1*4882a593Smuzhiyun============= 2*4882a593Smuzhiyundm-log-writes 3*4882a593Smuzhiyun============= 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunThis target takes 2 devices, one to pass all IO to normally, and one to log all 6*4882a593Smuzhiyunof the write operations to. This is intended for file system developers wishing 7*4882a593Smuzhiyunto verify the integrity of metadata or data as the file system is written to. 8*4882a593SmuzhiyunThere is a log_write_entry written for every WRITE request and the target is 9*4882a593Smuzhiyunable to take arbitrary data from userspace to insert into the log. The data 10*4882a593Smuzhiyunthat is in the WRITE requests is copied into the log to make the replay happen 11*4882a593Smuzhiyunexactly as it happened originally. 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunLog Ordering 14*4882a593Smuzhiyun============ 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunWe log things in order of completion once we are sure the write is no longer in 17*4882a593Smuzhiyuncache. This means that normal WRITE requests are not actually logged until the 18*4882a593Smuzhiyunnext REQ_PREFLUSH request. This is to make it easier for userspace to replay 19*4882a593Smuzhiyunthe log in a way that correlates to what is on disk and not what is in cache, 20*4882a593Smuzhiyunto make it easier to detect improper waiting/flushing. 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunThis works by attaching all WRITE requests to a list once the write completes. 23*4882a593SmuzhiyunOnce we see a REQ_PREFLUSH request we splice this list onto the request and once 24*4882a593Smuzhiyunthe FLUSH request completes we log all of the WRITEs and then the FLUSH. Only 25*4882a593Smuzhiyuncompleted WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to 26*4882a593Smuzhiyunsimulate the worst case scenario with regard to power failures. Consider the 27*4882a593Smuzhiyunfollowing example (W means write, C means complete): 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun W1,W2,W3,C3,C2,Wflush,C1,Cflush 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunThe log would show the following: 32*4882a593Smuzhiyun 33*4882a593Smuzhiyun W3,W2,flush,W1.... 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunAgain this is to simulate what is actually on disk, this allows us to detect 36*4882a593Smuzhiyuncases where a power failure at a particular point in time would create an 37*4882a593Smuzhiyuninconsistent file system. 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunAny REQ_FUA requests bypass this flushing mechanism and are logged as soon as 40*4882a593Smuzhiyunthey complete as those requests will obviously bypass the device cache. 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunAny REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would 43*4882a593Smuzhiyunhave all the DISCARD requests, and then the WRITE requests and then the FLUSH 44*4882a593Smuzhiyunrequest. Consider the following example: 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun WRITE block 1, DISCARD block 1, FLUSH 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunIf we logged DISCARD when it completed, the replay would look like this: 49*4882a593Smuzhiyun 50*4882a593Smuzhiyun DISCARD 1, WRITE 1, FLUSH 51*4882a593Smuzhiyun 52*4882a593Smuzhiyunwhich isn't quite what happened and wouldn't be caught during the log replay. 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunTarget interface 55*4882a593Smuzhiyun================ 56*4882a593Smuzhiyun 57*4882a593Smuzhiyuni) Constructor 58*4882a593Smuzhiyun 59*4882a593Smuzhiyun log-writes <dev_path> <log_dev_path> 60*4882a593Smuzhiyun 61*4882a593Smuzhiyun ============= ============================================== 62*4882a593Smuzhiyun dev_path Device that all of the IO will go to normally. 63*4882a593Smuzhiyun log_dev_path Device where the log entries are written to. 64*4882a593Smuzhiyun ============= ============================================== 65*4882a593Smuzhiyun 66*4882a593Smuzhiyunii) Status 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun <#logged entries> <highest allocated sector> 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun =========================== ======================== 71*4882a593Smuzhiyun #logged entries Number of logged entries 72*4882a593Smuzhiyun highest allocated sector Highest allocated sector 73*4882a593Smuzhiyun =========================== ======================== 74*4882a593Smuzhiyun 75*4882a593Smuzhiyuniii) Messages 76*4882a593Smuzhiyun 77*4882a593Smuzhiyun mark <description> 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun You can use a dmsetup message to set an arbitrary mark in a log. 80*4882a593Smuzhiyun For example say you want to fsck a file system after every 81*4882a593Smuzhiyun write, but first you need to replay up to the mkfs to make sure 82*4882a593Smuzhiyun we're fsck'ing something reasonable, you would do something like 83*4882a593Smuzhiyun this:: 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun mkfs.btrfs -f /dev/mapper/log 86*4882a593Smuzhiyun dmsetup message log 0 mark mkfs 87*4882a593Smuzhiyun <run test> 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun This would allow you to replay the log up to the mkfs mark and 90*4882a593Smuzhiyun then replay from that point on doing the fsck check in the 91*4882a593Smuzhiyun interval that you want. 92*4882a593Smuzhiyun 93*4882a593Smuzhiyun Every log has a mark at the end labeled "dm-log-writes-end". 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunUserspace component 96*4882a593Smuzhiyun=================== 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunThere is a userspace tool that will replay the log for you in various ways. 99*4882a593SmuzhiyunIt can be found here: https://github.com/josefbacik/log-writes 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunExample usage 102*4882a593Smuzhiyun============= 103*4882a593Smuzhiyun 104*4882a593SmuzhiyunSay you want to test fsync on your file system. You would do something like 105*4882a593Smuzhiyunthis:: 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" 108*4882a593Smuzhiyun dmsetup create log --table "$TABLE" 109*4882a593Smuzhiyun mkfs.btrfs -f /dev/mapper/log 110*4882a593Smuzhiyun dmsetup message log 0 mark mkfs 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun mount /dev/mapper/log /mnt/btrfs-test 113*4882a593Smuzhiyun <some test that does fsync at the end> 114*4882a593Smuzhiyun dmsetup message log 0 mark fsync 115*4882a593Smuzhiyun md5sum /mnt/btrfs-test/foo 116*4882a593Smuzhiyun umount /mnt/btrfs-test 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun dmsetup remove log 119*4882a593Smuzhiyun replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync 120*4882a593Smuzhiyun mount /dev/sdb /mnt/btrfs-test 121*4882a593Smuzhiyun md5sum /mnt/btrfs-test/foo 122*4882a593Smuzhiyun <verify md5sum's are correct> 123*4882a593Smuzhiyun 124*4882a593Smuzhiyun Another option is to do a complicated file system operation and verify the file 125*4882a593Smuzhiyun system is consistent during the entire operation. You could do this with: 126*4882a593Smuzhiyun 127*4882a593Smuzhiyun TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" 128*4882a593Smuzhiyun dmsetup create log --table "$TABLE" 129*4882a593Smuzhiyun mkfs.btrfs -f /dev/mapper/log 130*4882a593Smuzhiyun dmsetup message log 0 mark mkfs 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun mount /dev/mapper/log /mnt/btrfs-test 133*4882a593Smuzhiyun <fsstress to dirty the fs> 134*4882a593Smuzhiyun btrfs filesystem balance /mnt/btrfs-test 135*4882a593Smuzhiyun umount /mnt/btrfs-test 136*4882a593Smuzhiyun dmsetup remove log 137*4882a593Smuzhiyun 138*4882a593Smuzhiyun replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs 139*4882a593Smuzhiyun btrfsck /dev/sdb 140*4882a593Smuzhiyun replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ 141*4882a593Smuzhiyun --fsck "btrfsck /dev/sdb" --check fua 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunAnd that will replay the log until it sees a FUA request, run the fsck command 144*4882a593Smuzhiyunand if the fsck passes it will replay to the next FUA, until it is completed or 145*4882a593Smuzhiyunthe fsck command exists abnormally. 146