admin-guide/device-mapper/log-writes.rst

*4882a593Smuzhiyun=============
*4882a593Smuzhiyundm-log-writes
*4882a593Smuzhiyun=============
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis target takes 2 devices, one to pass all IO to normally, and one to log all
*4882a593Smuzhiyunof the write operations to.  This is intended for file system developers wishing
*4882a593Smuzhiyunto verify the integrity of metadata or data as the file system is written to.
*4882a593SmuzhiyunThere is a log_write_entry written for every WRITE request and the target is
*4882a593Smuzhiyunable to take arbitrary data from userspace to insert into the log.  The data
*4882a593Smuzhiyunthat is in the WRITE requests is copied into the log to make the replay happen
*4882a593Smuzhiyunexactly as it happened originally.
*4882a593Smuzhiyun
*4882a593SmuzhiyunLog Ordering
*4882a593Smuzhiyun============
*4882a593Smuzhiyun
*4882a593SmuzhiyunWe log things in order of completion once we are sure the write is no longer in
*4882a593Smuzhiyuncache.  This means that normal WRITE requests are not actually logged until the
*4882a593Smuzhiyunnext REQ_PREFLUSH request.  This is to make it easier for userspace to replay
*4882a593Smuzhiyunthe log in a way that correlates to what is on disk and not what is in cache,
*4882a593Smuzhiyunto make it easier to detect improper waiting/flushing.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis works by attaching all WRITE requests to a list once the write completes.
*4882a593SmuzhiyunOnce we see a REQ_PREFLUSH request we splice this list onto the request and once
*4882a593Smuzhiyunthe FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only
*4882a593Smuzhiyuncompleted WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
*4882a593Smuzhiyunsimulate the worst case scenario with regard to power failures.  Consider the
*4882a593Smuzhiyunfollowing example (W means write, C means complete):
*4882a593Smuzhiyun
*4882a593Smuzhiyun	W1,W2,W3,C3,C2,Wflush,C1,Cflush
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe log would show the following:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	W3,W2,flush,W1....
*4882a593Smuzhiyun
*4882a593SmuzhiyunAgain this is to simulate what is actually on disk, this allows us to detect
*4882a593Smuzhiyuncases where a power failure at a particular point in time would create an
*4882a593Smuzhiyuninconsistent file system.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAny REQ_FUA requests bypass this flushing mechanism and are logged as soon as
*4882a593Smuzhiyunthey complete as those requests will obviously bypass the device cache.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAny REQ_OP_DISCARD requests are treated like WRITE requests.  Otherwise we would
*4882a593Smuzhiyunhave all the DISCARD requests, and then the WRITE requests and then the FLUSH
*4882a593Smuzhiyunrequest.  Consider the following example:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	WRITE block 1, DISCARD block 1, FLUSH
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf we logged DISCARD when it completed, the replay would look like this:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	DISCARD 1, WRITE 1, FLUSH
*4882a593Smuzhiyun
*4882a593Smuzhiyunwhich isn't quite what happened and wouldn't be caught during the log replay.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTarget interface
*4882a593Smuzhiyun================
*4882a593Smuzhiyun
*4882a593Smuzhiyuni) Constructor
*4882a593Smuzhiyun
*4882a593Smuzhiyun   log-writes <dev_path> <log_dev_path>
*4882a593Smuzhiyun
*4882a593Smuzhiyun   ============= ==============================================
*4882a593Smuzhiyun   dev_path	 Device that all of the IO will go to normally.
*4882a593Smuzhiyun   log_dev_path  Device where the log entries are written to.
*4882a593Smuzhiyun   ============= ==============================================
*4882a593Smuzhiyun
*4882a593Smuzhiyunii) Status
*4882a593Smuzhiyun
*4882a593Smuzhiyun    <#logged entries> <highest allocated sector>
*4882a593Smuzhiyun
*4882a593Smuzhiyun    =========================== ========================
*4882a593Smuzhiyun    #logged entries	        Number of logged entries
*4882a593Smuzhiyun    highest allocated sector    Highest allocated sector
*4882a593Smuzhiyun    =========================== ========================
*4882a593Smuzhiyun
*4882a593Smuzhiyuniii) Messages
*4882a593Smuzhiyun
*4882a593Smuzhiyun    mark <description>
*4882a593Smuzhiyun
*4882a593Smuzhiyun	You can use a dmsetup message to set an arbitrary mark in a log.
*4882a593Smuzhiyun	For example say you want to fsck a file system after every
*4882a593Smuzhiyun	write, but first you need to replay up to the mkfs to make sure
*4882a593Smuzhiyun	we're fsck'ing something reasonable, you would do something like
*4882a593Smuzhiyun	this::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	  mkfs.btrfs -f /dev/mapper/log
*4882a593Smuzhiyun	  dmsetup message log 0 mark mkfs
*4882a593Smuzhiyun	  <run test>
*4882a593Smuzhiyun
*4882a593Smuzhiyun	This would allow you to replay the log up to the mkfs mark and
*4882a593Smuzhiyun	then replay from that point on doing the fsck check in the
*4882a593Smuzhiyun	interval that you want.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Every log has a mark at the end labeled "dm-log-writes-end".
*4882a593Smuzhiyun
*4882a593SmuzhiyunUserspace component
*4882a593Smuzhiyun===================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere is a userspace tool that will replay the log for you in various ways.
*4882a593SmuzhiyunIt can be found here: https://github.com/josefbacik/log-writes
*4882a593Smuzhiyun
*4882a593SmuzhiyunExample usage
*4882a593Smuzhiyun=============
*4882a593Smuzhiyun
*4882a593SmuzhiyunSay you want to test fsync on your file system.  You would do something like
*4882a593Smuzhiyunthis::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
*4882a593Smuzhiyun  dmsetup create log --table "$TABLE"
*4882a593Smuzhiyun  mkfs.btrfs -f /dev/mapper/log
*4882a593Smuzhiyun  dmsetup message log 0 mark mkfs
*4882a593Smuzhiyun
*4882a593Smuzhiyun  mount /dev/mapper/log /mnt/btrfs-test
*4882a593Smuzhiyun  <some test that does fsync at the end>
*4882a593Smuzhiyun  dmsetup message log 0 mark fsync
*4882a593Smuzhiyun  md5sum /mnt/btrfs-test/foo
*4882a593Smuzhiyun  umount /mnt/btrfs-test
*4882a593Smuzhiyun
*4882a593Smuzhiyun  dmsetup remove log
*4882a593Smuzhiyun  replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
*4882a593Smuzhiyun  mount /dev/sdb /mnt/btrfs-test
*4882a593Smuzhiyun  md5sum /mnt/btrfs-test/foo
*4882a593Smuzhiyun  <verify md5sum's are correct>
*4882a593Smuzhiyun
*4882a593Smuzhiyun  Another option is to do a complicated file system operation and verify the file
*4882a593Smuzhiyun  system is consistent during the entire operation.  You could do this with:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
*4882a593Smuzhiyun  dmsetup create log --table "$TABLE"
*4882a593Smuzhiyun  mkfs.btrfs -f /dev/mapper/log
*4882a593Smuzhiyun  dmsetup message log 0 mark mkfs
*4882a593Smuzhiyun
*4882a593Smuzhiyun  mount /dev/mapper/log /mnt/btrfs-test
*4882a593Smuzhiyun  <fsstress to dirty the fs>
*4882a593Smuzhiyun  btrfs filesystem balance /mnt/btrfs-test
*4882a593Smuzhiyun  umount /mnt/btrfs-test
*4882a593Smuzhiyun  dmsetup remove log
*4882a593Smuzhiyun
*4882a593Smuzhiyun  replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
*4882a593Smuzhiyun  btrfsck /dev/sdb
*4882a593Smuzhiyun  replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
*4882a593Smuzhiyun	--fsck "btrfsck /dev/sdb" --check fua
*4882a593Smuzhiyun
*4882a593SmuzhiyunAnd that will replay the log until it sees a FUA request, run the fsck command
*4882a593Smuzhiyunand if the fsck passes it will replay to the next FUA, until it is completed or
*4882a593Smuzhiyunthe fsck command exists abnormally.