xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/device-mapper/log-writes.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=============
2*4882a593Smuzhiyundm-log-writes
3*4882a593Smuzhiyun=============
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunThis target takes 2 devices, one to pass all IO to normally, and one to log all
6*4882a593Smuzhiyunof the write operations to.  This is intended for file system developers wishing
7*4882a593Smuzhiyunto verify the integrity of metadata or data as the file system is written to.
8*4882a593SmuzhiyunThere is a log_write_entry written for every WRITE request and the target is
9*4882a593Smuzhiyunable to take arbitrary data from userspace to insert into the log.  The data
10*4882a593Smuzhiyunthat is in the WRITE requests is copied into the log to make the replay happen
11*4882a593Smuzhiyunexactly as it happened originally.
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunLog Ordering
14*4882a593Smuzhiyun============
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunWe log things in order of completion once we are sure the write is no longer in
17*4882a593Smuzhiyuncache.  This means that normal WRITE requests are not actually logged until the
18*4882a593Smuzhiyunnext REQ_PREFLUSH request.  This is to make it easier for userspace to replay
19*4882a593Smuzhiyunthe log in a way that correlates to what is on disk and not what is in cache,
20*4882a593Smuzhiyunto make it easier to detect improper waiting/flushing.
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunThis works by attaching all WRITE requests to a list once the write completes.
23*4882a593SmuzhiyunOnce we see a REQ_PREFLUSH request we splice this list onto the request and once
24*4882a593Smuzhiyunthe FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only
25*4882a593Smuzhiyuncompleted WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
26*4882a593Smuzhiyunsimulate the worst case scenario with regard to power failures.  Consider the
27*4882a593Smuzhiyunfollowing example (W means write, C means complete):
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun	W1,W2,W3,C3,C2,Wflush,C1,Cflush
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunThe log would show the following:
32*4882a593Smuzhiyun
33*4882a593Smuzhiyun	W3,W2,flush,W1....
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunAgain this is to simulate what is actually on disk, this allows us to detect
36*4882a593Smuzhiyuncases where a power failure at a particular point in time would create an
37*4882a593Smuzhiyuninconsistent file system.
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunAny REQ_FUA requests bypass this flushing mechanism and are logged as soon as
40*4882a593Smuzhiyunthey complete as those requests will obviously bypass the device cache.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunAny REQ_OP_DISCARD requests are treated like WRITE requests.  Otherwise we would
43*4882a593Smuzhiyunhave all the DISCARD requests, and then the WRITE requests and then the FLUSH
44*4882a593Smuzhiyunrequest.  Consider the following example:
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun	WRITE block 1, DISCARD block 1, FLUSH
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunIf we logged DISCARD when it completed, the replay would look like this:
49*4882a593Smuzhiyun
50*4882a593Smuzhiyun	DISCARD 1, WRITE 1, FLUSH
51*4882a593Smuzhiyun
52*4882a593Smuzhiyunwhich isn't quite what happened and wouldn't be caught during the log replay.
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunTarget interface
55*4882a593Smuzhiyun================
56*4882a593Smuzhiyun
57*4882a593Smuzhiyuni) Constructor
58*4882a593Smuzhiyun
59*4882a593Smuzhiyun   log-writes <dev_path> <log_dev_path>
60*4882a593Smuzhiyun
61*4882a593Smuzhiyun   ============= ==============================================
62*4882a593Smuzhiyun   dev_path	 Device that all of the IO will go to normally.
63*4882a593Smuzhiyun   log_dev_path  Device where the log entries are written to.
64*4882a593Smuzhiyun   ============= ==============================================
65*4882a593Smuzhiyun
66*4882a593Smuzhiyunii) Status
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun    <#logged entries> <highest allocated sector>
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun    =========================== ========================
71*4882a593Smuzhiyun    #logged entries	        Number of logged entries
72*4882a593Smuzhiyun    highest allocated sector    Highest allocated sector
73*4882a593Smuzhiyun    =========================== ========================
74*4882a593Smuzhiyun
75*4882a593Smuzhiyuniii) Messages
76*4882a593Smuzhiyun
77*4882a593Smuzhiyun    mark <description>
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun	You can use a dmsetup message to set an arbitrary mark in a log.
80*4882a593Smuzhiyun	For example say you want to fsck a file system after every
81*4882a593Smuzhiyun	write, but first you need to replay up to the mkfs to make sure
82*4882a593Smuzhiyun	we're fsck'ing something reasonable, you would do something like
83*4882a593Smuzhiyun	this::
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun	  mkfs.btrfs -f /dev/mapper/log
86*4882a593Smuzhiyun	  dmsetup message log 0 mark mkfs
87*4882a593Smuzhiyun	  <run test>
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun	This would allow you to replay the log up to the mkfs mark and
90*4882a593Smuzhiyun	then replay from that point on doing the fsck check in the
91*4882a593Smuzhiyun	interval that you want.
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun	Every log has a mark at the end labeled "dm-log-writes-end".
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunUserspace component
96*4882a593Smuzhiyun===================
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunThere is a userspace tool that will replay the log for you in various ways.
99*4882a593SmuzhiyunIt can be found here: https://github.com/josefbacik/log-writes
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunExample usage
102*4882a593Smuzhiyun=============
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunSay you want to test fsync on your file system.  You would do something like
105*4882a593Smuzhiyunthis::
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
108*4882a593Smuzhiyun  dmsetup create log --table "$TABLE"
109*4882a593Smuzhiyun  mkfs.btrfs -f /dev/mapper/log
110*4882a593Smuzhiyun  dmsetup message log 0 mark mkfs
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun  mount /dev/mapper/log /mnt/btrfs-test
113*4882a593Smuzhiyun  <some test that does fsync at the end>
114*4882a593Smuzhiyun  dmsetup message log 0 mark fsync
115*4882a593Smuzhiyun  md5sum /mnt/btrfs-test/foo
116*4882a593Smuzhiyun  umount /mnt/btrfs-test
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun  dmsetup remove log
119*4882a593Smuzhiyun  replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
120*4882a593Smuzhiyun  mount /dev/sdb /mnt/btrfs-test
121*4882a593Smuzhiyun  md5sum /mnt/btrfs-test/foo
122*4882a593Smuzhiyun  <verify md5sum's are correct>
123*4882a593Smuzhiyun
124*4882a593Smuzhiyun  Another option is to do a complicated file system operation and verify the file
125*4882a593Smuzhiyun  system is consistent during the entire operation.  You could do this with:
126*4882a593Smuzhiyun
127*4882a593Smuzhiyun  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
128*4882a593Smuzhiyun  dmsetup create log --table "$TABLE"
129*4882a593Smuzhiyun  mkfs.btrfs -f /dev/mapper/log
130*4882a593Smuzhiyun  dmsetup message log 0 mark mkfs
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun  mount /dev/mapper/log /mnt/btrfs-test
133*4882a593Smuzhiyun  <fsstress to dirty the fs>
134*4882a593Smuzhiyun  btrfs filesystem balance /mnt/btrfs-test
135*4882a593Smuzhiyun  umount /mnt/btrfs-test
136*4882a593Smuzhiyun  dmsetup remove log
137*4882a593Smuzhiyun
138*4882a593Smuzhiyun  replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
139*4882a593Smuzhiyun  btrfsck /dev/sdb
140*4882a593Smuzhiyun  replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
141*4882a593Smuzhiyun	--fsck "btrfsck /dev/sdb" --check fua
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunAnd that will replay the log until it sees a FUA request, run the fsck command
144*4882a593Smuzhiyunand if the fsck passes it will replay to the next FUA, until it is completed or
145*4882a593Smuzhiyunthe fsck command exists abnormally.
146