xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/journalling.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593SmuzhiyunThe Linux Journalling API
2*4882a593Smuzhiyun=========================
3*4882a593Smuzhiyun
4*4882a593SmuzhiyunOverview
5*4882a593Smuzhiyun--------
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunDetails
8*4882a593Smuzhiyun~~~~~~~
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunThe journalling layer is easy to use. You need to first of all create a
11*4882a593Smuzhiyunjournal_t data structure. There are two calls to do this dependent on
12*4882a593Smuzhiyunhow you decide to allocate the physical media on which the journal
13*4882a593Smuzhiyunresides. The jbd2_journal_init_inode() call is for journals stored in
14*4882a593Smuzhiyunfilesystem inodes, or the jbd2_journal_init_dev() call can be used
15*4882a593Smuzhiyunfor journal stored on a raw device (in a continuous range of blocks). A
16*4882a593Smuzhiyunjournal_t is a typedef for a struct pointer, so when you are finally
17*4882a593Smuzhiyunfinished make sure you call jbd2_journal_destroy() on it to free up
18*4882a593Smuzhiyunany used kernel memory.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunOnce you have got your journal_t object you need to 'mount' or load the
21*4882a593Smuzhiyunjournal file. The journalling layer expects the space for the journal
22*4882a593Smuzhiyunwas already allocated and initialized properly by the userspace tools.
23*4882a593SmuzhiyunWhen loading the journal you must call jbd2_journal_load() to process
24*4882a593Smuzhiyunjournal contents. If the client file system detects the journal contents
25*4882a593Smuzhiyundoes not need to be processed (or even need not have valid contents), it
26*4882a593Smuzhiyunmay call jbd2_journal_wipe() to clear the journal contents before
27*4882a593Smuzhiyuncalling jbd2_journal_load().
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunNote that jbd2_journal_wipe(..,0) calls
30*4882a593Smuzhiyunjbd2_journal_skip_recovery() for you if it detects any outstanding
31*4882a593Smuzhiyuntransactions in the journal and similarly jbd2_journal_load() will
32*4882a593Smuzhiyuncall jbd2_journal_recover() if necessary. I would advise reading
33*4882a593Smuzhiyunext4_load_journal() in fs/ext4/super.c for examples on this stage.
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunNow you can go ahead and start modifying the underlying filesystem.
36*4882a593SmuzhiyunAlmost.
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunYou still need to actually journal your filesystem changes, this is done
39*4882a593Smuzhiyunby wrapping them into transactions. Additionally you also need to wrap
40*4882a593Smuzhiyunthe modification of each of the buffers with calls to the journal layer,
41*4882a593Smuzhiyunso it knows what the modifications you are actually making are. To do
42*4882a593Smuzhiyunthis use jbd2_journal_start() which returns a transaction handle.
43*4882a593Smuzhiyun
44*4882a593Smuzhiyunjbd2_journal_start() and its counterpart jbd2_journal_stop(),
45*4882a593Smuzhiyunwhich indicates the end of a transaction are nestable calls, so you can
46*4882a593Smuzhiyunreenter a transaction if necessary, but remember you must call
47*4882a593Smuzhiyunjbd2_journal_stop() the same number of times as
48*4882a593Smuzhiyunjbd2_journal_start() before the transaction is completed (or more
49*4882a593Smuzhiyunaccurately leaves the update phase). Ext4/VFS makes use of this feature to
50*4882a593Smuzhiyunsimplify handling of inode dirtying, quota support, etc.
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunInside each transaction you need to wrap the modifications to the
53*4882a593Smuzhiyunindividual buffers (blocks). Before you start to modify a buffer you
54*4882a593Smuzhiyunneed to call jbd2_journal_get_create_access() /
55*4882a593Smuzhiyunjbd2_journal_get_write_access() /
56*4882a593Smuzhiyunjbd2_journal_get_undo_access() as appropriate, this allows the
57*4882a593Smuzhiyunjournalling layer to copy the unmodified
58*4882a593Smuzhiyundata if it needs to. After all the buffer may be part of a previously
59*4882a593Smuzhiyununcommitted transaction. At this point you are at last ready to modify a
60*4882a593Smuzhiyunbuffer, and once you are have done so you need to call
61*4882a593Smuzhiyunjbd2_journal_dirty_metadata(). Or if you've asked for access to a
62*4882a593Smuzhiyunbuffer you now know is now longer required to be pushed back on the
63*4882a593Smuzhiyundevice you can call jbd2_journal_forget() in much the same way as you
64*4882a593Smuzhiyunmight have used bforget() in the past.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunA jbd2_journal_flush() may be called at any time to commit and
67*4882a593Smuzhiyuncheckpoint all your transactions.
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunThen at umount time , in your put_super() you can then call
70*4882a593Smuzhiyunjbd2_journal_destroy() to clean up your in-core journal object.
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunUnfortunately there a couple of ways the journal layer can cause a
73*4882a593Smuzhiyundeadlock. The first thing to note is that each task can only have a
74*4882a593Smuzhiyunsingle outstanding transaction at any one time, remember nothing commits
75*4882a593Smuzhiyununtil the outermost jbd2_journal_stop(). This means you must complete
76*4882a593Smuzhiyunthe transaction at the end of each file/inode/address etc. operation you
77*4882a593Smuzhiyunperform, so that the journalling system isn't re-entered on another
78*4882a593Smuzhiyunjournal. Since transactions can't be nested/batched across differing
79*4882a593Smuzhiyunjournals, and another filesystem other than yours (say ext4) may be
80*4882a593Smuzhiyunmodified in a later syscall.
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunThe second case to bear in mind is that jbd2_journal_start() can block
83*4882a593Smuzhiyunif there isn't enough space in the journal for your transaction (based
84*4882a593Smuzhiyunon the passed nblocks param) - when it blocks it merely(!) needs to wait
85*4882a593Smuzhiyunfor transactions to complete and be committed from other tasks, so
86*4882a593Smuzhiyunessentially we are waiting for jbd2_journal_stop(). So to avoid
87*4882a593Smuzhiyundeadlocks you must treat jbd2_journal_start() /
88*4882a593Smuzhiyunjbd2_journal_stop() as if they were semaphores and include them in
89*4882a593Smuzhiyunyour semaphore ordering rules to prevent
90*4882a593Smuzhiyundeadlocks. Note that jbd2_journal_extend() has similar blocking
91*4882a593Smuzhiyunbehaviour to jbd2_journal_start() so you can deadlock here just as
92*4882a593Smuzhiyuneasily as on jbd2_journal_start().
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunTry to reserve the right number of blocks the first time. ;-). This will
95*4882a593Smuzhiyunbe the maximum number of blocks you are going to touch in this
96*4882a593Smuzhiyuntransaction. I advise having a look at at least ext4_jbd.h to see the
97*4882a593Smuzhiyunbasis on which ext4 uses to make these decisions.
98*4882a593Smuzhiyun
99*4882a593SmuzhiyunAnother wriggle to watch out for is your on-disk block allocation
100*4882a593Smuzhiyunstrategy. Why? Because, if you do a delete, you need to ensure you
101*4882a593Smuzhiyunhaven't reused any of the freed blocks until the transaction freeing
102*4882a593Smuzhiyunthese blocks commits. If you reused these blocks and crash happens,
103*4882a593Smuzhiyunthere is no way to restore the contents of the reallocated blocks at the
104*4882a593Smuzhiyunend of the last fully committed transaction. One simple way of doing
105*4882a593Smuzhiyunthis is to mark blocks as free in internal in-memory block allocation
106*4882a593Smuzhiyunstructures only after the transaction freeing them commits. Ext4 uses
107*4882a593Smuzhiyunjournal commit callback for this purpose.
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunWith journal commit callbacks you can ask the journalling layer to call
110*4882a593Smuzhiyuna callback function when the transaction is finally committed to disk,
111*4882a593Smuzhiyunso that you can do some of your own management. You ask the journalling
112*4882a593Smuzhiyunlayer for calling the callback by simply setting
113*4882a593Smuzhiyun``journal->j_commit_callback`` function pointer and that function is
114*4882a593Smuzhiyuncalled after each transaction commit. You can also use
115*4882a593Smuzhiyun``transaction->t_private_list`` for attaching entries to a transaction
116*4882a593Smuzhiyunthat need processing when the transaction commits.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunJBD2 also provides a way to block all transaction updates via
119*4882a593Smuzhiyunjbd2_journal_lock_updates() /
120*4882a593Smuzhiyunjbd2_journal_unlock_updates(). Ext4 uses this when it wants a
121*4882a593Smuzhiyunwindow with a clean and stable fs for a moment. E.g.
122*4882a593Smuzhiyun
123*4882a593Smuzhiyun::
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun        jbd2_journal_lock_updates() //stop new stuff happening..
127*4882a593Smuzhiyun        jbd2_journal_flush()        // checkpoint everything.
128*4882a593Smuzhiyun        ..do stuff on stable fs
129*4882a593Smuzhiyun        jbd2_journal_unlock_updates() // carry on with filesystem use.
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunThe opportunities for abuse and DOS attacks with this should be obvious,
132*4882a593Smuzhiyunif you allow unprivileged userspace to trigger codepaths containing
133*4882a593Smuzhiyunthese calls.
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunFast commits
136*4882a593Smuzhiyun~~~~~~~~~~~~
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunJBD2 to also allows you to perform file-system specific delta commits known as
139*4882a593Smuzhiyunfast commits. In order to use fast commits, you will need to set following
140*4882a593Smuzhiyuncallbacks that perform correspodning work:
141*4882a593Smuzhiyun
142*4882a593Smuzhiyun`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
143*4882a593Smuzhiyunfast commit.
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun`journal->j_fc_replay_cb`: Replay function called for replay of fast commit
146*4882a593Smuzhiyunblocks.
147*4882a593Smuzhiyun
148*4882a593SmuzhiyunFile system is free to perform fast commits as and when it wants as long as it
149*4882a593Smuzhiyungets permission from JBD2 to do so by calling the function
150*4882a593Smuzhiyun:c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client
151*4882a593Smuzhiyunfile  system should tell JBD2 about it by calling
152*4882a593Smuzhiyun:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full
153*4882a593Smuzhiyuncommit immediately after stopping the fast commit it can do so by calling
154*4882a593Smuzhiyun:c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation
155*4882a593Smuzhiyunfails for some reason and the only way to guarantee consistency is for JBD2 to
156*4882a593Smuzhiyunperform the full traditional commit.
157*4882a593Smuzhiyun
158*4882a593SmuzhiyunJBD2 helper functions to manage fast commit buffers. File system can use
159*4882a593Smuzhiyun:c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate
160*4882a593Smuzhiyunand wait on IO completion of fast commit buffers.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunCurrently, only Ext4 implements fast commits. For details of its implementation
163*4882a593Smuzhiyunof fast commits, please refer to the top level comments in
164*4882a593Smuzhiyunfs/ext4/fast_commit.c.
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunSummary
167*4882a593Smuzhiyun~~~~~~~
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunUsing the journal is a matter of wrapping the different context changes,
170*4882a593Smuzhiyunbeing each mount, each modification (transaction) and each changed
171*4882a593Smuzhiyunbuffer to tell the journalling layer about them.
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunData Types
174*4882a593Smuzhiyun----------
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunThe journalling layer uses typedefs to 'hide' the concrete definitions
177*4882a593Smuzhiyunof the structures used. As a client of the JBD2 layer you can just rely
178*4882a593Smuzhiyunon the using the pointer as a magic cookie of some sort. Obviously the
179*4882a593Smuzhiyunhiding is not enforced as this is 'C'.
180*4882a593Smuzhiyun
181*4882a593SmuzhiyunStructures
182*4882a593Smuzhiyun~~~~~~~~~~
183*4882a593Smuzhiyun
184*4882a593Smuzhiyun.. kernel-doc:: include/linux/jbd2.h
185*4882a593Smuzhiyun   :internal:
186*4882a593Smuzhiyun
187*4882a593SmuzhiyunFunctions
188*4882a593Smuzhiyun---------
189*4882a593Smuzhiyun
190*4882a593SmuzhiyunThe functions here are split into two groups those that affect a journal
191*4882a593Smuzhiyunas a whole, and those which are used to manage transactions
192*4882a593Smuzhiyun
193*4882a593SmuzhiyunJournal Level
194*4882a593Smuzhiyun~~~~~~~~~~~~~
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun.. kernel-doc:: fs/jbd2/journal.c
197*4882a593Smuzhiyun   :export:
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun.. kernel-doc:: fs/jbd2/recovery.c
200*4882a593Smuzhiyun   :internal:
201*4882a593Smuzhiyun
202*4882a593SmuzhiyunTransasction Level
203*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~
204*4882a593Smuzhiyun
205*4882a593Smuzhiyun.. kernel-doc:: fs/jbd2/transaction.c
206*4882a593Smuzhiyun
207*4882a593SmuzhiyunSee also
208*4882a593Smuzhiyun--------
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen
211*4882a593SmuzhiyunTweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__
212*4882a593Smuzhiyun
213*4882a593Smuzhiyun`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen
214*4882a593SmuzhiyunTweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__
215*4882a593Smuzhiyun
216