1*4882a593SmuzhiyunThe Linux Journalling API 2*4882a593Smuzhiyun========================= 3*4882a593Smuzhiyun 4*4882a593SmuzhiyunOverview 5*4882a593Smuzhiyun-------- 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunDetails 8*4882a593Smuzhiyun~~~~~~~ 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunThe journalling layer is easy to use. You need to first of all create a 11*4882a593Smuzhiyunjournal_t data structure. There are two calls to do this dependent on 12*4882a593Smuzhiyunhow you decide to allocate the physical media on which the journal 13*4882a593Smuzhiyunresides. The jbd2_journal_init_inode() call is for journals stored in 14*4882a593Smuzhiyunfilesystem inodes, or the jbd2_journal_init_dev() call can be used 15*4882a593Smuzhiyunfor journal stored on a raw device (in a continuous range of blocks). A 16*4882a593Smuzhiyunjournal_t is a typedef for a struct pointer, so when you are finally 17*4882a593Smuzhiyunfinished make sure you call jbd2_journal_destroy() on it to free up 18*4882a593Smuzhiyunany used kernel memory. 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunOnce you have got your journal_t object you need to 'mount' or load the 21*4882a593Smuzhiyunjournal file. The journalling layer expects the space for the journal 22*4882a593Smuzhiyunwas already allocated and initialized properly by the userspace tools. 23*4882a593SmuzhiyunWhen loading the journal you must call jbd2_journal_load() to process 24*4882a593Smuzhiyunjournal contents. If the client file system detects the journal contents 25*4882a593Smuzhiyundoes not need to be processed (or even need not have valid contents), it 26*4882a593Smuzhiyunmay call jbd2_journal_wipe() to clear the journal contents before 27*4882a593Smuzhiyuncalling jbd2_journal_load(). 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunNote that jbd2_journal_wipe(..,0) calls 30*4882a593Smuzhiyunjbd2_journal_skip_recovery() for you if it detects any outstanding 31*4882a593Smuzhiyuntransactions in the journal and similarly jbd2_journal_load() will 32*4882a593Smuzhiyuncall jbd2_journal_recover() if necessary. I would advise reading 33*4882a593Smuzhiyunext4_load_journal() in fs/ext4/super.c for examples on this stage. 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunNow you can go ahead and start modifying the underlying filesystem. 36*4882a593SmuzhiyunAlmost. 37*4882a593Smuzhiyun 38*4882a593SmuzhiyunYou still need to actually journal your filesystem changes, this is done 39*4882a593Smuzhiyunby wrapping them into transactions. Additionally you also need to wrap 40*4882a593Smuzhiyunthe modification of each of the buffers with calls to the journal layer, 41*4882a593Smuzhiyunso it knows what the modifications you are actually making are. To do 42*4882a593Smuzhiyunthis use jbd2_journal_start() which returns a transaction handle. 43*4882a593Smuzhiyun 44*4882a593Smuzhiyunjbd2_journal_start() and its counterpart jbd2_journal_stop(), 45*4882a593Smuzhiyunwhich indicates the end of a transaction are nestable calls, so you can 46*4882a593Smuzhiyunreenter a transaction if necessary, but remember you must call 47*4882a593Smuzhiyunjbd2_journal_stop() the same number of times as 48*4882a593Smuzhiyunjbd2_journal_start() before the transaction is completed (or more 49*4882a593Smuzhiyunaccurately leaves the update phase). Ext4/VFS makes use of this feature to 50*4882a593Smuzhiyunsimplify handling of inode dirtying, quota support, etc. 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunInside each transaction you need to wrap the modifications to the 53*4882a593Smuzhiyunindividual buffers (blocks). Before you start to modify a buffer you 54*4882a593Smuzhiyunneed to call jbd2_journal_get_create_access() / 55*4882a593Smuzhiyunjbd2_journal_get_write_access() / 56*4882a593Smuzhiyunjbd2_journal_get_undo_access() as appropriate, this allows the 57*4882a593Smuzhiyunjournalling layer to copy the unmodified 58*4882a593Smuzhiyundata if it needs to. After all the buffer may be part of a previously 59*4882a593Smuzhiyununcommitted transaction. At this point you are at last ready to modify a 60*4882a593Smuzhiyunbuffer, and once you are have done so you need to call 61*4882a593Smuzhiyunjbd2_journal_dirty_metadata(). Or if you've asked for access to a 62*4882a593Smuzhiyunbuffer you now know is now longer required to be pushed back on the 63*4882a593Smuzhiyundevice you can call jbd2_journal_forget() in much the same way as you 64*4882a593Smuzhiyunmight have used bforget() in the past. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunA jbd2_journal_flush() may be called at any time to commit and 67*4882a593Smuzhiyuncheckpoint all your transactions. 68*4882a593Smuzhiyun 69*4882a593SmuzhiyunThen at umount time , in your put_super() you can then call 70*4882a593Smuzhiyunjbd2_journal_destroy() to clean up your in-core journal object. 71*4882a593Smuzhiyun 72*4882a593SmuzhiyunUnfortunately there a couple of ways the journal layer can cause a 73*4882a593Smuzhiyundeadlock. The first thing to note is that each task can only have a 74*4882a593Smuzhiyunsingle outstanding transaction at any one time, remember nothing commits 75*4882a593Smuzhiyununtil the outermost jbd2_journal_stop(). This means you must complete 76*4882a593Smuzhiyunthe transaction at the end of each file/inode/address etc. operation you 77*4882a593Smuzhiyunperform, so that the journalling system isn't re-entered on another 78*4882a593Smuzhiyunjournal. Since transactions can't be nested/batched across differing 79*4882a593Smuzhiyunjournals, and another filesystem other than yours (say ext4) may be 80*4882a593Smuzhiyunmodified in a later syscall. 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunThe second case to bear in mind is that jbd2_journal_start() can block 83*4882a593Smuzhiyunif there isn't enough space in the journal for your transaction (based 84*4882a593Smuzhiyunon the passed nblocks param) - when it blocks it merely(!) needs to wait 85*4882a593Smuzhiyunfor transactions to complete and be committed from other tasks, so 86*4882a593Smuzhiyunessentially we are waiting for jbd2_journal_stop(). So to avoid 87*4882a593Smuzhiyundeadlocks you must treat jbd2_journal_start() / 88*4882a593Smuzhiyunjbd2_journal_stop() as if they were semaphores and include them in 89*4882a593Smuzhiyunyour semaphore ordering rules to prevent 90*4882a593Smuzhiyundeadlocks. Note that jbd2_journal_extend() has similar blocking 91*4882a593Smuzhiyunbehaviour to jbd2_journal_start() so you can deadlock here just as 92*4882a593Smuzhiyuneasily as on jbd2_journal_start(). 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunTry to reserve the right number of blocks the first time. ;-). This will 95*4882a593Smuzhiyunbe the maximum number of blocks you are going to touch in this 96*4882a593Smuzhiyuntransaction. I advise having a look at at least ext4_jbd.h to see the 97*4882a593Smuzhiyunbasis on which ext4 uses to make these decisions. 98*4882a593Smuzhiyun 99*4882a593SmuzhiyunAnother wriggle to watch out for is your on-disk block allocation 100*4882a593Smuzhiyunstrategy. Why? Because, if you do a delete, you need to ensure you 101*4882a593Smuzhiyunhaven't reused any of the freed blocks until the transaction freeing 102*4882a593Smuzhiyunthese blocks commits. If you reused these blocks and crash happens, 103*4882a593Smuzhiyunthere is no way to restore the contents of the reallocated blocks at the 104*4882a593Smuzhiyunend of the last fully committed transaction. One simple way of doing 105*4882a593Smuzhiyunthis is to mark blocks as free in internal in-memory block allocation 106*4882a593Smuzhiyunstructures only after the transaction freeing them commits. Ext4 uses 107*4882a593Smuzhiyunjournal commit callback for this purpose. 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunWith journal commit callbacks you can ask the journalling layer to call 110*4882a593Smuzhiyuna callback function when the transaction is finally committed to disk, 111*4882a593Smuzhiyunso that you can do some of your own management. You ask the journalling 112*4882a593Smuzhiyunlayer for calling the callback by simply setting 113*4882a593Smuzhiyun``journal->j_commit_callback`` function pointer and that function is 114*4882a593Smuzhiyuncalled after each transaction commit. You can also use 115*4882a593Smuzhiyun``transaction->t_private_list`` for attaching entries to a transaction 116*4882a593Smuzhiyunthat need processing when the transaction commits. 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunJBD2 also provides a way to block all transaction updates via 119*4882a593Smuzhiyunjbd2_journal_lock_updates() / 120*4882a593Smuzhiyunjbd2_journal_unlock_updates(). Ext4 uses this when it wants a 121*4882a593Smuzhiyunwindow with a clean and stable fs for a moment. E.g. 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun:: 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun jbd2_journal_lock_updates() //stop new stuff happening.. 127*4882a593Smuzhiyun jbd2_journal_flush() // checkpoint everything. 128*4882a593Smuzhiyun ..do stuff on stable fs 129*4882a593Smuzhiyun jbd2_journal_unlock_updates() // carry on with filesystem use. 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunThe opportunities for abuse and DOS attacks with this should be obvious, 132*4882a593Smuzhiyunif you allow unprivileged userspace to trigger codepaths containing 133*4882a593Smuzhiyunthese calls. 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunFast commits 136*4882a593Smuzhiyun~~~~~~~~~~~~ 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunJBD2 to also allows you to perform file-system specific delta commits known as 139*4882a593Smuzhiyunfast commits. In order to use fast commits, you will need to set following 140*4882a593Smuzhiyuncallbacks that perform correspodning work: 141*4882a593Smuzhiyun 142*4882a593Smuzhiyun`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and 143*4882a593Smuzhiyunfast commit. 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun`journal->j_fc_replay_cb`: Replay function called for replay of fast commit 146*4882a593Smuzhiyunblocks. 147*4882a593Smuzhiyun 148*4882a593SmuzhiyunFile system is free to perform fast commits as and when it wants as long as it 149*4882a593Smuzhiyungets permission from JBD2 to do so by calling the function 150*4882a593Smuzhiyun:c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client 151*4882a593Smuzhiyunfile system should tell JBD2 about it by calling 152*4882a593Smuzhiyun:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full 153*4882a593Smuzhiyuncommit immediately after stopping the fast commit it can do so by calling 154*4882a593Smuzhiyun:c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation 155*4882a593Smuzhiyunfails for some reason and the only way to guarantee consistency is for JBD2 to 156*4882a593Smuzhiyunperform the full traditional commit. 157*4882a593Smuzhiyun 158*4882a593SmuzhiyunJBD2 helper functions to manage fast commit buffers. File system can use 159*4882a593Smuzhiyun:c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate 160*4882a593Smuzhiyunand wait on IO completion of fast commit buffers. 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunCurrently, only Ext4 implements fast commits. For details of its implementation 163*4882a593Smuzhiyunof fast commits, please refer to the top level comments in 164*4882a593Smuzhiyunfs/ext4/fast_commit.c. 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunSummary 167*4882a593Smuzhiyun~~~~~~~ 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunUsing the journal is a matter of wrapping the different context changes, 170*4882a593Smuzhiyunbeing each mount, each modification (transaction) and each changed 171*4882a593Smuzhiyunbuffer to tell the journalling layer about them. 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunData Types 174*4882a593Smuzhiyun---------- 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunThe journalling layer uses typedefs to 'hide' the concrete definitions 177*4882a593Smuzhiyunof the structures used. As a client of the JBD2 layer you can just rely 178*4882a593Smuzhiyunon the using the pointer as a magic cookie of some sort. Obviously the 179*4882a593Smuzhiyunhiding is not enforced as this is 'C'. 180*4882a593Smuzhiyun 181*4882a593SmuzhiyunStructures 182*4882a593Smuzhiyun~~~~~~~~~~ 183*4882a593Smuzhiyun 184*4882a593Smuzhiyun.. kernel-doc:: include/linux/jbd2.h 185*4882a593Smuzhiyun :internal: 186*4882a593Smuzhiyun 187*4882a593SmuzhiyunFunctions 188*4882a593Smuzhiyun--------- 189*4882a593Smuzhiyun 190*4882a593SmuzhiyunThe functions here are split into two groups those that affect a journal 191*4882a593Smuzhiyunas a whole, and those which are used to manage transactions 192*4882a593Smuzhiyun 193*4882a593SmuzhiyunJournal Level 194*4882a593Smuzhiyun~~~~~~~~~~~~~ 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun.. kernel-doc:: fs/jbd2/journal.c 197*4882a593Smuzhiyun :export: 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun.. kernel-doc:: fs/jbd2/recovery.c 200*4882a593Smuzhiyun :internal: 201*4882a593Smuzhiyun 202*4882a593SmuzhiyunTransasction Level 203*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~ 204*4882a593Smuzhiyun 205*4882a593Smuzhiyun.. kernel-doc:: fs/jbd2/transaction.c 206*4882a593Smuzhiyun 207*4882a593SmuzhiyunSee also 208*4882a593Smuzhiyun-------- 209*4882a593Smuzhiyun 210*4882a593Smuzhiyun`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen 211*4882a593SmuzhiyunTweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__ 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen 214*4882a593SmuzhiyunTweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__ 215*4882a593Smuzhiyun 216