xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/xfs-delayed-logging-design.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun==========================
4*4882a593SmuzhiyunXFS Delayed Logging Design
5*4882a593Smuzhiyun==========================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunIntroduction to Re-logging in XFS
8*4882a593Smuzhiyun=================================
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunXFS logging is a combination of logical and physical logging. Some objects,
11*4882a593Smuzhiyunsuch as inodes and dquots, are logged in logical format where the details
12*4882a593Smuzhiyunlogged are made up of the changes to in-core structures rather than on-disk
13*4882a593Smuzhiyunstructures. Other objects - typically buffers - have their physical changes
14*4882a593Smuzhiyunlogged. The reason for these differences is to reduce the amount of log space
15*4882a593Smuzhiyunrequired for objects that are frequently logged. Some parts of inodes are more
16*4882a593Smuzhiyunfrequently logged than others, and inodes are typically more frequently logged
17*4882a593Smuzhiyunthan any other object (except maybe the superblock buffer) so keeping the
18*4882a593Smuzhiyunamount of metadata logged low is of prime importance.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunThe reason that this is such a concern is that XFS allows multiple separate
21*4882a593Smuzhiyunmodifications to a single object to be carried in the log at any given time.
22*4882a593SmuzhiyunThis allows the log to avoid needing to flush each change to disk before
23*4882a593Smuzhiyunrecording a new change to the object. XFS does this via a method called
24*4882a593Smuzhiyun"re-logging". Conceptually, this is quite simple - all it requires is that any
25*4882a593Smuzhiyunnew change to the object is recorded with a *new copy* of all the existing
26*4882a593Smuzhiyunchanges in the new transaction that is written to the log.
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunThat is, if we have a sequence of changes A through to F, and the object was
29*4882a593Smuzhiyunwritten to disk after change D, we would see in the log the following series
30*4882a593Smuzhiyunof transactions, their contents and the log sequence number (LSN) of the
31*4882a593Smuzhiyuntransaction::
32*4882a593Smuzhiyun
33*4882a593Smuzhiyun	Transaction		Contents	LSN
34*4882a593Smuzhiyun	   A			   A		   X
35*4882a593Smuzhiyun	   B			  A+B		  X+n
36*4882a593Smuzhiyun	   C			 A+B+C		 X+n+m
37*4882a593Smuzhiyun	   D			A+B+C+D		X+n+m+o
38*4882a593Smuzhiyun	    <object written to disk>
39*4882a593Smuzhiyun	   E			   E		   Y (> X+n+m+o)
40*4882a593Smuzhiyun	   F			  E+F		  Y+p
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunIn other words, each time an object is relogged, the new transaction contains
43*4882a593Smuzhiyunthe aggregation of all the previous changes currently held only in the log.
44*4882a593Smuzhiyun
45*4882a593SmuzhiyunThis relogging technique also allows objects to be moved forward in the log so
46*4882a593Smuzhiyunthat an object being relogged does not prevent the tail of the log from ever
47*4882a593Smuzhiyunmoving forward.  This can be seen in the table above by the changing
48*4882a593Smuzhiyun(increasing) LSN of each subsequent transaction - the LSN is effectively a
49*4882a593Smuzhiyundirect encoding of the location in the log of the transaction.
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunThis relogging is also used to implement long-running, multiple-commit
52*4882a593Smuzhiyuntransactions.  These transaction are known as rolling transactions, and require
53*4882a593Smuzhiyuna special log reservation known as a permanent transaction reservation. A
54*4882a593Smuzhiyuntypical example of a rolling transaction is the removal of extents from an
55*4882a593Smuzhiyuninode which can only be done at a rate of two extents per transaction because
56*4882a593Smuzhiyunof reservation size limitations. Hence a rolling extent removal transaction
57*4882a593Smuzhiyunkeeps relogging the inode and btree buffers as they get modified in each
58*4882a593Smuzhiyunremoval operation. This keeps them moving forward in the log as the operation
59*4882a593Smuzhiyunprogresses, ensuring that current operation never gets blocked by itself if the
60*4882a593Smuzhiyunlog wraps around.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunHence it can be seen that the relogging operation is fundamental to the correct
63*4882a593Smuzhiyunworking of the XFS journalling subsystem. From the above description, most
64*4882a593Smuzhiyunpeople should be able to see why the XFS metadata operations writes so much to
65*4882a593Smuzhiyunthe log - repeated operations to the same objects write the same changes to
66*4882a593Smuzhiyunthe log over and over again. Worse is the fact that objects tend to get
67*4882a593Smuzhiyundirtier as they get relogged, so each subsequent transaction is writing more
68*4882a593Smuzhiyunmetadata into the log.
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunAnother feature of the XFS transaction subsystem is that most transactions are
71*4882a593Smuzhiyunasynchronous. That is, they don't commit to disk until either a log buffer is
72*4882a593Smuzhiyunfilled (a log buffer can hold multiple transactions) or a synchronous operation
73*4882a593Smuzhiyunforces the log buffers holding the transactions to disk. This means that XFS is
74*4882a593Smuzhiyundoing aggregation of transactions in memory - batching them, if you like - to
75*4882a593Smuzhiyunminimise the impact of the log IO on transaction throughput.
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunThe limitation on asynchronous transaction throughput is the number and size of
78*4882a593Smuzhiyunlog buffers made available by the log manager. By default there are 8 log
79*4882a593Smuzhiyunbuffers available and the size of each is 32kB - the size can be increased up
80*4882a593Smuzhiyunto 256kB by use of a mount option.
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunEffectively, this gives us the maximum bound of outstanding metadata changes
83*4882a593Smuzhiyunthat can be made to the filesystem at any point in time - if all the log
84*4882a593Smuzhiyunbuffers are full and under IO, then no more transactions can be committed until
85*4882a593Smuzhiyunthe current batch completes. It is now common for a single current CPU core to
86*4882a593Smuzhiyunbe to able to issue enough transactions to keep the log buffers full and under
87*4882a593SmuzhiyunIO permanently. Hence the XFS journalling subsystem can be considered to be IO
88*4882a593Smuzhiyunbound.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunDelayed Logging: Concepts
91*4882a593Smuzhiyun=========================
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunThe key thing to note about the asynchronous logging combined with the
94*4882a593Smuzhiyunrelogging technique XFS uses is that we can be relogging changed objects
95*4882a593Smuzhiyunmultiple times before they are committed to disk in the log buffers. If we
96*4882a593Smuzhiyunreturn to the previous relogging example, it is entirely possible that
97*4882a593Smuzhiyuntransactions A through D are committed to disk in the same log buffer.
98*4882a593Smuzhiyun
99*4882a593SmuzhiyunThat is, a single log buffer may contain multiple copies of the same object,
100*4882a593Smuzhiyunbut only one of those copies needs to be there - the last one "D", as it
101*4882a593Smuzhiyuncontains all the changes from the previous changes. In other words, we have one
102*4882a593Smuzhiyunnecessary copy in the log buffer, and three stale copies that are simply
103*4882a593Smuzhiyunwasting space. When we are doing repeated operations on the same set of
104*4882a593Smuzhiyunobjects, these "stale objects" can be over 90% of the space used in the log
105*4882a593Smuzhiyunbuffers. It is clear that reducing the number of stale objects written to the
106*4882a593Smuzhiyunlog would greatly reduce the amount of metadata we write to the log, and this
107*4882a593Smuzhiyunis the fundamental goal of delayed logging.
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunFrom a conceptual point of view, XFS is already doing relogging in memory (where
110*4882a593Smuzhiyunmemory == log buffer), only it is doing it extremely inefficiently. It is using
111*4882a593Smuzhiyunlogical to physical formatting to do the relogging because there is no
112*4882a593Smuzhiyuninfrastructure to keep track of logical changes in memory prior to physically
113*4882a593Smuzhiyunformatting the changes in a transaction to the log buffer. Hence we cannot avoid
114*4882a593Smuzhiyunaccumulating stale objects in the log buffers.
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunDelayed logging is the name we've given to keeping and tracking transactional
117*4882a593Smuzhiyunchanges to objects in memory outside the log buffer infrastructure. Because of
118*4882a593Smuzhiyunthe relogging concept fundamental to the XFS journalling subsystem, this is
119*4882a593Smuzhiyunactually relatively easy to do - all the changes to logged items are already
120*4882a593Smuzhiyuntracked in the current infrastructure. The big problem is how to accumulate
121*4882a593Smuzhiyunthem and get them to the log in a consistent, recoverable manner.
122*4882a593SmuzhiyunDescribing the problems and how they have been solved is the focus of this
123*4882a593Smuzhiyundocument.
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunOne of the key changes that delayed logging makes to the operation of the
126*4882a593Smuzhiyunjournalling subsystem is that it disassociates the amount of outstanding
127*4882a593Smuzhiyunmetadata changes from the size and number of log buffers available. In other
128*4882a593Smuzhiyunwords, instead of there only being a maximum of 2MB of transaction changes not
129*4882a593Smuzhiyunwritten to the log at any point in time, there may be a much greater amount
130*4882a593Smuzhiyunbeing accumulated in memory. Hence the potential for loss of metadata on a
131*4882a593Smuzhiyuncrash is much greater than for the existing logging mechanism.
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunIt should be noted that this does not change the guarantee that log recovery
134*4882a593Smuzhiyunwill result in a consistent filesystem. What it does mean is that as far as the
135*4882a593Smuzhiyunrecovered filesystem is concerned, there may be many thousands of transactions
136*4882a593Smuzhiyunthat simply did not occur as a result of the crash. This makes it even more
137*4882a593Smuzhiyunimportant that applications that care about their data use fsync() where they
138*4882a593Smuzhiyunneed to ensure application level data integrity is maintained.
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunIt should be noted that delayed logging is not an innovative new concept that
141*4882a593Smuzhiyunwarrants rigorous proofs to determine whether it is correct or not. The method
142*4882a593Smuzhiyunof accumulating changes in memory for some period before writing them to the
143*4882a593Smuzhiyunlog is used effectively in many filesystems including ext3 and ext4. Hence
144*4882a593Smuzhiyunno time is spent in this document trying to convince the reader that the
145*4882a593Smuzhiyunconcept is sound. Instead it is simply considered a "solved problem" and as
146*4882a593Smuzhiyunsuch implementing it in XFS is purely an exercise in software engineering.
147*4882a593Smuzhiyun
148*4882a593SmuzhiyunThe fundamental requirements for delayed logging in XFS are simple:
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun	1. Reduce the amount of metadata written to the log by at least
151*4882a593Smuzhiyun	   an order of magnitude.
152*4882a593Smuzhiyun	2. Supply sufficient statistics to validate Requirement #1.
153*4882a593Smuzhiyun	3. Supply sufficient new tracing infrastructure to be able to debug
154*4882a593Smuzhiyun	   problems with the new code.
155*4882a593Smuzhiyun	4. No on-disk format change (metadata or log format).
156*4882a593Smuzhiyun	5. Enable and disable with a mount option.
157*4882a593Smuzhiyun	6. No performance regressions for synchronous transaction workloads.
158*4882a593Smuzhiyun
159*4882a593SmuzhiyunDelayed Logging: Design
160*4882a593Smuzhiyun=======================
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunStoring Changes
163*4882a593Smuzhiyun---------------
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunThe problem with accumulating changes at a logical level (i.e. just using the
166*4882a593Smuzhiyunexisting log item dirty region tracking) is that when it comes to writing the
167*4882a593Smuzhiyunchanges to the log buffers, we need to ensure that the object we are formatting
168*4882a593Smuzhiyunis not changing while we do this. This requires locking the object to prevent
169*4882a593Smuzhiyunconcurrent modification. Hence flushing the logical changes to the log would
170*4882a593Smuzhiyunrequire us to lock every object, format them, and then unlock them again.
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunThis introduces lots of scope for deadlocks with transactions that are already
173*4882a593Smuzhiyunrunning. For example, a transaction has object A locked and modified, but needs
174*4882a593Smuzhiyunthe delayed logging tracking lock to commit the transaction. However, the
175*4882a593Smuzhiyunflushing thread has the delayed logging tracking lock already held, and is
176*4882a593Smuzhiyuntrying to get the lock on object A to flush it to the log buffer. This appears
177*4882a593Smuzhiyunto be an unsolvable deadlock condition, and it was solving this problem that
178*4882a593Smuzhiyunwas the barrier to implementing delayed logging for so long.
179*4882a593Smuzhiyun
180*4882a593SmuzhiyunThe solution is relatively simple - it just took a long time to recognise it.
181*4882a593SmuzhiyunPut simply, the current logging code formats the changes to each item into an
182*4882a593Smuzhiyunvector array that points to the changed regions in the item. The log write code
183*4882a593Smuzhiyunsimply copies the memory these vectors point to into the log buffer during
184*4882a593Smuzhiyuntransaction commit while the item is locked in the transaction. Instead of
185*4882a593Smuzhiyunusing the log buffer as the destination of the formatting code, we can use an
186*4882a593Smuzhiyunallocated memory buffer big enough to fit the formatted vector.
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunIf we then copy the vector into the memory buffer and rewrite the vector to
189*4882a593Smuzhiyunpoint to the memory buffer rather than the object itself, we now have a copy of
190*4882a593Smuzhiyunthe changes in a format that is compatible with the log buffer writing code.
191*4882a593Smuzhiyunthat does not require us to lock the item to access. This formatting and
192*4882a593Smuzhiyunrewriting can all be done while the object is locked during transaction commit,
193*4882a593Smuzhiyunresulting in a vector that is transactionally consistent and can be accessed
194*4882a593Smuzhiyunwithout needing to lock the owning item.
195*4882a593Smuzhiyun
196*4882a593SmuzhiyunHence we avoid the need to lock items when we need to flush outstanding
197*4882a593Smuzhiyunasynchronous transactions to the log. The differences between the existing
198*4882a593Smuzhiyunformatting method and the delayed logging formatting can be seen in the
199*4882a593Smuzhiyundiagram below.
200*4882a593Smuzhiyun
201*4882a593SmuzhiyunCurrent format log vector::
202*4882a593Smuzhiyun
203*4882a593Smuzhiyun    Object    +---------------------------------------------+
204*4882a593Smuzhiyun    Vector 1      +----+
205*4882a593Smuzhiyun    Vector 2                    +----+
206*4882a593Smuzhiyun    Vector 3                                   +----------+
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunAfter formatting::
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun    Log Buffer    +-V1-+-V2-+----V3----+
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunDelayed logging vector::
213*4882a593Smuzhiyun
214*4882a593Smuzhiyun    Object    +---------------------------------------------+
215*4882a593Smuzhiyun    Vector 1      +----+
216*4882a593Smuzhiyun    Vector 2                    +----+
217*4882a593Smuzhiyun    Vector 3                                   +----------+
218*4882a593Smuzhiyun
219*4882a593SmuzhiyunAfter formatting::
220*4882a593Smuzhiyun
221*4882a593Smuzhiyun    Memory Buffer +-V1-+-V2-+----V3----+
222*4882a593Smuzhiyun    Vector 1      +----+
223*4882a593Smuzhiyun    Vector 2           +----+
224*4882a593Smuzhiyun    Vector 3                +----------+
225*4882a593Smuzhiyun
226*4882a593SmuzhiyunThe memory buffer and associated vector need to be passed as a single object,
227*4882a593Smuzhiyunbut still need to be associated with the parent object so if the object is
228*4882a593Smuzhiyunrelogged we can replace the current memory buffer with a new memory buffer that
229*4882a593Smuzhiyuncontains the latest changes.
230*4882a593Smuzhiyun
231*4882a593SmuzhiyunThe reason for keeping the vector around after we've formatted the memory
232*4882a593Smuzhiyunbuffer is to support splitting vectors across log buffer boundaries correctly.
233*4882a593SmuzhiyunIf we don't keep the vector around, we do not know where the region boundaries
234*4882a593Smuzhiyunare in the item, so we'd need a new encapsulation method for regions in the log
235*4882a593Smuzhiyunbuffer writing (i.e. double encapsulation). This would be an on-disk format
236*4882a593Smuzhiyunchange and as such is not desirable.  It also means we'd have to write the log
237*4882a593Smuzhiyunregion headers in the formatting stage, which is problematic as there is per
238*4882a593Smuzhiyunregion state that needs to be placed into the headers during the log write.
239*4882a593Smuzhiyun
240*4882a593SmuzhiyunHence we need to keep the vector, but by attaching the memory buffer to it and
241*4882a593Smuzhiyunrewriting the vector addresses to point at the memory buffer we end up with a
242*4882a593Smuzhiyunself-describing object that can be passed to the log buffer write code to be
243*4882a593Smuzhiyunhandled in exactly the same manner as the existing log vectors are handled.
244*4882a593SmuzhiyunHence we avoid needing a new on-disk format to handle items that have been
245*4882a593Smuzhiyunrelogged in memory.
246*4882a593Smuzhiyun
247*4882a593Smuzhiyun
248*4882a593SmuzhiyunTracking Changes
249*4882a593Smuzhiyun----------------
250*4882a593Smuzhiyun
251*4882a593SmuzhiyunNow that we can record transactional changes in memory in a form that allows
252*4882a593Smuzhiyunthem to be used without limitations, we need to be able to track and accumulate
253*4882a593Smuzhiyunthem so that they can be written to the log at some later point in time.  The
254*4882a593Smuzhiyunlog item is the natural place to store this vector and buffer, and also makes sense
255*4882a593Smuzhiyunto be the object that is used to track committed objects as it will always
256*4882a593Smuzhiyunexist once the object has been included in a transaction.
257*4882a593Smuzhiyun
258*4882a593SmuzhiyunThe log item is already used to track the log items that have been written to
259*4882a593Smuzhiyunthe log but not yet written to disk. Such log items are considered "active"
260*4882a593Smuzhiyunand as such are stored in the Active Item List (AIL) which is a LSN-ordered
261*4882a593Smuzhiyundouble linked list. Items are inserted into this list during log buffer IO
262*4882a593Smuzhiyuncompletion, after which they are unpinned and can be written to disk. An object
263*4882a593Smuzhiyunthat is in the AIL can be relogged, which causes the object to be pinned again
264*4882a593Smuzhiyunand then moved forward in the AIL when the log buffer IO completes for that
265*4882a593Smuzhiyuntransaction.
266*4882a593Smuzhiyun
267*4882a593SmuzhiyunEssentially, this shows that an item that is in the AIL can still be modified
268*4882a593Smuzhiyunand relogged, so any tracking must be separate to the AIL infrastructure. As
269*4882a593Smuzhiyunsuch, we cannot reuse the AIL list pointers for tracking committed items, nor
270*4882a593Smuzhiyuncan we store state in any field that is protected by the AIL lock. Hence the
271*4882a593Smuzhiyuncommitted item tracking needs it's own locks, lists and state fields in the log
272*4882a593Smuzhiyunitem.
273*4882a593Smuzhiyun
274*4882a593SmuzhiyunSimilar to the AIL, tracking of committed items is done through a new list
275*4882a593Smuzhiyuncalled the Committed Item List (CIL).  The list tracks log items that have been
276*4882a593Smuzhiyuncommitted and have formatted memory buffers attached to them. It tracks objects
277*4882a593Smuzhiyunin transaction commit order, so when an object is relogged it is removed from
278*4882a593Smuzhiyunit's place in the list and re-inserted at the tail. This is entirely arbitrary
279*4882a593Smuzhiyunand done to make it easy for debugging - the last items in the list are the
280*4882a593Smuzhiyunones that are most recently modified. Ordering of the CIL is not necessary for
281*4882a593Smuzhiyuntransactional integrity (as discussed in the next section) so the ordering is
282*4882a593Smuzhiyundone for convenience/sanity of the developers.
283*4882a593Smuzhiyun
284*4882a593Smuzhiyun
285*4882a593SmuzhiyunDelayed Logging: Checkpoints
286*4882a593Smuzhiyun----------------------------
287*4882a593Smuzhiyun
288*4882a593SmuzhiyunWhen we have a log synchronisation event, commonly known as a "log force",
289*4882a593Smuzhiyunall the items in the CIL must be written into the log via the log buffers.
290*4882a593SmuzhiyunWe need to write these items in the order that they exist in the CIL, and they
291*4882a593Smuzhiyunneed to be written as an atomic transaction. The need for all the objects to be
292*4882a593Smuzhiyunwritten as an atomic transaction comes from the requirements of relogging and
293*4882a593Smuzhiyunlog replay - all the changes in all the objects in a given transaction must
294*4882a593Smuzhiyuneither be completely replayed during log recovery, or not replayed at all. If
295*4882a593Smuzhiyuna transaction is not replayed because it is not complete in the log, then
296*4882a593Smuzhiyunno later transactions should be replayed, either.
297*4882a593Smuzhiyun
298*4882a593SmuzhiyunTo fulfill this requirement, we need to write the entire CIL in a single log
299*4882a593Smuzhiyuntransaction. Fortunately, the XFS log code has no fixed limit on the size of a
300*4882a593Smuzhiyuntransaction, nor does the log replay code. The only fundamental limit is that
301*4882a593Smuzhiyunthe transaction cannot be larger than just under half the size of the log.  The
302*4882a593Smuzhiyunreason for this limit is that to find the head and tail of the log, there must
303*4882a593Smuzhiyunbe at least one complete transaction in the log at any given time. If a
304*4882a593Smuzhiyuntransaction is larger than half the log, then there is the possibility that a
305*4882a593Smuzhiyuncrash during the write of a such a transaction could partially overwrite the
306*4882a593Smuzhiyunonly complete previous transaction in the log. This will result in a recovery
307*4882a593Smuzhiyunfailure and an inconsistent filesystem and hence we must enforce the maximum
308*4882a593Smuzhiyunsize of a checkpoint to be slightly less than a half the log.
309*4882a593Smuzhiyun
310*4882a593SmuzhiyunApart from this size requirement, a checkpoint transaction looks no different
311*4882a593Smuzhiyunto any other transaction - it contains a transaction header, a series of
312*4882a593Smuzhiyunformatted log items and a commit record at the tail. From a recovery
313*4882a593Smuzhiyunperspective, the checkpoint transaction is also no different - just a lot
314*4882a593Smuzhiyunbigger with a lot more items in it. The worst case effect of this is that we
315*4882a593Smuzhiyunmight need to tune the recovery transaction object hash size.
316*4882a593Smuzhiyun
317*4882a593SmuzhiyunBecause the checkpoint is just another transaction and all the changes to log
318*4882a593Smuzhiyunitems are stored as log vectors, we can use the existing log buffer writing
319*4882a593Smuzhiyuncode to write the changes into the log. To do this efficiently, we need to
320*4882a593Smuzhiyunminimise the time we hold the CIL locked while writing the checkpoint
321*4882a593Smuzhiyuntransaction. The current log write code enables us to do this easily with the
322*4882a593Smuzhiyunway it separates the writing of the transaction contents (the log vectors) from
323*4882a593Smuzhiyunthe transaction commit record, but tracking this requires us to have a
324*4882a593Smuzhiyunper-checkpoint context that travels through the log write process through to
325*4882a593Smuzhiyuncheckpoint completion.
326*4882a593Smuzhiyun
327*4882a593SmuzhiyunHence a checkpoint has a context that tracks the state of the current
328*4882a593Smuzhiyuncheckpoint from initiation to checkpoint completion. A new context is initiated
329*4882a593Smuzhiyunat the same time a checkpoint transaction is started. That is, when we remove
330*4882a593Smuzhiyunall the current items from the CIL during a checkpoint operation, we move all
331*4882a593Smuzhiyunthose changes into the current checkpoint context. We then initialise a new
332*4882a593Smuzhiyuncontext and attach that to the CIL for aggregation of new transactions.
333*4882a593Smuzhiyun
334*4882a593SmuzhiyunThis allows us to unlock the CIL immediately after transfer of all the
335*4882a593Smuzhiyuncommitted items and effectively allow new transactions to be issued while we
336*4882a593Smuzhiyunare formatting the checkpoint into the log. It also allows concurrent
337*4882a593Smuzhiyuncheckpoints to be written into the log buffers in the case of log force heavy
338*4882a593Smuzhiyunworkloads, just like the existing transaction commit code does. This, however,
339*4882a593Smuzhiyunrequires that we strictly order the commit records in the log so that
340*4882a593Smuzhiyuncheckpoint sequence order is maintained during log replay.
341*4882a593Smuzhiyun
342*4882a593SmuzhiyunTo ensure that we can be writing an item into a checkpoint transaction at
343*4882a593Smuzhiyunthe same time another transaction modifies the item and inserts the log item
344*4882a593Smuzhiyuninto the new CIL, then checkpoint transaction commit code cannot use log items
345*4882a593Smuzhiyunto store the list of log vectors that need to be written into the transaction.
346*4882a593SmuzhiyunHence log vectors need to be able to be chained together to allow them to be
347*4882a593Smuzhiyundetached from the log items. That is, when the CIL is flushed the memory
348*4882a593Smuzhiyunbuffer and log vector attached to each log item needs to be attached to the
349*4882a593Smuzhiyuncheckpoint context so that the log item can be released. In diagrammatic form,
350*4882a593Smuzhiyunthe CIL would look like this before the flush::
351*4882a593Smuzhiyun
352*4882a593Smuzhiyun	CIL Head
353*4882a593Smuzhiyun	   |
354*4882a593Smuzhiyun	   V
355*4882a593Smuzhiyun	Log Item <-> log vector 1	-> memory buffer
356*4882a593Smuzhiyun	   |				-> vector array
357*4882a593Smuzhiyun	   V
358*4882a593Smuzhiyun	Log Item <-> log vector 2	-> memory buffer
359*4882a593Smuzhiyun	   |				-> vector array
360*4882a593Smuzhiyun	   V
361*4882a593Smuzhiyun	......
362*4882a593Smuzhiyun	   |
363*4882a593Smuzhiyun	   V
364*4882a593Smuzhiyun	Log Item <-> log vector N-1	-> memory buffer
365*4882a593Smuzhiyun	   |				-> vector array
366*4882a593Smuzhiyun	   V
367*4882a593Smuzhiyun	Log Item <-> log vector N	-> memory buffer
368*4882a593Smuzhiyun					-> vector array
369*4882a593Smuzhiyun
370*4882a593SmuzhiyunAnd after the flush the CIL head is empty, and the checkpoint context log
371*4882a593Smuzhiyunvector list would look like::
372*4882a593Smuzhiyun
373*4882a593Smuzhiyun	Checkpoint Context
374*4882a593Smuzhiyun	   |
375*4882a593Smuzhiyun	   V
376*4882a593Smuzhiyun	log vector 1	-> memory buffer
377*4882a593Smuzhiyun	   |		-> vector array
378*4882a593Smuzhiyun	   |		-> Log Item
379*4882a593Smuzhiyun	   V
380*4882a593Smuzhiyun	log vector 2	-> memory buffer
381*4882a593Smuzhiyun	   |		-> vector array
382*4882a593Smuzhiyun	   |		-> Log Item
383*4882a593Smuzhiyun	   V
384*4882a593Smuzhiyun	......
385*4882a593Smuzhiyun	   |
386*4882a593Smuzhiyun	   V
387*4882a593Smuzhiyun	log vector N-1	-> memory buffer
388*4882a593Smuzhiyun	   |		-> vector array
389*4882a593Smuzhiyun	   |		-> Log Item
390*4882a593Smuzhiyun	   V
391*4882a593Smuzhiyun	log vector N	-> memory buffer
392*4882a593Smuzhiyun			-> vector array
393*4882a593Smuzhiyun			-> Log Item
394*4882a593Smuzhiyun
395*4882a593SmuzhiyunOnce this transfer is done, the CIL can be unlocked and new transactions can
396*4882a593Smuzhiyunstart, while the checkpoint flush code works over the log vector chain to
397*4882a593Smuzhiyuncommit the checkpoint.
398*4882a593Smuzhiyun
399*4882a593SmuzhiyunOnce the checkpoint is written into the log buffers, the checkpoint context is
400*4882a593Smuzhiyunattached to the log buffer that the commit record was written to along with a
401*4882a593Smuzhiyuncompletion callback. Log IO completion will call that callback, which can then
402*4882a593Smuzhiyunrun transaction committed processing for the log items (i.e. insert into AIL
403*4882a593Smuzhiyunand unpin) in the log vector chain and then free the log vector chain and
404*4882a593Smuzhiyuncheckpoint context.
405*4882a593Smuzhiyun
406*4882a593SmuzhiyunDiscussion Point: I am uncertain as to whether the log item is the most
407*4882a593Smuzhiyunefficient way to track vectors, even though it seems like the natural way to do
408*4882a593Smuzhiyunit. The fact that we walk the log items (in the CIL) just to chain the log
409*4882a593Smuzhiyunvectors and break the link between the log item and the log vector means that
410*4882a593Smuzhiyunwe take a cache line hit for the log item list modification, then another for
411*4882a593Smuzhiyunthe log vector chaining. If we track by the log vectors, then we only need to
412*4882a593Smuzhiyunbreak the link between the log item and the log vector, which means we should
413*4882a593Smuzhiyundirty only the log item cachelines. Normally I wouldn't be concerned about one
414*4882a593Smuzhiyunvs two dirty cachelines except for the fact I've seen upwards of 80,000 log
415*4882a593Smuzhiyunvectors in one checkpoint transaction. I'd guess this is a "measure and
416*4882a593Smuzhiyuncompare" situation that can be done after a working and reviewed implementation
417*4882a593Smuzhiyunis in the dev tree....
418*4882a593Smuzhiyun
419*4882a593SmuzhiyunDelayed Logging: Checkpoint Sequencing
420*4882a593Smuzhiyun--------------------------------------
421*4882a593Smuzhiyun
422*4882a593SmuzhiyunOne of the key aspects of the XFS transaction subsystem is that it tags
423*4882a593Smuzhiyuncommitted transactions with the log sequence number of the transaction commit.
424*4882a593SmuzhiyunThis allows transactions to be issued asynchronously even though there may be
425*4882a593Smuzhiyunfuture operations that cannot be completed until that transaction is fully
426*4882a593Smuzhiyuncommitted to the log. In the rare case that a dependent operation occurs (e.g.
427*4882a593Smuzhiyunre-using a freed metadata extent for a data extent), a special, optimised log
428*4882a593Smuzhiyunforce can be issued to force the dependent transaction to disk immediately.
429*4882a593Smuzhiyun
430*4882a593SmuzhiyunTo do this, transactions need to record the LSN of the commit record of the
431*4882a593Smuzhiyuntransaction. This LSN comes directly from the log buffer the transaction is
432*4882a593Smuzhiyunwritten into. While this works just fine for the existing transaction
433*4882a593Smuzhiyunmechanism, it does not work for delayed logging because transactions are not
434*4882a593Smuzhiyunwritten directly into the log buffers. Hence some other method of sequencing
435*4882a593Smuzhiyuntransactions is required.
436*4882a593Smuzhiyun
437*4882a593SmuzhiyunAs discussed in the checkpoint section, delayed logging uses per-checkpoint
438*4882a593Smuzhiyuncontexts, and as such it is simple to assign a sequence number to each
439*4882a593Smuzhiyuncheckpoint. Because the switching of checkpoint contexts must be done
440*4882a593Smuzhiyunatomically, it is simple to ensure that each new context has a monotonically
441*4882a593Smuzhiyunincreasing sequence number assigned to it without the need for an external
442*4882a593Smuzhiyunatomic counter - we can just take the current context sequence number and add
443*4882a593Smuzhiyunone to it for the new context.
444*4882a593Smuzhiyun
445*4882a593SmuzhiyunThen, instead of assigning a log buffer LSN to the transaction commit LSN
446*4882a593Smuzhiyunduring the commit, we can assign the current checkpoint sequence. This allows
447*4882a593Smuzhiyunoperations that track transactions that have not yet completed know what
448*4882a593Smuzhiyuncheckpoint sequence needs to be committed before they can continue. As a
449*4882a593Smuzhiyunresult, the code that forces the log to a specific LSN now needs to ensure that
450*4882a593Smuzhiyunthe log forces to a specific checkpoint.
451*4882a593Smuzhiyun
452*4882a593SmuzhiyunTo ensure that we can do this, we need to track all the checkpoint contexts
453*4882a593Smuzhiyunthat are currently committing to the log. When we flush a checkpoint, the
454*4882a593Smuzhiyuncontext gets added to a "committing" list which can be searched. When a
455*4882a593Smuzhiyuncheckpoint commit completes, it is removed from the committing list. Because
456*4882a593Smuzhiyunthe checkpoint context records the LSN of the commit record for the checkpoint,
457*4882a593Smuzhiyunwe can also wait on the log buffer that contains the commit record, thereby
458*4882a593Smuzhiyunusing the existing log force mechanisms to execute synchronous forces.
459*4882a593Smuzhiyun
460*4882a593SmuzhiyunIt should be noted that the synchronous forces may need to be extended with
461*4882a593Smuzhiyunmitigation algorithms similar to the current log buffer code to allow
462*4882a593Smuzhiyunaggregation of multiple synchronous transactions if there are already
463*4882a593Smuzhiyunsynchronous transactions being flushed. Investigation of the performance of the
464*4882a593Smuzhiyuncurrent design is needed before making any decisions here.
465*4882a593Smuzhiyun
466*4882a593SmuzhiyunThe main concern with log forces is to ensure that all the previous checkpoints
467*4882a593Smuzhiyunare also committed to disk before the one we need to wait for. Therefore we
468*4882a593Smuzhiyunneed to check that all the prior contexts in the committing list are also
469*4882a593Smuzhiyuncomplete before waiting on the one we need to complete. We do this
470*4882a593Smuzhiyunsynchronisation in the log force code so that we don't need to wait anywhere
471*4882a593Smuzhiyunelse for such serialisation - it only matters when we do a log force.
472*4882a593Smuzhiyun
473*4882a593SmuzhiyunThe only remaining complexity is that a log force now also has to handle the
474*4882a593Smuzhiyuncase where the forcing sequence number is the same as the current context. That
475*4882a593Smuzhiyunis, we need to flush the CIL and potentially wait for it to complete. This is a
476*4882a593Smuzhiyunsimple addition to the existing log forcing code to check the sequence numbers
477*4882a593Smuzhiyunand push if required. Indeed, placing the current sequence checkpoint flush in
478*4882a593Smuzhiyunthe log force code enables the current mechanism for issuing synchronous
479*4882a593Smuzhiyuntransactions to remain untouched (i.e. commit an asynchronous transaction, then
480*4882a593Smuzhiyunforce the log at the LSN of that transaction) and so the higher level code
481*4882a593Smuzhiyunbehaves the same regardless of whether delayed logging is being used or not.
482*4882a593Smuzhiyun
483*4882a593SmuzhiyunDelayed Logging: Checkpoint Log Space Accounting
484*4882a593Smuzhiyun------------------------------------------------
485*4882a593Smuzhiyun
486*4882a593SmuzhiyunThe big issue for a checkpoint transaction is the log space reservation for the
487*4882a593Smuzhiyuntransaction. We don't know how big a checkpoint transaction is going to be
488*4882a593Smuzhiyunahead of time, nor how many log buffers it will take to write out, nor the
489*4882a593Smuzhiyunnumber of split log vector regions are going to be used. We can track the
490*4882a593Smuzhiyunamount of log space required as we add items to the commit item list, but we
491*4882a593Smuzhiyunstill need to reserve the space in the log for the checkpoint.
492*4882a593Smuzhiyun
493*4882a593SmuzhiyunA typical transaction reserves enough space in the log for the worst case space
494*4882a593Smuzhiyunusage of the transaction. The reservation accounts for log record headers,
495*4882a593Smuzhiyuntransaction and region headers, headers for split regions, buffer tail padding,
496*4882a593Smuzhiyunetc. as well as the actual space for all the changed metadata in the
497*4882a593Smuzhiyuntransaction. While some of this is fixed overhead, much of it is dependent on
498*4882a593Smuzhiyunthe size of the transaction and the number of regions being logged (the number
499*4882a593Smuzhiyunof log vectors in the transaction).
500*4882a593Smuzhiyun
501*4882a593SmuzhiyunAn example of the differences would be logging directory changes versus logging
502*4882a593Smuzhiyuninode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then
503*4882a593Smuzhiyunthere are lots of transactions that only contain an inode core and an inode log
504*4882a593Smuzhiyunformat structure. That is, two vectors totaling roughly 150 bytes. If we modify
505*4882a593Smuzhiyun10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
506*4882a593Smuzhiyunvector is 12 bytes, so the total to be logged is approximately 1.75MB. In
507*4882a593Smuzhiyuncomparison, if we are logging full directory buffers, they are typically 4KB
508*4882a593Smuzhiyuneach, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
509*4882a593Smuzhiyunbuffer format structure for each buffer - roughly 800 vectors or 1.51MB total
510*4882a593Smuzhiyunspace.  From this, it should be obvious that a static log space reservation is
511*4882a593Smuzhiyunnot particularly flexible and is difficult to select the "optimal value" for
512*4882a593Smuzhiyunall workloads.
513*4882a593Smuzhiyun
514*4882a593SmuzhiyunFurther, if we are going to use a static reservation, which bit of the entire
515*4882a593Smuzhiyunreservation does it cover? We account for space used by the transaction
516*4882a593Smuzhiyunreservation by tracking the space currently used by the object in the CIL and
517*4882a593Smuzhiyunthen calculating the increase or decrease in space used as the object is
518*4882a593Smuzhiyunrelogged. This allows for a checkpoint reservation to only have to account for
519*4882a593Smuzhiyunlog buffer metadata used such as log header records.
520*4882a593Smuzhiyun
521*4882a593SmuzhiyunHowever, even using a static reservation for just the log metadata is
522*4882a593Smuzhiyunproblematic. Typically log record headers use at least 16KB of log space per
523*4882a593Smuzhiyun1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
524*4882a593Smuzhiyunlarge enough to handle arbitrary sized checkpoint transactions. This
525*4882a593Smuzhiyunreservation needs to be made before the checkpoint is started, and we need to
526*4882a593Smuzhiyunbe able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
527*4882a593Smuzhiyunreservation of around 150KB, which is a non-trivial amount of space.
528*4882a593Smuzhiyun
529*4882a593SmuzhiyunA static reservation needs to manipulate the log grant counters - we can take a
530*4882a593Smuzhiyunpermanent reservation on the space, but we still need to make sure we refresh
531*4882a593Smuzhiyunthe write reservation (the actual space available to the transaction) after
532*4882a593Smuzhiyunevery checkpoint transaction completion. Unfortunately, if this space is not
533*4882a593Smuzhiyunavailable when required, then the regrant code will sleep waiting for it.
534*4882a593Smuzhiyun
535*4882a593SmuzhiyunThe problem with this is that it can lead to deadlocks as we may need to commit
536*4882a593Smuzhiyuncheckpoints to be able to free up log space (refer back to the description of
537*4882a593Smuzhiyunrolling transactions for an example of this).  Hence we *must* always have
538*4882a593Smuzhiyunspace available in the log if we are to use static reservations, and that is
539*4882a593Smuzhiyunvery difficult and complex to arrange. It is possible to do, but there is a
540*4882a593Smuzhiyunsimpler way.
541*4882a593Smuzhiyun
542*4882a593SmuzhiyunThe simpler way of doing this is tracking the entire log space used by the
543*4882a593Smuzhiyunitems in the CIL and using this to dynamically calculate the amount of log
544*4882a593Smuzhiyunspace required by the log metadata. If this log metadata space changes as a
545*4882a593Smuzhiyunresult of a transaction commit inserting a new memory buffer into the CIL, then
546*4882a593Smuzhiyunthe difference in space required is removed from the transaction that causes
547*4882a593Smuzhiyunthe change. Transactions at this level will *always* have enough space
548*4882a593Smuzhiyunavailable in their reservation for this as they have already reserved the
549*4882a593Smuzhiyunmaximal amount of log metadata space they require, and such a delta reservation
550*4882a593Smuzhiyunwill always be less than or equal to the maximal amount in the reservation.
551*4882a593Smuzhiyun
552*4882a593SmuzhiyunHence we can grow the checkpoint transaction reservation dynamically as items
553*4882a593Smuzhiyunare added to the CIL and avoid the need for reserving and regranting log space
554*4882a593Smuzhiyunup front. This avoids deadlocks and removes a blocking point from the
555*4882a593Smuzhiyuncheckpoint flush code.
556*4882a593Smuzhiyun
557*4882a593SmuzhiyunAs mentioned early, transactions can't grow to more than half the size of the
558*4882a593Smuzhiyunlog. Hence as part of the reservation growing, we need to also check the size
559*4882a593Smuzhiyunof the reservation against the maximum allowed transaction size. If we reach
560*4882a593Smuzhiyunthe maximum threshold, we need to push the CIL to the log. This is effectively
561*4882a593Smuzhiyuna "background flush" and is done on demand. This is identical to
562*4882a593Smuzhiyuna CIL push triggered by a log force, only that there is no waiting for the
563*4882a593Smuzhiyuncheckpoint commit to complete. This background push is checked and executed by
564*4882a593Smuzhiyuntransaction commit code.
565*4882a593Smuzhiyun
566*4882a593SmuzhiyunIf the transaction subsystem goes idle while we still have items in the CIL,
567*4882a593Smuzhiyunthey will be flushed by the periodic log force issued by the xfssyncd. This log
568*4882a593Smuzhiyunforce will push the CIL to disk, and if the transaction subsystem stays idle,
569*4882a593Smuzhiyunallow the idle log to be covered (effectively marked clean) in exactly the same
570*4882a593Smuzhiyunmanner that is done for the existing logging method. A discussion point is
571*4882a593Smuzhiyunwhether this log force needs to be done more frequently than the current rate
572*4882a593Smuzhiyunwhich is once every 30s.
573*4882a593Smuzhiyun
574*4882a593Smuzhiyun
575*4882a593SmuzhiyunDelayed Logging: Log Item Pinning
576*4882a593Smuzhiyun---------------------------------
577*4882a593Smuzhiyun
578*4882a593SmuzhiyunCurrently log items are pinned during transaction commit while the items are
579*4882a593Smuzhiyunstill locked. This happens just after the items are formatted, though it could
580*4882a593Smuzhiyunbe done any time before the items are unlocked. The result of this mechanism is
581*4882a593Smuzhiyunthat items get pinned once for every transaction that is committed to the log
582*4882a593Smuzhiyunbuffers. Hence items that are relogged in the log buffers will have a pin count
583*4882a593Smuzhiyunfor every outstanding transaction they were dirtied in. When each of these
584*4882a593Smuzhiyuntransactions is completed, they will unpin the item once. As a result, the item
585*4882a593Smuzhiyunonly becomes unpinned when all the transactions complete and there are no
586*4882a593Smuzhiyunpending transactions. Thus the pinning and unpinning of a log item is symmetric
587*4882a593Smuzhiyunas there is a 1:1 relationship with transaction commit and log item completion.
588*4882a593Smuzhiyun
589*4882a593SmuzhiyunFor delayed logging, however, we have an asymmetric transaction commit to
590*4882a593Smuzhiyuncompletion relationship. Every time an object is relogged in the CIL it goes
591*4882a593Smuzhiyunthrough the commit process without a corresponding completion being registered.
592*4882a593SmuzhiyunThat is, we now have a many-to-one relationship between transaction commit and
593*4882a593Smuzhiyunlog item completion. The result of this is that pinning and unpinning of the
594*4882a593Smuzhiyunlog items becomes unbalanced if we retain the "pin on transaction commit, unpin
595*4882a593Smuzhiyunon transaction completion" model.
596*4882a593Smuzhiyun
597*4882a593SmuzhiyunTo keep pin/unpin symmetry, the algorithm needs to change to a "pin on
598*4882a593Smuzhiyuninsertion into the CIL, unpin on checkpoint completion". In other words, the
599*4882a593Smuzhiyunpinning and unpinning becomes symmetric around a checkpoint context. We have to
600*4882a593Smuzhiyunpin the object the first time it is inserted into the CIL - if it is already in
601*4882a593Smuzhiyunthe CIL during a transaction commit, then we do not pin it again. Because there
602*4882a593Smuzhiyuncan be multiple outstanding checkpoint contexts, we can still see elevated pin
603*4882a593Smuzhiyuncounts, but as each checkpoint completes the pin count will retain the correct
604*4882a593Smuzhiyunvalue according to it's context.
605*4882a593Smuzhiyun
606*4882a593SmuzhiyunJust to make matters more slightly more complex, this checkpoint level context
607*4882a593Smuzhiyunfor the pin count means that the pinning of an item must take place under the
608*4882a593SmuzhiyunCIL commit/flush lock. If we pin the object outside this lock, we cannot
609*4882a593Smuzhiyunguarantee which context the pin count is associated with. This is because of
610*4882a593Smuzhiyunthe fact pinning the item is dependent on whether the item is present in the
611*4882a593Smuzhiyuncurrent CIL or not. If we don't pin the CIL first before we check and pin the
612*4882a593Smuzhiyunobject, we have a race with CIL being flushed between the check and the pin
613*4882a593Smuzhiyun(or not pinning, as the case may be). Hence we must hold the CIL flush/commit
614*4882a593Smuzhiyunlock to guarantee that we pin the items correctly.
615*4882a593Smuzhiyun
616*4882a593SmuzhiyunDelayed Logging: Concurrent Scalability
617*4882a593Smuzhiyun---------------------------------------
618*4882a593Smuzhiyun
619*4882a593SmuzhiyunA fundamental requirement for the CIL is that accesses through transaction
620*4882a593Smuzhiyuncommits must scale to many concurrent commits. The current transaction commit
621*4882a593Smuzhiyuncode does not break down even when there are transactions coming from 2048
622*4882a593Smuzhiyunprocessors at once. The current transaction code does not go any faster than if
623*4882a593Smuzhiyunthere was only one CPU using it, but it does not slow down either.
624*4882a593Smuzhiyun
625*4882a593SmuzhiyunAs a result, the delayed logging transaction commit code needs to be designed
626*4882a593Smuzhiyunfor concurrency from the ground up. It is obvious that there are serialisation
627*4882a593Smuzhiyunpoints in the design - the three important ones are:
628*4882a593Smuzhiyun
629*4882a593Smuzhiyun	1. Locking out new transaction commits while flushing the CIL
630*4882a593Smuzhiyun	2. Adding items to the CIL and updating item space accounting
631*4882a593Smuzhiyun	3. Checkpoint commit ordering
632*4882a593Smuzhiyun
633*4882a593SmuzhiyunLooking at the transaction commit and CIL flushing interactions, it is clear
634*4882a593Smuzhiyunthat we have a many-to-one interaction here. That is, the only restriction on
635*4882a593Smuzhiyunthe number of concurrent transactions that can be trying to commit at once is
636*4882a593Smuzhiyunthe amount of space available in the log for their reservations. The practical
637*4882a593Smuzhiyunlimit here is in the order of several hundred concurrent transactions for a
638*4882a593Smuzhiyun128MB log, which means that it is generally one per CPU in a machine.
639*4882a593Smuzhiyun
640*4882a593SmuzhiyunThe amount of time a transaction commit needs to hold out a flush is a
641*4882a593Smuzhiyunrelatively long period of time - the pinning of log items needs to be done
642*4882a593Smuzhiyunwhile we are holding out a CIL flush, so at the moment that means it is held
643*4882a593Smuzhiyunacross the formatting of the objects into memory buffers (i.e. while memcpy()s
644*4882a593Smuzhiyunare in progress). Ultimately a two pass algorithm where the formatting is done
645*4882a593Smuzhiyunseparately to the pinning of objects could be used to reduce the hold time of
646*4882a593Smuzhiyunthe transaction commit side.
647*4882a593Smuzhiyun
648*4882a593SmuzhiyunBecause of the number of potential transaction commit side holders, the lock
649*4882a593Smuzhiyunreally needs to be a sleeping lock - if the CIL flush takes the lock, we do not
650*4882a593Smuzhiyunwant every other CPU in the machine spinning on the CIL lock. Given that
651*4882a593Smuzhiyunflushing the CIL could involve walking a list of tens of thousands of log
652*4882a593Smuzhiyunitems, it will get held for a significant time and so spin contention is a
653*4882a593Smuzhiyunsignificant concern. Preventing lots of CPUs spinning doing nothing is the
654*4882a593Smuzhiyunmain reason for choosing a sleeping lock even though nothing in either the
655*4882a593Smuzhiyuntransaction commit or CIL flush side sleeps with the lock held.
656*4882a593Smuzhiyun
657*4882a593SmuzhiyunIt should also be noted that CIL flushing is also a relatively rare operation
658*4882a593Smuzhiyuncompared to transaction commit for asynchronous transaction workloads - only
659*4882a593Smuzhiyuntime will tell if using a read-write semaphore for exclusion will limit
660*4882a593Smuzhiyuntransaction commit concurrency due to cache line bouncing of the lock on the
661*4882a593Smuzhiyunread side.
662*4882a593Smuzhiyun
663*4882a593SmuzhiyunThe second serialisation point is on the transaction commit side where items
664*4882a593Smuzhiyunare inserted into the CIL. Because transactions can enter this code
665*4882a593Smuzhiyunconcurrently, the CIL needs to be protected separately from the above
666*4882a593Smuzhiyuncommit/flush exclusion. It also needs to be an exclusive lock but it is only
667*4882a593Smuzhiyunheld for a very short time and so a spin lock is appropriate here. It is
668*4882a593Smuzhiyunpossible that this lock will become a contention point, but given the short
669*4882a593Smuzhiyunhold time once per transaction I think that contention is unlikely.
670*4882a593Smuzhiyun
671*4882a593SmuzhiyunThe final serialisation point is the checkpoint commit record ordering code
672*4882a593Smuzhiyunthat is run as part of the checkpoint commit and log force sequencing. The code
673*4882a593Smuzhiyunpath that triggers a CIL flush (i.e. whatever triggers the log force) will enter
674*4882a593Smuzhiyunan ordering loop after writing all the log vectors into the log buffers but
675*4882a593Smuzhiyunbefore writing the commit record. This loop walks the list of committing
676*4882a593Smuzhiyuncheckpoints and needs to block waiting for checkpoints to complete their commit
677*4882a593Smuzhiyunrecord write. As a result it needs a lock and a wait variable. Log force
678*4882a593Smuzhiyunsequencing also requires the same lock, list walk, and blocking mechanism to
679*4882a593Smuzhiyunensure completion of checkpoints.
680*4882a593Smuzhiyun
681*4882a593SmuzhiyunThese two sequencing operations can use the mechanism even though the
682*4882a593Smuzhiyunevents they are waiting for are different. The checkpoint commit record
683*4882a593Smuzhiyunsequencing needs to wait until checkpoint contexts contain a commit LSN
684*4882a593Smuzhiyun(obtained through completion of a commit record write) while log force
685*4882a593Smuzhiyunsequencing needs to wait until previous checkpoint contexts are removed from
686*4882a593Smuzhiyunthe committing list (i.e. they've completed). A simple wait variable and
687*4882a593Smuzhiyunbroadcast wakeups (thundering herds) has been used to implement these two
688*4882a593Smuzhiyunserialisation queues. They use the same lock as the CIL, too. If we see too
689*4882a593Smuzhiyunmuch contention on the CIL lock, or too many context switches as a result of
690*4882a593Smuzhiyunthe broadcast wakeups these operations can be put under a new spinlock and
691*4882a593Smuzhiyungiven separate wait lists to reduce lock contention and the number of processes
692*4882a593Smuzhiyunwoken by the wrong event.
693*4882a593Smuzhiyun
694*4882a593Smuzhiyun
695*4882a593SmuzhiyunLifecycle Changes
696*4882a593Smuzhiyun-----------------
697*4882a593Smuzhiyun
698*4882a593SmuzhiyunThe existing log item life cycle is as follows::
699*4882a593Smuzhiyun
700*4882a593Smuzhiyun	1. Transaction allocate
701*4882a593Smuzhiyun	2. Transaction reserve
702*4882a593Smuzhiyun	3. Lock item
703*4882a593Smuzhiyun	4. Join item to transaction
704*4882a593Smuzhiyun		If not already attached,
705*4882a593Smuzhiyun			Allocate log item
706*4882a593Smuzhiyun			Attach log item to owner item
707*4882a593Smuzhiyun		Attach log item to transaction
708*4882a593Smuzhiyun	5. Modify item
709*4882a593Smuzhiyun		Record modifications in log item
710*4882a593Smuzhiyun	6. Transaction commit
711*4882a593Smuzhiyun		Pin item in memory
712*4882a593Smuzhiyun		Format item into log buffer
713*4882a593Smuzhiyun		Write commit LSN into transaction
714*4882a593Smuzhiyun		Unlock item
715*4882a593Smuzhiyun		Attach transaction to log buffer
716*4882a593Smuzhiyun
717*4882a593Smuzhiyun	<log buffer IO dispatched>
718*4882a593Smuzhiyun	<log buffer IO completes>
719*4882a593Smuzhiyun
720*4882a593Smuzhiyun	7. Transaction completion
721*4882a593Smuzhiyun		Mark log item committed
722*4882a593Smuzhiyun		Insert log item into AIL
723*4882a593Smuzhiyun			Write commit LSN into log item
724*4882a593Smuzhiyun		Unpin log item
725*4882a593Smuzhiyun	8. AIL traversal
726*4882a593Smuzhiyun		Lock item
727*4882a593Smuzhiyun		Mark log item clean
728*4882a593Smuzhiyun		Flush item to disk
729*4882a593Smuzhiyun
730*4882a593Smuzhiyun	<item IO completion>
731*4882a593Smuzhiyun
732*4882a593Smuzhiyun	9. Log item removed from AIL
733*4882a593Smuzhiyun		Moves log tail
734*4882a593Smuzhiyun		Item unlocked
735*4882a593Smuzhiyun
736*4882a593SmuzhiyunEssentially, steps 1-6 operate independently from step 7, which is also
737*4882a593Smuzhiyunindependent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
738*4882a593Smuzhiyunat the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
739*4882a593Smuzhiyunat the same time. If the log item is in the AIL or between steps 6 and 7
740*4882a593Smuzhiyunand steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
741*4882a593Smuzhiyunare entered and completed is the object considered clean.
742*4882a593Smuzhiyun
743*4882a593SmuzhiyunWith delayed logging, there are new steps inserted into the life cycle::
744*4882a593Smuzhiyun
745*4882a593Smuzhiyun	1. Transaction allocate
746*4882a593Smuzhiyun	2. Transaction reserve
747*4882a593Smuzhiyun	3. Lock item
748*4882a593Smuzhiyun	4. Join item to transaction
749*4882a593Smuzhiyun		If not already attached,
750*4882a593Smuzhiyun			Allocate log item
751*4882a593Smuzhiyun			Attach log item to owner item
752*4882a593Smuzhiyun		Attach log item to transaction
753*4882a593Smuzhiyun	5. Modify item
754*4882a593Smuzhiyun		Record modifications in log item
755*4882a593Smuzhiyun	6. Transaction commit
756*4882a593Smuzhiyun		Pin item in memory if not pinned in CIL
757*4882a593Smuzhiyun		Format item into log vector + buffer
758*4882a593Smuzhiyun		Attach log vector and buffer to log item
759*4882a593Smuzhiyun		Insert log item into CIL
760*4882a593Smuzhiyun		Write CIL context sequence into transaction
761*4882a593Smuzhiyun		Unlock item
762*4882a593Smuzhiyun
763*4882a593Smuzhiyun	<next log force>
764*4882a593Smuzhiyun
765*4882a593Smuzhiyun	7. CIL push
766*4882a593Smuzhiyun		lock CIL flush
767*4882a593Smuzhiyun		Chain log vectors and buffers together
768*4882a593Smuzhiyun		Remove items from CIL
769*4882a593Smuzhiyun		unlock CIL flush
770*4882a593Smuzhiyun		write log vectors into log
771*4882a593Smuzhiyun		sequence commit records
772*4882a593Smuzhiyun		attach checkpoint context to log buffer
773*4882a593Smuzhiyun
774*4882a593Smuzhiyun	<log buffer IO dispatched>
775*4882a593Smuzhiyun	<log buffer IO completes>
776*4882a593Smuzhiyun
777*4882a593Smuzhiyun	8. Checkpoint completion
778*4882a593Smuzhiyun		Mark log item committed
779*4882a593Smuzhiyun		Insert item into AIL
780*4882a593Smuzhiyun			Write commit LSN into log item
781*4882a593Smuzhiyun		Unpin log item
782*4882a593Smuzhiyun	9. AIL traversal
783*4882a593Smuzhiyun		Lock item
784*4882a593Smuzhiyun		Mark log item clean
785*4882a593Smuzhiyun		Flush item to disk
786*4882a593Smuzhiyun	<item IO completion>
787*4882a593Smuzhiyun	10. Log item removed from AIL
788*4882a593Smuzhiyun		Moves log tail
789*4882a593Smuzhiyun		Item unlocked
790*4882a593Smuzhiyun
791*4882a593SmuzhiyunFrom this, it can be seen that the only life cycle differences between the two
792*4882a593Smuzhiyunlogging methods are in the middle of the life cycle - they still have the same
793*4882a593Smuzhiyunbeginning and end and execution constraints. The only differences are in the
794*4882a593Smuzhiyuncommitting of the log items to the log itself and the completion processing.
795*4882a593SmuzhiyunHence delayed logging should not introduce any constraints on log item
796*4882a593Smuzhiyunbehaviour, allocation or freeing that don't already exist.
797*4882a593Smuzhiyun
798*4882a593SmuzhiyunAs a result of this zero-impact "insertion" of delayed logging infrastructure
799*4882a593Smuzhiyunand the design of the internal structures to avoid on disk format changes, we
800*4882a593Smuzhiyuncan basically switch between delayed logging and the existing mechanism with a
801*4882a593Smuzhiyunmount option. Fundamentally, there is no reason why the log manager would not
802*4882a593Smuzhiyunbe able to swap methods automatically and transparently depending on load
803*4882a593Smuzhiyuncharacteristics, but this should not be necessary if delayed logging works as
804*4882a593Smuzhiyundesigned.
805