xref: /OK3568_Linux_fs/kernel/Documentation/trace/ring-buffer-design.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-only
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun===========================
4*4882a593SmuzhiyunLockless Ring Buffer Design
5*4882a593Smuzhiyun===========================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunCopyright 2009 Red Hat Inc.
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun:Author:   Steven Rostedt <srostedt@redhat.com>
10*4882a593Smuzhiyun:License:  The GNU Free Documentation License, Version 1.2
11*4882a593Smuzhiyun           (dual licensed under the GPL v2)
12*4882a593Smuzhiyun:Reviewers:  Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
13*4882a593Smuzhiyun	     and Frederic Weisbecker.
14*4882a593Smuzhiyun
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunWritten for: 2.6.31
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunTerminology used in this Document
19*4882a593Smuzhiyun---------------------------------
20*4882a593Smuzhiyun
21*4882a593Smuzhiyuntail
22*4882a593Smuzhiyun	- where new writes happen in the ring buffer.
23*4882a593Smuzhiyun
24*4882a593Smuzhiyunhead
25*4882a593Smuzhiyun	- where new reads happen in the ring buffer.
26*4882a593Smuzhiyun
27*4882a593Smuzhiyunproducer
28*4882a593Smuzhiyun	- the task that writes into the ring buffer (same as writer)
29*4882a593Smuzhiyun
30*4882a593Smuzhiyunwriter
31*4882a593Smuzhiyun	- same as producer
32*4882a593Smuzhiyun
33*4882a593Smuzhiyunconsumer
34*4882a593Smuzhiyun	- the task that reads from the buffer (same as reader)
35*4882a593Smuzhiyun
36*4882a593Smuzhiyunreader
37*4882a593Smuzhiyun	- same as consumer.
38*4882a593Smuzhiyun
39*4882a593Smuzhiyunreader_page
40*4882a593Smuzhiyun	- A page outside the ring buffer used solely (for the most part)
41*4882a593Smuzhiyun	  by the reader.
42*4882a593Smuzhiyun
43*4882a593Smuzhiyunhead_page
44*4882a593Smuzhiyun	- a pointer to the page that the reader will use next
45*4882a593Smuzhiyun
46*4882a593Smuzhiyuntail_page
47*4882a593Smuzhiyun	- a pointer to the page that will be written to next
48*4882a593Smuzhiyun
49*4882a593Smuzhiyuncommit_page
50*4882a593Smuzhiyun	- a pointer to the page with the last finished non-nested write.
51*4882a593Smuzhiyun
52*4882a593Smuzhiyuncmpxchg
53*4882a593Smuzhiyun	- hardware-assisted atomic transaction that performs the following::
54*4882a593Smuzhiyun
55*4882a593Smuzhiyun	    A = B if previous A == C
56*4882a593Smuzhiyun
57*4882a593Smuzhiyun	    R = cmpxchg(A, C, B) is saying that we replace A with B if and only
58*4882a593Smuzhiyun		if current A is equal to C, and we put the old (current)
59*4882a593Smuzhiyun		A into R
60*4882a593Smuzhiyun
61*4882a593Smuzhiyun	    R gets the previous A regardless if A is updated with B or not.
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun	  To see if the update was successful a compare of ``R == C``
64*4882a593Smuzhiyun	  may be used.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunThe Generic Ring Buffer
67*4882a593Smuzhiyun-----------------------
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunThe ring buffer can be used in either an overwrite mode or in
70*4882a593Smuzhiyunproducer/consumer mode.
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunProducer/consumer mode is where if the producer were to fill up the
73*4882a593Smuzhiyunbuffer before the consumer could free up anything, the producer
74*4882a593Smuzhiyunwill stop writing to the buffer. This will lose most recent events.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunOverwrite mode is where if the producer were to fill up the buffer
77*4882a593Smuzhiyunbefore the consumer could free up anything, the producer will
78*4882a593Smuzhiyunoverwrite the older data. This will lose the oldest events.
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunNo two writers can write at the same time (on the same per-cpu buffer),
81*4882a593Smuzhiyunbut a writer may interrupt another writer, but it must finish writing
82*4882a593Smuzhiyunbefore the previous writer may continue. This is very important to the
83*4882a593Smuzhiyunalgorithm. The writers act like a "stack". The way interrupts works
84*4882a593Smuzhiyunenforces this behavior::
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun
87*4882a593Smuzhiyun  writer1 start
88*4882a593Smuzhiyun     <preempted> writer2 start
89*4882a593Smuzhiyun         <preempted> writer3 start
90*4882a593Smuzhiyun                     writer3 finishes
91*4882a593Smuzhiyun                 writer2 finishes
92*4882a593Smuzhiyun  writer1 finishes
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunThis is very much like a writer being preempted by an interrupt and
95*4882a593Smuzhiyunthe interrupt doing a write as well.
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunReaders can happen at any time. But no two readers may run at the
98*4882a593Smuzhiyunsame time, nor can a reader preempt/interrupt another reader. A reader
99*4882a593Smuzhiyuncannot preempt/interrupt a writer, but it may read/consume from the
100*4882a593Smuzhiyunbuffer at the same time as a writer is writing, but the reader must be
101*4882a593Smuzhiyunon another processor to do so. A reader may read on its own processor
102*4882a593Smuzhiyunand can be preempted by a writer.
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunA writer can preempt a reader, but a reader cannot preempt a writer.
105*4882a593SmuzhiyunBut a reader can read the buffer at the same time (on another processor)
106*4882a593Smuzhiyunas a writer.
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunThe ring buffer is made up of a list of pages held together by a linked list.
109*4882a593Smuzhiyun
110*4882a593SmuzhiyunAt initialization a reader page is allocated for the reader that is not
111*4882a593Smuzhiyunpart of the ring buffer.
112*4882a593Smuzhiyun
113*4882a593SmuzhiyunThe head_page, tail_page and commit_page are all initialized to point
114*4882a593Smuzhiyunto the same page.
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunThe reader page is initialized to have its next pointer pointing to
117*4882a593Smuzhiyunthe head page, and its previous pointer pointing to a page before
118*4882a593Smuzhiyunthe head page.
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunThe reader has its own page to use. At start up time, this page is
121*4882a593Smuzhiyunallocated but is not attached to the list. When the reader wants
122*4882a593Smuzhiyunto read from the buffer, if its page is empty (like it is on start-up),
123*4882a593Smuzhiyunit will swap its page with the head_page. The old reader page will
124*4882a593Smuzhiyunbecome part of the ring buffer and the head_page will be removed.
125*4882a593SmuzhiyunThe page after the inserted page (old reader_page) will become the
126*4882a593Smuzhiyunnew head page.
127*4882a593Smuzhiyun
128*4882a593SmuzhiyunOnce the new page is given to the reader, the reader could do what
129*4882a593Smuzhiyunit wants with it, as long as a writer has left that page.
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunA sample of how the reader page is swapped: Note this does not
132*4882a593Smuzhiyunshow the head page in the buffer, it is for demonstrating a swap
133*4882a593Smuzhiyunonly.
134*4882a593Smuzhiyun
135*4882a593Smuzhiyun::
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun  +------+
138*4882a593Smuzhiyun  |reader|          RING BUFFER
139*4882a593Smuzhiyun  |page  |
140*4882a593Smuzhiyun  +------+
141*4882a593Smuzhiyun                  +---+   +---+   +---+
142*4882a593Smuzhiyun                  |   |-->|   |-->|   |
143*4882a593Smuzhiyun                  |   |<--|   |<--|   |
144*4882a593Smuzhiyun                  +---+   +---+   +---+
145*4882a593Smuzhiyun                   ^ |             ^ |
146*4882a593Smuzhiyun                   | +-------------+ |
147*4882a593Smuzhiyun                   +-----------------+
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun  +------+
151*4882a593Smuzhiyun  |reader|          RING BUFFER
152*4882a593Smuzhiyun  |page  |-------------------+
153*4882a593Smuzhiyun  +------+                   v
154*4882a593Smuzhiyun    |             +---+   +---+   +---+
155*4882a593Smuzhiyun    |             |   |-->|   |-->|   |
156*4882a593Smuzhiyun    |             |   |<--|   |<--|   |<-+
157*4882a593Smuzhiyun    |             +---+   +---+   +---+  |
158*4882a593Smuzhiyun    |              ^ |             ^ |   |
159*4882a593Smuzhiyun    |              | +-------------+ |   |
160*4882a593Smuzhiyun    |              +-----------------+   |
161*4882a593Smuzhiyun    +------------------------------------+
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun  +------+
164*4882a593Smuzhiyun  |reader|          RING BUFFER
165*4882a593Smuzhiyun  |page  |-------------------+
166*4882a593Smuzhiyun  +------+ <---------------+ v
167*4882a593Smuzhiyun    |  ^          +---+   +---+   +---+
168*4882a593Smuzhiyun    |  |          |   |-->|   |-->|   |
169*4882a593Smuzhiyun    |  |          |   |   |   |<--|   |<-+
170*4882a593Smuzhiyun    |  |          +---+   +---+   +---+  |
171*4882a593Smuzhiyun    |  |             |             ^ |   |
172*4882a593Smuzhiyun    |  |             +-------------+ |   |
173*4882a593Smuzhiyun    |  +-----------------------------+   |
174*4882a593Smuzhiyun    +------------------------------------+
175*4882a593Smuzhiyun
176*4882a593Smuzhiyun  +------+
177*4882a593Smuzhiyun  |buffer|          RING BUFFER
178*4882a593Smuzhiyun  |page  |-------------------+
179*4882a593Smuzhiyun  +------+ <---------------+ v
180*4882a593Smuzhiyun    |  ^          +---+   +---+   +---+
181*4882a593Smuzhiyun    |  |          |   |   |   |-->|   |
182*4882a593Smuzhiyun    |  |  New     |   |   |   |<--|   |<-+
183*4882a593Smuzhiyun    |  | Reader   +---+   +---+   +---+  |
184*4882a593Smuzhiyun    |  |  page ----^                 |   |
185*4882a593Smuzhiyun    |  |                             |   |
186*4882a593Smuzhiyun    |  +-----------------------------+   |
187*4882a593Smuzhiyun    +------------------------------------+
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunIt is possible that the page swapped is the commit page and the tail page,
192*4882a593Smuzhiyunif what is in the ring buffer is less than what is held in a buffer page.
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun::
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun            reader page    commit page   tail page
197*4882a593Smuzhiyun                |              |             |
198*4882a593Smuzhiyun                v              |             |
199*4882a593Smuzhiyun               +---+           |             |
200*4882a593Smuzhiyun               |   |<----------+             |
201*4882a593Smuzhiyun               |   |<------------------------+
202*4882a593Smuzhiyun               |   |------+
203*4882a593Smuzhiyun               +---+      |
204*4882a593Smuzhiyun                          |
205*4882a593Smuzhiyun                          v
206*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
207*4882a593Smuzhiyun  <---|   |--->|   |--->|   |--->|   |--->
208*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
209*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
210*4882a593Smuzhiyun
211*4882a593SmuzhiyunThis case is still valid for this algorithm.
212*4882a593SmuzhiyunWhen the writer leaves the page, it simply goes into the ring buffer
213*4882a593Smuzhiyunsince the reader page still points to the next location in the ring
214*4882a593Smuzhiyunbuffer.
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunThe main pointers:
218*4882a593Smuzhiyun
219*4882a593Smuzhiyun  reader page
220*4882a593Smuzhiyun	    - The page used solely by the reader and is not part
221*4882a593Smuzhiyun              of the ring buffer (may be swapped in)
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun  head page
224*4882a593Smuzhiyun	    - the next page in the ring buffer that will be swapped
225*4882a593Smuzhiyun              with the reader page.
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun  tail page
228*4882a593Smuzhiyun	    - the page where the next write will take place.
229*4882a593Smuzhiyun
230*4882a593Smuzhiyun  commit page
231*4882a593Smuzhiyun	    - the page that last finished a write.
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunThe commit page only is updated by the outermost writer in the
234*4882a593Smuzhiyunwriter stack. A writer that preempts another writer will not move the
235*4882a593Smuzhiyuncommit page.
236*4882a593Smuzhiyun
237*4882a593SmuzhiyunWhen data is written into the ring buffer, a position is reserved
238*4882a593Smuzhiyunin the ring buffer and passed back to the writer. When the writer
239*4882a593Smuzhiyunis finished writing data into that position, it commits the write.
240*4882a593Smuzhiyun
241*4882a593SmuzhiyunAnother write (or a read) may take place at anytime during this
242*4882a593Smuzhiyuntransaction. If another write happens it must finish before continuing
243*4882a593Smuzhiyunwith the previous write.
244*4882a593Smuzhiyun
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun   Write reserve::
247*4882a593Smuzhiyun
248*4882a593Smuzhiyun       Buffer page
249*4882a593Smuzhiyun      +---------+
250*4882a593Smuzhiyun      |written  |
251*4882a593Smuzhiyun      +---------+  <--- given back to writer (current commit)
252*4882a593Smuzhiyun      |reserved |
253*4882a593Smuzhiyun      +---------+ <--- tail pointer
254*4882a593Smuzhiyun      | empty   |
255*4882a593Smuzhiyun      +---------+
256*4882a593Smuzhiyun
257*4882a593Smuzhiyun   Write commit::
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun       Buffer page
260*4882a593Smuzhiyun      +---------+
261*4882a593Smuzhiyun      |written  |
262*4882a593Smuzhiyun      +---------+
263*4882a593Smuzhiyun      |written  |
264*4882a593Smuzhiyun      +---------+  <--- next position for write (current commit)
265*4882a593Smuzhiyun      | empty   |
266*4882a593Smuzhiyun      +---------+
267*4882a593Smuzhiyun
268*4882a593Smuzhiyun
269*4882a593Smuzhiyun If a write happens after the first reserve::
270*4882a593Smuzhiyun
271*4882a593Smuzhiyun       Buffer page
272*4882a593Smuzhiyun      +---------+
273*4882a593Smuzhiyun      |written  |
274*4882a593Smuzhiyun      +---------+  <-- current commit
275*4882a593Smuzhiyun      |reserved |
276*4882a593Smuzhiyun      +---------+  <--- given back to second writer
277*4882a593Smuzhiyun      |reserved |
278*4882a593Smuzhiyun      +---------+ <--- tail pointer
279*4882a593Smuzhiyun
280*4882a593Smuzhiyun  After second writer commits::
281*4882a593Smuzhiyun
282*4882a593Smuzhiyun
283*4882a593Smuzhiyun       Buffer page
284*4882a593Smuzhiyun      +---------+
285*4882a593Smuzhiyun      |written  |
286*4882a593Smuzhiyun      +---------+  <--(last full commit)
287*4882a593Smuzhiyun      |reserved |
288*4882a593Smuzhiyun      +---------+
289*4882a593Smuzhiyun      |pending  |
290*4882a593Smuzhiyun      |commit   |
291*4882a593Smuzhiyun      +---------+ <--- tail pointer
292*4882a593Smuzhiyun
293*4882a593Smuzhiyun  When the first writer commits::
294*4882a593Smuzhiyun
295*4882a593Smuzhiyun       Buffer page
296*4882a593Smuzhiyun      +---------+
297*4882a593Smuzhiyun      |written  |
298*4882a593Smuzhiyun      +---------+
299*4882a593Smuzhiyun      |written  |
300*4882a593Smuzhiyun      +---------+
301*4882a593Smuzhiyun      |written  |
302*4882a593Smuzhiyun      +---------+  <--(last full commit and tail pointer)
303*4882a593Smuzhiyun
304*4882a593Smuzhiyun
305*4882a593SmuzhiyunThe commit pointer points to the last write location that was
306*4882a593Smuzhiyuncommitted without preempting another write. When a write that
307*4882a593Smuzhiyunpreempted another write is committed, it only becomes a pending commit
308*4882a593Smuzhiyunand will not be a full commit until all writes have been committed.
309*4882a593Smuzhiyun
310*4882a593SmuzhiyunThe commit page points to the page that has the last full commit.
311*4882a593SmuzhiyunThe tail page points to the page with the last write (before
312*4882a593Smuzhiyuncommitting).
313*4882a593Smuzhiyun
314*4882a593SmuzhiyunThe tail page is always equal to or after the commit page. It may
315*4882a593Smuzhiyunbe several pages ahead. If the tail page catches up to the commit
316*4882a593Smuzhiyunpage then no more writes may take place (regardless of the mode
317*4882a593Smuzhiyunof the ring buffer: overwrite and produce/consumer).
318*4882a593Smuzhiyun
319*4882a593SmuzhiyunThe order of pages is::
320*4882a593Smuzhiyun
321*4882a593Smuzhiyun head page
322*4882a593Smuzhiyun commit page
323*4882a593Smuzhiyun tail page
324*4882a593Smuzhiyun
325*4882a593SmuzhiyunPossible scenario::
326*4882a593Smuzhiyun
327*4882a593Smuzhiyun                               tail page
328*4882a593Smuzhiyun    head page         commit page  |
329*4882a593Smuzhiyun        |                 |        |
330*4882a593Smuzhiyun        v                 v        v
331*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
332*4882a593Smuzhiyun  <---|   |--->|   |--->|   |--->|   |--->
333*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
334*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
335*4882a593Smuzhiyun
336*4882a593SmuzhiyunThere is a special case that the head page is after either the commit page
337*4882a593Smuzhiyunand possibly the tail page. That is when the commit (and tail) page has been
338*4882a593Smuzhiyunswapped with the reader page. This is because the head page is always
339*4882a593Smuzhiyunpart of the ring buffer, but the reader page is not. Whenever there
340*4882a593Smuzhiyunhas been less than a full page that has been committed inside the ring buffer,
341*4882a593Smuzhiyunand a reader swaps out a page, it will be swapping out the commit page.
342*4882a593Smuzhiyun
343*4882a593Smuzhiyun::
344*4882a593Smuzhiyun
345*4882a593Smuzhiyun            reader page    commit page   tail page
346*4882a593Smuzhiyun                |              |             |
347*4882a593Smuzhiyun                v              |             |
348*4882a593Smuzhiyun               +---+           |             |
349*4882a593Smuzhiyun               |   |<----------+             |
350*4882a593Smuzhiyun               |   |<------------------------+
351*4882a593Smuzhiyun               |   |------+
352*4882a593Smuzhiyun               +---+      |
353*4882a593Smuzhiyun                          |
354*4882a593Smuzhiyun                          v
355*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
356*4882a593Smuzhiyun  <---|   |--->|   |--->|   |--->|   |--->
357*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
358*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
359*4882a593Smuzhiyun                          ^
360*4882a593Smuzhiyun                          |
361*4882a593Smuzhiyun                      head page
362*4882a593Smuzhiyun
363*4882a593Smuzhiyun
364*4882a593SmuzhiyunIn this case, the head page will not move when the tail and commit
365*4882a593Smuzhiyunmove back into the ring buffer.
366*4882a593Smuzhiyun
367*4882a593SmuzhiyunThe reader cannot swap a page into the ring buffer if the commit page
368*4882a593Smuzhiyunis still on that page. If the read meets the last commit (real commit
369*4882a593Smuzhiyunnot pending or reserved), then there is nothing more to read.
370*4882a593SmuzhiyunThe buffer is considered empty until another full commit finishes.
371*4882a593Smuzhiyun
372*4882a593SmuzhiyunWhen the tail meets the head page, if the buffer is in overwrite mode,
373*4882a593Smuzhiyunthe head page will be pushed ahead one. If the buffer is in producer/consumer
374*4882a593Smuzhiyunmode, the write will fail.
375*4882a593Smuzhiyun
376*4882a593SmuzhiyunOverwrite mode::
377*4882a593Smuzhiyun
378*4882a593Smuzhiyun              tail page
379*4882a593Smuzhiyun                 |
380*4882a593Smuzhiyun                 v
381*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
382*4882a593Smuzhiyun  <---|   |--->|   |--->|   |--->|   |--->
383*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
384*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
385*4882a593Smuzhiyun                          ^
386*4882a593Smuzhiyun                          |
387*4882a593Smuzhiyun                      head page
388*4882a593Smuzhiyun
389*4882a593Smuzhiyun
390*4882a593Smuzhiyun              tail page
391*4882a593Smuzhiyun                 |
392*4882a593Smuzhiyun                 v
393*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
394*4882a593Smuzhiyun  <---|   |--->|   |--->|   |--->|   |--->
395*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
396*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
397*4882a593Smuzhiyun                                   ^
398*4882a593Smuzhiyun                                   |
399*4882a593Smuzhiyun                               head page
400*4882a593Smuzhiyun
401*4882a593Smuzhiyun
402*4882a593Smuzhiyun                      tail page
403*4882a593Smuzhiyun                          |
404*4882a593Smuzhiyun                          v
405*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
406*4882a593Smuzhiyun  <---|   |--->|   |--->|   |--->|   |--->
407*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
408*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
409*4882a593Smuzhiyun                                   ^
410*4882a593Smuzhiyun                                   |
411*4882a593Smuzhiyun                               head page
412*4882a593Smuzhiyun
413*4882a593SmuzhiyunNote, the reader page will still point to the previous head page.
414*4882a593SmuzhiyunBut when a swap takes place, it will use the most recent head page.
415*4882a593Smuzhiyun
416*4882a593Smuzhiyun
417*4882a593SmuzhiyunMaking the Ring Buffer Lockless:
418*4882a593Smuzhiyun--------------------------------
419*4882a593Smuzhiyun
420*4882a593SmuzhiyunThe main idea behind the lockless algorithm is to combine the moving
421*4882a593Smuzhiyunof the head_page pointer with the swapping of pages with the reader.
422*4882a593SmuzhiyunState flags are placed inside the pointer to the page. To do this,
423*4882a593Smuzhiyuneach page must be aligned in memory by 4 bytes. This will allow the 2
424*4882a593Smuzhiyunleast significant bits of the address to be used as flags, since
425*4882a593Smuzhiyunthey will always be zero for the address. To get the address,
426*4882a593Smuzhiyunsimply mask out the flags::
427*4882a593Smuzhiyun
428*4882a593Smuzhiyun  MASK = ~3
429*4882a593Smuzhiyun
430*4882a593Smuzhiyun  address & MASK
431*4882a593Smuzhiyun
432*4882a593SmuzhiyunTwo flags will be kept by these two bits:
433*4882a593Smuzhiyun
434*4882a593Smuzhiyun   HEADER
435*4882a593Smuzhiyun	- the page being pointed to is a head page
436*4882a593Smuzhiyun
437*4882a593Smuzhiyun   UPDATE
438*4882a593Smuzhiyun	- the page being pointed to is being updated by a writer
439*4882a593Smuzhiyun          and was or is about to be a head page.
440*4882a593Smuzhiyun
441*4882a593Smuzhiyun::
442*4882a593Smuzhiyun
443*4882a593Smuzhiyun	      reader page
444*4882a593Smuzhiyun		  |
445*4882a593Smuzhiyun		  v
446*4882a593Smuzhiyun		+---+
447*4882a593Smuzhiyun		|   |------+
448*4882a593Smuzhiyun		+---+      |
449*4882a593Smuzhiyun			    |
450*4882a593Smuzhiyun			    v
451*4882a593Smuzhiyun	+---+    +---+    +---+    +---+
452*4882a593Smuzhiyun    <---|   |--->|   |-H->|   |--->|   |--->
453*4882a593Smuzhiyun    --->|   |<---|   |<---|   |<---|   |<---
454*4882a593Smuzhiyun	+---+    +---+    +---+    +---+
455*4882a593Smuzhiyun
456*4882a593Smuzhiyun
457*4882a593SmuzhiyunThe above pointer "-H->" would have the HEADER flag set. That is
458*4882a593Smuzhiyunthe next page is the next page to be swapped out by the reader.
459*4882a593SmuzhiyunThis pointer means the next page is the head page.
460*4882a593Smuzhiyun
461*4882a593SmuzhiyunWhen the tail page meets the head pointer, it will use cmpxchg to
462*4882a593Smuzhiyunchange the pointer to the UPDATE state::
463*4882a593Smuzhiyun
464*4882a593Smuzhiyun
465*4882a593Smuzhiyun              tail page
466*4882a593Smuzhiyun                 |
467*4882a593Smuzhiyun                 v
468*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
469*4882a593Smuzhiyun  <---|   |--->|   |-H->|   |--->|   |--->
470*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
471*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
472*4882a593Smuzhiyun
473*4882a593Smuzhiyun              tail page
474*4882a593Smuzhiyun                 |
475*4882a593Smuzhiyun                 v
476*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
477*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |--->|   |--->
478*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
479*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
480*4882a593Smuzhiyun
481*4882a593Smuzhiyun"-U->" represents a pointer in the UPDATE state.
482*4882a593Smuzhiyun
483*4882a593SmuzhiyunAny access to the reader will need to take some sort of lock to serialize
484*4882a593Smuzhiyunthe readers. But the writers will never take a lock to write to the
485*4882a593Smuzhiyunring buffer. This means we only need to worry about a single reader,
486*4882a593Smuzhiyunand writes only preempt in "stack" formation.
487*4882a593Smuzhiyun
488*4882a593SmuzhiyunWhen the reader tries to swap the page with the ring buffer, it
489*4882a593Smuzhiyunwill also use cmpxchg. If the flag bit in the pointer to the
490*4882a593Smuzhiyunhead page does not have the HEADER flag set, the compare will fail
491*4882a593Smuzhiyunand the reader will need to look for the new head page and try again.
492*4882a593SmuzhiyunNote, the flags UPDATE and HEADER are never set at the same time.
493*4882a593Smuzhiyun
494*4882a593SmuzhiyunThe reader swaps the reader page as follows::
495*4882a593Smuzhiyun
496*4882a593Smuzhiyun  +------+
497*4882a593Smuzhiyun  |reader|          RING BUFFER
498*4882a593Smuzhiyun  |page  |
499*4882a593Smuzhiyun  +------+
500*4882a593Smuzhiyun                  +---+    +---+    +---+
501*4882a593Smuzhiyun                  |   |--->|   |--->|   |
502*4882a593Smuzhiyun                  |   |<---|   |<---|   |
503*4882a593Smuzhiyun                  +---+    +---+    +---+
504*4882a593Smuzhiyun                   ^ |               ^ |
505*4882a593Smuzhiyun                   | +---------------+ |
506*4882a593Smuzhiyun                   +-----H-------------+
507*4882a593Smuzhiyun
508*4882a593SmuzhiyunThe reader sets the reader page next pointer as HEADER to the page after
509*4882a593Smuzhiyunthe head page::
510*4882a593Smuzhiyun
511*4882a593Smuzhiyun
512*4882a593Smuzhiyun  +------+
513*4882a593Smuzhiyun  |reader|          RING BUFFER
514*4882a593Smuzhiyun  |page  |-------H-----------+
515*4882a593Smuzhiyun  +------+                   v
516*4882a593Smuzhiyun    |             +---+    +---+    +---+
517*4882a593Smuzhiyun    |             |   |--->|   |--->|   |
518*4882a593Smuzhiyun    |             |   |<---|   |<---|   |<-+
519*4882a593Smuzhiyun    |             +---+    +---+    +---+  |
520*4882a593Smuzhiyun    |              ^ |               ^ |   |
521*4882a593Smuzhiyun    |              | +---------------+ |   |
522*4882a593Smuzhiyun    |              +-----H-------------+   |
523*4882a593Smuzhiyun    +--------------------------------------+
524*4882a593Smuzhiyun
525*4882a593SmuzhiyunIt does a cmpxchg with the pointer to the previous head page to make it
526*4882a593Smuzhiyunpoint to the reader page. Note that the new pointer does not have the HEADER
527*4882a593Smuzhiyunflag set.  This action atomically moves the head page forward::
528*4882a593Smuzhiyun
529*4882a593Smuzhiyun  +------+
530*4882a593Smuzhiyun  |reader|          RING BUFFER
531*4882a593Smuzhiyun  |page  |-------H-----------+
532*4882a593Smuzhiyun  +------+                   v
533*4882a593Smuzhiyun    |  ^          +---+   +---+   +---+
534*4882a593Smuzhiyun    |  |          |   |-->|   |-->|   |
535*4882a593Smuzhiyun    |  |          |   |<--|   |<--|   |<-+
536*4882a593Smuzhiyun    |  |          +---+   +---+   +---+  |
537*4882a593Smuzhiyun    |  |             |             ^ |   |
538*4882a593Smuzhiyun    |  |             +-------------+ |   |
539*4882a593Smuzhiyun    |  +-----------------------------+   |
540*4882a593Smuzhiyun    +------------------------------------+
541*4882a593Smuzhiyun
542*4882a593SmuzhiyunAfter the new head page is set, the previous pointer of the head page is
543*4882a593Smuzhiyunupdated to the reader page::
544*4882a593Smuzhiyun
545*4882a593Smuzhiyun  +------+
546*4882a593Smuzhiyun  |reader|          RING BUFFER
547*4882a593Smuzhiyun  |page  |-------H-----------+
548*4882a593Smuzhiyun  +------+ <---------------+ v
549*4882a593Smuzhiyun    |  ^          +---+   +---+   +---+
550*4882a593Smuzhiyun    |  |          |   |-->|   |-->|   |
551*4882a593Smuzhiyun    |  |          |   |   |   |<--|   |<-+
552*4882a593Smuzhiyun    |  |          +---+   +---+   +---+  |
553*4882a593Smuzhiyun    |  |             |             ^ |   |
554*4882a593Smuzhiyun    |  |             +-------------+ |   |
555*4882a593Smuzhiyun    |  +-----------------------------+   |
556*4882a593Smuzhiyun    +------------------------------------+
557*4882a593Smuzhiyun
558*4882a593Smuzhiyun  +------+
559*4882a593Smuzhiyun  |buffer|          RING BUFFER
560*4882a593Smuzhiyun  |page  |-------H-----------+  <--- New head page
561*4882a593Smuzhiyun  +------+ <---------------+ v
562*4882a593Smuzhiyun    |  ^          +---+   +---+   +---+
563*4882a593Smuzhiyun    |  |          |   |   |   |-->|   |
564*4882a593Smuzhiyun    |  |  New     |   |   |   |<--|   |<-+
565*4882a593Smuzhiyun    |  | Reader   +---+   +---+   +---+  |
566*4882a593Smuzhiyun    |  |  page ----^                 |   |
567*4882a593Smuzhiyun    |  |                             |   |
568*4882a593Smuzhiyun    |  +-----------------------------+   |
569*4882a593Smuzhiyun    +------------------------------------+
570*4882a593Smuzhiyun
571*4882a593SmuzhiyunAnother important point: The page that the reader page points back to
572*4882a593Smuzhiyunby its previous pointer (the one that now points to the new head page)
573*4882a593Smuzhiyunnever points back to the reader page. That is because the reader page is
574*4882a593Smuzhiyunnot part of the ring buffer. Traversing the ring buffer via the next pointers
575*4882a593Smuzhiyunwill always stay in the ring buffer. Traversing the ring buffer via the
576*4882a593Smuzhiyunprev pointers may not.
577*4882a593Smuzhiyun
578*4882a593SmuzhiyunNote, the way to determine a reader page is simply by examining the previous
579*4882a593Smuzhiyunpointer of the page. If the next pointer of the previous page does not
580*4882a593Smuzhiyunpoint back to the original page, then the original page is a reader page::
581*4882a593Smuzhiyun
582*4882a593Smuzhiyun
583*4882a593Smuzhiyun             +--------+
584*4882a593Smuzhiyun             | reader |  next   +----+
585*4882a593Smuzhiyun             |  page  |-------->|    |<====== (buffer page)
586*4882a593Smuzhiyun             +--------+         +----+
587*4882a593Smuzhiyun                 |                | ^
588*4882a593Smuzhiyun                 |                v | next
589*4882a593Smuzhiyun            prev |              +----+
590*4882a593Smuzhiyun                 +------------->|    |
591*4882a593Smuzhiyun                                +----+
592*4882a593Smuzhiyun
593*4882a593SmuzhiyunThe way the head page moves forward:
594*4882a593Smuzhiyun
595*4882a593SmuzhiyunWhen the tail page meets the head page and the buffer is in overwrite mode
596*4882a593Smuzhiyunand more writes take place, the head page must be moved forward before the
597*4882a593Smuzhiyunwriter may move the tail page. The way this is done is that the writer
598*4882a593Smuzhiyunperforms a cmpxchg to convert the pointer to the head page from the HEADER
599*4882a593Smuzhiyunflag to have the UPDATE flag set. Once this is done, the reader will
600*4882a593Smuzhiyunnot be able to swap the head page from the buffer, nor will it be able to
601*4882a593Smuzhiyunmove the head page, until the writer is finished with the move.
602*4882a593Smuzhiyun
603*4882a593SmuzhiyunThis eliminates any races that the reader can have on the writer. The reader
604*4882a593Smuzhiyunmust spin, and this is why the reader cannot preempt the writer::
605*4882a593Smuzhiyun
606*4882a593Smuzhiyun              tail page
607*4882a593Smuzhiyun                 |
608*4882a593Smuzhiyun                 v
609*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
610*4882a593Smuzhiyun  <---|   |--->|   |-H->|   |--->|   |--->
611*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
612*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
613*4882a593Smuzhiyun
614*4882a593Smuzhiyun              tail page
615*4882a593Smuzhiyun                 |
616*4882a593Smuzhiyun                 v
617*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
618*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |--->|   |--->
619*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
620*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
621*4882a593Smuzhiyun
622*4882a593SmuzhiyunThe following page will be made into the new head page::
623*4882a593Smuzhiyun
624*4882a593Smuzhiyun             tail page
625*4882a593Smuzhiyun                 |
626*4882a593Smuzhiyun                 v
627*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
628*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |-H->|   |--->
629*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
630*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
631*4882a593Smuzhiyun
632*4882a593SmuzhiyunAfter the new head page has been set, we can set the old head page
633*4882a593Smuzhiyunpointer back to NORMAL::
634*4882a593Smuzhiyun
635*4882a593Smuzhiyun             tail page
636*4882a593Smuzhiyun                 |
637*4882a593Smuzhiyun                 v
638*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
639*4882a593Smuzhiyun  <---|   |--->|   |--->|   |-H->|   |--->
640*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
641*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
642*4882a593Smuzhiyun
643*4882a593SmuzhiyunAfter the head page has been moved, the tail page may now move forward::
644*4882a593Smuzhiyun
645*4882a593Smuzhiyun                      tail page
646*4882a593Smuzhiyun                          |
647*4882a593Smuzhiyun                          v
648*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
649*4882a593Smuzhiyun  <---|   |--->|   |--->|   |-H->|   |--->
650*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
651*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
652*4882a593Smuzhiyun
653*4882a593Smuzhiyun
654*4882a593SmuzhiyunThe above are the trivial updates. Now for the more complex scenarios.
655*4882a593Smuzhiyun
656*4882a593Smuzhiyun
657*4882a593SmuzhiyunAs stated before, if enough writes preempt the first write, the
658*4882a593Smuzhiyuntail page may make it all the way around the buffer and meet the commit
659*4882a593Smuzhiyunpage. At this time, we must start dropping writes (usually with some kind
660*4882a593Smuzhiyunof warning to the user). But what happens if the commit was still on the
661*4882a593Smuzhiyunreader page? The commit page is not part of the ring buffer. The tail page
662*4882a593Smuzhiyunmust account for this::
663*4882a593Smuzhiyun
664*4882a593Smuzhiyun
665*4882a593Smuzhiyun            reader page    commit page
666*4882a593Smuzhiyun                |              |
667*4882a593Smuzhiyun                v              |
668*4882a593Smuzhiyun               +---+           |
669*4882a593Smuzhiyun               |   |<----------+
670*4882a593Smuzhiyun               |   |
671*4882a593Smuzhiyun               |   |------+
672*4882a593Smuzhiyun               +---+      |
673*4882a593Smuzhiyun                          |
674*4882a593Smuzhiyun                          v
675*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
676*4882a593Smuzhiyun  <---|   |--->|   |-H->|   |--->|   |--->
677*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
678*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
679*4882a593Smuzhiyun                 ^
680*4882a593Smuzhiyun                 |
681*4882a593Smuzhiyun             tail page
682*4882a593Smuzhiyun
683*4882a593SmuzhiyunIf the tail page were to simply push the head page forward, the commit when
684*4882a593Smuzhiyunleaving the reader page would not be pointing to the correct page.
685*4882a593Smuzhiyun
686*4882a593SmuzhiyunThe solution to this is to test if the commit page is on the reader page
687*4882a593Smuzhiyunbefore pushing the head page. If it is, then it can be assumed that the
688*4882a593Smuzhiyuntail page wrapped the buffer, and we must drop new writes.
689*4882a593Smuzhiyun
690*4882a593SmuzhiyunThis is not a race condition, because the commit page can only be moved
691*4882a593Smuzhiyunby the outermost writer (the writer that was preempted).
692*4882a593SmuzhiyunThis means that the commit will not move while a writer is moving the
693*4882a593Smuzhiyuntail page. The reader cannot swap the reader page if it is also being
694*4882a593Smuzhiyunused as the commit page. The reader can simply check that the commit
695*4882a593Smuzhiyunis off the reader page. Once the commit page leaves the reader page
696*4882a593Smuzhiyunit will never go back on it unless a reader does another swap with the
697*4882a593Smuzhiyunbuffer page that is also the commit page.
698*4882a593Smuzhiyun
699*4882a593Smuzhiyun
700*4882a593SmuzhiyunNested writes
701*4882a593Smuzhiyun-------------
702*4882a593Smuzhiyun
703*4882a593SmuzhiyunIn the pushing forward of the tail page we must first push forward
704*4882a593Smuzhiyunthe head page if the head page is the next page. If the head page
705*4882a593Smuzhiyunis not the next page, the tail page is simply updated with a cmpxchg.
706*4882a593Smuzhiyun
707*4882a593SmuzhiyunOnly writers move the tail page. This must be done atomically to protect
708*4882a593Smuzhiyunagainst nested writers::
709*4882a593Smuzhiyun
710*4882a593Smuzhiyun  temp_page = tail_page
711*4882a593Smuzhiyun  next_page = temp_page->next
712*4882a593Smuzhiyun  cmpxchg(tail_page, temp_page, next_page)
713*4882a593Smuzhiyun
714*4882a593SmuzhiyunThe above will update the tail page if it is still pointing to the expected
715*4882a593Smuzhiyunpage. If this fails, a nested write pushed it forward, the current write
716*4882a593Smuzhiyundoes not need to push it::
717*4882a593Smuzhiyun
718*4882a593Smuzhiyun
719*4882a593Smuzhiyun             temp page
720*4882a593Smuzhiyun                 |
721*4882a593Smuzhiyun                 v
722*4882a593Smuzhiyun              tail page
723*4882a593Smuzhiyun                 |
724*4882a593Smuzhiyun                 v
725*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
726*4882a593Smuzhiyun  <---|   |--->|   |--->|   |--->|   |--->
727*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
728*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
729*4882a593Smuzhiyun
730*4882a593SmuzhiyunNested write comes in and moves the tail page forward::
731*4882a593Smuzhiyun
732*4882a593Smuzhiyun                      tail page (moved by nested writer)
733*4882a593Smuzhiyun              temp page   |
734*4882a593Smuzhiyun                 |        |
735*4882a593Smuzhiyun                 v        v
736*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
737*4882a593Smuzhiyun  <---|   |--->|   |--->|   |--->|   |--->
738*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
739*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
740*4882a593Smuzhiyun
741*4882a593SmuzhiyunThe above would fail the cmpxchg, but since the tail page has already
742*4882a593Smuzhiyunbeen moved forward, the writer will just try again to reserve storage
743*4882a593Smuzhiyunon the new tail page.
744*4882a593Smuzhiyun
745*4882a593SmuzhiyunBut the moving of the head page is a bit more complex::
746*4882a593Smuzhiyun
747*4882a593Smuzhiyun              tail page
748*4882a593Smuzhiyun                 |
749*4882a593Smuzhiyun                 v
750*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
751*4882a593Smuzhiyun  <---|   |--->|   |-H->|   |--->|   |--->
752*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
753*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
754*4882a593Smuzhiyun
755*4882a593SmuzhiyunThe write converts the head page pointer to UPDATE::
756*4882a593Smuzhiyun
757*4882a593Smuzhiyun              tail page
758*4882a593Smuzhiyun                 |
759*4882a593Smuzhiyun                 v
760*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
761*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |--->|   |--->
762*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
763*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
764*4882a593Smuzhiyun
765*4882a593SmuzhiyunBut if a nested writer preempts here, it will see that the next
766*4882a593Smuzhiyunpage is a head page, but it is also nested. It will detect that
767*4882a593Smuzhiyunit is nested and will save that information. The detection is the
768*4882a593Smuzhiyunfact that it sees the UPDATE flag instead of a HEADER or NORMAL
769*4882a593Smuzhiyunpointer.
770*4882a593Smuzhiyun
771*4882a593SmuzhiyunThe nested writer will set the new head page pointer::
772*4882a593Smuzhiyun
773*4882a593Smuzhiyun             tail page
774*4882a593Smuzhiyun                 |
775*4882a593Smuzhiyun                 v
776*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
777*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |-H->|   |--->
778*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
779*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
780*4882a593Smuzhiyun
781*4882a593SmuzhiyunBut it will not reset the update back to normal. Only the writer
782*4882a593Smuzhiyunthat converted a pointer from HEAD to UPDATE will convert it back
783*4882a593Smuzhiyunto NORMAL::
784*4882a593Smuzhiyun
785*4882a593Smuzhiyun                      tail page
786*4882a593Smuzhiyun                          |
787*4882a593Smuzhiyun                          v
788*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
789*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |-H->|   |--->
790*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
791*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
792*4882a593Smuzhiyun
793*4882a593SmuzhiyunAfter the nested writer finishes, the outermost writer will convert
794*4882a593Smuzhiyunthe UPDATE pointer to NORMAL::
795*4882a593Smuzhiyun
796*4882a593Smuzhiyun
797*4882a593Smuzhiyun                      tail page
798*4882a593Smuzhiyun                          |
799*4882a593Smuzhiyun                          v
800*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
801*4882a593Smuzhiyun  <---|   |--->|   |--->|   |-H->|   |--->
802*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
803*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
804*4882a593Smuzhiyun
805*4882a593Smuzhiyun
806*4882a593SmuzhiyunIt can be even more complex if several nested writes came in and moved
807*4882a593Smuzhiyunthe tail page ahead several pages::
808*4882a593Smuzhiyun
809*4882a593Smuzhiyun
810*4882a593Smuzhiyun  (first writer)
811*4882a593Smuzhiyun
812*4882a593Smuzhiyun              tail page
813*4882a593Smuzhiyun                 |
814*4882a593Smuzhiyun                 v
815*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
816*4882a593Smuzhiyun  <---|   |--->|   |-H->|   |--->|   |--->
817*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
818*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
819*4882a593Smuzhiyun
820*4882a593SmuzhiyunThe write converts the head page pointer to UPDATE::
821*4882a593Smuzhiyun
822*4882a593Smuzhiyun              tail page
823*4882a593Smuzhiyun                 |
824*4882a593Smuzhiyun                 v
825*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
826*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |--->|   |--->
827*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
828*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
829*4882a593Smuzhiyun
830*4882a593SmuzhiyunNext writer comes in, and sees the update and sets up the new
831*4882a593Smuzhiyunhead page::
832*4882a593Smuzhiyun
833*4882a593Smuzhiyun  (second writer)
834*4882a593Smuzhiyun
835*4882a593Smuzhiyun             tail page
836*4882a593Smuzhiyun                 |
837*4882a593Smuzhiyun                 v
838*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
839*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |-H->|   |--->
840*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
841*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
842*4882a593Smuzhiyun
843*4882a593SmuzhiyunThe nested writer moves the tail page forward. But does not set the old
844*4882a593Smuzhiyunupdate page to NORMAL because it is not the outermost writer::
845*4882a593Smuzhiyun
846*4882a593Smuzhiyun                      tail page
847*4882a593Smuzhiyun                          |
848*4882a593Smuzhiyun                          v
849*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
850*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |-H->|   |--->
851*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
852*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
853*4882a593Smuzhiyun
854*4882a593SmuzhiyunAnother writer preempts and sees the page after the tail page is a head page.
855*4882a593SmuzhiyunIt changes it from HEAD to UPDATE::
856*4882a593Smuzhiyun
857*4882a593Smuzhiyun  (third writer)
858*4882a593Smuzhiyun
859*4882a593Smuzhiyun                      tail page
860*4882a593Smuzhiyun                          |
861*4882a593Smuzhiyun                          v
862*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
863*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |-U->|   |--->
864*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
865*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
866*4882a593Smuzhiyun
867*4882a593SmuzhiyunThe writer will move the head page forward::
868*4882a593Smuzhiyun
869*4882a593Smuzhiyun
870*4882a593Smuzhiyun  (third writer)
871*4882a593Smuzhiyun
872*4882a593Smuzhiyun                      tail page
873*4882a593Smuzhiyun                          |
874*4882a593Smuzhiyun                          v
875*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
876*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |-U->|   |-H->
877*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
878*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
879*4882a593Smuzhiyun
880*4882a593SmuzhiyunBut now that the third writer did change the HEAD flag to UPDATE it
881*4882a593Smuzhiyunwill convert it to normal::
882*4882a593Smuzhiyun
883*4882a593Smuzhiyun
884*4882a593Smuzhiyun  (third writer)
885*4882a593Smuzhiyun
886*4882a593Smuzhiyun                      tail page
887*4882a593Smuzhiyun                          |
888*4882a593Smuzhiyun                          v
889*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
890*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |--->|   |-H->
891*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
892*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
893*4882a593Smuzhiyun
894*4882a593Smuzhiyun
895*4882a593SmuzhiyunThen it will move the tail page, and return back to the second writer::
896*4882a593Smuzhiyun
897*4882a593Smuzhiyun
898*4882a593Smuzhiyun  (second writer)
899*4882a593Smuzhiyun
900*4882a593Smuzhiyun                               tail page
901*4882a593Smuzhiyun                                   |
902*4882a593Smuzhiyun                                   v
903*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
904*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |--->|   |-H->
905*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
906*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
907*4882a593Smuzhiyun
908*4882a593Smuzhiyun
909*4882a593SmuzhiyunThe second writer will fail to move the tail page because it was already
910*4882a593Smuzhiyunmoved, so it will try again and add its data to the new tail page.
911*4882a593SmuzhiyunIt will return to the first writer::
912*4882a593Smuzhiyun
913*4882a593Smuzhiyun
914*4882a593Smuzhiyun  (first writer)
915*4882a593Smuzhiyun
916*4882a593Smuzhiyun                               tail page
917*4882a593Smuzhiyun                                   |
918*4882a593Smuzhiyun                                   v
919*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
920*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |--->|   |-H->
921*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
922*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
923*4882a593Smuzhiyun
924*4882a593SmuzhiyunThe first writer cannot know atomically if the tail page moved
925*4882a593Smuzhiyunwhile it updates the HEAD page. It will then update the head page to
926*4882a593Smuzhiyunwhat it thinks is the new head page::
927*4882a593Smuzhiyun
928*4882a593Smuzhiyun
929*4882a593Smuzhiyun  (first writer)
930*4882a593Smuzhiyun
931*4882a593Smuzhiyun                               tail page
932*4882a593Smuzhiyun                                   |
933*4882a593Smuzhiyun                                   v
934*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
935*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |-H->|   |-H->
936*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
937*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
938*4882a593Smuzhiyun
939*4882a593SmuzhiyunSince the cmpxchg returns the old value of the pointer the first writer
940*4882a593Smuzhiyunwill see it succeeded in updating the pointer from NORMAL to HEAD.
941*4882a593SmuzhiyunBut as we can see, this is not good enough. It must also check to see
942*4882a593Smuzhiyunif the tail page is either where it use to be or on the next page::
943*4882a593Smuzhiyun
944*4882a593Smuzhiyun
945*4882a593Smuzhiyun  (first writer)
946*4882a593Smuzhiyun
947*4882a593Smuzhiyun                 A        B    tail page
948*4882a593Smuzhiyun                 |        |        |
949*4882a593Smuzhiyun                 v        v        v
950*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
951*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |-H->|   |-H->
952*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
953*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
954*4882a593Smuzhiyun
955*4882a593SmuzhiyunIf tail page != A and tail page != B, then it must reset the pointer
956*4882a593Smuzhiyunback to NORMAL. The fact that it only needs to worry about nested
957*4882a593Smuzhiyunwriters means that it only needs to check this after setting the HEAD page::
958*4882a593Smuzhiyun
959*4882a593Smuzhiyun
960*4882a593Smuzhiyun  (first writer)
961*4882a593Smuzhiyun
962*4882a593Smuzhiyun                 A        B    tail page
963*4882a593Smuzhiyun                 |        |        |
964*4882a593Smuzhiyun                 v        v        v
965*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
966*4882a593Smuzhiyun  <---|   |--->|   |-U->|   |--->|   |-H->
967*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
968*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
969*4882a593Smuzhiyun
970*4882a593SmuzhiyunNow the writer can update the head page. This is also why the head page must
971*4882a593Smuzhiyunremain in UPDATE and only reset by the outermost writer. This prevents
972*4882a593Smuzhiyunthe reader from seeing the incorrect head page::
973*4882a593Smuzhiyun
974*4882a593Smuzhiyun
975*4882a593Smuzhiyun  (first writer)
976*4882a593Smuzhiyun
977*4882a593Smuzhiyun                 A        B    tail page
978*4882a593Smuzhiyun                 |        |        |
979*4882a593Smuzhiyun                 v        v        v
980*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
981*4882a593Smuzhiyun  <---|   |--->|   |--->|   |--->|   |-H->
982*4882a593Smuzhiyun  --->|   |<---|   |<---|   |<---|   |<---
983*4882a593Smuzhiyun      +---+    +---+    +---+    +---+
984