xref: /OK3568_Linux_fs/kernel/tools/memory-model/Documentation/explanation.txt (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593SmuzhiyunExplanation of the Linux-Kernel Memory Consistency Model
2*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3*4882a593Smuzhiyun
4*4882a593Smuzhiyun:Author: Alan Stern <stern@rowland.harvard.edu>
5*4882a593Smuzhiyun:Created: October 2017
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun.. Contents
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun  1. INTRODUCTION
10*4882a593Smuzhiyun  2. BACKGROUND
11*4882a593Smuzhiyun  3. A SIMPLE EXAMPLE
12*4882a593Smuzhiyun  4. A SELECTION OF MEMORY MODELS
13*4882a593Smuzhiyun  5. ORDERING AND CYCLES
14*4882a593Smuzhiyun  6. EVENTS
15*4882a593Smuzhiyun  7. THE PROGRAM ORDER RELATION: po AND po-loc
16*4882a593Smuzhiyun  8. A WARNING
17*4882a593Smuzhiyun  9. DEPENDENCY RELATIONS: data, addr, and ctrl
18*4882a593Smuzhiyun  10. THE READS-FROM RELATION: rf, rfi, and rfe
19*4882a593Smuzhiyun  11. CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe
20*4882a593Smuzhiyun  12. THE FROM-READS RELATION: fr, fri, and fre
21*4882a593Smuzhiyun  13. AN OPERATIONAL MODEL
22*4882a593Smuzhiyun  14. PROPAGATION ORDER RELATION: cumul-fence
23*4882a593Smuzhiyun  15. DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL
24*4882a593Smuzhiyun  16. SEQUENTIAL CONSISTENCY PER VARIABLE
25*4882a593Smuzhiyun  17. ATOMIC UPDATES: rmw
26*4882a593Smuzhiyun  18. THE PRESERVED PROGRAM ORDER RELATION: ppo
27*4882a593Smuzhiyun  19. AND THEN THERE WAS ALPHA
28*4882a593Smuzhiyun  20. THE HAPPENS-BEFORE RELATION: hb
29*4882a593Smuzhiyun  21. THE PROPAGATES-BEFORE RELATION: pb
30*4882a593Smuzhiyun  22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
31*4882a593Smuzhiyun  23. LOCKING
32*4882a593Smuzhiyun  24. PLAIN ACCESSES AND DATA RACES
33*4882a593Smuzhiyun  25. ODDS AND ENDS
34*4882a593Smuzhiyun
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunINTRODUCTION
38*4882a593Smuzhiyun------------
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunThe Linux-kernel memory consistency model (LKMM) is rather complex and
41*4882a593Smuzhiyunobscure.  This is particularly evident if you read through the
42*4882a593Smuzhiyunlinux-kernel.bell and linux-kernel.cat files that make up the formal
43*4882a593Smuzhiyunversion of the model; they are extremely terse and their meanings are
44*4882a593Smuzhiyunfar from clear.
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunThis document describes the ideas underlying the LKMM.  It is meant
47*4882a593Smuzhiyunfor people who want to understand how the model was designed.  It does
48*4882a593Smuzhiyunnot go into the details of the code in the .bell and .cat files;
49*4882a593Smuzhiyunrather, it explains in English what the code expresses symbolically.
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunSections 2 (BACKGROUND) through 5 (ORDERING AND CYCLES) are aimed
52*4882a593Smuzhiyuntoward beginners; they explain what memory consistency models are and
53*4882a593Smuzhiyunthe basic notions shared by all such models.  People already familiar
54*4882a593Smuzhiyunwith these concepts can skim or skip over them.  Sections 6 (EVENTS)
55*4882a593Smuzhiyunthrough 12 (THE FROM_READS RELATION) describe the fundamental
56*4882a593Smuzhiyunrelations used in many models.  Starting in Section 13 (AN OPERATIONAL
57*4882a593SmuzhiyunMODEL), the workings of the LKMM itself are covered.
58*4882a593Smuzhiyun
59*4882a593SmuzhiyunWarning: The code examples in this document are not written in the
60*4882a593Smuzhiyunproper format for litmus tests.  They don't include a header line, the
61*4882a593Smuzhiyuninitializations are not enclosed in braces, the global variables are
62*4882a593Smuzhiyunnot passed by pointers, and they don't have an "exists" clause at the
63*4882a593Smuzhiyunend.  Converting them to the right format is left as an exercise for
64*4882a593Smuzhiyunthe reader.
65*4882a593Smuzhiyun
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunBACKGROUND
68*4882a593Smuzhiyun----------
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunA memory consistency model (or just memory model, for short) is
71*4882a593Smuzhiyunsomething which predicts, given a piece of computer code running on a
72*4882a593Smuzhiyunparticular kind of system, what values may be obtained by the code's
73*4882a593Smuzhiyunload instructions.  The LKMM makes these predictions for code running
74*4882a593Smuzhiyunas part of the Linux kernel.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunIn practice, people tend to use memory models the other way around.
77*4882a593SmuzhiyunThat is, given a piece of code and a collection of values specified
78*4882a593Smuzhiyunfor the loads, the model will predict whether it is possible for the
79*4882a593Smuzhiyuncode to run in such a way that the loads will indeed obtain the
80*4882a593Smuzhiyunspecified values.  Of course, this is just another way of expressing
81*4882a593Smuzhiyunthe same idea.
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunFor code running on a uniprocessor system, the predictions are easy:
84*4882a593SmuzhiyunEach load instruction must obtain the value written by the most recent
85*4882a593Smuzhiyunstore instruction accessing the same location (we ignore complicating
86*4882a593Smuzhiyunfactors such as DMA and mixed-size accesses.)  But on multiprocessor
87*4882a593Smuzhiyunsystems, with multiple CPUs making concurrent accesses to shared
88*4882a593Smuzhiyunmemory locations, things aren't so simple.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunDifferent architectures have differing memory models, and the Linux
91*4882a593Smuzhiyunkernel supports a variety of architectures.  The LKMM has to be fairly
92*4882a593Smuzhiyunpermissive, in the sense that any behavior allowed by one of these
93*4882a593Smuzhiyunarchitectures also has to be allowed by the LKMM.
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun
96*4882a593SmuzhiyunA SIMPLE EXAMPLE
97*4882a593Smuzhiyun----------------
98*4882a593Smuzhiyun
99*4882a593SmuzhiyunHere is a simple example to illustrate the basic concepts.  Consider
100*4882a593Smuzhiyunsome code running as part of a device driver for an input device.  The
101*4882a593Smuzhiyundriver might contain an interrupt handler which collects data from the
102*4882a593Smuzhiyundevice, stores it in a buffer, and sets a flag to indicate the buffer
103*4882a593Smuzhiyunis full.  Running concurrently on a different CPU might be a part of
104*4882a593Smuzhiyunthe driver code being executed by a process in the midst of a read(2)
105*4882a593Smuzhiyunsystem call.  This code tests the flag to see whether the buffer is
106*4882a593Smuzhiyunready, and if it is, copies the data back to userspace.  The buffer
107*4882a593Smuzhiyunand the flag are memory locations shared between the two CPUs.
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunWe can abstract out the important pieces of the driver code as follows
110*4882a593Smuzhiyun(the reason for using WRITE_ONCE() and READ_ONCE() instead of simple
111*4882a593Smuzhiyunassignment statements is discussed later):
112*4882a593Smuzhiyun
113*4882a593Smuzhiyun	int buf = 0, flag = 0;
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun	P0()
116*4882a593Smuzhiyun	{
117*4882a593Smuzhiyun		WRITE_ONCE(buf, 1);
118*4882a593Smuzhiyun		WRITE_ONCE(flag, 1);
119*4882a593Smuzhiyun	}
120*4882a593Smuzhiyun
121*4882a593Smuzhiyun	P1()
122*4882a593Smuzhiyun	{
123*4882a593Smuzhiyun		int r1;
124*4882a593Smuzhiyun		int r2 = 0;
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun		r1 = READ_ONCE(flag);
127*4882a593Smuzhiyun		if (r1)
128*4882a593Smuzhiyun			r2 = READ_ONCE(buf);
129*4882a593Smuzhiyun	}
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunHere the P0() function represents the interrupt handler running on one
132*4882a593SmuzhiyunCPU and P1() represents the read() routine running on another.  The
133*4882a593Smuzhiyunvalue 1 stored in buf represents input data collected from the device.
134*4882a593SmuzhiyunThus, P0 stores the data in buf and then sets flag.  Meanwhile, P1
135*4882a593Smuzhiyunreads flag into the private variable r1, and if it is set, reads the
136*4882a593Smuzhiyundata from buf into a second private variable r2 for copying to
137*4882a593Smuzhiyunuserspace.  (Presumably if flag is not set then the driver will wait a
138*4882a593Smuzhiyunwhile and try again.)
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunThis pattern of memory accesses, where one CPU stores values to two
141*4882a593Smuzhiyunshared memory locations and another CPU loads from those locations in
142*4882a593Smuzhiyunthe opposite order, is widely known as the "Message Passing" or MP
143*4882a593Smuzhiyunpattern.  It is typical of memory access patterns in the kernel.
144*4882a593Smuzhiyun
145*4882a593SmuzhiyunPlease note that this example code is a simplified abstraction.  Real
146*4882a593Smuzhiyunbuffers are usually larger than a single integer, real device drivers
147*4882a593Smuzhiyunusually use sleep and wakeup mechanisms rather than polling for I/O
148*4882a593Smuzhiyuncompletion, and real code generally doesn't bother to copy values into
149*4882a593Smuzhiyunprivate variables before using them.  All that is beside the point;
150*4882a593Smuzhiyunthe idea here is simply to illustrate the overall pattern of memory
151*4882a593Smuzhiyunaccesses by the CPUs.
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunA memory model will predict what values P1 might obtain for its loads
154*4882a593Smuzhiyunfrom flag and buf, or equivalently, what values r1 and r2 might end up
155*4882a593Smuzhiyunwith after the code has finished running.
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunSome predictions are trivial.  For instance, no sane memory model would
158*4882a593Smuzhiyunpredict that r1 = 42 or r2 = -7, because neither of those values ever
159*4882a593Smuzhiyungets stored in flag or buf.
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunSome nontrivial predictions are nonetheless quite simple.  For
162*4882a593Smuzhiyuninstance, P1 might run entirely before P0 begins, in which case r1 and
163*4882a593Smuzhiyunr2 will both be 0 at the end.  Or P0 might run entirely before P1
164*4882a593Smuzhiyunbegins, in which case r1 and r2 will both be 1.
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunThe interesting predictions concern what might happen when the two
167*4882a593Smuzhiyunroutines run concurrently.  One possibility is that P1 runs after P0's
168*4882a593Smuzhiyunstore to buf but before the store to flag.  In this case, r1 and r2
169*4882a593Smuzhiyunwill again both be 0.  (If P1 had been designed to read buf
170*4882a593Smuzhiyununconditionally then we would instead have r1 = 0 and r2 = 1.)
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunHowever, the most interesting possibility is where r1 = 1 and r2 = 0.
173*4882a593SmuzhiyunIf this were to occur it would mean the driver contains a bug, because
174*4882a593Smuzhiyunincorrect data would get sent to the user: 0 instead of 1.  As it
175*4882a593Smuzhiyunhappens, the LKMM does predict this outcome can occur, and the example
176*4882a593Smuzhiyundriver code shown above is indeed buggy.
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun
179*4882a593SmuzhiyunA SELECTION OF MEMORY MODELS
180*4882a593Smuzhiyun----------------------------
181*4882a593Smuzhiyun
182*4882a593SmuzhiyunThe first widely cited memory model, and the simplest to understand,
183*4882a593Smuzhiyunis Sequential Consistency.  According to this model, systems behave as
184*4882a593Smuzhiyunif each CPU executed its instructions in order but with unspecified
185*4882a593Smuzhiyuntiming.  In other words, the instructions from the various CPUs get
186*4882a593Smuzhiyuninterleaved in a nondeterministic way, always according to some single
187*4882a593Smuzhiyunglobal order that agrees with the order of the instructions in the
188*4882a593Smuzhiyunprogram source for each CPU.  The model says that the value obtained
189*4882a593Smuzhiyunby each load is simply the value written by the most recently executed
190*4882a593Smuzhiyunstore to the same memory location, from any CPU.
191*4882a593Smuzhiyun
192*4882a593SmuzhiyunFor the MP example code shown above, Sequential Consistency predicts
193*4882a593Smuzhiyunthat the undesired result r1 = 1, r2 = 0 cannot occur.  The reasoning
194*4882a593Smuzhiyungoes like this:
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun	Since r1 = 1, P0 must store 1 to flag before P1 loads 1 from
197*4882a593Smuzhiyun	it, as loads can obtain values only from earlier stores.
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun	P1 loads from flag before loading from buf, since CPUs execute
200*4882a593Smuzhiyun	their instructions in order.
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun	P1 must load 0 from buf before P0 stores 1 to it; otherwise r2
203*4882a593Smuzhiyun	would be 1 since a load obtains its value from the most recent
204*4882a593Smuzhiyun	store to the same address.
205*4882a593Smuzhiyun
206*4882a593Smuzhiyun	P0 stores 1 to buf before storing 1 to flag, since it executes
207*4882a593Smuzhiyun	its instructions in order.
208*4882a593Smuzhiyun
209*4882a593Smuzhiyun	Since an instruction (in this case, P0's store to flag) cannot
210*4882a593Smuzhiyun	execute before itself, the specified outcome is impossible.
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunHowever, real computer hardware almost never follows the Sequential
213*4882a593SmuzhiyunConsistency memory model; doing so would rule out too many valuable
214*4882a593Smuzhiyunperformance optimizations.  On ARM and PowerPC architectures, for
215*4882a593Smuzhiyuninstance, the MP example code really does sometimes yield r1 = 1 and
216*4882a593Smuzhiyunr2 = 0.
217*4882a593Smuzhiyun
218*4882a593Smuzhiyunx86 and SPARC follow yet a different memory model: TSO (Total Store
219*4882a593SmuzhiyunOrdering).  This model predicts that the undesired outcome for the MP
220*4882a593Smuzhiyunpattern cannot occur, but in other respects it differs from Sequential
221*4882a593SmuzhiyunConsistency.  One example is the Store Buffer (SB) pattern, in which
222*4882a593Smuzhiyuneach CPU stores to its own shared location and then loads from the
223*4882a593Smuzhiyunother CPU's location:
224*4882a593Smuzhiyun
225*4882a593Smuzhiyun	int x = 0, y = 0;
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun	P0()
228*4882a593Smuzhiyun	{
229*4882a593Smuzhiyun		int r0;
230*4882a593Smuzhiyun
231*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
232*4882a593Smuzhiyun		r0 = READ_ONCE(y);
233*4882a593Smuzhiyun	}
234*4882a593Smuzhiyun
235*4882a593Smuzhiyun	P1()
236*4882a593Smuzhiyun	{
237*4882a593Smuzhiyun		int r1;
238*4882a593Smuzhiyun
239*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
240*4882a593Smuzhiyun		r1 = READ_ONCE(x);
241*4882a593Smuzhiyun	}
242*4882a593Smuzhiyun
243*4882a593SmuzhiyunSequential Consistency predicts that the outcome r0 = 0, r1 = 0 is
244*4882a593Smuzhiyunimpossible.  (Exercise: Figure out the reasoning.)  But TSO allows
245*4882a593Smuzhiyunthis outcome to occur, and in fact it does sometimes occur on x86 and
246*4882a593SmuzhiyunSPARC systems.
247*4882a593Smuzhiyun
248*4882a593SmuzhiyunThe LKMM was inspired by the memory models followed by PowerPC, ARM,
249*4882a593Smuzhiyunx86, Alpha, and other architectures.  However, it is different in
250*4882a593Smuzhiyundetail from each of them.
251*4882a593Smuzhiyun
252*4882a593Smuzhiyun
253*4882a593SmuzhiyunORDERING AND CYCLES
254*4882a593Smuzhiyun-------------------
255*4882a593Smuzhiyun
256*4882a593SmuzhiyunMemory models are all about ordering.  Often this is temporal ordering
257*4882a593Smuzhiyun(i.e., the order in which certain events occur) but it doesn't have to
258*4882a593Smuzhiyunbe; consider for example the order of instructions in a program's
259*4882a593Smuzhiyunsource code.  We saw above that Sequential Consistency makes an
260*4882a593Smuzhiyunimportant assumption that CPUs execute instructions in the same order
261*4882a593Smuzhiyunas those instructions occur in the code, and there are many other
262*4882a593Smuzhiyuninstances of ordering playing central roles in memory models.
263*4882a593Smuzhiyun
264*4882a593SmuzhiyunThe counterpart to ordering is a cycle.  Ordering rules out cycles:
265*4882a593SmuzhiyunIt's not possible to have X ordered before Y, Y ordered before Z, and
266*4882a593SmuzhiyunZ ordered before X, because this would mean that X is ordered before
267*4882a593Smuzhiyunitself.  The analysis of the MP example under Sequential Consistency
268*4882a593Smuzhiyuninvolved just such an impossible cycle:
269*4882a593Smuzhiyun
270*4882a593Smuzhiyun	W: P0 stores 1 to flag   executes before
271*4882a593Smuzhiyun	X: P1 loads 1 from flag  executes before
272*4882a593Smuzhiyun	Y: P1 loads 0 from buf   executes before
273*4882a593Smuzhiyun	Z: P0 stores 1 to buf    executes before
274*4882a593Smuzhiyun	W: P0 stores 1 to flag.
275*4882a593Smuzhiyun
276*4882a593SmuzhiyunIn short, if a memory model requires certain accesses to be ordered,
277*4882a593Smuzhiyunand a certain outcome for the loads in a piece of code can happen only
278*4882a593Smuzhiyunif those accesses would form a cycle, then the memory model predicts
279*4882a593Smuzhiyunthat outcome cannot occur.
280*4882a593Smuzhiyun
281*4882a593SmuzhiyunThe LKMM is defined largely in terms of cycles, as we will see.
282*4882a593Smuzhiyun
283*4882a593Smuzhiyun
284*4882a593SmuzhiyunEVENTS
285*4882a593Smuzhiyun------
286*4882a593Smuzhiyun
287*4882a593SmuzhiyunThe LKMM does not work directly with the C statements that make up
288*4882a593Smuzhiyunkernel source code.  Instead it considers the effects of those
289*4882a593Smuzhiyunstatements in a more abstract form, namely, events.  The model
290*4882a593Smuzhiyunincludes three types of events:
291*4882a593Smuzhiyun
292*4882a593Smuzhiyun	Read events correspond to loads from shared memory, such as
293*4882a593Smuzhiyun	calls to READ_ONCE(), smp_load_acquire(), or
294*4882a593Smuzhiyun	rcu_dereference().
295*4882a593Smuzhiyun
296*4882a593Smuzhiyun	Write events correspond to stores to shared memory, such as
297*4882a593Smuzhiyun	calls to WRITE_ONCE(), smp_store_release(), or atomic_set().
298*4882a593Smuzhiyun
299*4882a593Smuzhiyun	Fence events correspond to memory barriers (also known as
300*4882a593Smuzhiyun	fences), such as calls to smp_rmb() or rcu_read_lock().
301*4882a593Smuzhiyun
302*4882a593SmuzhiyunThese categories are not exclusive; a read or write event can also be
303*4882a593Smuzhiyuna fence.  This happens with functions like smp_load_acquire() or
304*4882a593Smuzhiyunspin_lock().  However, no single event can be both a read and a write.
305*4882a593SmuzhiyunAtomic read-modify-write accesses, such as atomic_inc() or xchg(),
306*4882a593Smuzhiyuncorrespond to a pair of events: a read followed by a write.  (The
307*4882a593Smuzhiyunwrite event is omitted for executions where it doesn't occur, such as
308*4882a593Smuzhiyuna cmpxchg() where the comparison fails.)
309*4882a593Smuzhiyun
310*4882a593SmuzhiyunOther parts of the code, those which do not involve interaction with
311*4882a593Smuzhiyunshared memory, do not give rise to events.  Thus, arithmetic and
312*4882a593Smuzhiyunlogical computations, control-flow instructions, or accesses to
313*4882a593Smuzhiyunprivate memory or CPU registers are not of central interest to the
314*4882a593Smuzhiyunmemory model.  They only affect the model's predictions indirectly.
315*4882a593SmuzhiyunFor example, an arithmetic computation might determine the value that
316*4882a593Smuzhiyungets stored to a shared memory location (or in the case of an array
317*4882a593Smuzhiyunindex, the address where the value gets stored), but the memory model
318*4882a593Smuzhiyunis concerned only with the store itself -- its value and its address
319*4882a593Smuzhiyun-- not the computation leading up to it.
320*4882a593Smuzhiyun
321*4882a593SmuzhiyunEvents in the LKMM can be linked by various relations, which we will
322*4882a593Smuzhiyundescribe in the following sections.  The memory model requires certain
323*4882a593Smuzhiyunof these relations to be orderings, that is, it requires them not to
324*4882a593Smuzhiyunhave any cycles.
325*4882a593Smuzhiyun
326*4882a593Smuzhiyun
327*4882a593SmuzhiyunTHE PROGRAM ORDER RELATION: po AND po-loc
328*4882a593Smuzhiyun-----------------------------------------
329*4882a593Smuzhiyun
330*4882a593SmuzhiyunThe most important relation between events is program order (po).  You
331*4882a593Smuzhiyuncan think of it as the order in which statements occur in the source
332*4882a593Smuzhiyuncode after branches are taken into account and loops have been
333*4882a593Smuzhiyununrolled.  A better description might be the order in which
334*4882a593Smuzhiyuninstructions are presented to a CPU's execution unit.  Thus, we say
335*4882a593Smuzhiyunthat X is po-before Y (written as "X ->po Y" in formulas) if X occurs
336*4882a593Smuzhiyunbefore Y in the instruction stream.
337*4882a593Smuzhiyun
338*4882a593SmuzhiyunThis is inherently a single-CPU relation; two instructions executing
339*4882a593Smuzhiyunon different CPUs are never linked by po.  Also, it is by definition
340*4882a593Smuzhiyunan ordering so it cannot have any cycles.
341*4882a593Smuzhiyun
342*4882a593Smuzhiyunpo-loc is a sub-relation of po.  It links two memory accesses when the
343*4882a593Smuzhiyunfirst comes before the second in program order and they access the
344*4882a593Smuzhiyunsame memory location (the "-loc" suffix).
345*4882a593Smuzhiyun
346*4882a593SmuzhiyunAlthough this may seem straightforward, there is one subtle aspect to
347*4882a593Smuzhiyunprogram order we need to explain.  The LKMM was inspired by low-level
348*4882a593Smuzhiyunarchitectural memory models which describe the behavior of machine
349*4882a593Smuzhiyuncode, and it retains their outlook to a considerable extent.  The
350*4882a593Smuzhiyunread, write, and fence events used by the model are close in spirit to
351*4882a593Smuzhiyunindividual machine instructions.  Nevertheless, the LKMM describes
352*4882a593Smuzhiyunkernel code written in C, and the mapping from C to machine code can
353*4882a593Smuzhiyunbe extremely complex.
354*4882a593Smuzhiyun
355*4882a593SmuzhiyunOptimizing compilers have great freedom in the way they translate
356*4882a593Smuzhiyunsource code to object code.  They are allowed to apply transformations
357*4882a593Smuzhiyunthat add memory accesses, eliminate accesses, combine them, split them
358*4882a593Smuzhiyuninto pieces, or move them around.  The use of READ_ONCE(), WRITE_ONCE(),
359*4882a593Smuzhiyunor one of the other atomic or synchronization primitives prevents a
360*4882a593Smuzhiyunlarge number of compiler optimizations.  In particular, it is guaranteed
361*4882a593Smuzhiyunthat the compiler will not remove such accesses from the generated code
362*4882a593Smuzhiyun(unless it can prove the accesses will never be executed), it will not
363*4882a593Smuzhiyunchange the order in which they occur in the code (within limits imposed
364*4882a593Smuzhiyunby the C standard), and it will not introduce extraneous accesses.
365*4882a593Smuzhiyun
366*4882a593SmuzhiyunThe MP and SB examples above used READ_ONCE() and WRITE_ONCE() rather
367*4882a593Smuzhiyunthan ordinary memory accesses.  Thanks to this usage, we can be certain
368*4882a593Smuzhiyunthat in the MP example, the compiler won't reorder P0's write event to
369*4882a593Smuzhiyunbuf and P0's write event to flag, and similarly for the other shared
370*4882a593Smuzhiyunmemory accesses in the examples.
371*4882a593Smuzhiyun
372*4882a593SmuzhiyunSince private variables are not shared between CPUs, they can be
373*4882a593Smuzhiyunaccessed normally without READ_ONCE() or WRITE_ONCE().  In fact, they
374*4882a593Smuzhiyunneed not even be stored in normal memory at all -- in principle a
375*4882a593Smuzhiyunprivate variable could be stored in a CPU register (hence the convention
376*4882a593Smuzhiyunthat these variables have names starting with the letter 'r').
377*4882a593Smuzhiyun
378*4882a593Smuzhiyun
379*4882a593SmuzhiyunA WARNING
380*4882a593Smuzhiyun---------
381*4882a593Smuzhiyun
382*4882a593SmuzhiyunThe protections provided by READ_ONCE(), WRITE_ONCE(), and others are
383*4882a593Smuzhiyunnot perfect; and under some circumstances it is possible for the
384*4882a593Smuzhiyuncompiler to undermine the memory model.  Here is an example.  Suppose
385*4882a593Smuzhiyunboth branches of an "if" statement store the same value to the same
386*4882a593Smuzhiyunlocation:
387*4882a593Smuzhiyun
388*4882a593Smuzhiyun	r1 = READ_ONCE(x);
389*4882a593Smuzhiyun	if (r1) {
390*4882a593Smuzhiyun		WRITE_ONCE(y, 2);
391*4882a593Smuzhiyun		...  /* do something */
392*4882a593Smuzhiyun	} else {
393*4882a593Smuzhiyun		WRITE_ONCE(y, 2);
394*4882a593Smuzhiyun		...  /* do something else */
395*4882a593Smuzhiyun	}
396*4882a593Smuzhiyun
397*4882a593SmuzhiyunFor this code, the LKMM predicts that the load from x will always be
398*4882a593Smuzhiyunexecuted before either of the stores to y.  However, a compiler could
399*4882a593Smuzhiyunlift the stores out of the conditional, transforming the code into
400*4882a593Smuzhiyunsomething resembling:
401*4882a593Smuzhiyun
402*4882a593Smuzhiyun	r1 = READ_ONCE(x);
403*4882a593Smuzhiyun	WRITE_ONCE(y, 2);
404*4882a593Smuzhiyun	if (r1) {
405*4882a593Smuzhiyun		...  /* do something */
406*4882a593Smuzhiyun	} else {
407*4882a593Smuzhiyun		...  /* do something else */
408*4882a593Smuzhiyun	}
409*4882a593Smuzhiyun
410*4882a593SmuzhiyunGiven this version of the code, the LKMM would predict that the load
411*4882a593Smuzhiyunfrom x could be executed after the store to y.  Thus, the memory
412*4882a593Smuzhiyunmodel's original prediction could be invalidated by the compiler.
413*4882a593Smuzhiyun
414*4882a593SmuzhiyunAnother issue arises from the fact that in C, arguments to many
415*4882a593Smuzhiyunoperators and function calls can be evaluated in any order.  For
416*4882a593Smuzhiyunexample:
417*4882a593Smuzhiyun
418*4882a593Smuzhiyun	r1 = f(5) + g(6);
419*4882a593Smuzhiyun
420*4882a593SmuzhiyunThe object code might call f(5) either before or after g(6); the
421*4882a593Smuzhiyunmemory model cannot assume there is a fixed program order relation
422*4882a593Smuzhiyunbetween them.  (In fact, if the function calls are inlined then the
423*4882a593Smuzhiyuncompiler might even interleave their object code.)
424*4882a593Smuzhiyun
425*4882a593Smuzhiyun
426*4882a593SmuzhiyunDEPENDENCY RELATIONS: data, addr, and ctrl
427*4882a593Smuzhiyun------------------------------------------
428*4882a593Smuzhiyun
429*4882a593SmuzhiyunWe say that two events are linked by a dependency relation when the
430*4882a593Smuzhiyunexecution of the second event depends in some way on a value obtained
431*4882a593Smuzhiyunfrom memory by the first.  The first event must be a read, and the
432*4882a593Smuzhiyunvalue it obtains must somehow affect what the second event does.
433*4882a593SmuzhiyunThere are three kinds of dependencies: data, address (addr), and
434*4882a593Smuzhiyuncontrol (ctrl).
435*4882a593Smuzhiyun
436*4882a593SmuzhiyunA read and a write event are linked by a data dependency if the value
437*4882a593Smuzhiyunobtained by the read affects the value stored by the write.  As a very
438*4882a593Smuzhiyunsimple example:
439*4882a593Smuzhiyun
440*4882a593Smuzhiyun	int x, y;
441*4882a593Smuzhiyun
442*4882a593Smuzhiyun	r1 = READ_ONCE(x);
443*4882a593Smuzhiyun	WRITE_ONCE(y, r1 + 5);
444*4882a593Smuzhiyun
445*4882a593SmuzhiyunThe value stored by the WRITE_ONCE obviously depends on the value
446*4882a593Smuzhiyunloaded by the READ_ONCE.  Such dependencies can wind through
447*4882a593Smuzhiyunarbitrarily complicated computations, and a write can depend on the
448*4882a593Smuzhiyunvalues of multiple reads.
449*4882a593Smuzhiyun
450*4882a593SmuzhiyunA read event and another memory access event are linked by an address
451*4882a593Smuzhiyundependency if the value obtained by the read affects the location
452*4882a593Smuzhiyunaccessed by the other event.  The second event can be either a read or
453*4882a593Smuzhiyuna write.  Here's another simple example:
454*4882a593Smuzhiyun
455*4882a593Smuzhiyun	int a[20];
456*4882a593Smuzhiyun	int i;
457*4882a593Smuzhiyun
458*4882a593Smuzhiyun	r1 = READ_ONCE(i);
459*4882a593Smuzhiyun	r2 = READ_ONCE(a[r1]);
460*4882a593Smuzhiyun
461*4882a593SmuzhiyunHere the location accessed by the second READ_ONCE() depends on the
462*4882a593Smuzhiyunindex value loaded by the first.  Pointer indirection also gives rise
463*4882a593Smuzhiyunto address dependencies, since the address of a location accessed
464*4882a593Smuzhiyunthrough a pointer will depend on the value read earlier from that
465*4882a593Smuzhiyunpointer.
466*4882a593Smuzhiyun
467*4882a593SmuzhiyunFinally, a read event and another memory access event are linked by a
468*4882a593Smuzhiyuncontrol dependency if the value obtained by the read affects whether
469*4882a593Smuzhiyunthe second event is executed at all.  Simple example:
470*4882a593Smuzhiyun
471*4882a593Smuzhiyun	int x, y;
472*4882a593Smuzhiyun
473*4882a593Smuzhiyun	r1 = READ_ONCE(x);
474*4882a593Smuzhiyun	if (r1)
475*4882a593Smuzhiyun		WRITE_ONCE(y, 1984);
476*4882a593Smuzhiyun
477*4882a593SmuzhiyunExecution of the WRITE_ONCE() is controlled by a conditional expression
478*4882a593Smuzhiyunwhich depends on the value obtained by the READ_ONCE(); hence there is
479*4882a593Smuzhiyuna control dependency from the load to the store.
480*4882a593Smuzhiyun
481*4882a593SmuzhiyunIt should be pretty obvious that events can only depend on reads that
482*4882a593Smuzhiyuncome earlier in program order.  Symbolically, if we have R ->data X,
483*4882a593SmuzhiyunR ->addr X, or R ->ctrl X (where R is a read event), then we must also
484*4882a593Smuzhiyunhave R ->po X.  It wouldn't make sense for a computation to depend
485*4882a593Smuzhiyunsomehow on a value that doesn't get loaded from shared memory until
486*4882a593Smuzhiyunlater in the code!
487*4882a593Smuzhiyun
488*4882a593Smuzhiyun
489*4882a593SmuzhiyunTHE READS-FROM RELATION: rf, rfi, and rfe
490*4882a593Smuzhiyun-----------------------------------------
491*4882a593Smuzhiyun
492*4882a593SmuzhiyunThe reads-from relation (rf) links a write event to a read event when
493*4882a593Smuzhiyunthe value loaded by the read is the value that was stored by the
494*4882a593Smuzhiyunwrite.  In colloquial terms, the load "reads from" the store.  We
495*4882a593Smuzhiyunwrite W ->rf R to indicate that the load R reads from the store W.  We
496*4882a593Smuzhiyunfurther distinguish the cases where the load and the store occur on
497*4882a593Smuzhiyunthe same CPU (internal reads-from, or rfi) and where they occur on
498*4882a593Smuzhiyundifferent CPUs (external reads-from, or rfe).
499*4882a593Smuzhiyun
500*4882a593SmuzhiyunFor our purposes, a memory location's initial value is treated as
501*4882a593Smuzhiyunthough it had been written there by an imaginary initial store that
502*4882a593Smuzhiyunexecutes on a separate CPU before the main program runs.
503*4882a593Smuzhiyun
504*4882a593SmuzhiyunUsage of the rf relation implicitly assumes that loads will always
505*4882a593Smuzhiyunread from a single store.  It doesn't apply properly in the presence
506*4882a593Smuzhiyunof load-tearing, where a load obtains some of its bits from one store
507*4882a593Smuzhiyunand some of them from another store.  Fortunately, use of READ_ONCE()
508*4882a593Smuzhiyunand WRITE_ONCE() will prevent load-tearing; it's not possible to have:
509*4882a593Smuzhiyun
510*4882a593Smuzhiyun	int x = 0;
511*4882a593Smuzhiyun
512*4882a593Smuzhiyun	P0()
513*4882a593Smuzhiyun	{
514*4882a593Smuzhiyun		WRITE_ONCE(x, 0x1234);
515*4882a593Smuzhiyun	}
516*4882a593Smuzhiyun
517*4882a593Smuzhiyun	P1()
518*4882a593Smuzhiyun	{
519*4882a593Smuzhiyun		int r1;
520*4882a593Smuzhiyun
521*4882a593Smuzhiyun		r1 = READ_ONCE(x);
522*4882a593Smuzhiyun	}
523*4882a593Smuzhiyun
524*4882a593Smuzhiyunand end up with r1 = 0x1200 (partly from x's initial value and partly
525*4882a593Smuzhiyunfrom the value stored by P0).
526*4882a593Smuzhiyun
527*4882a593SmuzhiyunOn the other hand, load-tearing is unavoidable when mixed-size
528*4882a593Smuzhiyunaccesses are used.  Consider this example:
529*4882a593Smuzhiyun
530*4882a593Smuzhiyun	union {
531*4882a593Smuzhiyun		u32	w;
532*4882a593Smuzhiyun		u16	h[2];
533*4882a593Smuzhiyun	} x;
534*4882a593Smuzhiyun
535*4882a593Smuzhiyun	P0()
536*4882a593Smuzhiyun	{
537*4882a593Smuzhiyun		WRITE_ONCE(x.h[0], 0x1234);
538*4882a593Smuzhiyun		WRITE_ONCE(x.h[1], 0x5678);
539*4882a593Smuzhiyun	}
540*4882a593Smuzhiyun
541*4882a593Smuzhiyun	P1()
542*4882a593Smuzhiyun	{
543*4882a593Smuzhiyun		int r1;
544*4882a593Smuzhiyun
545*4882a593Smuzhiyun		r1 = READ_ONCE(x.w);
546*4882a593Smuzhiyun	}
547*4882a593Smuzhiyun
548*4882a593SmuzhiyunIf r1 = 0x56781234 (little-endian!) at the end, then P1 must have read
549*4882a593Smuzhiyunfrom both of P0's stores.  It is possible to handle mixed-size and
550*4882a593Smuzhiyununaligned accesses in a memory model, but the LKMM currently does not
551*4882a593Smuzhiyunattempt to do so.  It requires all accesses to be properly aligned and
552*4882a593Smuzhiyunof the location's actual size.
553*4882a593Smuzhiyun
554*4882a593Smuzhiyun
555*4882a593SmuzhiyunCACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe
556*4882a593Smuzhiyun------------------------------------------------------------------
557*4882a593Smuzhiyun
558*4882a593SmuzhiyunCache coherence is a general principle requiring that in a
559*4882a593Smuzhiyunmulti-processor system, the CPUs must share a consistent view of the
560*4882a593Smuzhiyunmemory contents.  Specifically, it requires that for each location in
561*4882a593Smuzhiyunshared memory, the stores to that location must form a single global
562*4882a593Smuzhiyunordering which all the CPUs agree on (the coherence order), and this
563*4882a593Smuzhiyunordering must be consistent with the program order for accesses to
564*4882a593Smuzhiyunthat location.
565*4882a593Smuzhiyun
566*4882a593SmuzhiyunTo put it another way, for any variable x, the coherence order (co) of
567*4882a593Smuzhiyunthe stores to x is simply the order in which the stores overwrite one
568*4882a593Smuzhiyunanother.  The imaginary store which establishes x's initial value
569*4882a593Smuzhiyuncomes first in the coherence order; the store which directly
570*4882a593Smuzhiyunoverwrites the initial value comes second; the store which overwrites
571*4882a593Smuzhiyunthat value comes third, and so on.
572*4882a593Smuzhiyun
573*4882a593SmuzhiyunYou can think of the coherence order as being the order in which the
574*4882a593Smuzhiyunstores reach x's location in memory (or if you prefer a more
575*4882a593Smuzhiyunhardware-centric view, the order in which the stores get written to
576*4882a593Smuzhiyunx's cache line).  We write W ->co W' if W comes before W' in the
577*4882a593Smuzhiyuncoherence order, that is, if the value stored by W gets overwritten,
578*4882a593Smuzhiyundirectly or indirectly, by the value stored by W'.
579*4882a593Smuzhiyun
580*4882a593SmuzhiyunCoherence order is required to be consistent with program order.  This
581*4882a593Smuzhiyunrequirement takes the form of four coherency rules:
582*4882a593Smuzhiyun
583*4882a593Smuzhiyun	Write-write coherence: If W ->po-loc W' (i.e., W comes before
584*4882a593Smuzhiyun	W' in program order and they access the same location), where W
585*4882a593Smuzhiyun	and W' are two stores, then W ->co W'.
586*4882a593Smuzhiyun
587*4882a593Smuzhiyun	Write-read coherence: If W ->po-loc R, where W is a store and R
588*4882a593Smuzhiyun	is a load, then R must read from W or from some other store
589*4882a593Smuzhiyun	which comes after W in the coherence order.
590*4882a593Smuzhiyun
591*4882a593Smuzhiyun	Read-write coherence: If R ->po-loc W, where R is a load and W
592*4882a593Smuzhiyun	is a store, then the store which R reads from must come before
593*4882a593Smuzhiyun	W in the coherence order.
594*4882a593Smuzhiyun
595*4882a593Smuzhiyun	Read-read coherence: If R ->po-loc R', where R and R' are two
596*4882a593Smuzhiyun	loads, then either they read from the same store or else the
597*4882a593Smuzhiyun	store read by R comes before the store read by R' in the
598*4882a593Smuzhiyun	coherence order.
599*4882a593Smuzhiyun
600*4882a593SmuzhiyunThis is sometimes referred to as sequential consistency per variable,
601*4882a593Smuzhiyunbecause it means that the accesses to any single memory location obey
602*4882a593Smuzhiyunthe rules of the Sequential Consistency memory model.  (According to
603*4882a593SmuzhiyunWikipedia, sequential consistency per variable and cache coherence
604*4882a593Smuzhiyunmean the same thing except that cache coherence includes an extra
605*4882a593Smuzhiyunrequirement that every store eventually becomes visible to every CPU.)
606*4882a593Smuzhiyun
607*4882a593SmuzhiyunAny reasonable memory model will include cache coherence.  Indeed, our
608*4882a593Smuzhiyunexpectation of cache coherence is so deeply ingrained that violations
609*4882a593Smuzhiyunof its requirements look more like hardware bugs than programming
610*4882a593Smuzhiyunerrors:
611*4882a593Smuzhiyun
612*4882a593Smuzhiyun	int x;
613*4882a593Smuzhiyun
614*4882a593Smuzhiyun	P0()
615*4882a593Smuzhiyun	{
616*4882a593Smuzhiyun		WRITE_ONCE(x, 17);
617*4882a593Smuzhiyun		WRITE_ONCE(x, 23);
618*4882a593Smuzhiyun	}
619*4882a593Smuzhiyun
620*4882a593SmuzhiyunIf the final value stored in x after this code ran was 17, you would
621*4882a593Smuzhiyunthink your computer was broken.  It would be a violation of the
622*4882a593Smuzhiyunwrite-write coherence rule: Since the store of 23 comes later in
623*4882a593Smuzhiyunprogram order, it must also come later in x's coherence order and
624*4882a593Smuzhiyunthus must overwrite the store of 17.
625*4882a593Smuzhiyun
626*4882a593Smuzhiyun	int x = 0;
627*4882a593Smuzhiyun
628*4882a593Smuzhiyun	P0()
629*4882a593Smuzhiyun	{
630*4882a593Smuzhiyun		int r1;
631*4882a593Smuzhiyun
632*4882a593Smuzhiyun		r1 = READ_ONCE(x);
633*4882a593Smuzhiyun		WRITE_ONCE(x, 666);
634*4882a593Smuzhiyun	}
635*4882a593Smuzhiyun
636*4882a593SmuzhiyunIf r1 = 666 at the end, this would violate the read-write coherence
637*4882a593Smuzhiyunrule: The READ_ONCE() load comes before the WRITE_ONCE() store in
638*4882a593Smuzhiyunprogram order, so it must not read from that store but rather from one
639*4882a593Smuzhiyuncoming earlier in the coherence order (in this case, x's initial
640*4882a593Smuzhiyunvalue).
641*4882a593Smuzhiyun
642*4882a593Smuzhiyun	int x = 0;
643*4882a593Smuzhiyun
644*4882a593Smuzhiyun	P0()
645*4882a593Smuzhiyun	{
646*4882a593Smuzhiyun		WRITE_ONCE(x, 5);
647*4882a593Smuzhiyun	}
648*4882a593Smuzhiyun
649*4882a593Smuzhiyun	P1()
650*4882a593Smuzhiyun	{
651*4882a593Smuzhiyun		int r1, r2;
652*4882a593Smuzhiyun
653*4882a593Smuzhiyun		r1 = READ_ONCE(x);
654*4882a593Smuzhiyun		r2 = READ_ONCE(x);
655*4882a593Smuzhiyun	}
656*4882a593Smuzhiyun
657*4882a593SmuzhiyunIf r1 = 5 (reading from P0's store) and r2 = 0 (reading from the
658*4882a593Smuzhiyunimaginary store which establishes x's initial value) at the end, this
659*4882a593Smuzhiyunwould violate the read-read coherence rule: The r1 load comes before
660*4882a593Smuzhiyunthe r2 load in program order, so it must not read from a store that
661*4882a593Smuzhiyuncomes later in the coherence order.
662*4882a593Smuzhiyun
663*4882a593Smuzhiyun(As a minor curiosity, if this code had used normal loads instead of
664*4882a593SmuzhiyunREAD_ONCE() in P1, on Itanium it sometimes could end up with r1 = 5
665*4882a593Smuzhiyunand r2 = 0!  This results from parallel execution of the operations
666*4882a593Smuzhiyunencoded in Itanium's Very-Long-Instruction-Word format, and it is yet
667*4882a593Smuzhiyunanother motivation for using READ_ONCE() when accessing shared memory
668*4882a593Smuzhiyunlocations.)
669*4882a593Smuzhiyun
670*4882a593SmuzhiyunJust like the po relation, co is inherently an ordering -- it is not
671*4882a593Smuzhiyunpossible for a store to directly or indirectly overwrite itself!  And
672*4882a593Smuzhiyunjust like with the rf relation, we distinguish between stores that
673*4882a593Smuzhiyunoccur on the same CPU (internal coherence order, or coi) and stores
674*4882a593Smuzhiyunthat occur on different CPUs (external coherence order, or coe).
675*4882a593Smuzhiyun
676*4882a593SmuzhiyunOn the other hand, stores to different memory locations are never
677*4882a593Smuzhiyunrelated by co, just as instructions on different CPUs are never
678*4882a593Smuzhiyunrelated by po.  Coherence order is strictly per-location, or if you
679*4882a593Smuzhiyunprefer, each location has its own independent coherence order.
680*4882a593Smuzhiyun
681*4882a593Smuzhiyun
682*4882a593SmuzhiyunTHE FROM-READS RELATION: fr, fri, and fre
683*4882a593Smuzhiyun-----------------------------------------
684*4882a593Smuzhiyun
685*4882a593SmuzhiyunThe from-reads relation (fr) can be a little difficult for people to
686*4882a593Smuzhiyungrok.  It describes the situation where a load reads a value that gets
687*4882a593Smuzhiyunoverwritten by a store.  In other words, we have R ->fr W when the
688*4882a593Smuzhiyunvalue that R reads is overwritten (directly or indirectly) by W, or
689*4882a593Smuzhiyunequivalently, when R reads from a store which comes earlier than W in
690*4882a593Smuzhiyunthe coherence order.
691*4882a593Smuzhiyun
692*4882a593SmuzhiyunFor example:
693*4882a593Smuzhiyun
694*4882a593Smuzhiyun	int x = 0;
695*4882a593Smuzhiyun
696*4882a593Smuzhiyun	P0()
697*4882a593Smuzhiyun	{
698*4882a593Smuzhiyun		int r1;
699*4882a593Smuzhiyun
700*4882a593Smuzhiyun		r1 = READ_ONCE(x);
701*4882a593Smuzhiyun		WRITE_ONCE(x, 2);
702*4882a593Smuzhiyun	}
703*4882a593Smuzhiyun
704*4882a593SmuzhiyunThe value loaded from x will be 0 (assuming cache coherence!), and it
705*4882a593Smuzhiyungets overwritten by the value 2.  Thus there is an fr link from the
706*4882a593SmuzhiyunREAD_ONCE() to the WRITE_ONCE().  If the code contained any later
707*4882a593Smuzhiyunstores to x, there would also be fr links from the READ_ONCE() to
708*4882a593Smuzhiyunthem.
709*4882a593Smuzhiyun
710*4882a593SmuzhiyunAs with rf, rfi, and rfe, we subdivide the fr relation into fri (when
711*4882a593Smuzhiyunthe load and the store are on the same CPU) and fre (when they are on
712*4882a593Smuzhiyundifferent CPUs).
713*4882a593Smuzhiyun
714*4882a593SmuzhiyunNote that the fr relation is determined entirely by the rf and co
715*4882a593Smuzhiyunrelations; it is not independent.  Given a read event R and a write
716*4882a593Smuzhiyunevent W for the same location, we will have R ->fr W if and only if
717*4882a593Smuzhiyunthe write which R reads from is co-before W.  In symbols,
718*4882a593Smuzhiyun
719*4882a593Smuzhiyun	(R ->fr W) := (there exists W' with W' ->rf R and W' ->co W).
720*4882a593Smuzhiyun
721*4882a593Smuzhiyun
722*4882a593SmuzhiyunAN OPERATIONAL MODEL
723*4882a593Smuzhiyun--------------------
724*4882a593Smuzhiyun
725*4882a593SmuzhiyunThe LKMM is based on various operational memory models, meaning that
726*4882a593Smuzhiyunthe models arise from an abstract view of how a computer system
727*4882a593Smuzhiyunoperates.  Here are the main ideas, as incorporated into the LKMM.
728*4882a593Smuzhiyun
729*4882a593SmuzhiyunThe system as a whole is divided into the CPUs and a memory subsystem.
730*4882a593SmuzhiyunThe CPUs are responsible for executing instructions (not necessarily
731*4882a593Smuzhiyunin program order), and they communicate with the memory subsystem.
732*4882a593SmuzhiyunFor the most part, executing an instruction requires a CPU to perform
733*4882a593Smuzhiyunonly internal operations.  However, loads, stores, and fences involve
734*4882a593Smuzhiyunmore.
735*4882a593Smuzhiyun
736*4882a593SmuzhiyunWhen CPU C executes a store instruction, it tells the memory subsystem
737*4882a593Smuzhiyunto store a certain value at a certain location.  The memory subsystem
738*4882a593Smuzhiyunpropagates the store to all the other CPUs as well as to RAM.  (As a
739*4882a593Smuzhiyunspecial case, we say that the store propagates to its own CPU at the
740*4882a593Smuzhiyuntime it is executed.)  The memory subsystem also determines where the
741*4882a593Smuzhiyunstore falls in the location's coherence order.  In particular, it must
742*4882a593Smuzhiyunarrange for the store to be co-later than (i.e., to overwrite) any
743*4882a593Smuzhiyunother store to the same location which has already propagated to CPU C.
744*4882a593Smuzhiyun
745*4882a593SmuzhiyunWhen a CPU executes a load instruction R, it first checks to see
746*4882a593Smuzhiyunwhether there are any as-yet unexecuted store instructions, for the
747*4882a593Smuzhiyunsame location, that come before R in program order.  If there are, it
748*4882a593Smuzhiyunuses the value of the po-latest such store as the value obtained by R,
749*4882a593Smuzhiyunand we say that the store's value is forwarded to R.  Otherwise, the
750*4882a593SmuzhiyunCPU asks the memory subsystem for the value to load and we say that R
751*4882a593Smuzhiyunis satisfied from memory.  The memory subsystem hands back the value
752*4882a593Smuzhiyunof the co-latest store to the location in question which has already
753*4882a593Smuzhiyunpropagated to that CPU.
754*4882a593Smuzhiyun
755*4882a593Smuzhiyun(In fact, the picture needs to be a little more complicated than this.
756*4882a593SmuzhiyunCPUs have local caches, and propagating a store to a CPU really means
757*4882a593Smuzhiyunpropagating it to the CPU's local cache.  A local cache can take some
758*4882a593Smuzhiyuntime to process the stores that it receives, and a store can't be used
759*4882a593Smuzhiyunto satisfy one of the CPU's loads until it has been processed.  On
760*4882a593Smuzhiyunmost architectures, the local caches process stores in
761*4882a593SmuzhiyunFirst-In-First-Out order, and consequently the processing delay
762*4882a593Smuzhiyundoesn't matter for the memory model.  But on Alpha, the local caches
763*4882a593Smuzhiyunhave a partitioned design that results in non-FIFO behavior.  We will
764*4882a593Smuzhiyundiscuss this in more detail later.)
765*4882a593Smuzhiyun
766*4882a593SmuzhiyunNote that load instructions may be executed speculatively and may be
767*4882a593Smuzhiyunrestarted under certain circumstances.  The memory model ignores these
768*4882a593Smuzhiyunpremature executions; we simply say that the load executes at the
769*4882a593Smuzhiyunfinal time it is forwarded or satisfied.
770*4882a593Smuzhiyun
771*4882a593SmuzhiyunExecuting a fence (or memory barrier) instruction doesn't require a
772*4882a593SmuzhiyunCPU to do anything special other than informing the memory subsystem
773*4882a593Smuzhiyunabout the fence.  However, fences do constrain the way CPUs and the
774*4882a593Smuzhiyunmemory subsystem handle other instructions, in two respects.
775*4882a593Smuzhiyun
776*4882a593SmuzhiyunFirst, a fence forces the CPU to execute various instructions in
777*4882a593Smuzhiyunprogram order.  Exactly which instructions are ordered depends on the
778*4882a593Smuzhiyuntype of fence:
779*4882a593Smuzhiyun
780*4882a593Smuzhiyun	Strong fences, including smp_mb() and synchronize_rcu(), force
781*4882a593Smuzhiyun	the CPU to execute all po-earlier instructions before any
782*4882a593Smuzhiyun	po-later instructions;
783*4882a593Smuzhiyun
784*4882a593Smuzhiyun	smp_rmb() forces the CPU to execute all po-earlier loads
785*4882a593Smuzhiyun	before any po-later loads;
786*4882a593Smuzhiyun
787*4882a593Smuzhiyun	smp_wmb() forces the CPU to execute all po-earlier stores
788*4882a593Smuzhiyun	before any po-later stores;
789*4882a593Smuzhiyun
790*4882a593Smuzhiyun	Acquire fences, such as smp_load_acquire(), force the CPU to
791*4882a593Smuzhiyun	execute the load associated with the fence (e.g., the load
792*4882a593Smuzhiyun	part of an smp_load_acquire()) before any po-later
793*4882a593Smuzhiyun	instructions;
794*4882a593Smuzhiyun
795*4882a593Smuzhiyun	Release fences, such as smp_store_release(), force the CPU to
796*4882a593Smuzhiyun	execute all po-earlier instructions before the store
797*4882a593Smuzhiyun	associated with the fence (e.g., the store part of an
798*4882a593Smuzhiyun	smp_store_release()).
799*4882a593Smuzhiyun
800*4882a593SmuzhiyunSecond, some types of fence affect the way the memory subsystem
801*4882a593Smuzhiyunpropagates stores.  When a fence instruction is executed on CPU C:
802*4882a593Smuzhiyun
803*4882a593Smuzhiyun	For each other CPU C', smp_wmb() forces all po-earlier stores
804*4882a593Smuzhiyun	on C to propagate to C' before any po-later stores do.
805*4882a593Smuzhiyun
806*4882a593Smuzhiyun	For each other CPU C', any store which propagates to C before
807*4882a593Smuzhiyun	a release fence is executed (including all po-earlier
808*4882a593Smuzhiyun	stores executed on C) is forced to propagate to C' before the
809*4882a593Smuzhiyun	store associated with the release fence does.
810*4882a593Smuzhiyun
811*4882a593Smuzhiyun	Any store which propagates to C before a strong fence is
812*4882a593Smuzhiyun	executed (including all po-earlier stores on C) is forced to
813*4882a593Smuzhiyun	propagate to all other CPUs before any instructions po-after
814*4882a593Smuzhiyun	the strong fence are executed on C.
815*4882a593Smuzhiyun
816*4882a593SmuzhiyunThe propagation ordering enforced by release fences and strong fences
817*4882a593Smuzhiyunaffects stores from other CPUs that propagate to CPU C before the
818*4882a593Smuzhiyunfence is executed, as well as stores that are executed on C before the
819*4882a593Smuzhiyunfence.  We describe this property by saying that release fences and
820*4882a593Smuzhiyunstrong fences are A-cumulative.  By contrast, smp_wmb() fences are not
821*4882a593SmuzhiyunA-cumulative; they only affect the propagation of stores that are
822*4882a593Smuzhiyunexecuted on C before the fence (i.e., those which precede the fence in
823*4882a593Smuzhiyunprogram order).
824*4882a593Smuzhiyun
825*4882a593Smuzhiyunrcu_read_lock(), rcu_read_unlock(), and synchronize_rcu() fences have
826*4882a593Smuzhiyunother properties which we discuss later.
827*4882a593Smuzhiyun
828*4882a593Smuzhiyun
829*4882a593SmuzhiyunPROPAGATION ORDER RELATION: cumul-fence
830*4882a593Smuzhiyun---------------------------------------
831*4882a593Smuzhiyun
832*4882a593SmuzhiyunThe fences which affect propagation order (i.e., strong, release, and
833*4882a593Smuzhiyunsmp_wmb() fences) are collectively referred to as cumul-fences, even
834*4882a593Smuzhiyunthough smp_wmb() isn't A-cumulative.  The cumul-fence relation is
835*4882a593Smuzhiyundefined to link memory access events E and F whenever:
836*4882a593Smuzhiyun
837*4882a593Smuzhiyun	E and F are both stores on the same CPU and an smp_wmb() fence
838*4882a593Smuzhiyun	event occurs between them in program order; or
839*4882a593Smuzhiyun
840*4882a593Smuzhiyun	F is a release fence and some X comes before F in program order,
841*4882a593Smuzhiyun	where either X = E or else E ->rf X; or
842*4882a593Smuzhiyun
843*4882a593Smuzhiyun	A strong fence event occurs between some X and F in program
844*4882a593Smuzhiyun	order, where either X = E or else E ->rf X.
845*4882a593Smuzhiyun
846*4882a593SmuzhiyunThe operational model requires that whenever W and W' are both stores
847*4882a593Smuzhiyunand W ->cumul-fence W', then W must propagate to any given CPU
848*4882a593Smuzhiyunbefore W' does.  However, for different CPUs C and C', it does not
849*4882a593Smuzhiyunrequire W to propagate to C before W' propagates to C'.
850*4882a593Smuzhiyun
851*4882a593Smuzhiyun
852*4882a593SmuzhiyunDERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL
853*4882a593Smuzhiyun-------------------------------------------------
854*4882a593Smuzhiyun
855*4882a593SmuzhiyunThe LKMM is derived from the restrictions imposed by the design
856*4882a593Smuzhiyunoutlined above.  These restrictions involve the necessity of
857*4882a593Smuzhiyunmaintaining cache coherence and the fact that a CPU can't operate on a
858*4882a593Smuzhiyunvalue before it knows what that value is, among other things.
859*4882a593Smuzhiyun
860*4882a593SmuzhiyunThe formal version of the LKMM is defined by six requirements, or
861*4882a593Smuzhiyunaxioms:
862*4882a593Smuzhiyun
863*4882a593Smuzhiyun	Sequential consistency per variable: This requires that the
864*4882a593Smuzhiyun	system obey the four coherency rules.
865*4882a593Smuzhiyun
866*4882a593Smuzhiyun	Atomicity: This requires that atomic read-modify-write
867*4882a593Smuzhiyun	operations really are atomic, that is, no other stores can
868*4882a593Smuzhiyun	sneak into the middle of such an update.
869*4882a593Smuzhiyun
870*4882a593Smuzhiyun	Happens-before: This requires that certain instructions are
871*4882a593Smuzhiyun	executed in a specific order.
872*4882a593Smuzhiyun
873*4882a593Smuzhiyun	Propagation: This requires that certain stores propagate to
874*4882a593Smuzhiyun	CPUs and to RAM in a specific order.
875*4882a593Smuzhiyun
876*4882a593Smuzhiyun	Rcu: This requires that RCU read-side critical sections and
877*4882a593Smuzhiyun	grace periods obey the rules of RCU, in particular, the
878*4882a593Smuzhiyun	Grace-Period Guarantee.
879*4882a593Smuzhiyun
880*4882a593Smuzhiyun	Plain-coherence: This requires that plain memory accesses
881*4882a593Smuzhiyun	(those not using READ_ONCE(), WRITE_ONCE(), etc.) must obey
882*4882a593Smuzhiyun	the operational model's rules regarding cache coherence.
883*4882a593Smuzhiyun
884*4882a593SmuzhiyunThe first and second are quite common; they can be found in many
885*4882a593Smuzhiyunmemory models (such as those for C11/C++11).  The "happens-before" and
886*4882a593Smuzhiyun"propagation" axioms have analogs in other memory models as well.  The
887*4882a593Smuzhiyun"rcu" and "plain-coherence" axioms are specific to the LKMM.
888*4882a593Smuzhiyun
889*4882a593SmuzhiyunEach of these axioms is discussed below.
890*4882a593Smuzhiyun
891*4882a593Smuzhiyun
892*4882a593SmuzhiyunSEQUENTIAL CONSISTENCY PER VARIABLE
893*4882a593Smuzhiyun-----------------------------------
894*4882a593Smuzhiyun
895*4882a593SmuzhiyunAccording to the principle of cache coherence, the stores to any fixed
896*4882a593Smuzhiyunshared location in memory form a global ordering.  We can imagine
897*4882a593Smuzhiyuninserting the loads from that location into this ordering, by placing
898*4882a593Smuzhiyuneach load between the store that it reads from and the following
899*4882a593Smuzhiyunstore.  This leaves the relative positions of loads that read from the
900*4882a593Smuzhiyunsame store unspecified; let's say they are inserted in program order,
901*4882a593Smuzhiyunfirst for CPU 0, then CPU 1, etc.
902*4882a593Smuzhiyun
903*4882a593SmuzhiyunYou can check that the four coherency rules imply that the rf, co, fr,
904*4882a593Smuzhiyunand po-loc relations agree with this global ordering; in other words,
905*4882a593Smuzhiyunwhenever we have X ->rf Y or X ->co Y or X ->fr Y or X ->po-loc Y, the
906*4882a593SmuzhiyunX event comes before the Y event in the global ordering.  The LKMM's
907*4882a593Smuzhiyun"coherence" axiom expresses this by requiring the union of these
908*4882a593Smuzhiyunrelations not to have any cycles.  This means it must not be possible
909*4882a593Smuzhiyunto find events
910*4882a593Smuzhiyun
911*4882a593Smuzhiyun	X0 -> X1 -> X2 -> ... -> Xn -> X0,
912*4882a593Smuzhiyun
913*4882a593Smuzhiyunwhere each of the links is either rf, co, fr, or po-loc.  This has to
914*4882a593Smuzhiyunhold if the accesses to the fixed memory location can be ordered as
915*4882a593Smuzhiyuncache coherence demands.
916*4882a593Smuzhiyun
917*4882a593SmuzhiyunAlthough it is not obvious, it can be shown that the converse is also
918*4882a593Smuzhiyuntrue: This LKMM axiom implies that the four coherency rules are
919*4882a593Smuzhiyunobeyed.
920*4882a593Smuzhiyun
921*4882a593Smuzhiyun
922*4882a593SmuzhiyunATOMIC UPDATES: rmw
923*4882a593Smuzhiyun-------------------
924*4882a593Smuzhiyun
925*4882a593SmuzhiyunWhat does it mean to say that a read-modify-write (rmw) update, such
926*4882a593Smuzhiyunas atomic_inc(&x), is atomic?  It means that the memory location (x in
927*4882a593Smuzhiyunthis case) does not get altered between the read and the write events
928*4882a593Smuzhiyunmaking up the atomic operation.  In particular, if two CPUs perform
929*4882a593Smuzhiyunatomic_inc(&x) concurrently, it must be guaranteed that the final
930*4882a593Smuzhiyunvalue of x will be the initial value plus two.  We should never have
931*4882a593Smuzhiyunthe following sequence of events:
932*4882a593Smuzhiyun
933*4882a593Smuzhiyun	CPU 0 loads x obtaining 13;
934*4882a593Smuzhiyun					CPU 1 loads x obtaining 13;
935*4882a593Smuzhiyun	CPU 0 stores 14 to x;
936*4882a593Smuzhiyun					CPU 1 stores 14 to x;
937*4882a593Smuzhiyun
938*4882a593Smuzhiyunwhere the final value of x is wrong (14 rather than 15).
939*4882a593Smuzhiyun
940*4882a593SmuzhiyunIn this example, CPU 0's increment effectively gets lost because it
941*4882a593Smuzhiyunoccurs in between CPU 1's load and store.  To put it another way, the
942*4882a593Smuzhiyunproblem is that the position of CPU 0's store in x's coherence order
943*4882a593Smuzhiyunis between the store that CPU 1 reads from and the store that CPU 1
944*4882a593Smuzhiyunperforms.
945*4882a593Smuzhiyun
946*4882a593SmuzhiyunThe same analysis applies to all atomic update operations.  Therefore,
947*4882a593Smuzhiyunto enforce atomicity the LKMM requires that atomic updates follow this
948*4882a593Smuzhiyunrule: Whenever R and W are the read and write events composing an
949*4882a593Smuzhiyunatomic read-modify-write and W' is the write event which R reads from,
950*4882a593Smuzhiyunthere must not be any stores coming between W' and W in the coherence
951*4882a593Smuzhiyunorder.  Equivalently,
952*4882a593Smuzhiyun
953*4882a593Smuzhiyun	(R ->rmw W) implies (there is no X with R ->fr X and X ->co W),
954*4882a593Smuzhiyun
955*4882a593Smuzhiyunwhere the rmw relation links the read and write events making up each
956*4882a593Smuzhiyunatomic update.  This is what the LKMM's "atomic" axiom says.
957*4882a593Smuzhiyun
958*4882a593Smuzhiyun
959*4882a593SmuzhiyunTHE PRESERVED PROGRAM ORDER RELATION: ppo
960*4882a593Smuzhiyun-----------------------------------------
961*4882a593Smuzhiyun
962*4882a593SmuzhiyunThere are many situations where a CPU is obliged to execute two
963*4882a593Smuzhiyuninstructions in program order.  We amalgamate them into the ppo (for
964*4882a593Smuzhiyun"preserved program order") relation, which links the po-earlier
965*4882a593Smuzhiyuninstruction to the po-later instruction and is thus a sub-relation of
966*4882a593Smuzhiyunpo.
967*4882a593Smuzhiyun
968*4882a593SmuzhiyunThe operational model already includes a description of one such
969*4882a593Smuzhiyunsituation: Fences are a source of ppo links.  Suppose X and Y are
970*4882a593Smuzhiyunmemory accesses with X ->po Y; then the CPU must execute X before Y if
971*4882a593Smuzhiyunany of the following hold:
972*4882a593Smuzhiyun
973*4882a593Smuzhiyun	A strong (smp_mb() or synchronize_rcu()) fence occurs between
974*4882a593Smuzhiyun	X and Y;
975*4882a593Smuzhiyun
976*4882a593Smuzhiyun	X and Y are both stores and an smp_wmb() fence occurs between
977*4882a593Smuzhiyun	them;
978*4882a593Smuzhiyun
979*4882a593Smuzhiyun	X and Y are both loads and an smp_rmb() fence occurs between
980*4882a593Smuzhiyun	them;
981*4882a593Smuzhiyun
982*4882a593Smuzhiyun	X is also an acquire fence, such as smp_load_acquire();
983*4882a593Smuzhiyun
984*4882a593Smuzhiyun	Y is also a release fence, such as smp_store_release().
985*4882a593Smuzhiyun
986*4882a593SmuzhiyunAnother possibility, not mentioned earlier but discussed in the next
987*4882a593Smuzhiyunsection, is:
988*4882a593Smuzhiyun
989*4882a593Smuzhiyun	X and Y are both loads, X ->addr Y (i.e., there is an address
990*4882a593Smuzhiyun	dependency from X to Y), and X is a READ_ONCE() or an atomic
991*4882a593Smuzhiyun	access.
992*4882a593Smuzhiyun
993*4882a593SmuzhiyunDependencies can also cause instructions to be executed in program
994*4882a593Smuzhiyunorder.  This is uncontroversial when the second instruction is a
995*4882a593Smuzhiyunstore; either a data, address, or control dependency from a load R to
996*4882a593Smuzhiyuna store W will force the CPU to execute R before W.  This is very
997*4882a593Smuzhiyunsimply because the CPU cannot tell the memory subsystem about W's
998*4882a593Smuzhiyunstore before it knows what value should be stored (in the case of a
999*4882a593Smuzhiyundata dependency), what location it should be stored into (in the case
1000*4882a593Smuzhiyunof an address dependency), or whether the store should actually take
1001*4882a593Smuzhiyunplace (in the case of a control dependency).
1002*4882a593Smuzhiyun
1003*4882a593SmuzhiyunDependencies to load instructions are more problematic.  To begin with,
1004*4882a593Smuzhiyunthere is no such thing as a data dependency to a load.  Next, a CPU
1005*4882a593Smuzhiyunhas no reason to respect a control dependency to a load, because it
1006*4882a593Smuzhiyuncan always satisfy the second load speculatively before the first, and
1007*4882a593Smuzhiyunthen ignore the result if it turns out that the second load shouldn't
1008*4882a593Smuzhiyunbe executed after all.  And lastly, the real difficulties begin when
1009*4882a593Smuzhiyunwe consider address dependencies to loads.
1010*4882a593Smuzhiyun
1011*4882a593SmuzhiyunTo be fair about it, all Linux-supported architectures do execute
1012*4882a593Smuzhiyunloads in program order if there is an address dependency between them.
1013*4882a593SmuzhiyunAfter all, a CPU cannot ask the memory subsystem to load a value from
1014*4882a593Smuzhiyuna particular location before it knows what that location is.  However,
1015*4882a593Smuzhiyunthe split-cache design used by Alpha can cause it to behave in a way
1016*4882a593Smuzhiyunthat looks as if the loads were executed out of order (see the next
1017*4882a593Smuzhiyunsection for more details).  The kernel includes a workaround for this
1018*4882a593Smuzhiyunproblem when the loads come from READ_ONCE(), and therefore the LKMM
1019*4882a593Smuzhiyunincludes address dependencies to loads in the ppo relation.
1020*4882a593Smuzhiyun
1021*4882a593SmuzhiyunOn the other hand, dependencies can indirectly affect the ordering of
1022*4882a593Smuzhiyuntwo loads.  This happens when there is a dependency from a load to a
1023*4882a593Smuzhiyunstore and a second, po-later load reads from that store:
1024*4882a593Smuzhiyun
1025*4882a593Smuzhiyun	R ->dep W ->rfi R',
1026*4882a593Smuzhiyun
1027*4882a593Smuzhiyunwhere the dep link can be either an address or a data dependency.  In
1028*4882a593Smuzhiyunthis situation we know it is possible for the CPU to execute R' before
1029*4882a593SmuzhiyunW, because it can forward the value that W will store to R'.  But it
1030*4882a593Smuzhiyuncannot execute R' before R, because it cannot forward the value before
1031*4882a593Smuzhiyunit knows what that value is, or that W and R' do access the same
1032*4882a593Smuzhiyunlocation.  However, if there is merely a control dependency between R
1033*4882a593Smuzhiyunand W then the CPU can speculatively forward W to R' before executing
1034*4882a593SmuzhiyunR; if the speculation turns out to be wrong then the CPU merely has to
1035*4882a593Smuzhiyunrestart or abandon R'.
1036*4882a593Smuzhiyun
1037*4882a593Smuzhiyun(In theory, a CPU might forward a store to a load when it runs across
1038*4882a593Smuzhiyunan address dependency like this:
1039*4882a593Smuzhiyun
1040*4882a593Smuzhiyun	r1 = READ_ONCE(ptr);
1041*4882a593Smuzhiyun	WRITE_ONCE(*r1, 17);
1042*4882a593Smuzhiyun	r2 = READ_ONCE(*r1);
1043*4882a593Smuzhiyun
1044*4882a593Smuzhiyunbecause it could tell that the store and the second load access the
1045*4882a593Smuzhiyunsame location even before it knows what the location's address is.
1046*4882a593SmuzhiyunHowever, none of the architectures supported by the Linux kernel do
1047*4882a593Smuzhiyunthis.)
1048*4882a593Smuzhiyun
1049*4882a593SmuzhiyunTwo memory accesses of the same location must always be executed in
1050*4882a593Smuzhiyunprogram order if the second access is a store.  Thus, if we have
1051*4882a593Smuzhiyun
1052*4882a593Smuzhiyun	R ->po-loc W
1053*4882a593Smuzhiyun
1054*4882a593Smuzhiyun(the po-loc link says that R comes before W in program order and they
1055*4882a593Smuzhiyunaccess the same location), the CPU is obliged to execute W after R.
1056*4882a593SmuzhiyunIf it executed W first then the memory subsystem would respond to R's
1057*4882a593Smuzhiyunread request with the value stored by W (or an even later store), in
1058*4882a593Smuzhiyunviolation of the read-write coherence rule.  Similarly, if we had
1059*4882a593Smuzhiyun
1060*4882a593Smuzhiyun	W ->po-loc W'
1061*4882a593Smuzhiyun
1062*4882a593Smuzhiyunand the CPU executed W' before W, then the memory subsystem would put
1063*4882a593SmuzhiyunW' before W in the coherence order.  It would effectively cause W to
1064*4882a593Smuzhiyunoverwrite W', in violation of the write-write coherence rule.
1065*4882a593Smuzhiyun(Interestingly, an early ARMv8 memory model, now obsolete, proposed
1066*4882a593Smuzhiyunallowing out-of-order writes like this to occur.  The model avoided
1067*4882a593Smuzhiyunviolating the write-write coherence rule by requiring the CPU not to
1068*4882a593Smuzhiyunsend the W write to the memory subsystem at all!)
1069*4882a593Smuzhiyun
1070*4882a593Smuzhiyun
1071*4882a593SmuzhiyunAND THEN THERE WAS ALPHA
1072*4882a593Smuzhiyun------------------------
1073*4882a593Smuzhiyun
1074*4882a593SmuzhiyunAs mentioned above, the Alpha architecture is unique in that it does
1075*4882a593Smuzhiyunnot appear to respect address dependencies to loads.  This means that
1076*4882a593Smuzhiyuncode such as the following:
1077*4882a593Smuzhiyun
1078*4882a593Smuzhiyun	int x = 0;
1079*4882a593Smuzhiyun	int y = -1;
1080*4882a593Smuzhiyun	int *ptr = &y;
1081*4882a593Smuzhiyun
1082*4882a593Smuzhiyun	P0()
1083*4882a593Smuzhiyun	{
1084*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
1085*4882a593Smuzhiyun		smp_wmb();
1086*4882a593Smuzhiyun		WRITE_ONCE(ptr, &x);
1087*4882a593Smuzhiyun	}
1088*4882a593Smuzhiyun
1089*4882a593Smuzhiyun	P1()
1090*4882a593Smuzhiyun	{
1091*4882a593Smuzhiyun		int *r1;
1092*4882a593Smuzhiyun		int r2;
1093*4882a593Smuzhiyun
1094*4882a593Smuzhiyun		r1 = ptr;
1095*4882a593Smuzhiyun		r2 = READ_ONCE(*r1);
1096*4882a593Smuzhiyun	}
1097*4882a593Smuzhiyun
1098*4882a593Smuzhiyuncan malfunction on Alpha systems (notice that P1 uses an ordinary load
1099*4882a593Smuzhiyunto read ptr instead of READ_ONCE()).  It is quite possible that r1 = &x
1100*4882a593Smuzhiyunand r2 = 0 at the end, in spite of the address dependency.
1101*4882a593Smuzhiyun
1102*4882a593SmuzhiyunAt first glance this doesn't seem to make sense.  We know that the
1103*4882a593Smuzhiyunsmp_wmb() forces P0's store to x to propagate to P1 before the store
1104*4882a593Smuzhiyunto ptr does.  And since P1 can't execute its second load
1105*4882a593Smuzhiyununtil it knows what location to load from, i.e., after executing its
1106*4882a593Smuzhiyunfirst load, the value x = 1 must have propagated to P1 before the
1107*4882a593Smuzhiyunsecond load executed.  So why doesn't r2 end up equal to 1?
1108*4882a593Smuzhiyun
1109*4882a593SmuzhiyunThe answer lies in the Alpha's split local caches.  Although the two
1110*4882a593Smuzhiyunstores do reach P1's local cache in the proper order, it can happen
1111*4882a593Smuzhiyunthat the first store is processed by a busy part of the cache while
1112*4882a593Smuzhiyunthe second store is processed by an idle part.  As a result, the x = 1
1113*4882a593Smuzhiyunvalue may not become available for P1's CPU to read until after the
1114*4882a593Smuzhiyunptr = &x value does, leading to the undesirable result above.  The
1115*4882a593Smuzhiyunfinal effect is that even though the two loads really are executed in
1116*4882a593Smuzhiyunprogram order, it appears that they aren't.
1117*4882a593Smuzhiyun
1118*4882a593SmuzhiyunThis could not have happened if the local cache had processed the
1119*4882a593Smuzhiyunincoming stores in FIFO order.  By contrast, other architectures
1120*4882a593Smuzhiyunmaintain at least the appearance of FIFO order.
1121*4882a593Smuzhiyun
1122*4882a593SmuzhiyunIn practice, this difficulty is solved by inserting a special fence
1123*4882a593Smuzhiyunbetween P1's two loads when the kernel is compiled for the Alpha
1124*4882a593Smuzhiyunarchitecture.  In fact, as of version 4.15, the kernel automatically
1125*4882a593Smuzhiyunadds this fence after every READ_ONCE() and atomic load on Alpha.  The
1126*4882a593Smuzhiyuneffect of the fence is to cause the CPU not to execute any po-later
1127*4882a593Smuzhiyuninstructions until after the local cache has finished processing all
1128*4882a593Smuzhiyunthe stores it has already received.  Thus, if the code was changed to:
1129*4882a593Smuzhiyun
1130*4882a593Smuzhiyun	P1()
1131*4882a593Smuzhiyun	{
1132*4882a593Smuzhiyun		int *r1;
1133*4882a593Smuzhiyun		int r2;
1134*4882a593Smuzhiyun
1135*4882a593Smuzhiyun		r1 = READ_ONCE(ptr);
1136*4882a593Smuzhiyun		r2 = READ_ONCE(*r1);
1137*4882a593Smuzhiyun	}
1138*4882a593Smuzhiyun
1139*4882a593Smuzhiyunthen we would never get r1 = &x and r2 = 0.  By the time P1 executed
1140*4882a593Smuzhiyunits second load, the x = 1 store would already be fully processed by
1141*4882a593Smuzhiyunthe local cache and available for satisfying the read request.  Thus
1142*4882a593Smuzhiyunwe have yet another reason why shared data should always be read with
1143*4882a593SmuzhiyunREAD_ONCE() or another synchronization primitive rather than accessed
1144*4882a593Smuzhiyundirectly.
1145*4882a593Smuzhiyun
1146*4882a593SmuzhiyunThe LKMM requires that smp_rmb(), acquire fences, and strong fences
1147*4882a593Smuzhiyunshare this property: They do not allow the CPU to execute any po-later
1148*4882a593Smuzhiyuninstructions (or po-later loads in the case of smp_rmb()) until all
1149*4882a593Smuzhiyunoutstanding stores have been processed by the local cache.  In the
1150*4882a593Smuzhiyuncase of a strong fence, the CPU first has to wait for all of its
1151*4882a593Smuzhiyunpo-earlier stores to propagate to every other CPU in the system; then
1152*4882a593Smuzhiyunit has to wait for the local cache to process all the stores received
1153*4882a593Smuzhiyunas of that time -- not just the stores received when the strong fence
1154*4882a593Smuzhiyunbegan.
1155*4882a593Smuzhiyun
1156*4882a593SmuzhiyunAnd of course, none of this matters for any architecture other than
1157*4882a593SmuzhiyunAlpha.
1158*4882a593Smuzhiyun
1159*4882a593Smuzhiyun
1160*4882a593SmuzhiyunTHE HAPPENS-BEFORE RELATION: hb
1161*4882a593Smuzhiyun-------------------------------
1162*4882a593Smuzhiyun
1163*4882a593SmuzhiyunThe happens-before relation (hb) links memory accesses that have to
1164*4882a593Smuzhiyunexecute in a certain order.  hb includes the ppo relation and two
1165*4882a593Smuzhiyunothers, one of which is rfe.
1166*4882a593Smuzhiyun
1167*4882a593SmuzhiyunW ->rfe R implies that W and R are on different CPUs.  It also means
1168*4882a593Smuzhiyunthat W's store must have propagated to R's CPU before R executed;
1169*4882a593Smuzhiyunotherwise R could not have read the value stored by W.  Therefore W
1170*4882a593Smuzhiyunmust have executed before R, and so we have W ->hb R.
1171*4882a593Smuzhiyun
1172*4882a593SmuzhiyunThe equivalent fact need not hold if W ->rfi R (i.e., W and R are on
1173*4882a593Smuzhiyunthe same CPU).  As we have already seen, the operational model allows
1174*4882a593SmuzhiyunW's value to be forwarded to R in such cases, meaning that R may well
1175*4882a593Smuzhiyunexecute before W does.
1176*4882a593Smuzhiyun
1177*4882a593SmuzhiyunIt's important to understand that neither coe nor fre is included in
1178*4882a593Smuzhiyunhb, despite their similarities to rfe.  For example, suppose we have
1179*4882a593SmuzhiyunW ->coe W'.  This means that W and W' are stores to the same location,
1180*4882a593Smuzhiyunthey execute on different CPUs, and W comes before W' in the coherence
1181*4882a593Smuzhiyunorder (i.e., W' overwrites W).  Nevertheless, it is possible for W' to
1182*4882a593Smuzhiyunexecute before W, because the decision as to which store overwrites
1183*4882a593Smuzhiyunthe other is made later by the memory subsystem.  When the stores are
1184*4882a593Smuzhiyunnearly simultaneous, either one can come out on top.  Similarly,
1185*4882a593SmuzhiyunR ->fre W means that W overwrites the value which R reads, but it
1186*4882a593Smuzhiyundoesn't mean that W has to execute after R.  All that's necessary is
1187*4882a593Smuzhiyunfor the memory subsystem not to propagate W to R's CPU until after R
1188*4882a593Smuzhiyunhas executed, which is possible if W executes shortly before R.
1189*4882a593Smuzhiyun
1190*4882a593SmuzhiyunThe third relation included in hb is like ppo, in that it only links
1191*4882a593Smuzhiyunevents that are on the same CPU.  However it is more difficult to
1192*4882a593Smuzhiyunexplain, because it arises only indirectly from the requirement of
1193*4882a593Smuzhiyuncache coherence.  The relation is called prop, and it links two events
1194*4882a593Smuzhiyunon CPU C in situations where a store from some other CPU comes after
1195*4882a593Smuzhiyunthe first event in the coherence order and propagates to C before the
1196*4882a593Smuzhiyunsecond event executes.
1197*4882a593Smuzhiyun
1198*4882a593SmuzhiyunThis is best explained with some examples.  The simplest case looks
1199*4882a593Smuzhiyunlike this:
1200*4882a593Smuzhiyun
1201*4882a593Smuzhiyun	int x;
1202*4882a593Smuzhiyun
1203*4882a593Smuzhiyun	P0()
1204*4882a593Smuzhiyun	{
1205*4882a593Smuzhiyun		int r1;
1206*4882a593Smuzhiyun
1207*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
1208*4882a593Smuzhiyun		r1 = READ_ONCE(x);
1209*4882a593Smuzhiyun	}
1210*4882a593Smuzhiyun
1211*4882a593Smuzhiyun	P1()
1212*4882a593Smuzhiyun	{
1213*4882a593Smuzhiyun		WRITE_ONCE(x, 8);
1214*4882a593Smuzhiyun	}
1215*4882a593Smuzhiyun
1216*4882a593SmuzhiyunIf r1 = 8 at the end then P0's accesses must have executed in program
1217*4882a593Smuzhiyunorder.  We can deduce this from the operational model; if P0's load
1218*4882a593Smuzhiyunhad executed before its store then the value of the store would have
1219*4882a593Smuzhiyunbeen forwarded to the load, so r1 would have ended up equal to 1, not
1220*4882a593Smuzhiyun8.  In this case there is a prop link from P0's write event to its read
1221*4882a593Smuzhiyunevent, because P1's store came after P0's store in x's coherence
1222*4882a593Smuzhiyunorder, and P1's store propagated to P0 before P0's load executed.
1223*4882a593Smuzhiyun
1224*4882a593SmuzhiyunAn equally simple case involves two loads of the same location that
1225*4882a593Smuzhiyunread from different stores:
1226*4882a593Smuzhiyun
1227*4882a593Smuzhiyun	int x = 0;
1228*4882a593Smuzhiyun
1229*4882a593Smuzhiyun	P0()
1230*4882a593Smuzhiyun	{
1231*4882a593Smuzhiyun		int r1, r2;
1232*4882a593Smuzhiyun
1233*4882a593Smuzhiyun		r1 = READ_ONCE(x);
1234*4882a593Smuzhiyun		r2 = READ_ONCE(x);
1235*4882a593Smuzhiyun	}
1236*4882a593Smuzhiyun
1237*4882a593Smuzhiyun	P1()
1238*4882a593Smuzhiyun	{
1239*4882a593Smuzhiyun		WRITE_ONCE(x, 9);
1240*4882a593Smuzhiyun	}
1241*4882a593Smuzhiyun
1242*4882a593SmuzhiyunIf r1 = 0 and r2 = 9 at the end then P0's accesses must have executed
1243*4882a593Smuzhiyunin program order.  If the second load had executed before the first
1244*4882a593Smuzhiyunthen the x = 9 store must have been propagated to P0 before the first
1245*4882a593Smuzhiyunload executed, and so r1 would have been 9 rather than 0.  In this
1246*4882a593Smuzhiyuncase there is a prop link from P0's first read event to its second,
1247*4882a593Smuzhiyunbecause P1's store overwrote the value read by P0's first load, and
1248*4882a593SmuzhiyunP1's store propagated to P0 before P0's second load executed.
1249*4882a593Smuzhiyun
1250*4882a593SmuzhiyunLess trivial examples of prop all involve fences.  Unlike the simple
1251*4882a593Smuzhiyunexamples above, they can require that some instructions are executed
1252*4882a593Smuzhiyunout of program order.  This next one should look familiar:
1253*4882a593Smuzhiyun
1254*4882a593Smuzhiyun	int buf = 0, flag = 0;
1255*4882a593Smuzhiyun
1256*4882a593Smuzhiyun	P0()
1257*4882a593Smuzhiyun	{
1258*4882a593Smuzhiyun		WRITE_ONCE(buf, 1);
1259*4882a593Smuzhiyun		smp_wmb();
1260*4882a593Smuzhiyun		WRITE_ONCE(flag, 1);
1261*4882a593Smuzhiyun	}
1262*4882a593Smuzhiyun
1263*4882a593Smuzhiyun	P1()
1264*4882a593Smuzhiyun	{
1265*4882a593Smuzhiyun		int r1;
1266*4882a593Smuzhiyun		int r2;
1267*4882a593Smuzhiyun
1268*4882a593Smuzhiyun		r1 = READ_ONCE(flag);
1269*4882a593Smuzhiyun		r2 = READ_ONCE(buf);
1270*4882a593Smuzhiyun	}
1271*4882a593Smuzhiyun
1272*4882a593SmuzhiyunThis is the MP pattern again, with an smp_wmb() fence between the two
1273*4882a593Smuzhiyunstores.  If r1 = 1 and r2 = 0 at the end then there is a prop link
1274*4882a593Smuzhiyunfrom P1's second load to its first (backwards!).  The reason is
1275*4882a593Smuzhiyunsimilar to the previous examples: The value P1 loads from buf gets
1276*4882a593Smuzhiyunoverwritten by P0's store to buf, the fence guarantees that the store
1277*4882a593Smuzhiyunto buf will propagate to P1 before the store to flag does, and the
1278*4882a593Smuzhiyunstore to flag propagates to P1 before P1 reads flag.
1279*4882a593Smuzhiyun
1280*4882a593SmuzhiyunThe prop link says that in order to obtain the r1 = 1, r2 = 0 result,
1281*4882a593SmuzhiyunP1 must execute its second load before the first.  Indeed, if the load
1282*4882a593Smuzhiyunfrom flag were executed first, then the buf = 1 store would already
1283*4882a593Smuzhiyunhave propagated to P1 by the time P1's load from buf executed, so r2
1284*4882a593Smuzhiyunwould have been 1 at the end, not 0.  (The reasoning holds even for
1285*4882a593SmuzhiyunAlpha, although the details are more complicated and we will not go
1286*4882a593Smuzhiyuninto them.)
1287*4882a593Smuzhiyun
1288*4882a593SmuzhiyunBut what if we put an smp_rmb() fence between P1's loads?  The fence
1289*4882a593Smuzhiyunwould force the two loads to be executed in program order, and it
1290*4882a593Smuzhiyunwould generate a cycle in the hb relation: The fence would create a ppo
1291*4882a593Smuzhiyunlink (hence an hb link) from the first load to the second, and the
1292*4882a593Smuzhiyunprop relation would give an hb link from the second load to the first.
1293*4882a593SmuzhiyunSince an instruction can't execute before itself, we are forced to
1294*4882a593Smuzhiyunconclude that if an smp_rmb() fence is added, the r1 = 1, r2 = 0
1295*4882a593Smuzhiyunoutcome is impossible -- as it should be.
1296*4882a593Smuzhiyun
1297*4882a593SmuzhiyunThe formal definition of the prop relation involves a coe or fre link,
1298*4882a593Smuzhiyunfollowed by an arbitrary number of cumul-fence links, ending with an
1299*4882a593Smuzhiyunrfe link.  You can concoct more exotic examples, containing more than
1300*4882a593Smuzhiyunone fence, although this quickly leads to diminishing returns in terms
1301*4882a593Smuzhiyunof complexity.  For instance, here's an example containing a coe link
1302*4882a593Smuzhiyunfollowed by two cumul-fences and an rfe link, utilizing the fact that
1303*4882a593Smuzhiyunrelease fences are A-cumulative:
1304*4882a593Smuzhiyun
1305*4882a593Smuzhiyun	int x, y, z;
1306*4882a593Smuzhiyun
1307*4882a593Smuzhiyun	P0()
1308*4882a593Smuzhiyun	{
1309*4882a593Smuzhiyun		int r0;
1310*4882a593Smuzhiyun
1311*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
1312*4882a593Smuzhiyun		r0 = READ_ONCE(z);
1313*4882a593Smuzhiyun	}
1314*4882a593Smuzhiyun
1315*4882a593Smuzhiyun	P1()
1316*4882a593Smuzhiyun	{
1317*4882a593Smuzhiyun		WRITE_ONCE(x, 2);
1318*4882a593Smuzhiyun		smp_wmb();
1319*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
1320*4882a593Smuzhiyun	}
1321*4882a593Smuzhiyun
1322*4882a593Smuzhiyun	P2()
1323*4882a593Smuzhiyun	{
1324*4882a593Smuzhiyun		int r2;
1325*4882a593Smuzhiyun
1326*4882a593Smuzhiyun		r2 = READ_ONCE(y);
1327*4882a593Smuzhiyun		smp_store_release(&z, 1);
1328*4882a593Smuzhiyun	}
1329*4882a593Smuzhiyun
1330*4882a593SmuzhiyunIf x = 2, r0 = 1, and r2 = 1 after this code runs then there is a prop
1331*4882a593Smuzhiyunlink from P0's store to its load.  This is because P0's store gets
1332*4882a593Smuzhiyunoverwritten by P1's store since x = 2 at the end (a coe link), the
1333*4882a593Smuzhiyunsmp_wmb() ensures that P1's store to x propagates to P2 before the
1334*4882a593Smuzhiyunstore to y does (the first cumul-fence), the store to y propagates to P2
1335*4882a593Smuzhiyunbefore P2's load and store execute, P2's smp_store_release()
1336*4882a593Smuzhiyunguarantees that the stores to x and y both propagate to P0 before the
1337*4882a593Smuzhiyunstore to z does (the second cumul-fence), and P0's load executes after the
1338*4882a593Smuzhiyunstore to z has propagated to P0 (an rfe link).
1339*4882a593Smuzhiyun
1340*4882a593SmuzhiyunIn summary, the fact that the hb relation links memory access events
1341*4882a593Smuzhiyunin the order they execute means that it must not have cycles.  This
1342*4882a593Smuzhiyunrequirement is the content of the LKMM's "happens-before" axiom.
1343*4882a593Smuzhiyun
1344*4882a593SmuzhiyunThe LKMM defines yet another relation connected to times of
1345*4882a593Smuzhiyuninstruction execution, but it is not included in hb.  It relies on the
1346*4882a593Smuzhiyunparticular properties of strong fences, which we cover in the next
1347*4882a593Smuzhiyunsection.
1348*4882a593Smuzhiyun
1349*4882a593Smuzhiyun
1350*4882a593SmuzhiyunTHE PROPAGATES-BEFORE RELATION: pb
1351*4882a593Smuzhiyun----------------------------------
1352*4882a593Smuzhiyun
1353*4882a593SmuzhiyunThe propagates-before (pb) relation capitalizes on the special
1354*4882a593Smuzhiyunfeatures of strong fences.  It links two events E and F whenever some
1355*4882a593Smuzhiyunstore is coherence-later than E and propagates to every CPU and to RAM
1356*4882a593Smuzhiyunbefore F executes.  The formal definition requires that E be linked to
1357*4882a593SmuzhiyunF via a coe or fre link, an arbitrary number of cumul-fences, an
1358*4882a593Smuzhiyunoptional rfe link, a strong fence, and an arbitrary number of hb
1359*4882a593Smuzhiyunlinks.  Let's see how this definition works out.
1360*4882a593Smuzhiyun
1361*4882a593SmuzhiyunConsider first the case where E is a store (implying that the sequence
1362*4882a593Smuzhiyunof links begins with coe).  Then there are events W, X, Y, and Z such
1363*4882a593Smuzhiyunthat:
1364*4882a593Smuzhiyun
1365*4882a593Smuzhiyun	E ->coe W ->cumul-fence* X ->rfe? Y ->strong-fence Z ->hb* F,
1366*4882a593Smuzhiyun
1367*4882a593Smuzhiyunwhere the * suffix indicates an arbitrary number of links of the
1368*4882a593Smuzhiyunspecified type, and the ? suffix indicates the link is optional (Y may
1369*4882a593Smuzhiyunbe equal to X).  Because of the cumul-fence links, we know that W will
1370*4882a593Smuzhiyunpropagate to Y's CPU before X does, hence before Y executes and hence
1371*4882a593Smuzhiyunbefore the strong fence executes.  Because this fence is strong, we
1372*4882a593Smuzhiyunknow that W will propagate to every CPU and to RAM before Z executes.
1373*4882a593SmuzhiyunAnd because of the hb links, we know that Z will execute before F.
1374*4882a593SmuzhiyunThus W, which comes later than E in the coherence order, will
1375*4882a593Smuzhiyunpropagate to every CPU and to RAM before F executes.
1376*4882a593Smuzhiyun
1377*4882a593SmuzhiyunThe case where E is a load is exactly the same, except that the first
1378*4882a593Smuzhiyunlink in the sequence is fre instead of coe.
1379*4882a593Smuzhiyun
1380*4882a593SmuzhiyunThe existence of a pb link from E to F implies that E must execute
1381*4882a593Smuzhiyunbefore F.  To see why, suppose that F executed first.  Then W would
1382*4882a593Smuzhiyunhave propagated to E's CPU before E executed.  If E was a store, the
1383*4882a593Smuzhiyunmemory subsystem would then be forced to make E come after W in the
1384*4882a593Smuzhiyuncoherence order, contradicting the fact that E ->coe W.  If E was a
1385*4882a593Smuzhiyunload, the memory subsystem would then be forced to satisfy E's read
1386*4882a593Smuzhiyunrequest with the value stored by W or an even later store,
1387*4882a593Smuzhiyuncontradicting the fact that E ->fre W.
1388*4882a593Smuzhiyun
1389*4882a593SmuzhiyunA good example illustrating how pb works is the SB pattern with strong
1390*4882a593Smuzhiyunfences:
1391*4882a593Smuzhiyun
1392*4882a593Smuzhiyun	int x = 0, y = 0;
1393*4882a593Smuzhiyun
1394*4882a593Smuzhiyun	P0()
1395*4882a593Smuzhiyun	{
1396*4882a593Smuzhiyun		int r0;
1397*4882a593Smuzhiyun
1398*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
1399*4882a593Smuzhiyun		smp_mb();
1400*4882a593Smuzhiyun		r0 = READ_ONCE(y);
1401*4882a593Smuzhiyun	}
1402*4882a593Smuzhiyun
1403*4882a593Smuzhiyun	P1()
1404*4882a593Smuzhiyun	{
1405*4882a593Smuzhiyun		int r1;
1406*4882a593Smuzhiyun
1407*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
1408*4882a593Smuzhiyun		smp_mb();
1409*4882a593Smuzhiyun		r1 = READ_ONCE(x);
1410*4882a593Smuzhiyun	}
1411*4882a593Smuzhiyun
1412*4882a593SmuzhiyunIf r0 = 0 at the end then there is a pb link from P0's load to P1's
1413*4882a593Smuzhiyunload: an fre link from P0's load to P1's store (which overwrites the
1414*4882a593Smuzhiyunvalue read by P0), and a strong fence between P1's store and its load.
1415*4882a593SmuzhiyunIn this example, the sequences of cumul-fence and hb links are empty.
1416*4882a593SmuzhiyunNote that this pb link is not included in hb as an instance of prop,
1417*4882a593Smuzhiyunbecause it does not start and end on the same CPU.
1418*4882a593Smuzhiyun
1419*4882a593SmuzhiyunSimilarly, if r1 = 0 at the end then there is a pb link from P1's load
1420*4882a593Smuzhiyunto P0's.  This means that if both r1 and r2 were 0 there would be a
1421*4882a593Smuzhiyuncycle in pb, which is not possible since an instruction cannot execute
1422*4882a593Smuzhiyunbefore itself.  Thus, adding smp_mb() fences to the SB pattern
1423*4882a593Smuzhiyunprevents the r0 = 0, r1 = 0 outcome.
1424*4882a593Smuzhiyun
1425*4882a593SmuzhiyunIn summary, the fact that the pb relation links events in the order
1426*4882a593Smuzhiyunthey execute means that it cannot have cycles.  This requirement is
1427*4882a593Smuzhiyunthe content of the LKMM's "propagation" axiom.
1428*4882a593Smuzhiyun
1429*4882a593Smuzhiyun
1430*4882a593SmuzhiyunRCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
1431*4882a593Smuzhiyun------------------------------------------------------------------------
1432*4882a593Smuzhiyun
1433*4882a593SmuzhiyunRCU (Read-Copy-Update) is a powerful synchronization mechanism.  It
1434*4882a593Smuzhiyunrests on two concepts: grace periods and read-side critical sections.
1435*4882a593Smuzhiyun
1436*4882a593SmuzhiyunA grace period is the span of time occupied by a call to
1437*4882a593Smuzhiyunsynchronize_rcu().  A read-side critical section (or just critical
1438*4882a593Smuzhiyunsection, for short) is a region of code delimited by rcu_read_lock()
1439*4882a593Smuzhiyunat the start and rcu_read_unlock() at the end.  Critical sections can
1440*4882a593Smuzhiyunbe nested, although we won't make use of this fact.
1441*4882a593Smuzhiyun
1442*4882a593SmuzhiyunAs far as memory models are concerned, RCU's main feature is its
1443*4882a593SmuzhiyunGrace-Period Guarantee, which states that a critical section can never
1444*4882a593Smuzhiyunspan a full grace period.  In more detail, the Guarantee says:
1445*4882a593Smuzhiyun
1446*4882a593Smuzhiyun	For any critical section C and any grace period G, at least
1447*4882a593Smuzhiyun	one of the following statements must hold:
1448*4882a593Smuzhiyun
1449*4882a593Smuzhiyun(1)	C ends before G does, and in addition, every store that
1450*4882a593Smuzhiyun	propagates to C's CPU before the end of C must propagate to
1451*4882a593Smuzhiyun	every CPU before G ends.
1452*4882a593Smuzhiyun
1453*4882a593Smuzhiyun(2)	G starts before C does, and in addition, every store that
1454*4882a593Smuzhiyun	propagates to G's CPU before the start of G must propagate
1455*4882a593Smuzhiyun	to every CPU before C starts.
1456*4882a593Smuzhiyun
1457*4882a593SmuzhiyunIn particular, it is not possible for a critical section to both start
1458*4882a593Smuzhiyunbefore and end after a grace period.
1459*4882a593Smuzhiyun
1460*4882a593SmuzhiyunHere is a simple example of RCU in action:
1461*4882a593Smuzhiyun
1462*4882a593Smuzhiyun	int x, y;
1463*4882a593Smuzhiyun
1464*4882a593Smuzhiyun	P0()
1465*4882a593Smuzhiyun	{
1466*4882a593Smuzhiyun		rcu_read_lock();
1467*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
1468*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
1469*4882a593Smuzhiyun		rcu_read_unlock();
1470*4882a593Smuzhiyun	}
1471*4882a593Smuzhiyun
1472*4882a593Smuzhiyun	P1()
1473*4882a593Smuzhiyun	{
1474*4882a593Smuzhiyun		int r1, r2;
1475*4882a593Smuzhiyun
1476*4882a593Smuzhiyun		r1 = READ_ONCE(x);
1477*4882a593Smuzhiyun		synchronize_rcu();
1478*4882a593Smuzhiyun		r2 = READ_ONCE(y);
1479*4882a593Smuzhiyun	}
1480*4882a593Smuzhiyun
1481*4882a593SmuzhiyunThe Grace Period Guarantee tells us that when this code runs, it will
1482*4882a593Smuzhiyunnever end with r1 = 1 and r2 = 0.  The reasoning is as follows.  r1 = 1
1483*4882a593Smuzhiyunmeans that P0's store to x propagated to P1 before P1 called
1484*4882a593Smuzhiyunsynchronize_rcu(), so P0's critical section must have started before
1485*4882a593SmuzhiyunP1's grace period, contrary to part (2) of the Guarantee.  On the
1486*4882a593Smuzhiyunother hand, r2 = 0 means that P0's store to y, which occurs before the
1487*4882a593Smuzhiyunend of the critical section, did not propagate to P1 before the end of
1488*4882a593Smuzhiyunthe grace period, contrary to part (1).  Together the results violate
1489*4882a593Smuzhiyunthe Guarantee.
1490*4882a593Smuzhiyun
1491*4882a593SmuzhiyunIn the kernel's implementations of RCU, the requirements for stores
1492*4882a593Smuzhiyunto propagate to every CPU are fulfilled by placing strong fences at
1493*4882a593Smuzhiyunsuitable places in the RCU-related code.  Thus, if a critical section
1494*4882a593Smuzhiyunstarts before a grace period does then the critical section's CPU will
1495*4882a593Smuzhiyunexecute an smp_mb() fence after the end of the critical section and
1496*4882a593Smuzhiyunsome time before the grace period's synchronize_rcu() call returns.
1497*4882a593SmuzhiyunAnd if a critical section ends after a grace period does then the
1498*4882a593Smuzhiyunsynchronize_rcu() routine will execute an smp_mb() fence at its start
1499*4882a593Smuzhiyunand some time before the critical section's opening rcu_read_lock()
1500*4882a593Smuzhiyunexecutes.
1501*4882a593Smuzhiyun
1502*4882a593SmuzhiyunWhat exactly do we mean by saying that a critical section "starts
1503*4882a593Smuzhiyunbefore" or "ends after" a grace period?  Some aspects of the meaning
1504*4882a593Smuzhiyunare pretty obvious, as in the example above, but the details aren't
1505*4882a593Smuzhiyunentirely clear.  The LKMM formalizes this notion by means of the
1506*4882a593Smuzhiyunrcu-link relation.  rcu-link encompasses a very general notion of
1507*4882a593Smuzhiyun"before": If E and F are RCU fence events (i.e., rcu_read_lock(),
1508*4882a593Smuzhiyunrcu_read_unlock(), or synchronize_rcu()) then among other things,
1509*4882a593SmuzhiyunE ->rcu-link F includes cases where E is po-before some memory-access
1510*4882a593Smuzhiyunevent X, F is po-after some memory-access event Y, and we have any of
1511*4882a593SmuzhiyunX ->rfe Y, X ->co Y, or X ->fr Y.
1512*4882a593Smuzhiyun
1513*4882a593SmuzhiyunThe formal definition of the rcu-link relation is more than a little
1514*4882a593Smuzhiyunobscure, and we won't give it here.  It is closely related to the pb
1515*4882a593Smuzhiyunrelation, and the details don't matter unless you want to comb through
1516*4882a593Smuzhiyuna somewhat lengthy formal proof.  Pretty much all you need to know
1517*4882a593Smuzhiyunabout rcu-link is the information in the preceding paragraph.
1518*4882a593Smuzhiyun
1519*4882a593SmuzhiyunThe LKMM also defines the rcu-gp and rcu-rscsi relations.  They bring
1520*4882a593Smuzhiyungrace periods and read-side critical sections into the picture, in the
1521*4882a593Smuzhiyunfollowing way:
1522*4882a593Smuzhiyun
1523*4882a593Smuzhiyun	E ->rcu-gp F means that E and F are in fact the same event,
1524*4882a593Smuzhiyun	and that event is a synchronize_rcu() fence (i.e., a grace
1525*4882a593Smuzhiyun	period).
1526*4882a593Smuzhiyun
1527*4882a593Smuzhiyun	E ->rcu-rscsi F means that E and F are the rcu_read_unlock()
1528*4882a593Smuzhiyun	and rcu_read_lock() fence events delimiting some read-side
1529*4882a593Smuzhiyun	critical section.  (The 'i' at the end of the name emphasizes
1530*4882a593Smuzhiyun	that this relation is "inverted": It links the end of the
1531*4882a593Smuzhiyun	critical section to the start.)
1532*4882a593Smuzhiyun
1533*4882a593SmuzhiyunIf we think of the rcu-link relation as standing for an extended
1534*4882a593Smuzhiyun"before", then X ->rcu-gp Y ->rcu-link Z roughly says that X is a
1535*4882a593Smuzhiyungrace period which ends before Z begins.  (In fact it covers more than
1536*4882a593Smuzhiyunthis, because it also includes cases where some store propagates to
1537*4882a593SmuzhiyunZ's CPU before Z begins but doesn't propagate to some other CPU until
1538*4882a593Smuzhiyunafter X ends.)  Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is
1539*4882a593Smuzhiyunthe end of a critical section which starts before Z begins.
1540*4882a593Smuzhiyun
1541*4882a593SmuzhiyunThe LKMM goes on to define the rcu-order relation as a sequence of
1542*4882a593Smuzhiyunrcu-gp and rcu-rscsi links separated by rcu-link links, in which the
1543*4882a593Smuzhiyunnumber of rcu-gp links is >= the number of rcu-rscsi links.  For
1544*4882a593Smuzhiyunexample:
1545*4882a593Smuzhiyun
1546*4882a593Smuzhiyun	X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
1547*4882a593Smuzhiyun
1548*4882a593Smuzhiyunwould imply that X ->rcu-order V, because this sequence contains two
1549*4882a593Smuzhiyunrcu-gp links and one rcu-rscsi link.  (It also implies that
1550*4882a593SmuzhiyunX ->rcu-order T and Z ->rcu-order V.)  On the other hand:
1551*4882a593Smuzhiyun
1552*4882a593Smuzhiyun	X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
1553*4882a593Smuzhiyun
1554*4882a593Smuzhiyundoes not imply X ->rcu-order V, because the sequence contains only
1555*4882a593Smuzhiyunone rcu-gp link but two rcu-rscsi links.
1556*4882a593Smuzhiyun
1557*4882a593SmuzhiyunThe rcu-order relation is important because the Grace Period Guarantee
1558*4882a593Smuzhiyunmeans that rcu-order links act kind of like strong fences.  In
1559*4882a593Smuzhiyunparticular, E ->rcu-order F implies not only that E begins before F
1560*4882a593Smuzhiyunends, but also that any write po-before E will propagate to every CPU
1561*4882a593Smuzhiyunbefore any instruction po-after F can execute.  (However, it does not
1562*4882a593Smuzhiyunimply that E must execute before F; in fact, each synchronize_rcu()
1563*4882a593Smuzhiyunfence event is linked to itself by rcu-order as a degenerate case.)
1564*4882a593Smuzhiyun
1565*4882a593SmuzhiyunTo prove this in full generality requires some intellectual effort.
1566*4882a593SmuzhiyunWe'll consider just a very simple case:
1567*4882a593Smuzhiyun
1568*4882a593Smuzhiyun	G ->rcu-gp W ->rcu-link Z ->rcu-rscsi F.
1569*4882a593Smuzhiyun
1570*4882a593SmuzhiyunThis formula means that G and W are the same event (a grace period),
1571*4882a593Smuzhiyunand there are events X, Y and a read-side critical section C such that:
1572*4882a593Smuzhiyun
1573*4882a593Smuzhiyun	1. G = W is po-before or equal to X;
1574*4882a593Smuzhiyun
1575*4882a593Smuzhiyun	2. X comes "before" Y in some sense (including rfe, co and fr);
1576*4882a593Smuzhiyun
1577*4882a593Smuzhiyun	3. Y is po-before Z;
1578*4882a593Smuzhiyun
1579*4882a593Smuzhiyun	4. Z is the rcu_read_unlock() event marking the end of C;
1580*4882a593Smuzhiyun
1581*4882a593Smuzhiyun	5. F is the rcu_read_lock() event marking the start of C.
1582*4882a593Smuzhiyun
1583*4882a593SmuzhiyunFrom 1 - 4 we deduce that the grace period G ends before the critical
1584*4882a593Smuzhiyunsection C.  Then part (2) of the Grace Period Guarantee says not only
1585*4882a593Smuzhiyunthat G starts before C does, but also that any write which executes on
1586*4882a593SmuzhiyunG's CPU before G starts must propagate to every CPU before C starts.
1587*4882a593SmuzhiyunIn particular, the write propagates to every CPU before F finishes
1588*4882a593Smuzhiyunexecuting and hence before any instruction po-after F can execute.
1589*4882a593SmuzhiyunThis sort of reasoning can be extended to handle all the situations
1590*4882a593Smuzhiyuncovered by rcu-order.
1591*4882a593Smuzhiyun
1592*4882a593SmuzhiyunThe rcu-fence relation is a simple extension of rcu-order.  While
1593*4882a593Smuzhiyunrcu-order only links certain fence events (calls to synchronize_rcu(),
1594*4882a593Smuzhiyunrcu_read_lock(), or rcu_read_unlock()), rcu-fence links any events
1595*4882a593Smuzhiyunthat are separated by an rcu-order link.  This is analogous to the way
1596*4882a593Smuzhiyunthe strong-fence relation links events that are separated by an
1597*4882a593Smuzhiyunsmp_mb() fence event (as mentioned above, rcu-order links act kind of
1598*4882a593Smuzhiyunlike strong fences).  Written symbolically, X ->rcu-fence Y means
1599*4882a593Smuzhiyunthere are fence events E and F such that:
1600*4882a593Smuzhiyun
1601*4882a593Smuzhiyun	X ->po E ->rcu-order F ->po Y.
1602*4882a593Smuzhiyun
1603*4882a593SmuzhiyunFrom the discussion above, we see this implies not only that X
1604*4882a593Smuzhiyunexecutes before Y, but also (if X is a store) that X propagates to
1605*4882a593Smuzhiyunevery CPU before Y executes.  Thus rcu-fence is sort of a
1606*4882a593Smuzhiyun"super-strong" fence: Unlike the original strong fences (smp_mb() and
1607*4882a593Smuzhiyunsynchronize_rcu()), rcu-fence is able to link events on different
1608*4882a593SmuzhiyunCPUs.  (Perhaps this fact should lead us to say that rcu-fence isn't
1609*4882a593Smuzhiyunreally a fence at all!)
1610*4882a593Smuzhiyun
1611*4882a593SmuzhiyunFinally, the LKMM defines the RCU-before (rb) relation in terms of
1612*4882a593Smuzhiyunrcu-fence.  This is done in essentially the same way as the pb
1613*4882a593Smuzhiyunrelation was defined in terms of strong-fence.  We will omit the
1614*4882a593Smuzhiyundetails; the end result is that E ->rb F implies E must execute
1615*4882a593Smuzhiyunbefore F, just as E ->pb F does (and for much the same reasons).
1616*4882a593Smuzhiyun
1617*4882a593SmuzhiyunPutting this all together, the LKMM expresses the Grace Period
1618*4882a593SmuzhiyunGuarantee by requiring that the rb relation does not contain a cycle.
1619*4882a593SmuzhiyunEquivalently, this "rcu" axiom requires that there are no events E
1620*4882a593Smuzhiyunand F with E ->rcu-link F ->rcu-order E.  Or to put it a third way,
1621*4882a593Smuzhiyunthe axiom requires that there are no cycles consisting of rcu-gp and
1622*4882a593Smuzhiyunrcu-rscsi alternating with rcu-link, where the number of rcu-gp links
1623*4882a593Smuzhiyunis >= the number of rcu-rscsi links.
1624*4882a593Smuzhiyun
1625*4882a593SmuzhiyunJustifying the axiom isn't easy, but it is in fact a valid
1626*4882a593Smuzhiyunformalization of the Grace Period Guarantee.  We won't attempt to go
1627*4882a593Smuzhiyunthrough the detailed argument, but the following analysis gives a
1628*4882a593Smuzhiyuntaste of what is involved.  Suppose both parts of the Guarantee are
1629*4882a593Smuzhiyunviolated: A critical section starts before a grace period, and some
1630*4882a593Smuzhiyunstore propagates to the critical section's CPU before the end of the
1631*4882a593Smuzhiyuncritical section but doesn't propagate to some other CPU until after
1632*4882a593Smuzhiyunthe end of the grace period.
1633*4882a593Smuzhiyun
1634*4882a593SmuzhiyunPutting symbols to these ideas, let L and U be the rcu_read_lock() and
1635*4882a593Smuzhiyunrcu_read_unlock() fence events delimiting the critical section in
1636*4882a593Smuzhiyunquestion, and let S be the synchronize_rcu() fence event for the grace
1637*4882a593Smuzhiyunperiod.  Saying that the critical section starts before S means there
1638*4882a593Smuzhiyunare events Q and R where Q is po-after L (which marks the start of the
1639*4882a593Smuzhiyuncritical section), Q is "before" R in the sense used by the rcu-link
1640*4882a593Smuzhiyunrelation, and R is po-before the grace period S.  Thus we have:
1641*4882a593Smuzhiyun
1642*4882a593Smuzhiyun	L ->rcu-link S.
1643*4882a593Smuzhiyun
1644*4882a593SmuzhiyunLet W be the store mentioned above, let Y come before the end of the
1645*4882a593Smuzhiyuncritical section and witness that W propagates to the critical
1646*4882a593Smuzhiyunsection's CPU by reading from W, and let Z on some arbitrary CPU be a
1647*4882a593Smuzhiyunwitness that W has not propagated to that CPU, where Z happens after
1648*4882a593Smuzhiyunsome event X which is po-after S.  Symbolically, this amounts to:
1649*4882a593Smuzhiyun
1650*4882a593Smuzhiyun	S ->po X ->hb* Z ->fr W ->rf Y ->po U.
1651*4882a593Smuzhiyun
1652*4882a593SmuzhiyunThe fr link from Z to W indicates that W has not propagated to Z's CPU
1653*4882a593Smuzhiyunat the time that Z executes.  From this, it can be shown (see the
1654*4882a593Smuzhiyundiscussion of the rcu-link relation earlier) that S and U are related
1655*4882a593Smuzhiyunby rcu-link:
1656*4882a593Smuzhiyun
1657*4882a593Smuzhiyun	S ->rcu-link U.
1658*4882a593Smuzhiyun
1659*4882a593SmuzhiyunSince S is a grace period we have S ->rcu-gp S, and since L and U are
1660*4882a593Smuzhiyunthe start and end of the critical section C we have U ->rcu-rscsi L.
1661*4882a593SmuzhiyunFrom this we obtain:
1662*4882a593Smuzhiyun
1663*4882a593Smuzhiyun	S ->rcu-gp S ->rcu-link U ->rcu-rscsi L ->rcu-link S,
1664*4882a593Smuzhiyun
1665*4882a593Smuzhiyuna forbidden cycle.  Thus the "rcu" axiom rules out this violation of
1666*4882a593Smuzhiyunthe Grace Period Guarantee.
1667*4882a593Smuzhiyun
1668*4882a593SmuzhiyunFor something a little more down-to-earth, let's see how the axiom
1669*4882a593Smuzhiyunworks out in practice.  Consider the RCU code example from above, this
1670*4882a593Smuzhiyuntime with statement labels added:
1671*4882a593Smuzhiyun
1672*4882a593Smuzhiyun	int x, y;
1673*4882a593Smuzhiyun
1674*4882a593Smuzhiyun	P0()
1675*4882a593Smuzhiyun	{
1676*4882a593Smuzhiyun		L: rcu_read_lock();
1677*4882a593Smuzhiyun		X: WRITE_ONCE(x, 1);
1678*4882a593Smuzhiyun		Y: WRITE_ONCE(y, 1);
1679*4882a593Smuzhiyun		U: rcu_read_unlock();
1680*4882a593Smuzhiyun	}
1681*4882a593Smuzhiyun
1682*4882a593Smuzhiyun	P1()
1683*4882a593Smuzhiyun	{
1684*4882a593Smuzhiyun		int r1, r2;
1685*4882a593Smuzhiyun
1686*4882a593Smuzhiyun		Z: r1 = READ_ONCE(x);
1687*4882a593Smuzhiyun		S: synchronize_rcu();
1688*4882a593Smuzhiyun		W: r2 = READ_ONCE(y);
1689*4882a593Smuzhiyun	}
1690*4882a593Smuzhiyun
1691*4882a593Smuzhiyun
1692*4882a593SmuzhiyunIf r2 = 0 at the end then P0's store at Y overwrites the value that
1693*4882a593SmuzhiyunP1's load at W reads from, so we have W ->fre Y.  Since S ->po W and
1694*4882a593Smuzhiyunalso Y ->po U, we get S ->rcu-link U.  In addition, S ->rcu-gp S
1695*4882a593Smuzhiyunbecause S is a grace period.
1696*4882a593Smuzhiyun
1697*4882a593SmuzhiyunIf r1 = 1 at the end then P1's load at Z reads from P0's store at X,
1698*4882a593Smuzhiyunso we have X ->rfe Z.  Together with L ->po X and Z ->po S, this
1699*4882a593Smuzhiyunyields L ->rcu-link S.  And since L and U are the start and end of a
1700*4882a593Smuzhiyuncritical section, we have U ->rcu-rscsi L.
1701*4882a593Smuzhiyun
1702*4882a593SmuzhiyunThen U ->rcu-rscsi L ->rcu-link S ->rcu-gp S ->rcu-link U is a
1703*4882a593Smuzhiyunforbidden cycle, violating the "rcu" axiom.  Hence the outcome is not
1704*4882a593Smuzhiyunallowed by the LKMM, as we would expect.
1705*4882a593Smuzhiyun
1706*4882a593SmuzhiyunFor contrast, let's see what can happen in a more complicated example:
1707*4882a593Smuzhiyun
1708*4882a593Smuzhiyun	int x, y, z;
1709*4882a593Smuzhiyun
1710*4882a593Smuzhiyun	P0()
1711*4882a593Smuzhiyun	{
1712*4882a593Smuzhiyun		int r0;
1713*4882a593Smuzhiyun
1714*4882a593Smuzhiyun		L0: rcu_read_lock();
1715*4882a593Smuzhiyun		    r0 = READ_ONCE(x);
1716*4882a593Smuzhiyun		    WRITE_ONCE(y, 1);
1717*4882a593Smuzhiyun		U0: rcu_read_unlock();
1718*4882a593Smuzhiyun	}
1719*4882a593Smuzhiyun
1720*4882a593Smuzhiyun	P1()
1721*4882a593Smuzhiyun	{
1722*4882a593Smuzhiyun		int r1;
1723*4882a593Smuzhiyun
1724*4882a593Smuzhiyun		    r1 = READ_ONCE(y);
1725*4882a593Smuzhiyun		S1: synchronize_rcu();
1726*4882a593Smuzhiyun		    WRITE_ONCE(z, 1);
1727*4882a593Smuzhiyun	}
1728*4882a593Smuzhiyun
1729*4882a593Smuzhiyun	P2()
1730*4882a593Smuzhiyun	{
1731*4882a593Smuzhiyun		int r2;
1732*4882a593Smuzhiyun
1733*4882a593Smuzhiyun		L2: rcu_read_lock();
1734*4882a593Smuzhiyun		    r2 = READ_ONCE(z);
1735*4882a593Smuzhiyun		    WRITE_ONCE(x, 1);
1736*4882a593Smuzhiyun		U2: rcu_read_unlock();
1737*4882a593Smuzhiyun	}
1738*4882a593Smuzhiyun
1739*4882a593SmuzhiyunIf r0 = r1 = r2 = 1 at the end, then similar reasoning to before shows
1740*4882a593Smuzhiyunthat U0 ->rcu-rscsi L0 ->rcu-link S1 ->rcu-gp S1 ->rcu-link U2 ->rcu-rscsi
1741*4882a593SmuzhiyunL2 ->rcu-link U0.  However this cycle is not forbidden, because the
1742*4882a593Smuzhiyunsequence of relations contains fewer instances of rcu-gp (one) than of
1743*4882a593Smuzhiyunrcu-rscsi (two).  Consequently the outcome is allowed by the LKMM.
1744*4882a593SmuzhiyunThe following instruction timing diagram shows how it might actually
1745*4882a593Smuzhiyunoccur:
1746*4882a593Smuzhiyun
1747*4882a593SmuzhiyunP0			P1			P2
1748*4882a593Smuzhiyun--------------------	--------------------	--------------------
1749*4882a593Smuzhiyunrcu_read_lock()
1750*4882a593SmuzhiyunWRITE_ONCE(y, 1)
1751*4882a593Smuzhiyun			r1 = READ_ONCE(y)
1752*4882a593Smuzhiyun			synchronize_rcu() starts
1753*4882a593Smuzhiyun			.			rcu_read_lock()
1754*4882a593Smuzhiyun			.			WRITE_ONCE(x, 1)
1755*4882a593Smuzhiyunr0 = READ_ONCE(x)	.
1756*4882a593Smuzhiyunrcu_read_unlock()	.
1757*4882a593Smuzhiyun			synchronize_rcu() ends
1758*4882a593Smuzhiyun			WRITE_ONCE(z, 1)
1759*4882a593Smuzhiyun						r2 = READ_ONCE(z)
1760*4882a593Smuzhiyun						rcu_read_unlock()
1761*4882a593Smuzhiyun
1762*4882a593SmuzhiyunThis requires P0 and P2 to execute their loads and stores out of
1763*4882a593Smuzhiyunprogram order, but of course they are allowed to do so.  And as you
1764*4882a593Smuzhiyuncan see, the Grace Period Guarantee is not violated: The critical
1765*4882a593Smuzhiyunsection in P0 both starts before P1's grace period does and ends
1766*4882a593Smuzhiyunbefore it does, and the critical section in P2 both starts after P1's
1767*4882a593Smuzhiyungrace period does and ends after it does.
1768*4882a593Smuzhiyun
1769*4882a593SmuzhiyunAddendum: The LKMM now supports SRCU (Sleepable Read-Copy-Update) in
1770*4882a593Smuzhiyunaddition to normal RCU.  The ideas involved are much the same as
1771*4882a593Smuzhiyunabove, with new relations srcu-gp and srcu-rscsi added to represent
1772*4882a593SmuzhiyunSRCU grace periods and read-side critical sections.  There is a
1773*4882a593Smuzhiyunrestriction on the srcu-gp and srcu-rscsi links that can appear in an
1774*4882a593Smuzhiyunrcu-order sequence (the srcu-rscsi links must be paired with srcu-gp
1775*4882a593Smuzhiyunlinks having the same SRCU domain with proper nesting); the details
1776*4882a593Smuzhiyunare relatively unimportant.
1777*4882a593Smuzhiyun
1778*4882a593Smuzhiyun
1779*4882a593SmuzhiyunLOCKING
1780*4882a593Smuzhiyun-------
1781*4882a593Smuzhiyun
1782*4882a593SmuzhiyunThe LKMM includes locking.  In fact, there is special code for locking
1783*4882a593Smuzhiyunin the formal model, added in order to make tools run faster.
1784*4882a593SmuzhiyunHowever, this special code is intended to be more or less equivalent
1785*4882a593Smuzhiyunto concepts we have already covered.  A spinlock_t variable is treated
1786*4882a593Smuzhiyunthe same as an int, and spin_lock(&s) is treated almost the same as:
1787*4882a593Smuzhiyun
1788*4882a593Smuzhiyun	while (cmpxchg_acquire(&s, 0, 1) != 0)
1789*4882a593Smuzhiyun		cpu_relax();
1790*4882a593Smuzhiyun
1791*4882a593SmuzhiyunThis waits until s is equal to 0 and then atomically sets it to 1,
1792*4882a593Smuzhiyunand the read part of the cmpxchg operation acts as an acquire fence.
1793*4882a593SmuzhiyunAn alternate way to express the same thing would be:
1794*4882a593Smuzhiyun
1795*4882a593Smuzhiyun	r = xchg_acquire(&s, 1);
1796*4882a593Smuzhiyun
1797*4882a593Smuzhiyunalong with a requirement that at the end, r = 0.  Similarly,
1798*4882a593Smuzhiyunspin_trylock(&s) is treated almost the same as:
1799*4882a593Smuzhiyun
1800*4882a593Smuzhiyun	return !cmpxchg_acquire(&s, 0, 1);
1801*4882a593Smuzhiyun
1802*4882a593Smuzhiyunwhich atomically sets s to 1 if it is currently equal to 0 and returns
1803*4882a593Smuzhiyuntrue if it succeeds (the read part of the cmpxchg operation acts as an
1804*4882a593Smuzhiyunacquire fence only if the operation is successful).  spin_unlock(&s)
1805*4882a593Smuzhiyunis treated almost the same as:
1806*4882a593Smuzhiyun
1807*4882a593Smuzhiyun	smp_store_release(&s, 0);
1808*4882a593Smuzhiyun
1809*4882a593SmuzhiyunThe "almost" qualifiers above need some explanation.  In the LKMM, the
1810*4882a593Smuzhiyunstore-release in a spin_unlock() and the load-acquire which forms the
1811*4882a593Smuzhiyunfirst half of the atomic rmw update in a spin_lock() or a successful
1812*4882a593Smuzhiyunspin_trylock() -- we can call these things lock-releases and
1813*4882a593Smuzhiyunlock-acquires -- have two properties beyond those of ordinary releases
1814*4882a593Smuzhiyunand acquires.
1815*4882a593Smuzhiyun
1816*4882a593SmuzhiyunFirst, when a lock-acquire reads from a lock-release, the LKMM
1817*4882a593Smuzhiyunrequires that every instruction po-before the lock-release must
1818*4882a593Smuzhiyunexecute before any instruction po-after the lock-acquire.  This would
1819*4882a593Smuzhiyunnaturally hold if the release and acquire operations were on different
1820*4882a593SmuzhiyunCPUs, but the LKMM says it holds even when they are on the same CPU.
1821*4882a593SmuzhiyunFor example:
1822*4882a593Smuzhiyun
1823*4882a593Smuzhiyun	int x, y;
1824*4882a593Smuzhiyun	spinlock_t s;
1825*4882a593Smuzhiyun
1826*4882a593Smuzhiyun	P0()
1827*4882a593Smuzhiyun	{
1828*4882a593Smuzhiyun		int r1, r2;
1829*4882a593Smuzhiyun
1830*4882a593Smuzhiyun		spin_lock(&s);
1831*4882a593Smuzhiyun		r1 = READ_ONCE(x);
1832*4882a593Smuzhiyun		spin_unlock(&s);
1833*4882a593Smuzhiyun		spin_lock(&s);
1834*4882a593Smuzhiyun		r2 = READ_ONCE(y);
1835*4882a593Smuzhiyun		spin_unlock(&s);
1836*4882a593Smuzhiyun	}
1837*4882a593Smuzhiyun
1838*4882a593Smuzhiyun	P1()
1839*4882a593Smuzhiyun	{
1840*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
1841*4882a593Smuzhiyun		smp_wmb();
1842*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
1843*4882a593Smuzhiyun	}
1844*4882a593Smuzhiyun
1845*4882a593SmuzhiyunHere the second spin_lock() reads from the first spin_unlock(), and
1846*4882a593Smuzhiyuntherefore the load of x must execute before the load of y.  Thus we
1847*4882a593Smuzhiyuncannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
1848*4882a593SmuzhiyunMP pattern).
1849*4882a593Smuzhiyun
1850*4882a593SmuzhiyunThis requirement does not apply to ordinary release and acquire
1851*4882a593Smuzhiyunfences, only to lock-related operations.  For instance, suppose P0()
1852*4882a593Smuzhiyunin the example had been written as:
1853*4882a593Smuzhiyun
1854*4882a593Smuzhiyun	P0()
1855*4882a593Smuzhiyun	{
1856*4882a593Smuzhiyun		int r1, r2, r3;
1857*4882a593Smuzhiyun
1858*4882a593Smuzhiyun		r1 = READ_ONCE(x);
1859*4882a593Smuzhiyun		smp_store_release(&s, 1);
1860*4882a593Smuzhiyun		r3 = smp_load_acquire(&s);
1861*4882a593Smuzhiyun		r2 = READ_ONCE(y);
1862*4882a593Smuzhiyun	}
1863*4882a593Smuzhiyun
1864*4882a593SmuzhiyunThen the CPU would be allowed to forward the s = 1 value from the
1865*4882a593Smuzhiyunsmp_store_release() to the smp_load_acquire(), executing the
1866*4882a593Smuzhiyuninstructions in the following order:
1867*4882a593Smuzhiyun
1868*4882a593Smuzhiyun		r3 = smp_load_acquire(&s);	// Obtains r3 = 1
1869*4882a593Smuzhiyun		r2 = READ_ONCE(y);
1870*4882a593Smuzhiyun		r1 = READ_ONCE(x);
1871*4882a593Smuzhiyun		smp_store_release(&s, 1);	// Value is forwarded
1872*4882a593Smuzhiyun
1873*4882a593Smuzhiyunand thus it could load y before x, obtaining r2 = 0 and r1 = 1.
1874*4882a593Smuzhiyun
1875*4882a593SmuzhiyunSecond, when a lock-acquire reads from a lock-release, and some other
1876*4882a593Smuzhiyunstores W and W' occur po-before the lock-release and po-after the
1877*4882a593Smuzhiyunlock-acquire respectively, the LKMM requires that W must propagate to
1878*4882a593Smuzhiyuneach CPU before W' does.  For example, consider:
1879*4882a593Smuzhiyun
1880*4882a593Smuzhiyun	int x, y;
1881*4882a593Smuzhiyun	spinlock_t x;
1882*4882a593Smuzhiyun
1883*4882a593Smuzhiyun	P0()
1884*4882a593Smuzhiyun	{
1885*4882a593Smuzhiyun		spin_lock(&s);
1886*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
1887*4882a593Smuzhiyun		spin_unlock(&s);
1888*4882a593Smuzhiyun	}
1889*4882a593Smuzhiyun
1890*4882a593Smuzhiyun	P1()
1891*4882a593Smuzhiyun	{
1892*4882a593Smuzhiyun		int r1;
1893*4882a593Smuzhiyun
1894*4882a593Smuzhiyun		spin_lock(&s);
1895*4882a593Smuzhiyun		r1 = READ_ONCE(x);
1896*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
1897*4882a593Smuzhiyun		spin_unlock(&s);
1898*4882a593Smuzhiyun	}
1899*4882a593Smuzhiyun
1900*4882a593Smuzhiyun	P2()
1901*4882a593Smuzhiyun	{
1902*4882a593Smuzhiyun		int r2, r3;
1903*4882a593Smuzhiyun
1904*4882a593Smuzhiyun		r2 = READ_ONCE(y);
1905*4882a593Smuzhiyun		smp_rmb();
1906*4882a593Smuzhiyun		r3 = READ_ONCE(x);
1907*4882a593Smuzhiyun	}
1908*4882a593Smuzhiyun
1909*4882a593SmuzhiyunIf r1 = 1 at the end then the spin_lock() in P1 must have read from
1910*4882a593Smuzhiyunthe spin_unlock() in P0.  Hence the store to x must propagate to P2
1911*4882a593Smuzhiyunbefore the store to y does, so we cannot have r2 = 1 and r3 = 0.
1912*4882a593Smuzhiyun
1913*4882a593SmuzhiyunThese two special requirements for lock-release and lock-acquire do
1914*4882a593Smuzhiyunnot arise from the operational model.  Nevertheless, kernel developers
1915*4882a593Smuzhiyunhave come to expect and rely on them because they do hold on all
1916*4882a593Smuzhiyunarchitectures supported by the Linux kernel, albeit for various
1917*4882a593Smuzhiyundiffering reasons.
1918*4882a593Smuzhiyun
1919*4882a593Smuzhiyun
1920*4882a593SmuzhiyunPLAIN ACCESSES AND DATA RACES
1921*4882a593Smuzhiyun-----------------------------
1922*4882a593Smuzhiyun
1923*4882a593SmuzhiyunIn the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y),
1924*4882a593Smuzhiyunsmp_load_acquire(&z), and so on are collectively referred to as
1925*4882a593Smuzhiyun"marked" accesses, because they are all annotated with special
1926*4882a593Smuzhiyunoperations of one kind or another.  Ordinary C-language memory
1927*4882a593Smuzhiyunaccesses such as x or y = 0 are simply called "plain" accesses.
1928*4882a593Smuzhiyun
1929*4882a593SmuzhiyunEarly versions of the LKMM had nothing to say about plain accesses.
1930*4882a593SmuzhiyunThe C standard allows compilers to assume that the variables affected
1931*4882a593Smuzhiyunby plain accesses are not concurrently read or written by any other
1932*4882a593Smuzhiyunthreads or CPUs.  This leaves compilers free to implement all manner
1933*4882a593Smuzhiyunof transformations or optimizations of code containing plain accesses,
1934*4882a593Smuzhiyunmaking such code very difficult for a memory model to handle.
1935*4882a593Smuzhiyun
1936*4882a593SmuzhiyunHere is just one example of a possible pitfall:
1937*4882a593Smuzhiyun
1938*4882a593Smuzhiyun	int a = 6;
1939*4882a593Smuzhiyun	int *x = &a;
1940*4882a593Smuzhiyun
1941*4882a593Smuzhiyun	P0()
1942*4882a593Smuzhiyun	{
1943*4882a593Smuzhiyun		int *r1;
1944*4882a593Smuzhiyun		int r2 = 0;
1945*4882a593Smuzhiyun
1946*4882a593Smuzhiyun		r1 = x;
1947*4882a593Smuzhiyun		if (r1 != NULL)
1948*4882a593Smuzhiyun			r2 = READ_ONCE(*r1);
1949*4882a593Smuzhiyun	}
1950*4882a593Smuzhiyun
1951*4882a593Smuzhiyun	P1()
1952*4882a593Smuzhiyun	{
1953*4882a593Smuzhiyun		WRITE_ONCE(x, NULL);
1954*4882a593Smuzhiyun	}
1955*4882a593Smuzhiyun
1956*4882a593SmuzhiyunOn the face of it, one would expect that when this code runs, the only
1957*4882a593Smuzhiyunpossible final values for r2 are 6 and 0, depending on whether or not
1958*4882a593SmuzhiyunP1's store to x propagates to P0 before P0's load from x executes.
1959*4882a593SmuzhiyunBut since P0's load from x is a plain access, the compiler may decide
1960*4882a593Smuzhiyunto carry out the load twice (for the comparison against NULL, then again
1961*4882a593Smuzhiyunfor the READ_ONCE()) and eliminate the temporary variable r1.  The
1962*4882a593Smuzhiyunobject code generated for P0 could therefore end up looking rather
1963*4882a593Smuzhiyunlike this:
1964*4882a593Smuzhiyun
1965*4882a593Smuzhiyun	P0()
1966*4882a593Smuzhiyun	{
1967*4882a593Smuzhiyun		int r2 = 0;
1968*4882a593Smuzhiyun
1969*4882a593Smuzhiyun		if (x != NULL)
1970*4882a593Smuzhiyun			r2 = READ_ONCE(*x);
1971*4882a593Smuzhiyun	}
1972*4882a593Smuzhiyun
1973*4882a593SmuzhiyunAnd now it is obvious that this code runs the risk of dereferencing a
1974*4882a593SmuzhiyunNULL pointer, because P1's store to x might propagate to P0 after the
1975*4882a593Smuzhiyuntest against NULL has been made but before the READ_ONCE() executes.
1976*4882a593SmuzhiyunIf the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x",
1977*4882a593Smuzhiyunthe compiler would not have performed this optimization and there
1978*4882a593Smuzhiyunwould be no possibility of a NULL-pointer dereference.
1979*4882a593Smuzhiyun
1980*4882a593SmuzhiyunGiven the possibility of transformations like this one, the LKMM
1981*4882a593Smuzhiyundoesn't try to predict all possible outcomes of code containing plain
1982*4882a593Smuzhiyunaccesses.  It is instead content to determine whether the code
1983*4882a593Smuzhiyunviolates the compiler's assumptions, which would render the ultimate
1984*4882a593Smuzhiyunoutcome undefined.
1985*4882a593Smuzhiyun
1986*4882a593SmuzhiyunIn technical terms, the compiler is allowed to assume that when the
1987*4882a593Smuzhiyunprogram executes, there will not be any data races.  A "data race"
1988*4882a593Smuzhiyunoccurs when there are two memory accesses such that:
1989*4882a593Smuzhiyun
1990*4882a593Smuzhiyun1.	they access the same location,
1991*4882a593Smuzhiyun
1992*4882a593Smuzhiyun2.	at least one of them is a store,
1993*4882a593Smuzhiyun
1994*4882a593Smuzhiyun3.	at least one of them is plain,
1995*4882a593Smuzhiyun
1996*4882a593Smuzhiyun4.	they occur on different CPUs (or in different threads on the
1997*4882a593Smuzhiyun	same CPU), and
1998*4882a593Smuzhiyun
1999*4882a593Smuzhiyun5.	they execute concurrently.
2000*4882a593Smuzhiyun
2001*4882a593SmuzhiyunIn the literature, two accesses are said to "conflict" if they satisfy
2002*4882a593Smuzhiyun1 and 2 above.  We'll go a little farther and say that two accesses
2003*4882a593Smuzhiyunare "race candidates" if they satisfy 1 - 4.  Thus, whether or not two
2004*4882a593Smuzhiyunrace candidates actually do race in a given execution depends on
2005*4882a593Smuzhiyunwhether they are concurrent.
2006*4882a593Smuzhiyun
2007*4882a593SmuzhiyunThe LKMM tries to determine whether a program contains race candidates
2008*4882a593Smuzhiyunwhich may execute concurrently; if it does then the LKMM says there is
2009*4882a593Smuzhiyuna potential data race and makes no predictions about the program's
2010*4882a593Smuzhiyunoutcome.
2011*4882a593Smuzhiyun
2012*4882a593SmuzhiyunDetermining whether two accesses are race candidates is easy; you can
2013*4882a593Smuzhiyunsee that all the concepts involved in the definition above are already
2014*4882a593Smuzhiyunpart of the memory model.  The hard part is telling whether they may
2015*4882a593Smuzhiyunexecute concurrently.  The LKMM takes a conservative attitude,
2016*4882a593Smuzhiyunassuming that accesses may be concurrent unless it can prove they
2017*4882a593Smuzhiyunare not.
2018*4882a593Smuzhiyun
2019*4882a593SmuzhiyunIf two memory accesses aren't concurrent then one must execute before
2020*4882a593Smuzhiyunthe other.  Therefore the LKMM decides two accesses aren't concurrent
2021*4882a593Smuzhiyunif they can be connected by a sequence of hb, pb, and rb links
2022*4882a593Smuzhiyun(together referred to as xb, for "executes before").  However, there
2023*4882a593Smuzhiyunare two complicating factors.
2024*4882a593Smuzhiyun
2025*4882a593SmuzhiyunIf X is a load and X executes before a store Y, then indeed there is
2026*4882a593Smuzhiyunno danger of X and Y being concurrent.  After all, Y can't have any
2027*4882a593Smuzhiyuneffect on the value obtained by X until the memory subsystem has
2028*4882a593Smuzhiyunpropagated Y from its own CPU to X's CPU, which won't happen until
2029*4882a593Smuzhiyunsome time after Y executes and thus after X executes.  But if X is a
2030*4882a593Smuzhiyunstore, then even if X executes before Y it is still possible that X
2031*4882a593Smuzhiyunwill propagate to Y's CPU just as Y is executing.  In such a case X
2032*4882a593Smuzhiyuncould very well interfere somehow with Y, and we would have to
2033*4882a593Smuzhiyunconsider X and Y to be concurrent.
2034*4882a593Smuzhiyun
2035*4882a593SmuzhiyunTherefore when X is a store, for X and Y to be non-concurrent the LKMM
2036*4882a593Smuzhiyunrequires not only that X must execute before Y but also that X must
2037*4882a593Smuzhiyunpropagate to Y's CPU before Y executes.  (Or vice versa, of course, if
2038*4882a593SmuzhiyunY executes before X -- then Y must propagate to X's CPU before X
2039*4882a593Smuzhiyunexecutes if Y is a store.)  This is expressed by the visibility
2040*4882a593Smuzhiyunrelation (vis), where X ->vis Y is defined to hold if there is an
2041*4882a593Smuzhiyunintermediate event Z such that:
2042*4882a593Smuzhiyun
2043*4882a593Smuzhiyun	X is connected to Z by a possibly empty sequence of
2044*4882a593Smuzhiyun	cumul-fence links followed by an optional rfe link (if none of
2045*4882a593Smuzhiyun	these links are present, X and Z are the same event),
2046*4882a593Smuzhiyun
2047*4882a593Smuzhiyunand either:
2048*4882a593Smuzhiyun
2049*4882a593Smuzhiyun	Z is connected to Y by a strong-fence link followed by a
2050*4882a593Smuzhiyun	possibly empty sequence of xb links,
2051*4882a593Smuzhiyun
2052*4882a593Smuzhiyunor:
2053*4882a593Smuzhiyun
2054*4882a593Smuzhiyun	Z is on the same CPU as Y and is connected to Y by a possibly
2055*4882a593Smuzhiyun	empty sequence of xb links (again, if the sequence is empty it
2056*4882a593Smuzhiyun	means Z and Y are the same event).
2057*4882a593Smuzhiyun
2058*4882a593SmuzhiyunThe motivations behind this definition are straightforward:
2059*4882a593Smuzhiyun
2060*4882a593Smuzhiyun	cumul-fence memory barriers force stores that are po-before
2061*4882a593Smuzhiyun	the barrier to propagate to other CPUs before stores that are
2062*4882a593Smuzhiyun	po-after the barrier.
2063*4882a593Smuzhiyun
2064*4882a593Smuzhiyun	An rfe link from an event W to an event R says that R reads
2065*4882a593Smuzhiyun	from W, which certainly means that W must have propagated to
2066*4882a593Smuzhiyun	R's CPU before R executed.
2067*4882a593Smuzhiyun
2068*4882a593Smuzhiyun	strong-fence memory barriers force stores that are po-before
2069*4882a593Smuzhiyun	the barrier, or that propagate to the barrier's CPU before the
2070*4882a593Smuzhiyun	barrier executes, to propagate to all CPUs before any events
2071*4882a593Smuzhiyun	po-after the barrier can execute.
2072*4882a593Smuzhiyun
2073*4882a593SmuzhiyunTo see how this works out in practice, consider our old friend, the MP
2074*4882a593Smuzhiyunpattern (with fences and statement labels, but without the conditional
2075*4882a593Smuzhiyuntest):
2076*4882a593Smuzhiyun
2077*4882a593Smuzhiyun	int buf = 0, flag = 0;
2078*4882a593Smuzhiyun
2079*4882a593Smuzhiyun	P0()
2080*4882a593Smuzhiyun	{
2081*4882a593Smuzhiyun		X: WRITE_ONCE(buf, 1);
2082*4882a593Smuzhiyun		   smp_wmb();
2083*4882a593Smuzhiyun		W: WRITE_ONCE(flag, 1);
2084*4882a593Smuzhiyun	}
2085*4882a593Smuzhiyun
2086*4882a593Smuzhiyun	P1()
2087*4882a593Smuzhiyun	{
2088*4882a593Smuzhiyun		int r1;
2089*4882a593Smuzhiyun		int r2 = 0;
2090*4882a593Smuzhiyun
2091*4882a593Smuzhiyun		Z: r1 = READ_ONCE(flag);
2092*4882a593Smuzhiyun		   smp_rmb();
2093*4882a593Smuzhiyun		Y: r2 = READ_ONCE(buf);
2094*4882a593Smuzhiyun	}
2095*4882a593Smuzhiyun
2096*4882a593SmuzhiyunThe smp_wmb() memory barrier gives a cumul-fence link from X to W, and
2097*4882a593Smuzhiyunassuming r1 = 1 at the end, there is an rfe link from W to Z.  This
2098*4882a593Smuzhiyunmeans that the store to buf must propagate from P0 to P1 before Z
2099*4882a593Smuzhiyunexecutes.  Next, Z and Y are on the same CPU and the smp_rmb() fence
2100*4882a593Smuzhiyunprovides an xb link from Z to Y (i.e., it forces Z to execute before
2101*4882a593SmuzhiyunY).  Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
2102*4882a593Smuzhiyunexecutes.
2103*4882a593Smuzhiyun
2104*4882a593SmuzhiyunThe second complicating factor mentioned above arises from the fact
2105*4882a593Smuzhiyunthat when we are considering data races, some of the memory accesses
2106*4882a593Smuzhiyunare plain.  Now, although we have not said so explicitly, up to this
2107*4882a593Smuzhiyunpoint most of the relations defined by the LKMM (ppo, hb, prop,
2108*4882a593Smuzhiyuncumul-fence, pb, and so on -- including vis) apply only to marked
2109*4882a593Smuzhiyunaccesses.
2110*4882a593Smuzhiyun
2111*4882a593SmuzhiyunThere are good reasons for this restriction.  The compiler is not
2112*4882a593Smuzhiyunallowed to apply fancy transformations to marked accesses, and
2113*4882a593Smuzhiyunconsequently each such access in the source code corresponds more or
2114*4882a593Smuzhiyunless directly to a single machine instruction in the object code.  But
2115*4882a593Smuzhiyunplain accesses are a different story; the compiler may combine them,
2116*4882a593Smuzhiyunsplit them up, duplicate them, eliminate them, invent new ones, and
2117*4882a593Smuzhiyunwho knows what else.  Seeing a plain access in the source code tells
2118*4882a593Smuzhiyunyou almost nothing about what machine instructions will end up in the
2119*4882a593Smuzhiyunobject code.
2120*4882a593Smuzhiyun
2121*4882a593SmuzhiyunFortunately, the compiler isn't completely free; it is subject to some
2122*4882a593Smuzhiyunlimitations.  For one, it is not allowed to introduce a data race into
2123*4882a593Smuzhiyunthe object code if the source code does not already contain a data
2124*4882a593Smuzhiyunrace (if it could, memory models would be useless and no multithreaded
2125*4882a593Smuzhiyuncode would be safe!).  For another, it cannot move a plain access past
2126*4882a593Smuzhiyuna compiler barrier.
2127*4882a593Smuzhiyun
2128*4882a593SmuzhiyunA compiler barrier is a kind of fence, but as the name implies, it
2129*4882a593Smuzhiyunonly affects the compiler; it does not necessarily have any effect on
2130*4882a593Smuzhiyunhow instructions are executed by the CPU.  In Linux kernel source
2131*4882a593Smuzhiyuncode, the barrier() function is a compiler barrier.  It doesn't give
2132*4882a593Smuzhiyunrise directly to any machine instructions in the object code; rather,
2133*4882a593Smuzhiyunit affects how the compiler generates the rest of the object code.
2134*4882a593SmuzhiyunGiven source code like this:
2135*4882a593Smuzhiyun
2136*4882a593Smuzhiyun	... some memory accesses ...
2137*4882a593Smuzhiyun	barrier();
2138*4882a593Smuzhiyun	... some other memory accesses ...
2139*4882a593Smuzhiyun
2140*4882a593Smuzhiyunthe barrier() function ensures that the machine instructions
2141*4882a593Smuzhiyuncorresponding to the first group of accesses will all end po-before
2142*4882a593Smuzhiyunany machine instructions corresponding to the second group of accesses
2143*4882a593Smuzhiyun-- even if some of the accesses are plain.  (Of course, the CPU may
2144*4882a593Smuzhiyunthen execute some of those accesses out of program order, but we
2145*4882a593Smuzhiyunalready know how to deal with such issues.)  Without the barrier()
2146*4882a593Smuzhiyunthere would be no such guarantee; the two groups of accesses could be
2147*4882a593Smuzhiyunintermingled or even reversed in the object code.
2148*4882a593Smuzhiyun
2149*4882a593SmuzhiyunThe LKMM doesn't say much about the barrier() function, but it does
2150*4882a593Smuzhiyunrequire that all fences are also compiler barriers.  In addition, it
2151*4882a593Smuzhiyunrequires that the ordering properties of memory barriers such as
2152*4882a593Smuzhiyunsmp_rmb() or smp_store_release() apply to plain accesses as well as to
2153*4882a593Smuzhiyunmarked accesses.
2154*4882a593Smuzhiyun
2155*4882a593SmuzhiyunThis is the key to analyzing data races.  Consider the MP pattern
2156*4882a593Smuzhiyunagain, now using plain accesses for buf:
2157*4882a593Smuzhiyun
2158*4882a593Smuzhiyun	int buf = 0, flag = 0;
2159*4882a593Smuzhiyun
2160*4882a593Smuzhiyun	P0()
2161*4882a593Smuzhiyun	{
2162*4882a593Smuzhiyun		U: buf = 1;
2163*4882a593Smuzhiyun		   smp_wmb();
2164*4882a593Smuzhiyun		X: WRITE_ONCE(flag, 1);
2165*4882a593Smuzhiyun	}
2166*4882a593Smuzhiyun
2167*4882a593Smuzhiyun	P1()
2168*4882a593Smuzhiyun	{
2169*4882a593Smuzhiyun		int r1;
2170*4882a593Smuzhiyun		int r2 = 0;
2171*4882a593Smuzhiyun
2172*4882a593Smuzhiyun		Y: r1 = READ_ONCE(flag);
2173*4882a593Smuzhiyun		   if (r1) {
2174*4882a593Smuzhiyun			   smp_rmb();
2175*4882a593Smuzhiyun			V: r2 = buf;
2176*4882a593Smuzhiyun		   }
2177*4882a593Smuzhiyun	}
2178*4882a593Smuzhiyun
2179*4882a593SmuzhiyunThis program does not contain a data race.  Although the U and V
2180*4882a593Smuzhiyunaccesses are race candidates, the LKMM can prove they are not
2181*4882a593Smuzhiyunconcurrent as follows:
2182*4882a593Smuzhiyun
2183*4882a593Smuzhiyun	The smp_wmb() fence in P0 is both a compiler barrier and a
2184*4882a593Smuzhiyun	cumul-fence.  It guarantees that no matter what hash of
2185*4882a593Smuzhiyun	machine instructions the compiler generates for the plain
2186*4882a593Smuzhiyun	access U, all those instructions will be po-before the fence.
2187*4882a593Smuzhiyun	Consequently U's store to buf, no matter how it is carried out
2188*4882a593Smuzhiyun	at the machine level, must propagate to P1 before X's store to
2189*4882a593Smuzhiyun	flag does.
2190*4882a593Smuzhiyun
2191*4882a593Smuzhiyun	X and Y are both marked accesses.  Hence an rfe link from X to
2192*4882a593Smuzhiyun	Y is a valid indicator that X propagated to P1 before Y
2193*4882a593Smuzhiyun	executed, i.e., X ->vis Y.  (And if there is no rfe link then
2194*4882a593Smuzhiyun	r1 will be 0, so V will not be executed and ipso facto won't
2195*4882a593Smuzhiyun	race with U.)
2196*4882a593Smuzhiyun
2197*4882a593Smuzhiyun	The smp_rmb() fence in P1 is a compiler barrier as well as a
2198*4882a593Smuzhiyun	fence.  It guarantees that all the machine-level instructions
2199*4882a593Smuzhiyun	corresponding to the access V will be po-after the fence, and
2200*4882a593Smuzhiyun	therefore any loads among those instructions will execute
2201*4882a593Smuzhiyun	after the fence does and hence after Y does.
2202*4882a593Smuzhiyun
2203*4882a593SmuzhiyunThus U's store to buf is forced to propagate to P1 before V's load
2204*4882a593Smuzhiyunexecutes (assuming V does execute), ruling out the possibility of a
2205*4882a593Smuzhiyundata race between them.
2206*4882a593Smuzhiyun
2207*4882a593SmuzhiyunThis analysis illustrates how the LKMM deals with plain accesses in
2208*4882a593Smuzhiyungeneral.  Suppose R is a plain load and we want to show that R
2209*4882a593Smuzhiyunexecutes before some marked access E.  We can do this by finding a
2210*4882a593Smuzhiyunmarked access X such that R and X are ordered by a suitable fence and
2211*4882a593SmuzhiyunX ->xb* E.  If E was also a plain access, we would also look for a
2212*4882a593Smuzhiyunmarked access Y such that X ->xb* Y, and Y and E are ordered by a
2213*4882a593Smuzhiyunfence.  We describe this arrangement by saying that R is
2214*4882a593Smuzhiyun"post-bounded" by X and E is "pre-bounded" by Y.
2215*4882a593Smuzhiyun
2216*4882a593SmuzhiyunIn fact, we go one step further: Since R is a read, we say that R is
2217*4882a593Smuzhiyun"r-post-bounded" by X.  Similarly, E would be "r-pre-bounded" or
2218*4882a593Smuzhiyun"w-pre-bounded" by Y, depending on whether E was a store or a load.
2219*4882a593SmuzhiyunThis distinction is needed because some fences affect only loads
2220*4882a593Smuzhiyun(i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise
2221*4882a593Smuzhiyunthe two types of bounds are the same.  And as a degenerate case, we
2222*4882a593Smuzhiyunsay that a marked access pre-bounds and post-bounds itself (e.g., if R
2223*4882a593Smuzhiyunabove were a marked load then X could simply be taken to be R itself.)
2224*4882a593Smuzhiyun
2225*4882a593SmuzhiyunThe need to distinguish between r- and w-bounding raises yet another
2226*4882a593Smuzhiyunissue.  When the source code contains a plain store, the compiler is
2227*4882a593Smuzhiyunallowed to put plain loads of the same location into the object code.
2228*4882a593SmuzhiyunFor example, given the source code:
2229*4882a593Smuzhiyun
2230*4882a593Smuzhiyun	x = 1;
2231*4882a593Smuzhiyun
2232*4882a593Smuzhiyunthe compiler is theoretically allowed to generate object code that
2233*4882a593Smuzhiyunlooks like:
2234*4882a593Smuzhiyun
2235*4882a593Smuzhiyun	if (x != 1)
2236*4882a593Smuzhiyun		x = 1;
2237*4882a593Smuzhiyun
2238*4882a593Smuzhiyunthereby adding a load (and possibly replacing the store entirely).
2239*4882a593SmuzhiyunFor this reason, whenever the LKMM requires a plain store to be
2240*4882a593Smuzhiyunw-pre-bounded or w-post-bounded by a marked access, it also requires
2241*4882a593Smuzhiyunthe store to be r-pre-bounded or r-post-bounded, so as to handle cases
2242*4882a593Smuzhiyunwhere the compiler adds a load.
2243*4882a593Smuzhiyun
2244*4882a593Smuzhiyun(This may be overly cautious.  We don't know of any examples where a
2245*4882a593Smuzhiyuncompiler has augmented a store with a load in this fashion, and the
2246*4882a593SmuzhiyunLinux kernel developers would probably fight pretty hard to change a
2247*4882a593Smuzhiyuncompiler if it ever did this.  Still, better safe than sorry.)
2248*4882a593Smuzhiyun
2249*4882a593SmuzhiyunIncidentally, the other tranformation -- augmenting a plain load by
2250*4882a593Smuzhiyunadding in a store to the same location -- is not allowed.  This is
2251*4882a593Smuzhiyunbecause the compiler cannot know whether any other CPUs might perform
2252*4882a593Smuzhiyuna concurrent load from that location.  Two concurrent loads don't
2253*4882a593Smuzhiyunconstitute a race (they can't interfere with each other), but a store
2254*4882a593Smuzhiyundoes race with a concurrent load.  Thus adding a store might create a
2255*4882a593Smuzhiyundata race where one was not already present in the source code,
2256*4882a593Smuzhiyunsomething the compiler is forbidden to do.  Augmenting a store with a
2257*4882a593Smuzhiyunload, on the other hand, is acceptable because doing so won't create a
2258*4882a593Smuzhiyundata race unless one already existed.
2259*4882a593Smuzhiyun
2260*4882a593SmuzhiyunThe LKMM includes a second way to pre-bound plain accesses, in
2261*4882a593Smuzhiyunaddition to fences: an address dependency from a marked load.  That
2262*4882a593Smuzhiyunis, in the sequence:
2263*4882a593Smuzhiyun
2264*4882a593Smuzhiyun	p = READ_ONCE(ptr);
2265*4882a593Smuzhiyun	r = *p;
2266*4882a593Smuzhiyun
2267*4882a593Smuzhiyunthe LKMM says that the marked load of ptr pre-bounds the plain load of
2268*4882a593Smuzhiyun*p; the marked load must execute before any of the machine
2269*4882a593Smuzhiyuninstructions corresponding to the plain load.  This is a reasonable
2270*4882a593Smuzhiyunstipulation, since after all, the CPU can't perform the load of *p
2271*4882a593Smuzhiyununtil it knows what value p will hold.  Furthermore, without some
2272*4882a593Smuzhiyunassumption like this one, some usages typical of RCU would count as
2273*4882a593Smuzhiyundata races.  For example:
2274*4882a593Smuzhiyun
2275*4882a593Smuzhiyun	int a = 1, b;
2276*4882a593Smuzhiyun	int *ptr = &a;
2277*4882a593Smuzhiyun
2278*4882a593Smuzhiyun	P0()
2279*4882a593Smuzhiyun	{
2280*4882a593Smuzhiyun		b = 2;
2281*4882a593Smuzhiyun		rcu_assign_pointer(ptr, &b);
2282*4882a593Smuzhiyun	}
2283*4882a593Smuzhiyun
2284*4882a593Smuzhiyun	P1()
2285*4882a593Smuzhiyun	{
2286*4882a593Smuzhiyun		int *p;
2287*4882a593Smuzhiyun		int r;
2288*4882a593Smuzhiyun
2289*4882a593Smuzhiyun		rcu_read_lock();
2290*4882a593Smuzhiyun		p = rcu_dereference(ptr);
2291*4882a593Smuzhiyun		r = *p;
2292*4882a593Smuzhiyun		rcu_read_unlock();
2293*4882a593Smuzhiyun	}
2294*4882a593Smuzhiyun
2295*4882a593Smuzhiyun(In this example the rcu_read_lock() and rcu_read_unlock() calls don't
2296*4882a593Smuzhiyunreally do anything, because there aren't any grace periods.  They are
2297*4882a593Smuzhiyunincluded merely for the sake of good form; typically P0 would call
2298*4882a593Smuzhiyunsynchronize_rcu() somewhere after the rcu_assign_pointer().)
2299*4882a593Smuzhiyun
2300*4882a593Smuzhiyunrcu_assign_pointer() performs a store-release, so the plain store to b
2301*4882a593Smuzhiyunis definitely w-post-bounded before the store to ptr, and the two
2302*4882a593Smuzhiyunstores will propagate to P1 in that order.  However, rcu_dereference()
2303*4882a593Smuzhiyunis only equivalent to READ_ONCE().  While it is a marked access, it is
2304*4882a593Smuzhiyunnot a fence or compiler barrier.  Hence the only guarantee we have
2305*4882a593Smuzhiyunthat the load of ptr in P1 is r-pre-bounded before the load of *p
2306*4882a593Smuzhiyun(thus avoiding a race) is the assumption about address dependencies.
2307*4882a593Smuzhiyun
2308*4882a593SmuzhiyunThis is a situation where the compiler can undermine the memory model,
2309*4882a593Smuzhiyunand a certain amount of care is required when programming constructs
2310*4882a593Smuzhiyunlike this one.  In particular, comparisons between the pointer and
2311*4882a593Smuzhiyunother known addresses can cause trouble.  If you have something like:
2312*4882a593Smuzhiyun
2313*4882a593Smuzhiyun	p = rcu_dereference(ptr);
2314*4882a593Smuzhiyun	if (p == &x)
2315*4882a593Smuzhiyun		r = *p;
2316*4882a593Smuzhiyun
2317*4882a593Smuzhiyunthen the compiler just might generate object code resembling:
2318*4882a593Smuzhiyun
2319*4882a593Smuzhiyun	p = rcu_dereference(ptr);
2320*4882a593Smuzhiyun	if (p == &x)
2321*4882a593Smuzhiyun		r = x;
2322*4882a593Smuzhiyun
2323*4882a593Smuzhiyunor even:
2324*4882a593Smuzhiyun
2325*4882a593Smuzhiyun	rtemp = x;
2326*4882a593Smuzhiyun	p = rcu_dereference(ptr);
2327*4882a593Smuzhiyun	if (p == &x)
2328*4882a593Smuzhiyun		r = rtemp;
2329*4882a593Smuzhiyun
2330*4882a593Smuzhiyunwhich would invalidate the memory model's assumption, since the CPU
2331*4882a593Smuzhiyuncould now perform the load of x before the load of ptr (there might be
2332*4882a593Smuzhiyuna control dependency but no address dependency at the machine level).
2333*4882a593Smuzhiyun
2334*4882a593SmuzhiyunFinally, it turns out there is a situation in which a plain write does
2335*4882a593Smuzhiyunnot need to be w-post-bounded: when it is separated from the other
2336*4882a593Smuzhiyunrace-candidate access by a fence.  At first glance this may seem
2337*4882a593Smuzhiyunimpossible.  After all, to be race candidates the two accesses must
2338*4882a593Smuzhiyunbe on different CPUs, and fences don't link events on different CPUs.
2339*4882a593SmuzhiyunWell, normal fences don't -- but rcu-fence can!  Here's an example:
2340*4882a593Smuzhiyun
2341*4882a593Smuzhiyun	int x, y;
2342*4882a593Smuzhiyun
2343*4882a593Smuzhiyun	P0()
2344*4882a593Smuzhiyun	{
2345*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
2346*4882a593Smuzhiyun		synchronize_rcu();
2347*4882a593Smuzhiyun		y = 3;
2348*4882a593Smuzhiyun	}
2349*4882a593Smuzhiyun
2350*4882a593Smuzhiyun	P1()
2351*4882a593Smuzhiyun	{
2352*4882a593Smuzhiyun		rcu_read_lock();
2353*4882a593Smuzhiyun		if (READ_ONCE(x) == 0)
2354*4882a593Smuzhiyun			y = 2;
2355*4882a593Smuzhiyun		rcu_read_unlock();
2356*4882a593Smuzhiyun	}
2357*4882a593Smuzhiyun
2358*4882a593SmuzhiyunDo the plain stores to y race?  Clearly not if P1 reads a non-zero
2359*4882a593Smuzhiyunvalue for x, so let's assume the READ_ONCE(x) does obtain 0.  This
2360*4882a593Smuzhiyunmeans that the read-side critical section in P1 must finish executing
2361*4882a593Smuzhiyunbefore the grace period in P0 does, because RCU's Grace-Period
2362*4882a593SmuzhiyunGuarantee says that otherwise P0's store to x would have propagated to
2363*4882a593SmuzhiyunP1 before the critical section started and so would have been visible
2364*4882a593Smuzhiyunto the READ_ONCE().  (Another way of putting it is that the fre link
2365*4882a593Smuzhiyunfrom the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link
2366*4882a593Smuzhiyunbetween those two events.)
2367*4882a593Smuzhiyun
2368*4882a593SmuzhiyunThis means there is an rcu-fence link from P1's "y = 2" store to P0's
2369*4882a593Smuzhiyun"y = 3" store, and consequently the first must propagate from P1 to P0
2370*4882a593Smuzhiyunbefore the second can execute.  Therefore the two stores cannot be
2371*4882a593Smuzhiyunconcurrent and there is no race, even though P1's plain store to y
2372*4882a593Smuzhiyunisn't w-post-bounded by any marked accesses.
2373*4882a593Smuzhiyun
2374*4882a593SmuzhiyunPutting all this material together yields the following picture.  For
2375*4882a593Smuzhiyunrace-candidate stores W and W', where W ->co W', the LKMM says the
2376*4882a593Smuzhiyunstores don't race if W can be linked to W' by a
2377*4882a593Smuzhiyun
2378*4882a593Smuzhiyun	w-post-bounded ; vis ; w-pre-bounded
2379*4882a593Smuzhiyun
2380*4882a593Smuzhiyunsequence.  If W is plain then they also have to be linked by an
2381*4882a593Smuzhiyun
2382*4882a593Smuzhiyun	r-post-bounded ; xb* ; w-pre-bounded
2383*4882a593Smuzhiyun
2384*4882a593Smuzhiyunsequence, and if W' is plain then they also have to be linked by a
2385*4882a593Smuzhiyun
2386*4882a593Smuzhiyun	w-post-bounded ; vis ; r-pre-bounded
2387*4882a593Smuzhiyun
2388*4882a593Smuzhiyunsequence.  For race-candidate load R and store W, the LKMM says the
2389*4882a593Smuzhiyuntwo accesses don't race if R can be linked to W by an
2390*4882a593Smuzhiyun
2391*4882a593Smuzhiyun	r-post-bounded ; xb* ; w-pre-bounded
2392*4882a593Smuzhiyun
2393*4882a593Smuzhiyunsequence or if W can be linked to R by a
2394*4882a593Smuzhiyun
2395*4882a593Smuzhiyun	w-post-bounded ; vis ; r-pre-bounded
2396*4882a593Smuzhiyun
2397*4882a593Smuzhiyunsequence.  For the cases involving a vis link, the LKMM also accepts
2398*4882a593Smuzhiyunsequences in which W is linked to W' or R by a
2399*4882a593Smuzhiyun
2400*4882a593Smuzhiyun	strong-fence ; xb* ; {w and/or r}-pre-bounded
2401*4882a593Smuzhiyun
2402*4882a593Smuzhiyunsequence with no post-bounding, and in every case the LKMM also allows
2403*4882a593Smuzhiyunthe link simply to be a fence with no bounding at all.  If no sequence
2404*4882a593Smuzhiyunof the appropriate sort exists, the LKMM says that the accesses race.
2405*4882a593Smuzhiyun
2406*4882a593SmuzhiyunThere is one more part of the LKMM related to plain accesses (although
2407*4882a593Smuzhiyunnot to data races) we should discuss.  Recall that many relations such
2408*4882a593Smuzhiyunas hb are limited to marked accesses only.  As a result, the
2409*4882a593Smuzhiyunhappens-before, propagates-before, and rcu axioms (which state that
2410*4882a593Smuzhiyunvarious relation must not contain a cycle) doesn't apply to plain
2411*4882a593Smuzhiyunaccesses.  Nevertheless, we do want to rule out such cycles, because
2412*4882a593Smuzhiyunthey don't make sense even for plain accesses.
2413*4882a593Smuzhiyun
2414*4882a593SmuzhiyunTo this end, the LKMM imposes three extra restrictions, together
2415*4882a593Smuzhiyuncalled the "plain-coherence" axiom because of their resemblance to the
2416*4882a593Smuzhiyunrules used by the operational model to ensure cache coherence (that
2417*4882a593Smuzhiyunis, the rules governing the memory subsystem's choice of a store to
2418*4882a593Smuzhiyunsatisfy a load request and its determination of where a store will
2419*4882a593Smuzhiyunfall in the coherence order):
2420*4882a593Smuzhiyun
2421*4882a593Smuzhiyun	If R and W are race candidates and it is possible to link R to
2422*4882a593Smuzhiyun	W by one of the xb* sequences listed above, then W ->rfe R is
2423*4882a593Smuzhiyun	not allowed (i.e., a load cannot read from a store that it
2424*4882a593Smuzhiyun	executes before, even if one or both is plain).
2425*4882a593Smuzhiyun
2426*4882a593Smuzhiyun	If W and R are race candidates and it is possible to link W to
2427*4882a593Smuzhiyun	R by one of the vis sequences listed above, then R ->fre W is
2428*4882a593Smuzhiyun	not allowed (i.e., if a store is visible to a load then the
2429*4882a593Smuzhiyun	load must read from that store or one coherence-after it).
2430*4882a593Smuzhiyun
2431*4882a593Smuzhiyun	If W and W' are race candidates and it is possible to link W
2432*4882a593Smuzhiyun	to W' by one of the vis sequences listed above, then W' ->co W
2433*4882a593Smuzhiyun	is not allowed (i.e., if one store is visible to a second then
2434*4882a593Smuzhiyun	the second must come after the first in the coherence order).
2435*4882a593Smuzhiyun
2436*4882a593SmuzhiyunThis is the extent to which the LKMM deals with plain accesses.
2437*4882a593SmuzhiyunPerhaps it could say more (for example, plain accesses might
2438*4882a593Smuzhiyuncontribute to the ppo relation), but at the moment it seems that this
2439*4882a593Smuzhiyunminimal, conservative approach is good enough.
2440*4882a593Smuzhiyun
2441*4882a593Smuzhiyun
2442*4882a593SmuzhiyunODDS AND ENDS
2443*4882a593Smuzhiyun-------------
2444*4882a593Smuzhiyun
2445*4882a593SmuzhiyunThis section covers material that didn't quite fit anywhere in the
2446*4882a593Smuzhiyunearlier sections.
2447*4882a593Smuzhiyun
2448*4882a593SmuzhiyunThe descriptions in this document don't always match the formal
2449*4882a593Smuzhiyunversion of the LKMM exactly.  For example, the actual formal
2450*4882a593Smuzhiyundefinition of the prop relation makes the initial coe or fre part
2451*4882a593Smuzhiyunoptional, and it doesn't require the events linked by the relation to
2452*4882a593Smuzhiyunbe on the same CPU.  These differences are very unimportant; indeed,
2453*4882a593Smuzhiyuninstances where the coe/fre part of prop is missing are of no interest
2454*4882a593Smuzhiyunbecause all the other parts (fences and rfe) are already included in
2455*4882a593Smuzhiyunhb anyway, and where the formal model adds prop into hb, it includes
2456*4882a593Smuzhiyunan explicit requirement that the events being linked are on the same
2457*4882a593SmuzhiyunCPU.
2458*4882a593Smuzhiyun
2459*4882a593SmuzhiyunAnother minor difference has to do with events that are both memory
2460*4882a593Smuzhiyunaccesses and fences, such as those corresponding to smp_load_acquire()
2461*4882a593Smuzhiyuncalls.  In the formal model, these events aren't actually both reads
2462*4882a593Smuzhiyunand fences; rather, they are read events with an annotation marking
2463*4882a593Smuzhiyunthem as acquires.  (Or write events annotated as releases, in the case
2464*4882a593Smuzhiyunsmp_store_release().)  The final effect is the same.
2465*4882a593Smuzhiyun
2466*4882a593SmuzhiyunAlthough we didn't mention it above, the instruction execution
2467*4882a593Smuzhiyunordering provided by the smp_rmb() fence doesn't apply to read events
2468*4882a593Smuzhiyunthat are part of a non-value-returning atomic update.  For instance,
2469*4882a593Smuzhiyungiven:
2470*4882a593Smuzhiyun
2471*4882a593Smuzhiyun	atomic_inc(&x);
2472*4882a593Smuzhiyun	smp_rmb();
2473*4882a593Smuzhiyun	r1 = READ_ONCE(y);
2474*4882a593Smuzhiyun
2475*4882a593Smuzhiyunit is not guaranteed that the load from y will execute after the
2476*4882a593Smuzhiyunupdate to x.  This is because the ARMv8 architecture allows
2477*4882a593Smuzhiyunnon-value-returning atomic operations effectively to be executed off
2478*4882a593Smuzhiyunthe CPU.  Basically, the CPU tells the memory subsystem to increment
2479*4882a593Smuzhiyunx, and then the increment is carried out by the memory hardware with
2480*4882a593Smuzhiyunno further involvement from the CPU.  Since the CPU doesn't ever read
2481*4882a593Smuzhiyunthe value of x, there is nothing for the smp_rmb() fence to act on.
2482*4882a593Smuzhiyun
2483*4882a593SmuzhiyunThe LKMM defines a few extra synchronization operations in terms of
2484*4882a593Smuzhiyunthings we have already covered.  In particular, rcu_dereference() is
2485*4882a593Smuzhiyuntreated as READ_ONCE() and rcu_assign_pointer() is treated as
2486*4882a593Smuzhiyunsmp_store_release() -- which is basically how the Linux kernel treats
2487*4882a593Smuzhiyunthem.
2488*4882a593Smuzhiyun
2489*4882a593SmuzhiyunAlthough we said that plain accesses are not linked by the ppo
2490*4882a593Smuzhiyunrelation, they do contribute to it indirectly.  Namely, when there is
2491*4882a593Smuzhiyunan address dependency from a marked load R to a plain store W,
2492*4882a593Smuzhiyunfollowed by smp_wmb() and then a marked store W', the LKMM creates a
2493*4882a593Smuzhiyunppo link from R to W'.  The reasoning behind this is perhaps a little
2494*4882a593Smuzhiyunshaky, but essentially it says there is no way to generate object code
2495*4882a593Smuzhiyunfor this source code in which W' could execute before R.  Just as with
2496*4882a593Smuzhiyunpre-bounding by address dependencies, it is possible for the compiler
2497*4882a593Smuzhiyunto undermine this relation if sufficient care is not taken.
2498*4882a593Smuzhiyun
2499*4882a593SmuzhiyunThere are a few oddball fences which need special treatment:
2500*4882a593Smuzhiyunsmp_mb__before_atomic(), smp_mb__after_atomic(), and
2501*4882a593Smuzhiyunsmp_mb__after_spinlock().  The LKMM uses fence events with special
2502*4882a593Smuzhiyunannotations for them; they act as strong fences just like smp_mb()
2503*4882a593Smuzhiyunexcept for the sets of events that they order.  Instead of ordering
2504*4882a593Smuzhiyunall po-earlier events against all po-later events, as smp_mb() does,
2505*4882a593Smuzhiyunthey behave as follows:
2506*4882a593Smuzhiyun
2507*4882a593Smuzhiyun	smp_mb__before_atomic() orders all po-earlier events against
2508*4882a593Smuzhiyun	po-later atomic updates and the events following them;
2509*4882a593Smuzhiyun
2510*4882a593Smuzhiyun	smp_mb__after_atomic() orders po-earlier atomic updates and
2511*4882a593Smuzhiyun	the events preceding them against all po-later events;
2512*4882a593Smuzhiyun
2513*4882a593Smuzhiyun	smp_mb_after_spinlock() orders po-earlier lock acquisition
2514*4882a593Smuzhiyun	events and the events preceding them against all po-later
2515*4882a593Smuzhiyun	events.
2516*4882a593Smuzhiyun
2517*4882a593SmuzhiyunInterestingly, RCU and locking each introduce the possibility of
2518*4882a593Smuzhiyundeadlock.  When faced with code sequences such as:
2519*4882a593Smuzhiyun
2520*4882a593Smuzhiyun	spin_lock(&s);
2521*4882a593Smuzhiyun	spin_lock(&s);
2522*4882a593Smuzhiyun	spin_unlock(&s);
2523*4882a593Smuzhiyun	spin_unlock(&s);
2524*4882a593Smuzhiyun
2525*4882a593Smuzhiyunor:
2526*4882a593Smuzhiyun
2527*4882a593Smuzhiyun	rcu_read_lock();
2528*4882a593Smuzhiyun	synchronize_rcu();
2529*4882a593Smuzhiyun	rcu_read_unlock();
2530*4882a593Smuzhiyun
2531*4882a593Smuzhiyunwhat does the LKMM have to say?  Answer: It says there are no allowed
2532*4882a593Smuzhiyunexecutions at all, which makes sense.  But this can also lead to
2533*4882a593Smuzhiyunmisleading results, because if a piece of code has multiple possible
2534*4882a593Smuzhiyunexecutions, some of which deadlock, the model will report only on the
2535*4882a593Smuzhiyunnon-deadlocking executions.  For example:
2536*4882a593Smuzhiyun
2537*4882a593Smuzhiyun	int x, y;
2538*4882a593Smuzhiyun
2539*4882a593Smuzhiyun	P0()
2540*4882a593Smuzhiyun	{
2541*4882a593Smuzhiyun		int r0;
2542*4882a593Smuzhiyun
2543*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
2544*4882a593Smuzhiyun		r0 = READ_ONCE(y);
2545*4882a593Smuzhiyun	}
2546*4882a593Smuzhiyun
2547*4882a593Smuzhiyun	P1()
2548*4882a593Smuzhiyun	{
2549*4882a593Smuzhiyun		rcu_read_lock();
2550*4882a593Smuzhiyun		if (READ_ONCE(x) > 0) {
2551*4882a593Smuzhiyun			WRITE_ONCE(y, 36);
2552*4882a593Smuzhiyun			synchronize_rcu();
2553*4882a593Smuzhiyun		}
2554*4882a593Smuzhiyun		rcu_read_unlock();
2555*4882a593Smuzhiyun	}
2556*4882a593Smuzhiyun
2557*4882a593SmuzhiyunIs it possible to end up with r0 = 36 at the end?  The LKMM will tell
2558*4882a593Smuzhiyunyou it is not, but the model won't mention that this is because P1
2559*4882a593Smuzhiyunwill self-deadlock in the executions where it stores 36 in y.
2560