xref: /OK3568_Linux_fs/kernel/tools/memory-model/Documentation/recipes.txt (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593SmuzhiyunThis document provides "recipes", that is, litmus tests for commonly
2*4882a593Smuzhiyunoccurring situations, as well as a few that illustrate subtly broken but
3*4882a593Smuzhiyunattractive nuisances.  Many of these recipes include example code from
4*4882a593Smuzhiyunv5.7 of the Linux kernel.
5*4882a593Smuzhiyun
6*4882a593SmuzhiyunThe first section covers simple special cases, the second section
7*4882a593Smuzhiyuntakes off the training wheels to cover more involved examples,
8*4882a593Smuzhiyunand the third section provides a few rules of thumb.
9*4882a593Smuzhiyun
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunSimple special cases
12*4882a593Smuzhiyun====================
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunThis section presents two simple special cases, the first being where
15*4882a593Smuzhiyunthere is only one CPU or only one memory location is accessed, and the
16*4882a593Smuzhiyunsecond being use of that old concurrency workhorse, locking.
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunSingle CPU or single memory location
20*4882a593Smuzhiyun------------------------------------
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunIf there is only one CPU on the one hand or only one variable
23*4882a593Smuzhiyunon the other, the code will execute in order.  There are (as
24*4882a593Smuzhiyunusual) some things to be careful of:
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun1.	Some aspects of the C language are unordered.  For example,
27*4882a593Smuzhiyun	in the expression "f(x) + g(y)", the order in which f and g are
28*4882a593Smuzhiyun	called is not defined; the object code is allowed to use either
29*4882a593Smuzhiyun	order or even to interleave the computations.
30*4882a593Smuzhiyun
31*4882a593Smuzhiyun2.	Compilers are permitted to use the "as-if" rule.  That is, a
32*4882a593Smuzhiyun	compiler can emit whatever code it likes for normal accesses,
33*4882a593Smuzhiyun	as long as the results of a single-threaded execution appear
34*4882a593Smuzhiyun	just as if the compiler had followed all the relevant rules.
35*4882a593Smuzhiyun	To see this, compile with a high level of optimization and run
36*4882a593Smuzhiyun	the debugger on the resulting binary.
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun3.	If there is only one variable but multiple CPUs, that variable
39*4882a593Smuzhiyun	must be properly aligned and all accesses to that variable must
40*4882a593Smuzhiyun	be full sized.	Variables that straddle cachelines or pages void
41*4882a593Smuzhiyun	your full-ordering warranty, as do undersized accesses that load
42*4882a593Smuzhiyun	from or store to only part of the variable.
43*4882a593Smuzhiyun
44*4882a593Smuzhiyun4.	If there are multiple CPUs, accesses to shared variables should
45*4882a593Smuzhiyun	use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
46*4882a593Smuzhiyun	tearing, load/store fusing, and invented loads and stores.
47*4882a593Smuzhiyun	There are exceptions to this rule, including:
48*4882a593Smuzhiyun
49*4882a593Smuzhiyun	i.	When there is no possibility of a given shared variable
50*4882a593Smuzhiyun		being updated by some other CPU, for example, while
51*4882a593Smuzhiyun		holding the update-side lock, reads from that variable
52*4882a593Smuzhiyun		need not use READ_ONCE().
53*4882a593Smuzhiyun
54*4882a593Smuzhiyun	ii.	When there is no possibility of a given shared variable
55*4882a593Smuzhiyun		being either read or updated by other CPUs, for example,
56*4882a593Smuzhiyun		when running during early boot, reads from that variable
57*4882a593Smuzhiyun		need not use READ_ONCE() and writes to that variable
58*4882a593Smuzhiyun		need not use WRITE_ONCE().
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunLocking
62*4882a593Smuzhiyun-------
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunLocking is well-known and straightforward, at least if you don't think
65*4882a593Smuzhiyunabout it too hard.  And the basic rule is indeed quite simple: Any CPU that
66*4882a593Smuzhiyunhas acquired a given lock sees any changes previously seen or made by any
67*4882a593SmuzhiyunCPU before it released that same lock.  Note that this statement is a bit
68*4882a593Smuzhiyunstronger than "Any CPU holding a given lock sees all changes made by any
69*4882a593SmuzhiyunCPU during the time that CPU was holding this same lock".  For example,
70*4882a593Smuzhiyunconsider the following pair of code fragments:
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun	/* See MP+polocks.litmus. */
73*4882a593Smuzhiyun	void CPU0(void)
74*4882a593Smuzhiyun	{
75*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
76*4882a593Smuzhiyun		spin_lock(&mylock);
77*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
78*4882a593Smuzhiyun		spin_unlock(&mylock);
79*4882a593Smuzhiyun	}
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun	void CPU1(void)
82*4882a593Smuzhiyun	{
83*4882a593Smuzhiyun		spin_lock(&mylock);
84*4882a593Smuzhiyun		r0 = READ_ONCE(y);
85*4882a593Smuzhiyun		spin_unlock(&mylock);
86*4882a593Smuzhiyun		r1 = READ_ONCE(x);
87*4882a593Smuzhiyun	}
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunThe basic rule guarantees that if CPU0() acquires mylock before CPU1(),
90*4882a593Smuzhiyunthen both r0 and r1 must be set to the value 1.  This also has the
91*4882a593Smuzhiyunconsequence that if the final value of r0 is equal to 1, then the final
92*4882a593Smuzhiyunvalue of r1 must also be equal to 1.  In contrast, the weaker rule would
93*4882a593Smuzhiyunsay nothing about the final value of r1.
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunThe converse to the basic rule also holds, as illustrated by the
96*4882a593Smuzhiyunfollowing litmus test:
97*4882a593Smuzhiyun
98*4882a593Smuzhiyun	/* See MP+porevlocks.litmus. */
99*4882a593Smuzhiyun	void CPU0(void)
100*4882a593Smuzhiyun	{
101*4882a593Smuzhiyun		r0 = READ_ONCE(y);
102*4882a593Smuzhiyun		spin_lock(&mylock);
103*4882a593Smuzhiyun		r1 = READ_ONCE(x);
104*4882a593Smuzhiyun		spin_unlock(&mylock);
105*4882a593Smuzhiyun	}
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun	void CPU1(void)
108*4882a593Smuzhiyun	{
109*4882a593Smuzhiyun		spin_lock(&mylock);
110*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
111*4882a593Smuzhiyun		spin_unlock(&mylock);
112*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
113*4882a593Smuzhiyun	}
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunThis converse to the basic rule guarantees that if CPU0() acquires
116*4882a593Smuzhiyunmylock before CPU1(), then both r0 and r1 must be set to the value 0.
117*4882a593SmuzhiyunThis also has the consequence that if the final value of r1 is equal
118*4882a593Smuzhiyunto 0, then the final value of r0 must also be equal to 0.  In contrast,
119*4882a593Smuzhiyunthe weaker rule would say nothing about the final value of r0.
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunThese examples show only a single pair of CPUs, but the effects of the
122*4882a593Smuzhiyunlocking basic rule extend across multiple acquisitions of a given lock
123*4882a593Smuzhiyunacross multiple CPUs.
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunHowever, it is not necessarily the case that accesses ordered by
126*4882a593Smuzhiyunlocking will be seen as ordered by CPUs not holding that lock.
127*4882a593SmuzhiyunConsider this example:
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun	/* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
130*4882a593Smuzhiyun	void CPU0(void)
131*4882a593Smuzhiyun	{
132*4882a593Smuzhiyun		spin_lock(&mylock);
133*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
134*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
135*4882a593Smuzhiyun		spin_unlock(&mylock);
136*4882a593Smuzhiyun	}
137*4882a593Smuzhiyun
138*4882a593Smuzhiyun	void CPU1(void)
139*4882a593Smuzhiyun	{
140*4882a593Smuzhiyun		spin_lock(&mylock);
141*4882a593Smuzhiyun		r0 = READ_ONCE(y);
142*4882a593Smuzhiyun		WRITE_ONCE(z, 1);
143*4882a593Smuzhiyun		spin_unlock(&mylock);
144*4882a593Smuzhiyun	}
145*4882a593Smuzhiyun
146*4882a593Smuzhiyun	void CPU2(void)
147*4882a593Smuzhiyun	{
148*4882a593Smuzhiyun		WRITE_ONCE(z, 2);
149*4882a593Smuzhiyun		smp_mb();
150*4882a593Smuzhiyun		r1 = READ_ONCE(x);
151*4882a593Smuzhiyun	}
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunCounter-intuitive though it might be, it is quite possible to have
154*4882a593Smuzhiyunthe final value of r0 be 1, the final value of z be 2, and the final
155*4882a593Smuzhiyunvalue of r1 be 0.  The reason for this surprising outcome is that
156*4882a593SmuzhiyunCPU2() never acquired the lock, and thus did not benefit from the
157*4882a593Smuzhiyunlock's ordering properties.
158*4882a593Smuzhiyun
159*4882a593SmuzhiyunOrdering can be extended to CPUs not holding the lock by careful use
160*4882a593Smuzhiyunof smp_mb__after_spinlock():
161*4882a593Smuzhiyun
162*4882a593Smuzhiyun	/* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
163*4882a593Smuzhiyun	void CPU0(void)
164*4882a593Smuzhiyun	{
165*4882a593Smuzhiyun		spin_lock(&mylock);
166*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
167*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
168*4882a593Smuzhiyun		spin_unlock(&mylock);
169*4882a593Smuzhiyun	}
170*4882a593Smuzhiyun
171*4882a593Smuzhiyun	void CPU1(void)
172*4882a593Smuzhiyun	{
173*4882a593Smuzhiyun		spin_lock(&mylock);
174*4882a593Smuzhiyun		smp_mb__after_spinlock();
175*4882a593Smuzhiyun		r0 = READ_ONCE(y);
176*4882a593Smuzhiyun		WRITE_ONCE(z, 1);
177*4882a593Smuzhiyun		spin_unlock(&mylock);
178*4882a593Smuzhiyun	}
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun	void CPU2(void)
181*4882a593Smuzhiyun	{
182*4882a593Smuzhiyun		WRITE_ONCE(z, 2);
183*4882a593Smuzhiyun		smp_mb();
184*4882a593Smuzhiyun		r1 = READ_ONCE(x);
185*4882a593Smuzhiyun	}
186*4882a593Smuzhiyun
187*4882a593SmuzhiyunThis addition of smp_mb__after_spinlock() strengthens the lock acquisition
188*4882a593Smuzhiyunsufficiently to rule out the counter-intuitive outcome.
189*4882a593Smuzhiyun
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunTaking off the training wheels
192*4882a593Smuzhiyun==============================
193*4882a593Smuzhiyun
194*4882a593SmuzhiyunThis section looks at more complex examples, including message passing,
195*4882a593Smuzhiyunload buffering, release-acquire chains, store buffering.
196*4882a593SmuzhiyunMany classes of litmus tests have abbreviated names, which may be found
197*4882a593Smuzhiyunhere: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun
200*4882a593SmuzhiyunMessage passing (MP)
201*4882a593Smuzhiyun--------------------
202*4882a593Smuzhiyun
203*4882a593SmuzhiyunThe MP pattern has one CPU execute a pair of stores to a pair of variables
204*4882a593Smuzhiyunand another CPU execute a pair of loads from this same pair of variables,
205*4882a593Smuzhiyunbut in the opposite order.  The goal is to avoid the counter-intuitive
206*4882a593Smuzhiyunoutcome in which the first load sees the value written by the second store
207*4882a593Smuzhiyunbut the second load does not see the value written by the first store.
208*4882a593SmuzhiyunIn the absence of any ordering, this goal may not be met, as can be seen
209*4882a593Smuzhiyunin the MP+poonceonces.litmus litmus test.  This section therefore looks at
210*4882a593Smuzhiyuna number of ways of meeting this goal.
211*4882a593Smuzhiyun
212*4882a593Smuzhiyun
213*4882a593SmuzhiyunRelease and acquire
214*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~
215*4882a593Smuzhiyun
216*4882a593SmuzhiyunUse of smp_store_release() and smp_load_acquire() is one way to force
217*4882a593Smuzhiyunthe desired MP ordering.  The general approach is shown below:
218*4882a593Smuzhiyun
219*4882a593Smuzhiyun	/* See MP+pooncerelease+poacquireonce.litmus. */
220*4882a593Smuzhiyun	void CPU0(void)
221*4882a593Smuzhiyun	{
222*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
223*4882a593Smuzhiyun		smp_store_release(&y, 1);
224*4882a593Smuzhiyun	}
225*4882a593Smuzhiyun
226*4882a593Smuzhiyun	void CPU1(void)
227*4882a593Smuzhiyun	{
228*4882a593Smuzhiyun		r0 = smp_load_acquire(&y);
229*4882a593Smuzhiyun		r1 = READ_ONCE(x);
230*4882a593Smuzhiyun	}
231*4882a593Smuzhiyun
232*4882a593SmuzhiyunThe smp_store_release() macro orders any prior accesses against the
233*4882a593Smuzhiyunstore, while the smp_load_acquire macro orders the load against any
234*4882a593Smuzhiyunsubsequent accesses.  Therefore, if the final value of r0 is the value 1,
235*4882a593Smuzhiyunthe final value of r1 must also be the value 1.
236*4882a593Smuzhiyun
237*4882a593SmuzhiyunThe init_stack_slab() function in lib/stackdepot.c uses release-acquire
238*4882a593Smuzhiyunin this way to safely initialize of a slab of the stack.  Working out
239*4882a593Smuzhiyunthe mutual-exclusion design is left as an exercise for the reader.
240*4882a593Smuzhiyun
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunAssign and dereference
243*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~
244*4882a593Smuzhiyun
245*4882a593SmuzhiyunUse of rcu_assign_pointer() and rcu_dereference() is quite similar to the
246*4882a593Smuzhiyunuse of smp_store_release() and smp_load_acquire(), except that both
247*4882a593Smuzhiyunrcu_assign_pointer() and rcu_dereference() operate on RCU-protected
248*4882a593Smuzhiyunpointers.  The general approach is shown below:
249*4882a593Smuzhiyun
250*4882a593Smuzhiyun	/* See MP+onceassign+derefonce.litmus. */
251*4882a593Smuzhiyun	int z;
252*4882a593Smuzhiyun	int *y = &z;
253*4882a593Smuzhiyun	int x;
254*4882a593Smuzhiyun
255*4882a593Smuzhiyun	void CPU0(void)
256*4882a593Smuzhiyun	{
257*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
258*4882a593Smuzhiyun		rcu_assign_pointer(y, &x);
259*4882a593Smuzhiyun	}
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun	void CPU1(void)
262*4882a593Smuzhiyun	{
263*4882a593Smuzhiyun		rcu_read_lock();
264*4882a593Smuzhiyun		r0 = rcu_dereference(y);
265*4882a593Smuzhiyun		r1 = READ_ONCE(*r0);
266*4882a593Smuzhiyun		rcu_read_unlock();
267*4882a593Smuzhiyun	}
268*4882a593Smuzhiyun
269*4882a593SmuzhiyunIn this example, if the final value of r0 is &x then the final value of
270*4882a593Smuzhiyunr1 must be 1.
271*4882a593Smuzhiyun
272*4882a593SmuzhiyunThe rcu_assign_pointer() macro has the same ordering properties as does
273*4882a593Smuzhiyunsmp_store_release(), but the rcu_dereference() macro orders the load only
274*4882a593Smuzhiyunagainst later accesses that depend on the value loaded.  A dependency
275*4882a593Smuzhiyunis present if the value loaded determines the address of a later access
276*4882a593Smuzhiyun(address dependency, as shown above), the value written by a later store
277*4882a593Smuzhiyun(data dependency), or whether or not a later store is executed in the
278*4882a593Smuzhiyunfirst place (control dependency).  Note that the term "data dependency"
279*4882a593Smuzhiyunis sometimes casually used to cover both address and data dependencies.
280*4882a593Smuzhiyun
281*4882a593SmuzhiyunIn lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
282*4882a593Smuzhiyunrcu_assign_pointer(), and the next_prime_number() function invokes
283*4882a593Smuzhiyunrcu_dereference().  This combination mediates access to a bit vector
284*4882a593Smuzhiyunthat is expanded as additional primes are needed.
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun
287*4882a593SmuzhiyunWrite and read memory barriers
288*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
289*4882a593Smuzhiyun
290*4882a593SmuzhiyunIt is usually better to use smp_store_release() instead of smp_wmb()
291*4882a593Smuzhiyunand to use smp_load_acquire() instead of smp_rmb().  However, the older
292*4882a593Smuzhiyunsmp_wmb() and smp_rmb() APIs are still heavily used, so it is important
293*4882a593Smuzhiyunto understand their use cases.  The general approach is shown below:
294*4882a593Smuzhiyun
295*4882a593Smuzhiyun	/* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
296*4882a593Smuzhiyun	void CPU0(void)
297*4882a593Smuzhiyun	{
298*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
299*4882a593Smuzhiyun		smp_wmb();
300*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
301*4882a593Smuzhiyun	}
302*4882a593Smuzhiyun
303*4882a593Smuzhiyun	void CPU1(void)
304*4882a593Smuzhiyun	{
305*4882a593Smuzhiyun		r0 = READ_ONCE(y);
306*4882a593Smuzhiyun		smp_rmb();
307*4882a593Smuzhiyun		r1 = READ_ONCE(x);
308*4882a593Smuzhiyun	}
309*4882a593Smuzhiyun
310*4882a593SmuzhiyunThe smp_wmb() macro orders prior stores against later stores, and the
311*4882a593Smuzhiyunsmp_rmb() macro orders prior loads against later loads.  Therefore, if
312*4882a593Smuzhiyunthe final value of r0 is 1, the final value of r1 must also be 1.
313*4882a593Smuzhiyun
314*4882a593SmuzhiyunThe xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
315*4882a593Smuzhiyunthe following write-side code fragment:
316*4882a593Smuzhiyun
317*4882a593Smuzhiyun	log->l_curr_block -= log->l_logBBsize;
318*4882a593Smuzhiyun	ASSERT(log->l_curr_block >= 0);
319*4882a593Smuzhiyun	smp_wmb();
320*4882a593Smuzhiyun	log->l_curr_cycle++;
321*4882a593Smuzhiyun
322*4882a593SmuzhiyunAnd the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
323*4882a593Smuzhiyunthe corresponding read-side code fragment:
324*4882a593Smuzhiyun
325*4882a593Smuzhiyun	cur_cycle = READ_ONCE(log->l_curr_cycle);
326*4882a593Smuzhiyun	smp_rmb();
327*4882a593Smuzhiyun	cur_block = READ_ONCE(log->l_curr_block);
328*4882a593Smuzhiyun
329*4882a593SmuzhiyunAlternatively, consider the following comment in function
330*4882a593Smuzhiyunperf_output_put_handle() in kernel/events/ring_buffer.c:
331*4882a593Smuzhiyun
332*4882a593Smuzhiyun	 *   kernel				user
333*4882a593Smuzhiyun	 *
334*4882a593Smuzhiyun	 *   if (LOAD ->data_tail) {		LOAD ->data_head
335*4882a593Smuzhiyun	 *			(A)		smp_rmb()	(C)
336*4882a593Smuzhiyun	 *	STORE $data			LOAD $data
337*4882a593Smuzhiyun	 *	smp_wmb()	(B)		smp_mb()	(D)
338*4882a593Smuzhiyun	 *	STORE ->data_head		STORE ->data_tail
339*4882a593Smuzhiyun	 *   }
340*4882a593Smuzhiyun
341*4882a593SmuzhiyunThe B/C pairing is an example of the MP pattern using smp_wmb() on the
342*4882a593Smuzhiyunwrite side and smp_rmb() on the read side.
343*4882a593Smuzhiyun
344*4882a593SmuzhiyunOf course, given that smp_mb() is strictly stronger than either smp_wmb()
345*4882a593Smuzhiyunor smp_rmb(), any code fragment that would work with smp_rmb() and
346*4882a593Smuzhiyunsmp_wmb() would also work with smp_mb() replacing either or both of the
347*4882a593Smuzhiyunweaker barriers.
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun
350*4882a593SmuzhiyunLoad buffering (LB)
351*4882a593Smuzhiyun-------------------
352*4882a593Smuzhiyun
353*4882a593SmuzhiyunThe LB pattern has one CPU load from one variable and then store to a
354*4882a593Smuzhiyunsecond, while another CPU loads from the second variable and then stores
355*4882a593Smuzhiyunto the first.  The goal is to avoid the counter-intuitive situation where
356*4882a593Smuzhiyuneach load reads the value written by the other CPU's store.  In the
357*4882a593Smuzhiyunabsence of any ordering it is quite possible that this may happen, as
358*4882a593Smuzhiyuncan be seen in the LB+poonceonces.litmus litmus test.
359*4882a593Smuzhiyun
360*4882a593SmuzhiyunOne way of avoiding the counter-intuitive outcome is through the use of a
361*4882a593Smuzhiyuncontrol dependency paired with a full memory barrier:
362*4882a593Smuzhiyun
363*4882a593Smuzhiyun	/* See LB+fencembonceonce+ctrlonceonce.litmus. */
364*4882a593Smuzhiyun	void CPU0(void)
365*4882a593Smuzhiyun	{
366*4882a593Smuzhiyun		r0 = READ_ONCE(x);
367*4882a593Smuzhiyun		if (r0)
368*4882a593Smuzhiyun			WRITE_ONCE(y, 1);
369*4882a593Smuzhiyun	}
370*4882a593Smuzhiyun
371*4882a593Smuzhiyun	void CPU1(void)
372*4882a593Smuzhiyun	{
373*4882a593Smuzhiyun		r1 = READ_ONCE(y);
374*4882a593Smuzhiyun		smp_mb();
375*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
376*4882a593Smuzhiyun	}
377*4882a593Smuzhiyun
378*4882a593SmuzhiyunThis pairing of a control dependency in CPU0() with a full memory
379*4882a593Smuzhiyunbarrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
380*4882a593Smuzhiyun
381*4882a593SmuzhiyunThe A/D pairing from the ring-buffer use case shown earlier also
382*4882a593Smuzhiyunillustrates LB.  Here is a repeat of the comment in
383*4882a593Smuzhiyunperf_output_put_handle() in kernel/events/ring_buffer.c, showing a
384*4882a593Smuzhiyuncontrol dependency on the kernel side and a full memory barrier on
385*4882a593Smuzhiyunthe user side:
386*4882a593Smuzhiyun
387*4882a593Smuzhiyun	 *   kernel				user
388*4882a593Smuzhiyun	 *
389*4882a593Smuzhiyun	 *   if (LOAD ->data_tail) {		LOAD ->data_head
390*4882a593Smuzhiyun	 *			(A)		smp_rmb()	(C)
391*4882a593Smuzhiyun	 *	STORE $data			LOAD $data
392*4882a593Smuzhiyun	 *	smp_wmb()	(B)		smp_mb()	(D)
393*4882a593Smuzhiyun	 *	STORE ->data_head		STORE ->data_tail
394*4882a593Smuzhiyun	 *   }
395*4882a593Smuzhiyun	 *
396*4882a593Smuzhiyun	 * Where A pairs with D, and B pairs with C.
397*4882a593Smuzhiyun
398*4882a593SmuzhiyunThe kernel's control dependency between the load from ->data_tail
399*4882a593Smuzhiyunand the store to data combined with the user's full memory barrier
400*4882a593Smuzhiyunbetween the load from data and the store to ->data_tail prevents
401*4882a593Smuzhiyunthe counter-intuitive outcome where the kernel overwrites the data
402*4882a593Smuzhiyunbefore the user gets done loading it.
403*4882a593Smuzhiyun
404*4882a593Smuzhiyun
405*4882a593SmuzhiyunRelease-acquire chains
406*4882a593Smuzhiyun----------------------
407*4882a593Smuzhiyun
408*4882a593SmuzhiyunRelease-acquire chains are a low-overhead, flexible, and easy-to-use
409*4882a593Smuzhiyunmethod of maintaining order.  However, they do have some limitations that
410*4882a593Smuzhiyunneed to be fully understood.  Here is an example that maintains order:
411*4882a593Smuzhiyun
412*4882a593Smuzhiyun	/* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
413*4882a593Smuzhiyun	void CPU0(void)
414*4882a593Smuzhiyun	{
415*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
416*4882a593Smuzhiyun		smp_store_release(&y, 1);
417*4882a593Smuzhiyun	}
418*4882a593Smuzhiyun
419*4882a593Smuzhiyun	void CPU1(void)
420*4882a593Smuzhiyun	{
421*4882a593Smuzhiyun		r0 = smp_load_acquire(y);
422*4882a593Smuzhiyun		smp_store_release(&z, 1);
423*4882a593Smuzhiyun	}
424*4882a593Smuzhiyun
425*4882a593Smuzhiyun	void CPU2(void)
426*4882a593Smuzhiyun	{
427*4882a593Smuzhiyun		r1 = smp_load_acquire(z);
428*4882a593Smuzhiyun		r2 = READ_ONCE(x);
429*4882a593Smuzhiyun	}
430*4882a593Smuzhiyun
431*4882a593SmuzhiyunIn this case, if r0 and r1 both have final values of 1, then r2 must
432*4882a593Smuzhiyunalso have a final value of 1.
433*4882a593Smuzhiyun
434*4882a593SmuzhiyunThe ordering in this example is stronger than it needs to be.  For
435*4882a593Smuzhiyunexample, ordering would still be preserved if CPU1()'s smp_load_acquire()
436*4882a593Smuzhiyuninvocation was replaced with READ_ONCE().
437*4882a593Smuzhiyun
438*4882a593SmuzhiyunIt is tempting to assume that CPU0()'s store to x is globally ordered
439*4882a593Smuzhiyunbefore CPU1()'s store to z, but this is not the case:
440*4882a593Smuzhiyun
441*4882a593Smuzhiyun	/* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
442*4882a593Smuzhiyun	void CPU0(void)
443*4882a593Smuzhiyun	{
444*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
445*4882a593Smuzhiyun		smp_store_release(&y, 1);
446*4882a593Smuzhiyun	}
447*4882a593Smuzhiyun
448*4882a593Smuzhiyun	void CPU1(void)
449*4882a593Smuzhiyun	{
450*4882a593Smuzhiyun		r0 = smp_load_acquire(y);
451*4882a593Smuzhiyun		smp_store_release(&z, 1);
452*4882a593Smuzhiyun	}
453*4882a593Smuzhiyun
454*4882a593Smuzhiyun	void CPU2(void)
455*4882a593Smuzhiyun	{
456*4882a593Smuzhiyun		WRITE_ONCE(z, 2);
457*4882a593Smuzhiyun		smp_mb();
458*4882a593Smuzhiyun		r1 = READ_ONCE(x);
459*4882a593Smuzhiyun	}
460*4882a593Smuzhiyun
461*4882a593SmuzhiyunOne might hope that if the final value of r0 is 1 and the final value
462*4882a593Smuzhiyunof z is 2, then the final value of r1 must also be 1, but it really is
463*4882a593Smuzhiyunpossible for r1 to have the final value of 0.  The reason, of course,
464*4882a593Smuzhiyunis that in this version, CPU2() is not part of the release-acquire chain.
465*4882a593SmuzhiyunThis situation is accounted for in the rules of thumb below.
466*4882a593Smuzhiyun
467*4882a593SmuzhiyunDespite this limitation, release-acquire chains are low-overhead as
468*4882a593Smuzhiyunwell as simple and powerful, at least as memory-ordering mechanisms go.
469*4882a593Smuzhiyun
470*4882a593Smuzhiyun
471*4882a593SmuzhiyunStore buffering
472*4882a593Smuzhiyun---------------
473*4882a593Smuzhiyun
474*4882a593SmuzhiyunStore buffering can be thought of as upside-down load buffering, so
475*4882a593Smuzhiyunthat one CPU first stores to one variable and then loads from a second,
476*4882a593Smuzhiyunwhile another CPU stores to the second variable and then loads from the
477*4882a593Smuzhiyunfirst.  Preserving order requires nothing less than full barriers:
478*4882a593Smuzhiyun
479*4882a593Smuzhiyun	/* See SB+fencembonceonces.litmus. */
480*4882a593Smuzhiyun	void CPU0(void)
481*4882a593Smuzhiyun	{
482*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
483*4882a593Smuzhiyun		smp_mb();
484*4882a593Smuzhiyun		r0 = READ_ONCE(y);
485*4882a593Smuzhiyun	}
486*4882a593Smuzhiyun
487*4882a593Smuzhiyun	void CPU1(void)
488*4882a593Smuzhiyun	{
489*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
490*4882a593Smuzhiyun		smp_mb();
491*4882a593Smuzhiyun		r1 = READ_ONCE(x);
492*4882a593Smuzhiyun	}
493*4882a593Smuzhiyun
494*4882a593SmuzhiyunOmitting either smp_mb() will allow both r0 and r1 to have final
495*4882a593Smuzhiyunvalues of 0, but providing both full barriers as shown above prevents
496*4882a593Smuzhiyunthis counter-intuitive outcome.
497*4882a593Smuzhiyun
498*4882a593SmuzhiyunThis pattern most famously appears as part of Dekker's locking
499*4882a593Smuzhiyunalgorithm, but it has a much more practical use within the Linux kernel
500*4882a593Smuzhiyunof ordering wakeups.  The following comment taken from waitqueue_active()
501*4882a593Smuzhiyunin include/linux/wait.h shows the canonical pattern:
502*4882a593Smuzhiyun
503*4882a593Smuzhiyun *      CPU0 - waker                    CPU1 - waiter
504*4882a593Smuzhiyun *
505*4882a593Smuzhiyun *                                      for (;;) {
506*4882a593Smuzhiyun *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
507*4882a593Smuzhiyun *      smp_mb();                         // smp_mb() from set_current_state()
508*4882a593Smuzhiyun *      if (waitqueue_active(wq_head))         if (@cond)
509*4882a593Smuzhiyun *        wake_up(wq_head);                      break;
510*4882a593Smuzhiyun *                                        schedule();
511*4882a593Smuzhiyun *                                      }
512*4882a593Smuzhiyun *                                      finish_wait(&wq_head, &wait);
513*4882a593Smuzhiyun
514*4882a593SmuzhiyunOn CPU0, the store is to @cond and the load is in waitqueue_active().
515*4882a593SmuzhiyunOn CPU1, prepare_to_wait() contains both a store to wq_head and a call
516*4882a593Smuzhiyunto set_current_state(), which contains an smp_mb() barrier; the load is
517*4882a593Smuzhiyun"if (@cond)".  The full barriers prevent the undesirable outcome where
518*4882a593SmuzhiyunCPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
519*4882a593Smuzhiyun
520*4882a593SmuzhiyunNote that use of locking can greatly simplify this pattern.
521*4882a593Smuzhiyun
522*4882a593Smuzhiyun
523*4882a593SmuzhiyunRules of thumb
524*4882a593Smuzhiyun==============
525*4882a593Smuzhiyun
526*4882a593SmuzhiyunThere might seem to be no pattern governing what ordering primitives are
527*4882a593Smuzhiyunneeded in which situations, but this is not the case.  There is a pattern
528*4882a593Smuzhiyunbased on the relation between the accesses linking successive CPUs in a
529*4882a593Smuzhiyungiven litmus test.  There are three types of linkage:
530*4882a593Smuzhiyun
531*4882a593Smuzhiyun1.	Write-to-read, where the next CPU reads the value that the
532*4882a593Smuzhiyun	previous CPU wrote.  The LB litmus-test patterns contain only
533*4882a593Smuzhiyun	this type of relation.	In formal memory-modeling texts, this
534*4882a593Smuzhiyun	relation is called "reads-from" and is usually abbreviated "rf".
535*4882a593Smuzhiyun
536*4882a593Smuzhiyun2.	Read-to-write, where the next CPU overwrites the value that the
537*4882a593Smuzhiyun	previous CPU read.  The SB litmus test contains only this type
538*4882a593Smuzhiyun	of relation.  In formal memory-modeling texts, this relation is
539*4882a593Smuzhiyun	often called "from-reads" and is sometimes abbreviated "fr".
540*4882a593Smuzhiyun
541*4882a593Smuzhiyun3.	Write-to-write, where the next CPU overwrites the value written
542*4882a593Smuzhiyun	by the previous CPU.  The Z6.0 litmus test pattern contains a
543*4882a593Smuzhiyun	write-to-write relation between the last access of CPU1() and
544*4882a593Smuzhiyun	the first access of CPU2().  In formal memory-modeling texts,
545*4882a593Smuzhiyun	this relation is often called "coherence order" and is sometimes
546*4882a593Smuzhiyun	abbreviated "co".  In the C++ standard, it is instead called
547*4882a593Smuzhiyun	"modification order" and often abbreviated "mo".
548*4882a593Smuzhiyun
549*4882a593SmuzhiyunThe strength of memory ordering required for a given litmus test to
550*4882a593Smuzhiyunavoid a counter-intuitive outcome depends on the types of relations
551*4882a593Smuzhiyunlinking the memory accesses for the outcome in question:
552*4882a593Smuzhiyun
553*4882a593Smuzhiyuno	If all links are write-to-read links, then the weakest
554*4882a593Smuzhiyun	possible ordering within each CPU suffices.  For example, in
555*4882a593Smuzhiyun	the LB litmus test, a control dependency was enough to do the
556*4882a593Smuzhiyun	job.
557*4882a593Smuzhiyun
558*4882a593Smuzhiyuno	If all but one of the links are write-to-read links, then a
559*4882a593Smuzhiyun	release-acquire chain suffices.  Both the MP and the ISA2
560*4882a593Smuzhiyun	litmus tests illustrate this case.
561*4882a593Smuzhiyun
562*4882a593Smuzhiyuno	If more than one of the links are something other than
563*4882a593Smuzhiyun	write-to-read links, then a full memory barrier is required
564*4882a593Smuzhiyun	between each successive pair of non-write-to-read links.  This
565*4882a593Smuzhiyun	case is illustrated by the Z6.0 litmus tests, both in the
566*4882a593Smuzhiyun	locking and in the release-acquire sections.
567*4882a593Smuzhiyun
568*4882a593SmuzhiyunHowever, if you find yourself having to stretch these rules of thumb
569*4882a593Smuzhiyunto fit your situation, you should consider creating a litmus test and
570*4882a593Smuzhiyunrunning it on the model.
571