1*4882a593SmuzhiyunThis document provides "recipes", that is, litmus tests for commonly 2*4882a593Smuzhiyunoccurring situations, as well as a few that illustrate subtly broken but 3*4882a593Smuzhiyunattractive nuisances. Many of these recipes include example code from 4*4882a593Smuzhiyunv5.7 of the Linux kernel. 5*4882a593Smuzhiyun 6*4882a593SmuzhiyunThe first section covers simple special cases, the second section 7*4882a593Smuzhiyuntakes off the training wheels to cover more involved examples, 8*4882a593Smuzhiyunand the third section provides a few rules of thumb. 9*4882a593Smuzhiyun 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunSimple special cases 12*4882a593Smuzhiyun==================== 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunThis section presents two simple special cases, the first being where 15*4882a593Smuzhiyunthere is only one CPU or only one memory location is accessed, and the 16*4882a593Smuzhiyunsecond being use of that old concurrency workhorse, locking. 17*4882a593Smuzhiyun 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunSingle CPU or single memory location 20*4882a593Smuzhiyun------------------------------------ 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunIf there is only one CPU on the one hand or only one variable 23*4882a593Smuzhiyunon the other, the code will execute in order. There are (as 24*4882a593Smuzhiyunusual) some things to be careful of: 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun1. Some aspects of the C language are unordered. For example, 27*4882a593Smuzhiyun in the expression "f(x) + g(y)", the order in which f and g are 28*4882a593Smuzhiyun called is not defined; the object code is allowed to use either 29*4882a593Smuzhiyun order or even to interleave the computations. 30*4882a593Smuzhiyun 31*4882a593Smuzhiyun2. Compilers are permitted to use the "as-if" rule. That is, a 32*4882a593Smuzhiyun compiler can emit whatever code it likes for normal accesses, 33*4882a593Smuzhiyun as long as the results of a single-threaded execution appear 34*4882a593Smuzhiyun just as if the compiler had followed all the relevant rules. 35*4882a593Smuzhiyun To see this, compile with a high level of optimization and run 36*4882a593Smuzhiyun the debugger on the resulting binary. 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun3. If there is only one variable but multiple CPUs, that variable 39*4882a593Smuzhiyun must be properly aligned and all accesses to that variable must 40*4882a593Smuzhiyun be full sized. Variables that straddle cachelines or pages void 41*4882a593Smuzhiyun your full-ordering warranty, as do undersized accesses that load 42*4882a593Smuzhiyun from or store to only part of the variable. 43*4882a593Smuzhiyun 44*4882a593Smuzhiyun4. If there are multiple CPUs, accesses to shared variables should 45*4882a593Smuzhiyun use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store 46*4882a593Smuzhiyun tearing, load/store fusing, and invented loads and stores. 47*4882a593Smuzhiyun There are exceptions to this rule, including: 48*4882a593Smuzhiyun 49*4882a593Smuzhiyun i. When there is no possibility of a given shared variable 50*4882a593Smuzhiyun being updated by some other CPU, for example, while 51*4882a593Smuzhiyun holding the update-side lock, reads from that variable 52*4882a593Smuzhiyun need not use READ_ONCE(). 53*4882a593Smuzhiyun 54*4882a593Smuzhiyun ii. When there is no possibility of a given shared variable 55*4882a593Smuzhiyun being either read or updated by other CPUs, for example, 56*4882a593Smuzhiyun when running during early boot, reads from that variable 57*4882a593Smuzhiyun need not use READ_ONCE() and writes to that variable 58*4882a593Smuzhiyun need not use WRITE_ONCE(). 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunLocking 62*4882a593Smuzhiyun------- 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunLocking is well-known and straightforward, at least if you don't think 65*4882a593Smuzhiyunabout it too hard. And the basic rule is indeed quite simple: Any CPU that 66*4882a593Smuzhiyunhas acquired a given lock sees any changes previously seen or made by any 67*4882a593SmuzhiyunCPU before it released that same lock. Note that this statement is a bit 68*4882a593Smuzhiyunstronger than "Any CPU holding a given lock sees all changes made by any 69*4882a593SmuzhiyunCPU during the time that CPU was holding this same lock". For example, 70*4882a593Smuzhiyunconsider the following pair of code fragments: 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun /* See MP+polocks.litmus. */ 73*4882a593Smuzhiyun void CPU0(void) 74*4882a593Smuzhiyun { 75*4882a593Smuzhiyun WRITE_ONCE(x, 1); 76*4882a593Smuzhiyun spin_lock(&mylock); 77*4882a593Smuzhiyun WRITE_ONCE(y, 1); 78*4882a593Smuzhiyun spin_unlock(&mylock); 79*4882a593Smuzhiyun } 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun void CPU1(void) 82*4882a593Smuzhiyun { 83*4882a593Smuzhiyun spin_lock(&mylock); 84*4882a593Smuzhiyun r0 = READ_ONCE(y); 85*4882a593Smuzhiyun spin_unlock(&mylock); 86*4882a593Smuzhiyun r1 = READ_ONCE(x); 87*4882a593Smuzhiyun } 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunThe basic rule guarantees that if CPU0() acquires mylock before CPU1(), 90*4882a593Smuzhiyunthen both r0 and r1 must be set to the value 1. This also has the 91*4882a593Smuzhiyunconsequence that if the final value of r0 is equal to 1, then the final 92*4882a593Smuzhiyunvalue of r1 must also be equal to 1. In contrast, the weaker rule would 93*4882a593Smuzhiyunsay nothing about the final value of r1. 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunThe converse to the basic rule also holds, as illustrated by the 96*4882a593Smuzhiyunfollowing litmus test: 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun /* See MP+porevlocks.litmus. */ 99*4882a593Smuzhiyun void CPU0(void) 100*4882a593Smuzhiyun { 101*4882a593Smuzhiyun r0 = READ_ONCE(y); 102*4882a593Smuzhiyun spin_lock(&mylock); 103*4882a593Smuzhiyun r1 = READ_ONCE(x); 104*4882a593Smuzhiyun spin_unlock(&mylock); 105*4882a593Smuzhiyun } 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun void CPU1(void) 108*4882a593Smuzhiyun { 109*4882a593Smuzhiyun spin_lock(&mylock); 110*4882a593Smuzhiyun WRITE_ONCE(x, 1); 111*4882a593Smuzhiyun spin_unlock(&mylock); 112*4882a593Smuzhiyun WRITE_ONCE(y, 1); 113*4882a593Smuzhiyun } 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunThis converse to the basic rule guarantees that if CPU0() acquires 116*4882a593Smuzhiyunmylock before CPU1(), then both r0 and r1 must be set to the value 0. 117*4882a593SmuzhiyunThis also has the consequence that if the final value of r1 is equal 118*4882a593Smuzhiyunto 0, then the final value of r0 must also be equal to 0. In contrast, 119*4882a593Smuzhiyunthe weaker rule would say nothing about the final value of r0. 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunThese examples show only a single pair of CPUs, but the effects of the 122*4882a593Smuzhiyunlocking basic rule extend across multiple acquisitions of a given lock 123*4882a593Smuzhiyunacross multiple CPUs. 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunHowever, it is not necessarily the case that accesses ordered by 126*4882a593Smuzhiyunlocking will be seen as ordered by CPUs not holding that lock. 127*4882a593SmuzhiyunConsider this example: 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */ 130*4882a593Smuzhiyun void CPU0(void) 131*4882a593Smuzhiyun { 132*4882a593Smuzhiyun spin_lock(&mylock); 133*4882a593Smuzhiyun WRITE_ONCE(x, 1); 134*4882a593Smuzhiyun WRITE_ONCE(y, 1); 135*4882a593Smuzhiyun spin_unlock(&mylock); 136*4882a593Smuzhiyun } 137*4882a593Smuzhiyun 138*4882a593Smuzhiyun void CPU1(void) 139*4882a593Smuzhiyun { 140*4882a593Smuzhiyun spin_lock(&mylock); 141*4882a593Smuzhiyun r0 = READ_ONCE(y); 142*4882a593Smuzhiyun WRITE_ONCE(z, 1); 143*4882a593Smuzhiyun spin_unlock(&mylock); 144*4882a593Smuzhiyun } 145*4882a593Smuzhiyun 146*4882a593Smuzhiyun void CPU2(void) 147*4882a593Smuzhiyun { 148*4882a593Smuzhiyun WRITE_ONCE(z, 2); 149*4882a593Smuzhiyun smp_mb(); 150*4882a593Smuzhiyun r1 = READ_ONCE(x); 151*4882a593Smuzhiyun } 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunCounter-intuitive though it might be, it is quite possible to have 154*4882a593Smuzhiyunthe final value of r0 be 1, the final value of z be 2, and the final 155*4882a593Smuzhiyunvalue of r1 be 0. The reason for this surprising outcome is that 156*4882a593SmuzhiyunCPU2() never acquired the lock, and thus did not benefit from the 157*4882a593Smuzhiyunlock's ordering properties. 158*4882a593Smuzhiyun 159*4882a593SmuzhiyunOrdering can be extended to CPUs not holding the lock by careful use 160*4882a593Smuzhiyunof smp_mb__after_spinlock(): 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */ 163*4882a593Smuzhiyun void CPU0(void) 164*4882a593Smuzhiyun { 165*4882a593Smuzhiyun spin_lock(&mylock); 166*4882a593Smuzhiyun WRITE_ONCE(x, 1); 167*4882a593Smuzhiyun WRITE_ONCE(y, 1); 168*4882a593Smuzhiyun spin_unlock(&mylock); 169*4882a593Smuzhiyun } 170*4882a593Smuzhiyun 171*4882a593Smuzhiyun void CPU1(void) 172*4882a593Smuzhiyun { 173*4882a593Smuzhiyun spin_lock(&mylock); 174*4882a593Smuzhiyun smp_mb__after_spinlock(); 175*4882a593Smuzhiyun r0 = READ_ONCE(y); 176*4882a593Smuzhiyun WRITE_ONCE(z, 1); 177*4882a593Smuzhiyun spin_unlock(&mylock); 178*4882a593Smuzhiyun } 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun void CPU2(void) 181*4882a593Smuzhiyun { 182*4882a593Smuzhiyun WRITE_ONCE(z, 2); 183*4882a593Smuzhiyun smp_mb(); 184*4882a593Smuzhiyun r1 = READ_ONCE(x); 185*4882a593Smuzhiyun } 186*4882a593Smuzhiyun 187*4882a593SmuzhiyunThis addition of smp_mb__after_spinlock() strengthens the lock acquisition 188*4882a593Smuzhiyunsufficiently to rule out the counter-intuitive outcome. 189*4882a593Smuzhiyun 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunTaking off the training wheels 192*4882a593Smuzhiyun============================== 193*4882a593Smuzhiyun 194*4882a593SmuzhiyunThis section looks at more complex examples, including message passing, 195*4882a593Smuzhiyunload buffering, release-acquire chains, store buffering. 196*4882a593SmuzhiyunMany classes of litmus tests have abbreviated names, which may be found 197*4882a593Smuzhiyunhere: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun 200*4882a593SmuzhiyunMessage passing (MP) 201*4882a593Smuzhiyun-------------------- 202*4882a593Smuzhiyun 203*4882a593SmuzhiyunThe MP pattern has one CPU execute a pair of stores to a pair of variables 204*4882a593Smuzhiyunand another CPU execute a pair of loads from this same pair of variables, 205*4882a593Smuzhiyunbut in the opposite order. The goal is to avoid the counter-intuitive 206*4882a593Smuzhiyunoutcome in which the first load sees the value written by the second store 207*4882a593Smuzhiyunbut the second load does not see the value written by the first store. 208*4882a593SmuzhiyunIn the absence of any ordering, this goal may not be met, as can be seen 209*4882a593Smuzhiyunin the MP+poonceonces.litmus litmus test. This section therefore looks at 210*4882a593Smuzhiyuna number of ways of meeting this goal. 211*4882a593Smuzhiyun 212*4882a593Smuzhiyun 213*4882a593SmuzhiyunRelease and acquire 214*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~ 215*4882a593Smuzhiyun 216*4882a593SmuzhiyunUse of smp_store_release() and smp_load_acquire() is one way to force 217*4882a593Smuzhiyunthe desired MP ordering. The general approach is shown below: 218*4882a593Smuzhiyun 219*4882a593Smuzhiyun /* See MP+pooncerelease+poacquireonce.litmus. */ 220*4882a593Smuzhiyun void CPU0(void) 221*4882a593Smuzhiyun { 222*4882a593Smuzhiyun WRITE_ONCE(x, 1); 223*4882a593Smuzhiyun smp_store_release(&y, 1); 224*4882a593Smuzhiyun } 225*4882a593Smuzhiyun 226*4882a593Smuzhiyun void CPU1(void) 227*4882a593Smuzhiyun { 228*4882a593Smuzhiyun r0 = smp_load_acquire(&y); 229*4882a593Smuzhiyun r1 = READ_ONCE(x); 230*4882a593Smuzhiyun } 231*4882a593Smuzhiyun 232*4882a593SmuzhiyunThe smp_store_release() macro orders any prior accesses against the 233*4882a593Smuzhiyunstore, while the smp_load_acquire macro orders the load against any 234*4882a593Smuzhiyunsubsequent accesses. Therefore, if the final value of r0 is the value 1, 235*4882a593Smuzhiyunthe final value of r1 must also be the value 1. 236*4882a593Smuzhiyun 237*4882a593SmuzhiyunThe init_stack_slab() function in lib/stackdepot.c uses release-acquire 238*4882a593Smuzhiyunin this way to safely initialize of a slab of the stack. Working out 239*4882a593Smuzhiyunthe mutual-exclusion design is left as an exercise for the reader. 240*4882a593Smuzhiyun 241*4882a593Smuzhiyun 242*4882a593SmuzhiyunAssign and dereference 243*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~ 244*4882a593Smuzhiyun 245*4882a593SmuzhiyunUse of rcu_assign_pointer() and rcu_dereference() is quite similar to the 246*4882a593Smuzhiyunuse of smp_store_release() and smp_load_acquire(), except that both 247*4882a593Smuzhiyunrcu_assign_pointer() and rcu_dereference() operate on RCU-protected 248*4882a593Smuzhiyunpointers. The general approach is shown below: 249*4882a593Smuzhiyun 250*4882a593Smuzhiyun /* See MP+onceassign+derefonce.litmus. */ 251*4882a593Smuzhiyun int z; 252*4882a593Smuzhiyun int *y = &z; 253*4882a593Smuzhiyun int x; 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun void CPU0(void) 256*4882a593Smuzhiyun { 257*4882a593Smuzhiyun WRITE_ONCE(x, 1); 258*4882a593Smuzhiyun rcu_assign_pointer(y, &x); 259*4882a593Smuzhiyun } 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun void CPU1(void) 262*4882a593Smuzhiyun { 263*4882a593Smuzhiyun rcu_read_lock(); 264*4882a593Smuzhiyun r0 = rcu_dereference(y); 265*4882a593Smuzhiyun r1 = READ_ONCE(*r0); 266*4882a593Smuzhiyun rcu_read_unlock(); 267*4882a593Smuzhiyun } 268*4882a593Smuzhiyun 269*4882a593SmuzhiyunIn this example, if the final value of r0 is &x then the final value of 270*4882a593Smuzhiyunr1 must be 1. 271*4882a593Smuzhiyun 272*4882a593SmuzhiyunThe rcu_assign_pointer() macro has the same ordering properties as does 273*4882a593Smuzhiyunsmp_store_release(), but the rcu_dereference() macro orders the load only 274*4882a593Smuzhiyunagainst later accesses that depend on the value loaded. A dependency 275*4882a593Smuzhiyunis present if the value loaded determines the address of a later access 276*4882a593Smuzhiyun(address dependency, as shown above), the value written by a later store 277*4882a593Smuzhiyun(data dependency), or whether or not a later store is executed in the 278*4882a593Smuzhiyunfirst place (control dependency). Note that the term "data dependency" 279*4882a593Smuzhiyunis sometimes casually used to cover both address and data dependencies. 280*4882a593Smuzhiyun 281*4882a593SmuzhiyunIn lib/math/prime_numbers.c, the expand_to_next_prime() function invokes 282*4882a593Smuzhiyunrcu_assign_pointer(), and the next_prime_number() function invokes 283*4882a593Smuzhiyunrcu_dereference(). This combination mediates access to a bit vector 284*4882a593Smuzhiyunthat is expanded as additional primes are needed. 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun 287*4882a593SmuzhiyunWrite and read memory barriers 288*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 289*4882a593Smuzhiyun 290*4882a593SmuzhiyunIt is usually better to use smp_store_release() instead of smp_wmb() 291*4882a593Smuzhiyunand to use smp_load_acquire() instead of smp_rmb(). However, the older 292*4882a593Smuzhiyunsmp_wmb() and smp_rmb() APIs are still heavily used, so it is important 293*4882a593Smuzhiyunto understand their use cases. The general approach is shown below: 294*4882a593Smuzhiyun 295*4882a593Smuzhiyun /* See MP+fencewmbonceonce+fencermbonceonce.litmus. */ 296*4882a593Smuzhiyun void CPU0(void) 297*4882a593Smuzhiyun { 298*4882a593Smuzhiyun WRITE_ONCE(x, 1); 299*4882a593Smuzhiyun smp_wmb(); 300*4882a593Smuzhiyun WRITE_ONCE(y, 1); 301*4882a593Smuzhiyun } 302*4882a593Smuzhiyun 303*4882a593Smuzhiyun void CPU1(void) 304*4882a593Smuzhiyun { 305*4882a593Smuzhiyun r0 = READ_ONCE(y); 306*4882a593Smuzhiyun smp_rmb(); 307*4882a593Smuzhiyun r1 = READ_ONCE(x); 308*4882a593Smuzhiyun } 309*4882a593Smuzhiyun 310*4882a593SmuzhiyunThe smp_wmb() macro orders prior stores against later stores, and the 311*4882a593Smuzhiyunsmp_rmb() macro orders prior loads against later loads. Therefore, if 312*4882a593Smuzhiyunthe final value of r0 is 1, the final value of r1 must also be 1. 313*4882a593Smuzhiyun 314*4882a593SmuzhiyunThe xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains 315*4882a593Smuzhiyunthe following write-side code fragment: 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun log->l_curr_block -= log->l_logBBsize; 318*4882a593Smuzhiyun ASSERT(log->l_curr_block >= 0); 319*4882a593Smuzhiyun smp_wmb(); 320*4882a593Smuzhiyun log->l_curr_cycle++; 321*4882a593Smuzhiyun 322*4882a593SmuzhiyunAnd the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains 323*4882a593Smuzhiyunthe corresponding read-side code fragment: 324*4882a593Smuzhiyun 325*4882a593Smuzhiyun cur_cycle = READ_ONCE(log->l_curr_cycle); 326*4882a593Smuzhiyun smp_rmb(); 327*4882a593Smuzhiyun cur_block = READ_ONCE(log->l_curr_block); 328*4882a593Smuzhiyun 329*4882a593SmuzhiyunAlternatively, consider the following comment in function 330*4882a593Smuzhiyunperf_output_put_handle() in kernel/events/ring_buffer.c: 331*4882a593Smuzhiyun 332*4882a593Smuzhiyun * kernel user 333*4882a593Smuzhiyun * 334*4882a593Smuzhiyun * if (LOAD ->data_tail) { LOAD ->data_head 335*4882a593Smuzhiyun * (A) smp_rmb() (C) 336*4882a593Smuzhiyun * STORE $data LOAD $data 337*4882a593Smuzhiyun * smp_wmb() (B) smp_mb() (D) 338*4882a593Smuzhiyun * STORE ->data_head STORE ->data_tail 339*4882a593Smuzhiyun * } 340*4882a593Smuzhiyun 341*4882a593SmuzhiyunThe B/C pairing is an example of the MP pattern using smp_wmb() on the 342*4882a593Smuzhiyunwrite side and smp_rmb() on the read side. 343*4882a593Smuzhiyun 344*4882a593SmuzhiyunOf course, given that smp_mb() is strictly stronger than either smp_wmb() 345*4882a593Smuzhiyunor smp_rmb(), any code fragment that would work with smp_rmb() and 346*4882a593Smuzhiyunsmp_wmb() would also work with smp_mb() replacing either or both of the 347*4882a593Smuzhiyunweaker barriers. 348*4882a593Smuzhiyun 349*4882a593Smuzhiyun 350*4882a593SmuzhiyunLoad buffering (LB) 351*4882a593Smuzhiyun------------------- 352*4882a593Smuzhiyun 353*4882a593SmuzhiyunThe LB pattern has one CPU load from one variable and then store to a 354*4882a593Smuzhiyunsecond, while another CPU loads from the second variable and then stores 355*4882a593Smuzhiyunto the first. The goal is to avoid the counter-intuitive situation where 356*4882a593Smuzhiyuneach load reads the value written by the other CPU's store. In the 357*4882a593Smuzhiyunabsence of any ordering it is quite possible that this may happen, as 358*4882a593Smuzhiyuncan be seen in the LB+poonceonces.litmus litmus test. 359*4882a593Smuzhiyun 360*4882a593SmuzhiyunOne way of avoiding the counter-intuitive outcome is through the use of a 361*4882a593Smuzhiyuncontrol dependency paired with a full memory barrier: 362*4882a593Smuzhiyun 363*4882a593Smuzhiyun /* See LB+fencembonceonce+ctrlonceonce.litmus. */ 364*4882a593Smuzhiyun void CPU0(void) 365*4882a593Smuzhiyun { 366*4882a593Smuzhiyun r0 = READ_ONCE(x); 367*4882a593Smuzhiyun if (r0) 368*4882a593Smuzhiyun WRITE_ONCE(y, 1); 369*4882a593Smuzhiyun } 370*4882a593Smuzhiyun 371*4882a593Smuzhiyun void CPU1(void) 372*4882a593Smuzhiyun { 373*4882a593Smuzhiyun r1 = READ_ONCE(y); 374*4882a593Smuzhiyun smp_mb(); 375*4882a593Smuzhiyun WRITE_ONCE(x, 1); 376*4882a593Smuzhiyun } 377*4882a593Smuzhiyun 378*4882a593SmuzhiyunThis pairing of a control dependency in CPU0() with a full memory 379*4882a593Smuzhiyunbarrier in CPU1() prevents r0 and r1 from both ending up equal to 1. 380*4882a593Smuzhiyun 381*4882a593SmuzhiyunThe A/D pairing from the ring-buffer use case shown earlier also 382*4882a593Smuzhiyunillustrates LB. Here is a repeat of the comment in 383*4882a593Smuzhiyunperf_output_put_handle() in kernel/events/ring_buffer.c, showing a 384*4882a593Smuzhiyuncontrol dependency on the kernel side and a full memory barrier on 385*4882a593Smuzhiyunthe user side: 386*4882a593Smuzhiyun 387*4882a593Smuzhiyun * kernel user 388*4882a593Smuzhiyun * 389*4882a593Smuzhiyun * if (LOAD ->data_tail) { LOAD ->data_head 390*4882a593Smuzhiyun * (A) smp_rmb() (C) 391*4882a593Smuzhiyun * STORE $data LOAD $data 392*4882a593Smuzhiyun * smp_wmb() (B) smp_mb() (D) 393*4882a593Smuzhiyun * STORE ->data_head STORE ->data_tail 394*4882a593Smuzhiyun * } 395*4882a593Smuzhiyun * 396*4882a593Smuzhiyun * Where A pairs with D, and B pairs with C. 397*4882a593Smuzhiyun 398*4882a593SmuzhiyunThe kernel's control dependency between the load from ->data_tail 399*4882a593Smuzhiyunand the store to data combined with the user's full memory barrier 400*4882a593Smuzhiyunbetween the load from data and the store to ->data_tail prevents 401*4882a593Smuzhiyunthe counter-intuitive outcome where the kernel overwrites the data 402*4882a593Smuzhiyunbefore the user gets done loading it. 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun 405*4882a593SmuzhiyunRelease-acquire chains 406*4882a593Smuzhiyun---------------------- 407*4882a593Smuzhiyun 408*4882a593SmuzhiyunRelease-acquire chains are a low-overhead, flexible, and easy-to-use 409*4882a593Smuzhiyunmethod of maintaining order. However, they do have some limitations that 410*4882a593Smuzhiyunneed to be fully understood. Here is an example that maintains order: 411*4882a593Smuzhiyun 412*4882a593Smuzhiyun /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */ 413*4882a593Smuzhiyun void CPU0(void) 414*4882a593Smuzhiyun { 415*4882a593Smuzhiyun WRITE_ONCE(x, 1); 416*4882a593Smuzhiyun smp_store_release(&y, 1); 417*4882a593Smuzhiyun } 418*4882a593Smuzhiyun 419*4882a593Smuzhiyun void CPU1(void) 420*4882a593Smuzhiyun { 421*4882a593Smuzhiyun r0 = smp_load_acquire(y); 422*4882a593Smuzhiyun smp_store_release(&z, 1); 423*4882a593Smuzhiyun } 424*4882a593Smuzhiyun 425*4882a593Smuzhiyun void CPU2(void) 426*4882a593Smuzhiyun { 427*4882a593Smuzhiyun r1 = smp_load_acquire(z); 428*4882a593Smuzhiyun r2 = READ_ONCE(x); 429*4882a593Smuzhiyun } 430*4882a593Smuzhiyun 431*4882a593SmuzhiyunIn this case, if r0 and r1 both have final values of 1, then r2 must 432*4882a593Smuzhiyunalso have a final value of 1. 433*4882a593Smuzhiyun 434*4882a593SmuzhiyunThe ordering in this example is stronger than it needs to be. For 435*4882a593Smuzhiyunexample, ordering would still be preserved if CPU1()'s smp_load_acquire() 436*4882a593Smuzhiyuninvocation was replaced with READ_ONCE(). 437*4882a593Smuzhiyun 438*4882a593SmuzhiyunIt is tempting to assume that CPU0()'s store to x is globally ordered 439*4882a593Smuzhiyunbefore CPU1()'s store to z, but this is not the case: 440*4882a593Smuzhiyun 441*4882a593Smuzhiyun /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */ 442*4882a593Smuzhiyun void CPU0(void) 443*4882a593Smuzhiyun { 444*4882a593Smuzhiyun WRITE_ONCE(x, 1); 445*4882a593Smuzhiyun smp_store_release(&y, 1); 446*4882a593Smuzhiyun } 447*4882a593Smuzhiyun 448*4882a593Smuzhiyun void CPU1(void) 449*4882a593Smuzhiyun { 450*4882a593Smuzhiyun r0 = smp_load_acquire(y); 451*4882a593Smuzhiyun smp_store_release(&z, 1); 452*4882a593Smuzhiyun } 453*4882a593Smuzhiyun 454*4882a593Smuzhiyun void CPU2(void) 455*4882a593Smuzhiyun { 456*4882a593Smuzhiyun WRITE_ONCE(z, 2); 457*4882a593Smuzhiyun smp_mb(); 458*4882a593Smuzhiyun r1 = READ_ONCE(x); 459*4882a593Smuzhiyun } 460*4882a593Smuzhiyun 461*4882a593SmuzhiyunOne might hope that if the final value of r0 is 1 and the final value 462*4882a593Smuzhiyunof z is 2, then the final value of r1 must also be 1, but it really is 463*4882a593Smuzhiyunpossible for r1 to have the final value of 0. The reason, of course, 464*4882a593Smuzhiyunis that in this version, CPU2() is not part of the release-acquire chain. 465*4882a593SmuzhiyunThis situation is accounted for in the rules of thumb below. 466*4882a593Smuzhiyun 467*4882a593SmuzhiyunDespite this limitation, release-acquire chains are low-overhead as 468*4882a593Smuzhiyunwell as simple and powerful, at least as memory-ordering mechanisms go. 469*4882a593Smuzhiyun 470*4882a593Smuzhiyun 471*4882a593SmuzhiyunStore buffering 472*4882a593Smuzhiyun--------------- 473*4882a593Smuzhiyun 474*4882a593SmuzhiyunStore buffering can be thought of as upside-down load buffering, so 475*4882a593Smuzhiyunthat one CPU first stores to one variable and then loads from a second, 476*4882a593Smuzhiyunwhile another CPU stores to the second variable and then loads from the 477*4882a593Smuzhiyunfirst. Preserving order requires nothing less than full barriers: 478*4882a593Smuzhiyun 479*4882a593Smuzhiyun /* See SB+fencembonceonces.litmus. */ 480*4882a593Smuzhiyun void CPU0(void) 481*4882a593Smuzhiyun { 482*4882a593Smuzhiyun WRITE_ONCE(x, 1); 483*4882a593Smuzhiyun smp_mb(); 484*4882a593Smuzhiyun r0 = READ_ONCE(y); 485*4882a593Smuzhiyun } 486*4882a593Smuzhiyun 487*4882a593Smuzhiyun void CPU1(void) 488*4882a593Smuzhiyun { 489*4882a593Smuzhiyun WRITE_ONCE(y, 1); 490*4882a593Smuzhiyun smp_mb(); 491*4882a593Smuzhiyun r1 = READ_ONCE(x); 492*4882a593Smuzhiyun } 493*4882a593Smuzhiyun 494*4882a593SmuzhiyunOmitting either smp_mb() will allow both r0 and r1 to have final 495*4882a593Smuzhiyunvalues of 0, but providing both full barriers as shown above prevents 496*4882a593Smuzhiyunthis counter-intuitive outcome. 497*4882a593Smuzhiyun 498*4882a593SmuzhiyunThis pattern most famously appears as part of Dekker's locking 499*4882a593Smuzhiyunalgorithm, but it has a much more practical use within the Linux kernel 500*4882a593Smuzhiyunof ordering wakeups. The following comment taken from waitqueue_active() 501*4882a593Smuzhiyunin include/linux/wait.h shows the canonical pattern: 502*4882a593Smuzhiyun 503*4882a593Smuzhiyun * CPU0 - waker CPU1 - waiter 504*4882a593Smuzhiyun * 505*4882a593Smuzhiyun * for (;;) { 506*4882a593Smuzhiyun * @cond = true; prepare_to_wait(&wq_head, &wait, state); 507*4882a593Smuzhiyun * smp_mb(); // smp_mb() from set_current_state() 508*4882a593Smuzhiyun * if (waitqueue_active(wq_head)) if (@cond) 509*4882a593Smuzhiyun * wake_up(wq_head); break; 510*4882a593Smuzhiyun * schedule(); 511*4882a593Smuzhiyun * } 512*4882a593Smuzhiyun * finish_wait(&wq_head, &wait); 513*4882a593Smuzhiyun 514*4882a593SmuzhiyunOn CPU0, the store is to @cond and the load is in waitqueue_active(). 515*4882a593SmuzhiyunOn CPU1, prepare_to_wait() contains both a store to wq_head and a call 516*4882a593Smuzhiyunto set_current_state(), which contains an smp_mb() barrier; the load is 517*4882a593Smuzhiyun"if (@cond)". The full barriers prevent the undesirable outcome where 518*4882a593SmuzhiyunCPU1 puts the waiting task to sleep and CPU0 fails to wake it up. 519*4882a593Smuzhiyun 520*4882a593SmuzhiyunNote that use of locking can greatly simplify this pattern. 521*4882a593Smuzhiyun 522*4882a593Smuzhiyun 523*4882a593SmuzhiyunRules of thumb 524*4882a593Smuzhiyun============== 525*4882a593Smuzhiyun 526*4882a593SmuzhiyunThere might seem to be no pattern governing what ordering primitives are 527*4882a593Smuzhiyunneeded in which situations, but this is not the case. There is a pattern 528*4882a593Smuzhiyunbased on the relation between the accesses linking successive CPUs in a 529*4882a593Smuzhiyungiven litmus test. There are three types of linkage: 530*4882a593Smuzhiyun 531*4882a593Smuzhiyun1. Write-to-read, where the next CPU reads the value that the 532*4882a593Smuzhiyun previous CPU wrote. The LB litmus-test patterns contain only 533*4882a593Smuzhiyun this type of relation. In formal memory-modeling texts, this 534*4882a593Smuzhiyun relation is called "reads-from" and is usually abbreviated "rf". 535*4882a593Smuzhiyun 536*4882a593Smuzhiyun2. Read-to-write, where the next CPU overwrites the value that the 537*4882a593Smuzhiyun previous CPU read. The SB litmus test contains only this type 538*4882a593Smuzhiyun of relation. In formal memory-modeling texts, this relation is 539*4882a593Smuzhiyun often called "from-reads" and is sometimes abbreviated "fr". 540*4882a593Smuzhiyun 541*4882a593Smuzhiyun3. Write-to-write, where the next CPU overwrites the value written 542*4882a593Smuzhiyun by the previous CPU. The Z6.0 litmus test pattern contains a 543*4882a593Smuzhiyun write-to-write relation between the last access of CPU1() and 544*4882a593Smuzhiyun the first access of CPU2(). In formal memory-modeling texts, 545*4882a593Smuzhiyun this relation is often called "coherence order" and is sometimes 546*4882a593Smuzhiyun abbreviated "co". In the C++ standard, it is instead called 547*4882a593Smuzhiyun "modification order" and often abbreviated "mo". 548*4882a593Smuzhiyun 549*4882a593SmuzhiyunThe strength of memory ordering required for a given litmus test to 550*4882a593Smuzhiyunavoid a counter-intuitive outcome depends on the types of relations 551*4882a593Smuzhiyunlinking the memory accesses for the outcome in question: 552*4882a593Smuzhiyun 553*4882a593Smuzhiyuno If all links are write-to-read links, then the weakest 554*4882a593Smuzhiyun possible ordering within each CPU suffices. For example, in 555*4882a593Smuzhiyun the LB litmus test, a control dependency was enough to do the 556*4882a593Smuzhiyun job. 557*4882a593Smuzhiyun 558*4882a593Smuzhiyuno If all but one of the links are write-to-read links, then a 559*4882a593Smuzhiyun release-acquire chain suffices. Both the MP and the ISA2 560*4882a593Smuzhiyun litmus tests illustrate this case. 561*4882a593Smuzhiyun 562*4882a593Smuzhiyuno If more than one of the links are something other than 563*4882a593Smuzhiyun write-to-read links, then a full memory barrier is required 564*4882a593Smuzhiyun between each successive pair of non-write-to-read links. This 565*4882a593Smuzhiyun case is illustrated by the Z6.0 litmus tests, both in the 566*4882a593Smuzhiyun locking and in the release-acquire sections. 567*4882a593Smuzhiyun 568*4882a593SmuzhiyunHowever, if you find yourself having to stretch these rules of thumb 569*4882a593Smuzhiyunto fit your situation, you should consider creating a litmus test and 570*4882a593Smuzhiyunrunning it on the model. 571