1*4882a593SmuzhiyunExplanation of the Linux-Kernel Memory Consistency Model 2*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3*4882a593Smuzhiyun 4*4882a593Smuzhiyun:Author: Alan Stern <stern@rowland.harvard.edu> 5*4882a593Smuzhiyun:Created: October 2017 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun.. Contents 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun 1. INTRODUCTION 10*4882a593Smuzhiyun 2. BACKGROUND 11*4882a593Smuzhiyun 3. A SIMPLE EXAMPLE 12*4882a593Smuzhiyun 4. A SELECTION OF MEMORY MODELS 13*4882a593Smuzhiyun 5. ORDERING AND CYCLES 14*4882a593Smuzhiyun 6. EVENTS 15*4882a593Smuzhiyun 7. THE PROGRAM ORDER RELATION: po AND po-loc 16*4882a593Smuzhiyun 8. A WARNING 17*4882a593Smuzhiyun 9. DEPENDENCY RELATIONS: data, addr, and ctrl 18*4882a593Smuzhiyun 10. THE READS-FROM RELATION: rf, rfi, and rfe 19*4882a593Smuzhiyun 11. CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe 20*4882a593Smuzhiyun 12. THE FROM-READS RELATION: fr, fri, and fre 21*4882a593Smuzhiyun 13. AN OPERATIONAL MODEL 22*4882a593Smuzhiyun 14. PROPAGATION ORDER RELATION: cumul-fence 23*4882a593Smuzhiyun 15. DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL 24*4882a593Smuzhiyun 16. SEQUENTIAL CONSISTENCY PER VARIABLE 25*4882a593Smuzhiyun 17. ATOMIC UPDATES: rmw 26*4882a593Smuzhiyun 18. THE PRESERVED PROGRAM ORDER RELATION: ppo 27*4882a593Smuzhiyun 19. AND THEN THERE WAS ALPHA 28*4882a593Smuzhiyun 20. THE HAPPENS-BEFORE RELATION: hb 29*4882a593Smuzhiyun 21. THE PROPAGATES-BEFORE RELATION: pb 30*4882a593Smuzhiyun 22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb 31*4882a593Smuzhiyun 23. LOCKING 32*4882a593Smuzhiyun 24. PLAIN ACCESSES AND DATA RACES 33*4882a593Smuzhiyun 25. ODDS AND ENDS 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunINTRODUCTION 38*4882a593Smuzhiyun------------ 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunThe Linux-kernel memory consistency model (LKMM) is rather complex and 41*4882a593Smuzhiyunobscure. This is particularly evident if you read through the 42*4882a593Smuzhiyunlinux-kernel.bell and linux-kernel.cat files that make up the formal 43*4882a593Smuzhiyunversion of the model; they are extremely terse and their meanings are 44*4882a593Smuzhiyunfar from clear. 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunThis document describes the ideas underlying the LKMM. It is meant 47*4882a593Smuzhiyunfor people who want to understand how the model was designed. It does 48*4882a593Smuzhiyunnot go into the details of the code in the .bell and .cat files; 49*4882a593Smuzhiyunrather, it explains in English what the code expresses symbolically. 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunSections 2 (BACKGROUND) through 5 (ORDERING AND CYCLES) are aimed 52*4882a593Smuzhiyuntoward beginners; they explain what memory consistency models are and 53*4882a593Smuzhiyunthe basic notions shared by all such models. People already familiar 54*4882a593Smuzhiyunwith these concepts can skim or skip over them. Sections 6 (EVENTS) 55*4882a593Smuzhiyunthrough 12 (THE FROM_READS RELATION) describe the fundamental 56*4882a593Smuzhiyunrelations used in many models. Starting in Section 13 (AN OPERATIONAL 57*4882a593SmuzhiyunMODEL), the workings of the LKMM itself are covered. 58*4882a593Smuzhiyun 59*4882a593SmuzhiyunWarning: The code examples in this document are not written in the 60*4882a593Smuzhiyunproper format for litmus tests. They don't include a header line, the 61*4882a593Smuzhiyuninitializations are not enclosed in braces, the global variables are 62*4882a593Smuzhiyunnot passed by pointers, and they don't have an "exists" clause at the 63*4882a593Smuzhiyunend. Converting them to the right format is left as an exercise for 64*4882a593Smuzhiyunthe reader. 65*4882a593Smuzhiyun 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunBACKGROUND 68*4882a593Smuzhiyun---------- 69*4882a593Smuzhiyun 70*4882a593SmuzhiyunA memory consistency model (or just memory model, for short) is 71*4882a593Smuzhiyunsomething which predicts, given a piece of computer code running on a 72*4882a593Smuzhiyunparticular kind of system, what values may be obtained by the code's 73*4882a593Smuzhiyunload instructions. The LKMM makes these predictions for code running 74*4882a593Smuzhiyunas part of the Linux kernel. 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunIn practice, people tend to use memory models the other way around. 77*4882a593SmuzhiyunThat is, given a piece of code and a collection of values specified 78*4882a593Smuzhiyunfor the loads, the model will predict whether it is possible for the 79*4882a593Smuzhiyuncode to run in such a way that the loads will indeed obtain the 80*4882a593Smuzhiyunspecified values. Of course, this is just another way of expressing 81*4882a593Smuzhiyunthe same idea. 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunFor code running on a uniprocessor system, the predictions are easy: 84*4882a593SmuzhiyunEach load instruction must obtain the value written by the most recent 85*4882a593Smuzhiyunstore instruction accessing the same location (we ignore complicating 86*4882a593Smuzhiyunfactors such as DMA and mixed-size accesses.) But on multiprocessor 87*4882a593Smuzhiyunsystems, with multiple CPUs making concurrent accesses to shared 88*4882a593Smuzhiyunmemory locations, things aren't so simple. 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunDifferent architectures have differing memory models, and the Linux 91*4882a593Smuzhiyunkernel supports a variety of architectures. The LKMM has to be fairly 92*4882a593Smuzhiyunpermissive, in the sense that any behavior allowed by one of these 93*4882a593Smuzhiyunarchitectures also has to be allowed by the LKMM. 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun 96*4882a593SmuzhiyunA SIMPLE EXAMPLE 97*4882a593Smuzhiyun---------------- 98*4882a593Smuzhiyun 99*4882a593SmuzhiyunHere is a simple example to illustrate the basic concepts. Consider 100*4882a593Smuzhiyunsome code running as part of a device driver for an input device. The 101*4882a593Smuzhiyundriver might contain an interrupt handler which collects data from the 102*4882a593Smuzhiyundevice, stores it in a buffer, and sets a flag to indicate the buffer 103*4882a593Smuzhiyunis full. Running concurrently on a different CPU might be a part of 104*4882a593Smuzhiyunthe driver code being executed by a process in the midst of a read(2) 105*4882a593Smuzhiyunsystem call. This code tests the flag to see whether the buffer is 106*4882a593Smuzhiyunready, and if it is, copies the data back to userspace. The buffer 107*4882a593Smuzhiyunand the flag are memory locations shared between the two CPUs. 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunWe can abstract out the important pieces of the driver code as follows 110*4882a593Smuzhiyun(the reason for using WRITE_ONCE() and READ_ONCE() instead of simple 111*4882a593Smuzhiyunassignment statements is discussed later): 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun int buf = 0, flag = 0; 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun P0() 116*4882a593Smuzhiyun { 117*4882a593Smuzhiyun WRITE_ONCE(buf, 1); 118*4882a593Smuzhiyun WRITE_ONCE(flag, 1); 119*4882a593Smuzhiyun } 120*4882a593Smuzhiyun 121*4882a593Smuzhiyun P1() 122*4882a593Smuzhiyun { 123*4882a593Smuzhiyun int r1; 124*4882a593Smuzhiyun int r2 = 0; 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun r1 = READ_ONCE(flag); 127*4882a593Smuzhiyun if (r1) 128*4882a593Smuzhiyun r2 = READ_ONCE(buf); 129*4882a593Smuzhiyun } 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunHere the P0() function represents the interrupt handler running on one 132*4882a593SmuzhiyunCPU and P1() represents the read() routine running on another. The 133*4882a593Smuzhiyunvalue 1 stored in buf represents input data collected from the device. 134*4882a593SmuzhiyunThus, P0 stores the data in buf and then sets flag. Meanwhile, P1 135*4882a593Smuzhiyunreads flag into the private variable r1, and if it is set, reads the 136*4882a593Smuzhiyundata from buf into a second private variable r2 for copying to 137*4882a593Smuzhiyunuserspace. (Presumably if flag is not set then the driver will wait a 138*4882a593Smuzhiyunwhile and try again.) 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunThis pattern of memory accesses, where one CPU stores values to two 141*4882a593Smuzhiyunshared memory locations and another CPU loads from those locations in 142*4882a593Smuzhiyunthe opposite order, is widely known as the "Message Passing" or MP 143*4882a593Smuzhiyunpattern. It is typical of memory access patterns in the kernel. 144*4882a593Smuzhiyun 145*4882a593SmuzhiyunPlease note that this example code is a simplified abstraction. Real 146*4882a593Smuzhiyunbuffers are usually larger than a single integer, real device drivers 147*4882a593Smuzhiyunusually use sleep and wakeup mechanisms rather than polling for I/O 148*4882a593Smuzhiyuncompletion, and real code generally doesn't bother to copy values into 149*4882a593Smuzhiyunprivate variables before using them. All that is beside the point; 150*4882a593Smuzhiyunthe idea here is simply to illustrate the overall pattern of memory 151*4882a593Smuzhiyunaccesses by the CPUs. 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunA memory model will predict what values P1 might obtain for its loads 154*4882a593Smuzhiyunfrom flag and buf, or equivalently, what values r1 and r2 might end up 155*4882a593Smuzhiyunwith after the code has finished running. 156*4882a593Smuzhiyun 157*4882a593SmuzhiyunSome predictions are trivial. For instance, no sane memory model would 158*4882a593Smuzhiyunpredict that r1 = 42 or r2 = -7, because neither of those values ever 159*4882a593Smuzhiyungets stored in flag or buf. 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunSome nontrivial predictions are nonetheless quite simple. For 162*4882a593Smuzhiyuninstance, P1 might run entirely before P0 begins, in which case r1 and 163*4882a593Smuzhiyunr2 will both be 0 at the end. Or P0 might run entirely before P1 164*4882a593Smuzhiyunbegins, in which case r1 and r2 will both be 1. 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunThe interesting predictions concern what might happen when the two 167*4882a593Smuzhiyunroutines run concurrently. One possibility is that P1 runs after P0's 168*4882a593Smuzhiyunstore to buf but before the store to flag. In this case, r1 and r2 169*4882a593Smuzhiyunwill again both be 0. (If P1 had been designed to read buf 170*4882a593Smuzhiyununconditionally then we would instead have r1 = 0 and r2 = 1.) 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunHowever, the most interesting possibility is where r1 = 1 and r2 = 0. 173*4882a593SmuzhiyunIf this were to occur it would mean the driver contains a bug, because 174*4882a593Smuzhiyunincorrect data would get sent to the user: 0 instead of 1. As it 175*4882a593Smuzhiyunhappens, the LKMM does predict this outcome can occur, and the example 176*4882a593Smuzhiyundriver code shown above is indeed buggy. 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun 179*4882a593SmuzhiyunA SELECTION OF MEMORY MODELS 180*4882a593Smuzhiyun---------------------------- 181*4882a593Smuzhiyun 182*4882a593SmuzhiyunThe first widely cited memory model, and the simplest to understand, 183*4882a593Smuzhiyunis Sequential Consistency. According to this model, systems behave as 184*4882a593Smuzhiyunif each CPU executed its instructions in order but with unspecified 185*4882a593Smuzhiyuntiming. In other words, the instructions from the various CPUs get 186*4882a593Smuzhiyuninterleaved in a nondeterministic way, always according to some single 187*4882a593Smuzhiyunglobal order that agrees with the order of the instructions in the 188*4882a593Smuzhiyunprogram source for each CPU. The model says that the value obtained 189*4882a593Smuzhiyunby each load is simply the value written by the most recently executed 190*4882a593Smuzhiyunstore to the same memory location, from any CPU. 191*4882a593Smuzhiyun 192*4882a593SmuzhiyunFor the MP example code shown above, Sequential Consistency predicts 193*4882a593Smuzhiyunthat the undesired result r1 = 1, r2 = 0 cannot occur. The reasoning 194*4882a593Smuzhiyungoes like this: 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun Since r1 = 1, P0 must store 1 to flag before P1 loads 1 from 197*4882a593Smuzhiyun it, as loads can obtain values only from earlier stores. 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun P1 loads from flag before loading from buf, since CPUs execute 200*4882a593Smuzhiyun their instructions in order. 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun P1 must load 0 from buf before P0 stores 1 to it; otherwise r2 203*4882a593Smuzhiyun would be 1 since a load obtains its value from the most recent 204*4882a593Smuzhiyun store to the same address. 205*4882a593Smuzhiyun 206*4882a593Smuzhiyun P0 stores 1 to buf before storing 1 to flag, since it executes 207*4882a593Smuzhiyun its instructions in order. 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun Since an instruction (in this case, P0's store to flag) cannot 210*4882a593Smuzhiyun execute before itself, the specified outcome is impossible. 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunHowever, real computer hardware almost never follows the Sequential 213*4882a593SmuzhiyunConsistency memory model; doing so would rule out too many valuable 214*4882a593Smuzhiyunperformance optimizations. On ARM and PowerPC architectures, for 215*4882a593Smuzhiyuninstance, the MP example code really does sometimes yield r1 = 1 and 216*4882a593Smuzhiyunr2 = 0. 217*4882a593Smuzhiyun 218*4882a593Smuzhiyunx86 and SPARC follow yet a different memory model: TSO (Total Store 219*4882a593SmuzhiyunOrdering). This model predicts that the undesired outcome for the MP 220*4882a593Smuzhiyunpattern cannot occur, but in other respects it differs from Sequential 221*4882a593SmuzhiyunConsistency. One example is the Store Buffer (SB) pattern, in which 222*4882a593Smuzhiyuneach CPU stores to its own shared location and then loads from the 223*4882a593Smuzhiyunother CPU's location: 224*4882a593Smuzhiyun 225*4882a593Smuzhiyun int x = 0, y = 0; 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun P0() 228*4882a593Smuzhiyun { 229*4882a593Smuzhiyun int r0; 230*4882a593Smuzhiyun 231*4882a593Smuzhiyun WRITE_ONCE(x, 1); 232*4882a593Smuzhiyun r0 = READ_ONCE(y); 233*4882a593Smuzhiyun } 234*4882a593Smuzhiyun 235*4882a593Smuzhiyun P1() 236*4882a593Smuzhiyun { 237*4882a593Smuzhiyun int r1; 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun WRITE_ONCE(y, 1); 240*4882a593Smuzhiyun r1 = READ_ONCE(x); 241*4882a593Smuzhiyun } 242*4882a593Smuzhiyun 243*4882a593SmuzhiyunSequential Consistency predicts that the outcome r0 = 0, r1 = 0 is 244*4882a593Smuzhiyunimpossible. (Exercise: Figure out the reasoning.) But TSO allows 245*4882a593Smuzhiyunthis outcome to occur, and in fact it does sometimes occur on x86 and 246*4882a593SmuzhiyunSPARC systems. 247*4882a593Smuzhiyun 248*4882a593SmuzhiyunThe LKMM was inspired by the memory models followed by PowerPC, ARM, 249*4882a593Smuzhiyunx86, Alpha, and other architectures. However, it is different in 250*4882a593Smuzhiyundetail from each of them. 251*4882a593Smuzhiyun 252*4882a593Smuzhiyun 253*4882a593SmuzhiyunORDERING AND CYCLES 254*4882a593Smuzhiyun------------------- 255*4882a593Smuzhiyun 256*4882a593SmuzhiyunMemory models are all about ordering. Often this is temporal ordering 257*4882a593Smuzhiyun(i.e., the order in which certain events occur) but it doesn't have to 258*4882a593Smuzhiyunbe; consider for example the order of instructions in a program's 259*4882a593Smuzhiyunsource code. We saw above that Sequential Consistency makes an 260*4882a593Smuzhiyunimportant assumption that CPUs execute instructions in the same order 261*4882a593Smuzhiyunas those instructions occur in the code, and there are many other 262*4882a593Smuzhiyuninstances of ordering playing central roles in memory models. 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunThe counterpart to ordering is a cycle. Ordering rules out cycles: 265*4882a593SmuzhiyunIt's not possible to have X ordered before Y, Y ordered before Z, and 266*4882a593SmuzhiyunZ ordered before X, because this would mean that X is ordered before 267*4882a593Smuzhiyunitself. The analysis of the MP example under Sequential Consistency 268*4882a593Smuzhiyuninvolved just such an impossible cycle: 269*4882a593Smuzhiyun 270*4882a593Smuzhiyun W: P0 stores 1 to flag executes before 271*4882a593Smuzhiyun X: P1 loads 1 from flag executes before 272*4882a593Smuzhiyun Y: P1 loads 0 from buf executes before 273*4882a593Smuzhiyun Z: P0 stores 1 to buf executes before 274*4882a593Smuzhiyun W: P0 stores 1 to flag. 275*4882a593Smuzhiyun 276*4882a593SmuzhiyunIn short, if a memory model requires certain accesses to be ordered, 277*4882a593Smuzhiyunand a certain outcome for the loads in a piece of code can happen only 278*4882a593Smuzhiyunif those accesses would form a cycle, then the memory model predicts 279*4882a593Smuzhiyunthat outcome cannot occur. 280*4882a593Smuzhiyun 281*4882a593SmuzhiyunThe LKMM is defined largely in terms of cycles, as we will see. 282*4882a593Smuzhiyun 283*4882a593Smuzhiyun 284*4882a593SmuzhiyunEVENTS 285*4882a593Smuzhiyun------ 286*4882a593Smuzhiyun 287*4882a593SmuzhiyunThe LKMM does not work directly with the C statements that make up 288*4882a593Smuzhiyunkernel source code. Instead it considers the effects of those 289*4882a593Smuzhiyunstatements in a more abstract form, namely, events. The model 290*4882a593Smuzhiyunincludes three types of events: 291*4882a593Smuzhiyun 292*4882a593Smuzhiyun Read events correspond to loads from shared memory, such as 293*4882a593Smuzhiyun calls to READ_ONCE(), smp_load_acquire(), or 294*4882a593Smuzhiyun rcu_dereference(). 295*4882a593Smuzhiyun 296*4882a593Smuzhiyun Write events correspond to stores to shared memory, such as 297*4882a593Smuzhiyun calls to WRITE_ONCE(), smp_store_release(), or atomic_set(). 298*4882a593Smuzhiyun 299*4882a593Smuzhiyun Fence events correspond to memory barriers (also known as 300*4882a593Smuzhiyun fences), such as calls to smp_rmb() or rcu_read_lock(). 301*4882a593Smuzhiyun 302*4882a593SmuzhiyunThese categories are not exclusive; a read or write event can also be 303*4882a593Smuzhiyuna fence. This happens with functions like smp_load_acquire() or 304*4882a593Smuzhiyunspin_lock(). However, no single event can be both a read and a write. 305*4882a593SmuzhiyunAtomic read-modify-write accesses, such as atomic_inc() or xchg(), 306*4882a593Smuzhiyuncorrespond to a pair of events: a read followed by a write. (The 307*4882a593Smuzhiyunwrite event is omitted for executions where it doesn't occur, such as 308*4882a593Smuzhiyuna cmpxchg() where the comparison fails.) 309*4882a593Smuzhiyun 310*4882a593SmuzhiyunOther parts of the code, those which do not involve interaction with 311*4882a593Smuzhiyunshared memory, do not give rise to events. Thus, arithmetic and 312*4882a593Smuzhiyunlogical computations, control-flow instructions, or accesses to 313*4882a593Smuzhiyunprivate memory or CPU registers are not of central interest to the 314*4882a593Smuzhiyunmemory model. They only affect the model's predictions indirectly. 315*4882a593SmuzhiyunFor example, an arithmetic computation might determine the value that 316*4882a593Smuzhiyungets stored to a shared memory location (or in the case of an array 317*4882a593Smuzhiyunindex, the address where the value gets stored), but the memory model 318*4882a593Smuzhiyunis concerned only with the store itself -- its value and its address 319*4882a593Smuzhiyun-- not the computation leading up to it. 320*4882a593Smuzhiyun 321*4882a593SmuzhiyunEvents in the LKMM can be linked by various relations, which we will 322*4882a593Smuzhiyundescribe in the following sections. The memory model requires certain 323*4882a593Smuzhiyunof these relations to be orderings, that is, it requires them not to 324*4882a593Smuzhiyunhave any cycles. 325*4882a593Smuzhiyun 326*4882a593Smuzhiyun 327*4882a593SmuzhiyunTHE PROGRAM ORDER RELATION: po AND po-loc 328*4882a593Smuzhiyun----------------------------------------- 329*4882a593Smuzhiyun 330*4882a593SmuzhiyunThe most important relation between events is program order (po). You 331*4882a593Smuzhiyuncan think of it as the order in which statements occur in the source 332*4882a593Smuzhiyuncode after branches are taken into account and loops have been 333*4882a593Smuzhiyununrolled. A better description might be the order in which 334*4882a593Smuzhiyuninstructions are presented to a CPU's execution unit. Thus, we say 335*4882a593Smuzhiyunthat X is po-before Y (written as "X ->po Y" in formulas) if X occurs 336*4882a593Smuzhiyunbefore Y in the instruction stream. 337*4882a593Smuzhiyun 338*4882a593SmuzhiyunThis is inherently a single-CPU relation; two instructions executing 339*4882a593Smuzhiyunon different CPUs are never linked by po. Also, it is by definition 340*4882a593Smuzhiyunan ordering so it cannot have any cycles. 341*4882a593Smuzhiyun 342*4882a593Smuzhiyunpo-loc is a sub-relation of po. It links two memory accesses when the 343*4882a593Smuzhiyunfirst comes before the second in program order and they access the 344*4882a593Smuzhiyunsame memory location (the "-loc" suffix). 345*4882a593Smuzhiyun 346*4882a593SmuzhiyunAlthough this may seem straightforward, there is one subtle aspect to 347*4882a593Smuzhiyunprogram order we need to explain. The LKMM was inspired by low-level 348*4882a593Smuzhiyunarchitectural memory models which describe the behavior of machine 349*4882a593Smuzhiyuncode, and it retains their outlook to a considerable extent. The 350*4882a593Smuzhiyunread, write, and fence events used by the model are close in spirit to 351*4882a593Smuzhiyunindividual machine instructions. Nevertheless, the LKMM describes 352*4882a593Smuzhiyunkernel code written in C, and the mapping from C to machine code can 353*4882a593Smuzhiyunbe extremely complex. 354*4882a593Smuzhiyun 355*4882a593SmuzhiyunOptimizing compilers have great freedom in the way they translate 356*4882a593Smuzhiyunsource code to object code. They are allowed to apply transformations 357*4882a593Smuzhiyunthat add memory accesses, eliminate accesses, combine them, split them 358*4882a593Smuzhiyuninto pieces, or move them around. The use of READ_ONCE(), WRITE_ONCE(), 359*4882a593Smuzhiyunor one of the other atomic or synchronization primitives prevents a 360*4882a593Smuzhiyunlarge number of compiler optimizations. In particular, it is guaranteed 361*4882a593Smuzhiyunthat the compiler will not remove such accesses from the generated code 362*4882a593Smuzhiyun(unless it can prove the accesses will never be executed), it will not 363*4882a593Smuzhiyunchange the order in which they occur in the code (within limits imposed 364*4882a593Smuzhiyunby the C standard), and it will not introduce extraneous accesses. 365*4882a593Smuzhiyun 366*4882a593SmuzhiyunThe MP and SB examples above used READ_ONCE() and WRITE_ONCE() rather 367*4882a593Smuzhiyunthan ordinary memory accesses. Thanks to this usage, we can be certain 368*4882a593Smuzhiyunthat in the MP example, the compiler won't reorder P0's write event to 369*4882a593Smuzhiyunbuf and P0's write event to flag, and similarly for the other shared 370*4882a593Smuzhiyunmemory accesses in the examples. 371*4882a593Smuzhiyun 372*4882a593SmuzhiyunSince private variables are not shared between CPUs, they can be 373*4882a593Smuzhiyunaccessed normally without READ_ONCE() or WRITE_ONCE(). In fact, they 374*4882a593Smuzhiyunneed not even be stored in normal memory at all -- in principle a 375*4882a593Smuzhiyunprivate variable could be stored in a CPU register (hence the convention 376*4882a593Smuzhiyunthat these variables have names starting with the letter 'r'). 377*4882a593Smuzhiyun 378*4882a593Smuzhiyun 379*4882a593SmuzhiyunA WARNING 380*4882a593Smuzhiyun--------- 381*4882a593Smuzhiyun 382*4882a593SmuzhiyunThe protections provided by READ_ONCE(), WRITE_ONCE(), and others are 383*4882a593Smuzhiyunnot perfect; and under some circumstances it is possible for the 384*4882a593Smuzhiyuncompiler to undermine the memory model. Here is an example. Suppose 385*4882a593Smuzhiyunboth branches of an "if" statement store the same value to the same 386*4882a593Smuzhiyunlocation: 387*4882a593Smuzhiyun 388*4882a593Smuzhiyun r1 = READ_ONCE(x); 389*4882a593Smuzhiyun if (r1) { 390*4882a593Smuzhiyun WRITE_ONCE(y, 2); 391*4882a593Smuzhiyun ... /* do something */ 392*4882a593Smuzhiyun } else { 393*4882a593Smuzhiyun WRITE_ONCE(y, 2); 394*4882a593Smuzhiyun ... /* do something else */ 395*4882a593Smuzhiyun } 396*4882a593Smuzhiyun 397*4882a593SmuzhiyunFor this code, the LKMM predicts that the load from x will always be 398*4882a593Smuzhiyunexecuted before either of the stores to y. However, a compiler could 399*4882a593Smuzhiyunlift the stores out of the conditional, transforming the code into 400*4882a593Smuzhiyunsomething resembling: 401*4882a593Smuzhiyun 402*4882a593Smuzhiyun r1 = READ_ONCE(x); 403*4882a593Smuzhiyun WRITE_ONCE(y, 2); 404*4882a593Smuzhiyun if (r1) { 405*4882a593Smuzhiyun ... /* do something */ 406*4882a593Smuzhiyun } else { 407*4882a593Smuzhiyun ... /* do something else */ 408*4882a593Smuzhiyun } 409*4882a593Smuzhiyun 410*4882a593SmuzhiyunGiven this version of the code, the LKMM would predict that the load 411*4882a593Smuzhiyunfrom x could be executed after the store to y. Thus, the memory 412*4882a593Smuzhiyunmodel's original prediction could be invalidated by the compiler. 413*4882a593Smuzhiyun 414*4882a593SmuzhiyunAnother issue arises from the fact that in C, arguments to many 415*4882a593Smuzhiyunoperators and function calls can be evaluated in any order. For 416*4882a593Smuzhiyunexample: 417*4882a593Smuzhiyun 418*4882a593Smuzhiyun r1 = f(5) + g(6); 419*4882a593Smuzhiyun 420*4882a593SmuzhiyunThe object code might call f(5) either before or after g(6); the 421*4882a593Smuzhiyunmemory model cannot assume there is a fixed program order relation 422*4882a593Smuzhiyunbetween them. (In fact, if the function calls are inlined then the 423*4882a593Smuzhiyuncompiler might even interleave their object code.) 424*4882a593Smuzhiyun 425*4882a593Smuzhiyun 426*4882a593SmuzhiyunDEPENDENCY RELATIONS: data, addr, and ctrl 427*4882a593Smuzhiyun------------------------------------------ 428*4882a593Smuzhiyun 429*4882a593SmuzhiyunWe say that two events are linked by a dependency relation when the 430*4882a593Smuzhiyunexecution of the second event depends in some way on a value obtained 431*4882a593Smuzhiyunfrom memory by the first. The first event must be a read, and the 432*4882a593Smuzhiyunvalue it obtains must somehow affect what the second event does. 433*4882a593SmuzhiyunThere are three kinds of dependencies: data, address (addr), and 434*4882a593Smuzhiyuncontrol (ctrl). 435*4882a593Smuzhiyun 436*4882a593SmuzhiyunA read and a write event are linked by a data dependency if the value 437*4882a593Smuzhiyunobtained by the read affects the value stored by the write. As a very 438*4882a593Smuzhiyunsimple example: 439*4882a593Smuzhiyun 440*4882a593Smuzhiyun int x, y; 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun r1 = READ_ONCE(x); 443*4882a593Smuzhiyun WRITE_ONCE(y, r1 + 5); 444*4882a593Smuzhiyun 445*4882a593SmuzhiyunThe value stored by the WRITE_ONCE obviously depends on the value 446*4882a593Smuzhiyunloaded by the READ_ONCE. Such dependencies can wind through 447*4882a593Smuzhiyunarbitrarily complicated computations, and a write can depend on the 448*4882a593Smuzhiyunvalues of multiple reads. 449*4882a593Smuzhiyun 450*4882a593SmuzhiyunA read event and another memory access event are linked by an address 451*4882a593Smuzhiyundependency if the value obtained by the read affects the location 452*4882a593Smuzhiyunaccessed by the other event. The second event can be either a read or 453*4882a593Smuzhiyuna write. Here's another simple example: 454*4882a593Smuzhiyun 455*4882a593Smuzhiyun int a[20]; 456*4882a593Smuzhiyun int i; 457*4882a593Smuzhiyun 458*4882a593Smuzhiyun r1 = READ_ONCE(i); 459*4882a593Smuzhiyun r2 = READ_ONCE(a[r1]); 460*4882a593Smuzhiyun 461*4882a593SmuzhiyunHere the location accessed by the second READ_ONCE() depends on the 462*4882a593Smuzhiyunindex value loaded by the first. Pointer indirection also gives rise 463*4882a593Smuzhiyunto address dependencies, since the address of a location accessed 464*4882a593Smuzhiyunthrough a pointer will depend on the value read earlier from that 465*4882a593Smuzhiyunpointer. 466*4882a593Smuzhiyun 467*4882a593SmuzhiyunFinally, a read event and another memory access event are linked by a 468*4882a593Smuzhiyuncontrol dependency if the value obtained by the read affects whether 469*4882a593Smuzhiyunthe second event is executed at all. Simple example: 470*4882a593Smuzhiyun 471*4882a593Smuzhiyun int x, y; 472*4882a593Smuzhiyun 473*4882a593Smuzhiyun r1 = READ_ONCE(x); 474*4882a593Smuzhiyun if (r1) 475*4882a593Smuzhiyun WRITE_ONCE(y, 1984); 476*4882a593Smuzhiyun 477*4882a593SmuzhiyunExecution of the WRITE_ONCE() is controlled by a conditional expression 478*4882a593Smuzhiyunwhich depends on the value obtained by the READ_ONCE(); hence there is 479*4882a593Smuzhiyuna control dependency from the load to the store. 480*4882a593Smuzhiyun 481*4882a593SmuzhiyunIt should be pretty obvious that events can only depend on reads that 482*4882a593Smuzhiyuncome earlier in program order. Symbolically, if we have R ->data X, 483*4882a593SmuzhiyunR ->addr X, or R ->ctrl X (where R is a read event), then we must also 484*4882a593Smuzhiyunhave R ->po X. It wouldn't make sense for a computation to depend 485*4882a593Smuzhiyunsomehow on a value that doesn't get loaded from shared memory until 486*4882a593Smuzhiyunlater in the code! 487*4882a593Smuzhiyun 488*4882a593Smuzhiyun 489*4882a593SmuzhiyunTHE READS-FROM RELATION: rf, rfi, and rfe 490*4882a593Smuzhiyun----------------------------------------- 491*4882a593Smuzhiyun 492*4882a593SmuzhiyunThe reads-from relation (rf) links a write event to a read event when 493*4882a593Smuzhiyunthe value loaded by the read is the value that was stored by the 494*4882a593Smuzhiyunwrite. In colloquial terms, the load "reads from" the store. We 495*4882a593Smuzhiyunwrite W ->rf R to indicate that the load R reads from the store W. We 496*4882a593Smuzhiyunfurther distinguish the cases where the load and the store occur on 497*4882a593Smuzhiyunthe same CPU (internal reads-from, or rfi) and where they occur on 498*4882a593Smuzhiyundifferent CPUs (external reads-from, or rfe). 499*4882a593Smuzhiyun 500*4882a593SmuzhiyunFor our purposes, a memory location's initial value is treated as 501*4882a593Smuzhiyunthough it had been written there by an imaginary initial store that 502*4882a593Smuzhiyunexecutes on a separate CPU before the main program runs. 503*4882a593Smuzhiyun 504*4882a593SmuzhiyunUsage of the rf relation implicitly assumes that loads will always 505*4882a593Smuzhiyunread from a single store. It doesn't apply properly in the presence 506*4882a593Smuzhiyunof load-tearing, where a load obtains some of its bits from one store 507*4882a593Smuzhiyunand some of them from another store. Fortunately, use of READ_ONCE() 508*4882a593Smuzhiyunand WRITE_ONCE() will prevent load-tearing; it's not possible to have: 509*4882a593Smuzhiyun 510*4882a593Smuzhiyun int x = 0; 511*4882a593Smuzhiyun 512*4882a593Smuzhiyun P0() 513*4882a593Smuzhiyun { 514*4882a593Smuzhiyun WRITE_ONCE(x, 0x1234); 515*4882a593Smuzhiyun } 516*4882a593Smuzhiyun 517*4882a593Smuzhiyun P1() 518*4882a593Smuzhiyun { 519*4882a593Smuzhiyun int r1; 520*4882a593Smuzhiyun 521*4882a593Smuzhiyun r1 = READ_ONCE(x); 522*4882a593Smuzhiyun } 523*4882a593Smuzhiyun 524*4882a593Smuzhiyunand end up with r1 = 0x1200 (partly from x's initial value and partly 525*4882a593Smuzhiyunfrom the value stored by P0). 526*4882a593Smuzhiyun 527*4882a593SmuzhiyunOn the other hand, load-tearing is unavoidable when mixed-size 528*4882a593Smuzhiyunaccesses are used. Consider this example: 529*4882a593Smuzhiyun 530*4882a593Smuzhiyun union { 531*4882a593Smuzhiyun u32 w; 532*4882a593Smuzhiyun u16 h[2]; 533*4882a593Smuzhiyun } x; 534*4882a593Smuzhiyun 535*4882a593Smuzhiyun P0() 536*4882a593Smuzhiyun { 537*4882a593Smuzhiyun WRITE_ONCE(x.h[0], 0x1234); 538*4882a593Smuzhiyun WRITE_ONCE(x.h[1], 0x5678); 539*4882a593Smuzhiyun } 540*4882a593Smuzhiyun 541*4882a593Smuzhiyun P1() 542*4882a593Smuzhiyun { 543*4882a593Smuzhiyun int r1; 544*4882a593Smuzhiyun 545*4882a593Smuzhiyun r1 = READ_ONCE(x.w); 546*4882a593Smuzhiyun } 547*4882a593Smuzhiyun 548*4882a593SmuzhiyunIf r1 = 0x56781234 (little-endian!) at the end, then P1 must have read 549*4882a593Smuzhiyunfrom both of P0's stores. It is possible to handle mixed-size and 550*4882a593Smuzhiyununaligned accesses in a memory model, but the LKMM currently does not 551*4882a593Smuzhiyunattempt to do so. It requires all accesses to be properly aligned and 552*4882a593Smuzhiyunof the location's actual size. 553*4882a593Smuzhiyun 554*4882a593Smuzhiyun 555*4882a593SmuzhiyunCACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe 556*4882a593Smuzhiyun------------------------------------------------------------------ 557*4882a593Smuzhiyun 558*4882a593SmuzhiyunCache coherence is a general principle requiring that in a 559*4882a593Smuzhiyunmulti-processor system, the CPUs must share a consistent view of the 560*4882a593Smuzhiyunmemory contents. Specifically, it requires that for each location in 561*4882a593Smuzhiyunshared memory, the stores to that location must form a single global 562*4882a593Smuzhiyunordering which all the CPUs agree on (the coherence order), and this 563*4882a593Smuzhiyunordering must be consistent with the program order for accesses to 564*4882a593Smuzhiyunthat location. 565*4882a593Smuzhiyun 566*4882a593SmuzhiyunTo put it another way, for any variable x, the coherence order (co) of 567*4882a593Smuzhiyunthe stores to x is simply the order in which the stores overwrite one 568*4882a593Smuzhiyunanother. The imaginary store which establishes x's initial value 569*4882a593Smuzhiyuncomes first in the coherence order; the store which directly 570*4882a593Smuzhiyunoverwrites the initial value comes second; the store which overwrites 571*4882a593Smuzhiyunthat value comes third, and so on. 572*4882a593Smuzhiyun 573*4882a593SmuzhiyunYou can think of the coherence order as being the order in which the 574*4882a593Smuzhiyunstores reach x's location in memory (or if you prefer a more 575*4882a593Smuzhiyunhardware-centric view, the order in which the stores get written to 576*4882a593Smuzhiyunx's cache line). We write W ->co W' if W comes before W' in the 577*4882a593Smuzhiyuncoherence order, that is, if the value stored by W gets overwritten, 578*4882a593Smuzhiyundirectly or indirectly, by the value stored by W'. 579*4882a593Smuzhiyun 580*4882a593SmuzhiyunCoherence order is required to be consistent with program order. This 581*4882a593Smuzhiyunrequirement takes the form of four coherency rules: 582*4882a593Smuzhiyun 583*4882a593Smuzhiyun Write-write coherence: If W ->po-loc W' (i.e., W comes before 584*4882a593Smuzhiyun W' in program order and they access the same location), where W 585*4882a593Smuzhiyun and W' are two stores, then W ->co W'. 586*4882a593Smuzhiyun 587*4882a593Smuzhiyun Write-read coherence: If W ->po-loc R, where W is a store and R 588*4882a593Smuzhiyun is a load, then R must read from W or from some other store 589*4882a593Smuzhiyun which comes after W in the coherence order. 590*4882a593Smuzhiyun 591*4882a593Smuzhiyun Read-write coherence: If R ->po-loc W, where R is a load and W 592*4882a593Smuzhiyun is a store, then the store which R reads from must come before 593*4882a593Smuzhiyun W in the coherence order. 594*4882a593Smuzhiyun 595*4882a593Smuzhiyun Read-read coherence: If R ->po-loc R', where R and R' are two 596*4882a593Smuzhiyun loads, then either they read from the same store or else the 597*4882a593Smuzhiyun store read by R comes before the store read by R' in the 598*4882a593Smuzhiyun coherence order. 599*4882a593Smuzhiyun 600*4882a593SmuzhiyunThis is sometimes referred to as sequential consistency per variable, 601*4882a593Smuzhiyunbecause it means that the accesses to any single memory location obey 602*4882a593Smuzhiyunthe rules of the Sequential Consistency memory model. (According to 603*4882a593SmuzhiyunWikipedia, sequential consistency per variable and cache coherence 604*4882a593Smuzhiyunmean the same thing except that cache coherence includes an extra 605*4882a593Smuzhiyunrequirement that every store eventually becomes visible to every CPU.) 606*4882a593Smuzhiyun 607*4882a593SmuzhiyunAny reasonable memory model will include cache coherence. Indeed, our 608*4882a593Smuzhiyunexpectation of cache coherence is so deeply ingrained that violations 609*4882a593Smuzhiyunof its requirements look more like hardware bugs than programming 610*4882a593Smuzhiyunerrors: 611*4882a593Smuzhiyun 612*4882a593Smuzhiyun int x; 613*4882a593Smuzhiyun 614*4882a593Smuzhiyun P0() 615*4882a593Smuzhiyun { 616*4882a593Smuzhiyun WRITE_ONCE(x, 17); 617*4882a593Smuzhiyun WRITE_ONCE(x, 23); 618*4882a593Smuzhiyun } 619*4882a593Smuzhiyun 620*4882a593SmuzhiyunIf the final value stored in x after this code ran was 17, you would 621*4882a593Smuzhiyunthink your computer was broken. It would be a violation of the 622*4882a593Smuzhiyunwrite-write coherence rule: Since the store of 23 comes later in 623*4882a593Smuzhiyunprogram order, it must also come later in x's coherence order and 624*4882a593Smuzhiyunthus must overwrite the store of 17. 625*4882a593Smuzhiyun 626*4882a593Smuzhiyun int x = 0; 627*4882a593Smuzhiyun 628*4882a593Smuzhiyun P0() 629*4882a593Smuzhiyun { 630*4882a593Smuzhiyun int r1; 631*4882a593Smuzhiyun 632*4882a593Smuzhiyun r1 = READ_ONCE(x); 633*4882a593Smuzhiyun WRITE_ONCE(x, 666); 634*4882a593Smuzhiyun } 635*4882a593Smuzhiyun 636*4882a593SmuzhiyunIf r1 = 666 at the end, this would violate the read-write coherence 637*4882a593Smuzhiyunrule: The READ_ONCE() load comes before the WRITE_ONCE() store in 638*4882a593Smuzhiyunprogram order, so it must not read from that store but rather from one 639*4882a593Smuzhiyuncoming earlier in the coherence order (in this case, x's initial 640*4882a593Smuzhiyunvalue). 641*4882a593Smuzhiyun 642*4882a593Smuzhiyun int x = 0; 643*4882a593Smuzhiyun 644*4882a593Smuzhiyun P0() 645*4882a593Smuzhiyun { 646*4882a593Smuzhiyun WRITE_ONCE(x, 5); 647*4882a593Smuzhiyun } 648*4882a593Smuzhiyun 649*4882a593Smuzhiyun P1() 650*4882a593Smuzhiyun { 651*4882a593Smuzhiyun int r1, r2; 652*4882a593Smuzhiyun 653*4882a593Smuzhiyun r1 = READ_ONCE(x); 654*4882a593Smuzhiyun r2 = READ_ONCE(x); 655*4882a593Smuzhiyun } 656*4882a593Smuzhiyun 657*4882a593SmuzhiyunIf r1 = 5 (reading from P0's store) and r2 = 0 (reading from the 658*4882a593Smuzhiyunimaginary store which establishes x's initial value) at the end, this 659*4882a593Smuzhiyunwould violate the read-read coherence rule: The r1 load comes before 660*4882a593Smuzhiyunthe r2 load in program order, so it must not read from a store that 661*4882a593Smuzhiyuncomes later in the coherence order. 662*4882a593Smuzhiyun 663*4882a593Smuzhiyun(As a minor curiosity, if this code had used normal loads instead of 664*4882a593SmuzhiyunREAD_ONCE() in P1, on Itanium it sometimes could end up with r1 = 5 665*4882a593Smuzhiyunand r2 = 0! This results from parallel execution of the operations 666*4882a593Smuzhiyunencoded in Itanium's Very-Long-Instruction-Word format, and it is yet 667*4882a593Smuzhiyunanother motivation for using READ_ONCE() when accessing shared memory 668*4882a593Smuzhiyunlocations.) 669*4882a593Smuzhiyun 670*4882a593SmuzhiyunJust like the po relation, co is inherently an ordering -- it is not 671*4882a593Smuzhiyunpossible for a store to directly or indirectly overwrite itself! And 672*4882a593Smuzhiyunjust like with the rf relation, we distinguish between stores that 673*4882a593Smuzhiyunoccur on the same CPU (internal coherence order, or coi) and stores 674*4882a593Smuzhiyunthat occur on different CPUs (external coherence order, or coe). 675*4882a593Smuzhiyun 676*4882a593SmuzhiyunOn the other hand, stores to different memory locations are never 677*4882a593Smuzhiyunrelated by co, just as instructions on different CPUs are never 678*4882a593Smuzhiyunrelated by po. Coherence order is strictly per-location, or if you 679*4882a593Smuzhiyunprefer, each location has its own independent coherence order. 680*4882a593Smuzhiyun 681*4882a593Smuzhiyun 682*4882a593SmuzhiyunTHE FROM-READS RELATION: fr, fri, and fre 683*4882a593Smuzhiyun----------------------------------------- 684*4882a593Smuzhiyun 685*4882a593SmuzhiyunThe from-reads relation (fr) can be a little difficult for people to 686*4882a593Smuzhiyungrok. It describes the situation where a load reads a value that gets 687*4882a593Smuzhiyunoverwritten by a store. In other words, we have R ->fr W when the 688*4882a593Smuzhiyunvalue that R reads is overwritten (directly or indirectly) by W, or 689*4882a593Smuzhiyunequivalently, when R reads from a store which comes earlier than W in 690*4882a593Smuzhiyunthe coherence order. 691*4882a593Smuzhiyun 692*4882a593SmuzhiyunFor example: 693*4882a593Smuzhiyun 694*4882a593Smuzhiyun int x = 0; 695*4882a593Smuzhiyun 696*4882a593Smuzhiyun P0() 697*4882a593Smuzhiyun { 698*4882a593Smuzhiyun int r1; 699*4882a593Smuzhiyun 700*4882a593Smuzhiyun r1 = READ_ONCE(x); 701*4882a593Smuzhiyun WRITE_ONCE(x, 2); 702*4882a593Smuzhiyun } 703*4882a593Smuzhiyun 704*4882a593SmuzhiyunThe value loaded from x will be 0 (assuming cache coherence!), and it 705*4882a593Smuzhiyungets overwritten by the value 2. Thus there is an fr link from the 706*4882a593SmuzhiyunREAD_ONCE() to the WRITE_ONCE(). If the code contained any later 707*4882a593Smuzhiyunstores to x, there would also be fr links from the READ_ONCE() to 708*4882a593Smuzhiyunthem. 709*4882a593Smuzhiyun 710*4882a593SmuzhiyunAs with rf, rfi, and rfe, we subdivide the fr relation into fri (when 711*4882a593Smuzhiyunthe load and the store are on the same CPU) and fre (when they are on 712*4882a593Smuzhiyundifferent CPUs). 713*4882a593Smuzhiyun 714*4882a593SmuzhiyunNote that the fr relation is determined entirely by the rf and co 715*4882a593Smuzhiyunrelations; it is not independent. Given a read event R and a write 716*4882a593Smuzhiyunevent W for the same location, we will have R ->fr W if and only if 717*4882a593Smuzhiyunthe write which R reads from is co-before W. In symbols, 718*4882a593Smuzhiyun 719*4882a593Smuzhiyun (R ->fr W) := (there exists W' with W' ->rf R and W' ->co W). 720*4882a593Smuzhiyun 721*4882a593Smuzhiyun 722*4882a593SmuzhiyunAN OPERATIONAL MODEL 723*4882a593Smuzhiyun-------------------- 724*4882a593Smuzhiyun 725*4882a593SmuzhiyunThe LKMM is based on various operational memory models, meaning that 726*4882a593Smuzhiyunthe models arise from an abstract view of how a computer system 727*4882a593Smuzhiyunoperates. Here are the main ideas, as incorporated into the LKMM. 728*4882a593Smuzhiyun 729*4882a593SmuzhiyunThe system as a whole is divided into the CPUs and a memory subsystem. 730*4882a593SmuzhiyunThe CPUs are responsible for executing instructions (not necessarily 731*4882a593Smuzhiyunin program order), and they communicate with the memory subsystem. 732*4882a593SmuzhiyunFor the most part, executing an instruction requires a CPU to perform 733*4882a593Smuzhiyunonly internal operations. However, loads, stores, and fences involve 734*4882a593Smuzhiyunmore. 735*4882a593Smuzhiyun 736*4882a593SmuzhiyunWhen CPU C executes a store instruction, it tells the memory subsystem 737*4882a593Smuzhiyunto store a certain value at a certain location. The memory subsystem 738*4882a593Smuzhiyunpropagates the store to all the other CPUs as well as to RAM. (As a 739*4882a593Smuzhiyunspecial case, we say that the store propagates to its own CPU at the 740*4882a593Smuzhiyuntime it is executed.) The memory subsystem also determines where the 741*4882a593Smuzhiyunstore falls in the location's coherence order. In particular, it must 742*4882a593Smuzhiyunarrange for the store to be co-later than (i.e., to overwrite) any 743*4882a593Smuzhiyunother store to the same location which has already propagated to CPU C. 744*4882a593Smuzhiyun 745*4882a593SmuzhiyunWhen a CPU executes a load instruction R, it first checks to see 746*4882a593Smuzhiyunwhether there are any as-yet unexecuted store instructions, for the 747*4882a593Smuzhiyunsame location, that come before R in program order. If there are, it 748*4882a593Smuzhiyunuses the value of the po-latest such store as the value obtained by R, 749*4882a593Smuzhiyunand we say that the store's value is forwarded to R. Otherwise, the 750*4882a593SmuzhiyunCPU asks the memory subsystem for the value to load and we say that R 751*4882a593Smuzhiyunis satisfied from memory. The memory subsystem hands back the value 752*4882a593Smuzhiyunof the co-latest store to the location in question which has already 753*4882a593Smuzhiyunpropagated to that CPU. 754*4882a593Smuzhiyun 755*4882a593Smuzhiyun(In fact, the picture needs to be a little more complicated than this. 756*4882a593SmuzhiyunCPUs have local caches, and propagating a store to a CPU really means 757*4882a593Smuzhiyunpropagating it to the CPU's local cache. A local cache can take some 758*4882a593Smuzhiyuntime to process the stores that it receives, and a store can't be used 759*4882a593Smuzhiyunto satisfy one of the CPU's loads until it has been processed. On 760*4882a593Smuzhiyunmost architectures, the local caches process stores in 761*4882a593SmuzhiyunFirst-In-First-Out order, and consequently the processing delay 762*4882a593Smuzhiyundoesn't matter for the memory model. But on Alpha, the local caches 763*4882a593Smuzhiyunhave a partitioned design that results in non-FIFO behavior. We will 764*4882a593Smuzhiyundiscuss this in more detail later.) 765*4882a593Smuzhiyun 766*4882a593SmuzhiyunNote that load instructions may be executed speculatively and may be 767*4882a593Smuzhiyunrestarted under certain circumstances. The memory model ignores these 768*4882a593Smuzhiyunpremature executions; we simply say that the load executes at the 769*4882a593Smuzhiyunfinal time it is forwarded or satisfied. 770*4882a593Smuzhiyun 771*4882a593SmuzhiyunExecuting a fence (or memory barrier) instruction doesn't require a 772*4882a593SmuzhiyunCPU to do anything special other than informing the memory subsystem 773*4882a593Smuzhiyunabout the fence. However, fences do constrain the way CPUs and the 774*4882a593Smuzhiyunmemory subsystem handle other instructions, in two respects. 775*4882a593Smuzhiyun 776*4882a593SmuzhiyunFirst, a fence forces the CPU to execute various instructions in 777*4882a593Smuzhiyunprogram order. Exactly which instructions are ordered depends on the 778*4882a593Smuzhiyuntype of fence: 779*4882a593Smuzhiyun 780*4882a593Smuzhiyun Strong fences, including smp_mb() and synchronize_rcu(), force 781*4882a593Smuzhiyun the CPU to execute all po-earlier instructions before any 782*4882a593Smuzhiyun po-later instructions; 783*4882a593Smuzhiyun 784*4882a593Smuzhiyun smp_rmb() forces the CPU to execute all po-earlier loads 785*4882a593Smuzhiyun before any po-later loads; 786*4882a593Smuzhiyun 787*4882a593Smuzhiyun smp_wmb() forces the CPU to execute all po-earlier stores 788*4882a593Smuzhiyun before any po-later stores; 789*4882a593Smuzhiyun 790*4882a593Smuzhiyun Acquire fences, such as smp_load_acquire(), force the CPU to 791*4882a593Smuzhiyun execute the load associated with the fence (e.g., the load 792*4882a593Smuzhiyun part of an smp_load_acquire()) before any po-later 793*4882a593Smuzhiyun instructions; 794*4882a593Smuzhiyun 795*4882a593Smuzhiyun Release fences, such as smp_store_release(), force the CPU to 796*4882a593Smuzhiyun execute all po-earlier instructions before the store 797*4882a593Smuzhiyun associated with the fence (e.g., the store part of an 798*4882a593Smuzhiyun smp_store_release()). 799*4882a593Smuzhiyun 800*4882a593SmuzhiyunSecond, some types of fence affect the way the memory subsystem 801*4882a593Smuzhiyunpropagates stores. When a fence instruction is executed on CPU C: 802*4882a593Smuzhiyun 803*4882a593Smuzhiyun For each other CPU C', smp_wmb() forces all po-earlier stores 804*4882a593Smuzhiyun on C to propagate to C' before any po-later stores do. 805*4882a593Smuzhiyun 806*4882a593Smuzhiyun For each other CPU C', any store which propagates to C before 807*4882a593Smuzhiyun a release fence is executed (including all po-earlier 808*4882a593Smuzhiyun stores executed on C) is forced to propagate to C' before the 809*4882a593Smuzhiyun store associated with the release fence does. 810*4882a593Smuzhiyun 811*4882a593Smuzhiyun Any store which propagates to C before a strong fence is 812*4882a593Smuzhiyun executed (including all po-earlier stores on C) is forced to 813*4882a593Smuzhiyun propagate to all other CPUs before any instructions po-after 814*4882a593Smuzhiyun the strong fence are executed on C. 815*4882a593Smuzhiyun 816*4882a593SmuzhiyunThe propagation ordering enforced by release fences and strong fences 817*4882a593Smuzhiyunaffects stores from other CPUs that propagate to CPU C before the 818*4882a593Smuzhiyunfence is executed, as well as stores that are executed on C before the 819*4882a593Smuzhiyunfence. We describe this property by saying that release fences and 820*4882a593Smuzhiyunstrong fences are A-cumulative. By contrast, smp_wmb() fences are not 821*4882a593SmuzhiyunA-cumulative; they only affect the propagation of stores that are 822*4882a593Smuzhiyunexecuted on C before the fence (i.e., those which precede the fence in 823*4882a593Smuzhiyunprogram order). 824*4882a593Smuzhiyun 825*4882a593Smuzhiyunrcu_read_lock(), rcu_read_unlock(), and synchronize_rcu() fences have 826*4882a593Smuzhiyunother properties which we discuss later. 827*4882a593Smuzhiyun 828*4882a593Smuzhiyun 829*4882a593SmuzhiyunPROPAGATION ORDER RELATION: cumul-fence 830*4882a593Smuzhiyun--------------------------------------- 831*4882a593Smuzhiyun 832*4882a593SmuzhiyunThe fences which affect propagation order (i.e., strong, release, and 833*4882a593Smuzhiyunsmp_wmb() fences) are collectively referred to as cumul-fences, even 834*4882a593Smuzhiyunthough smp_wmb() isn't A-cumulative. The cumul-fence relation is 835*4882a593Smuzhiyundefined to link memory access events E and F whenever: 836*4882a593Smuzhiyun 837*4882a593Smuzhiyun E and F are both stores on the same CPU and an smp_wmb() fence 838*4882a593Smuzhiyun event occurs between them in program order; or 839*4882a593Smuzhiyun 840*4882a593Smuzhiyun F is a release fence and some X comes before F in program order, 841*4882a593Smuzhiyun where either X = E or else E ->rf X; or 842*4882a593Smuzhiyun 843*4882a593Smuzhiyun A strong fence event occurs between some X and F in program 844*4882a593Smuzhiyun order, where either X = E or else E ->rf X. 845*4882a593Smuzhiyun 846*4882a593SmuzhiyunThe operational model requires that whenever W and W' are both stores 847*4882a593Smuzhiyunand W ->cumul-fence W', then W must propagate to any given CPU 848*4882a593Smuzhiyunbefore W' does. However, for different CPUs C and C', it does not 849*4882a593Smuzhiyunrequire W to propagate to C before W' propagates to C'. 850*4882a593Smuzhiyun 851*4882a593Smuzhiyun 852*4882a593SmuzhiyunDERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL 853*4882a593Smuzhiyun------------------------------------------------- 854*4882a593Smuzhiyun 855*4882a593SmuzhiyunThe LKMM is derived from the restrictions imposed by the design 856*4882a593Smuzhiyunoutlined above. These restrictions involve the necessity of 857*4882a593Smuzhiyunmaintaining cache coherence and the fact that a CPU can't operate on a 858*4882a593Smuzhiyunvalue before it knows what that value is, among other things. 859*4882a593Smuzhiyun 860*4882a593SmuzhiyunThe formal version of the LKMM is defined by six requirements, or 861*4882a593Smuzhiyunaxioms: 862*4882a593Smuzhiyun 863*4882a593Smuzhiyun Sequential consistency per variable: This requires that the 864*4882a593Smuzhiyun system obey the four coherency rules. 865*4882a593Smuzhiyun 866*4882a593Smuzhiyun Atomicity: This requires that atomic read-modify-write 867*4882a593Smuzhiyun operations really are atomic, that is, no other stores can 868*4882a593Smuzhiyun sneak into the middle of such an update. 869*4882a593Smuzhiyun 870*4882a593Smuzhiyun Happens-before: This requires that certain instructions are 871*4882a593Smuzhiyun executed in a specific order. 872*4882a593Smuzhiyun 873*4882a593Smuzhiyun Propagation: This requires that certain stores propagate to 874*4882a593Smuzhiyun CPUs and to RAM in a specific order. 875*4882a593Smuzhiyun 876*4882a593Smuzhiyun Rcu: This requires that RCU read-side critical sections and 877*4882a593Smuzhiyun grace periods obey the rules of RCU, in particular, the 878*4882a593Smuzhiyun Grace-Period Guarantee. 879*4882a593Smuzhiyun 880*4882a593Smuzhiyun Plain-coherence: This requires that plain memory accesses 881*4882a593Smuzhiyun (those not using READ_ONCE(), WRITE_ONCE(), etc.) must obey 882*4882a593Smuzhiyun the operational model's rules regarding cache coherence. 883*4882a593Smuzhiyun 884*4882a593SmuzhiyunThe first and second are quite common; they can be found in many 885*4882a593Smuzhiyunmemory models (such as those for C11/C++11). The "happens-before" and 886*4882a593Smuzhiyun"propagation" axioms have analogs in other memory models as well. The 887*4882a593Smuzhiyun"rcu" and "plain-coherence" axioms are specific to the LKMM. 888*4882a593Smuzhiyun 889*4882a593SmuzhiyunEach of these axioms is discussed below. 890*4882a593Smuzhiyun 891*4882a593Smuzhiyun 892*4882a593SmuzhiyunSEQUENTIAL CONSISTENCY PER VARIABLE 893*4882a593Smuzhiyun----------------------------------- 894*4882a593Smuzhiyun 895*4882a593SmuzhiyunAccording to the principle of cache coherence, the stores to any fixed 896*4882a593Smuzhiyunshared location in memory form a global ordering. We can imagine 897*4882a593Smuzhiyuninserting the loads from that location into this ordering, by placing 898*4882a593Smuzhiyuneach load between the store that it reads from and the following 899*4882a593Smuzhiyunstore. This leaves the relative positions of loads that read from the 900*4882a593Smuzhiyunsame store unspecified; let's say they are inserted in program order, 901*4882a593Smuzhiyunfirst for CPU 0, then CPU 1, etc. 902*4882a593Smuzhiyun 903*4882a593SmuzhiyunYou can check that the four coherency rules imply that the rf, co, fr, 904*4882a593Smuzhiyunand po-loc relations agree with this global ordering; in other words, 905*4882a593Smuzhiyunwhenever we have X ->rf Y or X ->co Y or X ->fr Y or X ->po-loc Y, the 906*4882a593SmuzhiyunX event comes before the Y event in the global ordering. The LKMM's 907*4882a593Smuzhiyun"coherence" axiom expresses this by requiring the union of these 908*4882a593Smuzhiyunrelations not to have any cycles. This means it must not be possible 909*4882a593Smuzhiyunto find events 910*4882a593Smuzhiyun 911*4882a593Smuzhiyun X0 -> X1 -> X2 -> ... -> Xn -> X0, 912*4882a593Smuzhiyun 913*4882a593Smuzhiyunwhere each of the links is either rf, co, fr, or po-loc. This has to 914*4882a593Smuzhiyunhold if the accesses to the fixed memory location can be ordered as 915*4882a593Smuzhiyuncache coherence demands. 916*4882a593Smuzhiyun 917*4882a593SmuzhiyunAlthough it is not obvious, it can be shown that the converse is also 918*4882a593Smuzhiyuntrue: This LKMM axiom implies that the four coherency rules are 919*4882a593Smuzhiyunobeyed. 920*4882a593Smuzhiyun 921*4882a593Smuzhiyun 922*4882a593SmuzhiyunATOMIC UPDATES: rmw 923*4882a593Smuzhiyun------------------- 924*4882a593Smuzhiyun 925*4882a593SmuzhiyunWhat does it mean to say that a read-modify-write (rmw) update, such 926*4882a593Smuzhiyunas atomic_inc(&x), is atomic? It means that the memory location (x in 927*4882a593Smuzhiyunthis case) does not get altered between the read and the write events 928*4882a593Smuzhiyunmaking up the atomic operation. In particular, if two CPUs perform 929*4882a593Smuzhiyunatomic_inc(&x) concurrently, it must be guaranteed that the final 930*4882a593Smuzhiyunvalue of x will be the initial value plus two. We should never have 931*4882a593Smuzhiyunthe following sequence of events: 932*4882a593Smuzhiyun 933*4882a593Smuzhiyun CPU 0 loads x obtaining 13; 934*4882a593Smuzhiyun CPU 1 loads x obtaining 13; 935*4882a593Smuzhiyun CPU 0 stores 14 to x; 936*4882a593Smuzhiyun CPU 1 stores 14 to x; 937*4882a593Smuzhiyun 938*4882a593Smuzhiyunwhere the final value of x is wrong (14 rather than 15). 939*4882a593Smuzhiyun 940*4882a593SmuzhiyunIn this example, CPU 0's increment effectively gets lost because it 941*4882a593Smuzhiyunoccurs in between CPU 1's load and store. To put it another way, the 942*4882a593Smuzhiyunproblem is that the position of CPU 0's store in x's coherence order 943*4882a593Smuzhiyunis between the store that CPU 1 reads from and the store that CPU 1 944*4882a593Smuzhiyunperforms. 945*4882a593Smuzhiyun 946*4882a593SmuzhiyunThe same analysis applies to all atomic update operations. Therefore, 947*4882a593Smuzhiyunto enforce atomicity the LKMM requires that atomic updates follow this 948*4882a593Smuzhiyunrule: Whenever R and W are the read and write events composing an 949*4882a593Smuzhiyunatomic read-modify-write and W' is the write event which R reads from, 950*4882a593Smuzhiyunthere must not be any stores coming between W' and W in the coherence 951*4882a593Smuzhiyunorder. Equivalently, 952*4882a593Smuzhiyun 953*4882a593Smuzhiyun (R ->rmw W) implies (there is no X with R ->fr X and X ->co W), 954*4882a593Smuzhiyun 955*4882a593Smuzhiyunwhere the rmw relation links the read and write events making up each 956*4882a593Smuzhiyunatomic update. This is what the LKMM's "atomic" axiom says. 957*4882a593Smuzhiyun 958*4882a593Smuzhiyun 959*4882a593SmuzhiyunTHE PRESERVED PROGRAM ORDER RELATION: ppo 960*4882a593Smuzhiyun----------------------------------------- 961*4882a593Smuzhiyun 962*4882a593SmuzhiyunThere are many situations where a CPU is obliged to execute two 963*4882a593Smuzhiyuninstructions in program order. We amalgamate them into the ppo (for 964*4882a593Smuzhiyun"preserved program order") relation, which links the po-earlier 965*4882a593Smuzhiyuninstruction to the po-later instruction and is thus a sub-relation of 966*4882a593Smuzhiyunpo. 967*4882a593Smuzhiyun 968*4882a593SmuzhiyunThe operational model already includes a description of one such 969*4882a593Smuzhiyunsituation: Fences are a source of ppo links. Suppose X and Y are 970*4882a593Smuzhiyunmemory accesses with X ->po Y; then the CPU must execute X before Y if 971*4882a593Smuzhiyunany of the following hold: 972*4882a593Smuzhiyun 973*4882a593Smuzhiyun A strong (smp_mb() or synchronize_rcu()) fence occurs between 974*4882a593Smuzhiyun X and Y; 975*4882a593Smuzhiyun 976*4882a593Smuzhiyun X and Y are both stores and an smp_wmb() fence occurs between 977*4882a593Smuzhiyun them; 978*4882a593Smuzhiyun 979*4882a593Smuzhiyun X and Y are both loads and an smp_rmb() fence occurs between 980*4882a593Smuzhiyun them; 981*4882a593Smuzhiyun 982*4882a593Smuzhiyun X is also an acquire fence, such as smp_load_acquire(); 983*4882a593Smuzhiyun 984*4882a593Smuzhiyun Y is also a release fence, such as smp_store_release(). 985*4882a593Smuzhiyun 986*4882a593SmuzhiyunAnother possibility, not mentioned earlier but discussed in the next 987*4882a593Smuzhiyunsection, is: 988*4882a593Smuzhiyun 989*4882a593Smuzhiyun X and Y are both loads, X ->addr Y (i.e., there is an address 990*4882a593Smuzhiyun dependency from X to Y), and X is a READ_ONCE() or an atomic 991*4882a593Smuzhiyun access. 992*4882a593Smuzhiyun 993*4882a593SmuzhiyunDependencies can also cause instructions to be executed in program 994*4882a593Smuzhiyunorder. This is uncontroversial when the second instruction is a 995*4882a593Smuzhiyunstore; either a data, address, or control dependency from a load R to 996*4882a593Smuzhiyuna store W will force the CPU to execute R before W. This is very 997*4882a593Smuzhiyunsimply because the CPU cannot tell the memory subsystem about W's 998*4882a593Smuzhiyunstore before it knows what value should be stored (in the case of a 999*4882a593Smuzhiyundata dependency), what location it should be stored into (in the case 1000*4882a593Smuzhiyunof an address dependency), or whether the store should actually take 1001*4882a593Smuzhiyunplace (in the case of a control dependency). 1002*4882a593Smuzhiyun 1003*4882a593SmuzhiyunDependencies to load instructions are more problematic. To begin with, 1004*4882a593Smuzhiyunthere is no such thing as a data dependency to a load. Next, a CPU 1005*4882a593Smuzhiyunhas no reason to respect a control dependency to a load, because it 1006*4882a593Smuzhiyuncan always satisfy the second load speculatively before the first, and 1007*4882a593Smuzhiyunthen ignore the result if it turns out that the second load shouldn't 1008*4882a593Smuzhiyunbe executed after all. And lastly, the real difficulties begin when 1009*4882a593Smuzhiyunwe consider address dependencies to loads. 1010*4882a593Smuzhiyun 1011*4882a593SmuzhiyunTo be fair about it, all Linux-supported architectures do execute 1012*4882a593Smuzhiyunloads in program order if there is an address dependency between them. 1013*4882a593SmuzhiyunAfter all, a CPU cannot ask the memory subsystem to load a value from 1014*4882a593Smuzhiyuna particular location before it knows what that location is. However, 1015*4882a593Smuzhiyunthe split-cache design used by Alpha can cause it to behave in a way 1016*4882a593Smuzhiyunthat looks as if the loads were executed out of order (see the next 1017*4882a593Smuzhiyunsection for more details). The kernel includes a workaround for this 1018*4882a593Smuzhiyunproblem when the loads come from READ_ONCE(), and therefore the LKMM 1019*4882a593Smuzhiyunincludes address dependencies to loads in the ppo relation. 1020*4882a593Smuzhiyun 1021*4882a593SmuzhiyunOn the other hand, dependencies can indirectly affect the ordering of 1022*4882a593Smuzhiyuntwo loads. This happens when there is a dependency from a load to a 1023*4882a593Smuzhiyunstore and a second, po-later load reads from that store: 1024*4882a593Smuzhiyun 1025*4882a593Smuzhiyun R ->dep W ->rfi R', 1026*4882a593Smuzhiyun 1027*4882a593Smuzhiyunwhere the dep link can be either an address or a data dependency. In 1028*4882a593Smuzhiyunthis situation we know it is possible for the CPU to execute R' before 1029*4882a593SmuzhiyunW, because it can forward the value that W will store to R'. But it 1030*4882a593Smuzhiyuncannot execute R' before R, because it cannot forward the value before 1031*4882a593Smuzhiyunit knows what that value is, or that W and R' do access the same 1032*4882a593Smuzhiyunlocation. However, if there is merely a control dependency between R 1033*4882a593Smuzhiyunand W then the CPU can speculatively forward W to R' before executing 1034*4882a593SmuzhiyunR; if the speculation turns out to be wrong then the CPU merely has to 1035*4882a593Smuzhiyunrestart or abandon R'. 1036*4882a593Smuzhiyun 1037*4882a593Smuzhiyun(In theory, a CPU might forward a store to a load when it runs across 1038*4882a593Smuzhiyunan address dependency like this: 1039*4882a593Smuzhiyun 1040*4882a593Smuzhiyun r1 = READ_ONCE(ptr); 1041*4882a593Smuzhiyun WRITE_ONCE(*r1, 17); 1042*4882a593Smuzhiyun r2 = READ_ONCE(*r1); 1043*4882a593Smuzhiyun 1044*4882a593Smuzhiyunbecause it could tell that the store and the second load access the 1045*4882a593Smuzhiyunsame location even before it knows what the location's address is. 1046*4882a593SmuzhiyunHowever, none of the architectures supported by the Linux kernel do 1047*4882a593Smuzhiyunthis.) 1048*4882a593Smuzhiyun 1049*4882a593SmuzhiyunTwo memory accesses of the same location must always be executed in 1050*4882a593Smuzhiyunprogram order if the second access is a store. Thus, if we have 1051*4882a593Smuzhiyun 1052*4882a593Smuzhiyun R ->po-loc W 1053*4882a593Smuzhiyun 1054*4882a593Smuzhiyun(the po-loc link says that R comes before W in program order and they 1055*4882a593Smuzhiyunaccess the same location), the CPU is obliged to execute W after R. 1056*4882a593SmuzhiyunIf it executed W first then the memory subsystem would respond to R's 1057*4882a593Smuzhiyunread request with the value stored by W (or an even later store), in 1058*4882a593Smuzhiyunviolation of the read-write coherence rule. Similarly, if we had 1059*4882a593Smuzhiyun 1060*4882a593Smuzhiyun W ->po-loc W' 1061*4882a593Smuzhiyun 1062*4882a593Smuzhiyunand the CPU executed W' before W, then the memory subsystem would put 1063*4882a593SmuzhiyunW' before W in the coherence order. It would effectively cause W to 1064*4882a593Smuzhiyunoverwrite W', in violation of the write-write coherence rule. 1065*4882a593Smuzhiyun(Interestingly, an early ARMv8 memory model, now obsolete, proposed 1066*4882a593Smuzhiyunallowing out-of-order writes like this to occur. The model avoided 1067*4882a593Smuzhiyunviolating the write-write coherence rule by requiring the CPU not to 1068*4882a593Smuzhiyunsend the W write to the memory subsystem at all!) 1069*4882a593Smuzhiyun 1070*4882a593Smuzhiyun 1071*4882a593SmuzhiyunAND THEN THERE WAS ALPHA 1072*4882a593Smuzhiyun------------------------ 1073*4882a593Smuzhiyun 1074*4882a593SmuzhiyunAs mentioned above, the Alpha architecture is unique in that it does 1075*4882a593Smuzhiyunnot appear to respect address dependencies to loads. This means that 1076*4882a593Smuzhiyuncode such as the following: 1077*4882a593Smuzhiyun 1078*4882a593Smuzhiyun int x = 0; 1079*4882a593Smuzhiyun int y = -1; 1080*4882a593Smuzhiyun int *ptr = &y; 1081*4882a593Smuzhiyun 1082*4882a593Smuzhiyun P0() 1083*4882a593Smuzhiyun { 1084*4882a593Smuzhiyun WRITE_ONCE(x, 1); 1085*4882a593Smuzhiyun smp_wmb(); 1086*4882a593Smuzhiyun WRITE_ONCE(ptr, &x); 1087*4882a593Smuzhiyun } 1088*4882a593Smuzhiyun 1089*4882a593Smuzhiyun P1() 1090*4882a593Smuzhiyun { 1091*4882a593Smuzhiyun int *r1; 1092*4882a593Smuzhiyun int r2; 1093*4882a593Smuzhiyun 1094*4882a593Smuzhiyun r1 = ptr; 1095*4882a593Smuzhiyun r2 = READ_ONCE(*r1); 1096*4882a593Smuzhiyun } 1097*4882a593Smuzhiyun 1098*4882a593Smuzhiyuncan malfunction on Alpha systems (notice that P1 uses an ordinary load 1099*4882a593Smuzhiyunto read ptr instead of READ_ONCE()). It is quite possible that r1 = &x 1100*4882a593Smuzhiyunand r2 = 0 at the end, in spite of the address dependency. 1101*4882a593Smuzhiyun 1102*4882a593SmuzhiyunAt first glance this doesn't seem to make sense. We know that the 1103*4882a593Smuzhiyunsmp_wmb() forces P0's store to x to propagate to P1 before the store 1104*4882a593Smuzhiyunto ptr does. And since P1 can't execute its second load 1105*4882a593Smuzhiyununtil it knows what location to load from, i.e., after executing its 1106*4882a593Smuzhiyunfirst load, the value x = 1 must have propagated to P1 before the 1107*4882a593Smuzhiyunsecond load executed. So why doesn't r2 end up equal to 1? 1108*4882a593Smuzhiyun 1109*4882a593SmuzhiyunThe answer lies in the Alpha's split local caches. Although the two 1110*4882a593Smuzhiyunstores do reach P1's local cache in the proper order, it can happen 1111*4882a593Smuzhiyunthat the first store is processed by a busy part of the cache while 1112*4882a593Smuzhiyunthe second store is processed by an idle part. As a result, the x = 1 1113*4882a593Smuzhiyunvalue may not become available for P1's CPU to read until after the 1114*4882a593Smuzhiyunptr = &x value does, leading to the undesirable result above. The 1115*4882a593Smuzhiyunfinal effect is that even though the two loads really are executed in 1116*4882a593Smuzhiyunprogram order, it appears that they aren't. 1117*4882a593Smuzhiyun 1118*4882a593SmuzhiyunThis could not have happened if the local cache had processed the 1119*4882a593Smuzhiyunincoming stores in FIFO order. By contrast, other architectures 1120*4882a593Smuzhiyunmaintain at least the appearance of FIFO order. 1121*4882a593Smuzhiyun 1122*4882a593SmuzhiyunIn practice, this difficulty is solved by inserting a special fence 1123*4882a593Smuzhiyunbetween P1's two loads when the kernel is compiled for the Alpha 1124*4882a593Smuzhiyunarchitecture. In fact, as of version 4.15, the kernel automatically 1125*4882a593Smuzhiyunadds this fence after every READ_ONCE() and atomic load on Alpha. The 1126*4882a593Smuzhiyuneffect of the fence is to cause the CPU not to execute any po-later 1127*4882a593Smuzhiyuninstructions until after the local cache has finished processing all 1128*4882a593Smuzhiyunthe stores it has already received. Thus, if the code was changed to: 1129*4882a593Smuzhiyun 1130*4882a593Smuzhiyun P1() 1131*4882a593Smuzhiyun { 1132*4882a593Smuzhiyun int *r1; 1133*4882a593Smuzhiyun int r2; 1134*4882a593Smuzhiyun 1135*4882a593Smuzhiyun r1 = READ_ONCE(ptr); 1136*4882a593Smuzhiyun r2 = READ_ONCE(*r1); 1137*4882a593Smuzhiyun } 1138*4882a593Smuzhiyun 1139*4882a593Smuzhiyunthen we would never get r1 = &x and r2 = 0. By the time P1 executed 1140*4882a593Smuzhiyunits second load, the x = 1 store would already be fully processed by 1141*4882a593Smuzhiyunthe local cache and available for satisfying the read request. Thus 1142*4882a593Smuzhiyunwe have yet another reason why shared data should always be read with 1143*4882a593SmuzhiyunREAD_ONCE() or another synchronization primitive rather than accessed 1144*4882a593Smuzhiyundirectly. 1145*4882a593Smuzhiyun 1146*4882a593SmuzhiyunThe LKMM requires that smp_rmb(), acquire fences, and strong fences 1147*4882a593Smuzhiyunshare this property: They do not allow the CPU to execute any po-later 1148*4882a593Smuzhiyuninstructions (or po-later loads in the case of smp_rmb()) until all 1149*4882a593Smuzhiyunoutstanding stores have been processed by the local cache. In the 1150*4882a593Smuzhiyuncase of a strong fence, the CPU first has to wait for all of its 1151*4882a593Smuzhiyunpo-earlier stores to propagate to every other CPU in the system; then 1152*4882a593Smuzhiyunit has to wait for the local cache to process all the stores received 1153*4882a593Smuzhiyunas of that time -- not just the stores received when the strong fence 1154*4882a593Smuzhiyunbegan. 1155*4882a593Smuzhiyun 1156*4882a593SmuzhiyunAnd of course, none of this matters for any architecture other than 1157*4882a593SmuzhiyunAlpha. 1158*4882a593Smuzhiyun 1159*4882a593Smuzhiyun 1160*4882a593SmuzhiyunTHE HAPPENS-BEFORE RELATION: hb 1161*4882a593Smuzhiyun------------------------------- 1162*4882a593Smuzhiyun 1163*4882a593SmuzhiyunThe happens-before relation (hb) links memory accesses that have to 1164*4882a593Smuzhiyunexecute in a certain order. hb includes the ppo relation and two 1165*4882a593Smuzhiyunothers, one of which is rfe. 1166*4882a593Smuzhiyun 1167*4882a593SmuzhiyunW ->rfe R implies that W and R are on different CPUs. It also means 1168*4882a593Smuzhiyunthat W's store must have propagated to R's CPU before R executed; 1169*4882a593Smuzhiyunotherwise R could not have read the value stored by W. Therefore W 1170*4882a593Smuzhiyunmust have executed before R, and so we have W ->hb R. 1171*4882a593Smuzhiyun 1172*4882a593SmuzhiyunThe equivalent fact need not hold if W ->rfi R (i.e., W and R are on 1173*4882a593Smuzhiyunthe same CPU). As we have already seen, the operational model allows 1174*4882a593SmuzhiyunW's value to be forwarded to R in such cases, meaning that R may well 1175*4882a593Smuzhiyunexecute before W does. 1176*4882a593Smuzhiyun 1177*4882a593SmuzhiyunIt's important to understand that neither coe nor fre is included in 1178*4882a593Smuzhiyunhb, despite their similarities to rfe. For example, suppose we have 1179*4882a593SmuzhiyunW ->coe W'. This means that W and W' are stores to the same location, 1180*4882a593Smuzhiyunthey execute on different CPUs, and W comes before W' in the coherence 1181*4882a593Smuzhiyunorder (i.e., W' overwrites W). Nevertheless, it is possible for W' to 1182*4882a593Smuzhiyunexecute before W, because the decision as to which store overwrites 1183*4882a593Smuzhiyunthe other is made later by the memory subsystem. When the stores are 1184*4882a593Smuzhiyunnearly simultaneous, either one can come out on top. Similarly, 1185*4882a593SmuzhiyunR ->fre W means that W overwrites the value which R reads, but it 1186*4882a593Smuzhiyundoesn't mean that W has to execute after R. All that's necessary is 1187*4882a593Smuzhiyunfor the memory subsystem not to propagate W to R's CPU until after R 1188*4882a593Smuzhiyunhas executed, which is possible if W executes shortly before R. 1189*4882a593Smuzhiyun 1190*4882a593SmuzhiyunThe third relation included in hb is like ppo, in that it only links 1191*4882a593Smuzhiyunevents that are on the same CPU. However it is more difficult to 1192*4882a593Smuzhiyunexplain, because it arises only indirectly from the requirement of 1193*4882a593Smuzhiyuncache coherence. The relation is called prop, and it links two events 1194*4882a593Smuzhiyunon CPU C in situations where a store from some other CPU comes after 1195*4882a593Smuzhiyunthe first event in the coherence order and propagates to C before the 1196*4882a593Smuzhiyunsecond event executes. 1197*4882a593Smuzhiyun 1198*4882a593SmuzhiyunThis is best explained with some examples. The simplest case looks 1199*4882a593Smuzhiyunlike this: 1200*4882a593Smuzhiyun 1201*4882a593Smuzhiyun int x; 1202*4882a593Smuzhiyun 1203*4882a593Smuzhiyun P0() 1204*4882a593Smuzhiyun { 1205*4882a593Smuzhiyun int r1; 1206*4882a593Smuzhiyun 1207*4882a593Smuzhiyun WRITE_ONCE(x, 1); 1208*4882a593Smuzhiyun r1 = READ_ONCE(x); 1209*4882a593Smuzhiyun } 1210*4882a593Smuzhiyun 1211*4882a593Smuzhiyun P1() 1212*4882a593Smuzhiyun { 1213*4882a593Smuzhiyun WRITE_ONCE(x, 8); 1214*4882a593Smuzhiyun } 1215*4882a593Smuzhiyun 1216*4882a593SmuzhiyunIf r1 = 8 at the end then P0's accesses must have executed in program 1217*4882a593Smuzhiyunorder. We can deduce this from the operational model; if P0's load 1218*4882a593Smuzhiyunhad executed before its store then the value of the store would have 1219*4882a593Smuzhiyunbeen forwarded to the load, so r1 would have ended up equal to 1, not 1220*4882a593Smuzhiyun8. In this case there is a prop link from P0's write event to its read 1221*4882a593Smuzhiyunevent, because P1's store came after P0's store in x's coherence 1222*4882a593Smuzhiyunorder, and P1's store propagated to P0 before P0's load executed. 1223*4882a593Smuzhiyun 1224*4882a593SmuzhiyunAn equally simple case involves two loads of the same location that 1225*4882a593Smuzhiyunread from different stores: 1226*4882a593Smuzhiyun 1227*4882a593Smuzhiyun int x = 0; 1228*4882a593Smuzhiyun 1229*4882a593Smuzhiyun P0() 1230*4882a593Smuzhiyun { 1231*4882a593Smuzhiyun int r1, r2; 1232*4882a593Smuzhiyun 1233*4882a593Smuzhiyun r1 = READ_ONCE(x); 1234*4882a593Smuzhiyun r2 = READ_ONCE(x); 1235*4882a593Smuzhiyun } 1236*4882a593Smuzhiyun 1237*4882a593Smuzhiyun P1() 1238*4882a593Smuzhiyun { 1239*4882a593Smuzhiyun WRITE_ONCE(x, 9); 1240*4882a593Smuzhiyun } 1241*4882a593Smuzhiyun 1242*4882a593SmuzhiyunIf r1 = 0 and r2 = 9 at the end then P0's accesses must have executed 1243*4882a593Smuzhiyunin program order. If the second load had executed before the first 1244*4882a593Smuzhiyunthen the x = 9 store must have been propagated to P0 before the first 1245*4882a593Smuzhiyunload executed, and so r1 would have been 9 rather than 0. In this 1246*4882a593Smuzhiyuncase there is a prop link from P0's first read event to its second, 1247*4882a593Smuzhiyunbecause P1's store overwrote the value read by P0's first load, and 1248*4882a593SmuzhiyunP1's store propagated to P0 before P0's second load executed. 1249*4882a593Smuzhiyun 1250*4882a593SmuzhiyunLess trivial examples of prop all involve fences. Unlike the simple 1251*4882a593Smuzhiyunexamples above, they can require that some instructions are executed 1252*4882a593Smuzhiyunout of program order. This next one should look familiar: 1253*4882a593Smuzhiyun 1254*4882a593Smuzhiyun int buf = 0, flag = 0; 1255*4882a593Smuzhiyun 1256*4882a593Smuzhiyun P0() 1257*4882a593Smuzhiyun { 1258*4882a593Smuzhiyun WRITE_ONCE(buf, 1); 1259*4882a593Smuzhiyun smp_wmb(); 1260*4882a593Smuzhiyun WRITE_ONCE(flag, 1); 1261*4882a593Smuzhiyun } 1262*4882a593Smuzhiyun 1263*4882a593Smuzhiyun P1() 1264*4882a593Smuzhiyun { 1265*4882a593Smuzhiyun int r1; 1266*4882a593Smuzhiyun int r2; 1267*4882a593Smuzhiyun 1268*4882a593Smuzhiyun r1 = READ_ONCE(flag); 1269*4882a593Smuzhiyun r2 = READ_ONCE(buf); 1270*4882a593Smuzhiyun } 1271*4882a593Smuzhiyun 1272*4882a593SmuzhiyunThis is the MP pattern again, with an smp_wmb() fence between the two 1273*4882a593Smuzhiyunstores. If r1 = 1 and r2 = 0 at the end then there is a prop link 1274*4882a593Smuzhiyunfrom P1's second load to its first (backwards!). The reason is 1275*4882a593Smuzhiyunsimilar to the previous examples: The value P1 loads from buf gets 1276*4882a593Smuzhiyunoverwritten by P0's store to buf, the fence guarantees that the store 1277*4882a593Smuzhiyunto buf will propagate to P1 before the store to flag does, and the 1278*4882a593Smuzhiyunstore to flag propagates to P1 before P1 reads flag. 1279*4882a593Smuzhiyun 1280*4882a593SmuzhiyunThe prop link says that in order to obtain the r1 = 1, r2 = 0 result, 1281*4882a593SmuzhiyunP1 must execute its second load before the first. Indeed, if the load 1282*4882a593Smuzhiyunfrom flag were executed first, then the buf = 1 store would already 1283*4882a593Smuzhiyunhave propagated to P1 by the time P1's load from buf executed, so r2 1284*4882a593Smuzhiyunwould have been 1 at the end, not 0. (The reasoning holds even for 1285*4882a593SmuzhiyunAlpha, although the details are more complicated and we will not go 1286*4882a593Smuzhiyuninto them.) 1287*4882a593Smuzhiyun 1288*4882a593SmuzhiyunBut what if we put an smp_rmb() fence between P1's loads? The fence 1289*4882a593Smuzhiyunwould force the two loads to be executed in program order, and it 1290*4882a593Smuzhiyunwould generate a cycle in the hb relation: The fence would create a ppo 1291*4882a593Smuzhiyunlink (hence an hb link) from the first load to the second, and the 1292*4882a593Smuzhiyunprop relation would give an hb link from the second load to the first. 1293*4882a593SmuzhiyunSince an instruction can't execute before itself, we are forced to 1294*4882a593Smuzhiyunconclude that if an smp_rmb() fence is added, the r1 = 1, r2 = 0 1295*4882a593Smuzhiyunoutcome is impossible -- as it should be. 1296*4882a593Smuzhiyun 1297*4882a593SmuzhiyunThe formal definition of the prop relation involves a coe or fre link, 1298*4882a593Smuzhiyunfollowed by an arbitrary number of cumul-fence links, ending with an 1299*4882a593Smuzhiyunrfe link. You can concoct more exotic examples, containing more than 1300*4882a593Smuzhiyunone fence, although this quickly leads to diminishing returns in terms 1301*4882a593Smuzhiyunof complexity. For instance, here's an example containing a coe link 1302*4882a593Smuzhiyunfollowed by two cumul-fences and an rfe link, utilizing the fact that 1303*4882a593Smuzhiyunrelease fences are A-cumulative: 1304*4882a593Smuzhiyun 1305*4882a593Smuzhiyun int x, y, z; 1306*4882a593Smuzhiyun 1307*4882a593Smuzhiyun P0() 1308*4882a593Smuzhiyun { 1309*4882a593Smuzhiyun int r0; 1310*4882a593Smuzhiyun 1311*4882a593Smuzhiyun WRITE_ONCE(x, 1); 1312*4882a593Smuzhiyun r0 = READ_ONCE(z); 1313*4882a593Smuzhiyun } 1314*4882a593Smuzhiyun 1315*4882a593Smuzhiyun P1() 1316*4882a593Smuzhiyun { 1317*4882a593Smuzhiyun WRITE_ONCE(x, 2); 1318*4882a593Smuzhiyun smp_wmb(); 1319*4882a593Smuzhiyun WRITE_ONCE(y, 1); 1320*4882a593Smuzhiyun } 1321*4882a593Smuzhiyun 1322*4882a593Smuzhiyun P2() 1323*4882a593Smuzhiyun { 1324*4882a593Smuzhiyun int r2; 1325*4882a593Smuzhiyun 1326*4882a593Smuzhiyun r2 = READ_ONCE(y); 1327*4882a593Smuzhiyun smp_store_release(&z, 1); 1328*4882a593Smuzhiyun } 1329*4882a593Smuzhiyun 1330*4882a593SmuzhiyunIf x = 2, r0 = 1, and r2 = 1 after this code runs then there is a prop 1331*4882a593Smuzhiyunlink from P0's store to its load. This is because P0's store gets 1332*4882a593Smuzhiyunoverwritten by P1's store since x = 2 at the end (a coe link), the 1333*4882a593Smuzhiyunsmp_wmb() ensures that P1's store to x propagates to P2 before the 1334*4882a593Smuzhiyunstore to y does (the first cumul-fence), the store to y propagates to P2 1335*4882a593Smuzhiyunbefore P2's load and store execute, P2's smp_store_release() 1336*4882a593Smuzhiyunguarantees that the stores to x and y both propagate to P0 before the 1337*4882a593Smuzhiyunstore to z does (the second cumul-fence), and P0's load executes after the 1338*4882a593Smuzhiyunstore to z has propagated to P0 (an rfe link). 1339*4882a593Smuzhiyun 1340*4882a593SmuzhiyunIn summary, the fact that the hb relation links memory access events 1341*4882a593Smuzhiyunin the order they execute means that it must not have cycles. This 1342*4882a593Smuzhiyunrequirement is the content of the LKMM's "happens-before" axiom. 1343*4882a593Smuzhiyun 1344*4882a593SmuzhiyunThe LKMM defines yet another relation connected to times of 1345*4882a593Smuzhiyuninstruction execution, but it is not included in hb. It relies on the 1346*4882a593Smuzhiyunparticular properties of strong fences, which we cover in the next 1347*4882a593Smuzhiyunsection. 1348*4882a593Smuzhiyun 1349*4882a593Smuzhiyun 1350*4882a593SmuzhiyunTHE PROPAGATES-BEFORE RELATION: pb 1351*4882a593Smuzhiyun---------------------------------- 1352*4882a593Smuzhiyun 1353*4882a593SmuzhiyunThe propagates-before (pb) relation capitalizes on the special 1354*4882a593Smuzhiyunfeatures of strong fences. It links two events E and F whenever some 1355*4882a593Smuzhiyunstore is coherence-later than E and propagates to every CPU and to RAM 1356*4882a593Smuzhiyunbefore F executes. The formal definition requires that E be linked to 1357*4882a593SmuzhiyunF via a coe or fre link, an arbitrary number of cumul-fences, an 1358*4882a593Smuzhiyunoptional rfe link, a strong fence, and an arbitrary number of hb 1359*4882a593Smuzhiyunlinks. Let's see how this definition works out. 1360*4882a593Smuzhiyun 1361*4882a593SmuzhiyunConsider first the case where E is a store (implying that the sequence 1362*4882a593Smuzhiyunof links begins with coe). Then there are events W, X, Y, and Z such 1363*4882a593Smuzhiyunthat: 1364*4882a593Smuzhiyun 1365*4882a593Smuzhiyun E ->coe W ->cumul-fence* X ->rfe? Y ->strong-fence Z ->hb* F, 1366*4882a593Smuzhiyun 1367*4882a593Smuzhiyunwhere the * suffix indicates an arbitrary number of links of the 1368*4882a593Smuzhiyunspecified type, and the ? suffix indicates the link is optional (Y may 1369*4882a593Smuzhiyunbe equal to X). Because of the cumul-fence links, we know that W will 1370*4882a593Smuzhiyunpropagate to Y's CPU before X does, hence before Y executes and hence 1371*4882a593Smuzhiyunbefore the strong fence executes. Because this fence is strong, we 1372*4882a593Smuzhiyunknow that W will propagate to every CPU and to RAM before Z executes. 1373*4882a593SmuzhiyunAnd because of the hb links, we know that Z will execute before F. 1374*4882a593SmuzhiyunThus W, which comes later than E in the coherence order, will 1375*4882a593Smuzhiyunpropagate to every CPU and to RAM before F executes. 1376*4882a593Smuzhiyun 1377*4882a593SmuzhiyunThe case where E is a load is exactly the same, except that the first 1378*4882a593Smuzhiyunlink in the sequence is fre instead of coe. 1379*4882a593Smuzhiyun 1380*4882a593SmuzhiyunThe existence of a pb link from E to F implies that E must execute 1381*4882a593Smuzhiyunbefore F. To see why, suppose that F executed first. Then W would 1382*4882a593Smuzhiyunhave propagated to E's CPU before E executed. If E was a store, the 1383*4882a593Smuzhiyunmemory subsystem would then be forced to make E come after W in the 1384*4882a593Smuzhiyuncoherence order, contradicting the fact that E ->coe W. If E was a 1385*4882a593Smuzhiyunload, the memory subsystem would then be forced to satisfy E's read 1386*4882a593Smuzhiyunrequest with the value stored by W or an even later store, 1387*4882a593Smuzhiyuncontradicting the fact that E ->fre W. 1388*4882a593Smuzhiyun 1389*4882a593SmuzhiyunA good example illustrating how pb works is the SB pattern with strong 1390*4882a593Smuzhiyunfences: 1391*4882a593Smuzhiyun 1392*4882a593Smuzhiyun int x = 0, y = 0; 1393*4882a593Smuzhiyun 1394*4882a593Smuzhiyun P0() 1395*4882a593Smuzhiyun { 1396*4882a593Smuzhiyun int r0; 1397*4882a593Smuzhiyun 1398*4882a593Smuzhiyun WRITE_ONCE(x, 1); 1399*4882a593Smuzhiyun smp_mb(); 1400*4882a593Smuzhiyun r0 = READ_ONCE(y); 1401*4882a593Smuzhiyun } 1402*4882a593Smuzhiyun 1403*4882a593Smuzhiyun P1() 1404*4882a593Smuzhiyun { 1405*4882a593Smuzhiyun int r1; 1406*4882a593Smuzhiyun 1407*4882a593Smuzhiyun WRITE_ONCE(y, 1); 1408*4882a593Smuzhiyun smp_mb(); 1409*4882a593Smuzhiyun r1 = READ_ONCE(x); 1410*4882a593Smuzhiyun } 1411*4882a593Smuzhiyun 1412*4882a593SmuzhiyunIf r0 = 0 at the end then there is a pb link from P0's load to P1's 1413*4882a593Smuzhiyunload: an fre link from P0's load to P1's store (which overwrites the 1414*4882a593Smuzhiyunvalue read by P0), and a strong fence between P1's store and its load. 1415*4882a593SmuzhiyunIn this example, the sequences of cumul-fence and hb links are empty. 1416*4882a593SmuzhiyunNote that this pb link is not included in hb as an instance of prop, 1417*4882a593Smuzhiyunbecause it does not start and end on the same CPU. 1418*4882a593Smuzhiyun 1419*4882a593SmuzhiyunSimilarly, if r1 = 0 at the end then there is a pb link from P1's load 1420*4882a593Smuzhiyunto P0's. This means that if both r1 and r2 were 0 there would be a 1421*4882a593Smuzhiyuncycle in pb, which is not possible since an instruction cannot execute 1422*4882a593Smuzhiyunbefore itself. Thus, adding smp_mb() fences to the SB pattern 1423*4882a593Smuzhiyunprevents the r0 = 0, r1 = 0 outcome. 1424*4882a593Smuzhiyun 1425*4882a593SmuzhiyunIn summary, the fact that the pb relation links events in the order 1426*4882a593Smuzhiyunthey execute means that it cannot have cycles. This requirement is 1427*4882a593Smuzhiyunthe content of the LKMM's "propagation" axiom. 1428*4882a593Smuzhiyun 1429*4882a593Smuzhiyun 1430*4882a593SmuzhiyunRCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb 1431*4882a593Smuzhiyun------------------------------------------------------------------------ 1432*4882a593Smuzhiyun 1433*4882a593SmuzhiyunRCU (Read-Copy-Update) is a powerful synchronization mechanism. It 1434*4882a593Smuzhiyunrests on two concepts: grace periods and read-side critical sections. 1435*4882a593Smuzhiyun 1436*4882a593SmuzhiyunA grace period is the span of time occupied by a call to 1437*4882a593Smuzhiyunsynchronize_rcu(). A read-side critical section (or just critical 1438*4882a593Smuzhiyunsection, for short) is a region of code delimited by rcu_read_lock() 1439*4882a593Smuzhiyunat the start and rcu_read_unlock() at the end. Critical sections can 1440*4882a593Smuzhiyunbe nested, although we won't make use of this fact. 1441*4882a593Smuzhiyun 1442*4882a593SmuzhiyunAs far as memory models are concerned, RCU's main feature is its 1443*4882a593SmuzhiyunGrace-Period Guarantee, which states that a critical section can never 1444*4882a593Smuzhiyunspan a full grace period. In more detail, the Guarantee says: 1445*4882a593Smuzhiyun 1446*4882a593Smuzhiyun For any critical section C and any grace period G, at least 1447*4882a593Smuzhiyun one of the following statements must hold: 1448*4882a593Smuzhiyun 1449*4882a593Smuzhiyun(1) C ends before G does, and in addition, every store that 1450*4882a593Smuzhiyun propagates to C's CPU before the end of C must propagate to 1451*4882a593Smuzhiyun every CPU before G ends. 1452*4882a593Smuzhiyun 1453*4882a593Smuzhiyun(2) G starts before C does, and in addition, every store that 1454*4882a593Smuzhiyun propagates to G's CPU before the start of G must propagate 1455*4882a593Smuzhiyun to every CPU before C starts. 1456*4882a593Smuzhiyun 1457*4882a593SmuzhiyunIn particular, it is not possible for a critical section to both start 1458*4882a593Smuzhiyunbefore and end after a grace period. 1459*4882a593Smuzhiyun 1460*4882a593SmuzhiyunHere is a simple example of RCU in action: 1461*4882a593Smuzhiyun 1462*4882a593Smuzhiyun int x, y; 1463*4882a593Smuzhiyun 1464*4882a593Smuzhiyun P0() 1465*4882a593Smuzhiyun { 1466*4882a593Smuzhiyun rcu_read_lock(); 1467*4882a593Smuzhiyun WRITE_ONCE(x, 1); 1468*4882a593Smuzhiyun WRITE_ONCE(y, 1); 1469*4882a593Smuzhiyun rcu_read_unlock(); 1470*4882a593Smuzhiyun } 1471*4882a593Smuzhiyun 1472*4882a593Smuzhiyun P1() 1473*4882a593Smuzhiyun { 1474*4882a593Smuzhiyun int r1, r2; 1475*4882a593Smuzhiyun 1476*4882a593Smuzhiyun r1 = READ_ONCE(x); 1477*4882a593Smuzhiyun synchronize_rcu(); 1478*4882a593Smuzhiyun r2 = READ_ONCE(y); 1479*4882a593Smuzhiyun } 1480*4882a593Smuzhiyun 1481*4882a593SmuzhiyunThe Grace Period Guarantee tells us that when this code runs, it will 1482*4882a593Smuzhiyunnever end with r1 = 1 and r2 = 0. The reasoning is as follows. r1 = 1 1483*4882a593Smuzhiyunmeans that P0's store to x propagated to P1 before P1 called 1484*4882a593Smuzhiyunsynchronize_rcu(), so P0's critical section must have started before 1485*4882a593SmuzhiyunP1's grace period, contrary to part (2) of the Guarantee. On the 1486*4882a593Smuzhiyunother hand, r2 = 0 means that P0's store to y, which occurs before the 1487*4882a593Smuzhiyunend of the critical section, did not propagate to P1 before the end of 1488*4882a593Smuzhiyunthe grace period, contrary to part (1). Together the results violate 1489*4882a593Smuzhiyunthe Guarantee. 1490*4882a593Smuzhiyun 1491*4882a593SmuzhiyunIn the kernel's implementations of RCU, the requirements for stores 1492*4882a593Smuzhiyunto propagate to every CPU are fulfilled by placing strong fences at 1493*4882a593Smuzhiyunsuitable places in the RCU-related code. Thus, if a critical section 1494*4882a593Smuzhiyunstarts before a grace period does then the critical section's CPU will 1495*4882a593Smuzhiyunexecute an smp_mb() fence after the end of the critical section and 1496*4882a593Smuzhiyunsome time before the grace period's synchronize_rcu() call returns. 1497*4882a593SmuzhiyunAnd if a critical section ends after a grace period does then the 1498*4882a593Smuzhiyunsynchronize_rcu() routine will execute an smp_mb() fence at its start 1499*4882a593Smuzhiyunand some time before the critical section's opening rcu_read_lock() 1500*4882a593Smuzhiyunexecutes. 1501*4882a593Smuzhiyun 1502*4882a593SmuzhiyunWhat exactly do we mean by saying that a critical section "starts 1503*4882a593Smuzhiyunbefore" or "ends after" a grace period? Some aspects of the meaning 1504*4882a593Smuzhiyunare pretty obvious, as in the example above, but the details aren't 1505*4882a593Smuzhiyunentirely clear. The LKMM formalizes this notion by means of the 1506*4882a593Smuzhiyunrcu-link relation. rcu-link encompasses a very general notion of 1507*4882a593Smuzhiyun"before": If E and F are RCU fence events (i.e., rcu_read_lock(), 1508*4882a593Smuzhiyunrcu_read_unlock(), or synchronize_rcu()) then among other things, 1509*4882a593SmuzhiyunE ->rcu-link F includes cases where E is po-before some memory-access 1510*4882a593Smuzhiyunevent X, F is po-after some memory-access event Y, and we have any of 1511*4882a593SmuzhiyunX ->rfe Y, X ->co Y, or X ->fr Y. 1512*4882a593Smuzhiyun 1513*4882a593SmuzhiyunThe formal definition of the rcu-link relation is more than a little 1514*4882a593Smuzhiyunobscure, and we won't give it here. It is closely related to the pb 1515*4882a593Smuzhiyunrelation, and the details don't matter unless you want to comb through 1516*4882a593Smuzhiyuna somewhat lengthy formal proof. Pretty much all you need to know 1517*4882a593Smuzhiyunabout rcu-link is the information in the preceding paragraph. 1518*4882a593Smuzhiyun 1519*4882a593SmuzhiyunThe LKMM also defines the rcu-gp and rcu-rscsi relations. They bring 1520*4882a593Smuzhiyungrace periods and read-side critical sections into the picture, in the 1521*4882a593Smuzhiyunfollowing way: 1522*4882a593Smuzhiyun 1523*4882a593Smuzhiyun E ->rcu-gp F means that E and F are in fact the same event, 1524*4882a593Smuzhiyun and that event is a synchronize_rcu() fence (i.e., a grace 1525*4882a593Smuzhiyun period). 1526*4882a593Smuzhiyun 1527*4882a593Smuzhiyun E ->rcu-rscsi F means that E and F are the rcu_read_unlock() 1528*4882a593Smuzhiyun and rcu_read_lock() fence events delimiting some read-side 1529*4882a593Smuzhiyun critical section. (The 'i' at the end of the name emphasizes 1530*4882a593Smuzhiyun that this relation is "inverted": It links the end of the 1531*4882a593Smuzhiyun critical section to the start.) 1532*4882a593Smuzhiyun 1533*4882a593SmuzhiyunIf we think of the rcu-link relation as standing for an extended 1534*4882a593Smuzhiyun"before", then X ->rcu-gp Y ->rcu-link Z roughly says that X is a 1535*4882a593Smuzhiyungrace period which ends before Z begins. (In fact it covers more than 1536*4882a593Smuzhiyunthis, because it also includes cases where some store propagates to 1537*4882a593SmuzhiyunZ's CPU before Z begins but doesn't propagate to some other CPU until 1538*4882a593Smuzhiyunafter X ends.) Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is 1539*4882a593Smuzhiyunthe end of a critical section which starts before Z begins. 1540*4882a593Smuzhiyun 1541*4882a593SmuzhiyunThe LKMM goes on to define the rcu-order relation as a sequence of 1542*4882a593Smuzhiyunrcu-gp and rcu-rscsi links separated by rcu-link links, in which the 1543*4882a593Smuzhiyunnumber of rcu-gp links is >= the number of rcu-rscsi links. For 1544*4882a593Smuzhiyunexample: 1545*4882a593Smuzhiyun 1546*4882a593Smuzhiyun X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V 1547*4882a593Smuzhiyun 1548*4882a593Smuzhiyunwould imply that X ->rcu-order V, because this sequence contains two 1549*4882a593Smuzhiyunrcu-gp links and one rcu-rscsi link. (It also implies that 1550*4882a593SmuzhiyunX ->rcu-order T and Z ->rcu-order V.) On the other hand: 1551*4882a593Smuzhiyun 1552*4882a593Smuzhiyun X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V 1553*4882a593Smuzhiyun 1554*4882a593Smuzhiyundoes not imply X ->rcu-order V, because the sequence contains only 1555*4882a593Smuzhiyunone rcu-gp link but two rcu-rscsi links. 1556*4882a593Smuzhiyun 1557*4882a593SmuzhiyunThe rcu-order relation is important because the Grace Period Guarantee 1558*4882a593Smuzhiyunmeans that rcu-order links act kind of like strong fences. In 1559*4882a593Smuzhiyunparticular, E ->rcu-order F implies not only that E begins before F 1560*4882a593Smuzhiyunends, but also that any write po-before E will propagate to every CPU 1561*4882a593Smuzhiyunbefore any instruction po-after F can execute. (However, it does not 1562*4882a593Smuzhiyunimply that E must execute before F; in fact, each synchronize_rcu() 1563*4882a593Smuzhiyunfence event is linked to itself by rcu-order as a degenerate case.) 1564*4882a593Smuzhiyun 1565*4882a593SmuzhiyunTo prove this in full generality requires some intellectual effort. 1566*4882a593SmuzhiyunWe'll consider just a very simple case: 1567*4882a593Smuzhiyun 1568*4882a593Smuzhiyun G ->rcu-gp W ->rcu-link Z ->rcu-rscsi F. 1569*4882a593Smuzhiyun 1570*4882a593SmuzhiyunThis formula means that G and W are the same event (a grace period), 1571*4882a593Smuzhiyunand there are events X, Y and a read-side critical section C such that: 1572*4882a593Smuzhiyun 1573*4882a593Smuzhiyun 1. G = W is po-before or equal to X; 1574*4882a593Smuzhiyun 1575*4882a593Smuzhiyun 2. X comes "before" Y in some sense (including rfe, co and fr); 1576*4882a593Smuzhiyun 1577*4882a593Smuzhiyun 3. Y is po-before Z; 1578*4882a593Smuzhiyun 1579*4882a593Smuzhiyun 4. Z is the rcu_read_unlock() event marking the end of C; 1580*4882a593Smuzhiyun 1581*4882a593Smuzhiyun 5. F is the rcu_read_lock() event marking the start of C. 1582*4882a593Smuzhiyun 1583*4882a593SmuzhiyunFrom 1 - 4 we deduce that the grace period G ends before the critical 1584*4882a593Smuzhiyunsection C. Then part (2) of the Grace Period Guarantee says not only 1585*4882a593Smuzhiyunthat G starts before C does, but also that any write which executes on 1586*4882a593SmuzhiyunG's CPU before G starts must propagate to every CPU before C starts. 1587*4882a593SmuzhiyunIn particular, the write propagates to every CPU before F finishes 1588*4882a593Smuzhiyunexecuting and hence before any instruction po-after F can execute. 1589*4882a593SmuzhiyunThis sort of reasoning can be extended to handle all the situations 1590*4882a593Smuzhiyuncovered by rcu-order. 1591*4882a593Smuzhiyun 1592*4882a593SmuzhiyunThe rcu-fence relation is a simple extension of rcu-order. While 1593*4882a593Smuzhiyunrcu-order only links certain fence events (calls to synchronize_rcu(), 1594*4882a593Smuzhiyunrcu_read_lock(), or rcu_read_unlock()), rcu-fence links any events 1595*4882a593Smuzhiyunthat are separated by an rcu-order link. This is analogous to the way 1596*4882a593Smuzhiyunthe strong-fence relation links events that are separated by an 1597*4882a593Smuzhiyunsmp_mb() fence event (as mentioned above, rcu-order links act kind of 1598*4882a593Smuzhiyunlike strong fences). Written symbolically, X ->rcu-fence Y means 1599*4882a593Smuzhiyunthere are fence events E and F such that: 1600*4882a593Smuzhiyun 1601*4882a593Smuzhiyun X ->po E ->rcu-order F ->po Y. 1602*4882a593Smuzhiyun 1603*4882a593SmuzhiyunFrom the discussion above, we see this implies not only that X 1604*4882a593Smuzhiyunexecutes before Y, but also (if X is a store) that X propagates to 1605*4882a593Smuzhiyunevery CPU before Y executes. Thus rcu-fence is sort of a 1606*4882a593Smuzhiyun"super-strong" fence: Unlike the original strong fences (smp_mb() and 1607*4882a593Smuzhiyunsynchronize_rcu()), rcu-fence is able to link events on different 1608*4882a593SmuzhiyunCPUs. (Perhaps this fact should lead us to say that rcu-fence isn't 1609*4882a593Smuzhiyunreally a fence at all!) 1610*4882a593Smuzhiyun 1611*4882a593SmuzhiyunFinally, the LKMM defines the RCU-before (rb) relation in terms of 1612*4882a593Smuzhiyunrcu-fence. This is done in essentially the same way as the pb 1613*4882a593Smuzhiyunrelation was defined in terms of strong-fence. We will omit the 1614*4882a593Smuzhiyundetails; the end result is that E ->rb F implies E must execute 1615*4882a593Smuzhiyunbefore F, just as E ->pb F does (and for much the same reasons). 1616*4882a593Smuzhiyun 1617*4882a593SmuzhiyunPutting this all together, the LKMM expresses the Grace Period 1618*4882a593SmuzhiyunGuarantee by requiring that the rb relation does not contain a cycle. 1619*4882a593SmuzhiyunEquivalently, this "rcu" axiom requires that there are no events E 1620*4882a593Smuzhiyunand F with E ->rcu-link F ->rcu-order E. Or to put it a third way, 1621*4882a593Smuzhiyunthe axiom requires that there are no cycles consisting of rcu-gp and 1622*4882a593Smuzhiyunrcu-rscsi alternating with rcu-link, where the number of rcu-gp links 1623*4882a593Smuzhiyunis >= the number of rcu-rscsi links. 1624*4882a593Smuzhiyun 1625*4882a593SmuzhiyunJustifying the axiom isn't easy, but it is in fact a valid 1626*4882a593Smuzhiyunformalization of the Grace Period Guarantee. We won't attempt to go 1627*4882a593Smuzhiyunthrough the detailed argument, but the following analysis gives a 1628*4882a593Smuzhiyuntaste of what is involved. Suppose both parts of the Guarantee are 1629*4882a593Smuzhiyunviolated: A critical section starts before a grace period, and some 1630*4882a593Smuzhiyunstore propagates to the critical section's CPU before the end of the 1631*4882a593Smuzhiyuncritical section but doesn't propagate to some other CPU until after 1632*4882a593Smuzhiyunthe end of the grace period. 1633*4882a593Smuzhiyun 1634*4882a593SmuzhiyunPutting symbols to these ideas, let L and U be the rcu_read_lock() and 1635*4882a593Smuzhiyunrcu_read_unlock() fence events delimiting the critical section in 1636*4882a593Smuzhiyunquestion, and let S be the synchronize_rcu() fence event for the grace 1637*4882a593Smuzhiyunperiod. Saying that the critical section starts before S means there 1638*4882a593Smuzhiyunare events Q and R where Q is po-after L (which marks the start of the 1639*4882a593Smuzhiyuncritical section), Q is "before" R in the sense used by the rcu-link 1640*4882a593Smuzhiyunrelation, and R is po-before the grace period S. Thus we have: 1641*4882a593Smuzhiyun 1642*4882a593Smuzhiyun L ->rcu-link S. 1643*4882a593Smuzhiyun 1644*4882a593SmuzhiyunLet W be the store mentioned above, let Y come before the end of the 1645*4882a593Smuzhiyuncritical section and witness that W propagates to the critical 1646*4882a593Smuzhiyunsection's CPU by reading from W, and let Z on some arbitrary CPU be a 1647*4882a593Smuzhiyunwitness that W has not propagated to that CPU, where Z happens after 1648*4882a593Smuzhiyunsome event X which is po-after S. Symbolically, this amounts to: 1649*4882a593Smuzhiyun 1650*4882a593Smuzhiyun S ->po X ->hb* Z ->fr W ->rf Y ->po U. 1651*4882a593Smuzhiyun 1652*4882a593SmuzhiyunThe fr link from Z to W indicates that W has not propagated to Z's CPU 1653*4882a593Smuzhiyunat the time that Z executes. From this, it can be shown (see the 1654*4882a593Smuzhiyundiscussion of the rcu-link relation earlier) that S and U are related 1655*4882a593Smuzhiyunby rcu-link: 1656*4882a593Smuzhiyun 1657*4882a593Smuzhiyun S ->rcu-link U. 1658*4882a593Smuzhiyun 1659*4882a593SmuzhiyunSince S is a grace period we have S ->rcu-gp S, and since L and U are 1660*4882a593Smuzhiyunthe start and end of the critical section C we have U ->rcu-rscsi L. 1661*4882a593SmuzhiyunFrom this we obtain: 1662*4882a593Smuzhiyun 1663*4882a593Smuzhiyun S ->rcu-gp S ->rcu-link U ->rcu-rscsi L ->rcu-link S, 1664*4882a593Smuzhiyun 1665*4882a593Smuzhiyuna forbidden cycle. Thus the "rcu" axiom rules out this violation of 1666*4882a593Smuzhiyunthe Grace Period Guarantee. 1667*4882a593Smuzhiyun 1668*4882a593SmuzhiyunFor something a little more down-to-earth, let's see how the axiom 1669*4882a593Smuzhiyunworks out in practice. Consider the RCU code example from above, this 1670*4882a593Smuzhiyuntime with statement labels added: 1671*4882a593Smuzhiyun 1672*4882a593Smuzhiyun int x, y; 1673*4882a593Smuzhiyun 1674*4882a593Smuzhiyun P0() 1675*4882a593Smuzhiyun { 1676*4882a593Smuzhiyun L: rcu_read_lock(); 1677*4882a593Smuzhiyun X: WRITE_ONCE(x, 1); 1678*4882a593Smuzhiyun Y: WRITE_ONCE(y, 1); 1679*4882a593Smuzhiyun U: rcu_read_unlock(); 1680*4882a593Smuzhiyun } 1681*4882a593Smuzhiyun 1682*4882a593Smuzhiyun P1() 1683*4882a593Smuzhiyun { 1684*4882a593Smuzhiyun int r1, r2; 1685*4882a593Smuzhiyun 1686*4882a593Smuzhiyun Z: r1 = READ_ONCE(x); 1687*4882a593Smuzhiyun S: synchronize_rcu(); 1688*4882a593Smuzhiyun W: r2 = READ_ONCE(y); 1689*4882a593Smuzhiyun } 1690*4882a593Smuzhiyun 1691*4882a593Smuzhiyun 1692*4882a593SmuzhiyunIf r2 = 0 at the end then P0's store at Y overwrites the value that 1693*4882a593SmuzhiyunP1's load at W reads from, so we have W ->fre Y. Since S ->po W and 1694*4882a593Smuzhiyunalso Y ->po U, we get S ->rcu-link U. In addition, S ->rcu-gp S 1695*4882a593Smuzhiyunbecause S is a grace period. 1696*4882a593Smuzhiyun 1697*4882a593SmuzhiyunIf r1 = 1 at the end then P1's load at Z reads from P0's store at X, 1698*4882a593Smuzhiyunso we have X ->rfe Z. Together with L ->po X and Z ->po S, this 1699*4882a593Smuzhiyunyields L ->rcu-link S. And since L and U are the start and end of a 1700*4882a593Smuzhiyuncritical section, we have U ->rcu-rscsi L. 1701*4882a593Smuzhiyun 1702*4882a593SmuzhiyunThen U ->rcu-rscsi L ->rcu-link S ->rcu-gp S ->rcu-link U is a 1703*4882a593Smuzhiyunforbidden cycle, violating the "rcu" axiom. Hence the outcome is not 1704*4882a593Smuzhiyunallowed by the LKMM, as we would expect. 1705*4882a593Smuzhiyun 1706*4882a593SmuzhiyunFor contrast, let's see what can happen in a more complicated example: 1707*4882a593Smuzhiyun 1708*4882a593Smuzhiyun int x, y, z; 1709*4882a593Smuzhiyun 1710*4882a593Smuzhiyun P0() 1711*4882a593Smuzhiyun { 1712*4882a593Smuzhiyun int r0; 1713*4882a593Smuzhiyun 1714*4882a593Smuzhiyun L0: rcu_read_lock(); 1715*4882a593Smuzhiyun r0 = READ_ONCE(x); 1716*4882a593Smuzhiyun WRITE_ONCE(y, 1); 1717*4882a593Smuzhiyun U0: rcu_read_unlock(); 1718*4882a593Smuzhiyun } 1719*4882a593Smuzhiyun 1720*4882a593Smuzhiyun P1() 1721*4882a593Smuzhiyun { 1722*4882a593Smuzhiyun int r1; 1723*4882a593Smuzhiyun 1724*4882a593Smuzhiyun r1 = READ_ONCE(y); 1725*4882a593Smuzhiyun S1: synchronize_rcu(); 1726*4882a593Smuzhiyun WRITE_ONCE(z, 1); 1727*4882a593Smuzhiyun } 1728*4882a593Smuzhiyun 1729*4882a593Smuzhiyun P2() 1730*4882a593Smuzhiyun { 1731*4882a593Smuzhiyun int r2; 1732*4882a593Smuzhiyun 1733*4882a593Smuzhiyun L2: rcu_read_lock(); 1734*4882a593Smuzhiyun r2 = READ_ONCE(z); 1735*4882a593Smuzhiyun WRITE_ONCE(x, 1); 1736*4882a593Smuzhiyun U2: rcu_read_unlock(); 1737*4882a593Smuzhiyun } 1738*4882a593Smuzhiyun 1739*4882a593SmuzhiyunIf r0 = r1 = r2 = 1 at the end, then similar reasoning to before shows 1740*4882a593Smuzhiyunthat U0 ->rcu-rscsi L0 ->rcu-link S1 ->rcu-gp S1 ->rcu-link U2 ->rcu-rscsi 1741*4882a593SmuzhiyunL2 ->rcu-link U0. However this cycle is not forbidden, because the 1742*4882a593Smuzhiyunsequence of relations contains fewer instances of rcu-gp (one) than of 1743*4882a593Smuzhiyunrcu-rscsi (two). Consequently the outcome is allowed by the LKMM. 1744*4882a593SmuzhiyunThe following instruction timing diagram shows how it might actually 1745*4882a593Smuzhiyunoccur: 1746*4882a593Smuzhiyun 1747*4882a593SmuzhiyunP0 P1 P2 1748*4882a593Smuzhiyun-------------------- -------------------- -------------------- 1749*4882a593Smuzhiyunrcu_read_lock() 1750*4882a593SmuzhiyunWRITE_ONCE(y, 1) 1751*4882a593Smuzhiyun r1 = READ_ONCE(y) 1752*4882a593Smuzhiyun synchronize_rcu() starts 1753*4882a593Smuzhiyun . rcu_read_lock() 1754*4882a593Smuzhiyun . WRITE_ONCE(x, 1) 1755*4882a593Smuzhiyunr0 = READ_ONCE(x) . 1756*4882a593Smuzhiyunrcu_read_unlock() . 1757*4882a593Smuzhiyun synchronize_rcu() ends 1758*4882a593Smuzhiyun WRITE_ONCE(z, 1) 1759*4882a593Smuzhiyun r2 = READ_ONCE(z) 1760*4882a593Smuzhiyun rcu_read_unlock() 1761*4882a593Smuzhiyun 1762*4882a593SmuzhiyunThis requires P0 and P2 to execute their loads and stores out of 1763*4882a593Smuzhiyunprogram order, but of course they are allowed to do so. And as you 1764*4882a593Smuzhiyuncan see, the Grace Period Guarantee is not violated: The critical 1765*4882a593Smuzhiyunsection in P0 both starts before P1's grace period does and ends 1766*4882a593Smuzhiyunbefore it does, and the critical section in P2 both starts after P1's 1767*4882a593Smuzhiyungrace period does and ends after it does. 1768*4882a593Smuzhiyun 1769*4882a593SmuzhiyunAddendum: The LKMM now supports SRCU (Sleepable Read-Copy-Update) in 1770*4882a593Smuzhiyunaddition to normal RCU. The ideas involved are much the same as 1771*4882a593Smuzhiyunabove, with new relations srcu-gp and srcu-rscsi added to represent 1772*4882a593SmuzhiyunSRCU grace periods and read-side critical sections. There is a 1773*4882a593Smuzhiyunrestriction on the srcu-gp and srcu-rscsi links that can appear in an 1774*4882a593Smuzhiyunrcu-order sequence (the srcu-rscsi links must be paired with srcu-gp 1775*4882a593Smuzhiyunlinks having the same SRCU domain with proper nesting); the details 1776*4882a593Smuzhiyunare relatively unimportant. 1777*4882a593Smuzhiyun 1778*4882a593Smuzhiyun 1779*4882a593SmuzhiyunLOCKING 1780*4882a593Smuzhiyun------- 1781*4882a593Smuzhiyun 1782*4882a593SmuzhiyunThe LKMM includes locking. In fact, there is special code for locking 1783*4882a593Smuzhiyunin the formal model, added in order to make tools run faster. 1784*4882a593SmuzhiyunHowever, this special code is intended to be more or less equivalent 1785*4882a593Smuzhiyunto concepts we have already covered. A spinlock_t variable is treated 1786*4882a593Smuzhiyunthe same as an int, and spin_lock(&s) is treated almost the same as: 1787*4882a593Smuzhiyun 1788*4882a593Smuzhiyun while (cmpxchg_acquire(&s, 0, 1) != 0) 1789*4882a593Smuzhiyun cpu_relax(); 1790*4882a593Smuzhiyun 1791*4882a593SmuzhiyunThis waits until s is equal to 0 and then atomically sets it to 1, 1792*4882a593Smuzhiyunand the read part of the cmpxchg operation acts as an acquire fence. 1793*4882a593SmuzhiyunAn alternate way to express the same thing would be: 1794*4882a593Smuzhiyun 1795*4882a593Smuzhiyun r = xchg_acquire(&s, 1); 1796*4882a593Smuzhiyun 1797*4882a593Smuzhiyunalong with a requirement that at the end, r = 0. Similarly, 1798*4882a593Smuzhiyunspin_trylock(&s) is treated almost the same as: 1799*4882a593Smuzhiyun 1800*4882a593Smuzhiyun return !cmpxchg_acquire(&s, 0, 1); 1801*4882a593Smuzhiyun 1802*4882a593Smuzhiyunwhich atomically sets s to 1 if it is currently equal to 0 and returns 1803*4882a593Smuzhiyuntrue if it succeeds (the read part of the cmpxchg operation acts as an 1804*4882a593Smuzhiyunacquire fence only if the operation is successful). spin_unlock(&s) 1805*4882a593Smuzhiyunis treated almost the same as: 1806*4882a593Smuzhiyun 1807*4882a593Smuzhiyun smp_store_release(&s, 0); 1808*4882a593Smuzhiyun 1809*4882a593SmuzhiyunThe "almost" qualifiers above need some explanation. In the LKMM, the 1810*4882a593Smuzhiyunstore-release in a spin_unlock() and the load-acquire which forms the 1811*4882a593Smuzhiyunfirst half of the atomic rmw update in a spin_lock() or a successful 1812*4882a593Smuzhiyunspin_trylock() -- we can call these things lock-releases and 1813*4882a593Smuzhiyunlock-acquires -- have two properties beyond those of ordinary releases 1814*4882a593Smuzhiyunand acquires. 1815*4882a593Smuzhiyun 1816*4882a593SmuzhiyunFirst, when a lock-acquire reads from a lock-release, the LKMM 1817*4882a593Smuzhiyunrequires that every instruction po-before the lock-release must 1818*4882a593Smuzhiyunexecute before any instruction po-after the lock-acquire. This would 1819*4882a593Smuzhiyunnaturally hold if the release and acquire operations were on different 1820*4882a593SmuzhiyunCPUs, but the LKMM says it holds even when they are on the same CPU. 1821*4882a593SmuzhiyunFor example: 1822*4882a593Smuzhiyun 1823*4882a593Smuzhiyun int x, y; 1824*4882a593Smuzhiyun spinlock_t s; 1825*4882a593Smuzhiyun 1826*4882a593Smuzhiyun P0() 1827*4882a593Smuzhiyun { 1828*4882a593Smuzhiyun int r1, r2; 1829*4882a593Smuzhiyun 1830*4882a593Smuzhiyun spin_lock(&s); 1831*4882a593Smuzhiyun r1 = READ_ONCE(x); 1832*4882a593Smuzhiyun spin_unlock(&s); 1833*4882a593Smuzhiyun spin_lock(&s); 1834*4882a593Smuzhiyun r2 = READ_ONCE(y); 1835*4882a593Smuzhiyun spin_unlock(&s); 1836*4882a593Smuzhiyun } 1837*4882a593Smuzhiyun 1838*4882a593Smuzhiyun P1() 1839*4882a593Smuzhiyun { 1840*4882a593Smuzhiyun WRITE_ONCE(y, 1); 1841*4882a593Smuzhiyun smp_wmb(); 1842*4882a593Smuzhiyun WRITE_ONCE(x, 1); 1843*4882a593Smuzhiyun } 1844*4882a593Smuzhiyun 1845*4882a593SmuzhiyunHere the second spin_lock() reads from the first spin_unlock(), and 1846*4882a593Smuzhiyuntherefore the load of x must execute before the load of y. Thus we 1847*4882a593Smuzhiyuncannot have r1 = 1 and r2 = 0 at the end (this is an instance of the 1848*4882a593SmuzhiyunMP pattern). 1849*4882a593Smuzhiyun 1850*4882a593SmuzhiyunThis requirement does not apply to ordinary release and acquire 1851*4882a593Smuzhiyunfences, only to lock-related operations. For instance, suppose P0() 1852*4882a593Smuzhiyunin the example had been written as: 1853*4882a593Smuzhiyun 1854*4882a593Smuzhiyun P0() 1855*4882a593Smuzhiyun { 1856*4882a593Smuzhiyun int r1, r2, r3; 1857*4882a593Smuzhiyun 1858*4882a593Smuzhiyun r1 = READ_ONCE(x); 1859*4882a593Smuzhiyun smp_store_release(&s, 1); 1860*4882a593Smuzhiyun r3 = smp_load_acquire(&s); 1861*4882a593Smuzhiyun r2 = READ_ONCE(y); 1862*4882a593Smuzhiyun } 1863*4882a593Smuzhiyun 1864*4882a593SmuzhiyunThen the CPU would be allowed to forward the s = 1 value from the 1865*4882a593Smuzhiyunsmp_store_release() to the smp_load_acquire(), executing the 1866*4882a593Smuzhiyuninstructions in the following order: 1867*4882a593Smuzhiyun 1868*4882a593Smuzhiyun r3 = smp_load_acquire(&s); // Obtains r3 = 1 1869*4882a593Smuzhiyun r2 = READ_ONCE(y); 1870*4882a593Smuzhiyun r1 = READ_ONCE(x); 1871*4882a593Smuzhiyun smp_store_release(&s, 1); // Value is forwarded 1872*4882a593Smuzhiyun 1873*4882a593Smuzhiyunand thus it could load y before x, obtaining r2 = 0 and r1 = 1. 1874*4882a593Smuzhiyun 1875*4882a593SmuzhiyunSecond, when a lock-acquire reads from a lock-release, and some other 1876*4882a593Smuzhiyunstores W and W' occur po-before the lock-release and po-after the 1877*4882a593Smuzhiyunlock-acquire respectively, the LKMM requires that W must propagate to 1878*4882a593Smuzhiyuneach CPU before W' does. For example, consider: 1879*4882a593Smuzhiyun 1880*4882a593Smuzhiyun int x, y; 1881*4882a593Smuzhiyun spinlock_t x; 1882*4882a593Smuzhiyun 1883*4882a593Smuzhiyun P0() 1884*4882a593Smuzhiyun { 1885*4882a593Smuzhiyun spin_lock(&s); 1886*4882a593Smuzhiyun WRITE_ONCE(x, 1); 1887*4882a593Smuzhiyun spin_unlock(&s); 1888*4882a593Smuzhiyun } 1889*4882a593Smuzhiyun 1890*4882a593Smuzhiyun P1() 1891*4882a593Smuzhiyun { 1892*4882a593Smuzhiyun int r1; 1893*4882a593Smuzhiyun 1894*4882a593Smuzhiyun spin_lock(&s); 1895*4882a593Smuzhiyun r1 = READ_ONCE(x); 1896*4882a593Smuzhiyun WRITE_ONCE(y, 1); 1897*4882a593Smuzhiyun spin_unlock(&s); 1898*4882a593Smuzhiyun } 1899*4882a593Smuzhiyun 1900*4882a593Smuzhiyun P2() 1901*4882a593Smuzhiyun { 1902*4882a593Smuzhiyun int r2, r3; 1903*4882a593Smuzhiyun 1904*4882a593Smuzhiyun r2 = READ_ONCE(y); 1905*4882a593Smuzhiyun smp_rmb(); 1906*4882a593Smuzhiyun r3 = READ_ONCE(x); 1907*4882a593Smuzhiyun } 1908*4882a593Smuzhiyun 1909*4882a593SmuzhiyunIf r1 = 1 at the end then the spin_lock() in P1 must have read from 1910*4882a593Smuzhiyunthe spin_unlock() in P0. Hence the store to x must propagate to P2 1911*4882a593Smuzhiyunbefore the store to y does, so we cannot have r2 = 1 and r3 = 0. 1912*4882a593Smuzhiyun 1913*4882a593SmuzhiyunThese two special requirements for lock-release and lock-acquire do 1914*4882a593Smuzhiyunnot arise from the operational model. Nevertheless, kernel developers 1915*4882a593Smuzhiyunhave come to expect and rely on them because they do hold on all 1916*4882a593Smuzhiyunarchitectures supported by the Linux kernel, albeit for various 1917*4882a593Smuzhiyundiffering reasons. 1918*4882a593Smuzhiyun 1919*4882a593Smuzhiyun 1920*4882a593SmuzhiyunPLAIN ACCESSES AND DATA RACES 1921*4882a593Smuzhiyun----------------------------- 1922*4882a593Smuzhiyun 1923*4882a593SmuzhiyunIn the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y), 1924*4882a593Smuzhiyunsmp_load_acquire(&z), and so on are collectively referred to as 1925*4882a593Smuzhiyun"marked" accesses, because they are all annotated with special 1926*4882a593Smuzhiyunoperations of one kind or another. Ordinary C-language memory 1927*4882a593Smuzhiyunaccesses such as x or y = 0 are simply called "plain" accesses. 1928*4882a593Smuzhiyun 1929*4882a593SmuzhiyunEarly versions of the LKMM had nothing to say about plain accesses. 1930*4882a593SmuzhiyunThe C standard allows compilers to assume that the variables affected 1931*4882a593Smuzhiyunby plain accesses are not concurrently read or written by any other 1932*4882a593Smuzhiyunthreads or CPUs. This leaves compilers free to implement all manner 1933*4882a593Smuzhiyunof transformations or optimizations of code containing plain accesses, 1934*4882a593Smuzhiyunmaking such code very difficult for a memory model to handle. 1935*4882a593Smuzhiyun 1936*4882a593SmuzhiyunHere is just one example of a possible pitfall: 1937*4882a593Smuzhiyun 1938*4882a593Smuzhiyun int a = 6; 1939*4882a593Smuzhiyun int *x = &a; 1940*4882a593Smuzhiyun 1941*4882a593Smuzhiyun P0() 1942*4882a593Smuzhiyun { 1943*4882a593Smuzhiyun int *r1; 1944*4882a593Smuzhiyun int r2 = 0; 1945*4882a593Smuzhiyun 1946*4882a593Smuzhiyun r1 = x; 1947*4882a593Smuzhiyun if (r1 != NULL) 1948*4882a593Smuzhiyun r2 = READ_ONCE(*r1); 1949*4882a593Smuzhiyun } 1950*4882a593Smuzhiyun 1951*4882a593Smuzhiyun P1() 1952*4882a593Smuzhiyun { 1953*4882a593Smuzhiyun WRITE_ONCE(x, NULL); 1954*4882a593Smuzhiyun } 1955*4882a593Smuzhiyun 1956*4882a593SmuzhiyunOn the face of it, one would expect that when this code runs, the only 1957*4882a593Smuzhiyunpossible final values for r2 are 6 and 0, depending on whether or not 1958*4882a593SmuzhiyunP1's store to x propagates to P0 before P0's load from x executes. 1959*4882a593SmuzhiyunBut since P0's load from x is a plain access, the compiler may decide 1960*4882a593Smuzhiyunto carry out the load twice (for the comparison against NULL, then again 1961*4882a593Smuzhiyunfor the READ_ONCE()) and eliminate the temporary variable r1. The 1962*4882a593Smuzhiyunobject code generated for P0 could therefore end up looking rather 1963*4882a593Smuzhiyunlike this: 1964*4882a593Smuzhiyun 1965*4882a593Smuzhiyun P0() 1966*4882a593Smuzhiyun { 1967*4882a593Smuzhiyun int r2 = 0; 1968*4882a593Smuzhiyun 1969*4882a593Smuzhiyun if (x != NULL) 1970*4882a593Smuzhiyun r2 = READ_ONCE(*x); 1971*4882a593Smuzhiyun } 1972*4882a593Smuzhiyun 1973*4882a593SmuzhiyunAnd now it is obvious that this code runs the risk of dereferencing a 1974*4882a593SmuzhiyunNULL pointer, because P1's store to x might propagate to P0 after the 1975*4882a593Smuzhiyuntest against NULL has been made but before the READ_ONCE() executes. 1976*4882a593SmuzhiyunIf the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x", 1977*4882a593Smuzhiyunthe compiler would not have performed this optimization and there 1978*4882a593Smuzhiyunwould be no possibility of a NULL-pointer dereference. 1979*4882a593Smuzhiyun 1980*4882a593SmuzhiyunGiven the possibility of transformations like this one, the LKMM 1981*4882a593Smuzhiyundoesn't try to predict all possible outcomes of code containing plain 1982*4882a593Smuzhiyunaccesses. It is instead content to determine whether the code 1983*4882a593Smuzhiyunviolates the compiler's assumptions, which would render the ultimate 1984*4882a593Smuzhiyunoutcome undefined. 1985*4882a593Smuzhiyun 1986*4882a593SmuzhiyunIn technical terms, the compiler is allowed to assume that when the 1987*4882a593Smuzhiyunprogram executes, there will not be any data races. A "data race" 1988*4882a593Smuzhiyunoccurs when there are two memory accesses such that: 1989*4882a593Smuzhiyun 1990*4882a593Smuzhiyun1. they access the same location, 1991*4882a593Smuzhiyun 1992*4882a593Smuzhiyun2. at least one of them is a store, 1993*4882a593Smuzhiyun 1994*4882a593Smuzhiyun3. at least one of them is plain, 1995*4882a593Smuzhiyun 1996*4882a593Smuzhiyun4. they occur on different CPUs (or in different threads on the 1997*4882a593Smuzhiyun same CPU), and 1998*4882a593Smuzhiyun 1999*4882a593Smuzhiyun5. they execute concurrently. 2000*4882a593Smuzhiyun 2001*4882a593SmuzhiyunIn the literature, two accesses are said to "conflict" if they satisfy 2002*4882a593Smuzhiyun1 and 2 above. We'll go a little farther and say that two accesses 2003*4882a593Smuzhiyunare "race candidates" if they satisfy 1 - 4. Thus, whether or not two 2004*4882a593Smuzhiyunrace candidates actually do race in a given execution depends on 2005*4882a593Smuzhiyunwhether they are concurrent. 2006*4882a593Smuzhiyun 2007*4882a593SmuzhiyunThe LKMM tries to determine whether a program contains race candidates 2008*4882a593Smuzhiyunwhich may execute concurrently; if it does then the LKMM says there is 2009*4882a593Smuzhiyuna potential data race and makes no predictions about the program's 2010*4882a593Smuzhiyunoutcome. 2011*4882a593Smuzhiyun 2012*4882a593SmuzhiyunDetermining whether two accesses are race candidates is easy; you can 2013*4882a593Smuzhiyunsee that all the concepts involved in the definition above are already 2014*4882a593Smuzhiyunpart of the memory model. The hard part is telling whether they may 2015*4882a593Smuzhiyunexecute concurrently. The LKMM takes a conservative attitude, 2016*4882a593Smuzhiyunassuming that accesses may be concurrent unless it can prove they 2017*4882a593Smuzhiyunare not. 2018*4882a593Smuzhiyun 2019*4882a593SmuzhiyunIf two memory accesses aren't concurrent then one must execute before 2020*4882a593Smuzhiyunthe other. Therefore the LKMM decides two accesses aren't concurrent 2021*4882a593Smuzhiyunif they can be connected by a sequence of hb, pb, and rb links 2022*4882a593Smuzhiyun(together referred to as xb, for "executes before"). However, there 2023*4882a593Smuzhiyunare two complicating factors. 2024*4882a593Smuzhiyun 2025*4882a593SmuzhiyunIf X is a load and X executes before a store Y, then indeed there is 2026*4882a593Smuzhiyunno danger of X and Y being concurrent. After all, Y can't have any 2027*4882a593Smuzhiyuneffect on the value obtained by X until the memory subsystem has 2028*4882a593Smuzhiyunpropagated Y from its own CPU to X's CPU, which won't happen until 2029*4882a593Smuzhiyunsome time after Y executes and thus after X executes. But if X is a 2030*4882a593Smuzhiyunstore, then even if X executes before Y it is still possible that X 2031*4882a593Smuzhiyunwill propagate to Y's CPU just as Y is executing. In such a case X 2032*4882a593Smuzhiyuncould very well interfere somehow with Y, and we would have to 2033*4882a593Smuzhiyunconsider X and Y to be concurrent. 2034*4882a593Smuzhiyun 2035*4882a593SmuzhiyunTherefore when X is a store, for X and Y to be non-concurrent the LKMM 2036*4882a593Smuzhiyunrequires not only that X must execute before Y but also that X must 2037*4882a593Smuzhiyunpropagate to Y's CPU before Y executes. (Or vice versa, of course, if 2038*4882a593SmuzhiyunY executes before X -- then Y must propagate to X's CPU before X 2039*4882a593Smuzhiyunexecutes if Y is a store.) This is expressed by the visibility 2040*4882a593Smuzhiyunrelation (vis), where X ->vis Y is defined to hold if there is an 2041*4882a593Smuzhiyunintermediate event Z such that: 2042*4882a593Smuzhiyun 2043*4882a593Smuzhiyun X is connected to Z by a possibly empty sequence of 2044*4882a593Smuzhiyun cumul-fence links followed by an optional rfe link (if none of 2045*4882a593Smuzhiyun these links are present, X and Z are the same event), 2046*4882a593Smuzhiyun 2047*4882a593Smuzhiyunand either: 2048*4882a593Smuzhiyun 2049*4882a593Smuzhiyun Z is connected to Y by a strong-fence link followed by a 2050*4882a593Smuzhiyun possibly empty sequence of xb links, 2051*4882a593Smuzhiyun 2052*4882a593Smuzhiyunor: 2053*4882a593Smuzhiyun 2054*4882a593Smuzhiyun Z is on the same CPU as Y and is connected to Y by a possibly 2055*4882a593Smuzhiyun empty sequence of xb links (again, if the sequence is empty it 2056*4882a593Smuzhiyun means Z and Y are the same event). 2057*4882a593Smuzhiyun 2058*4882a593SmuzhiyunThe motivations behind this definition are straightforward: 2059*4882a593Smuzhiyun 2060*4882a593Smuzhiyun cumul-fence memory barriers force stores that are po-before 2061*4882a593Smuzhiyun the barrier to propagate to other CPUs before stores that are 2062*4882a593Smuzhiyun po-after the barrier. 2063*4882a593Smuzhiyun 2064*4882a593Smuzhiyun An rfe link from an event W to an event R says that R reads 2065*4882a593Smuzhiyun from W, which certainly means that W must have propagated to 2066*4882a593Smuzhiyun R's CPU before R executed. 2067*4882a593Smuzhiyun 2068*4882a593Smuzhiyun strong-fence memory barriers force stores that are po-before 2069*4882a593Smuzhiyun the barrier, or that propagate to the barrier's CPU before the 2070*4882a593Smuzhiyun barrier executes, to propagate to all CPUs before any events 2071*4882a593Smuzhiyun po-after the barrier can execute. 2072*4882a593Smuzhiyun 2073*4882a593SmuzhiyunTo see how this works out in practice, consider our old friend, the MP 2074*4882a593Smuzhiyunpattern (with fences and statement labels, but without the conditional 2075*4882a593Smuzhiyuntest): 2076*4882a593Smuzhiyun 2077*4882a593Smuzhiyun int buf = 0, flag = 0; 2078*4882a593Smuzhiyun 2079*4882a593Smuzhiyun P0() 2080*4882a593Smuzhiyun { 2081*4882a593Smuzhiyun X: WRITE_ONCE(buf, 1); 2082*4882a593Smuzhiyun smp_wmb(); 2083*4882a593Smuzhiyun W: WRITE_ONCE(flag, 1); 2084*4882a593Smuzhiyun } 2085*4882a593Smuzhiyun 2086*4882a593Smuzhiyun P1() 2087*4882a593Smuzhiyun { 2088*4882a593Smuzhiyun int r1; 2089*4882a593Smuzhiyun int r2 = 0; 2090*4882a593Smuzhiyun 2091*4882a593Smuzhiyun Z: r1 = READ_ONCE(flag); 2092*4882a593Smuzhiyun smp_rmb(); 2093*4882a593Smuzhiyun Y: r2 = READ_ONCE(buf); 2094*4882a593Smuzhiyun } 2095*4882a593Smuzhiyun 2096*4882a593SmuzhiyunThe smp_wmb() memory barrier gives a cumul-fence link from X to W, and 2097*4882a593Smuzhiyunassuming r1 = 1 at the end, there is an rfe link from W to Z. This 2098*4882a593Smuzhiyunmeans that the store to buf must propagate from P0 to P1 before Z 2099*4882a593Smuzhiyunexecutes. Next, Z and Y are on the same CPU and the smp_rmb() fence 2100*4882a593Smuzhiyunprovides an xb link from Z to Y (i.e., it forces Z to execute before 2101*4882a593SmuzhiyunY). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y 2102*4882a593Smuzhiyunexecutes. 2103*4882a593Smuzhiyun 2104*4882a593SmuzhiyunThe second complicating factor mentioned above arises from the fact 2105*4882a593Smuzhiyunthat when we are considering data races, some of the memory accesses 2106*4882a593Smuzhiyunare plain. Now, although we have not said so explicitly, up to this 2107*4882a593Smuzhiyunpoint most of the relations defined by the LKMM (ppo, hb, prop, 2108*4882a593Smuzhiyuncumul-fence, pb, and so on -- including vis) apply only to marked 2109*4882a593Smuzhiyunaccesses. 2110*4882a593Smuzhiyun 2111*4882a593SmuzhiyunThere are good reasons for this restriction. The compiler is not 2112*4882a593Smuzhiyunallowed to apply fancy transformations to marked accesses, and 2113*4882a593Smuzhiyunconsequently each such access in the source code corresponds more or 2114*4882a593Smuzhiyunless directly to a single machine instruction in the object code. But 2115*4882a593Smuzhiyunplain accesses are a different story; the compiler may combine them, 2116*4882a593Smuzhiyunsplit them up, duplicate them, eliminate them, invent new ones, and 2117*4882a593Smuzhiyunwho knows what else. Seeing a plain access in the source code tells 2118*4882a593Smuzhiyunyou almost nothing about what machine instructions will end up in the 2119*4882a593Smuzhiyunobject code. 2120*4882a593Smuzhiyun 2121*4882a593SmuzhiyunFortunately, the compiler isn't completely free; it is subject to some 2122*4882a593Smuzhiyunlimitations. For one, it is not allowed to introduce a data race into 2123*4882a593Smuzhiyunthe object code if the source code does not already contain a data 2124*4882a593Smuzhiyunrace (if it could, memory models would be useless and no multithreaded 2125*4882a593Smuzhiyuncode would be safe!). For another, it cannot move a plain access past 2126*4882a593Smuzhiyuna compiler barrier. 2127*4882a593Smuzhiyun 2128*4882a593SmuzhiyunA compiler barrier is a kind of fence, but as the name implies, it 2129*4882a593Smuzhiyunonly affects the compiler; it does not necessarily have any effect on 2130*4882a593Smuzhiyunhow instructions are executed by the CPU. In Linux kernel source 2131*4882a593Smuzhiyuncode, the barrier() function is a compiler barrier. It doesn't give 2132*4882a593Smuzhiyunrise directly to any machine instructions in the object code; rather, 2133*4882a593Smuzhiyunit affects how the compiler generates the rest of the object code. 2134*4882a593SmuzhiyunGiven source code like this: 2135*4882a593Smuzhiyun 2136*4882a593Smuzhiyun ... some memory accesses ... 2137*4882a593Smuzhiyun barrier(); 2138*4882a593Smuzhiyun ... some other memory accesses ... 2139*4882a593Smuzhiyun 2140*4882a593Smuzhiyunthe barrier() function ensures that the machine instructions 2141*4882a593Smuzhiyuncorresponding to the first group of accesses will all end po-before 2142*4882a593Smuzhiyunany machine instructions corresponding to the second group of accesses 2143*4882a593Smuzhiyun-- even if some of the accesses are plain. (Of course, the CPU may 2144*4882a593Smuzhiyunthen execute some of those accesses out of program order, but we 2145*4882a593Smuzhiyunalready know how to deal with such issues.) Without the barrier() 2146*4882a593Smuzhiyunthere would be no such guarantee; the two groups of accesses could be 2147*4882a593Smuzhiyunintermingled or even reversed in the object code. 2148*4882a593Smuzhiyun 2149*4882a593SmuzhiyunThe LKMM doesn't say much about the barrier() function, but it does 2150*4882a593Smuzhiyunrequire that all fences are also compiler barriers. In addition, it 2151*4882a593Smuzhiyunrequires that the ordering properties of memory barriers such as 2152*4882a593Smuzhiyunsmp_rmb() or smp_store_release() apply to plain accesses as well as to 2153*4882a593Smuzhiyunmarked accesses. 2154*4882a593Smuzhiyun 2155*4882a593SmuzhiyunThis is the key to analyzing data races. Consider the MP pattern 2156*4882a593Smuzhiyunagain, now using plain accesses for buf: 2157*4882a593Smuzhiyun 2158*4882a593Smuzhiyun int buf = 0, flag = 0; 2159*4882a593Smuzhiyun 2160*4882a593Smuzhiyun P0() 2161*4882a593Smuzhiyun { 2162*4882a593Smuzhiyun U: buf = 1; 2163*4882a593Smuzhiyun smp_wmb(); 2164*4882a593Smuzhiyun X: WRITE_ONCE(flag, 1); 2165*4882a593Smuzhiyun } 2166*4882a593Smuzhiyun 2167*4882a593Smuzhiyun P1() 2168*4882a593Smuzhiyun { 2169*4882a593Smuzhiyun int r1; 2170*4882a593Smuzhiyun int r2 = 0; 2171*4882a593Smuzhiyun 2172*4882a593Smuzhiyun Y: r1 = READ_ONCE(flag); 2173*4882a593Smuzhiyun if (r1) { 2174*4882a593Smuzhiyun smp_rmb(); 2175*4882a593Smuzhiyun V: r2 = buf; 2176*4882a593Smuzhiyun } 2177*4882a593Smuzhiyun } 2178*4882a593Smuzhiyun 2179*4882a593SmuzhiyunThis program does not contain a data race. Although the U and V 2180*4882a593Smuzhiyunaccesses are race candidates, the LKMM can prove they are not 2181*4882a593Smuzhiyunconcurrent as follows: 2182*4882a593Smuzhiyun 2183*4882a593Smuzhiyun The smp_wmb() fence in P0 is both a compiler barrier and a 2184*4882a593Smuzhiyun cumul-fence. It guarantees that no matter what hash of 2185*4882a593Smuzhiyun machine instructions the compiler generates for the plain 2186*4882a593Smuzhiyun access U, all those instructions will be po-before the fence. 2187*4882a593Smuzhiyun Consequently U's store to buf, no matter how it is carried out 2188*4882a593Smuzhiyun at the machine level, must propagate to P1 before X's store to 2189*4882a593Smuzhiyun flag does. 2190*4882a593Smuzhiyun 2191*4882a593Smuzhiyun X and Y are both marked accesses. Hence an rfe link from X to 2192*4882a593Smuzhiyun Y is a valid indicator that X propagated to P1 before Y 2193*4882a593Smuzhiyun executed, i.e., X ->vis Y. (And if there is no rfe link then 2194*4882a593Smuzhiyun r1 will be 0, so V will not be executed and ipso facto won't 2195*4882a593Smuzhiyun race with U.) 2196*4882a593Smuzhiyun 2197*4882a593Smuzhiyun The smp_rmb() fence in P1 is a compiler barrier as well as a 2198*4882a593Smuzhiyun fence. It guarantees that all the machine-level instructions 2199*4882a593Smuzhiyun corresponding to the access V will be po-after the fence, and 2200*4882a593Smuzhiyun therefore any loads among those instructions will execute 2201*4882a593Smuzhiyun after the fence does and hence after Y does. 2202*4882a593Smuzhiyun 2203*4882a593SmuzhiyunThus U's store to buf is forced to propagate to P1 before V's load 2204*4882a593Smuzhiyunexecutes (assuming V does execute), ruling out the possibility of a 2205*4882a593Smuzhiyundata race between them. 2206*4882a593Smuzhiyun 2207*4882a593SmuzhiyunThis analysis illustrates how the LKMM deals with plain accesses in 2208*4882a593Smuzhiyungeneral. Suppose R is a plain load and we want to show that R 2209*4882a593Smuzhiyunexecutes before some marked access E. We can do this by finding a 2210*4882a593Smuzhiyunmarked access X such that R and X are ordered by a suitable fence and 2211*4882a593SmuzhiyunX ->xb* E. If E was also a plain access, we would also look for a 2212*4882a593Smuzhiyunmarked access Y such that X ->xb* Y, and Y and E are ordered by a 2213*4882a593Smuzhiyunfence. We describe this arrangement by saying that R is 2214*4882a593Smuzhiyun"post-bounded" by X and E is "pre-bounded" by Y. 2215*4882a593Smuzhiyun 2216*4882a593SmuzhiyunIn fact, we go one step further: Since R is a read, we say that R is 2217*4882a593Smuzhiyun"r-post-bounded" by X. Similarly, E would be "r-pre-bounded" or 2218*4882a593Smuzhiyun"w-pre-bounded" by Y, depending on whether E was a store or a load. 2219*4882a593SmuzhiyunThis distinction is needed because some fences affect only loads 2220*4882a593Smuzhiyun(i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise 2221*4882a593Smuzhiyunthe two types of bounds are the same. And as a degenerate case, we 2222*4882a593Smuzhiyunsay that a marked access pre-bounds and post-bounds itself (e.g., if R 2223*4882a593Smuzhiyunabove were a marked load then X could simply be taken to be R itself.) 2224*4882a593Smuzhiyun 2225*4882a593SmuzhiyunThe need to distinguish between r- and w-bounding raises yet another 2226*4882a593Smuzhiyunissue. When the source code contains a plain store, the compiler is 2227*4882a593Smuzhiyunallowed to put plain loads of the same location into the object code. 2228*4882a593SmuzhiyunFor example, given the source code: 2229*4882a593Smuzhiyun 2230*4882a593Smuzhiyun x = 1; 2231*4882a593Smuzhiyun 2232*4882a593Smuzhiyunthe compiler is theoretically allowed to generate object code that 2233*4882a593Smuzhiyunlooks like: 2234*4882a593Smuzhiyun 2235*4882a593Smuzhiyun if (x != 1) 2236*4882a593Smuzhiyun x = 1; 2237*4882a593Smuzhiyun 2238*4882a593Smuzhiyunthereby adding a load (and possibly replacing the store entirely). 2239*4882a593SmuzhiyunFor this reason, whenever the LKMM requires a plain store to be 2240*4882a593Smuzhiyunw-pre-bounded or w-post-bounded by a marked access, it also requires 2241*4882a593Smuzhiyunthe store to be r-pre-bounded or r-post-bounded, so as to handle cases 2242*4882a593Smuzhiyunwhere the compiler adds a load. 2243*4882a593Smuzhiyun 2244*4882a593Smuzhiyun(This may be overly cautious. We don't know of any examples where a 2245*4882a593Smuzhiyuncompiler has augmented a store with a load in this fashion, and the 2246*4882a593SmuzhiyunLinux kernel developers would probably fight pretty hard to change a 2247*4882a593Smuzhiyuncompiler if it ever did this. Still, better safe than sorry.) 2248*4882a593Smuzhiyun 2249*4882a593SmuzhiyunIncidentally, the other tranformation -- augmenting a plain load by 2250*4882a593Smuzhiyunadding in a store to the same location -- is not allowed. This is 2251*4882a593Smuzhiyunbecause the compiler cannot know whether any other CPUs might perform 2252*4882a593Smuzhiyuna concurrent load from that location. Two concurrent loads don't 2253*4882a593Smuzhiyunconstitute a race (they can't interfere with each other), but a store 2254*4882a593Smuzhiyundoes race with a concurrent load. Thus adding a store might create a 2255*4882a593Smuzhiyundata race where one was not already present in the source code, 2256*4882a593Smuzhiyunsomething the compiler is forbidden to do. Augmenting a store with a 2257*4882a593Smuzhiyunload, on the other hand, is acceptable because doing so won't create a 2258*4882a593Smuzhiyundata race unless one already existed. 2259*4882a593Smuzhiyun 2260*4882a593SmuzhiyunThe LKMM includes a second way to pre-bound plain accesses, in 2261*4882a593Smuzhiyunaddition to fences: an address dependency from a marked load. That 2262*4882a593Smuzhiyunis, in the sequence: 2263*4882a593Smuzhiyun 2264*4882a593Smuzhiyun p = READ_ONCE(ptr); 2265*4882a593Smuzhiyun r = *p; 2266*4882a593Smuzhiyun 2267*4882a593Smuzhiyunthe LKMM says that the marked load of ptr pre-bounds the plain load of 2268*4882a593Smuzhiyun*p; the marked load must execute before any of the machine 2269*4882a593Smuzhiyuninstructions corresponding to the plain load. This is a reasonable 2270*4882a593Smuzhiyunstipulation, since after all, the CPU can't perform the load of *p 2271*4882a593Smuzhiyununtil it knows what value p will hold. Furthermore, without some 2272*4882a593Smuzhiyunassumption like this one, some usages typical of RCU would count as 2273*4882a593Smuzhiyundata races. For example: 2274*4882a593Smuzhiyun 2275*4882a593Smuzhiyun int a = 1, b; 2276*4882a593Smuzhiyun int *ptr = &a; 2277*4882a593Smuzhiyun 2278*4882a593Smuzhiyun P0() 2279*4882a593Smuzhiyun { 2280*4882a593Smuzhiyun b = 2; 2281*4882a593Smuzhiyun rcu_assign_pointer(ptr, &b); 2282*4882a593Smuzhiyun } 2283*4882a593Smuzhiyun 2284*4882a593Smuzhiyun P1() 2285*4882a593Smuzhiyun { 2286*4882a593Smuzhiyun int *p; 2287*4882a593Smuzhiyun int r; 2288*4882a593Smuzhiyun 2289*4882a593Smuzhiyun rcu_read_lock(); 2290*4882a593Smuzhiyun p = rcu_dereference(ptr); 2291*4882a593Smuzhiyun r = *p; 2292*4882a593Smuzhiyun rcu_read_unlock(); 2293*4882a593Smuzhiyun } 2294*4882a593Smuzhiyun 2295*4882a593Smuzhiyun(In this example the rcu_read_lock() and rcu_read_unlock() calls don't 2296*4882a593Smuzhiyunreally do anything, because there aren't any grace periods. They are 2297*4882a593Smuzhiyunincluded merely for the sake of good form; typically P0 would call 2298*4882a593Smuzhiyunsynchronize_rcu() somewhere after the rcu_assign_pointer().) 2299*4882a593Smuzhiyun 2300*4882a593Smuzhiyunrcu_assign_pointer() performs a store-release, so the plain store to b 2301*4882a593Smuzhiyunis definitely w-post-bounded before the store to ptr, and the two 2302*4882a593Smuzhiyunstores will propagate to P1 in that order. However, rcu_dereference() 2303*4882a593Smuzhiyunis only equivalent to READ_ONCE(). While it is a marked access, it is 2304*4882a593Smuzhiyunnot a fence or compiler barrier. Hence the only guarantee we have 2305*4882a593Smuzhiyunthat the load of ptr in P1 is r-pre-bounded before the load of *p 2306*4882a593Smuzhiyun(thus avoiding a race) is the assumption about address dependencies. 2307*4882a593Smuzhiyun 2308*4882a593SmuzhiyunThis is a situation where the compiler can undermine the memory model, 2309*4882a593Smuzhiyunand a certain amount of care is required when programming constructs 2310*4882a593Smuzhiyunlike this one. In particular, comparisons between the pointer and 2311*4882a593Smuzhiyunother known addresses can cause trouble. If you have something like: 2312*4882a593Smuzhiyun 2313*4882a593Smuzhiyun p = rcu_dereference(ptr); 2314*4882a593Smuzhiyun if (p == &x) 2315*4882a593Smuzhiyun r = *p; 2316*4882a593Smuzhiyun 2317*4882a593Smuzhiyunthen the compiler just might generate object code resembling: 2318*4882a593Smuzhiyun 2319*4882a593Smuzhiyun p = rcu_dereference(ptr); 2320*4882a593Smuzhiyun if (p == &x) 2321*4882a593Smuzhiyun r = x; 2322*4882a593Smuzhiyun 2323*4882a593Smuzhiyunor even: 2324*4882a593Smuzhiyun 2325*4882a593Smuzhiyun rtemp = x; 2326*4882a593Smuzhiyun p = rcu_dereference(ptr); 2327*4882a593Smuzhiyun if (p == &x) 2328*4882a593Smuzhiyun r = rtemp; 2329*4882a593Smuzhiyun 2330*4882a593Smuzhiyunwhich would invalidate the memory model's assumption, since the CPU 2331*4882a593Smuzhiyuncould now perform the load of x before the load of ptr (there might be 2332*4882a593Smuzhiyuna control dependency but no address dependency at the machine level). 2333*4882a593Smuzhiyun 2334*4882a593SmuzhiyunFinally, it turns out there is a situation in which a plain write does 2335*4882a593Smuzhiyunnot need to be w-post-bounded: when it is separated from the other 2336*4882a593Smuzhiyunrace-candidate access by a fence. At first glance this may seem 2337*4882a593Smuzhiyunimpossible. After all, to be race candidates the two accesses must 2338*4882a593Smuzhiyunbe on different CPUs, and fences don't link events on different CPUs. 2339*4882a593SmuzhiyunWell, normal fences don't -- but rcu-fence can! Here's an example: 2340*4882a593Smuzhiyun 2341*4882a593Smuzhiyun int x, y; 2342*4882a593Smuzhiyun 2343*4882a593Smuzhiyun P0() 2344*4882a593Smuzhiyun { 2345*4882a593Smuzhiyun WRITE_ONCE(x, 1); 2346*4882a593Smuzhiyun synchronize_rcu(); 2347*4882a593Smuzhiyun y = 3; 2348*4882a593Smuzhiyun } 2349*4882a593Smuzhiyun 2350*4882a593Smuzhiyun P1() 2351*4882a593Smuzhiyun { 2352*4882a593Smuzhiyun rcu_read_lock(); 2353*4882a593Smuzhiyun if (READ_ONCE(x) == 0) 2354*4882a593Smuzhiyun y = 2; 2355*4882a593Smuzhiyun rcu_read_unlock(); 2356*4882a593Smuzhiyun } 2357*4882a593Smuzhiyun 2358*4882a593SmuzhiyunDo the plain stores to y race? Clearly not if P1 reads a non-zero 2359*4882a593Smuzhiyunvalue for x, so let's assume the READ_ONCE(x) does obtain 0. This 2360*4882a593Smuzhiyunmeans that the read-side critical section in P1 must finish executing 2361*4882a593Smuzhiyunbefore the grace period in P0 does, because RCU's Grace-Period 2362*4882a593SmuzhiyunGuarantee says that otherwise P0's store to x would have propagated to 2363*4882a593SmuzhiyunP1 before the critical section started and so would have been visible 2364*4882a593Smuzhiyunto the READ_ONCE(). (Another way of putting it is that the fre link 2365*4882a593Smuzhiyunfrom the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link 2366*4882a593Smuzhiyunbetween those two events.) 2367*4882a593Smuzhiyun 2368*4882a593SmuzhiyunThis means there is an rcu-fence link from P1's "y = 2" store to P0's 2369*4882a593Smuzhiyun"y = 3" store, and consequently the first must propagate from P1 to P0 2370*4882a593Smuzhiyunbefore the second can execute. Therefore the two stores cannot be 2371*4882a593Smuzhiyunconcurrent and there is no race, even though P1's plain store to y 2372*4882a593Smuzhiyunisn't w-post-bounded by any marked accesses. 2373*4882a593Smuzhiyun 2374*4882a593SmuzhiyunPutting all this material together yields the following picture. For 2375*4882a593Smuzhiyunrace-candidate stores W and W', where W ->co W', the LKMM says the 2376*4882a593Smuzhiyunstores don't race if W can be linked to W' by a 2377*4882a593Smuzhiyun 2378*4882a593Smuzhiyun w-post-bounded ; vis ; w-pre-bounded 2379*4882a593Smuzhiyun 2380*4882a593Smuzhiyunsequence. If W is plain then they also have to be linked by an 2381*4882a593Smuzhiyun 2382*4882a593Smuzhiyun r-post-bounded ; xb* ; w-pre-bounded 2383*4882a593Smuzhiyun 2384*4882a593Smuzhiyunsequence, and if W' is plain then they also have to be linked by a 2385*4882a593Smuzhiyun 2386*4882a593Smuzhiyun w-post-bounded ; vis ; r-pre-bounded 2387*4882a593Smuzhiyun 2388*4882a593Smuzhiyunsequence. For race-candidate load R and store W, the LKMM says the 2389*4882a593Smuzhiyuntwo accesses don't race if R can be linked to W by an 2390*4882a593Smuzhiyun 2391*4882a593Smuzhiyun r-post-bounded ; xb* ; w-pre-bounded 2392*4882a593Smuzhiyun 2393*4882a593Smuzhiyunsequence or if W can be linked to R by a 2394*4882a593Smuzhiyun 2395*4882a593Smuzhiyun w-post-bounded ; vis ; r-pre-bounded 2396*4882a593Smuzhiyun 2397*4882a593Smuzhiyunsequence. For the cases involving a vis link, the LKMM also accepts 2398*4882a593Smuzhiyunsequences in which W is linked to W' or R by a 2399*4882a593Smuzhiyun 2400*4882a593Smuzhiyun strong-fence ; xb* ; {w and/or r}-pre-bounded 2401*4882a593Smuzhiyun 2402*4882a593Smuzhiyunsequence with no post-bounding, and in every case the LKMM also allows 2403*4882a593Smuzhiyunthe link simply to be a fence with no bounding at all. If no sequence 2404*4882a593Smuzhiyunof the appropriate sort exists, the LKMM says that the accesses race. 2405*4882a593Smuzhiyun 2406*4882a593SmuzhiyunThere is one more part of the LKMM related to plain accesses (although 2407*4882a593Smuzhiyunnot to data races) we should discuss. Recall that many relations such 2408*4882a593Smuzhiyunas hb are limited to marked accesses only. As a result, the 2409*4882a593Smuzhiyunhappens-before, propagates-before, and rcu axioms (which state that 2410*4882a593Smuzhiyunvarious relation must not contain a cycle) doesn't apply to plain 2411*4882a593Smuzhiyunaccesses. Nevertheless, we do want to rule out such cycles, because 2412*4882a593Smuzhiyunthey don't make sense even for plain accesses. 2413*4882a593Smuzhiyun 2414*4882a593SmuzhiyunTo this end, the LKMM imposes three extra restrictions, together 2415*4882a593Smuzhiyuncalled the "plain-coherence" axiom because of their resemblance to the 2416*4882a593Smuzhiyunrules used by the operational model to ensure cache coherence (that 2417*4882a593Smuzhiyunis, the rules governing the memory subsystem's choice of a store to 2418*4882a593Smuzhiyunsatisfy a load request and its determination of where a store will 2419*4882a593Smuzhiyunfall in the coherence order): 2420*4882a593Smuzhiyun 2421*4882a593Smuzhiyun If R and W are race candidates and it is possible to link R to 2422*4882a593Smuzhiyun W by one of the xb* sequences listed above, then W ->rfe R is 2423*4882a593Smuzhiyun not allowed (i.e., a load cannot read from a store that it 2424*4882a593Smuzhiyun executes before, even if one or both is plain). 2425*4882a593Smuzhiyun 2426*4882a593Smuzhiyun If W and R are race candidates and it is possible to link W to 2427*4882a593Smuzhiyun R by one of the vis sequences listed above, then R ->fre W is 2428*4882a593Smuzhiyun not allowed (i.e., if a store is visible to a load then the 2429*4882a593Smuzhiyun load must read from that store or one coherence-after it). 2430*4882a593Smuzhiyun 2431*4882a593Smuzhiyun If W and W' are race candidates and it is possible to link W 2432*4882a593Smuzhiyun to W' by one of the vis sequences listed above, then W' ->co W 2433*4882a593Smuzhiyun is not allowed (i.e., if one store is visible to a second then 2434*4882a593Smuzhiyun the second must come after the first in the coherence order). 2435*4882a593Smuzhiyun 2436*4882a593SmuzhiyunThis is the extent to which the LKMM deals with plain accesses. 2437*4882a593SmuzhiyunPerhaps it could say more (for example, plain accesses might 2438*4882a593Smuzhiyuncontribute to the ppo relation), but at the moment it seems that this 2439*4882a593Smuzhiyunminimal, conservative approach is good enough. 2440*4882a593Smuzhiyun 2441*4882a593Smuzhiyun 2442*4882a593SmuzhiyunODDS AND ENDS 2443*4882a593Smuzhiyun------------- 2444*4882a593Smuzhiyun 2445*4882a593SmuzhiyunThis section covers material that didn't quite fit anywhere in the 2446*4882a593Smuzhiyunearlier sections. 2447*4882a593Smuzhiyun 2448*4882a593SmuzhiyunThe descriptions in this document don't always match the formal 2449*4882a593Smuzhiyunversion of the LKMM exactly. For example, the actual formal 2450*4882a593Smuzhiyundefinition of the prop relation makes the initial coe or fre part 2451*4882a593Smuzhiyunoptional, and it doesn't require the events linked by the relation to 2452*4882a593Smuzhiyunbe on the same CPU. These differences are very unimportant; indeed, 2453*4882a593Smuzhiyuninstances where the coe/fre part of prop is missing are of no interest 2454*4882a593Smuzhiyunbecause all the other parts (fences and rfe) are already included in 2455*4882a593Smuzhiyunhb anyway, and where the formal model adds prop into hb, it includes 2456*4882a593Smuzhiyunan explicit requirement that the events being linked are on the same 2457*4882a593SmuzhiyunCPU. 2458*4882a593Smuzhiyun 2459*4882a593SmuzhiyunAnother minor difference has to do with events that are both memory 2460*4882a593Smuzhiyunaccesses and fences, such as those corresponding to smp_load_acquire() 2461*4882a593Smuzhiyuncalls. In the formal model, these events aren't actually both reads 2462*4882a593Smuzhiyunand fences; rather, they are read events with an annotation marking 2463*4882a593Smuzhiyunthem as acquires. (Or write events annotated as releases, in the case 2464*4882a593Smuzhiyunsmp_store_release().) The final effect is the same. 2465*4882a593Smuzhiyun 2466*4882a593SmuzhiyunAlthough we didn't mention it above, the instruction execution 2467*4882a593Smuzhiyunordering provided by the smp_rmb() fence doesn't apply to read events 2468*4882a593Smuzhiyunthat are part of a non-value-returning atomic update. For instance, 2469*4882a593Smuzhiyungiven: 2470*4882a593Smuzhiyun 2471*4882a593Smuzhiyun atomic_inc(&x); 2472*4882a593Smuzhiyun smp_rmb(); 2473*4882a593Smuzhiyun r1 = READ_ONCE(y); 2474*4882a593Smuzhiyun 2475*4882a593Smuzhiyunit is not guaranteed that the load from y will execute after the 2476*4882a593Smuzhiyunupdate to x. This is because the ARMv8 architecture allows 2477*4882a593Smuzhiyunnon-value-returning atomic operations effectively to be executed off 2478*4882a593Smuzhiyunthe CPU. Basically, the CPU tells the memory subsystem to increment 2479*4882a593Smuzhiyunx, and then the increment is carried out by the memory hardware with 2480*4882a593Smuzhiyunno further involvement from the CPU. Since the CPU doesn't ever read 2481*4882a593Smuzhiyunthe value of x, there is nothing for the smp_rmb() fence to act on. 2482*4882a593Smuzhiyun 2483*4882a593SmuzhiyunThe LKMM defines a few extra synchronization operations in terms of 2484*4882a593Smuzhiyunthings we have already covered. In particular, rcu_dereference() is 2485*4882a593Smuzhiyuntreated as READ_ONCE() and rcu_assign_pointer() is treated as 2486*4882a593Smuzhiyunsmp_store_release() -- which is basically how the Linux kernel treats 2487*4882a593Smuzhiyunthem. 2488*4882a593Smuzhiyun 2489*4882a593SmuzhiyunAlthough we said that plain accesses are not linked by the ppo 2490*4882a593Smuzhiyunrelation, they do contribute to it indirectly. Namely, when there is 2491*4882a593Smuzhiyunan address dependency from a marked load R to a plain store W, 2492*4882a593Smuzhiyunfollowed by smp_wmb() and then a marked store W', the LKMM creates a 2493*4882a593Smuzhiyunppo link from R to W'. The reasoning behind this is perhaps a little 2494*4882a593Smuzhiyunshaky, but essentially it says there is no way to generate object code 2495*4882a593Smuzhiyunfor this source code in which W' could execute before R. Just as with 2496*4882a593Smuzhiyunpre-bounding by address dependencies, it is possible for the compiler 2497*4882a593Smuzhiyunto undermine this relation if sufficient care is not taken. 2498*4882a593Smuzhiyun 2499*4882a593SmuzhiyunThere are a few oddball fences which need special treatment: 2500*4882a593Smuzhiyunsmp_mb__before_atomic(), smp_mb__after_atomic(), and 2501*4882a593Smuzhiyunsmp_mb__after_spinlock(). The LKMM uses fence events with special 2502*4882a593Smuzhiyunannotations for them; they act as strong fences just like smp_mb() 2503*4882a593Smuzhiyunexcept for the sets of events that they order. Instead of ordering 2504*4882a593Smuzhiyunall po-earlier events against all po-later events, as smp_mb() does, 2505*4882a593Smuzhiyunthey behave as follows: 2506*4882a593Smuzhiyun 2507*4882a593Smuzhiyun smp_mb__before_atomic() orders all po-earlier events against 2508*4882a593Smuzhiyun po-later atomic updates and the events following them; 2509*4882a593Smuzhiyun 2510*4882a593Smuzhiyun smp_mb__after_atomic() orders po-earlier atomic updates and 2511*4882a593Smuzhiyun the events preceding them against all po-later events; 2512*4882a593Smuzhiyun 2513*4882a593Smuzhiyun smp_mb_after_spinlock() orders po-earlier lock acquisition 2514*4882a593Smuzhiyun events and the events preceding them against all po-later 2515*4882a593Smuzhiyun events. 2516*4882a593Smuzhiyun 2517*4882a593SmuzhiyunInterestingly, RCU and locking each introduce the possibility of 2518*4882a593Smuzhiyundeadlock. When faced with code sequences such as: 2519*4882a593Smuzhiyun 2520*4882a593Smuzhiyun spin_lock(&s); 2521*4882a593Smuzhiyun spin_lock(&s); 2522*4882a593Smuzhiyun spin_unlock(&s); 2523*4882a593Smuzhiyun spin_unlock(&s); 2524*4882a593Smuzhiyun 2525*4882a593Smuzhiyunor: 2526*4882a593Smuzhiyun 2527*4882a593Smuzhiyun rcu_read_lock(); 2528*4882a593Smuzhiyun synchronize_rcu(); 2529*4882a593Smuzhiyun rcu_read_unlock(); 2530*4882a593Smuzhiyun 2531*4882a593Smuzhiyunwhat does the LKMM have to say? Answer: It says there are no allowed 2532*4882a593Smuzhiyunexecutions at all, which makes sense. But this can also lead to 2533*4882a593Smuzhiyunmisleading results, because if a piece of code has multiple possible 2534*4882a593Smuzhiyunexecutions, some of which deadlock, the model will report only on the 2535*4882a593Smuzhiyunnon-deadlocking executions. For example: 2536*4882a593Smuzhiyun 2537*4882a593Smuzhiyun int x, y; 2538*4882a593Smuzhiyun 2539*4882a593Smuzhiyun P0() 2540*4882a593Smuzhiyun { 2541*4882a593Smuzhiyun int r0; 2542*4882a593Smuzhiyun 2543*4882a593Smuzhiyun WRITE_ONCE(x, 1); 2544*4882a593Smuzhiyun r0 = READ_ONCE(y); 2545*4882a593Smuzhiyun } 2546*4882a593Smuzhiyun 2547*4882a593Smuzhiyun P1() 2548*4882a593Smuzhiyun { 2549*4882a593Smuzhiyun rcu_read_lock(); 2550*4882a593Smuzhiyun if (READ_ONCE(x) > 0) { 2551*4882a593Smuzhiyun WRITE_ONCE(y, 36); 2552*4882a593Smuzhiyun synchronize_rcu(); 2553*4882a593Smuzhiyun } 2554*4882a593Smuzhiyun rcu_read_unlock(); 2555*4882a593Smuzhiyun } 2556*4882a593Smuzhiyun 2557*4882a593SmuzhiyunIs it possible to end up with r0 = 36 at the end? The LKMM will tell 2558*4882a593Smuzhiyunyou it is not, but the model won't mention that this is because P1 2559*4882a593Smuzhiyunwill self-deadlock in the executions where it stores 36 in y. 2560