1*4882a593Smuzhiyun ============================ 2*4882a593Smuzhiyun LINUX KERNEL MEMORY BARRIERS 3*4882a593Smuzhiyun ============================ 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunBy: David Howells <dhowells@redhat.com> 6*4882a593Smuzhiyun Paul E. McKenney <paulmck@linux.ibm.com> 7*4882a593Smuzhiyun Will Deacon <will.deacon@arm.com> 8*4882a593Smuzhiyun Peter Zijlstra <peterz@infradead.org> 9*4882a593Smuzhiyun 10*4882a593Smuzhiyun========== 11*4882a593SmuzhiyunDISCLAIMER 12*4882a593Smuzhiyun========== 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunThis document is not a specification; it is intentionally (for the sake of 15*4882a593Smuzhiyunbrevity) and unintentionally (due to being human) incomplete. This document is 16*4882a593Smuzhiyunmeant as a guide to using the various memory barriers provided by Linux, but 17*4882a593Smuzhiyunin case of any doubt (and there are many) please ask. Some doubts may be 18*4882a593Smuzhiyunresolved by referring to the formal memory consistency model and related 19*4882a593Smuzhiyundocumentation at tools/memory-model/. Nevertheless, even this memory 20*4882a593Smuzhiyunmodel should be viewed as the collective opinion of its maintainers rather 21*4882a593Smuzhiyunthan as an infallible oracle. 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunTo repeat, this document is not a specification of what Linux expects from 24*4882a593Smuzhiyunhardware. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunThe purpose of this document is twofold: 27*4882a593Smuzhiyun 28*4882a593Smuzhiyun (1) to specify the minimum functionality that one can rely on for any 29*4882a593Smuzhiyun particular barrier, and 30*4882a593Smuzhiyun 31*4882a593Smuzhiyun (2) to provide a guide as to how to use the barriers that are available. 32*4882a593Smuzhiyun 33*4882a593SmuzhiyunNote that an architecture can provide more than the minimum requirement 34*4882a593Smuzhiyunfor any particular barrier, but if the architecture provides less than 35*4882a593Smuzhiyunthat, that architecture is incorrect. 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunNote also that it is possible that a barrier may be a no-op for an 38*4882a593Smuzhiyunarchitecture because the way that arch works renders an explicit barrier 39*4882a593Smuzhiyununnecessary in that case. 40*4882a593Smuzhiyun 41*4882a593Smuzhiyun 42*4882a593Smuzhiyun======== 43*4882a593SmuzhiyunCONTENTS 44*4882a593Smuzhiyun======== 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun (*) Abstract memory access model. 47*4882a593Smuzhiyun 48*4882a593Smuzhiyun - Device operations. 49*4882a593Smuzhiyun - Guarantees. 50*4882a593Smuzhiyun 51*4882a593Smuzhiyun (*) What are memory barriers? 52*4882a593Smuzhiyun 53*4882a593Smuzhiyun - Varieties of memory barrier. 54*4882a593Smuzhiyun - What may not be assumed about memory barriers? 55*4882a593Smuzhiyun - Data dependency barriers (historical). 56*4882a593Smuzhiyun - Control dependencies. 57*4882a593Smuzhiyun - SMP barrier pairing. 58*4882a593Smuzhiyun - Examples of memory barrier sequences. 59*4882a593Smuzhiyun - Read memory barriers vs load speculation. 60*4882a593Smuzhiyun - Multicopy atomicity. 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun (*) Explicit kernel barriers. 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun - Compiler barrier. 65*4882a593Smuzhiyun - CPU memory barriers. 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun (*) Implicit kernel memory barriers. 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun - Lock acquisition functions. 70*4882a593Smuzhiyun - Interrupt disabling functions. 71*4882a593Smuzhiyun - Sleep and wake-up functions. 72*4882a593Smuzhiyun - Miscellaneous functions. 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun (*) Inter-CPU acquiring barrier effects. 75*4882a593Smuzhiyun 76*4882a593Smuzhiyun - Acquires vs memory accesses. 77*4882a593Smuzhiyun 78*4882a593Smuzhiyun (*) Where are memory barriers needed? 79*4882a593Smuzhiyun 80*4882a593Smuzhiyun - Interprocessor interaction. 81*4882a593Smuzhiyun - Atomic operations. 82*4882a593Smuzhiyun - Accessing devices. 83*4882a593Smuzhiyun - Interrupts. 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun (*) Kernel I/O barrier effects. 86*4882a593Smuzhiyun 87*4882a593Smuzhiyun (*) Assumed minimum execution ordering model. 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun (*) The effects of the cpu cache. 90*4882a593Smuzhiyun 91*4882a593Smuzhiyun - Cache coherency. 92*4882a593Smuzhiyun - Cache coherency vs DMA. 93*4882a593Smuzhiyun - Cache coherency vs MMIO. 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun (*) The things CPUs get up to. 96*4882a593Smuzhiyun 97*4882a593Smuzhiyun - And then there's the Alpha. 98*4882a593Smuzhiyun - Virtual Machine Guests. 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun (*) Example uses. 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun - Circular buffers. 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun (*) References. 105*4882a593Smuzhiyun 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun============================ 108*4882a593SmuzhiyunABSTRACT MEMORY ACCESS MODEL 109*4882a593Smuzhiyun============================ 110*4882a593Smuzhiyun 111*4882a593SmuzhiyunConsider the following abstract model of the system: 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun : : 114*4882a593Smuzhiyun : : 115*4882a593Smuzhiyun : : 116*4882a593Smuzhiyun +-------+ : +--------+ : +-------+ 117*4882a593Smuzhiyun | | : | | : | | 118*4882a593Smuzhiyun | | : | | : | | 119*4882a593Smuzhiyun | CPU 1 |<----->| Memory |<----->| CPU 2 | 120*4882a593Smuzhiyun | | : | | : | | 121*4882a593Smuzhiyun | | : | | : | | 122*4882a593Smuzhiyun +-------+ : +--------+ : +-------+ 123*4882a593Smuzhiyun ^ : ^ : ^ 124*4882a593Smuzhiyun | : | : | 125*4882a593Smuzhiyun | : | : | 126*4882a593Smuzhiyun | : v : | 127*4882a593Smuzhiyun | : +--------+ : | 128*4882a593Smuzhiyun | : | | : | 129*4882a593Smuzhiyun | : | | : | 130*4882a593Smuzhiyun +---------->| Device |<----------+ 131*4882a593Smuzhiyun : | | : 132*4882a593Smuzhiyun : | | : 133*4882a593Smuzhiyun : +--------+ : 134*4882a593Smuzhiyun : : 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunEach CPU executes a program that generates memory access operations. In the 137*4882a593Smuzhiyunabstract CPU, memory operation ordering is very relaxed, and a CPU may actually 138*4882a593Smuzhiyunperform the memory operations in any order it likes, provided program causality 139*4882a593Smuzhiyunappears to be maintained. Similarly, the compiler may also arrange the 140*4882a593Smuzhiyuninstructions it emits in any order it likes, provided it doesn't affect the 141*4882a593Smuzhiyunapparent operation of the program. 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunSo in the above diagram, the effects of the memory operations performed by a 144*4882a593SmuzhiyunCPU are perceived by the rest of the system as the operations cross the 145*4882a593Smuzhiyuninterface between the CPU and rest of the system (the dotted lines). 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun 148*4882a593SmuzhiyunFor example, consider the following sequence of events: 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun CPU 1 CPU 2 151*4882a593Smuzhiyun =============== =============== 152*4882a593Smuzhiyun { A == 1; B == 2 } 153*4882a593Smuzhiyun A = 3; x = B; 154*4882a593Smuzhiyun B = 4; y = A; 155*4882a593Smuzhiyun 156*4882a593SmuzhiyunThe set of accesses as seen by the memory system in the middle can be arranged 157*4882a593Smuzhiyunin 24 different combinations: 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun STORE A=3, STORE B=4, y=LOAD A->3, x=LOAD B->4 160*4882a593Smuzhiyun STORE A=3, STORE B=4, x=LOAD B->4, y=LOAD A->3 161*4882a593Smuzhiyun STORE A=3, y=LOAD A->3, STORE B=4, x=LOAD B->4 162*4882a593Smuzhiyun STORE A=3, y=LOAD A->3, x=LOAD B->2, STORE B=4 163*4882a593Smuzhiyun STORE A=3, x=LOAD B->2, STORE B=4, y=LOAD A->3 164*4882a593Smuzhiyun STORE A=3, x=LOAD B->2, y=LOAD A->3, STORE B=4 165*4882a593Smuzhiyun STORE B=4, STORE A=3, y=LOAD A->3, x=LOAD B->4 166*4882a593Smuzhiyun STORE B=4, ... 167*4882a593Smuzhiyun ... 168*4882a593Smuzhiyun 169*4882a593Smuzhiyunand can thus result in four different combinations of values: 170*4882a593Smuzhiyun 171*4882a593Smuzhiyun x == 2, y == 1 172*4882a593Smuzhiyun x == 2, y == 3 173*4882a593Smuzhiyun x == 4, y == 1 174*4882a593Smuzhiyun x == 4, y == 3 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunFurthermore, the stores committed by a CPU to the memory system may not be 178*4882a593Smuzhiyunperceived by the loads made by another CPU in the same order as the stores were 179*4882a593Smuzhiyuncommitted. 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun 182*4882a593SmuzhiyunAs a further example, consider this sequence of events: 183*4882a593Smuzhiyun 184*4882a593Smuzhiyun CPU 1 CPU 2 185*4882a593Smuzhiyun =============== =============== 186*4882a593Smuzhiyun { A == 1, B == 2, C == 3, P == &A, Q == &C } 187*4882a593Smuzhiyun B = 4; Q = P; 188*4882a593Smuzhiyun P = &B; D = *Q; 189*4882a593Smuzhiyun 190*4882a593SmuzhiyunThere is an obvious data dependency here, as the value loaded into D depends on 191*4882a593Smuzhiyunthe address retrieved from P by CPU 2. At the end of the sequence, any of the 192*4882a593Smuzhiyunfollowing results are possible: 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun (Q == &A) and (D == 1) 195*4882a593Smuzhiyun (Q == &B) and (D == 2) 196*4882a593Smuzhiyun (Q == &B) and (D == 4) 197*4882a593Smuzhiyun 198*4882a593SmuzhiyunNote that CPU 2 will never try and load C into D because the CPU will load P 199*4882a593Smuzhiyuninto Q before issuing the load of *Q. 200*4882a593Smuzhiyun 201*4882a593Smuzhiyun 202*4882a593SmuzhiyunDEVICE OPERATIONS 203*4882a593Smuzhiyun----------------- 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunSome devices present their control interfaces as collections of memory 206*4882a593Smuzhiyunlocations, but the order in which the control registers are accessed is very 207*4882a593Smuzhiyunimportant. For instance, imagine an ethernet card with a set of internal 208*4882a593Smuzhiyunregisters that are accessed through an address port register (A) and a data 209*4882a593Smuzhiyunport register (D). To read internal register 5, the following code might then 210*4882a593Smuzhiyunbe used: 211*4882a593Smuzhiyun 212*4882a593Smuzhiyun *A = 5; 213*4882a593Smuzhiyun x = *D; 214*4882a593Smuzhiyun 215*4882a593Smuzhiyunbut this might show up as either of the following two sequences: 216*4882a593Smuzhiyun 217*4882a593Smuzhiyun STORE *A = 5, x = LOAD *D 218*4882a593Smuzhiyun x = LOAD *D, STORE *A = 5 219*4882a593Smuzhiyun 220*4882a593Smuzhiyunthe second of which will almost certainly result in a malfunction, since it set 221*4882a593Smuzhiyunthe address _after_ attempting to read the register. 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunGUARANTEES 225*4882a593Smuzhiyun---------- 226*4882a593Smuzhiyun 227*4882a593SmuzhiyunThere are some minimal guarantees that may be expected of a CPU: 228*4882a593Smuzhiyun 229*4882a593Smuzhiyun (*) On any given CPU, dependent memory accesses will be issued in order, with 230*4882a593Smuzhiyun respect to itself. This means that for: 231*4882a593Smuzhiyun 232*4882a593Smuzhiyun Q = READ_ONCE(P); D = READ_ONCE(*Q); 233*4882a593Smuzhiyun 234*4882a593Smuzhiyun the CPU will issue the following memory operations: 235*4882a593Smuzhiyun 236*4882a593Smuzhiyun Q = LOAD P, D = LOAD *Q 237*4882a593Smuzhiyun 238*4882a593Smuzhiyun and always in that order. However, on DEC Alpha, READ_ONCE() also 239*4882a593Smuzhiyun emits a memory-barrier instruction, so that a DEC Alpha CPU will 240*4882a593Smuzhiyun instead issue the following memory operations: 241*4882a593Smuzhiyun 242*4882a593Smuzhiyun Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun Whether on DEC Alpha or not, the READ_ONCE() also prevents compiler 245*4882a593Smuzhiyun mischief. 246*4882a593Smuzhiyun 247*4882a593Smuzhiyun (*) Overlapping loads and stores within a particular CPU will appear to be 248*4882a593Smuzhiyun ordered within that CPU. This means that for: 249*4882a593Smuzhiyun 250*4882a593Smuzhiyun a = READ_ONCE(*X); WRITE_ONCE(*X, b); 251*4882a593Smuzhiyun 252*4882a593Smuzhiyun the CPU will only issue the following sequence of memory operations: 253*4882a593Smuzhiyun 254*4882a593Smuzhiyun a = LOAD *X, STORE *X = b 255*4882a593Smuzhiyun 256*4882a593Smuzhiyun And for: 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun WRITE_ONCE(*X, c); d = READ_ONCE(*X); 259*4882a593Smuzhiyun 260*4882a593Smuzhiyun the CPU will only issue: 261*4882a593Smuzhiyun 262*4882a593Smuzhiyun STORE *X = c, d = LOAD *X 263*4882a593Smuzhiyun 264*4882a593Smuzhiyun (Loads and stores overlap if they are targeted at overlapping pieces of 265*4882a593Smuzhiyun memory). 266*4882a593Smuzhiyun 267*4882a593SmuzhiyunAnd there are a number of things that _must_ or _must_not_ be assumed: 268*4882a593Smuzhiyun 269*4882a593Smuzhiyun (*) It _must_not_ be assumed that the compiler will do what you want 270*4882a593Smuzhiyun with memory references that are not protected by READ_ONCE() and 271*4882a593Smuzhiyun WRITE_ONCE(). Without them, the compiler is within its rights to 272*4882a593Smuzhiyun do all sorts of "creative" transformations, which are covered in 273*4882a593Smuzhiyun the COMPILER BARRIER section. 274*4882a593Smuzhiyun 275*4882a593Smuzhiyun (*) It _must_not_ be assumed that independent loads and stores will be issued 276*4882a593Smuzhiyun in the order given. This means that for: 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun X = *A; Y = *B; *D = Z; 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun we may get any of the following sequences: 281*4882a593Smuzhiyun 282*4882a593Smuzhiyun X = LOAD *A, Y = LOAD *B, STORE *D = Z 283*4882a593Smuzhiyun X = LOAD *A, STORE *D = Z, Y = LOAD *B 284*4882a593Smuzhiyun Y = LOAD *B, X = LOAD *A, STORE *D = Z 285*4882a593Smuzhiyun Y = LOAD *B, STORE *D = Z, X = LOAD *A 286*4882a593Smuzhiyun STORE *D = Z, X = LOAD *A, Y = LOAD *B 287*4882a593Smuzhiyun STORE *D = Z, Y = LOAD *B, X = LOAD *A 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun (*) It _must_ be assumed that overlapping memory accesses may be merged or 290*4882a593Smuzhiyun discarded. This means that for: 291*4882a593Smuzhiyun 292*4882a593Smuzhiyun X = *A; Y = *(A + 4); 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun we may get any one of the following sequences: 295*4882a593Smuzhiyun 296*4882a593Smuzhiyun X = LOAD *A; Y = LOAD *(A + 4); 297*4882a593Smuzhiyun Y = LOAD *(A + 4); X = LOAD *A; 298*4882a593Smuzhiyun {X, Y} = LOAD {*A, *(A + 4) }; 299*4882a593Smuzhiyun 300*4882a593Smuzhiyun And for: 301*4882a593Smuzhiyun 302*4882a593Smuzhiyun *A = X; *(A + 4) = Y; 303*4882a593Smuzhiyun 304*4882a593Smuzhiyun we may get any of: 305*4882a593Smuzhiyun 306*4882a593Smuzhiyun STORE *A = X; STORE *(A + 4) = Y; 307*4882a593Smuzhiyun STORE *(A + 4) = Y; STORE *A = X; 308*4882a593Smuzhiyun STORE {*A, *(A + 4) } = {X, Y}; 309*4882a593Smuzhiyun 310*4882a593SmuzhiyunAnd there are anti-guarantees: 311*4882a593Smuzhiyun 312*4882a593Smuzhiyun (*) These guarantees do not apply to bitfields, because compilers often 313*4882a593Smuzhiyun generate code to modify these using non-atomic read-modify-write 314*4882a593Smuzhiyun sequences. Do not attempt to use bitfields to synchronize parallel 315*4882a593Smuzhiyun algorithms. 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun (*) Even in cases where bitfields are protected by locks, all fields 318*4882a593Smuzhiyun in a given bitfield must be protected by one lock. If two fields 319*4882a593Smuzhiyun in a given bitfield are protected by different locks, the compiler's 320*4882a593Smuzhiyun non-atomic read-modify-write sequences can cause an update to one 321*4882a593Smuzhiyun field to corrupt the value of an adjacent field. 322*4882a593Smuzhiyun 323*4882a593Smuzhiyun (*) These guarantees apply only to properly aligned and sized scalar 324*4882a593Smuzhiyun variables. "Properly sized" currently means variables that are 325*4882a593Smuzhiyun the same size as "char", "short", "int" and "long". "Properly 326*4882a593Smuzhiyun aligned" means the natural alignment, thus no constraints for 327*4882a593Smuzhiyun "char", two-byte alignment for "short", four-byte alignment for 328*4882a593Smuzhiyun "int", and either four-byte or eight-byte alignment for "long", 329*4882a593Smuzhiyun on 32-bit and 64-bit systems, respectively. Note that these 330*4882a593Smuzhiyun guarantees were introduced into the C11 standard, so beware when 331*4882a593Smuzhiyun using older pre-C11 compilers (for example, gcc 4.6). The portion 332*4882a593Smuzhiyun of the standard containing this guarantee is Section 3.14, which 333*4882a593Smuzhiyun defines "memory location" as follows: 334*4882a593Smuzhiyun 335*4882a593Smuzhiyun memory location 336*4882a593Smuzhiyun either an object of scalar type, or a maximal sequence 337*4882a593Smuzhiyun of adjacent bit-fields all having nonzero width 338*4882a593Smuzhiyun 339*4882a593Smuzhiyun NOTE 1: Two threads of execution can update and access 340*4882a593Smuzhiyun separate memory locations without interfering with 341*4882a593Smuzhiyun each other. 342*4882a593Smuzhiyun 343*4882a593Smuzhiyun NOTE 2: A bit-field and an adjacent non-bit-field member 344*4882a593Smuzhiyun are in separate memory locations. The same applies 345*4882a593Smuzhiyun to two bit-fields, if one is declared inside a nested 346*4882a593Smuzhiyun structure declaration and the other is not, or if the two 347*4882a593Smuzhiyun are separated by a zero-length bit-field declaration, 348*4882a593Smuzhiyun or if they are separated by a non-bit-field member 349*4882a593Smuzhiyun declaration. It is not safe to concurrently update two 350*4882a593Smuzhiyun bit-fields in the same structure if all members declared 351*4882a593Smuzhiyun between them are also bit-fields, no matter what the 352*4882a593Smuzhiyun sizes of those intervening bit-fields happen to be. 353*4882a593Smuzhiyun 354*4882a593Smuzhiyun 355*4882a593Smuzhiyun========================= 356*4882a593SmuzhiyunWHAT ARE MEMORY BARRIERS? 357*4882a593Smuzhiyun========================= 358*4882a593Smuzhiyun 359*4882a593SmuzhiyunAs can be seen above, independent memory operations are effectively performed 360*4882a593Smuzhiyunin random order, but this can be a problem for CPU-CPU interaction and for I/O. 361*4882a593SmuzhiyunWhat is required is some way of intervening to instruct the compiler and the 362*4882a593SmuzhiyunCPU to restrict the order. 363*4882a593Smuzhiyun 364*4882a593SmuzhiyunMemory barriers are such interventions. They impose a perceived partial 365*4882a593Smuzhiyunordering over the memory operations on either side of the barrier. 366*4882a593Smuzhiyun 367*4882a593SmuzhiyunSuch enforcement is important because the CPUs and other devices in a system 368*4882a593Smuzhiyuncan use a variety of tricks to improve performance, including reordering, 369*4882a593Smuzhiyundeferral and combination of memory operations; speculative loads; speculative 370*4882a593Smuzhiyunbranch prediction and various types of caching. Memory barriers are used to 371*4882a593Smuzhiyunoverride or suppress these tricks, allowing the code to sanely control the 372*4882a593Smuzhiyuninteraction of multiple CPUs and/or devices. 373*4882a593Smuzhiyun 374*4882a593Smuzhiyun 375*4882a593SmuzhiyunVARIETIES OF MEMORY BARRIER 376*4882a593Smuzhiyun--------------------------- 377*4882a593Smuzhiyun 378*4882a593SmuzhiyunMemory barriers come in four basic varieties: 379*4882a593Smuzhiyun 380*4882a593Smuzhiyun (1) Write (or store) memory barriers. 381*4882a593Smuzhiyun 382*4882a593Smuzhiyun A write memory barrier gives a guarantee that all the STORE operations 383*4882a593Smuzhiyun specified before the barrier will appear to happen before all the STORE 384*4882a593Smuzhiyun operations specified after the barrier with respect to the other 385*4882a593Smuzhiyun components of the system. 386*4882a593Smuzhiyun 387*4882a593Smuzhiyun A write barrier is a partial ordering on stores only; it is not required 388*4882a593Smuzhiyun to have any effect on loads. 389*4882a593Smuzhiyun 390*4882a593Smuzhiyun A CPU can be viewed as committing a sequence of store operations to the 391*4882a593Smuzhiyun memory system as time progresses. All stores _before_ a write barrier 392*4882a593Smuzhiyun will occur _before_ all the stores after the write barrier. 393*4882a593Smuzhiyun 394*4882a593Smuzhiyun [!] Note that write barriers should normally be paired with read or data 395*4882a593Smuzhiyun dependency barriers; see the "SMP barrier pairing" subsection. 396*4882a593Smuzhiyun 397*4882a593Smuzhiyun 398*4882a593Smuzhiyun (2) Data dependency barriers. 399*4882a593Smuzhiyun 400*4882a593Smuzhiyun A data dependency barrier is a weaker form of read barrier. In the case 401*4882a593Smuzhiyun where two loads are performed such that the second depends on the result 402*4882a593Smuzhiyun of the first (eg: the first load retrieves the address to which the second 403*4882a593Smuzhiyun load will be directed), a data dependency barrier would be required to 404*4882a593Smuzhiyun make sure that the target of the second load is updated after the address 405*4882a593Smuzhiyun obtained by the first load is accessed. 406*4882a593Smuzhiyun 407*4882a593Smuzhiyun A data dependency barrier is a partial ordering on interdependent loads 408*4882a593Smuzhiyun only; it is not required to have any effect on stores, independent loads 409*4882a593Smuzhiyun or overlapping loads. 410*4882a593Smuzhiyun 411*4882a593Smuzhiyun As mentioned in (1), the other CPUs in the system can be viewed as 412*4882a593Smuzhiyun committing sequences of stores to the memory system that the CPU being 413*4882a593Smuzhiyun considered can then perceive. A data dependency barrier issued by the CPU 414*4882a593Smuzhiyun under consideration guarantees that for any load preceding it, if that 415*4882a593Smuzhiyun load touches one of a sequence of stores from another CPU, then by the 416*4882a593Smuzhiyun time the barrier completes, the effects of all the stores prior to that 417*4882a593Smuzhiyun touched by the load will be perceptible to any loads issued after the data 418*4882a593Smuzhiyun dependency barrier. 419*4882a593Smuzhiyun 420*4882a593Smuzhiyun See the "Examples of memory barrier sequences" subsection for diagrams 421*4882a593Smuzhiyun showing the ordering constraints. 422*4882a593Smuzhiyun 423*4882a593Smuzhiyun [!] Note that the first load really has to have a _data_ dependency and 424*4882a593Smuzhiyun not a control dependency. If the address for the second load is dependent 425*4882a593Smuzhiyun on the first load, but the dependency is through a conditional rather than 426*4882a593Smuzhiyun actually loading the address itself, then it's a _control_ dependency and 427*4882a593Smuzhiyun a full read barrier or better is required. See the "Control dependencies" 428*4882a593Smuzhiyun subsection for more information. 429*4882a593Smuzhiyun 430*4882a593Smuzhiyun [!] Note that data dependency barriers should normally be paired with 431*4882a593Smuzhiyun write barriers; see the "SMP barrier pairing" subsection. 432*4882a593Smuzhiyun 433*4882a593Smuzhiyun 434*4882a593Smuzhiyun (3) Read (or load) memory barriers. 435*4882a593Smuzhiyun 436*4882a593Smuzhiyun A read barrier is a data dependency barrier plus a guarantee that all the 437*4882a593Smuzhiyun LOAD operations specified before the barrier will appear to happen before 438*4882a593Smuzhiyun all the LOAD operations specified after the barrier with respect to the 439*4882a593Smuzhiyun other components of the system. 440*4882a593Smuzhiyun 441*4882a593Smuzhiyun A read barrier is a partial ordering on loads only; it is not required to 442*4882a593Smuzhiyun have any effect on stores. 443*4882a593Smuzhiyun 444*4882a593Smuzhiyun Read memory barriers imply data dependency barriers, and so can substitute 445*4882a593Smuzhiyun for them. 446*4882a593Smuzhiyun 447*4882a593Smuzhiyun [!] Note that read barriers should normally be paired with write barriers; 448*4882a593Smuzhiyun see the "SMP barrier pairing" subsection. 449*4882a593Smuzhiyun 450*4882a593Smuzhiyun 451*4882a593Smuzhiyun (4) General memory barriers. 452*4882a593Smuzhiyun 453*4882a593Smuzhiyun A general memory barrier gives a guarantee that all the LOAD and STORE 454*4882a593Smuzhiyun operations specified before the barrier will appear to happen before all 455*4882a593Smuzhiyun the LOAD and STORE operations specified after the barrier with respect to 456*4882a593Smuzhiyun the other components of the system. 457*4882a593Smuzhiyun 458*4882a593Smuzhiyun A general memory barrier is a partial ordering over both loads and stores. 459*4882a593Smuzhiyun 460*4882a593Smuzhiyun General memory barriers imply both read and write memory barriers, and so 461*4882a593Smuzhiyun can substitute for either. 462*4882a593Smuzhiyun 463*4882a593Smuzhiyun 464*4882a593SmuzhiyunAnd a couple of implicit varieties: 465*4882a593Smuzhiyun 466*4882a593Smuzhiyun (5) ACQUIRE operations. 467*4882a593Smuzhiyun 468*4882a593Smuzhiyun This acts as a one-way permeable barrier. It guarantees that all memory 469*4882a593Smuzhiyun operations after the ACQUIRE operation will appear to happen after the 470*4882a593Smuzhiyun ACQUIRE operation with respect to the other components of the system. 471*4882a593Smuzhiyun ACQUIRE operations include LOCK operations and both smp_load_acquire() 472*4882a593Smuzhiyun and smp_cond_load_acquire() operations. 473*4882a593Smuzhiyun 474*4882a593Smuzhiyun Memory operations that occur before an ACQUIRE operation may appear to 475*4882a593Smuzhiyun happen after it completes. 476*4882a593Smuzhiyun 477*4882a593Smuzhiyun An ACQUIRE operation should almost always be paired with a RELEASE 478*4882a593Smuzhiyun operation. 479*4882a593Smuzhiyun 480*4882a593Smuzhiyun 481*4882a593Smuzhiyun (6) RELEASE operations. 482*4882a593Smuzhiyun 483*4882a593Smuzhiyun This also acts as a one-way permeable barrier. It guarantees that all 484*4882a593Smuzhiyun memory operations before the RELEASE operation will appear to happen 485*4882a593Smuzhiyun before the RELEASE operation with respect to the other components of the 486*4882a593Smuzhiyun system. RELEASE operations include UNLOCK operations and 487*4882a593Smuzhiyun smp_store_release() operations. 488*4882a593Smuzhiyun 489*4882a593Smuzhiyun Memory operations that occur after a RELEASE operation may appear to 490*4882a593Smuzhiyun happen before it completes. 491*4882a593Smuzhiyun 492*4882a593Smuzhiyun The use of ACQUIRE and RELEASE operations generally precludes the need 493*4882a593Smuzhiyun for other sorts of memory barrier. In addition, a RELEASE+ACQUIRE pair is 494*4882a593Smuzhiyun -not- guaranteed to act as a full memory barrier. However, after an 495*4882a593Smuzhiyun ACQUIRE on a given variable, all memory accesses preceding any prior 496*4882a593Smuzhiyun RELEASE on that same variable are guaranteed to be visible. In other 497*4882a593Smuzhiyun words, within a given variable's critical section, all accesses of all 498*4882a593Smuzhiyun previous critical sections for that variable are guaranteed to have 499*4882a593Smuzhiyun completed. 500*4882a593Smuzhiyun 501*4882a593Smuzhiyun This means that ACQUIRE acts as a minimal "acquire" operation and 502*4882a593Smuzhiyun RELEASE acts as a minimal "release" operation. 503*4882a593Smuzhiyun 504*4882a593SmuzhiyunA subset of the atomic operations described in atomic_t.txt have ACQUIRE and 505*4882a593SmuzhiyunRELEASE variants in addition to fully-ordered and relaxed (no barrier 506*4882a593Smuzhiyunsemantics) definitions. For compound atomics performing both a load and a 507*4882a593Smuzhiyunstore, ACQUIRE semantics apply only to the load and RELEASE semantics apply 508*4882a593Smuzhiyunonly to the store portion of the operation. 509*4882a593Smuzhiyun 510*4882a593SmuzhiyunMemory barriers are only required where there's a possibility of interaction 511*4882a593Smuzhiyunbetween two CPUs or between a CPU and a device. If it can be guaranteed that 512*4882a593Smuzhiyunthere won't be any such interaction in any particular piece of code, then 513*4882a593Smuzhiyunmemory barriers are unnecessary in that piece of code. 514*4882a593Smuzhiyun 515*4882a593Smuzhiyun 516*4882a593SmuzhiyunNote that these are the _minimum_ guarantees. Different architectures may give 517*4882a593Smuzhiyunmore substantial guarantees, but they may _not_ be relied upon outside of arch 518*4882a593Smuzhiyunspecific code. 519*4882a593Smuzhiyun 520*4882a593Smuzhiyun 521*4882a593SmuzhiyunWHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS? 522*4882a593Smuzhiyun---------------------------------------------- 523*4882a593Smuzhiyun 524*4882a593SmuzhiyunThere are certain things that the Linux kernel memory barriers do not guarantee: 525*4882a593Smuzhiyun 526*4882a593Smuzhiyun (*) There is no guarantee that any of the memory accesses specified before a 527*4882a593Smuzhiyun memory barrier will be _complete_ by the completion of a memory barrier 528*4882a593Smuzhiyun instruction; the barrier can be considered to draw a line in that CPU's 529*4882a593Smuzhiyun access queue that accesses of the appropriate type may not cross. 530*4882a593Smuzhiyun 531*4882a593Smuzhiyun (*) There is no guarantee that issuing a memory barrier on one CPU will have 532*4882a593Smuzhiyun any direct effect on another CPU or any other hardware in the system. The 533*4882a593Smuzhiyun indirect effect will be the order in which the second CPU sees the effects 534*4882a593Smuzhiyun of the first CPU's accesses occur, but see the next point: 535*4882a593Smuzhiyun 536*4882a593Smuzhiyun (*) There is no guarantee that a CPU will see the correct order of effects 537*4882a593Smuzhiyun from a second CPU's accesses, even _if_ the second CPU uses a memory 538*4882a593Smuzhiyun barrier, unless the first CPU _also_ uses a matching memory barrier (see 539*4882a593Smuzhiyun the subsection on "SMP Barrier Pairing"). 540*4882a593Smuzhiyun 541*4882a593Smuzhiyun (*) There is no guarantee that some intervening piece of off-the-CPU 542*4882a593Smuzhiyun hardware[*] will not reorder the memory accesses. CPU cache coherency 543*4882a593Smuzhiyun mechanisms should propagate the indirect effects of a memory barrier 544*4882a593Smuzhiyun between CPUs, but might not do so in order. 545*4882a593Smuzhiyun 546*4882a593Smuzhiyun [*] For information on bus mastering DMA and coherency please read: 547*4882a593Smuzhiyun 548*4882a593Smuzhiyun Documentation/driver-api/pci/pci.rst 549*4882a593Smuzhiyun Documentation/core-api/dma-api-howto.rst 550*4882a593Smuzhiyun Documentation/core-api/dma-api.rst 551*4882a593Smuzhiyun 552*4882a593Smuzhiyun 553*4882a593SmuzhiyunDATA DEPENDENCY BARRIERS (HISTORICAL) 554*4882a593Smuzhiyun------------------------------------- 555*4882a593Smuzhiyun 556*4882a593SmuzhiyunAs of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for 557*4882a593SmuzhiyunDEC Alpha, which means that about the only people who need to pay attention 558*4882a593Smuzhiyunto this section are those working on DEC Alpha architecture-specific code 559*4882a593Smuzhiyunand those working on READ_ONCE() itself. For those who need it, and for 560*4882a593Smuzhiyunthose who are interested in the history, here is the story of 561*4882a593Smuzhiyundata-dependency barriers. 562*4882a593Smuzhiyun 563*4882a593SmuzhiyunThe usage requirements of data dependency barriers are a little subtle, and 564*4882a593Smuzhiyunit's not always obvious that they're needed. To illustrate, consider the 565*4882a593Smuzhiyunfollowing sequence of events: 566*4882a593Smuzhiyun 567*4882a593Smuzhiyun CPU 1 CPU 2 568*4882a593Smuzhiyun =============== =============== 569*4882a593Smuzhiyun { A == 1, B == 2, C == 3, P == &A, Q == &C } 570*4882a593Smuzhiyun B = 4; 571*4882a593Smuzhiyun <write barrier> 572*4882a593Smuzhiyun WRITE_ONCE(P, &B); 573*4882a593Smuzhiyun Q = READ_ONCE(P); 574*4882a593Smuzhiyun D = *Q; 575*4882a593Smuzhiyun 576*4882a593SmuzhiyunThere's a clear data dependency here, and it would seem that by the end of the 577*4882a593Smuzhiyunsequence, Q must be either &A or &B, and that: 578*4882a593Smuzhiyun 579*4882a593Smuzhiyun (Q == &A) implies (D == 1) 580*4882a593Smuzhiyun (Q == &B) implies (D == 4) 581*4882a593Smuzhiyun 582*4882a593SmuzhiyunBut! CPU 2's perception of P may be updated _before_ its perception of B, thus 583*4882a593Smuzhiyunleading to the following situation: 584*4882a593Smuzhiyun 585*4882a593Smuzhiyun (Q == &B) and (D == 2) ???? 586*4882a593Smuzhiyun 587*4882a593SmuzhiyunWhile this may seem like a failure of coherency or causality maintenance, it 588*4882a593Smuzhiyunisn't, and this behaviour can be observed on certain real CPUs (such as the DEC 589*4882a593SmuzhiyunAlpha). 590*4882a593Smuzhiyun 591*4882a593SmuzhiyunTo deal with this, a data dependency barrier or better must be inserted 592*4882a593Smuzhiyunbetween the address load and the data load: 593*4882a593Smuzhiyun 594*4882a593Smuzhiyun CPU 1 CPU 2 595*4882a593Smuzhiyun =============== =============== 596*4882a593Smuzhiyun { A == 1, B == 2, C == 3, P == &A, Q == &C } 597*4882a593Smuzhiyun B = 4; 598*4882a593Smuzhiyun <write barrier> 599*4882a593Smuzhiyun WRITE_ONCE(P, &B); 600*4882a593Smuzhiyun Q = READ_ONCE(P); 601*4882a593Smuzhiyun <data dependency barrier> 602*4882a593Smuzhiyun D = *Q; 603*4882a593Smuzhiyun 604*4882a593SmuzhiyunThis enforces the occurrence of one of the two implications, and prevents the 605*4882a593Smuzhiyunthird possibility from arising. 606*4882a593Smuzhiyun 607*4882a593Smuzhiyun 608*4882a593Smuzhiyun[!] Note that this extremely counterintuitive situation arises most easily on 609*4882a593Smuzhiyunmachines with split caches, so that, for example, one cache bank processes 610*4882a593Smuzhiyuneven-numbered cache lines and the other bank processes odd-numbered cache 611*4882a593Smuzhiyunlines. The pointer P might be stored in an odd-numbered cache line, and the 612*4882a593Smuzhiyunvariable B might be stored in an even-numbered cache line. Then, if the 613*4882a593Smuzhiyuneven-numbered bank of the reading CPU's cache is extremely busy while the 614*4882a593Smuzhiyunodd-numbered bank is idle, one can see the new value of the pointer P (&B), 615*4882a593Smuzhiyunbut the old value of the variable B (2). 616*4882a593Smuzhiyun 617*4882a593Smuzhiyun 618*4882a593SmuzhiyunA data-dependency barrier is not required to order dependent writes 619*4882a593Smuzhiyunbecause the CPUs that the Linux kernel supports don't do writes 620*4882a593Smuzhiyununtil they are certain (1) that the write will actually happen, (2) 621*4882a593Smuzhiyunof the location of the write, and (3) of the value to be written. 622*4882a593SmuzhiyunBut please carefully read the "CONTROL DEPENDENCIES" section and the 623*4882a593SmuzhiyunDocumentation/RCU/rcu_dereference.rst file: The compiler can and does 624*4882a593Smuzhiyunbreak dependencies in a great many highly creative ways. 625*4882a593Smuzhiyun 626*4882a593Smuzhiyun CPU 1 CPU 2 627*4882a593Smuzhiyun =============== =============== 628*4882a593Smuzhiyun { A == 1, B == 2, C = 3, P == &A, Q == &C } 629*4882a593Smuzhiyun B = 4; 630*4882a593Smuzhiyun <write barrier> 631*4882a593Smuzhiyun WRITE_ONCE(P, &B); 632*4882a593Smuzhiyun Q = READ_ONCE(P); 633*4882a593Smuzhiyun WRITE_ONCE(*Q, 5); 634*4882a593Smuzhiyun 635*4882a593SmuzhiyunTherefore, no data-dependency barrier is required to order the read into 636*4882a593SmuzhiyunQ with the store into *Q. In other words, this outcome is prohibited, 637*4882a593Smuzhiyuneven without a data-dependency barrier: 638*4882a593Smuzhiyun 639*4882a593Smuzhiyun (Q == &B) && (B == 4) 640*4882a593Smuzhiyun 641*4882a593SmuzhiyunPlease note that this pattern should be rare. After all, the whole point 642*4882a593Smuzhiyunof dependency ordering is to -prevent- writes to the data structure, along 643*4882a593Smuzhiyunwith the expensive cache misses associated with those writes. This pattern 644*4882a593Smuzhiyuncan be used to record rare error conditions and the like, and the CPUs' 645*4882a593Smuzhiyunnaturally occurring ordering prevents such records from being lost. 646*4882a593Smuzhiyun 647*4882a593Smuzhiyun 648*4882a593SmuzhiyunNote well that the ordering provided by a data dependency is local to 649*4882a593Smuzhiyunthe CPU containing it. See the section on "Multicopy atomicity" for 650*4882a593Smuzhiyunmore information. 651*4882a593Smuzhiyun 652*4882a593Smuzhiyun 653*4882a593SmuzhiyunThe data dependency barrier is very important to the RCU system, 654*4882a593Smuzhiyunfor example. See rcu_assign_pointer() and rcu_dereference() in 655*4882a593Smuzhiyuninclude/linux/rcupdate.h. This permits the current target of an RCU'd 656*4882a593Smuzhiyunpointer to be replaced with a new modified target, without the replacement 657*4882a593Smuzhiyuntarget appearing to be incompletely initialised. 658*4882a593Smuzhiyun 659*4882a593SmuzhiyunSee also the subsection on "Cache Coherency" for a more thorough example. 660*4882a593Smuzhiyun 661*4882a593Smuzhiyun 662*4882a593SmuzhiyunCONTROL DEPENDENCIES 663*4882a593Smuzhiyun-------------------- 664*4882a593Smuzhiyun 665*4882a593SmuzhiyunControl dependencies can be a bit tricky because current compilers do 666*4882a593Smuzhiyunnot understand them. The purpose of this section is to help you prevent 667*4882a593Smuzhiyunthe compiler's ignorance from breaking your code. 668*4882a593Smuzhiyun 669*4882a593SmuzhiyunA load-load control dependency requires a full read memory barrier, not 670*4882a593Smuzhiyunsimply a data dependency barrier to make it work correctly. Consider the 671*4882a593Smuzhiyunfollowing bit of code: 672*4882a593Smuzhiyun 673*4882a593Smuzhiyun q = READ_ONCE(a); 674*4882a593Smuzhiyun if (q) { 675*4882a593Smuzhiyun <data dependency barrier> /* BUG: No data dependency!!! */ 676*4882a593Smuzhiyun p = READ_ONCE(b); 677*4882a593Smuzhiyun } 678*4882a593Smuzhiyun 679*4882a593SmuzhiyunThis will not have the desired effect because there is no actual data 680*4882a593Smuzhiyundependency, but rather a control dependency that the CPU may short-circuit 681*4882a593Smuzhiyunby attempting to predict the outcome in advance, so that other CPUs see 682*4882a593Smuzhiyunthe load from b as having happened before the load from a. In such a 683*4882a593Smuzhiyuncase what's actually required is: 684*4882a593Smuzhiyun 685*4882a593Smuzhiyun q = READ_ONCE(a); 686*4882a593Smuzhiyun if (q) { 687*4882a593Smuzhiyun <read barrier> 688*4882a593Smuzhiyun p = READ_ONCE(b); 689*4882a593Smuzhiyun } 690*4882a593Smuzhiyun 691*4882a593SmuzhiyunHowever, stores are not speculated. This means that ordering -is- provided 692*4882a593Smuzhiyunfor load-store control dependencies, as in the following example: 693*4882a593Smuzhiyun 694*4882a593Smuzhiyun q = READ_ONCE(a); 695*4882a593Smuzhiyun if (q) { 696*4882a593Smuzhiyun WRITE_ONCE(b, 1); 697*4882a593Smuzhiyun } 698*4882a593Smuzhiyun 699*4882a593SmuzhiyunControl dependencies pair normally with other types of barriers. 700*4882a593SmuzhiyunThat said, please note that neither READ_ONCE() nor WRITE_ONCE() 701*4882a593Smuzhiyunare optional! Without the READ_ONCE(), the compiler might combine the 702*4882a593Smuzhiyunload from 'a' with other loads from 'a'. Without the WRITE_ONCE(), 703*4882a593Smuzhiyunthe compiler might combine the store to 'b' with other stores to 'b'. 704*4882a593SmuzhiyunEither can result in highly counterintuitive effects on ordering. 705*4882a593Smuzhiyun 706*4882a593SmuzhiyunWorse yet, if the compiler is able to prove (say) that the value of 707*4882a593Smuzhiyunvariable 'a' is always non-zero, it would be well within its rights 708*4882a593Smuzhiyunto optimize the original example by eliminating the "if" statement 709*4882a593Smuzhiyunas follows: 710*4882a593Smuzhiyun 711*4882a593Smuzhiyun q = a; 712*4882a593Smuzhiyun b = 1; /* BUG: Compiler and CPU can both reorder!!! */ 713*4882a593Smuzhiyun 714*4882a593SmuzhiyunSo don't leave out the READ_ONCE(). 715*4882a593Smuzhiyun 716*4882a593SmuzhiyunIt is tempting to try to enforce ordering on identical stores on both 717*4882a593Smuzhiyunbranches of the "if" statement as follows: 718*4882a593Smuzhiyun 719*4882a593Smuzhiyun q = READ_ONCE(a); 720*4882a593Smuzhiyun if (q) { 721*4882a593Smuzhiyun barrier(); 722*4882a593Smuzhiyun WRITE_ONCE(b, 1); 723*4882a593Smuzhiyun do_something(); 724*4882a593Smuzhiyun } else { 725*4882a593Smuzhiyun barrier(); 726*4882a593Smuzhiyun WRITE_ONCE(b, 1); 727*4882a593Smuzhiyun do_something_else(); 728*4882a593Smuzhiyun } 729*4882a593Smuzhiyun 730*4882a593SmuzhiyunUnfortunately, current compilers will transform this as follows at high 731*4882a593Smuzhiyunoptimization levels: 732*4882a593Smuzhiyun 733*4882a593Smuzhiyun q = READ_ONCE(a); 734*4882a593Smuzhiyun barrier(); 735*4882a593Smuzhiyun WRITE_ONCE(b, 1); /* BUG: No ordering vs. load from a!!! */ 736*4882a593Smuzhiyun if (q) { 737*4882a593Smuzhiyun /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */ 738*4882a593Smuzhiyun do_something(); 739*4882a593Smuzhiyun } else { 740*4882a593Smuzhiyun /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */ 741*4882a593Smuzhiyun do_something_else(); 742*4882a593Smuzhiyun } 743*4882a593Smuzhiyun 744*4882a593SmuzhiyunNow there is no conditional between the load from 'a' and the store to 745*4882a593Smuzhiyun'b', which means that the CPU is within its rights to reorder them: 746*4882a593SmuzhiyunThe conditional is absolutely required, and must be present in the 747*4882a593Smuzhiyunassembly code even after all compiler optimizations have been applied. 748*4882a593SmuzhiyunTherefore, if you need ordering in this example, you need explicit 749*4882a593Smuzhiyunmemory barriers, for example, smp_store_release(): 750*4882a593Smuzhiyun 751*4882a593Smuzhiyun q = READ_ONCE(a); 752*4882a593Smuzhiyun if (q) { 753*4882a593Smuzhiyun smp_store_release(&b, 1); 754*4882a593Smuzhiyun do_something(); 755*4882a593Smuzhiyun } else { 756*4882a593Smuzhiyun smp_store_release(&b, 1); 757*4882a593Smuzhiyun do_something_else(); 758*4882a593Smuzhiyun } 759*4882a593Smuzhiyun 760*4882a593SmuzhiyunIn contrast, without explicit memory barriers, two-legged-if control 761*4882a593Smuzhiyunordering is guaranteed only when the stores differ, for example: 762*4882a593Smuzhiyun 763*4882a593Smuzhiyun q = READ_ONCE(a); 764*4882a593Smuzhiyun if (q) { 765*4882a593Smuzhiyun WRITE_ONCE(b, 1); 766*4882a593Smuzhiyun do_something(); 767*4882a593Smuzhiyun } else { 768*4882a593Smuzhiyun WRITE_ONCE(b, 2); 769*4882a593Smuzhiyun do_something_else(); 770*4882a593Smuzhiyun } 771*4882a593Smuzhiyun 772*4882a593SmuzhiyunThe initial READ_ONCE() is still required to prevent the compiler from 773*4882a593Smuzhiyunproving the value of 'a'. 774*4882a593Smuzhiyun 775*4882a593SmuzhiyunIn addition, you need to be careful what you do with the local variable 'q', 776*4882a593Smuzhiyunotherwise the compiler might be able to guess the value and again remove 777*4882a593Smuzhiyunthe needed conditional. For example: 778*4882a593Smuzhiyun 779*4882a593Smuzhiyun q = READ_ONCE(a); 780*4882a593Smuzhiyun if (q % MAX) { 781*4882a593Smuzhiyun WRITE_ONCE(b, 1); 782*4882a593Smuzhiyun do_something(); 783*4882a593Smuzhiyun } else { 784*4882a593Smuzhiyun WRITE_ONCE(b, 2); 785*4882a593Smuzhiyun do_something_else(); 786*4882a593Smuzhiyun } 787*4882a593Smuzhiyun 788*4882a593SmuzhiyunIf MAX is defined to be 1, then the compiler knows that (q % MAX) is 789*4882a593Smuzhiyunequal to zero, in which case the compiler is within its rights to 790*4882a593Smuzhiyuntransform the above code into the following: 791*4882a593Smuzhiyun 792*4882a593Smuzhiyun q = READ_ONCE(a); 793*4882a593Smuzhiyun WRITE_ONCE(b, 2); 794*4882a593Smuzhiyun do_something_else(); 795*4882a593Smuzhiyun 796*4882a593SmuzhiyunGiven this transformation, the CPU is not required to respect the ordering 797*4882a593Smuzhiyunbetween the load from variable 'a' and the store to variable 'b'. It is 798*4882a593Smuzhiyuntempting to add a barrier(), but this does not help. The conditional 799*4882a593Smuzhiyunis gone, and the barrier won't bring it back. Therefore, if you are 800*4882a593Smuzhiyunrelying on this ordering, you should make sure that MAX is greater than 801*4882a593Smuzhiyunone, perhaps as follows: 802*4882a593Smuzhiyun 803*4882a593Smuzhiyun q = READ_ONCE(a); 804*4882a593Smuzhiyun BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */ 805*4882a593Smuzhiyun if (q % MAX) { 806*4882a593Smuzhiyun WRITE_ONCE(b, 1); 807*4882a593Smuzhiyun do_something(); 808*4882a593Smuzhiyun } else { 809*4882a593Smuzhiyun WRITE_ONCE(b, 2); 810*4882a593Smuzhiyun do_something_else(); 811*4882a593Smuzhiyun } 812*4882a593Smuzhiyun 813*4882a593SmuzhiyunPlease note once again that the stores to 'b' differ. If they were 814*4882a593Smuzhiyunidentical, as noted earlier, the compiler could pull this store outside 815*4882a593Smuzhiyunof the 'if' statement. 816*4882a593Smuzhiyun 817*4882a593SmuzhiyunYou must also be careful not to rely too much on boolean short-circuit 818*4882a593Smuzhiyunevaluation. Consider this example: 819*4882a593Smuzhiyun 820*4882a593Smuzhiyun q = READ_ONCE(a); 821*4882a593Smuzhiyun if (q || 1 > 0) 822*4882a593Smuzhiyun WRITE_ONCE(b, 1); 823*4882a593Smuzhiyun 824*4882a593SmuzhiyunBecause the first condition cannot fault and the second condition is 825*4882a593Smuzhiyunalways true, the compiler can transform this example as following, 826*4882a593Smuzhiyundefeating control dependency: 827*4882a593Smuzhiyun 828*4882a593Smuzhiyun q = READ_ONCE(a); 829*4882a593Smuzhiyun WRITE_ONCE(b, 1); 830*4882a593Smuzhiyun 831*4882a593SmuzhiyunThis example underscores the need to ensure that the compiler cannot 832*4882a593Smuzhiyunout-guess your code. More generally, although READ_ONCE() does force 833*4882a593Smuzhiyunthe compiler to actually emit code for a given load, it does not force 834*4882a593Smuzhiyunthe compiler to use the results. 835*4882a593Smuzhiyun 836*4882a593SmuzhiyunIn addition, control dependencies apply only to the then-clause and 837*4882a593Smuzhiyunelse-clause of the if-statement in question. In particular, it does 838*4882a593Smuzhiyunnot necessarily apply to code following the if-statement: 839*4882a593Smuzhiyun 840*4882a593Smuzhiyun q = READ_ONCE(a); 841*4882a593Smuzhiyun if (q) { 842*4882a593Smuzhiyun WRITE_ONCE(b, 1); 843*4882a593Smuzhiyun } else { 844*4882a593Smuzhiyun WRITE_ONCE(b, 2); 845*4882a593Smuzhiyun } 846*4882a593Smuzhiyun WRITE_ONCE(c, 1); /* BUG: No ordering against the read from 'a'. */ 847*4882a593Smuzhiyun 848*4882a593SmuzhiyunIt is tempting to argue that there in fact is ordering because the 849*4882a593Smuzhiyuncompiler cannot reorder volatile accesses and also cannot reorder 850*4882a593Smuzhiyunthe writes to 'b' with the condition. Unfortunately for this line 851*4882a593Smuzhiyunof reasoning, the compiler might compile the two writes to 'b' as 852*4882a593Smuzhiyunconditional-move instructions, as in this fanciful pseudo-assembly 853*4882a593Smuzhiyunlanguage: 854*4882a593Smuzhiyun 855*4882a593Smuzhiyun ld r1,a 856*4882a593Smuzhiyun cmp r1,$0 857*4882a593Smuzhiyun cmov,ne r4,$1 858*4882a593Smuzhiyun cmov,eq r4,$2 859*4882a593Smuzhiyun st r4,b 860*4882a593Smuzhiyun st $1,c 861*4882a593Smuzhiyun 862*4882a593SmuzhiyunA weakly ordered CPU would have no dependency of any sort between the load 863*4882a593Smuzhiyunfrom 'a' and the store to 'c'. The control dependencies would extend 864*4882a593Smuzhiyunonly to the pair of cmov instructions and the store depending on them. 865*4882a593SmuzhiyunIn short, control dependencies apply only to the stores in the then-clause 866*4882a593Smuzhiyunand else-clause of the if-statement in question (including functions 867*4882a593Smuzhiyuninvoked by those two clauses), not to code following that if-statement. 868*4882a593Smuzhiyun 869*4882a593Smuzhiyun 870*4882a593SmuzhiyunNote well that the ordering provided by a control dependency is local 871*4882a593Smuzhiyunto the CPU containing it. See the section on "Multicopy atomicity" 872*4882a593Smuzhiyunfor more information. 873*4882a593Smuzhiyun 874*4882a593Smuzhiyun 875*4882a593SmuzhiyunIn summary: 876*4882a593Smuzhiyun 877*4882a593Smuzhiyun (*) Control dependencies can order prior loads against later stores. 878*4882a593Smuzhiyun However, they do -not- guarantee any other sort of ordering: 879*4882a593Smuzhiyun Not prior loads against later loads, nor prior stores against 880*4882a593Smuzhiyun later anything. If you need these other forms of ordering, 881*4882a593Smuzhiyun use smp_rmb(), smp_wmb(), or, in the case of prior stores and 882*4882a593Smuzhiyun later loads, smp_mb(). 883*4882a593Smuzhiyun 884*4882a593Smuzhiyun (*) If both legs of the "if" statement begin with identical stores to 885*4882a593Smuzhiyun the same variable, then those stores must be ordered, either by 886*4882a593Smuzhiyun preceding both of them with smp_mb() or by using smp_store_release() 887*4882a593Smuzhiyun to carry out the stores. Please note that it is -not- sufficient 888*4882a593Smuzhiyun to use barrier() at beginning of each leg of the "if" statement 889*4882a593Smuzhiyun because, as shown by the example above, optimizing compilers can 890*4882a593Smuzhiyun destroy the control dependency while respecting the letter of the 891*4882a593Smuzhiyun barrier() law. 892*4882a593Smuzhiyun 893*4882a593Smuzhiyun (*) Control dependencies require at least one run-time conditional 894*4882a593Smuzhiyun between the prior load and the subsequent store, and this 895*4882a593Smuzhiyun conditional must involve the prior load. If the compiler is able 896*4882a593Smuzhiyun to optimize the conditional away, it will have also optimized 897*4882a593Smuzhiyun away the ordering. Careful use of READ_ONCE() and WRITE_ONCE() 898*4882a593Smuzhiyun can help to preserve the needed conditional. 899*4882a593Smuzhiyun 900*4882a593Smuzhiyun (*) Control dependencies require that the compiler avoid reordering the 901*4882a593Smuzhiyun dependency into nonexistence. Careful use of READ_ONCE() or 902*4882a593Smuzhiyun atomic{,64}_read() can help to preserve your control dependency. 903*4882a593Smuzhiyun Please see the COMPILER BARRIER section for more information. 904*4882a593Smuzhiyun 905*4882a593Smuzhiyun (*) Control dependencies apply only to the then-clause and else-clause 906*4882a593Smuzhiyun of the if-statement containing the control dependency, including 907*4882a593Smuzhiyun any functions that these two clauses call. Control dependencies 908*4882a593Smuzhiyun do -not- apply to code following the if-statement containing the 909*4882a593Smuzhiyun control dependency. 910*4882a593Smuzhiyun 911*4882a593Smuzhiyun (*) Control dependencies pair normally with other types of barriers. 912*4882a593Smuzhiyun 913*4882a593Smuzhiyun (*) Control dependencies do -not- provide multicopy atomicity. If you 914*4882a593Smuzhiyun need all the CPUs to see a given store at the same time, use smp_mb(). 915*4882a593Smuzhiyun 916*4882a593Smuzhiyun (*) Compilers do not understand control dependencies. It is therefore 917*4882a593Smuzhiyun your job to ensure that they do not break your code. 918*4882a593Smuzhiyun 919*4882a593Smuzhiyun 920*4882a593SmuzhiyunSMP BARRIER PAIRING 921*4882a593Smuzhiyun------------------- 922*4882a593Smuzhiyun 923*4882a593SmuzhiyunWhen dealing with CPU-CPU interactions, certain types of memory barrier should 924*4882a593Smuzhiyunalways be paired. A lack of appropriate pairing is almost certainly an error. 925*4882a593Smuzhiyun 926*4882a593SmuzhiyunGeneral barriers pair with each other, though they also pair with most 927*4882a593Smuzhiyunother types of barriers, albeit without multicopy atomicity. An acquire 928*4882a593Smuzhiyunbarrier pairs with a release barrier, but both may also pair with other 929*4882a593Smuzhiyunbarriers, including of course general barriers. A write barrier pairs 930*4882a593Smuzhiyunwith a data dependency barrier, a control dependency, an acquire barrier, 931*4882a593Smuzhiyuna release barrier, a read barrier, or a general barrier. Similarly a 932*4882a593Smuzhiyunread barrier, control dependency, or a data dependency barrier pairs 933*4882a593Smuzhiyunwith a write barrier, an acquire barrier, a release barrier, or a 934*4882a593Smuzhiyungeneral barrier: 935*4882a593Smuzhiyun 936*4882a593Smuzhiyun CPU 1 CPU 2 937*4882a593Smuzhiyun =============== =============== 938*4882a593Smuzhiyun WRITE_ONCE(a, 1); 939*4882a593Smuzhiyun <write barrier> 940*4882a593Smuzhiyun WRITE_ONCE(b, 2); x = READ_ONCE(b); 941*4882a593Smuzhiyun <read barrier> 942*4882a593Smuzhiyun y = READ_ONCE(a); 943*4882a593Smuzhiyun 944*4882a593SmuzhiyunOr: 945*4882a593Smuzhiyun 946*4882a593Smuzhiyun CPU 1 CPU 2 947*4882a593Smuzhiyun =============== =============================== 948*4882a593Smuzhiyun a = 1; 949*4882a593Smuzhiyun <write barrier> 950*4882a593Smuzhiyun WRITE_ONCE(b, &a); x = READ_ONCE(b); 951*4882a593Smuzhiyun <data dependency barrier> 952*4882a593Smuzhiyun y = *x; 953*4882a593Smuzhiyun 954*4882a593SmuzhiyunOr even: 955*4882a593Smuzhiyun 956*4882a593Smuzhiyun CPU 1 CPU 2 957*4882a593Smuzhiyun =============== =============================== 958*4882a593Smuzhiyun r1 = READ_ONCE(y); 959*4882a593Smuzhiyun <general barrier> 960*4882a593Smuzhiyun WRITE_ONCE(x, 1); if (r2 = READ_ONCE(x)) { 961*4882a593Smuzhiyun <implicit control dependency> 962*4882a593Smuzhiyun WRITE_ONCE(y, 1); 963*4882a593Smuzhiyun } 964*4882a593Smuzhiyun 965*4882a593Smuzhiyun assert(r1 == 0 || r2 == 0); 966*4882a593Smuzhiyun 967*4882a593SmuzhiyunBasically, the read barrier always has to be there, even though it can be of 968*4882a593Smuzhiyunthe "weaker" type. 969*4882a593Smuzhiyun 970*4882a593Smuzhiyun[!] Note that the stores before the write barrier would normally be expected to 971*4882a593Smuzhiyunmatch the loads after the read barrier or the data dependency barrier, and vice 972*4882a593Smuzhiyunversa: 973*4882a593Smuzhiyun 974*4882a593Smuzhiyun CPU 1 CPU 2 975*4882a593Smuzhiyun =================== =================== 976*4882a593Smuzhiyun WRITE_ONCE(a, 1); }---- --->{ v = READ_ONCE(c); 977*4882a593Smuzhiyun WRITE_ONCE(b, 2); } \ / { w = READ_ONCE(d); 978*4882a593Smuzhiyun <write barrier> \ <read barrier> 979*4882a593Smuzhiyun WRITE_ONCE(c, 3); } / \ { x = READ_ONCE(a); 980*4882a593Smuzhiyun WRITE_ONCE(d, 4); }---- --->{ y = READ_ONCE(b); 981*4882a593Smuzhiyun 982*4882a593Smuzhiyun 983*4882a593SmuzhiyunEXAMPLES OF MEMORY BARRIER SEQUENCES 984*4882a593Smuzhiyun------------------------------------ 985*4882a593Smuzhiyun 986*4882a593SmuzhiyunFirstly, write barriers act as partial orderings on store operations. 987*4882a593SmuzhiyunConsider the following sequence of events: 988*4882a593Smuzhiyun 989*4882a593Smuzhiyun CPU 1 990*4882a593Smuzhiyun ======================= 991*4882a593Smuzhiyun STORE A = 1 992*4882a593Smuzhiyun STORE B = 2 993*4882a593Smuzhiyun STORE C = 3 994*4882a593Smuzhiyun <write barrier> 995*4882a593Smuzhiyun STORE D = 4 996*4882a593Smuzhiyun STORE E = 5 997*4882a593Smuzhiyun 998*4882a593SmuzhiyunThis sequence of events is committed to the memory coherence system in an order 999*4882a593Smuzhiyunthat the rest of the system might perceive as the unordered set of { STORE A, 1000*4882a593SmuzhiyunSTORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E 1001*4882a593Smuzhiyun}: 1002*4882a593Smuzhiyun 1003*4882a593Smuzhiyun +-------+ : : 1004*4882a593Smuzhiyun | | +------+ 1005*4882a593Smuzhiyun | |------>| C=3 | } /\ 1006*4882a593Smuzhiyun | | : +------+ }----- \ -----> Events perceptible to 1007*4882a593Smuzhiyun | | : | A=1 | } \/ the rest of the system 1008*4882a593Smuzhiyun | | : +------+ } 1009*4882a593Smuzhiyun | CPU 1 | : | B=2 | } 1010*4882a593Smuzhiyun | | +------+ } 1011*4882a593Smuzhiyun | | wwwwwwwwwwwwwwww } <--- At this point the write barrier 1012*4882a593Smuzhiyun | | +------+ } requires all stores prior to the 1013*4882a593Smuzhiyun | | : | E=5 | } barrier to be committed before 1014*4882a593Smuzhiyun | | : +------+ } further stores may take place 1015*4882a593Smuzhiyun | |------>| D=4 | } 1016*4882a593Smuzhiyun | | +------+ 1017*4882a593Smuzhiyun +-------+ : : 1018*4882a593Smuzhiyun | 1019*4882a593Smuzhiyun | Sequence in which stores are committed to the 1020*4882a593Smuzhiyun | memory system by CPU 1 1021*4882a593Smuzhiyun V 1022*4882a593Smuzhiyun 1023*4882a593Smuzhiyun 1024*4882a593SmuzhiyunSecondly, data dependency barriers act as partial orderings on data-dependent 1025*4882a593Smuzhiyunloads. Consider the following sequence of events: 1026*4882a593Smuzhiyun 1027*4882a593Smuzhiyun CPU 1 CPU 2 1028*4882a593Smuzhiyun ======================= ======================= 1029*4882a593Smuzhiyun { B = 7; X = 9; Y = 8; C = &Y } 1030*4882a593Smuzhiyun STORE A = 1 1031*4882a593Smuzhiyun STORE B = 2 1032*4882a593Smuzhiyun <write barrier> 1033*4882a593Smuzhiyun STORE C = &B LOAD X 1034*4882a593Smuzhiyun STORE D = 4 LOAD C (gets &B) 1035*4882a593Smuzhiyun LOAD *C (reads B) 1036*4882a593Smuzhiyun 1037*4882a593SmuzhiyunWithout intervention, CPU 2 may perceive the events on CPU 1 in some 1038*4882a593Smuzhiyuneffectively random order, despite the write barrier issued by CPU 1: 1039*4882a593Smuzhiyun 1040*4882a593Smuzhiyun +-------+ : : : : 1041*4882a593Smuzhiyun | | +------+ +-------+ | Sequence of update 1042*4882a593Smuzhiyun | |------>| B=2 |----- --->| Y->8 | | of perception on 1043*4882a593Smuzhiyun | | : +------+ \ +-------+ | CPU 2 1044*4882a593Smuzhiyun | CPU 1 | : | A=1 | \ --->| C->&Y | V 1045*4882a593Smuzhiyun | | +------+ | +-------+ 1046*4882a593Smuzhiyun | | wwwwwwwwwwwwwwww | : : 1047*4882a593Smuzhiyun | | +------+ | : : 1048*4882a593Smuzhiyun | | : | C=&B |--- | : : +-------+ 1049*4882a593Smuzhiyun | | : +------+ \ | +-------+ | | 1050*4882a593Smuzhiyun | |------>| D=4 | ----------->| C->&B |------>| | 1051*4882a593Smuzhiyun | | +------+ | +-------+ | | 1052*4882a593Smuzhiyun +-------+ : : | : : | | 1053*4882a593Smuzhiyun | : : | | 1054*4882a593Smuzhiyun | : : | CPU 2 | 1055*4882a593Smuzhiyun | +-------+ | | 1056*4882a593Smuzhiyun Apparently incorrect ---> | | B->7 |------>| | 1057*4882a593Smuzhiyun perception of B (!) | +-------+ | | 1058*4882a593Smuzhiyun | : : | | 1059*4882a593Smuzhiyun | +-------+ | | 1060*4882a593Smuzhiyun The load of X holds ---> \ | X->9 |------>| | 1061*4882a593Smuzhiyun up the maintenance \ +-------+ | | 1062*4882a593Smuzhiyun of coherence of B ----->| B->2 | +-------+ 1063*4882a593Smuzhiyun +-------+ 1064*4882a593Smuzhiyun : : 1065*4882a593Smuzhiyun 1066*4882a593Smuzhiyun 1067*4882a593SmuzhiyunIn the above example, CPU 2 perceives that B is 7, despite the load of *C 1068*4882a593Smuzhiyun(which would be B) coming after the LOAD of C. 1069*4882a593Smuzhiyun 1070*4882a593SmuzhiyunIf, however, a data dependency barrier were to be placed between the load of C 1071*4882a593Smuzhiyunand the load of *C (ie: B) on CPU 2: 1072*4882a593Smuzhiyun 1073*4882a593Smuzhiyun CPU 1 CPU 2 1074*4882a593Smuzhiyun ======================= ======================= 1075*4882a593Smuzhiyun { B = 7; X = 9; Y = 8; C = &Y } 1076*4882a593Smuzhiyun STORE A = 1 1077*4882a593Smuzhiyun STORE B = 2 1078*4882a593Smuzhiyun <write barrier> 1079*4882a593Smuzhiyun STORE C = &B LOAD X 1080*4882a593Smuzhiyun STORE D = 4 LOAD C (gets &B) 1081*4882a593Smuzhiyun <data dependency barrier> 1082*4882a593Smuzhiyun LOAD *C (reads B) 1083*4882a593Smuzhiyun 1084*4882a593Smuzhiyunthen the following will occur: 1085*4882a593Smuzhiyun 1086*4882a593Smuzhiyun +-------+ : : : : 1087*4882a593Smuzhiyun | | +------+ +-------+ 1088*4882a593Smuzhiyun | |------>| B=2 |----- --->| Y->8 | 1089*4882a593Smuzhiyun | | : +------+ \ +-------+ 1090*4882a593Smuzhiyun | CPU 1 | : | A=1 | \ --->| C->&Y | 1091*4882a593Smuzhiyun | | +------+ | +-------+ 1092*4882a593Smuzhiyun | | wwwwwwwwwwwwwwww | : : 1093*4882a593Smuzhiyun | | +------+ | : : 1094*4882a593Smuzhiyun | | : | C=&B |--- | : : +-------+ 1095*4882a593Smuzhiyun | | : +------+ \ | +-------+ | | 1096*4882a593Smuzhiyun | |------>| D=4 | ----------->| C->&B |------>| | 1097*4882a593Smuzhiyun | | +------+ | +-------+ | | 1098*4882a593Smuzhiyun +-------+ : : | : : | | 1099*4882a593Smuzhiyun | : : | | 1100*4882a593Smuzhiyun | : : | CPU 2 | 1101*4882a593Smuzhiyun | +-------+ | | 1102*4882a593Smuzhiyun | | X->9 |------>| | 1103*4882a593Smuzhiyun | +-------+ | | 1104*4882a593Smuzhiyun Makes sure all effects ---> \ ddddddddddddddddd | | 1105*4882a593Smuzhiyun prior to the store of C \ +-------+ | | 1106*4882a593Smuzhiyun are perceptible to ----->| B->2 |------>| | 1107*4882a593Smuzhiyun subsequent loads +-------+ | | 1108*4882a593Smuzhiyun : : +-------+ 1109*4882a593Smuzhiyun 1110*4882a593Smuzhiyun 1111*4882a593SmuzhiyunAnd thirdly, a read barrier acts as a partial order on loads. Consider the 1112*4882a593Smuzhiyunfollowing sequence of events: 1113*4882a593Smuzhiyun 1114*4882a593Smuzhiyun CPU 1 CPU 2 1115*4882a593Smuzhiyun ======================= ======================= 1116*4882a593Smuzhiyun { A = 0, B = 9 } 1117*4882a593Smuzhiyun STORE A=1 1118*4882a593Smuzhiyun <write barrier> 1119*4882a593Smuzhiyun STORE B=2 1120*4882a593Smuzhiyun LOAD B 1121*4882a593Smuzhiyun LOAD A 1122*4882a593Smuzhiyun 1123*4882a593SmuzhiyunWithout intervention, CPU 2 may then choose to perceive the events on CPU 1 in 1124*4882a593Smuzhiyunsome effectively random order, despite the write barrier issued by CPU 1: 1125*4882a593Smuzhiyun 1126*4882a593Smuzhiyun +-------+ : : : : 1127*4882a593Smuzhiyun | | +------+ +-------+ 1128*4882a593Smuzhiyun | |------>| A=1 |------ --->| A->0 | 1129*4882a593Smuzhiyun | | +------+ \ +-------+ 1130*4882a593Smuzhiyun | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 1131*4882a593Smuzhiyun | | +------+ | +-------+ 1132*4882a593Smuzhiyun | |------>| B=2 |--- | : : 1133*4882a593Smuzhiyun | | +------+ \ | : : +-------+ 1134*4882a593Smuzhiyun +-------+ : : \ | +-------+ | | 1135*4882a593Smuzhiyun ---------->| B->2 |------>| | 1136*4882a593Smuzhiyun | +-------+ | CPU 2 | 1137*4882a593Smuzhiyun | | A->0 |------>| | 1138*4882a593Smuzhiyun | +-------+ | | 1139*4882a593Smuzhiyun | : : +-------+ 1140*4882a593Smuzhiyun \ : : 1141*4882a593Smuzhiyun \ +-------+ 1142*4882a593Smuzhiyun ---->| A->1 | 1143*4882a593Smuzhiyun +-------+ 1144*4882a593Smuzhiyun : : 1145*4882a593Smuzhiyun 1146*4882a593Smuzhiyun 1147*4882a593SmuzhiyunIf, however, a read barrier were to be placed between the load of B and the 1148*4882a593Smuzhiyunload of A on CPU 2: 1149*4882a593Smuzhiyun 1150*4882a593Smuzhiyun CPU 1 CPU 2 1151*4882a593Smuzhiyun ======================= ======================= 1152*4882a593Smuzhiyun { A = 0, B = 9 } 1153*4882a593Smuzhiyun STORE A=1 1154*4882a593Smuzhiyun <write barrier> 1155*4882a593Smuzhiyun STORE B=2 1156*4882a593Smuzhiyun LOAD B 1157*4882a593Smuzhiyun <read barrier> 1158*4882a593Smuzhiyun LOAD A 1159*4882a593Smuzhiyun 1160*4882a593Smuzhiyunthen the partial ordering imposed by CPU 1 will be perceived correctly by CPU 1161*4882a593Smuzhiyun2: 1162*4882a593Smuzhiyun 1163*4882a593Smuzhiyun +-------+ : : : : 1164*4882a593Smuzhiyun | | +------+ +-------+ 1165*4882a593Smuzhiyun | |------>| A=1 |------ --->| A->0 | 1166*4882a593Smuzhiyun | | +------+ \ +-------+ 1167*4882a593Smuzhiyun | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 1168*4882a593Smuzhiyun | | +------+ | +-------+ 1169*4882a593Smuzhiyun | |------>| B=2 |--- | : : 1170*4882a593Smuzhiyun | | +------+ \ | : : +-------+ 1171*4882a593Smuzhiyun +-------+ : : \ | +-------+ | | 1172*4882a593Smuzhiyun ---------->| B->2 |------>| | 1173*4882a593Smuzhiyun | +-------+ | CPU 2 | 1174*4882a593Smuzhiyun | : : | | 1175*4882a593Smuzhiyun | : : | | 1176*4882a593Smuzhiyun At this point the read ----> \ rrrrrrrrrrrrrrrrr | | 1177*4882a593Smuzhiyun barrier causes all effects \ +-------+ | | 1178*4882a593Smuzhiyun prior to the storage of B ---->| A->1 |------>| | 1179*4882a593Smuzhiyun to be perceptible to CPU 2 +-------+ | | 1180*4882a593Smuzhiyun : : +-------+ 1181*4882a593Smuzhiyun 1182*4882a593Smuzhiyun 1183*4882a593SmuzhiyunTo illustrate this more completely, consider what could happen if the code 1184*4882a593Smuzhiyuncontained a load of A either side of the read barrier: 1185*4882a593Smuzhiyun 1186*4882a593Smuzhiyun CPU 1 CPU 2 1187*4882a593Smuzhiyun ======================= ======================= 1188*4882a593Smuzhiyun { A = 0, B = 9 } 1189*4882a593Smuzhiyun STORE A=1 1190*4882a593Smuzhiyun <write barrier> 1191*4882a593Smuzhiyun STORE B=2 1192*4882a593Smuzhiyun LOAD B 1193*4882a593Smuzhiyun LOAD A [first load of A] 1194*4882a593Smuzhiyun <read barrier> 1195*4882a593Smuzhiyun LOAD A [second load of A] 1196*4882a593Smuzhiyun 1197*4882a593SmuzhiyunEven though the two loads of A both occur after the load of B, they may both 1198*4882a593Smuzhiyuncome up with different values: 1199*4882a593Smuzhiyun 1200*4882a593Smuzhiyun +-------+ : : : : 1201*4882a593Smuzhiyun | | +------+ +-------+ 1202*4882a593Smuzhiyun | |------>| A=1 |------ --->| A->0 | 1203*4882a593Smuzhiyun | | +------+ \ +-------+ 1204*4882a593Smuzhiyun | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 1205*4882a593Smuzhiyun | | +------+ | +-------+ 1206*4882a593Smuzhiyun | |------>| B=2 |--- | : : 1207*4882a593Smuzhiyun | | +------+ \ | : : +-------+ 1208*4882a593Smuzhiyun +-------+ : : \ | +-------+ | | 1209*4882a593Smuzhiyun ---------->| B->2 |------>| | 1210*4882a593Smuzhiyun | +-------+ | CPU 2 | 1211*4882a593Smuzhiyun | : : | | 1212*4882a593Smuzhiyun | : : | | 1213*4882a593Smuzhiyun | +-------+ | | 1214*4882a593Smuzhiyun | | A->0 |------>| 1st | 1215*4882a593Smuzhiyun | +-------+ | | 1216*4882a593Smuzhiyun At this point the read ----> \ rrrrrrrrrrrrrrrrr | | 1217*4882a593Smuzhiyun barrier causes all effects \ +-------+ | | 1218*4882a593Smuzhiyun prior to the storage of B ---->| A->1 |------>| 2nd | 1219*4882a593Smuzhiyun to be perceptible to CPU 2 +-------+ | | 1220*4882a593Smuzhiyun : : +-------+ 1221*4882a593Smuzhiyun 1222*4882a593Smuzhiyun 1223*4882a593SmuzhiyunBut it may be that the update to A from CPU 1 becomes perceptible to CPU 2 1224*4882a593Smuzhiyunbefore the read barrier completes anyway: 1225*4882a593Smuzhiyun 1226*4882a593Smuzhiyun +-------+ : : : : 1227*4882a593Smuzhiyun | | +------+ +-------+ 1228*4882a593Smuzhiyun | |------>| A=1 |------ --->| A->0 | 1229*4882a593Smuzhiyun | | +------+ \ +-------+ 1230*4882a593Smuzhiyun | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 1231*4882a593Smuzhiyun | | +------+ | +-------+ 1232*4882a593Smuzhiyun | |------>| B=2 |--- | : : 1233*4882a593Smuzhiyun | | +------+ \ | : : +-------+ 1234*4882a593Smuzhiyun +-------+ : : \ | +-------+ | | 1235*4882a593Smuzhiyun ---------->| B->2 |------>| | 1236*4882a593Smuzhiyun | +-------+ | CPU 2 | 1237*4882a593Smuzhiyun | : : | | 1238*4882a593Smuzhiyun \ : : | | 1239*4882a593Smuzhiyun \ +-------+ | | 1240*4882a593Smuzhiyun ---->| A->1 |------>| 1st | 1241*4882a593Smuzhiyun +-------+ | | 1242*4882a593Smuzhiyun rrrrrrrrrrrrrrrrr | | 1243*4882a593Smuzhiyun +-------+ | | 1244*4882a593Smuzhiyun | A->1 |------>| 2nd | 1245*4882a593Smuzhiyun +-------+ | | 1246*4882a593Smuzhiyun : : +-------+ 1247*4882a593Smuzhiyun 1248*4882a593Smuzhiyun 1249*4882a593SmuzhiyunThe guarantee is that the second load will always come up with A == 1 if the 1250*4882a593Smuzhiyunload of B came up with B == 2. No such guarantee exists for the first load of 1251*4882a593SmuzhiyunA; that may come up with either A == 0 or A == 1. 1252*4882a593Smuzhiyun 1253*4882a593Smuzhiyun 1254*4882a593SmuzhiyunREAD MEMORY BARRIERS VS LOAD SPECULATION 1255*4882a593Smuzhiyun---------------------------------------- 1256*4882a593Smuzhiyun 1257*4882a593SmuzhiyunMany CPUs speculate with loads: that is they see that they will need to load an 1258*4882a593Smuzhiyunitem from memory, and they find a time where they're not using the bus for any 1259*4882a593Smuzhiyunother loads, and so do the load in advance - even though they haven't actually 1260*4882a593Smuzhiyungot to that point in the instruction execution flow yet. This permits the 1261*4882a593Smuzhiyunactual load instruction to potentially complete immediately because the CPU 1262*4882a593Smuzhiyunalready has the value to hand. 1263*4882a593Smuzhiyun 1264*4882a593SmuzhiyunIt may turn out that the CPU didn't actually need the value - perhaps because a 1265*4882a593Smuzhiyunbranch circumvented the load - in which case it can discard the value or just 1266*4882a593Smuzhiyuncache it for later use. 1267*4882a593Smuzhiyun 1268*4882a593SmuzhiyunConsider: 1269*4882a593Smuzhiyun 1270*4882a593Smuzhiyun CPU 1 CPU 2 1271*4882a593Smuzhiyun ======================= ======================= 1272*4882a593Smuzhiyun LOAD B 1273*4882a593Smuzhiyun DIVIDE } Divide instructions generally 1274*4882a593Smuzhiyun DIVIDE } take a long time to perform 1275*4882a593Smuzhiyun LOAD A 1276*4882a593Smuzhiyun 1277*4882a593SmuzhiyunWhich might appear as this: 1278*4882a593Smuzhiyun 1279*4882a593Smuzhiyun : : +-------+ 1280*4882a593Smuzhiyun +-------+ | | 1281*4882a593Smuzhiyun --->| B->2 |------>| | 1282*4882a593Smuzhiyun +-------+ | CPU 2 | 1283*4882a593Smuzhiyun : :DIVIDE | | 1284*4882a593Smuzhiyun +-------+ | | 1285*4882a593Smuzhiyun The CPU being busy doing a ---> --->| A->0 |~~~~ | | 1286*4882a593Smuzhiyun division speculates on the +-------+ ~ | | 1287*4882a593Smuzhiyun LOAD of A : : ~ | | 1288*4882a593Smuzhiyun : :DIVIDE | | 1289*4882a593Smuzhiyun : : ~ | | 1290*4882a593Smuzhiyun Once the divisions are complete --> : : ~-->| | 1291*4882a593Smuzhiyun the CPU can then perform the : : | | 1292*4882a593Smuzhiyun LOAD with immediate effect : : +-------+ 1293*4882a593Smuzhiyun 1294*4882a593Smuzhiyun 1295*4882a593SmuzhiyunPlacing a read barrier or a data dependency barrier just before the second 1296*4882a593Smuzhiyunload: 1297*4882a593Smuzhiyun 1298*4882a593Smuzhiyun CPU 1 CPU 2 1299*4882a593Smuzhiyun ======================= ======================= 1300*4882a593Smuzhiyun LOAD B 1301*4882a593Smuzhiyun DIVIDE 1302*4882a593Smuzhiyun DIVIDE 1303*4882a593Smuzhiyun <read barrier> 1304*4882a593Smuzhiyun LOAD A 1305*4882a593Smuzhiyun 1306*4882a593Smuzhiyunwill force any value speculatively obtained to be reconsidered to an extent 1307*4882a593Smuzhiyundependent on the type of barrier used. If there was no change made to the 1308*4882a593Smuzhiyunspeculated memory location, then the speculated value will just be used: 1309*4882a593Smuzhiyun 1310*4882a593Smuzhiyun : : +-------+ 1311*4882a593Smuzhiyun +-------+ | | 1312*4882a593Smuzhiyun --->| B->2 |------>| | 1313*4882a593Smuzhiyun +-------+ | CPU 2 | 1314*4882a593Smuzhiyun : :DIVIDE | | 1315*4882a593Smuzhiyun +-------+ | | 1316*4882a593Smuzhiyun The CPU being busy doing a ---> --->| A->0 |~~~~ | | 1317*4882a593Smuzhiyun division speculates on the +-------+ ~ | | 1318*4882a593Smuzhiyun LOAD of A : : ~ | | 1319*4882a593Smuzhiyun : :DIVIDE | | 1320*4882a593Smuzhiyun : : ~ | | 1321*4882a593Smuzhiyun : : ~ | | 1322*4882a593Smuzhiyun rrrrrrrrrrrrrrrr~ | | 1323*4882a593Smuzhiyun : : ~ | | 1324*4882a593Smuzhiyun : : ~-->| | 1325*4882a593Smuzhiyun : : | | 1326*4882a593Smuzhiyun : : +-------+ 1327*4882a593Smuzhiyun 1328*4882a593Smuzhiyun 1329*4882a593Smuzhiyunbut if there was an update or an invalidation from another CPU pending, then 1330*4882a593Smuzhiyunthe speculation will be cancelled and the value reloaded: 1331*4882a593Smuzhiyun 1332*4882a593Smuzhiyun : : +-------+ 1333*4882a593Smuzhiyun +-------+ | | 1334*4882a593Smuzhiyun --->| B->2 |------>| | 1335*4882a593Smuzhiyun +-------+ | CPU 2 | 1336*4882a593Smuzhiyun : :DIVIDE | | 1337*4882a593Smuzhiyun +-------+ | | 1338*4882a593Smuzhiyun The CPU being busy doing a ---> --->| A->0 |~~~~ | | 1339*4882a593Smuzhiyun division speculates on the +-------+ ~ | | 1340*4882a593Smuzhiyun LOAD of A : : ~ | | 1341*4882a593Smuzhiyun : :DIVIDE | | 1342*4882a593Smuzhiyun : : ~ | | 1343*4882a593Smuzhiyun : : ~ | | 1344*4882a593Smuzhiyun rrrrrrrrrrrrrrrrr | | 1345*4882a593Smuzhiyun +-------+ | | 1346*4882a593Smuzhiyun The speculation is discarded ---> --->| A->1 |------>| | 1347*4882a593Smuzhiyun and an updated value is +-------+ | | 1348*4882a593Smuzhiyun retrieved : : +-------+ 1349*4882a593Smuzhiyun 1350*4882a593Smuzhiyun 1351*4882a593SmuzhiyunMULTICOPY ATOMICITY 1352*4882a593Smuzhiyun-------------------- 1353*4882a593Smuzhiyun 1354*4882a593SmuzhiyunMulticopy atomicity is a deeply intuitive notion about ordering that is 1355*4882a593Smuzhiyunnot always provided by real computer systems, namely that a given store 1356*4882a593Smuzhiyunbecomes visible at the same time to all CPUs, or, alternatively, that all 1357*4882a593SmuzhiyunCPUs agree on the order in which all stores become visible. However, 1358*4882a593Smuzhiyunsupport of full multicopy atomicity would rule out valuable hardware 1359*4882a593Smuzhiyunoptimizations, so a weaker form called ``other multicopy atomicity'' 1360*4882a593Smuzhiyuninstead guarantees only that a given store becomes visible at the same 1361*4882a593Smuzhiyuntime to all -other- CPUs. The remainder of this document discusses this 1362*4882a593Smuzhiyunweaker form, but for brevity will call it simply ``multicopy atomicity''. 1363*4882a593Smuzhiyun 1364*4882a593SmuzhiyunThe following example demonstrates multicopy atomicity: 1365*4882a593Smuzhiyun 1366*4882a593Smuzhiyun CPU 1 CPU 2 CPU 3 1367*4882a593Smuzhiyun ======================= ======================= ======================= 1368*4882a593Smuzhiyun { X = 0, Y = 0 } 1369*4882a593Smuzhiyun STORE X=1 r1=LOAD X (reads 1) LOAD Y (reads 1) 1370*4882a593Smuzhiyun <general barrier> <read barrier> 1371*4882a593Smuzhiyun STORE Y=r1 LOAD X 1372*4882a593Smuzhiyun 1373*4882a593SmuzhiyunSuppose that CPU 2's load from X returns 1, which it then stores to Y, 1374*4882a593Smuzhiyunand CPU 3's load from Y returns 1. This indicates that CPU 1's store 1375*4882a593Smuzhiyunto X precedes CPU 2's load from X and that CPU 2's store to Y precedes 1376*4882a593SmuzhiyunCPU 3's load from Y. In addition, the memory barriers guarantee that 1377*4882a593SmuzhiyunCPU 2 executes its load before its store, and CPU 3 loads from Y before 1378*4882a593Smuzhiyunit loads from X. The question is then "Can CPU 3's load from X return 0?" 1379*4882a593Smuzhiyun 1380*4882a593SmuzhiyunBecause CPU 3's load from X in some sense comes after CPU 2's load, it 1381*4882a593Smuzhiyunis natural to expect that CPU 3's load from X must therefore return 1. 1382*4882a593SmuzhiyunThis expectation follows from multicopy atomicity: if a load executing 1383*4882a593Smuzhiyunon CPU B follows a load from the same variable executing on CPU A (and 1384*4882a593SmuzhiyunCPU A did not originally store the value which it read), then on 1385*4882a593Smuzhiyunmulticopy-atomic systems, CPU B's load must return either the same value 1386*4882a593Smuzhiyunthat CPU A's load did or some later value. However, the Linux kernel 1387*4882a593Smuzhiyundoes not require systems to be multicopy atomic. 1388*4882a593Smuzhiyun 1389*4882a593SmuzhiyunThe use of a general memory barrier in the example above compensates 1390*4882a593Smuzhiyunfor any lack of multicopy atomicity. In the example, if CPU 2's load 1391*4882a593Smuzhiyunfrom X returns 1 and CPU 3's load from Y returns 1, then CPU 3's load 1392*4882a593Smuzhiyunfrom X must indeed also return 1. 1393*4882a593Smuzhiyun 1394*4882a593SmuzhiyunHowever, dependencies, read barriers, and write barriers are not always 1395*4882a593Smuzhiyunable to compensate for non-multicopy atomicity. For example, suppose 1396*4882a593Smuzhiyunthat CPU 2's general barrier is removed from the above example, leaving 1397*4882a593Smuzhiyunonly the data dependency shown below: 1398*4882a593Smuzhiyun 1399*4882a593Smuzhiyun CPU 1 CPU 2 CPU 3 1400*4882a593Smuzhiyun ======================= ======================= ======================= 1401*4882a593Smuzhiyun { X = 0, Y = 0 } 1402*4882a593Smuzhiyun STORE X=1 r1=LOAD X (reads 1) LOAD Y (reads 1) 1403*4882a593Smuzhiyun <data dependency> <read barrier> 1404*4882a593Smuzhiyun STORE Y=r1 LOAD X (reads 0) 1405*4882a593Smuzhiyun 1406*4882a593SmuzhiyunThis substitution allows non-multicopy atomicity to run rampant: in 1407*4882a593Smuzhiyunthis example, it is perfectly legal for CPU 2's load from X to return 1, 1408*4882a593SmuzhiyunCPU 3's load from Y to return 1, and its load from X to return 0. 1409*4882a593Smuzhiyun 1410*4882a593SmuzhiyunThe key point is that although CPU 2's data dependency orders its load 1411*4882a593Smuzhiyunand store, it does not guarantee to order CPU 1's store. Thus, if this 1412*4882a593Smuzhiyunexample runs on a non-multicopy-atomic system where CPUs 1 and 2 share a 1413*4882a593Smuzhiyunstore buffer or a level of cache, CPU 2 might have early access to CPU 1's 1414*4882a593Smuzhiyunwrites. General barriers are therefore required to ensure that all CPUs 1415*4882a593Smuzhiyunagree on the combined order of multiple accesses. 1416*4882a593Smuzhiyun 1417*4882a593SmuzhiyunGeneral barriers can compensate not only for non-multicopy atomicity, 1418*4882a593Smuzhiyunbut can also generate additional ordering that can ensure that -all- 1419*4882a593SmuzhiyunCPUs will perceive the same order of -all- operations. In contrast, a 1420*4882a593Smuzhiyunchain of release-acquire pairs do not provide this additional ordering, 1421*4882a593Smuzhiyunwhich means that only those CPUs on the chain are guaranteed to agree 1422*4882a593Smuzhiyunon the combined order of the accesses. For example, switching to C code 1423*4882a593Smuzhiyunin deference to the ghost of Herman Hollerith: 1424*4882a593Smuzhiyun 1425*4882a593Smuzhiyun int u, v, x, y, z; 1426*4882a593Smuzhiyun 1427*4882a593Smuzhiyun void cpu0(void) 1428*4882a593Smuzhiyun { 1429*4882a593Smuzhiyun r0 = smp_load_acquire(&x); 1430*4882a593Smuzhiyun WRITE_ONCE(u, 1); 1431*4882a593Smuzhiyun smp_store_release(&y, 1); 1432*4882a593Smuzhiyun } 1433*4882a593Smuzhiyun 1434*4882a593Smuzhiyun void cpu1(void) 1435*4882a593Smuzhiyun { 1436*4882a593Smuzhiyun r1 = smp_load_acquire(&y); 1437*4882a593Smuzhiyun r4 = READ_ONCE(v); 1438*4882a593Smuzhiyun r5 = READ_ONCE(u); 1439*4882a593Smuzhiyun smp_store_release(&z, 1); 1440*4882a593Smuzhiyun } 1441*4882a593Smuzhiyun 1442*4882a593Smuzhiyun void cpu2(void) 1443*4882a593Smuzhiyun { 1444*4882a593Smuzhiyun r2 = smp_load_acquire(&z); 1445*4882a593Smuzhiyun smp_store_release(&x, 1); 1446*4882a593Smuzhiyun } 1447*4882a593Smuzhiyun 1448*4882a593Smuzhiyun void cpu3(void) 1449*4882a593Smuzhiyun { 1450*4882a593Smuzhiyun WRITE_ONCE(v, 1); 1451*4882a593Smuzhiyun smp_mb(); 1452*4882a593Smuzhiyun r3 = READ_ONCE(u); 1453*4882a593Smuzhiyun } 1454*4882a593Smuzhiyun 1455*4882a593SmuzhiyunBecause cpu0(), cpu1(), and cpu2() participate in a chain of 1456*4882a593Smuzhiyunsmp_store_release()/smp_load_acquire() pairs, the following outcome 1457*4882a593Smuzhiyunis prohibited: 1458*4882a593Smuzhiyun 1459*4882a593Smuzhiyun r0 == 1 && r1 == 1 && r2 == 1 1460*4882a593Smuzhiyun 1461*4882a593SmuzhiyunFurthermore, because of the release-acquire relationship between cpu0() 1462*4882a593Smuzhiyunand cpu1(), cpu1() must see cpu0()'s writes, so that the following 1463*4882a593Smuzhiyunoutcome is prohibited: 1464*4882a593Smuzhiyun 1465*4882a593Smuzhiyun r1 == 1 && r5 == 0 1466*4882a593Smuzhiyun 1467*4882a593SmuzhiyunHowever, the ordering provided by a release-acquire chain is local 1468*4882a593Smuzhiyunto the CPUs participating in that chain and does not apply to cpu3(), 1469*4882a593Smuzhiyunat least aside from stores. Therefore, the following outcome is possible: 1470*4882a593Smuzhiyun 1471*4882a593Smuzhiyun r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 1472*4882a593Smuzhiyun 1473*4882a593SmuzhiyunAs an aside, the following outcome is also possible: 1474*4882a593Smuzhiyun 1475*4882a593Smuzhiyun r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1 1476*4882a593Smuzhiyun 1477*4882a593SmuzhiyunAlthough cpu0(), cpu1(), and cpu2() will see their respective reads and 1478*4882a593Smuzhiyunwrites in order, CPUs not involved in the release-acquire chain might 1479*4882a593Smuzhiyunwell disagree on the order. This disagreement stems from the fact that 1480*4882a593Smuzhiyunthe weak memory-barrier instructions used to implement smp_load_acquire() 1481*4882a593Smuzhiyunand smp_store_release() are not required to order prior stores against 1482*4882a593Smuzhiyunsubsequent loads in all cases. This means that cpu3() can see cpu0()'s 1483*4882a593Smuzhiyunstore to u as happening -after- cpu1()'s load from v, even though 1484*4882a593Smuzhiyunboth cpu0() and cpu1() agree that these two operations occurred in the 1485*4882a593Smuzhiyunintended order. 1486*4882a593Smuzhiyun 1487*4882a593SmuzhiyunHowever, please keep in mind that smp_load_acquire() is not magic. 1488*4882a593SmuzhiyunIn particular, it simply reads from its argument with ordering. It does 1489*4882a593Smuzhiyun-not- ensure that any particular value will be read. Therefore, the 1490*4882a593Smuzhiyunfollowing outcome is possible: 1491*4882a593Smuzhiyun 1492*4882a593Smuzhiyun r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0 1493*4882a593Smuzhiyun 1494*4882a593SmuzhiyunNote that this outcome can happen even on a mythical sequentially 1495*4882a593Smuzhiyunconsistent system where nothing is ever reordered. 1496*4882a593Smuzhiyun 1497*4882a593SmuzhiyunTo reiterate, if your code requires full ordering of all operations, 1498*4882a593Smuzhiyunuse general barriers throughout. 1499*4882a593Smuzhiyun 1500*4882a593Smuzhiyun 1501*4882a593Smuzhiyun======================== 1502*4882a593SmuzhiyunEXPLICIT KERNEL BARRIERS 1503*4882a593Smuzhiyun======================== 1504*4882a593Smuzhiyun 1505*4882a593SmuzhiyunThe Linux kernel has a variety of different barriers that act at different 1506*4882a593Smuzhiyunlevels: 1507*4882a593Smuzhiyun 1508*4882a593Smuzhiyun (*) Compiler barrier. 1509*4882a593Smuzhiyun 1510*4882a593Smuzhiyun (*) CPU memory barriers. 1511*4882a593Smuzhiyun 1512*4882a593Smuzhiyun 1513*4882a593SmuzhiyunCOMPILER BARRIER 1514*4882a593Smuzhiyun---------------- 1515*4882a593Smuzhiyun 1516*4882a593SmuzhiyunThe Linux kernel has an explicit compiler barrier function that prevents the 1517*4882a593Smuzhiyuncompiler from moving the memory accesses either side of it to the other side: 1518*4882a593Smuzhiyun 1519*4882a593Smuzhiyun barrier(); 1520*4882a593Smuzhiyun 1521*4882a593SmuzhiyunThis is a general barrier -- there are no read-read or write-write 1522*4882a593Smuzhiyunvariants of barrier(). However, READ_ONCE() and WRITE_ONCE() can be 1523*4882a593Smuzhiyunthought of as weak forms of barrier() that affect only the specific 1524*4882a593Smuzhiyunaccesses flagged by the READ_ONCE() or WRITE_ONCE(). 1525*4882a593Smuzhiyun 1526*4882a593SmuzhiyunThe barrier() function has the following effects: 1527*4882a593Smuzhiyun 1528*4882a593Smuzhiyun (*) Prevents the compiler from reordering accesses following the 1529*4882a593Smuzhiyun barrier() to precede any accesses preceding the barrier(). 1530*4882a593Smuzhiyun One example use for this property is to ease communication between 1531*4882a593Smuzhiyun interrupt-handler code and the code that was interrupted. 1532*4882a593Smuzhiyun 1533*4882a593Smuzhiyun (*) Within a loop, forces the compiler to load the variables used 1534*4882a593Smuzhiyun in that loop's conditional on each pass through that loop. 1535*4882a593Smuzhiyun 1536*4882a593SmuzhiyunThe READ_ONCE() and WRITE_ONCE() functions can prevent any number of 1537*4882a593Smuzhiyunoptimizations that, while perfectly safe in single-threaded code, can 1538*4882a593Smuzhiyunbe fatal in concurrent code. Here are some examples of these sorts 1539*4882a593Smuzhiyunof optimizations: 1540*4882a593Smuzhiyun 1541*4882a593Smuzhiyun (*) The compiler is within its rights to reorder loads and stores 1542*4882a593Smuzhiyun to the same variable, and in some cases, the CPU is within its 1543*4882a593Smuzhiyun rights to reorder loads to the same variable. This means that 1544*4882a593Smuzhiyun the following code: 1545*4882a593Smuzhiyun 1546*4882a593Smuzhiyun a[0] = x; 1547*4882a593Smuzhiyun a[1] = x; 1548*4882a593Smuzhiyun 1549*4882a593Smuzhiyun Might result in an older value of x stored in a[1] than in a[0]. 1550*4882a593Smuzhiyun Prevent both the compiler and the CPU from doing this as follows: 1551*4882a593Smuzhiyun 1552*4882a593Smuzhiyun a[0] = READ_ONCE(x); 1553*4882a593Smuzhiyun a[1] = READ_ONCE(x); 1554*4882a593Smuzhiyun 1555*4882a593Smuzhiyun In short, READ_ONCE() and WRITE_ONCE() provide cache coherence for 1556*4882a593Smuzhiyun accesses from multiple CPUs to a single variable. 1557*4882a593Smuzhiyun 1558*4882a593Smuzhiyun (*) The compiler is within its rights to merge successive loads from 1559*4882a593Smuzhiyun the same variable. Such merging can cause the compiler to "optimize" 1560*4882a593Smuzhiyun the following code: 1561*4882a593Smuzhiyun 1562*4882a593Smuzhiyun while (tmp = a) 1563*4882a593Smuzhiyun do_something_with(tmp); 1564*4882a593Smuzhiyun 1565*4882a593Smuzhiyun into the following code, which, although in some sense legitimate 1566*4882a593Smuzhiyun for single-threaded code, is almost certainly not what the developer 1567*4882a593Smuzhiyun intended: 1568*4882a593Smuzhiyun 1569*4882a593Smuzhiyun if (tmp = a) 1570*4882a593Smuzhiyun for (;;) 1571*4882a593Smuzhiyun do_something_with(tmp); 1572*4882a593Smuzhiyun 1573*4882a593Smuzhiyun Use READ_ONCE() to prevent the compiler from doing this to you: 1574*4882a593Smuzhiyun 1575*4882a593Smuzhiyun while (tmp = READ_ONCE(a)) 1576*4882a593Smuzhiyun do_something_with(tmp); 1577*4882a593Smuzhiyun 1578*4882a593Smuzhiyun (*) The compiler is within its rights to reload a variable, for example, 1579*4882a593Smuzhiyun in cases where high register pressure prevents the compiler from 1580*4882a593Smuzhiyun keeping all data of interest in registers. The compiler might 1581*4882a593Smuzhiyun therefore optimize the variable 'tmp' out of our previous example: 1582*4882a593Smuzhiyun 1583*4882a593Smuzhiyun while (tmp = a) 1584*4882a593Smuzhiyun do_something_with(tmp); 1585*4882a593Smuzhiyun 1586*4882a593Smuzhiyun This could result in the following code, which is perfectly safe in 1587*4882a593Smuzhiyun single-threaded code, but can be fatal in concurrent code: 1588*4882a593Smuzhiyun 1589*4882a593Smuzhiyun while (a) 1590*4882a593Smuzhiyun do_something_with(a); 1591*4882a593Smuzhiyun 1592*4882a593Smuzhiyun For example, the optimized version of this code could result in 1593*4882a593Smuzhiyun passing a zero to do_something_with() in the case where the variable 1594*4882a593Smuzhiyun a was modified by some other CPU between the "while" statement and 1595*4882a593Smuzhiyun the call to do_something_with(). 1596*4882a593Smuzhiyun 1597*4882a593Smuzhiyun Again, use READ_ONCE() to prevent the compiler from doing this: 1598*4882a593Smuzhiyun 1599*4882a593Smuzhiyun while (tmp = READ_ONCE(a)) 1600*4882a593Smuzhiyun do_something_with(tmp); 1601*4882a593Smuzhiyun 1602*4882a593Smuzhiyun Note that if the compiler runs short of registers, it might save 1603*4882a593Smuzhiyun tmp onto the stack. The overhead of this saving and later restoring 1604*4882a593Smuzhiyun is why compilers reload variables. Doing so is perfectly safe for 1605*4882a593Smuzhiyun single-threaded code, so you need to tell the compiler about cases 1606*4882a593Smuzhiyun where it is not safe. 1607*4882a593Smuzhiyun 1608*4882a593Smuzhiyun (*) The compiler is within its rights to omit a load entirely if it knows 1609*4882a593Smuzhiyun what the value will be. For example, if the compiler can prove that 1610*4882a593Smuzhiyun the value of variable 'a' is always zero, it can optimize this code: 1611*4882a593Smuzhiyun 1612*4882a593Smuzhiyun while (tmp = a) 1613*4882a593Smuzhiyun do_something_with(tmp); 1614*4882a593Smuzhiyun 1615*4882a593Smuzhiyun Into this: 1616*4882a593Smuzhiyun 1617*4882a593Smuzhiyun do { } while (0); 1618*4882a593Smuzhiyun 1619*4882a593Smuzhiyun This transformation is a win for single-threaded code because it 1620*4882a593Smuzhiyun gets rid of a load and a branch. The problem is that the compiler 1621*4882a593Smuzhiyun will carry out its proof assuming that the current CPU is the only 1622*4882a593Smuzhiyun one updating variable 'a'. If variable 'a' is shared, then the 1623*4882a593Smuzhiyun compiler's proof will be erroneous. Use READ_ONCE() to tell the 1624*4882a593Smuzhiyun compiler that it doesn't know as much as it thinks it does: 1625*4882a593Smuzhiyun 1626*4882a593Smuzhiyun while (tmp = READ_ONCE(a)) 1627*4882a593Smuzhiyun do_something_with(tmp); 1628*4882a593Smuzhiyun 1629*4882a593Smuzhiyun But please note that the compiler is also closely watching what you 1630*4882a593Smuzhiyun do with the value after the READ_ONCE(). For example, suppose you 1631*4882a593Smuzhiyun do the following and MAX is a preprocessor macro with the value 1: 1632*4882a593Smuzhiyun 1633*4882a593Smuzhiyun while ((tmp = READ_ONCE(a)) % MAX) 1634*4882a593Smuzhiyun do_something_with(tmp); 1635*4882a593Smuzhiyun 1636*4882a593Smuzhiyun Then the compiler knows that the result of the "%" operator applied 1637*4882a593Smuzhiyun to MAX will always be zero, again allowing the compiler to optimize 1638*4882a593Smuzhiyun the code into near-nonexistence. (It will still load from the 1639*4882a593Smuzhiyun variable 'a'.) 1640*4882a593Smuzhiyun 1641*4882a593Smuzhiyun (*) Similarly, the compiler is within its rights to omit a store entirely 1642*4882a593Smuzhiyun if it knows that the variable already has the value being stored. 1643*4882a593Smuzhiyun Again, the compiler assumes that the current CPU is the only one 1644*4882a593Smuzhiyun storing into the variable, which can cause the compiler to do the 1645*4882a593Smuzhiyun wrong thing for shared variables. For example, suppose you have 1646*4882a593Smuzhiyun the following: 1647*4882a593Smuzhiyun 1648*4882a593Smuzhiyun a = 0; 1649*4882a593Smuzhiyun ... Code that does not store to variable a ... 1650*4882a593Smuzhiyun a = 0; 1651*4882a593Smuzhiyun 1652*4882a593Smuzhiyun The compiler sees that the value of variable 'a' is already zero, so 1653*4882a593Smuzhiyun it might well omit the second store. This would come as a fatal 1654*4882a593Smuzhiyun surprise if some other CPU might have stored to variable 'a' in the 1655*4882a593Smuzhiyun meantime. 1656*4882a593Smuzhiyun 1657*4882a593Smuzhiyun Use WRITE_ONCE() to prevent the compiler from making this sort of 1658*4882a593Smuzhiyun wrong guess: 1659*4882a593Smuzhiyun 1660*4882a593Smuzhiyun WRITE_ONCE(a, 0); 1661*4882a593Smuzhiyun ... Code that does not store to variable a ... 1662*4882a593Smuzhiyun WRITE_ONCE(a, 0); 1663*4882a593Smuzhiyun 1664*4882a593Smuzhiyun (*) The compiler is within its rights to reorder memory accesses unless 1665*4882a593Smuzhiyun you tell it not to. For example, consider the following interaction 1666*4882a593Smuzhiyun between process-level code and an interrupt handler: 1667*4882a593Smuzhiyun 1668*4882a593Smuzhiyun void process_level(void) 1669*4882a593Smuzhiyun { 1670*4882a593Smuzhiyun msg = get_message(); 1671*4882a593Smuzhiyun flag = true; 1672*4882a593Smuzhiyun } 1673*4882a593Smuzhiyun 1674*4882a593Smuzhiyun void interrupt_handler(void) 1675*4882a593Smuzhiyun { 1676*4882a593Smuzhiyun if (flag) 1677*4882a593Smuzhiyun process_message(msg); 1678*4882a593Smuzhiyun } 1679*4882a593Smuzhiyun 1680*4882a593Smuzhiyun There is nothing to prevent the compiler from transforming 1681*4882a593Smuzhiyun process_level() to the following, in fact, this might well be a 1682*4882a593Smuzhiyun win for single-threaded code: 1683*4882a593Smuzhiyun 1684*4882a593Smuzhiyun void process_level(void) 1685*4882a593Smuzhiyun { 1686*4882a593Smuzhiyun flag = true; 1687*4882a593Smuzhiyun msg = get_message(); 1688*4882a593Smuzhiyun } 1689*4882a593Smuzhiyun 1690*4882a593Smuzhiyun If the interrupt occurs between these two statement, then 1691*4882a593Smuzhiyun interrupt_handler() might be passed a garbled msg. Use WRITE_ONCE() 1692*4882a593Smuzhiyun to prevent this as follows: 1693*4882a593Smuzhiyun 1694*4882a593Smuzhiyun void process_level(void) 1695*4882a593Smuzhiyun { 1696*4882a593Smuzhiyun WRITE_ONCE(msg, get_message()); 1697*4882a593Smuzhiyun WRITE_ONCE(flag, true); 1698*4882a593Smuzhiyun } 1699*4882a593Smuzhiyun 1700*4882a593Smuzhiyun void interrupt_handler(void) 1701*4882a593Smuzhiyun { 1702*4882a593Smuzhiyun if (READ_ONCE(flag)) 1703*4882a593Smuzhiyun process_message(READ_ONCE(msg)); 1704*4882a593Smuzhiyun } 1705*4882a593Smuzhiyun 1706*4882a593Smuzhiyun Note that the READ_ONCE() and WRITE_ONCE() wrappers in 1707*4882a593Smuzhiyun interrupt_handler() are needed if this interrupt handler can itself 1708*4882a593Smuzhiyun be interrupted by something that also accesses 'flag' and 'msg', 1709*4882a593Smuzhiyun for example, a nested interrupt or an NMI. Otherwise, READ_ONCE() 1710*4882a593Smuzhiyun and WRITE_ONCE() are not needed in interrupt_handler() other than 1711*4882a593Smuzhiyun for documentation purposes. (Note also that nested interrupts 1712*4882a593Smuzhiyun do not typically occur in modern Linux kernels, in fact, if an 1713*4882a593Smuzhiyun interrupt handler returns with interrupts enabled, you will get a 1714*4882a593Smuzhiyun WARN_ONCE() splat.) 1715*4882a593Smuzhiyun 1716*4882a593Smuzhiyun You should assume that the compiler can move READ_ONCE() and 1717*4882a593Smuzhiyun WRITE_ONCE() past code not containing READ_ONCE(), WRITE_ONCE(), 1718*4882a593Smuzhiyun barrier(), or similar primitives. 1719*4882a593Smuzhiyun 1720*4882a593Smuzhiyun This effect could also be achieved using barrier(), but READ_ONCE() 1721*4882a593Smuzhiyun and WRITE_ONCE() are more selective: With READ_ONCE() and 1722*4882a593Smuzhiyun WRITE_ONCE(), the compiler need only forget the contents of the 1723*4882a593Smuzhiyun indicated memory locations, while with barrier() the compiler must 1724*4882a593Smuzhiyun discard the value of all memory locations that it has currently 1725*4882a593Smuzhiyun cached in any machine registers. Of course, the compiler must also 1726*4882a593Smuzhiyun respect the order in which the READ_ONCE()s and WRITE_ONCE()s occur, 1727*4882a593Smuzhiyun though the CPU of course need not do so. 1728*4882a593Smuzhiyun 1729*4882a593Smuzhiyun (*) The compiler is within its rights to invent stores to a variable, 1730*4882a593Smuzhiyun as in the following example: 1731*4882a593Smuzhiyun 1732*4882a593Smuzhiyun if (a) 1733*4882a593Smuzhiyun b = a; 1734*4882a593Smuzhiyun else 1735*4882a593Smuzhiyun b = 42; 1736*4882a593Smuzhiyun 1737*4882a593Smuzhiyun The compiler might save a branch by optimizing this as follows: 1738*4882a593Smuzhiyun 1739*4882a593Smuzhiyun b = 42; 1740*4882a593Smuzhiyun if (a) 1741*4882a593Smuzhiyun b = a; 1742*4882a593Smuzhiyun 1743*4882a593Smuzhiyun In single-threaded code, this is not only safe, but also saves 1744*4882a593Smuzhiyun a branch. Unfortunately, in concurrent code, this optimization 1745*4882a593Smuzhiyun could cause some other CPU to see a spurious value of 42 -- even 1746*4882a593Smuzhiyun if variable 'a' was never zero -- when loading variable 'b'. 1747*4882a593Smuzhiyun Use WRITE_ONCE() to prevent this as follows: 1748*4882a593Smuzhiyun 1749*4882a593Smuzhiyun if (a) 1750*4882a593Smuzhiyun WRITE_ONCE(b, a); 1751*4882a593Smuzhiyun else 1752*4882a593Smuzhiyun WRITE_ONCE(b, 42); 1753*4882a593Smuzhiyun 1754*4882a593Smuzhiyun The compiler can also invent loads. These are usually less 1755*4882a593Smuzhiyun damaging, but they can result in cache-line bouncing and thus in 1756*4882a593Smuzhiyun poor performance and scalability. Use READ_ONCE() to prevent 1757*4882a593Smuzhiyun invented loads. 1758*4882a593Smuzhiyun 1759*4882a593Smuzhiyun (*) For aligned memory locations whose size allows them to be accessed 1760*4882a593Smuzhiyun with a single memory-reference instruction, prevents "load tearing" 1761*4882a593Smuzhiyun and "store tearing," in which a single large access is replaced by 1762*4882a593Smuzhiyun multiple smaller accesses. For example, given an architecture having 1763*4882a593Smuzhiyun 16-bit store instructions with 7-bit immediate fields, the compiler 1764*4882a593Smuzhiyun might be tempted to use two 16-bit store-immediate instructions to 1765*4882a593Smuzhiyun implement the following 32-bit store: 1766*4882a593Smuzhiyun 1767*4882a593Smuzhiyun p = 0x00010002; 1768*4882a593Smuzhiyun 1769*4882a593Smuzhiyun Please note that GCC really does use this sort of optimization, 1770*4882a593Smuzhiyun which is not surprising given that it would likely take more 1771*4882a593Smuzhiyun than two instructions to build the constant and then store it. 1772*4882a593Smuzhiyun This optimization can therefore be a win in single-threaded code. 1773*4882a593Smuzhiyun In fact, a recent bug (since fixed) caused GCC to incorrectly use 1774*4882a593Smuzhiyun this optimization in a volatile store. In the absence of such bugs, 1775*4882a593Smuzhiyun use of WRITE_ONCE() prevents store tearing in the following example: 1776*4882a593Smuzhiyun 1777*4882a593Smuzhiyun WRITE_ONCE(p, 0x00010002); 1778*4882a593Smuzhiyun 1779*4882a593Smuzhiyun Use of packed structures can also result in load and store tearing, 1780*4882a593Smuzhiyun as in this example: 1781*4882a593Smuzhiyun 1782*4882a593Smuzhiyun struct __attribute__((__packed__)) foo { 1783*4882a593Smuzhiyun short a; 1784*4882a593Smuzhiyun int b; 1785*4882a593Smuzhiyun short c; 1786*4882a593Smuzhiyun }; 1787*4882a593Smuzhiyun struct foo foo1, foo2; 1788*4882a593Smuzhiyun ... 1789*4882a593Smuzhiyun 1790*4882a593Smuzhiyun foo2.a = foo1.a; 1791*4882a593Smuzhiyun foo2.b = foo1.b; 1792*4882a593Smuzhiyun foo2.c = foo1.c; 1793*4882a593Smuzhiyun 1794*4882a593Smuzhiyun Because there are no READ_ONCE() or WRITE_ONCE() wrappers and no 1795*4882a593Smuzhiyun volatile markings, the compiler would be well within its rights to 1796*4882a593Smuzhiyun implement these three assignment statements as a pair of 32-bit 1797*4882a593Smuzhiyun loads followed by a pair of 32-bit stores. This would result in 1798*4882a593Smuzhiyun load tearing on 'foo1.b' and store tearing on 'foo2.b'. READ_ONCE() 1799*4882a593Smuzhiyun and WRITE_ONCE() again prevent tearing in this example: 1800*4882a593Smuzhiyun 1801*4882a593Smuzhiyun foo2.a = foo1.a; 1802*4882a593Smuzhiyun WRITE_ONCE(foo2.b, READ_ONCE(foo1.b)); 1803*4882a593Smuzhiyun foo2.c = foo1.c; 1804*4882a593Smuzhiyun 1805*4882a593SmuzhiyunAll that aside, it is never necessary to use READ_ONCE() and 1806*4882a593SmuzhiyunWRITE_ONCE() on a variable that has been marked volatile. For example, 1807*4882a593Smuzhiyunbecause 'jiffies' is marked volatile, it is never necessary to 1808*4882a593Smuzhiyunsay READ_ONCE(jiffies). The reason for this is that READ_ONCE() and 1809*4882a593SmuzhiyunWRITE_ONCE() are implemented as volatile casts, which has no effect when 1810*4882a593Smuzhiyunits argument is already marked volatile. 1811*4882a593Smuzhiyun 1812*4882a593SmuzhiyunPlease note that these compiler barriers have no direct effect on the CPU, 1813*4882a593Smuzhiyunwhich may then reorder things however it wishes. 1814*4882a593Smuzhiyun 1815*4882a593Smuzhiyun 1816*4882a593SmuzhiyunCPU MEMORY BARRIERS 1817*4882a593Smuzhiyun------------------- 1818*4882a593Smuzhiyun 1819*4882a593SmuzhiyunThe Linux kernel has eight basic CPU memory barriers: 1820*4882a593Smuzhiyun 1821*4882a593Smuzhiyun TYPE MANDATORY SMP CONDITIONAL 1822*4882a593Smuzhiyun =============== ======================= =========================== 1823*4882a593Smuzhiyun GENERAL mb() smp_mb() 1824*4882a593Smuzhiyun WRITE wmb() smp_wmb() 1825*4882a593Smuzhiyun READ rmb() smp_rmb() 1826*4882a593Smuzhiyun DATA DEPENDENCY READ_ONCE() 1827*4882a593Smuzhiyun 1828*4882a593Smuzhiyun 1829*4882a593SmuzhiyunAll memory barriers except the data dependency barriers imply a compiler 1830*4882a593Smuzhiyunbarrier. Data dependencies do not impose any additional compiler ordering. 1831*4882a593Smuzhiyun 1832*4882a593SmuzhiyunAside: In the case of data dependencies, the compiler would be expected 1833*4882a593Smuzhiyunto issue the loads in the correct order (eg. `a[b]` would have to load 1834*4882a593Smuzhiyunthe value of b before loading a[b]), however there is no guarantee in 1835*4882a593Smuzhiyunthe C specification that the compiler may not speculate the value of b 1836*4882a593Smuzhiyun(eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1) 1837*4882a593Smuzhiyuntmp = a[b]; ). There is also the problem of a compiler reloading b after 1838*4882a593Smuzhiyunhaving loaded a[b], thus having a newer copy of b than a[b]. A consensus 1839*4882a593Smuzhiyunhas not yet been reached about these problems, however the READ_ONCE() 1840*4882a593Smuzhiyunmacro is a good place to start looking. 1841*4882a593Smuzhiyun 1842*4882a593SmuzhiyunSMP memory barriers are reduced to compiler barriers on uniprocessor compiled 1843*4882a593Smuzhiyunsystems because it is assumed that a CPU will appear to be self-consistent, 1844*4882a593Smuzhiyunand will order overlapping accesses correctly with respect to itself. 1845*4882a593SmuzhiyunHowever, see the subsection on "Virtual Machine Guests" below. 1846*4882a593Smuzhiyun 1847*4882a593Smuzhiyun[!] Note that SMP memory barriers _must_ be used to control the ordering of 1848*4882a593Smuzhiyunreferences to shared memory on SMP systems, though the use of locking instead 1849*4882a593Smuzhiyunis sufficient. 1850*4882a593Smuzhiyun 1851*4882a593SmuzhiyunMandatory barriers should not be used to control SMP effects, since mandatory 1852*4882a593Smuzhiyunbarriers impose unnecessary overhead on both SMP and UP systems. They may, 1853*4882a593Smuzhiyunhowever, be used to control MMIO effects on accesses through relaxed memory I/O 1854*4882a593Smuzhiyunwindows. These barriers are required even on non-SMP systems as they affect 1855*4882a593Smuzhiyunthe order in which memory operations appear to a device by prohibiting both the 1856*4882a593Smuzhiyuncompiler and the CPU from reordering them. 1857*4882a593Smuzhiyun 1858*4882a593Smuzhiyun 1859*4882a593SmuzhiyunThere are some more advanced barrier functions: 1860*4882a593Smuzhiyun 1861*4882a593Smuzhiyun (*) smp_store_mb(var, value) 1862*4882a593Smuzhiyun 1863*4882a593Smuzhiyun This assigns the value to the variable and then inserts a full memory 1864*4882a593Smuzhiyun barrier after it. It isn't guaranteed to insert anything more than a 1865*4882a593Smuzhiyun compiler barrier in a UP compilation. 1866*4882a593Smuzhiyun 1867*4882a593Smuzhiyun 1868*4882a593Smuzhiyun (*) smp_mb__before_atomic(); 1869*4882a593Smuzhiyun (*) smp_mb__after_atomic(); 1870*4882a593Smuzhiyun 1871*4882a593Smuzhiyun These are for use with atomic RMW functions that do not imply memory 1872*4882a593Smuzhiyun barriers, but where the code needs a memory barrier. Examples for atomic 1873*4882a593Smuzhiyun RMW functions that do not imply are memory barrier are e.g. add, 1874*4882a593Smuzhiyun subtract, (failed) conditional operations, _relaxed functions, 1875*4882a593Smuzhiyun but not atomic_read or atomic_set. A common example where a memory 1876*4882a593Smuzhiyun barrier may be required is when atomic ops are used for reference 1877*4882a593Smuzhiyun counting. 1878*4882a593Smuzhiyun 1879*4882a593Smuzhiyun These are also used for atomic RMW bitop functions that do not imply a 1880*4882a593Smuzhiyun memory barrier (such as set_bit and clear_bit). 1881*4882a593Smuzhiyun 1882*4882a593Smuzhiyun As an example, consider a piece of code that marks an object as being dead 1883*4882a593Smuzhiyun and then decrements the object's reference count: 1884*4882a593Smuzhiyun 1885*4882a593Smuzhiyun obj->dead = 1; 1886*4882a593Smuzhiyun smp_mb__before_atomic(); 1887*4882a593Smuzhiyun atomic_dec(&obj->ref_count); 1888*4882a593Smuzhiyun 1889*4882a593Smuzhiyun This makes sure that the death mark on the object is perceived to be set 1890*4882a593Smuzhiyun *before* the reference counter is decremented. 1891*4882a593Smuzhiyun 1892*4882a593Smuzhiyun See Documentation/atomic_{t,bitops}.txt for more information. 1893*4882a593Smuzhiyun 1894*4882a593Smuzhiyun 1895*4882a593Smuzhiyun (*) dma_wmb(); 1896*4882a593Smuzhiyun (*) dma_rmb(); 1897*4882a593Smuzhiyun 1898*4882a593Smuzhiyun These are for use with consistent memory to guarantee the ordering 1899*4882a593Smuzhiyun of writes or reads of shared memory accessible to both the CPU and a 1900*4882a593Smuzhiyun DMA capable device. 1901*4882a593Smuzhiyun 1902*4882a593Smuzhiyun For example, consider a device driver that shares memory with a device 1903*4882a593Smuzhiyun and uses a descriptor status value to indicate if the descriptor belongs 1904*4882a593Smuzhiyun to the device or the CPU, and a doorbell to notify it when new 1905*4882a593Smuzhiyun descriptors are available: 1906*4882a593Smuzhiyun 1907*4882a593Smuzhiyun if (desc->status != DEVICE_OWN) { 1908*4882a593Smuzhiyun /* do not read data until we own descriptor */ 1909*4882a593Smuzhiyun dma_rmb(); 1910*4882a593Smuzhiyun 1911*4882a593Smuzhiyun /* read/modify data */ 1912*4882a593Smuzhiyun read_data = desc->data; 1913*4882a593Smuzhiyun desc->data = write_data; 1914*4882a593Smuzhiyun 1915*4882a593Smuzhiyun /* flush modifications before status update */ 1916*4882a593Smuzhiyun dma_wmb(); 1917*4882a593Smuzhiyun 1918*4882a593Smuzhiyun /* assign ownership */ 1919*4882a593Smuzhiyun desc->status = DEVICE_OWN; 1920*4882a593Smuzhiyun 1921*4882a593Smuzhiyun /* notify device of new descriptors */ 1922*4882a593Smuzhiyun writel(DESC_NOTIFY, doorbell); 1923*4882a593Smuzhiyun } 1924*4882a593Smuzhiyun 1925*4882a593Smuzhiyun The dma_rmb() allows us guarantee the device has released ownership 1926*4882a593Smuzhiyun before we read the data from the descriptor, and the dma_wmb() allows 1927*4882a593Smuzhiyun us to guarantee the data is written to the descriptor before the device 1928*4882a593Smuzhiyun can see it now has ownership. Note that, when using writel(), a prior 1929*4882a593Smuzhiyun wmb() is not needed to guarantee that the cache coherent memory writes 1930*4882a593Smuzhiyun have completed before writing to the MMIO region. The cheaper 1931*4882a593Smuzhiyun writel_relaxed() does not provide this guarantee and must not be used 1932*4882a593Smuzhiyun here. 1933*4882a593Smuzhiyun 1934*4882a593Smuzhiyun See the subsection "Kernel I/O barrier effects" for more information on 1935*4882a593Smuzhiyun relaxed I/O accessors and the Documentation/core-api/dma-api.rst file for 1936*4882a593Smuzhiyun more information on consistent memory. 1937*4882a593Smuzhiyun 1938*4882a593Smuzhiyun (*) pmem_wmb(); 1939*4882a593Smuzhiyun 1940*4882a593Smuzhiyun This is for use with persistent memory to ensure that stores for which 1941*4882a593Smuzhiyun modifications are written to persistent storage reached a platform 1942*4882a593Smuzhiyun durability domain. 1943*4882a593Smuzhiyun 1944*4882a593Smuzhiyun For example, after a non-temporal write to pmem region, we use pmem_wmb() 1945*4882a593Smuzhiyun to ensure that stores have reached a platform durability domain. This ensures 1946*4882a593Smuzhiyun that stores have updated persistent storage before any data access or 1947*4882a593Smuzhiyun data transfer caused by subsequent instructions is initiated. This is 1948*4882a593Smuzhiyun in addition to the ordering done by wmb(). 1949*4882a593Smuzhiyun 1950*4882a593Smuzhiyun For load from persistent memory, existing read memory barriers are sufficient 1951*4882a593Smuzhiyun to ensure read ordering. 1952*4882a593Smuzhiyun 1953*4882a593Smuzhiyun=============================== 1954*4882a593SmuzhiyunIMPLICIT KERNEL MEMORY BARRIERS 1955*4882a593Smuzhiyun=============================== 1956*4882a593Smuzhiyun 1957*4882a593SmuzhiyunSome of the other functions in the linux kernel imply memory barriers, amongst 1958*4882a593Smuzhiyunwhich are locking and scheduling functions. 1959*4882a593Smuzhiyun 1960*4882a593SmuzhiyunThis specification is a _minimum_ guarantee; any particular architecture may 1961*4882a593Smuzhiyunprovide more substantial guarantees, but these may not be relied upon outside 1962*4882a593Smuzhiyunof arch specific code. 1963*4882a593Smuzhiyun 1964*4882a593Smuzhiyun 1965*4882a593SmuzhiyunLOCK ACQUISITION FUNCTIONS 1966*4882a593Smuzhiyun-------------------------- 1967*4882a593Smuzhiyun 1968*4882a593SmuzhiyunThe Linux kernel has a number of locking constructs: 1969*4882a593Smuzhiyun 1970*4882a593Smuzhiyun (*) spin locks 1971*4882a593Smuzhiyun (*) R/W spin locks 1972*4882a593Smuzhiyun (*) mutexes 1973*4882a593Smuzhiyun (*) semaphores 1974*4882a593Smuzhiyun (*) R/W semaphores 1975*4882a593Smuzhiyun 1976*4882a593SmuzhiyunIn all cases there are variants on "ACQUIRE" operations and "RELEASE" operations 1977*4882a593Smuzhiyunfor each construct. These operations all imply certain barriers: 1978*4882a593Smuzhiyun 1979*4882a593Smuzhiyun (1) ACQUIRE operation implication: 1980*4882a593Smuzhiyun 1981*4882a593Smuzhiyun Memory operations issued after the ACQUIRE will be completed after the 1982*4882a593Smuzhiyun ACQUIRE operation has completed. 1983*4882a593Smuzhiyun 1984*4882a593Smuzhiyun Memory operations issued before the ACQUIRE may be completed after 1985*4882a593Smuzhiyun the ACQUIRE operation has completed. 1986*4882a593Smuzhiyun 1987*4882a593Smuzhiyun (2) RELEASE operation implication: 1988*4882a593Smuzhiyun 1989*4882a593Smuzhiyun Memory operations issued before the RELEASE will be completed before the 1990*4882a593Smuzhiyun RELEASE operation has completed. 1991*4882a593Smuzhiyun 1992*4882a593Smuzhiyun Memory operations issued after the RELEASE may be completed before the 1993*4882a593Smuzhiyun RELEASE operation has completed. 1994*4882a593Smuzhiyun 1995*4882a593Smuzhiyun (3) ACQUIRE vs ACQUIRE implication: 1996*4882a593Smuzhiyun 1997*4882a593Smuzhiyun All ACQUIRE operations issued before another ACQUIRE operation will be 1998*4882a593Smuzhiyun completed before that ACQUIRE operation. 1999*4882a593Smuzhiyun 2000*4882a593Smuzhiyun (4) ACQUIRE vs RELEASE implication: 2001*4882a593Smuzhiyun 2002*4882a593Smuzhiyun All ACQUIRE operations issued before a RELEASE operation will be 2003*4882a593Smuzhiyun completed before the RELEASE operation. 2004*4882a593Smuzhiyun 2005*4882a593Smuzhiyun (5) Failed conditional ACQUIRE implication: 2006*4882a593Smuzhiyun 2007*4882a593Smuzhiyun Certain locking variants of the ACQUIRE operation may fail, either due to 2008*4882a593Smuzhiyun being unable to get the lock immediately, or due to receiving an unblocked 2009*4882a593Smuzhiyun signal while asleep waiting for the lock to become available. Failed 2010*4882a593Smuzhiyun locks do not imply any sort of barrier. 2011*4882a593Smuzhiyun 2012*4882a593Smuzhiyun[!] Note: one of the consequences of lock ACQUIREs and RELEASEs being only 2013*4882a593Smuzhiyunone-way barriers is that the effects of instructions outside of a critical 2014*4882a593Smuzhiyunsection may seep into the inside of the critical section. 2015*4882a593Smuzhiyun 2016*4882a593SmuzhiyunAn ACQUIRE followed by a RELEASE may not be assumed to be full memory barrier 2017*4882a593Smuzhiyunbecause it is possible for an access preceding the ACQUIRE to happen after the 2018*4882a593SmuzhiyunACQUIRE, and an access following the RELEASE to happen before the RELEASE, and 2019*4882a593Smuzhiyunthe two accesses can themselves then cross: 2020*4882a593Smuzhiyun 2021*4882a593Smuzhiyun *A = a; 2022*4882a593Smuzhiyun ACQUIRE M 2023*4882a593Smuzhiyun RELEASE M 2024*4882a593Smuzhiyun *B = b; 2025*4882a593Smuzhiyun 2026*4882a593Smuzhiyunmay occur as: 2027*4882a593Smuzhiyun 2028*4882a593Smuzhiyun ACQUIRE M, STORE *B, STORE *A, RELEASE M 2029*4882a593Smuzhiyun 2030*4882a593SmuzhiyunWhen the ACQUIRE and RELEASE are a lock acquisition and release, 2031*4882a593Smuzhiyunrespectively, this same reordering can occur if the lock's ACQUIRE and 2032*4882a593SmuzhiyunRELEASE are to the same lock variable, but only from the perspective of 2033*4882a593Smuzhiyunanother CPU not holding that lock. In short, a ACQUIRE followed by an 2034*4882a593SmuzhiyunRELEASE may -not- be assumed to be a full memory barrier. 2035*4882a593Smuzhiyun 2036*4882a593SmuzhiyunSimilarly, the reverse case of a RELEASE followed by an ACQUIRE does 2037*4882a593Smuzhiyunnot imply a full memory barrier. Therefore, the CPU's execution of the 2038*4882a593Smuzhiyuncritical sections corresponding to the RELEASE and the ACQUIRE can cross, 2039*4882a593Smuzhiyunso that: 2040*4882a593Smuzhiyun 2041*4882a593Smuzhiyun *A = a; 2042*4882a593Smuzhiyun RELEASE M 2043*4882a593Smuzhiyun ACQUIRE N 2044*4882a593Smuzhiyun *B = b; 2045*4882a593Smuzhiyun 2046*4882a593Smuzhiyuncould occur as: 2047*4882a593Smuzhiyun 2048*4882a593Smuzhiyun ACQUIRE N, STORE *B, STORE *A, RELEASE M 2049*4882a593Smuzhiyun 2050*4882a593SmuzhiyunIt might appear that this reordering could introduce a deadlock. 2051*4882a593SmuzhiyunHowever, this cannot happen because if such a deadlock threatened, 2052*4882a593Smuzhiyunthe RELEASE would simply complete, thereby avoiding the deadlock. 2053*4882a593Smuzhiyun 2054*4882a593Smuzhiyun Why does this work? 2055*4882a593Smuzhiyun 2056*4882a593Smuzhiyun One key point is that we are only talking about the CPU doing 2057*4882a593Smuzhiyun the reordering, not the compiler. If the compiler (or, for 2058*4882a593Smuzhiyun that matter, the developer) switched the operations, deadlock 2059*4882a593Smuzhiyun -could- occur. 2060*4882a593Smuzhiyun 2061*4882a593Smuzhiyun But suppose the CPU reordered the operations. In this case, 2062*4882a593Smuzhiyun the unlock precedes the lock in the assembly code. The CPU 2063*4882a593Smuzhiyun simply elected to try executing the later lock operation first. 2064*4882a593Smuzhiyun If there is a deadlock, this lock operation will simply spin (or 2065*4882a593Smuzhiyun try to sleep, but more on that later). The CPU will eventually 2066*4882a593Smuzhiyun execute the unlock operation (which preceded the lock operation 2067*4882a593Smuzhiyun in the assembly code), which will unravel the potential deadlock, 2068*4882a593Smuzhiyun allowing the lock operation to succeed. 2069*4882a593Smuzhiyun 2070*4882a593Smuzhiyun But what if the lock is a sleeplock? In that case, the code will 2071*4882a593Smuzhiyun try to enter the scheduler, where it will eventually encounter 2072*4882a593Smuzhiyun a memory barrier, which will force the earlier unlock operation 2073*4882a593Smuzhiyun to complete, again unraveling the deadlock. There might be 2074*4882a593Smuzhiyun a sleep-unlock race, but the locking primitive needs to resolve 2075*4882a593Smuzhiyun such races properly in any case. 2076*4882a593Smuzhiyun 2077*4882a593SmuzhiyunLocks and semaphores may not provide any guarantee of ordering on UP compiled 2078*4882a593Smuzhiyunsystems, and so cannot be counted on in such a situation to actually achieve 2079*4882a593Smuzhiyunanything at all - especially with respect to I/O accesses - unless combined 2080*4882a593Smuzhiyunwith interrupt disabling operations. 2081*4882a593Smuzhiyun 2082*4882a593SmuzhiyunSee also the section on "Inter-CPU acquiring barrier effects". 2083*4882a593Smuzhiyun 2084*4882a593Smuzhiyun 2085*4882a593SmuzhiyunAs an example, consider the following: 2086*4882a593Smuzhiyun 2087*4882a593Smuzhiyun *A = a; 2088*4882a593Smuzhiyun *B = b; 2089*4882a593Smuzhiyun ACQUIRE 2090*4882a593Smuzhiyun *C = c; 2091*4882a593Smuzhiyun *D = d; 2092*4882a593Smuzhiyun RELEASE 2093*4882a593Smuzhiyun *E = e; 2094*4882a593Smuzhiyun *F = f; 2095*4882a593Smuzhiyun 2096*4882a593SmuzhiyunThe following sequence of events is acceptable: 2097*4882a593Smuzhiyun 2098*4882a593Smuzhiyun ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE 2099*4882a593Smuzhiyun 2100*4882a593Smuzhiyun [+] Note that {*F,*A} indicates a combined access. 2101*4882a593Smuzhiyun 2102*4882a593SmuzhiyunBut none of the following are: 2103*4882a593Smuzhiyun 2104*4882a593Smuzhiyun {*F,*A}, *B, ACQUIRE, *C, *D, RELEASE, *E 2105*4882a593Smuzhiyun *A, *B, *C, ACQUIRE, *D, RELEASE, *E, *F 2106*4882a593Smuzhiyun *A, *B, ACQUIRE, *C, RELEASE, *D, *E, *F 2107*4882a593Smuzhiyun *B, ACQUIRE, *C, *D, RELEASE, {*F,*A}, *E 2108*4882a593Smuzhiyun 2109*4882a593Smuzhiyun 2110*4882a593Smuzhiyun 2111*4882a593SmuzhiyunINTERRUPT DISABLING FUNCTIONS 2112*4882a593Smuzhiyun----------------------------- 2113*4882a593Smuzhiyun 2114*4882a593SmuzhiyunFunctions that disable interrupts (ACQUIRE equivalent) and enable interrupts 2115*4882a593Smuzhiyun(RELEASE equivalent) will act as compiler barriers only. So if memory or I/O 2116*4882a593Smuzhiyunbarriers are required in such a situation, they must be provided from some 2117*4882a593Smuzhiyunother means. 2118*4882a593Smuzhiyun 2119*4882a593Smuzhiyun 2120*4882a593SmuzhiyunSLEEP AND WAKE-UP FUNCTIONS 2121*4882a593Smuzhiyun--------------------------- 2122*4882a593Smuzhiyun 2123*4882a593SmuzhiyunSleeping and waking on an event flagged in global data can be viewed as an 2124*4882a593Smuzhiyuninteraction between two pieces of data: the task state of the task waiting for 2125*4882a593Smuzhiyunthe event and the global data used to indicate the event. To make sure that 2126*4882a593Smuzhiyunthese appear to happen in the right order, the primitives to begin the process 2127*4882a593Smuzhiyunof going to sleep, and the primitives to initiate a wake up imply certain 2128*4882a593Smuzhiyunbarriers. 2129*4882a593Smuzhiyun 2130*4882a593SmuzhiyunFirstly, the sleeper normally follows something like this sequence of events: 2131*4882a593Smuzhiyun 2132*4882a593Smuzhiyun for (;;) { 2133*4882a593Smuzhiyun set_current_state(TASK_UNINTERRUPTIBLE); 2134*4882a593Smuzhiyun if (event_indicated) 2135*4882a593Smuzhiyun break; 2136*4882a593Smuzhiyun schedule(); 2137*4882a593Smuzhiyun } 2138*4882a593Smuzhiyun 2139*4882a593SmuzhiyunA general memory barrier is interpolated automatically by set_current_state() 2140*4882a593Smuzhiyunafter it has altered the task state: 2141*4882a593Smuzhiyun 2142*4882a593Smuzhiyun CPU 1 2143*4882a593Smuzhiyun =============================== 2144*4882a593Smuzhiyun set_current_state(); 2145*4882a593Smuzhiyun smp_store_mb(); 2146*4882a593Smuzhiyun STORE current->state 2147*4882a593Smuzhiyun <general barrier> 2148*4882a593Smuzhiyun LOAD event_indicated 2149*4882a593Smuzhiyun 2150*4882a593Smuzhiyunset_current_state() may be wrapped by: 2151*4882a593Smuzhiyun 2152*4882a593Smuzhiyun prepare_to_wait(); 2153*4882a593Smuzhiyun prepare_to_wait_exclusive(); 2154*4882a593Smuzhiyun 2155*4882a593Smuzhiyunwhich therefore also imply a general memory barrier after setting the state. 2156*4882a593SmuzhiyunThe whole sequence above is available in various canned forms, all of which 2157*4882a593Smuzhiyuninterpolate the memory barrier in the right place: 2158*4882a593Smuzhiyun 2159*4882a593Smuzhiyun wait_event(); 2160*4882a593Smuzhiyun wait_event_interruptible(); 2161*4882a593Smuzhiyun wait_event_interruptible_exclusive(); 2162*4882a593Smuzhiyun wait_event_interruptible_timeout(); 2163*4882a593Smuzhiyun wait_event_killable(); 2164*4882a593Smuzhiyun wait_event_timeout(); 2165*4882a593Smuzhiyun wait_on_bit(); 2166*4882a593Smuzhiyun wait_on_bit_lock(); 2167*4882a593Smuzhiyun 2168*4882a593Smuzhiyun 2169*4882a593SmuzhiyunSecondly, code that performs a wake up normally follows something like this: 2170*4882a593Smuzhiyun 2171*4882a593Smuzhiyun event_indicated = 1; 2172*4882a593Smuzhiyun wake_up(&event_wait_queue); 2173*4882a593Smuzhiyun 2174*4882a593Smuzhiyunor: 2175*4882a593Smuzhiyun 2176*4882a593Smuzhiyun event_indicated = 1; 2177*4882a593Smuzhiyun wake_up_process(event_daemon); 2178*4882a593Smuzhiyun 2179*4882a593SmuzhiyunA general memory barrier is executed by wake_up() if it wakes something up. 2180*4882a593SmuzhiyunIf it doesn't wake anything up then a memory barrier may or may not be 2181*4882a593Smuzhiyunexecuted; you must not rely on it. The barrier occurs before the task state 2182*4882a593Smuzhiyunis accessed, in particular, it sits between the STORE to indicate the event 2183*4882a593Smuzhiyunand the STORE to set TASK_RUNNING: 2184*4882a593Smuzhiyun 2185*4882a593Smuzhiyun CPU 1 (Sleeper) CPU 2 (Waker) 2186*4882a593Smuzhiyun =============================== =============================== 2187*4882a593Smuzhiyun set_current_state(); STORE event_indicated 2188*4882a593Smuzhiyun smp_store_mb(); wake_up(); 2189*4882a593Smuzhiyun STORE current->state ... 2190*4882a593Smuzhiyun <general barrier> <general barrier> 2191*4882a593Smuzhiyun LOAD event_indicated if ((LOAD task->state) & TASK_NORMAL) 2192*4882a593Smuzhiyun STORE task->state 2193*4882a593Smuzhiyun 2194*4882a593Smuzhiyunwhere "task" is the thread being woken up and it equals CPU 1's "current". 2195*4882a593Smuzhiyun 2196*4882a593SmuzhiyunTo repeat, a general memory barrier is guaranteed to be executed by wake_up() 2197*4882a593Smuzhiyunif something is actually awakened, but otherwise there is no such guarantee. 2198*4882a593SmuzhiyunTo see this, consider the following sequence of events, where X and Y are both 2199*4882a593Smuzhiyuninitially zero: 2200*4882a593Smuzhiyun 2201*4882a593Smuzhiyun CPU 1 CPU 2 2202*4882a593Smuzhiyun =============================== =============================== 2203*4882a593Smuzhiyun X = 1; Y = 1; 2204*4882a593Smuzhiyun smp_mb(); wake_up(); 2205*4882a593Smuzhiyun LOAD Y LOAD X 2206*4882a593Smuzhiyun 2207*4882a593SmuzhiyunIf a wakeup does occur, one (at least) of the two loads must see 1. If, on 2208*4882a593Smuzhiyunthe other hand, a wakeup does not occur, both loads might see 0. 2209*4882a593Smuzhiyun 2210*4882a593Smuzhiyunwake_up_process() always executes a general memory barrier. The barrier again 2211*4882a593Smuzhiyunoccurs before the task state is accessed. In particular, if the wake_up() in 2212*4882a593Smuzhiyunthe previous snippet were replaced by a call to wake_up_process() then one of 2213*4882a593Smuzhiyunthe two loads would be guaranteed to see 1. 2214*4882a593Smuzhiyun 2215*4882a593SmuzhiyunThe available waker functions include: 2216*4882a593Smuzhiyun 2217*4882a593Smuzhiyun complete(); 2218*4882a593Smuzhiyun wake_up(); 2219*4882a593Smuzhiyun wake_up_all(); 2220*4882a593Smuzhiyun wake_up_bit(); 2221*4882a593Smuzhiyun wake_up_interruptible(); 2222*4882a593Smuzhiyun wake_up_interruptible_all(); 2223*4882a593Smuzhiyun wake_up_interruptible_nr(); 2224*4882a593Smuzhiyun wake_up_interruptible_poll(); 2225*4882a593Smuzhiyun wake_up_interruptible_sync(); 2226*4882a593Smuzhiyun wake_up_interruptible_sync_poll(); 2227*4882a593Smuzhiyun wake_up_locked(); 2228*4882a593Smuzhiyun wake_up_locked_poll(); 2229*4882a593Smuzhiyun wake_up_nr(); 2230*4882a593Smuzhiyun wake_up_poll(); 2231*4882a593Smuzhiyun wake_up_process(); 2232*4882a593Smuzhiyun 2233*4882a593SmuzhiyunIn terms of memory ordering, these functions all provide the same guarantees of 2234*4882a593Smuzhiyuna wake_up() (or stronger). 2235*4882a593Smuzhiyun 2236*4882a593Smuzhiyun[!] Note that the memory barriers implied by the sleeper and the waker do _not_ 2237*4882a593Smuzhiyunorder multiple stores before the wake-up with respect to loads of those stored 2238*4882a593Smuzhiyunvalues after the sleeper has called set_current_state(). For instance, if the 2239*4882a593Smuzhiyunsleeper does: 2240*4882a593Smuzhiyun 2241*4882a593Smuzhiyun set_current_state(TASK_INTERRUPTIBLE); 2242*4882a593Smuzhiyun if (event_indicated) 2243*4882a593Smuzhiyun break; 2244*4882a593Smuzhiyun __set_current_state(TASK_RUNNING); 2245*4882a593Smuzhiyun do_something(my_data); 2246*4882a593Smuzhiyun 2247*4882a593Smuzhiyunand the waker does: 2248*4882a593Smuzhiyun 2249*4882a593Smuzhiyun my_data = value; 2250*4882a593Smuzhiyun event_indicated = 1; 2251*4882a593Smuzhiyun wake_up(&event_wait_queue); 2252*4882a593Smuzhiyun 2253*4882a593Smuzhiyunthere's no guarantee that the change to event_indicated will be perceived by 2254*4882a593Smuzhiyunthe sleeper as coming after the change to my_data. In such a circumstance, the 2255*4882a593Smuzhiyuncode on both sides must interpolate its own memory barriers between the 2256*4882a593Smuzhiyunseparate data accesses. Thus the above sleeper ought to do: 2257*4882a593Smuzhiyun 2258*4882a593Smuzhiyun set_current_state(TASK_INTERRUPTIBLE); 2259*4882a593Smuzhiyun if (event_indicated) { 2260*4882a593Smuzhiyun smp_rmb(); 2261*4882a593Smuzhiyun do_something(my_data); 2262*4882a593Smuzhiyun } 2263*4882a593Smuzhiyun 2264*4882a593Smuzhiyunand the waker should do: 2265*4882a593Smuzhiyun 2266*4882a593Smuzhiyun my_data = value; 2267*4882a593Smuzhiyun smp_wmb(); 2268*4882a593Smuzhiyun event_indicated = 1; 2269*4882a593Smuzhiyun wake_up(&event_wait_queue); 2270*4882a593Smuzhiyun 2271*4882a593Smuzhiyun 2272*4882a593SmuzhiyunMISCELLANEOUS FUNCTIONS 2273*4882a593Smuzhiyun----------------------- 2274*4882a593Smuzhiyun 2275*4882a593SmuzhiyunOther functions that imply barriers: 2276*4882a593Smuzhiyun 2277*4882a593Smuzhiyun (*) schedule() and similar imply full memory barriers. 2278*4882a593Smuzhiyun 2279*4882a593Smuzhiyun 2280*4882a593Smuzhiyun=================================== 2281*4882a593SmuzhiyunINTER-CPU ACQUIRING BARRIER EFFECTS 2282*4882a593Smuzhiyun=================================== 2283*4882a593Smuzhiyun 2284*4882a593SmuzhiyunOn SMP systems locking primitives give a more substantial form of barrier: one 2285*4882a593Smuzhiyunthat does affect memory access ordering on other CPUs, within the context of 2286*4882a593Smuzhiyunconflict on any particular lock. 2287*4882a593Smuzhiyun 2288*4882a593Smuzhiyun 2289*4882a593SmuzhiyunACQUIRES VS MEMORY ACCESSES 2290*4882a593Smuzhiyun--------------------------- 2291*4882a593Smuzhiyun 2292*4882a593SmuzhiyunConsider the following: the system has a pair of spinlocks (M) and (Q), and 2293*4882a593Smuzhiyunthree CPUs; then should the following sequence of events occur: 2294*4882a593Smuzhiyun 2295*4882a593Smuzhiyun CPU 1 CPU 2 2296*4882a593Smuzhiyun =============================== =============================== 2297*4882a593Smuzhiyun WRITE_ONCE(*A, a); WRITE_ONCE(*E, e); 2298*4882a593Smuzhiyun ACQUIRE M ACQUIRE Q 2299*4882a593Smuzhiyun WRITE_ONCE(*B, b); WRITE_ONCE(*F, f); 2300*4882a593Smuzhiyun WRITE_ONCE(*C, c); WRITE_ONCE(*G, g); 2301*4882a593Smuzhiyun RELEASE M RELEASE Q 2302*4882a593Smuzhiyun WRITE_ONCE(*D, d); WRITE_ONCE(*H, h); 2303*4882a593Smuzhiyun 2304*4882a593SmuzhiyunThen there is no guarantee as to what order CPU 3 will see the accesses to *A 2305*4882a593Smuzhiyunthrough *H occur in, other than the constraints imposed by the separate locks 2306*4882a593Smuzhiyunon the separate CPUs. It might, for example, see: 2307*4882a593Smuzhiyun 2308*4882a593Smuzhiyun *E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M 2309*4882a593Smuzhiyun 2310*4882a593SmuzhiyunBut it won't see any of: 2311*4882a593Smuzhiyun 2312*4882a593Smuzhiyun *B, *C or *D preceding ACQUIRE M 2313*4882a593Smuzhiyun *A, *B or *C following RELEASE M 2314*4882a593Smuzhiyun *F, *G or *H preceding ACQUIRE Q 2315*4882a593Smuzhiyun *E, *F or *G following RELEASE Q 2316*4882a593Smuzhiyun 2317*4882a593Smuzhiyun 2318*4882a593Smuzhiyun================================= 2319*4882a593SmuzhiyunWHERE ARE MEMORY BARRIERS NEEDED? 2320*4882a593Smuzhiyun================================= 2321*4882a593Smuzhiyun 2322*4882a593SmuzhiyunUnder normal operation, memory operation reordering is generally not going to 2323*4882a593Smuzhiyunbe a problem as a single-threaded linear piece of code will still appear to 2324*4882a593Smuzhiyunwork correctly, even if it's in an SMP kernel. There are, however, four 2325*4882a593Smuzhiyuncircumstances in which reordering definitely _could_ be a problem: 2326*4882a593Smuzhiyun 2327*4882a593Smuzhiyun (*) Interprocessor interaction. 2328*4882a593Smuzhiyun 2329*4882a593Smuzhiyun (*) Atomic operations. 2330*4882a593Smuzhiyun 2331*4882a593Smuzhiyun (*) Accessing devices. 2332*4882a593Smuzhiyun 2333*4882a593Smuzhiyun (*) Interrupts. 2334*4882a593Smuzhiyun 2335*4882a593Smuzhiyun 2336*4882a593SmuzhiyunINTERPROCESSOR INTERACTION 2337*4882a593Smuzhiyun-------------------------- 2338*4882a593Smuzhiyun 2339*4882a593SmuzhiyunWhen there's a system with more than one processor, more than one CPU in the 2340*4882a593Smuzhiyunsystem may be working on the same data set at the same time. This can cause 2341*4882a593Smuzhiyunsynchronisation problems, and the usual way of dealing with them is to use 2342*4882a593Smuzhiyunlocks. Locks, however, are quite expensive, and so it may be preferable to 2343*4882a593Smuzhiyunoperate without the use of a lock if at all possible. In such a case 2344*4882a593Smuzhiyunoperations that affect both CPUs may have to be carefully ordered to prevent 2345*4882a593Smuzhiyuna malfunction. 2346*4882a593Smuzhiyun 2347*4882a593SmuzhiyunConsider, for example, the R/W semaphore slow path. Here a waiting process is 2348*4882a593Smuzhiyunqueued on the semaphore, by virtue of it having a piece of its stack linked to 2349*4882a593Smuzhiyunthe semaphore's list of waiting processes: 2350*4882a593Smuzhiyun 2351*4882a593Smuzhiyun struct rw_semaphore { 2352*4882a593Smuzhiyun ... 2353*4882a593Smuzhiyun spinlock_t lock; 2354*4882a593Smuzhiyun struct list_head waiters; 2355*4882a593Smuzhiyun }; 2356*4882a593Smuzhiyun 2357*4882a593Smuzhiyun struct rwsem_waiter { 2358*4882a593Smuzhiyun struct list_head list; 2359*4882a593Smuzhiyun struct task_struct *task; 2360*4882a593Smuzhiyun }; 2361*4882a593Smuzhiyun 2362*4882a593SmuzhiyunTo wake up a particular waiter, the up_read() or up_write() functions have to: 2363*4882a593Smuzhiyun 2364*4882a593Smuzhiyun (1) read the next pointer from this waiter's record to know as to where the 2365*4882a593Smuzhiyun next waiter record is; 2366*4882a593Smuzhiyun 2367*4882a593Smuzhiyun (2) read the pointer to the waiter's task structure; 2368*4882a593Smuzhiyun 2369*4882a593Smuzhiyun (3) clear the task pointer to tell the waiter it has been given the semaphore; 2370*4882a593Smuzhiyun 2371*4882a593Smuzhiyun (4) call wake_up_process() on the task; and 2372*4882a593Smuzhiyun 2373*4882a593Smuzhiyun (5) release the reference held on the waiter's task struct. 2374*4882a593Smuzhiyun 2375*4882a593SmuzhiyunIn other words, it has to perform this sequence of events: 2376*4882a593Smuzhiyun 2377*4882a593Smuzhiyun LOAD waiter->list.next; 2378*4882a593Smuzhiyun LOAD waiter->task; 2379*4882a593Smuzhiyun STORE waiter->task; 2380*4882a593Smuzhiyun CALL wakeup 2381*4882a593Smuzhiyun RELEASE task 2382*4882a593Smuzhiyun 2383*4882a593Smuzhiyunand if any of these steps occur out of order, then the whole thing may 2384*4882a593Smuzhiyunmalfunction. 2385*4882a593Smuzhiyun 2386*4882a593SmuzhiyunOnce it has queued itself and dropped the semaphore lock, the waiter does not 2387*4882a593Smuzhiyunget the lock again; it instead just waits for its task pointer to be cleared 2388*4882a593Smuzhiyunbefore proceeding. Since the record is on the waiter's stack, this means that 2389*4882a593Smuzhiyunif the task pointer is cleared _before_ the next pointer in the list is read, 2390*4882a593Smuzhiyunanother CPU might start processing the waiter and might clobber the waiter's 2391*4882a593Smuzhiyunstack before the up*() function has a chance to read the next pointer. 2392*4882a593Smuzhiyun 2393*4882a593SmuzhiyunConsider then what might happen to the above sequence of events: 2394*4882a593Smuzhiyun 2395*4882a593Smuzhiyun CPU 1 CPU 2 2396*4882a593Smuzhiyun =============================== =============================== 2397*4882a593Smuzhiyun down_xxx() 2398*4882a593Smuzhiyun Queue waiter 2399*4882a593Smuzhiyun Sleep 2400*4882a593Smuzhiyun up_yyy() 2401*4882a593Smuzhiyun LOAD waiter->task; 2402*4882a593Smuzhiyun STORE waiter->task; 2403*4882a593Smuzhiyun Woken up by other event 2404*4882a593Smuzhiyun <preempt> 2405*4882a593Smuzhiyun Resume processing 2406*4882a593Smuzhiyun down_xxx() returns 2407*4882a593Smuzhiyun call foo() 2408*4882a593Smuzhiyun foo() clobbers *waiter 2409*4882a593Smuzhiyun </preempt> 2410*4882a593Smuzhiyun LOAD waiter->list.next; 2411*4882a593Smuzhiyun --- OOPS --- 2412*4882a593Smuzhiyun 2413*4882a593SmuzhiyunThis could be dealt with using the semaphore lock, but then the down_xxx() 2414*4882a593Smuzhiyunfunction has to needlessly get the spinlock again after being woken up. 2415*4882a593Smuzhiyun 2416*4882a593SmuzhiyunThe way to deal with this is to insert a general SMP memory barrier: 2417*4882a593Smuzhiyun 2418*4882a593Smuzhiyun LOAD waiter->list.next; 2419*4882a593Smuzhiyun LOAD waiter->task; 2420*4882a593Smuzhiyun smp_mb(); 2421*4882a593Smuzhiyun STORE waiter->task; 2422*4882a593Smuzhiyun CALL wakeup 2423*4882a593Smuzhiyun RELEASE task 2424*4882a593Smuzhiyun 2425*4882a593SmuzhiyunIn this case, the barrier makes a guarantee that all memory accesses before the 2426*4882a593Smuzhiyunbarrier will appear to happen before all the memory accesses after the barrier 2427*4882a593Smuzhiyunwith respect to the other CPUs on the system. It does _not_ guarantee that all 2428*4882a593Smuzhiyunthe memory accesses before the barrier will be complete by the time the barrier 2429*4882a593Smuzhiyuninstruction itself is complete. 2430*4882a593Smuzhiyun 2431*4882a593SmuzhiyunOn a UP system - where this wouldn't be a problem - the smp_mb() is just a 2432*4882a593Smuzhiyuncompiler barrier, thus making sure the compiler emits the instructions in the 2433*4882a593Smuzhiyunright order without actually intervening in the CPU. Since there's only one 2434*4882a593SmuzhiyunCPU, that CPU's dependency ordering logic will take care of everything else. 2435*4882a593Smuzhiyun 2436*4882a593Smuzhiyun 2437*4882a593SmuzhiyunATOMIC OPERATIONS 2438*4882a593Smuzhiyun----------------- 2439*4882a593Smuzhiyun 2440*4882a593SmuzhiyunWhile they are technically interprocessor interaction considerations, atomic 2441*4882a593Smuzhiyunoperations are noted specially as some of them imply full memory barriers and 2442*4882a593Smuzhiyunsome don't, but they're very heavily relied on as a group throughout the 2443*4882a593Smuzhiyunkernel. 2444*4882a593Smuzhiyun 2445*4882a593SmuzhiyunSee Documentation/atomic_t.txt for more information. 2446*4882a593Smuzhiyun 2447*4882a593Smuzhiyun 2448*4882a593SmuzhiyunACCESSING DEVICES 2449*4882a593Smuzhiyun----------------- 2450*4882a593Smuzhiyun 2451*4882a593SmuzhiyunMany devices can be memory mapped, and so appear to the CPU as if they're just 2452*4882a593Smuzhiyuna set of memory locations. To control such a device, the driver usually has to 2453*4882a593Smuzhiyunmake the right memory accesses in exactly the right order. 2454*4882a593Smuzhiyun 2455*4882a593SmuzhiyunHowever, having a clever CPU or a clever compiler creates a potential problem 2456*4882a593Smuzhiyunin that the carefully sequenced accesses in the driver code won't reach the 2457*4882a593Smuzhiyundevice in the requisite order if the CPU or the compiler thinks it is more 2458*4882a593Smuzhiyunefficient to reorder, combine or merge accesses - something that would cause 2459*4882a593Smuzhiyunthe device to malfunction. 2460*4882a593Smuzhiyun 2461*4882a593SmuzhiyunInside of the Linux kernel, I/O should be done through the appropriate accessor 2462*4882a593Smuzhiyunroutines - such as inb() or writel() - which know how to make such accesses 2463*4882a593Smuzhiyunappropriately sequential. While this, for the most part, renders the explicit 2464*4882a593Smuzhiyunuse of memory barriers unnecessary, if the accessor functions are used to refer 2465*4882a593Smuzhiyunto an I/O memory window with relaxed memory access properties, then _mandatory_ 2466*4882a593Smuzhiyunmemory barriers are required to enforce ordering. 2467*4882a593Smuzhiyun 2468*4882a593SmuzhiyunSee Documentation/driver-api/device-io.rst for more information. 2469*4882a593Smuzhiyun 2470*4882a593Smuzhiyun 2471*4882a593SmuzhiyunINTERRUPTS 2472*4882a593Smuzhiyun---------- 2473*4882a593Smuzhiyun 2474*4882a593SmuzhiyunA driver may be interrupted by its own interrupt service routine, and thus the 2475*4882a593Smuzhiyuntwo parts of the driver may interfere with each other's attempts to control or 2476*4882a593Smuzhiyunaccess the device. 2477*4882a593Smuzhiyun 2478*4882a593SmuzhiyunThis may be alleviated - at least in part - by disabling local interrupts (a 2479*4882a593Smuzhiyunform of locking), such that the critical operations are all contained within 2480*4882a593Smuzhiyunthe interrupt-disabled section in the driver. While the driver's interrupt 2481*4882a593Smuzhiyunroutine is executing, the driver's core may not run on the same CPU, and its 2482*4882a593Smuzhiyuninterrupt is not permitted to happen again until the current interrupt has been 2483*4882a593Smuzhiyunhandled, thus the interrupt handler does not need to lock against that. 2484*4882a593Smuzhiyun 2485*4882a593SmuzhiyunHowever, consider a driver that was talking to an ethernet card that sports an 2486*4882a593Smuzhiyunaddress register and a data register. If that driver's core talks to the card 2487*4882a593Smuzhiyununder interrupt-disablement and then the driver's interrupt handler is invoked: 2488*4882a593Smuzhiyun 2489*4882a593Smuzhiyun LOCAL IRQ DISABLE 2490*4882a593Smuzhiyun writew(ADDR, 3); 2491*4882a593Smuzhiyun writew(DATA, y); 2492*4882a593Smuzhiyun LOCAL IRQ ENABLE 2493*4882a593Smuzhiyun <interrupt> 2494*4882a593Smuzhiyun writew(ADDR, 4); 2495*4882a593Smuzhiyun q = readw(DATA); 2496*4882a593Smuzhiyun </interrupt> 2497*4882a593Smuzhiyun 2498*4882a593SmuzhiyunThe store to the data register might happen after the second store to the 2499*4882a593Smuzhiyunaddress register if ordering rules are sufficiently relaxed: 2500*4882a593Smuzhiyun 2501*4882a593Smuzhiyun STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA 2502*4882a593Smuzhiyun 2503*4882a593Smuzhiyun 2504*4882a593SmuzhiyunIf ordering rules are relaxed, it must be assumed that accesses done inside an 2505*4882a593Smuzhiyuninterrupt disabled section may leak outside of it and may interleave with 2506*4882a593Smuzhiyunaccesses performed in an interrupt - and vice versa - unless implicit or 2507*4882a593Smuzhiyunexplicit barriers are used. 2508*4882a593Smuzhiyun 2509*4882a593SmuzhiyunNormally this won't be a problem because the I/O accesses done inside such 2510*4882a593Smuzhiyunsections will include synchronous load operations on strictly ordered I/O 2511*4882a593Smuzhiyunregisters that form implicit I/O barriers. 2512*4882a593Smuzhiyun 2513*4882a593Smuzhiyun 2514*4882a593SmuzhiyunA similar situation may occur between an interrupt routine and two routines 2515*4882a593Smuzhiyunrunning on separate CPUs that communicate with each other. If such a case is 2516*4882a593Smuzhiyunlikely, then interrupt-disabling locks should be used to guarantee ordering. 2517*4882a593Smuzhiyun 2518*4882a593Smuzhiyun 2519*4882a593Smuzhiyun========================== 2520*4882a593SmuzhiyunKERNEL I/O BARRIER EFFECTS 2521*4882a593Smuzhiyun========================== 2522*4882a593Smuzhiyun 2523*4882a593SmuzhiyunInterfacing with peripherals via I/O accesses is deeply architecture and device 2524*4882a593Smuzhiyunspecific. Therefore, drivers which are inherently non-portable may rely on 2525*4882a593Smuzhiyunspecific behaviours of their target systems in order to achieve synchronization 2526*4882a593Smuzhiyunin the most lightweight manner possible. For drivers intending to be portable 2527*4882a593Smuzhiyunbetween multiple architectures and bus implementations, the kernel offers a 2528*4882a593Smuzhiyunseries of accessor functions that provide various degrees of ordering 2529*4882a593Smuzhiyunguarantees: 2530*4882a593Smuzhiyun 2531*4882a593Smuzhiyun (*) readX(), writeX(): 2532*4882a593Smuzhiyun 2533*4882a593Smuzhiyun The readX() and writeX() MMIO accessors take a pointer to the 2534*4882a593Smuzhiyun peripheral being accessed as an __iomem * parameter. For pointers 2535*4882a593Smuzhiyun mapped with the default I/O attributes (e.g. those returned by 2536*4882a593Smuzhiyun ioremap()), the ordering guarantees are as follows: 2537*4882a593Smuzhiyun 2538*4882a593Smuzhiyun 1. All readX() and writeX() accesses to the same peripheral are ordered 2539*4882a593Smuzhiyun with respect to each other. This ensures that MMIO register accesses 2540*4882a593Smuzhiyun by the same CPU thread to a particular device will arrive in program 2541*4882a593Smuzhiyun order. 2542*4882a593Smuzhiyun 2543*4882a593Smuzhiyun 2. A writeX() issued by a CPU thread holding a spinlock is ordered 2544*4882a593Smuzhiyun before a writeX() to the same peripheral from another CPU thread 2545*4882a593Smuzhiyun issued after a later acquisition of the same spinlock. This ensures 2546*4882a593Smuzhiyun that MMIO register writes to a particular device issued while holding 2547*4882a593Smuzhiyun a spinlock will arrive in an order consistent with acquisitions of 2548*4882a593Smuzhiyun the lock. 2549*4882a593Smuzhiyun 2550*4882a593Smuzhiyun 3. A writeX() by a CPU thread to the peripheral will first wait for the 2551*4882a593Smuzhiyun completion of all prior writes to memory either issued by, or 2552*4882a593Smuzhiyun propagated to, the same thread. This ensures that writes by the CPU 2553*4882a593Smuzhiyun to an outbound DMA buffer allocated by dma_alloc_coherent() will be 2554*4882a593Smuzhiyun visible to a DMA engine when the CPU writes to its MMIO control 2555*4882a593Smuzhiyun register to trigger the transfer. 2556*4882a593Smuzhiyun 2557*4882a593Smuzhiyun 4. A readX() by a CPU thread from the peripheral will complete before 2558*4882a593Smuzhiyun any subsequent reads from memory by the same thread can begin. This 2559*4882a593Smuzhiyun ensures that reads by the CPU from an incoming DMA buffer allocated 2560*4882a593Smuzhiyun by dma_alloc_coherent() will not see stale data after reading from 2561*4882a593Smuzhiyun the DMA engine's MMIO status register to establish that the DMA 2562*4882a593Smuzhiyun transfer has completed. 2563*4882a593Smuzhiyun 2564*4882a593Smuzhiyun 5. A readX() by a CPU thread from the peripheral will complete before 2565*4882a593Smuzhiyun any subsequent delay() loop can begin execution on the same thread. 2566*4882a593Smuzhiyun This ensures that two MMIO register writes by the CPU to a peripheral 2567*4882a593Smuzhiyun will arrive at least 1us apart if the first write is immediately read 2568*4882a593Smuzhiyun back with readX() and udelay(1) is called prior to the second 2569*4882a593Smuzhiyun writeX(): 2570*4882a593Smuzhiyun 2571*4882a593Smuzhiyun writel(42, DEVICE_REGISTER_0); // Arrives at the device... 2572*4882a593Smuzhiyun readl(DEVICE_REGISTER_0); 2573*4882a593Smuzhiyun udelay(1); 2574*4882a593Smuzhiyun writel(42, DEVICE_REGISTER_1); // ...at least 1us before this. 2575*4882a593Smuzhiyun 2576*4882a593Smuzhiyun The ordering properties of __iomem pointers obtained with non-default 2577*4882a593Smuzhiyun attributes (e.g. those returned by ioremap_wc()) are specific to the 2578*4882a593Smuzhiyun underlying architecture and therefore the guarantees listed above cannot 2579*4882a593Smuzhiyun generally be relied upon for accesses to these types of mappings. 2580*4882a593Smuzhiyun 2581*4882a593Smuzhiyun (*) readX_relaxed(), writeX_relaxed(): 2582*4882a593Smuzhiyun 2583*4882a593Smuzhiyun These are similar to readX() and writeX(), but provide weaker memory 2584*4882a593Smuzhiyun ordering guarantees. Specifically, they do not guarantee ordering with 2585*4882a593Smuzhiyun respect to locking, normal memory accesses or delay() loops (i.e. 2586*4882a593Smuzhiyun bullets 2-5 above) but they are still guaranteed to be ordered with 2587*4882a593Smuzhiyun respect to other accesses from the same CPU thread to the same 2588*4882a593Smuzhiyun peripheral when operating on __iomem pointers mapped with the default 2589*4882a593Smuzhiyun I/O attributes. 2590*4882a593Smuzhiyun 2591*4882a593Smuzhiyun (*) readsX(), writesX(): 2592*4882a593Smuzhiyun 2593*4882a593Smuzhiyun The readsX() and writesX() MMIO accessors are designed for accessing 2594*4882a593Smuzhiyun register-based, memory-mapped FIFOs residing on peripherals that are not 2595*4882a593Smuzhiyun capable of performing DMA. Consequently, they provide only the ordering 2596*4882a593Smuzhiyun guarantees of readX_relaxed() and writeX_relaxed(), as documented above. 2597*4882a593Smuzhiyun 2598*4882a593Smuzhiyun (*) inX(), outX(): 2599*4882a593Smuzhiyun 2600*4882a593Smuzhiyun The inX() and outX() accessors are intended to access legacy port-mapped 2601*4882a593Smuzhiyun I/O peripherals, which may require special instructions on some 2602*4882a593Smuzhiyun architectures (notably x86). The port number of the peripheral being 2603*4882a593Smuzhiyun accessed is passed as an argument. 2604*4882a593Smuzhiyun 2605*4882a593Smuzhiyun Since many CPU architectures ultimately access these peripherals via an 2606*4882a593Smuzhiyun internal virtual memory mapping, the portable ordering guarantees 2607*4882a593Smuzhiyun provided by inX() and outX() are the same as those provided by readX() 2608*4882a593Smuzhiyun and writeX() respectively when accessing a mapping with the default I/O 2609*4882a593Smuzhiyun attributes. 2610*4882a593Smuzhiyun 2611*4882a593Smuzhiyun Device drivers may expect outX() to emit a non-posted write transaction 2612*4882a593Smuzhiyun that waits for a completion response from the I/O peripheral before 2613*4882a593Smuzhiyun returning. This is not guaranteed by all architectures and is therefore 2614*4882a593Smuzhiyun not part of the portable ordering semantics. 2615*4882a593Smuzhiyun 2616*4882a593Smuzhiyun (*) insX(), outsX(): 2617*4882a593Smuzhiyun 2618*4882a593Smuzhiyun As above, the insX() and outsX() accessors provide the same ordering 2619*4882a593Smuzhiyun guarantees as readsX() and writesX() respectively when accessing a 2620*4882a593Smuzhiyun mapping with the default I/O attributes. 2621*4882a593Smuzhiyun 2622*4882a593Smuzhiyun (*) ioreadX(), iowriteX(): 2623*4882a593Smuzhiyun 2624*4882a593Smuzhiyun These will perform appropriately for the type of access they're actually 2625*4882a593Smuzhiyun doing, be it inX()/outX() or readX()/writeX(). 2626*4882a593Smuzhiyun 2627*4882a593SmuzhiyunWith the exception of the string accessors (insX(), outsX(), readsX() and 2628*4882a593SmuzhiyunwritesX()), all of the above assume that the underlying peripheral is 2629*4882a593Smuzhiyunlittle-endian and will therefore perform byte-swapping operations on big-endian 2630*4882a593Smuzhiyunarchitectures. 2631*4882a593Smuzhiyun 2632*4882a593Smuzhiyun 2633*4882a593Smuzhiyun======================================== 2634*4882a593SmuzhiyunASSUMED MINIMUM EXECUTION ORDERING MODEL 2635*4882a593Smuzhiyun======================================== 2636*4882a593Smuzhiyun 2637*4882a593SmuzhiyunIt has to be assumed that the conceptual CPU is weakly-ordered but that it will 2638*4882a593Smuzhiyunmaintain the appearance of program causality with respect to itself. Some CPUs 2639*4882a593Smuzhiyun(such as i386 or x86_64) are more constrained than others (such as powerpc or 2640*4882a593Smuzhiyunfrv), and so the most relaxed case (namely DEC Alpha) must be assumed outside 2641*4882a593Smuzhiyunof arch-specific code. 2642*4882a593Smuzhiyun 2643*4882a593SmuzhiyunThis means that it must be considered that the CPU will execute its instruction 2644*4882a593Smuzhiyunstream in any order it feels like - or even in parallel - provided that if an 2645*4882a593Smuzhiyuninstruction in the stream depends on an earlier instruction, then that 2646*4882a593Smuzhiyunearlier instruction must be sufficiently complete[*] before the later 2647*4882a593Smuzhiyuninstruction may proceed; in other words: provided that the appearance of 2648*4882a593Smuzhiyuncausality is maintained. 2649*4882a593Smuzhiyun 2650*4882a593Smuzhiyun [*] Some instructions have more than one effect - such as changing the 2651*4882a593Smuzhiyun condition codes, changing registers or changing memory - and different 2652*4882a593Smuzhiyun instructions may depend on different effects. 2653*4882a593Smuzhiyun 2654*4882a593SmuzhiyunA CPU may also discard any instruction sequence that winds up having no 2655*4882a593Smuzhiyunultimate effect. For example, if two adjacent instructions both load an 2656*4882a593Smuzhiyunimmediate value into the same register, the first may be discarded. 2657*4882a593Smuzhiyun 2658*4882a593Smuzhiyun 2659*4882a593SmuzhiyunSimilarly, it has to be assumed that compiler might reorder the instruction 2660*4882a593Smuzhiyunstream in any way it sees fit, again provided the appearance of causality is 2661*4882a593Smuzhiyunmaintained. 2662*4882a593Smuzhiyun 2663*4882a593Smuzhiyun 2664*4882a593Smuzhiyun============================ 2665*4882a593SmuzhiyunTHE EFFECTS OF THE CPU CACHE 2666*4882a593Smuzhiyun============================ 2667*4882a593Smuzhiyun 2668*4882a593SmuzhiyunThe way cached memory operations are perceived across the system is affected to 2669*4882a593Smuzhiyuna certain extent by the caches that lie between CPUs and memory, and by the 2670*4882a593Smuzhiyunmemory coherence system that maintains the consistency of state in the system. 2671*4882a593Smuzhiyun 2672*4882a593SmuzhiyunAs far as the way a CPU interacts with another part of the system through the 2673*4882a593Smuzhiyuncaches goes, the memory system has to include the CPU's caches, and memory 2674*4882a593Smuzhiyunbarriers for the most part act at the interface between the CPU and its cache 2675*4882a593Smuzhiyun(memory barriers logically act on the dotted line in the following diagram): 2676*4882a593Smuzhiyun 2677*4882a593Smuzhiyun <--- CPU ---> : <----------- Memory -----------> 2678*4882a593Smuzhiyun : 2679*4882a593Smuzhiyun +--------+ +--------+ : +--------+ +-----------+ 2680*4882a593Smuzhiyun | | | | : | | | | +--------+ 2681*4882a593Smuzhiyun | CPU | | Memory | : | CPU | | | | | 2682*4882a593Smuzhiyun | Core |--->| Access |----->| Cache |<-->| | | | 2683*4882a593Smuzhiyun | | | Queue | : | | | |--->| Memory | 2684*4882a593Smuzhiyun | | | | : | | | | | | 2685*4882a593Smuzhiyun +--------+ +--------+ : +--------+ | | | | 2686*4882a593Smuzhiyun : | Cache | +--------+ 2687*4882a593Smuzhiyun : | Coherency | 2688*4882a593Smuzhiyun : | Mechanism | +--------+ 2689*4882a593Smuzhiyun +--------+ +--------+ : +--------+ | | | | 2690*4882a593Smuzhiyun | | | | : | | | | | | 2691*4882a593Smuzhiyun | CPU | | Memory | : | CPU | | |--->| Device | 2692*4882a593Smuzhiyun | Core |--->| Access |----->| Cache |<-->| | | | 2693*4882a593Smuzhiyun | | | Queue | : | | | | | | 2694*4882a593Smuzhiyun | | | | : | | | | +--------+ 2695*4882a593Smuzhiyun +--------+ +--------+ : +--------+ +-----------+ 2696*4882a593Smuzhiyun : 2697*4882a593Smuzhiyun : 2698*4882a593Smuzhiyun 2699*4882a593SmuzhiyunAlthough any particular load or store may not actually appear outside of the 2700*4882a593SmuzhiyunCPU that issued it since it may have been satisfied within the CPU's own cache, 2701*4882a593Smuzhiyunit will still appear as if the full memory access had taken place as far as the 2702*4882a593Smuzhiyunother CPUs are concerned since the cache coherency mechanisms will migrate the 2703*4882a593Smuzhiyuncacheline over to the accessing CPU and propagate the effects upon conflict. 2704*4882a593Smuzhiyun 2705*4882a593SmuzhiyunThe CPU core may execute instructions in any order it deems fit, provided the 2706*4882a593Smuzhiyunexpected program causality appears to be maintained. Some of the instructions 2707*4882a593Smuzhiyungenerate load and store operations which then go into the queue of memory 2708*4882a593Smuzhiyunaccesses to be performed. The core may place these in the queue in any order 2709*4882a593Smuzhiyunit wishes, and continue execution until it is forced to wait for an instruction 2710*4882a593Smuzhiyunto complete. 2711*4882a593Smuzhiyun 2712*4882a593SmuzhiyunWhat memory barriers are concerned with is controlling the order in which 2713*4882a593Smuzhiyunaccesses cross from the CPU side of things to the memory side of things, and 2714*4882a593Smuzhiyunthe order in which the effects are perceived to happen by the other observers 2715*4882a593Smuzhiyunin the system. 2716*4882a593Smuzhiyun 2717*4882a593Smuzhiyun[!] Memory barriers are _not_ needed within a given CPU, as CPUs always see 2718*4882a593Smuzhiyuntheir own loads and stores as if they had happened in program order. 2719*4882a593Smuzhiyun 2720*4882a593Smuzhiyun[!] MMIO or other device accesses may bypass the cache system. This depends on 2721*4882a593Smuzhiyunthe properties of the memory window through which devices are accessed and/or 2722*4882a593Smuzhiyunthe use of any special device communication instructions the CPU may have. 2723*4882a593Smuzhiyun 2724*4882a593Smuzhiyun 2725*4882a593SmuzhiyunCACHE COHERENCY VS DMA 2726*4882a593Smuzhiyun---------------------- 2727*4882a593Smuzhiyun 2728*4882a593SmuzhiyunNot all systems maintain cache coherency with respect to devices doing DMA. In 2729*4882a593Smuzhiyunsuch cases, a device attempting DMA may obtain stale data from RAM because 2730*4882a593Smuzhiyundirty cache lines may be resident in the caches of various CPUs, and may not 2731*4882a593Smuzhiyunhave been written back to RAM yet. To deal with this, the appropriate part of 2732*4882a593Smuzhiyunthe kernel must flush the overlapping bits of cache on each CPU (and maybe 2733*4882a593Smuzhiyuninvalidate them as well). 2734*4882a593Smuzhiyun 2735*4882a593SmuzhiyunIn addition, the data DMA'd to RAM by a device may be overwritten by dirty 2736*4882a593Smuzhiyuncache lines being written back to RAM from a CPU's cache after the device has 2737*4882a593Smuzhiyuninstalled its own data, or cache lines present in the CPU's cache may simply 2738*4882a593Smuzhiyunobscure the fact that RAM has been updated, until at such time as the cacheline 2739*4882a593Smuzhiyunis discarded from the CPU's cache and reloaded. To deal with this, the 2740*4882a593Smuzhiyunappropriate part of the kernel must invalidate the overlapping bits of the 2741*4882a593Smuzhiyuncache on each CPU. 2742*4882a593Smuzhiyun 2743*4882a593SmuzhiyunSee Documentation/core-api/cachetlb.rst for more information on cache management. 2744*4882a593Smuzhiyun 2745*4882a593Smuzhiyun 2746*4882a593SmuzhiyunCACHE COHERENCY VS MMIO 2747*4882a593Smuzhiyun----------------------- 2748*4882a593Smuzhiyun 2749*4882a593SmuzhiyunMemory mapped I/O usually takes place through memory locations that are part of 2750*4882a593Smuzhiyuna window in the CPU's memory space that has different properties assigned than 2751*4882a593Smuzhiyunthe usual RAM directed window. 2752*4882a593Smuzhiyun 2753*4882a593SmuzhiyunAmongst these properties is usually the fact that such accesses bypass the 2754*4882a593Smuzhiyuncaching entirely and go directly to the device buses. This means MMIO accesses 2755*4882a593Smuzhiyunmay, in effect, overtake accesses to cached memory that were emitted earlier. 2756*4882a593SmuzhiyunA memory barrier isn't sufficient in such a case, but rather the cache must be 2757*4882a593Smuzhiyunflushed between the cached memory write and the MMIO access if the two are in 2758*4882a593Smuzhiyunany way dependent. 2759*4882a593Smuzhiyun 2760*4882a593Smuzhiyun 2761*4882a593Smuzhiyun========================= 2762*4882a593SmuzhiyunTHE THINGS CPUS GET UP TO 2763*4882a593Smuzhiyun========================= 2764*4882a593Smuzhiyun 2765*4882a593SmuzhiyunA programmer might take it for granted that the CPU will perform memory 2766*4882a593Smuzhiyunoperations in exactly the order specified, so that if the CPU is, for example, 2767*4882a593Smuzhiyungiven the following piece of code to execute: 2768*4882a593Smuzhiyun 2769*4882a593Smuzhiyun a = READ_ONCE(*A); 2770*4882a593Smuzhiyun WRITE_ONCE(*B, b); 2771*4882a593Smuzhiyun c = READ_ONCE(*C); 2772*4882a593Smuzhiyun d = READ_ONCE(*D); 2773*4882a593Smuzhiyun WRITE_ONCE(*E, e); 2774*4882a593Smuzhiyun 2775*4882a593Smuzhiyunthey would then expect that the CPU will complete the memory operation for each 2776*4882a593Smuzhiyuninstruction before moving on to the next one, leading to a definite sequence of 2777*4882a593Smuzhiyunoperations as seen by external observers in the system: 2778*4882a593Smuzhiyun 2779*4882a593Smuzhiyun LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E. 2780*4882a593Smuzhiyun 2781*4882a593Smuzhiyun 2782*4882a593SmuzhiyunReality is, of course, much messier. With many CPUs and compilers, the above 2783*4882a593Smuzhiyunassumption doesn't hold because: 2784*4882a593Smuzhiyun 2785*4882a593Smuzhiyun (*) loads are more likely to need to be completed immediately to permit 2786*4882a593Smuzhiyun execution progress, whereas stores can often be deferred without a 2787*4882a593Smuzhiyun problem; 2788*4882a593Smuzhiyun 2789*4882a593Smuzhiyun (*) loads may be done speculatively, and the result discarded should it prove 2790*4882a593Smuzhiyun to have been unnecessary; 2791*4882a593Smuzhiyun 2792*4882a593Smuzhiyun (*) loads may be done speculatively, leading to the result having been fetched 2793*4882a593Smuzhiyun at the wrong time in the expected sequence of events; 2794*4882a593Smuzhiyun 2795*4882a593Smuzhiyun (*) the order of the memory accesses may be rearranged to promote better use 2796*4882a593Smuzhiyun of the CPU buses and caches; 2797*4882a593Smuzhiyun 2798*4882a593Smuzhiyun (*) loads and stores may be combined to improve performance when talking to 2799*4882a593Smuzhiyun memory or I/O hardware that can do batched accesses of adjacent locations, 2800*4882a593Smuzhiyun thus cutting down on transaction setup costs (memory and PCI devices may 2801*4882a593Smuzhiyun both be able to do this); and 2802*4882a593Smuzhiyun 2803*4882a593Smuzhiyun (*) the CPU's data cache may affect the ordering, and while cache-coherency 2804*4882a593Smuzhiyun mechanisms may alleviate this - once the store has actually hit the cache 2805*4882a593Smuzhiyun - there's no guarantee that the coherency management will be propagated in 2806*4882a593Smuzhiyun order to other CPUs. 2807*4882a593Smuzhiyun 2808*4882a593SmuzhiyunSo what another CPU, say, might actually observe from the above piece of code 2809*4882a593Smuzhiyunis: 2810*4882a593Smuzhiyun 2811*4882a593Smuzhiyun LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B 2812*4882a593Smuzhiyun 2813*4882a593Smuzhiyun (Where "LOAD {*C,*D}" is a combined load) 2814*4882a593Smuzhiyun 2815*4882a593Smuzhiyun 2816*4882a593SmuzhiyunHowever, it is guaranteed that a CPU will be self-consistent: it will see its 2817*4882a593Smuzhiyun_own_ accesses appear to be correctly ordered, without the need for a memory 2818*4882a593Smuzhiyunbarrier. For instance with the following code: 2819*4882a593Smuzhiyun 2820*4882a593Smuzhiyun U = READ_ONCE(*A); 2821*4882a593Smuzhiyun WRITE_ONCE(*A, V); 2822*4882a593Smuzhiyun WRITE_ONCE(*A, W); 2823*4882a593Smuzhiyun X = READ_ONCE(*A); 2824*4882a593Smuzhiyun WRITE_ONCE(*A, Y); 2825*4882a593Smuzhiyun Z = READ_ONCE(*A); 2826*4882a593Smuzhiyun 2827*4882a593Smuzhiyunand assuming no intervention by an external influence, it can be assumed that 2828*4882a593Smuzhiyunthe final result will appear to be: 2829*4882a593Smuzhiyun 2830*4882a593Smuzhiyun U == the original value of *A 2831*4882a593Smuzhiyun X == W 2832*4882a593Smuzhiyun Z == Y 2833*4882a593Smuzhiyun *A == Y 2834*4882a593Smuzhiyun 2835*4882a593SmuzhiyunThe code above may cause the CPU to generate the full sequence of memory 2836*4882a593Smuzhiyunaccesses: 2837*4882a593Smuzhiyun 2838*4882a593Smuzhiyun U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A 2839*4882a593Smuzhiyun 2840*4882a593Smuzhiyunin that order, but, without intervention, the sequence may have almost any 2841*4882a593Smuzhiyuncombination of elements combined or discarded, provided the program's view 2842*4882a593Smuzhiyunof the world remains consistent. Note that READ_ONCE() and WRITE_ONCE() 2843*4882a593Smuzhiyunare -not- optional in the above example, as there are architectures 2844*4882a593Smuzhiyunwhere a given CPU might reorder successive loads to the same location. 2845*4882a593SmuzhiyunOn such architectures, READ_ONCE() and WRITE_ONCE() do whatever is 2846*4882a593Smuzhiyunnecessary to prevent this, for example, on Itanium the volatile casts 2847*4882a593Smuzhiyunused by READ_ONCE() and WRITE_ONCE() cause GCC to emit the special ld.acq 2848*4882a593Smuzhiyunand st.rel instructions (respectively) that prevent such reordering. 2849*4882a593Smuzhiyun 2850*4882a593SmuzhiyunThe compiler may also combine, discard or defer elements of the sequence before 2851*4882a593Smuzhiyunthe CPU even sees them. 2852*4882a593Smuzhiyun 2853*4882a593SmuzhiyunFor instance: 2854*4882a593Smuzhiyun 2855*4882a593Smuzhiyun *A = V; 2856*4882a593Smuzhiyun *A = W; 2857*4882a593Smuzhiyun 2858*4882a593Smuzhiyunmay be reduced to: 2859*4882a593Smuzhiyun 2860*4882a593Smuzhiyun *A = W; 2861*4882a593Smuzhiyun 2862*4882a593Smuzhiyunsince, without either a write barrier or an WRITE_ONCE(), it can be 2863*4882a593Smuzhiyunassumed that the effect of the storage of V to *A is lost. Similarly: 2864*4882a593Smuzhiyun 2865*4882a593Smuzhiyun *A = Y; 2866*4882a593Smuzhiyun Z = *A; 2867*4882a593Smuzhiyun 2868*4882a593Smuzhiyunmay, without a memory barrier or an READ_ONCE() and WRITE_ONCE(), be 2869*4882a593Smuzhiyunreduced to: 2870*4882a593Smuzhiyun 2871*4882a593Smuzhiyun *A = Y; 2872*4882a593Smuzhiyun Z = Y; 2873*4882a593Smuzhiyun 2874*4882a593Smuzhiyunand the LOAD operation never appear outside of the CPU. 2875*4882a593Smuzhiyun 2876*4882a593Smuzhiyun 2877*4882a593SmuzhiyunAND THEN THERE'S THE ALPHA 2878*4882a593Smuzhiyun-------------------------- 2879*4882a593Smuzhiyun 2880*4882a593SmuzhiyunThe DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that, 2881*4882a593Smuzhiyunsome versions of the Alpha CPU have a split data cache, permitting them to have 2882*4882a593Smuzhiyuntwo semantically-related cache lines updated at separate times. This is where 2883*4882a593Smuzhiyunthe data dependency barrier really becomes necessary as this synchronises both 2884*4882a593Smuzhiyuncaches with the memory coherence system, thus making it seem like pointer 2885*4882a593Smuzhiyunchanges vs new data occur in the right order. 2886*4882a593Smuzhiyun 2887*4882a593SmuzhiyunThe Alpha defines the Linux kernel's memory model, although as of v4.15 2888*4882a593Smuzhiyunthe Linux kernel's addition of smp_mb() to READ_ONCE() on Alpha greatly 2889*4882a593Smuzhiyunreduced its impact on the memory model. 2890*4882a593Smuzhiyun 2891*4882a593Smuzhiyun 2892*4882a593SmuzhiyunVIRTUAL MACHINE GUESTS 2893*4882a593Smuzhiyun---------------------- 2894*4882a593Smuzhiyun 2895*4882a593SmuzhiyunGuests running within virtual machines might be affected by SMP effects even if 2896*4882a593Smuzhiyunthe guest itself is compiled without SMP support. This is an artifact of 2897*4882a593Smuzhiyuninterfacing with an SMP host while running an UP kernel. Using mandatory 2898*4882a593Smuzhiyunbarriers for this use-case would be possible but is often suboptimal. 2899*4882a593Smuzhiyun 2900*4882a593SmuzhiyunTo handle this case optimally, low-level virt_mb() etc macros are available. 2901*4882a593SmuzhiyunThese have the same effect as smp_mb() etc when SMP is enabled, but generate 2902*4882a593Smuzhiyunidentical code for SMP and non-SMP systems. For example, virtual machine guests 2903*4882a593Smuzhiyunshould use virt_mb() rather than smp_mb() when synchronizing against a 2904*4882a593Smuzhiyun(possibly SMP) host. 2905*4882a593Smuzhiyun 2906*4882a593SmuzhiyunThese are equivalent to smp_mb() etc counterparts in all other respects, 2907*4882a593Smuzhiyunin particular, they do not control MMIO effects: to control 2908*4882a593SmuzhiyunMMIO effects, use mandatory barriers. 2909*4882a593Smuzhiyun 2910*4882a593Smuzhiyun 2911*4882a593Smuzhiyun============ 2912*4882a593SmuzhiyunEXAMPLE USES 2913*4882a593Smuzhiyun============ 2914*4882a593Smuzhiyun 2915*4882a593SmuzhiyunCIRCULAR BUFFERS 2916*4882a593Smuzhiyun---------------- 2917*4882a593Smuzhiyun 2918*4882a593SmuzhiyunMemory barriers can be used to implement circular buffering without the need 2919*4882a593Smuzhiyunof a lock to serialise the producer with the consumer. See: 2920*4882a593Smuzhiyun 2921*4882a593Smuzhiyun Documentation/core-api/circular-buffers.rst 2922*4882a593Smuzhiyun 2923*4882a593Smuzhiyunfor details. 2924*4882a593Smuzhiyun 2925*4882a593Smuzhiyun 2926*4882a593Smuzhiyun========== 2927*4882a593SmuzhiyunREFERENCES 2928*4882a593Smuzhiyun========== 2929*4882a593Smuzhiyun 2930*4882a593SmuzhiyunAlpha AXP Architecture Reference Manual, Second Edition (Sites & Witek, 2931*4882a593SmuzhiyunDigital Press) 2932*4882a593Smuzhiyun Chapter 5.2: Physical Address Space Characteristics 2933*4882a593Smuzhiyun Chapter 5.4: Caches and Write Buffers 2934*4882a593Smuzhiyun Chapter 5.5: Data Sharing 2935*4882a593Smuzhiyun Chapter 5.6: Read/Write Ordering 2936*4882a593Smuzhiyun 2937*4882a593SmuzhiyunAMD64 Architecture Programmer's Manual Volume 2: System Programming 2938*4882a593Smuzhiyun Chapter 7.1: Memory-Access Ordering 2939*4882a593Smuzhiyun Chapter 7.4: Buffering and Combining Memory Writes 2940*4882a593Smuzhiyun 2941*4882a593SmuzhiyunARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile) 2942*4882a593Smuzhiyun Chapter B2: The AArch64 Application Level Memory Model 2943*4882a593Smuzhiyun 2944*4882a593SmuzhiyunIA-32 Intel Architecture Software Developer's Manual, Volume 3: 2945*4882a593SmuzhiyunSystem Programming Guide 2946*4882a593Smuzhiyun Chapter 7.1: Locked Atomic Operations 2947*4882a593Smuzhiyun Chapter 7.2: Memory Ordering 2948*4882a593Smuzhiyun Chapter 7.4: Serializing Instructions 2949*4882a593Smuzhiyun 2950*4882a593SmuzhiyunThe SPARC Architecture Manual, Version 9 2951*4882a593Smuzhiyun Chapter 8: Memory Models 2952*4882a593Smuzhiyun Appendix D: Formal Specification of the Memory Models 2953*4882a593Smuzhiyun Appendix J: Programming with the Memory Models 2954*4882a593Smuzhiyun 2955*4882a593SmuzhiyunStorage in the PowerPC (Stone and Fitzgerald) 2956*4882a593Smuzhiyun 2957*4882a593SmuzhiyunUltraSPARC Programmer Reference Manual 2958*4882a593Smuzhiyun Chapter 5: Memory Accesses and Cacheability 2959*4882a593Smuzhiyun Chapter 15: Sparc-V9 Memory Models 2960*4882a593Smuzhiyun 2961*4882a593SmuzhiyunUltraSPARC III Cu User's Manual 2962*4882a593Smuzhiyun Chapter 9: Memory Models 2963*4882a593Smuzhiyun 2964*4882a593SmuzhiyunUltraSPARC IIIi Processor User's Manual 2965*4882a593Smuzhiyun Chapter 8: Memory Models 2966*4882a593Smuzhiyun 2967*4882a593SmuzhiyunUltraSPARC Architecture 2005 2968*4882a593Smuzhiyun Chapter 9: Memory 2969*4882a593Smuzhiyun Appendix D: Formal Specifications of the Memory Models 2970*4882a593Smuzhiyun 2971*4882a593SmuzhiyunUltraSPARC T1 Supplement to the UltraSPARC Architecture 2005 2972*4882a593Smuzhiyun Chapter 8: Memory Models 2973*4882a593Smuzhiyun Appendix F: Caches and Cache Coherency 2974*4882a593Smuzhiyun 2975*4882a593SmuzhiyunSolaris Internals, Core Kernel Architecture, p63-68: 2976*4882a593Smuzhiyun Chapter 3.3: Hardware Considerations for Locks and 2977*4882a593Smuzhiyun Synchronization 2978*4882a593Smuzhiyun 2979*4882a593SmuzhiyunUnix Systems for Modern Architectures, Symmetric Multiprocessing and Caching 2980*4882a593Smuzhiyunfor Kernel Programmers: 2981*4882a593Smuzhiyun Chapter 13: Other Memory Models 2982*4882a593Smuzhiyun 2983*4882a593SmuzhiyunIntel Itanium Architecture Software Developer's Manual: Volume 1: 2984*4882a593Smuzhiyun Section 2.6: Speculation 2985*4882a593Smuzhiyun Section 4.4: Memory Access 2986