memory-model/Documentation/explanation.txt

*4882a593SmuzhiyunExplanation of the Linux-Kernel Memory Consistency Model
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593Smuzhiyun
*4882a593Smuzhiyun:Author: Alan Stern <stern@rowland.harvard.edu>
*4882a593Smuzhiyun:Created: October 2017
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. Contents
*4882a593Smuzhiyun
*4882a593Smuzhiyun  1. INTRODUCTION
*4882a593Smuzhiyun  2. BACKGROUND
*4882a593Smuzhiyun  3. A SIMPLE EXAMPLE
*4882a593Smuzhiyun  4. A SELECTION OF MEMORY MODELS
*4882a593Smuzhiyun  5. ORDERING AND CYCLES
*4882a593Smuzhiyun  6. EVENTS
*4882a593Smuzhiyun  7. THE PROGRAM ORDER RELATION: po AND po-loc
*4882a593Smuzhiyun  8. A WARNING
*4882a593Smuzhiyun  9. DEPENDENCY RELATIONS: data, addr, and ctrl
*4882a593Smuzhiyun  10. THE READS-FROM RELATION: rf, rfi, and rfe
*4882a593Smuzhiyun  11. CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe
*4882a593Smuzhiyun  12. THE FROM-READS RELATION: fr, fri, and fre
*4882a593Smuzhiyun  13. AN OPERATIONAL MODEL
*4882a593Smuzhiyun  14. PROPAGATION ORDER RELATION: cumul-fence
*4882a593Smuzhiyun  15. DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL
*4882a593Smuzhiyun  16. SEQUENTIAL CONSISTENCY PER VARIABLE
*4882a593Smuzhiyun  17. ATOMIC UPDATES: rmw
*4882a593Smuzhiyun  18. THE PRESERVED PROGRAM ORDER RELATION: ppo
*4882a593Smuzhiyun  19. AND THEN THERE WAS ALPHA
*4882a593Smuzhiyun  20. THE HAPPENS-BEFORE RELATION: hb
*4882a593Smuzhiyun  21. THE PROPAGATES-BEFORE RELATION: pb
*4882a593Smuzhiyun  22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
*4882a593Smuzhiyun  23. LOCKING
*4882a593Smuzhiyun  24. PLAIN ACCESSES AND DATA RACES
*4882a593Smuzhiyun  25. ODDS AND ENDS
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunINTRODUCTION
*4882a593Smuzhiyun------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe Linux-kernel memory consistency model (LKMM) is rather complex and
*4882a593Smuzhiyunobscure.  This is particularly evident if you read through the
*4882a593Smuzhiyunlinux-kernel.bell and linux-kernel.cat files that make up the formal
*4882a593Smuzhiyunversion of the model; they are extremely terse and their meanings are
*4882a593Smuzhiyunfar from clear.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis document describes the ideas underlying the LKMM.  It is meant
*4882a593Smuzhiyunfor people who want to understand how the model was designed.  It does
*4882a593Smuzhiyunnot go into the details of the code in the .bell and .cat files;
*4882a593Smuzhiyunrather, it explains in English what the code expresses symbolically.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSections 2 (BACKGROUND) through 5 (ORDERING AND CYCLES) are aimed
*4882a593Smuzhiyuntoward beginners; they explain what memory consistency models are and
*4882a593Smuzhiyunthe basic notions shared by all such models.  People already familiar
*4882a593Smuzhiyunwith these concepts can skim or skip over them.  Sections 6 (EVENTS)
*4882a593Smuzhiyunthrough 12 (THE FROM_READS RELATION) describe the fundamental
*4882a593Smuzhiyunrelations used in many models.  Starting in Section 13 (AN OPERATIONAL
*4882a593SmuzhiyunMODEL), the workings of the LKMM itself are covered.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWarning: The code examples in this document are not written in the
*4882a593Smuzhiyunproper format for litmus tests.  They don't include a header line, the
*4882a593Smuzhiyuninitializations are not enclosed in braces, the global variables are
*4882a593Smuzhiyunnot passed by pointers, and they don't have an "exists" clause at the
*4882a593Smuzhiyunend.  Converting them to the right format is left as an exercise for
*4882a593Smuzhiyunthe reader.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunBACKGROUND
*4882a593Smuzhiyun----------
*4882a593Smuzhiyun
*4882a593SmuzhiyunA memory consistency model (or just memory model, for short) is
*4882a593Smuzhiyunsomething which predicts, given a piece of computer code running on a
*4882a593Smuzhiyunparticular kind of system, what values may be obtained by the code's
*4882a593Smuzhiyunload instructions.  The LKMM makes these predictions for code running
*4882a593Smuzhiyunas part of the Linux kernel.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn practice, people tend to use memory models the other way around.
*4882a593SmuzhiyunThat is, given a piece of code and a collection of values specified
*4882a593Smuzhiyunfor the loads, the model will predict whether it is possible for the
*4882a593Smuzhiyuncode to run in such a way that the loads will indeed obtain the
*4882a593Smuzhiyunspecified values.  Of course, this is just another way of expressing
*4882a593Smuzhiyunthe same idea.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor code running on a uniprocessor system, the predictions are easy:
*4882a593SmuzhiyunEach load instruction must obtain the value written by the most recent
*4882a593Smuzhiyunstore instruction accessing the same location (we ignore complicating
*4882a593Smuzhiyunfactors such as DMA and mixed-size accesses.)  But on multiprocessor
*4882a593Smuzhiyunsystems, with multiple CPUs making concurrent accesses to shared
*4882a593Smuzhiyunmemory locations, things aren't so simple.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDifferent architectures have differing memory models, and the Linux
*4882a593Smuzhiyunkernel supports a variety of architectures.  The LKMM has to be fairly
*4882a593Smuzhiyunpermissive, in the sense that any behavior allowed by one of these
*4882a593Smuzhiyunarchitectures also has to be allowed by the LKMM.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunA SIMPLE EXAMPLE
*4882a593Smuzhiyun----------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunHere is a simple example to illustrate the basic concepts.  Consider
*4882a593Smuzhiyunsome code running as part of a device driver for an input device.  The
*4882a593Smuzhiyundriver might contain an interrupt handler which collects data from the
*4882a593Smuzhiyundevice, stores it in a buffer, and sets a flag to indicate the buffer
*4882a593Smuzhiyunis full.  Running concurrently on a different CPU might be a part of
*4882a593Smuzhiyunthe driver code being executed by a process in the midst of a read(2)
*4882a593Smuzhiyunsystem call.  This code tests the flag to see whether the buffer is
*4882a593Smuzhiyunready, and if it is, copies the data back to userspace.  The buffer
*4882a593Smuzhiyunand the flag are memory locations shared between the two CPUs.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWe can abstract out the important pieces of the driver code as follows
*4882a593Smuzhiyun(the reason for using WRITE_ONCE() and READ_ONCE() instead of simple
*4882a593Smuzhiyunassignment statements is discussed later):
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int buf = 0, flag = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(buf, 1);
*4882a593Smuzhiyun		WRITE_ONCE(flag, 1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun		int r2 = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = READ_ONCE(flag);
*4882a593Smuzhiyun		if (r1)
*4882a593Smuzhiyun			r2 = READ_ONCE(buf);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunHere the P0() function represents the interrupt handler running on one
*4882a593SmuzhiyunCPU and P1() represents the read() routine running on another.  The
*4882a593Smuzhiyunvalue 1 stored in buf represents input data collected from the device.
*4882a593SmuzhiyunThus, P0 stores the data in buf and then sets flag.  Meanwhile, P1
*4882a593Smuzhiyunreads flag into the private variable r1, and if it is set, reads the
*4882a593Smuzhiyundata from buf into a second private variable r2 for copying to
*4882a593Smuzhiyunuserspace.  (Presumably if flag is not set then the driver will wait a
*4882a593Smuzhiyunwhile and try again.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis pattern of memory accesses, where one CPU stores values to two
*4882a593Smuzhiyunshared memory locations and another CPU loads from those locations in
*4882a593Smuzhiyunthe opposite order, is widely known as the "Message Passing" or MP
*4882a593Smuzhiyunpattern.  It is typical of memory access patterns in the kernel.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPlease note that this example code is a simplified abstraction.  Real
*4882a593Smuzhiyunbuffers are usually larger than a single integer, real device drivers
*4882a593Smuzhiyunusually use sleep and wakeup mechanisms rather than polling for I/O
*4882a593Smuzhiyuncompletion, and real code generally doesn't bother to copy values into
*4882a593Smuzhiyunprivate variables before using them.  All that is beside the point;
*4882a593Smuzhiyunthe idea here is simply to illustrate the overall pattern of memory
*4882a593Smuzhiyunaccesses by the CPUs.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA memory model will predict what values P1 might obtain for its loads
*4882a593Smuzhiyunfrom flag and buf, or equivalently, what values r1 and r2 might end up
*4882a593Smuzhiyunwith after the code has finished running.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSome predictions are trivial.  For instance, no sane memory model would
*4882a593Smuzhiyunpredict that r1 = 42 or r2 = -7, because neither of those values ever
*4882a593Smuzhiyungets stored in flag or buf.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSome nontrivial predictions are nonetheless quite simple.  For
*4882a593Smuzhiyuninstance, P1 might run entirely before P0 begins, in which case r1 and
*4882a593Smuzhiyunr2 will both be 0 at the end.  Or P0 might run entirely before P1
*4882a593Smuzhiyunbegins, in which case r1 and r2 will both be 1.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe interesting predictions concern what might happen when the two
*4882a593Smuzhiyunroutines run concurrently.  One possibility is that P1 runs after P0's
*4882a593Smuzhiyunstore to buf but before the store to flag.  In this case, r1 and r2
*4882a593Smuzhiyunwill again both be 0.  (If P1 had been designed to read buf
*4882a593Smuzhiyununconditionally then we would instead have r1 = 0 and r2 = 1.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunHowever, the most interesting possibility is where r1 = 1 and r2 = 0.
*4882a593SmuzhiyunIf this were to occur it would mean the driver contains a bug, because
*4882a593Smuzhiyunincorrect data would get sent to the user: 0 instead of 1.  As it
*4882a593Smuzhiyunhappens, the LKMM does predict this outcome can occur, and the example
*4882a593Smuzhiyundriver code shown above is indeed buggy.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunA SELECTION OF MEMORY MODELS
*4882a593Smuzhiyun----------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe first widely cited memory model, and the simplest to understand,
*4882a593Smuzhiyunis Sequential Consistency.  According to this model, systems behave as
*4882a593Smuzhiyunif each CPU executed its instructions in order but with unspecified
*4882a593Smuzhiyuntiming.  In other words, the instructions from the various CPUs get
*4882a593Smuzhiyuninterleaved in a nondeterministic way, always according to some single
*4882a593Smuzhiyunglobal order that agrees with the order of the instructions in the
*4882a593Smuzhiyunprogram source for each CPU.  The model says that the value obtained
*4882a593Smuzhiyunby each load is simply the value written by the most recently executed
*4882a593Smuzhiyunstore to the same memory location, from any CPU.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor the MP example code shown above, Sequential Consistency predicts
*4882a593Smuzhiyunthat the undesired result r1 = 1, r2 = 0 cannot occur.  The reasoning
*4882a593Smuzhiyungoes like this:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Since r1 = 1, P0 must store 1 to flag before P1 loads 1 from
*4882a593Smuzhiyun	it, as loads can obtain values only from earlier stores.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1 loads from flag before loading from buf, since CPUs execute
*4882a593Smuzhiyun	their instructions in order.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1 must load 0 from buf before P0 stores 1 to it; otherwise r2
*4882a593Smuzhiyun	would be 1 since a load obtains its value from the most recent
*4882a593Smuzhiyun	store to the same address.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0 stores 1 to buf before storing 1 to flag, since it executes
*4882a593Smuzhiyun	its instructions in order.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Since an instruction (in this case, P0's store to flag) cannot
*4882a593Smuzhiyun	execute before itself, the specified outcome is impossible.
*4882a593Smuzhiyun
*4882a593SmuzhiyunHowever, real computer hardware almost never follows the Sequential
*4882a593SmuzhiyunConsistency memory model; doing so would rule out too many valuable
*4882a593Smuzhiyunperformance optimizations.  On ARM and PowerPC architectures, for
*4882a593Smuzhiyuninstance, the MP example code really does sometimes yield r1 = 1 and
*4882a593Smuzhiyunr2 = 0.
*4882a593Smuzhiyun
*4882a593Smuzhiyunx86 and SPARC follow yet a different memory model: TSO (Total Store
*4882a593SmuzhiyunOrdering).  This model predicts that the undesired outcome for the MP
*4882a593Smuzhiyunpattern cannot occur, but in other respects it differs from Sequential
*4882a593SmuzhiyunConsistency.  One example is the Store Buffer (SB) pattern, in which
*4882a593Smuzhiyuneach CPU stores to its own shared location and then loads from the
*4882a593Smuzhiyunother CPU's location:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x = 0, y = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
*4882a593Smuzhiyun		r0 = READ_ONCE(y);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunSequential Consistency predicts that the outcome r0 = 0, r1 = 0 is
*4882a593Smuzhiyunimpossible.  (Exercise: Figure out the reasoning.)  But TSO allows
*4882a593Smuzhiyunthis outcome to occur, and in fact it does sometimes occur on x86 and
*4882a593SmuzhiyunSPARC systems.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM was inspired by the memory models followed by PowerPC, ARM,
*4882a593Smuzhiyunx86, Alpha, and other architectures.  However, it is different in
*4882a593Smuzhiyundetail from each of them.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunORDERING AND CYCLES
*4882a593Smuzhiyun-------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunMemory models are all about ordering.  Often this is temporal ordering
*4882a593Smuzhiyun(i.e., the order in which certain events occur) but it doesn't have to
*4882a593Smuzhiyunbe; consider for example the order of instructions in a program's
*4882a593Smuzhiyunsource code.  We saw above that Sequential Consistency makes an
*4882a593Smuzhiyunimportant assumption that CPUs execute instructions in the same order
*4882a593Smuzhiyunas those instructions occur in the code, and there are many other
*4882a593Smuzhiyuninstances of ordering playing central roles in memory models.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe counterpart to ordering is a cycle.  Ordering rules out cycles:
*4882a593SmuzhiyunIt's not possible to have X ordered before Y, Y ordered before Z, and
*4882a593SmuzhiyunZ ordered before X, because this would mean that X is ordered before
*4882a593Smuzhiyunitself.  The analysis of the MP example under Sequential Consistency
*4882a593Smuzhiyuninvolved just such an impossible cycle:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	W: P0 stores 1 to flag   executes before
*4882a593Smuzhiyun	X: P1 loads 1 from flag  executes before
*4882a593Smuzhiyun	Y: P1 loads 0 from buf   executes before
*4882a593Smuzhiyun	Z: P0 stores 1 to buf    executes before
*4882a593Smuzhiyun	W: P0 stores 1 to flag.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn short, if a memory model requires certain accesses to be ordered,
*4882a593Smuzhiyunand a certain outcome for the loads in a piece of code can happen only
*4882a593Smuzhiyunif those accesses would form a cycle, then the memory model predicts
*4882a593Smuzhiyunthat outcome cannot occur.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM is defined largely in terms of cycles, as we will see.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunEVENTS
*4882a593Smuzhiyun------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM does not work directly with the C statements that make up
*4882a593Smuzhiyunkernel source code.  Instead it considers the effects of those
*4882a593Smuzhiyunstatements in a more abstract form, namely, events.  The model
*4882a593Smuzhiyunincludes three types of events:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Read events correspond to loads from shared memory, such as
*4882a593Smuzhiyun	calls to READ_ONCE(), smp_load_acquire(), or
*4882a593Smuzhiyun	rcu_dereference().
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Write events correspond to stores to shared memory, such as
*4882a593Smuzhiyun	calls to WRITE_ONCE(), smp_store_release(), or atomic_set().
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Fence events correspond to memory barriers (also known as
*4882a593Smuzhiyun	fences), such as calls to smp_rmb() or rcu_read_lock().
*4882a593Smuzhiyun
*4882a593SmuzhiyunThese categories are not exclusive; a read or write event can also be
*4882a593Smuzhiyuna fence.  This happens with functions like smp_load_acquire() or
*4882a593Smuzhiyunspin_lock().  However, no single event can be both a read and a write.
*4882a593SmuzhiyunAtomic read-modify-write accesses, such as atomic_inc() or xchg(),
*4882a593Smuzhiyuncorrespond to a pair of events: a read followed by a write.  (The
*4882a593Smuzhiyunwrite event is omitted for executions where it doesn't occur, such as
*4882a593Smuzhiyuna cmpxchg() where the comparison fails.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunOther parts of the code, those which do not involve interaction with
*4882a593Smuzhiyunshared memory, do not give rise to events.  Thus, arithmetic and
*4882a593Smuzhiyunlogical computations, control-flow instructions, or accesses to
*4882a593Smuzhiyunprivate memory or CPU registers are not of central interest to the
*4882a593Smuzhiyunmemory model.  They only affect the model's predictions indirectly.
*4882a593SmuzhiyunFor example, an arithmetic computation might determine the value that
*4882a593Smuzhiyungets stored to a shared memory location (or in the case of an array
*4882a593Smuzhiyunindex, the address where the value gets stored), but the memory model
*4882a593Smuzhiyunis concerned only with the store itself -- its value and its address
*4882a593Smuzhiyun-- not the computation leading up to it.
*4882a593Smuzhiyun
*4882a593SmuzhiyunEvents in the LKMM can be linked by various relations, which we will
*4882a593Smuzhiyundescribe in the following sections.  The memory model requires certain
*4882a593Smuzhiyunof these relations to be orderings, that is, it requires them not to
*4882a593Smuzhiyunhave any cycles.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunTHE PROGRAM ORDER RELATION: po AND po-loc
*4882a593Smuzhiyun-----------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe most important relation between events is program order (po).  You
*4882a593Smuzhiyuncan think of it as the order in which statements occur in the source
*4882a593Smuzhiyuncode after branches are taken into account and loops have been
*4882a593Smuzhiyununrolled.  A better description might be the order in which
*4882a593Smuzhiyuninstructions are presented to a CPU's execution unit.  Thus, we say
*4882a593Smuzhiyunthat X is po-before Y (written as "X ->po Y" in formulas) if X occurs
*4882a593Smuzhiyunbefore Y in the instruction stream.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis is inherently a single-CPU relation; two instructions executing
*4882a593Smuzhiyunon different CPUs are never linked by po.  Also, it is by definition
*4882a593Smuzhiyunan ordering so it cannot have any cycles.
*4882a593Smuzhiyun
*4882a593Smuzhiyunpo-loc is a sub-relation of po.  It links two memory accesses when the
*4882a593Smuzhiyunfirst comes before the second in program order and they access the
*4882a593Smuzhiyunsame memory location (the "-loc" suffix).
*4882a593Smuzhiyun
*4882a593SmuzhiyunAlthough this may seem straightforward, there is one subtle aspect to
*4882a593Smuzhiyunprogram order we need to explain.  The LKMM was inspired by low-level
*4882a593Smuzhiyunarchitectural memory models which describe the behavior of machine
*4882a593Smuzhiyuncode, and it retains their outlook to a considerable extent.  The
*4882a593Smuzhiyunread, write, and fence events used by the model are close in spirit to
*4882a593Smuzhiyunindividual machine instructions.  Nevertheless, the LKMM describes
*4882a593Smuzhiyunkernel code written in C, and the mapping from C to machine code can
*4882a593Smuzhiyunbe extremely complex.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOptimizing compilers have great freedom in the way they translate
*4882a593Smuzhiyunsource code to object code.  They are allowed to apply transformations
*4882a593Smuzhiyunthat add memory accesses, eliminate accesses, combine them, split them
*4882a593Smuzhiyuninto pieces, or move them around.  The use of READ_ONCE(), WRITE_ONCE(),
*4882a593Smuzhiyunor one of the other atomic or synchronization primitives prevents a
*4882a593Smuzhiyunlarge number of compiler optimizations.  In particular, it is guaranteed
*4882a593Smuzhiyunthat the compiler will not remove such accesses from the generated code
*4882a593Smuzhiyun(unless it can prove the accesses will never be executed), it will not
*4882a593Smuzhiyunchange the order in which they occur in the code (within limits imposed
*4882a593Smuzhiyunby the C standard), and it will not introduce extraneous accesses.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe MP and SB examples above used READ_ONCE() and WRITE_ONCE() rather
*4882a593Smuzhiyunthan ordinary memory accesses.  Thanks to this usage, we can be certain
*4882a593Smuzhiyunthat in the MP example, the compiler won't reorder P0's write event to
*4882a593Smuzhiyunbuf and P0's write event to flag, and similarly for the other shared
*4882a593Smuzhiyunmemory accesses in the examples.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSince private variables are not shared between CPUs, they can be
*4882a593Smuzhiyunaccessed normally without READ_ONCE() or WRITE_ONCE().  In fact, they
*4882a593Smuzhiyunneed not even be stored in normal memory at all -- in principle a
*4882a593Smuzhiyunprivate variable could be stored in a CPU register (hence the convention
*4882a593Smuzhiyunthat these variables have names starting with the letter 'r').
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunA WARNING
*4882a593Smuzhiyun---------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe protections provided by READ_ONCE(), WRITE_ONCE(), and others are
*4882a593Smuzhiyunnot perfect; and under some circumstances it is possible for the
*4882a593Smuzhiyuncompiler to undermine the memory model.  Here is an example.  Suppose
*4882a593Smuzhiyunboth branches of an "if" statement store the same value to the same
*4882a593Smuzhiyunlocation:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	r1 = READ_ONCE(x);
*4882a593Smuzhiyun	if (r1) {
*4882a593Smuzhiyun		WRITE_ONCE(y, 2);
*4882a593Smuzhiyun		...  /* do something */
*4882a593Smuzhiyun	} else {
*4882a593Smuzhiyun		WRITE_ONCE(y, 2);
*4882a593Smuzhiyun		...  /* do something else */
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor this code, the LKMM predicts that the load from x will always be
*4882a593Smuzhiyunexecuted before either of the stores to y.  However, a compiler could
*4882a593Smuzhiyunlift the stores out of the conditional, transforming the code into
*4882a593Smuzhiyunsomething resembling:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	r1 = READ_ONCE(x);
*4882a593Smuzhiyun	WRITE_ONCE(y, 2);
*4882a593Smuzhiyun	if (r1) {
*4882a593Smuzhiyun		...  /* do something */
*4882a593Smuzhiyun	} else {
*4882a593Smuzhiyun		...  /* do something else */
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunGiven this version of the code, the LKMM would predict that the load
*4882a593Smuzhiyunfrom x could be executed after the store to y.  Thus, the memory
*4882a593Smuzhiyunmodel's original prediction could be invalidated by the compiler.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAnother issue arises from the fact that in C, arguments to many
*4882a593Smuzhiyunoperators and function calls can be evaluated in any order.  For
*4882a593Smuzhiyunexample:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	r1 = f(5) + g(6);
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe object code might call f(5) either before or after g(6); the
*4882a593Smuzhiyunmemory model cannot assume there is a fixed program order relation
*4882a593Smuzhiyunbetween them.  (In fact, if the function calls are inlined then the
*4882a593Smuzhiyuncompiler might even interleave their object code.)
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunDEPENDENCY RELATIONS: data, addr, and ctrl
*4882a593Smuzhiyun------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunWe say that two events are linked by a dependency relation when the
*4882a593Smuzhiyunexecution of the second event depends in some way on a value obtained
*4882a593Smuzhiyunfrom memory by the first.  The first event must be a read, and the
*4882a593Smuzhiyunvalue it obtains must somehow affect what the second event does.
*4882a593SmuzhiyunThere are three kinds of dependencies: data, address (addr), and
*4882a593Smuzhiyuncontrol (ctrl).
*4882a593Smuzhiyun
*4882a593SmuzhiyunA read and a write event are linked by a data dependency if the value
*4882a593Smuzhiyunobtained by the read affects the value stored by the write.  As a very
*4882a593Smuzhiyunsimple example:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x, y;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	r1 = READ_ONCE(x);
*4882a593Smuzhiyun	WRITE_ONCE(y, r1 + 5);
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe value stored by the WRITE_ONCE obviously depends on the value
*4882a593Smuzhiyunloaded by the READ_ONCE.  Such dependencies can wind through
*4882a593Smuzhiyunarbitrarily complicated computations, and a write can depend on the
*4882a593Smuzhiyunvalues of multiple reads.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA read event and another memory access event are linked by an address
*4882a593Smuzhiyundependency if the value obtained by the read affects the location
*4882a593Smuzhiyunaccessed by the other event.  The second event can be either a read or
*4882a593Smuzhiyuna write.  Here's another simple example:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int a[20];
*4882a593Smuzhiyun	int i;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	r1 = READ_ONCE(i);
*4882a593Smuzhiyun	r2 = READ_ONCE(a[r1]);
*4882a593Smuzhiyun
*4882a593SmuzhiyunHere the location accessed by the second READ_ONCE() depends on the
*4882a593Smuzhiyunindex value loaded by the first.  Pointer indirection also gives rise
*4882a593Smuzhiyunto address dependencies, since the address of a location accessed
*4882a593Smuzhiyunthrough a pointer will depend on the value read earlier from that
*4882a593Smuzhiyunpointer.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFinally, a read event and another memory access event are linked by a
*4882a593Smuzhiyuncontrol dependency if the value obtained by the read affects whether
*4882a593Smuzhiyunthe second event is executed at all.  Simple example:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x, y;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	r1 = READ_ONCE(x);
*4882a593Smuzhiyun	if (r1)
*4882a593Smuzhiyun		WRITE_ONCE(y, 1984);
*4882a593Smuzhiyun
*4882a593SmuzhiyunExecution of the WRITE_ONCE() is controlled by a conditional expression
*4882a593Smuzhiyunwhich depends on the value obtained by the READ_ONCE(); hence there is
*4882a593Smuzhiyuna control dependency from the load to the store.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIt should be pretty obvious that events can only depend on reads that
*4882a593Smuzhiyuncome earlier in program order.  Symbolically, if we have R ->data X,
*4882a593SmuzhiyunR ->addr X, or R ->ctrl X (where R is a read event), then we must also
*4882a593Smuzhiyunhave R ->po X.  It wouldn't make sense for a computation to depend
*4882a593Smuzhiyunsomehow on a value that doesn't get loaded from shared memory until
*4882a593Smuzhiyunlater in the code!
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunTHE READS-FROM RELATION: rf, rfi, and rfe
*4882a593Smuzhiyun-----------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe reads-from relation (rf) links a write event to a read event when
*4882a593Smuzhiyunthe value loaded by the read is the value that was stored by the
*4882a593Smuzhiyunwrite.  In colloquial terms, the load "reads from" the store.  We
*4882a593Smuzhiyunwrite W ->rf R to indicate that the load R reads from the store W.  We
*4882a593Smuzhiyunfurther distinguish the cases where the load and the store occur on
*4882a593Smuzhiyunthe same CPU (internal reads-from, or rfi) and where they occur on
*4882a593Smuzhiyundifferent CPUs (external reads-from, or rfe).
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor our purposes, a memory location's initial value is treated as
*4882a593Smuzhiyunthough it had been written there by an imaginary initial store that
*4882a593Smuzhiyunexecutes on a separate CPU before the main program runs.
*4882a593Smuzhiyun
*4882a593SmuzhiyunUsage of the rf relation implicitly assumes that loads will always
*4882a593Smuzhiyunread from a single store.  It doesn't apply properly in the presence
*4882a593Smuzhiyunof load-tearing, where a load obtains some of its bits from one store
*4882a593Smuzhiyunand some of them from another store.  Fortunately, use of READ_ONCE()
*4882a593Smuzhiyunand WRITE_ONCE() will prevent load-tearing; it's not possible to have:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(x, 0x1234);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyunand end up with r1 = 0x1200 (partly from x's initial value and partly
*4882a593Smuzhiyunfrom the value stored by P0).
*4882a593Smuzhiyun
*4882a593SmuzhiyunOn the other hand, load-tearing is unavoidable when mixed-size
*4882a593Smuzhiyunaccesses are used.  Consider this example:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	union {
*4882a593Smuzhiyun		u32	w;
*4882a593Smuzhiyun		u16	h[2];
*4882a593Smuzhiyun	} x;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(x.h[0], 0x1234);
*4882a593Smuzhiyun		WRITE_ONCE(x.h[1], 0x5678);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = READ_ONCE(x.w);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf r1 = 0x56781234 (little-endian!) at the end, then P1 must have read
*4882a593Smuzhiyunfrom both of P0's stores.  It is possible to handle mixed-size and
*4882a593Smuzhiyununaligned accesses in a memory model, but the LKMM currently does not
*4882a593Smuzhiyunattempt to do so.  It requires all accesses to be properly aligned and
*4882a593Smuzhiyunof the location's actual size.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunCACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe
*4882a593Smuzhiyun------------------------------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunCache coherence is a general principle requiring that in a
*4882a593Smuzhiyunmulti-processor system, the CPUs must share a consistent view of the
*4882a593Smuzhiyunmemory contents.  Specifically, it requires that for each location in
*4882a593Smuzhiyunshared memory, the stores to that location must form a single global
*4882a593Smuzhiyunordering which all the CPUs agree on (the coherence order), and this
*4882a593Smuzhiyunordering must be consistent with the program order for accesses to
*4882a593Smuzhiyunthat location.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo put it another way, for any variable x, the coherence order (co) of
*4882a593Smuzhiyunthe stores to x is simply the order in which the stores overwrite one
*4882a593Smuzhiyunanother.  The imaginary store which establishes x's initial value
*4882a593Smuzhiyuncomes first in the coherence order; the store which directly
*4882a593Smuzhiyunoverwrites the initial value comes second; the store which overwrites
*4882a593Smuzhiyunthat value comes third, and so on.
*4882a593Smuzhiyun
*4882a593SmuzhiyunYou can think of the coherence order as being the order in which the
*4882a593Smuzhiyunstores reach x's location in memory (or if you prefer a more
*4882a593Smuzhiyunhardware-centric view, the order in which the stores get written to
*4882a593Smuzhiyunx's cache line).  We write W ->co W' if W comes before W' in the
*4882a593Smuzhiyuncoherence order, that is, if the value stored by W gets overwritten,
*4882a593Smuzhiyundirectly or indirectly, by the value stored by W'.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCoherence order is required to be consistent with program order.  This
*4882a593Smuzhiyunrequirement takes the form of four coherency rules:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Write-write coherence: If W ->po-loc W' (i.e., W comes before
*4882a593Smuzhiyun	W' in program order and they access the same location), where W
*4882a593Smuzhiyun	and W' are two stores, then W ->co W'.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Write-read coherence: If W ->po-loc R, where W is a store and R
*4882a593Smuzhiyun	is a load, then R must read from W or from some other store
*4882a593Smuzhiyun	which comes after W in the coherence order.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Read-write coherence: If R ->po-loc W, where R is a load and W
*4882a593Smuzhiyun	is a store, then the store which R reads from must come before
*4882a593Smuzhiyun	W in the coherence order.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Read-read coherence: If R ->po-loc R', where R and R' are two
*4882a593Smuzhiyun	loads, then either they read from the same store or else the
*4882a593Smuzhiyun	store read by R comes before the store read by R' in the
*4882a593Smuzhiyun	coherence order.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis is sometimes referred to as sequential consistency per variable,
*4882a593Smuzhiyunbecause it means that the accesses to any single memory location obey
*4882a593Smuzhiyunthe rules of the Sequential Consistency memory model.  (According to
*4882a593SmuzhiyunWikipedia, sequential consistency per variable and cache coherence
*4882a593Smuzhiyunmean the same thing except that cache coherence includes an extra
*4882a593Smuzhiyunrequirement that every store eventually becomes visible to every CPU.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunAny reasonable memory model will include cache coherence.  Indeed, our
*4882a593Smuzhiyunexpectation of cache coherence is so deeply ingrained that violations
*4882a593Smuzhiyunof its requirements look more like hardware bugs than programming
*4882a593Smuzhiyunerrors:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(x, 17);
*4882a593Smuzhiyun		WRITE_ONCE(x, 23);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the final value stored in x after this code ran was 17, you would
*4882a593Smuzhiyunthink your computer was broken.  It would be a violation of the
*4882a593Smuzhiyunwrite-write coherence rule: Since the store of 23 comes later in
*4882a593Smuzhiyunprogram order, it must also come later in x's coherence order and
*4882a593Smuzhiyunthus must overwrite the store of 17.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun		WRITE_ONCE(x, 666);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf r1 = 666 at the end, this would violate the read-write coherence
*4882a593Smuzhiyunrule: The READ_ONCE() load comes before the WRITE_ONCE() store in
*4882a593Smuzhiyunprogram order, so it must not read from that store but rather from one
*4882a593Smuzhiyuncoming earlier in the coherence order (in this case, x's initial
*4882a593Smuzhiyunvalue).
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(x, 5);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1, r2;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun		r2 = READ_ONCE(x);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf r1 = 5 (reading from P0's store) and r2 = 0 (reading from the
*4882a593Smuzhiyunimaginary store which establishes x's initial value) at the end, this
*4882a593Smuzhiyunwould violate the read-read coherence rule: The r1 load comes before
*4882a593Smuzhiyunthe r2 load in program order, so it must not read from a store that
*4882a593Smuzhiyuncomes later in the coherence order.
*4882a593Smuzhiyun
*4882a593Smuzhiyun(As a minor curiosity, if this code had used normal loads instead of
*4882a593SmuzhiyunREAD_ONCE() in P1, on Itanium it sometimes could end up with r1 = 5
*4882a593Smuzhiyunand r2 = 0!  This results from parallel execution of the operations
*4882a593Smuzhiyunencoded in Itanium's Very-Long-Instruction-Word format, and it is yet
*4882a593Smuzhiyunanother motivation for using READ_ONCE() when accessing shared memory
*4882a593Smuzhiyunlocations.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunJust like the po relation, co is inherently an ordering -- it is not
*4882a593Smuzhiyunpossible for a store to directly or indirectly overwrite itself!  And
*4882a593Smuzhiyunjust like with the rf relation, we distinguish between stores that
*4882a593Smuzhiyunoccur on the same CPU (internal coherence order, or coi) and stores
*4882a593Smuzhiyunthat occur on different CPUs (external coherence order, or coe).
*4882a593Smuzhiyun
*4882a593SmuzhiyunOn the other hand, stores to different memory locations are never
*4882a593Smuzhiyunrelated by co, just as instructions on different CPUs are never
*4882a593Smuzhiyunrelated by po.  Coherence order is strictly per-location, or if you
*4882a593Smuzhiyunprefer, each location has its own independent coherence order.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunTHE FROM-READS RELATION: fr, fri, and fre
*4882a593Smuzhiyun-----------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe from-reads relation (fr) can be a little difficult for people to
*4882a593Smuzhiyungrok.  It describes the situation where a load reads a value that gets
*4882a593Smuzhiyunoverwritten by a store.  In other words, we have R ->fr W when the
*4882a593Smuzhiyunvalue that R reads is overwritten (directly or indirectly) by W, or
*4882a593Smuzhiyunequivalently, when R reads from a store which comes earlier than W in
*4882a593Smuzhiyunthe coherence order.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor example:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun		WRITE_ONCE(x, 2);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe value loaded from x will be 0 (assuming cache coherence!), and it
*4882a593Smuzhiyungets overwritten by the value 2.  Thus there is an fr link from the
*4882a593SmuzhiyunREAD_ONCE() to the WRITE_ONCE().  If the code contained any later
*4882a593Smuzhiyunstores to x, there would also be fr links from the READ_ONCE() to
*4882a593Smuzhiyunthem.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs with rf, rfi, and rfe, we subdivide the fr relation into fri (when
*4882a593Smuzhiyunthe load and the store are on the same CPU) and fre (when they are on
*4882a593Smuzhiyundifferent CPUs).
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote that the fr relation is determined entirely by the rf and co
*4882a593Smuzhiyunrelations; it is not independent.  Given a read event R and a write
*4882a593Smuzhiyunevent W for the same location, we will have R ->fr W if and only if
*4882a593Smuzhiyunthe write which R reads from is co-before W.  In symbols,
*4882a593Smuzhiyun
*4882a593Smuzhiyun	(R ->fr W) := (there exists W' with W' ->rf R and W' ->co W).
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunAN OPERATIONAL MODEL
*4882a593Smuzhiyun--------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM is based on various operational memory models, meaning that
*4882a593Smuzhiyunthe models arise from an abstract view of how a computer system
*4882a593Smuzhiyunoperates.  Here are the main ideas, as incorporated into the LKMM.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe system as a whole is divided into the CPUs and a memory subsystem.
*4882a593SmuzhiyunThe CPUs are responsible for executing instructions (not necessarily
*4882a593Smuzhiyunin program order), and they communicate with the memory subsystem.
*4882a593SmuzhiyunFor the most part, executing an instruction requires a CPU to perform
*4882a593Smuzhiyunonly internal operations.  However, loads, stores, and fences involve
*4882a593Smuzhiyunmore.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen CPU C executes a store instruction, it tells the memory subsystem
*4882a593Smuzhiyunto store a certain value at a certain location.  The memory subsystem
*4882a593Smuzhiyunpropagates the store to all the other CPUs as well as to RAM.  (As a
*4882a593Smuzhiyunspecial case, we say that the store propagates to its own CPU at the
*4882a593Smuzhiyuntime it is executed.)  The memory subsystem also determines where the
*4882a593Smuzhiyunstore falls in the location's coherence order.  In particular, it must
*4882a593Smuzhiyunarrange for the store to be co-later than (i.e., to overwrite) any
*4882a593Smuzhiyunother store to the same location which has already propagated to CPU C.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen a CPU executes a load instruction R, it first checks to see
*4882a593Smuzhiyunwhether there are any as-yet unexecuted store instructions, for the
*4882a593Smuzhiyunsame location, that come before R in program order.  If there are, it
*4882a593Smuzhiyunuses the value of the po-latest such store as the value obtained by R,
*4882a593Smuzhiyunand we say that the store's value is forwarded to R.  Otherwise, the
*4882a593SmuzhiyunCPU asks the memory subsystem for the value to load and we say that R
*4882a593Smuzhiyunis satisfied from memory.  The memory subsystem hands back the value
*4882a593Smuzhiyunof the co-latest store to the location in question which has already
*4882a593Smuzhiyunpropagated to that CPU.
*4882a593Smuzhiyun
*4882a593Smuzhiyun(In fact, the picture needs to be a little more complicated than this.
*4882a593SmuzhiyunCPUs have local caches, and propagating a store to a CPU really means
*4882a593Smuzhiyunpropagating it to the CPU's local cache.  A local cache can take some
*4882a593Smuzhiyuntime to process the stores that it receives, and a store can't be used
*4882a593Smuzhiyunto satisfy one of the CPU's loads until it has been processed.  On
*4882a593Smuzhiyunmost architectures, the local caches process stores in
*4882a593SmuzhiyunFirst-In-First-Out order, and consequently the processing delay
*4882a593Smuzhiyundoesn't matter for the memory model.  But on Alpha, the local caches
*4882a593Smuzhiyunhave a partitioned design that results in non-FIFO behavior.  We will
*4882a593Smuzhiyundiscuss this in more detail later.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote that load instructions may be executed speculatively and may be
*4882a593Smuzhiyunrestarted under certain circumstances.  The memory model ignores these
*4882a593Smuzhiyunpremature executions; we simply say that the load executes at the
*4882a593Smuzhiyunfinal time it is forwarded or satisfied.
*4882a593Smuzhiyun
*4882a593SmuzhiyunExecuting a fence (or memory barrier) instruction doesn't require a
*4882a593SmuzhiyunCPU to do anything special other than informing the memory subsystem
*4882a593Smuzhiyunabout the fence.  However, fences do constrain the way CPUs and the
*4882a593Smuzhiyunmemory subsystem handle other instructions, in two respects.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFirst, a fence forces the CPU to execute various instructions in
*4882a593Smuzhiyunprogram order.  Exactly which instructions are ordered depends on the
*4882a593Smuzhiyuntype of fence:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Strong fences, including smp_mb() and synchronize_rcu(), force
*4882a593Smuzhiyun	the CPU to execute all po-earlier instructions before any
*4882a593Smuzhiyun	po-later instructions;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	smp_rmb() forces the CPU to execute all po-earlier loads
*4882a593Smuzhiyun	before any po-later loads;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	smp_wmb() forces the CPU to execute all po-earlier stores
*4882a593Smuzhiyun	before any po-later stores;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Acquire fences, such as smp_load_acquire(), force the CPU to
*4882a593Smuzhiyun	execute the load associated with the fence (e.g., the load
*4882a593Smuzhiyun	part of an smp_load_acquire()) before any po-later
*4882a593Smuzhiyun	instructions;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Release fences, such as smp_store_release(), force the CPU to
*4882a593Smuzhiyun	execute all po-earlier instructions before the store
*4882a593Smuzhiyun	associated with the fence (e.g., the store part of an
*4882a593Smuzhiyun	smp_store_release()).
*4882a593Smuzhiyun
*4882a593SmuzhiyunSecond, some types of fence affect the way the memory subsystem
*4882a593Smuzhiyunpropagates stores.  When a fence instruction is executed on CPU C:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	For each other CPU C', smp_wmb() forces all po-earlier stores
*4882a593Smuzhiyun	on C to propagate to C' before any po-later stores do.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	For each other CPU C', any store which propagates to C before
*4882a593Smuzhiyun	a release fence is executed (including all po-earlier
*4882a593Smuzhiyun	stores executed on C) is forced to propagate to C' before the
*4882a593Smuzhiyun	store associated with the release fence does.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Any store which propagates to C before a strong fence is
*4882a593Smuzhiyun	executed (including all po-earlier stores on C) is forced to
*4882a593Smuzhiyun	propagate to all other CPUs before any instructions po-after
*4882a593Smuzhiyun	the strong fence are executed on C.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe propagation ordering enforced by release fences and strong fences
*4882a593Smuzhiyunaffects stores from other CPUs that propagate to CPU C before the
*4882a593Smuzhiyunfence is executed, as well as stores that are executed on C before the
*4882a593Smuzhiyunfence.  We describe this property by saying that release fences and
*4882a593Smuzhiyunstrong fences are A-cumulative.  By contrast, smp_wmb() fences are not
*4882a593SmuzhiyunA-cumulative; they only affect the propagation of stores that are
*4882a593Smuzhiyunexecuted on C before the fence (i.e., those which precede the fence in
*4882a593Smuzhiyunprogram order).
*4882a593Smuzhiyun
*4882a593Smuzhiyunrcu_read_lock(), rcu_read_unlock(), and synchronize_rcu() fences have
*4882a593Smuzhiyunother properties which we discuss later.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunPROPAGATION ORDER RELATION: cumul-fence
*4882a593Smuzhiyun---------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe fences which affect propagation order (i.e., strong, release, and
*4882a593Smuzhiyunsmp_wmb() fences) are collectively referred to as cumul-fences, even
*4882a593Smuzhiyunthough smp_wmb() isn't A-cumulative.  The cumul-fence relation is
*4882a593Smuzhiyundefined to link memory access events E and F whenever:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	E and F are both stores on the same CPU and an smp_wmb() fence
*4882a593Smuzhiyun	event occurs between them in program order; or
*4882a593Smuzhiyun
*4882a593Smuzhiyun	F is a release fence and some X comes before F in program order,
*4882a593Smuzhiyun	where either X = E or else E ->rf X; or
*4882a593Smuzhiyun
*4882a593Smuzhiyun	A strong fence event occurs between some X and F in program
*4882a593Smuzhiyun	order, where either X = E or else E ->rf X.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe operational model requires that whenever W and W' are both stores
*4882a593Smuzhiyunand W ->cumul-fence W', then W must propagate to any given CPU
*4882a593Smuzhiyunbefore W' does.  However, for different CPUs C and C', it does not
*4882a593Smuzhiyunrequire W to propagate to C before W' propagates to C'.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunDERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL
*4882a593Smuzhiyun-------------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM is derived from the restrictions imposed by the design
*4882a593Smuzhiyunoutlined above.  These restrictions involve the necessity of
*4882a593Smuzhiyunmaintaining cache coherence and the fact that a CPU can't operate on a
*4882a593Smuzhiyunvalue before it knows what that value is, among other things.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe formal version of the LKMM is defined by six requirements, or
*4882a593Smuzhiyunaxioms:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Sequential consistency per variable: This requires that the
*4882a593Smuzhiyun	system obey the four coherency rules.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Atomicity: This requires that atomic read-modify-write
*4882a593Smuzhiyun	operations really are atomic, that is, no other stores can
*4882a593Smuzhiyun	sneak into the middle of such an update.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Happens-before: This requires that certain instructions are
*4882a593Smuzhiyun	executed in a specific order.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Propagation: This requires that certain stores propagate to
*4882a593Smuzhiyun	CPUs and to RAM in a specific order.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Rcu: This requires that RCU read-side critical sections and
*4882a593Smuzhiyun	grace periods obey the rules of RCU, in particular, the
*4882a593Smuzhiyun	Grace-Period Guarantee.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Plain-coherence: This requires that plain memory accesses
*4882a593Smuzhiyun	(those not using READ_ONCE(), WRITE_ONCE(), etc.) must obey
*4882a593Smuzhiyun	the operational model's rules regarding cache coherence.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe first and second are quite common; they can be found in many
*4882a593Smuzhiyunmemory models (such as those for C11/C++11).  The "happens-before" and
*4882a593Smuzhiyun"propagation" axioms have analogs in other memory models as well.  The
*4882a593Smuzhiyun"rcu" and "plain-coherence" axioms are specific to the LKMM.
*4882a593Smuzhiyun
*4882a593SmuzhiyunEach of these axioms is discussed below.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunSEQUENTIAL CONSISTENCY PER VARIABLE
*4882a593Smuzhiyun-----------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunAccording to the principle of cache coherence, the stores to any fixed
*4882a593Smuzhiyunshared location in memory form a global ordering.  We can imagine
*4882a593Smuzhiyuninserting the loads from that location into this ordering, by placing
*4882a593Smuzhiyuneach load between the store that it reads from and the following
*4882a593Smuzhiyunstore.  This leaves the relative positions of loads that read from the
*4882a593Smuzhiyunsame store unspecified; let's say they are inserted in program order,
*4882a593Smuzhiyunfirst for CPU 0, then CPU 1, etc.
*4882a593Smuzhiyun
*4882a593SmuzhiyunYou can check that the four coherency rules imply that the rf, co, fr,
*4882a593Smuzhiyunand po-loc relations agree with this global ordering; in other words,
*4882a593Smuzhiyunwhenever we have X ->rf Y or X ->co Y or X ->fr Y or X ->po-loc Y, the
*4882a593SmuzhiyunX event comes before the Y event in the global ordering.  The LKMM's
*4882a593Smuzhiyun"coherence" axiom expresses this by requiring the union of these
*4882a593Smuzhiyunrelations not to have any cycles.  This means it must not be possible
*4882a593Smuzhiyunto find events
*4882a593Smuzhiyun
*4882a593Smuzhiyun	X0 -> X1 -> X2 -> ... -> Xn -> X0,
*4882a593Smuzhiyun
*4882a593Smuzhiyunwhere each of the links is either rf, co, fr, or po-loc.  This has to
*4882a593Smuzhiyunhold if the accesses to the fixed memory location can be ordered as
*4882a593Smuzhiyuncache coherence demands.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAlthough it is not obvious, it can be shown that the converse is also
*4882a593Smuzhiyuntrue: This LKMM axiom implies that the four coherency rules are
*4882a593Smuzhiyunobeyed.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunATOMIC UPDATES: rmw
*4882a593Smuzhiyun-------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhat does it mean to say that a read-modify-write (rmw) update, such
*4882a593Smuzhiyunas atomic_inc(&x), is atomic?  It means that the memory location (x in
*4882a593Smuzhiyunthis case) does not get altered between the read and the write events
*4882a593Smuzhiyunmaking up the atomic operation.  In particular, if two CPUs perform
*4882a593Smuzhiyunatomic_inc(&x) concurrently, it must be guaranteed that the final
*4882a593Smuzhiyunvalue of x will be the initial value plus two.  We should never have
*4882a593Smuzhiyunthe following sequence of events:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	CPU 0 loads x obtaining 13;
*4882a593Smuzhiyun					CPU 1 loads x obtaining 13;
*4882a593Smuzhiyun	CPU 0 stores 14 to x;
*4882a593Smuzhiyun					CPU 1 stores 14 to x;
*4882a593Smuzhiyun
*4882a593Smuzhiyunwhere the final value of x is wrong (14 rather than 15).
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn this example, CPU 0's increment effectively gets lost because it
*4882a593Smuzhiyunoccurs in between CPU 1's load and store.  To put it another way, the
*4882a593Smuzhiyunproblem is that the position of CPU 0's store in x's coherence order
*4882a593Smuzhiyunis between the store that CPU 1 reads from and the store that CPU 1
*4882a593Smuzhiyunperforms.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe same analysis applies to all atomic update operations.  Therefore,
*4882a593Smuzhiyunto enforce atomicity the LKMM requires that atomic updates follow this
*4882a593Smuzhiyunrule: Whenever R and W are the read and write events composing an
*4882a593Smuzhiyunatomic read-modify-write and W' is the write event which R reads from,
*4882a593Smuzhiyunthere must not be any stores coming between W' and W in the coherence
*4882a593Smuzhiyunorder.  Equivalently,
*4882a593Smuzhiyun
*4882a593Smuzhiyun	(R ->rmw W) implies (there is no X with R ->fr X and X ->co W),
*4882a593Smuzhiyun
*4882a593Smuzhiyunwhere the rmw relation links the read and write events making up each
*4882a593Smuzhiyunatomic update.  This is what the LKMM's "atomic" axiom says.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunTHE PRESERVED PROGRAM ORDER RELATION: ppo
*4882a593Smuzhiyun-----------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are many situations where a CPU is obliged to execute two
*4882a593Smuzhiyuninstructions in program order.  We amalgamate them into the ppo (for
*4882a593Smuzhiyun"preserved program order") relation, which links the po-earlier
*4882a593Smuzhiyuninstruction to the po-later instruction and is thus a sub-relation of
*4882a593Smuzhiyunpo.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe operational model already includes a description of one such
*4882a593Smuzhiyunsituation: Fences are a source of ppo links.  Suppose X and Y are
*4882a593Smuzhiyunmemory accesses with X ->po Y; then the CPU must execute X before Y if
*4882a593Smuzhiyunany of the following hold:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	A strong (smp_mb() or synchronize_rcu()) fence occurs between
*4882a593Smuzhiyun	X and Y;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	X and Y are both stores and an smp_wmb() fence occurs between
*4882a593Smuzhiyun	them;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	X and Y are both loads and an smp_rmb() fence occurs between
*4882a593Smuzhiyun	them;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	X is also an acquire fence, such as smp_load_acquire();
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Y is also a release fence, such as smp_store_release().
*4882a593Smuzhiyun
*4882a593SmuzhiyunAnother possibility, not mentioned earlier but discussed in the next
*4882a593Smuzhiyunsection, is:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	X and Y are both loads, X ->addr Y (i.e., there is an address
*4882a593Smuzhiyun	dependency from X to Y), and X is a READ_ONCE() or an atomic
*4882a593Smuzhiyun	access.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDependencies can also cause instructions to be executed in program
*4882a593Smuzhiyunorder.  This is uncontroversial when the second instruction is a
*4882a593Smuzhiyunstore; either a data, address, or control dependency from a load R to
*4882a593Smuzhiyuna store W will force the CPU to execute R before W.  This is very
*4882a593Smuzhiyunsimply because the CPU cannot tell the memory subsystem about W's
*4882a593Smuzhiyunstore before it knows what value should be stored (in the case of a
*4882a593Smuzhiyundata dependency), what location it should be stored into (in the case
*4882a593Smuzhiyunof an address dependency), or whether the store should actually take
*4882a593Smuzhiyunplace (in the case of a control dependency).
*4882a593Smuzhiyun
*4882a593SmuzhiyunDependencies to load instructions are more problematic.  To begin with,
*4882a593Smuzhiyunthere is no such thing as a data dependency to a load.  Next, a CPU
*4882a593Smuzhiyunhas no reason to respect a control dependency to a load, because it
*4882a593Smuzhiyuncan always satisfy the second load speculatively before the first, and
*4882a593Smuzhiyunthen ignore the result if it turns out that the second load shouldn't
*4882a593Smuzhiyunbe executed after all.  And lastly, the real difficulties begin when
*4882a593Smuzhiyunwe consider address dependencies to loads.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo be fair about it, all Linux-supported architectures do execute
*4882a593Smuzhiyunloads in program order if there is an address dependency between them.
*4882a593SmuzhiyunAfter all, a CPU cannot ask the memory subsystem to load a value from
*4882a593Smuzhiyuna particular location before it knows what that location is.  However,
*4882a593Smuzhiyunthe split-cache design used by Alpha can cause it to behave in a way
*4882a593Smuzhiyunthat looks as if the loads were executed out of order (see the next
*4882a593Smuzhiyunsection for more details).  The kernel includes a workaround for this
*4882a593Smuzhiyunproblem when the loads come from READ_ONCE(), and therefore the LKMM
*4882a593Smuzhiyunincludes address dependencies to loads in the ppo relation.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOn the other hand, dependencies can indirectly affect the ordering of
*4882a593Smuzhiyuntwo loads.  This happens when there is a dependency from a load to a
*4882a593Smuzhiyunstore and a second, po-later load reads from that store:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	R ->dep W ->rfi R',
*4882a593Smuzhiyun
*4882a593Smuzhiyunwhere the dep link can be either an address or a data dependency.  In
*4882a593Smuzhiyunthis situation we know it is possible for the CPU to execute R' before
*4882a593SmuzhiyunW, because it can forward the value that W will store to R'.  But it
*4882a593Smuzhiyuncannot execute R' before R, because it cannot forward the value before
*4882a593Smuzhiyunit knows what that value is, or that W and R' do access the same
*4882a593Smuzhiyunlocation.  However, if there is merely a control dependency between R
*4882a593Smuzhiyunand W then the CPU can speculatively forward W to R' before executing
*4882a593SmuzhiyunR; if the speculation turns out to be wrong then the CPU merely has to
*4882a593Smuzhiyunrestart or abandon R'.
*4882a593Smuzhiyun
*4882a593Smuzhiyun(In theory, a CPU might forward a store to a load when it runs across
*4882a593Smuzhiyunan address dependency like this:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	r1 = READ_ONCE(ptr);
*4882a593Smuzhiyun	WRITE_ONCE(*r1, 17);
*4882a593Smuzhiyun	r2 = READ_ONCE(*r1);
*4882a593Smuzhiyun
*4882a593Smuzhiyunbecause it could tell that the store and the second load access the
*4882a593Smuzhiyunsame location even before it knows what the location's address is.
*4882a593SmuzhiyunHowever, none of the architectures supported by the Linux kernel do
*4882a593Smuzhiyunthis.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunTwo memory accesses of the same location must always be executed in
*4882a593Smuzhiyunprogram order if the second access is a store.  Thus, if we have
*4882a593Smuzhiyun
*4882a593Smuzhiyun	R ->po-loc W
*4882a593Smuzhiyun
*4882a593Smuzhiyun(the po-loc link says that R comes before W in program order and they
*4882a593Smuzhiyunaccess the same location), the CPU is obliged to execute W after R.
*4882a593SmuzhiyunIf it executed W first then the memory subsystem would respond to R's
*4882a593Smuzhiyunread request with the value stored by W (or an even later store), in
*4882a593Smuzhiyunviolation of the read-write coherence rule.  Similarly, if we had
*4882a593Smuzhiyun
*4882a593Smuzhiyun	W ->po-loc W'
*4882a593Smuzhiyun
*4882a593Smuzhiyunand the CPU executed W' before W, then the memory subsystem would put
*4882a593SmuzhiyunW' before W in the coherence order.  It would effectively cause W to
*4882a593Smuzhiyunoverwrite W', in violation of the write-write coherence rule.
*4882a593Smuzhiyun(Interestingly, an early ARMv8 memory model, now obsolete, proposed
*4882a593Smuzhiyunallowing out-of-order writes like this to occur.  The model avoided
*4882a593Smuzhiyunviolating the write-write coherence rule by requiring the CPU not to
*4882a593Smuzhiyunsend the W write to the memory subsystem at all!)
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunAND THEN THERE WAS ALPHA
*4882a593Smuzhiyun------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs mentioned above, the Alpha architecture is unique in that it does
*4882a593Smuzhiyunnot appear to respect address dependencies to loads.  This means that
*4882a593Smuzhiyuncode such as the following:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x = 0;
*4882a593Smuzhiyun	int y = -1;
*4882a593Smuzhiyun	int *ptr = &y;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
*4882a593Smuzhiyun		smp_wmb();
*4882a593Smuzhiyun		WRITE_ONCE(ptr, &x);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int *r1;
*4882a593Smuzhiyun		int r2;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = ptr;
*4882a593Smuzhiyun		r2 = READ_ONCE(*r1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyuncan malfunction on Alpha systems (notice that P1 uses an ordinary load
*4882a593Smuzhiyunto read ptr instead of READ_ONCE()).  It is quite possible that r1 = &x
*4882a593Smuzhiyunand r2 = 0 at the end, in spite of the address dependency.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAt first glance this doesn't seem to make sense.  We know that the
*4882a593Smuzhiyunsmp_wmb() forces P0's store to x to propagate to P1 before the store
*4882a593Smuzhiyunto ptr does.  And since P1 can't execute its second load
*4882a593Smuzhiyununtil it knows what location to load from, i.e., after executing its
*4882a593Smuzhiyunfirst load, the value x = 1 must have propagated to P1 before the
*4882a593Smuzhiyunsecond load executed.  So why doesn't r2 end up equal to 1?
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe answer lies in the Alpha's split local caches.  Although the two
*4882a593Smuzhiyunstores do reach P1's local cache in the proper order, it can happen
*4882a593Smuzhiyunthat the first store is processed by a busy part of the cache while
*4882a593Smuzhiyunthe second store is processed by an idle part.  As a result, the x = 1
*4882a593Smuzhiyunvalue may not become available for P1's CPU to read until after the
*4882a593Smuzhiyunptr = &x value does, leading to the undesirable result above.  The
*4882a593Smuzhiyunfinal effect is that even though the two loads really are executed in
*4882a593Smuzhiyunprogram order, it appears that they aren't.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis could not have happened if the local cache had processed the
*4882a593Smuzhiyunincoming stores in FIFO order.  By contrast, other architectures
*4882a593Smuzhiyunmaintain at least the appearance of FIFO order.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn practice, this difficulty is solved by inserting a special fence
*4882a593Smuzhiyunbetween P1's two loads when the kernel is compiled for the Alpha
*4882a593Smuzhiyunarchitecture.  In fact, as of version 4.15, the kernel automatically
*4882a593Smuzhiyunadds this fence after every READ_ONCE() and atomic load on Alpha.  The
*4882a593Smuzhiyuneffect of the fence is to cause the CPU not to execute any po-later
*4882a593Smuzhiyuninstructions until after the local cache has finished processing all
*4882a593Smuzhiyunthe stores it has already received.  Thus, if the code was changed to:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int *r1;
*4882a593Smuzhiyun		int r2;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = READ_ONCE(ptr);
*4882a593Smuzhiyun		r2 = READ_ONCE(*r1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyunthen we would never get r1 = &x and r2 = 0.  By the time P1 executed
*4882a593Smuzhiyunits second load, the x = 1 store would already be fully processed by
*4882a593Smuzhiyunthe local cache and available for satisfying the read request.  Thus
*4882a593Smuzhiyunwe have yet another reason why shared data should always be read with
*4882a593SmuzhiyunREAD_ONCE() or another synchronization primitive rather than accessed
*4882a593Smuzhiyundirectly.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM requires that smp_rmb(), acquire fences, and strong fences
*4882a593Smuzhiyunshare this property: They do not allow the CPU to execute any po-later
*4882a593Smuzhiyuninstructions (or po-later loads in the case of smp_rmb()) until all
*4882a593Smuzhiyunoutstanding stores have been processed by the local cache.  In the
*4882a593Smuzhiyuncase of a strong fence, the CPU first has to wait for all of its
*4882a593Smuzhiyunpo-earlier stores to propagate to every other CPU in the system; then
*4882a593Smuzhiyunit has to wait for the local cache to process all the stores received
*4882a593Smuzhiyunas of that time -- not just the stores received when the strong fence
*4882a593Smuzhiyunbegan.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAnd of course, none of this matters for any architecture other than
*4882a593SmuzhiyunAlpha.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunTHE HAPPENS-BEFORE RELATION: hb
*4882a593Smuzhiyun-------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe happens-before relation (hb) links memory accesses that have to
*4882a593Smuzhiyunexecute in a certain order.  hb includes the ppo relation and two
*4882a593Smuzhiyunothers, one of which is rfe.
*4882a593Smuzhiyun
*4882a593SmuzhiyunW ->rfe R implies that W and R are on different CPUs.  It also means
*4882a593Smuzhiyunthat W's store must have propagated to R's CPU before R executed;
*4882a593Smuzhiyunotherwise R could not have read the value stored by W.  Therefore W
*4882a593Smuzhiyunmust have executed before R, and so we have W ->hb R.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe equivalent fact need not hold if W ->rfi R (i.e., W and R are on
*4882a593Smuzhiyunthe same CPU).  As we have already seen, the operational model allows
*4882a593SmuzhiyunW's value to be forwarded to R in such cases, meaning that R may well
*4882a593Smuzhiyunexecute before W does.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIt's important to understand that neither coe nor fre is included in
*4882a593Smuzhiyunhb, despite their similarities to rfe.  For example, suppose we have
*4882a593SmuzhiyunW ->coe W'.  This means that W and W' are stores to the same location,
*4882a593Smuzhiyunthey execute on different CPUs, and W comes before W' in the coherence
*4882a593Smuzhiyunorder (i.e., W' overwrites W).  Nevertheless, it is possible for W' to
*4882a593Smuzhiyunexecute before W, because the decision as to which store overwrites
*4882a593Smuzhiyunthe other is made later by the memory subsystem.  When the stores are
*4882a593Smuzhiyunnearly simultaneous, either one can come out on top.  Similarly,
*4882a593SmuzhiyunR ->fre W means that W overwrites the value which R reads, but it
*4882a593Smuzhiyundoesn't mean that W has to execute after R.  All that's necessary is
*4882a593Smuzhiyunfor the memory subsystem not to propagate W to R's CPU until after R
*4882a593Smuzhiyunhas executed, which is possible if W executes shortly before R.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe third relation included in hb is like ppo, in that it only links
*4882a593Smuzhiyunevents that are on the same CPU.  However it is more difficult to
*4882a593Smuzhiyunexplain, because it arises only indirectly from the requirement of
*4882a593Smuzhiyuncache coherence.  The relation is called prop, and it links two events
*4882a593Smuzhiyunon CPU C in situations where a store from some other CPU comes after
*4882a593Smuzhiyunthe first event in the coherence order and propagates to C before the
*4882a593Smuzhiyunsecond event executes.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis is best explained with some examples.  The simplest case looks
*4882a593Smuzhiyunlike this:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(x, 8);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf r1 = 8 at the end then P0's accesses must have executed in program
*4882a593Smuzhiyunorder.  We can deduce this from the operational model; if P0's load
*4882a593Smuzhiyunhad executed before its store then the value of the store would have
*4882a593Smuzhiyunbeen forwarded to the load, so r1 would have ended up equal to 1, not
*4882a593Smuzhiyun8.  In this case there is a prop link from P0's write event to its read
*4882a593Smuzhiyunevent, because P1's store came after P0's store in x's coherence
*4882a593Smuzhiyunorder, and P1's store propagated to P0 before P0's load executed.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAn equally simple case involves two loads of the same location that
*4882a593Smuzhiyunread from different stores:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1, r2;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun		r2 = READ_ONCE(x);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(x, 9);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf r1 = 0 and r2 = 9 at the end then P0's accesses must have executed
*4882a593Smuzhiyunin program order.  If the second load had executed before the first
*4882a593Smuzhiyunthen the x = 9 store must have been propagated to P0 before the first
*4882a593Smuzhiyunload executed, and so r1 would have been 9 rather than 0.  In this
*4882a593Smuzhiyuncase there is a prop link from P0's first read event to its second,
*4882a593Smuzhiyunbecause P1's store overwrote the value read by P0's first load, and
*4882a593SmuzhiyunP1's store propagated to P0 before P0's second load executed.
*4882a593Smuzhiyun
*4882a593SmuzhiyunLess trivial examples of prop all involve fences.  Unlike the simple
*4882a593Smuzhiyunexamples above, they can require that some instructions are executed
*4882a593Smuzhiyunout of program order.  This next one should look familiar:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int buf = 0, flag = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(buf, 1);
*4882a593Smuzhiyun		smp_wmb();
*4882a593Smuzhiyun		WRITE_ONCE(flag, 1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun		int r2;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = READ_ONCE(flag);
*4882a593Smuzhiyun		r2 = READ_ONCE(buf);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis is the MP pattern again, with an smp_wmb() fence between the two
*4882a593Smuzhiyunstores.  If r1 = 1 and r2 = 0 at the end then there is a prop link
*4882a593Smuzhiyunfrom P1's second load to its first (backwards!).  The reason is
*4882a593Smuzhiyunsimilar to the previous examples: The value P1 loads from buf gets
*4882a593Smuzhiyunoverwritten by P0's store to buf, the fence guarantees that the store
*4882a593Smuzhiyunto buf will propagate to P1 before the store to flag does, and the
*4882a593Smuzhiyunstore to flag propagates to P1 before P1 reads flag.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe prop link says that in order to obtain the r1 = 1, r2 = 0 result,
*4882a593SmuzhiyunP1 must execute its second load before the first.  Indeed, if the load
*4882a593Smuzhiyunfrom flag were executed first, then the buf = 1 store would already
*4882a593Smuzhiyunhave propagated to P1 by the time P1's load from buf executed, so r2
*4882a593Smuzhiyunwould have been 1 at the end, not 0.  (The reasoning holds even for
*4882a593SmuzhiyunAlpha, although the details are more complicated and we will not go
*4882a593Smuzhiyuninto them.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunBut what if we put an smp_rmb() fence between P1's loads?  The fence
*4882a593Smuzhiyunwould force the two loads to be executed in program order, and it
*4882a593Smuzhiyunwould generate a cycle in the hb relation: The fence would create a ppo
*4882a593Smuzhiyunlink (hence an hb link) from the first load to the second, and the
*4882a593Smuzhiyunprop relation would give an hb link from the second load to the first.
*4882a593SmuzhiyunSince an instruction can't execute before itself, we are forced to
*4882a593Smuzhiyunconclude that if an smp_rmb() fence is added, the r1 = 1, r2 = 0
*4882a593Smuzhiyunoutcome is impossible -- as it should be.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe formal definition of the prop relation involves a coe or fre link,
*4882a593Smuzhiyunfollowed by an arbitrary number of cumul-fence links, ending with an
*4882a593Smuzhiyunrfe link.  You can concoct more exotic examples, containing more than
*4882a593Smuzhiyunone fence, although this quickly leads to diminishing returns in terms
*4882a593Smuzhiyunof complexity.  For instance, here's an example containing a coe link
*4882a593Smuzhiyunfollowed by two cumul-fences and an rfe link, utilizing the fact that
*4882a593Smuzhiyunrelease fences are A-cumulative:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x, y, z;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
*4882a593Smuzhiyun		r0 = READ_ONCE(z);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(x, 2);
*4882a593Smuzhiyun		smp_wmb();
*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P2()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r2;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r2 = READ_ONCE(y);
*4882a593Smuzhiyun		smp_store_release(&z, 1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf x = 2, r0 = 1, and r2 = 1 after this code runs then there is a prop
*4882a593Smuzhiyunlink from P0's store to its load.  This is because P0's store gets
*4882a593Smuzhiyunoverwritten by P1's store since x = 2 at the end (a coe link), the
*4882a593Smuzhiyunsmp_wmb() ensures that P1's store to x propagates to P2 before the
*4882a593Smuzhiyunstore to y does (the first cumul-fence), the store to y propagates to P2
*4882a593Smuzhiyunbefore P2's load and store execute, P2's smp_store_release()
*4882a593Smuzhiyunguarantees that the stores to x and y both propagate to P0 before the
*4882a593Smuzhiyunstore to z does (the second cumul-fence), and P0's load executes after the
*4882a593Smuzhiyunstore to z has propagated to P0 (an rfe link).
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn summary, the fact that the hb relation links memory access events
*4882a593Smuzhiyunin the order they execute means that it must not have cycles.  This
*4882a593Smuzhiyunrequirement is the content of the LKMM's "happens-before" axiom.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM defines yet another relation connected to times of
*4882a593Smuzhiyuninstruction execution, but it is not included in hb.  It relies on the
*4882a593Smuzhiyunparticular properties of strong fences, which we cover in the next
*4882a593Smuzhiyunsection.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunTHE PROPAGATES-BEFORE RELATION: pb
*4882a593Smuzhiyun----------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe propagates-before (pb) relation capitalizes on the special
*4882a593Smuzhiyunfeatures of strong fences.  It links two events E and F whenever some
*4882a593Smuzhiyunstore is coherence-later than E and propagates to every CPU and to RAM
*4882a593Smuzhiyunbefore F executes.  The formal definition requires that E be linked to
*4882a593SmuzhiyunF via a coe or fre link, an arbitrary number of cumul-fences, an
*4882a593Smuzhiyunoptional rfe link, a strong fence, and an arbitrary number of hb
*4882a593Smuzhiyunlinks.  Let's see how this definition works out.
*4882a593Smuzhiyun
*4882a593SmuzhiyunConsider first the case where E is a store (implying that the sequence
*4882a593Smuzhiyunof links begins with coe).  Then there are events W, X, Y, and Z such
*4882a593Smuzhiyunthat:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	E ->coe W ->cumul-fence* X ->rfe? Y ->strong-fence Z ->hb* F,
*4882a593Smuzhiyun
*4882a593Smuzhiyunwhere the * suffix indicates an arbitrary number of links of the
*4882a593Smuzhiyunspecified type, and the ? suffix indicates the link is optional (Y may
*4882a593Smuzhiyunbe equal to X).  Because of the cumul-fence links, we know that W will
*4882a593Smuzhiyunpropagate to Y's CPU before X does, hence before Y executes and hence
*4882a593Smuzhiyunbefore the strong fence executes.  Because this fence is strong, we
*4882a593Smuzhiyunknow that W will propagate to every CPU and to RAM before Z executes.
*4882a593SmuzhiyunAnd because of the hb links, we know that Z will execute before F.
*4882a593SmuzhiyunThus W, which comes later than E in the coherence order, will
*4882a593Smuzhiyunpropagate to every CPU and to RAM before F executes.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe case where E is a load is exactly the same, except that the first
*4882a593Smuzhiyunlink in the sequence is fre instead of coe.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe existence of a pb link from E to F implies that E must execute
*4882a593Smuzhiyunbefore F.  To see why, suppose that F executed first.  Then W would
*4882a593Smuzhiyunhave propagated to E's CPU before E executed.  If E was a store, the
*4882a593Smuzhiyunmemory subsystem would then be forced to make E come after W in the
*4882a593Smuzhiyuncoherence order, contradicting the fact that E ->coe W.  If E was a
*4882a593Smuzhiyunload, the memory subsystem would then be forced to satisfy E's read
*4882a593Smuzhiyunrequest with the value stored by W or an even later store,
*4882a593Smuzhiyuncontradicting the fact that E ->fre W.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA good example illustrating how pb works is the SB pattern with strong
*4882a593Smuzhiyunfences:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x = 0, y = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
*4882a593Smuzhiyun		smp_mb();
*4882a593Smuzhiyun		r0 = READ_ONCE(y);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
*4882a593Smuzhiyun		smp_mb();
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf r0 = 0 at the end then there is a pb link from P0's load to P1's
*4882a593Smuzhiyunload: an fre link from P0's load to P1's store (which overwrites the
*4882a593Smuzhiyunvalue read by P0), and a strong fence between P1's store and its load.
*4882a593SmuzhiyunIn this example, the sequences of cumul-fence and hb links are empty.
*4882a593SmuzhiyunNote that this pb link is not included in hb as an instance of prop,
*4882a593Smuzhiyunbecause it does not start and end on the same CPU.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSimilarly, if r1 = 0 at the end then there is a pb link from P1's load
*4882a593Smuzhiyunto P0's.  This means that if both r1 and r2 were 0 there would be a
*4882a593Smuzhiyuncycle in pb, which is not possible since an instruction cannot execute
*4882a593Smuzhiyunbefore itself.  Thus, adding smp_mb() fences to the SB pattern
*4882a593Smuzhiyunprevents the r0 = 0, r1 = 0 outcome.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn summary, the fact that the pb relation links events in the order
*4882a593Smuzhiyunthey execute means that it cannot have cycles.  This requirement is
*4882a593Smuzhiyunthe content of the LKMM's "propagation" axiom.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunRCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
*4882a593Smuzhiyun------------------------------------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunRCU (Read-Copy-Update) is a powerful synchronization mechanism.  It
*4882a593Smuzhiyunrests on two concepts: grace periods and read-side critical sections.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA grace period is the span of time occupied by a call to
*4882a593Smuzhiyunsynchronize_rcu().  A read-side critical section (or just critical
*4882a593Smuzhiyunsection, for short) is a region of code delimited by rcu_read_lock()
*4882a593Smuzhiyunat the start and rcu_read_unlock() at the end.  Critical sections can
*4882a593Smuzhiyunbe nested, although we won't make use of this fact.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs far as memory models are concerned, RCU's main feature is its
*4882a593SmuzhiyunGrace-Period Guarantee, which states that a critical section can never
*4882a593Smuzhiyunspan a full grace period.  In more detail, the Guarantee says:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	For any critical section C and any grace period G, at least
*4882a593Smuzhiyun	one of the following statements must hold:
*4882a593Smuzhiyun
*4882a593Smuzhiyun(1)	C ends before G does, and in addition, every store that
*4882a593Smuzhiyun	propagates to C's CPU before the end of C must propagate to
*4882a593Smuzhiyun	every CPU before G ends.
*4882a593Smuzhiyun
*4882a593Smuzhiyun(2)	G starts before C does, and in addition, every store that
*4882a593Smuzhiyun	propagates to G's CPU before the start of G must propagate
*4882a593Smuzhiyun	to every CPU before C starts.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn particular, it is not possible for a critical section to both start
*4882a593Smuzhiyunbefore and end after a grace period.
*4882a593Smuzhiyun
*4882a593SmuzhiyunHere is a simple example of RCU in action:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x, y;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		rcu_read_lock();
*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
*4882a593Smuzhiyun		rcu_read_unlock();
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1, r2;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun		synchronize_rcu();
*4882a593Smuzhiyun		r2 = READ_ONCE(y);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe Grace Period Guarantee tells us that when this code runs, it will
*4882a593Smuzhiyunnever end with r1 = 1 and r2 = 0.  The reasoning is as follows.  r1 = 1
*4882a593Smuzhiyunmeans that P0's store to x propagated to P1 before P1 called
*4882a593Smuzhiyunsynchronize_rcu(), so P0's critical section must have started before
*4882a593SmuzhiyunP1's grace period, contrary to part (2) of the Guarantee.  On the
*4882a593Smuzhiyunother hand, r2 = 0 means that P0's store to y, which occurs before the
*4882a593Smuzhiyunend of the critical section, did not propagate to P1 before the end of
*4882a593Smuzhiyunthe grace period, contrary to part (1).  Together the results violate
*4882a593Smuzhiyunthe Guarantee.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn the kernel's implementations of RCU, the requirements for stores
*4882a593Smuzhiyunto propagate to every CPU are fulfilled by placing strong fences at
*4882a593Smuzhiyunsuitable places in the RCU-related code.  Thus, if a critical section
*4882a593Smuzhiyunstarts before a grace period does then the critical section's CPU will
*4882a593Smuzhiyunexecute an smp_mb() fence after the end of the critical section and
*4882a593Smuzhiyunsome time before the grace period's synchronize_rcu() call returns.
*4882a593SmuzhiyunAnd if a critical section ends after a grace period does then the
*4882a593Smuzhiyunsynchronize_rcu() routine will execute an smp_mb() fence at its start
*4882a593Smuzhiyunand some time before the critical section's opening rcu_read_lock()
*4882a593Smuzhiyunexecutes.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhat exactly do we mean by saying that a critical section "starts
*4882a593Smuzhiyunbefore" or "ends after" a grace period?  Some aspects of the meaning
*4882a593Smuzhiyunare pretty obvious, as in the example above, but the details aren't
*4882a593Smuzhiyunentirely clear.  The LKMM formalizes this notion by means of the
*4882a593Smuzhiyunrcu-link relation.  rcu-link encompasses a very general notion of
*4882a593Smuzhiyun"before": If E and F are RCU fence events (i.e., rcu_read_lock(),
*4882a593Smuzhiyunrcu_read_unlock(), or synchronize_rcu()) then among other things,
*4882a593SmuzhiyunE ->rcu-link F includes cases where E is po-before some memory-access
*4882a593Smuzhiyunevent X, F is po-after some memory-access event Y, and we have any of
*4882a593SmuzhiyunX ->rfe Y, X ->co Y, or X ->fr Y.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe formal definition of the rcu-link relation is more than a little
*4882a593Smuzhiyunobscure, and we won't give it here.  It is closely related to the pb
*4882a593Smuzhiyunrelation, and the details don't matter unless you want to comb through
*4882a593Smuzhiyuna somewhat lengthy formal proof.  Pretty much all you need to know
*4882a593Smuzhiyunabout rcu-link is the information in the preceding paragraph.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM also defines the rcu-gp and rcu-rscsi relations.  They bring
*4882a593Smuzhiyungrace periods and read-side critical sections into the picture, in the
*4882a593Smuzhiyunfollowing way:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	E ->rcu-gp F means that E and F are in fact the same event,
*4882a593Smuzhiyun	and that event is a synchronize_rcu() fence (i.e., a grace
*4882a593Smuzhiyun	period).
*4882a593Smuzhiyun
*4882a593Smuzhiyun	E ->rcu-rscsi F means that E and F are the rcu_read_unlock()
*4882a593Smuzhiyun	and rcu_read_lock() fence events delimiting some read-side
*4882a593Smuzhiyun	critical section.  (The 'i' at the end of the name emphasizes
*4882a593Smuzhiyun	that this relation is "inverted": It links the end of the
*4882a593Smuzhiyun	critical section to the start.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf we think of the rcu-link relation as standing for an extended
*4882a593Smuzhiyun"before", then X ->rcu-gp Y ->rcu-link Z roughly says that X is a
*4882a593Smuzhiyungrace period which ends before Z begins.  (In fact it covers more than
*4882a593Smuzhiyunthis, because it also includes cases where some store propagates to
*4882a593SmuzhiyunZ's CPU before Z begins but doesn't propagate to some other CPU until
*4882a593Smuzhiyunafter X ends.)  Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is
*4882a593Smuzhiyunthe end of a critical section which starts before Z begins.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM goes on to define the rcu-order relation as a sequence of
*4882a593Smuzhiyunrcu-gp and rcu-rscsi links separated by rcu-link links, in which the
*4882a593Smuzhiyunnumber of rcu-gp links is >= the number of rcu-rscsi links.  For
*4882a593Smuzhiyunexample:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
*4882a593Smuzhiyun
*4882a593Smuzhiyunwould imply that X ->rcu-order V, because this sequence contains two
*4882a593Smuzhiyunrcu-gp links and one rcu-rscsi link.  (It also implies that
*4882a593SmuzhiyunX ->rcu-order T and Z ->rcu-order V.)  On the other hand:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
*4882a593Smuzhiyun
*4882a593Smuzhiyundoes not imply X ->rcu-order V, because the sequence contains only
*4882a593Smuzhiyunone rcu-gp link but two rcu-rscsi links.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe rcu-order relation is important because the Grace Period Guarantee
*4882a593Smuzhiyunmeans that rcu-order links act kind of like strong fences.  In
*4882a593Smuzhiyunparticular, E ->rcu-order F implies not only that E begins before F
*4882a593Smuzhiyunends, but also that any write po-before E will propagate to every CPU
*4882a593Smuzhiyunbefore any instruction po-after F can execute.  (However, it does not
*4882a593Smuzhiyunimply that E must execute before F; in fact, each synchronize_rcu()
*4882a593Smuzhiyunfence event is linked to itself by rcu-order as a degenerate case.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo prove this in full generality requires some intellectual effort.
*4882a593SmuzhiyunWe'll consider just a very simple case:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	G ->rcu-gp W ->rcu-link Z ->rcu-rscsi F.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis formula means that G and W are the same event (a grace period),
*4882a593Smuzhiyunand there are events X, Y and a read-side critical section C such that:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	1. G = W is po-before or equal to X;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	2. X comes "before" Y in some sense (including rfe, co and fr);
*4882a593Smuzhiyun
*4882a593Smuzhiyun	3. Y is po-before Z;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	4. Z is the rcu_read_unlock() event marking the end of C;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	5. F is the rcu_read_lock() event marking the start of C.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFrom 1 - 4 we deduce that the grace period G ends before the critical
*4882a593Smuzhiyunsection C.  Then part (2) of the Grace Period Guarantee says not only
*4882a593Smuzhiyunthat G starts before C does, but also that any write which executes on
*4882a593SmuzhiyunG's CPU before G starts must propagate to every CPU before C starts.
*4882a593SmuzhiyunIn particular, the write propagates to every CPU before F finishes
*4882a593Smuzhiyunexecuting and hence before any instruction po-after F can execute.
*4882a593SmuzhiyunThis sort of reasoning can be extended to handle all the situations
*4882a593Smuzhiyuncovered by rcu-order.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe rcu-fence relation is a simple extension of rcu-order.  While
*4882a593Smuzhiyunrcu-order only links certain fence events (calls to synchronize_rcu(),
*4882a593Smuzhiyunrcu_read_lock(), or rcu_read_unlock()), rcu-fence links any events
*4882a593Smuzhiyunthat are separated by an rcu-order link.  This is analogous to the way
*4882a593Smuzhiyunthe strong-fence relation links events that are separated by an
*4882a593Smuzhiyunsmp_mb() fence event (as mentioned above, rcu-order links act kind of
*4882a593Smuzhiyunlike strong fences).  Written symbolically, X ->rcu-fence Y means
*4882a593Smuzhiyunthere are fence events E and F such that:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	X ->po E ->rcu-order F ->po Y.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFrom the discussion above, we see this implies not only that X
*4882a593Smuzhiyunexecutes before Y, but also (if X is a store) that X propagates to
*4882a593Smuzhiyunevery CPU before Y executes.  Thus rcu-fence is sort of a
*4882a593Smuzhiyun"super-strong" fence: Unlike the original strong fences (smp_mb() and
*4882a593Smuzhiyunsynchronize_rcu()), rcu-fence is able to link events on different
*4882a593SmuzhiyunCPUs.  (Perhaps this fact should lead us to say that rcu-fence isn't
*4882a593Smuzhiyunreally a fence at all!)
*4882a593Smuzhiyun
*4882a593SmuzhiyunFinally, the LKMM defines the RCU-before (rb) relation in terms of
*4882a593Smuzhiyunrcu-fence.  This is done in essentially the same way as the pb
*4882a593Smuzhiyunrelation was defined in terms of strong-fence.  We will omit the
*4882a593Smuzhiyundetails; the end result is that E ->rb F implies E must execute
*4882a593Smuzhiyunbefore F, just as E ->pb F does (and for much the same reasons).
*4882a593Smuzhiyun
*4882a593SmuzhiyunPutting this all together, the LKMM expresses the Grace Period
*4882a593SmuzhiyunGuarantee by requiring that the rb relation does not contain a cycle.
*4882a593SmuzhiyunEquivalently, this "rcu" axiom requires that there are no events E
*4882a593Smuzhiyunand F with E ->rcu-link F ->rcu-order E.  Or to put it a third way,
*4882a593Smuzhiyunthe axiom requires that there are no cycles consisting of rcu-gp and
*4882a593Smuzhiyunrcu-rscsi alternating with rcu-link, where the number of rcu-gp links
*4882a593Smuzhiyunis >= the number of rcu-rscsi links.
*4882a593Smuzhiyun
*4882a593SmuzhiyunJustifying the axiom isn't easy, but it is in fact a valid
*4882a593Smuzhiyunformalization of the Grace Period Guarantee.  We won't attempt to go
*4882a593Smuzhiyunthrough the detailed argument, but the following analysis gives a
*4882a593Smuzhiyuntaste of what is involved.  Suppose both parts of the Guarantee are
*4882a593Smuzhiyunviolated: A critical section starts before a grace period, and some
*4882a593Smuzhiyunstore propagates to the critical section's CPU before the end of the
*4882a593Smuzhiyuncritical section but doesn't propagate to some other CPU until after
*4882a593Smuzhiyunthe end of the grace period.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPutting symbols to these ideas, let L and U be the rcu_read_lock() and
*4882a593Smuzhiyunrcu_read_unlock() fence events delimiting the critical section in
*4882a593Smuzhiyunquestion, and let S be the synchronize_rcu() fence event for the grace
*4882a593Smuzhiyunperiod.  Saying that the critical section starts before S means there
*4882a593Smuzhiyunare events Q and R where Q is po-after L (which marks the start of the
*4882a593Smuzhiyuncritical section), Q is "before" R in the sense used by the rcu-link
*4882a593Smuzhiyunrelation, and R is po-before the grace period S.  Thus we have:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	L ->rcu-link S.
*4882a593Smuzhiyun
*4882a593SmuzhiyunLet W be the store mentioned above, let Y come before the end of the
*4882a593Smuzhiyuncritical section and witness that W propagates to the critical
*4882a593Smuzhiyunsection's CPU by reading from W, and let Z on some arbitrary CPU be a
*4882a593Smuzhiyunwitness that W has not propagated to that CPU, where Z happens after
*4882a593Smuzhiyunsome event X which is po-after S.  Symbolically, this amounts to:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	S ->po X ->hb* Z ->fr W ->rf Y ->po U.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe fr link from Z to W indicates that W has not propagated to Z's CPU
*4882a593Smuzhiyunat the time that Z executes.  From this, it can be shown (see the
*4882a593Smuzhiyundiscussion of the rcu-link relation earlier) that S and U are related
*4882a593Smuzhiyunby rcu-link:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	S ->rcu-link U.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSince S is a grace period we have S ->rcu-gp S, and since L and U are
*4882a593Smuzhiyunthe start and end of the critical section C we have U ->rcu-rscsi L.
*4882a593SmuzhiyunFrom this we obtain:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	S ->rcu-gp S ->rcu-link U ->rcu-rscsi L ->rcu-link S,
*4882a593Smuzhiyun
*4882a593Smuzhiyuna forbidden cycle.  Thus the "rcu" axiom rules out this violation of
*4882a593Smuzhiyunthe Grace Period Guarantee.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor something a little more down-to-earth, let's see how the axiom
*4882a593Smuzhiyunworks out in practice.  Consider the RCU code example from above, this
*4882a593Smuzhiyuntime with statement labels added:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x, y;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		L: rcu_read_lock();
*4882a593Smuzhiyun		X: WRITE_ONCE(x, 1);
*4882a593Smuzhiyun		Y: WRITE_ONCE(y, 1);
*4882a593Smuzhiyun		U: rcu_read_unlock();
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1, r2;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		Z: r1 = READ_ONCE(x);
*4882a593Smuzhiyun		S: synchronize_rcu();
*4882a593Smuzhiyun		W: r2 = READ_ONCE(y);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf r2 = 0 at the end then P0's store at Y overwrites the value that
*4882a593SmuzhiyunP1's load at W reads from, so we have W ->fre Y.  Since S ->po W and
*4882a593Smuzhiyunalso Y ->po U, we get S ->rcu-link U.  In addition, S ->rcu-gp S
*4882a593Smuzhiyunbecause S is a grace period.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf r1 = 1 at the end then P1's load at Z reads from P0's store at X,
*4882a593Smuzhiyunso we have X ->rfe Z.  Together with L ->po X and Z ->po S, this
*4882a593Smuzhiyunyields L ->rcu-link S.  And since L and U are the start and end of a
*4882a593Smuzhiyuncritical section, we have U ->rcu-rscsi L.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThen U ->rcu-rscsi L ->rcu-link S ->rcu-gp S ->rcu-link U is a
*4882a593Smuzhiyunforbidden cycle, violating the "rcu" axiom.  Hence the outcome is not
*4882a593Smuzhiyunallowed by the LKMM, as we would expect.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor contrast, let's see what can happen in a more complicated example:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x, y, z;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		L0: rcu_read_lock();
*4882a593Smuzhiyun		    r0 = READ_ONCE(x);
*4882a593Smuzhiyun		    WRITE_ONCE(y, 1);
*4882a593Smuzhiyun		U0: rcu_read_unlock();
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		    r1 = READ_ONCE(y);
*4882a593Smuzhiyun		S1: synchronize_rcu();
*4882a593Smuzhiyun		    WRITE_ONCE(z, 1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P2()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r2;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		L2: rcu_read_lock();
*4882a593Smuzhiyun		    r2 = READ_ONCE(z);
*4882a593Smuzhiyun		    WRITE_ONCE(x, 1);
*4882a593Smuzhiyun		U2: rcu_read_unlock();
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf r0 = r1 = r2 = 1 at the end, then similar reasoning to before shows
*4882a593Smuzhiyunthat U0 ->rcu-rscsi L0 ->rcu-link S1 ->rcu-gp S1 ->rcu-link U2 ->rcu-rscsi
*4882a593SmuzhiyunL2 ->rcu-link U0.  However this cycle is not forbidden, because the
*4882a593Smuzhiyunsequence of relations contains fewer instances of rcu-gp (one) than of
*4882a593Smuzhiyunrcu-rscsi (two).  Consequently the outcome is allowed by the LKMM.
*4882a593SmuzhiyunThe following instruction timing diagram shows how it might actually
*4882a593Smuzhiyunoccur:
*4882a593Smuzhiyun
*4882a593SmuzhiyunP0			P1			P2
*4882a593Smuzhiyun--------------------	--------------------	--------------------
*4882a593Smuzhiyunrcu_read_lock()
*4882a593SmuzhiyunWRITE_ONCE(y, 1)
*4882a593Smuzhiyun			r1 = READ_ONCE(y)
*4882a593Smuzhiyun			synchronize_rcu() starts
*4882a593Smuzhiyun			.			rcu_read_lock()
*4882a593Smuzhiyun			.			WRITE_ONCE(x, 1)
*4882a593Smuzhiyunr0 = READ_ONCE(x)	.
*4882a593Smuzhiyunrcu_read_unlock()	.
*4882a593Smuzhiyun			synchronize_rcu() ends
*4882a593Smuzhiyun			WRITE_ONCE(z, 1)
*4882a593Smuzhiyun						r2 = READ_ONCE(z)
*4882a593Smuzhiyun						rcu_read_unlock()
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis requires P0 and P2 to execute their loads and stores out of
*4882a593Smuzhiyunprogram order, but of course they are allowed to do so.  And as you
*4882a593Smuzhiyuncan see, the Grace Period Guarantee is not violated: The critical
*4882a593Smuzhiyunsection in P0 both starts before P1's grace period does and ends
*4882a593Smuzhiyunbefore it does, and the critical section in P2 both starts after P1's
*4882a593Smuzhiyungrace period does and ends after it does.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAddendum: The LKMM now supports SRCU (Sleepable Read-Copy-Update) in
*4882a593Smuzhiyunaddition to normal RCU.  The ideas involved are much the same as
*4882a593Smuzhiyunabove, with new relations srcu-gp and srcu-rscsi added to represent
*4882a593SmuzhiyunSRCU grace periods and read-side critical sections.  There is a
*4882a593Smuzhiyunrestriction on the srcu-gp and srcu-rscsi links that can appear in an
*4882a593Smuzhiyunrcu-order sequence (the srcu-rscsi links must be paired with srcu-gp
*4882a593Smuzhiyunlinks having the same SRCU domain with proper nesting); the details
*4882a593Smuzhiyunare relatively unimportant.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunLOCKING
*4882a593Smuzhiyun-------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM includes locking.  In fact, there is special code for locking
*4882a593Smuzhiyunin the formal model, added in order to make tools run faster.
*4882a593SmuzhiyunHowever, this special code is intended to be more or less equivalent
*4882a593Smuzhiyunto concepts we have already covered.  A spinlock_t variable is treated
*4882a593Smuzhiyunthe same as an int, and spin_lock(&s) is treated almost the same as:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	while (cmpxchg_acquire(&s, 0, 1) != 0)
*4882a593Smuzhiyun		cpu_relax();
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis waits until s is equal to 0 and then atomically sets it to 1,
*4882a593Smuzhiyunand the read part of the cmpxchg operation acts as an acquire fence.
*4882a593SmuzhiyunAn alternate way to express the same thing would be:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	r = xchg_acquire(&s, 1);
*4882a593Smuzhiyun
*4882a593Smuzhiyunalong with a requirement that at the end, r = 0.  Similarly,
*4882a593Smuzhiyunspin_trylock(&s) is treated almost the same as:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	return !cmpxchg_acquire(&s, 0, 1);
*4882a593Smuzhiyun
*4882a593Smuzhiyunwhich atomically sets s to 1 if it is currently equal to 0 and returns
*4882a593Smuzhiyuntrue if it succeeds (the read part of the cmpxchg operation acts as an
*4882a593Smuzhiyunacquire fence only if the operation is successful).  spin_unlock(&s)
*4882a593Smuzhiyunis treated almost the same as:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	smp_store_release(&s, 0);
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe "almost" qualifiers above need some explanation.  In the LKMM, the
*4882a593Smuzhiyunstore-release in a spin_unlock() and the load-acquire which forms the
*4882a593Smuzhiyunfirst half of the atomic rmw update in a spin_lock() or a successful
*4882a593Smuzhiyunspin_trylock() -- we can call these things lock-releases and
*4882a593Smuzhiyunlock-acquires -- have two properties beyond those of ordinary releases
*4882a593Smuzhiyunand acquires.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFirst, when a lock-acquire reads from a lock-release, the LKMM
*4882a593Smuzhiyunrequires that every instruction po-before the lock-release must
*4882a593Smuzhiyunexecute before any instruction po-after the lock-acquire.  This would
*4882a593Smuzhiyunnaturally hold if the release and acquire operations were on different
*4882a593SmuzhiyunCPUs, but the LKMM says it holds even when they are on the same CPU.
*4882a593SmuzhiyunFor example:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x, y;
*4882a593Smuzhiyun	spinlock_t s;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1, r2;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		spin_lock(&s);
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun		spin_unlock(&s);
*4882a593Smuzhiyun		spin_lock(&s);
*4882a593Smuzhiyun		r2 = READ_ONCE(y);
*4882a593Smuzhiyun		spin_unlock(&s);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
*4882a593Smuzhiyun		smp_wmb();
*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunHere the second spin_lock() reads from the first spin_unlock(), and
*4882a593Smuzhiyuntherefore the load of x must execute before the load of y.  Thus we
*4882a593Smuzhiyuncannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
*4882a593SmuzhiyunMP pattern).
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis requirement does not apply to ordinary release and acquire
*4882a593Smuzhiyunfences, only to lock-related operations.  For instance, suppose P0()
*4882a593Smuzhiyunin the example had been written as:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1, r2, r3;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun		smp_store_release(&s, 1);
*4882a593Smuzhiyun		r3 = smp_load_acquire(&s);
*4882a593Smuzhiyun		r2 = READ_ONCE(y);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunThen the CPU would be allowed to forward the s = 1 value from the
*4882a593Smuzhiyunsmp_store_release() to the smp_load_acquire(), executing the
*4882a593Smuzhiyuninstructions in the following order:
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r3 = smp_load_acquire(&s);	// Obtains r3 = 1
*4882a593Smuzhiyun		r2 = READ_ONCE(y);
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun		smp_store_release(&s, 1);	// Value is forwarded
*4882a593Smuzhiyun
*4882a593Smuzhiyunand thus it could load y before x, obtaining r2 = 0 and r1 = 1.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSecond, when a lock-acquire reads from a lock-release, and some other
*4882a593Smuzhiyunstores W and W' occur po-before the lock-release and po-after the
*4882a593Smuzhiyunlock-acquire respectively, the LKMM requires that W must propagate to
*4882a593Smuzhiyuneach CPU before W' does.  For example, consider:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x, y;
*4882a593Smuzhiyun	spinlock_t x;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		spin_lock(&s);
*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
*4882a593Smuzhiyun		spin_unlock(&s);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		spin_lock(&s);
*4882a593Smuzhiyun		r1 = READ_ONCE(x);
*4882a593Smuzhiyun		WRITE_ONCE(y, 1);
*4882a593Smuzhiyun		spin_unlock(&s);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P2()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r2, r3;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r2 = READ_ONCE(y);
*4882a593Smuzhiyun		smp_rmb();
*4882a593Smuzhiyun		r3 = READ_ONCE(x);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf r1 = 1 at the end then the spin_lock() in P1 must have read from
*4882a593Smuzhiyunthe spin_unlock() in P0.  Hence the store to x must propagate to P2
*4882a593Smuzhiyunbefore the store to y does, so we cannot have r2 = 1 and r3 = 0.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThese two special requirements for lock-release and lock-acquire do
*4882a593Smuzhiyunnot arise from the operational model.  Nevertheless, kernel developers
*4882a593Smuzhiyunhave come to expect and rely on them because they do hold on all
*4882a593Smuzhiyunarchitectures supported by the Linux kernel, albeit for various
*4882a593Smuzhiyundiffering reasons.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunPLAIN ACCESSES AND DATA RACES
*4882a593Smuzhiyun-----------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y),
*4882a593Smuzhiyunsmp_load_acquire(&z), and so on are collectively referred to as
*4882a593Smuzhiyun"marked" accesses, because they are all annotated with special
*4882a593Smuzhiyunoperations of one kind or another.  Ordinary C-language memory
*4882a593Smuzhiyunaccesses such as x or y = 0 are simply called "plain" accesses.
*4882a593Smuzhiyun
*4882a593SmuzhiyunEarly versions of the LKMM had nothing to say about plain accesses.
*4882a593SmuzhiyunThe C standard allows compilers to assume that the variables affected
*4882a593Smuzhiyunby plain accesses are not concurrently read or written by any other
*4882a593Smuzhiyunthreads or CPUs.  This leaves compilers free to implement all manner
*4882a593Smuzhiyunof transformations or optimizations of code containing plain accesses,
*4882a593Smuzhiyunmaking such code very difficult for a memory model to handle.
*4882a593Smuzhiyun
*4882a593SmuzhiyunHere is just one example of a possible pitfall:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int a = 6;
*4882a593Smuzhiyun	int *x = &a;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int *r1;
*4882a593Smuzhiyun		int r2 = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		r1 = x;
*4882a593Smuzhiyun		if (r1 != NULL)
*4882a593Smuzhiyun			r2 = READ_ONCE(*r1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(x, NULL);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunOn the face of it, one would expect that when this code runs, the only
*4882a593Smuzhiyunpossible final values for r2 are 6 and 0, depending on whether or not
*4882a593SmuzhiyunP1's store to x propagates to P0 before P0's load from x executes.
*4882a593SmuzhiyunBut since P0's load from x is a plain access, the compiler may decide
*4882a593Smuzhiyunto carry out the load twice (for the comparison against NULL, then again
*4882a593Smuzhiyunfor the READ_ONCE()) and eliminate the temporary variable r1.  The
*4882a593Smuzhiyunobject code generated for P0 could therefore end up looking rather
*4882a593Smuzhiyunlike this:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r2 = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		if (x != NULL)
*4882a593Smuzhiyun			r2 = READ_ONCE(*x);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunAnd now it is obvious that this code runs the risk of dereferencing a
*4882a593SmuzhiyunNULL pointer, because P1's store to x might propagate to P0 after the
*4882a593Smuzhiyuntest against NULL has been made but before the READ_ONCE() executes.
*4882a593SmuzhiyunIf the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x",
*4882a593Smuzhiyunthe compiler would not have performed this optimization and there
*4882a593Smuzhiyunwould be no possibility of a NULL-pointer dereference.
*4882a593Smuzhiyun
*4882a593SmuzhiyunGiven the possibility of transformations like this one, the LKMM
*4882a593Smuzhiyundoesn't try to predict all possible outcomes of code containing plain
*4882a593Smuzhiyunaccesses.  It is instead content to determine whether the code
*4882a593Smuzhiyunviolates the compiler's assumptions, which would render the ultimate
*4882a593Smuzhiyunoutcome undefined.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn technical terms, the compiler is allowed to assume that when the
*4882a593Smuzhiyunprogram executes, there will not be any data races.  A "data race"
*4882a593Smuzhiyunoccurs when there are two memory accesses such that:
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.	they access the same location,
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.	at least one of them is a store,
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.	at least one of them is plain,
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.	they occur on different CPUs (or in different threads on the
*4882a593Smuzhiyun	same CPU), and
*4882a593Smuzhiyun
*4882a593Smuzhiyun5.	they execute concurrently.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn the literature, two accesses are said to "conflict" if they satisfy
*4882a593Smuzhiyun1 and 2 above.  We'll go a little farther and say that two accesses
*4882a593Smuzhiyunare "race candidates" if they satisfy 1 - 4.  Thus, whether or not two
*4882a593Smuzhiyunrace candidates actually do race in a given execution depends on
*4882a593Smuzhiyunwhether they are concurrent.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM tries to determine whether a program contains race candidates
*4882a593Smuzhiyunwhich may execute concurrently; if it does then the LKMM says there is
*4882a593Smuzhiyuna potential data race and makes no predictions about the program's
*4882a593Smuzhiyunoutcome.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDetermining whether two accesses are race candidates is easy; you can
*4882a593Smuzhiyunsee that all the concepts involved in the definition above are already
*4882a593Smuzhiyunpart of the memory model.  The hard part is telling whether they may
*4882a593Smuzhiyunexecute concurrently.  The LKMM takes a conservative attitude,
*4882a593Smuzhiyunassuming that accesses may be concurrent unless it can prove they
*4882a593Smuzhiyunare not.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf two memory accesses aren't concurrent then one must execute before
*4882a593Smuzhiyunthe other.  Therefore the LKMM decides two accesses aren't concurrent
*4882a593Smuzhiyunif they can be connected by a sequence of hb, pb, and rb links
*4882a593Smuzhiyun(together referred to as xb, for "executes before").  However, there
*4882a593Smuzhiyunare two complicating factors.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf X is a load and X executes before a store Y, then indeed there is
*4882a593Smuzhiyunno danger of X and Y being concurrent.  After all, Y can't have any
*4882a593Smuzhiyuneffect on the value obtained by X until the memory subsystem has
*4882a593Smuzhiyunpropagated Y from its own CPU to X's CPU, which won't happen until
*4882a593Smuzhiyunsome time after Y executes and thus after X executes.  But if X is a
*4882a593Smuzhiyunstore, then even if X executes before Y it is still possible that X
*4882a593Smuzhiyunwill propagate to Y's CPU just as Y is executing.  In such a case X
*4882a593Smuzhiyuncould very well interfere somehow with Y, and we would have to
*4882a593Smuzhiyunconsider X and Y to be concurrent.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTherefore when X is a store, for X and Y to be non-concurrent the LKMM
*4882a593Smuzhiyunrequires not only that X must execute before Y but also that X must
*4882a593Smuzhiyunpropagate to Y's CPU before Y executes.  (Or vice versa, of course, if
*4882a593SmuzhiyunY executes before X -- then Y must propagate to X's CPU before X
*4882a593Smuzhiyunexecutes if Y is a store.)  This is expressed by the visibility
*4882a593Smuzhiyunrelation (vis), where X ->vis Y is defined to hold if there is an
*4882a593Smuzhiyunintermediate event Z such that:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	X is connected to Z by a possibly empty sequence of
*4882a593Smuzhiyun	cumul-fence links followed by an optional rfe link (if none of
*4882a593Smuzhiyun	these links are present, X and Z are the same event),
*4882a593Smuzhiyun
*4882a593Smuzhiyunand either:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Z is connected to Y by a strong-fence link followed by a
*4882a593Smuzhiyun	possibly empty sequence of xb links,
*4882a593Smuzhiyun
*4882a593Smuzhiyunor:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Z is on the same CPU as Y and is connected to Y by a possibly
*4882a593Smuzhiyun	empty sequence of xb links (again, if the sequence is empty it
*4882a593Smuzhiyun	means Z and Y are the same event).
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe motivations behind this definition are straightforward:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	cumul-fence memory barriers force stores that are po-before
*4882a593Smuzhiyun	the barrier to propagate to other CPUs before stores that are
*4882a593Smuzhiyun	po-after the barrier.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	An rfe link from an event W to an event R says that R reads
*4882a593Smuzhiyun	from W, which certainly means that W must have propagated to
*4882a593Smuzhiyun	R's CPU before R executed.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	strong-fence memory barriers force stores that are po-before
*4882a593Smuzhiyun	the barrier, or that propagate to the barrier's CPU before the
*4882a593Smuzhiyun	barrier executes, to propagate to all CPUs before any events
*4882a593Smuzhiyun	po-after the barrier can execute.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo see how this works out in practice, consider our old friend, the MP
*4882a593Smuzhiyunpattern (with fences and statement labels, but without the conditional
*4882a593Smuzhiyuntest):
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int buf = 0, flag = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		X: WRITE_ONCE(buf, 1);
*4882a593Smuzhiyun		   smp_wmb();
*4882a593Smuzhiyun		W: WRITE_ONCE(flag, 1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun		int r2 = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		Z: r1 = READ_ONCE(flag);
*4882a593Smuzhiyun		   smp_rmb();
*4882a593Smuzhiyun		Y: r2 = READ_ONCE(buf);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe smp_wmb() memory barrier gives a cumul-fence link from X to W, and
*4882a593Smuzhiyunassuming r1 = 1 at the end, there is an rfe link from W to Z.  This
*4882a593Smuzhiyunmeans that the store to buf must propagate from P0 to P1 before Z
*4882a593Smuzhiyunexecutes.  Next, Z and Y are on the same CPU and the smp_rmb() fence
*4882a593Smuzhiyunprovides an xb link from Z to Y (i.e., it forces Z to execute before
*4882a593SmuzhiyunY).  Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
*4882a593Smuzhiyunexecutes.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe second complicating factor mentioned above arises from the fact
*4882a593Smuzhiyunthat when we are considering data races, some of the memory accesses
*4882a593Smuzhiyunare plain.  Now, although we have not said so explicitly, up to this
*4882a593Smuzhiyunpoint most of the relations defined by the LKMM (ppo, hb, prop,
*4882a593Smuzhiyuncumul-fence, pb, and so on -- including vis) apply only to marked
*4882a593Smuzhiyunaccesses.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are good reasons for this restriction.  The compiler is not
*4882a593Smuzhiyunallowed to apply fancy transformations to marked accesses, and
*4882a593Smuzhiyunconsequently each such access in the source code corresponds more or
*4882a593Smuzhiyunless directly to a single machine instruction in the object code.  But
*4882a593Smuzhiyunplain accesses are a different story; the compiler may combine them,
*4882a593Smuzhiyunsplit them up, duplicate them, eliminate them, invent new ones, and
*4882a593Smuzhiyunwho knows what else.  Seeing a plain access in the source code tells
*4882a593Smuzhiyunyou almost nothing about what machine instructions will end up in the
*4882a593Smuzhiyunobject code.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFortunately, the compiler isn't completely free; it is subject to some
*4882a593Smuzhiyunlimitations.  For one, it is not allowed to introduce a data race into
*4882a593Smuzhiyunthe object code if the source code does not already contain a data
*4882a593Smuzhiyunrace (if it could, memory models would be useless and no multithreaded
*4882a593Smuzhiyuncode would be safe!).  For another, it cannot move a plain access past
*4882a593Smuzhiyuna compiler barrier.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA compiler barrier is a kind of fence, but as the name implies, it
*4882a593Smuzhiyunonly affects the compiler; it does not necessarily have any effect on
*4882a593Smuzhiyunhow instructions are executed by the CPU.  In Linux kernel source
*4882a593Smuzhiyuncode, the barrier() function is a compiler barrier.  It doesn't give
*4882a593Smuzhiyunrise directly to any machine instructions in the object code; rather,
*4882a593Smuzhiyunit affects how the compiler generates the rest of the object code.
*4882a593SmuzhiyunGiven source code like this:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	... some memory accesses ...
*4882a593Smuzhiyun	barrier();
*4882a593Smuzhiyun	... some other memory accesses ...
*4882a593Smuzhiyun
*4882a593Smuzhiyunthe barrier() function ensures that the machine instructions
*4882a593Smuzhiyuncorresponding to the first group of accesses will all end po-before
*4882a593Smuzhiyunany machine instructions corresponding to the second group of accesses
*4882a593Smuzhiyun-- even if some of the accesses are plain.  (Of course, the CPU may
*4882a593Smuzhiyunthen execute some of those accesses out of program order, but we
*4882a593Smuzhiyunalready know how to deal with such issues.)  Without the barrier()
*4882a593Smuzhiyunthere would be no such guarantee; the two groups of accesses could be
*4882a593Smuzhiyunintermingled or even reversed in the object code.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM doesn't say much about the barrier() function, but it does
*4882a593Smuzhiyunrequire that all fences are also compiler barriers.  In addition, it
*4882a593Smuzhiyunrequires that the ordering properties of memory barriers such as
*4882a593Smuzhiyunsmp_rmb() or smp_store_release() apply to plain accesses as well as to
*4882a593Smuzhiyunmarked accesses.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis is the key to analyzing data races.  Consider the MP pattern
*4882a593Smuzhiyunagain, now using plain accesses for buf:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int buf = 0, flag = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		U: buf = 1;
*4882a593Smuzhiyun		   smp_wmb();
*4882a593Smuzhiyun		X: WRITE_ONCE(flag, 1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r1;
*4882a593Smuzhiyun		int r2 = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		Y: r1 = READ_ONCE(flag);
*4882a593Smuzhiyun		   if (r1) {
*4882a593Smuzhiyun			   smp_rmb();
*4882a593Smuzhiyun			V: r2 = buf;
*4882a593Smuzhiyun		   }
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis program does not contain a data race.  Although the U and V
*4882a593Smuzhiyunaccesses are race candidates, the LKMM can prove they are not
*4882a593Smuzhiyunconcurrent as follows:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	The smp_wmb() fence in P0 is both a compiler barrier and a
*4882a593Smuzhiyun	cumul-fence.  It guarantees that no matter what hash of
*4882a593Smuzhiyun	machine instructions the compiler generates for the plain
*4882a593Smuzhiyun	access U, all those instructions will be po-before the fence.
*4882a593Smuzhiyun	Consequently U's store to buf, no matter how it is carried out
*4882a593Smuzhiyun	at the machine level, must propagate to P1 before X's store to
*4882a593Smuzhiyun	flag does.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	X and Y are both marked accesses.  Hence an rfe link from X to
*4882a593Smuzhiyun	Y is a valid indicator that X propagated to P1 before Y
*4882a593Smuzhiyun	executed, i.e., X ->vis Y.  (And if there is no rfe link then
*4882a593Smuzhiyun	r1 will be 0, so V will not be executed and ipso facto won't
*4882a593Smuzhiyun	race with U.)
*4882a593Smuzhiyun
*4882a593Smuzhiyun	The smp_rmb() fence in P1 is a compiler barrier as well as a
*4882a593Smuzhiyun	fence.  It guarantees that all the machine-level instructions
*4882a593Smuzhiyun	corresponding to the access V will be po-after the fence, and
*4882a593Smuzhiyun	therefore any loads among those instructions will execute
*4882a593Smuzhiyun	after the fence does and hence after Y does.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThus U's store to buf is forced to propagate to P1 before V's load
*4882a593Smuzhiyunexecutes (assuming V does execute), ruling out the possibility of a
*4882a593Smuzhiyundata race between them.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis analysis illustrates how the LKMM deals with plain accesses in
*4882a593Smuzhiyungeneral.  Suppose R is a plain load and we want to show that R
*4882a593Smuzhiyunexecutes before some marked access E.  We can do this by finding a
*4882a593Smuzhiyunmarked access X such that R and X are ordered by a suitable fence and
*4882a593SmuzhiyunX ->xb* E.  If E was also a plain access, we would also look for a
*4882a593Smuzhiyunmarked access Y such that X ->xb* Y, and Y and E are ordered by a
*4882a593Smuzhiyunfence.  We describe this arrangement by saying that R is
*4882a593Smuzhiyun"post-bounded" by X and E is "pre-bounded" by Y.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn fact, we go one step further: Since R is a read, we say that R is
*4882a593Smuzhiyun"r-post-bounded" by X.  Similarly, E would be "r-pre-bounded" or
*4882a593Smuzhiyun"w-pre-bounded" by Y, depending on whether E was a store or a load.
*4882a593SmuzhiyunThis distinction is needed because some fences affect only loads
*4882a593Smuzhiyun(i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise
*4882a593Smuzhiyunthe two types of bounds are the same.  And as a degenerate case, we
*4882a593Smuzhiyunsay that a marked access pre-bounds and post-bounds itself (e.g., if R
*4882a593Smuzhiyunabove were a marked load then X could simply be taken to be R itself.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe need to distinguish between r- and w-bounding raises yet another
*4882a593Smuzhiyunissue.  When the source code contains a plain store, the compiler is
*4882a593Smuzhiyunallowed to put plain loads of the same location into the object code.
*4882a593SmuzhiyunFor example, given the source code:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	x = 1;
*4882a593Smuzhiyun
*4882a593Smuzhiyunthe compiler is theoretically allowed to generate object code that
*4882a593Smuzhiyunlooks like:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	if (x != 1)
*4882a593Smuzhiyun		x = 1;
*4882a593Smuzhiyun
*4882a593Smuzhiyunthereby adding a load (and possibly replacing the store entirely).
*4882a593SmuzhiyunFor this reason, whenever the LKMM requires a plain store to be
*4882a593Smuzhiyunw-pre-bounded or w-post-bounded by a marked access, it also requires
*4882a593Smuzhiyunthe store to be r-pre-bounded or r-post-bounded, so as to handle cases
*4882a593Smuzhiyunwhere the compiler adds a load.
*4882a593Smuzhiyun
*4882a593Smuzhiyun(This may be overly cautious.  We don't know of any examples where a
*4882a593Smuzhiyuncompiler has augmented a store with a load in this fashion, and the
*4882a593SmuzhiyunLinux kernel developers would probably fight pretty hard to change a
*4882a593Smuzhiyuncompiler if it ever did this.  Still, better safe than sorry.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunIncidentally, the other tranformation -- augmenting a plain load by
*4882a593Smuzhiyunadding in a store to the same location -- is not allowed.  This is
*4882a593Smuzhiyunbecause the compiler cannot know whether any other CPUs might perform
*4882a593Smuzhiyuna concurrent load from that location.  Two concurrent loads don't
*4882a593Smuzhiyunconstitute a race (they can't interfere with each other), but a store
*4882a593Smuzhiyundoes race with a concurrent load.  Thus adding a store might create a
*4882a593Smuzhiyundata race where one was not already present in the source code,
*4882a593Smuzhiyunsomething the compiler is forbidden to do.  Augmenting a store with a
*4882a593Smuzhiyunload, on the other hand, is acceptable because doing so won't create a
*4882a593Smuzhiyundata race unless one already existed.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM includes a second way to pre-bound plain accesses, in
*4882a593Smuzhiyunaddition to fences: an address dependency from a marked load.  That
*4882a593Smuzhiyunis, in the sequence:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	p = READ_ONCE(ptr);
*4882a593Smuzhiyun	r = *p;
*4882a593Smuzhiyun
*4882a593Smuzhiyunthe LKMM says that the marked load of ptr pre-bounds the plain load of
*4882a593Smuzhiyun*p; the marked load must execute before any of the machine
*4882a593Smuzhiyuninstructions corresponding to the plain load.  This is a reasonable
*4882a593Smuzhiyunstipulation, since after all, the CPU can't perform the load of *p
*4882a593Smuzhiyununtil it knows what value p will hold.  Furthermore, without some
*4882a593Smuzhiyunassumption like this one, some usages typical of RCU would count as
*4882a593Smuzhiyundata races.  For example:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int a = 1, b;
*4882a593Smuzhiyun	int *ptr = &a;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		b = 2;
*4882a593Smuzhiyun		rcu_assign_pointer(ptr, &b);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int *p;
*4882a593Smuzhiyun		int r;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		rcu_read_lock();
*4882a593Smuzhiyun		p = rcu_dereference(ptr);
*4882a593Smuzhiyun		r = *p;
*4882a593Smuzhiyun		rcu_read_unlock();
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun(In this example the rcu_read_lock() and rcu_read_unlock() calls don't
*4882a593Smuzhiyunreally do anything, because there aren't any grace periods.  They are
*4882a593Smuzhiyunincluded merely for the sake of good form; typically P0 would call
*4882a593Smuzhiyunsynchronize_rcu() somewhere after the rcu_assign_pointer().)
*4882a593Smuzhiyun
*4882a593Smuzhiyunrcu_assign_pointer() performs a store-release, so the plain store to b
*4882a593Smuzhiyunis definitely w-post-bounded before the store to ptr, and the two
*4882a593Smuzhiyunstores will propagate to P1 in that order.  However, rcu_dereference()
*4882a593Smuzhiyunis only equivalent to READ_ONCE().  While it is a marked access, it is
*4882a593Smuzhiyunnot a fence or compiler barrier.  Hence the only guarantee we have
*4882a593Smuzhiyunthat the load of ptr in P1 is r-pre-bounded before the load of *p
*4882a593Smuzhiyun(thus avoiding a race) is the assumption about address dependencies.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis is a situation where the compiler can undermine the memory model,
*4882a593Smuzhiyunand a certain amount of care is required when programming constructs
*4882a593Smuzhiyunlike this one.  In particular, comparisons between the pointer and
*4882a593Smuzhiyunother known addresses can cause trouble.  If you have something like:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	p = rcu_dereference(ptr);
*4882a593Smuzhiyun	if (p == &x)
*4882a593Smuzhiyun		r = *p;
*4882a593Smuzhiyun
*4882a593Smuzhiyunthen the compiler just might generate object code resembling:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	p = rcu_dereference(ptr);
*4882a593Smuzhiyun	if (p == &x)
*4882a593Smuzhiyun		r = x;
*4882a593Smuzhiyun
*4882a593Smuzhiyunor even:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	rtemp = x;
*4882a593Smuzhiyun	p = rcu_dereference(ptr);
*4882a593Smuzhiyun	if (p == &x)
*4882a593Smuzhiyun		r = rtemp;
*4882a593Smuzhiyun
*4882a593Smuzhiyunwhich would invalidate the memory model's assumption, since the CPU
*4882a593Smuzhiyuncould now perform the load of x before the load of ptr (there might be
*4882a593Smuzhiyuna control dependency but no address dependency at the machine level).
*4882a593Smuzhiyun
*4882a593SmuzhiyunFinally, it turns out there is a situation in which a plain write does
*4882a593Smuzhiyunnot need to be w-post-bounded: when it is separated from the other
*4882a593Smuzhiyunrace-candidate access by a fence.  At first glance this may seem
*4882a593Smuzhiyunimpossible.  After all, to be race candidates the two accesses must
*4882a593Smuzhiyunbe on different CPUs, and fences don't link events on different CPUs.
*4882a593SmuzhiyunWell, normal fences don't -- but rcu-fence can!  Here's an example:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x, y;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
*4882a593Smuzhiyun		synchronize_rcu();
*4882a593Smuzhiyun		y = 3;
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		rcu_read_lock();
*4882a593Smuzhiyun		if (READ_ONCE(x) == 0)
*4882a593Smuzhiyun			y = 2;
*4882a593Smuzhiyun		rcu_read_unlock();
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunDo the plain stores to y race?  Clearly not if P1 reads a non-zero
*4882a593Smuzhiyunvalue for x, so let's assume the READ_ONCE(x) does obtain 0.  This
*4882a593Smuzhiyunmeans that the read-side critical section in P1 must finish executing
*4882a593Smuzhiyunbefore the grace period in P0 does, because RCU's Grace-Period
*4882a593SmuzhiyunGuarantee says that otherwise P0's store to x would have propagated to
*4882a593SmuzhiyunP1 before the critical section started and so would have been visible
*4882a593Smuzhiyunto the READ_ONCE().  (Another way of putting it is that the fre link
*4882a593Smuzhiyunfrom the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link
*4882a593Smuzhiyunbetween those two events.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis means there is an rcu-fence link from P1's "y = 2" store to P0's
*4882a593Smuzhiyun"y = 3" store, and consequently the first must propagate from P1 to P0
*4882a593Smuzhiyunbefore the second can execute.  Therefore the two stores cannot be
*4882a593Smuzhiyunconcurrent and there is no race, even though P1's plain store to y
*4882a593Smuzhiyunisn't w-post-bounded by any marked accesses.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPutting all this material together yields the following picture.  For
*4882a593Smuzhiyunrace-candidate stores W and W', where W ->co W', the LKMM says the
*4882a593Smuzhiyunstores don't race if W can be linked to W' by a
*4882a593Smuzhiyun
*4882a593Smuzhiyun	w-post-bounded ; vis ; w-pre-bounded
*4882a593Smuzhiyun
*4882a593Smuzhiyunsequence.  If W is plain then they also have to be linked by an
*4882a593Smuzhiyun
*4882a593Smuzhiyun	r-post-bounded ; xb* ; w-pre-bounded
*4882a593Smuzhiyun
*4882a593Smuzhiyunsequence, and if W' is plain then they also have to be linked by a
*4882a593Smuzhiyun
*4882a593Smuzhiyun	w-post-bounded ; vis ; r-pre-bounded
*4882a593Smuzhiyun
*4882a593Smuzhiyunsequence.  For race-candidate load R and store W, the LKMM says the
*4882a593Smuzhiyuntwo accesses don't race if R can be linked to W by an
*4882a593Smuzhiyun
*4882a593Smuzhiyun	r-post-bounded ; xb* ; w-pre-bounded
*4882a593Smuzhiyun
*4882a593Smuzhiyunsequence or if W can be linked to R by a
*4882a593Smuzhiyun
*4882a593Smuzhiyun	w-post-bounded ; vis ; r-pre-bounded
*4882a593Smuzhiyun
*4882a593Smuzhiyunsequence.  For the cases involving a vis link, the LKMM also accepts
*4882a593Smuzhiyunsequences in which W is linked to W' or R by a
*4882a593Smuzhiyun
*4882a593Smuzhiyun	strong-fence ; xb* ; {w and/or r}-pre-bounded
*4882a593Smuzhiyun
*4882a593Smuzhiyunsequence with no post-bounding, and in every case the LKMM also allows
*4882a593Smuzhiyunthe link simply to be a fence with no bounding at all.  If no sequence
*4882a593Smuzhiyunof the appropriate sort exists, the LKMM says that the accesses race.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere is one more part of the LKMM related to plain accesses (although
*4882a593Smuzhiyunnot to data races) we should discuss.  Recall that many relations such
*4882a593Smuzhiyunas hb are limited to marked accesses only.  As a result, the
*4882a593Smuzhiyunhappens-before, propagates-before, and rcu axioms (which state that
*4882a593Smuzhiyunvarious relation must not contain a cycle) doesn't apply to plain
*4882a593Smuzhiyunaccesses.  Nevertheless, we do want to rule out such cycles, because
*4882a593Smuzhiyunthey don't make sense even for plain accesses.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo this end, the LKMM imposes three extra restrictions, together
*4882a593Smuzhiyuncalled the "plain-coherence" axiom because of their resemblance to the
*4882a593Smuzhiyunrules used by the operational model to ensure cache coherence (that
*4882a593Smuzhiyunis, the rules governing the memory subsystem's choice of a store to
*4882a593Smuzhiyunsatisfy a load request and its determination of where a store will
*4882a593Smuzhiyunfall in the coherence order):
*4882a593Smuzhiyun
*4882a593Smuzhiyun	If R and W are race candidates and it is possible to link R to
*4882a593Smuzhiyun	W by one of the xb* sequences listed above, then W ->rfe R is
*4882a593Smuzhiyun	not allowed (i.e., a load cannot read from a store that it
*4882a593Smuzhiyun	executes before, even if one or both is plain).
*4882a593Smuzhiyun
*4882a593Smuzhiyun	If W and R are race candidates and it is possible to link W to
*4882a593Smuzhiyun	R by one of the vis sequences listed above, then R ->fre W is
*4882a593Smuzhiyun	not allowed (i.e., if a store is visible to a load then the
*4882a593Smuzhiyun	load must read from that store or one coherence-after it).
*4882a593Smuzhiyun
*4882a593Smuzhiyun	If W and W' are race candidates and it is possible to link W
*4882a593Smuzhiyun	to W' by one of the vis sequences listed above, then W' ->co W
*4882a593Smuzhiyun	is not allowed (i.e., if one store is visible to a second then
*4882a593Smuzhiyun	the second must come after the first in the coherence order).
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis is the extent to which the LKMM deals with plain accesses.
*4882a593SmuzhiyunPerhaps it could say more (for example, plain accesses might
*4882a593Smuzhiyuncontribute to the ppo relation), but at the moment it seems that this
*4882a593Smuzhiyunminimal, conservative approach is good enough.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunODDS AND ENDS
*4882a593Smuzhiyun-------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis section covers material that didn't quite fit anywhere in the
*4882a593Smuzhiyunearlier sections.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe descriptions in this document don't always match the formal
*4882a593Smuzhiyunversion of the LKMM exactly.  For example, the actual formal
*4882a593Smuzhiyundefinition of the prop relation makes the initial coe or fre part
*4882a593Smuzhiyunoptional, and it doesn't require the events linked by the relation to
*4882a593Smuzhiyunbe on the same CPU.  These differences are very unimportant; indeed,
*4882a593Smuzhiyuninstances where the coe/fre part of prop is missing are of no interest
*4882a593Smuzhiyunbecause all the other parts (fences and rfe) are already included in
*4882a593Smuzhiyunhb anyway, and where the formal model adds prop into hb, it includes
*4882a593Smuzhiyunan explicit requirement that the events being linked are on the same
*4882a593SmuzhiyunCPU.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAnother minor difference has to do with events that are both memory
*4882a593Smuzhiyunaccesses and fences, such as those corresponding to smp_load_acquire()
*4882a593Smuzhiyuncalls.  In the formal model, these events aren't actually both reads
*4882a593Smuzhiyunand fences; rather, they are read events with an annotation marking
*4882a593Smuzhiyunthem as acquires.  (Or write events annotated as releases, in the case
*4882a593Smuzhiyunsmp_store_release().)  The final effect is the same.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAlthough we didn't mention it above, the instruction execution
*4882a593Smuzhiyunordering provided by the smp_rmb() fence doesn't apply to read events
*4882a593Smuzhiyunthat are part of a non-value-returning atomic update.  For instance,
*4882a593Smuzhiyungiven:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	atomic_inc(&x);
*4882a593Smuzhiyun	smp_rmb();
*4882a593Smuzhiyun	r1 = READ_ONCE(y);
*4882a593Smuzhiyun
*4882a593Smuzhiyunit is not guaranteed that the load from y will execute after the
*4882a593Smuzhiyunupdate to x.  This is because the ARMv8 architecture allows
*4882a593Smuzhiyunnon-value-returning atomic operations effectively to be executed off
*4882a593Smuzhiyunthe CPU.  Basically, the CPU tells the memory subsystem to increment
*4882a593Smuzhiyunx, and then the increment is carried out by the memory hardware with
*4882a593Smuzhiyunno further involvement from the CPU.  Since the CPU doesn't ever read
*4882a593Smuzhiyunthe value of x, there is nothing for the smp_rmb() fence to act on.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe LKMM defines a few extra synchronization operations in terms of
*4882a593Smuzhiyunthings we have already covered.  In particular, rcu_dereference() is
*4882a593Smuzhiyuntreated as READ_ONCE() and rcu_assign_pointer() is treated as
*4882a593Smuzhiyunsmp_store_release() -- which is basically how the Linux kernel treats
*4882a593Smuzhiyunthem.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAlthough we said that plain accesses are not linked by the ppo
*4882a593Smuzhiyunrelation, they do contribute to it indirectly.  Namely, when there is
*4882a593Smuzhiyunan address dependency from a marked load R to a plain store W,
*4882a593Smuzhiyunfollowed by smp_wmb() and then a marked store W', the LKMM creates a
*4882a593Smuzhiyunppo link from R to W'.  The reasoning behind this is perhaps a little
*4882a593Smuzhiyunshaky, but essentially it says there is no way to generate object code
*4882a593Smuzhiyunfor this source code in which W' could execute before R.  Just as with
*4882a593Smuzhiyunpre-bounding by address dependencies, it is possible for the compiler
*4882a593Smuzhiyunto undermine this relation if sufficient care is not taken.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are a few oddball fences which need special treatment:
*4882a593Smuzhiyunsmp_mb__before_atomic(), smp_mb__after_atomic(), and
*4882a593Smuzhiyunsmp_mb__after_spinlock().  The LKMM uses fence events with special
*4882a593Smuzhiyunannotations for them; they act as strong fences just like smp_mb()
*4882a593Smuzhiyunexcept for the sets of events that they order.  Instead of ordering
*4882a593Smuzhiyunall po-earlier events against all po-later events, as smp_mb() does,
*4882a593Smuzhiyunthey behave as follows:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	smp_mb__before_atomic() orders all po-earlier events against
*4882a593Smuzhiyun	po-later atomic updates and the events following them;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	smp_mb__after_atomic() orders po-earlier atomic updates and
*4882a593Smuzhiyun	the events preceding them against all po-later events;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	smp_mb_after_spinlock() orders po-earlier lock acquisition
*4882a593Smuzhiyun	events and the events preceding them against all po-later
*4882a593Smuzhiyun	events.
*4882a593Smuzhiyun
*4882a593SmuzhiyunInterestingly, RCU and locking each introduce the possibility of
*4882a593Smuzhiyundeadlock.  When faced with code sequences such as:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	spin_lock(&s);
*4882a593Smuzhiyun	spin_lock(&s);
*4882a593Smuzhiyun	spin_unlock(&s);
*4882a593Smuzhiyun	spin_unlock(&s);
*4882a593Smuzhiyun
*4882a593Smuzhiyunor:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	rcu_read_lock();
*4882a593Smuzhiyun	synchronize_rcu();
*4882a593Smuzhiyun	rcu_read_unlock();
*4882a593Smuzhiyun
*4882a593Smuzhiyunwhat does the LKMM have to say?  Answer: It says there are no allowed
*4882a593Smuzhiyunexecutions at all, which makes sense.  But this can also lead to
*4882a593Smuzhiyunmisleading results, because if a piece of code has multiple possible
*4882a593Smuzhiyunexecutions, some of which deadlock, the model will report only on the
*4882a593Smuzhiyunnon-deadlocking executions.  For example:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int x, y;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P0()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		int r0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		WRITE_ONCE(x, 1);
*4882a593Smuzhiyun		r0 = READ_ONCE(y);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	P1()
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		rcu_read_lock();
*4882a593Smuzhiyun		if (READ_ONCE(x) > 0) {
*4882a593Smuzhiyun			WRITE_ONCE(y, 36);
*4882a593Smuzhiyun			synchronize_rcu();
*4882a593Smuzhiyun		}
*4882a593Smuzhiyun		rcu_read_unlock();
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593SmuzhiyunIs it possible to end up with r0 = 36 at the end?  The LKMM will tell
*4882a593Smuzhiyunyou it is not, but the model won't mention that this is because P1
*4882a593Smuzhiyunwill self-deadlock in the executions where it stores 36 in y.