memory-model/Documentation/simple.txt

*4882a593SmuzhiyunThis document provides options for those wishing to keep their
*4882a593Smuzhiyunmemory-ordering lives simple, as is necessary for those whose domain
*4882a593Smuzhiyunis complex.  After all, there are bugs other than memory-ordering bugs,
*4882a593Smuzhiyunand the time spent gaining memory-ordering knowledge is not available
*4882a593Smuzhiyunfor gaining domain knowledge.  Furthermore Linux-kernel memory model
*4882a593Smuzhiyun(LKMM) is quite complex, with subtle differences in code often having
*4882a593Smuzhiyundramatic effects on correctness.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe options near the beginning of this list are quite simple.  The idea
*4882a593Smuzhiyunis not that kernel hackers don't already know about them, but rather
*4882a593Smuzhiyunthat they might need the occasional reminder.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPlease note that this is a generic guide, and that specific subsystems
*4882a593Smuzhiyunwill often have special requirements or idioms.  For example, developers
*4882a593Smuzhiyunof MMIO-based device drivers will often need to use mb(), rmb(), and
*4882a593Smuzhiyunwmb(), and therefore might find smp_mb(), smp_rmb(), and smp_wmb()
*4882a593Smuzhiyunto be more natural than smp_load_acquire() and smp_store_release().
*4882a593SmuzhiyunOn the other hand, those coming in from other environments will likely
*4882a593Smuzhiyunbe more familiar with these last two.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunSingle-threaded code
*4882a593Smuzhiyun====================
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn single-threaded code, there is no reordering, at least assuming
*4882a593Smuzhiyunthat your toolchain and hardware are working correctly.  In addition,
*4882a593Smuzhiyunit is generally a mistake to assume your code will only run in a single
*4882a593Smuzhiyunthreaded context as the kernel can enter the same code path on multiple
*4882a593SmuzhiyunCPUs at the same time.  One important exception is a function that makes
*4882a593Smuzhiyunno external data references.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn the general case, you will need to take explicit steps to ensure that
*4882a593Smuzhiyunyour code really is executed within a single thread that does not access
*4882a593Smuzhiyunshared variables.  A simple way to achieve this is to define a global lock
*4882a593Smuzhiyunthat you acquire at the beginning of your code and release at the end,
*4882a593Smuzhiyuntaking care to ensure that all references to your code's shared data are
*4882a593Smuzhiyunalso carried out under that same lock.  Because only one thread can hold
*4882a593Smuzhiyunthis lock at a given time, your code will be executed single-threaded.
*4882a593SmuzhiyunThis approach is called "code locking".
*4882a593Smuzhiyun
*4882a593SmuzhiyunCode locking can severely limit both performance and scalability, so it
*4882a593Smuzhiyunshould be used with caution, and only on code paths that execute rarely.
*4882a593SmuzhiyunAfter all, a huge amount of effort was required to remove the Linux
*4882a593Smuzhiyunkernel's old "Big Kernel Lock", so let's please be very careful about
*4882a593Smuzhiyunadding new "little kernel locks".
*4882a593Smuzhiyun
*4882a593SmuzhiyunOne of the advantages of locking is that, in happy contrast with the
*4882a593Smuzhiyunyear 1981, almost all kernel developers are very familiar with locking.
*4882a593SmuzhiyunThe Linux kernel's lockdep (CONFIG_PROVE_LOCKING=y) is very helpful with
*4882a593Smuzhiyunthe formerly feared deadlock scenarios.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPlease use the standard locking primitives provided by the kernel rather
*4882a593Smuzhiyunthan rolling your own.  For one thing, the standard primitives interact
*4882a593Smuzhiyunproperly with lockdep.  For another thing, these primitives have been
*4882a593Smuzhiyuntuned to deal better with high contention.  And for one final thing, it is
*4882a593Smuzhiyunsurprisingly hard to correctly code production-quality lock acquisition
*4882a593Smuzhiyunand release functions.  After all, even simple non-production-quality
*4882a593Smuzhiyunlocking functions must carefully prevent both the CPU and the compiler
*4882a593Smuzhiyunfrom moving code in either direction across the locking function.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDespite the scalability limitations of single-threaded code, RCU
*4882a593Smuzhiyuntakes this approach for much of its grace-period processing and also
*4882a593Smuzhiyunfor early-boot operation.  The reason RCU is able to scale despite
*4882a593Smuzhiyunsingle-threaded grace-period processing is use of batching, where all
*4882a593Smuzhiyunupdates that accumulated during one grace period are handled by the
*4882a593Smuzhiyunnext one.  In other words, slowing down grace-period processing makes
*4882a593Smuzhiyunit more efficient.  Nor is RCU unique:  Similar batching optimizations
*4882a593Smuzhiyunare used in many I/O operations.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunPackaged code
*4882a593Smuzhiyun=============
*4882a593Smuzhiyun
*4882a593SmuzhiyunEven if performance and scalability concerns prevent your code from
*4882a593Smuzhiyunbeing completely single-threaded, it is often possible to use library
*4882a593Smuzhiyunfunctions that handle the concurrency nearly or entirely on their own.
*4882a593SmuzhiyunThis approach delegates any LKMM worries to the library maintainer.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn the kernel, what is the "library"?  Quite a bit.  It includes the
*4882a593Smuzhiyuncontents of the lib/ directory, much of the include/linux/ directory along
*4882a593Smuzhiyunwith a lot of other heavily used APIs.  But heavily used examples include
*4882a593Smuzhiyunthe list macros (for example, include/linux/{,rcu}list.h), workqueues,
*4882a593Smuzhiyunsmp_call_function(), and the various hash tables and search trees.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunData locking
*4882a593Smuzhiyun============
*4882a593Smuzhiyun
*4882a593SmuzhiyunWith code locking, we use single-threaded code execution to guarantee
*4882a593Smuzhiyunserialized access to the data that the code is accessing.  However,
*4882a593Smuzhiyunwe can also achieve this by instead associating the lock with specific
*4882a593Smuzhiyuninstances of the data structures.  This creates a "critical section"
*4882a593Smuzhiyunin the code execution that will execute as though it is single threaded.
*4882a593SmuzhiyunBy placing all the accesses and modifications to a shared data structure
*4882a593Smuzhiyuninside a critical section, we ensure that the execution context that
*4882a593Smuzhiyunholds the lock has exclusive access to the shared data.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe poster boy for this approach is the hash table, where placing a lock
*4882a593Smuzhiyunin each hash bucket allows operations on different buckets to proceed
*4882a593Smuzhiyunconcurrently.  This works because the buckets do not overlap with each
*4882a593Smuzhiyunother, so that an operation on one bucket does not interfere with any
*4882a593Smuzhiyunother bucket.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs the number of buckets increases, data locking scales naturally.
*4882a593SmuzhiyunIn particular, if the amount of data increases with the number of CPUs,
*4882a593Smuzhiyunincreasing the number of buckets as the number of CPUs increase results
*4882a593Smuzhiyunin a naturally scalable data structure.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunPer-CPU processing
*4882a593Smuzhiyun==================
*4882a593Smuzhiyun
*4882a593SmuzhiyunPartitioning processing and data over CPUs allows each CPU to take
*4882a593Smuzhiyuna single-threaded approach while providing excellent performance and
*4882a593Smuzhiyunscalability.  Of course, there is no free lunch:  The dark side of this
*4882a593Smuzhiyunexcellence is substantially increased memory footprint.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn addition, it is sometimes necessary to occasionally update some global
*4882a593Smuzhiyunview of this processing and data, in which case something like locking
*4882a593Smuzhiyunmust be used to protect this global view.  This is the approach taken
*4882a593Smuzhiyunby the percpu_counter infrastructure. In many cases, there are already
*4882a593Smuzhiyungeneric/library variants of commonly used per-cpu constructs available.
*4882a593SmuzhiyunPlease use them rather than rolling your own.
*4882a593Smuzhiyun
*4882a593SmuzhiyunRCU uses DEFINE_PER_CPU*() declaration to create a number of per-CPU
*4882a593Smuzhiyundata sets.  For example, each CPU does private quiescent-state processing
*4882a593Smuzhiyunwithin its instance of the per-CPU rcu_data structure, and then uses data
*4882a593Smuzhiyunlocking to report quiescent states up the grace-period combining tree.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunPackaged primitives: Sequence locking
*4882a593Smuzhiyun=====================================
*4882a593Smuzhiyun
*4882a593SmuzhiyunLockless programming is considered by many to be more difficult than
*4882a593Smuzhiyunlock-based programming, but there are a few lockless design patterns that
*4882a593Smuzhiyunhave been built out into an API.  One of these APIs is sequence locking.
*4882a593SmuzhiyunAlthough this APIs can be used in extremely complex ways, there are simple
*4882a593Smuzhiyunand effective ways of using it that avoid the need to pay attention to
*4882a593Smuzhiyunmemory ordering.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe basic keep-things-simple rule for sequence locking is "do not write
*4882a593Smuzhiyunin read-side code".  Yes, you can do writes from within sequence-locking
*4882a593Smuzhiyunreaders, but it won't be so simple.  For example, such writes will be
*4882a593Smuzhiyunlockless and should be idempotent.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor more sophisticated use cases, LKMM can guide you, including use
*4882a593Smuzhiyuncases involving combining sequence locking with other synchronization
*4882a593Smuzhiyunprimitives.  (LKMM does not yet know about sequence locking, so it is
*4882a593Smuzhiyuncurrently necessary to open-code it in your litmus tests.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunAdditional information may be found in include/linux/seqlock.h.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPackaged primitives: RCU
*4882a593Smuzhiyun========================
*4882a593Smuzhiyun
*4882a593SmuzhiyunAnother lockless design pattern that has been baked into an API
*4882a593Smuzhiyunis RCU.  The Linux kernel makes sophisticated use of RCU, but the
*4882a593Smuzhiyunkeep-things-simple rules for RCU are "do not write in read-side code"
*4882a593Smuzhiyunand "do not update anything that is visible to and accessed by readers",
*4882a593Smuzhiyunand "protect updates with locking".
*4882a593Smuzhiyun
*4882a593SmuzhiyunThese rules are illustrated by the functions foo_update_a() and
*4882a593Smuzhiyunfoo_get_a() shown in Documentation/RCU/whatisRCU.rst.  Additional
*4882a593SmuzhiyunRCU usage patterns maybe found in Documentation/RCU and in the
*4882a593Smuzhiyunsource code.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunPackaged primitives: Atomic operations
*4882a593Smuzhiyun======================================
*4882a593Smuzhiyun
*4882a593SmuzhiyunBack in the day, the Linux kernel had three types of atomic operations:
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.	Initialization and read-out, such as atomic_set() and atomic_read().
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.	Operations that did not return a value and provided no ordering,
*4882a593Smuzhiyun	such as atomic_inc() and atomic_dec().
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.	Operations that returned a value and provided full ordering, such as
*4882a593Smuzhiyun	atomic_add_return() and atomic_dec_and_test().  Note that some
*4882a593Smuzhiyun	value-returning operations provide full ordering only conditionally.
*4882a593Smuzhiyun	For example, cmpxchg() provides ordering only upon success.
*4882a593Smuzhiyun
*4882a593SmuzhiyunMore recent kernels have operations that return a value but do not
*4882a593Smuzhiyunprovide full ordering.  These are flagged with either a _relaxed()
*4882a593Smuzhiyunsuffix (providing no ordering), or an _acquire() or _release() suffix
*4882a593Smuzhiyun(providing limited ordering).
*4882a593Smuzhiyun
*4882a593SmuzhiyunAdditional information may be found in these files:
*4882a593Smuzhiyun
*4882a593SmuzhiyunDocumentation/atomic_t.txt
*4882a593SmuzhiyunDocumentation/atomic_bitops.txt
*4882a593SmuzhiyunDocumentation/core-api/atomic_ops.rst
*4882a593SmuzhiyunDocumentation/core-api/refcount-vs-atomic.rst
*4882a593Smuzhiyun
*4882a593SmuzhiyunReading code using these primitives is often also quite helpful.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunLockless, fully ordered
*4882a593Smuzhiyun=======================
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen using locking, there often comes a time when it is necessary
*4882a593Smuzhiyunto access some variable or another without holding the data lock
*4882a593Smuzhiyunthat serializes access to that variable.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf you want to keep things simple, use the initialization and read-out
*4882a593Smuzhiyunoperations from the previous section only when there are no racing
*4882a593Smuzhiyunaccesses.  Otherwise, use only fully ordered operations when accessing
*4882a593Smuzhiyunor modifying the variable.  This approach guarantees that code prior
*4882a593Smuzhiyunto a given access to that variable will be seen by all CPUs has having
*4882a593Smuzhiyunhappened before any code following any later access to that same variable.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPlease note that per-CPU functions are not atomic operations and
*4882a593Smuzhiyunhence they do not provide any ordering guarantees at all.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the lockless accesses are frequently executed reads that are used
*4882a593Smuzhiyunonly for heuristics, or if they are frequently executed writes that
*4882a593Smuzhiyunare used only for statistics, please see the next section.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunLockless statistics and heuristics
*4882a593Smuzhiyun==================================
*4882a593Smuzhiyun
*4882a593SmuzhiyunUnordered primitives such as atomic_read(), atomic_set(), READ_ONCE(), and
*4882a593SmuzhiyunWRITE_ONCE() can safely be used in some cases.  These primitives provide
*4882a593Smuzhiyunno ordering, but they do prevent the compiler from carrying out a number
*4882a593Smuzhiyunof destructive optimizations (for which please see the next section).
*4882a593SmuzhiyunOne example use for these primitives is statistics, such as per-CPU
*4882a593Smuzhiyuncounters exemplified by the rt_cache_stat structure's routing-cache
*4882a593Smuzhiyunstatistics counters.  Another example use case is heuristics, such as
*4882a593Smuzhiyunthe jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters
*4882a593Smuzhiyuncontrolling how often RCU scans for idle CPUs.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBut be careful.  "Unordered" really does mean "unordered".  It is all
*4882a593Smuzhiyuntoo easy to assume ordering, and this assumption must be avoided when
*4882a593Smuzhiyunusing these primitives.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunDon't let the compiler trip you up
*4882a593Smuzhiyun==================================
*4882a593Smuzhiyun
*4882a593SmuzhiyunIt can be quite tempting to use plain C-language accesses for lockless
*4882a593Smuzhiyunloads from and stores to shared variables.  Although this is both
*4882a593Smuzhiyunpossible and quite common in the Linux kernel, it does require a
*4882a593Smuzhiyunsurprising amount of analysis, care, and knowledge about the compiler.
*4882a593SmuzhiyunYes, some decades ago it was not unfair to consider a C compiler to be
*4882a593Smuzhiyunan assembler with added syntax and better portability, but the advent of
*4882a593Smuzhiyunsophisticated optimizing compilers mean that those days are long gone.
*4882a593SmuzhiyunToday's optimizing compilers can profoundly rewrite your code during the
*4882a593Smuzhiyuntranslation process, and have long been ready, willing, and able to do so.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTherefore, if you really need to use C-language assignments instead of
*4882a593SmuzhiyunREAD_ONCE(), WRITE_ONCE(), and so on, you will need to have a very good
*4882a593Smuzhiyununderstanding of both the C standard and your compiler.  Here are some
*4882a593Smuzhiyunintroductory references and some tooling to start you on this noble quest:
*4882a593Smuzhiyun
*4882a593SmuzhiyunWho's afraid of a big bad optimizing compiler?
*4882a593Smuzhiyun	https://lwn.net/Articles/793253/
*4882a593SmuzhiyunCalibrating your fear of big bad optimizing compilers
*4882a593Smuzhiyun	https://lwn.net/Articles/799218/
*4882a593SmuzhiyunConcurrency bugs should fear the big bad data-race detector (part 1)
*4882a593Smuzhiyun	https://lwn.net/Articles/816850/
*4882a593SmuzhiyunConcurrency bugs should fear the big bad data-race detector (part 2)
*4882a593Smuzhiyun	https://lwn.net/Articles/816854/
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunMore complex use cases
*4882a593Smuzhiyun======================
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the alternatives above do not do what you need, please look at the
*4882a593Smuzhiyunrecipes-pairs.txt file to peel off the next layer of the memory-ordering
*4882a593Smuzhiyunonion.