xref: /OK3568_Linux_fs/kernel/tools/memory-model/Documentation/simple.txt (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593SmuzhiyunThis document provides options for those wishing to keep their
2*4882a593Smuzhiyunmemory-ordering lives simple, as is necessary for those whose domain
3*4882a593Smuzhiyunis complex.  After all, there are bugs other than memory-ordering bugs,
4*4882a593Smuzhiyunand the time spent gaining memory-ordering knowledge is not available
5*4882a593Smuzhiyunfor gaining domain knowledge.  Furthermore Linux-kernel memory model
6*4882a593Smuzhiyun(LKMM) is quite complex, with subtle differences in code often having
7*4882a593Smuzhiyundramatic effects on correctness.
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunThe options near the beginning of this list are quite simple.  The idea
10*4882a593Smuzhiyunis not that kernel hackers don't already know about them, but rather
11*4882a593Smuzhiyunthat they might need the occasional reminder.
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunPlease note that this is a generic guide, and that specific subsystems
14*4882a593Smuzhiyunwill often have special requirements or idioms.  For example, developers
15*4882a593Smuzhiyunof MMIO-based device drivers will often need to use mb(), rmb(), and
16*4882a593Smuzhiyunwmb(), and therefore might find smp_mb(), smp_rmb(), and smp_wmb()
17*4882a593Smuzhiyunto be more natural than smp_load_acquire() and smp_store_release().
18*4882a593SmuzhiyunOn the other hand, those coming in from other environments will likely
19*4882a593Smuzhiyunbe more familiar with these last two.
20*4882a593Smuzhiyun
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunSingle-threaded code
23*4882a593Smuzhiyun====================
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunIn single-threaded code, there is no reordering, at least assuming
26*4882a593Smuzhiyunthat your toolchain and hardware are working correctly.  In addition,
27*4882a593Smuzhiyunit is generally a mistake to assume your code will only run in a single
28*4882a593Smuzhiyunthreaded context as the kernel can enter the same code path on multiple
29*4882a593SmuzhiyunCPUs at the same time.  One important exception is a function that makes
30*4882a593Smuzhiyunno external data references.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunIn the general case, you will need to take explicit steps to ensure that
33*4882a593Smuzhiyunyour code really is executed within a single thread that does not access
34*4882a593Smuzhiyunshared variables.  A simple way to achieve this is to define a global lock
35*4882a593Smuzhiyunthat you acquire at the beginning of your code and release at the end,
36*4882a593Smuzhiyuntaking care to ensure that all references to your code's shared data are
37*4882a593Smuzhiyunalso carried out under that same lock.  Because only one thread can hold
38*4882a593Smuzhiyunthis lock at a given time, your code will be executed single-threaded.
39*4882a593SmuzhiyunThis approach is called "code locking".
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunCode locking can severely limit both performance and scalability, so it
42*4882a593Smuzhiyunshould be used with caution, and only on code paths that execute rarely.
43*4882a593SmuzhiyunAfter all, a huge amount of effort was required to remove the Linux
44*4882a593Smuzhiyunkernel's old "Big Kernel Lock", so let's please be very careful about
45*4882a593Smuzhiyunadding new "little kernel locks".
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunOne of the advantages of locking is that, in happy contrast with the
48*4882a593Smuzhiyunyear 1981, almost all kernel developers are very familiar with locking.
49*4882a593SmuzhiyunThe Linux kernel's lockdep (CONFIG_PROVE_LOCKING=y) is very helpful with
50*4882a593Smuzhiyunthe formerly feared deadlock scenarios.
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunPlease use the standard locking primitives provided by the kernel rather
53*4882a593Smuzhiyunthan rolling your own.  For one thing, the standard primitives interact
54*4882a593Smuzhiyunproperly with lockdep.  For another thing, these primitives have been
55*4882a593Smuzhiyuntuned to deal better with high contention.  And for one final thing, it is
56*4882a593Smuzhiyunsurprisingly hard to correctly code production-quality lock acquisition
57*4882a593Smuzhiyunand release functions.  After all, even simple non-production-quality
58*4882a593Smuzhiyunlocking functions must carefully prevent both the CPU and the compiler
59*4882a593Smuzhiyunfrom moving code in either direction across the locking function.
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunDespite the scalability limitations of single-threaded code, RCU
62*4882a593Smuzhiyuntakes this approach for much of its grace-period processing and also
63*4882a593Smuzhiyunfor early-boot operation.  The reason RCU is able to scale despite
64*4882a593Smuzhiyunsingle-threaded grace-period processing is use of batching, where all
65*4882a593Smuzhiyunupdates that accumulated during one grace period are handled by the
66*4882a593Smuzhiyunnext one.  In other words, slowing down grace-period processing makes
67*4882a593Smuzhiyunit more efficient.  Nor is RCU unique:  Similar batching optimizations
68*4882a593Smuzhiyunare used in many I/O operations.
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun
71*4882a593SmuzhiyunPackaged code
72*4882a593Smuzhiyun=============
73*4882a593Smuzhiyun
74*4882a593SmuzhiyunEven if performance and scalability concerns prevent your code from
75*4882a593Smuzhiyunbeing completely single-threaded, it is often possible to use library
76*4882a593Smuzhiyunfunctions that handle the concurrency nearly or entirely on their own.
77*4882a593SmuzhiyunThis approach delegates any LKMM worries to the library maintainer.
78*4882a593Smuzhiyun
79*4882a593SmuzhiyunIn the kernel, what is the "library"?  Quite a bit.  It includes the
80*4882a593Smuzhiyuncontents of the lib/ directory, much of the include/linux/ directory along
81*4882a593Smuzhiyunwith a lot of other heavily used APIs.  But heavily used examples include
82*4882a593Smuzhiyunthe list macros (for example, include/linux/{,rcu}list.h), workqueues,
83*4882a593Smuzhiyunsmp_call_function(), and the various hash tables and search trees.
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun
86*4882a593SmuzhiyunData locking
87*4882a593Smuzhiyun============
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunWith code locking, we use single-threaded code execution to guarantee
90*4882a593Smuzhiyunserialized access to the data that the code is accessing.  However,
91*4882a593Smuzhiyunwe can also achieve this by instead associating the lock with specific
92*4882a593Smuzhiyuninstances of the data structures.  This creates a "critical section"
93*4882a593Smuzhiyunin the code execution that will execute as though it is single threaded.
94*4882a593SmuzhiyunBy placing all the accesses and modifications to a shared data structure
95*4882a593Smuzhiyuninside a critical section, we ensure that the execution context that
96*4882a593Smuzhiyunholds the lock has exclusive access to the shared data.
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunThe poster boy for this approach is the hash table, where placing a lock
99*4882a593Smuzhiyunin each hash bucket allows operations on different buckets to proceed
100*4882a593Smuzhiyunconcurrently.  This works because the buckets do not overlap with each
101*4882a593Smuzhiyunother, so that an operation on one bucket does not interfere with any
102*4882a593Smuzhiyunother bucket.
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunAs the number of buckets increases, data locking scales naturally.
105*4882a593SmuzhiyunIn particular, if the amount of data increases with the number of CPUs,
106*4882a593Smuzhiyunincreasing the number of buckets as the number of CPUs increase results
107*4882a593Smuzhiyunin a naturally scalable data structure.
108*4882a593Smuzhiyun
109*4882a593Smuzhiyun
110*4882a593SmuzhiyunPer-CPU processing
111*4882a593Smuzhiyun==================
112*4882a593Smuzhiyun
113*4882a593SmuzhiyunPartitioning processing and data over CPUs allows each CPU to take
114*4882a593Smuzhiyuna single-threaded approach while providing excellent performance and
115*4882a593Smuzhiyunscalability.  Of course, there is no free lunch:  The dark side of this
116*4882a593Smuzhiyunexcellence is substantially increased memory footprint.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunIn addition, it is sometimes necessary to occasionally update some global
119*4882a593Smuzhiyunview of this processing and data, in which case something like locking
120*4882a593Smuzhiyunmust be used to protect this global view.  This is the approach taken
121*4882a593Smuzhiyunby the percpu_counter infrastructure. In many cases, there are already
122*4882a593Smuzhiyungeneric/library variants of commonly used per-cpu constructs available.
123*4882a593SmuzhiyunPlease use them rather than rolling your own.
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunRCU uses DEFINE_PER_CPU*() declaration to create a number of per-CPU
126*4882a593Smuzhiyundata sets.  For example, each CPU does private quiescent-state processing
127*4882a593Smuzhiyunwithin its instance of the per-CPU rcu_data structure, and then uses data
128*4882a593Smuzhiyunlocking to report quiescent states up the grace-period combining tree.
129*4882a593Smuzhiyun
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunPackaged primitives: Sequence locking
132*4882a593Smuzhiyun=====================================
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunLockless programming is considered by many to be more difficult than
135*4882a593Smuzhiyunlock-based programming, but there are a few lockless design patterns that
136*4882a593Smuzhiyunhave been built out into an API.  One of these APIs is sequence locking.
137*4882a593SmuzhiyunAlthough this APIs can be used in extremely complex ways, there are simple
138*4882a593Smuzhiyunand effective ways of using it that avoid the need to pay attention to
139*4882a593Smuzhiyunmemory ordering.
140*4882a593Smuzhiyun
141*4882a593SmuzhiyunThe basic keep-things-simple rule for sequence locking is "do not write
142*4882a593Smuzhiyunin read-side code".  Yes, you can do writes from within sequence-locking
143*4882a593Smuzhiyunreaders, but it won't be so simple.  For example, such writes will be
144*4882a593Smuzhiyunlockless and should be idempotent.
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunFor more sophisticated use cases, LKMM can guide you, including use
147*4882a593Smuzhiyuncases involving combining sequence locking with other synchronization
148*4882a593Smuzhiyunprimitives.  (LKMM does not yet know about sequence locking, so it is
149*4882a593Smuzhiyuncurrently necessary to open-code it in your litmus tests.)
150*4882a593Smuzhiyun
151*4882a593SmuzhiyunAdditional information may be found in include/linux/seqlock.h.
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunPackaged primitives: RCU
154*4882a593Smuzhiyun========================
155*4882a593Smuzhiyun
156*4882a593SmuzhiyunAnother lockless design pattern that has been baked into an API
157*4882a593Smuzhiyunis RCU.  The Linux kernel makes sophisticated use of RCU, but the
158*4882a593Smuzhiyunkeep-things-simple rules for RCU are "do not write in read-side code"
159*4882a593Smuzhiyunand "do not update anything that is visible to and accessed by readers",
160*4882a593Smuzhiyunand "protect updates with locking".
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunThese rules are illustrated by the functions foo_update_a() and
163*4882a593Smuzhiyunfoo_get_a() shown in Documentation/RCU/whatisRCU.rst.  Additional
164*4882a593SmuzhiyunRCU usage patterns maybe found in Documentation/RCU and in the
165*4882a593Smuzhiyunsource code.
166*4882a593Smuzhiyun
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunPackaged primitives: Atomic operations
169*4882a593Smuzhiyun======================================
170*4882a593Smuzhiyun
171*4882a593SmuzhiyunBack in the day, the Linux kernel had three types of atomic operations:
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun1.	Initialization and read-out, such as atomic_set() and atomic_read().
174*4882a593Smuzhiyun
175*4882a593Smuzhiyun2.	Operations that did not return a value and provided no ordering,
176*4882a593Smuzhiyun	such as atomic_inc() and atomic_dec().
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun3.	Operations that returned a value and provided full ordering, such as
179*4882a593Smuzhiyun	atomic_add_return() and atomic_dec_and_test().  Note that some
180*4882a593Smuzhiyun	value-returning operations provide full ordering only conditionally.
181*4882a593Smuzhiyun	For example, cmpxchg() provides ordering only upon success.
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunMore recent kernels have operations that return a value but do not
184*4882a593Smuzhiyunprovide full ordering.  These are flagged with either a _relaxed()
185*4882a593Smuzhiyunsuffix (providing no ordering), or an _acquire() or _release() suffix
186*4882a593Smuzhiyun(providing limited ordering).
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunAdditional information may be found in these files:
189*4882a593Smuzhiyun
190*4882a593SmuzhiyunDocumentation/atomic_t.txt
191*4882a593SmuzhiyunDocumentation/atomic_bitops.txt
192*4882a593SmuzhiyunDocumentation/core-api/atomic_ops.rst
193*4882a593SmuzhiyunDocumentation/core-api/refcount-vs-atomic.rst
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunReading code using these primitives is often also quite helpful.
196*4882a593Smuzhiyun
197*4882a593Smuzhiyun
198*4882a593SmuzhiyunLockless, fully ordered
199*4882a593Smuzhiyun=======================
200*4882a593Smuzhiyun
201*4882a593SmuzhiyunWhen using locking, there often comes a time when it is necessary
202*4882a593Smuzhiyunto access some variable or another without holding the data lock
203*4882a593Smuzhiyunthat serializes access to that variable.
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunIf you want to keep things simple, use the initialization and read-out
206*4882a593Smuzhiyunoperations from the previous section only when there are no racing
207*4882a593Smuzhiyunaccesses.  Otherwise, use only fully ordered operations when accessing
208*4882a593Smuzhiyunor modifying the variable.  This approach guarantees that code prior
209*4882a593Smuzhiyunto a given access to that variable will be seen by all CPUs has having
210*4882a593Smuzhiyunhappened before any code following any later access to that same variable.
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunPlease note that per-CPU functions are not atomic operations and
213*4882a593Smuzhiyunhence they do not provide any ordering guarantees at all.
214*4882a593Smuzhiyun
215*4882a593SmuzhiyunIf the lockless accesses are frequently executed reads that are used
216*4882a593Smuzhiyunonly for heuristics, or if they are frequently executed writes that
217*4882a593Smuzhiyunare used only for statistics, please see the next section.
218*4882a593Smuzhiyun
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunLockless statistics and heuristics
221*4882a593Smuzhiyun==================================
222*4882a593Smuzhiyun
223*4882a593SmuzhiyunUnordered primitives such as atomic_read(), atomic_set(), READ_ONCE(), and
224*4882a593SmuzhiyunWRITE_ONCE() can safely be used in some cases.  These primitives provide
225*4882a593Smuzhiyunno ordering, but they do prevent the compiler from carrying out a number
226*4882a593Smuzhiyunof destructive optimizations (for which please see the next section).
227*4882a593SmuzhiyunOne example use for these primitives is statistics, such as per-CPU
228*4882a593Smuzhiyuncounters exemplified by the rt_cache_stat structure's routing-cache
229*4882a593Smuzhiyunstatistics counters.  Another example use case is heuristics, such as
230*4882a593Smuzhiyunthe jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters
231*4882a593Smuzhiyuncontrolling how often RCU scans for idle CPUs.
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunBut be careful.  "Unordered" really does mean "unordered".  It is all
234*4882a593Smuzhiyuntoo easy to assume ordering, and this assumption must be avoided when
235*4882a593Smuzhiyunusing these primitives.
236*4882a593Smuzhiyun
237*4882a593Smuzhiyun
238*4882a593SmuzhiyunDon't let the compiler trip you up
239*4882a593Smuzhiyun==================================
240*4882a593Smuzhiyun
241*4882a593SmuzhiyunIt can be quite tempting to use plain C-language accesses for lockless
242*4882a593Smuzhiyunloads from and stores to shared variables.  Although this is both
243*4882a593Smuzhiyunpossible and quite common in the Linux kernel, it does require a
244*4882a593Smuzhiyunsurprising amount of analysis, care, and knowledge about the compiler.
245*4882a593SmuzhiyunYes, some decades ago it was not unfair to consider a C compiler to be
246*4882a593Smuzhiyunan assembler with added syntax and better portability, but the advent of
247*4882a593Smuzhiyunsophisticated optimizing compilers mean that those days are long gone.
248*4882a593SmuzhiyunToday's optimizing compilers can profoundly rewrite your code during the
249*4882a593Smuzhiyuntranslation process, and have long been ready, willing, and able to do so.
250*4882a593Smuzhiyun
251*4882a593SmuzhiyunTherefore, if you really need to use C-language assignments instead of
252*4882a593SmuzhiyunREAD_ONCE(), WRITE_ONCE(), and so on, you will need to have a very good
253*4882a593Smuzhiyununderstanding of both the C standard and your compiler.  Here are some
254*4882a593Smuzhiyunintroductory references and some tooling to start you on this noble quest:
255*4882a593Smuzhiyun
256*4882a593SmuzhiyunWho's afraid of a big bad optimizing compiler?
257*4882a593Smuzhiyun	https://lwn.net/Articles/793253/
258*4882a593SmuzhiyunCalibrating your fear of big bad optimizing compilers
259*4882a593Smuzhiyun	https://lwn.net/Articles/799218/
260*4882a593SmuzhiyunConcurrency bugs should fear the big bad data-race detector (part 1)
261*4882a593Smuzhiyun	https://lwn.net/Articles/816850/
262*4882a593SmuzhiyunConcurrency bugs should fear the big bad data-race detector (part 2)
263*4882a593Smuzhiyun	https://lwn.net/Articles/816854/
264*4882a593Smuzhiyun
265*4882a593Smuzhiyun
266*4882a593SmuzhiyunMore complex use cases
267*4882a593Smuzhiyun======================
268*4882a593Smuzhiyun
269*4882a593SmuzhiyunIf the alternatives above do not do what you need, please look at the
270*4882a593Smuzhiyunrecipes-pairs.txt file to peel off the next layer of the memory-ordering
271*4882a593Smuzhiyunonion.
272