xref: /OK3568_Linux_fs/kernel/Documentation/RCU/checklist.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun================================
4*4882a593SmuzhiyunReview Checklist for RCU Patches
5*4882a593Smuzhiyun================================
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunThis document contains a checklist for producing and reviewing patches
9*4882a593Smuzhiyunthat make use of RCU.  Violating any of the rules listed below will
10*4882a593Smuzhiyunresult in the same sorts of problems that leaving out a locking primitive
11*4882a593Smuzhiyunwould cause.  This list is based on experiences reviewing such patches
12*4882a593Smuzhiyunover a rather long period of time, but improvements are always welcome!
13*4882a593Smuzhiyun
14*4882a593Smuzhiyun0.	Is RCU being applied to a read-mostly situation?  If the data
15*4882a593Smuzhiyun	structure is updated more than about 10% of the time, then you
16*4882a593Smuzhiyun	should strongly consider some other approach, unless detailed
17*4882a593Smuzhiyun	performance measurements show that RCU is nonetheless the right
18*4882a593Smuzhiyun	tool for the job.  Yes, RCU does reduce read-side overhead by
19*4882a593Smuzhiyun	increasing write-side overhead, which is exactly why normal uses
20*4882a593Smuzhiyun	of RCU will do much more reading than updating.
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun	Another exception is where performance is not an issue, and RCU
23*4882a593Smuzhiyun	provides a simpler implementation.  An example of this situation
24*4882a593Smuzhiyun	is the dynamic NMI code in the Linux 2.6 kernel, at least on
25*4882a593Smuzhiyun	architectures where NMIs are rare.
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun	Yet another exception is where the low real-time latency of RCU's
28*4882a593Smuzhiyun	read-side primitives is critically important.
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun	One final exception is where RCU readers are used to prevent
31*4882a593Smuzhiyun	the ABA problem (https://en.wikipedia.org/wiki/ABA_problem)
32*4882a593Smuzhiyun	for lockless updates.  This does result in the mildly
33*4882a593Smuzhiyun	counter-intuitive situation where rcu_read_lock() and
34*4882a593Smuzhiyun	rcu_read_unlock() are used to protect updates, however, this
35*4882a593Smuzhiyun	approach provides the same potential simplifications that garbage
36*4882a593Smuzhiyun	collectors do.
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun1.	Does the update code have proper mutual exclusion?
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun	RCU does allow -readers- to run (almost) naked, but -writers- must
41*4882a593Smuzhiyun	still use some sort of mutual exclusion, such as:
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun	a.	locking,
44*4882a593Smuzhiyun	b.	atomic operations, or
45*4882a593Smuzhiyun	c.	restricting updates to a single task.
46*4882a593Smuzhiyun
47*4882a593Smuzhiyun	If you choose #b, be prepared to describe how you have handled
48*4882a593Smuzhiyun	memory barriers on weakly ordered machines (pretty much all of
49*4882a593Smuzhiyun	them -- even x86 allows later loads to be reordered to precede
50*4882a593Smuzhiyun	earlier stores), and be prepared to explain why this added
51*4882a593Smuzhiyun	complexity is worthwhile.  If you choose #c, be prepared to
52*4882a593Smuzhiyun	explain how this single task does not become a major bottleneck on
53*4882a593Smuzhiyun	big multiprocessor machines (for example, if the task is updating
54*4882a593Smuzhiyun	information relating to itself that other tasks can read, there
55*4882a593Smuzhiyun	by definition can be no bottleneck).  Note that the definition
56*4882a593Smuzhiyun	of "large" has changed significantly:  Eight CPUs was "large"
57*4882a593Smuzhiyun	in the year 2000, but a hundred CPUs was unremarkable in 2017.
58*4882a593Smuzhiyun
59*4882a593Smuzhiyun2.	Do the RCU read-side critical sections make proper use of
60*4882a593Smuzhiyun	rcu_read_lock() and friends?  These primitives are needed
61*4882a593Smuzhiyun	to prevent grace periods from ending prematurely, which
62*4882a593Smuzhiyun	could result in data being unceremoniously freed out from
63*4882a593Smuzhiyun	under your read-side code, which can greatly increase the
64*4882a593Smuzhiyun	actuarial risk of your kernel.
65*4882a593Smuzhiyun
66*4882a593Smuzhiyun	As a rough rule of thumb, any dereference of an RCU-protected
67*4882a593Smuzhiyun	pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(),
68*4882a593Smuzhiyun	rcu_read_lock_sched(), or by the appropriate update-side lock.
69*4882a593Smuzhiyun	Disabling of preemption can serve as rcu_read_lock_sched(), but
70*4882a593Smuzhiyun	is less readable and prevents lockdep from detecting locking issues.
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun	Letting RCU-protected pointers "leak" out of an RCU read-side
73*4882a593Smuzhiyun	critical section is every bid as bad as letting them leak out
74*4882a593Smuzhiyun	from under a lock.  Unless, of course, you have arranged some
75*4882a593Smuzhiyun	other means of protection, such as a lock or a reference count
76*4882a593Smuzhiyun	-before- letting them out of the RCU read-side critical section.
77*4882a593Smuzhiyun
78*4882a593Smuzhiyun3.	Does the update code tolerate concurrent accesses?
79*4882a593Smuzhiyun
80*4882a593Smuzhiyun	The whole point of RCU is to permit readers to run without
81*4882a593Smuzhiyun	any locks or atomic operations.  This means that readers will
82*4882a593Smuzhiyun	be running while updates are in progress.  There are a number
83*4882a593Smuzhiyun	of ways to handle this concurrency, depending on the situation:
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun	a.	Use the RCU variants of the list and hlist update
86*4882a593Smuzhiyun		primitives to add, remove, and replace elements on
87*4882a593Smuzhiyun		an RCU-protected list.	Alternatively, use the other
88*4882a593Smuzhiyun		RCU-protected data structures that have been added to
89*4882a593Smuzhiyun		the Linux kernel.
90*4882a593Smuzhiyun
91*4882a593Smuzhiyun		This is almost always the best approach.
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun	b.	Proceed as in (a) above, but also maintain per-element
94*4882a593Smuzhiyun		locks (that are acquired by both readers and writers)
95*4882a593Smuzhiyun		that guard per-element state.  Of course, fields that
96*4882a593Smuzhiyun		the readers refrain from accessing can be guarded by
97*4882a593Smuzhiyun		some other lock acquired only by updaters, if desired.
98*4882a593Smuzhiyun
99*4882a593Smuzhiyun		This works quite well, also.
100*4882a593Smuzhiyun
101*4882a593Smuzhiyun	c.	Make updates appear atomic to readers.	For example,
102*4882a593Smuzhiyun		pointer updates to properly aligned fields will
103*4882a593Smuzhiyun		appear atomic, as will individual atomic primitives.
104*4882a593Smuzhiyun		Sequences of operations performed under a lock will -not-
105*4882a593Smuzhiyun		appear to be atomic to RCU readers, nor will sequences
106*4882a593Smuzhiyun		of multiple atomic primitives.
107*4882a593Smuzhiyun
108*4882a593Smuzhiyun		This can work, but is starting to get a bit tricky.
109*4882a593Smuzhiyun
110*4882a593Smuzhiyun	d.	Carefully order the updates and the reads so that
111*4882a593Smuzhiyun		readers see valid data at all phases of the update.
112*4882a593Smuzhiyun		This is often more difficult than it sounds, especially
113*4882a593Smuzhiyun		given modern CPUs' tendency to reorder memory references.
114*4882a593Smuzhiyun		One must usually liberally sprinkle memory barriers
115*4882a593Smuzhiyun		(smp_wmb(), smp_rmb(), smp_mb()) through the code,
116*4882a593Smuzhiyun		making it difficult to understand and to test.
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun		It is usually better to group the changing data into
119*4882a593Smuzhiyun		a separate structure, so that the change may be made
120*4882a593Smuzhiyun		to appear atomic by updating a pointer to reference
121*4882a593Smuzhiyun		a new structure containing updated values.
122*4882a593Smuzhiyun
123*4882a593Smuzhiyun4.	Weakly ordered CPUs pose special challenges.  Almost all CPUs
124*4882a593Smuzhiyun	are weakly ordered -- even x86 CPUs allow later loads to be
125*4882a593Smuzhiyun	reordered to precede earlier stores.  RCU code must take all of
126*4882a593Smuzhiyun	the following measures to prevent memory-corruption problems:
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun	a.	Readers must maintain proper ordering of their memory
129*4882a593Smuzhiyun		accesses.  The rcu_dereference() primitive ensures that
130*4882a593Smuzhiyun		the CPU picks up the pointer before it picks up the data
131*4882a593Smuzhiyun		that the pointer points to.  This really is necessary
132*4882a593Smuzhiyun		on Alpha CPUs.	If you don't believe me, see:
133*4882a593Smuzhiyun
134*4882a593Smuzhiyun			http://www.openvms.compaq.com/wizard/wiz_2637.html
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun		The rcu_dereference() primitive is also an excellent
137*4882a593Smuzhiyun		documentation aid, letting the person reading the
138*4882a593Smuzhiyun		code know exactly which pointers are protected by RCU.
139*4882a593Smuzhiyun		Please note that compilers can also reorder code, and
140*4882a593Smuzhiyun		they are becoming increasingly aggressive about doing
141*4882a593Smuzhiyun		just that.  The rcu_dereference() primitive therefore also
142*4882a593Smuzhiyun		prevents destructive compiler optimizations.  However,
143*4882a593Smuzhiyun		with a bit of devious creativity, it is possible to
144*4882a593Smuzhiyun		mishandle the return value from rcu_dereference().
145*4882a593Smuzhiyun		Please see rcu_dereference.txt in this directory for
146*4882a593Smuzhiyun		more information.
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun		The rcu_dereference() primitive is used by the
149*4882a593Smuzhiyun		various "_rcu()" list-traversal primitives, such
150*4882a593Smuzhiyun		as the list_for_each_entry_rcu().  Note that it is
151*4882a593Smuzhiyun		perfectly legal (if redundant) for update-side code to
152*4882a593Smuzhiyun		use rcu_dereference() and the "_rcu()" list-traversal
153*4882a593Smuzhiyun		primitives.  This is particularly useful in code that
154*4882a593Smuzhiyun		is common to readers and updaters.  However, lockdep
155*4882a593Smuzhiyun		will complain if you access rcu_dereference() outside
156*4882a593Smuzhiyun		of an RCU read-side critical section.  See lockdep.txt
157*4882a593Smuzhiyun		to learn what to do about this.
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun		Of course, neither rcu_dereference() nor the "_rcu()"
160*4882a593Smuzhiyun		list-traversal primitives can substitute for a good
161*4882a593Smuzhiyun		concurrency design coordinating among multiple updaters.
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun	b.	If the list macros are being used, the list_add_tail_rcu()
164*4882a593Smuzhiyun		and list_add_rcu() primitives must be used in order
165*4882a593Smuzhiyun		to prevent weakly ordered machines from misordering
166*4882a593Smuzhiyun		structure initialization and pointer planting.
167*4882a593Smuzhiyun		Similarly, if the hlist macros are being used, the
168*4882a593Smuzhiyun		hlist_add_head_rcu() primitive is required.
169*4882a593Smuzhiyun
170*4882a593Smuzhiyun	c.	If the list macros are being used, the list_del_rcu()
171*4882a593Smuzhiyun		primitive must be used to keep list_del()'s pointer
172*4882a593Smuzhiyun		poisoning from inflicting toxic effects on concurrent
173*4882a593Smuzhiyun		readers.  Similarly, if the hlist macros are being used,
174*4882a593Smuzhiyun		the hlist_del_rcu() primitive is required.
175*4882a593Smuzhiyun
176*4882a593Smuzhiyun		The list_replace_rcu() and hlist_replace_rcu() primitives
177*4882a593Smuzhiyun		may be used to replace an old structure with a new one
178*4882a593Smuzhiyun		in their respective types of RCU-protected lists.
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun	d.	Rules similar to (4b) and (4c) apply to the "hlist_nulls"
181*4882a593Smuzhiyun		type of RCU-protected linked lists.
182*4882a593Smuzhiyun
183*4882a593Smuzhiyun	e.	Updates must ensure that initialization of a given
184*4882a593Smuzhiyun		structure happens before pointers to that structure are
185*4882a593Smuzhiyun		publicized.  Use the rcu_assign_pointer() primitive
186*4882a593Smuzhiyun		when publicizing a pointer to a structure that can
187*4882a593Smuzhiyun		be traversed by an RCU read-side critical section.
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun5.	If call_rcu() or call_srcu() is used, the callback function will
190*4882a593Smuzhiyun	be called from softirq context.  In particular, it cannot block.
191*4882a593Smuzhiyun
192*4882a593Smuzhiyun6.	Since synchronize_rcu() can block, it cannot be called
193*4882a593Smuzhiyun	from any sort of irq context.  The same rule applies
194*4882a593Smuzhiyun	for synchronize_srcu(), synchronize_rcu_expedited(), and
195*4882a593Smuzhiyun	synchronize_srcu_expedited().
196*4882a593Smuzhiyun
197*4882a593Smuzhiyun	The expedited forms of these primitives have the same semantics
198*4882a593Smuzhiyun	as the non-expedited forms, but expediting is both expensive and
199*4882a593Smuzhiyun	(with the exception of synchronize_srcu_expedited()) unfriendly
200*4882a593Smuzhiyun	to real-time workloads.  Use of the expedited primitives should
201*4882a593Smuzhiyun	be restricted to rare configuration-change operations that would
202*4882a593Smuzhiyun	not normally be undertaken while a real-time workload is running.
203*4882a593Smuzhiyun	However, real-time workloads can use rcupdate.rcu_normal kernel
204*4882a593Smuzhiyun	boot parameter to completely disable expedited grace periods,
205*4882a593Smuzhiyun	though this might have performance implications.
206*4882a593Smuzhiyun
207*4882a593Smuzhiyun	In particular, if you find yourself invoking one of the expedited
208*4882a593Smuzhiyun	primitives repeatedly in a loop, please do everyone a favor:
209*4882a593Smuzhiyun	Restructure your code so that it batches the updates, allowing
210*4882a593Smuzhiyun	a single non-expedited primitive to cover the entire batch.
211*4882a593Smuzhiyun	This will very likely be faster than the loop containing the
212*4882a593Smuzhiyun	expedited primitive, and will be much much easier on the rest
213*4882a593Smuzhiyun	of the system, especially to real-time workloads running on
214*4882a593Smuzhiyun	the rest of the system.
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun7.	As of v4.20, a given kernel implements only one RCU flavor,
217*4882a593Smuzhiyun	which is RCU-sched for PREEMPT=n and RCU-preempt for PREEMPT=y.
218*4882a593Smuzhiyun	If the updater uses call_rcu() or synchronize_rcu(),
219*4882a593Smuzhiyun	then the corresponding readers my use rcu_read_lock() and
220*4882a593Smuzhiyun	rcu_read_unlock(), rcu_read_lock_bh() and rcu_read_unlock_bh(),
221*4882a593Smuzhiyun	or any pair of primitives that disables and re-enables preemption,
222*4882a593Smuzhiyun	for example, rcu_read_lock_sched() and rcu_read_unlock_sched().
223*4882a593Smuzhiyun	If the updater uses synchronize_srcu() or call_srcu(),
224*4882a593Smuzhiyun	then the corresponding readers must use srcu_read_lock() and
225*4882a593Smuzhiyun	srcu_read_unlock(), and with the same srcu_struct.  The rules for
226*4882a593Smuzhiyun	the expedited primitives are the same as for their non-expedited
227*4882a593Smuzhiyun	counterparts.  Mixing things up will result in confusion and
228*4882a593Smuzhiyun	broken kernels, and has even resulted in an exploitable security
229*4882a593Smuzhiyun	issue.
230*4882a593Smuzhiyun
231*4882a593Smuzhiyun	One exception to this rule: rcu_read_lock() and rcu_read_unlock()
232*4882a593Smuzhiyun	may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh()
233*4882a593Smuzhiyun	in cases where local bottom halves are already known to be
234*4882a593Smuzhiyun	disabled, for example, in irq or softirq context.  Commenting
235*4882a593Smuzhiyun	such cases is a must, of course!  And the jury is still out on
236*4882a593Smuzhiyun	whether the increased speed is worth it.
237*4882a593Smuzhiyun
238*4882a593Smuzhiyun8.	Although synchronize_rcu() is slower than is call_rcu(), it
239*4882a593Smuzhiyun	usually results in simpler code.  So, unless update performance is
240*4882a593Smuzhiyun	critically important, the updaters cannot block, or the latency of
241*4882a593Smuzhiyun	synchronize_rcu() is visible from userspace, synchronize_rcu()
242*4882a593Smuzhiyun	should be used in preference to call_rcu().  Furthermore,
243*4882a593Smuzhiyun	kfree_rcu() usually results in even simpler code than does
244*4882a593Smuzhiyun	synchronize_rcu() without synchronize_rcu()'s multi-millisecond
245*4882a593Smuzhiyun	latency.  So please take advantage of kfree_rcu()'s "fire and
246*4882a593Smuzhiyun	forget" memory-freeing capabilities where it applies.
247*4882a593Smuzhiyun
248*4882a593Smuzhiyun	An especially important property of the synchronize_rcu()
249*4882a593Smuzhiyun	primitive is that it automatically self-limits: if grace periods
250*4882a593Smuzhiyun	are delayed for whatever reason, then the synchronize_rcu()
251*4882a593Smuzhiyun	primitive will correspondingly delay updates.  In contrast,
252*4882a593Smuzhiyun	code using call_rcu() should explicitly limit update rate in
253*4882a593Smuzhiyun	cases where grace periods are delayed, as failing to do so can
254*4882a593Smuzhiyun	result in excessive realtime latencies or even OOM conditions.
255*4882a593Smuzhiyun
256*4882a593Smuzhiyun	Ways of gaining this self-limiting property when using call_rcu()
257*4882a593Smuzhiyun	include:
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun	a.	Keeping a count of the number of data-structure elements
260*4882a593Smuzhiyun		used by the RCU-protected data structure, including
261*4882a593Smuzhiyun		those waiting for a grace period to elapse.  Enforce a
262*4882a593Smuzhiyun		limit on this number, stalling updates as needed to allow
263*4882a593Smuzhiyun		previously deferred frees to complete.	Alternatively,
264*4882a593Smuzhiyun		limit only the number awaiting deferred free rather than
265*4882a593Smuzhiyun		the total number of elements.
266*4882a593Smuzhiyun
267*4882a593Smuzhiyun		One way to stall the updates is to acquire the update-side
268*4882a593Smuzhiyun		mutex.	(Don't try this with a spinlock -- other CPUs
269*4882a593Smuzhiyun		spinning on the lock could prevent the grace period
270*4882a593Smuzhiyun		from ever ending.)  Another way to stall the updates
271*4882a593Smuzhiyun		is for the updates to use a wrapper function around
272*4882a593Smuzhiyun		the memory allocator, so that this wrapper function
273*4882a593Smuzhiyun		simulates OOM when there is too much memory awaiting an
274*4882a593Smuzhiyun		RCU grace period.  There are of course many other
275*4882a593Smuzhiyun		variations on this theme.
276*4882a593Smuzhiyun
277*4882a593Smuzhiyun	b.	Limiting update rate.  For example, if updates occur only
278*4882a593Smuzhiyun		once per hour, then no explicit rate limiting is
279*4882a593Smuzhiyun		required, unless your system is already badly broken.
280*4882a593Smuzhiyun		Older versions of the dcache subsystem take this approach,
281*4882a593Smuzhiyun		guarding updates with a global lock, limiting their rate.
282*4882a593Smuzhiyun
283*4882a593Smuzhiyun	c.	Trusted update -- if updates can only be done manually by
284*4882a593Smuzhiyun		superuser or some other trusted user, then it might not
285*4882a593Smuzhiyun		be necessary to automatically limit them.  The theory
286*4882a593Smuzhiyun		here is that superuser already has lots of ways to crash
287*4882a593Smuzhiyun		the machine.
288*4882a593Smuzhiyun
289*4882a593Smuzhiyun	d.	Periodically invoke synchronize_rcu(), permitting a limited
290*4882a593Smuzhiyun		number of updates per grace period.
291*4882a593Smuzhiyun
292*4882a593Smuzhiyun	The same cautions apply to call_srcu() and kfree_rcu().
293*4882a593Smuzhiyun
294*4882a593Smuzhiyun	Note that although these primitives do take action to avoid memory
295*4882a593Smuzhiyun	exhaustion when any given CPU has too many callbacks, a determined
296*4882a593Smuzhiyun	user could still exhaust memory.  This is especially the case
297*4882a593Smuzhiyun	if a system with a large number of CPUs has been configured to
298*4882a593Smuzhiyun	offload all of its RCU callbacks onto a single CPU, or if the
299*4882a593Smuzhiyun	system has relatively little free memory.
300*4882a593Smuzhiyun
301*4882a593Smuzhiyun9.	All RCU list-traversal primitives, which include
302*4882a593Smuzhiyun	rcu_dereference(), list_for_each_entry_rcu(), and
303*4882a593Smuzhiyun	list_for_each_safe_rcu(), must be either within an RCU read-side
304*4882a593Smuzhiyun	critical section or must be protected by appropriate update-side
305*4882a593Smuzhiyun	locks.	RCU read-side critical sections are delimited by
306*4882a593Smuzhiyun	rcu_read_lock() and rcu_read_unlock(), or by similar primitives
307*4882a593Smuzhiyun	such as rcu_read_lock_bh() and rcu_read_unlock_bh(), in which
308*4882a593Smuzhiyun	case the matching rcu_dereference() primitive must be used in
309*4882a593Smuzhiyun	order to keep lockdep happy, in this case, rcu_dereference_bh().
310*4882a593Smuzhiyun
311*4882a593Smuzhiyun	The reason that it is permissible to use RCU list-traversal
312*4882a593Smuzhiyun	primitives when the update-side lock is held is that doing so
313*4882a593Smuzhiyun	can be quite helpful in reducing code bloat when common code is
314*4882a593Smuzhiyun	shared between readers and updaters.  Additional primitives
315*4882a593Smuzhiyun	are provided for this case, as discussed in lockdep.txt.
316*4882a593Smuzhiyun
317*4882a593Smuzhiyun10.	Conversely, if you are in an RCU read-side critical section,
318*4882a593Smuzhiyun	and you don't hold the appropriate update-side lock, you -must-
319*4882a593Smuzhiyun	use the "_rcu()" variants of the list macros.  Failing to do so
320*4882a593Smuzhiyun	will break Alpha, cause aggressive compilers to generate bad code,
321*4882a593Smuzhiyun	and confuse people trying to read your code.
322*4882a593Smuzhiyun
323*4882a593Smuzhiyun11.	Any lock acquired by an RCU callback must be acquired elsewhere
324*4882a593Smuzhiyun	with softirq disabled, e.g., via spin_lock_irqsave(),
325*4882a593Smuzhiyun	spin_lock_bh(), etc.  Failing to disable softirq on a given
326*4882a593Smuzhiyun	acquisition of that lock will result in deadlock as soon as
327*4882a593Smuzhiyun	the RCU softirq handler happens to run your RCU callback while
328*4882a593Smuzhiyun	interrupting that acquisition's critical section.
329*4882a593Smuzhiyun
330*4882a593Smuzhiyun12.	RCU callbacks can be and are executed in parallel.  In many cases,
331*4882a593Smuzhiyun	the callback code simply wrappers around kfree(), so that this
332*4882a593Smuzhiyun	is not an issue (or, more accurately, to the extent that it is
333*4882a593Smuzhiyun	an issue, the memory-allocator locking handles it).  However,
334*4882a593Smuzhiyun	if the callbacks do manipulate a shared data structure, they
335*4882a593Smuzhiyun	must use whatever locking or other synchronization is required
336*4882a593Smuzhiyun	to safely access and/or modify that data structure.
337*4882a593Smuzhiyun
338*4882a593Smuzhiyun	Do not assume that RCU callbacks will be executed on the same
339*4882a593Smuzhiyun	CPU that executed the corresponding call_rcu() or call_srcu().
340*4882a593Smuzhiyun	For example, if a given CPU goes offline while having an RCU
341*4882a593Smuzhiyun	callback pending, then that RCU callback will execute on some
342*4882a593Smuzhiyun	surviving CPU.	(If this was not the case, a self-spawning RCU
343*4882a593Smuzhiyun	callback would prevent the victim CPU from ever going offline.)
344*4882a593Smuzhiyun	Furthermore, CPUs designated by rcu_nocbs= might well -always-
345*4882a593Smuzhiyun	have their RCU callbacks executed on some other CPUs, in fact,
346*4882a593Smuzhiyun	for some  real-time workloads, this is the whole point of using
347*4882a593Smuzhiyun	the rcu_nocbs= kernel boot parameter.
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun13.	Unlike other forms of RCU, it -is- permissible to block in an
350*4882a593Smuzhiyun	SRCU read-side critical section (demarked by srcu_read_lock()
351*4882a593Smuzhiyun	and srcu_read_unlock()), hence the "SRCU": "sleepable RCU".
352*4882a593Smuzhiyun	Please note that if you don't need to sleep in read-side critical
353*4882a593Smuzhiyun	sections, you should be using RCU rather than SRCU, because RCU
354*4882a593Smuzhiyun	is almost always faster and easier to use than is SRCU.
355*4882a593Smuzhiyun
356*4882a593Smuzhiyun	Also unlike other forms of RCU, explicit initialization and
357*4882a593Smuzhiyun	cleanup is required either at build time via DEFINE_SRCU()
358*4882a593Smuzhiyun	or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct()
359*4882a593Smuzhiyun	and cleanup_srcu_struct().  These last two are passed a
360*4882a593Smuzhiyun	"struct srcu_struct" that defines the scope of a given
361*4882a593Smuzhiyun	SRCU domain.  Once initialized, the srcu_struct is passed
362*4882a593Smuzhiyun	to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(),
363*4882a593Smuzhiyun	synchronize_srcu_expedited(), and call_srcu().	A given
364*4882a593Smuzhiyun	synchronize_srcu() waits only for SRCU read-side critical
365*4882a593Smuzhiyun	sections governed by srcu_read_lock() and srcu_read_unlock()
366*4882a593Smuzhiyun	calls that have been passed the same srcu_struct.  This property
367*4882a593Smuzhiyun	is what makes sleeping read-side critical sections tolerable --
368*4882a593Smuzhiyun	a given subsystem delays only its own updates, not those of other
369*4882a593Smuzhiyun	subsystems using SRCU.	Therefore, SRCU is less prone to OOM the
370*4882a593Smuzhiyun	system than RCU would be if RCU's read-side critical sections
371*4882a593Smuzhiyun	were permitted to sleep.
372*4882a593Smuzhiyun
373*4882a593Smuzhiyun	The ability to sleep in read-side critical sections does not
374*4882a593Smuzhiyun	come for free.	First, corresponding srcu_read_lock() and
375*4882a593Smuzhiyun	srcu_read_unlock() calls must be passed the same srcu_struct.
376*4882a593Smuzhiyun	Second, grace-period-detection overhead is amortized only
377*4882a593Smuzhiyun	over those updates sharing a given srcu_struct, rather than
378*4882a593Smuzhiyun	being globally amortized as they are for other forms of RCU.
379*4882a593Smuzhiyun	Therefore, SRCU should be used in preference to rw_semaphore
380*4882a593Smuzhiyun	only in extremely read-intensive situations, or in situations
381*4882a593Smuzhiyun	requiring SRCU's read-side deadlock immunity or low read-side
382*4882a593Smuzhiyun	realtime latency.  You should also consider percpu_rw_semaphore
383*4882a593Smuzhiyun	when you need lightweight readers.
384*4882a593Smuzhiyun
385*4882a593Smuzhiyun	SRCU's expedited primitive (synchronize_srcu_expedited())
386*4882a593Smuzhiyun	never sends IPIs to other CPUs, so it is easier on
387*4882a593Smuzhiyun	real-time workloads than is synchronize_rcu_expedited().
388*4882a593Smuzhiyun
389*4882a593Smuzhiyun	Note that rcu_assign_pointer() relates to SRCU just as it does to
390*4882a593Smuzhiyun	other forms of RCU, but instead of rcu_dereference() you should
391*4882a593Smuzhiyun	use srcu_dereference() in order to avoid lockdep splats.
392*4882a593Smuzhiyun
393*4882a593Smuzhiyun14.	The whole point of call_rcu(), synchronize_rcu(), and friends
394*4882a593Smuzhiyun	is to wait until all pre-existing readers have finished before
395*4882a593Smuzhiyun	carrying out some otherwise-destructive operation.  It is
396*4882a593Smuzhiyun	therefore critically important to -first- remove any path
397*4882a593Smuzhiyun	that readers can follow that could be affected by the
398*4882a593Smuzhiyun	destructive operation, and -only- -then- invoke call_rcu(),
399*4882a593Smuzhiyun	synchronize_rcu(), or friends.
400*4882a593Smuzhiyun
401*4882a593Smuzhiyun	Because these primitives only wait for pre-existing readers, it
402*4882a593Smuzhiyun	is the caller's responsibility to guarantee that any subsequent
403*4882a593Smuzhiyun	readers will execute safely.
404*4882a593Smuzhiyun
405*4882a593Smuzhiyun15.	The various RCU read-side primitives do -not- necessarily contain
406*4882a593Smuzhiyun	memory barriers.  You should therefore plan for the CPU
407*4882a593Smuzhiyun	and the compiler to freely reorder code into and out of RCU
408*4882a593Smuzhiyun	read-side critical sections.  It is the responsibility of the
409*4882a593Smuzhiyun	RCU update-side primitives to deal with this.
410*4882a593Smuzhiyun
411*4882a593Smuzhiyun	For SRCU readers, you can use smp_mb__after_srcu_read_unlock()
412*4882a593Smuzhiyun	immediately after an srcu_read_unlock() to get a full barrier.
413*4882a593Smuzhiyun
414*4882a593Smuzhiyun16.	Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the
415*4882a593Smuzhiyun	__rcu sparse checks to validate your RCU code.	These can help
416*4882a593Smuzhiyun	find problems as follows:
417*4882a593Smuzhiyun
418*4882a593Smuzhiyun	CONFIG_PROVE_LOCKING:
419*4882a593Smuzhiyun		check that accesses to RCU-protected data
420*4882a593Smuzhiyun		structures are carried out under the proper RCU
421*4882a593Smuzhiyun		read-side critical section, while holding the right
422*4882a593Smuzhiyun		combination of locks, or whatever other conditions
423*4882a593Smuzhiyun		are appropriate.
424*4882a593Smuzhiyun
425*4882a593Smuzhiyun	CONFIG_DEBUG_OBJECTS_RCU_HEAD:
426*4882a593Smuzhiyun		check that you don't pass the
427*4882a593Smuzhiyun		same object to call_rcu() (or friends) before an RCU
428*4882a593Smuzhiyun		grace period has elapsed since the last time that you
429*4882a593Smuzhiyun		passed that same object to call_rcu() (or friends).
430*4882a593Smuzhiyun
431*4882a593Smuzhiyun	__rcu sparse checks:
432*4882a593Smuzhiyun		tag the pointer to the RCU-protected data
433*4882a593Smuzhiyun		structure with __rcu, and sparse will warn you if you
434*4882a593Smuzhiyun		access that pointer without the services of one of the
435*4882a593Smuzhiyun		variants of rcu_dereference().
436*4882a593Smuzhiyun
437*4882a593Smuzhiyun	These debugging aids can help you find problems that are
438*4882a593Smuzhiyun	otherwise extremely difficult to spot.
439*4882a593Smuzhiyun
440*4882a593Smuzhiyun17.	If you register a callback using call_rcu() or call_srcu(), and
441*4882a593Smuzhiyun	pass in a function defined within a loadable module, then it in
442*4882a593Smuzhiyun	necessary to wait for all pending callbacks to be invoked after
443*4882a593Smuzhiyun	the last invocation and before unloading that module.  Note that
444*4882a593Smuzhiyun	it is absolutely -not- sufficient to wait for a grace period!
445*4882a593Smuzhiyun	The current (say) synchronize_rcu() implementation is -not-
446*4882a593Smuzhiyun	guaranteed to wait for callbacks registered on other CPUs.
447*4882a593Smuzhiyun	Or even on the current CPU if that CPU recently went offline
448*4882a593Smuzhiyun	and came back online.
449*4882a593Smuzhiyun
450*4882a593Smuzhiyun	You instead need to use one of the barrier functions:
451*4882a593Smuzhiyun
452*4882a593Smuzhiyun	-	call_rcu() -> rcu_barrier()
453*4882a593Smuzhiyun	-	call_srcu() -> srcu_barrier()
454*4882a593Smuzhiyun
455*4882a593Smuzhiyun	However, these barrier functions are absolutely -not- guaranteed
456*4882a593Smuzhiyun	to wait for a grace period.  In fact, if there are no call_rcu()
457*4882a593Smuzhiyun	callbacks waiting anywhere in the system, rcu_barrier() is within
458*4882a593Smuzhiyun	its rights to return immediately.
459*4882a593Smuzhiyun
460*4882a593Smuzhiyun	So if you need to wait for both an RCU grace period and for
461*4882a593Smuzhiyun	all pre-existing call_rcu() callbacks, you will need to execute
462*4882a593Smuzhiyun	both rcu_barrier() and synchronize_rcu(), if necessary, using
463*4882a593Smuzhiyun	something like workqueues to to execute them concurrently.
464*4882a593Smuzhiyun
465*4882a593Smuzhiyun	See rcubarrier.txt for more information.
466