xref: /OK3568_Linux_fs/kernel/Documentation/RCU/whatisRCU.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _whatisrcu_doc:
2*4882a593Smuzhiyun
3*4882a593SmuzhiyunWhat is RCU?  --  "Read, Copy, Update"
4*4882a593Smuzhiyun======================================
5*4882a593Smuzhiyun
6*4882a593SmuzhiyunPlease note that the "What is RCU?" LWN series is an excellent place
7*4882a593Smuzhiyunto start learning about RCU:
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun| 1.	What is RCU, Fundamentally?  http://lwn.net/Articles/262464/
10*4882a593Smuzhiyun| 2.	What is RCU? Part 2: Usage   http://lwn.net/Articles/263130/
11*4882a593Smuzhiyun| 3.	RCU part 3: the RCU API      http://lwn.net/Articles/264090/
12*4882a593Smuzhiyun| 4.	The RCU API, 2010 Edition    http://lwn.net/Articles/418853/
13*4882a593Smuzhiyun| 	2010 Big API Table           http://lwn.net/Articles/419086/
14*4882a593Smuzhiyun| 5.	The RCU API, 2014 Edition    http://lwn.net/Articles/609904/
15*4882a593Smuzhiyun|	2014 Big API Table           http://lwn.net/Articles/609973/
16*4882a593Smuzhiyun
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunWhat is RCU?
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunRCU is a synchronization mechanism that was added to the Linux kernel
21*4882a593Smuzhiyunduring the 2.5 development effort that is optimized for read-mostly
22*4882a593Smuzhiyunsituations.  Although RCU is actually quite simple once you understand it,
23*4882a593Smuzhiyungetting there can sometimes be a challenge.  Part of the problem is that
24*4882a593Smuzhiyunmost of the past descriptions of RCU have been written with the mistaken
25*4882a593Smuzhiyunassumption that there is "one true way" to describe RCU.  Instead,
26*4882a593Smuzhiyunthe experience has been that different people must take different paths
27*4882a593Smuzhiyunto arrive at an understanding of RCU.  This document provides several
28*4882a593Smuzhiyundifferent paths, as follows:
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun:ref:`1.	RCU OVERVIEW <1_whatisRCU>`
31*4882a593Smuzhiyun
32*4882a593Smuzhiyun:ref:`2.	WHAT IS RCU'S CORE API? <2_whatisRCU>`
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun:ref:`3.	WHAT ARE SOME EXAMPLE USES OF CORE RCU API? <3_whatisRCU>`
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun:ref:`4.	WHAT IF MY UPDATING THREAD CANNOT BLOCK? <4_whatisRCU>`
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun:ref:`5.	WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? <5_whatisRCU>`
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun:ref:`6.	ANALOGY WITH READER-WRITER LOCKING <6_whatisRCU>`
41*4882a593Smuzhiyun
42*4882a593Smuzhiyun:ref:`7.	FULL LIST OF RCU APIs <7_whatisRCU>`
43*4882a593Smuzhiyun
44*4882a593Smuzhiyun:ref:`8.	ANSWERS TO QUICK QUIZZES <8_whatisRCU>`
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunPeople who prefer starting with a conceptual overview should focus on
47*4882a593SmuzhiyunSection 1, though most readers will profit by reading this section at
48*4882a593Smuzhiyunsome point.  People who prefer to start with an API that they can then
49*4882a593Smuzhiyunexperiment with should focus on Section 2.  People who prefer to start
50*4882a593Smuzhiyunwith example uses should focus on Sections 3 and 4.  People who need to
51*4882a593Smuzhiyununderstand the RCU implementation should focus on Section 5, then dive
52*4882a593Smuzhiyuninto the kernel source code.  People who reason best by analogy should
53*4882a593Smuzhiyunfocus on Section 6.  Section 7 serves as an index to the docbook API
54*4882a593Smuzhiyundocumentation, and Section 8 is the traditional answer key.
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunSo, start with the section that makes the most sense to you and your
57*4882a593Smuzhiyunpreferred method of learning.  If you need to know everything about
58*4882a593Smuzhiyuneverything, feel free to read the whole thing -- but if you are really
59*4882a593Smuzhiyunthat type of person, you have perused the source code and will therefore
60*4882a593Smuzhiyunnever need this document anyway.  ;-)
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun.. _1_whatisRCU:
63*4882a593Smuzhiyun
64*4882a593Smuzhiyun1.  RCU OVERVIEW
65*4882a593Smuzhiyun----------------
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunThe basic idea behind RCU is to split updates into "removal" and
68*4882a593Smuzhiyun"reclamation" phases.  The removal phase removes references to data items
69*4882a593Smuzhiyunwithin a data structure (possibly by replacing them with references to
70*4882a593Smuzhiyunnew versions of these data items), and can run concurrently with readers.
71*4882a593SmuzhiyunThe reason that it is safe to run the removal phase concurrently with
72*4882a593Smuzhiyunreaders is the semantics of modern CPUs guarantee that readers will see
73*4882a593Smuzhiyuneither the old or the new version of the data structure rather than a
74*4882a593Smuzhiyunpartially updated reference.  The reclamation phase does the work of reclaiming
75*4882a593Smuzhiyun(e.g., freeing) the data items removed from the data structure during the
76*4882a593Smuzhiyunremoval phase.  Because reclaiming data items can disrupt any readers
77*4882a593Smuzhiyunconcurrently referencing those data items, the reclamation phase must
78*4882a593Smuzhiyunnot start until readers no longer hold references to those data items.
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunSplitting the update into removal and reclamation phases permits the
81*4882a593Smuzhiyunupdater to perform the removal phase immediately, and to defer the
82*4882a593Smuzhiyunreclamation phase until all readers active during the removal phase have
83*4882a593Smuzhiyuncompleted, either by blocking until they finish or by registering a
84*4882a593Smuzhiyuncallback that is invoked after they finish.  Only readers that are active
85*4882a593Smuzhiyunduring the removal phase need be considered, because any reader starting
86*4882a593Smuzhiyunafter the removal phase will be unable to gain a reference to the removed
87*4882a593Smuzhiyundata items, and therefore cannot be disrupted by the reclamation phase.
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunSo the typical RCU update sequence goes something like the following:
90*4882a593Smuzhiyun
91*4882a593Smuzhiyuna.	Remove pointers to a data structure, so that subsequent
92*4882a593Smuzhiyun	readers cannot gain a reference to it.
93*4882a593Smuzhiyun
94*4882a593Smuzhiyunb.	Wait for all previous readers to complete their RCU read-side
95*4882a593Smuzhiyun	critical sections.
96*4882a593Smuzhiyun
97*4882a593Smuzhiyunc.	At this point, there cannot be any readers who hold references
98*4882a593Smuzhiyun	to the data structure, so it now may safely be reclaimed
99*4882a593Smuzhiyun	(e.g., kfree()d).
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunStep (b) above is the key idea underlying RCU's deferred destruction.
102*4882a593SmuzhiyunThe ability to wait until all readers are done allows RCU readers to
103*4882a593Smuzhiyunuse much lighter-weight synchronization, in some cases, absolutely no
104*4882a593Smuzhiyunsynchronization at all.  In contrast, in more conventional lock-based
105*4882a593Smuzhiyunschemes, readers must use heavy-weight synchronization in order to
106*4882a593Smuzhiyunprevent an updater from deleting the data structure out from under them.
107*4882a593SmuzhiyunThis is because lock-based updaters typically update data items in place,
108*4882a593Smuzhiyunand must therefore exclude readers.  In contrast, RCU-based updaters
109*4882a593Smuzhiyuntypically take advantage of the fact that writes to single aligned
110*4882a593Smuzhiyunpointers are atomic on modern CPUs, allowing atomic insertion, removal,
111*4882a593Smuzhiyunand replacement of data items in a linked structure without disrupting
112*4882a593Smuzhiyunreaders.  Concurrent RCU readers can then continue accessing the old
113*4882a593Smuzhiyunversions, and can dispense with the atomic operations, memory barriers,
114*4882a593Smuzhiyunand communications cache misses that are so expensive on present-day
115*4882a593SmuzhiyunSMP computer systems, even in absence of lock contention.
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunIn the three-step procedure shown above, the updater is performing both
118*4882a593Smuzhiyunthe removal and the reclamation step, but it is often helpful for an
119*4882a593Smuzhiyunentirely different thread to do the reclamation, as is in fact the case
120*4882a593Smuzhiyunin the Linux kernel's directory-entry cache (dcache).  Even if the same
121*4882a593Smuzhiyunthread performs both the update step (step (a) above) and the reclamation
122*4882a593Smuzhiyunstep (step (c) above), it is often helpful to think of them separately.
123*4882a593SmuzhiyunFor example, RCU readers and updaters need not communicate at all,
124*4882a593Smuzhiyunbut RCU provides implicit low-overhead communication between readers
125*4882a593Smuzhiyunand reclaimers, namely, in step (b) above.
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunSo how the heck can a reclaimer tell when a reader is done, given
128*4882a593Smuzhiyunthat readers are not doing any sort of synchronization operations???
129*4882a593SmuzhiyunRead on to learn about how RCU's API makes this easy.
130*4882a593Smuzhiyun
131*4882a593Smuzhiyun.. _2_whatisRCU:
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun2.  WHAT IS RCU'S CORE API?
134*4882a593Smuzhiyun---------------------------
135*4882a593Smuzhiyun
136*4882a593SmuzhiyunThe core RCU API is quite small:
137*4882a593Smuzhiyun
138*4882a593Smuzhiyuna.	rcu_read_lock()
139*4882a593Smuzhiyunb.	rcu_read_unlock()
140*4882a593Smuzhiyunc.	synchronize_rcu() / call_rcu()
141*4882a593Smuzhiyund.	rcu_assign_pointer()
142*4882a593Smuzhiyune.	rcu_dereference()
143*4882a593Smuzhiyun
144*4882a593SmuzhiyunThere are many other members of the RCU API, but the rest can be
145*4882a593Smuzhiyunexpressed in terms of these five, though most implementations instead
146*4882a593Smuzhiyunexpress synchronize_rcu() in terms of the call_rcu() callback API.
147*4882a593Smuzhiyun
148*4882a593SmuzhiyunThe five core RCU APIs are described below, the other 18 will be enumerated
149*4882a593Smuzhiyunlater.  See the kernel docbook documentation for more info, or look directly
150*4882a593Smuzhiyunat the function header comments.
151*4882a593Smuzhiyun
152*4882a593Smuzhiyunrcu_read_lock()
153*4882a593Smuzhiyun^^^^^^^^^^^^^^^
154*4882a593Smuzhiyun	void rcu_read_lock(void);
155*4882a593Smuzhiyun
156*4882a593Smuzhiyun	Used by a reader to inform the reclaimer that the reader is
157*4882a593Smuzhiyun	entering an RCU read-side critical section.  It is illegal
158*4882a593Smuzhiyun	to block while in an RCU read-side critical section, though
159*4882a593Smuzhiyun	kernels built with CONFIG_PREEMPT_RCU can preempt RCU
160*4882a593Smuzhiyun	read-side critical sections.  Any RCU-protected data structure
161*4882a593Smuzhiyun	accessed during an RCU read-side critical section is guaranteed to
162*4882a593Smuzhiyun	remain unreclaimed for the full duration of that critical section.
163*4882a593Smuzhiyun	Reference counts may be used in conjunction with RCU to maintain
164*4882a593Smuzhiyun	longer-term references to data structures.
165*4882a593Smuzhiyun
166*4882a593Smuzhiyunrcu_read_unlock()
167*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^
168*4882a593Smuzhiyun	void rcu_read_unlock(void);
169*4882a593Smuzhiyun
170*4882a593Smuzhiyun	Used by a reader to inform the reclaimer that the reader is
171*4882a593Smuzhiyun	exiting an RCU read-side critical section.  Note that RCU
172*4882a593Smuzhiyun	read-side critical sections may be nested and/or overlapping.
173*4882a593Smuzhiyun
174*4882a593Smuzhiyunsynchronize_rcu()
175*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^
176*4882a593Smuzhiyun	void synchronize_rcu(void);
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun	Marks the end of updater code and the beginning of reclaimer
179*4882a593Smuzhiyun	code.  It does this by blocking until all pre-existing RCU
180*4882a593Smuzhiyun	read-side critical sections on all CPUs have completed.
181*4882a593Smuzhiyun	Note that synchronize_rcu() will **not** necessarily wait for
182*4882a593Smuzhiyun	any subsequent RCU read-side critical sections to complete.
183*4882a593Smuzhiyun	For example, consider the following sequence of events::
184*4882a593Smuzhiyun
185*4882a593Smuzhiyun	         CPU 0                  CPU 1                 CPU 2
186*4882a593Smuzhiyun	     ----------------- ------------------------- ---------------
187*4882a593Smuzhiyun	 1.  rcu_read_lock()
188*4882a593Smuzhiyun	 2.                    enters synchronize_rcu()
189*4882a593Smuzhiyun	 3.                                               rcu_read_lock()
190*4882a593Smuzhiyun	 4.  rcu_read_unlock()
191*4882a593Smuzhiyun	 5.                     exits synchronize_rcu()
192*4882a593Smuzhiyun	 6.                                              rcu_read_unlock()
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun	To reiterate, synchronize_rcu() waits only for ongoing RCU
195*4882a593Smuzhiyun	read-side critical sections to complete, not necessarily for
196*4882a593Smuzhiyun	any that begin after synchronize_rcu() is invoked.
197*4882a593Smuzhiyun
198*4882a593Smuzhiyun	Of course, synchronize_rcu() does not necessarily return
199*4882a593Smuzhiyun	**immediately** after the last pre-existing RCU read-side critical
200*4882a593Smuzhiyun	section completes.  For one thing, there might well be scheduling
201*4882a593Smuzhiyun	delays.  For another thing, many RCU implementations process
202*4882a593Smuzhiyun	requests in batches in order to improve efficiencies, which can
203*4882a593Smuzhiyun	further delay synchronize_rcu().
204*4882a593Smuzhiyun
205*4882a593Smuzhiyun	Since synchronize_rcu() is the API that must figure out when
206*4882a593Smuzhiyun	readers are done, its implementation is key to RCU.  For RCU
207*4882a593Smuzhiyun	to be useful in all but the most read-intensive situations,
208*4882a593Smuzhiyun	synchronize_rcu()'s overhead must also be quite small.
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun	The call_rcu() API is a callback form of synchronize_rcu(),
211*4882a593Smuzhiyun	and is described in more detail in a later section.  Instead of
212*4882a593Smuzhiyun	blocking, it registers a function and argument which are invoked
213*4882a593Smuzhiyun	after all ongoing RCU read-side critical sections have completed.
214*4882a593Smuzhiyun	This callback variant is particularly useful in situations where
215*4882a593Smuzhiyun	it is illegal to block or where update-side performance is
216*4882a593Smuzhiyun	critically important.
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun	However, the call_rcu() API should not be used lightly, as use
219*4882a593Smuzhiyun	of the synchronize_rcu() API generally results in simpler code.
220*4882a593Smuzhiyun	In addition, the synchronize_rcu() API has the nice property
221*4882a593Smuzhiyun	of automatically limiting update rate should grace periods
222*4882a593Smuzhiyun	be delayed.  This property results in system resilience in face
223*4882a593Smuzhiyun	of denial-of-service attacks.  Code using call_rcu() should limit
224*4882a593Smuzhiyun	update rate in order to gain this same sort of resilience.  See
225*4882a593Smuzhiyun	checklist.txt for some approaches to limiting the update rate.
226*4882a593Smuzhiyun
227*4882a593Smuzhiyunrcu_assign_pointer()
228*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^
229*4882a593Smuzhiyun	void rcu_assign_pointer(p, typeof(p) v);
230*4882a593Smuzhiyun
231*4882a593Smuzhiyun	Yes, rcu_assign_pointer() **is** implemented as a macro, though it
232*4882a593Smuzhiyun	would be cool to be able to declare a function in this manner.
233*4882a593Smuzhiyun	(Compiler experts will no doubt disagree.)
234*4882a593Smuzhiyun
235*4882a593Smuzhiyun	The updater uses this function to assign a new value to an
236*4882a593Smuzhiyun	RCU-protected pointer, in order to safely communicate the change
237*4882a593Smuzhiyun	in value from the updater to the reader.  This macro does not
238*4882a593Smuzhiyun	evaluate to an rvalue, but it does execute any memory-barrier
239*4882a593Smuzhiyun	instructions required for a given CPU architecture.
240*4882a593Smuzhiyun
241*4882a593Smuzhiyun	Perhaps just as important, it serves to document (1) which
242*4882a593Smuzhiyun	pointers are protected by RCU and (2) the point at which a
243*4882a593Smuzhiyun	given structure becomes accessible to other CPUs.  That said,
244*4882a593Smuzhiyun	rcu_assign_pointer() is most frequently used indirectly, via
245*4882a593Smuzhiyun	the _rcu list-manipulation primitives such as list_add_rcu().
246*4882a593Smuzhiyun
247*4882a593Smuzhiyunrcu_dereference()
248*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^
249*4882a593Smuzhiyun	typeof(p) rcu_dereference(p);
250*4882a593Smuzhiyun
251*4882a593Smuzhiyun	Like rcu_assign_pointer(), rcu_dereference() must be implemented
252*4882a593Smuzhiyun	as a macro.
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun	The reader uses rcu_dereference() to fetch an RCU-protected
255*4882a593Smuzhiyun	pointer, which returns a value that may then be safely
256*4882a593Smuzhiyun	dereferenced.  Note that rcu_dereference() does not actually
257*4882a593Smuzhiyun	dereference the pointer, instead, it protects the pointer for
258*4882a593Smuzhiyun	later dereferencing.  It also executes any needed memory-barrier
259*4882a593Smuzhiyun	instructions for a given CPU architecture.  Currently, only Alpha
260*4882a593Smuzhiyun	needs memory barriers within rcu_dereference() -- on other CPUs,
261*4882a593Smuzhiyun	it compiles to nothing, not even a compiler directive.
262*4882a593Smuzhiyun
263*4882a593Smuzhiyun	Common coding practice uses rcu_dereference() to copy an
264*4882a593Smuzhiyun	RCU-protected pointer to a local variable, then dereferences
265*4882a593Smuzhiyun	this local variable, for example as follows::
266*4882a593Smuzhiyun
267*4882a593Smuzhiyun		p = rcu_dereference(head.next);
268*4882a593Smuzhiyun		return p->data;
269*4882a593Smuzhiyun
270*4882a593Smuzhiyun	However, in this case, one could just as easily combine these
271*4882a593Smuzhiyun	into one statement::
272*4882a593Smuzhiyun
273*4882a593Smuzhiyun		return rcu_dereference(head.next)->data;
274*4882a593Smuzhiyun
275*4882a593Smuzhiyun	If you are going to be fetching multiple fields from the
276*4882a593Smuzhiyun	RCU-protected structure, using the local variable is of
277*4882a593Smuzhiyun	course preferred.  Repeated rcu_dereference() calls look
278*4882a593Smuzhiyun	ugly, do not guarantee that the same pointer will be returned
279*4882a593Smuzhiyun	if an update happened while in the critical section, and incur
280*4882a593Smuzhiyun	unnecessary overhead on Alpha CPUs.
281*4882a593Smuzhiyun
282*4882a593Smuzhiyun	Note that the value returned by rcu_dereference() is valid
283*4882a593Smuzhiyun	only within the enclosing RCU read-side critical section [1]_.
284*4882a593Smuzhiyun	For example, the following is **not** legal::
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun		rcu_read_lock();
287*4882a593Smuzhiyun		p = rcu_dereference(head.next);
288*4882a593Smuzhiyun		rcu_read_unlock();
289*4882a593Smuzhiyun		x = p->address;	/* BUG!!! */
290*4882a593Smuzhiyun		rcu_read_lock();
291*4882a593Smuzhiyun		y = p->data;	/* BUG!!! */
292*4882a593Smuzhiyun		rcu_read_unlock();
293*4882a593Smuzhiyun
294*4882a593Smuzhiyun	Holding a reference from one RCU read-side critical section
295*4882a593Smuzhiyun	to another is just as illegal as holding a reference from
296*4882a593Smuzhiyun	one lock-based critical section to another!  Similarly,
297*4882a593Smuzhiyun	using a reference outside of the critical section in which
298*4882a593Smuzhiyun	it was acquired is just as illegal as doing so with normal
299*4882a593Smuzhiyun	locking.
300*4882a593Smuzhiyun
301*4882a593Smuzhiyun	As with rcu_assign_pointer(), an important function of
302*4882a593Smuzhiyun	rcu_dereference() is to document which pointers are protected by
303*4882a593Smuzhiyun	RCU, in particular, flagging a pointer that is subject to changing
304*4882a593Smuzhiyun	at any time, including immediately after the rcu_dereference().
305*4882a593Smuzhiyun	And, again like rcu_assign_pointer(), rcu_dereference() is
306*4882a593Smuzhiyun	typically used indirectly, via the _rcu list-manipulation
307*4882a593Smuzhiyun	primitives, such as list_for_each_entry_rcu() [2]_.
308*4882a593Smuzhiyun
309*4882a593Smuzhiyun.. 	[1] The variant rcu_dereference_protected() can be used outside
310*4882a593Smuzhiyun	of an RCU read-side critical section as long as the usage is
311*4882a593Smuzhiyun	protected by locks acquired by the update-side code.  This variant
312*4882a593Smuzhiyun	avoids the lockdep warning that would happen when using (for
313*4882a593Smuzhiyun	example) rcu_dereference() without rcu_read_lock() protection.
314*4882a593Smuzhiyun	Using rcu_dereference_protected() also has the advantage
315*4882a593Smuzhiyun	of permitting compiler optimizations that rcu_dereference()
316*4882a593Smuzhiyun	must prohibit.	The rcu_dereference_protected() variant takes
317*4882a593Smuzhiyun	a lockdep expression to indicate which locks must be acquired
318*4882a593Smuzhiyun	by the caller. If the indicated protection is not provided,
319*4882a593Smuzhiyun	a lockdep splat is emitted.  See Documentation/RCU/Design/Requirements/Requirements.rst
320*4882a593Smuzhiyun	and the API's code comments for more details and example usage.
321*4882a593Smuzhiyun
322*4882a593Smuzhiyun.. 	[2] If the list_for_each_entry_rcu() instance might be used by
323*4882a593Smuzhiyun	update-side code as well as by RCU readers, then an additional
324*4882a593Smuzhiyun	lockdep expression can be added to its list of arguments.
325*4882a593Smuzhiyun	For example, given an additional "lock_is_held(&mylock)" argument,
326*4882a593Smuzhiyun	the RCU lockdep code would complain only if this instance was
327*4882a593Smuzhiyun	invoked outside of an RCU read-side critical section and without
328*4882a593Smuzhiyun	the protection of mylock.
329*4882a593Smuzhiyun
330*4882a593SmuzhiyunThe following diagram shows how each API communicates among the
331*4882a593Smuzhiyunreader, updater, and reclaimer.
332*4882a593Smuzhiyun::
333*4882a593Smuzhiyun
334*4882a593Smuzhiyun
335*4882a593Smuzhiyun	    rcu_assign_pointer()
336*4882a593Smuzhiyun	                            +--------+
337*4882a593Smuzhiyun	    +---------------------->| reader |---------+
338*4882a593Smuzhiyun	    |                       +--------+         |
339*4882a593Smuzhiyun	    |                           |              |
340*4882a593Smuzhiyun	    |                           |              | Protect:
341*4882a593Smuzhiyun	    |                           |              | rcu_read_lock()
342*4882a593Smuzhiyun	    |                           |              | rcu_read_unlock()
343*4882a593Smuzhiyun	    |        rcu_dereference()  |              |
344*4882a593Smuzhiyun	    +---------+                 |              |
345*4882a593Smuzhiyun	    | updater |<----------------+              |
346*4882a593Smuzhiyun	    +---------+                                V
347*4882a593Smuzhiyun	    |                                    +-----------+
348*4882a593Smuzhiyun	    +----------------------------------->| reclaimer |
349*4882a593Smuzhiyun	                                         +-----------+
350*4882a593Smuzhiyun	      Defer:
351*4882a593Smuzhiyun	      synchronize_rcu() & call_rcu()
352*4882a593Smuzhiyun
353*4882a593Smuzhiyun
354*4882a593SmuzhiyunThe RCU infrastructure observes the time sequence of rcu_read_lock(),
355*4882a593Smuzhiyunrcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
356*4882a593Smuzhiyunorder to determine when (1) synchronize_rcu() invocations may return
357*4882a593Smuzhiyunto their callers and (2) call_rcu() callbacks may be invoked.  Efficient
358*4882a593Smuzhiyunimplementations of the RCU infrastructure make heavy use of batching in
359*4882a593Smuzhiyunorder to amortize their overhead over many uses of the corresponding APIs.
360*4882a593Smuzhiyun
361*4882a593SmuzhiyunThere are at least three flavors of RCU usage in the Linux kernel. The diagram
362*4882a593Smuzhiyunabove shows the most common one. On the updater side, the rcu_assign_pointer(),
363*4882a593Smuzhiyunsynchronize_rcu() and call_rcu() primitives used are the same for all three
364*4882a593Smuzhiyunflavors. However for protection (on the reader side), the primitives used vary
365*4882a593Smuzhiyundepending on the flavor:
366*4882a593Smuzhiyun
367*4882a593Smuzhiyuna.	rcu_read_lock() / rcu_read_unlock()
368*4882a593Smuzhiyun	rcu_dereference()
369*4882a593Smuzhiyun
370*4882a593Smuzhiyunb.	rcu_read_lock_bh() / rcu_read_unlock_bh()
371*4882a593Smuzhiyun	local_bh_disable() / local_bh_enable()
372*4882a593Smuzhiyun	rcu_dereference_bh()
373*4882a593Smuzhiyun
374*4882a593Smuzhiyunc.	rcu_read_lock_sched() / rcu_read_unlock_sched()
375*4882a593Smuzhiyun	preempt_disable() / preempt_enable()
376*4882a593Smuzhiyun	local_irq_save() / local_irq_restore()
377*4882a593Smuzhiyun	hardirq enter / hardirq exit
378*4882a593Smuzhiyun	NMI enter / NMI exit
379*4882a593Smuzhiyun	rcu_dereference_sched()
380*4882a593Smuzhiyun
381*4882a593SmuzhiyunThese three flavors are used as follows:
382*4882a593Smuzhiyun
383*4882a593Smuzhiyuna.	RCU applied to normal data structures.
384*4882a593Smuzhiyun
385*4882a593Smuzhiyunb.	RCU applied to networking data structures that may be subjected
386*4882a593Smuzhiyun	to remote denial-of-service attacks.
387*4882a593Smuzhiyun
388*4882a593Smuzhiyunc.	RCU applied to scheduler and interrupt/NMI-handler tasks.
389*4882a593Smuzhiyun
390*4882a593SmuzhiyunAgain, most uses will be of (a).  The (b) and (c) cases are important
391*4882a593Smuzhiyunfor specialized uses, but are relatively uncommon.
392*4882a593Smuzhiyun
393*4882a593Smuzhiyun.. _3_whatisRCU:
394*4882a593Smuzhiyun
395*4882a593Smuzhiyun3.  WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
396*4882a593Smuzhiyun-----------------------------------------------
397*4882a593Smuzhiyun
398*4882a593SmuzhiyunThis section shows a simple use of the core RCU API to protect a
399*4882a593Smuzhiyunglobal pointer to a dynamically allocated structure.  More-typical
400*4882a593Smuzhiyunuses of RCU may be found in :ref:`listRCU.rst <list_rcu_doc>`,
401*4882a593Smuzhiyun:ref:`arrayRCU.rst <array_rcu_doc>`, and :ref:`NMI-RCU.rst <NMI_rcu_doc>`.
402*4882a593Smuzhiyun::
403*4882a593Smuzhiyun
404*4882a593Smuzhiyun	struct foo {
405*4882a593Smuzhiyun		int a;
406*4882a593Smuzhiyun		char b;
407*4882a593Smuzhiyun		long c;
408*4882a593Smuzhiyun	};
409*4882a593Smuzhiyun	DEFINE_SPINLOCK(foo_mutex);
410*4882a593Smuzhiyun
411*4882a593Smuzhiyun	struct foo __rcu *gbl_foo;
412*4882a593Smuzhiyun
413*4882a593Smuzhiyun	/*
414*4882a593Smuzhiyun	 * Create a new struct foo that is the same as the one currently
415*4882a593Smuzhiyun	 * pointed to by gbl_foo, except that field "a" is replaced
416*4882a593Smuzhiyun	 * with "new_a".  Points gbl_foo to the new structure, and
417*4882a593Smuzhiyun	 * frees up the old structure after a grace period.
418*4882a593Smuzhiyun	 *
419*4882a593Smuzhiyun	 * Uses rcu_assign_pointer() to ensure that concurrent readers
420*4882a593Smuzhiyun	 * see the initialized version of the new structure.
421*4882a593Smuzhiyun	 *
422*4882a593Smuzhiyun	 * Uses synchronize_rcu() to ensure that any readers that might
423*4882a593Smuzhiyun	 * have references to the old structure complete before freeing
424*4882a593Smuzhiyun	 * the old structure.
425*4882a593Smuzhiyun	 */
426*4882a593Smuzhiyun	void foo_update_a(int new_a)
427*4882a593Smuzhiyun	{
428*4882a593Smuzhiyun		struct foo *new_fp;
429*4882a593Smuzhiyun		struct foo *old_fp;
430*4882a593Smuzhiyun
431*4882a593Smuzhiyun		new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
432*4882a593Smuzhiyun		spin_lock(&foo_mutex);
433*4882a593Smuzhiyun		old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex));
434*4882a593Smuzhiyun		*new_fp = *old_fp;
435*4882a593Smuzhiyun		new_fp->a = new_a;
436*4882a593Smuzhiyun		rcu_assign_pointer(gbl_foo, new_fp);
437*4882a593Smuzhiyun		spin_unlock(&foo_mutex);
438*4882a593Smuzhiyun		synchronize_rcu();
439*4882a593Smuzhiyun		kfree(old_fp);
440*4882a593Smuzhiyun	}
441*4882a593Smuzhiyun
442*4882a593Smuzhiyun	/*
443*4882a593Smuzhiyun	 * Return the value of field "a" of the current gbl_foo
444*4882a593Smuzhiyun	 * structure.  Use rcu_read_lock() and rcu_read_unlock()
445*4882a593Smuzhiyun	 * to ensure that the structure does not get deleted out
446*4882a593Smuzhiyun	 * from under us, and use rcu_dereference() to ensure that
447*4882a593Smuzhiyun	 * we see the initialized version of the structure (important
448*4882a593Smuzhiyun	 * for DEC Alpha and for people reading the code).
449*4882a593Smuzhiyun	 */
450*4882a593Smuzhiyun	int foo_get_a(void)
451*4882a593Smuzhiyun	{
452*4882a593Smuzhiyun		int retval;
453*4882a593Smuzhiyun
454*4882a593Smuzhiyun		rcu_read_lock();
455*4882a593Smuzhiyun		retval = rcu_dereference(gbl_foo)->a;
456*4882a593Smuzhiyun		rcu_read_unlock();
457*4882a593Smuzhiyun		return retval;
458*4882a593Smuzhiyun	}
459*4882a593Smuzhiyun
460*4882a593SmuzhiyunSo, to sum up:
461*4882a593Smuzhiyun
462*4882a593Smuzhiyun-	Use rcu_read_lock() and rcu_read_unlock() to guard RCU
463*4882a593Smuzhiyun	read-side critical sections.
464*4882a593Smuzhiyun
465*4882a593Smuzhiyun-	Within an RCU read-side critical section, use rcu_dereference()
466*4882a593Smuzhiyun	to dereference RCU-protected pointers.
467*4882a593Smuzhiyun
468*4882a593Smuzhiyun-	Use some solid scheme (such as locks or semaphores) to
469*4882a593Smuzhiyun	keep concurrent updates from interfering with each other.
470*4882a593Smuzhiyun
471*4882a593Smuzhiyun-	Use rcu_assign_pointer() to update an RCU-protected pointer.
472*4882a593Smuzhiyun	This primitive protects concurrent readers from the updater,
473*4882a593Smuzhiyun	**not** concurrent updates from each other!  You therefore still
474*4882a593Smuzhiyun	need to use locking (or something similar) to keep concurrent
475*4882a593Smuzhiyun	rcu_assign_pointer() primitives from interfering with each other.
476*4882a593Smuzhiyun
477*4882a593Smuzhiyun-	Use synchronize_rcu() **after** removing a data element from an
478*4882a593Smuzhiyun	RCU-protected data structure, but **before** reclaiming/freeing
479*4882a593Smuzhiyun	the data element, in order to wait for the completion of all
480*4882a593Smuzhiyun	RCU read-side critical sections that might be referencing that
481*4882a593Smuzhiyun	data item.
482*4882a593Smuzhiyun
483*4882a593SmuzhiyunSee checklist.txt for additional rules to follow when using RCU.
484*4882a593SmuzhiyunAnd again, more-typical uses of RCU may be found in :ref:`listRCU.rst
485*4882a593Smuzhiyun<list_rcu_doc>`, :ref:`arrayRCU.rst <array_rcu_doc>`, and :ref:`NMI-RCU.rst
486*4882a593Smuzhiyun<NMI_rcu_doc>`.
487*4882a593Smuzhiyun
488*4882a593Smuzhiyun.. _4_whatisRCU:
489*4882a593Smuzhiyun
490*4882a593Smuzhiyun4.  WHAT IF MY UPDATING THREAD CANNOT BLOCK?
491*4882a593Smuzhiyun--------------------------------------------
492*4882a593Smuzhiyun
493*4882a593SmuzhiyunIn the example above, foo_update_a() blocks until a grace period elapses.
494*4882a593SmuzhiyunThis is quite simple, but in some cases one cannot afford to wait so
495*4882a593Smuzhiyunlong -- there might be other high-priority work to be done.
496*4882a593Smuzhiyun
497*4882a593SmuzhiyunIn such cases, one uses call_rcu() rather than synchronize_rcu().
498*4882a593SmuzhiyunThe call_rcu() API is as follows::
499*4882a593Smuzhiyun
500*4882a593Smuzhiyun	void call_rcu(struct rcu_head * head,
501*4882a593Smuzhiyun		      void (*func)(struct rcu_head *head));
502*4882a593Smuzhiyun
503*4882a593SmuzhiyunThis function invokes func(head) after a grace period has elapsed.
504*4882a593SmuzhiyunThis invocation might happen from either softirq or process context,
505*4882a593Smuzhiyunso the function is not permitted to block.  The foo struct needs to
506*4882a593Smuzhiyunhave an rcu_head structure added, perhaps as follows::
507*4882a593Smuzhiyun
508*4882a593Smuzhiyun	struct foo {
509*4882a593Smuzhiyun		int a;
510*4882a593Smuzhiyun		char b;
511*4882a593Smuzhiyun		long c;
512*4882a593Smuzhiyun		struct rcu_head rcu;
513*4882a593Smuzhiyun	};
514*4882a593Smuzhiyun
515*4882a593SmuzhiyunThe foo_update_a() function might then be written as follows::
516*4882a593Smuzhiyun
517*4882a593Smuzhiyun	/*
518*4882a593Smuzhiyun	 * Create a new struct foo that is the same as the one currently
519*4882a593Smuzhiyun	 * pointed to by gbl_foo, except that field "a" is replaced
520*4882a593Smuzhiyun	 * with "new_a".  Points gbl_foo to the new structure, and
521*4882a593Smuzhiyun	 * frees up the old structure after a grace period.
522*4882a593Smuzhiyun	 *
523*4882a593Smuzhiyun	 * Uses rcu_assign_pointer() to ensure that concurrent readers
524*4882a593Smuzhiyun	 * see the initialized version of the new structure.
525*4882a593Smuzhiyun	 *
526*4882a593Smuzhiyun	 * Uses call_rcu() to ensure that any readers that might have
527*4882a593Smuzhiyun	 * references to the old structure complete before freeing the
528*4882a593Smuzhiyun	 * old structure.
529*4882a593Smuzhiyun	 */
530*4882a593Smuzhiyun	void foo_update_a(int new_a)
531*4882a593Smuzhiyun	{
532*4882a593Smuzhiyun		struct foo *new_fp;
533*4882a593Smuzhiyun		struct foo *old_fp;
534*4882a593Smuzhiyun
535*4882a593Smuzhiyun		new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
536*4882a593Smuzhiyun		spin_lock(&foo_mutex);
537*4882a593Smuzhiyun		old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex));
538*4882a593Smuzhiyun		*new_fp = *old_fp;
539*4882a593Smuzhiyun		new_fp->a = new_a;
540*4882a593Smuzhiyun		rcu_assign_pointer(gbl_foo, new_fp);
541*4882a593Smuzhiyun		spin_unlock(&foo_mutex);
542*4882a593Smuzhiyun		call_rcu(&old_fp->rcu, foo_reclaim);
543*4882a593Smuzhiyun	}
544*4882a593Smuzhiyun
545*4882a593SmuzhiyunThe foo_reclaim() function might appear as follows::
546*4882a593Smuzhiyun
547*4882a593Smuzhiyun	void foo_reclaim(struct rcu_head *rp)
548*4882a593Smuzhiyun	{
549*4882a593Smuzhiyun		struct foo *fp = container_of(rp, struct foo, rcu);
550*4882a593Smuzhiyun
551*4882a593Smuzhiyun		foo_cleanup(fp->a);
552*4882a593Smuzhiyun
553*4882a593Smuzhiyun		kfree(fp);
554*4882a593Smuzhiyun	}
555*4882a593Smuzhiyun
556*4882a593SmuzhiyunThe container_of() primitive is a macro that, given a pointer into a
557*4882a593Smuzhiyunstruct, the type of the struct, and the pointed-to field within the
558*4882a593Smuzhiyunstruct, returns a pointer to the beginning of the struct.
559*4882a593Smuzhiyun
560*4882a593SmuzhiyunThe use of call_rcu() permits the caller of foo_update_a() to
561*4882a593Smuzhiyunimmediately regain control, without needing to worry further about the
562*4882a593Smuzhiyunold version of the newly updated element.  It also clearly shows the
563*4882a593SmuzhiyunRCU distinction between updater, namely foo_update_a(), and reclaimer,
564*4882a593Smuzhiyunnamely foo_reclaim().
565*4882a593Smuzhiyun
566*4882a593SmuzhiyunThe summary of advice is the same as for the previous section, except
567*4882a593Smuzhiyunthat we are now using call_rcu() rather than synchronize_rcu():
568*4882a593Smuzhiyun
569*4882a593Smuzhiyun-	Use call_rcu() **after** removing a data element from an
570*4882a593Smuzhiyun	RCU-protected data structure in order to register a callback
571*4882a593Smuzhiyun	function that will be invoked after the completion of all RCU
572*4882a593Smuzhiyun	read-side critical sections that might be referencing that
573*4882a593Smuzhiyun	data item.
574*4882a593Smuzhiyun
575*4882a593SmuzhiyunIf the callback for call_rcu() is not doing anything more than calling
576*4882a593Smuzhiyunkfree() on the structure, you can use kfree_rcu() instead of call_rcu()
577*4882a593Smuzhiyunto avoid having to write your own callback::
578*4882a593Smuzhiyun
579*4882a593Smuzhiyun	kfree_rcu(old_fp, rcu);
580*4882a593Smuzhiyun
581*4882a593SmuzhiyunAgain, see checklist.txt for additional rules governing the use of RCU.
582*4882a593Smuzhiyun
583*4882a593Smuzhiyun.. _5_whatisRCU:
584*4882a593Smuzhiyun
585*4882a593Smuzhiyun5.  WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
586*4882a593Smuzhiyun------------------------------------------------
587*4882a593Smuzhiyun
588*4882a593SmuzhiyunOne of the nice things about RCU is that it has extremely simple "toy"
589*4882a593Smuzhiyunimplementations that are a good first step towards understanding the
590*4882a593Smuzhiyunproduction-quality implementations in the Linux kernel.  This section
591*4882a593Smuzhiyunpresents two such "toy" implementations of RCU, one that is implemented
592*4882a593Smuzhiyunin terms of familiar locking primitives, and another that more closely
593*4882a593Smuzhiyunresembles "classic" RCU.  Both are way too simple for real-world use,
594*4882a593Smuzhiyunlacking both functionality and performance.  However, they are useful
595*4882a593Smuzhiyunin getting a feel for how RCU works.  See kernel/rcu/update.c for a
596*4882a593Smuzhiyunproduction-quality implementation, and see:
597*4882a593Smuzhiyun
598*4882a593Smuzhiyun	http://www.rdrop.com/users/paulmck/RCU
599*4882a593Smuzhiyun
600*4882a593Smuzhiyunfor papers describing the Linux kernel RCU implementation.  The OLS'01
601*4882a593Smuzhiyunand OLS'02 papers are a good introduction, and the dissertation provides
602*4882a593Smuzhiyunmore details on the current implementation as of early 2004.
603*4882a593Smuzhiyun
604*4882a593Smuzhiyun
605*4882a593Smuzhiyun5A.  "TOY" IMPLEMENTATION #1: LOCKING
606*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
607*4882a593SmuzhiyunThis section presents a "toy" RCU implementation that is based on
608*4882a593Smuzhiyunfamiliar locking primitives.  Its overhead makes it a non-starter for
609*4882a593Smuzhiyunreal-life use, as does its lack of scalability.  It is also unsuitable
610*4882a593Smuzhiyunfor realtime use, since it allows scheduling latency to "bleed" from
611*4882a593Smuzhiyunone read-side critical section to another.  It also assumes recursive
612*4882a593Smuzhiyunreader-writer locks:  If you try this with non-recursive locks, and
613*4882a593Smuzhiyunyou allow nested rcu_read_lock() calls, you can deadlock.
614*4882a593Smuzhiyun
615*4882a593SmuzhiyunHowever, it is probably the easiest implementation to relate to, so is
616*4882a593Smuzhiyuna good starting point.
617*4882a593Smuzhiyun
618*4882a593SmuzhiyunIt is extremely simple::
619*4882a593Smuzhiyun
620*4882a593Smuzhiyun	static DEFINE_RWLOCK(rcu_gp_mutex);
621*4882a593Smuzhiyun
622*4882a593Smuzhiyun	void rcu_read_lock(void)
623*4882a593Smuzhiyun	{
624*4882a593Smuzhiyun		read_lock(&rcu_gp_mutex);
625*4882a593Smuzhiyun	}
626*4882a593Smuzhiyun
627*4882a593Smuzhiyun	void rcu_read_unlock(void)
628*4882a593Smuzhiyun	{
629*4882a593Smuzhiyun		read_unlock(&rcu_gp_mutex);
630*4882a593Smuzhiyun	}
631*4882a593Smuzhiyun
632*4882a593Smuzhiyun	void synchronize_rcu(void)
633*4882a593Smuzhiyun	{
634*4882a593Smuzhiyun		write_lock(&rcu_gp_mutex);
635*4882a593Smuzhiyun		smp_mb__after_spinlock();
636*4882a593Smuzhiyun		write_unlock(&rcu_gp_mutex);
637*4882a593Smuzhiyun	}
638*4882a593Smuzhiyun
639*4882a593Smuzhiyun[You can ignore rcu_assign_pointer() and rcu_dereference() without missing
640*4882a593Smuzhiyunmuch.  But here are simplified versions anyway.  And whatever you do,
641*4882a593Smuzhiyundon't forget about them when submitting patches making use of RCU!]::
642*4882a593Smuzhiyun
643*4882a593Smuzhiyun	#define rcu_assign_pointer(p, v) \
644*4882a593Smuzhiyun	({ \
645*4882a593Smuzhiyun		smp_store_release(&(p), (v)); \
646*4882a593Smuzhiyun	})
647*4882a593Smuzhiyun
648*4882a593Smuzhiyun	#define rcu_dereference(p) \
649*4882a593Smuzhiyun	({ \
650*4882a593Smuzhiyun		typeof(p) _________p1 = READ_ONCE(p); \
651*4882a593Smuzhiyun		(_________p1); \
652*4882a593Smuzhiyun	})
653*4882a593Smuzhiyun
654*4882a593Smuzhiyun
655*4882a593SmuzhiyunThe rcu_read_lock() and rcu_read_unlock() primitive read-acquire
656*4882a593Smuzhiyunand release a global reader-writer lock.  The synchronize_rcu()
657*4882a593Smuzhiyunprimitive write-acquires this same lock, then releases it.  This means
658*4882a593Smuzhiyunthat once synchronize_rcu() exits, all RCU read-side critical sections
659*4882a593Smuzhiyunthat were in progress before synchronize_rcu() was called are guaranteed
660*4882a593Smuzhiyunto have completed -- there is no way that synchronize_rcu() would have
661*4882a593Smuzhiyunbeen able to write-acquire the lock otherwise.  The smp_mb__after_spinlock()
662*4882a593Smuzhiyunpromotes synchronize_rcu() to a full memory barrier in compliance with
663*4882a593Smuzhiyunthe "Memory-Barrier Guarantees" listed in:
664*4882a593Smuzhiyun
665*4882a593Smuzhiyun	Documentation/RCU/Design/Requirements/Requirements.rst
666*4882a593Smuzhiyun
667*4882a593SmuzhiyunIt is possible to nest rcu_read_lock(), since reader-writer locks may
668*4882a593Smuzhiyunbe recursively acquired.  Note also that rcu_read_lock() is immune
669*4882a593Smuzhiyunfrom deadlock (an important property of RCU).  The reason for this is
670*4882a593Smuzhiyunthat the only thing that can block rcu_read_lock() is a synchronize_rcu().
671*4882a593SmuzhiyunBut synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex,
672*4882a593Smuzhiyunso there can be no deadlock cycle.
673*4882a593Smuzhiyun
674*4882a593Smuzhiyun.. _quiz_1:
675*4882a593Smuzhiyun
676*4882a593SmuzhiyunQuick Quiz #1:
677*4882a593Smuzhiyun		Why is this argument naive?  How could a deadlock
678*4882a593Smuzhiyun		occur when using this algorithm in a real-world Linux
679*4882a593Smuzhiyun		kernel?  How could this deadlock be avoided?
680*4882a593Smuzhiyun
681*4882a593Smuzhiyun:ref:`Answers to Quick Quiz <8_whatisRCU>`
682*4882a593Smuzhiyun
683*4882a593Smuzhiyun5B.  "TOY" EXAMPLE #2: CLASSIC RCU
684*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
685*4882a593SmuzhiyunThis section presents a "toy" RCU implementation that is based on
686*4882a593Smuzhiyun"classic RCU".  It is also short on performance (but only for updates) and
687*4882a593Smuzhiyunon features such as hotplug CPU and the ability to run in CONFIG_PREEMPT
688*4882a593Smuzhiyunkernels.  The definitions of rcu_dereference() and rcu_assign_pointer()
689*4882a593Smuzhiyunare the same as those shown in the preceding section, so they are omitted.
690*4882a593Smuzhiyun::
691*4882a593Smuzhiyun
692*4882a593Smuzhiyun	void rcu_read_lock(void) { }
693*4882a593Smuzhiyun
694*4882a593Smuzhiyun	void rcu_read_unlock(void) { }
695*4882a593Smuzhiyun
696*4882a593Smuzhiyun	void synchronize_rcu(void)
697*4882a593Smuzhiyun	{
698*4882a593Smuzhiyun		int cpu;
699*4882a593Smuzhiyun
700*4882a593Smuzhiyun		for_each_possible_cpu(cpu)
701*4882a593Smuzhiyun			run_on(cpu);
702*4882a593Smuzhiyun	}
703*4882a593Smuzhiyun
704*4882a593SmuzhiyunNote that rcu_read_lock() and rcu_read_unlock() do absolutely nothing.
705*4882a593SmuzhiyunThis is the great strength of classic RCU in a non-preemptive kernel:
706*4882a593Smuzhiyunread-side overhead is precisely zero, at least on non-Alpha CPUs.
707*4882a593SmuzhiyunAnd there is absolutely no way that rcu_read_lock() can possibly
708*4882a593Smuzhiyunparticipate in a deadlock cycle!
709*4882a593Smuzhiyun
710*4882a593SmuzhiyunThe implementation of synchronize_rcu() simply schedules itself on each
711*4882a593SmuzhiyunCPU in turn.  The run_on() primitive can be implemented straightforwardly
712*4882a593Smuzhiyunin terms of the sched_setaffinity() primitive.  Of course, a somewhat less
713*4882a593Smuzhiyun"toy" implementation would restore the affinity upon completion rather
714*4882a593Smuzhiyunthan just leaving all tasks running on the last CPU, but when I said
715*4882a593Smuzhiyun"toy", I meant **toy**!
716*4882a593Smuzhiyun
717*4882a593SmuzhiyunSo how the heck is this supposed to work???
718*4882a593Smuzhiyun
719*4882a593SmuzhiyunRemember that it is illegal to block while in an RCU read-side critical
720*4882a593Smuzhiyunsection.  Therefore, if a given CPU executes a context switch, we know
721*4882a593Smuzhiyunthat it must have completed all preceding RCU read-side critical sections.
722*4882a593SmuzhiyunOnce **all** CPUs have executed a context switch, then **all** preceding
723*4882a593SmuzhiyunRCU read-side critical sections will have completed.
724*4882a593Smuzhiyun
725*4882a593SmuzhiyunSo, suppose that we remove a data item from its structure and then invoke
726*4882a593Smuzhiyunsynchronize_rcu().  Once synchronize_rcu() returns, we are guaranteed
727*4882a593Smuzhiyunthat there are no RCU read-side critical sections holding a reference
728*4882a593Smuzhiyunto that data item, so we can safely reclaim it.
729*4882a593Smuzhiyun
730*4882a593Smuzhiyun.. _quiz_2:
731*4882a593Smuzhiyun
732*4882a593SmuzhiyunQuick Quiz #2:
733*4882a593Smuzhiyun		Give an example where Classic RCU's read-side
734*4882a593Smuzhiyun		overhead is **negative**.
735*4882a593Smuzhiyun
736*4882a593Smuzhiyun:ref:`Answers to Quick Quiz <8_whatisRCU>`
737*4882a593Smuzhiyun
738*4882a593Smuzhiyun.. _quiz_3:
739*4882a593Smuzhiyun
740*4882a593SmuzhiyunQuick Quiz #3:
741*4882a593Smuzhiyun		If it is illegal to block in an RCU read-side
742*4882a593Smuzhiyun		critical section, what the heck do you do in
743*4882a593Smuzhiyun		PREEMPT_RT, where normal spinlocks can block???
744*4882a593Smuzhiyun
745*4882a593Smuzhiyun:ref:`Answers to Quick Quiz <8_whatisRCU>`
746*4882a593Smuzhiyun
747*4882a593Smuzhiyun.. _6_whatisRCU:
748*4882a593Smuzhiyun
749*4882a593Smuzhiyun6.  ANALOGY WITH READER-WRITER LOCKING
750*4882a593Smuzhiyun--------------------------------------
751*4882a593Smuzhiyun
752*4882a593SmuzhiyunAlthough RCU can be used in many different ways, a very common use of
753*4882a593SmuzhiyunRCU is analogous to reader-writer locking.  The following unified
754*4882a593Smuzhiyundiff shows how closely related RCU and reader-writer locking can be.
755*4882a593Smuzhiyun::
756*4882a593Smuzhiyun
757*4882a593Smuzhiyun	@@ -5,5 +5,5 @@ struct el {
758*4882a593Smuzhiyun	 	int data;
759*4882a593Smuzhiyun	 	/* Other data fields */
760*4882a593Smuzhiyun	 };
761*4882a593Smuzhiyun	-rwlock_t listmutex;
762*4882a593Smuzhiyun	+spinlock_t listmutex;
763*4882a593Smuzhiyun	 struct el head;
764*4882a593Smuzhiyun
765*4882a593Smuzhiyun	@@ -13,15 +14,15 @@
766*4882a593Smuzhiyun		struct list_head *lp;
767*4882a593Smuzhiyun		struct el *p;
768*4882a593Smuzhiyun
769*4882a593Smuzhiyun	-	read_lock(&listmutex);
770*4882a593Smuzhiyun	-	list_for_each_entry(p, head, lp) {
771*4882a593Smuzhiyun	+	rcu_read_lock();
772*4882a593Smuzhiyun	+	list_for_each_entry_rcu(p, head, lp) {
773*4882a593Smuzhiyun			if (p->key == key) {
774*4882a593Smuzhiyun				*result = p->data;
775*4882a593Smuzhiyun	-			read_unlock(&listmutex);
776*4882a593Smuzhiyun	+			rcu_read_unlock();
777*4882a593Smuzhiyun				return 1;
778*4882a593Smuzhiyun			}
779*4882a593Smuzhiyun		}
780*4882a593Smuzhiyun	-	read_unlock(&listmutex);
781*4882a593Smuzhiyun	+	rcu_read_unlock();
782*4882a593Smuzhiyun		return 0;
783*4882a593Smuzhiyun	 }
784*4882a593Smuzhiyun
785*4882a593Smuzhiyun	@@ -29,15 +30,16 @@
786*4882a593Smuzhiyun	 {
787*4882a593Smuzhiyun		struct el *p;
788*4882a593Smuzhiyun
789*4882a593Smuzhiyun	-	write_lock(&listmutex);
790*4882a593Smuzhiyun	+	spin_lock(&listmutex);
791*4882a593Smuzhiyun		list_for_each_entry(p, head, lp) {
792*4882a593Smuzhiyun			if (p->key == key) {
793*4882a593Smuzhiyun	-			list_del(&p->list);
794*4882a593Smuzhiyun	-			write_unlock(&listmutex);
795*4882a593Smuzhiyun	+			list_del_rcu(&p->list);
796*4882a593Smuzhiyun	+			spin_unlock(&listmutex);
797*4882a593Smuzhiyun	+			synchronize_rcu();
798*4882a593Smuzhiyun				kfree(p);
799*4882a593Smuzhiyun				return 1;
800*4882a593Smuzhiyun			}
801*4882a593Smuzhiyun		}
802*4882a593Smuzhiyun	-	write_unlock(&listmutex);
803*4882a593Smuzhiyun	+	spin_unlock(&listmutex);
804*4882a593Smuzhiyun		return 0;
805*4882a593Smuzhiyun	 }
806*4882a593Smuzhiyun
807*4882a593SmuzhiyunOr, for those who prefer a side-by-side listing::
808*4882a593Smuzhiyun
809*4882a593Smuzhiyun 1 struct el {                          1 struct el {
810*4882a593Smuzhiyun 2   struct list_head list;             2   struct list_head list;
811*4882a593Smuzhiyun 3   long key;                          3   long key;
812*4882a593Smuzhiyun 4   spinlock_t mutex;                  4   spinlock_t mutex;
813*4882a593Smuzhiyun 5   int data;                          5   int data;
814*4882a593Smuzhiyun 6   /* Other data fields */            6   /* Other data fields */
815*4882a593Smuzhiyun 7 };                                   7 };
816*4882a593Smuzhiyun 8 rwlock_t listmutex;                  8 spinlock_t listmutex;
817*4882a593Smuzhiyun 9 struct el head;                      9 struct el head;
818*4882a593Smuzhiyun
819*4882a593Smuzhiyun::
820*4882a593Smuzhiyun
821*4882a593Smuzhiyun  1 int search(long key, int *result)    1 int search(long key, int *result)
822*4882a593Smuzhiyun  2 {                                    2 {
823*4882a593Smuzhiyun  3   struct list_head *lp;              3   struct list_head *lp;
824*4882a593Smuzhiyun  4   struct el *p;                      4   struct el *p;
825*4882a593Smuzhiyun  5                                      5
826*4882a593Smuzhiyun  6   read_lock(&listmutex);             6   rcu_read_lock();
827*4882a593Smuzhiyun  7   list_for_each_entry(p, head, lp) { 7   list_for_each_entry_rcu(p, head, lp) {
828*4882a593Smuzhiyun  8     if (p->key == key) {             8     if (p->key == key) {
829*4882a593Smuzhiyun  9       *result = p->data;             9       *result = p->data;
830*4882a593Smuzhiyun 10       read_unlock(&listmutex);      10       rcu_read_unlock();
831*4882a593Smuzhiyun 11       return 1;                     11       return 1;
832*4882a593Smuzhiyun 12     }                               12     }
833*4882a593Smuzhiyun 13   }                                 13   }
834*4882a593Smuzhiyun 14   read_unlock(&listmutex);          14   rcu_read_unlock();
835*4882a593Smuzhiyun 15   return 0;                         15   return 0;
836*4882a593Smuzhiyun 16 }                                   16 }
837*4882a593Smuzhiyun
838*4882a593Smuzhiyun::
839*4882a593Smuzhiyun
840*4882a593Smuzhiyun  1 int delete(long key)                 1 int delete(long key)
841*4882a593Smuzhiyun  2 {                                    2 {
842*4882a593Smuzhiyun  3   struct el *p;                      3   struct el *p;
843*4882a593Smuzhiyun  4                                      4
844*4882a593Smuzhiyun  5   write_lock(&listmutex);            5   spin_lock(&listmutex);
845*4882a593Smuzhiyun  6   list_for_each_entry(p, head, lp) { 6   list_for_each_entry(p, head, lp) {
846*4882a593Smuzhiyun  7     if (p->key == key) {             7     if (p->key == key) {
847*4882a593Smuzhiyun  8       list_del(&p->list);            8       list_del_rcu(&p->list);
848*4882a593Smuzhiyun  9       write_unlock(&listmutex);      9       spin_unlock(&listmutex);
849*4882a593Smuzhiyun                                        10       synchronize_rcu();
850*4882a593Smuzhiyun 10       kfree(p);                     11       kfree(p);
851*4882a593Smuzhiyun 11       return 1;                     12       return 1;
852*4882a593Smuzhiyun 12     }                               13     }
853*4882a593Smuzhiyun 13   }                                 14   }
854*4882a593Smuzhiyun 14   write_unlock(&listmutex);         15   spin_unlock(&listmutex);
855*4882a593Smuzhiyun 15   return 0;                         16   return 0;
856*4882a593Smuzhiyun 16 }                                   17 }
857*4882a593Smuzhiyun
858*4882a593SmuzhiyunEither way, the differences are quite small.  Read-side locking moves
859*4882a593Smuzhiyunto rcu_read_lock() and rcu_read_unlock, update-side locking moves from
860*4882a593Smuzhiyuna reader-writer lock to a simple spinlock, and a synchronize_rcu()
861*4882a593Smuzhiyunprecedes the kfree().
862*4882a593Smuzhiyun
863*4882a593SmuzhiyunHowever, there is one potential catch: the read-side and update-side
864*4882a593Smuzhiyuncritical sections can now run concurrently.  In many cases, this will
865*4882a593Smuzhiyunnot be a problem, but it is necessary to check carefully regardless.
866*4882a593SmuzhiyunFor example, if multiple independent list updates must be seen as
867*4882a593Smuzhiyuna single atomic update, converting to RCU will require special care.
868*4882a593Smuzhiyun
869*4882a593SmuzhiyunAlso, the presence of synchronize_rcu() means that the RCU version of
870*4882a593Smuzhiyundelete() can now block.  If this is a problem, there is a callback-based
871*4882a593Smuzhiyunmechanism that never blocks, namely call_rcu() or kfree_rcu(), that can
872*4882a593Smuzhiyunbe used in place of synchronize_rcu().
873*4882a593Smuzhiyun
874*4882a593Smuzhiyun.. _7_whatisRCU:
875*4882a593Smuzhiyun
876*4882a593Smuzhiyun7.  FULL LIST OF RCU APIs
877*4882a593Smuzhiyun-------------------------
878*4882a593Smuzhiyun
879*4882a593SmuzhiyunThe RCU APIs are documented in docbook-format header comments in the
880*4882a593SmuzhiyunLinux-kernel source code, but it helps to have a full list of the
881*4882a593SmuzhiyunAPIs, since there does not appear to be a way to categorize them
882*4882a593Smuzhiyunin docbook.  Here is the list, by category.
883*4882a593Smuzhiyun
884*4882a593SmuzhiyunRCU list traversal::
885*4882a593Smuzhiyun
886*4882a593Smuzhiyun	list_entry_rcu
887*4882a593Smuzhiyun	list_entry_lockless
888*4882a593Smuzhiyun	list_first_entry_rcu
889*4882a593Smuzhiyun	list_next_rcu
890*4882a593Smuzhiyun	list_for_each_entry_rcu
891*4882a593Smuzhiyun	list_for_each_entry_continue_rcu
892*4882a593Smuzhiyun	list_for_each_entry_from_rcu
893*4882a593Smuzhiyun	list_first_or_null_rcu
894*4882a593Smuzhiyun	list_next_or_null_rcu
895*4882a593Smuzhiyun	hlist_first_rcu
896*4882a593Smuzhiyun	hlist_next_rcu
897*4882a593Smuzhiyun	hlist_pprev_rcu
898*4882a593Smuzhiyun	hlist_for_each_entry_rcu
899*4882a593Smuzhiyun	hlist_for_each_entry_rcu_bh
900*4882a593Smuzhiyun	hlist_for_each_entry_from_rcu
901*4882a593Smuzhiyun	hlist_for_each_entry_continue_rcu
902*4882a593Smuzhiyun	hlist_for_each_entry_continue_rcu_bh
903*4882a593Smuzhiyun	hlist_nulls_first_rcu
904*4882a593Smuzhiyun	hlist_nulls_for_each_entry_rcu
905*4882a593Smuzhiyun	hlist_bl_first_rcu
906*4882a593Smuzhiyun	hlist_bl_for_each_entry_rcu
907*4882a593Smuzhiyun
908*4882a593SmuzhiyunRCU pointer/list update::
909*4882a593Smuzhiyun
910*4882a593Smuzhiyun	rcu_assign_pointer
911*4882a593Smuzhiyun	list_add_rcu
912*4882a593Smuzhiyun	list_add_tail_rcu
913*4882a593Smuzhiyun	list_del_rcu
914*4882a593Smuzhiyun	list_replace_rcu
915*4882a593Smuzhiyun	hlist_add_behind_rcu
916*4882a593Smuzhiyun	hlist_add_before_rcu
917*4882a593Smuzhiyun	hlist_add_head_rcu
918*4882a593Smuzhiyun	hlist_add_tail_rcu
919*4882a593Smuzhiyun	hlist_del_rcu
920*4882a593Smuzhiyun	hlist_del_init_rcu
921*4882a593Smuzhiyun	hlist_replace_rcu
922*4882a593Smuzhiyun	list_splice_init_rcu
923*4882a593Smuzhiyun	list_splice_tail_init_rcu
924*4882a593Smuzhiyun	hlist_nulls_del_init_rcu
925*4882a593Smuzhiyun	hlist_nulls_del_rcu
926*4882a593Smuzhiyun	hlist_nulls_add_head_rcu
927*4882a593Smuzhiyun	hlist_bl_add_head_rcu
928*4882a593Smuzhiyun	hlist_bl_del_init_rcu
929*4882a593Smuzhiyun	hlist_bl_del_rcu
930*4882a593Smuzhiyun	hlist_bl_set_first_rcu
931*4882a593Smuzhiyun
932*4882a593SmuzhiyunRCU::
933*4882a593Smuzhiyun
934*4882a593Smuzhiyun	Critical sections	Grace period		Barrier
935*4882a593Smuzhiyun
936*4882a593Smuzhiyun	rcu_read_lock		synchronize_net		rcu_barrier
937*4882a593Smuzhiyun	rcu_read_unlock		synchronize_rcu
938*4882a593Smuzhiyun	rcu_dereference		synchronize_rcu_expedited
939*4882a593Smuzhiyun	rcu_read_lock_held	call_rcu
940*4882a593Smuzhiyun	rcu_dereference_check	kfree_rcu
941*4882a593Smuzhiyun	rcu_dereference_protected
942*4882a593Smuzhiyun
943*4882a593Smuzhiyunbh::
944*4882a593Smuzhiyun
945*4882a593Smuzhiyun	Critical sections	Grace period		Barrier
946*4882a593Smuzhiyun
947*4882a593Smuzhiyun	rcu_read_lock_bh	call_rcu		rcu_barrier
948*4882a593Smuzhiyun	rcu_read_unlock_bh	synchronize_rcu
949*4882a593Smuzhiyun	[local_bh_disable]	synchronize_rcu_expedited
950*4882a593Smuzhiyun	[and friends]
951*4882a593Smuzhiyun	rcu_dereference_bh
952*4882a593Smuzhiyun	rcu_dereference_bh_check
953*4882a593Smuzhiyun	rcu_dereference_bh_protected
954*4882a593Smuzhiyun	rcu_read_lock_bh_held
955*4882a593Smuzhiyun
956*4882a593Smuzhiyunsched::
957*4882a593Smuzhiyun
958*4882a593Smuzhiyun	Critical sections	Grace period		Barrier
959*4882a593Smuzhiyun
960*4882a593Smuzhiyun	rcu_read_lock_sched	call_rcu		rcu_barrier
961*4882a593Smuzhiyun	rcu_read_unlock_sched	synchronize_rcu
962*4882a593Smuzhiyun	[preempt_disable]	synchronize_rcu_expedited
963*4882a593Smuzhiyun	[and friends]
964*4882a593Smuzhiyun	rcu_read_lock_sched_notrace
965*4882a593Smuzhiyun	rcu_read_unlock_sched_notrace
966*4882a593Smuzhiyun	rcu_dereference_sched
967*4882a593Smuzhiyun	rcu_dereference_sched_check
968*4882a593Smuzhiyun	rcu_dereference_sched_protected
969*4882a593Smuzhiyun	rcu_read_lock_sched_held
970*4882a593Smuzhiyun
971*4882a593Smuzhiyun
972*4882a593SmuzhiyunSRCU::
973*4882a593Smuzhiyun
974*4882a593Smuzhiyun	Critical sections	Grace period		Barrier
975*4882a593Smuzhiyun
976*4882a593Smuzhiyun	srcu_read_lock		call_srcu		srcu_barrier
977*4882a593Smuzhiyun	srcu_read_unlock	synchronize_srcu
978*4882a593Smuzhiyun	srcu_dereference	synchronize_srcu_expedited
979*4882a593Smuzhiyun	srcu_dereference_check
980*4882a593Smuzhiyun	srcu_read_lock_held
981*4882a593Smuzhiyun
982*4882a593SmuzhiyunSRCU: Initialization/cleanup::
983*4882a593Smuzhiyun
984*4882a593Smuzhiyun	DEFINE_SRCU
985*4882a593Smuzhiyun	DEFINE_STATIC_SRCU
986*4882a593Smuzhiyun	init_srcu_struct
987*4882a593Smuzhiyun	cleanup_srcu_struct
988*4882a593Smuzhiyun
989*4882a593SmuzhiyunAll: lockdep-checked RCU-protected pointer access::
990*4882a593Smuzhiyun
991*4882a593Smuzhiyun	rcu_access_pointer
992*4882a593Smuzhiyun	rcu_dereference_raw
993*4882a593Smuzhiyun	RCU_LOCKDEP_WARN
994*4882a593Smuzhiyun	rcu_sleep_check
995*4882a593Smuzhiyun	RCU_NONIDLE
996*4882a593Smuzhiyun
997*4882a593SmuzhiyunSee the comment headers in the source code (or the docbook generated
998*4882a593Smuzhiyunfrom them) for more information.
999*4882a593Smuzhiyun
1000*4882a593SmuzhiyunHowever, given that there are no fewer than four families of RCU APIs
1001*4882a593Smuzhiyunin the Linux kernel, how do you choose which one to use?  The following
1002*4882a593Smuzhiyunlist can be helpful:
1003*4882a593Smuzhiyun
1004*4882a593Smuzhiyuna.	Will readers need to block?  If so, you need SRCU.
1005*4882a593Smuzhiyun
1006*4882a593Smuzhiyunb.	What about the -rt patchset?  If readers would need to block
1007*4882a593Smuzhiyun	in an non-rt kernel, you need SRCU.  If readers would block
1008*4882a593Smuzhiyun	in a -rt kernel, but not in a non-rt kernel, SRCU is not
1009*4882a593Smuzhiyun	necessary.  (The -rt patchset turns spinlocks into sleeplocks,
1010*4882a593Smuzhiyun	hence this distinction.)
1011*4882a593Smuzhiyun
1012*4882a593Smuzhiyunc.	Do you need to treat NMI handlers, hardirq handlers,
1013*4882a593Smuzhiyun	and code segments with preemption disabled (whether
1014*4882a593Smuzhiyun	via preempt_disable(), local_irq_save(), local_bh_disable(),
1015*4882a593Smuzhiyun	or some other mechanism) as if they were explicit RCU readers?
1016*4882a593Smuzhiyun	If so, RCU-sched is the only choice that will work for you.
1017*4882a593Smuzhiyun
1018*4882a593Smuzhiyund.	Do you need RCU grace periods to complete even in the face
1019*4882a593Smuzhiyun	of softirq monopolization of one or more of the CPUs?  For
1020*4882a593Smuzhiyun	example, is your code subject to network-based denial-of-service
1021*4882a593Smuzhiyun	attacks?  If so, you should disable softirq across your readers,
1022*4882a593Smuzhiyun	for example, by using rcu_read_lock_bh().
1023*4882a593Smuzhiyun
1024*4882a593Smuzhiyune.	Is your workload too update-intensive for normal use of
1025*4882a593Smuzhiyun	RCU, but inappropriate for other synchronization mechanisms?
1026*4882a593Smuzhiyun	If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
1027*4882a593Smuzhiyun	named SLAB_DESTROY_BY_RCU).  But please be careful!
1028*4882a593Smuzhiyun
1029*4882a593Smuzhiyunf.	Do you need read-side critical sections that are respected
1030*4882a593Smuzhiyun	even though they are in the middle of the idle loop, during
1031*4882a593Smuzhiyun	user-mode execution, or on an offlined CPU?  If so, SRCU is the
1032*4882a593Smuzhiyun	only choice that will work for you.
1033*4882a593Smuzhiyun
1034*4882a593Smuzhiyung.	Otherwise, use RCU.
1035*4882a593Smuzhiyun
1036*4882a593SmuzhiyunOf course, this all assumes that you have determined that RCU is in fact
1037*4882a593Smuzhiyunthe right tool for your job.
1038*4882a593Smuzhiyun
1039*4882a593Smuzhiyun.. _8_whatisRCU:
1040*4882a593Smuzhiyun
1041*4882a593Smuzhiyun8.  ANSWERS TO QUICK QUIZZES
1042*4882a593Smuzhiyun----------------------------
1043*4882a593Smuzhiyun
1044*4882a593SmuzhiyunQuick Quiz #1:
1045*4882a593Smuzhiyun		Why is this argument naive?  How could a deadlock
1046*4882a593Smuzhiyun		occur when using this algorithm in a real-world Linux
1047*4882a593Smuzhiyun		kernel?  [Referring to the lock-based "toy" RCU
1048*4882a593Smuzhiyun		algorithm.]
1049*4882a593Smuzhiyun
1050*4882a593SmuzhiyunAnswer:
1051*4882a593Smuzhiyun		Consider the following sequence of events:
1052*4882a593Smuzhiyun
1053*4882a593Smuzhiyun		1.	CPU 0 acquires some unrelated lock, call it
1054*4882a593Smuzhiyun			"problematic_lock", disabling irq via
1055*4882a593Smuzhiyun			spin_lock_irqsave().
1056*4882a593Smuzhiyun
1057*4882a593Smuzhiyun		2.	CPU 1 enters synchronize_rcu(), write-acquiring
1058*4882a593Smuzhiyun			rcu_gp_mutex.
1059*4882a593Smuzhiyun
1060*4882a593Smuzhiyun		3.	CPU 0 enters rcu_read_lock(), but must wait
1061*4882a593Smuzhiyun			because CPU 1 holds rcu_gp_mutex.
1062*4882a593Smuzhiyun
1063*4882a593Smuzhiyun		4.	CPU 1 is interrupted, and the irq handler
1064*4882a593Smuzhiyun			attempts to acquire problematic_lock.
1065*4882a593Smuzhiyun
1066*4882a593Smuzhiyun		The system is now deadlocked.
1067*4882a593Smuzhiyun
1068*4882a593Smuzhiyun		One way to avoid this deadlock is to use an approach like
1069*4882a593Smuzhiyun		that of CONFIG_PREEMPT_RT, where all normal spinlocks
1070*4882a593Smuzhiyun		become blocking locks, and all irq handlers execute in
1071*4882a593Smuzhiyun		the context of special tasks.  In this case, in step 4
1072*4882a593Smuzhiyun		above, the irq handler would block, allowing CPU 1 to
1073*4882a593Smuzhiyun		release rcu_gp_mutex, avoiding the deadlock.
1074*4882a593Smuzhiyun
1075*4882a593Smuzhiyun		Even in the absence of deadlock, this RCU implementation
1076*4882a593Smuzhiyun		allows latency to "bleed" from readers to other
1077*4882a593Smuzhiyun		readers through synchronize_rcu().  To see this,
1078*4882a593Smuzhiyun		consider task A in an RCU read-side critical section
1079*4882a593Smuzhiyun		(thus read-holding rcu_gp_mutex), task B blocked
1080*4882a593Smuzhiyun		attempting to write-acquire rcu_gp_mutex, and
1081*4882a593Smuzhiyun		task C blocked in rcu_read_lock() attempting to
1082*4882a593Smuzhiyun		read_acquire rcu_gp_mutex.  Task A's RCU read-side
1083*4882a593Smuzhiyun		latency is holding up task C, albeit indirectly via
1084*4882a593Smuzhiyun		task B.
1085*4882a593Smuzhiyun
1086*4882a593Smuzhiyun		Realtime RCU implementations therefore use a counter-based
1087*4882a593Smuzhiyun		approach where tasks in RCU read-side critical sections
1088*4882a593Smuzhiyun		cannot be blocked by tasks executing synchronize_rcu().
1089*4882a593Smuzhiyun
1090*4882a593Smuzhiyun:ref:`Back to Quick Quiz #1 <quiz_1>`
1091*4882a593Smuzhiyun
1092*4882a593SmuzhiyunQuick Quiz #2:
1093*4882a593Smuzhiyun		Give an example where Classic RCU's read-side
1094*4882a593Smuzhiyun		overhead is **negative**.
1095*4882a593Smuzhiyun
1096*4882a593SmuzhiyunAnswer:
1097*4882a593Smuzhiyun		Imagine a single-CPU system with a non-CONFIG_PREEMPT
1098*4882a593Smuzhiyun		kernel where a routing table is used by process-context
1099*4882a593Smuzhiyun		code, but can be updated by irq-context code (for example,
1100*4882a593Smuzhiyun		by an "ICMP REDIRECT" packet).	The usual way of handling
1101*4882a593Smuzhiyun		this would be to have the process-context code disable
1102*4882a593Smuzhiyun		interrupts while searching the routing table.  Use of
1103*4882a593Smuzhiyun		RCU allows such interrupt-disabling to be dispensed with.
1104*4882a593Smuzhiyun		Thus, without RCU, you pay the cost of disabling interrupts,
1105*4882a593Smuzhiyun		and with RCU you don't.
1106*4882a593Smuzhiyun
1107*4882a593Smuzhiyun		One can argue that the overhead of RCU in this
1108*4882a593Smuzhiyun		case is negative with respect to the single-CPU
1109*4882a593Smuzhiyun		interrupt-disabling approach.  Others might argue that
1110*4882a593Smuzhiyun		the overhead of RCU is merely zero, and that replacing
1111*4882a593Smuzhiyun		the positive overhead of the interrupt-disabling scheme
1112*4882a593Smuzhiyun		with the zero-overhead RCU scheme does not constitute
1113*4882a593Smuzhiyun		negative overhead.
1114*4882a593Smuzhiyun
1115*4882a593Smuzhiyun		In real life, of course, things are more complex.  But
1116*4882a593Smuzhiyun		even the theoretical possibility of negative overhead for
1117*4882a593Smuzhiyun		a synchronization primitive is a bit unexpected.  ;-)
1118*4882a593Smuzhiyun
1119*4882a593Smuzhiyun:ref:`Back to Quick Quiz #2 <quiz_2>`
1120*4882a593Smuzhiyun
1121*4882a593SmuzhiyunQuick Quiz #3:
1122*4882a593Smuzhiyun		If it is illegal to block in an RCU read-side
1123*4882a593Smuzhiyun		critical section, what the heck do you do in
1124*4882a593Smuzhiyun		PREEMPT_RT, where normal spinlocks can block???
1125*4882a593Smuzhiyun
1126*4882a593SmuzhiyunAnswer:
1127*4882a593Smuzhiyun		Just as PREEMPT_RT permits preemption of spinlock
1128*4882a593Smuzhiyun		critical sections, it permits preemption of RCU
1129*4882a593Smuzhiyun		read-side critical sections.  It also permits
1130*4882a593Smuzhiyun		spinlocks blocking while in RCU read-side critical
1131*4882a593Smuzhiyun		sections.
1132*4882a593Smuzhiyun
1133*4882a593Smuzhiyun		Why the apparent inconsistency?  Because it is
1134*4882a593Smuzhiyun		possible to use priority boosting to keep the RCU
1135*4882a593Smuzhiyun		grace periods short if need be (for example, if running
1136*4882a593Smuzhiyun		short of memory).  In contrast, if blocking waiting
1137*4882a593Smuzhiyun		for (say) network reception, there is no way to know
1138*4882a593Smuzhiyun		what should be boosted.  Especially given that the
1139*4882a593Smuzhiyun		process we need to boost might well be a human being
1140*4882a593Smuzhiyun		who just went out for a pizza or something.  And although
1141*4882a593Smuzhiyun		a computer-operated cattle prod might arouse serious
1142*4882a593Smuzhiyun		interest, it might also provoke serious objections.
1143*4882a593Smuzhiyun		Besides, how does the computer know what pizza parlor
1144*4882a593Smuzhiyun		the human being went to???
1145*4882a593Smuzhiyun
1146*4882a593Smuzhiyun:ref:`Back to Quick Quiz #3 <quiz_3>`
1147*4882a593Smuzhiyun
1148*4882a593SmuzhiyunACKNOWLEDGEMENTS
1149*4882a593Smuzhiyun
1150*4882a593SmuzhiyunMy thanks to the people who helped make this human-readable, including
1151*4882a593SmuzhiyunJon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern.
1152*4882a593Smuzhiyun
1153*4882a593Smuzhiyun
1154*4882a593SmuzhiyunFor more information, see http://www.rdrop.com/users/paulmck/RCU.
1155