xref: /OK3568_Linux_fs/kernel/Documentation/RCU/Design/Requirements/Requirements.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=================================
2*4882a593SmuzhiyunA Tour Through RCU's Requirements
3*4882a593Smuzhiyun=================================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunCopyright IBM Corporation, 2015
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunAuthor: Paul E. McKenney
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunThe initial version of this document appeared in the
10*4882a593Smuzhiyun`LWN <https://lwn.net/>`_ on those articles:
11*4882a593Smuzhiyun`part 1 <https://lwn.net/Articles/652156/>`_,
12*4882a593Smuzhiyun`part 2 <https://lwn.net/Articles/652677/>`_, and
13*4882a593Smuzhiyun`part 3 <https://lwn.net/Articles/653326/>`_.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunIntroduction
16*4882a593Smuzhiyun------------
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunRead-copy update (RCU) is a synchronization mechanism that is often used
19*4882a593Smuzhiyunas a replacement for reader-writer locking. RCU is unusual in that
20*4882a593Smuzhiyunupdaters do not block readers, which means that RCU's read-side
21*4882a593Smuzhiyunprimitives can be exceedingly fast and scalable. In addition, updaters
22*4882a593Smuzhiyuncan make useful forward progress concurrently with readers. However, all
23*4882a593Smuzhiyunthis concurrency between RCU readers and updaters does raise the
24*4882a593Smuzhiyunquestion of exactly what RCU readers are doing, which in turn raises the
25*4882a593Smuzhiyunquestion of exactly what RCU's requirements are.
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunThis document therefore summarizes RCU's requirements, and can be
28*4882a593Smuzhiyunthought of as an informal, high-level specification for RCU. It is
29*4882a593Smuzhiyunimportant to understand that RCU's specification is primarily empirical
30*4882a593Smuzhiyunin nature; in fact, I learned about many of these requirements the hard
31*4882a593Smuzhiyunway. This situation might cause some consternation, however, not only
32*4882a593Smuzhiyunhas this learning process been a lot of fun, but it has also been a
33*4882a593Smuzhiyungreat privilege to work with so many people willing to apply
34*4882a593Smuzhiyuntechnologies in interesting new ways.
35*4882a593Smuzhiyun
36*4882a593SmuzhiyunAll that aside, here are the categories of currently known RCU
37*4882a593Smuzhiyunrequirements:
38*4882a593Smuzhiyun
39*4882a593Smuzhiyun#. `Fundamental Requirements`_
40*4882a593Smuzhiyun#. `Fundamental Non-Requirements`_
41*4882a593Smuzhiyun#. `Parallelism Facts of Life`_
42*4882a593Smuzhiyun#. `Quality-of-Implementation Requirements`_
43*4882a593Smuzhiyun#. `Linux Kernel Complications`_
44*4882a593Smuzhiyun#. `Software-Engineering Requirements`_
45*4882a593Smuzhiyun#. `Other RCU Flavors`_
46*4882a593Smuzhiyun#. `Possible Future Changes`_
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunThis is followed by a `summary <#Summary>`__, however, the answers to
49*4882a593Smuzhiyuneach quick quiz immediately follows the quiz. Select the big white space
50*4882a593Smuzhiyunwith your mouse to see the answer.
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunFundamental Requirements
53*4882a593Smuzhiyun------------------------
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunRCU's fundamental requirements are the closest thing RCU has to hard
56*4882a593Smuzhiyunmathematical requirements. These are:
57*4882a593Smuzhiyun
58*4882a593Smuzhiyun#. `Grace-Period Guarantee`_
59*4882a593Smuzhiyun#. `Publish/Subscribe Guarantee`_
60*4882a593Smuzhiyun#. `Memory-Barrier Guarantees`_
61*4882a593Smuzhiyun#. `RCU Primitives Guaranteed to Execute Unconditionally`_
62*4882a593Smuzhiyun#. `Guaranteed Read-to-Write Upgrade`_
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunGrace-Period Guarantee
65*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunRCU's grace-period guarantee is unusual in being premeditated: Jack
68*4882a593SmuzhiyunSlingwine and I had this guarantee firmly in mind when we started work
69*4882a593Smuzhiyunon RCU (then called “rclock”) in the early 1990s. That said, the past
70*4882a593Smuzhiyuntwo decades of experience with RCU have produced a much more detailed
71*4882a593Smuzhiyununderstanding of this guarantee.
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunRCU's grace-period guarantee allows updaters to wait for the completion
74*4882a593Smuzhiyunof all pre-existing RCU read-side critical sections. An RCU read-side
75*4882a593Smuzhiyuncritical section begins with the marker ``rcu_read_lock()`` and ends
76*4882a593Smuzhiyunwith the marker ``rcu_read_unlock()``. These markers may be nested, and
77*4882a593SmuzhiyunRCU treats a nested set as one big RCU read-side critical section.
78*4882a593SmuzhiyunProduction-quality implementations of ``rcu_read_lock()`` and
79*4882a593Smuzhiyun``rcu_read_unlock()`` are extremely lightweight, and in fact have
80*4882a593Smuzhiyunexactly zero overhead in Linux kernels built for production use with
81*4882a593Smuzhiyun``CONFIG_PREEMPT=n``.
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunThis guarantee allows ordering to be enforced with extremely low
84*4882a593Smuzhiyunoverhead to readers, for example:
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun   ::
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun       1 int x, y;
89*4882a593Smuzhiyun       2
90*4882a593Smuzhiyun       3 void thread0(void)
91*4882a593Smuzhiyun       4 {
92*4882a593Smuzhiyun       5   rcu_read_lock();
93*4882a593Smuzhiyun       6   r1 = READ_ONCE(x);
94*4882a593Smuzhiyun       7   r2 = READ_ONCE(y);
95*4882a593Smuzhiyun       8   rcu_read_unlock();
96*4882a593Smuzhiyun       9 }
97*4882a593Smuzhiyun      10
98*4882a593Smuzhiyun      11 void thread1(void)
99*4882a593Smuzhiyun      12 {
100*4882a593Smuzhiyun      13   WRITE_ONCE(x, 1);
101*4882a593Smuzhiyun      14   synchronize_rcu();
102*4882a593Smuzhiyun      15   WRITE_ONCE(y, 1);
103*4882a593Smuzhiyun      16 }
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunBecause the ``synchronize_rcu()`` on line 14 waits for all pre-existing
106*4882a593Smuzhiyunreaders, any instance of ``thread0()`` that loads a value of zero from
107*4882a593Smuzhiyun``x`` must complete before ``thread1()`` stores to ``y``, so that
108*4882a593Smuzhiyuninstance must also load a value of zero from ``y``. Similarly, any
109*4882a593Smuzhiyuninstance of ``thread0()`` that loads a value of one from ``y`` must have
110*4882a593Smuzhiyunstarted after the ``synchronize_rcu()`` started, and must therefore also
111*4882a593Smuzhiyunload a value of one from ``x``. Therefore, the outcome:
112*4882a593Smuzhiyun
113*4882a593Smuzhiyun   ::
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun      (r1 == 0 && r2 == 1)
116*4882a593Smuzhiyun
117*4882a593Smuzhiyuncannot happen.
118*4882a593Smuzhiyun
119*4882a593Smuzhiyun+-----------------------------------------------------------------------+
120*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
121*4882a593Smuzhiyun+-----------------------------------------------------------------------+
122*4882a593Smuzhiyun| Wait a minute! You said that updaters can make useful forward         |
123*4882a593Smuzhiyun| progress concurrently with readers, but pre-existing readers will     |
124*4882a593Smuzhiyun| block ``synchronize_rcu()``!!!                                        |
125*4882a593Smuzhiyun| Just who are you trying to fool???                                    |
126*4882a593Smuzhiyun+-----------------------------------------------------------------------+
127*4882a593Smuzhiyun| **Answer**:                                                           |
128*4882a593Smuzhiyun+-----------------------------------------------------------------------+
129*4882a593Smuzhiyun| First, if updaters do not wish to be blocked by readers, they can use |
130*4882a593Smuzhiyun| ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later.     |
131*4882a593Smuzhiyun| Second, even when using ``synchronize_rcu()``, the other update-side  |
132*4882a593Smuzhiyun| code does run concurrently with readers, whether pre-existing or not. |
133*4882a593Smuzhiyun+-----------------------------------------------------------------------+
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunThis scenario resembles one of the first uses of RCU in
136*4882a593Smuzhiyun`DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a
137*4882a593Smuzhiyundistributed lock manager's transition into a state suitable for handling
138*4882a593Smuzhiyunrecovery from node failure, more or less as follows:
139*4882a593Smuzhiyun
140*4882a593Smuzhiyun   ::
141*4882a593Smuzhiyun
142*4882a593Smuzhiyun       1 #define STATE_NORMAL        0
143*4882a593Smuzhiyun       2 #define STATE_WANT_RECOVERY 1
144*4882a593Smuzhiyun       3 #define STATE_RECOVERING    2
145*4882a593Smuzhiyun       4 #define STATE_WANT_NORMAL   3
146*4882a593Smuzhiyun       5
147*4882a593Smuzhiyun       6 int state = STATE_NORMAL;
148*4882a593Smuzhiyun       7
149*4882a593Smuzhiyun       8 void do_something_dlm(void)
150*4882a593Smuzhiyun       9 {
151*4882a593Smuzhiyun      10   int state_snap;
152*4882a593Smuzhiyun      11
153*4882a593Smuzhiyun      12   rcu_read_lock();
154*4882a593Smuzhiyun      13   state_snap = READ_ONCE(state);
155*4882a593Smuzhiyun      14   if (state_snap == STATE_NORMAL)
156*4882a593Smuzhiyun      15     do_something();
157*4882a593Smuzhiyun      16   else
158*4882a593Smuzhiyun      17     do_something_carefully();
159*4882a593Smuzhiyun      18   rcu_read_unlock();
160*4882a593Smuzhiyun      19 }
161*4882a593Smuzhiyun      20
162*4882a593Smuzhiyun      21 void start_recovery(void)
163*4882a593Smuzhiyun      22 {
164*4882a593Smuzhiyun      23   WRITE_ONCE(state, STATE_WANT_RECOVERY);
165*4882a593Smuzhiyun      24   synchronize_rcu();
166*4882a593Smuzhiyun      25   WRITE_ONCE(state, STATE_RECOVERING);
167*4882a593Smuzhiyun      26   recovery();
168*4882a593Smuzhiyun      27   WRITE_ONCE(state, STATE_WANT_NORMAL);
169*4882a593Smuzhiyun      28   synchronize_rcu();
170*4882a593Smuzhiyun      29   WRITE_ONCE(state, STATE_NORMAL);
171*4882a593Smuzhiyun      30 }
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunThe RCU read-side critical section in ``do_something_dlm()`` works with
174*4882a593Smuzhiyunthe ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that
175*4882a593Smuzhiyun``do_something()`` never runs concurrently with ``recovery()``, but with
176*4882a593Smuzhiyunlittle or no synchronization overhead in ``do_something_dlm()``.
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun+-----------------------------------------------------------------------+
179*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
180*4882a593Smuzhiyun+-----------------------------------------------------------------------+
181*4882a593Smuzhiyun| Why is the ``synchronize_rcu()`` on line 28 needed?                   |
182*4882a593Smuzhiyun+-----------------------------------------------------------------------+
183*4882a593Smuzhiyun| **Answer**:                                                           |
184*4882a593Smuzhiyun+-----------------------------------------------------------------------+
185*4882a593Smuzhiyun| Without that extra grace period, memory reordering could result in    |
186*4882a593Smuzhiyun| ``do_something_dlm()`` executing ``do_something()`` concurrently with |
187*4882a593Smuzhiyun| the last bits of ``recovery()``.                                      |
188*4882a593Smuzhiyun+-----------------------------------------------------------------------+
189*4882a593Smuzhiyun
190*4882a593SmuzhiyunIn order to avoid fatal problems such as deadlocks, an RCU read-side
191*4882a593Smuzhiyuncritical section must not contain calls to ``synchronize_rcu()``.
192*4882a593SmuzhiyunSimilarly, an RCU read-side critical section must not contain anything
193*4882a593Smuzhiyunthat waits, directly or indirectly, on completion of an invocation of
194*4882a593Smuzhiyun``synchronize_rcu()``.
195*4882a593Smuzhiyun
196*4882a593SmuzhiyunAlthough RCU's grace-period guarantee is useful in and of itself, with
197*4882a593Smuzhiyun`quite a few use cases <https://lwn.net/Articles/573497/>`__, it would
198*4882a593Smuzhiyunbe good to be able to use RCU to coordinate read-side access to linked
199*4882a593Smuzhiyundata structures. For this, the grace-period guarantee is not sufficient,
200*4882a593Smuzhiyunas can be seen in function ``add_gp_buggy()`` below. We will look at the
201*4882a593Smuzhiyunreader's code later, but in the meantime, just think of the reader as
202*4882a593Smuzhiyunlocklessly picking up the ``gp`` pointer, and, if the value loaded is
203*4882a593Smuzhiyunnon-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
204*4882a593Smuzhiyun
205*4882a593Smuzhiyun   ::
206*4882a593Smuzhiyun
207*4882a593Smuzhiyun       1 bool add_gp_buggy(int a, int b)
208*4882a593Smuzhiyun       2 {
209*4882a593Smuzhiyun       3   p = kmalloc(sizeof(*p), GFP_KERNEL);
210*4882a593Smuzhiyun       4   if (!p)
211*4882a593Smuzhiyun       5     return -ENOMEM;
212*4882a593Smuzhiyun       6   spin_lock(&gp_lock);
213*4882a593Smuzhiyun       7   if (rcu_access_pointer(gp)) {
214*4882a593Smuzhiyun       8     spin_unlock(&gp_lock);
215*4882a593Smuzhiyun       9     return false;
216*4882a593Smuzhiyun      10   }
217*4882a593Smuzhiyun      11   p->a = a;
218*4882a593Smuzhiyun      12   p->b = a;
219*4882a593Smuzhiyun      13   gp = p; /* ORDERING BUG */
220*4882a593Smuzhiyun      14   spin_unlock(&gp_lock);
221*4882a593Smuzhiyun      15   return true;
222*4882a593Smuzhiyun      16 }
223*4882a593Smuzhiyun
224*4882a593SmuzhiyunThe problem is that both the compiler and weakly ordered CPUs are within
225*4882a593Smuzhiyuntheir rights to reorder this code as follows:
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun   ::
228*4882a593Smuzhiyun
229*4882a593Smuzhiyun       1 bool add_gp_buggy_optimized(int a, int b)
230*4882a593Smuzhiyun       2 {
231*4882a593Smuzhiyun       3   p = kmalloc(sizeof(*p), GFP_KERNEL);
232*4882a593Smuzhiyun       4   if (!p)
233*4882a593Smuzhiyun       5     return -ENOMEM;
234*4882a593Smuzhiyun       6   spin_lock(&gp_lock);
235*4882a593Smuzhiyun       7   if (rcu_access_pointer(gp)) {
236*4882a593Smuzhiyun       8     spin_unlock(&gp_lock);
237*4882a593Smuzhiyun       9     return false;
238*4882a593Smuzhiyun      10   }
239*4882a593Smuzhiyun      11   gp = p; /* ORDERING BUG */
240*4882a593Smuzhiyun      12   p->a = a;
241*4882a593Smuzhiyun      13   p->b = a;
242*4882a593Smuzhiyun      14   spin_unlock(&gp_lock);
243*4882a593Smuzhiyun      15   return true;
244*4882a593Smuzhiyun      16 }
245*4882a593Smuzhiyun
246*4882a593SmuzhiyunIf an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized``
247*4882a593Smuzhiyunexecutes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
248*4882a593SmuzhiyunAnd this is but one of many ways in which compiler and hardware
249*4882a593Smuzhiyunoptimizations could cause trouble. Therefore, we clearly need some way
250*4882a593Smuzhiyunto prevent the compiler and the CPU from reordering in this manner,
251*4882a593Smuzhiyunwhich brings us to the publish-subscribe guarantee discussed in the next
252*4882a593Smuzhiyunsection.
253*4882a593Smuzhiyun
254*4882a593SmuzhiyunPublish/Subscribe Guarantee
255*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~
256*4882a593Smuzhiyun
257*4882a593SmuzhiyunRCU's publish-subscribe guarantee allows data to be inserted into a
258*4882a593Smuzhiyunlinked data structure without disrupting RCU readers. The updater uses
259*4882a593Smuzhiyun``rcu_assign_pointer()`` to insert the new data, and readers use
260*4882a593Smuzhiyun``rcu_dereference()`` to access data, whether new or old. The following
261*4882a593Smuzhiyunshows an example of insertion:
262*4882a593Smuzhiyun
263*4882a593Smuzhiyun   ::
264*4882a593Smuzhiyun
265*4882a593Smuzhiyun       1 bool add_gp(int a, int b)
266*4882a593Smuzhiyun       2 {
267*4882a593Smuzhiyun       3   p = kmalloc(sizeof(*p), GFP_KERNEL);
268*4882a593Smuzhiyun       4   if (!p)
269*4882a593Smuzhiyun       5     return -ENOMEM;
270*4882a593Smuzhiyun       6   spin_lock(&gp_lock);
271*4882a593Smuzhiyun       7   if (rcu_access_pointer(gp)) {
272*4882a593Smuzhiyun       8     spin_unlock(&gp_lock);
273*4882a593Smuzhiyun       9     return false;
274*4882a593Smuzhiyun      10   }
275*4882a593Smuzhiyun      11   p->a = a;
276*4882a593Smuzhiyun      12   p->b = a;
277*4882a593Smuzhiyun      13   rcu_assign_pointer(gp, p);
278*4882a593Smuzhiyun      14   spin_unlock(&gp_lock);
279*4882a593Smuzhiyun      15   return true;
280*4882a593Smuzhiyun      16 }
281*4882a593Smuzhiyun
282*4882a593SmuzhiyunThe ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a
283*4882a593Smuzhiyunsimple assignment statement, but also guarantees that its assignment
284*4882a593Smuzhiyunwill happen after the two assignments in lines 11 and 12, similar to the
285*4882a593SmuzhiyunC11 ``memory_order_release`` store operation. It also prevents any
286*4882a593Smuzhiyunnumber of “interesting” compiler optimizations, for example, the use of
287*4882a593Smuzhiyun``gp`` as a scratch location immediately preceding the assignment.
288*4882a593Smuzhiyun
289*4882a593Smuzhiyun+-----------------------------------------------------------------------+
290*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
291*4882a593Smuzhiyun+-----------------------------------------------------------------------+
292*4882a593Smuzhiyun| But ``rcu_assign_pointer()`` does nothing to prevent the two          |
293*4882a593Smuzhiyun| assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
294*4882a593Smuzhiyun| also cause problems?                                                  |
295*4882a593Smuzhiyun+-----------------------------------------------------------------------+
296*4882a593Smuzhiyun| **Answer**:                                                           |
297*4882a593Smuzhiyun+-----------------------------------------------------------------------+
298*4882a593Smuzhiyun| No, it cannot. The readers cannot see either of these two fields      |
299*4882a593Smuzhiyun| until the assignment to ``gp``, by which time both fields are fully   |
300*4882a593Smuzhiyun| initialized. So reordering the assignments to ``p->a`` and ``p->b``   |
301*4882a593Smuzhiyun| cannot possibly cause any problems.                                   |
302*4882a593Smuzhiyun+-----------------------------------------------------------------------+
303*4882a593Smuzhiyun
304*4882a593SmuzhiyunIt is tempting to assume that the reader need not do anything special to
305*4882a593Smuzhiyuncontrol its accesses to the RCU-protected data, as shown in
306*4882a593Smuzhiyun``do_something_gp_buggy()`` below:
307*4882a593Smuzhiyun
308*4882a593Smuzhiyun   ::
309*4882a593Smuzhiyun
310*4882a593Smuzhiyun       1 bool do_something_gp_buggy(void)
311*4882a593Smuzhiyun       2 {
312*4882a593Smuzhiyun       3   rcu_read_lock();
313*4882a593Smuzhiyun       4   p = gp;  /* OPTIMIZATIONS GALORE!!! */
314*4882a593Smuzhiyun       5   if (p) {
315*4882a593Smuzhiyun       6     do_something(p->a, p->b);
316*4882a593Smuzhiyun       7     rcu_read_unlock();
317*4882a593Smuzhiyun       8     return true;
318*4882a593Smuzhiyun       9   }
319*4882a593Smuzhiyun      10   rcu_read_unlock();
320*4882a593Smuzhiyun      11   return false;
321*4882a593Smuzhiyun      12 }
322*4882a593Smuzhiyun
323*4882a593SmuzhiyunHowever, this temptation must be resisted because there are a
324*4882a593Smuzhiyunsurprisingly large number of ways that the compiler (to say nothing of
325*4882a593Smuzhiyun`DEC Alpha CPUs <https://h71000.www7.hp.com/wizard/wiz_2637.html>`__)
326*4882a593Smuzhiyuncan trip this code up. For but one example, if the compiler were short
327*4882a593Smuzhiyunof registers, it might choose to refetch from ``gp`` rather than keeping
328*4882a593Smuzhiyuna separate copy in ``p`` as follows:
329*4882a593Smuzhiyun
330*4882a593Smuzhiyun   ::
331*4882a593Smuzhiyun
332*4882a593Smuzhiyun       1 bool do_something_gp_buggy_optimized(void)
333*4882a593Smuzhiyun       2 {
334*4882a593Smuzhiyun       3   rcu_read_lock();
335*4882a593Smuzhiyun       4   if (gp) { /* OPTIMIZATIONS GALORE!!! */
336*4882a593Smuzhiyun       5     do_something(gp->a, gp->b);
337*4882a593Smuzhiyun       6     rcu_read_unlock();
338*4882a593Smuzhiyun       7     return true;
339*4882a593Smuzhiyun       8   }
340*4882a593Smuzhiyun       9   rcu_read_unlock();
341*4882a593Smuzhiyun      10   return false;
342*4882a593Smuzhiyun      11 }
343*4882a593Smuzhiyun
344*4882a593SmuzhiyunIf this function ran concurrently with a series of updates that replaced
345*4882a593Smuzhiyunthe current structure with a new one, the fetches of ``gp->a`` and
346*4882a593Smuzhiyun``gp->b`` might well come from two different structures, which could
347*4882a593Smuzhiyuncause serious confusion. To prevent this (and much else besides),
348*4882a593Smuzhiyun``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``:
349*4882a593Smuzhiyun
350*4882a593Smuzhiyun   ::
351*4882a593Smuzhiyun
352*4882a593Smuzhiyun       1 bool do_something_gp(void)
353*4882a593Smuzhiyun       2 {
354*4882a593Smuzhiyun       3   rcu_read_lock();
355*4882a593Smuzhiyun       4   p = rcu_dereference(gp);
356*4882a593Smuzhiyun       5   if (p) {
357*4882a593Smuzhiyun       6     do_something(p->a, p->b);
358*4882a593Smuzhiyun       7     rcu_read_unlock();
359*4882a593Smuzhiyun       8     return true;
360*4882a593Smuzhiyun       9   }
361*4882a593Smuzhiyun      10   rcu_read_unlock();
362*4882a593Smuzhiyun      11   return false;
363*4882a593Smuzhiyun      12 }
364*4882a593Smuzhiyun
365*4882a593SmuzhiyunThe ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory
366*4882a593Smuzhiyunbarriers in the Linux kernel. Should a `high-quality implementation of
367*4882a593SmuzhiyunC11 ``memory_order_consume``
368*4882a593Smuzhiyun[PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__
369*4882a593Smuzhiyunever appear, then ``rcu_dereference()`` could be implemented as a
370*4882a593Smuzhiyun``memory_order_consume`` load. Regardless of the exact implementation, a
371*4882a593Smuzhiyunpointer fetched by ``rcu_dereference()`` may not be used outside of the
372*4882a593Smuzhiyunoutermost RCU read-side critical section containing that
373*4882a593Smuzhiyun``rcu_dereference()``, unless protection of the corresponding data
374*4882a593Smuzhiyunelement has been passed from RCU to some other synchronization
375*4882a593Smuzhiyunmechanism, most commonly locking or `reference
376*4882a593Smuzhiyuncounting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__.
377*4882a593Smuzhiyun
378*4882a593SmuzhiyunIn short, updaters use ``rcu_assign_pointer()`` and readers use
379*4882a593Smuzhiyun``rcu_dereference()``, and these two RCU API elements work together to
380*4882a593Smuzhiyunensure that readers have a consistent view of newly added data elements.
381*4882a593Smuzhiyun
382*4882a593SmuzhiyunOf course, it is also necessary to remove elements from RCU-protected
383*4882a593Smuzhiyundata structures, for example, using the following process:
384*4882a593Smuzhiyun
385*4882a593Smuzhiyun#. Remove the data element from the enclosing structure.
386*4882a593Smuzhiyun#. Wait for all pre-existing RCU read-side critical sections to complete
387*4882a593Smuzhiyun   (because only pre-existing readers can possibly have a reference to
388*4882a593Smuzhiyun   the newly removed data element).
389*4882a593Smuzhiyun#. At this point, only the updater has a reference to the newly removed
390*4882a593Smuzhiyun   data element, so it can safely reclaim the data element, for example,
391*4882a593Smuzhiyun   by passing it to ``kfree()``.
392*4882a593Smuzhiyun
393*4882a593SmuzhiyunThis process is implemented by ``remove_gp_synchronous()``:
394*4882a593Smuzhiyun
395*4882a593Smuzhiyun   ::
396*4882a593Smuzhiyun
397*4882a593Smuzhiyun       1 bool remove_gp_synchronous(void)
398*4882a593Smuzhiyun       2 {
399*4882a593Smuzhiyun       3   struct foo *p;
400*4882a593Smuzhiyun       4
401*4882a593Smuzhiyun       5   spin_lock(&gp_lock);
402*4882a593Smuzhiyun       6   p = rcu_access_pointer(gp);
403*4882a593Smuzhiyun       7   if (!p) {
404*4882a593Smuzhiyun       8     spin_unlock(&gp_lock);
405*4882a593Smuzhiyun       9     return false;
406*4882a593Smuzhiyun      10   }
407*4882a593Smuzhiyun      11   rcu_assign_pointer(gp, NULL);
408*4882a593Smuzhiyun      12   spin_unlock(&gp_lock);
409*4882a593Smuzhiyun      13   synchronize_rcu();
410*4882a593Smuzhiyun      14   kfree(p);
411*4882a593Smuzhiyun      15   return true;
412*4882a593Smuzhiyun      16 }
413*4882a593Smuzhiyun
414*4882a593SmuzhiyunThis function is straightforward, with line 13 waiting for a grace
415*4882a593Smuzhiyunperiod before line 14 frees the old data element. This waiting ensures
416*4882a593Smuzhiyunthat readers will reach line 7 of ``do_something_gp()`` before the data
417*4882a593Smuzhiyunelement referenced by ``p`` is freed. The ``rcu_access_pointer()`` on
418*4882a593Smuzhiyunline 6 is similar to ``rcu_dereference()``, except that:
419*4882a593Smuzhiyun
420*4882a593Smuzhiyun#. The value returned by ``rcu_access_pointer()`` cannot be
421*4882a593Smuzhiyun   dereferenced. If you want to access the value pointed to as well as
422*4882a593Smuzhiyun   the pointer itself, use ``rcu_dereference()`` instead of
423*4882a593Smuzhiyun   ``rcu_access_pointer()``.
424*4882a593Smuzhiyun#. The call to ``rcu_access_pointer()`` need not be protected. In
425*4882a593Smuzhiyun   contrast, ``rcu_dereference()`` must either be within an RCU
426*4882a593Smuzhiyun   read-side critical section or in a code segment where the pointer
427*4882a593Smuzhiyun   cannot change, for example, in code protected by the corresponding
428*4882a593Smuzhiyun   update-side lock.
429*4882a593Smuzhiyun
430*4882a593Smuzhiyun+-----------------------------------------------------------------------+
431*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
432*4882a593Smuzhiyun+-----------------------------------------------------------------------+
433*4882a593Smuzhiyun| Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``,    |
434*4882a593Smuzhiyun| what destructive optimizations might the compiler make use of?        |
435*4882a593Smuzhiyun+-----------------------------------------------------------------------+
436*4882a593Smuzhiyun| **Answer**:                                                           |
437*4882a593Smuzhiyun+-----------------------------------------------------------------------+
438*4882a593Smuzhiyun| Let's start with what happens to ``do_something_gp()`` if it fails to |
439*4882a593Smuzhiyun| use ``rcu_dereference()``. It could reuse a value formerly fetched    |
440*4882a593Smuzhiyun| from this same pointer. It could also fetch the pointer from ``gp``   |
441*4882a593Smuzhiyun| in a byte-at-a-time manner, resulting in *load tearing*, in turn      |
442*4882a593Smuzhiyun| resulting a bytewise mash-up of two distinct pointer values. It might |
443*4882a593Smuzhiyun| even use value-speculation optimizations, where it makes a wrong      |
444*4882a593Smuzhiyun| guess, but by the time it gets around to checking the value, an       |
445*4882a593Smuzhiyun| update has changed the pointer to match the wrong guess. Too bad      |
446*4882a593Smuzhiyun| about any dereferences that returned pre-initialization garbage in    |
447*4882a593Smuzhiyun| the meantime!                                                         |
448*4882a593Smuzhiyun| For ``remove_gp_synchronous()``, as long as all modifications to      |
449*4882a593Smuzhiyun| ``gp`` are carried out while holding ``gp_lock``, the above           |
450*4882a593Smuzhiyun| optimizations are harmless. However, ``sparse`` will complain if you  |
451*4882a593Smuzhiyun| define ``gp`` with ``__rcu`` and then access it without using either  |
452*4882a593Smuzhiyun| ``rcu_access_pointer()`` or ``rcu_dereference()``.                    |
453*4882a593Smuzhiyun+-----------------------------------------------------------------------+
454*4882a593Smuzhiyun
455*4882a593SmuzhiyunIn short, RCU's publish-subscribe guarantee is provided by the
456*4882a593Smuzhiyuncombination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This
457*4882a593Smuzhiyunguarantee allows data elements to be safely added to RCU-protected
458*4882a593Smuzhiyunlinked data structures without disrupting RCU readers. This guarantee
459*4882a593Smuzhiyuncan be used in combination with the grace-period guarantee to also allow
460*4882a593Smuzhiyundata elements to be removed from RCU-protected linked data structures,
461*4882a593Smuzhiyunagain without disrupting RCU readers.
462*4882a593Smuzhiyun
463*4882a593SmuzhiyunThis guarantee was only partially premeditated. DYNIX/ptx used an
464*4882a593Smuzhiyunexplicit memory barrier for publication, but had nothing resembling
465*4882a593Smuzhiyun``rcu_dereference()`` for subscription, nor did it have anything
466*4882a593Smuzhiyunresembling the dependency-ordering barrier that was later subsumed
467*4882a593Smuzhiyuninto ``rcu_dereference()`` and later still into ``READ_ONCE()``. The
468*4882a593Smuzhiyunneed for these operations made itself known quite suddenly at a
469*4882a593Smuzhiyunlate-1990s meeting with the DEC Alpha architects, back in the days when
470*4882a593SmuzhiyunDEC was still a free-standing company. It took the Alpha architects a
471*4882a593Smuzhiyungood hour to convince me that any sort of barrier would ever be needed,
472*4882a593Smuzhiyunand it then took me a good *two* hours to convince them that their
473*4882a593Smuzhiyundocumentation did not make this point clear. More recent work with the C
474*4882a593Smuzhiyunand C++ standards committees have provided much education on tricks and
475*4882a593Smuzhiyuntraps from the compiler. In short, compilers were much less tricky in
476*4882a593Smuzhiyunthe early 1990s, but in 2015, don't even think about omitting
477*4882a593Smuzhiyun``rcu_dereference()``!
478*4882a593Smuzhiyun
479*4882a593SmuzhiyunMemory-Barrier Guarantees
480*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~
481*4882a593Smuzhiyun
482*4882a593SmuzhiyunThe previous section's simple linked-data-structure scenario clearly
483*4882a593Smuzhiyundemonstrates the need for RCU's stringent memory-ordering guarantees on
484*4882a593Smuzhiyunsystems with more than one CPU:
485*4882a593Smuzhiyun
486*4882a593Smuzhiyun#. Each CPU that has an RCU read-side critical section that begins
487*4882a593Smuzhiyun   before ``synchronize_rcu()`` starts is guaranteed to execute a full
488*4882a593Smuzhiyun   memory barrier between the time that the RCU read-side critical
489*4882a593Smuzhiyun   section ends and the time that ``synchronize_rcu()`` returns. Without
490*4882a593Smuzhiyun   this guarantee, a pre-existing RCU read-side critical section might
491*4882a593Smuzhiyun   hold a reference to the newly removed ``struct foo`` after the
492*4882a593Smuzhiyun   ``kfree()`` on line 14 of ``remove_gp_synchronous()``.
493*4882a593Smuzhiyun#. Each CPU that has an RCU read-side critical section that ends after
494*4882a593Smuzhiyun   ``synchronize_rcu()`` returns is guaranteed to execute a full memory
495*4882a593Smuzhiyun   barrier between the time that ``synchronize_rcu()`` begins and the
496*4882a593Smuzhiyun   time that the RCU read-side critical section begins. Without this
497*4882a593Smuzhiyun   guarantee, a later RCU read-side critical section running after the
498*4882a593Smuzhiyun   ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run
499*4882a593Smuzhiyun   ``do_something_gp()`` and find the newly deleted ``struct foo``.
500*4882a593Smuzhiyun#. If the task invoking ``synchronize_rcu()`` remains on a given CPU,
501*4882a593Smuzhiyun   then that CPU is guaranteed to execute a full memory barrier sometime
502*4882a593Smuzhiyun   during the execution of ``synchronize_rcu()``. This guarantee ensures
503*4882a593Smuzhiyun   that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really
504*4882a593Smuzhiyun   does execute after the removal on line 11.
505*4882a593Smuzhiyun#. If the task invoking ``synchronize_rcu()`` migrates among a group of
506*4882a593Smuzhiyun   CPUs during that invocation, then each of the CPUs in that group is
507*4882a593Smuzhiyun   guaranteed to execute a full memory barrier sometime during the
508*4882a593Smuzhiyun   execution of ``synchronize_rcu()``. This guarantee also ensures that
509*4882a593Smuzhiyun   the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does
510*4882a593Smuzhiyun   execute after the removal on line 11, but also in the case where the
511*4882a593Smuzhiyun   thread executing the ``synchronize_rcu()`` migrates in the meantime.
512*4882a593Smuzhiyun
513*4882a593Smuzhiyun+-----------------------------------------------------------------------+
514*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
515*4882a593Smuzhiyun+-----------------------------------------------------------------------+
516*4882a593Smuzhiyun| Given that multiple CPUs can start RCU read-side critical sections at |
517*4882a593Smuzhiyun| any time without any ordering whatsoever, how can RCU possibly tell   |
518*4882a593Smuzhiyun| whether or not a given RCU read-side critical section starts before a |
519*4882a593Smuzhiyun| given instance of ``synchronize_rcu()``?                              |
520*4882a593Smuzhiyun+-----------------------------------------------------------------------+
521*4882a593Smuzhiyun| **Answer**:                                                           |
522*4882a593Smuzhiyun+-----------------------------------------------------------------------+
523*4882a593Smuzhiyun| If RCU cannot tell whether or not a given RCU read-side critical      |
524*4882a593Smuzhiyun| section starts before a given instance of ``synchronize_rcu()``, then |
525*4882a593Smuzhiyun| it must assume that the RCU read-side critical section started first. |
526*4882a593Smuzhiyun| In other words, a given instance of ``synchronize_rcu()`` can avoid   |
527*4882a593Smuzhiyun| waiting on a given RCU read-side critical section only if it can      |
528*4882a593Smuzhiyun| prove that ``synchronize_rcu()`` started first.                       |
529*4882a593Smuzhiyun| A related question is “When ``rcu_read_lock()`` doesn't generate any  |
530*4882a593Smuzhiyun| code, why does it matter how it relates to a grace period?” The       |
531*4882a593Smuzhiyun| answer is that it is not the relationship of ``rcu_read_lock()``      |
532*4882a593Smuzhiyun| itself that is important, but rather the relationship of the code     |
533*4882a593Smuzhiyun| within the enclosed RCU read-side critical section to the code        |
534*4882a593Smuzhiyun| preceding and following the grace period. If we take this viewpoint,  |
535*4882a593Smuzhiyun| then a given RCU read-side critical section begins before a given     |
536*4882a593Smuzhiyun| grace period when some access preceding the grace period observes the |
537*4882a593Smuzhiyun| effect of some access within the critical section, in which case none |
538*4882a593Smuzhiyun| of the accesses within the critical section may observe the effects   |
539*4882a593Smuzhiyun| of any access following the grace period.                             |
540*4882a593Smuzhiyun|                                                                       |
541*4882a593Smuzhiyun| As of late 2016, mathematical models of RCU take this viewpoint, for  |
542*4882a593Smuzhiyun| example, see slides 62 and 63 of the `2016 LinuxCon                   |
543*4882a593Smuzhiyun| EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 |
544*4882a593Smuzhiyun| 6.10.04c.LCE.pdf>`__                                                  |
545*4882a593Smuzhiyun| presentation.                                                         |
546*4882a593Smuzhiyun+-----------------------------------------------------------------------+
547*4882a593Smuzhiyun
548*4882a593Smuzhiyun+-----------------------------------------------------------------------+
549*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
550*4882a593Smuzhiyun+-----------------------------------------------------------------------+
551*4882a593Smuzhiyun| The first and second guarantees require unbelievably strict ordering! |
552*4882a593Smuzhiyun| Are all these memory barriers *really* required?                      |
553*4882a593Smuzhiyun+-----------------------------------------------------------------------+
554*4882a593Smuzhiyun| **Answer**:                                                           |
555*4882a593Smuzhiyun+-----------------------------------------------------------------------+
556*4882a593Smuzhiyun| Yes, they really are required. To see why the first guarantee is      |
557*4882a593Smuzhiyun| required, consider the following sequence of events:                  |
558*4882a593Smuzhiyun|                                                                       |
559*4882a593Smuzhiyun| #. CPU 1: ``rcu_read_lock()``                                         |
560*4882a593Smuzhiyun| #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` |
561*4882a593Smuzhiyun| #. CPU 0: ``list_del_rcu(p);``                                        |
562*4882a593Smuzhiyun| #. CPU 0: ``synchronize_rcu()`` starts.                               |
563*4882a593Smuzhiyun| #. CPU 1: ``do_something_with(q->a);``                                |
564*4882a593Smuzhiyun|    ``/* No smp_mb(), so might happen after kfree(). */``              |
565*4882a593Smuzhiyun| #. CPU 1: ``rcu_read_unlock()``                                       |
566*4882a593Smuzhiyun| #. CPU 0: ``synchronize_rcu()`` returns.                              |
567*4882a593Smuzhiyun| #. CPU 0: ``kfree(p);``                                               |
568*4882a593Smuzhiyun|                                                                       |
569*4882a593Smuzhiyun| Therefore, there absolutely must be a full memory barrier between the |
570*4882a593Smuzhiyun| end of the RCU read-side critical section and the end of the grace    |
571*4882a593Smuzhiyun| period.                                                               |
572*4882a593Smuzhiyun|                                                                       |
573*4882a593Smuzhiyun| The sequence of events demonstrating the necessity of the second rule |
574*4882a593Smuzhiyun| is roughly similar:                                                   |
575*4882a593Smuzhiyun|                                                                       |
576*4882a593Smuzhiyun| #. CPU 0: ``list_del_rcu(p);``                                        |
577*4882a593Smuzhiyun| #. CPU 0: ``synchronize_rcu()`` starts.                               |
578*4882a593Smuzhiyun| #. CPU 1: ``rcu_read_lock()``                                         |
579*4882a593Smuzhiyun| #. CPU 1: ``q = rcu_dereference(gp);``                                |
580*4882a593Smuzhiyun|    ``/* Might return p if no memory barrier. */``                     |
581*4882a593Smuzhiyun| #. CPU 0: ``synchronize_rcu()`` returns.                              |
582*4882a593Smuzhiyun| #. CPU 0: ``kfree(p);``                                               |
583*4882a593Smuzhiyun| #. CPU 1: ``do_something_with(q->a); /* Boom!!! */``                  |
584*4882a593Smuzhiyun| #. CPU 1: ``rcu_read_unlock()``                                       |
585*4882a593Smuzhiyun|                                                                       |
586*4882a593Smuzhiyun| And similarly, without a memory barrier between the beginning of the  |
587*4882a593Smuzhiyun| grace period and the beginning of the RCU read-side critical section, |
588*4882a593Smuzhiyun| CPU 1 might end up accessing the freelist.                            |
589*4882a593Smuzhiyun|                                                                       |
590*4882a593Smuzhiyun| The “as if” rule of course applies, so that any implementation that   |
591*4882a593Smuzhiyun| acts as if the appropriate memory barriers were in place is a correct |
592*4882a593Smuzhiyun| implementation. That said, it is much easier to fool yourself into    |
593*4882a593Smuzhiyun| believing that you have adhered to the as-if rule than it is to       |
594*4882a593Smuzhiyun| actually adhere to it!                                                |
595*4882a593Smuzhiyun+-----------------------------------------------------------------------+
596*4882a593Smuzhiyun
597*4882a593Smuzhiyun+-----------------------------------------------------------------------+
598*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
599*4882a593Smuzhiyun+-----------------------------------------------------------------------+
600*4882a593Smuzhiyun| You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
601*4882a593Smuzhiyun| absolutely no code in some kernel builds. This means that the         |
602*4882a593Smuzhiyun| compiler might arbitrarily rearrange consecutive RCU read-side        |
603*4882a593Smuzhiyun| critical sections. Given such rearrangement, if a given RCU read-side |
604*4882a593Smuzhiyun| critical section is done, how can you be sure that all prior RCU      |
605*4882a593Smuzhiyun| read-side critical sections are done? Won't the compiler              |
606*4882a593Smuzhiyun| rearrangements make that impossible to determine?                     |
607*4882a593Smuzhiyun+-----------------------------------------------------------------------+
608*4882a593Smuzhiyun| **Answer**:                                                           |
609*4882a593Smuzhiyun+-----------------------------------------------------------------------+
610*4882a593Smuzhiyun| In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
611*4882a593Smuzhiyun| absolutely no code, RCU infers quiescent states only at special       |
612*4882a593Smuzhiyun| locations, for example, within the scheduler. Because calls to        |
613*4882a593Smuzhiyun| ``schedule()`` had better prevent calling-code accesses to shared     |
614*4882a593Smuzhiyun| variables from being rearranged across the call to ``schedule()``, if |
615*4882a593Smuzhiyun| RCU detects the end of a given RCU read-side critical section, it     |
616*4882a593Smuzhiyun| will necessarily detect the end of all prior RCU read-side critical   |
617*4882a593Smuzhiyun| sections, no matter how aggressively the compiler scrambles the code. |
618*4882a593Smuzhiyun| Again, this all assumes that the compiler cannot scramble code across |
619*4882a593Smuzhiyun| calls to the scheduler, out of interrupt handlers, into the idle      |
620*4882a593Smuzhiyun| loop, into user-mode code, and so on. But if your kernel build allows |
621*4882a593Smuzhiyun| that sort of scrambling, you have broken far more than just RCU!      |
622*4882a593Smuzhiyun+-----------------------------------------------------------------------+
623*4882a593Smuzhiyun
624*4882a593SmuzhiyunNote that these memory-barrier requirements do not replace the
625*4882a593Smuzhiyunfundamental RCU requirement that a grace period wait for all
626*4882a593Smuzhiyunpre-existing readers. On the contrary, the memory barriers called out in
627*4882a593Smuzhiyunthis section must operate in such a way as to *enforce* this fundamental
628*4882a593Smuzhiyunrequirement. Of course, different implementations enforce this
629*4882a593Smuzhiyunrequirement in different ways, but enforce it they must.
630*4882a593Smuzhiyun
631*4882a593SmuzhiyunRCU Primitives Guaranteed to Execute Unconditionally
632*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
633*4882a593Smuzhiyun
634*4882a593SmuzhiyunThe common-case RCU primitives are unconditional. They are invoked, they
635*4882a593Smuzhiyundo their job, and they return, with no possibility of error, and no need
636*4882a593Smuzhiyunto retry. This is a key RCU design philosophy.
637*4882a593Smuzhiyun
638*4882a593SmuzhiyunHowever, this philosophy is pragmatic rather than pigheaded. If someone
639*4882a593Smuzhiyuncomes up with a good justification for a particular conditional RCU
640*4882a593Smuzhiyunprimitive, it might well be implemented and added. After all, this
641*4882a593Smuzhiyunguarantee was reverse-engineered, not premeditated. The unconditional
642*4882a593Smuzhiyunnature of the RCU primitives was initially an accident of
643*4882a593Smuzhiyunimplementation, and later experience with synchronization primitives
644*4882a593Smuzhiyunwith conditional primitives caused me to elevate this accident to a
645*4882a593Smuzhiyunguarantee. Therefore, the justification for adding a conditional
646*4882a593Smuzhiyunprimitive to RCU would need to be based on detailed and compelling use
647*4882a593Smuzhiyuncases.
648*4882a593Smuzhiyun
649*4882a593SmuzhiyunGuaranteed Read-to-Write Upgrade
650*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
651*4882a593Smuzhiyun
652*4882a593SmuzhiyunAs far as RCU is concerned, it is always possible to carry out an update
653*4882a593Smuzhiyunwithin an RCU read-side critical section. For example, that RCU
654*4882a593Smuzhiyunread-side critical section might search for a given data element, and
655*4882a593Smuzhiyunthen might acquire the update-side spinlock in order to update that
656*4882a593Smuzhiyunelement, all while remaining in that RCU read-side critical section. Of
657*4882a593Smuzhiyuncourse, it is necessary to exit the RCU read-side critical section
658*4882a593Smuzhiyunbefore invoking ``synchronize_rcu()``, however, this inconvenience can
659*4882a593Smuzhiyunbe avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API
660*4882a593Smuzhiyunmembers described later in this document.
661*4882a593Smuzhiyun
662*4882a593Smuzhiyun+-----------------------------------------------------------------------+
663*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
664*4882a593Smuzhiyun+-----------------------------------------------------------------------+
665*4882a593Smuzhiyun| But how does the upgrade-to-write operation exclude other readers?    |
666*4882a593Smuzhiyun+-----------------------------------------------------------------------+
667*4882a593Smuzhiyun| **Answer**:                                                           |
668*4882a593Smuzhiyun+-----------------------------------------------------------------------+
669*4882a593Smuzhiyun| It doesn't, just like normal RCU updates, which also do not exclude   |
670*4882a593Smuzhiyun| RCU readers.                                                          |
671*4882a593Smuzhiyun+-----------------------------------------------------------------------+
672*4882a593Smuzhiyun
673*4882a593SmuzhiyunThis guarantee allows lookup code to be shared between read-side and
674*4882a593Smuzhiyunupdate-side code, and was premeditated, appearing in the earliest
675*4882a593SmuzhiyunDYNIX/ptx RCU documentation.
676*4882a593Smuzhiyun
677*4882a593SmuzhiyunFundamental Non-Requirements
678*4882a593Smuzhiyun----------------------------
679*4882a593Smuzhiyun
680*4882a593SmuzhiyunRCU provides extremely lightweight readers, and its read-side
681*4882a593Smuzhiyunguarantees, though quite useful, are correspondingly lightweight. It is
682*4882a593Smuzhiyuntherefore all too easy to assume that RCU is guaranteeing more than it
683*4882a593Smuzhiyunreally is. Of course, the list of things that RCU does not guarantee is
684*4882a593Smuzhiyuninfinitely long, however, the following sections list a few
685*4882a593Smuzhiyunnon-guarantees that have caused confusion. Except where otherwise noted,
686*4882a593Smuzhiyunthese non-guarantees were premeditated.
687*4882a593Smuzhiyun
688*4882a593Smuzhiyun#. `Readers Impose Minimal Ordering`_
689*4882a593Smuzhiyun#. `Readers Do Not Exclude Updaters`_
690*4882a593Smuzhiyun#. `Updaters Only Wait For Old Readers`_
691*4882a593Smuzhiyun#. `Grace Periods Don't Partition Read-Side Critical Sections`_
692*4882a593Smuzhiyun#. `Read-Side Critical Sections Don't Partition Grace Periods`_
693*4882a593Smuzhiyun
694*4882a593SmuzhiyunReaders Impose Minimal Ordering
695*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
696*4882a593Smuzhiyun
697*4882a593SmuzhiyunReader-side markers such as ``rcu_read_lock()`` and
698*4882a593Smuzhiyun``rcu_read_unlock()`` provide absolutely no ordering guarantees except
699*4882a593Smuzhiyunthrough their interaction with the grace-period APIs such as
700*4882a593Smuzhiyun``synchronize_rcu()``. To see this, consider the following pair of
701*4882a593Smuzhiyunthreads:
702*4882a593Smuzhiyun
703*4882a593Smuzhiyun   ::
704*4882a593Smuzhiyun
705*4882a593Smuzhiyun       1 void thread0(void)
706*4882a593Smuzhiyun       2 {
707*4882a593Smuzhiyun       3   rcu_read_lock();
708*4882a593Smuzhiyun       4   WRITE_ONCE(x, 1);
709*4882a593Smuzhiyun       5   rcu_read_unlock();
710*4882a593Smuzhiyun       6   rcu_read_lock();
711*4882a593Smuzhiyun       7   WRITE_ONCE(y, 1);
712*4882a593Smuzhiyun       8   rcu_read_unlock();
713*4882a593Smuzhiyun       9 }
714*4882a593Smuzhiyun      10
715*4882a593Smuzhiyun      11 void thread1(void)
716*4882a593Smuzhiyun      12 {
717*4882a593Smuzhiyun      13   rcu_read_lock();
718*4882a593Smuzhiyun      14   r1 = READ_ONCE(y);
719*4882a593Smuzhiyun      15   rcu_read_unlock();
720*4882a593Smuzhiyun      16   rcu_read_lock();
721*4882a593Smuzhiyun      17   r2 = READ_ONCE(x);
722*4882a593Smuzhiyun      18   rcu_read_unlock();
723*4882a593Smuzhiyun      19 }
724*4882a593Smuzhiyun
725*4882a593SmuzhiyunAfter ``thread0()`` and ``thread1()`` execute concurrently, it is quite
726*4882a593Smuzhiyunpossible to have
727*4882a593Smuzhiyun
728*4882a593Smuzhiyun   ::
729*4882a593Smuzhiyun
730*4882a593Smuzhiyun      (r1 == 1 && r2 == 0)
731*4882a593Smuzhiyun
732*4882a593Smuzhiyun(that is, ``y`` appears to have been assigned before ``x``), which would
733*4882a593Smuzhiyunnot be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had
734*4882a593Smuzhiyunmuch in the way of ordering properties. But they do not, so the CPU is
735*4882a593Smuzhiyunwithin its rights to do significant reordering. This is by design: Any
736*4882a593Smuzhiyunsignificant ordering constraints would slow down these fast-path APIs.
737*4882a593Smuzhiyun
738*4882a593Smuzhiyun+-----------------------------------------------------------------------+
739*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
740*4882a593Smuzhiyun+-----------------------------------------------------------------------+
741*4882a593Smuzhiyun| Can't the compiler also reorder this code?                            |
742*4882a593Smuzhiyun+-----------------------------------------------------------------------+
743*4882a593Smuzhiyun| **Answer**:                                                           |
744*4882a593Smuzhiyun+-----------------------------------------------------------------------+
745*4882a593Smuzhiyun| No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()``        |
746*4882a593Smuzhiyun| prevent the compiler from reordering in this particular case.         |
747*4882a593Smuzhiyun+-----------------------------------------------------------------------+
748*4882a593Smuzhiyun
749*4882a593SmuzhiyunReaders Do Not Exclude Updaters
750*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
751*4882a593Smuzhiyun
752*4882a593SmuzhiyunNeither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates.
753*4882a593SmuzhiyunAll they do is to prevent grace periods from ending. The following
754*4882a593Smuzhiyunexample illustrates this:
755*4882a593Smuzhiyun
756*4882a593Smuzhiyun   ::
757*4882a593Smuzhiyun
758*4882a593Smuzhiyun       1 void thread0(void)
759*4882a593Smuzhiyun       2 {
760*4882a593Smuzhiyun       3   rcu_read_lock();
761*4882a593Smuzhiyun       4   r1 = READ_ONCE(y);
762*4882a593Smuzhiyun       5   if (r1) {
763*4882a593Smuzhiyun       6     do_something_with_nonzero_x();
764*4882a593Smuzhiyun       7     r2 = READ_ONCE(x);
765*4882a593Smuzhiyun       8     WARN_ON(!r2); /* BUG!!! */
766*4882a593Smuzhiyun       9   }
767*4882a593Smuzhiyun      10   rcu_read_unlock();
768*4882a593Smuzhiyun      11 }
769*4882a593Smuzhiyun      12
770*4882a593Smuzhiyun      13 void thread1(void)
771*4882a593Smuzhiyun      14 {
772*4882a593Smuzhiyun      15   spin_lock(&my_lock);
773*4882a593Smuzhiyun      16   WRITE_ONCE(x, 1);
774*4882a593Smuzhiyun      17   WRITE_ONCE(y, 1);
775*4882a593Smuzhiyun      18   spin_unlock(&my_lock);
776*4882a593Smuzhiyun      19 }
777*4882a593Smuzhiyun
778*4882a593SmuzhiyunIf the ``thread0()`` function's ``rcu_read_lock()`` excluded the
779*4882a593Smuzhiyun``thread1()`` function's update, the ``WARN_ON()`` could never fire. But
780*4882a593Smuzhiyunthe fact is that ``rcu_read_lock()`` does not exclude much of anything
781*4882a593Smuzhiyunaside from subsequent grace periods, of which ``thread1()`` has none, so
782*4882a593Smuzhiyunthe ``WARN_ON()`` can and does fire.
783*4882a593Smuzhiyun
784*4882a593SmuzhiyunUpdaters Only Wait For Old Readers
785*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
786*4882a593Smuzhiyun
787*4882a593SmuzhiyunIt might be tempting to assume that after ``synchronize_rcu()``
788*4882a593Smuzhiyuncompletes, there are no readers executing. This temptation must be
789*4882a593Smuzhiyunavoided because new readers can start immediately after
790*4882a593Smuzhiyun``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no
791*4882a593Smuzhiyunobligation to wait for these new readers.
792*4882a593Smuzhiyun
793*4882a593Smuzhiyun+-----------------------------------------------------------------------+
794*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
795*4882a593Smuzhiyun+-----------------------------------------------------------------------+
796*4882a593Smuzhiyun| Suppose that synchronize_rcu() did wait until *all* readers had       |
797*4882a593Smuzhiyun| completed instead of waiting only on pre-existing readers. For how    |
798*4882a593Smuzhiyun| long would the updater be able to rely on there being no readers?     |
799*4882a593Smuzhiyun+-----------------------------------------------------------------------+
800*4882a593Smuzhiyun| **Answer**:                                                           |
801*4882a593Smuzhiyun+-----------------------------------------------------------------------+
802*4882a593Smuzhiyun| For no time at all. Even if ``synchronize_rcu()`` were to wait until  |
803*4882a593Smuzhiyun| all readers had completed, a new reader might start immediately after |
804*4882a593Smuzhiyun| ``synchronize_rcu()`` completed. Therefore, the code following        |
805*4882a593Smuzhiyun| ``synchronize_rcu()`` can *never* rely on there being no readers.     |
806*4882a593Smuzhiyun+-----------------------------------------------------------------------+
807*4882a593Smuzhiyun
808*4882a593SmuzhiyunGrace Periods Don't Partition Read-Side Critical Sections
809*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
810*4882a593Smuzhiyun
811*4882a593SmuzhiyunIt is tempting to assume that if any part of one RCU read-side critical
812*4882a593Smuzhiyunsection precedes a given grace period, and if any part of another RCU
813*4882a593Smuzhiyunread-side critical section follows that same grace period, then all of
814*4882a593Smuzhiyunthe first RCU read-side critical section must precede all of the second.
815*4882a593SmuzhiyunHowever, this just isn't the case: A single grace period does not
816*4882a593Smuzhiyunpartition the set of RCU read-side critical sections. An example of this
817*4882a593Smuzhiyunsituation can be illustrated as follows, where ``x``, ``y``, and ``z``
818*4882a593Smuzhiyunare initially all zero:
819*4882a593Smuzhiyun
820*4882a593Smuzhiyun   ::
821*4882a593Smuzhiyun
822*4882a593Smuzhiyun       1 void thread0(void)
823*4882a593Smuzhiyun       2 {
824*4882a593Smuzhiyun       3   rcu_read_lock();
825*4882a593Smuzhiyun       4   WRITE_ONCE(a, 1);
826*4882a593Smuzhiyun       5   WRITE_ONCE(b, 1);
827*4882a593Smuzhiyun       6   rcu_read_unlock();
828*4882a593Smuzhiyun       7 }
829*4882a593Smuzhiyun       8
830*4882a593Smuzhiyun       9 void thread1(void)
831*4882a593Smuzhiyun      10 {
832*4882a593Smuzhiyun      11   r1 = READ_ONCE(a);
833*4882a593Smuzhiyun      12   synchronize_rcu();
834*4882a593Smuzhiyun      13   WRITE_ONCE(c, 1);
835*4882a593Smuzhiyun      14 }
836*4882a593Smuzhiyun      15
837*4882a593Smuzhiyun      16 void thread2(void)
838*4882a593Smuzhiyun      17 {
839*4882a593Smuzhiyun      18   rcu_read_lock();
840*4882a593Smuzhiyun      19   r2 = READ_ONCE(b);
841*4882a593Smuzhiyun      20   r3 = READ_ONCE(c);
842*4882a593Smuzhiyun      21   rcu_read_unlock();
843*4882a593Smuzhiyun      22 }
844*4882a593Smuzhiyun
845*4882a593SmuzhiyunIt turns out that the outcome:
846*4882a593Smuzhiyun
847*4882a593Smuzhiyun   ::
848*4882a593Smuzhiyun
849*4882a593Smuzhiyun      (r1 == 1 && r2 == 0 && r3 == 1)
850*4882a593Smuzhiyun
851*4882a593Smuzhiyunis entirely possible. The following figure show how this can happen,
852*4882a593Smuzhiyunwith each circled ``QS`` indicating the point at which RCU recorded a
853*4882a593Smuzhiyun*quiescent state* for each thread, that is, a state in which RCU knows
854*4882a593Smuzhiyunthat the thread cannot be in the midst of an RCU read-side critical
855*4882a593Smuzhiyunsection that started before the current grace period:
856*4882a593Smuzhiyun
857*4882a593Smuzhiyun.. kernel-figure:: GPpartitionReaders1.svg
858*4882a593Smuzhiyun
859*4882a593SmuzhiyunIf it is necessary to partition RCU read-side critical sections in this
860*4882a593Smuzhiyunmanner, it is necessary to use two grace periods, where the first grace
861*4882a593Smuzhiyunperiod is known to end before the second grace period starts:
862*4882a593Smuzhiyun
863*4882a593Smuzhiyun   ::
864*4882a593Smuzhiyun
865*4882a593Smuzhiyun       1 void thread0(void)
866*4882a593Smuzhiyun       2 {
867*4882a593Smuzhiyun       3   rcu_read_lock();
868*4882a593Smuzhiyun       4   WRITE_ONCE(a, 1);
869*4882a593Smuzhiyun       5   WRITE_ONCE(b, 1);
870*4882a593Smuzhiyun       6   rcu_read_unlock();
871*4882a593Smuzhiyun       7 }
872*4882a593Smuzhiyun       8
873*4882a593Smuzhiyun       9 void thread1(void)
874*4882a593Smuzhiyun      10 {
875*4882a593Smuzhiyun      11   r1 = READ_ONCE(a);
876*4882a593Smuzhiyun      12   synchronize_rcu();
877*4882a593Smuzhiyun      13   WRITE_ONCE(c, 1);
878*4882a593Smuzhiyun      14 }
879*4882a593Smuzhiyun      15
880*4882a593Smuzhiyun      16 void thread2(void)
881*4882a593Smuzhiyun      17 {
882*4882a593Smuzhiyun      18   r2 = READ_ONCE(c);
883*4882a593Smuzhiyun      19   synchronize_rcu();
884*4882a593Smuzhiyun      20   WRITE_ONCE(d, 1);
885*4882a593Smuzhiyun      21 }
886*4882a593Smuzhiyun      22
887*4882a593Smuzhiyun      23 void thread3(void)
888*4882a593Smuzhiyun      24 {
889*4882a593Smuzhiyun      25   rcu_read_lock();
890*4882a593Smuzhiyun      26   r3 = READ_ONCE(b);
891*4882a593Smuzhiyun      27   r4 = READ_ONCE(d);
892*4882a593Smuzhiyun      28   rcu_read_unlock();
893*4882a593Smuzhiyun      29 }
894*4882a593Smuzhiyun
895*4882a593SmuzhiyunHere, if ``(r1 == 1)``, then ``thread0()``'s write to ``b`` must happen
896*4882a593Smuzhiyunbefore the end of ``thread1()``'s grace period. If in addition
897*4882a593Smuzhiyun``(r4 == 1)``, then ``thread3()``'s read from ``b`` must happen after
898*4882a593Smuzhiyunthe beginning of ``thread2()``'s grace period. If it is also the case
899*4882a593Smuzhiyunthat ``(r2 == 1)``, then the end of ``thread1()``'s grace period must
900*4882a593Smuzhiyunprecede the beginning of ``thread2()``'s grace period. This mean that
901*4882a593Smuzhiyunthe two RCU read-side critical sections cannot overlap, guaranteeing
902*4882a593Smuzhiyunthat ``(r3 == 1)``. As a result, the outcome:
903*4882a593Smuzhiyun
904*4882a593Smuzhiyun   ::
905*4882a593Smuzhiyun
906*4882a593Smuzhiyun      (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1)
907*4882a593Smuzhiyun
908*4882a593Smuzhiyuncannot happen.
909*4882a593Smuzhiyun
910*4882a593SmuzhiyunThis non-requirement was also non-premeditated, but became apparent when
911*4882a593Smuzhiyunstudying RCU's interaction with memory ordering.
912*4882a593Smuzhiyun
913*4882a593SmuzhiyunRead-Side Critical Sections Don't Partition Grace Periods
914*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
915*4882a593Smuzhiyun
916*4882a593SmuzhiyunIt is also tempting to assume that if an RCU read-side critical section
917*4882a593Smuzhiyunhappens between a pair of grace periods, then those grace periods cannot
918*4882a593Smuzhiyunoverlap. However, this temptation leads nowhere good, as can be
919*4882a593Smuzhiyunillustrated by the following, with all variables initially zero:
920*4882a593Smuzhiyun
921*4882a593Smuzhiyun   ::
922*4882a593Smuzhiyun
923*4882a593Smuzhiyun       1 void thread0(void)
924*4882a593Smuzhiyun       2 {
925*4882a593Smuzhiyun       3   rcu_read_lock();
926*4882a593Smuzhiyun       4   WRITE_ONCE(a, 1);
927*4882a593Smuzhiyun       5   WRITE_ONCE(b, 1);
928*4882a593Smuzhiyun       6   rcu_read_unlock();
929*4882a593Smuzhiyun       7 }
930*4882a593Smuzhiyun       8
931*4882a593Smuzhiyun       9 void thread1(void)
932*4882a593Smuzhiyun      10 {
933*4882a593Smuzhiyun      11   r1 = READ_ONCE(a);
934*4882a593Smuzhiyun      12   synchronize_rcu();
935*4882a593Smuzhiyun      13   WRITE_ONCE(c, 1);
936*4882a593Smuzhiyun      14 }
937*4882a593Smuzhiyun      15
938*4882a593Smuzhiyun      16 void thread2(void)
939*4882a593Smuzhiyun      17 {
940*4882a593Smuzhiyun      18   rcu_read_lock();
941*4882a593Smuzhiyun      19   WRITE_ONCE(d, 1);
942*4882a593Smuzhiyun      20   r2 = READ_ONCE(c);
943*4882a593Smuzhiyun      21   rcu_read_unlock();
944*4882a593Smuzhiyun      22 }
945*4882a593Smuzhiyun      23
946*4882a593Smuzhiyun      24 void thread3(void)
947*4882a593Smuzhiyun      25 {
948*4882a593Smuzhiyun      26   r3 = READ_ONCE(d);
949*4882a593Smuzhiyun      27   synchronize_rcu();
950*4882a593Smuzhiyun      28   WRITE_ONCE(e, 1);
951*4882a593Smuzhiyun      29 }
952*4882a593Smuzhiyun      30
953*4882a593Smuzhiyun      31 void thread4(void)
954*4882a593Smuzhiyun      32 {
955*4882a593Smuzhiyun      33   rcu_read_lock();
956*4882a593Smuzhiyun      34   r4 = READ_ONCE(b);
957*4882a593Smuzhiyun      35   r5 = READ_ONCE(e);
958*4882a593Smuzhiyun      36   rcu_read_unlock();
959*4882a593Smuzhiyun      37 }
960*4882a593Smuzhiyun
961*4882a593SmuzhiyunIn this case, the outcome:
962*4882a593Smuzhiyun
963*4882a593Smuzhiyun   ::
964*4882a593Smuzhiyun
965*4882a593Smuzhiyun      (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1)
966*4882a593Smuzhiyun
967*4882a593Smuzhiyunis entirely possible, as illustrated below:
968*4882a593Smuzhiyun
969*4882a593Smuzhiyun.. kernel-figure:: ReadersPartitionGP1.svg
970*4882a593Smuzhiyun
971*4882a593SmuzhiyunAgain, an RCU read-side critical section can overlap almost all of a
972*4882a593Smuzhiyungiven grace period, just so long as it does not overlap the entire grace
973*4882a593Smuzhiyunperiod. As a result, an RCU read-side critical section cannot partition
974*4882a593Smuzhiyuna pair of RCU grace periods.
975*4882a593Smuzhiyun
976*4882a593Smuzhiyun+-----------------------------------------------------------------------+
977*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
978*4882a593Smuzhiyun+-----------------------------------------------------------------------+
979*4882a593Smuzhiyun| How long a sequence of grace periods, each separated by an RCU        |
980*4882a593Smuzhiyun| read-side critical section, would be required to partition the RCU    |
981*4882a593Smuzhiyun| read-side critical sections at the beginning and end of the chain?    |
982*4882a593Smuzhiyun+-----------------------------------------------------------------------+
983*4882a593Smuzhiyun| **Answer**:                                                           |
984*4882a593Smuzhiyun+-----------------------------------------------------------------------+
985*4882a593Smuzhiyun| In theory, an infinite number. In practice, an unknown number that is |
986*4882a593Smuzhiyun| sensitive to both implementation details and timing considerations.   |
987*4882a593Smuzhiyun| Therefore, even in practice, RCU users must abide by the theoretical  |
988*4882a593Smuzhiyun| rather than the practical answer.                                     |
989*4882a593Smuzhiyun+-----------------------------------------------------------------------+
990*4882a593Smuzhiyun
991*4882a593SmuzhiyunParallelism Facts of Life
992*4882a593Smuzhiyun-------------------------
993*4882a593Smuzhiyun
994*4882a593SmuzhiyunThese parallelism facts of life are by no means specific to RCU, but the
995*4882a593SmuzhiyunRCU implementation must abide by them. They therefore bear repeating:
996*4882a593Smuzhiyun
997*4882a593Smuzhiyun#. Any CPU or task may be delayed at any time, and any attempts to avoid
998*4882a593Smuzhiyun   these delays by disabling preemption, interrupts, or whatever are
999*4882a593Smuzhiyun   completely futile. This is most obvious in preemptible user-level
1000*4882a593Smuzhiyun   environments and in virtualized environments (where a given guest
1001*4882a593Smuzhiyun   OS's VCPUs can be preempted at any time by the underlying
1002*4882a593Smuzhiyun   hypervisor), but can also happen in bare-metal environments due to
1003*4882a593Smuzhiyun   ECC errors, NMIs, and other hardware events. Although a delay of more
1004*4882a593Smuzhiyun   than about 20 seconds can result in splats, the RCU implementation is
1005*4882a593Smuzhiyun   obligated to use algorithms that can tolerate extremely long delays,
1006*4882a593Smuzhiyun   but where “extremely long” is not long enough to allow wrap-around
1007*4882a593Smuzhiyun   when incrementing a 64-bit counter.
1008*4882a593Smuzhiyun#. Both the compiler and the CPU can reorder memory accesses. Where it
1009*4882a593Smuzhiyun   matters, RCU must use compiler directives and memory-barrier
1010*4882a593Smuzhiyun   instructions to preserve ordering.
1011*4882a593Smuzhiyun#. Conflicting writes to memory locations in any given cache line will
1012*4882a593Smuzhiyun   result in expensive cache misses. Greater numbers of concurrent
1013*4882a593Smuzhiyun   writes and more-frequent concurrent writes will result in more
1014*4882a593Smuzhiyun   dramatic slowdowns. RCU is therefore obligated to use algorithms that
1015*4882a593Smuzhiyun   have sufficient locality to avoid significant performance and
1016*4882a593Smuzhiyun   scalability problems.
1017*4882a593Smuzhiyun#. As a rough rule of thumb, only one CPU's worth of processing may be
1018*4882a593Smuzhiyun   carried out under the protection of any given exclusive lock. RCU
1019*4882a593Smuzhiyun   must therefore use scalable locking designs.
1020*4882a593Smuzhiyun#. Counters are finite, especially on 32-bit systems. RCU's use of
1021*4882a593Smuzhiyun   counters must therefore tolerate counter wrap, or be designed such
1022*4882a593Smuzhiyun   that counter wrap would take way more time than a single system is
1023*4882a593Smuzhiyun   likely to run. An uptime of ten years is quite possible, a runtime of
1024*4882a593Smuzhiyun   a century much less so. As an example of the latter, RCU's
1025*4882a593Smuzhiyun   dyntick-idle nesting counter allows 54 bits for interrupt nesting
1026*4882a593Smuzhiyun   level (this counter is 64 bits even on a 32-bit system). Overflowing
1027*4882a593Smuzhiyun   this counter requires 2\ :sup:`54` half-interrupts on a given CPU
1028*4882a593Smuzhiyun   without that CPU ever going idle. If a half-interrupt happened every
1029*4882a593Smuzhiyun   microsecond, it would take 570 years of runtime to overflow this
1030*4882a593Smuzhiyun   counter, which is currently believed to be an acceptably long time.
1031*4882a593Smuzhiyun#. Linux systems can have thousands of CPUs running a single Linux
1032*4882a593Smuzhiyun   kernel in a single shared-memory environment. RCU must therefore pay
1033*4882a593Smuzhiyun   close attention to high-end scalability.
1034*4882a593Smuzhiyun
1035*4882a593SmuzhiyunThis last parallelism fact of life means that RCU must pay special
1036*4882a593Smuzhiyunattention to the preceding facts of life. The idea that Linux might
1037*4882a593Smuzhiyunscale to systems with thousands of CPUs would have been met with some
1038*4882a593Smuzhiyunskepticism in the 1990s, but these requirements would have otherwise
1039*4882a593Smuzhiyunhave been unsurprising, even in the early 1990s.
1040*4882a593Smuzhiyun
1041*4882a593SmuzhiyunQuality-of-Implementation Requirements
1042*4882a593Smuzhiyun--------------------------------------
1043*4882a593Smuzhiyun
1044*4882a593SmuzhiyunThese sections list quality-of-implementation requirements. Although an
1045*4882a593SmuzhiyunRCU implementation that ignores these requirements could still be used,
1046*4882a593Smuzhiyunit would likely be subject to limitations that would make it
1047*4882a593Smuzhiyuninappropriate for industrial-strength production use. Classes of
1048*4882a593Smuzhiyunquality-of-implementation requirements are as follows:
1049*4882a593Smuzhiyun
1050*4882a593Smuzhiyun#. `Specialization`_
1051*4882a593Smuzhiyun#. `Performance and Scalability`_
1052*4882a593Smuzhiyun#. `Forward Progress`_
1053*4882a593Smuzhiyun#. `Composability`_
1054*4882a593Smuzhiyun#. `Corner Cases`_
1055*4882a593Smuzhiyun
1056*4882a593SmuzhiyunThese classes is covered in the following sections.
1057*4882a593Smuzhiyun
1058*4882a593SmuzhiyunSpecialization
1059*4882a593Smuzhiyun~~~~~~~~~~~~~~
1060*4882a593Smuzhiyun
1061*4882a593SmuzhiyunRCU is and always has been intended primarily for read-mostly
1062*4882a593Smuzhiyunsituations, which means that RCU's read-side primitives are optimized,
1063*4882a593Smuzhiyunoften at the expense of its update-side primitives. Experience thus far
1064*4882a593Smuzhiyunis captured by the following list of situations:
1065*4882a593Smuzhiyun
1066*4882a593Smuzhiyun#. Read-mostly data, where stale and inconsistent data is not a problem:
1067*4882a593Smuzhiyun   RCU works great!
1068*4882a593Smuzhiyun#. Read-mostly data, where data must be consistent: RCU works well.
1069*4882a593Smuzhiyun#. Read-write data, where data must be consistent: RCU *might* work OK.
1070*4882a593Smuzhiyun   Or not.
1071*4882a593Smuzhiyun#. Write-mostly data, where data must be consistent: RCU is very
1072*4882a593Smuzhiyun   unlikely to be the right tool for the job, with the following
1073*4882a593Smuzhiyun   exceptions, where RCU can provide:
1074*4882a593Smuzhiyun
1075*4882a593Smuzhiyun   a. Existence guarantees for update-friendly mechanisms.
1076*4882a593Smuzhiyun   b. Wait-free read-side primitives for real-time use.
1077*4882a593Smuzhiyun
1078*4882a593SmuzhiyunThis focus on read-mostly situations means that RCU must interoperate
1079*4882a593Smuzhiyunwith other synchronization primitives. For example, the ``add_gp()`` and
1080*4882a593Smuzhiyun``remove_gp_synchronous()`` examples discussed earlier use RCU to
1081*4882a593Smuzhiyunprotect readers and locking to coordinate updaters. However, the need
1082*4882a593Smuzhiyunextends much farther, requiring that a variety of synchronization
1083*4882a593Smuzhiyunprimitives be legal within RCU read-side critical sections, including
1084*4882a593Smuzhiyunspinlocks, sequence locks, atomic operations, reference counters, and
1085*4882a593Smuzhiyunmemory barriers.
1086*4882a593Smuzhiyun
1087*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1088*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
1089*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1090*4882a593Smuzhiyun| What about sleeping locks?                                            |
1091*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1092*4882a593Smuzhiyun| **Answer**:                                                           |
1093*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1094*4882a593Smuzhiyun| These are forbidden within Linux-kernel RCU read-side critical        |
1095*4882a593Smuzhiyun| sections because it is not legal to place a quiescent state (in this  |
1096*4882a593Smuzhiyun| case, voluntary context switch) within an RCU read-side critical      |
1097*4882a593Smuzhiyun| section. However, sleeping locks may be used within userspace RCU     |
1098*4882a593Smuzhiyun| read-side critical sections, and also within Linux-kernel sleepable   |
1099*4882a593Smuzhiyun| RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In     |
1100*4882a593Smuzhiyun| addition, the -rt patchset turns spinlocks into a sleeping locks so   |
1101*4882a593Smuzhiyun| that the corresponding critical sections can be preempted, which also |
1102*4882a593Smuzhiyun| means that these sleeplockified spinlocks (but not other sleeping     |
1103*4882a593Smuzhiyun| locks!) may be acquire within -rt-Linux-kernel RCU read-side critical |
1104*4882a593Smuzhiyun| sections.                                                             |
1105*4882a593Smuzhiyun| Note that it *is* legal for a normal RCU read-side critical section   |
1106*4882a593Smuzhiyun| to conditionally acquire a sleeping locks (as in                      |
1107*4882a593Smuzhiyun| ``mutex_trylock()``), but only as long as it does not loop            |
1108*4882a593Smuzhiyun| indefinitely attempting to conditionally acquire that sleeping locks. |
1109*4882a593Smuzhiyun| The key point is that things like ``mutex_trylock()`` either return   |
1110*4882a593Smuzhiyun| with the mutex held, or return an error indication if the mutex was   |
1111*4882a593Smuzhiyun| not immediately available. Either way, ``mutex_trylock()`` returns    |
1112*4882a593Smuzhiyun| immediately without sleeping.                                         |
1113*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1114*4882a593Smuzhiyun
1115*4882a593SmuzhiyunIt often comes as a surprise that many algorithms do not require a
1116*4882a593Smuzhiyunconsistent view of data, but many can function in that mode, with
1117*4882a593Smuzhiyunnetwork routing being the poster child. Internet routing algorithms take
1118*4882a593Smuzhiyunsignificant time to propagate updates, so that by the time an update
1119*4882a593Smuzhiyunarrives at a given system, that system has been sending network traffic
1120*4882a593Smuzhiyunthe wrong way for a considerable length of time. Having a few threads
1121*4882a593Smuzhiyuncontinue to send traffic the wrong way for a few more milliseconds is
1122*4882a593Smuzhiyunclearly not a problem: In the worst case, TCP retransmissions will
1123*4882a593Smuzhiyuneventually get the data where it needs to go. In general, when tracking
1124*4882a593Smuzhiyunthe state of the universe outside of the computer, some level of
1125*4882a593Smuzhiyuninconsistency must be tolerated due to speed-of-light delays if nothing
1126*4882a593Smuzhiyunelse.
1127*4882a593Smuzhiyun
1128*4882a593SmuzhiyunFurthermore, uncertainty about external state is inherent in many cases.
1129*4882a593SmuzhiyunFor example, a pair of veterinarians might use heartbeat to determine
1130*4882a593Smuzhiyunwhether or not a given cat was alive. But how long should they wait
1131*4882a593Smuzhiyunafter the last heartbeat to decide that the cat is in fact dead? Waiting
1132*4882a593Smuzhiyunless than 400 milliseconds makes no sense because this would mean that a
1133*4882a593Smuzhiyunrelaxed cat would be considered to cycle between death and life more
1134*4882a593Smuzhiyunthan 100 times per minute. Moreover, just as with human beings, a cat's
1135*4882a593Smuzhiyunheart might stop for some period of time, so the exact wait period is a
1136*4882a593Smuzhiyunjudgment call. One of our pair of veterinarians might wait 30 seconds
1137*4882a593Smuzhiyunbefore pronouncing the cat dead, while the other might insist on waiting
1138*4882a593Smuzhiyuna full minute. The two veterinarians would then disagree on the state of
1139*4882a593Smuzhiyunthe cat during the final 30 seconds of the minute following the last
1140*4882a593Smuzhiyunheartbeat.
1141*4882a593Smuzhiyun
1142*4882a593SmuzhiyunInterestingly enough, this same situation applies to hardware. When push
1143*4882a593Smuzhiyuncomes to shove, how do we tell whether or not some external server has
1144*4882a593Smuzhiyunfailed? We send messages to it periodically, and declare it failed if we
1145*4882a593Smuzhiyundon't receive a response within a given period of time. Policy decisions
1146*4882a593Smuzhiyuncan usually tolerate short periods of inconsistency. The policy was
1147*4882a593Smuzhiyundecided some time ago, and is only now being put into effect, so a few
1148*4882a593Smuzhiyunmilliseconds of delay is normally inconsequential.
1149*4882a593Smuzhiyun
1150*4882a593SmuzhiyunHowever, there are algorithms that absolutely must see consistent data.
1151*4882a593SmuzhiyunFor example, the translation between a user-level SystemV semaphore ID
1152*4882a593Smuzhiyunto the corresponding in-kernel data structure is protected by RCU, but
1153*4882a593Smuzhiyunit is absolutely forbidden to update a semaphore that has just been
1154*4882a593Smuzhiyunremoved. In the Linux kernel, this need for consistency is accommodated
1155*4882a593Smuzhiyunby acquiring spinlocks located in the in-kernel data structure from
1156*4882a593Smuzhiyunwithin the RCU read-side critical section, and this is indicated by the
1157*4882a593Smuzhiyungreen box in the figure above. Many other techniques may be used, and
1158*4882a593Smuzhiyunare in fact used within the Linux kernel.
1159*4882a593Smuzhiyun
1160*4882a593SmuzhiyunIn short, RCU is not required to maintain consistency, and other
1161*4882a593Smuzhiyunmechanisms may be used in concert with RCU when consistency is required.
1162*4882a593SmuzhiyunRCU's specialization allows it to do its job extremely well, and its
1163*4882a593Smuzhiyunability to interoperate with other synchronization mechanisms allows the
1164*4882a593Smuzhiyunright mix of synchronization tools to be used for a given job.
1165*4882a593Smuzhiyun
1166*4882a593SmuzhiyunPerformance and Scalability
1167*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~
1168*4882a593Smuzhiyun
1169*4882a593SmuzhiyunEnergy efficiency is a critical component of performance today, and
1170*4882a593SmuzhiyunLinux-kernel RCU implementations must therefore avoid unnecessarily
1171*4882a593Smuzhiyunawakening idle CPUs. I cannot claim that this requirement was
1172*4882a593Smuzhiyunpremeditated. In fact, I learned of it during a telephone conversation
1173*4882a593Smuzhiyunin which I was given “frank and open” feedback on the importance of
1174*4882a593Smuzhiyunenergy efficiency in battery-powered systems and on specific
1175*4882a593Smuzhiyunenergy-efficiency shortcomings of the Linux-kernel RCU implementation.
1176*4882a593SmuzhiyunIn my experience, the battery-powered embedded community will consider
1177*4882a593Smuzhiyunany unnecessary wakeups to be extremely unfriendly acts. So much so that
1178*4882a593Smuzhiyunmere Linux-kernel-mailing-list posts are insufficient to vent their ire.
1179*4882a593Smuzhiyun
1180*4882a593SmuzhiyunMemory consumption is not particularly important for in most situations,
1181*4882a593Smuzhiyunand has become decreasingly so as memory sizes have expanded and memory
1182*4882a593Smuzhiyuncosts have plummeted. However, as I learned from Matt Mackall's
1183*4882a593Smuzhiyun`bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory
1184*4882a593Smuzhiyunfootprint is critically important on single-CPU systems with
1185*4882a593Smuzhiyunnon-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny
1186*4882a593SmuzhiyunRCU <https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com>`__
1187*4882a593Smuzhiyunwas born. Josh Triplett has since taken over the small-memory banner
1188*4882a593Smuzhiyunwith his `Linux kernel tinification <https://tiny.wiki.kernel.org/>`__
1189*4882a593Smuzhiyunproject, which resulted in `SRCU <#Sleepable%20RCU>`__ becoming optional
1190*4882a593Smuzhiyunfor those kernels not needing it.
1191*4882a593Smuzhiyun
1192*4882a593SmuzhiyunThe remaining performance requirements are, for the most part,
1193*4882a593Smuzhiyununsurprising. For example, in keeping with RCU's read-side
1194*4882a593Smuzhiyunspecialization, ``rcu_dereference()`` should have negligible overhead
1195*4882a593Smuzhiyun(for example, suppression of a few minor compiler optimizations).
1196*4882a593SmuzhiyunSimilarly, in non-preemptible environments, ``rcu_read_lock()`` and
1197*4882a593Smuzhiyun``rcu_read_unlock()`` should have exactly zero overhead.
1198*4882a593Smuzhiyun
1199*4882a593SmuzhiyunIn preemptible environments, in the case where the RCU read-side
1200*4882a593Smuzhiyuncritical section was not preempted (as will be the case for the
1201*4882a593Smuzhiyunhighest-priority real-time process), ``rcu_read_lock()`` and
1202*4882a593Smuzhiyun``rcu_read_unlock()`` should have minimal overhead. In particular, they
1203*4882a593Smuzhiyunshould not contain atomic read-modify-write operations, memory-barrier
1204*4882a593Smuzhiyuninstructions, preemption disabling, interrupt disabling, or backwards
1205*4882a593Smuzhiyunbranches. However, in the case where the RCU read-side critical section
1206*4882a593Smuzhiyunwas preempted, ``rcu_read_unlock()`` may acquire spinlocks and disable
1207*4882a593Smuzhiyuninterrupts. This is why it is better to nest an RCU read-side critical
1208*4882a593Smuzhiyunsection within a preempt-disable region than vice versa, at least in
1209*4882a593Smuzhiyuncases where that critical section is short enough to avoid unduly
1210*4882a593Smuzhiyundegrading real-time latencies.
1211*4882a593Smuzhiyun
1212*4882a593SmuzhiyunThe ``synchronize_rcu()`` grace-period-wait primitive is optimized for
1213*4882a593Smuzhiyunthroughput. It may therefore incur several milliseconds of latency in
1214*4882a593Smuzhiyunaddition to the duration of the longest RCU read-side critical section.
1215*4882a593SmuzhiyunOn the other hand, multiple concurrent invocations of
1216*4882a593Smuzhiyun``synchronize_rcu()`` are required to use batching optimizations so that
1217*4882a593Smuzhiyunthey can be satisfied by a single underlying grace-period-wait
1218*4882a593Smuzhiyunoperation. For example, in the Linux kernel, it is not unusual for a
1219*4882a593Smuzhiyunsingle grace-period-wait operation to serve more than `1,000 separate
1220*4882a593Smuzhiyuninvocations <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response>`__
1221*4882a593Smuzhiyunof ``synchronize_rcu()``, thus amortizing the per-invocation overhead
1222*4882a593Smuzhiyundown to nearly zero. However, the grace-period optimization is also
1223*4882a593Smuzhiyunrequired to avoid measurable degradation of real-time scheduling and
1224*4882a593Smuzhiyuninterrupt latencies.
1225*4882a593Smuzhiyun
1226*4882a593SmuzhiyunIn some cases, the multi-millisecond ``synchronize_rcu()`` latencies are
1227*4882a593Smuzhiyununacceptable. In these cases, ``synchronize_rcu_expedited()`` may be
1228*4882a593Smuzhiyunused instead, reducing the grace-period latency down to a few tens of
1229*4882a593Smuzhiyunmicroseconds on small systems, at least in cases where the RCU read-side
1230*4882a593Smuzhiyuncritical sections are short. There are currently no special latency
1231*4882a593Smuzhiyunrequirements for ``synchronize_rcu_expedited()`` on large systems, but,
1232*4882a593Smuzhiyunconsistent with the empirical nature of the RCU specification, that is
1233*4882a593Smuzhiyunsubject to change. However, there most definitely are scalability
1234*4882a593Smuzhiyunrequirements: A storm of ``synchronize_rcu_expedited()`` invocations on
1235*4882a593Smuzhiyun4096 CPUs should at least make reasonable forward progress. In return
1236*4882a593Smuzhiyunfor its shorter latencies, ``synchronize_rcu_expedited()`` is permitted
1237*4882a593Smuzhiyunto impose modest degradation of real-time latency on non-idle online
1238*4882a593SmuzhiyunCPUs. Here, “modest” means roughly the same latency degradation as a
1239*4882a593Smuzhiyunscheduling-clock interrupt.
1240*4882a593Smuzhiyun
1241*4882a593SmuzhiyunThere are a number of situations where even
1242*4882a593Smuzhiyun``synchronize_rcu_expedited()``'s reduced grace-period latency is
1243*4882a593Smuzhiyununacceptable. In these situations, the asynchronous ``call_rcu()`` can
1244*4882a593Smuzhiyunbe used in place of ``synchronize_rcu()`` as follows:
1245*4882a593Smuzhiyun
1246*4882a593Smuzhiyun   ::
1247*4882a593Smuzhiyun
1248*4882a593Smuzhiyun       1 struct foo {
1249*4882a593Smuzhiyun       2   int a;
1250*4882a593Smuzhiyun       3   int b;
1251*4882a593Smuzhiyun       4   struct rcu_head rh;
1252*4882a593Smuzhiyun       5 };
1253*4882a593Smuzhiyun       6
1254*4882a593Smuzhiyun       7 static void remove_gp_cb(struct rcu_head *rhp)
1255*4882a593Smuzhiyun       8 {
1256*4882a593Smuzhiyun       9   struct foo *p = container_of(rhp, struct foo, rh);
1257*4882a593Smuzhiyun      10
1258*4882a593Smuzhiyun      11   kfree(p);
1259*4882a593Smuzhiyun      12 }
1260*4882a593Smuzhiyun      13
1261*4882a593Smuzhiyun      14 bool remove_gp_asynchronous(void)
1262*4882a593Smuzhiyun      15 {
1263*4882a593Smuzhiyun      16   struct foo *p;
1264*4882a593Smuzhiyun      17
1265*4882a593Smuzhiyun      18   spin_lock(&gp_lock);
1266*4882a593Smuzhiyun      19   p = rcu_access_pointer(gp);
1267*4882a593Smuzhiyun      20   if (!p) {
1268*4882a593Smuzhiyun      21     spin_unlock(&gp_lock);
1269*4882a593Smuzhiyun      22     return false;
1270*4882a593Smuzhiyun      23   }
1271*4882a593Smuzhiyun      24   rcu_assign_pointer(gp, NULL);
1272*4882a593Smuzhiyun      25   call_rcu(&p->rh, remove_gp_cb);
1273*4882a593Smuzhiyun      26   spin_unlock(&gp_lock);
1274*4882a593Smuzhiyun      27   return true;
1275*4882a593Smuzhiyun      28 }
1276*4882a593Smuzhiyun
1277*4882a593SmuzhiyunA definition of ``struct foo`` is finally needed, and appears on
1278*4882a593Smuzhiyunlines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()``
1279*4882a593Smuzhiyunon line 25, and will be invoked after the end of a subsequent grace
1280*4882a593Smuzhiyunperiod. This gets the same effect as ``remove_gp_synchronous()``, but
1281*4882a593Smuzhiyunwithout forcing the updater to wait for a grace period to elapse. The
1282*4882a593Smuzhiyun``call_rcu()`` function may be used in a number of situations where
1283*4882a593Smuzhiyunneither ``synchronize_rcu()`` nor ``synchronize_rcu_expedited()`` would
1284*4882a593Smuzhiyunbe legal, including within preempt-disable code, ``local_bh_disable()``
1285*4882a593Smuzhiyuncode, interrupt-disable code, and interrupt handlers. However, even
1286*4882a593Smuzhiyun``call_rcu()`` is illegal within NMI handlers and from idle and offline
1287*4882a593SmuzhiyunCPUs. The callback function (``remove_gp_cb()`` in this case) will be
1288*4882a593Smuzhiyunexecuted within softirq (software interrupt) environment within the
1289*4882a593SmuzhiyunLinux kernel, either within a real softirq handler or under the
1290*4882a593Smuzhiyunprotection of ``local_bh_disable()``. In both the Linux kernel and in
1291*4882a593Smuzhiyunuserspace, it is bad practice to write an RCU callback function that
1292*4882a593Smuzhiyuntakes too long. Long-running operations should be relegated to separate
1293*4882a593Smuzhiyunthreads or (in the Linux kernel) workqueues.
1294*4882a593Smuzhiyun
1295*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1296*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
1297*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1298*4882a593Smuzhiyun| Why does line 19 use ``rcu_access_pointer()``? After all,             |
1299*4882a593Smuzhiyun| ``call_rcu()`` on line 25 stores into the structure, which would      |
1300*4882a593Smuzhiyun| interact badly with concurrent insertions. Doesn't this mean that     |
1301*4882a593Smuzhiyun| ``rcu_dereference()`` is required?                                    |
1302*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1303*4882a593Smuzhiyun| **Answer**:                                                           |
1304*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1305*4882a593Smuzhiyun| Presumably the ``->gp_lock`` acquired on line 18 excludes any         |
1306*4882a593Smuzhiyun| changes, including any insertions that ``rcu_dereference()`` would    |
1307*4882a593Smuzhiyun| protect against. Therefore, any insertions will be delayed until      |
1308*4882a593Smuzhiyun| after ``->gp_lock`` is released on line 25, which in turn means that  |
1309*4882a593Smuzhiyun| ``rcu_access_pointer()`` suffices.                                    |
1310*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1311*4882a593Smuzhiyun
1312*4882a593SmuzhiyunHowever, all that ``remove_gp_cb()`` is doing is invoking ``kfree()`` on
1313*4882a593Smuzhiyunthe data element. This is a common idiom, and is supported by
1314*4882a593Smuzhiyun``kfree_rcu()``, which allows “fire and forget” operation as shown
1315*4882a593Smuzhiyunbelow:
1316*4882a593Smuzhiyun
1317*4882a593Smuzhiyun   ::
1318*4882a593Smuzhiyun
1319*4882a593Smuzhiyun       1 struct foo {
1320*4882a593Smuzhiyun       2   int a;
1321*4882a593Smuzhiyun       3   int b;
1322*4882a593Smuzhiyun       4   struct rcu_head rh;
1323*4882a593Smuzhiyun       5 };
1324*4882a593Smuzhiyun       6
1325*4882a593Smuzhiyun       7 bool remove_gp_faf(void)
1326*4882a593Smuzhiyun       8 {
1327*4882a593Smuzhiyun       9   struct foo *p;
1328*4882a593Smuzhiyun      10
1329*4882a593Smuzhiyun      11   spin_lock(&gp_lock);
1330*4882a593Smuzhiyun      12   p = rcu_dereference(gp);
1331*4882a593Smuzhiyun      13   if (!p) {
1332*4882a593Smuzhiyun      14     spin_unlock(&gp_lock);
1333*4882a593Smuzhiyun      15     return false;
1334*4882a593Smuzhiyun      16   }
1335*4882a593Smuzhiyun      17   rcu_assign_pointer(gp, NULL);
1336*4882a593Smuzhiyun      18   kfree_rcu(p, rh);
1337*4882a593Smuzhiyun      19   spin_unlock(&gp_lock);
1338*4882a593Smuzhiyun      20   return true;
1339*4882a593Smuzhiyun      21 }
1340*4882a593Smuzhiyun
1341*4882a593SmuzhiyunNote that ``remove_gp_faf()`` simply invokes ``kfree_rcu()`` and
1342*4882a593Smuzhiyunproceeds, without any need to pay any further attention to the
1343*4882a593Smuzhiyunsubsequent grace period and ``kfree()``. It is permissible to invoke
1344*4882a593Smuzhiyun``kfree_rcu()`` from the same environments as for ``call_rcu()``.
1345*4882a593SmuzhiyunInterestingly enough, DYNIX/ptx had the equivalents of ``call_rcu()``
1346*4882a593Smuzhiyunand ``kfree_rcu()``, but not ``synchronize_rcu()``. This was due to the
1347*4882a593Smuzhiyunfact that RCU was not heavily used within DYNIX/ptx, so the very few
1348*4882a593Smuzhiyunplaces that needed something like ``synchronize_rcu()`` simply
1349*4882a593Smuzhiyunopen-coded it.
1350*4882a593Smuzhiyun
1351*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1352*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
1353*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1354*4882a593Smuzhiyun| Earlier it was claimed that ``call_rcu()`` and ``kfree_rcu()``        |
1355*4882a593Smuzhiyun| allowed updaters to avoid being blocked by readers. But how can that  |
1356*4882a593Smuzhiyun| be correct, given that the invocation of the callback and the freeing |
1357*4882a593Smuzhiyun| of the memory (respectively) must still wait for a grace period to    |
1358*4882a593Smuzhiyun| elapse?                                                               |
1359*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1360*4882a593Smuzhiyun| **Answer**:                                                           |
1361*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1362*4882a593Smuzhiyun| We could define things this way, but keep in mind that this sort of   |
1363*4882a593Smuzhiyun| definition would say that updates in garbage-collected languages      |
1364*4882a593Smuzhiyun| cannot complete until the next time the garbage collector runs, which |
1365*4882a593Smuzhiyun| does not seem at all reasonable. The key point is that in most cases, |
1366*4882a593Smuzhiyun| an updater using either ``call_rcu()`` or ``kfree_rcu()`` can proceed |
1367*4882a593Smuzhiyun| to the next update as soon as it has invoked ``call_rcu()`` or        |
1368*4882a593Smuzhiyun| ``kfree_rcu()``, without having to wait for a subsequent grace        |
1369*4882a593Smuzhiyun| period.                                                               |
1370*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1371*4882a593Smuzhiyun
1372*4882a593SmuzhiyunBut what if the updater must wait for the completion of code to be
1373*4882a593Smuzhiyunexecuted after the end of the grace period, but has other tasks that can
1374*4882a593Smuzhiyunbe carried out in the meantime? The polling-style
1375*4882a593Smuzhiyun``get_state_synchronize_rcu()`` and ``cond_synchronize_rcu()`` functions
1376*4882a593Smuzhiyunmay be used for this purpose, as shown below:
1377*4882a593Smuzhiyun
1378*4882a593Smuzhiyun   ::
1379*4882a593Smuzhiyun
1380*4882a593Smuzhiyun       1 bool remove_gp_poll(void)
1381*4882a593Smuzhiyun       2 {
1382*4882a593Smuzhiyun       3   struct foo *p;
1383*4882a593Smuzhiyun       4   unsigned long s;
1384*4882a593Smuzhiyun       5
1385*4882a593Smuzhiyun       6   spin_lock(&gp_lock);
1386*4882a593Smuzhiyun       7   p = rcu_access_pointer(gp);
1387*4882a593Smuzhiyun       8   if (!p) {
1388*4882a593Smuzhiyun       9     spin_unlock(&gp_lock);
1389*4882a593Smuzhiyun      10     return false;
1390*4882a593Smuzhiyun      11   }
1391*4882a593Smuzhiyun      12   rcu_assign_pointer(gp, NULL);
1392*4882a593Smuzhiyun      13   spin_unlock(&gp_lock);
1393*4882a593Smuzhiyun      14   s = get_state_synchronize_rcu();
1394*4882a593Smuzhiyun      15   do_something_while_waiting();
1395*4882a593Smuzhiyun      16   cond_synchronize_rcu(s);
1396*4882a593Smuzhiyun      17   kfree(p);
1397*4882a593Smuzhiyun      18   return true;
1398*4882a593Smuzhiyun      19 }
1399*4882a593Smuzhiyun
1400*4882a593SmuzhiyunOn line 14, ``get_state_synchronize_rcu()`` obtains a “cookie” from RCU,
1401*4882a593Smuzhiyunthen line 15 carries out other tasks, and finally, line 16 returns
1402*4882a593Smuzhiyunimmediately if a grace period has elapsed in the meantime, but otherwise
1403*4882a593Smuzhiyunwaits as required. The need for ``get_state_synchronize_rcu`` and
1404*4882a593Smuzhiyun``cond_synchronize_rcu()`` has appeared quite recently, so it is too
1405*4882a593Smuzhiyunearly to tell whether they will stand the test of time.
1406*4882a593Smuzhiyun
1407*4882a593SmuzhiyunRCU thus provides a range of tools to allow updaters to strike the
1408*4882a593Smuzhiyunrequired tradeoff between latency, flexibility and CPU overhead.
1409*4882a593Smuzhiyun
1410*4882a593SmuzhiyunForward Progress
1411*4882a593Smuzhiyun~~~~~~~~~~~~~~~~
1412*4882a593Smuzhiyun
1413*4882a593SmuzhiyunIn theory, delaying grace-period completion and callback invocation is
1414*4882a593Smuzhiyunharmless. In practice, not only are memory sizes finite but also
1415*4882a593Smuzhiyuncallbacks sometimes do wakeups, and sufficiently deferred wakeups can be
1416*4882a593Smuzhiyundifficult to distinguish from system hangs. Therefore, RCU must provide
1417*4882a593Smuzhiyuna number of mechanisms to promote forward progress.
1418*4882a593Smuzhiyun
1419*4882a593SmuzhiyunThese mechanisms are not foolproof, nor can they be. For one simple
1420*4882a593Smuzhiyunexample, an infinite loop in an RCU read-side critical section must by
1421*4882a593Smuzhiyundefinition prevent later grace periods from ever completing. For a more
1422*4882a593Smuzhiyuninvolved example, consider a 64-CPU system built with
1423*4882a593Smuzhiyun``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where
1424*4882a593SmuzhiyunCPUs 1 through 63 spin in tight loops that invoke ``call_rcu()``. Even
1425*4882a593Smuzhiyunif these tight loops also contain calls to ``cond_resched()`` (thus
1426*4882a593Smuzhiyunallowing grace periods to complete), CPU 0 simply will not be able to
1427*4882a593Smuzhiyuninvoke callbacks as fast as the other 63 CPUs can register them, at
1428*4882a593Smuzhiyunleast not until the system runs out of memory. In both of these
1429*4882a593Smuzhiyunexamples, the Spiderman principle applies: With great power comes great
1430*4882a593Smuzhiyunresponsibility. However, short of this level of abuse, RCU is required
1431*4882a593Smuzhiyunto ensure timely completion of grace periods and timely invocation of
1432*4882a593Smuzhiyuncallbacks.
1433*4882a593Smuzhiyun
1434*4882a593SmuzhiyunRCU takes the following steps to encourage timely completion of grace
1435*4882a593Smuzhiyunperiods:
1436*4882a593Smuzhiyun
1437*4882a593Smuzhiyun#. If a grace period fails to complete within 100 milliseconds, RCU
1438*4882a593Smuzhiyun   causes future invocations of ``cond_resched()`` on the holdout CPUs
1439*4882a593Smuzhiyun   to provide an RCU quiescent state. RCU also causes those CPUs'
1440*4882a593Smuzhiyun   ``need_resched()`` invocations to return ``true``, but only after the
1441*4882a593Smuzhiyun   corresponding CPU's next scheduling-clock.
1442*4882a593Smuzhiyun#. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run
1443*4882a593Smuzhiyun   indefinitely in the kernel without scheduling-clock interrupts, which
1444*4882a593Smuzhiyun   defeats the above ``need_resched()`` strategem. RCU will therefore
1445*4882a593Smuzhiyun   invoke ``resched_cpu()`` on any ``nohz_full`` CPUs still holding out
1446*4882a593Smuzhiyun   after 109 milliseconds.
1447*4882a593Smuzhiyun#. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that
1448*4882a593Smuzhiyun   has been preempted within an RCU read-side critical section is
1449*4882a593Smuzhiyun   holding out for more than 500 milliseconds, RCU will resort to
1450*4882a593Smuzhiyun   priority boosting.
1451*4882a593Smuzhiyun#. If a CPU is still holding out 10 seconds into the grace period, RCU
1452*4882a593Smuzhiyun   will invoke ``resched_cpu()`` on it regardless of its ``nohz_full``
1453*4882a593Smuzhiyun   state.
1454*4882a593Smuzhiyun
1455*4882a593SmuzhiyunThe above values are defaults for systems running with ``HZ=1000``. They
1456*4882a593Smuzhiyunwill vary as the value of ``HZ`` varies, and can also be changed using
1457*4882a593Smuzhiyunthe relevant Kconfig options and kernel boot parameters. RCU currently
1458*4882a593Smuzhiyundoes not do much sanity checking of these parameters, so please use
1459*4882a593Smuzhiyuncaution when changing them. Note that these forward-progress measures
1460*4882a593Smuzhiyunare provided only for RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks
1461*4882a593SmuzhiyunRCU <#Tasks%20RCU>`__.
1462*4882a593Smuzhiyun
1463*4882a593SmuzhiyunRCU takes the following steps in ``call_rcu()`` to encourage timely
1464*4882a593Smuzhiyuninvocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has
1465*4882a593Smuzhiyun10,000 callbacks, or has 10,000 more callbacks than it had the last time
1466*4882a593Smuzhiyunencouragement was provided:
1467*4882a593Smuzhiyun
1468*4882a593Smuzhiyun#. Starts a grace period, if one is not already in progress.
1469*4882a593Smuzhiyun#. Forces immediate checking for quiescent states, rather than waiting
1470*4882a593Smuzhiyun   for three milliseconds to have elapsed since the beginning of the
1471*4882a593Smuzhiyun   grace period.
1472*4882a593Smuzhiyun#. Immediately tags the CPU's callbacks with their grace period
1473*4882a593Smuzhiyun   completion numbers, rather than waiting for the ``RCU_SOFTIRQ``
1474*4882a593Smuzhiyun   handler to get around to it.
1475*4882a593Smuzhiyun#. Lifts callback-execution batch limits, which speeds up callback
1476*4882a593Smuzhiyun   invocation at the expense of degrading realtime response.
1477*4882a593Smuzhiyun
1478*4882a593SmuzhiyunAgain, these are default values when running at ``HZ=1000``, and can be
1479*4882a593Smuzhiyunoverridden. Again, these forward-progress measures are provided only for
1480*4882a593SmuzhiyunRCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks
1481*4882a593SmuzhiyunRCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward
1482*4882a593Smuzhiyunprogress for ``rcu_nocbs`` CPUs is much less well-developed, in part
1483*4882a593Smuzhiyunbecause workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke
1484*4882a593Smuzhiyun``call_rcu()`` relatively infrequently. If workloads emerge that need
1485*4882a593Smuzhiyunboth ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then
1486*4882a593Smuzhiyunadditional forward-progress work will be required.
1487*4882a593Smuzhiyun
1488*4882a593SmuzhiyunComposability
1489*4882a593Smuzhiyun~~~~~~~~~~~~~
1490*4882a593Smuzhiyun
1491*4882a593SmuzhiyunComposability has received much attention in recent years, perhaps in
1492*4882a593Smuzhiyunpart due to the collision of multicore hardware with object-oriented
1493*4882a593Smuzhiyuntechniques designed in single-threaded environments for single-threaded
1494*4882a593Smuzhiyunuse. And in theory, RCU read-side critical sections may be composed, and
1495*4882a593Smuzhiyunin fact may be nested arbitrarily deeply. In practice, as with all
1496*4882a593Smuzhiyunreal-world implementations of composable constructs, there are
1497*4882a593Smuzhiyunlimitations.
1498*4882a593Smuzhiyun
1499*4882a593SmuzhiyunImplementations of RCU for which ``rcu_read_lock()`` and
1500*4882a593Smuzhiyun``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when
1501*4882a593Smuzhiyun``CONFIG_PREEMPT=n``, can be nested arbitrarily deeply. After all, there
1502*4882a593Smuzhiyunis no overhead. Except that if all these instances of
1503*4882a593Smuzhiyun``rcu_read_lock()`` and ``rcu_read_unlock()`` are visible to the
1504*4882a593Smuzhiyuncompiler, compilation will eventually fail due to exhausting memory,
1505*4882a593Smuzhiyunmass storage, or user patience, whichever comes first. If the nesting is
1506*4882a593Smuzhiyunnot visible to the compiler, as is the case with mutually recursive
1507*4882a593Smuzhiyunfunctions each in its own translation unit, stack overflow will result.
1508*4882a593SmuzhiyunIf the nesting takes the form of loops, perhaps in the guise of tail
1509*4882a593Smuzhiyunrecursion, either the control variable will overflow or (in the Linux
1510*4882a593Smuzhiyunkernel) you will get an RCU CPU stall warning. Nevertheless, this class
1511*4882a593Smuzhiyunof RCU implementations is one of the most composable constructs in
1512*4882a593Smuzhiyunexistence.
1513*4882a593Smuzhiyun
1514*4882a593SmuzhiyunRCU implementations that explicitly track nesting depth are limited by
1515*4882a593Smuzhiyunthe nesting-depth counter. For example, the Linux kernel's preemptible
1516*4882a593SmuzhiyunRCU limits nesting to ``INT_MAX``. This should suffice for almost all
1517*4882a593Smuzhiyunpractical purposes. That said, a consecutive pair of RCU read-side
1518*4882a593Smuzhiyuncritical sections between which there is an operation that waits for a
1519*4882a593Smuzhiyungrace period cannot be enclosed in another RCU read-side critical
1520*4882a593Smuzhiyunsection. This is because it is not legal to wait for a grace period
1521*4882a593Smuzhiyunwithin an RCU read-side critical section: To do so would result either
1522*4882a593Smuzhiyunin deadlock or in RCU implicitly splitting the enclosing RCU read-side
1523*4882a593Smuzhiyuncritical section, neither of which is conducive to a long-lived and
1524*4882a593Smuzhiyunprosperous kernel.
1525*4882a593Smuzhiyun
1526*4882a593SmuzhiyunIt is worth noting that RCU is not alone in limiting composability. For
1527*4882a593Smuzhiyunexample, many transactional-memory implementations prohibit composing a
1528*4882a593Smuzhiyunpair of transactions separated by an irrevocable operation (for example,
1529*4882a593Smuzhiyuna network receive operation). For another example, lock-based critical
1530*4882a593Smuzhiyunsections can be composed surprisingly freely, but only if deadlock is
1531*4882a593Smuzhiyunavoided.
1532*4882a593Smuzhiyun
1533*4882a593SmuzhiyunIn short, although RCU read-side critical sections are highly
1534*4882a593Smuzhiyuncomposable, care is required in some situations, just as is the case for
1535*4882a593Smuzhiyunany other composable synchronization mechanism.
1536*4882a593Smuzhiyun
1537*4882a593SmuzhiyunCorner Cases
1538*4882a593Smuzhiyun~~~~~~~~~~~~
1539*4882a593Smuzhiyun
1540*4882a593SmuzhiyunA given RCU workload might have an endless and intense stream of RCU
1541*4882a593Smuzhiyunread-side critical sections, perhaps even so intense that there was
1542*4882a593Smuzhiyunnever a point in time during which there was not at least one RCU
1543*4882a593Smuzhiyunread-side critical section in flight. RCU cannot allow this situation to
1544*4882a593Smuzhiyunblock grace periods: As long as all the RCU read-side critical sections
1545*4882a593Smuzhiyunare finite, grace periods must also be finite.
1546*4882a593Smuzhiyun
1547*4882a593SmuzhiyunThat said, preemptible RCU implementations could potentially result in
1548*4882a593SmuzhiyunRCU read-side critical sections being preempted for long durations,
1549*4882a593Smuzhiyunwhich has the effect of creating a long-duration RCU read-side critical
1550*4882a593Smuzhiyunsection. This situation can arise only in heavily loaded systems, but
1551*4882a593Smuzhiyunsystems using real-time priorities are of course more vulnerable.
1552*4882a593SmuzhiyunTherefore, RCU priority boosting is provided to help deal with this
1553*4882a593Smuzhiyuncase. That said, the exact requirements on RCU priority boosting will
1554*4882a593Smuzhiyunlikely evolve as more experience accumulates.
1555*4882a593Smuzhiyun
1556*4882a593SmuzhiyunOther workloads might have very high update rates. Although one can
1557*4882a593Smuzhiyunargue that such workloads should instead use something other than RCU,
1558*4882a593Smuzhiyunthe fact remains that RCU must handle such workloads gracefully. This
1559*4882a593Smuzhiyunrequirement is another factor driving batching of grace periods, but it
1560*4882a593Smuzhiyunis also the driving force behind the checks for large numbers of queued
1561*4882a593SmuzhiyunRCU callbacks in the ``call_rcu()`` code path. Finally, high update
1562*4882a593Smuzhiyunrates should not delay RCU read-side critical sections, although some
1563*4882a593Smuzhiyunsmall read-side delays can occur when using
1564*4882a593Smuzhiyun``synchronize_rcu_expedited()``, courtesy of this function's use of
1565*4882a593Smuzhiyun``smp_call_function_single()``.
1566*4882a593Smuzhiyun
1567*4882a593SmuzhiyunAlthough all three of these corner cases were understood in the early
1568*4882a593Smuzhiyun1990s, a simple user-level test consisting of ``close(open(path))`` in a
1569*4882a593Smuzhiyuntight loop in the early 2000s suddenly provided a much deeper
1570*4882a593Smuzhiyunappreciation of the high-update-rate corner case. This test also
1571*4882a593Smuzhiyunmotivated addition of some RCU code to react to high update rates, for
1572*4882a593Smuzhiyunexample, if a given CPU finds itself with more than 10,000 RCU callbacks
1573*4882a593Smuzhiyunqueued, it will cause RCU to take evasive action by more aggressively
1574*4882a593Smuzhiyunstarting grace periods and more aggressively forcing completion of
1575*4882a593Smuzhiyungrace-period processing. This evasive action causes the grace period to
1576*4882a593Smuzhiyuncomplete more quickly, but at the cost of restricting RCU's batching
1577*4882a593Smuzhiyunoptimizations, thus increasing the CPU overhead incurred by that grace
1578*4882a593Smuzhiyunperiod.
1579*4882a593Smuzhiyun
1580*4882a593SmuzhiyunSoftware-Engineering Requirements
1581*4882a593Smuzhiyun---------------------------------
1582*4882a593Smuzhiyun
1583*4882a593SmuzhiyunBetween Murphy's Law and “To err is human”, it is necessary to guard
1584*4882a593Smuzhiyunagainst mishaps and misuse:
1585*4882a593Smuzhiyun
1586*4882a593Smuzhiyun#. It is all too easy to forget to use ``rcu_read_lock()`` everywhere
1587*4882a593Smuzhiyun   that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will
1588*4882a593Smuzhiyun   splat if ``rcu_dereference()`` is used outside of an RCU read-side
1589*4882a593Smuzhiyun   critical section. Update-side code can use
1590*4882a593Smuzhiyun   ``rcu_dereference_protected()``, which takes a `lockdep
1591*4882a593Smuzhiyun   expression <https://lwn.net/Articles/371986/>`__ to indicate what is
1592*4882a593Smuzhiyun   providing the protection. If the indicated protection is not
1593*4882a593Smuzhiyun   provided, a lockdep splat is emitted.
1594*4882a593Smuzhiyun   Code shared between readers and updaters can use
1595*4882a593Smuzhiyun   ``rcu_dereference_check()``, which also takes a lockdep expression,
1596*4882a593Smuzhiyun   and emits a lockdep splat if neither ``rcu_read_lock()`` nor the
1597*4882a593Smuzhiyun   indicated protection is in place. In addition,
1598*4882a593Smuzhiyun   ``rcu_dereference_raw()`` is used in those (hopefully rare) cases
1599*4882a593Smuzhiyun   where the required protection cannot be easily described. Finally,
1600*4882a593Smuzhiyun   ``rcu_read_lock_held()`` is provided to allow a function to verify
1601*4882a593Smuzhiyun   that it has been invoked within an RCU read-side critical section. I
1602*4882a593Smuzhiyun   was made aware of this set of requirements shortly after Thomas
1603*4882a593Smuzhiyun   Gleixner audited a number of RCU uses.
1604*4882a593Smuzhiyun#. A given function might wish to check for RCU-related preconditions
1605*4882a593Smuzhiyun   upon entry, before using any other RCU API. The
1606*4882a593Smuzhiyun   ``rcu_lockdep_assert()`` does this job, asserting the expression in
1607*4882a593Smuzhiyun   kernels having lockdep enabled and doing nothing otherwise.
1608*4882a593Smuzhiyun#. It is also easy to forget to use ``rcu_assign_pointer()`` and
1609*4882a593Smuzhiyun   ``rcu_dereference()``, perhaps (incorrectly) substituting a simple
1610*4882a593Smuzhiyun   assignment. To catch this sort of error, a given RCU-protected
1611*4882a593Smuzhiyun   pointer may be tagged with ``__rcu``, after which sparse will
1612*4882a593Smuzhiyun   complain about simple-assignment accesses to that pointer. Arnd
1613*4882a593Smuzhiyun   Bergmann made me aware of this requirement, and also supplied the
1614*4882a593Smuzhiyun   needed `patch series <https://lwn.net/Articles/376011/>`__.
1615*4882a593Smuzhiyun#. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if
1616*4882a593Smuzhiyun   a data element is passed to ``call_rcu()`` twice in a row, without a
1617*4882a593Smuzhiyun   grace period in between. (This error is similar to a double free.)
1618*4882a593Smuzhiyun   The corresponding ``rcu_head`` structures that are dynamically
1619*4882a593Smuzhiyun   allocated are automatically tracked, but ``rcu_head`` structures
1620*4882a593Smuzhiyun   allocated on the stack must be initialized with
1621*4882a593Smuzhiyun   ``init_rcu_head_on_stack()`` and cleaned up with
1622*4882a593Smuzhiyun   ``destroy_rcu_head_on_stack()``. Similarly, statically allocated
1623*4882a593Smuzhiyun   non-stack ``rcu_head`` structures must be initialized with
1624*4882a593Smuzhiyun   ``init_rcu_head()`` and cleaned up with ``destroy_rcu_head()``.
1625*4882a593Smuzhiyun   Mathieu Desnoyers made me aware of this requirement, and also
1626*4882a593Smuzhiyun   supplied the needed
1627*4882a593Smuzhiyun   `patch <https://lkml.kernel.org/g/20100319013024.GA28456@Krystal>`__.
1628*4882a593Smuzhiyun#. An infinite loop in an RCU read-side critical section will eventually
1629*4882a593Smuzhiyun   trigger an RCU CPU stall warning splat, with the duration of
1630*4882a593Smuzhiyun   “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT``
1631*4882a593Smuzhiyun   ``Kconfig`` option, or, alternatively, by the
1632*4882a593Smuzhiyun   ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU
1633*4882a593Smuzhiyun   is not obligated to produce this splat unless there is a grace period
1634*4882a593Smuzhiyun   waiting on that particular RCU read-side critical section.
1635*4882a593Smuzhiyun
1636*4882a593Smuzhiyun   Some extreme workloads might intentionally delay RCU grace periods,
1637*4882a593Smuzhiyun   and systems running those workloads can be booted with
1638*4882a593Smuzhiyun   ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This
1639*4882a593Smuzhiyun   kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU
1640*4882a593Smuzhiyun   stall warnings are counter-productive during sysrq dumps and during
1641*4882a593Smuzhiyun   panics. RCU therefore supplies the ``rcu_sysrq_start()`` and
1642*4882a593Smuzhiyun   ``rcu_sysrq_end()`` API members to be called before and after long
1643*4882a593Smuzhiyun   sysrq dumps. RCU also supplies the ``rcu_panic()`` notifier that is
1644*4882a593Smuzhiyun   automatically invoked at the beginning of a panic to suppress further
1645*4882a593Smuzhiyun   RCU CPU stall warnings.
1646*4882a593Smuzhiyun
1647*4882a593Smuzhiyun   This requirement made itself known in the early 1990s, pretty much
1648*4882a593Smuzhiyun   the first time that it was necessary to debug a CPU stall. That said,
1649*4882a593Smuzhiyun   the initial implementation in DYNIX/ptx was quite generic in
1650*4882a593Smuzhiyun   comparison with that of Linux.
1651*4882a593Smuzhiyun
1652*4882a593Smuzhiyun#. Although it would be very good to detect pointers leaking out of RCU
1653*4882a593Smuzhiyun   read-side critical sections, there is currently no good way of doing
1654*4882a593Smuzhiyun   this. One complication is the need to distinguish between pointers
1655*4882a593Smuzhiyun   leaking and pointers that have been handed off from RCU to some other
1656*4882a593Smuzhiyun   synchronization mechanism, for example, reference counting.
1657*4882a593Smuzhiyun#. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information
1658*4882a593Smuzhiyun   is provided via event tracing.
1659*4882a593Smuzhiyun#. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()``
1660*4882a593Smuzhiyun   to create typical linked data structures can be surprisingly
1661*4882a593Smuzhiyun   error-prone. Therefore, RCU-protected `linked
1662*4882a593Smuzhiyun   lists <https://lwn.net/Articles/609973/#RCU%20List%20APIs>`__ and,
1663*4882a593Smuzhiyun   more recently, RCU-protected `hash
1664*4882a593Smuzhiyun   tables <https://lwn.net/Articles/612100/>`__ are available. Many
1665*4882a593Smuzhiyun   other special-purpose RCU-protected data structures are available in
1666*4882a593Smuzhiyun   the Linux kernel and the userspace RCU library.
1667*4882a593Smuzhiyun#. Some linked structures are created at compile time, but still require
1668*4882a593Smuzhiyun   ``__rcu`` checking. The ``RCU_POINTER_INITIALIZER()`` macro serves
1669*4882a593Smuzhiyun   this purpose.
1670*4882a593Smuzhiyun#. It is not necessary to use ``rcu_assign_pointer()`` when creating
1671*4882a593Smuzhiyun   linked structures that are to be published via a single external
1672*4882a593Smuzhiyun   pointer. The ``RCU_INIT_POINTER()`` macro is provided for this task
1673*4882a593Smuzhiyun   and also for assigning ``NULL`` pointers at runtime.
1674*4882a593Smuzhiyun
1675*4882a593SmuzhiyunThis not a hard-and-fast list: RCU's diagnostic capabilities will
1676*4882a593Smuzhiyuncontinue to be guided by the number and type of usage bugs found in
1677*4882a593Smuzhiyunreal-world RCU usage.
1678*4882a593Smuzhiyun
1679*4882a593SmuzhiyunLinux Kernel Complications
1680*4882a593Smuzhiyun--------------------------
1681*4882a593Smuzhiyun
1682*4882a593SmuzhiyunThe Linux kernel provides an interesting environment for all kinds of
1683*4882a593Smuzhiyunsoftware, including RCU. Some of the relevant points of interest are as
1684*4882a593Smuzhiyunfollows:
1685*4882a593Smuzhiyun
1686*4882a593Smuzhiyun#. `Configuration`_
1687*4882a593Smuzhiyun#. `Firmware Interface`_
1688*4882a593Smuzhiyun#. `Early Boot`_
1689*4882a593Smuzhiyun#. `Interrupts and NMIs`_
1690*4882a593Smuzhiyun#. `Loadable Modules`_
1691*4882a593Smuzhiyun#. `Hotplug CPU`_
1692*4882a593Smuzhiyun#. `Scheduler and RCU`_
1693*4882a593Smuzhiyun#. `Tracing and RCU`_
1694*4882a593Smuzhiyun#. `Accesses to User Memory and RCU`_
1695*4882a593Smuzhiyun#. `Energy Efficiency`_
1696*4882a593Smuzhiyun#. `Scheduling-Clock Interrupts and RCU`_
1697*4882a593Smuzhiyun#. `Memory Efficiency`_
1698*4882a593Smuzhiyun#. `Performance, Scalability, Response Time, and Reliability`_
1699*4882a593Smuzhiyun
1700*4882a593SmuzhiyunThis list is probably incomplete, but it does give a feel for the most
1701*4882a593Smuzhiyunnotable Linux-kernel complications. Each of the following sections
1702*4882a593Smuzhiyuncovers one of the above topics.
1703*4882a593Smuzhiyun
1704*4882a593SmuzhiyunConfiguration
1705*4882a593Smuzhiyun~~~~~~~~~~~~~
1706*4882a593Smuzhiyun
1707*4882a593SmuzhiyunRCU's goal is automatic configuration, so that almost nobody needs to
1708*4882a593Smuzhiyunworry about RCU's ``Kconfig`` options. And for almost all users, RCU
1709*4882a593Smuzhiyundoes in fact work well “out of the box.”
1710*4882a593Smuzhiyun
1711*4882a593SmuzhiyunHowever, there are specialized use cases that are handled by kernel boot
1712*4882a593Smuzhiyunparameters and ``Kconfig`` options. Unfortunately, the ``Kconfig``
1713*4882a593Smuzhiyunsystem will explicitly ask users about new ``Kconfig`` options, which
1714*4882a593Smuzhiyunrequires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT``
1715*4882a593Smuzhiyun``Kconfig`` option.
1716*4882a593Smuzhiyun
1717*4882a593SmuzhiyunThis all should be quite obvious, but the fact remains that Linus
1718*4882a593SmuzhiyunTorvalds recently had to
1719*4882a593Smuzhiyun`remind <https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com>`__
1720*4882a593Smuzhiyunme of this requirement.
1721*4882a593Smuzhiyun
1722*4882a593SmuzhiyunFirmware Interface
1723*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~
1724*4882a593Smuzhiyun
1725*4882a593SmuzhiyunIn many cases, kernel obtains information about the system from the
1726*4882a593Smuzhiyunfirmware, and sometimes things are lost in translation. Or the
1727*4882a593Smuzhiyuntranslation is accurate, but the original message is bogus.
1728*4882a593Smuzhiyun
1729*4882a593SmuzhiyunFor example, some systems' firmware overreports the number of CPUs,
1730*4882a593Smuzhiyunsometimes by a large factor. If RCU naively believed the firmware, as it
1731*4882a593Smuzhiyunused to do, it would create too many per-CPU kthreads. Although the
1732*4882a593Smuzhiyunresulting system will still run correctly, the extra kthreads needlessly
1733*4882a593Smuzhiyunconsume memory and can cause confusion when they show up in ``ps``
1734*4882a593Smuzhiyunlistings.
1735*4882a593Smuzhiyun
1736*4882a593SmuzhiyunRCU must therefore wait for a given CPU to actually come online before
1737*4882a593Smuzhiyunit can allow itself to believe that the CPU actually exists. The
1738*4882a593Smuzhiyunresulting “ghost CPUs” (which are never going to come online) cause a
1739*4882a593Smuzhiyunnumber of `interesting
1740*4882a593Smuzhiyuncomplications <https://paulmck.livejournal.com/37494.html>`__.
1741*4882a593Smuzhiyun
1742*4882a593SmuzhiyunEarly Boot
1743*4882a593Smuzhiyun~~~~~~~~~~
1744*4882a593Smuzhiyun
1745*4882a593SmuzhiyunThe Linux kernel's boot sequence is an interesting process, and RCU is
1746*4882a593Smuzhiyunused early, even before ``rcu_init()`` is invoked. In fact, a number of
1747*4882a593SmuzhiyunRCU's primitives can be used as soon as the initial task's
1748*4882a593Smuzhiyun``task_struct`` is available and the boot CPU's per-CPU variables are
1749*4882a593Smuzhiyunset up. The read-side primitives (``rcu_read_lock()``,
1750*4882a593Smuzhiyun``rcu_read_unlock()``, ``rcu_dereference()``, and
1751*4882a593Smuzhiyun``rcu_access_pointer()``) will operate normally very early on, as will
1752*4882a593Smuzhiyun``rcu_assign_pointer()``.
1753*4882a593Smuzhiyun
1754*4882a593SmuzhiyunAlthough ``call_rcu()`` may be invoked at any time during boot,
1755*4882a593Smuzhiyuncallbacks are not guaranteed to be invoked until after all of RCU's
1756*4882a593Smuzhiyunkthreads have been spawned, which occurs at ``early_initcall()`` time.
1757*4882a593SmuzhiyunThis delay in callback invocation is due to the fact that RCU does not
1758*4882a593Smuzhiyuninvoke callbacks until it is fully initialized, and this full
1759*4882a593Smuzhiyuninitialization cannot occur until after the scheduler has initialized
1760*4882a593Smuzhiyunitself to the point where RCU can spawn and run its kthreads. In theory,
1761*4882a593Smuzhiyunit would be possible to invoke callbacks earlier, however, this is not a
1762*4882a593Smuzhiyunpanacea because there would be severe restrictions on what operations
1763*4882a593Smuzhiyunthose callbacks could invoke.
1764*4882a593Smuzhiyun
1765*4882a593SmuzhiyunPerhaps surprisingly, ``synchronize_rcu()`` and
1766*4882a593Smuzhiyun``synchronize_rcu_expedited()``, will operate normally during very early
1767*4882a593Smuzhiyunboot, the reason being that there is only one CPU and preemption is
1768*4882a593Smuzhiyundisabled. This means that the call ``synchronize_rcu()`` (or friends)
1769*4882a593Smuzhiyunitself is a quiescent state and thus a grace period, so the early-boot
1770*4882a593Smuzhiyunimplementation can be a no-op.
1771*4882a593Smuzhiyun
1772*4882a593SmuzhiyunHowever, once the scheduler has spawned its first kthread, this early
1773*4882a593Smuzhiyunboot trick fails for ``synchronize_rcu()`` (as well as for
1774*4882a593Smuzhiyun``synchronize_rcu_expedited()``) in ``CONFIG_PREEMPT=y`` kernels. The
1775*4882a593Smuzhiyunreason is that an RCU read-side critical section might be preempted,
1776*4882a593Smuzhiyunwhich means that a subsequent ``synchronize_rcu()`` really does have to
1777*4882a593Smuzhiyunwait for something, as opposed to simply returning immediately.
1778*4882a593SmuzhiyunUnfortunately, ``synchronize_rcu()`` can't do this until all of its
1779*4882a593Smuzhiyunkthreads are spawned, which doesn't happen until some time during
1780*4882a593Smuzhiyun``early_initcalls()`` time. But this is no excuse: RCU is nevertheless
1781*4882a593Smuzhiyunrequired to correctly handle synchronous grace periods during this time
1782*4882a593Smuzhiyunperiod. Once all of its kthreads are up and running, RCU starts running
1783*4882a593Smuzhiyunnormally.
1784*4882a593Smuzhiyun
1785*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1786*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
1787*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1788*4882a593Smuzhiyun| How can RCU possibly handle grace periods before all of its kthreads  |
1789*4882a593Smuzhiyun| have been spawned???                                                  |
1790*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1791*4882a593Smuzhiyun| **Answer**:                                                           |
1792*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1793*4882a593Smuzhiyun| Very carefully!                                                       |
1794*4882a593Smuzhiyun| During the “dead zone” between the time that the scheduler spawns the |
1795*4882a593Smuzhiyun| first task and the time that all of RCU's kthreads have been spawned, |
1796*4882a593Smuzhiyun| all synchronous grace periods are handled by the expedited            |
1797*4882a593Smuzhiyun| grace-period mechanism. At runtime, this expedited mechanism relies   |
1798*4882a593Smuzhiyun| on workqueues, but during the dead zone the requesting task itself    |
1799*4882a593Smuzhiyun| drives the desired expedited grace period. Because dead-zone          |
1800*4882a593Smuzhiyun| execution takes place within task context, everything works. Once the |
1801*4882a593Smuzhiyun| dead zone ends, expedited grace periods go back to using workqueues,  |
1802*4882a593Smuzhiyun| as is required to avoid problems that would otherwise occur when a    |
1803*4882a593Smuzhiyun| user task received a POSIX signal while driving an expedited grace    |
1804*4882a593Smuzhiyun| period.                                                               |
1805*4882a593Smuzhiyun|                                                                       |
1806*4882a593Smuzhiyun| And yes, this does mean that it is unhelpful to send POSIX signals to |
1807*4882a593Smuzhiyun| random tasks between the time that the scheduler spawns its first     |
1808*4882a593Smuzhiyun| kthread and the time that RCU's kthreads have all been spawned. If    |
1809*4882a593Smuzhiyun| there ever turns out to be a good reason for sending POSIX signals    |
1810*4882a593Smuzhiyun| during that time, appropriate adjustments will be made. (If it turns  |
1811*4882a593Smuzhiyun| out that POSIX signals are sent during this time for no good reason,  |
1812*4882a593Smuzhiyun| other adjustments will be made, appropriate or otherwise.)            |
1813*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1814*4882a593Smuzhiyun
1815*4882a593SmuzhiyunI learned of these boot-time requirements as a result of a series of
1816*4882a593Smuzhiyunsystem hangs.
1817*4882a593Smuzhiyun
1818*4882a593SmuzhiyunInterrupts and NMIs
1819*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~
1820*4882a593Smuzhiyun
1821*4882a593SmuzhiyunThe Linux kernel has interrupts, and RCU read-side critical sections are
1822*4882a593Smuzhiyunlegal within interrupt handlers and within interrupt-disabled regions of
1823*4882a593Smuzhiyuncode, as are invocations of ``call_rcu()``.
1824*4882a593Smuzhiyun
1825*4882a593SmuzhiyunSome Linux-kernel architectures can enter an interrupt handler from
1826*4882a593Smuzhiyunnon-idle process context, and then just never leave it, instead
1827*4882a593Smuzhiyunstealthily transitioning back to process context. This trick is
1828*4882a593Smuzhiyunsometimes used to invoke system calls from inside the kernel. These
1829*4882a593Smuzhiyun“half-interrupts” mean that RCU has to be very careful about how it
1830*4882a593Smuzhiyuncounts interrupt nesting levels. I learned of this requirement the hard
1831*4882a593Smuzhiyunway during a rewrite of RCU's dyntick-idle code.
1832*4882a593Smuzhiyun
1833*4882a593SmuzhiyunThe Linux kernel has non-maskable interrupts (NMIs), and RCU read-side
1834*4882a593Smuzhiyuncritical sections are legal within NMI handlers. Thankfully, RCU
1835*4882a593Smuzhiyunupdate-side primitives, including ``call_rcu()``, are prohibited within
1836*4882a593SmuzhiyunNMI handlers.
1837*4882a593Smuzhiyun
1838*4882a593SmuzhiyunThe name notwithstanding, some Linux-kernel architectures can have
1839*4882a593Smuzhiyunnested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised
1840*4882a593Smuzhiyunme <https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__
1841*4882a593Smuzhiyunwith this requirement; he also kindly surprised me with `an
1842*4882a593Smuzhiyunalgorithm <https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com>`__
1843*4882a593Smuzhiyunthat meets this requirement.
1844*4882a593Smuzhiyun
1845*4882a593SmuzhiyunFurthermore, NMI handlers can be interrupted by what appear to RCU to be
1846*4882a593Smuzhiyunnormal interrupts. One way that this can happen is for code that
1847*4882a593Smuzhiyundirectly invokes ``rcu_irq_enter()`` and ``rcu_irq_exit()`` to be called
1848*4882a593Smuzhiyunfrom an NMI handler. This astonishing fact of life prompted the current
1849*4882a593Smuzhiyuncode structure, which has ``rcu_irq_enter()`` invoking
1850*4882a593Smuzhiyun``rcu_nmi_enter()`` and ``rcu_irq_exit()`` invoking ``rcu_nmi_exit()``.
1851*4882a593SmuzhiyunAnd yes, I also learned of this requirement the hard way.
1852*4882a593Smuzhiyun
1853*4882a593SmuzhiyunLoadable Modules
1854*4882a593Smuzhiyun~~~~~~~~~~~~~~~~
1855*4882a593Smuzhiyun
1856*4882a593SmuzhiyunThe Linux kernel has loadable modules, and these modules can also be
1857*4882a593Smuzhiyununloaded. After a given module has been unloaded, any attempt to call
1858*4882a593Smuzhiyunone of its functions results in a segmentation fault. The module-unload
1859*4882a593Smuzhiyunfunctions must therefore cancel any delayed calls to loadable-module
1860*4882a593Smuzhiyunfunctions, for example, any outstanding ``mod_timer()`` must be dealt
1861*4882a593Smuzhiyunwith via ``del_timer_sync()`` or similar.
1862*4882a593Smuzhiyun
1863*4882a593SmuzhiyunUnfortunately, there is no way to cancel an RCU callback; once you
1864*4882a593Smuzhiyuninvoke ``call_rcu()``, the callback function is eventually going to be
1865*4882a593Smuzhiyuninvoked, unless the system goes down first. Because it is normally
1866*4882a593Smuzhiyunconsidered socially irresponsible to crash the system in response to a
1867*4882a593Smuzhiyunmodule unload request, we need some other way to deal with in-flight RCU
1868*4882a593Smuzhiyuncallbacks.
1869*4882a593Smuzhiyun
1870*4882a593SmuzhiyunRCU therefore provides ``rcu_barrier()``, which waits until all
1871*4882a593Smuzhiyunin-flight RCU callbacks have been invoked. If a module uses
1872*4882a593Smuzhiyun``call_rcu()``, its exit function should therefore prevent any future
1873*4882a593Smuzhiyuninvocation of ``call_rcu()``, then invoke ``rcu_barrier()``. In theory,
1874*4882a593Smuzhiyunthe underlying module-unload code could invoke ``rcu_barrier()``
1875*4882a593Smuzhiyununconditionally, but in practice this would incur unacceptable
1876*4882a593Smuzhiyunlatencies.
1877*4882a593Smuzhiyun
1878*4882a593SmuzhiyunNikita Danilov noted this requirement for an analogous
1879*4882a593Smuzhiyunfilesystem-unmount situation, and Dipankar Sarma incorporated
1880*4882a593Smuzhiyun``rcu_barrier()`` into RCU. The need for ``rcu_barrier()`` for module
1881*4882a593Smuzhiyununloading became apparent later.
1882*4882a593Smuzhiyun
1883*4882a593Smuzhiyun.. important::
1884*4882a593Smuzhiyun
1885*4882a593Smuzhiyun   The ``rcu_barrier()`` function is not, repeat,
1886*4882a593Smuzhiyun   *not*, obligated to wait for a grace period. It is instead only required
1887*4882a593Smuzhiyun   to wait for RCU callbacks that have already been posted. Therefore, if
1888*4882a593Smuzhiyun   there are no RCU callbacks posted anywhere in the system,
1889*4882a593Smuzhiyun   ``rcu_barrier()`` is within its rights to return immediately. Even if
1890*4882a593Smuzhiyun   there are callbacks posted, ``rcu_barrier()`` does not necessarily need
1891*4882a593Smuzhiyun   to wait for a grace period.
1892*4882a593Smuzhiyun
1893*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1894*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
1895*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1896*4882a593Smuzhiyun| Wait a minute! Each RCU callbacks must wait for a grace period to     |
1897*4882a593Smuzhiyun| complete, and ``rcu_barrier()`` must wait for each pre-existing       |
1898*4882a593Smuzhiyun| callback to be invoked. Doesn't ``rcu_barrier()`` therefore need to   |
1899*4882a593Smuzhiyun| wait for a full grace period if there is even one callback posted     |
1900*4882a593Smuzhiyun| anywhere in the system?                                               |
1901*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1902*4882a593Smuzhiyun| **Answer**:                                                           |
1903*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1904*4882a593Smuzhiyun| Absolutely not!!!                                                     |
1905*4882a593Smuzhiyun| Yes, each RCU callbacks must wait for a grace period to complete, but |
1906*4882a593Smuzhiyun| it might well be partly (or even completely) finished waiting by the  |
1907*4882a593Smuzhiyun| time ``rcu_barrier()`` is invoked. In that case, ``rcu_barrier()``    |
1908*4882a593Smuzhiyun| need only wait for the remaining portion of the grace period to       |
1909*4882a593Smuzhiyun| elapse. So even if there are quite a few callbacks posted,            |
1910*4882a593Smuzhiyun| ``rcu_barrier()`` might well return quite quickly.                    |
1911*4882a593Smuzhiyun|                                                                       |
1912*4882a593Smuzhiyun| So if you need to wait for a grace period as well as for all          |
1913*4882a593Smuzhiyun| pre-existing callbacks, you will need to invoke both                  |
1914*4882a593Smuzhiyun| ``synchronize_rcu()`` and ``rcu_barrier()``. If latency is a concern, |
1915*4882a593Smuzhiyun| you can always use workqueues to invoke them concurrently.            |
1916*4882a593Smuzhiyun+-----------------------------------------------------------------------+
1917*4882a593Smuzhiyun
1918*4882a593SmuzhiyunHotplug CPU
1919*4882a593Smuzhiyun~~~~~~~~~~~
1920*4882a593Smuzhiyun
1921*4882a593SmuzhiyunThe Linux kernel supports CPU hotplug, which means that CPUs can come
1922*4882a593Smuzhiyunand go. It is of course illegal to use any RCU API member from an
1923*4882a593Smuzhiyunoffline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side
1924*4882a593Smuzhiyuncritical sections. This requirement was present from day one in
1925*4882a593SmuzhiyunDYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug
1926*4882a593Smuzhiyunimplementation is “interesting.”
1927*4882a593Smuzhiyun
1928*4882a593SmuzhiyunThe Linux-kernel CPU-hotplug implementation has notifiers that are used
1929*4882a593Smuzhiyunto allow the various kernel subsystems (including RCU) to respond
1930*4882a593Smuzhiyunappropriately to a given CPU-hotplug operation. Most RCU operations may
1931*4882a593Smuzhiyunbe invoked from CPU-hotplug notifiers, including even synchronous
1932*4882a593Smuzhiyungrace-period operations such as ``synchronize_rcu()`` and
1933*4882a593Smuzhiyun``synchronize_rcu_expedited()``.
1934*4882a593Smuzhiyun
1935*4882a593SmuzhiyunHowever, all-callback-wait operations such as ``rcu_barrier()`` are also
1936*4882a593Smuzhiyunnot supported, due to the fact that there are phases of CPU-hotplug
1937*4882a593Smuzhiyunoperations where the outgoing CPU's callbacks will not be invoked until
1938*4882a593Smuzhiyunafter the CPU-hotplug operation ends, which could also result in
1939*4882a593Smuzhiyundeadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations
1940*4882a593Smuzhiyunduring its execution, which results in another type of deadlock when
1941*4882a593Smuzhiyuninvoked from a CPU-hotplug notifier.
1942*4882a593Smuzhiyun
1943*4882a593SmuzhiyunScheduler and RCU
1944*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~
1945*4882a593Smuzhiyun
1946*4882a593SmuzhiyunRCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
1947*4882a593Smuzhiyunaccumulation by these kthreads. This requirement was no surprise, but
1948*4882a593SmuzhiyunRCU's violation of it when running context-switch-heavy workloads when
1949*4882a593Smuzhiyunbuilt with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise
1950*4882a593Smuzhiyun[PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__.
1951*4882a593SmuzhiyunRCU has made good progress towards meeting this requirement, even for
1952*4882a593Smuzhiyuncontext-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
1953*4882a593Smuzhiyunroom for further improvement.
1954*4882a593Smuzhiyun
1955*4882a593SmuzhiyunThere is no longer any prohibition against holding any of
1956*4882a593Smuzhiyunscheduler's runqueue or priority-inheritance spinlocks across an
1957*4882a593Smuzhiyun``rcu_read_unlock()``, even if interrupts and preemption were enabled
1958*4882a593Smuzhiyunsomewhere within the corresponding RCU read-side critical section.
1959*4882a593SmuzhiyunTherefore, it is now perfectly legal to execute ``rcu_read_lock()``
1960*4882a593Smuzhiyunwith preemption enabled, acquire one of the scheduler locks, and hold
1961*4882a593Smuzhiyunthat lock across the matching ``rcu_read_unlock()``.
1962*4882a593Smuzhiyun
1963*4882a593SmuzhiyunSimilarly, the RCU flavor consolidation has removed the need for negative
1964*4882a593Smuzhiyunnesting.  The fact that interrupt-disabled regions of code act as RCU
1965*4882a593Smuzhiyunread-side critical sections implicitly avoids earlier issues that used
1966*4882a593Smuzhiyunto result in destructive recursion via interrupt handler's use of RCU.
1967*4882a593Smuzhiyun
1968*4882a593SmuzhiyunTracing and RCU
1969*4882a593Smuzhiyun~~~~~~~~~~~~~~~
1970*4882a593Smuzhiyun
1971*4882a593SmuzhiyunIt is possible to use tracing on RCU code, but tracing itself uses RCU.
1972*4882a593SmuzhiyunFor this reason, ``rcu_dereference_raw_check()`` is provided for use
1973*4882a593Smuzhiyunby tracing, which avoids the destructive recursion that could otherwise
1974*4882a593Smuzhiyunensue. This API is also used by virtualization in some architectures,
1975*4882a593Smuzhiyunwhere RCU readers execute in environments in which tracing cannot be
1976*4882a593Smuzhiyunused. The tracing folks both located the requirement and provided the
1977*4882a593Smuzhiyunneeded fix, so this surprise requirement was relatively painless.
1978*4882a593Smuzhiyun
1979*4882a593SmuzhiyunAccesses to User Memory and RCU
1980*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1981*4882a593Smuzhiyun
1982*4882a593SmuzhiyunThe kernel needs to access user-space memory, for example, to access data
1983*4882a593Smuzhiyunreferenced by system-call parameters.  The ``get_user()`` macro does this job.
1984*4882a593Smuzhiyun
1985*4882a593SmuzhiyunHowever, user-space memory might well be paged out, which means that
1986*4882a593Smuzhiyun``get_user()`` might well page-fault and thus block while waiting for the
1987*4882a593Smuzhiyunresulting I/O to complete.  It would be a very bad thing for the compiler to
1988*4882a593Smuzhiyunreorder a ``get_user()`` invocation into an RCU read-side critical section.
1989*4882a593Smuzhiyun
1990*4882a593SmuzhiyunFor example, suppose that the source code looked like this:
1991*4882a593Smuzhiyun
1992*4882a593Smuzhiyun  ::
1993*4882a593Smuzhiyun
1994*4882a593Smuzhiyun       1 rcu_read_lock();
1995*4882a593Smuzhiyun       2 p = rcu_dereference(gp);
1996*4882a593Smuzhiyun       3 v = p->value;
1997*4882a593Smuzhiyun       4 rcu_read_unlock();
1998*4882a593Smuzhiyun       5 get_user(user_v, user_p);
1999*4882a593Smuzhiyun       6 do_something_with(v, user_v);
2000*4882a593Smuzhiyun
2001*4882a593SmuzhiyunThe compiler must not be permitted to transform this source code into
2002*4882a593Smuzhiyunthe following:
2003*4882a593Smuzhiyun
2004*4882a593Smuzhiyun  ::
2005*4882a593Smuzhiyun
2006*4882a593Smuzhiyun       1 rcu_read_lock();
2007*4882a593Smuzhiyun       2 p = rcu_dereference(gp);
2008*4882a593Smuzhiyun       3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!!
2009*4882a593Smuzhiyun       4 v = p->value;
2010*4882a593Smuzhiyun       5 rcu_read_unlock();
2011*4882a593Smuzhiyun       6 do_something_with(v, user_v);
2012*4882a593Smuzhiyun
2013*4882a593SmuzhiyunIf the compiler did make this transformation in a ``CONFIG_PREEMPT=n`` kernel
2014*4882a593Smuzhiyunbuild, and if ``get_user()`` did page fault, the result would be a quiescent
2015*4882a593Smuzhiyunstate in the middle of an RCU read-side critical section.  This misplaced
2016*4882a593Smuzhiyunquiescent state could result in line 4 being a use-after-free access,
2017*4882a593Smuzhiyunwhich could be bad for your kernel's actuarial statistics.  Similar examples
2018*4882a593Smuzhiyuncan be constructed with the call to ``get_user()`` preceding the
2019*4882a593Smuzhiyun``rcu_read_lock()``.
2020*4882a593Smuzhiyun
2021*4882a593SmuzhiyunUnfortunately, ``get_user()`` doesn't have any particular ordering properties,
2022*4882a593Smuzhiyunand in some architectures the underlying ``asm`` isn't even marked
2023*4882a593Smuzhiyun``volatile``.  And even if it was marked ``volatile``, the above access to
2024*4882a593Smuzhiyun``p->value`` is not volatile, so the compiler would not have any reason to keep
2025*4882a593Smuzhiyunthose two accesses in order.
2026*4882a593Smuzhiyun
2027*4882a593SmuzhiyunTherefore, the Linux-kernel definitions of ``rcu_read_lock()`` and
2028*4882a593Smuzhiyun``rcu_read_unlock()`` must act as compiler barriers, at least for outermost
2029*4882a593Smuzhiyuninstances of ``rcu_read_lock()`` and ``rcu_read_unlock()`` within a nested set
2030*4882a593Smuzhiyunof RCU read-side critical sections.
2031*4882a593Smuzhiyun
2032*4882a593SmuzhiyunEnergy Efficiency
2033*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~
2034*4882a593Smuzhiyun
2035*4882a593SmuzhiyunInterrupting idle CPUs is considered socially unacceptable, especially
2036*4882a593Smuzhiyunby people with battery-powered embedded systems. RCU therefore conserves
2037*4882a593Smuzhiyunenergy by detecting which CPUs are idle, including tracking CPUs that
2038*4882a593Smuzhiyunhave been interrupted from idle. This is a large part of the
2039*4882a593Smuzhiyunenergy-efficiency requirement, so I learned of this via an irate phone
2040*4882a593Smuzhiyuncall.
2041*4882a593Smuzhiyun
2042*4882a593SmuzhiyunBecause RCU avoids interrupting idle CPUs, it is illegal to execute an
2043*4882a593SmuzhiyunRCU read-side critical section on an idle CPU. (Kernels built with
2044*4882a593Smuzhiyun``CONFIG_PROVE_RCU=y`` will splat if you try it.) The ``RCU_NONIDLE()``
2045*4882a593Smuzhiyunmacro and ``_rcuidle`` event tracing is provided to work around this
2046*4882a593Smuzhiyunrestriction. In addition, ``rcu_is_watching()`` may be used to test
2047*4882a593Smuzhiyunwhether or not it is currently legal to run RCU read-side critical
2048*4882a593Smuzhiyunsections on this CPU. I learned of the need for diagnostics on the one
2049*4882a593Smuzhiyunhand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code.
2050*4882a593SmuzhiyunSteven Rostedt supplied ``_rcuidle`` event tracing, which is used quite
2051*4882a593Smuzhiyunheavily in the idle loop. However, there are some restrictions on the
2052*4882a593Smuzhiyuncode placed within ``RCU_NONIDLE()``:
2053*4882a593Smuzhiyun
2054*4882a593Smuzhiyun#. Blocking is prohibited. In practice, this is not a serious
2055*4882a593Smuzhiyun   restriction given that idle tasks are prohibited from blocking to
2056*4882a593Smuzhiyun   begin with.
2057*4882a593Smuzhiyun#. Although nesting ``RCU_NONIDLE()`` is permitted, they cannot nest
2058*4882a593Smuzhiyun   indefinitely deeply. However, given that they can be nested on the
2059*4882a593Smuzhiyun   order of a million deep, even on 32-bit systems, this should not be a
2060*4882a593Smuzhiyun   serious restriction. This nesting limit would probably be reached
2061*4882a593Smuzhiyun   long after the compiler OOMed or the stack overflowed.
2062*4882a593Smuzhiyun#. Any code path that enters ``RCU_NONIDLE()`` must sequence out of that
2063*4882a593Smuzhiyun   same ``RCU_NONIDLE()``. For example, the following is grossly
2064*4882a593Smuzhiyun   illegal:
2065*4882a593Smuzhiyun
2066*4882a593Smuzhiyun      ::
2067*4882a593Smuzhiyun
2068*4882a593Smuzhiyun	  1     RCU_NONIDLE({
2069*4882a593Smuzhiyun	  2       do_something();
2070*4882a593Smuzhiyun	  3       goto bad_idea;  /* BUG!!! */
2071*4882a593Smuzhiyun	  4       do_something_else();});
2072*4882a593Smuzhiyun	  5   bad_idea:
2073*4882a593Smuzhiyun
2074*4882a593Smuzhiyun
2075*4882a593Smuzhiyun   It is just as illegal to transfer control into the middle of
2076*4882a593Smuzhiyun   ``RCU_NONIDLE()``'s argument. Yes, in theory, you could transfer in
2077*4882a593Smuzhiyun   as long as you also transferred out, but in practice you could also
2078*4882a593Smuzhiyun   expect to get sharply worded review comments.
2079*4882a593Smuzhiyun
2080*4882a593SmuzhiyunIt is similarly socially unacceptable to interrupt an ``nohz_full`` CPU
2081*4882a593Smuzhiyunrunning in userspace. RCU must therefore track ``nohz_full`` userspace
2082*4882a593Smuzhiyunexecution. RCU must therefore be able to sample state at two points in
2083*4882a593Smuzhiyuntime, and be able to determine whether or not some other CPU spent any
2084*4882a593Smuzhiyuntime idle and/or executing in userspace.
2085*4882a593Smuzhiyun
2086*4882a593SmuzhiyunThese energy-efficiency requirements have proven quite difficult to
2087*4882a593Smuzhiyununderstand and to meet, for example, there have been more than five
2088*4882a593Smuzhiyunclean-sheet rewrites of RCU's energy-efficiency code, the last of which
2089*4882a593Smuzhiyunwas finally able to demonstrate `real energy savings running on real
2090*4882a593Smuzhiyunhardware
2091*4882a593Smuzhiyun[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf>`__.
2092*4882a593SmuzhiyunAs noted earlier, I learned of many of these requirements via angry
2093*4882a593Smuzhiyunphone calls: Flaming me on the Linux-kernel mailing list was apparently
2094*4882a593Smuzhiyunnot sufficient to fully vent their ire at RCU's energy-efficiency bugs!
2095*4882a593Smuzhiyun
2096*4882a593SmuzhiyunScheduling-Clock Interrupts and RCU
2097*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2098*4882a593Smuzhiyun
2099*4882a593SmuzhiyunThe kernel transitions between in-kernel non-idle execution, userspace
2100*4882a593Smuzhiyunexecution, and the idle loop. Depending on kernel configuration, RCU
2101*4882a593Smuzhiyunhandles these states differently:
2102*4882a593Smuzhiyun
2103*4882a593Smuzhiyun+-----------------+------------------+------------------+-----------------+
2104*4882a593Smuzhiyun| ``HZ`` Kconfig  | In-Kernel        | Usermode         | Idle            |
2105*4882a593Smuzhiyun+=================+==================+==================+=================+
2106*4882a593Smuzhiyun| ``HZ_PERIODIC`` | Can rely on      | Can rely on      | Can rely on     |
2107*4882a593Smuzhiyun|                 | scheduling-clock | scheduling-clock | RCU's           |
2108*4882a593Smuzhiyun|                 | interrupt.       | interrupt and    | dyntick-idle    |
2109*4882a593Smuzhiyun|                 |                  | its detection    | detection.      |
2110*4882a593Smuzhiyun|                 |                  | of interrupt     |                 |
2111*4882a593Smuzhiyun|                 |                  | from usermode.   |                 |
2112*4882a593Smuzhiyun+-----------------+------------------+------------------+-----------------+
2113*4882a593Smuzhiyun| ``NO_HZ_IDLE``  | Can rely on      | Can rely on      | Can rely on     |
2114*4882a593Smuzhiyun|                 | scheduling-clock | scheduling-clock | RCU's           |
2115*4882a593Smuzhiyun|                 | interrupt.       | interrupt and    | dyntick-idle    |
2116*4882a593Smuzhiyun|                 |                  | its detection    | detection.      |
2117*4882a593Smuzhiyun|                 |                  | of interrupt     |                 |
2118*4882a593Smuzhiyun|                 |                  | from usermode.   |                 |
2119*4882a593Smuzhiyun+-----------------+------------------+------------------+-----------------+
2120*4882a593Smuzhiyun| ``NO_HZ_FULL``  | Can only         | Can rely on      | Can rely on     |
2121*4882a593Smuzhiyun|                 | sometimes rely   | RCU's            | RCU's           |
2122*4882a593Smuzhiyun|                 | on               | dyntick-idle     | dyntick-idle    |
2123*4882a593Smuzhiyun|                 | scheduling-clock | detection.       | detection.      |
2124*4882a593Smuzhiyun|                 | interrupt. In    |                  |                 |
2125*4882a593Smuzhiyun|                 | other cases, it  |                  |                 |
2126*4882a593Smuzhiyun|                 | is necessary to  |                  |                 |
2127*4882a593Smuzhiyun|                 | bound kernel     |                  |                 |
2128*4882a593Smuzhiyun|                 | execution times  |                  |                 |
2129*4882a593Smuzhiyun|                 | and/or use       |                  |                 |
2130*4882a593Smuzhiyun|                 | IPIs.            |                  |                 |
2131*4882a593Smuzhiyun+-----------------+------------------+------------------+-----------------+
2132*4882a593Smuzhiyun
2133*4882a593Smuzhiyun+-----------------------------------------------------------------------+
2134*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
2135*4882a593Smuzhiyun+-----------------------------------------------------------------------+
2136*4882a593Smuzhiyun| Why can't ``NO_HZ_FULL`` in-kernel execution rely on the              |
2137*4882a593Smuzhiyun| scheduling-clock interrupt, just like ``HZ_PERIODIC`` and             |
2138*4882a593Smuzhiyun| ``NO_HZ_IDLE`` do?                                                    |
2139*4882a593Smuzhiyun+-----------------------------------------------------------------------+
2140*4882a593Smuzhiyun| **Answer**:                                                           |
2141*4882a593Smuzhiyun+-----------------------------------------------------------------------+
2142*4882a593Smuzhiyun| Because, as a performance optimization, ``NO_HZ_FULL`` does not       |
2143*4882a593Smuzhiyun| necessarily re-enable the scheduling-clock interrupt on entry to each |
2144*4882a593Smuzhiyun| and every system call.                                                |
2145*4882a593Smuzhiyun+-----------------------------------------------------------------------+
2146*4882a593Smuzhiyun
2147*4882a593SmuzhiyunHowever, RCU must be reliably informed as to whether any given CPU is
2148*4882a593Smuzhiyuncurrently in the idle loop, and, for ``NO_HZ_FULL``, also whether that
2149*4882a593SmuzhiyunCPU is executing in usermode, as discussed
2150*4882a593Smuzhiyun`earlier <#Energy%20Efficiency>`__. It also requires that the
2151*4882a593Smuzhiyunscheduling-clock interrupt be enabled when RCU needs it to be:
2152*4882a593Smuzhiyun
2153*4882a593Smuzhiyun#. If a CPU is either idle or executing in usermode, and RCU believes it
2154*4882a593Smuzhiyun   is non-idle, the scheduling-clock tick had better be running.
2155*4882a593Smuzhiyun   Otherwise, you will get RCU CPU stall warnings. Or at best, very long
2156*4882a593Smuzhiyun   (11-second) grace periods, with a pointless IPI waking the CPU from
2157*4882a593Smuzhiyun   time to time.
2158*4882a593Smuzhiyun#. If a CPU is in a portion of the kernel that executes RCU read-side
2159*4882a593Smuzhiyun   critical sections, and RCU believes this CPU to be idle, you will get
2160*4882a593Smuzhiyun   random memory corruption. **DON'T DO THIS!!!**
2161*4882a593Smuzhiyun   This is one reason to test with lockdep, which will complain about
2162*4882a593Smuzhiyun   this sort of thing.
2163*4882a593Smuzhiyun#. If a CPU is in a portion of the kernel that is absolutely positively
2164*4882a593Smuzhiyun   no-joking guaranteed to never execute any RCU read-side critical
2165*4882a593Smuzhiyun   sections, and RCU believes this CPU to be idle, no problem. This
2166*4882a593Smuzhiyun   sort of thing is used by some architectures for light-weight
2167*4882a593Smuzhiyun   exception handlers, which can then avoid the overhead of
2168*4882a593Smuzhiyun   ``rcu_irq_enter()`` and ``rcu_irq_exit()`` at exception entry and
2169*4882a593Smuzhiyun   exit, respectively. Some go further and avoid the entireties of
2170*4882a593Smuzhiyun   ``irq_enter()`` and ``irq_exit()``.
2171*4882a593Smuzhiyun   Just make very sure you are running some of your tests with
2172*4882a593Smuzhiyun   ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in
2173*4882a593Smuzhiyun   fact joking about not doing RCU read-side critical sections.
2174*4882a593Smuzhiyun#. If a CPU is executing in the kernel with the scheduling-clock
2175*4882a593Smuzhiyun   interrupt disabled and RCU believes this CPU to be non-idle, and if
2176*4882a593Smuzhiyun   the CPU goes idle (from an RCU perspective) every few jiffies, no
2177*4882a593Smuzhiyun   problem. It is usually OK for there to be the occasional gap between
2178*4882a593Smuzhiyun   idle periods of up to a second or so.
2179*4882a593Smuzhiyun   If the gap grows too long, you get RCU CPU stall warnings.
2180*4882a593Smuzhiyun#. If a CPU is either idle or executing in usermode, and RCU believes it
2181*4882a593Smuzhiyun   to be idle, of course no problem.
2182*4882a593Smuzhiyun#. If a CPU is executing in the kernel, the kernel code path is passing
2183*4882a593Smuzhiyun   through quiescent states at a reasonable frequency (preferably about
2184*4882a593Smuzhiyun   once per few jiffies, but the occasional excursion to a second or so
2185*4882a593Smuzhiyun   is usually OK) and the scheduling-clock interrupt is enabled, of
2186*4882a593Smuzhiyun   course no problem.
2187*4882a593Smuzhiyun   If the gap between a successive pair of quiescent states grows too
2188*4882a593Smuzhiyun   long, you get RCU CPU stall warnings.
2189*4882a593Smuzhiyun
2190*4882a593Smuzhiyun+-----------------------------------------------------------------------+
2191*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
2192*4882a593Smuzhiyun+-----------------------------------------------------------------------+
2193*4882a593Smuzhiyun| But what if my driver has a hardware interrupt handler that can run   |
2194*4882a593Smuzhiyun| for many seconds? I cannot invoke ``schedule()`` from an hardware     |
2195*4882a593Smuzhiyun| interrupt handler, after all!                                         |
2196*4882a593Smuzhiyun+-----------------------------------------------------------------------+
2197*4882a593Smuzhiyun| **Answer**:                                                           |
2198*4882a593Smuzhiyun+-----------------------------------------------------------------------+
2199*4882a593Smuzhiyun| One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so    |
2200*4882a593Smuzhiyun| often. But given that long-running interrupt handlers can cause other |
2201*4882a593Smuzhiyun| problems, not least for response time, shouldn't you work to keep     |
2202*4882a593Smuzhiyun| your interrupt handler's runtime within reasonable bounds?            |
2203*4882a593Smuzhiyun+-----------------------------------------------------------------------+
2204*4882a593Smuzhiyun
2205*4882a593SmuzhiyunBut as long as RCU is properly informed of kernel state transitions
2206*4882a593Smuzhiyunbetween in-kernel execution, usermode execution, and idle, and as long
2207*4882a593Smuzhiyunas the scheduling-clock interrupt is enabled when RCU needs it to be,
2208*4882a593Smuzhiyunyou can rest assured that the bugs you encounter will be in some other
2209*4882a593Smuzhiyunpart of RCU or some other part of the kernel!
2210*4882a593Smuzhiyun
2211*4882a593SmuzhiyunMemory Efficiency
2212*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~
2213*4882a593Smuzhiyun
2214*4882a593SmuzhiyunAlthough small-memory non-realtime systems can simply use Tiny RCU, code
2215*4882a593Smuzhiyunsize is only one aspect of memory efficiency. Another aspect is the size
2216*4882a593Smuzhiyunof the ``rcu_head`` structure used by ``call_rcu()`` and
2217*4882a593Smuzhiyun``kfree_rcu()``. Although this structure contains nothing more than a
2218*4882a593Smuzhiyunpair of pointers, it does appear in many RCU-protected data structures,
2219*4882a593Smuzhiyunincluding some that are size critical. The ``page`` structure is a case
2220*4882a593Smuzhiyunin point, as evidenced by the many occurrences of the ``union`` keyword
2221*4882a593Smuzhiyunwithin that structure.
2222*4882a593Smuzhiyun
2223*4882a593SmuzhiyunThis need for memory efficiency is one reason that RCU uses hand-crafted
2224*4882a593Smuzhiyunsingly linked lists to track the ``rcu_head`` structures that are
2225*4882a593Smuzhiyunwaiting for a grace period to elapse. It is also the reason why
2226*4882a593Smuzhiyun``rcu_head`` structures do not contain debug information, such as fields
2227*4882a593Smuzhiyuntracking the file and line of the ``call_rcu()`` or ``kfree_rcu()`` that
2228*4882a593Smuzhiyunposted them. Although this information might appear in debug-only kernel
2229*4882a593Smuzhiyunbuilds at some point, in the meantime, the ``->func`` field will often
2230*4882a593Smuzhiyunprovide the needed debug information.
2231*4882a593Smuzhiyun
2232*4882a593SmuzhiyunHowever, in some cases, the need for memory efficiency leads to even
2233*4882a593Smuzhiyunmore extreme measures. Returning to the ``page`` structure, the
2234*4882a593Smuzhiyun``rcu_head`` field shares storage with a great many other structures
2235*4882a593Smuzhiyunthat are used at various points in the corresponding page's lifetime. In
2236*4882a593Smuzhiyunorder to correctly resolve certain `race
2237*4882a593Smuzhiyunconditions <https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com>`__,
2238*4882a593Smuzhiyunthe Linux kernel's memory-management subsystem needs a particular bit to
2239*4882a593Smuzhiyunremain zero during all phases of grace-period processing, and that bit
2240*4882a593Smuzhiyunhappens to map to the bottom bit of the ``rcu_head`` structure's
2241*4882a593Smuzhiyun``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is
2242*4882a593Smuzhiyunused to post the callback, as opposed to ``kfree_rcu()`` or some future
2243*4882a593Smuzhiyun“lazy” variant of ``call_rcu()`` that might one day be created for
2244*4882a593Smuzhiyunenergy-efficiency purposes.
2245*4882a593Smuzhiyun
2246*4882a593SmuzhiyunThat said, there are limits. RCU requires that the ``rcu_head``
2247*4882a593Smuzhiyunstructure be aligned to a two-byte boundary, and passing a misaligned
2248*4882a593Smuzhiyun``rcu_head`` structure to one of the ``call_rcu()`` family of functions
2249*4882a593Smuzhiyunwill result in a splat. It is therefore necessary to exercise caution
2250*4882a593Smuzhiyunwhen packing structures containing fields of type ``rcu_head``. Why not
2251*4882a593Smuzhiyuna four-byte or even eight-byte alignment requirement? Because the m68k
2252*4882a593Smuzhiyunarchitecture provides only two-byte alignment, and thus acts as
2253*4882a593Smuzhiyunalignment's least common denominator.
2254*4882a593Smuzhiyun
2255*4882a593SmuzhiyunThe reason for reserving the bottom bit of pointers to ``rcu_head``
2256*4882a593Smuzhiyunstructures is to leave the door open to “lazy” callbacks whose
2257*4882a593Smuzhiyuninvocations can safely be deferred. Deferring invocation could
2258*4882a593Smuzhiyunpotentially have energy-efficiency benefits, but only if the rate of
2259*4882a593Smuzhiyunnon-lazy callbacks decreases significantly for some important workload.
2260*4882a593SmuzhiyunIn the meantime, reserving the bottom bit keeps this option open in case
2261*4882a593Smuzhiyunit one day becomes useful.
2262*4882a593Smuzhiyun
2263*4882a593SmuzhiyunPerformance, Scalability, Response Time, and Reliability
2264*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2265*4882a593Smuzhiyun
2266*4882a593SmuzhiyunExpanding on the `earlier
2267*4882a593Smuzhiyundiscussion <#Performance%20and%20Scalability>`__, RCU is used heavily by
2268*4882a593Smuzhiyunhot code paths in performance-critical portions of the Linux kernel's
2269*4882a593Smuzhiyunnetworking, security, virtualization, and scheduling code paths. RCU
2270*4882a593Smuzhiyunmust therefore use efficient implementations, especially in its
2271*4882a593Smuzhiyunread-side primitives. To that end, it would be good if preemptible RCU's
2272*4882a593Smuzhiyunimplementation of ``rcu_read_lock()`` could be inlined, however, doing
2273*4882a593Smuzhiyunthis requires resolving ``#include`` issues with the ``task_struct``
2274*4882a593Smuzhiyunstructure.
2275*4882a593Smuzhiyun
2276*4882a593SmuzhiyunThe Linux kernel supports hardware configurations with up to 4096 CPUs,
2277*4882a593Smuzhiyunwhich means that RCU must be extremely scalable. Algorithms that involve
2278*4882a593Smuzhiyunfrequent acquisitions of global locks or frequent atomic operations on
2279*4882a593Smuzhiyunglobal variables simply cannot be tolerated within the RCU
2280*4882a593Smuzhiyunimplementation. RCU therefore makes heavy use of a combining tree based
2281*4882a593Smuzhiyunon the ``rcu_node`` structure. RCU is required to tolerate all CPUs
2282*4882a593Smuzhiyuncontinuously invoking any combination of RCU's runtime primitives with
2283*4882a593Smuzhiyunminimal per-operation overhead. In fact, in many cases, increasing load
2284*4882a593Smuzhiyunmust *decrease* the per-operation overhead, witness the batching
2285*4882a593Smuzhiyunoptimizations for ``synchronize_rcu()``, ``call_rcu()``,
2286*4882a593Smuzhiyun``synchronize_rcu_expedited()``, and ``rcu_barrier()``. As a general
2287*4882a593Smuzhiyunrule, RCU must cheerfully accept whatever the rest of the Linux kernel
2288*4882a593Smuzhiyundecides to throw at it.
2289*4882a593Smuzhiyun
2290*4882a593SmuzhiyunThe Linux kernel is used for real-time workloads, especially in
2291*4882a593Smuzhiyunconjunction with the `-rt
2292*4882a593Smuzhiyunpatchset <https://rt.wiki.kernel.org/index.php/Main_Page>`__. The
2293*4882a593Smuzhiyunreal-time-latency response requirements are such that the traditional
2294*4882a593Smuzhiyunapproach of disabling preemption across RCU read-side critical sections
2295*4882a593Smuzhiyunis inappropriate. Kernels built with ``CONFIG_PREEMPT=y`` therefore use
2296*4882a593Smuzhiyunan RCU implementation that allows RCU read-side critical sections to be
2297*4882a593Smuzhiyunpreempted. This requirement made its presence known after users made it
2298*4882a593Smuzhiyunclear that an earlier `real-time
2299*4882a593Smuzhiyunpatch <https://lwn.net/Articles/107930/>`__ did not meet their needs, in
2300*4882a593Smuzhiyunconjunction with some `RCU
2301*4882a593Smuzhiyunissues <https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com>`__
2302*4882a593Smuzhiyunencountered by a very early version of the -rt patchset.
2303*4882a593Smuzhiyun
2304*4882a593SmuzhiyunIn addition, RCU must make do with a sub-100-microsecond real-time
2305*4882a593Smuzhiyunlatency budget. In fact, on smaller systems with the -rt patchset, the
2306*4882a593SmuzhiyunLinux kernel provides sub-20-microsecond real-time latencies for the
2307*4882a593Smuzhiyunwhole kernel, including RCU. RCU's scalability and latency must
2308*4882a593Smuzhiyuntherefore be sufficient for these sorts of configurations. To my
2309*4882a593Smuzhiyunsurprise, the sub-100-microsecond real-time latency budget `applies to
2310*4882a593Smuzhiyuneven the largest systems
2311*4882a593Smuzhiyun[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf>`__,
2312*4882a593Smuzhiyunup to and including systems with 4096 CPUs. This real-time requirement
2313*4882a593Smuzhiyunmotivated the grace-period kthread, which also simplified handling of a
2314*4882a593Smuzhiyunnumber of race conditions.
2315*4882a593Smuzhiyun
2316*4882a593SmuzhiyunRCU must avoid degrading real-time response for CPU-bound threads,
2317*4882a593Smuzhiyunwhether executing in usermode (which is one use case for
2318*4882a593Smuzhiyun``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in
2319*4882a593Smuzhiyunthe kernel must execute ``cond_resched()`` at least once per few tens of
2320*4882a593Smuzhiyunmilliseconds in order to avoid receiving an IPI from RCU.
2321*4882a593Smuzhiyun
2322*4882a593SmuzhiyunFinally, RCU's status as a synchronization primitive means that any RCU
2323*4882a593Smuzhiyunfailure can result in arbitrary memory corruption that can be extremely
2324*4882a593Smuzhiyundifficult to debug. This means that RCU must be extremely reliable,
2325*4882a593Smuzhiyunwhich in practice also means that RCU must have an aggressive
2326*4882a593Smuzhiyunstress-test suite. This stress-test suite is called ``rcutorture``.
2327*4882a593Smuzhiyun
2328*4882a593SmuzhiyunAlthough the need for ``rcutorture`` was no surprise, the current
2329*4882a593Smuzhiyunimmense popularity of the Linux kernel is posing interesting—and perhaps
2330*4882a593Smuzhiyununprecedented—validation challenges. To see this, keep in mind that
2331*4882a593Smuzhiyunthere are well over one billion instances of the Linux kernel running
2332*4882a593Smuzhiyuntoday, given Android smartphones, Linux-powered televisions, and
2333*4882a593Smuzhiyunservers. This number can be expected to increase sharply with the advent
2334*4882a593Smuzhiyunof the celebrated Internet of Things.
2335*4882a593Smuzhiyun
2336*4882a593SmuzhiyunSuppose that RCU contains a race condition that manifests on average
2337*4882a593Smuzhiyunonce per million years of runtime. This bug will be occurring about
2338*4882a593Smuzhiyunthree times per *day* across the installed base. RCU could simply hide
2339*4882a593Smuzhiyunbehind hardware error rates, given that no one should really expect
2340*4882a593Smuzhiyuntheir smartphone to last for a million years. However, anyone taking too
2341*4882a593Smuzhiyunmuch comfort from this thought should consider the fact that in most
2342*4882a593Smuzhiyunjurisdictions, a successful multi-year test of a given mechanism, which
2343*4882a593Smuzhiyunmight include a Linux kernel, suffices for a number of types of
2344*4882a593Smuzhiyunsafety-critical certifications. In fact, rumor has it that the Linux
2345*4882a593Smuzhiyunkernel is already being used in production for safety-critical
2346*4882a593Smuzhiyunapplications. I don't know about you, but I would feel quite bad if a
2347*4882a593Smuzhiyunbug in RCU killed someone. Which might explain my recent focus on
2348*4882a593Smuzhiyunvalidation and verification.
2349*4882a593Smuzhiyun
2350*4882a593SmuzhiyunOther RCU Flavors
2351*4882a593Smuzhiyun-----------------
2352*4882a593Smuzhiyun
2353*4882a593SmuzhiyunOne of the more surprising things about RCU is that there are now no
2354*4882a593Smuzhiyunfewer than five *flavors*, or API families. In addition, the primary
2355*4882a593Smuzhiyunflavor that has been the sole focus up to this point has two different
2356*4882a593Smuzhiyunimplementations, non-preemptible and preemptible. The other four flavors
2357*4882a593Smuzhiyunare listed below, with requirements for each described in a separate
2358*4882a593Smuzhiyunsection.
2359*4882a593Smuzhiyun
2360*4882a593Smuzhiyun#. `Bottom-Half Flavor (Historical)`_
2361*4882a593Smuzhiyun#. `Sched Flavor (Historical)`_
2362*4882a593Smuzhiyun#. `Sleepable RCU`_
2363*4882a593Smuzhiyun#. `Tasks RCU`_
2364*4882a593Smuzhiyun
2365*4882a593SmuzhiyunBottom-Half Flavor (Historical)
2366*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2367*4882a593Smuzhiyun
2368*4882a593SmuzhiyunThe RCU-bh flavor of RCU has since been expressed in terms of the other
2369*4882a593SmuzhiyunRCU flavors as part of a consolidation of the three flavors into a
2370*4882a593Smuzhiyunsingle flavor. The read-side API remains, and continues to disable
2371*4882a593Smuzhiyunsoftirq and to be accounted for by lockdep. Much of the material in this
2372*4882a593Smuzhiyunsection is therefore strictly historical in nature.
2373*4882a593Smuzhiyun
2374*4882a593SmuzhiyunThe softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations)
2375*4882a593Smuzhiyunflavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a
2376*4882a593Smuzhiyunflavor of RCU that could withstand the network-based denial-of-service
2377*4882a593Smuzhiyunattacks researched by Robert Olsson. These attacks placed so much
2378*4882a593Smuzhiyunnetworking load on the system that some of the CPUs never exited softirq
2379*4882a593Smuzhiyunexecution, which in turn prevented those CPUs from ever executing a
2380*4882a593Smuzhiyuncontext switch, which, in the RCU implementation of that time, prevented
2381*4882a593Smuzhiyungrace periods from ever ending. The result was an out-of-memory
2382*4882a593Smuzhiyuncondition and a system hang.
2383*4882a593Smuzhiyun
2384*4882a593SmuzhiyunThe solution was the creation of RCU-bh, which does
2385*4882a593Smuzhiyun``local_bh_disable()`` across its read-side critical sections, and which
2386*4882a593Smuzhiyunuses the transition from one type of softirq processing to another as a
2387*4882a593Smuzhiyunquiescent state in addition to context switch, idle, user mode, and
2388*4882a593Smuzhiyunoffline. This means that RCU-bh grace periods can complete even when
2389*4882a593Smuzhiyunsome of the CPUs execute in softirq indefinitely, thus allowing
2390*4882a593Smuzhiyunalgorithms based on RCU-bh to withstand network-based denial-of-service
2391*4882a593Smuzhiyunattacks.
2392*4882a593Smuzhiyun
2393*4882a593SmuzhiyunBecause ``rcu_read_lock_bh()`` and ``rcu_read_unlock_bh()`` disable and
2394*4882a593Smuzhiyunre-enable softirq handlers, any attempt to start a softirq handlers
2395*4882a593Smuzhiyunduring the RCU-bh read-side critical section will be deferred. In this
2396*4882a593Smuzhiyuncase, ``rcu_read_unlock_bh()`` will invoke softirq processing, which can
2397*4882a593Smuzhiyuntake considerable time. One can of course argue that this softirq
2398*4882a593Smuzhiyunoverhead should be associated with the code following the RCU-bh
2399*4882a593Smuzhiyunread-side critical section rather than ``rcu_read_unlock_bh()``, but the
2400*4882a593Smuzhiyunfact is that most profiling tools cannot be expected to make this sort
2401*4882a593Smuzhiyunof fine distinction. For example, suppose that a three-millisecond-long
2402*4882a593SmuzhiyunRCU-bh read-side critical section executes during a time of heavy
2403*4882a593Smuzhiyunnetworking load. There will very likely be an attempt to invoke at least
2404*4882a593Smuzhiyunone softirq handler during that three milliseconds, but any such
2405*4882a593Smuzhiyuninvocation will be delayed until the time of the
2406*4882a593Smuzhiyun``rcu_read_unlock_bh()``. This can of course make it appear at first
2407*4882a593Smuzhiyunglance as if ``rcu_read_unlock_bh()`` was executing very slowly.
2408*4882a593Smuzhiyun
2409*4882a593SmuzhiyunThe `RCU-bh
2410*4882a593SmuzhiyunAPI <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2411*4882a593Smuzhiyunincludes ``rcu_read_lock_bh()``, ``rcu_read_unlock_bh()``,
2412*4882a593Smuzhiyun``rcu_dereference_bh()``, ``rcu_dereference_bh_check()``,
2413*4882a593Smuzhiyun``synchronize_rcu_bh()``, ``synchronize_rcu_bh_expedited()``,
2414*4882a593Smuzhiyun``call_rcu_bh()``, ``rcu_barrier_bh()``, and
2415*4882a593Smuzhiyun``rcu_read_lock_bh_held()``. However, the update-side APIs are now
2416*4882a593Smuzhiyunsimple wrappers for other RCU flavors, namely RCU-sched in
2417*4882a593SmuzhiyunCONFIG_PREEMPT=n kernels and RCU-preempt otherwise.
2418*4882a593Smuzhiyun
2419*4882a593SmuzhiyunSched Flavor (Historical)
2420*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~
2421*4882a593Smuzhiyun
2422*4882a593SmuzhiyunThe RCU-sched flavor of RCU has since been expressed in terms of the
2423*4882a593Smuzhiyunother RCU flavors as part of a consolidation of the three flavors into a
2424*4882a593Smuzhiyunsingle flavor. The read-side API remains, and continues to disable
2425*4882a593Smuzhiyunpreemption and to be accounted for by lockdep. Much of the material in
2426*4882a593Smuzhiyunthis section is therefore strictly historical in nature.
2427*4882a593Smuzhiyun
2428*4882a593SmuzhiyunBefore preemptible RCU, waiting for an RCU grace period had the side
2429*4882a593Smuzhiyuneffect of also waiting for all pre-existing interrupt and NMI handlers.
2430*4882a593SmuzhiyunHowever, there are legitimate preemptible-RCU implementations that do
2431*4882a593Smuzhiyunnot have this property, given that any point in the code outside of an
2432*4882a593SmuzhiyunRCU read-side critical section can be a quiescent state. Therefore,
2433*4882a593Smuzhiyun*RCU-sched* was created, which follows “classic” RCU in that an
2434*4882a593SmuzhiyunRCU-sched grace period waits for pre-existing interrupt and NMI
2435*4882a593Smuzhiyunhandlers. In kernels built with ``CONFIG_PREEMPT=n``, the RCU and
2436*4882a593SmuzhiyunRCU-sched APIs have identical implementations, while kernels built with
2437*4882a593Smuzhiyun``CONFIG_PREEMPT=y`` provide a separate implementation for each.
2438*4882a593Smuzhiyun
2439*4882a593SmuzhiyunNote well that in ``CONFIG_PREEMPT=y`` kernels,
2440*4882a593Smuzhiyun``rcu_read_lock_sched()`` and ``rcu_read_unlock_sched()`` disable and
2441*4882a593Smuzhiyunre-enable preemption, respectively. This means that if there was a
2442*4882a593Smuzhiyunpreemption attempt during the RCU-sched read-side critical section,
2443*4882a593Smuzhiyun``rcu_read_unlock_sched()`` will enter the scheduler, with all the
2444*4882a593Smuzhiyunlatency and overhead entailed. Just as with ``rcu_read_unlock_bh()``,
2445*4882a593Smuzhiyunthis can make it look as if ``rcu_read_unlock_sched()`` was executing
2446*4882a593Smuzhiyunvery slowly. However, the highest-priority task won't be preempted, so
2447*4882a593Smuzhiyunthat task will enjoy low-overhead ``rcu_read_unlock_sched()``
2448*4882a593Smuzhiyuninvocations.
2449*4882a593Smuzhiyun
2450*4882a593SmuzhiyunThe `RCU-sched
2451*4882a593SmuzhiyunAPI <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2452*4882a593Smuzhiyunincludes ``rcu_read_lock_sched()``, ``rcu_read_unlock_sched()``,
2453*4882a593Smuzhiyun``rcu_read_lock_sched_notrace()``, ``rcu_read_unlock_sched_notrace()``,
2454*4882a593Smuzhiyun``rcu_dereference_sched()``, ``rcu_dereference_sched_check()``,
2455*4882a593Smuzhiyun``synchronize_sched()``, ``synchronize_rcu_sched_expedited()``,
2456*4882a593Smuzhiyun``call_rcu_sched()``, ``rcu_barrier_sched()``, and
2457*4882a593Smuzhiyun``rcu_read_lock_sched_held()``. However, anything that disables
2458*4882a593Smuzhiyunpreemption also marks an RCU-sched read-side critical section, including
2459*4882a593Smuzhiyun``preempt_disable()`` and ``preempt_enable()``, ``local_irq_save()`` and
2460*4882a593Smuzhiyun``local_irq_restore()``, and so on.
2461*4882a593Smuzhiyun
2462*4882a593SmuzhiyunSleepable RCU
2463*4882a593Smuzhiyun~~~~~~~~~~~~~
2464*4882a593Smuzhiyun
2465*4882a593SmuzhiyunFor well over a decade, someone saying “I need to block within an RCU
2466*4882a593Smuzhiyunread-side critical section” was a reliable indication that this someone
2467*4882a593Smuzhiyundid not understand RCU. After all, if you are always blocking in an RCU
2468*4882a593Smuzhiyunread-side critical section, you can probably afford to use a
2469*4882a593Smuzhiyunhigher-overhead synchronization mechanism. However, that changed with
2470*4882a593Smuzhiyunthe advent of the Linux kernel's notifiers, whose RCU read-side critical
2471*4882a593Smuzhiyunsections almost never sleep, but sometimes need to. This resulted in the
2472*4882a593Smuzhiyunintroduction of `sleepable RCU <https://lwn.net/Articles/202847/>`__, or
2473*4882a593Smuzhiyun*SRCU*.
2474*4882a593Smuzhiyun
2475*4882a593SmuzhiyunSRCU allows different domains to be defined, with each such domain
2476*4882a593Smuzhiyundefined by an instance of an ``srcu_struct`` structure. A pointer to
2477*4882a593Smuzhiyunthis structure must be passed in to each SRCU function, for example,
2478*4882a593Smuzhiyun``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct``
2479*4882a593Smuzhiyunstructure. The key benefit of these domains is that a slow SRCU reader
2480*4882a593Smuzhiyunin one domain does not delay an SRCU grace period in some other domain.
2481*4882a593SmuzhiyunThat said, one consequence of these domains is that read-side code must
2482*4882a593Smuzhiyunpass a “cookie” from ``srcu_read_lock()`` to ``srcu_read_unlock()``, for
2483*4882a593Smuzhiyunexample, as follows:
2484*4882a593Smuzhiyun
2485*4882a593Smuzhiyun   ::
2486*4882a593Smuzhiyun
2487*4882a593Smuzhiyun       1 int idx;
2488*4882a593Smuzhiyun       2
2489*4882a593Smuzhiyun       3 idx = srcu_read_lock(&ss);
2490*4882a593Smuzhiyun       4 do_something();
2491*4882a593Smuzhiyun       5 srcu_read_unlock(&ss, idx);
2492*4882a593Smuzhiyun
2493*4882a593SmuzhiyunAs noted above, it is legal to block within SRCU read-side critical
2494*4882a593Smuzhiyunsections, however, with great power comes great responsibility. If you
2495*4882a593Smuzhiyunblock forever in one of a given domain's SRCU read-side critical
2496*4882a593Smuzhiyunsections, then that domain's grace periods will also be blocked forever.
2497*4882a593SmuzhiyunOf course, one good way to block forever is to deadlock, which can
2498*4882a593Smuzhiyunhappen if any operation in a given domain's SRCU read-side critical
2499*4882a593Smuzhiyunsection can wait, either directly or indirectly, for that domain's grace
2500*4882a593Smuzhiyunperiod to elapse. For example, this results in a self-deadlock:
2501*4882a593Smuzhiyun
2502*4882a593Smuzhiyun   ::
2503*4882a593Smuzhiyun
2504*4882a593Smuzhiyun       1 int idx;
2505*4882a593Smuzhiyun       2
2506*4882a593Smuzhiyun       3 idx = srcu_read_lock(&ss);
2507*4882a593Smuzhiyun       4 do_something();
2508*4882a593Smuzhiyun       5 synchronize_srcu(&ss);
2509*4882a593Smuzhiyun       6 srcu_read_unlock(&ss, idx);
2510*4882a593Smuzhiyun
2511*4882a593SmuzhiyunHowever, if line 5 acquired a mutex that was held across a
2512*4882a593Smuzhiyun``synchronize_srcu()`` for domain ``ss``, deadlock would still be
2513*4882a593Smuzhiyunpossible. Furthermore, if line 5 acquired a mutex that was held across a
2514*4882a593Smuzhiyun``synchronize_srcu()`` for some other domain ``ss1``, and if an
2515*4882a593Smuzhiyun``ss1``-domain SRCU read-side critical section acquired another mutex
2516*4882a593Smuzhiyunthat was held across as ``ss``-domain ``synchronize_srcu()``, deadlock
2517*4882a593Smuzhiyunwould again be possible. Such a deadlock cycle could extend across an
2518*4882a593Smuzhiyunarbitrarily large number of different SRCU domains. Again, with great
2519*4882a593Smuzhiyunpower comes great responsibility.
2520*4882a593Smuzhiyun
2521*4882a593SmuzhiyunUnlike the other RCU flavors, SRCU read-side critical sections can run
2522*4882a593Smuzhiyunon idle and even offline CPUs. This ability requires that
2523*4882a593Smuzhiyun``srcu_read_lock()`` and ``srcu_read_unlock()`` contain memory barriers,
2524*4882a593Smuzhiyunwhich means that SRCU readers will run a bit slower than would RCU
2525*4882a593Smuzhiyunreaders. It also motivates the ``smp_mb__after_srcu_read_unlock()`` API,
2526*4882a593Smuzhiyunwhich, in combination with ``srcu_read_unlock()``, guarantees a full
2527*4882a593Smuzhiyunmemory barrier.
2528*4882a593Smuzhiyun
2529*4882a593SmuzhiyunAlso unlike other RCU flavors, ``synchronize_srcu()`` may **not** be
2530*4882a593Smuzhiyuninvoked from CPU-hotplug notifiers, due to the fact that SRCU grace
2531*4882a593Smuzhiyunperiods make use of timers and the possibility of timers being
2532*4882a593Smuzhiyuntemporarily “stranded” on the outgoing CPU. This stranding of timers
2533*4882a593Smuzhiyunmeans that timers posted to the outgoing CPU will not fire until late in
2534*4882a593Smuzhiyunthe CPU-hotplug process. The problem is that if a notifier is waiting on
2535*4882a593Smuzhiyunan SRCU grace period, that grace period is waiting on a timer, and that
2536*4882a593Smuzhiyuntimer is stranded on the outgoing CPU, then the notifier will never be
2537*4882a593Smuzhiyunawakened, in other words, deadlock has occurred. This same situation of
2538*4882a593Smuzhiyuncourse also prohibits ``srcu_barrier()`` from being invoked from
2539*4882a593SmuzhiyunCPU-hotplug notifiers.
2540*4882a593Smuzhiyun
2541*4882a593SmuzhiyunSRCU also differs from other RCU flavors in that SRCU's expedited and
2542*4882a593Smuzhiyunnon-expedited grace periods are implemented by the same mechanism. This
2543*4882a593Smuzhiyunmeans that in the current SRCU implementation, expediting a future grace
2544*4882a593Smuzhiyunperiod has the side effect of expediting all prior grace periods that
2545*4882a593Smuzhiyunhave not yet completed. (But please note that this is a property of the
2546*4882a593Smuzhiyuncurrent implementation, not necessarily of future implementations.) In
2547*4882a593Smuzhiyunaddition, if SRCU has been idle for longer than the interval specified
2548*4882a593Smuzhiyunby the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds
2549*4882a593Smuzhiyunby default), and if a ``synchronize_srcu()`` invocation ends this idle
2550*4882a593Smuzhiyunperiod, that invocation will be automatically expedited.
2551*4882a593Smuzhiyun
2552*4882a593SmuzhiyunAs of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a
2553*4882a593Smuzhiyunlocking bottleneck present in prior kernel versions. Although this will
2554*4882a593Smuzhiyunallow users to put much heavier stress on ``call_srcu()``, it is
2555*4882a593Smuzhiyunimportant to note that SRCU does not yet take any special steps to deal
2556*4882a593Smuzhiyunwith callback flooding. So if you are posting (say) 10,000 SRCU
2557*4882a593Smuzhiyuncallbacks per second per CPU, you are probably totally OK, but if you
2558*4882a593Smuzhiyunintend to post (say) 1,000,000 SRCU callbacks per second per CPU, please
2559*4882a593Smuzhiyunrun some tests first. SRCU just might need a few adjustment to deal with
2560*4882a593Smuzhiyunthat sort of load. Of course, your mileage may vary based on the speed
2561*4882a593Smuzhiyunof your CPUs and the size of your memory.
2562*4882a593Smuzhiyun
2563*4882a593SmuzhiyunThe `SRCU
2564*4882a593SmuzhiyunAPI <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2565*4882a593Smuzhiyunincludes ``srcu_read_lock()``, ``srcu_read_unlock()``,
2566*4882a593Smuzhiyun``srcu_dereference()``, ``srcu_dereference_check()``,
2567*4882a593Smuzhiyun``synchronize_srcu()``, ``synchronize_srcu_expedited()``,
2568*4882a593Smuzhiyun``call_srcu()``, ``srcu_barrier()``, and ``srcu_read_lock_held()``. It
2569*4882a593Smuzhiyunalso includes ``DEFINE_SRCU()``, ``DEFINE_STATIC_SRCU()``, and
2570*4882a593Smuzhiyun``init_srcu_struct()`` APIs for defining and initializing
2571*4882a593Smuzhiyun``srcu_struct`` structures.
2572*4882a593Smuzhiyun
2573*4882a593SmuzhiyunTasks RCU
2574*4882a593Smuzhiyun~~~~~~~~~
2575*4882a593Smuzhiyun
2576*4882a593SmuzhiyunSome forms of tracing use “trampolines” to handle the binary rewriting
2577*4882a593Smuzhiyunrequired to install different types of probes. It would be good to be
2578*4882a593Smuzhiyunable to free old trampolines, which sounds like a job for some form of
2579*4882a593SmuzhiyunRCU. However, because it is necessary to be able to install a trace
2580*4882a593Smuzhiyunanywhere in the code, it is not possible to use read-side markers such
2581*4882a593Smuzhiyunas ``rcu_read_lock()`` and ``rcu_read_unlock()``. In addition, it does
2582*4882a593Smuzhiyunnot work to have these markers in the trampoline itself, because there
2583*4882a593Smuzhiyunwould need to be instructions following ``rcu_read_unlock()``. Although
2584*4882a593Smuzhiyun``synchronize_rcu()`` would guarantee that execution reached the
2585*4882a593Smuzhiyun``rcu_read_unlock()``, it would not be able to guarantee that execution
2586*4882a593Smuzhiyunhad completely left the trampoline. Worse yet, in some situations
2587*4882a593Smuzhiyunthe trampoline's protection must extend a few instructions *prior* to
2588*4882a593Smuzhiyunexecution reaching the trampoline.  For example, these few instructions
2589*4882a593Smuzhiyunmight calculate the address of the trampoline, so that entering the
2590*4882a593Smuzhiyuntrampoline would be pre-ordained a surprisingly long time before execution
2591*4882a593Smuzhiyunactually reached the trampoline itself.
2592*4882a593Smuzhiyun
2593*4882a593SmuzhiyunThe solution, in the form of `Tasks
2594*4882a593SmuzhiyunRCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
2595*4882a593Smuzhiyuncritical sections that are delimited by voluntary context switches, that
2596*4882a593Smuzhiyunis, calls to ``schedule()``, ``cond_resched()``, and
2597*4882a593Smuzhiyun``synchronize_rcu_tasks()``. In addition, transitions to and from
2598*4882a593Smuzhiyunuserspace execution also delimit tasks-RCU read-side critical sections.
2599*4882a593Smuzhiyun
2600*4882a593SmuzhiyunThe tasks-RCU API is quite compact, consisting only of
2601*4882a593Smuzhiyun``call_rcu_tasks()``, ``synchronize_rcu_tasks()``, and
2602*4882a593Smuzhiyun``rcu_barrier_tasks()``. In ``CONFIG_PREEMPT=n`` kernels, trampolines
2603*4882a593Smuzhiyuncannot be preempted, so these APIs map to ``call_rcu()``,
2604*4882a593Smuzhiyun``synchronize_rcu()``, and ``rcu_barrier()``, respectively. In
2605*4882a593Smuzhiyun``CONFIG_PREEMPT=y`` kernels, trampolines can be preempted, and these
2606*4882a593Smuzhiyunthree APIs are therefore implemented by separate functions that check
2607*4882a593Smuzhiyunfor voluntary context switches.
2608*4882a593Smuzhiyun
2609*4882a593SmuzhiyunPossible Future Changes
2610*4882a593Smuzhiyun-----------------------
2611*4882a593Smuzhiyun
2612*4882a593SmuzhiyunOne of the tricks that RCU uses to attain update-side scalability is to
2613*4882a593Smuzhiyunincrease grace-period latency with increasing numbers of CPUs. If this
2614*4882a593Smuzhiyunbecomes a serious problem, it will be necessary to rework the
2615*4882a593Smuzhiyungrace-period state machine so as to avoid the need for the additional
2616*4882a593Smuzhiyunlatency.
2617*4882a593Smuzhiyun
2618*4882a593SmuzhiyunRCU disables CPU hotplug in a few places, perhaps most notably in the
2619*4882a593Smuzhiyun``rcu_barrier()`` operations. If there is a strong reason to use
2620*4882a593Smuzhiyun``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to
2621*4882a593Smuzhiyunavoid disabling CPU hotplug. This would introduce some complexity, so
2622*4882a593Smuzhiyunthere had better be a *very* good reason.
2623*4882a593Smuzhiyun
2624*4882a593SmuzhiyunThe tradeoff between grace-period latency on the one hand and
2625*4882a593Smuzhiyuninterruptions of other CPUs on the other hand may need to be
2626*4882a593Smuzhiyunre-examined. The desire is of course for zero grace-period latency as
2627*4882a593Smuzhiyunwell as zero interprocessor interrupts undertaken during an expedited
2628*4882a593Smuzhiyungrace period operation. While this ideal is unlikely to be achievable,
2629*4882a593Smuzhiyunit is quite possible that further improvements can be made.
2630*4882a593Smuzhiyun
2631*4882a593SmuzhiyunThe multiprocessor implementations of RCU use a combining tree that
2632*4882a593Smuzhiyungroups CPUs so as to reduce lock contention and increase cache locality.
2633*4882a593SmuzhiyunHowever, this combining tree does not spread its memory across NUMA
2634*4882a593Smuzhiyunnodes nor does it align the CPU groups with hardware features such as
2635*4882a593Smuzhiyunsockets or cores. Such spreading and alignment is currently believed to
2636*4882a593Smuzhiyunbe unnecessary because the hotpath read-side primitives do not access
2637*4882a593Smuzhiyunthe combining tree, nor does ``call_rcu()`` in the common case. If you
2638*4882a593Smuzhiyunbelieve that your architecture needs such spreading and alignment, then
2639*4882a593Smuzhiyunyour architecture should also benefit from the
2640*4882a593Smuzhiyun``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the
2641*4882a593Smuzhiyunnumber of CPUs in a socket, NUMA node, or whatever. If the number of
2642*4882a593SmuzhiyunCPUs is too large, use a fraction of the number of CPUs. If the number
2643*4882a593Smuzhiyunof CPUs is a large prime number, well, that certainly is an
2644*4882a593Smuzhiyun“interesting” architectural choice! More flexible arrangements might be
2645*4882a593Smuzhiyunconsidered, but only if ``rcutree.rcu_fanout_leaf`` has proven
2646*4882a593Smuzhiyuninadequate, and only if the inadequacy has been demonstrated by a
2647*4882a593Smuzhiyuncarefully run and realistic system-level workload.
2648*4882a593Smuzhiyun
2649*4882a593SmuzhiyunPlease note that arrangements that require RCU to remap CPU numbers will
2650*4882a593Smuzhiyunrequire extremely good demonstration of need and full exploration of
2651*4882a593Smuzhiyunalternatives.
2652*4882a593Smuzhiyun
2653*4882a593SmuzhiyunRCU's various kthreads are reasonably recent additions. It is quite
2654*4882a593Smuzhiyunlikely that adjustments will be required to more gracefully handle
2655*4882a593Smuzhiyunextreme loads. It might also be necessary to be able to relate CPU
2656*4882a593Smuzhiyunutilization by RCU's kthreads and softirq handlers to the code that
2657*4882a593Smuzhiyuninstigated this CPU utilization. For example, RCU callback overhead
2658*4882a593Smuzhiyunmight be charged back to the originating ``call_rcu()`` instance, though
2659*4882a593Smuzhiyunprobably not in production kernels.
2660*4882a593Smuzhiyun
2661*4882a593SmuzhiyunAdditional work may be required to provide reasonable forward-progress
2662*4882a593Smuzhiyunguarantees under heavy load for grace periods and for callback
2663*4882a593Smuzhiyuninvocation.
2664*4882a593Smuzhiyun
2665*4882a593SmuzhiyunSummary
2666*4882a593Smuzhiyun-------
2667*4882a593Smuzhiyun
2668*4882a593SmuzhiyunThis document has presented more than two decade's worth of RCU
2669*4882a593Smuzhiyunrequirements. Given that the requirements keep changing, this will not
2670*4882a593Smuzhiyunbe the last word on this subject, but at least it serves to get an
2671*4882a593Smuzhiyunimportant subset of the requirements set forth.
2672*4882a593Smuzhiyun
2673*4882a593SmuzhiyunAcknowledgments
2674*4882a593Smuzhiyun---------------
2675*4882a593Smuzhiyun
2676*4882a593SmuzhiyunI am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg
2677*4882a593SmuzhiyunNesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy
2678*4882a593SmuzhiyunLutomirski for their help in rendering this article human readable, and
2679*4882a593Smuzhiyunto Michelle Rankin for her support of this effort. Other contributions
2680*4882a593Smuzhiyunare acknowledged in the Linux kernel's git archive.
2681