xref: /OK3568_Linux_fs/kernel/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun======================================================
2*4882a593SmuzhiyunA Tour Through TREE_RCU's Grace-Period Memory Ordering
3*4882a593Smuzhiyun======================================================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunAugust 8, 2017
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunThis article was contributed by Paul E. McKenney
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunIntroduction
10*4882a593Smuzhiyun============
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunThis document gives a rough visual overview of how Tree RCU's
13*4882a593Smuzhiyungrace-period memory ordering guarantee is provided.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunWhat Is Tree RCU's Grace Period Memory Ordering Guarantee?
16*4882a593Smuzhiyun==========================================================
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunRCU grace periods provide extremely strong memory-ordering guarantees
19*4882a593Smuzhiyunfor non-idle non-offline code.
20*4882a593SmuzhiyunAny code that happens after the end of a given RCU grace period is guaranteed
21*4882a593Smuzhiyunto see the effects of all accesses prior to the beginning of that grace
22*4882a593Smuzhiyunperiod that are within RCU read-side critical sections.
23*4882a593SmuzhiyunSimilarly, any code that happens before the beginning of a given RCU grace
24*4882a593Smuzhiyunperiod is guaranteed to see the effects of all accesses following the end
25*4882a593Smuzhiyunof that grace period that are within RCU read-side critical sections.
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunNote well that RCU-sched read-side critical sections include any region
28*4882a593Smuzhiyunof code for which preemption is disabled.
29*4882a593SmuzhiyunGiven that each individual machine instruction can be thought of as
30*4882a593Smuzhiyunan extremely small region of preemption-disabled code, one can think of
31*4882a593Smuzhiyun``synchronize_rcu()`` as ``smp_mb()`` on steroids.
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunRCU updaters use this guarantee by splitting their updates into
34*4882a593Smuzhiyuntwo phases, one of which is executed before the grace period and
35*4882a593Smuzhiyunthe other of which is executed after the grace period.
36*4882a593SmuzhiyunIn the most common use case, phase one removes an element from
37*4882a593Smuzhiyuna linked RCU-protected data structure, and phase two frees that element.
38*4882a593SmuzhiyunFor this to work, any readers that have witnessed state prior to the
39*4882a593Smuzhiyunphase-one update (in the common case, removal) must not witness state
40*4882a593Smuzhiyunfollowing the phase-two update (in the common case, freeing).
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunThe RCU implementation provides this guarantee using a network
43*4882a593Smuzhiyunof lock-based critical sections, memory barriers, and per-CPU
44*4882a593Smuzhiyunprocessing, as is described in the following sections.
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunTree RCU Grace Period Memory Ordering Building Blocks
47*4882a593Smuzhiyun=====================================================
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunThe workhorse for RCU's grace-period memory ordering is the
50*4882a593Smuzhiyuncritical section for the ``rcu_node`` structure's
51*4882a593Smuzhiyun``->lock``. These critical sections use helper functions for lock
52*4882a593Smuzhiyunacquisition, including ``raw_spin_lock_rcu_node()``,
53*4882a593Smuzhiyun``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``.
54*4882a593SmuzhiyunTheir lock-release counterparts are ``raw_spin_unlock_rcu_node()``,
55*4882a593Smuzhiyun``raw_spin_unlock_irq_rcu_node()``, and
56*4882a593Smuzhiyun``raw_spin_unlock_irqrestore_rcu_node()``, respectively.
57*4882a593SmuzhiyunFor completeness, a ``raw_spin_trylock_rcu_node()`` is also provided.
58*4882a593SmuzhiyunThe key point is that the lock-acquisition functions, including
59*4882a593Smuzhiyun``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()``
60*4882a593Smuzhiyunimmediately after successful acquisition of the lock.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunTherefore, for any given ``rcu_node`` structure, any access
63*4882a593Smuzhiyunhappening before one of the above lock-release functions will be seen
64*4882a593Smuzhiyunby all CPUs as happening before any access happening after a later
65*4882a593Smuzhiyunone of the above lock-acquisition functions.
66*4882a593SmuzhiyunFurthermore, any access happening before one of the
67*4882a593Smuzhiyunabove lock-release function on any given CPU will be seen by all
68*4882a593SmuzhiyunCPUs as happening before any access happening after a later one
69*4882a593Smuzhiyunof the above lock-acquisition functions executing on that same CPU,
70*4882a593Smuzhiyuneven if the lock-release and lock-acquisition functions are operating
71*4882a593Smuzhiyunon different ``rcu_node`` structures.
72*4882a593SmuzhiyunTree RCU uses these two ordering guarantees to form an ordering
73*4882a593Smuzhiyunnetwork among all CPUs that were in any way involved in the grace
74*4882a593Smuzhiyunperiod, including any CPUs that came online or went offline during
75*4882a593Smuzhiyunthe grace period in question.
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunThe following litmus test exhibits the ordering effects of these
78*4882a593Smuzhiyunlock-acquisition and lock-release functions::
79*4882a593Smuzhiyun
80*4882a593Smuzhiyun    1 int x, y, z;
81*4882a593Smuzhiyun    2
82*4882a593Smuzhiyun    3 void task0(void)
83*4882a593Smuzhiyun    4 {
84*4882a593Smuzhiyun    5   raw_spin_lock_rcu_node(rnp);
85*4882a593Smuzhiyun    6   WRITE_ONCE(x, 1);
86*4882a593Smuzhiyun    7   r1 = READ_ONCE(y);
87*4882a593Smuzhiyun    8   raw_spin_unlock_rcu_node(rnp);
88*4882a593Smuzhiyun    9 }
89*4882a593Smuzhiyun   10
90*4882a593Smuzhiyun   11 void task1(void)
91*4882a593Smuzhiyun   12 {
92*4882a593Smuzhiyun   13   raw_spin_lock_rcu_node(rnp);
93*4882a593Smuzhiyun   14   WRITE_ONCE(y, 1);
94*4882a593Smuzhiyun   15   r2 = READ_ONCE(z);
95*4882a593Smuzhiyun   16   raw_spin_unlock_rcu_node(rnp);
96*4882a593Smuzhiyun   17 }
97*4882a593Smuzhiyun   18
98*4882a593Smuzhiyun   19 void task2(void)
99*4882a593Smuzhiyun   20 {
100*4882a593Smuzhiyun   21   WRITE_ONCE(z, 1);
101*4882a593Smuzhiyun   22   smp_mb();
102*4882a593Smuzhiyun   23   r3 = READ_ONCE(x);
103*4882a593Smuzhiyun   24 }
104*4882a593Smuzhiyun   25
105*4882a593Smuzhiyun   26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0);
106*4882a593Smuzhiyun
107*4882a593SmuzhiyunThe ``WARN_ON()`` is evaluated at "the end of time",
108*4882a593Smuzhiyunafter all changes have propagated throughout the system.
109*4882a593SmuzhiyunWithout the ``smp_mb__after_unlock_lock()`` provided by the
110*4882a593Smuzhiyunacquisition functions, this ``WARN_ON()`` could trigger, for example
111*4882a593Smuzhiyunon PowerPC.
112*4882a593SmuzhiyunThe ``smp_mb__after_unlock_lock()`` invocations prevent this
113*4882a593Smuzhiyun``WARN_ON()`` from triggering.
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunThis approach must be extended to include idle CPUs, which need
116*4882a593SmuzhiyunRCU's grace-period memory ordering guarantee to extend to any
117*4882a593SmuzhiyunRCU read-side critical sections preceding and following the current
118*4882a593Smuzhiyunidle sojourn.
119*4882a593SmuzhiyunThis case is handled by calls to the strongly ordered
120*4882a593Smuzhiyun``atomic_add_return()`` read-modify-write atomic operation that
121*4882a593Smuzhiyunis invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry
122*4882a593Smuzhiyuntime and within ``rcu_dynticks_eqs_exit()`` at idle-exit time.
123*4882a593SmuzhiyunThe grace-period kthread invokes ``rcu_dynticks_snap()`` and
124*4882a593Smuzhiyun``rcu_dynticks_in_eqs_since()`` (both of which invoke
125*4882a593Smuzhiyunan ``atomic_add_return()`` of zero) to detect idle CPUs.
126*4882a593Smuzhiyun
127*4882a593Smuzhiyun+-----------------------------------------------------------------------+
128*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
129*4882a593Smuzhiyun+-----------------------------------------------------------------------+
130*4882a593Smuzhiyun| But what about CPUs that remain offline for the entire grace period?  |
131*4882a593Smuzhiyun+-----------------------------------------------------------------------+
132*4882a593Smuzhiyun| **Answer**:                                                           |
133*4882a593Smuzhiyun+-----------------------------------------------------------------------+
134*4882a593Smuzhiyun| Such CPUs will be offline at the beginning of the grace period, so    |
135*4882a593Smuzhiyun| the grace period won't expect quiescent states from them. Races       |
136*4882a593Smuzhiyun| between grace-period start and CPU-hotplug operations are mediated    |
137*4882a593Smuzhiyun| by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described    |
138*4882a593Smuzhiyun| above.                                                                |
139*4882a593Smuzhiyun+-----------------------------------------------------------------------+
140*4882a593Smuzhiyun
141*4882a593SmuzhiyunThe approach must be extended to handle one final case, that of waking a
142*4882a593Smuzhiyuntask blocked in ``synchronize_rcu()``. This task might be affinitied to
143*4882a593Smuzhiyuna CPU that is not yet aware that the grace period has ended, and thus
144*4882a593Smuzhiyunmight not yet be subject to the grace period's memory ordering.
145*4882a593SmuzhiyunTherefore, there is an ``smp_mb()`` after the return from
146*4882a593Smuzhiyun``wait_for_completion()`` in the ``synchronize_rcu()`` code path.
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun+-----------------------------------------------------------------------+
149*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
150*4882a593Smuzhiyun+-----------------------------------------------------------------------+
151*4882a593Smuzhiyun| What? Where??? I don't see any ``smp_mb()`` after the return from     |
152*4882a593Smuzhiyun| ``wait_for_completion()``!!!                                          |
153*4882a593Smuzhiyun+-----------------------------------------------------------------------+
154*4882a593Smuzhiyun| **Answer**:                                                           |
155*4882a593Smuzhiyun+-----------------------------------------------------------------------+
156*4882a593Smuzhiyun| That would be because I spotted the need for that ``smp_mb()`` during |
157*4882a593Smuzhiyun| the creation of this documentation, and it is therefore unlikely to   |
158*4882a593Smuzhiyun| hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter     |
159*4882a593Smuzhiyun| Zijlstra, and Jonathan Cameron for asking questions that sensitized   |
160*4882a593Smuzhiyun| me to the rather elaborate sequence of events that demonstrate the    |
161*4882a593Smuzhiyun| need for this memory barrier.                                         |
162*4882a593Smuzhiyun+-----------------------------------------------------------------------+
163*4882a593Smuzhiyun
164*4882a593SmuzhiyunTree RCU's grace--period memory-ordering guarantees rely most heavily on
165*4882a593Smuzhiyunthe ``rcu_node`` structure's ``->lock`` field, so much so that it is
166*4882a593Smuzhiyunnecessary to abbreviate this pattern in the diagrams in the next
167*4882a593Smuzhiyunsection. For example, consider the ``rcu_prepare_for_idle()`` function
168*4882a593Smuzhiyunshown below, which is one of several functions that enforce ordering of
169*4882a593Smuzhiyunnewly arrived RCU callbacks against future grace periods:
170*4882a593Smuzhiyun
171*4882a593Smuzhiyun::
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun    1 static void rcu_prepare_for_idle(void)
174*4882a593Smuzhiyun    2 {
175*4882a593Smuzhiyun    3   bool needwake;
176*4882a593Smuzhiyun    4   struct rcu_data *rdp;
177*4882a593Smuzhiyun    5   struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
178*4882a593Smuzhiyun    6   struct rcu_node *rnp;
179*4882a593Smuzhiyun    7   struct rcu_state *rsp;
180*4882a593Smuzhiyun    8   int tne;
181*4882a593Smuzhiyun    9
182*4882a593Smuzhiyun   10   if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
183*4882a593Smuzhiyun   11       rcu_is_nocb_cpu(smp_processor_id()))
184*4882a593Smuzhiyun   12     return;
185*4882a593Smuzhiyun   13   tne = READ_ONCE(tick_nohz_active);
186*4882a593Smuzhiyun   14   if (tne != rdtp->tick_nohz_enabled_snap) {
187*4882a593Smuzhiyun   15     if (rcu_cpu_has_callbacks(NULL))
188*4882a593Smuzhiyun   16       invoke_rcu_core();
189*4882a593Smuzhiyun   17     rdtp->tick_nohz_enabled_snap = tne;
190*4882a593Smuzhiyun   18     return;
191*4882a593Smuzhiyun   19   }
192*4882a593Smuzhiyun   20   if (!tne)
193*4882a593Smuzhiyun   21     return;
194*4882a593Smuzhiyun   22   if (rdtp->all_lazy &&
195*4882a593Smuzhiyun   23       rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) {
196*4882a593Smuzhiyun   24     rdtp->all_lazy = false;
197*4882a593Smuzhiyun   25     rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted;
198*4882a593Smuzhiyun   26     invoke_rcu_core();
199*4882a593Smuzhiyun   27     return;
200*4882a593Smuzhiyun   28   }
201*4882a593Smuzhiyun   29   if (rdtp->last_accelerate == jiffies)
202*4882a593Smuzhiyun   30     return;
203*4882a593Smuzhiyun   31   rdtp->last_accelerate = jiffies;
204*4882a593Smuzhiyun   32   for_each_rcu_flavor(rsp) {
205*4882a593Smuzhiyun   33     rdp = this_cpu_ptr(rsp->rda);
206*4882a593Smuzhiyun   34     if (rcu_segcblist_pend_cbs(&rdp->cblist))
207*4882a593Smuzhiyun   35       continue;
208*4882a593Smuzhiyun   36     rnp = rdp->mynode;
209*4882a593Smuzhiyun   37     raw_spin_lock_rcu_node(rnp);
210*4882a593Smuzhiyun   38     needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
211*4882a593Smuzhiyun   39     raw_spin_unlock_rcu_node(rnp);
212*4882a593Smuzhiyun   40     if (needwake)
213*4882a593Smuzhiyun   41       rcu_gp_kthread_wake(rsp);
214*4882a593Smuzhiyun   42   }
215*4882a593Smuzhiyun   43 }
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunBut the only part of ``rcu_prepare_for_idle()`` that really matters for
218*4882a593Smuzhiyunthis discussion are lines 37–39. We will therefore abbreviate this
219*4882a593Smuzhiyunfunction as follows:
220*4882a593Smuzhiyun
221*4882a593Smuzhiyun.. kernel-figure:: rcu_node-lock.svg
222*4882a593Smuzhiyun
223*4882a593SmuzhiyunThe box represents the ``rcu_node`` structure's ``->lock`` critical
224*4882a593Smuzhiyunsection, with the double line on top representing the additional
225*4882a593Smuzhiyun``smp_mb__after_unlock_lock()``.
226*4882a593Smuzhiyun
227*4882a593SmuzhiyunTree RCU Grace Period Memory Ordering Components
228*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunTree RCU's grace-period memory-ordering guarantee is provided by a
231*4882a593Smuzhiyunnumber of RCU components:
232*4882a593Smuzhiyun
233*4882a593Smuzhiyun#. `Callback Registry`_
234*4882a593Smuzhiyun#. `Grace-Period Initialization`_
235*4882a593Smuzhiyun#. `Self-Reported Quiescent States`_
236*4882a593Smuzhiyun#. `Dynamic Tick Interface`_
237*4882a593Smuzhiyun#. `CPU-Hotplug Interface`_
238*4882a593Smuzhiyun#. `Forcing Quiescent States`_
239*4882a593Smuzhiyun#. `Grace-Period Cleanup`_
240*4882a593Smuzhiyun#. `Callback Invocation`_
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunEach of the following section looks at the corresponding component in
243*4882a593Smuzhiyundetail.
244*4882a593Smuzhiyun
245*4882a593SmuzhiyunCallback Registry
246*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^
247*4882a593Smuzhiyun
248*4882a593SmuzhiyunIf RCU's grace-period guarantee is to mean anything at all, any access
249*4882a593Smuzhiyunthat happens before a given invocation of ``call_rcu()`` must also
250*4882a593Smuzhiyunhappen before the corresponding grace period. The implementation of this
251*4882a593Smuzhiyunportion of RCU's grace period guarantee is shown in the following
252*4882a593Smuzhiyunfigure:
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun.. kernel-figure:: TreeRCU-callback-registry.svg
255*4882a593Smuzhiyun
256*4882a593SmuzhiyunBecause ``call_rcu()`` normally acts only on CPU-local state, it
257*4882a593Smuzhiyunprovides no ordering guarantees, either for itself or for phase one of
258*4882a593Smuzhiyunthe update (which again will usually be removal of an element from an
259*4882a593SmuzhiyunRCU-protected data structure). It simply enqueues the ``rcu_head``
260*4882a593Smuzhiyunstructure on a per-CPU list, which cannot become associated with a grace
261*4882a593Smuzhiyunperiod until a later call to ``rcu_accelerate_cbs()``, as shown in the
262*4882a593Smuzhiyundiagram above.
263*4882a593Smuzhiyun
264*4882a593SmuzhiyunOne set of code paths shown on the left invokes ``rcu_accelerate_cbs()``
265*4882a593Smuzhiyunvia ``note_gp_changes()``, either directly from ``call_rcu()`` (if the
266*4882a593Smuzhiyuncurrent CPU is inundated with queued ``rcu_head`` structures) or more
267*4882a593Smuzhiyunlikely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle
268*4882a593Smuzhiyunis taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which
269*4882a593Smuzhiyuninvokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The
270*4882a593Smuzhiyunfinal code path on the right is taken only in kernels built with
271*4882a593Smuzhiyun``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via
272*4882a593Smuzhiyun``rcu_advance_cbs()``, ``rcu_migrate_callbacks``,
273*4882a593Smuzhiyun``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn
274*4882a593Smuzhiyunis invoked on a surviving CPU after the outgoing CPU has been completely
275*4882a593Smuzhiyunofflined.
276*4882a593Smuzhiyun
277*4882a593SmuzhiyunThere are a few other code paths within grace-period processing that
278*4882a593Smuzhiyunopportunistically invoke ``rcu_accelerate_cbs()``. However, either way,
279*4882a593Smuzhiyunall of the CPU's recently queued ``rcu_head`` structures are associated
280*4882a593Smuzhiyunwith a future grace-period number under the protection of the CPU's lead
281*4882a593Smuzhiyun``rcu_node`` structure's ``->lock``. In all cases, there is full
282*4882a593Smuzhiyunordering against any prior critical section for that same ``rcu_node``
283*4882a593Smuzhiyunstructure's ``->lock``, and also full ordering against any of the
284*4882a593Smuzhiyuncurrent task's or CPU's prior critical sections for any ``rcu_node``
285*4882a593Smuzhiyunstructure's ``->lock``.
286*4882a593Smuzhiyun
287*4882a593SmuzhiyunThe next section will show how this ordering ensures that any accesses
288*4882a593Smuzhiyunprior to the ``call_rcu()`` (particularly including phase one of the
289*4882a593Smuzhiyunupdate) happen before the start of the corresponding grace period.
290*4882a593Smuzhiyun
291*4882a593Smuzhiyun+-----------------------------------------------------------------------+
292*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
293*4882a593Smuzhiyun+-----------------------------------------------------------------------+
294*4882a593Smuzhiyun| But what about ``synchronize_rcu()``?                                 |
295*4882a593Smuzhiyun+-----------------------------------------------------------------------+
296*4882a593Smuzhiyun| **Answer**:                                                           |
297*4882a593Smuzhiyun+-----------------------------------------------------------------------+
298*4882a593Smuzhiyun| The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, |
299*4882a593Smuzhiyun| which invokes it. So either way, it eventually comes down to          |
300*4882a593Smuzhiyun| ``call_rcu()``.                                                       |
301*4882a593Smuzhiyun+-----------------------------------------------------------------------+
302*4882a593Smuzhiyun
303*4882a593SmuzhiyunGrace-Period Initialization
304*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^
305*4882a593Smuzhiyun
306*4882a593SmuzhiyunGrace-period initialization is carried out by the grace-period kernel
307*4882a593Smuzhiyunthread, which makes several passes over the ``rcu_node`` tree within the
308*4882a593Smuzhiyun``rcu_gp_init()`` function. This means that showing the full flow of
309*4882a593Smuzhiyunordering through the grace-period computation will require duplicating
310*4882a593Smuzhiyunthis tree. If you find this confusing, please note that the state of the
311*4882a593Smuzhiyun``rcu_node`` changes over time, just like Heraclitus's river. However,
312*4882a593Smuzhiyunto keep the ``rcu_node`` river tractable, the grace-period kernel
313*4882a593Smuzhiyunthread's traversals are presented in multiple parts, starting in this
314*4882a593Smuzhiyunsection with the various phases of grace-period initialization.
315*4882a593Smuzhiyun
316*4882a593SmuzhiyunThe first ordering-related grace-period initialization action is to
317*4882a593Smuzhiyunadvance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number
318*4882a593Smuzhiyuncounter, as shown below:
319*4882a593Smuzhiyun
320*4882a593Smuzhiyun.. kernel-figure:: TreeRCU-gp-init-1.svg
321*4882a593Smuzhiyun
322*4882a593SmuzhiyunThe actual increment is carried out using ``smp_store_release()``, which
323*4882a593Smuzhiyunhelps reject false-positive RCU CPU stall detection. Note that only the
324*4882a593Smuzhiyunroot ``rcu_node`` structure is touched.
325*4882a593Smuzhiyun
326*4882a593SmuzhiyunThe first pass through the ``rcu_node`` tree updates bitmasks based on
327*4882a593SmuzhiyunCPUs having come online or gone offline since the start of the previous
328*4882a593Smuzhiyungrace period. In the common case where the number of online CPUs for
329*4882a593Smuzhiyunthis ``rcu_node`` structure has not transitioned to or from zero, this
330*4882a593Smuzhiyunpass will scan only the leaf ``rcu_node`` structures. However, if the
331*4882a593Smuzhiyunnumber of online CPUs for a given leaf ``rcu_node`` structure has
332*4882a593Smuzhiyuntransitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the
333*4882a593Smuzhiyunfirst incoming CPU. Similarly, if the number of online CPUs for a given
334*4882a593Smuzhiyunleaf ``rcu_node`` structure has transitioned to zero,
335*4882a593Smuzhiyun``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU.
336*4882a593SmuzhiyunThe diagram below shows the path of ordering if the leftmost
337*4882a593Smuzhiyun``rcu_node`` structure onlines its first CPU and if the next
338*4882a593Smuzhiyun``rcu_node`` structure has no online CPUs (or, alternatively if the
339*4882a593Smuzhiyunleftmost ``rcu_node`` structure offlines its last CPU and if the next
340*4882a593Smuzhiyun``rcu_node`` structure has no online CPUs).
341*4882a593Smuzhiyun
342*4882a593Smuzhiyun.. kernel-figure:: TreeRCU-gp-init-1.svg
343*4882a593Smuzhiyun
344*4882a593SmuzhiyunThe final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses
345*4882a593Smuzhiyunbreadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field
346*4882a593Smuzhiyunto the newly advanced value from the ``rcu_state`` structure, as shown
347*4882a593Smuzhiyunin the following diagram.
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun.. kernel-figure:: TreeRCU-gp-init-1.svg
350*4882a593Smuzhiyun
351*4882a593SmuzhiyunThis change will also cause each CPU's next call to
352*4882a593Smuzhiyun``__note_gp_changes()`` to notice that a new grace period has started,
353*4882a593Smuzhiyunas described in the next section. But because the grace-period kthread
354*4882a593Smuzhiyunstarted the grace period at the root (with the advancing of the
355*4882a593Smuzhiyun``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf
356*4882a593Smuzhiyun``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of
357*4882a593Smuzhiyunthe start of the grace period will happen after the actual start of the
358*4882a593Smuzhiyungrace period.
359*4882a593Smuzhiyun
360*4882a593Smuzhiyun+-----------------------------------------------------------------------+
361*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
362*4882a593Smuzhiyun+-----------------------------------------------------------------------+
363*4882a593Smuzhiyun| But what about the CPU that started the grace period? Why wouldn't it |
364*4882a593Smuzhiyun| see the start of the grace period right when it started that grace    |
365*4882a593Smuzhiyun| period?                                                               |
366*4882a593Smuzhiyun+-----------------------------------------------------------------------+
367*4882a593Smuzhiyun| **Answer**:                                                           |
368*4882a593Smuzhiyun+-----------------------------------------------------------------------+
369*4882a593Smuzhiyun| In some deep philosophical and overly anthromorphized sense, yes, the |
370*4882a593Smuzhiyun| CPU starting the grace period is immediately aware of having done so. |
371*4882a593Smuzhiyun| However, if we instead assume that RCU is not self-aware, then even   |
372*4882a593Smuzhiyun| the CPU starting the grace period does not really become aware of the |
373*4882a593Smuzhiyun| start of this grace period until its first call to                    |
374*4882a593Smuzhiyun| ``__note_gp_changes()``. On the other hand, this CPU potentially gets |
375*4882a593Smuzhiyun| early notification because it invokes ``__note_gp_changes()`` during  |
376*4882a593Smuzhiyun| its last ``rcu_gp_init()`` pass through its leaf ``rcu_node``         |
377*4882a593Smuzhiyun| structure.                                                            |
378*4882a593Smuzhiyun+-----------------------------------------------------------------------+
379*4882a593Smuzhiyun
380*4882a593SmuzhiyunSelf-Reported Quiescent States
381*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
382*4882a593Smuzhiyun
383*4882a593SmuzhiyunWhen all entities that might block the grace period have reported
384*4882a593Smuzhiyunquiescent states (or as described in a later section, had quiescent
385*4882a593Smuzhiyunstates reported on their behalf), the grace period can end. Online
386*4882a593Smuzhiyunnon-idle CPUs report their own quiescent states, as shown in the
387*4882a593Smuzhiyunfollowing diagram:
388*4882a593Smuzhiyun
389*4882a593Smuzhiyun.. kernel-figure:: TreeRCU-qs.svg
390*4882a593Smuzhiyun
391*4882a593SmuzhiyunThis is for the last CPU to report a quiescent state, which signals the
392*4882a593Smuzhiyunend of the grace period. Earlier quiescent states would push up the
393*4882a593Smuzhiyun``rcu_node`` tree only until they encountered an ``rcu_node`` structure
394*4882a593Smuzhiyunthat is waiting for additional quiescent states. However, ordering is
395*4882a593Smuzhiyunnevertheless preserved because some later quiescent state will acquire
396*4882a593Smuzhiyunthat ``rcu_node`` structure's ``->lock``.
397*4882a593Smuzhiyun
398*4882a593SmuzhiyunAny number of events can lead up to a CPU invoking ``note_gp_changes``
399*4882a593Smuzhiyun(or alternatively, directly invoking ``__note_gp_changes()``), at which
400*4882a593Smuzhiyunpoint that CPU will notice the start of a new grace period while holding
401*4882a593Smuzhiyunits leaf ``rcu_node`` lock. Therefore, all execution shown in this
402*4882a593Smuzhiyundiagram happens after the start of the grace period. In addition, this
403*4882a593SmuzhiyunCPU will consider any RCU read-side critical section that started before
404*4882a593Smuzhiyunthe invocation of ``__note_gp_changes()`` to have started before the
405*4882a593Smuzhiyungrace period, and thus a critical section that the grace period must
406*4882a593Smuzhiyunwait on.
407*4882a593Smuzhiyun
408*4882a593Smuzhiyun+-----------------------------------------------------------------------+
409*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
410*4882a593Smuzhiyun+-----------------------------------------------------------------------+
411*4882a593Smuzhiyun| But a RCU read-side critical section might have started after the     |
412*4882a593Smuzhiyun| beginning of the grace period (the advancing of ``->gp_seq`` from     |
413*4882a593Smuzhiyun| earlier), so why should the grace period wait on such a critical      |
414*4882a593Smuzhiyun| section?                                                              |
415*4882a593Smuzhiyun+-----------------------------------------------------------------------+
416*4882a593Smuzhiyun| **Answer**:                                                           |
417*4882a593Smuzhiyun+-----------------------------------------------------------------------+
418*4882a593Smuzhiyun| It is indeed not necessary for the grace period to wait on such a     |
419*4882a593Smuzhiyun| critical section. However, it is permissible to wait on it. And it is |
420*4882a593Smuzhiyun| furthermore important to wait on it, as this lazy approach is far     |
421*4882a593Smuzhiyun| more scalable than a “big bang” all-at-once grace-period start could  |
422*4882a593Smuzhiyun| possibly be.                                                          |
423*4882a593Smuzhiyun+-----------------------------------------------------------------------+
424*4882a593Smuzhiyun
425*4882a593SmuzhiyunIf the CPU does a context switch, a quiescent state will be noted by
426*4882a593Smuzhiyun``rcu_note_context_switch()`` on the left. On the other hand, if the CPU
427*4882a593Smuzhiyuntakes a scheduler-clock interrupt while executing in usermode, a
428*4882a593Smuzhiyunquiescent state will be noted by ``rcu_sched_clock_irq()`` on the right.
429*4882a593SmuzhiyunEither way, the passage through a quiescent state will be noted in a
430*4882a593Smuzhiyunper-CPU variable.
431*4882a593Smuzhiyun
432*4882a593SmuzhiyunThe next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for
433*4882a593Smuzhiyunexample, after the next scheduler-clock interrupt), ``rcu_core()`` will
434*4882a593Smuzhiyuninvoke ``rcu_check_quiescent_state()``, which will notice the recorded
435*4882a593Smuzhiyunquiescent state, and invoke ``rcu_report_qs_rdp()``. If
436*4882a593Smuzhiyun``rcu_report_qs_rdp()`` verifies that the quiescent state really does
437*4882a593Smuzhiyunapply to the current grace period, it invokes ``rcu_report_rnp()`` which
438*4882a593Smuzhiyuntraverses up the ``rcu_node`` tree as shown at the bottom of the
439*4882a593Smuzhiyundiagram, clearing bits from each ``rcu_node`` structure's ``->qsmask``
440*4882a593Smuzhiyunfield, and propagating up the tree when the result is zero.
441*4882a593Smuzhiyun
442*4882a593SmuzhiyunNote that traversal passes upwards out of a given ``rcu_node`` structure
443*4882a593Smuzhiyunonly if the current CPU is reporting the last quiescent state for the
444*4882a593Smuzhiyunsubtree headed by that ``rcu_node`` structure. A key point is that if a
445*4882a593SmuzhiyunCPU's traversal stops at a given ``rcu_node`` structure, then there will
446*4882a593Smuzhiyunbe a later traversal by another CPU (or perhaps the same one) that
447*4882a593Smuzhiyunproceeds upwards from that point, and the ``rcu_node`` ``->lock``
448*4882a593Smuzhiyunguarantees that the first CPU's quiescent state happens before the
449*4882a593Smuzhiyunremainder of the second CPU's traversal. Applying this line of thought
450*4882a593Smuzhiyunrepeatedly shows that all CPUs' quiescent states happen before the last
451*4882a593SmuzhiyunCPU traverses through the root ``rcu_node`` structure, the “last CPU”
452*4882a593Smuzhiyunbeing the one that clears the last bit in the root ``rcu_node``
453*4882a593Smuzhiyunstructure's ``->qsmask`` field.
454*4882a593Smuzhiyun
455*4882a593SmuzhiyunDynamic Tick Interface
456*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^
457*4882a593Smuzhiyun
458*4882a593SmuzhiyunDue to energy-efficiency considerations, RCU is forbidden from
459*4882a593Smuzhiyundisturbing idle CPUs. CPUs are therefore required to notify RCU when
460*4882a593Smuzhiyunentering or leaving idle state, which they do via fully ordered
461*4882a593Smuzhiyunvalue-returning atomic operations on a per-CPU variable. The ordering
462*4882a593Smuzhiyuneffects are as shown below:
463*4882a593Smuzhiyun
464*4882a593Smuzhiyun.. kernel-figure:: TreeRCU-dyntick.svg
465*4882a593Smuzhiyun
466*4882a593SmuzhiyunThe RCU grace-period kernel thread samples the per-CPU idleness variable
467*4882a593Smuzhiyunwhile holding the corresponding CPU's leaf ``rcu_node`` structure's
468*4882a593Smuzhiyun``->lock``. This means that any RCU read-side critical sections that
469*4882a593Smuzhiyunprecede the idle period (the oval near the top of the diagram above)
470*4882a593Smuzhiyunwill happen before the end of the current grace period. Similarly, the
471*4882a593Smuzhiyunbeginning of the current grace period will happen before any RCU
472*4882a593Smuzhiyunread-side critical sections that follow the idle period (the oval near
473*4882a593Smuzhiyunthe bottom of the diagram above).
474*4882a593Smuzhiyun
475*4882a593SmuzhiyunPlumbing this into the full grace-period execution is described
476*4882a593Smuzhiyun`below <#Forcing%20Quiescent%20States>`__.
477*4882a593Smuzhiyun
478*4882a593SmuzhiyunCPU-Hotplug Interface
479*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^
480*4882a593Smuzhiyun
481*4882a593SmuzhiyunRCU is also forbidden from disturbing offline CPUs, which might well be
482*4882a593Smuzhiyunpowered off and removed from the system completely. CPUs are therefore
483*4882a593Smuzhiyunrequired to notify RCU of their comings and goings as part of the
484*4882a593Smuzhiyuncorresponding CPU hotplug operations. The ordering effects are shown
485*4882a593Smuzhiyunbelow:
486*4882a593Smuzhiyun
487*4882a593Smuzhiyun.. kernel-figure:: TreeRCU-hotplug.svg
488*4882a593Smuzhiyun
489*4882a593SmuzhiyunBecause CPU hotplug operations are much less frequent than idle
490*4882a593Smuzhiyuntransitions, they are heavier weight, and thus acquire the CPU's leaf
491*4882a593Smuzhiyun``rcu_node`` structure's ``->lock`` and update this structure's
492*4882a593Smuzhiyun``->qsmaskinitnext``. The RCU grace-period kernel thread samples this
493*4882a593Smuzhiyunmask to detect CPUs having gone offline since the beginning of this
494*4882a593Smuzhiyungrace period.
495*4882a593Smuzhiyun
496*4882a593SmuzhiyunPlumbing this into the full grace-period execution is described
497*4882a593Smuzhiyun`below <#Forcing%20Quiescent%20States>`__.
498*4882a593Smuzhiyun
499*4882a593SmuzhiyunForcing Quiescent States
500*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^
501*4882a593Smuzhiyun
502*4882a593SmuzhiyunAs noted above, idle and offline CPUs cannot report their own quiescent
503*4882a593Smuzhiyunstates, and therefore the grace-period kernel thread must do the
504*4882a593Smuzhiyunreporting on their behalf. This process is called “forcing quiescent
505*4882a593Smuzhiyunstates”, it is repeated every few jiffies, and its ordering effects are
506*4882a593Smuzhiyunshown below:
507*4882a593Smuzhiyun
508*4882a593Smuzhiyun.. kernel-figure:: TreeRCU-gp-fqs.svg
509*4882a593Smuzhiyun
510*4882a593SmuzhiyunEach pass of quiescent state forcing is guaranteed to traverse the leaf
511*4882a593Smuzhiyun``rcu_node`` structures, and if there are no new quiescent states due to
512*4882a593Smuzhiyunrecently idled and/or offlined CPUs, then only the leaves are traversed.
513*4882a593SmuzhiyunHowever, if there is a newly offlined CPU as illustrated on the left or
514*4882a593Smuzhiyuna newly idled CPU as illustrated on the right, the corresponding
515*4882a593Smuzhiyunquiescent state will be driven up towards the root. As with
516*4882a593Smuzhiyunself-reported quiescent states, the upwards driving stops once it
517*4882a593Smuzhiyunreaches an ``rcu_node`` structure that has quiescent states outstanding
518*4882a593Smuzhiyunfrom other CPUs.
519*4882a593Smuzhiyun
520*4882a593Smuzhiyun+-----------------------------------------------------------------------+
521*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
522*4882a593Smuzhiyun+-----------------------------------------------------------------------+
523*4882a593Smuzhiyun| The leftmost drive to root stopped before it reached the root         |
524*4882a593Smuzhiyun| ``rcu_node`` structure, which means that there are still CPUs         |
525*4882a593Smuzhiyun| subordinate to that structure on which the current grace period is    |
526*4882a593Smuzhiyun| waiting. Given that, how is it possible that the rightmost drive to   |
527*4882a593Smuzhiyun| root ended the grace period?                                          |
528*4882a593Smuzhiyun+-----------------------------------------------------------------------+
529*4882a593Smuzhiyun| **Answer**:                                                           |
530*4882a593Smuzhiyun+-----------------------------------------------------------------------+
531*4882a593Smuzhiyun| Good analysis! It is in fact impossible in the absence of bugs in     |
532*4882a593Smuzhiyun| RCU. But this diagram is complex enough as it is, so simplicity       |
533*4882a593Smuzhiyun| overrode accuracy. You can think of it as poetic license, or you can  |
534*4882a593Smuzhiyun| think of it as misdirection that is resolved in the                   |
535*4882a593Smuzhiyun| `stitched-together diagram <#Putting%20It%20All%20Together>`__.       |
536*4882a593Smuzhiyun+-----------------------------------------------------------------------+
537*4882a593Smuzhiyun
538*4882a593SmuzhiyunGrace-Period Cleanup
539*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^
540*4882a593Smuzhiyun
541*4882a593SmuzhiyunGrace-period cleanup first scans the ``rcu_node`` tree breadth-first
542*4882a593Smuzhiyunadvancing all the ``->gp_seq`` fields, then it advances the
543*4882a593Smuzhiyun``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are
544*4882a593Smuzhiyunshown below:
545*4882a593Smuzhiyun
546*4882a593Smuzhiyun.. kernel-figure:: TreeRCU-gp-cleanup.svg
547*4882a593Smuzhiyun
548*4882a593SmuzhiyunAs indicated by the oval at the bottom of the diagram, once grace-period
549*4882a593Smuzhiyuncleanup is complete, the next grace period can begin.
550*4882a593Smuzhiyun
551*4882a593Smuzhiyun+-----------------------------------------------------------------------+
552*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
553*4882a593Smuzhiyun+-----------------------------------------------------------------------+
554*4882a593Smuzhiyun| But when precisely does the grace period end?                         |
555*4882a593Smuzhiyun+-----------------------------------------------------------------------+
556*4882a593Smuzhiyun| **Answer**:                                                           |
557*4882a593Smuzhiyun+-----------------------------------------------------------------------+
558*4882a593Smuzhiyun| There is no useful single point at which the grace period can be said |
559*4882a593Smuzhiyun| to end. The earliest reasonable candidate is as soon as the last CPU  |
560*4882a593Smuzhiyun| has reported its quiescent state, but it may be some milliseconds     |
561*4882a593Smuzhiyun| before RCU becomes aware of this. The latest reasonable candidate is  |
562*4882a593Smuzhiyun| once the ``rcu_state`` structure's ``->gp_seq`` field has been        |
563*4882a593Smuzhiyun| updated, but it is quite possible that some CPUs have already         |
564*4882a593Smuzhiyun| completed phase two of their updates by that time. In short, if you   |
565*4882a593Smuzhiyun| are going to work with RCU, you need to learn to embrace uncertainty. |
566*4882a593Smuzhiyun+-----------------------------------------------------------------------+
567*4882a593Smuzhiyun
568*4882a593SmuzhiyunCallback Invocation
569*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^
570*4882a593Smuzhiyun
571*4882a593SmuzhiyunOnce a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has
572*4882a593Smuzhiyunbeen updated, that CPU can begin invoking its RCU callbacks that were
573*4882a593Smuzhiyunwaiting for this grace period to end. These callbacks are identified by
574*4882a593Smuzhiyun``rcu_advance_cbs()``, which is usually invoked by
575*4882a593Smuzhiyun``__note_gp_changes()``. As shown in the diagram below, this invocation
576*4882a593Smuzhiyuncan be triggered by the scheduling-clock interrupt
577*4882a593Smuzhiyun(``rcu_sched_clock_irq()`` on the left) or by idle entry
578*4882a593Smuzhiyun(``rcu_cleanup_after_idle()`` on the right, but only for kernels build
579*4882a593Smuzhiyunwith ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is
580*4882a593Smuzhiyunraised, which results in ``rcu_do_batch()`` invoking the callbacks,
581*4882a593Smuzhiyunwhich in turn allows those callbacks to carry out (either directly or
582*4882a593Smuzhiyunindirectly via wakeup) the needed phase-two processing for each update.
583*4882a593Smuzhiyun
584*4882a593Smuzhiyun.. kernel-figure:: TreeRCU-callback-invocation.svg
585*4882a593Smuzhiyun
586*4882a593SmuzhiyunPlease note that callback invocation can also be prompted by any number
587*4882a593Smuzhiyunof corner-case code paths, for example, when a CPU notes that it has
588*4882a593Smuzhiyunexcessive numbers of callbacks queued. In all cases, the CPU acquires
589*4882a593Smuzhiyunits leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks,
590*4882a593Smuzhiyunwhich preserves the required ordering against the newly completed grace
591*4882a593Smuzhiyunperiod.
592*4882a593Smuzhiyun
593*4882a593SmuzhiyunHowever, if the callback function communicates to other CPUs, for
594*4882a593Smuzhiyunexample, doing a wakeup, then it is that function's responsibility to
595*4882a593Smuzhiyunmaintain ordering. For example, if the callback function wakes up a task
596*4882a593Smuzhiyunthat runs on some other CPU, proper ordering must in place in both the
597*4882a593Smuzhiyuncallback function and the task being awakened. To see why this is
598*4882a593Smuzhiyunimportant, consider the top half of the `grace-period
599*4882a593Smuzhiyuncleanup <#Grace-Period%20Cleanup>`__ diagram. The callback might be
600*4882a593Smuzhiyunrunning on a CPU corresponding to the leftmost leaf ``rcu_node``
601*4882a593Smuzhiyunstructure, and awaken a task that is to run on a CPU corresponding to
602*4882a593Smuzhiyunthe rightmost leaf ``rcu_node`` structure, and the grace-period kernel
603*4882a593Smuzhiyunthread might not yet have reached the rightmost leaf. In this case, the
604*4882a593Smuzhiyungrace period's memory ordering might not yet have reached that CPU, so
605*4882a593Smuzhiyunagain the callback function and the awakened task must supply proper
606*4882a593Smuzhiyunordering.
607*4882a593Smuzhiyun
608*4882a593SmuzhiyunPutting It All Together
609*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~
610*4882a593Smuzhiyun
611*4882a593SmuzhiyunA stitched-together diagram is here:
612*4882a593Smuzhiyun
613*4882a593Smuzhiyun.. kernel-figure:: TreeRCU-gp.svg
614*4882a593Smuzhiyun
615*4882a593SmuzhiyunLegal Statement
616*4882a593Smuzhiyun~~~~~~~~~~~~~~~
617*4882a593Smuzhiyun
618*4882a593SmuzhiyunThis work represents the view of the author and does not necessarily
619*4882a593Smuzhiyunrepresent the view of IBM.
620*4882a593Smuzhiyun
621*4882a593SmuzhiyunLinux is a registered trademark of Linus Torvalds.
622*4882a593Smuzhiyun
623*4882a593SmuzhiyunOther company, product, and service names may be trademarks or service
624*4882a593Smuzhiyunmarks of others.
625