1*4882a593Smuzhiyun=================================================
2*4882a593SmuzhiyunA Tour Through TREE_RCU's Expedited Grace Periods
3*4882a593Smuzhiyun=================================================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunIntroduction
6*4882a593Smuzhiyun============
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunThis document describes RCU's expedited grace periods.
9*4882a593SmuzhiyunUnlike RCU's normal grace periods, which accept long latencies to attain
10*4882a593Smuzhiyunhigh efficiency and minimal disturbance, expedited grace periods accept
11*4882a593Smuzhiyunlower efficiency and significant disturbance to attain shorter latencies.
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunThere are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier
14*4882a593Smuzhiyunthird RCU-bh flavor having been implemented in terms of the other two.
15*4882a593SmuzhiyunEach of the two implementations is covered in its own section.
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunExpedited Grace Period Design
18*4882a593Smuzhiyun=============================
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunThe expedited RCU grace periods cannot be accused of being subtle,
21*4882a593Smuzhiyungiven that they for all intents and purposes hammer every CPU that
22*4882a593Smuzhiyunhas not yet provided a quiescent state for the current expedited
23*4882a593Smuzhiyungrace period.
24*4882a593SmuzhiyunThe one saving grace is that the hammer has grown a bit smaller
25*4882a593Smuzhiyunover time:  The old call to ``try_stop_cpus()`` has been
26*4882a593Smuzhiyunreplaced with a set of calls to ``smp_call_function_single()``,
27*4882a593Smuzhiyuneach of which results in an IPI to the target CPU.
28*4882a593SmuzhiyunThe corresponding handler function checks the CPU's state, motivating
29*4882a593Smuzhiyuna faster quiescent state where possible, and triggering a report
30*4882a593Smuzhiyunof that quiescent state.
31*4882a593SmuzhiyunAs always for RCU, once everything has spent some time in a quiescent
32*4882a593Smuzhiyunstate, the expedited grace period has completed.
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunThe details of the ``smp_call_function_single()`` handler's
35*4882a593Smuzhiyunoperation depend on the RCU flavor, as described in the following
36*4882a593Smuzhiyunsections.
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunRCU-preempt Expedited Grace Periods
39*4882a593Smuzhiyun===================================
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun``CONFIG_PREEMPT=y`` kernels implement RCU-preempt.
42*4882a593SmuzhiyunThe overall flow of the handling of a given CPU by an RCU-preempt
43*4882a593Smuzhiyunexpedited grace period is shown in the following diagram:
44*4882a593Smuzhiyun
45*4882a593Smuzhiyun.. kernel-figure:: ExpRCUFlow.svg
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunThe solid arrows denote direct action, for example, a function call.
48*4882a593SmuzhiyunThe dotted arrows denote indirect action, for example, an IPI
49*4882a593Smuzhiyunor a state that is reached after some time.
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunIf a given CPU is offline or idle, ``synchronize_rcu_expedited()``
52*4882a593Smuzhiyunwill ignore it because idle and offline CPUs are already residing
53*4882a593Smuzhiyunin quiescent states.
54*4882a593SmuzhiyunOtherwise, the expedited grace period will use
55*4882a593Smuzhiyun``smp_call_function_single()`` to send the CPU an IPI, which
56*4882a593Smuzhiyunis handled by ``rcu_exp_handler()``.
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunHowever, because this is preemptible RCU, ``rcu_exp_handler()``
59*4882a593Smuzhiyuncan check to see if the CPU is currently running in an RCU read-side
60*4882a593Smuzhiyuncritical section.
61*4882a593SmuzhiyunIf not, the handler can immediately report a quiescent state.
62*4882a593SmuzhiyunOtherwise, it sets flags so that the outermost ``rcu_read_unlock()``
63*4882a593Smuzhiyuninvocation will provide the needed quiescent-state report.
64*4882a593SmuzhiyunThis flag-setting avoids the previous forced preemption of all
65*4882a593SmuzhiyunCPUs that might have RCU read-side critical sections.
66*4882a593SmuzhiyunIn addition, this flag-setting is done so as to avoid increasing
67*4882a593Smuzhiyunthe overhead of the common-case fastpath through the scheduler.
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunAgain because this is preemptible RCU, an RCU read-side critical section
70*4882a593Smuzhiyuncan be preempted.
71*4882a593SmuzhiyunWhen that happens, RCU will enqueue the task, which will the continue to
72*4882a593Smuzhiyunblock the current expedited grace period until it resumes and finds its
73*4882a593Smuzhiyunoutermost ``rcu_read_unlock()``.
74*4882a593SmuzhiyunThe CPU will report a quiescent state just after enqueuing the task because
75*4882a593Smuzhiyunthe CPU is no longer blocking the grace period.
76*4882a593SmuzhiyunIt is instead the preempted task doing the blocking.
77*4882a593SmuzhiyunThe list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``,
78*4882a593Smuzhiyunwhich is called from ``rcu_preempt_note_context_switch()``, which
79*4882a593Smuzhiyunin turn is called from ``rcu_note_context_switch()``, which in
80*4882a593Smuzhiyunturn is called from the scheduler.
81*4882a593Smuzhiyun
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun+-----------------------------------------------------------------------+
84*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
85*4882a593Smuzhiyun+-----------------------------------------------------------------------+
86*4882a593Smuzhiyun| Why not just have the expedited grace period check the state of all   |
87*4882a593Smuzhiyun| the CPUs? After all, that would avoid all those real-time-unfriendly  |
88*4882a593Smuzhiyun| IPIs.                                                                 |
89*4882a593Smuzhiyun+-----------------------------------------------------------------------+
90*4882a593Smuzhiyun| **Answer**:                                                           |
91*4882a593Smuzhiyun+-----------------------------------------------------------------------+
92*4882a593Smuzhiyun| Because we want the RCU read-side critical sections to run fast,      |
93*4882a593Smuzhiyun| which means no memory barriers. Therefore, it is not possible to      |
94*4882a593Smuzhiyun| safely check the state from some other CPU. And even if it was        |
95*4882a593Smuzhiyun| possible to safely check the state, it would still be necessary to    |
96*4882a593Smuzhiyun| IPI the CPU to safely interact with the upcoming                      |
97*4882a593Smuzhiyun| ``rcu_read_unlock()`` invocation, which means that the remote state   |
98*4882a593Smuzhiyun| testing would not help the worst-case latency that real-time          |
99*4882a593Smuzhiyun| applications care about.                                              |
100*4882a593Smuzhiyun|                                                                       |
101*4882a593Smuzhiyun| One way to prevent your real-time application from getting hit with   |
102*4882a593Smuzhiyun| these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU  |
103*4882a593Smuzhiyun| would then perceive the CPU running your application as being idle,   |
104*4882a593Smuzhiyun| and it would be able to safely detect that state without needing to   |
105*4882a593Smuzhiyun| IPI the CPU.                                                          |
106*4882a593Smuzhiyun+-----------------------------------------------------------------------+
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunPlease note that this is just the overall flow: Additional complications
109*4882a593Smuzhiyuncan arise due to races with CPUs going idle or offline, among other
110*4882a593Smuzhiyunthings.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunRCU-sched Expedited Grace Periods
113*4882a593Smuzhiyun---------------------------------
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun``CONFIG_PREEMPT=n`` kernels implement RCU-sched. The overall flow of
116*4882a593Smuzhiyunthe handling of a given CPU by an RCU-sched expedited grace period is
117*4882a593Smuzhiyunshown in the following diagram:
118*4882a593Smuzhiyun
119*4882a593Smuzhiyun.. kernel-figure:: ExpSchedFlow.svg
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunAs with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores
122*4882a593Smuzhiyunoffline and idle CPUs, again because they are in remotely detectable
123*4882a593Smuzhiyunquiescent states. However, because the ``rcu_read_lock_sched()`` and
124*4882a593Smuzhiyun``rcu_read_unlock_sched()`` leave no trace of their invocation, in
125*4882a593Smuzhiyungeneral it is not possible to tell whether or not the current CPU is in
126*4882a593Smuzhiyunan RCU read-side critical section. The best that RCU-sched's
127*4882a593Smuzhiyun``rcu_exp_handler()`` can do is to check for idle, on the off-chance
128*4882a593Smuzhiyunthat the CPU went idle while the IPI was in flight. If the CPU is idle,
129*4882a593Smuzhiyunthen ``rcu_exp_handler()`` reports the quiescent state.
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunOtherwise, the handler forces a future context switch by setting the
132*4882a593SmuzhiyunNEED_RESCHED flag of the current task's thread flag and the CPU preempt
133*4882a593Smuzhiyuncounter. At the time of the context switch, the CPU reports the
134*4882a593Smuzhiyunquiescent state. Should the CPU go offline first, it will report the
135*4882a593Smuzhiyunquiescent state at that time.
136*4882a593Smuzhiyun
137*4882a593SmuzhiyunExpedited Grace Period and CPU Hotplug
138*4882a593Smuzhiyun--------------------------------------
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunThe expedited nature of expedited grace periods require a much tighter
141*4882a593Smuzhiyuninteraction with CPU hotplug operations than is required for normal
142*4882a593Smuzhiyungrace periods. In addition, attempting to IPI offline CPUs will result
143*4882a593Smuzhiyunin splats, but failing to IPI online CPUs can result in too-short grace
144*4882a593Smuzhiyunperiods. Neither option is acceptable in production kernels.
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunThe interaction between expedited grace periods and CPU hotplug
147*4882a593Smuzhiyunoperations is carried out at several levels:
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun#. The number of CPUs that have ever been online is tracked by the
150*4882a593Smuzhiyun   ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state``
151*4882a593Smuzhiyun   structure's ``->ncpus_snap`` field tracks the number of CPUs that
152*4882a593Smuzhiyun   have ever been online at the beginning of an RCU expedited grace
153*4882a593Smuzhiyun   period. Note that this number never decreases, at least in the
154*4882a593Smuzhiyun   absence of a time machine.
155*4882a593Smuzhiyun#. The identities of the CPUs that have ever been online is tracked by
156*4882a593Smuzhiyun   the ``rcu_node`` structure's ``->expmaskinitnext`` field. The
157*4882a593Smuzhiyun   ``rcu_node`` structure's ``->expmaskinit`` field tracks the
158*4882a593Smuzhiyun   identities of the CPUs that were online at least once at the
159*4882a593Smuzhiyun   beginning of the most recent RCU expedited grace period. The
160*4882a593Smuzhiyun   ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are
161*4882a593Smuzhiyun   used to detect when new CPUs have come online for the first time,
162*4882a593Smuzhiyun   that is, when the ``rcu_node`` structure's ``->expmaskinitnext``
163*4882a593Smuzhiyun   field has changed since the beginning of the last RCU expedited grace
164*4882a593Smuzhiyun   period, which triggers an update of each ``rcu_node`` structure's
165*4882a593Smuzhiyun   ``->expmaskinit`` field from its ``->expmaskinitnext`` field.
166*4882a593Smuzhiyun#. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to
167*4882a593Smuzhiyun   initialize that structure's ``->expmask`` at the beginning of each
168*4882a593Smuzhiyun   RCU expedited grace period. This means that only those CPUs that have
169*4882a593Smuzhiyun   been online at least once will be considered for a given grace
170*4882a593Smuzhiyun   period.
171*4882a593Smuzhiyun#. Any CPU that goes offline will clear its bit in its leaf ``rcu_node``
172*4882a593Smuzhiyun   structure's ``->qsmaskinitnext`` field, so any CPU with that bit
173*4882a593Smuzhiyun   clear can safely be ignored. However, it is possible for a CPU coming
174*4882a593Smuzhiyun   online or going offline to have this bit set for some time while
175*4882a593Smuzhiyun   ``cpu_online`` returns ``false``.
176*4882a593Smuzhiyun#. For each non-idle CPU that RCU believes is currently online, the
177*4882a593Smuzhiyun   grace period invokes ``smp_call_function_single()``. If this
178*4882a593Smuzhiyun   succeeds, the CPU was fully online. Failure indicates that the CPU is
179*4882a593Smuzhiyun   in the process of coming online or going offline, in which case it is
180*4882a593Smuzhiyun   necessary to wait for a short time period and try again. The purpose
181*4882a593Smuzhiyun   of this wait (or series of waits, as the case may be) is to permit a
182*4882a593Smuzhiyun   concurrent CPU-hotplug operation to complete.
183*4882a593Smuzhiyun#. In the case of RCU-sched, one of the last acts of an outgoing CPU is
184*4882a593Smuzhiyun   to invoke ``rcu_report_dead()``, which reports a quiescent state for
185*4882a593Smuzhiyun   that CPU. However, this is likely paranoia-induced redundancy.
186*4882a593Smuzhiyun
187*4882a593Smuzhiyun+-----------------------------------------------------------------------+
188*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
189*4882a593Smuzhiyun+-----------------------------------------------------------------------+
190*4882a593Smuzhiyun| Why all the dancing around with multiple counters and masks tracking  |
191*4882a593Smuzhiyun| CPUs that were once online? Why not just have a single set of masks   |
192*4882a593Smuzhiyun| tracking the currently online CPUs and be done with it?               |
193*4882a593Smuzhiyun+-----------------------------------------------------------------------+
194*4882a593Smuzhiyun| **Answer**:                                                           |
195*4882a593Smuzhiyun+-----------------------------------------------------------------------+
196*4882a593Smuzhiyun| Maintaining single set of masks tracking the online CPUs *sounds*     |
197*4882a593Smuzhiyun| easier, at least until you try working out all the race conditions    |
198*4882a593Smuzhiyun| between grace-period initialization and CPU-hotplug operations. For   |
199*4882a593Smuzhiyun| example, suppose initialization is progressing down the tree while a  |
200*4882a593Smuzhiyun| CPU-offline operation is progressing up the tree. This situation can  |
201*4882a593Smuzhiyun| result in bits set at the top of the tree that have no counterparts   |
202*4882a593Smuzhiyun| at the bottom of the tree. Those bits will never be cleared, which    |
203*4882a593Smuzhiyun| will result in grace-period hangs. In short, that way lies madness,   |
204*4882a593Smuzhiyun| to say nothing of a great many bugs, hangs, and deadlocks.            |
205*4882a593Smuzhiyun| In contrast, the current multi-mask multi-counter scheme ensures that |
206*4882a593Smuzhiyun| grace-period initialization will always see consistent masks up and   |
207*4882a593Smuzhiyun| down the tree, which brings significant simplifications over the      |
208*4882a593Smuzhiyun| single-mask method.                                                   |
209*4882a593Smuzhiyun|                                                                       |
210*4882a593Smuzhiyun| This is an instance of `deferring work in order to avoid              |
211*4882a593Smuzhiyun| synchronization <http://www.cs.columbia.edu/~library/TR-repository/re |
212*4882a593Smuzhiyun| ports/reports-1992/cucs-039-92.ps.gz>`__.                             |
213*4882a593Smuzhiyun| Lazily recording CPU-hotplug events at the beginning of the next      |
214*4882a593Smuzhiyun| grace period greatly simplifies maintenance of the CPU-tracking       |
215*4882a593Smuzhiyun| bitmasks in the ``rcu_node`` tree.                                    |
216*4882a593Smuzhiyun+-----------------------------------------------------------------------+
217*4882a593Smuzhiyun
218*4882a593SmuzhiyunExpedited Grace Period Refinements
219*4882a593Smuzhiyun----------------------------------
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunIdle-CPU Checks
222*4882a593Smuzhiyun~~~~~~~~~~~~~~~
223*4882a593Smuzhiyun
224*4882a593SmuzhiyunEach expedited grace period checks for idle CPUs when initially forming
225*4882a593Smuzhiyunthe mask of CPUs to be IPIed and again just before IPIing a CPU (both
226*4882a593Smuzhiyunchecks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is
227*4882a593Smuzhiyunidle at any time between those two times, the CPU will not be IPIed.
228*4882a593SmuzhiyunInstead, the task pushing the grace period forward will include the idle
229*4882a593SmuzhiyunCPUs in the mask passed to ``rcu_report_exp_cpu_mult()``.
230*4882a593Smuzhiyun
231*4882a593SmuzhiyunFor RCU-sched, there is an additional check: If the IPI has interrupted
232*4882a593Smuzhiyunthe idle loop, then ``rcu_exp_handler()`` invokes
233*4882a593Smuzhiyun``rcu_report_exp_rdp()`` to report the corresponding quiescent state.
234*4882a593Smuzhiyun
235*4882a593SmuzhiyunFor RCU-preempt, there is no specific check for idle in the IPI handler
236*4882a593Smuzhiyun(``rcu_exp_handler()``), but because RCU read-side critical sections are
237*4882a593Smuzhiyunnot permitted within the idle loop, if ``rcu_exp_handler()`` sees that
238*4882a593Smuzhiyunthe CPU is within RCU read-side critical section, the CPU cannot
239*4882a593Smuzhiyunpossibly be idle. Otherwise, ``rcu_exp_handler()`` invokes
240*4882a593Smuzhiyun``rcu_report_exp_rdp()`` to report the corresponding quiescent state,
241*4882a593Smuzhiyunregardless of whether or not that quiescent state was due to the CPU
242*4882a593Smuzhiyunbeing idle.
243*4882a593Smuzhiyun
244*4882a593SmuzhiyunIn summary, RCU expedited grace periods check for idle when building the
245*4882a593Smuzhiyunbitmask of CPUs that must be IPIed, just before sending each IPI, and
246*4882a593Smuzhiyun(either explicitly or implicitly) within the IPI handler.
247*4882a593Smuzhiyun
248*4882a593SmuzhiyunBatching via Sequence Counter
249*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
250*4882a593Smuzhiyun
251*4882a593SmuzhiyunIf each grace-period request was carried out separately, expedited grace
252*4882a593Smuzhiyunperiods would have abysmal scalability and problematic high-load
253*4882a593Smuzhiyuncharacteristics. Because each grace-period operation can serve an
254*4882a593Smuzhiyununlimited number of updates, it is important to *batch* requests, so
255*4882a593Smuzhiyunthat a single expedited grace-period operation will cover all requests
256*4882a593Smuzhiyunin the corresponding batch.
257*4882a593Smuzhiyun
258*4882a593SmuzhiyunThis batching is controlled by a sequence counter named
259*4882a593Smuzhiyun``->expedited_sequence`` in the ``rcu_state`` structure. This counter
260*4882a593Smuzhiyunhas an odd value when there is an expedited grace period in progress and
261*4882a593Smuzhiyunan even value otherwise, so that dividing the counter value by two gives
262*4882a593Smuzhiyunthe number of completed grace periods. During any given update request,
263*4882a593Smuzhiyunthe counter must transition from even to odd and then back to even, thus
264*4882a593Smuzhiyunindicating that a grace period has elapsed. Therefore, if the initial
265*4882a593Smuzhiyunvalue of the counter is ``s``, the updater must wait until the counter
266*4882a593Smuzhiyunreaches at least the value ``(s+3)&~0x1``. This counter is managed by
267*4882a593Smuzhiyunthe following access functions:
268*4882a593Smuzhiyun
269*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited
270*4882a593Smuzhiyun   grace period.
271*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace
272*4882a593Smuzhiyun   period.
273*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter.
274*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited
275*4882a593Smuzhiyun   grace period has elapsed since the corresponding call to
276*4882a593Smuzhiyun   ``rcu_exp_gp_seq_snap()``.
277*4882a593Smuzhiyun
278*4882a593SmuzhiyunAgain, only one request in a given batch need actually carry out a
279*4882a593Smuzhiyungrace-period operation, which means there must be an efficient way to
280*4882a593Smuzhiyunidentify which of many concurrent reqeusts will initiate the grace
281*4882a593Smuzhiyunperiod, and that there be an efficient way for the remaining requests to
282*4882a593Smuzhiyunwait for that grace period to complete. However, that is the topic of
283*4882a593Smuzhiyunthe next section.
284*4882a593Smuzhiyun
285*4882a593SmuzhiyunFunnel Locking and Wait/Wakeup
286*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
287*4882a593Smuzhiyun
288*4882a593SmuzhiyunThe natural way to sort out which of a batch of updaters will initiate
289*4882a593Smuzhiyunthe expedited grace period is to use the ``rcu_node`` combining tree, as
290*4882a593Smuzhiyunimplemented by the ``exp_funnel_lock()`` function. The first updater
291*4882a593Smuzhiyuncorresponding to a given grace period arriving at a given ``rcu_node``
292*4882a593Smuzhiyunstructure records its desired grace-period sequence number in the
293*4882a593Smuzhiyun``->exp_seq_rq`` field and moves up to the next level in the tree.
294*4882a593SmuzhiyunOtherwise, if the ``->exp_seq_rq`` field already contains the sequence
295*4882a593Smuzhiyunnumber for the desired grace period or some later one, the updater
296*4882a593Smuzhiyunblocks on one of four wait queues in the ``->exp_wq[]`` array, using the
297*4882a593Smuzhiyunsecond-from-bottom and third-from bottom bits as an index. An
298*4882a593Smuzhiyun``->exp_lock`` field in the ``rcu_node`` structure synchronizes access
299*4882a593Smuzhiyunto these fields.
300*4882a593Smuzhiyun
301*4882a593SmuzhiyunAn empty ``rcu_node`` tree is shown in the following diagram, with the
302*4882a593Smuzhiyunwhite cells representing the ``->exp_seq_rq`` field and the red cells
303*4882a593Smuzhiyunrepresenting the elements of the ``->exp_wq[]`` array.
304*4882a593Smuzhiyun
305*4882a593Smuzhiyun.. kernel-figure:: Funnel0.svg
306*4882a593Smuzhiyun
307*4882a593SmuzhiyunThe next diagram shows the situation after the arrival of Task A and
308*4882a593SmuzhiyunTask B at the leftmost and rightmost leaf ``rcu_node`` structures,
309*4882a593Smuzhiyunrespectively. The current value of the ``rcu_state`` structure's
310*4882a593Smuzhiyun``->expedited_sequence`` field is zero, so adding three and clearing the
311*4882a593Smuzhiyunbottom bit results in the value two, which both tasks record in the
312*4882a593Smuzhiyun``->exp_seq_rq`` field of their respective ``rcu_node`` structures:
313*4882a593Smuzhiyun
314*4882a593Smuzhiyun.. kernel-figure:: Funnel1.svg
315*4882a593Smuzhiyun
316*4882a593SmuzhiyunEach of Tasks A and B will move up to the root ``rcu_node`` structure.
317*4882a593SmuzhiyunSuppose that Task A wins, recording its desired grace-period sequence
318*4882a593Smuzhiyunnumber and resulting in the state shown below:
319*4882a593Smuzhiyun
320*4882a593Smuzhiyun.. kernel-figure:: Funnel2.svg
321*4882a593Smuzhiyun
322*4882a593SmuzhiyunTask A now advances to initiate a new grace period, while Task B moves
323*4882a593Smuzhiyunup to the root ``rcu_node`` structure, and, seeing that its desired
324*4882a593Smuzhiyunsequence number is already recorded, blocks on ``->exp_wq[1]``.
325*4882a593Smuzhiyun
326*4882a593Smuzhiyun+-----------------------------------------------------------------------+
327*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
328*4882a593Smuzhiyun+-----------------------------------------------------------------------+
329*4882a593Smuzhiyun| Why ``->exp_wq[1]``? Given that the value of these tasks' desired     |
330*4882a593Smuzhiyun| sequence number is two, so shouldn't they instead block on            |
331*4882a593Smuzhiyun| ``->exp_wq[2]``?                                                      |
332*4882a593Smuzhiyun+-----------------------------------------------------------------------+
333*4882a593Smuzhiyun| **Answer**:                                                           |
334*4882a593Smuzhiyun+-----------------------------------------------------------------------+
335*4882a593Smuzhiyun| No.                                                                   |
336*4882a593Smuzhiyun| Recall that the bottom bit of the desired sequence number indicates   |
337*4882a593Smuzhiyun| whether or not a grace period is currently in progress. It is         |
338*4882a593Smuzhiyun| therefore necessary to shift the sequence number right one bit        |
339*4882a593Smuzhiyun| position to obtain the number of the grace period. This results in    |
340*4882a593Smuzhiyun| ``->exp_wq[1]``.                                                      |
341*4882a593Smuzhiyun+-----------------------------------------------------------------------+
342*4882a593Smuzhiyun
343*4882a593SmuzhiyunIf Tasks C and D also arrive at this point, they will compute the same
344*4882a593Smuzhiyundesired grace-period sequence number, and see that both leaf
345*4882a593Smuzhiyun``rcu_node`` structures already have that value recorded. They will
346*4882a593Smuzhiyuntherefore block on their respective ``rcu_node`` structures'
347*4882a593Smuzhiyun``->exp_wq[1]`` fields, as shown below:
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun.. kernel-figure:: Funnel3.svg
350*4882a593Smuzhiyun
351*4882a593SmuzhiyunTask A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and
352*4882a593Smuzhiyuninitiates the grace period, which increments ``->expedited_sequence``.
353*4882a593SmuzhiyunTherefore, if Tasks E and F arrive, they will compute a desired sequence
354*4882a593Smuzhiyunnumber of 4 and will record this value as shown below:
355*4882a593Smuzhiyun
356*4882a593Smuzhiyun.. kernel-figure:: Funnel4.svg
357*4882a593Smuzhiyun
358*4882a593SmuzhiyunTasks E and F will propagate up the ``rcu_node`` combining tree, with
359*4882a593SmuzhiyunTask F blocking on the root ``rcu_node`` structure and Task E wait for
360*4882a593SmuzhiyunTask A to finish so that it can start the next grace period. The
361*4882a593Smuzhiyunresulting state is as shown below:
362*4882a593Smuzhiyun
363*4882a593Smuzhiyun.. kernel-figure:: Funnel5.svg
364*4882a593Smuzhiyun
365*4882a593SmuzhiyunOnce the grace period completes, Task A starts waking up the tasks
366*4882a593Smuzhiyunwaiting for this grace period to complete, increments the
367*4882a593Smuzhiyun``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then
368*4882a593Smuzhiyunreleases the ``->exp_mutex``. This results in the following state:
369*4882a593Smuzhiyun
370*4882a593Smuzhiyun.. kernel-figure:: Funnel6.svg
371*4882a593Smuzhiyun
372*4882a593SmuzhiyunTask E can then acquire ``->exp_mutex`` and increment
373*4882a593Smuzhiyun``->expedited_sequence`` to the value three. If new tasks G and H arrive
374*4882a593Smuzhiyunand moves up the combining tree at the same time, the state will be as
375*4882a593Smuzhiyunfollows:
376*4882a593Smuzhiyun
377*4882a593Smuzhiyun.. kernel-figure:: Funnel7.svg
378*4882a593Smuzhiyun
379*4882a593SmuzhiyunNote that three of the root ``rcu_node`` structure's waitqueues are now
380*4882a593Smuzhiyunoccupied. However, at some point, Task A will wake up the tasks blocked
381*4882a593Smuzhiyunon the ``->exp_wq`` waitqueues, resulting in the following state:
382*4882a593Smuzhiyun
383*4882a593Smuzhiyun.. kernel-figure:: Funnel8.svg
384*4882a593Smuzhiyun
385*4882a593SmuzhiyunExecution will continue with Tasks E and H completing their grace
386*4882a593Smuzhiyunperiods and carrying out their wakeups.
387*4882a593Smuzhiyun
388*4882a593Smuzhiyun+-----------------------------------------------------------------------+
389*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
390*4882a593Smuzhiyun+-----------------------------------------------------------------------+
391*4882a593Smuzhiyun| What happens if Task A takes so long to do its wakeups that Task E's  |
392*4882a593Smuzhiyun| grace period completes?                                               |
393*4882a593Smuzhiyun+-----------------------------------------------------------------------+
394*4882a593Smuzhiyun| **Answer**:                                                           |
395*4882a593Smuzhiyun+-----------------------------------------------------------------------+
396*4882a593Smuzhiyun| Then Task E will block on the ``->exp_wake_mutex``, which will also   |
397*4882a593Smuzhiyun| prevent it from releasing ``->exp_mutex``, which in turn will prevent |
398*4882a593Smuzhiyun| the next grace period from starting. This last is important in        |
399*4882a593Smuzhiyun| preventing overflow of the ``->exp_wq[]`` array.                      |
400*4882a593Smuzhiyun+-----------------------------------------------------------------------+
401*4882a593Smuzhiyun
402*4882a593SmuzhiyunUse of Workqueues
403*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~
404*4882a593Smuzhiyun
405*4882a593SmuzhiyunIn earlier implementations, the task requesting the expedited grace
406*4882a593Smuzhiyunperiod also drove it to completion. This straightforward approach had
407*4882a593Smuzhiyunthe disadvantage of needing to account for POSIX signals sent to user
408*4882a593Smuzhiyuntasks, so more recent implemementations use the Linux kernel's
409*4882a593Smuzhiyun`workqueues <https://www.kernel.org/doc/Documentation/core-api/workqueue.rst>`__.
410*4882a593Smuzhiyun
411*4882a593SmuzhiyunThe requesting task still does counter snapshotting and funnel-lock
412*4882a593Smuzhiyunprocessing, but the task reaching the top of the funnel lock does a
413*4882a593Smuzhiyun``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a
414*4882a593Smuzhiyunworkqueue kthread does the actual grace-period processing. Because
415*4882a593Smuzhiyunworkqueue kthreads do not accept POSIX signals, grace-period-wait
416*4882a593Smuzhiyunprocessing need not allow for POSIX signals. In addition, this approach
417*4882a593Smuzhiyunallows wakeups for the previous expedited grace period to be overlapped
418*4882a593Smuzhiyunwith processing for the next expedited grace period. Because there are
419*4882a593Smuzhiyunonly four sets of waitqueues, it is necessary to ensure that the
420*4882a593Smuzhiyunprevious grace period's wakeups complete before the next grace period's
421*4882a593Smuzhiyunwakeups start. This is handled by having the ``->exp_mutex`` guard
422*4882a593Smuzhiyunexpedited grace-period processing and the ``->exp_wake_mutex`` guard
423*4882a593Smuzhiyunwakeups. The key point is that the ``->exp_mutex`` is not released until
424*4882a593Smuzhiyunthe first wakeup is complete, which means that the ``->exp_wake_mutex``
425*4882a593Smuzhiyunhas already been acquired at that point. This approach ensures that the
426*4882a593Smuzhiyunprevious grace period's wakeups can be carried out while the current
427*4882a593Smuzhiyungrace period is in process, but that these wakeups will complete before
428*4882a593Smuzhiyunthe next grace period starts. This means that only three waitqueues are
429*4882a593Smuzhiyunrequired, guaranteeing that the four that are provided are sufficient.
430*4882a593Smuzhiyun
431*4882a593SmuzhiyunStall Warnings
432*4882a593Smuzhiyun~~~~~~~~~~~~~~
433*4882a593Smuzhiyun
434*4882a593SmuzhiyunExpediting grace periods does nothing to speed things up when RCU
435*4882a593Smuzhiyunreaders take too long, and therefore expedited grace periods check for
436*4882a593Smuzhiyunstalls just as normal grace periods do.
437*4882a593Smuzhiyun
438*4882a593Smuzhiyun+-----------------------------------------------------------------------+
439*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
440*4882a593Smuzhiyun+-----------------------------------------------------------------------+
441*4882a593Smuzhiyun| But why not just let the normal grace-period machinery detect the     |
442*4882a593Smuzhiyun| stalls, given that a given reader must block both normal and          |
443*4882a593Smuzhiyun| expedited grace periods?                                              |
444*4882a593Smuzhiyun+-----------------------------------------------------------------------+
445*4882a593Smuzhiyun| **Answer**:                                                           |
446*4882a593Smuzhiyun+-----------------------------------------------------------------------+
447*4882a593Smuzhiyun| Because it is quite possible that at a given time there is no normal  |
448*4882a593Smuzhiyun| grace period in progress, in which case the normal grace period       |
449*4882a593Smuzhiyun| cannot emit a stall warning.                                          |
450*4882a593Smuzhiyun+-----------------------------------------------------------------------+
451*4882a593Smuzhiyun
452*4882a593SmuzhiyunThe ``synchronize_sched_expedited_wait()`` function loops waiting for
453*4882a593Smuzhiyunthe expedited grace period to end, but with a timeout set to the current
454*4882a593SmuzhiyunRCU CPU stall-warning time. If this time is exceeded, any CPUs or
455*4882a593Smuzhiyun``rcu_node`` structures blocking the current grace period are printed.
456*4882a593SmuzhiyunEach stall warning results in another pass through the loop, but the
457*4882a593Smuzhiyunsecond and subsequent passes use longer stall times.
458*4882a593Smuzhiyun
459*4882a593SmuzhiyunMid-boot operation
460*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~
461*4882a593Smuzhiyun
462*4882a593SmuzhiyunThe use of workqueues has the advantage that the expedited grace-period
463*4882a593Smuzhiyuncode need not worry about POSIX signals. Unfortunately, it has the
464*4882a593Smuzhiyuncorresponding disadvantage that workqueues cannot be used until they are
465*4882a593Smuzhiyuninitialized, which does not happen until some time after the scheduler
466*4882a593Smuzhiyunspawns the first task. Given that there are parts of the kernel that
467*4882a593Smuzhiyunreally do want to execute grace periods during this mid-boot “dead
468*4882a593Smuzhiyunzone”, expedited grace periods must do something else during thie time.
469*4882a593Smuzhiyun
470*4882a593SmuzhiyunWhat they do is to fall back to the old practice of requiring that the
471*4882a593Smuzhiyunrequesting task drive the expedited grace period, as was the case before
472*4882a593Smuzhiyunthe use of workqueues. However, the requesting task is only required to
473*4882a593Smuzhiyundrive the grace period during the mid-boot dead zone. Before mid-boot, a
474*4882a593Smuzhiyunsynchronous grace period is a no-op. Some time after mid-boot,
475*4882a593Smuzhiyunworkqueues are used.
476*4882a593Smuzhiyun
477*4882a593SmuzhiyunNon-expedited non-SRCU synchronous grace periods must also operate
478*4882a593Smuzhiyunnormally during mid-boot. This is handled by causing non-expedited grace
479*4882a593Smuzhiyunperiods to take the expedited code path during mid-boot.
480*4882a593Smuzhiyun
481*4882a593SmuzhiyunThe current code assumes that there are no POSIX signals during the
482*4882a593Smuzhiyunmid-boot dead zone. However, if an overwhelming need for POSIX signals
483*4882a593Smuzhiyunsomehow arises, appropriate adjustments can be made to the expedited
484*4882a593Smuzhiyunstall-warning code. One such adjustment would reinstate the
485*4882a593Smuzhiyunpre-workqueue stall-warning checks, but only during the mid-boot dead
486*4882a593Smuzhiyunzone.
487*4882a593Smuzhiyun
488*4882a593SmuzhiyunWith this refinement, synchronous grace periods can now be used from
489*4882a593Smuzhiyuntask context pretty much any time during the life of the kernel. That
490*4882a593Smuzhiyunis, aside from some points in the suspend, hibernate, or shutdown code
491*4882a593Smuzhiyunpath.
492*4882a593Smuzhiyun
493*4882a593SmuzhiyunSummary
494*4882a593Smuzhiyun~~~~~~~
495*4882a593Smuzhiyun
496*4882a593SmuzhiyunExpedited grace periods use a sequence-number approach to promote
497*4882a593Smuzhiyunbatching, so that a single grace-period operation can serve numerous
498*4882a593Smuzhiyunrequests. A funnel lock is used to efficiently identify the one task out
499*4882a593Smuzhiyunof a concurrent group that will request the grace period. All members of
500*4882a593Smuzhiyunthe group will block on waitqueues provided in the ``rcu_node``
501*4882a593Smuzhiyunstructure. The actual grace-period processing is carried out by a
502*4882a593Smuzhiyunworkqueue.
503*4882a593Smuzhiyun
504*4882a593SmuzhiyunCPU-hotplug operations are noted lazily in order to prevent the need for
505*4882a593Smuzhiyuntight synchronization between expedited grace periods and CPU-hotplug
506*4882a593Smuzhiyunoperations. The dyntick-idle counters are used to avoid sending IPIs to
507*4882a593Smuzhiyunidle CPUs, at least in the common case. RCU-preempt and RCU-sched use
508*4882a593Smuzhiyundifferent IPI handlers and different code to respond to the state
509*4882a593Smuzhiyunchanges carried out by those handlers, but otherwise use common code.
510*4882a593Smuzhiyun
511*4882a593SmuzhiyunQuiescent states are tracked using the ``rcu_node`` tree, and once all
512*4882a593Smuzhiyunnecessary quiescent states have been reported, all tasks waiting on this
513*4882a593Smuzhiyunexpedited grace period are awakened. A pair of mutexes are used to allow
514*4882a593Smuzhiyunone grace period's wakeups to proceed concurrently with the next grace
515*4882a593Smuzhiyunperiod's processing.
516*4882a593Smuzhiyun
517*4882a593SmuzhiyunThis combination of mechanisms allows expedited grace periods to run
518*4882a593Smuzhiyunreasonably efficiently. However, for non-time-critical tasks, normal
519*4882a593Smuzhiyungrace periods should be used instead because their longer duration
520*4882a593Smuzhiyunpermits much higher degrees of batching, and thus much lower per-request
521*4882a593Smuzhiyunoverheads.
522