1*4882a593Smuzhiyun================================================= 2*4882a593SmuzhiyunA Tour Through TREE_RCU's Expedited Grace Periods 3*4882a593Smuzhiyun================================================= 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunIntroduction 6*4882a593Smuzhiyun============ 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunThis document describes RCU's expedited grace periods. 9*4882a593SmuzhiyunUnlike RCU's normal grace periods, which accept long latencies to attain 10*4882a593Smuzhiyunhigh efficiency and minimal disturbance, expedited grace periods accept 11*4882a593Smuzhiyunlower efficiency and significant disturbance to attain shorter latencies. 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunThere are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier 14*4882a593Smuzhiyunthird RCU-bh flavor having been implemented in terms of the other two. 15*4882a593SmuzhiyunEach of the two implementations is covered in its own section. 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunExpedited Grace Period Design 18*4882a593Smuzhiyun============================= 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunThe expedited RCU grace periods cannot be accused of being subtle, 21*4882a593Smuzhiyungiven that they for all intents and purposes hammer every CPU that 22*4882a593Smuzhiyunhas not yet provided a quiescent state for the current expedited 23*4882a593Smuzhiyungrace period. 24*4882a593SmuzhiyunThe one saving grace is that the hammer has grown a bit smaller 25*4882a593Smuzhiyunover time: The old call to ``try_stop_cpus()`` has been 26*4882a593Smuzhiyunreplaced with a set of calls to ``smp_call_function_single()``, 27*4882a593Smuzhiyuneach of which results in an IPI to the target CPU. 28*4882a593SmuzhiyunThe corresponding handler function checks the CPU's state, motivating 29*4882a593Smuzhiyuna faster quiescent state where possible, and triggering a report 30*4882a593Smuzhiyunof that quiescent state. 31*4882a593SmuzhiyunAs always for RCU, once everything has spent some time in a quiescent 32*4882a593Smuzhiyunstate, the expedited grace period has completed. 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunThe details of the ``smp_call_function_single()`` handler's 35*4882a593Smuzhiyunoperation depend on the RCU flavor, as described in the following 36*4882a593Smuzhiyunsections. 37*4882a593Smuzhiyun 38*4882a593SmuzhiyunRCU-preempt Expedited Grace Periods 39*4882a593Smuzhiyun=================================== 40*4882a593Smuzhiyun 41*4882a593Smuzhiyun``CONFIG_PREEMPT=y`` kernels implement RCU-preempt. 42*4882a593SmuzhiyunThe overall flow of the handling of a given CPU by an RCU-preempt 43*4882a593Smuzhiyunexpedited grace period is shown in the following diagram: 44*4882a593Smuzhiyun 45*4882a593Smuzhiyun.. kernel-figure:: ExpRCUFlow.svg 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunThe solid arrows denote direct action, for example, a function call. 48*4882a593SmuzhiyunThe dotted arrows denote indirect action, for example, an IPI 49*4882a593Smuzhiyunor a state that is reached after some time. 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunIf a given CPU is offline or idle, ``synchronize_rcu_expedited()`` 52*4882a593Smuzhiyunwill ignore it because idle and offline CPUs are already residing 53*4882a593Smuzhiyunin quiescent states. 54*4882a593SmuzhiyunOtherwise, the expedited grace period will use 55*4882a593Smuzhiyun``smp_call_function_single()`` to send the CPU an IPI, which 56*4882a593Smuzhiyunis handled by ``rcu_exp_handler()``. 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunHowever, because this is preemptible RCU, ``rcu_exp_handler()`` 59*4882a593Smuzhiyuncan check to see if the CPU is currently running in an RCU read-side 60*4882a593Smuzhiyuncritical section. 61*4882a593SmuzhiyunIf not, the handler can immediately report a quiescent state. 62*4882a593SmuzhiyunOtherwise, it sets flags so that the outermost ``rcu_read_unlock()`` 63*4882a593Smuzhiyuninvocation will provide the needed quiescent-state report. 64*4882a593SmuzhiyunThis flag-setting avoids the previous forced preemption of all 65*4882a593SmuzhiyunCPUs that might have RCU read-side critical sections. 66*4882a593SmuzhiyunIn addition, this flag-setting is done so as to avoid increasing 67*4882a593Smuzhiyunthe overhead of the common-case fastpath through the scheduler. 68*4882a593Smuzhiyun 69*4882a593SmuzhiyunAgain because this is preemptible RCU, an RCU read-side critical section 70*4882a593Smuzhiyuncan be preempted. 71*4882a593SmuzhiyunWhen that happens, RCU will enqueue the task, which will the continue to 72*4882a593Smuzhiyunblock the current expedited grace period until it resumes and finds its 73*4882a593Smuzhiyunoutermost ``rcu_read_unlock()``. 74*4882a593SmuzhiyunThe CPU will report a quiescent state just after enqueuing the task because 75*4882a593Smuzhiyunthe CPU is no longer blocking the grace period. 76*4882a593SmuzhiyunIt is instead the preempted task doing the blocking. 77*4882a593SmuzhiyunThe list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``, 78*4882a593Smuzhiyunwhich is called from ``rcu_preempt_note_context_switch()``, which 79*4882a593Smuzhiyunin turn is called from ``rcu_note_context_switch()``, which in 80*4882a593Smuzhiyunturn is called from the scheduler. 81*4882a593Smuzhiyun 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 84*4882a593Smuzhiyun| **Quick Quiz**: | 85*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 86*4882a593Smuzhiyun| Why not just have the expedited grace period check the state of all | 87*4882a593Smuzhiyun| the CPUs? After all, that would avoid all those real-time-unfriendly | 88*4882a593Smuzhiyun| IPIs. | 89*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 90*4882a593Smuzhiyun| **Answer**: | 91*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 92*4882a593Smuzhiyun| Because we want the RCU read-side critical sections to run fast, | 93*4882a593Smuzhiyun| which means no memory barriers. Therefore, it is not possible to | 94*4882a593Smuzhiyun| safely check the state from some other CPU. And even if it was | 95*4882a593Smuzhiyun| possible to safely check the state, it would still be necessary to | 96*4882a593Smuzhiyun| IPI the CPU to safely interact with the upcoming | 97*4882a593Smuzhiyun| ``rcu_read_unlock()`` invocation, which means that the remote state | 98*4882a593Smuzhiyun| testing would not help the worst-case latency that real-time | 99*4882a593Smuzhiyun| applications care about. | 100*4882a593Smuzhiyun| | 101*4882a593Smuzhiyun| One way to prevent your real-time application from getting hit with | 102*4882a593Smuzhiyun| these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU | 103*4882a593Smuzhiyun| would then perceive the CPU running your application as being idle, | 104*4882a593Smuzhiyun| and it would be able to safely detect that state without needing to | 105*4882a593Smuzhiyun| IPI the CPU. | 106*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunPlease note that this is just the overall flow: Additional complications 109*4882a593Smuzhiyuncan arise due to races with CPUs going idle or offline, among other 110*4882a593Smuzhiyunthings. 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunRCU-sched Expedited Grace Periods 113*4882a593Smuzhiyun--------------------------------- 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun``CONFIG_PREEMPT=n`` kernels implement RCU-sched. The overall flow of 116*4882a593Smuzhiyunthe handling of a given CPU by an RCU-sched expedited grace period is 117*4882a593Smuzhiyunshown in the following diagram: 118*4882a593Smuzhiyun 119*4882a593Smuzhiyun.. kernel-figure:: ExpSchedFlow.svg 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunAs with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores 122*4882a593Smuzhiyunoffline and idle CPUs, again because they are in remotely detectable 123*4882a593Smuzhiyunquiescent states. However, because the ``rcu_read_lock_sched()`` and 124*4882a593Smuzhiyun``rcu_read_unlock_sched()`` leave no trace of their invocation, in 125*4882a593Smuzhiyungeneral it is not possible to tell whether or not the current CPU is in 126*4882a593Smuzhiyunan RCU read-side critical section. The best that RCU-sched's 127*4882a593Smuzhiyun``rcu_exp_handler()`` can do is to check for idle, on the off-chance 128*4882a593Smuzhiyunthat the CPU went idle while the IPI was in flight. If the CPU is idle, 129*4882a593Smuzhiyunthen ``rcu_exp_handler()`` reports the quiescent state. 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunOtherwise, the handler forces a future context switch by setting the 132*4882a593SmuzhiyunNEED_RESCHED flag of the current task's thread flag and the CPU preempt 133*4882a593Smuzhiyuncounter. At the time of the context switch, the CPU reports the 134*4882a593Smuzhiyunquiescent state. Should the CPU go offline first, it will report the 135*4882a593Smuzhiyunquiescent state at that time. 136*4882a593Smuzhiyun 137*4882a593SmuzhiyunExpedited Grace Period and CPU Hotplug 138*4882a593Smuzhiyun-------------------------------------- 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunThe expedited nature of expedited grace periods require a much tighter 141*4882a593Smuzhiyuninteraction with CPU hotplug operations than is required for normal 142*4882a593Smuzhiyungrace periods. In addition, attempting to IPI offline CPUs will result 143*4882a593Smuzhiyunin splats, but failing to IPI online CPUs can result in too-short grace 144*4882a593Smuzhiyunperiods. Neither option is acceptable in production kernels. 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunThe interaction between expedited grace periods and CPU hotplug 147*4882a593Smuzhiyunoperations is carried out at several levels: 148*4882a593Smuzhiyun 149*4882a593Smuzhiyun#. The number of CPUs that have ever been online is tracked by the 150*4882a593Smuzhiyun ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state`` 151*4882a593Smuzhiyun structure's ``->ncpus_snap`` field tracks the number of CPUs that 152*4882a593Smuzhiyun have ever been online at the beginning of an RCU expedited grace 153*4882a593Smuzhiyun period. Note that this number never decreases, at least in the 154*4882a593Smuzhiyun absence of a time machine. 155*4882a593Smuzhiyun#. The identities of the CPUs that have ever been online is tracked by 156*4882a593Smuzhiyun the ``rcu_node`` structure's ``->expmaskinitnext`` field. The 157*4882a593Smuzhiyun ``rcu_node`` structure's ``->expmaskinit`` field tracks the 158*4882a593Smuzhiyun identities of the CPUs that were online at least once at the 159*4882a593Smuzhiyun beginning of the most recent RCU expedited grace period. The 160*4882a593Smuzhiyun ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are 161*4882a593Smuzhiyun used to detect when new CPUs have come online for the first time, 162*4882a593Smuzhiyun that is, when the ``rcu_node`` structure's ``->expmaskinitnext`` 163*4882a593Smuzhiyun field has changed since the beginning of the last RCU expedited grace 164*4882a593Smuzhiyun period, which triggers an update of each ``rcu_node`` structure's 165*4882a593Smuzhiyun ``->expmaskinit`` field from its ``->expmaskinitnext`` field. 166*4882a593Smuzhiyun#. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to 167*4882a593Smuzhiyun initialize that structure's ``->expmask`` at the beginning of each 168*4882a593Smuzhiyun RCU expedited grace period. This means that only those CPUs that have 169*4882a593Smuzhiyun been online at least once will be considered for a given grace 170*4882a593Smuzhiyun period. 171*4882a593Smuzhiyun#. Any CPU that goes offline will clear its bit in its leaf ``rcu_node`` 172*4882a593Smuzhiyun structure's ``->qsmaskinitnext`` field, so any CPU with that bit 173*4882a593Smuzhiyun clear can safely be ignored. However, it is possible for a CPU coming 174*4882a593Smuzhiyun online or going offline to have this bit set for some time while 175*4882a593Smuzhiyun ``cpu_online`` returns ``false``. 176*4882a593Smuzhiyun#. For each non-idle CPU that RCU believes is currently online, the 177*4882a593Smuzhiyun grace period invokes ``smp_call_function_single()``. If this 178*4882a593Smuzhiyun succeeds, the CPU was fully online. Failure indicates that the CPU is 179*4882a593Smuzhiyun in the process of coming online or going offline, in which case it is 180*4882a593Smuzhiyun necessary to wait for a short time period and try again. The purpose 181*4882a593Smuzhiyun of this wait (or series of waits, as the case may be) is to permit a 182*4882a593Smuzhiyun concurrent CPU-hotplug operation to complete. 183*4882a593Smuzhiyun#. In the case of RCU-sched, one of the last acts of an outgoing CPU is 184*4882a593Smuzhiyun to invoke ``rcu_report_dead()``, which reports a quiescent state for 185*4882a593Smuzhiyun that CPU. However, this is likely paranoia-induced redundancy. 186*4882a593Smuzhiyun 187*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 188*4882a593Smuzhiyun| **Quick Quiz**: | 189*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 190*4882a593Smuzhiyun| Why all the dancing around with multiple counters and masks tracking | 191*4882a593Smuzhiyun| CPUs that were once online? Why not just have a single set of masks | 192*4882a593Smuzhiyun| tracking the currently online CPUs and be done with it? | 193*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 194*4882a593Smuzhiyun| **Answer**: | 195*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 196*4882a593Smuzhiyun| Maintaining single set of masks tracking the online CPUs *sounds* | 197*4882a593Smuzhiyun| easier, at least until you try working out all the race conditions | 198*4882a593Smuzhiyun| between grace-period initialization and CPU-hotplug operations. For | 199*4882a593Smuzhiyun| example, suppose initialization is progressing down the tree while a | 200*4882a593Smuzhiyun| CPU-offline operation is progressing up the tree. This situation can | 201*4882a593Smuzhiyun| result in bits set at the top of the tree that have no counterparts | 202*4882a593Smuzhiyun| at the bottom of the tree. Those bits will never be cleared, which | 203*4882a593Smuzhiyun| will result in grace-period hangs. In short, that way lies madness, | 204*4882a593Smuzhiyun| to say nothing of a great many bugs, hangs, and deadlocks. | 205*4882a593Smuzhiyun| In contrast, the current multi-mask multi-counter scheme ensures that | 206*4882a593Smuzhiyun| grace-period initialization will always see consistent masks up and | 207*4882a593Smuzhiyun| down the tree, which brings significant simplifications over the | 208*4882a593Smuzhiyun| single-mask method. | 209*4882a593Smuzhiyun| | 210*4882a593Smuzhiyun| This is an instance of `deferring work in order to avoid | 211*4882a593Smuzhiyun| synchronization <http://www.cs.columbia.edu/~library/TR-repository/re | 212*4882a593Smuzhiyun| ports/reports-1992/cucs-039-92.ps.gz>`__. | 213*4882a593Smuzhiyun| Lazily recording CPU-hotplug events at the beginning of the next | 214*4882a593Smuzhiyun| grace period greatly simplifies maintenance of the CPU-tracking | 215*4882a593Smuzhiyun| bitmasks in the ``rcu_node`` tree. | 216*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunExpedited Grace Period Refinements 219*4882a593Smuzhiyun---------------------------------- 220*4882a593Smuzhiyun 221*4882a593SmuzhiyunIdle-CPU Checks 222*4882a593Smuzhiyun~~~~~~~~~~~~~~~ 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunEach expedited grace period checks for idle CPUs when initially forming 225*4882a593Smuzhiyunthe mask of CPUs to be IPIed and again just before IPIing a CPU (both 226*4882a593Smuzhiyunchecks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is 227*4882a593Smuzhiyunidle at any time between those two times, the CPU will not be IPIed. 228*4882a593SmuzhiyunInstead, the task pushing the grace period forward will include the idle 229*4882a593SmuzhiyunCPUs in the mask passed to ``rcu_report_exp_cpu_mult()``. 230*4882a593Smuzhiyun 231*4882a593SmuzhiyunFor RCU-sched, there is an additional check: If the IPI has interrupted 232*4882a593Smuzhiyunthe idle loop, then ``rcu_exp_handler()`` invokes 233*4882a593Smuzhiyun``rcu_report_exp_rdp()`` to report the corresponding quiescent state. 234*4882a593Smuzhiyun 235*4882a593SmuzhiyunFor RCU-preempt, there is no specific check for idle in the IPI handler 236*4882a593Smuzhiyun(``rcu_exp_handler()``), but because RCU read-side critical sections are 237*4882a593Smuzhiyunnot permitted within the idle loop, if ``rcu_exp_handler()`` sees that 238*4882a593Smuzhiyunthe CPU is within RCU read-side critical section, the CPU cannot 239*4882a593Smuzhiyunpossibly be idle. Otherwise, ``rcu_exp_handler()`` invokes 240*4882a593Smuzhiyun``rcu_report_exp_rdp()`` to report the corresponding quiescent state, 241*4882a593Smuzhiyunregardless of whether or not that quiescent state was due to the CPU 242*4882a593Smuzhiyunbeing idle. 243*4882a593Smuzhiyun 244*4882a593SmuzhiyunIn summary, RCU expedited grace periods check for idle when building the 245*4882a593Smuzhiyunbitmask of CPUs that must be IPIed, just before sending each IPI, and 246*4882a593Smuzhiyun(either explicitly or implicitly) within the IPI handler. 247*4882a593Smuzhiyun 248*4882a593SmuzhiyunBatching via Sequence Counter 249*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 250*4882a593Smuzhiyun 251*4882a593SmuzhiyunIf each grace-period request was carried out separately, expedited grace 252*4882a593Smuzhiyunperiods would have abysmal scalability and problematic high-load 253*4882a593Smuzhiyuncharacteristics. Because each grace-period operation can serve an 254*4882a593Smuzhiyununlimited number of updates, it is important to *batch* requests, so 255*4882a593Smuzhiyunthat a single expedited grace-period operation will cover all requests 256*4882a593Smuzhiyunin the corresponding batch. 257*4882a593Smuzhiyun 258*4882a593SmuzhiyunThis batching is controlled by a sequence counter named 259*4882a593Smuzhiyun``->expedited_sequence`` in the ``rcu_state`` structure. This counter 260*4882a593Smuzhiyunhas an odd value when there is an expedited grace period in progress and 261*4882a593Smuzhiyunan even value otherwise, so that dividing the counter value by two gives 262*4882a593Smuzhiyunthe number of completed grace periods. During any given update request, 263*4882a593Smuzhiyunthe counter must transition from even to odd and then back to even, thus 264*4882a593Smuzhiyunindicating that a grace period has elapsed. Therefore, if the initial 265*4882a593Smuzhiyunvalue of the counter is ``s``, the updater must wait until the counter 266*4882a593Smuzhiyunreaches at least the value ``(s+3)&~0x1``. This counter is managed by 267*4882a593Smuzhiyunthe following access functions: 268*4882a593Smuzhiyun 269*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited 270*4882a593Smuzhiyun grace period. 271*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace 272*4882a593Smuzhiyun period. 273*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter. 274*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited 275*4882a593Smuzhiyun grace period has elapsed since the corresponding call to 276*4882a593Smuzhiyun ``rcu_exp_gp_seq_snap()``. 277*4882a593Smuzhiyun 278*4882a593SmuzhiyunAgain, only one request in a given batch need actually carry out a 279*4882a593Smuzhiyungrace-period operation, which means there must be an efficient way to 280*4882a593Smuzhiyunidentify which of many concurrent reqeusts will initiate the grace 281*4882a593Smuzhiyunperiod, and that there be an efficient way for the remaining requests to 282*4882a593Smuzhiyunwait for that grace period to complete. However, that is the topic of 283*4882a593Smuzhiyunthe next section. 284*4882a593Smuzhiyun 285*4882a593SmuzhiyunFunnel Locking and Wait/Wakeup 286*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 287*4882a593Smuzhiyun 288*4882a593SmuzhiyunThe natural way to sort out which of a batch of updaters will initiate 289*4882a593Smuzhiyunthe expedited grace period is to use the ``rcu_node`` combining tree, as 290*4882a593Smuzhiyunimplemented by the ``exp_funnel_lock()`` function. The first updater 291*4882a593Smuzhiyuncorresponding to a given grace period arriving at a given ``rcu_node`` 292*4882a593Smuzhiyunstructure records its desired grace-period sequence number in the 293*4882a593Smuzhiyun``->exp_seq_rq`` field and moves up to the next level in the tree. 294*4882a593SmuzhiyunOtherwise, if the ``->exp_seq_rq`` field already contains the sequence 295*4882a593Smuzhiyunnumber for the desired grace period or some later one, the updater 296*4882a593Smuzhiyunblocks on one of four wait queues in the ``->exp_wq[]`` array, using the 297*4882a593Smuzhiyunsecond-from-bottom and third-from bottom bits as an index. An 298*4882a593Smuzhiyun``->exp_lock`` field in the ``rcu_node`` structure synchronizes access 299*4882a593Smuzhiyunto these fields. 300*4882a593Smuzhiyun 301*4882a593SmuzhiyunAn empty ``rcu_node`` tree is shown in the following diagram, with the 302*4882a593Smuzhiyunwhite cells representing the ``->exp_seq_rq`` field and the red cells 303*4882a593Smuzhiyunrepresenting the elements of the ``->exp_wq[]`` array. 304*4882a593Smuzhiyun 305*4882a593Smuzhiyun.. kernel-figure:: Funnel0.svg 306*4882a593Smuzhiyun 307*4882a593SmuzhiyunThe next diagram shows the situation after the arrival of Task A and 308*4882a593SmuzhiyunTask B at the leftmost and rightmost leaf ``rcu_node`` structures, 309*4882a593Smuzhiyunrespectively. The current value of the ``rcu_state`` structure's 310*4882a593Smuzhiyun``->expedited_sequence`` field is zero, so adding three and clearing the 311*4882a593Smuzhiyunbottom bit results in the value two, which both tasks record in the 312*4882a593Smuzhiyun``->exp_seq_rq`` field of their respective ``rcu_node`` structures: 313*4882a593Smuzhiyun 314*4882a593Smuzhiyun.. kernel-figure:: Funnel1.svg 315*4882a593Smuzhiyun 316*4882a593SmuzhiyunEach of Tasks A and B will move up to the root ``rcu_node`` structure. 317*4882a593SmuzhiyunSuppose that Task A wins, recording its desired grace-period sequence 318*4882a593Smuzhiyunnumber and resulting in the state shown below: 319*4882a593Smuzhiyun 320*4882a593Smuzhiyun.. kernel-figure:: Funnel2.svg 321*4882a593Smuzhiyun 322*4882a593SmuzhiyunTask A now advances to initiate a new grace period, while Task B moves 323*4882a593Smuzhiyunup to the root ``rcu_node`` structure, and, seeing that its desired 324*4882a593Smuzhiyunsequence number is already recorded, blocks on ``->exp_wq[1]``. 325*4882a593Smuzhiyun 326*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 327*4882a593Smuzhiyun| **Quick Quiz**: | 328*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 329*4882a593Smuzhiyun| Why ``->exp_wq[1]``? Given that the value of these tasks' desired | 330*4882a593Smuzhiyun| sequence number is two, so shouldn't they instead block on | 331*4882a593Smuzhiyun| ``->exp_wq[2]``? | 332*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 333*4882a593Smuzhiyun| **Answer**: | 334*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 335*4882a593Smuzhiyun| No. | 336*4882a593Smuzhiyun| Recall that the bottom bit of the desired sequence number indicates | 337*4882a593Smuzhiyun| whether or not a grace period is currently in progress. It is | 338*4882a593Smuzhiyun| therefore necessary to shift the sequence number right one bit | 339*4882a593Smuzhiyun| position to obtain the number of the grace period. This results in | 340*4882a593Smuzhiyun| ``->exp_wq[1]``. | 341*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 342*4882a593Smuzhiyun 343*4882a593SmuzhiyunIf Tasks C and D also arrive at this point, they will compute the same 344*4882a593Smuzhiyundesired grace-period sequence number, and see that both leaf 345*4882a593Smuzhiyun``rcu_node`` structures already have that value recorded. They will 346*4882a593Smuzhiyuntherefore block on their respective ``rcu_node`` structures' 347*4882a593Smuzhiyun``->exp_wq[1]`` fields, as shown below: 348*4882a593Smuzhiyun 349*4882a593Smuzhiyun.. kernel-figure:: Funnel3.svg 350*4882a593Smuzhiyun 351*4882a593SmuzhiyunTask A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and 352*4882a593Smuzhiyuninitiates the grace period, which increments ``->expedited_sequence``. 353*4882a593SmuzhiyunTherefore, if Tasks E and F arrive, they will compute a desired sequence 354*4882a593Smuzhiyunnumber of 4 and will record this value as shown below: 355*4882a593Smuzhiyun 356*4882a593Smuzhiyun.. kernel-figure:: Funnel4.svg 357*4882a593Smuzhiyun 358*4882a593SmuzhiyunTasks E and F will propagate up the ``rcu_node`` combining tree, with 359*4882a593SmuzhiyunTask F blocking on the root ``rcu_node`` structure and Task E wait for 360*4882a593SmuzhiyunTask A to finish so that it can start the next grace period. The 361*4882a593Smuzhiyunresulting state is as shown below: 362*4882a593Smuzhiyun 363*4882a593Smuzhiyun.. kernel-figure:: Funnel5.svg 364*4882a593Smuzhiyun 365*4882a593SmuzhiyunOnce the grace period completes, Task A starts waking up the tasks 366*4882a593Smuzhiyunwaiting for this grace period to complete, increments the 367*4882a593Smuzhiyun``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then 368*4882a593Smuzhiyunreleases the ``->exp_mutex``. This results in the following state: 369*4882a593Smuzhiyun 370*4882a593Smuzhiyun.. kernel-figure:: Funnel6.svg 371*4882a593Smuzhiyun 372*4882a593SmuzhiyunTask E can then acquire ``->exp_mutex`` and increment 373*4882a593Smuzhiyun``->expedited_sequence`` to the value three. If new tasks G and H arrive 374*4882a593Smuzhiyunand moves up the combining tree at the same time, the state will be as 375*4882a593Smuzhiyunfollows: 376*4882a593Smuzhiyun 377*4882a593Smuzhiyun.. kernel-figure:: Funnel7.svg 378*4882a593Smuzhiyun 379*4882a593SmuzhiyunNote that three of the root ``rcu_node`` structure's waitqueues are now 380*4882a593Smuzhiyunoccupied. However, at some point, Task A will wake up the tasks blocked 381*4882a593Smuzhiyunon the ``->exp_wq`` waitqueues, resulting in the following state: 382*4882a593Smuzhiyun 383*4882a593Smuzhiyun.. kernel-figure:: Funnel8.svg 384*4882a593Smuzhiyun 385*4882a593SmuzhiyunExecution will continue with Tasks E and H completing their grace 386*4882a593Smuzhiyunperiods and carrying out their wakeups. 387*4882a593Smuzhiyun 388*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 389*4882a593Smuzhiyun| **Quick Quiz**: | 390*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 391*4882a593Smuzhiyun| What happens if Task A takes so long to do its wakeups that Task E's | 392*4882a593Smuzhiyun| grace period completes? | 393*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 394*4882a593Smuzhiyun| **Answer**: | 395*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 396*4882a593Smuzhiyun| Then Task E will block on the ``->exp_wake_mutex``, which will also | 397*4882a593Smuzhiyun| prevent it from releasing ``->exp_mutex``, which in turn will prevent | 398*4882a593Smuzhiyun| the next grace period from starting. This last is important in | 399*4882a593Smuzhiyun| preventing overflow of the ``->exp_wq[]`` array. | 400*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 401*4882a593Smuzhiyun 402*4882a593SmuzhiyunUse of Workqueues 403*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~ 404*4882a593Smuzhiyun 405*4882a593SmuzhiyunIn earlier implementations, the task requesting the expedited grace 406*4882a593Smuzhiyunperiod also drove it to completion. This straightforward approach had 407*4882a593Smuzhiyunthe disadvantage of needing to account for POSIX signals sent to user 408*4882a593Smuzhiyuntasks, so more recent implemementations use the Linux kernel's 409*4882a593Smuzhiyun`workqueues <https://www.kernel.org/doc/Documentation/core-api/workqueue.rst>`__. 410*4882a593Smuzhiyun 411*4882a593SmuzhiyunThe requesting task still does counter snapshotting and funnel-lock 412*4882a593Smuzhiyunprocessing, but the task reaching the top of the funnel lock does a 413*4882a593Smuzhiyun``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a 414*4882a593Smuzhiyunworkqueue kthread does the actual grace-period processing. Because 415*4882a593Smuzhiyunworkqueue kthreads do not accept POSIX signals, grace-period-wait 416*4882a593Smuzhiyunprocessing need not allow for POSIX signals. In addition, this approach 417*4882a593Smuzhiyunallows wakeups for the previous expedited grace period to be overlapped 418*4882a593Smuzhiyunwith processing for the next expedited grace period. Because there are 419*4882a593Smuzhiyunonly four sets of waitqueues, it is necessary to ensure that the 420*4882a593Smuzhiyunprevious grace period's wakeups complete before the next grace period's 421*4882a593Smuzhiyunwakeups start. This is handled by having the ``->exp_mutex`` guard 422*4882a593Smuzhiyunexpedited grace-period processing and the ``->exp_wake_mutex`` guard 423*4882a593Smuzhiyunwakeups. The key point is that the ``->exp_mutex`` is not released until 424*4882a593Smuzhiyunthe first wakeup is complete, which means that the ``->exp_wake_mutex`` 425*4882a593Smuzhiyunhas already been acquired at that point. This approach ensures that the 426*4882a593Smuzhiyunprevious grace period's wakeups can be carried out while the current 427*4882a593Smuzhiyungrace period is in process, but that these wakeups will complete before 428*4882a593Smuzhiyunthe next grace period starts. This means that only three waitqueues are 429*4882a593Smuzhiyunrequired, guaranteeing that the four that are provided are sufficient. 430*4882a593Smuzhiyun 431*4882a593SmuzhiyunStall Warnings 432*4882a593Smuzhiyun~~~~~~~~~~~~~~ 433*4882a593Smuzhiyun 434*4882a593SmuzhiyunExpediting grace periods does nothing to speed things up when RCU 435*4882a593Smuzhiyunreaders take too long, and therefore expedited grace periods check for 436*4882a593Smuzhiyunstalls just as normal grace periods do. 437*4882a593Smuzhiyun 438*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 439*4882a593Smuzhiyun| **Quick Quiz**: | 440*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 441*4882a593Smuzhiyun| But why not just let the normal grace-period machinery detect the | 442*4882a593Smuzhiyun| stalls, given that a given reader must block both normal and | 443*4882a593Smuzhiyun| expedited grace periods? | 444*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 445*4882a593Smuzhiyun| **Answer**: | 446*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 447*4882a593Smuzhiyun| Because it is quite possible that at a given time there is no normal | 448*4882a593Smuzhiyun| grace period in progress, in which case the normal grace period | 449*4882a593Smuzhiyun| cannot emit a stall warning. | 450*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 451*4882a593Smuzhiyun 452*4882a593SmuzhiyunThe ``synchronize_sched_expedited_wait()`` function loops waiting for 453*4882a593Smuzhiyunthe expedited grace period to end, but with a timeout set to the current 454*4882a593SmuzhiyunRCU CPU stall-warning time. If this time is exceeded, any CPUs or 455*4882a593Smuzhiyun``rcu_node`` structures blocking the current grace period are printed. 456*4882a593SmuzhiyunEach stall warning results in another pass through the loop, but the 457*4882a593Smuzhiyunsecond and subsequent passes use longer stall times. 458*4882a593Smuzhiyun 459*4882a593SmuzhiyunMid-boot operation 460*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~ 461*4882a593Smuzhiyun 462*4882a593SmuzhiyunThe use of workqueues has the advantage that the expedited grace-period 463*4882a593Smuzhiyuncode need not worry about POSIX signals. Unfortunately, it has the 464*4882a593Smuzhiyuncorresponding disadvantage that workqueues cannot be used until they are 465*4882a593Smuzhiyuninitialized, which does not happen until some time after the scheduler 466*4882a593Smuzhiyunspawns the first task. Given that there are parts of the kernel that 467*4882a593Smuzhiyunreally do want to execute grace periods during this mid-boot “dead 468*4882a593Smuzhiyunzone”, expedited grace periods must do something else during thie time. 469*4882a593Smuzhiyun 470*4882a593SmuzhiyunWhat they do is to fall back to the old practice of requiring that the 471*4882a593Smuzhiyunrequesting task drive the expedited grace period, as was the case before 472*4882a593Smuzhiyunthe use of workqueues. However, the requesting task is only required to 473*4882a593Smuzhiyundrive the grace period during the mid-boot dead zone. Before mid-boot, a 474*4882a593Smuzhiyunsynchronous grace period is a no-op. Some time after mid-boot, 475*4882a593Smuzhiyunworkqueues are used. 476*4882a593Smuzhiyun 477*4882a593SmuzhiyunNon-expedited non-SRCU synchronous grace periods must also operate 478*4882a593Smuzhiyunnormally during mid-boot. This is handled by causing non-expedited grace 479*4882a593Smuzhiyunperiods to take the expedited code path during mid-boot. 480*4882a593Smuzhiyun 481*4882a593SmuzhiyunThe current code assumes that there are no POSIX signals during the 482*4882a593Smuzhiyunmid-boot dead zone. However, if an overwhelming need for POSIX signals 483*4882a593Smuzhiyunsomehow arises, appropriate adjustments can be made to the expedited 484*4882a593Smuzhiyunstall-warning code. One such adjustment would reinstate the 485*4882a593Smuzhiyunpre-workqueue stall-warning checks, but only during the mid-boot dead 486*4882a593Smuzhiyunzone. 487*4882a593Smuzhiyun 488*4882a593SmuzhiyunWith this refinement, synchronous grace periods can now be used from 489*4882a593Smuzhiyuntask context pretty much any time during the life of the kernel. That 490*4882a593Smuzhiyunis, aside from some points in the suspend, hibernate, or shutdown code 491*4882a593Smuzhiyunpath. 492*4882a593Smuzhiyun 493*4882a593SmuzhiyunSummary 494*4882a593Smuzhiyun~~~~~~~ 495*4882a593Smuzhiyun 496*4882a593SmuzhiyunExpedited grace periods use a sequence-number approach to promote 497*4882a593Smuzhiyunbatching, so that a single grace-period operation can serve numerous 498*4882a593Smuzhiyunrequests. A funnel lock is used to efficiently identify the one task out 499*4882a593Smuzhiyunof a concurrent group that will request the grace period. All members of 500*4882a593Smuzhiyunthe group will block on waitqueues provided in the ``rcu_node`` 501*4882a593Smuzhiyunstructure. The actual grace-period processing is carried out by a 502*4882a593Smuzhiyunworkqueue. 503*4882a593Smuzhiyun 504*4882a593SmuzhiyunCPU-hotplug operations are noted lazily in order to prevent the need for 505*4882a593Smuzhiyuntight synchronization between expedited grace periods and CPU-hotplug 506*4882a593Smuzhiyunoperations. The dyntick-idle counters are used to avoid sending IPIs to 507*4882a593Smuzhiyunidle CPUs, at least in the common case. RCU-preempt and RCU-sched use 508*4882a593Smuzhiyundifferent IPI handlers and different code to respond to the state 509*4882a593Smuzhiyunchanges carried out by those handlers, but otherwise use common code. 510*4882a593Smuzhiyun 511*4882a593SmuzhiyunQuiescent states are tracked using the ``rcu_node`` tree, and once all 512*4882a593Smuzhiyunnecessary quiescent states have been reported, all tasks waiting on this 513*4882a593Smuzhiyunexpedited grace period are awakened. A pair of mutexes are used to allow 514*4882a593Smuzhiyunone grace period's wakeups to proceed concurrently with the next grace 515*4882a593Smuzhiyunperiod's processing. 516*4882a593Smuzhiyun 517*4882a593SmuzhiyunThis combination of mechanisms allows expedited grace periods to run 518*4882a593Smuzhiyunreasonably efficiently. However, for non-time-critical tasks, normal 519*4882a593Smuzhiyungrace periods should be used instead because their longer duration 520*4882a593Smuzhiyunpermits much higher degrees of batching, and thus much lower per-request 521*4882a593Smuzhiyunoverheads. 522