Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst

*4882a593Smuzhiyun=================================================
*4882a593SmuzhiyunA Tour Through TREE_RCU's Expedited Grace Periods
*4882a593Smuzhiyun=================================================
*4882a593Smuzhiyun
*4882a593SmuzhiyunIntroduction
*4882a593Smuzhiyun============
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis document describes RCU's expedited grace periods.
*4882a593SmuzhiyunUnlike RCU's normal grace periods, which accept long latencies to attain
*4882a593Smuzhiyunhigh efficiency and minimal disturbance, expedited grace periods accept
*4882a593Smuzhiyunlower efficiency and significant disturbance to attain shorter latencies.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier
*4882a593Smuzhiyunthird RCU-bh flavor having been implemented in terms of the other two.
*4882a593SmuzhiyunEach of the two implementations is covered in its own section.
*4882a593Smuzhiyun
*4882a593SmuzhiyunExpedited Grace Period Design
*4882a593Smuzhiyun=============================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe expedited RCU grace periods cannot be accused of being subtle,
*4882a593Smuzhiyungiven that they for all intents and purposes hammer every CPU that
*4882a593Smuzhiyunhas not yet provided a quiescent state for the current expedited
*4882a593Smuzhiyungrace period.
*4882a593SmuzhiyunThe one saving grace is that the hammer has grown a bit smaller
*4882a593Smuzhiyunover time:  The old call to ``try_stop_cpus()`` has been
*4882a593Smuzhiyunreplaced with a set of calls to ``smp_call_function_single()``,
*4882a593Smuzhiyuneach of which results in an IPI to the target CPU.
*4882a593SmuzhiyunThe corresponding handler function checks the CPU's state, motivating
*4882a593Smuzhiyuna faster quiescent state where possible, and triggering a report
*4882a593Smuzhiyunof that quiescent state.
*4882a593SmuzhiyunAs always for RCU, once everything has spent some time in a quiescent
*4882a593Smuzhiyunstate, the expedited grace period has completed.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe details of the ``smp_call_function_single()`` handler's
*4882a593Smuzhiyunoperation depend on the RCU flavor, as described in the following
*4882a593Smuzhiyunsections.
*4882a593Smuzhiyun
*4882a593SmuzhiyunRCU-preempt Expedited Grace Periods
*4882a593Smuzhiyun===================================
*4882a593Smuzhiyun
*4882a593Smuzhiyun``CONFIG_PREEMPT=y`` kernels implement RCU-preempt.
*4882a593SmuzhiyunThe overall flow of the handling of a given CPU by an RCU-preempt
*4882a593Smuzhiyunexpedited grace period is shown in the following diagram:
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. kernel-figure:: ExpRCUFlow.svg
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe solid arrows denote direct action, for example, a function call.
*4882a593SmuzhiyunThe dotted arrows denote indirect action, for example, an IPI
*4882a593Smuzhiyunor a state that is reached after some time.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf a given CPU is offline or idle, ``synchronize_rcu_expedited()``
*4882a593Smuzhiyunwill ignore it because idle and offline CPUs are already residing
*4882a593Smuzhiyunin quiescent states.
*4882a593SmuzhiyunOtherwise, the expedited grace period will use
*4882a593Smuzhiyun``smp_call_function_single()`` to send the CPU an IPI, which
*4882a593Smuzhiyunis handled by ``rcu_exp_handler()``.
*4882a593Smuzhiyun
*4882a593SmuzhiyunHowever, because this is preemptible RCU, ``rcu_exp_handler()``
*4882a593Smuzhiyuncan check to see if the CPU is currently running in an RCU read-side
*4882a593Smuzhiyuncritical section.
*4882a593SmuzhiyunIf not, the handler can immediately report a quiescent state.
*4882a593SmuzhiyunOtherwise, it sets flags so that the outermost ``rcu_read_unlock()``
*4882a593Smuzhiyuninvocation will provide the needed quiescent-state report.
*4882a593SmuzhiyunThis flag-setting avoids the previous forced preemption of all
*4882a593SmuzhiyunCPUs that might have RCU read-side critical sections.
*4882a593SmuzhiyunIn addition, this flag-setting is done so as to avoid increasing
*4882a593Smuzhiyunthe overhead of the common-case fastpath through the scheduler.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAgain because this is preemptible RCU, an RCU read-side critical section
*4882a593Smuzhiyuncan be preempted.
*4882a593SmuzhiyunWhen that happens, RCU will enqueue the task, which will the continue to
*4882a593Smuzhiyunblock the current expedited grace period until it resumes and finds its
*4882a593Smuzhiyunoutermost ``rcu_read_unlock()``.
*4882a593SmuzhiyunThe CPU will report a quiescent state just after enqueuing the task because
*4882a593Smuzhiyunthe CPU is no longer blocking the grace period.
*4882a593SmuzhiyunIt is instead the preempted task doing the blocking.
*4882a593SmuzhiyunThe list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``,
*4882a593Smuzhiyunwhich is called from ``rcu_preempt_note_context_switch()``, which
*4882a593Smuzhiyunin turn is called from ``rcu_note_context_switch()``, which in
*4882a593Smuzhiyunturn is called from the scheduler.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| Why not just have the expedited grace period check the state of all   |
*4882a593Smuzhiyun| the CPUs? After all, that would avoid all those real-time-unfriendly  |
*4882a593Smuzhiyun| IPIs.                                                                 |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| **Answer**:                                                           |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| Because we want the RCU read-side critical sections to run fast,      |
*4882a593Smuzhiyun| which means no memory barriers. Therefore, it is not possible to      |
*4882a593Smuzhiyun| safely check the state from some other CPU. And even if it was        |
*4882a593Smuzhiyun| possible to safely check the state, it would still be necessary to    |
*4882a593Smuzhiyun| IPI the CPU to safely interact with the upcoming                      |
*4882a593Smuzhiyun| ``rcu_read_unlock()`` invocation, which means that the remote state   |
*4882a593Smuzhiyun| testing would not help the worst-case latency that real-time          |
*4882a593Smuzhiyun| applications care about.                                              |
*4882a593Smuzhiyun|                                                                       |
*4882a593Smuzhiyun| One way to prevent your real-time application from getting hit with   |
*4882a593Smuzhiyun| these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU  |
*4882a593Smuzhiyun| would then perceive the CPU running your application as being idle,   |
*4882a593Smuzhiyun| and it would be able to safely detect that state without needing to   |
*4882a593Smuzhiyun| IPI the CPU.                                                          |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun
*4882a593SmuzhiyunPlease note that this is just the overall flow: Additional complications
*4882a593Smuzhiyuncan arise due to races with CPUs going idle or offline, among other
*4882a593Smuzhiyunthings.
*4882a593Smuzhiyun
*4882a593SmuzhiyunRCU-sched Expedited Grace Periods
*4882a593Smuzhiyun---------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun``CONFIG_PREEMPT=n`` kernels implement RCU-sched. The overall flow of
*4882a593Smuzhiyunthe handling of a given CPU by an RCU-sched expedited grace period is
*4882a593Smuzhiyunshown in the following diagram:
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. kernel-figure:: ExpSchedFlow.svg
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores
*4882a593Smuzhiyunoffline and idle CPUs, again because they are in remotely detectable
*4882a593Smuzhiyunquiescent states. However, because the ``rcu_read_lock_sched()`` and
*4882a593Smuzhiyun``rcu_read_unlock_sched()`` leave no trace of their invocation, in
*4882a593Smuzhiyungeneral it is not possible to tell whether or not the current CPU is in
*4882a593Smuzhiyunan RCU read-side critical section. The best that RCU-sched's
*4882a593Smuzhiyun``rcu_exp_handler()`` can do is to check for idle, on the off-chance
*4882a593Smuzhiyunthat the CPU went idle while the IPI was in flight. If the CPU is idle,
*4882a593Smuzhiyunthen ``rcu_exp_handler()`` reports the quiescent state.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOtherwise, the handler forces a future context switch by setting the
*4882a593SmuzhiyunNEED_RESCHED flag of the current task's thread flag and the CPU preempt
*4882a593Smuzhiyuncounter. At the time of the context switch, the CPU reports the
*4882a593Smuzhiyunquiescent state. Should the CPU go offline first, it will report the
*4882a593Smuzhiyunquiescent state at that time.
*4882a593Smuzhiyun
*4882a593SmuzhiyunExpedited Grace Period and CPU Hotplug
*4882a593Smuzhiyun--------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe expedited nature of expedited grace periods require a much tighter
*4882a593Smuzhiyuninteraction with CPU hotplug operations than is required for normal
*4882a593Smuzhiyungrace periods. In addition, attempting to IPI offline CPUs will result
*4882a593Smuzhiyunin splats, but failing to IPI online CPUs can result in too-short grace
*4882a593Smuzhiyunperiods. Neither option is acceptable in production kernels.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe interaction between expedited grace periods and CPU hotplug
*4882a593Smuzhiyunoperations is carried out at several levels:
*4882a593Smuzhiyun
*4882a593Smuzhiyun#. The number of CPUs that have ever been online is tracked by the
*4882a593Smuzhiyun   ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state``
*4882a593Smuzhiyun   structure's ``->ncpus_snap`` field tracks the number of CPUs that
*4882a593Smuzhiyun   have ever been online at the beginning of an RCU expedited grace
*4882a593Smuzhiyun   period. Note that this number never decreases, at least in the
*4882a593Smuzhiyun   absence of a time machine.
*4882a593Smuzhiyun#. The identities of the CPUs that have ever been online is tracked by
*4882a593Smuzhiyun   the ``rcu_node`` structure's ``->expmaskinitnext`` field. The
*4882a593Smuzhiyun   ``rcu_node`` structure's ``->expmaskinit`` field tracks the
*4882a593Smuzhiyun   identities of the CPUs that were online at least once at the
*4882a593Smuzhiyun   beginning of the most recent RCU expedited grace period. The
*4882a593Smuzhiyun   ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are
*4882a593Smuzhiyun   used to detect when new CPUs have come online for the first time,
*4882a593Smuzhiyun   that is, when the ``rcu_node`` structure's ``->expmaskinitnext``
*4882a593Smuzhiyun   field has changed since the beginning of the last RCU expedited grace
*4882a593Smuzhiyun   period, which triggers an update of each ``rcu_node`` structure's
*4882a593Smuzhiyun   ``->expmaskinit`` field from its ``->expmaskinitnext`` field.
*4882a593Smuzhiyun#. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to
*4882a593Smuzhiyun   initialize that structure's ``->expmask`` at the beginning of each
*4882a593Smuzhiyun   RCU expedited grace period. This means that only those CPUs that have
*4882a593Smuzhiyun   been online at least once will be considered for a given grace
*4882a593Smuzhiyun   period.
*4882a593Smuzhiyun#. Any CPU that goes offline will clear its bit in its leaf ``rcu_node``
*4882a593Smuzhiyun   structure's ``->qsmaskinitnext`` field, so any CPU with that bit
*4882a593Smuzhiyun   clear can safely be ignored. However, it is possible for a CPU coming
*4882a593Smuzhiyun   online or going offline to have this bit set for some time while
*4882a593Smuzhiyun   ``cpu_online`` returns ``false``.
*4882a593Smuzhiyun#. For each non-idle CPU that RCU believes is currently online, the
*4882a593Smuzhiyun   grace period invokes ``smp_call_function_single()``. If this
*4882a593Smuzhiyun   succeeds, the CPU was fully online. Failure indicates that the CPU is
*4882a593Smuzhiyun   in the process of coming online or going offline, in which case it is
*4882a593Smuzhiyun   necessary to wait for a short time period and try again. The purpose
*4882a593Smuzhiyun   of this wait (or series of waits, as the case may be) is to permit a
*4882a593Smuzhiyun   concurrent CPU-hotplug operation to complete.
*4882a593Smuzhiyun#. In the case of RCU-sched, one of the last acts of an outgoing CPU is
*4882a593Smuzhiyun   to invoke ``rcu_report_dead()``, which reports a quiescent state for
*4882a593Smuzhiyun   that CPU. However, this is likely paranoia-induced redundancy.
*4882a593Smuzhiyun
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| Why all the dancing around with multiple counters and masks tracking  |
*4882a593Smuzhiyun| CPUs that were once online? Why not just have a single set of masks   |
*4882a593Smuzhiyun| tracking the currently online CPUs and be done with it?               |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| **Answer**:                                                           |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| Maintaining single set of masks tracking the online CPUs *sounds*     |
*4882a593Smuzhiyun| easier, at least until you try working out all the race conditions    |
*4882a593Smuzhiyun| between grace-period initialization and CPU-hotplug operations. For   |
*4882a593Smuzhiyun| example, suppose initialization is progressing down the tree while a  |
*4882a593Smuzhiyun| CPU-offline operation is progressing up the tree. This situation can  |
*4882a593Smuzhiyun| result in bits set at the top of the tree that have no counterparts   |
*4882a593Smuzhiyun| at the bottom of the tree. Those bits will never be cleared, which    |
*4882a593Smuzhiyun| will result in grace-period hangs. In short, that way lies madness,   |
*4882a593Smuzhiyun| to say nothing of a great many bugs, hangs, and deadlocks.            |
*4882a593Smuzhiyun| In contrast, the current multi-mask multi-counter scheme ensures that |
*4882a593Smuzhiyun| grace-period initialization will always see consistent masks up and   |
*4882a593Smuzhiyun| down the tree, which brings significant simplifications over the      |
*4882a593Smuzhiyun| single-mask method.                                                   |
*4882a593Smuzhiyun|                                                                       |
*4882a593Smuzhiyun| This is an instance of `deferring work in order to avoid              |
*4882a593Smuzhiyun| synchronization <http://www.cs.columbia.edu/~library/TR-repository/re |
*4882a593Smuzhiyun| ports/reports-1992/cucs-039-92.ps.gz>`__.                             |
*4882a593Smuzhiyun| Lazily recording CPU-hotplug events at the beginning of the next      |
*4882a593Smuzhiyun| grace period greatly simplifies maintenance of the CPU-tracking       |
*4882a593Smuzhiyun| bitmasks in the ``rcu_node`` tree.                                    |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun
*4882a593SmuzhiyunExpedited Grace Period Refinements
*4882a593Smuzhiyun----------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunIdle-CPU Checks
*4882a593Smuzhiyun~~~~~~~~~~~~~~~
*4882a593Smuzhiyun
*4882a593SmuzhiyunEach expedited grace period checks for idle CPUs when initially forming
*4882a593Smuzhiyunthe mask of CPUs to be IPIed and again just before IPIing a CPU (both
*4882a593Smuzhiyunchecks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is
*4882a593Smuzhiyunidle at any time between those two times, the CPU will not be IPIed.
*4882a593SmuzhiyunInstead, the task pushing the grace period forward will include the idle
*4882a593SmuzhiyunCPUs in the mask passed to ``rcu_report_exp_cpu_mult()``.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor RCU-sched, there is an additional check: If the IPI has interrupted
*4882a593Smuzhiyunthe idle loop, then ``rcu_exp_handler()`` invokes
*4882a593Smuzhiyun``rcu_report_exp_rdp()`` to report the corresponding quiescent state.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor RCU-preempt, there is no specific check for idle in the IPI handler
*4882a593Smuzhiyun(``rcu_exp_handler()``), but because RCU read-side critical sections are
*4882a593Smuzhiyunnot permitted within the idle loop, if ``rcu_exp_handler()`` sees that
*4882a593Smuzhiyunthe CPU is within RCU read-side critical section, the CPU cannot
*4882a593Smuzhiyunpossibly be idle. Otherwise, ``rcu_exp_handler()`` invokes
*4882a593Smuzhiyun``rcu_report_exp_rdp()`` to report the corresponding quiescent state,
*4882a593Smuzhiyunregardless of whether or not that quiescent state was due to the CPU
*4882a593Smuzhiyunbeing idle.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn summary, RCU expedited grace periods check for idle when building the
*4882a593Smuzhiyunbitmask of CPUs that must be IPIed, just before sending each IPI, and
*4882a593Smuzhiyun(either explicitly or implicitly) within the IPI handler.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBatching via Sequence Counter
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf each grace-period request was carried out separately, expedited grace
*4882a593Smuzhiyunperiods would have abysmal scalability and problematic high-load
*4882a593Smuzhiyuncharacteristics. Because each grace-period operation can serve an
*4882a593Smuzhiyununlimited number of updates, it is important to *batch* requests, so
*4882a593Smuzhiyunthat a single expedited grace-period operation will cover all requests
*4882a593Smuzhiyunin the corresponding batch.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis batching is controlled by a sequence counter named
*4882a593Smuzhiyun``->expedited_sequence`` in the ``rcu_state`` structure. This counter
*4882a593Smuzhiyunhas an odd value when there is an expedited grace period in progress and
*4882a593Smuzhiyunan even value otherwise, so that dividing the counter value by two gives
*4882a593Smuzhiyunthe number of completed grace periods. During any given update request,
*4882a593Smuzhiyunthe counter must transition from even to odd and then back to even, thus
*4882a593Smuzhiyunindicating that a grace period has elapsed. Therefore, if the initial
*4882a593Smuzhiyunvalue of the counter is ``s``, the updater must wait until the counter
*4882a593Smuzhiyunreaches at least the value ``(s+3)&~0x1``. This counter is managed by
*4882a593Smuzhiyunthe following access functions:
*4882a593Smuzhiyun
*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited
*4882a593Smuzhiyun   grace period.
*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace
*4882a593Smuzhiyun   period.
*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter.
*4882a593Smuzhiyun#. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited
*4882a593Smuzhiyun   grace period has elapsed since the corresponding call to
*4882a593Smuzhiyun   ``rcu_exp_gp_seq_snap()``.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAgain, only one request in a given batch need actually carry out a
*4882a593Smuzhiyungrace-period operation, which means there must be an efficient way to
*4882a593Smuzhiyunidentify which of many concurrent reqeusts will initiate the grace
*4882a593Smuzhiyunperiod, and that there be an efficient way for the remaining requests to
*4882a593Smuzhiyunwait for that grace period to complete. However, that is the topic of
*4882a593Smuzhiyunthe next section.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFunnel Locking and Wait/Wakeup
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe natural way to sort out which of a batch of updaters will initiate
*4882a593Smuzhiyunthe expedited grace period is to use the ``rcu_node`` combining tree, as
*4882a593Smuzhiyunimplemented by the ``exp_funnel_lock()`` function. The first updater
*4882a593Smuzhiyuncorresponding to a given grace period arriving at a given ``rcu_node``
*4882a593Smuzhiyunstructure records its desired grace-period sequence number in the
*4882a593Smuzhiyun``->exp_seq_rq`` field and moves up to the next level in the tree.
*4882a593SmuzhiyunOtherwise, if the ``->exp_seq_rq`` field already contains the sequence
*4882a593Smuzhiyunnumber for the desired grace period or some later one, the updater
*4882a593Smuzhiyunblocks on one of four wait queues in the ``->exp_wq[]`` array, using the
*4882a593Smuzhiyunsecond-from-bottom and third-from bottom bits as an index. An
*4882a593Smuzhiyun``->exp_lock`` field in the ``rcu_node`` structure synchronizes access
*4882a593Smuzhiyunto these fields.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAn empty ``rcu_node`` tree is shown in the following diagram, with the
*4882a593Smuzhiyunwhite cells representing the ``->exp_seq_rq`` field and the red cells
*4882a593Smuzhiyunrepresenting the elements of the ``->exp_wq[]`` array.
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. kernel-figure:: Funnel0.svg
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe next diagram shows the situation after the arrival of Task A and
*4882a593SmuzhiyunTask B at the leftmost and rightmost leaf ``rcu_node`` structures,
*4882a593Smuzhiyunrespectively. The current value of the ``rcu_state`` structure's
*4882a593Smuzhiyun``->expedited_sequence`` field is zero, so adding three and clearing the
*4882a593Smuzhiyunbottom bit results in the value two, which both tasks record in the
*4882a593Smuzhiyun``->exp_seq_rq`` field of their respective ``rcu_node`` structures:
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. kernel-figure:: Funnel1.svg
*4882a593Smuzhiyun
*4882a593SmuzhiyunEach of Tasks A and B will move up to the root ``rcu_node`` structure.
*4882a593SmuzhiyunSuppose that Task A wins, recording its desired grace-period sequence
*4882a593Smuzhiyunnumber and resulting in the state shown below:
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. kernel-figure:: Funnel2.svg
*4882a593Smuzhiyun
*4882a593SmuzhiyunTask A now advances to initiate a new grace period, while Task B moves
*4882a593Smuzhiyunup to the root ``rcu_node`` structure, and, seeing that its desired
*4882a593Smuzhiyunsequence number is already recorded, blocks on ``->exp_wq[1]``.
*4882a593Smuzhiyun
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| Why ``->exp_wq[1]``? Given that the value of these tasks' desired     |
*4882a593Smuzhiyun| sequence number is two, so shouldn't they instead block on            |
*4882a593Smuzhiyun| ``->exp_wq[2]``?                                                      |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| **Answer**:                                                           |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| No.                                                                   |
*4882a593Smuzhiyun| Recall that the bottom bit of the desired sequence number indicates   |
*4882a593Smuzhiyun| whether or not a grace period is currently in progress. It is         |
*4882a593Smuzhiyun| therefore necessary to shift the sequence number right one bit        |
*4882a593Smuzhiyun| position to obtain the number of the grace period. This results in    |
*4882a593Smuzhiyun| ``->exp_wq[1]``.                                                      |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf Tasks C and D also arrive at this point, they will compute the same
*4882a593Smuzhiyundesired grace-period sequence number, and see that both leaf
*4882a593Smuzhiyun``rcu_node`` structures already have that value recorded. They will
*4882a593Smuzhiyuntherefore block on their respective ``rcu_node`` structures'
*4882a593Smuzhiyun``->exp_wq[1]`` fields, as shown below:
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. kernel-figure:: Funnel3.svg
*4882a593Smuzhiyun
*4882a593SmuzhiyunTask A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and
*4882a593Smuzhiyuninitiates the grace period, which increments ``->expedited_sequence``.
*4882a593SmuzhiyunTherefore, if Tasks E and F arrive, they will compute a desired sequence
*4882a593Smuzhiyunnumber of 4 and will record this value as shown below:
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. kernel-figure:: Funnel4.svg
*4882a593Smuzhiyun
*4882a593SmuzhiyunTasks E and F will propagate up the ``rcu_node`` combining tree, with
*4882a593SmuzhiyunTask F blocking on the root ``rcu_node`` structure and Task E wait for
*4882a593SmuzhiyunTask A to finish so that it can start the next grace period. The
*4882a593Smuzhiyunresulting state is as shown below:
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. kernel-figure:: Funnel5.svg
*4882a593Smuzhiyun
*4882a593SmuzhiyunOnce the grace period completes, Task A starts waking up the tasks
*4882a593Smuzhiyunwaiting for this grace period to complete, increments the
*4882a593Smuzhiyun``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then
*4882a593Smuzhiyunreleases the ``->exp_mutex``. This results in the following state:
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. kernel-figure:: Funnel6.svg
*4882a593Smuzhiyun
*4882a593SmuzhiyunTask E can then acquire ``->exp_mutex`` and increment
*4882a593Smuzhiyun``->expedited_sequence`` to the value three. If new tasks G and H arrive
*4882a593Smuzhiyunand moves up the combining tree at the same time, the state will be as
*4882a593Smuzhiyunfollows:
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. kernel-figure:: Funnel7.svg
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote that three of the root ``rcu_node`` structure's waitqueues are now
*4882a593Smuzhiyunoccupied. However, at some point, Task A will wake up the tasks blocked
*4882a593Smuzhiyunon the ``->exp_wq`` waitqueues, resulting in the following state:
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. kernel-figure:: Funnel8.svg
*4882a593Smuzhiyun
*4882a593SmuzhiyunExecution will continue with Tasks E and H completing their grace
*4882a593Smuzhiyunperiods and carrying out their wakeups.
*4882a593Smuzhiyun
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| What happens if Task A takes so long to do its wakeups that Task E's  |
*4882a593Smuzhiyun| grace period completes?                                               |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| **Answer**:                                                           |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| Then Task E will block on the ``->exp_wake_mutex``, which will also   |
*4882a593Smuzhiyun| prevent it from releasing ``->exp_mutex``, which in turn will prevent |
*4882a593Smuzhiyun| the next grace period from starting. This last is important in        |
*4882a593Smuzhiyun| preventing overflow of the ``->exp_wq[]`` array.                      |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun
*4882a593SmuzhiyunUse of Workqueues
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn earlier implementations, the task requesting the expedited grace
*4882a593Smuzhiyunperiod also drove it to completion. This straightforward approach had
*4882a593Smuzhiyunthe disadvantage of needing to account for POSIX signals sent to user
*4882a593Smuzhiyuntasks, so more recent implemementations use the Linux kernel's
*4882a593Smuzhiyun`workqueues <https://www.kernel.org/doc/Documentation/core-api/workqueue.rst>`__.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe requesting task still does counter snapshotting and funnel-lock
*4882a593Smuzhiyunprocessing, but the task reaching the top of the funnel lock does a
*4882a593Smuzhiyun``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a
*4882a593Smuzhiyunworkqueue kthread does the actual grace-period processing. Because
*4882a593Smuzhiyunworkqueue kthreads do not accept POSIX signals, grace-period-wait
*4882a593Smuzhiyunprocessing need not allow for POSIX signals. In addition, this approach
*4882a593Smuzhiyunallows wakeups for the previous expedited grace period to be overlapped
*4882a593Smuzhiyunwith processing for the next expedited grace period. Because there are
*4882a593Smuzhiyunonly four sets of waitqueues, it is necessary to ensure that the
*4882a593Smuzhiyunprevious grace period's wakeups complete before the next grace period's
*4882a593Smuzhiyunwakeups start. This is handled by having the ``->exp_mutex`` guard
*4882a593Smuzhiyunexpedited grace-period processing and the ``->exp_wake_mutex`` guard
*4882a593Smuzhiyunwakeups. The key point is that the ``->exp_mutex`` is not released until
*4882a593Smuzhiyunthe first wakeup is complete, which means that the ``->exp_wake_mutex``
*4882a593Smuzhiyunhas already been acquired at that point. This approach ensures that the
*4882a593Smuzhiyunprevious grace period's wakeups can be carried out while the current
*4882a593Smuzhiyungrace period is in process, but that these wakeups will complete before
*4882a593Smuzhiyunthe next grace period starts. This means that only three waitqueues are
*4882a593Smuzhiyunrequired, guaranteeing that the four that are provided are sufficient.
*4882a593Smuzhiyun
*4882a593SmuzhiyunStall Warnings
*4882a593Smuzhiyun~~~~~~~~~~~~~~
*4882a593Smuzhiyun
*4882a593SmuzhiyunExpediting grace periods does nothing to speed things up when RCU
*4882a593Smuzhiyunreaders take too long, and therefore expedited grace periods check for
*4882a593Smuzhiyunstalls just as normal grace periods do.
*4882a593Smuzhiyun
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| **Quick Quiz**:                                                       |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| But why not just let the normal grace-period machinery detect the     |
*4882a593Smuzhiyun| stalls, given that a given reader must block both normal and          |
*4882a593Smuzhiyun| expedited grace periods?                                              |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| **Answer**:                                                           |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun| Because it is quite possible that at a given time there is no normal  |
*4882a593Smuzhiyun| grace period in progress, in which case the normal grace period       |
*4882a593Smuzhiyun| cannot emit a stall warning.                                          |
*4882a593Smuzhiyun+-----------------------------------------------------------------------+
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe ``synchronize_sched_expedited_wait()`` function loops waiting for
*4882a593Smuzhiyunthe expedited grace period to end, but with a timeout set to the current
*4882a593SmuzhiyunRCU CPU stall-warning time. If this time is exceeded, any CPUs or
*4882a593Smuzhiyun``rcu_node`` structures blocking the current grace period are printed.
*4882a593SmuzhiyunEach stall warning results in another pass through the loop, but the
*4882a593Smuzhiyunsecond and subsequent passes use longer stall times.
*4882a593Smuzhiyun
*4882a593SmuzhiyunMid-boot operation
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe use of workqueues has the advantage that the expedited grace-period
*4882a593Smuzhiyuncode need not worry about POSIX signals. Unfortunately, it has the
*4882a593Smuzhiyuncorresponding disadvantage that workqueues cannot be used until they are
*4882a593Smuzhiyuninitialized, which does not happen until some time after the scheduler
*4882a593Smuzhiyunspawns the first task. Given that there are parts of the kernel that
*4882a593Smuzhiyunreally do want to execute grace periods during this mid-boot “dead
*4882a593Smuzhiyunzone”, expedited grace periods must do something else during thie time.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhat they do is to fall back to the old practice of requiring that the
*4882a593Smuzhiyunrequesting task drive the expedited grace period, as was the case before
*4882a593Smuzhiyunthe use of workqueues. However, the requesting task is only required to
*4882a593Smuzhiyundrive the grace period during the mid-boot dead zone. Before mid-boot, a
*4882a593Smuzhiyunsynchronous grace period is a no-op. Some time after mid-boot,
*4882a593Smuzhiyunworkqueues are used.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNon-expedited non-SRCU synchronous grace periods must also operate
*4882a593Smuzhiyunnormally during mid-boot. This is handled by causing non-expedited grace
*4882a593Smuzhiyunperiods to take the expedited code path during mid-boot.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe current code assumes that there are no POSIX signals during the
*4882a593Smuzhiyunmid-boot dead zone. However, if an overwhelming need for POSIX signals
*4882a593Smuzhiyunsomehow arises, appropriate adjustments can be made to the expedited
*4882a593Smuzhiyunstall-warning code. One such adjustment would reinstate the
*4882a593Smuzhiyunpre-workqueue stall-warning checks, but only during the mid-boot dead
*4882a593Smuzhiyunzone.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWith this refinement, synchronous grace periods can now be used from
*4882a593Smuzhiyuntask context pretty much any time during the life of the kernel. That
*4882a593Smuzhiyunis, aside from some points in the suspend, hibernate, or shutdown code
*4882a593Smuzhiyunpath.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSummary
*4882a593Smuzhiyun~~~~~~~
*4882a593Smuzhiyun
*4882a593SmuzhiyunExpedited grace periods use a sequence-number approach to promote
*4882a593Smuzhiyunbatching, so that a single grace-period operation can serve numerous
*4882a593Smuzhiyunrequests. A funnel lock is used to efficiently identify the one task out
*4882a593Smuzhiyunof a concurrent group that will request the grace period. All members of
*4882a593Smuzhiyunthe group will block on waitqueues provided in the ``rcu_node``
*4882a593Smuzhiyunstructure. The actual grace-period processing is carried out by a
*4882a593Smuzhiyunworkqueue.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCPU-hotplug operations are noted lazily in order to prevent the need for
*4882a593Smuzhiyuntight synchronization between expedited grace periods and CPU-hotplug
*4882a593Smuzhiyunoperations. The dyntick-idle counters are used to avoid sending IPIs to
*4882a593Smuzhiyunidle CPUs, at least in the common case. RCU-preempt and RCU-sched use
*4882a593Smuzhiyundifferent IPI handlers and different code to respond to the state
*4882a593Smuzhiyunchanges carried out by those handlers, but otherwise use common code.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQuiescent states are tracked using the ``rcu_node`` tree, and once all
*4882a593Smuzhiyunnecessary quiescent states have been reported, all tasks waiting on this
*4882a593Smuzhiyunexpedited grace period are awakened. A pair of mutexes are used to allow
*4882a593Smuzhiyunone grace period's wakeups to proceed concurrently with the next grace
*4882a593Smuzhiyunperiod's processing.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis combination of mechanisms allows expedited grace periods to run
*4882a593Smuzhiyunreasonably efficiently. However, for non-time-critical tasks, normal
*4882a593Smuzhiyungrace periods should be used instead because their longer duration
*4882a593Smuzhiyunpermits much higher degrees of batching, and thus much lower per-request
*4882a593Smuzhiyunoverheads.