Design/Requirements/Requirements.rst

16 ------------
18 Read-copy update (RCU) is a synchronization mechanism that is often used
19 as a replacement for reader-writer locking. RCU is unusual in that
20 updaters do not block readers, which means that RCU's read-side
28 thought of as an informal, high-level specification for RCU. It is
40 #. `Fundamental Non-Requirements`_
42 #. `Quality-of-Implementation Requirements`_
44 #. `Software-Engineering Requirements`_
53 ------------------------
58 #. `Grace-Period Guarantee`_
60 #. `Memory-Barrier Guarantees`_
62 #. `Guaranteed Read-to-Write Upgrade`_
64 Grace-Period Guarantee
67 RCU's grace-period guarantee is unusual in being premeditated: Jack
73 RCU's grace-period guarantee allows updaters to wait for the completion
74 of all pre-existing RCU read-side critical sections. An RCU read-side
77 RCU treats a nested set as one big RCU read-side critical section.
78 Production-quality implementations of ``rcu_read_lock()`` and
105 Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing
119 +-----------------------------------------------------------------------+
121 +-----------------------------------------------------------------------+
123 | progress concurrently with readers, but pre-existing readers will     |
126 +-----------------------------------------------------------------------+
128 +-----------------------------------------------------------------------+
131 | Second, even when using ``synchronize_rcu()``, the other update-side  |
132 | code does run concurrently with readers, whether pre-existing or not. |
133 +-----------------------------------------------------------------------+
173 The RCU read-side critical section in ``do_something_dlm()`` works with
178 +-----------------------------------------------------------------------+
180 +-----------------------------------------------------------------------+
182 +-----------------------------------------------------------------------+
184 +-----------------------------------------------------------------------+
188 +-----------------------------------------------------------------------+
190 In order to avoid fatal problems such as deadlocks, an RCU read-side
192 Similarly, an RCU read-side critical section must not contain anything
196 Although RCU's grace-period guarantee is useful in and of itself, with
198 be good to be able to use RCU to coordinate read-side access to linked
199 data structures. For this, the grace-period guarantee is not sufficient,
203 non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
211        5     return -ENOMEM;
217       11   p->a = a;
218       12   p->b = a;
233        5     return -ENOMEM;
240       12   p->a = a;
241       13   p->b = a;
247 executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
251 which brings us to the publish-subscribe guarantee discussed in the next
257 RCU's publish-subscribe guarantee allows data to be inserted into a
269        5     return -ENOMEM;
275       11   p->a = a;
276       12   p->b = a;
289 +-----------------------------------------------------------------------+
291 +-----------------------------------------------------------------------+
293 | assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
295 +-----------------------------------------------------------------------+
297 +-----------------------------------------------------------------------+
300 | initialized. So reordering the assignments to ``p->a`` and ``p->b``   |
302 +-----------------------------------------------------------------------+
305 control its accesses to the RCU-protected data, as shown in
315        6     do_something(p->a, p->b);
336        5     do_something(gp->a, gp->b);
345 the current structure with a new one, the fetches of ``gp->a`` and
346 ``gp->b`` might well come from two different structures, which could
357        6     do_something(p->a, p->b);
366 barriers in the Linux kernel. Should a `high-quality implementation of
372 outermost RCU read-side critical section containing that
382 Of course, it is also necessary to remove elements from RCU-protected
386 #. Wait for all pre-existing RCU read-side critical sections to complete
387    (because only pre-existing readers can possibly have a reference to
426    read-side critical section or in a code segment where the pointer
428    update-side lock.
430 +-----------------------------------------------------------------------+
432 +-----------------------------------------------------------------------+
435 +-----------------------------------------------------------------------+
437 +-----------------------------------------------------------------------+
441 | in a byte-at-a-time manner, resulting in *load tearing*, in turn      |
442 | resulting a bytewise mash-up of two distinct pointer values. It might |
443 | even use value-speculation optimizations, where it makes a wrong      |
446 | about any dereferences that returned pre-initialization garbage in    |
453 +-----------------------------------------------------------------------+
455 In short, RCU's publish-subscribe guarantee is provided by the
457 guarantee allows data elements to be safely added to RCU-protected
459 can be used in combination with the grace-period guarantee to also allow
460 data elements to be removed from RCU-protected linked data structures,
466 resembling the dependency-ordering barrier that was later subsumed
469 late-1990s meeting with the DEC Alpha architects, back in the days when
470 DEC was still a free-standing company. It took the Alpha architects a
479 Memory-Barrier Guarantees
482 The previous section's simple linked-data-structure scenario clearly
483 demonstrates the need for RCU's stringent memory-ordering guarantees on
486 #. Each CPU that has an RCU read-side critical section that begins
488    memory barrier between the time that the RCU read-side critical
490    this guarantee, a pre-existing RCU read-side critical section might
493 #. Each CPU that has an RCU read-side critical section that ends after
496    time that the RCU read-side critical section begins. Without this
497    guarantee, a later RCU read-side critical section running after the
513 +-----------------------------------------------------------------------+
515 +-----------------------------------------------------------------------+
516 | Given that multiple CPUs can start RCU read-side critical sections at |
518 | whether or not a given RCU read-side critical section starts before a |
520 +-----------------------------------------------------------------------+
522 +-----------------------------------------------------------------------+
523 | If RCU cannot tell whether or not a given RCU read-side critical      |
525 | it must assume that the RCU read-side critical section started first. |
527 | waiting on a given RCU read-side critical section only if it can      |
533 | within the enclosed RCU read-side critical section to the code        |
535 | then a given RCU read-side critical section begins before a given     |
546 +-----------------------------------------------------------------------+
548 +-----------------------------------------------------------------------+
550 +-----------------------------------------------------------------------+
553 +-----------------------------------------------------------------------+
555 +-----------------------------------------------------------------------+
563 | #. CPU 1: ``do_something_with(q->a);``                                |
570 | end of the RCU read-side critical section and the end of the grace    |
583 | #. CPU 1: ``do_something_with(q->a); /* Boom!!! */``                  |
587 | grace period and the beginning of the RCU read-side critical section, |
593 | believing that you have adhered to the as-if rule than it is to       |
595 +-----------------------------------------------------------------------+
597 +-----------------------------------------------------------------------+
599 +-----------------------------------------------------------------------+
602 | compiler might arbitrarily rearrange consecutive RCU read-side        |
603 | critical sections. Given such rearrangement, if a given RCU read-side |
605 | read-side critical sections are done? Won't the compiler              |
607 +-----------------------------------------------------------------------+
609 +-----------------------------------------------------------------------+
613 | ``schedule()`` had better prevent calling-code accesses to shared     |
615 | RCU detects the end of a given RCU read-side critical section, it     |
616 | will necessarily detect the end of all prior RCU read-side critical   |
620 | loop, into user-mode code, and so on. But if your kernel build allows |
622 +-----------------------------------------------------------------------+
624 Note that these memory-barrier requirements do not replace the
626 pre-existing readers. On the contrary, the memory barriers called out in
634 The common-case RCU primitives are unconditional. They are invoked, they
641 guarantee was reverse-engineered, not premeditated. The unconditional
649 Guaranteed Read-to-Write Upgrade
653 within an RCU read-side critical section. For example, that RCU
654 read-side critical section might search for a given data element, and
655 then might acquire the update-side spinlock in order to update that
656 element, all while remaining in that RCU read-side critical section. Of
657 course, it is necessary to exit the RCU read-side critical section
662 +-----------------------------------------------------------------------+
664 +-----------------------------------------------------------------------+
665 | But how does the upgrade-to-write operation exclude other readers?    |
666 +-----------------------------------------------------------------------+
668 +-----------------------------------------------------------------------+
671 +-----------------------------------------------------------------------+
673 This guarantee allows lookup code to be shared between read-side and
674 update-side code, and was premeditated, appearing in the earliest
677 Fundamental Non-Requirements
678 ----------------------------
680 RCU provides extremely lightweight readers, and its read-side
685 non-guarantees that have caused confusion. Except where otherwise noted,
686 these non-guarantees were premeditated.
691 #. `Grace Periods Don't Partition Read-Side Critical Sections`_
692 #. `Read-Side Critical Sections Don't Partition Grace Periods`_
697 Reader-side markers such as ``rcu_read_lock()`` and
699 through their interaction with the grace-period APIs such as
736 significant ordering constraints would slow down these fast-path APIs.
738 +-----------------------------------------------------------------------+
740 +-----------------------------------------------------------------------+
742 +-----------------------------------------------------------------------+
744 +-----------------------------------------------------------------------+
747 +-----------------------------------------------------------------------+
793 +-----------------------------------------------------------------------+
795 +-----------------------------------------------------------------------+
797 | completed instead of waiting only on pre-existing readers. For how    |
799 +-----------------------------------------------------------------------+
801 +-----------------------------------------------------------------------+
806 +-----------------------------------------------------------------------+
808 Grace Periods Don't Partition Read-Side Critical Sections
811 It is tempting to assume that if any part of one RCU read-side critical
813 read-side critical section follows that same grace period, then all of
814 the first RCU read-side critical section must precede all of the second.
816 partition the set of RCU read-side critical sections. An example of this
854 that the thread cannot be in the midst of an RCU read-side critical
857 .. kernel-figure:: GPpartitionReaders1.svg
859 If it is necessary to partition RCU read-side critical sections in this
901 the two RCU read-side critical sections cannot overlap, guaranteeing
910 This non-requirement was also non-premeditated, but became apparent when
913 Read-Side Critical Sections Don't Partition Grace Periods
916 It is also tempting to assume that if an RCU read-side critical section
969 .. kernel-figure:: ReadersPartitionGP1.svg
971 Again, an RCU read-side critical section can overlap almost all of a
973 period. As a result, an RCU read-side critical section cannot partition
976 +-----------------------------------------------------------------------+
978 +-----------------------------------------------------------------------+
980 | read-side critical section, would be required to partition the RCU    |
981 | read-side critical sections at the beginning and end of the chain?    |
982 +-----------------------------------------------------------------------+
984 +-----------------------------------------------------------------------+
989 +-----------------------------------------------------------------------+
992 -------------------------
999    completely futile. This is most obvious in preemptible user-level
1002    hypervisor), but can also happen in bare-metal environments due to
1006    but where “extremely long” is not long enough to allow wrap-around
1007    when incrementing a 64-bit counter.
1009    matters, RCU must use compiler directives and memory-barrier
1013    writes and more-frequent concurrent writes will result in more
1020 #. Counters are finite, especially on 32-bit systems. RCU's use of
1025    dyntick-idle nesting counter allows 54 bits for interrupt nesting
1026    level (this counter is 64 bits even on a 32-bit system). Overflowing
1027    this counter requires 2\ :sup:`54` half-interrupts on a given CPU
1028    without that CPU ever going idle. If a half-interrupt happened every
1032    kernel in a single shared-memory environment. RCU must therefore pay
1033    close attention to high-end scalability.
1041 Quality-of-Implementation Requirements
1042 --------------------------------------
1044 These sections list quality-of-implementation requirements. Although an
1047 inappropriate for industrial-strength production use. Classes of
1048 quality-of-implementation requirements are as follows:
1061 RCU is and always has been intended primarily for read-mostly
1062 situations, which means that RCU's read-side primitives are optimized,
1063 often at the expense of its update-side primitives. Experience thus far
1066 #. Read-mostly data, where stale and inconsistent data is not a problem:
1068 #. Read-mostly data, where data must be consistent: RCU works well.
1069 #. Read-write data, where data must be consistent: RCU *might* work OK.
1071 #. Write-mostly data, where data must be consistent: RCU is very
1075    a. Existence guarantees for update-friendly mechanisms.
1076    b. Wait-free read-side primitives for real-time use.
1078 This focus on read-mostly situations means that RCU must interoperate
1083 primitives be legal within RCU read-side critical sections, including
1087 +-----------------------------------------------------------------------+
1089 +-----------------------------------------------------------------------+
1091 +-----------------------------------------------------------------------+
1093 +-----------------------------------------------------------------------+
1094 | These are forbidden within Linux-kernel RCU read-side critical        |
1096 | case, voluntary context switch) within an RCU read-side critical      |
1098 | read-side critical sections, and also within Linux-kernel sleepable   |
1099 | RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In     |
1100 | addition, the -rt patchset turns spinlocks into a sleeping locks so   |
1103 | locks!) may be acquire within -rt-Linux-kernel RCU read-side critical |
1105 | Note that it *is* legal for a normal RCU read-side critical section   |
1113 +-----------------------------------------------------------------------+
1125 inconsistency must be tolerated due to speed-of-light delays if nothing
1151 For example, the translation between a user-level SystemV semaphore ID
1152 to the corresponding in-kernel data structure is protected by RCU, but
1155 by acquiring spinlocks located in the in-kernel data structure from
1156 within the RCU read-side critical section, and this is indicated by the
1170 Linux-kernel RCU implementations must therefore avoid unnecessarily
1174 energy efficiency in battery-powered systems and on specific
1175 energy-efficiency shortcomings of the Linux-kernel RCU implementation.
1176 In my experience, the battery-powered embedded community will consider
1178 mere Linux-kernel-mailing-list posts are insufficient to vent their ire.
1183 `bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory
1184 footprint is critically important on single-CPU systems with
1185 non-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny
1187 was born. Josh Triplett has since taken over the small-memory banner
1193 unsurprising. For example, in keeping with RCU's read-side
1196 Similarly, in non-preemptible environments, ``rcu_read_lock()`` and
1199 In preemptible environments, in the case where the RCU read-side
1201 highest-priority real-time process), ``rcu_read_lock()`` and
1203 should not contain atomic read-modify-write operations, memory-barrier
1205 branches. However, in the case where the RCU read-side critical section
1207 interrupts. This is why it is better to nest an RCU read-side critical
1208 section within a preempt-disable region than vice versa, at least in
1210 degrading real-time latencies.
1212 The ``synchronize_rcu()`` grace-period-wait primitive is optimized for
1214 addition to the duration of the longest RCU read-side critical section.
1217 they can be satisfied by a single underlying grace-period-wait
1219 single grace-period-wait operation to serve more than `1,000 separate
1220 … <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-s…
1221 of ``synchronize_rcu()``, thus amortizing the per-invocation overhead
1222 down to nearly zero. However, the grace-period optimization is also
1223 required to avoid measurable degradation of real-time scheduling and
1226 In some cases, the multi-millisecond ``synchronize_rcu()`` latencies are
1228 used instead, reducing the grace-period latency down to a few tens of
1229 microseconds on small systems, at least in cases where the RCU read-side
1237 to impose modest degradation of real-time latency on non-idle online
1239 scheduling-clock interrupt.
1242 ``synchronize_rcu_expedited()``'s reduced grace-period latency is
1272       25   call_rcu(&p->rh, remove_gp_cb);
1278 lines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()``
1284 be legal, including within preempt-disable code, ``local_bh_disable()``
1285 code, interrupt-disable code, and interrupt handlers. However, even
1292 takes too long. Long-running operations should be relegated to separate
1295 +-----------------------------------------------------------------------+
1297 +-----------------------------------------------------------------------+
1302 +-----------------------------------------------------------------------+
1304 +-----------------------------------------------------------------------+
1305 | Presumably the ``->gp_lock`` acquired on line 18 excludes any         |
1308 | after ``->gp_lock`` is released on line 25, which in turn means that  |
1310 +-----------------------------------------------------------------------+
1349 open-coded it.
1351 +-----------------------------------------------------------------------+
1353 +-----------------------------------------------------------------------+
1359 +-----------------------------------------------------------------------+
1361 +-----------------------------------------------------------------------+
1363 | definition would say that updates in garbage-collected languages      |
1370 +-----------------------------------------------------------------------+
1374 be carried out in the meantime? The polling-style
1413 In theory, delaying grace-period completion and callback invocation is
1420 example, an infinite loop in an RCU read-side critical section must by
1422 involved example, consider a 64-CPU system built with
1423 ``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where
1441    corresponding CPU's next scheduling-clock.
1443    indefinitely in the kernel without scheduling-clock interrupts, which
1448    has been preempted within an RCU read-side critical section is
1459 caution when changing them. Note that these forward-progress measures
1464 invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has
1475 #. Lifts callback-execution batch limits, which speeds up callback
1479 overridden. Again, these forward-progress measures are provided only for
1481 RCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward
1482 progress for ``rcu_nocbs`` CPUs is much less well-developed, in part
1485 both ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then
1486 additional forward-progress work will be required.
1492 part due to the collision of multicore hardware with object-oriented
1493 techniques designed in single-threaded environments for single-threaded
1494 use. And in theory, RCU read-side critical sections may be composed, and
1496 real-world implementations of composable constructs, there are
1500 ``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when
1515 the nesting-depth counter. For example, the Linux kernel's preemptible
1517 practical purposes. That said, a consecutive pair of RCU read-side
1519 grace period cannot be enclosed in another RCU read-side critical
1521 within an RCU read-side critical section: To do so would result either
1522 in deadlock or in RCU implicitly splitting the enclosing RCU read-side
1523 critical section, neither of which is conducive to a long-lived and
1527 example, many transactional-memory implementations prohibit composing a
1529 a network receive operation). For another example, lock-based critical
1533 In short, although RCU read-side critical sections are highly
1541 read-side critical sections, perhaps even so intense that there was
1543 read-side critical section in flight. RCU cannot allow this situation to
1544 block grace periods: As long as all the RCU read-side critical sections
1548 RCU read-side critical sections being preempted for long durations,
1549 which has the effect of creating a long-duration RCU read-side critical
1551 systems using real-time priorities are of course more vulnerable.
1556 Other workloads might have very high update rates. Although one can
1561 RCU callbacks in the ``call_rcu()`` code path. Finally, high update
1562 rates should not delay RCU read-side critical sections, although some
1563 small read-side delays can occur when using
1568 1990s, a simple user-level test consisting of ``close(open(path))`` in a
1570 appreciation of the high-update-rate corner case. This test also
1571 motivated addition of some RCU code to react to high update rates, for
1575 grace-period processing. This evasive action causes the grace period to
1580 Software-Engineering Requirements
1581 ---------------------------------
1588    splat if ``rcu_dereference()`` is used outside of an RCU read-side
1589    critical section. Update-side code can use
1601    that it has been invoked within an RCU read-side critical section. I
1604 #. A given function might wish to check for RCU-related preconditions
1610    assignment. To catch this sort of error, a given RCU-protected
1612    complain about simple-assignment accesses to that pointer. Arnd
1623    non-stack ``rcu_head`` structures must be initialized with
1628 #. An infinite loop in an RCU read-side critical section will eventually
1634    waiting on that particular RCU read-side critical section.
1640    stall warnings are counter-productive during sysrq dumps and during
1653    read-side critical sections, there is currently no good way of doing
1657 #. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information
1659 #. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()``
1661    error-prone. Therefore, RCU-protected `linked
1663    more recently, RCU-protected `hash
1665    other special-purpose RCU-protected data structures are available in
1675 This not a hard-and-fast list: RCU's diagnostic capabilities will
1677 real-world RCU usage.
1680 --------------------------
1696 #. `Scheduling-Clock Interrupts and RCU`_
1701 notable Linux-kernel complications. Each of the following sections
1719 `remind <https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.c…
1731 used to do, it would create too many per-CPU kthreads. Although the
1748 ``task_struct`` is available and the boot CPU's per-CPU variables are
1749 set up. The read-side primitives (``rcu_read_lock()``,
1769 itself is a quiescent state and thus a grace period, so the early-boot
1770 implementation can be a no-op.
1775 reason is that an RCU read-side critical section might be preempted,
1785 +-----------------------------------------------------------------------+
1787 +-----------------------------------------------------------------------+
1790 +-----------------------------------------------------------------------+
1792 +-----------------------------------------------------------------------+
1797 | grace-period mechanism. At runtime, this expedited mechanism relies   |
1799 | drives the desired expedited grace period. Because dead-zone          |
1813 +-----------------------------------------------------------------------+
1815 I learned of these boot-time requirements as a result of a series of
1821 The Linux kernel has interrupts, and RCU read-side critical sections are
1822 legal within interrupt handlers and within interrupt-disabled regions of
1825 Some Linux-kernel architectures can enter an interrupt handler from
1826 non-idle process context, and then just never leave it, instead
1829 “half-interrupts” mean that RCU has to be very careful about how it
1831 way during a rewrite of RCU's dyntick-idle code.
1833 The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side
1835 update-side primitives, including ``call_rcu()``, are prohibited within
1838 The name notwithstanding, some Linux-kernel architectures can have
1840 me <https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__
1858 one of its functions results in a segmentation fault. The module-unload
1859 functions must therefore cancel any delayed calls to loadable-module
1867 module unload request, we need some other way to deal with in-flight RCU
1871 in-flight RCU callbacks have been invoked. If a module uses
1874 the underlying module-unload code could invoke ``rcu_barrier()``
1879 filesystem-unmount situation, and Dipankar Sarma incorporated
1893 +-----------------------------------------------------------------------+
1895 +-----------------------------------------------------------------------+
1897 | complete, and ``rcu_barrier()`` must wait for each pre-existing       |
1901 +-----------------------------------------------------------------------+
1903 +-----------------------------------------------------------------------+
1913 | pre-existing callbacks, you will need to invoke both                  |
1916 +-----------------------------------------------------------------------+
1923 offline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side
1925 DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug
1928 The Linux-kernel CPU-hotplug implementation has notifiers that are used
1930 appropriately to a given CPU-hotplug operation. Most RCU operations may
1931 be invoked from CPU-hotplug notifiers, including even synchronous
1932 grace-period operations such as ``synchronize_rcu()`` and
1935 However, all-callback-wait operations such as ``rcu_barrier()`` are also
1936 not supported, due to the fact that there are phases of CPU-hotplug
1938 after the CPU-hotplug operation ends, which could also result in
1939 deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations
1941 invoked from a CPU-hotplug notifier.
1946 RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
1948 RCU's violation of it when running context-switch-heavy workloads when
1952 context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
1956 scheduler's runqueue or priority-inheritance spinlocks across an
1958 somewhere within the corresponding RCU read-side critical section.
1964 nesting.  The fact that interrupt-disabled regions of code act as RCU
1965 read-side critical sections implicitly avoids earlier issues that used
1982 The kernel needs to access user-space memory, for example, to access data
1983 referenced by system-call parameters.  The ``get_user()`` macro does this job.
1985 However, user-space memory might well be paged out, which means that
1986 ``get_user()`` might well page-fault and thus block while waiting for the
1988 reorder a ``get_user()`` invocation into an RCU read-side critical section.
1996        3 v = p->value;
2009        4 v = p->value;
2015 state in the middle of an RCU read-side critical section.  This misplaced
2016 quiescent state could result in line 4 being a use-after-free access,
2024 ``p->value`` is not volatile, so the compiler would not have any reason to keep
2027 Therefore, the Linux-kernel definitions of ``rcu_read_lock()`` and
2030 of RCU read-side critical sections.
2036 by people with battery-powered embedded systems. RCU therefore conserves
2039 energy-efficiency requirement, so I learned of this via an irate phone
2043 RCU read-side critical section on an idle CPU. (Kernels built with
2047 whether or not it is currently legal to run RCU read-side critical
2049 hand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code.
2059    order of a million deep, even on 32-bit systems, this should not be a
2086 These energy-efficiency requirements have proven quite difficult to
2088 clean-sheet rewrites of RCU's energy-efficiency code, the last of which
2093 phone calls: Flaming me on the Linux-kernel mailing list was apparently
2094 not sufficient to fully vent their ire at RCU's energy-efficiency bugs!
2096 Scheduling-Clock Interrupts and RCU
2099 The kernel transitions between in-kernel non-idle execution, userspace
2103 +-----------------+------------------+------------------+-----------------+
2104 | ``HZ`` Kconfig  | In-Kernel        | Usermode         | Idle            |
2107 |                 | scheduling-clock | scheduling-clock | RCU's           |
2108 |                 | interrupt.       | interrupt and    | dyntick-idle    |
2112 +-----------------+------------------+------------------+-----------------+
2114 |                 | scheduling-clock | scheduling-clock | RCU's           |
2115 |                 | interrupt.       | interrupt and    | dyntick-idle    |
2119 +-----------------+------------------+------------------+-----------------+
2122 |                 | on               | dyntick-idle     | dyntick-idle    |
2123 |                 | scheduling-clock | detection.       | detection.      |
2131 +-----------------+------------------+------------------+-----------------+
2133 +-----------------------------------------------------------------------+
2135 +-----------------------------------------------------------------------+
2136 | Why can't ``NO_HZ_FULL`` in-kernel execution rely on the              |
2137 | scheduling-clock interrupt, just like ``HZ_PERIODIC`` and             |
2139 +-----------------------------------------------------------------------+
2141 +-----------------------------------------------------------------------+
2143 | necessarily re-enable the scheduling-clock interrupt on entry to each |
2145 +-----------------------------------------------------------------------+
2151 scheduling-clock interrupt be enabled when RCU needs it to be:
2154    is non-idle, the scheduling-clock tick had better be running.
2156    (11-second) grace periods, with a pointless IPI waking the CPU from
2158 #. If a CPU is in a portion of the kernel that executes RCU read-side
2164    no-joking guaranteed to never execute any RCU read-side critical
2166    sort of thing is used by some architectures for light-weight
2173    fact joking about not doing RCU read-side critical sections.
2174 #. If a CPU is executing in the kernel with the scheduling-clock
2175    interrupt disabled and RCU believes this CPU to be non-idle, and if
2185    is usually OK) and the scheduling-clock interrupt is enabled, of
2190 +-----------------------------------------------------------------------+
2192 +-----------------------------------------------------------------------+
2196 +-----------------------------------------------------------------------+
2198 +-----------------------------------------------------------------------+
2200 | often. But given that long-running interrupt handlers can cause other |
2203 +-----------------------------------------------------------------------+
2206 between in-kernel execution, usermode execution, and idle, and as long
2207 as the scheduling-clock interrupt is enabled when RCU needs it to be,
2214 Although small-memory non-realtime systems can simply use Tiny RCU, code
2218 pair of pointers, it does appear in many RCU-protected data structures,
2223 This need for memory efficiency is one reason that RCU uses hand-crafted
2228 posted them. Although this information might appear in debug-only kernel
2229 builds at some point, in the meantime, the ``->func`` field will often
2237 conditions <https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.inte…
2238 the Linux kernel's memory-management subsystem needs a particular bit to
2239 remain zero during all phases of grace-period processing, and that bit
2241 ``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is
2244 energy-efficiency purposes.
2247 structure be aligned to a two-byte boundary, and passing a misaligned
2251 a four-byte or even eight-byte alignment requirement? Because the m68k
2252 architecture provides only two-byte alignment, and thus acts as
2258 potentially have energy-efficiency benefits, but only if the rate of
2259 non-lazy callbacks decreases significantly for some important workload.
2268 hot code paths in performance-critical portions of the Linux kernel's
2271 read-side primitives. To that end, it would be good if preemptible RCU's
2283 minimal per-operation overhead. In fact, in many cases, increasing load
2284 must *decrease* the per-operation overhead, witness the batching
2290 The Linux kernel is used for real-time workloads, especially in
2291 conjunction with the `-rt
2293 real-time-latency response requirements are such that the traditional
2294 approach of disabling preemption across RCU read-side critical sections
2296 an RCU implementation that allows RCU read-side critical sections to be
2298 clear that an earlier `real-time
2302 encountered by a very early version of the -rt patchset.
2304 In addition, RCU must make do with a sub-100-microsecond real-time
2305 latency budget. In fact, on smaller systems with the -rt patchset, the
2306 Linux kernel provides sub-20-microsecond real-time latencies for the
2309 surprise, the sub-100-microsecond real-time latency budget `applies to
2312 up to and including systems with 4096 CPUs. This real-time requirement
2313 motivated the grace-period kthread, which also simplified handling of a
2316 RCU must avoid degrading real-time response for CPU-bound threads,
2318 ``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in
2326 stress-test suite. This stress-test suite is called ``rcutorture``.
2332 today, given Android smartphones, Linux-powered televisions, and
2342 jurisdictions, a successful multi-year test of a given mechanism, which
2344 safety-critical certifications. In fact, rumor has it that the Linux
2345 kernel is already being used in production for safety-critical
2351 -----------------
2356 implementations, non-preemptible and preemptible. The other four flavors
2360 #. `Bottom-Half Flavor (Historical)`_
2365 Bottom-Half Flavor (Historical)
2368 The RCU-bh flavor of RCU has since been expressed in terms of the other
2370 single flavor. The read-side API remains, and continues to disable
2374 The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations)
2375 flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a
2376 flavor of RCU that could withstand the network-based denial-of-service
2381 grace periods from ever ending. The result was an out-of-memory
2384 The solution was the creation of RCU-bh, which does
2385 ``local_bh_disable()`` across its read-side critical sections, and which
2388 offline. This means that RCU-bh grace periods can complete even when
2390 algorithms based on RCU-bh to withstand network-based denial-of-service
2394 re-enable softirq handlers, any attempt to start a softirq handlers
2395 during the RCU-bh read-side critical section will be deferred. In this
2398 overhead should be associated with the code following the RCU-bh
2399 read-side critical section rather than ``rcu_read_unlock_bh()``, but the
2401 of fine distinction. For example, suppose that a three-millisecond-long
2402 RCU-bh read-side critical section executes during a time of heavy
2409 The `RCU-bh
2410 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2415 ``rcu_read_lock_bh_held()``. However, the update-side APIs are now
2416 simple wrappers for other RCU flavors, namely RCU-sched in
2417 CONFIG_PREEMPT=n kernels and RCU-preempt otherwise.
2422 The RCU-sched flavor of RCU has since been expressed in terms of the
2424 single flavor. The read-side API remains, and continues to disable
2428 Before preemptible RCU, waiting for an RCU grace period had the side
2429 effect of also waiting for all pre-existing interrupt and NMI handlers.
2430 However, there are legitimate preemptible-RCU implementations that do
2432 RCU read-side critical section can be a quiescent state. Therefore,
2433 *RCU-sched* was created, which follows “classic” RCU in that an
2434 RCU-sched grace period waits for pre-existing interrupt and NMI
2436 RCU-sched APIs have identical implementations, while kernels built with
2441 re-enable preemption, respectively. This means that if there was a
2442 preemption attempt during the RCU-sched read-side critical section,
2446 very slowly. However, the highest-priority task won't be preempted, so
2447 that task will enjoy low-overhead ``rcu_read_unlock_sched()``
2450 The `RCU-sched
2451 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2458 preemption also marks an RCU-sched read-side critical section, including
2466 read-side critical section” was a reliable indication that this someone
2468 read-side critical section, you can probably afford to use a
2469 higher-overhead synchronization mechanism. However, that changed with
2470 the advent of the Linux kernel's notifiers, whose RCU read-side critical
2481 That said, one consequence of these domains is that read-side code must
2493 As noted above, it is legal to block within SRCU read-side critical
2495 block forever in one of a given domain's SRCU read-side critical
2498 happen if any operation in a given domain's SRCU read-side critical
2500 period to elapse. For example, this results in a self-deadlock:
2515 ``ss1``-domain SRCU read-side critical section acquired another mutex
2516 that was held across as ``ss``-domain ``synchronize_srcu()``, deadlock
2521 Unlike the other RCU flavors, SRCU read-side critical sections can run
2530 invoked from CPU-hotplug notifiers, due to the fact that SRCU grace
2534 the CPU-hotplug process. The problem is that if a notifier is waiting on
2539 CPU-hotplug notifiers.
2542 non-expedited grace periods are implemented by the same mechanism. This
2544 period has the side effect of expediting all prior grace periods that
2552 As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a
2564 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2580 anywhere in the code, it is not possible to use read-side markers such
2590 trampoline would be pre-ordained a surprisingly long time before execution
2594 RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
2598 userspace execution also delimit tasks-RCU read-side critical sections.
2600 The tasks-RCU API is quite compact, consisting only of
2610 -----------------------
2612 One of the tricks that RCU uses to attain update-side scalability is to
2613 increase grace-period latency with increasing numbers of CPUs. If this
2615 grace-period state machine so as to avoid the need for the additional
2620 ``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to
2624 The tradeoff between grace-period latency on the one hand and
2626 re-examined. The desire is of course for zero grace-period latency as
2636 be unnecessary because the hotpath read-side primitives do not access
2647 carefully run and realistic system-level workload.
2661 Additional work may be required to provide reasonable forward-progress
2666 -------
2674 ---------------