1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun================================ 4*4882a593SmuzhiyunReview Checklist for RCU Patches 5*4882a593Smuzhiyun================================ 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunThis document contains a checklist for producing and reviewing patches 9*4882a593Smuzhiyunthat make use of RCU. Violating any of the rules listed below will 10*4882a593Smuzhiyunresult in the same sorts of problems that leaving out a locking primitive 11*4882a593Smuzhiyunwould cause. This list is based on experiences reviewing such patches 12*4882a593Smuzhiyunover a rather long period of time, but improvements are always welcome! 13*4882a593Smuzhiyun 14*4882a593Smuzhiyun0. Is RCU being applied to a read-mostly situation? If the data 15*4882a593Smuzhiyun structure is updated more than about 10% of the time, then you 16*4882a593Smuzhiyun should strongly consider some other approach, unless detailed 17*4882a593Smuzhiyun performance measurements show that RCU is nonetheless the right 18*4882a593Smuzhiyun tool for the job. Yes, RCU does reduce read-side overhead by 19*4882a593Smuzhiyun increasing write-side overhead, which is exactly why normal uses 20*4882a593Smuzhiyun of RCU will do much more reading than updating. 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun Another exception is where performance is not an issue, and RCU 23*4882a593Smuzhiyun provides a simpler implementation. An example of this situation 24*4882a593Smuzhiyun is the dynamic NMI code in the Linux 2.6 kernel, at least on 25*4882a593Smuzhiyun architectures where NMIs are rare. 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun Yet another exception is where the low real-time latency of RCU's 28*4882a593Smuzhiyun read-side primitives is critically important. 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun One final exception is where RCU readers are used to prevent 31*4882a593Smuzhiyun the ABA problem (https://en.wikipedia.org/wiki/ABA_problem) 32*4882a593Smuzhiyun for lockless updates. This does result in the mildly 33*4882a593Smuzhiyun counter-intuitive situation where rcu_read_lock() and 34*4882a593Smuzhiyun rcu_read_unlock() are used to protect updates, however, this 35*4882a593Smuzhiyun approach provides the same potential simplifications that garbage 36*4882a593Smuzhiyun collectors do. 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun1. Does the update code have proper mutual exclusion? 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun RCU does allow -readers- to run (almost) naked, but -writers- must 41*4882a593Smuzhiyun still use some sort of mutual exclusion, such as: 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun a. locking, 44*4882a593Smuzhiyun b. atomic operations, or 45*4882a593Smuzhiyun c. restricting updates to a single task. 46*4882a593Smuzhiyun 47*4882a593Smuzhiyun If you choose #b, be prepared to describe how you have handled 48*4882a593Smuzhiyun memory barriers on weakly ordered machines (pretty much all of 49*4882a593Smuzhiyun them -- even x86 allows later loads to be reordered to precede 50*4882a593Smuzhiyun earlier stores), and be prepared to explain why this added 51*4882a593Smuzhiyun complexity is worthwhile. If you choose #c, be prepared to 52*4882a593Smuzhiyun explain how this single task does not become a major bottleneck on 53*4882a593Smuzhiyun big multiprocessor machines (for example, if the task is updating 54*4882a593Smuzhiyun information relating to itself that other tasks can read, there 55*4882a593Smuzhiyun by definition can be no bottleneck). Note that the definition 56*4882a593Smuzhiyun of "large" has changed significantly: Eight CPUs was "large" 57*4882a593Smuzhiyun in the year 2000, but a hundred CPUs was unremarkable in 2017. 58*4882a593Smuzhiyun 59*4882a593Smuzhiyun2. Do the RCU read-side critical sections make proper use of 60*4882a593Smuzhiyun rcu_read_lock() and friends? These primitives are needed 61*4882a593Smuzhiyun to prevent grace periods from ending prematurely, which 62*4882a593Smuzhiyun could result in data being unceremoniously freed out from 63*4882a593Smuzhiyun under your read-side code, which can greatly increase the 64*4882a593Smuzhiyun actuarial risk of your kernel. 65*4882a593Smuzhiyun 66*4882a593Smuzhiyun As a rough rule of thumb, any dereference of an RCU-protected 67*4882a593Smuzhiyun pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(), 68*4882a593Smuzhiyun rcu_read_lock_sched(), or by the appropriate update-side lock. 69*4882a593Smuzhiyun Disabling of preemption can serve as rcu_read_lock_sched(), but 70*4882a593Smuzhiyun is less readable and prevents lockdep from detecting locking issues. 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun Letting RCU-protected pointers "leak" out of an RCU read-side 73*4882a593Smuzhiyun critical section is every bid as bad as letting them leak out 74*4882a593Smuzhiyun from under a lock. Unless, of course, you have arranged some 75*4882a593Smuzhiyun other means of protection, such as a lock or a reference count 76*4882a593Smuzhiyun -before- letting them out of the RCU read-side critical section. 77*4882a593Smuzhiyun 78*4882a593Smuzhiyun3. Does the update code tolerate concurrent accesses? 79*4882a593Smuzhiyun 80*4882a593Smuzhiyun The whole point of RCU is to permit readers to run without 81*4882a593Smuzhiyun any locks or atomic operations. This means that readers will 82*4882a593Smuzhiyun be running while updates are in progress. There are a number 83*4882a593Smuzhiyun of ways to handle this concurrency, depending on the situation: 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun a. Use the RCU variants of the list and hlist update 86*4882a593Smuzhiyun primitives to add, remove, and replace elements on 87*4882a593Smuzhiyun an RCU-protected list. Alternatively, use the other 88*4882a593Smuzhiyun RCU-protected data structures that have been added to 89*4882a593Smuzhiyun the Linux kernel. 90*4882a593Smuzhiyun 91*4882a593Smuzhiyun This is almost always the best approach. 92*4882a593Smuzhiyun 93*4882a593Smuzhiyun b. Proceed as in (a) above, but also maintain per-element 94*4882a593Smuzhiyun locks (that are acquired by both readers and writers) 95*4882a593Smuzhiyun that guard per-element state. Of course, fields that 96*4882a593Smuzhiyun the readers refrain from accessing can be guarded by 97*4882a593Smuzhiyun some other lock acquired only by updaters, if desired. 98*4882a593Smuzhiyun 99*4882a593Smuzhiyun This works quite well, also. 100*4882a593Smuzhiyun 101*4882a593Smuzhiyun c. Make updates appear atomic to readers. For example, 102*4882a593Smuzhiyun pointer updates to properly aligned fields will 103*4882a593Smuzhiyun appear atomic, as will individual atomic primitives. 104*4882a593Smuzhiyun Sequences of operations performed under a lock will -not- 105*4882a593Smuzhiyun appear to be atomic to RCU readers, nor will sequences 106*4882a593Smuzhiyun of multiple atomic primitives. 107*4882a593Smuzhiyun 108*4882a593Smuzhiyun This can work, but is starting to get a bit tricky. 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun d. Carefully order the updates and the reads so that 111*4882a593Smuzhiyun readers see valid data at all phases of the update. 112*4882a593Smuzhiyun This is often more difficult than it sounds, especially 113*4882a593Smuzhiyun given modern CPUs' tendency to reorder memory references. 114*4882a593Smuzhiyun One must usually liberally sprinkle memory barriers 115*4882a593Smuzhiyun (smp_wmb(), smp_rmb(), smp_mb()) through the code, 116*4882a593Smuzhiyun making it difficult to understand and to test. 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun It is usually better to group the changing data into 119*4882a593Smuzhiyun a separate structure, so that the change may be made 120*4882a593Smuzhiyun to appear atomic by updating a pointer to reference 121*4882a593Smuzhiyun a new structure containing updated values. 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun4. Weakly ordered CPUs pose special challenges. Almost all CPUs 124*4882a593Smuzhiyun are weakly ordered -- even x86 CPUs allow later loads to be 125*4882a593Smuzhiyun reordered to precede earlier stores. RCU code must take all of 126*4882a593Smuzhiyun the following measures to prevent memory-corruption problems: 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun a. Readers must maintain proper ordering of their memory 129*4882a593Smuzhiyun accesses. The rcu_dereference() primitive ensures that 130*4882a593Smuzhiyun the CPU picks up the pointer before it picks up the data 131*4882a593Smuzhiyun that the pointer points to. This really is necessary 132*4882a593Smuzhiyun on Alpha CPUs. If you don't believe me, see: 133*4882a593Smuzhiyun 134*4882a593Smuzhiyun http://www.openvms.compaq.com/wizard/wiz_2637.html 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun The rcu_dereference() primitive is also an excellent 137*4882a593Smuzhiyun documentation aid, letting the person reading the 138*4882a593Smuzhiyun code know exactly which pointers are protected by RCU. 139*4882a593Smuzhiyun Please note that compilers can also reorder code, and 140*4882a593Smuzhiyun they are becoming increasingly aggressive about doing 141*4882a593Smuzhiyun just that. The rcu_dereference() primitive therefore also 142*4882a593Smuzhiyun prevents destructive compiler optimizations. However, 143*4882a593Smuzhiyun with a bit of devious creativity, it is possible to 144*4882a593Smuzhiyun mishandle the return value from rcu_dereference(). 145*4882a593Smuzhiyun Please see rcu_dereference.txt in this directory for 146*4882a593Smuzhiyun more information. 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun The rcu_dereference() primitive is used by the 149*4882a593Smuzhiyun various "_rcu()" list-traversal primitives, such 150*4882a593Smuzhiyun as the list_for_each_entry_rcu(). Note that it is 151*4882a593Smuzhiyun perfectly legal (if redundant) for update-side code to 152*4882a593Smuzhiyun use rcu_dereference() and the "_rcu()" list-traversal 153*4882a593Smuzhiyun primitives. This is particularly useful in code that 154*4882a593Smuzhiyun is common to readers and updaters. However, lockdep 155*4882a593Smuzhiyun will complain if you access rcu_dereference() outside 156*4882a593Smuzhiyun of an RCU read-side critical section. See lockdep.txt 157*4882a593Smuzhiyun to learn what to do about this. 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun Of course, neither rcu_dereference() nor the "_rcu()" 160*4882a593Smuzhiyun list-traversal primitives can substitute for a good 161*4882a593Smuzhiyun concurrency design coordinating among multiple updaters. 162*4882a593Smuzhiyun 163*4882a593Smuzhiyun b. If the list macros are being used, the list_add_tail_rcu() 164*4882a593Smuzhiyun and list_add_rcu() primitives must be used in order 165*4882a593Smuzhiyun to prevent weakly ordered machines from misordering 166*4882a593Smuzhiyun structure initialization and pointer planting. 167*4882a593Smuzhiyun Similarly, if the hlist macros are being used, the 168*4882a593Smuzhiyun hlist_add_head_rcu() primitive is required. 169*4882a593Smuzhiyun 170*4882a593Smuzhiyun c. If the list macros are being used, the list_del_rcu() 171*4882a593Smuzhiyun primitive must be used to keep list_del()'s pointer 172*4882a593Smuzhiyun poisoning from inflicting toxic effects on concurrent 173*4882a593Smuzhiyun readers. Similarly, if the hlist macros are being used, 174*4882a593Smuzhiyun the hlist_del_rcu() primitive is required. 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun The list_replace_rcu() and hlist_replace_rcu() primitives 177*4882a593Smuzhiyun may be used to replace an old structure with a new one 178*4882a593Smuzhiyun in their respective types of RCU-protected lists. 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun d. Rules similar to (4b) and (4c) apply to the "hlist_nulls" 181*4882a593Smuzhiyun type of RCU-protected linked lists. 182*4882a593Smuzhiyun 183*4882a593Smuzhiyun e. Updates must ensure that initialization of a given 184*4882a593Smuzhiyun structure happens before pointers to that structure are 185*4882a593Smuzhiyun publicized. Use the rcu_assign_pointer() primitive 186*4882a593Smuzhiyun when publicizing a pointer to a structure that can 187*4882a593Smuzhiyun be traversed by an RCU read-side critical section. 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun5. If call_rcu() or call_srcu() is used, the callback function will 190*4882a593Smuzhiyun be called from softirq context. In particular, it cannot block. 191*4882a593Smuzhiyun 192*4882a593Smuzhiyun6. Since synchronize_rcu() can block, it cannot be called 193*4882a593Smuzhiyun from any sort of irq context. The same rule applies 194*4882a593Smuzhiyun for synchronize_srcu(), synchronize_rcu_expedited(), and 195*4882a593Smuzhiyun synchronize_srcu_expedited(). 196*4882a593Smuzhiyun 197*4882a593Smuzhiyun The expedited forms of these primitives have the same semantics 198*4882a593Smuzhiyun as the non-expedited forms, but expediting is both expensive and 199*4882a593Smuzhiyun (with the exception of synchronize_srcu_expedited()) unfriendly 200*4882a593Smuzhiyun to real-time workloads. Use of the expedited primitives should 201*4882a593Smuzhiyun be restricted to rare configuration-change operations that would 202*4882a593Smuzhiyun not normally be undertaken while a real-time workload is running. 203*4882a593Smuzhiyun However, real-time workloads can use rcupdate.rcu_normal kernel 204*4882a593Smuzhiyun boot parameter to completely disable expedited grace periods, 205*4882a593Smuzhiyun though this might have performance implications. 206*4882a593Smuzhiyun 207*4882a593Smuzhiyun In particular, if you find yourself invoking one of the expedited 208*4882a593Smuzhiyun primitives repeatedly in a loop, please do everyone a favor: 209*4882a593Smuzhiyun Restructure your code so that it batches the updates, allowing 210*4882a593Smuzhiyun a single non-expedited primitive to cover the entire batch. 211*4882a593Smuzhiyun This will very likely be faster than the loop containing the 212*4882a593Smuzhiyun expedited primitive, and will be much much easier on the rest 213*4882a593Smuzhiyun of the system, especially to real-time workloads running on 214*4882a593Smuzhiyun the rest of the system. 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun7. As of v4.20, a given kernel implements only one RCU flavor, 217*4882a593Smuzhiyun which is RCU-sched for PREEMPT=n and RCU-preempt for PREEMPT=y. 218*4882a593Smuzhiyun If the updater uses call_rcu() or synchronize_rcu(), 219*4882a593Smuzhiyun then the corresponding readers my use rcu_read_lock() and 220*4882a593Smuzhiyun rcu_read_unlock(), rcu_read_lock_bh() and rcu_read_unlock_bh(), 221*4882a593Smuzhiyun or any pair of primitives that disables and re-enables preemption, 222*4882a593Smuzhiyun for example, rcu_read_lock_sched() and rcu_read_unlock_sched(). 223*4882a593Smuzhiyun If the updater uses synchronize_srcu() or call_srcu(), 224*4882a593Smuzhiyun then the corresponding readers must use srcu_read_lock() and 225*4882a593Smuzhiyun srcu_read_unlock(), and with the same srcu_struct. The rules for 226*4882a593Smuzhiyun the expedited primitives are the same as for their non-expedited 227*4882a593Smuzhiyun counterparts. Mixing things up will result in confusion and 228*4882a593Smuzhiyun broken kernels, and has even resulted in an exploitable security 229*4882a593Smuzhiyun issue. 230*4882a593Smuzhiyun 231*4882a593Smuzhiyun One exception to this rule: rcu_read_lock() and rcu_read_unlock() 232*4882a593Smuzhiyun may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() 233*4882a593Smuzhiyun in cases where local bottom halves are already known to be 234*4882a593Smuzhiyun disabled, for example, in irq or softirq context. Commenting 235*4882a593Smuzhiyun such cases is a must, of course! And the jury is still out on 236*4882a593Smuzhiyun whether the increased speed is worth it. 237*4882a593Smuzhiyun 238*4882a593Smuzhiyun8. Although synchronize_rcu() is slower than is call_rcu(), it 239*4882a593Smuzhiyun usually results in simpler code. So, unless update performance is 240*4882a593Smuzhiyun critically important, the updaters cannot block, or the latency of 241*4882a593Smuzhiyun synchronize_rcu() is visible from userspace, synchronize_rcu() 242*4882a593Smuzhiyun should be used in preference to call_rcu(). Furthermore, 243*4882a593Smuzhiyun kfree_rcu() usually results in even simpler code than does 244*4882a593Smuzhiyun synchronize_rcu() without synchronize_rcu()'s multi-millisecond 245*4882a593Smuzhiyun latency. So please take advantage of kfree_rcu()'s "fire and 246*4882a593Smuzhiyun forget" memory-freeing capabilities where it applies. 247*4882a593Smuzhiyun 248*4882a593Smuzhiyun An especially important property of the synchronize_rcu() 249*4882a593Smuzhiyun primitive is that it automatically self-limits: if grace periods 250*4882a593Smuzhiyun are delayed for whatever reason, then the synchronize_rcu() 251*4882a593Smuzhiyun primitive will correspondingly delay updates. In contrast, 252*4882a593Smuzhiyun code using call_rcu() should explicitly limit update rate in 253*4882a593Smuzhiyun cases where grace periods are delayed, as failing to do so can 254*4882a593Smuzhiyun result in excessive realtime latencies or even OOM conditions. 255*4882a593Smuzhiyun 256*4882a593Smuzhiyun Ways of gaining this self-limiting property when using call_rcu() 257*4882a593Smuzhiyun include: 258*4882a593Smuzhiyun 259*4882a593Smuzhiyun a. Keeping a count of the number of data-structure elements 260*4882a593Smuzhiyun used by the RCU-protected data structure, including 261*4882a593Smuzhiyun those waiting for a grace period to elapse. Enforce a 262*4882a593Smuzhiyun limit on this number, stalling updates as needed to allow 263*4882a593Smuzhiyun previously deferred frees to complete. Alternatively, 264*4882a593Smuzhiyun limit only the number awaiting deferred free rather than 265*4882a593Smuzhiyun the total number of elements. 266*4882a593Smuzhiyun 267*4882a593Smuzhiyun One way to stall the updates is to acquire the update-side 268*4882a593Smuzhiyun mutex. (Don't try this with a spinlock -- other CPUs 269*4882a593Smuzhiyun spinning on the lock could prevent the grace period 270*4882a593Smuzhiyun from ever ending.) Another way to stall the updates 271*4882a593Smuzhiyun is for the updates to use a wrapper function around 272*4882a593Smuzhiyun the memory allocator, so that this wrapper function 273*4882a593Smuzhiyun simulates OOM when there is too much memory awaiting an 274*4882a593Smuzhiyun RCU grace period. There are of course many other 275*4882a593Smuzhiyun variations on this theme. 276*4882a593Smuzhiyun 277*4882a593Smuzhiyun b. Limiting update rate. For example, if updates occur only 278*4882a593Smuzhiyun once per hour, then no explicit rate limiting is 279*4882a593Smuzhiyun required, unless your system is already badly broken. 280*4882a593Smuzhiyun Older versions of the dcache subsystem take this approach, 281*4882a593Smuzhiyun guarding updates with a global lock, limiting their rate. 282*4882a593Smuzhiyun 283*4882a593Smuzhiyun c. Trusted update -- if updates can only be done manually by 284*4882a593Smuzhiyun superuser or some other trusted user, then it might not 285*4882a593Smuzhiyun be necessary to automatically limit them. The theory 286*4882a593Smuzhiyun here is that superuser already has lots of ways to crash 287*4882a593Smuzhiyun the machine. 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun d. Periodically invoke synchronize_rcu(), permitting a limited 290*4882a593Smuzhiyun number of updates per grace period. 291*4882a593Smuzhiyun 292*4882a593Smuzhiyun The same cautions apply to call_srcu() and kfree_rcu(). 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun Note that although these primitives do take action to avoid memory 295*4882a593Smuzhiyun exhaustion when any given CPU has too many callbacks, a determined 296*4882a593Smuzhiyun user could still exhaust memory. This is especially the case 297*4882a593Smuzhiyun if a system with a large number of CPUs has been configured to 298*4882a593Smuzhiyun offload all of its RCU callbacks onto a single CPU, or if the 299*4882a593Smuzhiyun system has relatively little free memory. 300*4882a593Smuzhiyun 301*4882a593Smuzhiyun9. All RCU list-traversal primitives, which include 302*4882a593Smuzhiyun rcu_dereference(), list_for_each_entry_rcu(), and 303*4882a593Smuzhiyun list_for_each_safe_rcu(), must be either within an RCU read-side 304*4882a593Smuzhiyun critical section or must be protected by appropriate update-side 305*4882a593Smuzhiyun locks. RCU read-side critical sections are delimited by 306*4882a593Smuzhiyun rcu_read_lock() and rcu_read_unlock(), or by similar primitives 307*4882a593Smuzhiyun such as rcu_read_lock_bh() and rcu_read_unlock_bh(), in which 308*4882a593Smuzhiyun case the matching rcu_dereference() primitive must be used in 309*4882a593Smuzhiyun order to keep lockdep happy, in this case, rcu_dereference_bh(). 310*4882a593Smuzhiyun 311*4882a593Smuzhiyun The reason that it is permissible to use RCU list-traversal 312*4882a593Smuzhiyun primitives when the update-side lock is held is that doing so 313*4882a593Smuzhiyun can be quite helpful in reducing code bloat when common code is 314*4882a593Smuzhiyun shared between readers and updaters. Additional primitives 315*4882a593Smuzhiyun are provided for this case, as discussed in lockdep.txt. 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun10. Conversely, if you are in an RCU read-side critical section, 318*4882a593Smuzhiyun and you don't hold the appropriate update-side lock, you -must- 319*4882a593Smuzhiyun use the "_rcu()" variants of the list macros. Failing to do so 320*4882a593Smuzhiyun will break Alpha, cause aggressive compilers to generate bad code, 321*4882a593Smuzhiyun and confuse people trying to read your code. 322*4882a593Smuzhiyun 323*4882a593Smuzhiyun11. Any lock acquired by an RCU callback must be acquired elsewhere 324*4882a593Smuzhiyun with softirq disabled, e.g., via spin_lock_irqsave(), 325*4882a593Smuzhiyun spin_lock_bh(), etc. Failing to disable softirq on a given 326*4882a593Smuzhiyun acquisition of that lock will result in deadlock as soon as 327*4882a593Smuzhiyun the RCU softirq handler happens to run your RCU callback while 328*4882a593Smuzhiyun interrupting that acquisition's critical section. 329*4882a593Smuzhiyun 330*4882a593Smuzhiyun12. RCU callbacks can be and are executed in parallel. In many cases, 331*4882a593Smuzhiyun the callback code simply wrappers around kfree(), so that this 332*4882a593Smuzhiyun is not an issue (or, more accurately, to the extent that it is 333*4882a593Smuzhiyun an issue, the memory-allocator locking handles it). However, 334*4882a593Smuzhiyun if the callbacks do manipulate a shared data structure, they 335*4882a593Smuzhiyun must use whatever locking or other synchronization is required 336*4882a593Smuzhiyun to safely access and/or modify that data structure. 337*4882a593Smuzhiyun 338*4882a593Smuzhiyun Do not assume that RCU callbacks will be executed on the same 339*4882a593Smuzhiyun CPU that executed the corresponding call_rcu() or call_srcu(). 340*4882a593Smuzhiyun For example, if a given CPU goes offline while having an RCU 341*4882a593Smuzhiyun callback pending, then that RCU callback will execute on some 342*4882a593Smuzhiyun surviving CPU. (If this was not the case, a self-spawning RCU 343*4882a593Smuzhiyun callback would prevent the victim CPU from ever going offline.) 344*4882a593Smuzhiyun Furthermore, CPUs designated by rcu_nocbs= might well -always- 345*4882a593Smuzhiyun have their RCU callbacks executed on some other CPUs, in fact, 346*4882a593Smuzhiyun for some real-time workloads, this is the whole point of using 347*4882a593Smuzhiyun the rcu_nocbs= kernel boot parameter. 348*4882a593Smuzhiyun 349*4882a593Smuzhiyun13. Unlike other forms of RCU, it -is- permissible to block in an 350*4882a593Smuzhiyun SRCU read-side critical section (demarked by srcu_read_lock() 351*4882a593Smuzhiyun and srcu_read_unlock()), hence the "SRCU": "sleepable RCU". 352*4882a593Smuzhiyun Please note that if you don't need to sleep in read-side critical 353*4882a593Smuzhiyun sections, you should be using RCU rather than SRCU, because RCU 354*4882a593Smuzhiyun is almost always faster and easier to use than is SRCU. 355*4882a593Smuzhiyun 356*4882a593Smuzhiyun Also unlike other forms of RCU, explicit initialization and 357*4882a593Smuzhiyun cleanup is required either at build time via DEFINE_SRCU() 358*4882a593Smuzhiyun or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct() 359*4882a593Smuzhiyun and cleanup_srcu_struct(). These last two are passed a 360*4882a593Smuzhiyun "struct srcu_struct" that defines the scope of a given 361*4882a593Smuzhiyun SRCU domain. Once initialized, the srcu_struct is passed 362*4882a593Smuzhiyun to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(), 363*4882a593Smuzhiyun synchronize_srcu_expedited(), and call_srcu(). A given 364*4882a593Smuzhiyun synchronize_srcu() waits only for SRCU read-side critical 365*4882a593Smuzhiyun sections governed by srcu_read_lock() and srcu_read_unlock() 366*4882a593Smuzhiyun calls that have been passed the same srcu_struct. This property 367*4882a593Smuzhiyun is what makes sleeping read-side critical sections tolerable -- 368*4882a593Smuzhiyun a given subsystem delays only its own updates, not those of other 369*4882a593Smuzhiyun subsystems using SRCU. Therefore, SRCU is less prone to OOM the 370*4882a593Smuzhiyun system than RCU would be if RCU's read-side critical sections 371*4882a593Smuzhiyun were permitted to sleep. 372*4882a593Smuzhiyun 373*4882a593Smuzhiyun The ability to sleep in read-side critical sections does not 374*4882a593Smuzhiyun come for free. First, corresponding srcu_read_lock() and 375*4882a593Smuzhiyun srcu_read_unlock() calls must be passed the same srcu_struct. 376*4882a593Smuzhiyun Second, grace-period-detection overhead is amortized only 377*4882a593Smuzhiyun over those updates sharing a given srcu_struct, rather than 378*4882a593Smuzhiyun being globally amortized as they are for other forms of RCU. 379*4882a593Smuzhiyun Therefore, SRCU should be used in preference to rw_semaphore 380*4882a593Smuzhiyun only in extremely read-intensive situations, or in situations 381*4882a593Smuzhiyun requiring SRCU's read-side deadlock immunity or low read-side 382*4882a593Smuzhiyun realtime latency. You should also consider percpu_rw_semaphore 383*4882a593Smuzhiyun when you need lightweight readers. 384*4882a593Smuzhiyun 385*4882a593Smuzhiyun SRCU's expedited primitive (synchronize_srcu_expedited()) 386*4882a593Smuzhiyun never sends IPIs to other CPUs, so it is easier on 387*4882a593Smuzhiyun real-time workloads than is synchronize_rcu_expedited(). 388*4882a593Smuzhiyun 389*4882a593Smuzhiyun Note that rcu_assign_pointer() relates to SRCU just as it does to 390*4882a593Smuzhiyun other forms of RCU, but instead of rcu_dereference() you should 391*4882a593Smuzhiyun use srcu_dereference() in order to avoid lockdep splats. 392*4882a593Smuzhiyun 393*4882a593Smuzhiyun14. The whole point of call_rcu(), synchronize_rcu(), and friends 394*4882a593Smuzhiyun is to wait until all pre-existing readers have finished before 395*4882a593Smuzhiyun carrying out some otherwise-destructive operation. It is 396*4882a593Smuzhiyun therefore critically important to -first- remove any path 397*4882a593Smuzhiyun that readers can follow that could be affected by the 398*4882a593Smuzhiyun destructive operation, and -only- -then- invoke call_rcu(), 399*4882a593Smuzhiyun synchronize_rcu(), or friends. 400*4882a593Smuzhiyun 401*4882a593Smuzhiyun Because these primitives only wait for pre-existing readers, it 402*4882a593Smuzhiyun is the caller's responsibility to guarantee that any subsequent 403*4882a593Smuzhiyun readers will execute safely. 404*4882a593Smuzhiyun 405*4882a593Smuzhiyun15. The various RCU read-side primitives do -not- necessarily contain 406*4882a593Smuzhiyun memory barriers. You should therefore plan for the CPU 407*4882a593Smuzhiyun and the compiler to freely reorder code into and out of RCU 408*4882a593Smuzhiyun read-side critical sections. It is the responsibility of the 409*4882a593Smuzhiyun RCU update-side primitives to deal with this. 410*4882a593Smuzhiyun 411*4882a593Smuzhiyun For SRCU readers, you can use smp_mb__after_srcu_read_unlock() 412*4882a593Smuzhiyun immediately after an srcu_read_unlock() to get a full barrier. 413*4882a593Smuzhiyun 414*4882a593Smuzhiyun16. Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the 415*4882a593Smuzhiyun __rcu sparse checks to validate your RCU code. These can help 416*4882a593Smuzhiyun find problems as follows: 417*4882a593Smuzhiyun 418*4882a593Smuzhiyun CONFIG_PROVE_LOCKING: 419*4882a593Smuzhiyun check that accesses to RCU-protected data 420*4882a593Smuzhiyun structures are carried out under the proper RCU 421*4882a593Smuzhiyun read-side critical section, while holding the right 422*4882a593Smuzhiyun combination of locks, or whatever other conditions 423*4882a593Smuzhiyun are appropriate. 424*4882a593Smuzhiyun 425*4882a593Smuzhiyun CONFIG_DEBUG_OBJECTS_RCU_HEAD: 426*4882a593Smuzhiyun check that you don't pass the 427*4882a593Smuzhiyun same object to call_rcu() (or friends) before an RCU 428*4882a593Smuzhiyun grace period has elapsed since the last time that you 429*4882a593Smuzhiyun passed that same object to call_rcu() (or friends). 430*4882a593Smuzhiyun 431*4882a593Smuzhiyun __rcu sparse checks: 432*4882a593Smuzhiyun tag the pointer to the RCU-protected data 433*4882a593Smuzhiyun structure with __rcu, and sparse will warn you if you 434*4882a593Smuzhiyun access that pointer without the services of one of the 435*4882a593Smuzhiyun variants of rcu_dereference(). 436*4882a593Smuzhiyun 437*4882a593Smuzhiyun These debugging aids can help you find problems that are 438*4882a593Smuzhiyun otherwise extremely difficult to spot. 439*4882a593Smuzhiyun 440*4882a593Smuzhiyun17. If you register a callback using call_rcu() or call_srcu(), and 441*4882a593Smuzhiyun pass in a function defined within a loadable module, then it in 442*4882a593Smuzhiyun necessary to wait for all pending callbacks to be invoked after 443*4882a593Smuzhiyun the last invocation and before unloading that module. Note that 444*4882a593Smuzhiyun it is absolutely -not- sufficient to wait for a grace period! 445*4882a593Smuzhiyun The current (say) synchronize_rcu() implementation is -not- 446*4882a593Smuzhiyun guaranteed to wait for callbacks registered on other CPUs. 447*4882a593Smuzhiyun Or even on the current CPU if that CPU recently went offline 448*4882a593Smuzhiyun and came back online. 449*4882a593Smuzhiyun 450*4882a593Smuzhiyun You instead need to use one of the barrier functions: 451*4882a593Smuzhiyun 452*4882a593Smuzhiyun - call_rcu() -> rcu_barrier() 453*4882a593Smuzhiyun - call_srcu() -> srcu_barrier() 454*4882a593Smuzhiyun 455*4882a593Smuzhiyun However, these barrier functions are absolutely -not- guaranteed 456*4882a593Smuzhiyun to wait for a grace period. In fact, if there are no call_rcu() 457*4882a593Smuzhiyun callbacks waiting anywhere in the system, rcu_barrier() is within 458*4882a593Smuzhiyun its rights to return immediately. 459*4882a593Smuzhiyun 460*4882a593Smuzhiyun So if you need to wait for both an RCU grace period and for 461*4882a593Smuzhiyun all pre-existing call_rcu() callbacks, you will need to execute 462*4882a593Smuzhiyun both rcu_barrier() and synchronize_rcu(), if necessary, using 463*4882a593Smuzhiyun something like workqueues to to execute them concurrently. 464*4882a593Smuzhiyun 465*4882a593Smuzhiyun See rcubarrier.txt for more information. 466