1*4882a593Smuzhiyun.. _whatisrcu_doc: 2*4882a593Smuzhiyun 3*4882a593SmuzhiyunWhat is RCU? -- "Read, Copy, Update" 4*4882a593Smuzhiyun====================================== 5*4882a593Smuzhiyun 6*4882a593SmuzhiyunPlease note that the "What is RCU?" LWN series is an excellent place 7*4882a593Smuzhiyunto start learning about RCU: 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun| 1. What is RCU, Fundamentally? http://lwn.net/Articles/262464/ 10*4882a593Smuzhiyun| 2. What is RCU? Part 2: Usage http://lwn.net/Articles/263130/ 11*4882a593Smuzhiyun| 3. RCU part 3: the RCU API http://lwn.net/Articles/264090/ 12*4882a593Smuzhiyun| 4. The RCU API, 2010 Edition http://lwn.net/Articles/418853/ 13*4882a593Smuzhiyun| 2010 Big API Table http://lwn.net/Articles/419086/ 14*4882a593Smuzhiyun| 5. The RCU API, 2014 Edition http://lwn.net/Articles/609904/ 15*4882a593Smuzhiyun| 2014 Big API Table http://lwn.net/Articles/609973/ 16*4882a593Smuzhiyun 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunWhat is RCU? 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunRCU is a synchronization mechanism that was added to the Linux kernel 21*4882a593Smuzhiyunduring the 2.5 development effort that is optimized for read-mostly 22*4882a593Smuzhiyunsituations. Although RCU is actually quite simple once you understand it, 23*4882a593Smuzhiyungetting there can sometimes be a challenge. Part of the problem is that 24*4882a593Smuzhiyunmost of the past descriptions of RCU have been written with the mistaken 25*4882a593Smuzhiyunassumption that there is "one true way" to describe RCU. Instead, 26*4882a593Smuzhiyunthe experience has been that different people must take different paths 27*4882a593Smuzhiyunto arrive at an understanding of RCU. This document provides several 28*4882a593Smuzhiyundifferent paths, as follows: 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun:ref:`1. RCU OVERVIEW <1_whatisRCU>` 31*4882a593Smuzhiyun 32*4882a593Smuzhiyun:ref:`2. WHAT IS RCU'S CORE API? <2_whatisRCU>` 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun:ref:`3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? <3_whatisRCU>` 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun:ref:`4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? <4_whatisRCU>` 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun:ref:`5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? <5_whatisRCU>` 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun:ref:`6. ANALOGY WITH READER-WRITER LOCKING <6_whatisRCU>` 41*4882a593Smuzhiyun 42*4882a593Smuzhiyun:ref:`7. FULL LIST OF RCU APIs <7_whatisRCU>` 43*4882a593Smuzhiyun 44*4882a593Smuzhiyun:ref:`8. ANSWERS TO QUICK QUIZZES <8_whatisRCU>` 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunPeople who prefer starting with a conceptual overview should focus on 47*4882a593SmuzhiyunSection 1, though most readers will profit by reading this section at 48*4882a593Smuzhiyunsome point. People who prefer to start with an API that they can then 49*4882a593Smuzhiyunexperiment with should focus on Section 2. People who prefer to start 50*4882a593Smuzhiyunwith example uses should focus on Sections 3 and 4. People who need to 51*4882a593Smuzhiyununderstand the RCU implementation should focus on Section 5, then dive 52*4882a593Smuzhiyuninto the kernel source code. People who reason best by analogy should 53*4882a593Smuzhiyunfocus on Section 6. Section 7 serves as an index to the docbook API 54*4882a593Smuzhiyundocumentation, and Section 8 is the traditional answer key. 55*4882a593Smuzhiyun 56*4882a593SmuzhiyunSo, start with the section that makes the most sense to you and your 57*4882a593Smuzhiyunpreferred method of learning. If you need to know everything about 58*4882a593Smuzhiyuneverything, feel free to read the whole thing -- but if you are really 59*4882a593Smuzhiyunthat type of person, you have perused the source code and will therefore 60*4882a593Smuzhiyunnever need this document anyway. ;-) 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun.. _1_whatisRCU: 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun1. RCU OVERVIEW 65*4882a593Smuzhiyun---------------- 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunThe basic idea behind RCU is to split updates into "removal" and 68*4882a593Smuzhiyun"reclamation" phases. The removal phase removes references to data items 69*4882a593Smuzhiyunwithin a data structure (possibly by replacing them with references to 70*4882a593Smuzhiyunnew versions of these data items), and can run concurrently with readers. 71*4882a593SmuzhiyunThe reason that it is safe to run the removal phase concurrently with 72*4882a593Smuzhiyunreaders is the semantics of modern CPUs guarantee that readers will see 73*4882a593Smuzhiyuneither the old or the new version of the data structure rather than a 74*4882a593Smuzhiyunpartially updated reference. The reclamation phase does the work of reclaiming 75*4882a593Smuzhiyun(e.g., freeing) the data items removed from the data structure during the 76*4882a593Smuzhiyunremoval phase. Because reclaiming data items can disrupt any readers 77*4882a593Smuzhiyunconcurrently referencing those data items, the reclamation phase must 78*4882a593Smuzhiyunnot start until readers no longer hold references to those data items. 79*4882a593Smuzhiyun 80*4882a593SmuzhiyunSplitting the update into removal and reclamation phases permits the 81*4882a593Smuzhiyunupdater to perform the removal phase immediately, and to defer the 82*4882a593Smuzhiyunreclamation phase until all readers active during the removal phase have 83*4882a593Smuzhiyuncompleted, either by blocking until they finish or by registering a 84*4882a593Smuzhiyuncallback that is invoked after they finish. Only readers that are active 85*4882a593Smuzhiyunduring the removal phase need be considered, because any reader starting 86*4882a593Smuzhiyunafter the removal phase will be unable to gain a reference to the removed 87*4882a593Smuzhiyundata items, and therefore cannot be disrupted by the reclamation phase. 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunSo the typical RCU update sequence goes something like the following: 90*4882a593Smuzhiyun 91*4882a593Smuzhiyuna. Remove pointers to a data structure, so that subsequent 92*4882a593Smuzhiyun readers cannot gain a reference to it. 93*4882a593Smuzhiyun 94*4882a593Smuzhiyunb. Wait for all previous readers to complete their RCU read-side 95*4882a593Smuzhiyun critical sections. 96*4882a593Smuzhiyun 97*4882a593Smuzhiyunc. At this point, there cannot be any readers who hold references 98*4882a593Smuzhiyun to the data structure, so it now may safely be reclaimed 99*4882a593Smuzhiyun (e.g., kfree()d). 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunStep (b) above is the key idea underlying RCU's deferred destruction. 102*4882a593SmuzhiyunThe ability to wait until all readers are done allows RCU readers to 103*4882a593Smuzhiyunuse much lighter-weight synchronization, in some cases, absolutely no 104*4882a593Smuzhiyunsynchronization at all. In contrast, in more conventional lock-based 105*4882a593Smuzhiyunschemes, readers must use heavy-weight synchronization in order to 106*4882a593Smuzhiyunprevent an updater from deleting the data structure out from under them. 107*4882a593SmuzhiyunThis is because lock-based updaters typically update data items in place, 108*4882a593Smuzhiyunand must therefore exclude readers. In contrast, RCU-based updaters 109*4882a593Smuzhiyuntypically take advantage of the fact that writes to single aligned 110*4882a593Smuzhiyunpointers are atomic on modern CPUs, allowing atomic insertion, removal, 111*4882a593Smuzhiyunand replacement of data items in a linked structure without disrupting 112*4882a593Smuzhiyunreaders. Concurrent RCU readers can then continue accessing the old 113*4882a593Smuzhiyunversions, and can dispense with the atomic operations, memory barriers, 114*4882a593Smuzhiyunand communications cache misses that are so expensive on present-day 115*4882a593SmuzhiyunSMP computer systems, even in absence of lock contention. 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunIn the three-step procedure shown above, the updater is performing both 118*4882a593Smuzhiyunthe removal and the reclamation step, but it is often helpful for an 119*4882a593Smuzhiyunentirely different thread to do the reclamation, as is in fact the case 120*4882a593Smuzhiyunin the Linux kernel's directory-entry cache (dcache). Even if the same 121*4882a593Smuzhiyunthread performs both the update step (step (a) above) and the reclamation 122*4882a593Smuzhiyunstep (step (c) above), it is often helpful to think of them separately. 123*4882a593SmuzhiyunFor example, RCU readers and updaters need not communicate at all, 124*4882a593Smuzhiyunbut RCU provides implicit low-overhead communication between readers 125*4882a593Smuzhiyunand reclaimers, namely, in step (b) above. 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunSo how the heck can a reclaimer tell when a reader is done, given 128*4882a593Smuzhiyunthat readers are not doing any sort of synchronization operations??? 129*4882a593SmuzhiyunRead on to learn about how RCU's API makes this easy. 130*4882a593Smuzhiyun 131*4882a593Smuzhiyun.. _2_whatisRCU: 132*4882a593Smuzhiyun 133*4882a593Smuzhiyun2. WHAT IS RCU'S CORE API? 134*4882a593Smuzhiyun--------------------------- 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunThe core RCU API is quite small: 137*4882a593Smuzhiyun 138*4882a593Smuzhiyuna. rcu_read_lock() 139*4882a593Smuzhiyunb. rcu_read_unlock() 140*4882a593Smuzhiyunc. synchronize_rcu() / call_rcu() 141*4882a593Smuzhiyund. rcu_assign_pointer() 142*4882a593Smuzhiyune. rcu_dereference() 143*4882a593Smuzhiyun 144*4882a593SmuzhiyunThere are many other members of the RCU API, but the rest can be 145*4882a593Smuzhiyunexpressed in terms of these five, though most implementations instead 146*4882a593Smuzhiyunexpress synchronize_rcu() in terms of the call_rcu() callback API. 147*4882a593Smuzhiyun 148*4882a593SmuzhiyunThe five core RCU APIs are described below, the other 18 will be enumerated 149*4882a593Smuzhiyunlater. See the kernel docbook documentation for more info, or look directly 150*4882a593Smuzhiyunat the function header comments. 151*4882a593Smuzhiyun 152*4882a593Smuzhiyunrcu_read_lock() 153*4882a593Smuzhiyun^^^^^^^^^^^^^^^ 154*4882a593Smuzhiyun void rcu_read_lock(void); 155*4882a593Smuzhiyun 156*4882a593Smuzhiyun Used by a reader to inform the reclaimer that the reader is 157*4882a593Smuzhiyun entering an RCU read-side critical section. It is illegal 158*4882a593Smuzhiyun to block while in an RCU read-side critical section, though 159*4882a593Smuzhiyun kernels built with CONFIG_PREEMPT_RCU can preempt RCU 160*4882a593Smuzhiyun read-side critical sections. Any RCU-protected data structure 161*4882a593Smuzhiyun accessed during an RCU read-side critical section is guaranteed to 162*4882a593Smuzhiyun remain unreclaimed for the full duration of that critical section. 163*4882a593Smuzhiyun Reference counts may be used in conjunction with RCU to maintain 164*4882a593Smuzhiyun longer-term references to data structures. 165*4882a593Smuzhiyun 166*4882a593Smuzhiyunrcu_read_unlock() 167*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^ 168*4882a593Smuzhiyun void rcu_read_unlock(void); 169*4882a593Smuzhiyun 170*4882a593Smuzhiyun Used by a reader to inform the reclaimer that the reader is 171*4882a593Smuzhiyun exiting an RCU read-side critical section. Note that RCU 172*4882a593Smuzhiyun read-side critical sections may be nested and/or overlapping. 173*4882a593Smuzhiyun 174*4882a593Smuzhiyunsynchronize_rcu() 175*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^ 176*4882a593Smuzhiyun void synchronize_rcu(void); 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun Marks the end of updater code and the beginning of reclaimer 179*4882a593Smuzhiyun code. It does this by blocking until all pre-existing RCU 180*4882a593Smuzhiyun read-side critical sections on all CPUs have completed. 181*4882a593Smuzhiyun Note that synchronize_rcu() will **not** necessarily wait for 182*4882a593Smuzhiyun any subsequent RCU read-side critical sections to complete. 183*4882a593Smuzhiyun For example, consider the following sequence of events:: 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun CPU 0 CPU 1 CPU 2 186*4882a593Smuzhiyun ----------------- ------------------------- --------------- 187*4882a593Smuzhiyun 1. rcu_read_lock() 188*4882a593Smuzhiyun 2. enters synchronize_rcu() 189*4882a593Smuzhiyun 3. rcu_read_lock() 190*4882a593Smuzhiyun 4. rcu_read_unlock() 191*4882a593Smuzhiyun 5. exits synchronize_rcu() 192*4882a593Smuzhiyun 6. rcu_read_unlock() 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun To reiterate, synchronize_rcu() waits only for ongoing RCU 195*4882a593Smuzhiyun read-side critical sections to complete, not necessarily for 196*4882a593Smuzhiyun any that begin after synchronize_rcu() is invoked. 197*4882a593Smuzhiyun 198*4882a593Smuzhiyun Of course, synchronize_rcu() does not necessarily return 199*4882a593Smuzhiyun **immediately** after the last pre-existing RCU read-side critical 200*4882a593Smuzhiyun section completes. For one thing, there might well be scheduling 201*4882a593Smuzhiyun delays. For another thing, many RCU implementations process 202*4882a593Smuzhiyun requests in batches in order to improve efficiencies, which can 203*4882a593Smuzhiyun further delay synchronize_rcu(). 204*4882a593Smuzhiyun 205*4882a593Smuzhiyun Since synchronize_rcu() is the API that must figure out when 206*4882a593Smuzhiyun readers are done, its implementation is key to RCU. For RCU 207*4882a593Smuzhiyun to be useful in all but the most read-intensive situations, 208*4882a593Smuzhiyun synchronize_rcu()'s overhead must also be quite small. 209*4882a593Smuzhiyun 210*4882a593Smuzhiyun The call_rcu() API is a callback form of synchronize_rcu(), 211*4882a593Smuzhiyun and is described in more detail in a later section. Instead of 212*4882a593Smuzhiyun blocking, it registers a function and argument which are invoked 213*4882a593Smuzhiyun after all ongoing RCU read-side critical sections have completed. 214*4882a593Smuzhiyun This callback variant is particularly useful in situations where 215*4882a593Smuzhiyun it is illegal to block or where update-side performance is 216*4882a593Smuzhiyun critically important. 217*4882a593Smuzhiyun 218*4882a593Smuzhiyun However, the call_rcu() API should not be used lightly, as use 219*4882a593Smuzhiyun of the synchronize_rcu() API generally results in simpler code. 220*4882a593Smuzhiyun In addition, the synchronize_rcu() API has the nice property 221*4882a593Smuzhiyun of automatically limiting update rate should grace periods 222*4882a593Smuzhiyun be delayed. This property results in system resilience in face 223*4882a593Smuzhiyun of denial-of-service attacks. Code using call_rcu() should limit 224*4882a593Smuzhiyun update rate in order to gain this same sort of resilience. See 225*4882a593Smuzhiyun checklist.txt for some approaches to limiting the update rate. 226*4882a593Smuzhiyun 227*4882a593Smuzhiyunrcu_assign_pointer() 228*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^ 229*4882a593Smuzhiyun void rcu_assign_pointer(p, typeof(p) v); 230*4882a593Smuzhiyun 231*4882a593Smuzhiyun Yes, rcu_assign_pointer() **is** implemented as a macro, though it 232*4882a593Smuzhiyun would be cool to be able to declare a function in this manner. 233*4882a593Smuzhiyun (Compiler experts will no doubt disagree.) 234*4882a593Smuzhiyun 235*4882a593Smuzhiyun The updater uses this function to assign a new value to an 236*4882a593Smuzhiyun RCU-protected pointer, in order to safely communicate the change 237*4882a593Smuzhiyun in value from the updater to the reader. This macro does not 238*4882a593Smuzhiyun evaluate to an rvalue, but it does execute any memory-barrier 239*4882a593Smuzhiyun instructions required for a given CPU architecture. 240*4882a593Smuzhiyun 241*4882a593Smuzhiyun Perhaps just as important, it serves to document (1) which 242*4882a593Smuzhiyun pointers are protected by RCU and (2) the point at which a 243*4882a593Smuzhiyun given structure becomes accessible to other CPUs. That said, 244*4882a593Smuzhiyun rcu_assign_pointer() is most frequently used indirectly, via 245*4882a593Smuzhiyun the _rcu list-manipulation primitives such as list_add_rcu(). 246*4882a593Smuzhiyun 247*4882a593Smuzhiyunrcu_dereference() 248*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^ 249*4882a593Smuzhiyun typeof(p) rcu_dereference(p); 250*4882a593Smuzhiyun 251*4882a593Smuzhiyun Like rcu_assign_pointer(), rcu_dereference() must be implemented 252*4882a593Smuzhiyun as a macro. 253*4882a593Smuzhiyun 254*4882a593Smuzhiyun The reader uses rcu_dereference() to fetch an RCU-protected 255*4882a593Smuzhiyun pointer, which returns a value that may then be safely 256*4882a593Smuzhiyun dereferenced. Note that rcu_dereference() does not actually 257*4882a593Smuzhiyun dereference the pointer, instead, it protects the pointer for 258*4882a593Smuzhiyun later dereferencing. It also executes any needed memory-barrier 259*4882a593Smuzhiyun instructions for a given CPU architecture. Currently, only Alpha 260*4882a593Smuzhiyun needs memory barriers within rcu_dereference() -- on other CPUs, 261*4882a593Smuzhiyun it compiles to nothing, not even a compiler directive. 262*4882a593Smuzhiyun 263*4882a593Smuzhiyun Common coding practice uses rcu_dereference() to copy an 264*4882a593Smuzhiyun RCU-protected pointer to a local variable, then dereferences 265*4882a593Smuzhiyun this local variable, for example as follows:: 266*4882a593Smuzhiyun 267*4882a593Smuzhiyun p = rcu_dereference(head.next); 268*4882a593Smuzhiyun return p->data; 269*4882a593Smuzhiyun 270*4882a593Smuzhiyun However, in this case, one could just as easily combine these 271*4882a593Smuzhiyun into one statement:: 272*4882a593Smuzhiyun 273*4882a593Smuzhiyun return rcu_dereference(head.next)->data; 274*4882a593Smuzhiyun 275*4882a593Smuzhiyun If you are going to be fetching multiple fields from the 276*4882a593Smuzhiyun RCU-protected structure, using the local variable is of 277*4882a593Smuzhiyun course preferred. Repeated rcu_dereference() calls look 278*4882a593Smuzhiyun ugly, do not guarantee that the same pointer will be returned 279*4882a593Smuzhiyun if an update happened while in the critical section, and incur 280*4882a593Smuzhiyun unnecessary overhead on Alpha CPUs. 281*4882a593Smuzhiyun 282*4882a593Smuzhiyun Note that the value returned by rcu_dereference() is valid 283*4882a593Smuzhiyun only within the enclosing RCU read-side critical section [1]_. 284*4882a593Smuzhiyun For example, the following is **not** legal:: 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun rcu_read_lock(); 287*4882a593Smuzhiyun p = rcu_dereference(head.next); 288*4882a593Smuzhiyun rcu_read_unlock(); 289*4882a593Smuzhiyun x = p->address; /* BUG!!! */ 290*4882a593Smuzhiyun rcu_read_lock(); 291*4882a593Smuzhiyun y = p->data; /* BUG!!! */ 292*4882a593Smuzhiyun rcu_read_unlock(); 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun Holding a reference from one RCU read-side critical section 295*4882a593Smuzhiyun to another is just as illegal as holding a reference from 296*4882a593Smuzhiyun one lock-based critical section to another! Similarly, 297*4882a593Smuzhiyun using a reference outside of the critical section in which 298*4882a593Smuzhiyun it was acquired is just as illegal as doing so with normal 299*4882a593Smuzhiyun locking. 300*4882a593Smuzhiyun 301*4882a593Smuzhiyun As with rcu_assign_pointer(), an important function of 302*4882a593Smuzhiyun rcu_dereference() is to document which pointers are protected by 303*4882a593Smuzhiyun RCU, in particular, flagging a pointer that is subject to changing 304*4882a593Smuzhiyun at any time, including immediately after the rcu_dereference(). 305*4882a593Smuzhiyun And, again like rcu_assign_pointer(), rcu_dereference() is 306*4882a593Smuzhiyun typically used indirectly, via the _rcu list-manipulation 307*4882a593Smuzhiyun primitives, such as list_for_each_entry_rcu() [2]_. 308*4882a593Smuzhiyun 309*4882a593Smuzhiyun.. [1] The variant rcu_dereference_protected() can be used outside 310*4882a593Smuzhiyun of an RCU read-side critical section as long as the usage is 311*4882a593Smuzhiyun protected by locks acquired by the update-side code. This variant 312*4882a593Smuzhiyun avoids the lockdep warning that would happen when using (for 313*4882a593Smuzhiyun example) rcu_dereference() without rcu_read_lock() protection. 314*4882a593Smuzhiyun Using rcu_dereference_protected() also has the advantage 315*4882a593Smuzhiyun of permitting compiler optimizations that rcu_dereference() 316*4882a593Smuzhiyun must prohibit. The rcu_dereference_protected() variant takes 317*4882a593Smuzhiyun a lockdep expression to indicate which locks must be acquired 318*4882a593Smuzhiyun by the caller. If the indicated protection is not provided, 319*4882a593Smuzhiyun a lockdep splat is emitted. See Documentation/RCU/Design/Requirements/Requirements.rst 320*4882a593Smuzhiyun and the API's code comments for more details and example usage. 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun.. [2] If the list_for_each_entry_rcu() instance might be used by 323*4882a593Smuzhiyun update-side code as well as by RCU readers, then an additional 324*4882a593Smuzhiyun lockdep expression can be added to its list of arguments. 325*4882a593Smuzhiyun For example, given an additional "lock_is_held(&mylock)" argument, 326*4882a593Smuzhiyun the RCU lockdep code would complain only if this instance was 327*4882a593Smuzhiyun invoked outside of an RCU read-side critical section and without 328*4882a593Smuzhiyun the protection of mylock. 329*4882a593Smuzhiyun 330*4882a593SmuzhiyunThe following diagram shows how each API communicates among the 331*4882a593Smuzhiyunreader, updater, and reclaimer. 332*4882a593Smuzhiyun:: 333*4882a593Smuzhiyun 334*4882a593Smuzhiyun 335*4882a593Smuzhiyun rcu_assign_pointer() 336*4882a593Smuzhiyun +--------+ 337*4882a593Smuzhiyun +---------------------->| reader |---------+ 338*4882a593Smuzhiyun | +--------+ | 339*4882a593Smuzhiyun | | | 340*4882a593Smuzhiyun | | | Protect: 341*4882a593Smuzhiyun | | | rcu_read_lock() 342*4882a593Smuzhiyun | | | rcu_read_unlock() 343*4882a593Smuzhiyun | rcu_dereference() | | 344*4882a593Smuzhiyun +---------+ | | 345*4882a593Smuzhiyun | updater |<----------------+ | 346*4882a593Smuzhiyun +---------+ V 347*4882a593Smuzhiyun | +-----------+ 348*4882a593Smuzhiyun +----------------------------------->| reclaimer | 349*4882a593Smuzhiyun +-----------+ 350*4882a593Smuzhiyun Defer: 351*4882a593Smuzhiyun synchronize_rcu() & call_rcu() 352*4882a593Smuzhiyun 353*4882a593Smuzhiyun 354*4882a593SmuzhiyunThe RCU infrastructure observes the time sequence of rcu_read_lock(), 355*4882a593Smuzhiyunrcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in 356*4882a593Smuzhiyunorder to determine when (1) synchronize_rcu() invocations may return 357*4882a593Smuzhiyunto their callers and (2) call_rcu() callbacks may be invoked. Efficient 358*4882a593Smuzhiyunimplementations of the RCU infrastructure make heavy use of batching in 359*4882a593Smuzhiyunorder to amortize their overhead over many uses of the corresponding APIs. 360*4882a593Smuzhiyun 361*4882a593SmuzhiyunThere are at least three flavors of RCU usage in the Linux kernel. The diagram 362*4882a593Smuzhiyunabove shows the most common one. On the updater side, the rcu_assign_pointer(), 363*4882a593Smuzhiyunsynchronize_rcu() and call_rcu() primitives used are the same for all three 364*4882a593Smuzhiyunflavors. However for protection (on the reader side), the primitives used vary 365*4882a593Smuzhiyundepending on the flavor: 366*4882a593Smuzhiyun 367*4882a593Smuzhiyuna. rcu_read_lock() / rcu_read_unlock() 368*4882a593Smuzhiyun rcu_dereference() 369*4882a593Smuzhiyun 370*4882a593Smuzhiyunb. rcu_read_lock_bh() / rcu_read_unlock_bh() 371*4882a593Smuzhiyun local_bh_disable() / local_bh_enable() 372*4882a593Smuzhiyun rcu_dereference_bh() 373*4882a593Smuzhiyun 374*4882a593Smuzhiyunc. rcu_read_lock_sched() / rcu_read_unlock_sched() 375*4882a593Smuzhiyun preempt_disable() / preempt_enable() 376*4882a593Smuzhiyun local_irq_save() / local_irq_restore() 377*4882a593Smuzhiyun hardirq enter / hardirq exit 378*4882a593Smuzhiyun NMI enter / NMI exit 379*4882a593Smuzhiyun rcu_dereference_sched() 380*4882a593Smuzhiyun 381*4882a593SmuzhiyunThese three flavors are used as follows: 382*4882a593Smuzhiyun 383*4882a593Smuzhiyuna. RCU applied to normal data structures. 384*4882a593Smuzhiyun 385*4882a593Smuzhiyunb. RCU applied to networking data structures that may be subjected 386*4882a593Smuzhiyun to remote denial-of-service attacks. 387*4882a593Smuzhiyun 388*4882a593Smuzhiyunc. RCU applied to scheduler and interrupt/NMI-handler tasks. 389*4882a593Smuzhiyun 390*4882a593SmuzhiyunAgain, most uses will be of (a). The (b) and (c) cases are important 391*4882a593Smuzhiyunfor specialized uses, but are relatively uncommon. 392*4882a593Smuzhiyun 393*4882a593Smuzhiyun.. _3_whatisRCU: 394*4882a593Smuzhiyun 395*4882a593Smuzhiyun3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? 396*4882a593Smuzhiyun----------------------------------------------- 397*4882a593Smuzhiyun 398*4882a593SmuzhiyunThis section shows a simple use of the core RCU API to protect a 399*4882a593Smuzhiyunglobal pointer to a dynamically allocated structure. More-typical 400*4882a593Smuzhiyunuses of RCU may be found in :ref:`listRCU.rst <list_rcu_doc>`, 401*4882a593Smuzhiyun:ref:`arrayRCU.rst <array_rcu_doc>`, and :ref:`NMI-RCU.rst <NMI_rcu_doc>`. 402*4882a593Smuzhiyun:: 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun struct foo { 405*4882a593Smuzhiyun int a; 406*4882a593Smuzhiyun char b; 407*4882a593Smuzhiyun long c; 408*4882a593Smuzhiyun }; 409*4882a593Smuzhiyun DEFINE_SPINLOCK(foo_mutex); 410*4882a593Smuzhiyun 411*4882a593Smuzhiyun struct foo __rcu *gbl_foo; 412*4882a593Smuzhiyun 413*4882a593Smuzhiyun /* 414*4882a593Smuzhiyun * Create a new struct foo that is the same as the one currently 415*4882a593Smuzhiyun * pointed to by gbl_foo, except that field "a" is replaced 416*4882a593Smuzhiyun * with "new_a". Points gbl_foo to the new structure, and 417*4882a593Smuzhiyun * frees up the old structure after a grace period. 418*4882a593Smuzhiyun * 419*4882a593Smuzhiyun * Uses rcu_assign_pointer() to ensure that concurrent readers 420*4882a593Smuzhiyun * see the initialized version of the new structure. 421*4882a593Smuzhiyun * 422*4882a593Smuzhiyun * Uses synchronize_rcu() to ensure that any readers that might 423*4882a593Smuzhiyun * have references to the old structure complete before freeing 424*4882a593Smuzhiyun * the old structure. 425*4882a593Smuzhiyun */ 426*4882a593Smuzhiyun void foo_update_a(int new_a) 427*4882a593Smuzhiyun { 428*4882a593Smuzhiyun struct foo *new_fp; 429*4882a593Smuzhiyun struct foo *old_fp; 430*4882a593Smuzhiyun 431*4882a593Smuzhiyun new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); 432*4882a593Smuzhiyun spin_lock(&foo_mutex); 433*4882a593Smuzhiyun old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex)); 434*4882a593Smuzhiyun *new_fp = *old_fp; 435*4882a593Smuzhiyun new_fp->a = new_a; 436*4882a593Smuzhiyun rcu_assign_pointer(gbl_foo, new_fp); 437*4882a593Smuzhiyun spin_unlock(&foo_mutex); 438*4882a593Smuzhiyun synchronize_rcu(); 439*4882a593Smuzhiyun kfree(old_fp); 440*4882a593Smuzhiyun } 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun /* 443*4882a593Smuzhiyun * Return the value of field "a" of the current gbl_foo 444*4882a593Smuzhiyun * structure. Use rcu_read_lock() and rcu_read_unlock() 445*4882a593Smuzhiyun * to ensure that the structure does not get deleted out 446*4882a593Smuzhiyun * from under us, and use rcu_dereference() to ensure that 447*4882a593Smuzhiyun * we see the initialized version of the structure (important 448*4882a593Smuzhiyun * for DEC Alpha and for people reading the code). 449*4882a593Smuzhiyun */ 450*4882a593Smuzhiyun int foo_get_a(void) 451*4882a593Smuzhiyun { 452*4882a593Smuzhiyun int retval; 453*4882a593Smuzhiyun 454*4882a593Smuzhiyun rcu_read_lock(); 455*4882a593Smuzhiyun retval = rcu_dereference(gbl_foo)->a; 456*4882a593Smuzhiyun rcu_read_unlock(); 457*4882a593Smuzhiyun return retval; 458*4882a593Smuzhiyun } 459*4882a593Smuzhiyun 460*4882a593SmuzhiyunSo, to sum up: 461*4882a593Smuzhiyun 462*4882a593Smuzhiyun- Use rcu_read_lock() and rcu_read_unlock() to guard RCU 463*4882a593Smuzhiyun read-side critical sections. 464*4882a593Smuzhiyun 465*4882a593Smuzhiyun- Within an RCU read-side critical section, use rcu_dereference() 466*4882a593Smuzhiyun to dereference RCU-protected pointers. 467*4882a593Smuzhiyun 468*4882a593Smuzhiyun- Use some solid scheme (such as locks or semaphores) to 469*4882a593Smuzhiyun keep concurrent updates from interfering with each other. 470*4882a593Smuzhiyun 471*4882a593Smuzhiyun- Use rcu_assign_pointer() to update an RCU-protected pointer. 472*4882a593Smuzhiyun This primitive protects concurrent readers from the updater, 473*4882a593Smuzhiyun **not** concurrent updates from each other! You therefore still 474*4882a593Smuzhiyun need to use locking (or something similar) to keep concurrent 475*4882a593Smuzhiyun rcu_assign_pointer() primitives from interfering with each other. 476*4882a593Smuzhiyun 477*4882a593Smuzhiyun- Use synchronize_rcu() **after** removing a data element from an 478*4882a593Smuzhiyun RCU-protected data structure, but **before** reclaiming/freeing 479*4882a593Smuzhiyun the data element, in order to wait for the completion of all 480*4882a593Smuzhiyun RCU read-side critical sections that might be referencing that 481*4882a593Smuzhiyun data item. 482*4882a593Smuzhiyun 483*4882a593SmuzhiyunSee checklist.txt for additional rules to follow when using RCU. 484*4882a593SmuzhiyunAnd again, more-typical uses of RCU may be found in :ref:`listRCU.rst 485*4882a593Smuzhiyun<list_rcu_doc>`, :ref:`arrayRCU.rst <array_rcu_doc>`, and :ref:`NMI-RCU.rst 486*4882a593Smuzhiyun<NMI_rcu_doc>`. 487*4882a593Smuzhiyun 488*4882a593Smuzhiyun.. _4_whatisRCU: 489*4882a593Smuzhiyun 490*4882a593Smuzhiyun4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? 491*4882a593Smuzhiyun-------------------------------------------- 492*4882a593Smuzhiyun 493*4882a593SmuzhiyunIn the example above, foo_update_a() blocks until a grace period elapses. 494*4882a593SmuzhiyunThis is quite simple, but in some cases one cannot afford to wait so 495*4882a593Smuzhiyunlong -- there might be other high-priority work to be done. 496*4882a593Smuzhiyun 497*4882a593SmuzhiyunIn such cases, one uses call_rcu() rather than synchronize_rcu(). 498*4882a593SmuzhiyunThe call_rcu() API is as follows:: 499*4882a593Smuzhiyun 500*4882a593Smuzhiyun void call_rcu(struct rcu_head * head, 501*4882a593Smuzhiyun void (*func)(struct rcu_head *head)); 502*4882a593Smuzhiyun 503*4882a593SmuzhiyunThis function invokes func(head) after a grace period has elapsed. 504*4882a593SmuzhiyunThis invocation might happen from either softirq or process context, 505*4882a593Smuzhiyunso the function is not permitted to block. The foo struct needs to 506*4882a593Smuzhiyunhave an rcu_head structure added, perhaps as follows:: 507*4882a593Smuzhiyun 508*4882a593Smuzhiyun struct foo { 509*4882a593Smuzhiyun int a; 510*4882a593Smuzhiyun char b; 511*4882a593Smuzhiyun long c; 512*4882a593Smuzhiyun struct rcu_head rcu; 513*4882a593Smuzhiyun }; 514*4882a593Smuzhiyun 515*4882a593SmuzhiyunThe foo_update_a() function might then be written as follows:: 516*4882a593Smuzhiyun 517*4882a593Smuzhiyun /* 518*4882a593Smuzhiyun * Create a new struct foo that is the same as the one currently 519*4882a593Smuzhiyun * pointed to by gbl_foo, except that field "a" is replaced 520*4882a593Smuzhiyun * with "new_a". Points gbl_foo to the new structure, and 521*4882a593Smuzhiyun * frees up the old structure after a grace period. 522*4882a593Smuzhiyun * 523*4882a593Smuzhiyun * Uses rcu_assign_pointer() to ensure that concurrent readers 524*4882a593Smuzhiyun * see the initialized version of the new structure. 525*4882a593Smuzhiyun * 526*4882a593Smuzhiyun * Uses call_rcu() to ensure that any readers that might have 527*4882a593Smuzhiyun * references to the old structure complete before freeing the 528*4882a593Smuzhiyun * old structure. 529*4882a593Smuzhiyun */ 530*4882a593Smuzhiyun void foo_update_a(int new_a) 531*4882a593Smuzhiyun { 532*4882a593Smuzhiyun struct foo *new_fp; 533*4882a593Smuzhiyun struct foo *old_fp; 534*4882a593Smuzhiyun 535*4882a593Smuzhiyun new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); 536*4882a593Smuzhiyun spin_lock(&foo_mutex); 537*4882a593Smuzhiyun old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex)); 538*4882a593Smuzhiyun *new_fp = *old_fp; 539*4882a593Smuzhiyun new_fp->a = new_a; 540*4882a593Smuzhiyun rcu_assign_pointer(gbl_foo, new_fp); 541*4882a593Smuzhiyun spin_unlock(&foo_mutex); 542*4882a593Smuzhiyun call_rcu(&old_fp->rcu, foo_reclaim); 543*4882a593Smuzhiyun } 544*4882a593Smuzhiyun 545*4882a593SmuzhiyunThe foo_reclaim() function might appear as follows:: 546*4882a593Smuzhiyun 547*4882a593Smuzhiyun void foo_reclaim(struct rcu_head *rp) 548*4882a593Smuzhiyun { 549*4882a593Smuzhiyun struct foo *fp = container_of(rp, struct foo, rcu); 550*4882a593Smuzhiyun 551*4882a593Smuzhiyun foo_cleanup(fp->a); 552*4882a593Smuzhiyun 553*4882a593Smuzhiyun kfree(fp); 554*4882a593Smuzhiyun } 555*4882a593Smuzhiyun 556*4882a593SmuzhiyunThe container_of() primitive is a macro that, given a pointer into a 557*4882a593Smuzhiyunstruct, the type of the struct, and the pointed-to field within the 558*4882a593Smuzhiyunstruct, returns a pointer to the beginning of the struct. 559*4882a593Smuzhiyun 560*4882a593SmuzhiyunThe use of call_rcu() permits the caller of foo_update_a() to 561*4882a593Smuzhiyunimmediately regain control, without needing to worry further about the 562*4882a593Smuzhiyunold version of the newly updated element. It also clearly shows the 563*4882a593SmuzhiyunRCU distinction between updater, namely foo_update_a(), and reclaimer, 564*4882a593Smuzhiyunnamely foo_reclaim(). 565*4882a593Smuzhiyun 566*4882a593SmuzhiyunThe summary of advice is the same as for the previous section, except 567*4882a593Smuzhiyunthat we are now using call_rcu() rather than synchronize_rcu(): 568*4882a593Smuzhiyun 569*4882a593Smuzhiyun- Use call_rcu() **after** removing a data element from an 570*4882a593Smuzhiyun RCU-protected data structure in order to register a callback 571*4882a593Smuzhiyun function that will be invoked after the completion of all RCU 572*4882a593Smuzhiyun read-side critical sections that might be referencing that 573*4882a593Smuzhiyun data item. 574*4882a593Smuzhiyun 575*4882a593SmuzhiyunIf the callback for call_rcu() is not doing anything more than calling 576*4882a593Smuzhiyunkfree() on the structure, you can use kfree_rcu() instead of call_rcu() 577*4882a593Smuzhiyunto avoid having to write your own callback:: 578*4882a593Smuzhiyun 579*4882a593Smuzhiyun kfree_rcu(old_fp, rcu); 580*4882a593Smuzhiyun 581*4882a593SmuzhiyunAgain, see checklist.txt for additional rules governing the use of RCU. 582*4882a593Smuzhiyun 583*4882a593Smuzhiyun.. _5_whatisRCU: 584*4882a593Smuzhiyun 585*4882a593Smuzhiyun5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? 586*4882a593Smuzhiyun------------------------------------------------ 587*4882a593Smuzhiyun 588*4882a593SmuzhiyunOne of the nice things about RCU is that it has extremely simple "toy" 589*4882a593Smuzhiyunimplementations that are a good first step towards understanding the 590*4882a593Smuzhiyunproduction-quality implementations in the Linux kernel. This section 591*4882a593Smuzhiyunpresents two such "toy" implementations of RCU, one that is implemented 592*4882a593Smuzhiyunin terms of familiar locking primitives, and another that more closely 593*4882a593Smuzhiyunresembles "classic" RCU. Both are way too simple for real-world use, 594*4882a593Smuzhiyunlacking both functionality and performance. However, they are useful 595*4882a593Smuzhiyunin getting a feel for how RCU works. See kernel/rcu/update.c for a 596*4882a593Smuzhiyunproduction-quality implementation, and see: 597*4882a593Smuzhiyun 598*4882a593Smuzhiyun http://www.rdrop.com/users/paulmck/RCU 599*4882a593Smuzhiyun 600*4882a593Smuzhiyunfor papers describing the Linux kernel RCU implementation. The OLS'01 601*4882a593Smuzhiyunand OLS'02 papers are a good introduction, and the dissertation provides 602*4882a593Smuzhiyunmore details on the current implementation as of early 2004. 603*4882a593Smuzhiyun 604*4882a593Smuzhiyun 605*4882a593Smuzhiyun5A. "TOY" IMPLEMENTATION #1: LOCKING 606*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 607*4882a593SmuzhiyunThis section presents a "toy" RCU implementation that is based on 608*4882a593Smuzhiyunfamiliar locking primitives. Its overhead makes it a non-starter for 609*4882a593Smuzhiyunreal-life use, as does its lack of scalability. It is also unsuitable 610*4882a593Smuzhiyunfor realtime use, since it allows scheduling latency to "bleed" from 611*4882a593Smuzhiyunone read-side critical section to another. It also assumes recursive 612*4882a593Smuzhiyunreader-writer locks: If you try this with non-recursive locks, and 613*4882a593Smuzhiyunyou allow nested rcu_read_lock() calls, you can deadlock. 614*4882a593Smuzhiyun 615*4882a593SmuzhiyunHowever, it is probably the easiest implementation to relate to, so is 616*4882a593Smuzhiyuna good starting point. 617*4882a593Smuzhiyun 618*4882a593SmuzhiyunIt is extremely simple:: 619*4882a593Smuzhiyun 620*4882a593Smuzhiyun static DEFINE_RWLOCK(rcu_gp_mutex); 621*4882a593Smuzhiyun 622*4882a593Smuzhiyun void rcu_read_lock(void) 623*4882a593Smuzhiyun { 624*4882a593Smuzhiyun read_lock(&rcu_gp_mutex); 625*4882a593Smuzhiyun } 626*4882a593Smuzhiyun 627*4882a593Smuzhiyun void rcu_read_unlock(void) 628*4882a593Smuzhiyun { 629*4882a593Smuzhiyun read_unlock(&rcu_gp_mutex); 630*4882a593Smuzhiyun } 631*4882a593Smuzhiyun 632*4882a593Smuzhiyun void synchronize_rcu(void) 633*4882a593Smuzhiyun { 634*4882a593Smuzhiyun write_lock(&rcu_gp_mutex); 635*4882a593Smuzhiyun smp_mb__after_spinlock(); 636*4882a593Smuzhiyun write_unlock(&rcu_gp_mutex); 637*4882a593Smuzhiyun } 638*4882a593Smuzhiyun 639*4882a593Smuzhiyun[You can ignore rcu_assign_pointer() and rcu_dereference() without missing 640*4882a593Smuzhiyunmuch. But here are simplified versions anyway. And whatever you do, 641*4882a593Smuzhiyundon't forget about them when submitting patches making use of RCU!]:: 642*4882a593Smuzhiyun 643*4882a593Smuzhiyun #define rcu_assign_pointer(p, v) \ 644*4882a593Smuzhiyun ({ \ 645*4882a593Smuzhiyun smp_store_release(&(p), (v)); \ 646*4882a593Smuzhiyun }) 647*4882a593Smuzhiyun 648*4882a593Smuzhiyun #define rcu_dereference(p) \ 649*4882a593Smuzhiyun ({ \ 650*4882a593Smuzhiyun typeof(p) _________p1 = READ_ONCE(p); \ 651*4882a593Smuzhiyun (_________p1); \ 652*4882a593Smuzhiyun }) 653*4882a593Smuzhiyun 654*4882a593Smuzhiyun 655*4882a593SmuzhiyunThe rcu_read_lock() and rcu_read_unlock() primitive read-acquire 656*4882a593Smuzhiyunand release a global reader-writer lock. The synchronize_rcu() 657*4882a593Smuzhiyunprimitive write-acquires this same lock, then releases it. This means 658*4882a593Smuzhiyunthat once synchronize_rcu() exits, all RCU read-side critical sections 659*4882a593Smuzhiyunthat were in progress before synchronize_rcu() was called are guaranteed 660*4882a593Smuzhiyunto have completed -- there is no way that synchronize_rcu() would have 661*4882a593Smuzhiyunbeen able to write-acquire the lock otherwise. The smp_mb__after_spinlock() 662*4882a593Smuzhiyunpromotes synchronize_rcu() to a full memory barrier in compliance with 663*4882a593Smuzhiyunthe "Memory-Barrier Guarantees" listed in: 664*4882a593Smuzhiyun 665*4882a593Smuzhiyun Documentation/RCU/Design/Requirements/Requirements.rst 666*4882a593Smuzhiyun 667*4882a593SmuzhiyunIt is possible to nest rcu_read_lock(), since reader-writer locks may 668*4882a593Smuzhiyunbe recursively acquired. Note also that rcu_read_lock() is immune 669*4882a593Smuzhiyunfrom deadlock (an important property of RCU). The reason for this is 670*4882a593Smuzhiyunthat the only thing that can block rcu_read_lock() is a synchronize_rcu(). 671*4882a593SmuzhiyunBut synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex, 672*4882a593Smuzhiyunso there can be no deadlock cycle. 673*4882a593Smuzhiyun 674*4882a593Smuzhiyun.. _quiz_1: 675*4882a593Smuzhiyun 676*4882a593SmuzhiyunQuick Quiz #1: 677*4882a593Smuzhiyun Why is this argument naive? How could a deadlock 678*4882a593Smuzhiyun occur when using this algorithm in a real-world Linux 679*4882a593Smuzhiyun kernel? How could this deadlock be avoided? 680*4882a593Smuzhiyun 681*4882a593Smuzhiyun:ref:`Answers to Quick Quiz <8_whatisRCU>` 682*4882a593Smuzhiyun 683*4882a593Smuzhiyun5B. "TOY" EXAMPLE #2: CLASSIC RCU 684*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 685*4882a593SmuzhiyunThis section presents a "toy" RCU implementation that is based on 686*4882a593Smuzhiyun"classic RCU". It is also short on performance (but only for updates) and 687*4882a593Smuzhiyunon features such as hotplug CPU and the ability to run in CONFIG_PREEMPT 688*4882a593Smuzhiyunkernels. The definitions of rcu_dereference() and rcu_assign_pointer() 689*4882a593Smuzhiyunare the same as those shown in the preceding section, so they are omitted. 690*4882a593Smuzhiyun:: 691*4882a593Smuzhiyun 692*4882a593Smuzhiyun void rcu_read_lock(void) { } 693*4882a593Smuzhiyun 694*4882a593Smuzhiyun void rcu_read_unlock(void) { } 695*4882a593Smuzhiyun 696*4882a593Smuzhiyun void synchronize_rcu(void) 697*4882a593Smuzhiyun { 698*4882a593Smuzhiyun int cpu; 699*4882a593Smuzhiyun 700*4882a593Smuzhiyun for_each_possible_cpu(cpu) 701*4882a593Smuzhiyun run_on(cpu); 702*4882a593Smuzhiyun } 703*4882a593Smuzhiyun 704*4882a593SmuzhiyunNote that rcu_read_lock() and rcu_read_unlock() do absolutely nothing. 705*4882a593SmuzhiyunThis is the great strength of classic RCU in a non-preemptive kernel: 706*4882a593Smuzhiyunread-side overhead is precisely zero, at least on non-Alpha CPUs. 707*4882a593SmuzhiyunAnd there is absolutely no way that rcu_read_lock() can possibly 708*4882a593Smuzhiyunparticipate in a deadlock cycle! 709*4882a593Smuzhiyun 710*4882a593SmuzhiyunThe implementation of synchronize_rcu() simply schedules itself on each 711*4882a593SmuzhiyunCPU in turn. The run_on() primitive can be implemented straightforwardly 712*4882a593Smuzhiyunin terms of the sched_setaffinity() primitive. Of course, a somewhat less 713*4882a593Smuzhiyun"toy" implementation would restore the affinity upon completion rather 714*4882a593Smuzhiyunthan just leaving all tasks running on the last CPU, but when I said 715*4882a593Smuzhiyun"toy", I meant **toy**! 716*4882a593Smuzhiyun 717*4882a593SmuzhiyunSo how the heck is this supposed to work??? 718*4882a593Smuzhiyun 719*4882a593SmuzhiyunRemember that it is illegal to block while in an RCU read-side critical 720*4882a593Smuzhiyunsection. Therefore, if a given CPU executes a context switch, we know 721*4882a593Smuzhiyunthat it must have completed all preceding RCU read-side critical sections. 722*4882a593SmuzhiyunOnce **all** CPUs have executed a context switch, then **all** preceding 723*4882a593SmuzhiyunRCU read-side critical sections will have completed. 724*4882a593Smuzhiyun 725*4882a593SmuzhiyunSo, suppose that we remove a data item from its structure and then invoke 726*4882a593Smuzhiyunsynchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed 727*4882a593Smuzhiyunthat there are no RCU read-side critical sections holding a reference 728*4882a593Smuzhiyunto that data item, so we can safely reclaim it. 729*4882a593Smuzhiyun 730*4882a593Smuzhiyun.. _quiz_2: 731*4882a593Smuzhiyun 732*4882a593SmuzhiyunQuick Quiz #2: 733*4882a593Smuzhiyun Give an example where Classic RCU's read-side 734*4882a593Smuzhiyun overhead is **negative**. 735*4882a593Smuzhiyun 736*4882a593Smuzhiyun:ref:`Answers to Quick Quiz <8_whatisRCU>` 737*4882a593Smuzhiyun 738*4882a593Smuzhiyun.. _quiz_3: 739*4882a593Smuzhiyun 740*4882a593SmuzhiyunQuick Quiz #3: 741*4882a593Smuzhiyun If it is illegal to block in an RCU read-side 742*4882a593Smuzhiyun critical section, what the heck do you do in 743*4882a593Smuzhiyun PREEMPT_RT, where normal spinlocks can block??? 744*4882a593Smuzhiyun 745*4882a593Smuzhiyun:ref:`Answers to Quick Quiz <8_whatisRCU>` 746*4882a593Smuzhiyun 747*4882a593Smuzhiyun.. _6_whatisRCU: 748*4882a593Smuzhiyun 749*4882a593Smuzhiyun6. ANALOGY WITH READER-WRITER LOCKING 750*4882a593Smuzhiyun-------------------------------------- 751*4882a593Smuzhiyun 752*4882a593SmuzhiyunAlthough RCU can be used in many different ways, a very common use of 753*4882a593SmuzhiyunRCU is analogous to reader-writer locking. The following unified 754*4882a593Smuzhiyundiff shows how closely related RCU and reader-writer locking can be. 755*4882a593Smuzhiyun:: 756*4882a593Smuzhiyun 757*4882a593Smuzhiyun @@ -5,5 +5,5 @@ struct el { 758*4882a593Smuzhiyun int data; 759*4882a593Smuzhiyun /* Other data fields */ 760*4882a593Smuzhiyun }; 761*4882a593Smuzhiyun -rwlock_t listmutex; 762*4882a593Smuzhiyun +spinlock_t listmutex; 763*4882a593Smuzhiyun struct el head; 764*4882a593Smuzhiyun 765*4882a593Smuzhiyun @@ -13,15 +14,15 @@ 766*4882a593Smuzhiyun struct list_head *lp; 767*4882a593Smuzhiyun struct el *p; 768*4882a593Smuzhiyun 769*4882a593Smuzhiyun - read_lock(&listmutex); 770*4882a593Smuzhiyun - list_for_each_entry(p, head, lp) { 771*4882a593Smuzhiyun + rcu_read_lock(); 772*4882a593Smuzhiyun + list_for_each_entry_rcu(p, head, lp) { 773*4882a593Smuzhiyun if (p->key == key) { 774*4882a593Smuzhiyun *result = p->data; 775*4882a593Smuzhiyun - read_unlock(&listmutex); 776*4882a593Smuzhiyun + rcu_read_unlock(); 777*4882a593Smuzhiyun return 1; 778*4882a593Smuzhiyun } 779*4882a593Smuzhiyun } 780*4882a593Smuzhiyun - read_unlock(&listmutex); 781*4882a593Smuzhiyun + rcu_read_unlock(); 782*4882a593Smuzhiyun return 0; 783*4882a593Smuzhiyun } 784*4882a593Smuzhiyun 785*4882a593Smuzhiyun @@ -29,15 +30,16 @@ 786*4882a593Smuzhiyun { 787*4882a593Smuzhiyun struct el *p; 788*4882a593Smuzhiyun 789*4882a593Smuzhiyun - write_lock(&listmutex); 790*4882a593Smuzhiyun + spin_lock(&listmutex); 791*4882a593Smuzhiyun list_for_each_entry(p, head, lp) { 792*4882a593Smuzhiyun if (p->key == key) { 793*4882a593Smuzhiyun - list_del(&p->list); 794*4882a593Smuzhiyun - write_unlock(&listmutex); 795*4882a593Smuzhiyun + list_del_rcu(&p->list); 796*4882a593Smuzhiyun + spin_unlock(&listmutex); 797*4882a593Smuzhiyun + synchronize_rcu(); 798*4882a593Smuzhiyun kfree(p); 799*4882a593Smuzhiyun return 1; 800*4882a593Smuzhiyun } 801*4882a593Smuzhiyun } 802*4882a593Smuzhiyun - write_unlock(&listmutex); 803*4882a593Smuzhiyun + spin_unlock(&listmutex); 804*4882a593Smuzhiyun return 0; 805*4882a593Smuzhiyun } 806*4882a593Smuzhiyun 807*4882a593SmuzhiyunOr, for those who prefer a side-by-side listing:: 808*4882a593Smuzhiyun 809*4882a593Smuzhiyun 1 struct el { 1 struct el { 810*4882a593Smuzhiyun 2 struct list_head list; 2 struct list_head list; 811*4882a593Smuzhiyun 3 long key; 3 long key; 812*4882a593Smuzhiyun 4 spinlock_t mutex; 4 spinlock_t mutex; 813*4882a593Smuzhiyun 5 int data; 5 int data; 814*4882a593Smuzhiyun 6 /* Other data fields */ 6 /* Other data fields */ 815*4882a593Smuzhiyun 7 }; 7 }; 816*4882a593Smuzhiyun 8 rwlock_t listmutex; 8 spinlock_t listmutex; 817*4882a593Smuzhiyun 9 struct el head; 9 struct el head; 818*4882a593Smuzhiyun 819*4882a593Smuzhiyun:: 820*4882a593Smuzhiyun 821*4882a593Smuzhiyun 1 int search(long key, int *result) 1 int search(long key, int *result) 822*4882a593Smuzhiyun 2 { 2 { 823*4882a593Smuzhiyun 3 struct list_head *lp; 3 struct list_head *lp; 824*4882a593Smuzhiyun 4 struct el *p; 4 struct el *p; 825*4882a593Smuzhiyun 5 5 826*4882a593Smuzhiyun 6 read_lock(&listmutex); 6 rcu_read_lock(); 827*4882a593Smuzhiyun 7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) { 828*4882a593Smuzhiyun 8 if (p->key == key) { 8 if (p->key == key) { 829*4882a593Smuzhiyun 9 *result = p->data; 9 *result = p->data; 830*4882a593Smuzhiyun 10 read_unlock(&listmutex); 10 rcu_read_unlock(); 831*4882a593Smuzhiyun 11 return 1; 11 return 1; 832*4882a593Smuzhiyun 12 } 12 } 833*4882a593Smuzhiyun 13 } 13 } 834*4882a593Smuzhiyun 14 read_unlock(&listmutex); 14 rcu_read_unlock(); 835*4882a593Smuzhiyun 15 return 0; 15 return 0; 836*4882a593Smuzhiyun 16 } 16 } 837*4882a593Smuzhiyun 838*4882a593Smuzhiyun:: 839*4882a593Smuzhiyun 840*4882a593Smuzhiyun 1 int delete(long key) 1 int delete(long key) 841*4882a593Smuzhiyun 2 { 2 { 842*4882a593Smuzhiyun 3 struct el *p; 3 struct el *p; 843*4882a593Smuzhiyun 4 4 844*4882a593Smuzhiyun 5 write_lock(&listmutex); 5 spin_lock(&listmutex); 845*4882a593Smuzhiyun 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) { 846*4882a593Smuzhiyun 7 if (p->key == key) { 7 if (p->key == key) { 847*4882a593Smuzhiyun 8 list_del(&p->list); 8 list_del_rcu(&p->list); 848*4882a593Smuzhiyun 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex); 849*4882a593Smuzhiyun 10 synchronize_rcu(); 850*4882a593Smuzhiyun 10 kfree(p); 11 kfree(p); 851*4882a593Smuzhiyun 11 return 1; 12 return 1; 852*4882a593Smuzhiyun 12 } 13 } 853*4882a593Smuzhiyun 13 } 14 } 854*4882a593Smuzhiyun 14 write_unlock(&listmutex); 15 spin_unlock(&listmutex); 855*4882a593Smuzhiyun 15 return 0; 16 return 0; 856*4882a593Smuzhiyun 16 } 17 } 857*4882a593Smuzhiyun 858*4882a593SmuzhiyunEither way, the differences are quite small. Read-side locking moves 859*4882a593Smuzhiyunto rcu_read_lock() and rcu_read_unlock, update-side locking moves from 860*4882a593Smuzhiyuna reader-writer lock to a simple spinlock, and a synchronize_rcu() 861*4882a593Smuzhiyunprecedes the kfree(). 862*4882a593Smuzhiyun 863*4882a593SmuzhiyunHowever, there is one potential catch: the read-side and update-side 864*4882a593Smuzhiyuncritical sections can now run concurrently. In many cases, this will 865*4882a593Smuzhiyunnot be a problem, but it is necessary to check carefully regardless. 866*4882a593SmuzhiyunFor example, if multiple independent list updates must be seen as 867*4882a593Smuzhiyuna single atomic update, converting to RCU will require special care. 868*4882a593Smuzhiyun 869*4882a593SmuzhiyunAlso, the presence of synchronize_rcu() means that the RCU version of 870*4882a593Smuzhiyundelete() can now block. If this is a problem, there is a callback-based 871*4882a593Smuzhiyunmechanism that never blocks, namely call_rcu() or kfree_rcu(), that can 872*4882a593Smuzhiyunbe used in place of synchronize_rcu(). 873*4882a593Smuzhiyun 874*4882a593Smuzhiyun.. _7_whatisRCU: 875*4882a593Smuzhiyun 876*4882a593Smuzhiyun7. FULL LIST OF RCU APIs 877*4882a593Smuzhiyun------------------------- 878*4882a593Smuzhiyun 879*4882a593SmuzhiyunThe RCU APIs are documented in docbook-format header comments in the 880*4882a593SmuzhiyunLinux-kernel source code, but it helps to have a full list of the 881*4882a593SmuzhiyunAPIs, since there does not appear to be a way to categorize them 882*4882a593Smuzhiyunin docbook. Here is the list, by category. 883*4882a593Smuzhiyun 884*4882a593SmuzhiyunRCU list traversal:: 885*4882a593Smuzhiyun 886*4882a593Smuzhiyun list_entry_rcu 887*4882a593Smuzhiyun list_entry_lockless 888*4882a593Smuzhiyun list_first_entry_rcu 889*4882a593Smuzhiyun list_next_rcu 890*4882a593Smuzhiyun list_for_each_entry_rcu 891*4882a593Smuzhiyun list_for_each_entry_continue_rcu 892*4882a593Smuzhiyun list_for_each_entry_from_rcu 893*4882a593Smuzhiyun list_first_or_null_rcu 894*4882a593Smuzhiyun list_next_or_null_rcu 895*4882a593Smuzhiyun hlist_first_rcu 896*4882a593Smuzhiyun hlist_next_rcu 897*4882a593Smuzhiyun hlist_pprev_rcu 898*4882a593Smuzhiyun hlist_for_each_entry_rcu 899*4882a593Smuzhiyun hlist_for_each_entry_rcu_bh 900*4882a593Smuzhiyun hlist_for_each_entry_from_rcu 901*4882a593Smuzhiyun hlist_for_each_entry_continue_rcu 902*4882a593Smuzhiyun hlist_for_each_entry_continue_rcu_bh 903*4882a593Smuzhiyun hlist_nulls_first_rcu 904*4882a593Smuzhiyun hlist_nulls_for_each_entry_rcu 905*4882a593Smuzhiyun hlist_bl_first_rcu 906*4882a593Smuzhiyun hlist_bl_for_each_entry_rcu 907*4882a593Smuzhiyun 908*4882a593SmuzhiyunRCU pointer/list update:: 909*4882a593Smuzhiyun 910*4882a593Smuzhiyun rcu_assign_pointer 911*4882a593Smuzhiyun list_add_rcu 912*4882a593Smuzhiyun list_add_tail_rcu 913*4882a593Smuzhiyun list_del_rcu 914*4882a593Smuzhiyun list_replace_rcu 915*4882a593Smuzhiyun hlist_add_behind_rcu 916*4882a593Smuzhiyun hlist_add_before_rcu 917*4882a593Smuzhiyun hlist_add_head_rcu 918*4882a593Smuzhiyun hlist_add_tail_rcu 919*4882a593Smuzhiyun hlist_del_rcu 920*4882a593Smuzhiyun hlist_del_init_rcu 921*4882a593Smuzhiyun hlist_replace_rcu 922*4882a593Smuzhiyun list_splice_init_rcu 923*4882a593Smuzhiyun list_splice_tail_init_rcu 924*4882a593Smuzhiyun hlist_nulls_del_init_rcu 925*4882a593Smuzhiyun hlist_nulls_del_rcu 926*4882a593Smuzhiyun hlist_nulls_add_head_rcu 927*4882a593Smuzhiyun hlist_bl_add_head_rcu 928*4882a593Smuzhiyun hlist_bl_del_init_rcu 929*4882a593Smuzhiyun hlist_bl_del_rcu 930*4882a593Smuzhiyun hlist_bl_set_first_rcu 931*4882a593Smuzhiyun 932*4882a593SmuzhiyunRCU:: 933*4882a593Smuzhiyun 934*4882a593Smuzhiyun Critical sections Grace period Barrier 935*4882a593Smuzhiyun 936*4882a593Smuzhiyun rcu_read_lock synchronize_net rcu_barrier 937*4882a593Smuzhiyun rcu_read_unlock synchronize_rcu 938*4882a593Smuzhiyun rcu_dereference synchronize_rcu_expedited 939*4882a593Smuzhiyun rcu_read_lock_held call_rcu 940*4882a593Smuzhiyun rcu_dereference_check kfree_rcu 941*4882a593Smuzhiyun rcu_dereference_protected 942*4882a593Smuzhiyun 943*4882a593Smuzhiyunbh:: 944*4882a593Smuzhiyun 945*4882a593Smuzhiyun Critical sections Grace period Barrier 946*4882a593Smuzhiyun 947*4882a593Smuzhiyun rcu_read_lock_bh call_rcu rcu_barrier 948*4882a593Smuzhiyun rcu_read_unlock_bh synchronize_rcu 949*4882a593Smuzhiyun [local_bh_disable] synchronize_rcu_expedited 950*4882a593Smuzhiyun [and friends] 951*4882a593Smuzhiyun rcu_dereference_bh 952*4882a593Smuzhiyun rcu_dereference_bh_check 953*4882a593Smuzhiyun rcu_dereference_bh_protected 954*4882a593Smuzhiyun rcu_read_lock_bh_held 955*4882a593Smuzhiyun 956*4882a593Smuzhiyunsched:: 957*4882a593Smuzhiyun 958*4882a593Smuzhiyun Critical sections Grace period Barrier 959*4882a593Smuzhiyun 960*4882a593Smuzhiyun rcu_read_lock_sched call_rcu rcu_barrier 961*4882a593Smuzhiyun rcu_read_unlock_sched synchronize_rcu 962*4882a593Smuzhiyun [preempt_disable] synchronize_rcu_expedited 963*4882a593Smuzhiyun [and friends] 964*4882a593Smuzhiyun rcu_read_lock_sched_notrace 965*4882a593Smuzhiyun rcu_read_unlock_sched_notrace 966*4882a593Smuzhiyun rcu_dereference_sched 967*4882a593Smuzhiyun rcu_dereference_sched_check 968*4882a593Smuzhiyun rcu_dereference_sched_protected 969*4882a593Smuzhiyun rcu_read_lock_sched_held 970*4882a593Smuzhiyun 971*4882a593Smuzhiyun 972*4882a593SmuzhiyunSRCU:: 973*4882a593Smuzhiyun 974*4882a593Smuzhiyun Critical sections Grace period Barrier 975*4882a593Smuzhiyun 976*4882a593Smuzhiyun srcu_read_lock call_srcu srcu_barrier 977*4882a593Smuzhiyun srcu_read_unlock synchronize_srcu 978*4882a593Smuzhiyun srcu_dereference synchronize_srcu_expedited 979*4882a593Smuzhiyun srcu_dereference_check 980*4882a593Smuzhiyun srcu_read_lock_held 981*4882a593Smuzhiyun 982*4882a593SmuzhiyunSRCU: Initialization/cleanup:: 983*4882a593Smuzhiyun 984*4882a593Smuzhiyun DEFINE_SRCU 985*4882a593Smuzhiyun DEFINE_STATIC_SRCU 986*4882a593Smuzhiyun init_srcu_struct 987*4882a593Smuzhiyun cleanup_srcu_struct 988*4882a593Smuzhiyun 989*4882a593SmuzhiyunAll: lockdep-checked RCU-protected pointer access:: 990*4882a593Smuzhiyun 991*4882a593Smuzhiyun rcu_access_pointer 992*4882a593Smuzhiyun rcu_dereference_raw 993*4882a593Smuzhiyun RCU_LOCKDEP_WARN 994*4882a593Smuzhiyun rcu_sleep_check 995*4882a593Smuzhiyun RCU_NONIDLE 996*4882a593Smuzhiyun 997*4882a593SmuzhiyunSee the comment headers in the source code (or the docbook generated 998*4882a593Smuzhiyunfrom them) for more information. 999*4882a593Smuzhiyun 1000*4882a593SmuzhiyunHowever, given that there are no fewer than four families of RCU APIs 1001*4882a593Smuzhiyunin the Linux kernel, how do you choose which one to use? The following 1002*4882a593Smuzhiyunlist can be helpful: 1003*4882a593Smuzhiyun 1004*4882a593Smuzhiyuna. Will readers need to block? If so, you need SRCU. 1005*4882a593Smuzhiyun 1006*4882a593Smuzhiyunb. What about the -rt patchset? If readers would need to block 1007*4882a593Smuzhiyun in an non-rt kernel, you need SRCU. If readers would block 1008*4882a593Smuzhiyun in a -rt kernel, but not in a non-rt kernel, SRCU is not 1009*4882a593Smuzhiyun necessary. (The -rt patchset turns spinlocks into sleeplocks, 1010*4882a593Smuzhiyun hence this distinction.) 1011*4882a593Smuzhiyun 1012*4882a593Smuzhiyunc. Do you need to treat NMI handlers, hardirq handlers, 1013*4882a593Smuzhiyun and code segments with preemption disabled (whether 1014*4882a593Smuzhiyun via preempt_disable(), local_irq_save(), local_bh_disable(), 1015*4882a593Smuzhiyun or some other mechanism) as if they were explicit RCU readers? 1016*4882a593Smuzhiyun If so, RCU-sched is the only choice that will work for you. 1017*4882a593Smuzhiyun 1018*4882a593Smuzhiyund. Do you need RCU grace periods to complete even in the face 1019*4882a593Smuzhiyun of softirq monopolization of one or more of the CPUs? For 1020*4882a593Smuzhiyun example, is your code subject to network-based denial-of-service 1021*4882a593Smuzhiyun attacks? If so, you should disable softirq across your readers, 1022*4882a593Smuzhiyun for example, by using rcu_read_lock_bh(). 1023*4882a593Smuzhiyun 1024*4882a593Smuzhiyune. Is your workload too update-intensive for normal use of 1025*4882a593Smuzhiyun RCU, but inappropriate for other synchronization mechanisms? 1026*4882a593Smuzhiyun If so, consider SLAB_TYPESAFE_BY_RCU (which was originally 1027*4882a593Smuzhiyun named SLAB_DESTROY_BY_RCU). But please be careful! 1028*4882a593Smuzhiyun 1029*4882a593Smuzhiyunf. Do you need read-side critical sections that are respected 1030*4882a593Smuzhiyun even though they are in the middle of the idle loop, during 1031*4882a593Smuzhiyun user-mode execution, or on an offlined CPU? If so, SRCU is the 1032*4882a593Smuzhiyun only choice that will work for you. 1033*4882a593Smuzhiyun 1034*4882a593Smuzhiyung. Otherwise, use RCU. 1035*4882a593Smuzhiyun 1036*4882a593SmuzhiyunOf course, this all assumes that you have determined that RCU is in fact 1037*4882a593Smuzhiyunthe right tool for your job. 1038*4882a593Smuzhiyun 1039*4882a593Smuzhiyun.. _8_whatisRCU: 1040*4882a593Smuzhiyun 1041*4882a593Smuzhiyun8. ANSWERS TO QUICK QUIZZES 1042*4882a593Smuzhiyun---------------------------- 1043*4882a593Smuzhiyun 1044*4882a593SmuzhiyunQuick Quiz #1: 1045*4882a593Smuzhiyun Why is this argument naive? How could a deadlock 1046*4882a593Smuzhiyun occur when using this algorithm in a real-world Linux 1047*4882a593Smuzhiyun kernel? [Referring to the lock-based "toy" RCU 1048*4882a593Smuzhiyun algorithm.] 1049*4882a593Smuzhiyun 1050*4882a593SmuzhiyunAnswer: 1051*4882a593Smuzhiyun Consider the following sequence of events: 1052*4882a593Smuzhiyun 1053*4882a593Smuzhiyun 1. CPU 0 acquires some unrelated lock, call it 1054*4882a593Smuzhiyun "problematic_lock", disabling irq via 1055*4882a593Smuzhiyun spin_lock_irqsave(). 1056*4882a593Smuzhiyun 1057*4882a593Smuzhiyun 2. CPU 1 enters synchronize_rcu(), write-acquiring 1058*4882a593Smuzhiyun rcu_gp_mutex. 1059*4882a593Smuzhiyun 1060*4882a593Smuzhiyun 3. CPU 0 enters rcu_read_lock(), but must wait 1061*4882a593Smuzhiyun because CPU 1 holds rcu_gp_mutex. 1062*4882a593Smuzhiyun 1063*4882a593Smuzhiyun 4. CPU 1 is interrupted, and the irq handler 1064*4882a593Smuzhiyun attempts to acquire problematic_lock. 1065*4882a593Smuzhiyun 1066*4882a593Smuzhiyun The system is now deadlocked. 1067*4882a593Smuzhiyun 1068*4882a593Smuzhiyun One way to avoid this deadlock is to use an approach like 1069*4882a593Smuzhiyun that of CONFIG_PREEMPT_RT, where all normal spinlocks 1070*4882a593Smuzhiyun become blocking locks, and all irq handlers execute in 1071*4882a593Smuzhiyun the context of special tasks. In this case, in step 4 1072*4882a593Smuzhiyun above, the irq handler would block, allowing CPU 1 to 1073*4882a593Smuzhiyun release rcu_gp_mutex, avoiding the deadlock. 1074*4882a593Smuzhiyun 1075*4882a593Smuzhiyun Even in the absence of deadlock, this RCU implementation 1076*4882a593Smuzhiyun allows latency to "bleed" from readers to other 1077*4882a593Smuzhiyun readers through synchronize_rcu(). To see this, 1078*4882a593Smuzhiyun consider task A in an RCU read-side critical section 1079*4882a593Smuzhiyun (thus read-holding rcu_gp_mutex), task B blocked 1080*4882a593Smuzhiyun attempting to write-acquire rcu_gp_mutex, and 1081*4882a593Smuzhiyun task C blocked in rcu_read_lock() attempting to 1082*4882a593Smuzhiyun read_acquire rcu_gp_mutex. Task A's RCU read-side 1083*4882a593Smuzhiyun latency is holding up task C, albeit indirectly via 1084*4882a593Smuzhiyun task B. 1085*4882a593Smuzhiyun 1086*4882a593Smuzhiyun Realtime RCU implementations therefore use a counter-based 1087*4882a593Smuzhiyun approach where tasks in RCU read-side critical sections 1088*4882a593Smuzhiyun cannot be blocked by tasks executing synchronize_rcu(). 1089*4882a593Smuzhiyun 1090*4882a593Smuzhiyun:ref:`Back to Quick Quiz #1 <quiz_1>` 1091*4882a593Smuzhiyun 1092*4882a593SmuzhiyunQuick Quiz #2: 1093*4882a593Smuzhiyun Give an example where Classic RCU's read-side 1094*4882a593Smuzhiyun overhead is **negative**. 1095*4882a593Smuzhiyun 1096*4882a593SmuzhiyunAnswer: 1097*4882a593Smuzhiyun Imagine a single-CPU system with a non-CONFIG_PREEMPT 1098*4882a593Smuzhiyun kernel where a routing table is used by process-context 1099*4882a593Smuzhiyun code, but can be updated by irq-context code (for example, 1100*4882a593Smuzhiyun by an "ICMP REDIRECT" packet). The usual way of handling 1101*4882a593Smuzhiyun this would be to have the process-context code disable 1102*4882a593Smuzhiyun interrupts while searching the routing table. Use of 1103*4882a593Smuzhiyun RCU allows such interrupt-disabling to be dispensed with. 1104*4882a593Smuzhiyun Thus, without RCU, you pay the cost of disabling interrupts, 1105*4882a593Smuzhiyun and with RCU you don't. 1106*4882a593Smuzhiyun 1107*4882a593Smuzhiyun One can argue that the overhead of RCU in this 1108*4882a593Smuzhiyun case is negative with respect to the single-CPU 1109*4882a593Smuzhiyun interrupt-disabling approach. Others might argue that 1110*4882a593Smuzhiyun the overhead of RCU is merely zero, and that replacing 1111*4882a593Smuzhiyun the positive overhead of the interrupt-disabling scheme 1112*4882a593Smuzhiyun with the zero-overhead RCU scheme does not constitute 1113*4882a593Smuzhiyun negative overhead. 1114*4882a593Smuzhiyun 1115*4882a593Smuzhiyun In real life, of course, things are more complex. But 1116*4882a593Smuzhiyun even the theoretical possibility of negative overhead for 1117*4882a593Smuzhiyun a synchronization primitive is a bit unexpected. ;-) 1118*4882a593Smuzhiyun 1119*4882a593Smuzhiyun:ref:`Back to Quick Quiz #2 <quiz_2>` 1120*4882a593Smuzhiyun 1121*4882a593SmuzhiyunQuick Quiz #3: 1122*4882a593Smuzhiyun If it is illegal to block in an RCU read-side 1123*4882a593Smuzhiyun critical section, what the heck do you do in 1124*4882a593Smuzhiyun PREEMPT_RT, where normal spinlocks can block??? 1125*4882a593Smuzhiyun 1126*4882a593SmuzhiyunAnswer: 1127*4882a593Smuzhiyun Just as PREEMPT_RT permits preemption of spinlock 1128*4882a593Smuzhiyun critical sections, it permits preemption of RCU 1129*4882a593Smuzhiyun read-side critical sections. It also permits 1130*4882a593Smuzhiyun spinlocks blocking while in RCU read-side critical 1131*4882a593Smuzhiyun sections. 1132*4882a593Smuzhiyun 1133*4882a593Smuzhiyun Why the apparent inconsistency? Because it is 1134*4882a593Smuzhiyun possible to use priority boosting to keep the RCU 1135*4882a593Smuzhiyun grace periods short if need be (for example, if running 1136*4882a593Smuzhiyun short of memory). In contrast, if blocking waiting 1137*4882a593Smuzhiyun for (say) network reception, there is no way to know 1138*4882a593Smuzhiyun what should be boosted. Especially given that the 1139*4882a593Smuzhiyun process we need to boost might well be a human being 1140*4882a593Smuzhiyun who just went out for a pizza or something. And although 1141*4882a593Smuzhiyun a computer-operated cattle prod might arouse serious 1142*4882a593Smuzhiyun interest, it might also provoke serious objections. 1143*4882a593Smuzhiyun Besides, how does the computer know what pizza parlor 1144*4882a593Smuzhiyun the human being went to??? 1145*4882a593Smuzhiyun 1146*4882a593Smuzhiyun:ref:`Back to Quick Quiz #3 <quiz_3>` 1147*4882a593Smuzhiyun 1148*4882a593SmuzhiyunACKNOWLEDGEMENTS 1149*4882a593Smuzhiyun 1150*4882a593SmuzhiyunMy thanks to the people who helped make this human-readable, including 1151*4882a593SmuzhiyunJon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern. 1152*4882a593Smuzhiyun 1153*4882a593Smuzhiyun 1154*4882a593SmuzhiyunFor more information, see http://www.rdrop.com/users/paulmck/RCU. 1155