1*4882a593Smuzhiyun================================= 2*4882a593SmuzhiyunA Tour Through RCU's Requirements 3*4882a593Smuzhiyun================================= 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunCopyright IBM Corporation, 2015 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunAuthor: Paul E. McKenney 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunThe initial version of this document appeared in the 10*4882a593Smuzhiyun`LWN <https://lwn.net/>`_ on those articles: 11*4882a593Smuzhiyun`part 1 <https://lwn.net/Articles/652156/>`_, 12*4882a593Smuzhiyun`part 2 <https://lwn.net/Articles/652677/>`_, and 13*4882a593Smuzhiyun`part 3 <https://lwn.net/Articles/653326/>`_. 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunIntroduction 16*4882a593Smuzhiyun------------ 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunRead-copy update (RCU) is a synchronization mechanism that is often used 19*4882a593Smuzhiyunas a replacement for reader-writer locking. RCU is unusual in that 20*4882a593Smuzhiyunupdaters do not block readers, which means that RCU's read-side 21*4882a593Smuzhiyunprimitives can be exceedingly fast and scalable. In addition, updaters 22*4882a593Smuzhiyuncan make useful forward progress concurrently with readers. However, all 23*4882a593Smuzhiyunthis concurrency between RCU readers and updaters does raise the 24*4882a593Smuzhiyunquestion of exactly what RCU readers are doing, which in turn raises the 25*4882a593Smuzhiyunquestion of exactly what RCU's requirements are. 26*4882a593Smuzhiyun 27*4882a593SmuzhiyunThis document therefore summarizes RCU's requirements, and can be 28*4882a593Smuzhiyunthought of as an informal, high-level specification for RCU. It is 29*4882a593Smuzhiyunimportant to understand that RCU's specification is primarily empirical 30*4882a593Smuzhiyunin nature; in fact, I learned about many of these requirements the hard 31*4882a593Smuzhiyunway. This situation might cause some consternation, however, not only 32*4882a593Smuzhiyunhas this learning process been a lot of fun, but it has also been a 33*4882a593Smuzhiyungreat privilege to work with so many people willing to apply 34*4882a593Smuzhiyuntechnologies in interesting new ways. 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunAll that aside, here are the categories of currently known RCU 37*4882a593Smuzhiyunrequirements: 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun#. `Fundamental Requirements`_ 40*4882a593Smuzhiyun#. `Fundamental Non-Requirements`_ 41*4882a593Smuzhiyun#. `Parallelism Facts of Life`_ 42*4882a593Smuzhiyun#. `Quality-of-Implementation Requirements`_ 43*4882a593Smuzhiyun#. `Linux Kernel Complications`_ 44*4882a593Smuzhiyun#. `Software-Engineering Requirements`_ 45*4882a593Smuzhiyun#. `Other RCU Flavors`_ 46*4882a593Smuzhiyun#. `Possible Future Changes`_ 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunThis is followed by a `summary <#Summary>`__, however, the answers to 49*4882a593Smuzhiyuneach quick quiz immediately follows the quiz. Select the big white space 50*4882a593Smuzhiyunwith your mouse to see the answer. 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunFundamental Requirements 53*4882a593Smuzhiyun------------------------ 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunRCU's fundamental requirements are the closest thing RCU has to hard 56*4882a593Smuzhiyunmathematical requirements. These are: 57*4882a593Smuzhiyun 58*4882a593Smuzhiyun#. `Grace-Period Guarantee`_ 59*4882a593Smuzhiyun#. `Publish/Subscribe Guarantee`_ 60*4882a593Smuzhiyun#. `Memory-Barrier Guarantees`_ 61*4882a593Smuzhiyun#. `RCU Primitives Guaranteed to Execute Unconditionally`_ 62*4882a593Smuzhiyun#. `Guaranteed Read-to-Write Upgrade`_ 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunGrace-Period Guarantee 65*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~ 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunRCU's grace-period guarantee is unusual in being premeditated: Jack 68*4882a593SmuzhiyunSlingwine and I had this guarantee firmly in mind when we started work 69*4882a593Smuzhiyunon RCU (then called “rclock”) in the early 1990s. That said, the past 70*4882a593Smuzhiyuntwo decades of experience with RCU have produced a much more detailed 71*4882a593Smuzhiyununderstanding of this guarantee. 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunRCU's grace-period guarantee allows updaters to wait for the completion 74*4882a593Smuzhiyunof all pre-existing RCU read-side critical sections. An RCU read-side 75*4882a593Smuzhiyuncritical section begins with the marker ``rcu_read_lock()`` and ends 76*4882a593Smuzhiyunwith the marker ``rcu_read_unlock()``. These markers may be nested, and 77*4882a593SmuzhiyunRCU treats a nested set as one big RCU read-side critical section. 78*4882a593SmuzhiyunProduction-quality implementations of ``rcu_read_lock()`` and 79*4882a593Smuzhiyun``rcu_read_unlock()`` are extremely lightweight, and in fact have 80*4882a593Smuzhiyunexactly zero overhead in Linux kernels built for production use with 81*4882a593Smuzhiyun``CONFIG_PREEMPT=n``. 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunThis guarantee allows ordering to be enforced with extremely low 84*4882a593Smuzhiyunoverhead to readers, for example: 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun :: 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun 1 int x, y; 89*4882a593Smuzhiyun 2 90*4882a593Smuzhiyun 3 void thread0(void) 91*4882a593Smuzhiyun 4 { 92*4882a593Smuzhiyun 5 rcu_read_lock(); 93*4882a593Smuzhiyun 6 r1 = READ_ONCE(x); 94*4882a593Smuzhiyun 7 r2 = READ_ONCE(y); 95*4882a593Smuzhiyun 8 rcu_read_unlock(); 96*4882a593Smuzhiyun 9 } 97*4882a593Smuzhiyun 10 98*4882a593Smuzhiyun 11 void thread1(void) 99*4882a593Smuzhiyun 12 { 100*4882a593Smuzhiyun 13 WRITE_ONCE(x, 1); 101*4882a593Smuzhiyun 14 synchronize_rcu(); 102*4882a593Smuzhiyun 15 WRITE_ONCE(y, 1); 103*4882a593Smuzhiyun 16 } 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunBecause the ``synchronize_rcu()`` on line 14 waits for all pre-existing 106*4882a593Smuzhiyunreaders, any instance of ``thread0()`` that loads a value of zero from 107*4882a593Smuzhiyun``x`` must complete before ``thread1()`` stores to ``y``, so that 108*4882a593Smuzhiyuninstance must also load a value of zero from ``y``. Similarly, any 109*4882a593Smuzhiyuninstance of ``thread0()`` that loads a value of one from ``y`` must have 110*4882a593Smuzhiyunstarted after the ``synchronize_rcu()`` started, and must therefore also 111*4882a593Smuzhiyunload a value of one from ``x``. Therefore, the outcome: 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun :: 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun (r1 == 0 && r2 == 1) 116*4882a593Smuzhiyun 117*4882a593Smuzhiyuncannot happen. 118*4882a593Smuzhiyun 119*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 120*4882a593Smuzhiyun| **Quick Quiz**: | 121*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 122*4882a593Smuzhiyun| Wait a minute! You said that updaters can make useful forward | 123*4882a593Smuzhiyun| progress concurrently with readers, but pre-existing readers will | 124*4882a593Smuzhiyun| block ``synchronize_rcu()``!!! | 125*4882a593Smuzhiyun| Just who are you trying to fool??? | 126*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 127*4882a593Smuzhiyun| **Answer**: | 128*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 129*4882a593Smuzhiyun| First, if updaters do not wish to be blocked by readers, they can use | 130*4882a593Smuzhiyun| ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later. | 131*4882a593Smuzhiyun| Second, even when using ``synchronize_rcu()``, the other update-side | 132*4882a593Smuzhiyun| code does run concurrently with readers, whether pre-existing or not. | 133*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunThis scenario resembles one of the first uses of RCU in 136*4882a593Smuzhiyun`DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a 137*4882a593Smuzhiyundistributed lock manager's transition into a state suitable for handling 138*4882a593Smuzhiyunrecovery from node failure, more or less as follows: 139*4882a593Smuzhiyun 140*4882a593Smuzhiyun :: 141*4882a593Smuzhiyun 142*4882a593Smuzhiyun 1 #define STATE_NORMAL 0 143*4882a593Smuzhiyun 2 #define STATE_WANT_RECOVERY 1 144*4882a593Smuzhiyun 3 #define STATE_RECOVERING 2 145*4882a593Smuzhiyun 4 #define STATE_WANT_NORMAL 3 146*4882a593Smuzhiyun 5 147*4882a593Smuzhiyun 6 int state = STATE_NORMAL; 148*4882a593Smuzhiyun 7 149*4882a593Smuzhiyun 8 void do_something_dlm(void) 150*4882a593Smuzhiyun 9 { 151*4882a593Smuzhiyun 10 int state_snap; 152*4882a593Smuzhiyun 11 153*4882a593Smuzhiyun 12 rcu_read_lock(); 154*4882a593Smuzhiyun 13 state_snap = READ_ONCE(state); 155*4882a593Smuzhiyun 14 if (state_snap == STATE_NORMAL) 156*4882a593Smuzhiyun 15 do_something(); 157*4882a593Smuzhiyun 16 else 158*4882a593Smuzhiyun 17 do_something_carefully(); 159*4882a593Smuzhiyun 18 rcu_read_unlock(); 160*4882a593Smuzhiyun 19 } 161*4882a593Smuzhiyun 20 162*4882a593Smuzhiyun 21 void start_recovery(void) 163*4882a593Smuzhiyun 22 { 164*4882a593Smuzhiyun 23 WRITE_ONCE(state, STATE_WANT_RECOVERY); 165*4882a593Smuzhiyun 24 synchronize_rcu(); 166*4882a593Smuzhiyun 25 WRITE_ONCE(state, STATE_RECOVERING); 167*4882a593Smuzhiyun 26 recovery(); 168*4882a593Smuzhiyun 27 WRITE_ONCE(state, STATE_WANT_NORMAL); 169*4882a593Smuzhiyun 28 synchronize_rcu(); 170*4882a593Smuzhiyun 29 WRITE_ONCE(state, STATE_NORMAL); 171*4882a593Smuzhiyun 30 } 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunThe RCU read-side critical section in ``do_something_dlm()`` works with 174*4882a593Smuzhiyunthe ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that 175*4882a593Smuzhiyun``do_something()`` never runs concurrently with ``recovery()``, but with 176*4882a593Smuzhiyunlittle or no synchronization overhead in ``do_something_dlm()``. 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 179*4882a593Smuzhiyun| **Quick Quiz**: | 180*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 181*4882a593Smuzhiyun| Why is the ``synchronize_rcu()`` on line 28 needed? | 182*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 183*4882a593Smuzhiyun| **Answer**: | 184*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 185*4882a593Smuzhiyun| Without that extra grace period, memory reordering could result in | 186*4882a593Smuzhiyun| ``do_something_dlm()`` executing ``do_something()`` concurrently with | 187*4882a593Smuzhiyun| the last bits of ``recovery()``. | 188*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 189*4882a593Smuzhiyun 190*4882a593SmuzhiyunIn order to avoid fatal problems such as deadlocks, an RCU read-side 191*4882a593Smuzhiyuncritical section must not contain calls to ``synchronize_rcu()``. 192*4882a593SmuzhiyunSimilarly, an RCU read-side critical section must not contain anything 193*4882a593Smuzhiyunthat waits, directly or indirectly, on completion of an invocation of 194*4882a593Smuzhiyun``synchronize_rcu()``. 195*4882a593Smuzhiyun 196*4882a593SmuzhiyunAlthough RCU's grace-period guarantee is useful in and of itself, with 197*4882a593Smuzhiyun`quite a few use cases <https://lwn.net/Articles/573497/>`__, it would 198*4882a593Smuzhiyunbe good to be able to use RCU to coordinate read-side access to linked 199*4882a593Smuzhiyundata structures. For this, the grace-period guarantee is not sufficient, 200*4882a593Smuzhiyunas can be seen in function ``add_gp_buggy()`` below. We will look at the 201*4882a593Smuzhiyunreader's code later, but in the meantime, just think of the reader as 202*4882a593Smuzhiyunlocklessly picking up the ``gp`` pointer, and, if the value loaded is 203*4882a593Smuzhiyunnon-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields. 204*4882a593Smuzhiyun 205*4882a593Smuzhiyun :: 206*4882a593Smuzhiyun 207*4882a593Smuzhiyun 1 bool add_gp_buggy(int a, int b) 208*4882a593Smuzhiyun 2 { 209*4882a593Smuzhiyun 3 p = kmalloc(sizeof(*p), GFP_KERNEL); 210*4882a593Smuzhiyun 4 if (!p) 211*4882a593Smuzhiyun 5 return -ENOMEM; 212*4882a593Smuzhiyun 6 spin_lock(&gp_lock); 213*4882a593Smuzhiyun 7 if (rcu_access_pointer(gp)) { 214*4882a593Smuzhiyun 8 spin_unlock(&gp_lock); 215*4882a593Smuzhiyun 9 return false; 216*4882a593Smuzhiyun 10 } 217*4882a593Smuzhiyun 11 p->a = a; 218*4882a593Smuzhiyun 12 p->b = a; 219*4882a593Smuzhiyun 13 gp = p; /* ORDERING BUG */ 220*4882a593Smuzhiyun 14 spin_unlock(&gp_lock); 221*4882a593Smuzhiyun 15 return true; 222*4882a593Smuzhiyun 16 } 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunThe problem is that both the compiler and weakly ordered CPUs are within 225*4882a593Smuzhiyuntheir rights to reorder this code as follows: 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun :: 228*4882a593Smuzhiyun 229*4882a593Smuzhiyun 1 bool add_gp_buggy_optimized(int a, int b) 230*4882a593Smuzhiyun 2 { 231*4882a593Smuzhiyun 3 p = kmalloc(sizeof(*p), GFP_KERNEL); 232*4882a593Smuzhiyun 4 if (!p) 233*4882a593Smuzhiyun 5 return -ENOMEM; 234*4882a593Smuzhiyun 6 spin_lock(&gp_lock); 235*4882a593Smuzhiyun 7 if (rcu_access_pointer(gp)) { 236*4882a593Smuzhiyun 8 spin_unlock(&gp_lock); 237*4882a593Smuzhiyun 9 return false; 238*4882a593Smuzhiyun 10 } 239*4882a593Smuzhiyun 11 gp = p; /* ORDERING BUG */ 240*4882a593Smuzhiyun 12 p->a = a; 241*4882a593Smuzhiyun 13 p->b = a; 242*4882a593Smuzhiyun 14 spin_unlock(&gp_lock); 243*4882a593Smuzhiyun 15 return true; 244*4882a593Smuzhiyun 16 } 245*4882a593Smuzhiyun 246*4882a593SmuzhiyunIf an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized`` 247*4882a593Smuzhiyunexecutes line 11, it will see garbage in the ``->a`` and ``->b`` fields. 248*4882a593SmuzhiyunAnd this is but one of many ways in which compiler and hardware 249*4882a593Smuzhiyunoptimizations could cause trouble. Therefore, we clearly need some way 250*4882a593Smuzhiyunto prevent the compiler and the CPU from reordering in this manner, 251*4882a593Smuzhiyunwhich brings us to the publish-subscribe guarantee discussed in the next 252*4882a593Smuzhiyunsection. 253*4882a593Smuzhiyun 254*4882a593SmuzhiyunPublish/Subscribe Guarantee 255*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~ 256*4882a593Smuzhiyun 257*4882a593SmuzhiyunRCU's publish-subscribe guarantee allows data to be inserted into a 258*4882a593Smuzhiyunlinked data structure without disrupting RCU readers. The updater uses 259*4882a593Smuzhiyun``rcu_assign_pointer()`` to insert the new data, and readers use 260*4882a593Smuzhiyun``rcu_dereference()`` to access data, whether new or old. The following 261*4882a593Smuzhiyunshows an example of insertion: 262*4882a593Smuzhiyun 263*4882a593Smuzhiyun :: 264*4882a593Smuzhiyun 265*4882a593Smuzhiyun 1 bool add_gp(int a, int b) 266*4882a593Smuzhiyun 2 { 267*4882a593Smuzhiyun 3 p = kmalloc(sizeof(*p), GFP_KERNEL); 268*4882a593Smuzhiyun 4 if (!p) 269*4882a593Smuzhiyun 5 return -ENOMEM; 270*4882a593Smuzhiyun 6 spin_lock(&gp_lock); 271*4882a593Smuzhiyun 7 if (rcu_access_pointer(gp)) { 272*4882a593Smuzhiyun 8 spin_unlock(&gp_lock); 273*4882a593Smuzhiyun 9 return false; 274*4882a593Smuzhiyun 10 } 275*4882a593Smuzhiyun 11 p->a = a; 276*4882a593Smuzhiyun 12 p->b = a; 277*4882a593Smuzhiyun 13 rcu_assign_pointer(gp, p); 278*4882a593Smuzhiyun 14 spin_unlock(&gp_lock); 279*4882a593Smuzhiyun 15 return true; 280*4882a593Smuzhiyun 16 } 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunThe ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a 283*4882a593Smuzhiyunsimple assignment statement, but also guarantees that its assignment 284*4882a593Smuzhiyunwill happen after the two assignments in lines 11 and 12, similar to the 285*4882a593SmuzhiyunC11 ``memory_order_release`` store operation. It also prevents any 286*4882a593Smuzhiyunnumber of “interesting” compiler optimizations, for example, the use of 287*4882a593Smuzhiyun``gp`` as a scratch location immediately preceding the assignment. 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 290*4882a593Smuzhiyun| **Quick Quiz**: | 291*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 292*4882a593Smuzhiyun| But ``rcu_assign_pointer()`` does nothing to prevent the two | 293*4882a593Smuzhiyun| assignments to ``p->a`` and ``p->b`` from being reordered. Can't that | 294*4882a593Smuzhiyun| also cause problems? | 295*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 296*4882a593Smuzhiyun| **Answer**: | 297*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 298*4882a593Smuzhiyun| No, it cannot. The readers cannot see either of these two fields | 299*4882a593Smuzhiyun| until the assignment to ``gp``, by which time both fields are fully | 300*4882a593Smuzhiyun| initialized. So reordering the assignments to ``p->a`` and ``p->b`` | 301*4882a593Smuzhiyun| cannot possibly cause any problems. | 302*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 303*4882a593Smuzhiyun 304*4882a593SmuzhiyunIt is tempting to assume that the reader need not do anything special to 305*4882a593Smuzhiyuncontrol its accesses to the RCU-protected data, as shown in 306*4882a593Smuzhiyun``do_something_gp_buggy()`` below: 307*4882a593Smuzhiyun 308*4882a593Smuzhiyun :: 309*4882a593Smuzhiyun 310*4882a593Smuzhiyun 1 bool do_something_gp_buggy(void) 311*4882a593Smuzhiyun 2 { 312*4882a593Smuzhiyun 3 rcu_read_lock(); 313*4882a593Smuzhiyun 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ 314*4882a593Smuzhiyun 5 if (p) { 315*4882a593Smuzhiyun 6 do_something(p->a, p->b); 316*4882a593Smuzhiyun 7 rcu_read_unlock(); 317*4882a593Smuzhiyun 8 return true; 318*4882a593Smuzhiyun 9 } 319*4882a593Smuzhiyun 10 rcu_read_unlock(); 320*4882a593Smuzhiyun 11 return false; 321*4882a593Smuzhiyun 12 } 322*4882a593Smuzhiyun 323*4882a593SmuzhiyunHowever, this temptation must be resisted because there are a 324*4882a593Smuzhiyunsurprisingly large number of ways that the compiler (to say nothing of 325*4882a593Smuzhiyun`DEC Alpha CPUs <https://h71000.www7.hp.com/wizard/wiz_2637.html>`__) 326*4882a593Smuzhiyuncan trip this code up. For but one example, if the compiler were short 327*4882a593Smuzhiyunof registers, it might choose to refetch from ``gp`` rather than keeping 328*4882a593Smuzhiyuna separate copy in ``p`` as follows: 329*4882a593Smuzhiyun 330*4882a593Smuzhiyun :: 331*4882a593Smuzhiyun 332*4882a593Smuzhiyun 1 bool do_something_gp_buggy_optimized(void) 333*4882a593Smuzhiyun 2 { 334*4882a593Smuzhiyun 3 rcu_read_lock(); 335*4882a593Smuzhiyun 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ 336*4882a593Smuzhiyun 5 do_something(gp->a, gp->b); 337*4882a593Smuzhiyun 6 rcu_read_unlock(); 338*4882a593Smuzhiyun 7 return true; 339*4882a593Smuzhiyun 8 } 340*4882a593Smuzhiyun 9 rcu_read_unlock(); 341*4882a593Smuzhiyun 10 return false; 342*4882a593Smuzhiyun 11 } 343*4882a593Smuzhiyun 344*4882a593SmuzhiyunIf this function ran concurrently with a series of updates that replaced 345*4882a593Smuzhiyunthe current structure with a new one, the fetches of ``gp->a`` and 346*4882a593Smuzhiyun``gp->b`` might well come from two different structures, which could 347*4882a593Smuzhiyuncause serious confusion. To prevent this (and much else besides), 348*4882a593Smuzhiyun``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``: 349*4882a593Smuzhiyun 350*4882a593Smuzhiyun :: 351*4882a593Smuzhiyun 352*4882a593Smuzhiyun 1 bool do_something_gp(void) 353*4882a593Smuzhiyun 2 { 354*4882a593Smuzhiyun 3 rcu_read_lock(); 355*4882a593Smuzhiyun 4 p = rcu_dereference(gp); 356*4882a593Smuzhiyun 5 if (p) { 357*4882a593Smuzhiyun 6 do_something(p->a, p->b); 358*4882a593Smuzhiyun 7 rcu_read_unlock(); 359*4882a593Smuzhiyun 8 return true; 360*4882a593Smuzhiyun 9 } 361*4882a593Smuzhiyun 10 rcu_read_unlock(); 362*4882a593Smuzhiyun 11 return false; 363*4882a593Smuzhiyun 12 } 364*4882a593Smuzhiyun 365*4882a593SmuzhiyunThe ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory 366*4882a593Smuzhiyunbarriers in the Linux kernel. Should a `high-quality implementation of 367*4882a593SmuzhiyunC11 ``memory_order_consume`` 368*4882a593Smuzhiyun[PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__ 369*4882a593Smuzhiyunever appear, then ``rcu_dereference()`` could be implemented as a 370*4882a593Smuzhiyun``memory_order_consume`` load. Regardless of the exact implementation, a 371*4882a593Smuzhiyunpointer fetched by ``rcu_dereference()`` may not be used outside of the 372*4882a593Smuzhiyunoutermost RCU read-side critical section containing that 373*4882a593Smuzhiyun``rcu_dereference()``, unless protection of the corresponding data 374*4882a593Smuzhiyunelement has been passed from RCU to some other synchronization 375*4882a593Smuzhiyunmechanism, most commonly locking or `reference 376*4882a593Smuzhiyuncounting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__. 377*4882a593Smuzhiyun 378*4882a593SmuzhiyunIn short, updaters use ``rcu_assign_pointer()`` and readers use 379*4882a593Smuzhiyun``rcu_dereference()``, and these two RCU API elements work together to 380*4882a593Smuzhiyunensure that readers have a consistent view of newly added data elements. 381*4882a593Smuzhiyun 382*4882a593SmuzhiyunOf course, it is also necessary to remove elements from RCU-protected 383*4882a593Smuzhiyundata structures, for example, using the following process: 384*4882a593Smuzhiyun 385*4882a593Smuzhiyun#. Remove the data element from the enclosing structure. 386*4882a593Smuzhiyun#. Wait for all pre-existing RCU read-side critical sections to complete 387*4882a593Smuzhiyun (because only pre-existing readers can possibly have a reference to 388*4882a593Smuzhiyun the newly removed data element). 389*4882a593Smuzhiyun#. At this point, only the updater has a reference to the newly removed 390*4882a593Smuzhiyun data element, so it can safely reclaim the data element, for example, 391*4882a593Smuzhiyun by passing it to ``kfree()``. 392*4882a593Smuzhiyun 393*4882a593SmuzhiyunThis process is implemented by ``remove_gp_synchronous()``: 394*4882a593Smuzhiyun 395*4882a593Smuzhiyun :: 396*4882a593Smuzhiyun 397*4882a593Smuzhiyun 1 bool remove_gp_synchronous(void) 398*4882a593Smuzhiyun 2 { 399*4882a593Smuzhiyun 3 struct foo *p; 400*4882a593Smuzhiyun 4 401*4882a593Smuzhiyun 5 spin_lock(&gp_lock); 402*4882a593Smuzhiyun 6 p = rcu_access_pointer(gp); 403*4882a593Smuzhiyun 7 if (!p) { 404*4882a593Smuzhiyun 8 spin_unlock(&gp_lock); 405*4882a593Smuzhiyun 9 return false; 406*4882a593Smuzhiyun 10 } 407*4882a593Smuzhiyun 11 rcu_assign_pointer(gp, NULL); 408*4882a593Smuzhiyun 12 spin_unlock(&gp_lock); 409*4882a593Smuzhiyun 13 synchronize_rcu(); 410*4882a593Smuzhiyun 14 kfree(p); 411*4882a593Smuzhiyun 15 return true; 412*4882a593Smuzhiyun 16 } 413*4882a593Smuzhiyun 414*4882a593SmuzhiyunThis function is straightforward, with line 13 waiting for a grace 415*4882a593Smuzhiyunperiod before line 14 frees the old data element. This waiting ensures 416*4882a593Smuzhiyunthat readers will reach line 7 of ``do_something_gp()`` before the data 417*4882a593Smuzhiyunelement referenced by ``p`` is freed. The ``rcu_access_pointer()`` on 418*4882a593Smuzhiyunline 6 is similar to ``rcu_dereference()``, except that: 419*4882a593Smuzhiyun 420*4882a593Smuzhiyun#. The value returned by ``rcu_access_pointer()`` cannot be 421*4882a593Smuzhiyun dereferenced. If you want to access the value pointed to as well as 422*4882a593Smuzhiyun the pointer itself, use ``rcu_dereference()`` instead of 423*4882a593Smuzhiyun ``rcu_access_pointer()``. 424*4882a593Smuzhiyun#. The call to ``rcu_access_pointer()`` need not be protected. In 425*4882a593Smuzhiyun contrast, ``rcu_dereference()`` must either be within an RCU 426*4882a593Smuzhiyun read-side critical section or in a code segment where the pointer 427*4882a593Smuzhiyun cannot change, for example, in code protected by the corresponding 428*4882a593Smuzhiyun update-side lock. 429*4882a593Smuzhiyun 430*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 431*4882a593Smuzhiyun| **Quick Quiz**: | 432*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 433*4882a593Smuzhiyun| Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``, | 434*4882a593Smuzhiyun| what destructive optimizations might the compiler make use of? | 435*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 436*4882a593Smuzhiyun| **Answer**: | 437*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 438*4882a593Smuzhiyun| Let's start with what happens to ``do_something_gp()`` if it fails to | 439*4882a593Smuzhiyun| use ``rcu_dereference()``. It could reuse a value formerly fetched | 440*4882a593Smuzhiyun| from this same pointer. It could also fetch the pointer from ``gp`` | 441*4882a593Smuzhiyun| in a byte-at-a-time manner, resulting in *load tearing*, in turn | 442*4882a593Smuzhiyun| resulting a bytewise mash-up of two distinct pointer values. It might | 443*4882a593Smuzhiyun| even use value-speculation optimizations, where it makes a wrong | 444*4882a593Smuzhiyun| guess, but by the time it gets around to checking the value, an | 445*4882a593Smuzhiyun| update has changed the pointer to match the wrong guess. Too bad | 446*4882a593Smuzhiyun| about any dereferences that returned pre-initialization garbage in | 447*4882a593Smuzhiyun| the meantime! | 448*4882a593Smuzhiyun| For ``remove_gp_synchronous()``, as long as all modifications to | 449*4882a593Smuzhiyun| ``gp`` are carried out while holding ``gp_lock``, the above | 450*4882a593Smuzhiyun| optimizations are harmless. However, ``sparse`` will complain if you | 451*4882a593Smuzhiyun| define ``gp`` with ``__rcu`` and then access it without using either | 452*4882a593Smuzhiyun| ``rcu_access_pointer()`` or ``rcu_dereference()``. | 453*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 454*4882a593Smuzhiyun 455*4882a593SmuzhiyunIn short, RCU's publish-subscribe guarantee is provided by the 456*4882a593Smuzhiyuncombination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This 457*4882a593Smuzhiyunguarantee allows data elements to be safely added to RCU-protected 458*4882a593Smuzhiyunlinked data structures without disrupting RCU readers. This guarantee 459*4882a593Smuzhiyuncan be used in combination with the grace-period guarantee to also allow 460*4882a593Smuzhiyundata elements to be removed from RCU-protected linked data structures, 461*4882a593Smuzhiyunagain without disrupting RCU readers. 462*4882a593Smuzhiyun 463*4882a593SmuzhiyunThis guarantee was only partially premeditated. DYNIX/ptx used an 464*4882a593Smuzhiyunexplicit memory barrier for publication, but had nothing resembling 465*4882a593Smuzhiyun``rcu_dereference()`` for subscription, nor did it have anything 466*4882a593Smuzhiyunresembling the dependency-ordering barrier that was later subsumed 467*4882a593Smuzhiyuninto ``rcu_dereference()`` and later still into ``READ_ONCE()``. The 468*4882a593Smuzhiyunneed for these operations made itself known quite suddenly at a 469*4882a593Smuzhiyunlate-1990s meeting with the DEC Alpha architects, back in the days when 470*4882a593SmuzhiyunDEC was still a free-standing company. It took the Alpha architects a 471*4882a593Smuzhiyungood hour to convince me that any sort of barrier would ever be needed, 472*4882a593Smuzhiyunand it then took me a good *two* hours to convince them that their 473*4882a593Smuzhiyundocumentation did not make this point clear. More recent work with the C 474*4882a593Smuzhiyunand C++ standards committees have provided much education on tricks and 475*4882a593Smuzhiyuntraps from the compiler. In short, compilers were much less tricky in 476*4882a593Smuzhiyunthe early 1990s, but in 2015, don't even think about omitting 477*4882a593Smuzhiyun``rcu_dereference()``! 478*4882a593Smuzhiyun 479*4882a593SmuzhiyunMemory-Barrier Guarantees 480*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~ 481*4882a593Smuzhiyun 482*4882a593SmuzhiyunThe previous section's simple linked-data-structure scenario clearly 483*4882a593Smuzhiyundemonstrates the need for RCU's stringent memory-ordering guarantees on 484*4882a593Smuzhiyunsystems with more than one CPU: 485*4882a593Smuzhiyun 486*4882a593Smuzhiyun#. Each CPU that has an RCU read-side critical section that begins 487*4882a593Smuzhiyun before ``synchronize_rcu()`` starts is guaranteed to execute a full 488*4882a593Smuzhiyun memory barrier between the time that the RCU read-side critical 489*4882a593Smuzhiyun section ends and the time that ``synchronize_rcu()`` returns. Without 490*4882a593Smuzhiyun this guarantee, a pre-existing RCU read-side critical section might 491*4882a593Smuzhiyun hold a reference to the newly removed ``struct foo`` after the 492*4882a593Smuzhiyun ``kfree()`` on line 14 of ``remove_gp_synchronous()``. 493*4882a593Smuzhiyun#. Each CPU that has an RCU read-side critical section that ends after 494*4882a593Smuzhiyun ``synchronize_rcu()`` returns is guaranteed to execute a full memory 495*4882a593Smuzhiyun barrier between the time that ``synchronize_rcu()`` begins and the 496*4882a593Smuzhiyun time that the RCU read-side critical section begins. Without this 497*4882a593Smuzhiyun guarantee, a later RCU read-side critical section running after the 498*4882a593Smuzhiyun ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run 499*4882a593Smuzhiyun ``do_something_gp()`` and find the newly deleted ``struct foo``. 500*4882a593Smuzhiyun#. If the task invoking ``synchronize_rcu()`` remains on a given CPU, 501*4882a593Smuzhiyun then that CPU is guaranteed to execute a full memory barrier sometime 502*4882a593Smuzhiyun during the execution of ``synchronize_rcu()``. This guarantee ensures 503*4882a593Smuzhiyun that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really 504*4882a593Smuzhiyun does execute after the removal on line 11. 505*4882a593Smuzhiyun#. If the task invoking ``synchronize_rcu()`` migrates among a group of 506*4882a593Smuzhiyun CPUs during that invocation, then each of the CPUs in that group is 507*4882a593Smuzhiyun guaranteed to execute a full memory barrier sometime during the 508*4882a593Smuzhiyun execution of ``synchronize_rcu()``. This guarantee also ensures that 509*4882a593Smuzhiyun the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does 510*4882a593Smuzhiyun execute after the removal on line 11, but also in the case where the 511*4882a593Smuzhiyun thread executing the ``synchronize_rcu()`` migrates in the meantime. 512*4882a593Smuzhiyun 513*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 514*4882a593Smuzhiyun| **Quick Quiz**: | 515*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 516*4882a593Smuzhiyun| Given that multiple CPUs can start RCU read-side critical sections at | 517*4882a593Smuzhiyun| any time without any ordering whatsoever, how can RCU possibly tell | 518*4882a593Smuzhiyun| whether or not a given RCU read-side critical section starts before a | 519*4882a593Smuzhiyun| given instance of ``synchronize_rcu()``? | 520*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 521*4882a593Smuzhiyun| **Answer**: | 522*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 523*4882a593Smuzhiyun| If RCU cannot tell whether or not a given RCU read-side critical | 524*4882a593Smuzhiyun| section starts before a given instance of ``synchronize_rcu()``, then | 525*4882a593Smuzhiyun| it must assume that the RCU read-side critical section started first. | 526*4882a593Smuzhiyun| In other words, a given instance of ``synchronize_rcu()`` can avoid | 527*4882a593Smuzhiyun| waiting on a given RCU read-side critical section only if it can | 528*4882a593Smuzhiyun| prove that ``synchronize_rcu()`` started first. | 529*4882a593Smuzhiyun| A related question is “When ``rcu_read_lock()`` doesn't generate any | 530*4882a593Smuzhiyun| code, why does it matter how it relates to a grace period?” The | 531*4882a593Smuzhiyun| answer is that it is not the relationship of ``rcu_read_lock()`` | 532*4882a593Smuzhiyun| itself that is important, but rather the relationship of the code | 533*4882a593Smuzhiyun| within the enclosed RCU read-side critical section to the code | 534*4882a593Smuzhiyun| preceding and following the grace period. If we take this viewpoint, | 535*4882a593Smuzhiyun| then a given RCU read-side critical section begins before a given | 536*4882a593Smuzhiyun| grace period when some access preceding the grace period observes the | 537*4882a593Smuzhiyun| effect of some access within the critical section, in which case none | 538*4882a593Smuzhiyun| of the accesses within the critical section may observe the effects | 539*4882a593Smuzhiyun| of any access following the grace period. | 540*4882a593Smuzhiyun| | 541*4882a593Smuzhiyun| As of late 2016, mathematical models of RCU take this viewpoint, for | 542*4882a593Smuzhiyun| example, see slides 62 and 63 of the `2016 LinuxCon | 543*4882a593Smuzhiyun| EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 | 544*4882a593Smuzhiyun| 6.10.04c.LCE.pdf>`__ | 545*4882a593Smuzhiyun| presentation. | 546*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 547*4882a593Smuzhiyun 548*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 549*4882a593Smuzhiyun| **Quick Quiz**: | 550*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 551*4882a593Smuzhiyun| The first and second guarantees require unbelievably strict ordering! | 552*4882a593Smuzhiyun| Are all these memory barriers *really* required? | 553*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 554*4882a593Smuzhiyun| **Answer**: | 555*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 556*4882a593Smuzhiyun| Yes, they really are required. To see why the first guarantee is | 557*4882a593Smuzhiyun| required, consider the following sequence of events: | 558*4882a593Smuzhiyun| | 559*4882a593Smuzhiyun| #. CPU 1: ``rcu_read_lock()`` | 560*4882a593Smuzhiyun| #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` | 561*4882a593Smuzhiyun| #. CPU 0: ``list_del_rcu(p);`` | 562*4882a593Smuzhiyun| #. CPU 0: ``synchronize_rcu()`` starts. | 563*4882a593Smuzhiyun| #. CPU 1: ``do_something_with(q->a);`` | 564*4882a593Smuzhiyun| ``/* No smp_mb(), so might happen after kfree(). */`` | 565*4882a593Smuzhiyun| #. CPU 1: ``rcu_read_unlock()`` | 566*4882a593Smuzhiyun| #. CPU 0: ``synchronize_rcu()`` returns. | 567*4882a593Smuzhiyun| #. CPU 0: ``kfree(p);`` | 568*4882a593Smuzhiyun| | 569*4882a593Smuzhiyun| Therefore, there absolutely must be a full memory barrier between the | 570*4882a593Smuzhiyun| end of the RCU read-side critical section and the end of the grace | 571*4882a593Smuzhiyun| period. | 572*4882a593Smuzhiyun| | 573*4882a593Smuzhiyun| The sequence of events demonstrating the necessity of the second rule | 574*4882a593Smuzhiyun| is roughly similar: | 575*4882a593Smuzhiyun| | 576*4882a593Smuzhiyun| #. CPU 0: ``list_del_rcu(p);`` | 577*4882a593Smuzhiyun| #. CPU 0: ``synchronize_rcu()`` starts. | 578*4882a593Smuzhiyun| #. CPU 1: ``rcu_read_lock()`` | 579*4882a593Smuzhiyun| #. CPU 1: ``q = rcu_dereference(gp);`` | 580*4882a593Smuzhiyun| ``/* Might return p if no memory barrier. */`` | 581*4882a593Smuzhiyun| #. CPU 0: ``synchronize_rcu()`` returns. | 582*4882a593Smuzhiyun| #. CPU 0: ``kfree(p);`` | 583*4882a593Smuzhiyun| #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` | 584*4882a593Smuzhiyun| #. CPU 1: ``rcu_read_unlock()`` | 585*4882a593Smuzhiyun| | 586*4882a593Smuzhiyun| And similarly, without a memory barrier between the beginning of the | 587*4882a593Smuzhiyun| grace period and the beginning of the RCU read-side critical section, | 588*4882a593Smuzhiyun| CPU 1 might end up accessing the freelist. | 589*4882a593Smuzhiyun| | 590*4882a593Smuzhiyun| The “as if” rule of course applies, so that any implementation that | 591*4882a593Smuzhiyun| acts as if the appropriate memory barriers were in place is a correct | 592*4882a593Smuzhiyun| implementation. That said, it is much easier to fool yourself into | 593*4882a593Smuzhiyun| believing that you have adhered to the as-if rule than it is to | 594*4882a593Smuzhiyun| actually adhere to it! | 595*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 596*4882a593Smuzhiyun 597*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 598*4882a593Smuzhiyun| **Quick Quiz**: | 599*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 600*4882a593Smuzhiyun| You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | 601*4882a593Smuzhiyun| absolutely no code in some kernel builds. This means that the | 602*4882a593Smuzhiyun| compiler might arbitrarily rearrange consecutive RCU read-side | 603*4882a593Smuzhiyun| critical sections. Given such rearrangement, if a given RCU read-side | 604*4882a593Smuzhiyun| critical section is done, how can you be sure that all prior RCU | 605*4882a593Smuzhiyun| read-side critical sections are done? Won't the compiler | 606*4882a593Smuzhiyun| rearrangements make that impossible to determine? | 607*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 608*4882a593Smuzhiyun| **Answer**: | 609*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 610*4882a593Smuzhiyun| In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | 611*4882a593Smuzhiyun| absolutely no code, RCU infers quiescent states only at special | 612*4882a593Smuzhiyun| locations, for example, within the scheduler. Because calls to | 613*4882a593Smuzhiyun| ``schedule()`` had better prevent calling-code accesses to shared | 614*4882a593Smuzhiyun| variables from being rearranged across the call to ``schedule()``, if | 615*4882a593Smuzhiyun| RCU detects the end of a given RCU read-side critical section, it | 616*4882a593Smuzhiyun| will necessarily detect the end of all prior RCU read-side critical | 617*4882a593Smuzhiyun| sections, no matter how aggressively the compiler scrambles the code. | 618*4882a593Smuzhiyun| Again, this all assumes that the compiler cannot scramble code across | 619*4882a593Smuzhiyun| calls to the scheduler, out of interrupt handlers, into the idle | 620*4882a593Smuzhiyun| loop, into user-mode code, and so on. But if your kernel build allows | 621*4882a593Smuzhiyun| that sort of scrambling, you have broken far more than just RCU! | 622*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 623*4882a593Smuzhiyun 624*4882a593SmuzhiyunNote that these memory-barrier requirements do not replace the 625*4882a593Smuzhiyunfundamental RCU requirement that a grace period wait for all 626*4882a593Smuzhiyunpre-existing readers. On the contrary, the memory barriers called out in 627*4882a593Smuzhiyunthis section must operate in such a way as to *enforce* this fundamental 628*4882a593Smuzhiyunrequirement. Of course, different implementations enforce this 629*4882a593Smuzhiyunrequirement in different ways, but enforce it they must. 630*4882a593Smuzhiyun 631*4882a593SmuzhiyunRCU Primitives Guaranteed to Execute Unconditionally 632*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 633*4882a593Smuzhiyun 634*4882a593SmuzhiyunThe common-case RCU primitives are unconditional. They are invoked, they 635*4882a593Smuzhiyundo their job, and they return, with no possibility of error, and no need 636*4882a593Smuzhiyunto retry. This is a key RCU design philosophy. 637*4882a593Smuzhiyun 638*4882a593SmuzhiyunHowever, this philosophy is pragmatic rather than pigheaded. If someone 639*4882a593Smuzhiyuncomes up with a good justification for a particular conditional RCU 640*4882a593Smuzhiyunprimitive, it might well be implemented and added. After all, this 641*4882a593Smuzhiyunguarantee was reverse-engineered, not premeditated. The unconditional 642*4882a593Smuzhiyunnature of the RCU primitives was initially an accident of 643*4882a593Smuzhiyunimplementation, and later experience with synchronization primitives 644*4882a593Smuzhiyunwith conditional primitives caused me to elevate this accident to a 645*4882a593Smuzhiyunguarantee. Therefore, the justification for adding a conditional 646*4882a593Smuzhiyunprimitive to RCU would need to be based on detailed and compelling use 647*4882a593Smuzhiyuncases. 648*4882a593Smuzhiyun 649*4882a593SmuzhiyunGuaranteed Read-to-Write Upgrade 650*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 651*4882a593Smuzhiyun 652*4882a593SmuzhiyunAs far as RCU is concerned, it is always possible to carry out an update 653*4882a593Smuzhiyunwithin an RCU read-side critical section. For example, that RCU 654*4882a593Smuzhiyunread-side critical section might search for a given data element, and 655*4882a593Smuzhiyunthen might acquire the update-side spinlock in order to update that 656*4882a593Smuzhiyunelement, all while remaining in that RCU read-side critical section. Of 657*4882a593Smuzhiyuncourse, it is necessary to exit the RCU read-side critical section 658*4882a593Smuzhiyunbefore invoking ``synchronize_rcu()``, however, this inconvenience can 659*4882a593Smuzhiyunbe avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API 660*4882a593Smuzhiyunmembers described later in this document. 661*4882a593Smuzhiyun 662*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 663*4882a593Smuzhiyun| **Quick Quiz**: | 664*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 665*4882a593Smuzhiyun| But how does the upgrade-to-write operation exclude other readers? | 666*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 667*4882a593Smuzhiyun| **Answer**: | 668*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 669*4882a593Smuzhiyun| It doesn't, just like normal RCU updates, which also do not exclude | 670*4882a593Smuzhiyun| RCU readers. | 671*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 672*4882a593Smuzhiyun 673*4882a593SmuzhiyunThis guarantee allows lookup code to be shared between read-side and 674*4882a593Smuzhiyunupdate-side code, and was premeditated, appearing in the earliest 675*4882a593SmuzhiyunDYNIX/ptx RCU documentation. 676*4882a593Smuzhiyun 677*4882a593SmuzhiyunFundamental Non-Requirements 678*4882a593Smuzhiyun---------------------------- 679*4882a593Smuzhiyun 680*4882a593SmuzhiyunRCU provides extremely lightweight readers, and its read-side 681*4882a593Smuzhiyunguarantees, though quite useful, are correspondingly lightweight. It is 682*4882a593Smuzhiyuntherefore all too easy to assume that RCU is guaranteeing more than it 683*4882a593Smuzhiyunreally is. Of course, the list of things that RCU does not guarantee is 684*4882a593Smuzhiyuninfinitely long, however, the following sections list a few 685*4882a593Smuzhiyunnon-guarantees that have caused confusion. Except where otherwise noted, 686*4882a593Smuzhiyunthese non-guarantees were premeditated. 687*4882a593Smuzhiyun 688*4882a593Smuzhiyun#. `Readers Impose Minimal Ordering`_ 689*4882a593Smuzhiyun#. `Readers Do Not Exclude Updaters`_ 690*4882a593Smuzhiyun#. `Updaters Only Wait For Old Readers`_ 691*4882a593Smuzhiyun#. `Grace Periods Don't Partition Read-Side Critical Sections`_ 692*4882a593Smuzhiyun#. `Read-Side Critical Sections Don't Partition Grace Periods`_ 693*4882a593Smuzhiyun 694*4882a593SmuzhiyunReaders Impose Minimal Ordering 695*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 696*4882a593Smuzhiyun 697*4882a593SmuzhiyunReader-side markers such as ``rcu_read_lock()`` and 698*4882a593Smuzhiyun``rcu_read_unlock()`` provide absolutely no ordering guarantees except 699*4882a593Smuzhiyunthrough their interaction with the grace-period APIs such as 700*4882a593Smuzhiyun``synchronize_rcu()``. To see this, consider the following pair of 701*4882a593Smuzhiyunthreads: 702*4882a593Smuzhiyun 703*4882a593Smuzhiyun :: 704*4882a593Smuzhiyun 705*4882a593Smuzhiyun 1 void thread0(void) 706*4882a593Smuzhiyun 2 { 707*4882a593Smuzhiyun 3 rcu_read_lock(); 708*4882a593Smuzhiyun 4 WRITE_ONCE(x, 1); 709*4882a593Smuzhiyun 5 rcu_read_unlock(); 710*4882a593Smuzhiyun 6 rcu_read_lock(); 711*4882a593Smuzhiyun 7 WRITE_ONCE(y, 1); 712*4882a593Smuzhiyun 8 rcu_read_unlock(); 713*4882a593Smuzhiyun 9 } 714*4882a593Smuzhiyun 10 715*4882a593Smuzhiyun 11 void thread1(void) 716*4882a593Smuzhiyun 12 { 717*4882a593Smuzhiyun 13 rcu_read_lock(); 718*4882a593Smuzhiyun 14 r1 = READ_ONCE(y); 719*4882a593Smuzhiyun 15 rcu_read_unlock(); 720*4882a593Smuzhiyun 16 rcu_read_lock(); 721*4882a593Smuzhiyun 17 r2 = READ_ONCE(x); 722*4882a593Smuzhiyun 18 rcu_read_unlock(); 723*4882a593Smuzhiyun 19 } 724*4882a593Smuzhiyun 725*4882a593SmuzhiyunAfter ``thread0()`` and ``thread1()`` execute concurrently, it is quite 726*4882a593Smuzhiyunpossible to have 727*4882a593Smuzhiyun 728*4882a593Smuzhiyun :: 729*4882a593Smuzhiyun 730*4882a593Smuzhiyun (r1 == 1 && r2 == 0) 731*4882a593Smuzhiyun 732*4882a593Smuzhiyun(that is, ``y`` appears to have been assigned before ``x``), which would 733*4882a593Smuzhiyunnot be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had 734*4882a593Smuzhiyunmuch in the way of ordering properties. But they do not, so the CPU is 735*4882a593Smuzhiyunwithin its rights to do significant reordering. This is by design: Any 736*4882a593Smuzhiyunsignificant ordering constraints would slow down these fast-path APIs. 737*4882a593Smuzhiyun 738*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 739*4882a593Smuzhiyun| **Quick Quiz**: | 740*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 741*4882a593Smuzhiyun| Can't the compiler also reorder this code? | 742*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 743*4882a593Smuzhiyun| **Answer**: | 744*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 745*4882a593Smuzhiyun| No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()`` | 746*4882a593Smuzhiyun| prevent the compiler from reordering in this particular case. | 747*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 748*4882a593Smuzhiyun 749*4882a593SmuzhiyunReaders Do Not Exclude Updaters 750*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 751*4882a593Smuzhiyun 752*4882a593SmuzhiyunNeither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates. 753*4882a593SmuzhiyunAll they do is to prevent grace periods from ending. The following 754*4882a593Smuzhiyunexample illustrates this: 755*4882a593Smuzhiyun 756*4882a593Smuzhiyun :: 757*4882a593Smuzhiyun 758*4882a593Smuzhiyun 1 void thread0(void) 759*4882a593Smuzhiyun 2 { 760*4882a593Smuzhiyun 3 rcu_read_lock(); 761*4882a593Smuzhiyun 4 r1 = READ_ONCE(y); 762*4882a593Smuzhiyun 5 if (r1) { 763*4882a593Smuzhiyun 6 do_something_with_nonzero_x(); 764*4882a593Smuzhiyun 7 r2 = READ_ONCE(x); 765*4882a593Smuzhiyun 8 WARN_ON(!r2); /* BUG!!! */ 766*4882a593Smuzhiyun 9 } 767*4882a593Smuzhiyun 10 rcu_read_unlock(); 768*4882a593Smuzhiyun 11 } 769*4882a593Smuzhiyun 12 770*4882a593Smuzhiyun 13 void thread1(void) 771*4882a593Smuzhiyun 14 { 772*4882a593Smuzhiyun 15 spin_lock(&my_lock); 773*4882a593Smuzhiyun 16 WRITE_ONCE(x, 1); 774*4882a593Smuzhiyun 17 WRITE_ONCE(y, 1); 775*4882a593Smuzhiyun 18 spin_unlock(&my_lock); 776*4882a593Smuzhiyun 19 } 777*4882a593Smuzhiyun 778*4882a593SmuzhiyunIf the ``thread0()`` function's ``rcu_read_lock()`` excluded the 779*4882a593Smuzhiyun``thread1()`` function's update, the ``WARN_ON()`` could never fire. But 780*4882a593Smuzhiyunthe fact is that ``rcu_read_lock()`` does not exclude much of anything 781*4882a593Smuzhiyunaside from subsequent grace periods, of which ``thread1()`` has none, so 782*4882a593Smuzhiyunthe ``WARN_ON()`` can and does fire. 783*4882a593Smuzhiyun 784*4882a593SmuzhiyunUpdaters Only Wait For Old Readers 785*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 786*4882a593Smuzhiyun 787*4882a593SmuzhiyunIt might be tempting to assume that after ``synchronize_rcu()`` 788*4882a593Smuzhiyuncompletes, there are no readers executing. This temptation must be 789*4882a593Smuzhiyunavoided because new readers can start immediately after 790*4882a593Smuzhiyun``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no 791*4882a593Smuzhiyunobligation to wait for these new readers. 792*4882a593Smuzhiyun 793*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 794*4882a593Smuzhiyun| **Quick Quiz**: | 795*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 796*4882a593Smuzhiyun| Suppose that synchronize_rcu() did wait until *all* readers had | 797*4882a593Smuzhiyun| completed instead of waiting only on pre-existing readers. For how | 798*4882a593Smuzhiyun| long would the updater be able to rely on there being no readers? | 799*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 800*4882a593Smuzhiyun| **Answer**: | 801*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 802*4882a593Smuzhiyun| For no time at all. Even if ``synchronize_rcu()`` were to wait until | 803*4882a593Smuzhiyun| all readers had completed, a new reader might start immediately after | 804*4882a593Smuzhiyun| ``synchronize_rcu()`` completed. Therefore, the code following | 805*4882a593Smuzhiyun| ``synchronize_rcu()`` can *never* rely on there being no readers. | 806*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 807*4882a593Smuzhiyun 808*4882a593SmuzhiyunGrace Periods Don't Partition Read-Side Critical Sections 809*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 810*4882a593Smuzhiyun 811*4882a593SmuzhiyunIt is tempting to assume that if any part of one RCU read-side critical 812*4882a593Smuzhiyunsection precedes a given grace period, and if any part of another RCU 813*4882a593Smuzhiyunread-side critical section follows that same grace period, then all of 814*4882a593Smuzhiyunthe first RCU read-side critical section must precede all of the second. 815*4882a593SmuzhiyunHowever, this just isn't the case: A single grace period does not 816*4882a593Smuzhiyunpartition the set of RCU read-side critical sections. An example of this 817*4882a593Smuzhiyunsituation can be illustrated as follows, where ``x``, ``y``, and ``z`` 818*4882a593Smuzhiyunare initially all zero: 819*4882a593Smuzhiyun 820*4882a593Smuzhiyun :: 821*4882a593Smuzhiyun 822*4882a593Smuzhiyun 1 void thread0(void) 823*4882a593Smuzhiyun 2 { 824*4882a593Smuzhiyun 3 rcu_read_lock(); 825*4882a593Smuzhiyun 4 WRITE_ONCE(a, 1); 826*4882a593Smuzhiyun 5 WRITE_ONCE(b, 1); 827*4882a593Smuzhiyun 6 rcu_read_unlock(); 828*4882a593Smuzhiyun 7 } 829*4882a593Smuzhiyun 8 830*4882a593Smuzhiyun 9 void thread1(void) 831*4882a593Smuzhiyun 10 { 832*4882a593Smuzhiyun 11 r1 = READ_ONCE(a); 833*4882a593Smuzhiyun 12 synchronize_rcu(); 834*4882a593Smuzhiyun 13 WRITE_ONCE(c, 1); 835*4882a593Smuzhiyun 14 } 836*4882a593Smuzhiyun 15 837*4882a593Smuzhiyun 16 void thread2(void) 838*4882a593Smuzhiyun 17 { 839*4882a593Smuzhiyun 18 rcu_read_lock(); 840*4882a593Smuzhiyun 19 r2 = READ_ONCE(b); 841*4882a593Smuzhiyun 20 r3 = READ_ONCE(c); 842*4882a593Smuzhiyun 21 rcu_read_unlock(); 843*4882a593Smuzhiyun 22 } 844*4882a593Smuzhiyun 845*4882a593SmuzhiyunIt turns out that the outcome: 846*4882a593Smuzhiyun 847*4882a593Smuzhiyun :: 848*4882a593Smuzhiyun 849*4882a593Smuzhiyun (r1 == 1 && r2 == 0 && r3 == 1) 850*4882a593Smuzhiyun 851*4882a593Smuzhiyunis entirely possible. The following figure show how this can happen, 852*4882a593Smuzhiyunwith each circled ``QS`` indicating the point at which RCU recorded a 853*4882a593Smuzhiyun*quiescent state* for each thread, that is, a state in which RCU knows 854*4882a593Smuzhiyunthat the thread cannot be in the midst of an RCU read-side critical 855*4882a593Smuzhiyunsection that started before the current grace period: 856*4882a593Smuzhiyun 857*4882a593Smuzhiyun.. kernel-figure:: GPpartitionReaders1.svg 858*4882a593Smuzhiyun 859*4882a593SmuzhiyunIf it is necessary to partition RCU read-side critical sections in this 860*4882a593Smuzhiyunmanner, it is necessary to use two grace periods, where the first grace 861*4882a593Smuzhiyunperiod is known to end before the second grace period starts: 862*4882a593Smuzhiyun 863*4882a593Smuzhiyun :: 864*4882a593Smuzhiyun 865*4882a593Smuzhiyun 1 void thread0(void) 866*4882a593Smuzhiyun 2 { 867*4882a593Smuzhiyun 3 rcu_read_lock(); 868*4882a593Smuzhiyun 4 WRITE_ONCE(a, 1); 869*4882a593Smuzhiyun 5 WRITE_ONCE(b, 1); 870*4882a593Smuzhiyun 6 rcu_read_unlock(); 871*4882a593Smuzhiyun 7 } 872*4882a593Smuzhiyun 8 873*4882a593Smuzhiyun 9 void thread1(void) 874*4882a593Smuzhiyun 10 { 875*4882a593Smuzhiyun 11 r1 = READ_ONCE(a); 876*4882a593Smuzhiyun 12 synchronize_rcu(); 877*4882a593Smuzhiyun 13 WRITE_ONCE(c, 1); 878*4882a593Smuzhiyun 14 } 879*4882a593Smuzhiyun 15 880*4882a593Smuzhiyun 16 void thread2(void) 881*4882a593Smuzhiyun 17 { 882*4882a593Smuzhiyun 18 r2 = READ_ONCE(c); 883*4882a593Smuzhiyun 19 synchronize_rcu(); 884*4882a593Smuzhiyun 20 WRITE_ONCE(d, 1); 885*4882a593Smuzhiyun 21 } 886*4882a593Smuzhiyun 22 887*4882a593Smuzhiyun 23 void thread3(void) 888*4882a593Smuzhiyun 24 { 889*4882a593Smuzhiyun 25 rcu_read_lock(); 890*4882a593Smuzhiyun 26 r3 = READ_ONCE(b); 891*4882a593Smuzhiyun 27 r4 = READ_ONCE(d); 892*4882a593Smuzhiyun 28 rcu_read_unlock(); 893*4882a593Smuzhiyun 29 } 894*4882a593Smuzhiyun 895*4882a593SmuzhiyunHere, if ``(r1 == 1)``, then ``thread0()``'s write to ``b`` must happen 896*4882a593Smuzhiyunbefore the end of ``thread1()``'s grace period. If in addition 897*4882a593Smuzhiyun``(r4 == 1)``, then ``thread3()``'s read from ``b`` must happen after 898*4882a593Smuzhiyunthe beginning of ``thread2()``'s grace period. If it is also the case 899*4882a593Smuzhiyunthat ``(r2 == 1)``, then the end of ``thread1()``'s grace period must 900*4882a593Smuzhiyunprecede the beginning of ``thread2()``'s grace period. This mean that 901*4882a593Smuzhiyunthe two RCU read-side critical sections cannot overlap, guaranteeing 902*4882a593Smuzhiyunthat ``(r3 == 1)``. As a result, the outcome: 903*4882a593Smuzhiyun 904*4882a593Smuzhiyun :: 905*4882a593Smuzhiyun 906*4882a593Smuzhiyun (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) 907*4882a593Smuzhiyun 908*4882a593Smuzhiyuncannot happen. 909*4882a593Smuzhiyun 910*4882a593SmuzhiyunThis non-requirement was also non-premeditated, but became apparent when 911*4882a593Smuzhiyunstudying RCU's interaction with memory ordering. 912*4882a593Smuzhiyun 913*4882a593SmuzhiyunRead-Side Critical Sections Don't Partition Grace Periods 914*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 915*4882a593Smuzhiyun 916*4882a593SmuzhiyunIt is also tempting to assume that if an RCU read-side critical section 917*4882a593Smuzhiyunhappens between a pair of grace periods, then those grace periods cannot 918*4882a593Smuzhiyunoverlap. However, this temptation leads nowhere good, as can be 919*4882a593Smuzhiyunillustrated by the following, with all variables initially zero: 920*4882a593Smuzhiyun 921*4882a593Smuzhiyun :: 922*4882a593Smuzhiyun 923*4882a593Smuzhiyun 1 void thread0(void) 924*4882a593Smuzhiyun 2 { 925*4882a593Smuzhiyun 3 rcu_read_lock(); 926*4882a593Smuzhiyun 4 WRITE_ONCE(a, 1); 927*4882a593Smuzhiyun 5 WRITE_ONCE(b, 1); 928*4882a593Smuzhiyun 6 rcu_read_unlock(); 929*4882a593Smuzhiyun 7 } 930*4882a593Smuzhiyun 8 931*4882a593Smuzhiyun 9 void thread1(void) 932*4882a593Smuzhiyun 10 { 933*4882a593Smuzhiyun 11 r1 = READ_ONCE(a); 934*4882a593Smuzhiyun 12 synchronize_rcu(); 935*4882a593Smuzhiyun 13 WRITE_ONCE(c, 1); 936*4882a593Smuzhiyun 14 } 937*4882a593Smuzhiyun 15 938*4882a593Smuzhiyun 16 void thread2(void) 939*4882a593Smuzhiyun 17 { 940*4882a593Smuzhiyun 18 rcu_read_lock(); 941*4882a593Smuzhiyun 19 WRITE_ONCE(d, 1); 942*4882a593Smuzhiyun 20 r2 = READ_ONCE(c); 943*4882a593Smuzhiyun 21 rcu_read_unlock(); 944*4882a593Smuzhiyun 22 } 945*4882a593Smuzhiyun 23 946*4882a593Smuzhiyun 24 void thread3(void) 947*4882a593Smuzhiyun 25 { 948*4882a593Smuzhiyun 26 r3 = READ_ONCE(d); 949*4882a593Smuzhiyun 27 synchronize_rcu(); 950*4882a593Smuzhiyun 28 WRITE_ONCE(e, 1); 951*4882a593Smuzhiyun 29 } 952*4882a593Smuzhiyun 30 953*4882a593Smuzhiyun 31 void thread4(void) 954*4882a593Smuzhiyun 32 { 955*4882a593Smuzhiyun 33 rcu_read_lock(); 956*4882a593Smuzhiyun 34 r4 = READ_ONCE(b); 957*4882a593Smuzhiyun 35 r5 = READ_ONCE(e); 958*4882a593Smuzhiyun 36 rcu_read_unlock(); 959*4882a593Smuzhiyun 37 } 960*4882a593Smuzhiyun 961*4882a593SmuzhiyunIn this case, the outcome: 962*4882a593Smuzhiyun 963*4882a593Smuzhiyun :: 964*4882a593Smuzhiyun 965*4882a593Smuzhiyun (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) 966*4882a593Smuzhiyun 967*4882a593Smuzhiyunis entirely possible, as illustrated below: 968*4882a593Smuzhiyun 969*4882a593Smuzhiyun.. kernel-figure:: ReadersPartitionGP1.svg 970*4882a593Smuzhiyun 971*4882a593SmuzhiyunAgain, an RCU read-side critical section can overlap almost all of a 972*4882a593Smuzhiyungiven grace period, just so long as it does not overlap the entire grace 973*4882a593Smuzhiyunperiod. As a result, an RCU read-side critical section cannot partition 974*4882a593Smuzhiyuna pair of RCU grace periods. 975*4882a593Smuzhiyun 976*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 977*4882a593Smuzhiyun| **Quick Quiz**: | 978*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 979*4882a593Smuzhiyun| How long a sequence of grace periods, each separated by an RCU | 980*4882a593Smuzhiyun| read-side critical section, would be required to partition the RCU | 981*4882a593Smuzhiyun| read-side critical sections at the beginning and end of the chain? | 982*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 983*4882a593Smuzhiyun| **Answer**: | 984*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 985*4882a593Smuzhiyun| In theory, an infinite number. In practice, an unknown number that is | 986*4882a593Smuzhiyun| sensitive to both implementation details and timing considerations. | 987*4882a593Smuzhiyun| Therefore, even in practice, RCU users must abide by the theoretical | 988*4882a593Smuzhiyun| rather than the practical answer. | 989*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 990*4882a593Smuzhiyun 991*4882a593SmuzhiyunParallelism Facts of Life 992*4882a593Smuzhiyun------------------------- 993*4882a593Smuzhiyun 994*4882a593SmuzhiyunThese parallelism facts of life are by no means specific to RCU, but the 995*4882a593SmuzhiyunRCU implementation must abide by them. They therefore bear repeating: 996*4882a593Smuzhiyun 997*4882a593Smuzhiyun#. Any CPU or task may be delayed at any time, and any attempts to avoid 998*4882a593Smuzhiyun these delays by disabling preemption, interrupts, or whatever are 999*4882a593Smuzhiyun completely futile. This is most obvious in preemptible user-level 1000*4882a593Smuzhiyun environments and in virtualized environments (where a given guest 1001*4882a593Smuzhiyun OS's VCPUs can be preempted at any time by the underlying 1002*4882a593Smuzhiyun hypervisor), but can also happen in bare-metal environments due to 1003*4882a593Smuzhiyun ECC errors, NMIs, and other hardware events. Although a delay of more 1004*4882a593Smuzhiyun than about 20 seconds can result in splats, the RCU implementation is 1005*4882a593Smuzhiyun obligated to use algorithms that can tolerate extremely long delays, 1006*4882a593Smuzhiyun but where “extremely long” is not long enough to allow wrap-around 1007*4882a593Smuzhiyun when incrementing a 64-bit counter. 1008*4882a593Smuzhiyun#. Both the compiler and the CPU can reorder memory accesses. Where it 1009*4882a593Smuzhiyun matters, RCU must use compiler directives and memory-barrier 1010*4882a593Smuzhiyun instructions to preserve ordering. 1011*4882a593Smuzhiyun#. Conflicting writes to memory locations in any given cache line will 1012*4882a593Smuzhiyun result in expensive cache misses. Greater numbers of concurrent 1013*4882a593Smuzhiyun writes and more-frequent concurrent writes will result in more 1014*4882a593Smuzhiyun dramatic slowdowns. RCU is therefore obligated to use algorithms that 1015*4882a593Smuzhiyun have sufficient locality to avoid significant performance and 1016*4882a593Smuzhiyun scalability problems. 1017*4882a593Smuzhiyun#. As a rough rule of thumb, only one CPU's worth of processing may be 1018*4882a593Smuzhiyun carried out under the protection of any given exclusive lock. RCU 1019*4882a593Smuzhiyun must therefore use scalable locking designs. 1020*4882a593Smuzhiyun#. Counters are finite, especially on 32-bit systems. RCU's use of 1021*4882a593Smuzhiyun counters must therefore tolerate counter wrap, or be designed such 1022*4882a593Smuzhiyun that counter wrap would take way more time than a single system is 1023*4882a593Smuzhiyun likely to run. An uptime of ten years is quite possible, a runtime of 1024*4882a593Smuzhiyun a century much less so. As an example of the latter, RCU's 1025*4882a593Smuzhiyun dyntick-idle nesting counter allows 54 bits for interrupt nesting 1026*4882a593Smuzhiyun level (this counter is 64 bits even on a 32-bit system). Overflowing 1027*4882a593Smuzhiyun this counter requires 2\ :sup:`54` half-interrupts on a given CPU 1028*4882a593Smuzhiyun without that CPU ever going idle. If a half-interrupt happened every 1029*4882a593Smuzhiyun microsecond, it would take 570 years of runtime to overflow this 1030*4882a593Smuzhiyun counter, which is currently believed to be an acceptably long time. 1031*4882a593Smuzhiyun#. Linux systems can have thousands of CPUs running a single Linux 1032*4882a593Smuzhiyun kernel in a single shared-memory environment. RCU must therefore pay 1033*4882a593Smuzhiyun close attention to high-end scalability. 1034*4882a593Smuzhiyun 1035*4882a593SmuzhiyunThis last parallelism fact of life means that RCU must pay special 1036*4882a593Smuzhiyunattention to the preceding facts of life. The idea that Linux might 1037*4882a593Smuzhiyunscale to systems with thousands of CPUs would have been met with some 1038*4882a593Smuzhiyunskepticism in the 1990s, but these requirements would have otherwise 1039*4882a593Smuzhiyunhave been unsurprising, even in the early 1990s. 1040*4882a593Smuzhiyun 1041*4882a593SmuzhiyunQuality-of-Implementation Requirements 1042*4882a593Smuzhiyun-------------------------------------- 1043*4882a593Smuzhiyun 1044*4882a593SmuzhiyunThese sections list quality-of-implementation requirements. Although an 1045*4882a593SmuzhiyunRCU implementation that ignores these requirements could still be used, 1046*4882a593Smuzhiyunit would likely be subject to limitations that would make it 1047*4882a593Smuzhiyuninappropriate for industrial-strength production use. Classes of 1048*4882a593Smuzhiyunquality-of-implementation requirements are as follows: 1049*4882a593Smuzhiyun 1050*4882a593Smuzhiyun#. `Specialization`_ 1051*4882a593Smuzhiyun#. `Performance and Scalability`_ 1052*4882a593Smuzhiyun#. `Forward Progress`_ 1053*4882a593Smuzhiyun#. `Composability`_ 1054*4882a593Smuzhiyun#. `Corner Cases`_ 1055*4882a593Smuzhiyun 1056*4882a593SmuzhiyunThese classes is covered in the following sections. 1057*4882a593Smuzhiyun 1058*4882a593SmuzhiyunSpecialization 1059*4882a593Smuzhiyun~~~~~~~~~~~~~~ 1060*4882a593Smuzhiyun 1061*4882a593SmuzhiyunRCU is and always has been intended primarily for read-mostly 1062*4882a593Smuzhiyunsituations, which means that RCU's read-side primitives are optimized, 1063*4882a593Smuzhiyunoften at the expense of its update-side primitives. Experience thus far 1064*4882a593Smuzhiyunis captured by the following list of situations: 1065*4882a593Smuzhiyun 1066*4882a593Smuzhiyun#. Read-mostly data, where stale and inconsistent data is not a problem: 1067*4882a593Smuzhiyun RCU works great! 1068*4882a593Smuzhiyun#. Read-mostly data, where data must be consistent: RCU works well. 1069*4882a593Smuzhiyun#. Read-write data, where data must be consistent: RCU *might* work OK. 1070*4882a593Smuzhiyun Or not. 1071*4882a593Smuzhiyun#. Write-mostly data, where data must be consistent: RCU is very 1072*4882a593Smuzhiyun unlikely to be the right tool for the job, with the following 1073*4882a593Smuzhiyun exceptions, where RCU can provide: 1074*4882a593Smuzhiyun 1075*4882a593Smuzhiyun a. Existence guarantees for update-friendly mechanisms. 1076*4882a593Smuzhiyun b. Wait-free read-side primitives for real-time use. 1077*4882a593Smuzhiyun 1078*4882a593SmuzhiyunThis focus on read-mostly situations means that RCU must interoperate 1079*4882a593Smuzhiyunwith other synchronization primitives. For example, the ``add_gp()`` and 1080*4882a593Smuzhiyun``remove_gp_synchronous()`` examples discussed earlier use RCU to 1081*4882a593Smuzhiyunprotect readers and locking to coordinate updaters. However, the need 1082*4882a593Smuzhiyunextends much farther, requiring that a variety of synchronization 1083*4882a593Smuzhiyunprimitives be legal within RCU read-side critical sections, including 1084*4882a593Smuzhiyunspinlocks, sequence locks, atomic operations, reference counters, and 1085*4882a593Smuzhiyunmemory barriers. 1086*4882a593Smuzhiyun 1087*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1088*4882a593Smuzhiyun| **Quick Quiz**: | 1089*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1090*4882a593Smuzhiyun| What about sleeping locks? | 1091*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1092*4882a593Smuzhiyun| **Answer**: | 1093*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1094*4882a593Smuzhiyun| These are forbidden within Linux-kernel RCU read-side critical | 1095*4882a593Smuzhiyun| sections because it is not legal to place a quiescent state (in this | 1096*4882a593Smuzhiyun| case, voluntary context switch) within an RCU read-side critical | 1097*4882a593Smuzhiyun| section. However, sleeping locks may be used within userspace RCU | 1098*4882a593Smuzhiyun| read-side critical sections, and also within Linux-kernel sleepable | 1099*4882a593Smuzhiyun| RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In | 1100*4882a593Smuzhiyun| addition, the -rt patchset turns spinlocks into a sleeping locks so | 1101*4882a593Smuzhiyun| that the corresponding critical sections can be preempted, which also | 1102*4882a593Smuzhiyun| means that these sleeplockified spinlocks (but not other sleeping | 1103*4882a593Smuzhiyun| locks!) may be acquire within -rt-Linux-kernel RCU read-side critical | 1104*4882a593Smuzhiyun| sections. | 1105*4882a593Smuzhiyun| Note that it *is* legal for a normal RCU read-side critical section | 1106*4882a593Smuzhiyun| to conditionally acquire a sleeping locks (as in | 1107*4882a593Smuzhiyun| ``mutex_trylock()``), but only as long as it does not loop | 1108*4882a593Smuzhiyun| indefinitely attempting to conditionally acquire that sleeping locks. | 1109*4882a593Smuzhiyun| The key point is that things like ``mutex_trylock()`` either return | 1110*4882a593Smuzhiyun| with the mutex held, or return an error indication if the mutex was | 1111*4882a593Smuzhiyun| not immediately available. Either way, ``mutex_trylock()`` returns | 1112*4882a593Smuzhiyun| immediately without sleeping. | 1113*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1114*4882a593Smuzhiyun 1115*4882a593SmuzhiyunIt often comes as a surprise that many algorithms do not require a 1116*4882a593Smuzhiyunconsistent view of data, but many can function in that mode, with 1117*4882a593Smuzhiyunnetwork routing being the poster child. Internet routing algorithms take 1118*4882a593Smuzhiyunsignificant time to propagate updates, so that by the time an update 1119*4882a593Smuzhiyunarrives at a given system, that system has been sending network traffic 1120*4882a593Smuzhiyunthe wrong way for a considerable length of time. Having a few threads 1121*4882a593Smuzhiyuncontinue to send traffic the wrong way for a few more milliseconds is 1122*4882a593Smuzhiyunclearly not a problem: In the worst case, TCP retransmissions will 1123*4882a593Smuzhiyuneventually get the data where it needs to go. In general, when tracking 1124*4882a593Smuzhiyunthe state of the universe outside of the computer, some level of 1125*4882a593Smuzhiyuninconsistency must be tolerated due to speed-of-light delays if nothing 1126*4882a593Smuzhiyunelse. 1127*4882a593Smuzhiyun 1128*4882a593SmuzhiyunFurthermore, uncertainty about external state is inherent in many cases. 1129*4882a593SmuzhiyunFor example, a pair of veterinarians might use heartbeat to determine 1130*4882a593Smuzhiyunwhether or not a given cat was alive. But how long should they wait 1131*4882a593Smuzhiyunafter the last heartbeat to decide that the cat is in fact dead? Waiting 1132*4882a593Smuzhiyunless than 400 milliseconds makes no sense because this would mean that a 1133*4882a593Smuzhiyunrelaxed cat would be considered to cycle between death and life more 1134*4882a593Smuzhiyunthan 100 times per minute. Moreover, just as with human beings, a cat's 1135*4882a593Smuzhiyunheart might stop for some period of time, so the exact wait period is a 1136*4882a593Smuzhiyunjudgment call. One of our pair of veterinarians might wait 30 seconds 1137*4882a593Smuzhiyunbefore pronouncing the cat dead, while the other might insist on waiting 1138*4882a593Smuzhiyuna full minute. The two veterinarians would then disagree on the state of 1139*4882a593Smuzhiyunthe cat during the final 30 seconds of the minute following the last 1140*4882a593Smuzhiyunheartbeat. 1141*4882a593Smuzhiyun 1142*4882a593SmuzhiyunInterestingly enough, this same situation applies to hardware. When push 1143*4882a593Smuzhiyuncomes to shove, how do we tell whether or not some external server has 1144*4882a593Smuzhiyunfailed? We send messages to it periodically, and declare it failed if we 1145*4882a593Smuzhiyundon't receive a response within a given period of time. Policy decisions 1146*4882a593Smuzhiyuncan usually tolerate short periods of inconsistency. The policy was 1147*4882a593Smuzhiyundecided some time ago, and is only now being put into effect, so a few 1148*4882a593Smuzhiyunmilliseconds of delay is normally inconsequential. 1149*4882a593Smuzhiyun 1150*4882a593SmuzhiyunHowever, there are algorithms that absolutely must see consistent data. 1151*4882a593SmuzhiyunFor example, the translation between a user-level SystemV semaphore ID 1152*4882a593Smuzhiyunto the corresponding in-kernel data structure is protected by RCU, but 1153*4882a593Smuzhiyunit is absolutely forbidden to update a semaphore that has just been 1154*4882a593Smuzhiyunremoved. In the Linux kernel, this need for consistency is accommodated 1155*4882a593Smuzhiyunby acquiring spinlocks located in the in-kernel data structure from 1156*4882a593Smuzhiyunwithin the RCU read-side critical section, and this is indicated by the 1157*4882a593Smuzhiyungreen box in the figure above. Many other techniques may be used, and 1158*4882a593Smuzhiyunare in fact used within the Linux kernel. 1159*4882a593Smuzhiyun 1160*4882a593SmuzhiyunIn short, RCU is not required to maintain consistency, and other 1161*4882a593Smuzhiyunmechanisms may be used in concert with RCU when consistency is required. 1162*4882a593SmuzhiyunRCU's specialization allows it to do its job extremely well, and its 1163*4882a593Smuzhiyunability to interoperate with other synchronization mechanisms allows the 1164*4882a593Smuzhiyunright mix of synchronization tools to be used for a given job. 1165*4882a593Smuzhiyun 1166*4882a593SmuzhiyunPerformance and Scalability 1167*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1168*4882a593Smuzhiyun 1169*4882a593SmuzhiyunEnergy efficiency is a critical component of performance today, and 1170*4882a593SmuzhiyunLinux-kernel RCU implementations must therefore avoid unnecessarily 1171*4882a593Smuzhiyunawakening idle CPUs. I cannot claim that this requirement was 1172*4882a593Smuzhiyunpremeditated. In fact, I learned of it during a telephone conversation 1173*4882a593Smuzhiyunin which I was given “frank and open” feedback on the importance of 1174*4882a593Smuzhiyunenergy efficiency in battery-powered systems and on specific 1175*4882a593Smuzhiyunenergy-efficiency shortcomings of the Linux-kernel RCU implementation. 1176*4882a593SmuzhiyunIn my experience, the battery-powered embedded community will consider 1177*4882a593Smuzhiyunany unnecessary wakeups to be extremely unfriendly acts. So much so that 1178*4882a593Smuzhiyunmere Linux-kernel-mailing-list posts are insufficient to vent their ire. 1179*4882a593Smuzhiyun 1180*4882a593SmuzhiyunMemory consumption is not particularly important for in most situations, 1181*4882a593Smuzhiyunand has become decreasingly so as memory sizes have expanded and memory 1182*4882a593Smuzhiyuncosts have plummeted. However, as I learned from Matt Mackall's 1183*4882a593Smuzhiyun`bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory 1184*4882a593Smuzhiyunfootprint is critically important on single-CPU systems with 1185*4882a593Smuzhiyunnon-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny 1186*4882a593SmuzhiyunRCU <https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com>`__ 1187*4882a593Smuzhiyunwas born. Josh Triplett has since taken over the small-memory banner 1188*4882a593Smuzhiyunwith his `Linux kernel tinification <https://tiny.wiki.kernel.org/>`__ 1189*4882a593Smuzhiyunproject, which resulted in `SRCU <#Sleepable%20RCU>`__ becoming optional 1190*4882a593Smuzhiyunfor those kernels not needing it. 1191*4882a593Smuzhiyun 1192*4882a593SmuzhiyunThe remaining performance requirements are, for the most part, 1193*4882a593Smuzhiyununsurprising. For example, in keeping with RCU's read-side 1194*4882a593Smuzhiyunspecialization, ``rcu_dereference()`` should have negligible overhead 1195*4882a593Smuzhiyun(for example, suppression of a few minor compiler optimizations). 1196*4882a593SmuzhiyunSimilarly, in non-preemptible environments, ``rcu_read_lock()`` and 1197*4882a593Smuzhiyun``rcu_read_unlock()`` should have exactly zero overhead. 1198*4882a593Smuzhiyun 1199*4882a593SmuzhiyunIn preemptible environments, in the case where the RCU read-side 1200*4882a593Smuzhiyuncritical section was not preempted (as will be the case for the 1201*4882a593Smuzhiyunhighest-priority real-time process), ``rcu_read_lock()`` and 1202*4882a593Smuzhiyun``rcu_read_unlock()`` should have minimal overhead. In particular, they 1203*4882a593Smuzhiyunshould not contain atomic read-modify-write operations, memory-barrier 1204*4882a593Smuzhiyuninstructions, preemption disabling, interrupt disabling, or backwards 1205*4882a593Smuzhiyunbranches. However, in the case where the RCU read-side critical section 1206*4882a593Smuzhiyunwas preempted, ``rcu_read_unlock()`` may acquire spinlocks and disable 1207*4882a593Smuzhiyuninterrupts. This is why it is better to nest an RCU read-side critical 1208*4882a593Smuzhiyunsection within a preempt-disable region than vice versa, at least in 1209*4882a593Smuzhiyuncases where that critical section is short enough to avoid unduly 1210*4882a593Smuzhiyundegrading real-time latencies. 1211*4882a593Smuzhiyun 1212*4882a593SmuzhiyunThe ``synchronize_rcu()`` grace-period-wait primitive is optimized for 1213*4882a593Smuzhiyunthroughput. It may therefore incur several milliseconds of latency in 1214*4882a593Smuzhiyunaddition to the duration of the longest RCU read-side critical section. 1215*4882a593SmuzhiyunOn the other hand, multiple concurrent invocations of 1216*4882a593Smuzhiyun``synchronize_rcu()`` are required to use batching optimizations so that 1217*4882a593Smuzhiyunthey can be satisfied by a single underlying grace-period-wait 1218*4882a593Smuzhiyunoperation. For example, in the Linux kernel, it is not unusual for a 1219*4882a593Smuzhiyunsingle grace-period-wait operation to serve more than `1,000 separate 1220*4882a593Smuzhiyuninvocations <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response>`__ 1221*4882a593Smuzhiyunof ``synchronize_rcu()``, thus amortizing the per-invocation overhead 1222*4882a593Smuzhiyundown to nearly zero. However, the grace-period optimization is also 1223*4882a593Smuzhiyunrequired to avoid measurable degradation of real-time scheduling and 1224*4882a593Smuzhiyuninterrupt latencies. 1225*4882a593Smuzhiyun 1226*4882a593SmuzhiyunIn some cases, the multi-millisecond ``synchronize_rcu()`` latencies are 1227*4882a593Smuzhiyununacceptable. In these cases, ``synchronize_rcu_expedited()`` may be 1228*4882a593Smuzhiyunused instead, reducing the grace-period latency down to a few tens of 1229*4882a593Smuzhiyunmicroseconds on small systems, at least in cases where the RCU read-side 1230*4882a593Smuzhiyuncritical sections are short. There are currently no special latency 1231*4882a593Smuzhiyunrequirements for ``synchronize_rcu_expedited()`` on large systems, but, 1232*4882a593Smuzhiyunconsistent with the empirical nature of the RCU specification, that is 1233*4882a593Smuzhiyunsubject to change. However, there most definitely are scalability 1234*4882a593Smuzhiyunrequirements: A storm of ``synchronize_rcu_expedited()`` invocations on 1235*4882a593Smuzhiyun4096 CPUs should at least make reasonable forward progress. In return 1236*4882a593Smuzhiyunfor its shorter latencies, ``synchronize_rcu_expedited()`` is permitted 1237*4882a593Smuzhiyunto impose modest degradation of real-time latency on non-idle online 1238*4882a593SmuzhiyunCPUs. Here, “modest” means roughly the same latency degradation as a 1239*4882a593Smuzhiyunscheduling-clock interrupt. 1240*4882a593Smuzhiyun 1241*4882a593SmuzhiyunThere are a number of situations where even 1242*4882a593Smuzhiyun``synchronize_rcu_expedited()``'s reduced grace-period latency is 1243*4882a593Smuzhiyununacceptable. In these situations, the asynchronous ``call_rcu()`` can 1244*4882a593Smuzhiyunbe used in place of ``synchronize_rcu()`` as follows: 1245*4882a593Smuzhiyun 1246*4882a593Smuzhiyun :: 1247*4882a593Smuzhiyun 1248*4882a593Smuzhiyun 1 struct foo { 1249*4882a593Smuzhiyun 2 int a; 1250*4882a593Smuzhiyun 3 int b; 1251*4882a593Smuzhiyun 4 struct rcu_head rh; 1252*4882a593Smuzhiyun 5 }; 1253*4882a593Smuzhiyun 6 1254*4882a593Smuzhiyun 7 static void remove_gp_cb(struct rcu_head *rhp) 1255*4882a593Smuzhiyun 8 { 1256*4882a593Smuzhiyun 9 struct foo *p = container_of(rhp, struct foo, rh); 1257*4882a593Smuzhiyun 10 1258*4882a593Smuzhiyun 11 kfree(p); 1259*4882a593Smuzhiyun 12 } 1260*4882a593Smuzhiyun 13 1261*4882a593Smuzhiyun 14 bool remove_gp_asynchronous(void) 1262*4882a593Smuzhiyun 15 { 1263*4882a593Smuzhiyun 16 struct foo *p; 1264*4882a593Smuzhiyun 17 1265*4882a593Smuzhiyun 18 spin_lock(&gp_lock); 1266*4882a593Smuzhiyun 19 p = rcu_access_pointer(gp); 1267*4882a593Smuzhiyun 20 if (!p) { 1268*4882a593Smuzhiyun 21 spin_unlock(&gp_lock); 1269*4882a593Smuzhiyun 22 return false; 1270*4882a593Smuzhiyun 23 } 1271*4882a593Smuzhiyun 24 rcu_assign_pointer(gp, NULL); 1272*4882a593Smuzhiyun 25 call_rcu(&p->rh, remove_gp_cb); 1273*4882a593Smuzhiyun 26 spin_unlock(&gp_lock); 1274*4882a593Smuzhiyun 27 return true; 1275*4882a593Smuzhiyun 28 } 1276*4882a593Smuzhiyun 1277*4882a593SmuzhiyunA definition of ``struct foo`` is finally needed, and appears on 1278*4882a593Smuzhiyunlines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()`` 1279*4882a593Smuzhiyunon line 25, and will be invoked after the end of a subsequent grace 1280*4882a593Smuzhiyunperiod. This gets the same effect as ``remove_gp_synchronous()``, but 1281*4882a593Smuzhiyunwithout forcing the updater to wait for a grace period to elapse. The 1282*4882a593Smuzhiyun``call_rcu()`` function may be used in a number of situations where 1283*4882a593Smuzhiyunneither ``synchronize_rcu()`` nor ``synchronize_rcu_expedited()`` would 1284*4882a593Smuzhiyunbe legal, including within preempt-disable code, ``local_bh_disable()`` 1285*4882a593Smuzhiyuncode, interrupt-disable code, and interrupt handlers. However, even 1286*4882a593Smuzhiyun``call_rcu()`` is illegal within NMI handlers and from idle and offline 1287*4882a593SmuzhiyunCPUs. The callback function (``remove_gp_cb()`` in this case) will be 1288*4882a593Smuzhiyunexecuted within softirq (software interrupt) environment within the 1289*4882a593SmuzhiyunLinux kernel, either within a real softirq handler or under the 1290*4882a593Smuzhiyunprotection of ``local_bh_disable()``. In both the Linux kernel and in 1291*4882a593Smuzhiyunuserspace, it is bad practice to write an RCU callback function that 1292*4882a593Smuzhiyuntakes too long. Long-running operations should be relegated to separate 1293*4882a593Smuzhiyunthreads or (in the Linux kernel) workqueues. 1294*4882a593Smuzhiyun 1295*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1296*4882a593Smuzhiyun| **Quick Quiz**: | 1297*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1298*4882a593Smuzhiyun| Why does line 19 use ``rcu_access_pointer()``? After all, | 1299*4882a593Smuzhiyun| ``call_rcu()`` on line 25 stores into the structure, which would | 1300*4882a593Smuzhiyun| interact badly with concurrent insertions. Doesn't this mean that | 1301*4882a593Smuzhiyun| ``rcu_dereference()`` is required? | 1302*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1303*4882a593Smuzhiyun| **Answer**: | 1304*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1305*4882a593Smuzhiyun| Presumably the ``->gp_lock`` acquired on line 18 excludes any | 1306*4882a593Smuzhiyun| changes, including any insertions that ``rcu_dereference()`` would | 1307*4882a593Smuzhiyun| protect against. Therefore, any insertions will be delayed until | 1308*4882a593Smuzhiyun| after ``->gp_lock`` is released on line 25, which in turn means that | 1309*4882a593Smuzhiyun| ``rcu_access_pointer()`` suffices. | 1310*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1311*4882a593Smuzhiyun 1312*4882a593SmuzhiyunHowever, all that ``remove_gp_cb()`` is doing is invoking ``kfree()`` on 1313*4882a593Smuzhiyunthe data element. This is a common idiom, and is supported by 1314*4882a593Smuzhiyun``kfree_rcu()``, which allows “fire and forget” operation as shown 1315*4882a593Smuzhiyunbelow: 1316*4882a593Smuzhiyun 1317*4882a593Smuzhiyun :: 1318*4882a593Smuzhiyun 1319*4882a593Smuzhiyun 1 struct foo { 1320*4882a593Smuzhiyun 2 int a; 1321*4882a593Smuzhiyun 3 int b; 1322*4882a593Smuzhiyun 4 struct rcu_head rh; 1323*4882a593Smuzhiyun 5 }; 1324*4882a593Smuzhiyun 6 1325*4882a593Smuzhiyun 7 bool remove_gp_faf(void) 1326*4882a593Smuzhiyun 8 { 1327*4882a593Smuzhiyun 9 struct foo *p; 1328*4882a593Smuzhiyun 10 1329*4882a593Smuzhiyun 11 spin_lock(&gp_lock); 1330*4882a593Smuzhiyun 12 p = rcu_dereference(gp); 1331*4882a593Smuzhiyun 13 if (!p) { 1332*4882a593Smuzhiyun 14 spin_unlock(&gp_lock); 1333*4882a593Smuzhiyun 15 return false; 1334*4882a593Smuzhiyun 16 } 1335*4882a593Smuzhiyun 17 rcu_assign_pointer(gp, NULL); 1336*4882a593Smuzhiyun 18 kfree_rcu(p, rh); 1337*4882a593Smuzhiyun 19 spin_unlock(&gp_lock); 1338*4882a593Smuzhiyun 20 return true; 1339*4882a593Smuzhiyun 21 } 1340*4882a593Smuzhiyun 1341*4882a593SmuzhiyunNote that ``remove_gp_faf()`` simply invokes ``kfree_rcu()`` and 1342*4882a593Smuzhiyunproceeds, without any need to pay any further attention to the 1343*4882a593Smuzhiyunsubsequent grace period and ``kfree()``. It is permissible to invoke 1344*4882a593Smuzhiyun``kfree_rcu()`` from the same environments as for ``call_rcu()``. 1345*4882a593SmuzhiyunInterestingly enough, DYNIX/ptx had the equivalents of ``call_rcu()`` 1346*4882a593Smuzhiyunand ``kfree_rcu()``, but not ``synchronize_rcu()``. This was due to the 1347*4882a593Smuzhiyunfact that RCU was not heavily used within DYNIX/ptx, so the very few 1348*4882a593Smuzhiyunplaces that needed something like ``synchronize_rcu()`` simply 1349*4882a593Smuzhiyunopen-coded it. 1350*4882a593Smuzhiyun 1351*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1352*4882a593Smuzhiyun| **Quick Quiz**: | 1353*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1354*4882a593Smuzhiyun| Earlier it was claimed that ``call_rcu()`` and ``kfree_rcu()`` | 1355*4882a593Smuzhiyun| allowed updaters to avoid being blocked by readers. But how can that | 1356*4882a593Smuzhiyun| be correct, given that the invocation of the callback and the freeing | 1357*4882a593Smuzhiyun| of the memory (respectively) must still wait for a grace period to | 1358*4882a593Smuzhiyun| elapse? | 1359*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1360*4882a593Smuzhiyun| **Answer**: | 1361*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1362*4882a593Smuzhiyun| We could define things this way, but keep in mind that this sort of | 1363*4882a593Smuzhiyun| definition would say that updates in garbage-collected languages | 1364*4882a593Smuzhiyun| cannot complete until the next time the garbage collector runs, which | 1365*4882a593Smuzhiyun| does not seem at all reasonable. The key point is that in most cases, | 1366*4882a593Smuzhiyun| an updater using either ``call_rcu()`` or ``kfree_rcu()`` can proceed | 1367*4882a593Smuzhiyun| to the next update as soon as it has invoked ``call_rcu()`` or | 1368*4882a593Smuzhiyun| ``kfree_rcu()``, without having to wait for a subsequent grace | 1369*4882a593Smuzhiyun| period. | 1370*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1371*4882a593Smuzhiyun 1372*4882a593SmuzhiyunBut what if the updater must wait for the completion of code to be 1373*4882a593Smuzhiyunexecuted after the end of the grace period, but has other tasks that can 1374*4882a593Smuzhiyunbe carried out in the meantime? The polling-style 1375*4882a593Smuzhiyun``get_state_synchronize_rcu()`` and ``cond_synchronize_rcu()`` functions 1376*4882a593Smuzhiyunmay be used for this purpose, as shown below: 1377*4882a593Smuzhiyun 1378*4882a593Smuzhiyun :: 1379*4882a593Smuzhiyun 1380*4882a593Smuzhiyun 1 bool remove_gp_poll(void) 1381*4882a593Smuzhiyun 2 { 1382*4882a593Smuzhiyun 3 struct foo *p; 1383*4882a593Smuzhiyun 4 unsigned long s; 1384*4882a593Smuzhiyun 5 1385*4882a593Smuzhiyun 6 spin_lock(&gp_lock); 1386*4882a593Smuzhiyun 7 p = rcu_access_pointer(gp); 1387*4882a593Smuzhiyun 8 if (!p) { 1388*4882a593Smuzhiyun 9 spin_unlock(&gp_lock); 1389*4882a593Smuzhiyun 10 return false; 1390*4882a593Smuzhiyun 11 } 1391*4882a593Smuzhiyun 12 rcu_assign_pointer(gp, NULL); 1392*4882a593Smuzhiyun 13 spin_unlock(&gp_lock); 1393*4882a593Smuzhiyun 14 s = get_state_synchronize_rcu(); 1394*4882a593Smuzhiyun 15 do_something_while_waiting(); 1395*4882a593Smuzhiyun 16 cond_synchronize_rcu(s); 1396*4882a593Smuzhiyun 17 kfree(p); 1397*4882a593Smuzhiyun 18 return true; 1398*4882a593Smuzhiyun 19 } 1399*4882a593Smuzhiyun 1400*4882a593SmuzhiyunOn line 14, ``get_state_synchronize_rcu()`` obtains a “cookie” from RCU, 1401*4882a593Smuzhiyunthen line 15 carries out other tasks, and finally, line 16 returns 1402*4882a593Smuzhiyunimmediately if a grace period has elapsed in the meantime, but otherwise 1403*4882a593Smuzhiyunwaits as required. The need for ``get_state_synchronize_rcu`` and 1404*4882a593Smuzhiyun``cond_synchronize_rcu()`` has appeared quite recently, so it is too 1405*4882a593Smuzhiyunearly to tell whether they will stand the test of time. 1406*4882a593Smuzhiyun 1407*4882a593SmuzhiyunRCU thus provides a range of tools to allow updaters to strike the 1408*4882a593Smuzhiyunrequired tradeoff between latency, flexibility and CPU overhead. 1409*4882a593Smuzhiyun 1410*4882a593SmuzhiyunForward Progress 1411*4882a593Smuzhiyun~~~~~~~~~~~~~~~~ 1412*4882a593Smuzhiyun 1413*4882a593SmuzhiyunIn theory, delaying grace-period completion and callback invocation is 1414*4882a593Smuzhiyunharmless. In practice, not only are memory sizes finite but also 1415*4882a593Smuzhiyuncallbacks sometimes do wakeups, and sufficiently deferred wakeups can be 1416*4882a593Smuzhiyundifficult to distinguish from system hangs. Therefore, RCU must provide 1417*4882a593Smuzhiyuna number of mechanisms to promote forward progress. 1418*4882a593Smuzhiyun 1419*4882a593SmuzhiyunThese mechanisms are not foolproof, nor can they be. For one simple 1420*4882a593Smuzhiyunexample, an infinite loop in an RCU read-side critical section must by 1421*4882a593Smuzhiyundefinition prevent later grace periods from ever completing. For a more 1422*4882a593Smuzhiyuninvolved example, consider a 64-CPU system built with 1423*4882a593Smuzhiyun``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where 1424*4882a593SmuzhiyunCPUs 1 through 63 spin in tight loops that invoke ``call_rcu()``. Even 1425*4882a593Smuzhiyunif these tight loops also contain calls to ``cond_resched()`` (thus 1426*4882a593Smuzhiyunallowing grace periods to complete), CPU 0 simply will not be able to 1427*4882a593Smuzhiyuninvoke callbacks as fast as the other 63 CPUs can register them, at 1428*4882a593Smuzhiyunleast not until the system runs out of memory. In both of these 1429*4882a593Smuzhiyunexamples, the Spiderman principle applies: With great power comes great 1430*4882a593Smuzhiyunresponsibility. However, short of this level of abuse, RCU is required 1431*4882a593Smuzhiyunto ensure timely completion of grace periods and timely invocation of 1432*4882a593Smuzhiyuncallbacks. 1433*4882a593Smuzhiyun 1434*4882a593SmuzhiyunRCU takes the following steps to encourage timely completion of grace 1435*4882a593Smuzhiyunperiods: 1436*4882a593Smuzhiyun 1437*4882a593Smuzhiyun#. If a grace period fails to complete within 100 milliseconds, RCU 1438*4882a593Smuzhiyun causes future invocations of ``cond_resched()`` on the holdout CPUs 1439*4882a593Smuzhiyun to provide an RCU quiescent state. RCU also causes those CPUs' 1440*4882a593Smuzhiyun ``need_resched()`` invocations to return ``true``, but only after the 1441*4882a593Smuzhiyun corresponding CPU's next scheduling-clock. 1442*4882a593Smuzhiyun#. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run 1443*4882a593Smuzhiyun indefinitely in the kernel without scheduling-clock interrupts, which 1444*4882a593Smuzhiyun defeats the above ``need_resched()`` strategem. RCU will therefore 1445*4882a593Smuzhiyun invoke ``resched_cpu()`` on any ``nohz_full`` CPUs still holding out 1446*4882a593Smuzhiyun after 109 milliseconds. 1447*4882a593Smuzhiyun#. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that 1448*4882a593Smuzhiyun has been preempted within an RCU read-side critical section is 1449*4882a593Smuzhiyun holding out for more than 500 milliseconds, RCU will resort to 1450*4882a593Smuzhiyun priority boosting. 1451*4882a593Smuzhiyun#. If a CPU is still holding out 10 seconds into the grace period, RCU 1452*4882a593Smuzhiyun will invoke ``resched_cpu()`` on it regardless of its ``nohz_full`` 1453*4882a593Smuzhiyun state. 1454*4882a593Smuzhiyun 1455*4882a593SmuzhiyunThe above values are defaults for systems running with ``HZ=1000``. They 1456*4882a593Smuzhiyunwill vary as the value of ``HZ`` varies, and can also be changed using 1457*4882a593Smuzhiyunthe relevant Kconfig options and kernel boot parameters. RCU currently 1458*4882a593Smuzhiyundoes not do much sanity checking of these parameters, so please use 1459*4882a593Smuzhiyuncaution when changing them. Note that these forward-progress measures 1460*4882a593Smuzhiyunare provided only for RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks 1461*4882a593SmuzhiyunRCU <#Tasks%20RCU>`__. 1462*4882a593Smuzhiyun 1463*4882a593SmuzhiyunRCU takes the following steps in ``call_rcu()`` to encourage timely 1464*4882a593Smuzhiyuninvocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has 1465*4882a593Smuzhiyun10,000 callbacks, or has 10,000 more callbacks than it had the last time 1466*4882a593Smuzhiyunencouragement was provided: 1467*4882a593Smuzhiyun 1468*4882a593Smuzhiyun#. Starts a grace period, if one is not already in progress. 1469*4882a593Smuzhiyun#. Forces immediate checking for quiescent states, rather than waiting 1470*4882a593Smuzhiyun for three milliseconds to have elapsed since the beginning of the 1471*4882a593Smuzhiyun grace period. 1472*4882a593Smuzhiyun#. Immediately tags the CPU's callbacks with their grace period 1473*4882a593Smuzhiyun completion numbers, rather than waiting for the ``RCU_SOFTIRQ`` 1474*4882a593Smuzhiyun handler to get around to it. 1475*4882a593Smuzhiyun#. Lifts callback-execution batch limits, which speeds up callback 1476*4882a593Smuzhiyun invocation at the expense of degrading realtime response. 1477*4882a593Smuzhiyun 1478*4882a593SmuzhiyunAgain, these are default values when running at ``HZ=1000``, and can be 1479*4882a593Smuzhiyunoverridden. Again, these forward-progress measures are provided only for 1480*4882a593SmuzhiyunRCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks 1481*4882a593SmuzhiyunRCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward 1482*4882a593Smuzhiyunprogress for ``rcu_nocbs`` CPUs is much less well-developed, in part 1483*4882a593Smuzhiyunbecause workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke 1484*4882a593Smuzhiyun``call_rcu()`` relatively infrequently. If workloads emerge that need 1485*4882a593Smuzhiyunboth ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then 1486*4882a593Smuzhiyunadditional forward-progress work will be required. 1487*4882a593Smuzhiyun 1488*4882a593SmuzhiyunComposability 1489*4882a593Smuzhiyun~~~~~~~~~~~~~ 1490*4882a593Smuzhiyun 1491*4882a593SmuzhiyunComposability has received much attention in recent years, perhaps in 1492*4882a593Smuzhiyunpart due to the collision of multicore hardware with object-oriented 1493*4882a593Smuzhiyuntechniques designed in single-threaded environments for single-threaded 1494*4882a593Smuzhiyunuse. And in theory, RCU read-side critical sections may be composed, and 1495*4882a593Smuzhiyunin fact may be nested arbitrarily deeply. In practice, as with all 1496*4882a593Smuzhiyunreal-world implementations of composable constructs, there are 1497*4882a593Smuzhiyunlimitations. 1498*4882a593Smuzhiyun 1499*4882a593SmuzhiyunImplementations of RCU for which ``rcu_read_lock()`` and 1500*4882a593Smuzhiyun``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when 1501*4882a593Smuzhiyun``CONFIG_PREEMPT=n``, can be nested arbitrarily deeply. After all, there 1502*4882a593Smuzhiyunis no overhead. Except that if all these instances of 1503*4882a593Smuzhiyun``rcu_read_lock()`` and ``rcu_read_unlock()`` are visible to the 1504*4882a593Smuzhiyuncompiler, compilation will eventually fail due to exhausting memory, 1505*4882a593Smuzhiyunmass storage, or user patience, whichever comes first. If the nesting is 1506*4882a593Smuzhiyunnot visible to the compiler, as is the case with mutually recursive 1507*4882a593Smuzhiyunfunctions each in its own translation unit, stack overflow will result. 1508*4882a593SmuzhiyunIf the nesting takes the form of loops, perhaps in the guise of tail 1509*4882a593Smuzhiyunrecursion, either the control variable will overflow or (in the Linux 1510*4882a593Smuzhiyunkernel) you will get an RCU CPU stall warning. Nevertheless, this class 1511*4882a593Smuzhiyunof RCU implementations is one of the most composable constructs in 1512*4882a593Smuzhiyunexistence. 1513*4882a593Smuzhiyun 1514*4882a593SmuzhiyunRCU implementations that explicitly track nesting depth are limited by 1515*4882a593Smuzhiyunthe nesting-depth counter. For example, the Linux kernel's preemptible 1516*4882a593SmuzhiyunRCU limits nesting to ``INT_MAX``. This should suffice for almost all 1517*4882a593Smuzhiyunpractical purposes. That said, a consecutive pair of RCU read-side 1518*4882a593Smuzhiyuncritical sections between which there is an operation that waits for a 1519*4882a593Smuzhiyungrace period cannot be enclosed in another RCU read-side critical 1520*4882a593Smuzhiyunsection. This is because it is not legal to wait for a grace period 1521*4882a593Smuzhiyunwithin an RCU read-side critical section: To do so would result either 1522*4882a593Smuzhiyunin deadlock or in RCU implicitly splitting the enclosing RCU read-side 1523*4882a593Smuzhiyuncritical section, neither of which is conducive to a long-lived and 1524*4882a593Smuzhiyunprosperous kernel. 1525*4882a593Smuzhiyun 1526*4882a593SmuzhiyunIt is worth noting that RCU is not alone in limiting composability. For 1527*4882a593Smuzhiyunexample, many transactional-memory implementations prohibit composing a 1528*4882a593Smuzhiyunpair of transactions separated by an irrevocable operation (for example, 1529*4882a593Smuzhiyuna network receive operation). For another example, lock-based critical 1530*4882a593Smuzhiyunsections can be composed surprisingly freely, but only if deadlock is 1531*4882a593Smuzhiyunavoided. 1532*4882a593Smuzhiyun 1533*4882a593SmuzhiyunIn short, although RCU read-side critical sections are highly 1534*4882a593Smuzhiyuncomposable, care is required in some situations, just as is the case for 1535*4882a593Smuzhiyunany other composable synchronization mechanism. 1536*4882a593Smuzhiyun 1537*4882a593SmuzhiyunCorner Cases 1538*4882a593Smuzhiyun~~~~~~~~~~~~ 1539*4882a593Smuzhiyun 1540*4882a593SmuzhiyunA given RCU workload might have an endless and intense stream of RCU 1541*4882a593Smuzhiyunread-side critical sections, perhaps even so intense that there was 1542*4882a593Smuzhiyunnever a point in time during which there was not at least one RCU 1543*4882a593Smuzhiyunread-side critical section in flight. RCU cannot allow this situation to 1544*4882a593Smuzhiyunblock grace periods: As long as all the RCU read-side critical sections 1545*4882a593Smuzhiyunare finite, grace periods must also be finite. 1546*4882a593Smuzhiyun 1547*4882a593SmuzhiyunThat said, preemptible RCU implementations could potentially result in 1548*4882a593SmuzhiyunRCU read-side critical sections being preempted for long durations, 1549*4882a593Smuzhiyunwhich has the effect of creating a long-duration RCU read-side critical 1550*4882a593Smuzhiyunsection. This situation can arise only in heavily loaded systems, but 1551*4882a593Smuzhiyunsystems using real-time priorities are of course more vulnerable. 1552*4882a593SmuzhiyunTherefore, RCU priority boosting is provided to help deal with this 1553*4882a593Smuzhiyuncase. That said, the exact requirements on RCU priority boosting will 1554*4882a593Smuzhiyunlikely evolve as more experience accumulates. 1555*4882a593Smuzhiyun 1556*4882a593SmuzhiyunOther workloads might have very high update rates. Although one can 1557*4882a593Smuzhiyunargue that such workloads should instead use something other than RCU, 1558*4882a593Smuzhiyunthe fact remains that RCU must handle such workloads gracefully. This 1559*4882a593Smuzhiyunrequirement is another factor driving batching of grace periods, but it 1560*4882a593Smuzhiyunis also the driving force behind the checks for large numbers of queued 1561*4882a593SmuzhiyunRCU callbacks in the ``call_rcu()`` code path. Finally, high update 1562*4882a593Smuzhiyunrates should not delay RCU read-side critical sections, although some 1563*4882a593Smuzhiyunsmall read-side delays can occur when using 1564*4882a593Smuzhiyun``synchronize_rcu_expedited()``, courtesy of this function's use of 1565*4882a593Smuzhiyun``smp_call_function_single()``. 1566*4882a593Smuzhiyun 1567*4882a593SmuzhiyunAlthough all three of these corner cases were understood in the early 1568*4882a593Smuzhiyun1990s, a simple user-level test consisting of ``close(open(path))`` in a 1569*4882a593Smuzhiyuntight loop in the early 2000s suddenly provided a much deeper 1570*4882a593Smuzhiyunappreciation of the high-update-rate corner case. This test also 1571*4882a593Smuzhiyunmotivated addition of some RCU code to react to high update rates, for 1572*4882a593Smuzhiyunexample, if a given CPU finds itself with more than 10,000 RCU callbacks 1573*4882a593Smuzhiyunqueued, it will cause RCU to take evasive action by more aggressively 1574*4882a593Smuzhiyunstarting grace periods and more aggressively forcing completion of 1575*4882a593Smuzhiyungrace-period processing. This evasive action causes the grace period to 1576*4882a593Smuzhiyuncomplete more quickly, but at the cost of restricting RCU's batching 1577*4882a593Smuzhiyunoptimizations, thus increasing the CPU overhead incurred by that grace 1578*4882a593Smuzhiyunperiod. 1579*4882a593Smuzhiyun 1580*4882a593SmuzhiyunSoftware-Engineering Requirements 1581*4882a593Smuzhiyun--------------------------------- 1582*4882a593Smuzhiyun 1583*4882a593SmuzhiyunBetween Murphy's Law and “To err is human”, it is necessary to guard 1584*4882a593Smuzhiyunagainst mishaps and misuse: 1585*4882a593Smuzhiyun 1586*4882a593Smuzhiyun#. It is all too easy to forget to use ``rcu_read_lock()`` everywhere 1587*4882a593Smuzhiyun that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will 1588*4882a593Smuzhiyun splat if ``rcu_dereference()`` is used outside of an RCU read-side 1589*4882a593Smuzhiyun critical section. Update-side code can use 1590*4882a593Smuzhiyun ``rcu_dereference_protected()``, which takes a `lockdep 1591*4882a593Smuzhiyun expression <https://lwn.net/Articles/371986/>`__ to indicate what is 1592*4882a593Smuzhiyun providing the protection. If the indicated protection is not 1593*4882a593Smuzhiyun provided, a lockdep splat is emitted. 1594*4882a593Smuzhiyun Code shared between readers and updaters can use 1595*4882a593Smuzhiyun ``rcu_dereference_check()``, which also takes a lockdep expression, 1596*4882a593Smuzhiyun and emits a lockdep splat if neither ``rcu_read_lock()`` nor the 1597*4882a593Smuzhiyun indicated protection is in place. In addition, 1598*4882a593Smuzhiyun ``rcu_dereference_raw()`` is used in those (hopefully rare) cases 1599*4882a593Smuzhiyun where the required protection cannot be easily described. Finally, 1600*4882a593Smuzhiyun ``rcu_read_lock_held()`` is provided to allow a function to verify 1601*4882a593Smuzhiyun that it has been invoked within an RCU read-side critical section. I 1602*4882a593Smuzhiyun was made aware of this set of requirements shortly after Thomas 1603*4882a593Smuzhiyun Gleixner audited a number of RCU uses. 1604*4882a593Smuzhiyun#. A given function might wish to check for RCU-related preconditions 1605*4882a593Smuzhiyun upon entry, before using any other RCU API. The 1606*4882a593Smuzhiyun ``rcu_lockdep_assert()`` does this job, asserting the expression in 1607*4882a593Smuzhiyun kernels having lockdep enabled and doing nothing otherwise. 1608*4882a593Smuzhiyun#. It is also easy to forget to use ``rcu_assign_pointer()`` and 1609*4882a593Smuzhiyun ``rcu_dereference()``, perhaps (incorrectly) substituting a simple 1610*4882a593Smuzhiyun assignment. To catch this sort of error, a given RCU-protected 1611*4882a593Smuzhiyun pointer may be tagged with ``__rcu``, after which sparse will 1612*4882a593Smuzhiyun complain about simple-assignment accesses to that pointer. Arnd 1613*4882a593Smuzhiyun Bergmann made me aware of this requirement, and also supplied the 1614*4882a593Smuzhiyun needed `patch series <https://lwn.net/Articles/376011/>`__. 1615*4882a593Smuzhiyun#. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if 1616*4882a593Smuzhiyun a data element is passed to ``call_rcu()`` twice in a row, without a 1617*4882a593Smuzhiyun grace period in between. (This error is similar to a double free.) 1618*4882a593Smuzhiyun The corresponding ``rcu_head`` structures that are dynamically 1619*4882a593Smuzhiyun allocated are automatically tracked, but ``rcu_head`` structures 1620*4882a593Smuzhiyun allocated on the stack must be initialized with 1621*4882a593Smuzhiyun ``init_rcu_head_on_stack()`` and cleaned up with 1622*4882a593Smuzhiyun ``destroy_rcu_head_on_stack()``. Similarly, statically allocated 1623*4882a593Smuzhiyun non-stack ``rcu_head`` structures must be initialized with 1624*4882a593Smuzhiyun ``init_rcu_head()`` and cleaned up with ``destroy_rcu_head()``. 1625*4882a593Smuzhiyun Mathieu Desnoyers made me aware of this requirement, and also 1626*4882a593Smuzhiyun supplied the needed 1627*4882a593Smuzhiyun `patch <https://lkml.kernel.org/g/20100319013024.GA28456@Krystal>`__. 1628*4882a593Smuzhiyun#. An infinite loop in an RCU read-side critical section will eventually 1629*4882a593Smuzhiyun trigger an RCU CPU stall warning splat, with the duration of 1630*4882a593Smuzhiyun “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT`` 1631*4882a593Smuzhiyun ``Kconfig`` option, or, alternatively, by the 1632*4882a593Smuzhiyun ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU 1633*4882a593Smuzhiyun is not obligated to produce this splat unless there is a grace period 1634*4882a593Smuzhiyun waiting on that particular RCU read-side critical section. 1635*4882a593Smuzhiyun 1636*4882a593Smuzhiyun Some extreme workloads might intentionally delay RCU grace periods, 1637*4882a593Smuzhiyun and systems running those workloads can be booted with 1638*4882a593Smuzhiyun ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This 1639*4882a593Smuzhiyun kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU 1640*4882a593Smuzhiyun stall warnings are counter-productive during sysrq dumps and during 1641*4882a593Smuzhiyun panics. RCU therefore supplies the ``rcu_sysrq_start()`` and 1642*4882a593Smuzhiyun ``rcu_sysrq_end()`` API members to be called before and after long 1643*4882a593Smuzhiyun sysrq dumps. RCU also supplies the ``rcu_panic()`` notifier that is 1644*4882a593Smuzhiyun automatically invoked at the beginning of a panic to suppress further 1645*4882a593Smuzhiyun RCU CPU stall warnings. 1646*4882a593Smuzhiyun 1647*4882a593Smuzhiyun This requirement made itself known in the early 1990s, pretty much 1648*4882a593Smuzhiyun the first time that it was necessary to debug a CPU stall. That said, 1649*4882a593Smuzhiyun the initial implementation in DYNIX/ptx was quite generic in 1650*4882a593Smuzhiyun comparison with that of Linux. 1651*4882a593Smuzhiyun 1652*4882a593Smuzhiyun#. Although it would be very good to detect pointers leaking out of RCU 1653*4882a593Smuzhiyun read-side critical sections, there is currently no good way of doing 1654*4882a593Smuzhiyun this. One complication is the need to distinguish between pointers 1655*4882a593Smuzhiyun leaking and pointers that have been handed off from RCU to some other 1656*4882a593Smuzhiyun synchronization mechanism, for example, reference counting. 1657*4882a593Smuzhiyun#. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information 1658*4882a593Smuzhiyun is provided via event tracing. 1659*4882a593Smuzhiyun#. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()`` 1660*4882a593Smuzhiyun to create typical linked data structures can be surprisingly 1661*4882a593Smuzhiyun error-prone. Therefore, RCU-protected `linked 1662*4882a593Smuzhiyun lists <https://lwn.net/Articles/609973/#RCU%20List%20APIs>`__ and, 1663*4882a593Smuzhiyun more recently, RCU-protected `hash 1664*4882a593Smuzhiyun tables <https://lwn.net/Articles/612100/>`__ are available. Many 1665*4882a593Smuzhiyun other special-purpose RCU-protected data structures are available in 1666*4882a593Smuzhiyun the Linux kernel and the userspace RCU library. 1667*4882a593Smuzhiyun#. Some linked structures are created at compile time, but still require 1668*4882a593Smuzhiyun ``__rcu`` checking. The ``RCU_POINTER_INITIALIZER()`` macro serves 1669*4882a593Smuzhiyun this purpose. 1670*4882a593Smuzhiyun#. It is not necessary to use ``rcu_assign_pointer()`` when creating 1671*4882a593Smuzhiyun linked structures that are to be published via a single external 1672*4882a593Smuzhiyun pointer. The ``RCU_INIT_POINTER()`` macro is provided for this task 1673*4882a593Smuzhiyun and also for assigning ``NULL`` pointers at runtime. 1674*4882a593Smuzhiyun 1675*4882a593SmuzhiyunThis not a hard-and-fast list: RCU's diagnostic capabilities will 1676*4882a593Smuzhiyuncontinue to be guided by the number and type of usage bugs found in 1677*4882a593Smuzhiyunreal-world RCU usage. 1678*4882a593Smuzhiyun 1679*4882a593SmuzhiyunLinux Kernel Complications 1680*4882a593Smuzhiyun-------------------------- 1681*4882a593Smuzhiyun 1682*4882a593SmuzhiyunThe Linux kernel provides an interesting environment for all kinds of 1683*4882a593Smuzhiyunsoftware, including RCU. Some of the relevant points of interest are as 1684*4882a593Smuzhiyunfollows: 1685*4882a593Smuzhiyun 1686*4882a593Smuzhiyun#. `Configuration`_ 1687*4882a593Smuzhiyun#. `Firmware Interface`_ 1688*4882a593Smuzhiyun#. `Early Boot`_ 1689*4882a593Smuzhiyun#. `Interrupts and NMIs`_ 1690*4882a593Smuzhiyun#. `Loadable Modules`_ 1691*4882a593Smuzhiyun#. `Hotplug CPU`_ 1692*4882a593Smuzhiyun#. `Scheduler and RCU`_ 1693*4882a593Smuzhiyun#. `Tracing and RCU`_ 1694*4882a593Smuzhiyun#. `Accesses to User Memory and RCU`_ 1695*4882a593Smuzhiyun#. `Energy Efficiency`_ 1696*4882a593Smuzhiyun#. `Scheduling-Clock Interrupts and RCU`_ 1697*4882a593Smuzhiyun#. `Memory Efficiency`_ 1698*4882a593Smuzhiyun#. `Performance, Scalability, Response Time, and Reliability`_ 1699*4882a593Smuzhiyun 1700*4882a593SmuzhiyunThis list is probably incomplete, but it does give a feel for the most 1701*4882a593Smuzhiyunnotable Linux-kernel complications. Each of the following sections 1702*4882a593Smuzhiyuncovers one of the above topics. 1703*4882a593Smuzhiyun 1704*4882a593SmuzhiyunConfiguration 1705*4882a593Smuzhiyun~~~~~~~~~~~~~ 1706*4882a593Smuzhiyun 1707*4882a593SmuzhiyunRCU's goal is automatic configuration, so that almost nobody needs to 1708*4882a593Smuzhiyunworry about RCU's ``Kconfig`` options. And for almost all users, RCU 1709*4882a593Smuzhiyundoes in fact work well “out of the box.” 1710*4882a593Smuzhiyun 1711*4882a593SmuzhiyunHowever, there are specialized use cases that are handled by kernel boot 1712*4882a593Smuzhiyunparameters and ``Kconfig`` options. Unfortunately, the ``Kconfig`` 1713*4882a593Smuzhiyunsystem will explicitly ask users about new ``Kconfig`` options, which 1714*4882a593Smuzhiyunrequires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT`` 1715*4882a593Smuzhiyun``Kconfig`` option. 1716*4882a593Smuzhiyun 1717*4882a593SmuzhiyunThis all should be quite obvious, but the fact remains that Linus 1718*4882a593SmuzhiyunTorvalds recently had to 1719*4882a593Smuzhiyun`remind <https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com>`__ 1720*4882a593Smuzhiyunme of this requirement. 1721*4882a593Smuzhiyun 1722*4882a593SmuzhiyunFirmware Interface 1723*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~ 1724*4882a593Smuzhiyun 1725*4882a593SmuzhiyunIn many cases, kernel obtains information about the system from the 1726*4882a593Smuzhiyunfirmware, and sometimes things are lost in translation. Or the 1727*4882a593Smuzhiyuntranslation is accurate, but the original message is bogus. 1728*4882a593Smuzhiyun 1729*4882a593SmuzhiyunFor example, some systems' firmware overreports the number of CPUs, 1730*4882a593Smuzhiyunsometimes by a large factor. If RCU naively believed the firmware, as it 1731*4882a593Smuzhiyunused to do, it would create too many per-CPU kthreads. Although the 1732*4882a593Smuzhiyunresulting system will still run correctly, the extra kthreads needlessly 1733*4882a593Smuzhiyunconsume memory and can cause confusion when they show up in ``ps`` 1734*4882a593Smuzhiyunlistings. 1735*4882a593Smuzhiyun 1736*4882a593SmuzhiyunRCU must therefore wait for a given CPU to actually come online before 1737*4882a593Smuzhiyunit can allow itself to believe that the CPU actually exists. The 1738*4882a593Smuzhiyunresulting “ghost CPUs” (which are never going to come online) cause a 1739*4882a593Smuzhiyunnumber of `interesting 1740*4882a593Smuzhiyuncomplications <https://paulmck.livejournal.com/37494.html>`__. 1741*4882a593Smuzhiyun 1742*4882a593SmuzhiyunEarly Boot 1743*4882a593Smuzhiyun~~~~~~~~~~ 1744*4882a593Smuzhiyun 1745*4882a593SmuzhiyunThe Linux kernel's boot sequence is an interesting process, and RCU is 1746*4882a593Smuzhiyunused early, even before ``rcu_init()`` is invoked. In fact, a number of 1747*4882a593SmuzhiyunRCU's primitives can be used as soon as the initial task's 1748*4882a593Smuzhiyun``task_struct`` is available and the boot CPU's per-CPU variables are 1749*4882a593Smuzhiyunset up. The read-side primitives (``rcu_read_lock()``, 1750*4882a593Smuzhiyun``rcu_read_unlock()``, ``rcu_dereference()``, and 1751*4882a593Smuzhiyun``rcu_access_pointer()``) will operate normally very early on, as will 1752*4882a593Smuzhiyun``rcu_assign_pointer()``. 1753*4882a593Smuzhiyun 1754*4882a593SmuzhiyunAlthough ``call_rcu()`` may be invoked at any time during boot, 1755*4882a593Smuzhiyuncallbacks are not guaranteed to be invoked until after all of RCU's 1756*4882a593Smuzhiyunkthreads have been spawned, which occurs at ``early_initcall()`` time. 1757*4882a593SmuzhiyunThis delay in callback invocation is due to the fact that RCU does not 1758*4882a593Smuzhiyuninvoke callbacks until it is fully initialized, and this full 1759*4882a593Smuzhiyuninitialization cannot occur until after the scheduler has initialized 1760*4882a593Smuzhiyunitself to the point where RCU can spawn and run its kthreads. In theory, 1761*4882a593Smuzhiyunit would be possible to invoke callbacks earlier, however, this is not a 1762*4882a593Smuzhiyunpanacea because there would be severe restrictions on what operations 1763*4882a593Smuzhiyunthose callbacks could invoke. 1764*4882a593Smuzhiyun 1765*4882a593SmuzhiyunPerhaps surprisingly, ``synchronize_rcu()`` and 1766*4882a593Smuzhiyun``synchronize_rcu_expedited()``, will operate normally during very early 1767*4882a593Smuzhiyunboot, the reason being that there is only one CPU and preemption is 1768*4882a593Smuzhiyundisabled. This means that the call ``synchronize_rcu()`` (or friends) 1769*4882a593Smuzhiyunitself is a quiescent state and thus a grace period, so the early-boot 1770*4882a593Smuzhiyunimplementation can be a no-op. 1771*4882a593Smuzhiyun 1772*4882a593SmuzhiyunHowever, once the scheduler has spawned its first kthread, this early 1773*4882a593Smuzhiyunboot trick fails for ``synchronize_rcu()`` (as well as for 1774*4882a593Smuzhiyun``synchronize_rcu_expedited()``) in ``CONFIG_PREEMPT=y`` kernels. The 1775*4882a593Smuzhiyunreason is that an RCU read-side critical section might be preempted, 1776*4882a593Smuzhiyunwhich means that a subsequent ``synchronize_rcu()`` really does have to 1777*4882a593Smuzhiyunwait for something, as opposed to simply returning immediately. 1778*4882a593SmuzhiyunUnfortunately, ``synchronize_rcu()`` can't do this until all of its 1779*4882a593Smuzhiyunkthreads are spawned, which doesn't happen until some time during 1780*4882a593Smuzhiyun``early_initcalls()`` time. But this is no excuse: RCU is nevertheless 1781*4882a593Smuzhiyunrequired to correctly handle synchronous grace periods during this time 1782*4882a593Smuzhiyunperiod. Once all of its kthreads are up and running, RCU starts running 1783*4882a593Smuzhiyunnormally. 1784*4882a593Smuzhiyun 1785*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1786*4882a593Smuzhiyun| **Quick Quiz**: | 1787*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1788*4882a593Smuzhiyun| How can RCU possibly handle grace periods before all of its kthreads | 1789*4882a593Smuzhiyun| have been spawned??? | 1790*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1791*4882a593Smuzhiyun| **Answer**: | 1792*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1793*4882a593Smuzhiyun| Very carefully! | 1794*4882a593Smuzhiyun| During the “dead zone” between the time that the scheduler spawns the | 1795*4882a593Smuzhiyun| first task and the time that all of RCU's kthreads have been spawned, | 1796*4882a593Smuzhiyun| all synchronous grace periods are handled by the expedited | 1797*4882a593Smuzhiyun| grace-period mechanism. At runtime, this expedited mechanism relies | 1798*4882a593Smuzhiyun| on workqueues, but during the dead zone the requesting task itself | 1799*4882a593Smuzhiyun| drives the desired expedited grace period. Because dead-zone | 1800*4882a593Smuzhiyun| execution takes place within task context, everything works. Once the | 1801*4882a593Smuzhiyun| dead zone ends, expedited grace periods go back to using workqueues, | 1802*4882a593Smuzhiyun| as is required to avoid problems that would otherwise occur when a | 1803*4882a593Smuzhiyun| user task received a POSIX signal while driving an expedited grace | 1804*4882a593Smuzhiyun| period. | 1805*4882a593Smuzhiyun| | 1806*4882a593Smuzhiyun| And yes, this does mean that it is unhelpful to send POSIX signals to | 1807*4882a593Smuzhiyun| random tasks between the time that the scheduler spawns its first | 1808*4882a593Smuzhiyun| kthread and the time that RCU's kthreads have all been spawned. If | 1809*4882a593Smuzhiyun| there ever turns out to be a good reason for sending POSIX signals | 1810*4882a593Smuzhiyun| during that time, appropriate adjustments will be made. (If it turns | 1811*4882a593Smuzhiyun| out that POSIX signals are sent during this time for no good reason, | 1812*4882a593Smuzhiyun| other adjustments will be made, appropriate or otherwise.) | 1813*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1814*4882a593Smuzhiyun 1815*4882a593SmuzhiyunI learned of these boot-time requirements as a result of a series of 1816*4882a593Smuzhiyunsystem hangs. 1817*4882a593Smuzhiyun 1818*4882a593SmuzhiyunInterrupts and NMIs 1819*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~ 1820*4882a593Smuzhiyun 1821*4882a593SmuzhiyunThe Linux kernel has interrupts, and RCU read-side critical sections are 1822*4882a593Smuzhiyunlegal within interrupt handlers and within interrupt-disabled regions of 1823*4882a593Smuzhiyuncode, as are invocations of ``call_rcu()``. 1824*4882a593Smuzhiyun 1825*4882a593SmuzhiyunSome Linux-kernel architectures can enter an interrupt handler from 1826*4882a593Smuzhiyunnon-idle process context, and then just never leave it, instead 1827*4882a593Smuzhiyunstealthily transitioning back to process context. This trick is 1828*4882a593Smuzhiyunsometimes used to invoke system calls from inside the kernel. These 1829*4882a593Smuzhiyun“half-interrupts” mean that RCU has to be very careful about how it 1830*4882a593Smuzhiyuncounts interrupt nesting levels. I learned of this requirement the hard 1831*4882a593Smuzhiyunway during a rewrite of RCU's dyntick-idle code. 1832*4882a593Smuzhiyun 1833*4882a593SmuzhiyunThe Linux kernel has non-maskable interrupts (NMIs), and RCU read-side 1834*4882a593Smuzhiyuncritical sections are legal within NMI handlers. Thankfully, RCU 1835*4882a593Smuzhiyunupdate-side primitives, including ``call_rcu()``, are prohibited within 1836*4882a593SmuzhiyunNMI handlers. 1837*4882a593Smuzhiyun 1838*4882a593SmuzhiyunThe name notwithstanding, some Linux-kernel architectures can have 1839*4882a593Smuzhiyunnested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised 1840*4882a593Smuzhiyunme <https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__ 1841*4882a593Smuzhiyunwith this requirement; he also kindly surprised me with `an 1842*4882a593Smuzhiyunalgorithm <https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com>`__ 1843*4882a593Smuzhiyunthat meets this requirement. 1844*4882a593Smuzhiyun 1845*4882a593SmuzhiyunFurthermore, NMI handlers can be interrupted by what appear to RCU to be 1846*4882a593Smuzhiyunnormal interrupts. One way that this can happen is for code that 1847*4882a593Smuzhiyundirectly invokes ``rcu_irq_enter()`` and ``rcu_irq_exit()`` to be called 1848*4882a593Smuzhiyunfrom an NMI handler. This astonishing fact of life prompted the current 1849*4882a593Smuzhiyuncode structure, which has ``rcu_irq_enter()`` invoking 1850*4882a593Smuzhiyun``rcu_nmi_enter()`` and ``rcu_irq_exit()`` invoking ``rcu_nmi_exit()``. 1851*4882a593SmuzhiyunAnd yes, I also learned of this requirement the hard way. 1852*4882a593Smuzhiyun 1853*4882a593SmuzhiyunLoadable Modules 1854*4882a593Smuzhiyun~~~~~~~~~~~~~~~~ 1855*4882a593Smuzhiyun 1856*4882a593SmuzhiyunThe Linux kernel has loadable modules, and these modules can also be 1857*4882a593Smuzhiyununloaded. After a given module has been unloaded, any attempt to call 1858*4882a593Smuzhiyunone of its functions results in a segmentation fault. The module-unload 1859*4882a593Smuzhiyunfunctions must therefore cancel any delayed calls to loadable-module 1860*4882a593Smuzhiyunfunctions, for example, any outstanding ``mod_timer()`` must be dealt 1861*4882a593Smuzhiyunwith via ``del_timer_sync()`` or similar. 1862*4882a593Smuzhiyun 1863*4882a593SmuzhiyunUnfortunately, there is no way to cancel an RCU callback; once you 1864*4882a593Smuzhiyuninvoke ``call_rcu()``, the callback function is eventually going to be 1865*4882a593Smuzhiyuninvoked, unless the system goes down first. Because it is normally 1866*4882a593Smuzhiyunconsidered socially irresponsible to crash the system in response to a 1867*4882a593Smuzhiyunmodule unload request, we need some other way to deal with in-flight RCU 1868*4882a593Smuzhiyuncallbacks. 1869*4882a593Smuzhiyun 1870*4882a593SmuzhiyunRCU therefore provides ``rcu_barrier()``, which waits until all 1871*4882a593Smuzhiyunin-flight RCU callbacks have been invoked. If a module uses 1872*4882a593Smuzhiyun``call_rcu()``, its exit function should therefore prevent any future 1873*4882a593Smuzhiyuninvocation of ``call_rcu()``, then invoke ``rcu_barrier()``. In theory, 1874*4882a593Smuzhiyunthe underlying module-unload code could invoke ``rcu_barrier()`` 1875*4882a593Smuzhiyununconditionally, but in practice this would incur unacceptable 1876*4882a593Smuzhiyunlatencies. 1877*4882a593Smuzhiyun 1878*4882a593SmuzhiyunNikita Danilov noted this requirement for an analogous 1879*4882a593Smuzhiyunfilesystem-unmount situation, and Dipankar Sarma incorporated 1880*4882a593Smuzhiyun``rcu_barrier()`` into RCU. The need for ``rcu_barrier()`` for module 1881*4882a593Smuzhiyununloading became apparent later. 1882*4882a593Smuzhiyun 1883*4882a593Smuzhiyun.. important:: 1884*4882a593Smuzhiyun 1885*4882a593Smuzhiyun The ``rcu_barrier()`` function is not, repeat, 1886*4882a593Smuzhiyun *not*, obligated to wait for a grace period. It is instead only required 1887*4882a593Smuzhiyun to wait for RCU callbacks that have already been posted. Therefore, if 1888*4882a593Smuzhiyun there are no RCU callbacks posted anywhere in the system, 1889*4882a593Smuzhiyun ``rcu_barrier()`` is within its rights to return immediately. Even if 1890*4882a593Smuzhiyun there are callbacks posted, ``rcu_barrier()`` does not necessarily need 1891*4882a593Smuzhiyun to wait for a grace period. 1892*4882a593Smuzhiyun 1893*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1894*4882a593Smuzhiyun| **Quick Quiz**: | 1895*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1896*4882a593Smuzhiyun| Wait a minute! Each RCU callbacks must wait for a grace period to | 1897*4882a593Smuzhiyun| complete, and ``rcu_barrier()`` must wait for each pre-existing | 1898*4882a593Smuzhiyun| callback to be invoked. Doesn't ``rcu_barrier()`` therefore need to | 1899*4882a593Smuzhiyun| wait for a full grace period if there is even one callback posted | 1900*4882a593Smuzhiyun| anywhere in the system? | 1901*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1902*4882a593Smuzhiyun| **Answer**: | 1903*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1904*4882a593Smuzhiyun| Absolutely not!!! | 1905*4882a593Smuzhiyun| Yes, each RCU callbacks must wait for a grace period to complete, but | 1906*4882a593Smuzhiyun| it might well be partly (or even completely) finished waiting by the | 1907*4882a593Smuzhiyun| time ``rcu_barrier()`` is invoked. In that case, ``rcu_barrier()`` | 1908*4882a593Smuzhiyun| need only wait for the remaining portion of the grace period to | 1909*4882a593Smuzhiyun| elapse. So even if there are quite a few callbacks posted, | 1910*4882a593Smuzhiyun| ``rcu_barrier()`` might well return quite quickly. | 1911*4882a593Smuzhiyun| | 1912*4882a593Smuzhiyun| So if you need to wait for a grace period as well as for all | 1913*4882a593Smuzhiyun| pre-existing callbacks, you will need to invoke both | 1914*4882a593Smuzhiyun| ``synchronize_rcu()`` and ``rcu_barrier()``. If latency is a concern, | 1915*4882a593Smuzhiyun| you can always use workqueues to invoke them concurrently. | 1916*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 1917*4882a593Smuzhiyun 1918*4882a593SmuzhiyunHotplug CPU 1919*4882a593Smuzhiyun~~~~~~~~~~~ 1920*4882a593Smuzhiyun 1921*4882a593SmuzhiyunThe Linux kernel supports CPU hotplug, which means that CPUs can come 1922*4882a593Smuzhiyunand go. It is of course illegal to use any RCU API member from an 1923*4882a593Smuzhiyunoffline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side 1924*4882a593Smuzhiyuncritical sections. This requirement was present from day one in 1925*4882a593SmuzhiyunDYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug 1926*4882a593Smuzhiyunimplementation is “interesting.” 1927*4882a593Smuzhiyun 1928*4882a593SmuzhiyunThe Linux-kernel CPU-hotplug implementation has notifiers that are used 1929*4882a593Smuzhiyunto allow the various kernel subsystems (including RCU) to respond 1930*4882a593Smuzhiyunappropriately to a given CPU-hotplug operation. Most RCU operations may 1931*4882a593Smuzhiyunbe invoked from CPU-hotplug notifiers, including even synchronous 1932*4882a593Smuzhiyungrace-period operations such as ``synchronize_rcu()`` and 1933*4882a593Smuzhiyun``synchronize_rcu_expedited()``. 1934*4882a593Smuzhiyun 1935*4882a593SmuzhiyunHowever, all-callback-wait operations such as ``rcu_barrier()`` are also 1936*4882a593Smuzhiyunnot supported, due to the fact that there are phases of CPU-hotplug 1937*4882a593Smuzhiyunoperations where the outgoing CPU's callbacks will not be invoked until 1938*4882a593Smuzhiyunafter the CPU-hotplug operation ends, which could also result in 1939*4882a593Smuzhiyundeadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations 1940*4882a593Smuzhiyunduring its execution, which results in another type of deadlock when 1941*4882a593Smuzhiyuninvoked from a CPU-hotplug notifier. 1942*4882a593Smuzhiyun 1943*4882a593SmuzhiyunScheduler and RCU 1944*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~ 1945*4882a593Smuzhiyun 1946*4882a593SmuzhiyunRCU makes use of kthreads, and it is necessary to avoid excessive CPU-time 1947*4882a593Smuzhiyunaccumulation by these kthreads. This requirement was no surprise, but 1948*4882a593SmuzhiyunRCU's violation of it when running context-switch-heavy workloads when 1949*4882a593Smuzhiyunbuilt with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise 1950*4882a593Smuzhiyun[PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__. 1951*4882a593SmuzhiyunRCU has made good progress towards meeting this requirement, even for 1952*4882a593Smuzhiyuncontext-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is 1953*4882a593Smuzhiyunroom for further improvement. 1954*4882a593Smuzhiyun 1955*4882a593SmuzhiyunThere is no longer any prohibition against holding any of 1956*4882a593Smuzhiyunscheduler's runqueue or priority-inheritance spinlocks across an 1957*4882a593Smuzhiyun``rcu_read_unlock()``, even if interrupts and preemption were enabled 1958*4882a593Smuzhiyunsomewhere within the corresponding RCU read-side critical section. 1959*4882a593SmuzhiyunTherefore, it is now perfectly legal to execute ``rcu_read_lock()`` 1960*4882a593Smuzhiyunwith preemption enabled, acquire one of the scheduler locks, and hold 1961*4882a593Smuzhiyunthat lock across the matching ``rcu_read_unlock()``. 1962*4882a593Smuzhiyun 1963*4882a593SmuzhiyunSimilarly, the RCU flavor consolidation has removed the need for negative 1964*4882a593Smuzhiyunnesting. The fact that interrupt-disabled regions of code act as RCU 1965*4882a593Smuzhiyunread-side critical sections implicitly avoids earlier issues that used 1966*4882a593Smuzhiyunto result in destructive recursion via interrupt handler's use of RCU. 1967*4882a593Smuzhiyun 1968*4882a593SmuzhiyunTracing and RCU 1969*4882a593Smuzhiyun~~~~~~~~~~~~~~~ 1970*4882a593Smuzhiyun 1971*4882a593SmuzhiyunIt is possible to use tracing on RCU code, but tracing itself uses RCU. 1972*4882a593SmuzhiyunFor this reason, ``rcu_dereference_raw_check()`` is provided for use 1973*4882a593Smuzhiyunby tracing, which avoids the destructive recursion that could otherwise 1974*4882a593Smuzhiyunensue. This API is also used by virtualization in some architectures, 1975*4882a593Smuzhiyunwhere RCU readers execute in environments in which tracing cannot be 1976*4882a593Smuzhiyunused. The tracing folks both located the requirement and provided the 1977*4882a593Smuzhiyunneeded fix, so this surprise requirement was relatively painless. 1978*4882a593Smuzhiyun 1979*4882a593SmuzhiyunAccesses to User Memory and RCU 1980*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1981*4882a593Smuzhiyun 1982*4882a593SmuzhiyunThe kernel needs to access user-space memory, for example, to access data 1983*4882a593Smuzhiyunreferenced by system-call parameters. The ``get_user()`` macro does this job. 1984*4882a593Smuzhiyun 1985*4882a593SmuzhiyunHowever, user-space memory might well be paged out, which means that 1986*4882a593Smuzhiyun``get_user()`` might well page-fault and thus block while waiting for the 1987*4882a593Smuzhiyunresulting I/O to complete. It would be a very bad thing for the compiler to 1988*4882a593Smuzhiyunreorder a ``get_user()`` invocation into an RCU read-side critical section. 1989*4882a593Smuzhiyun 1990*4882a593SmuzhiyunFor example, suppose that the source code looked like this: 1991*4882a593Smuzhiyun 1992*4882a593Smuzhiyun :: 1993*4882a593Smuzhiyun 1994*4882a593Smuzhiyun 1 rcu_read_lock(); 1995*4882a593Smuzhiyun 2 p = rcu_dereference(gp); 1996*4882a593Smuzhiyun 3 v = p->value; 1997*4882a593Smuzhiyun 4 rcu_read_unlock(); 1998*4882a593Smuzhiyun 5 get_user(user_v, user_p); 1999*4882a593Smuzhiyun 6 do_something_with(v, user_v); 2000*4882a593Smuzhiyun 2001*4882a593SmuzhiyunThe compiler must not be permitted to transform this source code into 2002*4882a593Smuzhiyunthe following: 2003*4882a593Smuzhiyun 2004*4882a593Smuzhiyun :: 2005*4882a593Smuzhiyun 2006*4882a593Smuzhiyun 1 rcu_read_lock(); 2007*4882a593Smuzhiyun 2 p = rcu_dereference(gp); 2008*4882a593Smuzhiyun 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!! 2009*4882a593Smuzhiyun 4 v = p->value; 2010*4882a593Smuzhiyun 5 rcu_read_unlock(); 2011*4882a593Smuzhiyun 6 do_something_with(v, user_v); 2012*4882a593Smuzhiyun 2013*4882a593SmuzhiyunIf the compiler did make this transformation in a ``CONFIG_PREEMPT=n`` kernel 2014*4882a593Smuzhiyunbuild, and if ``get_user()`` did page fault, the result would be a quiescent 2015*4882a593Smuzhiyunstate in the middle of an RCU read-side critical section. This misplaced 2016*4882a593Smuzhiyunquiescent state could result in line 4 being a use-after-free access, 2017*4882a593Smuzhiyunwhich could be bad for your kernel's actuarial statistics. Similar examples 2018*4882a593Smuzhiyuncan be constructed with the call to ``get_user()`` preceding the 2019*4882a593Smuzhiyun``rcu_read_lock()``. 2020*4882a593Smuzhiyun 2021*4882a593SmuzhiyunUnfortunately, ``get_user()`` doesn't have any particular ordering properties, 2022*4882a593Smuzhiyunand in some architectures the underlying ``asm`` isn't even marked 2023*4882a593Smuzhiyun``volatile``. And even if it was marked ``volatile``, the above access to 2024*4882a593Smuzhiyun``p->value`` is not volatile, so the compiler would not have any reason to keep 2025*4882a593Smuzhiyunthose two accesses in order. 2026*4882a593Smuzhiyun 2027*4882a593SmuzhiyunTherefore, the Linux-kernel definitions of ``rcu_read_lock()`` and 2028*4882a593Smuzhiyun``rcu_read_unlock()`` must act as compiler barriers, at least for outermost 2029*4882a593Smuzhiyuninstances of ``rcu_read_lock()`` and ``rcu_read_unlock()`` within a nested set 2030*4882a593Smuzhiyunof RCU read-side critical sections. 2031*4882a593Smuzhiyun 2032*4882a593SmuzhiyunEnergy Efficiency 2033*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~ 2034*4882a593Smuzhiyun 2035*4882a593SmuzhiyunInterrupting idle CPUs is considered socially unacceptable, especially 2036*4882a593Smuzhiyunby people with battery-powered embedded systems. RCU therefore conserves 2037*4882a593Smuzhiyunenergy by detecting which CPUs are idle, including tracking CPUs that 2038*4882a593Smuzhiyunhave been interrupted from idle. This is a large part of the 2039*4882a593Smuzhiyunenergy-efficiency requirement, so I learned of this via an irate phone 2040*4882a593Smuzhiyuncall. 2041*4882a593Smuzhiyun 2042*4882a593SmuzhiyunBecause RCU avoids interrupting idle CPUs, it is illegal to execute an 2043*4882a593SmuzhiyunRCU read-side critical section on an idle CPU. (Kernels built with 2044*4882a593Smuzhiyun``CONFIG_PROVE_RCU=y`` will splat if you try it.) The ``RCU_NONIDLE()`` 2045*4882a593Smuzhiyunmacro and ``_rcuidle`` event tracing is provided to work around this 2046*4882a593Smuzhiyunrestriction. In addition, ``rcu_is_watching()`` may be used to test 2047*4882a593Smuzhiyunwhether or not it is currently legal to run RCU read-side critical 2048*4882a593Smuzhiyunsections on this CPU. I learned of the need for diagnostics on the one 2049*4882a593Smuzhiyunhand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code. 2050*4882a593SmuzhiyunSteven Rostedt supplied ``_rcuidle`` event tracing, which is used quite 2051*4882a593Smuzhiyunheavily in the idle loop. However, there are some restrictions on the 2052*4882a593Smuzhiyuncode placed within ``RCU_NONIDLE()``: 2053*4882a593Smuzhiyun 2054*4882a593Smuzhiyun#. Blocking is prohibited. In practice, this is not a serious 2055*4882a593Smuzhiyun restriction given that idle tasks are prohibited from blocking to 2056*4882a593Smuzhiyun begin with. 2057*4882a593Smuzhiyun#. Although nesting ``RCU_NONIDLE()`` is permitted, they cannot nest 2058*4882a593Smuzhiyun indefinitely deeply. However, given that they can be nested on the 2059*4882a593Smuzhiyun order of a million deep, even on 32-bit systems, this should not be a 2060*4882a593Smuzhiyun serious restriction. This nesting limit would probably be reached 2061*4882a593Smuzhiyun long after the compiler OOMed or the stack overflowed. 2062*4882a593Smuzhiyun#. Any code path that enters ``RCU_NONIDLE()`` must sequence out of that 2063*4882a593Smuzhiyun same ``RCU_NONIDLE()``. For example, the following is grossly 2064*4882a593Smuzhiyun illegal: 2065*4882a593Smuzhiyun 2066*4882a593Smuzhiyun :: 2067*4882a593Smuzhiyun 2068*4882a593Smuzhiyun 1 RCU_NONIDLE({ 2069*4882a593Smuzhiyun 2 do_something(); 2070*4882a593Smuzhiyun 3 goto bad_idea; /* BUG!!! */ 2071*4882a593Smuzhiyun 4 do_something_else();}); 2072*4882a593Smuzhiyun 5 bad_idea: 2073*4882a593Smuzhiyun 2074*4882a593Smuzhiyun 2075*4882a593Smuzhiyun It is just as illegal to transfer control into the middle of 2076*4882a593Smuzhiyun ``RCU_NONIDLE()``'s argument. Yes, in theory, you could transfer in 2077*4882a593Smuzhiyun as long as you also transferred out, but in practice you could also 2078*4882a593Smuzhiyun expect to get sharply worded review comments. 2079*4882a593Smuzhiyun 2080*4882a593SmuzhiyunIt is similarly socially unacceptable to interrupt an ``nohz_full`` CPU 2081*4882a593Smuzhiyunrunning in userspace. RCU must therefore track ``nohz_full`` userspace 2082*4882a593Smuzhiyunexecution. RCU must therefore be able to sample state at two points in 2083*4882a593Smuzhiyuntime, and be able to determine whether or not some other CPU spent any 2084*4882a593Smuzhiyuntime idle and/or executing in userspace. 2085*4882a593Smuzhiyun 2086*4882a593SmuzhiyunThese energy-efficiency requirements have proven quite difficult to 2087*4882a593Smuzhiyununderstand and to meet, for example, there have been more than five 2088*4882a593Smuzhiyunclean-sheet rewrites of RCU's energy-efficiency code, the last of which 2089*4882a593Smuzhiyunwas finally able to demonstrate `real energy savings running on real 2090*4882a593Smuzhiyunhardware 2091*4882a593Smuzhiyun[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf>`__. 2092*4882a593SmuzhiyunAs noted earlier, I learned of many of these requirements via angry 2093*4882a593Smuzhiyunphone calls: Flaming me on the Linux-kernel mailing list was apparently 2094*4882a593Smuzhiyunnot sufficient to fully vent their ire at RCU's energy-efficiency bugs! 2095*4882a593Smuzhiyun 2096*4882a593SmuzhiyunScheduling-Clock Interrupts and RCU 2097*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2098*4882a593Smuzhiyun 2099*4882a593SmuzhiyunThe kernel transitions between in-kernel non-idle execution, userspace 2100*4882a593Smuzhiyunexecution, and the idle loop. Depending on kernel configuration, RCU 2101*4882a593Smuzhiyunhandles these states differently: 2102*4882a593Smuzhiyun 2103*4882a593Smuzhiyun+-----------------+------------------+------------------+-----------------+ 2104*4882a593Smuzhiyun| ``HZ`` Kconfig | In-Kernel | Usermode | Idle | 2105*4882a593Smuzhiyun+=================+==================+==================+=================+ 2106*4882a593Smuzhiyun| ``HZ_PERIODIC`` | Can rely on | Can rely on | Can rely on | 2107*4882a593Smuzhiyun| | scheduling-clock | scheduling-clock | RCU's | 2108*4882a593Smuzhiyun| | interrupt. | interrupt and | dyntick-idle | 2109*4882a593Smuzhiyun| | | its detection | detection. | 2110*4882a593Smuzhiyun| | | of interrupt | | 2111*4882a593Smuzhiyun| | | from usermode. | | 2112*4882a593Smuzhiyun+-----------------+------------------+------------------+-----------------+ 2113*4882a593Smuzhiyun| ``NO_HZ_IDLE`` | Can rely on | Can rely on | Can rely on | 2114*4882a593Smuzhiyun| | scheduling-clock | scheduling-clock | RCU's | 2115*4882a593Smuzhiyun| | interrupt. | interrupt and | dyntick-idle | 2116*4882a593Smuzhiyun| | | its detection | detection. | 2117*4882a593Smuzhiyun| | | of interrupt | | 2118*4882a593Smuzhiyun| | | from usermode. | | 2119*4882a593Smuzhiyun+-----------------+------------------+------------------+-----------------+ 2120*4882a593Smuzhiyun| ``NO_HZ_FULL`` | Can only | Can rely on | Can rely on | 2121*4882a593Smuzhiyun| | sometimes rely | RCU's | RCU's | 2122*4882a593Smuzhiyun| | on | dyntick-idle | dyntick-idle | 2123*4882a593Smuzhiyun| | scheduling-clock | detection. | detection. | 2124*4882a593Smuzhiyun| | interrupt. In | | | 2125*4882a593Smuzhiyun| | other cases, it | | | 2126*4882a593Smuzhiyun| | is necessary to | | | 2127*4882a593Smuzhiyun| | bound kernel | | | 2128*4882a593Smuzhiyun| | execution times | | | 2129*4882a593Smuzhiyun| | and/or use | | | 2130*4882a593Smuzhiyun| | IPIs. | | | 2131*4882a593Smuzhiyun+-----------------+------------------+------------------+-----------------+ 2132*4882a593Smuzhiyun 2133*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 2134*4882a593Smuzhiyun| **Quick Quiz**: | 2135*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 2136*4882a593Smuzhiyun| Why can't ``NO_HZ_FULL`` in-kernel execution rely on the | 2137*4882a593Smuzhiyun| scheduling-clock interrupt, just like ``HZ_PERIODIC`` and | 2138*4882a593Smuzhiyun| ``NO_HZ_IDLE`` do? | 2139*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 2140*4882a593Smuzhiyun| **Answer**: | 2141*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 2142*4882a593Smuzhiyun| Because, as a performance optimization, ``NO_HZ_FULL`` does not | 2143*4882a593Smuzhiyun| necessarily re-enable the scheduling-clock interrupt on entry to each | 2144*4882a593Smuzhiyun| and every system call. | 2145*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 2146*4882a593Smuzhiyun 2147*4882a593SmuzhiyunHowever, RCU must be reliably informed as to whether any given CPU is 2148*4882a593Smuzhiyuncurrently in the idle loop, and, for ``NO_HZ_FULL``, also whether that 2149*4882a593SmuzhiyunCPU is executing in usermode, as discussed 2150*4882a593Smuzhiyun`earlier <#Energy%20Efficiency>`__. It also requires that the 2151*4882a593Smuzhiyunscheduling-clock interrupt be enabled when RCU needs it to be: 2152*4882a593Smuzhiyun 2153*4882a593Smuzhiyun#. If a CPU is either idle or executing in usermode, and RCU believes it 2154*4882a593Smuzhiyun is non-idle, the scheduling-clock tick had better be running. 2155*4882a593Smuzhiyun Otherwise, you will get RCU CPU stall warnings. Or at best, very long 2156*4882a593Smuzhiyun (11-second) grace periods, with a pointless IPI waking the CPU from 2157*4882a593Smuzhiyun time to time. 2158*4882a593Smuzhiyun#. If a CPU is in a portion of the kernel that executes RCU read-side 2159*4882a593Smuzhiyun critical sections, and RCU believes this CPU to be idle, you will get 2160*4882a593Smuzhiyun random memory corruption. **DON'T DO THIS!!!** 2161*4882a593Smuzhiyun This is one reason to test with lockdep, which will complain about 2162*4882a593Smuzhiyun this sort of thing. 2163*4882a593Smuzhiyun#. If a CPU is in a portion of the kernel that is absolutely positively 2164*4882a593Smuzhiyun no-joking guaranteed to never execute any RCU read-side critical 2165*4882a593Smuzhiyun sections, and RCU believes this CPU to be idle, no problem. This 2166*4882a593Smuzhiyun sort of thing is used by some architectures for light-weight 2167*4882a593Smuzhiyun exception handlers, which can then avoid the overhead of 2168*4882a593Smuzhiyun ``rcu_irq_enter()`` and ``rcu_irq_exit()`` at exception entry and 2169*4882a593Smuzhiyun exit, respectively. Some go further and avoid the entireties of 2170*4882a593Smuzhiyun ``irq_enter()`` and ``irq_exit()``. 2171*4882a593Smuzhiyun Just make very sure you are running some of your tests with 2172*4882a593Smuzhiyun ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in 2173*4882a593Smuzhiyun fact joking about not doing RCU read-side critical sections. 2174*4882a593Smuzhiyun#. If a CPU is executing in the kernel with the scheduling-clock 2175*4882a593Smuzhiyun interrupt disabled and RCU believes this CPU to be non-idle, and if 2176*4882a593Smuzhiyun the CPU goes idle (from an RCU perspective) every few jiffies, no 2177*4882a593Smuzhiyun problem. It is usually OK for there to be the occasional gap between 2178*4882a593Smuzhiyun idle periods of up to a second or so. 2179*4882a593Smuzhiyun If the gap grows too long, you get RCU CPU stall warnings. 2180*4882a593Smuzhiyun#. If a CPU is either idle or executing in usermode, and RCU believes it 2181*4882a593Smuzhiyun to be idle, of course no problem. 2182*4882a593Smuzhiyun#. If a CPU is executing in the kernel, the kernel code path is passing 2183*4882a593Smuzhiyun through quiescent states at a reasonable frequency (preferably about 2184*4882a593Smuzhiyun once per few jiffies, but the occasional excursion to a second or so 2185*4882a593Smuzhiyun is usually OK) and the scheduling-clock interrupt is enabled, of 2186*4882a593Smuzhiyun course no problem. 2187*4882a593Smuzhiyun If the gap between a successive pair of quiescent states grows too 2188*4882a593Smuzhiyun long, you get RCU CPU stall warnings. 2189*4882a593Smuzhiyun 2190*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 2191*4882a593Smuzhiyun| **Quick Quiz**: | 2192*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 2193*4882a593Smuzhiyun| But what if my driver has a hardware interrupt handler that can run | 2194*4882a593Smuzhiyun| for many seconds? I cannot invoke ``schedule()`` from an hardware | 2195*4882a593Smuzhiyun| interrupt handler, after all! | 2196*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 2197*4882a593Smuzhiyun| **Answer**: | 2198*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 2199*4882a593Smuzhiyun| One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so | 2200*4882a593Smuzhiyun| often. But given that long-running interrupt handlers can cause other | 2201*4882a593Smuzhiyun| problems, not least for response time, shouldn't you work to keep | 2202*4882a593Smuzhiyun| your interrupt handler's runtime within reasonable bounds? | 2203*4882a593Smuzhiyun+-----------------------------------------------------------------------+ 2204*4882a593Smuzhiyun 2205*4882a593SmuzhiyunBut as long as RCU is properly informed of kernel state transitions 2206*4882a593Smuzhiyunbetween in-kernel execution, usermode execution, and idle, and as long 2207*4882a593Smuzhiyunas the scheduling-clock interrupt is enabled when RCU needs it to be, 2208*4882a593Smuzhiyunyou can rest assured that the bugs you encounter will be in some other 2209*4882a593Smuzhiyunpart of RCU or some other part of the kernel! 2210*4882a593Smuzhiyun 2211*4882a593SmuzhiyunMemory Efficiency 2212*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~ 2213*4882a593Smuzhiyun 2214*4882a593SmuzhiyunAlthough small-memory non-realtime systems can simply use Tiny RCU, code 2215*4882a593Smuzhiyunsize is only one aspect of memory efficiency. Another aspect is the size 2216*4882a593Smuzhiyunof the ``rcu_head`` structure used by ``call_rcu()`` and 2217*4882a593Smuzhiyun``kfree_rcu()``. Although this structure contains nothing more than a 2218*4882a593Smuzhiyunpair of pointers, it does appear in many RCU-protected data structures, 2219*4882a593Smuzhiyunincluding some that are size critical. The ``page`` structure is a case 2220*4882a593Smuzhiyunin point, as evidenced by the many occurrences of the ``union`` keyword 2221*4882a593Smuzhiyunwithin that structure. 2222*4882a593Smuzhiyun 2223*4882a593SmuzhiyunThis need for memory efficiency is one reason that RCU uses hand-crafted 2224*4882a593Smuzhiyunsingly linked lists to track the ``rcu_head`` structures that are 2225*4882a593Smuzhiyunwaiting for a grace period to elapse. It is also the reason why 2226*4882a593Smuzhiyun``rcu_head`` structures do not contain debug information, such as fields 2227*4882a593Smuzhiyuntracking the file and line of the ``call_rcu()`` or ``kfree_rcu()`` that 2228*4882a593Smuzhiyunposted them. Although this information might appear in debug-only kernel 2229*4882a593Smuzhiyunbuilds at some point, in the meantime, the ``->func`` field will often 2230*4882a593Smuzhiyunprovide the needed debug information. 2231*4882a593Smuzhiyun 2232*4882a593SmuzhiyunHowever, in some cases, the need for memory efficiency leads to even 2233*4882a593Smuzhiyunmore extreme measures. Returning to the ``page`` structure, the 2234*4882a593Smuzhiyun``rcu_head`` field shares storage with a great many other structures 2235*4882a593Smuzhiyunthat are used at various points in the corresponding page's lifetime. In 2236*4882a593Smuzhiyunorder to correctly resolve certain `race 2237*4882a593Smuzhiyunconditions <https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com>`__, 2238*4882a593Smuzhiyunthe Linux kernel's memory-management subsystem needs a particular bit to 2239*4882a593Smuzhiyunremain zero during all phases of grace-period processing, and that bit 2240*4882a593Smuzhiyunhappens to map to the bottom bit of the ``rcu_head`` structure's 2241*4882a593Smuzhiyun``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is 2242*4882a593Smuzhiyunused to post the callback, as opposed to ``kfree_rcu()`` or some future 2243*4882a593Smuzhiyun“lazy” variant of ``call_rcu()`` that might one day be created for 2244*4882a593Smuzhiyunenergy-efficiency purposes. 2245*4882a593Smuzhiyun 2246*4882a593SmuzhiyunThat said, there are limits. RCU requires that the ``rcu_head`` 2247*4882a593Smuzhiyunstructure be aligned to a two-byte boundary, and passing a misaligned 2248*4882a593Smuzhiyun``rcu_head`` structure to one of the ``call_rcu()`` family of functions 2249*4882a593Smuzhiyunwill result in a splat. It is therefore necessary to exercise caution 2250*4882a593Smuzhiyunwhen packing structures containing fields of type ``rcu_head``. Why not 2251*4882a593Smuzhiyuna four-byte or even eight-byte alignment requirement? Because the m68k 2252*4882a593Smuzhiyunarchitecture provides only two-byte alignment, and thus acts as 2253*4882a593Smuzhiyunalignment's least common denominator. 2254*4882a593Smuzhiyun 2255*4882a593SmuzhiyunThe reason for reserving the bottom bit of pointers to ``rcu_head`` 2256*4882a593Smuzhiyunstructures is to leave the door open to “lazy” callbacks whose 2257*4882a593Smuzhiyuninvocations can safely be deferred. Deferring invocation could 2258*4882a593Smuzhiyunpotentially have energy-efficiency benefits, but only if the rate of 2259*4882a593Smuzhiyunnon-lazy callbacks decreases significantly for some important workload. 2260*4882a593SmuzhiyunIn the meantime, reserving the bottom bit keeps this option open in case 2261*4882a593Smuzhiyunit one day becomes useful. 2262*4882a593Smuzhiyun 2263*4882a593SmuzhiyunPerformance, Scalability, Response Time, and Reliability 2264*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2265*4882a593Smuzhiyun 2266*4882a593SmuzhiyunExpanding on the `earlier 2267*4882a593Smuzhiyundiscussion <#Performance%20and%20Scalability>`__, RCU is used heavily by 2268*4882a593Smuzhiyunhot code paths in performance-critical portions of the Linux kernel's 2269*4882a593Smuzhiyunnetworking, security, virtualization, and scheduling code paths. RCU 2270*4882a593Smuzhiyunmust therefore use efficient implementations, especially in its 2271*4882a593Smuzhiyunread-side primitives. To that end, it would be good if preemptible RCU's 2272*4882a593Smuzhiyunimplementation of ``rcu_read_lock()`` could be inlined, however, doing 2273*4882a593Smuzhiyunthis requires resolving ``#include`` issues with the ``task_struct`` 2274*4882a593Smuzhiyunstructure. 2275*4882a593Smuzhiyun 2276*4882a593SmuzhiyunThe Linux kernel supports hardware configurations with up to 4096 CPUs, 2277*4882a593Smuzhiyunwhich means that RCU must be extremely scalable. Algorithms that involve 2278*4882a593Smuzhiyunfrequent acquisitions of global locks or frequent atomic operations on 2279*4882a593Smuzhiyunglobal variables simply cannot be tolerated within the RCU 2280*4882a593Smuzhiyunimplementation. RCU therefore makes heavy use of a combining tree based 2281*4882a593Smuzhiyunon the ``rcu_node`` structure. RCU is required to tolerate all CPUs 2282*4882a593Smuzhiyuncontinuously invoking any combination of RCU's runtime primitives with 2283*4882a593Smuzhiyunminimal per-operation overhead. In fact, in many cases, increasing load 2284*4882a593Smuzhiyunmust *decrease* the per-operation overhead, witness the batching 2285*4882a593Smuzhiyunoptimizations for ``synchronize_rcu()``, ``call_rcu()``, 2286*4882a593Smuzhiyun``synchronize_rcu_expedited()``, and ``rcu_barrier()``. As a general 2287*4882a593Smuzhiyunrule, RCU must cheerfully accept whatever the rest of the Linux kernel 2288*4882a593Smuzhiyundecides to throw at it. 2289*4882a593Smuzhiyun 2290*4882a593SmuzhiyunThe Linux kernel is used for real-time workloads, especially in 2291*4882a593Smuzhiyunconjunction with the `-rt 2292*4882a593Smuzhiyunpatchset <https://rt.wiki.kernel.org/index.php/Main_Page>`__. The 2293*4882a593Smuzhiyunreal-time-latency response requirements are such that the traditional 2294*4882a593Smuzhiyunapproach of disabling preemption across RCU read-side critical sections 2295*4882a593Smuzhiyunis inappropriate. Kernels built with ``CONFIG_PREEMPT=y`` therefore use 2296*4882a593Smuzhiyunan RCU implementation that allows RCU read-side critical sections to be 2297*4882a593Smuzhiyunpreempted. This requirement made its presence known after users made it 2298*4882a593Smuzhiyunclear that an earlier `real-time 2299*4882a593Smuzhiyunpatch <https://lwn.net/Articles/107930/>`__ did not meet their needs, in 2300*4882a593Smuzhiyunconjunction with some `RCU 2301*4882a593Smuzhiyunissues <https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com>`__ 2302*4882a593Smuzhiyunencountered by a very early version of the -rt patchset. 2303*4882a593Smuzhiyun 2304*4882a593SmuzhiyunIn addition, RCU must make do with a sub-100-microsecond real-time 2305*4882a593Smuzhiyunlatency budget. In fact, on smaller systems with the -rt patchset, the 2306*4882a593SmuzhiyunLinux kernel provides sub-20-microsecond real-time latencies for the 2307*4882a593Smuzhiyunwhole kernel, including RCU. RCU's scalability and latency must 2308*4882a593Smuzhiyuntherefore be sufficient for these sorts of configurations. To my 2309*4882a593Smuzhiyunsurprise, the sub-100-microsecond real-time latency budget `applies to 2310*4882a593Smuzhiyuneven the largest systems 2311*4882a593Smuzhiyun[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf>`__, 2312*4882a593Smuzhiyunup to and including systems with 4096 CPUs. This real-time requirement 2313*4882a593Smuzhiyunmotivated the grace-period kthread, which also simplified handling of a 2314*4882a593Smuzhiyunnumber of race conditions. 2315*4882a593Smuzhiyun 2316*4882a593SmuzhiyunRCU must avoid degrading real-time response for CPU-bound threads, 2317*4882a593Smuzhiyunwhether executing in usermode (which is one use case for 2318*4882a593Smuzhiyun``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in 2319*4882a593Smuzhiyunthe kernel must execute ``cond_resched()`` at least once per few tens of 2320*4882a593Smuzhiyunmilliseconds in order to avoid receiving an IPI from RCU. 2321*4882a593Smuzhiyun 2322*4882a593SmuzhiyunFinally, RCU's status as a synchronization primitive means that any RCU 2323*4882a593Smuzhiyunfailure can result in arbitrary memory corruption that can be extremely 2324*4882a593Smuzhiyundifficult to debug. This means that RCU must be extremely reliable, 2325*4882a593Smuzhiyunwhich in practice also means that RCU must have an aggressive 2326*4882a593Smuzhiyunstress-test suite. This stress-test suite is called ``rcutorture``. 2327*4882a593Smuzhiyun 2328*4882a593SmuzhiyunAlthough the need for ``rcutorture`` was no surprise, the current 2329*4882a593Smuzhiyunimmense popularity of the Linux kernel is posing interesting—and perhaps 2330*4882a593Smuzhiyununprecedented—validation challenges. To see this, keep in mind that 2331*4882a593Smuzhiyunthere are well over one billion instances of the Linux kernel running 2332*4882a593Smuzhiyuntoday, given Android smartphones, Linux-powered televisions, and 2333*4882a593Smuzhiyunservers. This number can be expected to increase sharply with the advent 2334*4882a593Smuzhiyunof the celebrated Internet of Things. 2335*4882a593Smuzhiyun 2336*4882a593SmuzhiyunSuppose that RCU contains a race condition that manifests on average 2337*4882a593Smuzhiyunonce per million years of runtime. This bug will be occurring about 2338*4882a593Smuzhiyunthree times per *day* across the installed base. RCU could simply hide 2339*4882a593Smuzhiyunbehind hardware error rates, given that no one should really expect 2340*4882a593Smuzhiyuntheir smartphone to last for a million years. However, anyone taking too 2341*4882a593Smuzhiyunmuch comfort from this thought should consider the fact that in most 2342*4882a593Smuzhiyunjurisdictions, a successful multi-year test of a given mechanism, which 2343*4882a593Smuzhiyunmight include a Linux kernel, suffices for a number of types of 2344*4882a593Smuzhiyunsafety-critical certifications. In fact, rumor has it that the Linux 2345*4882a593Smuzhiyunkernel is already being used in production for safety-critical 2346*4882a593Smuzhiyunapplications. I don't know about you, but I would feel quite bad if a 2347*4882a593Smuzhiyunbug in RCU killed someone. Which might explain my recent focus on 2348*4882a593Smuzhiyunvalidation and verification. 2349*4882a593Smuzhiyun 2350*4882a593SmuzhiyunOther RCU Flavors 2351*4882a593Smuzhiyun----------------- 2352*4882a593Smuzhiyun 2353*4882a593SmuzhiyunOne of the more surprising things about RCU is that there are now no 2354*4882a593Smuzhiyunfewer than five *flavors*, or API families. In addition, the primary 2355*4882a593Smuzhiyunflavor that has been the sole focus up to this point has two different 2356*4882a593Smuzhiyunimplementations, non-preemptible and preemptible. The other four flavors 2357*4882a593Smuzhiyunare listed below, with requirements for each described in a separate 2358*4882a593Smuzhiyunsection. 2359*4882a593Smuzhiyun 2360*4882a593Smuzhiyun#. `Bottom-Half Flavor (Historical)`_ 2361*4882a593Smuzhiyun#. `Sched Flavor (Historical)`_ 2362*4882a593Smuzhiyun#. `Sleepable RCU`_ 2363*4882a593Smuzhiyun#. `Tasks RCU`_ 2364*4882a593Smuzhiyun 2365*4882a593SmuzhiyunBottom-Half Flavor (Historical) 2366*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2367*4882a593Smuzhiyun 2368*4882a593SmuzhiyunThe RCU-bh flavor of RCU has since been expressed in terms of the other 2369*4882a593SmuzhiyunRCU flavors as part of a consolidation of the three flavors into a 2370*4882a593Smuzhiyunsingle flavor. The read-side API remains, and continues to disable 2371*4882a593Smuzhiyunsoftirq and to be accounted for by lockdep. Much of the material in this 2372*4882a593Smuzhiyunsection is therefore strictly historical in nature. 2373*4882a593Smuzhiyun 2374*4882a593SmuzhiyunThe softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations) 2375*4882a593Smuzhiyunflavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a 2376*4882a593Smuzhiyunflavor of RCU that could withstand the network-based denial-of-service 2377*4882a593Smuzhiyunattacks researched by Robert Olsson. These attacks placed so much 2378*4882a593Smuzhiyunnetworking load on the system that some of the CPUs never exited softirq 2379*4882a593Smuzhiyunexecution, which in turn prevented those CPUs from ever executing a 2380*4882a593Smuzhiyuncontext switch, which, in the RCU implementation of that time, prevented 2381*4882a593Smuzhiyungrace periods from ever ending. The result was an out-of-memory 2382*4882a593Smuzhiyuncondition and a system hang. 2383*4882a593Smuzhiyun 2384*4882a593SmuzhiyunThe solution was the creation of RCU-bh, which does 2385*4882a593Smuzhiyun``local_bh_disable()`` across its read-side critical sections, and which 2386*4882a593Smuzhiyunuses the transition from one type of softirq processing to another as a 2387*4882a593Smuzhiyunquiescent state in addition to context switch, idle, user mode, and 2388*4882a593Smuzhiyunoffline. This means that RCU-bh grace periods can complete even when 2389*4882a593Smuzhiyunsome of the CPUs execute in softirq indefinitely, thus allowing 2390*4882a593Smuzhiyunalgorithms based on RCU-bh to withstand network-based denial-of-service 2391*4882a593Smuzhiyunattacks. 2392*4882a593Smuzhiyun 2393*4882a593SmuzhiyunBecause ``rcu_read_lock_bh()`` and ``rcu_read_unlock_bh()`` disable and 2394*4882a593Smuzhiyunre-enable softirq handlers, any attempt to start a softirq handlers 2395*4882a593Smuzhiyunduring the RCU-bh read-side critical section will be deferred. In this 2396*4882a593Smuzhiyuncase, ``rcu_read_unlock_bh()`` will invoke softirq processing, which can 2397*4882a593Smuzhiyuntake considerable time. One can of course argue that this softirq 2398*4882a593Smuzhiyunoverhead should be associated with the code following the RCU-bh 2399*4882a593Smuzhiyunread-side critical section rather than ``rcu_read_unlock_bh()``, but the 2400*4882a593Smuzhiyunfact is that most profiling tools cannot be expected to make this sort 2401*4882a593Smuzhiyunof fine distinction. For example, suppose that a three-millisecond-long 2402*4882a593SmuzhiyunRCU-bh read-side critical section executes during a time of heavy 2403*4882a593Smuzhiyunnetworking load. There will very likely be an attempt to invoke at least 2404*4882a593Smuzhiyunone softirq handler during that three milliseconds, but any such 2405*4882a593Smuzhiyuninvocation will be delayed until the time of the 2406*4882a593Smuzhiyun``rcu_read_unlock_bh()``. This can of course make it appear at first 2407*4882a593Smuzhiyunglance as if ``rcu_read_unlock_bh()`` was executing very slowly. 2408*4882a593Smuzhiyun 2409*4882a593SmuzhiyunThe `RCU-bh 2410*4882a593SmuzhiyunAPI <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ 2411*4882a593Smuzhiyunincludes ``rcu_read_lock_bh()``, ``rcu_read_unlock_bh()``, 2412*4882a593Smuzhiyun``rcu_dereference_bh()``, ``rcu_dereference_bh_check()``, 2413*4882a593Smuzhiyun``synchronize_rcu_bh()``, ``synchronize_rcu_bh_expedited()``, 2414*4882a593Smuzhiyun``call_rcu_bh()``, ``rcu_barrier_bh()``, and 2415*4882a593Smuzhiyun``rcu_read_lock_bh_held()``. However, the update-side APIs are now 2416*4882a593Smuzhiyunsimple wrappers for other RCU flavors, namely RCU-sched in 2417*4882a593SmuzhiyunCONFIG_PREEMPT=n kernels and RCU-preempt otherwise. 2418*4882a593Smuzhiyun 2419*4882a593SmuzhiyunSched Flavor (Historical) 2420*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~ 2421*4882a593Smuzhiyun 2422*4882a593SmuzhiyunThe RCU-sched flavor of RCU has since been expressed in terms of the 2423*4882a593Smuzhiyunother RCU flavors as part of a consolidation of the three flavors into a 2424*4882a593Smuzhiyunsingle flavor. The read-side API remains, and continues to disable 2425*4882a593Smuzhiyunpreemption and to be accounted for by lockdep. Much of the material in 2426*4882a593Smuzhiyunthis section is therefore strictly historical in nature. 2427*4882a593Smuzhiyun 2428*4882a593SmuzhiyunBefore preemptible RCU, waiting for an RCU grace period had the side 2429*4882a593Smuzhiyuneffect of also waiting for all pre-existing interrupt and NMI handlers. 2430*4882a593SmuzhiyunHowever, there are legitimate preemptible-RCU implementations that do 2431*4882a593Smuzhiyunnot have this property, given that any point in the code outside of an 2432*4882a593SmuzhiyunRCU read-side critical section can be a quiescent state. Therefore, 2433*4882a593Smuzhiyun*RCU-sched* was created, which follows “classic” RCU in that an 2434*4882a593SmuzhiyunRCU-sched grace period waits for pre-existing interrupt and NMI 2435*4882a593Smuzhiyunhandlers. In kernels built with ``CONFIG_PREEMPT=n``, the RCU and 2436*4882a593SmuzhiyunRCU-sched APIs have identical implementations, while kernels built with 2437*4882a593Smuzhiyun``CONFIG_PREEMPT=y`` provide a separate implementation for each. 2438*4882a593Smuzhiyun 2439*4882a593SmuzhiyunNote well that in ``CONFIG_PREEMPT=y`` kernels, 2440*4882a593Smuzhiyun``rcu_read_lock_sched()`` and ``rcu_read_unlock_sched()`` disable and 2441*4882a593Smuzhiyunre-enable preemption, respectively. This means that if there was a 2442*4882a593Smuzhiyunpreemption attempt during the RCU-sched read-side critical section, 2443*4882a593Smuzhiyun``rcu_read_unlock_sched()`` will enter the scheduler, with all the 2444*4882a593Smuzhiyunlatency and overhead entailed. Just as with ``rcu_read_unlock_bh()``, 2445*4882a593Smuzhiyunthis can make it look as if ``rcu_read_unlock_sched()`` was executing 2446*4882a593Smuzhiyunvery slowly. However, the highest-priority task won't be preempted, so 2447*4882a593Smuzhiyunthat task will enjoy low-overhead ``rcu_read_unlock_sched()`` 2448*4882a593Smuzhiyuninvocations. 2449*4882a593Smuzhiyun 2450*4882a593SmuzhiyunThe `RCU-sched 2451*4882a593SmuzhiyunAPI <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ 2452*4882a593Smuzhiyunincludes ``rcu_read_lock_sched()``, ``rcu_read_unlock_sched()``, 2453*4882a593Smuzhiyun``rcu_read_lock_sched_notrace()``, ``rcu_read_unlock_sched_notrace()``, 2454*4882a593Smuzhiyun``rcu_dereference_sched()``, ``rcu_dereference_sched_check()``, 2455*4882a593Smuzhiyun``synchronize_sched()``, ``synchronize_rcu_sched_expedited()``, 2456*4882a593Smuzhiyun``call_rcu_sched()``, ``rcu_barrier_sched()``, and 2457*4882a593Smuzhiyun``rcu_read_lock_sched_held()``. However, anything that disables 2458*4882a593Smuzhiyunpreemption also marks an RCU-sched read-side critical section, including 2459*4882a593Smuzhiyun``preempt_disable()`` and ``preempt_enable()``, ``local_irq_save()`` and 2460*4882a593Smuzhiyun``local_irq_restore()``, and so on. 2461*4882a593Smuzhiyun 2462*4882a593SmuzhiyunSleepable RCU 2463*4882a593Smuzhiyun~~~~~~~~~~~~~ 2464*4882a593Smuzhiyun 2465*4882a593SmuzhiyunFor well over a decade, someone saying “I need to block within an RCU 2466*4882a593Smuzhiyunread-side critical section” was a reliable indication that this someone 2467*4882a593Smuzhiyundid not understand RCU. After all, if you are always blocking in an RCU 2468*4882a593Smuzhiyunread-side critical section, you can probably afford to use a 2469*4882a593Smuzhiyunhigher-overhead synchronization mechanism. However, that changed with 2470*4882a593Smuzhiyunthe advent of the Linux kernel's notifiers, whose RCU read-side critical 2471*4882a593Smuzhiyunsections almost never sleep, but sometimes need to. This resulted in the 2472*4882a593Smuzhiyunintroduction of `sleepable RCU <https://lwn.net/Articles/202847/>`__, or 2473*4882a593Smuzhiyun*SRCU*. 2474*4882a593Smuzhiyun 2475*4882a593SmuzhiyunSRCU allows different domains to be defined, with each such domain 2476*4882a593Smuzhiyundefined by an instance of an ``srcu_struct`` structure. A pointer to 2477*4882a593Smuzhiyunthis structure must be passed in to each SRCU function, for example, 2478*4882a593Smuzhiyun``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct`` 2479*4882a593Smuzhiyunstructure. The key benefit of these domains is that a slow SRCU reader 2480*4882a593Smuzhiyunin one domain does not delay an SRCU grace period in some other domain. 2481*4882a593SmuzhiyunThat said, one consequence of these domains is that read-side code must 2482*4882a593Smuzhiyunpass a “cookie” from ``srcu_read_lock()`` to ``srcu_read_unlock()``, for 2483*4882a593Smuzhiyunexample, as follows: 2484*4882a593Smuzhiyun 2485*4882a593Smuzhiyun :: 2486*4882a593Smuzhiyun 2487*4882a593Smuzhiyun 1 int idx; 2488*4882a593Smuzhiyun 2 2489*4882a593Smuzhiyun 3 idx = srcu_read_lock(&ss); 2490*4882a593Smuzhiyun 4 do_something(); 2491*4882a593Smuzhiyun 5 srcu_read_unlock(&ss, idx); 2492*4882a593Smuzhiyun 2493*4882a593SmuzhiyunAs noted above, it is legal to block within SRCU read-side critical 2494*4882a593Smuzhiyunsections, however, with great power comes great responsibility. If you 2495*4882a593Smuzhiyunblock forever in one of a given domain's SRCU read-side critical 2496*4882a593Smuzhiyunsections, then that domain's grace periods will also be blocked forever. 2497*4882a593SmuzhiyunOf course, one good way to block forever is to deadlock, which can 2498*4882a593Smuzhiyunhappen if any operation in a given domain's SRCU read-side critical 2499*4882a593Smuzhiyunsection can wait, either directly or indirectly, for that domain's grace 2500*4882a593Smuzhiyunperiod to elapse. For example, this results in a self-deadlock: 2501*4882a593Smuzhiyun 2502*4882a593Smuzhiyun :: 2503*4882a593Smuzhiyun 2504*4882a593Smuzhiyun 1 int idx; 2505*4882a593Smuzhiyun 2 2506*4882a593Smuzhiyun 3 idx = srcu_read_lock(&ss); 2507*4882a593Smuzhiyun 4 do_something(); 2508*4882a593Smuzhiyun 5 synchronize_srcu(&ss); 2509*4882a593Smuzhiyun 6 srcu_read_unlock(&ss, idx); 2510*4882a593Smuzhiyun 2511*4882a593SmuzhiyunHowever, if line 5 acquired a mutex that was held across a 2512*4882a593Smuzhiyun``synchronize_srcu()`` for domain ``ss``, deadlock would still be 2513*4882a593Smuzhiyunpossible. Furthermore, if line 5 acquired a mutex that was held across a 2514*4882a593Smuzhiyun``synchronize_srcu()`` for some other domain ``ss1``, and if an 2515*4882a593Smuzhiyun``ss1``-domain SRCU read-side critical section acquired another mutex 2516*4882a593Smuzhiyunthat was held across as ``ss``-domain ``synchronize_srcu()``, deadlock 2517*4882a593Smuzhiyunwould again be possible. Such a deadlock cycle could extend across an 2518*4882a593Smuzhiyunarbitrarily large number of different SRCU domains. Again, with great 2519*4882a593Smuzhiyunpower comes great responsibility. 2520*4882a593Smuzhiyun 2521*4882a593SmuzhiyunUnlike the other RCU flavors, SRCU read-side critical sections can run 2522*4882a593Smuzhiyunon idle and even offline CPUs. This ability requires that 2523*4882a593Smuzhiyun``srcu_read_lock()`` and ``srcu_read_unlock()`` contain memory barriers, 2524*4882a593Smuzhiyunwhich means that SRCU readers will run a bit slower than would RCU 2525*4882a593Smuzhiyunreaders. It also motivates the ``smp_mb__after_srcu_read_unlock()`` API, 2526*4882a593Smuzhiyunwhich, in combination with ``srcu_read_unlock()``, guarantees a full 2527*4882a593Smuzhiyunmemory barrier. 2528*4882a593Smuzhiyun 2529*4882a593SmuzhiyunAlso unlike other RCU flavors, ``synchronize_srcu()`` may **not** be 2530*4882a593Smuzhiyuninvoked from CPU-hotplug notifiers, due to the fact that SRCU grace 2531*4882a593Smuzhiyunperiods make use of timers and the possibility of timers being 2532*4882a593Smuzhiyuntemporarily “stranded” on the outgoing CPU. This stranding of timers 2533*4882a593Smuzhiyunmeans that timers posted to the outgoing CPU will not fire until late in 2534*4882a593Smuzhiyunthe CPU-hotplug process. The problem is that if a notifier is waiting on 2535*4882a593Smuzhiyunan SRCU grace period, that grace period is waiting on a timer, and that 2536*4882a593Smuzhiyuntimer is stranded on the outgoing CPU, then the notifier will never be 2537*4882a593Smuzhiyunawakened, in other words, deadlock has occurred. This same situation of 2538*4882a593Smuzhiyuncourse also prohibits ``srcu_barrier()`` from being invoked from 2539*4882a593SmuzhiyunCPU-hotplug notifiers. 2540*4882a593Smuzhiyun 2541*4882a593SmuzhiyunSRCU also differs from other RCU flavors in that SRCU's expedited and 2542*4882a593Smuzhiyunnon-expedited grace periods are implemented by the same mechanism. This 2543*4882a593Smuzhiyunmeans that in the current SRCU implementation, expediting a future grace 2544*4882a593Smuzhiyunperiod has the side effect of expediting all prior grace periods that 2545*4882a593Smuzhiyunhave not yet completed. (But please note that this is a property of the 2546*4882a593Smuzhiyuncurrent implementation, not necessarily of future implementations.) In 2547*4882a593Smuzhiyunaddition, if SRCU has been idle for longer than the interval specified 2548*4882a593Smuzhiyunby the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds 2549*4882a593Smuzhiyunby default), and if a ``synchronize_srcu()`` invocation ends this idle 2550*4882a593Smuzhiyunperiod, that invocation will be automatically expedited. 2551*4882a593Smuzhiyun 2552*4882a593SmuzhiyunAs of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a 2553*4882a593Smuzhiyunlocking bottleneck present in prior kernel versions. Although this will 2554*4882a593Smuzhiyunallow users to put much heavier stress on ``call_srcu()``, it is 2555*4882a593Smuzhiyunimportant to note that SRCU does not yet take any special steps to deal 2556*4882a593Smuzhiyunwith callback flooding. So if you are posting (say) 10,000 SRCU 2557*4882a593Smuzhiyuncallbacks per second per CPU, you are probably totally OK, but if you 2558*4882a593Smuzhiyunintend to post (say) 1,000,000 SRCU callbacks per second per CPU, please 2559*4882a593Smuzhiyunrun some tests first. SRCU just might need a few adjustment to deal with 2560*4882a593Smuzhiyunthat sort of load. Of course, your mileage may vary based on the speed 2561*4882a593Smuzhiyunof your CPUs and the size of your memory. 2562*4882a593Smuzhiyun 2563*4882a593SmuzhiyunThe `SRCU 2564*4882a593SmuzhiyunAPI <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ 2565*4882a593Smuzhiyunincludes ``srcu_read_lock()``, ``srcu_read_unlock()``, 2566*4882a593Smuzhiyun``srcu_dereference()``, ``srcu_dereference_check()``, 2567*4882a593Smuzhiyun``synchronize_srcu()``, ``synchronize_srcu_expedited()``, 2568*4882a593Smuzhiyun``call_srcu()``, ``srcu_barrier()``, and ``srcu_read_lock_held()``. It 2569*4882a593Smuzhiyunalso includes ``DEFINE_SRCU()``, ``DEFINE_STATIC_SRCU()``, and 2570*4882a593Smuzhiyun``init_srcu_struct()`` APIs for defining and initializing 2571*4882a593Smuzhiyun``srcu_struct`` structures. 2572*4882a593Smuzhiyun 2573*4882a593SmuzhiyunTasks RCU 2574*4882a593Smuzhiyun~~~~~~~~~ 2575*4882a593Smuzhiyun 2576*4882a593SmuzhiyunSome forms of tracing use “trampolines” to handle the binary rewriting 2577*4882a593Smuzhiyunrequired to install different types of probes. It would be good to be 2578*4882a593Smuzhiyunable to free old trampolines, which sounds like a job for some form of 2579*4882a593SmuzhiyunRCU. However, because it is necessary to be able to install a trace 2580*4882a593Smuzhiyunanywhere in the code, it is not possible to use read-side markers such 2581*4882a593Smuzhiyunas ``rcu_read_lock()`` and ``rcu_read_unlock()``. In addition, it does 2582*4882a593Smuzhiyunnot work to have these markers in the trampoline itself, because there 2583*4882a593Smuzhiyunwould need to be instructions following ``rcu_read_unlock()``. Although 2584*4882a593Smuzhiyun``synchronize_rcu()`` would guarantee that execution reached the 2585*4882a593Smuzhiyun``rcu_read_unlock()``, it would not be able to guarantee that execution 2586*4882a593Smuzhiyunhad completely left the trampoline. Worse yet, in some situations 2587*4882a593Smuzhiyunthe trampoline's protection must extend a few instructions *prior* to 2588*4882a593Smuzhiyunexecution reaching the trampoline. For example, these few instructions 2589*4882a593Smuzhiyunmight calculate the address of the trampoline, so that entering the 2590*4882a593Smuzhiyuntrampoline would be pre-ordained a surprisingly long time before execution 2591*4882a593Smuzhiyunactually reached the trampoline itself. 2592*4882a593Smuzhiyun 2593*4882a593SmuzhiyunThe solution, in the form of `Tasks 2594*4882a593SmuzhiyunRCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side 2595*4882a593Smuzhiyuncritical sections that are delimited by voluntary context switches, that 2596*4882a593Smuzhiyunis, calls to ``schedule()``, ``cond_resched()``, and 2597*4882a593Smuzhiyun``synchronize_rcu_tasks()``. In addition, transitions to and from 2598*4882a593Smuzhiyunuserspace execution also delimit tasks-RCU read-side critical sections. 2599*4882a593Smuzhiyun 2600*4882a593SmuzhiyunThe tasks-RCU API is quite compact, consisting only of 2601*4882a593Smuzhiyun``call_rcu_tasks()``, ``synchronize_rcu_tasks()``, and 2602*4882a593Smuzhiyun``rcu_barrier_tasks()``. In ``CONFIG_PREEMPT=n`` kernels, trampolines 2603*4882a593Smuzhiyuncannot be preempted, so these APIs map to ``call_rcu()``, 2604*4882a593Smuzhiyun``synchronize_rcu()``, and ``rcu_barrier()``, respectively. In 2605*4882a593Smuzhiyun``CONFIG_PREEMPT=y`` kernels, trampolines can be preempted, and these 2606*4882a593Smuzhiyunthree APIs are therefore implemented by separate functions that check 2607*4882a593Smuzhiyunfor voluntary context switches. 2608*4882a593Smuzhiyun 2609*4882a593SmuzhiyunPossible Future Changes 2610*4882a593Smuzhiyun----------------------- 2611*4882a593Smuzhiyun 2612*4882a593SmuzhiyunOne of the tricks that RCU uses to attain update-side scalability is to 2613*4882a593Smuzhiyunincrease grace-period latency with increasing numbers of CPUs. If this 2614*4882a593Smuzhiyunbecomes a serious problem, it will be necessary to rework the 2615*4882a593Smuzhiyungrace-period state machine so as to avoid the need for the additional 2616*4882a593Smuzhiyunlatency. 2617*4882a593Smuzhiyun 2618*4882a593SmuzhiyunRCU disables CPU hotplug in a few places, perhaps most notably in the 2619*4882a593Smuzhiyun``rcu_barrier()`` operations. If there is a strong reason to use 2620*4882a593Smuzhiyun``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to 2621*4882a593Smuzhiyunavoid disabling CPU hotplug. This would introduce some complexity, so 2622*4882a593Smuzhiyunthere had better be a *very* good reason. 2623*4882a593Smuzhiyun 2624*4882a593SmuzhiyunThe tradeoff between grace-period latency on the one hand and 2625*4882a593Smuzhiyuninterruptions of other CPUs on the other hand may need to be 2626*4882a593Smuzhiyunre-examined. The desire is of course for zero grace-period latency as 2627*4882a593Smuzhiyunwell as zero interprocessor interrupts undertaken during an expedited 2628*4882a593Smuzhiyungrace period operation. While this ideal is unlikely to be achievable, 2629*4882a593Smuzhiyunit is quite possible that further improvements can be made. 2630*4882a593Smuzhiyun 2631*4882a593SmuzhiyunThe multiprocessor implementations of RCU use a combining tree that 2632*4882a593Smuzhiyungroups CPUs so as to reduce lock contention and increase cache locality. 2633*4882a593SmuzhiyunHowever, this combining tree does not spread its memory across NUMA 2634*4882a593Smuzhiyunnodes nor does it align the CPU groups with hardware features such as 2635*4882a593Smuzhiyunsockets or cores. Such spreading and alignment is currently believed to 2636*4882a593Smuzhiyunbe unnecessary because the hotpath read-side primitives do not access 2637*4882a593Smuzhiyunthe combining tree, nor does ``call_rcu()`` in the common case. If you 2638*4882a593Smuzhiyunbelieve that your architecture needs such spreading and alignment, then 2639*4882a593Smuzhiyunyour architecture should also benefit from the 2640*4882a593Smuzhiyun``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the 2641*4882a593Smuzhiyunnumber of CPUs in a socket, NUMA node, or whatever. If the number of 2642*4882a593SmuzhiyunCPUs is too large, use a fraction of the number of CPUs. If the number 2643*4882a593Smuzhiyunof CPUs is a large prime number, well, that certainly is an 2644*4882a593Smuzhiyun“interesting” architectural choice! More flexible arrangements might be 2645*4882a593Smuzhiyunconsidered, but only if ``rcutree.rcu_fanout_leaf`` has proven 2646*4882a593Smuzhiyuninadequate, and only if the inadequacy has been demonstrated by a 2647*4882a593Smuzhiyuncarefully run and realistic system-level workload. 2648*4882a593Smuzhiyun 2649*4882a593SmuzhiyunPlease note that arrangements that require RCU to remap CPU numbers will 2650*4882a593Smuzhiyunrequire extremely good demonstration of need and full exploration of 2651*4882a593Smuzhiyunalternatives. 2652*4882a593Smuzhiyun 2653*4882a593SmuzhiyunRCU's various kthreads are reasonably recent additions. It is quite 2654*4882a593Smuzhiyunlikely that adjustments will be required to more gracefully handle 2655*4882a593Smuzhiyunextreme loads. It might also be necessary to be able to relate CPU 2656*4882a593Smuzhiyunutilization by RCU's kthreads and softirq handlers to the code that 2657*4882a593Smuzhiyuninstigated this CPU utilization. For example, RCU callback overhead 2658*4882a593Smuzhiyunmight be charged back to the originating ``call_rcu()`` instance, though 2659*4882a593Smuzhiyunprobably not in production kernels. 2660*4882a593Smuzhiyun 2661*4882a593SmuzhiyunAdditional work may be required to provide reasonable forward-progress 2662*4882a593Smuzhiyunguarantees under heavy load for grace periods and for callback 2663*4882a593Smuzhiyuninvocation. 2664*4882a593Smuzhiyun 2665*4882a593SmuzhiyunSummary 2666*4882a593Smuzhiyun------- 2667*4882a593Smuzhiyun 2668*4882a593SmuzhiyunThis document has presented more than two decade's worth of RCU 2669*4882a593Smuzhiyunrequirements. Given that the requirements keep changing, this will not 2670*4882a593Smuzhiyunbe the last word on this subject, but at least it serves to get an 2671*4882a593Smuzhiyunimportant subset of the requirements set forth. 2672*4882a593Smuzhiyun 2673*4882a593SmuzhiyunAcknowledgments 2674*4882a593Smuzhiyun--------------- 2675*4882a593Smuzhiyun 2676*4882a593SmuzhiyunI am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg 2677*4882a593SmuzhiyunNesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy 2678*4882a593SmuzhiyunLutomirski for their help in rendering this article human readable, and 2679*4882a593Smuzhiyunto Michelle Rankin for her support of this effort. Other contributions 2680*4882a593Smuzhiyunare acknowledged in the Linux kernel's git archive. 2681