1*4882a593Smuzhiyun.. _list_rcu_doc: 2*4882a593Smuzhiyun 3*4882a593SmuzhiyunUsing RCU to Protect Read-Mostly Linked Lists 4*4882a593Smuzhiyun============================================= 5*4882a593Smuzhiyun 6*4882a593SmuzhiyunOne of the best applications of RCU is to protect read-mostly linked lists 7*4882a593Smuzhiyun(``struct list_head`` in list.h). One big advantage of this approach 8*4882a593Smuzhiyunis that all of the required memory barriers are included for you in 9*4882a593Smuzhiyunthe list macros. This document describes several applications of RCU, 10*4882a593Smuzhiyunwith the best fits first. 11*4882a593Smuzhiyun 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunExample 1: Read-mostly list: Deferred Destruction 14*4882a593Smuzhiyun------------------------------------------------- 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunA widely used usecase for RCU lists in the kernel is lockless iteration over 17*4882a593Smuzhiyunall processes in the system. ``task_struct::tasks`` represents the list node that 18*4882a593Smuzhiyunlinks all the processes. The list can be traversed in parallel to any list 19*4882a593Smuzhiyunadditions or removals. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunThe traversal of the list is done using ``for_each_process()`` which is defined 22*4882a593Smuzhiyunby the 2 macros:: 23*4882a593Smuzhiyun 24*4882a593Smuzhiyun #define next_task(p) \ 25*4882a593Smuzhiyun list_entry_rcu((p)->tasks.next, struct task_struct, tasks) 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun #define for_each_process(p) \ 28*4882a593Smuzhiyun for (p = &init_task ; (p = next_task(p)) != &init_task ; ) 29*4882a593Smuzhiyun 30*4882a593SmuzhiyunThe code traversing the list of all processes typically looks like:: 31*4882a593Smuzhiyun 32*4882a593Smuzhiyun rcu_read_lock(); 33*4882a593Smuzhiyun for_each_process(p) { 34*4882a593Smuzhiyun /* Do something with p */ 35*4882a593Smuzhiyun } 36*4882a593Smuzhiyun rcu_read_unlock(); 37*4882a593Smuzhiyun 38*4882a593SmuzhiyunThe simplified code for removing a process from a task list is:: 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun void release_task(struct task_struct *p) 41*4882a593Smuzhiyun { 42*4882a593Smuzhiyun write_lock(&tasklist_lock); 43*4882a593Smuzhiyun list_del_rcu(&p->tasks); 44*4882a593Smuzhiyun write_unlock(&tasklist_lock); 45*4882a593Smuzhiyun call_rcu(&p->rcu, delayed_put_task_struct); 46*4882a593Smuzhiyun } 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunWhen a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)`` under 49*4882a593Smuzhiyun``tasklist_lock`` writer lock protection, to remove the task from the list of 50*4882a593Smuzhiyunall tasks. The ``tasklist_lock`` prevents concurrent list additions/removals 51*4882a593Smuzhiyunfrom corrupting the list. Readers using ``for_each_process()`` are not protected 52*4882a593Smuzhiyunwith the ``tasklist_lock``. To prevent readers from noticing changes in the list 53*4882a593Smuzhiyunpointers, the ``task_struct`` object is freed only after one or more grace 54*4882a593Smuzhiyunperiods elapse (with the help of call_rcu()). This deferring of destruction 55*4882a593Smuzhiyunensures that any readers traversing the list will see valid ``p->tasks.next`` 56*4882a593Smuzhiyunpointers and deletion/freeing can happen in parallel with traversal of the list. 57*4882a593SmuzhiyunThis pattern is also called an **existence lock**, since RCU pins the object in 58*4882a593Smuzhiyunmemory until all existing readers finish. 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunExample 2: Read-Side Action Taken Outside of Lock: No In-Place Updates 62*4882a593Smuzhiyun---------------------------------------------------------------------- 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunThe best applications are cases where, if reader-writer locking were 65*4882a593Smuzhiyunused, the read-side lock would be dropped before taking any action 66*4882a593Smuzhiyunbased on the results of the search. The most celebrated example is 67*4882a593Smuzhiyunthe routing table. Because the routing table is tracking the state of 68*4882a593Smuzhiyunequipment outside of the computer, it will at times contain stale data. 69*4882a593SmuzhiyunTherefore, once the route has been computed, there is no need to hold 70*4882a593Smuzhiyunthe routing table static during transmission of the packet. After all, 71*4882a593Smuzhiyunyou can hold the routing table static all you want, but that won't keep 72*4882a593Smuzhiyunthe external Internet from changing, and it is the state of the external 73*4882a593SmuzhiyunInternet that really matters. In addition, routing entries are typically 74*4882a593Smuzhiyunadded or deleted, rather than being modified in place. 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunA straightforward example of this use of RCU may be found in the 77*4882a593Smuzhiyunsystem-call auditing support. For example, a reader-writer locked 78*4882a593Smuzhiyunimplementation of ``audit_filter_task()`` might be as follows:: 79*4882a593Smuzhiyun 80*4882a593Smuzhiyun static enum audit_state audit_filter_task(struct task_struct *tsk) 81*4882a593Smuzhiyun { 82*4882a593Smuzhiyun struct audit_entry *e; 83*4882a593Smuzhiyun enum audit_state state; 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun read_lock(&auditsc_lock); 86*4882a593Smuzhiyun /* Note: audit_filter_mutex held by caller. */ 87*4882a593Smuzhiyun list_for_each_entry(e, &audit_tsklist, list) { 88*4882a593Smuzhiyun if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { 89*4882a593Smuzhiyun read_unlock(&auditsc_lock); 90*4882a593Smuzhiyun return state; 91*4882a593Smuzhiyun } 92*4882a593Smuzhiyun } 93*4882a593Smuzhiyun read_unlock(&auditsc_lock); 94*4882a593Smuzhiyun return AUDIT_BUILD_CONTEXT; 95*4882a593Smuzhiyun } 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunHere the list is searched under the lock, but the lock is dropped before 98*4882a593Smuzhiyunthe corresponding value is returned. By the time that this value is acted 99*4882a593Smuzhiyunon, the list may well have been modified. This makes sense, since if 100*4882a593Smuzhiyunyou are turning auditing off, it is OK to audit a few extra system calls. 101*4882a593Smuzhiyun 102*4882a593SmuzhiyunThis means that RCU can be easily applied to the read side, as follows:: 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun static enum audit_state audit_filter_task(struct task_struct *tsk) 105*4882a593Smuzhiyun { 106*4882a593Smuzhiyun struct audit_entry *e; 107*4882a593Smuzhiyun enum audit_state state; 108*4882a593Smuzhiyun 109*4882a593Smuzhiyun rcu_read_lock(); 110*4882a593Smuzhiyun /* Note: audit_filter_mutex held by caller. */ 111*4882a593Smuzhiyun list_for_each_entry_rcu(e, &audit_tsklist, list) { 112*4882a593Smuzhiyun if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { 113*4882a593Smuzhiyun rcu_read_unlock(); 114*4882a593Smuzhiyun return state; 115*4882a593Smuzhiyun } 116*4882a593Smuzhiyun } 117*4882a593Smuzhiyun rcu_read_unlock(); 118*4882a593Smuzhiyun return AUDIT_BUILD_CONTEXT; 119*4882a593Smuzhiyun } 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunThe ``read_lock()`` and ``read_unlock()`` calls have become rcu_read_lock() 122*4882a593Smuzhiyunand rcu_read_unlock(), respectively, and the list_for_each_entry() has 123*4882a593Smuzhiyunbecome list_for_each_entry_rcu(). The **_rcu()** list-traversal primitives 124*4882a593Smuzhiyuninsert the read-side memory barriers that are required on DEC Alpha CPUs. 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunThe changes to the update side are also straightforward. A reader-writer lock 127*4882a593Smuzhiyunmight be used as follows for deletion and insertion:: 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun static inline int audit_del_rule(struct audit_rule *rule, 130*4882a593Smuzhiyun struct list_head *list) 131*4882a593Smuzhiyun { 132*4882a593Smuzhiyun struct audit_entry *e; 133*4882a593Smuzhiyun 134*4882a593Smuzhiyun write_lock(&auditsc_lock); 135*4882a593Smuzhiyun list_for_each_entry(e, list, list) { 136*4882a593Smuzhiyun if (!audit_compare_rule(rule, &e->rule)) { 137*4882a593Smuzhiyun list_del(&e->list); 138*4882a593Smuzhiyun write_unlock(&auditsc_lock); 139*4882a593Smuzhiyun return 0; 140*4882a593Smuzhiyun } 141*4882a593Smuzhiyun } 142*4882a593Smuzhiyun write_unlock(&auditsc_lock); 143*4882a593Smuzhiyun return -EFAULT; /* No matching rule */ 144*4882a593Smuzhiyun } 145*4882a593Smuzhiyun 146*4882a593Smuzhiyun static inline int audit_add_rule(struct audit_entry *entry, 147*4882a593Smuzhiyun struct list_head *list) 148*4882a593Smuzhiyun { 149*4882a593Smuzhiyun write_lock(&auditsc_lock); 150*4882a593Smuzhiyun if (entry->rule.flags & AUDIT_PREPEND) { 151*4882a593Smuzhiyun entry->rule.flags &= ~AUDIT_PREPEND; 152*4882a593Smuzhiyun list_add(&entry->list, list); 153*4882a593Smuzhiyun } else { 154*4882a593Smuzhiyun list_add_tail(&entry->list, list); 155*4882a593Smuzhiyun } 156*4882a593Smuzhiyun write_unlock(&auditsc_lock); 157*4882a593Smuzhiyun return 0; 158*4882a593Smuzhiyun } 159*4882a593Smuzhiyun 160*4882a593SmuzhiyunFollowing are the RCU equivalents for these two functions:: 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun static inline int audit_del_rule(struct audit_rule *rule, 163*4882a593Smuzhiyun struct list_head *list) 164*4882a593Smuzhiyun { 165*4882a593Smuzhiyun struct audit_entry *e; 166*4882a593Smuzhiyun 167*4882a593Smuzhiyun /* No need to use the _rcu iterator here, since this is the only 168*4882a593Smuzhiyun * deletion routine. */ 169*4882a593Smuzhiyun list_for_each_entry(e, list, list) { 170*4882a593Smuzhiyun if (!audit_compare_rule(rule, &e->rule)) { 171*4882a593Smuzhiyun list_del_rcu(&e->list); 172*4882a593Smuzhiyun call_rcu(&e->rcu, audit_free_rule); 173*4882a593Smuzhiyun return 0; 174*4882a593Smuzhiyun } 175*4882a593Smuzhiyun } 176*4882a593Smuzhiyun return -EFAULT; /* No matching rule */ 177*4882a593Smuzhiyun } 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun static inline int audit_add_rule(struct audit_entry *entry, 180*4882a593Smuzhiyun struct list_head *list) 181*4882a593Smuzhiyun { 182*4882a593Smuzhiyun if (entry->rule.flags & AUDIT_PREPEND) { 183*4882a593Smuzhiyun entry->rule.flags &= ~AUDIT_PREPEND; 184*4882a593Smuzhiyun list_add_rcu(&entry->list, list); 185*4882a593Smuzhiyun } else { 186*4882a593Smuzhiyun list_add_tail_rcu(&entry->list, list); 187*4882a593Smuzhiyun } 188*4882a593Smuzhiyun return 0; 189*4882a593Smuzhiyun } 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunNormally, the ``write_lock()`` and ``write_unlock()`` would be replaced by a 192*4882a593Smuzhiyunspin_lock() and a spin_unlock(). But in this case, all callers hold 193*4882a593Smuzhiyun``audit_filter_mutex``, so no additional locking is required. The 194*4882a593Smuzhiyun``auditsc_lock`` can therefore be eliminated, since use of RCU eliminates the 195*4882a593Smuzhiyunneed for writers to exclude readers. 196*4882a593Smuzhiyun 197*4882a593SmuzhiyunThe list_del(), list_add(), and list_add_tail() primitives have been 198*4882a593Smuzhiyunreplaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu(). 199*4882a593SmuzhiyunThe **_rcu()** list-manipulation primitives add memory barriers that are needed on 200*4882a593Smuzhiyunweakly ordered CPUs (most of them!). The list_del_rcu() primitive omits the 201*4882a593Smuzhiyunpointer poisoning debug-assist code that would otherwise cause concurrent 202*4882a593Smuzhiyunreaders to fail spectacularly. 203*4882a593Smuzhiyun 204*4882a593SmuzhiyunSo, when readers can tolerate stale data and when entries are either added or 205*4882a593Smuzhiyundeleted, without in-place modification, it is very easy to use RCU! 206*4882a593Smuzhiyun 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunExample 3: Handling In-Place Updates 209*4882a593Smuzhiyun------------------------------------ 210*4882a593Smuzhiyun 211*4882a593SmuzhiyunThe system-call auditing code does not update auditing rules in place. However, 212*4882a593Smuzhiyunif it did, the reader-writer-locked code to do so might look as follows 213*4882a593Smuzhiyun(assuming only ``field_count`` is updated, otherwise, the added fields would 214*4882a593Smuzhiyunneed to be filled in):: 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun static inline int audit_upd_rule(struct audit_rule *rule, 217*4882a593Smuzhiyun struct list_head *list, 218*4882a593Smuzhiyun __u32 newaction, 219*4882a593Smuzhiyun __u32 newfield_count) 220*4882a593Smuzhiyun { 221*4882a593Smuzhiyun struct audit_entry *e; 222*4882a593Smuzhiyun struct audit_entry *ne; 223*4882a593Smuzhiyun 224*4882a593Smuzhiyun write_lock(&auditsc_lock); 225*4882a593Smuzhiyun /* Note: audit_filter_mutex held by caller. */ 226*4882a593Smuzhiyun list_for_each_entry(e, list, list) { 227*4882a593Smuzhiyun if (!audit_compare_rule(rule, &e->rule)) { 228*4882a593Smuzhiyun e->rule.action = newaction; 229*4882a593Smuzhiyun e->rule.field_count = newfield_count; 230*4882a593Smuzhiyun write_unlock(&auditsc_lock); 231*4882a593Smuzhiyun return 0; 232*4882a593Smuzhiyun } 233*4882a593Smuzhiyun } 234*4882a593Smuzhiyun write_unlock(&auditsc_lock); 235*4882a593Smuzhiyun return -EFAULT; /* No matching rule */ 236*4882a593Smuzhiyun } 237*4882a593Smuzhiyun 238*4882a593SmuzhiyunThe RCU version creates a copy, updates the copy, then replaces the old 239*4882a593Smuzhiyunentry with the newly updated entry. This sequence of actions, allowing 240*4882a593Smuzhiyunconcurrent reads while making a copy to perform an update, is what gives 241*4882a593SmuzhiyunRCU (*read-copy update*) its name. The RCU code is as follows:: 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun static inline int audit_upd_rule(struct audit_rule *rule, 244*4882a593Smuzhiyun struct list_head *list, 245*4882a593Smuzhiyun __u32 newaction, 246*4882a593Smuzhiyun __u32 newfield_count) 247*4882a593Smuzhiyun { 248*4882a593Smuzhiyun struct audit_entry *e; 249*4882a593Smuzhiyun struct audit_entry *ne; 250*4882a593Smuzhiyun 251*4882a593Smuzhiyun list_for_each_entry(e, list, list) { 252*4882a593Smuzhiyun if (!audit_compare_rule(rule, &e->rule)) { 253*4882a593Smuzhiyun ne = kmalloc(sizeof(*entry), GFP_ATOMIC); 254*4882a593Smuzhiyun if (ne == NULL) 255*4882a593Smuzhiyun return -ENOMEM; 256*4882a593Smuzhiyun audit_copy_rule(&ne->rule, &e->rule); 257*4882a593Smuzhiyun ne->rule.action = newaction; 258*4882a593Smuzhiyun ne->rule.field_count = newfield_count; 259*4882a593Smuzhiyun list_replace_rcu(&e->list, &ne->list); 260*4882a593Smuzhiyun call_rcu(&e->rcu, audit_free_rule); 261*4882a593Smuzhiyun return 0; 262*4882a593Smuzhiyun } 263*4882a593Smuzhiyun } 264*4882a593Smuzhiyun return -EFAULT; /* No matching rule */ 265*4882a593Smuzhiyun } 266*4882a593Smuzhiyun 267*4882a593SmuzhiyunAgain, this assumes that the caller holds ``audit_filter_mutex``. Normally, the 268*4882a593Smuzhiyunwriter lock would become a spinlock in this sort of code. 269*4882a593Smuzhiyun 270*4882a593SmuzhiyunAnother use of this pattern can be found in the openswitch driver's *connection 271*4882a593Smuzhiyuntracking table* code in ``ct_limit_set()``. The table holds connection tracking 272*4882a593Smuzhiyunentries and has a limit on the maximum entries. There is one such table 273*4882a593Smuzhiyunper-zone and hence one *limit* per zone. The zones are mapped to their limits 274*4882a593Smuzhiyunthrough a hashtable using an RCU-managed hlist for the hash chains. When a new 275*4882a593Smuzhiyunlimit is set, a new limit object is allocated and ``ct_limit_set()`` is called 276*4882a593Smuzhiyunto replace the old limit object with the new one using list_replace_rcu(). 277*4882a593SmuzhiyunThe old limit object is then freed after a grace period using kfree_rcu(). 278*4882a593Smuzhiyun 279*4882a593Smuzhiyun 280*4882a593SmuzhiyunExample 4: Eliminating Stale Data 281*4882a593Smuzhiyun--------------------------------- 282*4882a593Smuzhiyun 283*4882a593SmuzhiyunThe auditing example above tolerates stale data, as do most algorithms 284*4882a593Smuzhiyunthat are tracking external state. Because there is a delay from the 285*4882a593Smuzhiyuntime the external state changes before Linux becomes aware of the change, 286*4882a593Smuzhiyunadditional RCU-induced staleness is generally not a problem. 287*4882a593Smuzhiyun 288*4882a593SmuzhiyunHowever, there are many examples where stale data cannot be tolerated. 289*4882a593SmuzhiyunOne example in the Linux kernel is the System V IPC (see the shm_lock() 290*4882a593Smuzhiyunfunction in ipc/shm.c). This code checks a *deleted* flag under a 291*4882a593Smuzhiyunper-entry spinlock, and, if the *deleted* flag is set, pretends that the 292*4882a593Smuzhiyunentry does not exist. For this to be helpful, the search function must 293*4882a593Smuzhiyunreturn holding the per-entry spinlock, as shm_lock() does in fact do. 294*4882a593Smuzhiyun 295*4882a593Smuzhiyun.. _quick_quiz: 296*4882a593Smuzhiyun 297*4882a593SmuzhiyunQuick Quiz: 298*4882a593Smuzhiyun For the deleted-flag technique to be helpful, why is it necessary 299*4882a593Smuzhiyun to hold the per-entry lock while returning from the search function? 300*4882a593Smuzhiyun 301*4882a593Smuzhiyun:ref:`Answer to Quick Quiz <quick_quiz_answer>` 302*4882a593Smuzhiyun 303*4882a593SmuzhiyunIf the system-call audit module were to ever need to reject stale data, one way 304*4882a593Smuzhiyunto accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the 305*4882a593Smuzhiyunaudit_entry structure, and modify ``audit_filter_task()`` as follows:: 306*4882a593Smuzhiyun 307*4882a593Smuzhiyun static enum audit_state audit_filter_task(struct task_struct *tsk) 308*4882a593Smuzhiyun { 309*4882a593Smuzhiyun struct audit_entry *e; 310*4882a593Smuzhiyun enum audit_state state; 311*4882a593Smuzhiyun 312*4882a593Smuzhiyun rcu_read_lock(); 313*4882a593Smuzhiyun list_for_each_entry_rcu(e, &audit_tsklist, list) { 314*4882a593Smuzhiyun if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { 315*4882a593Smuzhiyun spin_lock(&e->lock); 316*4882a593Smuzhiyun if (e->deleted) { 317*4882a593Smuzhiyun spin_unlock(&e->lock); 318*4882a593Smuzhiyun rcu_read_unlock(); 319*4882a593Smuzhiyun return AUDIT_BUILD_CONTEXT; 320*4882a593Smuzhiyun } 321*4882a593Smuzhiyun rcu_read_unlock(); 322*4882a593Smuzhiyun return state; 323*4882a593Smuzhiyun } 324*4882a593Smuzhiyun } 325*4882a593Smuzhiyun rcu_read_unlock(); 326*4882a593Smuzhiyun return AUDIT_BUILD_CONTEXT; 327*4882a593Smuzhiyun } 328*4882a593Smuzhiyun 329*4882a593SmuzhiyunNote that this example assumes that entries are only added and deleted. 330*4882a593SmuzhiyunAdditional mechanism is required to deal correctly with the update-in-place 331*4882a593Smuzhiyunperformed by ``audit_upd_rule()``. For one thing, ``audit_upd_rule()`` would 332*4882a593Smuzhiyunneed additional memory barriers to ensure that the list_add_rcu() was really 333*4882a593Smuzhiyunexecuted before the list_del_rcu(). 334*4882a593Smuzhiyun 335*4882a593SmuzhiyunThe ``audit_del_rule()`` function would need to set the ``deleted`` flag under the 336*4882a593Smuzhiyunspinlock as follows:: 337*4882a593Smuzhiyun 338*4882a593Smuzhiyun static inline int audit_del_rule(struct audit_rule *rule, 339*4882a593Smuzhiyun struct list_head *list) 340*4882a593Smuzhiyun { 341*4882a593Smuzhiyun struct audit_entry *e; 342*4882a593Smuzhiyun 343*4882a593Smuzhiyun /* No need to use the _rcu iterator here, since this 344*4882a593Smuzhiyun * is the only deletion routine. */ 345*4882a593Smuzhiyun list_for_each_entry(e, list, list) { 346*4882a593Smuzhiyun if (!audit_compare_rule(rule, &e->rule)) { 347*4882a593Smuzhiyun spin_lock(&e->lock); 348*4882a593Smuzhiyun list_del_rcu(&e->list); 349*4882a593Smuzhiyun e->deleted = 1; 350*4882a593Smuzhiyun spin_unlock(&e->lock); 351*4882a593Smuzhiyun call_rcu(&e->rcu, audit_free_rule); 352*4882a593Smuzhiyun return 0; 353*4882a593Smuzhiyun } 354*4882a593Smuzhiyun } 355*4882a593Smuzhiyun return -EFAULT; /* No matching rule */ 356*4882a593Smuzhiyun } 357*4882a593Smuzhiyun 358*4882a593SmuzhiyunThis too assumes that the caller holds ``audit_filter_mutex``. 359*4882a593Smuzhiyun 360*4882a593Smuzhiyun 361*4882a593SmuzhiyunExample 5: Skipping Stale Objects 362*4882a593Smuzhiyun--------------------------------- 363*4882a593Smuzhiyun 364*4882a593SmuzhiyunFor some usecases, reader performance can be improved by skipping stale objects 365*4882a593Smuzhiyunduring read-side list traversal if the object in concern is pending destruction 366*4882a593Smuzhiyunafter one or more grace periods. One such example can be found in the timerfd 367*4882a593Smuzhiyunsubsystem. When a ``CLOCK_REALTIME`` clock is reprogrammed - for example due to 368*4882a593Smuzhiyunsetting of the system time, then all programmed timerfds that depend on this 369*4882a593Smuzhiyunclock get triggered and processes waiting on them to expire are woken up in 370*4882a593Smuzhiyunadvance of their scheduled expiry. To facilitate this, all such timers are added 371*4882a593Smuzhiyunto an RCU-managed ``cancel_list`` when they are setup in 372*4882a593Smuzhiyun``timerfd_setup_cancel()``:: 373*4882a593Smuzhiyun 374*4882a593Smuzhiyun static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags) 375*4882a593Smuzhiyun { 376*4882a593Smuzhiyun spin_lock(&ctx->cancel_lock); 377*4882a593Smuzhiyun if ((ctx->clockid == CLOCK_REALTIME && 378*4882a593Smuzhiyun (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) { 379*4882a593Smuzhiyun if (!ctx->might_cancel) { 380*4882a593Smuzhiyun ctx->might_cancel = true; 381*4882a593Smuzhiyun spin_lock(&cancel_lock); 382*4882a593Smuzhiyun list_add_rcu(&ctx->clist, &cancel_list); 383*4882a593Smuzhiyun spin_unlock(&cancel_lock); 384*4882a593Smuzhiyun } 385*4882a593Smuzhiyun } 386*4882a593Smuzhiyun spin_unlock(&ctx->cancel_lock); 387*4882a593Smuzhiyun } 388*4882a593Smuzhiyun 389*4882a593SmuzhiyunWhen a timerfd is freed (fd is closed), then the ``might_cancel`` flag of the 390*4882a593Smuzhiyuntimerfd object is cleared, the object removed from the ``cancel_list`` and 391*4882a593Smuzhiyundestroyed:: 392*4882a593Smuzhiyun 393*4882a593Smuzhiyun int timerfd_release(struct inode *inode, struct file *file) 394*4882a593Smuzhiyun { 395*4882a593Smuzhiyun struct timerfd_ctx *ctx = file->private_data; 396*4882a593Smuzhiyun 397*4882a593Smuzhiyun spin_lock(&ctx->cancel_lock); 398*4882a593Smuzhiyun if (ctx->might_cancel) { 399*4882a593Smuzhiyun ctx->might_cancel = false; 400*4882a593Smuzhiyun spin_lock(&cancel_lock); 401*4882a593Smuzhiyun list_del_rcu(&ctx->clist); 402*4882a593Smuzhiyun spin_unlock(&cancel_lock); 403*4882a593Smuzhiyun } 404*4882a593Smuzhiyun spin_unlock(&ctx->cancel_lock); 405*4882a593Smuzhiyun 406*4882a593Smuzhiyun hrtimer_cancel(&ctx->t.tmr); 407*4882a593Smuzhiyun kfree_rcu(ctx, rcu); 408*4882a593Smuzhiyun return 0; 409*4882a593Smuzhiyun } 410*4882a593Smuzhiyun 411*4882a593SmuzhiyunIf the ``CLOCK_REALTIME`` clock is set, for example by a time server, the 412*4882a593Smuzhiyunhrtimer framework calls ``timerfd_clock_was_set()`` which walks the 413*4882a593Smuzhiyun``cancel_list`` and wakes up processes waiting on the timerfd. While iterating 414*4882a593Smuzhiyunthe ``cancel_list``, the ``might_cancel`` flag is consulted to skip stale 415*4882a593Smuzhiyunobjects:: 416*4882a593Smuzhiyun 417*4882a593Smuzhiyun void timerfd_clock_was_set(void) 418*4882a593Smuzhiyun { 419*4882a593Smuzhiyun struct timerfd_ctx *ctx; 420*4882a593Smuzhiyun unsigned long flags; 421*4882a593Smuzhiyun 422*4882a593Smuzhiyun rcu_read_lock(); 423*4882a593Smuzhiyun list_for_each_entry_rcu(ctx, &cancel_list, clist) { 424*4882a593Smuzhiyun if (!ctx->might_cancel) 425*4882a593Smuzhiyun continue; 426*4882a593Smuzhiyun spin_lock_irqsave(&ctx->wqh.lock, flags); 427*4882a593Smuzhiyun if (ctx->moffs != ktime_mono_to_real(0)) { 428*4882a593Smuzhiyun ctx->moffs = KTIME_MAX; 429*4882a593Smuzhiyun ctx->ticks++; 430*4882a593Smuzhiyun wake_up_locked_poll(&ctx->wqh, EPOLLIN); 431*4882a593Smuzhiyun } 432*4882a593Smuzhiyun spin_unlock_irqrestore(&ctx->wqh.lock, flags); 433*4882a593Smuzhiyun } 434*4882a593Smuzhiyun rcu_read_unlock(); 435*4882a593Smuzhiyun } 436*4882a593Smuzhiyun 437*4882a593SmuzhiyunThe key point here is, because RCU-traversal of the ``cancel_list`` happens 438*4882a593Smuzhiyunwhile objects are being added and removed to the list, sometimes the traversal 439*4882a593Smuzhiyuncan step on an object that has been removed from the list. In this example, it 440*4882a593Smuzhiyunis seen that it is better to skip such objects using a flag. 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun 443*4882a593SmuzhiyunSummary 444*4882a593Smuzhiyun------- 445*4882a593Smuzhiyun 446*4882a593SmuzhiyunRead-mostly list-based data structures that can tolerate stale data are 447*4882a593Smuzhiyunthe most amenable to use of RCU. The simplest case is where entries are 448*4882a593Smuzhiyuneither added or deleted from the data structure (or atomically modified 449*4882a593Smuzhiyunin place), but non-atomic in-place modifications can be handled by making 450*4882a593Smuzhiyuna copy, updating the copy, then replacing the original with the copy. 451*4882a593SmuzhiyunIf stale data cannot be tolerated, then a *deleted* flag may be used 452*4882a593Smuzhiyunin conjunction with a per-entry spinlock in order to allow the search 453*4882a593Smuzhiyunfunction to reject newly deleted data. 454*4882a593Smuzhiyun 455*4882a593Smuzhiyun.. _quick_quiz_answer: 456*4882a593Smuzhiyun 457*4882a593SmuzhiyunAnswer to Quick Quiz: 458*4882a593Smuzhiyun For the deleted-flag technique to be helpful, why is it necessary 459*4882a593Smuzhiyun to hold the per-entry lock while returning from the search function? 460*4882a593Smuzhiyun 461*4882a593Smuzhiyun If the search function drops the per-entry lock before returning, 462*4882a593Smuzhiyun then the caller will be processing stale data in any case. If it 463*4882a593Smuzhiyun is really OK to be processing stale data, then you don't need a 464*4882a593Smuzhiyun *deleted* flag. If processing stale data really is a problem, 465*4882a593Smuzhiyun then you need to hold the per-entry lock across all of the code 466*4882a593Smuzhiyun that uses the value that was returned. 467*4882a593Smuzhiyun 468*4882a593Smuzhiyun:ref:`Back to Quick Quiz <quick_quiz>` 469