xref: /OK3568_Linux_fs/kernel/Documentation/RCU/listRCU.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _list_rcu_doc:
2*4882a593Smuzhiyun
3*4882a593SmuzhiyunUsing RCU to Protect Read-Mostly Linked Lists
4*4882a593Smuzhiyun=============================================
5*4882a593Smuzhiyun
6*4882a593SmuzhiyunOne of the best applications of RCU is to protect read-mostly linked lists
7*4882a593Smuzhiyun(``struct list_head`` in list.h).  One big advantage of this approach
8*4882a593Smuzhiyunis that all of the required memory barriers are included for you in
9*4882a593Smuzhiyunthe list macros.  This document describes several applications of RCU,
10*4882a593Smuzhiyunwith the best fits first.
11*4882a593Smuzhiyun
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunExample 1: Read-mostly list: Deferred Destruction
14*4882a593Smuzhiyun-------------------------------------------------
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunA widely used usecase for RCU lists in the kernel is lockless iteration over
17*4882a593Smuzhiyunall processes in the system. ``task_struct::tasks`` represents the list node that
18*4882a593Smuzhiyunlinks all the processes. The list can be traversed in parallel to any list
19*4882a593Smuzhiyunadditions or removals.
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunThe traversal of the list is done using ``for_each_process()`` which is defined
22*4882a593Smuzhiyunby the 2 macros::
23*4882a593Smuzhiyun
24*4882a593Smuzhiyun	#define next_task(p) \
25*4882a593Smuzhiyun		list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun	#define for_each_process(p) \
28*4882a593Smuzhiyun		for (p = &init_task ; (p = next_task(p)) != &init_task ; )
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunThe code traversing the list of all processes typically looks like::
31*4882a593Smuzhiyun
32*4882a593Smuzhiyun	rcu_read_lock();
33*4882a593Smuzhiyun	for_each_process(p) {
34*4882a593Smuzhiyun		/* Do something with p */
35*4882a593Smuzhiyun	}
36*4882a593Smuzhiyun	rcu_read_unlock();
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunThe simplified code for removing a process from a task list is::
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun	void release_task(struct task_struct *p)
41*4882a593Smuzhiyun	{
42*4882a593Smuzhiyun		write_lock(&tasklist_lock);
43*4882a593Smuzhiyun		list_del_rcu(&p->tasks);
44*4882a593Smuzhiyun		write_unlock(&tasklist_lock);
45*4882a593Smuzhiyun		call_rcu(&p->rcu, delayed_put_task_struct);
46*4882a593Smuzhiyun	}
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunWhen a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)`` under
49*4882a593Smuzhiyun``tasklist_lock`` writer lock protection, to remove the task from the list of
50*4882a593Smuzhiyunall tasks. The ``tasklist_lock`` prevents concurrent list additions/removals
51*4882a593Smuzhiyunfrom corrupting the list. Readers using ``for_each_process()`` are not protected
52*4882a593Smuzhiyunwith the ``tasklist_lock``. To prevent readers from noticing changes in the list
53*4882a593Smuzhiyunpointers, the ``task_struct`` object is freed only after one or more grace
54*4882a593Smuzhiyunperiods elapse (with the help of call_rcu()). This deferring of destruction
55*4882a593Smuzhiyunensures that any readers traversing the list will see valid ``p->tasks.next``
56*4882a593Smuzhiyunpointers and deletion/freeing can happen in parallel with traversal of the list.
57*4882a593SmuzhiyunThis pattern is also called an **existence lock**, since RCU pins the object in
58*4882a593Smuzhiyunmemory until all existing readers finish.
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunExample 2: Read-Side Action Taken Outside of Lock: No In-Place Updates
62*4882a593Smuzhiyun----------------------------------------------------------------------
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunThe best applications are cases where, if reader-writer locking were
65*4882a593Smuzhiyunused, the read-side lock would be dropped before taking any action
66*4882a593Smuzhiyunbased on the results of the search.  The most celebrated example is
67*4882a593Smuzhiyunthe routing table.  Because the routing table is tracking the state of
68*4882a593Smuzhiyunequipment outside of the computer, it will at times contain stale data.
69*4882a593SmuzhiyunTherefore, once the route has been computed, there is no need to hold
70*4882a593Smuzhiyunthe routing table static during transmission of the packet.  After all,
71*4882a593Smuzhiyunyou can hold the routing table static all you want, but that won't keep
72*4882a593Smuzhiyunthe external Internet from changing, and it is the state of the external
73*4882a593SmuzhiyunInternet that really matters.  In addition, routing entries are typically
74*4882a593Smuzhiyunadded or deleted, rather than being modified in place.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunA straightforward example of this use of RCU may be found in the
77*4882a593Smuzhiyunsystem-call auditing support.  For example, a reader-writer locked
78*4882a593Smuzhiyunimplementation of ``audit_filter_task()`` might be as follows::
79*4882a593Smuzhiyun
80*4882a593Smuzhiyun	static enum audit_state audit_filter_task(struct task_struct *tsk)
81*4882a593Smuzhiyun	{
82*4882a593Smuzhiyun		struct audit_entry *e;
83*4882a593Smuzhiyun		enum audit_state   state;
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun		read_lock(&auditsc_lock);
86*4882a593Smuzhiyun		/* Note: audit_filter_mutex held by caller. */
87*4882a593Smuzhiyun		list_for_each_entry(e, &audit_tsklist, list) {
88*4882a593Smuzhiyun			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
89*4882a593Smuzhiyun				read_unlock(&auditsc_lock);
90*4882a593Smuzhiyun				return state;
91*4882a593Smuzhiyun			}
92*4882a593Smuzhiyun		}
93*4882a593Smuzhiyun		read_unlock(&auditsc_lock);
94*4882a593Smuzhiyun		return AUDIT_BUILD_CONTEXT;
95*4882a593Smuzhiyun	}
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunHere the list is searched under the lock, but the lock is dropped before
98*4882a593Smuzhiyunthe corresponding value is returned.  By the time that this value is acted
99*4882a593Smuzhiyunon, the list may well have been modified.  This makes sense, since if
100*4882a593Smuzhiyunyou are turning auditing off, it is OK to audit a few extra system calls.
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunThis means that RCU can be easily applied to the read side, as follows::
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun	static enum audit_state audit_filter_task(struct task_struct *tsk)
105*4882a593Smuzhiyun	{
106*4882a593Smuzhiyun		struct audit_entry *e;
107*4882a593Smuzhiyun		enum audit_state   state;
108*4882a593Smuzhiyun
109*4882a593Smuzhiyun		rcu_read_lock();
110*4882a593Smuzhiyun		/* Note: audit_filter_mutex held by caller. */
111*4882a593Smuzhiyun		list_for_each_entry_rcu(e, &audit_tsklist, list) {
112*4882a593Smuzhiyun			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
113*4882a593Smuzhiyun				rcu_read_unlock();
114*4882a593Smuzhiyun				return state;
115*4882a593Smuzhiyun			}
116*4882a593Smuzhiyun		}
117*4882a593Smuzhiyun		rcu_read_unlock();
118*4882a593Smuzhiyun		return AUDIT_BUILD_CONTEXT;
119*4882a593Smuzhiyun	}
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunThe ``read_lock()`` and ``read_unlock()`` calls have become rcu_read_lock()
122*4882a593Smuzhiyunand rcu_read_unlock(), respectively, and the list_for_each_entry() has
123*4882a593Smuzhiyunbecome list_for_each_entry_rcu().  The **_rcu()** list-traversal primitives
124*4882a593Smuzhiyuninsert the read-side memory barriers that are required on DEC Alpha CPUs.
125*4882a593Smuzhiyun
126*4882a593SmuzhiyunThe changes to the update side are also straightforward. A reader-writer lock
127*4882a593Smuzhiyunmight be used as follows for deletion and insertion::
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun	static inline int audit_del_rule(struct audit_rule *rule,
130*4882a593Smuzhiyun					 struct list_head *list)
131*4882a593Smuzhiyun	{
132*4882a593Smuzhiyun		struct audit_entry *e;
133*4882a593Smuzhiyun
134*4882a593Smuzhiyun		write_lock(&auditsc_lock);
135*4882a593Smuzhiyun		list_for_each_entry(e, list, list) {
136*4882a593Smuzhiyun			if (!audit_compare_rule(rule, &e->rule)) {
137*4882a593Smuzhiyun				list_del(&e->list);
138*4882a593Smuzhiyun				write_unlock(&auditsc_lock);
139*4882a593Smuzhiyun				return 0;
140*4882a593Smuzhiyun			}
141*4882a593Smuzhiyun		}
142*4882a593Smuzhiyun		write_unlock(&auditsc_lock);
143*4882a593Smuzhiyun		return -EFAULT;		/* No matching rule */
144*4882a593Smuzhiyun	}
145*4882a593Smuzhiyun
146*4882a593Smuzhiyun	static inline int audit_add_rule(struct audit_entry *entry,
147*4882a593Smuzhiyun					 struct list_head *list)
148*4882a593Smuzhiyun	{
149*4882a593Smuzhiyun		write_lock(&auditsc_lock);
150*4882a593Smuzhiyun		if (entry->rule.flags & AUDIT_PREPEND) {
151*4882a593Smuzhiyun			entry->rule.flags &= ~AUDIT_PREPEND;
152*4882a593Smuzhiyun			list_add(&entry->list, list);
153*4882a593Smuzhiyun		} else {
154*4882a593Smuzhiyun			list_add_tail(&entry->list, list);
155*4882a593Smuzhiyun		}
156*4882a593Smuzhiyun		write_unlock(&auditsc_lock);
157*4882a593Smuzhiyun		return 0;
158*4882a593Smuzhiyun	}
159*4882a593Smuzhiyun
160*4882a593SmuzhiyunFollowing are the RCU equivalents for these two functions::
161*4882a593Smuzhiyun
162*4882a593Smuzhiyun	static inline int audit_del_rule(struct audit_rule *rule,
163*4882a593Smuzhiyun					 struct list_head *list)
164*4882a593Smuzhiyun	{
165*4882a593Smuzhiyun		struct audit_entry *e;
166*4882a593Smuzhiyun
167*4882a593Smuzhiyun		/* No need to use the _rcu iterator here, since this is the only
168*4882a593Smuzhiyun		 * deletion routine. */
169*4882a593Smuzhiyun		list_for_each_entry(e, list, list) {
170*4882a593Smuzhiyun			if (!audit_compare_rule(rule, &e->rule)) {
171*4882a593Smuzhiyun				list_del_rcu(&e->list);
172*4882a593Smuzhiyun				call_rcu(&e->rcu, audit_free_rule);
173*4882a593Smuzhiyun				return 0;
174*4882a593Smuzhiyun			}
175*4882a593Smuzhiyun		}
176*4882a593Smuzhiyun		return -EFAULT;		/* No matching rule */
177*4882a593Smuzhiyun	}
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun	static inline int audit_add_rule(struct audit_entry *entry,
180*4882a593Smuzhiyun					 struct list_head *list)
181*4882a593Smuzhiyun	{
182*4882a593Smuzhiyun		if (entry->rule.flags & AUDIT_PREPEND) {
183*4882a593Smuzhiyun			entry->rule.flags &= ~AUDIT_PREPEND;
184*4882a593Smuzhiyun			list_add_rcu(&entry->list, list);
185*4882a593Smuzhiyun		} else {
186*4882a593Smuzhiyun			list_add_tail_rcu(&entry->list, list);
187*4882a593Smuzhiyun		}
188*4882a593Smuzhiyun		return 0;
189*4882a593Smuzhiyun	}
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunNormally, the ``write_lock()`` and ``write_unlock()`` would be replaced by a
192*4882a593Smuzhiyunspin_lock() and a spin_unlock(). But in this case, all callers hold
193*4882a593Smuzhiyun``audit_filter_mutex``, so no additional locking is required. The
194*4882a593Smuzhiyun``auditsc_lock`` can therefore be eliminated, since use of RCU eliminates the
195*4882a593Smuzhiyunneed for writers to exclude readers.
196*4882a593Smuzhiyun
197*4882a593SmuzhiyunThe list_del(), list_add(), and list_add_tail() primitives have been
198*4882a593Smuzhiyunreplaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
199*4882a593SmuzhiyunThe **_rcu()** list-manipulation primitives add memory barriers that are needed on
200*4882a593Smuzhiyunweakly ordered CPUs (most of them!).  The list_del_rcu() primitive omits the
201*4882a593Smuzhiyunpointer poisoning debug-assist code that would otherwise cause concurrent
202*4882a593Smuzhiyunreaders to fail spectacularly.
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunSo, when readers can tolerate stale data and when entries are either added or
205*4882a593Smuzhiyundeleted, without in-place modification, it is very easy to use RCU!
206*4882a593Smuzhiyun
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunExample 3: Handling In-Place Updates
209*4882a593Smuzhiyun------------------------------------
210*4882a593Smuzhiyun
211*4882a593SmuzhiyunThe system-call auditing code does not update auditing rules in place.  However,
212*4882a593Smuzhiyunif it did, the reader-writer-locked code to do so might look as follows
213*4882a593Smuzhiyun(assuming only ``field_count`` is updated, otherwise, the added fields would
214*4882a593Smuzhiyunneed to be filled in)::
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun	static inline int audit_upd_rule(struct audit_rule *rule,
217*4882a593Smuzhiyun					 struct list_head *list,
218*4882a593Smuzhiyun					 __u32 newaction,
219*4882a593Smuzhiyun					 __u32 newfield_count)
220*4882a593Smuzhiyun	{
221*4882a593Smuzhiyun		struct audit_entry *e;
222*4882a593Smuzhiyun		struct audit_entry *ne;
223*4882a593Smuzhiyun
224*4882a593Smuzhiyun		write_lock(&auditsc_lock);
225*4882a593Smuzhiyun		/* Note: audit_filter_mutex held by caller. */
226*4882a593Smuzhiyun		list_for_each_entry(e, list, list) {
227*4882a593Smuzhiyun			if (!audit_compare_rule(rule, &e->rule)) {
228*4882a593Smuzhiyun				e->rule.action = newaction;
229*4882a593Smuzhiyun				e->rule.field_count = newfield_count;
230*4882a593Smuzhiyun				write_unlock(&auditsc_lock);
231*4882a593Smuzhiyun				return 0;
232*4882a593Smuzhiyun			}
233*4882a593Smuzhiyun		}
234*4882a593Smuzhiyun		write_unlock(&auditsc_lock);
235*4882a593Smuzhiyun		return -EFAULT;		/* No matching rule */
236*4882a593Smuzhiyun	}
237*4882a593Smuzhiyun
238*4882a593SmuzhiyunThe RCU version creates a copy, updates the copy, then replaces the old
239*4882a593Smuzhiyunentry with the newly updated entry.  This sequence of actions, allowing
240*4882a593Smuzhiyunconcurrent reads while making a copy to perform an update, is what gives
241*4882a593SmuzhiyunRCU (*read-copy update*) its name.  The RCU code is as follows::
242*4882a593Smuzhiyun
243*4882a593Smuzhiyun	static inline int audit_upd_rule(struct audit_rule *rule,
244*4882a593Smuzhiyun					 struct list_head *list,
245*4882a593Smuzhiyun					 __u32 newaction,
246*4882a593Smuzhiyun					 __u32 newfield_count)
247*4882a593Smuzhiyun	{
248*4882a593Smuzhiyun		struct audit_entry *e;
249*4882a593Smuzhiyun		struct audit_entry *ne;
250*4882a593Smuzhiyun
251*4882a593Smuzhiyun		list_for_each_entry(e, list, list) {
252*4882a593Smuzhiyun			if (!audit_compare_rule(rule, &e->rule)) {
253*4882a593Smuzhiyun				ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
254*4882a593Smuzhiyun				if (ne == NULL)
255*4882a593Smuzhiyun					return -ENOMEM;
256*4882a593Smuzhiyun				audit_copy_rule(&ne->rule, &e->rule);
257*4882a593Smuzhiyun				ne->rule.action = newaction;
258*4882a593Smuzhiyun				ne->rule.field_count = newfield_count;
259*4882a593Smuzhiyun				list_replace_rcu(&e->list, &ne->list);
260*4882a593Smuzhiyun				call_rcu(&e->rcu, audit_free_rule);
261*4882a593Smuzhiyun				return 0;
262*4882a593Smuzhiyun			}
263*4882a593Smuzhiyun		}
264*4882a593Smuzhiyun		return -EFAULT;		/* No matching rule */
265*4882a593Smuzhiyun	}
266*4882a593Smuzhiyun
267*4882a593SmuzhiyunAgain, this assumes that the caller holds ``audit_filter_mutex``.  Normally, the
268*4882a593Smuzhiyunwriter lock would become a spinlock in this sort of code.
269*4882a593Smuzhiyun
270*4882a593SmuzhiyunAnother use of this pattern can be found in the openswitch driver's *connection
271*4882a593Smuzhiyuntracking table* code in ``ct_limit_set()``.  The table holds connection tracking
272*4882a593Smuzhiyunentries and has a limit on the maximum entries.  There is one such table
273*4882a593Smuzhiyunper-zone and hence one *limit* per zone.  The zones are mapped to their limits
274*4882a593Smuzhiyunthrough a hashtable using an RCU-managed hlist for the hash chains. When a new
275*4882a593Smuzhiyunlimit is set, a new limit object is allocated and ``ct_limit_set()`` is called
276*4882a593Smuzhiyunto replace the old limit object with the new one using list_replace_rcu().
277*4882a593SmuzhiyunThe old limit object is then freed after a grace period using kfree_rcu().
278*4882a593Smuzhiyun
279*4882a593Smuzhiyun
280*4882a593SmuzhiyunExample 4: Eliminating Stale Data
281*4882a593Smuzhiyun---------------------------------
282*4882a593Smuzhiyun
283*4882a593SmuzhiyunThe auditing example above tolerates stale data, as do most algorithms
284*4882a593Smuzhiyunthat are tracking external state.  Because there is a delay from the
285*4882a593Smuzhiyuntime the external state changes before Linux becomes aware of the change,
286*4882a593Smuzhiyunadditional RCU-induced staleness is generally not a problem.
287*4882a593Smuzhiyun
288*4882a593SmuzhiyunHowever, there are many examples where stale data cannot be tolerated.
289*4882a593SmuzhiyunOne example in the Linux kernel is the System V IPC (see the shm_lock()
290*4882a593Smuzhiyunfunction in ipc/shm.c).  This code checks a *deleted* flag under a
291*4882a593Smuzhiyunper-entry spinlock, and, if the *deleted* flag is set, pretends that the
292*4882a593Smuzhiyunentry does not exist.  For this to be helpful, the search function must
293*4882a593Smuzhiyunreturn holding the per-entry spinlock, as shm_lock() does in fact do.
294*4882a593Smuzhiyun
295*4882a593Smuzhiyun.. _quick_quiz:
296*4882a593Smuzhiyun
297*4882a593SmuzhiyunQuick Quiz:
298*4882a593Smuzhiyun	For the deleted-flag technique to be helpful, why is it necessary
299*4882a593Smuzhiyun	to hold the per-entry lock while returning from the search function?
300*4882a593Smuzhiyun
301*4882a593Smuzhiyun:ref:`Answer to Quick Quiz <quick_quiz_answer>`
302*4882a593Smuzhiyun
303*4882a593SmuzhiyunIf the system-call audit module were to ever need to reject stale data, one way
304*4882a593Smuzhiyunto accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the
305*4882a593Smuzhiyunaudit_entry structure, and modify ``audit_filter_task()`` as follows::
306*4882a593Smuzhiyun
307*4882a593Smuzhiyun	static enum audit_state audit_filter_task(struct task_struct *tsk)
308*4882a593Smuzhiyun	{
309*4882a593Smuzhiyun		struct audit_entry *e;
310*4882a593Smuzhiyun		enum audit_state   state;
311*4882a593Smuzhiyun
312*4882a593Smuzhiyun		rcu_read_lock();
313*4882a593Smuzhiyun		list_for_each_entry_rcu(e, &audit_tsklist, list) {
314*4882a593Smuzhiyun			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
315*4882a593Smuzhiyun				spin_lock(&e->lock);
316*4882a593Smuzhiyun				if (e->deleted) {
317*4882a593Smuzhiyun					spin_unlock(&e->lock);
318*4882a593Smuzhiyun					rcu_read_unlock();
319*4882a593Smuzhiyun					return AUDIT_BUILD_CONTEXT;
320*4882a593Smuzhiyun				}
321*4882a593Smuzhiyun				rcu_read_unlock();
322*4882a593Smuzhiyun				return state;
323*4882a593Smuzhiyun			}
324*4882a593Smuzhiyun		}
325*4882a593Smuzhiyun		rcu_read_unlock();
326*4882a593Smuzhiyun		return AUDIT_BUILD_CONTEXT;
327*4882a593Smuzhiyun	}
328*4882a593Smuzhiyun
329*4882a593SmuzhiyunNote that this example assumes that entries are only added and deleted.
330*4882a593SmuzhiyunAdditional mechanism is required to deal correctly with the update-in-place
331*4882a593Smuzhiyunperformed by ``audit_upd_rule()``.  For one thing, ``audit_upd_rule()`` would
332*4882a593Smuzhiyunneed additional memory barriers to ensure that the list_add_rcu() was really
333*4882a593Smuzhiyunexecuted before the list_del_rcu().
334*4882a593Smuzhiyun
335*4882a593SmuzhiyunThe ``audit_del_rule()`` function would need to set the ``deleted`` flag under the
336*4882a593Smuzhiyunspinlock as follows::
337*4882a593Smuzhiyun
338*4882a593Smuzhiyun	static inline int audit_del_rule(struct audit_rule *rule,
339*4882a593Smuzhiyun					 struct list_head *list)
340*4882a593Smuzhiyun	{
341*4882a593Smuzhiyun		struct audit_entry *e;
342*4882a593Smuzhiyun
343*4882a593Smuzhiyun		/* No need to use the _rcu iterator here, since this
344*4882a593Smuzhiyun		 * is the only deletion routine. */
345*4882a593Smuzhiyun		list_for_each_entry(e, list, list) {
346*4882a593Smuzhiyun			if (!audit_compare_rule(rule, &e->rule)) {
347*4882a593Smuzhiyun				spin_lock(&e->lock);
348*4882a593Smuzhiyun				list_del_rcu(&e->list);
349*4882a593Smuzhiyun				e->deleted = 1;
350*4882a593Smuzhiyun				spin_unlock(&e->lock);
351*4882a593Smuzhiyun				call_rcu(&e->rcu, audit_free_rule);
352*4882a593Smuzhiyun				return 0;
353*4882a593Smuzhiyun			}
354*4882a593Smuzhiyun		}
355*4882a593Smuzhiyun		return -EFAULT;		/* No matching rule */
356*4882a593Smuzhiyun	}
357*4882a593Smuzhiyun
358*4882a593SmuzhiyunThis too assumes that the caller holds ``audit_filter_mutex``.
359*4882a593Smuzhiyun
360*4882a593Smuzhiyun
361*4882a593SmuzhiyunExample 5: Skipping Stale Objects
362*4882a593Smuzhiyun---------------------------------
363*4882a593Smuzhiyun
364*4882a593SmuzhiyunFor some usecases, reader performance can be improved by skipping stale objects
365*4882a593Smuzhiyunduring read-side list traversal if the object in concern is pending destruction
366*4882a593Smuzhiyunafter one or more grace periods. One such example can be found in the timerfd
367*4882a593Smuzhiyunsubsystem. When a ``CLOCK_REALTIME`` clock is reprogrammed - for example due to
368*4882a593Smuzhiyunsetting of the system time, then all programmed timerfds that depend on this
369*4882a593Smuzhiyunclock get triggered and processes waiting on them to expire are woken up in
370*4882a593Smuzhiyunadvance of their scheduled expiry. To facilitate this, all such timers are added
371*4882a593Smuzhiyunto an RCU-managed ``cancel_list`` when they are setup in
372*4882a593Smuzhiyun``timerfd_setup_cancel()``::
373*4882a593Smuzhiyun
374*4882a593Smuzhiyun	static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags)
375*4882a593Smuzhiyun	{
376*4882a593Smuzhiyun		spin_lock(&ctx->cancel_lock);
377*4882a593Smuzhiyun		if ((ctx->clockid == CLOCK_REALTIME &&
378*4882a593Smuzhiyun		    (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
379*4882a593Smuzhiyun			if (!ctx->might_cancel) {
380*4882a593Smuzhiyun				ctx->might_cancel = true;
381*4882a593Smuzhiyun				spin_lock(&cancel_lock);
382*4882a593Smuzhiyun				list_add_rcu(&ctx->clist, &cancel_list);
383*4882a593Smuzhiyun				spin_unlock(&cancel_lock);
384*4882a593Smuzhiyun			}
385*4882a593Smuzhiyun		}
386*4882a593Smuzhiyun		spin_unlock(&ctx->cancel_lock);
387*4882a593Smuzhiyun	}
388*4882a593Smuzhiyun
389*4882a593SmuzhiyunWhen a timerfd is freed (fd is closed), then the ``might_cancel`` flag of the
390*4882a593Smuzhiyuntimerfd object is cleared, the object removed from the ``cancel_list`` and
391*4882a593Smuzhiyundestroyed::
392*4882a593Smuzhiyun
393*4882a593Smuzhiyun	int timerfd_release(struct inode *inode, struct file *file)
394*4882a593Smuzhiyun	{
395*4882a593Smuzhiyun		struct timerfd_ctx *ctx = file->private_data;
396*4882a593Smuzhiyun
397*4882a593Smuzhiyun		spin_lock(&ctx->cancel_lock);
398*4882a593Smuzhiyun		if (ctx->might_cancel) {
399*4882a593Smuzhiyun			ctx->might_cancel = false;
400*4882a593Smuzhiyun			spin_lock(&cancel_lock);
401*4882a593Smuzhiyun			list_del_rcu(&ctx->clist);
402*4882a593Smuzhiyun			spin_unlock(&cancel_lock);
403*4882a593Smuzhiyun		}
404*4882a593Smuzhiyun		spin_unlock(&ctx->cancel_lock);
405*4882a593Smuzhiyun
406*4882a593Smuzhiyun		hrtimer_cancel(&ctx->t.tmr);
407*4882a593Smuzhiyun		kfree_rcu(ctx, rcu);
408*4882a593Smuzhiyun		return 0;
409*4882a593Smuzhiyun	}
410*4882a593Smuzhiyun
411*4882a593SmuzhiyunIf the ``CLOCK_REALTIME`` clock is set, for example by a time server, the
412*4882a593Smuzhiyunhrtimer framework calls ``timerfd_clock_was_set()`` which walks the
413*4882a593Smuzhiyun``cancel_list`` and wakes up processes waiting on the timerfd. While iterating
414*4882a593Smuzhiyunthe ``cancel_list``, the ``might_cancel`` flag is consulted to skip stale
415*4882a593Smuzhiyunobjects::
416*4882a593Smuzhiyun
417*4882a593Smuzhiyun	void timerfd_clock_was_set(void)
418*4882a593Smuzhiyun	{
419*4882a593Smuzhiyun		struct timerfd_ctx *ctx;
420*4882a593Smuzhiyun		unsigned long flags;
421*4882a593Smuzhiyun
422*4882a593Smuzhiyun		rcu_read_lock();
423*4882a593Smuzhiyun		list_for_each_entry_rcu(ctx, &cancel_list, clist) {
424*4882a593Smuzhiyun			if (!ctx->might_cancel)
425*4882a593Smuzhiyun				continue;
426*4882a593Smuzhiyun			spin_lock_irqsave(&ctx->wqh.lock, flags);
427*4882a593Smuzhiyun			if (ctx->moffs != ktime_mono_to_real(0)) {
428*4882a593Smuzhiyun				ctx->moffs = KTIME_MAX;
429*4882a593Smuzhiyun				ctx->ticks++;
430*4882a593Smuzhiyun				wake_up_locked_poll(&ctx->wqh, EPOLLIN);
431*4882a593Smuzhiyun			}
432*4882a593Smuzhiyun			spin_unlock_irqrestore(&ctx->wqh.lock, flags);
433*4882a593Smuzhiyun		}
434*4882a593Smuzhiyun		rcu_read_unlock();
435*4882a593Smuzhiyun	}
436*4882a593Smuzhiyun
437*4882a593SmuzhiyunThe key point here is, because RCU-traversal of the ``cancel_list`` happens
438*4882a593Smuzhiyunwhile objects are being added and removed to the list, sometimes the traversal
439*4882a593Smuzhiyuncan step on an object that has been removed from the list. In this example, it
440*4882a593Smuzhiyunis seen that it is better to skip such objects using a flag.
441*4882a593Smuzhiyun
442*4882a593Smuzhiyun
443*4882a593SmuzhiyunSummary
444*4882a593Smuzhiyun-------
445*4882a593Smuzhiyun
446*4882a593SmuzhiyunRead-mostly list-based data structures that can tolerate stale data are
447*4882a593Smuzhiyunthe most amenable to use of RCU.  The simplest case is where entries are
448*4882a593Smuzhiyuneither added or deleted from the data structure (or atomically modified
449*4882a593Smuzhiyunin place), but non-atomic in-place modifications can be handled by making
450*4882a593Smuzhiyuna copy, updating the copy, then replacing the original with the copy.
451*4882a593SmuzhiyunIf stale data cannot be tolerated, then a *deleted* flag may be used
452*4882a593Smuzhiyunin conjunction with a per-entry spinlock in order to allow the search
453*4882a593Smuzhiyunfunction to reject newly deleted data.
454*4882a593Smuzhiyun
455*4882a593Smuzhiyun.. _quick_quiz_answer:
456*4882a593Smuzhiyun
457*4882a593SmuzhiyunAnswer to Quick Quiz:
458*4882a593Smuzhiyun	For the deleted-flag technique to be helpful, why is it necessary
459*4882a593Smuzhiyun	to hold the per-entry lock while returning from the search function?
460*4882a593Smuzhiyun
461*4882a593Smuzhiyun	If the search function drops the per-entry lock before returning,
462*4882a593Smuzhiyun	then the caller will be processing stale data in any case.  If it
463*4882a593Smuzhiyun	is really OK to be processing stale data, then you don't need a
464*4882a593Smuzhiyun	*deleted* flag.  If processing stale data really is a problem,
465*4882a593Smuzhiyun	then you need to hold the per-entry lock across all of the code
466*4882a593Smuzhiyun	that uses the value that was returned.
467*4882a593Smuzhiyun
468*4882a593Smuzhiyun:ref:`Back to Quick Quiz <quick_quiz>`
469