xref: /OK3568_Linux_fs/kernel/Documentation/core-api/atomic_ops.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=======================================================
2*4882a593SmuzhiyunSemantics and Behavior of Atomic and Bitmask Operations
3*4882a593Smuzhiyun=======================================================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun:Author: David S. Miller
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunThis document is intended to serve as a guide to Linux port
8*4882a593Smuzhiyunmaintainers on how to implement atomic counter, bitops, and spinlock
9*4882a593Smuzhiyuninterfaces properly.
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunAtomic Type And Operations
12*4882a593Smuzhiyun==========================
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunThe atomic_t type should be defined as a signed integer and
15*4882a593Smuzhiyunthe atomic_long_t type as a signed long integer.  Also, they should
16*4882a593Smuzhiyunbe made opaque such that any kind of cast to a normal C integer type
17*4882a593Smuzhiyunwill fail.  Something like the following should suffice::
18*4882a593Smuzhiyun
19*4882a593Smuzhiyun	typedef struct { int counter; } atomic_t;
20*4882a593Smuzhiyun	typedef struct { long counter; } atomic_long_t;
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunHistorically, counter has been declared volatile.  This is now discouraged.
23*4882a593SmuzhiyunSee :ref:`Documentation/process/volatile-considered-harmful.rst
24*4882a593Smuzhiyun<volatile_considered_harmful>` for the complete rationale.
25*4882a593Smuzhiyun
26*4882a593Smuzhiyunlocal_t is very similar to atomic_t. If the counter is per CPU and only
27*4882a593Smuzhiyunupdated by one CPU, local_t is probably more appropriate. Please see
28*4882a593Smuzhiyun:ref:`Documentation/core-api/local_ops.rst <local_ops>` for the semantics of
29*4882a593Smuzhiyunlocal_t.
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunThe first operations to implement for atomic_t's are the initializers and
32*4882a593Smuzhiyunplain writes. ::
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun	#define ATOMIC_INIT(i)		{ (i) }
35*4882a593Smuzhiyun	#define atomic_set(v, i)	((v)->counter = (i))
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunThe first macro is used in definitions, such as::
38*4882a593Smuzhiyun
39*4882a593Smuzhiyun	static atomic_t my_counter = ATOMIC_INIT(1);
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunThe initializer is atomic in that the return values of the atomic operations
42*4882a593Smuzhiyunare guaranteed to be correct reflecting the initialized value if the
43*4882a593Smuzhiyuninitializer is used before runtime.  If the initializer is used at runtime, a
44*4882a593Smuzhiyunproper implicit or explicit read memory barrier is needed before reading the
45*4882a593Smuzhiyunvalue with atomic_read from another thread.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunAs with all of the ``atomic_`` interfaces, replace the leading ``atomic_``
48*4882a593Smuzhiyunwith ``atomic_long_`` to operate on atomic_long_t.
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunThe second interface can be used at runtime, as in::
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun	struct foo { atomic_t counter; };
53*4882a593Smuzhiyun	...
54*4882a593Smuzhiyun
55*4882a593Smuzhiyun	struct foo *k;
56*4882a593Smuzhiyun
57*4882a593Smuzhiyun	k = kmalloc(sizeof(*k), GFP_KERNEL);
58*4882a593Smuzhiyun	if (!k)
59*4882a593Smuzhiyun		return -ENOMEM;
60*4882a593Smuzhiyun	atomic_set(&k->counter, 0);
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunThe setting is atomic in that the return values of the atomic operations by
63*4882a593Smuzhiyunall threads are guaranteed to be correct reflecting either the value that has
64*4882a593Smuzhiyunbeen set with this operation or set with another operation.  A proper implicit
65*4882a593Smuzhiyunor explicit memory barrier is needed before the value set with the operation
66*4882a593Smuzhiyunis guaranteed to be readable with atomic_read from another thread.
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunNext, we have::
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun	#define atomic_read(v)	((v)->counter)
71*4882a593Smuzhiyun
72*4882a593Smuzhiyunwhich simply reads the counter value currently visible to the calling thread.
73*4882a593SmuzhiyunThe read is atomic in that the return value is guaranteed to be one of the
74*4882a593Smuzhiyunvalues initialized or modified with the interface operations if a proper
75*4882a593Smuzhiyunimplicit or explicit memory barrier is used after possible runtime
76*4882a593Smuzhiyuninitialization by any other thread and the value is modified only with the
77*4882a593Smuzhiyuninterface operations.  atomic_read does not guarantee that the runtime
78*4882a593Smuzhiyuninitialization by any other thread is visible yet, so the user of the
79*4882a593Smuzhiyuninterface must take care of that with a proper implicit or explicit memory
80*4882a593Smuzhiyunbarrier.
81*4882a593Smuzhiyun
82*4882a593Smuzhiyun.. warning::
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun	``atomic_read()`` and ``atomic_set()`` DO NOT IMPLY BARRIERS!
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun	Some architectures may choose to use the volatile keyword, barriers, or
87*4882a593Smuzhiyun	inline assembly to guarantee some degree of immediacy for atomic_read()
88*4882a593Smuzhiyun	and atomic_set().  This is not uniformly guaranteed, and may change in
89*4882a593Smuzhiyun	the future, so all users of atomic_t should treat atomic_read() and
90*4882a593Smuzhiyun	atomic_set() as simple C statements that may be reordered or optimized
91*4882a593Smuzhiyun	away entirely by the compiler or processor, and explicitly invoke the
92*4882a593Smuzhiyun	appropriate compiler and/or memory barrier for each use case.  Failure
93*4882a593Smuzhiyun	to do so will result in code that may suddenly break when used with
94*4882a593Smuzhiyun	different architectures or compiler optimizations, or even changes in
95*4882a593Smuzhiyun	unrelated code which changes how the compiler optimizes the section
96*4882a593Smuzhiyun	accessing atomic_t variables.
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunProperly aligned pointers, longs, ints, and chars (and unsigned
99*4882a593Smuzhiyunequivalents) may be atomically loaded from and stored to in the same
100*4882a593Smuzhiyunsense as described for atomic_read() and atomic_set().  The READ_ONCE()
101*4882a593Smuzhiyunand WRITE_ONCE() macros should be used to prevent the compiler from using
102*4882a593Smuzhiyunoptimizations that might otherwise optimize accesses out of existence on
103*4882a593Smuzhiyunthe one hand, or that might create unsolicited accesses on the other.
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunFor example consider the following code::
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun	while (a > 0)
108*4882a593Smuzhiyun		do_something();
109*4882a593Smuzhiyun
110*4882a593SmuzhiyunIf the compiler can prove that do_something() does not store to the
111*4882a593Smuzhiyunvariable a, then the compiler is within its rights transforming this to
112*4882a593Smuzhiyunthe following::
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun	if (a > 0)
115*4882a593Smuzhiyun		for (;;)
116*4882a593Smuzhiyun			do_something();
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunIf you don't want the compiler to do this (and you probably don't), then
119*4882a593Smuzhiyunyou should use something like the following::
120*4882a593Smuzhiyun
121*4882a593Smuzhiyun	while (READ_ONCE(a) > 0)
122*4882a593Smuzhiyun		do_something();
123*4882a593Smuzhiyun
124*4882a593SmuzhiyunAlternatively, you could place a barrier() call in the loop.
125*4882a593Smuzhiyun
126*4882a593SmuzhiyunFor another example, consider the following code::
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun	tmp_a = a;
129*4882a593Smuzhiyun	do_something_with(tmp_a);
130*4882a593Smuzhiyun	do_something_else_with(tmp_a);
131*4882a593Smuzhiyun
132*4882a593SmuzhiyunIf the compiler can prove that do_something_with() does not store to the
133*4882a593Smuzhiyunvariable a, then the compiler is within its rights to manufacture an
134*4882a593Smuzhiyunadditional load as follows::
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun	tmp_a = a;
137*4882a593Smuzhiyun	do_something_with(tmp_a);
138*4882a593Smuzhiyun	tmp_a = a;
139*4882a593Smuzhiyun	do_something_else_with(tmp_a);
140*4882a593Smuzhiyun
141*4882a593SmuzhiyunThis could fatally confuse your code if it expected the same value
142*4882a593Smuzhiyunto be passed to do_something_with() and do_something_else_with().
143*4882a593Smuzhiyun
144*4882a593SmuzhiyunThe compiler would be likely to manufacture this additional load if
145*4882a593Smuzhiyundo_something_with() was an inline function that made very heavy use
146*4882a593Smuzhiyunof registers: reloading from variable a could save a flush to the
147*4882a593Smuzhiyunstack and later reload.  To prevent the compiler from attacking your
148*4882a593Smuzhiyuncode in this manner, write the following::
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun	tmp_a = READ_ONCE(a);
151*4882a593Smuzhiyun	do_something_with(tmp_a);
152*4882a593Smuzhiyun	do_something_else_with(tmp_a);
153*4882a593Smuzhiyun
154*4882a593SmuzhiyunFor a final example, consider the following code, assuming that the
155*4882a593Smuzhiyunvariable a is set at boot time before the second CPU is brought online
156*4882a593Smuzhiyunand never changed later, so that memory barriers are not needed::
157*4882a593Smuzhiyun
158*4882a593Smuzhiyun	if (a)
159*4882a593Smuzhiyun		b = 9;
160*4882a593Smuzhiyun	else
161*4882a593Smuzhiyun		b = 42;
162*4882a593Smuzhiyun
163*4882a593SmuzhiyunThe compiler is within its rights to manufacture an additional store
164*4882a593Smuzhiyunby transforming the above code into the following::
165*4882a593Smuzhiyun
166*4882a593Smuzhiyun	b = 42;
167*4882a593Smuzhiyun	if (a)
168*4882a593Smuzhiyun		b = 9;
169*4882a593Smuzhiyun
170*4882a593SmuzhiyunThis could come as a fatal surprise to other code running concurrently
171*4882a593Smuzhiyunthat expected b to never have the value 42 if a was zero.  To prevent
172*4882a593Smuzhiyunthe compiler from doing this, write something like::
173*4882a593Smuzhiyun
174*4882a593Smuzhiyun	if (a)
175*4882a593Smuzhiyun		WRITE_ONCE(b, 9);
176*4882a593Smuzhiyun	else
177*4882a593Smuzhiyun		WRITE_ONCE(b, 42);
178*4882a593Smuzhiyun
179*4882a593SmuzhiyunDon't even -think- about doing this without proper use of memory barriers,
180*4882a593Smuzhiyunlocks, or atomic operations if variable a can change at runtime!
181*4882a593Smuzhiyun
182*4882a593Smuzhiyun.. warning::
183*4882a593Smuzhiyun
184*4882a593Smuzhiyun	``READ_ONCE()`` OR ``WRITE_ONCE()`` DO NOT IMPLY A BARRIER!
185*4882a593Smuzhiyun
186*4882a593SmuzhiyunNow, we move onto the atomic operation interfaces typically implemented with
187*4882a593Smuzhiyunthe help of assembly code. ::
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun	void atomic_add(int i, atomic_t *v);
190*4882a593Smuzhiyun	void atomic_sub(int i, atomic_t *v);
191*4882a593Smuzhiyun	void atomic_inc(atomic_t *v);
192*4882a593Smuzhiyun	void atomic_dec(atomic_t *v);
193*4882a593Smuzhiyun
194*4882a593SmuzhiyunThese four routines add and subtract integral values to/from the given
195*4882a593Smuzhiyunatomic_t value.  The first two routines pass explicit integers by
196*4882a593Smuzhiyunwhich to make the adjustment, whereas the latter two use an implicit
197*4882a593Smuzhiyunadjustment value of "1".
198*4882a593Smuzhiyun
199*4882a593SmuzhiyunOne very important aspect of these two routines is that they DO NOT
200*4882a593Smuzhiyunrequire any explicit memory barriers.  They need only perform the
201*4882a593Smuzhiyunatomic_t counter update in an SMP safe manner.
202*4882a593Smuzhiyun
203*4882a593SmuzhiyunNext, we have::
204*4882a593Smuzhiyun
205*4882a593Smuzhiyun	int atomic_inc_return(atomic_t *v);
206*4882a593Smuzhiyun	int atomic_dec_return(atomic_t *v);
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunThese routines add 1 and subtract 1, respectively, from the given
209*4882a593Smuzhiyunatomic_t and return the new counter value after the operation is
210*4882a593Smuzhiyunperformed.
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunUnlike the above routines, it is required that these primitives
213*4882a593Smuzhiyuninclude explicit memory barriers that are performed before and after
214*4882a593Smuzhiyunthe operation.  It must be done such that all memory operations before
215*4882a593Smuzhiyunand after the atomic operation calls are strongly ordered with respect
216*4882a593Smuzhiyunto the atomic operation itself.
217*4882a593Smuzhiyun
218*4882a593SmuzhiyunFor example, it should behave as if a smp_mb() call existed both
219*4882a593Smuzhiyunbefore and after the atomic operation.
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunIf the atomic instructions used in an implementation provide explicit
222*4882a593Smuzhiyunmemory barrier semantics which satisfy the above requirements, that is
223*4882a593Smuzhiyunfine as well.
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunLet's move on::
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun	int atomic_add_return(int i, atomic_t *v);
228*4882a593Smuzhiyun	int atomic_sub_return(int i, atomic_t *v);
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunThese behave just like atomic_{inc,dec}_return() except that an
231*4882a593Smuzhiyunexplicit counter adjustment is given instead of the implicit "1".
232*4882a593SmuzhiyunThis means that like atomic_{inc,dec}_return(), the memory barrier
233*4882a593Smuzhiyunsemantics are required.
234*4882a593Smuzhiyun
235*4882a593SmuzhiyunNext::
236*4882a593Smuzhiyun
237*4882a593Smuzhiyun	int atomic_inc_and_test(atomic_t *v);
238*4882a593Smuzhiyun	int atomic_dec_and_test(atomic_t *v);
239*4882a593Smuzhiyun
240*4882a593SmuzhiyunThese two routines increment and decrement by 1, respectively, the
241*4882a593Smuzhiyungiven atomic counter.  They return a boolean indicating whether the
242*4882a593Smuzhiyunresulting counter value was zero or not.
243*4882a593Smuzhiyun
244*4882a593SmuzhiyunAgain, these primitives provide explicit memory barrier semantics around
245*4882a593Smuzhiyunthe atomic operation::
246*4882a593Smuzhiyun
247*4882a593Smuzhiyun	int atomic_sub_and_test(int i, atomic_t *v);
248*4882a593Smuzhiyun
249*4882a593SmuzhiyunThis is identical to atomic_dec_and_test() except that an explicit
250*4882a593Smuzhiyundecrement is given instead of the implicit "1".  This primitive must
251*4882a593Smuzhiyunprovide explicit memory barrier semantics around the operation::
252*4882a593Smuzhiyun
253*4882a593Smuzhiyun	int atomic_add_negative(int i, atomic_t *v);
254*4882a593Smuzhiyun
255*4882a593SmuzhiyunThe given increment is added to the given atomic counter value.  A boolean
256*4882a593Smuzhiyunis return which indicates whether the resulting counter value is negative.
257*4882a593SmuzhiyunThis primitive must provide explicit memory barrier semantics around
258*4882a593Smuzhiyunthe operation.
259*4882a593Smuzhiyun
260*4882a593SmuzhiyunThen::
261*4882a593Smuzhiyun
262*4882a593Smuzhiyun	int atomic_xchg(atomic_t *v, int new);
263*4882a593Smuzhiyun
264*4882a593SmuzhiyunThis performs an atomic exchange operation on the atomic variable v, setting
265*4882a593Smuzhiyunthe given new value.  It returns the old value that the atomic variable v had
266*4882a593Smuzhiyunjust before the operation.
267*4882a593Smuzhiyun
268*4882a593Smuzhiyunatomic_xchg must provide explicit memory barriers around the operation. ::
269*4882a593Smuzhiyun
270*4882a593Smuzhiyun	int atomic_cmpxchg(atomic_t *v, int old, int new);
271*4882a593Smuzhiyun
272*4882a593SmuzhiyunThis performs an atomic compare exchange operation on the atomic value v,
273*4882a593Smuzhiyunwith the given old and new values. Like all atomic_xxx operations,
274*4882a593Smuzhiyunatomic_cmpxchg will only satisfy its atomicity semantics as long as all
275*4882a593Smuzhiyunother accesses of \*v are performed through atomic_xxx operations.
276*4882a593Smuzhiyun
277*4882a593Smuzhiyunatomic_cmpxchg must provide explicit memory barriers around the operation,
278*4882a593Smuzhiyunalthough if the comparison fails then no memory ordering guarantees are
279*4882a593Smuzhiyunrequired.
280*4882a593Smuzhiyun
281*4882a593SmuzhiyunThe semantics for atomic_cmpxchg are the same as those defined for 'cas'
282*4882a593Smuzhiyunbelow.
283*4882a593Smuzhiyun
284*4882a593SmuzhiyunFinally::
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun	int atomic_add_unless(atomic_t *v, int a, int u);
287*4882a593Smuzhiyun
288*4882a593SmuzhiyunIf the atomic value v is not equal to u, this function adds a to v, and
289*4882a593Smuzhiyunreturns non zero. If v is equal to u then it returns zero. This is done as
290*4882a593Smuzhiyunan atomic operation.
291*4882a593Smuzhiyun
292*4882a593Smuzhiyunatomic_add_unless must provide explicit memory barriers around the
293*4882a593Smuzhiyunoperation unless it fails (returns 0).
294*4882a593Smuzhiyun
295*4882a593Smuzhiyunatomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0)
296*4882a593Smuzhiyun
297*4882a593Smuzhiyun
298*4882a593SmuzhiyunIf a caller requires memory barrier semantics around an atomic_t
299*4882a593Smuzhiyunoperation which does not return a value, a set of interfaces are
300*4882a593Smuzhiyundefined which accomplish this::
301*4882a593Smuzhiyun
302*4882a593Smuzhiyun	void smp_mb__before_atomic(void);
303*4882a593Smuzhiyun	void smp_mb__after_atomic(void);
304*4882a593Smuzhiyun
305*4882a593SmuzhiyunPreceding a non-value-returning read-modify-write atomic operation with
306*4882a593Smuzhiyunsmp_mb__before_atomic() and following it with smp_mb__after_atomic()
307*4882a593Smuzhiyunprovides the same full ordering that is provided by value-returning
308*4882a593Smuzhiyunread-modify-write atomic operations.
309*4882a593Smuzhiyun
310*4882a593SmuzhiyunFor example, smp_mb__before_atomic() can be used like so::
311*4882a593Smuzhiyun
312*4882a593Smuzhiyun	obj->dead = 1;
313*4882a593Smuzhiyun	smp_mb__before_atomic();
314*4882a593Smuzhiyun	atomic_dec(&obj->ref_count);
315*4882a593Smuzhiyun
316*4882a593SmuzhiyunIt makes sure that all memory operations preceding the atomic_dec()
317*4882a593Smuzhiyuncall are strongly ordered with respect to the atomic counter
318*4882a593Smuzhiyunoperation.  In the above example, it guarantees that the assignment of
319*4882a593Smuzhiyun"1" to obj->dead will be globally visible to other cpus before the
320*4882a593Smuzhiyunatomic counter decrement.
321*4882a593Smuzhiyun
322*4882a593SmuzhiyunWithout the explicit smp_mb__before_atomic() call, the
323*4882a593Smuzhiyunimplementation could legally allow the atomic counter update visible
324*4882a593Smuzhiyunto other cpus before the "obj->dead = 1;" assignment.
325*4882a593Smuzhiyun
326*4882a593SmuzhiyunA missing memory barrier in the cases where they are required by the
327*4882a593Smuzhiyunatomic_t implementation above can have disastrous results.  Here is
328*4882a593Smuzhiyunan example, which follows a pattern occurring frequently in the Linux
329*4882a593Smuzhiyunkernel.  It is the use of atomic counters to implement reference
330*4882a593Smuzhiyuncounting, and it works such that once the counter falls to zero it can
331*4882a593Smuzhiyunbe guaranteed that no other entity can be accessing the object::
332*4882a593Smuzhiyun
333*4882a593Smuzhiyun	static void obj_list_add(struct obj *obj, struct list_head *head)
334*4882a593Smuzhiyun	{
335*4882a593Smuzhiyun		obj->active = 1;
336*4882a593Smuzhiyun		list_add(&obj->list, head);
337*4882a593Smuzhiyun	}
338*4882a593Smuzhiyun
339*4882a593Smuzhiyun	static void obj_list_del(struct obj *obj)
340*4882a593Smuzhiyun	{
341*4882a593Smuzhiyun		list_del(&obj->list);
342*4882a593Smuzhiyun		obj->active = 0;
343*4882a593Smuzhiyun	}
344*4882a593Smuzhiyun
345*4882a593Smuzhiyun	static void obj_destroy(struct obj *obj)
346*4882a593Smuzhiyun	{
347*4882a593Smuzhiyun		BUG_ON(obj->active);
348*4882a593Smuzhiyun		kfree(obj);
349*4882a593Smuzhiyun	}
350*4882a593Smuzhiyun
351*4882a593Smuzhiyun	struct obj *obj_list_peek(struct list_head *head)
352*4882a593Smuzhiyun	{
353*4882a593Smuzhiyun		if (!list_empty(head)) {
354*4882a593Smuzhiyun			struct obj *obj;
355*4882a593Smuzhiyun
356*4882a593Smuzhiyun			obj = list_entry(head->next, struct obj, list);
357*4882a593Smuzhiyun			atomic_inc(&obj->refcnt);
358*4882a593Smuzhiyun			return obj;
359*4882a593Smuzhiyun		}
360*4882a593Smuzhiyun		return NULL;
361*4882a593Smuzhiyun	}
362*4882a593Smuzhiyun
363*4882a593Smuzhiyun	void obj_poke(void)
364*4882a593Smuzhiyun	{
365*4882a593Smuzhiyun		struct obj *obj;
366*4882a593Smuzhiyun
367*4882a593Smuzhiyun		spin_lock(&global_list_lock);
368*4882a593Smuzhiyun		obj = obj_list_peek(&global_list);
369*4882a593Smuzhiyun		spin_unlock(&global_list_lock);
370*4882a593Smuzhiyun
371*4882a593Smuzhiyun		if (obj) {
372*4882a593Smuzhiyun			obj->ops->poke(obj);
373*4882a593Smuzhiyun			if (atomic_dec_and_test(&obj->refcnt))
374*4882a593Smuzhiyun				obj_destroy(obj);
375*4882a593Smuzhiyun		}
376*4882a593Smuzhiyun	}
377*4882a593Smuzhiyun
378*4882a593Smuzhiyun	void obj_timeout(struct obj *obj)
379*4882a593Smuzhiyun	{
380*4882a593Smuzhiyun		spin_lock(&global_list_lock);
381*4882a593Smuzhiyun		obj_list_del(obj);
382*4882a593Smuzhiyun		spin_unlock(&global_list_lock);
383*4882a593Smuzhiyun
384*4882a593Smuzhiyun		if (atomic_dec_and_test(&obj->refcnt))
385*4882a593Smuzhiyun			obj_destroy(obj);
386*4882a593Smuzhiyun	}
387*4882a593Smuzhiyun
388*4882a593Smuzhiyun.. note::
389*4882a593Smuzhiyun
390*4882a593Smuzhiyun	This is a simplification of the ARP queue management in the generic
391*4882a593Smuzhiyun	neighbour discover code of the networking.  Olaf Kirch found a bug wrt.
392*4882a593Smuzhiyun	memory barriers in kfree_skb() that exposed the atomic_t memory barrier
393*4882a593Smuzhiyun	requirements quite clearly.
394*4882a593Smuzhiyun
395*4882a593SmuzhiyunGiven the above scheme, it must be the case that the obj->active
396*4882a593Smuzhiyunupdate done by the obj list deletion be visible to other processors
397*4882a593Smuzhiyunbefore the atomic counter decrement is performed.
398*4882a593Smuzhiyun
399*4882a593SmuzhiyunOtherwise, the counter could fall to zero, yet obj->active would still
400*4882a593Smuzhiyunbe set, thus triggering the assertion in obj_destroy().  The error
401*4882a593Smuzhiyunsequence looks like this::
402*4882a593Smuzhiyun
403*4882a593Smuzhiyun	cpu 0				cpu 1
404*4882a593Smuzhiyun	obj_poke()			obj_timeout()
405*4882a593Smuzhiyun	obj = obj_list_peek();
406*4882a593Smuzhiyun	... gains ref to obj, refcnt=2
407*4882a593Smuzhiyun					obj_list_del(obj);
408*4882a593Smuzhiyun					obj->active = 0 ...
409*4882a593Smuzhiyun					... visibility delayed ...
410*4882a593Smuzhiyun					atomic_dec_and_test()
411*4882a593Smuzhiyun					... refcnt drops to 1 ...
412*4882a593Smuzhiyun	atomic_dec_and_test()
413*4882a593Smuzhiyun	... refcount drops to 0 ...
414*4882a593Smuzhiyun	obj_destroy()
415*4882a593Smuzhiyun	BUG() triggers since obj->active
416*4882a593Smuzhiyun	still seen as one
417*4882a593Smuzhiyun					obj->active update visibility occurs
418*4882a593Smuzhiyun
419*4882a593SmuzhiyunWith the memory barrier semantics required of the atomic_t operations
420*4882a593Smuzhiyunwhich return values, the above sequence of memory visibility can never
421*4882a593Smuzhiyunhappen.  Specifically, in the above case the atomic_dec_and_test()
422*4882a593Smuzhiyuncounter decrement would not become globally visible until the
423*4882a593Smuzhiyunobj->active update does.
424*4882a593Smuzhiyun
425*4882a593SmuzhiyunAs a historical note, 32-bit Sparc used to only allow usage of
426*4882a593Smuzhiyun24-bits of its atomic_t type.  This was because it used 8 bits
427*4882a593Smuzhiyunas a spinlock for SMP safety.  Sparc32 lacked a "compare and swap"
428*4882a593Smuzhiyuntype instruction.  However, 32-bit Sparc has since been moved over
429*4882a593Smuzhiyunto a "hash table of spinlocks" scheme, that allows the full 32-bit
430*4882a593Smuzhiyuncounter to be realized.  Essentially, an array of spinlocks are
431*4882a593Smuzhiyunindexed into based upon the address of the atomic_t being operated
432*4882a593Smuzhiyunon, and that lock protects the atomic operation.  Parisc uses the
433*4882a593Smuzhiyunsame scheme.
434*4882a593Smuzhiyun
435*4882a593SmuzhiyunAnother note is that the atomic_t operations returning values are
436*4882a593Smuzhiyunextremely slow on an old 386.
437*4882a593Smuzhiyun
438*4882a593Smuzhiyun
439*4882a593SmuzhiyunAtomic Bitmask
440*4882a593Smuzhiyun==============
441*4882a593Smuzhiyun
442*4882a593SmuzhiyunWe will now cover the atomic bitmask operations.  You will find that
443*4882a593Smuzhiyuntheir SMP and memory barrier semantics are similar in shape and scope
444*4882a593Smuzhiyunto the atomic_t ops above.
445*4882a593Smuzhiyun
446*4882a593SmuzhiyunNative atomic bit operations are defined to operate on objects aligned
447*4882a593Smuzhiyunto the size of an "unsigned long" C data type, and are least of that
448*4882a593Smuzhiyunsize.  The endianness of the bits within each "unsigned long" are the
449*4882a593Smuzhiyunnative endianness of the cpu. ::
450*4882a593Smuzhiyun
451*4882a593Smuzhiyun	void set_bit(unsigned long nr, volatile unsigned long *addr);
452*4882a593Smuzhiyun	void clear_bit(unsigned long nr, volatile unsigned long *addr);
453*4882a593Smuzhiyun	void change_bit(unsigned long nr, volatile unsigned long *addr);
454*4882a593Smuzhiyun
455*4882a593SmuzhiyunThese routines set, clear, and change, respectively, the bit number
456*4882a593Smuzhiyunindicated by "nr" on the bit mask pointed to by "ADDR".
457*4882a593Smuzhiyun
458*4882a593SmuzhiyunThey must execute atomically, yet there are no implicit memory barrier
459*4882a593Smuzhiyunsemantics required of these interfaces. ::
460*4882a593Smuzhiyun
461*4882a593Smuzhiyun	int test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
462*4882a593Smuzhiyun	int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
463*4882a593Smuzhiyun	int test_and_change_bit(unsigned long nr, volatile unsigned long *addr);
464*4882a593Smuzhiyun
465*4882a593SmuzhiyunLike the above, except that these routines return a boolean which
466*4882a593Smuzhiyunindicates whether the changed bit was set _BEFORE_ the atomic bit
467*4882a593Smuzhiyunoperation.
468*4882a593Smuzhiyun
469*4882a593Smuzhiyun
470*4882a593Smuzhiyun.. warning::
471*4882a593Smuzhiyun        It is incredibly important that the value be a boolean, ie. "0" or "1".
472*4882a593Smuzhiyun        Do not try to be fancy and save a few instructions by declaring the
473*4882a593Smuzhiyun        above to return "long" and just returning something like "old_val &
474*4882a593Smuzhiyun        mask" because that will not work.
475*4882a593Smuzhiyun
476*4882a593SmuzhiyunFor one thing, this return value gets truncated to int in many code
477*4882a593Smuzhiyunpaths using these interfaces, so on 64-bit if the bit is set in the
478*4882a593Smuzhiyunupper 32-bits then testers will never see that.
479*4882a593Smuzhiyun
480*4882a593SmuzhiyunOne great example of where this problem crops up are the thread_info
481*4882a593Smuzhiyunflag operations.  Routines such as test_and_set_ti_thread_flag() chop
482*4882a593Smuzhiyunthe return value into an int.  There are other places where things
483*4882a593Smuzhiyunlike this occur as well.
484*4882a593Smuzhiyun
485*4882a593SmuzhiyunThese routines, like the atomic_t counter operations returning values,
486*4882a593Smuzhiyunmust provide explicit memory barrier semantics around their execution.
487*4882a593SmuzhiyunAll memory operations before the atomic bit operation call must be
488*4882a593Smuzhiyunmade visible globally before the atomic bit operation is made visible.
489*4882a593SmuzhiyunLikewise, the atomic bit operation must be visible globally before any
490*4882a593Smuzhiyunsubsequent memory operation is made visible.  For example::
491*4882a593Smuzhiyun
492*4882a593Smuzhiyun	obj->dead = 1;
493*4882a593Smuzhiyun	if (test_and_set_bit(0, &obj->flags))
494*4882a593Smuzhiyun		/* ... */;
495*4882a593Smuzhiyun	obj->killed = 1;
496*4882a593Smuzhiyun
497*4882a593SmuzhiyunThe implementation of test_and_set_bit() must guarantee that
498*4882a593Smuzhiyun"obj->dead = 1;" is visible to cpus before the atomic memory operation
499*4882a593Smuzhiyundone by test_and_set_bit() becomes visible.  Likewise, the atomic
500*4882a593Smuzhiyunmemory operation done by test_and_set_bit() must become visible before
501*4882a593Smuzhiyun"obj->killed = 1;" is visible.
502*4882a593Smuzhiyun
503*4882a593SmuzhiyunFinally there is the basic operation::
504*4882a593Smuzhiyun
505*4882a593Smuzhiyun	int test_bit(unsigned long nr, __const__ volatile unsigned long *addr);
506*4882a593Smuzhiyun
507*4882a593SmuzhiyunWhich returns a boolean indicating if bit "nr" is set in the bitmask
508*4882a593Smuzhiyunpointed to by "addr".
509*4882a593Smuzhiyun
510*4882a593SmuzhiyunIf explicit memory barriers are required around {set,clear}_bit() (which do
511*4882a593Smuzhiyunnot return a value, and thus does not need to provide memory barrier
512*4882a593Smuzhiyunsemantics), two interfaces are provided::
513*4882a593Smuzhiyun
514*4882a593Smuzhiyun	void smp_mb__before_atomic(void);
515*4882a593Smuzhiyun	void smp_mb__after_atomic(void);
516*4882a593Smuzhiyun
517*4882a593SmuzhiyunThey are used as follows, and are akin to their atomic_t operation
518*4882a593Smuzhiyunbrothers::
519*4882a593Smuzhiyun
520*4882a593Smuzhiyun	/* All memory operations before this call will
521*4882a593Smuzhiyun	 * be globally visible before the clear_bit().
522*4882a593Smuzhiyun	 */
523*4882a593Smuzhiyun	smp_mb__before_atomic();
524*4882a593Smuzhiyun	clear_bit( ... );
525*4882a593Smuzhiyun
526*4882a593Smuzhiyun	/* The clear_bit() will be visible before all
527*4882a593Smuzhiyun	 * subsequent memory operations.
528*4882a593Smuzhiyun	 */
529*4882a593Smuzhiyun	 smp_mb__after_atomic();
530*4882a593Smuzhiyun
531*4882a593SmuzhiyunThere are two special bitops with lock barrier semantics (acquire/release,
532*4882a593Smuzhiyunsame as spinlocks). These operate in the same way as their non-_lock/unlock
533*4882a593Smuzhiyunpostfixed variants, except that they are to provide acquire/release semantics,
534*4882a593Smuzhiyunrespectively. This means they can be used for bit_spin_trylock and
535*4882a593Smuzhiyunbit_spin_unlock type operations without specifying any more barriers. ::
536*4882a593Smuzhiyun
537*4882a593Smuzhiyun	int test_and_set_bit_lock(unsigned long nr, unsigned long *addr);
538*4882a593Smuzhiyun	void clear_bit_unlock(unsigned long nr, unsigned long *addr);
539*4882a593Smuzhiyun	void __clear_bit_unlock(unsigned long nr, unsigned long *addr);
540*4882a593Smuzhiyun
541*4882a593SmuzhiyunThe __clear_bit_unlock version is non-atomic, however it still implements
542*4882a593Smuzhiyununlock barrier semantics. This can be useful if the lock itself is protecting
543*4882a593Smuzhiyunthe other bits in the word.
544*4882a593Smuzhiyun
545*4882a593SmuzhiyunFinally, there are non-atomic versions of the bitmask operations
546*4882a593Smuzhiyunprovided.  They are used in contexts where some other higher-level SMP
547*4882a593Smuzhiyunlocking scheme is being used to protect the bitmask, and thus less
548*4882a593Smuzhiyunexpensive non-atomic operations may be used in the implementation.
549*4882a593SmuzhiyunThey have names similar to the above bitmask operation interfaces,
550*4882a593Smuzhiyunexcept that two underscores are prefixed to the interface name. ::
551*4882a593Smuzhiyun
552*4882a593Smuzhiyun	void __set_bit(unsigned long nr, volatile unsigned long *addr);
553*4882a593Smuzhiyun	void __clear_bit(unsigned long nr, volatile unsigned long *addr);
554*4882a593Smuzhiyun	void __change_bit(unsigned long nr, volatile unsigned long *addr);
555*4882a593Smuzhiyun	int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
556*4882a593Smuzhiyun	int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
557*4882a593Smuzhiyun	int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr);
558*4882a593Smuzhiyun
559*4882a593SmuzhiyunThese non-atomic variants also do not require any special memory
560*4882a593Smuzhiyunbarrier semantics.
561*4882a593Smuzhiyun
562*4882a593SmuzhiyunThe routines xchg() and cmpxchg() must provide the same exact
563*4882a593Smuzhiyunmemory-barrier semantics as the atomic and bit operations returning
564*4882a593Smuzhiyunvalues.
565*4882a593Smuzhiyun
566*4882a593Smuzhiyun.. note::
567*4882a593Smuzhiyun
568*4882a593Smuzhiyun	If someone wants to use xchg(), cmpxchg() and their variants,
569*4882a593Smuzhiyun	linux/atomic.h should be included rather than asm/cmpxchg.h, unless the
570*4882a593Smuzhiyun	code is in arch/* and can take care of itself.
571*4882a593Smuzhiyun
572*4882a593SmuzhiyunSpinlocks and rwlocks have memory barrier expectations as well.
573*4882a593SmuzhiyunThe rule to follow is simple:
574*4882a593Smuzhiyun
575*4882a593Smuzhiyun1) When acquiring a lock, the implementation must make it globally
576*4882a593Smuzhiyun   visible before any subsequent memory operation.
577*4882a593Smuzhiyun
578*4882a593Smuzhiyun2) When releasing a lock, the implementation must make it such that
579*4882a593Smuzhiyun   all previous memory operations are globally visible before the
580*4882a593Smuzhiyun   lock release.
581*4882a593Smuzhiyun
582*4882a593SmuzhiyunWhich finally brings us to _atomic_dec_and_lock().  There is an
583*4882a593Smuzhiyunarchitecture-neutral version implemented in lib/dec_and_lock.c,
584*4882a593Smuzhiyunbut most platforms will wish to optimize this in assembler. ::
585*4882a593Smuzhiyun
586*4882a593Smuzhiyun	int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock);
587*4882a593Smuzhiyun
588*4882a593SmuzhiyunAtomically decrement the given counter, and if will drop to zero
589*4882a593Smuzhiyunatomically acquire the given spinlock and perform the decrement
590*4882a593Smuzhiyunof the counter to zero.  If it does not drop to zero, do nothing
591*4882a593Smuzhiyunwith the spinlock.
592*4882a593Smuzhiyun
593*4882a593SmuzhiyunIt is actually pretty simple to get the memory barrier correct.
594*4882a593SmuzhiyunSimply satisfy the spinlock grab requirements, which is make
595*4882a593Smuzhiyunsure the spinlock operation is globally visible before any
596*4882a593Smuzhiyunsubsequent memory operation.
597*4882a593Smuzhiyun
598*4882a593SmuzhiyunWe can demonstrate this operation more clearly if we define
599*4882a593Smuzhiyunan abstract atomic operation::
600*4882a593Smuzhiyun
601*4882a593Smuzhiyun	long cas(long *mem, long old, long new);
602*4882a593Smuzhiyun
603*4882a593Smuzhiyun"cas" stands for "compare and swap".  It atomically:
604*4882a593Smuzhiyun
605*4882a593Smuzhiyun1) Compares "old" with the value currently at "mem".
606*4882a593Smuzhiyun2) If they are equal, "new" is written to "mem".
607*4882a593Smuzhiyun3) Regardless, the current value at "mem" is returned.
608*4882a593Smuzhiyun
609*4882a593SmuzhiyunAs an example usage, here is what an atomic counter update
610*4882a593Smuzhiyunmight look like::
611*4882a593Smuzhiyun
612*4882a593Smuzhiyun	void example_atomic_inc(long *counter)
613*4882a593Smuzhiyun	{
614*4882a593Smuzhiyun		long old, new, ret;
615*4882a593Smuzhiyun
616*4882a593Smuzhiyun		while (1) {
617*4882a593Smuzhiyun			old = *counter;
618*4882a593Smuzhiyun			new = old + 1;
619*4882a593Smuzhiyun
620*4882a593Smuzhiyun			ret = cas(counter, old, new);
621*4882a593Smuzhiyun			if (ret == old)
622*4882a593Smuzhiyun				break;
623*4882a593Smuzhiyun		}
624*4882a593Smuzhiyun	}
625*4882a593Smuzhiyun
626*4882a593SmuzhiyunLet's use cas() in order to build a pseudo-C atomic_dec_and_lock()::
627*4882a593Smuzhiyun
628*4882a593Smuzhiyun	int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock)
629*4882a593Smuzhiyun	{
630*4882a593Smuzhiyun		long old, new, ret;
631*4882a593Smuzhiyun		int went_to_zero;
632*4882a593Smuzhiyun
633*4882a593Smuzhiyun		went_to_zero = 0;
634*4882a593Smuzhiyun		while (1) {
635*4882a593Smuzhiyun			old = atomic_read(atomic);
636*4882a593Smuzhiyun			new = old - 1;
637*4882a593Smuzhiyun			if (new == 0) {
638*4882a593Smuzhiyun				went_to_zero = 1;
639*4882a593Smuzhiyun				spin_lock(lock);
640*4882a593Smuzhiyun			}
641*4882a593Smuzhiyun			ret = cas(atomic, old, new);
642*4882a593Smuzhiyun			if (ret == old)
643*4882a593Smuzhiyun				break;
644*4882a593Smuzhiyun			if (went_to_zero) {
645*4882a593Smuzhiyun				spin_unlock(lock);
646*4882a593Smuzhiyun				went_to_zero = 0;
647*4882a593Smuzhiyun			}
648*4882a593Smuzhiyun		}
649*4882a593Smuzhiyun
650*4882a593Smuzhiyun		return went_to_zero;
651*4882a593Smuzhiyun	}
652*4882a593Smuzhiyun
653*4882a593SmuzhiyunNow, as far as memory barriers go, as long as spin_lock()
654*4882a593Smuzhiyunstrictly orders all subsequent memory operations (including
655*4882a593Smuzhiyunthe cas()) with respect to itself, things will be fine.
656*4882a593Smuzhiyun
657*4882a593SmuzhiyunSaid another way, _atomic_dec_and_lock() must guarantee that
658*4882a593Smuzhiyuna counter dropping to zero is never made visible before the
659*4882a593Smuzhiyunspinlock being acquired.
660*4882a593Smuzhiyun
661*4882a593Smuzhiyun.. note::
662*4882a593Smuzhiyun
663*4882a593Smuzhiyun	Note that this also means that for the case where the counter is not
664*4882a593Smuzhiyun	dropping to zero, there are no memory ordering requirements.
665