1*4882a593Smuzhiyun======================================================= 2*4882a593SmuzhiyunSemantics and Behavior of Atomic and Bitmask Operations 3*4882a593Smuzhiyun======================================================= 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun:Author: David S. Miller 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunThis document is intended to serve as a guide to Linux port 8*4882a593Smuzhiyunmaintainers on how to implement atomic counter, bitops, and spinlock 9*4882a593Smuzhiyuninterfaces properly. 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunAtomic Type And Operations 12*4882a593Smuzhiyun========================== 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunThe atomic_t type should be defined as a signed integer and 15*4882a593Smuzhiyunthe atomic_long_t type as a signed long integer. Also, they should 16*4882a593Smuzhiyunbe made opaque such that any kind of cast to a normal C integer type 17*4882a593Smuzhiyunwill fail. Something like the following should suffice:: 18*4882a593Smuzhiyun 19*4882a593Smuzhiyun typedef struct { int counter; } atomic_t; 20*4882a593Smuzhiyun typedef struct { long counter; } atomic_long_t; 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunHistorically, counter has been declared volatile. This is now discouraged. 23*4882a593SmuzhiyunSee :ref:`Documentation/process/volatile-considered-harmful.rst 24*4882a593Smuzhiyun<volatile_considered_harmful>` for the complete rationale. 25*4882a593Smuzhiyun 26*4882a593Smuzhiyunlocal_t is very similar to atomic_t. If the counter is per CPU and only 27*4882a593Smuzhiyunupdated by one CPU, local_t is probably more appropriate. Please see 28*4882a593Smuzhiyun:ref:`Documentation/core-api/local_ops.rst <local_ops>` for the semantics of 29*4882a593Smuzhiyunlocal_t. 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunThe first operations to implement for atomic_t's are the initializers and 32*4882a593Smuzhiyunplain writes. :: 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun #define ATOMIC_INIT(i) { (i) } 35*4882a593Smuzhiyun #define atomic_set(v, i) ((v)->counter = (i)) 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunThe first macro is used in definitions, such as:: 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun static atomic_t my_counter = ATOMIC_INIT(1); 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunThe initializer is atomic in that the return values of the atomic operations 42*4882a593Smuzhiyunare guaranteed to be correct reflecting the initialized value if the 43*4882a593Smuzhiyuninitializer is used before runtime. If the initializer is used at runtime, a 44*4882a593Smuzhiyunproper implicit or explicit read memory barrier is needed before reading the 45*4882a593Smuzhiyunvalue with atomic_read from another thread. 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunAs with all of the ``atomic_`` interfaces, replace the leading ``atomic_`` 48*4882a593Smuzhiyunwith ``atomic_long_`` to operate on atomic_long_t. 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunThe second interface can be used at runtime, as in:: 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun struct foo { atomic_t counter; }; 53*4882a593Smuzhiyun ... 54*4882a593Smuzhiyun 55*4882a593Smuzhiyun struct foo *k; 56*4882a593Smuzhiyun 57*4882a593Smuzhiyun k = kmalloc(sizeof(*k), GFP_KERNEL); 58*4882a593Smuzhiyun if (!k) 59*4882a593Smuzhiyun return -ENOMEM; 60*4882a593Smuzhiyun atomic_set(&k->counter, 0); 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunThe setting is atomic in that the return values of the atomic operations by 63*4882a593Smuzhiyunall threads are guaranteed to be correct reflecting either the value that has 64*4882a593Smuzhiyunbeen set with this operation or set with another operation. A proper implicit 65*4882a593Smuzhiyunor explicit memory barrier is needed before the value set with the operation 66*4882a593Smuzhiyunis guaranteed to be readable with atomic_read from another thread. 67*4882a593Smuzhiyun 68*4882a593SmuzhiyunNext, we have:: 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun #define atomic_read(v) ((v)->counter) 71*4882a593Smuzhiyun 72*4882a593Smuzhiyunwhich simply reads the counter value currently visible to the calling thread. 73*4882a593SmuzhiyunThe read is atomic in that the return value is guaranteed to be one of the 74*4882a593Smuzhiyunvalues initialized or modified with the interface operations if a proper 75*4882a593Smuzhiyunimplicit or explicit memory barrier is used after possible runtime 76*4882a593Smuzhiyuninitialization by any other thread and the value is modified only with the 77*4882a593Smuzhiyuninterface operations. atomic_read does not guarantee that the runtime 78*4882a593Smuzhiyuninitialization by any other thread is visible yet, so the user of the 79*4882a593Smuzhiyuninterface must take care of that with a proper implicit or explicit memory 80*4882a593Smuzhiyunbarrier. 81*4882a593Smuzhiyun 82*4882a593Smuzhiyun.. warning:: 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun ``atomic_read()`` and ``atomic_set()`` DO NOT IMPLY BARRIERS! 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun Some architectures may choose to use the volatile keyword, barriers, or 87*4882a593Smuzhiyun inline assembly to guarantee some degree of immediacy for atomic_read() 88*4882a593Smuzhiyun and atomic_set(). This is not uniformly guaranteed, and may change in 89*4882a593Smuzhiyun the future, so all users of atomic_t should treat atomic_read() and 90*4882a593Smuzhiyun atomic_set() as simple C statements that may be reordered or optimized 91*4882a593Smuzhiyun away entirely by the compiler or processor, and explicitly invoke the 92*4882a593Smuzhiyun appropriate compiler and/or memory barrier for each use case. Failure 93*4882a593Smuzhiyun to do so will result in code that may suddenly break when used with 94*4882a593Smuzhiyun different architectures or compiler optimizations, or even changes in 95*4882a593Smuzhiyun unrelated code which changes how the compiler optimizes the section 96*4882a593Smuzhiyun accessing atomic_t variables. 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunProperly aligned pointers, longs, ints, and chars (and unsigned 99*4882a593Smuzhiyunequivalents) may be atomically loaded from and stored to in the same 100*4882a593Smuzhiyunsense as described for atomic_read() and atomic_set(). The READ_ONCE() 101*4882a593Smuzhiyunand WRITE_ONCE() macros should be used to prevent the compiler from using 102*4882a593Smuzhiyunoptimizations that might otherwise optimize accesses out of existence on 103*4882a593Smuzhiyunthe one hand, or that might create unsolicited accesses on the other. 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunFor example consider the following code:: 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun while (a > 0) 108*4882a593Smuzhiyun do_something(); 109*4882a593Smuzhiyun 110*4882a593SmuzhiyunIf the compiler can prove that do_something() does not store to the 111*4882a593Smuzhiyunvariable a, then the compiler is within its rights transforming this to 112*4882a593Smuzhiyunthe following:: 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun if (a > 0) 115*4882a593Smuzhiyun for (;;) 116*4882a593Smuzhiyun do_something(); 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunIf you don't want the compiler to do this (and you probably don't), then 119*4882a593Smuzhiyunyou should use something like the following:: 120*4882a593Smuzhiyun 121*4882a593Smuzhiyun while (READ_ONCE(a) > 0) 122*4882a593Smuzhiyun do_something(); 123*4882a593Smuzhiyun 124*4882a593SmuzhiyunAlternatively, you could place a barrier() call in the loop. 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunFor another example, consider the following code:: 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun tmp_a = a; 129*4882a593Smuzhiyun do_something_with(tmp_a); 130*4882a593Smuzhiyun do_something_else_with(tmp_a); 131*4882a593Smuzhiyun 132*4882a593SmuzhiyunIf the compiler can prove that do_something_with() does not store to the 133*4882a593Smuzhiyunvariable a, then the compiler is within its rights to manufacture an 134*4882a593Smuzhiyunadditional load as follows:: 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun tmp_a = a; 137*4882a593Smuzhiyun do_something_with(tmp_a); 138*4882a593Smuzhiyun tmp_a = a; 139*4882a593Smuzhiyun do_something_else_with(tmp_a); 140*4882a593Smuzhiyun 141*4882a593SmuzhiyunThis could fatally confuse your code if it expected the same value 142*4882a593Smuzhiyunto be passed to do_something_with() and do_something_else_with(). 143*4882a593Smuzhiyun 144*4882a593SmuzhiyunThe compiler would be likely to manufacture this additional load if 145*4882a593Smuzhiyundo_something_with() was an inline function that made very heavy use 146*4882a593Smuzhiyunof registers: reloading from variable a could save a flush to the 147*4882a593Smuzhiyunstack and later reload. To prevent the compiler from attacking your 148*4882a593Smuzhiyuncode in this manner, write the following:: 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun tmp_a = READ_ONCE(a); 151*4882a593Smuzhiyun do_something_with(tmp_a); 152*4882a593Smuzhiyun do_something_else_with(tmp_a); 153*4882a593Smuzhiyun 154*4882a593SmuzhiyunFor a final example, consider the following code, assuming that the 155*4882a593Smuzhiyunvariable a is set at boot time before the second CPU is brought online 156*4882a593Smuzhiyunand never changed later, so that memory barriers are not needed:: 157*4882a593Smuzhiyun 158*4882a593Smuzhiyun if (a) 159*4882a593Smuzhiyun b = 9; 160*4882a593Smuzhiyun else 161*4882a593Smuzhiyun b = 42; 162*4882a593Smuzhiyun 163*4882a593SmuzhiyunThe compiler is within its rights to manufacture an additional store 164*4882a593Smuzhiyunby transforming the above code into the following:: 165*4882a593Smuzhiyun 166*4882a593Smuzhiyun b = 42; 167*4882a593Smuzhiyun if (a) 168*4882a593Smuzhiyun b = 9; 169*4882a593Smuzhiyun 170*4882a593SmuzhiyunThis could come as a fatal surprise to other code running concurrently 171*4882a593Smuzhiyunthat expected b to never have the value 42 if a was zero. To prevent 172*4882a593Smuzhiyunthe compiler from doing this, write something like:: 173*4882a593Smuzhiyun 174*4882a593Smuzhiyun if (a) 175*4882a593Smuzhiyun WRITE_ONCE(b, 9); 176*4882a593Smuzhiyun else 177*4882a593Smuzhiyun WRITE_ONCE(b, 42); 178*4882a593Smuzhiyun 179*4882a593SmuzhiyunDon't even -think- about doing this without proper use of memory barriers, 180*4882a593Smuzhiyunlocks, or atomic operations if variable a can change at runtime! 181*4882a593Smuzhiyun 182*4882a593Smuzhiyun.. warning:: 183*4882a593Smuzhiyun 184*4882a593Smuzhiyun ``READ_ONCE()`` OR ``WRITE_ONCE()`` DO NOT IMPLY A BARRIER! 185*4882a593Smuzhiyun 186*4882a593SmuzhiyunNow, we move onto the atomic operation interfaces typically implemented with 187*4882a593Smuzhiyunthe help of assembly code. :: 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun void atomic_add(int i, atomic_t *v); 190*4882a593Smuzhiyun void atomic_sub(int i, atomic_t *v); 191*4882a593Smuzhiyun void atomic_inc(atomic_t *v); 192*4882a593Smuzhiyun void atomic_dec(atomic_t *v); 193*4882a593Smuzhiyun 194*4882a593SmuzhiyunThese four routines add and subtract integral values to/from the given 195*4882a593Smuzhiyunatomic_t value. The first two routines pass explicit integers by 196*4882a593Smuzhiyunwhich to make the adjustment, whereas the latter two use an implicit 197*4882a593Smuzhiyunadjustment value of "1". 198*4882a593Smuzhiyun 199*4882a593SmuzhiyunOne very important aspect of these two routines is that they DO NOT 200*4882a593Smuzhiyunrequire any explicit memory barriers. They need only perform the 201*4882a593Smuzhiyunatomic_t counter update in an SMP safe manner. 202*4882a593Smuzhiyun 203*4882a593SmuzhiyunNext, we have:: 204*4882a593Smuzhiyun 205*4882a593Smuzhiyun int atomic_inc_return(atomic_t *v); 206*4882a593Smuzhiyun int atomic_dec_return(atomic_t *v); 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunThese routines add 1 and subtract 1, respectively, from the given 209*4882a593Smuzhiyunatomic_t and return the new counter value after the operation is 210*4882a593Smuzhiyunperformed. 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunUnlike the above routines, it is required that these primitives 213*4882a593Smuzhiyuninclude explicit memory barriers that are performed before and after 214*4882a593Smuzhiyunthe operation. It must be done such that all memory operations before 215*4882a593Smuzhiyunand after the atomic operation calls are strongly ordered with respect 216*4882a593Smuzhiyunto the atomic operation itself. 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunFor example, it should behave as if a smp_mb() call existed both 219*4882a593Smuzhiyunbefore and after the atomic operation. 220*4882a593Smuzhiyun 221*4882a593SmuzhiyunIf the atomic instructions used in an implementation provide explicit 222*4882a593Smuzhiyunmemory barrier semantics which satisfy the above requirements, that is 223*4882a593Smuzhiyunfine as well. 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunLet's move on:: 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun int atomic_add_return(int i, atomic_t *v); 228*4882a593Smuzhiyun int atomic_sub_return(int i, atomic_t *v); 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunThese behave just like atomic_{inc,dec}_return() except that an 231*4882a593Smuzhiyunexplicit counter adjustment is given instead of the implicit "1". 232*4882a593SmuzhiyunThis means that like atomic_{inc,dec}_return(), the memory barrier 233*4882a593Smuzhiyunsemantics are required. 234*4882a593Smuzhiyun 235*4882a593SmuzhiyunNext:: 236*4882a593Smuzhiyun 237*4882a593Smuzhiyun int atomic_inc_and_test(atomic_t *v); 238*4882a593Smuzhiyun int atomic_dec_and_test(atomic_t *v); 239*4882a593Smuzhiyun 240*4882a593SmuzhiyunThese two routines increment and decrement by 1, respectively, the 241*4882a593Smuzhiyungiven atomic counter. They return a boolean indicating whether the 242*4882a593Smuzhiyunresulting counter value was zero or not. 243*4882a593Smuzhiyun 244*4882a593SmuzhiyunAgain, these primitives provide explicit memory barrier semantics around 245*4882a593Smuzhiyunthe atomic operation:: 246*4882a593Smuzhiyun 247*4882a593Smuzhiyun int atomic_sub_and_test(int i, atomic_t *v); 248*4882a593Smuzhiyun 249*4882a593SmuzhiyunThis is identical to atomic_dec_and_test() except that an explicit 250*4882a593Smuzhiyundecrement is given instead of the implicit "1". This primitive must 251*4882a593Smuzhiyunprovide explicit memory barrier semantics around the operation:: 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun int atomic_add_negative(int i, atomic_t *v); 254*4882a593Smuzhiyun 255*4882a593SmuzhiyunThe given increment is added to the given atomic counter value. A boolean 256*4882a593Smuzhiyunis return which indicates whether the resulting counter value is negative. 257*4882a593SmuzhiyunThis primitive must provide explicit memory barrier semantics around 258*4882a593Smuzhiyunthe operation. 259*4882a593Smuzhiyun 260*4882a593SmuzhiyunThen:: 261*4882a593Smuzhiyun 262*4882a593Smuzhiyun int atomic_xchg(atomic_t *v, int new); 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunThis performs an atomic exchange operation on the atomic variable v, setting 265*4882a593Smuzhiyunthe given new value. It returns the old value that the atomic variable v had 266*4882a593Smuzhiyunjust before the operation. 267*4882a593Smuzhiyun 268*4882a593Smuzhiyunatomic_xchg must provide explicit memory barriers around the operation. :: 269*4882a593Smuzhiyun 270*4882a593Smuzhiyun int atomic_cmpxchg(atomic_t *v, int old, int new); 271*4882a593Smuzhiyun 272*4882a593SmuzhiyunThis performs an atomic compare exchange operation on the atomic value v, 273*4882a593Smuzhiyunwith the given old and new values. Like all atomic_xxx operations, 274*4882a593Smuzhiyunatomic_cmpxchg will only satisfy its atomicity semantics as long as all 275*4882a593Smuzhiyunother accesses of \*v are performed through atomic_xxx operations. 276*4882a593Smuzhiyun 277*4882a593Smuzhiyunatomic_cmpxchg must provide explicit memory barriers around the operation, 278*4882a593Smuzhiyunalthough if the comparison fails then no memory ordering guarantees are 279*4882a593Smuzhiyunrequired. 280*4882a593Smuzhiyun 281*4882a593SmuzhiyunThe semantics for atomic_cmpxchg are the same as those defined for 'cas' 282*4882a593Smuzhiyunbelow. 283*4882a593Smuzhiyun 284*4882a593SmuzhiyunFinally:: 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun int atomic_add_unless(atomic_t *v, int a, int u); 287*4882a593Smuzhiyun 288*4882a593SmuzhiyunIf the atomic value v is not equal to u, this function adds a to v, and 289*4882a593Smuzhiyunreturns non zero. If v is equal to u then it returns zero. This is done as 290*4882a593Smuzhiyunan atomic operation. 291*4882a593Smuzhiyun 292*4882a593Smuzhiyunatomic_add_unless must provide explicit memory barriers around the 293*4882a593Smuzhiyunoperation unless it fails (returns 0). 294*4882a593Smuzhiyun 295*4882a593Smuzhiyunatomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0) 296*4882a593Smuzhiyun 297*4882a593Smuzhiyun 298*4882a593SmuzhiyunIf a caller requires memory barrier semantics around an atomic_t 299*4882a593Smuzhiyunoperation which does not return a value, a set of interfaces are 300*4882a593Smuzhiyundefined which accomplish this:: 301*4882a593Smuzhiyun 302*4882a593Smuzhiyun void smp_mb__before_atomic(void); 303*4882a593Smuzhiyun void smp_mb__after_atomic(void); 304*4882a593Smuzhiyun 305*4882a593SmuzhiyunPreceding a non-value-returning read-modify-write atomic operation with 306*4882a593Smuzhiyunsmp_mb__before_atomic() and following it with smp_mb__after_atomic() 307*4882a593Smuzhiyunprovides the same full ordering that is provided by value-returning 308*4882a593Smuzhiyunread-modify-write atomic operations. 309*4882a593Smuzhiyun 310*4882a593SmuzhiyunFor example, smp_mb__before_atomic() can be used like so:: 311*4882a593Smuzhiyun 312*4882a593Smuzhiyun obj->dead = 1; 313*4882a593Smuzhiyun smp_mb__before_atomic(); 314*4882a593Smuzhiyun atomic_dec(&obj->ref_count); 315*4882a593Smuzhiyun 316*4882a593SmuzhiyunIt makes sure that all memory operations preceding the atomic_dec() 317*4882a593Smuzhiyuncall are strongly ordered with respect to the atomic counter 318*4882a593Smuzhiyunoperation. In the above example, it guarantees that the assignment of 319*4882a593Smuzhiyun"1" to obj->dead will be globally visible to other cpus before the 320*4882a593Smuzhiyunatomic counter decrement. 321*4882a593Smuzhiyun 322*4882a593SmuzhiyunWithout the explicit smp_mb__before_atomic() call, the 323*4882a593Smuzhiyunimplementation could legally allow the atomic counter update visible 324*4882a593Smuzhiyunto other cpus before the "obj->dead = 1;" assignment. 325*4882a593Smuzhiyun 326*4882a593SmuzhiyunA missing memory barrier in the cases where they are required by the 327*4882a593Smuzhiyunatomic_t implementation above can have disastrous results. Here is 328*4882a593Smuzhiyunan example, which follows a pattern occurring frequently in the Linux 329*4882a593Smuzhiyunkernel. It is the use of atomic counters to implement reference 330*4882a593Smuzhiyuncounting, and it works such that once the counter falls to zero it can 331*4882a593Smuzhiyunbe guaranteed that no other entity can be accessing the object:: 332*4882a593Smuzhiyun 333*4882a593Smuzhiyun static void obj_list_add(struct obj *obj, struct list_head *head) 334*4882a593Smuzhiyun { 335*4882a593Smuzhiyun obj->active = 1; 336*4882a593Smuzhiyun list_add(&obj->list, head); 337*4882a593Smuzhiyun } 338*4882a593Smuzhiyun 339*4882a593Smuzhiyun static void obj_list_del(struct obj *obj) 340*4882a593Smuzhiyun { 341*4882a593Smuzhiyun list_del(&obj->list); 342*4882a593Smuzhiyun obj->active = 0; 343*4882a593Smuzhiyun } 344*4882a593Smuzhiyun 345*4882a593Smuzhiyun static void obj_destroy(struct obj *obj) 346*4882a593Smuzhiyun { 347*4882a593Smuzhiyun BUG_ON(obj->active); 348*4882a593Smuzhiyun kfree(obj); 349*4882a593Smuzhiyun } 350*4882a593Smuzhiyun 351*4882a593Smuzhiyun struct obj *obj_list_peek(struct list_head *head) 352*4882a593Smuzhiyun { 353*4882a593Smuzhiyun if (!list_empty(head)) { 354*4882a593Smuzhiyun struct obj *obj; 355*4882a593Smuzhiyun 356*4882a593Smuzhiyun obj = list_entry(head->next, struct obj, list); 357*4882a593Smuzhiyun atomic_inc(&obj->refcnt); 358*4882a593Smuzhiyun return obj; 359*4882a593Smuzhiyun } 360*4882a593Smuzhiyun return NULL; 361*4882a593Smuzhiyun } 362*4882a593Smuzhiyun 363*4882a593Smuzhiyun void obj_poke(void) 364*4882a593Smuzhiyun { 365*4882a593Smuzhiyun struct obj *obj; 366*4882a593Smuzhiyun 367*4882a593Smuzhiyun spin_lock(&global_list_lock); 368*4882a593Smuzhiyun obj = obj_list_peek(&global_list); 369*4882a593Smuzhiyun spin_unlock(&global_list_lock); 370*4882a593Smuzhiyun 371*4882a593Smuzhiyun if (obj) { 372*4882a593Smuzhiyun obj->ops->poke(obj); 373*4882a593Smuzhiyun if (atomic_dec_and_test(&obj->refcnt)) 374*4882a593Smuzhiyun obj_destroy(obj); 375*4882a593Smuzhiyun } 376*4882a593Smuzhiyun } 377*4882a593Smuzhiyun 378*4882a593Smuzhiyun void obj_timeout(struct obj *obj) 379*4882a593Smuzhiyun { 380*4882a593Smuzhiyun spin_lock(&global_list_lock); 381*4882a593Smuzhiyun obj_list_del(obj); 382*4882a593Smuzhiyun spin_unlock(&global_list_lock); 383*4882a593Smuzhiyun 384*4882a593Smuzhiyun if (atomic_dec_and_test(&obj->refcnt)) 385*4882a593Smuzhiyun obj_destroy(obj); 386*4882a593Smuzhiyun } 387*4882a593Smuzhiyun 388*4882a593Smuzhiyun.. note:: 389*4882a593Smuzhiyun 390*4882a593Smuzhiyun This is a simplification of the ARP queue management in the generic 391*4882a593Smuzhiyun neighbour discover code of the networking. Olaf Kirch found a bug wrt. 392*4882a593Smuzhiyun memory barriers in kfree_skb() that exposed the atomic_t memory barrier 393*4882a593Smuzhiyun requirements quite clearly. 394*4882a593Smuzhiyun 395*4882a593SmuzhiyunGiven the above scheme, it must be the case that the obj->active 396*4882a593Smuzhiyunupdate done by the obj list deletion be visible to other processors 397*4882a593Smuzhiyunbefore the atomic counter decrement is performed. 398*4882a593Smuzhiyun 399*4882a593SmuzhiyunOtherwise, the counter could fall to zero, yet obj->active would still 400*4882a593Smuzhiyunbe set, thus triggering the assertion in obj_destroy(). The error 401*4882a593Smuzhiyunsequence looks like this:: 402*4882a593Smuzhiyun 403*4882a593Smuzhiyun cpu 0 cpu 1 404*4882a593Smuzhiyun obj_poke() obj_timeout() 405*4882a593Smuzhiyun obj = obj_list_peek(); 406*4882a593Smuzhiyun ... gains ref to obj, refcnt=2 407*4882a593Smuzhiyun obj_list_del(obj); 408*4882a593Smuzhiyun obj->active = 0 ... 409*4882a593Smuzhiyun ... visibility delayed ... 410*4882a593Smuzhiyun atomic_dec_and_test() 411*4882a593Smuzhiyun ... refcnt drops to 1 ... 412*4882a593Smuzhiyun atomic_dec_and_test() 413*4882a593Smuzhiyun ... refcount drops to 0 ... 414*4882a593Smuzhiyun obj_destroy() 415*4882a593Smuzhiyun BUG() triggers since obj->active 416*4882a593Smuzhiyun still seen as one 417*4882a593Smuzhiyun obj->active update visibility occurs 418*4882a593Smuzhiyun 419*4882a593SmuzhiyunWith the memory barrier semantics required of the atomic_t operations 420*4882a593Smuzhiyunwhich return values, the above sequence of memory visibility can never 421*4882a593Smuzhiyunhappen. Specifically, in the above case the atomic_dec_and_test() 422*4882a593Smuzhiyuncounter decrement would not become globally visible until the 423*4882a593Smuzhiyunobj->active update does. 424*4882a593Smuzhiyun 425*4882a593SmuzhiyunAs a historical note, 32-bit Sparc used to only allow usage of 426*4882a593Smuzhiyun24-bits of its atomic_t type. This was because it used 8 bits 427*4882a593Smuzhiyunas a spinlock for SMP safety. Sparc32 lacked a "compare and swap" 428*4882a593Smuzhiyuntype instruction. However, 32-bit Sparc has since been moved over 429*4882a593Smuzhiyunto a "hash table of spinlocks" scheme, that allows the full 32-bit 430*4882a593Smuzhiyuncounter to be realized. Essentially, an array of spinlocks are 431*4882a593Smuzhiyunindexed into based upon the address of the atomic_t being operated 432*4882a593Smuzhiyunon, and that lock protects the atomic operation. Parisc uses the 433*4882a593Smuzhiyunsame scheme. 434*4882a593Smuzhiyun 435*4882a593SmuzhiyunAnother note is that the atomic_t operations returning values are 436*4882a593Smuzhiyunextremely slow on an old 386. 437*4882a593Smuzhiyun 438*4882a593Smuzhiyun 439*4882a593SmuzhiyunAtomic Bitmask 440*4882a593Smuzhiyun============== 441*4882a593Smuzhiyun 442*4882a593SmuzhiyunWe will now cover the atomic bitmask operations. You will find that 443*4882a593Smuzhiyuntheir SMP and memory barrier semantics are similar in shape and scope 444*4882a593Smuzhiyunto the atomic_t ops above. 445*4882a593Smuzhiyun 446*4882a593SmuzhiyunNative atomic bit operations are defined to operate on objects aligned 447*4882a593Smuzhiyunto the size of an "unsigned long" C data type, and are least of that 448*4882a593Smuzhiyunsize. The endianness of the bits within each "unsigned long" are the 449*4882a593Smuzhiyunnative endianness of the cpu. :: 450*4882a593Smuzhiyun 451*4882a593Smuzhiyun void set_bit(unsigned long nr, volatile unsigned long *addr); 452*4882a593Smuzhiyun void clear_bit(unsigned long nr, volatile unsigned long *addr); 453*4882a593Smuzhiyun void change_bit(unsigned long nr, volatile unsigned long *addr); 454*4882a593Smuzhiyun 455*4882a593SmuzhiyunThese routines set, clear, and change, respectively, the bit number 456*4882a593Smuzhiyunindicated by "nr" on the bit mask pointed to by "ADDR". 457*4882a593Smuzhiyun 458*4882a593SmuzhiyunThey must execute atomically, yet there are no implicit memory barrier 459*4882a593Smuzhiyunsemantics required of these interfaces. :: 460*4882a593Smuzhiyun 461*4882a593Smuzhiyun int test_and_set_bit(unsigned long nr, volatile unsigned long *addr); 462*4882a593Smuzhiyun int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); 463*4882a593Smuzhiyun int test_and_change_bit(unsigned long nr, volatile unsigned long *addr); 464*4882a593Smuzhiyun 465*4882a593SmuzhiyunLike the above, except that these routines return a boolean which 466*4882a593Smuzhiyunindicates whether the changed bit was set _BEFORE_ the atomic bit 467*4882a593Smuzhiyunoperation. 468*4882a593Smuzhiyun 469*4882a593Smuzhiyun 470*4882a593Smuzhiyun.. warning:: 471*4882a593Smuzhiyun It is incredibly important that the value be a boolean, ie. "0" or "1". 472*4882a593Smuzhiyun Do not try to be fancy and save a few instructions by declaring the 473*4882a593Smuzhiyun above to return "long" and just returning something like "old_val & 474*4882a593Smuzhiyun mask" because that will not work. 475*4882a593Smuzhiyun 476*4882a593SmuzhiyunFor one thing, this return value gets truncated to int in many code 477*4882a593Smuzhiyunpaths using these interfaces, so on 64-bit if the bit is set in the 478*4882a593Smuzhiyunupper 32-bits then testers will never see that. 479*4882a593Smuzhiyun 480*4882a593SmuzhiyunOne great example of where this problem crops up are the thread_info 481*4882a593Smuzhiyunflag operations. Routines such as test_and_set_ti_thread_flag() chop 482*4882a593Smuzhiyunthe return value into an int. There are other places where things 483*4882a593Smuzhiyunlike this occur as well. 484*4882a593Smuzhiyun 485*4882a593SmuzhiyunThese routines, like the atomic_t counter operations returning values, 486*4882a593Smuzhiyunmust provide explicit memory barrier semantics around their execution. 487*4882a593SmuzhiyunAll memory operations before the atomic bit operation call must be 488*4882a593Smuzhiyunmade visible globally before the atomic bit operation is made visible. 489*4882a593SmuzhiyunLikewise, the atomic bit operation must be visible globally before any 490*4882a593Smuzhiyunsubsequent memory operation is made visible. For example:: 491*4882a593Smuzhiyun 492*4882a593Smuzhiyun obj->dead = 1; 493*4882a593Smuzhiyun if (test_and_set_bit(0, &obj->flags)) 494*4882a593Smuzhiyun /* ... */; 495*4882a593Smuzhiyun obj->killed = 1; 496*4882a593Smuzhiyun 497*4882a593SmuzhiyunThe implementation of test_and_set_bit() must guarantee that 498*4882a593Smuzhiyun"obj->dead = 1;" is visible to cpus before the atomic memory operation 499*4882a593Smuzhiyundone by test_and_set_bit() becomes visible. Likewise, the atomic 500*4882a593Smuzhiyunmemory operation done by test_and_set_bit() must become visible before 501*4882a593Smuzhiyun"obj->killed = 1;" is visible. 502*4882a593Smuzhiyun 503*4882a593SmuzhiyunFinally there is the basic operation:: 504*4882a593Smuzhiyun 505*4882a593Smuzhiyun int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); 506*4882a593Smuzhiyun 507*4882a593SmuzhiyunWhich returns a boolean indicating if bit "nr" is set in the bitmask 508*4882a593Smuzhiyunpointed to by "addr". 509*4882a593Smuzhiyun 510*4882a593SmuzhiyunIf explicit memory barriers are required around {set,clear}_bit() (which do 511*4882a593Smuzhiyunnot return a value, and thus does not need to provide memory barrier 512*4882a593Smuzhiyunsemantics), two interfaces are provided:: 513*4882a593Smuzhiyun 514*4882a593Smuzhiyun void smp_mb__before_atomic(void); 515*4882a593Smuzhiyun void smp_mb__after_atomic(void); 516*4882a593Smuzhiyun 517*4882a593SmuzhiyunThey are used as follows, and are akin to their atomic_t operation 518*4882a593Smuzhiyunbrothers:: 519*4882a593Smuzhiyun 520*4882a593Smuzhiyun /* All memory operations before this call will 521*4882a593Smuzhiyun * be globally visible before the clear_bit(). 522*4882a593Smuzhiyun */ 523*4882a593Smuzhiyun smp_mb__before_atomic(); 524*4882a593Smuzhiyun clear_bit( ... ); 525*4882a593Smuzhiyun 526*4882a593Smuzhiyun /* The clear_bit() will be visible before all 527*4882a593Smuzhiyun * subsequent memory operations. 528*4882a593Smuzhiyun */ 529*4882a593Smuzhiyun smp_mb__after_atomic(); 530*4882a593Smuzhiyun 531*4882a593SmuzhiyunThere are two special bitops with lock barrier semantics (acquire/release, 532*4882a593Smuzhiyunsame as spinlocks). These operate in the same way as their non-_lock/unlock 533*4882a593Smuzhiyunpostfixed variants, except that they are to provide acquire/release semantics, 534*4882a593Smuzhiyunrespectively. This means they can be used for bit_spin_trylock and 535*4882a593Smuzhiyunbit_spin_unlock type operations without specifying any more barriers. :: 536*4882a593Smuzhiyun 537*4882a593Smuzhiyun int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); 538*4882a593Smuzhiyun void clear_bit_unlock(unsigned long nr, unsigned long *addr); 539*4882a593Smuzhiyun void __clear_bit_unlock(unsigned long nr, unsigned long *addr); 540*4882a593Smuzhiyun 541*4882a593SmuzhiyunThe __clear_bit_unlock version is non-atomic, however it still implements 542*4882a593Smuzhiyununlock barrier semantics. This can be useful if the lock itself is protecting 543*4882a593Smuzhiyunthe other bits in the word. 544*4882a593Smuzhiyun 545*4882a593SmuzhiyunFinally, there are non-atomic versions of the bitmask operations 546*4882a593Smuzhiyunprovided. They are used in contexts where some other higher-level SMP 547*4882a593Smuzhiyunlocking scheme is being used to protect the bitmask, and thus less 548*4882a593Smuzhiyunexpensive non-atomic operations may be used in the implementation. 549*4882a593SmuzhiyunThey have names similar to the above bitmask operation interfaces, 550*4882a593Smuzhiyunexcept that two underscores are prefixed to the interface name. :: 551*4882a593Smuzhiyun 552*4882a593Smuzhiyun void __set_bit(unsigned long nr, volatile unsigned long *addr); 553*4882a593Smuzhiyun void __clear_bit(unsigned long nr, volatile unsigned long *addr); 554*4882a593Smuzhiyun void __change_bit(unsigned long nr, volatile unsigned long *addr); 555*4882a593Smuzhiyun int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); 556*4882a593Smuzhiyun int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); 557*4882a593Smuzhiyun int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); 558*4882a593Smuzhiyun 559*4882a593SmuzhiyunThese non-atomic variants also do not require any special memory 560*4882a593Smuzhiyunbarrier semantics. 561*4882a593Smuzhiyun 562*4882a593SmuzhiyunThe routines xchg() and cmpxchg() must provide the same exact 563*4882a593Smuzhiyunmemory-barrier semantics as the atomic and bit operations returning 564*4882a593Smuzhiyunvalues. 565*4882a593Smuzhiyun 566*4882a593Smuzhiyun.. note:: 567*4882a593Smuzhiyun 568*4882a593Smuzhiyun If someone wants to use xchg(), cmpxchg() and their variants, 569*4882a593Smuzhiyun linux/atomic.h should be included rather than asm/cmpxchg.h, unless the 570*4882a593Smuzhiyun code is in arch/* and can take care of itself. 571*4882a593Smuzhiyun 572*4882a593SmuzhiyunSpinlocks and rwlocks have memory barrier expectations as well. 573*4882a593SmuzhiyunThe rule to follow is simple: 574*4882a593Smuzhiyun 575*4882a593Smuzhiyun1) When acquiring a lock, the implementation must make it globally 576*4882a593Smuzhiyun visible before any subsequent memory operation. 577*4882a593Smuzhiyun 578*4882a593Smuzhiyun2) When releasing a lock, the implementation must make it such that 579*4882a593Smuzhiyun all previous memory operations are globally visible before the 580*4882a593Smuzhiyun lock release. 581*4882a593Smuzhiyun 582*4882a593SmuzhiyunWhich finally brings us to _atomic_dec_and_lock(). There is an 583*4882a593Smuzhiyunarchitecture-neutral version implemented in lib/dec_and_lock.c, 584*4882a593Smuzhiyunbut most platforms will wish to optimize this in assembler. :: 585*4882a593Smuzhiyun 586*4882a593Smuzhiyun int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); 587*4882a593Smuzhiyun 588*4882a593SmuzhiyunAtomically decrement the given counter, and if will drop to zero 589*4882a593Smuzhiyunatomically acquire the given spinlock and perform the decrement 590*4882a593Smuzhiyunof the counter to zero. If it does not drop to zero, do nothing 591*4882a593Smuzhiyunwith the spinlock. 592*4882a593Smuzhiyun 593*4882a593SmuzhiyunIt is actually pretty simple to get the memory barrier correct. 594*4882a593SmuzhiyunSimply satisfy the spinlock grab requirements, which is make 595*4882a593Smuzhiyunsure the spinlock operation is globally visible before any 596*4882a593Smuzhiyunsubsequent memory operation. 597*4882a593Smuzhiyun 598*4882a593SmuzhiyunWe can demonstrate this operation more clearly if we define 599*4882a593Smuzhiyunan abstract atomic operation:: 600*4882a593Smuzhiyun 601*4882a593Smuzhiyun long cas(long *mem, long old, long new); 602*4882a593Smuzhiyun 603*4882a593Smuzhiyun"cas" stands for "compare and swap". It atomically: 604*4882a593Smuzhiyun 605*4882a593Smuzhiyun1) Compares "old" with the value currently at "mem". 606*4882a593Smuzhiyun2) If they are equal, "new" is written to "mem". 607*4882a593Smuzhiyun3) Regardless, the current value at "mem" is returned. 608*4882a593Smuzhiyun 609*4882a593SmuzhiyunAs an example usage, here is what an atomic counter update 610*4882a593Smuzhiyunmight look like:: 611*4882a593Smuzhiyun 612*4882a593Smuzhiyun void example_atomic_inc(long *counter) 613*4882a593Smuzhiyun { 614*4882a593Smuzhiyun long old, new, ret; 615*4882a593Smuzhiyun 616*4882a593Smuzhiyun while (1) { 617*4882a593Smuzhiyun old = *counter; 618*4882a593Smuzhiyun new = old + 1; 619*4882a593Smuzhiyun 620*4882a593Smuzhiyun ret = cas(counter, old, new); 621*4882a593Smuzhiyun if (ret == old) 622*4882a593Smuzhiyun break; 623*4882a593Smuzhiyun } 624*4882a593Smuzhiyun } 625*4882a593Smuzhiyun 626*4882a593SmuzhiyunLet's use cas() in order to build a pseudo-C atomic_dec_and_lock():: 627*4882a593Smuzhiyun 628*4882a593Smuzhiyun int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) 629*4882a593Smuzhiyun { 630*4882a593Smuzhiyun long old, new, ret; 631*4882a593Smuzhiyun int went_to_zero; 632*4882a593Smuzhiyun 633*4882a593Smuzhiyun went_to_zero = 0; 634*4882a593Smuzhiyun while (1) { 635*4882a593Smuzhiyun old = atomic_read(atomic); 636*4882a593Smuzhiyun new = old - 1; 637*4882a593Smuzhiyun if (new == 0) { 638*4882a593Smuzhiyun went_to_zero = 1; 639*4882a593Smuzhiyun spin_lock(lock); 640*4882a593Smuzhiyun } 641*4882a593Smuzhiyun ret = cas(atomic, old, new); 642*4882a593Smuzhiyun if (ret == old) 643*4882a593Smuzhiyun break; 644*4882a593Smuzhiyun if (went_to_zero) { 645*4882a593Smuzhiyun spin_unlock(lock); 646*4882a593Smuzhiyun went_to_zero = 0; 647*4882a593Smuzhiyun } 648*4882a593Smuzhiyun } 649*4882a593Smuzhiyun 650*4882a593Smuzhiyun return went_to_zero; 651*4882a593Smuzhiyun } 652*4882a593Smuzhiyun 653*4882a593SmuzhiyunNow, as far as memory barriers go, as long as spin_lock() 654*4882a593Smuzhiyunstrictly orders all subsequent memory operations (including 655*4882a593Smuzhiyunthe cas()) with respect to itself, things will be fine. 656*4882a593Smuzhiyun 657*4882a593SmuzhiyunSaid another way, _atomic_dec_and_lock() must guarantee that 658*4882a593Smuzhiyuna counter dropping to zero is never made visible before the 659*4882a593Smuzhiyunspinlock being acquired. 660*4882a593Smuzhiyun 661*4882a593Smuzhiyun.. note:: 662*4882a593Smuzhiyun 663*4882a593Smuzhiyun Note that this also means that for the case where the counter is not 664*4882a593Smuzhiyun dropping to zero, there are no memory ordering requirements. 665