1*4882a593Smuzhiyun.. _kernel_hacking_lock: 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun=========================== 4*4882a593SmuzhiyunUnreliable Guide To Locking 5*4882a593Smuzhiyun=========================== 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun:Author: Rusty Russell 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunIntroduction 10*4882a593Smuzhiyun============ 11*4882a593Smuzhiyun 12*4882a593SmuzhiyunWelcome, to Rusty's Remarkably Unreliable Guide to Kernel Locking 13*4882a593Smuzhiyunissues. This document describes the locking systems in the Linux Kernel 14*4882a593Smuzhiyunin 2.6. 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunWith the wide availability of HyperThreading, and preemption in the 17*4882a593SmuzhiyunLinux Kernel, everyone hacking on the kernel needs to know the 18*4882a593Smuzhiyunfundamentals of concurrency and locking for SMP. 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunThe Problem With Concurrency 21*4882a593Smuzhiyun============================ 22*4882a593Smuzhiyun 23*4882a593Smuzhiyun(Skip this if you know what a Race Condition is). 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunIn a normal program, you can increment a counter like so: 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun:: 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun very_important_count++; 30*4882a593Smuzhiyun 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunThis is what they would expect to happen: 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun.. table:: Expected Results 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 38*4882a593Smuzhiyun | Instance 1 | Instance 2 | 39*4882a593Smuzhiyun +====================================+====================================+ 40*4882a593Smuzhiyun | read very_important_count (5) | | 41*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 42*4882a593Smuzhiyun | add 1 (6) | | 43*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 44*4882a593Smuzhiyun | write very_important_count (6) | | 45*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 46*4882a593Smuzhiyun | | read very_important_count (6) | 47*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 48*4882a593Smuzhiyun | | add 1 (7) | 49*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 50*4882a593Smuzhiyun | | write very_important_count (7) | 51*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 52*4882a593Smuzhiyun 53*4882a593SmuzhiyunThis is what might happen: 54*4882a593Smuzhiyun 55*4882a593Smuzhiyun.. table:: Possible Results 56*4882a593Smuzhiyun 57*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 58*4882a593Smuzhiyun | Instance 1 | Instance 2 | 59*4882a593Smuzhiyun +====================================+====================================+ 60*4882a593Smuzhiyun | read very_important_count (5) | | 61*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 62*4882a593Smuzhiyun | | read very_important_count (5) | 63*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 64*4882a593Smuzhiyun | add 1 (6) | | 65*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 66*4882a593Smuzhiyun | | add 1 (6) | 67*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 68*4882a593Smuzhiyun | write very_important_count (6) | | 69*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 70*4882a593Smuzhiyun | | write very_important_count (6) | 71*4882a593Smuzhiyun +------------------------------------+------------------------------------+ 72*4882a593Smuzhiyun 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunRace Conditions and Critical Regions 75*4882a593Smuzhiyun------------------------------------ 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunThis overlap, where the result depends on the relative timing of 78*4882a593Smuzhiyunmultiple tasks, is called a race condition. The piece of code containing 79*4882a593Smuzhiyunthe concurrency issue is called a critical region. And especially since 80*4882a593SmuzhiyunLinux starting running on SMP machines, they became one of the major 81*4882a593Smuzhiyunissues in kernel design and implementation. 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunPreemption can have the same effect, even if there is only one CPU: by 84*4882a593Smuzhiyunpreempting one task during the critical region, we have exactly the same 85*4882a593Smuzhiyunrace condition. In this case the thread which preempts might run the 86*4882a593Smuzhiyuncritical region itself. 87*4882a593Smuzhiyun 88*4882a593SmuzhiyunThe solution is to recognize when these simultaneous accesses occur, and 89*4882a593Smuzhiyunuse locks to make sure that only one instance can enter the critical 90*4882a593Smuzhiyunregion at any time. There are many friendly primitives in the Linux 91*4882a593Smuzhiyunkernel to help you do this. And then there are the unfriendly 92*4882a593Smuzhiyunprimitives, but I'll pretend they don't exist. 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunLocking in the Linux Kernel 95*4882a593Smuzhiyun=========================== 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunIf I could give you one piece of advice: never sleep with anyone crazier 98*4882a593Smuzhiyunthan yourself. But if I had to give you advice on locking: **keep it 99*4882a593Smuzhiyunsimple**. 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunBe reluctant to introduce new locks. 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunStrangely enough, this last one is the exact reverse of my advice when 104*4882a593Smuzhiyunyou **have** slept with someone crazier than yourself. And you should 105*4882a593Smuzhiyunthink about getting a big dog. 106*4882a593Smuzhiyun 107*4882a593SmuzhiyunTwo Main Types of Kernel Locks: Spinlocks and Mutexes 108*4882a593Smuzhiyun----------------------------------------------------- 109*4882a593Smuzhiyun 110*4882a593SmuzhiyunThere are two main types of kernel locks. The fundamental type is the 111*4882a593Smuzhiyunspinlock (``include/asm/spinlock.h``), which is a very simple 112*4882a593Smuzhiyunsingle-holder lock: if you can't get the spinlock, you keep trying 113*4882a593Smuzhiyun(spinning) until you can. Spinlocks are very small and fast, and can be 114*4882a593Smuzhiyunused anywhere. 115*4882a593Smuzhiyun 116*4882a593SmuzhiyunThe second type is a mutex (``include/linux/mutex.h``): it is like a 117*4882a593Smuzhiyunspinlock, but you may block holding a mutex. If you can't lock a mutex, 118*4882a593Smuzhiyunyour task will suspend itself, and be woken up when the mutex is 119*4882a593Smuzhiyunreleased. This means the CPU can do something else while you are 120*4882a593Smuzhiyunwaiting. There are many cases when you simply can't sleep (see 121*4882a593Smuzhiyun`What Functions Are Safe To Call From Interrupts? <#sleeping-things>`__), 122*4882a593Smuzhiyunand so have to use a spinlock instead. 123*4882a593Smuzhiyun 124*4882a593SmuzhiyunNeither type of lock is recursive: see 125*4882a593Smuzhiyun`Deadlock: Simple and Advanced <#deadlock>`__. 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunLocks and Uniprocessor Kernels 128*4882a593Smuzhiyun------------------------------ 129*4882a593Smuzhiyun 130*4882a593SmuzhiyunFor kernels compiled without ``CONFIG_SMP``, and without 131*4882a593Smuzhiyun``CONFIG_PREEMPT`` spinlocks do not exist at all. This is an excellent 132*4882a593Smuzhiyundesign decision: when no-one else can run at the same time, there is no 133*4882a593Smuzhiyunreason to have a lock. 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunIf the kernel is compiled without ``CONFIG_SMP``, but ``CONFIG_PREEMPT`` 136*4882a593Smuzhiyunis set, then spinlocks simply disable preemption, which is sufficient to 137*4882a593Smuzhiyunprevent any races. For most purposes, we can think of preemption as 138*4882a593Smuzhiyunequivalent to SMP, and not worry about it separately. 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunYou should always test your locking code with ``CONFIG_SMP`` and 141*4882a593Smuzhiyun``CONFIG_PREEMPT`` enabled, even if you don't have an SMP test box, 142*4882a593Smuzhiyunbecause it will still catch some kinds of locking bugs. 143*4882a593Smuzhiyun 144*4882a593SmuzhiyunMutexes still exist, because they are required for synchronization 145*4882a593Smuzhiyunbetween user contexts, as we will see below. 146*4882a593Smuzhiyun 147*4882a593SmuzhiyunLocking Only In User Context 148*4882a593Smuzhiyun---------------------------- 149*4882a593Smuzhiyun 150*4882a593SmuzhiyunIf you have a data structure which is only ever accessed from user 151*4882a593Smuzhiyuncontext, then you can use a simple mutex (``include/linux/mutex.h``) to 152*4882a593Smuzhiyunprotect it. This is the most trivial case: you initialize the mutex. 153*4882a593SmuzhiyunThen you can call mutex_lock_interruptible() to grab the 154*4882a593Smuzhiyunmutex, and mutex_unlock() to release it. There is also a 155*4882a593Smuzhiyunmutex_lock(), which should be avoided, because it will 156*4882a593Smuzhiyunnot return if a signal is received. 157*4882a593Smuzhiyun 158*4882a593SmuzhiyunExample: ``net/netfilter/nf_sockopt.c`` allows registration of new 159*4882a593Smuzhiyunsetsockopt() and getsockopt() calls, with 160*4882a593Smuzhiyunnf_register_sockopt(). Registration and de-registration 161*4882a593Smuzhiyunare only done on module load and unload (and boot time, where there is 162*4882a593Smuzhiyunno concurrency), and the list of registrations is only consulted for an 163*4882a593Smuzhiyununknown setsockopt() or getsockopt() system 164*4882a593Smuzhiyuncall. The ``nf_sockopt_mutex`` is perfect to protect this, especially 165*4882a593Smuzhiyunsince the setsockopt and getsockopt calls may well sleep. 166*4882a593Smuzhiyun 167*4882a593SmuzhiyunLocking Between User Context and Softirqs 168*4882a593Smuzhiyun----------------------------------------- 169*4882a593Smuzhiyun 170*4882a593SmuzhiyunIf a softirq shares data with user context, you have two problems. 171*4882a593SmuzhiyunFirstly, the current user context can be interrupted by a softirq, and 172*4882a593Smuzhiyunsecondly, the critical region could be entered from another CPU. This is 173*4882a593Smuzhiyunwhere spin_lock_bh() (``include/linux/spinlock.h``) is 174*4882a593Smuzhiyunused. It disables softirqs on that CPU, then grabs the lock. 175*4882a593Smuzhiyunspin_unlock_bh() does the reverse. (The '_bh' suffix is 176*4882a593Smuzhiyuna historical reference to "Bottom Halves", the old name for software 177*4882a593Smuzhiyuninterrupts. It should really be called spin_lock_softirq()' in a 178*4882a593Smuzhiyunperfect world). 179*4882a593Smuzhiyun 180*4882a593SmuzhiyunNote that you can also use spin_lock_irq() or 181*4882a593Smuzhiyunspin_lock_irqsave() here, which stop hardware interrupts 182*4882a593Smuzhiyunas well: see `Hard IRQ Context <#hard-irq-context>`__. 183*4882a593Smuzhiyun 184*4882a593SmuzhiyunThis works perfectly for UP as well: the spin lock vanishes, and this 185*4882a593Smuzhiyunmacro simply becomes local_bh_disable() 186*4882a593Smuzhiyun(``include/linux/interrupt.h``), which protects you from the softirq 187*4882a593Smuzhiyunbeing run. 188*4882a593Smuzhiyun 189*4882a593SmuzhiyunLocking Between User Context and Tasklets 190*4882a593Smuzhiyun----------------------------------------- 191*4882a593Smuzhiyun 192*4882a593SmuzhiyunThis is exactly the same as above, because tasklets are actually run 193*4882a593Smuzhiyunfrom a softirq. 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunLocking Between User Context and Timers 196*4882a593Smuzhiyun--------------------------------------- 197*4882a593Smuzhiyun 198*4882a593SmuzhiyunThis, too, is exactly the same as above, because timers are actually run 199*4882a593Smuzhiyunfrom a softirq. From a locking point of view, tasklets and timers are 200*4882a593Smuzhiyunidentical. 201*4882a593Smuzhiyun 202*4882a593SmuzhiyunLocking Between Tasklets/Timers 203*4882a593Smuzhiyun------------------------------- 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunSometimes a tasklet or timer might want to share data with another 206*4882a593Smuzhiyuntasklet or timer. 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunThe Same Tasklet/Timer 209*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~ 210*4882a593Smuzhiyun 211*4882a593SmuzhiyunSince a tasklet is never run on two CPUs at once, you don't need to 212*4882a593Smuzhiyunworry about your tasklet being reentrant (running twice at once), even 213*4882a593Smuzhiyunon SMP. 214*4882a593Smuzhiyun 215*4882a593SmuzhiyunDifferent Tasklets/Timers 216*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~ 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunIf another tasklet/timer wants to share data with your tasklet or timer 219*4882a593Smuzhiyun, you will both need to use spin_lock() and 220*4882a593Smuzhiyunspin_unlock() calls. spin_lock_bh() is 221*4882a593Smuzhiyununnecessary here, as you are already in a tasklet, and none will be run 222*4882a593Smuzhiyunon the same CPU. 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunLocking Between Softirqs 225*4882a593Smuzhiyun------------------------ 226*4882a593Smuzhiyun 227*4882a593SmuzhiyunOften a softirq might want to share data with itself or a tasklet/timer. 228*4882a593Smuzhiyun 229*4882a593SmuzhiyunThe Same Softirq 230*4882a593Smuzhiyun~~~~~~~~~~~~~~~~ 231*4882a593Smuzhiyun 232*4882a593SmuzhiyunThe same softirq can run on the other CPUs: you can use a per-CPU array 233*4882a593Smuzhiyun(see `Per-CPU Data <#per-cpu-data>`__) for better performance. If you're 234*4882a593Smuzhiyungoing so far as to use a softirq, you probably care about scalable 235*4882a593Smuzhiyunperformance enough to justify the extra complexity. 236*4882a593Smuzhiyun 237*4882a593SmuzhiyunYou'll need to use spin_lock() and 238*4882a593Smuzhiyunspin_unlock() for shared data. 239*4882a593Smuzhiyun 240*4882a593SmuzhiyunDifferent Softirqs 241*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~ 242*4882a593Smuzhiyun 243*4882a593SmuzhiyunYou'll need to use spin_lock() and 244*4882a593Smuzhiyunspin_unlock() for shared data, whether it be a timer, 245*4882a593Smuzhiyuntasklet, different softirq or the same or another softirq: any of them 246*4882a593Smuzhiyuncould be running on a different CPU. 247*4882a593Smuzhiyun 248*4882a593SmuzhiyunHard IRQ Context 249*4882a593Smuzhiyun================ 250*4882a593Smuzhiyun 251*4882a593SmuzhiyunHardware interrupts usually communicate with a tasklet or softirq. 252*4882a593SmuzhiyunFrequently this involves putting work in a queue, which the softirq will 253*4882a593Smuzhiyuntake out. 254*4882a593Smuzhiyun 255*4882a593SmuzhiyunLocking Between Hard IRQ and Softirqs/Tasklets 256*4882a593Smuzhiyun---------------------------------------------- 257*4882a593Smuzhiyun 258*4882a593SmuzhiyunIf a hardware irq handler shares data with a softirq, you have two 259*4882a593Smuzhiyunconcerns. Firstly, the softirq processing can be interrupted by a 260*4882a593Smuzhiyunhardware interrupt, and secondly, the critical region could be entered 261*4882a593Smuzhiyunby a hardware interrupt on another CPU. This is where 262*4882a593Smuzhiyunspin_lock_irq() is used. It is defined to disable 263*4882a593Smuzhiyuninterrupts on that cpu, then grab the lock. 264*4882a593Smuzhiyunspin_unlock_irq() does the reverse. 265*4882a593Smuzhiyun 266*4882a593SmuzhiyunThe irq handler does not need to use spin_lock_irq(), because 267*4882a593Smuzhiyunthe softirq cannot run while the irq handler is running: it can use 268*4882a593Smuzhiyunspin_lock(), which is slightly faster. The only exception 269*4882a593Smuzhiyunwould be if a different hardware irq handler uses the same lock: 270*4882a593Smuzhiyunspin_lock_irq() will stop that from interrupting us. 271*4882a593Smuzhiyun 272*4882a593SmuzhiyunThis works perfectly for UP as well: the spin lock vanishes, and this 273*4882a593Smuzhiyunmacro simply becomes local_irq_disable() 274*4882a593Smuzhiyun(``include/asm/smp.h``), which protects you from the softirq/tasklet/BH 275*4882a593Smuzhiyunbeing run. 276*4882a593Smuzhiyun 277*4882a593Smuzhiyunspin_lock_irqsave() (``include/linux/spinlock.h``) is a 278*4882a593Smuzhiyunvariant which saves whether interrupts were on or off in a flags word, 279*4882a593Smuzhiyunwhich is passed to spin_unlock_irqrestore(). This means 280*4882a593Smuzhiyunthat the same code can be used inside an hard irq handler (where 281*4882a593Smuzhiyuninterrupts are already off) and in softirqs (where the irq disabling is 282*4882a593Smuzhiyunrequired). 283*4882a593Smuzhiyun 284*4882a593SmuzhiyunNote that softirqs (and hence tasklets and timers) are run on return 285*4882a593Smuzhiyunfrom hardware interrupts, so spin_lock_irq() also stops 286*4882a593Smuzhiyunthese. In that sense, spin_lock_irqsave() is the most 287*4882a593Smuzhiyungeneral and powerful locking function. 288*4882a593Smuzhiyun 289*4882a593SmuzhiyunLocking Between Two Hard IRQ Handlers 290*4882a593Smuzhiyun------------------------------------- 291*4882a593Smuzhiyun 292*4882a593SmuzhiyunIt is rare to have to share data between two IRQ handlers, but if you 293*4882a593Smuzhiyundo, spin_lock_irqsave() should be used: it is 294*4882a593Smuzhiyunarchitecture-specific whether all interrupts are disabled inside irq 295*4882a593Smuzhiyunhandlers themselves. 296*4882a593Smuzhiyun 297*4882a593SmuzhiyunCheat Sheet For Locking 298*4882a593Smuzhiyun======================= 299*4882a593Smuzhiyun 300*4882a593SmuzhiyunPete Zaitcev gives the following summary: 301*4882a593Smuzhiyun 302*4882a593Smuzhiyun- If you are in a process context (any syscall) and want to lock other 303*4882a593Smuzhiyun process out, use a mutex. You can take a mutex and sleep 304*4882a593Smuzhiyun (``copy_from_user*(`` or ``kmalloc(x,GFP_KERNEL)``). 305*4882a593Smuzhiyun 306*4882a593Smuzhiyun- Otherwise (== data can be touched in an interrupt), use 307*4882a593Smuzhiyun spin_lock_irqsave() and 308*4882a593Smuzhiyun spin_unlock_irqrestore(). 309*4882a593Smuzhiyun 310*4882a593Smuzhiyun- Avoid holding spinlock for more than 5 lines of code and across any 311*4882a593Smuzhiyun function call (except accessors like readb()). 312*4882a593Smuzhiyun 313*4882a593SmuzhiyunTable of Minimum Requirements 314*4882a593Smuzhiyun----------------------------- 315*4882a593Smuzhiyun 316*4882a593SmuzhiyunThe following table lists the **minimum** locking requirements between 317*4882a593Smuzhiyunvarious contexts. In some cases, the same context can only be running on 318*4882a593Smuzhiyunone CPU at a time, so no locking is required for that context (eg. a 319*4882a593Smuzhiyunparticular thread can only run on one CPU at a time, but if it needs 320*4882a593Smuzhiyunshares data with another thread, locking is required). 321*4882a593Smuzhiyun 322*4882a593SmuzhiyunRemember the advice above: you can always use 323*4882a593Smuzhiyunspin_lock_irqsave(), which is a superset of all other 324*4882a593Smuzhiyunspinlock primitives. 325*4882a593Smuzhiyun 326*4882a593Smuzhiyun============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== 327*4882a593Smuzhiyun. IRQ Handler A IRQ Handler B Softirq A Softirq B Tasklet A Tasklet B Timer A Timer B User Context A User Context B 328*4882a593Smuzhiyun============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== 329*4882a593SmuzhiyunIRQ Handler A None 330*4882a593SmuzhiyunIRQ Handler B SLIS None 331*4882a593SmuzhiyunSoftirq A SLI SLI SL 332*4882a593SmuzhiyunSoftirq B SLI SLI SL SL 333*4882a593SmuzhiyunTasklet A SLI SLI SL SL None 334*4882a593SmuzhiyunTasklet B SLI SLI SL SL SL None 335*4882a593SmuzhiyunTimer A SLI SLI SL SL SL SL None 336*4882a593SmuzhiyunTimer B SLI SLI SL SL SL SL SL None 337*4882a593SmuzhiyunUser Context A SLI SLI SLBH SLBH SLBH SLBH SLBH SLBH None 338*4882a593SmuzhiyunUser Context B SLI SLI SLBH SLBH SLBH SLBH SLBH SLBH MLI None 339*4882a593Smuzhiyun============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== 340*4882a593Smuzhiyun 341*4882a593SmuzhiyunTable: Table of Locking Requirements 342*4882a593Smuzhiyun 343*4882a593Smuzhiyun+--------+----------------------------+ 344*4882a593Smuzhiyun| SLIS | spin_lock_irqsave | 345*4882a593Smuzhiyun+--------+----------------------------+ 346*4882a593Smuzhiyun| SLI | spin_lock_irq | 347*4882a593Smuzhiyun+--------+----------------------------+ 348*4882a593Smuzhiyun| SL | spin_lock | 349*4882a593Smuzhiyun+--------+----------------------------+ 350*4882a593Smuzhiyun| SLBH | spin_lock_bh | 351*4882a593Smuzhiyun+--------+----------------------------+ 352*4882a593Smuzhiyun| MLI | mutex_lock_interruptible | 353*4882a593Smuzhiyun+--------+----------------------------+ 354*4882a593Smuzhiyun 355*4882a593SmuzhiyunTable: Legend for Locking Requirements Table 356*4882a593Smuzhiyun 357*4882a593SmuzhiyunThe trylock Functions 358*4882a593Smuzhiyun===================== 359*4882a593Smuzhiyun 360*4882a593SmuzhiyunThere are functions that try to acquire a lock only once and immediately 361*4882a593Smuzhiyunreturn a value telling about success or failure to acquire the lock. 362*4882a593SmuzhiyunThey can be used if you need no access to the data protected with the 363*4882a593Smuzhiyunlock when some other thread is holding the lock. You should acquire the 364*4882a593Smuzhiyunlock later if you then need access to the data protected with the lock. 365*4882a593Smuzhiyun 366*4882a593Smuzhiyunspin_trylock() does not spin but returns non-zero if it 367*4882a593Smuzhiyunacquires the spinlock on the first try or 0 if not. This function can be 368*4882a593Smuzhiyunused in all contexts like spin_lock(): you must have 369*4882a593Smuzhiyundisabled the contexts that might interrupt you and acquire the spin 370*4882a593Smuzhiyunlock. 371*4882a593Smuzhiyun 372*4882a593Smuzhiyunmutex_trylock() does not suspend your task but returns 373*4882a593Smuzhiyunnon-zero if it could lock the mutex on the first try or 0 if not. This 374*4882a593Smuzhiyunfunction cannot be safely used in hardware or software interrupt 375*4882a593Smuzhiyuncontexts despite not sleeping. 376*4882a593Smuzhiyun 377*4882a593SmuzhiyunCommon Examples 378*4882a593Smuzhiyun=============== 379*4882a593Smuzhiyun 380*4882a593SmuzhiyunLet's step through a simple example: a cache of number to name mappings. 381*4882a593SmuzhiyunThe cache keeps a count of how often each of the objects is used, and 382*4882a593Smuzhiyunwhen it gets full, throws out the least used one. 383*4882a593Smuzhiyun 384*4882a593SmuzhiyunAll In User Context 385*4882a593Smuzhiyun------------------- 386*4882a593Smuzhiyun 387*4882a593SmuzhiyunFor our first example, we assume that all operations are in user context 388*4882a593Smuzhiyun(ie. from system calls), so we can sleep. This means we can use a mutex 389*4882a593Smuzhiyunto protect the cache and all the objects within it. Here's the code:: 390*4882a593Smuzhiyun 391*4882a593Smuzhiyun #include <linux/list.h> 392*4882a593Smuzhiyun #include <linux/slab.h> 393*4882a593Smuzhiyun #include <linux/string.h> 394*4882a593Smuzhiyun #include <linux/mutex.h> 395*4882a593Smuzhiyun #include <asm/errno.h> 396*4882a593Smuzhiyun 397*4882a593Smuzhiyun struct object 398*4882a593Smuzhiyun { 399*4882a593Smuzhiyun struct list_head list; 400*4882a593Smuzhiyun int id; 401*4882a593Smuzhiyun char name[32]; 402*4882a593Smuzhiyun int popularity; 403*4882a593Smuzhiyun }; 404*4882a593Smuzhiyun 405*4882a593Smuzhiyun /* Protects the cache, cache_num, and the objects within it */ 406*4882a593Smuzhiyun static DEFINE_MUTEX(cache_lock); 407*4882a593Smuzhiyun static LIST_HEAD(cache); 408*4882a593Smuzhiyun static unsigned int cache_num = 0; 409*4882a593Smuzhiyun #define MAX_CACHE_SIZE 10 410*4882a593Smuzhiyun 411*4882a593Smuzhiyun /* Must be holding cache_lock */ 412*4882a593Smuzhiyun static struct object *__cache_find(int id) 413*4882a593Smuzhiyun { 414*4882a593Smuzhiyun struct object *i; 415*4882a593Smuzhiyun 416*4882a593Smuzhiyun list_for_each_entry(i, &cache, list) 417*4882a593Smuzhiyun if (i->id == id) { 418*4882a593Smuzhiyun i->popularity++; 419*4882a593Smuzhiyun return i; 420*4882a593Smuzhiyun } 421*4882a593Smuzhiyun return NULL; 422*4882a593Smuzhiyun } 423*4882a593Smuzhiyun 424*4882a593Smuzhiyun /* Must be holding cache_lock */ 425*4882a593Smuzhiyun static void __cache_delete(struct object *obj) 426*4882a593Smuzhiyun { 427*4882a593Smuzhiyun BUG_ON(!obj); 428*4882a593Smuzhiyun list_del(&obj->list); 429*4882a593Smuzhiyun kfree(obj); 430*4882a593Smuzhiyun cache_num--; 431*4882a593Smuzhiyun } 432*4882a593Smuzhiyun 433*4882a593Smuzhiyun /* Must be holding cache_lock */ 434*4882a593Smuzhiyun static void __cache_add(struct object *obj) 435*4882a593Smuzhiyun { 436*4882a593Smuzhiyun list_add(&obj->list, &cache); 437*4882a593Smuzhiyun if (++cache_num > MAX_CACHE_SIZE) { 438*4882a593Smuzhiyun struct object *i, *outcast = NULL; 439*4882a593Smuzhiyun list_for_each_entry(i, &cache, list) { 440*4882a593Smuzhiyun if (!outcast || i->popularity < outcast->popularity) 441*4882a593Smuzhiyun outcast = i; 442*4882a593Smuzhiyun } 443*4882a593Smuzhiyun __cache_delete(outcast); 444*4882a593Smuzhiyun } 445*4882a593Smuzhiyun } 446*4882a593Smuzhiyun 447*4882a593Smuzhiyun int cache_add(int id, const char *name) 448*4882a593Smuzhiyun { 449*4882a593Smuzhiyun struct object *obj; 450*4882a593Smuzhiyun 451*4882a593Smuzhiyun if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 452*4882a593Smuzhiyun return -ENOMEM; 453*4882a593Smuzhiyun 454*4882a593Smuzhiyun strscpy(obj->name, name, sizeof(obj->name)); 455*4882a593Smuzhiyun obj->id = id; 456*4882a593Smuzhiyun obj->popularity = 0; 457*4882a593Smuzhiyun 458*4882a593Smuzhiyun mutex_lock(&cache_lock); 459*4882a593Smuzhiyun __cache_add(obj); 460*4882a593Smuzhiyun mutex_unlock(&cache_lock); 461*4882a593Smuzhiyun return 0; 462*4882a593Smuzhiyun } 463*4882a593Smuzhiyun 464*4882a593Smuzhiyun void cache_delete(int id) 465*4882a593Smuzhiyun { 466*4882a593Smuzhiyun mutex_lock(&cache_lock); 467*4882a593Smuzhiyun __cache_delete(__cache_find(id)); 468*4882a593Smuzhiyun mutex_unlock(&cache_lock); 469*4882a593Smuzhiyun } 470*4882a593Smuzhiyun 471*4882a593Smuzhiyun int cache_find(int id, char *name) 472*4882a593Smuzhiyun { 473*4882a593Smuzhiyun struct object *obj; 474*4882a593Smuzhiyun int ret = -ENOENT; 475*4882a593Smuzhiyun 476*4882a593Smuzhiyun mutex_lock(&cache_lock); 477*4882a593Smuzhiyun obj = __cache_find(id); 478*4882a593Smuzhiyun if (obj) { 479*4882a593Smuzhiyun ret = 0; 480*4882a593Smuzhiyun strcpy(name, obj->name); 481*4882a593Smuzhiyun } 482*4882a593Smuzhiyun mutex_unlock(&cache_lock); 483*4882a593Smuzhiyun return ret; 484*4882a593Smuzhiyun } 485*4882a593Smuzhiyun 486*4882a593SmuzhiyunNote that we always make sure we have the cache_lock when we add, 487*4882a593Smuzhiyundelete, or look up the cache: both the cache infrastructure itself and 488*4882a593Smuzhiyunthe contents of the objects are protected by the lock. In this case it's 489*4882a593Smuzhiyuneasy, since we copy the data for the user, and never let them access the 490*4882a593Smuzhiyunobjects directly. 491*4882a593Smuzhiyun 492*4882a593SmuzhiyunThere is a slight (and common) optimization here: in 493*4882a593Smuzhiyuncache_add() we set up the fields of the object before 494*4882a593Smuzhiyungrabbing the lock. This is safe, as no-one else can access it until we 495*4882a593Smuzhiyunput it in cache. 496*4882a593Smuzhiyun 497*4882a593SmuzhiyunAccessing From Interrupt Context 498*4882a593Smuzhiyun-------------------------------- 499*4882a593Smuzhiyun 500*4882a593SmuzhiyunNow consider the case where cache_find() can be called 501*4882a593Smuzhiyunfrom interrupt context: either a hardware interrupt or a softirq. An 502*4882a593Smuzhiyunexample would be a timer which deletes object from the cache. 503*4882a593Smuzhiyun 504*4882a593SmuzhiyunThe change is shown below, in standard patch format: the ``-`` are lines 505*4882a593Smuzhiyunwhich are taken away, and the ``+`` are lines which are added. 506*4882a593Smuzhiyun 507*4882a593Smuzhiyun:: 508*4882a593Smuzhiyun 509*4882a593Smuzhiyun --- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100 510*4882a593Smuzhiyun +++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100 511*4882a593Smuzhiyun @@ -12,7 +12,7 @@ 512*4882a593Smuzhiyun int popularity; 513*4882a593Smuzhiyun }; 514*4882a593Smuzhiyun 515*4882a593Smuzhiyun -static DEFINE_MUTEX(cache_lock); 516*4882a593Smuzhiyun +static DEFINE_SPINLOCK(cache_lock); 517*4882a593Smuzhiyun static LIST_HEAD(cache); 518*4882a593Smuzhiyun static unsigned int cache_num = 0; 519*4882a593Smuzhiyun #define MAX_CACHE_SIZE 10 520*4882a593Smuzhiyun @@ -55,6 +55,7 @@ 521*4882a593Smuzhiyun int cache_add(int id, const char *name) 522*4882a593Smuzhiyun { 523*4882a593Smuzhiyun struct object *obj; 524*4882a593Smuzhiyun + unsigned long flags; 525*4882a593Smuzhiyun 526*4882a593Smuzhiyun if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 527*4882a593Smuzhiyun return -ENOMEM; 528*4882a593Smuzhiyun @@ -63,30 +64,33 @@ 529*4882a593Smuzhiyun obj->id = id; 530*4882a593Smuzhiyun obj->popularity = 0; 531*4882a593Smuzhiyun 532*4882a593Smuzhiyun - mutex_lock(&cache_lock); 533*4882a593Smuzhiyun + spin_lock_irqsave(&cache_lock, flags); 534*4882a593Smuzhiyun __cache_add(obj); 535*4882a593Smuzhiyun - mutex_unlock(&cache_lock); 536*4882a593Smuzhiyun + spin_unlock_irqrestore(&cache_lock, flags); 537*4882a593Smuzhiyun return 0; 538*4882a593Smuzhiyun } 539*4882a593Smuzhiyun 540*4882a593Smuzhiyun void cache_delete(int id) 541*4882a593Smuzhiyun { 542*4882a593Smuzhiyun - mutex_lock(&cache_lock); 543*4882a593Smuzhiyun + unsigned long flags; 544*4882a593Smuzhiyun + 545*4882a593Smuzhiyun + spin_lock_irqsave(&cache_lock, flags); 546*4882a593Smuzhiyun __cache_delete(__cache_find(id)); 547*4882a593Smuzhiyun - mutex_unlock(&cache_lock); 548*4882a593Smuzhiyun + spin_unlock_irqrestore(&cache_lock, flags); 549*4882a593Smuzhiyun } 550*4882a593Smuzhiyun 551*4882a593Smuzhiyun int cache_find(int id, char *name) 552*4882a593Smuzhiyun { 553*4882a593Smuzhiyun struct object *obj; 554*4882a593Smuzhiyun int ret = -ENOENT; 555*4882a593Smuzhiyun + unsigned long flags; 556*4882a593Smuzhiyun 557*4882a593Smuzhiyun - mutex_lock(&cache_lock); 558*4882a593Smuzhiyun + spin_lock_irqsave(&cache_lock, flags); 559*4882a593Smuzhiyun obj = __cache_find(id); 560*4882a593Smuzhiyun if (obj) { 561*4882a593Smuzhiyun ret = 0; 562*4882a593Smuzhiyun strcpy(name, obj->name); 563*4882a593Smuzhiyun } 564*4882a593Smuzhiyun - mutex_unlock(&cache_lock); 565*4882a593Smuzhiyun + spin_unlock_irqrestore(&cache_lock, flags); 566*4882a593Smuzhiyun return ret; 567*4882a593Smuzhiyun } 568*4882a593Smuzhiyun 569*4882a593SmuzhiyunNote that the spin_lock_irqsave() will turn off 570*4882a593Smuzhiyuninterrupts if they are on, otherwise does nothing (if we are already in 571*4882a593Smuzhiyunan interrupt handler), hence these functions are safe to call from any 572*4882a593Smuzhiyuncontext. 573*4882a593Smuzhiyun 574*4882a593SmuzhiyunUnfortunately, cache_add() calls kmalloc() 575*4882a593Smuzhiyunwith the ``GFP_KERNEL`` flag, which is only legal in user context. I 576*4882a593Smuzhiyunhave assumed that cache_add() is still only called in 577*4882a593Smuzhiyunuser context, otherwise this should become a parameter to 578*4882a593Smuzhiyuncache_add(). 579*4882a593Smuzhiyun 580*4882a593SmuzhiyunExposing Objects Outside This File 581*4882a593Smuzhiyun---------------------------------- 582*4882a593Smuzhiyun 583*4882a593SmuzhiyunIf our objects contained more information, it might not be sufficient to 584*4882a593Smuzhiyuncopy the information in and out: other parts of the code might want to 585*4882a593Smuzhiyunkeep pointers to these objects, for example, rather than looking up the 586*4882a593Smuzhiyunid every time. This produces two problems. 587*4882a593Smuzhiyun 588*4882a593SmuzhiyunThe first problem is that we use the ``cache_lock`` to protect objects: 589*4882a593Smuzhiyunwe'd need to make this non-static so the rest of the code can use it. 590*4882a593SmuzhiyunThis makes locking trickier, as it is no longer all in one place. 591*4882a593Smuzhiyun 592*4882a593SmuzhiyunThe second problem is the lifetime problem: if another structure keeps a 593*4882a593Smuzhiyunpointer to an object, it presumably expects that pointer to remain 594*4882a593Smuzhiyunvalid. Unfortunately, this is only guaranteed while you hold the lock, 595*4882a593Smuzhiyunotherwise someone might call cache_delete() and even 596*4882a593Smuzhiyunworse, add another object, re-using the same address. 597*4882a593Smuzhiyun 598*4882a593SmuzhiyunAs there is only one lock, you can't hold it forever: no-one else would 599*4882a593Smuzhiyunget any work done. 600*4882a593Smuzhiyun 601*4882a593SmuzhiyunThe solution to this problem is to use a reference count: everyone who 602*4882a593Smuzhiyunhas a pointer to the object increases it when they first get the object, 603*4882a593Smuzhiyunand drops the reference count when they're finished with it. Whoever 604*4882a593Smuzhiyundrops it to zero knows it is unused, and can actually delete it. 605*4882a593Smuzhiyun 606*4882a593SmuzhiyunHere is the code:: 607*4882a593Smuzhiyun 608*4882a593Smuzhiyun --- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100 609*4882a593Smuzhiyun +++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100 610*4882a593Smuzhiyun @@ -7,6 +7,7 @@ 611*4882a593Smuzhiyun struct object 612*4882a593Smuzhiyun { 613*4882a593Smuzhiyun struct list_head list; 614*4882a593Smuzhiyun + unsigned int refcnt; 615*4882a593Smuzhiyun int id; 616*4882a593Smuzhiyun char name[32]; 617*4882a593Smuzhiyun int popularity; 618*4882a593Smuzhiyun @@ -17,6 +18,35 @@ 619*4882a593Smuzhiyun static unsigned int cache_num = 0; 620*4882a593Smuzhiyun #define MAX_CACHE_SIZE 10 621*4882a593Smuzhiyun 622*4882a593Smuzhiyun +static void __object_put(struct object *obj) 623*4882a593Smuzhiyun +{ 624*4882a593Smuzhiyun + if (--obj->refcnt == 0) 625*4882a593Smuzhiyun + kfree(obj); 626*4882a593Smuzhiyun +} 627*4882a593Smuzhiyun + 628*4882a593Smuzhiyun +static void __object_get(struct object *obj) 629*4882a593Smuzhiyun +{ 630*4882a593Smuzhiyun + obj->refcnt++; 631*4882a593Smuzhiyun +} 632*4882a593Smuzhiyun + 633*4882a593Smuzhiyun +void object_put(struct object *obj) 634*4882a593Smuzhiyun +{ 635*4882a593Smuzhiyun + unsigned long flags; 636*4882a593Smuzhiyun + 637*4882a593Smuzhiyun + spin_lock_irqsave(&cache_lock, flags); 638*4882a593Smuzhiyun + __object_put(obj); 639*4882a593Smuzhiyun + spin_unlock_irqrestore(&cache_lock, flags); 640*4882a593Smuzhiyun +} 641*4882a593Smuzhiyun + 642*4882a593Smuzhiyun +void object_get(struct object *obj) 643*4882a593Smuzhiyun +{ 644*4882a593Smuzhiyun + unsigned long flags; 645*4882a593Smuzhiyun + 646*4882a593Smuzhiyun + spin_lock_irqsave(&cache_lock, flags); 647*4882a593Smuzhiyun + __object_get(obj); 648*4882a593Smuzhiyun + spin_unlock_irqrestore(&cache_lock, flags); 649*4882a593Smuzhiyun +} 650*4882a593Smuzhiyun + 651*4882a593Smuzhiyun /* Must be holding cache_lock */ 652*4882a593Smuzhiyun static struct object *__cache_find(int id) 653*4882a593Smuzhiyun { 654*4882a593Smuzhiyun @@ -35,6 +65,7 @@ 655*4882a593Smuzhiyun { 656*4882a593Smuzhiyun BUG_ON(!obj); 657*4882a593Smuzhiyun list_del(&obj->list); 658*4882a593Smuzhiyun + __object_put(obj); 659*4882a593Smuzhiyun cache_num--; 660*4882a593Smuzhiyun } 661*4882a593Smuzhiyun 662*4882a593Smuzhiyun @@ -63,6 +94,7 @@ 663*4882a593Smuzhiyun strscpy(obj->name, name, sizeof(obj->name)); 664*4882a593Smuzhiyun obj->id = id; 665*4882a593Smuzhiyun obj->popularity = 0; 666*4882a593Smuzhiyun + obj->refcnt = 1; /* The cache holds a reference */ 667*4882a593Smuzhiyun 668*4882a593Smuzhiyun spin_lock_irqsave(&cache_lock, flags); 669*4882a593Smuzhiyun __cache_add(obj); 670*4882a593Smuzhiyun @@ -79,18 +111,15 @@ 671*4882a593Smuzhiyun spin_unlock_irqrestore(&cache_lock, flags); 672*4882a593Smuzhiyun } 673*4882a593Smuzhiyun 674*4882a593Smuzhiyun -int cache_find(int id, char *name) 675*4882a593Smuzhiyun +struct object *cache_find(int id) 676*4882a593Smuzhiyun { 677*4882a593Smuzhiyun struct object *obj; 678*4882a593Smuzhiyun - int ret = -ENOENT; 679*4882a593Smuzhiyun unsigned long flags; 680*4882a593Smuzhiyun 681*4882a593Smuzhiyun spin_lock_irqsave(&cache_lock, flags); 682*4882a593Smuzhiyun obj = __cache_find(id); 683*4882a593Smuzhiyun - if (obj) { 684*4882a593Smuzhiyun - ret = 0; 685*4882a593Smuzhiyun - strcpy(name, obj->name); 686*4882a593Smuzhiyun - } 687*4882a593Smuzhiyun + if (obj) 688*4882a593Smuzhiyun + __object_get(obj); 689*4882a593Smuzhiyun spin_unlock_irqrestore(&cache_lock, flags); 690*4882a593Smuzhiyun - return ret; 691*4882a593Smuzhiyun + return obj; 692*4882a593Smuzhiyun } 693*4882a593Smuzhiyun 694*4882a593SmuzhiyunWe encapsulate the reference counting in the standard 'get' and 'put' 695*4882a593Smuzhiyunfunctions. Now we can return the object itself from 696*4882a593Smuzhiyuncache_find() which has the advantage that the user can 697*4882a593Smuzhiyunnow sleep holding the object (eg. to copy_to_user() to 698*4882a593Smuzhiyunname to userspace). 699*4882a593Smuzhiyun 700*4882a593SmuzhiyunThe other point to note is that I said a reference should be held for 701*4882a593Smuzhiyunevery pointer to the object: thus the reference count is 1 when first 702*4882a593Smuzhiyuninserted into the cache. In some versions the framework does not hold a 703*4882a593Smuzhiyunreference count, but they are more complicated. 704*4882a593Smuzhiyun 705*4882a593SmuzhiyunUsing Atomic Operations For The Reference Count 706*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 707*4882a593Smuzhiyun 708*4882a593SmuzhiyunIn practice, :c:type:`atomic_t` would usually be used for refcnt. There are a 709*4882a593Smuzhiyunnumber of atomic operations defined in ``include/asm/atomic.h``: these 710*4882a593Smuzhiyunare guaranteed to be seen atomically from all CPUs in the system, so no 711*4882a593Smuzhiyunlock is required. In this case, it is simpler than using spinlocks, 712*4882a593Smuzhiyunalthough for anything non-trivial using spinlocks is clearer. The 713*4882a593Smuzhiyunatomic_inc() and atomic_dec_and_test() 714*4882a593Smuzhiyunare used instead of the standard increment and decrement operators, and 715*4882a593Smuzhiyunthe lock is no longer used to protect the reference count itself. 716*4882a593Smuzhiyun 717*4882a593Smuzhiyun:: 718*4882a593Smuzhiyun 719*4882a593Smuzhiyun --- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100 720*4882a593Smuzhiyun +++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100 721*4882a593Smuzhiyun @@ -7,7 +7,7 @@ 722*4882a593Smuzhiyun struct object 723*4882a593Smuzhiyun { 724*4882a593Smuzhiyun struct list_head list; 725*4882a593Smuzhiyun - unsigned int refcnt; 726*4882a593Smuzhiyun + atomic_t refcnt; 727*4882a593Smuzhiyun int id; 728*4882a593Smuzhiyun char name[32]; 729*4882a593Smuzhiyun int popularity; 730*4882a593Smuzhiyun @@ -18,33 +18,15 @@ 731*4882a593Smuzhiyun static unsigned int cache_num = 0; 732*4882a593Smuzhiyun #define MAX_CACHE_SIZE 10 733*4882a593Smuzhiyun 734*4882a593Smuzhiyun -static void __object_put(struct object *obj) 735*4882a593Smuzhiyun -{ 736*4882a593Smuzhiyun - if (--obj->refcnt == 0) 737*4882a593Smuzhiyun - kfree(obj); 738*4882a593Smuzhiyun -} 739*4882a593Smuzhiyun - 740*4882a593Smuzhiyun -static void __object_get(struct object *obj) 741*4882a593Smuzhiyun -{ 742*4882a593Smuzhiyun - obj->refcnt++; 743*4882a593Smuzhiyun -} 744*4882a593Smuzhiyun - 745*4882a593Smuzhiyun void object_put(struct object *obj) 746*4882a593Smuzhiyun { 747*4882a593Smuzhiyun - unsigned long flags; 748*4882a593Smuzhiyun - 749*4882a593Smuzhiyun - spin_lock_irqsave(&cache_lock, flags); 750*4882a593Smuzhiyun - __object_put(obj); 751*4882a593Smuzhiyun - spin_unlock_irqrestore(&cache_lock, flags); 752*4882a593Smuzhiyun + if (atomic_dec_and_test(&obj->refcnt)) 753*4882a593Smuzhiyun + kfree(obj); 754*4882a593Smuzhiyun } 755*4882a593Smuzhiyun 756*4882a593Smuzhiyun void object_get(struct object *obj) 757*4882a593Smuzhiyun { 758*4882a593Smuzhiyun - unsigned long flags; 759*4882a593Smuzhiyun - 760*4882a593Smuzhiyun - spin_lock_irqsave(&cache_lock, flags); 761*4882a593Smuzhiyun - __object_get(obj); 762*4882a593Smuzhiyun - spin_unlock_irqrestore(&cache_lock, flags); 763*4882a593Smuzhiyun + atomic_inc(&obj->refcnt); 764*4882a593Smuzhiyun } 765*4882a593Smuzhiyun 766*4882a593Smuzhiyun /* Must be holding cache_lock */ 767*4882a593Smuzhiyun @@ -65,7 +47,7 @@ 768*4882a593Smuzhiyun { 769*4882a593Smuzhiyun BUG_ON(!obj); 770*4882a593Smuzhiyun list_del(&obj->list); 771*4882a593Smuzhiyun - __object_put(obj); 772*4882a593Smuzhiyun + object_put(obj); 773*4882a593Smuzhiyun cache_num--; 774*4882a593Smuzhiyun } 775*4882a593Smuzhiyun 776*4882a593Smuzhiyun @@ -94,7 +76,7 @@ 777*4882a593Smuzhiyun strscpy(obj->name, name, sizeof(obj->name)); 778*4882a593Smuzhiyun obj->id = id; 779*4882a593Smuzhiyun obj->popularity = 0; 780*4882a593Smuzhiyun - obj->refcnt = 1; /* The cache holds a reference */ 781*4882a593Smuzhiyun + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 782*4882a593Smuzhiyun 783*4882a593Smuzhiyun spin_lock_irqsave(&cache_lock, flags); 784*4882a593Smuzhiyun __cache_add(obj); 785*4882a593Smuzhiyun @@ -119,7 +101,7 @@ 786*4882a593Smuzhiyun spin_lock_irqsave(&cache_lock, flags); 787*4882a593Smuzhiyun obj = __cache_find(id); 788*4882a593Smuzhiyun if (obj) 789*4882a593Smuzhiyun - __object_get(obj); 790*4882a593Smuzhiyun + object_get(obj); 791*4882a593Smuzhiyun spin_unlock_irqrestore(&cache_lock, flags); 792*4882a593Smuzhiyun return obj; 793*4882a593Smuzhiyun } 794*4882a593Smuzhiyun 795*4882a593SmuzhiyunProtecting The Objects Themselves 796*4882a593Smuzhiyun--------------------------------- 797*4882a593Smuzhiyun 798*4882a593SmuzhiyunIn these examples, we assumed that the objects (except the reference 799*4882a593Smuzhiyuncounts) never changed once they are created. If we wanted to allow the 800*4882a593Smuzhiyunname to change, there are three possibilities: 801*4882a593Smuzhiyun 802*4882a593Smuzhiyun- You can make ``cache_lock`` non-static, and tell people to grab that 803*4882a593Smuzhiyun lock before changing the name in any object. 804*4882a593Smuzhiyun 805*4882a593Smuzhiyun- You can provide a cache_obj_rename() which grabs this 806*4882a593Smuzhiyun lock and changes the name for the caller, and tell everyone to use 807*4882a593Smuzhiyun that function. 808*4882a593Smuzhiyun 809*4882a593Smuzhiyun- You can make the ``cache_lock`` protect only the cache itself, and 810*4882a593Smuzhiyun use another lock to protect the name. 811*4882a593Smuzhiyun 812*4882a593SmuzhiyunTheoretically, you can make the locks as fine-grained as one lock for 813*4882a593Smuzhiyunevery field, for every object. In practice, the most common variants 814*4882a593Smuzhiyunare: 815*4882a593Smuzhiyun 816*4882a593Smuzhiyun- One lock which protects the infrastructure (the ``cache`` list in 817*4882a593Smuzhiyun this example) and all the objects. This is what we have done so far. 818*4882a593Smuzhiyun 819*4882a593Smuzhiyun- One lock which protects the infrastructure (including the list 820*4882a593Smuzhiyun pointers inside the objects), and one lock inside the object which 821*4882a593Smuzhiyun protects the rest of that object. 822*4882a593Smuzhiyun 823*4882a593Smuzhiyun- Multiple locks to protect the infrastructure (eg. one lock per hash 824*4882a593Smuzhiyun chain), possibly with a separate per-object lock. 825*4882a593Smuzhiyun 826*4882a593SmuzhiyunHere is the "lock-per-object" implementation: 827*4882a593Smuzhiyun 828*4882a593Smuzhiyun:: 829*4882a593Smuzhiyun 830*4882a593Smuzhiyun --- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100 831*4882a593Smuzhiyun +++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 832*4882a593Smuzhiyun @@ -6,11 +6,17 @@ 833*4882a593Smuzhiyun 834*4882a593Smuzhiyun struct object 835*4882a593Smuzhiyun { 836*4882a593Smuzhiyun + /* These two protected by cache_lock. */ 837*4882a593Smuzhiyun struct list_head list; 838*4882a593Smuzhiyun + int popularity; 839*4882a593Smuzhiyun + 840*4882a593Smuzhiyun atomic_t refcnt; 841*4882a593Smuzhiyun + 842*4882a593Smuzhiyun + /* Doesn't change once created. */ 843*4882a593Smuzhiyun int id; 844*4882a593Smuzhiyun + 845*4882a593Smuzhiyun + spinlock_t lock; /* Protects the name */ 846*4882a593Smuzhiyun char name[32]; 847*4882a593Smuzhiyun - int popularity; 848*4882a593Smuzhiyun }; 849*4882a593Smuzhiyun 850*4882a593Smuzhiyun static DEFINE_SPINLOCK(cache_lock); 851*4882a593Smuzhiyun @@ -77,6 +84,7 @@ 852*4882a593Smuzhiyun obj->id = id; 853*4882a593Smuzhiyun obj->popularity = 0; 854*4882a593Smuzhiyun atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 855*4882a593Smuzhiyun + spin_lock_init(&obj->lock); 856*4882a593Smuzhiyun 857*4882a593Smuzhiyun spin_lock_irqsave(&cache_lock, flags); 858*4882a593Smuzhiyun __cache_add(obj); 859*4882a593Smuzhiyun 860*4882a593SmuzhiyunNote that I decide that the popularity count should be protected by the 861*4882a593Smuzhiyun``cache_lock`` rather than the per-object lock: this is because it (like 862*4882a593Smuzhiyunthe :c:type:`struct list_head <list_head>` inside the object) 863*4882a593Smuzhiyunis logically part of the infrastructure. This way, I don't need to grab 864*4882a593Smuzhiyunthe lock of every object in __cache_add() when seeking 865*4882a593Smuzhiyunthe least popular. 866*4882a593Smuzhiyun 867*4882a593SmuzhiyunI also decided that the id member is unchangeable, so I don't need to 868*4882a593Smuzhiyungrab each object lock in __cache_find() to examine the 869*4882a593Smuzhiyunid: the object lock is only used by a caller who wants to read or write 870*4882a593Smuzhiyunthe name field. 871*4882a593Smuzhiyun 872*4882a593SmuzhiyunNote also that I added a comment describing what data was protected by 873*4882a593Smuzhiyunwhich locks. This is extremely important, as it describes the runtime 874*4882a593Smuzhiyunbehavior of the code, and can be hard to gain from just reading. And as 875*4882a593SmuzhiyunAlan Cox says, “Lock data, not code”. 876*4882a593Smuzhiyun 877*4882a593SmuzhiyunCommon Problems 878*4882a593Smuzhiyun=============== 879*4882a593Smuzhiyun 880*4882a593SmuzhiyunDeadlock: Simple and Advanced 881*4882a593Smuzhiyun----------------------------- 882*4882a593Smuzhiyun 883*4882a593SmuzhiyunThere is a coding bug where a piece of code tries to grab a spinlock 884*4882a593Smuzhiyuntwice: it will spin forever, waiting for the lock to be released 885*4882a593Smuzhiyun(spinlocks, rwlocks and mutexes are not recursive in Linux). This is 886*4882a593Smuzhiyuntrivial to diagnose: not a 887*4882a593Smuzhiyunstay-up-five-nights-talk-to-fluffy-code-bunnies kind of problem. 888*4882a593Smuzhiyun 889*4882a593SmuzhiyunFor a slightly more complex case, imagine you have a region shared by a 890*4882a593Smuzhiyunsoftirq and user context. If you use a spin_lock() call 891*4882a593Smuzhiyunto protect it, it is possible that the user context will be interrupted 892*4882a593Smuzhiyunby the softirq while it holds the lock, and the softirq will then spin 893*4882a593Smuzhiyunforever trying to get the same lock. 894*4882a593Smuzhiyun 895*4882a593SmuzhiyunBoth of these are called deadlock, and as shown above, it can occur even 896*4882a593Smuzhiyunwith a single CPU (although not on UP compiles, since spinlocks vanish 897*4882a593Smuzhiyunon kernel compiles with ``CONFIG_SMP``\ =n. You'll still get data 898*4882a593Smuzhiyuncorruption in the second example). 899*4882a593Smuzhiyun 900*4882a593SmuzhiyunThis complete lockup is easy to diagnose: on SMP boxes the watchdog 901*4882a593Smuzhiyuntimer or compiling with ``DEBUG_SPINLOCK`` set 902*4882a593Smuzhiyun(``include/linux/spinlock.h``) will show this up immediately when it 903*4882a593Smuzhiyunhappens. 904*4882a593Smuzhiyun 905*4882a593SmuzhiyunA more complex problem is the so-called 'deadly embrace', involving two 906*4882a593Smuzhiyunor more locks. Say you have a hash table: each entry in the table is a 907*4882a593Smuzhiyunspinlock, and a chain of hashed objects. Inside a softirq handler, you 908*4882a593Smuzhiyunsometimes want to alter an object from one place in the hash to another: 909*4882a593Smuzhiyunyou grab the spinlock of the old hash chain and the spinlock of the new 910*4882a593Smuzhiyunhash chain, and delete the object from the old one, and insert it in the 911*4882a593Smuzhiyunnew one. 912*4882a593Smuzhiyun 913*4882a593SmuzhiyunThere are two problems here. First, if your code ever tries to move the 914*4882a593Smuzhiyunobject to the same chain, it will deadlock with itself as it tries to 915*4882a593Smuzhiyunlock it twice. Secondly, if the same softirq on another CPU is trying to 916*4882a593Smuzhiyunmove another object in the reverse direction, the following could 917*4882a593Smuzhiyunhappen: 918*4882a593Smuzhiyun 919*4882a593Smuzhiyun+-----------------------+-----------------------+ 920*4882a593Smuzhiyun| CPU 1 | CPU 2 | 921*4882a593Smuzhiyun+=======================+=======================+ 922*4882a593Smuzhiyun| Grab lock A -> OK | Grab lock B -> OK | 923*4882a593Smuzhiyun+-----------------------+-----------------------+ 924*4882a593Smuzhiyun| Grab lock B -> spin | Grab lock A -> spin | 925*4882a593Smuzhiyun+-----------------------+-----------------------+ 926*4882a593Smuzhiyun 927*4882a593SmuzhiyunTable: Consequences 928*4882a593Smuzhiyun 929*4882a593SmuzhiyunThe two CPUs will spin forever, waiting for the other to give up their 930*4882a593Smuzhiyunlock. It will look, smell, and feel like a crash. 931*4882a593Smuzhiyun 932*4882a593SmuzhiyunPreventing Deadlock 933*4882a593Smuzhiyun------------------- 934*4882a593Smuzhiyun 935*4882a593SmuzhiyunTextbooks will tell you that if you always lock in the same order, you 936*4882a593Smuzhiyunwill never get this kind of deadlock. Practice will tell you that this 937*4882a593Smuzhiyunapproach doesn't scale: when I create a new lock, I don't understand 938*4882a593Smuzhiyunenough of the kernel to figure out where in the 5000 lock hierarchy it 939*4882a593Smuzhiyunwill fit. 940*4882a593Smuzhiyun 941*4882a593SmuzhiyunThe best locks are encapsulated: they never get exposed in headers, and 942*4882a593Smuzhiyunare never held around calls to non-trivial functions outside the same 943*4882a593Smuzhiyunfile. You can read through this code and see that it will never 944*4882a593Smuzhiyundeadlock, because it never tries to grab another lock while it has that 945*4882a593Smuzhiyunone. People using your code don't even need to know you are using a 946*4882a593Smuzhiyunlock. 947*4882a593Smuzhiyun 948*4882a593SmuzhiyunA classic problem here is when you provide callbacks or hooks: if you 949*4882a593Smuzhiyuncall these with the lock held, you risk simple deadlock, or a deadly 950*4882a593Smuzhiyunembrace (who knows what the callback will do?). Remember, the other 951*4882a593Smuzhiyunprogrammers are out to get you, so don't do this. 952*4882a593Smuzhiyun 953*4882a593SmuzhiyunOverzealous Prevention Of Deadlocks 954*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 955*4882a593Smuzhiyun 956*4882a593SmuzhiyunDeadlocks are problematic, but not as bad as data corruption. Code which 957*4882a593Smuzhiyungrabs a read lock, searches a list, fails to find what it wants, drops 958*4882a593Smuzhiyunthe read lock, grabs a write lock and inserts the object has a race 959*4882a593Smuzhiyuncondition. 960*4882a593Smuzhiyun 961*4882a593SmuzhiyunIf you don't see why, please stay the fuck away from my code. 962*4882a593Smuzhiyun 963*4882a593SmuzhiyunRacing Timers: A Kernel Pastime 964*4882a593Smuzhiyun------------------------------- 965*4882a593Smuzhiyun 966*4882a593SmuzhiyunTimers can produce their own special problems with races. Consider a 967*4882a593Smuzhiyuncollection of objects (list, hash, etc) where each object has a timer 968*4882a593Smuzhiyunwhich is due to destroy it. 969*4882a593Smuzhiyun 970*4882a593SmuzhiyunIf you want to destroy the entire collection (say on module removal), 971*4882a593Smuzhiyunyou might do the following:: 972*4882a593Smuzhiyun 973*4882a593Smuzhiyun /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE 974*4882a593Smuzhiyun HUNGARIAN NOTATION */ 975*4882a593Smuzhiyun spin_lock_bh(&list_lock); 976*4882a593Smuzhiyun 977*4882a593Smuzhiyun while (list) { 978*4882a593Smuzhiyun struct foo *next = list->next; 979*4882a593Smuzhiyun del_timer(&list->timer); 980*4882a593Smuzhiyun kfree(list); 981*4882a593Smuzhiyun list = next; 982*4882a593Smuzhiyun } 983*4882a593Smuzhiyun 984*4882a593Smuzhiyun spin_unlock_bh(&list_lock); 985*4882a593Smuzhiyun 986*4882a593Smuzhiyun 987*4882a593SmuzhiyunSooner or later, this will crash on SMP, because a timer can have just 988*4882a593Smuzhiyungone off before the spin_lock_bh(), and it will only get 989*4882a593Smuzhiyunthe lock after we spin_unlock_bh(), and then try to free 990*4882a593Smuzhiyunthe element (which has already been freed!). 991*4882a593Smuzhiyun 992*4882a593SmuzhiyunThis can be avoided by checking the result of 993*4882a593Smuzhiyundel_timer(): if it returns 1, the timer has been deleted. 994*4882a593SmuzhiyunIf 0, it means (in this case) that it is currently running, so we can 995*4882a593Smuzhiyundo:: 996*4882a593Smuzhiyun 997*4882a593Smuzhiyun retry: 998*4882a593Smuzhiyun spin_lock_bh(&list_lock); 999*4882a593Smuzhiyun 1000*4882a593Smuzhiyun while (list) { 1001*4882a593Smuzhiyun struct foo *next = list->next; 1002*4882a593Smuzhiyun if (!del_timer(&list->timer)) { 1003*4882a593Smuzhiyun /* Give timer a chance to delete this */ 1004*4882a593Smuzhiyun spin_unlock_bh(&list_lock); 1005*4882a593Smuzhiyun goto retry; 1006*4882a593Smuzhiyun } 1007*4882a593Smuzhiyun kfree(list); 1008*4882a593Smuzhiyun list = next; 1009*4882a593Smuzhiyun } 1010*4882a593Smuzhiyun 1011*4882a593Smuzhiyun spin_unlock_bh(&list_lock); 1012*4882a593Smuzhiyun 1013*4882a593Smuzhiyun 1014*4882a593SmuzhiyunAnother common problem is deleting timers which restart themselves (by 1015*4882a593Smuzhiyuncalling add_timer() at the end of their timer function). 1016*4882a593SmuzhiyunBecause this is a fairly common case which is prone to races, you should 1017*4882a593Smuzhiyunuse del_timer_sync() (``include/linux/timer.h``) to 1018*4882a593Smuzhiyunhandle this case. It returns the number of times the timer had to be 1019*4882a593Smuzhiyundeleted before we finally stopped it from adding itself back in. 1020*4882a593Smuzhiyun 1021*4882a593SmuzhiyunLocking Speed 1022*4882a593Smuzhiyun============= 1023*4882a593Smuzhiyun 1024*4882a593SmuzhiyunThere are three main things to worry about when considering speed of 1025*4882a593Smuzhiyunsome code which does locking. First is concurrency: how many things are 1026*4882a593Smuzhiyungoing to be waiting while someone else is holding a lock. Second is the 1027*4882a593Smuzhiyuntime taken to actually acquire and release an uncontended lock. Third is 1028*4882a593Smuzhiyunusing fewer, or smarter locks. I'm assuming that the lock is used fairly 1029*4882a593Smuzhiyunoften: otherwise, you wouldn't be concerned about efficiency. 1030*4882a593Smuzhiyun 1031*4882a593SmuzhiyunConcurrency depends on how long the lock is usually held: you should 1032*4882a593Smuzhiyunhold the lock for as long as needed, but no longer. In the cache 1033*4882a593Smuzhiyunexample, we always create the object without the lock held, and then 1034*4882a593Smuzhiyungrab the lock only when we are ready to insert it in the list. 1035*4882a593Smuzhiyun 1036*4882a593SmuzhiyunAcquisition times depend on how much damage the lock operations do to 1037*4882a593Smuzhiyunthe pipeline (pipeline stalls) and how likely it is that this CPU was 1038*4882a593Smuzhiyunthe last one to grab the lock (ie. is the lock cache-hot for this CPU): 1039*4882a593Smuzhiyunon a machine with more CPUs, this likelihood drops fast. Consider a 1040*4882a593Smuzhiyun700MHz Intel Pentium III: an instruction takes about 0.7ns, an atomic 1041*4882a593Smuzhiyunincrement takes about 58ns, a lock which is cache-hot on this CPU takes 1042*4882a593Smuzhiyun160ns, and a cacheline transfer from another CPU takes an additional 170 1043*4882a593Smuzhiyunto 360ns. (These figures from Paul McKenney's `Linux Journal RCU 1044*4882a593Smuzhiyunarticle <http://www.linuxjournal.com/article.php?sid=6993>`__). 1045*4882a593Smuzhiyun 1046*4882a593SmuzhiyunThese two aims conflict: holding a lock for a short time might be done 1047*4882a593Smuzhiyunby splitting locks into parts (such as in our final per-object-lock 1048*4882a593Smuzhiyunexample), but this increases the number of lock acquisitions, and the 1049*4882a593Smuzhiyunresults are often slower than having a single lock. This is another 1050*4882a593Smuzhiyunreason to advocate locking simplicity. 1051*4882a593Smuzhiyun 1052*4882a593SmuzhiyunThe third concern is addressed below: there are some methods to reduce 1053*4882a593Smuzhiyunthe amount of locking which needs to be done. 1054*4882a593Smuzhiyun 1055*4882a593SmuzhiyunRead/Write Lock Variants 1056*4882a593Smuzhiyun------------------------ 1057*4882a593Smuzhiyun 1058*4882a593SmuzhiyunBoth spinlocks and mutexes have read/write variants: ``rwlock_t`` and 1059*4882a593Smuzhiyun:c:type:`struct rw_semaphore <rw_semaphore>`. These divide 1060*4882a593Smuzhiyunusers into two classes: the readers and the writers. If you are only 1061*4882a593Smuzhiyunreading the data, you can get a read lock, but to write to the data you 1062*4882a593Smuzhiyunneed the write lock. Many people can hold a read lock, but a writer must 1063*4882a593Smuzhiyunbe sole holder. 1064*4882a593Smuzhiyun 1065*4882a593SmuzhiyunIf your code divides neatly along reader/writer lines (as our cache code 1066*4882a593Smuzhiyundoes), and the lock is held by readers for significant lengths of time, 1067*4882a593Smuzhiyunusing these locks can help. They are slightly slower than the normal 1068*4882a593Smuzhiyunlocks though, so in practice ``rwlock_t`` is not usually worthwhile. 1069*4882a593Smuzhiyun 1070*4882a593SmuzhiyunAvoiding Locks: Read Copy Update 1071*4882a593Smuzhiyun-------------------------------- 1072*4882a593Smuzhiyun 1073*4882a593SmuzhiyunThere is a special method of read/write locking called Read Copy Update. 1074*4882a593SmuzhiyunUsing RCU, the readers can avoid taking a lock altogether: as we expect 1075*4882a593Smuzhiyunour cache to be read more often than updated (otherwise the cache is a 1076*4882a593Smuzhiyunwaste of time), it is a candidate for this optimization. 1077*4882a593Smuzhiyun 1078*4882a593SmuzhiyunHow do we get rid of read locks? Getting rid of read locks means that 1079*4882a593Smuzhiyunwriters may be changing the list underneath the readers. That is 1080*4882a593Smuzhiyunactually quite simple: we can read a linked list while an element is 1081*4882a593Smuzhiyunbeing added if the writer adds the element very carefully. For example, 1082*4882a593Smuzhiyunadding ``new`` to a single linked list called ``list``:: 1083*4882a593Smuzhiyun 1084*4882a593Smuzhiyun new->next = list->next; 1085*4882a593Smuzhiyun wmb(); 1086*4882a593Smuzhiyun list->next = new; 1087*4882a593Smuzhiyun 1088*4882a593Smuzhiyun 1089*4882a593SmuzhiyunThe wmb() is a write memory barrier. It ensures that the 1090*4882a593Smuzhiyunfirst operation (setting the new element's ``next`` pointer) is complete 1091*4882a593Smuzhiyunand will be seen by all CPUs, before the second operation is (putting 1092*4882a593Smuzhiyunthe new element into the list). This is important, since modern 1093*4882a593Smuzhiyuncompilers and modern CPUs can both reorder instructions unless told 1094*4882a593Smuzhiyunotherwise: we want a reader to either not see the new element at all, or 1095*4882a593Smuzhiyunsee the new element with the ``next`` pointer correctly pointing at the 1096*4882a593Smuzhiyunrest of the list. 1097*4882a593Smuzhiyun 1098*4882a593SmuzhiyunFortunately, there is a function to do this for standard 1099*4882a593Smuzhiyun:c:type:`struct list_head <list_head>` lists: 1100*4882a593Smuzhiyunlist_add_rcu() (``include/linux/list.h``). 1101*4882a593Smuzhiyun 1102*4882a593SmuzhiyunRemoving an element from the list is even simpler: we replace the 1103*4882a593Smuzhiyunpointer to the old element with a pointer to its successor, and readers 1104*4882a593Smuzhiyunwill either see it, or skip over it. 1105*4882a593Smuzhiyun 1106*4882a593Smuzhiyun:: 1107*4882a593Smuzhiyun 1108*4882a593Smuzhiyun list->next = old->next; 1109*4882a593Smuzhiyun 1110*4882a593Smuzhiyun 1111*4882a593SmuzhiyunThere is list_del_rcu() (``include/linux/list.h``) which 1112*4882a593Smuzhiyundoes this (the normal version poisons the old object, which we don't 1113*4882a593Smuzhiyunwant). 1114*4882a593Smuzhiyun 1115*4882a593SmuzhiyunThe reader must also be careful: some CPUs can look through the ``next`` 1116*4882a593Smuzhiyunpointer to start reading the contents of the next element early, but 1117*4882a593Smuzhiyundon't realize that the pre-fetched contents is wrong when the ``next`` 1118*4882a593Smuzhiyunpointer changes underneath them. Once again, there is a 1119*4882a593Smuzhiyunlist_for_each_entry_rcu() (``include/linux/list.h``) 1120*4882a593Smuzhiyunto help you. Of course, writers can just use 1121*4882a593Smuzhiyunlist_for_each_entry(), since there cannot be two 1122*4882a593Smuzhiyunsimultaneous writers. 1123*4882a593Smuzhiyun 1124*4882a593SmuzhiyunOur final dilemma is this: when can we actually destroy the removed 1125*4882a593Smuzhiyunelement? Remember, a reader might be stepping through this element in 1126*4882a593Smuzhiyunthe list right now: if we free this element and the ``next`` pointer 1127*4882a593Smuzhiyunchanges, the reader will jump off into garbage and crash. We need to 1128*4882a593Smuzhiyunwait until we know that all the readers who were traversing the list 1129*4882a593Smuzhiyunwhen we deleted the element are finished. We use 1130*4882a593Smuzhiyuncall_rcu() to register a callback which will actually 1131*4882a593Smuzhiyundestroy the object once all pre-existing readers are finished. 1132*4882a593SmuzhiyunAlternatively, synchronize_rcu() may be used to block 1133*4882a593Smuzhiyununtil all pre-existing are finished. 1134*4882a593Smuzhiyun 1135*4882a593SmuzhiyunBut how does Read Copy Update know when the readers are finished? The 1136*4882a593Smuzhiyunmethod is this: firstly, the readers always traverse the list inside 1137*4882a593Smuzhiyunrcu_read_lock()/rcu_read_unlock() pairs: 1138*4882a593Smuzhiyunthese simply disable preemption so the reader won't go to sleep while 1139*4882a593Smuzhiyunreading the list. 1140*4882a593Smuzhiyun 1141*4882a593SmuzhiyunRCU then waits until every other CPU has slept at least once: since 1142*4882a593Smuzhiyunreaders cannot sleep, we know that any readers which were traversing the 1143*4882a593Smuzhiyunlist during the deletion are finished, and the callback is triggered. 1144*4882a593SmuzhiyunThe real Read Copy Update code is a little more optimized than this, but 1145*4882a593Smuzhiyunthis is the fundamental idea. 1146*4882a593Smuzhiyun 1147*4882a593Smuzhiyun:: 1148*4882a593Smuzhiyun 1149*4882a593Smuzhiyun --- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 1150*4882a593Smuzhiyun +++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100 1151*4882a593Smuzhiyun @@ -1,15 +1,18 @@ 1152*4882a593Smuzhiyun #include <linux/list.h> 1153*4882a593Smuzhiyun #include <linux/slab.h> 1154*4882a593Smuzhiyun #include <linux/string.h> 1155*4882a593Smuzhiyun +#include <linux/rcupdate.h> 1156*4882a593Smuzhiyun #include <linux/mutex.h> 1157*4882a593Smuzhiyun #include <asm/errno.h> 1158*4882a593Smuzhiyun 1159*4882a593Smuzhiyun struct object 1160*4882a593Smuzhiyun { 1161*4882a593Smuzhiyun - /* These two protected by cache_lock. */ 1162*4882a593Smuzhiyun + /* This is protected by RCU */ 1163*4882a593Smuzhiyun struct list_head list; 1164*4882a593Smuzhiyun int popularity; 1165*4882a593Smuzhiyun 1166*4882a593Smuzhiyun + struct rcu_head rcu; 1167*4882a593Smuzhiyun + 1168*4882a593Smuzhiyun atomic_t refcnt; 1169*4882a593Smuzhiyun 1170*4882a593Smuzhiyun /* Doesn't change once created. */ 1171*4882a593Smuzhiyun @@ -40,7 +43,7 @@ 1172*4882a593Smuzhiyun { 1173*4882a593Smuzhiyun struct object *i; 1174*4882a593Smuzhiyun 1175*4882a593Smuzhiyun - list_for_each_entry(i, &cache, list) { 1176*4882a593Smuzhiyun + list_for_each_entry_rcu(i, &cache, list) { 1177*4882a593Smuzhiyun if (i->id == id) { 1178*4882a593Smuzhiyun i->popularity++; 1179*4882a593Smuzhiyun return i; 1180*4882a593Smuzhiyun @@ -49,19 +52,25 @@ 1181*4882a593Smuzhiyun return NULL; 1182*4882a593Smuzhiyun } 1183*4882a593Smuzhiyun 1184*4882a593Smuzhiyun +/* Final discard done once we know no readers are looking. */ 1185*4882a593Smuzhiyun +static void cache_delete_rcu(void *arg) 1186*4882a593Smuzhiyun +{ 1187*4882a593Smuzhiyun + object_put(arg); 1188*4882a593Smuzhiyun +} 1189*4882a593Smuzhiyun + 1190*4882a593Smuzhiyun /* Must be holding cache_lock */ 1191*4882a593Smuzhiyun static void __cache_delete(struct object *obj) 1192*4882a593Smuzhiyun { 1193*4882a593Smuzhiyun BUG_ON(!obj); 1194*4882a593Smuzhiyun - list_del(&obj->list); 1195*4882a593Smuzhiyun - object_put(obj); 1196*4882a593Smuzhiyun + list_del_rcu(&obj->list); 1197*4882a593Smuzhiyun cache_num--; 1198*4882a593Smuzhiyun + call_rcu(&obj->rcu, cache_delete_rcu); 1199*4882a593Smuzhiyun } 1200*4882a593Smuzhiyun 1201*4882a593Smuzhiyun /* Must be holding cache_lock */ 1202*4882a593Smuzhiyun static void __cache_add(struct object *obj) 1203*4882a593Smuzhiyun { 1204*4882a593Smuzhiyun - list_add(&obj->list, &cache); 1205*4882a593Smuzhiyun + list_add_rcu(&obj->list, &cache); 1206*4882a593Smuzhiyun if (++cache_num > MAX_CACHE_SIZE) { 1207*4882a593Smuzhiyun struct object *i, *outcast = NULL; 1208*4882a593Smuzhiyun list_for_each_entry(i, &cache, list) { 1209*4882a593Smuzhiyun @@ -104,12 +114,11 @@ 1210*4882a593Smuzhiyun struct object *cache_find(int id) 1211*4882a593Smuzhiyun { 1212*4882a593Smuzhiyun struct object *obj; 1213*4882a593Smuzhiyun - unsigned long flags; 1214*4882a593Smuzhiyun 1215*4882a593Smuzhiyun - spin_lock_irqsave(&cache_lock, flags); 1216*4882a593Smuzhiyun + rcu_read_lock(); 1217*4882a593Smuzhiyun obj = __cache_find(id); 1218*4882a593Smuzhiyun if (obj) 1219*4882a593Smuzhiyun object_get(obj); 1220*4882a593Smuzhiyun - spin_unlock_irqrestore(&cache_lock, flags); 1221*4882a593Smuzhiyun + rcu_read_unlock(); 1222*4882a593Smuzhiyun return obj; 1223*4882a593Smuzhiyun } 1224*4882a593Smuzhiyun 1225*4882a593SmuzhiyunNote that the reader will alter the popularity member in 1226*4882a593Smuzhiyun__cache_find(), and now it doesn't hold a lock. One 1227*4882a593Smuzhiyunsolution would be to make it an ``atomic_t``, but for this usage, we 1228*4882a593Smuzhiyundon't really care about races: an approximate result is good enough, so 1229*4882a593SmuzhiyunI didn't change it. 1230*4882a593Smuzhiyun 1231*4882a593SmuzhiyunThe result is that cache_find() requires no 1232*4882a593Smuzhiyunsynchronization with any other functions, so is almost as fast on SMP as 1233*4882a593Smuzhiyunit would be on UP. 1234*4882a593Smuzhiyun 1235*4882a593SmuzhiyunThere is a further optimization possible here: remember our original 1236*4882a593Smuzhiyuncache code, where there were no reference counts and the caller simply 1237*4882a593Smuzhiyunheld the lock whenever using the object? This is still possible: if you 1238*4882a593Smuzhiyunhold the lock, no one can delete the object, so you don't need to get 1239*4882a593Smuzhiyunand put the reference count. 1240*4882a593Smuzhiyun 1241*4882a593SmuzhiyunNow, because the 'read lock' in RCU is simply disabling preemption, a 1242*4882a593Smuzhiyuncaller which always has preemption disabled between calling 1243*4882a593Smuzhiyuncache_find() and object_put() does not 1244*4882a593Smuzhiyunneed to actually get and put the reference count: we could expose 1245*4882a593Smuzhiyun__cache_find() by making it non-static, and such 1246*4882a593Smuzhiyuncallers could simply call that. 1247*4882a593Smuzhiyun 1248*4882a593SmuzhiyunThe benefit here is that the reference count is not written to: the 1249*4882a593Smuzhiyunobject is not altered in any way, which is much faster on SMP machines 1250*4882a593Smuzhiyundue to caching. 1251*4882a593Smuzhiyun 1252*4882a593SmuzhiyunPer-CPU Data 1253*4882a593Smuzhiyun------------ 1254*4882a593Smuzhiyun 1255*4882a593SmuzhiyunAnother technique for avoiding locking which is used fairly widely is to 1256*4882a593Smuzhiyunduplicate information for each CPU. For example, if you wanted to keep a 1257*4882a593Smuzhiyuncount of a common condition, you could use a spin lock and a single 1258*4882a593Smuzhiyuncounter. Nice and simple. 1259*4882a593Smuzhiyun 1260*4882a593SmuzhiyunIf that was too slow (it's usually not, but if you've got a really big 1261*4882a593Smuzhiyunmachine to test on and can show that it is), you could instead use a 1262*4882a593Smuzhiyuncounter for each CPU, then none of them need an exclusive lock. See 1263*4882a593SmuzhiyunDEFINE_PER_CPU(), get_cpu_var() and 1264*4882a593Smuzhiyunput_cpu_var() (``include/linux/percpu.h``). 1265*4882a593Smuzhiyun 1266*4882a593SmuzhiyunOf particular use for simple per-cpu counters is the ``local_t`` type, 1267*4882a593Smuzhiyunand the cpu_local_inc() and related functions, which are 1268*4882a593Smuzhiyunmore efficient than simple code on some architectures 1269*4882a593Smuzhiyun(``include/asm/local.h``). 1270*4882a593Smuzhiyun 1271*4882a593SmuzhiyunNote that there is no simple, reliable way of getting an exact value of 1272*4882a593Smuzhiyunsuch a counter, without introducing more locks. This is not a problem 1273*4882a593Smuzhiyunfor some uses. 1274*4882a593Smuzhiyun 1275*4882a593SmuzhiyunData Which Mostly Used By An IRQ Handler 1276*4882a593Smuzhiyun---------------------------------------- 1277*4882a593Smuzhiyun 1278*4882a593SmuzhiyunIf data is always accessed from within the same IRQ handler, you don't 1279*4882a593Smuzhiyunneed a lock at all: the kernel already guarantees that the irq handler 1280*4882a593Smuzhiyunwill not run simultaneously on multiple CPUs. 1281*4882a593Smuzhiyun 1282*4882a593SmuzhiyunManfred Spraul points out that you can still do this, even if the data 1283*4882a593Smuzhiyunis very occasionally accessed in user context or softirqs/tasklets. The 1284*4882a593Smuzhiyunirq handler doesn't use a lock, and all other accesses are done as so:: 1285*4882a593Smuzhiyun 1286*4882a593Smuzhiyun spin_lock(&lock); 1287*4882a593Smuzhiyun disable_irq(irq); 1288*4882a593Smuzhiyun ... 1289*4882a593Smuzhiyun enable_irq(irq); 1290*4882a593Smuzhiyun spin_unlock(&lock); 1291*4882a593Smuzhiyun 1292*4882a593SmuzhiyunThe disable_irq() prevents the irq handler from running 1293*4882a593Smuzhiyun(and waits for it to finish if it's currently running on other CPUs). 1294*4882a593SmuzhiyunThe spinlock prevents any other accesses happening at the same time. 1295*4882a593SmuzhiyunNaturally, this is slower than just a spin_lock_irq() 1296*4882a593Smuzhiyuncall, so it only makes sense if this type of access happens extremely 1297*4882a593Smuzhiyunrarely. 1298*4882a593Smuzhiyun 1299*4882a593SmuzhiyunWhat Functions Are Safe To Call From Interrupts? 1300*4882a593Smuzhiyun================================================ 1301*4882a593Smuzhiyun 1302*4882a593SmuzhiyunMany functions in the kernel sleep (ie. call schedule()) directly or 1303*4882a593Smuzhiyunindirectly: you can never call them while holding a spinlock, or with 1304*4882a593Smuzhiyunpreemption disabled. This also means you need to be in user context: 1305*4882a593Smuzhiyuncalling them from an interrupt is illegal. 1306*4882a593Smuzhiyun 1307*4882a593SmuzhiyunSome Functions Which Sleep 1308*4882a593Smuzhiyun-------------------------- 1309*4882a593Smuzhiyun 1310*4882a593SmuzhiyunThe most common ones are listed below, but you usually have to read the 1311*4882a593Smuzhiyuncode to find out if other calls are safe. If everyone else who calls it 1312*4882a593Smuzhiyuncan sleep, you probably need to be able to sleep, too. In particular, 1313*4882a593Smuzhiyunregistration and deregistration functions usually expect to be called 1314*4882a593Smuzhiyunfrom user context, and can sleep. 1315*4882a593Smuzhiyun 1316*4882a593Smuzhiyun- Accesses to userspace: 1317*4882a593Smuzhiyun 1318*4882a593Smuzhiyun - copy_from_user() 1319*4882a593Smuzhiyun 1320*4882a593Smuzhiyun - copy_to_user() 1321*4882a593Smuzhiyun 1322*4882a593Smuzhiyun - get_user() 1323*4882a593Smuzhiyun 1324*4882a593Smuzhiyun - put_user() 1325*4882a593Smuzhiyun 1326*4882a593Smuzhiyun- kmalloc(GP_KERNEL) <kmalloc>` 1327*4882a593Smuzhiyun 1328*4882a593Smuzhiyun- mutex_lock_interruptible() and 1329*4882a593Smuzhiyun mutex_lock() 1330*4882a593Smuzhiyun 1331*4882a593Smuzhiyun There is a mutex_trylock() which does not sleep. 1332*4882a593Smuzhiyun Still, it must not be used inside interrupt context since its 1333*4882a593Smuzhiyun implementation is not safe for that. mutex_unlock() 1334*4882a593Smuzhiyun will also never sleep. It cannot be used in interrupt context either 1335*4882a593Smuzhiyun since a mutex must be released by the same task that acquired it. 1336*4882a593Smuzhiyun 1337*4882a593SmuzhiyunSome Functions Which Don't Sleep 1338*4882a593Smuzhiyun-------------------------------- 1339*4882a593Smuzhiyun 1340*4882a593SmuzhiyunSome functions are safe to call from any context, or holding almost any 1341*4882a593Smuzhiyunlock. 1342*4882a593Smuzhiyun 1343*4882a593Smuzhiyun- printk() 1344*4882a593Smuzhiyun 1345*4882a593Smuzhiyun- kfree() 1346*4882a593Smuzhiyun 1347*4882a593Smuzhiyun- add_timer() and del_timer() 1348*4882a593Smuzhiyun 1349*4882a593SmuzhiyunMutex API reference 1350*4882a593Smuzhiyun=================== 1351*4882a593Smuzhiyun 1352*4882a593Smuzhiyun.. kernel-doc:: include/linux/mutex.h 1353*4882a593Smuzhiyun :internal: 1354*4882a593Smuzhiyun 1355*4882a593Smuzhiyun.. kernel-doc:: kernel/locking/mutex.c 1356*4882a593Smuzhiyun :export: 1357*4882a593Smuzhiyun 1358*4882a593SmuzhiyunFutex API reference 1359*4882a593Smuzhiyun=================== 1360*4882a593Smuzhiyun 1361*4882a593Smuzhiyun.. kernel-doc:: kernel/futex.c 1362*4882a593Smuzhiyun :internal: 1363*4882a593Smuzhiyun 1364*4882a593SmuzhiyunFurther reading 1365*4882a593Smuzhiyun=============== 1366*4882a593Smuzhiyun 1367*4882a593Smuzhiyun- ``Documentation/locking/spinlocks.rst``: Linus Torvalds' spinlocking 1368*4882a593Smuzhiyun tutorial in the kernel sources. 1369*4882a593Smuzhiyun 1370*4882a593Smuzhiyun- Unix Systems for Modern Architectures: Symmetric Multiprocessing and 1371*4882a593Smuzhiyun Caching for Kernel Programmers: 1372*4882a593Smuzhiyun 1373*4882a593Smuzhiyun Curt Schimmel's very good introduction to kernel level locking (not 1374*4882a593Smuzhiyun written for Linux, but nearly everything applies). The book is 1375*4882a593Smuzhiyun expensive, but really worth every penny to understand SMP locking. 1376*4882a593Smuzhiyun [ISBN: 0201633388] 1377*4882a593Smuzhiyun 1378*4882a593SmuzhiyunThanks 1379*4882a593Smuzhiyun====== 1380*4882a593Smuzhiyun 1381*4882a593SmuzhiyunThanks to Telsa Gwynne for DocBooking, neatening and adding style. 1382*4882a593Smuzhiyun 1383*4882a593SmuzhiyunThanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul Mackerras, 1384*4882a593SmuzhiyunRuedi Aschwanden, Alan Cox, Manfred Spraul, Tim Waugh, Pete Zaitcev, 1385*4882a593SmuzhiyunJames Morris, Robert Love, Paul McKenney, John Ashby for proofreading, 1386*4882a593Smuzhiyuncorrecting, flaming, commenting. 1387*4882a593Smuzhiyun 1388*4882a593SmuzhiyunThanks to the cabal for having no influence on this document. 1389*4882a593Smuzhiyun 1390*4882a593SmuzhiyunGlossary 1391*4882a593Smuzhiyun======== 1392*4882a593Smuzhiyun 1393*4882a593Smuzhiyunpreemption 1394*4882a593Smuzhiyun Prior to 2.5, or when ``CONFIG_PREEMPT`` is unset, processes in user 1395*4882a593Smuzhiyun context inside the kernel would not preempt each other (ie. you had that 1396*4882a593Smuzhiyun CPU until you gave it up, except for interrupts). With the addition of 1397*4882a593Smuzhiyun ``CONFIG_PREEMPT`` in 2.5.4, this changed: when in user context, higher 1398*4882a593Smuzhiyun priority tasks can "cut in": spinlocks were changed to disable 1399*4882a593Smuzhiyun preemption, even on UP. 1400*4882a593Smuzhiyun 1401*4882a593Smuzhiyunbh 1402*4882a593Smuzhiyun Bottom Half: for historical reasons, functions with '_bh' in them often 1403*4882a593Smuzhiyun now refer to any software interrupt, e.g. spin_lock_bh() 1404*4882a593Smuzhiyun blocks any software interrupt on the current CPU. Bottom halves are 1405*4882a593Smuzhiyun deprecated, and will eventually be replaced by tasklets. Only one bottom 1406*4882a593Smuzhiyun half will be running at any time. 1407*4882a593Smuzhiyun 1408*4882a593SmuzhiyunHardware Interrupt / Hardware IRQ 1409*4882a593Smuzhiyun Hardware interrupt request. in_irq() returns true in a 1410*4882a593Smuzhiyun hardware interrupt handler. 1411*4882a593Smuzhiyun 1412*4882a593SmuzhiyunInterrupt Context 1413*4882a593Smuzhiyun Not user context: processing a hardware irq or software irq. Indicated 1414*4882a593Smuzhiyun by the in_interrupt() macro returning true. 1415*4882a593Smuzhiyun 1416*4882a593SmuzhiyunSMP 1417*4882a593Smuzhiyun Symmetric Multi-Processor: kernels compiled for multiple-CPU machines. 1418*4882a593Smuzhiyun (``CONFIG_SMP=y``). 1419*4882a593Smuzhiyun 1420*4882a593SmuzhiyunSoftware Interrupt / softirq 1421*4882a593Smuzhiyun Software interrupt handler. in_irq() returns false; 1422*4882a593Smuzhiyun in_softirq() returns true. Tasklets and softirqs both 1423*4882a593Smuzhiyun fall into the category of 'software interrupts'. 1424*4882a593Smuzhiyun 1425*4882a593Smuzhiyun Strictly speaking a softirq is one of up to 32 enumerated software 1426*4882a593Smuzhiyun interrupts which can run on multiple CPUs at once. Sometimes used to 1427*4882a593Smuzhiyun refer to tasklets as well (ie. all software interrupts). 1428*4882a593Smuzhiyun 1429*4882a593Smuzhiyuntasklet 1430*4882a593Smuzhiyun A dynamically-registrable software interrupt, which is guaranteed to 1431*4882a593Smuzhiyun only run on one CPU at a time. 1432*4882a593Smuzhiyun 1433*4882a593Smuzhiyuntimer 1434*4882a593Smuzhiyun A dynamically-registrable software interrupt, which is run at (or close 1435*4882a593Smuzhiyun to) a given time. When running, it is just like a tasklet (in fact, they 1436*4882a593Smuzhiyun are called from the ``TIMER_SOFTIRQ``). 1437*4882a593Smuzhiyun 1438*4882a593SmuzhiyunUP 1439*4882a593Smuzhiyun Uni-Processor: Non-SMP. (``CONFIG_SMP=n``). 1440*4882a593Smuzhiyun 1441*4882a593SmuzhiyunUser Context 1442*4882a593Smuzhiyun The kernel executing on behalf of a particular process (ie. a system 1443*4882a593Smuzhiyun call or trap) or kernel thread. You can tell which process with the 1444*4882a593Smuzhiyun ``current`` macro.) Not to be confused with userspace. Can be 1445*4882a593Smuzhiyun interrupted by software or hardware interrupts. 1446*4882a593Smuzhiyun 1447*4882a593SmuzhiyunUserspace 1448*4882a593Smuzhiyun A process executing its own code outside the kernel. 1449