1*4882a593Smuzhiyun============================== 2*4882a593SmuzhiyunRT-mutex implementation design 3*4882a593Smuzhiyun============================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunCopyright (c) 2006 Steven Rostedt 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunLicensed under the GNU Free Documentation License, Version 1.2 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunThis document tries to describe the design of the rtmutex.c implementation. 11*4882a593SmuzhiyunIt doesn't describe the reasons why rtmutex.c exists. For that please see 12*4882a593SmuzhiyunDocumentation/locking/rt-mutex.rst. Although this document does explain problems 13*4882a593Smuzhiyunthat happen without this code, but that is in the concept to understand 14*4882a593Smuzhiyunwhat the code actually is doing. 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunThe goal of this document is to help others understand the priority 17*4882a593Smuzhiyuninheritance (PI) algorithm that is used, as well as reasons for the 18*4882a593Smuzhiyundecisions that were made to implement PI in the manner that was done. 19*4882a593Smuzhiyun 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunUnbounded Priority Inversion 22*4882a593Smuzhiyun---------------------------- 23*4882a593Smuzhiyun 24*4882a593SmuzhiyunPriority inversion is when a lower priority process executes while a higher 25*4882a593Smuzhiyunpriority process wants to run. This happens for several reasons, and 26*4882a593Smuzhiyunmost of the time it can't be helped. Anytime a high priority process wants 27*4882a593Smuzhiyunto use a resource that a lower priority process has (a mutex for example), 28*4882a593Smuzhiyunthe high priority process must wait until the lower priority process is done 29*4882a593Smuzhiyunwith the resource. This is a priority inversion. What we want to prevent 30*4882a593Smuzhiyunis something called unbounded priority inversion. That is when the high 31*4882a593Smuzhiyunpriority process is prevented from running by a lower priority process for 32*4882a593Smuzhiyunan undetermined amount of time. 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunThe classic example of unbounded priority inversion is where you have three 35*4882a593Smuzhiyunprocesses, let's call them processes A, B, and C, where A is the highest 36*4882a593Smuzhiyunpriority process, C is the lowest, and B is in between. A tries to grab a lock 37*4882a593Smuzhiyunthat C owns and must wait and lets C run to release the lock. But in the 38*4882a593Smuzhiyunmeantime, B executes, and since B is of a higher priority than C, it preempts C, 39*4882a593Smuzhiyunbut by doing so, it is in fact preempting A which is a higher priority process. 40*4882a593SmuzhiyunNow there's no way of knowing how long A will be sleeping waiting for C 41*4882a593Smuzhiyunto release the lock, because for all we know, B is a CPU hog and will 42*4882a593Smuzhiyunnever give C a chance to release the lock. This is called unbounded priority 43*4882a593Smuzhiyuninversion. 44*4882a593Smuzhiyun 45*4882a593SmuzhiyunHere's a little ASCII art to show the problem:: 46*4882a593Smuzhiyun 47*4882a593Smuzhiyun grab lock L1 (owned by C) 48*4882a593Smuzhiyun | 49*4882a593Smuzhiyun A ---+ 50*4882a593Smuzhiyun C preempted by B 51*4882a593Smuzhiyun | 52*4882a593Smuzhiyun C +----+ 53*4882a593Smuzhiyun 54*4882a593Smuzhiyun B +--------> 55*4882a593Smuzhiyun B now keeps A from running. 56*4882a593Smuzhiyun 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunPriority Inheritance (PI) 59*4882a593Smuzhiyun------------------------- 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunThere are several ways to solve this issue, but other ways are out of scope 62*4882a593Smuzhiyunfor this document. Here we only discuss PI. 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunPI is where a process inherits the priority of another process if the other 65*4882a593Smuzhiyunprocess blocks on a lock owned by the current process. To make this easier 66*4882a593Smuzhiyunto understand, let's use the previous example, with processes A, B, and C again. 67*4882a593Smuzhiyun 68*4882a593SmuzhiyunThis time, when A blocks on the lock owned by C, C would inherit the priority 69*4882a593Smuzhiyunof A. So now if B becomes runnable, it would not preempt C, since C now has 70*4882a593Smuzhiyunthe high priority of A. As soon as C releases the lock, it loses its 71*4882a593Smuzhiyuninherited priority, and A then can continue with the resource that C had. 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunTerminology 74*4882a593Smuzhiyun----------- 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunHere I explain some terminology that is used in this document to help describe 77*4882a593Smuzhiyunthe design that is used to implement PI. 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunPI chain 80*4882a593Smuzhiyun - The PI chain is an ordered series of locks and processes that cause 81*4882a593Smuzhiyun processes to inherit priorities from a previous process that is 82*4882a593Smuzhiyun blocked on one of its locks. This is described in more detail 83*4882a593Smuzhiyun later in this document. 84*4882a593Smuzhiyun 85*4882a593Smuzhiyunmutex 86*4882a593Smuzhiyun - In this document, to differentiate from locks that implement 87*4882a593Smuzhiyun PI and spin locks that are used in the PI code, from now on 88*4882a593Smuzhiyun the PI locks will be called a mutex. 89*4882a593Smuzhiyun 90*4882a593Smuzhiyunlock 91*4882a593Smuzhiyun - In this document from now on, I will use the term lock when 92*4882a593Smuzhiyun referring to spin locks that are used to protect parts of the PI 93*4882a593Smuzhiyun algorithm. These locks disable preemption for UP (when 94*4882a593Smuzhiyun CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from 95*4882a593Smuzhiyun entering critical sections simultaneously. 96*4882a593Smuzhiyun 97*4882a593Smuzhiyunspin lock 98*4882a593Smuzhiyun - Same as lock above. 99*4882a593Smuzhiyun 100*4882a593Smuzhiyunwaiter 101*4882a593Smuzhiyun - A waiter is a struct that is stored on the stack of a blocked 102*4882a593Smuzhiyun process. Since the scope of the waiter is within the code for 103*4882a593Smuzhiyun a process being blocked on the mutex, it is fine to allocate 104*4882a593Smuzhiyun the waiter on the process's stack (local variable). This 105*4882a593Smuzhiyun structure holds a pointer to the task, as well as the mutex that 106*4882a593Smuzhiyun the task is blocked on. It also has rbtree node structures to 107*4882a593Smuzhiyun place the task in the waiters rbtree of a mutex as well as the 108*4882a593Smuzhiyun pi_waiters rbtree of a mutex owner task (described below). 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun waiter is sometimes used in reference to the task that is waiting 111*4882a593Smuzhiyun on a mutex. This is the same as waiter->task. 112*4882a593Smuzhiyun 113*4882a593Smuzhiyunwaiters 114*4882a593Smuzhiyun - A list of processes that are blocked on a mutex. 115*4882a593Smuzhiyun 116*4882a593Smuzhiyuntop waiter 117*4882a593Smuzhiyun - The highest priority process waiting on a specific mutex. 118*4882a593Smuzhiyun 119*4882a593Smuzhiyuntop pi waiter 120*4882a593Smuzhiyun - The highest priority process waiting on one of the mutexes 121*4882a593Smuzhiyun that a specific process owns. 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunNote: 124*4882a593Smuzhiyun task and process are used interchangeably in this document, mostly to 125*4882a593Smuzhiyun differentiate between two processes that are being described together. 126*4882a593Smuzhiyun 127*4882a593Smuzhiyun 128*4882a593SmuzhiyunPI chain 129*4882a593Smuzhiyun-------- 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunThe PI chain is a list of processes and mutexes that may cause priority 132*4882a593Smuzhiyuninheritance to take place. Multiple chains may converge, but a chain 133*4882a593Smuzhiyunwould never diverge, since a process can't be blocked on more than one 134*4882a593Smuzhiyunmutex at a time. 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunExample:: 137*4882a593Smuzhiyun 138*4882a593Smuzhiyun Process: A, B, C, D, E 139*4882a593Smuzhiyun Mutexes: L1, L2, L3, L4 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun A owns: L1 142*4882a593Smuzhiyun B blocked on L1 143*4882a593Smuzhiyun B owns L2 144*4882a593Smuzhiyun C blocked on L2 145*4882a593Smuzhiyun C owns L3 146*4882a593Smuzhiyun D blocked on L3 147*4882a593Smuzhiyun D owns L4 148*4882a593Smuzhiyun E blocked on L4 149*4882a593Smuzhiyun 150*4882a593SmuzhiyunThe chain would be:: 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun E->L4->D->L3->C->L2->B->L1->A 153*4882a593Smuzhiyun 154*4882a593SmuzhiyunTo show where two chains merge, we could add another process F and 155*4882a593Smuzhiyunanother mutex L5 where B owns L5 and F is blocked on mutex L5. 156*4882a593Smuzhiyun 157*4882a593SmuzhiyunThe chain for F would be:: 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun F->L5->B->L1->A 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunSince a process may own more than one mutex, but never be blocked on more than 162*4882a593Smuzhiyunone, the chains merge. 163*4882a593Smuzhiyun 164*4882a593SmuzhiyunHere we show both chains:: 165*4882a593Smuzhiyun 166*4882a593Smuzhiyun E->L4->D->L3->C->L2-+ 167*4882a593Smuzhiyun | 168*4882a593Smuzhiyun +->B->L1->A 169*4882a593Smuzhiyun | 170*4882a593Smuzhiyun F->L5-+ 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunFor PI to work, the processes at the right end of these chains (or we may 173*4882a593Smuzhiyunalso call it the Top of the chain) must be equal to or higher in priority 174*4882a593Smuzhiyunthan the processes to the left or below in the chain. 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunAlso since a mutex may have more than one process blocked on it, we can 177*4882a593Smuzhiyunhave multiple chains merge at mutexes. If we add another process G that is 178*4882a593Smuzhiyunblocked on mutex L2:: 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun G->L2->B->L1->A 181*4882a593Smuzhiyun 182*4882a593SmuzhiyunAnd once again, to show how this can grow I will show the merging chains 183*4882a593Smuzhiyunagain:: 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun E->L4->D->L3->C-+ 186*4882a593Smuzhiyun +->L2-+ 187*4882a593Smuzhiyun | | 188*4882a593Smuzhiyun G-+ +->B->L1->A 189*4882a593Smuzhiyun | 190*4882a593Smuzhiyun F->L5-+ 191*4882a593Smuzhiyun 192*4882a593SmuzhiyunIf process G has the highest priority in the chain, then all the tasks up 193*4882a593Smuzhiyunthe chain (A and B in this example), must have their priorities increased 194*4882a593Smuzhiyunto that of G. 195*4882a593Smuzhiyun 196*4882a593SmuzhiyunMutex Waiters Tree 197*4882a593Smuzhiyun------------------ 198*4882a593Smuzhiyun 199*4882a593SmuzhiyunEvery mutex keeps track of all the waiters that are blocked on itself. The 200*4882a593Smuzhiyunmutex has a rbtree to store these waiters by priority. This tree is protected 201*4882a593Smuzhiyunby a spin lock that is located in the struct of the mutex. This lock is called 202*4882a593Smuzhiyunwait_lock. 203*4882a593Smuzhiyun 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunTask PI Tree 206*4882a593Smuzhiyun------------ 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunTo keep track of the PI chains, each process has its own PI rbtree. This is 209*4882a593Smuzhiyuna tree of all top waiters of the mutexes that are owned by the process. 210*4882a593SmuzhiyunNote that this tree only holds the top waiters and not all waiters that are 211*4882a593Smuzhiyunblocked on mutexes owned by the process. 212*4882a593Smuzhiyun 213*4882a593SmuzhiyunThe top of the task's PI tree is always the highest priority task that 214*4882a593Smuzhiyunis waiting on a mutex that is owned by the task. So if the task has 215*4882a593Smuzhiyuninherited a priority, it will always be the priority of the task that is 216*4882a593Smuzhiyunat the top of this tree. 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunThis tree is stored in the task structure of a process as a rbtree called 219*4882a593Smuzhiyunpi_waiters. It is protected by a spin lock also in the task structure, 220*4882a593Smuzhiyuncalled pi_lock. This lock may also be taken in interrupt context, so when 221*4882a593Smuzhiyunlocking the pi_lock, interrupts must be disabled. 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunDepth of the PI Chain 225*4882a593Smuzhiyun--------------------- 226*4882a593Smuzhiyun 227*4882a593SmuzhiyunThe maximum depth of the PI chain is not dynamic, and could actually be 228*4882a593Smuzhiyundefined. But is very complex to figure it out, since it depends on all 229*4882a593Smuzhiyunthe nesting of mutexes. Let's look at the example where we have 3 mutexes, 230*4882a593SmuzhiyunL1, L2, and L3, and four separate functions func1, func2, func3 and func4. 231*4882a593SmuzhiyunThe following shows a locking order of L1->L2->L3, but may not actually 232*4882a593Smuzhiyunbe directly nested that way:: 233*4882a593Smuzhiyun 234*4882a593Smuzhiyun void func1(void) 235*4882a593Smuzhiyun { 236*4882a593Smuzhiyun mutex_lock(L1); 237*4882a593Smuzhiyun 238*4882a593Smuzhiyun /* do anything */ 239*4882a593Smuzhiyun 240*4882a593Smuzhiyun mutex_unlock(L1); 241*4882a593Smuzhiyun } 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun void func2(void) 244*4882a593Smuzhiyun { 245*4882a593Smuzhiyun mutex_lock(L1); 246*4882a593Smuzhiyun mutex_lock(L2); 247*4882a593Smuzhiyun 248*4882a593Smuzhiyun /* do something */ 249*4882a593Smuzhiyun 250*4882a593Smuzhiyun mutex_unlock(L2); 251*4882a593Smuzhiyun mutex_unlock(L1); 252*4882a593Smuzhiyun } 253*4882a593Smuzhiyun 254*4882a593Smuzhiyun void func3(void) 255*4882a593Smuzhiyun { 256*4882a593Smuzhiyun mutex_lock(L2); 257*4882a593Smuzhiyun mutex_lock(L3); 258*4882a593Smuzhiyun 259*4882a593Smuzhiyun /* do something else */ 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun mutex_unlock(L3); 262*4882a593Smuzhiyun mutex_unlock(L2); 263*4882a593Smuzhiyun } 264*4882a593Smuzhiyun 265*4882a593Smuzhiyun void func4(void) 266*4882a593Smuzhiyun { 267*4882a593Smuzhiyun mutex_lock(L3); 268*4882a593Smuzhiyun 269*4882a593Smuzhiyun /* do something again */ 270*4882a593Smuzhiyun 271*4882a593Smuzhiyun mutex_unlock(L3); 272*4882a593Smuzhiyun } 273*4882a593Smuzhiyun 274*4882a593SmuzhiyunNow we add 4 processes that run each of these functions separately. 275*4882a593SmuzhiyunProcesses A, B, C, and D which run functions func1, func2, func3 and func4 276*4882a593Smuzhiyunrespectively, and such that D runs first and A last. With D being preempted 277*4882a593Smuzhiyunin func4 in the "do something again" area, we have a locking that follows:: 278*4882a593Smuzhiyun 279*4882a593Smuzhiyun D owns L3 280*4882a593Smuzhiyun C blocked on L3 281*4882a593Smuzhiyun C owns L2 282*4882a593Smuzhiyun B blocked on L2 283*4882a593Smuzhiyun B owns L1 284*4882a593Smuzhiyun A blocked on L1 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun And thus we have the chain A->L1->B->L2->C->L3->D. 287*4882a593Smuzhiyun 288*4882a593SmuzhiyunThis gives us a PI depth of 4 (four processes), but looking at any of the 289*4882a593Smuzhiyunfunctions individually, it seems as though they only have at most a locking 290*4882a593Smuzhiyundepth of two. So, although the locking depth is defined at compile time, 291*4882a593Smuzhiyunit still is very difficult to find the possibilities of that depth. 292*4882a593Smuzhiyun 293*4882a593SmuzhiyunNow since mutexes can be defined by user-land applications, we don't want a DOS 294*4882a593Smuzhiyuntype of application that nests large amounts of mutexes to create a large 295*4882a593SmuzhiyunPI chain, and have the code holding spin locks while looking at a large 296*4882a593Smuzhiyunamount of data. So to prevent this, the implementation not only implements 297*4882a593Smuzhiyuna maximum lock depth, but also only holds at most two different locks at a 298*4882a593Smuzhiyuntime, as it walks the PI chain. More about this below. 299*4882a593Smuzhiyun 300*4882a593Smuzhiyun 301*4882a593SmuzhiyunMutex owner and flags 302*4882a593Smuzhiyun--------------------- 303*4882a593Smuzhiyun 304*4882a593SmuzhiyunThe mutex structure contains a pointer to the owner of the mutex. If the 305*4882a593Smuzhiyunmutex is not owned, this owner is set to NULL. Since all architectures 306*4882a593Smuzhiyunhave the task structure on at least a two byte alignment (and if this is 307*4882a593Smuzhiyunnot true, the rtmutex.c code will be broken!), this allows for the least 308*4882a593Smuzhiyunsignificant bit to be used as a flag. Bit 0 is used as the "Has Waiters" 309*4882a593Smuzhiyunflag. It's set whenever there are waiters on a mutex. 310*4882a593Smuzhiyun 311*4882a593SmuzhiyunSee Documentation/locking/rt-mutex.rst for further details. 312*4882a593Smuzhiyun 313*4882a593Smuzhiyuncmpxchg Tricks 314*4882a593Smuzhiyun-------------- 315*4882a593Smuzhiyun 316*4882a593SmuzhiyunSome architectures implement an atomic cmpxchg (Compare and Exchange). This 317*4882a593Smuzhiyunis used (when applicable) to keep the fast path of grabbing and releasing 318*4882a593Smuzhiyunmutexes short. 319*4882a593Smuzhiyun 320*4882a593Smuzhiyuncmpxchg is basically the following function performed atomically:: 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) 323*4882a593Smuzhiyun { 324*4882a593Smuzhiyun unsigned long T = *A; 325*4882a593Smuzhiyun if (*A == *B) { 326*4882a593Smuzhiyun *A = *C; 327*4882a593Smuzhiyun } 328*4882a593Smuzhiyun return T; 329*4882a593Smuzhiyun } 330*4882a593Smuzhiyun #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) 331*4882a593Smuzhiyun 332*4882a593SmuzhiyunThis is really nice to have, since it allows you to only update a variable 333*4882a593Smuzhiyunif the variable is what you expect it to be. You know if it succeeded if 334*4882a593Smuzhiyunthe return value (the old value of A) is equal to B. 335*4882a593Smuzhiyun 336*4882a593SmuzhiyunThe macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If 337*4882a593Smuzhiyunthe architecture does not support CMPXCHG, then this macro is simply set 338*4882a593Smuzhiyunto fail every time. But if CMPXCHG is supported, then this will 339*4882a593Smuzhiyunhelp out extremely to keep the fast path short. 340*4882a593Smuzhiyun 341*4882a593SmuzhiyunThe use of rt_mutex_cmpxchg with the flags in the owner field help optimize 342*4882a593Smuzhiyunthe system for architectures that support it. This will also be explained 343*4882a593Smuzhiyunlater in this document. 344*4882a593Smuzhiyun 345*4882a593Smuzhiyun 346*4882a593SmuzhiyunPriority adjustments 347*4882a593Smuzhiyun-------------------- 348*4882a593Smuzhiyun 349*4882a593SmuzhiyunThe implementation of the PI code in rtmutex.c has several places that a 350*4882a593Smuzhiyunprocess must adjust its priority. With the help of the pi_waiters of a 351*4882a593Smuzhiyunprocess this is rather easy to know what needs to be adjusted. 352*4882a593Smuzhiyun 353*4882a593SmuzhiyunThe functions implementing the task adjustments are rt_mutex_adjust_prio 354*4882a593Smuzhiyunand rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio. 355*4882a593Smuzhiyun 356*4882a593Smuzhiyunrt_mutex_adjust_prio examines the priority of the task, and the highest 357*4882a593Smuzhiyunpriority process that is waiting any of mutexes owned by the task. Since 358*4882a593Smuzhiyunthe pi_waiters of a task holds an order by priority of all the top waiters 359*4882a593Smuzhiyunof all the mutexes that the task owns, we simply need to compare the top 360*4882a593Smuzhiyunpi waiter to its own normal/deadline priority and take the higher one. 361*4882a593SmuzhiyunThen rt_mutex_setprio is called to adjust the priority of the task to the 362*4882a593Smuzhiyunnew priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c 363*4882a593Smuzhiyunto implement the actual change in priority. 364*4882a593Smuzhiyun 365*4882a593SmuzhiyunNote: 366*4882a593Smuzhiyun For the "prio" field in task_struct, the lower the number, the 367*4882a593Smuzhiyun higher the priority. A "prio" of 5 is of higher priority than a 368*4882a593Smuzhiyun "prio" of 10. 369*4882a593Smuzhiyun 370*4882a593SmuzhiyunIt is interesting to note that rt_mutex_adjust_prio can either increase 371*4882a593Smuzhiyunor decrease the priority of the task. In the case that a higher priority 372*4882a593Smuzhiyunprocess has just blocked on a mutex owned by the task, rt_mutex_adjust_prio 373*4882a593Smuzhiyunwould increase/boost the task's priority. But if a higher priority task 374*4882a593Smuzhiyunwere for some reason to leave the mutex (timeout or signal), this same function 375*4882a593Smuzhiyunwould decrease/unboost the priority of the task. That is because the pi_waiters 376*4882a593Smuzhiyunalways contains the highest priority task that is waiting on a mutex owned 377*4882a593Smuzhiyunby the task, so we only need to compare the priority of that top pi waiter 378*4882a593Smuzhiyunto the normal priority of the given task. 379*4882a593Smuzhiyun 380*4882a593Smuzhiyun 381*4882a593SmuzhiyunHigh level overview of the PI chain walk 382*4882a593Smuzhiyun---------------------------------------- 383*4882a593Smuzhiyun 384*4882a593SmuzhiyunThe PI chain walk is implemented by the function rt_mutex_adjust_prio_chain. 385*4882a593Smuzhiyun 386*4882a593SmuzhiyunThe implementation has gone through several iterations, and has ended up 387*4882a593Smuzhiyunwith what we believe is the best. It walks the PI chain by only grabbing 388*4882a593Smuzhiyunat most two locks at a time, and is very efficient. 389*4882a593Smuzhiyun 390*4882a593SmuzhiyunThe rt_mutex_adjust_prio_chain can be used either to boost or lower process 391*4882a593Smuzhiyunpriorities. 392*4882a593Smuzhiyun 393*4882a593Smuzhiyunrt_mutex_adjust_prio_chain is called with a task to be checked for PI 394*4882a593Smuzhiyun(de)boosting (the owner of a mutex that a process is blocking on), a flag to 395*4882a593Smuzhiyuncheck for deadlocking, the mutex that the task owns, a pointer to a waiter 396*4882a593Smuzhiyunthat is the process's waiter struct that is blocked on the mutex (although this 397*4882a593Smuzhiyunparameter may be NULL for deboosting), a pointer to the mutex on which the task 398*4882a593Smuzhiyunis blocked, and a top_task as the top waiter of the mutex. 399*4882a593Smuzhiyun 400*4882a593SmuzhiyunFor this explanation, I will not mention deadlock detection. This explanation 401*4882a593Smuzhiyunwill try to stay at a high level. 402*4882a593Smuzhiyun 403*4882a593SmuzhiyunWhen this function is called, there are no locks held. That also means 404*4882a593Smuzhiyunthat the state of the owner and lock can change when entered into this function. 405*4882a593Smuzhiyun 406*4882a593SmuzhiyunBefore this function is called, the task has already had rt_mutex_adjust_prio 407*4882a593Smuzhiyunperformed on it. This means that the task is set to the priority that it 408*4882a593Smuzhiyunshould be at, but the rbtree nodes of the task's waiter have not been updated 409*4882a593Smuzhiyunwith the new priorities, and this task may not be in the proper locations 410*4882a593Smuzhiyunin the pi_waiters and waiters trees that the task is blocked on. This function 411*4882a593Smuzhiyunsolves all that. 412*4882a593Smuzhiyun 413*4882a593SmuzhiyunThe main operation of this function is summarized by Thomas Gleixner in 414*4882a593Smuzhiyunrtmutex.c. See the 'Chain walk basics and protection scope' comment for further 415*4882a593Smuzhiyundetails. 416*4882a593Smuzhiyun 417*4882a593SmuzhiyunTaking of a mutex (The walk through) 418*4882a593Smuzhiyun------------------------------------ 419*4882a593Smuzhiyun 420*4882a593SmuzhiyunOK, now let's take a look at the detailed walk through of what happens when 421*4882a593Smuzhiyuntaking a mutex. 422*4882a593Smuzhiyun 423*4882a593SmuzhiyunThe first thing that is tried is the fast taking of the mutex. This is 424*4882a593Smuzhiyundone when we have CMPXCHG enabled (otherwise the fast taking automatically 425*4882a593Smuzhiyunfails). Only when the owner field of the mutex is NULL can the lock be 426*4882a593Smuzhiyuntaken with the CMPXCHG and nothing else needs to be done. 427*4882a593Smuzhiyun 428*4882a593SmuzhiyunIf there is contention on the lock, we go about the slow path 429*4882a593Smuzhiyun(rt_mutex_slowlock). 430*4882a593Smuzhiyun 431*4882a593SmuzhiyunThe slow path function is where the task's waiter structure is created on 432*4882a593Smuzhiyunthe stack. This is because the waiter structure is only needed for the 433*4882a593Smuzhiyunscope of this function. The waiter structure holds the nodes to store 434*4882a593Smuzhiyunthe task on the waiters tree of the mutex, and if need be, the pi_waiters 435*4882a593Smuzhiyuntree of the owner. 436*4882a593Smuzhiyun 437*4882a593SmuzhiyunThe wait_lock of the mutex is taken since the slow path of unlocking the 438*4882a593Smuzhiyunmutex also takes this lock. 439*4882a593Smuzhiyun 440*4882a593SmuzhiyunWe then call try_to_take_rt_mutex. This is where the architecture that 441*4882a593Smuzhiyundoes not implement CMPXCHG would always grab the lock (if there's no 442*4882a593Smuzhiyuncontention). 443*4882a593Smuzhiyun 444*4882a593Smuzhiyuntry_to_take_rt_mutex is used every time the task tries to grab a mutex in the 445*4882a593Smuzhiyunslow path. The first thing that is done here is an atomic setting of 446*4882a593Smuzhiyunthe "Has Waiters" flag of the mutex's owner field. By setting this flag 447*4882a593Smuzhiyunnow, the current owner of the mutex being contended for can't release the mutex 448*4882a593Smuzhiyunwithout going into the slow unlock path, and it would then need to grab the 449*4882a593Smuzhiyunwait_lock, which this code currently holds. So setting the "Has Waiters" flag 450*4882a593Smuzhiyunforces the current owner to synchronize with this code. 451*4882a593Smuzhiyun 452*4882a593SmuzhiyunThe lock is taken if the following are true: 453*4882a593Smuzhiyun 454*4882a593Smuzhiyun 1) The lock has no owner 455*4882a593Smuzhiyun 2) The current task is the highest priority against all other 456*4882a593Smuzhiyun waiters of the lock 457*4882a593Smuzhiyun 458*4882a593SmuzhiyunIf the task succeeds to acquire the lock, then the task is set as the 459*4882a593Smuzhiyunowner of the lock, and if the lock still has waiters, the top_waiter 460*4882a593Smuzhiyun(highest priority task waiting on the lock) is added to this task's 461*4882a593Smuzhiyunpi_waiters tree. 462*4882a593Smuzhiyun 463*4882a593SmuzhiyunIf the lock is not taken by try_to_take_rt_mutex(), then the 464*4882a593Smuzhiyuntask_blocks_on_rt_mutex() function is called. This will add the task to 465*4882a593Smuzhiyunthe lock's waiter tree and propagate the pi chain of the lock as well 466*4882a593Smuzhiyunas the lock's owner's pi_waiters tree. This is described in the next 467*4882a593Smuzhiyunsection. 468*4882a593Smuzhiyun 469*4882a593SmuzhiyunTask blocks on mutex 470*4882a593Smuzhiyun-------------------- 471*4882a593Smuzhiyun 472*4882a593SmuzhiyunThe accounting of a mutex and process is done with the waiter structure of 473*4882a593Smuzhiyunthe process. The "task" field is set to the process, and the "lock" field 474*4882a593Smuzhiyunto the mutex. The rbtree node of waiter are initialized to the processes 475*4882a593Smuzhiyuncurrent priority. 476*4882a593Smuzhiyun 477*4882a593SmuzhiyunSince the wait_lock was taken at the entry of the slow lock, we can safely 478*4882a593Smuzhiyunadd the waiter to the task waiter tree. If the current process is the 479*4882a593Smuzhiyunhighest priority process currently waiting on this mutex, then we remove the 480*4882a593Smuzhiyunprevious top waiter process (if it exists) from the pi_waiters of the owner, 481*4882a593Smuzhiyunand add the current process to that tree. Since the pi_waiter of the owner 482*4882a593Smuzhiyunhas changed, we call rt_mutex_adjust_prio on the owner to see if the owner 483*4882a593Smuzhiyunshould adjust its priority accordingly. 484*4882a593Smuzhiyun 485*4882a593SmuzhiyunIf the owner is also blocked on a lock, and had its pi_waiters changed 486*4882a593Smuzhiyun(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead 487*4882a593Smuzhiyunand run rt_mutex_adjust_prio_chain on the owner, as described earlier. 488*4882a593Smuzhiyun 489*4882a593SmuzhiyunNow all locks are released, and if the current process is still blocked on a 490*4882a593Smuzhiyunmutex (waiter "task" field is not NULL), then we go to sleep (call schedule). 491*4882a593Smuzhiyun 492*4882a593SmuzhiyunWaking up in the loop 493*4882a593Smuzhiyun--------------------- 494*4882a593Smuzhiyun 495*4882a593SmuzhiyunThe task can then wake up for a couple of reasons: 496*4882a593Smuzhiyun 1) The previous lock owner released the lock, and the task now is top_waiter 497*4882a593Smuzhiyun 2) we received a signal or timeout 498*4882a593Smuzhiyun 499*4882a593SmuzhiyunIn both cases, the task will try again to acquire the lock. If it 500*4882a593Smuzhiyundoes, then it will take itself off the waiters tree and set itself back 501*4882a593Smuzhiyunto the TASK_RUNNING state. 502*4882a593Smuzhiyun 503*4882a593SmuzhiyunIn first case, if the lock was acquired by another task before this task 504*4882a593Smuzhiyuncould get the lock, then it will go back to sleep and wait to be woken again. 505*4882a593Smuzhiyun 506*4882a593SmuzhiyunThe second case is only applicable for tasks that are grabbing a mutex 507*4882a593Smuzhiyunthat can wake up before getting the lock, either due to a signal or 508*4882a593Smuzhiyuna timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to 509*4882a593Smuzhiyuntake the lock again, if it succeeds, then the task will return with the 510*4882a593Smuzhiyunlock held, otherwise it will return with -EINTR if the task was woken 511*4882a593Smuzhiyunby a signal, or -ETIMEDOUT if it timed out. 512*4882a593Smuzhiyun 513*4882a593Smuzhiyun 514*4882a593SmuzhiyunUnlocking the Mutex 515*4882a593Smuzhiyun------------------- 516*4882a593Smuzhiyun 517*4882a593SmuzhiyunThe unlocking of a mutex also has a fast path for those architectures with 518*4882a593SmuzhiyunCMPXCHG. Since the taking of a mutex on contention always sets the 519*4882a593Smuzhiyun"Has Waiters" flag of the mutex's owner, we use this to know if we need to 520*4882a593Smuzhiyuntake the slow path when unlocking the mutex. If the mutex doesn't have any 521*4882a593Smuzhiyunwaiters, the owner field of the mutex would equal the current process and 522*4882a593Smuzhiyunthe mutex can be unlocked by just replacing the owner field with NULL. 523*4882a593Smuzhiyun 524*4882a593SmuzhiyunIf the owner field has the "Has Waiters" bit set (or CMPXCHG is not available), 525*4882a593Smuzhiyunthe slow unlock path is taken. 526*4882a593Smuzhiyun 527*4882a593SmuzhiyunThe first thing done in the slow unlock path is to take the wait_lock of the 528*4882a593Smuzhiyunmutex. This synchronizes the locking and unlocking of the mutex. 529*4882a593Smuzhiyun 530*4882a593SmuzhiyunA check is made to see if the mutex has waiters or not. On architectures that 531*4882a593Smuzhiyundo not have CMPXCHG, this is the location that the owner of the mutex will 532*4882a593Smuzhiyundetermine if a waiter needs to be awoken or not. On architectures that 533*4882a593Smuzhiyundo have CMPXCHG, that check is done in the fast path, but it is still needed 534*4882a593Smuzhiyunin the slow path too. If a waiter of a mutex woke up because of a signal 535*4882a593Smuzhiyunor timeout between the time the owner failed the fast path CMPXCHG check and 536*4882a593Smuzhiyunthe grabbing of the wait_lock, the mutex may not have any waiters, thus the 537*4882a593Smuzhiyunowner still needs to make this check. If there are no waiters then the mutex 538*4882a593Smuzhiyunowner field is set to NULL, the wait_lock is released and nothing more is 539*4882a593Smuzhiyunneeded. 540*4882a593Smuzhiyun 541*4882a593SmuzhiyunIf there are waiters, then we need to wake one up. 542*4882a593Smuzhiyun 543*4882a593SmuzhiyunOn the wake up code, the pi_lock of the current owner is taken. The top 544*4882a593Smuzhiyunwaiter of the lock is found and removed from the waiters tree of the mutex 545*4882a593Smuzhiyunas well as the pi_waiters tree of the current owner. The "Has Waiters" bit is 546*4882a593Smuzhiyunmarked to prevent lower priority tasks from stealing the lock. 547*4882a593Smuzhiyun 548*4882a593SmuzhiyunFinally we unlock the pi_lock of the pending owner and wake it up. 549*4882a593Smuzhiyun 550*4882a593Smuzhiyun 551*4882a593SmuzhiyunContact 552*4882a593Smuzhiyun------- 553*4882a593Smuzhiyun 554*4882a593SmuzhiyunFor updates on this document, please email Steven Rostedt <rostedt@goodmis.org> 555*4882a593Smuzhiyun 556*4882a593Smuzhiyun 557*4882a593SmuzhiyunCredits 558*4882a593Smuzhiyun------- 559*4882a593Smuzhiyun 560*4882a593SmuzhiyunAuthor: Steven Rostedt <rostedt@goodmis.org> 561*4882a593Smuzhiyun 562*4882a593SmuzhiyunUpdated: Alex Shi <alex.shi@linaro.org> - 7/6/2017 563*4882a593Smuzhiyun 564*4882a593SmuzhiyunOriginal Reviewers: 565*4882a593Smuzhiyun Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and 566*4882a593Smuzhiyun Randy Dunlap 567*4882a593Smuzhiyun 568*4882a593SmuzhiyunUpdate (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior 569*4882a593Smuzhiyun 570*4882a593SmuzhiyunUpdates 571*4882a593Smuzhiyun------- 572*4882a593Smuzhiyun 573*4882a593SmuzhiyunThis document was originally written for 2.6.17-rc3-mm1 574*4882a593Smuzhiyunwas updated on 4.12 575