xref: /OK3568_Linux_fs/kernel/Documentation/locking/rt-mutex-design.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==============================
2*4882a593SmuzhiyunRT-mutex implementation design
3*4882a593Smuzhiyun==============================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunCopyright (c) 2006 Steven Rostedt
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunLicensed under the GNU Free Documentation License, Version 1.2
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunThis document tries to describe the design of the rtmutex.c implementation.
11*4882a593SmuzhiyunIt doesn't describe the reasons why rtmutex.c exists. For that please see
12*4882a593SmuzhiyunDocumentation/locking/rt-mutex.rst.  Although this document does explain problems
13*4882a593Smuzhiyunthat happen without this code, but that is in the concept to understand
14*4882a593Smuzhiyunwhat the code actually is doing.
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunThe goal of this document is to help others understand the priority
17*4882a593Smuzhiyuninheritance (PI) algorithm that is used, as well as reasons for the
18*4882a593Smuzhiyundecisions that were made to implement PI in the manner that was done.
19*4882a593Smuzhiyun
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunUnbounded Priority Inversion
22*4882a593Smuzhiyun----------------------------
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunPriority inversion is when a lower priority process executes while a higher
25*4882a593Smuzhiyunpriority process wants to run.  This happens for several reasons, and
26*4882a593Smuzhiyunmost of the time it can't be helped.  Anytime a high priority process wants
27*4882a593Smuzhiyunto use a resource that a lower priority process has (a mutex for example),
28*4882a593Smuzhiyunthe high priority process must wait until the lower priority process is done
29*4882a593Smuzhiyunwith the resource.  This is a priority inversion.  What we want to prevent
30*4882a593Smuzhiyunis something called unbounded priority inversion.  That is when the high
31*4882a593Smuzhiyunpriority process is prevented from running by a lower priority process for
32*4882a593Smuzhiyunan undetermined amount of time.
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunThe classic example of unbounded priority inversion is where you have three
35*4882a593Smuzhiyunprocesses, let's call them processes A, B, and C, where A is the highest
36*4882a593Smuzhiyunpriority process, C is the lowest, and B is in between. A tries to grab a lock
37*4882a593Smuzhiyunthat C owns and must wait and lets C run to release the lock. But in the
38*4882a593Smuzhiyunmeantime, B executes, and since B is of a higher priority than C, it preempts C,
39*4882a593Smuzhiyunbut by doing so, it is in fact preempting A which is a higher priority process.
40*4882a593SmuzhiyunNow there's no way of knowing how long A will be sleeping waiting for C
41*4882a593Smuzhiyunto release the lock, because for all we know, B is a CPU hog and will
42*4882a593Smuzhiyunnever give C a chance to release the lock.  This is called unbounded priority
43*4882a593Smuzhiyuninversion.
44*4882a593Smuzhiyun
45*4882a593SmuzhiyunHere's a little ASCII art to show the problem::
46*4882a593Smuzhiyun
47*4882a593Smuzhiyun     grab lock L1 (owned by C)
48*4882a593Smuzhiyun       |
49*4882a593Smuzhiyun  A ---+
50*4882a593Smuzhiyun          C preempted by B
51*4882a593Smuzhiyun            |
52*4882a593Smuzhiyun  C    +----+
53*4882a593Smuzhiyun
54*4882a593Smuzhiyun  B         +-------->
55*4882a593Smuzhiyun                  B now keeps A from running.
56*4882a593Smuzhiyun
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunPriority Inheritance (PI)
59*4882a593Smuzhiyun-------------------------
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunThere are several ways to solve this issue, but other ways are out of scope
62*4882a593Smuzhiyunfor this document.  Here we only discuss PI.
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunPI is where a process inherits the priority of another process if the other
65*4882a593Smuzhiyunprocess blocks on a lock owned by the current process.  To make this easier
66*4882a593Smuzhiyunto understand, let's use the previous example, with processes A, B, and C again.
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunThis time, when A blocks on the lock owned by C, C would inherit the priority
69*4882a593Smuzhiyunof A.  So now if B becomes runnable, it would not preempt C, since C now has
70*4882a593Smuzhiyunthe high priority of A.  As soon as C releases the lock, it loses its
71*4882a593Smuzhiyuninherited priority, and A then can continue with the resource that C had.
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunTerminology
74*4882a593Smuzhiyun-----------
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunHere I explain some terminology that is used in this document to help describe
77*4882a593Smuzhiyunthe design that is used to implement PI.
78*4882a593Smuzhiyun
79*4882a593SmuzhiyunPI chain
80*4882a593Smuzhiyun         - The PI chain is an ordered series of locks and processes that cause
81*4882a593Smuzhiyun           processes to inherit priorities from a previous process that is
82*4882a593Smuzhiyun           blocked on one of its locks.  This is described in more detail
83*4882a593Smuzhiyun           later in this document.
84*4882a593Smuzhiyun
85*4882a593Smuzhiyunmutex
86*4882a593Smuzhiyun         - In this document, to differentiate from locks that implement
87*4882a593Smuzhiyun           PI and spin locks that are used in the PI code, from now on
88*4882a593Smuzhiyun           the PI locks will be called a mutex.
89*4882a593Smuzhiyun
90*4882a593Smuzhiyunlock
91*4882a593Smuzhiyun         - In this document from now on, I will use the term lock when
92*4882a593Smuzhiyun           referring to spin locks that are used to protect parts of the PI
93*4882a593Smuzhiyun           algorithm.  These locks disable preemption for UP (when
94*4882a593Smuzhiyun           CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from
95*4882a593Smuzhiyun           entering critical sections simultaneously.
96*4882a593Smuzhiyun
97*4882a593Smuzhiyunspin lock
98*4882a593Smuzhiyun         - Same as lock above.
99*4882a593Smuzhiyun
100*4882a593Smuzhiyunwaiter
101*4882a593Smuzhiyun         - A waiter is a struct that is stored on the stack of a blocked
102*4882a593Smuzhiyun           process.  Since the scope of the waiter is within the code for
103*4882a593Smuzhiyun           a process being blocked on the mutex, it is fine to allocate
104*4882a593Smuzhiyun           the waiter on the process's stack (local variable).  This
105*4882a593Smuzhiyun           structure holds a pointer to the task, as well as the mutex that
106*4882a593Smuzhiyun           the task is blocked on.  It also has rbtree node structures to
107*4882a593Smuzhiyun           place the task in the waiters rbtree of a mutex as well as the
108*4882a593Smuzhiyun           pi_waiters rbtree of a mutex owner task (described below).
109*4882a593Smuzhiyun
110*4882a593Smuzhiyun           waiter is sometimes used in reference to the task that is waiting
111*4882a593Smuzhiyun           on a mutex. This is the same as waiter->task.
112*4882a593Smuzhiyun
113*4882a593Smuzhiyunwaiters
114*4882a593Smuzhiyun         - A list of processes that are blocked on a mutex.
115*4882a593Smuzhiyun
116*4882a593Smuzhiyuntop waiter
117*4882a593Smuzhiyun         - The highest priority process waiting on a specific mutex.
118*4882a593Smuzhiyun
119*4882a593Smuzhiyuntop pi waiter
120*4882a593Smuzhiyun              - The highest priority process waiting on one of the mutexes
121*4882a593Smuzhiyun                that a specific process owns.
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunNote:
124*4882a593Smuzhiyun       task and process are used interchangeably in this document, mostly to
125*4882a593Smuzhiyun       differentiate between two processes that are being described together.
126*4882a593Smuzhiyun
127*4882a593Smuzhiyun
128*4882a593SmuzhiyunPI chain
129*4882a593Smuzhiyun--------
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunThe PI chain is a list of processes and mutexes that may cause priority
132*4882a593Smuzhiyuninheritance to take place.  Multiple chains may converge, but a chain
133*4882a593Smuzhiyunwould never diverge, since a process can't be blocked on more than one
134*4882a593Smuzhiyunmutex at a time.
135*4882a593Smuzhiyun
136*4882a593SmuzhiyunExample::
137*4882a593Smuzhiyun
138*4882a593Smuzhiyun   Process:  A, B, C, D, E
139*4882a593Smuzhiyun   Mutexes:  L1, L2, L3, L4
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun   A owns: L1
142*4882a593Smuzhiyun           B blocked on L1
143*4882a593Smuzhiyun           B owns L2
144*4882a593Smuzhiyun                  C blocked on L2
145*4882a593Smuzhiyun                  C owns L3
146*4882a593Smuzhiyun                         D blocked on L3
147*4882a593Smuzhiyun                         D owns L4
148*4882a593Smuzhiyun                                E blocked on L4
149*4882a593Smuzhiyun
150*4882a593SmuzhiyunThe chain would be::
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun   E->L4->D->L3->C->L2->B->L1->A
153*4882a593Smuzhiyun
154*4882a593SmuzhiyunTo show where two chains merge, we could add another process F and
155*4882a593Smuzhiyunanother mutex L5 where B owns L5 and F is blocked on mutex L5.
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunThe chain for F would be::
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun   F->L5->B->L1->A
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunSince a process may own more than one mutex, but never be blocked on more than
162*4882a593Smuzhiyunone, the chains merge.
163*4882a593Smuzhiyun
164*4882a593SmuzhiyunHere we show both chains::
165*4882a593Smuzhiyun
166*4882a593Smuzhiyun   E->L4->D->L3->C->L2-+
167*4882a593Smuzhiyun                       |
168*4882a593Smuzhiyun                       +->B->L1->A
169*4882a593Smuzhiyun                       |
170*4882a593Smuzhiyun                 F->L5-+
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunFor PI to work, the processes at the right end of these chains (or we may
173*4882a593Smuzhiyunalso call it the Top of the chain) must be equal to or higher in priority
174*4882a593Smuzhiyunthan the processes to the left or below in the chain.
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunAlso since a mutex may have more than one process blocked on it, we can
177*4882a593Smuzhiyunhave multiple chains merge at mutexes.  If we add another process G that is
178*4882a593Smuzhiyunblocked on mutex L2::
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun  G->L2->B->L1->A
181*4882a593Smuzhiyun
182*4882a593SmuzhiyunAnd once again, to show how this can grow I will show the merging chains
183*4882a593Smuzhiyunagain::
184*4882a593Smuzhiyun
185*4882a593Smuzhiyun   E->L4->D->L3->C-+
186*4882a593Smuzhiyun                   +->L2-+
187*4882a593Smuzhiyun                   |     |
188*4882a593Smuzhiyun                 G-+     +->B->L1->A
189*4882a593Smuzhiyun                         |
190*4882a593Smuzhiyun                   F->L5-+
191*4882a593Smuzhiyun
192*4882a593SmuzhiyunIf process G has the highest priority in the chain, then all the tasks up
193*4882a593Smuzhiyunthe chain (A and B in this example), must have their priorities increased
194*4882a593Smuzhiyunto that of G.
195*4882a593Smuzhiyun
196*4882a593SmuzhiyunMutex Waiters Tree
197*4882a593Smuzhiyun------------------
198*4882a593Smuzhiyun
199*4882a593SmuzhiyunEvery mutex keeps track of all the waiters that are blocked on itself. The
200*4882a593Smuzhiyunmutex has a rbtree to store these waiters by priority.  This tree is protected
201*4882a593Smuzhiyunby a spin lock that is located in the struct of the mutex. This lock is called
202*4882a593Smuzhiyunwait_lock.
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunTask PI Tree
206*4882a593Smuzhiyun------------
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunTo keep track of the PI chains, each process has its own PI rbtree.  This is
209*4882a593Smuzhiyuna tree of all top waiters of the mutexes that are owned by the process.
210*4882a593SmuzhiyunNote that this tree only holds the top waiters and not all waiters that are
211*4882a593Smuzhiyunblocked on mutexes owned by the process.
212*4882a593Smuzhiyun
213*4882a593SmuzhiyunThe top of the task's PI tree is always the highest priority task that
214*4882a593Smuzhiyunis waiting on a mutex that is owned by the task.  So if the task has
215*4882a593Smuzhiyuninherited a priority, it will always be the priority of the task that is
216*4882a593Smuzhiyunat the top of this tree.
217*4882a593Smuzhiyun
218*4882a593SmuzhiyunThis tree is stored in the task structure of a process as a rbtree called
219*4882a593Smuzhiyunpi_waiters.  It is protected by a spin lock also in the task structure,
220*4882a593Smuzhiyuncalled pi_lock.  This lock may also be taken in interrupt context, so when
221*4882a593Smuzhiyunlocking the pi_lock, interrupts must be disabled.
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun
224*4882a593SmuzhiyunDepth of the PI Chain
225*4882a593Smuzhiyun---------------------
226*4882a593Smuzhiyun
227*4882a593SmuzhiyunThe maximum depth of the PI chain is not dynamic, and could actually be
228*4882a593Smuzhiyundefined.  But is very complex to figure it out, since it depends on all
229*4882a593Smuzhiyunthe nesting of mutexes.  Let's look at the example where we have 3 mutexes,
230*4882a593SmuzhiyunL1, L2, and L3, and four separate functions func1, func2, func3 and func4.
231*4882a593SmuzhiyunThe following shows a locking order of L1->L2->L3, but may not actually
232*4882a593Smuzhiyunbe directly nested that way::
233*4882a593Smuzhiyun
234*4882a593Smuzhiyun  void func1(void)
235*4882a593Smuzhiyun  {
236*4882a593Smuzhiyun	mutex_lock(L1);
237*4882a593Smuzhiyun
238*4882a593Smuzhiyun	/* do anything */
239*4882a593Smuzhiyun
240*4882a593Smuzhiyun	mutex_unlock(L1);
241*4882a593Smuzhiyun  }
242*4882a593Smuzhiyun
243*4882a593Smuzhiyun  void func2(void)
244*4882a593Smuzhiyun  {
245*4882a593Smuzhiyun	mutex_lock(L1);
246*4882a593Smuzhiyun	mutex_lock(L2);
247*4882a593Smuzhiyun
248*4882a593Smuzhiyun	/* do something */
249*4882a593Smuzhiyun
250*4882a593Smuzhiyun	mutex_unlock(L2);
251*4882a593Smuzhiyun	mutex_unlock(L1);
252*4882a593Smuzhiyun  }
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun  void func3(void)
255*4882a593Smuzhiyun  {
256*4882a593Smuzhiyun	mutex_lock(L2);
257*4882a593Smuzhiyun	mutex_lock(L3);
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun	/* do something else */
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun	mutex_unlock(L3);
262*4882a593Smuzhiyun	mutex_unlock(L2);
263*4882a593Smuzhiyun  }
264*4882a593Smuzhiyun
265*4882a593Smuzhiyun  void func4(void)
266*4882a593Smuzhiyun  {
267*4882a593Smuzhiyun	mutex_lock(L3);
268*4882a593Smuzhiyun
269*4882a593Smuzhiyun	/* do something again */
270*4882a593Smuzhiyun
271*4882a593Smuzhiyun	mutex_unlock(L3);
272*4882a593Smuzhiyun  }
273*4882a593Smuzhiyun
274*4882a593SmuzhiyunNow we add 4 processes that run each of these functions separately.
275*4882a593SmuzhiyunProcesses A, B, C, and D which run functions func1, func2, func3 and func4
276*4882a593Smuzhiyunrespectively, and such that D runs first and A last.  With D being preempted
277*4882a593Smuzhiyunin func4 in the "do something again" area, we have a locking that follows::
278*4882a593Smuzhiyun
279*4882a593Smuzhiyun  D owns L3
280*4882a593Smuzhiyun         C blocked on L3
281*4882a593Smuzhiyun         C owns L2
282*4882a593Smuzhiyun                B blocked on L2
283*4882a593Smuzhiyun                B owns L1
284*4882a593Smuzhiyun                       A blocked on L1
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun  And thus we have the chain A->L1->B->L2->C->L3->D.
287*4882a593Smuzhiyun
288*4882a593SmuzhiyunThis gives us a PI depth of 4 (four processes), but looking at any of the
289*4882a593Smuzhiyunfunctions individually, it seems as though they only have at most a locking
290*4882a593Smuzhiyundepth of two.  So, although the locking depth is defined at compile time,
291*4882a593Smuzhiyunit still is very difficult to find the possibilities of that depth.
292*4882a593Smuzhiyun
293*4882a593SmuzhiyunNow since mutexes can be defined by user-land applications, we don't want a DOS
294*4882a593Smuzhiyuntype of application that nests large amounts of mutexes to create a large
295*4882a593SmuzhiyunPI chain, and have the code holding spin locks while looking at a large
296*4882a593Smuzhiyunamount of data.  So to prevent this, the implementation not only implements
297*4882a593Smuzhiyuna maximum lock depth, but also only holds at most two different locks at a
298*4882a593Smuzhiyuntime, as it walks the PI chain.  More about this below.
299*4882a593Smuzhiyun
300*4882a593Smuzhiyun
301*4882a593SmuzhiyunMutex owner and flags
302*4882a593Smuzhiyun---------------------
303*4882a593Smuzhiyun
304*4882a593SmuzhiyunThe mutex structure contains a pointer to the owner of the mutex.  If the
305*4882a593Smuzhiyunmutex is not owned, this owner is set to NULL.  Since all architectures
306*4882a593Smuzhiyunhave the task structure on at least a two byte alignment (and if this is
307*4882a593Smuzhiyunnot true, the rtmutex.c code will be broken!), this allows for the least
308*4882a593Smuzhiyunsignificant bit to be used as a flag.  Bit 0 is used as the "Has Waiters"
309*4882a593Smuzhiyunflag. It's set whenever there are waiters on a mutex.
310*4882a593Smuzhiyun
311*4882a593SmuzhiyunSee Documentation/locking/rt-mutex.rst for further details.
312*4882a593Smuzhiyun
313*4882a593Smuzhiyuncmpxchg Tricks
314*4882a593Smuzhiyun--------------
315*4882a593Smuzhiyun
316*4882a593SmuzhiyunSome architectures implement an atomic cmpxchg (Compare and Exchange).  This
317*4882a593Smuzhiyunis used (when applicable) to keep the fast path of grabbing and releasing
318*4882a593Smuzhiyunmutexes short.
319*4882a593Smuzhiyun
320*4882a593Smuzhiyuncmpxchg is basically the following function performed atomically::
321*4882a593Smuzhiyun
322*4882a593Smuzhiyun  unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C)
323*4882a593Smuzhiyun  {
324*4882a593Smuzhiyun	unsigned long T = *A;
325*4882a593Smuzhiyun	if (*A == *B) {
326*4882a593Smuzhiyun		*A = *C;
327*4882a593Smuzhiyun	}
328*4882a593Smuzhiyun	return T;
329*4882a593Smuzhiyun  }
330*4882a593Smuzhiyun  #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c)
331*4882a593Smuzhiyun
332*4882a593SmuzhiyunThis is really nice to have, since it allows you to only update a variable
333*4882a593Smuzhiyunif the variable is what you expect it to be.  You know if it succeeded if
334*4882a593Smuzhiyunthe return value (the old value of A) is equal to B.
335*4882a593Smuzhiyun
336*4882a593SmuzhiyunThe macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If
337*4882a593Smuzhiyunthe architecture does not support CMPXCHG, then this macro is simply set
338*4882a593Smuzhiyunto fail every time.  But if CMPXCHG is supported, then this will
339*4882a593Smuzhiyunhelp out extremely to keep the fast path short.
340*4882a593Smuzhiyun
341*4882a593SmuzhiyunThe use of rt_mutex_cmpxchg with the flags in the owner field help optimize
342*4882a593Smuzhiyunthe system for architectures that support it.  This will also be explained
343*4882a593Smuzhiyunlater in this document.
344*4882a593Smuzhiyun
345*4882a593Smuzhiyun
346*4882a593SmuzhiyunPriority adjustments
347*4882a593Smuzhiyun--------------------
348*4882a593Smuzhiyun
349*4882a593SmuzhiyunThe implementation of the PI code in rtmutex.c has several places that a
350*4882a593Smuzhiyunprocess must adjust its priority.  With the help of the pi_waiters of a
351*4882a593Smuzhiyunprocess this is rather easy to know what needs to be adjusted.
352*4882a593Smuzhiyun
353*4882a593SmuzhiyunThe functions implementing the task adjustments are rt_mutex_adjust_prio
354*4882a593Smuzhiyunand rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio.
355*4882a593Smuzhiyun
356*4882a593Smuzhiyunrt_mutex_adjust_prio examines the priority of the task, and the highest
357*4882a593Smuzhiyunpriority process that is waiting any of mutexes owned by the task. Since
358*4882a593Smuzhiyunthe pi_waiters of a task holds an order by priority of all the top waiters
359*4882a593Smuzhiyunof all the mutexes that the task owns, we simply need to compare the top
360*4882a593Smuzhiyunpi waiter to its own normal/deadline priority and take the higher one.
361*4882a593SmuzhiyunThen rt_mutex_setprio is called to adjust the priority of the task to the
362*4882a593Smuzhiyunnew priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c
363*4882a593Smuzhiyunto implement the actual change in priority.
364*4882a593Smuzhiyun
365*4882a593SmuzhiyunNote:
366*4882a593Smuzhiyun	For the "prio" field in task_struct, the lower the number, the
367*4882a593Smuzhiyun	higher the priority. A "prio" of 5 is of higher priority than a
368*4882a593Smuzhiyun	"prio" of 10.
369*4882a593Smuzhiyun
370*4882a593SmuzhiyunIt is interesting to note that rt_mutex_adjust_prio can either increase
371*4882a593Smuzhiyunor decrease the priority of the task.  In the case that a higher priority
372*4882a593Smuzhiyunprocess has just blocked on a mutex owned by the task, rt_mutex_adjust_prio
373*4882a593Smuzhiyunwould increase/boost the task's priority.  But if a higher priority task
374*4882a593Smuzhiyunwere for some reason to leave the mutex (timeout or signal), this same function
375*4882a593Smuzhiyunwould decrease/unboost the priority of the task.  That is because the pi_waiters
376*4882a593Smuzhiyunalways contains the highest priority task that is waiting on a mutex owned
377*4882a593Smuzhiyunby the task, so we only need to compare the priority of that top pi waiter
378*4882a593Smuzhiyunto the normal priority of the given task.
379*4882a593Smuzhiyun
380*4882a593Smuzhiyun
381*4882a593SmuzhiyunHigh level overview of the PI chain walk
382*4882a593Smuzhiyun----------------------------------------
383*4882a593Smuzhiyun
384*4882a593SmuzhiyunThe PI chain walk is implemented by the function rt_mutex_adjust_prio_chain.
385*4882a593Smuzhiyun
386*4882a593SmuzhiyunThe implementation has gone through several iterations, and has ended up
387*4882a593Smuzhiyunwith what we believe is the best.  It walks the PI chain by only grabbing
388*4882a593Smuzhiyunat most two locks at a time, and is very efficient.
389*4882a593Smuzhiyun
390*4882a593SmuzhiyunThe rt_mutex_adjust_prio_chain can be used either to boost or lower process
391*4882a593Smuzhiyunpriorities.
392*4882a593Smuzhiyun
393*4882a593Smuzhiyunrt_mutex_adjust_prio_chain is called with a task to be checked for PI
394*4882a593Smuzhiyun(de)boosting (the owner of a mutex that a process is blocking on), a flag to
395*4882a593Smuzhiyuncheck for deadlocking, the mutex that the task owns, a pointer to a waiter
396*4882a593Smuzhiyunthat is the process's waiter struct that is blocked on the mutex (although this
397*4882a593Smuzhiyunparameter may be NULL for deboosting), a pointer to the mutex on which the task
398*4882a593Smuzhiyunis blocked, and a top_task as the top waiter of the mutex.
399*4882a593Smuzhiyun
400*4882a593SmuzhiyunFor this explanation, I will not mention deadlock detection. This explanation
401*4882a593Smuzhiyunwill try to stay at a high level.
402*4882a593Smuzhiyun
403*4882a593SmuzhiyunWhen this function is called, there are no locks held.  That also means
404*4882a593Smuzhiyunthat the state of the owner and lock can change when entered into this function.
405*4882a593Smuzhiyun
406*4882a593SmuzhiyunBefore this function is called, the task has already had rt_mutex_adjust_prio
407*4882a593Smuzhiyunperformed on it.  This means that the task is set to the priority that it
408*4882a593Smuzhiyunshould be at, but the rbtree nodes of the task's waiter have not been updated
409*4882a593Smuzhiyunwith the new priorities, and this task may not be in the proper locations
410*4882a593Smuzhiyunin the pi_waiters and waiters trees that the task is blocked on. This function
411*4882a593Smuzhiyunsolves all that.
412*4882a593Smuzhiyun
413*4882a593SmuzhiyunThe main operation of this function is summarized by Thomas Gleixner in
414*4882a593Smuzhiyunrtmutex.c. See the 'Chain walk basics and protection scope' comment for further
415*4882a593Smuzhiyundetails.
416*4882a593Smuzhiyun
417*4882a593SmuzhiyunTaking of a mutex (The walk through)
418*4882a593Smuzhiyun------------------------------------
419*4882a593Smuzhiyun
420*4882a593SmuzhiyunOK, now let's take a look at the detailed walk through of what happens when
421*4882a593Smuzhiyuntaking a mutex.
422*4882a593Smuzhiyun
423*4882a593SmuzhiyunThe first thing that is tried is the fast taking of the mutex.  This is
424*4882a593Smuzhiyundone when we have CMPXCHG enabled (otherwise the fast taking automatically
425*4882a593Smuzhiyunfails).  Only when the owner field of the mutex is NULL can the lock be
426*4882a593Smuzhiyuntaken with the CMPXCHG and nothing else needs to be done.
427*4882a593Smuzhiyun
428*4882a593SmuzhiyunIf there is contention on the lock, we go about the slow path
429*4882a593Smuzhiyun(rt_mutex_slowlock).
430*4882a593Smuzhiyun
431*4882a593SmuzhiyunThe slow path function is where the task's waiter structure is created on
432*4882a593Smuzhiyunthe stack.  This is because the waiter structure is only needed for the
433*4882a593Smuzhiyunscope of this function.  The waiter structure holds the nodes to store
434*4882a593Smuzhiyunthe task on the waiters tree of the mutex, and if need be, the pi_waiters
435*4882a593Smuzhiyuntree of the owner.
436*4882a593Smuzhiyun
437*4882a593SmuzhiyunThe wait_lock of the mutex is taken since the slow path of unlocking the
438*4882a593Smuzhiyunmutex also takes this lock.
439*4882a593Smuzhiyun
440*4882a593SmuzhiyunWe then call try_to_take_rt_mutex.  This is where the architecture that
441*4882a593Smuzhiyundoes not implement CMPXCHG would always grab the lock (if there's no
442*4882a593Smuzhiyuncontention).
443*4882a593Smuzhiyun
444*4882a593Smuzhiyuntry_to_take_rt_mutex is used every time the task tries to grab a mutex in the
445*4882a593Smuzhiyunslow path.  The first thing that is done here is an atomic setting of
446*4882a593Smuzhiyunthe "Has Waiters" flag of the mutex's owner field. By setting this flag
447*4882a593Smuzhiyunnow, the current owner of the mutex being contended for can't release the mutex
448*4882a593Smuzhiyunwithout going into the slow unlock path, and it would then need to grab the
449*4882a593Smuzhiyunwait_lock, which this code currently holds. So setting the "Has Waiters" flag
450*4882a593Smuzhiyunforces the current owner to synchronize with this code.
451*4882a593Smuzhiyun
452*4882a593SmuzhiyunThe lock is taken if the following are true:
453*4882a593Smuzhiyun
454*4882a593Smuzhiyun   1) The lock has no owner
455*4882a593Smuzhiyun   2) The current task is the highest priority against all other
456*4882a593Smuzhiyun      waiters of the lock
457*4882a593Smuzhiyun
458*4882a593SmuzhiyunIf the task succeeds to acquire the lock, then the task is set as the
459*4882a593Smuzhiyunowner of the lock, and if the lock still has waiters, the top_waiter
460*4882a593Smuzhiyun(highest priority task waiting on the lock) is added to this task's
461*4882a593Smuzhiyunpi_waiters tree.
462*4882a593Smuzhiyun
463*4882a593SmuzhiyunIf the lock is not taken by try_to_take_rt_mutex(), then the
464*4882a593Smuzhiyuntask_blocks_on_rt_mutex() function is called. This will add the task to
465*4882a593Smuzhiyunthe lock's waiter tree and propagate the pi chain of the lock as well
466*4882a593Smuzhiyunas the lock's owner's pi_waiters tree. This is described in the next
467*4882a593Smuzhiyunsection.
468*4882a593Smuzhiyun
469*4882a593SmuzhiyunTask blocks on mutex
470*4882a593Smuzhiyun--------------------
471*4882a593Smuzhiyun
472*4882a593SmuzhiyunThe accounting of a mutex and process is done with the waiter structure of
473*4882a593Smuzhiyunthe process.  The "task" field is set to the process, and the "lock" field
474*4882a593Smuzhiyunto the mutex.  The rbtree node of waiter are initialized to the processes
475*4882a593Smuzhiyuncurrent priority.
476*4882a593Smuzhiyun
477*4882a593SmuzhiyunSince the wait_lock was taken at the entry of the slow lock, we can safely
478*4882a593Smuzhiyunadd the waiter to the task waiter tree.  If the current process is the
479*4882a593Smuzhiyunhighest priority process currently waiting on this mutex, then we remove the
480*4882a593Smuzhiyunprevious top waiter process (if it exists) from the pi_waiters of the owner,
481*4882a593Smuzhiyunand add the current process to that tree.  Since the pi_waiter of the owner
482*4882a593Smuzhiyunhas changed, we call rt_mutex_adjust_prio on the owner to see if the owner
483*4882a593Smuzhiyunshould adjust its priority accordingly.
484*4882a593Smuzhiyun
485*4882a593SmuzhiyunIf the owner is also blocked on a lock, and had its pi_waiters changed
486*4882a593Smuzhiyun(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead
487*4882a593Smuzhiyunand run rt_mutex_adjust_prio_chain on the owner, as described earlier.
488*4882a593Smuzhiyun
489*4882a593SmuzhiyunNow all locks are released, and if the current process is still blocked on a
490*4882a593Smuzhiyunmutex (waiter "task" field is not NULL), then we go to sleep (call schedule).
491*4882a593Smuzhiyun
492*4882a593SmuzhiyunWaking up in the loop
493*4882a593Smuzhiyun---------------------
494*4882a593Smuzhiyun
495*4882a593SmuzhiyunThe task can then wake up for a couple of reasons:
496*4882a593Smuzhiyun  1) The previous lock owner released the lock, and the task now is top_waiter
497*4882a593Smuzhiyun  2) we received a signal or timeout
498*4882a593Smuzhiyun
499*4882a593SmuzhiyunIn both cases, the task will try again to acquire the lock. If it
500*4882a593Smuzhiyundoes, then it will take itself off the waiters tree and set itself back
501*4882a593Smuzhiyunto the TASK_RUNNING state.
502*4882a593Smuzhiyun
503*4882a593SmuzhiyunIn first case, if the lock was acquired by another task before this task
504*4882a593Smuzhiyuncould get the lock, then it will go back to sleep and wait to be woken again.
505*4882a593Smuzhiyun
506*4882a593SmuzhiyunThe second case is only applicable for tasks that are grabbing a mutex
507*4882a593Smuzhiyunthat can wake up before getting the lock, either due to a signal or
508*4882a593Smuzhiyuna timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to
509*4882a593Smuzhiyuntake the lock again, if it succeeds, then the task will return with the
510*4882a593Smuzhiyunlock held, otherwise it will return with -EINTR if the task was woken
511*4882a593Smuzhiyunby a signal, or -ETIMEDOUT if it timed out.
512*4882a593Smuzhiyun
513*4882a593Smuzhiyun
514*4882a593SmuzhiyunUnlocking the Mutex
515*4882a593Smuzhiyun-------------------
516*4882a593Smuzhiyun
517*4882a593SmuzhiyunThe unlocking of a mutex also has a fast path for those architectures with
518*4882a593SmuzhiyunCMPXCHG.  Since the taking of a mutex on contention always sets the
519*4882a593Smuzhiyun"Has Waiters" flag of the mutex's owner, we use this to know if we need to
520*4882a593Smuzhiyuntake the slow path when unlocking the mutex.  If the mutex doesn't have any
521*4882a593Smuzhiyunwaiters, the owner field of the mutex would equal the current process and
522*4882a593Smuzhiyunthe mutex can be unlocked by just replacing the owner field with NULL.
523*4882a593Smuzhiyun
524*4882a593SmuzhiyunIf the owner field has the "Has Waiters" bit set (or CMPXCHG is not available),
525*4882a593Smuzhiyunthe slow unlock path is taken.
526*4882a593Smuzhiyun
527*4882a593SmuzhiyunThe first thing done in the slow unlock path is to take the wait_lock of the
528*4882a593Smuzhiyunmutex.  This synchronizes the locking and unlocking of the mutex.
529*4882a593Smuzhiyun
530*4882a593SmuzhiyunA check is made to see if the mutex has waiters or not.  On architectures that
531*4882a593Smuzhiyundo not have CMPXCHG, this is the location that the owner of the mutex will
532*4882a593Smuzhiyundetermine if a waiter needs to be awoken or not.  On architectures that
533*4882a593Smuzhiyundo have CMPXCHG, that check is done in the fast path, but it is still needed
534*4882a593Smuzhiyunin the slow path too.  If a waiter of a mutex woke up because of a signal
535*4882a593Smuzhiyunor timeout between the time the owner failed the fast path CMPXCHG check and
536*4882a593Smuzhiyunthe grabbing of the wait_lock, the mutex may not have any waiters, thus the
537*4882a593Smuzhiyunowner still needs to make this check. If there are no waiters then the mutex
538*4882a593Smuzhiyunowner field is set to NULL, the wait_lock is released and nothing more is
539*4882a593Smuzhiyunneeded.
540*4882a593Smuzhiyun
541*4882a593SmuzhiyunIf there are waiters, then we need to wake one up.
542*4882a593Smuzhiyun
543*4882a593SmuzhiyunOn the wake up code, the pi_lock of the current owner is taken.  The top
544*4882a593Smuzhiyunwaiter of the lock is found and removed from the waiters tree of the mutex
545*4882a593Smuzhiyunas well as the pi_waiters tree of the current owner. The "Has Waiters" bit is
546*4882a593Smuzhiyunmarked to prevent lower priority tasks from stealing the lock.
547*4882a593Smuzhiyun
548*4882a593SmuzhiyunFinally we unlock the pi_lock of the pending owner and wake it up.
549*4882a593Smuzhiyun
550*4882a593Smuzhiyun
551*4882a593SmuzhiyunContact
552*4882a593Smuzhiyun-------
553*4882a593Smuzhiyun
554*4882a593SmuzhiyunFor updates on this document, please email Steven Rostedt <rostedt@goodmis.org>
555*4882a593Smuzhiyun
556*4882a593Smuzhiyun
557*4882a593SmuzhiyunCredits
558*4882a593Smuzhiyun-------
559*4882a593Smuzhiyun
560*4882a593SmuzhiyunAuthor:  Steven Rostedt <rostedt@goodmis.org>
561*4882a593Smuzhiyun
562*4882a593SmuzhiyunUpdated: Alex Shi <alex.shi@linaro.org>	- 7/6/2017
563*4882a593Smuzhiyun
564*4882a593SmuzhiyunOriginal Reviewers:
565*4882a593Smuzhiyun		     Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and
566*4882a593Smuzhiyun		     Randy Dunlap
567*4882a593Smuzhiyun
568*4882a593SmuzhiyunUpdate (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior
569*4882a593Smuzhiyun
570*4882a593SmuzhiyunUpdates
571*4882a593Smuzhiyun-------
572*4882a593Smuzhiyun
573*4882a593SmuzhiyunThis document was originally written for 2.6.17-rc3-mm1
574*4882a593Smuzhiyunwas updated on 4.12
575