Documentation/locking/rt-mutex-design.rst

*4882a593Smuzhiyun==============================
*4882a593SmuzhiyunRT-mutex implementation design
*4882a593Smuzhiyun==============================
*4882a593Smuzhiyun
*4882a593SmuzhiyunCopyright (c) 2006 Steven Rostedt
*4882a593Smuzhiyun
*4882a593SmuzhiyunLicensed under the GNU Free Documentation License, Version 1.2
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis document tries to describe the design of the rtmutex.c implementation.
*4882a593SmuzhiyunIt doesn't describe the reasons why rtmutex.c exists. For that please see
*4882a593SmuzhiyunDocumentation/locking/rt-mutex.rst.  Although this document does explain problems
*4882a593Smuzhiyunthat happen without this code, but that is in the concept to understand
*4882a593Smuzhiyunwhat the code actually is doing.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe goal of this document is to help others understand the priority
*4882a593Smuzhiyuninheritance (PI) algorithm that is used, as well as reasons for the
*4882a593Smuzhiyundecisions that were made to implement PI in the manner that was done.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunUnbounded Priority Inversion
*4882a593Smuzhiyun----------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunPriority inversion is when a lower priority process executes while a higher
*4882a593Smuzhiyunpriority process wants to run.  This happens for several reasons, and
*4882a593Smuzhiyunmost of the time it can't be helped.  Anytime a high priority process wants
*4882a593Smuzhiyunto use a resource that a lower priority process has (a mutex for example),
*4882a593Smuzhiyunthe high priority process must wait until the lower priority process is done
*4882a593Smuzhiyunwith the resource.  This is a priority inversion.  What we want to prevent
*4882a593Smuzhiyunis something called unbounded priority inversion.  That is when the high
*4882a593Smuzhiyunpriority process is prevented from running by a lower priority process for
*4882a593Smuzhiyunan undetermined amount of time.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe classic example of unbounded priority inversion is where you have three
*4882a593Smuzhiyunprocesses, let's call them processes A, B, and C, where A is the highest
*4882a593Smuzhiyunpriority process, C is the lowest, and B is in between. A tries to grab a lock
*4882a593Smuzhiyunthat C owns and must wait and lets C run to release the lock. But in the
*4882a593Smuzhiyunmeantime, B executes, and since B is of a higher priority than C, it preempts C,
*4882a593Smuzhiyunbut by doing so, it is in fact preempting A which is a higher priority process.
*4882a593SmuzhiyunNow there's no way of knowing how long A will be sleeping waiting for C
*4882a593Smuzhiyunto release the lock, because for all we know, B is a CPU hog and will
*4882a593Smuzhiyunnever give C a chance to release the lock.  This is called unbounded priority
*4882a593Smuzhiyuninversion.
*4882a593Smuzhiyun
*4882a593SmuzhiyunHere's a little ASCII art to show the problem::
*4882a593Smuzhiyun
*4882a593Smuzhiyun     grab lock L1 (owned by C)
*4882a593Smuzhiyun       |
*4882a593Smuzhiyun  A ---+
*4882a593Smuzhiyun          C preempted by B
*4882a593Smuzhiyun            |
*4882a593Smuzhiyun  C    +----+
*4882a593Smuzhiyun
*4882a593Smuzhiyun  B         +-------->
*4882a593Smuzhiyun                  B now keeps A from running.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunPriority Inheritance (PI)
*4882a593Smuzhiyun-------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are several ways to solve this issue, but other ways are out of scope
*4882a593Smuzhiyunfor this document.  Here we only discuss PI.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPI is where a process inherits the priority of another process if the other
*4882a593Smuzhiyunprocess blocks on a lock owned by the current process.  To make this easier
*4882a593Smuzhiyunto understand, let's use the previous example, with processes A, B, and C again.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis time, when A blocks on the lock owned by C, C would inherit the priority
*4882a593Smuzhiyunof A.  So now if B becomes runnable, it would not preempt C, since C now has
*4882a593Smuzhiyunthe high priority of A.  As soon as C releases the lock, it loses its
*4882a593Smuzhiyuninherited priority, and A then can continue with the resource that C had.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTerminology
*4882a593Smuzhiyun-----------
*4882a593Smuzhiyun
*4882a593SmuzhiyunHere I explain some terminology that is used in this document to help describe
*4882a593Smuzhiyunthe design that is used to implement PI.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPI chain
*4882a593Smuzhiyun         - The PI chain is an ordered series of locks and processes that cause
*4882a593Smuzhiyun           processes to inherit priorities from a previous process that is
*4882a593Smuzhiyun           blocked on one of its locks.  This is described in more detail
*4882a593Smuzhiyun           later in this document.
*4882a593Smuzhiyun
*4882a593Smuzhiyunmutex
*4882a593Smuzhiyun         - In this document, to differentiate from locks that implement
*4882a593Smuzhiyun           PI and spin locks that are used in the PI code, from now on
*4882a593Smuzhiyun           the PI locks will be called a mutex.
*4882a593Smuzhiyun
*4882a593Smuzhiyunlock
*4882a593Smuzhiyun         - In this document from now on, I will use the term lock when
*4882a593Smuzhiyun           referring to spin locks that are used to protect parts of the PI
*4882a593Smuzhiyun           algorithm.  These locks disable preemption for UP (when
*4882a593Smuzhiyun           CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from
*4882a593Smuzhiyun           entering critical sections simultaneously.
*4882a593Smuzhiyun
*4882a593Smuzhiyunspin lock
*4882a593Smuzhiyun         - Same as lock above.
*4882a593Smuzhiyun
*4882a593Smuzhiyunwaiter
*4882a593Smuzhiyun         - A waiter is a struct that is stored on the stack of a blocked
*4882a593Smuzhiyun           process.  Since the scope of the waiter is within the code for
*4882a593Smuzhiyun           a process being blocked on the mutex, it is fine to allocate
*4882a593Smuzhiyun           the waiter on the process's stack (local variable).  This
*4882a593Smuzhiyun           structure holds a pointer to the task, as well as the mutex that
*4882a593Smuzhiyun           the task is blocked on.  It also has rbtree node structures to
*4882a593Smuzhiyun           place the task in the waiters rbtree of a mutex as well as the
*4882a593Smuzhiyun           pi_waiters rbtree of a mutex owner task (described below).
*4882a593Smuzhiyun
*4882a593Smuzhiyun           waiter is sometimes used in reference to the task that is waiting
*4882a593Smuzhiyun           on a mutex. This is the same as waiter->task.
*4882a593Smuzhiyun
*4882a593Smuzhiyunwaiters
*4882a593Smuzhiyun         - A list of processes that are blocked on a mutex.
*4882a593Smuzhiyun
*4882a593Smuzhiyuntop waiter
*4882a593Smuzhiyun         - The highest priority process waiting on a specific mutex.
*4882a593Smuzhiyun
*4882a593Smuzhiyuntop pi waiter
*4882a593Smuzhiyun              - The highest priority process waiting on one of the mutexes
*4882a593Smuzhiyun                that a specific process owns.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote:
*4882a593Smuzhiyun       task and process are used interchangeably in this document, mostly to
*4882a593Smuzhiyun       differentiate between two processes that are being described together.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunPI chain
*4882a593Smuzhiyun--------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe PI chain is a list of processes and mutexes that may cause priority
*4882a593Smuzhiyuninheritance to take place.  Multiple chains may converge, but a chain
*4882a593Smuzhiyunwould never diverge, since a process can't be blocked on more than one
*4882a593Smuzhiyunmutex at a time.
*4882a593Smuzhiyun
*4882a593SmuzhiyunExample::
*4882a593Smuzhiyun
*4882a593Smuzhiyun   Process:  A, B, C, D, E
*4882a593Smuzhiyun   Mutexes:  L1, L2, L3, L4
*4882a593Smuzhiyun
*4882a593Smuzhiyun   A owns: L1
*4882a593Smuzhiyun           B blocked on L1
*4882a593Smuzhiyun           B owns L2
*4882a593Smuzhiyun                  C blocked on L2
*4882a593Smuzhiyun                  C owns L3
*4882a593Smuzhiyun                         D blocked on L3
*4882a593Smuzhiyun                         D owns L4
*4882a593Smuzhiyun                                E blocked on L4
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe chain would be::
*4882a593Smuzhiyun
*4882a593Smuzhiyun   E->L4->D->L3->C->L2->B->L1->A
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo show where two chains merge, we could add another process F and
*4882a593Smuzhiyunanother mutex L5 where B owns L5 and F is blocked on mutex L5.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe chain for F would be::
*4882a593Smuzhiyun
*4882a593Smuzhiyun   F->L5->B->L1->A
*4882a593Smuzhiyun
*4882a593SmuzhiyunSince a process may own more than one mutex, but never be blocked on more than
*4882a593Smuzhiyunone, the chains merge.
*4882a593Smuzhiyun
*4882a593SmuzhiyunHere we show both chains::
*4882a593Smuzhiyun
*4882a593Smuzhiyun   E->L4->D->L3->C->L2-+
*4882a593Smuzhiyun                       |
*4882a593Smuzhiyun                       +->B->L1->A
*4882a593Smuzhiyun                       |
*4882a593Smuzhiyun                 F->L5-+
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor PI to work, the processes at the right end of these chains (or we may
*4882a593Smuzhiyunalso call it the Top of the chain) must be equal to or higher in priority
*4882a593Smuzhiyunthan the processes to the left or below in the chain.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAlso since a mutex may have more than one process blocked on it, we can
*4882a593Smuzhiyunhave multiple chains merge at mutexes.  If we add another process G that is
*4882a593Smuzhiyunblocked on mutex L2::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  G->L2->B->L1->A
*4882a593Smuzhiyun
*4882a593SmuzhiyunAnd once again, to show how this can grow I will show the merging chains
*4882a593Smuzhiyunagain::
*4882a593Smuzhiyun
*4882a593Smuzhiyun   E->L4->D->L3->C-+
*4882a593Smuzhiyun                   +->L2-+
*4882a593Smuzhiyun                   |     |
*4882a593Smuzhiyun                 G-+     +->B->L1->A
*4882a593Smuzhiyun                         |
*4882a593Smuzhiyun                   F->L5-+
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf process G has the highest priority in the chain, then all the tasks up
*4882a593Smuzhiyunthe chain (A and B in this example), must have their priorities increased
*4882a593Smuzhiyunto that of G.
*4882a593Smuzhiyun
*4882a593SmuzhiyunMutex Waiters Tree
*4882a593Smuzhiyun------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunEvery mutex keeps track of all the waiters that are blocked on itself. The
*4882a593Smuzhiyunmutex has a rbtree to store these waiters by priority.  This tree is protected
*4882a593Smuzhiyunby a spin lock that is located in the struct of the mutex. This lock is called
*4882a593Smuzhiyunwait_lock.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunTask PI Tree
*4882a593Smuzhiyun------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo keep track of the PI chains, each process has its own PI rbtree.  This is
*4882a593Smuzhiyuna tree of all top waiters of the mutexes that are owned by the process.
*4882a593SmuzhiyunNote that this tree only holds the top waiters and not all waiters that are
*4882a593Smuzhiyunblocked on mutexes owned by the process.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe top of the task's PI tree is always the highest priority task that
*4882a593Smuzhiyunis waiting on a mutex that is owned by the task.  So if the task has
*4882a593Smuzhiyuninherited a priority, it will always be the priority of the task that is
*4882a593Smuzhiyunat the top of this tree.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis tree is stored in the task structure of a process as a rbtree called
*4882a593Smuzhiyunpi_waiters.  It is protected by a spin lock also in the task structure,
*4882a593Smuzhiyuncalled pi_lock.  This lock may also be taken in interrupt context, so when
*4882a593Smuzhiyunlocking the pi_lock, interrupts must be disabled.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunDepth of the PI Chain
*4882a593Smuzhiyun---------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe maximum depth of the PI chain is not dynamic, and could actually be
*4882a593Smuzhiyundefined.  But is very complex to figure it out, since it depends on all
*4882a593Smuzhiyunthe nesting of mutexes.  Let's look at the example where we have 3 mutexes,
*4882a593SmuzhiyunL1, L2, and L3, and four separate functions func1, func2, func3 and func4.
*4882a593SmuzhiyunThe following shows a locking order of L1->L2->L3, but may not actually
*4882a593Smuzhiyunbe directly nested that way::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  void func1(void)
*4882a593Smuzhiyun  {
*4882a593Smuzhiyun	mutex_lock(L1);
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* do anything */
*4882a593Smuzhiyun
*4882a593Smuzhiyun	mutex_unlock(L1);
*4882a593Smuzhiyun  }
*4882a593Smuzhiyun
*4882a593Smuzhiyun  void func2(void)
*4882a593Smuzhiyun  {
*4882a593Smuzhiyun	mutex_lock(L1);
*4882a593Smuzhiyun	mutex_lock(L2);
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* do something */
*4882a593Smuzhiyun
*4882a593Smuzhiyun	mutex_unlock(L2);
*4882a593Smuzhiyun	mutex_unlock(L1);
*4882a593Smuzhiyun  }
*4882a593Smuzhiyun
*4882a593Smuzhiyun  void func3(void)
*4882a593Smuzhiyun  {
*4882a593Smuzhiyun	mutex_lock(L2);
*4882a593Smuzhiyun	mutex_lock(L3);
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* do something else */
*4882a593Smuzhiyun
*4882a593Smuzhiyun	mutex_unlock(L3);
*4882a593Smuzhiyun	mutex_unlock(L2);
*4882a593Smuzhiyun  }
*4882a593Smuzhiyun
*4882a593Smuzhiyun  void func4(void)
*4882a593Smuzhiyun  {
*4882a593Smuzhiyun	mutex_lock(L3);
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* do something again */
*4882a593Smuzhiyun
*4882a593Smuzhiyun	mutex_unlock(L3);
*4882a593Smuzhiyun  }
*4882a593Smuzhiyun
*4882a593SmuzhiyunNow we add 4 processes that run each of these functions separately.
*4882a593SmuzhiyunProcesses A, B, C, and D which run functions func1, func2, func3 and func4
*4882a593Smuzhiyunrespectively, and such that D runs first and A last.  With D being preempted
*4882a593Smuzhiyunin func4 in the "do something again" area, we have a locking that follows::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  D owns L3
*4882a593Smuzhiyun         C blocked on L3
*4882a593Smuzhiyun         C owns L2
*4882a593Smuzhiyun                B blocked on L2
*4882a593Smuzhiyun                B owns L1
*4882a593Smuzhiyun                       A blocked on L1
*4882a593Smuzhiyun
*4882a593Smuzhiyun  And thus we have the chain A->L1->B->L2->C->L3->D.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis gives us a PI depth of 4 (four processes), but looking at any of the
*4882a593Smuzhiyunfunctions individually, it seems as though they only have at most a locking
*4882a593Smuzhiyundepth of two.  So, although the locking depth is defined at compile time,
*4882a593Smuzhiyunit still is very difficult to find the possibilities of that depth.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNow since mutexes can be defined by user-land applications, we don't want a DOS
*4882a593Smuzhiyuntype of application that nests large amounts of mutexes to create a large
*4882a593SmuzhiyunPI chain, and have the code holding spin locks while looking at a large
*4882a593Smuzhiyunamount of data.  So to prevent this, the implementation not only implements
*4882a593Smuzhiyuna maximum lock depth, but also only holds at most two different locks at a
*4882a593Smuzhiyuntime, as it walks the PI chain.  More about this below.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunMutex owner and flags
*4882a593Smuzhiyun---------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe mutex structure contains a pointer to the owner of the mutex.  If the
*4882a593Smuzhiyunmutex is not owned, this owner is set to NULL.  Since all architectures
*4882a593Smuzhiyunhave the task structure on at least a two byte alignment (and if this is
*4882a593Smuzhiyunnot true, the rtmutex.c code will be broken!), this allows for the least
*4882a593Smuzhiyunsignificant bit to be used as a flag.  Bit 0 is used as the "Has Waiters"
*4882a593Smuzhiyunflag. It's set whenever there are waiters on a mutex.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSee Documentation/locking/rt-mutex.rst for further details.
*4882a593Smuzhiyun
*4882a593Smuzhiyuncmpxchg Tricks
*4882a593Smuzhiyun--------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunSome architectures implement an atomic cmpxchg (Compare and Exchange).  This
*4882a593Smuzhiyunis used (when applicable) to keep the fast path of grabbing and releasing
*4882a593Smuzhiyunmutexes short.
*4882a593Smuzhiyun
*4882a593Smuzhiyuncmpxchg is basically the following function performed atomically::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C)
*4882a593Smuzhiyun  {
*4882a593Smuzhiyun	unsigned long T = *A;
*4882a593Smuzhiyun	if (*A == *B) {
*4882a593Smuzhiyun		*A = *C;
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun	return T;
*4882a593Smuzhiyun  }
*4882a593Smuzhiyun  #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c)
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis is really nice to have, since it allows you to only update a variable
*4882a593Smuzhiyunif the variable is what you expect it to be.  You know if it succeeded if
*4882a593Smuzhiyunthe return value (the old value of A) is equal to B.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If
*4882a593Smuzhiyunthe architecture does not support CMPXCHG, then this macro is simply set
*4882a593Smuzhiyunto fail every time.  But if CMPXCHG is supported, then this will
*4882a593Smuzhiyunhelp out extremely to keep the fast path short.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe use of rt_mutex_cmpxchg with the flags in the owner field help optimize
*4882a593Smuzhiyunthe system for architectures that support it.  This will also be explained
*4882a593Smuzhiyunlater in this document.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunPriority adjustments
*4882a593Smuzhiyun--------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe implementation of the PI code in rtmutex.c has several places that a
*4882a593Smuzhiyunprocess must adjust its priority.  With the help of the pi_waiters of a
*4882a593Smuzhiyunprocess this is rather easy to know what needs to be adjusted.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe functions implementing the task adjustments are rt_mutex_adjust_prio
*4882a593Smuzhiyunand rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio.
*4882a593Smuzhiyun
*4882a593Smuzhiyunrt_mutex_adjust_prio examines the priority of the task, and the highest
*4882a593Smuzhiyunpriority process that is waiting any of mutexes owned by the task. Since
*4882a593Smuzhiyunthe pi_waiters of a task holds an order by priority of all the top waiters
*4882a593Smuzhiyunof all the mutexes that the task owns, we simply need to compare the top
*4882a593Smuzhiyunpi waiter to its own normal/deadline priority and take the higher one.
*4882a593SmuzhiyunThen rt_mutex_setprio is called to adjust the priority of the task to the
*4882a593Smuzhiyunnew priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c
*4882a593Smuzhiyunto implement the actual change in priority.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote:
*4882a593Smuzhiyun	For the "prio" field in task_struct, the lower the number, the
*4882a593Smuzhiyun	higher the priority. A "prio" of 5 is of higher priority than a
*4882a593Smuzhiyun	"prio" of 10.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIt is interesting to note that rt_mutex_adjust_prio can either increase
*4882a593Smuzhiyunor decrease the priority of the task.  In the case that a higher priority
*4882a593Smuzhiyunprocess has just blocked on a mutex owned by the task, rt_mutex_adjust_prio
*4882a593Smuzhiyunwould increase/boost the task's priority.  But if a higher priority task
*4882a593Smuzhiyunwere for some reason to leave the mutex (timeout or signal), this same function
*4882a593Smuzhiyunwould decrease/unboost the priority of the task.  That is because the pi_waiters
*4882a593Smuzhiyunalways contains the highest priority task that is waiting on a mutex owned
*4882a593Smuzhiyunby the task, so we only need to compare the priority of that top pi waiter
*4882a593Smuzhiyunto the normal priority of the given task.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunHigh level overview of the PI chain walk
*4882a593Smuzhiyun----------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe PI chain walk is implemented by the function rt_mutex_adjust_prio_chain.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe implementation has gone through several iterations, and has ended up
*4882a593Smuzhiyunwith what we believe is the best.  It walks the PI chain by only grabbing
*4882a593Smuzhiyunat most two locks at a time, and is very efficient.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe rt_mutex_adjust_prio_chain can be used either to boost or lower process
*4882a593Smuzhiyunpriorities.
*4882a593Smuzhiyun
*4882a593Smuzhiyunrt_mutex_adjust_prio_chain is called with a task to be checked for PI
*4882a593Smuzhiyun(de)boosting (the owner of a mutex that a process is blocking on), a flag to
*4882a593Smuzhiyuncheck for deadlocking, the mutex that the task owns, a pointer to a waiter
*4882a593Smuzhiyunthat is the process's waiter struct that is blocked on the mutex (although this
*4882a593Smuzhiyunparameter may be NULL for deboosting), a pointer to the mutex on which the task
*4882a593Smuzhiyunis blocked, and a top_task as the top waiter of the mutex.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor this explanation, I will not mention deadlock detection. This explanation
*4882a593Smuzhiyunwill try to stay at a high level.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen this function is called, there are no locks held.  That also means
*4882a593Smuzhiyunthat the state of the owner and lock can change when entered into this function.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBefore this function is called, the task has already had rt_mutex_adjust_prio
*4882a593Smuzhiyunperformed on it.  This means that the task is set to the priority that it
*4882a593Smuzhiyunshould be at, but the rbtree nodes of the task's waiter have not been updated
*4882a593Smuzhiyunwith the new priorities, and this task may not be in the proper locations
*4882a593Smuzhiyunin the pi_waiters and waiters trees that the task is blocked on. This function
*4882a593Smuzhiyunsolves all that.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe main operation of this function is summarized by Thomas Gleixner in
*4882a593Smuzhiyunrtmutex.c. See the 'Chain walk basics and protection scope' comment for further
*4882a593Smuzhiyundetails.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTaking of a mutex (The walk through)
*4882a593Smuzhiyun------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunOK, now let's take a look at the detailed walk through of what happens when
*4882a593Smuzhiyuntaking a mutex.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe first thing that is tried is the fast taking of the mutex.  This is
*4882a593Smuzhiyundone when we have CMPXCHG enabled (otherwise the fast taking automatically
*4882a593Smuzhiyunfails).  Only when the owner field of the mutex is NULL can the lock be
*4882a593Smuzhiyuntaken with the CMPXCHG and nothing else needs to be done.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf there is contention on the lock, we go about the slow path
*4882a593Smuzhiyun(rt_mutex_slowlock).
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe slow path function is where the task's waiter structure is created on
*4882a593Smuzhiyunthe stack.  This is because the waiter structure is only needed for the
*4882a593Smuzhiyunscope of this function.  The waiter structure holds the nodes to store
*4882a593Smuzhiyunthe task on the waiters tree of the mutex, and if need be, the pi_waiters
*4882a593Smuzhiyuntree of the owner.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe wait_lock of the mutex is taken since the slow path of unlocking the
*4882a593Smuzhiyunmutex also takes this lock.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWe then call try_to_take_rt_mutex.  This is where the architecture that
*4882a593Smuzhiyundoes not implement CMPXCHG would always grab the lock (if there's no
*4882a593Smuzhiyuncontention).
*4882a593Smuzhiyun
*4882a593Smuzhiyuntry_to_take_rt_mutex is used every time the task tries to grab a mutex in the
*4882a593Smuzhiyunslow path.  The first thing that is done here is an atomic setting of
*4882a593Smuzhiyunthe "Has Waiters" flag of the mutex's owner field. By setting this flag
*4882a593Smuzhiyunnow, the current owner of the mutex being contended for can't release the mutex
*4882a593Smuzhiyunwithout going into the slow unlock path, and it would then need to grab the
*4882a593Smuzhiyunwait_lock, which this code currently holds. So setting the "Has Waiters" flag
*4882a593Smuzhiyunforces the current owner to synchronize with this code.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe lock is taken if the following are true:
*4882a593Smuzhiyun
*4882a593Smuzhiyun   1) The lock has no owner
*4882a593Smuzhiyun   2) The current task is the highest priority against all other
*4882a593Smuzhiyun      waiters of the lock
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the task succeeds to acquire the lock, then the task is set as the
*4882a593Smuzhiyunowner of the lock, and if the lock still has waiters, the top_waiter
*4882a593Smuzhiyun(highest priority task waiting on the lock) is added to this task's
*4882a593Smuzhiyunpi_waiters tree.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the lock is not taken by try_to_take_rt_mutex(), then the
*4882a593Smuzhiyuntask_blocks_on_rt_mutex() function is called. This will add the task to
*4882a593Smuzhiyunthe lock's waiter tree and propagate the pi chain of the lock as well
*4882a593Smuzhiyunas the lock's owner's pi_waiters tree. This is described in the next
*4882a593Smuzhiyunsection.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTask blocks on mutex
*4882a593Smuzhiyun--------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe accounting of a mutex and process is done with the waiter structure of
*4882a593Smuzhiyunthe process.  The "task" field is set to the process, and the "lock" field
*4882a593Smuzhiyunto the mutex.  The rbtree node of waiter are initialized to the processes
*4882a593Smuzhiyuncurrent priority.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSince the wait_lock was taken at the entry of the slow lock, we can safely
*4882a593Smuzhiyunadd the waiter to the task waiter tree.  If the current process is the
*4882a593Smuzhiyunhighest priority process currently waiting on this mutex, then we remove the
*4882a593Smuzhiyunprevious top waiter process (if it exists) from the pi_waiters of the owner,
*4882a593Smuzhiyunand add the current process to that tree.  Since the pi_waiter of the owner
*4882a593Smuzhiyunhas changed, we call rt_mutex_adjust_prio on the owner to see if the owner
*4882a593Smuzhiyunshould adjust its priority accordingly.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the owner is also blocked on a lock, and had its pi_waiters changed
*4882a593Smuzhiyun(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead
*4882a593Smuzhiyunand run rt_mutex_adjust_prio_chain on the owner, as described earlier.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNow all locks are released, and if the current process is still blocked on a
*4882a593Smuzhiyunmutex (waiter "task" field is not NULL), then we go to sleep (call schedule).
*4882a593Smuzhiyun
*4882a593SmuzhiyunWaking up in the loop
*4882a593Smuzhiyun---------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe task can then wake up for a couple of reasons:
*4882a593Smuzhiyun  1) The previous lock owner released the lock, and the task now is top_waiter
*4882a593Smuzhiyun  2) we received a signal or timeout
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn both cases, the task will try again to acquire the lock. If it
*4882a593Smuzhiyundoes, then it will take itself off the waiters tree and set itself back
*4882a593Smuzhiyunto the TASK_RUNNING state.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn first case, if the lock was acquired by another task before this task
*4882a593Smuzhiyuncould get the lock, then it will go back to sleep and wait to be woken again.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe second case is only applicable for tasks that are grabbing a mutex
*4882a593Smuzhiyunthat can wake up before getting the lock, either due to a signal or
*4882a593Smuzhiyuna timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to
*4882a593Smuzhiyuntake the lock again, if it succeeds, then the task will return with the
*4882a593Smuzhiyunlock held, otherwise it will return with -EINTR if the task was woken
*4882a593Smuzhiyunby a signal, or -ETIMEDOUT if it timed out.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunUnlocking the Mutex
*4882a593Smuzhiyun-------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe unlocking of a mutex also has a fast path for those architectures with
*4882a593SmuzhiyunCMPXCHG.  Since the taking of a mutex on contention always sets the
*4882a593Smuzhiyun"Has Waiters" flag of the mutex's owner, we use this to know if we need to
*4882a593Smuzhiyuntake the slow path when unlocking the mutex.  If the mutex doesn't have any
*4882a593Smuzhiyunwaiters, the owner field of the mutex would equal the current process and
*4882a593Smuzhiyunthe mutex can be unlocked by just replacing the owner field with NULL.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the owner field has the "Has Waiters" bit set (or CMPXCHG is not available),
*4882a593Smuzhiyunthe slow unlock path is taken.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe first thing done in the slow unlock path is to take the wait_lock of the
*4882a593Smuzhiyunmutex.  This synchronizes the locking and unlocking of the mutex.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA check is made to see if the mutex has waiters or not.  On architectures that
*4882a593Smuzhiyundo not have CMPXCHG, this is the location that the owner of the mutex will
*4882a593Smuzhiyundetermine if a waiter needs to be awoken or not.  On architectures that
*4882a593Smuzhiyundo have CMPXCHG, that check is done in the fast path, but it is still needed
*4882a593Smuzhiyunin the slow path too.  If a waiter of a mutex woke up because of a signal
*4882a593Smuzhiyunor timeout between the time the owner failed the fast path CMPXCHG check and
*4882a593Smuzhiyunthe grabbing of the wait_lock, the mutex may not have any waiters, thus the
*4882a593Smuzhiyunowner still needs to make this check. If there are no waiters then the mutex
*4882a593Smuzhiyunowner field is set to NULL, the wait_lock is released and nothing more is
*4882a593Smuzhiyunneeded.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf there are waiters, then we need to wake one up.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOn the wake up code, the pi_lock of the current owner is taken.  The top
*4882a593Smuzhiyunwaiter of the lock is found and removed from the waiters tree of the mutex
*4882a593Smuzhiyunas well as the pi_waiters tree of the current owner. The "Has Waiters" bit is
*4882a593Smuzhiyunmarked to prevent lower priority tasks from stealing the lock.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFinally we unlock the pi_lock of the pending owner and wake it up.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunContact
*4882a593Smuzhiyun-------
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor updates on this document, please email Steven Rostedt <rostedt@goodmis.org>
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunCredits
*4882a593Smuzhiyun-------
*4882a593Smuzhiyun
*4882a593SmuzhiyunAuthor:  Steven Rostedt <rostedt@goodmis.org>
*4882a593Smuzhiyun
*4882a593SmuzhiyunUpdated: Alex Shi <alex.shi@linaro.org>	- 7/6/2017
*4882a593Smuzhiyun
*4882a593SmuzhiyunOriginal Reviewers:
*4882a593Smuzhiyun		     Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and
*4882a593Smuzhiyun		     Randy Dunlap
*4882a593Smuzhiyun
*4882a593SmuzhiyunUpdate (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior
*4882a593Smuzhiyun
*4882a593SmuzhiyunUpdates
*4882a593Smuzhiyun-------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis document was originally written for 2.6.17-rc3-mm1
*4882a593Smuzhiyunwas updated on 4.12