xref: /OK3568_Linux_fs/kernel/Documentation/locking/futex-requeue-pi.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun================
2*4882a593SmuzhiyunFutex Requeue PI
3*4882a593Smuzhiyun================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunRequeueing of tasks from a non-PI futex to a PI futex requires
6*4882a593Smuzhiyunspecial handling in order to ensure the underlying rt_mutex is never
7*4882a593Smuzhiyunleft without an owner if it has waiters; doing so would break the PI
8*4882a593Smuzhiyunboosting logic [see rt-mutex-desgin.txt] For the purposes of
9*4882a593Smuzhiyunbrevity, this action will be referred to as "requeue_pi" throughout
10*4882a593Smuzhiyunthis document.  Priority inheritance is abbreviated throughout as
11*4882a593Smuzhiyun"PI".
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunMotivation
14*4882a593Smuzhiyun----------
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunWithout requeue_pi, the glibc implementation of
17*4882a593Smuzhiyunpthread_cond_broadcast() must resort to waking all the tasks waiting
18*4882a593Smuzhiyunon a pthread_condvar and letting them try to sort out which task
19*4882a593Smuzhiyungets to run first in classic thundering-herd formation.  An ideal
20*4882a593Smuzhiyunimplementation would wake the highest-priority waiter, and leave the
21*4882a593Smuzhiyunrest to the natural wakeup inherent in unlocking the mutex
22*4882a593Smuzhiyunassociated with the condvar.
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunConsider the simplified glibc calls::
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun	/* caller must lock mutex */
27*4882a593Smuzhiyun	pthread_cond_wait(cond, mutex)
28*4882a593Smuzhiyun	{
29*4882a593Smuzhiyun		lock(cond->__data.__lock);
30*4882a593Smuzhiyun		unlock(mutex);
31*4882a593Smuzhiyun		do {
32*4882a593Smuzhiyun		unlock(cond->__data.__lock);
33*4882a593Smuzhiyun		futex_wait(cond->__data.__futex);
34*4882a593Smuzhiyun		lock(cond->__data.__lock);
35*4882a593Smuzhiyun		} while(...)
36*4882a593Smuzhiyun		unlock(cond->__data.__lock);
37*4882a593Smuzhiyun		lock(mutex);
38*4882a593Smuzhiyun	}
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun	pthread_cond_broadcast(cond)
41*4882a593Smuzhiyun	{
42*4882a593Smuzhiyun		lock(cond->__data.__lock);
43*4882a593Smuzhiyun		unlock(cond->__data.__lock);
44*4882a593Smuzhiyun		futex_requeue(cond->data.__futex, cond->mutex);
45*4882a593Smuzhiyun	}
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunOnce pthread_cond_broadcast() requeues the tasks, the cond->mutex
48*4882a593Smuzhiyunhas waiters. Note that pthread_cond_wait() attempts to lock the
49*4882a593Smuzhiyunmutex only after it has returned to user space.  This will leave the
50*4882a593Smuzhiyununderlying rt_mutex with waiters, and no owner, breaking the
51*4882a593Smuzhiyunpreviously mentioned PI-boosting algorithms.
52*4882a593Smuzhiyun
53*4882a593SmuzhiyunIn order to support PI-aware pthread_condvar's, the kernel needs to
54*4882a593Smuzhiyunbe able to requeue tasks to PI futexes.  This support implies that
55*4882a593Smuzhiyunupon a successful futex_wait system call, the caller would return to
56*4882a593Smuzhiyunuser space already holding the PI futex.  The glibc implementation
57*4882a593Smuzhiyunwould be modified as follows::
58*4882a593Smuzhiyun
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun	/* caller must lock mutex */
61*4882a593Smuzhiyun	pthread_cond_wait_pi(cond, mutex)
62*4882a593Smuzhiyun	{
63*4882a593Smuzhiyun		lock(cond->__data.__lock);
64*4882a593Smuzhiyun		unlock(mutex);
65*4882a593Smuzhiyun		do {
66*4882a593Smuzhiyun		unlock(cond->__data.__lock);
67*4882a593Smuzhiyun		futex_wait_requeue_pi(cond->__data.__futex);
68*4882a593Smuzhiyun		lock(cond->__data.__lock);
69*4882a593Smuzhiyun		} while(...)
70*4882a593Smuzhiyun		unlock(cond->__data.__lock);
71*4882a593Smuzhiyun		/* the kernel acquired the mutex for us */
72*4882a593Smuzhiyun	}
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun	pthread_cond_broadcast_pi(cond)
75*4882a593Smuzhiyun	{
76*4882a593Smuzhiyun		lock(cond->__data.__lock);
77*4882a593Smuzhiyun		unlock(cond->__data.__lock);
78*4882a593Smuzhiyun		futex_requeue_pi(cond->data.__futex, cond->mutex);
79*4882a593Smuzhiyun	}
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunThe actual glibc implementation will likely test for PI and make the
82*4882a593Smuzhiyunnecessary changes inside the existing calls rather than creating new
83*4882a593Smuzhiyuncalls for the PI cases.  Similar changes are needed for
84*4882a593Smuzhiyunpthread_cond_timedwait() and pthread_cond_signal().
85*4882a593Smuzhiyun
86*4882a593SmuzhiyunImplementation
87*4882a593Smuzhiyun--------------
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunIn order to ensure the rt_mutex has an owner if it has waiters, it
90*4882a593Smuzhiyunis necessary for both the requeue code, as well as the waiting code,
91*4882a593Smuzhiyunto be able to acquire the rt_mutex before returning to user space.
92*4882a593SmuzhiyunThe requeue code cannot simply wake the waiter and leave it to
93*4882a593Smuzhiyunacquire the rt_mutex as it would open a race window between the
94*4882a593Smuzhiyunrequeue call returning to user space and the waiter waking and
95*4882a593Smuzhiyunstarting to run.  This is especially true in the uncontended case.
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunThe solution involves two new rt_mutex helper routines,
98*4882a593Smuzhiyunrt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
99*4882a593Smuzhiyunallow the requeue code to acquire an uncontended rt_mutex on behalf
100*4882a593Smuzhiyunof the waiter and to enqueue the waiter on a contended rt_mutex.
101*4882a593SmuzhiyunTwo new system calls provide the kernel<->user interface to
102*4882a593Smuzhiyunrequeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI.
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunFUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
105*4882a593Smuzhiyunand pthread_cond_timedwait()) to block on the initial futex and wait
106*4882a593Smuzhiyunto be requeued to a PI-aware futex.  The implementation is the
107*4882a593Smuzhiyunresult of a high-speed collision between futex_wait() and
108*4882a593Smuzhiyunfutex_lock_pi(), with some extra logic to check for the additional
109*4882a593Smuzhiyunwake-up scenarios.
110*4882a593Smuzhiyun
111*4882a593SmuzhiyunFUTEX_CMP_REQUEUE_PI is called by the waker
112*4882a593Smuzhiyun(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
113*4882a593Smuzhiyunpossibly wake the waiting tasks. Internally, this system call is
114*4882a593Smuzhiyunstill handled by futex_requeue (by passing requeue_pi=1).  Before
115*4882a593Smuzhiyunrequeueing, futex_requeue() attempts to acquire the requeue target
116*4882a593SmuzhiyunPI futex on behalf of the top waiter.  If it can, this waiter is
117*4882a593Smuzhiyunwoken.  futex_requeue() then proceeds to requeue the remaining
118*4882a593Smuzhiyunnr_wake+nr_requeue tasks to the PI futex, calling
119*4882a593Smuzhiyunrt_mutex_start_proxy_lock() prior to each requeue to prepare the
120*4882a593Smuzhiyuntask as a waiter on the underlying rt_mutex.  It is possible that
121*4882a593Smuzhiyunthe lock can be acquired at this stage as well, if so, the next
122*4882a593Smuzhiyunwaiter is woken to finish the acquisition of the lock.
123*4882a593Smuzhiyun
124*4882a593SmuzhiyunFUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
125*4882a593Smuzhiyuntheir sum is all that really matters.  futex_requeue() will wake or
126*4882a593Smuzhiyunrequeue up to nr_wake + nr_requeue tasks.  It will wake only as many
127*4882a593Smuzhiyuntasks as it can acquire the lock for, which in the majority of cases
128*4882a593Smuzhiyunshould be 0 as good programming practice dictates that the caller of
129*4882a593Smuzhiyuneither pthread_cond_broadcast() or pthread_cond_signal() acquire the
130*4882a593Smuzhiyunmutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that
131*4882a593Smuzhiyunnr_wake=1.  nr_requeue should be INT_MAX for broadcast and 0 for
132*4882a593Smuzhiyunsignal.
133