1*4882a593Smuzhiyun================ 2*4882a593SmuzhiyunFutex Requeue PI 3*4882a593Smuzhiyun================ 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunRequeueing of tasks from a non-PI futex to a PI futex requires 6*4882a593Smuzhiyunspecial handling in order to ensure the underlying rt_mutex is never 7*4882a593Smuzhiyunleft without an owner if it has waiters; doing so would break the PI 8*4882a593Smuzhiyunboosting logic [see rt-mutex-desgin.txt] For the purposes of 9*4882a593Smuzhiyunbrevity, this action will be referred to as "requeue_pi" throughout 10*4882a593Smuzhiyunthis document. Priority inheritance is abbreviated throughout as 11*4882a593Smuzhiyun"PI". 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunMotivation 14*4882a593Smuzhiyun---------- 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunWithout requeue_pi, the glibc implementation of 17*4882a593Smuzhiyunpthread_cond_broadcast() must resort to waking all the tasks waiting 18*4882a593Smuzhiyunon a pthread_condvar and letting them try to sort out which task 19*4882a593Smuzhiyungets to run first in classic thundering-herd formation. An ideal 20*4882a593Smuzhiyunimplementation would wake the highest-priority waiter, and leave the 21*4882a593Smuzhiyunrest to the natural wakeup inherent in unlocking the mutex 22*4882a593Smuzhiyunassociated with the condvar. 23*4882a593Smuzhiyun 24*4882a593SmuzhiyunConsider the simplified glibc calls:: 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun /* caller must lock mutex */ 27*4882a593Smuzhiyun pthread_cond_wait(cond, mutex) 28*4882a593Smuzhiyun { 29*4882a593Smuzhiyun lock(cond->__data.__lock); 30*4882a593Smuzhiyun unlock(mutex); 31*4882a593Smuzhiyun do { 32*4882a593Smuzhiyun unlock(cond->__data.__lock); 33*4882a593Smuzhiyun futex_wait(cond->__data.__futex); 34*4882a593Smuzhiyun lock(cond->__data.__lock); 35*4882a593Smuzhiyun } while(...) 36*4882a593Smuzhiyun unlock(cond->__data.__lock); 37*4882a593Smuzhiyun lock(mutex); 38*4882a593Smuzhiyun } 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun pthread_cond_broadcast(cond) 41*4882a593Smuzhiyun { 42*4882a593Smuzhiyun lock(cond->__data.__lock); 43*4882a593Smuzhiyun unlock(cond->__data.__lock); 44*4882a593Smuzhiyun futex_requeue(cond->data.__futex, cond->mutex); 45*4882a593Smuzhiyun } 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunOnce pthread_cond_broadcast() requeues the tasks, the cond->mutex 48*4882a593Smuzhiyunhas waiters. Note that pthread_cond_wait() attempts to lock the 49*4882a593Smuzhiyunmutex only after it has returned to user space. This will leave the 50*4882a593Smuzhiyununderlying rt_mutex with waiters, and no owner, breaking the 51*4882a593Smuzhiyunpreviously mentioned PI-boosting algorithms. 52*4882a593Smuzhiyun 53*4882a593SmuzhiyunIn order to support PI-aware pthread_condvar's, the kernel needs to 54*4882a593Smuzhiyunbe able to requeue tasks to PI futexes. This support implies that 55*4882a593Smuzhiyunupon a successful futex_wait system call, the caller would return to 56*4882a593Smuzhiyunuser space already holding the PI futex. The glibc implementation 57*4882a593Smuzhiyunwould be modified as follows:: 58*4882a593Smuzhiyun 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun /* caller must lock mutex */ 61*4882a593Smuzhiyun pthread_cond_wait_pi(cond, mutex) 62*4882a593Smuzhiyun { 63*4882a593Smuzhiyun lock(cond->__data.__lock); 64*4882a593Smuzhiyun unlock(mutex); 65*4882a593Smuzhiyun do { 66*4882a593Smuzhiyun unlock(cond->__data.__lock); 67*4882a593Smuzhiyun futex_wait_requeue_pi(cond->__data.__futex); 68*4882a593Smuzhiyun lock(cond->__data.__lock); 69*4882a593Smuzhiyun } while(...) 70*4882a593Smuzhiyun unlock(cond->__data.__lock); 71*4882a593Smuzhiyun /* the kernel acquired the mutex for us */ 72*4882a593Smuzhiyun } 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun pthread_cond_broadcast_pi(cond) 75*4882a593Smuzhiyun { 76*4882a593Smuzhiyun lock(cond->__data.__lock); 77*4882a593Smuzhiyun unlock(cond->__data.__lock); 78*4882a593Smuzhiyun futex_requeue_pi(cond->data.__futex, cond->mutex); 79*4882a593Smuzhiyun } 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunThe actual glibc implementation will likely test for PI and make the 82*4882a593Smuzhiyunnecessary changes inside the existing calls rather than creating new 83*4882a593Smuzhiyuncalls for the PI cases. Similar changes are needed for 84*4882a593Smuzhiyunpthread_cond_timedwait() and pthread_cond_signal(). 85*4882a593Smuzhiyun 86*4882a593SmuzhiyunImplementation 87*4882a593Smuzhiyun-------------- 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunIn order to ensure the rt_mutex has an owner if it has waiters, it 90*4882a593Smuzhiyunis necessary for both the requeue code, as well as the waiting code, 91*4882a593Smuzhiyunto be able to acquire the rt_mutex before returning to user space. 92*4882a593SmuzhiyunThe requeue code cannot simply wake the waiter and leave it to 93*4882a593Smuzhiyunacquire the rt_mutex as it would open a race window between the 94*4882a593Smuzhiyunrequeue call returning to user space and the waiter waking and 95*4882a593Smuzhiyunstarting to run. This is especially true in the uncontended case. 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunThe solution involves two new rt_mutex helper routines, 98*4882a593Smuzhiyunrt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which 99*4882a593Smuzhiyunallow the requeue code to acquire an uncontended rt_mutex on behalf 100*4882a593Smuzhiyunof the waiter and to enqueue the waiter on a contended rt_mutex. 101*4882a593SmuzhiyunTwo new system calls provide the kernel<->user interface to 102*4882a593Smuzhiyunrequeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI. 103*4882a593Smuzhiyun 104*4882a593SmuzhiyunFUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() 105*4882a593Smuzhiyunand pthread_cond_timedwait()) to block on the initial futex and wait 106*4882a593Smuzhiyunto be requeued to a PI-aware futex. The implementation is the 107*4882a593Smuzhiyunresult of a high-speed collision between futex_wait() and 108*4882a593Smuzhiyunfutex_lock_pi(), with some extra logic to check for the additional 109*4882a593Smuzhiyunwake-up scenarios. 110*4882a593Smuzhiyun 111*4882a593SmuzhiyunFUTEX_CMP_REQUEUE_PI is called by the waker 112*4882a593Smuzhiyun(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and 113*4882a593Smuzhiyunpossibly wake the waiting tasks. Internally, this system call is 114*4882a593Smuzhiyunstill handled by futex_requeue (by passing requeue_pi=1). Before 115*4882a593Smuzhiyunrequeueing, futex_requeue() attempts to acquire the requeue target 116*4882a593SmuzhiyunPI futex on behalf of the top waiter. If it can, this waiter is 117*4882a593Smuzhiyunwoken. futex_requeue() then proceeds to requeue the remaining 118*4882a593Smuzhiyunnr_wake+nr_requeue tasks to the PI futex, calling 119*4882a593Smuzhiyunrt_mutex_start_proxy_lock() prior to each requeue to prepare the 120*4882a593Smuzhiyuntask as a waiter on the underlying rt_mutex. It is possible that 121*4882a593Smuzhiyunthe lock can be acquired at this stage as well, if so, the next 122*4882a593Smuzhiyunwaiter is woken to finish the acquisition of the lock. 123*4882a593Smuzhiyun 124*4882a593SmuzhiyunFUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but 125*4882a593Smuzhiyuntheir sum is all that really matters. futex_requeue() will wake or 126*4882a593Smuzhiyunrequeue up to nr_wake + nr_requeue tasks. It will wake only as many 127*4882a593Smuzhiyuntasks as it can acquire the lock for, which in the majority of cases 128*4882a593Smuzhiyunshould be 0 as good programming practice dictates that the caller of 129*4882a593Smuzhiyuneither pthread_cond_broadcast() or pthread_cond_signal() acquire the 130*4882a593Smuzhiyunmutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that 131*4882a593Smuzhiyunnr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for 132*4882a593Smuzhiyunsignal. 133