xref: /OK3568_Linux_fs/kernel/Documentation/locking/pi-futex.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun======================
2*4882a593SmuzhiyunLightweight PI-futexes
3*4882a593Smuzhiyun======================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunWe are calling them lightweight for 3 reasons:
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun - in the user-space fastpath a PI-enabled futex involves no kernel work
8*4882a593Smuzhiyun   (or any other PI complexity) at all. No registration, no extra kernel
9*4882a593Smuzhiyun   calls - just pure fast atomic ops in userspace.
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun - even in the slowpath, the system call and scheduling pattern is very
12*4882a593Smuzhiyun   similar to normal futexes.
13*4882a593Smuzhiyun
14*4882a593Smuzhiyun - the in-kernel PI implementation is streamlined around the mutex
15*4882a593Smuzhiyun   abstraction, with strict rules that keep the implementation
16*4882a593Smuzhiyun   relatively simple: only a single owner may own a lock (i.e. no
17*4882a593Smuzhiyun   read-write lock support), only the owner may unlock a lock, no
18*4882a593Smuzhiyun   recursive locking, etc.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunPriority Inheritance - why?
21*4882a593Smuzhiyun---------------------------
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunThe short reply: user-space PI helps achieving/improving determinism for
24*4882a593Smuzhiyunuser-space applications. In the best-case, it can help achieve
25*4882a593Smuzhiyundeterminism and well-bound latencies. Even in the worst-case, PI will
26*4882a593Smuzhiyunimprove the statistical distribution of locking related application
27*4882a593Smuzhiyundelays.
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunThe longer reply
30*4882a593Smuzhiyun----------------
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunFirstly, sharing locks between multiple tasks is a common programming
33*4882a593Smuzhiyuntechnique that often cannot be replaced with lockless algorithms. As we
34*4882a593Smuzhiyuncan see it in the kernel [which is a quite complex program in itself],
35*4882a593Smuzhiyunlockless structures are rather the exception than the norm - the current
36*4882a593Smuzhiyunratio of lockless vs. locky code for shared data structures is somewhere
37*4882a593Smuzhiyunbetween 1:10 and 1:100. Lockless is hard, and the complexity of lockless
38*4882a593Smuzhiyunalgorithms often endangers to ability to do robust reviews of said code.
39*4882a593SmuzhiyunI.e. critical RT apps often choose lock structures to protect critical
40*4882a593Smuzhiyundata structures, instead of lockless algorithms. Furthermore, there are
41*4882a593Smuzhiyuncases (like shared hardware, or other resource limits) where lockless
42*4882a593Smuzhiyunaccess is mathematically impossible.
43*4882a593Smuzhiyun
44*4882a593SmuzhiyunMedia players (such as Jack) are an example of reasonable application
45*4882a593Smuzhiyundesign with multiple tasks (with multiple priority levels) sharing
46*4882a593Smuzhiyunshort-held locks: for example, a highprio audio playback thread is
47*4882a593Smuzhiyuncombined with medium-prio construct-audio-data threads and low-prio
48*4882a593Smuzhiyundisplay-colory-stuff threads. Add video and decoding to the mix and
49*4882a593Smuzhiyunwe've got even more priority levels.
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunSo once we accept that synchronization objects (locks) are an
52*4882a593Smuzhiyununavoidable fact of life, and once we accept that multi-task userspace
53*4882a593Smuzhiyunapps have a very fair expectation of being able to use locks, we've got
54*4882a593Smuzhiyunto think about how to offer the option of a deterministic locking
55*4882a593Smuzhiyunimplementation to user-space.
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunMost of the technical counter-arguments against doing priority
58*4882a593Smuzhiyuninheritance only apply to kernel-space locks. But user-space locks are
59*4882a593Smuzhiyundifferent, there we cannot disable interrupts or make the task
60*4882a593Smuzhiyunnon-preemptible in a critical section, so the 'use spinlocks' argument
61*4882a593Smuzhiyundoes not apply (user-space spinlocks have the same priority inversion
62*4882a593Smuzhiyunproblems as other user-space locking constructs). Fact is, pretty much
63*4882a593Smuzhiyunthe only technique that currently enables good determinism for userspace
64*4882a593Smuzhiyunlocks (such as futex-based pthread mutexes) is priority inheritance:
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunCurrently (without PI), if a high-prio and a low-prio task shares a lock
67*4882a593Smuzhiyun[this is a quite common scenario for most non-trivial RT applications],
68*4882a593Smuzhiyuneven if all critical sections are coded carefully to be deterministic
69*4882a593Smuzhiyun(i.e. all critical sections are short in duration and only execute a
70*4882a593Smuzhiyunlimited number of instructions), the kernel cannot guarantee any
71*4882a593Smuzhiyundeterministic execution of the high-prio task: any medium-priority task
72*4882a593Smuzhiyuncould preempt the low-prio task while it holds the shared lock and
73*4882a593Smuzhiyunexecutes the critical section, and could delay it indefinitely.
74*4882a593Smuzhiyun
75*4882a593SmuzhiyunImplementation
76*4882a593Smuzhiyun--------------
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunAs mentioned before, the userspace fastpath of PI-enabled pthread
79*4882a593Smuzhiyunmutexes involves no kernel work at all - they behave quite similarly to
80*4882a593Smuzhiyunnormal futex-based locks: a 0 value means unlocked, and a value==TID
81*4882a593Smuzhiyunmeans locked. (This is the same method as used by list-based robust
82*4882a593Smuzhiyunfutexes.) Userspace uses atomic ops to lock/unlock these mutexes without
83*4882a593Smuzhiyunentering the kernel.
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunTo handle the slowpath, we have added two new futex ops:
86*4882a593Smuzhiyun
87*4882a593Smuzhiyun  - FUTEX_LOCK_PI
88*4882a593Smuzhiyun  - FUTEX_UNLOCK_PI
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunIf the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to
91*4882a593SmuzhiyunTID fails], then FUTEX_LOCK_PI is called. The kernel does all the
92*4882a593Smuzhiyunremaining work: if there is no futex-queue attached to the futex address
93*4882a593Smuzhiyunyet then the code looks up the task that owns the futex [it has put its
94*4882a593Smuzhiyunown TID into the futex value], and attaches a 'PI state' structure to
95*4882a593Smuzhiyunthe futex-queue. The pi_state includes an rt-mutex, which is a PI-aware,
96*4882a593Smuzhiyunkernel-based synchronization object. The 'other' task is made the owner
97*4882a593Smuzhiyunof the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the
98*4882a593Smuzhiyunfutex value. Then this task tries to lock the rt-mutex, on which it
99*4882a593Smuzhiyunblocks. Once it returns, it has the mutex acquired, and it sets the
100*4882a593Smuzhiyunfutex value to its own TID and returns. Userspace has no other work to
101*4882a593Smuzhiyunperform - it now owns the lock, and futex value contains
102*4882a593SmuzhiyunFUTEX_WAITERS|TID.
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunIf the unlock side fastpath succeeds, [i.e. userspace manages to do a
105*4882a593SmuzhiyunTID -> 0 atomic transition of the futex value], then no kernel work is
106*4882a593Smuzhiyuntriggered.
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunIf the unlock fastpath fails (because the FUTEX_WAITERS bit is set),
109*4882a593Smuzhiyunthen FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the
110*4882a593Smuzhiyunbehalf of userspace - and it also unlocks the attached
111*4882a593Smuzhiyunpi_state->rt_mutex and thus wakes up any potential waiters.
112*4882a593Smuzhiyun
113*4882a593SmuzhiyunNote that under this approach, contrary to previous PI-futex approaches,
114*4882a593Smuzhiyunthere is no prior 'registration' of a PI-futex. [which is not quite
115*4882a593Smuzhiyunpossible anyway, due to existing ABI properties of pthread mutexes.]
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunAlso, under this scheme, 'robustness' and 'PI' are two orthogonal
118*4882a593Smuzhiyunproperties of futexes, and all four combinations are possible: futex,
119*4882a593Smuzhiyunrobust-futex, PI-futex, robust+PI-futex.
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunMore details about priority inheritance can be found in
122*4882a593SmuzhiyunDocumentation/locking/rt-mutex.rst.
123