xref: /OK3568_Linux_fs/kernel/Documentation/timers/hrtimers.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun======================================================
2*4882a593Smuzhiyunhrtimers - subsystem for high-resolution kernel timers
3*4882a593Smuzhiyun======================================================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunThis patch introduces a new subsystem for high-resolution kernel timers.
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunOne might ask the question: we already have a timer subsystem
8*4882a593Smuzhiyun(kernel/timers.c), why do we need two timer subsystems? After a lot of
9*4882a593Smuzhiyunback and forth trying to integrate high-resolution and high-precision
10*4882a593Smuzhiyunfeatures into the existing timer framework, and after testing various
11*4882a593Smuzhiyunsuch high-resolution timer implementations in practice, we came to the
12*4882a593Smuzhiyunconclusion that the timer wheel code is fundamentally not suitable for
13*4882a593Smuzhiyunsuch an approach. We initially didn't believe this ('there must be a way
14*4882a593Smuzhiyunto solve this'), and spent a considerable effort trying to integrate
15*4882a593Smuzhiyunthings into the timer wheel, but we failed. In hindsight, there are
16*4882a593Smuzhiyunseveral reasons why such integration is hard/impossible:
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun- the forced handling of low-resolution and high-resolution timers in
19*4882a593Smuzhiyun  the same way leads to a lot of compromises, macro magic and #ifdef
20*4882a593Smuzhiyun  mess. The timers.c code is very "tightly coded" around jiffies and
21*4882a593Smuzhiyun  32-bitness assumptions, and has been honed and micro-optimized for a
22*4882a593Smuzhiyun  relatively narrow use case (jiffies in a relatively narrow HZ range)
23*4882a593Smuzhiyun  for many years - and thus even small extensions to it easily break
24*4882a593Smuzhiyun  the wheel concept, leading to even worse compromises. The timer wheel
25*4882a593Smuzhiyun  code is very good and tight code, there's zero problems with it in its
26*4882a593Smuzhiyun  current usage - but it is simply not suitable to be extended for
27*4882a593Smuzhiyun  high-res timers.
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun- the unpredictable [O(N)] overhead of cascading leads to delays which
30*4882a593Smuzhiyun  necessitate a more complex handling of high resolution timers, which
31*4882a593Smuzhiyun  in turn decreases robustness. Such a design still leads to rather large
32*4882a593Smuzhiyun  timing inaccuracies. Cascading is a fundamental property of the timer
33*4882a593Smuzhiyun  wheel concept, it cannot be 'designed out' without inevitably
34*4882a593Smuzhiyun  degrading other portions of the timers.c code in an unacceptable way.
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun- the implementation of the current posix-timer subsystem on top of
37*4882a593Smuzhiyun  the timer wheel has already introduced a quite complex handling of
38*4882a593Smuzhiyun  the required readjusting of absolute CLOCK_REALTIME timers at
39*4882a593Smuzhiyun  settimeofday or NTP time - further underlying our experience by
40*4882a593Smuzhiyun  example: that the timer wheel data structure is too rigid for high-res
41*4882a593Smuzhiyun  timers.
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun- the timer wheel code is most optimal for use cases which can be
44*4882a593Smuzhiyun  identified as "timeouts". Such timeouts are usually set up to cover
45*4882a593Smuzhiyun  error conditions in various I/O paths, such as networking and block
46*4882a593Smuzhiyun  I/O. The vast majority of those timers never expire and are rarely
47*4882a593Smuzhiyun  recascaded because the expected correct event arrives in time so they
48*4882a593Smuzhiyun  can be removed from the timer wheel before any further processing of
49*4882a593Smuzhiyun  them becomes necessary. Thus the users of these timeouts can accept
50*4882a593Smuzhiyun  the granularity and precision tradeoffs of the timer wheel, and
51*4882a593Smuzhiyun  largely expect the timer subsystem to have near-zero overhead.
52*4882a593Smuzhiyun  Accurate timing for them is not a core purpose - in fact most of the
53*4882a593Smuzhiyun  timeout values used are ad-hoc. For them it is at most a necessary
54*4882a593Smuzhiyun  evil to guarantee the processing of actual timeout completions
55*4882a593Smuzhiyun  (because most of the timeouts are deleted before completion), which
56*4882a593Smuzhiyun  should thus be as cheap and unintrusive as possible.
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunThe primary users of precision timers are user-space applications that
59*4882a593Smuzhiyunutilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
60*4882a593Smuzhiyunusers like drivers and subsystems which require precise timed events
61*4882a593Smuzhiyun(e.g. multimedia) can benefit from the availability of a separate
62*4882a593Smuzhiyunhigh-resolution timer subsystem as well.
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunWhile this subsystem does not offer high-resolution clock sources just
65*4882a593Smuzhiyunyet, the hrtimer subsystem can be easily extended with high-resolution
66*4882a593Smuzhiyunclock capabilities, and patches for that exist and are maturing quickly.
67*4882a593SmuzhiyunThe increasing demand for realtime and multimedia applications along
68*4882a593Smuzhiyunwith other potential users for precise timers gives another reason to
69*4882a593Smuzhiyunseparate the "timeout" and "precise timer" subsystems.
70*4882a593Smuzhiyun
71*4882a593SmuzhiyunAnother potential benefit is that such a separation allows even more
72*4882a593Smuzhiyunspecial-purpose optimization of the existing timer wheel for the low
73*4882a593Smuzhiyunresolution and low precision use cases - once the precision-sensitive
74*4882a593SmuzhiyunAPIs are separated from the timer wheel and are migrated over to
75*4882a593Smuzhiyunhrtimers. E.g. we could decrease the frequency of the timeout subsystem
76*4882a593Smuzhiyunfrom 250 Hz to 100 HZ (or even smaller).
77*4882a593Smuzhiyun
78*4882a593Smuzhiyunhrtimer subsystem implementation details
79*4882a593Smuzhiyun----------------------------------------
80*4882a593Smuzhiyun
81*4882a593Smuzhiyunthe basic design considerations were:
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun- simplicity
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun- data structure not bound to jiffies or any other granularity. All the
86*4882a593Smuzhiyun  kernel logic works at 64-bit nanoseconds resolution - no compromises.
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun- simplification of existing, timing related kernel code
89*4882a593Smuzhiyun
90*4882a593Smuzhiyunanother basic requirement was the immediate enqueueing and ordering of
91*4882a593Smuzhiyuntimers at activation time. After looking at several possible solutions
92*4882a593Smuzhiyunsuch as radix trees and hashes, we chose the red black tree as the basic
93*4882a593Smuzhiyundata structure. Rbtrees are available as a library in the kernel and are
94*4882a593Smuzhiyunused in various performance-critical areas of e.g. memory management and
95*4882a593Smuzhiyunfile systems. The rbtree is solely used for time sorted ordering, while
96*4882a593Smuzhiyuna separate list is used to give the expiry code fast access to the
97*4882a593Smuzhiyunqueued timers, without having to walk the rbtree.
98*4882a593Smuzhiyun
99*4882a593Smuzhiyun(This separate list is also useful for later when we'll introduce
100*4882a593Smuzhiyunhigh-resolution clocks, where we need separate pending and expired
101*4882a593Smuzhiyunqueues while keeping the time-order intact.)
102*4882a593Smuzhiyun
103*4882a593SmuzhiyunTime-ordered enqueueing is not purely for the purposes of
104*4882a593Smuzhiyunhigh-resolution clocks though, it also simplifies the handling of
105*4882a593Smuzhiyunabsolute timers based on a low-resolution CLOCK_REALTIME. The existing
106*4882a593Smuzhiyunimplementation needed to keep an extra list of all armed absolute
107*4882a593SmuzhiyunCLOCK_REALTIME timers along with complex locking. In case of
108*4882a593Smuzhiyunsettimeofday and NTP, all the timers (!) had to be dequeued, the
109*4882a593Smuzhiyuntime-changing code had to fix them up one by one, and all of them had to
110*4882a593Smuzhiyunbe enqueued again. The time-ordered enqueueing and the storage of the
111*4882a593Smuzhiyunexpiry time in absolute time units removes all this complex and poorly
112*4882a593Smuzhiyunscaling code from the posix-timer implementation - the clock can simply
113*4882a593Smuzhiyunbe set without having to touch the rbtree. This also makes the handling
114*4882a593Smuzhiyunof posix-timers simpler in general.
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunThe locking and per-CPU behavior of hrtimers was mostly taken from the
117*4882a593Smuzhiyunexisting timer wheel code, as it is mature and well suited. Sharing code
118*4882a593Smuzhiyunwas not really a win, due to the different data structures. Also, the
119*4882a593Smuzhiyunhrtimer functions now have clearer behavior and clearer names - such as
120*4882a593Smuzhiyunhrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
121*4882a593Smuzhiyunequivalent to del_timer() and del_timer_sync()] - so there's no direct
122*4882a593Smuzhiyun1:1 mapping between them on the algorithmic level, and thus no real
123*4882a593Smuzhiyunpotential for code sharing either.
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunBasic data types: every time value, absolute or relative, is in a
126*4882a593Smuzhiyunspecial nanosecond-resolution type: ktime_t. The kernel-internal
127*4882a593Smuzhiyunrepresentation of ktime_t values and operations is implemented via
128*4882a593Smuzhiyunmacros and inline functions, and can be switched between a "hybrid
129*4882a593Smuzhiyununion" type and a plain "scalar" 64bit nanoseconds representation (at
130*4882a593Smuzhiyuncompile time). The hybrid union type optimizes time conversions on 32bit
131*4882a593SmuzhiyunCPUs. This build-time-selectable ktime_t storage format was implemented
132*4882a593Smuzhiyunto avoid the performance impact of 64-bit multiplications and divisions
133*4882a593Smuzhiyunon 32bit CPUs. Such operations are frequently necessary to convert
134*4882a593Smuzhiyunbetween the storage formats provided by kernel and userspace interfaces
135*4882a593Smuzhiyunand the internal time format. (See include/linux/ktime.h for further
136*4882a593Smuzhiyundetails.)
137*4882a593Smuzhiyun
138*4882a593Smuzhiyunhrtimers - rounding of timer values
139*4882a593Smuzhiyun-----------------------------------
140*4882a593Smuzhiyun
141*4882a593Smuzhiyunthe hrtimer code will round timer events to lower-resolution clocks
142*4882a593Smuzhiyunbecause it has to. Otherwise it will do no artificial rounding at all.
143*4882a593Smuzhiyun
144*4882a593Smuzhiyunone question is, what resolution value should be returned to the user by
145*4882a593Smuzhiyunthe clock_getres() interface. This will return whatever real resolution
146*4882a593Smuzhiyuna given clock has - be it low-res, high-res, or artificially-low-res.
147*4882a593Smuzhiyun
148*4882a593Smuzhiyunhrtimers - testing and verification
149*4882a593Smuzhiyun-----------------------------------
150*4882a593Smuzhiyun
151*4882a593SmuzhiyunWe used the high-resolution clock subsystem ontop of hrtimers to verify
152*4882a593Smuzhiyunthe hrtimer implementation details in praxis, and we also ran the posix
153*4882a593Smuzhiyuntimer tests in order to ensure specification compliance. We also ran
154*4882a593Smuzhiyuntests on low-resolution clocks.
155*4882a593Smuzhiyun
156*4882a593SmuzhiyunThe hrtimer patch converts the following kernel functionality to use
157*4882a593Smuzhiyunhrtimers:
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun - nanosleep
160*4882a593Smuzhiyun - itimers
161*4882a593Smuzhiyun - posix-timers
162*4882a593Smuzhiyun
163*4882a593SmuzhiyunThe conversion of nanosleep and posix-timers enabled the unification of
164*4882a593Smuzhiyunnanosleep and clock_nanosleep.
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunThe code was successfully compiled for the following platforms:
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun i386, x86_64, ARM, PPC, PPC64, IA64
169*4882a593Smuzhiyun
170*4882a593SmuzhiyunThe code was run-tested on the following platforms:
171*4882a593Smuzhiyun
172*4882a593Smuzhiyun i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
173*4882a593Smuzhiyun
174*4882a593Smuzhiyunhrtimers were also integrated into the -rt tree, along with a
175*4882a593Smuzhiyunhrtimers-based high-resolution clock implementation, so the hrtimers
176*4882a593Smuzhiyuncode got a healthy amount of testing and use in practice.
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun	Thomas Gleixner, Ingo Molnar
179