xref: /OK3568_Linux_fs/kernel/Documentation/timers/highres.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=====================================================
2*4882a593SmuzhiyunHigh resolution timers and dynamic ticks design notes
3*4882a593Smuzhiyun=====================================================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunFurther information can be found in the paper of the OLS 2006 talk "hrtimers
6*4882a593Smuzhiyunand beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can
7*4882a593Smuzhiyunbe found on the OLS website:
8*4882a593Smuzhiyunhttps://www.kernel.org/doc/ols/2006/ols2006v1-pages-333-346.pdf
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunThe slides to this talk are available from:
11*4882a593Smuzhiyunhttp://www.cs.columbia.edu/~nahum/w6998/papers/ols2006-hrtimers-slides.pdf
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunThe slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the
14*4882a593Smuzhiyunchanges in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the
15*4882a593Smuzhiyundesign of the Linux time(r) system before hrtimers and other building blocks
16*4882a593Smuzhiyungot merged into mainline.
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunNote: the paper and the slides are talking about "clock event source", while we
19*4882a593Smuzhiyunswitched to the name "clock event devices" in meantime.
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunThe design contains the following basic building blocks:
22*4882a593Smuzhiyun
23*4882a593Smuzhiyun- hrtimer base infrastructure
24*4882a593Smuzhiyun- timeofday and clock source management
25*4882a593Smuzhiyun- clock event management
26*4882a593Smuzhiyun- high resolution timer functionality
27*4882a593Smuzhiyun- dynamic ticks
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun
30*4882a593Smuzhiyunhrtimer base infrastructure
31*4882a593Smuzhiyun---------------------------
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunThe hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of
34*4882a593Smuzhiyunthe base implementation are covered in Documentation/timers/hrtimers.rst. See
35*4882a593Smuzhiyunalso figure #2 (OLS slides p. 15)
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunThe main differences to the timer wheel, which holds the armed timer_list type
38*4882a593Smuzhiyuntimers are:
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun       - time ordered enqueueing into a rb-tree
41*4882a593Smuzhiyun       - independent of ticks (the processing is based on nanoseconds)
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun
44*4882a593Smuzhiyuntimeofday and clock source management
45*4882a593Smuzhiyun-------------------------------------
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunJohn Stultz's Generic Time Of Day (GTOD) framework moves a large portion of
48*4882a593Smuzhiyuncode out of the architecture-specific areas into a generic management
49*4882a593Smuzhiyunframework, as illustrated in figure #3 (OLS slides p. 18). The architecture
50*4882a593Smuzhiyunspecific portion is reduced to the low level hardware details of the clock
51*4882a593Smuzhiyunsources, which are registered in the framework and selected on a quality based
52*4882a593Smuzhiyundecision. The low level code provides hardware setup and readout routines and
53*4882a593Smuzhiyuninitializes data structures, which are used by the generic time keeping code to
54*4882a593Smuzhiyunconvert the clock ticks to nanosecond based time values. All other time keeping
55*4882a593Smuzhiyunrelated functionality is moved into the generic code. The GTOD base patch got
56*4882a593Smuzhiyunmerged into the 2.6.18 kernel.
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunFurther information about the Generic Time Of Day framework is available in the
59*4882a593SmuzhiyunOLS 2005 Proceedings Volume 1:
60*4882a593Smuzhiyun
61*4882a593Smuzhiyun	http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunThe paper "We Are Not Getting Any Younger: A New Approach to Time and
64*4882a593SmuzhiyunTimers" was written by J. Stultz, D.V. Hart, & N. Aravamudan.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunFigure #3 (OLS slides p.18) illustrates the transformation.
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun
69*4882a593Smuzhiyunclock event management
70*4882a593Smuzhiyun----------------------
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunWhile clock sources provide read access to the monotonically increasing time
73*4882a593Smuzhiyunvalue, clock event devices are used to schedule the next event
74*4882a593Smuzhiyuninterrupt(s). The next event is currently defined to be periodic, with its
75*4882a593Smuzhiyunperiod defined at compile time. The setup and selection of the event device
76*4882a593Smuzhiyunfor various event driven functionalities is hardwired into the architecture
77*4882a593Smuzhiyundependent code. This results in duplicated code across all architectures and
78*4882a593Smuzhiyunmakes it extremely difficult to change the configuration of the system to use
79*4882a593Smuzhiyunevent interrupt devices other than those already built into the
80*4882a593Smuzhiyunarchitecture. Another implication of the current design is that it is necessary
81*4882a593Smuzhiyunto touch all the architecture-specific implementations in order to provide new
82*4882a593Smuzhiyunfunctionality like high resolution timers or dynamic ticks.
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunThe clock events subsystem tries to address this problem by providing a generic
85*4882a593Smuzhiyunsolution to manage clock event devices and their usage for the various clock
86*4882a593Smuzhiyunevent driven kernel functionalities. The goal of the clock event subsystem is
87*4882a593Smuzhiyunto minimize the clock event related architecture dependent code to the pure
88*4882a593Smuzhiyunhardware related handling and to allow easy addition and utilization of new
89*4882a593Smuzhiyunclock event devices. It also minimizes the duplicated code across the
90*4882a593Smuzhiyunarchitectures as it provides generic functionality down to the interrupt
91*4882a593Smuzhiyunservice handler, which is almost inherently hardware dependent.
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunClock event devices are registered either by the architecture dependent boot
94*4882a593Smuzhiyuncode or at module insertion time. Each clock event device fills a data
95*4882a593Smuzhiyunstructure with clock-specific property parameters and callback functions. The
96*4882a593Smuzhiyunclock event management decides, by using the specified property parameters, the
97*4882a593Smuzhiyunset of system functions a clock event device will be used to support. This
98*4882a593Smuzhiyunincludes the distinction of per-CPU and per-system global event devices.
99*4882a593Smuzhiyun
100*4882a593SmuzhiyunSystem-level global event devices are used for the Linux periodic tick. Per-CPU
101*4882a593Smuzhiyunevent devices are used to provide local CPU functionality such as process
102*4882a593Smuzhiyunaccounting, profiling, and high resolution timers.
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunThe management layer assigns one or more of the following functions to a clock
105*4882a593Smuzhiyunevent device:
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun      - system global periodic tick (jiffies update)
108*4882a593Smuzhiyun      - cpu local update_process_times
109*4882a593Smuzhiyun      - cpu local profiling
110*4882a593Smuzhiyun      - cpu local next event interrupt (non periodic mode)
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunThe clock event device delegates the selection of those timer interrupt related
113*4882a593Smuzhiyunfunctions completely to the management layer. The clock management layer stores
114*4882a593Smuzhiyuna function pointer in the device description structure, which has to be called
115*4882a593Smuzhiyunfrom the hardware level handler. This removes a lot of duplicated code from the
116*4882a593Smuzhiyunarchitecture specific timer interrupt handlers and hands the control over the
117*4882a593Smuzhiyunclock event devices and the assignment of timer interrupt related functionality
118*4882a593Smuzhiyunto the core code.
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunThe clock event layer API is rather small. Aside from the clock event device
121*4882a593Smuzhiyunregistration interface it provides functions to schedule the next event
122*4882a593Smuzhiyuninterrupt, clock event device notification service and support for suspend and
123*4882a593Smuzhiyunresume.
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunThe framework adds about 700 lines of code which results in a 2KB increase of
126*4882a593Smuzhiyunthe kernel binary size. The conversion of i386 removes about 100 lines of
127*4882a593Smuzhiyuncode. The binary size decrease is in the range of 400 byte. We believe that the
128*4882a593Smuzhiyunincrease of flexibility and the avoidance of duplicated code across
129*4882a593Smuzhiyunarchitectures justifies the slight increase of the binary size.
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunThe conversion of an architecture has no functional impact, but allows to
132*4882a593Smuzhiyunutilize the high resolution and dynamic tick functionalities without any change
133*4882a593Smuzhiyunto the clock event device and timer interrupt code. After the conversion the
134*4882a593Smuzhiyunenabling of high resolution timers and dynamic ticks is simply provided by
135*4882a593Smuzhiyunadding the kernel/time/Kconfig file to the architecture specific Kconfig and
136*4882a593Smuzhiyunadding the dynamic tick specific calls to the idle routine (a total of 3 lines
137*4882a593Smuzhiyunadded to the idle function and the Kconfig file)
138*4882a593Smuzhiyun
139*4882a593SmuzhiyunFigure #4 (OLS slides p.20) illustrates the transformation.
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun
142*4882a593Smuzhiyunhigh resolution timer functionality
143*4882a593Smuzhiyun-----------------------------------
144*4882a593Smuzhiyun
145*4882a593SmuzhiyunDuring system boot it is not possible to use the high resolution timer
146*4882a593Smuzhiyunfunctionality, while making it possible would be difficult and would serve no
147*4882a593Smuzhiyunuseful function. The initialization of the clock event device framework, the
148*4882a593Smuzhiyunclock source framework (GTOD) and hrtimers itself has to be done and
149*4882a593Smuzhiyunappropriate clock sources and clock event devices have to be registered before
150*4882a593Smuzhiyunthe high resolution functionality can work. Up to the point where hrtimers are
151*4882a593Smuzhiyuninitialized, the system works in the usual low resolution periodic mode. The
152*4882a593Smuzhiyunclock source and the clock event device layers provide notification functions
153*4882a593Smuzhiyunwhich inform hrtimers about availability of new hardware. hrtimers validates
154*4882a593Smuzhiyunthe usability of the registered clock sources and clock event devices before
155*4882a593Smuzhiyunswitching to high resolution mode. This ensures also that a kernel which is
156*4882a593Smuzhiyunconfigured for high resolution timers can run on a system which lacks the
157*4882a593Smuzhiyunnecessary hardware support.
158*4882a593Smuzhiyun
159*4882a593SmuzhiyunThe high resolution timer code does not support SMP machines which have only
160*4882a593Smuzhiyunglobal clock event devices. The support of such hardware would involve IPI
161*4882a593Smuzhiyuncalls when an interrupt happens. The overhead would be much larger than the
162*4882a593Smuzhiyunbenefit. This is the reason why we currently disable high resolution and
163*4882a593Smuzhiyundynamic ticks on i386 SMP systems which stop the local APIC in C3 power
164*4882a593Smuzhiyunstate. A workaround is available as an idea, but the problem has not been
165*4882a593Smuzhiyuntackled yet.
166*4882a593Smuzhiyun
167*4882a593SmuzhiyunThe time ordered insertion of timers provides all the infrastructure to decide
168*4882a593Smuzhiyunwhether the event device has to be reprogrammed when a timer is added. The
169*4882a593Smuzhiyundecision is made per timer base and synchronized across per-cpu timer bases in
170*4882a593Smuzhiyuna support function. The design allows the system to utilize separate per-CPU
171*4882a593Smuzhiyunclock event devices for the per-CPU timer bases, but currently only one
172*4882a593Smuzhiyunreprogrammable clock event device per-CPU is utilized.
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunWhen the timer interrupt happens, the next event interrupt handler is called
175*4882a593Smuzhiyunfrom the clock event distribution code and moves expired timers from the
176*4882a593Smuzhiyunred-black tree to a separate double linked list and invokes the softirq
177*4882a593Smuzhiyunhandler. An additional mode field in the hrtimer structure allows the system to
178*4882a593Smuzhiyunexecute callback functions directly from the next event interrupt handler. This
179*4882a593Smuzhiyunis restricted to code which can safely be executed in the hard interrupt
180*4882a593Smuzhiyuncontext. This applies, for example, to the common case of a wakeup function as
181*4882a593Smuzhiyunused by nanosleep. The advantage of executing the handler in the interrupt
182*4882a593Smuzhiyuncontext is the avoidance of up to two context switches - from the interrupted
183*4882a593Smuzhiyuncontext to the softirq and to the task which is woken up by the expired
184*4882a593Smuzhiyuntimer.
185*4882a593Smuzhiyun
186*4882a593SmuzhiyunOnce a system has switched to high resolution mode, the periodic tick is
187*4882a593Smuzhiyunswitched off. This disables the per system global periodic clock event device -
188*4882a593Smuzhiyune.g. the PIT on i386 SMP systems.
189*4882a593Smuzhiyun
190*4882a593SmuzhiyunThe periodic tick functionality is provided by an per-cpu hrtimer. The callback
191*4882a593Smuzhiyunfunction is executed in the next event interrupt context and updates jiffies
192*4882a593Smuzhiyunand calls update_process_times and profiling. The implementation of the hrtimer
193*4882a593Smuzhiyunbased periodic tick is designed to be extended with dynamic tick functionality.
194*4882a593SmuzhiyunThis allows to use a single clock event device to schedule high resolution
195*4882a593Smuzhiyuntimer and periodic events (jiffies tick, profiling, process accounting) on UP
196*4882a593Smuzhiyunsystems. This has been proved to work with the PIT on i386 and the Incrementer
197*4882a593Smuzhiyunon PPC.
198*4882a593Smuzhiyun
199*4882a593SmuzhiyunThe softirq for running the hrtimer queues and executing the callbacks has been
200*4882a593Smuzhiyunseparated from the tick bound timer softirq to allow accurate delivery of high
201*4882a593Smuzhiyunresolution timer signals which are used by itimer and POSIX interval
202*4882a593Smuzhiyuntimers. The execution of this softirq can still be delayed by other softirqs,
203*4882a593Smuzhiyunbut the overall latencies have been significantly improved by this separation.
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunFigure #5 (OLS slides p.22) illustrates the transformation.
206*4882a593Smuzhiyun
207*4882a593Smuzhiyun
208*4882a593Smuzhiyundynamic ticks
209*4882a593Smuzhiyun-------------
210*4882a593Smuzhiyun
211*4882a593SmuzhiyunDynamic ticks are the logical consequence of the hrtimer based periodic tick
212*4882a593Smuzhiyunreplacement (sched_tick). The functionality of the sched_tick hrtimer is
213*4882a593Smuzhiyunextended by three functions:
214*4882a593Smuzhiyun
215*4882a593Smuzhiyun- hrtimer_stop_sched_tick
216*4882a593Smuzhiyun- hrtimer_restart_sched_tick
217*4882a593Smuzhiyun- hrtimer_update_jiffies
218*4882a593Smuzhiyun
219*4882a593Smuzhiyunhrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code
220*4882a593Smuzhiyunevaluates the next scheduled timer event (from both hrtimers and the timer
221*4882a593Smuzhiyunwheel) and in case that the next event is further away than the next tick it
222*4882a593Smuzhiyunreprograms the sched_tick to this future event, to allow longer idle sleeps
223*4882a593Smuzhiyunwithout worthless interruption by the periodic tick. The function is also
224*4882a593Smuzhiyuncalled when an interrupt happens during the idle period, which does not cause a
225*4882a593Smuzhiyunreschedule. The call is necessary as the interrupt handler might have armed a
226*4882a593Smuzhiyunnew timer whose expiry time is before the time which was identified as the
227*4882a593Smuzhiyunnearest event in the previous call to hrtimer_stop_sched_tick.
228*4882a593Smuzhiyun
229*4882a593Smuzhiyunhrtimer_restart_sched_tick() is called when the CPU leaves the idle state before
230*4882a593Smuzhiyunit calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick,
231*4882a593Smuzhiyunwhich is kept active until the next call to hrtimer_stop_sched_tick().
232*4882a593Smuzhiyun
233*4882a593Smuzhiyunhrtimer_update_jiffies() is called from irq_enter() when an interrupt happens
234*4882a593Smuzhiyunin the idle period to make sure that jiffies are up to date and the interrupt
235*4882a593Smuzhiyunhandler has not to deal with an eventually stale jiffy value.
236*4882a593Smuzhiyun
237*4882a593SmuzhiyunThe dynamic tick feature provides statistical values which are exported to
238*4882a593Smuzhiyunuserspace via /proc/stat and can be made available for enhanced power
239*4882a593Smuzhiyunmanagement control.
240*4882a593Smuzhiyun
241*4882a593SmuzhiyunThe implementation leaves room for further development like full tickless
242*4882a593Smuzhiyunsystems, where the time slice is controlled by the scheduler, variable
243*4882a593Smuzhiyunfrequency profiling, and a complete removal of jiffies in the future.
244*4882a593Smuzhiyun
245*4882a593Smuzhiyun
246*4882a593SmuzhiyunAside the current initial submission of i386 support, the patchset has been
247*4882a593Smuzhiyunextended to x86_64 and ARM already. Initial (work in progress) support is also
248*4882a593Smuzhiyunavailable for MIPS and PowerPC.
249*4882a593Smuzhiyun
250*4882a593Smuzhiyun	  Thomas, Ingo
251