1*4882a593Smuzhiyun===================================================== 2*4882a593SmuzhiyunHigh resolution timers and dynamic ticks design notes 3*4882a593Smuzhiyun===================================================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunFurther information can be found in the paper of the OLS 2006 talk "hrtimers 6*4882a593Smuzhiyunand beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can 7*4882a593Smuzhiyunbe found on the OLS website: 8*4882a593Smuzhiyunhttps://www.kernel.org/doc/ols/2006/ols2006v1-pages-333-346.pdf 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunThe slides to this talk are available from: 11*4882a593Smuzhiyunhttp://www.cs.columbia.edu/~nahum/w6998/papers/ols2006-hrtimers-slides.pdf 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunThe slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the 14*4882a593Smuzhiyunchanges in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the 15*4882a593Smuzhiyundesign of the Linux time(r) system before hrtimers and other building blocks 16*4882a593Smuzhiyungot merged into mainline. 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunNote: the paper and the slides are talking about "clock event source", while we 19*4882a593Smuzhiyunswitched to the name "clock event devices" in meantime. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunThe design contains the following basic building blocks: 22*4882a593Smuzhiyun 23*4882a593Smuzhiyun- hrtimer base infrastructure 24*4882a593Smuzhiyun- timeofday and clock source management 25*4882a593Smuzhiyun- clock event management 26*4882a593Smuzhiyun- high resolution timer functionality 27*4882a593Smuzhiyun- dynamic ticks 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun 30*4882a593Smuzhiyunhrtimer base infrastructure 31*4882a593Smuzhiyun--------------------------- 32*4882a593Smuzhiyun 33*4882a593SmuzhiyunThe hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of 34*4882a593Smuzhiyunthe base implementation are covered in Documentation/timers/hrtimers.rst. See 35*4882a593Smuzhiyunalso figure #2 (OLS slides p. 15) 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunThe main differences to the timer wheel, which holds the armed timer_list type 38*4882a593Smuzhiyuntimers are: 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun - time ordered enqueueing into a rb-tree 41*4882a593Smuzhiyun - independent of ticks (the processing is based on nanoseconds) 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun 44*4882a593Smuzhiyuntimeofday and clock source management 45*4882a593Smuzhiyun------------------------------------- 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunJohn Stultz's Generic Time Of Day (GTOD) framework moves a large portion of 48*4882a593Smuzhiyuncode out of the architecture-specific areas into a generic management 49*4882a593Smuzhiyunframework, as illustrated in figure #3 (OLS slides p. 18). The architecture 50*4882a593Smuzhiyunspecific portion is reduced to the low level hardware details of the clock 51*4882a593Smuzhiyunsources, which are registered in the framework and selected on a quality based 52*4882a593Smuzhiyundecision. The low level code provides hardware setup and readout routines and 53*4882a593Smuzhiyuninitializes data structures, which are used by the generic time keeping code to 54*4882a593Smuzhiyunconvert the clock ticks to nanosecond based time values. All other time keeping 55*4882a593Smuzhiyunrelated functionality is moved into the generic code. The GTOD base patch got 56*4882a593Smuzhiyunmerged into the 2.6.18 kernel. 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunFurther information about the Generic Time Of Day framework is available in the 59*4882a593SmuzhiyunOLS 2005 Proceedings Volume 1: 60*4882a593Smuzhiyun 61*4882a593Smuzhiyun http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunThe paper "We Are Not Getting Any Younger: A New Approach to Time and 64*4882a593SmuzhiyunTimers" was written by J. Stultz, D.V. Hart, & N. Aravamudan. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunFigure #3 (OLS slides p.18) illustrates the transformation. 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun 69*4882a593Smuzhiyunclock event management 70*4882a593Smuzhiyun---------------------- 71*4882a593Smuzhiyun 72*4882a593SmuzhiyunWhile clock sources provide read access to the monotonically increasing time 73*4882a593Smuzhiyunvalue, clock event devices are used to schedule the next event 74*4882a593Smuzhiyuninterrupt(s). The next event is currently defined to be periodic, with its 75*4882a593Smuzhiyunperiod defined at compile time. The setup and selection of the event device 76*4882a593Smuzhiyunfor various event driven functionalities is hardwired into the architecture 77*4882a593Smuzhiyundependent code. This results in duplicated code across all architectures and 78*4882a593Smuzhiyunmakes it extremely difficult to change the configuration of the system to use 79*4882a593Smuzhiyunevent interrupt devices other than those already built into the 80*4882a593Smuzhiyunarchitecture. Another implication of the current design is that it is necessary 81*4882a593Smuzhiyunto touch all the architecture-specific implementations in order to provide new 82*4882a593Smuzhiyunfunctionality like high resolution timers or dynamic ticks. 83*4882a593Smuzhiyun 84*4882a593SmuzhiyunThe clock events subsystem tries to address this problem by providing a generic 85*4882a593Smuzhiyunsolution to manage clock event devices and their usage for the various clock 86*4882a593Smuzhiyunevent driven kernel functionalities. The goal of the clock event subsystem is 87*4882a593Smuzhiyunto minimize the clock event related architecture dependent code to the pure 88*4882a593Smuzhiyunhardware related handling and to allow easy addition and utilization of new 89*4882a593Smuzhiyunclock event devices. It also minimizes the duplicated code across the 90*4882a593Smuzhiyunarchitectures as it provides generic functionality down to the interrupt 91*4882a593Smuzhiyunservice handler, which is almost inherently hardware dependent. 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunClock event devices are registered either by the architecture dependent boot 94*4882a593Smuzhiyuncode or at module insertion time. Each clock event device fills a data 95*4882a593Smuzhiyunstructure with clock-specific property parameters and callback functions. The 96*4882a593Smuzhiyunclock event management decides, by using the specified property parameters, the 97*4882a593Smuzhiyunset of system functions a clock event device will be used to support. This 98*4882a593Smuzhiyunincludes the distinction of per-CPU and per-system global event devices. 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunSystem-level global event devices are used for the Linux periodic tick. Per-CPU 101*4882a593Smuzhiyunevent devices are used to provide local CPU functionality such as process 102*4882a593Smuzhiyunaccounting, profiling, and high resolution timers. 103*4882a593Smuzhiyun 104*4882a593SmuzhiyunThe management layer assigns one or more of the following functions to a clock 105*4882a593Smuzhiyunevent device: 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun - system global periodic tick (jiffies update) 108*4882a593Smuzhiyun - cpu local update_process_times 109*4882a593Smuzhiyun - cpu local profiling 110*4882a593Smuzhiyun - cpu local next event interrupt (non periodic mode) 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunThe clock event device delegates the selection of those timer interrupt related 113*4882a593Smuzhiyunfunctions completely to the management layer. The clock management layer stores 114*4882a593Smuzhiyuna function pointer in the device description structure, which has to be called 115*4882a593Smuzhiyunfrom the hardware level handler. This removes a lot of duplicated code from the 116*4882a593Smuzhiyunarchitecture specific timer interrupt handlers and hands the control over the 117*4882a593Smuzhiyunclock event devices and the assignment of timer interrupt related functionality 118*4882a593Smuzhiyunto the core code. 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunThe clock event layer API is rather small. Aside from the clock event device 121*4882a593Smuzhiyunregistration interface it provides functions to schedule the next event 122*4882a593Smuzhiyuninterrupt, clock event device notification service and support for suspend and 123*4882a593Smuzhiyunresume. 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunThe framework adds about 700 lines of code which results in a 2KB increase of 126*4882a593Smuzhiyunthe kernel binary size. The conversion of i386 removes about 100 lines of 127*4882a593Smuzhiyuncode. The binary size decrease is in the range of 400 byte. We believe that the 128*4882a593Smuzhiyunincrease of flexibility and the avoidance of duplicated code across 129*4882a593Smuzhiyunarchitectures justifies the slight increase of the binary size. 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunThe conversion of an architecture has no functional impact, but allows to 132*4882a593Smuzhiyunutilize the high resolution and dynamic tick functionalities without any change 133*4882a593Smuzhiyunto the clock event device and timer interrupt code. After the conversion the 134*4882a593Smuzhiyunenabling of high resolution timers and dynamic ticks is simply provided by 135*4882a593Smuzhiyunadding the kernel/time/Kconfig file to the architecture specific Kconfig and 136*4882a593Smuzhiyunadding the dynamic tick specific calls to the idle routine (a total of 3 lines 137*4882a593Smuzhiyunadded to the idle function and the Kconfig file) 138*4882a593Smuzhiyun 139*4882a593SmuzhiyunFigure #4 (OLS slides p.20) illustrates the transformation. 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun 142*4882a593Smuzhiyunhigh resolution timer functionality 143*4882a593Smuzhiyun----------------------------------- 144*4882a593Smuzhiyun 145*4882a593SmuzhiyunDuring system boot it is not possible to use the high resolution timer 146*4882a593Smuzhiyunfunctionality, while making it possible would be difficult and would serve no 147*4882a593Smuzhiyunuseful function. The initialization of the clock event device framework, the 148*4882a593Smuzhiyunclock source framework (GTOD) and hrtimers itself has to be done and 149*4882a593Smuzhiyunappropriate clock sources and clock event devices have to be registered before 150*4882a593Smuzhiyunthe high resolution functionality can work. Up to the point where hrtimers are 151*4882a593Smuzhiyuninitialized, the system works in the usual low resolution periodic mode. The 152*4882a593Smuzhiyunclock source and the clock event device layers provide notification functions 153*4882a593Smuzhiyunwhich inform hrtimers about availability of new hardware. hrtimers validates 154*4882a593Smuzhiyunthe usability of the registered clock sources and clock event devices before 155*4882a593Smuzhiyunswitching to high resolution mode. This ensures also that a kernel which is 156*4882a593Smuzhiyunconfigured for high resolution timers can run on a system which lacks the 157*4882a593Smuzhiyunnecessary hardware support. 158*4882a593Smuzhiyun 159*4882a593SmuzhiyunThe high resolution timer code does not support SMP machines which have only 160*4882a593Smuzhiyunglobal clock event devices. The support of such hardware would involve IPI 161*4882a593Smuzhiyuncalls when an interrupt happens. The overhead would be much larger than the 162*4882a593Smuzhiyunbenefit. This is the reason why we currently disable high resolution and 163*4882a593Smuzhiyundynamic ticks on i386 SMP systems which stop the local APIC in C3 power 164*4882a593Smuzhiyunstate. A workaround is available as an idea, but the problem has not been 165*4882a593Smuzhiyuntackled yet. 166*4882a593Smuzhiyun 167*4882a593SmuzhiyunThe time ordered insertion of timers provides all the infrastructure to decide 168*4882a593Smuzhiyunwhether the event device has to be reprogrammed when a timer is added. The 169*4882a593Smuzhiyundecision is made per timer base and synchronized across per-cpu timer bases in 170*4882a593Smuzhiyuna support function. The design allows the system to utilize separate per-CPU 171*4882a593Smuzhiyunclock event devices for the per-CPU timer bases, but currently only one 172*4882a593Smuzhiyunreprogrammable clock event device per-CPU is utilized. 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunWhen the timer interrupt happens, the next event interrupt handler is called 175*4882a593Smuzhiyunfrom the clock event distribution code and moves expired timers from the 176*4882a593Smuzhiyunred-black tree to a separate double linked list and invokes the softirq 177*4882a593Smuzhiyunhandler. An additional mode field in the hrtimer structure allows the system to 178*4882a593Smuzhiyunexecute callback functions directly from the next event interrupt handler. This 179*4882a593Smuzhiyunis restricted to code which can safely be executed in the hard interrupt 180*4882a593Smuzhiyuncontext. This applies, for example, to the common case of a wakeup function as 181*4882a593Smuzhiyunused by nanosleep. The advantage of executing the handler in the interrupt 182*4882a593Smuzhiyuncontext is the avoidance of up to two context switches - from the interrupted 183*4882a593Smuzhiyuncontext to the softirq and to the task which is woken up by the expired 184*4882a593Smuzhiyuntimer. 185*4882a593Smuzhiyun 186*4882a593SmuzhiyunOnce a system has switched to high resolution mode, the periodic tick is 187*4882a593Smuzhiyunswitched off. This disables the per system global periodic clock event device - 188*4882a593Smuzhiyune.g. the PIT on i386 SMP systems. 189*4882a593Smuzhiyun 190*4882a593SmuzhiyunThe periodic tick functionality is provided by an per-cpu hrtimer. The callback 191*4882a593Smuzhiyunfunction is executed in the next event interrupt context and updates jiffies 192*4882a593Smuzhiyunand calls update_process_times and profiling. The implementation of the hrtimer 193*4882a593Smuzhiyunbased periodic tick is designed to be extended with dynamic tick functionality. 194*4882a593SmuzhiyunThis allows to use a single clock event device to schedule high resolution 195*4882a593Smuzhiyuntimer and periodic events (jiffies tick, profiling, process accounting) on UP 196*4882a593Smuzhiyunsystems. This has been proved to work with the PIT on i386 and the Incrementer 197*4882a593Smuzhiyunon PPC. 198*4882a593Smuzhiyun 199*4882a593SmuzhiyunThe softirq for running the hrtimer queues and executing the callbacks has been 200*4882a593Smuzhiyunseparated from the tick bound timer softirq to allow accurate delivery of high 201*4882a593Smuzhiyunresolution timer signals which are used by itimer and POSIX interval 202*4882a593Smuzhiyuntimers. The execution of this softirq can still be delayed by other softirqs, 203*4882a593Smuzhiyunbut the overall latencies have been significantly improved by this separation. 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunFigure #5 (OLS slides p.22) illustrates the transformation. 206*4882a593Smuzhiyun 207*4882a593Smuzhiyun 208*4882a593Smuzhiyundynamic ticks 209*4882a593Smuzhiyun------------- 210*4882a593Smuzhiyun 211*4882a593SmuzhiyunDynamic ticks are the logical consequence of the hrtimer based periodic tick 212*4882a593Smuzhiyunreplacement (sched_tick). The functionality of the sched_tick hrtimer is 213*4882a593Smuzhiyunextended by three functions: 214*4882a593Smuzhiyun 215*4882a593Smuzhiyun- hrtimer_stop_sched_tick 216*4882a593Smuzhiyun- hrtimer_restart_sched_tick 217*4882a593Smuzhiyun- hrtimer_update_jiffies 218*4882a593Smuzhiyun 219*4882a593Smuzhiyunhrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code 220*4882a593Smuzhiyunevaluates the next scheduled timer event (from both hrtimers and the timer 221*4882a593Smuzhiyunwheel) and in case that the next event is further away than the next tick it 222*4882a593Smuzhiyunreprograms the sched_tick to this future event, to allow longer idle sleeps 223*4882a593Smuzhiyunwithout worthless interruption by the periodic tick. The function is also 224*4882a593Smuzhiyuncalled when an interrupt happens during the idle period, which does not cause a 225*4882a593Smuzhiyunreschedule. The call is necessary as the interrupt handler might have armed a 226*4882a593Smuzhiyunnew timer whose expiry time is before the time which was identified as the 227*4882a593Smuzhiyunnearest event in the previous call to hrtimer_stop_sched_tick. 228*4882a593Smuzhiyun 229*4882a593Smuzhiyunhrtimer_restart_sched_tick() is called when the CPU leaves the idle state before 230*4882a593Smuzhiyunit calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick, 231*4882a593Smuzhiyunwhich is kept active until the next call to hrtimer_stop_sched_tick(). 232*4882a593Smuzhiyun 233*4882a593Smuzhiyunhrtimer_update_jiffies() is called from irq_enter() when an interrupt happens 234*4882a593Smuzhiyunin the idle period to make sure that jiffies are up to date and the interrupt 235*4882a593Smuzhiyunhandler has not to deal with an eventually stale jiffy value. 236*4882a593Smuzhiyun 237*4882a593SmuzhiyunThe dynamic tick feature provides statistical values which are exported to 238*4882a593Smuzhiyunuserspace via /proc/stat and can be made available for enhanced power 239*4882a593Smuzhiyunmanagement control. 240*4882a593Smuzhiyun 241*4882a593SmuzhiyunThe implementation leaves room for further development like full tickless 242*4882a593Smuzhiyunsystems, where the time slice is controlled by the scheduler, variable 243*4882a593Smuzhiyunfrequency profiling, and a complete removal of jiffies in the future. 244*4882a593Smuzhiyun 245*4882a593Smuzhiyun 246*4882a593SmuzhiyunAside the current initial submission of i386 support, the patchset has been 247*4882a593Smuzhiyunextended to x86_64 and ARM already. Initial (work in progress) support is also 248*4882a593Smuzhiyunavailable for MIPS and PowerPC. 249*4882a593Smuzhiyun 250*4882a593Smuzhiyun Thomas, Ingo 251