1*4882a593Smuzhiyun====================================================== 2*4882a593Smuzhiyunhrtimers - subsystem for high-resolution kernel timers 3*4882a593Smuzhiyun====================================================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunThis patch introduces a new subsystem for high-resolution kernel timers. 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunOne might ask the question: we already have a timer subsystem 8*4882a593Smuzhiyun(kernel/timers.c), why do we need two timer subsystems? After a lot of 9*4882a593Smuzhiyunback and forth trying to integrate high-resolution and high-precision 10*4882a593Smuzhiyunfeatures into the existing timer framework, and after testing various 11*4882a593Smuzhiyunsuch high-resolution timer implementations in practice, we came to the 12*4882a593Smuzhiyunconclusion that the timer wheel code is fundamentally not suitable for 13*4882a593Smuzhiyunsuch an approach. We initially didn't believe this ('there must be a way 14*4882a593Smuzhiyunto solve this'), and spent a considerable effort trying to integrate 15*4882a593Smuzhiyunthings into the timer wheel, but we failed. In hindsight, there are 16*4882a593Smuzhiyunseveral reasons why such integration is hard/impossible: 17*4882a593Smuzhiyun 18*4882a593Smuzhiyun- the forced handling of low-resolution and high-resolution timers in 19*4882a593Smuzhiyun the same way leads to a lot of compromises, macro magic and #ifdef 20*4882a593Smuzhiyun mess. The timers.c code is very "tightly coded" around jiffies and 21*4882a593Smuzhiyun 32-bitness assumptions, and has been honed and micro-optimized for a 22*4882a593Smuzhiyun relatively narrow use case (jiffies in a relatively narrow HZ range) 23*4882a593Smuzhiyun for many years - and thus even small extensions to it easily break 24*4882a593Smuzhiyun the wheel concept, leading to even worse compromises. The timer wheel 25*4882a593Smuzhiyun code is very good and tight code, there's zero problems with it in its 26*4882a593Smuzhiyun current usage - but it is simply not suitable to be extended for 27*4882a593Smuzhiyun high-res timers. 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun- the unpredictable [O(N)] overhead of cascading leads to delays which 30*4882a593Smuzhiyun necessitate a more complex handling of high resolution timers, which 31*4882a593Smuzhiyun in turn decreases robustness. Such a design still leads to rather large 32*4882a593Smuzhiyun timing inaccuracies. Cascading is a fundamental property of the timer 33*4882a593Smuzhiyun wheel concept, it cannot be 'designed out' without inevitably 34*4882a593Smuzhiyun degrading other portions of the timers.c code in an unacceptable way. 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun- the implementation of the current posix-timer subsystem on top of 37*4882a593Smuzhiyun the timer wheel has already introduced a quite complex handling of 38*4882a593Smuzhiyun the required readjusting of absolute CLOCK_REALTIME timers at 39*4882a593Smuzhiyun settimeofday or NTP time - further underlying our experience by 40*4882a593Smuzhiyun example: that the timer wheel data structure is too rigid for high-res 41*4882a593Smuzhiyun timers. 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun- the timer wheel code is most optimal for use cases which can be 44*4882a593Smuzhiyun identified as "timeouts". Such timeouts are usually set up to cover 45*4882a593Smuzhiyun error conditions in various I/O paths, such as networking and block 46*4882a593Smuzhiyun I/O. The vast majority of those timers never expire and are rarely 47*4882a593Smuzhiyun recascaded because the expected correct event arrives in time so they 48*4882a593Smuzhiyun can be removed from the timer wheel before any further processing of 49*4882a593Smuzhiyun them becomes necessary. Thus the users of these timeouts can accept 50*4882a593Smuzhiyun the granularity and precision tradeoffs of the timer wheel, and 51*4882a593Smuzhiyun largely expect the timer subsystem to have near-zero overhead. 52*4882a593Smuzhiyun Accurate timing for them is not a core purpose - in fact most of the 53*4882a593Smuzhiyun timeout values used are ad-hoc. For them it is at most a necessary 54*4882a593Smuzhiyun evil to guarantee the processing of actual timeout completions 55*4882a593Smuzhiyun (because most of the timeouts are deleted before completion), which 56*4882a593Smuzhiyun should thus be as cheap and unintrusive as possible. 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunThe primary users of precision timers are user-space applications that 59*4882a593Smuzhiyunutilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel 60*4882a593Smuzhiyunusers like drivers and subsystems which require precise timed events 61*4882a593Smuzhiyun(e.g. multimedia) can benefit from the availability of a separate 62*4882a593Smuzhiyunhigh-resolution timer subsystem as well. 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunWhile this subsystem does not offer high-resolution clock sources just 65*4882a593Smuzhiyunyet, the hrtimer subsystem can be easily extended with high-resolution 66*4882a593Smuzhiyunclock capabilities, and patches for that exist and are maturing quickly. 67*4882a593SmuzhiyunThe increasing demand for realtime and multimedia applications along 68*4882a593Smuzhiyunwith other potential users for precise timers gives another reason to 69*4882a593Smuzhiyunseparate the "timeout" and "precise timer" subsystems. 70*4882a593Smuzhiyun 71*4882a593SmuzhiyunAnother potential benefit is that such a separation allows even more 72*4882a593Smuzhiyunspecial-purpose optimization of the existing timer wheel for the low 73*4882a593Smuzhiyunresolution and low precision use cases - once the precision-sensitive 74*4882a593SmuzhiyunAPIs are separated from the timer wheel and are migrated over to 75*4882a593Smuzhiyunhrtimers. E.g. we could decrease the frequency of the timeout subsystem 76*4882a593Smuzhiyunfrom 250 Hz to 100 HZ (or even smaller). 77*4882a593Smuzhiyun 78*4882a593Smuzhiyunhrtimer subsystem implementation details 79*4882a593Smuzhiyun---------------------------------------- 80*4882a593Smuzhiyun 81*4882a593Smuzhiyunthe basic design considerations were: 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun- simplicity 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun- data structure not bound to jiffies or any other granularity. All the 86*4882a593Smuzhiyun kernel logic works at 64-bit nanoseconds resolution - no compromises. 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun- simplification of existing, timing related kernel code 89*4882a593Smuzhiyun 90*4882a593Smuzhiyunanother basic requirement was the immediate enqueueing and ordering of 91*4882a593Smuzhiyuntimers at activation time. After looking at several possible solutions 92*4882a593Smuzhiyunsuch as radix trees and hashes, we chose the red black tree as the basic 93*4882a593Smuzhiyundata structure. Rbtrees are available as a library in the kernel and are 94*4882a593Smuzhiyunused in various performance-critical areas of e.g. memory management and 95*4882a593Smuzhiyunfile systems. The rbtree is solely used for time sorted ordering, while 96*4882a593Smuzhiyuna separate list is used to give the expiry code fast access to the 97*4882a593Smuzhiyunqueued timers, without having to walk the rbtree. 98*4882a593Smuzhiyun 99*4882a593Smuzhiyun(This separate list is also useful for later when we'll introduce 100*4882a593Smuzhiyunhigh-resolution clocks, where we need separate pending and expired 101*4882a593Smuzhiyunqueues while keeping the time-order intact.) 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunTime-ordered enqueueing is not purely for the purposes of 104*4882a593Smuzhiyunhigh-resolution clocks though, it also simplifies the handling of 105*4882a593Smuzhiyunabsolute timers based on a low-resolution CLOCK_REALTIME. The existing 106*4882a593Smuzhiyunimplementation needed to keep an extra list of all armed absolute 107*4882a593SmuzhiyunCLOCK_REALTIME timers along with complex locking. In case of 108*4882a593Smuzhiyunsettimeofday and NTP, all the timers (!) had to be dequeued, the 109*4882a593Smuzhiyuntime-changing code had to fix them up one by one, and all of them had to 110*4882a593Smuzhiyunbe enqueued again. The time-ordered enqueueing and the storage of the 111*4882a593Smuzhiyunexpiry time in absolute time units removes all this complex and poorly 112*4882a593Smuzhiyunscaling code from the posix-timer implementation - the clock can simply 113*4882a593Smuzhiyunbe set without having to touch the rbtree. This also makes the handling 114*4882a593Smuzhiyunof posix-timers simpler in general. 115*4882a593Smuzhiyun 116*4882a593SmuzhiyunThe locking and per-CPU behavior of hrtimers was mostly taken from the 117*4882a593Smuzhiyunexisting timer wheel code, as it is mature and well suited. Sharing code 118*4882a593Smuzhiyunwas not really a win, due to the different data structures. Also, the 119*4882a593Smuzhiyunhrtimer functions now have clearer behavior and clearer names - such as 120*4882a593Smuzhiyunhrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly 121*4882a593Smuzhiyunequivalent to del_timer() and del_timer_sync()] - so there's no direct 122*4882a593Smuzhiyun1:1 mapping between them on the algorithmic level, and thus no real 123*4882a593Smuzhiyunpotential for code sharing either. 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunBasic data types: every time value, absolute or relative, is in a 126*4882a593Smuzhiyunspecial nanosecond-resolution type: ktime_t. The kernel-internal 127*4882a593Smuzhiyunrepresentation of ktime_t values and operations is implemented via 128*4882a593Smuzhiyunmacros and inline functions, and can be switched between a "hybrid 129*4882a593Smuzhiyununion" type and a plain "scalar" 64bit nanoseconds representation (at 130*4882a593Smuzhiyuncompile time). The hybrid union type optimizes time conversions on 32bit 131*4882a593SmuzhiyunCPUs. This build-time-selectable ktime_t storage format was implemented 132*4882a593Smuzhiyunto avoid the performance impact of 64-bit multiplications and divisions 133*4882a593Smuzhiyunon 32bit CPUs. Such operations are frequently necessary to convert 134*4882a593Smuzhiyunbetween the storage formats provided by kernel and userspace interfaces 135*4882a593Smuzhiyunand the internal time format. (See include/linux/ktime.h for further 136*4882a593Smuzhiyundetails.) 137*4882a593Smuzhiyun 138*4882a593Smuzhiyunhrtimers - rounding of timer values 139*4882a593Smuzhiyun----------------------------------- 140*4882a593Smuzhiyun 141*4882a593Smuzhiyunthe hrtimer code will round timer events to lower-resolution clocks 142*4882a593Smuzhiyunbecause it has to. Otherwise it will do no artificial rounding at all. 143*4882a593Smuzhiyun 144*4882a593Smuzhiyunone question is, what resolution value should be returned to the user by 145*4882a593Smuzhiyunthe clock_getres() interface. This will return whatever real resolution 146*4882a593Smuzhiyuna given clock has - be it low-res, high-res, or artificially-low-res. 147*4882a593Smuzhiyun 148*4882a593Smuzhiyunhrtimers - testing and verification 149*4882a593Smuzhiyun----------------------------------- 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunWe used the high-resolution clock subsystem ontop of hrtimers to verify 152*4882a593Smuzhiyunthe hrtimer implementation details in praxis, and we also ran the posix 153*4882a593Smuzhiyuntimer tests in order to ensure specification compliance. We also ran 154*4882a593Smuzhiyuntests on low-resolution clocks. 155*4882a593Smuzhiyun 156*4882a593SmuzhiyunThe hrtimer patch converts the following kernel functionality to use 157*4882a593Smuzhiyunhrtimers: 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun - nanosleep 160*4882a593Smuzhiyun - itimers 161*4882a593Smuzhiyun - posix-timers 162*4882a593Smuzhiyun 163*4882a593SmuzhiyunThe conversion of nanosleep and posix-timers enabled the unification of 164*4882a593Smuzhiyunnanosleep and clock_nanosleep. 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunThe code was successfully compiled for the following platforms: 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun i386, x86_64, ARM, PPC, PPC64, IA64 169*4882a593Smuzhiyun 170*4882a593SmuzhiyunThe code was run-tested on the following platforms: 171*4882a593Smuzhiyun 172*4882a593Smuzhiyun i386(UP/SMP), x86_64(UP/SMP), ARM, PPC 173*4882a593Smuzhiyun 174*4882a593Smuzhiyunhrtimers were also integrated into the -rt tree, along with a 175*4882a593Smuzhiyunhrtimers-based high-resolution clock implementation, so the hrtimers 176*4882a593Smuzhiyuncode got a healthy amount of testing and use in practice. 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun Thomas Gleixner, Ingo Molnar 179