1*4882a593Smuzhiyun===================== 2*4882a593SmuzhiyunScheduler Nice Design 3*4882a593Smuzhiyun===================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunThis document explains the thinking about the revamped and streamlined 6*4882a593Smuzhiyunnice-levels implementation in the new Linux scheduler. 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunNice levels were always pretty weak under Linux and people continuously 9*4882a593Smuzhiyunpestered us to make nice +19 tasks use up much less CPU time. 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunUnfortunately that was not that easy to implement under the old 12*4882a593Smuzhiyunscheduler, (otherwise we'd have done it long ago) because nice level 13*4882a593Smuzhiyunsupport was historically coupled to timeslice length, and timeslice 14*4882a593Smuzhiyununits were driven by the HZ tick, so the smallest timeslice was 1/HZ. 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunIn the O(1) scheduler (in 2003) we changed negative nice levels to be 17*4882a593Smuzhiyunmuch stronger than they were before in 2.4 (and people were happy about 18*4882a593Smuzhiyunthat change), and we also intentionally calibrated the linear timeslice 19*4882a593Smuzhiyunrule so that nice +19 level would be _exactly_ 1 jiffy. To better 20*4882a593Smuzhiyununderstand it, the timeslice graph went like this (cheesy ASCII art 21*4882a593Smuzhiyunalert!):: 22*4882a593Smuzhiyun 23*4882a593Smuzhiyun 24*4882a593Smuzhiyun A 25*4882a593Smuzhiyun \ | [timeslice length] 26*4882a593Smuzhiyun \ | 27*4882a593Smuzhiyun \ | 28*4882a593Smuzhiyun \ | 29*4882a593Smuzhiyun \ | 30*4882a593Smuzhiyun \|___100msecs 31*4882a593Smuzhiyun |^ . _ 32*4882a593Smuzhiyun | ^ . _ 33*4882a593Smuzhiyun | ^ . _ 34*4882a593Smuzhiyun -*----------------------------------*-----> [nice level] 35*4882a593Smuzhiyun -20 | +19 36*4882a593Smuzhiyun | 37*4882a593Smuzhiyun | 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunSo that if someone wanted to really renice tasks, +19 would give a much 40*4882a593Smuzhiyunbigger hit than the normal linear rule would do. (The solution of 41*4882a593Smuzhiyunchanging the ABI to extend priorities was discarded early on.) 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunThis approach worked to some degree for some time, but later on with 44*4882a593SmuzhiyunHZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which 45*4882a593Smuzhiyunwe felt to be a bit excessive. Excessive _not_ because it's too small of 46*4882a593Smuzhiyuna CPU utilization, but because it causes too frequent (once per 47*4882a593Smuzhiyunmillisec) rescheduling. (and would thus trash the cache, etc. Remember, 48*4882a593Smuzhiyunthis was long ago when hardware was weaker and caches were smaller, and 49*4882a593Smuzhiyunpeople were running number crunching apps at nice +19.) 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunSo for HZ=1000 we changed nice +19 to 5msecs, because that felt like the 52*4882a593Smuzhiyunright minimal granularity - and this translates to 5% CPU utilization. 53*4882a593SmuzhiyunBut the fundamental HZ-sensitive property for nice+19 still remained, 54*4882a593Smuzhiyunand we never got a single complaint about nice +19 being too _weak_ in 55*4882a593Smuzhiyunterms of CPU utilization, we only got complaints about it (still) being 56*4882a593Smuzhiyuntoo _strong_ :-) 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunTo sum it up: we always wanted to make nice levels more consistent, but 59*4882a593Smuzhiyunwithin the constraints of HZ and jiffies and their nasty design level 60*4882a593Smuzhiyuncoupling to timeslices and granularity it was not really viable. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunThe second (less frequent but still periodically occurring) complaint 63*4882a593Smuzhiyunabout Linux's nice level support was its assymetry around the origo 64*4882a593Smuzhiyun(which you can see demonstrated in the picture above), or more 65*4882a593Smuzhiyunaccurately: the fact that nice level behavior depended on the _absolute_ 66*4882a593Smuzhiyunnice level as well, while the nice API itself is fundamentally 67*4882a593Smuzhiyun"relative": 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun int nice(int inc); 70*4882a593Smuzhiyun 71*4882a593Smuzhiyun asmlinkage long sys_nice(int increment) 72*4882a593Smuzhiyun 73*4882a593Smuzhiyun(the first one is the glibc API, the second one is the syscall API.) 74*4882a593SmuzhiyunNote that the 'inc' is relative to the current nice level. Tools like 75*4882a593Smuzhiyunbash's "nice" command mirror this relative API. 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunWith the old scheduler, if you for example started a niced task with +1 78*4882a593Smuzhiyunand another task with +2, the CPU split between the two tasks would 79*4882a593Smuzhiyundepend on the nice level of the parent shell - if it was at nice -10 the 80*4882a593SmuzhiyunCPU split was different than if it was at +5 or +10. 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunA third complaint against Linux's nice level support was that negative 83*4882a593Smuzhiyunnice levels were not 'punchy enough', so lots of people had to resort to 84*4882a593Smuzhiyunrun audio (and other multimedia) apps under RT priorities such as 85*4882a593SmuzhiyunSCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation 86*4882a593Smuzhiyunproof, and a buggy SCHED_FIFO app can also lock up the system for good. 87*4882a593Smuzhiyun 88*4882a593SmuzhiyunThe new scheduler in v2.6.23 addresses all three types of complaints: 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunTo address the first complaint (of nice levels being not "punchy" 91*4882a593Smuzhiyunenough), the scheduler was decoupled from 'time slice' and HZ concepts 92*4882a593Smuzhiyun(and granularity was made a separate concept from nice levels) and thus 93*4882a593Smuzhiyunit was possible to implement better and more consistent nice +19 94*4882a593Smuzhiyunsupport: with the new scheduler nice +19 tasks get a HZ-independent 95*4882a593Smuzhiyun1.5%, instead of the variable 3%-5%-9% range they got in the old 96*4882a593Smuzhiyunscheduler. 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunTo address the second complaint (of nice levels not being consistent), 99*4882a593Smuzhiyunthe new scheduler makes nice(1) have the same CPU utilization effect on 100*4882a593Smuzhiyuntasks, regardless of their absolute nice levels. So on the new 101*4882a593Smuzhiyunscheduler, running a nice +10 and a nice 11 task has the same CPU 102*4882a593Smuzhiyunutilization "split" between them as running a nice -5 and a nice -4 103*4882a593Smuzhiyuntask. (one will get 55% of the CPU, the other 45%.) That is why nice 104*4882a593Smuzhiyunlevels were changed to be "multiplicative" (or exponential) - that way 105*4882a593Smuzhiyunit does not matter which nice level you start out from, the 'relative 106*4882a593Smuzhiyunresult' will always be the same. 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunThe third complaint (of negative nice levels not being "punchy" enough 109*4882a593Smuzhiyunand forcing audio apps to run under the more dangerous SCHED_FIFO 110*4882a593Smuzhiyunscheduling policy) is addressed by the new scheduler almost 111*4882a593Smuzhiyunautomatically: stronger negative nice levels are an automatic 112*4882a593Smuzhiyunside-effect of the recalibrated dynamic range of nice levels. 113