Documentation/scheduler/sched-nice-design.rst

*4882a593Smuzhiyun=====================
*4882a593SmuzhiyunScheduler Nice Design
*4882a593Smuzhiyun=====================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis document explains the thinking about the revamped and streamlined
*4882a593Smuzhiyunnice-levels implementation in the new Linux scheduler.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNice levels were always pretty weak under Linux and people continuously
*4882a593Smuzhiyunpestered us to make nice +19 tasks use up much less CPU time.
*4882a593Smuzhiyun
*4882a593SmuzhiyunUnfortunately that was not that easy to implement under the old
*4882a593Smuzhiyunscheduler, (otherwise we'd have done it long ago) because nice level
*4882a593Smuzhiyunsupport was historically coupled to timeslice length, and timeslice
*4882a593Smuzhiyununits were driven by the HZ tick, so the smallest timeslice was 1/HZ.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn the O(1) scheduler (in 2003) we changed negative nice levels to be
*4882a593Smuzhiyunmuch stronger than they were before in 2.4 (and people were happy about
*4882a593Smuzhiyunthat change), and we also intentionally calibrated the linear timeslice
*4882a593Smuzhiyunrule so that nice +19 level would be _exactly_ 1 jiffy. To better
*4882a593Smuzhiyununderstand it, the timeslice graph went like this (cheesy ASCII art
*4882a593Smuzhiyunalert!)::
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun                   A
*4882a593Smuzhiyun             \     | [timeslice length]
*4882a593Smuzhiyun              \    |
*4882a593Smuzhiyun               \   |
*4882a593Smuzhiyun                \  |
*4882a593Smuzhiyun                 \ |
*4882a593Smuzhiyun                  \|___100msecs
*4882a593Smuzhiyun                   |^ . _
*4882a593Smuzhiyun                   |      ^ . _
*4882a593Smuzhiyun                   |            ^ . _
*4882a593Smuzhiyun -*----------------------------------*-----> [nice level]
*4882a593Smuzhiyun -20               |                +19
*4882a593Smuzhiyun                   |
*4882a593Smuzhiyun                   |
*4882a593Smuzhiyun
*4882a593SmuzhiyunSo that if someone wanted to really renice tasks, +19 would give a much
*4882a593Smuzhiyunbigger hit than the normal linear rule would do. (The solution of
*4882a593Smuzhiyunchanging the ABI to extend priorities was discarded early on.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis approach worked to some degree for some time, but later on with
*4882a593SmuzhiyunHZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
*4882a593Smuzhiyunwe felt to be a bit excessive. Excessive _not_ because it's too small of
*4882a593Smuzhiyuna CPU utilization, but because it causes too frequent (once per
*4882a593Smuzhiyunmillisec) rescheduling. (and would thus trash the cache, etc. Remember,
*4882a593Smuzhiyunthis was long ago when hardware was weaker and caches were smaller, and
*4882a593Smuzhiyunpeople were running number crunching apps at nice +19.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunSo for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
*4882a593Smuzhiyunright minimal granularity - and this translates to 5% CPU utilization.
*4882a593SmuzhiyunBut the fundamental HZ-sensitive property for nice+19 still remained,
*4882a593Smuzhiyunand we never got a single complaint about nice +19 being too _weak_ in
*4882a593Smuzhiyunterms of CPU utilization, we only got complaints about it (still) being
*4882a593Smuzhiyuntoo _strong_ :-)
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo sum it up: we always wanted to make nice levels more consistent, but
*4882a593Smuzhiyunwithin the constraints of HZ and jiffies and their nasty design level
*4882a593Smuzhiyuncoupling to timeslices and granularity it was not really viable.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe second (less frequent but still periodically occurring) complaint
*4882a593Smuzhiyunabout Linux's nice level support was its assymetry around the origo
*4882a593Smuzhiyun(which you can see demonstrated in the picture above), or more
*4882a593Smuzhiyunaccurately: the fact that nice level behavior depended on the _absolute_
*4882a593Smuzhiyunnice level as well, while the nice API itself is fundamentally
*4882a593Smuzhiyun"relative":
*4882a593Smuzhiyun
*4882a593Smuzhiyun   int nice(int inc);
*4882a593Smuzhiyun
*4882a593Smuzhiyun   asmlinkage long sys_nice(int increment)
*4882a593Smuzhiyun
*4882a593Smuzhiyun(the first one is the glibc API, the second one is the syscall API.)
*4882a593SmuzhiyunNote that the 'inc' is relative to the current nice level. Tools like
*4882a593Smuzhiyunbash's "nice" command mirror this relative API.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWith the old scheduler, if you for example started a niced task with +1
*4882a593Smuzhiyunand another task with +2, the CPU split between the two tasks would
*4882a593Smuzhiyundepend on the nice level of the parent shell - if it was at nice -10 the
*4882a593SmuzhiyunCPU split was different than if it was at +5 or +10.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA third complaint against Linux's nice level support was that negative
*4882a593Smuzhiyunnice levels were not 'punchy enough', so lots of people had to resort to
*4882a593Smuzhiyunrun audio (and other multimedia) apps under RT priorities such as
*4882a593SmuzhiyunSCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
*4882a593Smuzhiyunproof, and a buggy SCHED_FIFO app can also lock up the system for good.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe new scheduler in v2.6.23 addresses all three types of complaints:
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo address the first complaint (of nice levels being not "punchy"
*4882a593Smuzhiyunenough), the scheduler was decoupled from 'time slice' and HZ concepts
*4882a593Smuzhiyun(and granularity was made a separate concept from nice levels) and thus
*4882a593Smuzhiyunit was possible to implement better and more consistent nice +19
*4882a593Smuzhiyunsupport: with the new scheduler nice +19 tasks get a HZ-independent
*4882a593Smuzhiyun1.5%, instead of the variable 3%-5%-9% range they got in the old
*4882a593Smuzhiyunscheduler.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo address the second complaint (of nice levels not being consistent),
*4882a593Smuzhiyunthe new scheduler makes nice(1) have the same CPU utilization effect on
*4882a593Smuzhiyuntasks, regardless of their absolute nice levels. So on the new
*4882a593Smuzhiyunscheduler, running a nice +10 and a nice 11 task has the same CPU
*4882a593Smuzhiyunutilization "split" between them as running a nice -5 and a nice -4
*4882a593Smuzhiyuntask. (one will get 55% of the CPU, the other 45%.) That is why nice
*4882a593Smuzhiyunlevels were changed to be "multiplicative" (or exponential) - that way
*4882a593Smuzhiyunit does not matter which nice level you start out from, the 'relative
*4882a593Smuzhiyunresult' will always be the same.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe third complaint (of negative nice levels not being "punchy" enough
*4882a593Smuzhiyunand forcing audio apps to run under the more dangerous SCHED_FIFO
*4882a593Smuzhiyunscheduling policy) is addressed by the new scheduler almost
*4882a593Smuzhiyunautomatically: stronger negative nice levels are an automatic
*4882a593Smuzhiyunside-effect of the recalibrated dynamic range of nice levels.