xref: /OK3568_Linux_fs/kernel/Documentation/scheduler/sched-nice-design.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=====================
2*4882a593SmuzhiyunScheduler Nice Design
3*4882a593Smuzhiyun=====================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunThis document explains the thinking about the revamped and streamlined
6*4882a593Smuzhiyunnice-levels implementation in the new Linux scheduler.
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunNice levels were always pretty weak under Linux and people continuously
9*4882a593Smuzhiyunpestered us to make nice +19 tasks use up much less CPU time.
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunUnfortunately that was not that easy to implement under the old
12*4882a593Smuzhiyunscheduler, (otherwise we'd have done it long ago) because nice level
13*4882a593Smuzhiyunsupport was historically coupled to timeslice length, and timeslice
14*4882a593Smuzhiyununits were driven by the HZ tick, so the smallest timeslice was 1/HZ.
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunIn the O(1) scheduler (in 2003) we changed negative nice levels to be
17*4882a593Smuzhiyunmuch stronger than they were before in 2.4 (and people were happy about
18*4882a593Smuzhiyunthat change), and we also intentionally calibrated the linear timeslice
19*4882a593Smuzhiyunrule so that nice +19 level would be _exactly_ 1 jiffy. To better
20*4882a593Smuzhiyununderstand it, the timeslice graph went like this (cheesy ASCII art
21*4882a593Smuzhiyunalert!)::
22*4882a593Smuzhiyun
23*4882a593Smuzhiyun
24*4882a593Smuzhiyun                   A
25*4882a593Smuzhiyun             \     | [timeslice length]
26*4882a593Smuzhiyun              \    |
27*4882a593Smuzhiyun               \   |
28*4882a593Smuzhiyun                \  |
29*4882a593Smuzhiyun                 \ |
30*4882a593Smuzhiyun                  \|___100msecs
31*4882a593Smuzhiyun                   |^ . _
32*4882a593Smuzhiyun                   |      ^ . _
33*4882a593Smuzhiyun                   |            ^ . _
34*4882a593Smuzhiyun -*----------------------------------*-----> [nice level]
35*4882a593Smuzhiyun -20               |                +19
36*4882a593Smuzhiyun                   |
37*4882a593Smuzhiyun                   |
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunSo that if someone wanted to really renice tasks, +19 would give a much
40*4882a593Smuzhiyunbigger hit than the normal linear rule would do. (The solution of
41*4882a593Smuzhiyunchanging the ABI to extend priorities was discarded early on.)
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunThis approach worked to some degree for some time, but later on with
44*4882a593SmuzhiyunHZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
45*4882a593Smuzhiyunwe felt to be a bit excessive. Excessive _not_ because it's too small of
46*4882a593Smuzhiyuna CPU utilization, but because it causes too frequent (once per
47*4882a593Smuzhiyunmillisec) rescheduling. (and would thus trash the cache, etc. Remember,
48*4882a593Smuzhiyunthis was long ago when hardware was weaker and caches were smaller, and
49*4882a593Smuzhiyunpeople were running number crunching apps at nice +19.)
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunSo for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
52*4882a593Smuzhiyunright minimal granularity - and this translates to 5% CPU utilization.
53*4882a593SmuzhiyunBut the fundamental HZ-sensitive property for nice+19 still remained,
54*4882a593Smuzhiyunand we never got a single complaint about nice +19 being too _weak_ in
55*4882a593Smuzhiyunterms of CPU utilization, we only got complaints about it (still) being
56*4882a593Smuzhiyuntoo _strong_ :-)
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunTo sum it up: we always wanted to make nice levels more consistent, but
59*4882a593Smuzhiyunwithin the constraints of HZ and jiffies and their nasty design level
60*4882a593Smuzhiyuncoupling to timeslices and granularity it was not really viable.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunThe second (less frequent but still periodically occurring) complaint
63*4882a593Smuzhiyunabout Linux's nice level support was its assymetry around the origo
64*4882a593Smuzhiyun(which you can see demonstrated in the picture above), or more
65*4882a593Smuzhiyunaccurately: the fact that nice level behavior depended on the _absolute_
66*4882a593Smuzhiyunnice level as well, while the nice API itself is fundamentally
67*4882a593Smuzhiyun"relative":
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun   int nice(int inc);
70*4882a593Smuzhiyun
71*4882a593Smuzhiyun   asmlinkage long sys_nice(int increment)
72*4882a593Smuzhiyun
73*4882a593Smuzhiyun(the first one is the glibc API, the second one is the syscall API.)
74*4882a593SmuzhiyunNote that the 'inc' is relative to the current nice level. Tools like
75*4882a593Smuzhiyunbash's "nice" command mirror this relative API.
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunWith the old scheduler, if you for example started a niced task with +1
78*4882a593Smuzhiyunand another task with +2, the CPU split between the two tasks would
79*4882a593Smuzhiyundepend on the nice level of the parent shell - if it was at nice -10 the
80*4882a593SmuzhiyunCPU split was different than if it was at +5 or +10.
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunA third complaint against Linux's nice level support was that negative
83*4882a593Smuzhiyunnice levels were not 'punchy enough', so lots of people had to resort to
84*4882a593Smuzhiyunrun audio (and other multimedia) apps under RT priorities such as
85*4882a593SmuzhiyunSCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
86*4882a593Smuzhiyunproof, and a buggy SCHED_FIFO app can also lock up the system for good.
87*4882a593Smuzhiyun
88*4882a593SmuzhiyunThe new scheduler in v2.6.23 addresses all three types of complaints:
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunTo address the first complaint (of nice levels being not "punchy"
91*4882a593Smuzhiyunenough), the scheduler was decoupled from 'time slice' and HZ concepts
92*4882a593Smuzhiyun(and granularity was made a separate concept from nice levels) and thus
93*4882a593Smuzhiyunit was possible to implement better and more consistent nice +19
94*4882a593Smuzhiyunsupport: with the new scheduler nice +19 tasks get a HZ-independent
95*4882a593Smuzhiyun1.5%, instead of the variable 3%-5%-9% range they got in the old
96*4882a593Smuzhiyunscheduler.
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunTo address the second complaint (of nice levels not being consistent),
99*4882a593Smuzhiyunthe new scheduler makes nice(1) have the same CPU utilization effect on
100*4882a593Smuzhiyuntasks, regardless of their absolute nice levels. So on the new
101*4882a593Smuzhiyunscheduler, running a nice +10 and a nice 11 task has the same CPU
102*4882a593Smuzhiyunutilization "split" between them as running a nice -5 and a nice -4
103*4882a593Smuzhiyuntask. (one will get 55% of the CPU, the other 45%.) That is why nice
104*4882a593Smuzhiyunlevels were changed to be "multiplicative" (or exponential) - that way
105*4882a593Smuzhiyunit does not matter which nice level you start out from, the 'relative
106*4882a593Smuzhiyunresult' will always be the same.
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunThe third complaint (of negative nice levels not being "punchy" enough
109*4882a593Smuzhiyunand forcing audio apps to run under the more dangerous SCHED_FIFO
110*4882a593Smuzhiyunscheduling policy) is addressed by the new scheduler almost
111*4882a593Smuzhiyunautomatically: stronger negative nice levels are an automatic
112*4882a593Smuzhiyunside-effect of the recalibrated dynamic range of nice levels.
113