xref: /OK3568_Linux_fs/kernel/Documentation/scheduler/sched-design-CFS.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=============
2*4882a593SmuzhiyunCFS Scheduler
3*4882a593Smuzhiyun=============
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun
6*4882a593Smuzhiyun1.  OVERVIEW
7*4882a593Smuzhiyun============
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunCFS stands for "Completely Fair Scheduler," and is the new "desktop" process
10*4882a593Smuzhiyunscheduler implemented by Ingo Molnar and merged in Linux 2.6.23.  It is the
11*4882a593Smuzhiyunreplacement for the previous vanilla scheduler's SCHED_OTHER interactivity
12*4882a593Smuzhiyuncode.
13*4882a593Smuzhiyun
14*4882a593Smuzhiyun80% of CFS's design can be summed up in a single sentence: CFS basically models
15*4882a593Smuzhiyunan "ideal, precise multi-tasking CPU" on real hardware.
16*4882a593Smuzhiyun
17*4882a593Smuzhiyun"Ideal multi-tasking CPU" is a (non-existent  :-)) CPU that has 100% physical
18*4882a593Smuzhiyunpower and which can run each task at precise equal speed, in parallel, each at
19*4882a593Smuzhiyun1/nr_running speed.  For example: if there are 2 tasks running, then it runs
20*4882a593Smuzhiyuneach at 50% physical power --- i.e., actually in parallel.
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunOn real hardware, we can run only a single task at once, so we have to
23*4882a593Smuzhiyunintroduce the concept of "virtual runtime."  The virtual runtime of a task
24*4882a593Smuzhiyunspecifies when its next timeslice would start execution on the ideal
25*4882a593Smuzhiyunmulti-tasking CPU described above.  In practice, the virtual runtime of a task
26*4882a593Smuzhiyunis its actual runtime normalized to the total number of running tasks.
27*4882a593Smuzhiyun
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun2.  FEW IMPLEMENTATION DETAILS
31*4882a593Smuzhiyun==============================
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunIn CFS the virtual runtime is expressed and tracked via the per-task
34*4882a593Smuzhiyunp->se.vruntime (nanosec-unit) value.  This way, it's possible to accurately
35*4882a593Smuzhiyuntimestamp and measure the "expected CPU time" a task should have gotten.
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun[ small detail: on "ideal" hardware, at any time all tasks would have the same
38*4882a593Smuzhiyun  p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
39*4882a593Smuzhiyun  would ever get "out of balance" from the "ideal" share of CPU time.  ]
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunCFS's task picking logic is based on this p->se.vruntime value and it is thus
42*4882a593Smuzhiyunvery simple: it always tries to run the task with the smallest p->se.vruntime
43*4882a593Smuzhiyunvalue (i.e., the task which executed least so far).  CFS always tries to split
44*4882a593Smuzhiyunup CPU time between runnable tasks as close to "ideal multitasking hardware" as
45*4882a593Smuzhiyunpossible.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunMost of the rest of CFS's design just falls out of this really simple concept,
48*4882a593Smuzhiyunwith a few add-on embellishments like nice levels, multiprocessing and various
49*4882a593Smuzhiyunalgorithm variants to recognize sleepers.
50*4882a593Smuzhiyun
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun
53*4882a593Smuzhiyun3.  THE RBTREE
54*4882a593Smuzhiyun==============
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunCFS's design is quite radical: it does not use the old data structures for the
57*4882a593Smuzhiyunrunqueues, but it uses a time-ordered rbtree to build a "timeline" of future
58*4882a593Smuzhiyuntask execution, and thus has no "array switch" artifacts (by which both the
59*4882a593Smuzhiyunprevious vanilla scheduler and RSDL/SD are affected).
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunCFS also maintains the rq->cfs.min_vruntime value, which is a monotonic
62*4882a593Smuzhiyunincreasing value tracking the smallest vruntime among all tasks in the
63*4882a593Smuzhiyunrunqueue.  The total amount of work done by the system is tracked using
64*4882a593Smuzhiyunmin_vruntime; that value is used to place newly activated entities on the left
65*4882a593Smuzhiyunside of the tree as much as possible.
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunThe total number of running tasks in the runqueue is accounted through the
68*4882a593Smuzhiyunrq->cfs.load value, which is the sum of the weights of the tasks queued on the
69*4882a593Smuzhiyunrunqueue.
70*4882a593Smuzhiyun
71*4882a593SmuzhiyunCFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the
72*4882a593Smuzhiyunp->se.vruntime key. CFS picks the "leftmost" task from this tree and sticks to it.
73*4882a593SmuzhiyunAs the system progresses forwards, the executed tasks are put into the tree
74*4882a593Smuzhiyunmore and more to the right --- slowly but surely giving a chance for every task
75*4882a593Smuzhiyunto become the "leftmost task" and thus get on the CPU within a deterministic
76*4882a593Smuzhiyunamount of time.
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunSumming up, CFS works like this: it runs a task a bit, and when the task
79*4882a593Smuzhiyunschedules (or a scheduler tick happens) the task's CPU usage is "accounted
80*4882a593Smuzhiyunfor": the (small) time it just spent using the physical CPU is added to
81*4882a593Smuzhiyunp->se.vruntime.  Once p->se.vruntime gets high enough so that another task
82*4882a593Smuzhiyunbecomes the "leftmost task" of the time-ordered rbtree it maintains (plus a
83*4882a593Smuzhiyunsmall amount of "granularity" distance relative to the leftmost task so that we
84*4882a593Smuzhiyundo not over-schedule tasks and trash the cache), then the new leftmost task is
85*4882a593Smuzhiyunpicked and the current task is preempted.
86*4882a593Smuzhiyun
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun4.  SOME FEATURES OF CFS
90*4882a593Smuzhiyun========================
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunCFS uses nanosecond granularity accounting and does not rely on any jiffies or
93*4882a593Smuzhiyunother HZ detail.  Thus the CFS scheduler has no notion of "timeslices" in the
94*4882a593Smuzhiyunway the previous scheduler had, and has no heuristics whatsoever.  There is
95*4882a593Smuzhiyunonly one central tunable (you have to switch on CONFIG_SCHED_DEBUG):
96*4882a593Smuzhiyun
97*4882a593Smuzhiyun   /proc/sys/kernel/sched_min_granularity_ns
98*4882a593Smuzhiyun
99*4882a593Smuzhiyunwhich can be used to tune the scheduler from "desktop" (i.e., low latencies) to
100*4882a593Smuzhiyun"server" (i.e., good batching) workloads.  It defaults to a setting suitable
101*4882a593Smuzhiyunfor desktop workloads.  SCHED_BATCH is handled by the CFS scheduler module too.
102*4882a593Smuzhiyun
103*4882a593SmuzhiyunDue to its design, the CFS scheduler is not prone to any of the "attacks" that
104*4882a593Smuzhiyunexist today against the heuristics of the stock scheduler: fiftyp.c, thud.c,
105*4882a593Smuzhiyunchew.c, ring-test.c, massive_intr.c all work fine and do not impact
106*4882a593Smuzhiyuninteractivity and produce the expected behavior.
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunThe CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH
109*4882a593Smuzhiyunthan the previous vanilla scheduler: both types of workloads are isolated much
110*4882a593Smuzhiyunmore aggressively.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunSMP load-balancing has been reworked/sanitized: the runqueue-walking
113*4882a593Smuzhiyunassumptions are gone from the load-balancing code now, and iterators of the
114*4882a593Smuzhiyunscheduling modules are used.  The balancing code got quite a bit simpler as a
115*4882a593Smuzhiyunresult.
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun
119*4882a593Smuzhiyun5. Scheduling policies
120*4882a593Smuzhiyun======================
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunCFS implements three scheduling policies:
123*4882a593Smuzhiyun
124*4882a593Smuzhiyun  - SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling
125*4882a593Smuzhiyun    policy that is used for regular tasks.
126*4882a593Smuzhiyun
127*4882a593Smuzhiyun  - SCHED_BATCH: Does not preempt nearly as often as regular tasks
128*4882a593Smuzhiyun    would, thereby allowing tasks to run longer and make better use of
129*4882a593Smuzhiyun    caches but at the cost of interactivity. This is well suited for
130*4882a593Smuzhiyun    batch jobs.
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun  - SCHED_IDLE: This is even weaker than nice 19, but its not a true
133*4882a593Smuzhiyun    idle timer scheduler in order to avoid to get into priority
134*4882a593Smuzhiyun    inversion problems which would deadlock the machine.
135*4882a593Smuzhiyun
136*4882a593SmuzhiyunSCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by
137*4882a593SmuzhiyunPOSIX.
138*4882a593Smuzhiyun
139*4882a593SmuzhiyunThe command chrt from util-linux-ng 2.13.1.1 can set all of these except
140*4882a593SmuzhiyunSCHED_IDLE.
141*4882a593Smuzhiyun
142*4882a593Smuzhiyun
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun6.  SCHEDULING CLASSES
145*4882a593Smuzhiyun======================
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunThe new CFS scheduler has been designed in such a way to introduce "Scheduling
148*4882a593SmuzhiyunClasses," an extensible hierarchy of scheduler modules.  These modules
149*4882a593Smuzhiyunencapsulate scheduling policy details and are handled by the scheduler core
150*4882a593Smuzhiyunwithout the core code assuming too much about them.
151*4882a593Smuzhiyun
152*4882a593Smuzhiyunsched/fair.c implements the CFS scheduler described above.
153*4882a593Smuzhiyun
154*4882a593Smuzhiyunsched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than
155*4882a593Smuzhiyunthe previous vanilla scheduler did.  It uses 100 runqueues (for all 100 RT
156*4882a593Smuzhiyunpriority levels, instead of 140 in the previous scheduler) and it needs no
157*4882a593Smuzhiyunexpired array.
158*4882a593Smuzhiyun
159*4882a593SmuzhiyunScheduling classes are implemented through the sched_class structure, which
160*4882a593Smuzhiyuncontains hooks to functions that must be called whenever an interesting event
161*4882a593Smuzhiyunoccurs.
162*4882a593Smuzhiyun
163*4882a593SmuzhiyunThis is the (partial) list of the hooks:
164*4882a593Smuzhiyun
165*4882a593Smuzhiyun - enqueue_task(...)
166*4882a593Smuzhiyun
167*4882a593Smuzhiyun   Called when a task enters a runnable state.
168*4882a593Smuzhiyun   It puts the scheduling entity (task) into the red-black tree and
169*4882a593Smuzhiyun   increments the nr_running variable.
170*4882a593Smuzhiyun
171*4882a593Smuzhiyun - dequeue_task(...)
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun   When a task is no longer runnable, this function is called to keep the
174*4882a593Smuzhiyun   corresponding scheduling entity out of the red-black tree.  It decrements
175*4882a593Smuzhiyun   the nr_running variable.
176*4882a593Smuzhiyun
177*4882a593Smuzhiyun - yield_task(...)
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun   This function is basically just a dequeue followed by an enqueue, unless the
180*4882a593Smuzhiyun   compat_yield sysctl is turned on; in that case, it places the scheduling
181*4882a593Smuzhiyun   entity at the right-most end of the red-black tree.
182*4882a593Smuzhiyun
183*4882a593Smuzhiyun - check_preempt_curr(...)
184*4882a593Smuzhiyun
185*4882a593Smuzhiyun   This function checks if a task that entered the runnable state should
186*4882a593Smuzhiyun   preempt the currently running task.
187*4882a593Smuzhiyun
188*4882a593Smuzhiyun - pick_next_task(...)
189*4882a593Smuzhiyun
190*4882a593Smuzhiyun   This function chooses the most appropriate task eligible to run next.
191*4882a593Smuzhiyun
192*4882a593Smuzhiyun - set_curr_task(...)
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun   This function is called when a task changes its scheduling class or changes
195*4882a593Smuzhiyun   its task group.
196*4882a593Smuzhiyun
197*4882a593Smuzhiyun - task_tick(...)
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun   This function is mostly called from time tick functions; it might lead to
200*4882a593Smuzhiyun   process switch.  This drives the running preemption.
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun
205*4882a593Smuzhiyun7.  GROUP SCHEDULER EXTENSIONS TO CFS
206*4882a593Smuzhiyun=====================================
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunNormally, the scheduler operates on individual tasks and strives to provide
209*4882a593Smuzhiyunfair CPU time to each task.  Sometimes, it may be desirable to group tasks and
210*4882a593Smuzhiyunprovide fair CPU time to each such task group.  For example, it may be
211*4882a593Smuzhiyundesirable to first provide fair CPU time to each user on the system and then to
212*4882a593Smuzhiyuneach task belonging to a user.
213*4882a593Smuzhiyun
214*4882a593SmuzhiyunCONFIG_CGROUP_SCHED strives to achieve exactly that.  It lets tasks to be
215*4882a593Smuzhiyungrouped and divides CPU time fairly among such groups.
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunCONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
218*4882a593SmuzhiyunSCHED_RR) tasks.
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunCONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
221*4882a593SmuzhiyunSCHED_BATCH) tasks.
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun   These options need CONFIG_CGROUPS to be defined, and let the administrator
224*4882a593Smuzhiyun   create arbitrary groups of tasks, using the "cgroup" pseudo filesystem.  See
225*4882a593Smuzhiyun   Documentation/admin-guide/cgroup-v1/cgroups.rst for more information about this filesystem.
226*4882a593Smuzhiyun
227*4882a593SmuzhiyunWhen CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
228*4882a593Smuzhiyungroup created using the pseudo filesystem.  See example steps below to create
229*4882a593Smuzhiyuntask groups and modify their CPU share using the "cgroups" pseudo filesystem::
230*4882a593Smuzhiyun
231*4882a593Smuzhiyun	# mount -t tmpfs cgroup_root /sys/fs/cgroup
232*4882a593Smuzhiyun	# mkdir /sys/fs/cgroup/cpu
233*4882a593Smuzhiyun	# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
234*4882a593Smuzhiyun	# cd /sys/fs/cgroup/cpu
235*4882a593Smuzhiyun
236*4882a593Smuzhiyun	# mkdir multimedia	# create "multimedia" group of tasks
237*4882a593Smuzhiyun	# mkdir browser		# create "browser" group of tasks
238*4882a593Smuzhiyun
239*4882a593Smuzhiyun	# #Configure the multimedia group to receive twice the CPU bandwidth
240*4882a593Smuzhiyun	# #that of browser group
241*4882a593Smuzhiyun
242*4882a593Smuzhiyun	# echo 2048 > multimedia/cpu.shares
243*4882a593Smuzhiyun	# echo 1024 > browser/cpu.shares
244*4882a593Smuzhiyun
245*4882a593Smuzhiyun	# firefox &	# Launch firefox and move it to "browser" group
246*4882a593Smuzhiyun	# echo <firefox_pid> > browser/tasks
247*4882a593Smuzhiyun
248*4882a593Smuzhiyun	# #Launch gmplayer (or your favourite movie player)
249*4882a593Smuzhiyun	# echo <movie_player_pid> > multimedia/tasks
250