Documentation/scheduler/sched-design-CFS.rst

*4882a593Smuzhiyun=============
*4882a593SmuzhiyunCFS Scheduler
*4882a593Smuzhiyun=============
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.  OVERVIEW
*4882a593Smuzhiyun============
*4882a593Smuzhiyun
*4882a593SmuzhiyunCFS stands for "Completely Fair Scheduler," and is the new "desktop" process
*4882a593Smuzhiyunscheduler implemented by Ingo Molnar and merged in Linux 2.6.23.  It is the
*4882a593Smuzhiyunreplacement for the previous vanilla scheduler's SCHED_OTHER interactivity
*4882a593Smuzhiyuncode.
*4882a593Smuzhiyun
*4882a593Smuzhiyun80% of CFS's design can be summed up in a single sentence: CFS basically models
*4882a593Smuzhiyunan "ideal, precise multi-tasking CPU" on real hardware.
*4882a593Smuzhiyun
*4882a593Smuzhiyun"Ideal multi-tasking CPU" is a (non-existent  :-)) CPU that has 100% physical
*4882a593Smuzhiyunpower and which can run each task at precise equal speed, in parallel, each at
*4882a593Smuzhiyun1/nr_running speed.  For example: if there are 2 tasks running, then it runs
*4882a593Smuzhiyuneach at 50% physical power --- i.e., actually in parallel.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOn real hardware, we can run only a single task at once, so we have to
*4882a593Smuzhiyunintroduce the concept of "virtual runtime."  The virtual runtime of a task
*4882a593Smuzhiyunspecifies when its next timeslice would start execution on the ideal
*4882a593Smuzhiyunmulti-tasking CPU described above.  In practice, the virtual runtime of a task
*4882a593Smuzhiyunis its actual runtime normalized to the total number of running tasks.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.  FEW IMPLEMENTATION DETAILS
*4882a593Smuzhiyun==============================
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn CFS the virtual runtime is expressed and tracked via the per-task
*4882a593Smuzhiyunp->se.vruntime (nanosec-unit) value.  This way, it's possible to accurately
*4882a593Smuzhiyuntimestamp and measure the "expected CPU time" a task should have gotten.
*4882a593Smuzhiyun
*4882a593Smuzhiyun[ small detail: on "ideal" hardware, at any time all tasks would have the same
*4882a593Smuzhiyun  p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
*4882a593Smuzhiyun  would ever get "out of balance" from the "ideal" share of CPU time.  ]
*4882a593Smuzhiyun
*4882a593SmuzhiyunCFS's task picking logic is based on this p->se.vruntime value and it is thus
*4882a593Smuzhiyunvery simple: it always tries to run the task with the smallest p->se.vruntime
*4882a593Smuzhiyunvalue (i.e., the task which executed least so far).  CFS always tries to split
*4882a593Smuzhiyunup CPU time between runnable tasks as close to "ideal multitasking hardware" as
*4882a593Smuzhiyunpossible.
*4882a593Smuzhiyun
*4882a593SmuzhiyunMost of the rest of CFS's design just falls out of this really simple concept,
*4882a593Smuzhiyunwith a few add-on embellishments like nice levels, multiprocessing and various
*4882a593Smuzhiyunalgorithm variants to recognize sleepers.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.  THE RBTREE
*4882a593Smuzhiyun==============
*4882a593Smuzhiyun
*4882a593SmuzhiyunCFS's design is quite radical: it does not use the old data structures for the
*4882a593Smuzhiyunrunqueues, but it uses a time-ordered rbtree to build a "timeline" of future
*4882a593Smuzhiyuntask execution, and thus has no "array switch" artifacts (by which both the
*4882a593Smuzhiyunprevious vanilla scheduler and RSDL/SD are affected).
*4882a593Smuzhiyun
*4882a593SmuzhiyunCFS also maintains the rq->cfs.min_vruntime value, which is a monotonic
*4882a593Smuzhiyunincreasing value tracking the smallest vruntime among all tasks in the
*4882a593Smuzhiyunrunqueue.  The total amount of work done by the system is tracked using
*4882a593Smuzhiyunmin_vruntime; that value is used to place newly activated entities on the left
*4882a593Smuzhiyunside of the tree as much as possible.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe total number of running tasks in the runqueue is accounted through the
*4882a593Smuzhiyunrq->cfs.load value, which is the sum of the weights of the tasks queued on the
*4882a593Smuzhiyunrunqueue.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the
*4882a593Smuzhiyunp->se.vruntime key. CFS picks the "leftmost" task from this tree and sticks to it.
*4882a593SmuzhiyunAs the system progresses forwards, the executed tasks are put into the tree
*4882a593Smuzhiyunmore and more to the right --- slowly but surely giving a chance for every task
*4882a593Smuzhiyunto become the "leftmost task" and thus get on the CPU within a deterministic
*4882a593Smuzhiyunamount of time.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSumming up, CFS works like this: it runs a task a bit, and when the task
*4882a593Smuzhiyunschedules (or a scheduler tick happens) the task's CPU usage is "accounted
*4882a593Smuzhiyunfor": the (small) time it just spent using the physical CPU is added to
*4882a593Smuzhiyunp->se.vruntime.  Once p->se.vruntime gets high enough so that another task
*4882a593Smuzhiyunbecomes the "leftmost task" of the time-ordered rbtree it maintains (plus a
*4882a593Smuzhiyunsmall amount of "granularity" distance relative to the leftmost task so that we
*4882a593Smuzhiyundo not over-schedule tasks and trash the cache), then the new leftmost task is
*4882a593Smuzhiyunpicked and the current task is preempted.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.  SOME FEATURES OF CFS
*4882a593Smuzhiyun========================
*4882a593Smuzhiyun
*4882a593SmuzhiyunCFS uses nanosecond granularity accounting and does not rely on any jiffies or
*4882a593Smuzhiyunother HZ detail.  Thus the CFS scheduler has no notion of "timeslices" in the
*4882a593Smuzhiyunway the previous scheduler had, and has no heuristics whatsoever.  There is
*4882a593Smuzhiyunonly one central tunable (you have to switch on CONFIG_SCHED_DEBUG):
*4882a593Smuzhiyun
*4882a593Smuzhiyun   /proc/sys/kernel/sched_min_granularity_ns
*4882a593Smuzhiyun
*4882a593Smuzhiyunwhich can be used to tune the scheduler from "desktop" (i.e., low latencies) to
*4882a593Smuzhiyun"server" (i.e., good batching) workloads.  It defaults to a setting suitable
*4882a593Smuzhiyunfor desktop workloads.  SCHED_BATCH is handled by the CFS scheduler module too.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDue to its design, the CFS scheduler is not prone to any of the "attacks" that
*4882a593Smuzhiyunexist today against the heuristics of the stock scheduler: fiftyp.c, thud.c,
*4882a593Smuzhiyunchew.c, ring-test.c, massive_intr.c all work fine and do not impact
*4882a593Smuzhiyuninteractivity and produce the expected behavior.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH
*4882a593Smuzhiyunthan the previous vanilla scheduler: both types of workloads are isolated much
*4882a593Smuzhiyunmore aggressively.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSMP load-balancing has been reworked/sanitized: the runqueue-walking
*4882a593Smuzhiyunassumptions are gone from the load-balancing code now, and iterators of the
*4882a593Smuzhiyunscheduling modules are used.  The balancing code got quite a bit simpler as a
*4882a593Smuzhiyunresult.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun5. Scheduling policies
*4882a593Smuzhiyun======================
*4882a593Smuzhiyun
*4882a593SmuzhiyunCFS implements three scheduling policies:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling
*4882a593Smuzhiyun    policy that is used for regular tasks.
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - SCHED_BATCH: Does not preempt nearly as often as regular tasks
*4882a593Smuzhiyun    would, thereby allowing tasks to run longer and make better use of
*4882a593Smuzhiyun    caches but at the cost of interactivity. This is well suited for
*4882a593Smuzhiyun    batch jobs.
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - SCHED_IDLE: This is even weaker than nice 19, but its not a true
*4882a593Smuzhiyun    idle timer scheduler in order to avoid to get into priority
*4882a593Smuzhiyun    inversion problems which would deadlock the machine.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by
*4882a593SmuzhiyunPOSIX.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe command chrt from util-linux-ng 2.13.1.1 can set all of these except
*4882a593SmuzhiyunSCHED_IDLE.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.  SCHEDULING CLASSES
*4882a593Smuzhiyun======================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe new CFS scheduler has been designed in such a way to introduce "Scheduling
*4882a593SmuzhiyunClasses," an extensible hierarchy of scheduler modules.  These modules
*4882a593Smuzhiyunencapsulate scheduling policy details and are handled by the scheduler core
*4882a593Smuzhiyunwithout the core code assuming too much about them.
*4882a593Smuzhiyun
*4882a593Smuzhiyunsched/fair.c implements the CFS scheduler described above.
*4882a593Smuzhiyun
*4882a593Smuzhiyunsched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than
*4882a593Smuzhiyunthe previous vanilla scheduler did.  It uses 100 runqueues (for all 100 RT
*4882a593Smuzhiyunpriority levels, instead of 140 in the previous scheduler) and it needs no
*4882a593Smuzhiyunexpired array.
*4882a593Smuzhiyun
*4882a593SmuzhiyunScheduling classes are implemented through the sched_class structure, which
*4882a593Smuzhiyuncontains hooks to functions that must be called whenever an interesting event
*4882a593Smuzhiyunoccurs.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis is the (partial) list of the hooks:
*4882a593Smuzhiyun
*4882a593Smuzhiyun - enqueue_task(...)
*4882a593Smuzhiyun
*4882a593Smuzhiyun   Called when a task enters a runnable state.
*4882a593Smuzhiyun   It puts the scheduling entity (task) into the red-black tree and
*4882a593Smuzhiyun   increments the nr_running variable.
*4882a593Smuzhiyun
*4882a593Smuzhiyun - dequeue_task(...)
*4882a593Smuzhiyun
*4882a593Smuzhiyun   When a task is no longer runnable, this function is called to keep the
*4882a593Smuzhiyun   corresponding scheduling entity out of the red-black tree.  It decrements
*4882a593Smuzhiyun   the nr_running variable.
*4882a593Smuzhiyun
*4882a593Smuzhiyun - yield_task(...)
*4882a593Smuzhiyun
*4882a593Smuzhiyun   This function is basically just a dequeue followed by an enqueue, unless the
*4882a593Smuzhiyun   compat_yield sysctl is turned on; in that case, it places the scheduling
*4882a593Smuzhiyun   entity at the right-most end of the red-black tree.
*4882a593Smuzhiyun
*4882a593Smuzhiyun - check_preempt_curr(...)
*4882a593Smuzhiyun
*4882a593Smuzhiyun   This function checks if a task that entered the runnable state should
*4882a593Smuzhiyun   preempt the currently running task.
*4882a593Smuzhiyun
*4882a593Smuzhiyun - pick_next_task(...)
*4882a593Smuzhiyun
*4882a593Smuzhiyun   This function chooses the most appropriate task eligible to run next.
*4882a593Smuzhiyun
*4882a593Smuzhiyun - set_curr_task(...)
*4882a593Smuzhiyun
*4882a593Smuzhiyun   This function is called when a task changes its scheduling class or changes
*4882a593Smuzhiyun   its task group.
*4882a593Smuzhiyun
*4882a593Smuzhiyun - task_tick(...)
*4882a593Smuzhiyun
*4882a593Smuzhiyun   This function is mostly called from time tick functions; it might lead to
*4882a593Smuzhiyun   process switch.  This drives the running preemption.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun7.  GROUP SCHEDULER EXTENSIONS TO CFS
*4882a593Smuzhiyun=====================================
*4882a593Smuzhiyun
*4882a593SmuzhiyunNormally, the scheduler operates on individual tasks and strives to provide
*4882a593Smuzhiyunfair CPU time to each task.  Sometimes, it may be desirable to group tasks and
*4882a593Smuzhiyunprovide fair CPU time to each such task group.  For example, it may be
*4882a593Smuzhiyundesirable to first provide fair CPU time to each user on the system and then to
*4882a593Smuzhiyuneach task belonging to a user.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCONFIG_CGROUP_SCHED strives to achieve exactly that.  It lets tasks to be
*4882a593Smuzhiyungrouped and divides CPU time fairly among such groups.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
*4882a593SmuzhiyunSCHED_RR) tasks.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
*4882a593SmuzhiyunSCHED_BATCH) tasks.
*4882a593Smuzhiyun
*4882a593Smuzhiyun   These options need CONFIG_CGROUPS to be defined, and let the administrator
*4882a593Smuzhiyun   create arbitrary groups of tasks, using the "cgroup" pseudo filesystem.  See
*4882a593Smuzhiyun   Documentation/admin-guide/cgroup-v1/cgroups.rst for more information about this filesystem.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
*4882a593Smuzhiyungroup created using the pseudo filesystem.  See example steps below to create
*4882a593Smuzhiyuntask groups and modify their CPU share using the "cgroups" pseudo filesystem::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	# mount -t tmpfs cgroup_root /sys/fs/cgroup
*4882a593Smuzhiyun	# mkdir /sys/fs/cgroup/cpu
*4882a593Smuzhiyun	# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
*4882a593Smuzhiyun	# cd /sys/fs/cgroup/cpu
*4882a593Smuzhiyun
*4882a593Smuzhiyun	# mkdir multimedia	# create "multimedia" group of tasks
*4882a593Smuzhiyun	# mkdir browser		# create "browser" group of tasks
*4882a593Smuzhiyun
*4882a593Smuzhiyun	# #Configure the multimedia group to receive twice the CPU bandwidth
*4882a593Smuzhiyun	# #that of browser group
*4882a593Smuzhiyun
*4882a593Smuzhiyun	# echo 2048 > multimedia/cpu.shares
*4882a593Smuzhiyun	# echo 1024 > browser/cpu.shares
*4882a593Smuzhiyun
*4882a593Smuzhiyun	# firefox &	# Launch firefox and move it to "browser" group
*4882a593Smuzhiyun	# echo <firefox_pid> > browser/tasks
*4882a593Smuzhiyun
*4882a593Smuzhiyun	# #Launch gmplayer (or your favourite movie player)
*4882a593Smuzhiyun	# echo <movie_player_pid> > multimedia/tasks