1*4882a593Smuzhiyun============= 2*4882a593SmuzhiyunCFS Scheduler 3*4882a593Smuzhiyun============= 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun 6*4882a593Smuzhiyun1. OVERVIEW 7*4882a593Smuzhiyun============ 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunCFS stands for "Completely Fair Scheduler," and is the new "desktop" process 10*4882a593Smuzhiyunscheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the 11*4882a593Smuzhiyunreplacement for the previous vanilla scheduler's SCHED_OTHER interactivity 12*4882a593Smuzhiyuncode. 13*4882a593Smuzhiyun 14*4882a593Smuzhiyun80% of CFS's design can be summed up in a single sentence: CFS basically models 15*4882a593Smuzhiyunan "ideal, precise multi-tasking CPU" on real hardware. 16*4882a593Smuzhiyun 17*4882a593Smuzhiyun"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical 18*4882a593Smuzhiyunpower and which can run each task at precise equal speed, in parallel, each at 19*4882a593Smuzhiyun1/nr_running speed. For example: if there are 2 tasks running, then it runs 20*4882a593Smuzhiyuneach at 50% physical power --- i.e., actually in parallel. 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunOn real hardware, we can run only a single task at once, so we have to 23*4882a593Smuzhiyunintroduce the concept of "virtual runtime." The virtual runtime of a task 24*4882a593Smuzhiyunspecifies when its next timeslice would start execution on the ideal 25*4882a593Smuzhiyunmulti-tasking CPU described above. In practice, the virtual runtime of a task 26*4882a593Smuzhiyunis its actual runtime normalized to the total number of running tasks. 27*4882a593Smuzhiyun 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun2. FEW IMPLEMENTATION DETAILS 31*4882a593Smuzhiyun============================== 32*4882a593Smuzhiyun 33*4882a593SmuzhiyunIn CFS the virtual runtime is expressed and tracked via the per-task 34*4882a593Smuzhiyunp->se.vruntime (nanosec-unit) value. This way, it's possible to accurately 35*4882a593Smuzhiyuntimestamp and measure the "expected CPU time" a task should have gotten. 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun[ small detail: on "ideal" hardware, at any time all tasks would have the same 38*4882a593Smuzhiyun p->se.vruntime value --- i.e., tasks would execute simultaneously and no task 39*4882a593Smuzhiyun would ever get "out of balance" from the "ideal" share of CPU time. ] 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunCFS's task picking logic is based on this p->se.vruntime value and it is thus 42*4882a593Smuzhiyunvery simple: it always tries to run the task with the smallest p->se.vruntime 43*4882a593Smuzhiyunvalue (i.e., the task which executed least so far). CFS always tries to split 44*4882a593Smuzhiyunup CPU time between runnable tasks as close to "ideal multitasking hardware" as 45*4882a593Smuzhiyunpossible. 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunMost of the rest of CFS's design just falls out of this really simple concept, 48*4882a593Smuzhiyunwith a few add-on embellishments like nice levels, multiprocessing and various 49*4882a593Smuzhiyunalgorithm variants to recognize sleepers. 50*4882a593Smuzhiyun 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun 53*4882a593Smuzhiyun3. THE RBTREE 54*4882a593Smuzhiyun============== 55*4882a593Smuzhiyun 56*4882a593SmuzhiyunCFS's design is quite radical: it does not use the old data structures for the 57*4882a593Smuzhiyunrunqueues, but it uses a time-ordered rbtree to build a "timeline" of future 58*4882a593Smuzhiyuntask execution, and thus has no "array switch" artifacts (by which both the 59*4882a593Smuzhiyunprevious vanilla scheduler and RSDL/SD are affected). 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunCFS also maintains the rq->cfs.min_vruntime value, which is a monotonic 62*4882a593Smuzhiyunincreasing value tracking the smallest vruntime among all tasks in the 63*4882a593Smuzhiyunrunqueue. The total amount of work done by the system is tracked using 64*4882a593Smuzhiyunmin_vruntime; that value is used to place newly activated entities on the left 65*4882a593Smuzhiyunside of the tree as much as possible. 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunThe total number of running tasks in the runqueue is accounted through the 68*4882a593Smuzhiyunrq->cfs.load value, which is the sum of the weights of the tasks queued on the 69*4882a593Smuzhiyunrunqueue. 70*4882a593Smuzhiyun 71*4882a593SmuzhiyunCFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the 72*4882a593Smuzhiyunp->se.vruntime key. CFS picks the "leftmost" task from this tree and sticks to it. 73*4882a593SmuzhiyunAs the system progresses forwards, the executed tasks are put into the tree 74*4882a593Smuzhiyunmore and more to the right --- slowly but surely giving a chance for every task 75*4882a593Smuzhiyunto become the "leftmost task" and thus get on the CPU within a deterministic 76*4882a593Smuzhiyunamount of time. 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunSumming up, CFS works like this: it runs a task a bit, and when the task 79*4882a593Smuzhiyunschedules (or a scheduler tick happens) the task's CPU usage is "accounted 80*4882a593Smuzhiyunfor": the (small) time it just spent using the physical CPU is added to 81*4882a593Smuzhiyunp->se.vruntime. Once p->se.vruntime gets high enough so that another task 82*4882a593Smuzhiyunbecomes the "leftmost task" of the time-ordered rbtree it maintains (plus a 83*4882a593Smuzhiyunsmall amount of "granularity" distance relative to the leftmost task so that we 84*4882a593Smuzhiyundo not over-schedule tasks and trash the cache), then the new leftmost task is 85*4882a593Smuzhiyunpicked and the current task is preempted. 86*4882a593Smuzhiyun 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun4. SOME FEATURES OF CFS 90*4882a593Smuzhiyun======================== 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunCFS uses nanosecond granularity accounting and does not rely on any jiffies or 93*4882a593Smuzhiyunother HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the 94*4882a593Smuzhiyunway the previous scheduler had, and has no heuristics whatsoever. There is 95*4882a593Smuzhiyunonly one central tunable (you have to switch on CONFIG_SCHED_DEBUG): 96*4882a593Smuzhiyun 97*4882a593Smuzhiyun /proc/sys/kernel/sched_min_granularity_ns 98*4882a593Smuzhiyun 99*4882a593Smuzhiyunwhich can be used to tune the scheduler from "desktop" (i.e., low latencies) to 100*4882a593Smuzhiyun"server" (i.e., good batching) workloads. It defaults to a setting suitable 101*4882a593Smuzhiyunfor desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too. 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunDue to its design, the CFS scheduler is not prone to any of the "attacks" that 104*4882a593Smuzhiyunexist today against the heuristics of the stock scheduler: fiftyp.c, thud.c, 105*4882a593Smuzhiyunchew.c, ring-test.c, massive_intr.c all work fine and do not impact 106*4882a593Smuzhiyuninteractivity and produce the expected behavior. 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunThe CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH 109*4882a593Smuzhiyunthan the previous vanilla scheduler: both types of workloads are isolated much 110*4882a593Smuzhiyunmore aggressively. 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunSMP load-balancing has been reworked/sanitized: the runqueue-walking 113*4882a593Smuzhiyunassumptions are gone from the load-balancing code now, and iterators of the 114*4882a593Smuzhiyunscheduling modules are used. The balancing code got quite a bit simpler as a 115*4882a593Smuzhiyunresult. 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun 119*4882a593Smuzhiyun5. Scheduling policies 120*4882a593Smuzhiyun====================== 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunCFS implements three scheduling policies: 123*4882a593Smuzhiyun 124*4882a593Smuzhiyun - SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling 125*4882a593Smuzhiyun policy that is used for regular tasks. 126*4882a593Smuzhiyun 127*4882a593Smuzhiyun - SCHED_BATCH: Does not preempt nearly as often as regular tasks 128*4882a593Smuzhiyun would, thereby allowing tasks to run longer and make better use of 129*4882a593Smuzhiyun caches but at the cost of interactivity. This is well suited for 130*4882a593Smuzhiyun batch jobs. 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun - SCHED_IDLE: This is even weaker than nice 19, but its not a true 133*4882a593Smuzhiyun idle timer scheduler in order to avoid to get into priority 134*4882a593Smuzhiyun inversion problems which would deadlock the machine. 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunSCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by 137*4882a593SmuzhiyunPOSIX. 138*4882a593Smuzhiyun 139*4882a593SmuzhiyunThe command chrt from util-linux-ng 2.13.1.1 can set all of these except 140*4882a593SmuzhiyunSCHED_IDLE. 141*4882a593Smuzhiyun 142*4882a593Smuzhiyun 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun6. SCHEDULING CLASSES 145*4882a593Smuzhiyun====================== 146*4882a593Smuzhiyun 147*4882a593SmuzhiyunThe new CFS scheduler has been designed in such a way to introduce "Scheduling 148*4882a593SmuzhiyunClasses," an extensible hierarchy of scheduler modules. These modules 149*4882a593Smuzhiyunencapsulate scheduling policy details and are handled by the scheduler core 150*4882a593Smuzhiyunwithout the core code assuming too much about them. 151*4882a593Smuzhiyun 152*4882a593Smuzhiyunsched/fair.c implements the CFS scheduler described above. 153*4882a593Smuzhiyun 154*4882a593Smuzhiyunsched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than 155*4882a593Smuzhiyunthe previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT 156*4882a593Smuzhiyunpriority levels, instead of 140 in the previous scheduler) and it needs no 157*4882a593Smuzhiyunexpired array. 158*4882a593Smuzhiyun 159*4882a593SmuzhiyunScheduling classes are implemented through the sched_class structure, which 160*4882a593Smuzhiyuncontains hooks to functions that must be called whenever an interesting event 161*4882a593Smuzhiyunoccurs. 162*4882a593Smuzhiyun 163*4882a593SmuzhiyunThis is the (partial) list of the hooks: 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun - enqueue_task(...) 166*4882a593Smuzhiyun 167*4882a593Smuzhiyun Called when a task enters a runnable state. 168*4882a593Smuzhiyun It puts the scheduling entity (task) into the red-black tree and 169*4882a593Smuzhiyun increments the nr_running variable. 170*4882a593Smuzhiyun 171*4882a593Smuzhiyun - dequeue_task(...) 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun When a task is no longer runnable, this function is called to keep the 174*4882a593Smuzhiyun corresponding scheduling entity out of the red-black tree. It decrements 175*4882a593Smuzhiyun the nr_running variable. 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun - yield_task(...) 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun This function is basically just a dequeue followed by an enqueue, unless the 180*4882a593Smuzhiyun compat_yield sysctl is turned on; in that case, it places the scheduling 181*4882a593Smuzhiyun entity at the right-most end of the red-black tree. 182*4882a593Smuzhiyun 183*4882a593Smuzhiyun - check_preempt_curr(...) 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun This function checks if a task that entered the runnable state should 186*4882a593Smuzhiyun preempt the currently running task. 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun - pick_next_task(...) 189*4882a593Smuzhiyun 190*4882a593Smuzhiyun This function chooses the most appropriate task eligible to run next. 191*4882a593Smuzhiyun 192*4882a593Smuzhiyun - set_curr_task(...) 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun This function is called when a task changes its scheduling class or changes 195*4882a593Smuzhiyun its task group. 196*4882a593Smuzhiyun 197*4882a593Smuzhiyun - task_tick(...) 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun This function is mostly called from time tick functions; it might lead to 200*4882a593Smuzhiyun process switch. This drives the running preemption. 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun 203*4882a593Smuzhiyun 204*4882a593Smuzhiyun 205*4882a593Smuzhiyun7. GROUP SCHEDULER EXTENSIONS TO CFS 206*4882a593Smuzhiyun===================================== 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunNormally, the scheduler operates on individual tasks and strives to provide 209*4882a593Smuzhiyunfair CPU time to each task. Sometimes, it may be desirable to group tasks and 210*4882a593Smuzhiyunprovide fair CPU time to each such task group. For example, it may be 211*4882a593Smuzhiyundesirable to first provide fair CPU time to each user on the system and then to 212*4882a593Smuzhiyuneach task belonging to a user. 213*4882a593Smuzhiyun 214*4882a593SmuzhiyunCONFIG_CGROUP_SCHED strives to achieve exactly that. It lets tasks to be 215*4882a593Smuzhiyungrouped and divides CPU time fairly among such groups. 216*4882a593Smuzhiyun 217*4882a593SmuzhiyunCONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and 218*4882a593SmuzhiyunSCHED_RR) tasks. 219*4882a593Smuzhiyun 220*4882a593SmuzhiyunCONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and 221*4882a593SmuzhiyunSCHED_BATCH) tasks. 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun These options need CONFIG_CGROUPS to be defined, and let the administrator 224*4882a593Smuzhiyun create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See 225*4882a593Smuzhiyun Documentation/admin-guide/cgroup-v1/cgroups.rst for more information about this filesystem. 226*4882a593Smuzhiyun 227*4882a593SmuzhiyunWhen CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each 228*4882a593Smuzhiyungroup created using the pseudo filesystem. See example steps below to create 229*4882a593Smuzhiyuntask groups and modify their CPU share using the "cgroups" pseudo filesystem:: 230*4882a593Smuzhiyun 231*4882a593Smuzhiyun # mount -t tmpfs cgroup_root /sys/fs/cgroup 232*4882a593Smuzhiyun # mkdir /sys/fs/cgroup/cpu 233*4882a593Smuzhiyun # mount -t cgroup -ocpu none /sys/fs/cgroup/cpu 234*4882a593Smuzhiyun # cd /sys/fs/cgroup/cpu 235*4882a593Smuzhiyun 236*4882a593Smuzhiyun # mkdir multimedia # create "multimedia" group of tasks 237*4882a593Smuzhiyun # mkdir browser # create "browser" group of tasks 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun # #Configure the multimedia group to receive twice the CPU bandwidth 240*4882a593Smuzhiyun # #that of browser group 241*4882a593Smuzhiyun 242*4882a593Smuzhiyun # echo 2048 > multimedia/cpu.shares 243*4882a593Smuzhiyun # echo 1024 > browser/cpu.shares 244*4882a593Smuzhiyun 245*4882a593Smuzhiyun # firefox & # Launch firefox and move it to "browser" group 246*4882a593Smuzhiyun # echo <firefox_pid> > browser/tasks 247*4882a593Smuzhiyun 248*4882a593Smuzhiyun # #Launch gmplayer (or your favourite movie player) 249*4882a593Smuzhiyun # echo <movie_player_pid> > multimedia/tasks 250