1*4882a593Smuzhiyun========================== 2*4882a593SmuzhiyunReal-Time group scheduling 3*4882a593Smuzhiyun========================== 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun.. CONTENTS 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun 0. WARNING 8*4882a593Smuzhiyun 1. Overview 9*4882a593Smuzhiyun 1.1 The problem 10*4882a593Smuzhiyun 1.2 The solution 11*4882a593Smuzhiyun 2. The interface 12*4882a593Smuzhiyun 2.1 System-wide settings 13*4882a593Smuzhiyun 2.2 Default behaviour 14*4882a593Smuzhiyun 2.3 Basis for grouping tasks 15*4882a593Smuzhiyun 3. Future plans 16*4882a593Smuzhiyun 17*4882a593Smuzhiyun 18*4882a593Smuzhiyun0. WARNING 19*4882a593Smuzhiyun========== 20*4882a593Smuzhiyun 21*4882a593Smuzhiyun Fiddling with these settings can result in an unstable system, the knobs are 22*4882a593Smuzhiyun root only and assumes root knows what he is doing. 23*4882a593Smuzhiyun 24*4882a593SmuzhiyunMost notable: 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun * very small values in sched_rt_period_us can result in an unstable 27*4882a593Smuzhiyun system when the period is smaller than either the available hrtimer 28*4882a593Smuzhiyun resolution, or the time it takes to handle the budget refresh itself. 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun * very small values in sched_rt_runtime_us can result in an unstable 31*4882a593Smuzhiyun system when the runtime is so small the system has difficulty making 32*4882a593Smuzhiyun forward progress (NOTE: the migration thread and kstopmachine both 33*4882a593Smuzhiyun are real-time processes). 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun1. Overview 36*4882a593Smuzhiyun=========== 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun1.1 The problem 40*4882a593Smuzhiyun--------------- 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunRealtime scheduling is all about determinism, a group has to be able to rely on 43*4882a593Smuzhiyunthe amount of bandwidth (eg. CPU time) being constant. In order to schedule 44*4882a593Smuzhiyunmultiple groups of realtime tasks, each group must be assigned a fixed portion 45*4882a593Smuzhiyunof the CPU time available. Without a minimum guarantee a realtime group can 46*4882a593Smuzhiyunobviously fall short. A fuzzy upper limit is of no use since it cannot be 47*4882a593Smuzhiyunrelied upon. Which leaves us with just the single fixed portion. 48*4882a593Smuzhiyun 49*4882a593Smuzhiyun1.2 The solution 50*4882a593Smuzhiyun---------------- 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunCPU time is divided by means of specifying how much time can be spent running 53*4882a593Smuzhiyunin a given period. We allocate this "run time" for each realtime group which 54*4882a593Smuzhiyunthe other realtime groups will not be permitted to use. 55*4882a593Smuzhiyun 56*4882a593SmuzhiyunAny time not allocated to a realtime group will be used to run normal priority 57*4882a593Smuzhiyuntasks (SCHED_OTHER). Any allocated run time not used will also be picked up by 58*4882a593SmuzhiyunSCHED_OTHER. 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunLet's consider an example: a frame fixed realtime renderer must deliver 25 61*4882a593Smuzhiyunframes a second, which yields a period of 0.04s per frame. Now say it will also 62*4882a593Smuzhiyunhave to play some music and respond to input, leaving it with around 80% CPU 63*4882a593Smuzhiyuntime dedicated for the graphics. We can then give this group a run time of 0.8 64*4882a593Smuzhiyun* 0.04s = 0.032s. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunThis way the graphics group will have a 0.04s period with a 0.032s run time 67*4882a593Smuzhiyunlimit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but 68*4882a593Smuzhiyunneeds only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = 69*4882a593Smuzhiyun0.00015s. So this group can be scheduled with a period of 0.005s and a run time 70*4882a593Smuzhiyunof 0.00015s. 71*4882a593Smuzhiyun 72*4882a593SmuzhiyunThe remaining CPU time will be used for user input and other tasks. Because 73*4882a593Smuzhiyunrealtime tasks have explicitly allocated the CPU time they need to perform 74*4882a593Smuzhiyuntheir tasks, buffer underruns in the graphics or audio can be eliminated. 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunNOTE: the above example is not fully implemented yet. We still 77*4882a593Smuzhiyunlack an EDF scheduler to make non-uniform periods usable. 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun 80*4882a593Smuzhiyun2. The Interface 81*4882a593Smuzhiyun================ 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun2.1 System wide settings 85*4882a593Smuzhiyun------------------------ 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunThe system wide settings are configured under the /proc virtual file system: 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun/proc/sys/kernel/sched_rt_period_us: 90*4882a593Smuzhiyun The scheduling period that is equivalent to 100% CPU bandwidth 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun/proc/sys/kernel/sched_rt_runtime_us: 93*4882a593Smuzhiyun A global limit on how much time realtime scheduling may use. Even without 94*4882a593Smuzhiyun CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime 95*4882a593Smuzhiyun processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth 96*4882a593Smuzhiyun available to all realtime groups. 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun * Time is specified in us because the interface is s32. This gives an 99*4882a593Smuzhiyun operating range from 1us to about 35 minutes. 100*4882a593Smuzhiyun * sched_rt_period_us takes values from 1 to INT_MAX. 101*4882a593Smuzhiyun * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1). 102*4882a593Smuzhiyun * A run time of -1 specifies runtime == period, ie. no limit. 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun 105*4882a593Smuzhiyun2.2 Default behaviour 106*4882a593Smuzhiyun--------------------- 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunThe default values for sched_rt_period_us (1000000 or 1s) and 109*4882a593Smuzhiyunsched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by 110*4882a593SmuzhiyunSCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away 111*4882a593Smuzhiyunrealtime tasks will not lock up the machine but leave a little time to recover 112*4882a593Smuzhiyunit. By setting runtime to -1 you'd get the old behaviour back. 113*4882a593Smuzhiyun 114*4882a593SmuzhiyunBy default all bandwidth is assigned to the root group and new groups get the 115*4882a593Smuzhiyunperiod from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you 116*4882a593Smuzhiyunwant to assign bandwidth to another group, reduce the root group's bandwidth 117*4882a593Smuzhiyunand assign some or all of the difference to another group. 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunRealtime group scheduling means you have to assign a portion of total CPU 120*4882a593Smuzhiyunbandwidth to the group before it will accept realtime tasks. Therefore you will 121*4882a593Smuzhiyunnot be able to run realtime tasks as any user other than root until you have 122*4882a593Smuzhiyundone that, even if the user has the rights to run processes with realtime 123*4882a593Smuzhiyunpriority! 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun2.3 Basis for grouping tasks 127*4882a593Smuzhiyun---------------------------- 128*4882a593Smuzhiyun 129*4882a593SmuzhiyunEnabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real 130*4882a593SmuzhiyunCPU bandwidth to task groups. 131*4882a593Smuzhiyun 132*4882a593SmuzhiyunThis uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us" 133*4882a593Smuzhiyunto control the CPU time reserved for each control group. 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunFor more information on working with control groups, you should read 136*4882a593SmuzhiyunDocumentation/admin-guide/cgroup-v1/cgroups.rst as well. 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunGroup settings are checked against the following limits in order to keep the 139*4882a593Smuzhiyunconfiguration schedulable: 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunFor now, this can be simplified to just the following (but see Future plans): 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun \Sum_{i} runtime_{i} <= global_runtime 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun3. Future plans 149*4882a593Smuzhiyun=============== 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunThere is work in progress to make the scheduling period for each group 152*4882a593Smuzhiyun("<cgroup>/cpu.rt_period_us") configurable as well. 153*4882a593Smuzhiyun 154*4882a593SmuzhiyunThe constraint on the period is that a subgroup must have a smaller or 155*4882a593Smuzhiyunequal period to its parent. But realistically its not very useful _yet_ 156*4882a593Smuzhiyunas its prone to starvation without deadline scheduling. 157*4882a593Smuzhiyun 158*4882a593SmuzhiyunConsider two sibling groups A and B; both have 50% bandwidth, but A's 159*4882a593Smuzhiyunperiod is twice the length of B's. 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun* group A: period=100000us, runtime=50000us 162*4882a593Smuzhiyun 163*4882a593Smuzhiyun - this runs for 0.05s once every 0.1s 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun* group B: period= 50000us, runtime=25000us 166*4882a593Smuzhiyun 167*4882a593Smuzhiyun - this runs for 0.025s twice every 0.1s (or once every 0.05 sec). 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunThis means that currently a while (1) loop in A will run for the full period of 170*4882a593SmuzhiyunB and can starve B's tasks (assuming they are of lower priority) for a whole 171*4882a593Smuzhiyunperiod. 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunThe next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring 174*4882a593Smuzhiyunfull deadline scheduling to the linux kernel. Deadline scheduling the above 175*4882a593Smuzhiyungroups and treating end of the period as a deadline will ensure that they both 176*4882a593Smuzhiyunget their allocated time. 177*4882a593Smuzhiyun 178*4882a593SmuzhiyunImplementing SCHED_EDF might take a while to complete. Priority Inheritance is 179*4882a593Smuzhiyunthe biggest challenge as the current linux PI infrastructure is geared towards 180*4882a593Smuzhiyunthe limited static priority levels 0-99. With deadline scheduling you need to 181*4882a593Smuzhiyundo deadline inheritance (since priority is inversely proportional to the 182*4882a593Smuzhiyundeadline delta (deadline - now)). 183*4882a593Smuzhiyun 184*4882a593SmuzhiyunThis means the whole PI machinery will have to be reworked - and that is one of 185*4882a593Smuzhiyunthe most complex pieces of code we have. 186