xref: /OK3568_Linux_fs/kernel/Documentation/scheduler/sched-rt-group.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==========================
2*4882a593SmuzhiyunReal-Time group scheduling
3*4882a593Smuzhiyun==========================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun.. CONTENTS
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun   0. WARNING
8*4882a593Smuzhiyun   1. Overview
9*4882a593Smuzhiyun     1.1 The problem
10*4882a593Smuzhiyun     1.2 The solution
11*4882a593Smuzhiyun   2. The interface
12*4882a593Smuzhiyun     2.1 System-wide settings
13*4882a593Smuzhiyun     2.2 Default behaviour
14*4882a593Smuzhiyun     2.3 Basis for grouping tasks
15*4882a593Smuzhiyun   3. Future plans
16*4882a593Smuzhiyun
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun0. WARNING
19*4882a593Smuzhiyun==========
20*4882a593Smuzhiyun
21*4882a593Smuzhiyun Fiddling with these settings can result in an unstable system, the knobs are
22*4882a593Smuzhiyun root only and assumes root knows what he is doing.
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunMost notable:
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun * very small values in sched_rt_period_us can result in an unstable
27*4882a593Smuzhiyun   system when the period is smaller than either the available hrtimer
28*4882a593Smuzhiyun   resolution, or the time it takes to handle the budget refresh itself.
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun * very small values in sched_rt_runtime_us can result in an unstable
31*4882a593Smuzhiyun   system when the runtime is so small the system has difficulty making
32*4882a593Smuzhiyun   forward progress (NOTE: the migration thread and kstopmachine both
33*4882a593Smuzhiyun   are real-time processes).
34*4882a593Smuzhiyun
35*4882a593Smuzhiyun1. Overview
36*4882a593Smuzhiyun===========
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun
39*4882a593Smuzhiyun1.1 The problem
40*4882a593Smuzhiyun---------------
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunRealtime scheduling is all about determinism, a group has to be able to rely on
43*4882a593Smuzhiyunthe amount of bandwidth (eg. CPU time) being constant. In order to schedule
44*4882a593Smuzhiyunmultiple groups of realtime tasks, each group must be assigned a fixed portion
45*4882a593Smuzhiyunof the CPU time available.  Without a minimum guarantee a realtime group can
46*4882a593Smuzhiyunobviously fall short. A fuzzy upper limit is of no use since it cannot be
47*4882a593Smuzhiyunrelied upon. Which leaves us with just the single fixed portion.
48*4882a593Smuzhiyun
49*4882a593Smuzhiyun1.2 The solution
50*4882a593Smuzhiyun----------------
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunCPU time is divided by means of specifying how much time can be spent running
53*4882a593Smuzhiyunin a given period. We allocate this "run time" for each realtime group which
54*4882a593Smuzhiyunthe other realtime groups will not be permitted to use.
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunAny time not allocated to a realtime group will be used to run normal priority
57*4882a593Smuzhiyuntasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
58*4882a593SmuzhiyunSCHED_OTHER.
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunLet's consider an example: a frame fixed realtime renderer must deliver 25
61*4882a593Smuzhiyunframes a second, which yields a period of 0.04s per frame. Now say it will also
62*4882a593Smuzhiyunhave to play some music and respond to input, leaving it with around 80% CPU
63*4882a593Smuzhiyuntime dedicated for the graphics. We can then give this group a run time of 0.8
64*4882a593Smuzhiyun* 0.04s = 0.032s.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunThis way the graphics group will have a 0.04s period with a 0.032s run time
67*4882a593Smuzhiyunlimit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
68*4882a593Smuzhiyunneeds only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
69*4882a593Smuzhiyun0.00015s. So this group can be scheduled with a period of 0.005s and a run time
70*4882a593Smuzhiyunof 0.00015s.
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunThe remaining CPU time will be used for user input and other tasks. Because
73*4882a593Smuzhiyunrealtime tasks have explicitly allocated the CPU time they need to perform
74*4882a593Smuzhiyuntheir tasks, buffer underruns in the graphics or audio can be eliminated.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunNOTE: the above example is not fully implemented yet. We still
77*4882a593Smuzhiyunlack an EDF scheduler to make non-uniform periods usable.
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun
80*4882a593Smuzhiyun2. The Interface
81*4882a593Smuzhiyun================
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun2.1 System wide settings
85*4882a593Smuzhiyun------------------------
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunThe system wide settings are configured under the /proc virtual file system:
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun/proc/sys/kernel/sched_rt_period_us:
90*4882a593Smuzhiyun  The scheduling period that is equivalent to 100% CPU bandwidth
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun/proc/sys/kernel/sched_rt_runtime_us:
93*4882a593Smuzhiyun  A global limit on how much time realtime scheduling may use.  Even without
94*4882a593Smuzhiyun  CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
95*4882a593Smuzhiyun  processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
96*4882a593Smuzhiyun  available to all realtime groups.
97*4882a593Smuzhiyun
98*4882a593Smuzhiyun  * Time is specified in us because the interface is s32. This gives an
99*4882a593Smuzhiyun    operating range from 1us to about 35 minutes.
100*4882a593Smuzhiyun  * sched_rt_period_us takes values from 1 to INT_MAX.
101*4882a593Smuzhiyun  * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
102*4882a593Smuzhiyun  * A run time of -1 specifies runtime == period, ie. no limit.
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun
105*4882a593Smuzhiyun2.2 Default behaviour
106*4882a593Smuzhiyun---------------------
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunThe default values for sched_rt_period_us (1000000 or 1s) and
109*4882a593Smuzhiyunsched_rt_runtime_us (950000 or 0.95s).  This gives 0.05s to be used by
110*4882a593SmuzhiyunSCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
111*4882a593Smuzhiyunrealtime tasks will not lock up the machine but leave a little time to recover
112*4882a593Smuzhiyunit.  By setting runtime to -1 you'd get the old behaviour back.
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunBy default all bandwidth is assigned to the root group and new groups get the
115*4882a593Smuzhiyunperiod from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
116*4882a593Smuzhiyunwant to assign bandwidth to another group, reduce the root group's bandwidth
117*4882a593Smuzhiyunand assign some or all of the difference to another group.
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunRealtime group scheduling means you have to assign a portion of total CPU
120*4882a593Smuzhiyunbandwidth to the group before it will accept realtime tasks. Therefore you will
121*4882a593Smuzhiyunnot be able to run realtime tasks as any user other than root until you have
122*4882a593Smuzhiyundone that, even if the user has the rights to run processes with realtime
123*4882a593Smuzhiyunpriority!
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun2.3 Basis for grouping tasks
127*4882a593Smuzhiyun----------------------------
128*4882a593Smuzhiyun
129*4882a593SmuzhiyunEnabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
130*4882a593SmuzhiyunCPU bandwidth to task groups.
131*4882a593Smuzhiyun
132*4882a593SmuzhiyunThis uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
133*4882a593Smuzhiyunto control the CPU time reserved for each control group.
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunFor more information on working with control groups, you should read
136*4882a593SmuzhiyunDocumentation/admin-guide/cgroup-v1/cgroups.rst as well.
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunGroup settings are checked against the following limits in order to keep the
139*4882a593Smuzhiyunconfiguration schedulable:
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun   \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunFor now, this can be simplified to just the following (but see Future plans):
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun   \Sum_{i} runtime_{i} <= global_runtime
146*4882a593Smuzhiyun
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun3. Future plans
149*4882a593Smuzhiyun===============
150*4882a593Smuzhiyun
151*4882a593SmuzhiyunThere is work in progress to make the scheduling period for each group
152*4882a593Smuzhiyun("<cgroup>/cpu.rt_period_us") configurable as well.
153*4882a593Smuzhiyun
154*4882a593SmuzhiyunThe constraint on the period is that a subgroup must have a smaller or
155*4882a593Smuzhiyunequal period to its parent. But realistically its not very useful _yet_
156*4882a593Smuzhiyunas its prone to starvation without deadline scheduling.
157*4882a593Smuzhiyun
158*4882a593SmuzhiyunConsider two sibling groups A and B; both have 50% bandwidth, but A's
159*4882a593Smuzhiyunperiod is twice the length of B's.
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun* group A: period=100000us, runtime=50000us
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun	- this runs for 0.05s once every 0.1s
164*4882a593Smuzhiyun
165*4882a593Smuzhiyun* group B: period= 50000us, runtime=25000us
166*4882a593Smuzhiyun
167*4882a593Smuzhiyun	- this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunThis means that currently a while (1) loop in A will run for the full period of
170*4882a593SmuzhiyunB and can starve B's tasks (assuming they are of lower priority) for a whole
171*4882a593Smuzhiyunperiod.
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunThe next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
174*4882a593Smuzhiyunfull deadline scheduling to the linux kernel. Deadline scheduling the above
175*4882a593Smuzhiyungroups and treating end of the period as a deadline will ensure that they both
176*4882a593Smuzhiyunget their allocated time.
177*4882a593Smuzhiyun
178*4882a593SmuzhiyunImplementing SCHED_EDF might take a while to complete. Priority Inheritance is
179*4882a593Smuzhiyunthe biggest challenge as the current linux PI infrastructure is geared towards
180*4882a593Smuzhiyunthe limited static priority levels 0-99. With deadline scheduling you need to
181*4882a593Smuzhiyundo deadline inheritance (since priority is inversely proportional to the
182*4882a593Smuzhiyundeadline delta (deadline - now)).
183*4882a593Smuzhiyun
184*4882a593SmuzhiyunThis means the whole PI machinery will have to be reworked - and that is one of
185*4882a593Smuzhiyunthe most complex pieces of code we have.
186