xref: /OK3568_Linux_fs/kernel/Documentation/scheduler/sched-deadline.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun========================
2*4882a593SmuzhiyunDeadline Task Scheduling
3*4882a593Smuzhiyun========================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun.. CONTENTS
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun    0. WARNING
8*4882a593Smuzhiyun    1. Overview
9*4882a593Smuzhiyun    2. Scheduling algorithm
10*4882a593Smuzhiyun      2.1 Main algorithm
11*4882a593Smuzhiyun      2.2 Bandwidth reclaiming
12*4882a593Smuzhiyun    3. Scheduling Real-Time Tasks
13*4882a593Smuzhiyun      3.1 Definitions
14*4882a593Smuzhiyun      3.2 Schedulability Analysis for Uniprocessor Systems
15*4882a593Smuzhiyun      3.3 Schedulability Analysis for Multiprocessor Systems
16*4882a593Smuzhiyun      3.4 Relationship with SCHED_DEADLINE Parameters
17*4882a593Smuzhiyun    4. Bandwidth management
18*4882a593Smuzhiyun      4.1 System-wide settings
19*4882a593Smuzhiyun      4.2 Task interface
20*4882a593Smuzhiyun      4.3 Default behavior
21*4882a593Smuzhiyun      4.4 Behavior of sched_yield()
22*4882a593Smuzhiyun    5. Tasks CPU affinity
23*4882a593Smuzhiyun      5.1 SCHED_DEADLINE and cpusets HOWTO
24*4882a593Smuzhiyun    6. Future plans
25*4882a593Smuzhiyun    A. Test suite
26*4882a593Smuzhiyun    B. Minimal main()
27*4882a593Smuzhiyun
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun0. WARNING
30*4882a593Smuzhiyun==========
31*4882a593Smuzhiyun
32*4882a593Smuzhiyun Fiddling with these settings can result in an unpredictable or even unstable
33*4882a593Smuzhiyun system behavior. As for -rt (group) scheduling, it is assumed that root users
34*4882a593Smuzhiyun know what they're doing.
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun1. Overview
38*4882a593Smuzhiyun===========
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
41*4882a593Smuzhiyun basically an implementation of the Earliest Deadline First (EDF) scheduling
42*4882a593Smuzhiyun algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
43*4882a593Smuzhiyun that makes it possible to isolate the behavior of tasks between each other.
44*4882a593Smuzhiyun
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun2. Scheduling algorithm
47*4882a593Smuzhiyun=======================
48*4882a593Smuzhiyun
49*4882a593Smuzhiyun2.1 Main algorithm
50*4882a593Smuzhiyun------------------
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun SCHED_DEADLINE [18] uses three parameters, named "runtime", "period", and
53*4882a593Smuzhiyun "deadline", to schedule tasks. A SCHED_DEADLINE task should receive
54*4882a593Smuzhiyun "runtime" microseconds of execution time every "period" microseconds, and
55*4882a593Smuzhiyun these "runtime" microseconds are available within "deadline" microseconds
56*4882a593Smuzhiyun from the beginning of the period.  In order to implement this behavior,
57*4882a593Smuzhiyun every time the task wakes up, the scheduler computes a "scheduling deadline"
58*4882a593Smuzhiyun consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then
59*4882a593Smuzhiyun scheduled using EDF[1] on these scheduling deadlines (the task with the
60*4882a593Smuzhiyun earliest scheduling deadline is selected for execution). Notice that the
61*4882a593Smuzhiyun task actually receives "runtime" time units within "deadline" if a proper
62*4882a593Smuzhiyun "admission control" strategy (see Section "4. Bandwidth management") is used
63*4882a593Smuzhiyun (clearly, if the system is overloaded this guarantee cannot be respected).
64*4882a593Smuzhiyun
65*4882a593Smuzhiyun Summing up, the CBS[2,3] algorithm assigns scheduling deadlines to tasks so
66*4882a593Smuzhiyun that each task runs for at most its runtime every period, avoiding any
67*4882a593Smuzhiyun interference between different tasks (bandwidth isolation), while the EDF[1]
68*4882a593Smuzhiyun algorithm selects the task with the earliest scheduling deadline as the one
69*4882a593Smuzhiyun to be executed next. Thanks to this feature, tasks that do not strictly comply
70*4882a593Smuzhiyun with the "traditional" real-time task model (see Section 3) can effectively
71*4882a593Smuzhiyun use the new policy.
72*4882a593Smuzhiyun
73*4882a593Smuzhiyun In more details, the CBS algorithm assigns scheduling deadlines to
74*4882a593Smuzhiyun tasks in the following way:
75*4882a593Smuzhiyun
76*4882a593Smuzhiyun  - Each SCHED_DEADLINE task is characterized by the "runtime",
77*4882a593Smuzhiyun    "deadline", and "period" parameters;
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun  - The state of the task is described by a "scheduling deadline", and
80*4882a593Smuzhiyun    a "remaining runtime". These two parameters are initially set to 0;
81*4882a593Smuzhiyun
82*4882a593Smuzhiyun  - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
83*4882a593Smuzhiyun    the scheduler checks if::
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun                 remaining runtime                  runtime
86*4882a593Smuzhiyun        ----------------------------------    >    ---------
87*4882a593Smuzhiyun        scheduling deadline - current time           period
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun    then, if the scheduling deadline is smaller than the current time, or
90*4882a593Smuzhiyun    this condition is verified, the scheduling deadline and the
91*4882a593Smuzhiyun    remaining runtime are re-initialized as
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun         scheduling deadline = current time + deadline
94*4882a593Smuzhiyun         remaining runtime = runtime
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun    otherwise, the scheduling deadline and the remaining runtime are
97*4882a593Smuzhiyun    left unchanged;
98*4882a593Smuzhiyun
99*4882a593Smuzhiyun  - When a SCHED_DEADLINE task executes for an amount of time t, its
100*4882a593Smuzhiyun    remaining runtime is decreased as::
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun         remaining runtime = remaining runtime - t
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun    (technically, the runtime is decreased at every tick, or when the
105*4882a593Smuzhiyun    task is descheduled / preempted);
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun  - When the remaining runtime becomes less or equal than 0, the task is
108*4882a593Smuzhiyun    said to be "throttled" (also known as "depleted" in real-time literature)
109*4882a593Smuzhiyun    and cannot be scheduled until its scheduling deadline. The "replenishment
110*4882a593Smuzhiyun    time" for this task (see next item) is set to be equal to the current
111*4882a593Smuzhiyun    value of the scheduling deadline;
112*4882a593Smuzhiyun
113*4882a593Smuzhiyun  - When the current time is equal to the replenishment time of a
114*4882a593Smuzhiyun    throttled task, the scheduling deadline and the remaining runtime are
115*4882a593Smuzhiyun    updated as::
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun         scheduling deadline = scheduling deadline + period
118*4882a593Smuzhiyun         remaining runtime = remaining runtime + runtime
119*4882a593Smuzhiyun
120*4882a593Smuzhiyun The SCHED_FLAG_DL_OVERRUN flag in sched_attr's sched_flags field allows a task
121*4882a593Smuzhiyun to get informed about runtime overruns through the delivery of SIGXCPU
122*4882a593Smuzhiyun signals.
123*4882a593Smuzhiyun
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun2.2 Bandwidth reclaiming
126*4882a593Smuzhiyun------------------------
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun Bandwidth reclaiming for deadline tasks is based on the GRUB (Greedy
129*4882a593Smuzhiyun Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled
130*4882a593Smuzhiyun when flag SCHED_FLAG_RECLAIM is set.
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun The following diagram illustrates the state names for tasks handled by GRUB::
133*4882a593Smuzhiyun
134*4882a593Smuzhiyun                             ------------
135*4882a593Smuzhiyun                 (d)        |   Active   |
136*4882a593Smuzhiyun              ------------->|            |
137*4882a593Smuzhiyun              |             | Contending |
138*4882a593Smuzhiyun              |              ------------
139*4882a593Smuzhiyun              |                A      |
140*4882a593Smuzhiyun          ----------           |      |
141*4882a593Smuzhiyun         |          |          |      |
142*4882a593Smuzhiyun         | Inactive |          |(b)   | (a)
143*4882a593Smuzhiyun         |          |          |      |
144*4882a593Smuzhiyun          ----------           |      |
145*4882a593Smuzhiyun              A                |      V
146*4882a593Smuzhiyun              |              ------------
147*4882a593Smuzhiyun              |             |   Active   |
148*4882a593Smuzhiyun              --------------|     Non    |
149*4882a593Smuzhiyun                 (c)        | Contending |
150*4882a593Smuzhiyun                             ------------
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun A task can be in one of the following states:
153*4882a593Smuzhiyun
154*4882a593Smuzhiyun  - ActiveContending: if it is ready for execution (or executing);
155*4882a593Smuzhiyun
156*4882a593Smuzhiyun  - ActiveNonContending: if it just blocked and has not yet surpassed the 0-lag
157*4882a593Smuzhiyun    time;
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun  - Inactive: if it is blocked and has surpassed the 0-lag time.
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun State transitions:
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun  (a) When a task blocks, it does not become immediately inactive since its
164*4882a593Smuzhiyun      bandwidth cannot be immediately reclaimed without breaking the
165*4882a593Smuzhiyun      real-time guarantees. It therefore enters a transitional state called
166*4882a593Smuzhiyun      ActiveNonContending. The scheduler arms the "inactive timer" to fire at
167*4882a593Smuzhiyun      the 0-lag time, when the task's bandwidth can be reclaimed without
168*4882a593Smuzhiyun      breaking the real-time guarantees.
169*4882a593Smuzhiyun
170*4882a593Smuzhiyun      The 0-lag time for a task entering the ActiveNonContending state is
171*4882a593Smuzhiyun      computed as::
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun                        (runtime * dl_period)
174*4882a593Smuzhiyun             deadline - ---------------------
175*4882a593Smuzhiyun                             dl_runtime
176*4882a593Smuzhiyun
177*4882a593Smuzhiyun      where runtime is the remaining runtime, while dl_runtime and dl_period
178*4882a593Smuzhiyun      are the reservation parameters.
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun  (b) If the task wakes up before the inactive timer fires, the task re-enters
181*4882a593Smuzhiyun      the ActiveContending state and the "inactive timer" is canceled.
182*4882a593Smuzhiyun      In addition, if the task wakes up on a different runqueue, then
183*4882a593Smuzhiyun      the task's utilization must be removed from the previous runqueue's active
184*4882a593Smuzhiyun      utilization and must be added to the new runqueue's active utilization.
185*4882a593Smuzhiyun      In order to avoid races between a task waking up on a runqueue while the
186*4882a593Smuzhiyun      "inactive timer" is running on a different CPU, the "dl_non_contending"
187*4882a593Smuzhiyun      flag is used to indicate that a task is not on a runqueue but is active
188*4882a593Smuzhiyun      (so, the flag is set when the task blocks and is cleared when the
189*4882a593Smuzhiyun      "inactive timer" fires or when the task  wakes up).
190*4882a593Smuzhiyun
191*4882a593Smuzhiyun  (c) When the "inactive timer" fires, the task enters the Inactive state and
192*4882a593Smuzhiyun      its utilization is removed from the runqueue's active utilization.
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun  (d) When an inactive task wakes up, it enters the ActiveContending state and
195*4882a593Smuzhiyun      its utilization is added to the active utilization of the runqueue where
196*4882a593Smuzhiyun      it has been enqueued.
197*4882a593Smuzhiyun
198*4882a593Smuzhiyun For each runqueue, the algorithm GRUB keeps track of two different bandwidths:
199*4882a593Smuzhiyun
200*4882a593Smuzhiyun  - Active bandwidth (running_bw): this is the sum of the bandwidths of all
201*4882a593Smuzhiyun    tasks in active state (i.e., ActiveContending or ActiveNonContending);
202*4882a593Smuzhiyun
203*4882a593Smuzhiyun  - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the
204*4882a593Smuzhiyun    runqueue, including the tasks in Inactive state.
205*4882a593Smuzhiyun
206*4882a593Smuzhiyun
207*4882a593Smuzhiyun The algorithm reclaims the bandwidth of the tasks in Inactive state.
208*4882a593Smuzhiyun It does so by decrementing the runtime of the executing task Ti at a pace equal
209*4882a593Smuzhiyun to
210*4882a593Smuzhiyun
211*4882a593Smuzhiyun           dq = -max{ Ui / Umax, (1 - Uinact - Uextra) } dt
212*4882a593Smuzhiyun
213*4882a593Smuzhiyun where:
214*4882a593Smuzhiyun
215*4882a593Smuzhiyun  - Ui is the bandwidth of task Ti;
216*4882a593Smuzhiyun  - Umax is the maximum reclaimable utilization (subjected to RT throttling
217*4882a593Smuzhiyun    limits);
218*4882a593Smuzhiyun  - Uinact is the (per runqueue) inactive utilization, computed as
219*4882a593Smuzhiyun    (this_bq - running_bw);
220*4882a593Smuzhiyun  - Uextra is the (per runqueue) extra reclaimable utilization
221*4882a593Smuzhiyun    (subjected to RT throttling limits).
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun
224*4882a593Smuzhiyun Let's now see a trivial example of two deadline tasks with runtime equal
225*4882a593Smuzhiyun to 4 and period equal to 8 (i.e., bandwidth equal to 0.5)::
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun         A            Task T1
228*4882a593Smuzhiyun         |
229*4882a593Smuzhiyun         |                               |
230*4882a593Smuzhiyun         |                               |
231*4882a593Smuzhiyun         |--------                       |----
232*4882a593Smuzhiyun         |       |                       V
233*4882a593Smuzhiyun         |---|---|---|---|---|---|---|---|--------->t
234*4882a593Smuzhiyun         0   1   2   3   4   5   6   7   8
235*4882a593Smuzhiyun
236*4882a593Smuzhiyun
237*4882a593Smuzhiyun         A            Task T2
238*4882a593Smuzhiyun         |
239*4882a593Smuzhiyun         |                               |
240*4882a593Smuzhiyun         |                               |
241*4882a593Smuzhiyun         |       ------------------------|
242*4882a593Smuzhiyun         |       |                       V
243*4882a593Smuzhiyun         |---|---|---|---|---|---|---|---|--------->t
244*4882a593Smuzhiyun         0   1   2   3   4   5   6   7   8
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun
247*4882a593Smuzhiyun         A            running_bw
248*4882a593Smuzhiyun         |
249*4882a593Smuzhiyun       1 -----------------               ------
250*4882a593Smuzhiyun         |               |               |
251*4882a593Smuzhiyun      0.5-               -----------------
252*4882a593Smuzhiyun         |                               |
253*4882a593Smuzhiyun         |---|---|---|---|---|---|---|---|--------->t
254*4882a593Smuzhiyun         0   1   2   3   4   5   6   7   8
255*4882a593Smuzhiyun
256*4882a593Smuzhiyun
257*4882a593Smuzhiyun  - Time t = 0:
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun    Both tasks are ready for execution and therefore in ActiveContending state.
260*4882a593Smuzhiyun    Suppose Task T1 is the first task to start execution.
261*4882a593Smuzhiyun    Since there are no inactive tasks, its runtime is decreased as dq = -1 dt.
262*4882a593Smuzhiyun
263*4882a593Smuzhiyun  - Time t = 2:
264*4882a593Smuzhiyun
265*4882a593Smuzhiyun    Suppose that task T1 blocks
266*4882a593Smuzhiyun    Task T1 therefore enters the ActiveNonContending state. Since its remaining
267*4882a593Smuzhiyun    runtime is equal to 2, its 0-lag time is equal to t = 4.
268*4882a593Smuzhiyun    Task T2 start execution, with runtime still decreased as dq = -1 dt since
269*4882a593Smuzhiyun    there are no inactive tasks.
270*4882a593Smuzhiyun
271*4882a593Smuzhiyun  - Time t = 4:
272*4882a593Smuzhiyun
273*4882a593Smuzhiyun    This is the 0-lag time for Task T1. Since it didn't woken up in the
274*4882a593Smuzhiyun    meantime, it enters the Inactive state. Its bandwidth is removed from
275*4882a593Smuzhiyun    running_bw.
276*4882a593Smuzhiyun    Task T2 continues its execution. However, its runtime is now decreased as
277*4882a593Smuzhiyun    dq = - 0.5 dt because Uinact = 0.5.
278*4882a593Smuzhiyun    Task T2 therefore reclaims the bandwidth unused by Task T1.
279*4882a593Smuzhiyun
280*4882a593Smuzhiyun  - Time t = 8:
281*4882a593Smuzhiyun
282*4882a593Smuzhiyun    Task T1 wakes up. It enters the ActiveContending state again, and the
283*4882a593Smuzhiyun    running_bw is incremented.
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun2.3 Energy-aware scheduling
287*4882a593Smuzhiyun---------------------------
288*4882a593Smuzhiyun
289*4882a593Smuzhiyun When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the
290*4882a593Smuzhiyun GRUB-PA [19] algorithm, reducing the CPU operating frequency to the minimum
291*4882a593Smuzhiyun value that still allows to meet the deadlines. This behavior is currently
292*4882a593Smuzhiyun implemented only for ARM architectures.
293*4882a593Smuzhiyun
294*4882a593Smuzhiyun A particular care must be taken in case the time needed for changing frequency
295*4882a593Smuzhiyun is of the same order of magnitude of the reservation period. In such cases,
296*4882a593Smuzhiyun setting a fixed CPU frequency results in a lower amount of deadline misses.
297*4882a593Smuzhiyun
298*4882a593Smuzhiyun
299*4882a593Smuzhiyun3. Scheduling Real-Time Tasks
300*4882a593Smuzhiyun=============================
301*4882a593Smuzhiyun
302*4882a593Smuzhiyun
303*4882a593Smuzhiyun
304*4882a593Smuzhiyun ..  BIG FAT WARNING ******************************************************
305*4882a593Smuzhiyun
306*4882a593Smuzhiyun .. warning::
307*4882a593Smuzhiyun
308*4882a593Smuzhiyun   This section contains a (not-thorough) summary on classical deadline
309*4882a593Smuzhiyun   scheduling theory, and how it applies to SCHED_DEADLINE.
310*4882a593Smuzhiyun   The reader can "safely" skip to Section 4 if only interested in seeing
311*4882a593Smuzhiyun   how the scheduling policy can be used. Anyway, we strongly recommend
312*4882a593Smuzhiyun   to come back here and continue reading (once the urge for testing is
313*4882a593Smuzhiyun   satisfied :P) to be sure of fully understanding all technical details.
314*4882a593Smuzhiyun
315*4882a593Smuzhiyun .. ************************************************************************
316*4882a593Smuzhiyun
317*4882a593Smuzhiyun There are no limitations on what kind of task can exploit this new
318*4882a593Smuzhiyun scheduling discipline, even if it must be said that it is particularly
319*4882a593Smuzhiyun suited for periodic or sporadic real-time tasks that need guarantees on their
320*4882a593Smuzhiyun timing behavior, e.g., multimedia, streaming, control applications, etc.
321*4882a593Smuzhiyun
322*4882a593Smuzhiyun3.1 Definitions
323*4882a593Smuzhiyun------------------------
324*4882a593Smuzhiyun
325*4882a593Smuzhiyun A typical real-time task is composed of a repetition of computation phases
326*4882a593Smuzhiyun (task instances, or jobs) which are activated on a periodic or sporadic
327*4882a593Smuzhiyun fashion.
328*4882a593Smuzhiyun Each job J_j (where J_j is the j^th job of the task) is characterized by an
329*4882a593Smuzhiyun arrival time r_j (the time when the job starts), an amount of computation
330*4882a593Smuzhiyun time c_j needed to finish the job, and a job absolute deadline d_j, which
331*4882a593Smuzhiyun is the time within which the job should be finished. The maximum execution
332*4882a593Smuzhiyun time max{c_j} is called "Worst Case Execution Time" (WCET) for the task.
333*4882a593Smuzhiyun A real-time task can be periodic with period P if r_{j+1} = r_j + P, or
334*4882a593Smuzhiyun sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
335*4882a593Smuzhiyun d_j = r_j + D, where D is the task's relative deadline.
336*4882a593Smuzhiyun Summing up, a real-time task can be described as
337*4882a593Smuzhiyun
338*4882a593Smuzhiyun	Task = (WCET, D, P)
339*4882a593Smuzhiyun
340*4882a593Smuzhiyun The utilization of a real-time task is defined as the ratio between its
341*4882a593Smuzhiyun WCET and its period (or minimum inter-arrival time), and represents
342*4882a593Smuzhiyun the fraction of CPU time needed to execute the task.
343*4882a593Smuzhiyun
344*4882a593Smuzhiyun If the total utilization U=sum(WCET_i/P_i) is larger than M (with M equal
345*4882a593Smuzhiyun to the number of CPUs), then the scheduler is unable to respect all the
346*4882a593Smuzhiyun deadlines.
347*4882a593Smuzhiyun Note that total utilization is defined as the sum of the utilizations
348*4882a593Smuzhiyun WCET_i/P_i over all the real-time tasks in the system. When considering
349*4882a593Smuzhiyun multiple real-time tasks, the parameters of the i-th task are indicated
350*4882a593Smuzhiyun with the "_i" suffix.
351*4882a593Smuzhiyun Moreover, if the total utilization is larger than M, then we risk starving
352*4882a593Smuzhiyun non- real-time tasks by real-time tasks.
353*4882a593Smuzhiyun If, instead, the total utilization is smaller than M, then non real-time
354*4882a593Smuzhiyun tasks will not be starved and the system might be able to respect all the
355*4882a593Smuzhiyun deadlines.
356*4882a593Smuzhiyun As a matter of fact, in this case it is possible to provide an upper bound
357*4882a593Smuzhiyun for tardiness (defined as the maximum between 0 and the difference
358*4882a593Smuzhiyun between the finishing time of a job and its absolute deadline).
359*4882a593Smuzhiyun More precisely, it can be proven that using a global EDF scheduler the
360*4882a593Smuzhiyun maximum tardiness of each task is smaller or equal than
361*4882a593Smuzhiyun
362*4882a593Smuzhiyun	((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max
363*4882a593Smuzhiyun
364*4882a593Smuzhiyun where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i}
365*4882a593Smuzhiyun is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum
366*4882a593Smuzhiyun utilization[12].
367*4882a593Smuzhiyun
368*4882a593Smuzhiyun3.2 Schedulability Analysis for Uniprocessor Systems
369*4882a593Smuzhiyun----------------------------------------------------
370*4882a593Smuzhiyun
371*4882a593Smuzhiyun If M=1 (uniprocessor system), or in case of partitioned scheduling (each
372*4882a593Smuzhiyun real-time task is statically assigned to one and only one CPU), it is
373*4882a593Smuzhiyun possible to formally check if all the deadlines are respected.
374*4882a593Smuzhiyun If D_i = P_i for all tasks, then EDF is able to respect all the deadlines
375*4882a593Smuzhiyun of all the tasks executing on a CPU if and only if the total utilization
376*4882a593Smuzhiyun of the tasks running on such a CPU is smaller or equal than 1.
377*4882a593Smuzhiyun If D_i != P_i for some task, then it is possible to define the density of
378*4882a593Smuzhiyun a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines
379*4882a593Smuzhiyun of all the tasks running on a CPU if the sum of the densities of the tasks
380*4882a593Smuzhiyun running on such a CPU is smaller or equal than 1:
381*4882a593Smuzhiyun
382*4882a593Smuzhiyun	sum(WCET_i / min{D_i, P_i}) <= 1
383*4882a593Smuzhiyun
384*4882a593Smuzhiyun It is important to notice that this condition is only sufficient, and not
385*4882a593Smuzhiyun necessary: there are task sets that are schedulable, but do not respect the
386*4882a593Smuzhiyun condition. For example, consider the task set {Task_1,Task_2} composed by
387*4882a593Smuzhiyun Task_1=(50ms,50ms,100ms) and Task_2=(10ms,100ms,100ms).
388*4882a593Smuzhiyun EDF is clearly able to schedule the two tasks without missing any deadline
389*4882a593Smuzhiyun (Task_1 is scheduled as soon as it is released, and finishes just in time
390*4882a593Smuzhiyun to respect its deadline; Task_2 is scheduled immediately after Task_1, hence
391*4882a593Smuzhiyun its response time cannot be larger than 50ms + 10ms = 60ms) even if
392*4882a593Smuzhiyun
393*4882a593Smuzhiyun	50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1
394*4882a593Smuzhiyun
395*4882a593Smuzhiyun Of course it is possible to test the exact schedulability of tasks with
396*4882a593Smuzhiyun D_i != P_i (checking a condition that is both sufficient and necessary),
397*4882a593Smuzhiyun but this cannot be done by comparing the total utilization or density with
398*4882a593Smuzhiyun a constant. Instead, the so called "processor demand" approach can be used,
399*4882a593Smuzhiyun computing the total amount of CPU time h(t) needed by all the tasks to
400*4882a593Smuzhiyun respect all of their deadlines in a time interval of size t, and comparing
401*4882a593Smuzhiyun such a time with the interval size t. If h(t) is smaller than t (that is,
402*4882a593Smuzhiyun the amount of time needed by the tasks in a time interval of size t is
403*4882a593Smuzhiyun smaller than the size of the interval) for all the possible values of t, then
404*4882a593Smuzhiyun EDF is able to schedule the tasks respecting all of their deadlines. Since
405*4882a593Smuzhiyun performing this check for all possible values of t is impossible, it has been
406*4882a593Smuzhiyun proven[4,5,6] that it is sufficient to perform the test for values of t
407*4882a593Smuzhiyun between 0 and a maximum value L. The cited papers contain all of the
408*4882a593Smuzhiyun mathematical details and explain how to compute h(t) and L.
409*4882a593Smuzhiyun In any case, this kind of analysis is too complex as well as too
410*4882a593Smuzhiyun time-consuming to be performed on-line. Hence, as explained in Section
411*4882a593Smuzhiyun 4 Linux uses an admission test based on the tasks' utilizations.
412*4882a593Smuzhiyun
413*4882a593Smuzhiyun3.3 Schedulability Analysis for Multiprocessor Systems
414*4882a593Smuzhiyun------------------------------------------------------
415*4882a593Smuzhiyun
416*4882a593Smuzhiyun On multiprocessor systems with global EDF scheduling (non partitioned
417*4882a593Smuzhiyun systems), a sufficient test for schedulability can not be based on the
418*4882a593Smuzhiyun utilizations or densities: it can be shown that even if D_i = P_i task
419*4882a593Smuzhiyun sets with utilizations slightly larger than 1 can miss deadlines regardless
420*4882a593Smuzhiyun of the number of CPUs.
421*4882a593Smuzhiyun
422*4882a593Smuzhiyun Consider a set {Task_1,...Task_{M+1}} of M+1 tasks on a system with M
423*4882a593Smuzhiyun CPUs, with the first task Task_1=(P,P,P) having period, relative deadline
424*4882a593Smuzhiyun and WCET equal to P. The remaining M tasks Task_i=(e,P-1,P-1) have an
425*4882a593Smuzhiyun arbitrarily small worst case execution time (indicated as "e" here) and a
426*4882a593Smuzhiyun period smaller than the one of the first task. Hence, if all the tasks
427*4882a593Smuzhiyun activate at the same time t, global EDF schedules these M tasks first
428*4882a593Smuzhiyun (because their absolute deadlines are equal to t + P - 1, hence they are
429*4882a593Smuzhiyun smaller than the absolute deadline of Task_1, which is t + P). As a
430*4882a593Smuzhiyun result, Task_1 can be scheduled only at time t + e, and will finish at
431*4882a593Smuzhiyun time t + e + P, after its absolute deadline. The total utilization of the
432*4882a593Smuzhiyun task set is U = M · e / (P - 1) + P / P = M · e / (P - 1) + 1, and for small
433*4882a593Smuzhiyun values of e this can become very close to 1. This is known as "Dhall's
434*4882a593Smuzhiyun effect"[7]. Note: the example in the original paper by Dhall has been
435*4882a593Smuzhiyun slightly simplified here (for example, Dhall more correctly computed
436*4882a593Smuzhiyun lim_{e->0}U).
437*4882a593Smuzhiyun
438*4882a593Smuzhiyun More complex schedulability tests for global EDF have been developed in
439*4882a593Smuzhiyun real-time literature[8,9], but they are not based on a simple comparison
440*4882a593Smuzhiyun between total utilization (or density) and a fixed constant. If all tasks
441*4882a593Smuzhiyun have D_i = P_i, a sufficient schedulability condition can be expressed in
442*4882a593Smuzhiyun a simple way:
443*4882a593Smuzhiyun
444*4882a593Smuzhiyun	sum(WCET_i / P_i) <= M - (M - 1) · U_max
445*4882a593Smuzhiyun
446*4882a593Smuzhiyun where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1,
447*4882a593Smuzhiyun M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition
448*4882a593Smuzhiyun just confirms the Dhall's effect. A more complete survey of the literature
449*4882a593Smuzhiyun about schedulability tests for multi-processor real-time scheduling can be
450*4882a593Smuzhiyun found in [11].
451*4882a593Smuzhiyun
452*4882a593Smuzhiyun As seen, enforcing that the total utilization is smaller than M does not
453*4882a593Smuzhiyun guarantee that global EDF schedules the tasks without missing any deadline
454*4882a593Smuzhiyun (in other words, global EDF is not an optimal scheduling algorithm). However,
455*4882a593Smuzhiyun a total utilization smaller than M is enough to guarantee that non real-time
456*4882a593Smuzhiyun tasks are not starved and that the tardiness of real-time tasks has an upper
457*4882a593Smuzhiyun bound[12] (as previously noted). Different bounds on the maximum tardiness
458*4882a593Smuzhiyun experienced by real-time tasks have been developed in various papers[13,14],
459*4882a593Smuzhiyun but the theoretical result that is important for SCHED_DEADLINE is that if
460*4882a593Smuzhiyun the total utilization is smaller or equal than M then the response times of
461*4882a593Smuzhiyun the tasks are limited.
462*4882a593Smuzhiyun
463*4882a593Smuzhiyun3.4 Relationship with SCHED_DEADLINE Parameters
464*4882a593Smuzhiyun-----------------------------------------------
465*4882a593Smuzhiyun
466*4882a593Smuzhiyun Finally, it is important to understand the relationship between the
467*4882a593Smuzhiyun SCHED_DEADLINE scheduling parameters described in Section 2 (runtime,
468*4882a593Smuzhiyun deadline and period) and the real-time task parameters (WCET, D, P)
469*4882a593Smuzhiyun described in this section. Note that the tasks' temporal constraints are
470*4882a593Smuzhiyun represented by its absolute deadlines d_j = r_j + D described above, while
471*4882a593Smuzhiyun SCHED_DEADLINE schedules the tasks according to scheduling deadlines (see
472*4882a593Smuzhiyun Section 2).
473*4882a593Smuzhiyun If an admission test is used to guarantee that the scheduling deadlines
474*4882a593Smuzhiyun are respected, then SCHED_DEADLINE can be used to schedule real-time tasks
475*4882a593Smuzhiyun guaranteeing that all the jobs' deadlines of a task are respected.
476*4882a593Smuzhiyun In order to do this, a task must be scheduled by setting:
477*4882a593Smuzhiyun
478*4882a593Smuzhiyun  - runtime >= WCET
479*4882a593Smuzhiyun  - deadline = D
480*4882a593Smuzhiyun  - period <= P
481*4882a593Smuzhiyun
482*4882a593Smuzhiyun IOW, if runtime >= WCET and if period is <= P, then the scheduling deadlines
483*4882a593Smuzhiyun and the absolute deadlines (d_j) coincide, so a proper admission control
484*4882a593Smuzhiyun allows to respect the jobs' absolute deadlines for this task (this is what is
485*4882a593Smuzhiyun called "hard schedulability property" and is an extension of Lemma 1 of [2]).
486*4882a593Smuzhiyun Notice that if runtime > deadline the admission control will surely reject
487*4882a593Smuzhiyun this task, as it is not possible to respect its temporal constraints.
488*4882a593Smuzhiyun
489*4882a593Smuzhiyun References:
490*4882a593Smuzhiyun
491*4882a593Smuzhiyun  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
492*4882a593Smuzhiyun      ming in a hard-real-time environment. Journal of the Association for
493*4882a593Smuzhiyun      Computing Machinery, 20(1), 1973.
494*4882a593Smuzhiyun  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
495*4882a593Smuzhiyun      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
496*4882a593Smuzhiyun      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
497*4882a593Smuzhiyun  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
498*4882a593Smuzhiyun      Technical Report. http://disi.unitn.it/~abeni/tr-98-01.pdf
499*4882a593Smuzhiyun  4 - J. Y. Leung and M.L. Merril. A Note on Preemptive Scheduling of
500*4882a593Smuzhiyun      Periodic, Real-Time Tasks. Information Processing Letters, vol. 11,
501*4882a593Smuzhiyun      no. 3, pp. 115-118, 1980.
502*4882a593Smuzhiyun  5 - S. K. Baruah, A. K. Mok and L. E. Rosier. Preemptively Scheduling
503*4882a593Smuzhiyun      Hard-Real-Time Sporadic Tasks on One Processor. Proceedings of the
504*4882a593Smuzhiyun      11th IEEE Real-time Systems Symposium, 1990.
505*4882a593Smuzhiyun  6 - S. K. Baruah, L. E. Rosier and R. R. Howell. Algorithms and Complexity
506*4882a593Smuzhiyun      Concerning the Preemptive Scheduling of Periodic Real-Time tasks on
507*4882a593Smuzhiyun      One Processor. Real-Time Systems Journal, vol. 4, no. 2, pp 301-324,
508*4882a593Smuzhiyun      1990.
509*4882a593Smuzhiyun  7 - S. J. Dhall and C. L. Liu. On a real-time scheduling problem. Operations
510*4882a593Smuzhiyun      research, vol. 26, no. 1, pp 127-140, 1978.
511*4882a593Smuzhiyun  8 - T. Baker. Multiprocessor EDF and Deadline Monotonic Schedulability
512*4882a593Smuzhiyun      Analysis. Proceedings of the 24th IEEE Real-Time Systems Symposium, 2003.
513*4882a593Smuzhiyun  9 - T. Baker. An Analysis of EDF Schedulability on a Multiprocessor.
514*4882a593Smuzhiyun      IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 8,
515*4882a593Smuzhiyun      pp 760-768, 2005.
516*4882a593Smuzhiyun  10 - J. Goossens, S. Funk and S. Baruah, Priority-Driven Scheduling of
517*4882a593Smuzhiyun       Periodic Task Systems on Multiprocessors. Real-Time Systems Journal,
518*4882a593Smuzhiyun       vol. 25, no. 2–3, pp. 187–205, 2003.
519*4882a593Smuzhiyun  11 - R. Davis and A. Burns. A Survey of Hard Real-Time Scheduling for
520*4882a593Smuzhiyun       Multiprocessor Systems. ACM Computing Surveys, vol. 43, no. 4, 2011.
521*4882a593Smuzhiyun       http://www-users.cs.york.ac.uk/~robdavis/papers/MPSurveyv5.0.pdf
522*4882a593Smuzhiyun  12 - U. C. Devi and J. H. Anderson. Tardiness Bounds under Global EDF
523*4882a593Smuzhiyun       Scheduling on a Multiprocessor. Real-Time Systems Journal, vol. 32,
524*4882a593Smuzhiyun       no. 2, pp 133-189, 2008.
525*4882a593Smuzhiyun  13 - P. Valente and G. Lipari. An Upper Bound to the Lateness of Soft
526*4882a593Smuzhiyun       Real-Time Tasks Scheduled by EDF on Multiprocessors. Proceedings of
527*4882a593Smuzhiyun       the 26th IEEE Real-Time Systems Symposium, 2005.
528*4882a593Smuzhiyun  14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for
529*4882a593Smuzhiyun       Global EDF. Proceedings of the 22nd Euromicro Conference on
530*4882a593Smuzhiyun       Real-Time Systems, 2010.
531*4882a593Smuzhiyun  15 - G. Lipari, S. Baruah, Greedy reclamation of unused bandwidth in
532*4882a593Smuzhiyun       constant-bandwidth servers, 12th IEEE Euromicro Conference on Real-Time
533*4882a593Smuzhiyun       Systems, 2000.
534*4882a593Smuzhiyun  16 - L. Abeni, J. Lelli, C. Scordino, L. Palopoli, Greedy CPU reclaiming for
535*4882a593Smuzhiyun       SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS),
536*4882a593Smuzhiyun       Dusseldorf, Germany, 2014.
537*4882a593Smuzhiyun  17 - L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel
538*4882a593Smuzhiyun       or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied
539*4882a593Smuzhiyun       Computing, 2016.
540*4882a593Smuzhiyun  18 - J. Lelli, C. Scordino, L. Abeni, D. Faggioli, Deadline scheduling in the
541*4882a593Smuzhiyun       Linux kernel, Software: Practice and Experience, 46(6): 821-839, June
542*4882a593Smuzhiyun       2016.
543*4882a593Smuzhiyun  19 - C. Scordino, L. Abeni, J. Lelli, Energy-Aware Real-Time Scheduling in
544*4882a593Smuzhiyun       the Linux Kernel, 33rd ACM/SIGAPP Symposium On Applied Computing (SAC
545*4882a593Smuzhiyun       2018), Pau, France, April 2018.
546*4882a593Smuzhiyun
547*4882a593Smuzhiyun
548*4882a593Smuzhiyun4. Bandwidth management
549*4882a593Smuzhiyun=======================
550*4882a593Smuzhiyun
551*4882a593Smuzhiyun As previously mentioned, in order for -deadline scheduling to be
552*4882a593Smuzhiyun effective and useful (that is, to be able to provide "runtime" time units
553*4882a593Smuzhiyun within "deadline"), it is important to have some method to keep the allocation
554*4882a593Smuzhiyun of the available fractions of CPU time to the various tasks under control.
555*4882a593Smuzhiyun This is usually called "admission control" and if it is not performed, then
556*4882a593Smuzhiyun no guarantee can be given on the actual scheduling of the -deadline tasks.
557*4882a593Smuzhiyun
558*4882a593Smuzhiyun As already stated in Section 3, a necessary condition to be respected to
559*4882a593Smuzhiyun correctly schedule a set of real-time tasks is that the total utilization
560*4882a593Smuzhiyun is smaller than M. When talking about -deadline tasks, this requires that
561*4882a593Smuzhiyun the sum of the ratio between runtime and period for all tasks is smaller
562*4882a593Smuzhiyun than M. Notice that the ratio runtime/period is equivalent to the utilization
563*4882a593Smuzhiyun of a "traditional" real-time task, and is also often referred to as
564*4882a593Smuzhiyun "bandwidth".
565*4882a593Smuzhiyun The interface used to control the CPU bandwidth that can be allocated
566*4882a593Smuzhiyun to -deadline tasks is similar to the one already used for -rt
567*4882a593Smuzhiyun tasks with real-time group scheduling (a.k.a. RT-throttling - see
568*4882a593Smuzhiyun Documentation/scheduler/sched-rt-group.rst), and is based on readable/
569*4882a593Smuzhiyun writable control files located in procfs (for system wide settings).
570*4882a593Smuzhiyun Notice that per-group settings (controlled through cgroupfs) are still not
571*4882a593Smuzhiyun defined for -deadline tasks, because more discussion is needed in order to
572*4882a593Smuzhiyun figure out how we want to manage SCHED_DEADLINE bandwidth at the task group
573*4882a593Smuzhiyun level.
574*4882a593Smuzhiyun
575*4882a593Smuzhiyun A main difference between deadline bandwidth management and RT-throttling
576*4882a593Smuzhiyun is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
577*4882a593Smuzhiyun and thus we don't need a higher level throttling mechanism to enforce the
578*4882a593Smuzhiyun desired bandwidth. In other words, this means that interface parameters are
579*4882a593Smuzhiyun only used at admission control time (i.e., when the user calls
580*4882a593Smuzhiyun sched_setattr()). Scheduling is then performed considering actual tasks'
581*4882a593Smuzhiyun parameters, so that CPU bandwidth is allocated to SCHED_DEADLINE tasks
582*4882a593Smuzhiyun respecting their needs in terms of granularity. Therefore, using this simple
583*4882a593Smuzhiyun interface we can put a cap on total utilization of -deadline tasks (i.e.,
584*4882a593Smuzhiyun \Sum (runtime_i / period_i) < global_dl_utilization_cap).
585*4882a593Smuzhiyun
586*4882a593Smuzhiyun4.1 System wide settings
587*4882a593Smuzhiyun------------------------
588*4882a593Smuzhiyun
589*4882a593Smuzhiyun The system wide settings are configured under the /proc virtual file system.
590*4882a593Smuzhiyun
591*4882a593Smuzhiyun For now the -rt knobs are used for -deadline admission control and the
592*4882a593Smuzhiyun -deadline runtime is accounted against the -rt runtime. We realize that this
593*4882a593Smuzhiyun isn't entirely desirable; however, it is better to have a small interface for
594*4882a593Smuzhiyun now, and be able to change it easily later. The ideal situation (see 5.) is to
595*4882a593Smuzhiyun run -rt tasks from a -deadline server; in which case the -rt bandwidth is a
596*4882a593Smuzhiyun direct subset of dl_bw.
597*4882a593Smuzhiyun
598*4882a593Smuzhiyun This means that, for a root_domain comprising M CPUs, -deadline tasks
599*4882a593Smuzhiyun can be created while the sum of their bandwidths stays below:
600*4882a593Smuzhiyun
601*4882a593Smuzhiyun   M * (sched_rt_runtime_us / sched_rt_period_us)
602*4882a593Smuzhiyun
603*4882a593Smuzhiyun It is also possible to disable this bandwidth management logic, and
604*4882a593Smuzhiyun be thus free of oversubscribing the system up to any arbitrary level.
605*4882a593Smuzhiyun This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
606*4882a593Smuzhiyun
607*4882a593Smuzhiyun
608*4882a593Smuzhiyun4.2 Task interface
609*4882a593Smuzhiyun------------------
610*4882a593Smuzhiyun
611*4882a593Smuzhiyun Specifying a periodic/sporadic task that executes for a given amount of
612*4882a593Smuzhiyun runtime at each instance, and that is scheduled according to the urgency of
613*4882a593Smuzhiyun its own timing constraints needs, in general, a way of declaring:
614*4882a593Smuzhiyun
615*4882a593Smuzhiyun  - a (maximum/typical) instance execution time,
616*4882a593Smuzhiyun  - a minimum interval between consecutive instances,
617*4882a593Smuzhiyun  - a time constraint by which each instance must be completed.
618*4882a593Smuzhiyun
619*4882a593Smuzhiyun Therefore:
620*4882a593Smuzhiyun
621*4882a593Smuzhiyun  * a new struct sched_attr, containing all the necessary fields is
622*4882a593Smuzhiyun    provided;
623*4882a593Smuzhiyun  * the new scheduling related syscalls that manipulate it, i.e.,
624*4882a593Smuzhiyun    sched_setattr() and sched_getattr() are implemented.
625*4882a593Smuzhiyun
626*4882a593Smuzhiyun For debugging purposes, the leftover runtime and absolute deadline of a
627*4882a593Smuzhiyun SCHED_DEADLINE task can be retrieved through /proc/<pid>/sched (entries
628*4882a593Smuzhiyun dl.runtime and dl.deadline, both values in ns). A programmatic way to
629*4882a593Smuzhiyun retrieve these values from production code is under discussion.
630*4882a593Smuzhiyun
631*4882a593Smuzhiyun
632*4882a593Smuzhiyun4.3 Default behavior
633*4882a593Smuzhiyun---------------------
634*4882a593Smuzhiyun
635*4882a593Smuzhiyun The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
636*4882a593Smuzhiyun 950000. With rt_period equal to 1000000, by default, it means that -deadline
637*4882a593Smuzhiyun tasks can use at most 95%, multiplied by the number of CPUs that compose the
638*4882a593Smuzhiyun root_domain, for each root_domain.
639*4882a593Smuzhiyun This means that non -deadline tasks will receive at least 5% of the CPU time,
640*4882a593Smuzhiyun and that -deadline tasks will receive their runtime with a guaranteed
641*4882a593Smuzhiyun worst-case delay respect to the "deadline" parameter. If "deadline" = "period"
642*4882a593Smuzhiyun and the cpuset mechanism is used to implement partitioned scheduling (see
643*4882a593Smuzhiyun Section 5), then this simple setting of the bandwidth management is able to
644*4882a593Smuzhiyun deterministically guarantee that -deadline tasks will receive their runtime
645*4882a593Smuzhiyun in a period.
646*4882a593Smuzhiyun
647*4882a593Smuzhiyun Finally, notice that in order not to jeopardize the admission control a
648*4882a593Smuzhiyun -deadline task cannot fork.
649*4882a593Smuzhiyun
650*4882a593Smuzhiyun
651*4882a593Smuzhiyun4.4 Behavior of sched_yield()
652*4882a593Smuzhiyun-----------------------------
653*4882a593Smuzhiyun
654*4882a593Smuzhiyun When a SCHED_DEADLINE task calls sched_yield(), it gives up its
655*4882a593Smuzhiyun remaining runtime and is immediately throttled, until the next
656*4882a593Smuzhiyun period, when its runtime will be replenished (a special flag
657*4882a593Smuzhiyun dl_yielded is set and used to handle correctly throttling and runtime
658*4882a593Smuzhiyun replenishment after a call to sched_yield()).
659*4882a593Smuzhiyun
660*4882a593Smuzhiyun This behavior of sched_yield() allows the task to wake-up exactly at
661*4882a593Smuzhiyun the beginning of the next period. Also, this may be useful in the
662*4882a593Smuzhiyun future with bandwidth reclaiming mechanisms, where sched_yield() will
663*4882a593Smuzhiyun make the leftoever runtime available for reclamation by other
664*4882a593Smuzhiyun SCHED_DEADLINE tasks.
665*4882a593Smuzhiyun
666*4882a593Smuzhiyun
667*4882a593Smuzhiyun5. Tasks CPU affinity
668*4882a593Smuzhiyun=====================
669*4882a593Smuzhiyun
670*4882a593Smuzhiyun -deadline tasks cannot have an affinity mask smaller that the entire
671*4882a593Smuzhiyun root_domain they are created on. However, affinities can be specified
672*4882a593Smuzhiyun through the cpuset facility (Documentation/admin-guide/cgroup-v1/cpusets.rst).
673*4882a593Smuzhiyun
674*4882a593Smuzhiyun5.1 SCHED_DEADLINE and cpusets HOWTO
675*4882a593Smuzhiyun------------------------------------
676*4882a593Smuzhiyun
677*4882a593Smuzhiyun An example of a simple configuration (pin a -deadline task to CPU0)
678*4882a593Smuzhiyun follows (rt-app is used to create a -deadline task)::
679*4882a593Smuzhiyun
680*4882a593Smuzhiyun   mkdir /dev/cpuset
681*4882a593Smuzhiyun   mount -t cgroup -o cpuset cpuset /dev/cpuset
682*4882a593Smuzhiyun   cd /dev/cpuset
683*4882a593Smuzhiyun   mkdir cpu0
684*4882a593Smuzhiyun   echo 0 > cpu0/cpuset.cpus
685*4882a593Smuzhiyun   echo 0 > cpu0/cpuset.mems
686*4882a593Smuzhiyun   echo 1 > cpuset.cpu_exclusive
687*4882a593Smuzhiyun   echo 0 > cpuset.sched_load_balance
688*4882a593Smuzhiyun   echo 1 > cpu0/cpuset.cpu_exclusive
689*4882a593Smuzhiyun   echo 1 > cpu0/cpuset.mem_exclusive
690*4882a593Smuzhiyun   echo $$ > cpu0/tasks
691*4882a593Smuzhiyun   rt-app -t 100000:10000:d:0 -D5 # it is now actually superfluous to specify
692*4882a593Smuzhiyun				  # task affinity
693*4882a593Smuzhiyun
694*4882a593Smuzhiyun6. Future plans
695*4882a593Smuzhiyun===============
696*4882a593Smuzhiyun
697*4882a593Smuzhiyun Still missing:
698*4882a593Smuzhiyun
699*4882a593Smuzhiyun  - programmatic way to retrieve current runtime and absolute deadline
700*4882a593Smuzhiyun  - refinements to deadline inheritance, especially regarding the possibility
701*4882a593Smuzhiyun    of retaining bandwidth isolation among non-interacting tasks. This is
702*4882a593Smuzhiyun    being studied from both theoretical and practical points of view, and
703*4882a593Smuzhiyun    hopefully we should be able to produce some demonstrative code soon;
704*4882a593Smuzhiyun  - (c)group based bandwidth management, and maybe scheduling;
705*4882a593Smuzhiyun  - access control for non-root users (and related security concerns to
706*4882a593Smuzhiyun    address), which is the best way to allow unprivileged use of the mechanisms
707*4882a593Smuzhiyun    and how to prevent non-root users "cheat" the system?
708*4882a593Smuzhiyun
709*4882a593Smuzhiyun As already discussed, we are planning also to merge this work with the EDF
710*4882a593Smuzhiyun throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
711*4882a593Smuzhiyun the preliminary phases of the merge and we really seek feedback that would
712*4882a593Smuzhiyun help us decide on the direction it should take.
713*4882a593Smuzhiyun
714*4882a593SmuzhiyunAppendix A. Test suite
715*4882a593Smuzhiyun======================
716*4882a593Smuzhiyun
717*4882a593Smuzhiyun The SCHED_DEADLINE policy can be easily tested using two applications that
718*4882a593Smuzhiyun are part of a wider Linux Scheduler validation suite. The suite is
719*4882a593Smuzhiyun available as a GitHub repository: https://github.com/scheduler-tools.
720*4882a593Smuzhiyun
721*4882a593Smuzhiyun The first testing application is called rt-app and can be used to
722*4882a593Smuzhiyun start multiple threads with specific parameters. rt-app supports
723*4882a593Smuzhiyun SCHED_{OTHER,FIFO,RR,DEADLINE} scheduling policies and their related
724*4882a593Smuzhiyun parameters (e.g., niceness, priority, runtime/deadline/period). rt-app
725*4882a593Smuzhiyun is a valuable tool, as it can be used to synthetically recreate certain
726*4882a593Smuzhiyun workloads (maybe mimicking real use-cases) and evaluate how the scheduler
727*4882a593Smuzhiyun behaves under such workloads. In this way, results are easily reproducible.
728*4882a593Smuzhiyun rt-app is available at: https://github.com/scheduler-tools/rt-app.
729*4882a593Smuzhiyun
730*4882a593Smuzhiyun Thread parameters can be specified from the command line, with something like
731*4882a593Smuzhiyun this::
732*4882a593Smuzhiyun
733*4882a593Smuzhiyun  # rt-app -t 100000:10000:d -t 150000:20000:f:10 -D5
734*4882a593Smuzhiyun
735*4882a593Smuzhiyun The above creates 2 threads. The first one, scheduled by SCHED_DEADLINE,
736*4882a593Smuzhiyun executes for 10ms every 100ms. The second one, scheduled at SCHED_FIFO
737*4882a593Smuzhiyun priority 10, executes for 20ms every 150ms. The test will run for a total
738*4882a593Smuzhiyun of 5 seconds.
739*4882a593Smuzhiyun
740*4882a593Smuzhiyun More interestingly, configurations can be described with a json file that
741*4882a593Smuzhiyun can be passed as input to rt-app with something like this::
742*4882a593Smuzhiyun
743*4882a593Smuzhiyun  # rt-app my_config.json
744*4882a593Smuzhiyun
745*4882a593Smuzhiyun The parameters that can be specified with the second method are a superset
746*4882a593Smuzhiyun of the command line options. Please refer to rt-app documentation for more
747*4882a593Smuzhiyun details (`<rt-app-sources>/doc/*.json`).
748*4882a593Smuzhiyun
749*4882a593Smuzhiyun The second testing application is a modification of schedtool, called
750*4882a593Smuzhiyun schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a
751*4882a593Smuzhiyun certain pid/application. schedtool-dl is available at:
752*4882a593Smuzhiyun https://github.com/scheduler-tools/schedtool-dl.git.
753*4882a593Smuzhiyun
754*4882a593Smuzhiyun The usage is straightforward::
755*4882a593Smuzhiyun
756*4882a593Smuzhiyun  # schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app
757*4882a593Smuzhiyun
758*4882a593Smuzhiyun With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation
759*4882a593Smuzhiyun of 10ms every 100ms (note that parameters are expressed in microseconds).
760*4882a593Smuzhiyun You can also use schedtool to create a reservation for an already running
761*4882a593Smuzhiyun application, given that you know its pid::
762*4882a593Smuzhiyun
763*4882a593Smuzhiyun  # schedtool -E -t 10000000:100000000 my_app_pid
764*4882a593Smuzhiyun
765*4882a593SmuzhiyunAppendix B. Minimal main()
766*4882a593Smuzhiyun==========================
767*4882a593Smuzhiyun
768*4882a593Smuzhiyun We provide in what follows a simple (ugly) self-contained code snippet
769*4882a593Smuzhiyun showing how SCHED_DEADLINE reservations can be created by a real-time
770*4882a593Smuzhiyun application developer::
771*4882a593Smuzhiyun
772*4882a593Smuzhiyun   #define _GNU_SOURCE
773*4882a593Smuzhiyun   #include <unistd.h>
774*4882a593Smuzhiyun   #include <stdio.h>
775*4882a593Smuzhiyun   #include <stdlib.h>
776*4882a593Smuzhiyun   #include <string.h>
777*4882a593Smuzhiyun   #include <time.h>
778*4882a593Smuzhiyun   #include <linux/unistd.h>
779*4882a593Smuzhiyun   #include <linux/kernel.h>
780*4882a593Smuzhiyun   #include <linux/types.h>
781*4882a593Smuzhiyun   #include <sys/syscall.h>
782*4882a593Smuzhiyun   #include <pthread.h>
783*4882a593Smuzhiyun
784*4882a593Smuzhiyun   #define gettid() syscall(__NR_gettid)
785*4882a593Smuzhiyun
786*4882a593Smuzhiyun   #define SCHED_DEADLINE	6
787*4882a593Smuzhiyun
788*4882a593Smuzhiyun   /* XXX use the proper syscall numbers */
789*4882a593Smuzhiyun   #ifdef __x86_64__
790*4882a593Smuzhiyun   #define __NR_sched_setattr		314
791*4882a593Smuzhiyun   #define __NR_sched_getattr		315
792*4882a593Smuzhiyun   #endif
793*4882a593Smuzhiyun
794*4882a593Smuzhiyun   #ifdef __i386__
795*4882a593Smuzhiyun   #define __NR_sched_setattr		351
796*4882a593Smuzhiyun   #define __NR_sched_getattr		352
797*4882a593Smuzhiyun   #endif
798*4882a593Smuzhiyun
799*4882a593Smuzhiyun   #ifdef __arm__
800*4882a593Smuzhiyun   #define __NR_sched_setattr		380
801*4882a593Smuzhiyun   #define __NR_sched_getattr		381
802*4882a593Smuzhiyun   #endif
803*4882a593Smuzhiyun
804*4882a593Smuzhiyun   static volatile int done;
805*4882a593Smuzhiyun
806*4882a593Smuzhiyun   struct sched_attr {
807*4882a593Smuzhiyun	__u32 size;
808*4882a593Smuzhiyun
809*4882a593Smuzhiyun	__u32 sched_policy;
810*4882a593Smuzhiyun	__u64 sched_flags;
811*4882a593Smuzhiyun
812*4882a593Smuzhiyun	/* SCHED_NORMAL, SCHED_BATCH */
813*4882a593Smuzhiyun	__s32 sched_nice;
814*4882a593Smuzhiyun
815*4882a593Smuzhiyun	/* SCHED_FIFO, SCHED_RR */
816*4882a593Smuzhiyun	__u32 sched_priority;
817*4882a593Smuzhiyun
818*4882a593Smuzhiyun	/* SCHED_DEADLINE (nsec) */
819*4882a593Smuzhiyun	__u64 sched_runtime;
820*4882a593Smuzhiyun	__u64 sched_deadline;
821*4882a593Smuzhiyun	__u64 sched_period;
822*4882a593Smuzhiyun   };
823*4882a593Smuzhiyun
824*4882a593Smuzhiyun   int sched_setattr(pid_t pid,
825*4882a593Smuzhiyun		  const struct sched_attr *attr,
826*4882a593Smuzhiyun		  unsigned int flags)
827*4882a593Smuzhiyun   {
828*4882a593Smuzhiyun	return syscall(__NR_sched_setattr, pid, attr, flags);
829*4882a593Smuzhiyun   }
830*4882a593Smuzhiyun
831*4882a593Smuzhiyun   int sched_getattr(pid_t pid,
832*4882a593Smuzhiyun		  struct sched_attr *attr,
833*4882a593Smuzhiyun		  unsigned int size,
834*4882a593Smuzhiyun		  unsigned int flags)
835*4882a593Smuzhiyun   {
836*4882a593Smuzhiyun	return syscall(__NR_sched_getattr, pid, attr, size, flags);
837*4882a593Smuzhiyun   }
838*4882a593Smuzhiyun
839*4882a593Smuzhiyun   void *run_deadline(void *data)
840*4882a593Smuzhiyun   {
841*4882a593Smuzhiyun	struct sched_attr attr;
842*4882a593Smuzhiyun	int x = 0;
843*4882a593Smuzhiyun	int ret;
844*4882a593Smuzhiyun	unsigned int flags = 0;
845*4882a593Smuzhiyun
846*4882a593Smuzhiyun	printf("deadline thread started [%ld]\n", gettid());
847*4882a593Smuzhiyun
848*4882a593Smuzhiyun	attr.size = sizeof(attr);
849*4882a593Smuzhiyun	attr.sched_flags = 0;
850*4882a593Smuzhiyun	attr.sched_nice = 0;
851*4882a593Smuzhiyun	attr.sched_priority = 0;
852*4882a593Smuzhiyun
853*4882a593Smuzhiyun	/* This creates a 10ms/30ms reservation */
854*4882a593Smuzhiyun	attr.sched_policy = SCHED_DEADLINE;
855*4882a593Smuzhiyun	attr.sched_runtime = 10 * 1000 * 1000;
856*4882a593Smuzhiyun	attr.sched_period = attr.sched_deadline = 30 * 1000 * 1000;
857*4882a593Smuzhiyun
858*4882a593Smuzhiyun	ret = sched_setattr(0, &attr, flags);
859*4882a593Smuzhiyun	if (ret < 0) {
860*4882a593Smuzhiyun		done = 0;
861*4882a593Smuzhiyun		perror("sched_setattr");
862*4882a593Smuzhiyun		exit(-1);
863*4882a593Smuzhiyun	}
864*4882a593Smuzhiyun
865*4882a593Smuzhiyun	while (!done) {
866*4882a593Smuzhiyun		x++;
867*4882a593Smuzhiyun	}
868*4882a593Smuzhiyun
869*4882a593Smuzhiyun	printf("deadline thread dies [%ld]\n", gettid());
870*4882a593Smuzhiyun	return NULL;
871*4882a593Smuzhiyun   }
872*4882a593Smuzhiyun
873*4882a593Smuzhiyun   int main (int argc, char **argv)
874*4882a593Smuzhiyun   {
875*4882a593Smuzhiyun	pthread_t thread;
876*4882a593Smuzhiyun
877*4882a593Smuzhiyun	printf("main thread [%ld]\n", gettid());
878*4882a593Smuzhiyun
879*4882a593Smuzhiyun	pthread_create(&thread, NULL, run_deadline, NULL);
880*4882a593Smuzhiyun
881*4882a593Smuzhiyun	sleep(10);
882*4882a593Smuzhiyun
883*4882a593Smuzhiyun	done = 1;
884*4882a593Smuzhiyun	pthread_join(thread, NULL);
885*4882a593Smuzhiyun
886*4882a593Smuzhiyun	printf("main dies [%ld]\n", gettid());
887*4882a593Smuzhiyun	return 0;
888*4882a593Smuzhiyun   }
889