Documentation/scheduler/sched-deadline.rst

*4882a593Smuzhiyun========================
*4882a593SmuzhiyunDeadline Task Scheduling
*4882a593Smuzhiyun========================
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. CONTENTS
*4882a593Smuzhiyun
*4882a593Smuzhiyun    0. WARNING
*4882a593Smuzhiyun    1. Overview
*4882a593Smuzhiyun    2. Scheduling algorithm
*4882a593Smuzhiyun      2.1 Main algorithm
*4882a593Smuzhiyun      2.2 Bandwidth reclaiming
*4882a593Smuzhiyun    3. Scheduling Real-Time Tasks
*4882a593Smuzhiyun      3.1 Definitions
*4882a593Smuzhiyun      3.2 Schedulability Analysis for Uniprocessor Systems
*4882a593Smuzhiyun      3.3 Schedulability Analysis for Multiprocessor Systems
*4882a593Smuzhiyun      3.4 Relationship with SCHED_DEADLINE Parameters
*4882a593Smuzhiyun    4. Bandwidth management
*4882a593Smuzhiyun      4.1 System-wide settings
*4882a593Smuzhiyun      4.2 Task interface
*4882a593Smuzhiyun      4.3 Default behavior
*4882a593Smuzhiyun      4.4 Behavior of sched_yield()
*4882a593Smuzhiyun    5. Tasks CPU affinity
*4882a593Smuzhiyun      5.1 SCHED_DEADLINE and cpusets HOWTO
*4882a593Smuzhiyun    6. Future plans
*4882a593Smuzhiyun    A. Test suite
*4882a593Smuzhiyun    B. Minimal main()
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun0. WARNING
*4882a593Smuzhiyun==========
*4882a593Smuzhiyun
*4882a593Smuzhiyun Fiddling with these settings can result in an unpredictable or even unstable
*4882a593Smuzhiyun system behavior. As for -rt (group) scheduling, it is assumed that root users
*4882a593Smuzhiyun know what they're doing.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun1. Overview
*4882a593Smuzhiyun===========
*4882a593Smuzhiyun
*4882a593Smuzhiyun The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
*4882a593Smuzhiyun basically an implementation of the Earliest Deadline First (EDF) scheduling
*4882a593Smuzhiyun algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
*4882a593Smuzhiyun that makes it possible to isolate the behavior of tasks between each other.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun2. Scheduling algorithm
*4882a593Smuzhiyun=======================
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.1 Main algorithm
*4882a593Smuzhiyun------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun SCHED_DEADLINE [18] uses three parameters, named "runtime", "period", and
*4882a593Smuzhiyun "deadline", to schedule tasks. A SCHED_DEADLINE task should receive
*4882a593Smuzhiyun "runtime" microseconds of execution time every "period" microseconds, and
*4882a593Smuzhiyun these "runtime" microseconds are available within "deadline" microseconds
*4882a593Smuzhiyun from the beginning of the period.  In order to implement this behavior,
*4882a593Smuzhiyun every time the task wakes up, the scheduler computes a "scheduling deadline"
*4882a593Smuzhiyun consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then
*4882a593Smuzhiyun scheduled using EDF[1] on these scheduling deadlines (the task with the
*4882a593Smuzhiyun earliest scheduling deadline is selected for execution). Notice that the
*4882a593Smuzhiyun task actually receives "runtime" time units within "deadline" if a proper
*4882a593Smuzhiyun "admission control" strategy (see Section "4. Bandwidth management") is used
*4882a593Smuzhiyun (clearly, if the system is overloaded this guarantee cannot be respected).
*4882a593Smuzhiyun
*4882a593Smuzhiyun Summing up, the CBS[2,3] algorithm assigns scheduling deadlines to tasks so
*4882a593Smuzhiyun that each task runs for at most its runtime every period, avoiding any
*4882a593Smuzhiyun interference between different tasks (bandwidth isolation), while the EDF[1]
*4882a593Smuzhiyun algorithm selects the task with the earliest scheduling deadline as the one
*4882a593Smuzhiyun to be executed next. Thanks to this feature, tasks that do not strictly comply
*4882a593Smuzhiyun with the "traditional" real-time task model (see Section 3) can effectively
*4882a593Smuzhiyun use the new policy.
*4882a593Smuzhiyun
*4882a593Smuzhiyun In more details, the CBS algorithm assigns scheduling deadlines to
*4882a593Smuzhiyun tasks in the following way:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - Each SCHED_DEADLINE task is characterized by the "runtime",
*4882a593Smuzhiyun    "deadline", and "period" parameters;
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - The state of the task is described by a "scheduling deadline", and
*4882a593Smuzhiyun    a "remaining runtime". These two parameters are initially set to 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
*4882a593Smuzhiyun    the scheduler checks if::
*4882a593Smuzhiyun
*4882a593Smuzhiyun                 remaining runtime                  runtime
*4882a593Smuzhiyun        ----------------------------------    >    ---------
*4882a593Smuzhiyun        scheduling deadline - current time           period
*4882a593Smuzhiyun
*4882a593Smuzhiyun    then, if the scheduling deadline is smaller than the current time, or
*4882a593Smuzhiyun    this condition is verified, the scheduling deadline and the
*4882a593Smuzhiyun    remaining runtime are re-initialized as
*4882a593Smuzhiyun
*4882a593Smuzhiyun         scheduling deadline = current time + deadline
*4882a593Smuzhiyun         remaining runtime = runtime
*4882a593Smuzhiyun
*4882a593Smuzhiyun    otherwise, the scheduling deadline and the remaining runtime are
*4882a593Smuzhiyun    left unchanged;
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - When a SCHED_DEADLINE task executes for an amount of time t, its
*4882a593Smuzhiyun    remaining runtime is decreased as::
*4882a593Smuzhiyun
*4882a593Smuzhiyun         remaining runtime = remaining runtime - t
*4882a593Smuzhiyun
*4882a593Smuzhiyun    (technically, the runtime is decreased at every tick, or when the
*4882a593Smuzhiyun    task is descheduled / preempted);
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - When the remaining runtime becomes less or equal than 0, the task is
*4882a593Smuzhiyun    said to be "throttled" (also known as "depleted" in real-time literature)
*4882a593Smuzhiyun    and cannot be scheduled until its scheduling deadline. The "replenishment
*4882a593Smuzhiyun    time" for this task (see next item) is set to be equal to the current
*4882a593Smuzhiyun    value of the scheduling deadline;
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - When the current time is equal to the replenishment time of a
*4882a593Smuzhiyun    throttled task, the scheduling deadline and the remaining runtime are
*4882a593Smuzhiyun    updated as::
*4882a593Smuzhiyun
*4882a593Smuzhiyun         scheduling deadline = scheduling deadline + period
*4882a593Smuzhiyun         remaining runtime = remaining runtime + runtime
*4882a593Smuzhiyun
*4882a593Smuzhiyun The SCHED_FLAG_DL_OVERRUN flag in sched_attr's sched_flags field allows a task
*4882a593Smuzhiyun to get informed about runtime overruns through the delivery of SIGXCPU
*4882a593Smuzhiyun signals.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.2 Bandwidth reclaiming
*4882a593Smuzhiyun------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun Bandwidth reclaiming for deadline tasks is based on the GRUB (Greedy
*4882a593Smuzhiyun Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled
*4882a593Smuzhiyun when flag SCHED_FLAG_RECLAIM is set.
*4882a593Smuzhiyun
*4882a593Smuzhiyun The following diagram illustrates the state names for tasks handled by GRUB::
*4882a593Smuzhiyun
*4882a593Smuzhiyun                             ------------
*4882a593Smuzhiyun                 (d)        |   Active   |
*4882a593Smuzhiyun              ------------->|            |
*4882a593Smuzhiyun              |             | Contending |
*4882a593Smuzhiyun              |              ------------
*4882a593Smuzhiyun              |                A      |
*4882a593Smuzhiyun          ----------           |      |
*4882a593Smuzhiyun         |          |          |      |
*4882a593Smuzhiyun         | Inactive |          |(b)   | (a)
*4882a593Smuzhiyun         |          |          |      |
*4882a593Smuzhiyun          ----------           |      |
*4882a593Smuzhiyun              A                |      V
*4882a593Smuzhiyun              |              ------------
*4882a593Smuzhiyun              |             |   Active   |
*4882a593Smuzhiyun              --------------|     Non    |
*4882a593Smuzhiyun                 (c)        | Contending |
*4882a593Smuzhiyun                             ------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun A task can be in one of the following states:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - ActiveContending: if it is ready for execution (or executing);
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - ActiveNonContending: if it just blocked and has not yet surpassed the 0-lag
*4882a593Smuzhiyun    time;
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - Inactive: if it is blocked and has surpassed the 0-lag time.
*4882a593Smuzhiyun
*4882a593Smuzhiyun State transitions:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  (a) When a task blocks, it does not become immediately inactive since its
*4882a593Smuzhiyun      bandwidth cannot be immediately reclaimed without breaking the
*4882a593Smuzhiyun      real-time guarantees. It therefore enters a transitional state called
*4882a593Smuzhiyun      ActiveNonContending. The scheduler arms the "inactive timer" to fire at
*4882a593Smuzhiyun      the 0-lag time, when the task's bandwidth can be reclaimed without
*4882a593Smuzhiyun      breaking the real-time guarantees.
*4882a593Smuzhiyun
*4882a593Smuzhiyun      The 0-lag time for a task entering the ActiveNonContending state is
*4882a593Smuzhiyun      computed as::
*4882a593Smuzhiyun
*4882a593Smuzhiyun                        (runtime * dl_period)
*4882a593Smuzhiyun             deadline - ---------------------
*4882a593Smuzhiyun                             dl_runtime
*4882a593Smuzhiyun
*4882a593Smuzhiyun      where runtime is the remaining runtime, while dl_runtime and dl_period
*4882a593Smuzhiyun      are the reservation parameters.
*4882a593Smuzhiyun
*4882a593Smuzhiyun  (b) If the task wakes up before the inactive timer fires, the task re-enters
*4882a593Smuzhiyun      the ActiveContending state and the "inactive timer" is canceled.
*4882a593Smuzhiyun      In addition, if the task wakes up on a different runqueue, then
*4882a593Smuzhiyun      the task's utilization must be removed from the previous runqueue's active
*4882a593Smuzhiyun      utilization and must be added to the new runqueue's active utilization.
*4882a593Smuzhiyun      In order to avoid races between a task waking up on a runqueue while the
*4882a593Smuzhiyun      "inactive timer" is running on a different CPU, the "dl_non_contending"
*4882a593Smuzhiyun      flag is used to indicate that a task is not on a runqueue but is active
*4882a593Smuzhiyun      (so, the flag is set when the task blocks and is cleared when the
*4882a593Smuzhiyun      "inactive timer" fires or when the task  wakes up).
*4882a593Smuzhiyun
*4882a593Smuzhiyun  (c) When the "inactive timer" fires, the task enters the Inactive state and
*4882a593Smuzhiyun      its utilization is removed from the runqueue's active utilization.
*4882a593Smuzhiyun
*4882a593Smuzhiyun  (d) When an inactive task wakes up, it enters the ActiveContending state and
*4882a593Smuzhiyun      its utilization is added to the active utilization of the runqueue where
*4882a593Smuzhiyun      it has been enqueued.
*4882a593Smuzhiyun
*4882a593Smuzhiyun For each runqueue, the algorithm GRUB keeps track of two different bandwidths:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - Active bandwidth (running_bw): this is the sum of the bandwidths of all
*4882a593Smuzhiyun    tasks in active state (i.e., ActiveContending or ActiveNonContending);
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the
*4882a593Smuzhiyun    runqueue, including the tasks in Inactive state.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun The algorithm reclaims the bandwidth of the tasks in Inactive state.
*4882a593Smuzhiyun It does so by decrementing the runtime of the executing task Ti at a pace equal
*4882a593Smuzhiyun to
*4882a593Smuzhiyun
*4882a593Smuzhiyun           dq = -max{ Ui / Umax, (1 - Uinact - Uextra) } dt
*4882a593Smuzhiyun
*4882a593Smuzhiyun where:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - Ui is the bandwidth of task Ti;
*4882a593Smuzhiyun  - Umax is the maximum reclaimable utilization (subjected to RT throttling
*4882a593Smuzhiyun    limits);
*4882a593Smuzhiyun  - Uinact is the (per runqueue) inactive utilization, computed as
*4882a593Smuzhiyun    (this_bq - running_bw);
*4882a593Smuzhiyun  - Uextra is the (per runqueue) extra reclaimable utilization
*4882a593Smuzhiyun    (subjected to RT throttling limits).
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun Let's now see a trivial example of two deadline tasks with runtime equal
*4882a593Smuzhiyun to 4 and period equal to 8 (i.e., bandwidth equal to 0.5)::
*4882a593Smuzhiyun
*4882a593Smuzhiyun         A            Task T1
*4882a593Smuzhiyun         |
*4882a593Smuzhiyun         |                               |
*4882a593Smuzhiyun         |                               |
*4882a593Smuzhiyun         |--------                       |----
*4882a593Smuzhiyun         |       |                       V
*4882a593Smuzhiyun         |---|---|---|---|---|---|---|---|--------->t
*4882a593Smuzhiyun         0   1   2   3   4   5   6   7   8
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun         A            Task T2
*4882a593Smuzhiyun         |
*4882a593Smuzhiyun         |                               |
*4882a593Smuzhiyun         |                               |
*4882a593Smuzhiyun         |       ------------------------|
*4882a593Smuzhiyun         |       |                       V
*4882a593Smuzhiyun         |---|---|---|---|---|---|---|---|--------->t
*4882a593Smuzhiyun         0   1   2   3   4   5   6   7   8
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun         A            running_bw
*4882a593Smuzhiyun         |
*4882a593Smuzhiyun       1 -----------------               ------
*4882a593Smuzhiyun         |               |               |
*4882a593Smuzhiyun      0.5-               -----------------
*4882a593Smuzhiyun         |                               |
*4882a593Smuzhiyun         |---|---|---|---|---|---|---|---|--------->t
*4882a593Smuzhiyun         0   1   2   3   4   5   6   7   8
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - Time t = 0:
*4882a593Smuzhiyun
*4882a593Smuzhiyun    Both tasks are ready for execution and therefore in ActiveContending state.
*4882a593Smuzhiyun    Suppose Task T1 is the first task to start execution.
*4882a593Smuzhiyun    Since there are no inactive tasks, its runtime is decreased as dq = -1 dt.
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - Time t = 2:
*4882a593Smuzhiyun
*4882a593Smuzhiyun    Suppose that task T1 blocks
*4882a593Smuzhiyun    Task T1 therefore enters the ActiveNonContending state. Since its remaining
*4882a593Smuzhiyun    runtime is equal to 2, its 0-lag time is equal to t = 4.
*4882a593Smuzhiyun    Task T2 start execution, with runtime still decreased as dq = -1 dt since
*4882a593Smuzhiyun    there are no inactive tasks.
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - Time t = 4:
*4882a593Smuzhiyun
*4882a593Smuzhiyun    This is the 0-lag time for Task T1. Since it didn't woken up in the
*4882a593Smuzhiyun    meantime, it enters the Inactive state. Its bandwidth is removed from
*4882a593Smuzhiyun    running_bw.
*4882a593Smuzhiyun    Task T2 continues its execution. However, its runtime is now decreased as
*4882a593Smuzhiyun    dq = - 0.5 dt because Uinact = 0.5.
*4882a593Smuzhiyun    Task T2 therefore reclaims the bandwidth unused by Task T1.
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - Time t = 8:
*4882a593Smuzhiyun
*4882a593Smuzhiyun    Task T1 wakes up. It enters the ActiveContending state again, and the
*4882a593Smuzhiyun    running_bw is incremented.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.3 Energy-aware scheduling
*4882a593Smuzhiyun---------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the
*4882a593Smuzhiyun GRUB-PA [19] algorithm, reducing the CPU operating frequency to the minimum
*4882a593Smuzhiyun value that still allows to meet the deadlines. This behavior is currently
*4882a593Smuzhiyun implemented only for ARM architectures.
*4882a593Smuzhiyun
*4882a593Smuzhiyun A particular care must be taken in case the time needed for changing frequency
*4882a593Smuzhiyun is of the same order of magnitude of the reservation period. In such cases,
*4882a593Smuzhiyun setting a fixed CPU frequency results in a lower amount of deadline misses.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun3. Scheduling Real-Time Tasks
*4882a593Smuzhiyun=============================
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun ..  BIG FAT WARNING ******************************************************
*4882a593Smuzhiyun
*4882a593Smuzhiyun .. warning::
*4882a593Smuzhiyun
*4882a593Smuzhiyun   This section contains a (not-thorough) summary on classical deadline
*4882a593Smuzhiyun   scheduling theory, and how it applies to SCHED_DEADLINE.
*4882a593Smuzhiyun   The reader can "safely" skip to Section 4 if only interested in seeing
*4882a593Smuzhiyun   how the scheduling policy can be used. Anyway, we strongly recommend
*4882a593Smuzhiyun   to come back here and continue reading (once the urge for testing is
*4882a593Smuzhiyun   satisfied :P) to be sure of fully understanding all technical details.
*4882a593Smuzhiyun
*4882a593Smuzhiyun .. ************************************************************************
*4882a593Smuzhiyun
*4882a593Smuzhiyun There are no limitations on what kind of task can exploit this new
*4882a593Smuzhiyun scheduling discipline, even if it must be said that it is particularly
*4882a593Smuzhiyun suited for periodic or sporadic real-time tasks that need guarantees on their
*4882a593Smuzhiyun timing behavior, e.g., multimedia, streaming, control applications, etc.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.1 Definitions
*4882a593Smuzhiyun------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun A typical real-time task is composed of a repetition of computation phases
*4882a593Smuzhiyun (task instances, or jobs) which are activated on a periodic or sporadic
*4882a593Smuzhiyun fashion.
*4882a593Smuzhiyun Each job J_j (where J_j is the j^th job of the task) is characterized by an
*4882a593Smuzhiyun arrival time r_j (the time when the job starts), an amount of computation
*4882a593Smuzhiyun time c_j needed to finish the job, and a job absolute deadline d_j, which
*4882a593Smuzhiyun is the time within which the job should be finished. The maximum execution
*4882a593Smuzhiyun time max{c_j} is called "Worst Case Execution Time" (WCET) for the task.
*4882a593Smuzhiyun A real-time task can be periodic with period P if r_{j+1} = r_j + P, or
*4882a593Smuzhiyun sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
*4882a593Smuzhiyun d_j = r_j + D, where D is the task's relative deadline.
*4882a593Smuzhiyun Summing up, a real-time task can be described as
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Task = (WCET, D, P)
*4882a593Smuzhiyun
*4882a593Smuzhiyun The utilization of a real-time task is defined as the ratio between its
*4882a593Smuzhiyun WCET and its period (or minimum inter-arrival time), and represents
*4882a593Smuzhiyun the fraction of CPU time needed to execute the task.
*4882a593Smuzhiyun
*4882a593Smuzhiyun If the total utilization U=sum(WCET_i/P_i) is larger than M (with M equal
*4882a593Smuzhiyun to the number of CPUs), then the scheduler is unable to respect all the
*4882a593Smuzhiyun deadlines.
*4882a593Smuzhiyun Note that total utilization is defined as the sum of the utilizations
*4882a593Smuzhiyun WCET_i/P_i over all the real-time tasks in the system. When considering
*4882a593Smuzhiyun multiple real-time tasks, the parameters of the i-th task are indicated
*4882a593Smuzhiyun with the "_i" suffix.
*4882a593Smuzhiyun Moreover, if the total utilization is larger than M, then we risk starving
*4882a593Smuzhiyun non- real-time tasks by real-time tasks.
*4882a593Smuzhiyun If, instead, the total utilization is smaller than M, then non real-time
*4882a593Smuzhiyun tasks will not be starved and the system might be able to respect all the
*4882a593Smuzhiyun deadlines.
*4882a593Smuzhiyun As a matter of fact, in this case it is possible to provide an upper bound
*4882a593Smuzhiyun for tardiness (defined as the maximum between 0 and the difference
*4882a593Smuzhiyun between the finishing time of a job and its absolute deadline).
*4882a593Smuzhiyun More precisely, it can be proven that using a global EDF scheduler the
*4882a593Smuzhiyun maximum tardiness of each task is smaller or equal than
*4882a593Smuzhiyun
*4882a593Smuzhiyun	((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max
*4882a593Smuzhiyun
*4882a593Smuzhiyun where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i}
*4882a593Smuzhiyun is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum
*4882a593Smuzhiyun utilization[12].
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.2 Schedulability Analysis for Uniprocessor Systems
*4882a593Smuzhiyun----------------------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun If M=1 (uniprocessor system), or in case of partitioned scheduling (each
*4882a593Smuzhiyun real-time task is statically assigned to one and only one CPU), it is
*4882a593Smuzhiyun possible to formally check if all the deadlines are respected.
*4882a593Smuzhiyun If D_i = P_i for all tasks, then EDF is able to respect all the deadlines
*4882a593Smuzhiyun of all the tasks executing on a CPU if and only if the total utilization
*4882a593Smuzhiyun of the tasks running on such a CPU is smaller or equal than 1.
*4882a593Smuzhiyun If D_i != P_i for some task, then it is possible to define the density of
*4882a593Smuzhiyun a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines
*4882a593Smuzhiyun of all the tasks running on a CPU if the sum of the densities of the tasks
*4882a593Smuzhiyun running on such a CPU is smaller or equal than 1:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	sum(WCET_i / min{D_i, P_i}) <= 1
*4882a593Smuzhiyun
*4882a593Smuzhiyun It is important to notice that this condition is only sufficient, and not
*4882a593Smuzhiyun necessary: there are task sets that are schedulable, but do not respect the
*4882a593Smuzhiyun condition. For example, consider the task set {Task_1,Task_2} composed by
*4882a593Smuzhiyun Task_1=(50ms,50ms,100ms) and Task_2=(10ms,100ms,100ms).
*4882a593Smuzhiyun EDF is clearly able to schedule the two tasks without missing any deadline
*4882a593Smuzhiyun (Task_1 is scheduled as soon as it is released, and finishes just in time
*4882a593Smuzhiyun to respect its deadline; Task_2 is scheduled immediately after Task_1, hence
*4882a593Smuzhiyun its response time cannot be larger than 50ms + 10ms = 60ms) even if
*4882a593Smuzhiyun
*4882a593Smuzhiyun	50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1
*4882a593Smuzhiyun
*4882a593Smuzhiyun Of course it is possible to test the exact schedulability of tasks with
*4882a593Smuzhiyun D_i != P_i (checking a condition that is both sufficient and necessary),
*4882a593Smuzhiyun but this cannot be done by comparing the total utilization or density with
*4882a593Smuzhiyun a constant. Instead, the so called "processor demand" approach can be used,
*4882a593Smuzhiyun computing the total amount of CPU time h(t) needed by all the tasks to
*4882a593Smuzhiyun respect all of their deadlines in a time interval of size t, and comparing
*4882a593Smuzhiyun such a time with the interval size t. If h(t) is smaller than t (that is,
*4882a593Smuzhiyun the amount of time needed by the tasks in a time interval of size t is
*4882a593Smuzhiyun smaller than the size of the interval) for all the possible values of t, then
*4882a593Smuzhiyun EDF is able to schedule the tasks respecting all of their deadlines. Since
*4882a593Smuzhiyun performing this check for all possible values of t is impossible, it has been
*4882a593Smuzhiyun proven[4,5,6] that it is sufficient to perform the test for values of t
*4882a593Smuzhiyun between 0 and a maximum value L. The cited papers contain all of the
*4882a593Smuzhiyun mathematical details and explain how to compute h(t) and L.
*4882a593Smuzhiyun In any case, this kind of analysis is too complex as well as too
*4882a593Smuzhiyun time-consuming to be performed on-line. Hence, as explained in Section
*4882a593Smuzhiyun 4 Linux uses an admission test based on the tasks' utilizations.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.3 Schedulability Analysis for Multiprocessor Systems
*4882a593Smuzhiyun------------------------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun On multiprocessor systems with global EDF scheduling (non partitioned
*4882a593Smuzhiyun systems), a sufficient test for schedulability can not be based on the
*4882a593Smuzhiyun utilizations or densities: it can be shown that even if D_i = P_i task
*4882a593Smuzhiyun sets with utilizations slightly larger than 1 can miss deadlines regardless
*4882a593Smuzhiyun of the number of CPUs.
*4882a593Smuzhiyun
*4882a593Smuzhiyun Consider a set {Task_1,...Task_{M+1}} of M+1 tasks on a system with M
*4882a593Smuzhiyun CPUs, with the first task Task_1=(P,P,P) having period, relative deadline
*4882a593Smuzhiyun and WCET equal to P. The remaining M tasks Task_i=(e,P-1,P-1) have an
*4882a593Smuzhiyun arbitrarily small worst case execution time (indicated as "e" here) and a
*4882a593Smuzhiyun period smaller than the one of the first task. Hence, if all the tasks
*4882a593Smuzhiyun activate at the same time t, global EDF schedules these M tasks first
*4882a593Smuzhiyun (because their absolute deadlines are equal to t + P - 1, hence they are
*4882a593Smuzhiyun smaller than the absolute deadline of Task_1, which is t + P). As a
*4882a593Smuzhiyun result, Task_1 can be scheduled only at time t + e, and will finish at
*4882a593Smuzhiyun time t + e + P, after its absolute deadline. The total utilization of the
*4882a593Smuzhiyun task set is U = M · e / (P - 1) + P / P = M · e / (P - 1) + 1, and for small
*4882a593Smuzhiyun values of e this can become very close to 1. This is known as "Dhall's
*4882a593Smuzhiyun effect"[7]. Note: the example in the original paper by Dhall has been
*4882a593Smuzhiyun slightly simplified here (for example, Dhall more correctly computed
*4882a593Smuzhiyun lim_{e->0}U).
*4882a593Smuzhiyun
*4882a593Smuzhiyun More complex schedulability tests for global EDF have been developed in
*4882a593Smuzhiyun real-time literature[8,9], but they are not based on a simple comparison
*4882a593Smuzhiyun between total utilization (or density) and a fixed constant. If all tasks
*4882a593Smuzhiyun have D_i = P_i, a sufficient schedulability condition can be expressed in
*4882a593Smuzhiyun a simple way:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	sum(WCET_i / P_i) <= M - (M - 1) · U_max
*4882a593Smuzhiyun
*4882a593Smuzhiyun where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1,
*4882a593Smuzhiyun M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition
*4882a593Smuzhiyun just confirms the Dhall's effect. A more complete survey of the literature
*4882a593Smuzhiyun about schedulability tests for multi-processor real-time scheduling can be
*4882a593Smuzhiyun found in [11].
*4882a593Smuzhiyun
*4882a593Smuzhiyun As seen, enforcing that the total utilization is smaller than M does not
*4882a593Smuzhiyun guarantee that global EDF schedules the tasks without missing any deadline
*4882a593Smuzhiyun (in other words, global EDF is not an optimal scheduling algorithm). However,
*4882a593Smuzhiyun a total utilization smaller than M is enough to guarantee that non real-time
*4882a593Smuzhiyun tasks are not starved and that the tardiness of real-time tasks has an upper
*4882a593Smuzhiyun bound[12] (as previously noted). Different bounds on the maximum tardiness
*4882a593Smuzhiyun experienced by real-time tasks have been developed in various papers[13,14],
*4882a593Smuzhiyun but the theoretical result that is important for SCHED_DEADLINE is that if
*4882a593Smuzhiyun the total utilization is smaller or equal than M then the response times of
*4882a593Smuzhiyun the tasks are limited.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.4 Relationship with SCHED_DEADLINE Parameters
*4882a593Smuzhiyun-----------------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun Finally, it is important to understand the relationship between the
*4882a593Smuzhiyun SCHED_DEADLINE scheduling parameters described in Section 2 (runtime,
*4882a593Smuzhiyun deadline and period) and the real-time task parameters (WCET, D, P)
*4882a593Smuzhiyun described in this section. Note that the tasks' temporal constraints are
*4882a593Smuzhiyun represented by its absolute deadlines d_j = r_j + D described above, while
*4882a593Smuzhiyun SCHED_DEADLINE schedules the tasks according to scheduling deadlines (see
*4882a593Smuzhiyun Section 2).
*4882a593Smuzhiyun If an admission test is used to guarantee that the scheduling deadlines
*4882a593Smuzhiyun are respected, then SCHED_DEADLINE can be used to schedule real-time tasks
*4882a593Smuzhiyun guaranteeing that all the jobs' deadlines of a task are respected.
*4882a593Smuzhiyun In order to do this, a task must be scheduled by setting:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - runtime >= WCET
*4882a593Smuzhiyun  - deadline = D
*4882a593Smuzhiyun  - period <= P
*4882a593Smuzhiyun
*4882a593Smuzhiyun IOW, if runtime >= WCET and if period is <= P, then the scheduling deadlines
*4882a593Smuzhiyun and the absolute deadlines (d_j) coincide, so a proper admission control
*4882a593Smuzhiyun allows to respect the jobs' absolute deadlines for this task (this is what is
*4882a593Smuzhiyun called "hard schedulability property" and is an extension of Lemma 1 of [2]).
*4882a593Smuzhiyun Notice that if runtime > deadline the admission control will surely reject
*4882a593Smuzhiyun this task, as it is not possible to respect its temporal constraints.
*4882a593Smuzhiyun
*4882a593Smuzhiyun References:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
*4882a593Smuzhiyun      ming in a hard-real-time environment. Journal of the Association for
*4882a593Smuzhiyun      Computing Machinery, 20(1), 1973.
*4882a593Smuzhiyun  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
*4882a593Smuzhiyun      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
*4882a593Smuzhiyun      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
*4882a593Smuzhiyun  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
*4882a593Smuzhiyun      Technical Report. http://disi.unitn.it/~abeni/tr-98-01.pdf
*4882a593Smuzhiyun  4 - J. Y. Leung and M.L. Merril. A Note on Preemptive Scheduling of
*4882a593Smuzhiyun      Periodic, Real-Time Tasks. Information Processing Letters, vol. 11,
*4882a593Smuzhiyun      no. 3, pp. 115-118, 1980.
*4882a593Smuzhiyun  5 - S. K. Baruah, A. K. Mok and L. E. Rosier. Preemptively Scheduling
*4882a593Smuzhiyun      Hard-Real-Time Sporadic Tasks on One Processor. Proceedings of the
*4882a593Smuzhiyun      11th IEEE Real-time Systems Symposium, 1990.
*4882a593Smuzhiyun  6 - S. K. Baruah, L. E. Rosier and R. R. Howell. Algorithms and Complexity
*4882a593Smuzhiyun      Concerning the Preemptive Scheduling of Periodic Real-Time tasks on
*4882a593Smuzhiyun      One Processor. Real-Time Systems Journal, vol. 4, no. 2, pp 301-324,
*4882a593Smuzhiyun      1990.
*4882a593Smuzhiyun  7 - S. J. Dhall and C. L. Liu. On a real-time scheduling problem. Operations
*4882a593Smuzhiyun      research, vol. 26, no. 1, pp 127-140, 1978.
*4882a593Smuzhiyun  8 - T. Baker. Multiprocessor EDF and Deadline Monotonic Schedulability
*4882a593Smuzhiyun      Analysis. Proceedings of the 24th IEEE Real-Time Systems Symposium, 2003.
*4882a593Smuzhiyun  9 - T. Baker. An Analysis of EDF Schedulability on a Multiprocessor.
*4882a593Smuzhiyun      IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 8,
*4882a593Smuzhiyun      pp 760-768, 2005.
*4882a593Smuzhiyun  10 - J. Goossens, S. Funk and S. Baruah, Priority-Driven Scheduling of
*4882a593Smuzhiyun       Periodic Task Systems on Multiprocessors. Real-Time Systems Journal,
*4882a593Smuzhiyun       vol. 25, no. 2–3, pp. 187–205, 2003.
*4882a593Smuzhiyun  11 - R. Davis and A. Burns. A Survey of Hard Real-Time Scheduling for
*4882a593Smuzhiyun       Multiprocessor Systems. ACM Computing Surveys, vol. 43, no. 4, 2011.
*4882a593Smuzhiyun       http://www-users.cs.york.ac.uk/~robdavis/papers/MPSurveyv5.0.pdf
*4882a593Smuzhiyun  12 - U. C. Devi and J. H. Anderson. Tardiness Bounds under Global EDF
*4882a593Smuzhiyun       Scheduling on a Multiprocessor. Real-Time Systems Journal, vol. 32,
*4882a593Smuzhiyun       no. 2, pp 133-189, 2008.
*4882a593Smuzhiyun  13 - P. Valente and G. Lipari. An Upper Bound to the Lateness of Soft
*4882a593Smuzhiyun       Real-Time Tasks Scheduled by EDF on Multiprocessors. Proceedings of
*4882a593Smuzhiyun       the 26th IEEE Real-Time Systems Symposium, 2005.
*4882a593Smuzhiyun  14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for
*4882a593Smuzhiyun       Global EDF. Proceedings of the 22nd Euromicro Conference on
*4882a593Smuzhiyun       Real-Time Systems, 2010.
*4882a593Smuzhiyun  15 - G. Lipari, S. Baruah, Greedy reclamation of unused bandwidth in
*4882a593Smuzhiyun       constant-bandwidth servers, 12th IEEE Euromicro Conference on Real-Time
*4882a593Smuzhiyun       Systems, 2000.
*4882a593Smuzhiyun  16 - L. Abeni, J. Lelli, C. Scordino, L. Palopoli, Greedy CPU reclaiming for
*4882a593Smuzhiyun       SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS),
*4882a593Smuzhiyun       Dusseldorf, Germany, 2014.
*4882a593Smuzhiyun  17 - L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel
*4882a593Smuzhiyun       or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied
*4882a593Smuzhiyun       Computing, 2016.
*4882a593Smuzhiyun  18 - J. Lelli, C. Scordino, L. Abeni, D. Faggioli, Deadline scheduling in the
*4882a593Smuzhiyun       Linux kernel, Software: Practice and Experience, 46(6): 821-839, June
*4882a593Smuzhiyun       2016.
*4882a593Smuzhiyun  19 - C. Scordino, L. Abeni, J. Lelli, Energy-Aware Real-Time Scheduling in
*4882a593Smuzhiyun       the Linux Kernel, 33rd ACM/SIGAPP Symposium On Applied Computing (SAC
*4882a593Smuzhiyun       2018), Pau, France, April 2018.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun4. Bandwidth management
*4882a593Smuzhiyun=======================
*4882a593Smuzhiyun
*4882a593Smuzhiyun As previously mentioned, in order for -deadline scheduling to be
*4882a593Smuzhiyun effective and useful (that is, to be able to provide "runtime" time units
*4882a593Smuzhiyun within "deadline"), it is important to have some method to keep the allocation
*4882a593Smuzhiyun of the available fractions of CPU time to the various tasks under control.
*4882a593Smuzhiyun This is usually called "admission control" and if it is not performed, then
*4882a593Smuzhiyun no guarantee can be given on the actual scheduling of the -deadline tasks.
*4882a593Smuzhiyun
*4882a593Smuzhiyun As already stated in Section 3, a necessary condition to be respected to
*4882a593Smuzhiyun correctly schedule a set of real-time tasks is that the total utilization
*4882a593Smuzhiyun is smaller than M. When talking about -deadline tasks, this requires that
*4882a593Smuzhiyun the sum of the ratio between runtime and period for all tasks is smaller
*4882a593Smuzhiyun than M. Notice that the ratio runtime/period is equivalent to the utilization
*4882a593Smuzhiyun of a "traditional" real-time task, and is also often referred to as
*4882a593Smuzhiyun "bandwidth".
*4882a593Smuzhiyun The interface used to control the CPU bandwidth that can be allocated
*4882a593Smuzhiyun to -deadline tasks is similar to the one already used for -rt
*4882a593Smuzhiyun tasks with real-time group scheduling (a.k.a. RT-throttling - see
*4882a593Smuzhiyun Documentation/scheduler/sched-rt-group.rst), and is based on readable/
*4882a593Smuzhiyun writable control files located in procfs (for system wide settings).
*4882a593Smuzhiyun Notice that per-group settings (controlled through cgroupfs) are still not
*4882a593Smuzhiyun defined for -deadline tasks, because more discussion is needed in order to
*4882a593Smuzhiyun figure out how we want to manage SCHED_DEADLINE bandwidth at the task group
*4882a593Smuzhiyun level.
*4882a593Smuzhiyun
*4882a593Smuzhiyun A main difference between deadline bandwidth management and RT-throttling
*4882a593Smuzhiyun is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
*4882a593Smuzhiyun and thus we don't need a higher level throttling mechanism to enforce the
*4882a593Smuzhiyun desired bandwidth. In other words, this means that interface parameters are
*4882a593Smuzhiyun only used at admission control time (i.e., when the user calls
*4882a593Smuzhiyun sched_setattr()). Scheduling is then performed considering actual tasks'
*4882a593Smuzhiyun parameters, so that CPU bandwidth is allocated to SCHED_DEADLINE tasks
*4882a593Smuzhiyun respecting their needs in terms of granularity. Therefore, using this simple
*4882a593Smuzhiyun interface we can put a cap on total utilization of -deadline tasks (i.e.,
*4882a593Smuzhiyun \Sum (runtime_i / period_i) < global_dl_utilization_cap).
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.1 System wide settings
*4882a593Smuzhiyun------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun The system wide settings are configured under the /proc virtual file system.
*4882a593Smuzhiyun
*4882a593Smuzhiyun For now the -rt knobs are used for -deadline admission control and the
*4882a593Smuzhiyun -deadline runtime is accounted against the -rt runtime. We realize that this
*4882a593Smuzhiyun isn't entirely desirable; however, it is better to have a small interface for
*4882a593Smuzhiyun now, and be able to change it easily later. The ideal situation (see 5.) is to
*4882a593Smuzhiyun run -rt tasks from a -deadline server; in which case the -rt bandwidth is a
*4882a593Smuzhiyun direct subset of dl_bw.
*4882a593Smuzhiyun
*4882a593Smuzhiyun This means that, for a root_domain comprising M CPUs, -deadline tasks
*4882a593Smuzhiyun can be created while the sum of their bandwidths stays below:
*4882a593Smuzhiyun
*4882a593Smuzhiyun   M * (sched_rt_runtime_us / sched_rt_period_us)
*4882a593Smuzhiyun
*4882a593Smuzhiyun It is also possible to disable this bandwidth management logic, and
*4882a593Smuzhiyun be thus free of oversubscribing the system up to any arbitrary level.
*4882a593Smuzhiyun This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.2 Task interface
*4882a593Smuzhiyun------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun Specifying a periodic/sporadic task that executes for a given amount of
*4882a593Smuzhiyun runtime at each instance, and that is scheduled according to the urgency of
*4882a593Smuzhiyun its own timing constraints needs, in general, a way of declaring:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - a (maximum/typical) instance execution time,
*4882a593Smuzhiyun  - a minimum interval between consecutive instances,
*4882a593Smuzhiyun  - a time constraint by which each instance must be completed.
*4882a593Smuzhiyun
*4882a593Smuzhiyun Therefore:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  * a new struct sched_attr, containing all the necessary fields is
*4882a593Smuzhiyun    provided;
*4882a593Smuzhiyun  * the new scheduling related syscalls that manipulate it, i.e.,
*4882a593Smuzhiyun    sched_setattr() and sched_getattr() are implemented.
*4882a593Smuzhiyun
*4882a593Smuzhiyun For debugging purposes, the leftover runtime and absolute deadline of a
*4882a593Smuzhiyun SCHED_DEADLINE task can be retrieved through /proc/<pid>/sched (entries
*4882a593Smuzhiyun dl.runtime and dl.deadline, both values in ns). A programmatic way to
*4882a593Smuzhiyun retrieve these values from production code is under discussion.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.3 Default behavior
*4882a593Smuzhiyun---------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
*4882a593Smuzhiyun 950000. With rt_period equal to 1000000, by default, it means that -deadline
*4882a593Smuzhiyun tasks can use at most 95%, multiplied by the number of CPUs that compose the
*4882a593Smuzhiyun root_domain, for each root_domain.
*4882a593Smuzhiyun This means that non -deadline tasks will receive at least 5% of the CPU time,
*4882a593Smuzhiyun and that -deadline tasks will receive their runtime with a guaranteed
*4882a593Smuzhiyun worst-case delay respect to the "deadline" parameter. If "deadline" = "period"
*4882a593Smuzhiyun and the cpuset mechanism is used to implement partitioned scheduling (see
*4882a593Smuzhiyun Section 5), then this simple setting of the bandwidth management is able to
*4882a593Smuzhiyun deterministically guarantee that -deadline tasks will receive their runtime
*4882a593Smuzhiyun in a period.
*4882a593Smuzhiyun
*4882a593Smuzhiyun Finally, notice that in order not to jeopardize the admission control a
*4882a593Smuzhiyun -deadline task cannot fork.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.4 Behavior of sched_yield()
*4882a593Smuzhiyun-----------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun When a SCHED_DEADLINE task calls sched_yield(), it gives up its
*4882a593Smuzhiyun remaining runtime and is immediately throttled, until the next
*4882a593Smuzhiyun period, when its runtime will be replenished (a special flag
*4882a593Smuzhiyun dl_yielded is set and used to handle correctly throttling and runtime
*4882a593Smuzhiyun replenishment after a call to sched_yield()).
*4882a593Smuzhiyun
*4882a593Smuzhiyun This behavior of sched_yield() allows the task to wake-up exactly at
*4882a593Smuzhiyun the beginning of the next period. Also, this may be useful in the
*4882a593Smuzhiyun future with bandwidth reclaiming mechanisms, where sched_yield() will
*4882a593Smuzhiyun make the leftoever runtime available for reclamation by other
*4882a593Smuzhiyun SCHED_DEADLINE tasks.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun5. Tasks CPU affinity
*4882a593Smuzhiyun=====================
*4882a593Smuzhiyun
*4882a593Smuzhiyun -deadline tasks cannot have an affinity mask smaller that the entire
*4882a593Smuzhiyun root_domain they are created on. However, affinities can be specified
*4882a593Smuzhiyun through the cpuset facility (Documentation/admin-guide/cgroup-v1/cpusets.rst).
*4882a593Smuzhiyun
*4882a593Smuzhiyun5.1 SCHED_DEADLINE and cpusets HOWTO
*4882a593Smuzhiyun------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun An example of a simple configuration (pin a -deadline task to CPU0)
*4882a593Smuzhiyun follows (rt-app is used to create a -deadline task)::
*4882a593Smuzhiyun
*4882a593Smuzhiyun   mkdir /dev/cpuset
*4882a593Smuzhiyun   mount -t cgroup -o cpuset cpuset /dev/cpuset
*4882a593Smuzhiyun   cd /dev/cpuset
*4882a593Smuzhiyun   mkdir cpu0
*4882a593Smuzhiyun   echo 0 > cpu0/cpuset.cpus
*4882a593Smuzhiyun   echo 0 > cpu0/cpuset.mems
*4882a593Smuzhiyun   echo 1 > cpuset.cpu_exclusive
*4882a593Smuzhiyun   echo 0 > cpuset.sched_load_balance
*4882a593Smuzhiyun   echo 1 > cpu0/cpuset.cpu_exclusive
*4882a593Smuzhiyun   echo 1 > cpu0/cpuset.mem_exclusive
*4882a593Smuzhiyun   echo $$ > cpu0/tasks
*4882a593Smuzhiyun   rt-app -t 100000:10000:d:0 -D5 # it is now actually superfluous to specify
*4882a593Smuzhiyun				  # task affinity
*4882a593Smuzhiyun
*4882a593Smuzhiyun6. Future plans
*4882a593Smuzhiyun===============
*4882a593Smuzhiyun
*4882a593Smuzhiyun Still missing:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  - programmatic way to retrieve current runtime and absolute deadline
*4882a593Smuzhiyun  - refinements to deadline inheritance, especially regarding the possibility
*4882a593Smuzhiyun    of retaining bandwidth isolation among non-interacting tasks. This is
*4882a593Smuzhiyun    being studied from both theoretical and practical points of view, and
*4882a593Smuzhiyun    hopefully we should be able to produce some demonstrative code soon;
*4882a593Smuzhiyun  - (c)group based bandwidth management, and maybe scheduling;
*4882a593Smuzhiyun  - access control for non-root users (and related security concerns to
*4882a593Smuzhiyun    address), which is the best way to allow unprivileged use of the mechanisms
*4882a593Smuzhiyun    and how to prevent non-root users "cheat" the system?
*4882a593Smuzhiyun
*4882a593Smuzhiyun As already discussed, we are planning also to merge this work with the EDF
*4882a593Smuzhiyun throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
*4882a593Smuzhiyun the preliminary phases of the merge and we really seek feedback that would
*4882a593Smuzhiyun help us decide on the direction it should take.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAppendix A. Test suite
*4882a593Smuzhiyun======================
*4882a593Smuzhiyun
*4882a593Smuzhiyun The SCHED_DEADLINE policy can be easily tested using two applications that
*4882a593Smuzhiyun are part of a wider Linux Scheduler validation suite. The suite is
*4882a593Smuzhiyun available as a GitHub repository: https://github.com/scheduler-tools.
*4882a593Smuzhiyun
*4882a593Smuzhiyun The first testing application is called rt-app and can be used to
*4882a593Smuzhiyun start multiple threads with specific parameters. rt-app supports
*4882a593Smuzhiyun SCHED_{OTHER,FIFO,RR,DEADLINE} scheduling policies and their related
*4882a593Smuzhiyun parameters (e.g., niceness, priority, runtime/deadline/period). rt-app
*4882a593Smuzhiyun is a valuable tool, as it can be used to synthetically recreate certain
*4882a593Smuzhiyun workloads (maybe mimicking real use-cases) and evaluate how the scheduler
*4882a593Smuzhiyun behaves under such workloads. In this way, results are easily reproducible.
*4882a593Smuzhiyun rt-app is available at: https://github.com/scheduler-tools/rt-app.
*4882a593Smuzhiyun
*4882a593Smuzhiyun Thread parameters can be specified from the command line, with something like
*4882a593Smuzhiyun this::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # rt-app -t 100000:10000:d -t 150000:20000:f:10 -D5
*4882a593Smuzhiyun
*4882a593Smuzhiyun The above creates 2 threads. The first one, scheduled by SCHED_DEADLINE,
*4882a593Smuzhiyun executes for 10ms every 100ms. The second one, scheduled at SCHED_FIFO
*4882a593Smuzhiyun priority 10, executes for 20ms every 150ms. The test will run for a total
*4882a593Smuzhiyun of 5 seconds.
*4882a593Smuzhiyun
*4882a593Smuzhiyun More interestingly, configurations can be described with a json file that
*4882a593Smuzhiyun can be passed as input to rt-app with something like this::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # rt-app my_config.json
*4882a593Smuzhiyun
*4882a593Smuzhiyun The parameters that can be specified with the second method are a superset
*4882a593Smuzhiyun of the command line options. Please refer to rt-app documentation for more
*4882a593Smuzhiyun details (`<rt-app-sources>/doc/*.json`).
*4882a593Smuzhiyun
*4882a593Smuzhiyun The second testing application is a modification of schedtool, called
*4882a593Smuzhiyun schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a
*4882a593Smuzhiyun certain pid/application. schedtool-dl is available at:
*4882a593Smuzhiyun https://github.com/scheduler-tools/schedtool-dl.git.
*4882a593Smuzhiyun
*4882a593Smuzhiyun The usage is straightforward::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app
*4882a593Smuzhiyun
*4882a593Smuzhiyun With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation
*4882a593Smuzhiyun of 10ms every 100ms (note that parameters are expressed in microseconds).
*4882a593Smuzhiyun You can also use schedtool to create a reservation for an already running
*4882a593Smuzhiyun application, given that you know its pid::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # schedtool -E -t 10000000:100000000 my_app_pid
*4882a593Smuzhiyun
*4882a593SmuzhiyunAppendix B. Minimal main()
*4882a593Smuzhiyun==========================
*4882a593Smuzhiyun
*4882a593Smuzhiyun We provide in what follows a simple (ugly) self-contained code snippet
*4882a593Smuzhiyun showing how SCHED_DEADLINE reservations can be created by a real-time
*4882a593Smuzhiyun application developer::
*4882a593Smuzhiyun
*4882a593Smuzhiyun   #define _GNU_SOURCE
*4882a593Smuzhiyun   #include <unistd.h>
*4882a593Smuzhiyun   #include <stdio.h>
*4882a593Smuzhiyun   #include <stdlib.h>
*4882a593Smuzhiyun   #include <string.h>
*4882a593Smuzhiyun   #include <time.h>
*4882a593Smuzhiyun   #include <linux/unistd.h>
*4882a593Smuzhiyun   #include <linux/kernel.h>
*4882a593Smuzhiyun   #include <linux/types.h>
*4882a593Smuzhiyun   #include <sys/syscall.h>
*4882a593Smuzhiyun   #include <pthread.h>
*4882a593Smuzhiyun
*4882a593Smuzhiyun   #define gettid() syscall(__NR_gettid)
*4882a593Smuzhiyun
*4882a593Smuzhiyun   #define SCHED_DEADLINE	6
*4882a593Smuzhiyun
*4882a593Smuzhiyun   /* XXX use the proper syscall numbers */
*4882a593Smuzhiyun   #ifdef __x86_64__
*4882a593Smuzhiyun   #define __NR_sched_setattr		314
*4882a593Smuzhiyun   #define __NR_sched_getattr		315
*4882a593Smuzhiyun   #endif
*4882a593Smuzhiyun
*4882a593Smuzhiyun   #ifdef __i386__
*4882a593Smuzhiyun   #define __NR_sched_setattr		351
*4882a593Smuzhiyun   #define __NR_sched_getattr		352
*4882a593Smuzhiyun   #endif
*4882a593Smuzhiyun
*4882a593Smuzhiyun   #ifdef __arm__
*4882a593Smuzhiyun   #define __NR_sched_setattr		380
*4882a593Smuzhiyun   #define __NR_sched_getattr		381
*4882a593Smuzhiyun   #endif
*4882a593Smuzhiyun
*4882a593Smuzhiyun   static volatile int done;
*4882a593Smuzhiyun
*4882a593Smuzhiyun   struct sched_attr {
*4882a593Smuzhiyun	__u32 size;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	__u32 sched_policy;
*4882a593Smuzhiyun	__u64 sched_flags;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* SCHED_NORMAL, SCHED_BATCH */
*4882a593Smuzhiyun	__s32 sched_nice;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* SCHED_FIFO, SCHED_RR */
*4882a593Smuzhiyun	__u32 sched_priority;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* SCHED_DEADLINE (nsec) */
*4882a593Smuzhiyun	__u64 sched_runtime;
*4882a593Smuzhiyun	__u64 sched_deadline;
*4882a593Smuzhiyun	__u64 sched_period;
*4882a593Smuzhiyun   };
*4882a593Smuzhiyun
*4882a593Smuzhiyun   int sched_setattr(pid_t pid,
*4882a593Smuzhiyun		  const struct sched_attr *attr,
*4882a593Smuzhiyun		  unsigned int flags)
*4882a593Smuzhiyun   {
*4882a593Smuzhiyun	return syscall(__NR_sched_setattr, pid, attr, flags);
*4882a593Smuzhiyun   }
*4882a593Smuzhiyun
*4882a593Smuzhiyun   int sched_getattr(pid_t pid,
*4882a593Smuzhiyun		  struct sched_attr *attr,
*4882a593Smuzhiyun		  unsigned int size,
*4882a593Smuzhiyun		  unsigned int flags)
*4882a593Smuzhiyun   {
*4882a593Smuzhiyun	return syscall(__NR_sched_getattr, pid, attr, size, flags);
*4882a593Smuzhiyun   }
*4882a593Smuzhiyun
*4882a593Smuzhiyun   void *run_deadline(void *data)
*4882a593Smuzhiyun   {
*4882a593Smuzhiyun	struct sched_attr attr;
*4882a593Smuzhiyun	int x = 0;
*4882a593Smuzhiyun	int ret;
*4882a593Smuzhiyun	unsigned int flags = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	printf("deadline thread started [%ld]\n", gettid());
*4882a593Smuzhiyun
*4882a593Smuzhiyun	attr.size = sizeof(attr);
*4882a593Smuzhiyun	attr.sched_flags = 0;
*4882a593Smuzhiyun	attr.sched_nice = 0;
*4882a593Smuzhiyun	attr.sched_priority = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* This creates a 10ms/30ms reservation */
*4882a593Smuzhiyun	attr.sched_policy = SCHED_DEADLINE;
*4882a593Smuzhiyun	attr.sched_runtime = 10 * 1000 * 1000;
*4882a593Smuzhiyun	attr.sched_period = attr.sched_deadline = 30 * 1000 * 1000;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	ret = sched_setattr(0, &attr, flags);
*4882a593Smuzhiyun	if (ret < 0) {
*4882a593Smuzhiyun		done = 0;
*4882a593Smuzhiyun		perror("sched_setattr");
*4882a593Smuzhiyun		exit(-1);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	while (!done) {
*4882a593Smuzhiyun		x++;
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	printf("deadline thread dies [%ld]\n", gettid());
*4882a593Smuzhiyun	return NULL;
*4882a593Smuzhiyun   }
*4882a593Smuzhiyun
*4882a593Smuzhiyun   int main (int argc, char **argv)
*4882a593Smuzhiyun   {
*4882a593Smuzhiyun	pthread_t thread;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	printf("main thread [%ld]\n", gettid());
*4882a593Smuzhiyun
*4882a593Smuzhiyun	pthread_create(&thread, NULL, run_deadline, NULL);
*4882a593Smuzhiyun
*4882a593Smuzhiyun	sleep(10);
*4882a593Smuzhiyun
*4882a593Smuzhiyun	done = 1;
*4882a593Smuzhiyun	pthread_join(thread, NULL);
*4882a593Smuzhiyun
*4882a593Smuzhiyun	printf("main dies [%ld]\n", gettid());
*4882a593Smuzhiyun	return 0;
*4882a593Smuzhiyun   }