1*4882a593Smuzhiyun======================== 2*4882a593SmuzhiyunDeadline Task Scheduling 3*4882a593Smuzhiyun======================== 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun.. CONTENTS 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun 0. WARNING 8*4882a593Smuzhiyun 1. Overview 9*4882a593Smuzhiyun 2. Scheduling algorithm 10*4882a593Smuzhiyun 2.1 Main algorithm 11*4882a593Smuzhiyun 2.2 Bandwidth reclaiming 12*4882a593Smuzhiyun 3. Scheduling Real-Time Tasks 13*4882a593Smuzhiyun 3.1 Definitions 14*4882a593Smuzhiyun 3.2 Schedulability Analysis for Uniprocessor Systems 15*4882a593Smuzhiyun 3.3 Schedulability Analysis for Multiprocessor Systems 16*4882a593Smuzhiyun 3.4 Relationship with SCHED_DEADLINE Parameters 17*4882a593Smuzhiyun 4. Bandwidth management 18*4882a593Smuzhiyun 4.1 System-wide settings 19*4882a593Smuzhiyun 4.2 Task interface 20*4882a593Smuzhiyun 4.3 Default behavior 21*4882a593Smuzhiyun 4.4 Behavior of sched_yield() 22*4882a593Smuzhiyun 5. Tasks CPU affinity 23*4882a593Smuzhiyun 5.1 SCHED_DEADLINE and cpusets HOWTO 24*4882a593Smuzhiyun 6. Future plans 25*4882a593Smuzhiyun A. Test suite 26*4882a593Smuzhiyun B. Minimal main() 27*4882a593Smuzhiyun 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun0. WARNING 30*4882a593Smuzhiyun========== 31*4882a593Smuzhiyun 32*4882a593Smuzhiyun Fiddling with these settings can result in an unpredictable or even unstable 33*4882a593Smuzhiyun system behavior. As for -rt (group) scheduling, it is assumed that root users 34*4882a593Smuzhiyun know what they're doing. 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun1. Overview 38*4882a593Smuzhiyun=========== 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is 41*4882a593Smuzhiyun basically an implementation of the Earliest Deadline First (EDF) scheduling 42*4882a593Smuzhiyun algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) 43*4882a593Smuzhiyun that makes it possible to isolate the behavior of tasks between each other. 44*4882a593Smuzhiyun 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun2. Scheduling algorithm 47*4882a593Smuzhiyun======================= 48*4882a593Smuzhiyun 49*4882a593Smuzhiyun2.1 Main algorithm 50*4882a593Smuzhiyun------------------ 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun SCHED_DEADLINE [18] uses three parameters, named "runtime", "period", and 53*4882a593Smuzhiyun "deadline", to schedule tasks. A SCHED_DEADLINE task should receive 54*4882a593Smuzhiyun "runtime" microseconds of execution time every "period" microseconds, and 55*4882a593Smuzhiyun these "runtime" microseconds are available within "deadline" microseconds 56*4882a593Smuzhiyun from the beginning of the period. In order to implement this behavior, 57*4882a593Smuzhiyun every time the task wakes up, the scheduler computes a "scheduling deadline" 58*4882a593Smuzhiyun consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then 59*4882a593Smuzhiyun scheduled using EDF[1] on these scheduling deadlines (the task with the 60*4882a593Smuzhiyun earliest scheduling deadline is selected for execution). Notice that the 61*4882a593Smuzhiyun task actually receives "runtime" time units within "deadline" if a proper 62*4882a593Smuzhiyun "admission control" strategy (see Section "4. Bandwidth management") is used 63*4882a593Smuzhiyun (clearly, if the system is overloaded this guarantee cannot be respected). 64*4882a593Smuzhiyun 65*4882a593Smuzhiyun Summing up, the CBS[2,3] algorithm assigns scheduling deadlines to tasks so 66*4882a593Smuzhiyun that each task runs for at most its runtime every period, avoiding any 67*4882a593Smuzhiyun interference between different tasks (bandwidth isolation), while the EDF[1] 68*4882a593Smuzhiyun algorithm selects the task with the earliest scheduling deadline as the one 69*4882a593Smuzhiyun to be executed next. Thanks to this feature, tasks that do not strictly comply 70*4882a593Smuzhiyun with the "traditional" real-time task model (see Section 3) can effectively 71*4882a593Smuzhiyun use the new policy. 72*4882a593Smuzhiyun 73*4882a593Smuzhiyun In more details, the CBS algorithm assigns scheduling deadlines to 74*4882a593Smuzhiyun tasks in the following way: 75*4882a593Smuzhiyun 76*4882a593Smuzhiyun - Each SCHED_DEADLINE task is characterized by the "runtime", 77*4882a593Smuzhiyun "deadline", and "period" parameters; 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun - The state of the task is described by a "scheduling deadline", and 80*4882a593Smuzhiyun a "remaining runtime". These two parameters are initially set to 0; 81*4882a593Smuzhiyun 82*4882a593Smuzhiyun - When a SCHED_DEADLINE task wakes up (becomes ready for execution), 83*4882a593Smuzhiyun the scheduler checks if:: 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun remaining runtime runtime 86*4882a593Smuzhiyun ---------------------------------- > --------- 87*4882a593Smuzhiyun scheduling deadline - current time period 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun then, if the scheduling deadline is smaller than the current time, or 90*4882a593Smuzhiyun this condition is verified, the scheduling deadline and the 91*4882a593Smuzhiyun remaining runtime are re-initialized as 92*4882a593Smuzhiyun 93*4882a593Smuzhiyun scheduling deadline = current time + deadline 94*4882a593Smuzhiyun remaining runtime = runtime 95*4882a593Smuzhiyun 96*4882a593Smuzhiyun otherwise, the scheduling deadline and the remaining runtime are 97*4882a593Smuzhiyun left unchanged; 98*4882a593Smuzhiyun 99*4882a593Smuzhiyun - When a SCHED_DEADLINE task executes for an amount of time t, its 100*4882a593Smuzhiyun remaining runtime is decreased as:: 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun remaining runtime = remaining runtime - t 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun (technically, the runtime is decreased at every tick, or when the 105*4882a593Smuzhiyun task is descheduled / preempted); 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun - When the remaining runtime becomes less or equal than 0, the task is 108*4882a593Smuzhiyun said to be "throttled" (also known as "depleted" in real-time literature) 109*4882a593Smuzhiyun and cannot be scheduled until its scheduling deadline. The "replenishment 110*4882a593Smuzhiyun time" for this task (see next item) is set to be equal to the current 111*4882a593Smuzhiyun value of the scheduling deadline; 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun - When the current time is equal to the replenishment time of a 114*4882a593Smuzhiyun throttled task, the scheduling deadline and the remaining runtime are 115*4882a593Smuzhiyun updated as:: 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun scheduling deadline = scheduling deadline + period 118*4882a593Smuzhiyun remaining runtime = remaining runtime + runtime 119*4882a593Smuzhiyun 120*4882a593Smuzhiyun The SCHED_FLAG_DL_OVERRUN flag in sched_attr's sched_flags field allows a task 121*4882a593Smuzhiyun to get informed about runtime overruns through the delivery of SIGXCPU 122*4882a593Smuzhiyun signals. 123*4882a593Smuzhiyun 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun2.2 Bandwidth reclaiming 126*4882a593Smuzhiyun------------------------ 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun Bandwidth reclaiming for deadline tasks is based on the GRUB (Greedy 129*4882a593Smuzhiyun Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled 130*4882a593Smuzhiyun when flag SCHED_FLAG_RECLAIM is set. 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun The following diagram illustrates the state names for tasks handled by GRUB:: 133*4882a593Smuzhiyun 134*4882a593Smuzhiyun ------------ 135*4882a593Smuzhiyun (d) | Active | 136*4882a593Smuzhiyun ------------->| | 137*4882a593Smuzhiyun | | Contending | 138*4882a593Smuzhiyun | ------------ 139*4882a593Smuzhiyun | A | 140*4882a593Smuzhiyun ---------- | | 141*4882a593Smuzhiyun | | | | 142*4882a593Smuzhiyun | Inactive | |(b) | (a) 143*4882a593Smuzhiyun | | | | 144*4882a593Smuzhiyun ---------- | | 145*4882a593Smuzhiyun A | V 146*4882a593Smuzhiyun | ------------ 147*4882a593Smuzhiyun | | Active | 148*4882a593Smuzhiyun --------------| Non | 149*4882a593Smuzhiyun (c) | Contending | 150*4882a593Smuzhiyun ------------ 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun A task can be in one of the following states: 153*4882a593Smuzhiyun 154*4882a593Smuzhiyun - ActiveContending: if it is ready for execution (or executing); 155*4882a593Smuzhiyun 156*4882a593Smuzhiyun - ActiveNonContending: if it just blocked and has not yet surpassed the 0-lag 157*4882a593Smuzhiyun time; 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun - Inactive: if it is blocked and has surpassed the 0-lag time. 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun State transitions: 162*4882a593Smuzhiyun 163*4882a593Smuzhiyun (a) When a task blocks, it does not become immediately inactive since its 164*4882a593Smuzhiyun bandwidth cannot be immediately reclaimed without breaking the 165*4882a593Smuzhiyun real-time guarantees. It therefore enters a transitional state called 166*4882a593Smuzhiyun ActiveNonContending. The scheduler arms the "inactive timer" to fire at 167*4882a593Smuzhiyun the 0-lag time, when the task's bandwidth can be reclaimed without 168*4882a593Smuzhiyun breaking the real-time guarantees. 169*4882a593Smuzhiyun 170*4882a593Smuzhiyun The 0-lag time for a task entering the ActiveNonContending state is 171*4882a593Smuzhiyun computed as:: 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun (runtime * dl_period) 174*4882a593Smuzhiyun deadline - --------------------- 175*4882a593Smuzhiyun dl_runtime 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun where runtime is the remaining runtime, while dl_runtime and dl_period 178*4882a593Smuzhiyun are the reservation parameters. 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun (b) If the task wakes up before the inactive timer fires, the task re-enters 181*4882a593Smuzhiyun the ActiveContending state and the "inactive timer" is canceled. 182*4882a593Smuzhiyun In addition, if the task wakes up on a different runqueue, then 183*4882a593Smuzhiyun the task's utilization must be removed from the previous runqueue's active 184*4882a593Smuzhiyun utilization and must be added to the new runqueue's active utilization. 185*4882a593Smuzhiyun In order to avoid races between a task waking up on a runqueue while the 186*4882a593Smuzhiyun "inactive timer" is running on a different CPU, the "dl_non_contending" 187*4882a593Smuzhiyun flag is used to indicate that a task is not on a runqueue but is active 188*4882a593Smuzhiyun (so, the flag is set when the task blocks and is cleared when the 189*4882a593Smuzhiyun "inactive timer" fires or when the task wakes up). 190*4882a593Smuzhiyun 191*4882a593Smuzhiyun (c) When the "inactive timer" fires, the task enters the Inactive state and 192*4882a593Smuzhiyun its utilization is removed from the runqueue's active utilization. 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun (d) When an inactive task wakes up, it enters the ActiveContending state and 195*4882a593Smuzhiyun its utilization is added to the active utilization of the runqueue where 196*4882a593Smuzhiyun it has been enqueued. 197*4882a593Smuzhiyun 198*4882a593Smuzhiyun For each runqueue, the algorithm GRUB keeps track of two different bandwidths: 199*4882a593Smuzhiyun 200*4882a593Smuzhiyun - Active bandwidth (running_bw): this is the sum of the bandwidths of all 201*4882a593Smuzhiyun tasks in active state (i.e., ActiveContending or ActiveNonContending); 202*4882a593Smuzhiyun 203*4882a593Smuzhiyun - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the 204*4882a593Smuzhiyun runqueue, including the tasks in Inactive state. 205*4882a593Smuzhiyun 206*4882a593Smuzhiyun 207*4882a593Smuzhiyun The algorithm reclaims the bandwidth of the tasks in Inactive state. 208*4882a593Smuzhiyun It does so by decrementing the runtime of the executing task Ti at a pace equal 209*4882a593Smuzhiyun to 210*4882a593Smuzhiyun 211*4882a593Smuzhiyun dq = -max{ Ui / Umax, (1 - Uinact - Uextra) } dt 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun where: 214*4882a593Smuzhiyun 215*4882a593Smuzhiyun - Ui is the bandwidth of task Ti; 216*4882a593Smuzhiyun - Umax is the maximum reclaimable utilization (subjected to RT throttling 217*4882a593Smuzhiyun limits); 218*4882a593Smuzhiyun - Uinact is the (per runqueue) inactive utilization, computed as 219*4882a593Smuzhiyun (this_bq - running_bw); 220*4882a593Smuzhiyun - Uextra is the (per runqueue) extra reclaimable utilization 221*4882a593Smuzhiyun (subjected to RT throttling limits). 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun 224*4882a593Smuzhiyun Let's now see a trivial example of two deadline tasks with runtime equal 225*4882a593Smuzhiyun to 4 and period equal to 8 (i.e., bandwidth equal to 0.5):: 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun A Task T1 228*4882a593Smuzhiyun | 229*4882a593Smuzhiyun | | 230*4882a593Smuzhiyun | | 231*4882a593Smuzhiyun |-------- |---- 232*4882a593Smuzhiyun | | V 233*4882a593Smuzhiyun |---|---|---|---|---|---|---|---|--------->t 234*4882a593Smuzhiyun 0 1 2 3 4 5 6 7 8 235*4882a593Smuzhiyun 236*4882a593Smuzhiyun 237*4882a593Smuzhiyun A Task T2 238*4882a593Smuzhiyun | 239*4882a593Smuzhiyun | | 240*4882a593Smuzhiyun | | 241*4882a593Smuzhiyun | ------------------------| 242*4882a593Smuzhiyun | | V 243*4882a593Smuzhiyun |---|---|---|---|---|---|---|---|--------->t 244*4882a593Smuzhiyun 0 1 2 3 4 5 6 7 8 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun 247*4882a593Smuzhiyun A running_bw 248*4882a593Smuzhiyun | 249*4882a593Smuzhiyun 1 ----------------- ------ 250*4882a593Smuzhiyun | | | 251*4882a593Smuzhiyun 0.5- ----------------- 252*4882a593Smuzhiyun | | 253*4882a593Smuzhiyun |---|---|---|---|---|---|---|---|--------->t 254*4882a593Smuzhiyun 0 1 2 3 4 5 6 7 8 255*4882a593Smuzhiyun 256*4882a593Smuzhiyun 257*4882a593Smuzhiyun - Time t = 0: 258*4882a593Smuzhiyun 259*4882a593Smuzhiyun Both tasks are ready for execution and therefore in ActiveContending state. 260*4882a593Smuzhiyun Suppose Task T1 is the first task to start execution. 261*4882a593Smuzhiyun Since there are no inactive tasks, its runtime is decreased as dq = -1 dt. 262*4882a593Smuzhiyun 263*4882a593Smuzhiyun - Time t = 2: 264*4882a593Smuzhiyun 265*4882a593Smuzhiyun Suppose that task T1 blocks 266*4882a593Smuzhiyun Task T1 therefore enters the ActiveNonContending state. Since its remaining 267*4882a593Smuzhiyun runtime is equal to 2, its 0-lag time is equal to t = 4. 268*4882a593Smuzhiyun Task T2 start execution, with runtime still decreased as dq = -1 dt since 269*4882a593Smuzhiyun there are no inactive tasks. 270*4882a593Smuzhiyun 271*4882a593Smuzhiyun - Time t = 4: 272*4882a593Smuzhiyun 273*4882a593Smuzhiyun This is the 0-lag time for Task T1. Since it didn't woken up in the 274*4882a593Smuzhiyun meantime, it enters the Inactive state. Its bandwidth is removed from 275*4882a593Smuzhiyun running_bw. 276*4882a593Smuzhiyun Task T2 continues its execution. However, its runtime is now decreased as 277*4882a593Smuzhiyun dq = - 0.5 dt because Uinact = 0.5. 278*4882a593Smuzhiyun Task T2 therefore reclaims the bandwidth unused by Task T1. 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun - Time t = 8: 281*4882a593Smuzhiyun 282*4882a593Smuzhiyun Task T1 wakes up. It enters the ActiveContending state again, and the 283*4882a593Smuzhiyun running_bw is incremented. 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun2.3 Energy-aware scheduling 287*4882a593Smuzhiyun--------------------------- 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the 290*4882a593Smuzhiyun GRUB-PA [19] algorithm, reducing the CPU operating frequency to the minimum 291*4882a593Smuzhiyun value that still allows to meet the deadlines. This behavior is currently 292*4882a593Smuzhiyun implemented only for ARM architectures. 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun A particular care must be taken in case the time needed for changing frequency 295*4882a593Smuzhiyun is of the same order of magnitude of the reservation period. In such cases, 296*4882a593Smuzhiyun setting a fixed CPU frequency results in a lower amount of deadline misses. 297*4882a593Smuzhiyun 298*4882a593Smuzhiyun 299*4882a593Smuzhiyun3. Scheduling Real-Time Tasks 300*4882a593Smuzhiyun============================= 301*4882a593Smuzhiyun 302*4882a593Smuzhiyun 303*4882a593Smuzhiyun 304*4882a593Smuzhiyun .. BIG FAT WARNING ****************************************************** 305*4882a593Smuzhiyun 306*4882a593Smuzhiyun .. warning:: 307*4882a593Smuzhiyun 308*4882a593Smuzhiyun This section contains a (not-thorough) summary on classical deadline 309*4882a593Smuzhiyun scheduling theory, and how it applies to SCHED_DEADLINE. 310*4882a593Smuzhiyun The reader can "safely" skip to Section 4 if only interested in seeing 311*4882a593Smuzhiyun how the scheduling policy can be used. Anyway, we strongly recommend 312*4882a593Smuzhiyun to come back here and continue reading (once the urge for testing is 313*4882a593Smuzhiyun satisfied :P) to be sure of fully understanding all technical details. 314*4882a593Smuzhiyun 315*4882a593Smuzhiyun .. ************************************************************************ 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun There are no limitations on what kind of task can exploit this new 318*4882a593Smuzhiyun scheduling discipline, even if it must be said that it is particularly 319*4882a593Smuzhiyun suited for periodic or sporadic real-time tasks that need guarantees on their 320*4882a593Smuzhiyun timing behavior, e.g., multimedia, streaming, control applications, etc. 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun3.1 Definitions 323*4882a593Smuzhiyun------------------------ 324*4882a593Smuzhiyun 325*4882a593Smuzhiyun A typical real-time task is composed of a repetition of computation phases 326*4882a593Smuzhiyun (task instances, or jobs) which are activated on a periodic or sporadic 327*4882a593Smuzhiyun fashion. 328*4882a593Smuzhiyun Each job J_j (where J_j is the j^th job of the task) is characterized by an 329*4882a593Smuzhiyun arrival time r_j (the time when the job starts), an amount of computation 330*4882a593Smuzhiyun time c_j needed to finish the job, and a job absolute deadline d_j, which 331*4882a593Smuzhiyun is the time within which the job should be finished. The maximum execution 332*4882a593Smuzhiyun time max{c_j} is called "Worst Case Execution Time" (WCET) for the task. 333*4882a593Smuzhiyun A real-time task can be periodic with period P if r_{j+1} = r_j + P, or 334*4882a593Smuzhiyun sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally, 335*4882a593Smuzhiyun d_j = r_j + D, where D is the task's relative deadline. 336*4882a593Smuzhiyun Summing up, a real-time task can be described as 337*4882a593Smuzhiyun 338*4882a593Smuzhiyun Task = (WCET, D, P) 339*4882a593Smuzhiyun 340*4882a593Smuzhiyun The utilization of a real-time task is defined as the ratio between its 341*4882a593Smuzhiyun WCET and its period (or minimum inter-arrival time), and represents 342*4882a593Smuzhiyun the fraction of CPU time needed to execute the task. 343*4882a593Smuzhiyun 344*4882a593Smuzhiyun If the total utilization U=sum(WCET_i/P_i) is larger than M (with M equal 345*4882a593Smuzhiyun to the number of CPUs), then the scheduler is unable to respect all the 346*4882a593Smuzhiyun deadlines. 347*4882a593Smuzhiyun Note that total utilization is defined as the sum of the utilizations 348*4882a593Smuzhiyun WCET_i/P_i over all the real-time tasks in the system. When considering 349*4882a593Smuzhiyun multiple real-time tasks, the parameters of the i-th task are indicated 350*4882a593Smuzhiyun with the "_i" suffix. 351*4882a593Smuzhiyun Moreover, if the total utilization is larger than M, then we risk starving 352*4882a593Smuzhiyun non- real-time tasks by real-time tasks. 353*4882a593Smuzhiyun If, instead, the total utilization is smaller than M, then non real-time 354*4882a593Smuzhiyun tasks will not be starved and the system might be able to respect all the 355*4882a593Smuzhiyun deadlines. 356*4882a593Smuzhiyun As a matter of fact, in this case it is possible to provide an upper bound 357*4882a593Smuzhiyun for tardiness (defined as the maximum between 0 and the difference 358*4882a593Smuzhiyun between the finishing time of a job and its absolute deadline). 359*4882a593Smuzhiyun More precisely, it can be proven that using a global EDF scheduler the 360*4882a593Smuzhiyun maximum tardiness of each task is smaller or equal than 361*4882a593Smuzhiyun 362*4882a593Smuzhiyun ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max 363*4882a593Smuzhiyun 364*4882a593Smuzhiyun where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i} 365*4882a593Smuzhiyun is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum 366*4882a593Smuzhiyun utilization[12]. 367*4882a593Smuzhiyun 368*4882a593Smuzhiyun3.2 Schedulability Analysis for Uniprocessor Systems 369*4882a593Smuzhiyun---------------------------------------------------- 370*4882a593Smuzhiyun 371*4882a593Smuzhiyun If M=1 (uniprocessor system), or in case of partitioned scheduling (each 372*4882a593Smuzhiyun real-time task is statically assigned to one and only one CPU), it is 373*4882a593Smuzhiyun possible to formally check if all the deadlines are respected. 374*4882a593Smuzhiyun If D_i = P_i for all tasks, then EDF is able to respect all the deadlines 375*4882a593Smuzhiyun of all the tasks executing on a CPU if and only if the total utilization 376*4882a593Smuzhiyun of the tasks running on such a CPU is smaller or equal than 1. 377*4882a593Smuzhiyun If D_i != P_i for some task, then it is possible to define the density of 378*4882a593Smuzhiyun a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines 379*4882a593Smuzhiyun of all the tasks running on a CPU if the sum of the densities of the tasks 380*4882a593Smuzhiyun running on such a CPU is smaller or equal than 1: 381*4882a593Smuzhiyun 382*4882a593Smuzhiyun sum(WCET_i / min{D_i, P_i}) <= 1 383*4882a593Smuzhiyun 384*4882a593Smuzhiyun It is important to notice that this condition is only sufficient, and not 385*4882a593Smuzhiyun necessary: there are task sets that are schedulable, but do not respect the 386*4882a593Smuzhiyun condition. For example, consider the task set {Task_1,Task_2} composed by 387*4882a593Smuzhiyun Task_1=(50ms,50ms,100ms) and Task_2=(10ms,100ms,100ms). 388*4882a593Smuzhiyun EDF is clearly able to schedule the two tasks without missing any deadline 389*4882a593Smuzhiyun (Task_1 is scheduled as soon as it is released, and finishes just in time 390*4882a593Smuzhiyun to respect its deadline; Task_2 is scheduled immediately after Task_1, hence 391*4882a593Smuzhiyun its response time cannot be larger than 50ms + 10ms = 60ms) even if 392*4882a593Smuzhiyun 393*4882a593Smuzhiyun 50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1 394*4882a593Smuzhiyun 395*4882a593Smuzhiyun Of course it is possible to test the exact schedulability of tasks with 396*4882a593Smuzhiyun D_i != P_i (checking a condition that is both sufficient and necessary), 397*4882a593Smuzhiyun but this cannot be done by comparing the total utilization or density with 398*4882a593Smuzhiyun a constant. Instead, the so called "processor demand" approach can be used, 399*4882a593Smuzhiyun computing the total amount of CPU time h(t) needed by all the tasks to 400*4882a593Smuzhiyun respect all of their deadlines in a time interval of size t, and comparing 401*4882a593Smuzhiyun such a time with the interval size t. If h(t) is smaller than t (that is, 402*4882a593Smuzhiyun the amount of time needed by the tasks in a time interval of size t is 403*4882a593Smuzhiyun smaller than the size of the interval) for all the possible values of t, then 404*4882a593Smuzhiyun EDF is able to schedule the tasks respecting all of their deadlines. Since 405*4882a593Smuzhiyun performing this check for all possible values of t is impossible, it has been 406*4882a593Smuzhiyun proven[4,5,6] that it is sufficient to perform the test for values of t 407*4882a593Smuzhiyun between 0 and a maximum value L. The cited papers contain all of the 408*4882a593Smuzhiyun mathematical details and explain how to compute h(t) and L. 409*4882a593Smuzhiyun In any case, this kind of analysis is too complex as well as too 410*4882a593Smuzhiyun time-consuming to be performed on-line. Hence, as explained in Section 411*4882a593Smuzhiyun 4 Linux uses an admission test based on the tasks' utilizations. 412*4882a593Smuzhiyun 413*4882a593Smuzhiyun3.3 Schedulability Analysis for Multiprocessor Systems 414*4882a593Smuzhiyun------------------------------------------------------ 415*4882a593Smuzhiyun 416*4882a593Smuzhiyun On multiprocessor systems with global EDF scheduling (non partitioned 417*4882a593Smuzhiyun systems), a sufficient test for schedulability can not be based on the 418*4882a593Smuzhiyun utilizations or densities: it can be shown that even if D_i = P_i task 419*4882a593Smuzhiyun sets with utilizations slightly larger than 1 can miss deadlines regardless 420*4882a593Smuzhiyun of the number of CPUs. 421*4882a593Smuzhiyun 422*4882a593Smuzhiyun Consider a set {Task_1,...Task_{M+1}} of M+1 tasks on a system with M 423*4882a593Smuzhiyun CPUs, with the first task Task_1=(P,P,P) having period, relative deadline 424*4882a593Smuzhiyun and WCET equal to P. The remaining M tasks Task_i=(e,P-1,P-1) have an 425*4882a593Smuzhiyun arbitrarily small worst case execution time (indicated as "e" here) and a 426*4882a593Smuzhiyun period smaller than the one of the first task. Hence, if all the tasks 427*4882a593Smuzhiyun activate at the same time t, global EDF schedules these M tasks first 428*4882a593Smuzhiyun (because their absolute deadlines are equal to t + P - 1, hence they are 429*4882a593Smuzhiyun smaller than the absolute deadline of Task_1, which is t + P). As a 430*4882a593Smuzhiyun result, Task_1 can be scheduled only at time t + e, and will finish at 431*4882a593Smuzhiyun time t + e + P, after its absolute deadline. The total utilization of the 432*4882a593Smuzhiyun task set is U = M · e / (P - 1) + P / P = M · e / (P - 1) + 1, and for small 433*4882a593Smuzhiyun values of e this can become very close to 1. This is known as "Dhall's 434*4882a593Smuzhiyun effect"[7]. Note: the example in the original paper by Dhall has been 435*4882a593Smuzhiyun slightly simplified here (for example, Dhall more correctly computed 436*4882a593Smuzhiyun lim_{e->0}U). 437*4882a593Smuzhiyun 438*4882a593Smuzhiyun More complex schedulability tests for global EDF have been developed in 439*4882a593Smuzhiyun real-time literature[8,9], but they are not based on a simple comparison 440*4882a593Smuzhiyun between total utilization (or density) and a fixed constant. If all tasks 441*4882a593Smuzhiyun have D_i = P_i, a sufficient schedulability condition can be expressed in 442*4882a593Smuzhiyun a simple way: 443*4882a593Smuzhiyun 444*4882a593Smuzhiyun sum(WCET_i / P_i) <= M - (M - 1) · U_max 445*4882a593Smuzhiyun 446*4882a593Smuzhiyun where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1, 447*4882a593Smuzhiyun M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition 448*4882a593Smuzhiyun just confirms the Dhall's effect. A more complete survey of the literature 449*4882a593Smuzhiyun about schedulability tests for multi-processor real-time scheduling can be 450*4882a593Smuzhiyun found in [11]. 451*4882a593Smuzhiyun 452*4882a593Smuzhiyun As seen, enforcing that the total utilization is smaller than M does not 453*4882a593Smuzhiyun guarantee that global EDF schedules the tasks without missing any deadline 454*4882a593Smuzhiyun (in other words, global EDF is not an optimal scheduling algorithm). However, 455*4882a593Smuzhiyun a total utilization smaller than M is enough to guarantee that non real-time 456*4882a593Smuzhiyun tasks are not starved and that the tardiness of real-time tasks has an upper 457*4882a593Smuzhiyun bound[12] (as previously noted). Different bounds on the maximum tardiness 458*4882a593Smuzhiyun experienced by real-time tasks have been developed in various papers[13,14], 459*4882a593Smuzhiyun but the theoretical result that is important for SCHED_DEADLINE is that if 460*4882a593Smuzhiyun the total utilization is smaller or equal than M then the response times of 461*4882a593Smuzhiyun the tasks are limited. 462*4882a593Smuzhiyun 463*4882a593Smuzhiyun3.4 Relationship with SCHED_DEADLINE Parameters 464*4882a593Smuzhiyun----------------------------------------------- 465*4882a593Smuzhiyun 466*4882a593Smuzhiyun Finally, it is important to understand the relationship between the 467*4882a593Smuzhiyun SCHED_DEADLINE scheduling parameters described in Section 2 (runtime, 468*4882a593Smuzhiyun deadline and period) and the real-time task parameters (WCET, D, P) 469*4882a593Smuzhiyun described in this section. Note that the tasks' temporal constraints are 470*4882a593Smuzhiyun represented by its absolute deadlines d_j = r_j + D described above, while 471*4882a593Smuzhiyun SCHED_DEADLINE schedules the tasks according to scheduling deadlines (see 472*4882a593Smuzhiyun Section 2). 473*4882a593Smuzhiyun If an admission test is used to guarantee that the scheduling deadlines 474*4882a593Smuzhiyun are respected, then SCHED_DEADLINE can be used to schedule real-time tasks 475*4882a593Smuzhiyun guaranteeing that all the jobs' deadlines of a task are respected. 476*4882a593Smuzhiyun In order to do this, a task must be scheduled by setting: 477*4882a593Smuzhiyun 478*4882a593Smuzhiyun - runtime >= WCET 479*4882a593Smuzhiyun - deadline = D 480*4882a593Smuzhiyun - period <= P 481*4882a593Smuzhiyun 482*4882a593Smuzhiyun IOW, if runtime >= WCET and if period is <= P, then the scheduling deadlines 483*4882a593Smuzhiyun and the absolute deadlines (d_j) coincide, so a proper admission control 484*4882a593Smuzhiyun allows to respect the jobs' absolute deadlines for this task (this is what is 485*4882a593Smuzhiyun called "hard schedulability property" and is an extension of Lemma 1 of [2]). 486*4882a593Smuzhiyun Notice that if runtime > deadline the admission control will surely reject 487*4882a593Smuzhiyun this task, as it is not possible to respect its temporal constraints. 488*4882a593Smuzhiyun 489*4882a593Smuzhiyun References: 490*4882a593Smuzhiyun 491*4882a593Smuzhiyun 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram- 492*4882a593Smuzhiyun ming in a hard-real-time environment. Journal of the Association for 493*4882a593Smuzhiyun Computing Machinery, 20(1), 1973. 494*4882a593Smuzhiyun 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard 495*4882a593Smuzhiyun Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems 496*4882a593Smuzhiyun Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf 497*4882a593Smuzhiyun 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab 498*4882a593Smuzhiyun Technical Report. http://disi.unitn.it/~abeni/tr-98-01.pdf 499*4882a593Smuzhiyun 4 - J. Y. Leung and M.L. Merril. A Note on Preemptive Scheduling of 500*4882a593Smuzhiyun Periodic, Real-Time Tasks. Information Processing Letters, vol. 11, 501*4882a593Smuzhiyun no. 3, pp. 115-118, 1980. 502*4882a593Smuzhiyun 5 - S. K. Baruah, A. K. Mok and L. E. Rosier. Preemptively Scheduling 503*4882a593Smuzhiyun Hard-Real-Time Sporadic Tasks on One Processor. Proceedings of the 504*4882a593Smuzhiyun 11th IEEE Real-time Systems Symposium, 1990. 505*4882a593Smuzhiyun 6 - S. K. Baruah, L. E. Rosier and R. R. Howell. Algorithms and Complexity 506*4882a593Smuzhiyun Concerning the Preemptive Scheduling of Periodic Real-Time tasks on 507*4882a593Smuzhiyun One Processor. Real-Time Systems Journal, vol. 4, no. 2, pp 301-324, 508*4882a593Smuzhiyun 1990. 509*4882a593Smuzhiyun 7 - S. J. Dhall and C. L. Liu. On a real-time scheduling problem. Operations 510*4882a593Smuzhiyun research, vol. 26, no. 1, pp 127-140, 1978. 511*4882a593Smuzhiyun 8 - T. Baker. Multiprocessor EDF and Deadline Monotonic Schedulability 512*4882a593Smuzhiyun Analysis. Proceedings of the 24th IEEE Real-Time Systems Symposium, 2003. 513*4882a593Smuzhiyun 9 - T. Baker. An Analysis of EDF Schedulability on a Multiprocessor. 514*4882a593Smuzhiyun IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 8, 515*4882a593Smuzhiyun pp 760-768, 2005. 516*4882a593Smuzhiyun 10 - J. Goossens, S. Funk and S. Baruah, Priority-Driven Scheduling of 517*4882a593Smuzhiyun Periodic Task Systems on Multiprocessors. Real-Time Systems Journal, 518*4882a593Smuzhiyun vol. 25, no. 2–3, pp. 187–205, 2003. 519*4882a593Smuzhiyun 11 - R. Davis and A. Burns. A Survey of Hard Real-Time Scheduling for 520*4882a593Smuzhiyun Multiprocessor Systems. ACM Computing Surveys, vol. 43, no. 4, 2011. 521*4882a593Smuzhiyun http://www-users.cs.york.ac.uk/~robdavis/papers/MPSurveyv5.0.pdf 522*4882a593Smuzhiyun 12 - U. C. Devi and J. H. Anderson. Tardiness Bounds under Global EDF 523*4882a593Smuzhiyun Scheduling on a Multiprocessor. Real-Time Systems Journal, vol. 32, 524*4882a593Smuzhiyun no. 2, pp 133-189, 2008. 525*4882a593Smuzhiyun 13 - P. Valente and G. Lipari. An Upper Bound to the Lateness of Soft 526*4882a593Smuzhiyun Real-Time Tasks Scheduled by EDF on Multiprocessors. Proceedings of 527*4882a593Smuzhiyun the 26th IEEE Real-Time Systems Symposium, 2005. 528*4882a593Smuzhiyun 14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for 529*4882a593Smuzhiyun Global EDF. Proceedings of the 22nd Euromicro Conference on 530*4882a593Smuzhiyun Real-Time Systems, 2010. 531*4882a593Smuzhiyun 15 - G. Lipari, S. Baruah, Greedy reclamation of unused bandwidth in 532*4882a593Smuzhiyun constant-bandwidth servers, 12th IEEE Euromicro Conference on Real-Time 533*4882a593Smuzhiyun Systems, 2000. 534*4882a593Smuzhiyun 16 - L. Abeni, J. Lelli, C. Scordino, L. Palopoli, Greedy CPU reclaiming for 535*4882a593Smuzhiyun SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS), 536*4882a593Smuzhiyun Dusseldorf, Germany, 2014. 537*4882a593Smuzhiyun 17 - L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel 538*4882a593Smuzhiyun or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied 539*4882a593Smuzhiyun Computing, 2016. 540*4882a593Smuzhiyun 18 - J. Lelli, C. Scordino, L. Abeni, D. Faggioli, Deadline scheduling in the 541*4882a593Smuzhiyun Linux kernel, Software: Practice and Experience, 46(6): 821-839, June 542*4882a593Smuzhiyun 2016. 543*4882a593Smuzhiyun 19 - C. Scordino, L. Abeni, J. Lelli, Energy-Aware Real-Time Scheduling in 544*4882a593Smuzhiyun the Linux Kernel, 33rd ACM/SIGAPP Symposium On Applied Computing (SAC 545*4882a593Smuzhiyun 2018), Pau, France, April 2018. 546*4882a593Smuzhiyun 547*4882a593Smuzhiyun 548*4882a593Smuzhiyun4. Bandwidth management 549*4882a593Smuzhiyun======================= 550*4882a593Smuzhiyun 551*4882a593Smuzhiyun As previously mentioned, in order for -deadline scheduling to be 552*4882a593Smuzhiyun effective and useful (that is, to be able to provide "runtime" time units 553*4882a593Smuzhiyun within "deadline"), it is important to have some method to keep the allocation 554*4882a593Smuzhiyun of the available fractions of CPU time to the various tasks under control. 555*4882a593Smuzhiyun This is usually called "admission control" and if it is not performed, then 556*4882a593Smuzhiyun no guarantee can be given on the actual scheduling of the -deadline tasks. 557*4882a593Smuzhiyun 558*4882a593Smuzhiyun As already stated in Section 3, a necessary condition to be respected to 559*4882a593Smuzhiyun correctly schedule a set of real-time tasks is that the total utilization 560*4882a593Smuzhiyun is smaller than M. When talking about -deadline tasks, this requires that 561*4882a593Smuzhiyun the sum of the ratio between runtime and period for all tasks is smaller 562*4882a593Smuzhiyun than M. Notice that the ratio runtime/period is equivalent to the utilization 563*4882a593Smuzhiyun of a "traditional" real-time task, and is also often referred to as 564*4882a593Smuzhiyun "bandwidth". 565*4882a593Smuzhiyun The interface used to control the CPU bandwidth that can be allocated 566*4882a593Smuzhiyun to -deadline tasks is similar to the one already used for -rt 567*4882a593Smuzhiyun tasks with real-time group scheduling (a.k.a. RT-throttling - see 568*4882a593Smuzhiyun Documentation/scheduler/sched-rt-group.rst), and is based on readable/ 569*4882a593Smuzhiyun writable control files located in procfs (for system wide settings). 570*4882a593Smuzhiyun Notice that per-group settings (controlled through cgroupfs) are still not 571*4882a593Smuzhiyun defined for -deadline tasks, because more discussion is needed in order to 572*4882a593Smuzhiyun figure out how we want to manage SCHED_DEADLINE bandwidth at the task group 573*4882a593Smuzhiyun level. 574*4882a593Smuzhiyun 575*4882a593Smuzhiyun A main difference between deadline bandwidth management and RT-throttling 576*4882a593Smuzhiyun is that -deadline tasks have bandwidth on their own (while -rt ones don't!), 577*4882a593Smuzhiyun and thus we don't need a higher level throttling mechanism to enforce the 578*4882a593Smuzhiyun desired bandwidth. In other words, this means that interface parameters are 579*4882a593Smuzhiyun only used at admission control time (i.e., when the user calls 580*4882a593Smuzhiyun sched_setattr()). Scheduling is then performed considering actual tasks' 581*4882a593Smuzhiyun parameters, so that CPU bandwidth is allocated to SCHED_DEADLINE tasks 582*4882a593Smuzhiyun respecting their needs in terms of granularity. Therefore, using this simple 583*4882a593Smuzhiyun interface we can put a cap on total utilization of -deadline tasks (i.e., 584*4882a593Smuzhiyun \Sum (runtime_i / period_i) < global_dl_utilization_cap). 585*4882a593Smuzhiyun 586*4882a593Smuzhiyun4.1 System wide settings 587*4882a593Smuzhiyun------------------------ 588*4882a593Smuzhiyun 589*4882a593Smuzhiyun The system wide settings are configured under the /proc virtual file system. 590*4882a593Smuzhiyun 591*4882a593Smuzhiyun For now the -rt knobs are used for -deadline admission control and the 592*4882a593Smuzhiyun -deadline runtime is accounted against the -rt runtime. We realize that this 593*4882a593Smuzhiyun isn't entirely desirable; however, it is better to have a small interface for 594*4882a593Smuzhiyun now, and be able to change it easily later. The ideal situation (see 5.) is to 595*4882a593Smuzhiyun run -rt tasks from a -deadline server; in which case the -rt bandwidth is a 596*4882a593Smuzhiyun direct subset of dl_bw. 597*4882a593Smuzhiyun 598*4882a593Smuzhiyun This means that, for a root_domain comprising M CPUs, -deadline tasks 599*4882a593Smuzhiyun can be created while the sum of their bandwidths stays below: 600*4882a593Smuzhiyun 601*4882a593Smuzhiyun M * (sched_rt_runtime_us / sched_rt_period_us) 602*4882a593Smuzhiyun 603*4882a593Smuzhiyun It is also possible to disable this bandwidth management logic, and 604*4882a593Smuzhiyun be thus free of oversubscribing the system up to any arbitrary level. 605*4882a593Smuzhiyun This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us. 606*4882a593Smuzhiyun 607*4882a593Smuzhiyun 608*4882a593Smuzhiyun4.2 Task interface 609*4882a593Smuzhiyun------------------ 610*4882a593Smuzhiyun 611*4882a593Smuzhiyun Specifying a periodic/sporadic task that executes for a given amount of 612*4882a593Smuzhiyun runtime at each instance, and that is scheduled according to the urgency of 613*4882a593Smuzhiyun its own timing constraints needs, in general, a way of declaring: 614*4882a593Smuzhiyun 615*4882a593Smuzhiyun - a (maximum/typical) instance execution time, 616*4882a593Smuzhiyun - a minimum interval between consecutive instances, 617*4882a593Smuzhiyun - a time constraint by which each instance must be completed. 618*4882a593Smuzhiyun 619*4882a593Smuzhiyun Therefore: 620*4882a593Smuzhiyun 621*4882a593Smuzhiyun * a new struct sched_attr, containing all the necessary fields is 622*4882a593Smuzhiyun provided; 623*4882a593Smuzhiyun * the new scheduling related syscalls that manipulate it, i.e., 624*4882a593Smuzhiyun sched_setattr() and sched_getattr() are implemented. 625*4882a593Smuzhiyun 626*4882a593Smuzhiyun For debugging purposes, the leftover runtime and absolute deadline of a 627*4882a593Smuzhiyun SCHED_DEADLINE task can be retrieved through /proc/<pid>/sched (entries 628*4882a593Smuzhiyun dl.runtime and dl.deadline, both values in ns). A programmatic way to 629*4882a593Smuzhiyun retrieve these values from production code is under discussion. 630*4882a593Smuzhiyun 631*4882a593Smuzhiyun 632*4882a593Smuzhiyun4.3 Default behavior 633*4882a593Smuzhiyun--------------------- 634*4882a593Smuzhiyun 635*4882a593Smuzhiyun The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to 636*4882a593Smuzhiyun 950000. With rt_period equal to 1000000, by default, it means that -deadline 637*4882a593Smuzhiyun tasks can use at most 95%, multiplied by the number of CPUs that compose the 638*4882a593Smuzhiyun root_domain, for each root_domain. 639*4882a593Smuzhiyun This means that non -deadline tasks will receive at least 5% of the CPU time, 640*4882a593Smuzhiyun and that -deadline tasks will receive their runtime with a guaranteed 641*4882a593Smuzhiyun worst-case delay respect to the "deadline" parameter. If "deadline" = "period" 642*4882a593Smuzhiyun and the cpuset mechanism is used to implement partitioned scheduling (see 643*4882a593Smuzhiyun Section 5), then this simple setting of the bandwidth management is able to 644*4882a593Smuzhiyun deterministically guarantee that -deadline tasks will receive their runtime 645*4882a593Smuzhiyun in a period. 646*4882a593Smuzhiyun 647*4882a593Smuzhiyun Finally, notice that in order not to jeopardize the admission control a 648*4882a593Smuzhiyun -deadline task cannot fork. 649*4882a593Smuzhiyun 650*4882a593Smuzhiyun 651*4882a593Smuzhiyun4.4 Behavior of sched_yield() 652*4882a593Smuzhiyun----------------------------- 653*4882a593Smuzhiyun 654*4882a593Smuzhiyun When a SCHED_DEADLINE task calls sched_yield(), it gives up its 655*4882a593Smuzhiyun remaining runtime and is immediately throttled, until the next 656*4882a593Smuzhiyun period, when its runtime will be replenished (a special flag 657*4882a593Smuzhiyun dl_yielded is set and used to handle correctly throttling and runtime 658*4882a593Smuzhiyun replenishment after a call to sched_yield()). 659*4882a593Smuzhiyun 660*4882a593Smuzhiyun This behavior of sched_yield() allows the task to wake-up exactly at 661*4882a593Smuzhiyun the beginning of the next period. Also, this may be useful in the 662*4882a593Smuzhiyun future with bandwidth reclaiming mechanisms, where sched_yield() will 663*4882a593Smuzhiyun make the leftoever runtime available for reclamation by other 664*4882a593Smuzhiyun SCHED_DEADLINE tasks. 665*4882a593Smuzhiyun 666*4882a593Smuzhiyun 667*4882a593Smuzhiyun5. Tasks CPU affinity 668*4882a593Smuzhiyun===================== 669*4882a593Smuzhiyun 670*4882a593Smuzhiyun -deadline tasks cannot have an affinity mask smaller that the entire 671*4882a593Smuzhiyun root_domain they are created on. However, affinities can be specified 672*4882a593Smuzhiyun through the cpuset facility (Documentation/admin-guide/cgroup-v1/cpusets.rst). 673*4882a593Smuzhiyun 674*4882a593Smuzhiyun5.1 SCHED_DEADLINE and cpusets HOWTO 675*4882a593Smuzhiyun------------------------------------ 676*4882a593Smuzhiyun 677*4882a593Smuzhiyun An example of a simple configuration (pin a -deadline task to CPU0) 678*4882a593Smuzhiyun follows (rt-app is used to create a -deadline task):: 679*4882a593Smuzhiyun 680*4882a593Smuzhiyun mkdir /dev/cpuset 681*4882a593Smuzhiyun mount -t cgroup -o cpuset cpuset /dev/cpuset 682*4882a593Smuzhiyun cd /dev/cpuset 683*4882a593Smuzhiyun mkdir cpu0 684*4882a593Smuzhiyun echo 0 > cpu0/cpuset.cpus 685*4882a593Smuzhiyun echo 0 > cpu0/cpuset.mems 686*4882a593Smuzhiyun echo 1 > cpuset.cpu_exclusive 687*4882a593Smuzhiyun echo 0 > cpuset.sched_load_balance 688*4882a593Smuzhiyun echo 1 > cpu0/cpuset.cpu_exclusive 689*4882a593Smuzhiyun echo 1 > cpu0/cpuset.mem_exclusive 690*4882a593Smuzhiyun echo $$ > cpu0/tasks 691*4882a593Smuzhiyun rt-app -t 100000:10000:d:0 -D5 # it is now actually superfluous to specify 692*4882a593Smuzhiyun # task affinity 693*4882a593Smuzhiyun 694*4882a593Smuzhiyun6. Future plans 695*4882a593Smuzhiyun=============== 696*4882a593Smuzhiyun 697*4882a593Smuzhiyun Still missing: 698*4882a593Smuzhiyun 699*4882a593Smuzhiyun - programmatic way to retrieve current runtime and absolute deadline 700*4882a593Smuzhiyun - refinements to deadline inheritance, especially regarding the possibility 701*4882a593Smuzhiyun of retaining bandwidth isolation among non-interacting tasks. This is 702*4882a593Smuzhiyun being studied from both theoretical and practical points of view, and 703*4882a593Smuzhiyun hopefully we should be able to produce some demonstrative code soon; 704*4882a593Smuzhiyun - (c)group based bandwidth management, and maybe scheduling; 705*4882a593Smuzhiyun - access control for non-root users (and related security concerns to 706*4882a593Smuzhiyun address), which is the best way to allow unprivileged use of the mechanisms 707*4882a593Smuzhiyun and how to prevent non-root users "cheat" the system? 708*4882a593Smuzhiyun 709*4882a593Smuzhiyun As already discussed, we are planning also to merge this work with the EDF 710*4882a593Smuzhiyun throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in 711*4882a593Smuzhiyun the preliminary phases of the merge and we really seek feedback that would 712*4882a593Smuzhiyun help us decide on the direction it should take. 713*4882a593Smuzhiyun 714*4882a593SmuzhiyunAppendix A. Test suite 715*4882a593Smuzhiyun====================== 716*4882a593Smuzhiyun 717*4882a593Smuzhiyun The SCHED_DEADLINE policy can be easily tested using two applications that 718*4882a593Smuzhiyun are part of a wider Linux Scheduler validation suite. The suite is 719*4882a593Smuzhiyun available as a GitHub repository: https://github.com/scheduler-tools. 720*4882a593Smuzhiyun 721*4882a593Smuzhiyun The first testing application is called rt-app and can be used to 722*4882a593Smuzhiyun start multiple threads with specific parameters. rt-app supports 723*4882a593Smuzhiyun SCHED_{OTHER,FIFO,RR,DEADLINE} scheduling policies and their related 724*4882a593Smuzhiyun parameters (e.g., niceness, priority, runtime/deadline/period). rt-app 725*4882a593Smuzhiyun is a valuable tool, as it can be used to synthetically recreate certain 726*4882a593Smuzhiyun workloads (maybe mimicking real use-cases) and evaluate how the scheduler 727*4882a593Smuzhiyun behaves under such workloads. In this way, results are easily reproducible. 728*4882a593Smuzhiyun rt-app is available at: https://github.com/scheduler-tools/rt-app. 729*4882a593Smuzhiyun 730*4882a593Smuzhiyun Thread parameters can be specified from the command line, with something like 731*4882a593Smuzhiyun this:: 732*4882a593Smuzhiyun 733*4882a593Smuzhiyun # rt-app -t 100000:10000:d -t 150000:20000:f:10 -D5 734*4882a593Smuzhiyun 735*4882a593Smuzhiyun The above creates 2 threads. The first one, scheduled by SCHED_DEADLINE, 736*4882a593Smuzhiyun executes for 10ms every 100ms. The second one, scheduled at SCHED_FIFO 737*4882a593Smuzhiyun priority 10, executes for 20ms every 150ms. The test will run for a total 738*4882a593Smuzhiyun of 5 seconds. 739*4882a593Smuzhiyun 740*4882a593Smuzhiyun More interestingly, configurations can be described with a json file that 741*4882a593Smuzhiyun can be passed as input to rt-app with something like this:: 742*4882a593Smuzhiyun 743*4882a593Smuzhiyun # rt-app my_config.json 744*4882a593Smuzhiyun 745*4882a593Smuzhiyun The parameters that can be specified with the second method are a superset 746*4882a593Smuzhiyun of the command line options. Please refer to rt-app documentation for more 747*4882a593Smuzhiyun details (`<rt-app-sources>/doc/*.json`). 748*4882a593Smuzhiyun 749*4882a593Smuzhiyun The second testing application is a modification of schedtool, called 750*4882a593Smuzhiyun schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a 751*4882a593Smuzhiyun certain pid/application. schedtool-dl is available at: 752*4882a593Smuzhiyun https://github.com/scheduler-tools/schedtool-dl.git. 753*4882a593Smuzhiyun 754*4882a593Smuzhiyun The usage is straightforward:: 755*4882a593Smuzhiyun 756*4882a593Smuzhiyun # schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app 757*4882a593Smuzhiyun 758*4882a593Smuzhiyun With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation 759*4882a593Smuzhiyun of 10ms every 100ms (note that parameters are expressed in microseconds). 760*4882a593Smuzhiyun You can also use schedtool to create a reservation for an already running 761*4882a593Smuzhiyun application, given that you know its pid:: 762*4882a593Smuzhiyun 763*4882a593Smuzhiyun # schedtool -E -t 10000000:100000000 my_app_pid 764*4882a593Smuzhiyun 765*4882a593SmuzhiyunAppendix B. Minimal main() 766*4882a593Smuzhiyun========================== 767*4882a593Smuzhiyun 768*4882a593Smuzhiyun We provide in what follows a simple (ugly) self-contained code snippet 769*4882a593Smuzhiyun showing how SCHED_DEADLINE reservations can be created by a real-time 770*4882a593Smuzhiyun application developer:: 771*4882a593Smuzhiyun 772*4882a593Smuzhiyun #define _GNU_SOURCE 773*4882a593Smuzhiyun #include <unistd.h> 774*4882a593Smuzhiyun #include <stdio.h> 775*4882a593Smuzhiyun #include <stdlib.h> 776*4882a593Smuzhiyun #include <string.h> 777*4882a593Smuzhiyun #include <time.h> 778*4882a593Smuzhiyun #include <linux/unistd.h> 779*4882a593Smuzhiyun #include <linux/kernel.h> 780*4882a593Smuzhiyun #include <linux/types.h> 781*4882a593Smuzhiyun #include <sys/syscall.h> 782*4882a593Smuzhiyun #include <pthread.h> 783*4882a593Smuzhiyun 784*4882a593Smuzhiyun #define gettid() syscall(__NR_gettid) 785*4882a593Smuzhiyun 786*4882a593Smuzhiyun #define SCHED_DEADLINE 6 787*4882a593Smuzhiyun 788*4882a593Smuzhiyun /* XXX use the proper syscall numbers */ 789*4882a593Smuzhiyun #ifdef __x86_64__ 790*4882a593Smuzhiyun #define __NR_sched_setattr 314 791*4882a593Smuzhiyun #define __NR_sched_getattr 315 792*4882a593Smuzhiyun #endif 793*4882a593Smuzhiyun 794*4882a593Smuzhiyun #ifdef __i386__ 795*4882a593Smuzhiyun #define __NR_sched_setattr 351 796*4882a593Smuzhiyun #define __NR_sched_getattr 352 797*4882a593Smuzhiyun #endif 798*4882a593Smuzhiyun 799*4882a593Smuzhiyun #ifdef __arm__ 800*4882a593Smuzhiyun #define __NR_sched_setattr 380 801*4882a593Smuzhiyun #define __NR_sched_getattr 381 802*4882a593Smuzhiyun #endif 803*4882a593Smuzhiyun 804*4882a593Smuzhiyun static volatile int done; 805*4882a593Smuzhiyun 806*4882a593Smuzhiyun struct sched_attr { 807*4882a593Smuzhiyun __u32 size; 808*4882a593Smuzhiyun 809*4882a593Smuzhiyun __u32 sched_policy; 810*4882a593Smuzhiyun __u64 sched_flags; 811*4882a593Smuzhiyun 812*4882a593Smuzhiyun /* SCHED_NORMAL, SCHED_BATCH */ 813*4882a593Smuzhiyun __s32 sched_nice; 814*4882a593Smuzhiyun 815*4882a593Smuzhiyun /* SCHED_FIFO, SCHED_RR */ 816*4882a593Smuzhiyun __u32 sched_priority; 817*4882a593Smuzhiyun 818*4882a593Smuzhiyun /* SCHED_DEADLINE (nsec) */ 819*4882a593Smuzhiyun __u64 sched_runtime; 820*4882a593Smuzhiyun __u64 sched_deadline; 821*4882a593Smuzhiyun __u64 sched_period; 822*4882a593Smuzhiyun }; 823*4882a593Smuzhiyun 824*4882a593Smuzhiyun int sched_setattr(pid_t pid, 825*4882a593Smuzhiyun const struct sched_attr *attr, 826*4882a593Smuzhiyun unsigned int flags) 827*4882a593Smuzhiyun { 828*4882a593Smuzhiyun return syscall(__NR_sched_setattr, pid, attr, flags); 829*4882a593Smuzhiyun } 830*4882a593Smuzhiyun 831*4882a593Smuzhiyun int sched_getattr(pid_t pid, 832*4882a593Smuzhiyun struct sched_attr *attr, 833*4882a593Smuzhiyun unsigned int size, 834*4882a593Smuzhiyun unsigned int flags) 835*4882a593Smuzhiyun { 836*4882a593Smuzhiyun return syscall(__NR_sched_getattr, pid, attr, size, flags); 837*4882a593Smuzhiyun } 838*4882a593Smuzhiyun 839*4882a593Smuzhiyun void *run_deadline(void *data) 840*4882a593Smuzhiyun { 841*4882a593Smuzhiyun struct sched_attr attr; 842*4882a593Smuzhiyun int x = 0; 843*4882a593Smuzhiyun int ret; 844*4882a593Smuzhiyun unsigned int flags = 0; 845*4882a593Smuzhiyun 846*4882a593Smuzhiyun printf("deadline thread started [%ld]\n", gettid()); 847*4882a593Smuzhiyun 848*4882a593Smuzhiyun attr.size = sizeof(attr); 849*4882a593Smuzhiyun attr.sched_flags = 0; 850*4882a593Smuzhiyun attr.sched_nice = 0; 851*4882a593Smuzhiyun attr.sched_priority = 0; 852*4882a593Smuzhiyun 853*4882a593Smuzhiyun /* This creates a 10ms/30ms reservation */ 854*4882a593Smuzhiyun attr.sched_policy = SCHED_DEADLINE; 855*4882a593Smuzhiyun attr.sched_runtime = 10 * 1000 * 1000; 856*4882a593Smuzhiyun attr.sched_period = attr.sched_deadline = 30 * 1000 * 1000; 857*4882a593Smuzhiyun 858*4882a593Smuzhiyun ret = sched_setattr(0, &attr, flags); 859*4882a593Smuzhiyun if (ret < 0) { 860*4882a593Smuzhiyun done = 0; 861*4882a593Smuzhiyun perror("sched_setattr"); 862*4882a593Smuzhiyun exit(-1); 863*4882a593Smuzhiyun } 864*4882a593Smuzhiyun 865*4882a593Smuzhiyun while (!done) { 866*4882a593Smuzhiyun x++; 867*4882a593Smuzhiyun } 868*4882a593Smuzhiyun 869*4882a593Smuzhiyun printf("deadline thread dies [%ld]\n", gettid()); 870*4882a593Smuzhiyun return NULL; 871*4882a593Smuzhiyun } 872*4882a593Smuzhiyun 873*4882a593Smuzhiyun int main (int argc, char **argv) 874*4882a593Smuzhiyun { 875*4882a593Smuzhiyun pthread_t thread; 876*4882a593Smuzhiyun 877*4882a593Smuzhiyun printf("main thread [%ld]\n", gettid()); 878*4882a593Smuzhiyun 879*4882a593Smuzhiyun pthread_create(&thread, NULL, run_deadline, NULL); 880*4882a593Smuzhiyun 881*4882a593Smuzhiyun sleep(10); 882*4882a593Smuzhiyun 883*4882a593Smuzhiyun done = 1; 884*4882a593Smuzhiyun pthread_join(thread, NULL); 885*4882a593Smuzhiyun 886*4882a593Smuzhiyun printf("main dies [%ld]\n", gettid()); 887*4882a593Smuzhiyun return 0; 888*4882a593Smuzhiyun } 889