1*4882a593Smuzhiyun========================= 2*4882a593SmuzhiyunCapacity Aware Scheduling 3*4882a593Smuzhiyun========================= 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun1. CPU Capacity 6*4882a593Smuzhiyun=============== 7*4882a593Smuzhiyun 8*4882a593Smuzhiyun1.1 Introduction 9*4882a593Smuzhiyun---------------- 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunConventional, homogeneous SMP platforms are composed of purely identical 12*4882a593SmuzhiyunCPUs. Heterogeneous platforms on the other hand are composed of CPUs with 13*4882a593Smuzhiyundifferent performance characteristics - on such platforms, not all CPUs can be 14*4882a593Smuzhiyunconsidered equal. 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunCPU capacity is a measure of the performance a CPU can reach, normalized against 17*4882a593Smuzhiyunthe most performant CPU in the system. Heterogeneous systems are also called 18*4882a593Smuzhiyunasymmetric CPU capacity systems, as they contain CPUs of different capacities. 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunDisparity in maximum attainable performance (IOW in maximum CPU capacity) stems 21*4882a593Smuzhiyunfrom two factors: 22*4882a593Smuzhiyun 23*4882a593Smuzhiyun- not all CPUs may have the same microarchitecture (µarch). 24*4882a593Smuzhiyun- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be 25*4882a593Smuzhiyun physically able to attain the higher Operating Performance Points (OPP). 26*4882a593Smuzhiyun 27*4882a593SmuzhiyunArm big.LITTLE systems are an example of both. The big CPUs are more 28*4882a593Smuzhiyunperformance-oriented than the LITTLE ones (more pipeline stages, bigger caches, 29*4882a593Smuzhiyunsmarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones 30*4882a593Smuzhiyuncan. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunCPU performance is usually expressed in Millions of Instructions Per Second 33*4882a593Smuzhiyun(MIPS), which can also be expressed as a given amount of instructions attainable 34*4882a593Smuzhiyunper Hz, leading to:: 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun capacity(cpu) = work_per_hz(cpu) * max_freq(cpu) 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun1.2 Scheduler terms 39*4882a593Smuzhiyun------------------- 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunTwo different capacity values are used within the scheduler. A CPU's 42*4882a593Smuzhiyun``capacity_orig`` is its maximum attainable capacity, i.e. its maximum 43*4882a593Smuzhiyunattainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to 44*4882a593Smuzhiyunwhich some loss of available performance (e.g. time spent handling IRQs) is 45*4882a593Smuzhiyunsubtracted. 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunNote that a CPU's ``capacity`` is solely intended to be used by the CFS class, 48*4882a593Smuzhiyunwhile ``capacity_orig`` is class-agnostic. The rest of this document will use 49*4882a593Smuzhiyunthe term ``capacity`` interchangeably with ``capacity_orig`` for the sake of 50*4882a593Smuzhiyunbrevity. 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun1.3 Platform examples 53*4882a593Smuzhiyun--------------------- 54*4882a593Smuzhiyun 55*4882a593Smuzhiyun1.3.1 Identical OPPs 56*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~ 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunConsider an hypothetical dual-core asymmetric CPU capacity system where 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun- work_per_hz(CPU0) = W 61*4882a593Smuzhiyun- work_per_hz(CPU1) = W/2 62*4882a593Smuzhiyun- all CPUs are running at the same fixed frequency 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunBy the above definition of capacity: 65*4882a593Smuzhiyun 66*4882a593Smuzhiyun- capacity(CPU0) = C 67*4882a593Smuzhiyun- capacity(CPU1) = C/2 68*4882a593Smuzhiyun 69*4882a593SmuzhiyunTo draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would 70*4882a593Smuzhiyunbe a LITTLE. 71*4882a593Smuzhiyun 72*4882a593SmuzhiyunWith a workload that periodically does a fixed amount of work, you will get an 73*4882a593Smuzhiyunexecution trace like so:: 74*4882a593Smuzhiyun 75*4882a593Smuzhiyun CPU0 work ^ 76*4882a593Smuzhiyun | ____ ____ ____ 77*4882a593Smuzhiyun | | | | | | | 78*4882a593Smuzhiyun +----+----+----+----+----+----+----+----+----+----+-> time 79*4882a593Smuzhiyun 80*4882a593Smuzhiyun CPU1 work ^ 81*4882a593Smuzhiyun | _________ _________ ____ 82*4882a593Smuzhiyun | | | | | | 83*4882a593Smuzhiyun +----+----+----+----+----+----+----+----+----+----+-> time 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunCPU0 has the highest capacity in the system (C), and completes a fixed amount of 86*4882a593Smuzhiyunwork W in T units of time. On the other hand, CPU1 has half the capacity of 87*4882a593SmuzhiyunCPU0, and thus only completes W/2 in T. 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun1.3.2 Different max OPPs 90*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~ 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunUsually, CPUs of different capacity values also have different maximum 93*4882a593SmuzhiyunOPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with: 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun- max_freq(CPU0) = F 96*4882a593Smuzhiyun- max_freq(CPU1) = 2/3 * F 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunThis yields: 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun- capacity(CPU0) = C 101*4882a593Smuzhiyun- capacity(CPU1) = C/3 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunExecuting the same workload as described in 1.3.1, which each CPU running at its 104*4882a593Smuzhiyunmaximum frequency results in:: 105*4882a593Smuzhiyun 106*4882a593Smuzhiyun CPU0 work ^ 107*4882a593Smuzhiyun | ____ ____ ____ 108*4882a593Smuzhiyun | | | | | | | 109*4882a593Smuzhiyun +----+----+----+----+----+----+----+----+----+----+-> time 110*4882a593Smuzhiyun 111*4882a593Smuzhiyun workload on CPU1 112*4882a593Smuzhiyun CPU1 work ^ 113*4882a593Smuzhiyun | ______________ ______________ ____ 114*4882a593Smuzhiyun | | | | | | 115*4882a593Smuzhiyun +----+----+----+----+----+----+----+----+----+----+-> time 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun1.4 Representation caveat 118*4882a593Smuzhiyun------------------------- 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunIt should be noted that having a *single* value to represent differences in CPU 121*4882a593Smuzhiyunperformance is somewhat of a contentious point. The relative performance 122*4882a593Smuzhiyundifference between two different µarchs could be X% on integer operations, Y% on 123*4882a593Smuzhiyunfloating point operations, Z% on branches, and so on. Still, results using this 124*4882a593Smuzhiyunsimple approach have been satisfactory for now. 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun2. Task utilization 127*4882a593Smuzhiyun=================== 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun2.1 Introduction 130*4882a593Smuzhiyun---------------- 131*4882a593Smuzhiyun 132*4882a593SmuzhiyunCapacity aware scheduling requires an expression of a task's requirements with 133*4882a593Smuzhiyunregards to CPU capacity. Each scheduler class can express this differently, and 134*4882a593Smuzhiyunwhile task utilization is specific to CFS, it is convenient to describe it here 135*4882a593Smuzhiyunin order to introduce more generic concepts. 136*4882a593Smuzhiyun 137*4882a593SmuzhiyunTask utilization is a percentage meant to represent the throughput requirements 138*4882a593Smuzhiyunof a task. A simple approximation of it is the task's duty cycle, i.e.:: 139*4882a593Smuzhiyun 140*4882a593Smuzhiyun task_util(p) = duty_cycle(p) 141*4882a593Smuzhiyun 142*4882a593SmuzhiyunOn an SMP system with fixed frequencies, 100% utilization suggests the task is a 143*4882a593Smuzhiyunbusy loop. Conversely, 10% utilization hints it is a small periodic task that 144*4882a593Smuzhiyunspends more time sleeping than executing. Variable CPU frequencies and 145*4882a593Smuzhiyunasymmetric CPU capacities complexify this somewhat; the following sections will 146*4882a593Smuzhiyunexpand on these. 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun2.2 Frequency invariance 149*4882a593Smuzhiyun------------------------ 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunOne issue that needs to be taken into account is that a workload's duty cycle is 152*4882a593Smuzhiyundirectly impacted by the current OPP the CPU is running at. Consider running a 153*4882a593Smuzhiyunperiodic workload at a given frequency F:: 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun CPU work ^ 156*4882a593Smuzhiyun | ____ ____ ____ 157*4882a593Smuzhiyun | | | | | | | 158*4882a593Smuzhiyun +----+----+----+----+----+----+----+----+----+----+-> time 159*4882a593Smuzhiyun 160*4882a593SmuzhiyunThis yields duty_cycle(p) == 25%. 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunNow, consider running the *same* workload at frequency F/2:: 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun CPU work ^ 165*4882a593Smuzhiyun | _________ _________ ____ 166*4882a593Smuzhiyun | | | | | | 167*4882a593Smuzhiyun +----+----+----+----+----+----+----+----+----+----+-> time 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunThis yields duty_cycle(p) == 50%, despite the task having the exact same 170*4882a593Smuzhiyunbehaviour (i.e. executing the same amount of work) in both executions. 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunThe task utilization signal can be made frequency invariant using the following 173*4882a593Smuzhiyunformula:: 174*4882a593Smuzhiyun 175*4882a593Smuzhiyun task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu)) 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunApplying this formula to the two examples above yields a frequency invariant 178*4882a593Smuzhiyuntask utilization of 25%. 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun2.3 CPU invariance 181*4882a593Smuzhiyun------------------ 182*4882a593Smuzhiyun 183*4882a593SmuzhiyunCPU capacity has a similar effect on task utilization in that running an 184*4882a593Smuzhiyunidentical workload on CPUs of different capacity values will yield different 185*4882a593Smuzhiyunduty cycles. 186*4882a593Smuzhiyun 187*4882a593SmuzhiyunConsider the system described in 1.3.2., i.e.:: 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun- capacity(CPU0) = C 190*4882a593Smuzhiyun- capacity(CPU1) = C/3 191*4882a593Smuzhiyun 192*4882a593SmuzhiyunExecuting a given periodic workload on each CPU at their maximum frequency would 193*4882a593Smuzhiyunresult in:: 194*4882a593Smuzhiyun 195*4882a593Smuzhiyun CPU0 work ^ 196*4882a593Smuzhiyun | ____ ____ ____ 197*4882a593Smuzhiyun | | | | | | | 198*4882a593Smuzhiyun +----+----+----+----+----+----+----+----+----+----+-> time 199*4882a593Smuzhiyun 200*4882a593Smuzhiyun CPU1 work ^ 201*4882a593Smuzhiyun | ______________ ______________ ____ 202*4882a593Smuzhiyun | | | | | | 203*4882a593Smuzhiyun +----+----+----+----+----+----+----+----+----+----+-> time 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunIOW, 206*4882a593Smuzhiyun 207*4882a593Smuzhiyun- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency 208*4882a593Smuzhiyun- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency 209*4882a593Smuzhiyun 210*4882a593SmuzhiyunThe task utilization signal can be made CPU invariant using the following 211*4882a593Smuzhiyunformula:: 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity) 214*4882a593Smuzhiyun 215*4882a593Smuzhiyunwith ``max_capacity`` being the highest CPU capacity value in the 216*4882a593Smuzhiyunsystem. Applying this formula to the above example above yields a CPU 217*4882a593Smuzhiyuninvariant task utilization of 25%. 218*4882a593Smuzhiyun 219*4882a593Smuzhiyun2.4 Invariant task utilization 220*4882a593Smuzhiyun------------------------------ 221*4882a593Smuzhiyun 222*4882a593SmuzhiyunBoth frequency and CPU invariance need to be applied to task utilization in 223*4882a593Smuzhiyunorder to obtain a truly invariant signal. The pseudo-formula for a task 224*4882a593Smuzhiyunutilization that is both CPU and frequency invariant is thus, for a given 225*4882a593Smuzhiyuntask p:: 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun curr_frequency(cpu) capacity(cpu) 228*4882a593Smuzhiyun task_util_inv(p) = duty_cycle(p) * ------------------- * ------------- 229*4882a593Smuzhiyun max_frequency(cpu) max_capacity 230*4882a593Smuzhiyun 231*4882a593SmuzhiyunIn other words, invariant task utilization describes the behaviour of a task as 232*4882a593Smuzhiyunif it were running on the highest-capacity CPU in the system, running at its 233*4882a593Smuzhiyunmaximum frequency. 234*4882a593Smuzhiyun 235*4882a593SmuzhiyunAny mention of task utilization in the following sections will imply its 236*4882a593Smuzhiyuninvariant form. 237*4882a593Smuzhiyun 238*4882a593Smuzhiyun2.5 Utilization estimation 239*4882a593Smuzhiyun-------------------------- 240*4882a593Smuzhiyun 241*4882a593SmuzhiyunWithout a crystal ball, task behaviour (and thus task utilization) cannot 242*4882a593Smuzhiyunaccurately be predicted the moment a task first becomes runnable. The CFS class 243*4882a593Smuzhiyunmaintains a handful of CPU and task signals based on the Per-Entity Load 244*4882a593SmuzhiyunTracking (PELT) mechanism, one of those yielding an *average* utilization (as 245*4882a593Smuzhiyunopposed to instantaneous). 246*4882a593Smuzhiyun 247*4882a593SmuzhiyunThis means that while the capacity aware scheduling criteria will be written 248*4882a593Smuzhiyunconsidering a "true" task utilization (using a crystal ball), the implementation 249*4882a593Smuzhiyunwill only ever be able to use an estimator thereof. 250*4882a593Smuzhiyun 251*4882a593Smuzhiyun3. Capacity aware scheduling requirements 252*4882a593Smuzhiyun========================================= 253*4882a593Smuzhiyun 254*4882a593Smuzhiyun3.1 CPU capacity 255*4882a593Smuzhiyun---------------- 256*4882a593Smuzhiyun 257*4882a593SmuzhiyunLinux cannot currently figure out CPU capacity on its own, this information thus 258*4882a593Smuzhiyunneeds to be handed to it. Architectures must define arch_scale_cpu_capacity() 259*4882a593Smuzhiyunfor that purpose. 260*4882a593Smuzhiyun 261*4882a593SmuzhiyunThe arm and arm64 architectures directly map this to the arch_topology driver 262*4882a593SmuzhiyunCPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see 263*4882a593SmuzhiyunDocumentation/devicetree/bindings/arm/cpu-capacity.txt. 264*4882a593Smuzhiyun 265*4882a593Smuzhiyun3.2 Frequency invariance 266*4882a593Smuzhiyun------------------------ 267*4882a593Smuzhiyun 268*4882a593SmuzhiyunAs stated in 2.2, capacity-aware scheduling requires a frequency-invariant task 269*4882a593Smuzhiyunutilization. Architectures must define arch_scale_freq_capacity(cpu) for that 270*4882a593Smuzhiyunpurpose. 271*4882a593Smuzhiyun 272*4882a593SmuzhiyunImplementing this function requires figuring out at which frequency each CPU 273*4882a593Smuzhiyunhave been running at. One way to implement this is to leverage hardware counters 274*4882a593Smuzhiyunwhose increment rate scale with a CPU's current frequency (APERF/MPERF on x86, 275*4882a593SmuzhiyunAMU on arm64). Another is to directly hook into cpufreq frequency transitions, 276*4882a593Smuzhiyunwhen the kernel is aware of the switched-to frequency (also employed by 277*4882a593Smuzhiyunarm/arm64). 278*4882a593Smuzhiyun 279*4882a593Smuzhiyun4. Scheduler topology 280*4882a593Smuzhiyun===================== 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunDuring the construction of the sched domains, the scheduler will figure out 283*4882a593Smuzhiyunwhether the system exhibits asymmetric CPU capacities. Should that be the 284*4882a593Smuzhiyuncase: 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun- The sched_asym_cpucapacity static key will be enabled. 287*4882a593Smuzhiyun- The SD_ASYM_CPUCAPACITY flag will be set at the lowest sched_domain level that 288*4882a593Smuzhiyun spans all unique CPU capacity values. 289*4882a593Smuzhiyun 290*4882a593SmuzhiyunThe sched_asym_cpucapacity static key is intended to guard sections of code that 291*4882a593Smuzhiyuncater to asymmetric CPU capacity systems. Do note however that said key is 292*4882a593Smuzhiyun*system-wide*. Imagine the following setup using cpusets:: 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun capacity C/2 C 295*4882a593Smuzhiyun ________ ________ 296*4882a593Smuzhiyun / \ / \ 297*4882a593Smuzhiyun CPUs 0 1 2 3 4 5 6 7 298*4882a593Smuzhiyun \__/ \______________/ 299*4882a593Smuzhiyun cpusets cs0 cs1 300*4882a593Smuzhiyun 301*4882a593SmuzhiyunWhich could be created via: 302*4882a593Smuzhiyun 303*4882a593Smuzhiyun.. code-block:: sh 304*4882a593Smuzhiyun 305*4882a593Smuzhiyun mkdir /sys/fs/cgroup/cpuset/cs0 306*4882a593Smuzhiyun echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus 307*4882a593Smuzhiyun echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems 308*4882a593Smuzhiyun 309*4882a593Smuzhiyun mkdir /sys/fs/cgroup/cpuset/cs1 310*4882a593Smuzhiyun echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus 311*4882a593Smuzhiyun echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems 312*4882a593Smuzhiyun 313*4882a593Smuzhiyun echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance 314*4882a593Smuzhiyun 315*4882a593SmuzhiyunSince there *is* CPU capacity asymmetry in the system, the 316*4882a593Smuzhiyunsched_asym_cpucapacity static key will be enabled. However, the sched_domain 317*4882a593Smuzhiyunhierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't 318*4882a593Smuzhiyunset in that hierarchy, it describes an SMP island and should be treated as such. 319*4882a593Smuzhiyun 320*4882a593SmuzhiyunTherefore, the 'canonical' pattern for protecting codepaths that cater to 321*4882a593Smuzhiyunasymmetric CPU capacities is to: 322*4882a593Smuzhiyun 323*4882a593Smuzhiyun- Check the sched_asym_cpucapacity static key 324*4882a593Smuzhiyun- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in 325*4882a593Smuzhiyun the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific 326*4882a593Smuzhiyun CPU or group thereof) 327*4882a593Smuzhiyun 328*4882a593Smuzhiyun5. Capacity aware scheduling implementation 329*4882a593Smuzhiyun=========================================== 330*4882a593Smuzhiyun 331*4882a593Smuzhiyun5.1 CFS 332*4882a593Smuzhiyun------- 333*4882a593Smuzhiyun 334*4882a593Smuzhiyun5.1.1 Capacity fitness 335*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~ 336*4882a593Smuzhiyun 337*4882a593SmuzhiyunThe main capacity scheduling criterion of CFS is:: 338*4882a593Smuzhiyun 339*4882a593Smuzhiyun task_util(p) < capacity(task_cpu(p)) 340*4882a593Smuzhiyun 341*4882a593SmuzhiyunThis is commonly called the capacity fitness criterion, i.e. CFS must ensure a 342*4882a593Smuzhiyuntask "fits" on its CPU. If it is violated, the task will need to achieve more 343*4882a593Smuzhiyunwork than what its CPU can provide: it will be CPU-bound. 344*4882a593Smuzhiyun 345*4882a593SmuzhiyunFurthermore, uclamp lets userspace specify a minimum and a maximum utilization 346*4882a593Smuzhiyunvalue for a task, either via sched_setattr() or via the cgroup interface (see 347*4882a593SmuzhiyunDocumentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to 348*4882a593Smuzhiyunclamp task_util() in the previous criterion. 349*4882a593Smuzhiyun 350*4882a593Smuzhiyun5.1.2 Wakeup CPU selection 351*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~ 352*4882a593Smuzhiyun 353*4882a593SmuzhiyunCFS task wakeup CPU selection follows the capacity fitness criterion described 354*4882a593Smuzhiyunabove. On top of that, uclamp is used to clamp the task utilization values, 355*4882a593Smuzhiyunwhich lets userspace have more leverage over the CPU selection of CFS 356*4882a593Smuzhiyuntasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies:: 357*4882a593Smuzhiyun 358*4882a593Smuzhiyun clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu) 359*4882a593Smuzhiyun 360*4882a593SmuzhiyunBy using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run 361*4882a593Smuzhiyunon any CPU by giving it a low uclamp.max value. Conversely, it can force a small 362*4882a593Smuzhiyunperiodic task (e.g. 10% utilization) to run on the highest-performance CPUs by 363*4882a593Smuzhiyungiving it a high uclamp.min value. 364*4882a593Smuzhiyun 365*4882a593Smuzhiyun.. note:: 366*4882a593Smuzhiyun 367*4882a593Smuzhiyun Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling 368*4882a593Smuzhiyun (EAS), which is described in Documentation/scheduler/sched-energy.rst. 369*4882a593Smuzhiyun 370*4882a593Smuzhiyun5.1.3 Load balancing 371*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~ 372*4882a593Smuzhiyun 373*4882a593SmuzhiyunA pathological case in the wakeup CPU selection occurs when a task rarely 374*4882a593Smuzhiyunsleeps, if at all - it thus rarely wakes up, if at all. Consider:: 375*4882a593Smuzhiyun 376*4882a593Smuzhiyun w == wakeup event 377*4882a593Smuzhiyun 378*4882a593Smuzhiyun capacity(CPU0) = C 379*4882a593Smuzhiyun capacity(CPU1) = C / 3 380*4882a593Smuzhiyun 381*4882a593Smuzhiyun workload on CPU0 382*4882a593Smuzhiyun CPU work ^ 383*4882a593Smuzhiyun | _________ _________ ____ 384*4882a593Smuzhiyun | | | | | | 385*4882a593Smuzhiyun +----+----+----+----+----+----+----+----+----+----+-> time 386*4882a593Smuzhiyun w w w 387*4882a593Smuzhiyun 388*4882a593Smuzhiyun workload on CPU1 389*4882a593Smuzhiyun CPU work ^ 390*4882a593Smuzhiyun | ____________________________________________ 391*4882a593Smuzhiyun | | 392*4882a593Smuzhiyun +----+----+----+----+----+----+----+----+----+----+-> 393*4882a593Smuzhiyun w 394*4882a593Smuzhiyun 395*4882a593SmuzhiyunThis workload should run on CPU0, but if the task either: 396*4882a593Smuzhiyun 397*4882a593Smuzhiyun- was improperly scheduled from the start (inaccurate initial 398*4882a593Smuzhiyun utilization estimation) 399*4882a593Smuzhiyun- was properly scheduled from the start, but suddenly needs more 400*4882a593Smuzhiyun processing power 401*4882a593Smuzhiyun 402*4882a593Smuzhiyunthen it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``; 403*4882a593Smuzhiyunthe CPU capacity scheduling criterion is violated, and there may not be any more 404*4882a593Smuzhiyunwakeup event to fix this up via wakeup CPU selection. 405*4882a593Smuzhiyun 406*4882a593SmuzhiyunTasks that are in this situation are dubbed "misfit" tasks, and the mechanism 407*4882a593Smuzhiyunput in place to handle this shares the same name. Misfit task migration 408*4882a593Smuzhiyunleverages the CFS load balancer, more specifically the active load balance part 409*4882a593Smuzhiyun(which caters to migrating currently running tasks). When load balance happens, 410*4882a593Smuzhiyuna misfit active load balance will be triggered if a misfit task can be migrated 411*4882a593Smuzhiyunto a CPU with more capacity than its current one. 412*4882a593Smuzhiyun 413*4882a593Smuzhiyun5.2 RT 414*4882a593Smuzhiyun------ 415*4882a593Smuzhiyun 416*4882a593Smuzhiyun5.2.1 Wakeup CPU selection 417*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~ 418*4882a593Smuzhiyun 419*4882a593SmuzhiyunRT task wakeup CPU selection searches for a CPU that satisfies:: 420*4882a593Smuzhiyun 421*4882a593Smuzhiyun task_uclamp_min(p) <= capacity(task_cpu(cpu)) 422*4882a593Smuzhiyun 423*4882a593Smuzhiyunwhile still following the usual priority constraints. If none of the candidate 424*4882a593SmuzhiyunCPUs can satisfy this capacity criterion, then strict priority based scheduling 425*4882a593Smuzhiyunis followed and CPU capacities are ignored. 426*4882a593Smuzhiyun 427*4882a593Smuzhiyun5.3 DL 428*4882a593Smuzhiyun------ 429*4882a593Smuzhiyun 430*4882a593Smuzhiyun5.3.1 Wakeup CPU selection 431*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~ 432*4882a593Smuzhiyun 433*4882a593SmuzhiyunDL task wakeup CPU selection searches for a CPU that satisfies:: 434*4882a593Smuzhiyun 435*4882a593Smuzhiyun task_bandwidth(p) < capacity(task_cpu(p)) 436*4882a593Smuzhiyun 437*4882a593Smuzhiyunwhile still respecting the usual bandwidth and deadline constraints. If 438*4882a593Smuzhiyunnone of the candidate CPUs can satisfy this capacity criterion, then the 439*4882a593Smuzhiyuntask will remain on its current CPU. 440