xref: /OK3568_Linux_fs/kernel/Documentation/scheduler/sched-capacity.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=========================
2*4882a593SmuzhiyunCapacity Aware Scheduling
3*4882a593Smuzhiyun=========================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun1. CPU Capacity
6*4882a593Smuzhiyun===============
7*4882a593Smuzhiyun
8*4882a593Smuzhiyun1.1 Introduction
9*4882a593Smuzhiyun----------------
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunConventional, homogeneous SMP platforms are composed of purely identical
12*4882a593SmuzhiyunCPUs. Heterogeneous platforms on the other hand are composed of CPUs with
13*4882a593Smuzhiyundifferent performance characteristics - on such platforms, not all CPUs can be
14*4882a593Smuzhiyunconsidered equal.
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunCPU capacity is a measure of the performance a CPU can reach, normalized against
17*4882a593Smuzhiyunthe most performant CPU in the system. Heterogeneous systems are also called
18*4882a593Smuzhiyunasymmetric CPU capacity systems, as they contain CPUs of different capacities.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunDisparity in maximum attainable performance (IOW in maximum CPU capacity) stems
21*4882a593Smuzhiyunfrom two factors:
22*4882a593Smuzhiyun
23*4882a593Smuzhiyun- not all CPUs may have the same microarchitecture (µarch).
24*4882a593Smuzhiyun- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
25*4882a593Smuzhiyun  physically able to attain the higher Operating Performance Points (OPP).
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunArm big.LITTLE systems are an example of both. The big CPUs are more
28*4882a593Smuzhiyunperformance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
29*4882a593Smuzhiyunsmarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
30*4882a593Smuzhiyuncan.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunCPU performance is usually expressed in Millions of Instructions Per Second
33*4882a593Smuzhiyun(MIPS), which can also be expressed as a given amount of instructions attainable
34*4882a593Smuzhiyunper Hz, leading to::
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun  capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun1.2 Scheduler terms
39*4882a593Smuzhiyun-------------------
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunTwo different capacity values are used within the scheduler. A CPU's
42*4882a593Smuzhiyun``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
43*4882a593Smuzhiyunattainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
44*4882a593Smuzhiyunwhich some loss of available performance (e.g. time spent handling IRQs) is
45*4882a593Smuzhiyunsubtracted.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunNote that a CPU's ``capacity`` is solely intended to be used by the CFS class,
48*4882a593Smuzhiyunwhile ``capacity_orig`` is class-agnostic. The rest of this document will use
49*4882a593Smuzhiyunthe term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
50*4882a593Smuzhiyunbrevity.
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun1.3 Platform examples
53*4882a593Smuzhiyun---------------------
54*4882a593Smuzhiyun
55*4882a593Smuzhiyun1.3.1 Identical OPPs
56*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunConsider an hypothetical dual-core asymmetric CPU capacity system where
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun- work_per_hz(CPU0) = W
61*4882a593Smuzhiyun- work_per_hz(CPU1) = W/2
62*4882a593Smuzhiyun- all CPUs are running at the same fixed frequency
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunBy the above definition of capacity:
65*4882a593Smuzhiyun
66*4882a593Smuzhiyun- capacity(CPU0) = C
67*4882a593Smuzhiyun- capacity(CPU1) = C/2
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunTo draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
70*4882a593Smuzhiyunbe a LITTLE.
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunWith a workload that periodically does a fixed amount of work, you will get an
73*4882a593Smuzhiyunexecution trace like so::
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun CPU0 work ^
76*4882a593Smuzhiyun           |     ____                ____                ____
77*4882a593Smuzhiyun           |    |    |              |    |              |    |
78*4882a593Smuzhiyun           +----+----+----+----+----+----+----+----+----+----+-> time
79*4882a593Smuzhiyun
80*4882a593Smuzhiyun CPU1 work ^
81*4882a593Smuzhiyun           |     _________           _________           ____
82*4882a593Smuzhiyun           |    |         |         |         |         |
83*4882a593Smuzhiyun           +----+----+----+----+----+----+----+----+----+----+-> time
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunCPU0 has the highest capacity in the system (C), and completes a fixed amount of
86*4882a593Smuzhiyunwork W in T units of time. On the other hand, CPU1 has half the capacity of
87*4882a593SmuzhiyunCPU0, and thus only completes W/2 in T.
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun1.3.2 Different max OPPs
90*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunUsually, CPUs of different capacity values also have different maximum
93*4882a593SmuzhiyunOPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun- max_freq(CPU0) = F
96*4882a593Smuzhiyun- max_freq(CPU1) = 2/3 * F
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunThis yields:
99*4882a593Smuzhiyun
100*4882a593Smuzhiyun- capacity(CPU0) = C
101*4882a593Smuzhiyun- capacity(CPU1) = C/3
102*4882a593Smuzhiyun
103*4882a593SmuzhiyunExecuting the same workload as described in 1.3.1, which each CPU running at its
104*4882a593Smuzhiyunmaximum frequency results in::
105*4882a593Smuzhiyun
106*4882a593Smuzhiyun CPU0 work ^
107*4882a593Smuzhiyun           |     ____                ____                ____
108*4882a593Smuzhiyun           |    |    |              |    |              |    |
109*4882a593Smuzhiyun           +----+----+----+----+----+----+----+----+----+----+-> time
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun                            workload on CPU1
112*4882a593Smuzhiyun CPU1 work ^
113*4882a593Smuzhiyun           |     ______________      ______________      ____
114*4882a593Smuzhiyun           |    |              |    |              |    |
115*4882a593Smuzhiyun           +----+----+----+----+----+----+----+----+----+----+-> time
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun1.4 Representation caveat
118*4882a593Smuzhiyun-------------------------
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunIt should be noted that having a *single* value to represent differences in CPU
121*4882a593Smuzhiyunperformance is somewhat of a contentious point. The relative performance
122*4882a593Smuzhiyundifference between two different µarchs could be X% on integer operations, Y% on
123*4882a593Smuzhiyunfloating point operations, Z% on branches, and so on. Still, results using this
124*4882a593Smuzhiyunsimple approach have been satisfactory for now.
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun2. Task utilization
127*4882a593Smuzhiyun===================
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun2.1 Introduction
130*4882a593Smuzhiyun----------------
131*4882a593Smuzhiyun
132*4882a593SmuzhiyunCapacity aware scheduling requires an expression of a task's requirements with
133*4882a593Smuzhiyunregards to CPU capacity. Each scheduler class can express this differently, and
134*4882a593Smuzhiyunwhile task utilization is specific to CFS, it is convenient to describe it here
135*4882a593Smuzhiyunin order to introduce more generic concepts.
136*4882a593Smuzhiyun
137*4882a593SmuzhiyunTask utilization is a percentage meant to represent the throughput requirements
138*4882a593Smuzhiyunof a task. A simple approximation of it is the task's duty cycle, i.e.::
139*4882a593Smuzhiyun
140*4882a593Smuzhiyun  task_util(p) = duty_cycle(p)
141*4882a593Smuzhiyun
142*4882a593SmuzhiyunOn an SMP system with fixed frequencies, 100% utilization suggests the task is a
143*4882a593Smuzhiyunbusy loop. Conversely, 10% utilization hints it is a small periodic task that
144*4882a593Smuzhiyunspends more time sleeping than executing. Variable CPU frequencies and
145*4882a593Smuzhiyunasymmetric CPU capacities complexify this somewhat; the following sections will
146*4882a593Smuzhiyunexpand on these.
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun2.2 Frequency invariance
149*4882a593Smuzhiyun------------------------
150*4882a593Smuzhiyun
151*4882a593SmuzhiyunOne issue that needs to be taken into account is that a workload's duty cycle is
152*4882a593Smuzhiyundirectly impacted by the current OPP the CPU is running at. Consider running a
153*4882a593Smuzhiyunperiodic workload at a given frequency F::
154*4882a593Smuzhiyun
155*4882a593Smuzhiyun  CPU work ^
156*4882a593Smuzhiyun           |     ____                ____                ____
157*4882a593Smuzhiyun           |    |    |              |    |              |    |
158*4882a593Smuzhiyun           +----+----+----+----+----+----+----+----+----+----+-> time
159*4882a593Smuzhiyun
160*4882a593SmuzhiyunThis yields duty_cycle(p) == 25%.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunNow, consider running the *same* workload at frequency F/2::
163*4882a593Smuzhiyun
164*4882a593Smuzhiyun  CPU work ^
165*4882a593Smuzhiyun           |     _________           _________           ____
166*4882a593Smuzhiyun           |    |         |         |         |         |
167*4882a593Smuzhiyun           +----+----+----+----+----+----+----+----+----+----+-> time
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunThis yields duty_cycle(p) == 50%, despite the task having the exact same
170*4882a593Smuzhiyunbehaviour (i.e. executing the same amount of work) in both executions.
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunThe task utilization signal can be made frequency invariant using the following
173*4882a593Smuzhiyunformula::
174*4882a593Smuzhiyun
175*4882a593Smuzhiyun  task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunApplying this formula to the two examples above yields a frequency invariant
178*4882a593Smuzhiyuntask utilization of 25%.
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun2.3 CPU invariance
181*4882a593Smuzhiyun------------------
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunCPU capacity has a similar effect on task utilization in that running an
184*4882a593Smuzhiyunidentical workload on CPUs of different capacity values will yield different
185*4882a593Smuzhiyunduty cycles.
186*4882a593Smuzhiyun
187*4882a593SmuzhiyunConsider the system described in 1.3.2., i.e.::
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun- capacity(CPU0) = C
190*4882a593Smuzhiyun- capacity(CPU1) = C/3
191*4882a593Smuzhiyun
192*4882a593SmuzhiyunExecuting a given periodic workload on each CPU at their maximum frequency would
193*4882a593Smuzhiyunresult in::
194*4882a593Smuzhiyun
195*4882a593Smuzhiyun CPU0 work ^
196*4882a593Smuzhiyun           |     ____                ____                ____
197*4882a593Smuzhiyun           |    |    |              |    |              |    |
198*4882a593Smuzhiyun           +----+----+----+----+----+----+----+----+----+----+-> time
199*4882a593Smuzhiyun
200*4882a593Smuzhiyun CPU1 work ^
201*4882a593Smuzhiyun           |     ______________      ______________      ____
202*4882a593Smuzhiyun           |    |              |    |              |    |
203*4882a593Smuzhiyun           +----+----+----+----+----+----+----+----+----+----+-> time
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunIOW,
206*4882a593Smuzhiyun
207*4882a593Smuzhiyun- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
208*4882a593Smuzhiyun- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
209*4882a593Smuzhiyun
210*4882a593SmuzhiyunThe task utilization signal can be made CPU invariant using the following
211*4882a593Smuzhiyunformula::
212*4882a593Smuzhiyun
213*4882a593Smuzhiyun  task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
214*4882a593Smuzhiyun
215*4882a593Smuzhiyunwith ``max_capacity`` being the highest CPU capacity value in the
216*4882a593Smuzhiyunsystem. Applying this formula to the above example above yields a CPU
217*4882a593Smuzhiyuninvariant task utilization of 25%.
218*4882a593Smuzhiyun
219*4882a593Smuzhiyun2.4 Invariant task utilization
220*4882a593Smuzhiyun------------------------------
221*4882a593Smuzhiyun
222*4882a593SmuzhiyunBoth frequency and CPU invariance need to be applied to task utilization in
223*4882a593Smuzhiyunorder to obtain a truly invariant signal. The pseudo-formula for a task
224*4882a593Smuzhiyunutilization that is both CPU and frequency invariant is thus, for a given
225*4882a593Smuzhiyuntask p::
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun                                     curr_frequency(cpu)   capacity(cpu)
228*4882a593Smuzhiyun  task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
229*4882a593Smuzhiyun                                     max_frequency(cpu)    max_capacity
230*4882a593Smuzhiyun
231*4882a593SmuzhiyunIn other words, invariant task utilization describes the behaviour of a task as
232*4882a593Smuzhiyunif it were running on the highest-capacity CPU in the system, running at its
233*4882a593Smuzhiyunmaximum frequency.
234*4882a593Smuzhiyun
235*4882a593SmuzhiyunAny mention of task utilization in the following sections will imply its
236*4882a593Smuzhiyuninvariant form.
237*4882a593Smuzhiyun
238*4882a593Smuzhiyun2.5 Utilization estimation
239*4882a593Smuzhiyun--------------------------
240*4882a593Smuzhiyun
241*4882a593SmuzhiyunWithout a crystal ball, task behaviour (and thus task utilization) cannot
242*4882a593Smuzhiyunaccurately be predicted the moment a task first becomes runnable. The CFS class
243*4882a593Smuzhiyunmaintains a handful of CPU and task signals based on the Per-Entity Load
244*4882a593SmuzhiyunTracking (PELT) mechanism, one of those yielding an *average* utilization (as
245*4882a593Smuzhiyunopposed to instantaneous).
246*4882a593Smuzhiyun
247*4882a593SmuzhiyunThis means that while the capacity aware scheduling criteria will be written
248*4882a593Smuzhiyunconsidering a "true" task utilization (using a crystal ball), the implementation
249*4882a593Smuzhiyunwill only ever be able to use an estimator thereof.
250*4882a593Smuzhiyun
251*4882a593Smuzhiyun3. Capacity aware scheduling requirements
252*4882a593Smuzhiyun=========================================
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun3.1 CPU capacity
255*4882a593Smuzhiyun----------------
256*4882a593Smuzhiyun
257*4882a593SmuzhiyunLinux cannot currently figure out CPU capacity on its own, this information thus
258*4882a593Smuzhiyunneeds to be handed to it. Architectures must define arch_scale_cpu_capacity()
259*4882a593Smuzhiyunfor that purpose.
260*4882a593Smuzhiyun
261*4882a593SmuzhiyunThe arm and arm64 architectures directly map this to the arch_topology driver
262*4882a593SmuzhiyunCPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
263*4882a593SmuzhiyunDocumentation/devicetree/bindings/arm/cpu-capacity.txt.
264*4882a593Smuzhiyun
265*4882a593Smuzhiyun3.2 Frequency invariance
266*4882a593Smuzhiyun------------------------
267*4882a593Smuzhiyun
268*4882a593SmuzhiyunAs stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
269*4882a593Smuzhiyunutilization. Architectures must define arch_scale_freq_capacity(cpu) for that
270*4882a593Smuzhiyunpurpose.
271*4882a593Smuzhiyun
272*4882a593SmuzhiyunImplementing this function requires figuring out at which frequency each CPU
273*4882a593Smuzhiyunhave been running at. One way to implement this is to leverage hardware counters
274*4882a593Smuzhiyunwhose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
275*4882a593SmuzhiyunAMU on arm64). Another is to directly hook into cpufreq frequency transitions,
276*4882a593Smuzhiyunwhen the kernel is aware of the switched-to frequency (also employed by
277*4882a593Smuzhiyunarm/arm64).
278*4882a593Smuzhiyun
279*4882a593Smuzhiyun4. Scheduler topology
280*4882a593Smuzhiyun=====================
281*4882a593Smuzhiyun
282*4882a593SmuzhiyunDuring the construction of the sched domains, the scheduler will figure out
283*4882a593Smuzhiyunwhether the system exhibits asymmetric CPU capacities. Should that be the
284*4882a593Smuzhiyuncase:
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun- The sched_asym_cpucapacity static key will be enabled.
287*4882a593Smuzhiyun- The SD_ASYM_CPUCAPACITY flag will be set at the lowest sched_domain level that
288*4882a593Smuzhiyun  spans all unique CPU capacity values.
289*4882a593Smuzhiyun
290*4882a593SmuzhiyunThe sched_asym_cpucapacity static key is intended to guard sections of code that
291*4882a593Smuzhiyuncater to asymmetric CPU capacity systems. Do note however that said key is
292*4882a593Smuzhiyun*system-wide*. Imagine the following setup using cpusets::
293*4882a593Smuzhiyun
294*4882a593Smuzhiyun  capacity    C/2          C
295*4882a593Smuzhiyun            ________    ________
296*4882a593Smuzhiyun           /        \  /        \
297*4882a593Smuzhiyun  CPUs     0  1  2  3  4  5  6  7
298*4882a593Smuzhiyun           \__/  \______________/
299*4882a593Smuzhiyun  cpusets   cs0         cs1
300*4882a593Smuzhiyun
301*4882a593SmuzhiyunWhich could be created via:
302*4882a593Smuzhiyun
303*4882a593Smuzhiyun.. code-block:: sh
304*4882a593Smuzhiyun
305*4882a593Smuzhiyun  mkdir /sys/fs/cgroup/cpuset/cs0
306*4882a593Smuzhiyun  echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
307*4882a593Smuzhiyun  echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
308*4882a593Smuzhiyun
309*4882a593Smuzhiyun  mkdir /sys/fs/cgroup/cpuset/cs1
310*4882a593Smuzhiyun  echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
311*4882a593Smuzhiyun  echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
312*4882a593Smuzhiyun
313*4882a593Smuzhiyun  echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
314*4882a593Smuzhiyun
315*4882a593SmuzhiyunSince there *is* CPU capacity asymmetry in the system, the
316*4882a593Smuzhiyunsched_asym_cpucapacity static key will be enabled. However, the sched_domain
317*4882a593Smuzhiyunhierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
318*4882a593Smuzhiyunset in that hierarchy, it describes an SMP island and should be treated as such.
319*4882a593Smuzhiyun
320*4882a593SmuzhiyunTherefore, the 'canonical' pattern for protecting codepaths that cater to
321*4882a593Smuzhiyunasymmetric CPU capacities is to:
322*4882a593Smuzhiyun
323*4882a593Smuzhiyun- Check the sched_asym_cpucapacity static key
324*4882a593Smuzhiyun- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
325*4882a593Smuzhiyun  the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
326*4882a593Smuzhiyun  CPU or group thereof)
327*4882a593Smuzhiyun
328*4882a593Smuzhiyun5. Capacity aware scheduling implementation
329*4882a593Smuzhiyun===========================================
330*4882a593Smuzhiyun
331*4882a593Smuzhiyun5.1 CFS
332*4882a593Smuzhiyun-------
333*4882a593Smuzhiyun
334*4882a593Smuzhiyun5.1.1 Capacity fitness
335*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~
336*4882a593Smuzhiyun
337*4882a593SmuzhiyunThe main capacity scheduling criterion of CFS is::
338*4882a593Smuzhiyun
339*4882a593Smuzhiyun  task_util(p) < capacity(task_cpu(p))
340*4882a593Smuzhiyun
341*4882a593SmuzhiyunThis is commonly called the capacity fitness criterion, i.e. CFS must ensure a
342*4882a593Smuzhiyuntask "fits" on its CPU. If it is violated, the task will need to achieve more
343*4882a593Smuzhiyunwork than what its CPU can provide: it will be CPU-bound.
344*4882a593Smuzhiyun
345*4882a593SmuzhiyunFurthermore, uclamp lets userspace specify a minimum and a maximum utilization
346*4882a593Smuzhiyunvalue for a task, either via sched_setattr() or via the cgroup interface (see
347*4882a593SmuzhiyunDocumentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
348*4882a593Smuzhiyunclamp task_util() in the previous criterion.
349*4882a593Smuzhiyun
350*4882a593Smuzhiyun5.1.2 Wakeup CPU selection
351*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~
352*4882a593Smuzhiyun
353*4882a593SmuzhiyunCFS task wakeup CPU selection follows the capacity fitness criterion described
354*4882a593Smuzhiyunabove. On top of that, uclamp is used to clamp the task utilization values,
355*4882a593Smuzhiyunwhich lets userspace have more leverage over the CPU selection of CFS
356*4882a593Smuzhiyuntasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
357*4882a593Smuzhiyun
358*4882a593Smuzhiyun  clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
359*4882a593Smuzhiyun
360*4882a593SmuzhiyunBy using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
361*4882a593Smuzhiyunon any CPU by giving it a low uclamp.max value. Conversely, it can force a small
362*4882a593Smuzhiyunperiodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
363*4882a593Smuzhiyungiving it a high uclamp.min value.
364*4882a593Smuzhiyun
365*4882a593Smuzhiyun.. note::
366*4882a593Smuzhiyun
367*4882a593Smuzhiyun  Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
368*4882a593Smuzhiyun  (EAS), which is described in Documentation/scheduler/sched-energy.rst.
369*4882a593Smuzhiyun
370*4882a593Smuzhiyun5.1.3 Load balancing
371*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~
372*4882a593Smuzhiyun
373*4882a593SmuzhiyunA pathological case in the wakeup CPU selection occurs when a task rarely
374*4882a593Smuzhiyunsleeps, if at all - it thus rarely wakes up, if at all. Consider::
375*4882a593Smuzhiyun
376*4882a593Smuzhiyun  w == wakeup event
377*4882a593Smuzhiyun
378*4882a593Smuzhiyun  capacity(CPU0) = C
379*4882a593Smuzhiyun  capacity(CPU1) = C / 3
380*4882a593Smuzhiyun
381*4882a593Smuzhiyun                           workload on CPU0
382*4882a593Smuzhiyun  CPU work ^
383*4882a593Smuzhiyun           |     _________           _________           ____
384*4882a593Smuzhiyun           |    |         |         |         |         |
385*4882a593Smuzhiyun           +----+----+----+----+----+----+----+----+----+----+-> time
386*4882a593Smuzhiyun                w                   w                   w
387*4882a593Smuzhiyun
388*4882a593Smuzhiyun                           workload on CPU1
389*4882a593Smuzhiyun  CPU work ^
390*4882a593Smuzhiyun           |     ____________________________________________
391*4882a593Smuzhiyun           |    |
392*4882a593Smuzhiyun           +----+----+----+----+----+----+----+----+----+----+->
393*4882a593Smuzhiyun                w
394*4882a593Smuzhiyun
395*4882a593SmuzhiyunThis workload should run on CPU0, but if the task either:
396*4882a593Smuzhiyun
397*4882a593Smuzhiyun- was improperly scheduled from the start (inaccurate initial
398*4882a593Smuzhiyun  utilization estimation)
399*4882a593Smuzhiyun- was properly scheduled from the start, but suddenly needs more
400*4882a593Smuzhiyun  processing power
401*4882a593Smuzhiyun
402*4882a593Smuzhiyunthen it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
403*4882a593Smuzhiyunthe CPU capacity scheduling criterion is violated, and there may not be any more
404*4882a593Smuzhiyunwakeup event to fix this up via wakeup CPU selection.
405*4882a593Smuzhiyun
406*4882a593SmuzhiyunTasks that are in this situation are dubbed "misfit" tasks, and the mechanism
407*4882a593Smuzhiyunput in place to handle this shares the same name. Misfit task migration
408*4882a593Smuzhiyunleverages the CFS load balancer, more specifically the active load balance part
409*4882a593Smuzhiyun(which caters to migrating currently running tasks). When load balance happens,
410*4882a593Smuzhiyuna misfit active load balance will be triggered if a misfit task can be migrated
411*4882a593Smuzhiyunto a CPU with more capacity than its current one.
412*4882a593Smuzhiyun
413*4882a593Smuzhiyun5.2 RT
414*4882a593Smuzhiyun------
415*4882a593Smuzhiyun
416*4882a593Smuzhiyun5.2.1 Wakeup CPU selection
417*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~
418*4882a593Smuzhiyun
419*4882a593SmuzhiyunRT task wakeup CPU selection searches for a CPU that satisfies::
420*4882a593Smuzhiyun
421*4882a593Smuzhiyun  task_uclamp_min(p) <= capacity(task_cpu(cpu))
422*4882a593Smuzhiyun
423*4882a593Smuzhiyunwhile still following the usual priority constraints. If none of the candidate
424*4882a593SmuzhiyunCPUs can satisfy this capacity criterion, then strict priority based scheduling
425*4882a593Smuzhiyunis followed and CPU capacities are ignored.
426*4882a593Smuzhiyun
427*4882a593Smuzhiyun5.3 DL
428*4882a593Smuzhiyun------
429*4882a593Smuzhiyun
430*4882a593Smuzhiyun5.3.1 Wakeup CPU selection
431*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~
432*4882a593Smuzhiyun
433*4882a593SmuzhiyunDL task wakeup CPU selection searches for a CPU that satisfies::
434*4882a593Smuzhiyun
435*4882a593Smuzhiyun  task_bandwidth(p) < capacity(task_cpu(p))
436*4882a593Smuzhiyun
437*4882a593Smuzhiyunwhile still respecting the usual bandwidth and deadline constraints. If
438*4882a593Smuzhiyunnone of the candidate CPUs can satisfy this capacity criterion, then the
439*4882a593Smuzhiyuntask will remain on its current CPU.
440