xref: /OK3568_Linux_fs/kernel/Documentation/scheduler/sched-energy.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=======================
2*4882a593SmuzhiyunEnergy Aware Scheduling
3*4882a593Smuzhiyun=======================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun1. Introduction
6*4882a593Smuzhiyun---------------
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunEnergy Aware Scheduling (or EAS) gives the scheduler the ability to predict
9*4882a593Smuzhiyunthe impact of its decisions on the energy consumed by CPUs. EAS relies on an
10*4882a593SmuzhiyunEnergy Model (EM) of the CPUs to select an energy efficient CPU for each task,
11*4882a593Smuzhiyunwith a minimal impact on throughput. This document aims at providing an
12*4882a593Smuzhiyunintroduction on how EAS works, what are the main design decisions behind it, and
13*4882a593Smuzhiyundetails what is needed to get it to run.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunBefore going any further, please note that at the time of writing::
16*4882a593Smuzhiyun
17*4882a593Smuzhiyun   /!\ EAS does not support platforms with symmetric CPU topologies /!\
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunEAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
20*4882a593Smuzhiyunbecause this is where the potential for saving energy through scheduling is
21*4882a593Smuzhiyunthe highest.
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunThe actual EM used by EAS is _not_ maintained by the scheduler, but by a
24*4882a593Smuzhiyundedicated framework. For details about this framework and what it provides,
25*4882a593Smuzhiyunplease refer to its documentation (see Documentation/power/energy-model.rst).
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun
28*4882a593Smuzhiyun2. Background and Terminology
29*4882a593Smuzhiyun-----------------------------
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunTo make it clear from the start:
32*4882a593Smuzhiyun - energy = [joule] (resource like a battery on powered devices)
33*4882a593Smuzhiyun - power = energy/time = [joule/second] = [watt]
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunThe goal of EAS is to minimize energy, while still getting the job done. That
36*4882a593Smuzhiyunis, we want to maximize::
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun	performance [inst/s]
39*4882a593Smuzhiyun	--------------------
40*4882a593Smuzhiyun	    power [W]
41*4882a593Smuzhiyun
42*4882a593Smuzhiyunwhich is equivalent to minimizing::
43*4882a593Smuzhiyun
44*4882a593Smuzhiyun	energy [J]
45*4882a593Smuzhiyun	-----------
46*4882a593Smuzhiyun	instruction
47*4882a593Smuzhiyun
48*4882a593Smuzhiyunwhile still getting 'good' performance. It is essentially an alternative
49*4882a593Smuzhiyunoptimization objective to the current performance-only objective for the
50*4882a593Smuzhiyunscheduler. This alternative considers two objectives: energy-efficiency and
51*4882a593Smuzhiyunperformance.
52*4882a593Smuzhiyun
53*4882a593SmuzhiyunThe idea behind introducing an EM is to allow the scheduler to evaluate the
54*4882a593Smuzhiyunimplications of its decisions rather than blindly applying energy-saving
55*4882a593Smuzhiyuntechniques that may have positive effects only on some platforms. At the same
56*4882a593Smuzhiyuntime, the EM must be as simple as possible to minimize the scheduler latency
57*4882a593Smuzhiyunimpact.
58*4882a593Smuzhiyun
59*4882a593SmuzhiyunIn short, EAS changes the way CFS tasks are assigned to CPUs. When it is time
60*4882a593Smuzhiyunfor the scheduler to decide where a task should run (during wake-up), the EM
61*4882a593Smuzhiyunis used to break the tie between several good CPU candidates and pick the one
62*4882a593Smuzhiyunthat is predicted to yield the best energy consumption without harming the
63*4882a593Smuzhiyunsystem's throughput. The predictions made by EAS rely on specific elements of
64*4882a593Smuzhiyunknowledge about the platform's topology, which include the 'capacity' of CPUs,
65*4882a593Smuzhiyunand their respective energy costs.
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun3. Topology information
69*4882a593Smuzhiyun-----------------------
70*4882a593Smuzhiyun
71*4882a593SmuzhiyunEAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
72*4882a593Smuzhiyundifferentiate CPUs with different computing throughput. The 'capacity' of a CPU
73*4882a593Smuzhiyunrepresents the amount of work it can absorb when running at its highest
74*4882a593Smuzhiyunfrequency compared to the most capable CPU of the system. Capacity values are
75*4882a593Smuzhiyunnormalized in a 1024 range, and are comparable with the utilization signals of
76*4882a593Smuzhiyuntasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
77*4882a593Smuzhiyunto capacity and utilization values, EAS is able to estimate how big/busy a
78*4882a593Smuzhiyuntask/CPU is, and to take this into consideration when evaluating performance vs
79*4882a593Smuzhiyunenergy trade-offs. The capacity of CPUs is provided via arch-specific code
80*4882a593Smuzhiyunthrough the arch_scale_cpu_capacity() callback.
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunThe rest of platform knowledge used by EAS is directly read from the Energy
83*4882a593SmuzhiyunModel (EM) framework. The EM of a platform is composed of a power cost table
84*4882a593Smuzhiyunper 'performance domain' in the system (see Documentation/power/energy-model.rst
85*4882a593Smuzhiyunfor futher details about performance domains).
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunThe scheduler manages references to the EM objects in the topology code when the
88*4882a593Smuzhiyunscheduling domains are built, or re-built. For each root domain (rd), the
89*4882a593Smuzhiyunscheduler maintains a singly linked list of all performance domains intersecting
90*4882a593Smuzhiyunthe current rd->span. Each node in the list contains a pointer to a struct
91*4882a593Smuzhiyunem_perf_domain as provided by the EM framework.
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunThe lists are attached to the root domains in order to cope with exclusive
94*4882a593Smuzhiyuncpuset configurations. Since the boundaries of exclusive cpusets do not
95*4882a593Smuzhiyunnecessarily match those of performance domains, the lists of different root
96*4882a593Smuzhiyundomains can contain duplicate elements.
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunExample 1.
99*4882a593Smuzhiyun    Let us consider a platform with 12 CPUs, split in 3 performance domains
100*4882a593Smuzhiyun    (pd0, pd4 and pd8), organized as follows::
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun	          CPUs:   0 1 2 3 4 5 6 7 8 9 10 11
103*4882a593Smuzhiyun	          PDs:   |--pd0--|--pd4--|---pd8---|
104*4882a593Smuzhiyun	          RDs:   |----rd1----|-----rd2-----|
105*4882a593Smuzhiyun
106*4882a593Smuzhiyun    Now, consider that userspace decided to split the system with two
107*4882a593Smuzhiyun    exclusive cpusets, hence creating two independent root domains, each
108*4882a593Smuzhiyun    containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
109*4882a593Smuzhiyun    above figure. Since pd4 intersects with both rd1 and rd2, it will be
110*4882a593Smuzhiyun    present in the linked list '->pd' attached to each of them:
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun       * rd1->pd: pd0 -> pd4
113*4882a593Smuzhiyun       * rd2->pd: pd4 -> pd8
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun    Please note that the scheduler will create two duplicate list nodes for
116*4882a593Smuzhiyun    pd4 (one for each list). However, both just hold a pointer to the same
117*4882a593Smuzhiyun    shared data structure of the EM framework.
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunSince the access to these lists can happen concurrently with hotplug and other
120*4882a593Smuzhiyunthings, they are protected by RCU, like the rest of topology structures
121*4882a593Smuzhiyunmanipulated by the scheduler.
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunEAS also maintains a static key (sched_energy_present) which is enabled when at
124*4882a593Smuzhiyunleast one root domain meets all conditions for EAS to start. Those conditions
125*4882a593Smuzhiyunare summarized in Section 6.
126*4882a593Smuzhiyun
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun4. Energy-Aware task placement
129*4882a593Smuzhiyun------------------------------
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunEAS overrides the CFS task wake-up balancing code. It uses the EM of the
132*4882a593Smuzhiyunplatform and the PELT signals to choose an energy-efficient target CPU during
133*4882a593Smuzhiyunwake-up balance. When EAS is enabled, select_task_rq_fair() calls
134*4882a593Smuzhiyunfind_energy_efficient_cpu() to do the placement decision. This function looks
135*4882a593Smuzhiyunfor the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
136*4882a593Smuzhiyuneach performance domain since it is the one which will allow us to keep the
137*4882a593Smuzhiyunfrequency the lowest. Then, the function checks if placing the task there could
138*4882a593Smuzhiyunsave energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
139*4882a593Smuzhiyunin its previous activation.
140*4882a593Smuzhiyun
141*4882a593Smuzhiyunfind_energy_efficient_cpu() uses compute_energy() to estimate what will be the
142*4882a593Smuzhiyunenergy consumed by the system if the waking task was migrated. compute_energy()
143*4882a593Smuzhiyunlooks at the current utilization landscape of the CPUs and adjusts it to
144*4882a593Smuzhiyun'simulate' the task migration. The EM framework provides the em_pd_energy() API
145*4882a593Smuzhiyunwhich computes the expected energy consumption of each performance domain for
146*4882a593Smuzhiyunthe given utilization landscape.
147*4882a593Smuzhiyun
148*4882a593SmuzhiyunAn example of energy-optimized task placement decision is detailed below.
149*4882a593Smuzhiyun
150*4882a593SmuzhiyunExample 2.
151*4882a593Smuzhiyun    Let us consider a (fake) platform with 2 independent performance domains
152*4882a593Smuzhiyun    composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
153*4882a593Smuzhiyun    are big.
154*4882a593Smuzhiyun
155*4882a593Smuzhiyun    The scheduler must decide where to place a task P whose util_avg = 200
156*4882a593Smuzhiyun    and prev_cpu = 0.
157*4882a593Smuzhiyun
158*4882a593Smuzhiyun    The current utilization landscape of the CPUs is depicted on the graph
159*4882a593Smuzhiyun    below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
160*4882a593Smuzhiyun    Each performance domain has three Operating Performance Points (OPPs).
161*4882a593Smuzhiyun    The CPU capacity and power cost associated with each OPP is listed in
162*4882a593Smuzhiyun    the Energy Model table. The util_avg of P is shown on the figures
163*4882a593Smuzhiyun    below as 'PP'::
164*4882a593Smuzhiyun
165*4882a593Smuzhiyun     CPU util.
166*4882a593Smuzhiyun      1024                 - - - - - - -              Energy Model
167*4882a593Smuzhiyun                                               +-----------+-------------+
168*4882a593Smuzhiyun                                               |  Little   |     Big     |
169*4882a593Smuzhiyun       768                 =============       +-----+-----+------+------+
170*4882a593Smuzhiyun                                               | Cap | Pwr | Cap  | Pwr  |
171*4882a593Smuzhiyun                                               +-----+-----+------+------+
172*4882a593Smuzhiyun       512  ===========    - ##- - - - -       | 170 | 50  | 512  | 400  |
173*4882a593Smuzhiyun                             ##     ##         | 341 | 150 | 768  | 800  |
174*4882a593Smuzhiyun       341  -PP - - - -      ##     ##         | 512 | 300 | 1024 | 1700 |
175*4882a593Smuzhiyun             PP              ##     ##         +-----+-----+------+------+
176*4882a593Smuzhiyun       170  -## - - - -      ##     ##
177*4882a593Smuzhiyun             ##     ##       ##     ##
178*4882a593Smuzhiyun           ------------    -------------
179*4882a593Smuzhiyun            CPU0   CPU1     CPU2   CPU3
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun      Current OPP: =====       Other OPP: - - -     util_avg (100 each): ##
182*4882a593Smuzhiyun
183*4882a593Smuzhiyun
184*4882a593Smuzhiyun    find_energy_efficient_cpu() will first look for the CPUs with the
185*4882a593Smuzhiyun    maximum spare capacity in the two performance domains. In this example,
186*4882a593Smuzhiyun    CPU1 and CPU3. Then it will estimate the energy of the system if P was
187*4882a593Smuzhiyun    placed on either of them, and check if that would save some energy
188*4882a593Smuzhiyun    compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
189*4882a593Smuzhiyun    (which is coherent with the behaviour of the schedutil CPUFreq
190*4882a593Smuzhiyun    governor, see Section 6. for more details on this topic).
191*4882a593Smuzhiyun
192*4882a593Smuzhiyun    **Case 1. P is migrated to CPU1**::
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun      1024                 - - - - - - -
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun                                            Energy calculation:
197*4882a593Smuzhiyun       768                 =============     * CPU0: 200 / 341 * 150 = 88
198*4882a593Smuzhiyun                                             * CPU1: 300 / 341 * 150 = 131
199*4882a593Smuzhiyun                                             * CPU2: 600 / 768 * 800 = 625
200*4882a593Smuzhiyun       512  - - - - - -    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
201*4882a593Smuzhiyun                             ##     ##          => total_energy = 1364
202*4882a593Smuzhiyun       341  ===========      ##     ##
203*4882a593Smuzhiyun                    PP       ##     ##
204*4882a593Smuzhiyun       170  -## - - PP-      ##     ##
205*4882a593Smuzhiyun             ##     ##       ##     ##
206*4882a593Smuzhiyun           ------------    -------------
207*4882a593Smuzhiyun            CPU0   CPU1     CPU2   CPU3
208*4882a593Smuzhiyun
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun    **Case 2. P is migrated to CPU3**::
211*4882a593Smuzhiyun
212*4882a593Smuzhiyun      1024                 - - - - - - -
213*4882a593Smuzhiyun
214*4882a593Smuzhiyun                                            Energy calculation:
215*4882a593Smuzhiyun       768                 =============     * CPU0: 200 / 341 * 150 = 88
216*4882a593Smuzhiyun                                             * CPU1: 100 / 341 * 150 = 43
217*4882a593Smuzhiyun                                    PP       * CPU2: 600 / 768 * 800 = 625
218*4882a593Smuzhiyun       512  - - - - - -    - ##- - -PP -     * CPU3: 700 / 768 * 800 = 729
219*4882a593Smuzhiyun                             ##     ##          => total_energy = 1485
220*4882a593Smuzhiyun       341  ===========      ##     ##
221*4882a593Smuzhiyun                             ##     ##
222*4882a593Smuzhiyun       170  -## - - - -      ##     ##
223*4882a593Smuzhiyun             ##     ##       ##     ##
224*4882a593Smuzhiyun           ------------    -------------
225*4882a593Smuzhiyun            CPU0   CPU1     CPU2   CPU3
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun    **Case 3. P stays on prev_cpu / CPU 0**::
229*4882a593Smuzhiyun
230*4882a593Smuzhiyun      1024                 - - - - - - -
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun                                            Energy calculation:
233*4882a593Smuzhiyun       768                 =============     * CPU0: 400 / 512 * 300 = 234
234*4882a593Smuzhiyun                                             * CPU1: 100 / 512 * 300 = 58
235*4882a593Smuzhiyun                                             * CPU2: 600 / 768 * 800 = 625
236*4882a593Smuzhiyun       512  ===========    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
237*4882a593Smuzhiyun                             ##     ##          => total_energy = 1437
238*4882a593Smuzhiyun       341  -PP - - - -      ##     ##
239*4882a593Smuzhiyun             PP              ##     ##
240*4882a593Smuzhiyun       170  -## - - - -      ##     ##
241*4882a593Smuzhiyun             ##     ##       ##     ##
242*4882a593Smuzhiyun           ------------    -------------
243*4882a593Smuzhiyun            CPU0   CPU1     CPU2   CPU3
244*4882a593Smuzhiyun
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun    From these calculations, the Case 1 has the lowest total energy. So CPU 1
247*4882a593Smuzhiyun    is be the best candidate from an energy-efficiency standpoint.
248*4882a593Smuzhiyun
249*4882a593SmuzhiyunBig CPUs are generally more power hungry than the little ones and are thus used
250*4882a593Smuzhiyunmainly when a task doesn't fit the littles. However, little CPUs aren't always
251*4882a593Smuzhiyunnecessarily more energy-efficient than big CPUs. For some systems, the high OPPs
252*4882a593Smuzhiyunof the little CPUs can be less energy-efficient than the lowest OPPs of the
253*4882a593Smuzhiyunbigs, for example. So, if the little CPUs happen to have enough utilization at
254*4882a593Smuzhiyuna specific point in time, a small task waking up at that moment could be better
255*4882a593Smuzhiyunof executing on the big side in order to save energy, even though it would fit
256*4882a593Smuzhiyunon the little side.
257*4882a593Smuzhiyun
258*4882a593SmuzhiyunAnd even in the case where all OPPs of the big CPUs are less energy-efficient
259*4882a593Smuzhiyunthan those of the little, using the big CPUs for a small task might still, under
260*4882a593Smuzhiyunspecific conditions, save energy. Indeed, placing a task on a little CPU can
261*4882a593Smuzhiyunresult in raising the OPP of the entire performance domain, and that will
262*4882a593Smuzhiyunincrease the cost of the tasks already running there. If the waking task is
263*4882a593Smuzhiyunplaced on a big CPU, its own execution cost might be higher than if it was
264*4882a593Smuzhiyunrunning on a little, but it won't impact the other tasks of the little CPUs
265*4882a593Smuzhiyunwhich will keep running at a lower OPP. So, when considering the total energy
266*4882a593Smuzhiyunconsumed by CPUs, the extra cost of running that one task on a big core can be
267*4882a593Smuzhiyunsmaller than the cost of raising the OPP on the little CPUs for all the other
268*4882a593Smuzhiyuntasks.
269*4882a593Smuzhiyun
270*4882a593SmuzhiyunThe examples above would be nearly impossible to get right in a generic way, and
271*4882a593Smuzhiyunfor all platforms, without knowing the cost of running at different OPPs on all
272*4882a593SmuzhiyunCPUs of the system. Thanks to its EM-based design, EAS should cope with them
273*4882a593Smuzhiyuncorrectly without too many troubles. However, in order to ensure a minimal
274*4882a593Smuzhiyunimpact on throughput for high-utilization scenarios, EAS also implements another
275*4882a593Smuzhiyunmechanism called 'over-utilization'.
276*4882a593Smuzhiyun
277*4882a593Smuzhiyun
278*4882a593Smuzhiyun5. Over-utilization
279*4882a593Smuzhiyun-------------------
280*4882a593Smuzhiyun
281*4882a593SmuzhiyunFrom a general standpoint, the use-cases where EAS can help the most are those
282*4882a593Smuzhiyuninvolving a light/medium CPU utilization. Whenever long CPU-bound tasks are
283*4882a593Smuzhiyunbeing run, they will require all of the available CPU capacity, and there isn't
284*4882a593Smuzhiyunmuch that can be done by the scheduler to save energy without severly harming
285*4882a593Smuzhiyunthroughput. In order to avoid hurting performance with EAS, CPUs are flagged as
286*4882a593Smuzhiyun'over-utilized' as soon as they are used at more than 80% of their compute
287*4882a593Smuzhiyuncapacity. As long as no CPUs are over-utilized in a root domain, load balancing
288*4882a593Smuzhiyunis disabled and EAS overridess the wake-up balancing code. EAS is likely to load
289*4882a593Smuzhiyunthe most energy efficient CPUs of the system more than the others if that can be
290*4882a593Smuzhiyundone without harming throughput. So, the load-balancer is disabled to prevent
291*4882a593Smuzhiyunit from breaking the energy-efficient task placement found by EAS. It is safe to
292*4882a593Smuzhiyundo so when the system isn't overutilized since being below the 80% tipping point
293*4882a593Smuzhiyunimplies that:
294*4882a593Smuzhiyun
295*4882a593Smuzhiyun    a. there is some idle time on all CPUs, so the utilization signals used by
296*4882a593Smuzhiyun       EAS are likely to accurately represent the 'size' of the various tasks
297*4882a593Smuzhiyun       in the system;
298*4882a593Smuzhiyun    b. all tasks should already be provided with enough CPU capacity,
299*4882a593Smuzhiyun       regardless of their nice values;
300*4882a593Smuzhiyun    c. since there is spare capacity all tasks must be blocking/sleeping
301*4882a593Smuzhiyun       regularly and balancing at wake-up is sufficient.
302*4882a593Smuzhiyun
303*4882a593SmuzhiyunAs soon as one CPU goes above the 80% tipping point, at least one of the three
304*4882a593Smuzhiyunassumptions above becomes incorrect. In this scenario, the 'overutilized' flag
305*4882a593Smuzhiyunis raised for the entire root domain, EAS is disabled, and the load-balancer is
306*4882a593Smuzhiyunre-enabled. By doing so, the scheduler falls back onto load-based algorithms for
307*4882a593Smuzhiyunwake-up and load balance under CPU-bound conditions. This provides a better
308*4882a593Smuzhiyunrespect of the nice values of tasks.
309*4882a593Smuzhiyun
310*4882a593SmuzhiyunSince the notion of overutilization largely relies on detecting whether or not
311*4882a593Smuzhiyunthere is some idle time in the system, the CPU capacity 'stolen' by higher
312*4882a593Smuzhiyun(than CFS) scheduling classes (as well as IRQ) must be taken into account. As
313*4882a593Smuzhiyunsuch, the detection of overutilization accounts for the capacity used not only
314*4882a593Smuzhiyunby CFS tasks, but also by the other scheduling classes and IRQ.
315*4882a593Smuzhiyun
316*4882a593Smuzhiyun
317*4882a593Smuzhiyun6. Dependencies and requirements for EAS
318*4882a593Smuzhiyun----------------------------------------
319*4882a593Smuzhiyun
320*4882a593SmuzhiyunEnergy Aware Scheduling depends on the CPUs of the system having specific
321*4882a593Smuzhiyunhardware properties and on other features of the kernel being enabled. This
322*4882a593Smuzhiyunsection lists these dependencies and provides hints as to how they can be met.
323*4882a593Smuzhiyun
324*4882a593Smuzhiyun
325*4882a593Smuzhiyun6.1 - Asymmetric CPU topology
326*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
327*4882a593Smuzhiyun
328*4882a593Smuzhiyun
329*4882a593SmuzhiyunAs mentioned in the introduction, EAS is only supported on platforms with
330*4882a593Smuzhiyunasymmetric CPU topologies for now. This requirement is checked at run-time by
331*4882a593Smuzhiyunlooking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling
332*4882a593Smuzhiyundomains are built.
333*4882a593Smuzhiyun
334*4882a593SmuzhiyunSee Documentation/scheduler/sched-capacity.rst for requirements to be met for this
335*4882a593Smuzhiyunflag to be set in the sched_domain hierarchy.
336*4882a593Smuzhiyun
337*4882a593SmuzhiyunPlease note that EAS is not fundamentally incompatible with SMP, but no
338*4882a593Smuzhiyunsignificant savings on SMP platforms have been observed yet. This restriction
339*4882a593Smuzhiyuncould be amended in the future if proven otherwise.
340*4882a593Smuzhiyun
341*4882a593Smuzhiyun
342*4882a593Smuzhiyun6.2 - Energy Model presence
343*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^
344*4882a593Smuzhiyun
345*4882a593SmuzhiyunEAS uses the EM of a platform to estimate the impact of scheduling decisions on
346*4882a593Smuzhiyunenergy. So, your platform must provide power cost tables to the EM framework in
347*4882a593Smuzhiyunorder to make EAS start. To do so, please refer to documentation of the
348*4882a593Smuzhiyunindependent EM framework in Documentation/power/energy-model.rst.
349*4882a593Smuzhiyun
350*4882a593SmuzhiyunPlease also note that the scheduling domains need to be re-built after the
351*4882a593SmuzhiyunEM has been registered in order to start EAS.
352*4882a593Smuzhiyun
353*4882a593Smuzhiyun
354*4882a593Smuzhiyun6.3 - Energy Model complexity
355*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
356*4882a593Smuzhiyun
357*4882a593SmuzhiyunThe task wake-up path is very latency-sensitive. When the EM of a platform is
358*4882a593Smuzhiyuntoo complex (too many CPUs, too many performance domains, too many performance
359*4882a593Smuzhiyunstates, ...), the cost of using it in the wake-up path can become prohibitive.
360*4882a593SmuzhiyunThe energy-aware wake-up algorithm has a complexity of:
361*4882a593Smuzhiyun
362*4882a593Smuzhiyun	C = Nd * (Nc + Ns)
363*4882a593Smuzhiyun
364*4882a593Smuzhiyunwith: Nd the number of performance domains; Nc the number of CPUs; and Ns the
365*4882a593Smuzhiyuntotal number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
366*4882a593Smuzhiyun
367*4882a593SmuzhiyunA complexity check is performed at the root domain level, when scheduling
368*4882a593Smuzhiyundomains are built. EAS will not start on a root domain if its C happens to be
369*4882a593Smuzhiyunhigher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
370*4882a593Smuzhiyuntime of writing).
371*4882a593Smuzhiyun
372*4882a593SmuzhiyunIf you really want to use EAS but the complexity of your platform's Energy
373*4882a593SmuzhiyunModel is too high to be used with a single root domain, you're left with only
374*4882a593Smuzhiyuntwo possible options:
375*4882a593Smuzhiyun
376*4882a593Smuzhiyun    1. split your system into separate, smaller, root domains using exclusive
377*4882a593Smuzhiyun       cpusets and enable EAS locally on each of them. This option has the
378*4882a593Smuzhiyun       benefit to work out of the box but the drawback of preventing load
379*4882a593Smuzhiyun       balance between root domains, which can result in an unbalanced system
380*4882a593Smuzhiyun       overall;
381*4882a593Smuzhiyun    2. submit patches to reduce the complexity of the EAS wake-up algorithm,
382*4882a593Smuzhiyun       hence enabling it to cope with larger EMs in reasonable time.
383*4882a593Smuzhiyun
384*4882a593Smuzhiyun
385*4882a593Smuzhiyun6.4 - Schedutil governor
386*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^
387*4882a593Smuzhiyun
388*4882a593SmuzhiyunEAS tries to predict at which OPP will the CPUs be running in the close future
389*4882a593Smuzhiyunin order to estimate their energy consumption. To do so, it is assumed that OPPs
390*4882a593Smuzhiyunof CPUs follow their utilization.
391*4882a593Smuzhiyun
392*4882a593SmuzhiyunAlthough it is very difficult to provide hard guarantees regarding the accuracy
393*4882a593Smuzhiyunof this assumption in practice (because the hardware might not do what it is
394*4882a593Smuzhiyuntold to do, for example), schedutil as opposed to other CPUFreq governors at
395*4882a593Smuzhiyunleast _requests_ frequencies calculated using the utilization signals.
396*4882a593SmuzhiyunConsequently, the only sane governor to use together with EAS is schedutil,
397*4882a593Smuzhiyunbecause it is the only one providing some degree of consistency between
398*4882a593Smuzhiyunfrequency requests and energy predictions.
399*4882a593Smuzhiyun
400*4882a593SmuzhiyunUsing EAS with any other governor than schedutil is not recommended.
401*4882a593Smuzhiyun
402*4882a593Smuzhiyun
403*4882a593Smuzhiyun6.5 Scale-invariant utilization signals
404*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
405*4882a593Smuzhiyun
406*4882a593SmuzhiyunIn order to make accurate prediction across CPUs and for all performance
407*4882a593Smuzhiyunstates, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
408*4882a593Smuzhiyunbe obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
409*4882a593Smuzhiyuncallbacks.
410*4882a593Smuzhiyun
411*4882a593SmuzhiyunUsing EAS on a platform that doesn't implement these two callbacks is not
412*4882a593Smuzhiyunsupported.
413*4882a593Smuzhiyun
414*4882a593Smuzhiyun
415*4882a593Smuzhiyun6.6 Multithreading (SMT)
416*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^
417*4882a593Smuzhiyun
418*4882a593SmuzhiyunEAS in its current form is SMT unaware and is not able to leverage
419*4882a593Smuzhiyunmultithreaded hardware to save energy. EAS considers threads as independent
420*4882a593SmuzhiyunCPUs, which can actually be counter-productive for both performance and energy.
421*4882a593Smuzhiyun
422*4882a593SmuzhiyunEAS on SMT is not supported.
423