1*4882a593Smuzhiyun======================= 2*4882a593SmuzhiyunEnergy Aware Scheduling 3*4882a593Smuzhiyun======================= 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun1. Introduction 6*4882a593Smuzhiyun--------------- 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunEnergy Aware Scheduling (or EAS) gives the scheduler the ability to predict 9*4882a593Smuzhiyunthe impact of its decisions on the energy consumed by CPUs. EAS relies on an 10*4882a593SmuzhiyunEnergy Model (EM) of the CPUs to select an energy efficient CPU for each task, 11*4882a593Smuzhiyunwith a minimal impact on throughput. This document aims at providing an 12*4882a593Smuzhiyunintroduction on how EAS works, what are the main design decisions behind it, and 13*4882a593Smuzhiyundetails what is needed to get it to run. 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunBefore going any further, please note that at the time of writing:: 16*4882a593Smuzhiyun 17*4882a593Smuzhiyun /!\ EAS does not support platforms with symmetric CPU topologies /!\ 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunEAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE) 20*4882a593Smuzhiyunbecause this is where the potential for saving energy through scheduling is 21*4882a593Smuzhiyunthe highest. 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunThe actual EM used by EAS is _not_ maintained by the scheduler, but by a 24*4882a593Smuzhiyundedicated framework. For details about this framework and what it provides, 25*4882a593Smuzhiyunplease refer to its documentation (see Documentation/power/energy-model.rst). 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun 28*4882a593Smuzhiyun2. Background and Terminology 29*4882a593Smuzhiyun----------------------------- 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunTo make it clear from the start: 32*4882a593Smuzhiyun - energy = [joule] (resource like a battery on powered devices) 33*4882a593Smuzhiyun - power = energy/time = [joule/second] = [watt] 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunThe goal of EAS is to minimize energy, while still getting the job done. That 36*4882a593Smuzhiyunis, we want to maximize:: 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun performance [inst/s] 39*4882a593Smuzhiyun -------------------- 40*4882a593Smuzhiyun power [W] 41*4882a593Smuzhiyun 42*4882a593Smuzhiyunwhich is equivalent to minimizing:: 43*4882a593Smuzhiyun 44*4882a593Smuzhiyun energy [J] 45*4882a593Smuzhiyun ----------- 46*4882a593Smuzhiyun instruction 47*4882a593Smuzhiyun 48*4882a593Smuzhiyunwhile still getting 'good' performance. It is essentially an alternative 49*4882a593Smuzhiyunoptimization objective to the current performance-only objective for the 50*4882a593Smuzhiyunscheduler. This alternative considers two objectives: energy-efficiency and 51*4882a593Smuzhiyunperformance. 52*4882a593Smuzhiyun 53*4882a593SmuzhiyunThe idea behind introducing an EM is to allow the scheduler to evaluate the 54*4882a593Smuzhiyunimplications of its decisions rather than blindly applying energy-saving 55*4882a593Smuzhiyuntechniques that may have positive effects only on some platforms. At the same 56*4882a593Smuzhiyuntime, the EM must be as simple as possible to minimize the scheduler latency 57*4882a593Smuzhiyunimpact. 58*4882a593Smuzhiyun 59*4882a593SmuzhiyunIn short, EAS changes the way CFS tasks are assigned to CPUs. When it is time 60*4882a593Smuzhiyunfor the scheduler to decide where a task should run (during wake-up), the EM 61*4882a593Smuzhiyunis used to break the tie between several good CPU candidates and pick the one 62*4882a593Smuzhiyunthat is predicted to yield the best energy consumption without harming the 63*4882a593Smuzhiyunsystem's throughput. The predictions made by EAS rely on specific elements of 64*4882a593Smuzhiyunknowledge about the platform's topology, which include the 'capacity' of CPUs, 65*4882a593Smuzhiyunand their respective energy costs. 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun3. Topology information 69*4882a593Smuzhiyun----------------------- 70*4882a593Smuzhiyun 71*4882a593SmuzhiyunEAS (as well as the rest of the scheduler) uses the notion of 'capacity' to 72*4882a593Smuzhiyundifferentiate CPUs with different computing throughput. The 'capacity' of a CPU 73*4882a593Smuzhiyunrepresents the amount of work it can absorb when running at its highest 74*4882a593Smuzhiyunfrequency compared to the most capable CPU of the system. Capacity values are 75*4882a593Smuzhiyunnormalized in a 1024 range, and are comparable with the utilization signals of 76*4882a593Smuzhiyuntasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks 77*4882a593Smuzhiyunto capacity and utilization values, EAS is able to estimate how big/busy a 78*4882a593Smuzhiyuntask/CPU is, and to take this into consideration when evaluating performance vs 79*4882a593Smuzhiyunenergy trade-offs. The capacity of CPUs is provided via arch-specific code 80*4882a593Smuzhiyunthrough the arch_scale_cpu_capacity() callback. 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunThe rest of platform knowledge used by EAS is directly read from the Energy 83*4882a593SmuzhiyunModel (EM) framework. The EM of a platform is composed of a power cost table 84*4882a593Smuzhiyunper 'performance domain' in the system (see Documentation/power/energy-model.rst 85*4882a593Smuzhiyunfor futher details about performance domains). 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunThe scheduler manages references to the EM objects in the topology code when the 88*4882a593Smuzhiyunscheduling domains are built, or re-built. For each root domain (rd), the 89*4882a593Smuzhiyunscheduler maintains a singly linked list of all performance domains intersecting 90*4882a593Smuzhiyunthe current rd->span. Each node in the list contains a pointer to a struct 91*4882a593Smuzhiyunem_perf_domain as provided by the EM framework. 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunThe lists are attached to the root domains in order to cope with exclusive 94*4882a593Smuzhiyuncpuset configurations. Since the boundaries of exclusive cpusets do not 95*4882a593Smuzhiyunnecessarily match those of performance domains, the lists of different root 96*4882a593Smuzhiyundomains can contain duplicate elements. 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunExample 1. 99*4882a593Smuzhiyun Let us consider a platform with 12 CPUs, split in 3 performance domains 100*4882a593Smuzhiyun (pd0, pd4 and pd8), organized as follows:: 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 103*4882a593Smuzhiyun PDs: |--pd0--|--pd4--|---pd8---| 104*4882a593Smuzhiyun RDs: |----rd1----|-----rd2-----| 105*4882a593Smuzhiyun 106*4882a593Smuzhiyun Now, consider that userspace decided to split the system with two 107*4882a593Smuzhiyun exclusive cpusets, hence creating two independent root domains, each 108*4882a593Smuzhiyun containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the 109*4882a593Smuzhiyun above figure. Since pd4 intersects with both rd1 and rd2, it will be 110*4882a593Smuzhiyun present in the linked list '->pd' attached to each of them: 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun * rd1->pd: pd0 -> pd4 113*4882a593Smuzhiyun * rd2->pd: pd4 -> pd8 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun Please note that the scheduler will create two duplicate list nodes for 116*4882a593Smuzhiyun pd4 (one for each list). However, both just hold a pointer to the same 117*4882a593Smuzhiyun shared data structure of the EM framework. 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunSince the access to these lists can happen concurrently with hotplug and other 120*4882a593Smuzhiyunthings, they are protected by RCU, like the rest of topology structures 121*4882a593Smuzhiyunmanipulated by the scheduler. 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunEAS also maintains a static key (sched_energy_present) which is enabled when at 124*4882a593Smuzhiyunleast one root domain meets all conditions for EAS to start. Those conditions 125*4882a593Smuzhiyunare summarized in Section 6. 126*4882a593Smuzhiyun 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun4. Energy-Aware task placement 129*4882a593Smuzhiyun------------------------------ 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunEAS overrides the CFS task wake-up balancing code. It uses the EM of the 132*4882a593Smuzhiyunplatform and the PELT signals to choose an energy-efficient target CPU during 133*4882a593Smuzhiyunwake-up balance. When EAS is enabled, select_task_rq_fair() calls 134*4882a593Smuzhiyunfind_energy_efficient_cpu() to do the placement decision. This function looks 135*4882a593Smuzhiyunfor the CPU with the highest spare capacity (CPU capacity - CPU utilization) in 136*4882a593Smuzhiyuneach performance domain since it is the one which will allow us to keep the 137*4882a593Smuzhiyunfrequency the lowest. Then, the function checks if placing the task there could 138*4882a593Smuzhiyunsave energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran 139*4882a593Smuzhiyunin its previous activation. 140*4882a593Smuzhiyun 141*4882a593Smuzhiyunfind_energy_efficient_cpu() uses compute_energy() to estimate what will be the 142*4882a593Smuzhiyunenergy consumed by the system if the waking task was migrated. compute_energy() 143*4882a593Smuzhiyunlooks at the current utilization landscape of the CPUs and adjusts it to 144*4882a593Smuzhiyun'simulate' the task migration. The EM framework provides the em_pd_energy() API 145*4882a593Smuzhiyunwhich computes the expected energy consumption of each performance domain for 146*4882a593Smuzhiyunthe given utilization landscape. 147*4882a593Smuzhiyun 148*4882a593SmuzhiyunAn example of energy-optimized task placement decision is detailed below. 149*4882a593Smuzhiyun 150*4882a593SmuzhiyunExample 2. 151*4882a593Smuzhiyun Let us consider a (fake) platform with 2 independent performance domains 152*4882a593Smuzhiyun composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3 153*4882a593Smuzhiyun are big. 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun The scheduler must decide where to place a task P whose util_avg = 200 156*4882a593Smuzhiyun and prev_cpu = 0. 157*4882a593Smuzhiyun 158*4882a593Smuzhiyun The current utilization landscape of the CPUs is depicted on the graph 159*4882a593Smuzhiyun below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively 160*4882a593Smuzhiyun Each performance domain has three Operating Performance Points (OPPs). 161*4882a593Smuzhiyun The CPU capacity and power cost associated with each OPP is listed in 162*4882a593Smuzhiyun the Energy Model table. The util_avg of P is shown on the figures 163*4882a593Smuzhiyun below as 'PP':: 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun CPU util. 166*4882a593Smuzhiyun 1024 - - - - - - - Energy Model 167*4882a593Smuzhiyun +-----------+-------------+ 168*4882a593Smuzhiyun | Little | Big | 169*4882a593Smuzhiyun 768 ============= +-----+-----+------+------+ 170*4882a593Smuzhiyun | Cap | Pwr | Cap | Pwr | 171*4882a593Smuzhiyun +-----+-----+------+------+ 172*4882a593Smuzhiyun 512 =========== - ##- - - - - | 170 | 50 | 512 | 400 | 173*4882a593Smuzhiyun ## ## | 341 | 150 | 768 | 800 | 174*4882a593Smuzhiyun 341 -PP - - - - ## ## | 512 | 300 | 1024 | 1700 | 175*4882a593Smuzhiyun PP ## ## +-----+-----+------+------+ 176*4882a593Smuzhiyun 170 -## - - - - ## ## 177*4882a593Smuzhiyun ## ## ## ## 178*4882a593Smuzhiyun ------------ ------------- 179*4882a593Smuzhiyun CPU0 CPU1 CPU2 CPU3 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun Current OPP: ===== Other OPP: - - - util_avg (100 each): ## 182*4882a593Smuzhiyun 183*4882a593Smuzhiyun 184*4882a593Smuzhiyun find_energy_efficient_cpu() will first look for the CPUs with the 185*4882a593Smuzhiyun maximum spare capacity in the two performance domains. In this example, 186*4882a593Smuzhiyun CPU1 and CPU3. Then it will estimate the energy of the system if P was 187*4882a593Smuzhiyun placed on either of them, and check if that would save some energy 188*4882a593Smuzhiyun compared to leaving P on CPU0. EAS assumes that OPPs follow utilization 189*4882a593Smuzhiyun (which is coherent with the behaviour of the schedutil CPUFreq 190*4882a593Smuzhiyun governor, see Section 6. for more details on this topic). 191*4882a593Smuzhiyun 192*4882a593Smuzhiyun **Case 1. P is migrated to CPU1**:: 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun 1024 - - - - - - - 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun Energy calculation: 197*4882a593Smuzhiyun 768 ============= * CPU0: 200 / 341 * 150 = 88 198*4882a593Smuzhiyun * CPU1: 300 / 341 * 150 = 131 199*4882a593Smuzhiyun * CPU2: 600 / 768 * 800 = 625 200*4882a593Smuzhiyun 512 - - - - - - - ##- - - - - * CPU3: 500 / 768 * 800 = 520 201*4882a593Smuzhiyun ## ## => total_energy = 1364 202*4882a593Smuzhiyun 341 =========== ## ## 203*4882a593Smuzhiyun PP ## ## 204*4882a593Smuzhiyun 170 -## - - PP- ## ## 205*4882a593Smuzhiyun ## ## ## ## 206*4882a593Smuzhiyun ------------ ------------- 207*4882a593Smuzhiyun CPU0 CPU1 CPU2 CPU3 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun 210*4882a593Smuzhiyun **Case 2. P is migrated to CPU3**:: 211*4882a593Smuzhiyun 212*4882a593Smuzhiyun 1024 - - - - - - - 213*4882a593Smuzhiyun 214*4882a593Smuzhiyun Energy calculation: 215*4882a593Smuzhiyun 768 ============= * CPU0: 200 / 341 * 150 = 88 216*4882a593Smuzhiyun * CPU1: 100 / 341 * 150 = 43 217*4882a593Smuzhiyun PP * CPU2: 600 / 768 * 800 = 625 218*4882a593Smuzhiyun 512 - - - - - - - ##- - -PP - * CPU3: 700 / 768 * 800 = 729 219*4882a593Smuzhiyun ## ## => total_energy = 1485 220*4882a593Smuzhiyun 341 =========== ## ## 221*4882a593Smuzhiyun ## ## 222*4882a593Smuzhiyun 170 -## - - - - ## ## 223*4882a593Smuzhiyun ## ## ## ## 224*4882a593Smuzhiyun ------------ ------------- 225*4882a593Smuzhiyun CPU0 CPU1 CPU2 CPU3 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun **Case 3. P stays on prev_cpu / CPU 0**:: 229*4882a593Smuzhiyun 230*4882a593Smuzhiyun 1024 - - - - - - - 231*4882a593Smuzhiyun 232*4882a593Smuzhiyun Energy calculation: 233*4882a593Smuzhiyun 768 ============= * CPU0: 400 / 512 * 300 = 234 234*4882a593Smuzhiyun * CPU1: 100 / 512 * 300 = 58 235*4882a593Smuzhiyun * CPU2: 600 / 768 * 800 = 625 236*4882a593Smuzhiyun 512 =========== - ##- - - - - * CPU3: 500 / 768 * 800 = 520 237*4882a593Smuzhiyun ## ## => total_energy = 1437 238*4882a593Smuzhiyun 341 -PP - - - - ## ## 239*4882a593Smuzhiyun PP ## ## 240*4882a593Smuzhiyun 170 -## - - - - ## ## 241*4882a593Smuzhiyun ## ## ## ## 242*4882a593Smuzhiyun ------------ ------------- 243*4882a593Smuzhiyun CPU0 CPU1 CPU2 CPU3 244*4882a593Smuzhiyun 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun From these calculations, the Case 1 has the lowest total energy. So CPU 1 247*4882a593Smuzhiyun is be the best candidate from an energy-efficiency standpoint. 248*4882a593Smuzhiyun 249*4882a593SmuzhiyunBig CPUs are generally more power hungry than the little ones and are thus used 250*4882a593Smuzhiyunmainly when a task doesn't fit the littles. However, little CPUs aren't always 251*4882a593Smuzhiyunnecessarily more energy-efficient than big CPUs. For some systems, the high OPPs 252*4882a593Smuzhiyunof the little CPUs can be less energy-efficient than the lowest OPPs of the 253*4882a593Smuzhiyunbigs, for example. So, if the little CPUs happen to have enough utilization at 254*4882a593Smuzhiyuna specific point in time, a small task waking up at that moment could be better 255*4882a593Smuzhiyunof executing on the big side in order to save energy, even though it would fit 256*4882a593Smuzhiyunon the little side. 257*4882a593Smuzhiyun 258*4882a593SmuzhiyunAnd even in the case where all OPPs of the big CPUs are less energy-efficient 259*4882a593Smuzhiyunthan those of the little, using the big CPUs for a small task might still, under 260*4882a593Smuzhiyunspecific conditions, save energy. Indeed, placing a task on a little CPU can 261*4882a593Smuzhiyunresult in raising the OPP of the entire performance domain, and that will 262*4882a593Smuzhiyunincrease the cost of the tasks already running there. If the waking task is 263*4882a593Smuzhiyunplaced on a big CPU, its own execution cost might be higher than if it was 264*4882a593Smuzhiyunrunning on a little, but it won't impact the other tasks of the little CPUs 265*4882a593Smuzhiyunwhich will keep running at a lower OPP. So, when considering the total energy 266*4882a593Smuzhiyunconsumed by CPUs, the extra cost of running that one task on a big core can be 267*4882a593Smuzhiyunsmaller than the cost of raising the OPP on the little CPUs for all the other 268*4882a593Smuzhiyuntasks. 269*4882a593Smuzhiyun 270*4882a593SmuzhiyunThe examples above would be nearly impossible to get right in a generic way, and 271*4882a593Smuzhiyunfor all platforms, without knowing the cost of running at different OPPs on all 272*4882a593SmuzhiyunCPUs of the system. Thanks to its EM-based design, EAS should cope with them 273*4882a593Smuzhiyuncorrectly without too many troubles. However, in order to ensure a minimal 274*4882a593Smuzhiyunimpact on throughput for high-utilization scenarios, EAS also implements another 275*4882a593Smuzhiyunmechanism called 'over-utilization'. 276*4882a593Smuzhiyun 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun5. Over-utilization 279*4882a593Smuzhiyun------------------- 280*4882a593Smuzhiyun 281*4882a593SmuzhiyunFrom a general standpoint, the use-cases where EAS can help the most are those 282*4882a593Smuzhiyuninvolving a light/medium CPU utilization. Whenever long CPU-bound tasks are 283*4882a593Smuzhiyunbeing run, they will require all of the available CPU capacity, and there isn't 284*4882a593Smuzhiyunmuch that can be done by the scheduler to save energy without severly harming 285*4882a593Smuzhiyunthroughput. In order to avoid hurting performance with EAS, CPUs are flagged as 286*4882a593Smuzhiyun'over-utilized' as soon as they are used at more than 80% of their compute 287*4882a593Smuzhiyuncapacity. As long as no CPUs are over-utilized in a root domain, load balancing 288*4882a593Smuzhiyunis disabled and EAS overridess the wake-up balancing code. EAS is likely to load 289*4882a593Smuzhiyunthe most energy efficient CPUs of the system more than the others if that can be 290*4882a593Smuzhiyundone without harming throughput. So, the load-balancer is disabled to prevent 291*4882a593Smuzhiyunit from breaking the energy-efficient task placement found by EAS. It is safe to 292*4882a593Smuzhiyundo so when the system isn't overutilized since being below the 80% tipping point 293*4882a593Smuzhiyunimplies that: 294*4882a593Smuzhiyun 295*4882a593Smuzhiyun a. there is some idle time on all CPUs, so the utilization signals used by 296*4882a593Smuzhiyun EAS are likely to accurately represent the 'size' of the various tasks 297*4882a593Smuzhiyun in the system; 298*4882a593Smuzhiyun b. all tasks should already be provided with enough CPU capacity, 299*4882a593Smuzhiyun regardless of their nice values; 300*4882a593Smuzhiyun c. since there is spare capacity all tasks must be blocking/sleeping 301*4882a593Smuzhiyun regularly and balancing at wake-up is sufficient. 302*4882a593Smuzhiyun 303*4882a593SmuzhiyunAs soon as one CPU goes above the 80% tipping point, at least one of the three 304*4882a593Smuzhiyunassumptions above becomes incorrect. In this scenario, the 'overutilized' flag 305*4882a593Smuzhiyunis raised for the entire root domain, EAS is disabled, and the load-balancer is 306*4882a593Smuzhiyunre-enabled. By doing so, the scheduler falls back onto load-based algorithms for 307*4882a593Smuzhiyunwake-up and load balance under CPU-bound conditions. This provides a better 308*4882a593Smuzhiyunrespect of the nice values of tasks. 309*4882a593Smuzhiyun 310*4882a593SmuzhiyunSince the notion of overutilization largely relies on detecting whether or not 311*4882a593Smuzhiyunthere is some idle time in the system, the CPU capacity 'stolen' by higher 312*4882a593Smuzhiyun(than CFS) scheduling classes (as well as IRQ) must be taken into account. As 313*4882a593Smuzhiyunsuch, the detection of overutilization accounts for the capacity used not only 314*4882a593Smuzhiyunby CFS tasks, but also by the other scheduling classes and IRQ. 315*4882a593Smuzhiyun 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun6. Dependencies and requirements for EAS 318*4882a593Smuzhiyun---------------------------------------- 319*4882a593Smuzhiyun 320*4882a593SmuzhiyunEnergy Aware Scheduling depends on the CPUs of the system having specific 321*4882a593Smuzhiyunhardware properties and on other features of the kernel being enabled. This 322*4882a593Smuzhiyunsection lists these dependencies and provides hints as to how they can be met. 323*4882a593Smuzhiyun 324*4882a593Smuzhiyun 325*4882a593Smuzhiyun6.1 - Asymmetric CPU topology 326*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 327*4882a593Smuzhiyun 328*4882a593Smuzhiyun 329*4882a593SmuzhiyunAs mentioned in the introduction, EAS is only supported on platforms with 330*4882a593Smuzhiyunasymmetric CPU topologies for now. This requirement is checked at run-time by 331*4882a593Smuzhiyunlooking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling 332*4882a593Smuzhiyundomains are built. 333*4882a593Smuzhiyun 334*4882a593SmuzhiyunSee Documentation/scheduler/sched-capacity.rst for requirements to be met for this 335*4882a593Smuzhiyunflag to be set in the sched_domain hierarchy. 336*4882a593Smuzhiyun 337*4882a593SmuzhiyunPlease note that EAS is not fundamentally incompatible with SMP, but no 338*4882a593Smuzhiyunsignificant savings on SMP platforms have been observed yet. This restriction 339*4882a593Smuzhiyuncould be amended in the future if proven otherwise. 340*4882a593Smuzhiyun 341*4882a593Smuzhiyun 342*4882a593Smuzhiyun6.2 - Energy Model presence 343*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^ 344*4882a593Smuzhiyun 345*4882a593SmuzhiyunEAS uses the EM of a platform to estimate the impact of scheduling decisions on 346*4882a593Smuzhiyunenergy. So, your platform must provide power cost tables to the EM framework in 347*4882a593Smuzhiyunorder to make EAS start. To do so, please refer to documentation of the 348*4882a593Smuzhiyunindependent EM framework in Documentation/power/energy-model.rst. 349*4882a593Smuzhiyun 350*4882a593SmuzhiyunPlease also note that the scheduling domains need to be re-built after the 351*4882a593SmuzhiyunEM has been registered in order to start EAS. 352*4882a593Smuzhiyun 353*4882a593Smuzhiyun 354*4882a593Smuzhiyun6.3 - Energy Model complexity 355*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 356*4882a593Smuzhiyun 357*4882a593SmuzhiyunThe task wake-up path is very latency-sensitive. When the EM of a platform is 358*4882a593Smuzhiyuntoo complex (too many CPUs, too many performance domains, too many performance 359*4882a593Smuzhiyunstates, ...), the cost of using it in the wake-up path can become prohibitive. 360*4882a593SmuzhiyunThe energy-aware wake-up algorithm has a complexity of: 361*4882a593Smuzhiyun 362*4882a593Smuzhiyun C = Nd * (Nc + Ns) 363*4882a593Smuzhiyun 364*4882a593Smuzhiyunwith: Nd the number of performance domains; Nc the number of CPUs; and Ns the 365*4882a593Smuzhiyuntotal number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8). 366*4882a593Smuzhiyun 367*4882a593SmuzhiyunA complexity check is performed at the root domain level, when scheduling 368*4882a593Smuzhiyundomains are built. EAS will not start on a root domain if its C happens to be 369*4882a593Smuzhiyunhigher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the 370*4882a593Smuzhiyuntime of writing). 371*4882a593Smuzhiyun 372*4882a593SmuzhiyunIf you really want to use EAS but the complexity of your platform's Energy 373*4882a593SmuzhiyunModel is too high to be used with a single root domain, you're left with only 374*4882a593Smuzhiyuntwo possible options: 375*4882a593Smuzhiyun 376*4882a593Smuzhiyun 1. split your system into separate, smaller, root domains using exclusive 377*4882a593Smuzhiyun cpusets and enable EAS locally on each of them. This option has the 378*4882a593Smuzhiyun benefit to work out of the box but the drawback of preventing load 379*4882a593Smuzhiyun balance between root domains, which can result in an unbalanced system 380*4882a593Smuzhiyun overall; 381*4882a593Smuzhiyun 2. submit patches to reduce the complexity of the EAS wake-up algorithm, 382*4882a593Smuzhiyun hence enabling it to cope with larger EMs in reasonable time. 383*4882a593Smuzhiyun 384*4882a593Smuzhiyun 385*4882a593Smuzhiyun6.4 - Schedutil governor 386*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^ 387*4882a593Smuzhiyun 388*4882a593SmuzhiyunEAS tries to predict at which OPP will the CPUs be running in the close future 389*4882a593Smuzhiyunin order to estimate their energy consumption. To do so, it is assumed that OPPs 390*4882a593Smuzhiyunof CPUs follow their utilization. 391*4882a593Smuzhiyun 392*4882a593SmuzhiyunAlthough it is very difficult to provide hard guarantees regarding the accuracy 393*4882a593Smuzhiyunof this assumption in practice (because the hardware might not do what it is 394*4882a593Smuzhiyuntold to do, for example), schedutil as opposed to other CPUFreq governors at 395*4882a593Smuzhiyunleast _requests_ frequencies calculated using the utilization signals. 396*4882a593SmuzhiyunConsequently, the only sane governor to use together with EAS is schedutil, 397*4882a593Smuzhiyunbecause it is the only one providing some degree of consistency between 398*4882a593Smuzhiyunfrequency requests and energy predictions. 399*4882a593Smuzhiyun 400*4882a593SmuzhiyunUsing EAS with any other governor than schedutil is not recommended. 401*4882a593Smuzhiyun 402*4882a593Smuzhiyun 403*4882a593Smuzhiyun6.5 Scale-invariant utilization signals 404*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 405*4882a593Smuzhiyun 406*4882a593SmuzhiyunIn order to make accurate prediction across CPUs and for all performance 407*4882a593Smuzhiyunstates, EAS needs frequency-invariant and CPU-invariant PELT signals. These can 408*4882a593Smuzhiyunbe obtained using the architecture-defined arch_scale{cpu,freq}_capacity() 409*4882a593Smuzhiyuncallbacks. 410*4882a593Smuzhiyun 411*4882a593SmuzhiyunUsing EAS on a platform that doesn't implement these two callbacks is not 412*4882a593Smuzhiyunsupported. 413*4882a593Smuzhiyun 414*4882a593Smuzhiyun 415*4882a593Smuzhiyun6.6 Multithreading (SMT) 416*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^ 417*4882a593Smuzhiyun 418*4882a593SmuzhiyunEAS in its current form is SMT unaware and is not able to leverage 419*4882a593Smuzhiyunmultithreaded hardware to save energy. EAS considers threads as independent 420*4882a593SmuzhiyunCPUs, which can actually be counter-productive for both performance and energy. 421*4882a593Smuzhiyun 422*4882a593SmuzhiyunEAS on SMT is not supported. 423