1*4882a593Smuzhiyun======================= 2*4882a593SmuzhiyunIntel Powerclamp Driver 3*4882a593Smuzhiyun======================= 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunBy: 6*4882a593Smuzhiyun - Arjan van de Ven <arjan@linux.intel.com> 7*4882a593Smuzhiyun - Jacob Pan <jacob.jun.pan@linux.intel.com> 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun.. Contents: 10*4882a593Smuzhiyun 11*4882a593Smuzhiyun (*) Introduction 12*4882a593Smuzhiyun - Goals and Objectives 13*4882a593Smuzhiyun 14*4882a593Smuzhiyun (*) Theory of Operation 15*4882a593Smuzhiyun - Idle Injection 16*4882a593Smuzhiyun - Calibration 17*4882a593Smuzhiyun 18*4882a593Smuzhiyun (*) Performance Analysis 19*4882a593Smuzhiyun - Effectiveness and Limitations 20*4882a593Smuzhiyun - Power vs Performance 21*4882a593Smuzhiyun - Scalability 22*4882a593Smuzhiyun - Calibration 23*4882a593Smuzhiyun - Comparison with Alternative Techniques 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun (*) Usage and Interfaces 26*4882a593Smuzhiyun - Generic Thermal Layer (sysfs) 27*4882a593Smuzhiyun - Kernel APIs (TBD) 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunINTRODUCTION 30*4882a593Smuzhiyun============ 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunConsider the situation where a system’s power consumption must be 33*4882a593Smuzhiyunreduced at runtime, due to power budget, thermal constraint, or noise 34*4882a593Smuzhiyunlevel, and where active cooling is not preferred. Software managed 35*4882a593Smuzhiyunpassive power reduction must be performed to prevent the hardware 36*4882a593Smuzhiyunactions that are designed for catastrophic scenarios. 37*4882a593Smuzhiyun 38*4882a593SmuzhiyunCurrently, P-states, T-states (clock modulation), and CPU offlining 39*4882a593Smuzhiyunare used for CPU throttling. 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunOn Intel CPUs, C-states provide effective power reduction, but so far 42*4882a593Smuzhiyunthey’re only used opportunistically, based on workload. With the 43*4882a593Smuzhiyundevelopment of intel_powerclamp driver, the method of synchronizing 44*4882a593Smuzhiyunidle injection across all online CPU threads was introduced. The goal 45*4882a593Smuzhiyunis to achieve forced and controllable C-state residency. 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunTest/Analysis has been made in the areas of power, performance, 48*4882a593Smuzhiyunscalability, and user experience. In many cases, clear advantage is 49*4882a593Smuzhiyunshown over taking the CPU offline or modulating the CPU clock. 50*4882a593Smuzhiyun 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunTHEORY OF OPERATION 53*4882a593Smuzhiyun=================== 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunIdle Injection 56*4882a593Smuzhiyun-------------- 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunOn modern Intel processors (Nehalem or later), package level C-state 59*4882a593Smuzhiyunresidency is available in MSRs, thus also available to the kernel. 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunThese MSRs are:: 62*4882a593Smuzhiyun 63*4882a593Smuzhiyun #define MSR_PKG_C2_RESIDENCY 0x60D 64*4882a593Smuzhiyun #define MSR_PKG_C3_RESIDENCY 0x3F8 65*4882a593Smuzhiyun #define MSR_PKG_C6_RESIDENCY 0x3F9 66*4882a593Smuzhiyun #define MSR_PKG_C7_RESIDENCY 0x3FA 67*4882a593Smuzhiyun 68*4882a593SmuzhiyunIf the kernel can also inject idle time to the system, then a 69*4882a593Smuzhiyunclosed-loop control system can be established that manages package 70*4882a593Smuzhiyunlevel C-state. The intel_powerclamp driver is conceived as such a 71*4882a593Smuzhiyuncontrol system, where the target set point is a user-selected idle 72*4882a593Smuzhiyunratio (based on power reduction), and the error is the difference 73*4882a593Smuzhiyunbetween the actual package level C-state residency ratio and the target idle 74*4882a593Smuzhiyunratio. 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunInjection is controlled by high priority kernel threads, spawned for 77*4882a593Smuzhiyuneach online CPU. 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunThese kernel threads, with SCHED_FIFO class, are created to perform 80*4882a593Smuzhiyunclamping actions of controlled duty ratio and duration. Each per-CPU 81*4882a593Smuzhiyunthread synchronizes its idle time and duration, based on the rounding 82*4882a593Smuzhiyunof jiffies, so accumulated errors can be prevented to avoid a jittery 83*4882a593Smuzhiyuneffect. Threads are also bound to the CPU such that they cannot be 84*4882a593Smuzhiyunmigrated, unless the CPU is taken offline. In this case, threads 85*4882a593Smuzhiyunbelong to the offlined CPUs will be terminated immediately. 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunRunning as SCHED_FIFO and relatively high priority, also allows such 88*4882a593Smuzhiyunscheme to work for both preemptable and non-preemptable kernels. 89*4882a593SmuzhiyunAlignment of idle time around jiffies ensures scalability for HZ 90*4882a593Smuzhiyunvalues. This effect can be better visualized using a Perf timechart. 91*4882a593SmuzhiyunThe following diagram shows the behavior of kernel thread 92*4882a593Smuzhiyunkidle_inject/cpu. During idle injection, it runs monitor/mwait idle 93*4882a593Smuzhiyunfor a given "duration", then relinquishes the CPU to other tasks, 94*4882a593Smuzhiyununtil the next time interval. 95*4882a593Smuzhiyun 96*4882a593SmuzhiyunThe NOHZ schedule tick is disabled during idle time, but interrupts 97*4882a593Smuzhiyunare not masked. Tests show that the extra wakeups from scheduler tick 98*4882a593Smuzhiyunhave a dramatic impact on the effectiveness of the powerclamp driver 99*4882a593Smuzhiyunon large scale systems (Westmere system with 80 processors). 100*4882a593Smuzhiyun 101*4882a593Smuzhiyun:: 102*4882a593Smuzhiyun 103*4882a593Smuzhiyun CPU0 104*4882a593Smuzhiyun ____________ ____________ 105*4882a593Smuzhiyun kidle_inject/0 | sleep | mwait | sleep | 106*4882a593Smuzhiyun _________| |________| |_______ 107*4882a593Smuzhiyun duration 108*4882a593Smuzhiyun CPU1 109*4882a593Smuzhiyun ____________ ____________ 110*4882a593Smuzhiyun kidle_inject/1 | sleep | mwait | sleep | 111*4882a593Smuzhiyun _________| |________| |_______ 112*4882a593Smuzhiyun ^ 113*4882a593Smuzhiyun | 114*4882a593Smuzhiyun | 115*4882a593Smuzhiyun roundup(jiffies, interval) 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunOnly one CPU is allowed to collect statistics and update global 118*4882a593Smuzhiyuncontrol parameters. This CPU is referred to as the controlling CPU in 119*4882a593Smuzhiyunthis document. The controlling CPU is elected at runtime, with a 120*4882a593Smuzhiyunpolicy that favors BSP, taking into account the possibility of a CPU 121*4882a593Smuzhiyunhot-plug. 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunIn terms of dynamics of the idle control system, package level idle 124*4882a593Smuzhiyuntime is considered largely as a non-causal system where its behavior 125*4882a593Smuzhiyuncannot be based on the past or current input. Therefore, the 126*4882a593Smuzhiyunintel_powerclamp driver attempts to enforce the desired idle time 127*4882a593Smuzhiyuninstantly as given input (target idle ratio). After injection, 128*4882a593Smuzhiyunpowerclamp monitors the actual idle for a given time window and adjust 129*4882a593Smuzhiyunthe next injection accordingly to avoid over/under correction. 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunWhen used in a causal control system, such as a temperature control, 132*4882a593Smuzhiyunit is up to the user of this driver to implement algorithms where 133*4882a593Smuzhiyunpast samples and outputs are included in the feedback. For example, a 134*4882a593SmuzhiyunPID-based thermal controller can use the powerclamp driver to 135*4882a593Smuzhiyunmaintain a desired target temperature, based on integral and 136*4882a593Smuzhiyunderivative gains of the past samples. 137*4882a593Smuzhiyun 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunCalibration 141*4882a593Smuzhiyun----------- 142*4882a593SmuzhiyunDuring scalability testing, it is observed that synchronized actions 143*4882a593Smuzhiyunamong CPUs become challenging as the number of cores grows. This is 144*4882a593Smuzhiyunalso true for the ability of a system to enter package level C-states. 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunTo make sure the intel_powerclamp driver scales well, online 147*4882a593Smuzhiyuncalibration is implemented. The goals for doing such a calibration 148*4882a593Smuzhiyunare: 149*4882a593Smuzhiyun 150*4882a593Smuzhiyuna) determine the effective range of idle injection ratio 151*4882a593Smuzhiyunb) determine the amount of compensation needed at each target ratio 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunCompensation to each target ratio consists of two parts: 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun a) steady state error compensation 156*4882a593Smuzhiyun This is to offset the error occurring when the system can 157*4882a593Smuzhiyun enter idle without extra wakeups (such as external interrupts). 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun b) dynamic error compensation 160*4882a593Smuzhiyun When an excessive amount of wakeups occurs during idle, an 161*4882a593Smuzhiyun additional idle ratio can be added to quiet interrupts, by 162*4882a593Smuzhiyun slowing down CPU activities. 163*4882a593Smuzhiyun 164*4882a593SmuzhiyunA debugfs file is provided for the user to examine compensation 165*4882a593Smuzhiyunprogress and results, such as on a Westmere system:: 166*4882a593Smuzhiyun 167*4882a593Smuzhiyun [jacob@nex01 ~]$ cat 168*4882a593Smuzhiyun /sys/kernel/debug/intel_powerclamp/powerclamp_calib 169*4882a593Smuzhiyun controlling cpu: 0 170*4882a593Smuzhiyun pct confidence steady dynamic (compensation) 171*4882a593Smuzhiyun 0 0 0 0 172*4882a593Smuzhiyun 1 1 0 0 173*4882a593Smuzhiyun 2 1 1 0 174*4882a593Smuzhiyun 3 3 1 0 175*4882a593Smuzhiyun 4 3 1 0 176*4882a593Smuzhiyun 5 3 1 0 177*4882a593Smuzhiyun 6 3 1 0 178*4882a593Smuzhiyun 7 3 1 0 179*4882a593Smuzhiyun 8 3 1 0 180*4882a593Smuzhiyun ... 181*4882a593Smuzhiyun 30 3 2 0 182*4882a593Smuzhiyun 31 3 2 0 183*4882a593Smuzhiyun 32 3 1 0 184*4882a593Smuzhiyun 33 3 2 0 185*4882a593Smuzhiyun 34 3 1 0 186*4882a593Smuzhiyun 35 3 2 0 187*4882a593Smuzhiyun 36 3 1 0 188*4882a593Smuzhiyun 37 3 2 0 189*4882a593Smuzhiyun 38 3 1 0 190*4882a593Smuzhiyun 39 3 2 0 191*4882a593Smuzhiyun 40 3 3 0 192*4882a593Smuzhiyun 41 3 1 0 193*4882a593Smuzhiyun 42 3 2 0 194*4882a593Smuzhiyun 43 3 1 0 195*4882a593Smuzhiyun 44 3 1 0 196*4882a593Smuzhiyun 45 3 2 0 197*4882a593Smuzhiyun 46 3 3 0 198*4882a593Smuzhiyun 47 3 0 0 199*4882a593Smuzhiyun 48 3 2 0 200*4882a593Smuzhiyun 49 3 3 0 201*4882a593Smuzhiyun 202*4882a593SmuzhiyunCalibration occurs during runtime. No offline method is available. 203*4882a593SmuzhiyunSteady state compensation is used only when confidence levels of all 204*4882a593Smuzhiyunadjacent ratios have reached satisfactory level. A confidence level 205*4882a593Smuzhiyunis accumulated based on clean data collected at runtime. Data 206*4882a593Smuzhiyuncollected during a period without extra interrupts is considered 207*4882a593Smuzhiyunclean. 208*4882a593Smuzhiyun 209*4882a593SmuzhiyunTo compensate for excessive amounts of wakeup during idle, additional 210*4882a593Smuzhiyunidle time is injected when such a condition is detected. Currently, 211*4882a593Smuzhiyunwe have a simple algorithm to double the injection ratio. A possible 212*4882a593Smuzhiyunenhancement might be to throttle the offending IRQ, such as delaying 213*4882a593SmuzhiyunEOI for level triggered interrupts. But it is a challenge to be 214*4882a593Smuzhiyunnon-intrusive to the scheduler or the IRQ core code. 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun 217*4882a593SmuzhiyunCPU Online/Offline 218*4882a593Smuzhiyun------------------ 219*4882a593SmuzhiyunPer-CPU kernel threads are started/stopped upon receiving 220*4882a593Smuzhiyunnotifications of CPU hotplug activities. The intel_powerclamp driver 221*4882a593Smuzhiyunkeeps track of clamping kernel threads, even after they are migrated 222*4882a593Smuzhiyunto other CPUs, after a CPU offline event. 223*4882a593Smuzhiyun 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunPerformance Analysis 226*4882a593Smuzhiyun==================== 227*4882a593SmuzhiyunThis section describes the general performance data collected on 228*4882a593Smuzhiyunmultiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P). 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunEffectiveness and Limitations 231*4882a593Smuzhiyun----------------------------- 232*4882a593SmuzhiyunThe maximum range that idle injection is allowed is capped at 50 233*4882a593Smuzhiyunpercent. As mentioned earlier, since interrupts are allowed during 234*4882a593Smuzhiyunforced idle time, excessive interrupts could result in less 235*4882a593Smuzhiyuneffectiveness. The extreme case would be doing a ping -f to generated 236*4882a593Smuzhiyunflooded network interrupts without much CPU acknowledgement. In this 237*4882a593Smuzhiyuncase, little can be done from the idle injection threads. In most 238*4882a593Smuzhiyunnormal cases, such as scp a large file, applications can be throttled 239*4882a593Smuzhiyunby the powerclamp driver, since slowing down the CPU also slows down 240*4882a593Smuzhiyunnetwork protocol processing, which in turn reduces interrupts. 241*4882a593Smuzhiyun 242*4882a593SmuzhiyunWhen control parameters change at runtime by the controlling CPU, it 243*4882a593Smuzhiyunmay take an additional period for the rest of the CPUs to catch up 244*4882a593Smuzhiyunwith the changes. During this time, idle injection is out of sync, 245*4882a593Smuzhiyunthus not able to enter package C- states at the expected ratio. But 246*4882a593Smuzhiyunthis effect is minor, in that in most cases change to the target 247*4882a593Smuzhiyunratio is updated much less frequently than the idle injection 248*4882a593Smuzhiyunfrequency. 249*4882a593Smuzhiyun 250*4882a593SmuzhiyunScalability 251*4882a593Smuzhiyun----------- 252*4882a593SmuzhiyunTests also show a minor, but measurable, difference between the 4P/8P 253*4882a593SmuzhiyunIvy Bridge system and the 80P Westmere server under 50% idle ratio. 254*4882a593SmuzhiyunMore compensation is needed on Westmere for the same amount of 255*4882a593Smuzhiyuntarget idle ratio. The compensation also increases as the idle ratio 256*4882a593Smuzhiyungets larger. The above reason constitutes the need for the 257*4882a593Smuzhiyuncalibration code. 258*4882a593Smuzhiyun 259*4882a593SmuzhiyunOn the IVB 8P system, compared to an offline CPU, powerclamp can 260*4882a593Smuzhiyunachieve up to 40% better performance per watt. (measured by a spin 261*4882a593Smuzhiyuncounter summed over per CPU counting threads spawned for all running 262*4882a593SmuzhiyunCPUs). 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunUsage and Interfaces 265*4882a593Smuzhiyun==================== 266*4882a593SmuzhiyunThe powerclamp driver is registered to the generic thermal layer as a 267*4882a593Smuzhiyuncooling device. Currently, it’s not bound to any thermal zones:: 268*4882a593Smuzhiyun 269*4882a593Smuzhiyun jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . * 270*4882a593Smuzhiyun cur_state:0 271*4882a593Smuzhiyun max_state:50 272*4882a593Smuzhiyun type:intel_powerclamp 273*4882a593Smuzhiyun 274*4882a593Smuzhiyuncur_state allows user to set the desired idle percentage. Writing 0 to 275*4882a593Smuzhiyuncur_state will stop idle injection. Writing a value between 1 and 276*4882a593Smuzhiyunmax_state will start the idle injection. Reading cur_state returns the 277*4882a593Smuzhiyunactual and current idle percentage. This may not be the same value 278*4882a593Smuzhiyunset by the user in that current idle percentage depends on workload 279*4882a593Smuzhiyunand includes natural idle. When idle injection is disabled, reading 280*4882a593Smuzhiyuncur_state returns value -1 instead of 0 which is to avoid confusing 281*4882a593Smuzhiyun100% busy state with the disabled state. 282*4882a593Smuzhiyun 283*4882a593SmuzhiyunExample usage: 284*4882a593Smuzhiyun- To inject 25% idle time:: 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state 287*4882a593Smuzhiyun 288*4882a593SmuzhiyunIf the system is not busy and has more than 25% idle time already, 289*4882a593Smuzhiyunthen the powerclamp driver will not start idle injection. Using Top 290*4882a593Smuzhiyunwill not show idle injection kernel threads. 291*4882a593Smuzhiyun 292*4882a593SmuzhiyunIf the system is busy (spin test below) and has less than 25% natural 293*4882a593Smuzhiyunidle time, powerclamp kernel threads will do idle injection. Forced 294*4882a593Smuzhiyunidle time is accounted as normal idle in that common code path is 295*4882a593Smuzhiyuntaken as the idle task. 296*4882a593Smuzhiyun 297*4882a593SmuzhiyunIn this example, 24.1% idle is shown. This helps the system admin or 298*4882a593Smuzhiyunuser determine the cause of slowdown, when a powerclamp driver is in action:: 299*4882a593Smuzhiyun 300*4882a593Smuzhiyun 301*4882a593Smuzhiyun Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie 302*4882a593Smuzhiyun Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st 303*4882a593Smuzhiyun Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers 304*4882a593Smuzhiyun Swap: 4087804k total, 0k used, 4087804k free, 945336k cached 305*4882a593Smuzhiyun 306*4882a593Smuzhiyun PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 307*4882a593Smuzhiyun 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin 308*4882a593Smuzhiyun 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0 309*4882a593Smuzhiyun 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3 310*4882a593Smuzhiyun 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1 311*4882a593Smuzhiyun 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2 312*4882a593Smuzhiyun 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox 313*4882a593Smuzhiyun 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg 314*4882a593Smuzhiyun 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz 315*4882a593Smuzhiyun 316*4882a593SmuzhiyunTests have shown that by using the powerclamp driver as a cooling 317*4882a593Smuzhiyundevice, a PID based userspace thermal controller can manage to 318*4882a593Smuzhiyuncontrol CPU temperature effectively, when no other thermal influence 319*4882a593Smuzhiyunis added. For example, a UltraBook user can compile the kernel under 320*4882a593Smuzhiyuncertain temperature (below most active trip points). 321