xref: /OK3568_Linux_fs/kernel/Documentation/driver-api/thermal/intel_powerclamp.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=======================
2*4882a593SmuzhiyunIntel Powerclamp Driver
3*4882a593Smuzhiyun=======================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunBy:
6*4882a593Smuzhiyun  - Arjan van de Ven <arjan@linux.intel.com>
7*4882a593Smuzhiyun  - Jacob Pan <jacob.jun.pan@linux.intel.com>
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun.. Contents:
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun	(*) Introduction
12*4882a593Smuzhiyun	    - Goals and Objectives
13*4882a593Smuzhiyun
14*4882a593Smuzhiyun	(*) Theory of Operation
15*4882a593Smuzhiyun	    - Idle Injection
16*4882a593Smuzhiyun	    - Calibration
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun	(*) Performance Analysis
19*4882a593Smuzhiyun	    - Effectiveness and Limitations
20*4882a593Smuzhiyun	    - Power vs Performance
21*4882a593Smuzhiyun	    - Scalability
22*4882a593Smuzhiyun	    - Calibration
23*4882a593Smuzhiyun	    - Comparison with Alternative Techniques
24*4882a593Smuzhiyun
25*4882a593Smuzhiyun	(*) Usage and Interfaces
26*4882a593Smuzhiyun	    - Generic Thermal Layer (sysfs)
27*4882a593Smuzhiyun	    - Kernel APIs (TBD)
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunINTRODUCTION
30*4882a593Smuzhiyun============
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunConsider the situation where a system’s power consumption must be
33*4882a593Smuzhiyunreduced at runtime, due to power budget, thermal constraint, or noise
34*4882a593Smuzhiyunlevel, and where active cooling is not preferred. Software managed
35*4882a593Smuzhiyunpassive power reduction must be performed to prevent the hardware
36*4882a593Smuzhiyunactions that are designed for catastrophic scenarios.
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunCurrently, P-states, T-states (clock modulation), and CPU offlining
39*4882a593Smuzhiyunare used for CPU throttling.
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunOn Intel CPUs, C-states provide effective power reduction, but so far
42*4882a593Smuzhiyunthey’re only used opportunistically, based on workload. With the
43*4882a593Smuzhiyundevelopment of intel_powerclamp driver, the method of synchronizing
44*4882a593Smuzhiyunidle injection across all online CPU threads was introduced. The goal
45*4882a593Smuzhiyunis to achieve forced and controllable C-state residency.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunTest/Analysis has been made in the areas of power, performance,
48*4882a593Smuzhiyunscalability, and user experience. In many cases, clear advantage is
49*4882a593Smuzhiyunshown over taking the CPU offline or modulating the CPU clock.
50*4882a593Smuzhiyun
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunTHEORY OF OPERATION
53*4882a593Smuzhiyun===================
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunIdle Injection
56*4882a593Smuzhiyun--------------
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunOn modern Intel processors (Nehalem or later), package level C-state
59*4882a593Smuzhiyunresidency is available in MSRs, thus also available to the kernel.
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunThese MSRs are::
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun      #define MSR_PKG_C2_RESIDENCY      0x60D
64*4882a593Smuzhiyun      #define MSR_PKG_C3_RESIDENCY      0x3F8
65*4882a593Smuzhiyun      #define MSR_PKG_C6_RESIDENCY      0x3F9
66*4882a593Smuzhiyun      #define MSR_PKG_C7_RESIDENCY      0x3FA
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunIf the kernel can also inject idle time to the system, then a
69*4882a593Smuzhiyunclosed-loop control system can be established that manages package
70*4882a593Smuzhiyunlevel C-state. The intel_powerclamp driver is conceived as such a
71*4882a593Smuzhiyuncontrol system, where the target set point is a user-selected idle
72*4882a593Smuzhiyunratio (based on power reduction), and the error is the difference
73*4882a593Smuzhiyunbetween the actual package level C-state residency ratio and the target idle
74*4882a593Smuzhiyunratio.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunInjection is controlled by high priority kernel threads, spawned for
77*4882a593Smuzhiyuneach online CPU.
78*4882a593Smuzhiyun
79*4882a593SmuzhiyunThese kernel threads, with SCHED_FIFO class, are created to perform
80*4882a593Smuzhiyunclamping actions of controlled duty ratio and duration. Each per-CPU
81*4882a593Smuzhiyunthread synchronizes its idle time and duration, based on the rounding
82*4882a593Smuzhiyunof jiffies, so accumulated errors can be prevented to avoid a jittery
83*4882a593Smuzhiyuneffect. Threads are also bound to the CPU such that they cannot be
84*4882a593Smuzhiyunmigrated, unless the CPU is taken offline. In this case, threads
85*4882a593Smuzhiyunbelong to the offlined CPUs will be terminated immediately.
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunRunning as SCHED_FIFO and relatively high priority, also allows such
88*4882a593Smuzhiyunscheme to work for both preemptable and non-preemptable kernels.
89*4882a593SmuzhiyunAlignment of idle time around jiffies ensures scalability for HZ
90*4882a593Smuzhiyunvalues. This effect can be better visualized using a Perf timechart.
91*4882a593SmuzhiyunThe following diagram shows the behavior of kernel thread
92*4882a593Smuzhiyunkidle_inject/cpu. During idle injection, it runs monitor/mwait idle
93*4882a593Smuzhiyunfor a given "duration", then relinquishes the CPU to other tasks,
94*4882a593Smuzhiyununtil the next time interval.
95*4882a593Smuzhiyun
96*4882a593SmuzhiyunThe NOHZ schedule tick is disabled during idle time, but interrupts
97*4882a593Smuzhiyunare not masked. Tests show that the extra wakeups from scheduler tick
98*4882a593Smuzhiyunhave a dramatic impact on the effectiveness of the powerclamp driver
99*4882a593Smuzhiyunon large scale systems (Westmere system with 80 processors).
100*4882a593Smuzhiyun
101*4882a593Smuzhiyun::
102*4882a593Smuzhiyun
103*4882a593Smuzhiyun  CPU0
104*4882a593Smuzhiyun		    ____________          ____________
105*4882a593Smuzhiyun  kidle_inject/0   |   sleep    |  mwait |  sleep     |
106*4882a593Smuzhiyun	  _________|            |________|            |_______
107*4882a593Smuzhiyun				 duration
108*4882a593Smuzhiyun  CPU1
109*4882a593Smuzhiyun		    ____________          ____________
110*4882a593Smuzhiyun  kidle_inject/1   |   sleep    |  mwait |  sleep     |
111*4882a593Smuzhiyun	  _________|            |________|            |_______
112*4882a593Smuzhiyun				^
113*4882a593Smuzhiyun				|
114*4882a593Smuzhiyun				|
115*4882a593Smuzhiyun				roundup(jiffies, interval)
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunOnly one CPU is allowed to collect statistics and update global
118*4882a593Smuzhiyuncontrol parameters. This CPU is referred to as the controlling CPU in
119*4882a593Smuzhiyunthis document. The controlling CPU is elected at runtime, with a
120*4882a593Smuzhiyunpolicy that favors BSP, taking into account the possibility of a CPU
121*4882a593Smuzhiyunhot-plug.
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunIn terms of dynamics of the idle control system, package level idle
124*4882a593Smuzhiyuntime is considered largely as a non-causal system where its behavior
125*4882a593Smuzhiyuncannot be based on the past or current input. Therefore, the
126*4882a593Smuzhiyunintel_powerclamp driver attempts to enforce the desired idle time
127*4882a593Smuzhiyuninstantly as given input (target idle ratio). After injection,
128*4882a593Smuzhiyunpowerclamp monitors the actual idle for a given time window and adjust
129*4882a593Smuzhiyunthe next injection accordingly to avoid over/under correction.
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunWhen used in a causal control system, such as a temperature control,
132*4882a593Smuzhiyunit is up to the user of this driver to implement algorithms where
133*4882a593Smuzhiyunpast samples and outputs are included in the feedback. For example, a
134*4882a593SmuzhiyunPID-based thermal controller can use the powerclamp driver to
135*4882a593Smuzhiyunmaintain a desired target temperature, based on integral and
136*4882a593Smuzhiyunderivative gains of the past samples.
137*4882a593Smuzhiyun
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunCalibration
141*4882a593Smuzhiyun-----------
142*4882a593SmuzhiyunDuring scalability testing, it is observed that synchronized actions
143*4882a593Smuzhiyunamong CPUs become challenging as the number of cores grows. This is
144*4882a593Smuzhiyunalso true for the ability of a system to enter package level C-states.
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunTo make sure the intel_powerclamp driver scales well, online
147*4882a593Smuzhiyuncalibration is implemented. The goals for doing such a calibration
148*4882a593Smuzhiyunare:
149*4882a593Smuzhiyun
150*4882a593Smuzhiyuna) determine the effective range of idle injection ratio
151*4882a593Smuzhiyunb) determine the amount of compensation needed at each target ratio
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunCompensation to each target ratio consists of two parts:
154*4882a593Smuzhiyun
155*4882a593Smuzhiyun	a) steady state error compensation
156*4882a593Smuzhiyun	This is to offset the error occurring when the system can
157*4882a593Smuzhiyun	enter idle without extra wakeups (such as external interrupts).
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun	b) dynamic error compensation
160*4882a593Smuzhiyun	When an excessive amount of wakeups occurs during idle, an
161*4882a593Smuzhiyun	additional idle ratio can be added to quiet interrupts, by
162*4882a593Smuzhiyun	slowing down CPU activities.
163*4882a593Smuzhiyun
164*4882a593SmuzhiyunA debugfs file is provided for the user to examine compensation
165*4882a593Smuzhiyunprogress and results, such as on a Westmere system::
166*4882a593Smuzhiyun
167*4882a593Smuzhiyun  [jacob@nex01 ~]$ cat
168*4882a593Smuzhiyun  /sys/kernel/debug/intel_powerclamp/powerclamp_calib
169*4882a593Smuzhiyun  controlling cpu: 0
170*4882a593Smuzhiyun  pct confidence steady dynamic (compensation)
171*4882a593Smuzhiyun  0       0       0       0
172*4882a593Smuzhiyun  1       1       0       0
173*4882a593Smuzhiyun  2       1       1       0
174*4882a593Smuzhiyun  3       3       1       0
175*4882a593Smuzhiyun  4       3       1       0
176*4882a593Smuzhiyun  5       3       1       0
177*4882a593Smuzhiyun  6       3       1       0
178*4882a593Smuzhiyun  7       3       1       0
179*4882a593Smuzhiyun  8       3       1       0
180*4882a593Smuzhiyun  ...
181*4882a593Smuzhiyun  30      3       2       0
182*4882a593Smuzhiyun  31      3       2       0
183*4882a593Smuzhiyun  32      3       1       0
184*4882a593Smuzhiyun  33      3       2       0
185*4882a593Smuzhiyun  34      3       1       0
186*4882a593Smuzhiyun  35      3       2       0
187*4882a593Smuzhiyun  36      3       1       0
188*4882a593Smuzhiyun  37      3       2       0
189*4882a593Smuzhiyun  38      3       1       0
190*4882a593Smuzhiyun  39      3       2       0
191*4882a593Smuzhiyun  40      3       3       0
192*4882a593Smuzhiyun  41      3       1       0
193*4882a593Smuzhiyun  42      3       2       0
194*4882a593Smuzhiyun  43      3       1       0
195*4882a593Smuzhiyun  44      3       1       0
196*4882a593Smuzhiyun  45      3       2       0
197*4882a593Smuzhiyun  46      3       3       0
198*4882a593Smuzhiyun  47      3       0       0
199*4882a593Smuzhiyun  48      3       2       0
200*4882a593Smuzhiyun  49      3       3       0
201*4882a593Smuzhiyun
202*4882a593SmuzhiyunCalibration occurs during runtime. No offline method is available.
203*4882a593SmuzhiyunSteady state compensation is used only when confidence levels of all
204*4882a593Smuzhiyunadjacent ratios have reached satisfactory level. A confidence level
205*4882a593Smuzhiyunis accumulated based on clean data collected at runtime. Data
206*4882a593Smuzhiyuncollected during a period without extra interrupts is considered
207*4882a593Smuzhiyunclean.
208*4882a593Smuzhiyun
209*4882a593SmuzhiyunTo compensate for excessive amounts of wakeup during idle, additional
210*4882a593Smuzhiyunidle time is injected when such a condition is detected. Currently,
211*4882a593Smuzhiyunwe have a simple algorithm to double the injection ratio. A possible
212*4882a593Smuzhiyunenhancement might be to throttle the offending IRQ, such as delaying
213*4882a593SmuzhiyunEOI for level triggered interrupts. But it is a challenge to be
214*4882a593Smuzhiyunnon-intrusive to the scheduler or the IRQ core code.
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunCPU Online/Offline
218*4882a593Smuzhiyun------------------
219*4882a593SmuzhiyunPer-CPU kernel threads are started/stopped upon receiving
220*4882a593Smuzhiyunnotifications of CPU hotplug activities. The intel_powerclamp driver
221*4882a593Smuzhiyunkeeps track of clamping kernel threads, even after they are migrated
222*4882a593Smuzhiyunto other CPUs, after a CPU offline event.
223*4882a593Smuzhiyun
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunPerformance Analysis
226*4882a593Smuzhiyun====================
227*4882a593SmuzhiyunThis section describes the general performance data collected on
228*4882a593Smuzhiyunmultiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunEffectiveness and Limitations
231*4882a593Smuzhiyun-----------------------------
232*4882a593SmuzhiyunThe maximum range that idle injection is allowed is capped at 50
233*4882a593Smuzhiyunpercent. As mentioned earlier, since interrupts are allowed during
234*4882a593Smuzhiyunforced idle time, excessive interrupts could result in less
235*4882a593Smuzhiyuneffectiveness. The extreme case would be doing a ping -f to generated
236*4882a593Smuzhiyunflooded network interrupts without much CPU acknowledgement. In this
237*4882a593Smuzhiyuncase, little can be done from the idle injection threads. In most
238*4882a593Smuzhiyunnormal cases, such as scp a large file, applications can be throttled
239*4882a593Smuzhiyunby the powerclamp driver, since slowing down the CPU also slows down
240*4882a593Smuzhiyunnetwork protocol processing, which in turn reduces interrupts.
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunWhen control parameters change at runtime by the controlling CPU, it
243*4882a593Smuzhiyunmay take an additional period for the rest of the CPUs to catch up
244*4882a593Smuzhiyunwith the changes. During this time, idle injection is out of sync,
245*4882a593Smuzhiyunthus not able to enter package C- states at the expected ratio. But
246*4882a593Smuzhiyunthis effect is minor, in that in most cases change to the target
247*4882a593Smuzhiyunratio is updated much less frequently than the idle injection
248*4882a593Smuzhiyunfrequency.
249*4882a593Smuzhiyun
250*4882a593SmuzhiyunScalability
251*4882a593Smuzhiyun-----------
252*4882a593SmuzhiyunTests also show a minor, but measurable, difference between the 4P/8P
253*4882a593SmuzhiyunIvy Bridge system and the 80P Westmere server under 50% idle ratio.
254*4882a593SmuzhiyunMore compensation is needed on Westmere for the same amount of
255*4882a593Smuzhiyuntarget idle ratio. The compensation also increases as the idle ratio
256*4882a593Smuzhiyungets larger. The above reason constitutes the need for the
257*4882a593Smuzhiyuncalibration code.
258*4882a593Smuzhiyun
259*4882a593SmuzhiyunOn the IVB 8P system, compared to an offline CPU, powerclamp can
260*4882a593Smuzhiyunachieve up to 40% better performance per watt. (measured by a spin
261*4882a593Smuzhiyuncounter summed over per CPU counting threads spawned for all running
262*4882a593SmuzhiyunCPUs).
263*4882a593Smuzhiyun
264*4882a593SmuzhiyunUsage and Interfaces
265*4882a593Smuzhiyun====================
266*4882a593SmuzhiyunThe powerclamp driver is registered to the generic thermal layer as a
267*4882a593Smuzhiyuncooling device. Currently, it’s not bound to any thermal zones::
268*4882a593Smuzhiyun
269*4882a593Smuzhiyun  jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
270*4882a593Smuzhiyun  cur_state:0
271*4882a593Smuzhiyun  max_state:50
272*4882a593Smuzhiyun  type:intel_powerclamp
273*4882a593Smuzhiyun
274*4882a593Smuzhiyuncur_state allows user to set the desired idle percentage. Writing 0 to
275*4882a593Smuzhiyuncur_state will stop idle injection. Writing a value between 1 and
276*4882a593Smuzhiyunmax_state will start the idle injection. Reading cur_state returns the
277*4882a593Smuzhiyunactual and current idle percentage. This may not be the same value
278*4882a593Smuzhiyunset by the user in that current idle percentage depends on workload
279*4882a593Smuzhiyunand includes natural idle. When idle injection is disabled, reading
280*4882a593Smuzhiyuncur_state returns value -1 instead of 0 which is to avoid confusing
281*4882a593Smuzhiyun100% busy state with the disabled state.
282*4882a593Smuzhiyun
283*4882a593SmuzhiyunExample usage:
284*4882a593Smuzhiyun- To inject 25% idle time::
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun	$ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
287*4882a593Smuzhiyun
288*4882a593SmuzhiyunIf the system is not busy and has more than 25% idle time already,
289*4882a593Smuzhiyunthen the powerclamp driver will not start idle injection. Using Top
290*4882a593Smuzhiyunwill not show idle injection kernel threads.
291*4882a593Smuzhiyun
292*4882a593SmuzhiyunIf the system is busy (spin test below) and has less than 25% natural
293*4882a593Smuzhiyunidle time, powerclamp kernel threads will do idle injection. Forced
294*4882a593Smuzhiyunidle time is accounted as normal idle in that common code path is
295*4882a593Smuzhiyuntaken as the idle task.
296*4882a593Smuzhiyun
297*4882a593SmuzhiyunIn this example, 24.1% idle is shown. This helps the system admin or
298*4882a593Smuzhiyunuser determine the cause of slowdown, when a powerclamp driver is in action::
299*4882a593Smuzhiyun
300*4882a593Smuzhiyun
301*4882a593Smuzhiyun  Tasks: 197 total,   1 running, 196 sleeping,   0 stopped,   0 zombie
302*4882a593Smuzhiyun  Cpu(s): 71.2%us,  4.7%sy,  0.0%ni, 24.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
303*4882a593Smuzhiyun  Mem:   3943228k total,  1689632k used,  2253596k free,    74960k buffers
304*4882a593Smuzhiyun  Swap:  4087804k total,        0k used,  4087804k free,   945336k cached
305*4882a593Smuzhiyun
306*4882a593Smuzhiyun    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
307*4882a593Smuzhiyun   3352 jacob     20   0  262m  644  428 S  286  0.0   0:17.16 spin
308*4882a593Smuzhiyun   3341 root     -51   0     0    0    0 D   25  0.0   0:01.62 kidle_inject/0
309*4882a593Smuzhiyun   3344 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/3
310*4882a593Smuzhiyun   3342 root     -51   0     0    0    0 D   25  0.0   0:01.61 kidle_inject/1
311*4882a593Smuzhiyun   3343 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/2
312*4882a593Smuzhiyun   2935 jacob     20   0  696m 125m  35m S    5  3.3   0:31.11 firefox
313*4882a593Smuzhiyun   1546 root      20   0  158m  20m 6640 S    3  0.5   0:26.97 Xorg
314*4882a593Smuzhiyun   2100 jacob     20   0 1223m  88m  30m S    3  2.3   0:23.68 compiz
315*4882a593Smuzhiyun
316*4882a593SmuzhiyunTests have shown that by using the powerclamp driver as a cooling
317*4882a593Smuzhiyundevice, a PID based userspace thermal controller can manage to
318*4882a593Smuzhiyuncontrol CPU temperature effectively, when no other thermal influence
319*4882a593Smuzhiyunis added. For example, a UltraBook user can compile the kernel under
320*4882a593Smuzhiyuncertain temperature (below most active trip points).
321