xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/pm/cpuidle.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun.. include:: <isonum.txt>
3*4882a593Smuzhiyun
4*4882a593Smuzhiyun.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>`
5*4882a593Smuzhiyun.. |cpufreq| replace:: :doc:`CPU Performance Scaling <cpufreq>`
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun========================
8*4882a593SmuzhiyunCPU Idle Time Management
9*4882a593Smuzhiyun========================
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun:Copyright: |copy| 2018 Intel Corporation
12*4882a593Smuzhiyun
13*4882a593Smuzhiyun:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
14*4882a593Smuzhiyun
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunConcepts
17*4882a593Smuzhiyun========
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunModern processors are generally able to enter states in which the execution of
20*4882a593Smuzhiyuna program is suspended and instructions belonging to it are not fetched from
21*4882a593Smuzhiyunmemory or executed.  Those states are the *idle* states of the processor.
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunSince part of the processor hardware is not used in idle states, entering them
24*4882a593Smuzhiyungenerally allows power drawn by the processor to be reduced and, in consequence,
25*4882a593Smuzhiyunit is an opportunity to save energy.
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunCPU idle time management is an energy-efficiency feature concerned about using
28*4882a593Smuzhiyunthe idle states of processors for this purpose.
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunLogical CPUs
31*4882a593Smuzhiyun------------
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunCPU idle time management operates on CPUs as seen by the *CPU scheduler* (that
34*4882a593Smuzhiyunis the part of the kernel responsible for the distribution of computational
35*4882a593Smuzhiyunwork in the system).  In its view, CPUs are *logical* units.  That is, they need
36*4882a593Smuzhiyunnot be separate physical entities and may just be interfaces appearing to
37*4882a593Smuzhiyunsoftware as individual single-core processors.  In other words, a CPU is an
38*4882a593Smuzhiyunentity which appears to be fetching instructions that belong to one sequence
39*4882a593Smuzhiyun(program) from memory and executing them, but it need not work this way
40*4882a593Smuzhiyunphysically.  Generally, three different cases can be consider here.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunFirst, if the whole processor can only follow one sequence of instructions (one
43*4882a593Smuzhiyunprogram) at a time, it is a CPU.  In that case, if the hardware is asked to
44*4882a593Smuzhiyunenter an idle state, that applies to the processor as a whole.
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunSecond, if the processor is multi-core, each core in it is able to follow at
47*4882a593Smuzhiyunleast one program at a time.  The cores need not be entirely independent of each
48*4882a593Smuzhiyunother (for example, they may share caches), but still most of the time they
49*4882a593Smuzhiyunwork physically in parallel with each other, so if each of them executes only
50*4882a593Smuzhiyunone program, those programs run mostly independently of each other at the same
51*4882a593Smuzhiyuntime.  The entire cores are CPUs in that case and if the hardware is asked to
52*4882a593Smuzhiyunenter an idle state, that applies to the core that asked for it in the first
53*4882a593Smuzhiyunplace, but it also may apply to a larger unit (say a "package" or a "cluster")
54*4882a593Smuzhiyunthat the core belongs to (in fact, it may apply to an entire hierarchy of larger
55*4882a593Smuzhiyununits containing the core).  Namely, if all of the cores in the larger unit
56*4882a593Smuzhiyunexcept for one have been put into idle states at the "core level" and the
57*4882a593Smuzhiyunremaining core asks the processor to enter an idle state, that may trigger it
58*4882a593Smuzhiyunto put the whole larger unit into an idle state which also will affect the
59*4882a593Smuzhiyunother cores in that unit.
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunFinally, each core in a multi-core processor may be able to follow more than one
62*4882a593Smuzhiyunprogram in the same time frame (that is, each core may be able to fetch
63*4882a593Smuzhiyuninstructions from multiple locations in memory and execute them in the same time
64*4882a593Smuzhiyunframe, but not necessarily entirely in parallel with each other).  In that case
65*4882a593Smuzhiyunthe cores present themselves to software as "bundles" each consisting of
66*4882a593Smuzhiyunmultiple individual single-core "processors", referred to as *hardware threads*
67*4882a593Smuzhiyun(or hyper-threads specifically on Intel hardware), that each can follow one
68*4882a593Smuzhiyunsequence of instructions.  Then, the hardware threads are CPUs from the CPU idle
69*4882a593Smuzhiyuntime management perspective and if the processor is asked to enter an idle state
70*4882a593Smuzhiyunby one of them, the hardware thread (or CPU) that asked for it is stopped, but
71*4882a593Smuzhiyunnothing more happens, unless all of the other hardware threads within the same
72*4882a593Smuzhiyuncore also have asked the processor to enter an idle state.  In that situation,
73*4882a593Smuzhiyunthe core may be put into an idle state individually or a larger unit containing
74*4882a593Smuzhiyunit may be put into an idle state as a whole (if the other cores within the
75*4882a593Smuzhiyunlarger unit are in idle states already).
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunIdle CPUs
78*4882a593Smuzhiyun---------
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunLogical CPUs, simply referred to as "CPUs" in what follows, are regarded as
81*4882a593Smuzhiyun*idle* by the Linux kernel when there are no tasks to run on them except for the
82*4882a593Smuzhiyunspecial "idle" task.
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunTasks are the CPU scheduler's representation of work.  Each task consists of a
85*4882a593Smuzhiyunsequence of instructions to execute, or code, data to be manipulated while
86*4882a593Smuzhiyunrunning that code, and some context information that needs to be loaded into the
87*4882a593Smuzhiyunprocessor every time the task's code is run by a CPU.  The CPU scheduler
88*4882a593Smuzhiyundistributes work by assigning tasks to run to the CPUs present in the system.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunTasks can be in various states.  In particular, they are *runnable* if there are
91*4882a593Smuzhiyunno specific conditions preventing their code from being run by a CPU as long as
92*4882a593Smuzhiyunthere is a CPU available for that (for example, they are not waiting for any
93*4882a593Smuzhiyunevents to occur or similar).  When a task becomes runnable, the CPU scheduler
94*4882a593Smuzhiyunassigns it to one of the available CPUs to run and if there are no more runnable
95*4882a593Smuzhiyuntasks assigned to it, the CPU will load the given task's context and run its
96*4882a593Smuzhiyuncode (from the instruction following the last one executed so far, possibly by
97*4882a593Smuzhiyunanother CPU).  [If there are multiple runnable tasks assigned to one CPU
98*4882a593Smuzhiyunsimultaneously, they will be subject to prioritization and time sharing in order
99*4882a593Smuzhiyunto allow them to make some progress over time.]
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunThe special "idle" task becomes runnable if there are no other runnable tasks
102*4882a593Smuzhiyunassigned to the given CPU and the CPU is then regarded as idle.  In other words,
103*4882a593Smuzhiyunin Linux idle CPUs run the code of the "idle" task called *the idle loop*.  That
104*4882a593Smuzhiyuncode may cause the processor to be put into one of its idle states, if they are
105*4882a593Smuzhiyunsupported, in order to save energy, but if the processor does not support any
106*4882a593Smuzhiyunidle states, or there is not enough time to spend in an idle state before the
107*4882a593Smuzhiyunnext wakeup event, or there are strict latency constraints preventing any of the
108*4882a593Smuzhiyunavailable idle states from being used, the CPU will simply execute more or less
109*4882a593Smuzhiyunuseless instructions in a loop until it is assigned a new task to run.
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun.. _idle-loop:
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunThe Idle Loop
115*4882a593Smuzhiyun=============
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunThe idle loop code takes two major steps in every iteration of it.  First, it
118*4882a593Smuzhiyuncalls into a code module referred to as the *governor* that belongs to the CPU
119*4882a593Smuzhiyunidle time management subsystem called ``CPUIdle`` to select an idle state for
120*4882a593Smuzhiyunthe CPU to ask the hardware to enter.  Second, it invokes another code module
121*4882a593Smuzhiyunfrom the ``CPUIdle`` subsystem, called the *driver*, to actually ask the
122*4882a593Smuzhiyunprocessor hardware to enter the idle state selected by the governor.
123*4882a593Smuzhiyun
124*4882a593SmuzhiyunThe role of the governor is to find an idle state most suitable for the
125*4882a593Smuzhiyunconditions at hand.  For this purpose, idle states that the hardware can be
126*4882a593Smuzhiyunasked to enter by logical CPUs are represented in an abstract way independent of
127*4882a593Smuzhiyunthe platform or the processor architecture and organized in a one-dimensional
128*4882a593Smuzhiyun(linear) array.  That array has to be prepared and supplied by the ``CPUIdle``
129*4882a593Smuzhiyundriver matching the platform the kernel is running on at the initialization
130*4882a593Smuzhiyuntime.  This allows ``CPUIdle`` governors to be independent of the underlying
131*4882a593Smuzhiyunhardware and to work with any platforms that the Linux kernel can run on.
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunEach idle state present in that array is characterized by two parameters to be
134*4882a593Smuzhiyuntaken into account by the governor, the *target residency* and the (worst-case)
135*4882a593Smuzhiyun*exit latency*.  The target residency is the minimum time the hardware must
136*4882a593Smuzhiyunspend in the given state, including the time needed to enter it (which may be
137*4882a593Smuzhiyunsubstantial), in order to save more energy than it would save by entering one of
138*4882a593Smuzhiyunthe shallower idle states instead.  [The "depth" of an idle state roughly
139*4882a593Smuzhiyuncorresponds to the power drawn by the processor in that state.]  The exit
140*4882a593Smuzhiyunlatency, in turn, is the maximum time it will take a CPU asking the processor
141*4882a593Smuzhiyunhardware to enter an idle state to start executing the first instruction after a
142*4882a593Smuzhiyunwakeup from that state.  Note that in general the exit latency also must cover
143*4882a593Smuzhiyunthe time needed to enter the given state in case the wakeup occurs when the
144*4882a593Smuzhiyunhardware is entering it and it must be entered completely to be exited in an
145*4882a593Smuzhiyunordered manner.
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunThere are two types of information that can influence the governor's decisions.
148*4882a593SmuzhiyunFirst of all, the governor knows the time until the closest timer event.  That
149*4882a593Smuzhiyuntime is known exactly, because the kernel programs timers and it knows exactly
150*4882a593Smuzhiyunwhen they will trigger, and it is the maximum time the hardware that the given
151*4882a593SmuzhiyunCPU depends on can spend in an idle state, including the time necessary to enter
152*4882a593Smuzhiyunand exit it.  However, the CPU may be woken up by a non-timer event at any time
153*4882a593Smuzhiyun(in particular, before the closest timer triggers) and it generally is not known
154*4882a593Smuzhiyunwhen that may happen.  The governor can only see how much time the CPU actually
155*4882a593Smuzhiyunwas idle after it has been woken up (that time will be referred to as the *idle
156*4882a593Smuzhiyunduration* from now on) and it can use that information somehow along with the
157*4882a593Smuzhiyuntime until the closest timer to estimate the idle duration in future.  How the
158*4882a593Smuzhiyungovernor uses that information depends on what algorithm is implemented by it
159*4882a593Smuzhiyunand that is the primary reason for having more than one governor in the
160*4882a593Smuzhiyun``CPUIdle`` subsystem.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunThere are four ``CPUIdle`` governors available, ``menu``, `TEO <teo-gov_>`_,
163*4882a593Smuzhiyun``ladder`` and ``haltpoll``.  Which of them is used by default depends on the
164*4882a593Smuzhiyunconfiguration of the kernel and in particular on whether or not the scheduler
165*4882a593Smuzhiyuntick can be `stopped by the idle loop <idle-cpus-and-tick_>`_.  Available
166*4882a593Smuzhiyungovernors can be read from the :file:`available_governors`, and the governor
167*4882a593Smuzhiyuncan be changed at runtime.  The name of the ``CPUIdle`` governor currently
168*4882a593Smuzhiyunused by the kernel can be read from the :file:`current_governor_ro` or
169*4882a593Smuzhiyun:file:`current_governor` file under :file:`/sys/devices/system/cpu/cpuidle/`
170*4882a593Smuzhiyunin ``sysfs``.
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunWhich ``CPUIdle`` driver is used, on the other hand, usually depends on the
173*4882a593Smuzhiyunplatform the kernel is running on, but there are platforms with more than one
174*4882a593Smuzhiyunmatching driver.  For example, there are two drivers that can work with the
175*4882a593Smuzhiyunmajority of Intel platforms, ``intel_idle`` and ``acpi_idle``, one with
176*4882a593Smuzhiyunhardcoded idle states information and the other able to read that information
177*4882a593Smuzhiyunfrom the system's ACPI tables, respectively.  Still, even in those cases, the
178*4882a593Smuzhiyundriver chosen at the system initialization time cannot be replaced later, so the
179*4882a593Smuzhiyundecision on which one of them to use has to be made early (on Intel platforms
180*4882a593Smuzhiyunthe ``acpi_idle`` driver will be used if ``intel_idle`` is disabled for some
181*4882a593Smuzhiyunreason or if it does not recognize the processor).  The name of the ``CPUIdle``
182*4882a593Smuzhiyundriver currently used by the kernel can be read from the :file:`current_driver`
183*4882a593Smuzhiyunfile under :file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``.
184*4882a593Smuzhiyun
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun.. _idle-cpus-and-tick:
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunIdle CPUs and The Scheduler Tick
189*4882a593Smuzhiyun================================
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunThe scheduler tick is a timer that triggers periodically in order to implement
192*4882a593Smuzhiyunthe time sharing strategy of the CPU scheduler.  Of course, if there are
193*4882a593Smuzhiyunmultiple runnable tasks assigned to one CPU at the same time, the only way to
194*4882a593Smuzhiyunallow them to make reasonable progress in a given time frame is to make them
195*4882a593Smuzhiyunshare the available CPU time.  Namely, in rough approximation, each task is
196*4882a593Smuzhiyungiven a slice of the CPU time to run its code, subject to the scheduling class,
197*4882a593Smuzhiyunprioritization and so on and when that time slice is used up, the CPU should be
198*4882a593Smuzhiyunswitched over to running (the code of) another task.  The currently running task
199*4882a593Smuzhiyunmay not want to give the CPU away voluntarily, however, and the scheduler tick
200*4882a593Smuzhiyunis there to make the switch happen regardless.  That is not the only role of the
201*4882a593Smuzhiyuntick, but it is the primary reason for using it.
202*4882a593Smuzhiyun
203*4882a593SmuzhiyunThe scheduler tick is problematic from the CPU idle time management perspective,
204*4882a593Smuzhiyunbecause it triggers periodically and relatively often (depending on the kernel
205*4882a593Smuzhiyunconfiguration, the length of the tick period is between 1 ms and 10 ms).
206*4882a593SmuzhiyunThus, if the tick is allowed to trigger on idle CPUs, it will not make sense
207*4882a593Smuzhiyunfor them to ask the hardware to enter idle states with target residencies above
208*4882a593Smuzhiyunthe tick period length.  Moreover, in that case the idle duration of any CPU
209*4882a593Smuzhiyunwill never exceed the tick period length and the energy used for entering and
210*4882a593Smuzhiyunexiting idle states due to the tick wakeups on idle CPUs will be wasted.
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunFortunately, it is not really necessary to allow the tick to trigger on idle
213*4882a593SmuzhiyunCPUs, because (by definition) they have no tasks to run except for the special
214*4882a593Smuzhiyun"idle" one.  In other words, from the CPU scheduler perspective, the only user
215*4882a593Smuzhiyunof the CPU time on them is the idle loop.  Since the time of an idle CPU need
216*4882a593Smuzhiyunnot be shared between multiple runnable tasks, the primary reason for using the
217*4882a593Smuzhiyuntick goes away if the given CPU is idle.  Consequently, it is possible to stop
218*4882a593Smuzhiyunthe scheduler tick entirely on idle CPUs in principle, even though that may not
219*4882a593Smuzhiyunalways be worth the effort.
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunWhether or not it makes sense to stop the scheduler tick in the idle loop
222*4882a593Smuzhiyundepends on what is expected by the governor.  First, if there is another
223*4882a593Smuzhiyun(non-tick) timer due to trigger within the tick range, stopping the tick clearly
224*4882a593Smuzhiyunwould be a waste of time, even though the timer hardware may not need to be
225*4882a593Smuzhiyunreprogrammed in that case.  Second, if the governor is expecting a non-timer
226*4882a593Smuzhiyunwakeup within the tick range, stopping the tick is not necessary and it may even
227*4882a593Smuzhiyunbe harmful.  Namely, in that case the governor will select an idle state with
228*4882a593Smuzhiyunthe target residency within the time until the expected wakeup, so that state is
229*4882a593Smuzhiyungoing to be relatively shallow.  The governor really cannot select a deep idle
230*4882a593Smuzhiyunstate then, as that would contradict its own expectation of a wakeup in short
231*4882a593Smuzhiyunorder.  Now, if the wakeup really occurs shortly, stopping the tick would be a
232*4882a593Smuzhiyunwaste of time and in this case the timer hardware would need to be reprogrammed,
233*4882a593Smuzhiyunwhich is expensive.  On the other hand, if the tick is stopped and the wakeup
234*4882a593Smuzhiyundoes not occur any time soon, the hardware may spend indefinite amount of time
235*4882a593Smuzhiyunin the shallow idle state selected by the governor, which will be a waste of
236*4882a593Smuzhiyunenergy.  Hence, if the governor is expecting a wakeup of any kind within the
237*4882a593Smuzhiyuntick range, it is better to allow the tick trigger.  Otherwise, however, the
238*4882a593Smuzhiyungovernor will select a relatively deep idle state, so the tick should be stopped
239*4882a593Smuzhiyunso that it does not wake up the CPU too early.
240*4882a593Smuzhiyun
241*4882a593SmuzhiyunIn any case, the governor knows what it is expecting and the decision on whether
242*4882a593Smuzhiyunor not to stop the scheduler tick belongs to it.  Still, if the tick has been
243*4882a593Smuzhiyunstopped already (in one of the previous iterations of the loop), it is better
244*4882a593Smuzhiyunto leave it as is and the governor needs to take that into account.
245*4882a593Smuzhiyun
246*4882a593SmuzhiyunThe kernel can be configured to disable stopping the scheduler tick in the idle
247*4882a593Smuzhiyunloop altogether.  That can be done through the build-time configuration of it
248*4882a593Smuzhiyun(by unsetting the ``CONFIG_NO_HZ_IDLE`` configuration option) or by passing
249*4882a593Smuzhiyun``nohz=off`` to it in the command line.  In both cases, as the stopping of the
250*4882a593Smuzhiyunscheduler tick is disabled, the governor's decisions regarding it are simply
251*4882a593Smuzhiyunignored by the idle loop code and the tick is never stopped.
252*4882a593Smuzhiyun
253*4882a593SmuzhiyunThe systems that run kernels configured to allow the scheduler tick to be
254*4882a593Smuzhiyunstopped on idle CPUs are referred to as *tickless* systems and they are
255*4882a593Smuzhiyungenerally regarded as more energy-efficient than the systems running kernels in
256*4882a593Smuzhiyunwhich the tick cannot be stopped.  If the given system is tickless, it will use
257*4882a593Smuzhiyunthe ``menu`` governor by default and if it is not tickless, the default
258*4882a593Smuzhiyun``CPUIdle`` governor on it will be ``ladder``.
259*4882a593Smuzhiyun
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun.. _menu-gov:
262*4882a593Smuzhiyun
263*4882a593SmuzhiyunThe ``menu`` Governor
264*4882a593Smuzhiyun=====================
265*4882a593Smuzhiyun
266*4882a593SmuzhiyunThe ``menu`` governor is the default ``CPUIdle`` governor for tickless systems.
267*4882a593SmuzhiyunIt is quite complex, but the basic principle of its design is straightforward.
268*4882a593SmuzhiyunNamely, when invoked to select an idle state for a CPU (i.e. an idle state that
269*4882a593Smuzhiyunthe CPU will ask the processor hardware to enter), it attempts to predict the
270*4882a593Smuzhiyunidle duration and uses the predicted value for idle state selection.
271*4882a593Smuzhiyun
272*4882a593SmuzhiyunIt first obtains the time until the closest timer event with the assumption
273*4882a593Smuzhiyunthat the scheduler tick will be stopped.  That time, referred to as the *sleep
274*4882a593Smuzhiyunlength* in what follows, is the upper bound on the time before the next CPU
275*4882a593Smuzhiyunwakeup.  It is used to determine the sleep length range, which in turn is needed
276*4882a593Smuzhiyunto get the sleep length correction factor.
277*4882a593Smuzhiyun
278*4882a593SmuzhiyunThe ``menu`` governor maintains two arrays of sleep length correction factors.
279*4882a593SmuzhiyunOne of them is used when tasks previously running on the given CPU are waiting
280*4882a593Smuzhiyunfor some I/O operations to complete and the other one is used when that is not
281*4882a593Smuzhiyunthe case.  Each array contains several correction factor values that correspond
282*4882a593Smuzhiyunto different sleep length ranges organized so that each range represented in the
283*4882a593Smuzhiyunarray is approximately 10 times wider than the previous one.
284*4882a593Smuzhiyun
285*4882a593SmuzhiyunThe correction factor for the given sleep length range (determined before
286*4882a593Smuzhiyunselecting the idle state for the CPU) is updated after the CPU has been woken
287*4882a593Smuzhiyunup and the closer the sleep length is to the observed idle duration, the closer
288*4882a593Smuzhiyunto 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
289*4882a593SmuzhiyunThe sleep length is multiplied by the correction factor for the range that it
290*4882a593Smuzhiyunfalls into to obtain the first approximation of the predicted idle duration.
291*4882a593Smuzhiyun
292*4882a593SmuzhiyunNext, the governor uses a simple pattern recognition algorithm to refine its
293*4882a593Smuzhiyunidle duration prediction.  Namely, it saves the last 8 observed idle duration
294*4882a593Smuzhiyunvalues and, when predicting the idle duration next time, it computes the average
295*4882a593Smuzhiyunand variance of them.  If the variance is small (smaller than 400 square
296*4882a593Smuzhiyunmilliseconds) or it is small relative to the average (the average is greater
297*4882a593Smuzhiyunthat 6 times the standard deviation), the average is regarded as the "typical
298*4882a593Smuzhiyuninterval" value.  Otherwise, the longest of the saved observed idle duration
299*4882a593Smuzhiyunvalues is discarded and the computation is repeated for the remaining ones.
300*4882a593SmuzhiyunAgain, if the variance of them is small (in the above sense), the average is
301*4882a593Smuzhiyuntaken as the "typical interval" value and so on, until either the "typical
302*4882a593Smuzhiyuninterval" is determined or too many data points are disregarded, in which case
303*4882a593Smuzhiyunthe "typical interval" is assumed to equal "infinity" (the maximum unsigned
304*4882a593Smuzhiyuninteger value).  The "typical interval" computed this way is compared with the
305*4882a593Smuzhiyunsleep length multiplied by the correction factor and the minimum of the two is
306*4882a593Smuzhiyuntaken as the predicted idle duration.
307*4882a593Smuzhiyun
308*4882a593SmuzhiyunThen, the governor computes an extra latency limit to help "interactive"
309*4882a593Smuzhiyunworkloads.  It uses the observation that if the exit latency of the selected
310*4882a593Smuzhiyunidle state is comparable with the predicted idle duration, the total time spent
311*4882a593Smuzhiyunin that state probably will be very short and the amount of energy to save by
312*4882a593Smuzhiyunentering it will be relatively small, so likely it is better to avoid the
313*4882a593Smuzhiyunoverhead related to entering that state and exiting it.  Thus selecting a
314*4882a593Smuzhiyunshallower state is likely to be a better option then.   The first approximation
315*4882a593Smuzhiyunof the extra latency limit is the predicted idle duration itself which
316*4882a593Smuzhiyunadditionally is divided by a value depending on the number of tasks that
317*4882a593Smuzhiyunpreviously ran on the given CPU and now they are waiting for I/O operations to
318*4882a593Smuzhiyuncomplete.  The result of that division is compared with the latency limit coming
319*4882a593Smuzhiyunfrom the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
320*4882a593Smuzhiyunframework and the minimum of the two is taken as the limit for the idle states'
321*4882a593Smuzhiyunexit latency.
322*4882a593Smuzhiyun
323*4882a593SmuzhiyunNow, the governor is ready to walk the list of idle states and choose one of
324*4882a593Smuzhiyunthem.  For this purpose, it compares the target residency of each state with
325*4882a593Smuzhiyunthe predicted idle duration and the exit latency of it with the computed latency
326*4882a593Smuzhiyunlimit.  It selects the state with the target residency closest to the predicted
327*4882a593Smuzhiyunidle duration, but still below it, and exit latency that does not exceed the
328*4882a593Smuzhiyunlimit.
329*4882a593Smuzhiyun
330*4882a593SmuzhiyunIn the final step the governor may still need to refine the idle state selection
331*4882a593Smuzhiyunif it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_.  That
332*4882a593Smuzhiyunhappens if the idle duration predicted by it is less than the tick period and
333*4882a593Smuzhiyunthe tick has not been stopped already (in a previous iteration of the idle
334*4882a593Smuzhiyunloop).  Then, the sleep length used in the previous computations may not reflect
335*4882a593Smuzhiyunthe real time until the closest timer event and if it really is greater than
336*4882a593Smuzhiyunthat time, the governor may need to select a shallower state with a suitable
337*4882a593Smuzhiyuntarget residency.
338*4882a593Smuzhiyun
339*4882a593Smuzhiyun
340*4882a593Smuzhiyun.. _teo-gov:
341*4882a593Smuzhiyun
342*4882a593SmuzhiyunThe Timer Events Oriented (TEO) Governor
343*4882a593Smuzhiyun========================================
344*4882a593Smuzhiyun
345*4882a593SmuzhiyunThe timer events oriented (TEO) governor is an alternative ``CPUIdle`` governor
346*4882a593Smuzhiyunfor tickless systems.  It follows the same basic strategy as the ``menu`` `one
347*4882a593Smuzhiyun<menu-gov_>`_: it always tries to find the deepest idle state suitable for the
348*4882a593Smuzhiyungiven conditions.  However, it applies a different approach to that problem.
349*4882a593Smuzhiyun
350*4882a593SmuzhiyunFirst, it does not use sleep length correction factors, but instead it attempts
351*4882a593Smuzhiyunto correlate the observed idle duration values with the available idle states
352*4882a593Smuzhiyunand use that information to pick up the idle state that is most likely to
353*4882a593Smuzhiyun"match" the upcoming CPU idle interval.   Second, it does not take the tasks
354*4882a593Smuzhiyunthat were running on the given CPU in the past and are waiting on some I/O
355*4882a593Smuzhiyunoperations to complete now at all (there is no guarantee that they will run on
356*4882a593Smuzhiyunthe same CPU when they become runnable again) and the pattern detection code in
357*4882a593Smuzhiyunit avoids taking timer wakeups into account.  It also only uses idle duration
358*4882a593Smuzhiyunvalues less than the current time till the closest timer (with the scheduler
359*4882a593Smuzhiyuntick excluded) for that purpose.
360*4882a593Smuzhiyun
361*4882a593SmuzhiyunLike in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain
362*4882a593Smuzhiyunthe *sleep length*, which is the time until the closest timer event with the
363*4882a593Smuzhiyunassumption that the scheduler tick will be stopped (that also is the upper bound
364*4882a593Smuzhiyunon the time until the next CPU wakeup).  That value is then used to preselect an
365*4882a593Smuzhiyunidle state on the basis of three metrics maintained for each idle state provided
366*4882a593Smuzhiyunby the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``.
367*4882a593Smuzhiyun
368*4882a593SmuzhiyunThe ``hits`` and ``misses`` metrics measure the likelihood that a given idle
369*4882a593Smuzhiyunstate will "match" the observed (post-wakeup) idle duration if it "matches" the
370*4882a593Smuzhiyunsleep length.  They both are subject to decay (after a CPU wakeup) every time
371*4882a593Smuzhiyunthe target residency of the idle state corresponding to them is less than or
372*4882a593Smuzhiyunequal to the sleep length and the target residency of the next idle state is
373*4882a593Smuzhiyungreater than the sleep length (that is, when the idle state corresponding to
374*4882a593Smuzhiyunthem "matches" the sleep length).  The ``hits`` metric is increased if the
375*4882a593Smuzhiyunformer condition is satisfied and the target residency of the given idle state
376*4882a593Smuzhiyunis less than or equal to the observed idle duration and the target residency of
377*4882a593Smuzhiyunthe next idle state is greater than the observed idle duration at the same time
378*4882a593Smuzhiyun(that is, it is increased when the given idle state "matches" both the sleep
379*4882a593Smuzhiyunlength and the observed idle duration).  In turn, the ``misses`` metric is
380*4882a593Smuzhiyunincreased when the given idle state "matches" the sleep length only and the
381*4882a593Smuzhiyunobserved idle duration is too short for its target residency.
382*4882a593Smuzhiyun
383*4882a593SmuzhiyunThe ``early_hits`` metric measures the likelihood that a given idle state will
384*4882a593Smuzhiyun"match" the observed (post-wakeup) idle duration if it does not "match" the
385*4882a593Smuzhiyunsleep length.  It is subject to decay on every CPU wakeup and it is increased
386*4882a593Smuzhiyunwhen the idle state corresponding to it "matches" the observed (post-wakeup)
387*4882a593Smuzhiyunidle duration and the target residency of the next idle state is less than or
388*4882a593Smuzhiyunequal to the sleep length (i.e. the idle state "matching" the sleep length is
389*4882a593Smuzhiyundeeper than the given one).
390*4882a593Smuzhiyun
391*4882a593SmuzhiyunThe governor walks the list of idle states provided by the ``CPUIdle`` driver
392*4882a593Smuzhiyunand finds the last (deepest) one with the target residency less than or equal
393*4882a593Smuzhiyunto the sleep length.  Then, the ``hits`` and ``misses`` metrics of that idle
394*4882a593Smuzhiyunstate are compared with each other and it is preselected if the ``hits`` one is
395*4882a593Smuzhiyungreater (which means that that idle state is likely to "match" the observed idle
396*4882a593Smuzhiyunduration after CPU wakeup).  If the ``misses`` one is greater, the governor
397*4882a593Smuzhiyunpreselects the shallower idle state with the maximum ``early_hits`` metric
398*4882a593Smuzhiyun(or if there are multiple shallower idle states with equal ``early_hits``
399*4882a593Smuzhiyunmetric which also is the maximum, the shallowest of them will be preselected).
400*4882a593Smuzhiyun[If there is a wakeup latency constraint coming from the `PM QoS framework
401*4882a593Smuzhiyun<cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the
402*4882a593Smuzhiyuntarget residency within the sleep length, the deepest idle state with the exit
403*4882a593Smuzhiyunlatency within the constraint is preselected without consulting the ``hits``,
404*4882a593Smuzhiyun``misses`` and ``early_hits`` metrics.]
405*4882a593Smuzhiyun
406*4882a593SmuzhiyunNext, the governor takes several idle duration values observed most recently
407*4882a593Smuzhiyuninto consideration and if at least a half of them are greater than or equal to
408*4882a593Smuzhiyunthe target residency of the preselected idle state, that idle state becomes the
409*4882a593Smuzhiyunfinal candidate to ask for.  Otherwise, the average of the most recent idle
410*4882a593Smuzhiyunduration values below the target residency of the preselected idle state is
411*4882a593Smuzhiyuncomputed and the governor walks the idle states shallower than the preselected
412*4882a593Smuzhiyunone and finds the deepest of them with the target residency within that average.
413*4882a593SmuzhiyunThat idle state is then taken as the final candidate to ask for.
414*4882a593Smuzhiyun
415*4882a593SmuzhiyunStill, at this point the governor may need to refine the idle state selection if
416*4882a593Smuzhiyunit has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_.  That
417*4882a593Smuzhiyungenerally happens if the target residency of the idle state selected so far is
418*4882a593Smuzhiyunless than the tick period and the tick has not been stopped already (in a
419*4882a593Smuzhiyunprevious iteration of the idle loop).  Then, like in the ``menu`` governor
420*4882a593Smuzhiyun`case <menu-gov_>`_, the sleep length used in the previous computations may not
421*4882a593Smuzhiyunreflect the real time until the closest timer event and if it really is greater
422*4882a593Smuzhiyunthan that time, a shallower state with a suitable target residency may need to
423*4882a593Smuzhiyunbe selected.
424*4882a593Smuzhiyun
425*4882a593Smuzhiyun
426*4882a593Smuzhiyun.. _idle-states-representation:
427*4882a593Smuzhiyun
428*4882a593SmuzhiyunRepresentation of Idle States
429*4882a593Smuzhiyun=============================
430*4882a593Smuzhiyun
431*4882a593SmuzhiyunFor the CPU idle time management purposes all of the physical idle states
432*4882a593Smuzhiyunsupported by the processor have to be represented as a one-dimensional array of
433*4882a593Smuzhiyun|struct cpuidle_state| objects each allowing an individual (logical) CPU to ask
434*4882a593Smuzhiyunthe processor hardware to enter an idle state of certain properties.  If there
435*4882a593Smuzhiyunis a hierarchy of units in the processor, one |struct cpuidle_state| object can
436*4882a593Smuzhiyuncover a combination of idle states supported by the units at different levels of
437*4882a593Smuzhiyunthe hierarchy.  In that case, the `target residency and exit latency parameters
438*4882a593Smuzhiyunof it <idle-loop_>`_, must reflect the properties of the idle state at the
439*4882a593Smuzhiyundeepest level (i.e. the idle state of the unit containing all of the other
440*4882a593Smuzhiyununits).
441*4882a593Smuzhiyun
442*4882a593SmuzhiyunFor example, take a processor with two cores in a larger unit referred to as
443*4882a593Smuzhiyuna "module" and suppose that asking the hardware to enter a specific idle state
444*4882a593Smuzhiyun(say "X") at the "core" level by one core will trigger the module to try to
445*4882a593Smuzhiyunenter a specific idle state of its own (say "MX") if the other core is in idle
446*4882a593Smuzhiyunstate "X" already.  In other words, asking for idle state "X" at the "core"
447*4882a593Smuzhiyunlevel gives the hardware a license to go as deep as to idle state "MX" at the
448*4882a593Smuzhiyun"module" level, but there is no guarantee that this is going to happen (the core
449*4882a593Smuzhiyunasking for idle state "X" may just end up in that state by itself instead).
450*4882a593SmuzhiyunThen, the target residency of the |struct cpuidle_state| object representing
451*4882a593Smuzhiyunidle state "X" must reflect the minimum time to spend in idle state "MX" of
452*4882a593Smuzhiyunthe module (including the time needed to enter it), because that is the minimum
453*4882a593Smuzhiyuntime the CPU needs to be idle to save any energy in case the hardware enters
454*4882a593Smuzhiyunthat state.  Analogously, the exit latency parameter of that object must cover
455*4882a593Smuzhiyunthe exit time of idle state "MX" of the module (and usually its entry time too),
456*4882a593Smuzhiyunbecause that is the maximum delay between a wakeup signal and the time the CPU
457*4882a593Smuzhiyunwill start to execute the first new instruction (assuming that both cores in the
458*4882a593Smuzhiyunmodule will always be ready to execute instructions as soon as the module
459*4882a593Smuzhiyunbecomes operational as a whole).
460*4882a593Smuzhiyun
461*4882a593SmuzhiyunThere are processors without direct coordination between different levels of the
462*4882a593Smuzhiyunhierarchy of units inside them, however.  In those cases asking for an idle
463*4882a593Smuzhiyunstate at the "core" level does not automatically affect the "module" level, for
464*4882a593Smuzhiyunexample, in any way and the ``CPUIdle`` driver is responsible for the entire
465*4882a593Smuzhiyunhandling of the hierarchy.  Then, the definition of the idle state objects is
466*4882a593Smuzhiyunentirely up to the driver, but still the physical properties of the idle state
467*4882a593Smuzhiyunthat the processor hardware finally goes into must always follow the parameters
468*4882a593Smuzhiyunused by the governor for idle state selection (for instance, the actual exit
469*4882a593Smuzhiyunlatency of that idle state must not exceed the exit latency parameter of the
470*4882a593Smuzhiyunidle state object selected by the governor).
471*4882a593Smuzhiyun
472*4882a593SmuzhiyunIn addition to the target residency and exit latency idle state parameters
473*4882a593Smuzhiyundiscussed above, the objects representing idle states each contain a few other
474*4882a593Smuzhiyunparameters describing the idle state and a pointer to the function to run in
475*4882a593Smuzhiyunorder to ask the hardware to enter that state.  Also, for each
476*4882a593Smuzhiyun|struct cpuidle_state| object, there is a corresponding
477*4882a593Smuzhiyun:c:type:`struct cpuidle_state_usage <cpuidle_state_usage>` one containing usage
478*4882a593Smuzhiyunstatistics of the given idle state.  That information is exposed by the kernel
479*4882a593Smuzhiyunvia ``sysfs``.
480*4882a593Smuzhiyun
481*4882a593SmuzhiyunFor each CPU in the system, there is a :file:`/sys/devices/system/cpu/cpu<N>/cpuidle/`
482*4882a593Smuzhiyundirectory in ``sysfs``, where the number ``<N>`` is assigned to the given
483*4882a593SmuzhiyunCPU at the initialization time.  That directory contains a set of subdirectories
484*4882a593Smuzhiyuncalled :file:`state0`, :file:`state1` and so on, up to the number of idle state
485*4882a593Smuzhiyunobjects defined for the given CPU minus one.  Each of these directories
486*4882a593Smuzhiyuncorresponds to one idle state object and the larger the number in its name, the
487*4882a593Smuzhiyundeeper the (effective) idle state represented by it.  Each of them contains
488*4882a593Smuzhiyuna number of files (attributes) representing the properties of the idle state
489*4882a593Smuzhiyunobject corresponding to it, as follows:
490*4882a593Smuzhiyun
491*4882a593Smuzhiyun``above``
492*4882a593Smuzhiyun	Total number of times this idle state had been asked for, but the
493*4882a593Smuzhiyun	observed idle duration was certainly too short to match its target
494*4882a593Smuzhiyun	residency.
495*4882a593Smuzhiyun
496*4882a593Smuzhiyun``below``
497*4882a593Smuzhiyun	Total number of times this idle state had been asked for, but certainly
498*4882a593Smuzhiyun	a deeper idle state would have been a better match for the observed idle
499*4882a593Smuzhiyun	duration.
500*4882a593Smuzhiyun
501*4882a593Smuzhiyun``desc``
502*4882a593Smuzhiyun	Description of the idle state.
503*4882a593Smuzhiyun
504*4882a593Smuzhiyun``disable``
505*4882a593Smuzhiyun	Whether or not this idle state is disabled.
506*4882a593Smuzhiyun
507*4882a593Smuzhiyun``default_status``
508*4882a593Smuzhiyun	The default status of this state, "enabled" or "disabled".
509*4882a593Smuzhiyun
510*4882a593Smuzhiyun``latency``
511*4882a593Smuzhiyun	Exit latency of the idle state in microseconds.
512*4882a593Smuzhiyun
513*4882a593Smuzhiyun``name``
514*4882a593Smuzhiyun	Name of the idle state.
515*4882a593Smuzhiyun
516*4882a593Smuzhiyun``power``
517*4882a593Smuzhiyun	Power drawn by hardware in this idle state in milliwatts (if specified,
518*4882a593Smuzhiyun	0 otherwise).
519*4882a593Smuzhiyun
520*4882a593Smuzhiyun``residency``
521*4882a593Smuzhiyun	Target residency of the idle state in microseconds.
522*4882a593Smuzhiyun
523*4882a593Smuzhiyun``time``
524*4882a593Smuzhiyun	Total time spent in this idle state by the given CPU (as measured by the
525*4882a593Smuzhiyun	kernel) in microseconds.
526*4882a593Smuzhiyun
527*4882a593Smuzhiyun``usage``
528*4882a593Smuzhiyun	Total number of times the hardware has been asked by the given CPU to
529*4882a593Smuzhiyun	enter this idle state.
530*4882a593Smuzhiyun
531*4882a593Smuzhiyun``rejected``
532*4882a593Smuzhiyun	Total number of times a request to enter this idle state on the given
533*4882a593Smuzhiyun	CPU was rejected.
534*4882a593Smuzhiyun
535*4882a593SmuzhiyunThe :file:`desc` and :file:`name` files both contain strings.  The difference
536*4882a593Smuzhiyunbetween them is that the name is expected to be more concise, while the
537*4882a593Smuzhiyundescription may be longer and it may contain white space or special characters.
538*4882a593SmuzhiyunThe other files listed above contain integer numbers.
539*4882a593Smuzhiyun
540*4882a593SmuzhiyunThe :file:`disable` attribute is the only writeable one.  If it contains 1, the
541*4882a593Smuzhiyungiven idle state is disabled for this particular CPU, which means that the
542*4882a593Smuzhiyungovernor will never select it for this particular CPU and the ``CPUIdle``
543*4882a593Smuzhiyundriver will never ask the hardware to enter it for that CPU as a result.
544*4882a593SmuzhiyunHowever, disabling an idle state for one CPU does not prevent it from being
545*4882a593Smuzhiyunasked for by the other CPUs, so it must be disabled for all of them in order to
546*4882a593Smuzhiyunnever be asked for by any of them.  [Note that, due to the way the ``ladder``
547*4882a593Smuzhiyungovernor is implemented, disabling an idle state prevents that governor from
548*4882a593Smuzhiyunselecting any idle states deeper than the disabled one too.]
549*4882a593Smuzhiyun
550*4882a593SmuzhiyunIf the :file:`disable` attribute contains 0, the given idle state is enabled for
551*4882a593Smuzhiyunthis particular CPU, but it still may be disabled for some or all of the other
552*4882a593SmuzhiyunCPUs in the system at the same time.  Writing 1 to it causes the idle state to
553*4882a593Smuzhiyunbe disabled for this particular CPU and writing 0 to it allows the governor to
554*4882a593Smuzhiyuntake it into consideration for the given CPU and the driver to ask for it,
555*4882a593Smuzhiyununless that state was disabled globally in the driver (in which case it cannot
556*4882a593Smuzhiyunbe used at all).
557*4882a593Smuzhiyun
558*4882a593SmuzhiyunThe :file:`power` attribute is not defined very well, especially for idle state
559*4882a593Smuzhiyunobjects representing combinations of idle states at different levels of the
560*4882a593Smuzhiyunhierarchy of units in the processor, and it generally is hard to obtain idle
561*4882a593Smuzhiyunstate power numbers for complex hardware, so :file:`power` often contains 0 (not
562*4882a593Smuzhiyunavailable) and if it contains a nonzero number, that number may not be very
563*4882a593Smuzhiyunaccurate and it should not be relied on for anything meaningful.
564*4882a593Smuzhiyun
565*4882a593SmuzhiyunThe number in the :file:`time` file generally may be greater than the total time
566*4882a593Smuzhiyunreally spent by the given CPU in the given idle state, because it is measured by
567*4882a593Smuzhiyunthe kernel and it may not cover the cases in which the hardware refused to enter
568*4882a593Smuzhiyunthis idle state and entered a shallower one instead of it (or even it did not
569*4882a593Smuzhiyunenter any idle state at all).  The kernel can only measure the time span between
570*4882a593Smuzhiyunasking the hardware to enter an idle state and the subsequent wakeup of the CPU
571*4882a593Smuzhiyunand it cannot say what really happened in the meantime at the hardware level.
572*4882a593SmuzhiyunMoreover, if the idle state object in question represents a combination of idle
573*4882a593Smuzhiyunstates at different levels of the hierarchy of units in the processor,
574*4882a593Smuzhiyunthe kernel can never say how deep the hardware went down the hierarchy in any
575*4882a593Smuzhiyunparticular case.  For these reasons, the only reliable way to find out how
576*4882a593Smuzhiyunmuch time has been spent by the hardware in different idle states supported by
577*4882a593Smuzhiyunit is to use idle state residency counters in the hardware, if available.
578*4882a593Smuzhiyun
579*4882a593SmuzhiyunGenerally, an interrupt received when trying to enter an idle state causes the
580*4882a593Smuzhiyunidle state entry request to be rejected, in which case the ``CPUIdle`` driver
581*4882a593Smuzhiyunmay return an error code to indicate that this was the case. The :file:`usage`
582*4882a593Smuzhiyunand :file:`rejected` files report the number of times the given idle state
583*4882a593Smuzhiyunwas entered successfully or rejected, respectively.
584*4882a593Smuzhiyun
585*4882a593Smuzhiyun.. _cpu-pm-qos:
586*4882a593Smuzhiyun
587*4882a593SmuzhiyunPower Management Quality of Service for CPUs
588*4882a593Smuzhiyun============================================
589*4882a593Smuzhiyun
590*4882a593SmuzhiyunThe power management quality of service (PM QoS) framework in the Linux kernel
591*4882a593Smuzhiyunallows kernel code and user space processes to set constraints on various
592*4882a593Smuzhiyunenergy-efficiency features of the kernel to prevent performance from dropping
593*4882a593Smuzhiyunbelow a required level.
594*4882a593Smuzhiyun
595*4882a593SmuzhiyunCPU idle time management can be affected by PM QoS in two ways, through the
596*4882a593Smuzhiyunglobal CPU latency limit and through the resume latency constraints for
597*4882a593Smuzhiyunindividual CPUs.  Kernel code (e.g. device drivers) can set both of them with
598*4882a593Smuzhiyunthe help of special internal interfaces provided by the PM QoS framework.  User
599*4882a593Smuzhiyunspace can modify the former by opening the :file:`cpu_dma_latency` special
600*4882a593Smuzhiyundevice file under :file:`/dev/` and writing a binary value (interpreted as a
601*4882a593Smuzhiyunsigned 32-bit integer) to it.  In turn, the resume latency constraint for a CPU
602*4882a593Smuzhiyuncan be modified from user space by writing a string (representing a signed
603*4882a593Smuzhiyun32-bit integer) to the :file:`power/pm_qos_resume_latency_us` file under
604*4882a593Smuzhiyun:file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs``, where the CPU number
605*4882a593Smuzhiyun``<N>`` is allocated at the system initialization time.  Negative values
606*4882a593Smuzhiyunwill be rejected in both cases and, also in both cases, the written integer
607*4882a593Smuzhiyunnumber will be interpreted as a requested PM QoS constraint in microseconds.
608*4882a593Smuzhiyun
609*4882a593SmuzhiyunThe requested value is not automatically applied as a new constraint, however,
610*4882a593Smuzhiyunas it may be less restrictive (greater in this particular case) than another
611*4882a593Smuzhiyunconstraint previously requested by someone else.  For this reason, the PM QoS
612*4882a593Smuzhiyunframework maintains a list of requests that have been made so far for the
613*4882a593Smuzhiyunglobal CPU latency limit and for each individual CPU, aggregates them and
614*4882a593Smuzhiyunapplies the effective (minimum in this particular case) value as the new
615*4882a593Smuzhiyunconstraint.
616*4882a593Smuzhiyun
617*4882a593SmuzhiyunIn fact, opening the :file:`cpu_dma_latency` special device file causes a new
618*4882a593SmuzhiyunPM QoS request to be created and added to a global priority list of CPU latency
619*4882a593Smuzhiyunlimit requests and the file descriptor coming from the "open" operation
620*4882a593Smuzhiyunrepresents that request.  If that file descriptor is then used for writing, the
621*4882a593Smuzhiyunnumber written to it will be associated with the PM QoS request represented by
622*4882a593Smuzhiyunit as a new requested limit value.  Next, the priority list mechanism will be
623*4882a593Smuzhiyunused to determine the new effective value of the entire list of requests and
624*4882a593Smuzhiyunthat effective value will be set as a new CPU latency limit.  Thus requesting a
625*4882a593Smuzhiyunnew limit value will only change the real limit if the effective "list" value is
626*4882a593Smuzhiyunaffected by it, which is the case if it is the minimum of the requested values
627*4882a593Smuzhiyunin the list.
628*4882a593Smuzhiyun
629*4882a593SmuzhiyunThe process holding a file descriptor obtained by opening the
630*4882a593Smuzhiyun:file:`cpu_dma_latency` special device file controls the PM QoS request
631*4882a593Smuzhiyunassociated with that file descriptor, but it controls this particular PM QoS
632*4882a593Smuzhiyunrequest only.
633*4882a593Smuzhiyun
634*4882a593SmuzhiyunClosing the :file:`cpu_dma_latency` special device file or, more precisely, the
635*4882a593Smuzhiyunfile descriptor obtained while opening it, causes the PM QoS request associated
636*4882a593Smuzhiyunwith that file descriptor to be removed from the global priority list of CPU
637*4882a593Smuzhiyunlatency limit requests and destroyed.  If that happens, the priority list
638*4882a593Smuzhiyunmechanism will be used again, to determine the new effective value for the whole
639*4882a593Smuzhiyunlist and that value will become the new limit.
640*4882a593Smuzhiyun
641*4882a593SmuzhiyunIn turn, for each CPU there is one resume latency PM QoS request associated with
642*4882a593Smuzhiyunthe :file:`power/pm_qos_resume_latency_us` file under
643*4882a593Smuzhiyun:file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs`` and writing to it causes
644*4882a593Smuzhiyunthis single PM QoS request to be updated regardless of which user space
645*4882a593Smuzhiyunprocess does that.  In other words, this PM QoS request is shared by the entire
646*4882a593Smuzhiyunuser space, so access to the file associated with it needs to be arbitrated
647*4882a593Smuzhiyunto avoid confusion.  [Arguably, the only legitimate use of this mechanism in
648*4882a593Smuzhiyunpractice is to pin a process to the CPU in question and let it use the
649*4882a593Smuzhiyun``sysfs`` interface to control the resume latency constraint for it.]  It is
650*4882a593Smuzhiyunstill only a request, however.  It is an entry in a priority list used to
651*4882a593Smuzhiyundetermine the effective value to be set as the resume latency constraint for the
652*4882a593SmuzhiyunCPU in question every time the list of requests is updated this way or another
653*4882a593Smuzhiyun(there may be other requests coming from kernel code in that list).
654*4882a593Smuzhiyun
655*4882a593SmuzhiyunCPU idle time governors are expected to regard the minimum of the global
656*4882a593Smuzhiyun(effective) CPU latency limit and the effective resume latency constraint for
657*4882a593Smuzhiyunthe given CPU as the upper limit for the exit latency of the idle states that
658*4882a593Smuzhiyunthey are allowed to select for that CPU.  They should never select any idle
659*4882a593Smuzhiyunstates with exit latency beyond that limit.
660*4882a593Smuzhiyun
661*4882a593Smuzhiyun
662*4882a593SmuzhiyunIdle States Control Via Kernel Command Line
663*4882a593Smuzhiyun===========================================
664*4882a593Smuzhiyun
665*4882a593SmuzhiyunIn addition to the ``sysfs`` interface allowing individual idle states to be
666*4882a593Smuzhiyun`disabled for individual CPUs <idle-states-representation_>`_, there are kernel
667*4882a593Smuzhiyuncommand line parameters affecting CPU idle time management.
668*4882a593Smuzhiyun
669*4882a593SmuzhiyunThe ``cpuidle.off=1`` kernel command line option can be used to disable the
670*4882a593SmuzhiyunCPU idle time management entirely.  It does not prevent the idle loop from
671*4882a593Smuzhiyunrunning on idle CPUs, but it prevents the CPU idle time governors and drivers
672*4882a593Smuzhiyunfrom being invoked.  If it is added to the kernel command line, the idle loop
673*4882a593Smuzhiyunwill ask the hardware to enter idle states on idle CPUs via the CPU architecture
674*4882a593Smuzhiyunsupport code that is expected to provide a default mechanism for this purpose.
675*4882a593SmuzhiyunThat default mechanism usually is the least common denominator for all of the
676*4882a593Smuzhiyunprocessors implementing the architecture (i.e. CPU instruction set) in question,
677*4882a593Smuzhiyunhowever, so it is rather crude and not very energy-efficient.  For this reason,
678*4882a593Smuzhiyunit is not recommended for production use.
679*4882a593Smuzhiyun
680*4882a593SmuzhiyunThe ``cpuidle.governor=`` kernel command line switch allows the ``CPUIdle``
681*4882a593Smuzhiyungovernor to use to be specified.  It has to be appended with a string matching
682*4882a593Smuzhiyunthe name of an available governor (e.g. ``cpuidle.governor=menu``) and that
683*4882a593Smuzhiyungovernor will be used instead of the default one.  It is possible to force
684*4882a593Smuzhiyunthe ``menu`` governor to be used on the systems that use the ``ladder`` governor
685*4882a593Smuzhiyunby default this way, for example.
686*4882a593Smuzhiyun
687*4882a593SmuzhiyunThe other kernel command line parameters controlling CPU idle time management
688*4882a593Smuzhiyundescribed below are only relevant for the *x86* architecture and references
689*4882a593Smuzhiyunto ``intel_idle`` affect Intel processors only.
690*4882a593Smuzhiyun
691*4882a593SmuzhiyunThe *x86* architecture support code recognizes three kernel command line
692*4882a593Smuzhiyunoptions related to CPU idle time management: ``idle=poll``, ``idle=halt``,
693*4882a593Smuzhiyunand ``idle=nomwait``.  The first two of them disable the ``acpi_idle`` and
694*4882a593Smuzhiyun``intel_idle`` drivers altogether, which effectively causes the entire
695*4882a593Smuzhiyun``CPUIdle`` subsystem to be disabled and makes the idle loop invoke the
696*4882a593Smuzhiyunarchitecture support code to deal with idle CPUs.  How it does that depends on
697*4882a593Smuzhiyunwhich of the two parameters is added to the kernel command line.  In the
698*4882a593Smuzhiyun``idle=halt`` case, the architecture support code will use the ``HLT``
699*4882a593Smuzhiyuninstruction of the CPUs (which, as a rule, suspends the execution of the program
700*4882a593Smuzhiyunand causes the hardware to attempt to enter the shallowest available idle state)
701*4882a593Smuzhiyunfor this purpose, and if ``idle=poll`` is used, idle CPUs will execute a
702*4882a593Smuzhiyunmore or less "lightweight" sequence of instructions in a tight loop.  [Note
703*4882a593Smuzhiyunthat using ``idle=poll`` is somewhat drastic in many cases, as preventing idle
704*4882a593SmuzhiyunCPUs from saving almost any energy at all may not be the only effect of it.
705*4882a593SmuzhiyunFor example, on Intel hardware it effectively prevents CPUs from using
706*4882a593SmuzhiyunP-states (see |cpufreq|) that require any number of CPUs in a package to be
707*4882a593Smuzhiyunidle, so it very well may hurt single-thread computations performance as well as
708*4882a593Smuzhiyunenergy-efficiency.  Thus using it for performance reasons may not be a good idea
709*4882a593Smuzhiyunat all.]
710*4882a593Smuzhiyun
711*4882a593SmuzhiyunThe ``idle=nomwait`` option prevents the use of ``MWAIT`` instruction of
712*4882a593Smuzhiyunthe CPU to enter idle states. When this option is used, the ``acpi_idle``
713*4882a593Smuzhiyundriver will use the ``HLT`` instruction instead of ``MWAIT``. On systems
714*4882a593Smuzhiyunrunning Intel processors, this option disables the ``intel_idle`` driver
715*4882a593Smuzhiyunand forces the use of the ``acpi_idle`` driver instead. Note that in either
716*4882a593Smuzhiyuncase, ``acpi_idle`` driver will function only if all the information needed
717*4882a593Smuzhiyunby it is in the system's ACPI tables.
718*4882a593Smuzhiyun
719*4882a593SmuzhiyunIn addition to the architecture-level kernel command line options affecting CPU
720*4882a593Smuzhiyunidle time management, there are parameters affecting individual ``CPUIdle``
721*4882a593Smuzhiyundrivers that can be passed to them via the kernel command line.  Specifically,
722*4882a593Smuzhiyunthe ``intel_idle.max_cstate=<n>`` and ``processor.max_cstate=<n>`` parameters,
723*4882a593Smuzhiyunwhere ``<n>`` is an idle state index also used in the name of the given
724*4882a593Smuzhiyunstate's directory in ``sysfs`` (see
725*4882a593Smuzhiyun`Representation of Idle States <idle-states-representation_>`_), causes the
726*4882a593Smuzhiyun``intel_idle`` and ``acpi_idle`` drivers, respectively, to discard all of the
727*4882a593Smuzhiyunidle states deeper than idle state ``<n>``.  In that case, they will never ask
728*4882a593Smuzhiyunfor any of those idle states or expose them to the governor.  [The behavior of
729*4882a593Smuzhiyunthe two drivers is different for ``<n>`` equal to ``0``.  Adding
730*4882a593Smuzhiyun``intel_idle.max_cstate=0`` to the kernel command line disables the
731*4882a593Smuzhiyun``intel_idle`` driver and allows ``acpi_idle`` to be used, whereas
732*4882a593Smuzhiyun``processor.max_cstate=0`` is equivalent to ``processor.max_cstate=1``.
733*4882a593SmuzhiyunAlso, the ``acpi_idle`` driver is part of the ``processor`` kernel module that
734*4882a593Smuzhiyuncan be loaded separately and ``max_cstate=<n>`` can be passed to it as a module
735*4882a593Smuzhiyunparameter when it is loaded.]
736