xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/pm/intel_idle.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun.. include:: <isonum.txt>
3*4882a593Smuzhiyun
4*4882a593Smuzhiyun==============================================
5*4882a593Smuzhiyun``intel_idle`` CPU Idle Time Management Driver
6*4882a593Smuzhiyun==============================================
7*4882a593Smuzhiyun
8*4882a593Smuzhiyun:Copyright: |copy| 2020 Intel Corporation
9*4882a593Smuzhiyun
10*4882a593Smuzhiyun:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
11*4882a593Smuzhiyun
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunGeneral Information
14*4882a593Smuzhiyun===================
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun``intel_idle`` is a part of the
17*4882a593Smuzhiyun:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel
18*4882a593Smuzhiyun(``CPUIdle``).  It is the default CPU idle time management driver for the
19*4882a593SmuzhiyunNehalem and later generations of Intel processors, but the level of support for
20*4882a593Smuzhiyuna particular processor model in it depends on whether or not it recognizes that
21*4882a593Smuzhiyunprocessor model and may also depend on information coming from the platform
22*4882a593Smuzhiyunfirmware.  [To understand ``intel_idle`` it is necessary to know how ``CPUIdle``
23*4882a593Smuzhiyunworks in general, so this is the time to get familiar with :doc:`cpuidle` if you
24*4882a593Smuzhiyunhave not done that yet.]
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the
27*4882a593Smuzhiyunlogical CPU executing it is idle and so it may be possible to put some of the
28*4882a593Smuzhiyunprocessor's functional blocks into low-power states.  That instruction takes two
29*4882a593Smuzhiyunarguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the
30*4882a593Smuzhiyunfirst of which, referred to as a *hint*, can be used by the processor to
31*4882a593Smuzhiyundetermine what can be done (for details refer to Intel Software Developer’s
32*4882a593SmuzhiyunManual [1]_).  Accordingly, ``intel_idle`` refuses to work with processors in
33*4882a593Smuzhiyunwhich the support for the ``MWAIT`` instruction has been disabled (for example,
34*4882a593Smuzhiyunvia the platform firmware configuration menu) or which do not support that
35*4882a593Smuzhiyuninstruction at all.
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun``intel_idle`` is not modular, so it cannot be unloaded, which means that the
38*4882a593Smuzhiyunonly way to pass early-configuration-time parameters to it is via the kernel
39*4882a593Smuzhiyuncommand line.
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun
42*4882a593Smuzhiyun.. _intel-idle-enumeration-of-states:
43*4882a593Smuzhiyun
44*4882a593SmuzhiyunEnumeration of Idle States
45*4882a593Smuzhiyun==========================
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunEach ``MWAIT`` hint value is interpreted by the processor as a license to
48*4882a593Smuzhiyunreconfigure itself in a certain way in order to save energy.  The processor
49*4882a593Smuzhiyunconfigurations (with reduced power draw) resulting from that are referred to
50*4882a593Smuzhiyunas C-states (in the ACPI terminology) or idle states.  The list of meaningful
51*4882a593Smuzhiyun``MWAIT`` hint values and idle states (i.e. low-power configurations of the
52*4882a593Smuzhiyunprocessor) corresponding to them depends on the processor model and it may also
53*4882a593Smuzhiyundepend on the configuration of the platform.
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunIn order to create a list of available idle states required by the ``CPUIdle``
56*4882a593Smuzhiyunsubsystem (see :ref:`idle-states-representation` in :doc:`cpuidle`),
57*4882a593Smuzhiyun``intel_idle`` can use two sources of information: static tables of idle states
58*4882a593Smuzhiyunfor different processor models included in the driver itself and the ACPI tables
59*4882a593Smuzhiyunof the system.  The former are always used if the processor model at hand is
60*4882a593Smuzhiyunrecognized by ``intel_idle`` and the latter are used if that is required for
61*4882a593Smuzhiyunthe given processor model (which is the case for all server processor models
62*4882a593Smuzhiyunrecognized by ``intel_idle``) or if the processor model is not recognized.
63*4882a593Smuzhiyun[There is a module parameter that can be used to make the driver use the ACPI
64*4882a593Smuzhiyuntables with any processor model recognized by it; see
65*4882a593Smuzhiyun`below <intel-idle-parameters_>`_.]
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunIf the ACPI tables are going to be used for building the list of available idle
68*4882a593Smuzhiyunstates, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI
69*4882a593Smuzhiyunobjects corresponding to the CPUs in the system (refer to the ACPI specification
70*4882a593Smuzhiyun[2]_ for the description of ``_CST`` and its output package).  Because the
71*4882a593Smuzhiyun``CPUIdle`` subsystem expects that the list of idle states supplied by the
72*4882a593Smuzhiyundriver will be suitable for all of the CPUs handled by it and ``intel_idle`` is
73*4882a593Smuzhiyunregistered as the ``CPUIdle`` driver for all of the CPUs in the system, the
74*4882a593Smuzhiyundriver looks for the first ``_CST`` object returning at least one valid idle
75*4882a593Smuzhiyunstate description and such that all of the idle states included in its return
76*4882a593Smuzhiyunpackage are of the FFH (Functional Fixed Hardware) type, which means that the
77*4882a593Smuzhiyun``MWAIT`` instruction is expected to be used to tell the processor that it can
78*4882a593Smuzhiyunenter one of them.  The return package of that ``_CST`` is then assumed to be
79*4882a593Smuzhiyunapplicable to all of the other CPUs in the system and the idle state
80*4882a593Smuzhiyundescriptions extracted from it are stored in a preliminary list of idle states
81*4882a593Smuzhiyuncoming from the ACPI tables.  [This step is skipped if ``intel_idle`` is
82*4882a593Smuzhiyunconfigured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.]
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunNext, the first (index 0) entry in the list of available idle states is
85*4882a593Smuzhiyuninitialized to represent a "polling idle state" (a pseudo-idle state in which
86*4882a593Smuzhiyunthe target CPU continuously fetches and executes instructions), and the
87*4882a593Smuzhiyunsubsequent (real) idle state entries are populated as follows.
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunIf the processor model at hand is recognized by ``intel_idle``, there is a
90*4882a593Smuzhiyun(static) table of idle state descriptions for it in the driver.  In that case,
91*4882a593Smuzhiyunthe "internal" table is the primary source of information on idle states and the
92*4882a593Smuzhiyuninformation from it is copied to the final list of available idle states.  If
93*4882a593Smuzhiyunusing the ACPI tables for the enumeration of idle states is not required
94*4882a593Smuzhiyun(depending on the processor model), all of the listed idle state are enabled by
95*4882a593Smuzhiyundefault (so all of them will be taken into consideration by ``CPUIdle``
96*4882a593Smuzhiyungovernors during CPU idle state selection).  Otherwise, some of the listed idle
97*4882a593Smuzhiyunstates may not be enabled by default if there are no matching entries in the
98*4882a593Smuzhiyunpreliminary list of idle states coming from the ACPI tables.  In that case user
99*4882a593Smuzhiyunspace still can enable them later (on a per-CPU basis) with the help of
100*4882a593Smuzhiyunthe ``disable`` idle state attribute in ``sysfs`` (see
101*4882a593Smuzhiyun:ref:`idle-states-representation` in :doc:`cpuidle`).  This basically means that
102*4882a593Smuzhiyunthe idle states "known" to the driver may not be enabled by default if they have
103*4882a593Smuzhiyunnot been exposed by the platform firmware (through the ACPI tables).
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunIf the given processor model is not recognized by ``intel_idle``, but it
106*4882a593Smuzhiyunsupports ``MWAIT``, the preliminary list of idle states coming from the ACPI
107*4882a593Smuzhiyuntables is used for building the final list that will be supplied to the
108*4882a593Smuzhiyun``CPUIdle`` core during driver registration.  For each idle state in that list,
109*4882a593Smuzhiyunthe description, ``MWAIT`` hint and exit latency are copied to the corresponding
110*4882a593Smuzhiyunentry in the final list of idle states.  The name of the idle state represented
111*4882a593Smuzhiyunby it (to be returned by the ``name`` idle state attribute in ``sysfs``) is
112*4882a593Smuzhiyun"CX_ACPI", where X is the index of that idle state in the final list (note that
113*4882a593Smuzhiyunthe minimum value of X is 1, because 0 is reserved for the "polling" state), and
114*4882a593Smuzhiyunits target residency is based on the exit latency value.  Specifically, for
115*4882a593SmuzhiyunC1-type idle states the exit latency value is also used as the target residency
116*4882a593Smuzhiyun(for compatibility with the majority of the "internal" tables of idle states for
117*4882a593Smuzhiyunvarious processor models recognized by ``intel_idle``) and for the other idle
118*4882a593Smuzhiyunstate types (C2 and C3) the target residency value is 3 times the exit latency
119*4882a593Smuzhiyun(again, that is because it reflects the target residency to exit latency ratio
120*4882a593Smuzhiyunin the majority of cases for the processor models recognized by ``intel_idle``).
121*4882a593SmuzhiyunAll of the idle states in the final list are enabled by default in this case.
122*4882a593Smuzhiyun
123*4882a593Smuzhiyun
124*4882a593Smuzhiyun.. _intel-idle-initialization:
125*4882a593Smuzhiyun
126*4882a593SmuzhiyunInitialization
127*4882a593Smuzhiyun==============
128*4882a593Smuzhiyun
129*4882a593SmuzhiyunThe initialization of ``intel_idle`` starts with checking if the kernel command
130*4882a593Smuzhiyunline options forbid the use of the ``MWAIT`` instruction.  If that is the case,
131*4882a593Smuzhiyunan error code is returned right away.
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunThe next step is to check whether or not the processor model is known to the
134*4882a593Smuzhiyundriver, which determines the idle states enumeration method (see
135*4882a593Smuzhiyun`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor
136*4882a593Smuzhiyunsupports ``MWAIT`` (the initialization fails if that is not the case).  Then,
137*4882a593Smuzhiyunthe ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the
138*4882a593Smuzhiyundriver initialization fails if the level of support is not as expected (for
139*4882a593Smuzhiyunexample, if the total number of ``MWAIT`` substates returned is 0).
140*4882a593Smuzhiyun
141*4882a593SmuzhiyunNext, if the driver is not configured to ignore the ACPI tables (see
142*4882a593Smuzhiyun`below <intel-idle-parameters_>`_), the idle states information provided by the
143*4882a593Smuzhiyunplatform firmware is extracted from them.
144*4882a593Smuzhiyun
145*4882a593SmuzhiyunThen, ``CPUIdle`` device objects are allocated for all CPUs and the list of
146*4882a593Smuzhiyunavailable idle states is created as explained
147*4882a593Smuzhiyun`above <intel-idle-enumeration-of-states_>`_.
148*4882a593Smuzhiyun
149*4882a593SmuzhiyunFinally, ``intel_idle`` is registered with the help of cpuidle_register_driver()
150*4882a593Smuzhiyunas the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback
151*4882a593Smuzhiyunfor configuring individual CPUs is registered via cpuhp_setup_state(), which
152*4882a593Smuzhiyun(among other things) causes the callback routine to be invoked for all of the
153*4882a593SmuzhiyunCPUs present in the system at that time (each CPU executes its own instance of
154*4882a593Smuzhiyunthe callback routine).  That routine registers a ``CPUIdle`` device for the CPU
155*4882a593Smuzhiyunrunning it (which enables the ``CPUIdle`` subsystem to operate that CPU) and
156*4882a593Smuzhiyunoptionally performs some CPU-specific initialization actions that may be
157*4882a593Smuzhiyunrequired for the given processor model.
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun
160*4882a593Smuzhiyun.. _intel-idle-parameters:
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunKernel Command Line Options and Module Parameters
163*4882a593Smuzhiyun=================================================
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunThe *x86* architecture support code recognizes three kernel command line
166*4882a593Smuzhiyunoptions related to CPU idle time management: ``idle=poll``, ``idle=halt``,
167*4882a593Smuzhiyunand ``idle=nomwait``.  If any of them is present in the kernel command line, the
168*4882a593Smuzhiyun``MWAIT`` instruction is not allowed to be used, so the initialization of
169*4882a593Smuzhiyun``intel_idle`` will fail.
170*4882a593Smuzhiyun
171*4882a593SmuzhiyunApart from that there are four module parameters recognized by ``intel_idle``
172*4882a593Smuzhiyunitself that can be set via the kernel command line (they cannot be updated via
173*4882a593Smuzhiyunsysfs, so that is the only way to change their values).
174*4882a593Smuzhiyun
175*4882a593SmuzhiyunThe ``max_cstate`` parameter value is the maximum idle state index in the list
176*4882a593Smuzhiyunof idle states supplied to the ``CPUIdle`` core during the registration of the
177*4882a593Smuzhiyundriver.  It is also the maximum number of regular (non-polling) idle states that
178*4882a593Smuzhiyuncan be used by ``intel_idle``, so the enumeration of idle states is terminated
179*4882a593Smuzhiyunafter finding that number of usable idle states (the other idle states that
180*4882a593Smuzhiyunpotentially might have been used if ``max_cstate`` had been greater are not
181*4882a593Smuzhiyuntaken into consideration at all).  Setting ``max_cstate`` can prevent
182*4882a593Smuzhiyun``intel_idle`` from exposing idle states that are regarded as "too deep" for
183*4882a593Smuzhiyunsome reason to the ``CPUIdle`` core, but it does so by making them effectively
184*4882a593Smuzhiyuninvisible until the system is shut down and started again which may not always
185*4882a593Smuzhiyunbe desirable.  In practice, it is only really necessary to do that if the idle
186*4882a593Smuzhiyunstates in question cannot be enabled during system startup, because in the
187*4882a593Smuzhiyunworking state of the system the CPU power management quality of service (PM
188*4882a593SmuzhiyunQoS) feature can be used to prevent ``CPUIdle`` from touching those idle states
189*4882a593Smuzhiyuneven if they have been enumerated (see :ref:`cpu-pm-qos` in :doc:`cpuidle`).
190*4882a593SmuzhiyunSetting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
191*4882a593Smuzhiyun
192*4882a593SmuzhiyunThe ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle``
193*4882a593Smuzhiyunif the kernel has been configured with ACPI support) can be set to make the
194*4882a593Smuzhiyundriver ignore the system's ACPI tables entirely or use them for all of the
195*4882a593Smuzhiyunrecognized processor models, respectively (they both are unset by default and
196*4882a593Smuzhiyun``use_acpi`` has no effect if ``no_acpi`` is set).
197*4882a593Smuzhiyun
198*4882a593SmuzhiyunThe value of the ``states_off`` module parameter (0 by default) represents a
199*4882a593Smuzhiyunlist of idle states to be disabled by default in the form of a bitmask.
200*4882a593Smuzhiyun
201*4882a593SmuzhiyunNamely, the positions of the bits that are set in the ``states_off`` value are
202*4882a593Smuzhiyunthe indices of idle states to be disabled by default (as reflected by the names
203*4882a593Smuzhiyunof the corresponding idle state directories in ``sysfs``, :file:`state0`,
204*4882a593Smuzhiyun:file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given
205*4882a593Smuzhiyunidle state; see :ref:`idle-states-representation` in :doc:`cpuidle`).
206*4882a593Smuzhiyun
207*4882a593SmuzhiyunFor example, if ``states_off`` is equal to 3, the driver will disable idle
208*4882a593Smuzhiyunstates 0 and 1 by default, and if it is equal to 8, idle state 3 will be
209*4882a593Smuzhiyundisabled by default and so on (bit positions beyond the maximum idle state index
210*4882a593Smuzhiyunare ignored).
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunThe idle states disabled this way can be enabled (on a per-CPU basis) from user
213*4882a593Smuzhiyunspace via ``sysfs``.
214*4882a593Smuzhiyun
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun.. _intel-idle-core-and-package-idle-states:
217*4882a593Smuzhiyun
218*4882a593SmuzhiyunCore and Package Levels of Idle States
219*4882a593Smuzhiyun======================================
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunTypically, in a processor supporting the ``MWAIT`` instruction there are (at
222*4882a593Smuzhiyunleast) two levels of idle states (or C-states).  One level, referred to as
223*4882a593Smuzhiyun"core C-states", covers individual cores in the processor, whereas the other
224*4882a593Smuzhiyunlevel, referred to as "package C-states", covers the entire processor package
225*4882a593Smuzhiyunand it may also involve other components of the system (GPUs, memory
226*4882a593Smuzhiyuncontrollers, I/O hubs etc.).
227*4882a593Smuzhiyun
228*4882a593SmuzhiyunSome of the ``MWAIT`` hint values allow the processor to use core C-states only
229*4882a593Smuzhiyun(most importantly, that is the case for the ``MWAIT`` hint value corresponding
230*4882a593Smuzhiyunto the ``C1`` idle state), but the majority of them give it a license to put
231*4882a593Smuzhiyunthe target core (i.e. the core containing the logical CPU executing ``MWAIT``
232*4882a593Smuzhiyunwith the given hint value) into a specific core C-state and then (if possible)
233*4882a593Smuzhiyunto enter a specific package C-state at the deeper level.  For example, the
234*4882a593Smuzhiyun``MWAIT`` hint value representing the ``C3`` idle state allows the processor to
235*4882a593Smuzhiyunput the target core into the low-power state referred to as "core ``C3``" (or
236*4882a593Smuzhiyun``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core
237*4882a593Smuzhiyunhave executed ``MWAIT`` with the ``C3`` hint value (or with a hint value
238*4882a593Smuzhiyunrepresenting a deeper idle state), and in addition to that (in the majority of
239*4882a593Smuzhiyuncases) it gives the processor a license to put the entire package (possibly
240*4882a593Smuzhiyunincluding some non-CPU components such as a GPU or a memory controller) into the
241*4882a593Smuzhiyunlow-power state referred to as "package ``C3``" (or ``PC3``), which happens if
242*4882a593Smuzhiyunall of the cores have gone into the ``CC3`` state and (possibly) some additional
243*4882a593Smuzhiyunconditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may
244*4882a593Smuzhiyunbe required to be in a certain GPU-specific low-power state for ``PC3`` to be
245*4882a593Smuzhiyunreachable).
246*4882a593Smuzhiyun
247*4882a593SmuzhiyunAs a rule, there is no simple way to make the processor use core C-states only
248*4882a593Smuzhiyunif the conditions for entering the corresponding package C-states are met, so
249*4882a593Smuzhiyunthe logical CPU executing ``MWAIT`` with a hint value that is not core-level
250*4882a593Smuzhiyunonly (like for ``C1``) must always assume that this may cause the processor to
251*4882a593Smuzhiyunenter a package C-state.  [That is why the exit latency and target residency
252*4882a593Smuzhiyunvalues corresponding to the majority of ``MWAIT`` hint values in the "internal"
253*4882a593Smuzhiyuntables of idle states in ``intel_idle`` reflect the properties of package
254*4882a593SmuzhiyunC-states.]  If using package C-states is not desirable at all, either
255*4882a593Smuzhiyun:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of
256*4882a593Smuzhiyun``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to
257*4882a593Smuzhiyunrestrict the range of permissible idle states to the ones with core-level only
258*4882a593Smuzhiyun``MWAIT`` hint values (like ``C1``).
259*4882a593Smuzhiyun
260*4882a593Smuzhiyun
261*4882a593SmuzhiyunReferences
262*4882a593Smuzhiyun==========
263*4882a593Smuzhiyun
264*4882a593Smuzhiyun.. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*,
265*4882a593Smuzhiyun       https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html
266*4882a593Smuzhiyun
267*4882a593Smuzhiyun.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*,
268*4882a593Smuzhiyun       https://uefi.org/specifications
269