xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/pm/cpufreq.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun.. include:: <isonum.txt>
3*4882a593Smuzhiyun
4*4882a593Smuzhiyun.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
5*4882a593Smuzhiyun
6*4882a593Smuzhiyun=======================
7*4882a593SmuzhiyunCPU Performance Scaling
8*4882a593Smuzhiyun=======================
9*4882a593Smuzhiyun
10*4882a593Smuzhiyun:Copyright: |copy| 2017 Intel Corporation
11*4882a593Smuzhiyun
12*4882a593Smuzhiyun:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
13*4882a593Smuzhiyun
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunThe Concept of CPU Performance Scaling
16*4882a593Smuzhiyun======================================
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunThe majority of modern processors are capable of operating in a number of
19*4882a593Smuzhiyundifferent clock frequency and voltage configurations, often referred to as
20*4882a593SmuzhiyunOperating Performance Points or P-states (in ACPI terminology).  As a rule,
21*4882a593Smuzhiyunthe higher the clock frequency and the higher the voltage, the more instructions
22*4882a593Smuzhiyuncan be retired by the CPU over a unit of time, but also the higher the clock
23*4882a593Smuzhiyunfrequency and the higher the voltage, the more energy is consumed over a unit of
24*4882a593Smuzhiyuntime (or the more power is drawn) by the CPU in the given P-state.  Therefore
25*4882a593Smuzhiyunthere is a natural tradeoff between the CPU capacity (the number of instructions
26*4882a593Smuzhiyunthat can be executed over a unit of time) and the power drawn by the CPU.
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunIn some situations it is desirable or even necessary to run the program as fast
29*4882a593Smuzhiyunas possible and then there is no reason to use any P-states different from the
30*4882a593Smuzhiyunhighest one (i.e. the highest-performance frequency/voltage configuration
31*4882a593Smuzhiyunavailable).  In some other cases, however, it may not be necessary to execute
32*4882a593Smuzhiyuninstructions so quickly and maintaining the highest available CPU capacity for a
33*4882a593Smuzhiyunrelatively long time without utilizing it entirely may be regarded as wasteful.
34*4882a593SmuzhiyunIt also may not be physically possible to maintain maximum CPU capacity for too
35*4882a593Smuzhiyunlong for thermal or power supply capacity reasons or similar.  To cover those
36*4882a593Smuzhiyuncases, there are hardware interfaces allowing CPUs to be switched between
37*4882a593Smuzhiyundifferent frequency/voltage configurations or (in the ACPI terminology) to be
38*4882a593Smuzhiyunput into different P-states.
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunTypically, they are used along with algorithms to estimate the required CPU
41*4882a593Smuzhiyuncapacity, so as to decide which P-states to put the CPUs into.  Of course, since
42*4882a593Smuzhiyunthe utilization of the system generally changes over time, that has to be done
43*4882a593Smuzhiyunrepeatedly on a regular basis.  The activity by which this happens is referred
44*4882a593Smuzhiyunto as CPU performance scaling or CPU frequency scaling (because it involves
45*4882a593Smuzhiyunadjusting the CPU clock frequency).
46*4882a593Smuzhiyun
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunCPU Performance Scaling in Linux
49*4882a593Smuzhiyun================================
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunThe Linux kernel supports CPU performance scaling by means of the ``CPUFreq``
52*4882a593Smuzhiyun(CPU Frequency scaling) subsystem that consists of three layers of code: the
53*4882a593Smuzhiyuncore, scaling governors and scaling drivers.
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunThe ``CPUFreq`` core provides the common code infrastructure and user space
56*4882a593Smuzhiyuninterfaces for all platforms that support CPU performance scaling.  It defines
57*4882a593Smuzhiyunthe basic framework in which the other components operate.
58*4882a593Smuzhiyun
59*4882a593SmuzhiyunScaling governors implement algorithms to estimate the required CPU capacity.
60*4882a593SmuzhiyunAs a rule, each governor implements one, possibly parametrized, scaling
61*4882a593Smuzhiyunalgorithm.
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunScaling drivers talk to the hardware.  They provide scaling governors with
64*4882a593Smuzhiyuninformation on the available P-states (or P-state ranges in some cases) and
65*4882a593Smuzhiyunaccess platform-specific hardware interfaces to change CPU P-states as requested
66*4882a593Smuzhiyunby scaling governors.
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunIn principle, all available scaling governors can be used with every scaling
69*4882a593Smuzhiyundriver.  That design is based on the observation that the information used by
70*4882a593Smuzhiyunperformance scaling algorithms for P-state selection can be represented in a
71*4882a593Smuzhiyunplatform-independent form in the majority of cases, so it should be possible
72*4882a593Smuzhiyunto use the same performance scaling algorithm implemented in exactly the same
73*4882a593Smuzhiyunway regardless of which scaling driver is used.  Consequently, the same set of
74*4882a593Smuzhiyunscaling governors should be suitable for every supported platform.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunHowever, that observation may not hold for performance scaling algorithms
77*4882a593Smuzhiyunbased on information provided by the hardware itself, for example through
78*4882a593Smuzhiyunfeedback registers, as that information is typically specific to the hardware
79*4882a593Smuzhiyuninterface it comes from and may not be easily represented in an abstract,
80*4882a593Smuzhiyunplatform-independent way.  For this reason, ``CPUFreq`` allows scaling drivers
81*4882a593Smuzhiyunto bypass the governor layer and implement their own performance scaling
82*4882a593Smuzhiyunalgorithms.  That is done by the |intel_pstate| scaling driver.
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun``CPUFreq`` Policy Objects
86*4882a593Smuzhiyun==========================
87*4882a593Smuzhiyun
88*4882a593SmuzhiyunIn some cases the hardware interface for P-state control is shared by multiple
89*4882a593SmuzhiyunCPUs.  That is, for example, the same register (or set of registers) is used to
90*4882a593Smuzhiyuncontrol the P-state of multiple CPUs at the same time and writing to it affects
91*4882a593Smuzhiyunall of those CPUs simultaneously.
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunSets of CPUs sharing hardware P-state control interfaces are represented by
94*4882a593Smuzhiyun``CPUFreq`` as struct cpufreq_policy objects.  For consistency,
95*4882a593Smuzhiyunstruct cpufreq_policy is also used when there is only one CPU in the given
96*4882a593Smuzhiyunset.
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunThe ``CPUFreq`` core maintains a pointer to a struct cpufreq_policy object for
99*4882a593Smuzhiyunevery CPU in the system, including CPUs that are currently offline.  If multiple
100*4882a593SmuzhiyunCPUs share the same hardware P-state control interface, all of the pointers
101*4882a593Smuzhiyuncorresponding to them point to the same struct cpufreq_policy object.
102*4882a593Smuzhiyun
103*4882a593Smuzhiyun``CPUFreq`` uses struct cpufreq_policy as its basic data type and the design
104*4882a593Smuzhiyunof its user space interface is based on the policy concept.
105*4882a593Smuzhiyun
106*4882a593Smuzhiyun
107*4882a593SmuzhiyunCPU Initialization
108*4882a593Smuzhiyun==================
109*4882a593Smuzhiyun
110*4882a593SmuzhiyunFirst of all, a scaling driver has to be registered for ``CPUFreq`` to work.
111*4882a593SmuzhiyunIt is only possible to register one scaling driver at a time, so the scaling
112*4882a593Smuzhiyundriver is expected to be able to handle all CPUs in the system.
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunThe scaling driver may be registered before or after CPU registration.  If
115*4882a593SmuzhiyunCPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to
116*4882a593Smuzhiyuntake a note of all of the already registered CPUs during the registration of the
117*4882a593Smuzhiyunscaling driver.  In turn, if any CPUs are registered after the registration of
118*4882a593Smuzhiyunthe scaling driver, the ``CPUFreq`` core will be invoked to take note of them
119*4882a593Smuzhiyunat their registration time.
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunIn any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it
122*4882a593Smuzhiyunhas not seen so far as soon as it is ready to handle that CPU.  [Note that the
123*4882a593Smuzhiyunlogical CPU may be a physical single-core processor, or a single core in a
124*4882a593Smuzhiyunmulticore processor, or a hardware thread in a physical processor or processor
125*4882a593Smuzhiyuncore.  In what follows "CPU" always means "logical CPU" unless explicitly stated
126*4882a593Smuzhiyunotherwise and the word "processor" is used to refer to the physical part
127*4882a593Smuzhiyunpossibly including multiple logical CPUs.]
128*4882a593Smuzhiyun
129*4882a593SmuzhiyunOnce invoked, the ``CPUFreq`` core checks if the policy pointer is already set
130*4882a593Smuzhiyunfor the given CPU and if so, it skips the policy object creation.  Otherwise,
131*4882a593Smuzhiyuna new policy object is created and initialized, which involves the creation of
132*4882a593Smuzhiyuna new policy directory in ``sysfs``, and the policy pointer corresponding to
133*4882a593Smuzhiyunthe given CPU is set to the new policy object's address in memory.
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunNext, the scaling driver's ``->init()`` callback is invoked with the policy
136*4882a593Smuzhiyunpointer of the new CPU passed to it as the argument.  That callback is expected
137*4882a593Smuzhiyunto initialize the performance scaling hardware interface for the given CPU (or,
138*4882a593Smuzhiyunmore precisely, for the set of CPUs sharing the hardware interface it belongs
139*4882a593Smuzhiyunto, represented by its policy object) and, if the policy object it has been
140*4882a593Smuzhiyuncalled for is new, to set parameters of the policy, like the minimum and maximum
141*4882a593Smuzhiyunfrequencies supported by the hardware, the table of available frequencies (if
142*4882a593Smuzhiyunthe set of supported P-states is not a continuous range), and the mask of CPUs
143*4882a593Smuzhiyunthat belong to the same policy (including both online and offline CPUs).  That
144*4882a593Smuzhiyunmask is then used by the core to populate the policy pointers for all of the
145*4882a593SmuzhiyunCPUs in it.
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunThe next major initialization step for a new policy object is to attach a
148*4882a593Smuzhiyunscaling governor to it (to begin with, that is the default scaling governor
149*4882a593Smuzhiyundetermined by the kernel command line or configuration, but it may be changed
150*4882a593Smuzhiyunlater via ``sysfs``).  First, a pointer to the new policy object is passed to
151*4882a593Smuzhiyunthe governor's ``->init()`` callback which is expected to initialize all of the
152*4882a593Smuzhiyundata structures necessary to handle the given policy and, possibly, to add
153*4882a593Smuzhiyuna governor ``sysfs`` interface to it.  Next, the governor is started by
154*4882a593Smuzhiyuninvoking its ``->start()`` callback.
155*4882a593Smuzhiyun
156*4882a593SmuzhiyunThat callback is expected to register per-CPU utilization update callbacks for
157*4882a593Smuzhiyunall of the online CPUs belonging to the given policy with the CPU scheduler.
158*4882a593SmuzhiyunThe utilization update callbacks will be invoked by the CPU scheduler on
159*4882a593Smuzhiyunimportant events, like task enqueue and dequeue, on every iteration of the
160*4882a593Smuzhiyunscheduler tick or generally whenever the CPU utilization may change (from the
161*4882a593Smuzhiyunscheduler's perspective).  They are expected to carry out computations needed
162*4882a593Smuzhiyunto determine the P-state to use for the given policy going forward and to
163*4882a593Smuzhiyuninvoke the scaling driver to make changes to the hardware in accordance with
164*4882a593Smuzhiyunthe P-state selection.  The scaling driver may be invoked directly from
165*4882a593Smuzhiyunscheduler context or asynchronously, via a kernel thread or workqueue, depending
166*4882a593Smuzhiyunon the configuration and capabilities of the scaling driver and the governor.
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunSimilar steps are taken for policy objects that are not new, but were "inactive"
169*4882a593Smuzhiyunpreviously, meaning that all of the CPUs belonging to them were offline.  The
170*4882a593Smuzhiyunonly practical difference in that case is that the ``CPUFreq`` core will attempt
171*4882a593Smuzhiyunto use the scaling governor previously used with the policy that became
172*4882a593Smuzhiyun"inactive" (and is re-initialized now) instead of the default governor.
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunIn turn, if a previously offline CPU is being brought back online, but some
175*4882a593Smuzhiyunother CPUs sharing the policy object with it are online already, there is no
176*4882a593Smuzhiyunneed to re-initialize the policy object at all.  In that case, it only is
177*4882a593Smuzhiyunnecessary to restart the scaling governor so that it can take the new online CPU
178*4882a593Smuzhiyuninto account.  That is achieved by invoking the governor's ``->stop`` and
179*4882a593Smuzhiyun``->start()`` callbacks, in this order, for the entire policy.
180*4882a593Smuzhiyun
181*4882a593SmuzhiyunAs mentioned before, the |intel_pstate| scaling driver bypasses the scaling
182*4882a593Smuzhiyungovernor layer of ``CPUFreq`` and provides its own P-state selection algorithms.
183*4882a593SmuzhiyunConsequently, if |intel_pstate| is used, scaling governors are not attached to
184*4882a593Smuzhiyunnew policy objects.  Instead, the driver's ``->setpolicy()`` callback is invoked
185*4882a593Smuzhiyunto register per-CPU utilization update callbacks for each policy.  These
186*4882a593Smuzhiyuncallbacks are invoked by the CPU scheduler in the same way as for scaling
187*4882a593Smuzhiyungovernors, but in the |intel_pstate| case they both determine the P-state to
188*4882a593Smuzhiyunuse and change the hardware configuration accordingly in one go from scheduler
189*4882a593Smuzhiyuncontext.
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunThe policy objects created during CPU initialization and other data structures
192*4882a593Smuzhiyunassociated with them are torn down when the scaling driver is unregistered
193*4882a593Smuzhiyun(which happens when the kernel module containing it is unloaded, for example) or
194*4882a593Smuzhiyunwhen the last CPU belonging to the given policy in unregistered.
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun
197*4882a593SmuzhiyunPolicy Interface in ``sysfs``
198*4882a593Smuzhiyun=============================
199*4882a593Smuzhiyun
200*4882a593SmuzhiyunDuring the initialization of the kernel, the ``CPUFreq`` core creates a
201*4882a593Smuzhiyun``sysfs`` directory (kobject) called ``cpufreq`` under
202*4882a593Smuzhiyun:file:`/sys/devices/system/cpu/`.
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunThat directory contains a ``policyX`` subdirectory (where ``X`` represents an
205*4882a593Smuzhiyuninteger number) for every policy object maintained by the ``CPUFreq`` core.
206*4882a593SmuzhiyunEach ``policyX`` directory is pointed to by ``cpufreq`` symbolic links
207*4882a593Smuzhiyununder :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer
208*4882a593Smuzhiyunthat may be different from the one represented by ``X``) for all of the CPUs
209*4882a593Smuzhiyunassociated with (or belonging to) the given policy.  The ``policyX`` directories
210*4882a593Smuzhiyunin :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific
211*4882a593Smuzhiyunattributes (files) to control ``CPUFreq`` behavior for the corresponding policy
212*4882a593Smuzhiyunobjects (that is, for all of the CPUs associated with them).
213*4882a593Smuzhiyun
214*4882a593SmuzhiyunSome of those attributes are generic.  They are created by the ``CPUFreq`` core
215*4882a593Smuzhiyunand their behavior generally does not depend on what scaling driver is in use
216*4882a593Smuzhiyunand what scaling governor is attached to the given policy.  Some scaling drivers
217*4882a593Smuzhiyunalso add driver-specific attributes to the policy directories in ``sysfs`` to
218*4882a593Smuzhiyuncontrol policy-specific aspects of driver behavior.
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunThe generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/`
221*4882a593Smuzhiyunare the following:
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun``affected_cpus``
224*4882a593Smuzhiyun	List of online CPUs belonging to this policy (i.e. sharing the hardware
225*4882a593Smuzhiyun	performance scaling interface represented by the ``policyX`` policy
226*4882a593Smuzhiyun	object).
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun``bios_limit``
229*4882a593Smuzhiyun	If the platform firmware (BIOS) tells the OS to apply an upper limit to
230*4882a593Smuzhiyun	CPU frequencies, that limit will be reported through this attribute (if
231*4882a593Smuzhiyun	present).
232*4882a593Smuzhiyun
233*4882a593Smuzhiyun	The existence of the limit may be a result of some (often unintentional)
234*4882a593Smuzhiyun	BIOS settings, restrictions coming from a service processor or another
235*4882a593Smuzhiyun	BIOS/HW-based mechanisms.
236*4882a593Smuzhiyun
237*4882a593Smuzhiyun	This does not cover ACPI thermal limitations which can be discovered
238*4882a593Smuzhiyun	through a generic thermal driver.
239*4882a593Smuzhiyun
240*4882a593Smuzhiyun	This attribute is not present if the scaling driver in use does not
241*4882a593Smuzhiyun	support it.
242*4882a593Smuzhiyun
243*4882a593Smuzhiyun``cpuinfo_cur_freq``
244*4882a593Smuzhiyun	Current frequency of the CPUs belonging to this policy as obtained from
245*4882a593Smuzhiyun	the hardware (in KHz).
246*4882a593Smuzhiyun
247*4882a593Smuzhiyun	This is expected to be the frequency the hardware actually runs at.
248*4882a593Smuzhiyun	If that frequency cannot be determined, this attribute should not
249*4882a593Smuzhiyun	be present.
250*4882a593Smuzhiyun
251*4882a593Smuzhiyun``cpuinfo_max_freq``
252*4882a593Smuzhiyun	Maximum possible operating frequency the CPUs belonging to this policy
253*4882a593Smuzhiyun	can run at (in kHz).
254*4882a593Smuzhiyun
255*4882a593Smuzhiyun``cpuinfo_min_freq``
256*4882a593Smuzhiyun	Minimum possible operating frequency the CPUs belonging to this policy
257*4882a593Smuzhiyun	can run at (in kHz).
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun``cpuinfo_transition_latency``
260*4882a593Smuzhiyun	The time it takes to switch the CPUs belonging to this policy from one
261*4882a593Smuzhiyun	P-state to another, in nanoseconds.
262*4882a593Smuzhiyun
263*4882a593Smuzhiyun	If unknown or if known to be so high that the scaling driver does not
264*4882a593Smuzhiyun	work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`)
265*4882a593Smuzhiyun	will be returned by reads from this attribute.
266*4882a593Smuzhiyun
267*4882a593Smuzhiyun``related_cpus``
268*4882a593Smuzhiyun	List of all (online and offline) CPUs belonging to this policy.
269*4882a593Smuzhiyun
270*4882a593Smuzhiyun``scaling_available_governors``
271*4882a593Smuzhiyun	List of ``CPUFreq`` scaling governors present in the kernel that can
272*4882a593Smuzhiyun	be attached to this policy or (if the |intel_pstate| scaling driver is
273*4882a593Smuzhiyun	in use) list of scaling algorithms provided by the driver that can be
274*4882a593Smuzhiyun	applied to this policy.
275*4882a593Smuzhiyun
276*4882a593Smuzhiyun	[Note that some governors are modular and it may be necessary to load a
277*4882a593Smuzhiyun	kernel module for the governor held by it to become available and be
278*4882a593Smuzhiyun	listed by this attribute.]
279*4882a593Smuzhiyun
280*4882a593Smuzhiyun``scaling_cur_freq``
281*4882a593Smuzhiyun	Current frequency of all of the CPUs belonging to this policy (in kHz).
282*4882a593Smuzhiyun
283*4882a593Smuzhiyun	In the majority of cases, this is the frequency of the last P-state
284*4882a593Smuzhiyun	requested by the scaling driver from the hardware using the scaling
285*4882a593Smuzhiyun	interface provided by it, which may or may not reflect the frequency
286*4882a593Smuzhiyun	the CPU is actually running at (due to hardware design and other
287*4882a593Smuzhiyun	limitations).
288*4882a593Smuzhiyun
289*4882a593Smuzhiyun	Some architectures (e.g. ``x86``) may attempt to provide information
290*4882a593Smuzhiyun	more precisely reflecting the current CPU frequency through this
291*4882a593Smuzhiyun	attribute, but that still may not be the exact current CPU frequency as
292*4882a593Smuzhiyun	seen by the hardware at the moment.
293*4882a593Smuzhiyun
294*4882a593Smuzhiyun``scaling_driver``
295*4882a593Smuzhiyun	The scaling driver currently in use.
296*4882a593Smuzhiyun
297*4882a593Smuzhiyun``scaling_governor``
298*4882a593Smuzhiyun	The scaling governor currently attached to this policy or (if the
299*4882a593Smuzhiyun	|intel_pstate| scaling driver is in use) the scaling algorithm
300*4882a593Smuzhiyun	provided by the driver that is currently applied to this policy.
301*4882a593Smuzhiyun
302*4882a593Smuzhiyun	This attribute is read-write and writing to it will cause a new scaling
303*4882a593Smuzhiyun	governor to be attached to this policy or a new scaling algorithm
304*4882a593Smuzhiyun	provided by the scaling driver to be applied to it (in the
305*4882a593Smuzhiyun	|intel_pstate| case), as indicated by the string written to this
306*4882a593Smuzhiyun	attribute (which must be one of the names listed by the
307*4882a593Smuzhiyun	``scaling_available_governors`` attribute described above).
308*4882a593Smuzhiyun
309*4882a593Smuzhiyun``scaling_max_freq``
310*4882a593Smuzhiyun	Maximum frequency the CPUs belonging to this policy are allowed to be
311*4882a593Smuzhiyun	running at (in kHz).
312*4882a593Smuzhiyun
313*4882a593Smuzhiyun	This attribute is read-write and writing a string representing an
314*4882a593Smuzhiyun	integer to it will cause a new limit to be set (it must not be lower
315*4882a593Smuzhiyun	than the value of the ``scaling_min_freq`` attribute).
316*4882a593Smuzhiyun
317*4882a593Smuzhiyun``scaling_min_freq``
318*4882a593Smuzhiyun	Minimum frequency the CPUs belonging to this policy are allowed to be
319*4882a593Smuzhiyun	running at (in kHz).
320*4882a593Smuzhiyun
321*4882a593Smuzhiyun	This attribute is read-write and writing a string representing a
322*4882a593Smuzhiyun	non-negative integer to it will cause a new limit to be set (it must not
323*4882a593Smuzhiyun	be higher than the value of the ``scaling_max_freq`` attribute).
324*4882a593Smuzhiyun
325*4882a593Smuzhiyun``scaling_setspeed``
326*4882a593Smuzhiyun	This attribute is functional only if the `userspace`_ scaling governor
327*4882a593Smuzhiyun	is attached to the given policy.
328*4882a593Smuzhiyun
329*4882a593Smuzhiyun	It returns the last frequency requested by the governor (in kHz) or can
330*4882a593Smuzhiyun	be written to in order to set a new frequency for the policy.
331*4882a593Smuzhiyun
332*4882a593Smuzhiyun
333*4882a593SmuzhiyunGeneric Scaling Governors
334*4882a593Smuzhiyun=========================
335*4882a593Smuzhiyun
336*4882a593Smuzhiyun``CPUFreq`` provides generic scaling governors that can be used with all
337*4882a593Smuzhiyunscaling drivers.  As stated before, each of them implements a single, possibly
338*4882a593Smuzhiyunparametrized, performance scaling algorithm.
339*4882a593Smuzhiyun
340*4882a593SmuzhiyunScaling governors are attached to policy objects and different policy objects
341*4882a593Smuzhiyuncan be handled by different scaling governors at the same time (although that
342*4882a593Smuzhiyunmay lead to suboptimal results in some cases).
343*4882a593Smuzhiyun
344*4882a593SmuzhiyunThe scaling governor for a given policy object can be changed at any time with
345*4882a593Smuzhiyunthe help of the ``scaling_governor`` policy attribute in ``sysfs``.
346*4882a593Smuzhiyun
347*4882a593SmuzhiyunSome governors expose ``sysfs`` attributes to control or fine-tune the scaling
348*4882a593Smuzhiyunalgorithms implemented by them.  Those attributes, referred to as governor
349*4882a593Smuzhiyuntunables, can be either global (system-wide) or per-policy, depending on the
350*4882a593Smuzhiyunscaling driver in use.  If the driver requires governor tunables to be
351*4882a593Smuzhiyunper-policy, they are located in a subdirectory of each policy directory.
352*4882a593SmuzhiyunOtherwise, they are located in a subdirectory under
353*4882a593Smuzhiyun:file:`/sys/devices/system/cpu/cpufreq/`.  In either case the name of the
354*4882a593Smuzhiyunsubdirectory containing the governor tunables is the name of the governor
355*4882a593Smuzhiyunproviding them.
356*4882a593Smuzhiyun
357*4882a593Smuzhiyun``performance``
358*4882a593Smuzhiyun---------------
359*4882a593Smuzhiyun
360*4882a593SmuzhiyunWhen attached to a policy object, this governor causes the highest frequency,
361*4882a593Smuzhiyunwithin the ``scaling_max_freq`` policy limit, to be requested for that policy.
362*4882a593Smuzhiyun
363*4882a593SmuzhiyunThe request is made once at that time the governor for the policy is set to
364*4882a593Smuzhiyun``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
365*4882a593Smuzhiyunpolicy limits change after that.
366*4882a593Smuzhiyun
367*4882a593Smuzhiyun``powersave``
368*4882a593Smuzhiyun-------------
369*4882a593Smuzhiyun
370*4882a593SmuzhiyunWhen attached to a policy object, this governor causes the lowest frequency,
371*4882a593Smuzhiyunwithin the ``scaling_min_freq`` policy limit, to be requested for that policy.
372*4882a593Smuzhiyun
373*4882a593SmuzhiyunThe request is made once at that time the governor for the policy is set to
374*4882a593Smuzhiyun``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
375*4882a593Smuzhiyunpolicy limits change after that.
376*4882a593Smuzhiyun
377*4882a593Smuzhiyun``userspace``
378*4882a593Smuzhiyun-------------
379*4882a593Smuzhiyun
380*4882a593SmuzhiyunThis governor does not do anything by itself.  Instead, it allows user space
381*4882a593Smuzhiyunto set the CPU frequency for the policy it is attached to by writing to the
382*4882a593Smuzhiyun``scaling_setspeed`` attribute of that policy.
383*4882a593Smuzhiyun
384*4882a593Smuzhiyun``schedutil``
385*4882a593Smuzhiyun-------------
386*4882a593Smuzhiyun
387*4882a593SmuzhiyunThis governor uses CPU utilization data available from the CPU scheduler.  It
388*4882a593Smuzhiyungenerally is regarded as a part of the CPU scheduler, so it can access the
389*4882a593Smuzhiyunscheduler's internal data structures directly.
390*4882a593Smuzhiyun
391*4882a593SmuzhiyunIt runs entirely in scheduler context, although in some cases it may need to
392*4882a593Smuzhiyuninvoke the scaling driver asynchronously when it decides that the CPU frequency
393*4882a593Smuzhiyunshould be changed for a given policy (that depends on whether or not the driver
394*4882a593Smuzhiyunis capable of changing the CPU frequency from scheduler context).
395*4882a593Smuzhiyun
396*4882a593SmuzhiyunThe actions of this governor for a particular CPU depend on the scheduling class
397*4882a593Smuzhiyuninvoking its utilization update callback for that CPU.  If it is invoked by the
398*4882a593SmuzhiyunRT or deadline scheduling classes, the governor will increase the frequency to
399*4882a593Smuzhiyunthe allowed maximum (that is, the ``scaling_max_freq`` policy limit).  In turn,
400*4882a593Smuzhiyunif it is invoked by the CFS scheduling class, the governor will use the
401*4882a593SmuzhiyunPer-Entity Load Tracking (PELT) metric for the root control group of the
402*4882a593Smuzhiyungiven CPU as the CPU utilization estimate (see the *Per-entity load tracking*
403*4882a593SmuzhiyunLWN.net article [1]_ for a description of the PELT mechanism).  Then, the new
404*4882a593SmuzhiyunCPU frequency to apply is computed in accordance with the formula
405*4882a593Smuzhiyun
406*4882a593Smuzhiyun	f = 1.25 * ``f_0`` * ``util`` / ``max``
407*4882a593Smuzhiyun
408*4882a593Smuzhiyunwhere ``util`` is the PELT number, ``max`` is the theoretical maximum of
409*4882a593Smuzhiyun``util``, and ``f_0`` is either the maximum possible CPU frequency for the given
410*4882a593Smuzhiyunpolicy (if the PELT number is frequency-invariant), or the current CPU frequency
411*4882a593Smuzhiyun(otherwise).
412*4882a593Smuzhiyun
413*4882a593SmuzhiyunThis governor also employs a mechanism allowing it to temporarily bump up the
414*4882a593SmuzhiyunCPU frequency for tasks that have been waiting on I/O most recently, called
415*4882a593Smuzhiyun"IO-wait boosting".  That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag
416*4882a593Smuzhiyunis passed by the scheduler to the governor callback which causes the frequency
417*4882a593Smuzhiyunto go up to the allowed maximum immediately and then draw back to the value
418*4882a593Smuzhiyunreturned by the above formula over time.
419*4882a593Smuzhiyun
420*4882a593SmuzhiyunThis governor exposes only one tunable:
421*4882a593Smuzhiyun
422*4882a593Smuzhiyun``rate_limit_us``
423*4882a593Smuzhiyun	Minimum time (in microseconds) that has to pass between two consecutive
424*4882a593Smuzhiyun	runs of governor computations (default: 1000 times the scaling driver's
425*4882a593Smuzhiyun	transition latency).
426*4882a593Smuzhiyun
427*4882a593Smuzhiyun	The purpose of this tunable is to reduce the scheduler context overhead
428*4882a593Smuzhiyun	of the governor which might be excessive without it.
429*4882a593Smuzhiyun
430*4882a593SmuzhiyunThis governor generally is regarded as a replacement for the older `ondemand`_
431*4882a593Smuzhiyunand `conservative`_ governors (described below), as it is simpler and more
432*4882a593Smuzhiyuntightly integrated with the CPU scheduler, its overhead in terms of CPU context
433*4882a593Smuzhiyunswitches and similar is less significant, and it uses the scheduler's own CPU
434*4882a593Smuzhiyunutilization metric, so in principle its decisions should not contradict the
435*4882a593Smuzhiyundecisions made by the other parts of the scheduler.
436*4882a593Smuzhiyun
437*4882a593Smuzhiyun``ondemand``
438*4882a593Smuzhiyun------------
439*4882a593Smuzhiyun
440*4882a593SmuzhiyunThis governor uses CPU load as a CPU frequency selection metric.
441*4882a593Smuzhiyun
442*4882a593SmuzhiyunIn order to estimate the current CPU load, it measures the time elapsed between
443*4882a593Smuzhiyunconsecutive invocations of its worker routine and computes the fraction of that
444*4882a593Smuzhiyuntime in which the given CPU was not idle.  The ratio of the non-idle (active)
445*4882a593Smuzhiyuntime to the total CPU time is taken as an estimate of the load.
446*4882a593Smuzhiyun
447*4882a593SmuzhiyunIf this governor is attached to a policy shared by multiple CPUs, the load is
448*4882a593Smuzhiyunestimated for all of them and the greatest result is taken as the load estimate
449*4882a593Smuzhiyunfor the entire policy.
450*4882a593Smuzhiyun
451*4882a593SmuzhiyunThe worker routine of this governor has to run in process context, so it is
452*4882a593Smuzhiyuninvoked asynchronously (via a workqueue) and CPU P-states are updated from
453*4882a593Smuzhiyunthere if necessary.  As a result, the scheduler context overhead from this
454*4882a593Smuzhiyungovernor is minimum, but it causes additional CPU context switches to happen
455*4882a593Smuzhiyunrelatively often and the CPU P-state updates triggered by it can be relatively
456*4882a593Smuzhiyunirregular.  Also, it affects its own CPU load metric by running code that
457*4882a593Smuzhiyunreduces the CPU idle time (even though the CPU idle time is only reduced very
458*4882a593Smuzhiyunslightly by it).
459*4882a593Smuzhiyun
460*4882a593SmuzhiyunIt generally selects CPU frequencies proportional to the estimated load, so that
461*4882a593Smuzhiyunthe value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of
462*4882a593Smuzhiyun1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute
463*4882a593Smuzhiyuncorresponds to the load of 0, unless when the load exceeds a (configurable)
464*4882a593Smuzhiyunspeedup threshold, in which case it will go straight for the highest frequency
465*4882a593Smuzhiyunit is allowed to use (the ``scaling_max_freq`` policy limit).
466*4882a593Smuzhiyun
467*4882a593SmuzhiyunThis governor exposes the following tunables:
468*4882a593Smuzhiyun
469*4882a593Smuzhiyun``sampling_rate``
470*4882a593Smuzhiyun	This is how often the governor's worker routine should run, in
471*4882a593Smuzhiyun	microseconds.
472*4882a593Smuzhiyun
473*4882a593Smuzhiyun	Typically, it is set to values of the order of 10000 (10 ms).  Its
474*4882a593Smuzhiyun	default value is equal to the value of ``cpuinfo_transition_latency``
475*4882a593Smuzhiyun	for each policy this governor is attached to (but since the unit here
476*4882a593Smuzhiyun	is greater by 1000, this means that the time represented by
477*4882a593Smuzhiyun	``sampling_rate`` is 1000 times greater than the transition latency by
478*4882a593Smuzhiyun	default).
479*4882a593Smuzhiyun
480*4882a593Smuzhiyun	If this tunable is per-policy, the following shell command sets the time
481*4882a593Smuzhiyun	represented by it to be 750 times as high as the transition latency::
482*4882a593Smuzhiyun
483*4882a593Smuzhiyun	# echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
484*4882a593Smuzhiyun
485*4882a593Smuzhiyun``up_threshold``
486*4882a593Smuzhiyun	If the estimated CPU load is above this value (in percent), the governor
487*4882a593Smuzhiyun	will set the frequency to the maximum value allowed for the policy.
488*4882a593Smuzhiyun	Otherwise, the selected frequency will be proportional to the estimated
489*4882a593Smuzhiyun	CPU load.
490*4882a593Smuzhiyun
491*4882a593Smuzhiyun``ignore_nice_load``
492*4882a593Smuzhiyun	If set to 1 (default 0), it will cause the CPU load estimation code to
493*4882a593Smuzhiyun	treat the CPU time spent on executing tasks with "nice" levels greater
494*4882a593Smuzhiyun	than 0 as CPU idle time.
495*4882a593Smuzhiyun
496*4882a593Smuzhiyun	This may be useful if there are tasks in the system that should not be
497*4882a593Smuzhiyun	taken into account when deciding what frequency to run the CPUs at.
498*4882a593Smuzhiyun	Then, to make that happen it is sufficient to increase the "nice" level
499*4882a593Smuzhiyun	of those tasks above 0 and set this attribute to 1.
500*4882a593Smuzhiyun
501*4882a593Smuzhiyun``sampling_down_factor``
502*4882a593Smuzhiyun	Temporary multiplier, between 1 (default) and 100 inclusive, to apply to
503*4882a593Smuzhiyun	the ``sampling_rate`` value if the CPU load goes above ``up_threshold``.
504*4882a593Smuzhiyun
505*4882a593Smuzhiyun	This causes the next execution of the governor's worker routine (after
506*4882a593Smuzhiyun	setting the frequency to the allowed maximum) to be delayed, so the
507*4882a593Smuzhiyun	frequency stays at the maximum level for a longer time.
508*4882a593Smuzhiyun
509*4882a593Smuzhiyun	Frequency fluctuations in some bursty workloads may be avoided this way
510*4882a593Smuzhiyun	at the cost of additional energy spent on maintaining the maximum CPU
511*4882a593Smuzhiyun	capacity.
512*4882a593Smuzhiyun
513*4882a593Smuzhiyun``powersave_bias``
514*4882a593Smuzhiyun	Reduction factor to apply to the original frequency target of the
515*4882a593Smuzhiyun	governor (including the maximum value used when the ``up_threshold``
516*4882a593Smuzhiyun	value is exceeded by the estimated CPU load) or sensitivity threshold
517*4882a593Smuzhiyun	for the AMD frequency sensitivity powersave bias driver
518*4882a593Smuzhiyun	(:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000
519*4882a593Smuzhiyun	inclusive.
520*4882a593Smuzhiyun
521*4882a593Smuzhiyun	If the AMD frequency sensitivity powersave bias driver is not loaded,
522*4882a593Smuzhiyun	the effective frequency to apply is given by
523*4882a593Smuzhiyun
524*4882a593Smuzhiyun		f * (1 - ``powersave_bias`` / 1000)
525*4882a593Smuzhiyun
526*4882a593Smuzhiyun	where f is the governor's original frequency target.  The default value
527*4882a593Smuzhiyun	of this attribute is 0 in that case.
528*4882a593Smuzhiyun
529*4882a593Smuzhiyun	If the AMD frequency sensitivity powersave bias driver is loaded, the
530*4882a593Smuzhiyun	value of this attribute is 400 by default and it is used in a different
531*4882a593Smuzhiyun	way.
532*4882a593Smuzhiyun
533*4882a593Smuzhiyun	On Family 16h (and later) AMD processors there is a mechanism to get a
534*4882a593Smuzhiyun	measured workload sensitivity, between 0 and 100% inclusive, from the
535*4882a593Smuzhiyun	hardware.  That value can be used to estimate how the performance of the
536*4882a593Smuzhiyun	workload running on a CPU will change in response to frequency changes.
537*4882a593Smuzhiyun
538*4882a593Smuzhiyun	The performance of a workload with the sensitivity of 0 (memory-bound or
539*4882a593Smuzhiyun	IO-bound) is not expected to increase at all as a result of increasing
540*4882a593Smuzhiyun	the CPU frequency, whereas workloads with the sensitivity of 100%
541*4882a593Smuzhiyun	(CPU-bound) are expected to perform much better if the CPU frequency is
542*4882a593Smuzhiyun	increased.
543*4882a593Smuzhiyun
544*4882a593Smuzhiyun	If the workload sensitivity is less than the threshold represented by
545*4882a593Smuzhiyun	the ``powersave_bias`` value, the sensitivity powersave bias driver
546*4882a593Smuzhiyun	will cause the governor to select a frequency lower than its original
547*4882a593Smuzhiyun	target, so as to avoid over-provisioning workloads that will not benefit
548*4882a593Smuzhiyun	from running at higher CPU frequencies.
549*4882a593Smuzhiyun
550*4882a593Smuzhiyun``conservative``
551*4882a593Smuzhiyun----------------
552*4882a593Smuzhiyun
553*4882a593SmuzhiyunThis governor uses CPU load as a CPU frequency selection metric.
554*4882a593Smuzhiyun
555*4882a593SmuzhiyunIt estimates the CPU load in the same way as the `ondemand`_ governor described
556*4882a593Smuzhiyunabove, but the CPU frequency selection algorithm implemented by it is different.
557*4882a593Smuzhiyun
558*4882a593SmuzhiyunNamely, it avoids changing the frequency significantly over short time intervals
559*4882a593Smuzhiyunwhich may not be suitable for systems with limited power supply capacity (e.g.
560*4882a593Smuzhiyunbattery-powered).  To achieve that, it changes the frequency in relatively
561*4882a593Smuzhiyunsmall steps, one step at a time, up or down - depending on whether or not a
562*4882a593Smuzhiyun(configurable) threshold has been exceeded by the estimated CPU load.
563*4882a593Smuzhiyun
564*4882a593SmuzhiyunThis governor exposes the following tunables:
565*4882a593Smuzhiyun
566*4882a593Smuzhiyun``freq_step``
567*4882a593Smuzhiyun	Frequency step in percent of the maximum frequency the governor is
568*4882a593Smuzhiyun	allowed to set (the ``scaling_max_freq`` policy limit), between 0 and
569*4882a593Smuzhiyun	100 (5 by default).
570*4882a593Smuzhiyun
571*4882a593Smuzhiyun	This is how much the frequency is allowed to change in one go.  Setting
572*4882a593Smuzhiyun	it to 0 will cause the default frequency step (5 percent) to be used
573*4882a593Smuzhiyun	and setting it to 100 effectively causes the governor to periodically
574*4882a593Smuzhiyun	switch the frequency between the ``scaling_min_freq`` and
575*4882a593Smuzhiyun	``scaling_max_freq`` policy limits.
576*4882a593Smuzhiyun
577*4882a593Smuzhiyun``down_threshold``
578*4882a593Smuzhiyun	Threshold value (in percent, 20 by default) used to determine the
579*4882a593Smuzhiyun	frequency change direction.
580*4882a593Smuzhiyun
581*4882a593Smuzhiyun	If the estimated CPU load is greater than this value, the frequency will
582*4882a593Smuzhiyun	go up (by ``freq_step``).  If the load is less than this value (and the
583*4882a593Smuzhiyun	``sampling_down_factor`` mechanism is not in effect), the frequency will
584*4882a593Smuzhiyun	go down.  Otherwise, the frequency will not be changed.
585*4882a593Smuzhiyun
586*4882a593Smuzhiyun``sampling_down_factor``
587*4882a593Smuzhiyun	Frequency decrease deferral factor, between 1 (default) and 10
588*4882a593Smuzhiyun	inclusive.
589*4882a593Smuzhiyun
590*4882a593Smuzhiyun	It effectively causes the frequency to go down ``sampling_down_factor``
591*4882a593Smuzhiyun	times slower than it ramps up.
592*4882a593Smuzhiyun
593*4882a593Smuzhiyun``interactive``
594*4882a593Smuzhiyun----------------
595*4882a593Smuzhiyun
596*4882a593SmuzhiyunThe CPUfreq governor `interactive` is designed for latency-sensitive,
597*4882a593Smuzhiyuninteractive workloads. This governor sets the CPU speed depending on
598*4882a593Smuzhiyunusage, similar to `ondemand` and `conservative` governors, but with a
599*4882a593Smuzhiyundifferent set of configurable behaviors.
600*4882a593Smuzhiyun
601*4882a593SmuzhiyunThe tunable values for this governor are:
602*4882a593Smuzhiyun
603*4882a593Smuzhiyun``above_hispeed_delay``
604*4882a593Smuzhiyun        When speed is at or above hispeed_freq, wait for
605*4882a593Smuzhiyun        this long before raising speed in response to continued high load.
606*4882a593Smuzhiyun        The format is a single delay value, optionally followed by pairs of
607*4882a593Smuzhiyun        CPU speeds and the delay to use at or above those speeds.  Colons can
608*4882a593Smuzhiyun        be used between the speeds and associated delays for readability.  For
609*4882a593Smuzhiyun        example:
610*4882a593Smuzhiyun
611*4882a593Smuzhiyun           80000 1300000:200000 1500000:40000
612*4882a593Smuzhiyun
613*4882a593Smuzhiyun        uses delay 80000 uS until CPU speed 1.3 GHz, at which speed delay
614*4882a593Smuzhiyun        200000 uS is used until speed 1.5 GHz, at which speed (and above)
615*4882a593Smuzhiyun        delay 40000 uS is used.  If speeds are specified these must appear in
616*4882a593Smuzhiyun        ascending order.  Default is 20000 uS.
617*4882a593Smuzhiyun
618*4882a593Smuzhiyun``boost``
619*4882a593Smuzhiyun        If non-zero, immediately boost speed of all CPUs to at least
620*4882a593Smuzhiyun        hispeed_freq until zero is written to this attribute.  If zero, allow
621*4882a593Smuzhiyun        CPU speeds to drop below hispeed_freq according to load as usual.
622*4882a593Smuzhiyun        Default is zero.
623*4882a593Smuzhiyun
624*4882a593Smuzhiyun``boostpulse``
625*4882a593Smuzhiyun        On each write, immediately boost speed of all CPUs to
626*4882a593Smuzhiyun        hispeed_freq for at least the period of time specified by
627*4882a593Smuzhiyun        boostpulse_duration, after which speeds are allowed to drop below
628*4882a593Smuzhiyun        hispeed_freq according to load as usual. Its a write-only file.
629*4882a593Smuzhiyun
630*4882a593Smuzhiyun``boostpulse_duration``
631*4882a593Smuzhiyun        Length of time to hold CPU speed at hispeed_freq
632*4882a593Smuzhiyun        on a write to boostpulse, before allowing speed to drop according to
633*4882a593Smuzhiyun        load as usual.  Default is 80000 uS.
634*4882a593Smuzhiyun
635*4882a593Smuzhiyun``go_hispeed_load``
636*4882a593Smuzhiyun        The CPU load at which to ramp to hispeed_freq.
637*4882a593Smuzhiyun        Default is 99%.
638*4882a593Smuzhiyun
639*4882a593Smuzhiyun``hispeed_freq``
640*4882a593Smuzhiyun        An intermediate "high speed" at which to initially ramp
641*4882a593Smuzhiyun        when CPU load hits the value specified in go_hispeed_load.  If load
642*4882a593Smuzhiyun        stays high for the amount of time specified in above_hispeed_delay,
643*4882a593Smuzhiyun        then speed may be bumped higher.  Default is the maximum speed allowed
644*4882a593Smuzhiyun        by the policy at governor initialization time.
645*4882a593Smuzhiyun
646*4882a593Smuzhiyun``io_is_busy``
647*4882a593Smuzhiyun        If set, the governor accounts IO time as CPU busy time.
648*4882a593Smuzhiyun
649*4882a593Smuzhiyun``min_sample_time``
650*4882a593Smuzhiyun        The minimum amount of time to spend at the current
651*4882a593Smuzhiyun
652*4882a593SmuzhiyunFrequency Boost Support
653*4882a593Smuzhiyun=======================
654*4882a593Smuzhiyun
655*4882a593SmuzhiyunBackground
656*4882a593Smuzhiyun----------
657*4882a593Smuzhiyun
658*4882a593SmuzhiyunSome processors support a mechanism to raise the operating frequency of some
659*4882a593Smuzhiyuncores in a multicore package temporarily (and above the sustainable frequency
660*4882a593Smuzhiyunthreshold for the whole package) under certain conditions, for example if the
661*4882a593Smuzhiyunwhole chip is not fully utilized and below its intended thermal or power budget.
662*4882a593Smuzhiyun
663*4882a593SmuzhiyunDifferent names are used by different vendors to refer to this functionality.
664*4882a593SmuzhiyunFor Intel processors it is referred to as "Turbo Boost", AMD calls it
665*4882a593Smuzhiyun"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on.
666*4882a593SmuzhiyunAs a rule, it also is implemented differently by different vendors.  The simple
667*4882a593Smuzhiyunterm "frequency boost" is used here for brevity to refer to all of those
668*4882a593Smuzhiyunimplementations.
669*4882a593Smuzhiyun
670*4882a593SmuzhiyunThe frequency boost mechanism may be either hardware-based or software-based.
671*4882a593SmuzhiyunIf it is hardware-based (e.g. on x86), the decision to trigger the boosting is
672*4882a593Smuzhiyunmade by the hardware (although in general it requires the hardware to be put
673*4882a593Smuzhiyuninto a special state in which it can control the CPU frequency within certain
674*4882a593Smuzhiyunlimits).  If it is software-based (e.g. on ARM), the scaling driver decides
675*4882a593Smuzhiyunwhether or not to trigger boosting and when to do that.
676*4882a593Smuzhiyun
677*4882a593SmuzhiyunThe ``boost`` File in ``sysfs``
678*4882a593Smuzhiyun-------------------------------
679*4882a593Smuzhiyun
680*4882a593SmuzhiyunThis file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls
681*4882a593Smuzhiyunthe "boost" setting for the whole system.  It is not present if the underlying
682*4882a593Smuzhiyunscaling driver does not support the frequency boost mechanism (or supports it,
683*4882a593Smuzhiyunbut provides a driver-specific interface for controlling it, like
684*4882a593Smuzhiyun|intel_pstate|).
685*4882a593Smuzhiyun
686*4882a593SmuzhiyunIf the value in this file is 1, the frequency boost mechanism is enabled.  This
687*4882a593Smuzhiyunmeans that either the hardware can be put into states in which it is able to
688*4882a593Smuzhiyuntrigger boosting (in the hardware-based case), or the software is allowed to
689*4882a593Smuzhiyuntrigger boosting (in the software-based case).  It does not mean that boosting
690*4882a593Smuzhiyunis actually in use at the moment on any CPUs in the system.  It only means a
691*4882a593Smuzhiyunpermission to use the frequency boost mechanism (which still may never be used
692*4882a593Smuzhiyunfor other reasons).
693*4882a593Smuzhiyun
694*4882a593SmuzhiyunIf the value in this file is 0, the frequency boost mechanism is disabled and
695*4882a593Smuzhiyuncannot be used at all.
696*4882a593Smuzhiyun
697*4882a593SmuzhiyunThe only values that can be written to this file are 0 and 1.
698*4882a593Smuzhiyun
699*4882a593SmuzhiyunRationale for Boost Control Knob
700*4882a593Smuzhiyun--------------------------------
701*4882a593Smuzhiyun
702*4882a593SmuzhiyunThe frequency boost mechanism is generally intended to help to achieve optimum
703*4882a593SmuzhiyunCPU performance on time scales below software resolution (e.g. below the
704*4882a593Smuzhiyunscheduler tick interval) and it is demonstrably suitable for many workloads, but
705*4882a593Smuzhiyunit may lead to problems in certain situations.
706*4882a593Smuzhiyun
707*4882a593SmuzhiyunFor this reason, many systems make it possible to disable the frequency boost
708*4882a593Smuzhiyunmechanism in the platform firmware (BIOS) setup, but that requires the system to
709*4882a593Smuzhiyunbe restarted for the setting to be adjusted as desired, which may not be
710*4882a593Smuzhiyunpractical at least in some cases.  For example:
711*4882a593Smuzhiyun
712*4882a593Smuzhiyun  1. Boosting means overclocking the processor, although under controlled
713*4882a593Smuzhiyun     conditions.  Generally, the processor's energy consumption increases
714*4882a593Smuzhiyun     as a result of increasing its frequency and voltage, even temporarily.
715*4882a593Smuzhiyun     That may not be desirable on systems that switch to power sources of
716*4882a593Smuzhiyun     limited capacity, such as batteries, so the ability to disable the boost
717*4882a593Smuzhiyun     mechanism while the system is running may help there (but that depends on
718*4882a593Smuzhiyun     the workload too).
719*4882a593Smuzhiyun
720*4882a593Smuzhiyun  2. In some situations deterministic behavior is more important than
721*4882a593Smuzhiyun     performance or energy consumption (or both) and the ability to disable
722*4882a593Smuzhiyun     boosting while the system is running may be useful then.
723*4882a593Smuzhiyun
724*4882a593Smuzhiyun  3. To examine the impact of the frequency boost mechanism itself, it is useful
725*4882a593Smuzhiyun     to be able to run tests with and without boosting, preferably without
726*4882a593Smuzhiyun     restarting the system in the meantime.
727*4882a593Smuzhiyun
728*4882a593Smuzhiyun  4. Reproducible results are important when running benchmarks.  Since
729*4882a593Smuzhiyun     the boosting functionality depends on the load of the whole package,
730*4882a593Smuzhiyun     single-thread performance may vary because of it which may lead to
731*4882a593Smuzhiyun     unreproducible results sometimes.  That can be avoided by disabling the
732*4882a593Smuzhiyun     frequency boost mechanism before running benchmarks sensitive to that
733*4882a593Smuzhiyun     issue.
734*4882a593Smuzhiyun
735*4882a593SmuzhiyunLegacy AMD ``cpb`` Knob
736*4882a593Smuzhiyun-----------------------
737*4882a593Smuzhiyun
738*4882a593SmuzhiyunThe AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to
739*4882a593Smuzhiyunthe global ``boost`` one.  It is used for disabling/enabling the "Core
740*4882a593SmuzhiyunPerformance Boost" feature of some AMD processors.
741*4882a593Smuzhiyun
742*4882a593SmuzhiyunIf present, that knob is located in every ``CPUFreq`` policy directory in
743*4882a593Smuzhiyun``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called
744*4882a593Smuzhiyun``cpb``, which indicates a more fine grained control interface.  The actual
745*4882a593Smuzhiyunimplementation, however, works on the system-wide basis and setting that knob
746*4882a593Smuzhiyunfor one policy causes the same value of it to be set for all of the other
747*4882a593Smuzhiyunpolicies at the same time.
748*4882a593Smuzhiyun
749*4882a593SmuzhiyunThat knob is still supported on AMD processors that support its underlying
750*4882a593Smuzhiyunhardware feature, but it may be configured out of the kernel (via the
751*4882a593Smuzhiyun:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global
752*4882a593Smuzhiyun``boost`` knob is present regardless.  Thus it is always possible use the
753*4882a593Smuzhiyun``boost`` knob instead of the ``cpb`` one which is highly recommended, as that
754*4882a593Smuzhiyunis more consistent with what all of the other systems do (and the ``cpb`` knob
755*4882a593Smuzhiyunmay not be supported any more in the future).
756*4882a593Smuzhiyun
757*4882a593SmuzhiyunThe ``cpb`` knob is never present for any processors without the underlying
758*4882a593Smuzhiyunhardware feature (e.g. all Intel ones), even if the
759*4882a593Smuzhiyun:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set.
760*4882a593Smuzhiyun
761*4882a593Smuzhiyun
762*4882a593SmuzhiyunReferences
763*4882a593Smuzhiyun==========
764*4882a593Smuzhiyun
765*4882a593Smuzhiyun.. [1] Jonathan Corbet, *Per-entity load tracking*,
766*4882a593Smuzhiyun       https://lwn.net/Articles/531853/
767