1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun.. include:: <isonum.txt> 3*4882a593Smuzhiyun 4*4882a593Smuzhiyun.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>` 5*4882a593Smuzhiyun 6*4882a593Smuzhiyun======================= 7*4882a593SmuzhiyunCPU Performance Scaling 8*4882a593Smuzhiyun======================= 9*4882a593Smuzhiyun 10*4882a593Smuzhiyun:Copyright: |copy| 2017 Intel Corporation 11*4882a593Smuzhiyun 12*4882a593Smuzhiyun:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> 13*4882a593Smuzhiyun 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunThe Concept of CPU Performance Scaling 16*4882a593Smuzhiyun====================================== 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunThe majority of modern processors are capable of operating in a number of 19*4882a593Smuzhiyundifferent clock frequency and voltage configurations, often referred to as 20*4882a593SmuzhiyunOperating Performance Points or P-states (in ACPI terminology). As a rule, 21*4882a593Smuzhiyunthe higher the clock frequency and the higher the voltage, the more instructions 22*4882a593Smuzhiyuncan be retired by the CPU over a unit of time, but also the higher the clock 23*4882a593Smuzhiyunfrequency and the higher the voltage, the more energy is consumed over a unit of 24*4882a593Smuzhiyuntime (or the more power is drawn) by the CPU in the given P-state. Therefore 25*4882a593Smuzhiyunthere is a natural tradeoff between the CPU capacity (the number of instructions 26*4882a593Smuzhiyunthat can be executed over a unit of time) and the power drawn by the CPU. 27*4882a593Smuzhiyun 28*4882a593SmuzhiyunIn some situations it is desirable or even necessary to run the program as fast 29*4882a593Smuzhiyunas possible and then there is no reason to use any P-states different from the 30*4882a593Smuzhiyunhighest one (i.e. the highest-performance frequency/voltage configuration 31*4882a593Smuzhiyunavailable). In some other cases, however, it may not be necessary to execute 32*4882a593Smuzhiyuninstructions so quickly and maintaining the highest available CPU capacity for a 33*4882a593Smuzhiyunrelatively long time without utilizing it entirely may be regarded as wasteful. 34*4882a593SmuzhiyunIt also may not be physically possible to maintain maximum CPU capacity for too 35*4882a593Smuzhiyunlong for thermal or power supply capacity reasons or similar. To cover those 36*4882a593Smuzhiyuncases, there are hardware interfaces allowing CPUs to be switched between 37*4882a593Smuzhiyundifferent frequency/voltage configurations or (in the ACPI terminology) to be 38*4882a593Smuzhiyunput into different P-states. 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunTypically, they are used along with algorithms to estimate the required CPU 41*4882a593Smuzhiyuncapacity, so as to decide which P-states to put the CPUs into. Of course, since 42*4882a593Smuzhiyunthe utilization of the system generally changes over time, that has to be done 43*4882a593Smuzhiyunrepeatedly on a regular basis. The activity by which this happens is referred 44*4882a593Smuzhiyunto as CPU performance scaling or CPU frequency scaling (because it involves 45*4882a593Smuzhiyunadjusting the CPU clock frequency). 46*4882a593Smuzhiyun 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunCPU Performance Scaling in Linux 49*4882a593Smuzhiyun================================ 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunThe Linux kernel supports CPU performance scaling by means of the ``CPUFreq`` 52*4882a593Smuzhiyun(CPU Frequency scaling) subsystem that consists of three layers of code: the 53*4882a593Smuzhiyuncore, scaling governors and scaling drivers. 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunThe ``CPUFreq`` core provides the common code infrastructure and user space 56*4882a593Smuzhiyuninterfaces for all platforms that support CPU performance scaling. It defines 57*4882a593Smuzhiyunthe basic framework in which the other components operate. 58*4882a593Smuzhiyun 59*4882a593SmuzhiyunScaling governors implement algorithms to estimate the required CPU capacity. 60*4882a593SmuzhiyunAs a rule, each governor implements one, possibly parametrized, scaling 61*4882a593Smuzhiyunalgorithm. 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunScaling drivers talk to the hardware. They provide scaling governors with 64*4882a593Smuzhiyuninformation on the available P-states (or P-state ranges in some cases) and 65*4882a593Smuzhiyunaccess platform-specific hardware interfaces to change CPU P-states as requested 66*4882a593Smuzhiyunby scaling governors. 67*4882a593Smuzhiyun 68*4882a593SmuzhiyunIn principle, all available scaling governors can be used with every scaling 69*4882a593Smuzhiyundriver. That design is based on the observation that the information used by 70*4882a593Smuzhiyunperformance scaling algorithms for P-state selection can be represented in a 71*4882a593Smuzhiyunplatform-independent form in the majority of cases, so it should be possible 72*4882a593Smuzhiyunto use the same performance scaling algorithm implemented in exactly the same 73*4882a593Smuzhiyunway regardless of which scaling driver is used. Consequently, the same set of 74*4882a593Smuzhiyunscaling governors should be suitable for every supported platform. 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunHowever, that observation may not hold for performance scaling algorithms 77*4882a593Smuzhiyunbased on information provided by the hardware itself, for example through 78*4882a593Smuzhiyunfeedback registers, as that information is typically specific to the hardware 79*4882a593Smuzhiyuninterface it comes from and may not be easily represented in an abstract, 80*4882a593Smuzhiyunplatform-independent way. For this reason, ``CPUFreq`` allows scaling drivers 81*4882a593Smuzhiyunto bypass the governor layer and implement their own performance scaling 82*4882a593Smuzhiyunalgorithms. That is done by the |intel_pstate| scaling driver. 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun``CPUFreq`` Policy Objects 86*4882a593Smuzhiyun========================== 87*4882a593Smuzhiyun 88*4882a593SmuzhiyunIn some cases the hardware interface for P-state control is shared by multiple 89*4882a593SmuzhiyunCPUs. That is, for example, the same register (or set of registers) is used to 90*4882a593Smuzhiyuncontrol the P-state of multiple CPUs at the same time and writing to it affects 91*4882a593Smuzhiyunall of those CPUs simultaneously. 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunSets of CPUs sharing hardware P-state control interfaces are represented by 94*4882a593Smuzhiyun``CPUFreq`` as struct cpufreq_policy objects. For consistency, 95*4882a593Smuzhiyunstruct cpufreq_policy is also used when there is only one CPU in the given 96*4882a593Smuzhiyunset. 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunThe ``CPUFreq`` core maintains a pointer to a struct cpufreq_policy object for 99*4882a593Smuzhiyunevery CPU in the system, including CPUs that are currently offline. If multiple 100*4882a593SmuzhiyunCPUs share the same hardware P-state control interface, all of the pointers 101*4882a593Smuzhiyuncorresponding to them point to the same struct cpufreq_policy object. 102*4882a593Smuzhiyun 103*4882a593Smuzhiyun``CPUFreq`` uses struct cpufreq_policy as its basic data type and the design 104*4882a593Smuzhiyunof its user space interface is based on the policy concept. 105*4882a593Smuzhiyun 106*4882a593Smuzhiyun 107*4882a593SmuzhiyunCPU Initialization 108*4882a593Smuzhiyun================== 109*4882a593Smuzhiyun 110*4882a593SmuzhiyunFirst of all, a scaling driver has to be registered for ``CPUFreq`` to work. 111*4882a593SmuzhiyunIt is only possible to register one scaling driver at a time, so the scaling 112*4882a593Smuzhiyundriver is expected to be able to handle all CPUs in the system. 113*4882a593Smuzhiyun 114*4882a593SmuzhiyunThe scaling driver may be registered before or after CPU registration. If 115*4882a593SmuzhiyunCPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to 116*4882a593Smuzhiyuntake a note of all of the already registered CPUs during the registration of the 117*4882a593Smuzhiyunscaling driver. In turn, if any CPUs are registered after the registration of 118*4882a593Smuzhiyunthe scaling driver, the ``CPUFreq`` core will be invoked to take note of them 119*4882a593Smuzhiyunat their registration time. 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunIn any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it 122*4882a593Smuzhiyunhas not seen so far as soon as it is ready to handle that CPU. [Note that the 123*4882a593Smuzhiyunlogical CPU may be a physical single-core processor, or a single core in a 124*4882a593Smuzhiyunmulticore processor, or a hardware thread in a physical processor or processor 125*4882a593Smuzhiyuncore. In what follows "CPU" always means "logical CPU" unless explicitly stated 126*4882a593Smuzhiyunotherwise and the word "processor" is used to refer to the physical part 127*4882a593Smuzhiyunpossibly including multiple logical CPUs.] 128*4882a593Smuzhiyun 129*4882a593SmuzhiyunOnce invoked, the ``CPUFreq`` core checks if the policy pointer is already set 130*4882a593Smuzhiyunfor the given CPU and if so, it skips the policy object creation. Otherwise, 131*4882a593Smuzhiyuna new policy object is created and initialized, which involves the creation of 132*4882a593Smuzhiyuna new policy directory in ``sysfs``, and the policy pointer corresponding to 133*4882a593Smuzhiyunthe given CPU is set to the new policy object's address in memory. 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunNext, the scaling driver's ``->init()`` callback is invoked with the policy 136*4882a593Smuzhiyunpointer of the new CPU passed to it as the argument. That callback is expected 137*4882a593Smuzhiyunto initialize the performance scaling hardware interface for the given CPU (or, 138*4882a593Smuzhiyunmore precisely, for the set of CPUs sharing the hardware interface it belongs 139*4882a593Smuzhiyunto, represented by its policy object) and, if the policy object it has been 140*4882a593Smuzhiyuncalled for is new, to set parameters of the policy, like the minimum and maximum 141*4882a593Smuzhiyunfrequencies supported by the hardware, the table of available frequencies (if 142*4882a593Smuzhiyunthe set of supported P-states is not a continuous range), and the mask of CPUs 143*4882a593Smuzhiyunthat belong to the same policy (including both online and offline CPUs). That 144*4882a593Smuzhiyunmask is then used by the core to populate the policy pointers for all of the 145*4882a593SmuzhiyunCPUs in it. 146*4882a593Smuzhiyun 147*4882a593SmuzhiyunThe next major initialization step for a new policy object is to attach a 148*4882a593Smuzhiyunscaling governor to it (to begin with, that is the default scaling governor 149*4882a593Smuzhiyundetermined by the kernel command line or configuration, but it may be changed 150*4882a593Smuzhiyunlater via ``sysfs``). First, a pointer to the new policy object is passed to 151*4882a593Smuzhiyunthe governor's ``->init()`` callback which is expected to initialize all of the 152*4882a593Smuzhiyundata structures necessary to handle the given policy and, possibly, to add 153*4882a593Smuzhiyuna governor ``sysfs`` interface to it. Next, the governor is started by 154*4882a593Smuzhiyuninvoking its ``->start()`` callback. 155*4882a593Smuzhiyun 156*4882a593SmuzhiyunThat callback is expected to register per-CPU utilization update callbacks for 157*4882a593Smuzhiyunall of the online CPUs belonging to the given policy with the CPU scheduler. 158*4882a593SmuzhiyunThe utilization update callbacks will be invoked by the CPU scheduler on 159*4882a593Smuzhiyunimportant events, like task enqueue and dequeue, on every iteration of the 160*4882a593Smuzhiyunscheduler tick or generally whenever the CPU utilization may change (from the 161*4882a593Smuzhiyunscheduler's perspective). They are expected to carry out computations needed 162*4882a593Smuzhiyunto determine the P-state to use for the given policy going forward and to 163*4882a593Smuzhiyuninvoke the scaling driver to make changes to the hardware in accordance with 164*4882a593Smuzhiyunthe P-state selection. The scaling driver may be invoked directly from 165*4882a593Smuzhiyunscheduler context or asynchronously, via a kernel thread or workqueue, depending 166*4882a593Smuzhiyunon the configuration and capabilities of the scaling driver and the governor. 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunSimilar steps are taken for policy objects that are not new, but were "inactive" 169*4882a593Smuzhiyunpreviously, meaning that all of the CPUs belonging to them were offline. The 170*4882a593Smuzhiyunonly practical difference in that case is that the ``CPUFreq`` core will attempt 171*4882a593Smuzhiyunto use the scaling governor previously used with the policy that became 172*4882a593Smuzhiyun"inactive" (and is re-initialized now) instead of the default governor. 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunIn turn, if a previously offline CPU is being brought back online, but some 175*4882a593Smuzhiyunother CPUs sharing the policy object with it are online already, there is no 176*4882a593Smuzhiyunneed to re-initialize the policy object at all. In that case, it only is 177*4882a593Smuzhiyunnecessary to restart the scaling governor so that it can take the new online CPU 178*4882a593Smuzhiyuninto account. That is achieved by invoking the governor's ``->stop`` and 179*4882a593Smuzhiyun``->start()`` callbacks, in this order, for the entire policy. 180*4882a593Smuzhiyun 181*4882a593SmuzhiyunAs mentioned before, the |intel_pstate| scaling driver bypasses the scaling 182*4882a593Smuzhiyungovernor layer of ``CPUFreq`` and provides its own P-state selection algorithms. 183*4882a593SmuzhiyunConsequently, if |intel_pstate| is used, scaling governors are not attached to 184*4882a593Smuzhiyunnew policy objects. Instead, the driver's ``->setpolicy()`` callback is invoked 185*4882a593Smuzhiyunto register per-CPU utilization update callbacks for each policy. These 186*4882a593Smuzhiyuncallbacks are invoked by the CPU scheduler in the same way as for scaling 187*4882a593Smuzhiyungovernors, but in the |intel_pstate| case they both determine the P-state to 188*4882a593Smuzhiyunuse and change the hardware configuration accordingly in one go from scheduler 189*4882a593Smuzhiyuncontext. 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunThe policy objects created during CPU initialization and other data structures 192*4882a593Smuzhiyunassociated with them are torn down when the scaling driver is unregistered 193*4882a593Smuzhiyun(which happens when the kernel module containing it is unloaded, for example) or 194*4882a593Smuzhiyunwhen the last CPU belonging to the given policy in unregistered. 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun 197*4882a593SmuzhiyunPolicy Interface in ``sysfs`` 198*4882a593Smuzhiyun============================= 199*4882a593Smuzhiyun 200*4882a593SmuzhiyunDuring the initialization of the kernel, the ``CPUFreq`` core creates a 201*4882a593Smuzhiyun``sysfs`` directory (kobject) called ``cpufreq`` under 202*4882a593Smuzhiyun:file:`/sys/devices/system/cpu/`. 203*4882a593Smuzhiyun 204*4882a593SmuzhiyunThat directory contains a ``policyX`` subdirectory (where ``X`` represents an 205*4882a593Smuzhiyuninteger number) for every policy object maintained by the ``CPUFreq`` core. 206*4882a593SmuzhiyunEach ``policyX`` directory is pointed to by ``cpufreq`` symbolic links 207*4882a593Smuzhiyununder :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer 208*4882a593Smuzhiyunthat may be different from the one represented by ``X``) for all of the CPUs 209*4882a593Smuzhiyunassociated with (or belonging to) the given policy. The ``policyX`` directories 210*4882a593Smuzhiyunin :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific 211*4882a593Smuzhiyunattributes (files) to control ``CPUFreq`` behavior for the corresponding policy 212*4882a593Smuzhiyunobjects (that is, for all of the CPUs associated with them). 213*4882a593Smuzhiyun 214*4882a593SmuzhiyunSome of those attributes are generic. They are created by the ``CPUFreq`` core 215*4882a593Smuzhiyunand their behavior generally does not depend on what scaling driver is in use 216*4882a593Smuzhiyunand what scaling governor is attached to the given policy. Some scaling drivers 217*4882a593Smuzhiyunalso add driver-specific attributes to the policy directories in ``sysfs`` to 218*4882a593Smuzhiyuncontrol policy-specific aspects of driver behavior. 219*4882a593Smuzhiyun 220*4882a593SmuzhiyunThe generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/` 221*4882a593Smuzhiyunare the following: 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun``affected_cpus`` 224*4882a593Smuzhiyun List of online CPUs belonging to this policy (i.e. sharing the hardware 225*4882a593Smuzhiyun performance scaling interface represented by the ``policyX`` policy 226*4882a593Smuzhiyun object). 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun``bios_limit`` 229*4882a593Smuzhiyun If the platform firmware (BIOS) tells the OS to apply an upper limit to 230*4882a593Smuzhiyun CPU frequencies, that limit will be reported through this attribute (if 231*4882a593Smuzhiyun present). 232*4882a593Smuzhiyun 233*4882a593Smuzhiyun The existence of the limit may be a result of some (often unintentional) 234*4882a593Smuzhiyun BIOS settings, restrictions coming from a service processor or another 235*4882a593Smuzhiyun BIOS/HW-based mechanisms. 236*4882a593Smuzhiyun 237*4882a593Smuzhiyun This does not cover ACPI thermal limitations which can be discovered 238*4882a593Smuzhiyun through a generic thermal driver. 239*4882a593Smuzhiyun 240*4882a593Smuzhiyun This attribute is not present if the scaling driver in use does not 241*4882a593Smuzhiyun support it. 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun``cpuinfo_cur_freq`` 244*4882a593Smuzhiyun Current frequency of the CPUs belonging to this policy as obtained from 245*4882a593Smuzhiyun the hardware (in KHz). 246*4882a593Smuzhiyun 247*4882a593Smuzhiyun This is expected to be the frequency the hardware actually runs at. 248*4882a593Smuzhiyun If that frequency cannot be determined, this attribute should not 249*4882a593Smuzhiyun be present. 250*4882a593Smuzhiyun 251*4882a593Smuzhiyun``cpuinfo_max_freq`` 252*4882a593Smuzhiyun Maximum possible operating frequency the CPUs belonging to this policy 253*4882a593Smuzhiyun can run at (in kHz). 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun``cpuinfo_min_freq`` 256*4882a593Smuzhiyun Minimum possible operating frequency the CPUs belonging to this policy 257*4882a593Smuzhiyun can run at (in kHz). 258*4882a593Smuzhiyun 259*4882a593Smuzhiyun``cpuinfo_transition_latency`` 260*4882a593Smuzhiyun The time it takes to switch the CPUs belonging to this policy from one 261*4882a593Smuzhiyun P-state to another, in nanoseconds. 262*4882a593Smuzhiyun 263*4882a593Smuzhiyun If unknown or if known to be so high that the scaling driver does not 264*4882a593Smuzhiyun work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`) 265*4882a593Smuzhiyun will be returned by reads from this attribute. 266*4882a593Smuzhiyun 267*4882a593Smuzhiyun``related_cpus`` 268*4882a593Smuzhiyun List of all (online and offline) CPUs belonging to this policy. 269*4882a593Smuzhiyun 270*4882a593Smuzhiyun``scaling_available_governors`` 271*4882a593Smuzhiyun List of ``CPUFreq`` scaling governors present in the kernel that can 272*4882a593Smuzhiyun be attached to this policy or (if the |intel_pstate| scaling driver is 273*4882a593Smuzhiyun in use) list of scaling algorithms provided by the driver that can be 274*4882a593Smuzhiyun applied to this policy. 275*4882a593Smuzhiyun 276*4882a593Smuzhiyun [Note that some governors are modular and it may be necessary to load a 277*4882a593Smuzhiyun kernel module for the governor held by it to become available and be 278*4882a593Smuzhiyun listed by this attribute.] 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun``scaling_cur_freq`` 281*4882a593Smuzhiyun Current frequency of all of the CPUs belonging to this policy (in kHz). 282*4882a593Smuzhiyun 283*4882a593Smuzhiyun In the majority of cases, this is the frequency of the last P-state 284*4882a593Smuzhiyun requested by the scaling driver from the hardware using the scaling 285*4882a593Smuzhiyun interface provided by it, which may or may not reflect the frequency 286*4882a593Smuzhiyun the CPU is actually running at (due to hardware design and other 287*4882a593Smuzhiyun limitations). 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun Some architectures (e.g. ``x86``) may attempt to provide information 290*4882a593Smuzhiyun more precisely reflecting the current CPU frequency through this 291*4882a593Smuzhiyun attribute, but that still may not be the exact current CPU frequency as 292*4882a593Smuzhiyun seen by the hardware at the moment. 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun``scaling_driver`` 295*4882a593Smuzhiyun The scaling driver currently in use. 296*4882a593Smuzhiyun 297*4882a593Smuzhiyun``scaling_governor`` 298*4882a593Smuzhiyun The scaling governor currently attached to this policy or (if the 299*4882a593Smuzhiyun |intel_pstate| scaling driver is in use) the scaling algorithm 300*4882a593Smuzhiyun provided by the driver that is currently applied to this policy. 301*4882a593Smuzhiyun 302*4882a593Smuzhiyun This attribute is read-write and writing to it will cause a new scaling 303*4882a593Smuzhiyun governor to be attached to this policy or a new scaling algorithm 304*4882a593Smuzhiyun provided by the scaling driver to be applied to it (in the 305*4882a593Smuzhiyun |intel_pstate| case), as indicated by the string written to this 306*4882a593Smuzhiyun attribute (which must be one of the names listed by the 307*4882a593Smuzhiyun ``scaling_available_governors`` attribute described above). 308*4882a593Smuzhiyun 309*4882a593Smuzhiyun``scaling_max_freq`` 310*4882a593Smuzhiyun Maximum frequency the CPUs belonging to this policy are allowed to be 311*4882a593Smuzhiyun running at (in kHz). 312*4882a593Smuzhiyun 313*4882a593Smuzhiyun This attribute is read-write and writing a string representing an 314*4882a593Smuzhiyun integer to it will cause a new limit to be set (it must not be lower 315*4882a593Smuzhiyun than the value of the ``scaling_min_freq`` attribute). 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun``scaling_min_freq`` 318*4882a593Smuzhiyun Minimum frequency the CPUs belonging to this policy are allowed to be 319*4882a593Smuzhiyun running at (in kHz). 320*4882a593Smuzhiyun 321*4882a593Smuzhiyun This attribute is read-write and writing a string representing a 322*4882a593Smuzhiyun non-negative integer to it will cause a new limit to be set (it must not 323*4882a593Smuzhiyun be higher than the value of the ``scaling_max_freq`` attribute). 324*4882a593Smuzhiyun 325*4882a593Smuzhiyun``scaling_setspeed`` 326*4882a593Smuzhiyun This attribute is functional only if the `userspace`_ scaling governor 327*4882a593Smuzhiyun is attached to the given policy. 328*4882a593Smuzhiyun 329*4882a593Smuzhiyun It returns the last frequency requested by the governor (in kHz) or can 330*4882a593Smuzhiyun be written to in order to set a new frequency for the policy. 331*4882a593Smuzhiyun 332*4882a593Smuzhiyun 333*4882a593SmuzhiyunGeneric Scaling Governors 334*4882a593Smuzhiyun========================= 335*4882a593Smuzhiyun 336*4882a593Smuzhiyun``CPUFreq`` provides generic scaling governors that can be used with all 337*4882a593Smuzhiyunscaling drivers. As stated before, each of them implements a single, possibly 338*4882a593Smuzhiyunparametrized, performance scaling algorithm. 339*4882a593Smuzhiyun 340*4882a593SmuzhiyunScaling governors are attached to policy objects and different policy objects 341*4882a593Smuzhiyuncan be handled by different scaling governors at the same time (although that 342*4882a593Smuzhiyunmay lead to suboptimal results in some cases). 343*4882a593Smuzhiyun 344*4882a593SmuzhiyunThe scaling governor for a given policy object can be changed at any time with 345*4882a593Smuzhiyunthe help of the ``scaling_governor`` policy attribute in ``sysfs``. 346*4882a593Smuzhiyun 347*4882a593SmuzhiyunSome governors expose ``sysfs`` attributes to control or fine-tune the scaling 348*4882a593Smuzhiyunalgorithms implemented by them. Those attributes, referred to as governor 349*4882a593Smuzhiyuntunables, can be either global (system-wide) or per-policy, depending on the 350*4882a593Smuzhiyunscaling driver in use. If the driver requires governor tunables to be 351*4882a593Smuzhiyunper-policy, they are located in a subdirectory of each policy directory. 352*4882a593SmuzhiyunOtherwise, they are located in a subdirectory under 353*4882a593Smuzhiyun:file:`/sys/devices/system/cpu/cpufreq/`. In either case the name of the 354*4882a593Smuzhiyunsubdirectory containing the governor tunables is the name of the governor 355*4882a593Smuzhiyunproviding them. 356*4882a593Smuzhiyun 357*4882a593Smuzhiyun``performance`` 358*4882a593Smuzhiyun--------------- 359*4882a593Smuzhiyun 360*4882a593SmuzhiyunWhen attached to a policy object, this governor causes the highest frequency, 361*4882a593Smuzhiyunwithin the ``scaling_max_freq`` policy limit, to be requested for that policy. 362*4882a593Smuzhiyun 363*4882a593SmuzhiyunThe request is made once at that time the governor for the policy is set to 364*4882a593Smuzhiyun``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` 365*4882a593Smuzhiyunpolicy limits change after that. 366*4882a593Smuzhiyun 367*4882a593Smuzhiyun``powersave`` 368*4882a593Smuzhiyun------------- 369*4882a593Smuzhiyun 370*4882a593SmuzhiyunWhen attached to a policy object, this governor causes the lowest frequency, 371*4882a593Smuzhiyunwithin the ``scaling_min_freq`` policy limit, to be requested for that policy. 372*4882a593Smuzhiyun 373*4882a593SmuzhiyunThe request is made once at that time the governor for the policy is set to 374*4882a593Smuzhiyun``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` 375*4882a593Smuzhiyunpolicy limits change after that. 376*4882a593Smuzhiyun 377*4882a593Smuzhiyun``userspace`` 378*4882a593Smuzhiyun------------- 379*4882a593Smuzhiyun 380*4882a593SmuzhiyunThis governor does not do anything by itself. Instead, it allows user space 381*4882a593Smuzhiyunto set the CPU frequency for the policy it is attached to by writing to the 382*4882a593Smuzhiyun``scaling_setspeed`` attribute of that policy. 383*4882a593Smuzhiyun 384*4882a593Smuzhiyun``schedutil`` 385*4882a593Smuzhiyun------------- 386*4882a593Smuzhiyun 387*4882a593SmuzhiyunThis governor uses CPU utilization data available from the CPU scheduler. It 388*4882a593Smuzhiyungenerally is regarded as a part of the CPU scheduler, so it can access the 389*4882a593Smuzhiyunscheduler's internal data structures directly. 390*4882a593Smuzhiyun 391*4882a593SmuzhiyunIt runs entirely in scheduler context, although in some cases it may need to 392*4882a593Smuzhiyuninvoke the scaling driver asynchronously when it decides that the CPU frequency 393*4882a593Smuzhiyunshould be changed for a given policy (that depends on whether or not the driver 394*4882a593Smuzhiyunis capable of changing the CPU frequency from scheduler context). 395*4882a593Smuzhiyun 396*4882a593SmuzhiyunThe actions of this governor for a particular CPU depend on the scheduling class 397*4882a593Smuzhiyuninvoking its utilization update callback for that CPU. If it is invoked by the 398*4882a593SmuzhiyunRT or deadline scheduling classes, the governor will increase the frequency to 399*4882a593Smuzhiyunthe allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn, 400*4882a593Smuzhiyunif it is invoked by the CFS scheduling class, the governor will use the 401*4882a593SmuzhiyunPer-Entity Load Tracking (PELT) metric for the root control group of the 402*4882a593Smuzhiyungiven CPU as the CPU utilization estimate (see the *Per-entity load tracking* 403*4882a593SmuzhiyunLWN.net article [1]_ for a description of the PELT mechanism). Then, the new 404*4882a593SmuzhiyunCPU frequency to apply is computed in accordance with the formula 405*4882a593Smuzhiyun 406*4882a593Smuzhiyun f = 1.25 * ``f_0`` * ``util`` / ``max`` 407*4882a593Smuzhiyun 408*4882a593Smuzhiyunwhere ``util`` is the PELT number, ``max`` is the theoretical maximum of 409*4882a593Smuzhiyun``util``, and ``f_0`` is either the maximum possible CPU frequency for the given 410*4882a593Smuzhiyunpolicy (if the PELT number is frequency-invariant), or the current CPU frequency 411*4882a593Smuzhiyun(otherwise). 412*4882a593Smuzhiyun 413*4882a593SmuzhiyunThis governor also employs a mechanism allowing it to temporarily bump up the 414*4882a593SmuzhiyunCPU frequency for tasks that have been waiting on I/O most recently, called 415*4882a593Smuzhiyun"IO-wait boosting". That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag 416*4882a593Smuzhiyunis passed by the scheduler to the governor callback which causes the frequency 417*4882a593Smuzhiyunto go up to the allowed maximum immediately and then draw back to the value 418*4882a593Smuzhiyunreturned by the above formula over time. 419*4882a593Smuzhiyun 420*4882a593SmuzhiyunThis governor exposes only one tunable: 421*4882a593Smuzhiyun 422*4882a593Smuzhiyun``rate_limit_us`` 423*4882a593Smuzhiyun Minimum time (in microseconds) that has to pass between two consecutive 424*4882a593Smuzhiyun runs of governor computations (default: 1000 times the scaling driver's 425*4882a593Smuzhiyun transition latency). 426*4882a593Smuzhiyun 427*4882a593Smuzhiyun The purpose of this tunable is to reduce the scheduler context overhead 428*4882a593Smuzhiyun of the governor which might be excessive without it. 429*4882a593Smuzhiyun 430*4882a593SmuzhiyunThis governor generally is regarded as a replacement for the older `ondemand`_ 431*4882a593Smuzhiyunand `conservative`_ governors (described below), as it is simpler and more 432*4882a593Smuzhiyuntightly integrated with the CPU scheduler, its overhead in terms of CPU context 433*4882a593Smuzhiyunswitches and similar is less significant, and it uses the scheduler's own CPU 434*4882a593Smuzhiyunutilization metric, so in principle its decisions should not contradict the 435*4882a593Smuzhiyundecisions made by the other parts of the scheduler. 436*4882a593Smuzhiyun 437*4882a593Smuzhiyun``ondemand`` 438*4882a593Smuzhiyun------------ 439*4882a593Smuzhiyun 440*4882a593SmuzhiyunThis governor uses CPU load as a CPU frequency selection metric. 441*4882a593Smuzhiyun 442*4882a593SmuzhiyunIn order to estimate the current CPU load, it measures the time elapsed between 443*4882a593Smuzhiyunconsecutive invocations of its worker routine and computes the fraction of that 444*4882a593Smuzhiyuntime in which the given CPU was not idle. The ratio of the non-idle (active) 445*4882a593Smuzhiyuntime to the total CPU time is taken as an estimate of the load. 446*4882a593Smuzhiyun 447*4882a593SmuzhiyunIf this governor is attached to a policy shared by multiple CPUs, the load is 448*4882a593Smuzhiyunestimated for all of them and the greatest result is taken as the load estimate 449*4882a593Smuzhiyunfor the entire policy. 450*4882a593Smuzhiyun 451*4882a593SmuzhiyunThe worker routine of this governor has to run in process context, so it is 452*4882a593Smuzhiyuninvoked asynchronously (via a workqueue) and CPU P-states are updated from 453*4882a593Smuzhiyunthere if necessary. As a result, the scheduler context overhead from this 454*4882a593Smuzhiyungovernor is minimum, but it causes additional CPU context switches to happen 455*4882a593Smuzhiyunrelatively often and the CPU P-state updates triggered by it can be relatively 456*4882a593Smuzhiyunirregular. Also, it affects its own CPU load metric by running code that 457*4882a593Smuzhiyunreduces the CPU idle time (even though the CPU idle time is only reduced very 458*4882a593Smuzhiyunslightly by it). 459*4882a593Smuzhiyun 460*4882a593SmuzhiyunIt generally selects CPU frequencies proportional to the estimated load, so that 461*4882a593Smuzhiyunthe value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of 462*4882a593Smuzhiyun1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute 463*4882a593Smuzhiyuncorresponds to the load of 0, unless when the load exceeds a (configurable) 464*4882a593Smuzhiyunspeedup threshold, in which case it will go straight for the highest frequency 465*4882a593Smuzhiyunit is allowed to use (the ``scaling_max_freq`` policy limit). 466*4882a593Smuzhiyun 467*4882a593SmuzhiyunThis governor exposes the following tunables: 468*4882a593Smuzhiyun 469*4882a593Smuzhiyun``sampling_rate`` 470*4882a593Smuzhiyun This is how often the governor's worker routine should run, in 471*4882a593Smuzhiyun microseconds. 472*4882a593Smuzhiyun 473*4882a593Smuzhiyun Typically, it is set to values of the order of 10000 (10 ms). Its 474*4882a593Smuzhiyun default value is equal to the value of ``cpuinfo_transition_latency`` 475*4882a593Smuzhiyun for each policy this governor is attached to (but since the unit here 476*4882a593Smuzhiyun is greater by 1000, this means that the time represented by 477*4882a593Smuzhiyun ``sampling_rate`` is 1000 times greater than the transition latency by 478*4882a593Smuzhiyun default). 479*4882a593Smuzhiyun 480*4882a593Smuzhiyun If this tunable is per-policy, the following shell command sets the time 481*4882a593Smuzhiyun represented by it to be 750 times as high as the transition latency:: 482*4882a593Smuzhiyun 483*4882a593Smuzhiyun # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate 484*4882a593Smuzhiyun 485*4882a593Smuzhiyun``up_threshold`` 486*4882a593Smuzhiyun If the estimated CPU load is above this value (in percent), the governor 487*4882a593Smuzhiyun will set the frequency to the maximum value allowed for the policy. 488*4882a593Smuzhiyun Otherwise, the selected frequency will be proportional to the estimated 489*4882a593Smuzhiyun CPU load. 490*4882a593Smuzhiyun 491*4882a593Smuzhiyun``ignore_nice_load`` 492*4882a593Smuzhiyun If set to 1 (default 0), it will cause the CPU load estimation code to 493*4882a593Smuzhiyun treat the CPU time spent on executing tasks with "nice" levels greater 494*4882a593Smuzhiyun than 0 as CPU idle time. 495*4882a593Smuzhiyun 496*4882a593Smuzhiyun This may be useful if there are tasks in the system that should not be 497*4882a593Smuzhiyun taken into account when deciding what frequency to run the CPUs at. 498*4882a593Smuzhiyun Then, to make that happen it is sufficient to increase the "nice" level 499*4882a593Smuzhiyun of those tasks above 0 and set this attribute to 1. 500*4882a593Smuzhiyun 501*4882a593Smuzhiyun``sampling_down_factor`` 502*4882a593Smuzhiyun Temporary multiplier, between 1 (default) and 100 inclusive, to apply to 503*4882a593Smuzhiyun the ``sampling_rate`` value if the CPU load goes above ``up_threshold``. 504*4882a593Smuzhiyun 505*4882a593Smuzhiyun This causes the next execution of the governor's worker routine (after 506*4882a593Smuzhiyun setting the frequency to the allowed maximum) to be delayed, so the 507*4882a593Smuzhiyun frequency stays at the maximum level for a longer time. 508*4882a593Smuzhiyun 509*4882a593Smuzhiyun Frequency fluctuations in some bursty workloads may be avoided this way 510*4882a593Smuzhiyun at the cost of additional energy spent on maintaining the maximum CPU 511*4882a593Smuzhiyun capacity. 512*4882a593Smuzhiyun 513*4882a593Smuzhiyun``powersave_bias`` 514*4882a593Smuzhiyun Reduction factor to apply to the original frequency target of the 515*4882a593Smuzhiyun governor (including the maximum value used when the ``up_threshold`` 516*4882a593Smuzhiyun value is exceeded by the estimated CPU load) or sensitivity threshold 517*4882a593Smuzhiyun for the AMD frequency sensitivity powersave bias driver 518*4882a593Smuzhiyun (:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000 519*4882a593Smuzhiyun inclusive. 520*4882a593Smuzhiyun 521*4882a593Smuzhiyun If the AMD frequency sensitivity powersave bias driver is not loaded, 522*4882a593Smuzhiyun the effective frequency to apply is given by 523*4882a593Smuzhiyun 524*4882a593Smuzhiyun f * (1 - ``powersave_bias`` / 1000) 525*4882a593Smuzhiyun 526*4882a593Smuzhiyun where f is the governor's original frequency target. The default value 527*4882a593Smuzhiyun of this attribute is 0 in that case. 528*4882a593Smuzhiyun 529*4882a593Smuzhiyun If the AMD frequency sensitivity powersave bias driver is loaded, the 530*4882a593Smuzhiyun value of this attribute is 400 by default and it is used in a different 531*4882a593Smuzhiyun way. 532*4882a593Smuzhiyun 533*4882a593Smuzhiyun On Family 16h (and later) AMD processors there is a mechanism to get a 534*4882a593Smuzhiyun measured workload sensitivity, between 0 and 100% inclusive, from the 535*4882a593Smuzhiyun hardware. That value can be used to estimate how the performance of the 536*4882a593Smuzhiyun workload running on a CPU will change in response to frequency changes. 537*4882a593Smuzhiyun 538*4882a593Smuzhiyun The performance of a workload with the sensitivity of 0 (memory-bound or 539*4882a593Smuzhiyun IO-bound) is not expected to increase at all as a result of increasing 540*4882a593Smuzhiyun the CPU frequency, whereas workloads with the sensitivity of 100% 541*4882a593Smuzhiyun (CPU-bound) are expected to perform much better if the CPU frequency is 542*4882a593Smuzhiyun increased. 543*4882a593Smuzhiyun 544*4882a593Smuzhiyun If the workload sensitivity is less than the threshold represented by 545*4882a593Smuzhiyun the ``powersave_bias`` value, the sensitivity powersave bias driver 546*4882a593Smuzhiyun will cause the governor to select a frequency lower than its original 547*4882a593Smuzhiyun target, so as to avoid over-provisioning workloads that will not benefit 548*4882a593Smuzhiyun from running at higher CPU frequencies. 549*4882a593Smuzhiyun 550*4882a593Smuzhiyun``conservative`` 551*4882a593Smuzhiyun---------------- 552*4882a593Smuzhiyun 553*4882a593SmuzhiyunThis governor uses CPU load as a CPU frequency selection metric. 554*4882a593Smuzhiyun 555*4882a593SmuzhiyunIt estimates the CPU load in the same way as the `ondemand`_ governor described 556*4882a593Smuzhiyunabove, but the CPU frequency selection algorithm implemented by it is different. 557*4882a593Smuzhiyun 558*4882a593SmuzhiyunNamely, it avoids changing the frequency significantly over short time intervals 559*4882a593Smuzhiyunwhich may not be suitable for systems with limited power supply capacity (e.g. 560*4882a593Smuzhiyunbattery-powered). To achieve that, it changes the frequency in relatively 561*4882a593Smuzhiyunsmall steps, one step at a time, up or down - depending on whether or not a 562*4882a593Smuzhiyun(configurable) threshold has been exceeded by the estimated CPU load. 563*4882a593Smuzhiyun 564*4882a593SmuzhiyunThis governor exposes the following tunables: 565*4882a593Smuzhiyun 566*4882a593Smuzhiyun``freq_step`` 567*4882a593Smuzhiyun Frequency step in percent of the maximum frequency the governor is 568*4882a593Smuzhiyun allowed to set (the ``scaling_max_freq`` policy limit), between 0 and 569*4882a593Smuzhiyun 100 (5 by default). 570*4882a593Smuzhiyun 571*4882a593Smuzhiyun This is how much the frequency is allowed to change in one go. Setting 572*4882a593Smuzhiyun it to 0 will cause the default frequency step (5 percent) to be used 573*4882a593Smuzhiyun and setting it to 100 effectively causes the governor to periodically 574*4882a593Smuzhiyun switch the frequency between the ``scaling_min_freq`` and 575*4882a593Smuzhiyun ``scaling_max_freq`` policy limits. 576*4882a593Smuzhiyun 577*4882a593Smuzhiyun``down_threshold`` 578*4882a593Smuzhiyun Threshold value (in percent, 20 by default) used to determine the 579*4882a593Smuzhiyun frequency change direction. 580*4882a593Smuzhiyun 581*4882a593Smuzhiyun If the estimated CPU load is greater than this value, the frequency will 582*4882a593Smuzhiyun go up (by ``freq_step``). If the load is less than this value (and the 583*4882a593Smuzhiyun ``sampling_down_factor`` mechanism is not in effect), the frequency will 584*4882a593Smuzhiyun go down. Otherwise, the frequency will not be changed. 585*4882a593Smuzhiyun 586*4882a593Smuzhiyun``sampling_down_factor`` 587*4882a593Smuzhiyun Frequency decrease deferral factor, between 1 (default) and 10 588*4882a593Smuzhiyun inclusive. 589*4882a593Smuzhiyun 590*4882a593Smuzhiyun It effectively causes the frequency to go down ``sampling_down_factor`` 591*4882a593Smuzhiyun times slower than it ramps up. 592*4882a593Smuzhiyun 593*4882a593Smuzhiyun``interactive`` 594*4882a593Smuzhiyun---------------- 595*4882a593Smuzhiyun 596*4882a593SmuzhiyunThe CPUfreq governor `interactive` is designed for latency-sensitive, 597*4882a593Smuzhiyuninteractive workloads. This governor sets the CPU speed depending on 598*4882a593Smuzhiyunusage, similar to `ondemand` and `conservative` governors, but with a 599*4882a593Smuzhiyundifferent set of configurable behaviors. 600*4882a593Smuzhiyun 601*4882a593SmuzhiyunThe tunable values for this governor are: 602*4882a593Smuzhiyun 603*4882a593Smuzhiyun``above_hispeed_delay`` 604*4882a593Smuzhiyun When speed is at or above hispeed_freq, wait for 605*4882a593Smuzhiyun this long before raising speed in response to continued high load. 606*4882a593Smuzhiyun The format is a single delay value, optionally followed by pairs of 607*4882a593Smuzhiyun CPU speeds and the delay to use at or above those speeds. Colons can 608*4882a593Smuzhiyun be used between the speeds and associated delays for readability. For 609*4882a593Smuzhiyun example: 610*4882a593Smuzhiyun 611*4882a593Smuzhiyun 80000 1300000:200000 1500000:40000 612*4882a593Smuzhiyun 613*4882a593Smuzhiyun uses delay 80000 uS until CPU speed 1.3 GHz, at which speed delay 614*4882a593Smuzhiyun 200000 uS is used until speed 1.5 GHz, at which speed (and above) 615*4882a593Smuzhiyun delay 40000 uS is used. If speeds are specified these must appear in 616*4882a593Smuzhiyun ascending order. Default is 20000 uS. 617*4882a593Smuzhiyun 618*4882a593Smuzhiyun``boost`` 619*4882a593Smuzhiyun If non-zero, immediately boost speed of all CPUs to at least 620*4882a593Smuzhiyun hispeed_freq until zero is written to this attribute. If zero, allow 621*4882a593Smuzhiyun CPU speeds to drop below hispeed_freq according to load as usual. 622*4882a593Smuzhiyun Default is zero. 623*4882a593Smuzhiyun 624*4882a593Smuzhiyun``boostpulse`` 625*4882a593Smuzhiyun On each write, immediately boost speed of all CPUs to 626*4882a593Smuzhiyun hispeed_freq for at least the period of time specified by 627*4882a593Smuzhiyun boostpulse_duration, after which speeds are allowed to drop below 628*4882a593Smuzhiyun hispeed_freq according to load as usual. Its a write-only file. 629*4882a593Smuzhiyun 630*4882a593Smuzhiyun``boostpulse_duration`` 631*4882a593Smuzhiyun Length of time to hold CPU speed at hispeed_freq 632*4882a593Smuzhiyun on a write to boostpulse, before allowing speed to drop according to 633*4882a593Smuzhiyun load as usual. Default is 80000 uS. 634*4882a593Smuzhiyun 635*4882a593Smuzhiyun``go_hispeed_load`` 636*4882a593Smuzhiyun The CPU load at which to ramp to hispeed_freq. 637*4882a593Smuzhiyun Default is 99%. 638*4882a593Smuzhiyun 639*4882a593Smuzhiyun``hispeed_freq`` 640*4882a593Smuzhiyun An intermediate "high speed" at which to initially ramp 641*4882a593Smuzhiyun when CPU load hits the value specified in go_hispeed_load. If load 642*4882a593Smuzhiyun stays high for the amount of time specified in above_hispeed_delay, 643*4882a593Smuzhiyun then speed may be bumped higher. Default is the maximum speed allowed 644*4882a593Smuzhiyun by the policy at governor initialization time. 645*4882a593Smuzhiyun 646*4882a593Smuzhiyun``io_is_busy`` 647*4882a593Smuzhiyun If set, the governor accounts IO time as CPU busy time. 648*4882a593Smuzhiyun 649*4882a593Smuzhiyun``min_sample_time`` 650*4882a593Smuzhiyun The minimum amount of time to spend at the current 651*4882a593Smuzhiyun 652*4882a593SmuzhiyunFrequency Boost Support 653*4882a593Smuzhiyun======================= 654*4882a593Smuzhiyun 655*4882a593SmuzhiyunBackground 656*4882a593Smuzhiyun---------- 657*4882a593Smuzhiyun 658*4882a593SmuzhiyunSome processors support a mechanism to raise the operating frequency of some 659*4882a593Smuzhiyuncores in a multicore package temporarily (and above the sustainable frequency 660*4882a593Smuzhiyunthreshold for the whole package) under certain conditions, for example if the 661*4882a593Smuzhiyunwhole chip is not fully utilized and below its intended thermal or power budget. 662*4882a593Smuzhiyun 663*4882a593SmuzhiyunDifferent names are used by different vendors to refer to this functionality. 664*4882a593SmuzhiyunFor Intel processors it is referred to as "Turbo Boost", AMD calls it 665*4882a593Smuzhiyun"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on. 666*4882a593SmuzhiyunAs a rule, it also is implemented differently by different vendors. The simple 667*4882a593Smuzhiyunterm "frequency boost" is used here for brevity to refer to all of those 668*4882a593Smuzhiyunimplementations. 669*4882a593Smuzhiyun 670*4882a593SmuzhiyunThe frequency boost mechanism may be either hardware-based or software-based. 671*4882a593SmuzhiyunIf it is hardware-based (e.g. on x86), the decision to trigger the boosting is 672*4882a593Smuzhiyunmade by the hardware (although in general it requires the hardware to be put 673*4882a593Smuzhiyuninto a special state in which it can control the CPU frequency within certain 674*4882a593Smuzhiyunlimits). If it is software-based (e.g. on ARM), the scaling driver decides 675*4882a593Smuzhiyunwhether or not to trigger boosting and when to do that. 676*4882a593Smuzhiyun 677*4882a593SmuzhiyunThe ``boost`` File in ``sysfs`` 678*4882a593Smuzhiyun------------------------------- 679*4882a593Smuzhiyun 680*4882a593SmuzhiyunThis file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls 681*4882a593Smuzhiyunthe "boost" setting for the whole system. It is not present if the underlying 682*4882a593Smuzhiyunscaling driver does not support the frequency boost mechanism (or supports it, 683*4882a593Smuzhiyunbut provides a driver-specific interface for controlling it, like 684*4882a593Smuzhiyun|intel_pstate|). 685*4882a593Smuzhiyun 686*4882a593SmuzhiyunIf the value in this file is 1, the frequency boost mechanism is enabled. This 687*4882a593Smuzhiyunmeans that either the hardware can be put into states in which it is able to 688*4882a593Smuzhiyuntrigger boosting (in the hardware-based case), or the software is allowed to 689*4882a593Smuzhiyuntrigger boosting (in the software-based case). It does not mean that boosting 690*4882a593Smuzhiyunis actually in use at the moment on any CPUs in the system. It only means a 691*4882a593Smuzhiyunpermission to use the frequency boost mechanism (which still may never be used 692*4882a593Smuzhiyunfor other reasons). 693*4882a593Smuzhiyun 694*4882a593SmuzhiyunIf the value in this file is 0, the frequency boost mechanism is disabled and 695*4882a593Smuzhiyuncannot be used at all. 696*4882a593Smuzhiyun 697*4882a593SmuzhiyunThe only values that can be written to this file are 0 and 1. 698*4882a593Smuzhiyun 699*4882a593SmuzhiyunRationale for Boost Control Knob 700*4882a593Smuzhiyun-------------------------------- 701*4882a593Smuzhiyun 702*4882a593SmuzhiyunThe frequency boost mechanism is generally intended to help to achieve optimum 703*4882a593SmuzhiyunCPU performance on time scales below software resolution (e.g. below the 704*4882a593Smuzhiyunscheduler tick interval) and it is demonstrably suitable for many workloads, but 705*4882a593Smuzhiyunit may lead to problems in certain situations. 706*4882a593Smuzhiyun 707*4882a593SmuzhiyunFor this reason, many systems make it possible to disable the frequency boost 708*4882a593Smuzhiyunmechanism in the platform firmware (BIOS) setup, but that requires the system to 709*4882a593Smuzhiyunbe restarted for the setting to be adjusted as desired, which may not be 710*4882a593Smuzhiyunpractical at least in some cases. For example: 711*4882a593Smuzhiyun 712*4882a593Smuzhiyun 1. Boosting means overclocking the processor, although under controlled 713*4882a593Smuzhiyun conditions. Generally, the processor's energy consumption increases 714*4882a593Smuzhiyun as a result of increasing its frequency and voltage, even temporarily. 715*4882a593Smuzhiyun That may not be desirable on systems that switch to power sources of 716*4882a593Smuzhiyun limited capacity, such as batteries, so the ability to disable the boost 717*4882a593Smuzhiyun mechanism while the system is running may help there (but that depends on 718*4882a593Smuzhiyun the workload too). 719*4882a593Smuzhiyun 720*4882a593Smuzhiyun 2. In some situations deterministic behavior is more important than 721*4882a593Smuzhiyun performance or energy consumption (or both) and the ability to disable 722*4882a593Smuzhiyun boosting while the system is running may be useful then. 723*4882a593Smuzhiyun 724*4882a593Smuzhiyun 3. To examine the impact of the frequency boost mechanism itself, it is useful 725*4882a593Smuzhiyun to be able to run tests with and without boosting, preferably without 726*4882a593Smuzhiyun restarting the system in the meantime. 727*4882a593Smuzhiyun 728*4882a593Smuzhiyun 4. Reproducible results are important when running benchmarks. Since 729*4882a593Smuzhiyun the boosting functionality depends on the load of the whole package, 730*4882a593Smuzhiyun single-thread performance may vary because of it which may lead to 731*4882a593Smuzhiyun unreproducible results sometimes. That can be avoided by disabling the 732*4882a593Smuzhiyun frequency boost mechanism before running benchmarks sensitive to that 733*4882a593Smuzhiyun issue. 734*4882a593Smuzhiyun 735*4882a593SmuzhiyunLegacy AMD ``cpb`` Knob 736*4882a593Smuzhiyun----------------------- 737*4882a593Smuzhiyun 738*4882a593SmuzhiyunThe AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to 739*4882a593Smuzhiyunthe global ``boost`` one. It is used for disabling/enabling the "Core 740*4882a593SmuzhiyunPerformance Boost" feature of some AMD processors. 741*4882a593Smuzhiyun 742*4882a593SmuzhiyunIf present, that knob is located in every ``CPUFreq`` policy directory in 743*4882a593Smuzhiyun``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called 744*4882a593Smuzhiyun``cpb``, which indicates a more fine grained control interface. The actual 745*4882a593Smuzhiyunimplementation, however, works on the system-wide basis and setting that knob 746*4882a593Smuzhiyunfor one policy causes the same value of it to be set for all of the other 747*4882a593Smuzhiyunpolicies at the same time. 748*4882a593Smuzhiyun 749*4882a593SmuzhiyunThat knob is still supported on AMD processors that support its underlying 750*4882a593Smuzhiyunhardware feature, but it may be configured out of the kernel (via the 751*4882a593Smuzhiyun:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global 752*4882a593Smuzhiyun``boost`` knob is present regardless. Thus it is always possible use the 753*4882a593Smuzhiyun``boost`` knob instead of the ``cpb`` one which is highly recommended, as that 754*4882a593Smuzhiyunis more consistent with what all of the other systems do (and the ``cpb`` knob 755*4882a593Smuzhiyunmay not be supported any more in the future). 756*4882a593Smuzhiyun 757*4882a593SmuzhiyunThe ``cpb`` knob is never present for any processors without the underlying 758*4882a593Smuzhiyunhardware feature (e.g. all Intel ones), even if the 759*4882a593Smuzhiyun:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set. 760*4882a593Smuzhiyun 761*4882a593Smuzhiyun 762*4882a593SmuzhiyunReferences 763*4882a593Smuzhiyun========== 764*4882a593Smuzhiyun 765*4882a593Smuzhiyun.. [1] Jonathan Corbet, *Per-entity load tracking*, 766*4882a593Smuzhiyun https://lwn.net/Articles/531853/ 767