1*4882a593Smuzhiyun==================== 2*4882a593SmuzhiyunPCI Power Management 3*4882a593Smuzhiyun==================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunCopyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunAn overview of concepts and the Linux kernel's interfaces related to PCI power 8*4882a593Smuzhiyunmanagement. Based on previous work by Patrick Mochel <mochel@transmeta.com> 9*4882a593Smuzhiyun(and others). 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunThis document only covers the aspects of power management specific to PCI 12*4882a593Smuzhiyundevices. For general description of the kernel's interfaces related to device 13*4882a593Smuzhiyunpower management refer to Documentation/driver-api/pm/devices.rst and 14*4882a593SmuzhiyunDocumentation/power/runtime_pm.rst. 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun.. contents: 17*4882a593Smuzhiyun 18*4882a593Smuzhiyun 1. Hardware and Platform Support for PCI Power Management 19*4882a593Smuzhiyun 2. PCI Subsystem and Device Power Management 20*4882a593Smuzhiyun 3. PCI Device Drivers and Power Management 21*4882a593Smuzhiyun 4. Resources 22*4882a593Smuzhiyun 23*4882a593Smuzhiyun 24*4882a593Smuzhiyun1. Hardware and Platform Support for PCI Power Management 25*4882a593Smuzhiyun========================================================= 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun1.1. Native and Platform-Based Power Management 28*4882a593Smuzhiyun----------------------------------------------- 29*4882a593Smuzhiyun 30*4882a593SmuzhiyunIn general, power management is a feature allowing one to save energy by putting 31*4882a593Smuzhiyundevices into states in which they draw less power (low-power states) at the 32*4882a593Smuzhiyunprice of reduced functionality or performance. 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunUsually, a device is put into a low-power state when it is underutilized or 35*4882a593Smuzhiyuncompletely inactive. However, when it is necessary to use the device once 36*4882a593Smuzhiyunagain, it has to be put back into the "fully functional" state (full-power 37*4882a593Smuzhiyunstate). This may happen when there are some data for the device to handle or 38*4882a593Smuzhiyunas a result of an external event requiring the device to be active, which may 39*4882a593Smuzhiyunbe signaled by the device itself. 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunPCI devices may be put into low-power states in two ways, by using the device 42*4882a593Smuzhiyuncapabilities introduced by the PCI Bus Power Management Interface Specification, 43*4882a593Smuzhiyunor with the help of platform firmware, such as an ACPI BIOS. In the first 44*4882a593Smuzhiyunapproach, that is referred to as the native PCI power management (native PCI PM) 45*4882a593Smuzhiyunin what follows, the device power state is changed as a result of writing a 46*4882a593Smuzhiyunspecific value into one of its standard configuration registers. The second 47*4882a593Smuzhiyunapproach requires the platform firmware to provide special methods that may be 48*4882a593Smuzhiyunused by the kernel to change the device's power state. 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunDevices supporting the native PCI PM usually can generate wakeup signals called 51*4882a593SmuzhiyunPower Management Events (PMEs) to let the kernel know about external events 52*4882a593Smuzhiyunrequiring the device to be active. After receiving a PME the kernel is supposed 53*4882a593Smuzhiyunto put the device that sent it into the full-power state. However, the PCI Bus 54*4882a593SmuzhiyunPower Management Interface Specification doesn't define any standard method of 55*4882a593Smuzhiyundelivering the PME from the device to the CPU and the operating system kernel. 56*4882a593SmuzhiyunIt is assumed that the platform firmware will perform this task and therefore, 57*4882a593Smuzhiyuneven though a PCI device is set up to generate PMEs, it also may be necessary to 58*4882a593Smuzhiyunprepare the platform firmware for notifying the CPU of the PMEs coming from the 59*4882a593Smuzhiyundevice (e.g. by generating interrupts). 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunIn turn, if the methods provided by the platform firmware are used for changing 62*4882a593Smuzhiyunthe power state of a device, usually the platform also provides a method for 63*4882a593Smuzhiyunpreparing the device to generate wakeup signals. In that case, however, it 64*4882a593Smuzhiyunoften also is necessary to prepare the device for generating PMEs using the 65*4882a593Smuzhiyunnative PCI PM mechanism, because the method provided by the platform depends on 66*4882a593Smuzhiyunthat. 67*4882a593Smuzhiyun 68*4882a593SmuzhiyunThus in many situations both the native and the platform-based power management 69*4882a593Smuzhiyunmechanisms have to be used simultaneously to obtain the desired result. 70*4882a593Smuzhiyun 71*4882a593Smuzhiyun1.2. Native PCI Power Management 72*4882a593Smuzhiyun-------------------------------- 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunThe PCI Bus Power Management Interface Specification (PCI PM Spec) was 75*4882a593Smuzhiyunintroduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a 76*4882a593Smuzhiyunstandard interface for performing various operations related to power 77*4882a593Smuzhiyunmanagement. 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunThe implementation of the PCI PM Spec is optional for conventional PCI devices, 80*4882a593Smuzhiyunbut it is mandatory for PCI Express devices. If a device supports the PCI PM 81*4882a593SmuzhiyunSpec, it has an 8 byte power management capability field in its PCI 82*4882a593Smuzhiyunconfiguration space. This field is used to describe and control the standard 83*4882a593Smuzhiyunfeatures related to the native PCI power management. 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunThe PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses 86*4882a593Smuzhiyun(B0-B3). The higher the number, the less power is drawn by the device or bus 87*4882a593Smuzhiyunin that state. However, the higher the number, the longer the latency for 88*4882a593Smuzhiyunthe device or bus to return to the full-power state (D0 or B0, respectively). 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunThere are two variants of the D3 state defined by the specification. The first 91*4882a593Smuzhiyunone is D3hot, referred to as the software accessible D3, because devices can be 92*4882a593Smuzhiyunprogrammed to go into it. The second one, D3cold, is the state that PCI devices 93*4882a593Smuzhiyunare in when the supply voltage (Vcc) is removed from them. It is not possible 94*4882a593Smuzhiyunto program a PCI device to go into D3cold, although there may be a programmable 95*4882a593Smuzhiyuninterface for putting the bus the device is on into a state in which Vcc is 96*4882a593Smuzhiyunremoved from all devices on the bus. 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunPCI bus power management, however, is not supported by the Linux kernel at the 99*4882a593Smuzhiyuntime of this writing and therefore it is not covered by this document. 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunNote that every PCI device can be in the full-power state (D0) or in D3cold, 102*4882a593Smuzhiyunregardless of whether or not it implements the PCI PM Spec. In addition to 103*4882a593Smuzhiyunthat, if the PCI PM Spec is implemented by the device, it must support D3hot 104*4882a593Smuzhiyunas well as D0. The support for the D1 and D2 power states is optional. 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunPCI devices supporting the PCI PM Spec can be programmed to go to any of the 107*4882a593Smuzhiyunsupported low-power states (except for D3cold). While in D1-D3hot the 108*4882a593Smuzhiyunstandard configuration registers of the device must be accessible to software 109*4882a593Smuzhiyun(i.e. the device is required to respond to PCI configuration accesses), although 110*4882a593Smuzhiyunits I/O and memory spaces are then disabled. This allows the device to be 111*4882a593Smuzhiyunprogrammatically put into D0. Thus the kernel can switch the device back and 112*4882a593Smuzhiyunforth between D0 and the supported low-power states (except for D3cold) and the 113*4882a593Smuzhiyunpossible power state transitions the device can undergo are the following: 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun+----------------------------+ 116*4882a593Smuzhiyun| Current State | New State | 117*4882a593Smuzhiyun+----------------------------+ 118*4882a593Smuzhiyun| D0 | D1, D2, D3 | 119*4882a593Smuzhiyun+----------------------------+ 120*4882a593Smuzhiyun| D1 | D2, D3 | 121*4882a593Smuzhiyun+----------------------------+ 122*4882a593Smuzhiyun| D2 | D3 | 123*4882a593Smuzhiyun+----------------------------+ 124*4882a593Smuzhiyun| D1, D2, D3 | D0 | 125*4882a593Smuzhiyun+----------------------------+ 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunThe transition from D3cold to D0 occurs when the supply voltage is provided to 128*4882a593Smuzhiyunthe device (i.e. power is restored). In that case the device returns to D0 with 129*4882a593Smuzhiyuna full power-on reset sequence and the power-on defaults are restored to the 130*4882a593Smuzhiyundevice by hardware just as at initial power up. 131*4882a593Smuzhiyun 132*4882a593SmuzhiyunPCI devices supporting the PCI PM Spec can be programmed to generate PMEs 133*4882a593Smuzhiyunwhile in any power state (D0-D3), but they are not required to be capable 134*4882a593Smuzhiyunof generating PMEs from all supported power states. In particular, the 135*4882a593Smuzhiyuncapability of generating PMEs from D3cold is optional and depends on the 136*4882a593Smuzhiyunpresence of additional voltage (3.3Vaux) allowing the device to remain 137*4882a593Smuzhiyunsufficiently active to generate a wakeup signal. 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun1.3. ACPI Device Power Management 140*4882a593Smuzhiyun--------------------------------- 141*4882a593Smuzhiyun 142*4882a593SmuzhiyunThe platform firmware support for the power management of PCI devices is 143*4882a593Smuzhiyunsystem-specific. However, if the system in question is compliant with the 144*4882a593SmuzhiyunAdvanced Configuration and Power Interface (ACPI) Specification, like the 145*4882a593Smuzhiyunmajority of x86-based systems, it is supposed to implement device power 146*4882a593Smuzhiyunmanagement interfaces defined by the ACPI standard. 147*4882a593Smuzhiyun 148*4882a593SmuzhiyunFor this purpose the ACPI BIOS provides special functions called "control 149*4882a593Smuzhiyunmethods" that may be executed by the kernel to perform specific tasks, such as 150*4882a593Smuzhiyunputting a device into a low-power state. These control methods are encoded 151*4882a593Smuzhiyunusing special byte-code language called the ACPI Machine Language (AML) and 152*4882a593Smuzhiyunstored in the machine's BIOS. The kernel loads them from the BIOS and executes 153*4882a593Smuzhiyunthem as needed using an AML interpreter that translates the AML byte code into 154*4882a593Smuzhiyuncomputations and memory or I/O space accesses. This way, in theory, a BIOS 155*4882a593Smuzhiyunwriter can provide the kernel with a means to perform actions depending 156*4882a593Smuzhiyunon the system design in a system-specific fashion. 157*4882a593Smuzhiyun 158*4882a593SmuzhiyunACPI control methods may be divided into global control methods, that are not 159*4882a593Smuzhiyunassociated with any particular devices, and device control methods, that have 160*4882a593Smuzhiyunto be defined separately for each device supposed to be handled with the help of 161*4882a593Smuzhiyunthe platform. This means, in particular, that ACPI device control methods can 162*4882a593Smuzhiyunonly be used to handle devices that the BIOS writer knew about in advance. The 163*4882a593SmuzhiyunACPI methods used for device power management fall into that category. 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunThe ACPI specification assumes that devices can be in one of four power states 166*4882a593Smuzhiyunlabeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM 167*4882a593SmuzhiyunD0-D3 states (although the difference between D3hot and D3cold is not taken 168*4882a593Smuzhiyuninto account by ACPI). Moreover, for each power state of a device there is a 169*4882a593Smuzhiyunset of power resources that have to be enabled for the device to be put into 170*4882a593Smuzhiyunthat state. These power resources are controlled (i.e. enabled or disabled) 171*4882a593Smuzhiyunwith the help of their own control methods, _ON and _OFF, that have to be 172*4882a593Smuzhiyundefined individually for each of them. 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunTo put a device into the ACPI power state Dx (where x is a number between 0 and 175*4882a593Smuzhiyun3 inclusive) the kernel is supposed to (1) enable the power resources required 176*4882a593Smuzhiyunby the device in this state using their _ON control methods and (2) execute the 177*4882a593Smuzhiyun_PSx control method defined for the device. In addition to that, if the device 178*4882a593Smuzhiyunis going to be put into a low-power state (D1-D3) and is supposed to generate 179*4882a593Smuzhiyunwakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI 180*4882a593Smuzhiyun3.0) control method defined for it has to be executed before _PSx. Power 181*4882a593Smuzhiyunresources that are not required by the device in the target power state and are 182*4882a593Smuzhiyunnot required any more by any other device should be disabled (by executing their 183*4882a593Smuzhiyun_OFF control methods). If the current power state of the device is D3, it can 184*4882a593Smuzhiyunonly be put into D0 this way. 185*4882a593Smuzhiyun 186*4882a593SmuzhiyunHowever, quite often the power states of devices are changed during a 187*4882a593Smuzhiyunsystem-wide transition into a sleep state or back into the working state. ACPI 188*4882a593Smuzhiyundefines four system sleep states, S1, S2, S3, and S4, and denotes the system 189*4882a593Smuzhiyunworking state as S0. In general, the target system sleep (or working) state 190*4882a593Smuzhiyundetermines the highest power (lowest number) state the device can be put 191*4882a593Smuzhiyuninto and the kernel is supposed to obtain this information by executing the 192*4882a593Smuzhiyundevice's _SxD control method (where x is a number between 0 and 4 inclusive). 193*4882a593SmuzhiyunIf the device is required to wake up the system from the target sleep state, the 194*4882a593Smuzhiyunlowest power (highest number) state it can be put into is also determined by the 195*4882a593Smuzhiyuntarget state of the system. The kernel is then supposed to use the device's 196*4882a593Smuzhiyun_SxW control method to obtain the number of that state. It also is supposed to 197*4882a593Smuzhiyunuse the device's _PRW control method to learn which power resources need to be 198*4882a593Smuzhiyunenabled for the device to be able to generate wakeup signals. 199*4882a593Smuzhiyun 200*4882a593Smuzhiyun1.4. Wakeup Signaling 201*4882a593Smuzhiyun--------------------- 202*4882a593Smuzhiyun 203*4882a593SmuzhiyunWakeup signals generated by PCI devices, either as native PCI PMEs, or as 204*4882a593Smuzhiyuna result of the execution of the _DSW (or _PSW) ACPI control method before 205*4882a593Smuzhiyunputting the device into a low-power state, have to be caught and handled as 206*4882a593Smuzhiyunappropriate. If they are sent while the system is in the working state 207*4882a593Smuzhiyun(ACPI S0), they should be translated into interrupts so that the kernel can 208*4882a593Smuzhiyunput the devices generating them into the full-power state and take care of the 209*4882a593Smuzhiyunevents that triggered them. In turn, if they are sent while the system is 210*4882a593Smuzhiyunsleeping, they should cause the system's core logic to trigger wakeup. 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunOn ACPI-based systems wakeup signals sent by conventional PCI devices are 213*4882a593Smuzhiyunconverted into ACPI General-Purpose Events (GPEs) which are hardware signals 214*4882a593Smuzhiyunfrom the system core logic generated in response to various events that need to 215*4882a593Smuzhiyunbe acted upon. Every GPE is associated with one or more sources of potentially 216*4882a593Smuzhiyuninteresting events. In particular, a GPE may be associated with a PCI device 217*4882a593Smuzhiyuncapable of signaling wakeup. The information on the connections between GPEs 218*4882a593Smuzhiyunand event sources is recorded in the system's ACPI BIOS from where it can be 219*4882a593Smuzhiyunread by the kernel. 220*4882a593Smuzhiyun 221*4882a593SmuzhiyunIf a PCI device known to the system's ACPI BIOS signals wakeup, the GPE 222*4882a593Smuzhiyunassociated with it (if there is one) is triggered. The GPEs associated with PCI 223*4882a593Smuzhiyunbridges may also be triggered in response to a wakeup signal from one of the 224*4882a593Smuzhiyundevices below the bridge (this also is the case for root bridges) and, for 225*4882a593Smuzhiyunexample, native PCI PMEs from devices unknown to the system's ACPI BIOS may be 226*4882a593Smuzhiyunhandled this way. 227*4882a593Smuzhiyun 228*4882a593SmuzhiyunA GPE may be triggered when the system is sleeping (i.e. when it is in one of 229*4882a593Smuzhiyunthe ACPI S1-S4 states), in which case system wakeup is started by its core logic 230*4882a593Smuzhiyun(the device that was the source of the signal causing the system wakeup to occur 231*4882a593Smuzhiyunmay be identified later). The GPEs used in such situations are referred to as 232*4882a593Smuzhiyunwakeup GPEs. 233*4882a593Smuzhiyun 234*4882a593SmuzhiyunUsually, however, GPEs are also triggered when the system is in the working 235*4882a593Smuzhiyunstate (ACPI S0) and in that case the system's core logic generates a System 236*4882a593SmuzhiyunControl Interrupt (SCI) to notify the kernel of the event. Then, the SCI 237*4882a593Smuzhiyunhandler identifies the GPE that caused the interrupt to be generated which, 238*4882a593Smuzhiyunin turn, allows the kernel to identify the source of the event (that may be 239*4882a593Smuzhiyuna PCI device signaling wakeup). The GPEs used for notifying the kernel of 240*4882a593Smuzhiyunevents occurring while the system is in the working state are referred to as 241*4882a593Smuzhiyunruntime GPEs. 242*4882a593Smuzhiyun 243*4882a593SmuzhiyunUnfortunately, there is no standard way of handling wakeup signals sent by 244*4882a593Smuzhiyunconventional PCI devices on systems that are not ACPI-based, but there is one 245*4882a593Smuzhiyunfor PCI Express devices. Namely, the PCI Express Base Specification introduced 246*4882a593Smuzhiyuna native mechanism for converting native PCI PMEs into interrupts generated by 247*4882a593Smuzhiyunroot ports. For conventional PCI devices native PMEs are out-of-band, so they 248*4882a593Smuzhiyunare routed separately and they need not pass through bridges (in principle they 249*4882a593Smuzhiyunmay be routed directly to the system's core logic), but for PCI Express devices 250*4882a593Smuzhiyunthey are in-band messages that have to pass through the PCI Express hierarchy, 251*4882a593Smuzhiyunincluding the root port on the path from the device to the Root Complex. Thus 252*4882a593Smuzhiyunit was possible to introduce a mechanism by which a root port generates an 253*4882a593Smuzhiyuninterrupt whenever it receives a PME message from one of the devices below it. 254*4882a593SmuzhiyunThe PCI Express Requester ID of the device that sent the PME message is then 255*4882a593Smuzhiyunrecorded in one of the root port's configuration registers from where it may be 256*4882a593Smuzhiyunread by the interrupt handler allowing the device to be identified. [PME 257*4882a593Smuzhiyunmessages sent by PCI Express endpoints integrated with the Root Complex don't 258*4882a593Smuzhiyunpass through root ports, but instead they cause a Root Complex Event Collector 259*4882a593Smuzhiyun(if there is one) to generate interrupts.] 260*4882a593Smuzhiyun 261*4882a593SmuzhiyunIn principle the native PCI Express PME signaling may also be used on ACPI-based 262*4882a593Smuzhiyunsystems along with the GPEs, but to use it the kernel has to ask the system's 263*4882a593SmuzhiyunACPI BIOS to release control of root port configuration registers. The ACPI 264*4882a593SmuzhiyunBIOS, however, is not required to allow the kernel to control these registers 265*4882a593Smuzhiyunand if it doesn't do that, the kernel must not modify their contents. Of course 266*4882a593Smuzhiyunthe native PCI Express PME signaling cannot be used by the kernel in that case. 267*4882a593Smuzhiyun 268*4882a593Smuzhiyun 269*4882a593Smuzhiyun2. PCI Subsystem and Device Power Management 270*4882a593Smuzhiyun============================================ 271*4882a593Smuzhiyun 272*4882a593Smuzhiyun2.1. Device Power Management Callbacks 273*4882a593Smuzhiyun-------------------------------------- 274*4882a593Smuzhiyun 275*4882a593SmuzhiyunThe PCI Subsystem participates in the power management of PCI devices in a 276*4882a593Smuzhiyunnumber of ways. First of all, it provides an intermediate code layer between 277*4882a593Smuzhiyunthe device power management core (PM core) and PCI device drivers. 278*4882a593SmuzhiyunSpecifically, the pm field of the PCI subsystem's struct bus_type object, 279*4882a593Smuzhiyunpci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing 280*4882a593Smuzhiyunpointers to several device power management callbacks:: 281*4882a593Smuzhiyun 282*4882a593Smuzhiyun const struct dev_pm_ops pci_dev_pm_ops = { 283*4882a593Smuzhiyun .prepare = pci_pm_prepare, 284*4882a593Smuzhiyun .complete = pci_pm_complete, 285*4882a593Smuzhiyun .suspend = pci_pm_suspend, 286*4882a593Smuzhiyun .resume = pci_pm_resume, 287*4882a593Smuzhiyun .freeze = pci_pm_freeze, 288*4882a593Smuzhiyun .thaw = pci_pm_thaw, 289*4882a593Smuzhiyun .poweroff = pci_pm_poweroff, 290*4882a593Smuzhiyun .restore = pci_pm_restore, 291*4882a593Smuzhiyun .suspend_noirq = pci_pm_suspend_noirq, 292*4882a593Smuzhiyun .resume_noirq = pci_pm_resume_noirq, 293*4882a593Smuzhiyun .freeze_noirq = pci_pm_freeze_noirq, 294*4882a593Smuzhiyun .thaw_noirq = pci_pm_thaw_noirq, 295*4882a593Smuzhiyun .poweroff_noirq = pci_pm_poweroff_noirq, 296*4882a593Smuzhiyun .restore_noirq = pci_pm_restore_noirq, 297*4882a593Smuzhiyun .runtime_suspend = pci_pm_runtime_suspend, 298*4882a593Smuzhiyun .runtime_resume = pci_pm_runtime_resume, 299*4882a593Smuzhiyun .runtime_idle = pci_pm_runtime_idle, 300*4882a593Smuzhiyun }; 301*4882a593Smuzhiyun 302*4882a593SmuzhiyunThese callbacks are executed by the PM core in various situations related to 303*4882a593Smuzhiyundevice power management and they, in turn, execute power management callbacks 304*4882a593Smuzhiyunprovided by PCI device drivers. They also perform power management operations 305*4882a593Smuzhiyuninvolving some standard configuration registers of PCI devices that device 306*4882a593Smuzhiyundrivers need not know or care about. 307*4882a593Smuzhiyun 308*4882a593SmuzhiyunThe structure representing a PCI device, struct pci_dev, contains several fields 309*4882a593Smuzhiyunthat these callbacks operate on:: 310*4882a593Smuzhiyun 311*4882a593Smuzhiyun struct pci_dev { 312*4882a593Smuzhiyun ... 313*4882a593Smuzhiyun pci_power_t current_state; /* Current operating state. */ 314*4882a593Smuzhiyun int pm_cap; /* PM capability offset in the 315*4882a593Smuzhiyun configuration space */ 316*4882a593Smuzhiyun unsigned int pme_support:5; /* Bitmask of states from which PME# 317*4882a593Smuzhiyun can be generated */ 318*4882a593Smuzhiyun unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */ 319*4882a593Smuzhiyun unsigned int d1_support:1; /* Low power state D1 is supported */ 320*4882a593Smuzhiyun unsigned int d2_support:1; /* Low power state D2 is supported */ 321*4882a593Smuzhiyun unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ 322*4882a593Smuzhiyun unsigned int wakeup_prepared:1; /* Device prepared for wake up */ 323*4882a593Smuzhiyun unsigned int d3hot_delay; /* D3hot->D0 transition time in ms */ 324*4882a593Smuzhiyun ... 325*4882a593Smuzhiyun }; 326*4882a593Smuzhiyun 327*4882a593SmuzhiyunThey also indirectly use some fields of the struct device that is embedded in 328*4882a593Smuzhiyunstruct pci_dev. 329*4882a593Smuzhiyun 330*4882a593Smuzhiyun2.2. Device Initialization 331*4882a593Smuzhiyun-------------------------- 332*4882a593Smuzhiyun 333*4882a593SmuzhiyunThe PCI subsystem's first task related to device power management is to 334*4882a593Smuzhiyunprepare the device for power management and initialize the fields of struct 335*4882a593Smuzhiyunpci_dev used for this purpose. This happens in two functions defined in 336*4882a593Smuzhiyundrivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init(). 337*4882a593Smuzhiyun 338*4882a593SmuzhiyunThe first of these functions checks if the device supports native PCI PM 339*4882a593Smuzhiyunand if that's the case the offset of its power management capability structure 340*4882a593Smuzhiyunin the configuration space is stored in the pm_cap field of the device's struct 341*4882a593Smuzhiyunpci_dev object. Next, the function checks which PCI low-power states are 342*4882a593Smuzhiyunsupported by the device and from which low-power states the device can generate 343*4882a593Smuzhiyunnative PCI PMEs. The power management fields of the device's struct pci_dev and 344*4882a593Smuzhiyunthe struct device embedded in it are updated accordingly and the generation of 345*4882a593SmuzhiyunPMEs by the device is disabled. 346*4882a593Smuzhiyun 347*4882a593SmuzhiyunThe second function checks if the device can be prepared to signal wakeup with 348*4882a593Smuzhiyunthe help of the platform firmware, such as the ACPI BIOS. If that is the case, 349*4882a593Smuzhiyunthe function updates the wakeup fields in struct device embedded in the 350*4882a593Smuzhiyundevice's struct pci_dev and uses the firmware-provided method to prevent the 351*4882a593Smuzhiyundevice from signaling wakeup. 352*4882a593Smuzhiyun 353*4882a593SmuzhiyunAt this point the device is ready for power management. For driverless devices, 354*4882a593Smuzhiyunhowever, this functionality is limited to a few basic operations carried out 355*4882a593Smuzhiyunduring system-wide transitions to a sleep state and back to the working state. 356*4882a593Smuzhiyun 357*4882a593Smuzhiyun2.3. Runtime Device Power Management 358*4882a593Smuzhiyun------------------------------------ 359*4882a593Smuzhiyun 360*4882a593SmuzhiyunThe PCI subsystem plays a vital role in the runtime power management of PCI 361*4882a593Smuzhiyundevices. For this purpose it uses the general runtime power management 362*4882a593Smuzhiyun(runtime PM) framework described in Documentation/power/runtime_pm.rst. 363*4882a593SmuzhiyunNamely, it provides subsystem-level callbacks:: 364*4882a593Smuzhiyun 365*4882a593Smuzhiyun pci_pm_runtime_suspend() 366*4882a593Smuzhiyun pci_pm_runtime_resume() 367*4882a593Smuzhiyun pci_pm_runtime_idle() 368*4882a593Smuzhiyun 369*4882a593Smuzhiyunthat are executed by the core runtime PM routines. It also implements the 370*4882a593Smuzhiyunentire mechanics necessary for handling runtime wakeup signals from PCI devices 371*4882a593Smuzhiyunin low-power states, which at the time of this writing works for both the native 372*4882a593SmuzhiyunPCI Express PME signaling and the ACPI GPE-based wakeup signaling described in 373*4882a593SmuzhiyunSection 1. 374*4882a593Smuzhiyun 375*4882a593SmuzhiyunFirst, a PCI device is put into a low-power state, or suspended, with the help 376*4882a593Smuzhiyunof pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call 377*4882a593Smuzhiyunpci_pm_runtime_suspend() to do the actual job. For this to work, the device's 378*4882a593Smuzhiyundriver has to provide a pm->runtime_suspend() callback (see below), which is 379*4882a593Smuzhiyunrun by pci_pm_runtime_suspend() as the first action. If the driver's callback 380*4882a593Smuzhiyunreturns successfully, the device's standard configuration registers are saved, 381*4882a593Smuzhiyunthe device is prepared to generate wakeup signals and, finally, it is put into 382*4882a593Smuzhiyunthe target low-power state. 383*4882a593Smuzhiyun 384*4882a593SmuzhiyunThe low-power state to put the device into is the lowest-power (highest number) 385*4882a593Smuzhiyunstate from which it can signal wakeup. The exact method of signaling wakeup is 386*4882a593Smuzhiyunsystem-dependent and is determined by the PCI subsystem on the basis of the 387*4882a593Smuzhiyunreported capabilities of the device and the platform firmware. To prepare the 388*4882a593Smuzhiyundevice for signaling wakeup and put it into the selected low-power state, the 389*4882a593SmuzhiyunPCI subsystem can use the platform firmware as well as the device's native PCI 390*4882a593SmuzhiyunPM capabilities, if supported. 391*4882a593Smuzhiyun 392*4882a593SmuzhiyunIt is expected that the device driver's pm->runtime_suspend() callback will 393*4882a593Smuzhiyunnot attempt to prepare the device for signaling wakeup or to put it into a 394*4882a593Smuzhiyunlow-power state. The driver ought to leave these tasks to the PCI subsystem 395*4882a593Smuzhiyunthat has all of the information necessary to perform them. 396*4882a593Smuzhiyun 397*4882a593SmuzhiyunA suspended device is brought back into the "active" state, or resumed, 398*4882a593Smuzhiyunwith the help of pm_request_resume() or pm_runtime_resume() which both call 399*4882a593Smuzhiyunpci_pm_runtime_resume() for PCI devices. Again, this only works if the device's 400*4882a593Smuzhiyundriver provides a pm->runtime_resume() callback (see below). However, before 401*4882a593Smuzhiyunthe driver's callback is executed, pci_pm_runtime_resume() brings the device 402*4882a593Smuzhiyunback into the full-power state, prevents it from signaling wakeup while in that 403*4882a593Smuzhiyunstate and restores its standard configuration registers. Thus the driver's 404*4882a593Smuzhiyuncallback need not worry about the PCI-specific aspects of the device resume. 405*4882a593Smuzhiyun 406*4882a593SmuzhiyunNote that generally pci_pm_runtime_resume() may be called in two different 407*4882a593Smuzhiyunsituations. First, it may be called at the request of the device's driver, for 408*4882a593Smuzhiyunexample if there are some data for it to process. Second, it may be called 409*4882a593Smuzhiyunas a result of a wakeup signal from the device itself (this sometimes is 410*4882a593Smuzhiyunreferred to as "remote wakeup"). Of course, for this purpose the wakeup signal 411*4882a593Smuzhiyunis handled in one of the ways described in Section 1 and finally converted into 412*4882a593Smuzhiyuna notification for the PCI subsystem after the source device has been 413*4882a593Smuzhiyunidentified. 414*4882a593Smuzhiyun 415*4882a593SmuzhiyunThe pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle() 416*4882a593Smuzhiyunand pm_request_idle(), executes the device driver's pm->runtime_idle() 417*4882a593Smuzhiyuncallback, if defined, and if that callback doesn't return error code (or is not 418*4882a593Smuzhiyunpresent at all), suspends the device with the help of pm_runtime_suspend(). 419*4882a593SmuzhiyunSometimes pci_pm_runtime_idle() is called automatically by the PM core (for 420*4882a593Smuzhiyunexample, it is called right after the device has just been resumed), in which 421*4882a593Smuzhiyuncases it is expected to suspend the device if that makes sense. Usually, 422*4882a593Smuzhiyunhowever, the PCI subsystem doesn't really know if the device really can be 423*4882a593Smuzhiyunsuspended, so it lets the device's driver decide by running its 424*4882a593Smuzhiyunpm->runtime_idle() callback. 425*4882a593Smuzhiyun 426*4882a593Smuzhiyun2.4. System-Wide Power Transitions 427*4882a593Smuzhiyun---------------------------------- 428*4882a593SmuzhiyunThere are a few different types of system-wide power transitions, described in 429*4882a593SmuzhiyunDocumentation/driver-api/pm/devices.rst. Each of them requires devices to be 430*4882a593Smuzhiyunhandled in a specific way and the PM core executes subsystem-level power 431*4882a593Smuzhiyunmanagement callbacks for this purpose. They are executed in phases such that 432*4882a593Smuzhiyuneach phase involves executing the same subsystem-level callback for every device 433*4882a593Smuzhiyunbelonging to the given subsystem before the next phase begins. These phases 434*4882a593Smuzhiyunalways run after tasks have been frozen. 435*4882a593Smuzhiyun 436*4882a593Smuzhiyun2.4.1. System Suspend 437*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^ 438*4882a593Smuzhiyun 439*4882a593SmuzhiyunWhen the system is going into a sleep state in which the contents of memory will 440*4882a593Smuzhiyunbe preserved, such as one of the ACPI sleep states S1-S3, the phases are: 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun prepare, suspend, suspend_noirq. 443*4882a593Smuzhiyun 444*4882a593SmuzhiyunThe following PCI bus type's callbacks, respectively, are used in these phases:: 445*4882a593Smuzhiyun 446*4882a593Smuzhiyun pci_pm_prepare() 447*4882a593Smuzhiyun pci_pm_suspend() 448*4882a593Smuzhiyun pci_pm_suspend_noirq() 449*4882a593Smuzhiyun 450*4882a593SmuzhiyunThe pci_pm_prepare() routine first puts the device into the "fully functional" 451*4882a593Smuzhiyunstate with the help of pm_runtime_resume(). Then, it executes the device 452*4882a593Smuzhiyundriver's pm->prepare() callback if defined (i.e. if the driver's struct 453*4882a593Smuzhiyundev_pm_ops object is present and the prepare pointer in that object is valid). 454*4882a593Smuzhiyun 455*4882a593SmuzhiyunThe pci_pm_suspend() routine first checks if the device's driver implements 456*4882a593Smuzhiyunlegacy PCI suspend routines (see Section 3), in which case the driver's legacy 457*4882a593Smuzhiyunsuspend callback is executed, if present, and its result is returned. Next, if 458*4882a593Smuzhiyunthe device's driver doesn't provide a struct dev_pm_ops object (containing 459*4882a593Smuzhiyunpointers to the driver's callbacks), pci_pm_default_suspend() is called, which 460*4882a593Smuzhiyunsimply turns off the device's bus master capability and runs 461*4882a593Smuzhiyunpcibios_disable_device() to disable it, unless the device is a bridge (PCI 462*4882a593Smuzhiyunbridges are ignored by this routine). Next, the device driver's pm->suspend() 463*4882a593Smuzhiyuncallback is executed, if defined, and its result is returned if it fails. 464*4882a593SmuzhiyunFinally, pci_fixup_device() is called to apply hardware suspend quirks related 465*4882a593Smuzhiyunto the device if necessary. 466*4882a593Smuzhiyun 467*4882a593SmuzhiyunNote that the suspend phase is carried out asynchronously for PCI devices, so 468*4882a593Smuzhiyunthe pci_pm_suspend() callback may be executed in parallel for any pair of PCI 469*4882a593Smuzhiyundevices that don't depend on each other in a known way (i.e. none of the paths 470*4882a593Smuzhiyunin the device tree from the root bridge to a leaf device contains both of them). 471*4882a593Smuzhiyun 472*4882a593SmuzhiyunThe pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has 473*4882a593Smuzhiyunbeen called, which means that the device driver's interrupt handler won't be 474*4882a593Smuzhiyuninvoked while this routine is running. It first checks if the device's driver 475*4882a593Smuzhiyunimplements legacy PCI suspends routines (Section 3), in which case the legacy 476*4882a593Smuzhiyunlate suspend routine is called and its result is returned (the standard 477*4882a593Smuzhiyunconfiguration registers of the device are saved if the driver's callback hasn't 478*4882a593Smuzhiyundone that). Second, if the device driver's struct dev_pm_ops object is not 479*4882a593Smuzhiyunpresent, the device's standard configuration registers are saved and the routine 480*4882a593Smuzhiyunreturns success. Otherwise the device driver's pm->suspend_noirq() callback is 481*4882a593Smuzhiyunexecuted, if present, and its result is returned if it fails. Next, if the 482*4882a593Smuzhiyundevice's standard configuration registers haven't been saved yet (one of the 483*4882a593Smuzhiyundevice driver's callbacks executed before might do that), pci_pm_suspend_noirq() 484*4882a593Smuzhiyunsaves them, prepares the device to signal wakeup (if necessary) and puts it into 485*4882a593Smuzhiyuna low-power state. 486*4882a593Smuzhiyun 487*4882a593SmuzhiyunThe low-power state to put the device into is the lowest-power (highest number) 488*4882a593Smuzhiyunstate from which it can signal wakeup while the system is in the target sleep 489*4882a593Smuzhiyunstate. Just like in the runtime PM case described above, the mechanism of 490*4882a593Smuzhiyunsignaling wakeup is system-dependent and determined by the PCI subsystem, which 491*4882a593Smuzhiyunis also responsible for preparing the device to signal wakeup from the system's 492*4882a593Smuzhiyuntarget sleep state as appropriate. 493*4882a593Smuzhiyun 494*4882a593SmuzhiyunPCI device drivers (that don't implement legacy power management callbacks) are 495*4882a593Smuzhiyungenerally not expected to prepare devices for signaling wakeup or to put them 496*4882a593Smuzhiyuninto low-power states. However, if one of the driver's suspend callbacks 497*4882a593Smuzhiyun(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration 498*4882a593Smuzhiyunregisters, pci_pm_suspend_noirq() will assume that the device has been prepared 499*4882a593Smuzhiyunto signal wakeup and put into a low-power state by the driver (the driver is 500*4882a593Smuzhiyunthen assumed to have used the helper functions provided by the PCI subsystem for 501*4882a593Smuzhiyunthis purpose). PCI device drivers are not encouraged to do that, but in some 502*4882a593Smuzhiyunrare cases doing that in the driver may be the optimum approach. 503*4882a593Smuzhiyun 504*4882a593Smuzhiyun2.4.2. System Resume 505*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^ 506*4882a593Smuzhiyun 507*4882a593SmuzhiyunWhen the system is undergoing a transition from a sleep state in which the 508*4882a593Smuzhiyuncontents of memory have been preserved, such as one of the ACPI sleep states 509*4882a593SmuzhiyunS1-S3, into the working state (ACPI S0), the phases are: 510*4882a593Smuzhiyun 511*4882a593Smuzhiyun resume_noirq, resume, complete. 512*4882a593Smuzhiyun 513*4882a593SmuzhiyunThe following PCI bus type's callbacks, respectively, are executed in these 514*4882a593Smuzhiyunphases:: 515*4882a593Smuzhiyun 516*4882a593Smuzhiyun pci_pm_resume_noirq() 517*4882a593Smuzhiyun pci_pm_resume() 518*4882a593Smuzhiyun pci_pm_complete() 519*4882a593Smuzhiyun 520*4882a593SmuzhiyunThe pci_pm_resume_noirq() routine first puts the device into the full-power 521*4882a593Smuzhiyunstate, restores its standard configuration registers and applies early resume 522*4882a593Smuzhiyunhardware quirks related to the device, if necessary. This is done 523*4882a593Smuzhiyununconditionally, regardless of whether or not the device's driver implements 524*4882a593Smuzhiyunlegacy PCI power management callbacks (this way all PCI devices are in the 525*4882a593Smuzhiyunfull-power state and their standard configuration registers have been restored 526*4882a593Smuzhiyunwhen their interrupt handlers are invoked for the first time during resume, 527*4882a593Smuzhiyunwhich allows the kernel to avoid problems with the handling of shared interrupts 528*4882a593Smuzhiyunby drivers whose devices are still suspended). If legacy PCI power management 529*4882a593Smuzhiyuncallbacks (see Section 3) are implemented by the device's driver, the legacy 530*4882a593Smuzhiyunearly resume callback is executed and its result is returned. Otherwise, the 531*4882a593Smuzhiyundevice driver's pm->resume_noirq() callback is executed, if defined, and its 532*4882a593Smuzhiyunresult is returned. 533*4882a593Smuzhiyun 534*4882a593SmuzhiyunThe pci_pm_resume() routine first checks if the device's standard configuration 535*4882a593Smuzhiyunregisters have been restored and restores them if that's not the case (this 536*4882a593Smuzhiyunonly is necessary in the error path during a failing suspend). Next, resume 537*4882a593Smuzhiyunhardware quirks related to the device are applied, if necessary, and if the 538*4882a593Smuzhiyundevice's driver implements legacy PCI power management callbacks (see 539*4882a593SmuzhiyunSection 3), the driver's legacy resume callback is executed and its result is 540*4882a593Smuzhiyunreturned. Otherwise, the device's wakeup signaling mechanisms are blocked and 541*4882a593Smuzhiyunits driver's pm->resume() callback is executed, if defined (the callback's 542*4882a593Smuzhiyunresult is then returned). 543*4882a593Smuzhiyun 544*4882a593SmuzhiyunThe resume phase is carried out asynchronously for PCI devices, like the 545*4882a593Smuzhiyunsuspend phase described above, which means that if two PCI devices don't depend 546*4882a593Smuzhiyunon each other in a known way, the pci_pm_resume() routine may be executed for 547*4882a593Smuzhiyunthe both of them in parallel. 548*4882a593Smuzhiyun 549*4882a593SmuzhiyunThe pci_pm_complete() routine only executes the device driver's pm->complete() 550*4882a593Smuzhiyuncallback, if defined. 551*4882a593Smuzhiyun 552*4882a593Smuzhiyun2.4.3. System Hibernation 553*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^ 554*4882a593Smuzhiyun 555*4882a593SmuzhiyunSystem hibernation is more complicated than system suspend, because it requires 556*4882a593Smuzhiyuna system image to be created and written into a persistent storage medium. The 557*4882a593Smuzhiyunimage is created atomically and all devices are quiesced, or frozen, before that 558*4882a593Smuzhiyunhappens. 559*4882a593Smuzhiyun 560*4882a593SmuzhiyunThe freezing of devices is carried out after enough memory has been freed (at 561*4882a593Smuzhiyunthe time of this writing the image creation requires at least 50% of system RAM 562*4882a593Smuzhiyunto be free) in the following three phases: 563*4882a593Smuzhiyun 564*4882a593Smuzhiyun prepare, freeze, freeze_noirq 565*4882a593Smuzhiyun 566*4882a593Smuzhiyunthat correspond to the PCI bus type's callbacks:: 567*4882a593Smuzhiyun 568*4882a593Smuzhiyun pci_pm_prepare() 569*4882a593Smuzhiyun pci_pm_freeze() 570*4882a593Smuzhiyun pci_pm_freeze_noirq() 571*4882a593Smuzhiyun 572*4882a593SmuzhiyunThis means that the prepare phase is exactly the same as for system suspend. 573*4882a593SmuzhiyunThe other two phases, however, are different. 574*4882a593Smuzhiyun 575*4882a593SmuzhiyunThe pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs 576*4882a593Smuzhiyunthe device driver's pm->freeze() callback, if defined, instead of pm->suspend(), 577*4882a593Smuzhiyunand it doesn't apply the suspend-related hardware quirks. It is executed 578*4882a593Smuzhiyunasynchronously for different PCI devices that don't depend on each other in a 579*4882a593Smuzhiyunknown way. 580*4882a593Smuzhiyun 581*4882a593SmuzhiyunThe pci_pm_freeze_noirq() routine, in turn, is similar to 582*4882a593Smuzhiyunpci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq() 583*4882a593Smuzhiyunroutine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the 584*4882a593Smuzhiyundevice for signaling wakeup and put it into a low-power state. Still, it saves 585*4882a593Smuzhiyunthe device's standard configuration registers if they haven't been saved by one 586*4882a593Smuzhiyunof the driver's callbacks. 587*4882a593Smuzhiyun 588*4882a593SmuzhiyunOnce the image has been created, it has to be saved. However, at this point all 589*4882a593Smuzhiyundevices are frozen and they cannot handle I/O, while their ability to handle 590*4882a593SmuzhiyunI/O is obviously necessary for the image saving. Thus they have to be brought 591*4882a593Smuzhiyunback to the fully functional state and this is done in the following phases: 592*4882a593Smuzhiyun 593*4882a593Smuzhiyun thaw_noirq, thaw, complete 594*4882a593Smuzhiyun 595*4882a593Smuzhiyunusing the following PCI bus type's callbacks:: 596*4882a593Smuzhiyun 597*4882a593Smuzhiyun pci_pm_thaw_noirq() 598*4882a593Smuzhiyun pci_pm_thaw() 599*4882a593Smuzhiyun pci_pm_complete() 600*4882a593Smuzhiyun 601*4882a593Smuzhiyunrespectively. 602*4882a593Smuzhiyun 603*4882a593SmuzhiyunThe first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(). 604*4882a593SmuzhiyunIt puts the device into the full power state and restores its standard 605*4882a593Smuzhiyunconfiguration registers. It also executes the device driver's pm->thaw_noirq() 606*4882a593Smuzhiyuncallback, if defined, instead of pm->resume_noirq(). 607*4882a593Smuzhiyun 608*4882a593SmuzhiyunThe pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device 609*4882a593Smuzhiyundriver's pm->thaw() callback instead of pm->resume(). It is executed 610*4882a593Smuzhiyunasynchronously for different PCI devices that don't depend on each other in a 611*4882a593Smuzhiyunknown way. 612*4882a593Smuzhiyun 613*4882a593SmuzhiyunThe complete phase is the same as for system resume. 614*4882a593Smuzhiyun 615*4882a593SmuzhiyunAfter saving the image, devices need to be powered down before the system can 616*4882a593Smuzhiyunenter the target sleep state (ACPI S4 for ACPI-based systems). This is done in 617*4882a593Smuzhiyunthree phases: 618*4882a593Smuzhiyun 619*4882a593Smuzhiyun prepare, poweroff, poweroff_noirq 620*4882a593Smuzhiyun 621*4882a593Smuzhiyunwhere the prepare phase is exactly the same as for system suspend. The other 622*4882a593Smuzhiyuntwo phases are analogous to the suspend and suspend_noirq phases, respectively. 623*4882a593SmuzhiyunThe PCI subsystem-level callbacks they correspond to:: 624*4882a593Smuzhiyun 625*4882a593Smuzhiyun pci_pm_poweroff() 626*4882a593Smuzhiyun pci_pm_poweroff_noirq() 627*4882a593Smuzhiyun 628*4882a593Smuzhiyunwork in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively, 629*4882a593Smuzhiyunalthough they don't attempt to save the device's standard configuration 630*4882a593Smuzhiyunregisters. 631*4882a593Smuzhiyun 632*4882a593Smuzhiyun2.4.4. System Restore 633*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^ 634*4882a593Smuzhiyun 635*4882a593SmuzhiyunSystem restore requires a hibernation image to be loaded into memory and the 636*4882a593Smuzhiyunpre-hibernation memory contents to be restored before the pre-hibernation system 637*4882a593Smuzhiyunactivity can be resumed. 638*4882a593Smuzhiyun 639*4882a593SmuzhiyunAs described in Documentation/driver-api/pm/devices.rst, the hibernation image 640*4882a593Smuzhiyunis loaded into memory by a fresh instance of the kernel, called the boot kernel, 641*4882a593Smuzhiyunwhich in turn is loaded and run by a boot loader in the usual way. After the 642*4882a593Smuzhiyunboot kernel has loaded the image, it needs to replace its own code and data with 643*4882a593Smuzhiyunthe code and data of the "hibernated" kernel stored within the image, called the 644*4882a593Smuzhiyunimage kernel. For this purpose all devices are frozen just like before creating 645*4882a593Smuzhiyunthe image during hibernation, in the 646*4882a593Smuzhiyun 647*4882a593Smuzhiyun prepare, freeze, freeze_noirq 648*4882a593Smuzhiyun 649*4882a593Smuzhiyunphases described above. However, the devices affected by these phases are only 650*4882a593Smuzhiyunthose having drivers in the boot kernel; other devices will still be in whatever 651*4882a593Smuzhiyunstate the boot loader left them. 652*4882a593Smuzhiyun 653*4882a593SmuzhiyunShould the restoration of the pre-hibernation memory contents fail, the boot 654*4882a593Smuzhiyunkernel would go through the "thawing" procedure described above, using the 655*4882a593Smuzhiyunthaw_noirq, thaw, and complete phases (that will only affect the devices having 656*4882a593Smuzhiyundrivers in the boot kernel), and then continue running normally. 657*4882a593Smuzhiyun 658*4882a593SmuzhiyunIf the pre-hibernation memory contents are restored successfully, which is the 659*4882a593Smuzhiyunusual situation, control is passed to the image kernel, which then becomes 660*4882a593Smuzhiyunresponsible for bringing the system back to the working state. To achieve this, 661*4882a593Smuzhiyunit must restore the devices' pre-hibernation functionality, which is done much 662*4882a593Smuzhiyunlike waking up from the memory sleep state, although it involves different 663*4882a593Smuzhiyunphases: 664*4882a593Smuzhiyun 665*4882a593Smuzhiyun restore_noirq, restore, complete 666*4882a593Smuzhiyun 667*4882a593SmuzhiyunThe first two of these are analogous to the resume_noirq and resume phases 668*4882a593Smuzhiyundescribed above, respectively, and correspond to the following PCI subsystem 669*4882a593Smuzhiyuncallbacks:: 670*4882a593Smuzhiyun 671*4882a593Smuzhiyun pci_pm_restore_noirq() 672*4882a593Smuzhiyun pci_pm_restore() 673*4882a593Smuzhiyun 674*4882a593SmuzhiyunThese callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(), 675*4882a593Smuzhiyunrespectively, but they execute the device driver's pm->restore_noirq() and 676*4882a593Smuzhiyunpm->restore() callbacks, if available. 677*4882a593Smuzhiyun 678*4882a593SmuzhiyunThe complete phase is carried out in exactly the same way as during system 679*4882a593Smuzhiyunresume. 680*4882a593Smuzhiyun 681*4882a593Smuzhiyun 682*4882a593Smuzhiyun3. PCI Device Drivers and Power Management 683*4882a593Smuzhiyun========================================== 684*4882a593Smuzhiyun 685*4882a593Smuzhiyun3.1. Power Management Callbacks 686*4882a593Smuzhiyun------------------------------- 687*4882a593Smuzhiyun 688*4882a593SmuzhiyunPCI device drivers participate in power management by providing callbacks to be 689*4882a593Smuzhiyunexecuted by the PCI subsystem's power management routines described above and by 690*4882a593Smuzhiyuncontrolling the runtime power management of their devices. 691*4882a593Smuzhiyun 692*4882a593SmuzhiyunAt the time of this writing there are two ways to define power management 693*4882a593Smuzhiyuncallbacks for a PCI device driver, the recommended one, based on using a 694*4882a593Smuzhiyundev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and 695*4882a593Smuzhiyunthe "legacy" one, in which the .suspend() and .resume() callbacks from struct 696*4882a593Smuzhiyunpci_driver are used. The legacy approach, however, doesn't allow one to define 697*4882a593Smuzhiyunruntime power management callbacks and is not really suitable for any new 698*4882a593Smuzhiyundrivers. Therefore it is not covered by this document (refer to the source code 699*4882a593Smuzhiyunto learn more about it). 700*4882a593Smuzhiyun 701*4882a593SmuzhiyunIt is recommended that all PCI device drivers define a struct dev_pm_ops object 702*4882a593Smuzhiyuncontaining pointers to power management (PM) callbacks that will be executed by 703*4882a593Smuzhiyunthe PCI subsystem's PM routines in various circumstances. A pointer to the 704*4882a593Smuzhiyundriver's struct dev_pm_ops object has to be assigned to the driver.pm field in 705*4882a593Smuzhiyunits struct pci_driver object. Once that has happened, the "legacy" PM callbacks 706*4882a593Smuzhiyunin struct pci_driver are ignored (even if they are not NULL). 707*4882a593Smuzhiyun 708*4882a593SmuzhiyunThe PM callbacks in struct dev_pm_ops are not mandatory and if they are not 709*4882a593Smuzhiyundefined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI 710*4882a593Smuzhiyunsubsystem will handle the device in a simplified default manner. If they are 711*4882a593Smuzhiyundefined, though, they are expected to behave as described in the following 712*4882a593Smuzhiyunsubsections. 713*4882a593Smuzhiyun 714*4882a593Smuzhiyun3.1.1. prepare() 715*4882a593Smuzhiyun^^^^^^^^^^^^^^^^ 716*4882a593Smuzhiyun 717*4882a593SmuzhiyunThe prepare() callback is executed during system suspend, during hibernation 718*4882a593Smuzhiyun(when a hibernation image is about to be created), during power-off after 719*4882a593Smuzhiyunsaving a hibernation image and during system restore, when a hibernation image 720*4882a593Smuzhiyunhas just been loaded into memory. 721*4882a593Smuzhiyun 722*4882a593SmuzhiyunThis callback is only necessary if the driver's device has children that in 723*4882a593Smuzhiyungeneral may be registered at any time. In that case the role of the prepare() 724*4882a593Smuzhiyuncallback is to prevent new children of the device from being registered until 725*4882a593Smuzhiyunone of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run. 726*4882a593Smuzhiyun 727*4882a593SmuzhiyunIn addition to that the prepare() callback may carry out some operations 728*4882a593Smuzhiyunpreparing the device to be suspended, although it should not allocate memory 729*4882a593Smuzhiyun(if additional memory is required to suspend the device, it has to be 730*4882a593Smuzhiyunpreallocated earlier, for example in a suspend/hibernate notifier as described 731*4882a593Smuzhiyunin Documentation/driver-api/pm/notifiers.rst). 732*4882a593Smuzhiyun 733*4882a593Smuzhiyun3.1.2. suspend() 734*4882a593Smuzhiyun^^^^^^^^^^^^^^^^ 735*4882a593Smuzhiyun 736*4882a593SmuzhiyunThe suspend() callback is only executed during system suspend, after prepare() 737*4882a593Smuzhiyuncallbacks have been executed for all devices in the system. 738*4882a593Smuzhiyun 739*4882a593SmuzhiyunThis callback is expected to quiesce the device and prepare it to be put into a 740*4882a593Smuzhiyunlow-power state by the PCI subsystem. It is not required (in fact it even is 741*4882a593Smuzhiyunnot recommended) that a PCI driver's suspend() callback save the standard 742*4882a593Smuzhiyunconfiguration registers of the device, prepare it for waking up the system, or 743*4882a593Smuzhiyunput it into a low-power state. All of these operations can very well be taken 744*4882a593Smuzhiyuncare of by the PCI subsystem, without the driver's participation. 745*4882a593Smuzhiyun 746*4882a593SmuzhiyunHowever, in some rare case it is convenient to carry out these operations in 747*4882a593Smuzhiyuna PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and 748*4882a593Smuzhiyunpci_set_power_state() should be used to save the device's standard configuration 749*4882a593Smuzhiyunregisters, to prepare it for system wakeup (if necessary), and to put it into a 750*4882a593Smuzhiyunlow-power state, respectively. Moreover, if the driver calls pci_save_state(), 751*4882a593Smuzhiyunthe PCI subsystem will not execute either pci_prepare_to_sleep(), or 752*4882a593Smuzhiyunpci_set_power_state() for its device, so the driver is then responsible for 753*4882a593Smuzhiyunhandling the device as appropriate. 754*4882a593Smuzhiyun 755*4882a593SmuzhiyunWhile the suspend() callback is being executed, the driver's interrupt handler 756*4882a593Smuzhiyuncan be invoked to handle an interrupt from the device, so all suspend-related 757*4882a593Smuzhiyunoperations relying on the driver's ability to handle interrupts should be 758*4882a593Smuzhiyuncarried out in this callback. 759*4882a593Smuzhiyun 760*4882a593Smuzhiyun3.1.3. suspend_noirq() 761*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^ 762*4882a593Smuzhiyun 763*4882a593SmuzhiyunThe suspend_noirq() callback is only executed during system suspend, after 764*4882a593Smuzhiyunsuspend() callbacks have been executed for all devices in the system and 765*4882a593Smuzhiyunafter device interrupts have been disabled by the PM core. 766*4882a593Smuzhiyun 767*4882a593SmuzhiyunThe difference between suspend_noirq() and suspend() is that the driver's 768*4882a593Smuzhiyuninterrupt handler will not be invoked while suspend_noirq() is running. Thus 769*4882a593Smuzhiyunsuspend_noirq() can carry out operations that would cause race conditions to 770*4882a593Smuzhiyunarise if they were performed in suspend(). 771*4882a593Smuzhiyun 772*4882a593Smuzhiyun3.1.4. freeze() 773*4882a593Smuzhiyun^^^^^^^^^^^^^^^ 774*4882a593Smuzhiyun 775*4882a593SmuzhiyunThe freeze() callback is hibernation-specific and is executed in two situations, 776*4882a593Smuzhiyunduring hibernation, after prepare() callbacks have been executed for all devices 777*4882a593Smuzhiyunin preparation for the creation of a system image, and during restore, 778*4882a593Smuzhiyunafter a system image has been loaded into memory from persistent storage and the 779*4882a593Smuzhiyunprepare() callbacks have been executed for all devices. 780*4882a593Smuzhiyun 781*4882a593SmuzhiyunThe role of this callback is analogous to the role of the suspend() callback 782*4882a593Smuzhiyundescribed above. In fact, they only need to be different in the rare cases when 783*4882a593Smuzhiyunthe driver takes the responsibility for putting the device into a low-power 784*4882a593Smuzhiyunstate. 785*4882a593Smuzhiyun 786*4882a593SmuzhiyunIn that cases the freeze() callback should not prepare the device system wakeup 787*4882a593Smuzhiyunor put it into a low-power state. Still, either it or freeze_noirq() should 788*4882a593Smuzhiyunsave the device's standard configuration registers using pci_save_state(). 789*4882a593Smuzhiyun 790*4882a593Smuzhiyun3.1.5. freeze_noirq() 791*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^ 792*4882a593Smuzhiyun 793*4882a593SmuzhiyunThe freeze_noirq() callback is hibernation-specific. It is executed during 794*4882a593Smuzhiyunhibernation, after prepare() and freeze() callbacks have been executed for all 795*4882a593Smuzhiyundevices in preparation for the creation of a system image, and during restore, 796*4882a593Smuzhiyunafter a system image has been loaded into memory and after prepare() and 797*4882a593Smuzhiyunfreeze() callbacks have been executed for all devices. It is always executed 798*4882a593Smuzhiyunafter device interrupts have been disabled by the PM core. 799*4882a593Smuzhiyun 800*4882a593SmuzhiyunThe role of this callback is analogous to the role of the suspend_noirq() 801*4882a593Smuzhiyuncallback described above and it very rarely is necessary to define 802*4882a593Smuzhiyunfreeze_noirq(). 803*4882a593Smuzhiyun 804*4882a593SmuzhiyunThe difference between freeze_noirq() and freeze() is analogous to the 805*4882a593Smuzhiyundifference between suspend_noirq() and suspend(). 806*4882a593Smuzhiyun 807*4882a593Smuzhiyun3.1.6. poweroff() 808*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^ 809*4882a593Smuzhiyun 810*4882a593SmuzhiyunThe poweroff() callback is hibernation-specific. It is executed when the system 811*4882a593Smuzhiyunis about to be powered off after saving a hibernation image to a persistent 812*4882a593Smuzhiyunstorage. prepare() callbacks are executed for all devices before poweroff() is 813*4882a593Smuzhiyuncalled. 814*4882a593Smuzhiyun 815*4882a593SmuzhiyunThe role of this callback is analogous to the role of the suspend() and freeze() 816*4882a593Smuzhiyuncallbacks described above, although it does not need to save the contents of 817*4882a593Smuzhiyunthe device's registers. In particular, if the driver wants to put the device 818*4882a593Smuzhiyuninto a low-power state itself instead of allowing the PCI subsystem to do that, 819*4882a593Smuzhiyunthe poweroff() callback should use pci_prepare_to_sleep() and 820*4882a593Smuzhiyunpci_set_power_state() to prepare the device for system wakeup and to put it 821*4882a593Smuzhiyuninto a low-power state, respectively, but it need not save the device's standard 822*4882a593Smuzhiyunconfiguration registers. 823*4882a593Smuzhiyun 824*4882a593Smuzhiyun3.1.7. poweroff_noirq() 825*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^ 826*4882a593Smuzhiyun 827*4882a593SmuzhiyunThe poweroff_noirq() callback is hibernation-specific. It is executed after 828*4882a593Smuzhiyunpoweroff() callbacks have been executed for all devices in the system. 829*4882a593Smuzhiyun 830*4882a593SmuzhiyunThe role of this callback is analogous to the role of the suspend_noirq() and 831*4882a593Smuzhiyunfreeze_noirq() callbacks described above, but it does not need to save the 832*4882a593Smuzhiyuncontents of the device's registers. 833*4882a593Smuzhiyun 834*4882a593SmuzhiyunThe difference between poweroff_noirq() and poweroff() is analogous to the 835*4882a593Smuzhiyundifference between suspend_noirq() and suspend(). 836*4882a593Smuzhiyun 837*4882a593Smuzhiyun3.1.8. resume_noirq() 838*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^ 839*4882a593Smuzhiyun 840*4882a593SmuzhiyunThe resume_noirq() callback is only executed during system resume, after the 841*4882a593SmuzhiyunPM core has enabled the non-boot CPUs. The driver's interrupt handler will not 842*4882a593Smuzhiyunbe invoked while resume_noirq() is running, so this callback can carry out 843*4882a593Smuzhiyunoperations that might race with the interrupt handler. 844*4882a593Smuzhiyun 845*4882a593SmuzhiyunSince the PCI subsystem unconditionally puts all devices into the full power 846*4882a593Smuzhiyunstate in the resume_noirq phase of system resume and restores their standard 847*4882a593Smuzhiyunconfiguration registers, resume_noirq() is usually not necessary. In general 848*4882a593Smuzhiyunit should only be used for performing operations that would lead to race 849*4882a593Smuzhiyunconditions if carried out by resume(). 850*4882a593Smuzhiyun 851*4882a593Smuzhiyun3.1.9. resume() 852*4882a593Smuzhiyun^^^^^^^^^^^^^^^ 853*4882a593Smuzhiyun 854*4882a593SmuzhiyunThe resume() callback is only executed during system resume, after 855*4882a593Smuzhiyunresume_noirq() callbacks have been executed for all devices in the system and 856*4882a593Smuzhiyundevice interrupts have been enabled by the PM core. 857*4882a593Smuzhiyun 858*4882a593SmuzhiyunThis callback is responsible for restoring the pre-suspend configuration of the 859*4882a593Smuzhiyundevice and bringing it back to the fully functional state. The device should be 860*4882a593Smuzhiyunable to process I/O in a usual way after resume() has returned. 861*4882a593Smuzhiyun 862*4882a593Smuzhiyun3.1.10. thaw_noirq() 863*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^ 864*4882a593Smuzhiyun 865*4882a593SmuzhiyunThe thaw_noirq() callback is hibernation-specific. It is executed after a 866*4882a593Smuzhiyunsystem image has been created and the non-boot CPUs have been enabled by the PM 867*4882a593Smuzhiyuncore, in the thaw_noirq phase of hibernation. It also may be executed if the 868*4882a593Smuzhiyunloading of a hibernation image fails during system restore (it is then executed 869*4882a593Smuzhiyunafter enabling the non-boot CPUs). The driver's interrupt handler will not be 870*4882a593Smuzhiyuninvoked while thaw_noirq() is running. 871*4882a593Smuzhiyun 872*4882a593SmuzhiyunThe role of this callback is analogous to the role of resume_noirq(). The 873*4882a593Smuzhiyundifference between these two callbacks is that thaw_noirq() is executed after 874*4882a593Smuzhiyunfreeze() and freeze_noirq(), so in general it does not need to modify the 875*4882a593Smuzhiyuncontents of the device's registers. 876*4882a593Smuzhiyun 877*4882a593Smuzhiyun3.1.11. thaw() 878*4882a593Smuzhiyun^^^^^^^^^^^^^^ 879*4882a593Smuzhiyun 880*4882a593SmuzhiyunThe thaw() callback is hibernation-specific. It is executed after thaw_noirq() 881*4882a593Smuzhiyuncallbacks have been executed for all devices in the system and after device 882*4882a593Smuzhiyuninterrupts have been enabled by the PM core. 883*4882a593Smuzhiyun 884*4882a593SmuzhiyunThis callback is responsible for restoring the pre-freeze configuration of 885*4882a593Smuzhiyunthe device, so that it will work in a usual way after thaw() has returned. 886*4882a593Smuzhiyun 887*4882a593Smuzhiyun3.1.12. restore_noirq() 888*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^ 889*4882a593Smuzhiyun 890*4882a593SmuzhiyunThe restore_noirq() callback is hibernation-specific. It is executed in the 891*4882a593Smuzhiyunrestore_noirq phase of hibernation, when the boot kernel has passed control to 892*4882a593Smuzhiyunthe image kernel and the non-boot CPUs have been enabled by the image kernel's 893*4882a593SmuzhiyunPM core. 894*4882a593Smuzhiyun 895*4882a593SmuzhiyunThis callback is analogous to resume_noirq() with the exception that it cannot 896*4882a593Smuzhiyunmake any assumption on the previous state of the device, even if the BIOS (or 897*4882a593Smuzhiyungenerally the platform firmware) is known to preserve that state over a 898*4882a593Smuzhiyunsuspend-resume cycle. 899*4882a593Smuzhiyun 900*4882a593SmuzhiyunFor the vast majority of PCI device drivers there is no difference between 901*4882a593Smuzhiyunresume_noirq() and restore_noirq(). 902*4882a593Smuzhiyun 903*4882a593Smuzhiyun3.1.13. restore() 904*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^ 905*4882a593Smuzhiyun 906*4882a593SmuzhiyunThe restore() callback is hibernation-specific. It is executed after 907*4882a593Smuzhiyunrestore_noirq() callbacks have been executed for all devices in the system and 908*4882a593Smuzhiyunafter the PM core has enabled device drivers' interrupt handlers to be invoked. 909*4882a593Smuzhiyun 910*4882a593SmuzhiyunThis callback is analogous to resume(), just like restore_noirq() is analogous 911*4882a593Smuzhiyunto resume_noirq(). Consequently, the difference between restore_noirq() and 912*4882a593Smuzhiyunrestore() is analogous to the difference between resume_noirq() and resume(). 913*4882a593Smuzhiyun 914*4882a593SmuzhiyunFor the vast majority of PCI device drivers there is no difference between 915*4882a593Smuzhiyunresume() and restore(). 916*4882a593Smuzhiyun 917*4882a593Smuzhiyun3.1.14. complete() 918*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^ 919*4882a593Smuzhiyun 920*4882a593SmuzhiyunThe complete() callback is executed in the following situations: 921*4882a593Smuzhiyun 922*4882a593Smuzhiyun - during system resume, after resume() callbacks have been executed for all 923*4882a593Smuzhiyun devices, 924*4882a593Smuzhiyun - during hibernation, before saving the system image, after thaw() callbacks 925*4882a593Smuzhiyun have been executed for all devices, 926*4882a593Smuzhiyun - during system restore, when the system is going back to its pre-hibernation 927*4882a593Smuzhiyun state, after restore() callbacks have been executed for all devices. 928*4882a593Smuzhiyun 929*4882a593SmuzhiyunIt also may be executed if the loading of a hibernation image into memory fails 930*4882a593Smuzhiyun(in that case it is run after thaw() callbacks have been executed for all 931*4882a593Smuzhiyundevices that have drivers in the boot kernel). 932*4882a593Smuzhiyun 933*4882a593SmuzhiyunThis callback is entirely optional, although it may be necessary if the 934*4882a593Smuzhiyunprepare() callback performs operations that need to be reversed. 935*4882a593Smuzhiyun 936*4882a593Smuzhiyun3.1.15. runtime_suspend() 937*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^ 938*4882a593Smuzhiyun 939*4882a593SmuzhiyunThe runtime_suspend() callback is specific to device runtime power management 940*4882a593Smuzhiyun(runtime PM). It is executed by the PM core's runtime PM framework when the 941*4882a593Smuzhiyundevice is about to be suspended (i.e. quiesced and put into a low-power state) 942*4882a593Smuzhiyunat run time. 943*4882a593Smuzhiyun 944*4882a593SmuzhiyunThis callback is responsible for freezing the device and preparing it to be 945*4882a593Smuzhiyunput into a low-power state, but it must allow the PCI subsystem to perform all 946*4882a593Smuzhiyunof the PCI-specific actions necessary for suspending the device. 947*4882a593Smuzhiyun 948*4882a593Smuzhiyun3.1.16. runtime_resume() 949*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^ 950*4882a593Smuzhiyun 951*4882a593SmuzhiyunThe runtime_resume() callback is specific to device runtime PM. It is executed 952*4882a593Smuzhiyunby the PM core's runtime PM framework when the device is about to be resumed 953*4882a593Smuzhiyun(i.e. put into the full-power state and programmed to process I/O normally) at 954*4882a593Smuzhiyunrun time. 955*4882a593Smuzhiyun 956*4882a593SmuzhiyunThis callback is responsible for restoring the normal functionality of the 957*4882a593Smuzhiyundevice after it has been put into the full-power state by the PCI subsystem. 958*4882a593SmuzhiyunThe device is expected to be able to process I/O in the usual way after 959*4882a593Smuzhiyunruntime_resume() has returned. 960*4882a593Smuzhiyun 961*4882a593Smuzhiyun3.1.17. runtime_idle() 962*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^ 963*4882a593Smuzhiyun 964*4882a593SmuzhiyunThe runtime_idle() callback is specific to device runtime PM. It is executed 965*4882a593Smuzhiyunby the PM core's runtime PM framework whenever it may be desirable to suspend 966*4882a593Smuzhiyunthe device according to the PM core's information. In particular, it is 967*4882a593Smuzhiyunautomatically executed right after runtime_resume() has returned in case the 968*4882a593Smuzhiyunresume of the device has happened as a result of a spurious event. 969*4882a593Smuzhiyun 970*4882a593SmuzhiyunThis callback is optional, but if it is not implemented or if it returns 0, the 971*4882a593SmuzhiyunPCI subsystem will call pm_runtime_suspend() for the device, which in turn will 972*4882a593Smuzhiyuncause the driver's runtime_suspend() callback to be executed. 973*4882a593Smuzhiyun 974*4882a593Smuzhiyun3.1.18. Pointing Multiple Callback Pointers to One Routine 975*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 976*4882a593Smuzhiyun 977*4882a593SmuzhiyunAlthough in principle each of the callbacks described in the previous 978*4882a593Smuzhiyunsubsections can be defined as a separate function, it often is convenient to 979*4882a593Smuzhiyunpoint two or more members of struct dev_pm_ops to the same routine. There are 980*4882a593Smuzhiyuna few convenience macros that can be used for this purpose. 981*4882a593Smuzhiyun 982*4882a593SmuzhiyunThe SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one 983*4882a593Smuzhiyunsuspend routine pointed to by the .suspend(), .freeze(), and .poweroff() 984*4882a593Smuzhiyunmembers and one resume routine pointed to by the .resume(), .thaw(), and 985*4882a593Smuzhiyun.restore() members. The other function pointers in this struct dev_pm_ops are 986*4882a593Smuzhiyununset. 987*4882a593Smuzhiyun 988*4882a593SmuzhiyunThe UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it 989*4882a593Smuzhiyunadditionally sets the .runtime_resume() pointer to the same value as 990*4882a593Smuzhiyun.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to 991*4882a593Smuzhiyunthe same value as .suspend() (and .freeze() and .poweroff()). 992*4882a593Smuzhiyun 993*4882a593SmuzhiyunThe SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct 994*4882a593Smuzhiyundev_pm_ops to indicate that one suspend routine is to be pointed to by the 995*4882a593Smuzhiyun.suspend(), .freeze(), and .poweroff() members and one resume routine is to 996*4882a593Smuzhiyunbe pointed to by the .resume(), .thaw(), and .restore() members. 997*4882a593Smuzhiyun 998*4882a593Smuzhiyun3.1.19. Driver Flags for Power Management 999*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1000*4882a593Smuzhiyun 1001*4882a593SmuzhiyunThe PM core allows device drivers to set flags that influence the handling of 1002*4882a593Smuzhiyunpower management for the devices by the core itself and by middle layer code 1003*4882a593Smuzhiyunincluding the PCI bus type. The flags should be set once at the driver probe 1004*4882a593Smuzhiyuntime with the help of the dev_pm_set_driver_flags() function and they should not 1005*4882a593Smuzhiyunbe updated directly afterwards. 1006*4882a593Smuzhiyun 1007*4882a593SmuzhiyunThe DPM_FLAG_NO_DIRECT_COMPLETE flag prevents the PM core from using the 1008*4882a593Smuzhiyundirect-complete mechanism allowing device suspend/resume callbacks to be skipped 1009*4882a593Smuzhiyunif the device is in runtime suspend when the system suspend starts. That also 1010*4882a593Smuzhiyunaffects all of the ancestors of the device, so this flag should only be used if 1011*4882a593Smuzhiyunabsolutely necessary. 1012*4882a593Smuzhiyun 1013*4882a593SmuzhiyunThe DPM_FLAG_SMART_PREPARE flag causes the PCI bus type to return a positive 1014*4882a593Smuzhiyunvalue from pci_pm_prepare() only if the ->prepare callback provided by the 1015*4882a593Smuzhiyundriver of the device returns a positive value. That allows the driver to opt 1016*4882a593Smuzhiyunout from using the direct-complete mechanism dynamically (whereas setting 1017*4882a593SmuzhiyunDPM_FLAG_NO_DIRECT_COMPLETE means permanent opt-out). 1018*4882a593Smuzhiyun 1019*4882a593SmuzhiyunThe DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's 1020*4882a593Smuzhiyunperspective the device can be safely left in runtime suspend during system 1021*4882a593Smuzhiyunsuspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff() 1022*4882a593Smuzhiyunto avoid resuming the device from runtime suspend unless there are PCI-specific 1023*4882a593Smuzhiyunreasons for doing that. Also, it causes pci_pm_suspend_late/noirq() and 1024*4882a593Smuzhiyunpci_pm_poweroff_late/noirq() to return early if the device remains in runtime 1025*4882a593Smuzhiyunsuspend during the "late" phase of the system-wide transition under way. 1026*4882a593SmuzhiyunMoreover, if the device is in runtime suspend in pci_pm_resume_noirq() or 1027*4882a593Smuzhiyunpci_pm_restore_noirq(), its runtime PM status will be changed to "active" (as it 1028*4882a593Smuzhiyunis going to be put into D0 going forward). 1029*4882a593Smuzhiyun 1030*4882a593SmuzhiyunSetting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver allows its 1031*4882a593Smuzhiyun"noirq" and "early" resume callbacks to be skipped if the device can be left 1032*4882a593Smuzhiyunin suspend after a system-wide transition into the working state. This flag is 1033*4882a593Smuzhiyuntaken into consideration by the PM core along with the power.may_skip_resume 1034*4882a593Smuzhiyunstatus bit of the device which is set by pci_pm_suspend_noirq() in certain 1035*4882a593Smuzhiyunsituations. If the PM core determines that the driver's "noirq" and "early" 1036*4882a593Smuzhiyunresume callbacks should be skipped, the dev_pm_skip_resume() helper function 1037*4882a593Smuzhiyunwill return "true" and that will cause pci_pm_resume_noirq() and 1038*4882a593Smuzhiyunpci_pm_resume_early() to return upfront without touching the device and 1039*4882a593Smuzhiyunexecuting the driver callbacks. 1040*4882a593Smuzhiyun 1041*4882a593Smuzhiyun3.2. Device Runtime Power Management 1042*4882a593Smuzhiyun------------------------------------ 1043*4882a593Smuzhiyun 1044*4882a593SmuzhiyunIn addition to providing device power management callbacks PCI device drivers 1045*4882a593Smuzhiyunare responsible for controlling the runtime power management (runtime PM) of 1046*4882a593Smuzhiyuntheir devices. 1047*4882a593Smuzhiyun 1048*4882a593SmuzhiyunThe PCI device runtime PM is optional, but it is recommended that PCI device 1049*4882a593Smuzhiyundrivers implement it at least in the cases where there is a reliable way of 1050*4882a593Smuzhiyunverifying that the device is not used (like when the network cable is detached 1051*4882a593Smuzhiyunfrom an Ethernet adapter or there are no devices attached to a USB controller). 1052*4882a593Smuzhiyun 1053*4882a593SmuzhiyunTo support the PCI runtime PM the driver first needs to implement the 1054*4882a593Smuzhiyunruntime_suspend() and runtime_resume() callbacks. It also may need to implement 1055*4882a593Smuzhiyunthe runtime_idle() callback to prevent the device from being suspended again 1056*4882a593Smuzhiyunevery time right after the runtime_resume() callback has returned 1057*4882a593Smuzhiyun(alternatively, the runtime_suspend() callback will have to check if the 1058*4882a593Smuzhiyundevice should really be suspended and return -EAGAIN if that is not the case). 1059*4882a593Smuzhiyun 1060*4882a593SmuzhiyunThe runtime PM of PCI devices is enabled by default by the PCI core. PCI 1061*4882a593Smuzhiyundevice drivers do not need to enable it and should not attempt to do so. 1062*4882a593SmuzhiyunHowever, it is blocked by pci_pm_init() that runs the pm_runtime_forbid() 1063*4882a593Smuzhiyunhelper function. In addition to that, the runtime PM usage counter of 1064*4882a593Smuzhiyuneach PCI device is incremented by local_pci_probe() before executing the 1065*4882a593Smuzhiyunprobe callback provided by the device's driver. 1066*4882a593Smuzhiyun 1067*4882a593SmuzhiyunIf a PCI driver implements the runtime PM callbacks and intends to use the 1068*4882a593Smuzhiyunruntime PM framework provided by the PM core and the PCI subsystem, it needs 1069*4882a593Smuzhiyunto decrement the device's runtime PM usage counter in its probe callback 1070*4882a593Smuzhiyunfunction. If it doesn't do that, the counter will always be different from 1071*4882a593Smuzhiyunzero for the device and it will never be runtime-suspended. The simplest 1072*4882a593Smuzhiyunway to do that is by calling pm_runtime_put_noidle(), but if the driver 1073*4882a593Smuzhiyunwants to schedule an autosuspend right away, for example, it may call 1074*4882a593Smuzhiyunpm_runtime_put_autosuspend() instead for this purpose. Generally, it 1075*4882a593Smuzhiyunjust needs to call a function that decrements the devices usage counter 1076*4882a593Smuzhiyunfrom its probe routine to make runtime PM work for the device. 1077*4882a593Smuzhiyun 1078*4882a593SmuzhiyunIt is important to remember that the driver's runtime_suspend() callback 1079*4882a593Smuzhiyunmay be executed right after the usage counter has been decremented, because 1080*4882a593Smuzhiyunuser space may already have caused the pm_runtime_allow() helper function 1081*4882a593Smuzhiyununblocking the runtime PM of the device to run via sysfs, so the driver must 1082*4882a593Smuzhiyunbe prepared to cope with that. 1083*4882a593Smuzhiyun 1084*4882a593SmuzhiyunThe driver itself should not call pm_runtime_allow(), though. Instead, it 1085*4882a593Smuzhiyunshould let user space or some platform-specific code do that (user space can 1086*4882a593Smuzhiyundo it via sysfs as stated above), but it must be prepared to handle the 1087*4882a593Smuzhiyunruntime PM of the device correctly as soon as pm_runtime_allow() is called 1088*4882a593Smuzhiyun(which may happen at any time, even before the driver is loaded). 1089*4882a593Smuzhiyun 1090*4882a593SmuzhiyunWhen the driver's remove callback runs, it has to balance the decrementation 1091*4882a593Smuzhiyunof the device's runtime PM usage counter at the probe time. For this reason, 1092*4882a593Smuzhiyunif it has decremented the counter in its probe callback, it must run 1093*4882a593Smuzhiyunpm_runtime_get_noresume() in its remove callback. [Since the core carries 1094*4882a593Smuzhiyunout a runtime resume of the device and bumps up the device's usage counter 1095*4882a593Smuzhiyunbefore running the driver's remove callback, the runtime PM of the device 1096*4882a593Smuzhiyunis effectively disabled for the duration of the remove execution and all 1097*4882a593Smuzhiyunruntime PM helper functions incrementing the device's usage counter are 1098*4882a593Smuzhiyunthen effectively equivalent to pm_runtime_get_noresume().] 1099*4882a593Smuzhiyun 1100*4882a593SmuzhiyunThe runtime PM framework works by processing requests to suspend or resume 1101*4882a593Smuzhiyundevices, or to check if they are idle (in which cases it is reasonable to 1102*4882a593Smuzhiyunsubsequently request that they be suspended). These requests are represented 1103*4882a593Smuzhiyunby work items put into the power management workqueue, pm_wq. Although there 1104*4882a593Smuzhiyunare a few situations in which power management requests are automatically 1105*4882a593Smuzhiyunqueued by the PM core (for example, after processing a request to resume a 1106*4882a593Smuzhiyundevice the PM core automatically queues a request to check if the device is 1107*4882a593Smuzhiyunidle), device drivers are generally responsible for queuing power management 1108*4882a593Smuzhiyunrequests for their devices. For this purpose they should use the runtime PM 1109*4882a593Smuzhiyunhelper functions provided by the PM core, discussed in 1110*4882a593SmuzhiyunDocumentation/power/runtime_pm.rst. 1111*4882a593Smuzhiyun 1112*4882a593SmuzhiyunDevices can also be suspended and resumed synchronously, without placing a 1113*4882a593Smuzhiyunrequest into pm_wq. In the majority of cases this also is done by their 1114*4882a593Smuzhiyundrivers that use helper functions provided by the PM core for this purpose. 1115*4882a593Smuzhiyun 1116*4882a593SmuzhiyunFor more information on the runtime PM of devices refer to 1117*4882a593SmuzhiyunDocumentation/power/runtime_pm.rst. 1118*4882a593Smuzhiyun 1119*4882a593Smuzhiyun 1120*4882a593Smuzhiyun4. Resources 1121*4882a593Smuzhiyun============ 1122*4882a593Smuzhiyun 1123*4882a593SmuzhiyunPCI Local Bus Specification, Rev. 3.0 1124*4882a593Smuzhiyun 1125*4882a593SmuzhiyunPCI Bus Power Management Interface Specification, Rev. 1.2 1126*4882a593Smuzhiyun 1127*4882a593SmuzhiyunAdvanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b 1128*4882a593Smuzhiyun 1129*4882a593SmuzhiyunPCI Express Base Specification, Rev. 2.0 1130*4882a593Smuzhiyun 1131*4882a593SmuzhiyunDocumentation/driver-api/pm/devices.rst 1132*4882a593Smuzhiyun 1133*4882a593SmuzhiyunDocumentation/power/runtime_pm.rst 1134