1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun.. include:: <isonum.txt> 3*4882a593Smuzhiyun 4*4882a593Smuzhiyun=========================================================== 5*4882a593SmuzhiyunThe PCI Express Advanced Error Reporting Driver Guide HOWTO 6*4882a593Smuzhiyun=========================================================== 7*4882a593Smuzhiyun 8*4882a593Smuzhiyun:Authors: - T. Long Nguyen <tom.l.nguyen@intel.com> 9*4882a593Smuzhiyun - Yanmin Zhang <yanmin.zhang@intel.com> 10*4882a593Smuzhiyun 11*4882a593Smuzhiyun:Copyright: |copy| 2006 Intel Corporation 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunOverview 14*4882a593Smuzhiyun=========== 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunAbout this guide 17*4882a593Smuzhiyun---------------- 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunThis guide describes the basics of the PCI Express Advanced Error 20*4882a593SmuzhiyunReporting (AER) driver and provides information on how to use it, as 21*4882a593Smuzhiyunwell as how to enable the drivers of endpoint devices to conform with 22*4882a593SmuzhiyunPCI Express AER driver. 23*4882a593Smuzhiyun 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunWhat is the PCI Express AER Driver? 26*4882a593Smuzhiyun----------------------------------- 27*4882a593Smuzhiyun 28*4882a593SmuzhiyunPCI Express error signaling can occur on the PCI Express link itself 29*4882a593Smuzhiyunor on behalf of transactions initiated on the link. PCI Express 30*4882a593Smuzhiyundefines two error reporting paradigms: the baseline capability and 31*4882a593Smuzhiyunthe Advanced Error Reporting capability. The baseline capability is 32*4882a593Smuzhiyunrequired of all PCI Express components providing a minimum defined 33*4882a593Smuzhiyunset of error reporting requirements. Advanced Error Reporting 34*4882a593Smuzhiyuncapability is implemented with a PCI Express advanced error reporting 35*4882a593Smuzhiyunextended capability structure providing more robust error reporting. 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunThe PCI Express AER driver provides the infrastructure to support PCI 38*4882a593SmuzhiyunExpress Advanced Error Reporting capability. The PCI Express AER 39*4882a593Smuzhiyundriver provides three basic functions: 40*4882a593Smuzhiyun 41*4882a593Smuzhiyun - Gathers the comprehensive error information if errors occurred. 42*4882a593Smuzhiyun - Reports error to the users. 43*4882a593Smuzhiyun - Performs error recovery actions. 44*4882a593Smuzhiyun 45*4882a593SmuzhiyunAER driver only attaches root ports which support PCI-Express AER 46*4882a593Smuzhiyuncapability. 47*4882a593Smuzhiyun 48*4882a593Smuzhiyun 49*4882a593SmuzhiyunUser Guide 50*4882a593Smuzhiyun========== 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunInclude the PCI Express AER Root Driver into the Linux Kernel 53*4882a593Smuzhiyun------------------------------------------------------------- 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunThe PCI Express AER Root driver is a Root Port service driver attached 56*4882a593Smuzhiyunto the PCI Express Port Bus driver. If a user wants to use it, the driver 57*4882a593Smuzhiyunhas to be compiled. Option CONFIG_PCIEAER supports this capability. It 58*4882a593Smuzhiyundepends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and 59*4882a593SmuzhiyunCONFIG_PCIEAER = y. 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunLoad PCI Express AER Root Driver 62*4882a593Smuzhiyun-------------------------------- 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunSome systems have AER support in firmware. Enabling Linux AER support at 65*4882a593Smuzhiyunthe same time the firmware handles AER may result in unpredictable 66*4882a593Smuzhiyunbehavior. Therefore, Linux does not handle AER events unless the firmware 67*4882a593Smuzhiyungrants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 68*4882a593SmuzhiyunSpecification for details regarding _OSC usage. 69*4882a593Smuzhiyun 70*4882a593SmuzhiyunAER error output 71*4882a593Smuzhiyun---------------- 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunWhen a PCIe AER error is captured, an error message will be output to 74*4882a593Smuzhiyunconsole. If it's a correctable error, it is output as a warning. 75*4882a593SmuzhiyunOtherwise, it is printed as an error. So users could choose different 76*4882a593Smuzhiyunlog level to filter out correctable error messages. 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunBelow shows an example:: 79*4882a593Smuzhiyun 80*4882a593Smuzhiyun 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) 81*4882a593Smuzhiyun 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 82*4882a593Smuzhiyun 0000:50:00.0: [20] Unsupported Request (First) 83*4882a593Smuzhiyun 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunIn the example, 'Requester ID' means the ID of the device who sends 86*4882a593Smuzhiyunthe error message to root port. Pls. refer to pci express specs for 87*4882a593Smuzhiyunother fields. 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunAER Statistics / Counters 90*4882a593Smuzhiyun------------------------- 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunWhen PCIe AER errors are captured, the counters / statistics are also exposed 93*4882a593Smuzhiyunin the form of sysfs attributes which are documented at 94*4882a593SmuzhiyunDocumentation/ABI/testing/sysfs-bus-pci-devices-aer_stats 95*4882a593Smuzhiyun 96*4882a593SmuzhiyunDeveloper Guide 97*4882a593Smuzhiyun=============== 98*4882a593Smuzhiyun 99*4882a593SmuzhiyunTo enable AER aware support requires a software driver to configure 100*4882a593Smuzhiyunthe AER capability structure within its device and to provide callbacks. 101*4882a593Smuzhiyun 102*4882a593SmuzhiyunTo support AER better, developers need understand how AER does work 103*4882a593Smuzhiyunfirstly. 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunPCI Express errors are classified into two types: correctable errors 106*4882a593Smuzhiyunand uncorrectable errors. This classification is based on the impacts 107*4882a593Smuzhiyunof those errors, which may result in degraded performance or function 108*4882a593Smuzhiyunfailure. 109*4882a593Smuzhiyun 110*4882a593SmuzhiyunCorrectable errors pose no impacts on the functionality of the 111*4882a593Smuzhiyuninterface. The PCI Express protocol can recover without any software 112*4882a593Smuzhiyunintervention or any loss of data. These errors are detected and 113*4882a593Smuzhiyuncorrected by hardware. Unlike correctable errors, uncorrectable 114*4882a593Smuzhiyunerrors impact functionality of the interface. Uncorrectable errors 115*4882a593Smuzhiyuncan cause a particular transaction or a particular PCI Express link 116*4882a593Smuzhiyunto be unreliable. Depending on those error conditions, uncorrectable 117*4882a593Smuzhiyunerrors are further classified into non-fatal errors and fatal errors. 118*4882a593SmuzhiyunNon-fatal errors cause the particular transaction to be unreliable, 119*4882a593Smuzhiyunbut the PCI Express link itself is fully functional. Fatal errors, on 120*4882a593Smuzhiyunthe other hand, cause the link to be unreliable. 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunWhen AER is enabled, a PCI Express device will automatically send an 123*4882a593Smuzhiyunerror message to the PCIe root port above it when the device captures 124*4882a593Smuzhiyunan error. The Root Port, upon receiving an error reporting message, 125*4882a593Smuzhiyuninternally processes and logs the error message in its PCI Express 126*4882a593Smuzhiyuncapability structure. Error information being logged includes storing 127*4882a593Smuzhiyunthe error reporting agent's requestor ID into the Error Source 128*4882a593SmuzhiyunIdentification Registers and setting the error bits of the Root Error 129*4882a593SmuzhiyunStatus Register accordingly. If AER error reporting is enabled in Root 130*4882a593SmuzhiyunError Command Register, the Root Port generates an interrupt if an 131*4882a593Smuzhiyunerror is detected. 132*4882a593Smuzhiyun 133*4882a593SmuzhiyunNote that the errors as described above are related to the PCI Express 134*4882a593Smuzhiyunhierarchy and links. These errors do not include any device specific 135*4882a593Smuzhiyunerrors because device specific errors will still get sent directly to 136*4882a593Smuzhiyunthe device driver. 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunConfigure the AER capability structure 139*4882a593Smuzhiyun-------------------------------------- 140*4882a593Smuzhiyun 141*4882a593SmuzhiyunAER aware drivers of PCI Express component need change the device 142*4882a593Smuzhiyuncontrol registers to enable AER. They also could change AER registers, 143*4882a593Smuzhiyunincluding mask and severity registers. Helper function 144*4882a593Smuzhiyunpci_enable_pcie_error_reporting could be used to enable AER. See 145*4882a593Smuzhiyunsection 3.3. 146*4882a593Smuzhiyun 147*4882a593SmuzhiyunProvide callbacks 148*4882a593Smuzhiyun----------------- 149*4882a593Smuzhiyun 150*4882a593Smuzhiyuncallback reset_link to reset pci express link 151*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunThis callback is used to reset the pci express physical link when a 154*4882a593Smuzhiyunfatal error happens. The root port aer service driver provides a 155*4882a593Smuzhiyundefault reset_link function, but different upstream ports might 156*4882a593Smuzhiyunhave different specifications to reset pci express link, so all 157*4882a593Smuzhiyunupstream ports should provide their own reset_link functions. 158*4882a593Smuzhiyun 159*4882a593SmuzhiyunSection 3.2.2.2 provides more detailed info on when to call 160*4882a593Smuzhiyunreset_link. 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunPCI error-recovery callbacks 163*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunThe PCI Express AER Root driver uses error callbacks to coordinate 166*4882a593Smuzhiyunwith downstream device drivers associated with a hierarchy in question 167*4882a593Smuzhiyunwhen performing error recovery actions. 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunData struct pci_driver has a pointer, err_handler, to point to 170*4882a593Smuzhiyunpci_error_handlers who consists of a couple of callback function 171*4882a593Smuzhiyunpointers. AER driver follows the rules defined in 172*4882a593Smuzhiyunpci-error-recovery.txt except pci express specific parts (e.g. 173*4882a593Smuzhiyunreset_link). Pls. refer to pci-error-recovery.txt for detailed 174*4882a593Smuzhiyundefinitions of the callbacks. 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunBelow sections specify when to call the error callback functions. 177*4882a593Smuzhiyun 178*4882a593SmuzhiyunCorrectable errors 179*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~ 180*4882a593Smuzhiyun 181*4882a593SmuzhiyunCorrectable errors pose no impacts on the functionality of 182*4882a593Smuzhiyunthe interface. The PCI Express protocol can recover without any 183*4882a593Smuzhiyunsoftware intervention or any loss of data. These errors do not 184*4882a593Smuzhiyunrequire any recovery actions. The AER driver clears the device's 185*4882a593Smuzhiyuncorrectable error status register accordingly and logs these errors. 186*4882a593Smuzhiyun 187*4882a593SmuzhiyunNon-correctable (non-fatal and fatal) errors 188*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 189*4882a593Smuzhiyun 190*4882a593SmuzhiyunIf an error message indicates a non-fatal error, performing link reset 191*4882a593Smuzhiyunat upstream is not required. The AER driver calls error_detected(dev, 192*4882a593Smuzhiyunpci_channel_io_normal) to all drivers associated within a hierarchy in 193*4882a593Smuzhiyunquestion. for example:: 194*4882a593Smuzhiyun 195*4882a593Smuzhiyun EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort 196*4882a593Smuzhiyun 197*4882a593SmuzhiyunIf Upstream port A captures an AER error, the hierarchy consists of 198*4882a593SmuzhiyunDownstream port B and EndPoint. 199*4882a593Smuzhiyun 200*4882a593SmuzhiyunA driver may return PCI_ERS_RESULT_CAN_RECOVER, 201*4882a593SmuzhiyunPCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on 202*4882a593Smuzhiyunwhether it can recover or the AER driver calls mmio_enabled as next. 203*4882a593Smuzhiyun 204*4882a593SmuzhiyunIf an error message indicates a fatal error, kernel will broadcast 205*4882a593Smuzhiyunerror_detected(dev, pci_channel_io_frozen) to all drivers within 206*4882a593Smuzhiyuna hierarchy in question. Then, performing link reset at upstream is 207*4882a593Smuzhiyunnecessary. As different kinds of devices might use different approaches 208*4882a593Smuzhiyunto reset link, AER port service driver is required to provide the 209*4882a593Smuzhiyunfunction to reset link via callback parameter of pcie_do_recovery() 210*4882a593Smuzhiyunfunction. If reset_link is not NULL, recovery function will use it 211*4882a593Smuzhiyunto reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER 212*4882a593Smuzhiyunand reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes 213*4882a593Smuzhiyunto mmio_enabled. 214*4882a593Smuzhiyun 215*4882a593Smuzhiyunhelper functions 216*4882a593Smuzhiyun---------------- 217*4882a593Smuzhiyun:: 218*4882a593Smuzhiyun 219*4882a593Smuzhiyun int pci_enable_pcie_error_reporting(struct pci_dev *dev); 220*4882a593Smuzhiyun 221*4882a593Smuzhiyunpci_enable_pcie_error_reporting enables the device to send error 222*4882a593Smuzhiyunmessages to root port when an error is detected. Note that devices 223*4882a593Smuzhiyundon't enable the error reporting by default, so device drivers need 224*4882a593Smuzhiyuncall this function to enable it. 225*4882a593Smuzhiyun 226*4882a593Smuzhiyun:: 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun int pci_disable_pcie_error_reporting(struct pci_dev *dev); 229*4882a593Smuzhiyun 230*4882a593Smuzhiyunpci_disable_pcie_error_reporting disables the device to send error 231*4882a593Smuzhiyunmessages to root port when an error is detected. 232*4882a593Smuzhiyun 233*4882a593Smuzhiyun:: 234*4882a593Smuzhiyun 235*4882a593Smuzhiyun int pci_aer_clear_nonfatal_status(struct pci_dev *dev);` 236*4882a593Smuzhiyun 237*4882a593Smuzhiyunpci_aer_clear_nonfatal_status clears non-fatal errors in the uncorrectable 238*4882a593Smuzhiyunerror status register. 239*4882a593Smuzhiyun 240*4882a593SmuzhiyunFrequent Asked Questions 241*4882a593Smuzhiyun------------------------ 242*4882a593Smuzhiyun 243*4882a593SmuzhiyunQ: 244*4882a593Smuzhiyun What happens if a PCI Express device driver does not provide an 245*4882a593Smuzhiyun error recovery handler (pci_driver->err_handler is equal to NULL)? 246*4882a593Smuzhiyun 247*4882a593SmuzhiyunA: 248*4882a593Smuzhiyun The devices attached with the driver won't be recovered. If the 249*4882a593Smuzhiyun error is fatal, kernel will print out warning messages. Please refer 250*4882a593Smuzhiyun to section 3 for more information. 251*4882a593Smuzhiyun 252*4882a593SmuzhiyunQ: 253*4882a593Smuzhiyun What happens if an upstream port service driver does not provide 254*4882a593Smuzhiyun callback reset_link? 255*4882a593Smuzhiyun 256*4882a593SmuzhiyunA: 257*4882a593Smuzhiyun Fatal error recovery will fail if the errors are reported by the 258*4882a593Smuzhiyun upstream ports who are attached by the service driver. 259*4882a593Smuzhiyun 260*4882a593SmuzhiyunQ: 261*4882a593Smuzhiyun How does this infrastructure deal with driver that is not PCI 262*4882a593Smuzhiyun Express aware? 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunA: 265*4882a593Smuzhiyun This infrastructure calls the error callback functions of the 266*4882a593Smuzhiyun driver when an error happens. But if the driver is not aware of 267*4882a593Smuzhiyun PCI Express, the device might not report its own errors to root 268*4882a593Smuzhiyun port. 269*4882a593Smuzhiyun 270*4882a593SmuzhiyunQ: 271*4882a593Smuzhiyun What modifications will that driver need to make it compatible 272*4882a593Smuzhiyun with the PCI Express AER Root driver? 273*4882a593Smuzhiyun 274*4882a593SmuzhiyunA: 275*4882a593Smuzhiyun It could call the helper functions to enable AER in devices and 276*4882a593Smuzhiyun cleanup uncorrectable status register. Pls. refer to section 3.3. 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun 279*4882a593SmuzhiyunSoftware error injection 280*4882a593Smuzhiyun======================== 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunDebugging PCIe AER error recovery code is quite difficult because it 283*4882a593Smuzhiyunis hard to trigger real hardware errors. Software based error 284*4882a593Smuzhiyuninjection can be used to fake various kinds of PCIe errors. 285*4882a593Smuzhiyun 286*4882a593SmuzhiyunFirst you should enable PCIe AER software error injection in kernel 287*4882a593Smuzhiyunconfiguration, that is, following item should be in your .config. 288*4882a593Smuzhiyun 289*4882a593SmuzhiyunCONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m 290*4882a593Smuzhiyun 291*4882a593SmuzhiyunAfter reboot with new kernel or insert the module, a device file named 292*4882a593Smuzhiyun/dev/aer_inject should be created. 293*4882a593Smuzhiyun 294*4882a593SmuzhiyunThen, you need a user space tool named aer-inject, which can be gotten 295*4882a593Smuzhiyunfrom: 296*4882a593Smuzhiyun 297*4882a593Smuzhiyun https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ 298*4882a593Smuzhiyun 299*4882a593SmuzhiyunMore information about aer-inject can be found in the document comes 300*4882a593Smuzhiyunwith its source code. 301