xref: /OK3568_Linux_fs/kernel/Documentation/PCI/pcieaer-howto.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun.. include:: <isonum.txt>
3*4882a593Smuzhiyun
4*4882a593Smuzhiyun===========================================================
5*4882a593SmuzhiyunThe PCI Express Advanced Error Reporting Driver Guide HOWTO
6*4882a593Smuzhiyun===========================================================
7*4882a593Smuzhiyun
8*4882a593Smuzhiyun:Authors: - T. Long Nguyen <tom.l.nguyen@intel.com>
9*4882a593Smuzhiyun          - Yanmin Zhang <yanmin.zhang@intel.com>
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun:Copyright: |copy| 2006 Intel Corporation
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunOverview
14*4882a593Smuzhiyun===========
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunAbout this guide
17*4882a593Smuzhiyun----------------
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunThis guide describes the basics of the PCI Express Advanced Error
20*4882a593SmuzhiyunReporting (AER) driver and provides information on how to use it, as
21*4882a593Smuzhiyunwell as how to enable the drivers of endpoint devices to conform with
22*4882a593SmuzhiyunPCI Express AER driver.
23*4882a593Smuzhiyun
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunWhat is the PCI Express AER Driver?
26*4882a593Smuzhiyun-----------------------------------
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunPCI Express error signaling can occur on the PCI Express link itself
29*4882a593Smuzhiyunor on behalf of transactions initiated on the link. PCI Express
30*4882a593Smuzhiyundefines two error reporting paradigms: the baseline capability and
31*4882a593Smuzhiyunthe Advanced Error Reporting capability. The baseline capability is
32*4882a593Smuzhiyunrequired of all PCI Express components providing a minimum defined
33*4882a593Smuzhiyunset of error reporting requirements. Advanced Error Reporting
34*4882a593Smuzhiyuncapability is implemented with a PCI Express advanced error reporting
35*4882a593Smuzhiyunextended capability structure providing more robust error reporting.
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunThe PCI Express AER driver provides the infrastructure to support PCI
38*4882a593SmuzhiyunExpress Advanced Error Reporting capability. The PCI Express AER
39*4882a593Smuzhiyundriver provides three basic functions:
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun  - Gathers the comprehensive error information if errors occurred.
42*4882a593Smuzhiyun  - Reports error to the users.
43*4882a593Smuzhiyun  - Performs error recovery actions.
44*4882a593Smuzhiyun
45*4882a593SmuzhiyunAER driver only attaches root ports which support PCI-Express AER
46*4882a593Smuzhiyuncapability.
47*4882a593Smuzhiyun
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunUser Guide
50*4882a593Smuzhiyun==========
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunInclude the PCI Express AER Root Driver into the Linux Kernel
53*4882a593Smuzhiyun-------------------------------------------------------------
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunThe PCI Express AER Root driver is a Root Port service driver attached
56*4882a593Smuzhiyunto the PCI Express Port Bus driver. If a user wants to use it, the driver
57*4882a593Smuzhiyunhas to be compiled. Option CONFIG_PCIEAER supports this capability. It
58*4882a593Smuzhiyundepends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
59*4882a593SmuzhiyunCONFIG_PCIEAER = y.
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunLoad PCI Express AER Root Driver
62*4882a593Smuzhiyun--------------------------------
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunSome systems have AER support in firmware. Enabling Linux AER support at
65*4882a593Smuzhiyunthe same time the firmware handles AER may result in unpredictable
66*4882a593Smuzhiyunbehavior. Therefore, Linux does not handle AER events unless the firmware
67*4882a593Smuzhiyungrants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0
68*4882a593SmuzhiyunSpecification for details regarding _OSC usage.
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunAER error output
71*4882a593Smuzhiyun----------------
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunWhen a PCIe AER error is captured, an error message will be output to
74*4882a593Smuzhiyunconsole. If it's a correctable error, it is output as a warning.
75*4882a593SmuzhiyunOtherwise, it is printed as an error. So users could choose different
76*4882a593Smuzhiyunlog level to filter out correctable error messages.
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunBelow shows an example::
79*4882a593Smuzhiyun
80*4882a593Smuzhiyun  0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
81*4882a593Smuzhiyun  0000:50:00.0:   device [8086:0329] error status/mask=00100000/00000000
82*4882a593Smuzhiyun  0000:50:00.0:    [20] Unsupported Request    (First)
83*4882a593Smuzhiyun  0000:50:00.0:   TLP Header: 04000001 00200a03 05010000 00050100
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunIn the example, 'Requester ID' means the ID of the device who sends
86*4882a593Smuzhiyunthe error message to root port. Pls. refer to pci express specs for
87*4882a593Smuzhiyunother fields.
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunAER Statistics / Counters
90*4882a593Smuzhiyun-------------------------
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunWhen PCIe AER errors are captured, the counters / statistics are also exposed
93*4882a593Smuzhiyunin the form of sysfs attributes which are documented at
94*4882a593SmuzhiyunDocumentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
95*4882a593Smuzhiyun
96*4882a593SmuzhiyunDeveloper Guide
97*4882a593Smuzhiyun===============
98*4882a593Smuzhiyun
99*4882a593SmuzhiyunTo enable AER aware support requires a software driver to configure
100*4882a593Smuzhiyunthe AER capability structure within its device and to provide callbacks.
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunTo support AER better, developers need understand how AER does work
103*4882a593Smuzhiyunfirstly.
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunPCI Express errors are classified into two types: correctable errors
106*4882a593Smuzhiyunand uncorrectable errors. This classification is based on the impacts
107*4882a593Smuzhiyunof those errors, which may result in degraded performance or function
108*4882a593Smuzhiyunfailure.
109*4882a593Smuzhiyun
110*4882a593SmuzhiyunCorrectable errors pose no impacts on the functionality of the
111*4882a593Smuzhiyuninterface. The PCI Express protocol can recover without any software
112*4882a593Smuzhiyunintervention or any loss of data. These errors are detected and
113*4882a593Smuzhiyuncorrected by hardware. Unlike correctable errors, uncorrectable
114*4882a593Smuzhiyunerrors impact functionality of the interface. Uncorrectable errors
115*4882a593Smuzhiyuncan cause a particular transaction or a particular PCI Express link
116*4882a593Smuzhiyunto be unreliable. Depending on those error conditions, uncorrectable
117*4882a593Smuzhiyunerrors are further classified into non-fatal errors and fatal errors.
118*4882a593SmuzhiyunNon-fatal errors cause the particular transaction to be unreliable,
119*4882a593Smuzhiyunbut the PCI Express link itself is fully functional. Fatal errors, on
120*4882a593Smuzhiyunthe other hand, cause the link to be unreliable.
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunWhen AER is enabled, a PCI Express device will automatically send an
123*4882a593Smuzhiyunerror message to the PCIe root port above it when the device captures
124*4882a593Smuzhiyunan error. The Root Port, upon receiving an error reporting message,
125*4882a593Smuzhiyuninternally processes and logs the error message in its PCI Express
126*4882a593Smuzhiyuncapability structure. Error information being logged includes storing
127*4882a593Smuzhiyunthe error reporting agent's requestor ID into the Error Source
128*4882a593SmuzhiyunIdentification Registers and setting the error bits of the Root Error
129*4882a593SmuzhiyunStatus Register accordingly. If AER error reporting is enabled in Root
130*4882a593SmuzhiyunError Command Register, the Root Port generates an interrupt if an
131*4882a593Smuzhiyunerror is detected.
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunNote that the errors as described above are related to the PCI Express
134*4882a593Smuzhiyunhierarchy and links. These errors do not include any device specific
135*4882a593Smuzhiyunerrors because device specific errors will still get sent directly to
136*4882a593Smuzhiyunthe device driver.
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunConfigure the AER capability structure
139*4882a593Smuzhiyun--------------------------------------
140*4882a593Smuzhiyun
141*4882a593SmuzhiyunAER aware drivers of PCI Express component need change the device
142*4882a593Smuzhiyuncontrol registers to enable AER. They also could change AER registers,
143*4882a593Smuzhiyunincluding mask and severity registers. Helper function
144*4882a593Smuzhiyunpci_enable_pcie_error_reporting could be used to enable AER. See
145*4882a593Smuzhiyunsection 3.3.
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunProvide callbacks
148*4882a593Smuzhiyun-----------------
149*4882a593Smuzhiyun
150*4882a593Smuzhiyuncallback reset_link to reset pci express link
151*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunThis callback is used to reset the pci express physical link when a
154*4882a593Smuzhiyunfatal error happens. The root port aer service driver provides a
155*4882a593Smuzhiyundefault reset_link function, but different upstream ports might
156*4882a593Smuzhiyunhave different specifications to reset pci express link, so all
157*4882a593Smuzhiyunupstream ports should provide their own reset_link functions.
158*4882a593Smuzhiyun
159*4882a593SmuzhiyunSection 3.2.2.2 provides more detailed info on when to call
160*4882a593Smuzhiyunreset_link.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunPCI error-recovery callbacks
163*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunThe PCI Express AER Root driver uses error callbacks to coordinate
166*4882a593Smuzhiyunwith downstream device drivers associated with a hierarchy in question
167*4882a593Smuzhiyunwhen performing error recovery actions.
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunData struct pci_driver has a pointer, err_handler, to point to
170*4882a593Smuzhiyunpci_error_handlers who consists of a couple of callback function
171*4882a593Smuzhiyunpointers. AER driver follows the rules defined in
172*4882a593Smuzhiyunpci-error-recovery.txt except pci express specific parts (e.g.
173*4882a593Smuzhiyunreset_link). Pls. refer to pci-error-recovery.txt for detailed
174*4882a593Smuzhiyundefinitions of the callbacks.
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunBelow sections specify when to call the error callback functions.
177*4882a593Smuzhiyun
178*4882a593SmuzhiyunCorrectable errors
179*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~
180*4882a593Smuzhiyun
181*4882a593SmuzhiyunCorrectable errors pose no impacts on the functionality of
182*4882a593Smuzhiyunthe interface. The PCI Express protocol can recover without any
183*4882a593Smuzhiyunsoftware intervention or any loss of data. These errors do not
184*4882a593Smuzhiyunrequire any recovery actions. The AER driver clears the device's
185*4882a593Smuzhiyuncorrectable error status register accordingly and logs these errors.
186*4882a593Smuzhiyun
187*4882a593SmuzhiyunNon-correctable (non-fatal and fatal) errors
188*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
189*4882a593Smuzhiyun
190*4882a593SmuzhiyunIf an error message indicates a non-fatal error, performing link reset
191*4882a593Smuzhiyunat upstream is not required. The AER driver calls error_detected(dev,
192*4882a593Smuzhiyunpci_channel_io_normal) to all drivers associated within a hierarchy in
193*4882a593Smuzhiyunquestion. for example::
194*4882a593Smuzhiyun
195*4882a593Smuzhiyun  EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort
196*4882a593Smuzhiyun
197*4882a593SmuzhiyunIf Upstream port A captures an AER error, the hierarchy consists of
198*4882a593SmuzhiyunDownstream port B and EndPoint.
199*4882a593Smuzhiyun
200*4882a593SmuzhiyunA driver may return PCI_ERS_RESULT_CAN_RECOVER,
201*4882a593SmuzhiyunPCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
202*4882a593Smuzhiyunwhether it can recover or the AER driver calls mmio_enabled as next.
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunIf an error message indicates a fatal error, kernel will broadcast
205*4882a593Smuzhiyunerror_detected(dev, pci_channel_io_frozen) to all drivers within
206*4882a593Smuzhiyuna hierarchy in question. Then, performing link reset at upstream is
207*4882a593Smuzhiyunnecessary. As different kinds of devices might use different approaches
208*4882a593Smuzhiyunto reset link, AER port service driver is required to provide the
209*4882a593Smuzhiyunfunction to reset link via callback parameter of pcie_do_recovery()
210*4882a593Smuzhiyunfunction. If reset_link is not NULL, recovery function will use it
211*4882a593Smuzhiyunto reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
212*4882a593Smuzhiyunand reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
213*4882a593Smuzhiyunto mmio_enabled.
214*4882a593Smuzhiyun
215*4882a593Smuzhiyunhelper functions
216*4882a593Smuzhiyun----------------
217*4882a593Smuzhiyun::
218*4882a593Smuzhiyun
219*4882a593Smuzhiyun  int pci_enable_pcie_error_reporting(struct pci_dev *dev);
220*4882a593Smuzhiyun
221*4882a593Smuzhiyunpci_enable_pcie_error_reporting enables the device to send error
222*4882a593Smuzhiyunmessages to root port when an error is detected. Note that devices
223*4882a593Smuzhiyundon't enable the error reporting by default, so device drivers need
224*4882a593Smuzhiyuncall this function to enable it.
225*4882a593Smuzhiyun
226*4882a593Smuzhiyun::
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun  int pci_disable_pcie_error_reporting(struct pci_dev *dev);
229*4882a593Smuzhiyun
230*4882a593Smuzhiyunpci_disable_pcie_error_reporting disables the device to send error
231*4882a593Smuzhiyunmessages to root port when an error is detected.
232*4882a593Smuzhiyun
233*4882a593Smuzhiyun::
234*4882a593Smuzhiyun
235*4882a593Smuzhiyun  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);`
236*4882a593Smuzhiyun
237*4882a593Smuzhiyunpci_aer_clear_nonfatal_status clears non-fatal errors in the uncorrectable
238*4882a593Smuzhiyunerror status register.
239*4882a593Smuzhiyun
240*4882a593SmuzhiyunFrequent Asked Questions
241*4882a593Smuzhiyun------------------------
242*4882a593Smuzhiyun
243*4882a593SmuzhiyunQ:
244*4882a593Smuzhiyun  What happens if a PCI Express device driver does not provide an
245*4882a593Smuzhiyun  error recovery handler (pci_driver->err_handler is equal to NULL)?
246*4882a593Smuzhiyun
247*4882a593SmuzhiyunA:
248*4882a593Smuzhiyun  The devices attached with the driver won't be recovered. If the
249*4882a593Smuzhiyun  error is fatal, kernel will print out warning messages. Please refer
250*4882a593Smuzhiyun  to section 3 for more information.
251*4882a593Smuzhiyun
252*4882a593SmuzhiyunQ:
253*4882a593Smuzhiyun  What happens if an upstream port service driver does not provide
254*4882a593Smuzhiyun  callback reset_link?
255*4882a593Smuzhiyun
256*4882a593SmuzhiyunA:
257*4882a593Smuzhiyun  Fatal error recovery will fail if the errors are reported by the
258*4882a593Smuzhiyun  upstream ports who are attached by the service driver.
259*4882a593Smuzhiyun
260*4882a593SmuzhiyunQ:
261*4882a593Smuzhiyun  How does this infrastructure deal with driver that is not PCI
262*4882a593Smuzhiyun  Express aware?
263*4882a593Smuzhiyun
264*4882a593SmuzhiyunA:
265*4882a593Smuzhiyun  This infrastructure calls the error callback functions of the
266*4882a593Smuzhiyun  driver when an error happens. But if the driver is not aware of
267*4882a593Smuzhiyun  PCI Express, the device might not report its own errors to root
268*4882a593Smuzhiyun  port.
269*4882a593Smuzhiyun
270*4882a593SmuzhiyunQ:
271*4882a593Smuzhiyun  What modifications will that driver need to make it compatible
272*4882a593Smuzhiyun  with the PCI Express AER Root driver?
273*4882a593Smuzhiyun
274*4882a593SmuzhiyunA:
275*4882a593Smuzhiyun  It could call the helper functions to enable AER in devices and
276*4882a593Smuzhiyun  cleanup uncorrectable status register. Pls. refer to section 3.3.
277*4882a593Smuzhiyun
278*4882a593Smuzhiyun
279*4882a593SmuzhiyunSoftware error injection
280*4882a593Smuzhiyun========================
281*4882a593Smuzhiyun
282*4882a593SmuzhiyunDebugging PCIe AER error recovery code is quite difficult because it
283*4882a593Smuzhiyunis hard to trigger real hardware errors. Software based error
284*4882a593Smuzhiyuninjection can be used to fake various kinds of PCIe errors.
285*4882a593Smuzhiyun
286*4882a593SmuzhiyunFirst you should enable PCIe AER software error injection in kernel
287*4882a593Smuzhiyunconfiguration, that is, following item should be in your .config.
288*4882a593Smuzhiyun
289*4882a593SmuzhiyunCONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
290*4882a593Smuzhiyun
291*4882a593SmuzhiyunAfter reboot with new kernel or insert the module, a device file named
292*4882a593Smuzhiyun/dev/aer_inject should be created.
293*4882a593Smuzhiyun
294*4882a593SmuzhiyunThen, you need a user space tool named aer-inject, which can be gotten
295*4882a593Smuzhiyunfrom:
296*4882a593Smuzhiyun
297*4882a593Smuzhiyun    https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
298*4882a593Smuzhiyun
299*4882a593SmuzhiyunMore information about aer-inject can be found in the document comes
300*4882a593Smuzhiyunwith its source code.
301