xref: /OK3568_Linux_fs/kernel/Documentation/powerpc/eeh-pci-error-recovery.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==========================
2*4882a593SmuzhiyunPCI Bus EEH Error Recovery
3*4882a593Smuzhiyun==========================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunLinas Vepstas <linas@austin.ibm.com>
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun12 January 2005
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunOverview:
11*4882a593Smuzhiyun---------
12*4882a593SmuzhiyunThe IBM POWER-based pSeries and iSeries computers include PCI bus
13*4882a593Smuzhiyuncontroller chips that have extended capabilities for detecting and
14*4882a593Smuzhiyunreporting a large variety of PCI bus error conditions.  These features
15*4882a593Smuzhiyungo under the name of "EEH", for "Enhanced Error Handling".  The EEH
16*4882a593Smuzhiyunhardware features allow PCI bus errors to be cleared and a PCI
17*4882a593Smuzhiyuncard to be "rebooted", without also having to reboot the operating
18*4882a593Smuzhiyunsystem.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunThis is in contrast to traditional PCI error handling, where the
21*4882a593SmuzhiyunPCI chip is wired directly to the CPU, and an error would cause
22*4882a593Smuzhiyuna CPU machine-check/check-stop condition, halting the CPU entirely.
23*4882a593SmuzhiyunAnother "traditional" technique is to ignore such errors, which
24*4882a593Smuzhiyuncan lead to data corruption, both of user data or of kernel data,
25*4882a593Smuzhiyunhung/unresponsive adapters, or system crashes/lockups.  Thus,
26*4882a593Smuzhiyunthe idea behind EEH is that the operating system can become more
27*4882a593Smuzhiyunreliable and robust by protecting it from PCI errors, and giving
28*4882a593Smuzhiyunthe OS the ability to "reboot"/recover individual PCI devices.
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunFuture systems from other vendors, based on the PCI-E specification,
31*4882a593Smuzhiyunmay contain similar features.
32*4882a593Smuzhiyun
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunCauses of EEH Errors
35*4882a593Smuzhiyun--------------------
36*4882a593SmuzhiyunEEH was originally designed to guard against hardware failure, such
37*4882a593Smuzhiyunas PCI cards dying from heat, humidity, dust, vibration and bad
38*4882a593Smuzhiyunelectrical connections. The vast majority of EEH errors seen in
39*4882a593Smuzhiyun"real life" are due to either poorly seated PCI cards, or,
40*4882a593Smuzhiyununfortunately quite commonly, due to device driver bugs, device firmware
41*4882a593Smuzhiyunbugs, and sometimes PCI card hardware bugs.
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunThe most common software bug, is one that causes the device to
44*4882a593Smuzhiyunattempt to DMA to a location in system memory that has not been
45*4882a593Smuzhiyunreserved for DMA access for that card.  This is a powerful feature,
46*4882a593Smuzhiyunas it prevents what; otherwise, would have been silent memory
47*4882a593Smuzhiyuncorruption caused by the bad DMA.  A number of device driver
48*4882a593Smuzhiyunbugs have been found and fixed in this way over the past few
49*4882a593Smuzhiyunyears.  Other possible causes of EEH errors include data or
50*4882a593Smuzhiyunaddress line parity errors (for example, due to poor electrical
51*4882a593Smuzhiyunconnectivity due to a poorly seated card), and PCI-X split-completion
52*4882a593Smuzhiyunerrors (due to software, device firmware, or device PCI hardware bugs).
53*4882a593SmuzhiyunThe vast majority of "true hardware failures" can be cured by
54*4882a593Smuzhiyunphysically removing and re-seating the PCI card.
55*4882a593Smuzhiyun
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunDetection and Recovery
58*4882a593Smuzhiyun----------------------
59*4882a593SmuzhiyunIn the following discussion, a generic overview of how to detect
60*4882a593Smuzhiyunand recover from EEH errors will be presented. This is followed
61*4882a593Smuzhiyunby an overview of how the current implementation in the Linux
62*4882a593Smuzhiyunkernel does it.  The actual implementation is subject to change,
63*4882a593Smuzhiyunand some of the finer points are still being debated.  These
64*4882a593Smuzhiyunmay in turn be swayed if or when other architectures implement
65*4882a593Smuzhiyunsimilar functionality.
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunWhen a PCI Host Bridge (PHB, the bus controller connecting the
68*4882a593SmuzhiyunPCI bus to the system CPU electronics complex) detects a PCI error
69*4882a593Smuzhiyuncondition, it will "isolate" the affected PCI card.  Isolation
70*4882a593Smuzhiyunwill block all writes (either to the card from the system, or
71*4882a593Smuzhiyunfrom the card to the system), and it will cause all reads to
72*4882a593Smuzhiyunreturn all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads).
73*4882a593SmuzhiyunThis value was chosen because it is the same value you would
74*4882a593Smuzhiyunget if the device was physically unplugged from the slot.
75*4882a593SmuzhiyunThis includes access to PCI memory, I/O space, and PCI config
76*4882a593Smuzhiyunspace.  Interrupts; however, will continued to be delivered.
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunDetection and recovery are performed with the aid of ppc64
79*4882a593Smuzhiyunfirmware.  The programming interfaces in the Linux kernel
80*4882a593Smuzhiyuninto the firmware are referred to as RTAS (Run-Time Abstraction
81*4882a593SmuzhiyunServices).  The Linux kernel does not (should not) access
82*4882a593Smuzhiyunthe EEH function in the PCI chipsets directly, primarily because
83*4882a593Smuzhiyunthere are a number of different chipsets out there, each with
84*4882a593Smuzhiyundifferent interfaces and quirks. The firmware provides a
85*4882a593Smuzhiyununiform abstraction layer that will work with all pSeries
86*4882a593Smuzhiyunand iSeries hardware (and be forwards-compatible).
87*4882a593Smuzhiyun
88*4882a593SmuzhiyunIf the OS or device driver suspects that a PCI slot has been
89*4882a593SmuzhiyunEEH-isolated, there is a firmware call it can make to determine if
90*4882a593Smuzhiyunthis is the case. If so, then the device driver should put itself
91*4882a593Smuzhiyuninto a consistent state (given that it won't be able to complete any
92*4882a593Smuzhiyunpending work) and start recovery of the card.  Recovery normally
93*4882a593Smuzhiyunwould consist of resetting the PCI device (holding the PCI #RST
94*4882a593Smuzhiyunline high for two seconds), followed by setting up the device
95*4882a593Smuzhiyunconfig space (the base address registers (BAR's), latency timer,
96*4882a593Smuzhiyuncache line size, interrupt line, and so on).  This is followed by a
97*4882a593Smuzhiyunreinitialization of the device driver.  In a worst-case scenario,
98*4882a593Smuzhiyunthe power to the card can be toggled, at least on hot-plug-capable
99*4882a593Smuzhiyunslots.  In principle, layers far above the device driver probably
100*4882a593Smuzhiyundo not need to know that the PCI card has been "rebooted" in this
101*4882a593Smuzhiyunway; ideally, there should be at most a pause in Ethernet/disk/USB
102*4882a593SmuzhiyunI/O while the card is being reset.
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunIf the card cannot be recovered after three or four resets, the
105*4882a593Smuzhiyunkernel/device driver should assume the worst-case scenario, that the
106*4882a593Smuzhiyuncard has died completely, and report this error to the sysadmin.
107*4882a593SmuzhiyunIn addition, error messages are reported through RTAS and also through
108*4882a593Smuzhiyunsyslogd (/var/log/messages) to alert the sysadmin of PCI resets.
109*4882a593SmuzhiyunThe correct way to deal with failed adapters is to use the standard
110*4882a593SmuzhiyunPCI hotplug tools to remove and replace the dead card.
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun
113*4882a593SmuzhiyunCurrent PPC64 Linux EEH Implementation
114*4882a593Smuzhiyun--------------------------------------
115*4882a593SmuzhiyunAt this time, a generic EEH recovery mechanism has been implemented,
116*4882a593Smuzhiyunso that individual device drivers do not need to be modified to support
117*4882a593SmuzhiyunEEH recovery.  This generic mechanism piggy-backs on the PCI hotplug
118*4882a593Smuzhiyuninfrastructure,  and percolates events up through the userspace/udev
119*4882a593Smuzhiyuninfrastructure.  Following is a detailed description of how this is
120*4882a593Smuzhiyunaccomplished.
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunEEH must be enabled in the PHB's very early during the boot process,
123*4882a593Smuzhiyunand if a PCI slot is hot-plugged. The former is performed by
124*4882a593Smuzhiyuneeh_init() in arch/powerpc/platforms/pseries/eeh.c, and the later by
125*4882a593Smuzhiyundrivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code.
126*4882a593SmuzhiyunEEH must be enabled before a PCI scan of the device can proceed.
127*4882a593SmuzhiyunCurrent Power5 hardware will not work unless EEH is enabled;
128*4882a593Smuzhiyunalthough older Power4 can run with it disabled.  Effectively,
129*4882a593SmuzhiyunEEH can no longer be turned off.  PCI devices *must* be
130*4882a593Smuzhiyunregistered with the EEH code; the EEH code needs to know about
131*4882a593Smuzhiyunthe I/O address ranges of the PCI device in order to detect an
132*4882a593Smuzhiyunerror.  Given an arbitrary address, the routine
133*4882a593Smuzhiyunpci_get_device_by_addr() will find the pci device associated
134*4882a593Smuzhiyunwith that address (if any).
135*4882a593Smuzhiyun
136*4882a593SmuzhiyunThe default arch/powerpc/include/asm/io.h macros readb(), inb(), insb(),
137*4882a593Smuzhiyunetc. include a check to see if the i/o read returned all-0xff's.
138*4882a593SmuzhiyunIf so, these make a call to eeh_dn_check_failure(), which in turn
139*4882a593Smuzhiyunasks the firmware if the all-ff's value is the sign of a true EEH
140*4882a593Smuzhiyunerror.  If it is not, processing continues as normal.  The grand
141*4882a593Smuzhiyuntotal number of these false alarms or "false positives" can be
142*4882a593Smuzhiyunseen in /proc/ppc64/eeh (subject to change).  Normally, almost
143*4882a593Smuzhiyunall of these occur during boot, when the PCI bus is scanned, where
144*4882a593Smuzhiyuna large number of 0xff reads are part of the bus scan procedure.
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunIf a frozen slot is detected, code in
147*4882a593Smuzhiyunarch/powerpc/platforms/pseries/eeh.c will print a stack trace to
148*4882a593Smuzhiyunsyslog (/var/log/messages).  This stack trace has proven to be very
149*4882a593Smuzhiyunuseful to device-driver authors for finding out at what point the EEH
150*4882a593Smuzhiyunerror was detected, as the error itself usually occurs slightly
151*4882a593Smuzhiyunbeforehand.
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunNext, it uses the Linux kernel notifier chain/work queue mechanism to
154*4882a593Smuzhiyunallow any interested parties to find out about the failure.  Device
155*4882a593Smuzhiyundrivers, or other parts of the kernel, can use
156*4882a593Smuzhiyun`eeh_register_notifier(struct notifier_block *)` to find out about EEH
157*4882a593Smuzhiyunevents.  The event will include a pointer to the pci device, the
158*4882a593Smuzhiyundevice node and some state info.  Receivers of the event can "do as
159*4882a593Smuzhiyunthey wish"; the default handler will be described further in this
160*4882a593Smuzhiyunsection.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunTo assist in the recovery of the device, eeh.c exports the
163*4882a593Smuzhiyunfollowing functions:
164*4882a593Smuzhiyun
165*4882a593Smuzhiyunrtas_set_slot_reset()
166*4882a593Smuzhiyun   assert the  PCI #RST line for 1/8th of a second
167*4882a593Smuzhiyunrtas_configure_bridge()
168*4882a593Smuzhiyun   ask firmware to configure any PCI bridges
169*4882a593Smuzhiyun   located topologically under the pci slot.
170*4882a593Smuzhiyuneeh_save_bars() and eeh_restore_bars():
171*4882a593Smuzhiyun   save and restore the PCI
172*4882a593Smuzhiyun   config-space info for a device and any devices under it.
173*4882a593Smuzhiyun
174*4882a593Smuzhiyun
175*4882a593SmuzhiyunA handler for the EEH notifier_block events is implemented in
176*4882a593Smuzhiyundrivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events().
177*4882a593SmuzhiyunIt saves the device BAR's and then calls rpaphp_unconfig_pci_adapter().
178*4882a593SmuzhiyunThis last call causes the device driver for the card to be stopped,
179*4882a593Smuzhiyunwhich causes uevents to go out to user space. This triggers
180*4882a593Smuzhiyunuser-space scripts that might issue commands such as "ifdown eth0"
181*4882a593Smuzhiyunfor ethernet cards, and so on.  This handler then sleeps for 5 seconds,
182*4882a593Smuzhiyunhoping to give the user-space scripts enough time to complete.
183*4882a593SmuzhiyunIt then resets the PCI card, reconfigures the device BAR's, and
184*4882a593Smuzhiyunany bridges underneath. It then calls rpaphp_enable_pci_slot(),
185*4882a593Smuzhiyunwhich restarts the device driver and triggers more user-space
186*4882a593Smuzhiyunevents (for example, calling "ifup eth0" for ethernet cards).
187*4882a593Smuzhiyun
188*4882a593Smuzhiyun
189*4882a593SmuzhiyunDevice Shutdown and User-Space Events
190*4882a593Smuzhiyun-------------------------------------
191*4882a593SmuzhiyunThis section documents what happens when a pci slot is unconfigured,
192*4882a593Smuzhiyunfocusing on how the device driver gets shut down, and on how the
193*4882a593Smuzhiyunevents get delivered to user-space scripts.
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunFollowing is an example sequence of events that cause a device driver
196*4882a593Smuzhiyunclose function to be called during the first phase of an EEH reset.
197*4882a593SmuzhiyunThe following sequence is an example of the pcnet32 device driver::
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun    rpa_php_unconfig_pci_adapter (struct slot *)  // in rpaphp_pci.c
200*4882a593Smuzhiyun    {
201*4882a593Smuzhiyun      calls
202*4882a593Smuzhiyun      pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c
203*4882a593Smuzhiyun      {
204*4882a593Smuzhiyun        calls
205*4882a593Smuzhiyun        pci_destroy_dev (struct pci_dev *)
206*4882a593Smuzhiyun        {
207*4882a593Smuzhiyun          calls
208*4882a593Smuzhiyun          device_unregister (&dev->dev) // in /drivers/base/core.c
209*4882a593Smuzhiyun          {
210*4882a593Smuzhiyun            calls
211*4882a593Smuzhiyun            device_del (struct device *)
212*4882a593Smuzhiyun            {
213*4882a593Smuzhiyun              calls
214*4882a593Smuzhiyun              bus_remove_device() // in /drivers/base/bus.c
215*4882a593Smuzhiyun              {
216*4882a593Smuzhiyun                calls
217*4882a593Smuzhiyun                device_release_driver()
218*4882a593Smuzhiyun                {
219*4882a593Smuzhiyun                  calls
220*4882a593Smuzhiyun                  struct device_driver->remove() which is just
221*4882a593Smuzhiyun                  pci_device_remove()  // in /drivers/pci/pci_driver.c
222*4882a593Smuzhiyun                  {
223*4882a593Smuzhiyun                    calls
224*4882a593Smuzhiyun                    struct pci_driver->remove() which is just
225*4882a593Smuzhiyun                    pcnet32_remove_one() // in /drivers/net/pcnet32.c
226*4882a593Smuzhiyun                    {
227*4882a593Smuzhiyun                      calls
228*4882a593Smuzhiyun                      unregister_netdev() // in /net/core/dev.c
229*4882a593Smuzhiyun                      {
230*4882a593Smuzhiyun                        calls
231*4882a593Smuzhiyun                        dev_close()  // in /net/core/dev.c
232*4882a593Smuzhiyun                        {
233*4882a593Smuzhiyun                           calls dev->stop();
234*4882a593Smuzhiyun                           which is just pcnet32_close() // in pcnet32.c
235*4882a593Smuzhiyun                           {
236*4882a593Smuzhiyun                             which does what you wanted
237*4882a593Smuzhiyun                             to stop the device
238*4882a593Smuzhiyun                           }
239*4882a593Smuzhiyun                        }
240*4882a593Smuzhiyun                     }
241*4882a593Smuzhiyun                   which
242*4882a593Smuzhiyun                   frees pcnet32 device driver memory
243*4882a593Smuzhiyun                }
244*4882a593Smuzhiyun     }}}}}}
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun
247*4882a593Smuzhiyunin drivers/pci/pci_driver.c,
248*4882a593Smuzhiyunstruct device_driver->remove() is just pci_device_remove()
249*4882a593Smuzhiyunwhich calls struct pci_driver->remove() which is pcnet32_remove_one()
250*4882a593Smuzhiyunwhich calls unregister_netdev()  (in net/core/dev.c)
251*4882a593Smuzhiyunwhich calls dev_close()  (in net/core/dev.c)
252*4882a593Smuzhiyunwhich calls dev->stop() which is pcnet32_close()
253*4882a593Smuzhiyunwhich then does the appropriate shutdown.
254*4882a593Smuzhiyun
255*4882a593Smuzhiyun---
256*4882a593Smuzhiyun
257*4882a593SmuzhiyunFollowing is the analogous stack trace for events sent to user-space
258*4882a593Smuzhiyunwhen the pci device is unconfigured::
259*4882a593Smuzhiyun
260*4882a593Smuzhiyun  rpa_php_unconfig_pci_adapter() {             // in rpaphp_pci.c
261*4882a593Smuzhiyun    calls
262*4882a593Smuzhiyun    pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
263*4882a593Smuzhiyun      calls
264*4882a593Smuzhiyun      pci_destroy_dev (struct pci_dev *) {
265*4882a593Smuzhiyun        calls
266*4882a593Smuzhiyun        device_unregister (&dev->dev) {        // in /drivers/base/core.c
267*4882a593Smuzhiyun          calls
268*4882a593Smuzhiyun          device_del(struct device * dev) {    // in /drivers/base/core.c
269*4882a593Smuzhiyun            calls
270*4882a593Smuzhiyun            kobject_del() {                    //in /libs/kobject.c
271*4882a593Smuzhiyun              calls
272*4882a593Smuzhiyun              kobject_uevent() {               // in /libs/kobject.c
273*4882a593Smuzhiyun                calls
274*4882a593Smuzhiyun                kset_uevent() {                // in /lib/kobject.c
275*4882a593Smuzhiyun                  calls
276*4882a593Smuzhiyun                  kset->uevent_ops->uevent()   // which is really just
277*4882a593Smuzhiyun                  a call to
278*4882a593Smuzhiyun                  dev_uevent() {               // in /drivers/base/core.c
279*4882a593Smuzhiyun                    calls
280*4882a593Smuzhiyun                    dev->bus->uevent() which is really just a call to
281*4882a593Smuzhiyun                    pci_uevent () {            // in drivers/pci/hotplug.c
282*4882a593Smuzhiyun                      which prints device name, etc....
283*4882a593Smuzhiyun                   }
284*4882a593Smuzhiyun                 }
285*4882a593Smuzhiyun                 then kobject_uevent() sends a netlink uevent to userspace
286*4882a593Smuzhiyun                 --> userspace uevent
287*4882a593Smuzhiyun                 (during early boot, nobody listens to netlink events and
288*4882a593Smuzhiyun                 kobject_uevent() executes uevent_helper[], which runs the
289*4882a593Smuzhiyun                 event process /sbin/hotplug)
290*4882a593Smuzhiyun             }
291*4882a593Smuzhiyun           }
292*4882a593Smuzhiyun           kobject_del() then calls sysfs_remove_dir(), which would
293*4882a593Smuzhiyun           trigger any user-space daemon that was watching /sysfs,
294*4882a593Smuzhiyun           and notice the delete event.
295*4882a593Smuzhiyun
296*4882a593Smuzhiyun
297*4882a593SmuzhiyunPro's and Con's of the Current Design
298*4882a593Smuzhiyun-------------------------------------
299*4882a593SmuzhiyunThere are several issues with the current EEH software recovery design,
300*4882a593Smuzhiyunwhich may be addressed in future revisions.  But first, note that the
301*4882a593Smuzhiyunbig plus of the current design is that no changes need to be made to
302*4882a593Smuzhiyunindividual device drivers, so that the current design throws a wide net.
303*4882a593SmuzhiyunThe biggest negative of the design is that it potentially disturbs
304*4882a593Smuzhiyunnetwork daemons and file systems that didn't need to be disturbed.
305*4882a593Smuzhiyun
306*4882a593Smuzhiyun-  A minor complaint is that resetting the network card causes
307*4882a593Smuzhiyun   user-space back-to-back ifdown/ifup burps that potentially disturb
308*4882a593Smuzhiyun   network daemons, that didn't need to even know that the pci
309*4882a593Smuzhiyun   card was being rebooted.
310*4882a593Smuzhiyun
311*4882a593Smuzhiyun-  A more serious concern is that the same reset, for SCSI devices,
312*4882a593Smuzhiyun   causes havoc to mounted file systems.  Scripts cannot post-facto
313*4882a593Smuzhiyun   unmount a file system without flushing pending buffers, but this
314*4882a593Smuzhiyun   is impossible, because I/O has already been stopped.  Thus,
315*4882a593Smuzhiyun   ideally, the reset should happen at or below the block layer,
316*4882a593Smuzhiyun   so that the file systems are not disturbed.
317*4882a593Smuzhiyun
318*4882a593Smuzhiyun   Reiserfs does not tolerate errors returned from the block device.
319*4882a593Smuzhiyun   Ext3fs seems to be tolerant, retrying reads/writes until it does
320*4882a593Smuzhiyun   succeed. Both have been only lightly tested in this scenario.
321*4882a593Smuzhiyun
322*4882a593Smuzhiyun   The SCSI-generic subsystem already has built-in code for performing
323*4882a593Smuzhiyun   SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter
324*4882a593Smuzhiyun   (HBA) resets.  These are cascaded into a chain of attempted
325*4882a593Smuzhiyun   resets if a SCSI command fails. These are completely hidden
326*4882a593Smuzhiyun   from the block layer.  It would be very natural to add an EEH
327*4882a593Smuzhiyun   reset into this chain of events.
328*4882a593Smuzhiyun
329*4882a593Smuzhiyun-  If a SCSI error occurs for the root device, all is lost unless
330*4882a593Smuzhiyun   the sysadmin had the foresight to run /bin, /sbin, /etc, /var
331*4882a593Smuzhiyun   and so on, out of ramdisk/tmpfs.
332*4882a593Smuzhiyun
333*4882a593Smuzhiyun
334*4882a593SmuzhiyunConclusions
335*4882a593Smuzhiyun-----------
336*4882a593SmuzhiyunThere's forward progress ...
337