1*4882a593Smuzhiyun========================== 2*4882a593SmuzhiyunPCI Bus EEH Error Recovery 3*4882a593Smuzhiyun========================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunLinas Vepstas <linas@austin.ibm.com> 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun12 January 2005 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunOverview: 11*4882a593Smuzhiyun--------- 12*4882a593SmuzhiyunThe IBM POWER-based pSeries and iSeries computers include PCI bus 13*4882a593Smuzhiyuncontroller chips that have extended capabilities for detecting and 14*4882a593Smuzhiyunreporting a large variety of PCI bus error conditions. These features 15*4882a593Smuzhiyungo under the name of "EEH", for "Enhanced Error Handling". The EEH 16*4882a593Smuzhiyunhardware features allow PCI bus errors to be cleared and a PCI 17*4882a593Smuzhiyuncard to be "rebooted", without also having to reboot the operating 18*4882a593Smuzhiyunsystem. 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunThis is in contrast to traditional PCI error handling, where the 21*4882a593SmuzhiyunPCI chip is wired directly to the CPU, and an error would cause 22*4882a593Smuzhiyuna CPU machine-check/check-stop condition, halting the CPU entirely. 23*4882a593SmuzhiyunAnother "traditional" technique is to ignore such errors, which 24*4882a593Smuzhiyuncan lead to data corruption, both of user data or of kernel data, 25*4882a593Smuzhiyunhung/unresponsive adapters, or system crashes/lockups. Thus, 26*4882a593Smuzhiyunthe idea behind EEH is that the operating system can become more 27*4882a593Smuzhiyunreliable and robust by protecting it from PCI errors, and giving 28*4882a593Smuzhiyunthe OS the ability to "reboot"/recover individual PCI devices. 29*4882a593Smuzhiyun 30*4882a593SmuzhiyunFuture systems from other vendors, based on the PCI-E specification, 31*4882a593Smuzhiyunmay contain similar features. 32*4882a593Smuzhiyun 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunCauses of EEH Errors 35*4882a593Smuzhiyun-------------------- 36*4882a593SmuzhiyunEEH was originally designed to guard against hardware failure, such 37*4882a593Smuzhiyunas PCI cards dying from heat, humidity, dust, vibration and bad 38*4882a593Smuzhiyunelectrical connections. The vast majority of EEH errors seen in 39*4882a593Smuzhiyun"real life" are due to either poorly seated PCI cards, or, 40*4882a593Smuzhiyununfortunately quite commonly, due to device driver bugs, device firmware 41*4882a593Smuzhiyunbugs, and sometimes PCI card hardware bugs. 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunThe most common software bug, is one that causes the device to 44*4882a593Smuzhiyunattempt to DMA to a location in system memory that has not been 45*4882a593Smuzhiyunreserved for DMA access for that card. This is a powerful feature, 46*4882a593Smuzhiyunas it prevents what; otherwise, would have been silent memory 47*4882a593Smuzhiyuncorruption caused by the bad DMA. A number of device driver 48*4882a593Smuzhiyunbugs have been found and fixed in this way over the past few 49*4882a593Smuzhiyunyears. Other possible causes of EEH errors include data or 50*4882a593Smuzhiyunaddress line parity errors (for example, due to poor electrical 51*4882a593Smuzhiyunconnectivity due to a poorly seated card), and PCI-X split-completion 52*4882a593Smuzhiyunerrors (due to software, device firmware, or device PCI hardware bugs). 53*4882a593SmuzhiyunThe vast majority of "true hardware failures" can be cured by 54*4882a593Smuzhiyunphysically removing and re-seating the PCI card. 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunDetection and Recovery 58*4882a593Smuzhiyun---------------------- 59*4882a593SmuzhiyunIn the following discussion, a generic overview of how to detect 60*4882a593Smuzhiyunand recover from EEH errors will be presented. This is followed 61*4882a593Smuzhiyunby an overview of how the current implementation in the Linux 62*4882a593Smuzhiyunkernel does it. The actual implementation is subject to change, 63*4882a593Smuzhiyunand some of the finer points are still being debated. These 64*4882a593Smuzhiyunmay in turn be swayed if or when other architectures implement 65*4882a593Smuzhiyunsimilar functionality. 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunWhen a PCI Host Bridge (PHB, the bus controller connecting the 68*4882a593SmuzhiyunPCI bus to the system CPU electronics complex) detects a PCI error 69*4882a593Smuzhiyuncondition, it will "isolate" the affected PCI card. Isolation 70*4882a593Smuzhiyunwill block all writes (either to the card from the system, or 71*4882a593Smuzhiyunfrom the card to the system), and it will cause all reads to 72*4882a593Smuzhiyunreturn all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads). 73*4882a593SmuzhiyunThis value was chosen because it is the same value you would 74*4882a593Smuzhiyunget if the device was physically unplugged from the slot. 75*4882a593SmuzhiyunThis includes access to PCI memory, I/O space, and PCI config 76*4882a593Smuzhiyunspace. Interrupts; however, will continued to be delivered. 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunDetection and recovery are performed with the aid of ppc64 79*4882a593Smuzhiyunfirmware. The programming interfaces in the Linux kernel 80*4882a593Smuzhiyuninto the firmware are referred to as RTAS (Run-Time Abstraction 81*4882a593SmuzhiyunServices). The Linux kernel does not (should not) access 82*4882a593Smuzhiyunthe EEH function in the PCI chipsets directly, primarily because 83*4882a593Smuzhiyunthere are a number of different chipsets out there, each with 84*4882a593Smuzhiyundifferent interfaces and quirks. The firmware provides a 85*4882a593Smuzhiyununiform abstraction layer that will work with all pSeries 86*4882a593Smuzhiyunand iSeries hardware (and be forwards-compatible). 87*4882a593Smuzhiyun 88*4882a593SmuzhiyunIf the OS or device driver suspects that a PCI slot has been 89*4882a593SmuzhiyunEEH-isolated, there is a firmware call it can make to determine if 90*4882a593Smuzhiyunthis is the case. If so, then the device driver should put itself 91*4882a593Smuzhiyuninto a consistent state (given that it won't be able to complete any 92*4882a593Smuzhiyunpending work) and start recovery of the card. Recovery normally 93*4882a593Smuzhiyunwould consist of resetting the PCI device (holding the PCI #RST 94*4882a593Smuzhiyunline high for two seconds), followed by setting up the device 95*4882a593Smuzhiyunconfig space (the base address registers (BAR's), latency timer, 96*4882a593Smuzhiyuncache line size, interrupt line, and so on). This is followed by a 97*4882a593Smuzhiyunreinitialization of the device driver. In a worst-case scenario, 98*4882a593Smuzhiyunthe power to the card can be toggled, at least on hot-plug-capable 99*4882a593Smuzhiyunslots. In principle, layers far above the device driver probably 100*4882a593Smuzhiyundo not need to know that the PCI card has been "rebooted" in this 101*4882a593Smuzhiyunway; ideally, there should be at most a pause in Ethernet/disk/USB 102*4882a593SmuzhiyunI/O while the card is being reset. 103*4882a593Smuzhiyun 104*4882a593SmuzhiyunIf the card cannot be recovered after three or four resets, the 105*4882a593Smuzhiyunkernel/device driver should assume the worst-case scenario, that the 106*4882a593Smuzhiyuncard has died completely, and report this error to the sysadmin. 107*4882a593SmuzhiyunIn addition, error messages are reported through RTAS and also through 108*4882a593Smuzhiyunsyslogd (/var/log/messages) to alert the sysadmin of PCI resets. 109*4882a593SmuzhiyunThe correct way to deal with failed adapters is to use the standard 110*4882a593SmuzhiyunPCI hotplug tools to remove and replace the dead card. 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun 113*4882a593SmuzhiyunCurrent PPC64 Linux EEH Implementation 114*4882a593Smuzhiyun-------------------------------------- 115*4882a593SmuzhiyunAt this time, a generic EEH recovery mechanism has been implemented, 116*4882a593Smuzhiyunso that individual device drivers do not need to be modified to support 117*4882a593SmuzhiyunEEH recovery. This generic mechanism piggy-backs on the PCI hotplug 118*4882a593Smuzhiyuninfrastructure, and percolates events up through the userspace/udev 119*4882a593Smuzhiyuninfrastructure. Following is a detailed description of how this is 120*4882a593Smuzhiyunaccomplished. 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunEEH must be enabled in the PHB's very early during the boot process, 123*4882a593Smuzhiyunand if a PCI slot is hot-plugged. The former is performed by 124*4882a593Smuzhiyuneeh_init() in arch/powerpc/platforms/pseries/eeh.c, and the later by 125*4882a593Smuzhiyundrivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code. 126*4882a593SmuzhiyunEEH must be enabled before a PCI scan of the device can proceed. 127*4882a593SmuzhiyunCurrent Power5 hardware will not work unless EEH is enabled; 128*4882a593Smuzhiyunalthough older Power4 can run with it disabled. Effectively, 129*4882a593SmuzhiyunEEH can no longer be turned off. PCI devices *must* be 130*4882a593Smuzhiyunregistered with the EEH code; the EEH code needs to know about 131*4882a593Smuzhiyunthe I/O address ranges of the PCI device in order to detect an 132*4882a593Smuzhiyunerror. Given an arbitrary address, the routine 133*4882a593Smuzhiyunpci_get_device_by_addr() will find the pci device associated 134*4882a593Smuzhiyunwith that address (if any). 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunThe default arch/powerpc/include/asm/io.h macros readb(), inb(), insb(), 137*4882a593Smuzhiyunetc. include a check to see if the i/o read returned all-0xff's. 138*4882a593SmuzhiyunIf so, these make a call to eeh_dn_check_failure(), which in turn 139*4882a593Smuzhiyunasks the firmware if the all-ff's value is the sign of a true EEH 140*4882a593Smuzhiyunerror. If it is not, processing continues as normal. The grand 141*4882a593Smuzhiyuntotal number of these false alarms or "false positives" can be 142*4882a593Smuzhiyunseen in /proc/ppc64/eeh (subject to change). Normally, almost 143*4882a593Smuzhiyunall of these occur during boot, when the PCI bus is scanned, where 144*4882a593Smuzhiyuna large number of 0xff reads are part of the bus scan procedure. 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunIf a frozen slot is detected, code in 147*4882a593Smuzhiyunarch/powerpc/platforms/pseries/eeh.c will print a stack trace to 148*4882a593Smuzhiyunsyslog (/var/log/messages). This stack trace has proven to be very 149*4882a593Smuzhiyunuseful to device-driver authors for finding out at what point the EEH 150*4882a593Smuzhiyunerror was detected, as the error itself usually occurs slightly 151*4882a593Smuzhiyunbeforehand. 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunNext, it uses the Linux kernel notifier chain/work queue mechanism to 154*4882a593Smuzhiyunallow any interested parties to find out about the failure. Device 155*4882a593Smuzhiyundrivers, or other parts of the kernel, can use 156*4882a593Smuzhiyun`eeh_register_notifier(struct notifier_block *)` to find out about EEH 157*4882a593Smuzhiyunevents. The event will include a pointer to the pci device, the 158*4882a593Smuzhiyundevice node and some state info. Receivers of the event can "do as 159*4882a593Smuzhiyunthey wish"; the default handler will be described further in this 160*4882a593Smuzhiyunsection. 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunTo assist in the recovery of the device, eeh.c exports the 163*4882a593Smuzhiyunfollowing functions: 164*4882a593Smuzhiyun 165*4882a593Smuzhiyunrtas_set_slot_reset() 166*4882a593Smuzhiyun assert the PCI #RST line for 1/8th of a second 167*4882a593Smuzhiyunrtas_configure_bridge() 168*4882a593Smuzhiyun ask firmware to configure any PCI bridges 169*4882a593Smuzhiyun located topologically under the pci slot. 170*4882a593Smuzhiyuneeh_save_bars() and eeh_restore_bars(): 171*4882a593Smuzhiyun save and restore the PCI 172*4882a593Smuzhiyun config-space info for a device and any devices under it. 173*4882a593Smuzhiyun 174*4882a593Smuzhiyun 175*4882a593SmuzhiyunA handler for the EEH notifier_block events is implemented in 176*4882a593Smuzhiyundrivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events(). 177*4882a593SmuzhiyunIt saves the device BAR's and then calls rpaphp_unconfig_pci_adapter(). 178*4882a593SmuzhiyunThis last call causes the device driver for the card to be stopped, 179*4882a593Smuzhiyunwhich causes uevents to go out to user space. This triggers 180*4882a593Smuzhiyunuser-space scripts that might issue commands such as "ifdown eth0" 181*4882a593Smuzhiyunfor ethernet cards, and so on. This handler then sleeps for 5 seconds, 182*4882a593Smuzhiyunhoping to give the user-space scripts enough time to complete. 183*4882a593SmuzhiyunIt then resets the PCI card, reconfigures the device BAR's, and 184*4882a593Smuzhiyunany bridges underneath. It then calls rpaphp_enable_pci_slot(), 185*4882a593Smuzhiyunwhich restarts the device driver and triggers more user-space 186*4882a593Smuzhiyunevents (for example, calling "ifup eth0" for ethernet cards). 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun 189*4882a593SmuzhiyunDevice Shutdown and User-Space Events 190*4882a593Smuzhiyun------------------------------------- 191*4882a593SmuzhiyunThis section documents what happens when a pci slot is unconfigured, 192*4882a593Smuzhiyunfocusing on how the device driver gets shut down, and on how the 193*4882a593Smuzhiyunevents get delivered to user-space scripts. 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunFollowing is an example sequence of events that cause a device driver 196*4882a593Smuzhiyunclose function to be called during the first phase of an EEH reset. 197*4882a593SmuzhiyunThe following sequence is an example of the pcnet32 device driver:: 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun rpa_php_unconfig_pci_adapter (struct slot *) // in rpaphp_pci.c 200*4882a593Smuzhiyun { 201*4882a593Smuzhiyun calls 202*4882a593Smuzhiyun pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c 203*4882a593Smuzhiyun { 204*4882a593Smuzhiyun calls 205*4882a593Smuzhiyun pci_destroy_dev (struct pci_dev *) 206*4882a593Smuzhiyun { 207*4882a593Smuzhiyun calls 208*4882a593Smuzhiyun device_unregister (&dev->dev) // in /drivers/base/core.c 209*4882a593Smuzhiyun { 210*4882a593Smuzhiyun calls 211*4882a593Smuzhiyun device_del (struct device *) 212*4882a593Smuzhiyun { 213*4882a593Smuzhiyun calls 214*4882a593Smuzhiyun bus_remove_device() // in /drivers/base/bus.c 215*4882a593Smuzhiyun { 216*4882a593Smuzhiyun calls 217*4882a593Smuzhiyun device_release_driver() 218*4882a593Smuzhiyun { 219*4882a593Smuzhiyun calls 220*4882a593Smuzhiyun struct device_driver->remove() which is just 221*4882a593Smuzhiyun pci_device_remove() // in /drivers/pci/pci_driver.c 222*4882a593Smuzhiyun { 223*4882a593Smuzhiyun calls 224*4882a593Smuzhiyun struct pci_driver->remove() which is just 225*4882a593Smuzhiyun pcnet32_remove_one() // in /drivers/net/pcnet32.c 226*4882a593Smuzhiyun { 227*4882a593Smuzhiyun calls 228*4882a593Smuzhiyun unregister_netdev() // in /net/core/dev.c 229*4882a593Smuzhiyun { 230*4882a593Smuzhiyun calls 231*4882a593Smuzhiyun dev_close() // in /net/core/dev.c 232*4882a593Smuzhiyun { 233*4882a593Smuzhiyun calls dev->stop(); 234*4882a593Smuzhiyun which is just pcnet32_close() // in pcnet32.c 235*4882a593Smuzhiyun { 236*4882a593Smuzhiyun which does what you wanted 237*4882a593Smuzhiyun to stop the device 238*4882a593Smuzhiyun } 239*4882a593Smuzhiyun } 240*4882a593Smuzhiyun } 241*4882a593Smuzhiyun which 242*4882a593Smuzhiyun frees pcnet32 device driver memory 243*4882a593Smuzhiyun } 244*4882a593Smuzhiyun }}}}}} 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun 247*4882a593Smuzhiyunin drivers/pci/pci_driver.c, 248*4882a593Smuzhiyunstruct device_driver->remove() is just pci_device_remove() 249*4882a593Smuzhiyunwhich calls struct pci_driver->remove() which is pcnet32_remove_one() 250*4882a593Smuzhiyunwhich calls unregister_netdev() (in net/core/dev.c) 251*4882a593Smuzhiyunwhich calls dev_close() (in net/core/dev.c) 252*4882a593Smuzhiyunwhich calls dev->stop() which is pcnet32_close() 253*4882a593Smuzhiyunwhich then does the appropriate shutdown. 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun--- 256*4882a593Smuzhiyun 257*4882a593SmuzhiyunFollowing is the analogous stack trace for events sent to user-space 258*4882a593Smuzhiyunwhen the pci device is unconfigured:: 259*4882a593Smuzhiyun 260*4882a593Smuzhiyun rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c 261*4882a593Smuzhiyun calls 262*4882a593Smuzhiyun pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c 263*4882a593Smuzhiyun calls 264*4882a593Smuzhiyun pci_destroy_dev (struct pci_dev *) { 265*4882a593Smuzhiyun calls 266*4882a593Smuzhiyun device_unregister (&dev->dev) { // in /drivers/base/core.c 267*4882a593Smuzhiyun calls 268*4882a593Smuzhiyun device_del(struct device * dev) { // in /drivers/base/core.c 269*4882a593Smuzhiyun calls 270*4882a593Smuzhiyun kobject_del() { //in /libs/kobject.c 271*4882a593Smuzhiyun calls 272*4882a593Smuzhiyun kobject_uevent() { // in /libs/kobject.c 273*4882a593Smuzhiyun calls 274*4882a593Smuzhiyun kset_uevent() { // in /lib/kobject.c 275*4882a593Smuzhiyun calls 276*4882a593Smuzhiyun kset->uevent_ops->uevent() // which is really just 277*4882a593Smuzhiyun a call to 278*4882a593Smuzhiyun dev_uevent() { // in /drivers/base/core.c 279*4882a593Smuzhiyun calls 280*4882a593Smuzhiyun dev->bus->uevent() which is really just a call to 281*4882a593Smuzhiyun pci_uevent () { // in drivers/pci/hotplug.c 282*4882a593Smuzhiyun which prints device name, etc.... 283*4882a593Smuzhiyun } 284*4882a593Smuzhiyun } 285*4882a593Smuzhiyun then kobject_uevent() sends a netlink uevent to userspace 286*4882a593Smuzhiyun --> userspace uevent 287*4882a593Smuzhiyun (during early boot, nobody listens to netlink events and 288*4882a593Smuzhiyun kobject_uevent() executes uevent_helper[], which runs the 289*4882a593Smuzhiyun event process /sbin/hotplug) 290*4882a593Smuzhiyun } 291*4882a593Smuzhiyun } 292*4882a593Smuzhiyun kobject_del() then calls sysfs_remove_dir(), which would 293*4882a593Smuzhiyun trigger any user-space daemon that was watching /sysfs, 294*4882a593Smuzhiyun and notice the delete event. 295*4882a593Smuzhiyun 296*4882a593Smuzhiyun 297*4882a593SmuzhiyunPro's and Con's of the Current Design 298*4882a593Smuzhiyun------------------------------------- 299*4882a593SmuzhiyunThere are several issues with the current EEH software recovery design, 300*4882a593Smuzhiyunwhich may be addressed in future revisions. But first, note that the 301*4882a593Smuzhiyunbig plus of the current design is that no changes need to be made to 302*4882a593Smuzhiyunindividual device drivers, so that the current design throws a wide net. 303*4882a593SmuzhiyunThe biggest negative of the design is that it potentially disturbs 304*4882a593Smuzhiyunnetwork daemons and file systems that didn't need to be disturbed. 305*4882a593Smuzhiyun 306*4882a593Smuzhiyun- A minor complaint is that resetting the network card causes 307*4882a593Smuzhiyun user-space back-to-back ifdown/ifup burps that potentially disturb 308*4882a593Smuzhiyun network daemons, that didn't need to even know that the pci 309*4882a593Smuzhiyun card was being rebooted. 310*4882a593Smuzhiyun 311*4882a593Smuzhiyun- A more serious concern is that the same reset, for SCSI devices, 312*4882a593Smuzhiyun causes havoc to mounted file systems. Scripts cannot post-facto 313*4882a593Smuzhiyun unmount a file system without flushing pending buffers, but this 314*4882a593Smuzhiyun is impossible, because I/O has already been stopped. Thus, 315*4882a593Smuzhiyun ideally, the reset should happen at or below the block layer, 316*4882a593Smuzhiyun so that the file systems are not disturbed. 317*4882a593Smuzhiyun 318*4882a593Smuzhiyun Reiserfs does not tolerate errors returned from the block device. 319*4882a593Smuzhiyun Ext3fs seems to be tolerant, retrying reads/writes until it does 320*4882a593Smuzhiyun succeed. Both have been only lightly tested in this scenario. 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun The SCSI-generic subsystem already has built-in code for performing 323*4882a593Smuzhiyun SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter 324*4882a593Smuzhiyun (HBA) resets. These are cascaded into a chain of attempted 325*4882a593Smuzhiyun resets if a SCSI command fails. These are completely hidden 326*4882a593Smuzhiyun from the block layer. It would be very natural to add an EEH 327*4882a593Smuzhiyun reset into this chain of events. 328*4882a593Smuzhiyun 329*4882a593Smuzhiyun- If a SCSI error occurs for the root device, all is lost unless 330*4882a593Smuzhiyun the sysadmin had the foresight to run /bin, /sbin, /etc, /var 331*4882a593Smuzhiyun and so on, out of ramdisk/tmpfs. 332*4882a593Smuzhiyun 333*4882a593Smuzhiyun 334*4882a593SmuzhiyunConclusions 335*4882a593Smuzhiyun----------- 336*4882a593SmuzhiyunThere's forward progress ... 337