1*4882a593Smuzhiyun.. include:: <isonum.txt> 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun============================================ 4*4882a593SmuzhiyunReliability, Availability and Serviceability 5*4882a593Smuzhiyun============================================ 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunRAS concepts 8*4882a593Smuzhiyun************ 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunReliability, Availability and Serviceability (RAS) is a concept used on 11*4882a593Smuzhiyunservers meant to measure their robustness. 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunReliability 14*4882a593Smuzhiyun is the probability that a system will produce correct outputs. 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun * Generally measured as Mean Time Between Failures (MTBF) 17*4882a593Smuzhiyun * Enhanced by features that help to avoid, detect and repair hardware faults 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunAvailability 20*4882a593Smuzhiyun is the probability that a system is operational at a given time 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun * Generally measured as a percentage of downtime per a period of time 23*4882a593Smuzhiyun * Often uses mechanisms to detect and correct hardware faults in 24*4882a593Smuzhiyun runtime; 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunServiceability (or maintainability) 27*4882a593Smuzhiyun is the simplicity and speed with which a system can be repaired or 28*4882a593Smuzhiyun maintained 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun * Generally measured on Mean Time Between Repair (MTBR) 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunImproving RAS 33*4882a593Smuzhiyun------------- 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunIn order to reduce systems downtime, a system should be capable of detecting 36*4882a593Smuzhiyunhardware errors, and, when possible correcting them in runtime. It should 37*4882a593Smuzhiyunalso provide mechanisms to detect hardware degradation, in order to warn 38*4882a593Smuzhiyunthe system administrator to take the action of replacing a component before 39*4882a593Smuzhiyunit causes data loss or system downtime. 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunAmong the monitoring measures, the most usual ones include: 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun* CPU – detect errors at instruction execution and at L1/L2/L3 caches; 44*4882a593Smuzhiyun* Memory – add error correction logic (ECC) to detect and correct errors; 45*4882a593Smuzhiyun* I/O – add CRC checksums for transferred data; 46*4882a593Smuzhiyun* Storage – RAID, journal file systems, checksums, 47*4882a593Smuzhiyun Self-Monitoring, Analysis and Reporting Technology (SMART). 48*4882a593Smuzhiyun 49*4882a593SmuzhiyunBy monitoring the number of occurrences of error detections, it is possible 50*4882a593Smuzhiyunto identify if the probability of hardware errors is increasing, and, on such 51*4882a593Smuzhiyuncase, do a preventive maintenance to replace a degraded component while 52*4882a593Smuzhiyunthose errors are correctable. 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunTypes of errors 55*4882a593Smuzhiyun--------------- 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunMost mechanisms used on modern systems use technologies like Hamming 58*4882a593SmuzhiyunCodes that allow error correction when the number of errors on a bit packet 59*4882a593Smuzhiyunis below a threshold. If the number of errors is above, those mechanisms 60*4882a593Smuzhiyuncan indicate with a high degree of confidence that an error happened, but 61*4882a593Smuzhiyunthey can't correct. 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunAlso, sometimes an error occur on a component that it is not used. For 64*4882a593Smuzhiyunexample, a part of the memory that it is not currently allocated. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunThat defines some categories of errors: 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun* **Correctable Error (CE)** - the error detection mechanism detected and 69*4882a593Smuzhiyun corrected the error. Such errors are usually not fatal, although some 70*4882a593Smuzhiyun Kernel mechanisms allow the system administrator to consider them as fatal. 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun* **Uncorrected Error (UE)** - the amount of errors happened above the error 73*4882a593Smuzhiyun correction threshold, and the system was unable to auto-correct. 74*4882a593Smuzhiyun 75*4882a593Smuzhiyun* **Fatal Error** - when an UE error happens on a critical component of the 76*4882a593Smuzhiyun system (for example, a piece of the Kernel got corrupted by an UE), the 77*4882a593Smuzhiyun only reliable way to avoid data corruption is to hang or reboot the machine. 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun* **Non-fatal Error** - when an UE error happens on an unused component, 80*4882a593Smuzhiyun like a CPU in power down state or an unused memory bank, the system may 81*4882a593Smuzhiyun still run, eventually replacing the affected hardware by a hot spare, 82*4882a593Smuzhiyun if available. 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun Also, when an error happens on a userspace process, it is also possible to 85*4882a593Smuzhiyun kill such process and let userspace restart it. 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunThe mechanism for handling non-fatal errors is usually complex and may 88*4882a593Smuzhiyunrequire the help of some userspace application, in order to apply the 89*4882a593Smuzhiyunpolicy desired by the system administrator. 90*4882a593Smuzhiyun 91*4882a593SmuzhiyunIdentifying a bad hardware component 92*4882a593Smuzhiyun------------------------------------ 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunJust detecting a hardware flaw is usually not enough, as the system needs 95*4882a593Smuzhiyunto pinpoint to the minimal replaceable unit (MRU) that should be exchanged 96*4882a593Smuzhiyunto make the hardware reliable again. 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunSo, it requires not only error logging facilities, but also mechanisms that 99*4882a593Smuzhiyunwill translate the error message to the silkscreen or component label for 100*4882a593Smuzhiyunthe MRU. 101*4882a593Smuzhiyun 102*4882a593SmuzhiyunTypically, it is very complex for memory, as modern CPUs interlace memory 103*4882a593Smuzhiyunfrom different memory modules, in order to provide a better performance. The 104*4882a593SmuzhiyunDMI BIOS usually have a list of memory module labels, with can be obtained 105*4882a593Smuzhiyunusing the ``dmidecode`` tool. For example, on a desktop machine, it shows:: 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun Memory Device 108*4882a593Smuzhiyun Total Width: 64 bits 109*4882a593Smuzhiyun Data Width: 64 bits 110*4882a593Smuzhiyun Size: 16384 MB 111*4882a593Smuzhiyun Form Factor: SODIMM 112*4882a593Smuzhiyun Set: None 113*4882a593Smuzhiyun Locator: ChannelA-DIMM0 114*4882a593Smuzhiyun Bank Locator: BANK 0 115*4882a593Smuzhiyun Type: DDR4 116*4882a593Smuzhiyun Type Detail: Synchronous 117*4882a593Smuzhiyun Speed: 2133 MHz 118*4882a593Smuzhiyun Rank: 2 119*4882a593Smuzhiyun Configured Clock Speed: 2133 MHz 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunOn the above example, a DDR4 SO-DIMM memory module is located at the 122*4882a593Smuzhiyunsystem's memory labeled as "BANK 0", as given by the *bank locator* field. 123*4882a593SmuzhiyunPlease notice that, on such system, the *total width* is equal to the 124*4882a593Smuzhiyun*data width*. It means that such memory module doesn't have error 125*4882a593Smuzhiyundetection/correction mechanisms. 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunUnfortunately, not all systems use the same field to specify the memory 128*4882a593Smuzhiyunbank. On this example, from an older server, ``dmidecode`` shows:: 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun Memory Device 131*4882a593Smuzhiyun Array Handle: 0x1000 132*4882a593Smuzhiyun Error Information Handle: Not Provided 133*4882a593Smuzhiyun Total Width: 72 bits 134*4882a593Smuzhiyun Data Width: 64 bits 135*4882a593Smuzhiyun Size: 8192 MB 136*4882a593Smuzhiyun Form Factor: DIMM 137*4882a593Smuzhiyun Set: 1 138*4882a593Smuzhiyun Locator: DIMM_A1 139*4882a593Smuzhiyun Bank Locator: Not Specified 140*4882a593Smuzhiyun Type: DDR3 141*4882a593Smuzhiyun Type Detail: Synchronous Registered (Buffered) 142*4882a593Smuzhiyun Speed: 1600 MHz 143*4882a593Smuzhiyun Rank: 2 144*4882a593Smuzhiyun Configured Clock Speed: 1600 MHz 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunThere, the DDR3 RDIMM memory module is located at the system's memory labeled 147*4882a593Smuzhiyunas "DIMM_A1", as given by the *locator* field. Please notice that this 148*4882a593Smuzhiyunmemory module has 64 bits of *data width* and 72 bits of *total width*. So, 149*4882a593Smuzhiyunit has 8 extra bits to be used by error detection and correction mechanisms. 150*4882a593SmuzhiyunSuch kind of memory is called Error-correcting code memory (ECC memory). 151*4882a593Smuzhiyun 152*4882a593SmuzhiyunTo make things even worse, it is not uncommon that systems with different 153*4882a593Smuzhiyunlabels on their system's board to use exactly the same BIOS, meaning that 154*4882a593Smuzhiyunthe labels provided by the BIOS won't match the real ones. 155*4882a593Smuzhiyun 156*4882a593SmuzhiyunECC memory 157*4882a593Smuzhiyun---------- 158*4882a593Smuzhiyun 159*4882a593SmuzhiyunAs mentioned in the previous section, ECC memory has extra bits to be 160*4882a593Smuzhiyunused for error correction. In the above example, a memory module has 161*4882a593Smuzhiyun64 bits of *data width*, and 72 bits of *total width*. The extra 8 162*4882a593Smuzhiyunbits which are used for the error detection and correction mechanisms 163*4882a593Smuzhiyunare referred to as the *syndrome*\ [#f1]_\ [#f2]_. 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunSo, when the cpu requests the memory controller to write a word with 166*4882a593Smuzhiyun*data width*, the memory controller calculates the *syndrome* in real time, 167*4882a593Smuzhiyunusing Hamming code, or some other error correction code, like SECDED+, 168*4882a593Smuzhiyunproducing a code with *total width* size. Such code is then written 169*4882a593Smuzhiyunon the memory modules. 170*4882a593Smuzhiyun 171*4882a593SmuzhiyunAt read, the *total width* bits code is converted back, using the same 172*4882a593SmuzhiyunECC code used on write, producing a word with *data width* and a *syndrome*. 173*4882a593SmuzhiyunThe word with *data width* is sent to the CPU, even when errors happen. 174*4882a593Smuzhiyun 175*4882a593SmuzhiyunThe memory controller also looks at the *syndrome* in order to check if 176*4882a593Smuzhiyunthere was an error, and if the ECC code was able to fix such error. 177*4882a593SmuzhiyunIf the error was corrected, a Corrected Error (CE) happened. If not, an 178*4882a593SmuzhiyunUncorrected Error (UE) happened. 179*4882a593Smuzhiyun 180*4882a593SmuzhiyunThe information about the CE/UE errors is stored on some special registers 181*4882a593Smuzhiyunat the memory controller and can be accessed by reading such registers, 182*4882a593Smuzhiyuneither by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64 183*4882a593Smuzhiyunbit CPUs, such errors can also be retrieved via the Machine Check 184*4882a593SmuzhiyunArchitecture (MCA)\ [#f3]_. 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun.. [#f1] Please notice that several memory controllers allow operation on a 187*4882a593Smuzhiyun mode called "Lock-Step", where it groups two memory modules together, 188*4882a593Smuzhiyun doing 128-bit reads/writes. That gives 16 bits for error correction, with 189*4882a593Smuzhiyun significantly improves the error correction mechanism, at the expense 190*4882a593Smuzhiyun that, when an error happens, there's no way to know what memory module is 191*4882a593Smuzhiyun to blame. So, it has to blame both memory modules. 192*4882a593Smuzhiyun 193*4882a593Smuzhiyun.. [#f2] Some memory controllers also allow using memory in mirror mode. 194*4882a593Smuzhiyun On such mode, the same data is written to two memory modules. At read, 195*4882a593Smuzhiyun the system checks both memory modules, in order to check if both provide 196*4882a593Smuzhiyun identical data. On such configuration, when an error happens, there's no 197*4882a593Smuzhiyun way to know what memory module is to blame. So, it has to blame both 198*4882a593Smuzhiyun memory modules (or 4 memory modules, if the system is also on Lock-step 199*4882a593Smuzhiyun mode). 200*4882a593Smuzhiyun 201*4882a593Smuzhiyun.. [#f3] For more details about the Machine Check Architecture (MCA), 202*4882a593Smuzhiyun please read Documentation/x86/x86_64/machinecheck.rst at the Kernel tree. 203*4882a593Smuzhiyun 204*4882a593SmuzhiyunEDAC - Error Detection And Correction 205*4882a593Smuzhiyun************************************* 206*4882a593Smuzhiyun 207*4882a593Smuzhiyun.. note:: 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun "bluesmoke" was the name for this device driver subsystem when it 210*4882a593Smuzhiyun was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net. 211*4882a593Smuzhiyun That site is mostly archaic now and can be used only for historical 212*4882a593Smuzhiyun purposes. 213*4882a593Smuzhiyun 214*4882a593Smuzhiyun When the subsystem was pushed upstream for the first time, on 215*4882a593Smuzhiyun Kernel 2.6.16, it was renamed to ``EDAC``. 216*4882a593Smuzhiyun 217*4882a593SmuzhiyunPurpose 218*4882a593Smuzhiyun------- 219*4882a593Smuzhiyun 220*4882a593SmuzhiyunThe ``edac`` kernel module's goal is to detect and report hardware errors 221*4882a593Smuzhiyunthat occur within the computer system running under linux. 222*4882a593Smuzhiyun 223*4882a593SmuzhiyunMemory 224*4882a593Smuzhiyun------ 225*4882a593Smuzhiyun 226*4882a593SmuzhiyunMemory Correctable Errors (CE) and Uncorrectable Errors (UE) are the 227*4882a593Smuzhiyunprimary errors being harvested. These types of errors are harvested by 228*4882a593Smuzhiyunthe ``edac_mc`` device. 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunDetecting CE events, then harvesting those events and reporting them, 231*4882a593Smuzhiyun**can** but must not necessarily be a predictor of future UE events. With 232*4882a593SmuzhiyunCE events only, the system can and will continue to operate as no data 233*4882a593Smuzhiyunhas been damaged yet. 234*4882a593Smuzhiyun 235*4882a593SmuzhiyunHowever, preventive maintenance and proactive part replacement of memory 236*4882a593Smuzhiyunmodules exhibiting CEs can reduce the likelihood of the dreaded UE events 237*4882a593Smuzhiyunand system panics. 238*4882a593Smuzhiyun 239*4882a593SmuzhiyunOther hardware elements 240*4882a593Smuzhiyun----------------------- 241*4882a593Smuzhiyun 242*4882a593SmuzhiyunA new feature for EDAC, the ``edac_device`` class of device, was added in 243*4882a593Smuzhiyunthe 2.6.23 version of the kernel. 244*4882a593Smuzhiyun 245*4882a593SmuzhiyunThis new device type allows for non-memory type of ECC hardware detectors 246*4882a593Smuzhiyunto have their states harvested and presented to userspace via the sysfs 247*4882a593Smuzhiyuninterface. 248*4882a593Smuzhiyun 249*4882a593SmuzhiyunSome architectures have ECC detectors for L1, L2 and L3 caches, 250*4882a593Smuzhiyunalong with DMA engines, fabric switches, main data path switches, 251*4882a593Smuzhiyuninterconnections, and various other hardware data paths. If the hardware 252*4882a593Smuzhiyunreports it, then a edac_device device probably can be constructed to 253*4882a593Smuzhiyunharvest and present that to userspace. 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun 256*4882a593SmuzhiyunPCI bus scanning 257*4882a593Smuzhiyun---------------- 258*4882a593Smuzhiyun 259*4882a593SmuzhiyunIn addition, PCI devices are scanned for PCI Bus Parity and SERR Errors 260*4882a593Smuzhiyunin order to determine if errors are occurring during data transfers. 261*4882a593Smuzhiyun 262*4882a593SmuzhiyunThe presence of PCI Parity errors must be examined with a grain of salt. 263*4882a593SmuzhiyunThere are several add-in adapters that do **not** follow the PCI specification 264*4882a593Smuzhiyunwith regards to Parity generation and reporting. The specification says 265*4882a593Smuzhiyunthe vendor should tie the parity status bits to 0 if they do not intend 266*4882a593Smuzhiyunto generate parity. Some vendors do not do this, and thus the parity bit 267*4882a593Smuzhiyuncan "float" giving false positives. 268*4882a593Smuzhiyun 269*4882a593SmuzhiyunThere is a PCI device attribute located in sysfs that is checked by 270*4882a593Smuzhiyunthe EDAC PCI scanning code. If that attribute is set, PCI parity/error 271*4882a593Smuzhiyunscanning is skipped for that device. The attribute is:: 272*4882a593Smuzhiyun 273*4882a593Smuzhiyun broken_parity_status 274*4882a593Smuzhiyun 275*4882a593Smuzhiyunand is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for 276*4882a593SmuzhiyunPCI devices. 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun 279*4882a593SmuzhiyunVersioning 280*4882a593Smuzhiyun---------- 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunEDAC is composed of a "core" module (``edac_core.ko``) and several Memory 283*4882a593SmuzhiyunController (MC) driver modules. On a given system, the CORE is loaded 284*4882a593Smuzhiyunand one MC driver will be loaded. Both the CORE and the MC driver (or 285*4882a593Smuzhiyun``edac_device`` driver) have individual versions that reflect current 286*4882a593Smuzhiyunrelease level of their respective modules. 287*4882a593Smuzhiyun 288*4882a593SmuzhiyunThus, to "report" on what version a system is running, one must report 289*4882a593Smuzhiyunboth the CORE's and the MC driver's versions. 290*4882a593Smuzhiyun 291*4882a593Smuzhiyun 292*4882a593SmuzhiyunLoading 293*4882a593Smuzhiyun------- 294*4882a593Smuzhiyun 295*4882a593SmuzhiyunIf ``edac`` was statically linked with the kernel then no loading 296*4882a593Smuzhiyunis necessary. If ``edac`` was built as modules then simply modprobe 297*4882a593Smuzhiyunthe ``edac`` pieces that you need. You should be able to modprobe 298*4882a593Smuzhiyunhardware-specific modules and have the dependencies load the necessary 299*4882a593Smuzhiyuncore modules. 300*4882a593Smuzhiyun 301*4882a593SmuzhiyunExample:: 302*4882a593Smuzhiyun 303*4882a593Smuzhiyun $ modprobe amd76x_edac 304*4882a593Smuzhiyun 305*4882a593Smuzhiyunloads both the ``amd76x_edac.ko`` memory controller module and the 306*4882a593Smuzhiyun``edac_mc.ko`` core module. 307*4882a593Smuzhiyun 308*4882a593Smuzhiyun 309*4882a593SmuzhiyunSysfs interface 310*4882a593Smuzhiyun--------------- 311*4882a593Smuzhiyun 312*4882a593SmuzhiyunEDAC presents a ``sysfs`` interface for control and reporting purposes. It 313*4882a593Smuzhiyunlives in the /sys/devices/system/edac directory. 314*4882a593Smuzhiyun 315*4882a593SmuzhiyunWithin this directory there currently reside 2 components: 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun ======= ============================== 318*4882a593Smuzhiyun mc memory controller(s) system 319*4882a593Smuzhiyun pci PCI control and status system 320*4882a593Smuzhiyun ======= ============================== 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun 323*4882a593Smuzhiyun 324*4882a593SmuzhiyunMemory Controller (mc) Model 325*4882a593Smuzhiyun---------------------------- 326*4882a593Smuzhiyun 327*4882a593SmuzhiyunEach ``mc`` device controls a set of memory modules [#f4]_. These modules 328*4882a593Smuzhiyunare laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``). 329*4882a593SmuzhiyunThere can be multiple csrows and multiple channels. 330*4882a593Smuzhiyun 331*4882a593Smuzhiyun.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely 332*4882a593Smuzhiyun used to refer to a memory module, although there are other memory 333*4882a593Smuzhiyun packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI 334*4882a593Smuzhiyun specification (Version 2.7) defines a memory module in the Common 335*4882a593Smuzhiyun Platform Error Record (CPER) section to be an SMBIOS Memory Device 336*4882a593Smuzhiyun (Type 17). Along this document, and inside the EDAC subsystem, the term 337*4882a593Smuzhiyun "dimm" is used for all memory modules, even when they use a 338*4882a593Smuzhiyun different kind of packaging. 339*4882a593Smuzhiyun 340*4882a593SmuzhiyunMemory controllers allow for several csrows, with 8 csrows being a 341*4882a593Smuzhiyuntypical value. Yet, the actual number of csrows depends on the layout of 342*4882a593Smuzhiyuna given motherboard, memory controller and memory module characteristics. 343*4882a593Smuzhiyun 344*4882a593SmuzhiyunDual channels allow for dual data length (e. g. 128 bits, on 64 bit systems) 345*4882a593Smuzhiyundata transfers to/from the CPU from/to memory. Some newer chipsets allow 346*4882a593Smuzhiyunfor more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory 347*4882a593Smuzhiyuncontrollers. The following example will assume 2 channels: 348*4882a593Smuzhiyun 349*4882a593Smuzhiyun +------------+-----------------------+ 350*4882a593Smuzhiyun | CS Rows | Channels | 351*4882a593Smuzhiyun +------------+-----------+-----------+ 352*4882a593Smuzhiyun | | ``ch0`` | ``ch1`` | 353*4882a593Smuzhiyun +============+===========+===========+ 354*4882a593Smuzhiyun | |**DIMM_A0**|**DIMM_B0**| 355*4882a593Smuzhiyun +------------+-----------+-----------+ 356*4882a593Smuzhiyun | ``csrow0`` | rank0 | rank0 | 357*4882a593Smuzhiyun +------------+-----------+-----------+ 358*4882a593Smuzhiyun | ``csrow1`` | rank1 | rank1 | 359*4882a593Smuzhiyun +------------+-----------+-----------+ 360*4882a593Smuzhiyun | |**DIMM_A1**|**DIMM_B1**| 361*4882a593Smuzhiyun +------------+-----------+-----------+ 362*4882a593Smuzhiyun | ``csrow2`` | rank0 | rank0 | 363*4882a593Smuzhiyun +------------+-----------+-----------+ 364*4882a593Smuzhiyun | ``csrow3`` | rank1 | rank1 | 365*4882a593Smuzhiyun +------------+-----------+-----------+ 366*4882a593Smuzhiyun 367*4882a593SmuzhiyunIn the above example, there are 4 physical slots on the motherboard 368*4882a593Smuzhiyunfor memory DIMMs: 369*4882a593Smuzhiyun 370*4882a593Smuzhiyun +---------+---------+ 371*4882a593Smuzhiyun | DIMM_A0 | DIMM_B0 | 372*4882a593Smuzhiyun +---------+---------+ 373*4882a593Smuzhiyun | DIMM_A1 | DIMM_B1 | 374*4882a593Smuzhiyun +---------+---------+ 375*4882a593Smuzhiyun 376*4882a593SmuzhiyunLabels for these slots are usually silk-screened on the motherboard. 377*4882a593SmuzhiyunSlots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are 378*4882a593Smuzhiyunchannel 1. Notice that there are two csrows possible on a physical DIMM. 379*4882a593SmuzhiyunThese csrows are allocated their csrow assignment based on the slot into 380*4882a593Smuzhiyunwhich the memory DIMM is placed. Thus, when 1 DIMM is placed in each 381*4882a593SmuzhiyunChannel, the csrows cross both DIMMs. 382*4882a593Smuzhiyun 383*4882a593SmuzhiyunMemory DIMMs come single or dual "ranked". A rank is a populated csrow. 384*4882a593SmuzhiyunIn the example above 2 dual ranked DIMMs are similarly placed. Thus, 385*4882a593Smuzhiyunboth csrow0 and csrow1 are populated. On the other hand, when 2 single 386*4882a593Smuzhiyunranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will 387*4882a593Smuzhiyunhave just one csrow (csrow0) and csrow1 will be empty. The pattern 388*4882a593Smuzhiyunrepeats itself for csrow2 and csrow3. Also note that some memory 389*4882a593Smuzhiyuncontrollers don't have any logic to identify the memory module, see 390*4882a593Smuzhiyun``rankX`` directories below. 391*4882a593Smuzhiyun 392*4882a593SmuzhiyunThe representation of the above is reflected in the directory 393*4882a593Smuzhiyuntree in EDAC's sysfs interface. Starting in directory 394*4882a593Smuzhiyun``/sys/devices/system/edac/mc``, each memory controller will be 395*4882a593Smuzhiyunrepresented by its own ``mcX`` directory, where ``X`` is the 396*4882a593Smuzhiyunindex of the MC:: 397*4882a593Smuzhiyun 398*4882a593Smuzhiyun ..../edac/mc/ 399*4882a593Smuzhiyun | 400*4882a593Smuzhiyun |->mc0 401*4882a593Smuzhiyun |->mc1 402*4882a593Smuzhiyun |->mc2 403*4882a593Smuzhiyun .... 404*4882a593Smuzhiyun 405*4882a593SmuzhiyunUnder each ``mcX`` directory each ``csrowX`` is again represented by a 406*4882a593Smuzhiyun``csrowX``, where ``X`` is the csrow index:: 407*4882a593Smuzhiyun 408*4882a593Smuzhiyun .../mc/mc0/ 409*4882a593Smuzhiyun | 410*4882a593Smuzhiyun |->csrow0 411*4882a593Smuzhiyun |->csrow2 412*4882a593Smuzhiyun |->csrow3 413*4882a593Smuzhiyun .... 414*4882a593Smuzhiyun 415*4882a593SmuzhiyunNotice that there is no csrow1, which indicates that csrow0 is composed 416*4882a593Smuzhiyunof a single ranked DIMMs. This should also apply in both Channels, in 417*4882a593Smuzhiyunorder to have dual-channel mode be operational. Since both csrow2 and 418*4882a593Smuzhiyuncsrow3 are populated, this indicates a dual ranked set of DIMMs for 419*4882a593Smuzhiyunchannels 0 and 1. 420*4882a593Smuzhiyun 421*4882a593SmuzhiyunWithin each of the ``mcX`` and ``csrowX`` directories are several EDAC 422*4882a593Smuzhiyuncontrol and attribute files. 423*4882a593Smuzhiyun 424*4882a593Smuzhiyun``mcX`` directories 425*4882a593Smuzhiyun------------------- 426*4882a593Smuzhiyun 427*4882a593SmuzhiyunIn ``mcX`` directories are EDAC control and attribute files for 428*4882a593Smuzhiyunthis ``X`` instance of the memory controllers. 429*4882a593Smuzhiyun 430*4882a593SmuzhiyunFor a description of the sysfs API, please see: 431*4882a593Smuzhiyun 432*4882a593Smuzhiyun Documentation/ABI/testing/sysfs-devices-edac 433*4882a593Smuzhiyun 434*4882a593Smuzhiyun 435*4882a593Smuzhiyun``dimmX`` or ``rankX`` directories 436*4882a593Smuzhiyun---------------------------------- 437*4882a593Smuzhiyun 438*4882a593SmuzhiyunThe recommended way to use the EDAC subsystem is to look at the information 439*4882a593Smuzhiyunprovided by the ``dimmX`` or ``rankX`` directories [#f5]_. 440*4882a593Smuzhiyun 441*4882a593SmuzhiyunA typical EDAC system has the following structure under 442*4882a593Smuzhiyun``/sys/devices/system/edac/``\ [#f6]_:: 443*4882a593Smuzhiyun 444*4882a593Smuzhiyun /sys/devices/system/edac/ 445*4882a593Smuzhiyun ├── mc 446*4882a593Smuzhiyun │ ├── mc0 447*4882a593Smuzhiyun │ │ ├── ce_count 448*4882a593Smuzhiyun │ │ ├── ce_noinfo_count 449*4882a593Smuzhiyun │ │ ├── dimm0 450*4882a593Smuzhiyun │ │ │ ├── dimm_ce_count 451*4882a593Smuzhiyun │ │ │ ├── dimm_dev_type 452*4882a593Smuzhiyun │ │ │ ├── dimm_edac_mode 453*4882a593Smuzhiyun │ │ │ ├── dimm_label 454*4882a593Smuzhiyun │ │ │ ├── dimm_location 455*4882a593Smuzhiyun │ │ │ ├── dimm_mem_type 456*4882a593Smuzhiyun │ │ │ ├── dimm_ue_count 457*4882a593Smuzhiyun │ │ │ ├── size 458*4882a593Smuzhiyun │ │ │ └── uevent 459*4882a593Smuzhiyun │ │ ├── max_location 460*4882a593Smuzhiyun │ │ ├── mc_name 461*4882a593Smuzhiyun │ │ ├── reset_counters 462*4882a593Smuzhiyun │ │ ├── seconds_since_reset 463*4882a593Smuzhiyun │ │ ├── size_mb 464*4882a593Smuzhiyun │ │ ├── ue_count 465*4882a593Smuzhiyun │ │ ├── ue_noinfo_count 466*4882a593Smuzhiyun │ │ └── uevent 467*4882a593Smuzhiyun │ ├── mc1 468*4882a593Smuzhiyun │ │ ├── ce_count 469*4882a593Smuzhiyun │ │ ├── ce_noinfo_count 470*4882a593Smuzhiyun │ │ ├── dimm0 471*4882a593Smuzhiyun │ │ │ ├── dimm_ce_count 472*4882a593Smuzhiyun │ │ │ ├── dimm_dev_type 473*4882a593Smuzhiyun │ │ │ ├── dimm_edac_mode 474*4882a593Smuzhiyun │ │ │ ├── dimm_label 475*4882a593Smuzhiyun │ │ │ ├── dimm_location 476*4882a593Smuzhiyun │ │ │ ├── dimm_mem_type 477*4882a593Smuzhiyun │ │ │ ├── dimm_ue_count 478*4882a593Smuzhiyun │ │ │ ├── size 479*4882a593Smuzhiyun │ │ │ └── uevent 480*4882a593Smuzhiyun │ │ ├── max_location 481*4882a593Smuzhiyun │ │ ├── mc_name 482*4882a593Smuzhiyun │ │ ├── reset_counters 483*4882a593Smuzhiyun │ │ ├── seconds_since_reset 484*4882a593Smuzhiyun │ │ ├── size_mb 485*4882a593Smuzhiyun │ │ ├── ue_count 486*4882a593Smuzhiyun │ │ ├── ue_noinfo_count 487*4882a593Smuzhiyun │ │ └── uevent 488*4882a593Smuzhiyun │ └── uevent 489*4882a593Smuzhiyun └── uevent 490*4882a593Smuzhiyun 491*4882a593SmuzhiyunIn the ``dimmX`` directories are EDAC control and attribute files for 492*4882a593Smuzhiyunthis ``X`` memory module: 493*4882a593Smuzhiyun 494*4882a593Smuzhiyun- ``size`` - Total memory managed by this csrow attribute file 495*4882a593Smuzhiyun 496*4882a593Smuzhiyun This attribute file displays, in count of megabytes, the memory 497*4882a593Smuzhiyun that this csrow contains. 498*4882a593Smuzhiyun 499*4882a593Smuzhiyun- ``dimm_ue_count`` - Uncorrectable Errors count attribute file 500*4882a593Smuzhiyun 501*4882a593Smuzhiyun This attribute file displays the total count of uncorrectable 502*4882a593Smuzhiyun errors that have occurred on this DIMM. If panic_on_ue is set 503*4882a593Smuzhiyun this counter will not have a chance to increment, since EDAC 504*4882a593Smuzhiyun will panic the system. 505*4882a593Smuzhiyun 506*4882a593Smuzhiyun- ``dimm_ce_count`` - Correctable Errors count attribute file 507*4882a593Smuzhiyun 508*4882a593Smuzhiyun This attribute file displays the total count of correctable 509*4882a593Smuzhiyun errors that have occurred on this DIMM. This count is very 510*4882a593Smuzhiyun important to examine. CEs provide early indications that a 511*4882a593Smuzhiyun DIMM is beginning to fail. This count field should be 512*4882a593Smuzhiyun monitored for non-zero values and report such information 513*4882a593Smuzhiyun to the system administrator. 514*4882a593Smuzhiyun 515*4882a593Smuzhiyun- ``dimm_dev_type`` - Device type attribute file 516*4882a593Smuzhiyun 517*4882a593Smuzhiyun This attribute file will display what type of DRAM device is 518*4882a593Smuzhiyun being utilized on this DIMM. 519*4882a593Smuzhiyun Examples: 520*4882a593Smuzhiyun 521*4882a593Smuzhiyun - x1 522*4882a593Smuzhiyun - x2 523*4882a593Smuzhiyun - x4 524*4882a593Smuzhiyun - x8 525*4882a593Smuzhiyun 526*4882a593Smuzhiyun- ``dimm_edac_mode`` - EDAC Mode of operation attribute file 527*4882a593Smuzhiyun 528*4882a593Smuzhiyun This attribute file will display what type of Error detection 529*4882a593Smuzhiyun and correction is being utilized. 530*4882a593Smuzhiyun 531*4882a593Smuzhiyun- ``dimm_label`` - memory module label control file 532*4882a593Smuzhiyun 533*4882a593Smuzhiyun This control file allows this DIMM to have a label assigned 534*4882a593Smuzhiyun to it. With this label in the module, when errors occur 535*4882a593Smuzhiyun the output can provide the DIMM label in the system log. 536*4882a593Smuzhiyun This becomes vital for panic events to isolate the 537*4882a593Smuzhiyun cause of the UE event. 538*4882a593Smuzhiyun 539*4882a593Smuzhiyun DIMM Labels must be assigned after booting, with information 540*4882a593Smuzhiyun that correctly identifies the physical slot with its 541*4882a593Smuzhiyun silk screen label. This information is currently very 542*4882a593Smuzhiyun motherboard specific and determination of this information 543*4882a593Smuzhiyun must occur in userland at this time. 544*4882a593Smuzhiyun 545*4882a593Smuzhiyun- ``dimm_location`` - location of the memory module 546*4882a593Smuzhiyun 547*4882a593Smuzhiyun The location can have up to 3 levels, and describe how the 548*4882a593Smuzhiyun memory controller identifies the location of a memory module. 549*4882a593Smuzhiyun Depending on the type of memory and memory controller, it 550*4882a593Smuzhiyun can be: 551*4882a593Smuzhiyun 552*4882a593Smuzhiyun - *csrow* and *channel* - used when the memory controller 553*4882a593Smuzhiyun doesn't identify a single DIMM - e. g. in ``rankX`` dir; 554*4882a593Smuzhiyun - *branch*, *channel*, *slot* - typically used on FB-DIMM memory 555*4882a593Smuzhiyun controllers; 556*4882a593Smuzhiyun - *channel*, *slot* - used on Nehalem and newer Intel drivers. 557*4882a593Smuzhiyun 558*4882a593Smuzhiyun- ``dimm_mem_type`` - Memory Type attribute file 559*4882a593Smuzhiyun 560*4882a593Smuzhiyun This attribute file will display what type of memory is currently 561*4882a593Smuzhiyun on this csrow. Normally, either buffered or unbuffered memory. 562*4882a593Smuzhiyun Examples: 563*4882a593Smuzhiyun 564*4882a593Smuzhiyun - Registered-DDR 565*4882a593Smuzhiyun - Unbuffered-DDR 566*4882a593Smuzhiyun 567*4882a593Smuzhiyun.. [#f5] On some systems, the memory controller doesn't have any logic 568*4882a593Smuzhiyun to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories. 569*4882a593Smuzhiyun On modern Intel memory controllers, the memory controller identifies the 570*4882a593Smuzhiyun memory modules directly. On such systems, the directory is called ``dimmX``. 571*4882a593Smuzhiyun 572*4882a593Smuzhiyun.. [#f6] There are also some ``power`` directories and ``subsystem`` 573*4882a593Smuzhiyun symlinks inside the sysfs mapping that are automatically created by 574*4882a593Smuzhiyun the sysfs subsystem. Currently, they serve no purpose. 575*4882a593Smuzhiyun 576*4882a593Smuzhiyun``csrowX`` directories 577*4882a593Smuzhiyun---------------------- 578*4882a593Smuzhiyun 579*4882a593SmuzhiyunWhen CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX`` 580*4882a593Smuzhiyundirectories. As this API doesn't work properly for Rambus, FB-DIMMs and 581*4882a593Smuzhiyunmodern Intel Memory Controllers, this is being deprecated in favor of 582*4882a593Smuzhiyun``dimmX`` directories. 583*4882a593Smuzhiyun 584*4882a593SmuzhiyunIn the ``csrowX`` directories are EDAC control and attribute files for 585*4882a593Smuzhiyunthis ``X`` instance of csrow: 586*4882a593Smuzhiyun 587*4882a593Smuzhiyun 588*4882a593Smuzhiyun- ``ue_count`` - Total Uncorrectable Errors count attribute file 589*4882a593Smuzhiyun 590*4882a593Smuzhiyun This attribute file displays the total count of uncorrectable 591*4882a593Smuzhiyun errors that have occurred on this csrow. If panic_on_ue is set 592*4882a593Smuzhiyun this counter will not have a chance to increment, since EDAC 593*4882a593Smuzhiyun will panic the system. 594*4882a593Smuzhiyun 595*4882a593Smuzhiyun 596*4882a593Smuzhiyun- ``ce_count`` - Total Correctable Errors count attribute file 597*4882a593Smuzhiyun 598*4882a593Smuzhiyun This attribute file displays the total count of correctable 599*4882a593Smuzhiyun errors that have occurred on this csrow. This count is very 600*4882a593Smuzhiyun important to examine. CEs provide early indications that a 601*4882a593Smuzhiyun DIMM is beginning to fail. This count field should be 602*4882a593Smuzhiyun monitored for non-zero values and report such information 603*4882a593Smuzhiyun to the system administrator. 604*4882a593Smuzhiyun 605*4882a593Smuzhiyun 606*4882a593Smuzhiyun- ``size_mb`` - Total memory managed by this csrow attribute file 607*4882a593Smuzhiyun 608*4882a593Smuzhiyun This attribute file displays, in count of megabytes, the memory 609*4882a593Smuzhiyun that this csrow contains. 610*4882a593Smuzhiyun 611*4882a593Smuzhiyun 612*4882a593Smuzhiyun- ``mem_type`` - Memory Type attribute file 613*4882a593Smuzhiyun 614*4882a593Smuzhiyun This attribute file will display what type of memory is currently 615*4882a593Smuzhiyun on this csrow. Normally, either buffered or unbuffered memory. 616*4882a593Smuzhiyun Examples: 617*4882a593Smuzhiyun 618*4882a593Smuzhiyun - Registered-DDR 619*4882a593Smuzhiyun - Unbuffered-DDR 620*4882a593Smuzhiyun 621*4882a593Smuzhiyun 622*4882a593Smuzhiyun- ``edac_mode`` - EDAC Mode of operation attribute file 623*4882a593Smuzhiyun 624*4882a593Smuzhiyun This attribute file will display what type of Error detection 625*4882a593Smuzhiyun and correction is being utilized. 626*4882a593Smuzhiyun 627*4882a593Smuzhiyun 628*4882a593Smuzhiyun- ``dev_type`` - Device type attribute file 629*4882a593Smuzhiyun 630*4882a593Smuzhiyun This attribute file will display what type of DRAM device is 631*4882a593Smuzhiyun being utilized on this DIMM. 632*4882a593Smuzhiyun Examples: 633*4882a593Smuzhiyun 634*4882a593Smuzhiyun - x1 635*4882a593Smuzhiyun - x2 636*4882a593Smuzhiyun - x4 637*4882a593Smuzhiyun - x8 638*4882a593Smuzhiyun 639*4882a593Smuzhiyun 640*4882a593Smuzhiyun- ``ch0_ce_count`` - Channel 0 CE Count attribute file 641*4882a593Smuzhiyun 642*4882a593Smuzhiyun This attribute file will display the count of CEs on this 643*4882a593Smuzhiyun DIMM located in channel 0. 644*4882a593Smuzhiyun 645*4882a593Smuzhiyun 646*4882a593Smuzhiyun- ``ch0_ue_count`` - Channel 0 UE Count attribute file 647*4882a593Smuzhiyun 648*4882a593Smuzhiyun This attribute file will display the count of UEs on this 649*4882a593Smuzhiyun DIMM located in channel 0. 650*4882a593Smuzhiyun 651*4882a593Smuzhiyun 652*4882a593Smuzhiyun- ``ch0_dimm_label`` - Channel 0 DIMM Label control file 653*4882a593Smuzhiyun 654*4882a593Smuzhiyun 655*4882a593Smuzhiyun This control file allows this DIMM to have a label assigned 656*4882a593Smuzhiyun to it. With this label in the module, when errors occur 657*4882a593Smuzhiyun the output can provide the DIMM label in the system log. 658*4882a593Smuzhiyun This becomes vital for panic events to isolate the 659*4882a593Smuzhiyun cause of the UE event. 660*4882a593Smuzhiyun 661*4882a593Smuzhiyun DIMM Labels must be assigned after booting, with information 662*4882a593Smuzhiyun that correctly identifies the physical slot with its 663*4882a593Smuzhiyun silk screen label. This information is currently very 664*4882a593Smuzhiyun motherboard specific and determination of this information 665*4882a593Smuzhiyun must occur in userland at this time. 666*4882a593Smuzhiyun 667*4882a593Smuzhiyun 668*4882a593Smuzhiyun- ``ch1_ce_count`` - Channel 1 CE Count attribute file 669*4882a593Smuzhiyun 670*4882a593Smuzhiyun 671*4882a593Smuzhiyun This attribute file will display the count of CEs on this 672*4882a593Smuzhiyun DIMM located in channel 1. 673*4882a593Smuzhiyun 674*4882a593Smuzhiyun 675*4882a593Smuzhiyun- ``ch1_ue_count`` - Channel 1 UE Count attribute file 676*4882a593Smuzhiyun 677*4882a593Smuzhiyun 678*4882a593Smuzhiyun This attribute file will display the count of UEs on this 679*4882a593Smuzhiyun DIMM located in channel 0. 680*4882a593Smuzhiyun 681*4882a593Smuzhiyun 682*4882a593Smuzhiyun- ``ch1_dimm_label`` - Channel 1 DIMM Label control file 683*4882a593Smuzhiyun 684*4882a593Smuzhiyun This control file allows this DIMM to have a label assigned 685*4882a593Smuzhiyun to it. With this label in the module, when errors occur 686*4882a593Smuzhiyun the output can provide the DIMM label in the system log. 687*4882a593Smuzhiyun This becomes vital for panic events to isolate the 688*4882a593Smuzhiyun cause of the UE event. 689*4882a593Smuzhiyun 690*4882a593Smuzhiyun DIMM Labels must be assigned after booting, with information 691*4882a593Smuzhiyun that correctly identifies the physical slot with its 692*4882a593Smuzhiyun silk screen label. This information is currently very 693*4882a593Smuzhiyun motherboard specific and determination of this information 694*4882a593Smuzhiyun must occur in userland at this time. 695*4882a593Smuzhiyun 696*4882a593Smuzhiyun 697*4882a593SmuzhiyunSystem Logging 698*4882a593Smuzhiyun-------------- 699*4882a593Smuzhiyun 700*4882a593SmuzhiyunIf logging for UEs and CEs is enabled, then system logs will contain 701*4882a593Smuzhiyuninformation indicating that errors have been detected:: 702*4882a593Smuzhiyun 703*4882a593Smuzhiyun EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac 704*4882a593Smuzhiyun EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac 705*4882a593Smuzhiyun 706*4882a593Smuzhiyun 707*4882a593SmuzhiyunThe structure of the message is: 708*4882a593Smuzhiyun 709*4882a593Smuzhiyun +---------------------------------------+-------------+ 710*4882a593Smuzhiyun | Content | Example | 711*4882a593Smuzhiyun +=======================================+=============+ 712*4882a593Smuzhiyun | The memory controller | MC0 | 713*4882a593Smuzhiyun +---------------------------------------+-------------+ 714*4882a593Smuzhiyun | Error type | CE | 715*4882a593Smuzhiyun +---------------------------------------+-------------+ 716*4882a593Smuzhiyun | Memory page | 0x283 | 717*4882a593Smuzhiyun +---------------------------------------+-------------+ 718*4882a593Smuzhiyun | Offset in the page | 0xce0 | 719*4882a593Smuzhiyun +---------------------------------------+-------------+ 720*4882a593Smuzhiyun | The byte granularity | grain 8 | 721*4882a593Smuzhiyun | or resolution of the error | | 722*4882a593Smuzhiyun +---------------------------------------+-------------+ 723*4882a593Smuzhiyun | The error syndrome | 0xb741 | 724*4882a593Smuzhiyun +---------------------------------------+-------------+ 725*4882a593Smuzhiyun | Memory row | row 0 | 726*4882a593Smuzhiyun +---------------------------------------+-------------+ 727*4882a593Smuzhiyun | Memory channel | channel 1 | 728*4882a593Smuzhiyun +---------------------------------------+-------------+ 729*4882a593Smuzhiyun | DIMM label, if set prior | DIMM B1 | 730*4882a593Smuzhiyun +---------------------------------------+-------------+ 731*4882a593Smuzhiyun | And then an optional, driver-specific | | 732*4882a593Smuzhiyun | message that may have additional | | 733*4882a593Smuzhiyun | information. | | 734*4882a593Smuzhiyun +---------------------------------------+-------------+ 735*4882a593Smuzhiyun 736*4882a593SmuzhiyunBoth UEs and CEs with no info will lack all but memory controller, error 737*4882a593Smuzhiyuntype, a notice of "no info" and then an optional, driver-specific error 738*4882a593Smuzhiyunmessage. 739*4882a593Smuzhiyun 740*4882a593Smuzhiyun 741*4882a593SmuzhiyunPCI Bus Parity Detection 742*4882a593Smuzhiyun------------------------ 743*4882a593Smuzhiyun 744*4882a593SmuzhiyunOn Header Type 00 devices, the primary status is looked at for any 745*4882a593Smuzhiyunparity error regardless of whether parity is enabled on the device or 746*4882a593Smuzhiyunnot. (The spec indicates parity is generated in some cases). On Header 747*4882a593SmuzhiyunType 01 bridges, the secondary status register is also looked at to see 748*4882a593Smuzhiyunif parity occurred on the bus on the other side of the bridge. 749*4882a593Smuzhiyun 750*4882a593Smuzhiyun 751*4882a593SmuzhiyunSysfs configuration 752*4882a593Smuzhiyun------------------- 753*4882a593Smuzhiyun 754*4882a593SmuzhiyunUnder ``/sys/devices/system/edac/pci`` are control and attribute files as 755*4882a593Smuzhiyunfollows: 756*4882a593Smuzhiyun 757*4882a593Smuzhiyun 758*4882a593Smuzhiyun- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file 759*4882a593Smuzhiyun 760*4882a593Smuzhiyun This control file enables or disables the PCI Bus Parity scanning 761*4882a593Smuzhiyun operation. Writing a 1 to this file enables the scanning. Writing 762*4882a593Smuzhiyun a 0 to this file disables the scanning. 763*4882a593Smuzhiyun 764*4882a593Smuzhiyun Enable:: 765*4882a593Smuzhiyun 766*4882a593Smuzhiyun echo "1" >/sys/devices/system/edac/pci/check_pci_parity 767*4882a593Smuzhiyun 768*4882a593Smuzhiyun Disable:: 769*4882a593Smuzhiyun 770*4882a593Smuzhiyun echo "0" >/sys/devices/system/edac/pci/check_pci_parity 771*4882a593Smuzhiyun 772*4882a593Smuzhiyun 773*4882a593Smuzhiyun- ``pci_parity_count`` - Parity Count 774*4882a593Smuzhiyun 775*4882a593Smuzhiyun This attribute file will display the number of parity errors that 776*4882a593Smuzhiyun have been detected. 777*4882a593Smuzhiyun 778*4882a593Smuzhiyun 779*4882a593SmuzhiyunModule parameters 780*4882a593Smuzhiyun----------------- 781*4882a593Smuzhiyun 782*4882a593Smuzhiyun- ``edac_mc_panic_on_ue`` - Panic on UE control file 783*4882a593Smuzhiyun 784*4882a593Smuzhiyun An uncorrectable error will cause a machine panic. This is usually 785*4882a593Smuzhiyun desirable. It is a bad idea to continue when an uncorrectable error 786*4882a593Smuzhiyun occurs - it is indeterminate what was uncorrected and the operating 787*4882a593Smuzhiyun system context might be so mangled that continuing will lead to further 788*4882a593Smuzhiyun corruption. If the kernel has MCE configured, then EDAC will never 789*4882a593Smuzhiyun notice the UE. 790*4882a593Smuzhiyun 791*4882a593Smuzhiyun LOAD TIME:: 792*4882a593Smuzhiyun 793*4882a593Smuzhiyun module/kernel parameter: edac_mc_panic_on_ue=[0|1] 794*4882a593Smuzhiyun 795*4882a593Smuzhiyun RUN TIME:: 796*4882a593Smuzhiyun 797*4882a593Smuzhiyun echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue 798*4882a593Smuzhiyun 799*4882a593Smuzhiyun 800*4882a593Smuzhiyun- ``edac_mc_log_ue`` - Log UE control file 801*4882a593Smuzhiyun 802*4882a593Smuzhiyun 803*4882a593Smuzhiyun Generate kernel messages describing uncorrectable errors. These errors 804*4882a593Smuzhiyun are reported through the system message log system. UE statistics 805*4882a593Smuzhiyun will be accumulated even when UE logging is disabled. 806*4882a593Smuzhiyun 807*4882a593Smuzhiyun LOAD TIME:: 808*4882a593Smuzhiyun 809*4882a593Smuzhiyun module/kernel parameter: edac_mc_log_ue=[0|1] 810*4882a593Smuzhiyun 811*4882a593Smuzhiyun RUN TIME:: 812*4882a593Smuzhiyun 813*4882a593Smuzhiyun echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue 814*4882a593Smuzhiyun 815*4882a593Smuzhiyun 816*4882a593Smuzhiyun- ``edac_mc_log_ce`` - Log CE control file 817*4882a593Smuzhiyun 818*4882a593Smuzhiyun 819*4882a593Smuzhiyun Generate kernel messages describing correctable errors. These 820*4882a593Smuzhiyun errors are reported through the system message log system. 821*4882a593Smuzhiyun CE statistics will be accumulated even when CE logging is disabled. 822*4882a593Smuzhiyun 823*4882a593Smuzhiyun LOAD TIME:: 824*4882a593Smuzhiyun 825*4882a593Smuzhiyun module/kernel parameter: edac_mc_log_ce=[0|1] 826*4882a593Smuzhiyun 827*4882a593Smuzhiyun RUN TIME:: 828*4882a593Smuzhiyun 829*4882a593Smuzhiyun echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce 830*4882a593Smuzhiyun 831*4882a593Smuzhiyun 832*4882a593Smuzhiyun- ``edac_mc_poll_msec`` - Polling period control file 833*4882a593Smuzhiyun 834*4882a593Smuzhiyun 835*4882a593Smuzhiyun The time period, in milliseconds, for polling for error information. 836*4882a593Smuzhiyun Too small a value wastes resources. Too large a value might delay 837*4882a593Smuzhiyun necessary handling of errors and might loose valuable information for 838*4882a593Smuzhiyun locating the error. 1000 milliseconds (once each second) is the current 839*4882a593Smuzhiyun default. Systems which require all the bandwidth they can get, may 840*4882a593Smuzhiyun increase this. 841*4882a593Smuzhiyun 842*4882a593Smuzhiyun LOAD TIME:: 843*4882a593Smuzhiyun 844*4882a593Smuzhiyun module/kernel parameter: edac_mc_poll_msec=[0|1] 845*4882a593Smuzhiyun 846*4882a593Smuzhiyun RUN TIME:: 847*4882a593Smuzhiyun 848*4882a593Smuzhiyun echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec 849*4882a593Smuzhiyun 850*4882a593Smuzhiyun 851*4882a593Smuzhiyun- ``panic_on_pci_parity`` - Panic on PCI PARITY Error 852*4882a593Smuzhiyun 853*4882a593Smuzhiyun 854*4882a593Smuzhiyun This control file enables or disables panicking when a parity 855*4882a593Smuzhiyun error has been detected. 856*4882a593Smuzhiyun 857*4882a593Smuzhiyun 858*4882a593Smuzhiyun module/kernel parameter:: 859*4882a593Smuzhiyun 860*4882a593Smuzhiyun edac_panic_on_pci_pe=[0|1] 861*4882a593Smuzhiyun 862*4882a593Smuzhiyun Enable:: 863*4882a593Smuzhiyun 864*4882a593Smuzhiyun echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 865*4882a593Smuzhiyun 866*4882a593Smuzhiyun Disable:: 867*4882a593Smuzhiyun 868*4882a593Smuzhiyun echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 869*4882a593Smuzhiyun 870*4882a593Smuzhiyun 871*4882a593Smuzhiyun 872*4882a593SmuzhiyunEDAC device type 873*4882a593Smuzhiyun---------------- 874*4882a593Smuzhiyun 875*4882a593SmuzhiyunIn the header file, edac_pci.h, there is a series of edac_device structures 876*4882a593Smuzhiyunand APIs for the EDAC_DEVICE. 877*4882a593Smuzhiyun 878*4882a593SmuzhiyunUser space access to an edac_device is through the sysfs interface. 879*4882a593Smuzhiyun 880*4882a593SmuzhiyunAt the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices 881*4882a593Smuzhiyunwill appear. 882*4882a593Smuzhiyun 883*4882a593SmuzhiyunThere is a three level tree beneath the above ``edac`` directory. For example, 884*4882a593Smuzhiyunthe ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net 885*4882a593Smuzhiyunwebsite) installs itself as:: 886*4882a593Smuzhiyun 887*4882a593Smuzhiyun /sys/devices/system/edac/test-instance 888*4882a593Smuzhiyun 889*4882a593Smuzhiyunin this directory are various controls, a symlink and one or more ``instance`` 890*4882a593Smuzhiyundirectories. 891*4882a593Smuzhiyun 892*4882a593SmuzhiyunThe standard default controls are: 893*4882a593Smuzhiyun 894*4882a593Smuzhiyun ============== ======================================================= 895*4882a593Smuzhiyun log_ce boolean to log CE events 896*4882a593Smuzhiyun log_ue boolean to log UE events 897*4882a593Smuzhiyun panic_on_ue boolean to ``panic`` the system if an UE is encountered 898*4882a593Smuzhiyun (default off, can be set true via startup script) 899*4882a593Smuzhiyun poll_msec time period between POLL cycles for events 900*4882a593Smuzhiyun ============== ======================================================= 901*4882a593Smuzhiyun 902*4882a593SmuzhiyunThe test_device_edac device adds at least one of its own custom control: 903*4882a593Smuzhiyun 904*4882a593Smuzhiyun ============== ================================================== 905*4882a593Smuzhiyun test_bits which in the current test driver does nothing but 906*4882a593Smuzhiyun show how it is installed. A ported driver can 907*4882a593Smuzhiyun add one or more such controls and/or attributes 908*4882a593Smuzhiyun for specific uses. 909*4882a593Smuzhiyun One out-of-tree driver uses controls here to allow 910*4882a593Smuzhiyun for ERROR INJECTION operations to hardware 911*4882a593Smuzhiyun injection registers 912*4882a593Smuzhiyun ============== ================================================== 913*4882a593Smuzhiyun 914*4882a593SmuzhiyunThe symlink points to the 'struct dev' that is registered for this edac_device. 915*4882a593Smuzhiyun 916*4882a593SmuzhiyunInstances 917*4882a593Smuzhiyun--------- 918*4882a593Smuzhiyun 919*4882a593SmuzhiyunOne or more instance directories are present. For the ``test_device_edac`` 920*4882a593Smuzhiyuncase: 921*4882a593Smuzhiyun 922*4882a593Smuzhiyun +----------------+ 923*4882a593Smuzhiyun | test-instance0 | 924*4882a593Smuzhiyun +----------------+ 925*4882a593Smuzhiyun 926*4882a593Smuzhiyun 927*4882a593SmuzhiyunIn this directory there are two default counter attributes, which are totals of 928*4882a593Smuzhiyuncounter in deeper subdirectories. 929*4882a593Smuzhiyun 930*4882a593Smuzhiyun ============== ==================================== 931*4882a593Smuzhiyun ce_count total of CE events of subdirectories 932*4882a593Smuzhiyun ue_count total of UE events of subdirectories 933*4882a593Smuzhiyun ============== ==================================== 934*4882a593Smuzhiyun 935*4882a593SmuzhiyunBlocks 936*4882a593Smuzhiyun------ 937*4882a593Smuzhiyun 938*4882a593SmuzhiyunAt the lowest directory level is the ``block`` directory. There can be 0, 1 939*4882a593Smuzhiyunor more blocks specified in each instance: 940*4882a593Smuzhiyun 941*4882a593Smuzhiyun +-------------+ 942*4882a593Smuzhiyun | test-block0 | 943*4882a593Smuzhiyun +-------------+ 944*4882a593Smuzhiyun 945*4882a593SmuzhiyunIn this directory the default attributes are: 946*4882a593Smuzhiyun 947*4882a593Smuzhiyun ============== ================================================ 948*4882a593Smuzhiyun ce_count which is counter of CE events for this ``block`` 949*4882a593Smuzhiyun of hardware being monitored 950*4882a593Smuzhiyun ue_count which is counter of UE events for this ``block`` 951*4882a593Smuzhiyun of hardware being monitored 952*4882a593Smuzhiyun ============== ================================================ 953*4882a593Smuzhiyun 954*4882a593Smuzhiyun 955*4882a593SmuzhiyunThe ``test_device_edac`` device adds 4 attributes and 1 control: 956*4882a593Smuzhiyun 957*4882a593Smuzhiyun ================== ==================================================== 958*4882a593Smuzhiyun test-block-bits-0 for every POLL cycle this counter 959*4882a593Smuzhiyun is incremented 960*4882a593Smuzhiyun test-block-bits-1 every 10 cycles, this counter is bumped once, 961*4882a593Smuzhiyun and test-block-bits-0 is set to 0 962*4882a593Smuzhiyun test-block-bits-2 every 100 cycles, this counter is bumped once, 963*4882a593Smuzhiyun and test-block-bits-1 is set to 0 964*4882a593Smuzhiyun test-block-bits-3 every 1000 cycles, this counter is bumped once, 965*4882a593Smuzhiyun and test-block-bits-2 is set to 0 966*4882a593Smuzhiyun ================== ==================================================== 967*4882a593Smuzhiyun 968*4882a593Smuzhiyun 969*4882a593Smuzhiyun ================== ==================================================== 970*4882a593Smuzhiyun reset-counters writing ANY thing to this control will 971*4882a593Smuzhiyun reset all the above counters. 972*4882a593Smuzhiyun ================== ==================================================== 973*4882a593Smuzhiyun 974*4882a593Smuzhiyun 975*4882a593SmuzhiyunUse of the ``test_device_edac`` driver should enable any others to create their own 976*4882a593Smuzhiyununique drivers for their hardware systems. 977*4882a593Smuzhiyun 978*4882a593SmuzhiyunThe ``test_device_edac`` sample driver is located at the 979*4882a593Smuzhiyunhttp://bluesmoke.sourceforge.net project site for EDAC. 980*4882a593Smuzhiyun 981*4882a593Smuzhiyun 982*4882a593SmuzhiyunUsage of EDAC APIs on Nehalem and newer Intel CPUs 983*4882a593Smuzhiyun-------------------------------------------------- 984*4882a593Smuzhiyun 985*4882a593SmuzhiyunOn older Intel architectures, the memory controller was part of the North 986*4882a593SmuzhiyunBridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and 987*4882a593Smuzhiyunnewer Intel architectures integrated an enhanced version of the memory 988*4882a593Smuzhiyuncontroller (MC) inside the CPUs. 989*4882a593Smuzhiyun 990*4882a593SmuzhiyunThis chapter will cover the differences of the enhanced memory controllers 991*4882a593Smuzhiyunfound on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and 992*4882a593Smuzhiyun``sbx_edac`` drivers. 993*4882a593Smuzhiyun 994*4882a593Smuzhiyun.. note:: 995*4882a593Smuzhiyun 996*4882a593Smuzhiyun The Xeon E7 processor families use a separate chip for the memory 997*4882a593Smuzhiyun controller, called Intel Scalable Memory Buffer. This section doesn't 998*4882a593Smuzhiyun apply for such families. 999*4882a593Smuzhiyun 1000*4882a593Smuzhiyun1) There is one Memory Controller per Quick Patch Interconnect 1001*4882a593Smuzhiyun (QPI). At the driver, the term "socket" means one QPI. This is 1002*4882a593Smuzhiyun associated with a physical CPU socket. 1003*4882a593Smuzhiyun 1004*4882a593Smuzhiyun Each MC have 3 physical read channels, 3 physical write channels and 1005*4882a593Smuzhiyun 3 logic channels. The driver currently sees it as just 3 channels. 1006*4882a593Smuzhiyun Each channel can have up to 3 DIMMs. 1007*4882a593Smuzhiyun 1008*4882a593Smuzhiyun The minimum known unity is DIMMs. There are no information about csrows. 1009*4882a593Smuzhiyun As EDAC API maps the minimum unity is csrows, the driver sequentially 1010*4882a593Smuzhiyun maps channel/DIMM into different csrows. 1011*4882a593Smuzhiyun 1012*4882a593Smuzhiyun For example, supposing the following layout:: 1013*4882a593Smuzhiyun 1014*4882a593Smuzhiyun Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs 1015*4882a593Smuzhiyun dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 1016*4882a593Smuzhiyun dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 1017*4882a593Smuzhiyun Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs 1018*4882a593Smuzhiyun dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 1019*4882a593Smuzhiyun Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs 1020*4882a593Smuzhiyun dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 1021*4882a593Smuzhiyun 1022*4882a593Smuzhiyun The driver will map it as:: 1023*4882a593Smuzhiyun 1024*4882a593Smuzhiyun csrow0: channel 0, dimm0 1025*4882a593Smuzhiyun csrow1: channel 0, dimm1 1026*4882a593Smuzhiyun csrow2: channel 1, dimm0 1027*4882a593Smuzhiyun csrow3: channel 2, dimm0 1028*4882a593Smuzhiyun 1029*4882a593Smuzhiyun exports one DIMM per csrow. 1030*4882a593Smuzhiyun 1031*4882a593Smuzhiyun Each QPI is exported as a different memory controller. 1032*4882a593Smuzhiyun 1033*4882a593Smuzhiyun2) The MC has the ability to inject errors to test drivers. The drivers 1034*4882a593Smuzhiyun implement this functionality via some error injection nodes: 1035*4882a593Smuzhiyun 1036*4882a593Smuzhiyun For injecting a memory error, there are some sysfs nodes, under 1037*4882a593Smuzhiyun ``/sys/devices/system/edac/mc/mc?/``: 1038*4882a593Smuzhiyun 1039*4882a593Smuzhiyun - ``inject_addrmatch/*``: 1040*4882a593Smuzhiyun Controls the error injection mask register. It is possible to specify 1041*4882a593Smuzhiyun several characteristics of the address to match an error code:: 1042*4882a593Smuzhiyun 1043*4882a593Smuzhiyun dimm = the affected dimm. Numbers are relative to a channel; 1044*4882a593Smuzhiyun rank = the memory rank; 1045*4882a593Smuzhiyun channel = the channel that will generate an error; 1046*4882a593Smuzhiyun bank = the affected bank; 1047*4882a593Smuzhiyun page = the page address; 1048*4882a593Smuzhiyun column (or col) = the address column. 1049*4882a593Smuzhiyun 1050*4882a593Smuzhiyun each of the above values can be set to "any" to match any valid value. 1051*4882a593Smuzhiyun 1052*4882a593Smuzhiyun At driver init, all values are set to any. 1053*4882a593Smuzhiyun 1054*4882a593Smuzhiyun For example, to generate an error at rank 1 of dimm 2, for any channel, 1055*4882a593Smuzhiyun any bank, any page, any column:: 1056*4882a593Smuzhiyun 1057*4882a593Smuzhiyun echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 1058*4882a593Smuzhiyun echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 1059*4882a593Smuzhiyun 1060*4882a593Smuzhiyun To return to the default behaviour of matching any, you can do:: 1061*4882a593Smuzhiyun 1062*4882a593Smuzhiyun echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 1063*4882a593Smuzhiyun echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 1064*4882a593Smuzhiyun 1065*4882a593Smuzhiyun - ``inject_eccmask``: 1066*4882a593Smuzhiyun specifies what bits will have troubles, 1067*4882a593Smuzhiyun 1068*4882a593Smuzhiyun - ``inject_section``: 1069*4882a593Smuzhiyun specifies what ECC cache section will get the error:: 1070*4882a593Smuzhiyun 1071*4882a593Smuzhiyun 3 for both 1072*4882a593Smuzhiyun 2 for the highest 1073*4882a593Smuzhiyun 1 for the lowest 1074*4882a593Smuzhiyun 1075*4882a593Smuzhiyun - ``inject_type``: 1076*4882a593Smuzhiyun specifies the type of error, being a combination of the following bits:: 1077*4882a593Smuzhiyun 1078*4882a593Smuzhiyun bit 0 - repeat 1079*4882a593Smuzhiyun bit 1 - ecc 1080*4882a593Smuzhiyun bit 2 - parity 1081*4882a593Smuzhiyun 1082*4882a593Smuzhiyun - ``inject_enable``: 1083*4882a593Smuzhiyun starts the error generation when something different than 0 is written. 1084*4882a593Smuzhiyun 1085*4882a593Smuzhiyun All inject vars can be read. root permission is needed for write. 1086*4882a593Smuzhiyun 1087*4882a593Smuzhiyun Datasheet states that the error will only be generated after a write on an 1088*4882a593Smuzhiyun address that matches inject_addrmatch. It seems, however, that reading will 1089*4882a593Smuzhiyun also produce an error. 1090*4882a593Smuzhiyun 1091*4882a593Smuzhiyun For example, the following code will generate an error for any write access 1092*4882a593Smuzhiyun at socket 0, on any DIMM/address on channel 2:: 1093*4882a593Smuzhiyun 1094*4882a593Smuzhiyun echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel 1095*4882a593Smuzhiyun echo 2 >/sys/devices/system/edac/mc/mc0/inject_type 1096*4882a593Smuzhiyun echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask 1097*4882a593Smuzhiyun echo 3 >/sys/devices/system/edac/mc/mc0/inject_section 1098*4882a593Smuzhiyun echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable 1099*4882a593Smuzhiyun dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null 1100*4882a593Smuzhiyun 1101*4882a593Smuzhiyun For socket 1, it is needed to replace "mc0" by "mc1" at the above 1102*4882a593Smuzhiyun commands. 1103*4882a593Smuzhiyun 1104*4882a593Smuzhiyun The generated error message will look like:: 1105*4882a593Smuzhiyun 1106*4882a593Smuzhiyun EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 1107*4882a593Smuzhiyun 1108*4882a593Smuzhiyun3) Corrected Error memory register counters 1109*4882a593Smuzhiyun 1110*4882a593Smuzhiyun Those newer MCs have some registers to count memory errors. The driver 1111*4882a593Smuzhiyun uses those registers to report Corrected Errors on devices with Registered 1112*4882a593Smuzhiyun DIMMs. 1113*4882a593Smuzhiyun 1114*4882a593Smuzhiyun However, those counters don't work with Unregistered DIMM. As the chipset 1115*4882a593Smuzhiyun offers some counters that also work with UDIMMs (but with a worse level of 1116*4882a593Smuzhiyun granularity than the default ones), the driver exposes those registers for 1117*4882a593Smuzhiyun UDIMM memories. 1118*4882a593Smuzhiyun 1119*4882a593Smuzhiyun They can be read by looking at the contents of ``all_channel_counts/``:: 1120*4882a593Smuzhiyun 1121*4882a593Smuzhiyun $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done 1122*4882a593Smuzhiyun /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 1123*4882a593Smuzhiyun 0 1124*4882a593Smuzhiyun /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 1125*4882a593Smuzhiyun 0 1126*4882a593Smuzhiyun /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2 1127*4882a593Smuzhiyun 0 1128*4882a593Smuzhiyun 1129*4882a593Smuzhiyun What happens here is that errors on different csrows, but at the same 1130*4882a593Smuzhiyun dimm number will increment the same counter. 1131*4882a593Smuzhiyun So, in this memory mapping:: 1132*4882a593Smuzhiyun 1133*4882a593Smuzhiyun csrow0: channel 0, dimm0 1134*4882a593Smuzhiyun csrow1: channel 0, dimm1 1135*4882a593Smuzhiyun csrow2: channel 1, dimm0 1136*4882a593Smuzhiyun csrow3: channel 2, dimm0 1137*4882a593Smuzhiyun 1138*4882a593Smuzhiyun The hardware will increment udimm0 for an error at the first dimm at either 1139*4882a593Smuzhiyun csrow0, csrow2 or csrow3; 1140*4882a593Smuzhiyun 1141*4882a593Smuzhiyun The hardware will increment udimm1 for an error at the second dimm at either 1142*4882a593Smuzhiyun csrow0, csrow2 or csrow3; 1143*4882a593Smuzhiyun 1144*4882a593Smuzhiyun The hardware will increment udimm2 for an error at the third dimm at either 1145*4882a593Smuzhiyun csrow0, csrow2 or csrow3; 1146*4882a593Smuzhiyun 1147*4882a593Smuzhiyun4) Standard error counters 1148*4882a593Smuzhiyun 1149*4882a593Smuzhiyun The standard error counters are generated when an mcelog error is received 1150*4882a593Smuzhiyun by the driver. Since, with UDIMM, this is counted by software, it is 1151*4882a593Smuzhiyun possible that some errors could be lost. With RDIMM's, they display the 1152*4882a593Smuzhiyun contents of the registers 1153*4882a593Smuzhiyun 1154*4882a593SmuzhiyunReference documents used on ``amd64_edac`` 1155*4882a593Smuzhiyun------------------------------------------ 1156*4882a593Smuzhiyun 1157*4882a593Smuzhiyun``amd64_edac`` module is based on the following documents 1158*4882a593Smuzhiyun(available from http://support.amd.com/en-us/search/tech-docs): 1159*4882a593Smuzhiyun 1160*4882a593Smuzhiyun1. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD 1161*4882a593Smuzhiyun Opteron Processors 1162*4882a593Smuzhiyun :AMD publication #: 26094 1163*4882a593Smuzhiyun :Revision: 3.26 1164*4882a593Smuzhiyun :Link: http://support.amd.com/TechDocs/26094.PDF 1165*4882a593Smuzhiyun 1166*4882a593Smuzhiyun2. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh 1167*4882a593Smuzhiyun Processors 1168*4882a593Smuzhiyun :AMD publication #: 32559 1169*4882a593Smuzhiyun :Revision: 3.00 1170*4882a593Smuzhiyun :Issue Date: May 2006 1171*4882a593Smuzhiyun :Link: http://support.amd.com/TechDocs/32559.pdf 1172*4882a593Smuzhiyun 1173*4882a593Smuzhiyun3. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h 1174*4882a593Smuzhiyun Processors 1175*4882a593Smuzhiyun :AMD publication #: 31116 1176*4882a593Smuzhiyun :Revision: 3.00 1177*4882a593Smuzhiyun :Issue Date: September 07, 2007 1178*4882a593Smuzhiyun :Link: http://support.amd.com/TechDocs/31116.pdf 1179*4882a593Smuzhiyun 1180*4882a593Smuzhiyun4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 1181*4882a593Smuzhiyun Models 30h-3Fh Processors 1182*4882a593Smuzhiyun :AMD publication #: 49125 1183*4882a593Smuzhiyun :Revision: 3.06 1184*4882a593Smuzhiyun :Issue Date: 2/12/2015 (latest release) 1185*4882a593Smuzhiyun :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf 1186*4882a593Smuzhiyun 1187*4882a593Smuzhiyun5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 1188*4882a593Smuzhiyun Models 60h-6Fh Processors 1189*4882a593Smuzhiyun :AMD publication #: 50742 1190*4882a593Smuzhiyun :Revision: 3.01 1191*4882a593Smuzhiyun :Issue Date: 7/23/2015 (latest release) 1192*4882a593Smuzhiyun :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf 1193*4882a593Smuzhiyun 1194*4882a593Smuzhiyun6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h 1195*4882a593Smuzhiyun Models 00h-0Fh Processors 1196*4882a593Smuzhiyun :AMD publication #: 48751 1197*4882a593Smuzhiyun :Revision: 3.03 1198*4882a593Smuzhiyun :Issue Date: 2/23/2015 (latest release) 1199*4882a593Smuzhiyun :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf 1200*4882a593Smuzhiyun 1201*4882a593SmuzhiyunCredits 1202*4882a593Smuzhiyun======= 1203*4882a593Smuzhiyun 1204*4882a593Smuzhiyun* Written by Doug Thompson <dougthompson@xmission.com> 1205*4882a593Smuzhiyun 1206*4882a593Smuzhiyun - 7 Dec 2005 1207*4882a593Smuzhiyun - 17 Jul 2007 Updated 1208*4882a593Smuzhiyun 1209*4882a593Smuzhiyun* |copy| Mauro Carvalho Chehab 1210*4882a593Smuzhiyun 1211*4882a593Smuzhiyun - 05 Aug 2009 Nehalem interface 1212*4882a593Smuzhiyun - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section 1213*4882a593Smuzhiyun 1214*4882a593Smuzhiyun* EDAC authors/maintainers: 1215*4882a593Smuzhiyun 1216*4882a593Smuzhiyun - Doug Thompson, Dave Jiang, Dave Peterson et al, 1217*4882a593Smuzhiyun - Mauro Carvalho Chehab 1218*4882a593Smuzhiyun - Borislav Petkov 1219*4882a593Smuzhiyun - original author: Thayne Harbaugh 1220