xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/ras.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. include:: <isonum.txt>
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun============================================
4*4882a593SmuzhiyunReliability, Availability and Serviceability
5*4882a593Smuzhiyun============================================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunRAS concepts
8*4882a593Smuzhiyun************
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunReliability, Availability and Serviceability (RAS) is a concept used on
11*4882a593Smuzhiyunservers meant to measure their robustness.
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunReliability
14*4882a593Smuzhiyun  is the probability that a system will produce correct outputs.
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun  * Generally measured as Mean Time Between Failures (MTBF)
17*4882a593Smuzhiyun  * Enhanced by features that help to avoid, detect and repair hardware faults
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunAvailability
20*4882a593Smuzhiyun  is the probability that a system is operational at a given time
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun  * Generally measured as a percentage of downtime per a period of time
23*4882a593Smuzhiyun  * Often uses mechanisms to detect and correct hardware faults in
24*4882a593Smuzhiyun    runtime;
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunServiceability (or maintainability)
27*4882a593Smuzhiyun  is the simplicity and speed with which a system can be repaired or
28*4882a593Smuzhiyun  maintained
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun  * Generally measured on Mean Time Between Repair (MTBR)
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunImproving RAS
33*4882a593Smuzhiyun-------------
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunIn order to reduce systems downtime, a system should be capable of detecting
36*4882a593Smuzhiyunhardware errors, and, when possible correcting them in runtime. It should
37*4882a593Smuzhiyunalso provide mechanisms to detect hardware degradation, in order to warn
38*4882a593Smuzhiyunthe system administrator to take the action of replacing a component before
39*4882a593Smuzhiyunit causes data loss or system downtime.
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunAmong the monitoring measures, the most usual ones include:
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun* CPU – detect errors at instruction execution and at L1/L2/L3 caches;
44*4882a593Smuzhiyun* Memory – add error correction logic (ECC) to detect and correct errors;
45*4882a593Smuzhiyun* I/O – add CRC checksums for transferred data;
46*4882a593Smuzhiyun* Storage – RAID, journal file systems, checksums,
47*4882a593Smuzhiyun  Self-Monitoring, Analysis and Reporting Technology (SMART).
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunBy monitoring the number of occurrences of error detections, it is possible
50*4882a593Smuzhiyunto identify if the probability of hardware errors is increasing, and, on such
51*4882a593Smuzhiyuncase, do a preventive maintenance to replace a degraded component while
52*4882a593Smuzhiyunthose errors are correctable.
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunTypes of errors
55*4882a593Smuzhiyun---------------
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunMost mechanisms used on modern systems use technologies like Hamming
58*4882a593SmuzhiyunCodes that allow error correction when the number of errors on a bit packet
59*4882a593Smuzhiyunis below a threshold. If the number of errors is above, those mechanisms
60*4882a593Smuzhiyuncan indicate with a high degree of confidence that an error happened, but
61*4882a593Smuzhiyunthey can't correct.
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunAlso, sometimes an error occur on a component that it is not used. For
64*4882a593Smuzhiyunexample, a part of the memory that it is not currently allocated.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunThat defines some categories of errors:
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun* **Correctable Error (CE)** - the error detection mechanism detected and
69*4882a593Smuzhiyun  corrected the error. Such errors are usually not fatal, although some
70*4882a593Smuzhiyun  Kernel mechanisms allow the system administrator to consider them as fatal.
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun* **Uncorrected Error (UE)** - the amount of errors happened above the error
73*4882a593Smuzhiyun  correction threshold, and the system was unable to auto-correct.
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun* **Fatal Error** - when an UE error happens on a critical component of the
76*4882a593Smuzhiyun  system (for example, a piece of the Kernel got corrupted by an UE), the
77*4882a593Smuzhiyun  only reliable way to avoid data corruption is to hang or reboot the machine.
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun* **Non-fatal Error** - when an UE error happens on an unused component,
80*4882a593Smuzhiyun  like a CPU in power down state or an unused memory bank, the system may
81*4882a593Smuzhiyun  still run, eventually replacing the affected hardware by a hot spare,
82*4882a593Smuzhiyun  if available.
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun  Also, when an error happens on a userspace process, it is also possible to
85*4882a593Smuzhiyun  kill such process and let userspace restart it.
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunThe mechanism for handling non-fatal errors is usually complex and may
88*4882a593Smuzhiyunrequire the help of some userspace application, in order to apply the
89*4882a593Smuzhiyunpolicy desired by the system administrator.
90*4882a593Smuzhiyun
91*4882a593SmuzhiyunIdentifying a bad hardware component
92*4882a593Smuzhiyun------------------------------------
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunJust detecting a hardware flaw is usually not enough, as the system needs
95*4882a593Smuzhiyunto pinpoint to the minimal replaceable unit (MRU) that should be exchanged
96*4882a593Smuzhiyunto make the hardware reliable again.
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunSo, it requires not only error logging facilities, but also mechanisms that
99*4882a593Smuzhiyunwill translate the error message to the silkscreen or component label for
100*4882a593Smuzhiyunthe MRU.
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunTypically, it is very complex for memory, as modern CPUs interlace memory
103*4882a593Smuzhiyunfrom different memory modules, in order to provide a better performance. The
104*4882a593SmuzhiyunDMI BIOS usually have a list of memory module labels, with can be obtained
105*4882a593Smuzhiyunusing the ``dmidecode`` tool. For example, on a desktop machine, it shows::
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun	Memory Device
108*4882a593Smuzhiyun		Total Width: 64 bits
109*4882a593Smuzhiyun		Data Width: 64 bits
110*4882a593Smuzhiyun		Size: 16384 MB
111*4882a593Smuzhiyun		Form Factor: SODIMM
112*4882a593Smuzhiyun		Set: None
113*4882a593Smuzhiyun		Locator: ChannelA-DIMM0
114*4882a593Smuzhiyun		Bank Locator: BANK 0
115*4882a593Smuzhiyun		Type: DDR4
116*4882a593Smuzhiyun		Type Detail: Synchronous
117*4882a593Smuzhiyun		Speed: 2133 MHz
118*4882a593Smuzhiyun		Rank: 2
119*4882a593Smuzhiyun		Configured Clock Speed: 2133 MHz
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunOn the above example, a DDR4 SO-DIMM memory module is located at the
122*4882a593Smuzhiyunsystem's memory labeled as "BANK 0", as given by the *bank locator* field.
123*4882a593SmuzhiyunPlease notice that, on such system, the *total width* is equal to the
124*4882a593Smuzhiyun*data width*. It means that such memory module doesn't have error
125*4882a593Smuzhiyundetection/correction mechanisms.
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunUnfortunately, not all systems use the same field to specify the memory
128*4882a593Smuzhiyunbank. On this example, from an older server, ``dmidecode`` shows::
129*4882a593Smuzhiyun
130*4882a593Smuzhiyun	Memory Device
131*4882a593Smuzhiyun		Array Handle: 0x1000
132*4882a593Smuzhiyun		Error Information Handle: Not Provided
133*4882a593Smuzhiyun		Total Width: 72 bits
134*4882a593Smuzhiyun		Data Width: 64 bits
135*4882a593Smuzhiyun		Size: 8192 MB
136*4882a593Smuzhiyun		Form Factor: DIMM
137*4882a593Smuzhiyun		Set: 1
138*4882a593Smuzhiyun		Locator: DIMM_A1
139*4882a593Smuzhiyun		Bank Locator: Not Specified
140*4882a593Smuzhiyun		Type: DDR3
141*4882a593Smuzhiyun		Type Detail: Synchronous Registered (Buffered)
142*4882a593Smuzhiyun		Speed: 1600 MHz
143*4882a593Smuzhiyun		Rank: 2
144*4882a593Smuzhiyun		Configured Clock Speed: 1600 MHz
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunThere, the DDR3 RDIMM memory module is located at the system's memory labeled
147*4882a593Smuzhiyunas "DIMM_A1", as given by the *locator* field. Please notice that this
148*4882a593Smuzhiyunmemory module has 64 bits of *data width* and 72 bits of *total width*. So,
149*4882a593Smuzhiyunit has 8 extra bits to be used by error detection and correction mechanisms.
150*4882a593SmuzhiyunSuch kind of memory is called Error-correcting code memory (ECC memory).
151*4882a593Smuzhiyun
152*4882a593SmuzhiyunTo make things even worse, it is not uncommon that systems with different
153*4882a593Smuzhiyunlabels on their system's board to use exactly the same BIOS, meaning that
154*4882a593Smuzhiyunthe labels provided by the BIOS won't match the real ones.
155*4882a593Smuzhiyun
156*4882a593SmuzhiyunECC memory
157*4882a593Smuzhiyun----------
158*4882a593Smuzhiyun
159*4882a593SmuzhiyunAs mentioned in the previous section, ECC memory has extra bits to be
160*4882a593Smuzhiyunused for error correction. In the above example, a memory module has
161*4882a593Smuzhiyun64 bits of *data width*, and 72 bits of *total width*.  The extra 8
162*4882a593Smuzhiyunbits which are used for the error detection and correction mechanisms
163*4882a593Smuzhiyunare referred to as the *syndrome*\ [#f1]_\ [#f2]_.
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunSo, when the cpu requests the memory controller to write a word with
166*4882a593Smuzhiyun*data width*, the memory controller calculates the *syndrome* in real time,
167*4882a593Smuzhiyunusing Hamming code, or some other error correction code, like SECDED+,
168*4882a593Smuzhiyunproducing a code with *total width* size. Such code is then written
169*4882a593Smuzhiyunon the memory modules.
170*4882a593Smuzhiyun
171*4882a593SmuzhiyunAt read, the *total width* bits code is converted back, using the same
172*4882a593SmuzhiyunECC code used on write, producing a word with *data width* and a *syndrome*.
173*4882a593SmuzhiyunThe word with *data width* is sent to the CPU, even when errors happen.
174*4882a593Smuzhiyun
175*4882a593SmuzhiyunThe memory controller also looks at the *syndrome* in order to check if
176*4882a593Smuzhiyunthere was an error, and if the ECC code was able to fix such error.
177*4882a593SmuzhiyunIf the error was corrected, a Corrected Error (CE) happened. If not, an
178*4882a593SmuzhiyunUncorrected Error (UE) happened.
179*4882a593Smuzhiyun
180*4882a593SmuzhiyunThe information about the CE/UE errors is stored on some special registers
181*4882a593Smuzhiyunat the memory controller and can be accessed by reading such registers,
182*4882a593Smuzhiyuneither by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
183*4882a593Smuzhiyunbit CPUs, such errors can also be retrieved via the Machine Check
184*4882a593SmuzhiyunArchitecture (MCA)\ [#f3]_.
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun.. [#f1] Please notice that several memory controllers allow operation on a
187*4882a593Smuzhiyun  mode called "Lock-Step", where it groups two memory modules together,
188*4882a593Smuzhiyun  doing 128-bit reads/writes. That gives 16 bits for error correction, with
189*4882a593Smuzhiyun  significantly improves the error correction mechanism, at the expense
190*4882a593Smuzhiyun  that, when an error happens, there's no way to know what memory module is
191*4882a593Smuzhiyun  to blame. So, it has to blame both memory modules.
192*4882a593Smuzhiyun
193*4882a593Smuzhiyun.. [#f2] Some memory controllers also allow using memory in mirror mode.
194*4882a593Smuzhiyun  On such mode, the same data is written to two memory modules. At read,
195*4882a593Smuzhiyun  the system checks both memory modules, in order to check if both provide
196*4882a593Smuzhiyun  identical data. On such configuration, when an error happens, there's no
197*4882a593Smuzhiyun  way to know what memory module is to blame. So, it has to blame both
198*4882a593Smuzhiyun  memory modules (or 4 memory modules, if the system is also on Lock-step
199*4882a593Smuzhiyun  mode).
200*4882a593Smuzhiyun
201*4882a593Smuzhiyun.. [#f3] For more details about the Machine Check Architecture (MCA),
202*4882a593Smuzhiyun  please read Documentation/x86/x86_64/machinecheck.rst at the Kernel tree.
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunEDAC - Error Detection And Correction
205*4882a593Smuzhiyun*************************************
206*4882a593Smuzhiyun
207*4882a593Smuzhiyun.. note::
208*4882a593Smuzhiyun
209*4882a593Smuzhiyun   "bluesmoke" was the name for this device driver subsystem when it
210*4882a593Smuzhiyun   was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
211*4882a593Smuzhiyun   That site is mostly archaic now and can be used only for historical
212*4882a593Smuzhiyun   purposes.
213*4882a593Smuzhiyun
214*4882a593Smuzhiyun   When the subsystem was pushed upstream for the first time, on
215*4882a593Smuzhiyun   Kernel 2.6.16, it was renamed to ``EDAC``.
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunPurpose
218*4882a593Smuzhiyun-------
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunThe ``edac`` kernel module's goal is to detect and report hardware errors
221*4882a593Smuzhiyunthat occur within the computer system running under linux.
222*4882a593Smuzhiyun
223*4882a593SmuzhiyunMemory
224*4882a593Smuzhiyun------
225*4882a593Smuzhiyun
226*4882a593SmuzhiyunMemory Correctable Errors (CE) and Uncorrectable Errors (UE) are the
227*4882a593Smuzhiyunprimary errors being harvested. These types of errors are harvested by
228*4882a593Smuzhiyunthe ``edac_mc`` device.
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunDetecting CE events, then harvesting those events and reporting them,
231*4882a593Smuzhiyun**can** but must not necessarily be a predictor of future UE events. With
232*4882a593SmuzhiyunCE events only, the system can and will continue to operate as no data
233*4882a593Smuzhiyunhas been damaged yet.
234*4882a593Smuzhiyun
235*4882a593SmuzhiyunHowever, preventive maintenance and proactive part replacement of memory
236*4882a593Smuzhiyunmodules exhibiting CEs can reduce the likelihood of the dreaded UE events
237*4882a593Smuzhiyunand system panics.
238*4882a593Smuzhiyun
239*4882a593SmuzhiyunOther hardware elements
240*4882a593Smuzhiyun-----------------------
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunA new feature for EDAC, the ``edac_device`` class of device, was added in
243*4882a593Smuzhiyunthe 2.6.23 version of the kernel.
244*4882a593Smuzhiyun
245*4882a593SmuzhiyunThis new device type allows for non-memory type of ECC hardware detectors
246*4882a593Smuzhiyunto have their states harvested and presented to userspace via the sysfs
247*4882a593Smuzhiyuninterface.
248*4882a593Smuzhiyun
249*4882a593SmuzhiyunSome architectures have ECC detectors for L1, L2 and L3 caches,
250*4882a593Smuzhiyunalong with DMA engines, fabric switches, main data path switches,
251*4882a593Smuzhiyuninterconnections, and various other hardware data paths. If the hardware
252*4882a593Smuzhiyunreports it, then a edac_device device probably can be constructed to
253*4882a593Smuzhiyunharvest and present that to userspace.
254*4882a593Smuzhiyun
255*4882a593Smuzhiyun
256*4882a593SmuzhiyunPCI bus scanning
257*4882a593Smuzhiyun----------------
258*4882a593Smuzhiyun
259*4882a593SmuzhiyunIn addition, PCI devices are scanned for PCI Bus Parity and SERR Errors
260*4882a593Smuzhiyunin order to determine if errors are occurring during data transfers.
261*4882a593Smuzhiyun
262*4882a593SmuzhiyunThe presence of PCI Parity errors must be examined with a grain of salt.
263*4882a593SmuzhiyunThere are several add-in adapters that do **not** follow the PCI specification
264*4882a593Smuzhiyunwith regards to Parity generation and reporting. The specification says
265*4882a593Smuzhiyunthe vendor should tie the parity status bits to 0 if they do not intend
266*4882a593Smuzhiyunto generate parity.  Some vendors do not do this, and thus the parity bit
267*4882a593Smuzhiyuncan "float" giving false positives.
268*4882a593Smuzhiyun
269*4882a593SmuzhiyunThere is a PCI device attribute located in sysfs that is checked by
270*4882a593Smuzhiyunthe EDAC PCI scanning code. If that attribute is set, PCI parity/error
271*4882a593Smuzhiyunscanning is skipped for that device. The attribute is::
272*4882a593Smuzhiyun
273*4882a593Smuzhiyun	broken_parity_status
274*4882a593Smuzhiyun
275*4882a593Smuzhiyunand is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for
276*4882a593SmuzhiyunPCI devices.
277*4882a593Smuzhiyun
278*4882a593Smuzhiyun
279*4882a593SmuzhiyunVersioning
280*4882a593Smuzhiyun----------
281*4882a593Smuzhiyun
282*4882a593SmuzhiyunEDAC is composed of a "core" module (``edac_core.ko``) and several Memory
283*4882a593SmuzhiyunController (MC) driver modules. On a given system, the CORE is loaded
284*4882a593Smuzhiyunand one MC driver will be loaded. Both the CORE and the MC driver (or
285*4882a593Smuzhiyun``edac_device`` driver) have individual versions that reflect current
286*4882a593Smuzhiyunrelease level of their respective modules.
287*4882a593Smuzhiyun
288*4882a593SmuzhiyunThus, to "report" on what version a system is running, one must report
289*4882a593Smuzhiyunboth the CORE's and the MC driver's versions.
290*4882a593Smuzhiyun
291*4882a593Smuzhiyun
292*4882a593SmuzhiyunLoading
293*4882a593Smuzhiyun-------
294*4882a593Smuzhiyun
295*4882a593SmuzhiyunIf ``edac`` was statically linked with the kernel then no loading
296*4882a593Smuzhiyunis necessary. If ``edac`` was built as modules then simply modprobe
297*4882a593Smuzhiyunthe ``edac`` pieces that you need. You should be able to modprobe
298*4882a593Smuzhiyunhardware-specific modules and have the dependencies load the necessary
299*4882a593Smuzhiyuncore modules.
300*4882a593Smuzhiyun
301*4882a593SmuzhiyunExample::
302*4882a593Smuzhiyun
303*4882a593Smuzhiyun	$ modprobe amd76x_edac
304*4882a593Smuzhiyun
305*4882a593Smuzhiyunloads both the ``amd76x_edac.ko`` memory controller module and the
306*4882a593Smuzhiyun``edac_mc.ko`` core module.
307*4882a593Smuzhiyun
308*4882a593Smuzhiyun
309*4882a593SmuzhiyunSysfs interface
310*4882a593Smuzhiyun---------------
311*4882a593Smuzhiyun
312*4882a593SmuzhiyunEDAC presents a ``sysfs`` interface for control and reporting purposes. It
313*4882a593Smuzhiyunlives in the /sys/devices/system/edac directory.
314*4882a593Smuzhiyun
315*4882a593SmuzhiyunWithin this directory there currently reside 2 components:
316*4882a593Smuzhiyun
317*4882a593Smuzhiyun	======= ==============================
318*4882a593Smuzhiyun	mc	memory controller(s) system
319*4882a593Smuzhiyun	pci	PCI control and status system
320*4882a593Smuzhiyun	======= ==============================
321*4882a593Smuzhiyun
322*4882a593Smuzhiyun
323*4882a593Smuzhiyun
324*4882a593SmuzhiyunMemory Controller (mc) Model
325*4882a593Smuzhiyun----------------------------
326*4882a593Smuzhiyun
327*4882a593SmuzhiyunEach ``mc`` device controls a set of memory modules [#f4]_. These modules
328*4882a593Smuzhiyunare laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
329*4882a593SmuzhiyunThere can be multiple csrows and multiple channels.
330*4882a593Smuzhiyun
331*4882a593Smuzhiyun.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
332*4882a593Smuzhiyun  used to refer to a memory module, although there are other memory
333*4882a593Smuzhiyun  packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI
334*4882a593Smuzhiyun  specification (Version 2.7) defines a memory module in the Common
335*4882a593Smuzhiyun  Platform Error Record (CPER) section to be an SMBIOS Memory Device
336*4882a593Smuzhiyun  (Type 17). Along this document, and inside the EDAC subsystem, the term
337*4882a593Smuzhiyun  "dimm" is used for all memory modules, even when they use a
338*4882a593Smuzhiyun  different kind of packaging.
339*4882a593Smuzhiyun
340*4882a593SmuzhiyunMemory controllers allow for several csrows, with 8 csrows being a
341*4882a593Smuzhiyuntypical value. Yet, the actual number of csrows depends on the layout of
342*4882a593Smuzhiyuna given motherboard, memory controller and memory module characteristics.
343*4882a593Smuzhiyun
344*4882a593SmuzhiyunDual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
345*4882a593Smuzhiyundata transfers to/from the CPU from/to memory. Some newer chipsets allow
346*4882a593Smuzhiyunfor more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
347*4882a593Smuzhiyuncontrollers. The following example will assume 2 channels:
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun	+------------+-----------------------+
350*4882a593Smuzhiyun	| CS Rows    |       Channels        |
351*4882a593Smuzhiyun	+------------+-----------+-----------+
352*4882a593Smuzhiyun	|            |  ``ch0``  |  ``ch1``  |
353*4882a593Smuzhiyun	+============+===========+===========+
354*4882a593Smuzhiyun	|            |**DIMM_A0**|**DIMM_B0**|
355*4882a593Smuzhiyun	+------------+-----------+-----------+
356*4882a593Smuzhiyun	| ``csrow0`` |   rank0   |   rank0   |
357*4882a593Smuzhiyun	+------------+-----------+-----------+
358*4882a593Smuzhiyun	| ``csrow1`` |   rank1   |   rank1   |
359*4882a593Smuzhiyun	+------------+-----------+-----------+
360*4882a593Smuzhiyun	|            |**DIMM_A1**|**DIMM_B1**|
361*4882a593Smuzhiyun	+------------+-----------+-----------+
362*4882a593Smuzhiyun	| ``csrow2`` |    rank0  |  rank0    |
363*4882a593Smuzhiyun	+------------+-----------+-----------+
364*4882a593Smuzhiyun	| ``csrow3`` |    rank1  |  rank1    |
365*4882a593Smuzhiyun	+------------+-----------+-----------+
366*4882a593Smuzhiyun
367*4882a593SmuzhiyunIn the above example, there are 4 physical slots on the motherboard
368*4882a593Smuzhiyunfor memory DIMMs:
369*4882a593Smuzhiyun
370*4882a593Smuzhiyun	+---------+---------+
371*4882a593Smuzhiyun	| DIMM_A0 | DIMM_B0 |
372*4882a593Smuzhiyun	+---------+---------+
373*4882a593Smuzhiyun	| DIMM_A1 | DIMM_B1 |
374*4882a593Smuzhiyun	+---------+---------+
375*4882a593Smuzhiyun
376*4882a593SmuzhiyunLabels for these slots are usually silk-screened on the motherboard.
377*4882a593SmuzhiyunSlots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
378*4882a593Smuzhiyunchannel 1. Notice that there are two csrows possible on a physical DIMM.
379*4882a593SmuzhiyunThese csrows are allocated their csrow assignment based on the slot into
380*4882a593Smuzhiyunwhich the memory DIMM is placed. Thus, when 1 DIMM is placed in each
381*4882a593SmuzhiyunChannel, the csrows cross both DIMMs.
382*4882a593Smuzhiyun
383*4882a593SmuzhiyunMemory DIMMs come single or dual "ranked". A rank is a populated csrow.
384*4882a593SmuzhiyunIn the example above 2 dual ranked DIMMs are similarly placed. Thus,
385*4882a593Smuzhiyunboth csrow0 and csrow1 are populated. On the other hand, when 2 single
386*4882a593Smuzhiyunranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will
387*4882a593Smuzhiyunhave just one csrow (csrow0) and csrow1 will be empty. The pattern
388*4882a593Smuzhiyunrepeats itself for csrow2 and csrow3. Also note that some memory
389*4882a593Smuzhiyuncontrollers don't have any logic to identify the memory module, see
390*4882a593Smuzhiyun``rankX`` directories below.
391*4882a593Smuzhiyun
392*4882a593SmuzhiyunThe representation of the above is reflected in the directory
393*4882a593Smuzhiyuntree in EDAC's sysfs interface. Starting in directory
394*4882a593Smuzhiyun``/sys/devices/system/edac/mc``, each memory controller will be
395*4882a593Smuzhiyunrepresented by its own ``mcX`` directory, where ``X`` is the
396*4882a593Smuzhiyunindex of the MC::
397*4882a593Smuzhiyun
398*4882a593Smuzhiyun	..../edac/mc/
399*4882a593Smuzhiyun		   |
400*4882a593Smuzhiyun		   |->mc0
401*4882a593Smuzhiyun		   |->mc1
402*4882a593Smuzhiyun		   |->mc2
403*4882a593Smuzhiyun		   ....
404*4882a593Smuzhiyun
405*4882a593SmuzhiyunUnder each ``mcX`` directory each ``csrowX`` is again represented by a
406*4882a593Smuzhiyun``csrowX``, where ``X`` is the csrow index::
407*4882a593Smuzhiyun
408*4882a593Smuzhiyun	.../mc/mc0/
409*4882a593Smuzhiyun		|
410*4882a593Smuzhiyun		|->csrow0
411*4882a593Smuzhiyun		|->csrow2
412*4882a593Smuzhiyun		|->csrow3
413*4882a593Smuzhiyun		....
414*4882a593Smuzhiyun
415*4882a593SmuzhiyunNotice that there is no csrow1, which indicates that csrow0 is composed
416*4882a593Smuzhiyunof a single ranked DIMMs. This should also apply in both Channels, in
417*4882a593Smuzhiyunorder to have dual-channel mode be operational. Since both csrow2 and
418*4882a593Smuzhiyuncsrow3 are populated, this indicates a dual ranked set of DIMMs for
419*4882a593Smuzhiyunchannels 0 and 1.
420*4882a593Smuzhiyun
421*4882a593SmuzhiyunWithin each of the ``mcX`` and ``csrowX`` directories are several EDAC
422*4882a593Smuzhiyuncontrol and attribute files.
423*4882a593Smuzhiyun
424*4882a593Smuzhiyun``mcX`` directories
425*4882a593Smuzhiyun-------------------
426*4882a593Smuzhiyun
427*4882a593SmuzhiyunIn ``mcX`` directories are EDAC control and attribute files for
428*4882a593Smuzhiyunthis ``X`` instance of the memory controllers.
429*4882a593Smuzhiyun
430*4882a593SmuzhiyunFor a description of the sysfs API, please see:
431*4882a593Smuzhiyun
432*4882a593Smuzhiyun	Documentation/ABI/testing/sysfs-devices-edac
433*4882a593Smuzhiyun
434*4882a593Smuzhiyun
435*4882a593Smuzhiyun``dimmX`` or ``rankX`` directories
436*4882a593Smuzhiyun----------------------------------
437*4882a593Smuzhiyun
438*4882a593SmuzhiyunThe recommended way to use the EDAC subsystem is to look at the information
439*4882a593Smuzhiyunprovided by the ``dimmX`` or ``rankX`` directories [#f5]_.
440*4882a593Smuzhiyun
441*4882a593SmuzhiyunA typical EDAC system has the following structure under
442*4882a593Smuzhiyun``/sys/devices/system/edac/``\ [#f6]_::
443*4882a593Smuzhiyun
444*4882a593Smuzhiyun	/sys/devices/system/edac/
445*4882a593Smuzhiyun	├── mc
446*4882a593Smuzhiyun	│   ├── mc0
447*4882a593Smuzhiyun	│   │   ├── ce_count
448*4882a593Smuzhiyun	│   │   ├── ce_noinfo_count
449*4882a593Smuzhiyun	│   │   ├── dimm0
450*4882a593Smuzhiyun	│   │   │   ├── dimm_ce_count
451*4882a593Smuzhiyun	│   │   │   ├── dimm_dev_type
452*4882a593Smuzhiyun	│   │   │   ├── dimm_edac_mode
453*4882a593Smuzhiyun	│   │   │   ├── dimm_label
454*4882a593Smuzhiyun	│   │   │   ├── dimm_location
455*4882a593Smuzhiyun	│   │   │   ├── dimm_mem_type
456*4882a593Smuzhiyun	│   │   │   ├── dimm_ue_count
457*4882a593Smuzhiyun	│   │   │   ├── size
458*4882a593Smuzhiyun	│   │   │   └── uevent
459*4882a593Smuzhiyun	│   │   ├── max_location
460*4882a593Smuzhiyun	│   │   ├── mc_name
461*4882a593Smuzhiyun	│   │   ├── reset_counters
462*4882a593Smuzhiyun	│   │   ├── seconds_since_reset
463*4882a593Smuzhiyun	│   │   ├── size_mb
464*4882a593Smuzhiyun	│   │   ├── ue_count
465*4882a593Smuzhiyun	│   │   ├── ue_noinfo_count
466*4882a593Smuzhiyun	│   │   └── uevent
467*4882a593Smuzhiyun	│   ├── mc1
468*4882a593Smuzhiyun	│   │   ├── ce_count
469*4882a593Smuzhiyun	│   │   ├── ce_noinfo_count
470*4882a593Smuzhiyun	│   │   ├── dimm0
471*4882a593Smuzhiyun	│   │   │   ├── dimm_ce_count
472*4882a593Smuzhiyun	│   │   │   ├── dimm_dev_type
473*4882a593Smuzhiyun	│   │   │   ├── dimm_edac_mode
474*4882a593Smuzhiyun	│   │   │   ├── dimm_label
475*4882a593Smuzhiyun	│   │   │   ├── dimm_location
476*4882a593Smuzhiyun	│   │   │   ├── dimm_mem_type
477*4882a593Smuzhiyun	│   │   │   ├── dimm_ue_count
478*4882a593Smuzhiyun	│   │   │   ├── size
479*4882a593Smuzhiyun	│   │   │   └── uevent
480*4882a593Smuzhiyun	│   │   ├── max_location
481*4882a593Smuzhiyun	│   │   ├── mc_name
482*4882a593Smuzhiyun	│   │   ├── reset_counters
483*4882a593Smuzhiyun	│   │   ├── seconds_since_reset
484*4882a593Smuzhiyun	│   │   ├── size_mb
485*4882a593Smuzhiyun	│   │   ├── ue_count
486*4882a593Smuzhiyun	│   │   ├── ue_noinfo_count
487*4882a593Smuzhiyun	│   │   └── uevent
488*4882a593Smuzhiyun	│   └── uevent
489*4882a593Smuzhiyun	└── uevent
490*4882a593Smuzhiyun
491*4882a593SmuzhiyunIn the ``dimmX`` directories are EDAC control and attribute files for
492*4882a593Smuzhiyunthis ``X`` memory module:
493*4882a593Smuzhiyun
494*4882a593Smuzhiyun- ``size`` - Total memory managed by this csrow attribute file
495*4882a593Smuzhiyun
496*4882a593Smuzhiyun	This attribute file displays, in count of megabytes, the memory
497*4882a593Smuzhiyun	that this csrow contains.
498*4882a593Smuzhiyun
499*4882a593Smuzhiyun- ``dimm_ue_count`` - Uncorrectable Errors count attribute file
500*4882a593Smuzhiyun
501*4882a593Smuzhiyun	This attribute file displays the total count of uncorrectable
502*4882a593Smuzhiyun	errors that have occurred on this DIMM. If panic_on_ue is set
503*4882a593Smuzhiyun	this counter will not have a chance to increment, since EDAC
504*4882a593Smuzhiyun	will panic the system.
505*4882a593Smuzhiyun
506*4882a593Smuzhiyun- ``dimm_ce_count`` - Correctable Errors count attribute file
507*4882a593Smuzhiyun
508*4882a593Smuzhiyun	This attribute file displays the total count of correctable
509*4882a593Smuzhiyun	errors that have occurred on this DIMM. This count is very
510*4882a593Smuzhiyun	important to examine. CEs provide early indications that a
511*4882a593Smuzhiyun	DIMM is beginning to fail. This count field should be
512*4882a593Smuzhiyun	monitored for non-zero values and report such information
513*4882a593Smuzhiyun	to the system administrator.
514*4882a593Smuzhiyun
515*4882a593Smuzhiyun- ``dimm_dev_type``  - Device type attribute file
516*4882a593Smuzhiyun
517*4882a593Smuzhiyun	This attribute file will display what type of DRAM device is
518*4882a593Smuzhiyun	being utilized on this DIMM.
519*4882a593Smuzhiyun	Examples:
520*4882a593Smuzhiyun
521*4882a593Smuzhiyun		- x1
522*4882a593Smuzhiyun		- x2
523*4882a593Smuzhiyun		- x4
524*4882a593Smuzhiyun		- x8
525*4882a593Smuzhiyun
526*4882a593Smuzhiyun- ``dimm_edac_mode`` - EDAC Mode of operation attribute file
527*4882a593Smuzhiyun
528*4882a593Smuzhiyun	This attribute file will display what type of Error detection
529*4882a593Smuzhiyun	and correction is being utilized.
530*4882a593Smuzhiyun
531*4882a593Smuzhiyun- ``dimm_label`` - memory module label control file
532*4882a593Smuzhiyun
533*4882a593Smuzhiyun	This control file allows this DIMM to have a label assigned
534*4882a593Smuzhiyun	to it. With this label in the module, when errors occur
535*4882a593Smuzhiyun	the output can provide the DIMM label in the system log.
536*4882a593Smuzhiyun	This becomes vital for panic events to isolate the
537*4882a593Smuzhiyun	cause of the UE event.
538*4882a593Smuzhiyun
539*4882a593Smuzhiyun	DIMM Labels must be assigned after booting, with information
540*4882a593Smuzhiyun	that correctly identifies the physical slot with its
541*4882a593Smuzhiyun	silk screen label. This information is currently very
542*4882a593Smuzhiyun	motherboard specific and determination of this information
543*4882a593Smuzhiyun	must occur in userland at this time.
544*4882a593Smuzhiyun
545*4882a593Smuzhiyun- ``dimm_location`` - location of the memory module
546*4882a593Smuzhiyun
547*4882a593Smuzhiyun	The location can have up to 3 levels, and describe how the
548*4882a593Smuzhiyun	memory controller identifies the location of a memory module.
549*4882a593Smuzhiyun	Depending on the type of memory and memory controller, it
550*4882a593Smuzhiyun	can be:
551*4882a593Smuzhiyun
552*4882a593Smuzhiyun		- *csrow* and *channel* - used when the memory controller
553*4882a593Smuzhiyun		  doesn't identify a single DIMM - e. g. in ``rankX`` dir;
554*4882a593Smuzhiyun		- *branch*, *channel*, *slot* - typically used on FB-DIMM memory
555*4882a593Smuzhiyun		  controllers;
556*4882a593Smuzhiyun		- *channel*, *slot* - used on Nehalem and newer Intel drivers.
557*4882a593Smuzhiyun
558*4882a593Smuzhiyun- ``dimm_mem_type`` - Memory Type attribute file
559*4882a593Smuzhiyun
560*4882a593Smuzhiyun	This attribute file will display what type of memory is currently
561*4882a593Smuzhiyun	on this csrow. Normally, either buffered or unbuffered memory.
562*4882a593Smuzhiyun	Examples:
563*4882a593Smuzhiyun
564*4882a593Smuzhiyun		- Registered-DDR
565*4882a593Smuzhiyun		- Unbuffered-DDR
566*4882a593Smuzhiyun
567*4882a593Smuzhiyun.. [#f5] On some systems, the memory controller doesn't have any logic
568*4882a593Smuzhiyun  to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories.
569*4882a593Smuzhiyun  On modern Intel memory controllers, the memory controller identifies the
570*4882a593Smuzhiyun  memory modules directly. On such systems, the directory is called ``dimmX``.
571*4882a593Smuzhiyun
572*4882a593Smuzhiyun.. [#f6] There are also some ``power`` directories and ``subsystem``
573*4882a593Smuzhiyun  symlinks inside the sysfs mapping that are automatically created by
574*4882a593Smuzhiyun  the sysfs subsystem. Currently, they serve no purpose.
575*4882a593Smuzhiyun
576*4882a593Smuzhiyun``csrowX`` directories
577*4882a593Smuzhiyun----------------------
578*4882a593Smuzhiyun
579*4882a593SmuzhiyunWhen CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
580*4882a593Smuzhiyundirectories. As this API doesn't work properly for Rambus, FB-DIMMs and
581*4882a593Smuzhiyunmodern Intel Memory Controllers, this is being deprecated in favor of
582*4882a593Smuzhiyun``dimmX`` directories.
583*4882a593Smuzhiyun
584*4882a593SmuzhiyunIn the ``csrowX`` directories are EDAC control and attribute files for
585*4882a593Smuzhiyunthis ``X`` instance of csrow:
586*4882a593Smuzhiyun
587*4882a593Smuzhiyun
588*4882a593Smuzhiyun- ``ue_count`` - Total Uncorrectable Errors count attribute file
589*4882a593Smuzhiyun
590*4882a593Smuzhiyun	This attribute file displays the total count of uncorrectable
591*4882a593Smuzhiyun	errors that have occurred on this csrow. If panic_on_ue is set
592*4882a593Smuzhiyun	this counter will not have a chance to increment, since EDAC
593*4882a593Smuzhiyun	will panic the system.
594*4882a593Smuzhiyun
595*4882a593Smuzhiyun
596*4882a593Smuzhiyun- ``ce_count`` - Total Correctable Errors count attribute file
597*4882a593Smuzhiyun
598*4882a593Smuzhiyun	This attribute file displays the total count of correctable
599*4882a593Smuzhiyun	errors that have occurred on this csrow. This count is very
600*4882a593Smuzhiyun	important to examine. CEs provide early indications that a
601*4882a593Smuzhiyun	DIMM is beginning to fail. This count field should be
602*4882a593Smuzhiyun	monitored for non-zero values and report such information
603*4882a593Smuzhiyun	to the system administrator.
604*4882a593Smuzhiyun
605*4882a593Smuzhiyun
606*4882a593Smuzhiyun- ``size_mb`` - Total memory managed by this csrow attribute file
607*4882a593Smuzhiyun
608*4882a593Smuzhiyun	This attribute file displays, in count of megabytes, the memory
609*4882a593Smuzhiyun	that this csrow contains.
610*4882a593Smuzhiyun
611*4882a593Smuzhiyun
612*4882a593Smuzhiyun- ``mem_type`` - Memory Type attribute file
613*4882a593Smuzhiyun
614*4882a593Smuzhiyun	This attribute file will display what type of memory is currently
615*4882a593Smuzhiyun	on this csrow. Normally, either buffered or unbuffered memory.
616*4882a593Smuzhiyun	Examples:
617*4882a593Smuzhiyun
618*4882a593Smuzhiyun		- Registered-DDR
619*4882a593Smuzhiyun		- Unbuffered-DDR
620*4882a593Smuzhiyun
621*4882a593Smuzhiyun
622*4882a593Smuzhiyun- ``edac_mode`` - EDAC Mode of operation attribute file
623*4882a593Smuzhiyun
624*4882a593Smuzhiyun	This attribute file will display what type of Error detection
625*4882a593Smuzhiyun	and correction is being utilized.
626*4882a593Smuzhiyun
627*4882a593Smuzhiyun
628*4882a593Smuzhiyun- ``dev_type`` - Device type attribute file
629*4882a593Smuzhiyun
630*4882a593Smuzhiyun	This attribute file will display what type of DRAM device is
631*4882a593Smuzhiyun	being utilized on this DIMM.
632*4882a593Smuzhiyun	Examples:
633*4882a593Smuzhiyun
634*4882a593Smuzhiyun		- x1
635*4882a593Smuzhiyun		- x2
636*4882a593Smuzhiyun		- x4
637*4882a593Smuzhiyun		- x8
638*4882a593Smuzhiyun
639*4882a593Smuzhiyun
640*4882a593Smuzhiyun- ``ch0_ce_count`` - Channel 0 CE Count attribute file
641*4882a593Smuzhiyun
642*4882a593Smuzhiyun	This attribute file will display the count of CEs on this
643*4882a593Smuzhiyun	DIMM located in channel 0.
644*4882a593Smuzhiyun
645*4882a593Smuzhiyun
646*4882a593Smuzhiyun- ``ch0_ue_count`` - Channel 0 UE Count attribute file
647*4882a593Smuzhiyun
648*4882a593Smuzhiyun	This attribute file will display the count of UEs on this
649*4882a593Smuzhiyun	DIMM located in channel 0.
650*4882a593Smuzhiyun
651*4882a593Smuzhiyun
652*4882a593Smuzhiyun- ``ch0_dimm_label`` - Channel 0 DIMM Label control file
653*4882a593Smuzhiyun
654*4882a593Smuzhiyun
655*4882a593Smuzhiyun	This control file allows this DIMM to have a label assigned
656*4882a593Smuzhiyun	to it. With this label in the module, when errors occur
657*4882a593Smuzhiyun	the output can provide the DIMM label in the system log.
658*4882a593Smuzhiyun	This becomes vital for panic events to isolate the
659*4882a593Smuzhiyun	cause of the UE event.
660*4882a593Smuzhiyun
661*4882a593Smuzhiyun	DIMM Labels must be assigned after booting, with information
662*4882a593Smuzhiyun	that correctly identifies the physical slot with its
663*4882a593Smuzhiyun	silk screen label. This information is currently very
664*4882a593Smuzhiyun	motherboard specific and determination of this information
665*4882a593Smuzhiyun	must occur in userland at this time.
666*4882a593Smuzhiyun
667*4882a593Smuzhiyun
668*4882a593Smuzhiyun- ``ch1_ce_count`` - Channel 1 CE Count attribute file
669*4882a593Smuzhiyun
670*4882a593Smuzhiyun
671*4882a593Smuzhiyun	This attribute file will display the count of CEs on this
672*4882a593Smuzhiyun	DIMM located in channel 1.
673*4882a593Smuzhiyun
674*4882a593Smuzhiyun
675*4882a593Smuzhiyun- ``ch1_ue_count`` - Channel 1 UE Count attribute file
676*4882a593Smuzhiyun
677*4882a593Smuzhiyun
678*4882a593Smuzhiyun	This attribute file will display the count of UEs on this
679*4882a593Smuzhiyun	DIMM located in channel 0.
680*4882a593Smuzhiyun
681*4882a593Smuzhiyun
682*4882a593Smuzhiyun- ``ch1_dimm_label`` - Channel 1 DIMM Label control file
683*4882a593Smuzhiyun
684*4882a593Smuzhiyun	This control file allows this DIMM to have a label assigned
685*4882a593Smuzhiyun	to it. With this label in the module, when errors occur
686*4882a593Smuzhiyun	the output can provide the DIMM label in the system log.
687*4882a593Smuzhiyun	This becomes vital for panic events to isolate the
688*4882a593Smuzhiyun	cause of the UE event.
689*4882a593Smuzhiyun
690*4882a593Smuzhiyun	DIMM Labels must be assigned after booting, with information
691*4882a593Smuzhiyun	that correctly identifies the physical slot with its
692*4882a593Smuzhiyun	silk screen label. This information is currently very
693*4882a593Smuzhiyun	motherboard specific and determination of this information
694*4882a593Smuzhiyun	must occur in userland at this time.
695*4882a593Smuzhiyun
696*4882a593Smuzhiyun
697*4882a593SmuzhiyunSystem Logging
698*4882a593Smuzhiyun--------------
699*4882a593Smuzhiyun
700*4882a593SmuzhiyunIf logging for UEs and CEs is enabled, then system logs will contain
701*4882a593Smuzhiyuninformation indicating that errors have been detected::
702*4882a593Smuzhiyun
703*4882a593Smuzhiyun  EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac
704*4882a593Smuzhiyun  EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac
705*4882a593Smuzhiyun
706*4882a593Smuzhiyun
707*4882a593SmuzhiyunThe structure of the message is:
708*4882a593Smuzhiyun
709*4882a593Smuzhiyun	+---------------------------------------+-------------+
710*4882a593Smuzhiyun	| Content                               | Example     |
711*4882a593Smuzhiyun	+=======================================+=============+
712*4882a593Smuzhiyun	| The memory controller                 | MC0         |
713*4882a593Smuzhiyun	+---------------------------------------+-------------+
714*4882a593Smuzhiyun	| Error type                            | CE          |
715*4882a593Smuzhiyun	+---------------------------------------+-------------+
716*4882a593Smuzhiyun	| Memory page                           | 0x283       |
717*4882a593Smuzhiyun	+---------------------------------------+-------------+
718*4882a593Smuzhiyun	| Offset in the page                    | 0xce0       |
719*4882a593Smuzhiyun	+---------------------------------------+-------------+
720*4882a593Smuzhiyun	| The byte granularity                  | grain 8     |
721*4882a593Smuzhiyun	| or resolution of the error            |             |
722*4882a593Smuzhiyun	+---------------------------------------+-------------+
723*4882a593Smuzhiyun	| The error syndrome                    | 0xb741      |
724*4882a593Smuzhiyun	+---------------------------------------+-------------+
725*4882a593Smuzhiyun	| Memory row                            | row 0       |
726*4882a593Smuzhiyun	+---------------------------------------+-------------+
727*4882a593Smuzhiyun	| Memory channel                        | channel 1   |
728*4882a593Smuzhiyun	+---------------------------------------+-------------+
729*4882a593Smuzhiyun	| DIMM label, if set prior              | DIMM B1     |
730*4882a593Smuzhiyun	+---------------------------------------+-------------+
731*4882a593Smuzhiyun	| And then an optional, driver-specific |             |
732*4882a593Smuzhiyun	| message that may have additional      |             |
733*4882a593Smuzhiyun	| information.                          |             |
734*4882a593Smuzhiyun	+---------------------------------------+-------------+
735*4882a593Smuzhiyun
736*4882a593SmuzhiyunBoth UEs and CEs with no info will lack all but memory controller, error
737*4882a593Smuzhiyuntype, a notice of "no info" and then an optional, driver-specific error
738*4882a593Smuzhiyunmessage.
739*4882a593Smuzhiyun
740*4882a593Smuzhiyun
741*4882a593SmuzhiyunPCI Bus Parity Detection
742*4882a593Smuzhiyun------------------------
743*4882a593Smuzhiyun
744*4882a593SmuzhiyunOn Header Type 00 devices, the primary status is looked at for any
745*4882a593Smuzhiyunparity error regardless of whether parity is enabled on the device or
746*4882a593Smuzhiyunnot. (The spec indicates parity is generated in some cases). On Header
747*4882a593SmuzhiyunType 01 bridges, the secondary status register is also looked at to see
748*4882a593Smuzhiyunif parity occurred on the bus on the other side of the bridge.
749*4882a593Smuzhiyun
750*4882a593Smuzhiyun
751*4882a593SmuzhiyunSysfs configuration
752*4882a593Smuzhiyun-------------------
753*4882a593Smuzhiyun
754*4882a593SmuzhiyunUnder ``/sys/devices/system/edac/pci`` are control and attribute files as
755*4882a593Smuzhiyunfollows:
756*4882a593Smuzhiyun
757*4882a593Smuzhiyun
758*4882a593Smuzhiyun- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
759*4882a593Smuzhiyun
760*4882a593Smuzhiyun	This control file enables or disables the PCI Bus Parity scanning
761*4882a593Smuzhiyun	operation. Writing a 1 to this file enables the scanning. Writing
762*4882a593Smuzhiyun	a 0 to this file disables the scanning.
763*4882a593Smuzhiyun
764*4882a593Smuzhiyun	Enable::
765*4882a593Smuzhiyun
766*4882a593Smuzhiyun		echo "1" >/sys/devices/system/edac/pci/check_pci_parity
767*4882a593Smuzhiyun
768*4882a593Smuzhiyun	Disable::
769*4882a593Smuzhiyun
770*4882a593Smuzhiyun		echo "0" >/sys/devices/system/edac/pci/check_pci_parity
771*4882a593Smuzhiyun
772*4882a593Smuzhiyun
773*4882a593Smuzhiyun- ``pci_parity_count`` - Parity Count
774*4882a593Smuzhiyun
775*4882a593Smuzhiyun	This attribute file will display the number of parity errors that
776*4882a593Smuzhiyun	have been detected.
777*4882a593Smuzhiyun
778*4882a593Smuzhiyun
779*4882a593SmuzhiyunModule parameters
780*4882a593Smuzhiyun-----------------
781*4882a593Smuzhiyun
782*4882a593Smuzhiyun- ``edac_mc_panic_on_ue`` - Panic on UE control file
783*4882a593Smuzhiyun
784*4882a593Smuzhiyun	An uncorrectable error will cause a machine panic.  This is usually
785*4882a593Smuzhiyun	desirable.  It is a bad idea to continue when an uncorrectable error
786*4882a593Smuzhiyun	occurs - it is indeterminate what was uncorrected and the operating
787*4882a593Smuzhiyun	system context might be so mangled that continuing will lead to further
788*4882a593Smuzhiyun	corruption. If the kernel has MCE configured, then EDAC will never
789*4882a593Smuzhiyun	notice the UE.
790*4882a593Smuzhiyun
791*4882a593Smuzhiyun	LOAD TIME::
792*4882a593Smuzhiyun
793*4882a593Smuzhiyun		module/kernel parameter: edac_mc_panic_on_ue=[0|1]
794*4882a593Smuzhiyun
795*4882a593Smuzhiyun	RUN TIME::
796*4882a593Smuzhiyun
797*4882a593Smuzhiyun		echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
798*4882a593Smuzhiyun
799*4882a593Smuzhiyun
800*4882a593Smuzhiyun- ``edac_mc_log_ue`` - Log UE control file
801*4882a593Smuzhiyun
802*4882a593Smuzhiyun
803*4882a593Smuzhiyun	Generate kernel messages describing uncorrectable errors.  These errors
804*4882a593Smuzhiyun	are reported through the system message log system.  UE statistics
805*4882a593Smuzhiyun	will be accumulated even when UE logging is disabled.
806*4882a593Smuzhiyun
807*4882a593Smuzhiyun	LOAD TIME::
808*4882a593Smuzhiyun
809*4882a593Smuzhiyun		module/kernel parameter: edac_mc_log_ue=[0|1]
810*4882a593Smuzhiyun
811*4882a593Smuzhiyun	RUN TIME::
812*4882a593Smuzhiyun
813*4882a593Smuzhiyun		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
814*4882a593Smuzhiyun
815*4882a593Smuzhiyun
816*4882a593Smuzhiyun- ``edac_mc_log_ce`` - Log CE control file
817*4882a593Smuzhiyun
818*4882a593Smuzhiyun
819*4882a593Smuzhiyun	Generate kernel messages describing correctable errors.  These
820*4882a593Smuzhiyun	errors are reported through the system message log system.
821*4882a593Smuzhiyun	CE statistics will be accumulated even when CE logging is disabled.
822*4882a593Smuzhiyun
823*4882a593Smuzhiyun	LOAD TIME::
824*4882a593Smuzhiyun
825*4882a593Smuzhiyun		module/kernel parameter: edac_mc_log_ce=[0|1]
826*4882a593Smuzhiyun
827*4882a593Smuzhiyun	RUN TIME::
828*4882a593Smuzhiyun
829*4882a593Smuzhiyun		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
830*4882a593Smuzhiyun
831*4882a593Smuzhiyun
832*4882a593Smuzhiyun- ``edac_mc_poll_msec`` - Polling period control file
833*4882a593Smuzhiyun
834*4882a593Smuzhiyun
835*4882a593Smuzhiyun	The time period, in milliseconds, for polling for error information.
836*4882a593Smuzhiyun	Too small a value wastes resources.  Too large a value might delay
837*4882a593Smuzhiyun	necessary handling of errors and might loose valuable information for
838*4882a593Smuzhiyun	locating the error.  1000 milliseconds (once each second) is the current
839*4882a593Smuzhiyun	default. Systems which require all the bandwidth they can get, may
840*4882a593Smuzhiyun	increase this.
841*4882a593Smuzhiyun
842*4882a593Smuzhiyun	LOAD TIME::
843*4882a593Smuzhiyun
844*4882a593Smuzhiyun		module/kernel parameter: edac_mc_poll_msec=[0|1]
845*4882a593Smuzhiyun
846*4882a593Smuzhiyun	RUN TIME::
847*4882a593Smuzhiyun
848*4882a593Smuzhiyun		echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
849*4882a593Smuzhiyun
850*4882a593Smuzhiyun
851*4882a593Smuzhiyun- ``panic_on_pci_parity`` - Panic on PCI PARITY Error
852*4882a593Smuzhiyun
853*4882a593Smuzhiyun
854*4882a593Smuzhiyun	This control file enables or disables panicking when a parity
855*4882a593Smuzhiyun	error has been detected.
856*4882a593Smuzhiyun
857*4882a593Smuzhiyun
858*4882a593Smuzhiyun	module/kernel parameter::
859*4882a593Smuzhiyun
860*4882a593Smuzhiyun			edac_panic_on_pci_pe=[0|1]
861*4882a593Smuzhiyun
862*4882a593Smuzhiyun	Enable::
863*4882a593Smuzhiyun
864*4882a593Smuzhiyun		echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
865*4882a593Smuzhiyun
866*4882a593Smuzhiyun	Disable::
867*4882a593Smuzhiyun
868*4882a593Smuzhiyun		echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
869*4882a593Smuzhiyun
870*4882a593Smuzhiyun
871*4882a593Smuzhiyun
872*4882a593SmuzhiyunEDAC device type
873*4882a593Smuzhiyun----------------
874*4882a593Smuzhiyun
875*4882a593SmuzhiyunIn the header file, edac_pci.h, there is a series of edac_device structures
876*4882a593Smuzhiyunand APIs for the EDAC_DEVICE.
877*4882a593Smuzhiyun
878*4882a593SmuzhiyunUser space access to an edac_device is through the sysfs interface.
879*4882a593Smuzhiyun
880*4882a593SmuzhiyunAt the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
881*4882a593Smuzhiyunwill appear.
882*4882a593Smuzhiyun
883*4882a593SmuzhiyunThere is a three level tree beneath the above ``edac`` directory. For example,
884*4882a593Smuzhiyunthe ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net
885*4882a593Smuzhiyunwebsite) installs itself as::
886*4882a593Smuzhiyun
887*4882a593Smuzhiyun	/sys/devices/system/edac/test-instance
888*4882a593Smuzhiyun
889*4882a593Smuzhiyunin this directory are various controls, a symlink and one or more ``instance``
890*4882a593Smuzhiyundirectories.
891*4882a593Smuzhiyun
892*4882a593SmuzhiyunThe standard default controls are:
893*4882a593Smuzhiyun
894*4882a593Smuzhiyun	==============	=======================================================
895*4882a593Smuzhiyun	log_ce		boolean to log CE events
896*4882a593Smuzhiyun	log_ue		boolean to log UE events
897*4882a593Smuzhiyun	panic_on_ue	boolean to ``panic`` the system if an UE is encountered
898*4882a593Smuzhiyun			(default off, can be set true via startup script)
899*4882a593Smuzhiyun	poll_msec	time period between POLL cycles for events
900*4882a593Smuzhiyun	==============	=======================================================
901*4882a593Smuzhiyun
902*4882a593SmuzhiyunThe test_device_edac device adds at least one of its own custom control:
903*4882a593Smuzhiyun
904*4882a593Smuzhiyun	==============	==================================================
905*4882a593Smuzhiyun	test_bits	which in the current test driver does nothing but
906*4882a593Smuzhiyun			show how it is installed. A ported driver can
907*4882a593Smuzhiyun			add one or more such controls and/or attributes
908*4882a593Smuzhiyun			for specific uses.
909*4882a593Smuzhiyun			One out-of-tree driver uses controls here to allow
910*4882a593Smuzhiyun			for ERROR INJECTION operations to hardware
911*4882a593Smuzhiyun			injection registers
912*4882a593Smuzhiyun	==============	==================================================
913*4882a593Smuzhiyun
914*4882a593SmuzhiyunThe symlink points to the 'struct dev' that is registered for this edac_device.
915*4882a593Smuzhiyun
916*4882a593SmuzhiyunInstances
917*4882a593Smuzhiyun---------
918*4882a593Smuzhiyun
919*4882a593SmuzhiyunOne or more instance directories are present. For the ``test_device_edac``
920*4882a593Smuzhiyuncase:
921*4882a593Smuzhiyun
922*4882a593Smuzhiyun	+----------------+
923*4882a593Smuzhiyun	| test-instance0 |
924*4882a593Smuzhiyun	+----------------+
925*4882a593Smuzhiyun
926*4882a593Smuzhiyun
927*4882a593SmuzhiyunIn this directory there are two default counter attributes, which are totals of
928*4882a593Smuzhiyuncounter in deeper subdirectories.
929*4882a593Smuzhiyun
930*4882a593Smuzhiyun	==============	====================================
931*4882a593Smuzhiyun	ce_count	total of CE events of subdirectories
932*4882a593Smuzhiyun	ue_count	total of UE events of subdirectories
933*4882a593Smuzhiyun	==============	====================================
934*4882a593Smuzhiyun
935*4882a593SmuzhiyunBlocks
936*4882a593Smuzhiyun------
937*4882a593Smuzhiyun
938*4882a593SmuzhiyunAt the lowest directory level is the ``block`` directory. There can be 0, 1
939*4882a593Smuzhiyunor more blocks specified in each instance:
940*4882a593Smuzhiyun
941*4882a593Smuzhiyun	+-------------+
942*4882a593Smuzhiyun	| test-block0 |
943*4882a593Smuzhiyun	+-------------+
944*4882a593Smuzhiyun
945*4882a593SmuzhiyunIn this directory the default attributes are:
946*4882a593Smuzhiyun
947*4882a593Smuzhiyun	==============	================================================
948*4882a593Smuzhiyun	ce_count	which is counter of CE events for this ``block``
949*4882a593Smuzhiyun			of hardware being monitored
950*4882a593Smuzhiyun	ue_count	which is counter of UE events for this ``block``
951*4882a593Smuzhiyun			of hardware being monitored
952*4882a593Smuzhiyun	==============	================================================
953*4882a593Smuzhiyun
954*4882a593Smuzhiyun
955*4882a593SmuzhiyunThe ``test_device_edac`` device adds 4 attributes and 1 control:
956*4882a593Smuzhiyun
957*4882a593Smuzhiyun	================== ====================================================
958*4882a593Smuzhiyun	test-block-bits-0	for every POLL cycle this counter
959*4882a593Smuzhiyun				is incremented
960*4882a593Smuzhiyun	test-block-bits-1	every 10 cycles, this counter is bumped once,
961*4882a593Smuzhiyun				and test-block-bits-0 is set to 0
962*4882a593Smuzhiyun	test-block-bits-2	every 100 cycles, this counter is bumped once,
963*4882a593Smuzhiyun				and test-block-bits-1 is set to 0
964*4882a593Smuzhiyun	test-block-bits-3	every 1000 cycles, this counter is bumped once,
965*4882a593Smuzhiyun				and test-block-bits-2 is set to 0
966*4882a593Smuzhiyun	================== ====================================================
967*4882a593Smuzhiyun
968*4882a593Smuzhiyun
969*4882a593Smuzhiyun	================== ====================================================
970*4882a593Smuzhiyun	reset-counters		writing ANY thing to this control will
971*4882a593Smuzhiyun				reset all the above counters.
972*4882a593Smuzhiyun	================== ====================================================
973*4882a593Smuzhiyun
974*4882a593Smuzhiyun
975*4882a593SmuzhiyunUse of the ``test_device_edac`` driver should enable any others to create their own
976*4882a593Smuzhiyununique drivers for their hardware systems.
977*4882a593Smuzhiyun
978*4882a593SmuzhiyunThe ``test_device_edac`` sample driver is located at the
979*4882a593Smuzhiyunhttp://bluesmoke.sourceforge.net project site for EDAC.
980*4882a593Smuzhiyun
981*4882a593Smuzhiyun
982*4882a593SmuzhiyunUsage of EDAC APIs on Nehalem and newer Intel CPUs
983*4882a593Smuzhiyun--------------------------------------------------
984*4882a593Smuzhiyun
985*4882a593SmuzhiyunOn older Intel architectures, the memory controller was part of the North
986*4882a593SmuzhiyunBridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and
987*4882a593Smuzhiyunnewer Intel architectures integrated an enhanced version of the memory
988*4882a593Smuzhiyuncontroller (MC) inside the CPUs.
989*4882a593Smuzhiyun
990*4882a593SmuzhiyunThis chapter will cover the differences of the enhanced memory controllers
991*4882a593Smuzhiyunfound on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and
992*4882a593Smuzhiyun``sbx_edac`` drivers.
993*4882a593Smuzhiyun
994*4882a593Smuzhiyun.. note::
995*4882a593Smuzhiyun
996*4882a593Smuzhiyun   The Xeon E7 processor families use a separate chip for the memory
997*4882a593Smuzhiyun   controller, called Intel Scalable Memory Buffer. This section doesn't
998*4882a593Smuzhiyun   apply for such families.
999*4882a593Smuzhiyun
1000*4882a593Smuzhiyun1) There is one Memory Controller per Quick Patch Interconnect
1001*4882a593Smuzhiyun   (QPI). At the driver, the term "socket" means one QPI. This is
1002*4882a593Smuzhiyun   associated with a physical CPU socket.
1003*4882a593Smuzhiyun
1004*4882a593Smuzhiyun   Each MC have 3 physical read channels, 3 physical write channels and
1005*4882a593Smuzhiyun   3 logic channels. The driver currently sees it as just 3 channels.
1006*4882a593Smuzhiyun   Each channel can have up to 3 DIMMs.
1007*4882a593Smuzhiyun
1008*4882a593Smuzhiyun   The minimum known unity is DIMMs. There are no information about csrows.
1009*4882a593Smuzhiyun   As EDAC API maps the minimum unity is csrows, the driver sequentially
1010*4882a593Smuzhiyun   maps channel/DIMM into different csrows.
1011*4882a593Smuzhiyun
1012*4882a593Smuzhiyun   For example, supposing the following layout::
1013*4882a593Smuzhiyun
1014*4882a593Smuzhiyun	Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
1015*4882a593Smuzhiyun	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
1016*4882a593Smuzhiyun	  dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
1017*4882a593Smuzhiyun        Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
1018*4882a593Smuzhiyun	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
1019*4882a593Smuzhiyun	Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
1020*4882a593Smuzhiyun	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
1021*4882a593Smuzhiyun
1022*4882a593Smuzhiyun   The driver will map it as::
1023*4882a593Smuzhiyun
1024*4882a593Smuzhiyun	csrow0: channel 0, dimm0
1025*4882a593Smuzhiyun	csrow1: channel 0, dimm1
1026*4882a593Smuzhiyun	csrow2: channel 1, dimm0
1027*4882a593Smuzhiyun	csrow3: channel 2, dimm0
1028*4882a593Smuzhiyun
1029*4882a593Smuzhiyun   exports one DIMM per csrow.
1030*4882a593Smuzhiyun
1031*4882a593Smuzhiyun   Each QPI is exported as a different memory controller.
1032*4882a593Smuzhiyun
1033*4882a593Smuzhiyun2) The MC has the ability to inject errors to test drivers. The drivers
1034*4882a593Smuzhiyun   implement this functionality via some error injection nodes:
1035*4882a593Smuzhiyun
1036*4882a593Smuzhiyun   For injecting a memory error, there are some sysfs nodes, under
1037*4882a593Smuzhiyun   ``/sys/devices/system/edac/mc/mc?/``:
1038*4882a593Smuzhiyun
1039*4882a593Smuzhiyun   - ``inject_addrmatch/*``:
1040*4882a593Smuzhiyun      Controls the error injection mask register. It is possible to specify
1041*4882a593Smuzhiyun      several characteristics of the address to match an error code::
1042*4882a593Smuzhiyun
1043*4882a593Smuzhiyun         dimm = the affected dimm. Numbers are relative to a channel;
1044*4882a593Smuzhiyun         rank = the memory rank;
1045*4882a593Smuzhiyun         channel = the channel that will generate an error;
1046*4882a593Smuzhiyun         bank = the affected bank;
1047*4882a593Smuzhiyun         page = the page address;
1048*4882a593Smuzhiyun         column (or col) = the address column.
1049*4882a593Smuzhiyun
1050*4882a593Smuzhiyun      each of the above values can be set to "any" to match any valid value.
1051*4882a593Smuzhiyun
1052*4882a593Smuzhiyun      At driver init, all values are set to any.
1053*4882a593Smuzhiyun
1054*4882a593Smuzhiyun      For example, to generate an error at rank 1 of dimm 2, for any channel,
1055*4882a593Smuzhiyun      any bank, any page, any column::
1056*4882a593Smuzhiyun
1057*4882a593Smuzhiyun		echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1058*4882a593Smuzhiyun		echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1059*4882a593Smuzhiyun
1060*4882a593Smuzhiyun	To return to the default behaviour of matching any, you can do::
1061*4882a593Smuzhiyun
1062*4882a593Smuzhiyun		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1063*4882a593Smuzhiyun		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1064*4882a593Smuzhiyun
1065*4882a593Smuzhiyun   - ``inject_eccmask``:
1066*4882a593Smuzhiyun          specifies what bits will have troubles,
1067*4882a593Smuzhiyun
1068*4882a593Smuzhiyun   - ``inject_section``:
1069*4882a593Smuzhiyun       specifies what ECC cache section will get the error::
1070*4882a593Smuzhiyun
1071*4882a593Smuzhiyun		3 for both
1072*4882a593Smuzhiyun		2 for the highest
1073*4882a593Smuzhiyun		1 for the lowest
1074*4882a593Smuzhiyun
1075*4882a593Smuzhiyun   - ``inject_type``:
1076*4882a593Smuzhiyun       specifies the type of error, being a combination of the following bits::
1077*4882a593Smuzhiyun
1078*4882a593Smuzhiyun		bit 0 - repeat
1079*4882a593Smuzhiyun		bit 1 - ecc
1080*4882a593Smuzhiyun		bit 2 - parity
1081*4882a593Smuzhiyun
1082*4882a593Smuzhiyun   - ``inject_enable``:
1083*4882a593Smuzhiyun       starts the error generation when something different than 0 is written.
1084*4882a593Smuzhiyun
1085*4882a593Smuzhiyun   All inject vars can be read. root permission is needed for write.
1086*4882a593Smuzhiyun
1087*4882a593Smuzhiyun   Datasheet states that the error will only be generated after a write on an
1088*4882a593Smuzhiyun   address that matches inject_addrmatch. It seems, however, that reading will
1089*4882a593Smuzhiyun   also produce an error.
1090*4882a593Smuzhiyun
1091*4882a593Smuzhiyun   For example, the following code will generate an error for any write access
1092*4882a593Smuzhiyun   at socket 0, on any DIMM/address on channel 2::
1093*4882a593Smuzhiyun
1094*4882a593Smuzhiyun	echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
1095*4882a593Smuzhiyun	echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
1096*4882a593Smuzhiyun	echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
1097*4882a593Smuzhiyun	echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
1098*4882a593Smuzhiyun	echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
1099*4882a593Smuzhiyun	dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
1100*4882a593Smuzhiyun
1101*4882a593Smuzhiyun   For socket 1, it is needed to replace "mc0" by "mc1" at the above
1102*4882a593Smuzhiyun   commands.
1103*4882a593Smuzhiyun
1104*4882a593Smuzhiyun   The generated error message will look like::
1105*4882a593Smuzhiyun
1106*4882a593Smuzhiyun	EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
1107*4882a593Smuzhiyun
1108*4882a593Smuzhiyun3) Corrected Error memory register counters
1109*4882a593Smuzhiyun
1110*4882a593Smuzhiyun   Those newer MCs have some registers to count memory errors. The driver
1111*4882a593Smuzhiyun   uses those registers to report Corrected Errors on devices with Registered
1112*4882a593Smuzhiyun   DIMMs.
1113*4882a593Smuzhiyun
1114*4882a593Smuzhiyun   However, those counters don't work with Unregistered DIMM. As the chipset
1115*4882a593Smuzhiyun   offers some counters that also work with UDIMMs (but with a worse level of
1116*4882a593Smuzhiyun   granularity than the default ones), the driver exposes those registers for
1117*4882a593Smuzhiyun   UDIMM memories.
1118*4882a593Smuzhiyun
1119*4882a593Smuzhiyun   They can be read by looking at the contents of ``all_channel_counts/``::
1120*4882a593Smuzhiyun
1121*4882a593Smuzhiyun     $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
1122*4882a593Smuzhiyun	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
1123*4882a593Smuzhiyun	0
1124*4882a593Smuzhiyun	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
1125*4882a593Smuzhiyun	0
1126*4882a593Smuzhiyun	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
1127*4882a593Smuzhiyun	0
1128*4882a593Smuzhiyun
1129*4882a593Smuzhiyun   What happens here is that errors on different csrows, but at the same
1130*4882a593Smuzhiyun   dimm number will increment the same counter.
1131*4882a593Smuzhiyun   So, in this memory mapping::
1132*4882a593Smuzhiyun
1133*4882a593Smuzhiyun	csrow0: channel 0, dimm0
1134*4882a593Smuzhiyun	csrow1: channel 0, dimm1
1135*4882a593Smuzhiyun	csrow2: channel 1, dimm0
1136*4882a593Smuzhiyun	csrow3: channel 2, dimm0
1137*4882a593Smuzhiyun
1138*4882a593Smuzhiyun   The hardware will increment udimm0 for an error at the first dimm at either
1139*4882a593Smuzhiyun   csrow0, csrow2  or csrow3;
1140*4882a593Smuzhiyun
1141*4882a593Smuzhiyun   The hardware will increment udimm1 for an error at the second dimm at either
1142*4882a593Smuzhiyun   csrow0, csrow2  or csrow3;
1143*4882a593Smuzhiyun
1144*4882a593Smuzhiyun   The hardware will increment udimm2 for an error at the third dimm at either
1145*4882a593Smuzhiyun   csrow0, csrow2  or csrow3;
1146*4882a593Smuzhiyun
1147*4882a593Smuzhiyun4) Standard error counters
1148*4882a593Smuzhiyun
1149*4882a593Smuzhiyun   The standard error counters are generated when an mcelog error is received
1150*4882a593Smuzhiyun   by the driver. Since, with UDIMM, this is counted by software, it is
1151*4882a593Smuzhiyun   possible that some errors could be lost. With RDIMM's, they display the
1152*4882a593Smuzhiyun   contents of the registers
1153*4882a593Smuzhiyun
1154*4882a593SmuzhiyunReference documents used on ``amd64_edac``
1155*4882a593Smuzhiyun------------------------------------------
1156*4882a593Smuzhiyun
1157*4882a593Smuzhiyun``amd64_edac`` module is based on the following documents
1158*4882a593Smuzhiyun(available from http://support.amd.com/en-us/search/tech-docs):
1159*4882a593Smuzhiyun
1160*4882a593Smuzhiyun1. :Title:  BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD
1161*4882a593Smuzhiyun	   Opteron Processors
1162*4882a593Smuzhiyun   :AMD publication #: 26094
1163*4882a593Smuzhiyun   :Revision: 3.26
1164*4882a593Smuzhiyun   :Link: http://support.amd.com/TechDocs/26094.PDF
1165*4882a593Smuzhiyun
1166*4882a593Smuzhiyun2. :Title:  BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh
1167*4882a593Smuzhiyun	   Processors
1168*4882a593Smuzhiyun   :AMD publication #: 32559
1169*4882a593Smuzhiyun   :Revision: 3.00
1170*4882a593Smuzhiyun   :Issue Date: May 2006
1171*4882a593Smuzhiyun   :Link: http://support.amd.com/TechDocs/32559.pdf
1172*4882a593Smuzhiyun
1173*4882a593Smuzhiyun3. :Title:  BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h
1174*4882a593Smuzhiyun	   Processors
1175*4882a593Smuzhiyun   :AMD publication #: 31116
1176*4882a593Smuzhiyun   :Revision: 3.00
1177*4882a593Smuzhiyun   :Issue Date: September 07, 2007
1178*4882a593Smuzhiyun   :Link: http://support.amd.com/TechDocs/31116.pdf
1179*4882a593Smuzhiyun
1180*4882a593Smuzhiyun4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
1181*4882a593Smuzhiyun	  Models 30h-3Fh Processors
1182*4882a593Smuzhiyun   :AMD publication #: 49125
1183*4882a593Smuzhiyun   :Revision: 3.06
1184*4882a593Smuzhiyun   :Issue Date: 2/12/2015 (latest release)
1185*4882a593Smuzhiyun   :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
1186*4882a593Smuzhiyun
1187*4882a593Smuzhiyun5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
1188*4882a593Smuzhiyun	  Models 60h-6Fh Processors
1189*4882a593Smuzhiyun   :AMD publication #: 50742
1190*4882a593Smuzhiyun   :Revision: 3.01
1191*4882a593Smuzhiyun   :Issue Date: 7/23/2015 (latest release)
1192*4882a593Smuzhiyun   :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
1193*4882a593Smuzhiyun
1194*4882a593Smuzhiyun6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h
1195*4882a593Smuzhiyun	  Models 00h-0Fh Processors
1196*4882a593Smuzhiyun   :AMD publication #: 48751
1197*4882a593Smuzhiyun   :Revision: 3.03
1198*4882a593Smuzhiyun   :Issue Date: 2/23/2015 (latest release)
1199*4882a593Smuzhiyun   :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf
1200*4882a593Smuzhiyun
1201*4882a593SmuzhiyunCredits
1202*4882a593Smuzhiyun=======
1203*4882a593Smuzhiyun
1204*4882a593Smuzhiyun* Written by Doug Thompson <dougthompson@xmission.com>
1205*4882a593Smuzhiyun
1206*4882a593Smuzhiyun  - 7 Dec 2005
1207*4882a593Smuzhiyun  - 17 Jul 2007	Updated
1208*4882a593Smuzhiyun
1209*4882a593Smuzhiyun* |copy| Mauro Carvalho Chehab
1210*4882a593Smuzhiyun
1211*4882a593Smuzhiyun  - 05 Aug 2009	Nehalem interface
1212*4882a593Smuzhiyun  - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
1213*4882a593Smuzhiyun
1214*4882a593Smuzhiyun* EDAC authors/maintainers:
1215*4882a593Smuzhiyun
1216*4882a593Smuzhiyun  - Doug Thompson, Dave Jiang, Dave Peterson et al,
1217*4882a593Smuzhiyun  - Mauro Carvalho Chehab
1218*4882a593Smuzhiyun  - Borislav Petkov
1219*4882a593Smuzhiyun  - original author: Thayne Harbaugh
1220