xref: /OK3568_Linux_fs/kernel/Documentation/driver-api/edac.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593SmuzhiyunError Detection And Correction (EDAC) Devices
2*4882a593Smuzhiyun=============================================
3*4882a593Smuzhiyun
4*4882a593SmuzhiyunMain Concepts used at the EDAC subsystem
5*4882a593Smuzhiyun----------------------------------------
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunThere are several things to be aware of that aren't at all obvious, like
8*4882a593Smuzhiyun*sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*,
9*4882a593Smuzhiyunetc...
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunThese are some of the many terms that are thrown about that don't always
12*4882a593Smuzhiyunmean what people think they mean (Inconceivable!).  In the interest of
13*4882a593Smuzhiyuncreating a common ground for discussion, terms and their definitions
14*4882a593Smuzhiyunwill be established.
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun* Memory devices
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunThe individual DRAM chips on a memory stick.  These devices commonly
19*4882a593Smuzhiyunoutput 4 and 8 bits each (x4, x8). Grouping several of these in parallel
20*4882a593Smuzhiyunprovides the number of bits that the memory controller expects:
21*4882a593Smuzhiyuntypically 72 bits, in order to provide 64 bits + 8 bits of ECC data.
22*4882a593Smuzhiyun
23*4882a593Smuzhiyun* Memory Stick
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunA printed circuit board that aggregates multiple memory devices in
26*4882a593Smuzhiyunparallel.  In general, this is the Field Replaceable Unit (FRU) which
27*4882a593Smuzhiyungets replaced, in the case of excessive errors. Most often it is also
28*4882a593Smuzhiyuncalled DIMM (Dual Inline Memory Module).
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun* Memory Socket
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunA physical connector on the motherboard that accepts a single memory
33*4882a593Smuzhiyunstick. Also called as "slot" on several datasheets.
34*4882a593Smuzhiyun
35*4882a593Smuzhiyun* Channel
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunA memory controller channel, responsible to communicate with a group of
38*4882a593SmuzhiyunDIMMs. Each channel has its own independent control (command) and data
39*4882a593Smuzhiyunbus, and can be used independently or grouped with other channels.
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun* Branch
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunIt is typically the highest hierarchy on a Fully-Buffered DIMM memory
44*4882a593Smuzhiyuncontroller. Typically, it contains two channels. Two channels at the
45*4882a593Smuzhiyunsame branch can be used in single mode or in lockstep mode. When
46*4882a593Smuzhiyunlockstep is enabled, the cacheline is doubled, but it generally brings
47*4882a593Smuzhiyunsome performance penalty. Also, it is generally not possible to point to
48*4882a593Smuzhiyunjust one memory stick when an error occurs, as the error correction code
49*4882a593Smuzhiyunis calculated using two DIMMs instead of one. Due to that, it is capable
50*4882a593Smuzhiyunof correcting more errors than on single mode.
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun* Single-channel
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunThe data accessed by the memory controller is contained into one dimm
55*4882a593Smuzhiyunonly. E. g. if the data is 64 bits-wide, the data flows to the CPU using
56*4882a593Smuzhiyunone 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3
57*4882a593Smuzhiyunmemories. FB-DIMM and RAMBUS use a different concept for channel, so
58*4882a593Smuzhiyunthis concept doesn't apply there.
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun* Double-channel
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunThe data size accessed by the memory controller is interlaced into two
63*4882a593Smuzhiyundimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72
64*4882a593Smuzhiyunbits with ECC), the data flows to the CPU using a 128 bits parallel
65*4882a593Smuzhiyunaccess.
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun* Chip-select row
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunThis is the name of the DRAM signal used to select the DRAM ranks to be
70*4882a593Smuzhiyunaccessed. Common chip-select rows for single channel are 64 bits, for
71*4882a593Smuzhiyundual channel 128 bits. It may not be visible by the memory controller,
72*4882a593Smuzhiyunas some DIMM types have a memory buffer that can hide direct access to
73*4882a593Smuzhiyunit from the Memory Controller.
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun* Single-Ranked stick
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunA Single-ranked stick has 1 chip-select row of memory. Motherboards
78*4882a593Smuzhiyuncommonly drive two chip-select pins to a memory stick. A single-ranked
79*4882a593Smuzhiyunstick, will occupy only one of those rows. The other will be unused.
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun.. _doubleranked:
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun* Double-Ranked stick
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunA double-ranked stick has two chip-select rows which access different
86*4882a593Smuzhiyunsets of memory devices.  The two rows cannot be accessed concurrently.
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun* Double-sided stick
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun**DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`.
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunA double-sided stick has two chip-select rows which access different sets
93*4882a593Smuzhiyunof memory devices. The two rows cannot be accessed concurrently.
94*4882a593Smuzhiyun"Double-sided" is irrespective of the memory devices being mounted on
95*4882a593Smuzhiyunboth sides of the memory stick.
96*4882a593Smuzhiyun
97*4882a593Smuzhiyun* Socket set
98*4882a593Smuzhiyun
99*4882a593SmuzhiyunAll of the memory sticks that are required for a single memory access or
100*4882a593Smuzhiyunall of the memory sticks spanned by a chip-select row.  A single socket
101*4882a593Smuzhiyunset has two chip-select rows and if double-sided sticks are used these
102*4882a593Smuzhiyunwill occupy those chip-select rows.
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun* Bank
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunThis term is avoided because it is unclear when needing to distinguish
107*4882a593Smuzhiyunbetween chip-select rows and socket sets.
108*4882a593Smuzhiyun
109*4882a593Smuzhiyun
110*4882a593SmuzhiyunMemory Controllers
111*4882a593Smuzhiyun------------------
112*4882a593Smuzhiyun
113*4882a593SmuzhiyunMost of the EDAC core is focused on doing Memory Controller error detection.
114*4882a593SmuzhiyunThe :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info``
115*4882a593Smuzhiyunto describe the memory controllers, with is an opaque struct for the EDAC
116*4882a593Smuzhiyundrivers. Only the EDAC core is allowed to touch it.
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun.. kernel-doc:: include/linux/edac.h
119*4882a593Smuzhiyun
120*4882a593Smuzhiyun.. kernel-doc:: drivers/edac/edac_mc.h
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunPCI Controllers
123*4882a593Smuzhiyun---------------
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunThe EDAC subsystem provides a mechanism to handle PCI controllers by calling
126*4882a593Smuzhiyunthe :c:func:`edac_pci_alloc_ctl_info`. It will use the struct
127*4882a593Smuzhiyun:c:type:`edac_pci_ctl_info` to describe the PCI controllers.
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun.. kernel-doc:: drivers/edac/edac_pci.h
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunEDAC Blocks
132*4882a593Smuzhiyun-----------
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunThe EDAC subsystem also provides a generic mechanism to report errors on
135*4882a593Smuzhiyunother parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function.
136*4882a593Smuzhiyun
137*4882a593SmuzhiyunThe structures :c:type:`edac_dev_sysfs_block_attribute`,
138*4882a593Smuzhiyun:c:type:`edac_device_block`, :c:type:`edac_device_instance` and
139*4882a593Smuzhiyun:c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device'
140*4882a593Smuzhiyunrepresentation at sysfs.
141*4882a593Smuzhiyun
142*4882a593SmuzhiyunThis set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or
143*4882a593SmuzhiyunPCI, like:
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun- CPU caches (L1 and L2)
146*4882a593Smuzhiyun- DMA engines
147*4882a593Smuzhiyun- Core CPU switches
148*4882a593Smuzhiyun- Fabric switch units
149*4882a593Smuzhiyun- PCIe interface controllers
150*4882a593Smuzhiyun- other EDAC/ECC type devices that can be monitored for
151*4882a593Smuzhiyun  errors, etc.
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunIt allows for a 2 level set of hierarchy.
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunFor example, a cache could be composed of L1, L2 and L3 levels of cache.
156*4882a593SmuzhiyunEach CPU core would have its own L1 cache, while sharing L2 and maybe L3
157*4882a593Smuzhiyuncaches. On such case, those can be represented via the following sysfs
158*4882a593Smuzhiyunnodes::
159*4882a593Smuzhiyun
160*4882a593Smuzhiyun	/sys/devices/system/edac/..
161*4882a593Smuzhiyun
162*4882a593Smuzhiyun	pci/		<existing pci directory (if available)>
163*4882a593Smuzhiyun	mc/		<existing memory device directory>
164*4882a593Smuzhiyun	cpu/cpu0/..	<L1 and L2 block directory>
165*4882a593Smuzhiyun		/L1-cache/ce_count
166*4882a593Smuzhiyun			 /ue_count
167*4882a593Smuzhiyun		/L2-cache/ce_count
168*4882a593Smuzhiyun			 /ue_count
169*4882a593Smuzhiyun	cpu/cpu1/..	<L1 and L2 block directory>
170*4882a593Smuzhiyun		/L1-cache/ce_count
171*4882a593Smuzhiyun			 /ue_count
172*4882a593Smuzhiyun		/L2-cache/ce_count
173*4882a593Smuzhiyun			 /ue_count
174*4882a593Smuzhiyun	...
175*4882a593Smuzhiyun
176*4882a593Smuzhiyun	the L1 and L2 directories would be "edac_device_block's"
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun.. kernel-doc:: drivers/edac/edac_device.h
179