1*4882a593SmuzhiyunError Detection And Correction (EDAC) Devices 2*4882a593Smuzhiyun============================================= 3*4882a593Smuzhiyun 4*4882a593SmuzhiyunMain Concepts used at the EDAC subsystem 5*4882a593Smuzhiyun---------------------------------------- 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunThere are several things to be aware of that aren't at all obvious, like 8*4882a593Smuzhiyun*sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*, 9*4882a593Smuzhiyunetc... 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunThese are some of the many terms that are thrown about that don't always 12*4882a593Smuzhiyunmean what people think they mean (Inconceivable!). In the interest of 13*4882a593Smuzhiyuncreating a common ground for discussion, terms and their definitions 14*4882a593Smuzhiyunwill be established. 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun* Memory devices 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunThe individual DRAM chips on a memory stick. These devices commonly 19*4882a593Smuzhiyunoutput 4 and 8 bits each (x4, x8). Grouping several of these in parallel 20*4882a593Smuzhiyunprovides the number of bits that the memory controller expects: 21*4882a593Smuzhiyuntypically 72 bits, in order to provide 64 bits + 8 bits of ECC data. 22*4882a593Smuzhiyun 23*4882a593Smuzhiyun* Memory Stick 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunA printed circuit board that aggregates multiple memory devices in 26*4882a593Smuzhiyunparallel. In general, this is the Field Replaceable Unit (FRU) which 27*4882a593Smuzhiyungets replaced, in the case of excessive errors. Most often it is also 28*4882a593Smuzhiyuncalled DIMM (Dual Inline Memory Module). 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun* Memory Socket 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunA physical connector on the motherboard that accepts a single memory 33*4882a593Smuzhiyunstick. Also called as "slot" on several datasheets. 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun* Channel 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunA memory controller channel, responsible to communicate with a group of 38*4882a593SmuzhiyunDIMMs. Each channel has its own independent control (command) and data 39*4882a593Smuzhiyunbus, and can be used independently or grouped with other channels. 40*4882a593Smuzhiyun 41*4882a593Smuzhiyun* Branch 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunIt is typically the highest hierarchy on a Fully-Buffered DIMM memory 44*4882a593Smuzhiyuncontroller. Typically, it contains two channels. Two channels at the 45*4882a593Smuzhiyunsame branch can be used in single mode or in lockstep mode. When 46*4882a593Smuzhiyunlockstep is enabled, the cacheline is doubled, but it generally brings 47*4882a593Smuzhiyunsome performance penalty. Also, it is generally not possible to point to 48*4882a593Smuzhiyunjust one memory stick when an error occurs, as the error correction code 49*4882a593Smuzhiyunis calculated using two DIMMs instead of one. Due to that, it is capable 50*4882a593Smuzhiyunof correcting more errors than on single mode. 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun* Single-channel 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunThe data accessed by the memory controller is contained into one dimm 55*4882a593Smuzhiyunonly. E. g. if the data is 64 bits-wide, the data flows to the CPU using 56*4882a593Smuzhiyunone 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3 57*4882a593Smuzhiyunmemories. FB-DIMM and RAMBUS use a different concept for channel, so 58*4882a593Smuzhiyunthis concept doesn't apply there. 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun* Double-channel 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunThe data size accessed by the memory controller is interlaced into two 63*4882a593Smuzhiyundimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72 64*4882a593Smuzhiyunbits with ECC), the data flows to the CPU using a 128 bits parallel 65*4882a593Smuzhiyunaccess. 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun* Chip-select row 68*4882a593Smuzhiyun 69*4882a593SmuzhiyunThis is the name of the DRAM signal used to select the DRAM ranks to be 70*4882a593Smuzhiyunaccessed. Common chip-select rows for single channel are 64 bits, for 71*4882a593Smuzhiyundual channel 128 bits. It may not be visible by the memory controller, 72*4882a593Smuzhiyunas some DIMM types have a memory buffer that can hide direct access to 73*4882a593Smuzhiyunit from the Memory Controller. 74*4882a593Smuzhiyun 75*4882a593Smuzhiyun* Single-Ranked stick 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunA Single-ranked stick has 1 chip-select row of memory. Motherboards 78*4882a593Smuzhiyuncommonly drive two chip-select pins to a memory stick. A single-ranked 79*4882a593Smuzhiyunstick, will occupy only one of those rows. The other will be unused. 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun.. _doubleranked: 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun* Double-Ranked stick 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunA double-ranked stick has two chip-select rows which access different 86*4882a593Smuzhiyunsets of memory devices. The two rows cannot be accessed concurrently. 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun* Double-sided stick 89*4882a593Smuzhiyun 90*4882a593Smuzhiyun**DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`. 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunA double-sided stick has two chip-select rows which access different sets 93*4882a593Smuzhiyunof memory devices. The two rows cannot be accessed concurrently. 94*4882a593Smuzhiyun"Double-sided" is irrespective of the memory devices being mounted on 95*4882a593Smuzhiyunboth sides of the memory stick. 96*4882a593Smuzhiyun 97*4882a593Smuzhiyun* Socket set 98*4882a593Smuzhiyun 99*4882a593SmuzhiyunAll of the memory sticks that are required for a single memory access or 100*4882a593Smuzhiyunall of the memory sticks spanned by a chip-select row. A single socket 101*4882a593Smuzhiyunset has two chip-select rows and if double-sided sticks are used these 102*4882a593Smuzhiyunwill occupy those chip-select rows. 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun* Bank 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunThis term is avoided because it is unclear when needing to distinguish 107*4882a593Smuzhiyunbetween chip-select rows and socket sets. 108*4882a593Smuzhiyun 109*4882a593Smuzhiyun 110*4882a593SmuzhiyunMemory Controllers 111*4882a593Smuzhiyun------------------ 112*4882a593Smuzhiyun 113*4882a593SmuzhiyunMost of the EDAC core is focused on doing Memory Controller error detection. 114*4882a593SmuzhiyunThe :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info`` 115*4882a593Smuzhiyunto describe the memory controllers, with is an opaque struct for the EDAC 116*4882a593Smuzhiyundrivers. Only the EDAC core is allowed to touch it. 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun.. kernel-doc:: include/linux/edac.h 119*4882a593Smuzhiyun 120*4882a593Smuzhiyun.. kernel-doc:: drivers/edac/edac_mc.h 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunPCI Controllers 123*4882a593Smuzhiyun--------------- 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunThe EDAC subsystem provides a mechanism to handle PCI controllers by calling 126*4882a593Smuzhiyunthe :c:func:`edac_pci_alloc_ctl_info`. It will use the struct 127*4882a593Smuzhiyun:c:type:`edac_pci_ctl_info` to describe the PCI controllers. 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun.. kernel-doc:: drivers/edac/edac_pci.h 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunEDAC Blocks 132*4882a593Smuzhiyun----------- 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunThe EDAC subsystem also provides a generic mechanism to report errors on 135*4882a593Smuzhiyunother parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function. 136*4882a593Smuzhiyun 137*4882a593SmuzhiyunThe structures :c:type:`edac_dev_sysfs_block_attribute`, 138*4882a593Smuzhiyun:c:type:`edac_device_block`, :c:type:`edac_device_instance` and 139*4882a593Smuzhiyun:c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device' 140*4882a593Smuzhiyunrepresentation at sysfs. 141*4882a593Smuzhiyun 142*4882a593SmuzhiyunThis set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or 143*4882a593SmuzhiyunPCI, like: 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun- CPU caches (L1 and L2) 146*4882a593Smuzhiyun- DMA engines 147*4882a593Smuzhiyun- Core CPU switches 148*4882a593Smuzhiyun- Fabric switch units 149*4882a593Smuzhiyun- PCIe interface controllers 150*4882a593Smuzhiyun- other EDAC/ECC type devices that can be monitored for 151*4882a593Smuzhiyun errors, etc. 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunIt allows for a 2 level set of hierarchy. 154*4882a593Smuzhiyun 155*4882a593SmuzhiyunFor example, a cache could be composed of L1, L2 and L3 levels of cache. 156*4882a593SmuzhiyunEach CPU core would have its own L1 cache, while sharing L2 and maybe L3 157*4882a593Smuzhiyuncaches. On such case, those can be represented via the following sysfs 158*4882a593Smuzhiyunnodes:: 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun /sys/devices/system/edac/.. 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun pci/ <existing pci directory (if available)> 163*4882a593Smuzhiyun mc/ <existing memory device directory> 164*4882a593Smuzhiyun cpu/cpu0/.. <L1 and L2 block directory> 165*4882a593Smuzhiyun /L1-cache/ce_count 166*4882a593Smuzhiyun /ue_count 167*4882a593Smuzhiyun /L2-cache/ce_count 168*4882a593Smuzhiyun /ue_count 169*4882a593Smuzhiyun cpu/cpu1/.. <L1 and L2 block directory> 170*4882a593Smuzhiyun /L1-cache/ce_count 171*4882a593Smuzhiyun /ue_count 172*4882a593Smuzhiyun /L2-cache/ce_count 173*4882a593Smuzhiyun /ue_count 174*4882a593Smuzhiyun ... 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun the L1 and L2 directories would be "edac_device_block's" 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun.. kernel-doc:: drivers/edac/edac_device.h 179