1*4882a593Smuzhiyun=============================== 2*4882a593SmuzhiyunLIBNVDIMM: Non-Volatile Devices 3*4882a593Smuzhiyun=============================== 4*4882a593Smuzhiyun 5*4882a593Smuzhiyunlibnvdimm - kernel / libndctl - userspace helper library 6*4882a593Smuzhiyun 7*4882a593Smuzhiyunlinux-nvdimm@lists.01.org 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunVersion 13 10*4882a593Smuzhiyun 11*4882a593Smuzhiyun.. contents: 12*4882a593Smuzhiyun 13*4882a593Smuzhiyun Glossary 14*4882a593Smuzhiyun Overview 15*4882a593Smuzhiyun Supporting Documents 16*4882a593Smuzhiyun Git Trees 17*4882a593Smuzhiyun LIBNVDIMM PMEM and BLK 18*4882a593Smuzhiyun Why BLK? 19*4882a593Smuzhiyun PMEM vs BLK 20*4882a593Smuzhiyun BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX 21*4882a593Smuzhiyun Example NVDIMM Platform 22*4882a593Smuzhiyun LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API 23*4882a593Smuzhiyun LIBNDCTL: Context 24*4882a593Smuzhiyun libndctl: instantiate a new library context example 25*4882a593Smuzhiyun LIBNVDIMM/LIBNDCTL: Bus 26*4882a593Smuzhiyun libnvdimm: control class device in /sys/class 27*4882a593Smuzhiyun libnvdimm: bus 28*4882a593Smuzhiyun libndctl: bus enumeration example 29*4882a593Smuzhiyun LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 30*4882a593Smuzhiyun libnvdimm: DIMM (NMEM) 31*4882a593Smuzhiyun libndctl: DIMM enumeration example 32*4882a593Smuzhiyun LIBNVDIMM/LIBNDCTL: Region 33*4882a593Smuzhiyun libnvdimm: region 34*4882a593Smuzhiyun libndctl: region enumeration example 35*4882a593Smuzhiyun Why Not Encode the Region Type into the Region Name? 36*4882a593Smuzhiyun How Do I Determine the Major Type of a Region? 37*4882a593Smuzhiyun LIBNVDIMM/LIBNDCTL: Namespace 38*4882a593Smuzhiyun libnvdimm: namespace 39*4882a593Smuzhiyun libndctl: namespace enumeration example 40*4882a593Smuzhiyun libndctl: namespace creation example 41*4882a593Smuzhiyun Why the Term "namespace"? 42*4882a593Smuzhiyun LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" 43*4882a593Smuzhiyun libnvdimm: btt layout 44*4882a593Smuzhiyun libndctl: btt creation example 45*4882a593Smuzhiyun Summary LIBNDCTL Diagram 46*4882a593Smuzhiyun 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunGlossary 49*4882a593Smuzhiyun======== 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunPMEM: 52*4882a593Smuzhiyun A system-physical-address range where writes are persistent. A 53*4882a593Smuzhiyun block device composed of PMEM is capable of DAX. A PMEM address range 54*4882a593Smuzhiyun may span an interleave of several DIMMs. 55*4882a593Smuzhiyun 56*4882a593SmuzhiyunBLK: 57*4882a593Smuzhiyun A set of one or more programmable memory mapped apertures provided 58*4882a593Smuzhiyun by a DIMM to access its media. This indirection precludes the 59*4882a593Smuzhiyun performance benefit of interleaving, but enables DIMM-bounded failure 60*4882a593Smuzhiyun modes. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunDPA: 63*4882a593Smuzhiyun DIMM Physical Address, is a DIMM-relative offset. With one DIMM in 64*4882a593Smuzhiyun the system there would be a 1:1 system-physical-address:DPA association. 65*4882a593Smuzhiyun Once more DIMMs are added a memory controller interleave must be 66*4882a593Smuzhiyun decoded to determine the DPA associated with a given 67*4882a593Smuzhiyun system-physical-address. BLK capacity always has a 1:1 relationship 68*4882a593Smuzhiyun with a single-DIMM's DPA range. 69*4882a593Smuzhiyun 70*4882a593SmuzhiyunDAX: 71*4882a593Smuzhiyun File system extensions to bypass the page cache and block layer to 72*4882a593Smuzhiyun mmap persistent memory, from a PMEM block device, directly into a 73*4882a593Smuzhiyun process address space. 74*4882a593Smuzhiyun 75*4882a593SmuzhiyunDSM: 76*4882a593Smuzhiyun Device Specific Method: ACPI method to control specific 77*4882a593Smuzhiyun device - in this case the firmware. 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunDCR: 80*4882a593Smuzhiyun NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. 81*4882a593Smuzhiyun It defines a vendor-id, device-id, and interface format for a given DIMM. 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunBTT: 84*4882a593Smuzhiyun Block Translation Table: Persistent memory is byte addressable. 85*4882a593Smuzhiyun Existing software may have an expectation that the power-fail-atomicity 86*4882a593Smuzhiyun of writes is at least one sector, 512 bytes. The BTT is an indirection 87*4882a593Smuzhiyun table with atomic update semantics to front a PMEM/BLK block device 88*4882a593Smuzhiyun driver and present arbitrary atomic sector sizes. 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunLABEL: 91*4882a593Smuzhiyun Metadata stored on a DIMM device that partitions and identifies 92*4882a593Smuzhiyun (persistently names) storage between PMEM and BLK. It also partitions 93*4882a593Smuzhiyun BLK storage to host BTTs with different parameters per BLK-partition. 94*4882a593Smuzhiyun Note that traditional partition tables, GPT/MBR, are layered on top of a 95*4882a593Smuzhiyun BLK or PMEM device. 96*4882a593Smuzhiyun 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunOverview 99*4882a593Smuzhiyun======== 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunThe LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, 102*4882a593SmuzhiyunPMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM 103*4882a593Smuzhiyunand BLK mode access. These three modes of operation are described by 104*4882a593Smuzhiyunthe "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM 105*4882a593Smuzhiyunimplementation is generic and supports pre-NFIT platforms, it was guided 106*4882a593Smuzhiyunby the superset of capabilities need to support this ACPI 6 definition 107*4882a593Smuzhiyunfor NVDIMM resources. The bulk of the kernel implementation is in place 108*4882a593Smuzhiyunto handle the case where DPA accessible via PMEM is aliased with DPA 109*4882a593Smuzhiyunaccessible via BLK. When that occurs a LABEL is needed to reserve DPA 110*4882a593Smuzhiyunfor exclusive access via one mode a time. 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunSupporting Documents 113*4882a593Smuzhiyun-------------------- 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunACPI 6: 116*4882a593Smuzhiyun https://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf 117*4882a593SmuzhiyunNVDIMM Namespace: 118*4882a593Smuzhiyun https://pmem.io/documents/NVDIMM_Namespace_Spec.pdf 119*4882a593SmuzhiyunDSM Interface Example: 120*4882a593Smuzhiyun https://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf 121*4882a593SmuzhiyunDriver Writer's Guide: 122*4882a593Smuzhiyun https://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf 123*4882a593Smuzhiyun 124*4882a593SmuzhiyunGit Trees 125*4882a593Smuzhiyun--------- 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunLIBNVDIMM: 128*4882a593Smuzhiyun https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git 129*4882a593SmuzhiyunLIBNDCTL: 130*4882a593Smuzhiyun https://github.com/pmem/ndctl.git 131*4882a593SmuzhiyunPMEM: 132*4882a593Smuzhiyun https://github.com/01org/prd 133*4882a593Smuzhiyun 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunLIBNVDIMM PMEM and BLK 136*4882a593Smuzhiyun====================== 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunPrior to the arrival of the NFIT, non-volatile memory was described to a 139*4882a593Smuzhiyunsystem in various ad-hoc ways. Usually only the bare minimum was 140*4882a593Smuzhiyunprovided, namely, a single system-physical-address range where writes 141*4882a593Smuzhiyunare expected to be durable after a system power loss. Now, the NFIT 142*4882a593Smuzhiyunspecification standardizes not only the description of PMEM, but also 143*4882a593SmuzhiyunBLK and platform message-passing entry points for control and 144*4882a593Smuzhiyunconfiguration. 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunFor each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block 147*4882a593Smuzhiyundevice driver: 148*4882a593Smuzhiyun 149*4882a593Smuzhiyun 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This 150*4882a593Smuzhiyun range is contiguous in system memory and may be interleaved (hardware 151*4882a593Smuzhiyun memory controller striped) across multiple DIMMs. When interleaved the 152*4882a593Smuzhiyun platform may optionally provide details of which DIMMs are participating 153*4882a593Smuzhiyun in the interleave. 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun Note that while LIBNVDIMM describes system-physical-address ranges that may 156*4882a593Smuzhiyun alias with BLK access as ND_NAMESPACE_PMEM ranges and those without 157*4882a593Smuzhiyun alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no 158*4882a593Smuzhiyun distinction. The different device-types are an implementation detail 159*4882a593Smuzhiyun that userspace can exploit to implement policies like "only interface 160*4882a593Smuzhiyun with address ranges from certain DIMMs". It is worth noting that when 161*4882a593Smuzhiyun aliasing is present and a DIMM lacks a label, then no block device can 162*4882a593Smuzhiyun be created by default as userspace needs to do at least one allocation 163*4882a593Smuzhiyun of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once 164*4882a593Smuzhiyun registered, can be immediately attached to nd_pmem. 165*4882a593Smuzhiyun 166*4882a593Smuzhiyun 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform 167*4882a593Smuzhiyun defined apertures. A set of apertures will access just one DIMM. 168*4882a593Smuzhiyun Multiple windows (apertures) allow multiple concurrent accesses, much like 169*4882a593Smuzhiyun tagged-command-queuing, and would likely be used by different threads or 170*4882a593Smuzhiyun different CPUs. 171*4882a593Smuzhiyun 172*4882a593Smuzhiyun The NFIT specification defines a standard format for a BLK-aperture, but 173*4882a593Smuzhiyun the spec also allows for vendor specific layouts, and non-NFIT BLK 174*4882a593Smuzhiyun implementations may have other designs for BLK I/O. For this reason 175*4882a593Smuzhiyun "nd_blk" calls back into platform-specific code to perform the I/O. 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun One such implementation is defined in the "Driver Writer's Guide" and "DSM 178*4882a593Smuzhiyun Interface Example". 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun 181*4882a593SmuzhiyunWhy BLK? 182*4882a593Smuzhiyun======== 183*4882a593Smuzhiyun 184*4882a593SmuzhiyunWhile PMEM provides direct byte-addressable CPU-load/store access to 185*4882a593SmuzhiyunNVDIMM storage, it does not provide the best system RAS (recovery, 186*4882a593Smuzhiyunavailability, and serviceability) model. An access to a corrupted 187*4882a593Smuzhiyunsystem-physical-address address causes a CPU exception while an access 188*4882a593Smuzhiyunto a corrupted address through an BLK-aperture causes that block window 189*4882a593Smuzhiyunto raise an error status in a register. The latter is more aligned with 190*4882a593Smuzhiyunthe standard error model that host-bus-adapter attached disks present. 191*4882a593Smuzhiyun 192*4882a593SmuzhiyunAlso, if an administrator ever wants to replace a memory it is easier to 193*4882a593Smuzhiyunservice a system at DIMM module boundaries. Compare this to PMEM where 194*4882a593Smuzhiyundata could be interleaved in an opaque hardware specific manner across 195*4882a593Smuzhiyunseveral DIMMs. 196*4882a593Smuzhiyun 197*4882a593SmuzhiyunPMEM vs BLK 198*4882a593Smuzhiyun----------- 199*4882a593Smuzhiyun 200*4882a593SmuzhiyunBLK-apertures solve these RAS problems, but their presence is also the 201*4882a593Smuzhiyunmajor contributing factor to the complexity of the ND subsystem. They 202*4882a593Smuzhiyuncomplicate the implementation because PMEM and BLK alias in DPA space. 203*4882a593SmuzhiyunAny given DIMM's DPA-range may contribute to one or more 204*4882a593Smuzhiyunsystem-physical-address sets of interleaved DIMMs, *and* may also be 205*4882a593Smuzhiyunaccessed in its entirety through its BLK-aperture. Accessing a DPA 206*4882a593Smuzhiyunthrough a system-physical-address while simultaneously accessing the 207*4882a593Smuzhiyunsame DPA through a BLK-aperture has undefined results. For this reason, 208*4882a593SmuzhiyunDIMMs with this dual interface configuration include a DSM function to 209*4882a593Smuzhiyunstore/retrieve a LABEL. The LABEL effectively partitions the DPA-space 210*4882a593Smuzhiyuninto exclusive system-physical-address and BLK-aperture accessible 211*4882a593Smuzhiyunregions. For simplicity a DIMM is allowed a PMEM "region" per each 212*4882a593Smuzhiyuninterleave set in which it is a member. The remaining DPA space can be 213*4882a593Smuzhiyuncarved into an arbitrary number of BLK devices with discontiguous 214*4882a593Smuzhiyunextents. 215*4882a593Smuzhiyun 216*4882a593SmuzhiyunBLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX 217*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 218*4882a593Smuzhiyun 219*4882a593SmuzhiyunOne of the few 220*4882a593Smuzhiyunreasons to allow multiple BLK namespaces per REGION is so that each 221*4882a593SmuzhiyunBLK-namespace can be configured with a BTT with unique atomic sector 222*4882a593Smuzhiyunsizes. While a PMEM device can host a BTT the LABEL specification does 223*4882a593Smuzhiyunnot provide for a sector size to be specified for a PMEM namespace. 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunThis is due to the expectation that the primary usage model for PMEM is 226*4882a593Smuzhiyunvia DAX, and the BTT is incompatible with DAX. However, for the cases 227*4882a593Smuzhiyunwhere an application or filesystem still needs atomic sector update 228*4882a593Smuzhiyunguarantees it can register a BTT on a PMEM device or partition. See 229*4882a593SmuzhiyunLIBNVDIMM/NDCTL: Block Translation Table "btt" 230*4882a593Smuzhiyun 231*4882a593Smuzhiyun 232*4882a593SmuzhiyunExample NVDIMM Platform 233*4882a593Smuzhiyun======================= 234*4882a593Smuzhiyun 235*4882a593SmuzhiyunFor the remainder of this document the following diagram will be 236*4882a593Smuzhiyunreferenced for any example sysfs layouts:: 237*4882a593Smuzhiyun 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun (a) (b) DIMM BLK-REGION 240*4882a593Smuzhiyun +-------------------+--------+--------+--------+ 241*4882a593Smuzhiyun +------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 242*4882a593Smuzhiyun | imc0 +--+- - - region0- - - +--------+ +--------+ 243*4882a593Smuzhiyun +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 244*4882a593Smuzhiyun | +-------------------+--------v v--------+ 245*4882a593Smuzhiyun +--+---+ | | 246*4882a593Smuzhiyun | cpu0 | region1 247*4882a593Smuzhiyun +--+---+ | | 248*4882a593Smuzhiyun | +----------------------------^ ^--------+ 249*4882a593Smuzhiyun +--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 250*4882a593Smuzhiyun | imc1 +--+----------------------------| +--------+ 251*4882a593Smuzhiyun +------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 252*4882a593Smuzhiyun +----------------------------+--------+--------+ 253*4882a593Smuzhiyun 254*4882a593SmuzhiyunIn this platform we have four DIMMs and two memory controllers in one 255*4882a593Smuzhiyunsocket. Each unique interface (BLK or PMEM) to DPA space is identified 256*4882a593Smuzhiyunby a region device with a dynamically assigned id (REGION0 - REGION5). 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A 259*4882a593Smuzhiyun single PMEM namespace is created in the REGION0-SPA-range that spans most 260*4882a593Smuzhiyun of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that 261*4882a593Smuzhiyun interleaved system-physical-address range is reclaimed as BLK-aperture 262*4882a593Smuzhiyun accessed space starting at DPA-offset (a) into each DIMM. In that 263*4882a593Smuzhiyun reclaimed space we create two BLK-aperture "namespaces" from REGION2 and 264*4882a593Smuzhiyun REGION3 where "blk2.0" and "blk3.0" are just human readable names that 265*4882a593Smuzhiyun could be set to any user-desired name in the LABEL. 266*4882a593Smuzhiyun 267*4882a593Smuzhiyun 2. In the last portion of DIMM0 and DIMM1 we have an interleaved 268*4882a593Smuzhiyun system-physical-address range, REGION1, that spans those two DIMMs as 269*4882a593Smuzhiyun well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace 270*4882a593Smuzhiyun named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for 271*4882a593Smuzhiyun each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and 272*4882a593Smuzhiyun "blk5.0". 273*4882a593Smuzhiyun 274*4882a593Smuzhiyun 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 275*4882a593Smuzhiyun interleaved system-physical-address range (i.e. the DPA address past 276*4882a593Smuzhiyun offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. 277*4882a593Smuzhiyun Note, that this example shows that BLK-aperture namespaces don't need to 278*4882a593Smuzhiyun be contiguous in DPA-space. 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun This bus is provided by the kernel under the device 281*4882a593Smuzhiyun /sys/devices/platform/nfit_test.0 when the nfit_test.ko module from 282*4882a593Smuzhiyun tools/testing/nvdimm is loaded. This not only test LIBNVDIMM but the 283*4882a593Smuzhiyun acpi_nfit.ko driver as well. 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun 286*4882a593SmuzhiyunLIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API 287*4882a593Smuzhiyun======================================================== 288*4882a593Smuzhiyun 289*4882a593SmuzhiyunWhat follows is a description of the LIBNVDIMM sysfs layout and a 290*4882a593Smuzhiyuncorresponding object hierarchy diagram as viewed through the LIBNDCTL 291*4882a593SmuzhiyunAPI. The example sysfs paths and diagrams are relative to the Example 292*4882a593SmuzhiyunNVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit 293*4882a593Smuzhiyuntest. 294*4882a593Smuzhiyun 295*4882a593SmuzhiyunLIBNDCTL: Context 296*4882a593Smuzhiyun----------------- 297*4882a593Smuzhiyun 298*4882a593SmuzhiyunEvery API call in the LIBNDCTL library requires a context that holds the 299*4882a593Smuzhiyunlogging parameters and other library instance state. The library is 300*4882a593Smuzhiyunbased on the libabc template: 301*4882a593Smuzhiyun 302*4882a593Smuzhiyun https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git 303*4882a593Smuzhiyun 304*4882a593SmuzhiyunLIBNDCTL: instantiate a new library context example 305*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 306*4882a593Smuzhiyun 307*4882a593Smuzhiyun:: 308*4882a593Smuzhiyun 309*4882a593Smuzhiyun struct ndctl_ctx *ctx; 310*4882a593Smuzhiyun 311*4882a593Smuzhiyun if (ndctl_new(&ctx) == 0) 312*4882a593Smuzhiyun return ctx; 313*4882a593Smuzhiyun else 314*4882a593Smuzhiyun return NULL; 315*4882a593Smuzhiyun 316*4882a593SmuzhiyunLIBNVDIMM/LIBNDCTL: Bus 317*4882a593Smuzhiyun----------------------- 318*4882a593Smuzhiyun 319*4882a593SmuzhiyunA bus has a 1:1 relationship with an NFIT. The current expectation for 320*4882a593SmuzhiyunACPI based systems is that there is only ever one platform-global NFIT. 321*4882a593SmuzhiyunThat said, it is trivial to register multiple NFITs, the specification 322*4882a593Smuzhiyundoes not preclude it. The infrastructure supports multiple busses and 323*4882a593Smuzhiyunwe use this capability to test multiple NFIT configurations in the unit 324*4882a593Smuzhiyuntest. 325*4882a593Smuzhiyun 326*4882a593SmuzhiyunLIBNVDIMM: control class device in /sys/class 327*4882a593Smuzhiyun--------------------------------------------- 328*4882a593Smuzhiyun 329*4882a593SmuzhiyunThis character device accepts DSM messages to be passed to DIMM 330*4882a593Smuzhiyunidentified by its NFIT handle:: 331*4882a593Smuzhiyun 332*4882a593Smuzhiyun /sys/class/nd/ndctl0 333*4882a593Smuzhiyun |-- dev 334*4882a593Smuzhiyun |-- device -> ../../../ndbus0 335*4882a593Smuzhiyun |-- subsystem -> ../../../../../../../class/nd 336*4882a593Smuzhiyun 337*4882a593Smuzhiyun 338*4882a593Smuzhiyun 339*4882a593SmuzhiyunLIBNVDIMM: bus 340*4882a593Smuzhiyun-------------- 341*4882a593Smuzhiyun 342*4882a593Smuzhiyun:: 343*4882a593Smuzhiyun 344*4882a593Smuzhiyun struct nvdimm_bus *nvdimm_bus_register(struct device *parent, 345*4882a593Smuzhiyun struct nvdimm_bus_descriptor *nfit_desc); 346*4882a593Smuzhiyun 347*4882a593Smuzhiyun:: 348*4882a593Smuzhiyun 349*4882a593Smuzhiyun /sys/devices/platform/nfit_test.0/ndbus0 350*4882a593Smuzhiyun |-- commands 351*4882a593Smuzhiyun |-- nd 352*4882a593Smuzhiyun |-- nfit 353*4882a593Smuzhiyun |-- nmem0 354*4882a593Smuzhiyun |-- nmem1 355*4882a593Smuzhiyun |-- nmem2 356*4882a593Smuzhiyun |-- nmem3 357*4882a593Smuzhiyun |-- power 358*4882a593Smuzhiyun |-- provider 359*4882a593Smuzhiyun |-- region0 360*4882a593Smuzhiyun |-- region1 361*4882a593Smuzhiyun |-- region2 362*4882a593Smuzhiyun |-- region3 363*4882a593Smuzhiyun |-- region4 364*4882a593Smuzhiyun |-- region5 365*4882a593Smuzhiyun |-- uevent 366*4882a593Smuzhiyun `-- wait_probe 367*4882a593Smuzhiyun 368*4882a593SmuzhiyunLIBNDCTL: bus enumeration example 369*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 370*4882a593Smuzhiyun 371*4882a593SmuzhiyunFind the bus handle that describes the bus from Example NVDIMM Platform:: 372*4882a593Smuzhiyun 373*4882a593Smuzhiyun static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, 374*4882a593Smuzhiyun const char *provider) 375*4882a593Smuzhiyun { 376*4882a593Smuzhiyun struct ndctl_bus *bus; 377*4882a593Smuzhiyun 378*4882a593Smuzhiyun ndctl_bus_foreach(ctx, bus) 379*4882a593Smuzhiyun if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) 380*4882a593Smuzhiyun return bus; 381*4882a593Smuzhiyun 382*4882a593Smuzhiyun return NULL; 383*4882a593Smuzhiyun } 384*4882a593Smuzhiyun 385*4882a593Smuzhiyun bus = get_bus_by_provider(ctx, "nfit_test.0"); 386*4882a593Smuzhiyun 387*4882a593Smuzhiyun 388*4882a593SmuzhiyunLIBNVDIMM/LIBNDCTL: DIMM (NMEM) 389*4882a593Smuzhiyun------------------------------- 390*4882a593Smuzhiyun 391*4882a593SmuzhiyunThe DIMM device provides a character device for sending commands to 392*4882a593Smuzhiyunhardware, and it is a container for LABELs. If the DIMM is defined by 393*4882a593SmuzhiyunNFIT then an optional 'nfit' attribute sub-directory is available to add 394*4882a593SmuzhiyunNFIT-specifics. 395*4882a593Smuzhiyun 396*4882a593SmuzhiyunNote that the kernel device name for "DIMMs" is "nmemX". The NFIT 397*4882a593Smuzhiyundescribes these devices via "Memory Device to System Physical Address 398*4882a593SmuzhiyunRange Mapping Structure", and there is no requirement that they actually 399*4882a593Smuzhiyunbe physical DIMMs, so we use a more generic name. 400*4882a593Smuzhiyun 401*4882a593SmuzhiyunLIBNVDIMM: DIMM (NMEM) 402*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^ 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun:: 405*4882a593Smuzhiyun 406*4882a593Smuzhiyun struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, 407*4882a593Smuzhiyun const struct attribute_group **groups, unsigned long flags, 408*4882a593Smuzhiyun unsigned long *dsm_mask); 409*4882a593Smuzhiyun 410*4882a593Smuzhiyun:: 411*4882a593Smuzhiyun 412*4882a593Smuzhiyun /sys/devices/platform/nfit_test.0/ndbus0 413*4882a593Smuzhiyun |-- nmem0 414*4882a593Smuzhiyun | |-- available_slots 415*4882a593Smuzhiyun | |-- commands 416*4882a593Smuzhiyun | |-- dev 417*4882a593Smuzhiyun | |-- devtype 418*4882a593Smuzhiyun | |-- driver -> ../../../../../bus/nd/drivers/nvdimm 419*4882a593Smuzhiyun | |-- modalias 420*4882a593Smuzhiyun | |-- nfit 421*4882a593Smuzhiyun | | |-- device 422*4882a593Smuzhiyun | | |-- format 423*4882a593Smuzhiyun | | |-- handle 424*4882a593Smuzhiyun | | |-- phys_id 425*4882a593Smuzhiyun | | |-- rev_id 426*4882a593Smuzhiyun | | |-- serial 427*4882a593Smuzhiyun | | `-- vendor 428*4882a593Smuzhiyun | |-- state 429*4882a593Smuzhiyun | |-- subsystem -> ../../../../../bus/nd 430*4882a593Smuzhiyun | `-- uevent 431*4882a593Smuzhiyun |-- nmem1 432*4882a593Smuzhiyun [..] 433*4882a593Smuzhiyun 434*4882a593Smuzhiyun 435*4882a593SmuzhiyunLIBNDCTL: DIMM enumeration example 436*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 437*4882a593Smuzhiyun 438*4882a593SmuzhiyunNote, in this example we are assuming NFIT-defined DIMMs which are 439*4882a593Smuzhiyunidentified by an "nfit_handle" a 32-bit value where: 440*4882a593Smuzhiyun 441*4882a593Smuzhiyun - Bit 3:0 DIMM number within the memory channel 442*4882a593Smuzhiyun - Bit 7:4 memory channel number 443*4882a593Smuzhiyun - Bit 11:8 memory controller ID 444*4882a593Smuzhiyun - Bit 15:12 socket ID (within scope of a Node controller if node 445*4882a593Smuzhiyun controller is present) 446*4882a593Smuzhiyun - Bit 27:16 Node Controller ID 447*4882a593Smuzhiyun - Bit 31:28 Reserved 448*4882a593Smuzhiyun 449*4882a593Smuzhiyun:: 450*4882a593Smuzhiyun 451*4882a593Smuzhiyun static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, 452*4882a593Smuzhiyun unsigned int handle) 453*4882a593Smuzhiyun { 454*4882a593Smuzhiyun struct ndctl_dimm *dimm; 455*4882a593Smuzhiyun 456*4882a593Smuzhiyun ndctl_dimm_foreach(bus, dimm) 457*4882a593Smuzhiyun if (ndctl_dimm_get_handle(dimm) == handle) 458*4882a593Smuzhiyun return dimm; 459*4882a593Smuzhiyun 460*4882a593Smuzhiyun return NULL; 461*4882a593Smuzhiyun } 462*4882a593Smuzhiyun 463*4882a593Smuzhiyun #define DIMM_HANDLE(n, s, i, c, d) \ 464*4882a593Smuzhiyun (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ 465*4882a593Smuzhiyun | ((c & 0xf) << 4) | (d & 0xf)) 466*4882a593Smuzhiyun 467*4882a593Smuzhiyun dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); 468*4882a593Smuzhiyun 469*4882a593SmuzhiyunLIBNVDIMM/LIBNDCTL: Region 470*4882a593Smuzhiyun-------------------------- 471*4882a593Smuzhiyun 472*4882a593SmuzhiyunA generic REGION device is registered for each PMEM range or BLK-aperture 473*4882a593Smuzhiyunset. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture 474*4882a593Smuzhiyunsets on the "nfit_test.0" bus. The primary role of regions are to be a 475*4882a593Smuzhiyuncontainer of "mappings". A mapping is a tuple of <DIMM, 476*4882a593SmuzhiyunDPA-start-offset, length>. 477*4882a593Smuzhiyun 478*4882a593SmuzhiyunLIBNVDIMM provides a built-in driver for these REGION devices. This driver 479*4882a593Smuzhiyunis responsible for reconciling the aliased DPA mappings across all 480*4882a593Smuzhiyunregions, parsing the LABEL, if present, and then emitting NAMESPACE 481*4882a593Smuzhiyundevices with the resolved/exclusive DPA-boundaries for the nd_pmem or 482*4882a593Smuzhiyunnd_blk device driver to consume. 483*4882a593Smuzhiyun 484*4882a593SmuzhiyunIn addition to the generic attributes of "mapping"s, "interleave_ways" 485*4882a593Smuzhiyunand "size" the REGION device also exports some convenience attributes. 486*4882a593Smuzhiyun"nstype" indicates the integer type of namespace-device this region 487*4882a593Smuzhiyunemits, "devtype" duplicates the DEVTYPE variable stored by udev at the 488*4882a593Smuzhiyun'add' event, "modalias" duplicates the MODALIAS variable stored by udev 489*4882a593Smuzhiyunat the 'add' event, and finally, the optional "spa_index" is provided in 490*4882a593Smuzhiyunthe case where the region is defined by a SPA. 491*4882a593Smuzhiyun 492*4882a593SmuzhiyunLIBNVDIMM: region:: 493*4882a593Smuzhiyun 494*4882a593Smuzhiyun struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, 495*4882a593Smuzhiyun struct nd_region_desc *ndr_desc); 496*4882a593Smuzhiyun struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, 497*4882a593Smuzhiyun struct nd_region_desc *ndr_desc); 498*4882a593Smuzhiyun 499*4882a593Smuzhiyun:: 500*4882a593Smuzhiyun 501*4882a593Smuzhiyun /sys/devices/platform/nfit_test.0/ndbus0 502*4882a593Smuzhiyun |-- region0 503*4882a593Smuzhiyun | |-- available_size 504*4882a593Smuzhiyun | |-- btt0 505*4882a593Smuzhiyun | |-- btt_seed 506*4882a593Smuzhiyun | |-- devtype 507*4882a593Smuzhiyun | |-- driver -> ../../../../../bus/nd/drivers/nd_region 508*4882a593Smuzhiyun | |-- init_namespaces 509*4882a593Smuzhiyun | |-- mapping0 510*4882a593Smuzhiyun | |-- mapping1 511*4882a593Smuzhiyun | |-- mappings 512*4882a593Smuzhiyun | |-- modalias 513*4882a593Smuzhiyun | |-- namespace0.0 514*4882a593Smuzhiyun | |-- namespace_seed 515*4882a593Smuzhiyun | |-- numa_node 516*4882a593Smuzhiyun | |-- nfit 517*4882a593Smuzhiyun | | `-- spa_index 518*4882a593Smuzhiyun | |-- nstype 519*4882a593Smuzhiyun | |-- set_cookie 520*4882a593Smuzhiyun | |-- size 521*4882a593Smuzhiyun | |-- subsystem -> ../../../../../bus/nd 522*4882a593Smuzhiyun | `-- uevent 523*4882a593Smuzhiyun |-- region1 524*4882a593Smuzhiyun [..] 525*4882a593Smuzhiyun 526*4882a593SmuzhiyunLIBNDCTL: region enumeration example 527*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 528*4882a593Smuzhiyun 529*4882a593SmuzhiyunSample region retrieval routines based on NFIT-unique data like 530*4882a593Smuzhiyun"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for 531*4882a593SmuzhiyunBLK:: 532*4882a593Smuzhiyun 533*4882a593Smuzhiyun static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, 534*4882a593Smuzhiyun unsigned int spa_index) 535*4882a593Smuzhiyun { 536*4882a593Smuzhiyun struct ndctl_region *region; 537*4882a593Smuzhiyun 538*4882a593Smuzhiyun ndctl_region_foreach(bus, region) { 539*4882a593Smuzhiyun if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) 540*4882a593Smuzhiyun continue; 541*4882a593Smuzhiyun if (ndctl_region_get_spa_index(region) == spa_index) 542*4882a593Smuzhiyun return region; 543*4882a593Smuzhiyun } 544*4882a593Smuzhiyun return NULL; 545*4882a593Smuzhiyun } 546*4882a593Smuzhiyun 547*4882a593Smuzhiyun static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, 548*4882a593Smuzhiyun unsigned int handle) 549*4882a593Smuzhiyun { 550*4882a593Smuzhiyun struct ndctl_region *region; 551*4882a593Smuzhiyun 552*4882a593Smuzhiyun ndctl_region_foreach(bus, region) { 553*4882a593Smuzhiyun struct ndctl_mapping *map; 554*4882a593Smuzhiyun 555*4882a593Smuzhiyun if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) 556*4882a593Smuzhiyun continue; 557*4882a593Smuzhiyun ndctl_mapping_foreach(region, map) { 558*4882a593Smuzhiyun struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); 559*4882a593Smuzhiyun 560*4882a593Smuzhiyun if (ndctl_dimm_get_handle(dimm) == handle) 561*4882a593Smuzhiyun return region; 562*4882a593Smuzhiyun } 563*4882a593Smuzhiyun } 564*4882a593Smuzhiyun return NULL; 565*4882a593Smuzhiyun } 566*4882a593Smuzhiyun 567*4882a593Smuzhiyun 568*4882a593SmuzhiyunWhy Not Encode the Region Type into the Region Name? 569*4882a593Smuzhiyun---------------------------------------------------- 570*4882a593Smuzhiyun 571*4882a593SmuzhiyunAt first glance it seems since NFIT defines just PMEM and BLK interface 572*4882a593Smuzhiyuntypes that we should simply name REGION devices with something derived 573*4882a593Smuzhiyunfrom those type names. However, the ND subsystem explicitly keeps the 574*4882a593SmuzhiyunREGION name generic and expects userspace to always consider the 575*4882a593Smuzhiyunregion-attributes for four reasons: 576*4882a593Smuzhiyun 577*4882a593Smuzhiyun 1. There are already more than two REGION and "namespace" types. For 578*4882a593Smuzhiyun PMEM there are two subtypes. As mentioned previously we have PMEM where 579*4882a593Smuzhiyun the constituent DIMM devices are known and anonymous PMEM. For BLK 580*4882a593Smuzhiyun regions the NFIT specification already anticipates vendor specific 581*4882a593Smuzhiyun implementations. The exact distinction of what a region contains is in 582*4882a593Smuzhiyun the region-attributes not the region-name or the region-devtype. 583*4882a593Smuzhiyun 584*4882a593Smuzhiyun 2. A region with zero child-namespaces is a possible configuration. For 585*4882a593Smuzhiyun example, the NFIT allows for a DCR to be published without a 586*4882a593Smuzhiyun corresponding BLK-aperture. This equates to a DIMM that can only accept 587*4882a593Smuzhiyun control/configuration messages, but no i/o through a descendant block 588*4882a593Smuzhiyun device. Again, this "type" is advertised in the attributes ('mappings' 589*4882a593Smuzhiyun == 0) and the name does not tell you much. 590*4882a593Smuzhiyun 591*4882a593Smuzhiyun 3. What if a third major interface type arises in the future? Outside 592*4882a593Smuzhiyun of vendor specific implementations, it's not difficult to envision a 593*4882a593Smuzhiyun third class of interface type beyond BLK and PMEM. With a generic name 594*4882a593Smuzhiyun for the REGION level of the device-hierarchy old userspace 595*4882a593Smuzhiyun implementations can still make sense of new kernel advertised 596*4882a593Smuzhiyun region-types. Userspace can always rely on the generic region 597*4882a593Smuzhiyun attributes like "mappings", "size", etc and the expected child devices 598*4882a593Smuzhiyun named "namespace". This generic format of the device-model hierarchy 599*4882a593Smuzhiyun allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and 600*4882a593Smuzhiyun future-proof. 601*4882a593Smuzhiyun 602*4882a593Smuzhiyun 4. There are more robust mechanisms for determining the major type of a 603*4882a593Smuzhiyun region than a device name. See the next section, How Do I Determine the 604*4882a593Smuzhiyun Major Type of a Region? 605*4882a593Smuzhiyun 606*4882a593SmuzhiyunHow Do I Determine the Major Type of a Region? 607*4882a593Smuzhiyun---------------------------------------------- 608*4882a593Smuzhiyun 609*4882a593SmuzhiyunOutside of the blanket recommendation of "use libndctl", or simply 610*4882a593Smuzhiyunlooking at the kernel header (/usr/include/linux/ndctl.h) to decode the 611*4882a593Smuzhiyun"nstype" integer attribute, here are some other options. 612*4882a593Smuzhiyun 613*4882a593Smuzhiyun1. module alias lookup 614*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^ 615*4882a593Smuzhiyun 616*4882a593Smuzhiyun The whole point of region/namespace device type differentiation is to 617*4882a593Smuzhiyun decide which block-device driver will attach to a given LIBNVDIMM namespace. 618*4882a593Smuzhiyun One can simply use the modalias to lookup the resulting module. It's 619*4882a593Smuzhiyun important to note that this method is robust in the presence of a 620*4882a593Smuzhiyun vendor-specific driver down the road. If a vendor-specific 621*4882a593Smuzhiyun implementation wants to supplant the standard nd_blk driver it can with 622*4882a593Smuzhiyun minimal impact to the rest of LIBNVDIMM. 623*4882a593Smuzhiyun 624*4882a593Smuzhiyun In fact, a vendor may also want to have a vendor-specific region-driver 625*4882a593Smuzhiyun (outside of nd_region). For example, if a vendor defined its own LABEL 626*4882a593Smuzhiyun format it would need its own region driver to parse that LABEL and emit 627*4882a593Smuzhiyun the resulting namespaces. The output from module resolution is more 628*4882a593Smuzhiyun accurate than a region-name or region-devtype. 629*4882a593Smuzhiyun 630*4882a593Smuzhiyun2. udev 631*4882a593Smuzhiyun^^^^^^^ 632*4882a593Smuzhiyun 633*4882a593Smuzhiyun The kernel "devtype" is registered in the udev database:: 634*4882a593Smuzhiyun 635*4882a593Smuzhiyun # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 636*4882a593Smuzhiyun P: /devices/platform/nfit_test.0/ndbus0/region0 637*4882a593Smuzhiyun E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 638*4882a593Smuzhiyun E: DEVTYPE=nd_pmem 639*4882a593Smuzhiyun E: MODALIAS=nd:t2 640*4882a593Smuzhiyun E: SUBSYSTEM=nd 641*4882a593Smuzhiyun 642*4882a593Smuzhiyun # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 643*4882a593Smuzhiyun P: /devices/platform/nfit_test.0/ndbus0/region4 644*4882a593Smuzhiyun E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 645*4882a593Smuzhiyun E: DEVTYPE=nd_blk 646*4882a593Smuzhiyun E: MODALIAS=nd:t3 647*4882a593Smuzhiyun E: SUBSYSTEM=nd 648*4882a593Smuzhiyun 649*4882a593Smuzhiyun ...and is available as a region attribute, but keep in mind that the 650*4882a593Smuzhiyun "devtype" does not indicate sub-type variations and scripts should 651*4882a593Smuzhiyun really be understanding the other attributes. 652*4882a593Smuzhiyun 653*4882a593Smuzhiyun3. type specific attributes 654*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^ 655*4882a593Smuzhiyun 656*4882a593Smuzhiyun As it currently stands a BLK-aperture region will never have a 657*4882a593Smuzhiyun "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A 658*4882a593Smuzhiyun BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM 659*4882a593Smuzhiyun that does not allow I/O. A PMEM region with a "mappings" value of zero 660*4882a593Smuzhiyun is a simple system-physical-address range. 661*4882a593Smuzhiyun 662*4882a593Smuzhiyun 663*4882a593SmuzhiyunLIBNVDIMM/LIBNDCTL: Namespace 664*4882a593Smuzhiyun----------------------------- 665*4882a593Smuzhiyun 666*4882a593SmuzhiyunA REGION, after resolving DPA aliasing and LABEL specified boundaries, 667*4882a593Smuzhiyunsurfaces one or more "namespace" devices. The arrival of a "namespace" 668*4882a593Smuzhiyundevice currently triggers either the nd_blk or nd_pmem driver to load 669*4882a593Smuzhiyunand register a disk/block device. 670*4882a593Smuzhiyun 671*4882a593SmuzhiyunLIBNVDIMM: namespace 672*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^ 673*4882a593Smuzhiyun 674*4882a593SmuzhiyunHere is a sample layout from the three major types of NAMESPACE where 675*4882a593Smuzhiyunnamespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' 676*4882a593Smuzhiyunattribute), namespace2.0 represents a BLK namespace (note it has a 677*4882a593Smuzhiyun'sector_size' attribute) that, and namespace6.0 represents an anonymous 678*4882a593SmuzhiyunPMEM namespace (note that has no 'uuid' attribute due to not support a 679*4882a593SmuzhiyunLABEL):: 680*4882a593Smuzhiyun 681*4882a593Smuzhiyun /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 682*4882a593Smuzhiyun |-- alt_name 683*4882a593Smuzhiyun |-- devtype 684*4882a593Smuzhiyun |-- dpa_extents 685*4882a593Smuzhiyun |-- force_raw 686*4882a593Smuzhiyun |-- modalias 687*4882a593Smuzhiyun |-- numa_node 688*4882a593Smuzhiyun |-- resource 689*4882a593Smuzhiyun |-- size 690*4882a593Smuzhiyun |-- subsystem -> ../../../../../../bus/nd 691*4882a593Smuzhiyun |-- type 692*4882a593Smuzhiyun |-- uevent 693*4882a593Smuzhiyun `-- uuid 694*4882a593Smuzhiyun /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 695*4882a593Smuzhiyun |-- alt_name 696*4882a593Smuzhiyun |-- devtype 697*4882a593Smuzhiyun |-- dpa_extents 698*4882a593Smuzhiyun |-- force_raw 699*4882a593Smuzhiyun |-- modalias 700*4882a593Smuzhiyun |-- numa_node 701*4882a593Smuzhiyun |-- sector_size 702*4882a593Smuzhiyun |-- size 703*4882a593Smuzhiyun |-- subsystem -> ../../../../../../bus/nd 704*4882a593Smuzhiyun |-- type 705*4882a593Smuzhiyun |-- uevent 706*4882a593Smuzhiyun `-- uuid 707*4882a593Smuzhiyun /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 708*4882a593Smuzhiyun |-- block 709*4882a593Smuzhiyun | `-- pmem0 710*4882a593Smuzhiyun |-- devtype 711*4882a593Smuzhiyun |-- driver -> ../../../../../../bus/nd/drivers/pmem 712*4882a593Smuzhiyun |-- force_raw 713*4882a593Smuzhiyun |-- modalias 714*4882a593Smuzhiyun |-- numa_node 715*4882a593Smuzhiyun |-- resource 716*4882a593Smuzhiyun |-- size 717*4882a593Smuzhiyun |-- subsystem -> ../../../../../../bus/nd 718*4882a593Smuzhiyun |-- type 719*4882a593Smuzhiyun `-- uevent 720*4882a593Smuzhiyun 721*4882a593SmuzhiyunLIBNDCTL: namespace enumeration example 722*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 723*4882a593SmuzhiyunNamespaces are indexed relative to their parent region, example below. 724*4882a593SmuzhiyunThese indexes are mostly static from boot to boot, but subsystem makes 725*4882a593Smuzhiyunno guarantees in this regard. For a static namespace identifier use its 726*4882a593Smuzhiyun'uuid' attribute. 727*4882a593Smuzhiyun 728*4882a593Smuzhiyun:: 729*4882a593Smuzhiyun 730*4882a593Smuzhiyun static struct ndctl_namespace 731*4882a593Smuzhiyun *get_namespace_by_id(struct ndctl_region *region, unsigned int id) 732*4882a593Smuzhiyun { 733*4882a593Smuzhiyun struct ndctl_namespace *ndns; 734*4882a593Smuzhiyun 735*4882a593Smuzhiyun ndctl_namespace_foreach(region, ndns) 736*4882a593Smuzhiyun if (ndctl_namespace_get_id(ndns) == id) 737*4882a593Smuzhiyun return ndns; 738*4882a593Smuzhiyun 739*4882a593Smuzhiyun return NULL; 740*4882a593Smuzhiyun } 741*4882a593Smuzhiyun 742*4882a593SmuzhiyunLIBNDCTL: namespace creation example 743*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 744*4882a593Smuzhiyun 745*4882a593SmuzhiyunIdle namespaces are automatically created by the kernel if a given 746*4882a593Smuzhiyunregion has enough available capacity to create a new namespace. 747*4882a593SmuzhiyunNamespace instantiation involves finding an idle namespace and 748*4882a593Smuzhiyunconfiguring it. For the most part the setting of namespace attributes 749*4882a593Smuzhiyuncan occur in any order, the only constraint is that 'uuid' must be set 750*4882a593Smuzhiyunbefore 'size'. This enables the kernel to track DPA allocations 751*4882a593Smuzhiyuninternally with a static identifier:: 752*4882a593Smuzhiyun 753*4882a593Smuzhiyun static int configure_namespace(struct ndctl_region *region, 754*4882a593Smuzhiyun struct ndctl_namespace *ndns, 755*4882a593Smuzhiyun struct namespace_parameters *parameters) 756*4882a593Smuzhiyun { 757*4882a593Smuzhiyun char devname[50]; 758*4882a593Smuzhiyun 759*4882a593Smuzhiyun snprintf(devname, sizeof(devname), "namespace%d.%d", 760*4882a593Smuzhiyun ndctl_region_get_id(region), paramaters->id); 761*4882a593Smuzhiyun 762*4882a593Smuzhiyun ndctl_namespace_set_alt_name(ndns, devname); 763*4882a593Smuzhiyun /* 'uuid' must be set prior to setting size! */ 764*4882a593Smuzhiyun ndctl_namespace_set_uuid(ndns, paramaters->uuid); 765*4882a593Smuzhiyun ndctl_namespace_set_size(ndns, paramaters->size); 766*4882a593Smuzhiyun /* unlike pmem namespaces, blk namespaces have a sector size */ 767*4882a593Smuzhiyun if (parameters->lbasize) 768*4882a593Smuzhiyun ndctl_namespace_set_sector_size(ndns, parameters->lbasize); 769*4882a593Smuzhiyun ndctl_namespace_enable(ndns); 770*4882a593Smuzhiyun } 771*4882a593Smuzhiyun 772*4882a593Smuzhiyun 773*4882a593SmuzhiyunWhy the Term "namespace"? 774*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^ 775*4882a593Smuzhiyun 776*4882a593Smuzhiyun 1. Why not "volume" for instance? "volume" ran the risk of confusing 777*4882a593Smuzhiyun ND (libnvdimm subsystem) to a volume manager like device-mapper. 778*4882a593Smuzhiyun 779*4882a593Smuzhiyun 2. The term originated to describe the sub-devices that can be created 780*4882a593Smuzhiyun within a NVME controller (see the nvme specification: 781*4882a593Smuzhiyun https://www.nvmexpress.org/specifications/), and NFIT namespaces are 782*4882a593Smuzhiyun meant to parallel the capabilities and configurability of 783*4882a593Smuzhiyun NVME-namespaces. 784*4882a593Smuzhiyun 785*4882a593Smuzhiyun 786*4882a593SmuzhiyunLIBNVDIMM/LIBNDCTL: Block Translation Table "btt" 787*4882a593Smuzhiyun------------------------------------------------- 788*4882a593Smuzhiyun 789*4882a593SmuzhiyunA BTT (design document: https://pmem.io/2014/09/23/btt.html) is a stacked 790*4882a593Smuzhiyunblock device driver that fronts either the whole block device or a 791*4882a593Smuzhiyunpartition of a block device emitted by either a PMEM or BLK NAMESPACE. 792*4882a593Smuzhiyun 793*4882a593SmuzhiyunLIBNVDIMM: btt layout 794*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^ 795*4882a593Smuzhiyun 796*4882a593SmuzhiyunEvery region will start out with at least one BTT device which is the 797*4882a593Smuzhiyunseed device. To activate it set the "namespace", "uuid", and 798*4882a593Smuzhiyun"sector_size" attributes and then bind the device to the nd_pmem or 799*4882a593Smuzhiyunnd_blk driver depending on the region type:: 800*4882a593Smuzhiyun 801*4882a593Smuzhiyun /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ 802*4882a593Smuzhiyun |-- namespace 803*4882a593Smuzhiyun |-- delete 804*4882a593Smuzhiyun |-- devtype 805*4882a593Smuzhiyun |-- modalias 806*4882a593Smuzhiyun |-- numa_node 807*4882a593Smuzhiyun |-- sector_size 808*4882a593Smuzhiyun |-- subsystem -> ../../../../../bus/nd 809*4882a593Smuzhiyun |-- uevent 810*4882a593Smuzhiyun `-- uuid 811*4882a593Smuzhiyun 812*4882a593SmuzhiyunLIBNDCTL: btt creation example 813*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 814*4882a593Smuzhiyun 815*4882a593SmuzhiyunSimilar to namespaces an idle BTT device is automatically created per 816*4882a593Smuzhiyunregion. Each time this "seed" btt device is configured and enabled a new 817*4882a593Smuzhiyunseed is created. Creating a BTT configuration involves two steps of 818*4882a593Smuzhiyunfinding and idle BTT and assigning it to consume a PMEM or BLK namespace:: 819*4882a593Smuzhiyun 820*4882a593Smuzhiyun static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) 821*4882a593Smuzhiyun { 822*4882a593Smuzhiyun struct ndctl_btt *btt; 823*4882a593Smuzhiyun 824*4882a593Smuzhiyun ndctl_btt_foreach(region, btt) 825*4882a593Smuzhiyun if (!ndctl_btt_is_enabled(btt) 826*4882a593Smuzhiyun && !ndctl_btt_is_configured(btt)) 827*4882a593Smuzhiyun return btt; 828*4882a593Smuzhiyun 829*4882a593Smuzhiyun return NULL; 830*4882a593Smuzhiyun } 831*4882a593Smuzhiyun 832*4882a593Smuzhiyun static int configure_btt(struct ndctl_region *region, 833*4882a593Smuzhiyun struct btt_parameters *parameters) 834*4882a593Smuzhiyun { 835*4882a593Smuzhiyun btt = get_idle_btt(region); 836*4882a593Smuzhiyun 837*4882a593Smuzhiyun ndctl_btt_set_uuid(btt, parameters->uuid); 838*4882a593Smuzhiyun ndctl_btt_set_sector_size(btt, parameters->sector_size); 839*4882a593Smuzhiyun ndctl_btt_set_namespace(btt, parameters->ndns); 840*4882a593Smuzhiyun /* turn off raw mode device */ 841*4882a593Smuzhiyun ndctl_namespace_disable(parameters->ndns); 842*4882a593Smuzhiyun /* turn on btt access */ 843*4882a593Smuzhiyun ndctl_btt_enable(btt); 844*4882a593Smuzhiyun } 845*4882a593Smuzhiyun 846*4882a593SmuzhiyunOnce instantiated a new inactive btt seed device will appear underneath 847*4882a593Smuzhiyunthe region. 848*4882a593Smuzhiyun 849*4882a593SmuzhiyunOnce a "namespace" is removed from a BTT that instance of the BTT device 850*4882a593Smuzhiyunwill be deleted or otherwise reset to default values. This deletion is 851*4882a593Smuzhiyunonly at the device model level. In order to destroy a BTT the "info 852*4882a593Smuzhiyunblock" needs to be destroyed. Note, that to destroy a BTT the media 853*4882a593Smuzhiyunneeds to be written in raw mode. By default, the kernel will autodetect 854*4882a593Smuzhiyunthe presence of a BTT and disable raw mode. This autodetect behavior 855*4882a593Smuzhiyuncan be suppressed by enabling raw mode for the namespace via the 856*4882a593Smuzhiyunndctl_namespace_set_raw_mode() API. 857*4882a593Smuzhiyun 858*4882a593Smuzhiyun 859*4882a593SmuzhiyunSummary LIBNDCTL Diagram 860*4882a593Smuzhiyun------------------------ 861*4882a593Smuzhiyun 862*4882a593SmuzhiyunFor the given example above, here is the view of the objects as seen by the 863*4882a593SmuzhiyunLIBNDCTL API:: 864*4882a593Smuzhiyun 865*4882a593Smuzhiyun +---+ 866*4882a593Smuzhiyun |CTX| +---------+ +--------------+ +---------------+ 867*4882a593Smuzhiyun +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | 868*4882a593Smuzhiyun | | +---------+ +--------------+ +---------------+ 869*4882a593Smuzhiyun +-------+ | | +---------+ +--------------+ +---------------+ 870*4882a593Smuzhiyun | DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | 871*4882a593Smuzhiyun +-------+ | | | +---------+ +--------------+ +---------------+ 872*4882a593Smuzhiyun | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ 873*4882a593Smuzhiyun +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | 874*4882a593Smuzhiyun | DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ 875*4882a593Smuzhiyun +-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | 876*4882a593Smuzhiyun | DIMM3 <-+ | +--------------+ +----------------------+ 877*4882a593Smuzhiyun +-------+ | +---------+ +--------------+ +---------------+ 878*4882a593Smuzhiyun +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | 879*4882a593Smuzhiyun | +---------+ | +--------------+ +----------------------+ 880*4882a593Smuzhiyun | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | 881*4882a593Smuzhiyun | +--------------+ +----------------------+ 882*4882a593Smuzhiyun | +---------+ +--------------+ +---------------+ 883*4882a593Smuzhiyun +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | 884*4882a593Smuzhiyun | +---------+ +--------------+ +---------------+ 885*4882a593Smuzhiyun | +---------+ +--------------+ +----------------------+ 886*4882a593Smuzhiyun +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | 887*4882a593Smuzhiyun +---------+ +--------------+ +---------------+------+ 888