xref: /OK3568_Linux_fs/kernel/Documentation/driver-api/nvdimm/nvdimm.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun===============================
2*4882a593SmuzhiyunLIBNVDIMM: Non-Volatile Devices
3*4882a593Smuzhiyun===============================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyunlibnvdimm - kernel / libndctl - userspace helper library
6*4882a593Smuzhiyun
7*4882a593Smuzhiyunlinux-nvdimm@lists.01.org
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunVersion 13
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun.. contents:
12*4882a593Smuzhiyun
13*4882a593Smuzhiyun	Glossary
14*4882a593Smuzhiyun	Overview
15*4882a593Smuzhiyun	    Supporting Documents
16*4882a593Smuzhiyun	    Git Trees
17*4882a593Smuzhiyun	LIBNVDIMM PMEM and BLK
18*4882a593Smuzhiyun	Why BLK?
19*4882a593Smuzhiyun	    PMEM vs BLK
20*4882a593Smuzhiyun	        BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
21*4882a593Smuzhiyun	Example NVDIMM Platform
22*4882a593Smuzhiyun	LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
23*4882a593Smuzhiyun	    LIBNDCTL: Context
24*4882a593Smuzhiyun	        libndctl: instantiate a new library context example
25*4882a593Smuzhiyun	    LIBNVDIMM/LIBNDCTL: Bus
26*4882a593Smuzhiyun	        libnvdimm: control class device in /sys/class
27*4882a593Smuzhiyun	        libnvdimm: bus
28*4882a593Smuzhiyun	        libndctl: bus enumeration example
29*4882a593Smuzhiyun	    LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
30*4882a593Smuzhiyun	        libnvdimm: DIMM (NMEM)
31*4882a593Smuzhiyun	        libndctl: DIMM enumeration example
32*4882a593Smuzhiyun	    LIBNVDIMM/LIBNDCTL: Region
33*4882a593Smuzhiyun	        libnvdimm: region
34*4882a593Smuzhiyun	        libndctl: region enumeration example
35*4882a593Smuzhiyun	        Why Not Encode the Region Type into the Region Name?
36*4882a593Smuzhiyun	        How Do I Determine the Major Type of a Region?
37*4882a593Smuzhiyun	    LIBNVDIMM/LIBNDCTL: Namespace
38*4882a593Smuzhiyun	        libnvdimm: namespace
39*4882a593Smuzhiyun	        libndctl: namespace enumeration example
40*4882a593Smuzhiyun	        libndctl: namespace creation example
41*4882a593Smuzhiyun	        Why the Term "namespace"?
42*4882a593Smuzhiyun	    LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
43*4882a593Smuzhiyun	        libnvdimm: btt layout
44*4882a593Smuzhiyun	        libndctl: btt creation example
45*4882a593Smuzhiyun	Summary LIBNDCTL Diagram
46*4882a593Smuzhiyun
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunGlossary
49*4882a593Smuzhiyun========
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunPMEM:
52*4882a593Smuzhiyun  A system-physical-address range where writes are persistent.  A
53*4882a593Smuzhiyun  block device composed of PMEM is capable of DAX.  A PMEM address range
54*4882a593Smuzhiyun  may span an interleave of several DIMMs.
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunBLK:
57*4882a593Smuzhiyun  A set of one or more programmable memory mapped apertures provided
58*4882a593Smuzhiyun  by a DIMM to access its media.  This indirection precludes the
59*4882a593Smuzhiyun  performance benefit of interleaving, but enables DIMM-bounded failure
60*4882a593Smuzhiyun  modes.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunDPA:
63*4882a593Smuzhiyun  DIMM Physical Address, is a DIMM-relative offset.  With one DIMM in
64*4882a593Smuzhiyun  the system there would be a 1:1 system-physical-address:DPA association.
65*4882a593Smuzhiyun  Once more DIMMs are added a memory controller interleave must be
66*4882a593Smuzhiyun  decoded to determine the DPA associated with a given
67*4882a593Smuzhiyun  system-physical-address.  BLK capacity always has a 1:1 relationship
68*4882a593Smuzhiyun  with a single-DIMM's DPA range.
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunDAX:
71*4882a593Smuzhiyun  File system extensions to bypass the page cache and block layer to
72*4882a593Smuzhiyun  mmap persistent memory, from a PMEM block device, directly into a
73*4882a593Smuzhiyun  process address space.
74*4882a593Smuzhiyun
75*4882a593SmuzhiyunDSM:
76*4882a593Smuzhiyun  Device Specific Method: ACPI method to control specific
77*4882a593Smuzhiyun  device - in this case the firmware.
78*4882a593Smuzhiyun
79*4882a593SmuzhiyunDCR:
80*4882a593Smuzhiyun  NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5.
81*4882a593Smuzhiyun  It defines a vendor-id, device-id, and interface format for a given DIMM.
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunBTT:
84*4882a593Smuzhiyun  Block Translation Table: Persistent memory is byte addressable.
85*4882a593Smuzhiyun  Existing software may have an expectation that the power-fail-atomicity
86*4882a593Smuzhiyun  of writes is at least one sector, 512 bytes.  The BTT is an indirection
87*4882a593Smuzhiyun  table with atomic update semantics to front a PMEM/BLK block device
88*4882a593Smuzhiyun  driver and present arbitrary atomic sector sizes.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunLABEL:
91*4882a593Smuzhiyun  Metadata stored on a DIMM device that partitions and identifies
92*4882a593Smuzhiyun  (persistently names) storage between PMEM and BLK.  It also partitions
93*4882a593Smuzhiyun  BLK storage to host BTTs with different parameters per BLK-partition.
94*4882a593Smuzhiyun  Note that traditional partition tables, GPT/MBR, are layered on top of a
95*4882a593Smuzhiyun  BLK or PMEM device.
96*4882a593Smuzhiyun
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunOverview
99*4882a593Smuzhiyun========
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunThe LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely,
102*4882a593SmuzhiyunPMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM
103*4882a593Smuzhiyunand BLK mode access.  These three modes of operation are described by
104*4882a593Smuzhiyunthe "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6.  While the LIBNVDIMM
105*4882a593Smuzhiyunimplementation is generic and supports pre-NFIT platforms, it was guided
106*4882a593Smuzhiyunby the superset of capabilities need to support this ACPI 6 definition
107*4882a593Smuzhiyunfor NVDIMM resources.  The bulk of the kernel implementation is in place
108*4882a593Smuzhiyunto handle the case where DPA accessible via PMEM is aliased with DPA
109*4882a593Smuzhiyunaccessible via BLK.  When that occurs a LABEL is needed to reserve DPA
110*4882a593Smuzhiyunfor exclusive access via one mode a time.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunSupporting Documents
113*4882a593Smuzhiyun--------------------
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunACPI 6:
116*4882a593Smuzhiyun	https://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
117*4882a593SmuzhiyunNVDIMM Namespace:
118*4882a593Smuzhiyun	https://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
119*4882a593SmuzhiyunDSM Interface Example:
120*4882a593Smuzhiyun	https://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
121*4882a593SmuzhiyunDriver Writer's Guide:
122*4882a593Smuzhiyun	https://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
123*4882a593Smuzhiyun
124*4882a593SmuzhiyunGit Trees
125*4882a593Smuzhiyun---------
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunLIBNVDIMM:
128*4882a593Smuzhiyun	https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git
129*4882a593SmuzhiyunLIBNDCTL:
130*4882a593Smuzhiyun	https://github.com/pmem/ndctl.git
131*4882a593SmuzhiyunPMEM:
132*4882a593Smuzhiyun	https://github.com/01org/prd
133*4882a593Smuzhiyun
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunLIBNVDIMM PMEM and BLK
136*4882a593Smuzhiyun======================
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunPrior to the arrival of the NFIT, non-volatile memory was described to a
139*4882a593Smuzhiyunsystem in various ad-hoc ways.  Usually only the bare minimum was
140*4882a593Smuzhiyunprovided, namely, a single system-physical-address range where writes
141*4882a593Smuzhiyunare expected to be durable after a system power loss.  Now, the NFIT
142*4882a593Smuzhiyunspecification standardizes not only the description of PMEM, but also
143*4882a593SmuzhiyunBLK and platform message-passing entry points for control and
144*4882a593Smuzhiyunconfiguration.
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunFor each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block
147*4882a593Smuzhiyundevice driver:
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun    1. PMEM (nd_pmem.ko): Drives a system-physical-address range.  This
150*4882a593Smuzhiyun       range is contiguous in system memory and may be interleaved (hardware
151*4882a593Smuzhiyun       memory controller striped) across multiple DIMMs.  When interleaved the
152*4882a593Smuzhiyun       platform may optionally provide details of which DIMMs are participating
153*4882a593Smuzhiyun       in the interleave.
154*4882a593Smuzhiyun
155*4882a593Smuzhiyun       Note that while LIBNVDIMM describes system-physical-address ranges that may
156*4882a593Smuzhiyun       alias with BLK access as ND_NAMESPACE_PMEM ranges and those without
157*4882a593Smuzhiyun       alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no
158*4882a593Smuzhiyun       distinction.  The different device-types are an implementation detail
159*4882a593Smuzhiyun       that userspace can exploit to implement policies like "only interface
160*4882a593Smuzhiyun       with address ranges from certain DIMMs".  It is worth noting that when
161*4882a593Smuzhiyun       aliasing is present and a DIMM lacks a label, then no block device can
162*4882a593Smuzhiyun       be created by default as userspace needs to do at least one allocation
163*4882a593Smuzhiyun       of DPA to the PMEM range.  In contrast ND_NAMESPACE_IO ranges, once
164*4882a593Smuzhiyun       registered, can be immediately attached to nd_pmem.
165*4882a593Smuzhiyun
166*4882a593Smuzhiyun    2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
167*4882a593Smuzhiyun       defined apertures.  A set of apertures will access just one DIMM.
168*4882a593Smuzhiyun       Multiple windows (apertures) allow multiple concurrent accesses, much like
169*4882a593Smuzhiyun       tagged-command-queuing, and would likely be used by different threads or
170*4882a593Smuzhiyun       different CPUs.
171*4882a593Smuzhiyun
172*4882a593Smuzhiyun       The NFIT specification defines a standard format for a BLK-aperture, but
173*4882a593Smuzhiyun       the spec also allows for vendor specific layouts, and non-NFIT BLK
174*4882a593Smuzhiyun       implementations may have other designs for BLK I/O.  For this reason
175*4882a593Smuzhiyun       "nd_blk" calls back into platform-specific code to perform the I/O.
176*4882a593Smuzhiyun
177*4882a593Smuzhiyun       One such implementation is defined in the "Driver Writer's Guide" and "DSM
178*4882a593Smuzhiyun       Interface Example".
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun
181*4882a593SmuzhiyunWhy BLK?
182*4882a593Smuzhiyun========
183*4882a593Smuzhiyun
184*4882a593SmuzhiyunWhile PMEM provides direct byte-addressable CPU-load/store access to
185*4882a593SmuzhiyunNVDIMM storage, it does not provide the best system RAS (recovery,
186*4882a593Smuzhiyunavailability, and serviceability) model.  An access to a corrupted
187*4882a593Smuzhiyunsystem-physical-address address causes a CPU exception while an access
188*4882a593Smuzhiyunto a corrupted address through an BLK-aperture causes that block window
189*4882a593Smuzhiyunto raise an error status in a register.  The latter is more aligned with
190*4882a593Smuzhiyunthe standard error model that host-bus-adapter attached disks present.
191*4882a593Smuzhiyun
192*4882a593SmuzhiyunAlso, if an administrator ever wants to replace a memory it is easier to
193*4882a593Smuzhiyunservice a system at DIMM module boundaries.  Compare this to PMEM where
194*4882a593Smuzhiyundata could be interleaved in an opaque hardware specific manner across
195*4882a593Smuzhiyunseveral DIMMs.
196*4882a593Smuzhiyun
197*4882a593SmuzhiyunPMEM vs BLK
198*4882a593Smuzhiyun-----------
199*4882a593Smuzhiyun
200*4882a593SmuzhiyunBLK-apertures solve these RAS problems, but their presence is also the
201*4882a593Smuzhiyunmajor contributing factor to the complexity of the ND subsystem.  They
202*4882a593Smuzhiyuncomplicate the implementation because PMEM and BLK alias in DPA space.
203*4882a593SmuzhiyunAny given DIMM's DPA-range may contribute to one or more
204*4882a593Smuzhiyunsystem-physical-address sets of interleaved DIMMs, *and* may also be
205*4882a593Smuzhiyunaccessed in its entirety through its BLK-aperture.  Accessing a DPA
206*4882a593Smuzhiyunthrough a system-physical-address while simultaneously accessing the
207*4882a593Smuzhiyunsame DPA through a BLK-aperture has undefined results.  For this reason,
208*4882a593SmuzhiyunDIMMs with this dual interface configuration include a DSM function to
209*4882a593Smuzhiyunstore/retrieve a LABEL.  The LABEL effectively partitions the DPA-space
210*4882a593Smuzhiyuninto exclusive system-physical-address and BLK-aperture accessible
211*4882a593Smuzhiyunregions.  For simplicity a DIMM is allowed a PMEM "region" per each
212*4882a593Smuzhiyuninterleave set in which it is a member.  The remaining DPA space can be
213*4882a593Smuzhiyuncarved into an arbitrary number of BLK devices with discontiguous
214*4882a593Smuzhiyunextents.
215*4882a593Smuzhiyun
216*4882a593SmuzhiyunBLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
217*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
218*4882a593Smuzhiyun
219*4882a593SmuzhiyunOne of the few
220*4882a593Smuzhiyunreasons to allow multiple BLK namespaces per REGION is so that each
221*4882a593SmuzhiyunBLK-namespace can be configured with a BTT with unique atomic sector
222*4882a593Smuzhiyunsizes.  While a PMEM device can host a BTT the LABEL specification does
223*4882a593Smuzhiyunnot provide for a sector size to be specified for a PMEM namespace.
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunThis is due to the expectation that the primary usage model for PMEM is
226*4882a593Smuzhiyunvia DAX, and the BTT is incompatible with DAX.  However, for the cases
227*4882a593Smuzhiyunwhere an application or filesystem still needs atomic sector update
228*4882a593Smuzhiyunguarantees it can register a BTT on a PMEM device or partition.  See
229*4882a593SmuzhiyunLIBNVDIMM/NDCTL: Block Translation Table "btt"
230*4882a593Smuzhiyun
231*4882a593Smuzhiyun
232*4882a593SmuzhiyunExample NVDIMM Platform
233*4882a593Smuzhiyun=======================
234*4882a593Smuzhiyun
235*4882a593SmuzhiyunFor the remainder of this document the following diagram will be
236*4882a593Smuzhiyunreferenced for any example sysfs layouts::
237*4882a593Smuzhiyun
238*4882a593Smuzhiyun
239*4882a593Smuzhiyun                               (a)               (b)           DIMM   BLK-REGION
240*4882a593Smuzhiyun            +-------------------+--------+--------+--------+
241*4882a593Smuzhiyun  +------+  |       pm0.0       | blk2.0 | pm1.0  | blk2.1 |    0      region2
242*4882a593Smuzhiyun  | imc0 +--+- - - region0- - - +--------+        +--------+
243*4882a593Smuzhiyun  +--+---+  |       pm0.0       | blk3.0 | pm1.0  | blk3.1 |    1      region3
244*4882a593Smuzhiyun     |      +-------------------+--------v        v--------+
245*4882a593Smuzhiyun  +--+---+                               |                 |
246*4882a593Smuzhiyun  | cpu0 |                                     region1
247*4882a593Smuzhiyun  +--+---+                               |                 |
248*4882a593Smuzhiyun     |      +----------------------------^        ^--------+
249*4882a593Smuzhiyun  +--+---+  |           blk4.0           | pm1.0  | blk4.0 |    2      region4
250*4882a593Smuzhiyun  | imc1 +--+----------------------------|        +--------+
251*4882a593Smuzhiyun  +------+  |           blk5.0           | pm1.0  | blk5.0 |    3      region5
252*4882a593Smuzhiyun            +----------------------------+--------+--------+
253*4882a593Smuzhiyun
254*4882a593SmuzhiyunIn this platform we have four DIMMs and two memory controllers in one
255*4882a593Smuzhiyunsocket.  Each unique interface (BLK or PMEM) to DPA space is identified
256*4882a593Smuzhiyunby a region device with a dynamically assigned id (REGION0 - REGION5).
257*4882a593Smuzhiyun
258*4882a593Smuzhiyun    1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
259*4882a593Smuzhiyun       single PMEM namespace is created in the REGION0-SPA-range that spans most
260*4882a593Smuzhiyun       of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
261*4882a593Smuzhiyun       interleaved system-physical-address range is reclaimed as BLK-aperture
262*4882a593Smuzhiyun       accessed space starting at DPA-offset (a) into each DIMM.  In that
263*4882a593Smuzhiyun       reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
264*4882a593Smuzhiyun       REGION3 where "blk2.0" and "blk3.0" are just human readable names that
265*4882a593Smuzhiyun       could be set to any user-desired name in the LABEL.
266*4882a593Smuzhiyun
267*4882a593Smuzhiyun    2. In the last portion of DIMM0 and DIMM1 we have an interleaved
268*4882a593Smuzhiyun       system-physical-address range, REGION1, that spans those two DIMMs as
269*4882a593Smuzhiyun       well as DIMM2 and DIMM3.  Some of REGION1 is allocated to a PMEM namespace
270*4882a593Smuzhiyun       named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for
271*4882a593Smuzhiyun       each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
272*4882a593Smuzhiyun       "blk5.0".
273*4882a593Smuzhiyun
274*4882a593Smuzhiyun    3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
275*4882a593Smuzhiyun       interleaved system-physical-address range (i.e. the DPA address past
276*4882a593Smuzhiyun       offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
277*4882a593Smuzhiyun       Note, that this example shows that BLK-aperture namespaces don't need to
278*4882a593Smuzhiyun       be contiguous in DPA-space.
279*4882a593Smuzhiyun
280*4882a593Smuzhiyun    This bus is provided by the kernel under the device
281*4882a593Smuzhiyun    /sys/devices/platform/nfit_test.0 when the nfit_test.ko module from
282*4882a593Smuzhiyun    tools/testing/nvdimm is loaded.  This not only test LIBNVDIMM but the
283*4882a593Smuzhiyun    acpi_nfit.ko driver as well.
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun
286*4882a593SmuzhiyunLIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
287*4882a593Smuzhiyun========================================================
288*4882a593Smuzhiyun
289*4882a593SmuzhiyunWhat follows is a description of the LIBNVDIMM sysfs layout and a
290*4882a593Smuzhiyuncorresponding object hierarchy diagram as viewed through the LIBNDCTL
291*4882a593SmuzhiyunAPI.  The example sysfs paths and diagrams are relative to the Example
292*4882a593SmuzhiyunNVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
293*4882a593Smuzhiyuntest.
294*4882a593Smuzhiyun
295*4882a593SmuzhiyunLIBNDCTL: Context
296*4882a593Smuzhiyun-----------------
297*4882a593Smuzhiyun
298*4882a593SmuzhiyunEvery API call in the LIBNDCTL library requires a context that holds the
299*4882a593Smuzhiyunlogging parameters and other library instance state.  The library is
300*4882a593Smuzhiyunbased on the libabc template:
301*4882a593Smuzhiyun
302*4882a593Smuzhiyun	https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git
303*4882a593Smuzhiyun
304*4882a593SmuzhiyunLIBNDCTL: instantiate a new library context example
305*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
306*4882a593Smuzhiyun
307*4882a593Smuzhiyun::
308*4882a593Smuzhiyun
309*4882a593Smuzhiyun	struct ndctl_ctx *ctx;
310*4882a593Smuzhiyun
311*4882a593Smuzhiyun	if (ndctl_new(&ctx) == 0)
312*4882a593Smuzhiyun		return ctx;
313*4882a593Smuzhiyun	else
314*4882a593Smuzhiyun		return NULL;
315*4882a593Smuzhiyun
316*4882a593SmuzhiyunLIBNVDIMM/LIBNDCTL: Bus
317*4882a593Smuzhiyun-----------------------
318*4882a593Smuzhiyun
319*4882a593SmuzhiyunA bus has a 1:1 relationship with an NFIT.  The current expectation for
320*4882a593SmuzhiyunACPI based systems is that there is only ever one platform-global NFIT.
321*4882a593SmuzhiyunThat said, it is trivial to register multiple NFITs, the specification
322*4882a593Smuzhiyundoes not preclude it.  The infrastructure supports multiple busses and
323*4882a593Smuzhiyunwe use this capability to test multiple NFIT configurations in the unit
324*4882a593Smuzhiyuntest.
325*4882a593Smuzhiyun
326*4882a593SmuzhiyunLIBNVDIMM: control class device in /sys/class
327*4882a593Smuzhiyun---------------------------------------------
328*4882a593Smuzhiyun
329*4882a593SmuzhiyunThis character device accepts DSM messages to be passed to DIMM
330*4882a593Smuzhiyunidentified by its NFIT handle::
331*4882a593Smuzhiyun
332*4882a593Smuzhiyun	/sys/class/nd/ndctl0
333*4882a593Smuzhiyun	|-- dev
334*4882a593Smuzhiyun	|-- device -> ../../../ndbus0
335*4882a593Smuzhiyun	|-- subsystem -> ../../../../../../../class/nd
336*4882a593Smuzhiyun
337*4882a593Smuzhiyun
338*4882a593Smuzhiyun
339*4882a593SmuzhiyunLIBNVDIMM: bus
340*4882a593Smuzhiyun--------------
341*4882a593Smuzhiyun
342*4882a593Smuzhiyun::
343*4882a593Smuzhiyun
344*4882a593Smuzhiyun	struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
345*4882a593Smuzhiyun	       struct nvdimm_bus_descriptor *nfit_desc);
346*4882a593Smuzhiyun
347*4882a593Smuzhiyun::
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun	/sys/devices/platform/nfit_test.0/ndbus0
350*4882a593Smuzhiyun	|-- commands
351*4882a593Smuzhiyun	|-- nd
352*4882a593Smuzhiyun	|-- nfit
353*4882a593Smuzhiyun	|-- nmem0
354*4882a593Smuzhiyun	|-- nmem1
355*4882a593Smuzhiyun	|-- nmem2
356*4882a593Smuzhiyun	|-- nmem3
357*4882a593Smuzhiyun	|-- power
358*4882a593Smuzhiyun	|-- provider
359*4882a593Smuzhiyun	|-- region0
360*4882a593Smuzhiyun	|-- region1
361*4882a593Smuzhiyun	|-- region2
362*4882a593Smuzhiyun	|-- region3
363*4882a593Smuzhiyun	|-- region4
364*4882a593Smuzhiyun	|-- region5
365*4882a593Smuzhiyun	|-- uevent
366*4882a593Smuzhiyun	`-- wait_probe
367*4882a593Smuzhiyun
368*4882a593SmuzhiyunLIBNDCTL: bus enumeration example
369*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
370*4882a593Smuzhiyun
371*4882a593SmuzhiyunFind the bus handle that describes the bus from Example NVDIMM Platform::
372*4882a593Smuzhiyun
373*4882a593Smuzhiyun	static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
374*4882a593Smuzhiyun			const char *provider)
375*4882a593Smuzhiyun	{
376*4882a593Smuzhiyun		struct ndctl_bus *bus;
377*4882a593Smuzhiyun
378*4882a593Smuzhiyun		ndctl_bus_foreach(ctx, bus)
379*4882a593Smuzhiyun			if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)
380*4882a593Smuzhiyun				return bus;
381*4882a593Smuzhiyun
382*4882a593Smuzhiyun		return NULL;
383*4882a593Smuzhiyun	}
384*4882a593Smuzhiyun
385*4882a593Smuzhiyun	bus = get_bus_by_provider(ctx, "nfit_test.0");
386*4882a593Smuzhiyun
387*4882a593Smuzhiyun
388*4882a593SmuzhiyunLIBNVDIMM/LIBNDCTL: DIMM (NMEM)
389*4882a593Smuzhiyun-------------------------------
390*4882a593Smuzhiyun
391*4882a593SmuzhiyunThe DIMM device provides a character device for sending commands to
392*4882a593Smuzhiyunhardware, and it is a container for LABELs.  If the DIMM is defined by
393*4882a593SmuzhiyunNFIT then an optional 'nfit' attribute sub-directory is available to add
394*4882a593SmuzhiyunNFIT-specifics.
395*4882a593Smuzhiyun
396*4882a593SmuzhiyunNote that the kernel device name for "DIMMs" is "nmemX".  The NFIT
397*4882a593Smuzhiyundescribes these devices via "Memory Device to System Physical Address
398*4882a593SmuzhiyunRange Mapping Structure", and there is no requirement that they actually
399*4882a593Smuzhiyunbe physical DIMMs, so we use a more generic name.
400*4882a593Smuzhiyun
401*4882a593SmuzhiyunLIBNVDIMM: DIMM (NMEM)
402*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^
403*4882a593Smuzhiyun
404*4882a593Smuzhiyun::
405*4882a593Smuzhiyun
406*4882a593Smuzhiyun	struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
407*4882a593Smuzhiyun			const struct attribute_group **groups, unsigned long flags,
408*4882a593Smuzhiyun			unsigned long *dsm_mask);
409*4882a593Smuzhiyun
410*4882a593Smuzhiyun::
411*4882a593Smuzhiyun
412*4882a593Smuzhiyun	/sys/devices/platform/nfit_test.0/ndbus0
413*4882a593Smuzhiyun	|-- nmem0
414*4882a593Smuzhiyun	|   |-- available_slots
415*4882a593Smuzhiyun	|   |-- commands
416*4882a593Smuzhiyun	|   |-- dev
417*4882a593Smuzhiyun	|   |-- devtype
418*4882a593Smuzhiyun	|   |-- driver -> ../../../../../bus/nd/drivers/nvdimm
419*4882a593Smuzhiyun	|   |-- modalias
420*4882a593Smuzhiyun	|   |-- nfit
421*4882a593Smuzhiyun	|   |   |-- device
422*4882a593Smuzhiyun	|   |   |-- format
423*4882a593Smuzhiyun	|   |   |-- handle
424*4882a593Smuzhiyun	|   |   |-- phys_id
425*4882a593Smuzhiyun	|   |   |-- rev_id
426*4882a593Smuzhiyun	|   |   |-- serial
427*4882a593Smuzhiyun	|   |   `-- vendor
428*4882a593Smuzhiyun	|   |-- state
429*4882a593Smuzhiyun	|   |-- subsystem -> ../../../../../bus/nd
430*4882a593Smuzhiyun	|   `-- uevent
431*4882a593Smuzhiyun	|-- nmem1
432*4882a593Smuzhiyun	[..]
433*4882a593Smuzhiyun
434*4882a593Smuzhiyun
435*4882a593SmuzhiyunLIBNDCTL: DIMM enumeration example
436*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
437*4882a593Smuzhiyun
438*4882a593SmuzhiyunNote, in this example we are assuming NFIT-defined DIMMs which are
439*4882a593Smuzhiyunidentified by an "nfit_handle" a 32-bit value where:
440*4882a593Smuzhiyun
441*4882a593Smuzhiyun   - Bit 3:0 DIMM number within the memory channel
442*4882a593Smuzhiyun   - Bit 7:4 memory channel number
443*4882a593Smuzhiyun   - Bit 11:8 memory controller ID
444*4882a593Smuzhiyun   - Bit 15:12 socket ID (within scope of a Node controller if node
445*4882a593Smuzhiyun     controller is present)
446*4882a593Smuzhiyun   - Bit 27:16 Node Controller ID
447*4882a593Smuzhiyun   - Bit 31:28 Reserved
448*4882a593Smuzhiyun
449*4882a593Smuzhiyun::
450*4882a593Smuzhiyun
451*4882a593Smuzhiyun	static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
452*4882a593Smuzhiyun	       unsigned int handle)
453*4882a593Smuzhiyun	{
454*4882a593Smuzhiyun		struct ndctl_dimm *dimm;
455*4882a593Smuzhiyun
456*4882a593Smuzhiyun		ndctl_dimm_foreach(bus, dimm)
457*4882a593Smuzhiyun			if (ndctl_dimm_get_handle(dimm) == handle)
458*4882a593Smuzhiyun				return dimm;
459*4882a593Smuzhiyun
460*4882a593Smuzhiyun		return NULL;
461*4882a593Smuzhiyun	}
462*4882a593Smuzhiyun
463*4882a593Smuzhiyun	#define DIMM_HANDLE(n, s, i, c, d) \
464*4882a593Smuzhiyun		(((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \
465*4882a593Smuzhiyun		 | ((c & 0xf) << 4) | (d & 0xf))
466*4882a593Smuzhiyun
467*4882a593Smuzhiyun	dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
468*4882a593Smuzhiyun
469*4882a593SmuzhiyunLIBNVDIMM/LIBNDCTL: Region
470*4882a593Smuzhiyun--------------------------
471*4882a593Smuzhiyun
472*4882a593SmuzhiyunA generic REGION device is registered for each PMEM range or BLK-aperture
473*4882a593Smuzhiyunset.  Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
474*4882a593Smuzhiyunsets on the "nfit_test.0" bus.  The primary role of regions are to be a
475*4882a593Smuzhiyuncontainer of "mappings".  A mapping is a tuple of <DIMM,
476*4882a593SmuzhiyunDPA-start-offset, length>.
477*4882a593Smuzhiyun
478*4882a593SmuzhiyunLIBNVDIMM provides a built-in driver for these REGION devices.  This driver
479*4882a593Smuzhiyunis responsible for reconciling the aliased DPA mappings across all
480*4882a593Smuzhiyunregions, parsing the LABEL, if present, and then emitting NAMESPACE
481*4882a593Smuzhiyundevices with the resolved/exclusive DPA-boundaries for the nd_pmem or
482*4882a593Smuzhiyunnd_blk device driver to consume.
483*4882a593Smuzhiyun
484*4882a593SmuzhiyunIn addition to the generic attributes of "mapping"s, "interleave_ways"
485*4882a593Smuzhiyunand "size" the REGION device also exports some convenience attributes.
486*4882a593Smuzhiyun"nstype" indicates the integer type of namespace-device this region
487*4882a593Smuzhiyunemits, "devtype" duplicates the DEVTYPE variable stored by udev at the
488*4882a593Smuzhiyun'add' event, "modalias" duplicates the MODALIAS variable stored by udev
489*4882a593Smuzhiyunat the 'add' event, and finally, the optional "spa_index" is provided in
490*4882a593Smuzhiyunthe case where the region is defined by a SPA.
491*4882a593Smuzhiyun
492*4882a593SmuzhiyunLIBNVDIMM: region::
493*4882a593Smuzhiyun
494*4882a593Smuzhiyun	struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
495*4882a593Smuzhiyun			struct nd_region_desc *ndr_desc);
496*4882a593Smuzhiyun	struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
497*4882a593Smuzhiyun			struct nd_region_desc *ndr_desc);
498*4882a593Smuzhiyun
499*4882a593Smuzhiyun::
500*4882a593Smuzhiyun
501*4882a593Smuzhiyun	/sys/devices/platform/nfit_test.0/ndbus0
502*4882a593Smuzhiyun	|-- region0
503*4882a593Smuzhiyun	|   |-- available_size
504*4882a593Smuzhiyun	|   |-- btt0
505*4882a593Smuzhiyun	|   |-- btt_seed
506*4882a593Smuzhiyun	|   |-- devtype
507*4882a593Smuzhiyun	|   |-- driver -> ../../../../../bus/nd/drivers/nd_region
508*4882a593Smuzhiyun	|   |-- init_namespaces
509*4882a593Smuzhiyun	|   |-- mapping0
510*4882a593Smuzhiyun	|   |-- mapping1
511*4882a593Smuzhiyun	|   |-- mappings
512*4882a593Smuzhiyun	|   |-- modalias
513*4882a593Smuzhiyun	|   |-- namespace0.0
514*4882a593Smuzhiyun	|   |-- namespace_seed
515*4882a593Smuzhiyun	|   |-- numa_node
516*4882a593Smuzhiyun	|   |-- nfit
517*4882a593Smuzhiyun	|   |   `-- spa_index
518*4882a593Smuzhiyun	|   |-- nstype
519*4882a593Smuzhiyun	|   |-- set_cookie
520*4882a593Smuzhiyun	|   |-- size
521*4882a593Smuzhiyun	|   |-- subsystem -> ../../../../../bus/nd
522*4882a593Smuzhiyun	|   `-- uevent
523*4882a593Smuzhiyun	|-- region1
524*4882a593Smuzhiyun	[..]
525*4882a593Smuzhiyun
526*4882a593SmuzhiyunLIBNDCTL: region enumeration example
527*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
528*4882a593Smuzhiyun
529*4882a593SmuzhiyunSample region retrieval routines based on NFIT-unique data like
530*4882a593Smuzhiyun"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for
531*4882a593SmuzhiyunBLK::
532*4882a593Smuzhiyun
533*4882a593Smuzhiyun	static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
534*4882a593Smuzhiyun			unsigned int spa_index)
535*4882a593Smuzhiyun	{
536*4882a593Smuzhiyun		struct ndctl_region *region;
537*4882a593Smuzhiyun
538*4882a593Smuzhiyun		ndctl_region_foreach(bus, region) {
539*4882a593Smuzhiyun			if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)
540*4882a593Smuzhiyun				continue;
541*4882a593Smuzhiyun			if (ndctl_region_get_spa_index(region) == spa_index)
542*4882a593Smuzhiyun				return region;
543*4882a593Smuzhiyun		}
544*4882a593Smuzhiyun		return NULL;
545*4882a593Smuzhiyun	}
546*4882a593Smuzhiyun
547*4882a593Smuzhiyun	static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus,
548*4882a593Smuzhiyun			unsigned int handle)
549*4882a593Smuzhiyun	{
550*4882a593Smuzhiyun		struct ndctl_region *region;
551*4882a593Smuzhiyun
552*4882a593Smuzhiyun		ndctl_region_foreach(bus, region) {
553*4882a593Smuzhiyun			struct ndctl_mapping *map;
554*4882a593Smuzhiyun
555*4882a593Smuzhiyun			if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK)
556*4882a593Smuzhiyun				continue;
557*4882a593Smuzhiyun			ndctl_mapping_foreach(region, map) {
558*4882a593Smuzhiyun				struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map);
559*4882a593Smuzhiyun
560*4882a593Smuzhiyun				if (ndctl_dimm_get_handle(dimm) == handle)
561*4882a593Smuzhiyun					return region;
562*4882a593Smuzhiyun			}
563*4882a593Smuzhiyun		}
564*4882a593Smuzhiyun		return NULL;
565*4882a593Smuzhiyun	}
566*4882a593Smuzhiyun
567*4882a593Smuzhiyun
568*4882a593SmuzhiyunWhy Not Encode the Region Type into the Region Name?
569*4882a593Smuzhiyun----------------------------------------------------
570*4882a593Smuzhiyun
571*4882a593SmuzhiyunAt first glance it seems since NFIT defines just PMEM and BLK interface
572*4882a593Smuzhiyuntypes that we should simply name REGION devices with something derived
573*4882a593Smuzhiyunfrom those type names.  However, the ND subsystem explicitly keeps the
574*4882a593SmuzhiyunREGION name generic and expects userspace to always consider the
575*4882a593Smuzhiyunregion-attributes for four reasons:
576*4882a593Smuzhiyun
577*4882a593Smuzhiyun    1. There are already more than two REGION and "namespace" types.  For
578*4882a593Smuzhiyun       PMEM there are two subtypes.  As mentioned previously we have PMEM where
579*4882a593Smuzhiyun       the constituent DIMM devices are known and anonymous PMEM.  For BLK
580*4882a593Smuzhiyun       regions the NFIT specification already anticipates vendor specific
581*4882a593Smuzhiyun       implementations.  The exact distinction of what a region contains is in
582*4882a593Smuzhiyun       the region-attributes not the region-name or the region-devtype.
583*4882a593Smuzhiyun
584*4882a593Smuzhiyun    2. A region with zero child-namespaces is a possible configuration.  For
585*4882a593Smuzhiyun       example, the NFIT allows for a DCR to be published without a
586*4882a593Smuzhiyun       corresponding BLK-aperture.  This equates to a DIMM that can only accept
587*4882a593Smuzhiyun       control/configuration messages, but no i/o through a descendant block
588*4882a593Smuzhiyun       device.  Again, this "type" is advertised in the attributes ('mappings'
589*4882a593Smuzhiyun       == 0) and the name does not tell you much.
590*4882a593Smuzhiyun
591*4882a593Smuzhiyun    3. What if a third major interface type arises in the future?  Outside
592*4882a593Smuzhiyun       of vendor specific implementations, it's not difficult to envision a
593*4882a593Smuzhiyun       third class of interface type beyond BLK and PMEM.  With a generic name
594*4882a593Smuzhiyun       for the REGION level of the device-hierarchy old userspace
595*4882a593Smuzhiyun       implementations can still make sense of new kernel advertised
596*4882a593Smuzhiyun       region-types.  Userspace can always rely on the generic region
597*4882a593Smuzhiyun       attributes like "mappings", "size", etc and the expected child devices
598*4882a593Smuzhiyun       named "namespace".  This generic format of the device-model hierarchy
599*4882a593Smuzhiyun       allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and
600*4882a593Smuzhiyun       future-proof.
601*4882a593Smuzhiyun
602*4882a593Smuzhiyun    4. There are more robust mechanisms for determining the major type of a
603*4882a593Smuzhiyun       region than a device name.  See the next section, How Do I Determine the
604*4882a593Smuzhiyun       Major Type of a Region?
605*4882a593Smuzhiyun
606*4882a593SmuzhiyunHow Do I Determine the Major Type of a Region?
607*4882a593Smuzhiyun----------------------------------------------
608*4882a593Smuzhiyun
609*4882a593SmuzhiyunOutside of the blanket recommendation of "use libndctl", or simply
610*4882a593Smuzhiyunlooking at the kernel header (/usr/include/linux/ndctl.h) to decode the
611*4882a593Smuzhiyun"nstype" integer attribute, here are some other options.
612*4882a593Smuzhiyun
613*4882a593Smuzhiyun1. module alias lookup
614*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^
615*4882a593Smuzhiyun
616*4882a593Smuzhiyun    The whole point of region/namespace device type differentiation is to
617*4882a593Smuzhiyun    decide which block-device driver will attach to a given LIBNVDIMM namespace.
618*4882a593Smuzhiyun    One can simply use the modalias to lookup the resulting module.  It's
619*4882a593Smuzhiyun    important to note that this method is robust in the presence of a
620*4882a593Smuzhiyun    vendor-specific driver down the road.  If a vendor-specific
621*4882a593Smuzhiyun    implementation wants to supplant the standard nd_blk driver it can with
622*4882a593Smuzhiyun    minimal impact to the rest of LIBNVDIMM.
623*4882a593Smuzhiyun
624*4882a593Smuzhiyun    In fact, a vendor may also want to have a vendor-specific region-driver
625*4882a593Smuzhiyun    (outside of nd_region).  For example, if a vendor defined its own LABEL
626*4882a593Smuzhiyun    format it would need its own region driver to parse that LABEL and emit
627*4882a593Smuzhiyun    the resulting namespaces.  The output from module resolution is more
628*4882a593Smuzhiyun    accurate than a region-name or region-devtype.
629*4882a593Smuzhiyun
630*4882a593Smuzhiyun2. udev
631*4882a593Smuzhiyun^^^^^^^
632*4882a593Smuzhiyun
633*4882a593Smuzhiyun    The kernel "devtype" is registered in the udev database::
634*4882a593Smuzhiyun
635*4882a593Smuzhiyun	# udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0
636*4882a593Smuzhiyun	P: /devices/platform/nfit_test.0/ndbus0/region0
637*4882a593Smuzhiyun	E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0
638*4882a593Smuzhiyun	E: DEVTYPE=nd_pmem
639*4882a593Smuzhiyun	E: MODALIAS=nd:t2
640*4882a593Smuzhiyun	E: SUBSYSTEM=nd
641*4882a593Smuzhiyun
642*4882a593Smuzhiyun	# udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4
643*4882a593Smuzhiyun	P: /devices/platform/nfit_test.0/ndbus0/region4
644*4882a593Smuzhiyun	E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4
645*4882a593Smuzhiyun	E: DEVTYPE=nd_blk
646*4882a593Smuzhiyun	E: MODALIAS=nd:t3
647*4882a593Smuzhiyun	E: SUBSYSTEM=nd
648*4882a593Smuzhiyun
649*4882a593Smuzhiyun    ...and is available as a region attribute, but keep in mind that the
650*4882a593Smuzhiyun    "devtype" does not indicate sub-type variations and scripts should
651*4882a593Smuzhiyun    really be understanding the other attributes.
652*4882a593Smuzhiyun
653*4882a593Smuzhiyun3. type specific attributes
654*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^
655*4882a593Smuzhiyun
656*4882a593Smuzhiyun    As it currently stands a BLK-aperture region will never have a
657*4882a593Smuzhiyun    "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region.  A
658*4882a593Smuzhiyun    BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM
659*4882a593Smuzhiyun    that does not allow I/O.  A PMEM region with a "mappings" value of zero
660*4882a593Smuzhiyun    is a simple system-physical-address range.
661*4882a593Smuzhiyun
662*4882a593Smuzhiyun
663*4882a593SmuzhiyunLIBNVDIMM/LIBNDCTL: Namespace
664*4882a593Smuzhiyun-----------------------------
665*4882a593Smuzhiyun
666*4882a593SmuzhiyunA REGION, after resolving DPA aliasing and LABEL specified boundaries,
667*4882a593Smuzhiyunsurfaces one or more "namespace" devices.  The arrival of a "namespace"
668*4882a593Smuzhiyundevice currently triggers either the nd_blk or nd_pmem driver to load
669*4882a593Smuzhiyunand register a disk/block device.
670*4882a593Smuzhiyun
671*4882a593SmuzhiyunLIBNVDIMM: namespace
672*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^
673*4882a593Smuzhiyun
674*4882a593SmuzhiyunHere is a sample layout from the three major types of NAMESPACE where
675*4882a593Smuzhiyunnamespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid'
676*4882a593Smuzhiyunattribute), namespace2.0 represents a BLK namespace (note it has a
677*4882a593Smuzhiyun'sector_size' attribute) that, and namespace6.0 represents an anonymous
678*4882a593SmuzhiyunPMEM namespace (note that has no 'uuid' attribute due to not support a
679*4882a593SmuzhiyunLABEL)::
680*4882a593Smuzhiyun
681*4882a593Smuzhiyun	/sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
682*4882a593Smuzhiyun	|-- alt_name
683*4882a593Smuzhiyun	|-- devtype
684*4882a593Smuzhiyun	|-- dpa_extents
685*4882a593Smuzhiyun	|-- force_raw
686*4882a593Smuzhiyun	|-- modalias
687*4882a593Smuzhiyun	|-- numa_node
688*4882a593Smuzhiyun	|-- resource
689*4882a593Smuzhiyun	|-- size
690*4882a593Smuzhiyun	|-- subsystem -> ../../../../../../bus/nd
691*4882a593Smuzhiyun	|-- type
692*4882a593Smuzhiyun	|-- uevent
693*4882a593Smuzhiyun	`-- uuid
694*4882a593Smuzhiyun	/sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0
695*4882a593Smuzhiyun	|-- alt_name
696*4882a593Smuzhiyun	|-- devtype
697*4882a593Smuzhiyun	|-- dpa_extents
698*4882a593Smuzhiyun	|-- force_raw
699*4882a593Smuzhiyun	|-- modalias
700*4882a593Smuzhiyun	|-- numa_node
701*4882a593Smuzhiyun	|-- sector_size
702*4882a593Smuzhiyun	|-- size
703*4882a593Smuzhiyun	|-- subsystem -> ../../../../../../bus/nd
704*4882a593Smuzhiyun	|-- type
705*4882a593Smuzhiyun	|-- uevent
706*4882a593Smuzhiyun	`-- uuid
707*4882a593Smuzhiyun	/sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0
708*4882a593Smuzhiyun	|-- block
709*4882a593Smuzhiyun	|   `-- pmem0
710*4882a593Smuzhiyun	|-- devtype
711*4882a593Smuzhiyun	|-- driver -> ../../../../../../bus/nd/drivers/pmem
712*4882a593Smuzhiyun	|-- force_raw
713*4882a593Smuzhiyun	|-- modalias
714*4882a593Smuzhiyun	|-- numa_node
715*4882a593Smuzhiyun	|-- resource
716*4882a593Smuzhiyun	|-- size
717*4882a593Smuzhiyun	|-- subsystem -> ../../../../../../bus/nd
718*4882a593Smuzhiyun	|-- type
719*4882a593Smuzhiyun	`-- uevent
720*4882a593Smuzhiyun
721*4882a593SmuzhiyunLIBNDCTL: namespace enumeration example
722*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
723*4882a593SmuzhiyunNamespaces are indexed relative to their parent region, example below.
724*4882a593SmuzhiyunThese indexes are mostly static from boot to boot, but subsystem makes
725*4882a593Smuzhiyunno guarantees in this regard.  For a static namespace identifier use its
726*4882a593Smuzhiyun'uuid' attribute.
727*4882a593Smuzhiyun
728*4882a593Smuzhiyun::
729*4882a593Smuzhiyun
730*4882a593Smuzhiyun  static struct ndctl_namespace
731*4882a593Smuzhiyun  *get_namespace_by_id(struct ndctl_region *region, unsigned int id)
732*4882a593Smuzhiyun  {
733*4882a593Smuzhiyun          struct ndctl_namespace *ndns;
734*4882a593Smuzhiyun
735*4882a593Smuzhiyun          ndctl_namespace_foreach(region, ndns)
736*4882a593Smuzhiyun                  if (ndctl_namespace_get_id(ndns) == id)
737*4882a593Smuzhiyun                          return ndns;
738*4882a593Smuzhiyun
739*4882a593Smuzhiyun          return NULL;
740*4882a593Smuzhiyun  }
741*4882a593Smuzhiyun
742*4882a593SmuzhiyunLIBNDCTL: namespace creation example
743*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
744*4882a593Smuzhiyun
745*4882a593SmuzhiyunIdle namespaces are automatically created by the kernel if a given
746*4882a593Smuzhiyunregion has enough available capacity to create a new namespace.
747*4882a593SmuzhiyunNamespace instantiation involves finding an idle namespace and
748*4882a593Smuzhiyunconfiguring it.  For the most part the setting of namespace attributes
749*4882a593Smuzhiyuncan occur in any order, the only constraint is that 'uuid' must be set
750*4882a593Smuzhiyunbefore 'size'.  This enables the kernel to track DPA allocations
751*4882a593Smuzhiyuninternally with a static identifier::
752*4882a593Smuzhiyun
753*4882a593Smuzhiyun  static int configure_namespace(struct ndctl_region *region,
754*4882a593Smuzhiyun                  struct ndctl_namespace *ndns,
755*4882a593Smuzhiyun                  struct namespace_parameters *parameters)
756*4882a593Smuzhiyun  {
757*4882a593Smuzhiyun          char devname[50];
758*4882a593Smuzhiyun
759*4882a593Smuzhiyun          snprintf(devname, sizeof(devname), "namespace%d.%d",
760*4882a593Smuzhiyun                          ndctl_region_get_id(region), paramaters->id);
761*4882a593Smuzhiyun
762*4882a593Smuzhiyun          ndctl_namespace_set_alt_name(ndns, devname);
763*4882a593Smuzhiyun          /* 'uuid' must be set prior to setting size! */
764*4882a593Smuzhiyun          ndctl_namespace_set_uuid(ndns, paramaters->uuid);
765*4882a593Smuzhiyun          ndctl_namespace_set_size(ndns, paramaters->size);
766*4882a593Smuzhiyun          /* unlike pmem namespaces, blk namespaces have a sector size */
767*4882a593Smuzhiyun          if (parameters->lbasize)
768*4882a593Smuzhiyun                  ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
769*4882a593Smuzhiyun          ndctl_namespace_enable(ndns);
770*4882a593Smuzhiyun  }
771*4882a593Smuzhiyun
772*4882a593Smuzhiyun
773*4882a593SmuzhiyunWhy the Term "namespace"?
774*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^
775*4882a593Smuzhiyun
776*4882a593Smuzhiyun    1. Why not "volume" for instance?  "volume" ran the risk of confusing
777*4882a593Smuzhiyun       ND (libnvdimm subsystem) to a volume manager like device-mapper.
778*4882a593Smuzhiyun
779*4882a593Smuzhiyun    2. The term originated to describe the sub-devices that can be created
780*4882a593Smuzhiyun       within a NVME controller (see the nvme specification:
781*4882a593Smuzhiyun       https://www.nvmexpress.org/specifications/), and NFIT namespaces are
782*4882a593Smuzhiyun       meant to parallel the capabilities and configurability of
783*4882a593Smuzhiyun       NVME-namespaces.
784*4882a593Smuzhiyun
785*4882a593Smuzhiyun
786*4882a593SmuzhiyunLIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
787*4882a593Smuzhiyun-------------------------------------------------
788*4882a593Smuzhiyun
789*4882a593SmuzhiyunA BTT (design document: https://pmem.io/2014/09/23/btt.html) is a stacked
790*4882a593Smuzhiyunblock device driver that fronts either the whole block device or a
791*4882a593Smuzhiyunpartition of a block device emitted by either a PMEM or BLK NAMESPACE.
792*4882a593Smuzhiyun
793*4882a593SmuzhiyunLIBNVDIMM: btt layout
794*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^
795*4882a593Smuzhiyun
796*4882a593SmuzhiyunEvery region will start out with at least one BTT device which is the
797*4882a593Smuzhiyunseed device.  To activate it set the "namespace", "uuid", and
798*4882a593Smuzhiyun"sector_size" attributes and then bind the device to the nd_pmem or
799*4882a593Smuzhiyunnd_blk driver depending on the region type::
800*4882a593Smuzhiyun
801*4882a593Smuzhiyun	/sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/
802*4882a593Smuzhiyun	|-- namespace
803*4882a593Smuzhiyun	|-- delete
804*4882a593Smuzhiyun	|-- devtype
805*4882a593Smuzhiyun	|-- modalias
806*4882a593Smuzhiyun	|-- numa_node
807*4882a593Smuzhiyun	|-- sector_size
808*4882a593Smuzhiyun	|-- subsystem -> ../../../../../bus/nd
809*4882a593Smuzhiyun	|-- uevent
810*4882a593Smuzhiyun	`-- uuid
811*4882a593Smuzhiyun
812*4882a593SmuzhiyunLIBNDCTL: btt creation example
813*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
814*4882a593Smuzhiyun
815*4882a593SmuzhiyunSimilar to namespaces an idle BTT device is automatically created per
816*4882a593Smuzhiyunregion.  Each time this "seed" btt device is configured and enabled a new
817*4882a593Smuzhiyunseed is created.  Creating a BTT configuration involves two steps of
818*4882a593Smuzhiyunfinding and idle BTT and assigning it to consume a PMEM or BLK namespace::
819*4882a593Smuzhiyun
820*4882a593Smuzhiyun	static struct ndctl_btt *get_idle_btt(struct ndctl_region *region)
821*4882a593Smuzhiyun	{
822*4882a593Smuzhiyun		struct ndctl_btt *btt;
823*4882a593Smuzhiyun
824*4882a593Smuzhiyun		ndctl_btt_foreach(region, btt)
825*4882a593Smuzhiyun			if (!ndctl_btt_is_enabled(btt)
826*4882a593Smuzhiyun					&& !ndctl_btt_is_configured(btt))
827*4882a593Smuzhiyun				return btt;
828*4882a593Smuzhiyun
829*4882a593Smuzhiyun		return NULL;
830*4882a593Smuzhiyun	}
831*4882a593Smuzhiyun
832*4882a593Smuzhiyun	static int configure_btt(struct ndctl_region *region,
833*4882a593Smuzhiyun			struct btt_parameters *parameters)
834*4882a593Smuzhiyun	{
835*4882a593Smuzhiyun		btt = get_idle_btt(region);
836*4882a593Smuzhiyun
837*4882a593Smuzhiyun		ndctl_btt_set_uuid(btt, parameters->uuid);
838*4882a593Smuzhiyun		ndctl_btt_set_sector_size(btt, parameters->sector_size);
839*4882a593Smuzhiyun		ndctl_btt_set_namespace(btt, parameters->ndns);
840*4882a593Smuzhiyun		/* turn off raw mode device */
841*4882a593Smuzhiyun		ndctl_namespace_disable(parameters->ndns);
842*4882a593Smuzhiyun		/* turn on btt access */
843*4882a593Smuzhiyun		ndctl_btt_enable(btt);
844*4882a593Smuzhiyun	}
845*4882a593Smuzhiyun
846*4882a593SmuzhiyunOnce instantiated a new inactive btt seed device will appear underneath
847*4882a593Smuzhiyunthe region.
848*4882a593Smuzhiyun
849*4882a593SmuzhiyunOnce a "namespace" is removed from a BTT that instance of the BTT device
850*4882a593Smuzhiyunwill be deleted or otherwise reset to default values.  This deletion is
851*4882a593Smuzhiyunonly at the device model level.  In order to destroy a BTT the "info
852*4882a593Smuzhiyunblock" needs to be destroyed.  Note, that to destroy a BTT the media
853*4882a593Smuzhiyunneeds to be written in raw mode.  By default, the kernel will autodetect
854*4882a593Smuzhiyunthe presence of a BTT and disable raw mode.  This autodetect behavior
855*4882a593Smuzhiyuncan be suppressed by enabling raw mode for the namespace via the
856*4882a593Smuzhiyunndctl_namespace_set_raw_mode() API.
857*4882a593Smuzhiyun
858*4882a593Smuzhiyun
859*4882a593SmuzhiyunSummary LIBNDCTL Diagram
860*4882a593Smuzhiyun------------------------
861*4882a593Smuzhiyun
862*4882a593SmuzhiyunFor the given example above, here is the view of the objects as seen by the
863*4882a593SmuzhiyunLIBNDCTL API::
864*4882a593Smuzhiyun
865*4882a593Smuzhiyun              +---+
866*4882a593Smuzhiyun              |CTX|    +---------+   +--------------+  +---------------+
867*4882a593Smuzhiyun              +-+-+  +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
868*4882a593Smuzhiyun                |    | +---------+   +--------------+  +---------------+
869*4882a593Smuzhiyun  +-------+     |    | +---------+   +--------------+  +---------------+
870*4882a593Smuzhiyun  | DIMM0 <-+   |    +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" |
871*4882a593Smuzhiyun  +-------+ |   |    | +---------+   +--------------+  +---------------+
872*4882a593Smuzhiyun  | DIMM1 <-+ +-v--+ | +---------+   +--------------+  +---------------+
873*4882a593Smuzhiyun  +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6  "blk2.0" |
874*4882a593Smuzhiyun  | DIMM2 <-+ +----+ | +---------+ | +--------------+  +----------------------+
875*4882a593Smuzhiyun  +-------+ |        |             +-> NAMESPACE2.1 +--> ND5  "blk2.1" | BTT2 |
876*4882a593Smuzhiyun  | DIMM3 <-+        |               +--------------+  +----------------------+
877*4882a593Smuzhiyun  +-------+          | +---------+   +--------------+  +---------------+
878*4882a593Smuzhiyun                     +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4  "blk3.0" |
879*4882a593Smuzhiyun                     | +---------+ | +--------------+  +----------------------+
880*4882a593Smuzhiyun                     |             +-> NAMESPACE3.1 +--> ND3  "blk3.1" | BTT1 |
881*4882a593Smuzhiyun                     |               +--------------+  +----------------------+
882*4882a593Smuzhiyun                     | +---------+   +--------------+  +---------------+
883*4882a593Smuzhiyun                     +-> REGION4 +---> NAMESPACE4.0 +--> ND2  "blk4.0" |
884*4882a593Smuzhiyun                     | +---------+   +--------------+  +---------------+
885*4882a593Smuzhiyun                     | +---------+   +--------------+  +----------------------+
886*4882a593Smuzhiyun                     +-> REGION5 +---> NAMESPACE5.0 +--> ND1  "blk5.0" | BTT0 |
887*4882a593Smuzhiyun                       +---------+   +--------------+  +---------------+------+
888