xref: /OK3568_Linux_fs/kernel/Documentation/driver-api/nvdimm/btt.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=============================
2*4882a593SmuzhiyunBTT - Block Translation Table
3*4882a593Smuzhiyun=============================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun
6*4882a593Smuzhiyun1. Introduction
7*4882a593Smuzhiyun===============
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunPersistent memory based storage is able to perform IO at byte (or more
10*4882a593Smuzhiyunaccurately, cache line) granularity. However, we often want to expose such
11*4882a593Smuzhiyunstorage as traditional block devices. The block drivers for persistent memory
12*4882a593Smuzhiyunwill do exactly this. However, they do not provide any atomicity guarantees.
13*4882a593SmuzhiyunTraditional SSDs typically provide protection against torn sectors in hardware,
14*4882a593Smuzhiyunusing stored energy in capacitors to complete in-flight block writes, or perhaps
15*4882a593Smuzhiyunin firmware. We don't have this luxury with persistent memory - if a write is in
16*4882a593Smuzhiyunprogress, and we experience a power failure, the block will contain a mix of old
17*4882a593Smuzhiyunand new data. Applications may not be prepared to handle such a scenario.
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunThe Block Translation Table (BTT) provides atomic sector update semantics for
20*4882a593Smuzhiyunpersistent memory devices, so that applications that rely on sector writes not
21*4882a593Smuzhiyunbeing torn can continue to do so. The BTT manifests itself as a stacked block
22*4882a593Smuzhiyundevice, and reserves a portion of the underlying storage for its metadata. At
23*4882a593Smuzhiyunthe heart of it, is an indirection table that re-maps all the blocks on the
24*4882a593Smuzhiyunvolume. It can be thought of as an extremely simple file system that only
25*4882a593Smuzhiyunprovides atomic sector updates.
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun
28*4882a593Smuzhiyun2. Static Layout
29*4882a593Smuzhiyun================
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunThe underlying storage on which a BTT can be laid out is not limited in any way.
32*4882a593SmuzhiyunThe BTT, however, splits the available space into chunks of up to 512 GiB,
33*4882a593Smuzhiyuncalled "Arenas".
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunEach arena follows the same layout for its metadata, and all references in an
36*4882a593Smuzhiyunarena are internal to it (with the exception of one field that points to the
37*4882a593Smuzhiyunnext arena). The following depicts the "On-disk" metadata layout::
38*4882a593Smuzhiyun
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun    Backing Store     +------->  Arena
41*4882a593Smuzhiyun  +---------------+   |   +------------------+
42*4882a593Smuzhiyun  |               |   |   | Arena info block |
43*4882a593Smuzhiyun  |    Arena 0    +---+   |       4K         |
44*4882a593Smuzhiyun  |     512G      |       +------------------+
45*4882a593Smuzhiyun  |               |       |                  |
46*4882a593Smuzhiyun  +---------------+       |                  |
47*4882a593Smuzhiyun  |               |       |                  |
48*4882a593Smuzhiyun  |    Arena 1    |       |   Data Blocks    |
49*4882a593Smuzhiyun  |     512G      |       |                  |
50*4882a593Smuzhiyun  |               |       |                  |
51*4882a593Smuzhiyun  +---------------+       |                  |
52*4882a593Smuzhiyun  |       .       |       |                  |
53*4882a593Smuzhiyun  |       .       |       |                  |
54*4882a593Smuzhiyun  |       .       |       |                  |
55*4882a593Smuzhiyun  |               |       |                  |
56*4882a593Smuzhiyun  |               |       |                  |
57*4882a593Smuzhiyun  +---------------+       +------------------+
58*4882a593Smuzhiyun                          |                  |
59*4882a593Smuzhiyun                          |     BTT Map      |
60*4882a593Smuzhiyun                          |                  |
61*4882a593Smuzhiyun                          |                  |
62*4882a593Smuzhiyun                          +------------------+
63*4882a593Smuzhiyun                          |                  |
64*4882a593Smuzhiyun                          |     BTT Flog     |
65*4882a593Smuzhiyun                          |                  |
66*4882a593Smuzhiyun                          +------------------+
67*4882a593Smuzhiyun                          | Info block copy  |
68*4882a593Smuzhiyun                          |       4K         |
69*4882a593Smuzhiyun                          +------------------+
70*4882a593Smuzhiyun
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun3. Theory of Operation
73*4882a593Smuzhiyun======================
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun
76*4882a593Smuzhiyuna. The BTT Map
77*4882a593Smuzhiyun--------------
78*4882a593Smuzhiyun
79*4882a593SmuzhiyunThe map is a simple lookup/indirection table that maps an LBA to an internal
80*4882a593Smuzhiyunblock. Each map entry is 32 bits. The two most significant bits are special
81*4882a593Smuzhiyunflags, and the remaining form the internal block number.
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun======== =============================================================
84*4882a593SmuzhiyunBit      Description
85*4882a593Smuzhiyun======== =============================================================
86*4882a593Smuzhiyun31 - 30	 Error and Zero flags - Used in the following way::
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun	   == ==  ====================================================
89*4882a593Smuzhiyun	   31 30  Description
90*4882a593Smuzhiyun	   == ==  ====================================================
91*4882a593Smuzhiyun	   0  0	  Initial state. Reads return zeroes; Premap = Postmap
92*4882a593Smuzhiyun	   0  1	  Zero state: Reads return zeroes
93*4882a593Smuzhiyun	   1  0	  Error state: Reads fail; Writes clear 'E' bit
94*4882a593Smuzhiyun	   1  1	  Normal Block – has valid postmap
95*4882a593Smuzhiyun	   == ==  ====================================================
96*4882a593Smuzhiyun
97*4882a593Smuzhiyun29 - 0	 Mappings to internal 'postmap' blocks
98*4882a593Smuzhiyun======== =============================================================
99*4882a593Smuzhiyun
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunSome of the terminology that will be subsequently used:
102*4882a593Smuzhiyun
103*4882a593Smuzhiyun============	================================================================
104*4882a593SmuzhiyunExternal LBA	LBA as made visible to upper layers.
105*4882a593SmuzhiyunABA		Arena Block Address - Block offset/number within an arena
106*4882a593SmuzhiyunPremap ABA	The block offset into an arena, which was decided upon by range
107*4882a593Smuzhiyun		checking the External LBA
108*4882a593SmuzhiyunPostmap ABA	The block number in the "Data Blocks" area obtained after
109*4882a593Smuzhiyun		indirection from the map
110*4882a593Smuzhiyunnfree		The number of free blocks that are maintained at any given time.
111*4882a593Smuzhiyun		This is the number of concurrent writes that can happen to the
112*4882a593Smuzhiyun		arena.
113*4882a593Smuzhiyun============	================================================================
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunFor example, after adding a BTT, we surface a disk of 1024G. We get a read for
117*4882a593Smuzhiyunthe external LBA at 768G. This falls into the second arena, and of the 512G
118*4882a593Smuzhiyunworth of blocks that this arena contributes, this block is at 256G. Thus, the
119*4882a593Smuzhiyunpremap ABA is 256G. We now refer to the map, and find out the mapping for block
120*4882a593Smuzhiyun'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
121*4882a593Smuzhiyun
122*4882a593Smuzhiyun
123*4882a593Smuzhiyunb. The BTT Flog
124*4882a593Smuzhiyun---------------
125*4882a593Smuzhiyun
126*4882a593SmuzhiyunThe BTT provides sector atomicity by making every write an "allocating write",
127*4882a593Smuzhiyuni.e. Every write goes to a "free" block. A running list of free blocks is
128*4882a593Smuzhiyunmaintained in the form of the BTT flog. 'Flog' is a combination of the words
129*4882a593Smuzhiyun"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
130*4882a593Smuzhiyun
131*4882a593Smuzhiyun========  =====================================================================
132*4882a593Smuzhiyunlba       The premap ABA that is being written to
133*4882a593Smuzhiyunold_map   The old postmap ABA - after 'this' write completes, this will be a
134*4882a593Smuzhiyun	  free block.
135*4882a593Smuzhiyunnew_map   The new postmap ABA. The map will up updated to reflect this
136*4882a593Smuzhiyun	  lba->postmap_aba mapping, but we log it here in case we have to
137*4882a593Smuzhiyun	  recover.
138*4882a593Smuzhiyunseq	  Sequence number to mark which of the 2 sections of this flog entry is
139*4882a593Smuzhiyun	  valid/newest. It cycles between 01->10->11->01 (binary) under normal
140*4882a593Smuzhiyun	  operation, with 00 indicating an uninitialized state.
141*4882a593Smuzhiyunlba'	  alternate lba entry
142*4882a593Smuzhiyunold_map'  alternate old postmap entry
143*4882a593Smuzhiyunnew_map'  alternate new postmap entry
144*4882a593Smuzhiyunseq'	  alternate sequence number.
145*4882a593Smuzhiyun========  =====================================================================
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunEach of the above fields is 32-bit, making one entry 32 bytes. Entries are also
148*4882a593Smuzhiyunpadded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
149*4882a593Smuzhiyundone such that for any entry being written, it:
150*4882a593Smuzhiyuna. overwrites the 'old' section in the entry based on sequence numbers
151*4882a593Smuzhiyunb. writes the 'new' section such that the sequence number is written last.
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun
154*4882a593Smuzhiyunc. The concept of lanes
155*4882a593Smuzhiyun-----------------------
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunWhile 'nfree' describes the number of concurrent IOs an arena can process
158*4882a593Smuzhiyunconcurrently, 'nlanes' is the number of IOs the BTT device as a whole can
159*4882a593Smuzhiyunprocess::
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun	nlanes = min(nfree, num_cpus)
162*4882a593Smuzhiyun
163*4882a593SmuzhiyunA lane number is obtained at the start of any IO, and is used for indexing into
164*4882a593Smuzhiyunall the on-disk and in-memory data structures for the duration of the IO. If
165*4882a593Smuzhiyunthere are more CPUs than the max number of available lanes, than lanes are
166*4882a593Smuzhiyunprotected by spinlocks.
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun
169*4882a593Smuzhiyund. In-memory data structure: Read Tracking Table (RTT)
170*4882a593Smuzhiyun------------------------------------------------------
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunConsider a case where we have two threads, one doing reads and the other,
173*4882a593Smuzhiyunwrites. We can hit a condition where the writer thread grabs a free block to do
174*4882a593Smuzhiyuna new IO, but the (slow) reader thread is still reading from it. In other words,
175*4882a593Smuzhiyunthe reader consulted a map entry, and started reading the corresponding block. A
176*4882a593Smuzhiyunwriter started writing to the same external LBA, and finished the write updating
177*4882a593Smuzhiyunthe map for that external LBA to point to its new postmap ABA. At this point the
178*4882a593Smuzhiyuninternal, postmap block that the reader is (still) reading has been inserted
179*4882a593Smuzhiyuninto the list of free blocks. If another write comes in for the same LBA, it can
180*4882a593Smuzhiyungrab this free block, and start writing to it, causing the reader to read
181*4882a593Smuzhiyunincorrect data. To prevent this, we introduce the RTT.
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunThe RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
184*4882a593Smuzhiyuninto rtt[lane_number], the postmap ABA it is reading, and clears it after the
185*4882a593Smuzhiyunread is complete. Every writer thread, after grabbing a free block, checks the
186*4882a593SmuzhiyunRTT for its presence. If the postmap free block is in the RTT, it waits till the
187*4882a593Smuzhiyunreader clears the RTT entry, and only then starts writing to it.
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun
190*4882a593Smuzhiyune. In-memory data structure: map locks
191*4882a593Smuzhiyun--------------------------------------
192*4882a593Smuzhiyun
193*4882a593SmuzhiyunConsider a case where two writer threads are writing to the same LBA. There can
194*4882a593Smuzhiyunbe a race in the following sequence of steps::
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun	free[lane] = map[premap_aba]
197*4882a593Smuzhiyun	map[premap_aba] = postmap_aba
198*4882a593Smuzhiyun
199*4882a593SmuzhiyunBoth threads can update their respective free[lane] with the same old, freed
200*4882a593Smuzhiyunpostmap_aba. This has made the layout inconsistent by losing a free entry, and
201*4882a593Smuzhiyunat the same time, duplicating another free entry for two lanes.
202*4882a593Smuzhiyun
203*4882a593SmuzhiyunTo solve this, we could have a single map lock (per arena) that has to be taken
204*4882a593Smuzhiyunbefore performing the above sequence, but we feel that could be too contentious.
205*4882a593SmuzhiyunInstead we use an array of (nfree) map_locks that is indexed by
206*4882a593Smuzhiyun(premap_aba modulo nfree).
207*4882a593Smuzhiyun
208*4882a593Smuzhiyun
209*4882a593Smuzhiyunf. Reconstruction from the Flog
210*4882a593Smuzhiyun-------------------------------
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunOn startup, we analyze the BTT flog to create our list of free blocks. We walk
213*4882a593Smuzhiyunthrough all the entries, and for each lane, of the set of two possible
214*4882a593Smuzhiyun'sections', we always look at the most recent one only (based on the sequence
215*4882a593Smuzhiyunnumber). The reconstruction rules/steps are simple:
216*4882a593Smuzhiyun
217*4882a593Smuzhiyun- Read map[log_entry.lba].
218*4882a593Smuzhiyun- If log_entry.new matches the map entry, then log_entry.old is free.
219*4882a593Smuzhiyun- If log_entry.new does not match the map entry, then log_entry.new is free.
220*4882a593Smuzhiyun  (This case can only be caused by power-fails/unsafe shutdowns)
221*4882a593Smuzhiyun
222*4882a593Smuzhiyun
223*4882a593Smuzhiyung. Summarizing - Read and Write flows
224*4882a593Smuzhiyun-------------------------------------
225*4882a593Smuzhiyun
226*4882a593SmuzhiyunRead:
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun1.  Convert external LBA to arena number + pre-map ABA
229*4882a593Smuzhiyun2.  Get a lane (and take lane_lock)
230*4882a593Smuzhiyun3.  Read map to get the entry for this pre-map ABA
231*4882a593Smuzhiyun4.  Enter post-map ABA into RTT[lane]
232*4882a593Smuzhiyun5.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
233*4882a593Smuzhiyun6.  If ERROR flag set in map, end IO with EIO (go to step 8)
234*4882a593Smuzhiyun7.  Read data from this block
235*4882a593Smuzhiyun8.  Remove post-map ABA entry from RTT[lane]
236*4882a593Smuzhiyun9.  Release lane (and lane_lock)
237*4882a593Smuzhiyun
238*4882a593SmuzhiyunWrite:
239*4882a593Smuzhiyun
240*4882a593Smuzhiyun1.  Convert external LBA to Arena number + pre-map ABA
241*4882a593Smuzhiyun2.  Get a lane (and take lane_lock)
242*4882a593Smuzhiyun3.  Use lane to index into in-memory free list and obtain a new block, next flog
243*4882a593Smuzhiyun    index, next sequence number
244*4882a593Smuzhiyun4.  Scan the RTT to check if free block is present, and spin/wait if it is.
245*4882a593Smuzhiyun5.  Write data to this free block
246*4882a593Smuzhiyun6.  Read map to get the existing post-map ABA entry for this pre-map ABA
247*4882a593Smuzhiyun7.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
248*4882a593Smuzhiyun8.  Write new post-map ABA into map.
249*4882a593Smuzhiyun9.  Write old post-map entry into the free list
250*4882a593Smuzhiyun10. Calculate next sequence number and write into the free list entry
251*4882a593Smuzhiyun11. Release lane (and lane_lock)
252*4882a593Smuzhiyun
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun4. Error Handling
255*4882a593Smuzhiyun=================
256*4882a593Smuzhiyun
257*4882a593SmuzhiyunAn arena would be in an error state if any of the metadata is corrupted
258*4882a593Smuzhiyunirrecoverably, either due to a bug or a media error. The following conditions
259*4882a593Smuzhiyunindicate an error:
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun- Info block checksum does not match (and recovering from the copy also fails)
262*4882a593Smuzhiyun- All internal available blocks are not uniquely and entirely addressed by the
263*4882a593Smuzhiyun  sum of mapped blocks and free blocks (from the BTT flog).
264*4882a593Smuzhiyun- Rebuilding free list from the flog reveals missing/duplicate/impossible
265*4882a593Smuzhiyun  entries
266*4882a593Smuzhiyun- A map entry is out of bounds
267*4882a593Smuzhiyun
268*4882a593SmuzhiyunIf any of these error conditions are encountered, the arena is put into a read
269*4882a593Smuzhiyunonly state using a flag in the info block.
270*4882a593Smuzhiyun
271*4882a593Smuzhiyun
272*4882a593Smuzhiyun5. Usage
273*4882a593Smuzhiyun========
274*4882a593Smuzhiyun
275*4882a593SmuzhiyunThe BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
276*4882a593Smuzhiyun(pmem, or blk mode). The easiest way to set up such a namespace is using the
277*4882a593Smuzhiyun'ndctl' utility [1]:
278*4882a593Smuzhiyun
279*4882a593SmuzhiyunFor example, the ndctl command line to setup a btt with a 4k sector size is::
280*4882a593Smuzhiyun
281*4882a593Smuzhiyun    ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
282*4882a593Smuzhiyun
283*4882a593SmuzhiyunSee ndctl create-namespace --help for more options.
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun[1]: https://github.com/pmem/ndctl
286