1*4882a593Smuzhiyun============================= 2*4882a593SmuzhiyunBTT - Block Translation Table 3*4882a593Smuzhiyun============================= 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun 6*4882a593Smuzhiyun1. Introduction 7*4882a593Smuzhiyun=============== 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunPersistent memory based storage is able to perform IO at byte (or more 10*4882a593Smuzhiyunaccurately, cache line) granularity. However, we often want to expose such 11*4882a593Smuzhiyunstorage as traditional block devices. The block drivers for persistent memory 12*4882a593Smuzhiyunwill do exactly this. However, they do not provide any atomicity guarantees. 13*4882a593SmuzhiyunTraditional SSDs typically provide protection against torn sectors in hardware, 14*4882a593Smuzhiyunusing stored energy in capacitors to complete in-flight block writes, or perhaps 15*4882a593Smuzhiyunin firmware. We don't have this luxury with persistent memory - if a write is in 16*4882a593Smuzhiyunprogress, and we experience a power failure, the block will contain a mix of old 17*4882a593Smuzhiyunand new data. Applications may not be prepared to handle such a scenario. 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunThe Block Translation Table (BTT) provides atomic sector update semantics for 20*4882a593Smuzhiyunpersistent memory devices, so that applications that rely on sector writes not 21*4882a593Smuzhiyunbeing torn can continue to do so. The BTT manifests itself as a stacked block 22*4882a593Smuzhiyundevice, and reserves a portion of the underlying storage for its metadata. At 23*4882a593Smuzhiyunthe heart of it, is an indirection table that re-maps all the blocks on the 24*4882a593Smuzhiyunvolume. It can be thought of as an extremely simple file system that only 25*4882a593Smuzhiyunprovides atomic sector updates. 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun 28*4882a593Smuzhiyun2. Static Layout 29*4882a593Smuzhiyun================ 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunThe underlying storage on which a BTT can be laid out is not limited in any way. 32*4882a593SmuzhiyunThe BTT, however, splits the available space into chunks of up to 512 GiB, 33*4882a593Smuzhiyuncalled "Arenas". 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunEach arena follows the same layout for its metadata, and all references in an 36*4882a593Smuzhiyunarena are internal to it (with the exception of one field that points to the 37*4882a593Smuzhiyunnext arena). The following depicts the "On-disk" metadata layout:: 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun Backing Store +-------> Arena 41*4882a593Smuzhiyun +---------------+ | +------------------+ 42*4882a593Smuzhiyun | | | | Arena info block | 43*4882a593Smuzhiyun | Arena 0 +---+ | 4K | 44*4882a593Smuzhiyun | 512G | +------------------+ 45*4882a593Smuzhiyun | | | | 46*4882a593Smuzhiyun +---------------+ | | 47*4882a593Smuzhiyun | | | | 48*4882a593Smuzhiyun | Arena 1 | | Data Blocks | 49*4882a593Smuzhiyun | 512G | | | 50*4882a593Smuzhiyun | | | | 51*4882a593Smuzhiyun +---------------+ | | 52*4882a593Smuzhiyun | . | | | 53*4882a593Smuzhiyun | . | | | 54*4882a593Smuzhiyun | . | | | 55*4882a593Smuzhiyun | | | | 56*4882a593Smuzhiyun | | | | 57*4882a593Smuzhiyun +---------------+ +------------------+ 58*4882a593Smuzhiyun | | 59*4882a593Smuzhiyun | BTT Map | 60*4882a593Smuzhiyun | | 61*4882a593Smuzhiyun | | 62*4882a593Smuzhiyun +------------------+ 63*4882a593Smuzhiyun | | 64*4882a593Smuzhiyun | BTT Flog | 65*4882a593Smuzhiyun | | 66*4882a593Smuzhiyun +------------------+ 67*4882a593Smuzhiyun | Info block copy | 68*4882a593Smuzhiyun | 4K | 69*4882a593Smuzhiyun +------------------+ 70*4882a593Smuzhiyun 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun3. Theory of Operation 73*4882a593Smuzhiyun====================== 74*4882a593Smuzhiyun 75*4882a593Smuzhiyun 76*4882a593Smuzhiyuna. The BTT Map 77*4882a593Smuzhiyun-------------- 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunThe map is a simple lookup/indirection table that maps an LBA to an internal 80*4882a593Smuzhiyunblock. Each map entry is 32 bits. The two most significant bits are special 81*4882a593Smuzhiyunflags, and the remaining form the internal block number. 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun======== ============================================================= 84*4882a593SmuzhiyunBit Description 85*4882a593Smuzhiyun======== ============================================================= 86*4882a593Smuzhiyun31 - 30 Error and Zero flags - Used in the following way:: 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun == == ==================================================== 89*4882a593Smuzhiyun 31 30 Description 90*4882a593Smuzhiyun == == ==================================================== 91*4882a593Smuzhiyun 0 0 Initial state. Reads return zeroes; Premap = Postmap 92*4882a593Smuzhiyun 0 1 Zero state: Reads return zeroes 93*4882a593Smuzhiyun 1 0 Error state: Reads fail; Writes clear 'E' bit 94*4882a593Smuzhiyun 1 1 Normal Block – has valid postmap 95*4882a593Smuzhiyun == == ==================================================== 96*4882a593Smuzhiyun 97*4882a593Smuzhiyun29 - 0 Mappings to internal 'postmap' blocks 98*4882a593Smuzhiyun======== ============================================================= 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunSome of the terminology that will be subsequently used: 102*4882a593Smuzhiyun 103*4882a593Smuzhiyun============ ================================================================ 104*4882a593SmuzhiyunExternal LBA LBA as made visible to upper layers. 105*4882a593SmuzhiyunABA Arena Block Address - Block offset/number within an arena 106*4882a593SmuzhiyunPremap ABA The block offset into an arena, which was decided upon by range 107*4882a593Smuzhiyun checking the External LBA 108*4882a593SmuzhiyunPostmap ABA The block number in the "Data Blocks" area obtained after 109*4882a593Smuzhiyun indirection from the map 110*4882a593Smuzhiyunnfree The number of free blocks that are maintained at any given time. 111*4882a593Smuzhiyun This is the number of concurrent writes that can happen to the 112*4882a593Smuzhiyun arena. 113*4882a593Smuzhiyun============ ================================================================ 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun 116*4882a593SmuzhiyunFor example, after adding a BTT, we surface a disk of 1024G. We get a read for 117*4882a593Smuzhiyunthe external LBA at 768G. This falls into the second arena, and of the 512G 118*4882a593Smuzhiyunworth of blocks that this arena contributes, this block is at 256G. Thus, the 119*4882a593Smuzhiyunpremap ABA is 256G. We now refer to the map, and find out the mapping for block 120*4882a593Smuzhiyun'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. 121*4882a593Smuzhiyun 122*4882a593Smuzhiyun 123*4882a593Smuzhiyunb. The BTT Flog 124*4882a593Smuzhiyun--------------- 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunThe BTT provides sector atomicity by making every write an "allocating write", 127*4882a593Smuzhiyuni.e. Every write goes to a "free" block. A running list of free blocks is 128*4882a593Smuzhiyunmaintained in the form of the BTT flog. 'Flog' is a combination of the words 129*4882a593Smuzhiyun"free list" and "log". The flog contains 'nfree' entries, and an entry contains: 130*4882a593Smuzhiyun 131*4882a593Smuzhiyun======== ===================================================================== 132*4882a593Smuzhiyunlba The premap ABA that is being written to 133*4882a593Smuzhiyunold_map The old postmap ABA - after 'this' write completes, this will be a 134*4882a593Smuzhiyun free block. 135*4882a593Smuzhiyunnew_map The new postmap ABA. The map will up updated to reflect this 136*4882a593Smuzhiyun lba->postmap_aba mapping, but we log it here in case we have to 137*4882a593Smuzhiyun recover. 138*4882a593Smuzhiyunseq Sequence number to mark which of the 2 sections of this flog entry is 139*4882a593Smuzhiyun valid/newest. It cycles between 01->10->11->01 (binary) under normal 140*4882a593Smuzhiyun operation, with 00 indicating an uninitialized state. 141*4882a593Smuzhiyunlba' alternate lba entry 142*4882a593Smuzhiyunold_map' alternate old postmap entry 143*4882a593Smuzhiyunnew_map' alternate new postmap entry 144*4882a593Smuzhiyunseq' alternate sequence number. 145*4882a593Smuzhiyun======== ===================================================================== 146*4882a593Smuzhiyun 147*4882a593SmuzhiyunEach of the above fields is 32-bit, making one entry 32 bytes. Entries are also 148*4882a593Smuzhiyunpadded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are 149*4882a593Smuzhiyundone such that for any entry being written, it: 150*4882a593Smuzhiyuna. overwrites the 'old' section in the entry based on sequence numbers 151*4882a593Smuzhiyunb. writes the 'new' section such that the sequence number is written last. 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun 154*4882a593Smuzhiyunc. The concept of lanes 155*4882a593Smuzhiyun----------------------- 156*4882a593Smuzhiyun 157*4882a593SmuzhiyunWhile 'nfree' describes the number of concurrent IOs an arena can process 158*4882a593Smuzhiyunconcurrently, 'nlanes' is the number of IOs the BTT device as a whole can 159*4882a593Smuzhiyunprocess:: 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun nlanes = min(nfree, num_cpus) 162*4882a593Smuzhiyun 163*4882a593SmuzhiyunA lane number is obtained at the start of any IO, and is used for indexing into 164*4882a593Smuzhiyunall the on-disk and in-memory data structures for the duration of the IO. If 165*4882a593Smuzhiyunthere are more CPUs than the max number of available lanes, than lanes are 166*4882a593Smuzhiyunprotected by spinlocks. 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun 169*4882a593Smuzhiyund. In-memory data structure: Read Tracking Table (RTT) 170*4882a593Smuzhiyun------------------------------------------------------ 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunConsider a case where we have two threads, one doing reads and the other, 173*4882a593Smuzhiyunwrites. We can hit a condition where the writer thread grabs a free block to do 174*4882a593Smuzhiyuna new IO, but the (slow) reader thread is still reading from it. In other words, 175*4882a593Smuzhiyunthe reader consulted a map entry, and started reading the corresponding block. A 176*4882a593Smuzhiyunwriter started writing to the same external LBA, and finished the write updating 177*4882a593Smuzhiyunthe map for that external LBA to point to its new postmap ABA. At this point the 178*4882a593Smuzhiyuninternal, postmap block that the reader is (still) reading has been inserted 179*4882a593Smuzhiyuninto the list of free blocks. If another write comes in for the same LBA, it can 180*4882a593Smuzhiyungrab this free block, and start writing to it, causing the reader to read 181*4882a593Smuzhiyunincorrect data. To prevent this, we introduce the RTT. 182*4882a593Smuzhiyun 183*4882a593SmuzhiyunThe RTT is a simple, per arena table with 'nfree' entries. Every reader inserts 184*4882a593Smuzhiyuninto rtt[lane_number], the postmap ABA it is reading, and clears it after the 185*4882a593Smuzhiyunread is complete. Every writer thread, after grabbing a free block, checks the 186*4882a593SmuzhiyunRTT for its presence. If the postmap free block is in the RTT, it waits till the 187*4882a593Smuzhiyunreader clears the RTT entry, and only then starts writing to it. 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun 190*4882a593Smuzhiyune. In-memory data structure: map locks 191*4882a593Smuzhiyun-------------------------------------- 192*4882a593Smuzhiyun 193*4882a593SmuzhiyunConsider a case where two writer threads are writing to the same LBA. There can 194*4882a593Smuzhiyunbe a race in the following sequence of steps:: 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun free[lane] = map[premap_aba] 197*4882a593Smuzhiyun map[premap_aba] = postmap_aba 198*4882a593Smuzhiyun 199*4882a593SmuzhiyunBoth threads can update their respective free[lane] with the same old, freed 200*4882a593Smuzhiyunpostmap_aba. This has made the layout inconsistent by losing a free entry, and 201*4882a593Smuzhiyunat the same time, duplicating another free entry for two lanes. 202*4882a593Smuzhiyun 203*4882a593SmuzhiyunTo solve this, we could have a single map lock (per arena) that has to be taken 204*4882a593Smuzhiyunbefore performing the above sequence, but we feel that could be too contentious. 205*4882a593SmuzhiyunInstead we use an array of (nfree) map_locks that is indexed by 206*4882a593Smuzhiyun(premap_aba modulo nfree). 207*4882a593Smuzhiyun 208*4882a593Smuzhiyun 209*4882a593Smuzhiyunf. Reconstruction from the Flog 210*4882a593Smuzhiyun------------------------------- 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunOn startup, we analyze the BTT flog to create our list of free blocks. We walk 213*4882a593Smuzhiyunthrough all the entries, and for each lane, of the set of two possible 214*4882a593Smuzhiyun'sections', we always look at the most recent one only (based on the sequence 215*4882a593Smuzhiyunnumber). The reconstruction rules/steps are simple: 216*4882a593Smuzhiyun 217*4882a593Smuzhiyun- Read map[log_entry.lba]. 218*4882a593Smuzhiyun- If log_entry.new matches the map entry, then log_entry.old is free. 219*4882a593Smuzhiyun- If log_entry.new does not match the map entry, then log_entry.new is free. 220*4882a593Smuzhiyun (This case can only be caused by power-fails/unsafe shutdowns) 221*4882a593Smuzhiyun 222*4882a593Smuzhiyun 223*4882a593Smuzhiyung. Summarizing - Read and Write flows 224*4882a593Smuzhiyun------------------------------------- 225*4882a593Smuzhiyun 226*4882a593SmuzhiyunRead: 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun1. Convert external LBA to arena number + pre-map ABA 229*4882a593Smuzhiyun2. Get a lane (and take lane_lock) 230*4882a593Smuzhiyun3. Read map to get the entry for this pre-map ABA 231*4882a593Smuzhiyun4. Enter post-map ABA into RTT[lane] 232*4882a593Smuzhiyun5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) 233*4882a593Smuzhiyun6. If ERROR flag set in map, end IO with EIO (go to step 8) 234*4882a593Smuzhiyun7. Read data from this block 235*4882a593Smuzhiyun8. Remove post-map ABA entry from RTT[lane] 236*4882a593Smuzhiyun9. Release lane (and lane_lock) 237*4882a593Smuzhiyun 238*4882a593SmuzhiyunWrite: 239*4882a593Smuzhiyun 240*4882a593Smuzhiyun1. Convert external LBA to Arena number + pre-map ABA 241*4882a593Smuzhiyun2. Get a lane (and take lane_lock) 242*4882a593Smuzhiyun3. Use lane to index into in-memory free list and obtain a new block, next flog 243*4882a593Smuzhiyun index, next sequence number 244*4882a593Smuzhiyun4. Scan the RTT to check if free block is present, and spin/wait if it is. 245*4882a593Smuzhiyun5. Write data to this free block 246*4882a593Smuzhiyun6. Read map to get the existing post-map ABA entry for this pre-map ABA 247*4882a593Smuzhiyun7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] 248*4882a593Smuzhiyun8. Write new post-map ABA into map. 249*4882a593Smuzhiyun9. Write old post-map entry into the free list 250*4882a593Smuzhiyun10. Calculate next sequence number and write into the free list entry 251*4882a593Smuzhiyun11. Release lane (and lane_lock) 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun 254*4882a593Smuzhiyun4. Error Handling 255*4882a593Smuzhiyun================= 256*4882a593Smuzhiyun 257*4882a593SmuzhiyunAn arena would be in an error state if any of the metadata is corrupted 258*4882a593Smuzhiyunirrecoverably, either due to a bug or a media error. The following conditions 259*4882a593Smuzhiyunindicate an error: 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun- Info block checksum does not match (and recovering from the copy also fails) 262*4882a593Smuzhiyun- All internal available blocks are not uniquely and entirely addressed by the 263*4882a593Smuzhiyun sum of mapped blocks and free blocks (from the BTT flog). 264*4882a593Smuzhiyun- Rebuilding free list from the flog reveals missing/duplicate/impossible 265*4882a593Smuzhiyun entries 266*4882a593Smuzhiyun- A map entry is out of bounds 267*4882a593Smuzhiyun 268*4882a593SmuzhiyunIf any of these error conditions are encountered, the arena is put into a read 269*4882a593Smuzhiyunonly state using a flag in the info block. 270*4882a593Smuzhiyun 271*4882a593Smuzhiyun 272*4882a593Smuzhiyun5. Usage 273*4882a593Smuzhiyun======== 274*4882a593Smuzhiyun 275*4882a593SmuzhiyunThe BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem 276*4882a593Smuzhiyun(pmem, or blk mode). The easiest way to set up such a namespace is using the 277*4882a593Smuzhiyun'ndctl' utility [1]: 278*4882a593Smuzhiyun 279*4882a593SmuzhiyunFor example, the ndctl command line to setup a btt with a 4k sector size is:: 280*4882a593Smuzhiyun 281*4882a593Smuzhiyun ndctl create-namespace -f -e namespace0.0 -m sector -l 4k 282*4882a593Smuzhiyun 283*4882a593SmuzhiyunSee ndctl create-namespace --help for more options. 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun[1]: https://github.com/pmem/ndctl 286