1*4882a593Smuzhiyun======== 2*4882a593Smuzhiyundm-zoned 3*4882a593Smuzhiyun======== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunThe dm-zoned device mapper target exposes a zoned block device (ZBC and 6*4882a593SmuzhiyunZAC compliant devices) as a regular block device without any write 7*4882a593Smuzhiyunpattern constraints. In effect, it implements a drive-managed zoned 8*4882a593Smuzhiyunblock device which hides from the user (a file system or an application 9*4882a593Smuzhiyundoing raw block device accesses) the sequential write constraints of 10*4882a593Smuzhiyunhost-managed zoned block devices and can mitigate the potential 11*4882a593Smuzhiyundevice-side performance degradation due to excessive random writes on 12*4882a593Smuzhiyunhost-aware zoned block devices. 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunFor a more detailed description of the zoned block device models and 15*4882a593Smuzhiyuntheir constraints see (for SCSI devices): 16*4882a593Smuzhiyun 17*4882a593Smuzhiyunhttps://www.t10.org/drafts.htm#ZBC_Family 18*4882a593Smuzhiyun 19*4882a593Smuzhiyunand (for ATA devices): 20*4882a593Smuzhiyun 21*4882a593Smuzhiyunhttp://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunThe dm-zoned implementation is simple and minimizes system overhead (CPU 24*4882a593Smuzhiyunand memory usage as well as storage capacity loss). For a 10TB 25*4882a593Smuzhiyunhost-managed disk with 256 MB zones, dm-zoned memory usage per disk 26*4882a593Smuzhiyuninstance is at most 4.5 MB and as little as 5 zones will be used 27*4882a593Smuzhiyuninternally for storing metadata and performaing reclaim operations. 28*4882a593Smuzhiyun 29*4882a593Smuzhiyundm-zoned target devices are formatted and checked using the dmzadm 30*4882a593Smuzhiyunutility available at: 31*4882a593Smuzhiyun 32*4882a593Smuzhiyunhttps://github.com/hgst/dm-zoned-tools 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunAlgorithm 35*4882a593Smuzhiyun========= 36*4882a593Smuzhiyun 37*4882a593Smuzhiyundm-zoned implements an on-disk buffering scheme to handle non-sequential 38*4882a593Smuzhiyunwrite accesses to the sequential zones of a zoned block device. 39*4882a593SmuzhiyunConventional zones are used for caching as well as for storing internal 40*4882a593Smuzhiyunmetadata. It can also use a regular block device together with the zoned 41*4882a593Smuzhiyunblock device; in that case the regular block device will be split logically 42*4882a593Smuzhiyunin zones with the same size as the zoned block device. These zones will be 43*4882a593Smuzhiyunplaced in front of the zones from the zoned block device and will be handled 44*4882a593Smuzhiyunjust like conventional zones. 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunThe zones of the device(s) are separated into 2 types: 47*4882a593Smuzhiyun 48*4882a593Smuzhiyun1) Metadata zones: these are conventional zones used to store metadata. 49*4882a593SmuzhiyunMetadata zones are not reported as useable capacity to the user. 50*4882a593Smuzhiyun 51*4882a593Smuzhiyun2) Data zones: all remaining zones, the vast majority of which will be 52*4882a593Smuzhiyunsequential zones used exclusively to store user data. The conventional 53*4882a593Smuzhiyunzones of the device may be used also for buffering user random writes. 54*4882a593SmuzhiyunData in these zones may be directly mapped to the conventional zone, but 55*4882a593Smuzhiyunlater moved to a sequential zone so that the conventional zone can be 56*4882a593Smuzhiyunreused for buffering incoming random writes. 57*4882a593Smuzhiyun 58*4882a593Smuzhiyundm-zoned exposes a logical device with a sector size of 4096 bytes, 59*4882a593Smuzhiyunirrespective of the physical sector size of the backend zoned block 60*4882a593Smuzhiyundevice being used. This allows reducing the amount of metadata needed to 61*4882a593Smuzhiyunmanage valid blocks (blocks written). 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunThe on-disk metadata format is as follows: 64*4882a593Smuzhiyun 65*4882a593Smuzhiyun1) The first block of the first conventional zone found contains the 66*4882a593Smuzhiyunsuper block which describes the on disk amount and position of metadata 67*4882a593Smuzhiyunblocks. 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun2) Following the super block, a set of blocks is used to describe the 70*4882a593Smuzhiyunmapping of the logical device blocks. The mapping is done per chunk of 71*4882a593Smuzhiyunblocks, with the chunk size equal to the zoned block device size. The 72*4882a593Smuzhiyunmapping table is indexed by chunk number and each mapping entry 73*4882a593Smuzhiyunindicates the zone number of the device storing the chunk of data. Each 74*4882a593Smuzhiyunmapping entry may also indicate if the zone number of a conventional 75*4882a593Smuzhiyunzone used to buffer random modification to the data zone. 76*4882a593Smuzhiyun 77*4882a593Smuzhiyun3) A set of blocks used to store bitmaps indicating the validity of 78*4882a593Smuzhiyunblocks in the data zones follows the mapping table. A valid block is 79*4882a593Smuzhiyundefined as a block that was written and not discarded. For a buffered 80*4882a593Smuzhiyundata chunk, a block is always valid only in the data zone mapping the 81*4882a593Smuzhiyunchunk or in the buffer zone of the chunk. 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunFor a logical chunk mapped to a conventional zone, all write operations 84*4882a593Smuzhiyunare processed by directly writing to the zone. If the mapping zone is a 85*4882a593Smuzhiyunsequential zone, the write operation is processed directly only if the 86*4882a593Smuzhiyunwrite offset within the logical chunk is equal to the write pointer 87*4882a593Smuzhiyunoffset within of the sequential data zone (i.e. the write operation is 88*4882a593Smuzhiyunaligned on the zone write pointer). Otherwise, write operations are 89*4882a593Smuzhiyunprocessed indirectly using a buffer zone. In that case, an unused 90*4882a593Smuzhiyunconventional zone is allocated and assigned to the chunk being 91*4882a593Smuzhiyunaccessed. Writing a block to the buffer zone of a chunk will 92*4882a593Smuzhiyunautomatically invalidate the same block in the sequential zone mapping 93*4882a593Smuzhiyunthe chunk. If all blocks of the sequential zone become invalid, the zone 94*4882a593Smuzhiyunis freed and the chunk buffer zone becomes the primary zone mapping the 95*4882a593Smuzhiyunchunk, resulting in native random write performance similar to a regular 96*4882a593Smuzhiyunblock device. 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunRead operations are processed according to the block validity 99*4882a593Smuzhiyuninformation provided by the bitmaps. Valid blocks are read either from 100*4882a593Smuzhiyunthe sequential zone mapping a chunk, or if the chunk is buffered, from 101*4882a593Smuzhiyunthe buffer zone assigned. If the accessed chunk has no mapping, or the 102*4882a593Smuzhiyunaccessed blocks are invalid, the read buffer is zeroed and the read 103*4882a593Smuzhiyunoperation terminated. 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunAfter some time, the limited number of convnetional zones available may 106*4882a593Smuzhiyunbe exhausted (all used to map chunks or buffer sequential zones) and 107*4882a593Smuzhiyununaligned writes to unbuffered chunks become impossible. To avoid this 108*4882a593Smuzhiyunsituation, a reclaim process regularly scans used conventional zones and 109*4882a593Smuzhiyuntries to reclaim the least recently used zones by copying the valid 110*4882a593Smuzhiyunblocks of the buffer zone to a free sequential zone. Once the copy 111*4882a593Smuzhiyuncompletes, the chunk mapping is updated to point to the sequential zone 112*4882a593Smuzhiyunand the buffer zone freed for reuse. 113*4882a593Smuzhiyun 114*4882a593SmuzhiyunMetadata Protection 115*4882a593Smuzhiyun=================== 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunTo protect metadata against corruption in case of sudden power loss or 118*4882a593Smuzhiyunsystem crash, 2 sets of metadata zones are used. One set, the primary 119*4882a593Smuzhiyunset, is used as the main metadata region, while the secondary set is 120*4882a593Smuzhiyunused as a staging area. Modified metadata is first written to the 121*4882a593Smuzhiyunsecondary set and validated by updating the super block in the secondary 122*4882a593Smuzhiyunset, a generation counter is used to indicate that this set contains the 123*4882a593Smuzhiyunnewest metadata. Once this operation completes, in place of metadata 124*4882a593Smuzhiyunblock updates can be done in the primary metadata set. This ensures that 125*4882a593Smuzhiyunone of the set is always consistent (all modifications committed or none 126*4882a593Smuzhiyunat all). Flush operations are used as a commit point. Upon reception of 127*4882a593Smuzhiyuna flush request, metadata modification activity is temporarily blocked 128*4882a593Smuzhiyun(for both incoming BIO processing and reclaim process) and all dirty 129*4882a593Smuzhiyunmetadata blocks are staged and updated. Normal operation is then 130*4882a593Smuzhiyunresumed. Flushing metadata thus only temporarily delays write and 131*4882a593Smuzhiyundiscard requests. Read requests can be processed concurrently while 132*4882a593Smuzhiyunmetadata flush is being executed. 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunIf a regular device is used in conjunction with the zoned block device, 135*4882a593Smuzhiyuna third set of metadata (without the zone bitmaps) is written to the 136*4882a593Smuzhiyunstart of the zoned block device. This metadata has a generation counter of 137*4882a593Smuzhiyun'0' and will never be updated during normal operation; it just serves for 138*4882a593Smuzhiyunidentification purposes. The first and second copy of the metadata 139*4882a593Smuzhiyunare located at the start of the regular block device. 140*4882a593Smuzhiyun 141*4882a593SmuzhiyunUsage 142*4882a593Smuzhiyun===== 143*4882a593Smuzhiyun 144*4882a593SmuzhiyunA zoned block device must first be formatted using the dmzadm tool. This 145*4882a593Smuzhiyunwill analyze the device zone configuration, determine where to place the 146*4882a593Smuzhiyunmetadata sets on the device and initialize the metadata sets. 147*4882a593Smuzhiyun 148*4882a593SmuzhiyunEx:: 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun dmzadm --format /dev/sdxx 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunIf two drives are to be used, both devices must be specified, with the 154*4882a593Smuzhiyunregular block device as the first device. 155*4882a593Smuzhiyun 156*4882a593SmuzhiyunEx:: 157*4882a593Smuzhiyun 158*4882a593Smuzhiyun dmzadm --format /dev/sdxx /dev/sdyy 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunFomatted device(s) can be started with the dmzadm utility, too.: 162*4882a593Smuzhiyun 163*4882a593SmuzhiyunEx:: 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun dmzadm --start /dev/sdxx /dev/sdyy 166*4882a593Smuzhiyun 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunInformation about the internal layout and current usage of the zones can 169*4882a593Smuzhiyunbe obtained with the 'status' callback from dmsetup: 170*4882a593Smuzhiyun 171*4882a593SmuzhiyunEx:: 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun dmsetup status /dev/dm-X 174*4882a593Smuzhiyun 175*4882a593Smuzhiyunwill return a line 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun 0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential 178*4882a593Smuzhiyun 179*4882a593Smuzhiyunwhere <nr_zones> is the total number of zones, <nr_unmap_rnd> is the number 180*4882a593Smuzhiyunof unmapped (ie free) random zones, <nr_rnd> the total number of zones, 181*4882a593Smuzhiyun<nr_unmap_seq> the number of unmapped sequential zones, and <nr_seq> the 182*4882a593Smuzhiyuntotal number of sequential zones. 183*4882a593Smuzhiyun 184*4882a593SmuzhiyunNormally the reclaim process will be started once there are less than 50 185*4882a593Smuzhiyunpercent free random zones. In order to start the reclaim process manually 186*4882a593Smuzhiyuneven before reaching this threshold the 'dmsetup message' function can be 187*4882a593Smuzhiyunused: 188*4882a593Smuzhiyun 189*4882a593SmuzhiyunEx:: 190*4882a593Smuzhiyun 191*4882a593Smuzhiyun dmsetup message /dev/dm-X 0 reclaim 192*4882a593Smuzhiyun 193*4882a593Smuzhiyunwill start the reclaim process and random zones will be moved to sequential 194*4882a593Smuzhiyunzones. 195