xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/device-mapper/dm-zoned.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun========
2*4882a593Smuzhiyundm-zoned
3*4882a593Smuzhiyun========
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunThe dm-zoned device mapper target exposes a zoned block device (ZBC and
6*4882a593SmuzhiyunZAC compliant devices) as a regular block device without any write
7*4882a593Smuzhiyunpattern constraints. In effect, it implements a drive-managed zoned
8*4882a593Smuzhiyunblock device which hides from the user (a file system or an application
9*4882a593Smuzhiyundoing raw block device accesses) the sequential write constraints of
10*4882a593Smuzhiyunhost-managed zoned block devices and can mitigate the potential
11*4882a593Smuzhiyundevice-side performance degradation due to excessive random writes on
12*4882a593Smuzhiyunhost-aware zoned block devices.
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunFor a more detailed description of the zoned block device models and
15*4882a593Smuzhiyuntheir constraints see (for SCSI devices):
16*4882a593Smuzhiyun
17*4882a593Smuzhiyunhttps://www.t10.org/drafts.htm#ZBC_Family
18*4882a593Smuzhiyun
19*4882a593Smuzhiyunand (for ATA devices):
20*4882a593Smuzhiyun
21*4882a593Smuzhiyunhttp://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunThe dm-zoned implementation is simple and minimizes system overhead (CPU
24*4882a593Smuzhiyunand memory usage as well as storage capacity loss). For a 10TB
25*4882a593Smuzhiyunhost-managed disk with 256 MB zones, dm-zoned memory usage per disk
26*4882a593Smuzhiyuninstance is at most 4.5 MB and as little as 5 zones will be used
27*4882a593Smuzhiyuninternally for storing metadata and performaing reclaim operations.
28*4882a593Smuzhiyun
29*4882a593Smuzhiyundm-zoned target devices are formatted and checked using the dmzadm
30*4882a593Smuzhiyunutility available at:
31*4882a593Smuzhiyun
32*4882a593Smuzhiyunhttps://github.com/hgst/dm-zoned-tools
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunAlgorithm
35*4882a593Smuzhiyun=========
36*4882a593Smuzhiyun
37*4882a593Smuzhiyundm-zoned implements an on-disk buffering scheme to handle non-sequential
38*4882a593Smuzhiyunwrite accesses to the sequential zones of a zoned block device.
39*4882a593SmuzhiyunConventional zones are used for caching as well as for storing internal
40*4882a593Smuzhiyunmetadata. It can also use a regular block device together with the zoned
41*4882a593Smuzhiyunblock device; in that case the regular block device will be split logically
42*4882a593Smuzhiyunin zones with the same size as the zoned block device. These zones will be
43*4882a593Smuzhiyunplaced in front of the zones from the zoned block device and will be handled
44*4882a593Smuzhiyunjust like conventional zones.
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunThe zones of the device(s) are separated into 2 types:
47*4882a593Smuzhiyun
48*4882a593Smuzhiyun1) Metadata zones: these are conventional zones used to store metadata.
49*4882a593SmuzhiyunMetadata zones are not reported as useable capacity to the user.
50*4882a593Smuzhiyun
51*4882a593Smuzhiyun2) Data zones: all remaining zones, the vast majority of which will be
52*4882a593Smuzhiyunsequential zones used exclusively to store user data. The conventional
53*4882a593Smuzhiyunzones of the device may be used also for buffering user random writes.
54*4882a593SmuzhiyunData in these zones may be directly mapped to the conventional zone, but
55*4882a593Smuzhiyunlater moved to a sequential zone so that the conventional zone can be
56*4882a593Smuzhiyunreused for buffering incoming random writes.
57*4882a593Smuzhiyun
58*4882a593Smuzhiyundm-zoned exposes a logical device with a sector size of 4096 bytes,
59*4882a593Smuzhiyunirrespective of the physical sector size of the backend zoned block
60*4882a593Smuzhiyundevice being used. This allows reducing the amount of metadata needed to
61*4882a593Smuzhiyunmanage valid blocks (blocks written).
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunThe on-disk metadata format is as follows:
64*4882a593Smuzhiyun
65*4882a593Smuzhiyun1) The first block of the first conventional zone found contains the
66*4882a593Smuzhiyunsuper block which describes the on disk amount and position of metadata
67*4882a593Smuzhiyunblocks.
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun2) Following the super block, a set of blocks is used to describe the
70*4882a593Smuzhiyunmapping of the logical device blocks. The mapping is done per chunk of
71*4882a593Smuzhiyunblocks, with the chunk size equal to the zoned block device size. The
72*4882a593Smuzhiyunmapping table is indexed by chunk number and each mapping entry
73*4882a593Smuzhiyunindicates the zone number of the device storing the chunk of data. Each
74*4882a593Smuzhiyunmapping entry may also indicate if the zone number of a conventional
75*4882a593Smuzhiyunzone used to buffer random modification to the data zone.
76*4882a593Smuzhiyun
77*4882a593Smuzhiyun3) A set of blocks used to store bitmaps indicating the validity of
78*4882a593Smuzhiyunblocks in the data zones follows the mapping table. A valid block is
79*4882a593Smuzhiyundefined as a block that was written and not discarded. For a buffered
80*4882a593Smuzhiyundata chunk, a block is always valid only in the data zone mapping the
81*4882a593Smuzhiyunchunk or in the buffer zone of the chunk.
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunFor a logical chunk mapped to a conventional zone, all write operations
84*4882a593Smuzhiyunare processed by directly writing to the zone. If the mapping zone is a
85*4882a593Smuzhiyunsequential zone, the write operation is processed directly only if the
86*4882a593Smuzhiyunwrite offset within the logical chunk is equal to the write pointer
87*4882a593Smuzhiyunoffset within of the sequential data zone (i.e. the write operation is
88*4882a593Smuzhiyunaligned on the zone write pointer). Otherwise, write operations are
89*4882a593Smuzhiyunprocessed indirectly using a buffer zone. In that case, an unused
90*4882a593Smuzhiyunconventional zone is allocated and assigned to the chunk being
91*4882a593Smuzhiyunaccessed. Writing a block to the buffer zone of a chunk will
92*4882a593Smuzhiyunautomatically invalidate the same block in the sequential zone mapping
93*4882a593Smuzhiyunthe chunk. If all blocks of the sequential zone become invalid, the zone
94*4882a593Smuzhiyunis freed and the chunk buffer zone becomes the primary zone mapping the
95*4882a593Smuzhiyunchunk, resulting in native random write performance similar to a regular
96*4882a593Smuzhiyunblock device.
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunRead operations are processed according to the block validity
99*4882a593Smuzhiyuninformation provided by the bitmaps. Valid blocks are read either from
100*4882a593Smuzhiyunthe sequential zone mapping a chunk, or if the chunk is buffered, from
101*4882a593Smuzhiyunthe buffer zone assigned. If the accessed chunk has no mapping, or the
102*4882a593Smuzhiyunaccessed blocks are invalid, the read buffer is zeroed and the read
103*4882a593Smuzhiyunoperation terminated.
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunAfter some time, the limited number of convnetional zones available may
106*4882a593Smuzhiyunbe exhausted (all used to map chunks or buffer sequential zones) and
107*4882a593Smuzhiyununaligned writes to unbuffered chunks become impossible. To avoid this
108*4882a593Smuzhiyunsituation, a reclaim process regularly scans used conventional zones and
109*4882a593Smuzhiyuntries to reclaim the least recently used zones by copying the valid
110*4882a593Smuzhiyunblocks of the buffer zone to a free sequential zone. Once the copy
111*4882a593Smuzhiyuncompletes, the chunk mapping is updated to point to the sequential zone
112*4882a593Smuzhiyunand the buffer zone freed for reuse.
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunMetadata Protection
115*4882a593Smuzhiyun===================
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunTo protect metadata against corruption in case of sudden power loss or
118*4882a593Smuzhiyunsystem crash, 2 sets of metadata zones are used. One set, the primary
119*4882a593Smuzhiyunset, is used as the main metadata region, while the secondary set is
120*4882a593Smuzhiyunused as a staging area. Modified metadata is first written to the
121*4882a593Smuzhiyunsecondary set and validated by updating the super block in the secondary
122*4882a593Smuzhiyunset, a generation counter is used to indicate that this set contains the
123*4882a593Smuzhiyunnewest metadata. Once this operation completes, in place of metadata
124*4882a593Smuzhiyunblock updates can be done in the primary metadata set. This ensures that
125*4882a593Smuzhiyunone of the set is always consistent (all modifications committed or none
126*4882a593Smuzhiyunat all). Flush operations are used as a commit point. Upon reception of
127*4882a593Smuzhiyuna flush request, metadata modification activity is temporarily blocked
128*4882a593Smuzhiyun(for both incoming BIO processing and reclaim process) and all dirty
129*4882a593Smuzhiyunmetadata blocks are staged and updated. Normal operation is then
130*4882a593Smuzhiyunresumed. Flushing metadata thus only temporarily delays write and
131*4882a593Smuzhiyundiscard requests. Read requests can be processed concurrently while
132*4882a593Smuzhiyunmetadata flush is being executed.
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunIf a regular device is used in conjunction with the zoned block device,
135*4882a593Smuzhiyuna third set of metadata (without the zone bitmaps) is written to the
136*4882a593Smuzhiyunstart of the zoned block device. This metadata has a generation counter of
137*4882a593Smuzhiyun'0' and will never be updated during normal operation; it just serves for
138*4882a593Smuzhiyunidentification purposes. The first and second copy of the metadata
139*4882a593Smuzhiyunare located at the start of the regular block device.
140*4882a593Smuzhiyun
141*4882a593SmuzhiyunUsage
142*4882a593Smuzhiyun=====
143*4882a593Smuzhiyun
144*4882a593SmuzhiyunA zoned block device must first be formatted using the dmzadm tool. This
145*4882a593Smuzhiyunwill analyze the device zone configuration, determine where to place the
146*4882a593Smuzhiyunmetadata sets on the device and initialize the metadata sets.
147*4882a593Smuzhiyun
148*4882a593SmuzhiyunEx::
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun	dmzadm --format /dev/sdxx
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunIf two drives are to be used, both devices must be specified, with the
154*4882a593Smuzhiyunregular block device as the first device.
155*4882a593Smuzhiyun
156*4882a593SmuzhiyunEx::
157*4882a593Smuzhiyun
158*4882a593Smuzhiyun	dmzadm --format /dev/sdxx /dev/sdyy
159*4882a593Smuzhiyun
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunFomatted device(s) can be started with the dmzadm utility, too.:
162*4882a593Smuzhiyun
163*4882a593SmuzhiyunEx::
164*4882a593Smuzhiyun
165*4882a593Smuzhiyun	dmzadm --start /dev/sdxx /dev/sdyy
166*4882a593Smuzhiyun
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunInformation about the internal layout and current usage of the zones can
169*4882a593Smuzhiyunbe obtained with the 'status' callback from dmsetup:
170*4882a593Smuzhiyun
171*4882a593SmuzhiyunEx::
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun	dmsetup status /dev/dm-X
174*4882a593Smuzhiyun
175*4882a593Smuzhiyunwill return a line
176*4882a593Smuzhiyun
177*4882a593Smuzhiyun	0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential
178*4882a593Smuzhiyun
179*4882a593Smuzhiyunwhere <nr_zones> is the total number of zones, <nr_unmap_rnd> is the number
180*4882a593Smuzhiyunof unmapped (ie free) random zones, <nr_rnd> the total number of zones,
181*4882a593Smuzhiyun<nr_unmap_seq> the number of unmapped sequential zones, and <nr_seq> the
182*4882a593Smuzhiyuntotal number of sequential zones.
183*4882a593Smuzhiyun
184*4882a593SmuzhiyunNormally the reclaim process will be started once there are less than 50
185*4882a593Smuzhiyunpercent free random zones. In order to start the reclaim process manually
186*4882a593Smuzhiyuneven before reaching this threshold the 'dmsetup message' function can be
187*4882a593Smuzhiyunused:
188*4882a593Smuzhiyun
189*4882a593SmuzhiyunEx::
190*4882a593Smuzhiyun
191*4882a593Smuzhiyun	dmsetup message /dev/dm-X 0 reclaim
192*4882a593Smuzhiyun
193*4882a593Smuzhiyunwill start the reclaim process and random zones will be moved to sequential
194*4882a593Smuzhiyunzones.
195