xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/zonefs.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun================================================
4*4882a593SmuzhiyunZoneFS - Zone filesystem for Zoned block devices
5*4882a593Smuzhiyun================================================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunIntroduction
8*4882a593Smuzhiyun============
9*4882a593Smuzhiyun
10*4882a593Smuzhiyunzonefs is a very simple file system exposing each zone of a zoned block device
11*4882a593Smuzhiyunas a file. Unlike a regular POSIX-compliant file system with native zoned block
12*4882a593Smuzhiyundevice support (e.g. f2fs), zonefs does not hide the sequential write
13*4882a593Smuzhiyunconstraint of zoned block devices to the user. Files representing sequential
14*4882a593Smuzhiyunwrite zones of the device must be written sequentially starting from the end
15*4882a593Smuzhiyunof the file (append only writes).
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunAs such, zonefs is in essence closer to a raw block device access interface
18*4882a593Smuzhiyunthan to a full-featured POSIX file system. The goal of zonefs is to simplify
19*4882a593Smuzhiyunthe implementation of zoned block device support in applications by replacing
20*4882a593Smuzhiyunraw block device file accesses with a richer file API, avoiding relying on
21*4882a593Smuzhiyundirect block device file ioctls which may be more obscure to developers. One
22*4882a593Smuzhiyunexample of this approach is the implementation of LSM (log-structured merge)
23*4882a593Smuzhiyuntree structures (such as used in RocksDB and LevelDB) on zoned block devices
24*4882a593Smuzhiyunby allowing SSTables to be stored in a zone file similarly to a regular file
25*4882a593Smuzhiyunsystem rather than as a range of sectors of the entire disk. The introduction
26*4882a593Smuzhiyunof the higher level construct "one file is one zone" can help reducing the
27*4882a593Smuzhiyunamount of changes needed in the application as well as introducing support for
28*4882a593Smuzhiyundifferent application programming languages.
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunZoned block devices
31*4882a593Smuzhiyun-------------------
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunZoned storage devices belong to a class of storage devices with an address
34*4882a593Smuzhiyunspace that is divided into zones. A zone is a group of consecutive LBAs and all
35*4882a593Smuzhiyunzones are contiguous (there are no LBA gaps). Zones may have different types.
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun* Conventional zones: there are no access constraints to LBAs belonging to
38*4882a593Smuzhiyun  conventional zones. Any read or write access can be executed, similarly to a
39*4882a593Smuzhiyun  regular block device.
40*4882a593Smuzhiyun* Sequential zones: these zones accept random reads but must be written
41*4882a593Smuzhiyun  sequentially. Each sequential zone has a write pointer maintained by the
42*4882a593Smuzhiyun  device that keeps track of the mandatory start LBA position of the next write
43*4882a593Smuzhiyun  to the device. As a result of this write constraint, LBAs in a sequential zone
44*4882a593Smuzhiyun  cannot be overwritten. Sequential zones must first be erased using a special
45*4882a593Smuzhiyun  command (zone reset) before rewriting.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunZoned storage devices can be implemented using various recording and media
48*4882a593Smuzhiyuntechnologies. The most common form of zoned storage today uses the SCSI Zoned
49*4882a593SmuzhiyunBlock Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled
50*4882a593SmuzhiyunMagnetic Recording (SMR) HDDs.
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunSolid State Disks (SSD) storage devices can also implement a zoned interface
53*4882a593Smuzhiyunto, for instance, reduce internal write amplification due to garbage collection.
54*4882a593SmuzhiyunThe NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard
55*4882a593Smuzhiyuncommittee aiming at adding a zoned storage interface to the NVMe protocol.
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunZonefs Overview
58*4882a593Smuzhiyun===============
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunZonefs exposes the zones of a zoned block device as files. The files
61*4882a593Smuzhiyunrepresenting zones are grouped by zone type, which are themselves represented
62*4882a593Smuzhiyunby sub-directories. This file structure is built entirely using zone information
63*4882a593Smuzhiyunprovided by the device and so does not require any complex on-disk metadata
64*4882a593Smuzhiyunstructure.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunOn-disk metadata
67*4882a593Smuzhiyun----------------
68*4882a593Smuzhiyun
69*4882a593Smuzhiyunzonefs on-disk metadata is reduced to an immutable super block which
70*4882a593Smuzhiyunpersistently stores a magic number and optional feature flags and values. On
71*4882a593Smuzhiyunmount, zonefs uses blkdev_report_zones() to obtain the device zone configuration
72*4882a593Smuzhiyunand populates the mount point with a static file tree solely based on this
73*4882a593Smuzhiyuninformation. File sizes come from the device zone type and write pointer
74*4882a593Smuzhiyunposition managed by the device itself.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunThe super block is always written on disk at sector 0. The first zone of the
77*4882a593Smuzhiyundevice storing the super block is never exposed as a zone file by zonefs. If
78*4882a593Smuzhiyunthe zone containing the super block is a sequential zone, the mkzonefs format
79*4882a593Smuzhiyuntool always "finishes" the zone, that is, it transitions the zone to a full
80*4882a593Smuzhiyunstate to make it read-only, preventing any data write.
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunZone type sub-directories
83*4882a593Smuzhiyun-------------------------
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunFiles representing zones of the same type are grouped together under the same
86*4882a593Smuzhiyunsub-directory automatically created on mount.
87*4882a593Smuzhiyun
88*4882a593SmuzhiyunFor conventional zones, the sub-directory "cnv" is used. This directory is
89*4882a593Smuzhiyunhowever created if and only if the device has usable conventional zones. If
90*4882a593Smuzhiyunthe device only has a single conventional zone at sector 0, the zone will not
91*4882a593Smuzhiyunbe exposed as a file as it will be used to store the zonefs super block. For
92*4882a593Smuzhiyunsuch devices, the "cnv" sub-directory will not be created.
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunFor sequential write zones, the sub-directory "seq" is used.
95*4882a593Smuzhiyun
96*4882a593SmuzhiyunThese two directories are the only directories that exist in zonefs. Users
97*4882a593Smuzhiyuncannot create other directories and cannot rename nor delete the "cnv" and
98*4882a593Smuzhiyun"seq" sub-directories.
99*4882a593Smuzhiyun
100*4882a593SmuzhiyunThe size of the directories indicated by the st_size field of struct stat,
101*4882a593Smuzhiyunobtained with the stat() or fstat() system calls, indicates the number of files
102*4882a593Smuzhiyunexisting under the directory.
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunZone files
105*4882a593Smuzhiyun----------
106*4882a593Smuzhiyun
107*4882a593SmuzhiyunZone files are named using the number of the zone they represent within the set
108*4882a593Smuzhiyunof zones of a particular type. That is, both the "cnv" and "seq" directories
109*4882a593Smuzhiyuncontain files named "0", "1", "2", ... The file numbers also represent
110*4882a593Smuzhiyunincreasing zone start sector on the device.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunAll read and write operations to zone files are not allowed beyond the file
113*4882a593Smuzhiyunmaximum size, that is, beyond the zone capacity. Any access exceeding the zone
114*4882a593Smuzhiyuncapacity is failed with the -EFBIG error.
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunCreating, deleting, renaming or modifying any attribute of files and
117*4882a593Smuzhiyunsub-directories is not allowed.
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunThe number of blocks of a file as reported by stat() and fstat() indicates the
120*4882a593Smuzhiyuncapacity of the zone file, or in other words, the maximum file size.
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunConventional zone files
123*4882a593Smuzhiyun-----------------------
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunThe size of conventional zone files is fixed to the size of the zone they
126*4882a593Smuzhiyunrepresent. Conventional zone files cannot be truncated.
127*4882a593Smuzhiyun
128*4882a593SmuzhiyunThese files can be randomly read and written using any type of I/O operation:
129*4882a593Smuzhiyunbuffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O
130*4882a593Smuzhiyunconstraint for these files beyond the file size limit mentioned above.
131*4882a593Smuzhiyun
132*4882a593SmuzhiyunSequential zone files
133*4882a593Smuzhiyun---------------------
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunThe size of sequential zone files grouped in the "seq" sub-directory represents
136*4882a593Smuzhiyunthe file's zone write pointer position relative to the zone start sector.
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunSequential zone files can only be written sequentially, starting from the file
139*4882a593Smuzhiyunend, that is, write operations can only be append writes. Zonefs makes no
140*4882a593Smuzhiyunattempt at accepting random writes and will fail any write request that has a
141*4882a593Smuzhiyunstart offset not corresponding to the end of the file, or to the end of the last
142*4882a593Smuzhiyunwrite issued and still in-flight (for asynchronous I/O operations).
143*4882a593Smuzhiyun
144*4882a593SmuzhiyunSince dirty page writeback by the page cache does not guarantee a sequential
145*4882a593Smuzhiyunwrite pattern, zonefs prevents buffered writes and writeable shared mappings
146*4882a593Smuzhiyunon sequential files. Only direct I/O writes are accepted for these files.
147*4882a593Smuzhiyunzonefs relies on the sequential delivery of write I/O requests to the device
148*4882a593Smuzhiyunimplemented by the block layer elevator. An elevator implementing the sequential
149*4882a593Smuzhiyunwrite feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature)
150*4882a593Smuzhiyunmust be used. This type of elevator (e.g. mq-deadline) is set by default
151*4882a593Smuzhiyunfor zoned block devices on device initialization.
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunThere are no restrictions on the type of I/O used for read operations in
154*4882a593Smuzhiyunsequential zone files. Buffered I/Os, direct I/Os and shared read mappings are
155*4882a593Smuzhiyunall accepted.
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunTruncating sequential zone files is allowed only down to 0, in which case, the
158*4882a593Smuzhiyunzone is reset to rewind the file zone write pointer position to the start of
159*4882a593Smuzhiyunthe zone, or up to the zone capacity, in which case the file's zone is
160*4882a593Smuzhiyuntransitioned to the FULL state (finish zone operation).
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunFormat options
163*4882a593Smuzhiyun--------------
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunSeveral optional features of zonefs can be enabled at format time.
166*4882a593Smuzhiyun
167*4882a593Smuzhiyun* Conventional zone aggregation: ranges of contiguous conventional zones can be
168*4882a593Smuzhiyun  aggregated into a single larger file instead of the default one file per zone.
169*4882a593Smuzhiyun* File ownership: The owner UID and GID of zone files is by default 0 (root)
170*4882a593Smuzhiyun  but can be changed to any valid UID/GID.
171*4882a593Smuzhiyun* File access permissions: the default 640 access permissions can be changed.
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunIO error handling
174*4882a593Smuzhiyun-----------------
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunZoned block devices may fail I/O requests for reasons similar to regular block
177*4882a593Smuzhiyundevices, e.g. due to bad sectors. However, in addition to such known I/O
178*4882a593Smuzhiyunfailure pattern, the standards governing zoned block devices behavior define
179*4882a593Smuzhiyunadditional conditions that result in I/O errors.
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY):
182*4882a593Smuzhiyun  While the data already written in the zone is still readable, the zone can
183*4882a593Smuzhiyun  no longer be written. No user action on the zone (zone management command or
184*4882a593Smuzhiyun  read/write access) can change the zone condition back to a normal read/write
185*4882a593Smuzhiyun  state. While the reasons for the device to transition a zone to read-only
186*4882a593Smuzhiyun  state are not defined by the standards, a typical cause for such transition
187*4882a593Smuzhiyun  would be a defective write head on an HDD (all zones under this head are
188*4882a593Smuzhiyun  changed to read-only).
189*4882a593Smuzhiyun
190*4882a593Smuzhiyun* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE):
191*4882a593Smuzhiyun  An offline zone cannot be read nor written. No user action can transition an
192*4882a593Smuzhiyun  offline zone back to an operational good state. Similarly to zone read-only
193*4882a593Smuzhiyun  transitions, the reasons for a drive to transition a zone to the offline
194*4882a593Smuzhiyun  condition are undefined. A typical cause would be a defective read-write head
195*4882a593Smuzhiyun  on an HDD causing all zones on the platter under the broken head to be
196*4882a593Smuzhiyun  inaccessible.
197*4882a593Smuzhiyun
198*4882a593Smuzhiyun* Unaligned write errors: These errors result from the host issuing write
199*4882a593Smuzhiyun  requests with a start sector that does not correspond to a zone write pointer
200*4882a593Smuzhiyun  position when the write request is executed by the device. Even though zonefs
201*4882a593Smuzhiyun  enforces sequential file write for sequential zones, unaligned write errors
202*4882a593Smuzhiyun  may still happen in the case of a partial failure of a very large direct I/O
203*4882a593Smuzhiyun  operation split into multiple BIOs/requests or asynchronous I/O operations.
204*4882a593Smuzhiyun  If one of the write request within the set of sequential write requests
205*4882a593Smuzhiyun  issued to the device fails, all write requests queued after it will
206*4882a593Smuzhiyun  become unaligned and fail.
207*4882a593Smuzhiyun
208*4882a593Smuzhiyun* Delayed write errors: similarly to regular block devices, if the device side
209*4882a593Smuzhiyun  write cache is enabled, write errors may occur in ranges of previously
210*4882a593Smuzhiyun  completed writes when the device write cache is flushed, e.g. on fsync().
211*4882a593Smuzhiyun  Similarly to the previous immediate unaligned write error case, delayed write
212*4882a593Smuzhiyun  errors can propagate through a stream of cached sequential data for a zone
213*4882a593Smuzhiyun  causing all data to be dropped after the sector that caused the error.
214*4882a593Smuzhiyun
215*4882a593SmuzhiyunAll I/O errors detected by zonefs are notified to the user with an error code
216*4882a593Smuzhiyunreturn for the system call that triggered or detected the error. The recovery
217*4882a593Smuzhiyunactions taken by zonefs in response to I/O errors depend on the I/O type (read
218*4882a593Smuzhiyunvs write) and on the reason for the error (bad sector, unaligned writes or zone
219*4882a593Smuzhiyuncondition change).
220*4882a593Smuzhiyun
221*4882a593Smuzhiyun* For read I/O errors, zonefs does not execute any particular recovery action,
222*4882a593Smuzhiyun  but only if the file zone is still in a good condition and there is no
223*4882a593Smuzhiyun  inconsistency between the file inode size and its zone write pointer position.
224*4882a593Smuzhiyun  If a problem is detected, I/O error recovery is executed (see below table).
225*4882a593Smuzhiyun
226*4882a593Smuzhiyun* For write I/O errors, zonefs I/O error recovery is always executed.
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun* A zone condition change to read-only or offline also always triggers zonefs
229*4882a593Smuzhiyun  I/O error recovery.
230*4882a593Smuzhiyun
231*4882a593SmuzhiyunZonefs minimal I/O error recovery may change a file size and file access
232*4882a593Smuzhiyunpermissions.
233*4882a593Smuzhiyun
234*4882a593Smuzhiyun* File size changes:
235*4882a593Smuzhiyun  Immediate or delayed write errors in a sequential zone file may cause the file
236*4882a593Smuzhiyun  inode size to be inconsistent with the amount of data successfully written in
237*4882a593Smuzhiyun  the file zone. For instance, the partial failure of a multi-BIO large write
238*4882a593Smuzhiyun  operation will cause the zone write pointer to advance partially, even though
239*4882a593Smuzhiyun  the entire write operation will be reported as failed to the user. In such
240*4882a593Smuzhiyun  case, the file inode size must be advanced to reflect the zone write pointer
241*4882a593Smuzhiyun  change and eventually allow the user to restart writing at the end of the
242*4882a593Smuzhiyun  file.
243*4882a593Smuzhiyun  A file size may also be reduced to reflect a delayed write error detected on
244*4882a593Smuzhiyun  fsync(): in this case, the amount of data effectively written in the zone may
245*4882a593Smuzhiyun  be less than originally indicated by the file inode size. After such I/O
246*4882a593Smuzhiyun  error, zonefs always fixes the file inode size to reflect the amount of data
247*4882a593Smuzhiyun  persistently stored in the file zone.
248*4882a593Smuzhiyun
249*4882a593Smuzhiyun* Access permission changes:
250*4882a593Smuzhiyun  A zone condition change to read-only is indicated with a change in the file
251*4882a593Smuzhiyun  access permissions to render the file read-only. This disables changes to the
252*4882a593Smuzhiyun  file attributes and data modification. For offline zones, all permissions
253*4882a593Smuzhiyun  (read and write) to the file are disabled.
254*4882a593Smuzhiyun
255*4882a593SmuzhiyunFurther action taken by zonefs I/O error recovery can be controlled by the user
256*4882a593Smuzhiyunwith the "errors=xxx" mount option. The table below summarizes the result of
257*4882a593Smuzhiyunzonefs I/O error processing depending on the mount option and on the zone
258*4882a593Smuzhiyunconditions::
259*4882a593Smuzhiyun
260*4882a593Smuzhiyun    +--------------+-----------+-----------------------------------------+
261*4882a593Smuzhiyun    |              |           |            Post error state             |
262*4882a593Smuzhiyun    | "errors=xxx" |  device   |                 access permissions      |
263*4882a593Smuzhiyun    |    mount     |   zone    | file         file          device zone  |
264*4882a593Smuzhiyun    |    option    | condition | size     read    write    read    write |
265*4882a593Smuzhiyun    +--------------+-----------+-----------------------------------------+
266*4882a593Smuzhiyun    |              | good      | fixed    yes     no       yes     yes   |
267*4882a593Smuzhiyun    | remount-ro   | read-only | as is    yes     no       yes     no    |
268*4882a593Smuzhiyun    | (default)    | offline   |   0      no      no       no      no    |
269*4882a593Smuzhiyun    +--------------+-----------+-----------------------------------------+
270*4882a593Smuzhiyun    |              | good      | fixed    yes     no       yes     yes   |
271*4882a593Smuzhiyun    | zone-ro      | read-only | as is    yes     no       yes     no    |
272*4882a593Smuzhiyun    |              | offline   |   0      no      no       no      no    |
273*4882a593Smuzhiyun    +--------------+-----------+-----------------------------------------+
274*4882a593Smuzhiyun    |              | good      |   0      no      no       yes     yes   |
275*4882a593Smuzhiyun    | zone-offline | read-only |   0      no      no       yes     no    |
276*4882a593Smuzhiyun    |              | offline   |   0      no      no       no      no    |
277*4882a593Smuzhiyun    +--------------+-----------+-----------------------------------------+
278*4882a593Smuzhiyun    |              | good      | fixed    yes     yes      yes     yes   |
279*4882a593Smuzhiyun    | repair       | read-only | as is    yes     no       yes     no    |
280*4882a593Smuzhiyun    |              | offline   |   0      no      no       no      no    |
281*4882a593Smuzhiyun    +--------------+-----------+-----------------------------------------+
282*4882a593Smuzhiyun
283*4882a593SmuzhiyunFurther notes:
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun* The "errors=remount-ro" mount option is the default behavior of zonefs I/O
286*4882a593Smuzhiyun  error processing if no errors mount option is specified.
287*4882a593Smuzhiyun* With the "errors=remount-ro" mount option, the change of the file access
288*4882a593Smuzhiyun  permissions to read-only applies to all files. The file system is remounted
289*4882a593Smuzhiyun  read-only.
290*4882a593Smuzhiyun* Access permission and file size changes due to the device transitioning zones
291*4882a593Smuzhiyun  to the offline condition are permanent. Remounting or reformatting the device
292*4882a593Smuzhiyun  with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good
293*4882a593Smuzhiyun  state.
294*4882a593Smuzhiyun* File access permission changes to read-only due to the device transitioning
295*4882a593Smuzhiyun  zones to the read-only condition are permanent. Remounting or reformatting
296*4882a593Smuzhiyun  the device will not re-enable file write access.
297*4882a593Smuzhiyun* File access permission changes implied by the remount-ro, zone-ro and
298*4882a593Smuzhiyun  zone-offline mount options are temporary for zones in a good condition.
299*4882a593Smuzhiyun  Unmounting and remounting the file system will restore the previous default
300*4882a593Smuzhiyun  (format time values) access rights to the files affected.
301*4882a593Smuzhiyun* The repair mount option triggers only the minimal set of I/O error recovery
302*4882a593Smuzhiyun  actions, that is, file size fixes for zones in a good condition. Zones
303*4882a593Smuzhiyun  indicated as being read-only or offline by the device still imply changes to
304*4882a593Smuzhiyun  the zone file access permissions as noted in the table above.
305*4882a593Smuzhiyun
306*4882a593SmuzhiyunMount options
307*4882a593Smuzhiyun-------------
308*4882a593Smuzhiyun
309*4882a593Smuzhiyunzonefs define the "errors=<behavior>" mount option to allow the user to specify
310*4882a593Smuzhiyunzonefs behavior in response to I/O errors, inode size inconsistencies or zone
311*4882a593Smuzhiyuncondition changes. The defined behaviors are as follow:
312*4882a593Smuzhiyun
313*4882a593Smuzhiyun* remount-ro (default)
314*4882a593Smuzhiyun* zone-ro
315*4882a593Smuzhiyun* zone-offline
316*4882a593Smuzhiyun* repair
317*4882a593Smuzhiyun
318*4882a593SmuzhiyunThe run-time I/O error actions defined for each behavior are detailed in the
319*4882a593Smuzhiyunprevious section. Mount time I/O errors will cause the mount operation to fail.
320*4882a593SmuzhiyunThe handling of read-only zones also differs between mount-time and run-time.
321*4882a593SmuzhiyunIf a read-only zone is found at mount time, the zone is always treated in the
322*4882a593Smuzhiyunsame manner as offline zones, that is, all accesses are disabled and the zone
323*4882a593Smuzhiyunfile size set to 0. This is necessary as the write pointer of read-only zones
324*4882a593Smuzhiyunis defined as invalib by the ZBC and ZAC standards, making it impossible to
325*4882a593Smuzhiyundiscover the amount of data that has been written to the zone. In the case of a
326*4882a593Smuzhiyunread-only zone discovered at run-time, as indicated in the previous section.
327*4882a593SmuzhiyunThe size of the zone file is left unchanged from its last updated value.
328*4882a593Smuzhiyun
329*4882a593SmuzhiyunA zoned block device (e.g. an NVMe Zoned Namespace device) may have limits on
330*4882a593Smuzhiyunthe number of zones that can be active, that is, zones that are in the
331*4882a593Smuzhiyunimplicit open, explicit open or closed conditions.  This potential limitation
332*4882a593Smuzhiyuntranslates into a risk for applications to see write IO errors due to this
333*4882a593Smuzhiyunlimit being exceeded if the zone of a file is not already active when a write
334*4882a593Smuzhiyunrequest is issued by the user.
335*4882a593Smuzhiyun
336*4882a593SmuzhiyunTo avoid these potential errors, the "explicit-open" mount option forces zones
337*4882a593Smuzhiyunto be made active using an open zone command when a file is opened for writing
338*4882a593Smuzhiyunfor the first time. If the zone open command succeeds, the application is then
339*4882a593Smuzhiyunguaranteed that write requests can be processed. Conversely, the
340*4882a593Smuzhiyun"explicit-open" mount option will result in a zone close command being issued
341*4882a593Smuzhiyunto the device on the last close() of a zone file if the zone is not full nor
342*4882a593Smuzhiyunempty.
343*4882a593Smuzhiyun
344*4882a593SmuzhiyunZonefs User Space Tools
345*4882a593Smuzhiyun=======================
346*4882a593Smuzhiyun
347*4882a593SmuzhiyunThe mkzonefs tool is used to format zoned block devices for use with zonefs.
348*4882a593SmuzhiyunThis tool is available on Github at:
349*4882a593Smuzhiyun
350*4882a593Smuzhiyunhttps://github.com/damien-lemoal/zonefs-tools
351*4882a593Smuzhiyun
352*4882a593Smuzhiyunzonefs-tools also includes a test suite which can be run against any zoned
353*4882a593Smuzhiyunblock device, including null_blk block device created with zoned mode.
354*4882a593Smuzhiyun
355*4882a593SmuzhiyunExamples
356*4882a593Smuzhiyun--------
357*4882a593Smuzhiyun
358*4882a593SmuzhiyunThe following formats a 15TB host-managed SMR HDD with 256 MB zones
359*4882a593Smuzhiyunwith the conventional zones aggregation feature enabled::
360*4882a593Smuzhiyun
361*4882a593Smuzhiyun    # mkzonefs -o aggr_cnv /dev/sdX
362*4882a593Smuzhiyun    # mount -t zonefs /dev/sdX /mnt
363*4882a593Smuzhiyun    # ls -l /mnt/
364*4882a593Smuzhiyun    total 0
365*4882a593Smuzhiyun    dr-xr-xr-x 2 root root     1 Nov 25 13:23 cnv
366*4882a593Smuzhiyun    dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
367*4882a593Smuzhiyun
368*4882a593SmuzhiyunThe size of the zone files sub-directories indicate the number of files
369*4882a593Smuzhiyunexisting for each type of zones. In this example, there is only one
370*4882a593Smuzhiyunconventional zone file (all conventional zones are aggregated under a single
371*4882a593Smuzhiyunfile)::
372*4882a593Smuzhiyun
373*4882a593Smuzhiyun    # ls -l /mnt/cnv
374*4882a593Smuzhiyun    total 137101312
375*4882a593Smuzhiyun    -rw-r----- 1 root root 140391743488 Nov 25 13:23 0
376*4882a593Smuzhiyun
377*4882a593SmuzhiyunThis aggregated conventional zone file can be used as a regular file::
378*4882a593Smuzhiyun
379*4882a593Smuzhiyun    # mkfs.ext4 /mnt/cnv/0
380*4882a593Smuzhiyun    # mount -o loop /mnt/cnv/0 /data
381*4882a593Smuzhiyun
382*4882a593SmuzhiyunThe "seq" sub-directory grouping files for sequential write zones has in this
383*4882a593Smuzhiyunexample 55356 zones::
384*4882a593Smuzhiyun
385*4882a593Smuzhiyun    # ls -lv /mnt/seq
386*4882a593Smuzhiyun    total 14511243264
387*4882a593Smuzhiyun    -rw-r----- 1 root root 0 Nov 25 13:23 0
388*4882a593Smuzhiyun    -rw-r----- 1 root root 0 Nov 25 13:23 1
389*4882a593Smuzhiyun    -rw-r----- 1 root root 0 Nov 25 13:23 2
390*4882a593Smuzhiyun    ...
391*4882a593Smuzhiyun    -rw-r----- 1 root root 0 Nov 25 13:23 55354
392*4882a593Smuzhiyun    -rw-r----- 1 root root 0 Nov 25 13:23 55355
393*4882a593Smuzhiyun
394*4882a593SmuzhiyunFor sequential write zone files, the file size changes as data is appended at
395*4882a593Smuzhiyunthe end of the file, similarly to any regular file system::
396*4882a593Smuzhiyun
397*4882a593Smuzhiyun    # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct
398*4882a593Smuzhiyun    1+0 records in
399*4882a593Smuzhiyun    1+0 records out
400*4882a593Smuzhiyun    4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s
401*4882a593Smuzhiyun
402*4882a593Smuzhiyun    # ls -l /mnt/seq/0
403*4882a593Smuzhiyun    -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
404*4882a593Smuzhiyun
405*4882a593SmuzhiyunThe written file can be truncated to the zone size, preventing any further
406*4882a593Smuzhiyunwrite operation::
407*4882a593Smuzhiyun
408*4882a593Smuzhiyun    # truncate -s 268435456 /mnt/seq/0
409*4882a593Smuzhiyun    # ls -l /mnt/seq/0
410*4882a593Smuzhiyun    -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
411*4882a593Smuzhiyun
412*4882a593SmuzhiyunTruncation to 0 size allows freeing the file zone storage space and restart
413*4882a593Smuzhiyunappend-writes to the file::
414*4882a593Smuzhiyun
415*4882a593Smuzhiyun    # truncate -s 0 /mnt/seq/0
416*4882a593Smuzhiyun    # ls -l /mnt/seq/0
417*4882a593Smuzhiyun    -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
418*4882a593Smuzhiyun
419*4882a593SmuzhiyunSince files are statically mapped to zones on the disk, the number of blocks
420*4882a593Smuzhiyunof a file as reported by stat() and fstat() indicates the capacity of the file
421*4882a593Smuzhiyunzone::
422*4882a593Smuzhiyun
423*4882a593Smuzhiyun    # stat /mnt/seq/0
424*4882a593Smuzhiyun    File: /mnt/seq/0
425*4882a593Smuzhiyun    Size: 0         	Blocks: 524288     IO Block: 4096   regular empty file
426*4882a593Smuzhiyun    Device: 870h/2160d	Inode: 50431       Links: 1
427*4882a593Smuzhiyun    Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
428*4882a593Smuzhiyun    Access: 2019-11-25 13:23:57.048971997 +0900
429*4882a593Smuzhiyun    Modify: 2019-11-25 13:52:25.553805765 +0900
430*4882a593Smuzhiyun    Change: 2019-11-25 13:52:25.553805765 +0900
431*4882a593Smuzhiyun    Birth: -
432*4882a593Smuzhiyun
433*4882a593SmuzhiyunThe number of blocks of the file ("Blocks") in units of 512B blocks gives the
434*4882a593Smuzhiyunmaximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone
435*4882a593Smuzhiyuncapacity in this example. Of note is that the "IO block" field always
436*4882a593Smuzhiyunindicates the minimum I/O size for writes and corresponds to the device
437*4882a593Smuzhiyunphysical sector size.
438