xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/erofs.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun======================================
4*4882a593SmuzhiyunEnhanced Read-Only File System - EROFS
5*4882a593Smuzhiyun======================================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunOverview
8*4882a593Smuzhiyun========
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunEROFS file-system stands for Enhanced Read-Only File System. Different
11*4882a593Smuzhiyunfrom other read-only file systems, it aims to be designed for flexibility,
12*4882a593Smuzhiyunscalability, but be kept simple and high performance.
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunIt is designed as a better filesystem solution for the following scenarios:
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun - read-only storage media or
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun - part of a fully trusted read-only solution, which means it needs to be
19*4882a593Smuzhiyun   immutable and bit-for-bit identical to the official golden image for
20*4882a593Smuzhiyun   their releases due to security and other considerations and
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun - hope to save some extra storage space with guaranteed end-to-end performance
23*4882a593Smuzhiyun   by using reduced metadata and transparent file compression, especially
24*4882a593Smuzhiyun   for those embedded devices with limited memory (ex, smartphone);
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunHere is the main features of EROFS:
27*4882a593Smuzhiyun
28*4882a593Smuzhiyun - Little endian on-disk design;
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun - Currently 4KB block size (nobh) and therefore maximum 16TB address space;
31*4882a593Smuzhiyun
32*4882a593Smuzhiyun - Metadata & data could be mixed by design;
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun - 2 inode versions for different requirements:
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun   =====================  ============  =====================================
37*4882a593Smuzhiyun                          compact (v1)  extended (v2)
38*4882a593Smuzhiyun   =====================  ============  =====================================
39*4882a593Smuzhiyun   Inode metadata size    32 bytes      64 bytes
40*4882a593Smuzhiyun   Max file size          4 GB          16 EB (also limited by max. vol size)
41*4882a593Smuzhiyun   Max uids/gids          65536         4294967296
42*4882a593Smuzhiyun   File change time       no            yes (64 + 32-bit timestamp)
43*4882a593Smuzhiyun   Max hardlinks          65536         4294967296
44*4882a593Smuzhiyun   Metadata reserved      4 bytes       14 bytes
45*4882a593Smuzhiyun   =====================  ============  =====================================
46*4882a593Smuzhiyun
47*4882a593Smuzhiyun - Support extended attributes (xattrs) as an option;
48*4882a593Smuzhiyun
49*4882a593Smuzhiyun - Support xattr inline and tail-end data inline for all files;
50*4882a593Smuzhiyun
51*4882a593Smuzhiyun - Support POSIX.1e ACLs by using xattrs;
52*4882a593Smuzhiyun
53*4882a593Smuzhiyun - Support transparent file compression as an option:
54*4882a593Smuzhiyun   LZ4 algorithm with 4 KB fixed-sized output compression for high performance.
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunThe following git tree provides the file system user-space tools under
57*4882a593Smuzhiyundevelopment (ex, formatting tool mkfs.erofs):
58*4882a593Smuzhiyun
59*4882a593Smuzhiyun- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunBugs and patches are welcome, please kindly help us and send to the following
62*4882a593Smuzhiyunlinux-erofs mailing list:
63*4882a593Smuzhiyun
64*4882a593Smuzhiyun- linux-erofs mailing list   <linux-erofs@lists.ozlabs.org>
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunMount options
67*4882a593Smuzhiyun=============
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun===================    =========================================================
70*4882a593Smuzhiyun(no)user_xattr         Setup Extended User Attributes. Note: xattr is enabled
71*4882a593Smuzhiyun                       by default if CONFIG_EROFS_FS_XATTR is selected.
72*4882a593Smuzhiyun(no)acl                Setup POSIX Access Control List. Note: acl is enabled
73*4882a593Smuzhiyun                       by default if CONFIG_EROFS_FS_POSIX_ACL is selected.
74*4882a593Smuzhiyuncache_strategy=%s      Select a strategy for cached decompression from now on:
75*4882a593Smuzhiyun
76*4882a593Smuzhiyun		       ==========  =============================================
77*4882a593Smuzhiyun                         disabled  In-place I/O decompression only;
78*4882a593Smuzhiyun                        readahead  Cache the last incomplete compressed physical
79*4882a593Smuzhiyun                                   cluster for further reading. It still does
80*4882a593Smuzhiyun                                   in-place I/O decompression for the rest
81*4882a593Smuzhiyun                                   compressed physical clusters;
82*4882a593Smuzhiyun                       readaround  Cache the both ends of incomplete compressed
83*4882a593Smuzhiyun                                   physical clusters for further reading.
84*4882a593Smuzhiyun                                   It still does in-place I/O decompression
85*4882a593Smuzhiyun                                   for the rest compressed physical clusters.
86*4882a593Smuzhiyun		       ==========  =============================================
87*4882a593Smuzhiyundax={always,never}     Use direct access (no page cache).  See
88*4882a593Smuzhiyun                       Documentation/filesystems/dax.rst.
89*4882a593Smuzhiyundax                    A legacy option which is an alias for ``dax=always``.
90*4882a593Smuzhiyun===================    =========================================================
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunOn-disk details
93*4882a593Smuzhiyun===============
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunSummary
96*4882a593Smuzhiyun-------
97*4882a593SmuzhiyunDifferent from other read-only file systems, an EROFS volume is designed
98*4882a593Smuzhiyunto be as simple as possible::
99*4882a593Smuzhiyun
100*4882a593Smuzhiyun                                |-> aligned with the block size
101*4882a593Smuzhiyun   ____________________________________________________________
102*4882a593Smuzhiyun  | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data |
103*4882a593Smuzhiyun  |_|__|_|_____|__________|_____|______|__________|_____|______|
104*4882a593Smuzhiyun  0 +1K
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunAll data areas should be aligned with the block size, but metadata areas
107*4882a593Smuzhiyunmay not. All metadatas can be now observed in two different spaces (views):
108*4882a593Smuzhiyun
109*4882a593Smuzhiyun 1. Inode metadata space
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun    Each valid inode should be aligned with an inode slot, which is a fixed
112*4882a593Smuzhiyun    value (32 bytes) and designed to be kept in line with compact inode size.
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun    Each inode can be directly found with the following formula:
115*4882a593Smuzhiyun         inode offset = meta_blkaddr * block_size + 32 * nid
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun    ::
118*4882a593Smuzhiyun
119*4882a593Smuzhiyun				    |-> aligned with 8B
120*4882a593Smuzhiyun					    |-> followed closely
121*4882a593Smuzhiyun	+ meta_blkaddr blocks                                      |-> another slot
122*4882a593Smuzhiyun	_____________________________________________________________________
123*4882a593Smuzhiyun	|  ...   | inode |  xattrs  | extents  | data inline | ... | inode ...
124*4882a593Smuzhiyun	|________|_______|(optional)|(optional)|__(optional)_|_____|__________
125*4882a593Smuzhiyun		|-> aligned with the inode slot size
126*4882a593Smuzhiyun		    .                   .
127*4882a593Smuzhiyun		    .                         .
128*4882a593Smuzhiyun		.                              .
129*4882a593Smuzhiyun		.                                    .
130*4882a593Smuzhiyun	    .                                         .
131*4882a593Smuzhiyun	    .                                              .
132*4882a593Smuzhiyun	.____________________________________________________|-> aligned with 4B
133*4882a593Smuzhiyun	| xattr_ibody_header | shared xattrs | inline xattrs |
134*4882a593Smuzhiyun	|____________________|_______________|_______________|
135*4882a593Smuzhiyun	|->    12 bytes    <-|->x * 4 bytes<-|               .
136*4882a593Smuzhiyun			    .                .                 .
137*4882a593Smuzhiyun			.                      .                   .
138*4882a593Smuzhiyun		.                           .                     .
139*4882a593Smuzhiyun	    ._______________________________.______________________.
140*4882a593Smuzhiyun	    | id | id | id | id |  ... | id | ent | ... | ent| ... |
141*4882a593Smuzhiyun	    |____|____|____|____|______|____|_____|_____|____|_____|
142*4882a593Smuzhiyun					    |-> aligned with 4B
143*4882a593Smuzhiyun							|-> aligned with 4B
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun    Inode could be 32 or 64 bytes, which can be distinguished from a common
146*4882a593Smuzhiyun    field which all inode versions have -- i_format::
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun        __________________               __________________
149*4882a593Smuzhiyun       |     i_format     |             |     i_format     |
150*4882a593Smuzhiyun       |__________________|             |__________________|
151*4882a593Smuzhiyun       |        ...       |             |        ...       |
152*4882a593Smuzhiyun       |                  |             |                  |
153*4882a593Smuzhiyun       |__________________| 32 bytes    |                  |
154*4882a593Smuzhiyun                                        |                  |
155*4882a593Smuzhiyun                                        |__________________| 64 bytes
156*4882a593Smuzhiyun
157*4882a593Smuzhiyun    Xattrs, extents, data inline are followed by the corresponding inode with
158*4882a593Smuzhiyun    proper alignment, and they could be optional for different data mappings.
159*4882a593Smuzhiyun    _currently_ total 4 valid data mappings are supported:
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun    ==  ====================================================================
162*4882a593Smuzhiyun     0  flat file data without data inline (no extent);
163*4882a593Smuzhiyun     1  fixed-sized output data compression (with non-compacted indexes);
164*4882a593Smuzhiyun     2  flat file data with tail packing data inline (no extent);
165*4882a593Smuzhiyun     3  fixed-sized output data compression (with compacted indexes, v5.3+).
166*4882a593Smuzhiyun    ==  ====================================================================
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun    The size of the optional xattrs is indicated by i_xattr_count in inode
169*4882a593Smuzhiyun    header. Large xattrs or xattrs shared by many different files can be
170*4882a593Smuzhiyun    stored in shared xattrs metadata rather than inlined right after inode.
171*4882a593Smuzhiyun
172*4882a593Smuzhiyun 2. Shared xattrs metadata space
173*4882a593Smuzhiyun
174*4882a593Smuzhiyun    Shared xattrs space is similar to the above inode space, started with
175*4882a593Smuzhiyun    a specific block indicated by xattr_blkaddr, organized one by one with
176*4882a593Smuzhiyun    proper align.
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun    Each share xattr can also be directly found by the following formula:
179*4882a593Smuzhiyun         xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun    ::
182*4882a593Smuzhiyun
183*4882a593Smuzhiyun			    |-> aligned by  4 bytes
184*4882a593Smuzhiyun	+ xattr_blkaddr blocks                     |-> aligned with 4 bytes
185*4882a593Smuzhiyun	_________________________________________________________________________
186*4882a593Smuzhiyun	|  ...   | xattr_entry |  xattr data | ... |  xattr_entry | xattr data  ...
187*4882a593Smuzhiyun	|________|_____________|_____________|_____|______________|_______________
188*4882a593Smuzhiyun
189*4882a593SmuzhiyunDirectories
190*4882a593Smuzhiyun-----------
191*4882a593SmuzhiyunAll directories are now organized in a compact on-disk format. Note that
192*4882a593Smuzhiyuneach directory block is divided into index and name areas in order to support
193*4882a593Smuzhiyunrandom file lookup, and all directory entries are _strictly_ recorded in
194*4882a593Smuzhiyunalphabetical order in order to support improved prefix binary search
195*4882a593Smuzhiyunalgorithm (could refer to the related source code).
196*4882a593Smuzhiyun
197*4882a593Smuzhiyun::
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun		    ___________________________
200*4882a593Smuzhiyun		    /                           |
201*4882a593Smuzhiyun		/              ______________|________________
202*4882a593Smuzhiyun		/              /              | nameoff1       | nameoffN-1
203*4882a593Smuzhiyun    ____________.______________._______________v________________v__________
204*4882a593Smuzhiyun    | dirent | dirent | ... | dirent | filename | filename | ... | filename |
205*4882a593Smuzhiyun    |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
206*4882a593Smuzhiyun	\                           ^
207*4882a593Smuzhiyun	\                          |                           * could have
208*4882a593Smuzhiyun	\                         |                             trailing '\0'
209*4882a593Smuzhiyun	    \________________________| nameoff0
210*4882a593Smuzhiyun
211*4882a593Smuzhiyun				Directory block
212*4882a593Smuzhiyun
213*4882a593SmuzhiyunNote that apart from the offset of the first filename, nameoff0 also indicates
214*4882a593Smuzhiyunthe total number of directory entries in this block since it is no need to
215*4882a593Smuzhiyunintroduce another on-disk field at all.
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunCompression
218*4882a593Smuzhiyun-----------
219*4882a593SmuzhiyunCurrently, EROFS supports 4KB fixed-sized output transparent file compression,
220*4882a593Smuzhiyunas illustrated below::
221*4882a593Smuzhiyun
222*4882a593Smuzhiyun	    |---- Variant-Length Extent ----|-------- VLE --------|----- VLE -----
223*4882a593Smuzhiyun	    clusterofs                      clusterofs            clusterofs
224*4882a593Smuzhiyun	    |                               |                     |   logical data
225*4882a593Smuzhiyun    _________v_______________________________v_____________________v_______________
226*4882a593Smuzhiyun    ... |    .        |             |        .    |             |  .          | ...
227*4882a593Smuzhiyun    ____|____.________|_____________|________.____|_____________|__.__________|____
228*4882a593Smuzhiyun	|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|
229*4882a593Smuzhiyun	    size          size          size          size          size
230*4882a593Smuzhiyun	    .                             .                .                   .
231*4882a593Smuzhiyun	    .                       .               .                  .
232*4882a593Smuzhiyun		.                  .              .                .
233*4882a593Smuzhiyun	_______._____________._____________._____________._____________________
234*4882a593Smuzhiyun	    ... |             |             |             | ... physical data
235*4882a593Smuzhiyun	_______|_____________|_____________|_____________|_____________________
236*4882a593Smuzhiyun		|-> cluster <-|-> cluster <-|-> cluster <-|
237*4882a593Smuzhiyun		    size          size          size
238*4882a593Smuzhiyun
239*4882a593SmuzhiyunCurrently each on-disk physical cluster can contain 4KB (un)compressed data
240*4882a593Smuzhiyunat most. For each logical cluster, there is a corresponding on-disk index to
241*4882a593Smuzhiyundescribe its cluster type, physical cluster address, etc.
242*4882a593Smuzhiyun
243*4882a593SmuzhiyunSee "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
244