xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/squashfs.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=======================
4*4882a593SmuzhiyunSquashfs 4.0 Filesystem
5*4882a593Smuzhiyun=======================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunSquashfs is a compressed read-only filesystem for Linux.
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunIt uses zlib, lz4, lzo, or xz compression to compress files, inodes and
10*4882a593Smuzhiyundirectories.  Inodes in the system are very small and all blocks are packed to
11*4882a593Smuzhiyunminimise data overhead. Block sizes greater than 4K are supported up to a
12*4882a593Smuzhiyunmaximum of 1Mbytes (default block size 128K).
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunSquashfs is intended for general read-only filesystem use, for archival
15*4882a593Smuzhiyunuse (i.e. in cases where a .tar.gz file may be used), and in constrained
16*4882a593Smuzhiyunblock device/memory systems (e.g. embedded systems) where low overhead is
17*4882a593Smuzhiyunneeded.
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunMailing list: squashfs-devel@lists.sourceforge.net
20*4882a593SmuzhiyunWeb site: www.squashfs.org
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun1. Filesystem Features
23*4882a593Smuzhiyun----------------------
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunSquashfs filesystem features versus Cramfs:
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun============================== 	=========		==========
28*4882a593Smuzhiyun				Squashfs		Cramfs
29*4882a593Smuzhiyun============================== 	=========		==========
30*4882a593SmuzhiyunMax filesystem size		2^64			256 MiB
31*4882a593SmuzhiyunMax file size			~ 2 TiB			16 MiB
32*4882a593SmuzhiyunMax files			unlimited		unlimited
33*4882a593SmuzhiyunMax directories			unlimited		unlimited
34*4882a593SmuzhiyunMax entries per directory	unlimited		unlimited
35*4882a593SmuzhiyunMax block size			1 MiB			4 KiB
36*4882a593SmuzhiyunMetadata compression		yes			no
37*4882a593SmuzhiyunDirectory indexes		yes			no
38*4882a593SmuzhiyunSparse file support		yes			no
39*4882a593SmuzhiyunTail-end packing (fragments)	yes			no
40*4882a593SmuzhiyunExportable (NFS etc.)		yes			no
41*4882a593SmuzhiyunHard link support		yes			no
42*4882a593Smuzhiyun"." and ".." in readdir		yes			no
43*4882a593SmuzhiyunReal inode numbers		yes			no
44*4882a593Smuzhiyun32-bit uids/gids		yes			no
45*4882a593SmuzhiyunFile creation time		yes			no
46*4882a593SmuzhiyunXattr support			yes			no
47*4882a593SmuzhiyunACL support			no			no
48*4882a593Smuzhiyun============================== 	=========		==========
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunSquashfs compresses data, inodes and directories.  In addition, inode and
51*4882a593Smuzhiyundirectory data are highly compacted, and packed on byte boundaries.  Each
52*4882a593Smuzhiyuncompressed inode is on average 8 bytes in length (the exact length varies on
53*4882a593Smuzhiyunfile type, i.e. regular file, directory, symbolic link, and block/char device
54*4882a593Smuzhiyuninodes have different sizes).
55*4882a593Smuzhiyun
56*4882a593Smuzhiyun2. Using Squashfs
57*4882a593Smuzhiyun-----------------
58*4882a593Smuzhiyun
59*4882a593SmuzhiyunAs squashfs is a read-only filesystem, the mksquashfs program must be used to
60*4882a593Smuzhiyuncreate populated squashfs filesystems.  This and other squashfs utilities
61*4882a593Smuzhiyuncan be obtained from http://www.squashfs.org.  Usage instructions can be
62*4882a593Smuzhiyunobtained from this site also.
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunThe squashfs-tools development tree is now located on kernel.org
65*4882a593Smuzhiyun	git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun3. Squashfs Filesystem Design
68*4882a593Smuzhiyun-----------------------------
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunA squashfs filesystem consists of a maximum of nine parts, packed together on a
71*4882a593Smuzhiyunbyte alignment::
72*4882a593Smuzhiyun
73*4882a593Smuzhiyun	 ---------------
74*4882a593Smuzhiyun	|  superblock 	|
75*4882a593Smuzhiyun	|---------------|
76*4882a593Smuzhiyun	|  compression  |
77*4882a593Smuzhiyun	|    options    |
78*4882a593Smuzhiyun	|---------------|
79*4882a593Smuzhiyun	|  datablocks   |
80*4882a593Smuzhiyun	|  & fragments  |
81*4882a593Smuzhiyun	|---------------|
82*4882a593Smuzhiyun	|  inode table	|
83*4882a593Smuzhiyun	|---------------|
84*4882a593Smuzhiyun	|   directory	|
85*4882a593Smuzhiyun	|     table     |
86*4882a593Smuzhiyun	|---------------|
87*4882a593Smuzhiyun	|   fragment	|
88*4882a593Smuzhiyun	|    table      |
89*4882a593Smuzhiyun	|---------------|
90*4882a593Smuzhiyun	|    export     |
91*4882a593Smuzhiyun	|    table      |
92*4882a593Smuzhiyun	|---------------|
93*4882a593Smuzhiyun	|    uid/gid	|
94*4882a593Smuzhiyun	|  lookup table	|
95*4882a593Smuzhiyun	|---------------|
96*4882a593Smuzhiyun	|     xattr     |
97*4882a593Smuzhiyun	|     table	|
98*4882a593Smuzhiyun	 ---------------
99*4882a593Smuzhiyun
100*4882a593SmuzhiyunCompressed data blocks are written to the filesystem as files are read from
101*4882a593Smuzhiyunthe source directory, and checked for duplicates.  Once all file data has been
102*4882a593Smuzhiyunwritten the completed inode, directory, fragment, export, uid/gid lookup and
103*4882a593Smuzhiyunxattr tables are written.
104*4882a593Smuzhiyun
105*4882a593Smuzhiyun3.1 Compression options
106*4882a593Smuzhiyun-----------------------
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunCompressors can optionally support compression specific options (e.g.
109*4882a593Smuzhiyundictionary size).  If non-default compression options have been used, then
110*4882a593Smuzhiyunthese are stored here.
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun3.2 Inodes
113*4882a593Smuzhiyun----------
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunMetadata (inodes and directories) are compressed in 8Kbyte blocks.  Each
116*4882a593Smuzhiyuncompressed block is prefixed by a two byte length, the top bit is set if the
117*4882a593Smuzhiyunblock is uncompressed.  A block will be uncompressed if the -noI option is set,
118*4882a593Smuzhiyunor if the compressed block was larger than the uncompressed block.
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunInodes are packed into the metadata blocks, and are not aligned to block
121*4882a593Smuzhiyunboundaries, therefore inodes overlap compressed blocks.  Inodes are identified
122*4882a593Smuzhiyunby a 48-bit number which encodes the location of the compressed metadata block
123*4882a593Smuzhiyuncontaining the inode, and the byte offset into that block where the inode is
124*4882a593Smuzhiyunplaced (<block, offset>).
125*4882a593Smuzhiyun
126*4882a593SmuzhiyunTo maximise compression there are different inodes for each file type
127*4882a593Smuzhiyun(regular file, directory, device, etc.), the inode contents and length
128*4882a593Smuzhiyunvarying with the type.
129*4882a593Smuzhiyun
130*4882a593SmuzhiyunTo further maximise compression, two types of regular file inode and
131*4882a593Smuzhiyundirectory inode are defined: inodes optimised for frequently occurring
132*4882a593Smuzhiyunregular files and directories, and extended types where extra
133*4882a593Smuzhiyuninformation has to be stored.
134*4882a593Smuzhiyun
135*4882a593Smuzhiyun3.3 Directories
136*4882a593Smuzhiyun---------------
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunLike inodes, directories are packed into compressed metadata blocks, stored
139*4882a593Smuzhiyunin a directory table.  Directories are accessed using the start address of
140*4882a593Smuzhiyunthe metablock containing the directory and the offset into the
141*4882a593Smuzhiyundecompressed block (<block, offset>).
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunDirectories are organised in a slightly complex way, and are not simply
144*4882a593Smuzhiyuna list of file names.  The organisation takes advantage of the
145*4882a593Smuzhiyunfact that (in most cases) the inodes of the files will be in the same
146*4882a593Smuzhiyuncompressed metadata block, and therefore, can share the start block.
147*4882a593SmuzhiyunDirectories are therefore organised in a two level list, a directory
148*4882a593Smuzhiyunheader containing the shared start block value, and a sequence of directory
149*4882a593Smuzhiyunentries, each of which share the shared start block.  A new directory header
150*4882a593Smuzhiyunis written once/if the inode start block changes.  The directory
151*4882a593Smuzhiyunheader/directory entry list is repeated as many times as necessary.
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunDirectories are sorted, and can contain a directory index to speed up
154*4882a593Smuzhiyunfile lookup.  Directory indexes store one entry per metablock, each entry
155*4882a593Smuzhiyunstoring the index/filename mapping to the first directory header
156*4882a593Smuzhiyunin each metadata block.  Directories are sorted in alphabetical order,
157*4882a593Smuzhiyunand at lookup the index is scanned linearly looking for the first filename
158*4882a593Smuzhiyunalphabetically larger than the filename being looked up.  At this point the
159*4882a593Smuzhiyunlocation of the metadata block the filename is in has been found.
160*4882a593SmuzhiyunThe general idea of the index is to ensure only one metadata block needs to be
161*4882a593Smuzhiyundecompressed to do a lookup irrespective of the length of the directory.
162*4882a593SmuzhiyunThis scheme has the advantage that it doesn't require extra memory overhead
163*4882a593Smuzhiyunand doesn't require much extra storage on disk.
164*4882a593Smuzhiyun
165*4882a593Smuzhiyun3.4 File data
166*4882a593Smuzhiyun-------------
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunRegular files consist of a sequence of contiguous compressed blocks, and/or a
169*4882a593Smuzhiyuncompressed fragment block (tail-end packed block).   The compressed size
170*4882a593Smuzhiyunof each datablock is stored in a block list contained within the
171*4882a593Smuzhiyunfile inode.
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunTo speed up access to datablocks when reading 'large' files (256 Mbytes or
174*4882a593Smuzhiyunlarger), the code implements an index cache that caches the mapping from
175*4882a593Smuzhiyunblock index to datablock location on disk.
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunThe index cache allows Squashfs to handle large files (up to 1.75 TiB) while
178*4882a593Smuzhiyunretaining a simple and space-efficient block list on disk.  The cache
179*4882a593Smuzhiyunis split into slots, caching up to eight 224 GiB files (128 KiB blocks).
180*4882a593SmuzhiyunLarger files use multiple slots, with 1.75 TiB files using all 8 slots.
181*4882a593SmuzhiyunThe index cache is designed to be memory efficient, and by default uses
182*4882a593Smuzhiyun16 KiB.
183*4882a593Smuzhiyun
184*4882a593Smuzhiyun3.5 Fragment lookup table
185*4882a593Smuzhiyun-------------------------
186*4882a593Smuzhiyun
187*4882a593SmuzhiyunRegular files can contain a fragment index which is mapped to a fragment
188*4882a593Smuzhiyunlocation on disk and compressed size using a fragment lookup table.  This
189*4882a593Smuzhiyunfragment lookup table is itself stored compressed into metadata blocks.
190*4882a593SmuzhiyunA second index table is used to locate these.  This second index table for
191*4882a593Smuzhiyunspeed of access (and because it is small) is read at mount time and cached
192*4882a593Smuzhiyunin memory.
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun3.6 Uid/gid lookup table
195*4882a593Smuzhiyun------------------------
196*4882a593Smuzhiyun
197*4882a593SmuzhiyunFor space efficiency regular files store uid and gid indexes, which are
198*4882a593Smuzhiyunconverted to 32-bit uids/gids using an id look up table.  This table is
199*4882a593Smuzhiyunstored compressed into metadata blocks.  A second index table is used to
200*4882a593Smuzhiyunlocate these.  This second index table for speed of access (and because it
201*4882a593Smuzhiyunis small) is read at mount time and cached in memory.
202*4882a593Smuzhiyun
203*4882a593Smuzhiyun3.7 Export table
204*4882a593Smuzhiyun----------------
205*4882a593Smuzhiyun
206*4882a593SmuzhiyunTo enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
207*4882a593Smuzhiyuncan optionally (disabled with the -no-exports Mksquashfs option) contain
208*4882a593Smuzhiyunan inode number to inode disk location lookup table.  This is required to
209*4882a593Smuzhiyunenable Squashfs to map inode numbers passed in filehandles to the inode
210*4882a593Smuzhiyunlocation on disk, which is necessary when the export code reinstantiates
211*4882a593Smuzhiyunexpired/flushed inodes.
212*4882a593Smuzhiyun
213*4882a593SmuzhiyunThis table is stored compressed into metadata blocks.  A second index table is
214*4882a593Smuzhiyunused to locate these.  This second index table for speed of access (and because
215*4882a593Smuzhiyunit is small) is read at mount time and cached in memory.
216*4882a593Smuzhiyun
217*4882a593Smuzhiyun3.8 Xattr table
218*4882a593Smuzhiyun---------------
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunThe xattr table contains extended attributes for each inode.  The xattrs
221*4882a593Smuzhiyunfor each inode are stored in a list, each list entry containing a type,
222*4882a593Smuzhiyunname and value field.  The type field encodes the xattr prefix
223*4882a593Smuzhiyun("user.", "trusted." etc) and it also encodes how the name/value fields
224*4882a593Smuzhiyunshould be interpreted.  Currently the type indicates whether the value
225*4882a593Smuzhiyunis stored inline (in which case the value field contains the xattr value),
226*4882a593Smuzhiyunor if it is stored out of line (in which case the value field stores a
227*4882a593Smuzhiyunreference to where the actual value is stored).  This allows large values
228*4882a593Smuzhiyunto be stored out of line improving scanning and lookup performance and it
229*4882a593Smuzhiyunalso allows values to be de-duplicated, the value being stored once, and
230*4882a593Smuzhiyunall other occurrences holding an out of line reference to that value.
231*4882a593Smuzhiyun
232*4882a593SmuzhiyunThe xattr lists are packed into compressed 8K metadata blocks.
233*4882a593SmuzhiyunTo reduce overhead in inodes, rather than storing the on-disk
234*4882a593Smuzhiyunlocation of the xattr list inside each inode, a 32-bit xattr id
235*4882a593Smuzhiyunis stored.  This xattr id is mapped into the location of the xattr
236*4882a593Smuzhiyunlist using a second xattr id lookup table.
237*4882a593Smuzhiyun
238*4882a593Smuzhiyun4. TODOs and Outstanding Issues
239*4882a593Smuzhiyun-------------------------------
240*4882a593Smuzhiyun
241*4882a593Smuzhiyun4.1 TODO list
242*4882a593Smuzhiyun-------------
243*4882a593Smuzhiyun
244*4882a593SmuzhiyunImplement ACL support.
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun4.2 Squashfs Internal Cache
247*4882a593Smuzhiyun---------------------------
248*4882a593Smuzhiyun
249*4882a593SmuzhiyunBlocks in Squashfs are compressed.  To avoid repeatedly decompressing
250*4882a593Smuzhiyunrecently accessed data Squashfs uses two small metadata and fragment caches.
251*4882a593Smuzhiyun
252*4882a593SmuzhiyunThe cache is not used for file datablocks, these are decompressed and cached in
253*4882a593Smuzhiyunthe page-cache in the normal way.  The cache is used to temporarily cache
254*4882a593Smuzhiyunfragment and metadata blocks which have been read as a result of a metadata
255*4882a593Smuzhiyun(i.e. inode or directory) or fragment access.  Because metadata and fragments
256*4882a593Smuzhiyunare packed together into blocks (to gain greater compression) the read of a
257*4882a593Smuzhiyunparticular piece of metadata or fragment will retrieve other metadata/fragments
258*4882a593Smuzhiyunwhich have been packed with it, these because of locality-of-reference may be
259*4882a593Smuzhiyunread in the near future. Temporarily caching them ensures they are available
260*4882a593Smuzhiyunfor near future access without requiring an additional read and decompress.
261*4882a593Smuzhiyun
262*4882a593SmuzhiyunIn the future this internal cache may be replaced with an implementation which
263*4882a593Smuzhiyunuses the kernel page cache.  Because the page cache operates on page sized
264*4882a593Smuzhiyununits this may introduce additional complexity in terms of locking and
265*4882a593Smuzhiyunassociated race conditions.
266