1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun======================= 4*4882a593SmuzhiyunSquashfs 4.0 Filesystem 5*4882a593Smuzhiyun======================= 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunSquashfs is a compressed read-only filesystem for Linux. 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunIt uses zlib, lz4, lzo, or xz compression to compress files, inodes and 10*4882a593Smuzhiyundirectories. Inodes in the system are very small and all blocks are packed to 11*4882a593Smuzhiyunminimise data overhead. Block sizes greater than 4K are supported up to a 12*4882a593Smuzhiyunmaximum of 1Mbytes (default block size 128K). 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunSquashfs is intended for general read-only filesystem use, for archival 15*4882a593Smuzhiyunuse (i.e. in cases where a .tar.gz file may be used), and in constrained 16*4882a593Smuzhiyunblock device/memory systems (e.g. embedded systems) where low overhead is 17*4882a593Smuzhiyunneeded. 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunMailing list: squashfs-devel@lists.sourceforge.net 20*4882a593SmuzhiyunWeb site: www.squashfs.org 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun1. Filesystem Features 23*4882a593Smuzhiyun---------------------- 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunSquashfs filesystem features versus Cramfs: 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun============================== ========= ========== 28*4882a593Smuzhiyun Squashfs Cramfs 29*4882a593Smuzhiyun============================== ========= ========== 30*4882a593SmuzhiyunMax filesystem size 2^64 256 MiB 31*4882a593SmuzhiyunMax file size ~ 2 TiB 16 MiB 32*4882a593SmuzhiyunMax files unlimited unlimited 33*4882a593SmuzhiyunMax directories unlimited unlimited 34*4882a593SmuzhiyunMax entries per directory unlimited unlimited 35*4882a593SmuzhiyunMax block size 1 MiB 4 KiB 36*4882a593SmuzhiyunMetadata compression yes no 37*4882a593SmuzhiyunDirectory indexes yes no 38*4882a593SmuzhiyunSparse file support yes no 39*4882a593SmuzhiyunTail-end packing (fragments) yes no 40*4882a593SmuzhiyunExportable (NFS etc.) yes no 41*4882a593SmuzhiyunHard link support yes no 42*4882a593Smuzhiyun"." and ".." in readdir yes no 43*4882a593SmuzhiyunReal inode numbers yes no 44*4882a593Smuzhiyun32-bit uids/gids yes no 45*4882a593SmuzhiyunFile creation time yes no 46*4882a593SmuzhiyunXattr support yes no 47*4882a593SmuzhiyunACL support no no 48*4882a593Smuzhiyun============================== ========= ========== 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunSquashfs compresses data, inodes and directories. In addition, inode and 51*4882a593Smuzhiyundirectory data are highly compacted, and packed on byte boundaries. Each 52*4882a593Smuzhiyuncompressed inode is on average 8 bytes in length (the exact length varies on 53*4882a593Smuzhiyunfile type, i.e. regular file, directory, symbolic link, and block/char device 54*4882a593Smuzhiyuninodes have different sizes). 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun2. Using Squashfs 57*4882a593Smuzhiyun----------------- 58*4882a593Smuzhiyun 59*4882a593SmuzhiyunAs squashfs is a read-only filesystem, the mksquashfs program must be used to 60*4882a593Smuzhiyuncreate populated squashfs filesystems. This and other squashfs utilities 61*4882a593Smuzhiyuncan be obtained from http://www.squashfs.org. Usage instructions can be 62*4882a593Smuzhiyunobtained from this site also. 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunThe squashfs-tools development tree is now located on kernel.org 65*4882a593Smuzhiyun git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun3. Squashfs Filesystem Design 68*4882a593Smuzhiyun----------------------------- 69*4882a593Smuzhiyun 70*4882a593SmuzhiyunA squashfs filesystem consists of a maximum of nine parts, packed together on a 71*4882a593Smuzhiyunbyte alignment:: 72*4882a593Smuzhiyun 73*4882a593Smuzhiyun --------------- 74*4882a593Smuzhiyun | superblock | 75*4882a593Smuzhiyun |---------------| 76*4882a593Smuzhiyun | compression | 77*4882a593Smuzhiyun | options | 78*4882a593Smuzhiyun |---------------| 79*4882a593Smuzhiyun | datablocks | 80*4882a593Smuzhiyun | & fragments | 81*4882a593Smuzhiyun |---------------| 82*4882a593Smuzhiyun | inode table | 83*4882a593Smuzhiyun |---------------| 84*4882a593Smuzhiyun | directory | 85*4882a593Smuzhiyun | table | 86*4882a593Smuzhiyun |---------------| 87*4882a593Smuzhiyun | fragment | 88*4882a593Smuzhiyun | table | 89*4882a593Smuzhiyun |---------------| 90*4882a593Smuzhiyun | export | 91*4882a593Smuzhiyun | table | 92*4882a593Smuzhiyun |---------------| 93*4882a593Smuzhiyun | uid/gid | 94*4882a593Smuzhiyun | lookup table | 95*4882a593Smuzhiyun |---------------| 96*4882a593Smuzhiyun | xattr | 97*4882a593Smuzhiyun | table | 98*4882a593Smuzhiyun --------------- 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunCompressed data blocks are written to the filesystem as files are read from 101*4882a593Smuzhiyunthe source directory, and checked for duplicates. Once all file data has been 102*4882a593Smuzhiyunwritten the completed inode, directory, fragment, export, uid/gid lookup and 103*4882a593Smuzhiyunxattr tables are written. 104*4882a593Smuzhiyun 105*4882a593Smuzhiyun3.1 Compression options 106*4882a593Smuzhiyun----------------------- 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunCompressors can optionally support compression specific options (e.g. 109*4882a593Smuzhiyundictionary size). If non-default compression options have been used, then 110*4882a593Smuzhiyunthese are stored here. 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun3.2 Inodes 113*4882a593Smuzhiyun---------- 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunMetadata (inodes and directories) are compressed in 8Kbyte blocks. Each 116*4882a593Smuzhiyuncompressed block is prefixed by a two byte length, the top bit is set if the 117*4882a593Smuzhiyunblock is uncompressed. A block will be uncompressed if the -noI option is set, 118*4882a593Smuzhiyunor if the compressed block was larger than the uncompressed block. 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunInodes are packed into the metadata blocks, and are not aligned to block 121*4882a593Smuzhiyunboundaries, therefore inodes overlap compressed blocks. Inodes are identified 122*4882a593Smuzhiyunby a 48-bit number which encodes the location of the compressed metadata block 123*4882a593Smuzhiyuncontaining the inode, and the byte offset into that block where the inode is 124*4882a593Smuzhiyunplaced (<block, offset>). 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunTo maximise compression there are different inodes for each file type 127*4882a593Smuzhiyun(regular file, directory, device, etc.), the inode contents and length 128*4882a593Smuzhiyunvarying with the type. 129*4882a593Smuzhiyun 130*4882a593SmuzhiyunTo further maximise compression, two types of regular file inode and 131*4882a593Smuzhiyundirectory inode are defined: inodes optimised for frequently occurring 132*4882a593Smuzhiyunregular files and directories, and extended types where extra 133*4882a593Smuzhiyuninformation has to be stored. 134*4882a593Smuzhiyun 135*4882a593Smuzhiyun3.3 Directories 136*4882a593Smuzhiyun--------------- 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunLike inodes, directories are packed into compressed metadata blocks, stored 139*4882a593Smuzhiyunin a directory table. Directories are accessed using the start address of 140*4882a593Smuzhiyunthe metablock containing the directory and the offset into the 141*4882a593Smuzhiyundecompressed block (<block, offset>). 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunDirectories are organised in a slightly complex way, and are not simply 144*4882a593Smuzhiyuna list of file names. The organisation takes advantage of the 145*4882a593Smuzhiyunfact that (in most cases) the inodes of the files will be in the same 146*4882a593Smuzhiyuncompressed metadata block, and therefore, can share the start block. 147*4882a593SmuzhiyunDirectories are therefore organised in a two level list, a directory 148*4882a593Smuzhiyunheader containing the shared start block value, and a sequence of directory 149*4882a593Smuzhiyunentries, each of which share the shared start block. A new directory header 150*4882a593Smuzhiyunis written once/if the inode start block changes. The directory 151*4882a593Smuzhiyunheader/directory entry list is repeated as many times as necessary. 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunDirectories are sorted, and can contain a directory index to speed up 154*4882a593Smuzhiyunfile lookup. Directory indexes store one entry per metablock, each entry 155*4882a593Smuzhiyunstoring the index/filename mapping to the first directory header 156*4882a593Smuzhiyunin each metadata block. Directories are sorted in alphabetical order, 157*4882a593Smuzhiyunand at lookup the index is scanned linearly looking for the first filename 158*4882a593Smuzhiyunalphabetically larger than the filename being looked up. At this point the 159*4882a593Smuzhiyunlocation of the metadata block the filename is in has been found. 160*4882a593SmuzhiyunThe general idea of the index is to ensure only one metadata block needs to be 161*4882a593Smuzhiyundecompressed to do a lookup irrespective of the length of the directory. 162*4882a593SmuzhiyunThis scheme has the advantage that it doesn't require extra memory overhead 163*4882a593Smuzhiyunand doesn't require much extra storage on disk. 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun3.4 File data 166*4882a593Smuzhiyun------------- 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunRegular files consist of a sequence of contiguous compressed blocks, and/or a 169*4882a593Smuzhiyuncompressed fragment block (tail-end packed block). The compressed size 170*4882a593Smuzhiyunof each datablock is stored in a block list contained within the 171*4882a593Smuzhiyunfile inode. 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunTo speed up access to datablocks when reading 'large' files (256 Mbytes or 174*4882a593Smuzhiyunlarger), the code implements an index cache that caches the mapping from 175*4882a593Smuzhiyunblock index to datablock location on disk. 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunThe index cache allows Squashfs to handle large files (up to 1.75 TiB) while 178*4882a593Smuzhiyunretaining a simple and space-efficient block list on disk. The cache 179*4882a593Smuzhiyunis split into slots, caching up to eight 224 GiB files (128 KiB blocks). 180*4882a593SmuzhiyunLarger files use multiple slots, with 1.75 TiB files using all 8 slots. 181*4882a593SmuzhiyunThe index cache is designed to be memory efficient, and by default uses 182*4882a593Smuzhiyun16 KiB. 183*4882a593Smuzhiyun 184*4882a593Smuzhiyun3.5 Fragment lookup table 185*4882a593Smuzhiyun------------------------- 186*4882a593Smuzhiyun 187*4882a593SmuzhiyunRegular files can contain a fragment index which is mapped to a fragment 188*4882a593Smuzhiyunlocation on disk and compressed size using a fragment lookup table. This 189*4882a593Smuzhiyunfragment lookup table is itself stored compressed into metadata blocks. 190*4882a593SmuzhiyunA second index table is used to locate these. This second index table for 191*4882a593Smuzhiyunspeed of access (and because it is small) is read at mount time and cached 192*4882a593Smuzhiyunin memory. 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun3.6 Uid/gid lookup table 195*4882a593Smuzhiyun------------------------ 196*4882a593Smuzhiyun 197*4882a593SmuzhiyunFor space efficiency regular files store uid and gid indexes, which are 198*4882a593Smuzhiyunconverted to 32-bit uids/gids using an id look up table. This table is 199*4882a593Smuzhiyunstored compressed into metadata blocks. A second index table is used to 200*4882a593Smuzhiyunlocate these. This second index table for speed of access (and because it 201*4882a593Smuzhiyunis small) is read at mount time and cached in memory. 202*4882a593Smuzhiyun 203*4882a593Smuzhiyun3.7 Export table 204*4882a593Smuzhiyun---------------- 205*4882a593Smuzhiyun 206*4882a593SmuzhiyunTo enable Squashfs filesystems to be exportable (via NFS etc.) filesystems 207*4882a593Smuzhiyuncan optionally (disabled with the -no-exports Mksquashfs option) contain 208*4882a593Smuzhiyunan inode number to inode disk location lookup table. This is required to 209*4882a593Smuzhiyunenable Squashfs to map inode numbers passed in filehandles to the inode 210*4882a593Smuzhiyunlocation on disk, which is necessary when the export code reinstantiates 211*4882a593Smuzhiyunexpired/flushed inodes. 212*4882a593Smuzhiyun 213*4882a593SmuzhiyunThis table is stored compressed into metadata blocks. A second index table is 214*4882a593Smuzhiyunused to locate these. This second index table for speed of access (and because 215*4882a593Smuzhiyunit is small) is read at mount time and cached in memory. 216*4882a593Smuzhiyun 217*4882a593Smuzhiyun3.8 Xattr table 218*4882a593Smuzhiyun--------------- 219*4882a593Smuzhiyun 220*4882a593SmuzhiyunThe xattr table contains extended attributes for each inode. The xattrs 221*4882a593Smuzhiyunfor each inode are stored in a list, each list entry containing a type, 222*4882a593Smuzhiyunname and value field. The type field encodes the xattr prefix 223*4882a593Smuzhiyun("user.", "trusted." etc) and it also encodes how the name/value fields 224*4882a593Smuzhiyunshould be interpreted. Currently the type indicates whether the value 225*4882a593Smuzhiyunis stored inline (in which case the value field contains the xattr value), 226*4882a593Smuzhiyunor if it is stored out of line (in which case the value field stores a 227*4882a593Smuzhiyunreference to where the actual value is stored). This allows large values 228*4882a593Smuzhiyunto be stored out of line improving scanning and lookup performance and it 229*4882a593Smuzhiyunalso allows values to be de-duplicated, the value being stored once, and 230*4882a593Smuzhiyunall other occurrences holding an out of line reference to that value. 231*4882a593Smuzhiyun 232*4882a593SmuzhiyunThe xattr lists are packed into compressed 8K metadata blocks. 233*4882a593SmuzhiyunTo reduce overhead in inodes, rather than storing the on-disk 234*4882a593Smuzhiyunlocation of the xattr list inside each inode, a 32-bit xattr id 235*4882a593Smuzhiyunis stored. This xattr id is mapped into the location of the xattr 236*4882a593Smuzhiyunlist using a second xattr id lookup table. 237*4882a593Smuzhiyun 238*4882a593Smuzhiyun4. TODOs and Outstanding Issues 239*4882a593Smuzhiyun------------------------------- 240*4882a593Smuzhiyun 241*4882a593Smuzhiyun4.1 TODO list 242*4882a593Smuzhiyun------------- 243*4882a593Smuzhiyun 244*4882a593SmuzhiyunImplement ACL support. 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun4.2 Squashfs Internal Cache 247*4882a593Smuzhiyun--------------------------- 248*4882a593Smuzhiyun 249*4882a593SmuzhiyunBlocks in Squashfs are compressed. To avoid repeatedly decompressing 250*4882a593Smuzhiyunrecently accessed data Squashfs uses two small metadata and fragment caches. 251*4882a593Smuzhiyun 252*4882a593SmuzhiyunThe cache is not used for file datablocks, these are decompressed and cached in 253*4882a593Smuzhiyunthe page-cache in the normal way. The cache is used to temporarily cache 254*4882a593Smuzhiyunfragment and metadata blocks which have been read as a result of a metadata 255*4882a593Smuzhiyun(i.e. inode or directory) or fragment access. Because metadata and fragments 256*4882a593Smuzhiyunare packed together into blocks (to gain greater compression) the read of a 257*4882a593Smuzhiyunparticular piece of metadata or fragment will retrieve other metadata/fragments 258*4882a593Smuzhiyunwhich have been packed with it, these because of locality-of-reference may be 259*4882a593Smuzhiyunread in the near future. Temporarily caching them ensures they are available 260*4882a593Smuzhiyunfor near future access without requiring an additional read and decompress. 261*4882a593Smuzhiyun 262*4882a593SmuzhiyunIn the future this internal cache may be replaced with an implementation which 263*4882a593Smuzhiyunuses the kernel page cache. Because the page cache operates on page sized 264*4882a593Smuzhiyununits this may introduce additional complexity in terms of locking and 265*4882a593Smuzhiyunassociated race conditions. 266