Documentation/filesystems/squashfs.rst

*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
*4882a593Smuzhiyun
*4882a593Smuzhiyun=======================
*4882a593SmuzhiyunSquashfs 4.0 Filesystem
*4882a593Smuzhiyun=======================
*4882a593Smuzhiyun
*4882a593SmuzhiyunSquashfs is a compressed read-only filesystem for Linux.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIt uses zlib, lz4, lzo, or xz compression to compress files, inodes and
*4882a593Smuzhiyundirectories.  Inodes in the system are very small and all blocks are packed to
*4882a593Smuzhiyunminimise data overhead. Block sizes greater than 4K are supported up to a
*4882a593Smuzhiyunmaximum of 1Mbytes (default block size 128K).
*4882a593Smuzhiyun
*4882a593SmuzhiyunSquashfs is intended for general read-only filesystem use, for archival
*4882a593Smuzhiyunuse (i.e. in cases where a .tar.gz file may be used), and in constrained
*4882a593Smuzhiyunblock device/memory systems (e.g. embedded systems) where low overhead is
*4882a593Smuzhiyunneeded.
*4882a593Smuzhiyun
*4882a593SmuzhiyunMailing list: squashfs-devel@lists.sourceforge.net
*4882a593SmuzhiyunWeb site: www.squashfs.org
*4882a593Smuzhiyun
*4882a593Smuzhiyun1. Filesystem Features
*4882a593Smuzhiyun----------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunSquashfs filesystem features versus Cramfs:
*4882a593Smuzhiyun
*4882a593Smuzhiyun============================== 	=========		==========
*4882a593Smuzhiyun				Squashfs		Cramfs
*4882a593Smuzhiyun============================== 	=========		==========
*4882a593SmuzhiyunMax filesystem size		2^64			256 MiB
*4882a593SmuzhiyunMax file size			~ 2 TiB			16 MiB
*4882a593SmuzhiyunMax files			unlimited		unlimited
*4882a593SmuzhiyunMax directories			unlimited		unlimited
*4882a593SmuzhiyunMax entries per directory	unlimited		unlimited
*4882a593SmuzhiyunMax block size			1 MiB			4 KiB
*4882a593SmuzhiyunMetadata compression		yes			no
*4882a593SmuzhiyunDirectory indexes		yes			no
*4882a593SmuzhiyunSparse file support		yes			no
*4882a593SmuzhiyunTail-end packing (fragments)	yes			no
*4882a593SmuzhiyunExportable (NFS etc.)		yes			no
*4882a593SmuzhiyunHard link support		yes			no
*4882a593Smuzhiyun"." and ".." in readdir		yes			no
*4882a593SmuzhiyunReal inode numbers		yes			no
*4882a593Smuzhiyun32-bit uids/gids		yes			no
*4882a593SmuzhiyunFile creation time		yes			no
*4882a593SmuzhiyunXattr support			yes			no
*4882a593SmuzhiyunACL support			no			no
*4882a593Smuzhiyun============================== 	=========		==========
*4882a593Smuzhiyun
*4882a593SmuzhiyunSquashfs compresses data, inodes and directories.  In addition, inode and
*4882a593Smuzhiyundirectory data are highly compacted, and packed on byte boundaries.  Each
*4882a593Smuzhiyuncompressed inode is on average 8 bytes in length (the exact length varies on
*4882a593Smuzhiyunfile type, i.e. regular file, directory, symbolic link, and block/char device
*4882a593Smuzhiyuninodes have different sizes).
*4882a593Smuzhiyun
*4882a593Smuzhiyun2. Using Squashfs
*4882a593Smuzhiyun-----------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs squashfs is a read-only filesystem, the mksquashfs program must be used to
*4882a593Smuzhiyuncreate populated squashfs filesystems.  This and other squashfs utilities
*4882a593Smuzhiyuncan be obtained from http://www.squashfs.org.  Usage instructions can be
*4882a593Smuzhiyunobtained from this site also.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe squashfs-tools development tree is now located on kernel.org
*4882a593Smuzhiyun	git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
*4882a593Smuzhiyun
*4882a593Smuzhiyun3. Squashfs Filesystem Design
*4882a593Smuzhiyun-----------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunA squashfs filesystem consists of a maximum of nine parts, packed together on a
*4882a593Smuzhiyunbyte alignment::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	 ---------------
*4882a593Smuzhiyun	|  superblock 	|
*4882a593Smuzhiyun	|---------------|
*4882a593Smuzhiyun	|  compression  |
*4882a593Smuzhiyun	|    options    |
*4882a593Smuzhiyun	|---------------|
*4882a593Smuzhiyun	|  datablocks   |
*4882a593Smuzhiyun	|  & fragments  |
*4882a593Smuzhiyun	|---------------|
*4882a593Smuzhiyun	|  inode table	|
*4882a593Smuzhiyun	|---------------|
*4882a593Smuzhiyun	|   directory	|
*4882a593Smuzhiyun	|     table     |
*4882a593Smuzhiyun	|---------------|
*4882a593Smuzhiyun	|   fragment	|
*4882a593Smuzhiyun	|    table      |
*4882a593Smuzhiyun	|---------------|
*4882a593Smuzhiyun	|    export     |
*4882a593Smuzhiyun	|    table      |
*4882a593Smuzhiyun	|---------------|
*4882a593Smuzhiyun	|    uid/gid	|
*4882a593Smuzhiyun	|  lookup table	|
*4882a593Smuzhiyun	|---------------|
*4882a593Smuzhiyun	|     xattr     |
*4882a593Smuzhiyun	|     table	|
*4882a593Smuzhiyun	 ---------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunCompressed data blocks are written to the filesystem as files are read from
*4882a593Smuzhiyunthe source directory, and checked for duplicates.  Once all file data has been
*4882a593Smuzhiyunwritten the completed inode, directory, fragment, export, uid/gid lookup and
*4882a593Smuzhiyunxattr tables are written.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.1 Compression options
*4882a593Smuzhiyun-----------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunCompressors can optionally support compression specific options (e.g.
*4882a593Smuzhiyundictionary size).  If non-default compression options have been used, then
*4882a593Smuzhiyunthese are stored here.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.2 Inodes
*4882a593Smuzhiyun----------
*4882a593Smuzhiyun
*4882a593SmuzhiyunMetadata (inodes and directories) are compressed in 8Kbyte blocks.  Each
*4882a593Smuzhiyuncompressed block is prefixed by a two byte length, the top bit is set if the
*4882a593Smuzhiyunblock is uncompressed.  A block will be uncompressed if the -noI option is set,
*4882a593Smuzhiyunor if the compressed block was larger than the uncompressed block.
*4882a593Smuzhiyun
*4882a593SmuzhiyunInodes are packed into the metadata blocks, and are not aligned to block
*4882a593Smuzhiyunboundaries, therefore inodes overlap compressed blocks.  Inodes are identified
*4882a593Smuzhiyunby a 48-bit number which encodes the location of the compressed metadata block
*4882a593Smuzhiyuncontaining the inode, and the byte offset into that block where the inode is
*4882a593Smuzhiyunplaced (<block, offset>).
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo maximise compression there are different inodes for each file type
*4882a593Smuzhiyun(regular file, directory, device, etc.), the inode contents and length
*4882a593Smuzhiyunvarying with the type.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo further maximise compression, two types of regular file inode and
*4882a593Smuzhiyundirectory inode are defined: inodes optimised for frequently occurring
*4882a593Smuzhiyunregular files and directories, and extended types where extra
*4882a593Smuzhiyuninformation has to be stored.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.3 Directories
*4882a593Smuzhiyun---------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunLike inodes, directories are packed into compressed metadata blocks, stored
*4882a593Smuzhiyunin a directory table.  Directories are accessed using the start address of
*4882a593Smuzhiyunthe metablock containing the directory and the offset into the
*4882a593Smuzhiyundecompressed block (<block, offset>).
*4882a593Smuzhiyun
*4882a593SmuzhiyunDirectories are organised in a slightly complex way, and are not simply
*4882a593Smuzhiyuna list of file names.  The organisation takes advantage of the
*4882a593Smuzhiyunfact that (in most cases) the inodes of the files will be in the same
*4882a593Smuzhiyuncompressed metadata block, and therefore, can share the start block.
*4882a593SmuzhiyunDirectories are therefore organised in a two level list, a directory
*4882a593Smuzhiyunheader containing the shared start block value, and a sequence of directory
*4882a593Smuzhiyunentries, each of which share the shared start block.  A new directory header
*4882a593Smuzhiyunis written once/if the inode start block changes.  The directory
*4882a593Smuzhiyunheader/directory entry list is repeated as many times as necessary.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDirectories are sorted, and can contain a directory index to speed up
*4882a593Smuzhiyunfile lookup.  Directory indexes store one entry per metablock, each entry
*4882a593Smuzhiyunstoring the index/filename mapping to the first directory header
*4882a593Smuzhiyunin each metadata block.  Directories are sorted in alphabetical order,
*4882a593Smuzhiyunand at lookup the index is scanned linearly looking for the first filename
*4882a593Smuzhiyunalphabetically larger than the filename being looked up.  At this point the
*4882a593Smuzhiyunlocation of the metadata block the filename is in has been found.
*4882a593SmuzhiyunThe general idea of the index is to ensure only one metadata block needs to be
*4882a593Smuzhiyundecompressed to do a lookup irrespective of the length of the directory.
*4882a593SmuzhiyunThis scheme has the advantage that it doesn't require extra memory overhead
*4882a593Smuzhiyunand doesn't require much extra storage on disk.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.4 File data
*4882a593Smuzhiyun-------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunRegular files consist of a sequence of contiguous compressed blocks, and/or a
*4882a593Smuzhiyuncompressed fragment block (tail-end packed block).   The compressed size
*4882a593Smuzhiyunof each datablock is stored in a block list contained within the
*4882a593Smuzhiyunfile inode.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo speed up access to datablocks when reading 'large' files (256 Mbytes or
*4882a593Smuzhiyunlarger), the code implements an index cache that caches the mapping from
*4882a593Smuzhiyunblock index to datablock location on disk.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe index cache allows Squashfs to handle large files (up to 1.75 TiB) while
*4882a593Smuzhiyunretaining a simple and space-efficient block list on disk.  The cache
*4882a593Smuzhiyunis split into slots, caching up to eight 224 GiB files (128 KiB blocks).
*4882a593SmuzhiyunLarger files use multiple slots, with 1.75 TiB files using all 8 slots.
*4882a593SmuzhiyunThe index cache is designed to be memory efficient, and by default uses
*4882a593Smuzhiyun16 KiB.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.5 Fragment lookup table
*4882a593Smuzhiyun-------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunRegular files can contain a fragment index which is mapped to a fragment
*4882a593Smuzhiyunlocation on disk and compressed size using a fragment lookup table.  This
*4882a593Smuzhiyunfragment lookup table is itself stored compressed into metadata blocks.
*4882a593SmuzhiyunA second index table is used to locate these.  This second index table for
*4882a593Smuzhiyunspeed of access (and because it is small) is read at mount time and cached
*4882a593Smuzhiyunin memory.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.6 Uid/gid lookup table
*4882a593Smuzhiyun------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor space efficiency regular files store uid and gid indexes, which are
*4882a593Smuzhiyunconverted to 32-bit uids/gids using an id look up table.  This table is
*4882a593Smuzhiyunstored compressed into metadata blocks.  A second index table is used to
*4882a593Smuzhiyunlocate these.  This second index table for speed of access (and because it
*4882a593Smuzhiyunis small) is read at mount time and cached in memory.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.7 Export table
*4882a593Smuzhiyun----------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
*4882a593Smuzhiyuncan optionally (disabled with the -no-exports Mksquashfs option) contain
*4882a593Smuzhiyunan inode number to inode disk location lookup table.  This is required to
*4882a593Smuzhiyunenable Squashfs to map inode numbers passed in filehandles to the inode
*4882a593Smuzhiyunlocation on disk, which is necessary when the export code reinstantiates
*4882a593Smuzhiyunexpired/flushed inodes.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis table is stored compressed into metadata blocks.  A second index table is
*4882a593Smuzhiyunused to locate these.  This second index table for speed of access (and because
*4882a593Smuzhiyunit is small) is read at mount time and cached in memory.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.8 Xattr table
*4882a593Smuzhiyun---------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe xattr table contains extended attributes for each inode.  The xattrs
*4882a593Smuzhiyunfor each inode are stored in a list, each list entry containing a type,
*4882a593Smuzhiyunname and value field.  The type field encodes the xattr prefix
*4882a593Smuzhiyun("user.", "trusted." etc) and it also encodes how the name/value fields
*4882a593Smuzhiyunshould be interpreted.  Currently the type indicates whether the value
*4882a593Smuzhiyunis stored inline (in which case the value field contains the xattr value),
*4882a593Smuzhiyunor if it is stored out of line (in which case the value field stores a
*4882a593Smuzhiyunreference to where the actual value is stored).  This allows large values
*4882a593Smuzhiyunto be stored out of line improving scanning and lookup performance and it
*4882a593Smuzhiyunalso allows values to be de-duplicated, the value being stored once, and
*4882a593Smuzhiyunall other occurrences holding an out of line reference to that value.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe xattr lists are packed into compressed 8K metadata blocks.
*4882a593SmuzhiyunTo reduce overhead in inodes, rather than storing the on-disk
*4882a593Smuzhiyunlocation of the xattr list inside each inode, a 32-bit xattr id
*4882a593Smuzhiyunis stored.  This xattr id is mapped into the location of the xattr
*4882a593Smuzhiyunlist using a second xattr id lookup table.
*4882a593Smuzhiyun
*4882a593Smuzhiyun4. TODOs and Outstanding Issues
*4882a593Smuzhiyun-------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.1 TODO list
*4882a593Smuzhiyun-------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunImplement ACL support.
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.2 Squashfs Internal Cache
*4882a593Smuzhiyun---------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunBlocks in Squashfs are compressed.  To avoid repeatedly decompressing
*4882a593Smuzhiyunrecently accessed data Squashfs uses two small metadata and fragment caches.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe cache is not used for file datablocks, these are decompressed and cached in
*4882a593Smuzhiyunthe page-cache in the normal way.  The cache is used to temporarily cache
*4882a593Smuzhiyunfragment and metadata blocks which have been read as a result of a metadata
*4882a593Smuzhiyun(i.e. inode or directory) or fragment access.  Because metadata and fragments
*4882a593Smuzhiyunare packed together into blocks (to gain greater compression) the read of a
*4882a593Smuzhiyunparticular piece of metadata or fragment will retrieve other metadata/fragments
*4882a593Smuzhiyunwhich have been packed with it, these because of locality-of-reference may be
*4882a593Smuzhiyunread in the near future. Temporarily caching them ensures they are available
*4882a593Smuzhiyunfor near future access without requiring an additional read and decompress.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn the future this internal cache may be replaced with an implementation which
*4882a593Smuzhiyunuses the kernel page cache.  Because the page cache operates on page sized
*4882a593Smuzhiyununits this may introduce additional complexity in terms of locking and
*4882a593Smuzhiyunassociated race conditions.