1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593SmuzhiyunWritten by: Neil Brown 4*4882a593SmuzhiyunPlease see MAINTAINERS file for where to send questions. 5*4882a593Smuzhiyun 6*4882a593SmuzhiyunOverlay Filesystem 7*4882a593Smuzhiyun================== 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunThis document describes a prototype for a new approach to providing 10*4882a593Smuzhiyunoverlay-filesystem functionality in Linux (sometimes referred to as 11*4882a593Smuzhiyununion-filesystems). An overlay-filesystem tries to present a 12*4882a593Smuzhiyunfilesystem which is the result over overlaying one filesystem on top 13*4882a593Smuzhiyunof the other. 14*4882a593Smuzhiyun 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunOverlay objects 17*4882a593Smuzhiyun--------------- 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunThe overlay filesystem approach is 'hybrid', because the objects that 20*4882a593Smuzhiyunappear in the filesystem do not always appear to belong to that filesystem. 21*4882a593SmuzhiyunIn many cases, an object accessed in the union will be indistinguishable 22*4882a593Smuzhiyunfrom accessing the corresponding object from the original filesystem. 23*4882a593SmuzhiyunThis is most obvious from the 'st_dev' field returned by stat(2). 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunWhile directories will report an st_dev from the overlay-filesystem, 26*4882a593Smuzhiyunnon-directory objects may report an st_dev from the lower filesystem or 27*4882a593Smuzhiyunupper filesystem that is providing the object. Similarly st_ino will 28*4882a593Smuzhiyunonly be unique when combined with st_dev, and both of these can change 29*4882a593Smuzhiyunover the lifetime of a non-directory object. Many applications and 30*4882a593Smuzhiyuntools ignore these values and will not be affected. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunIn the special case of all overlay layers on the same underlying 33*4882a593Smuzhiyunfilesystem, all objects will report an st_dev from the overlay 34*4882a593Smuzhiyunfilesystem and st_ino from the underlying filesystem. This will 35*4882a593Smuzhiyunmake the overlay mount more compliant with filesystem scanners and 36*4882a593Smuzhiyunoverlay objects will be distinguishable from the corresponding 37*4882a593Smuzhiyunobjects in the original filesystem. 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunOn 64bit systems, even if all overlay layers are not on the same 40*4882a593Smuzhiyununderlying filesystem, the same compliant behavior could be achieved 41*4882a593Smuzhiyunwith the "xino" feature. The "xino" feature composes a unique object 42*4882a593Smuzhiyunidentifier from the real object st_ino and an underlying fsid index. 43*4882a593Smuzhiyun 44*4882a593SmuzhiyunIf all underlying filesystems support NFS file handles and export file 45*4882a593Smuzhiyunhandles with 32bit inode number encoding (e.g. ext4), overlay filesystem 46*4882a593Smuzhiyunwill use the high inode number bits for fsid. Even when the underlying 47*4882a593Smuzhiyunfilesystem uses 64bit inode numbers, users can still enable the "xino" 48*4882a593Smuzhiyunfeature with the "-o xino=on" overlay mount option. That is useful for the 49*4882a593Smuzhiyuncase of underlying filesystems like xfs and tmpfs, which use 64bit inode 50*4882a593Smuzhiyunnumbers, but are very unlikely to use the high inode number bits. In case 51*4882a593Smuzhiyunthe underlying inode number does overflow into the high xino bits, overlay 52*4882a593Smuzhiyunfilesystem will fall back to the non xino behavior for that inode. 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunThe following table summarizes what can be expected in different overlay 55*4882a593Smuzhiyunconfigurations. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunInode properties 58*4882a593Smuzhiyun```````````````` 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun+--------------+------------+------------+-----------------+----------------+ 61*4882a593Smuzhiyun|Configuration | Persistent | Uniform | st_ino == d_ino | d_ino == i_ino | 62*4882a593Smuzhiyun| | st_ino | st_dev | | [*] | 63*4882a593Smuzhiyun+==============+=====+======+=====+======+========+========+========+=======+ 64*4882a593Smuzhiyun| | dir | !dir | dir | !dir | dir + !dir | dir | !dir | 65*4882a593Smuzhiyun+--------------+-----+------+-----+------+--------+--------+--------+-------+ 66*4882a593Smuzhiyun| All layers | Y | Y | Y | Y | Y | Y | Y | Y | 67*4882a593Smuzhiyun| on same fs | | | | | | | | | 68*4882a593Smuzhiyun+--------------+-----+------+-----+------+--------+--------+--------+-------+ 69*4882a593Smuzhiyun| Layers not | N | Y | Y | N | N | Y | N | Y | 70*4882a593Smuzhiyun| on same fs, | | | | | | | | | 71*4882a593Smuzhiyun| xino=off | | | | | | | | | 72*4882a593Smuzhiyun+--------------+-----+------+-----+------+--------+--------+--------+-------+ 73*4882a593Smuzhiyun| xino=on/auto | Y | Y | Y | Y | Y | Y | Y | Y | 74*4882a593Smuzhiyun| | | | | | | | | | 75*4882a593Smuzhiyun+--------------+-----+------+-----+------+--------+--------+--------+-------+ 76*4882a593Smuzhiyun| xino=on/auto,| N | Y | Y | N | N | Y | N | Y | 77*4882a593Smuzhiyun| ino overflow | | | | | | | | | 78*4882a593Smuzhiyun+--------------+-----+------+-----+------+--------+--------+--------+-------+ 79*4882a593Smuzhiyun 80*4882a593Smuzhiyun[*] nfsd v3 readdirplus verifies d_ino == i_ino. i_ino is exposed via several 81*4882a593Smuzhiyun/proc files, such as /proc/locks and /proc/self/fdinfo/<fd> of an inotify 82*4882a593Smuzhiyunfile descriptor. 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunUpper and Lower 86*4882a593Smuzhiyun--------------- 87*4882a593Smuzhiyun 88*4882a593SmuzhiyunAn overlay filesystem combines two filesystems - an 'upper' filesystem 89*4882a593Smuzhiyunand a 'lower' filesystem. When a name exists in both filesystems, the 90*4882a593Smuzhiyunobject in the 'upper' filesystem is visible while the object in the 91*4882a593Smuzhiyun'lower' filesystem is either hidden or, in the case of directories, 92*4882a593Smuzhiyunmerged with the 'upper' object. 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunIt would be more correct to refer to an upper and lower 'directory 95*4882a593Smuzhiyuntree' rather than 'filesystem' as it is quite possible for both 96*4882a593Smuzhiyundirectory trees to be in the same filesystem and there is no 97*4882a593Smuzhiyunrequirement that the root of a filesystem be given for either upper or 98*4882a593Smuzhiyunlower. 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunThe lower filesystem can be any filesystem supported by Linux and does 101*4882a593Smuzhiyunnot need to be writable. The lower filesystem can even be another 102*4882a593Smuzhiyunoverlayfs. The upper filesystem will normally be writable and if it 103*4882a593Smuzhiyunis it must support the creation of trusted.* extended attributes, and 104*4882a593Smuzhiyunmust provide valid d_type in readdir responses, so NFS is not suitable. 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunA read-only overlay of two read-only filesystems may use any 107*4882a593Smuzhiyunfilesystem type. 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunDirectories 110*4882a593Smuzhiyun----------- 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunOverlaying mainly involves directories. If a given name appears in both 113*4882a593Smuzhiyunupper and lower filesystems and refers to a non-directory in either, 114*4882a593Smuzhiyunthen the lower object is hidden - the name refers only to the upper 115*4882a593Smuzhiyunobject. 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunWhere both upper and lower objects are directories, a merged directory 118*4882a593Smuzhiyunis formed. 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunAt mount time, the two directories given as mount options "lowerdir" and 121*4882a593Smuzhiyun"upperdir" are combined into a merged directory: 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\ 124*4882a593Smuzhiyun workdir=/work /merged 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunThe "workdir" needs to be an empty directory on the same filesystem 127*4882a593Smuzhiyunas upperdir. 128*4882a593Smuzhiyun 129*4882a593SmuzhiyunThen whenever a lookup is requested in such a merged directory, the 130*4882a593Smuzhiyunlookup is performed in each actual directory and the combined result 131*4882a593Smuzhiyunis cached in the dentry belonging to the overlay filesystem. If both 132*4882a593Smuzhiyunactual lookups find directories, both are stored and a merged 133*4882a593Smuzhiyundirectory is created, otherwise only one is stored: the upper if it 134*4882a593Smuzhiyunexists, else the lower. 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunOnly the lists of names from directories are merged. Other content 137*4882a593Smuzhiyunsuch as metadata and extended attributes are reported for the upper 138*4882a593Smuzhiyundirectory only. These attributes of the lower directory are hidden. 139*4882a593Smuzhiyun 140*4882a593Smuzhiyuncredentials 141*4882a593Smuzhiyun----------- 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunBy default, all access to the upper, lower and work directories is the 144*4882a593Smuzhiyunrecorded mounter's MAC and DAC credentials. The incoming accesses are 145*4882a593Smuzhiyunchecked against the caller's credentials. 146*4882a593Smuzhiyun 147*4882a593SmuzhiyunIn the case where caller MAC or DAC credentials do not overlap, a 148*4882a593Smuzhiyunuse case available in older versions of the driver, the 149*4882a593Smuzhiyunoverride_creds mount flag can be turned off and help when the use 150*4882a593Smuzhiyunpattern has caller with legitimate credentials where the mounter 151*4882a593Smuzhiyundoes not. Several unintended side effects will occur though. The 152*4882a593Smuzhiyuncaller without certain key capabilities or lower privilege will not 153*4882a593Smuzhiyunalways be able to delete files or directories, create nodes, or 154*4882a593Smuzhiyunsearch some restricted directories. The ability to search and read 155*4882a593Smuzhiyuna directory entry is spotty as a result of the cache mechanism not 156*4882a593Smuzhiyunretesting the credentials because of the assumption, a privileged 157*4882a593Smuzhiyuncaller can fill cache, then a lower privilege can read the directory 158*4882a593Smuzhiyuncache. The uneven security model where cache, upperdir and workdir 159*4882a593Smuzhiyunare opened at privilege, but accessed without creating a form of 160*4882a593Smuzhiyunprivilege escalation, should only be used with strict understanding 161*4882a593Smuzhiyunof the side effects and of the security policies. 162*4882a593Smuzhiyun 163*4882a593Smuzhiyunwhiteouts and opaque directories 164*4882a593Smuzhiyun-------------------------------- 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunIn order to support rm and rmdir without changing the lower 167*4882a593Smuzhiyunfilesystem, an overlay filesystem needs to record in the upper filesystem 168*4882a593Smuzhiyunthat files have been removed. This is done using whiteouts and opaque 169*4882a593Smuzhiyundirectories (non-directories are always opaque). 170*4882a593Smuzhiyun 171*4882a593SmuzhiyunA whiteout is created as a character device with 0/0 device number. 172*4882a593SmuzhiyunWhen a whiteout is found in the upper level of a merged directory, any 173*4882a593Smuzhiyunmatching name in the lower level is ignored, and the whiteout itself 174*4882a593Smuzhiyunis also hidden. 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunA directory is made opaque by setting the xattr "trusted.overlay.opaque" 177*4882a593Smuzhiyunto "y". Where the upper filesystem contains an opaque directory, any 178*4882a593Smuzhiyundirectory in the lower filesystem with the same name is ignored. 179*4882a593Smuzhiyun 180*4882a593Smuzhiyunreaddir 181*4882a593Smuzhiyun------- 182*4882a593Smuzhiyun 183*4882a593SmuzhiyunWhen a 'readdir' request is made on a merged directory, the upper and 184*4882a593Smuzhiyunlower directories are each read and the name lists merged in the 185*4882a593Smuzhiyunobvious way (upper is read first, then lower - entries that already 186*4882a593Smuzhiyunexist are not re-added). This merged name list is cached in the 187*4882a593Smuzhiyun'struct file' and so remains as long as the file is kept open. If the 188*4882a593Smuzhiyundirectory is opened and read by two processes at the same time, they 189*4882a593Smuzhiyunwill each have separate caches. A seekdir to the start of the 190*4882a593Smuzhiyundirectory (offset 0) followed by a readdir will cause the cache to be 191*4882a593Smuzhiyundiscarded and rebuilt. 192*4882a593Smuzhiyun 193*4882a593SmuzhiyunThis means that changes to the merged directory do not appear while a 194*4882a593Smuzhiyundirectory is being read. This is unlikely to be noticed by many 195*4882a593Smuzhiyunprograms. 196*4882a593Smuzhiyun 197*4882a593Smuzhiyunseek offsets are assigned sequentially when the directories are read. 198*4882a593SmuzhiyunThus if 199*4882a593Smuzhiyun 200*4882a593Smuzhiyun - read part of a directory 201*4882a593Smuzhiyun - remember an offset, and close the directory 202*4882a593Smuzhiyun - re-open the directory some time later 203*4882a593Smuzhiyun - seek to the remembered offset 204*4882a593Smuzhiyun 205*4882a593Smuzhiyunthere may be little correlation between the old and new locations in 206*4882a593Smuzhiyunthe list of filenames, particularly if anything has changed in the 207*4882a593Smuzhiyundirectory. 208*4882a593Smuzhiyun 209*4882a593SmuzhiyunReaddir on directories that are not merged is simply handled by the 210*4882a593Smuzhiyununderlying directory (upper or lower). 211*4882a593Smuzhiyun 212*4882a593Smuzhiyunrenaming directories 213*4882a593Smuzhiyun-------------------- 214*4882a593Smuzhiyun 215*4882a593SmuzhiyunWhen renaming a directory that is on the lower layer or merged (i.e. the 216*4882a593Smuzhiyundirectory was not created on the upper layer to start with) overlayfs can 217*4882a593Smuzhiyunhandle it in two different ways: 218*4882a593Smuzhiyun 219*4882a593Smuzhiyun1. return EXDEV error: this error is returned by rename(2) when trying to 220*4882a593Smuzhiyun move a file or directory across filesystem boundaries. Hence 221*4882a593Smuzhiyun applications are usually prepared to hande this error (mv(1) for example 222*4882a593Smuzhiyun recursively copies the directory tree). This is the default behavior. 223*4882a593Smuzhiyun 224*4882a593Smuzhiyun2. If the "redirect_dir" feature is enabled, then the directory will be 225*4882a593Smuzhiyun copied up (but not the contents). Then the "trusted.overlay.redirect" 226*4882a593Smuzhiyun extended attribute is set to the path of the original location from the 227*4882a593Smuzhiyun root of the overlay. Finally the directory is moved to the new 228*4882a593Smuzhiyun location. 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunThere are several ways to tune the "redirect_dir" feature. 231*4882a593Smuzhiyun 232*4882a593SmuzhiyunKernel config options: 233*4882a593Smuzhiyun 234*4882a593Smuzhiyun- OVERLAY_FS_REDIRECT_DIR: 235*4882a593Smuzhiyun If this is enabled, then redirect_dir is turned on by default. 236*4882a593Smuzhiyun- OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW: 237*4882a593Smuzhiyun If this is enabled, then redirects are always followed by default. Enabling 238*4882a593Smuzhiyun this results in a less secure configuration. Enable this option only when 239*4882a593Smuzhiyun worried about backward compatibility with kernels that have the redirect_dir 240*4882a593Smuzhiyun feature and follow redirects even if turned off. 241*4882a593Smuzhiyun 242*4882a593SmuzhiyunModule options (can also be changed through /sys/module/overlay/parameters/): 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun- "redirect_dir=BOOL": 245*4882a593Smuzhiyun See OVERLAY_FS_REDIRECT_DIR kernel config option above. 246*4882a593Smuzhiyun- "redirect_always_follow=BOOL": 247*4882a593Smuzhiyun See OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW kernel config option above. 248*4882a593Smuzhiyun- "redirect_max=NUM": 249*4882a593Smuzhiyun The maximum number of bytes in an absolute redirect (default is 256). 250*4882a593Smuzhiyun 251*4882a593SmuzhiyunMount options: 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun- "redirect_dir=on": 254*4882a593Smuzhiyun Redirects are enabled. 255*4882a593Smuzhiyun- "redirect_dir=follow": 256*4882a593Smuzhiyun Redirects are not created, but followed. 257*4882a593Smuzhiyun- "redirect_dir=off": 258*4882a593Smuzhiyun Redirects are not created and only followed if "redirect_always_follow" 259*4882a593Smuzhiyun feature is enabled in the kernel/module config. 260*4882a593Smuzhiyun- "redirect_dir=nofollow": 261*4882a593Smuzhiyun Redirects are not created and not followed (equivalent to "redirect_dir=off" 262*4882a593Smuzhiyun if "redirect_always_follow" feature is not enabled). 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunWhen the NFS export feature is enabled, every copied up directory is 265*4882a593Smuzhiyunindexed by the file handle of the lower inode and a file handle of the 266*4882a593Smuzhiyunupper directory is stored in a "trusted.overlay.upper" extended attribute 267*4882a593Smuzhiyunon the index entry. On lookup of a merged directory, if the upper 268*4882a593Smuzhiyundirectory does not match the file handle stores in the index, that is an 269*4882a593Smuzhiyunindication that multiple upper directories may be redirected to the same 270*4882a593Smuzhiyunlower directory. In that case, lookup returns an error and warns about 271*4882a593Smuzhiyuna possible inconsistency. 272*4882a593Smuzhiyun 273*4882a593SmuzhiyunBecause lower layer redirects cannot be verified with the index, enabling 274*4882a593SmuzhiyunNFS export support on an overlay filesystem with no upper layer requires 275*4882a593Smuzhiyunturning off redirect follow (e.g. "redirect_dir=nofollow"). 276*4882a593Smuzhiyun 277*4882a593Smuzhiyun 278*4882a593SmuzhiyunNon-directories 279*4882a593Smuzhiyun--------------- 280*4882a593Smuzhiyun 281*4882a593SmuzhiyunObjects that are not directories (files, symlinks, device-special 282*4882a593Smuzhiyunfiles etc.) are presented either from the upper or lower filesystem as 283*4882a593Smuzhiyunappropriate. When a file in the lower filesystem is accessed in a way 284*4882a593Smuzhiyunthe requires write-access, such as opening for write access, changing 285*4882a593Smuzhiyunsome metadata etc., the file is first copied from the lower filesystem 286*4882a593Smuzhiyunto the upper filesystem (copy_up). Note that creating a hard-link 287*4882a593Smuzhiyunalso requires copy_up, though of course creation of a symlink does 288*4882a593Smuzhiyunnot. 289*4882a593Smuzhiyun 290*4882a593SmuzhiyunThe copy_up may turn out to be unnecessary, for example if the file is 291*4882a593Smuzhiyunopened for read-write but the data is not modified. 292*4882a593Smuzhiyun 293*4882a593SmuzhiyunThe copy_up process first makes sure that the containing directory 294*4882a593Smuzhiyunexists in the upper filesystem - creating it and any parents as 295*4882a593Smuzhiyunnecessary. It then creates the object with the same metadata (owner, 296*4882a593Smuzhiyunmode, mtime, symlink-target etc.) and then if the object is a file, the 297*4882a593Smuzhiyundata is copied from the lower to the upper filesystem. Finally any 298*4882a593Smuzhiyunextended attributes are copied up. 299*4882a593Smuzhiyun 300*4882a593SmuzhiyunOnce the copy_up is complete, the overlay filesystem simply 301*4882a593Smuzhiyunprovides direct access to the newly created file in the upper 302*4882a593Smuzhiyunfilesystem - future operations on the file are barely noticed by the 303*4882a593Smuzhiyunoverlay filesystem (though an operation on the name of the file such as 304*4882a593Smuzhiyunrename or unlink will of course be noticed and handled). 305*4882a593Smuzhiyun 306*4882a593Smuzhiyun 307*4882a593SmuzhiyunPermission model 308*4882a593Smuzhiyun---------------- 309*4882a593Smuzhiyun 310*4882a593SmuzhiyunPermission checking in the overlay filesystem follows these principles: 311*4882a593Smuzhiyun 312*4882a593Smuzhiyun 1) permission check SHOULD return the same result before and after copy up 313*4882a593Smuzhiyun 314*4882a593Smuzhiyun 2) task creating the overlay mount MUST NOT gain additional privileges 315*4882a593Smuzhiyun 316*4882a593Smuzhiyun 3) non-mounting task MAY gain additional privileges through the overlay, 317*4882a593Smuzhiyun compared to direct access on underlying lower or upper filesystems 318*4882a593Smuzhiyun 319*4882a593SmuzhiyunThis is achieved by performing two permission checks on each access 320*4882a593Smuzhiyun 321*4882a593Smuzhiyun a) check if current task is allowed access based on local DAC (owner, 322*4882a593Smuzhiyun group, mode and posix acl), as well as MAC checks 323*4882a593Smuzhiyun 324*4882a593Smuzhiyun b) check if mounting task would be allowed real operation on lower or 325*4882a593Smuzhiyun upper layer based on underlying filesystem permissions, again including 326*4882a593Smuzhiyun MAC checks 327*4882a593Smuzhiyun 328*4882a593SmuzhiyunCheck (a) ensures consistency (1) since owner, group, mode and posix acls 329*4882a593Smuzhiyunare copied up. On the other hand it can result in server enforced 330*4882a593Smuzhiyunpermissions (used by NFS, for example) being ignored (3). 331*4882a593Smuzhiyun 332*4882a593SmuzhiyunCheck (b) ensures that no task gains permissions to underlying layers that 333*4882a593Smuzhiyunthe mounting task does not have (2). This also means that it is possible 334*4882a593Smuzhiyunto create setups where the consistency rule (1) does not hold; normally, 335*4882a593Smuzhiyunhowever, the mounting task will have sufficient privileges to perform all 336*4882a593Smuzhiyunoperations. 337*4882a593Smuzhiyun 338*4882a593SmuzhiyunAnother way to demonstrate this model is drawing parallels between 339*4882a593Smuzhiyun 340*4882a593Smuzhiyun mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,... /merged 341*4882a593Smuzhiyun 342*4882a593Smuzhiyunand 343*4882a593Smuzhiyun 344*4882a593Smuzhiyun cp -a /lower /upper 345*4882a593Smuzhiyun mount --bind /upper /merged 346*4882a593Smuzhiyun 347*4882a593SmuzhiyunThe resulting access permissions should be the same. The difference is in 348*4882a593Smuzhiyunthe time of copy (on-demand vs. up-front). 349*4882a593Smuzhiyun 350*4882a593Smuzhiyun 351*4882a593SmuzhiyunMultiple lower layers 352*4882a593Smuzhiyun--------------------- 353*4882a593Smuzhiyun 354*4882a593SmuzhiyunMultiple lower layers can now be given using the colon (":") as a 355*4882a593Smuzhiyunseparator character between the directory names. For example: 356*4882a593Smuzhiyun 357*4882a593Smuzhiyun mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged 358*4882a593Smuzhiyun 359*4882a593SmuzhiyunAs the example shows, "upperdir=" and "workdir=" may be omitted. In 360*4882a593Smuzhiyunthat case the overlay will be read-only. 361*4882a593Smuzhiyun 362*4882a593SmuzhiyunThe specified lower directories will be stacked beginning from the 363*4882a593Smuzhiyunrightmost one and going left. In the above example lower1 will be the 364*4882a593Smuzhiyuntop, lower2 the middle and lower3 the bottom layer. 365*4882a593Smuzhiyun 366*4882a593Smuzhiyun 367*4882a593SmuzhiyunMetadata only copy up 368*4882a593Smuzhiyun--------------------- 369*4882a593Smuzhiyun 370*4882a593SmuzhiyunWhen metadata only copy up feature is enabled, overlayfs will only copy 371*4882a593Smuzhiyunup metadata (as opposed to whole file), when a metadata specific operation 372*4882a593Smuzhiyunlike chown/chmod is performed. Full file will be copied up later when 373*4882a593Smuzhiyunfile is opened for WRITE operation. 374*4882a593Smuzhiyun 375*4882a593SmuzhiyunIn other words, this is delayed data copy up operation and data is copied 376*4882a593Smuzhiyunup when there is a need to actually modify data. 377*4882a593Smuzhiyun 378*4882a593SmuzhiyunThere are multiple ways to enable/disable this feature. A config option 379*4882a593SmuzhiyunCONFIG_OVERLAY_FS_METACOPY can be set/unset to enable/disable this feature 380*4882a593Smuzhiyunby default. Or one can enable/disable it at module load time with module 381*4882a593Smuzhiyunparameter metacopy=on/off. Lastly, there is also a per mount option 382*4882a593Smuzhiyunmetacopy=on/off to enable/disable this feature per mount. 383*4882a593Smuzhiyun 384*4882a593SmuzhiyunDo not use metacopy=on with untrusted upper/lower directories. Otherwise 385*4882a593Smuzhiyunit is possible that an attacker can create a handcrafted file with 386*4882a593Smuzhiyunappropriate REDIRECT and METACOPY xattrs, and gain access to file on lower 387*4882a593Smuzhiyunpointed by REDIRECT. This should not be possible on local system as setting 388*4882a593Smuzhiyun"trusted." xattrs will require CAP_SYS_ADMIN. But it should be possible 389*4882a593Smuzhiyunfor untrusted layers like from a pen drive. 390*4882a593Smuzhiyun 391*4882a593SmuzhiyunNote: redirect_dir={off|nofollow|follow[*]} and nfs_export=on mount options 392*4882a593Smuzhiyunconflict with metacopy=on, and will result in an error. 393*4882a593Smuzhiyun 394*4882a593Smuzhiyun[*] redirect_dir=follow only conflicts with metacopy=on if upperdir=... is 395*4882a593Smuzhiyungiven. 396*4882a593Smuzhiyun 397*4882a593SmuzhiyunSharing and copying layers 398*4882a593Smuzhiyun-------------------------- 399*4882a593Smuzhiyun 400*4882a593SmuzhiyunLower layers may be shared among several overlay mounts and that is indeed 401*4882a593Smuzhiyuna very common practice. An overlay mount may use the same lower layer 402*4882a593Smuzhiyunpath as another overlay mount and it may use a lower layer path that is 403*4882a593Smuzhiyunbeneath or above the path of another overlay lower layer path. 404*4882a593Smuzhiyun 405*4882a593SmuzhiyunUsing an upper layer path and/or a workdir path that are already used by 406*4882a593Smuzhiyunanother overlay mount is not allowed and may fail with EBUSY. Using 407*4882a593Smuzhiyunpartially overlapping paths is not allowed and may fail with EBUSY. 408*4882a593SmuzhiyunIf files are accessed from two overlayfs mounts which share or overlap the 409*4882a593Smuzhiyunupper layer and/or workdir path the behavior of the overlay is undefined, 410*4882a593Smuzhiyunthough it will not result in a crash or deadlock. 411*4882a593Smuzhiyun 412*4882a593SmuzhiyunMounting an overlay using an upper layer path, where the upper layer path 413*4882a593Smuzhiyunwas previously used by another mounted overlay in combination with a 414*4882a593Smuzhiyundifferent lower layer path, is allowed, unless the "inodes index" feature 415*4882a593Smuzhiyunor "metadata only copy up" feature is enabled. 416*4882a593Smuzhiyun 417*4882a593SmuzhiyunWith the "inodes index" feature, on the first time mount, an NFS file 418*4882a593Smuzhiyunhandle of the lower layer root directory, along with the UUID of the lower 419*4882a593Smuzhiyunfilesystem, are encoded and stored in the "trusted.overlay.origin" extended 420*4882a593Smuzhiyunattribute on the upper layer root directory. On subsequent mount attempts, 421*4882a593Smuzhiyunthe lower root directory file handle and lower filesystem UUID are compared 422*4882a593Smuzhiyunto the stored origin in upper root directory. On failure to verify the 423*4882a593Smuzhiyunlower root origin, mount will fail with ESTALE. An overlayfs mount with 424*4882a593Smuzhiyun"inodes index" enabled will fail with EOPNOTSUPP if the lower filesystem 425*4882a593Smuzhiyundoes not support NFS export, lower filesystem does not have a valid UUID or 426*4882a593Smuzhiyunif the upper filesystem does not support extended attributes. 427*4882a593Smuzhiyun 428*4882a593SmuzhiyunFor "metadata only copy up" feature there is no verification mechanism at 429*4882a593Smuzhiyunmount time. So if same upper is mounted with different set of lower, mount 430*4882a593Smuzhiyunprobably will succeed but expect the unexpected later on. So don't do it. 431*4882a593Smuzhiyun 432*4882a593SmuzhiyunIt is quite a common practice to copy overlay layers to a different 433*4882a593Smuzhiyundirectory tree on the same or different underlying filesystem, and even 434*4882a593Smuzhiyunto a different machine. With the "inodes index" feature, trying to mount 435*4882a593Smuzhiyunthe copied layers will fail the verification of the lower root file handle. 436*4882a593Smuzhiyun 437*4882a593Smuzhiyun 438*4882a593SmuzhiyunNon-standard behavior 439*4882a593Smuzhiyun--------------------- 440*4882a593Smuzhiyun 441*4882a593SmuzhiyunCurrent version of overlayfs can act as a mostly POSIX compliant 442*4882a593Smuzhiyunfilesystem. 443*4882a593Smuzhiyun 444*4882a593SmuzhiyunThis is the list of cases that overlayfs doesn't currently handle: 445*4882a593Smuzhiyun 446*4882a593Smuzhiyuna) POSIX mandates updating st_atime for reads. This is currently not 447*4882a593Smuzhiyundone in the case when the file resides on a lower layer. 448*4882a593Smuzhiyun 449*4882a593Smuzhiyunb) If a file residing on a lower layer is opened for read-only and then 450*4882a593Smuzhiyunmemory mapped with MAP_SHARED, then subsequent changes to the file are not 451*4882a593Smuzhiyunreflected in the memory mapping. 452*4882a593Smuzhiyun 453*4882a593SmuzhiyunThe following options allow overlayfs to act more like a standards 454*4882a593Smuzhiyuncompliant filesystem: 455*4882a593Smuzhiyun 456*4882a593Smuzhiyun1) "redirect_dir" 457*4882a593Smuzhiyun 458*4882a593SmuzhiyunEnabled with the mount option or module option: "redirect_dir=on" or with 459*4882a593Smuzhiyunthe kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y. 460*4882a593Smuzhiyun 461*4882a593SmuzhiyunIf this feature is disabled, then rename(2) on a lower or merged directory 462*4882a593Smuzhiyunwill fail with EXDEV ("Invalid cross-device link"). 463*4882a593Smuzhiyun 464*4882a593Smuzhiyun2) "inode index" 465*4882a593Smuzhiyun 466*4882a593SmuzhiyunEnabled with the mount option or module option "index=on" or with the 467*4882a593Smuzhiyunkernel config option CONFIG_OVERLAY_FS_INDEX=y. 468*4882a593Smuzhiyun 469*4882a593SmuzhiyunIf this feature is disabled and a file with multiple hard links is copied 470*4882a593Smuzhiyunup, then this will "break" the link. Changes will not be propagated to 471*4882a593Smuzhiyunother names referring to the same inode. 472*4882a593Smuzhiyun 473*4882a593Smuzhiyun3) "xino" 474*4882a593Smuzhiyun 475*4882a593SmuzhiyunEnabled with the mount option "xino=auto" or "xino=on", with the module 476*4882a593Smuzhiyunoption "xino_auto=on" or with the kernel config option 477*4882a593SmuzhiyunCONFIG_OVERLAY_FS_XINO_AUTO=y. Also implicitly enabled by using the same 478*4882a593Smuzhiyununderlying filesystem for all layers making up the overlay. 479*4882a593Smuzhiyun 480*4882a593SmuzhiyunIf this feature is disabled or the underlying filesystem doesn't have 481*4882a593Smuzhiyunenough free bits in the inode number, then overlayfs will not be able to 482*4882a593Smuzhiyunguarantee that the values of st_ino and st_dev returned by stat(2) and the 483*4882a593Smuzhiyunvalue of d_ino returned by readdir(3) will act like on a normal filesystem. 484*4882a593SmuzhiyunE.g. the value of st_dev may be different for two objects in the same 485*4882a593Smuzhiyunoverlay filesystem and the value of st_ino for directory objects may not be 486*4882a593Smuzhiyunpersistent and could change even while the overlay filesystem is mounted, as 487*4882a593Smuzhiyunsummarized in the `Inode properties`_ table above. 488*4882a593Smuzhiyun 489*4882a593Smuzhiyun 490*4882a593SmuzhiyunChanges to underlying filesystems 491*4882a593Smuzhiyun--------------------------------- 492*4882a593Smuzhiyun 493*4882a593SmuzhiyunOffline changes, when the overlay is not mounted, are allowed to either 494*4882a593Smuzhiyunthe upper or the lower trees. 495*4882a593Smuzhiyun 496*4882a593SmuzhiyunChanges to the underlying filesystems while part of a mounted overlay 497*4882a593Smuzhiyunfilesystem are not allowed. If the underlying filesystem is changed, 498*4882a593Smuzhiyunthe behavior of the overlay is undefined, though it will not result in 499*4882a593Smuzhiyuna crash or deadlock. 500*4882a593Smuzhiyun 501*4882a593SmuzhiyunWhen the overlay NFS export feature is enabled, overlay filesystems 502*4882a593Smuzhiyunbehavior on offline changes of the underlying lower layer is different 503*4882a593Smuzhiyunthan the behavior when NFS export is disabled. 504*4882a593Smuzhiyun 505*4882a593SmuzhiyunOn every copy_up, an NFS file handle of the lower inode, along with the 506*4882a593SmuzhiyunUUID of the lower filesystem, are encoded and stored in an extended 507*4882a593Smuzhiyunattribute "trusted.overlay.origin" on the upper inode. 508*4882a593Smuzhiyun 509*4882a593SmuzhiyunWhen the NFS export feature is enabled, a lookup of a merged directory, 510*4882a593Smuzhiyunthat found a lower directory at the lookup path or at the path pointed 511*4882a593Smuzhiyunto by the "trusted.overlay.redirect" extended attribute, will verify 512*4882a593Smuzhiyunthat the found lower directory file handle and lower filesystem UUID 513*4882a593Smuzhiyunmatch the origin file handle that was stored at copy_up time. If a 514*4882a593Smuzhiyunfound lower directory does not match the stored origin, that directory 515*4882a593Smuzhiyunwill not be merged with the upper directory. 516*4882a593Smuzhiyun 517*4882a593Smuzhiyun 518*4882a593Smuzhiyun 519*4882a593SmuzhiyunNFS export 520*4882a593Smuzhiyun---------- 521*4882a593Smuzhiyun 522*4882a593SmuzhiyunWhen the underlying filesystems supports NFS export and the "nfs_export" 523*4882a593Smuzhiyunfeature is enabled, an overlay filesystem may be exported to NFS. 524*4882a593Smuzhiyun 525*4882a593SmuzhiyunWith the "nfs_export" feature, on copy_up of any lower object, an index 526*4882a593Smuzhiyunentry is created under the index directory. The index entry name is the 527*4882a593Smuzhiyunhexadecimal representation of the copy up origin file handle. For a 528*4882a593Smuzhiyunnon-directory object, the index entry is a hard link to the upper inode. 529*4882a593SmuzhiyunFor a directory object, the index entry has an extended attribute 530*4882a593Smuzhiyun"trusted.overlay.upper" with an encoded file handle of the upper 531*4882a593Smuzhiyundirectory inode. 532*4882a593Smuzhiyun 533*4882a593SmuzhiyunWhen encoding a file handle from an overlay filesystem object, the 534*4882a593Smuzhiyunfollowing rules apply: 535*4882a593Smuzhiyun 536*4882a593Smuzhiyun1. For a non-upper object, encode a lower file handle from lower inode 537*4882a593Smuzhiyun2. For an indexed object, encode a lower file handle from copy_up origin 538*4882a593Smuzhiyun3. For a pure-upper object and for an existing non-indexed upper object, 539*4882a593Smuzhiyun encode an upper file handle from upper inode 540*4882a593Smuzhiyun 541*4882a593SmuzhiyunThe encoded overlay file handle includes: 542*4882a593Smuzhiyun - Header including path type information (e.g. lower/upper) 543*4882a593Smuzhiyun - UUID of the underlying filesystem 544*4882a593Smuzhiyun - Underlying filesystem encoding of underlying inode 545*4882a593Smuzhiyun 546*4882a593SmuzhiyunThis encoding format is identical to the encoding format file handles that 547*4882a593Smuzhiyunare stored in extended attribute "trusted.overlay.origin". 548*4882a593Smuzhiyun 549*4882a593SmuzhiyunWhen decoding an overlay file handle, the following steps are followed: 550*4882a593Smuzhiyun 551*4882a593Smuzhiyun1. Find underlying layer by UUID and path type information. 552*4882a593Smuzhiyun2. Decode the underlying filesystem file handle to underlying dentry. 553*4882a593Smuzhiyun3. For a lower file handle, lookup the handle in index directory by name. 554*4882a593Smuzhiyun4. If a whiteout is found in index, return ESTALE. This represents an 555*4882a593Smuzhiyun overlay object that was deleted after its file handle was encoded. 556*4882a593Smuzhiyun5. For a non-directory, instantiate a disconnected overlay dentry from the 557*4882a593Smuzhiyun decoded underlying dentry, the path type and index inode, if found. 558*4882a593Smuzhiyun6. For a directory, use the connected underlying decoded dentry, path type 559*4882a593Smuzhiyun and index, to lookup a connected overlay dentry. 560*4882a593Smuzhiyun 561*4882a593SmuzhiyunDecoding a non-directory file handle may return a disconnected dentry. 562*4882a593Smuzhiyuncopy_up of that disconnected dentry will create an upper index entry with 563*4882a593Smuzhiyunno upper alias. 564*4882a593Smuzhiyun 565*4882a593SmuzhiyunWhen overlay filesystem has multiple lower layers, a middle layer 566*4882a593Smuzhiyundirectory may have a "redirect" to lower directory. Because middle layer 567*4882a593Smuzhiyun"redirects" are not indexed, a lower file handle that was encoded from the 568*4882a593Smuzhiyun"redirect" origin directory, cannot be used to find the middle or upper 569*4882a593Smuzhiyunlayer directory. Similarly, a lower file handle that was encoded from a 570*4882a593Smuzhiyundescendant of the "redirect" origin directory, cannot be used to 571*4882a593Smuzhiyunreconstruct a connected overlay path. To mitigate the cases of 572*4882a593Smuzhiyundirectories that cannot be decoded from a lower file handle, these 573*4882a593Smuzhiyundirectories are copied up on encode and encoded as an upper file handle. 574*4882a593SmuzhiyunOn an overlay filesystem with no upper layer this mitigation cannot be 575*4882a593Smuzhiyunused NFS export in this setup requires turning off redirect follow (e.g. 576*4882a593Smuzhiyun"redirect_dir=nofollow"). 577*4882a593Smuzhiyun 578*4882a593SmuzhiyunThe overlay filesystem does not support non-directory connectable file 579*4882a593Smuzhiyunhandles, so exporting with the 'subtree_check' exportfs configuration will 580*4882a593Smuzhiyuncause failures to lookup files over NFS. 581*4882a593Smuzhiyun 582*4882a593SmuzhiyunWhen the NFS export feature is enabled, all directory index entries are 583*4882a593Smuzhiyunverified on mount time to check that upper file handles are not stale. 584*4882a593SmuzhiyunThis verification may cause significant overhead in some cases. 585*4882a593Smuzhiyun 586*4882a593SmuzhiyunNote: the mount options index=off,nfs_export=on are conflicting for a 587*4882a593Smuzhiyunread-write mount and will result in an error. 588*4882a593Smuzhiyun 589*4882a593Smuzhiyun 590*4882a593SmuzhiyunVolatile mount 591*4882a593Smuzhiyun-------------- 592*4882a593Smuzhiyun 593*4882a593SmuzhiyunThis is enabled with the "volatile" mount option. Volatile mounts are not 594*4882a593Smuzhiyunguaranteed to survive a crash. It is strongly recommended that volatile 595*4882a593Smuzhiyunmounts are only used if data written to the overlay can be recreated 596*4882a593Smuzhiyunwithout significant effort. 597*4882a593Smuzhiyun 598*4882a593SmuzhiyunThe advantage of mounting with the "volatile" option is that all forms of 599*4882a593Smuzhiyunsync calls to the upper filesystem are omitted. 600*4882a593Smuzhiyun 601*4882a593SmuzhiyunIn order to avoid a giving a false sense of safety, the syncfs (and fsync) 602*4882a593Smuzhiyunsemantics of volatile mounts are slightly different than that of the rest of 603*4882a593SmuzhiyunVFS. If any writeback error occurs on the upperdir's filesystem after a 604*4882a593Smuzhiyunvolatile mount takes place, all sync functions will return an error. Once this 605*4882a593Smuzhiyuncondition is reached, the filesystem will not recover, and every subsequent sync 606*4882a593Smuzhiyuncall will return an error, even if the upperdir has not experience a new error 607*4882a593Smuzhiyunsince the last sync call. 608*4882a593Smuzhiyun 609*4882a593SmuzhiyunWhen overlay is mounted with "volatile" option, the directory 610*4882a593Smuzhiyun"$workdir/work/incompat/volatile" is created. During next mount, overlay 611*4882a593Smuzhiyunchecks for this directory and refuses to mount if present. This is a strong 612*4882a593Smuzhiyunindicator that user should throw away upper and work directories and create 613*4882a593Smuzhiyunfresh one. In very limited cases where the user knows that the system has 614*4882a593Smuzhiyunnot crashed and contents of upperdir are intact, The "volatile" directory 615*4882a593Smuzhiyuncan be removed. 616*4882a593Smuzhiyun 617*4882a593SmuzhiyunTestsuite 618*4882a593Smuzhiyun--------- 619*4882a593Smuzhiyun 620*4882a593SmuzhiyunThere's a testsuite originally developed by David Howells and currently 621*4882a593Smuzhiyunmaintained by Amir Goldstein at: 622*4882a593Smuzhiyun 623*4882a593Smuzhiyun https://github.com/amir73il/unionmount-testsuite.git 624*4882a593Smuzhiyun 625*4882a593SmuzhiyunRun as root: 626*4882a593Smuzhiyun 627*4882a593Smuzhiyun # cd unionmount-testsuite 628*4882a593Smuzhiyun # ./run --ov --verify 629