1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun========================================= 4*4882a593SmuzhiyunOverview of the Linux Virtual File System 5*4882a593Smuzhiyun========================================= 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunOriginal author: Richard Gooch <rgooch@atnf.csiro.au> 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun- Copyright (C) 1999 Richard Gooch 10*4882a593Smuzhiyun- Copyright (C) 2005 Pekka Enberg 11*4882a593Smuzhiyun 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunIntroduction 14*4882a593Smuzhiyun============ 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunThe Virtual File System (also known as the Virtual Filesystem Switch) is 17*4882a593Smuzhiyunthe software layer in the kernel that provides the filesystem interface 18*4882a593Smuzhiyunto userspace programs. It also provides an abstraction within the 19*4882a593Smuzhiyunkernel which allows different filesystem implementations to coexist. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunVFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so on 22*4882a593Smuzhiyunare called from a process context. Filesystem locking is described in 23*4882a593Smuzhiyunthe document Documentation/filesystems/locking.rst. 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunDirectory Entry Cache (dcache) 27*4882a593Smuzhiyun------------------------------ 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunThe VFS implements the open(2), stat(2), chmod(2), and similar system 30*4882a593Smuzhiyuncalls. The pathname argument that is passed to them is used by the VFS 31*4882a593Smuzhiyunto search through the directory entry cache (also known as the dentry 32*4882a593Smuzhiyuncache or dcache). This provides a very fast look-up mechanism to 33*4882a593Smuzhiyuntranslate a pathname (filename) into a specific dentry. Dentries live 34*4882a593Smuzhiyunin RAM and are never saved to disc: they exist only for performance. 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunThe dentry cache is meant to be a view into your entire filespace. As 37*4882a593Smuzhiyunmost computers cannot fit all dentries in the RAM at the same time, some 38*4882a593Smuzhiyunbits of the cache are missing. In order to resolve your pathname into a 39*4882a593Smuzhiyundentry, the VFS may have to resort to creating dentries along the way, 40*4882a593Smuzhiyunand then loading the inode. This is done by looking up the inode. 41*4882a593Smuzhiyun 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunThe Inode Object 44*4882a593Smuzhiyun---------------- 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunAn individual dentry usually has a pointer to an inode. Inodes are 47*4882a593Smuzhiyunfilesystem objects such as regular files, directories, FIFOs and other 48*4882a593Smuzhiyunbeasts. They live either on the disc (for block device filesystems) or 49*4882a593Smuzhiyunin the memory (for pseudo filesystems). Inodes that live on the disc 50*4882a593Smuzhiyunare copied into the memory when required and changes to the inode are 51*4882a593Smuzhiyunwritten back to disc. A single inode can be pointed to by multiple 52*4882a593Smuzhiyundentries (hard links, for example, do this). 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunTo look up an inode requires that the VFS calls the lookup() method of 55*4882a593Smuzhiyunthe parent directory inode. This method is installed by the specific 56*4882a593Smuzhiyunfilesystem implementation that the inode lives in. Once the VFS has the 57*4882a593Smuzhiyunrequired dentry (and hence the inode), we can do all those boring things 58*4882a593Smuzhiyunlike open(2) the file, or stat(2) it to peek at the inode data. The 59*4882a593Smuzhiyunstat(2) operation is fairly simple: once the VFS has the dentry, it 60*4882a593Smuzhiyunpeeks at the inode data and passes some of it back to userspace. 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunThe File Object 64*4882a593Smuzhiyun--------------- 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunOpening a file requires another operation: allocation of a file 67*4882a593Smuzhiyunstructure (this is the kernel-side implementation of file descriptors). 68*4882a593SmuzhiyunThe freshly allocated file structure is initialized with a pointer to 69*4882a593Smuzhiyunthe dentry and a set of file operation member functions. These are 70*4882a593Smuzhiyuntaken from the inode data. The open() file method is then called so the 71*4882a593Smuzhiyunspecific filesystem implementation can do its work. You can see that 72*4882a593Smuzhiyunthis is another switch performed by the VFS. The file structure is 73*4882a593Smuzhiyunplaced into the file descriptor table for the process. 74*4882a593Smuzhiyun 75*4882a593SmuzhiyunReading, writing and closing files (and other assorted VFS operations) 76*4882a593Smuzhiyunis done by using the userspace file descriptor to grab the appropriate 77*4882a593Smuzhiyunfile structure, and then calling the required file structure method to 78*4882a593Smuzhiyundo whatever is required. For as long as the file is open, it keeps the 79*4882a593Smuzhiyundentry in use, which in turn means that the VFS inode is still in use. 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunRegistering and Mounting a Filesystem 83*4882a593Smuzhiyun===================================== 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunTo register and unregister a filesystem, use the following API 86*4882a593Smuzhiyunfunctions: 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun.. code-block:: c 89*4882a593Smuzhiyun 90*4882a593Smuzhiyun #include <linux/fs.h> 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun extern int register_filesystem(struct file_system_type *); 93*4882a593Smuzhiyun extern int unregister_filesystem(struct file_system_type *); 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunThe passed struct file_system_type describes your filesystem. When a 96*4882a593Smuzhiyunrequest is made to mount a filesystem onto a directory in your 97*4882a593Smuzhiyunnamespace, the VFS will call the appropriate mount() method for the 98*4882a593Smuzhiyunspecific filesystem. New vfsmount referring to the tree returned by 99*4882a593Smuzhiyun->mount() will be attached to the mountpoint, so that when pathname 100*4882a593Smuzhiyunresolution reaches the mountpoint it will jump into the root of that 101*4882a593Smuzhiyunvfsmount. 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunYou can see all filesystems that are registered to the kernel in the 104*4882a593Smuzhiyunfile /proc/filesystems. 105*4882a593Smuzhiyun 106*4882a593Smuzhiyun 107*4882a593Smuzhiyunstruct file_system_type 108*4882a593Smuzhiyun----------------------- 109*4882a593Smuzhiyun 110*4882a593SmuzhiyunThis describes the filesystem. As of kernel 2.6.39, the following 111*4882a593Smuzhiyunmembers are defined: 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun.. code-block:: c 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun struct file_system_operations { 116*4882a593Smuzhiyun const char *name; 117*4882a593Smuzhiyun int fs_flags; 118*4882a593Smuzhiyun struct dentry *(*mount) (struct file_system_type *, int, 119*4882a593Smuzhiyun const char *, void *); 120*4882a593Smuzhiyun void (*kill_sb) (struct super_block *); 121*4882a593Smuzhiyun struct module *owner; 122*4882a593Smuzhiyun struct file_system_type * next; 123*4882a593Smuzhiyun struct list_head fs_supers; 124*4882a593Smuzhiyun struct lock_class_key s_lock_key; 125*4882a593Smuzhiyun struct lock_class_key s_umount_key; 126*4882a593Smuzhiyun }; 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun``name`` 129*4882a593Smuzhiyun the name of the filesystem type, such as "ext2", "iso9660", 130*4882a593Smuzhiyun "msdos" and so on 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun``fs_flags`` 133*4882a593Smuzhiyun various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) 134*4882a593Smuzhiyun 135*4882a593Smuzhiyun``mount`` 136*4882a593Smuzhiyun the method to call when a new instance of this filesystem should 137*4882a593Smuzhiyun be mounted 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun``kill_sb`` 140*4882a593Smuzhiyun the method to call when an instance of this filesystem should be 141*4882a593Smuzhiyun shut down 142*4882a593Smuzhiyun 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun``owner`` 145*4882a593Smuzhiyun for internal VFS use: you should initialize this to THIS_MODULE 146*4882a593Smuzhiyun in most cases. 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun``next`` 149*4882a593Smuzhiyun for internal VFS use: you should initialize this to NULL 150*4882a593Smuzhiyun 151*4882a593Smuzhiyun s_lock_key, s_umount_key: lockdep-specific 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunThe mount() method has the following arguments: 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun``struct file_system_type *fs_type`` 156*4882a593Smuzhiyun describes the filesystem, partly initialized by the specific 157*4882a593Smuzhiyun filesystem code 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun``int flags`` 160*4882a593Smuzhiyun mount flags 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun``const char *dev_name`` 163*4882a593Smuzhiyun the device name we are mounting. 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun``void *data`` 166*4882a593Smuzhiyun arbitrary mount options, usually comes as an ASCII string (see 167*4882a593Smuzhiyun "Mount Options" section) 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunThe mount() method must return the root dentry of the tree requested by 170*4882a593Smuzhiyuncaller. An active reference to its superblock must be grabbed and the 171*4882a593Smuzhiyunsuperblock must be locked. On failure it should return ERR_PTR(error). 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunThe arguments match those of mount(2) and their interpretation depends 174*4882a593Smuzhiyunon filesystem type. E.g. for block filesystems, dev_name is interpreted 175*4882a593Smuzhiyunas block device name, that device is opened and if it contains a 176*4882a593Smuzhiyunsuitable filesystem image the method creates and initializes struct 177*4882a593Smuzhiyunsuper_block accordingly, returning its root dentry to caller. 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun->mount() may choose to return a subtree of existing filesystem - it 180*4882a593Smuzhiyundoesn't have to create a new one. The main result from the caller's 181*4882a593Smuzhiyunpoint of view is a reference to dentry at the root of (sub)tree to be 182*4882a593Smuzhiyunattached; creation of new superblock is a common side effect. 183*4882a593Smuzhiyun 184*4882a593SmuzhiyunThe most interesting member of the superblock structure that the mount() 185*4882a593Smuzhiyunmethod fills in is the "s_op" field. This is a pointer to a "struct 186*4882a593Smuzhiyunsuper_operations" which describes the next level of the filesystem 187*4882a593Smuzhiyunimplementation. 188*4882a593Smuzhiyun 189*4882a593SmuzhiyunUsually, a filesystem uses one of the generic mount() implementations 190*4882a593Smuzhiyunand provides a fill_super() callback instead. The generic variants are: 191*4882a593Smuzhiyun 192*4882a593Smuzhiyun``mount_bdev`` 193*4882a593Smuzhiyun mount a filesystem residing on a block device 194*4882a593Smuzhiyun 195*4882a593Smuzhiyun``mount_nodev`` 196*4882a593Smuzhiyun mount a filesystem that is not backed by a device 197*4882a593Smuzhiyun 198*4882a593Smuzhiyun``mount_single`` 199*4882a593Smuzhiyun mount a filesystem which shares the instance between all mounts 200*4882a593Smuzhiyun 201*4882a593SmuzhiyunA fill_super() callback implementation has the following arguments: 202*4882a593Smuzhiyun 203*4882a593Smuzhiyun``struct super_block *sb`` 204*4882a593Smuzhiyun the superblock structure. The callback must initialize this 205*4882a593Smuzhiyun properly. 206*4882a593Smuzhiyun 207*4882a593Smuzhiyun``void *data`` 208*4882a593Smuzhiyun arbitrary mount options, usually comes as an ASCII string (see 209*4882a593Smuzhiyun "Mount Options" section) 210*4882a593Smuzhiyun 211*4882a593Smuzhiyun``int silent`` 212*4882a593Smuzhiyun whether or not to be silent on error 213*4882a593Smuzhiyun 214*4882a593Smuzhiyun 215*4882a593SmuzhiyunThe Superblock Object 216*4882a593Smuzhiyun===================== 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunA superblock object represents a mounted filesystem. 219*4882a593Smuzhiyun 220*4882a593Smuzhiyun 221*4882a593Smuzhiyunstruct super_operations 222*4882a593Smuzhiyun----------------------- 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunThis describes how the VFS can manipulate the superblock of your 225*4882a593Smuzhiyunfilesystem. As of kernel 2.6.22, the following members are defined: 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun.. code-block:: c 228*4882a593Smuzhiyun 229*4882a593Smuzhiyun struct super_operations { 230*4882a593Smuzhiyun struct inode *(*alloc_inode)(struct super_block *sb); 231*4882a593Smuzhiyun void (*destroy_inode)(struct inode *); 232*4882a593Smuzhiyun 233*4882a593Smuzhiyun void (*dirty_inode) (struct inode *, int flags); 234*4882a593Smuzhiyun int (*write_inode) (struct inode *, int); 235*4882a593Smuzhiyun void (*drop_inode) (struct inode *); 236*4882a593Smuzhiyun void (*delete_inode) (struct inode *); 237*4882a593Smuzhiyun void (*put_super) (struct super_block *); 238*4882a593Smuzhiyun int (*sync_fs)(struct super_block *sb, int wait); 239*4882a593Smuzhiyun int (*freeze_fs) (struct super_block *); 240*4882a593Smuzhiyun int (*unfreeze_fs) (struct super_block *); 241*4882a593Smuzhiyun int (*statfs) (struct dentry *, struct kstatfs *); 242*4882a593Smuzhiyun int (*remount_fs) (struct super_block *, int *, char *); 243*4882a593Smuzhiyun void (*clear_inode) (struct inode *); 244*4882a593Smuzhiyun void (*umount_begin) (struct super_block *); 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun int (*show_options)(struct seq_file *, struct dentry *); 247*4882a593Smuzhiyun 248*4882a593Smuzhiyun ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); 249*4882a593Smuzhiyun ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); 250*4882a593Smuzhiyun int (*nr_cached_objects)(struct super_block *); 251*4882a593Smuzhiyun void (*free_cached_objects)(struct super_block *, int); 252*4882a593Smuzhiyun }; 253*4882a593Smuzhiyun 254*4882a593SmuzhiyunAll methods are called without any locks being held, unless otherwise 255*4882a593Smuzhiyunnoted. This means that most methods can block safely. All methods are 256*4882a593Smuzhiyunonly called from a process context (i.e. not from an interrupt handler 257*4882a593Smuzhiyunor bottom half). 258*4882a593Smuzhiyun 259*4882a593Smuzhiyun``alloc_inode`` 260*4882a593Smuzhiyun this method is called by alloc_inode() to allocate memory for 261*4882a593Smuzhiyun struct inode and initialize it. If this function is not 262*4882a593Smuzhiyun defined, a simple 'struct inode' is allocated. Normally 263*4882a593Smuzhiyun alloc_inode will be used to allocate a larger structure which 264*4882a593Smuzhiyun contains a 'struct inode' embedded within it. 265*4882a593Smuzhiyun 266*4882a593Smuzhiyun``destroy_inode`` 267*4882a593Smuzhiyun this method is called by destroy_inode() to release resources 268*4882a593Smuzhiyun allocated for struct inode. It is only required if 269*4882a593Smuzhiyun ->alloc_inode was defined and simply undoes anything done by 270*4882a593Smuzhiyun ->alloc_inode. 271*4882a593Smuzhiyun 272*4882a593Smuzhiyun``dirty_inode`` 273*4882a593Smuzhiyun this method is called by the VFS to mark an inode dirty. 274*4882a593Smuzhiyun 275*4882a593Smuzhiyun``write_inode`` 276*4882a593Smuzhiyun this method is called when the VFS needs to write an inode to 277*4882a593Smuzhiyun disc. The second parameter indicates whether the write should 278*4882a593Smuzhiyun be synchronous or not, not all filesystems check this flag. 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun``drop_inode`` 281*4882a593Smuzhiyun called when the last access to the inode is dropped, with the 282*4882a593Smuzhiyun inode->i_lock spinlock held. 283*4882a593Smuzhiyun 284*4882a593Smuzhiyun This method should be either NULL (normal UNIX filesystem 285*4882a593Smuzhiyun semantics) or "generic_delete_inode" (for filesystems that do 286*4882a593Smuzhiyun not want to cache inodes - causing "delete_inode" to always be 287*4882a593Smuzhiyun called regardless of the value of i_nlink) 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun The "generic_delete_inode()" behavior is equivalent to the old 290*4882a593Smuzhiyun practice of using "force_delete" in the put_inode() case, but 291*4882a593Smuzhiyun does not have the races that the "force_delete()" approach had. 292*4882a593Smuzhiyun 293*4882a593Smuzhiyun``delete_inode`` 294*4882a593Smuzhiyun called when the VFS wants to delete an inode 295*4882a593Smuzhiyun 296*4882a593Smuzhiyun``put_super`` 297*4882a593Smuzhiyun called when the VFS wishes to free the superblock 298*4882a593Smuzhiyun (i.e. unmount). This is called with the superblock lock held 299*4882a593Smuzhiyun 300*4882a593Smuzhiyun``sync_fs`` 301*4882a593Smuzhiyun called when VFS is writing out all dirty data associated with a 302*4882a593Smuzhiyun superblock. The second parameter indicates whether the method 303*4882a593Smuzhiyun should wait until the write out has been completed. Optional. 304*4882a593Smuzhiyun 305*4882a593Smuzhiyun``freeze_fs`` 306*4882a593Smuzhiyun called when VFS is locking a filesystem and forcing it into a 307*4882a593Smuzhiyun consistent state. This method is currently used by the Logical 308*4882a593Smuzhiyun Volume Manager (LVM). 309*4882a593Smuzhiyun 310*4882a593Smuzhiyun``unfreeze_fs`` 311*4882a593Smuzhiyun called when VFS is unlocking a filesystem and making it writable 312*4882a593Smuzhiyun again. 313*4882a593Smuzhiyun 314*4882a593Smuzhiyun``statfs`` 315*4882a593Smuzhiyun called when the VFS needs to get filesystem statistics. 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun``remount_fs`` 318*4882a593Smuzhiyun called when the filesystem is remounted. This is called with 319*4882a593Smuzhiyun the kernel lock held 320*4882a593Smuzhiyun 321*4882a593Smuzhiyun``clear_inode`` 322*4882a593Smuzhiyun called then the VFS clears the inode. Optional 323*4882a593Smuzhiyun 324*4882a593Smuzhiyun``umount_begin`` 325*4882a593Smuzhiyun called when the VFS is unmounting a filesystem. 326*4882a593Smuzhiyun 327*4882a593Smuzhiyun``show_options`` 328*4882a593Smuzhiyun called by the VFS to show mount options for /proc/<pid>/mounts. 329*4882a593Smuzhiyun (see "Mount Options" section) 330*4882a593Smuzhiyun 331*4882a593Smuzhiyun``quota_read`` 332*4882a593Smuzhiyun called by the VFS to read from filesystem quota file. 333*4882a593Smuzhiyun 334*4882a593Smuzhiyun``quota_write`` 335*4882a593Smuzhiyun called by the VFS to write to filesystem quota file. 336*4882a593Smuzhiyun 337*4882a593Smuzhiyun``nr_cached_objects`` 338*4882a593Smuzhiyun called by the sb cache shrinking function for the filesystem to 339*4882a593Smuzhiyun return the number of freeable cached objects it contains. 340*4882a593Smuzhiyun Optional. 341*4882a593Smuzhiyun 342*4882a593Smuzhiyun``free_cache_objects`` 343*4882a593Smuzhiyun called by the sb cache shrinking function for the filesystem to 344*4882a593Smuzhiyun scan the number of objects indicated to try to free them. 345*4882a593Smuzhiyun Optional, but any filesystem implementing this method needs to 346*4882a593Smuzhiyun also implement ->nr_cached_objects for it to be called 347*4882a593Smuzhiyun correctly. 348*4882a593Smuzhiyun 349*4882a593Smuzhiyun We can't do anything with any errors that the filesystem might 350*4882a593Smuzhiyun encountered, hence the void return type. This will never be 351*4882a593Smuzhiyun called if the VM is trying to reclaim under GFP_NOFS conditions, 352*4882a593Smuzhiyun hence this method does not need to handle that situation itself. 353*4882a593Smuzhiyun 354*4882a593Smuzhiyun Implementations must include conditional reschedule calls inside 355*4882a593Smuzhiyun any scanning loop that is done. This allows the VFS to 356*4882a593Smuzhiyun determine appropriate scan batch sizes without having to worry 357*4882a593Smuzhiyun about whether implementations will cause holdoff problems due to 358*4882a593Smuzhiyun large scan batch sizes. 359*4882a593Smuzhiyun 360*4882a593SmuzhiyunWhoever sets up the inode is responsible for filling in the "i_op" 361*4882a593Smuzhiyunfield. This is a pointer to a "struct inode_operations" which describes 362*4882a593Smuzhiyunthe methods that can be performed on individual inodes. 363*4882a593Smuzhiyun 364*4882a593Smuzhiyun 365*4882a593Smuzhiyunstruct xattr_handlers 366*4882a593Smuzhiyun--------------------- 367*4882a593Smuzhiyun 368*4882a593SmuzhiyunOn filesystems that support extended attributes (xattrs), the s_xattr 369*4882a593Smuzhiyunsuperblock field points to a NULL-terminated array of xattr handlers. 370*4882a593SmuzhiyunExtended attributes are name:value pairs. 371*4882a593Smuzhiyun 372*4882a593Smuzhiyun``name`` 373*4882a593Smuzhiyun Indicates that the handler matches attributes with the specified 374*4882a593Smuzhiyun name (such as "system.posix_acl_access"); the prefix field must 375*4882a593Smuzhiyun be NULL. 376*4882a593Smuzhiyun 377*4882a593Smuzhiyun``prefix`` 378*4882a593Smuzhiyun Indicates that the handler matches all attributes with the 379*4882a593Smuzhiyun specified name prefix (such as "user."); the name field must be 380*4882a593Smuzhiyun NULL. 381*4882a593Smuzhiyun 382*4882a593Smuzhiyun``list`` 383*4882a593Smuzhiyun Determine if attributes matching this xattr handler should be 384*4882a593Smuzhiyun listed for a particular dentry. Used by some listxattr 385*4882a593Smuzhiyun implementations like generic_listxattr. 386*4882a593Smuzhiyun 387*4882a593Smuzhiyun``get`` 388*4882a593Smuzhiyun Called by the VFS to get the value of a particular extended 389*4882a593Smuzhiyun attribute. This method is called by the getxattr(2) system 390*4882a593Smuzhiyun call. 391*4882a593Smuzhiyun 392*4882a593Smuzhiyun``set`` 393*4882a593Smuzhiyun Called by the VFS to set the value of a particular extended 394*4882a593Smuzhiyun attribute. When the new value is NULL, called to remove a 395*4882a593Smuzhiyun particular extended attribute. This method is called by the 396*4882a593Smuzhiyun setxattr(2) and removexattr(2) system calls. 397*4882a593Smuzhiyun 398*4882a593SmuzhiyunWhen none of the xattr handlers of a filesystem match the specified 399*4882a593Smuzhiyunattribute name or when a filesystem doesn't support extended attributes, 400*4882a593Smuzhiyunthe various ``*xattr(2)`` system calls return -EOPNOTSUPP. 401*4882a593Smuzhiyun 402*4882a593Smuzhiyun 403*4882a593SmuzhiyunThe Inode Object 404*4882a593Smuzhiyun================ 405*4882a593Smuzhiyun 406*4882a593SmuzhiyunAn inode object represents an object within the filesystem. 407*4882a593Smuzhiyun 408*4882a593Smuzhiyun 409*4882a593Smuzhiyunstruct inode_operations 410*4882a593Smuzhiyun----------------------- 411*4882a593Smuzhiyun 412*4882a593SmuzhiyunThis describes how the VFS can manipulate an inode in your filesystem. 413*4882a593SmuzhiyunAs of kernel 2.6.22, the following members are defined: 414*4882a593Smuzhiyun 415*4882a593Smuzhiyun.. code-block:: c 416*4882a593Smuzhiyun 417*4882a593Smuzhiyun struct inode_operations { 418*4882a593Smuzhiyun int (*create) (struct inode *,struct dentry *, umode_t, bool); 419*4882a593Smuzhiyun struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); 420*4882a593Smuzhiyun int (*link) (struct dentry *,struct inode *,struct dentry *); 421*4882a593Smuzhiyun int (*unlink) (struct inode *,struct dentry *); 422*4882a593Smuzhiyun int (*symlink) (struct inode *,struct dentry *,const char *); 423*4882a593Smuzhiyun int (*mkdir) (struct inode *,struct dentry *,umode_t); 424*4882a593Smuzhiyun int (*rmdir) (struct inode *,struct dentry *); 425*4882a593Smuzhiyun int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); 426*4882a593Smuzhiyun int (*rename) (struct inode *, struct dentry *, 427*4882a593Smuzhiyun struct inode *, struct dentry *, unsigned int); 428*4882a593Smuzhiyun int (*readlink) (struct dentry *, char __user *,int); 429*4882a593Smuzhiyun const char *(*get_link) (struct dentry *, struct inode *, 430*4882a593Smuzhiyun struct delayed_call *); 431*4882a593Smuzhiyun int (*permission) (struct inode *, int); 432*4882a593Smuzhiyun int (*get_acl)(struct inode *, int); 433*4882a593Smuzhiyun int (*setattr) (struct dentry *, struct iattr *); 434*4882a593Smuzhiyun int (*getattr) (const struct path *, struct kstat *, u32, unsigned int); 435*4882a593Smuzhiyun ssize_t (*listxattr) (struct dentry *, char *, size_t); 436*4882a593Smuzhiyun void (*update_time)(struct inode *, struct timespec *, int); 437*4882a593Smuzhiyun int (*atomic_open)(struct inode *, struct dentry *, struct file *, 438*4882a593Smuzhiyun unsigned open_flag, umode_t create_mode); 439*4882a593Smuzhiyun int (*tmpfile) (struct inode *, struct dentry *, umode_t); 440*4882a593Smuzhiyun }; 441*4882a593Smuzhiyun 442*4882a593SmuzhiyunAgain, all methods are called without any locks being held, unless 443*4882a593Smuzhiyunotherwise noted. 444*4882a593Smuzhiyun 445*4882a593Smuzhiyun``create`` 446*4882a593Smuzhiyun called by the open(2) and creat(2) system calls. Only required 447*4882a593Smuzhiyun if you want to support regular files. The dentry you get should 448*4882a593Smuzhiyun not have an inode (i.e. it should be a negative dentry). Here 449*4882a593Smuzhiyun you will probably call d_instantiate() with the dentry and the 450*4882a593Smuzhiyun newly created inode 451*4882a593Smuzhiyun 452*4882a593Smuzhiyun``lookup`` 453*4882a593Smuzhiyun called when the VFS needs to look up an inode in a parent 454*4882a593Smuzhiyun directory. The name to look for is found in the dentry. This 455*4882a593Smuzhiyun method must call d_add() to insert the found inode into the 456*4882a593Smuzhiyun dentry. The "i_count" field in the inode structure should be 457*4882a593Smuzhiyun incremented. If the named inode does not exist a NULL inode 458*4882a593Smuzhiyun should be inserted into the dentry (this is called a negative 459*4882a593Smuzhiyun dentry). Returning an error code from this routine must only be 460*4882a593Smuzhiyun done on a real error, otherwise creating inodes with system 461*4882a593Smuzhiyun calls like create(2), mknod(2), mkdir(2) and so on will fail. 462*4882a593Smuzhiyun If you wish to overload the dentry methods then you should 463*4882a593Smuzhiyun initialise the "d_dop" field in the dentry; this is a pointer to 464*4882a593Smuzhiyun a struct "dentry_operations". This method is called with the 465*4882a593Smuzhiyun directory inode semaphore held 466*4882a593Smuzhiyun 467*4882a593Smuzhiyun``link`` 468*4882a593Smuzhiyun called by the link(2) system call. Only required if you want to 469*4882a593Smuzhiyun support hard links. You will probably need to call 470*4882a593Smuzhiyun d_instantiate() just as you would in the create() method 471*4882a593Smuzhiyun 472*4882a593Smuzhiyun``unlink`` 473*4882a593Smuzhiyun called by the unlink(2) system call. Only required if you want 474*4882a593Smuzhiyun to support deleting inodes 475*4882a593Smuzhiyun 476*4882a593Smuzhiyun``symlink`` 477*4882a593Smuzhiyun called by the symlink(2) system call. Only required if you want 478*4882a593Smuzhiyun to support symlinks. You will probably need to call 479*4882a593Smuzhiyun d_instantiate() just as you would in the create() method 480*4882a593Smuzhiyun 481*4882a593Smuzhiyun``mkdir`` 482*4882a593Smuzhiyun called by the mkdir(2) system call. Only required if you want 483*4882a593Smuzhiyun to support creating subdirectories. You will probably need to 484*4882a593Smuzhiyun call d_instantiate() just as you would in the create() method 485*4882a593Smuzhiyun 486*4882a593Smuzhiyun``rmdir`` 487*4882a593Smuzhiyun called by the rmdir(2) system call. Only required if you want 488*4882a593Smuzhiyun to support deleting subdirectories 489*4882a593Smuzhiyun 490*4882a593Smuzhiyun``mknod`` 491*4882a593Smuzhiyun called by the mknod(2) system call to create a device (char, 492*4882a593Smuzhiyun block) inode or a named pipe (FIFO) or socket. Only required if 493*4882a593Smuzhiyun you want to support creating these types of inodes. You will 494*4882a593Smuzhiyun probably need to call d_instantiate() just as you would in the 495*4882a593Smuzhiyun create() method 496*4882a593Smuzhiyun 497*4882a593Smuzhiyun``rename`` 498*4882a593Smuzhiyun called by the rename(2) system call to rename the object to have 499*4882a593Smuzhiyun the parent and name given by the second inode and dentry. 500*4882a593Smuzhiyun 501*4882a593Smuzhiyun The filesystem must return -EINVAL for any unsupported or 502*4882a593Smuzhiyun unknown flags. Currently the following flags are implemented: 503*4882a593Smuzhiyun (1) RENAME_NOREPLACE: this flag indicates that if the target of 504*4882a593Smuzhiyun the rename exists the rename should fail with -EEXIST instead of 505*4882a593Smuzhiyun replacing the target. The VFS already checks for existence, so 506*4882a593Smuzhiyun for local filesystems the RENAME_NOREPLACE implementation is 507*4882a593Smuzhiyun equivalent to plain rename. 508*4882a593Smuzhiyun (2) RENAME_EXCHANGE: exchange source and target. Both must 509*4882a593Smuzhiyun exist; this is checked by the VFS. Unlike plain rename, source 510*4882a593Smuzhiyun and target may be of different type. 511*4882a593Smuzhiyun 512*4882a593Smuzhiyun``get_link`` 513*4882a593Smuzhiyun called by the VFS to follow a symbolic link to the inode it 514*4882a593Smuzhiyun points to. Only required if you want to support symbolic links. 515*4882a593Smuzhiyun This method returns the symlink body to traverse (and possibly 516*4882a593Smuzhiyun resets the current position with nd_jump_link()). If the body 517*4882a593Smuzhiyun won't go away until the inode is gone, nothing else is needed; 518*4882a593Smuzhiyun if it needs to be otherwise pinned, arrange for its release by 519*4882a593Smuzhiyun having get_link(..., ..., done) do set_delayed_call(done, 520*4882a593Smuzhiyun destructor, argument). In that case destructor(argument) will 521*4882a593Smuzhiyun be called once VFS is done with the body you've returned. May 522*4882a593Smuzhiyun be called in RCU mode; that is indicated by NULL dentry 523*4882a593Smuzhiyun argument. If request can't be handled without leaving RCU mode, 524*4882a593Smuzhiyun have it return ERR_PTR(-ECHILD). 525*4882a593Smuzhiyun 526*4882a593Smuzhiyun If the filesystem stores the symlink target in ->i_link, the 527*4882a593Smuzhiyun VFS may use it directly without calling ->get_link(); however, 528*4882a593Smuzhiyun ->get_link() must still be provided. ->i_link must not be 529*4882a593Smuzhiyun freed until after an RCU grace period. Writing to ->i_link 530*4882a593Smuzhiyun post-iget() time requires a 'release' memory barrier. 531*4882a593Smuzhiyun 532*4882a593Smuzhiyun``readlink`` 533*4882a593Smuzhiyun this is now just an override for use by readlink(2) for the 534*4882a593Smuzhiyun cases when ->get_link uses nd_jump_link() or object is not in 535*4882a593Smuzhiyun fact a symlink. Normally filesystems should only implement 536*4882a593Smuzhiyun ->get_link for symlinks and readlink(2) will automatically use 537*4882a593Smuzhiyun that. 538*4882a593Smuzhiyun 539*4882a593Smuzhiyun``permission`` 540*4882a593Smuzhiyun called by the VFS to check for access rights on a POSIX-like 541*4882a593Smuzhiyun filesystem. 542*4882a593Smuzhiyun 543*4882a593Smuzhiyun May be called in rcu-walk mode (mask & MAY_NOT_BLOCK). If in 544*4882a593Smuzhiyun rcu-walk mode, the filesystem must check the permission without 545*4882a593Smuzhiyun blocking or storing to the inode. 546*4882a593Smuzhiyun 547*4882a593Smuzhiyun If a situation is encountered that rcu-walk cannot handle, 548*4882a593Smuzhiyun return 549*4882a593Smuzhiyun -ECHILD and it will be called again in ref-walk mode. 550*4882a593Smuzhiyun 551*4882a593Smuzhiyun``setattr`` 552*4882a593Smuzhiyun called by the VFS to set attributes for a file. This method is 553*4882a593Smuzhiyun called by chmod(2) and related system calls. 554*4882a593Smuzhiyun 555*4882a593Smuzhiyun``getattr`` 556*4882a593Smuzhiyun called by the VFS to get attributes of a file. This method is 557*4882a593Smuzhiyun called by stat(2) and related system calls. 558*4882a593Smuzhiyun 559*4882a593Smuzhiyun``listxattr`` 560*4882a593Smuzhiyun called by the VFS to list all extended attributes for a given 561*4882a593Smuzhiyun file. This method is called by the listxattr(2) system call. 562*4882a593Smuzhiyun 563*4882a593Smuzhiyun``update_time`` 564*4882a593Smuzhiyun called by the VFS to update a specific time or the i_version of 565*4882a593Smuzhiyun an inode. If this is not defined the VFS will update the inode 566*4882a593Smuzhiyun itself and call mark_inode_dirty_sync. 567*4882a593Smuzhiyun 568*4882a593Smuzhiyun``atomic_open`` 569*4882a593Smuzhiyun called on the last component of an open. Using this optional 570*4882a593Smuzhiyun method the filesystem can look up, possibly create and open the 571*4882a593Smuzhiyun file in one atomic operation. If it wants to leave actual 572*4882a593Smuzhiyun opening to the caller (e.g. if the file turned out to be a 573*4882a593Smuzhiyun symlink, device, or just something filesystem won't do atomic 574*4882a593Smuzhiyun open for), it may signal this by returning finish_no_open(file, 575*4882a593Smuzhiyun dentry). This method is only called if the last component is 576*4882a593Smuzhiyun negative or needs lookup. Cached positive dentries are still 577*4882a593Smuzhiyun handled by f_op->open(). If the file was created, FMODE_CREATED 578*4882a593Smuzhiyun flag should be set in file->f_mode. In case of O_EXCL the 579*4882a593Smuzhiyun method must only succeed if the file didn't exist and hence 580*4882a593Smuzhiyun FMODE_CREATED shall always be set on success. 581*4882a593Smuzhiyun 582*4882a593Smuzhiyun``tmpfile`` 583*4882a593Smuzhiyun called in the end of O_TMPFILE open(). Optional, equivalent to 584*4882a593Smuzhiyun atomically creating, opening and unlinking a file in given 585*4882a593Smuzhiyun directory. 586*4882a593Smuzhiyun 587*4882a593Smuzhiyun 588*4882a593SmuzhiyunThe Address Space Object 589*4882a593Smuzhiyun======================== 590*4882a593Smuzhiyun 591*4882a593SmuzhiyunThe address space object is used to group and manage pages in the page 592*4882a593Smuzhiyuncache. It can be used to keep track of the pages in a file (or anything 593*4882a593Smuzhiyunelse) and also track the mapping of sections of the file into process 594*4882a593Smuzhiyunaddress spaces. 595*4882a593Smuzhiyun 596*4882a593SmuzhiyunThere are a number of distinct yet related services that an 597*4882a593Smuzhiyunaddress-space can provide. These include communicating memory pressure, 598*4882a593Smuzhiyunpage lookup by address, and keeping track of pages tagged as Dirty or 599*4882a593SmuzhiyunWriteback. 600*4882a593Smuzhiyun 601*4882a593SmuzhiyunThe first can be used independently to the others. The VM can try to 602*4882a593Smuzhiyuneither write dirty pages in order to clean them, or release clean pages 603*4882a593Smuzhiyunin order to reuse them. To do this it can call the ->writepage method 604*4882a593Smuzhiyunon dirty pages, and ->releasepage on clean pages with PagePrivate set. 605*4882a593SmuzhiyunClean pages without PagePrivate and with no external references will be 606*4882a593Smuzhiyunreleased without notice being given to the address_space. 607*4882a593Smuzhiyun 608*4882a593SmuzhiyunTo achieve this functionality, pages need to be placed on an LRU with 609*4882a593Smuzhiyunlru_cache_add and mark_page_active needs to be called whenever the page 610*4882a593Smuzhiyunis used. 611*4882a593Smuzhiyun 612*4882a593SmuzhiyunPages are normally kept in a radix tree index by ->index. This tree 613*4882a593Smuzhiyunmaintains information about the PG_Dirty and PG_Writeback status of each 614*4882a593Smuzhiyunpage, so that pages with either of these flags can be found quickly. 615*4882a593Smuzhiyun 616*4882a593SmuzhiyunThe Dirty tag is primarily used by mpage_writepages - the default 617*4882a593Smuzhiyun->writepages method. It uses the tag to find dirty pages to call 618*4882a593Smuzhiyun->writepage on. If mpage_writepages is not used (i.e. the address 619*4882a593Smuzhiyunprovides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almost 620*4882a593Smuzhiyununused. write_inode_now and sync_inode do use it (through 621*4882a593Smuzhiyun__sync_single_inode) to check if ->writepages has been successful in 622*4882a593Smuzhiyunwriting out the whole address_space. 623*4882a593Smuzhiyun 624*4882a593SmuzhiyunThe Writeback tag is used by filemap*wait* and sync_page* functions, via 625*4882a593Smuzhiyunfilemap_fdatawait_range, to wait for all writeback to complete. 626*4882a593Smuzhiyun 627*4882a593SmuzhiyunAn address_space handler may attach extra information to a page, 628*4882a593Smuzhiyuntypically using the 'private' field in the 'struct page'. If such 629*4882a593Smuzhiyuninformation is attached, the PG_Private flag should be set. This will 630*4882a593Smuzhiyuncause various VM routines to make extra calls into the address_space 631*4882a593Smuzhiyunhandler to deal with that data. 632*4882a593Smuzhiyun 633*4882a593SmuzhiyunAn address space acts as an intermediate between storage and 634*4882a593Smuzhiyunapplication. Data is read into the address space a whole page at a 635*4882a593Smuzhiyuntime, and provided to the application either by copying of the page, or 636*4882a593Smuzhiyunby memory-mapping the page. Data is written into the address space by 637*4882a593Smuzhiyunthe application, and then written-back to storage typically in whole 638*4882a593Smuzhiyunpages, however the address_space has finer control of write sizes. 639*4882a593Smuzhiyun 640*4882a593SmuzhiyunThe read process essentially only requires 'readpage'. The write 641*4882a593Smuzhiyunprocess is more complicated and uses write_begin/write_end or 642*4882a593Smuzhiyunset_page_dirty to write data into the address_space, and writepage and 643*4882a593Smuzhiyunwritepages to writeback data to storage. 644*4882a593Smuzhiyun 645*4882a593SmuzhiyunAdding and removing pages to/from an address_space is protected by the 646*4882a593Smuzhiyuninode's i_mutex. 647*4882a593Smuzhiyun 648*4882a593SmuzhiyunWhen data is written to a page, the PG_Dirty flag should be set. It 649*4882a593Smuzhiyuntypically remains set until writepage asks for it to be written. This 650*4882a593Smuzhiyunshould clear PG_Dirty and set PG_Writeback. It can be actually written 651*4882a593Smuzhiyunat any point after PG_Dirty is clear. Once it is known to be safe, 652*4882a593SmuzhiyunPG_Writeback is cleared. 653*4882a593Smuzhiyun 654*4882a593SmuzhiyunWriteback makes use of a writeback_control structure to direct the 655*4882a593Smuzhiyunoperations. This gives the writepage and writepages operations some 656*4882a593Smuzhiyuninformation about the nature of and reason for the writeback request, 657*4882a593Smuzhiyunand the constraints under which it is being done. It is also used to 658*4882a593Smuzhiyunreturn information back to the caller about the result of a writepage or 659*4882a593Smuzhiyunwritepages request. 660*4882a593Smuzhiyun 661*4882a593Smuzhiyun 662*4882a593SmuzhiyunHandling errors during writeback 663*4882a593Smuzhiyun-------------------------------- 664*4882a593Smuzhiyun 665*4882a593SmuzhiyunMost applications that do buffered I/O will periodically call a file 666*4882a593Smuzhiyunsynchronization call (fsync, fdatasync, msync or sync_file_range) to 667*4882a593Smuzhiyunensure that data written has made it to the backing store. When there 668*4882a593Smuzhiyunis an error during writeback, they expect that error to be reported when 669*4882a593Smuzhiyuna file sync request is made. After an error has been reported on one 670*4882a593Smuzhiyunrequest, subsequent requests on the same file descriptor should return 671*4882a593Smuzhiyun0, unless further writeback errors have occurred since the previous file 672*4882a593Smuzhiyunsyncronization. 673*4882a593Smuzhiyun 674*4882a593SmuzhiyunIdeally, the kernel would report errors only on file descriptions on 675*4882a593Smuzhiyunwhich writes were done that subsequently failed to be written back. The 676*4882a593Smuzhiyungeneric pagecache infrastructure does not track the file descriptions 677*4882a593Smuzhiyunthat have dirtied each individual page however, so determining which 678*4882a593Smuzhiyunfile descriptors should get back an error is not possible. 679*4882a593Smuzhiyun 680*4882a593SmuzhiyunInstead, the generic writeback error tracking infrastructure in the 681*4882a593Smuzhiyunkernel settles for reporting errors to fsync on all file descriptions 682*4882a593Smuzhiyunthat were open at the time that the error occurred. In a situation with 683*4882a593Smuzhiyunmultiple writers, all of them will get back an error on a subsequent 684*4882a593Smuzhiyunfsync, even if all of the writes done through that particular file 685*4882a593Smuzhiyundescriptor succeeded (or even if there were no writes on that file 686*4882a593Smuzhiyundescriptor at all). 687*4882a593Smuzhiyun 688*4882a593SmuzhiyunFilesystems that wish to use this infrastructure should call 689*4882a593Smuzhiyunmapping_set_error to record the error in the address_space when it 690*4882a593Smuzhiyunoccurs. Then, after writing back data from the pagecache in their 691*4882a593Smuzhiyunfile->fsync operation, they should call file_check_and_advance_wb_err to 692*4882a593Smuzhiyunensure that the struct file's error cursor has advanced to the correct 693*4882a593Smuzhiyunpoint in the stream of errors emitted by the backing device(s). 694*4882a593Smuzhiyun 695*4882a593Smuzhiyun 696*4882a593Smuzhiyunstruct address_space_operations 697*4882a593Smuzhiyun------------------------------- 698*4882a593Smuzhiyun 699*4882a593SmuzhiyunThis describes how the VFS can manipulate mapping of a file to page 700*4882a593Smuzhiyuncache in your filesystem. The following members are defined: 701*4882a593Smuzhiyun 702*4882a593Smuzhiyun.. code-block:: c 703*4882a593Smuzhiyun 704*4882a593Smuzhiyun struct address_space_operations { 705*4882a593Smuzhiyun int (*writepage)(struct page *page, struct writeback_control *wbc); 706*4882a593Smuzhiyun int (*readpage)(struct file *, struct page *); 707*4882a593Smuzhiyun int (*writepages)(struct address_space *, struct writeback_control *); 708*4882a593Smuzhiyun int (*set_page_dirty)(struct page *page); 709*4882a593Smuzhiyun void (*readahead)(struct readahead_control *); 710*4882a593Smuzhiyun int (*readpages)(struct file *filp, struct address_space *mapping, 711*4882a593Smuzhiyun struct list_head *pages, unsigned nr_pages); 712*4882a593Smuzhiyun int (*write_begin)(struct file *, struct address_space *mapping, 713*4882a593Smuzhiyun loff_t pos, unsigned len, unsigned flags, 714*4882a593Smuzhiyun struct page **pagep, void **fsdata); 715*4882a593Smuzhiyun int (*write_end)(struct file *, struct address_space *mapping, 716*4882a593Smuzhiyun loff_t pos, unsigned len, unsigned copied, 717*4882a593Smuzhiyun struct page *page, void *fsdata); 718*4882a593Smuzhiyun sector_t (*bmap)(struct address_space *, sector_t); 719*4882a593Smuzhiyun void (*invalidatepage) (struct page *, unsigned int, unsigned int); 720*4882a593Smuzhiyun int (*releasepage) (struct page *, int); 721*4882a593Smuzhiyun void (*freepage)(struct page *); 722*4882a593Smuzhiyun ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter); 723*4882a593Smuzhiyun /* isolate a page for migration */ 724*4882a593Smuzhiyun bool (*isolate_page) (struct page *, isolate_mode_t); 725*4882a593Smuzhiyun /* migrate the contents of a page to the specified target */ 726*4882a593Smuzhiyun int (*migratepage) (struct page *, struct page *); 727*4882a593Smuzhiyun /* put migration-failed page back to right list */ 728*4882a593Smuzhiyun void (*putback_page) (struct page *); 729*4882a593Smuzhiyun int (*launder_page) (struct page *); 730*4882a593Smuzhiyun 731*4882a593Smuzhiyun int (*is_partially_uptodate) (struct page *, unsigned long, 732*4882a593Smuzhiyun unsigned long); 733*4882a593Smuzhiyun void (*is_dirty_writeback) (struct page *, bool *, bool *); 734*4882a593Smuzhiyun int (*error_remove_page) (struct mapping *mapping, struct page *page); 735*4882a593Smuzhiyun int (*swap_activate)(struct file *); 736*4882a593Smuzhiyun int (*swap_deactivate)(struct file *); 737*4882a593Smuzhiyun }; 738*4882a593Smuzhiyun 739*4882a593Smuzhiyun``writepage`` 740*4882a593Smuzhiyun called by the VM to write a dirty page to backing store. This 741*4882a593Smuzhiyun may happen for data integrity reasons (i.e. 'sync'), or to free 742*4882a593Smuzhiyun up memory (flush). The difference can be seen in 743*4882a593Smuzhiyun wbc->sync_mode. The PG_Dirty flag has been cleared and 744*4882a593Smuzhiyun PageLocked is true. writepage should start writeout, should set 745*4882a593Smuzhiyun PG_Writeback, and should make sure the page is unlocked, either 746*4882a593Smuzhiyun synchronously or asynchronously when the write operation 747*4882a593Smuzhiyun completes. 748*4882a593Smuzhiyun 749*4882a593Smuzhiyun If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to 750*4882a593Smuzhiyun try too hard if there are problems, and may choose to write out 751*4882a593Smuzhiyun other pages from the mapping if that is easier (e.g. due to 752*4882a593Smuzhiyun internal dependencies). If it chooses not to start writeout, it 753*4882a593Smuzhiyun should return AOP_WRITEPAGE_ACTIVATE so that the VM will not 754*4882a593Smuzhiyun keep calling ->writepage on that page. 755*4882a593Smuzhiyun 756*4882a593Smuzhiyun See the file "Locking" for more details. 757*4882a593Smuzhiyun 758*4882a593Smuzhiyun``readpage`` 759*4882a593Smuzhiyun called by the VM to read a page from backing store. The page 760*4882a593Smuzhiyun will be Locked when readpage is called, and should be unlocked 761*4882a593Smuzhiyun and marked uptodate once the read completes. If ->readpage 762*4882a593Smuzhiyun discovers that it needs to unlock the page for some reason, it 763*4882a593Smuzhiyun can do so, and then return AOP_TRUNCATED_PAGE. In this case, 764*4882a593Smuzhiyun the page will be relocated, relocked and if that all succeeds, 765*4882a593Smuzhiyun ->readpage will be called again. 766*4882a593Smuzhiyun 767*4882a593Smuzhiyun``writepages`` 768*4882a593Smuzhiyun called by the VM to write out pages associated with the 769*4882a593Smuzhiyun address_space object. If wbc->sync_mode is WB_SYNC_ALL, then 770*4882a593Smuzhiyun the writeback_control will specify a range of pages that must be 771*4882a593Smuzhiyun written out. If it is WB_SYNC_NONE, then a nr_to_write is 772*4882a593Smuzhiyun given and that many pages should be written if possible. If no 773*4882a593Smuzhiyun ->writepages is given, then mpage_writepages is used instead. 774*4882a593Smuzhiyun This will choose pages from the address space that are tagged as 775*4882a593Smuzhiyun DIRTY and will pass them to ->writepage. 776*4882a593Smuzhiyun 777*4882a593Smuzhiyun``set_page_dirty`` 778*4882a593Smuzhiyun called by the VM to set a page dirty. This is particularly 779*4882a593Smuzhiyun needed if an address space attaches private data to a page, and 780*4882a593Smuzhiyun that data needs to be updated when a page is dirtied. This is 781*4882a593Smuzhiyun called, for example, when a memory mapped page gets modified. 782*4882a593Smuzhiyun If defined, it should set the PageDirty flag, and the 783*4882a593Smuzhiyun PAGECACHE_TAG_DIRTY tag in the radix tree. 784*4882a593Smuzhiyun 785*4882a593Smuzhiyun``readahead`` 786*4882a593Smuzhiyun Called by the VM to read pages associated with the address_space 787*4882a593Smuzhiyun object. The pages are consecutive in the page cache and are 788*4882a593Smuzhiyun locked. The implementation should decrement the page refcount 789*4882a593Smuzhiyun after starting I/O on each page. Usually the page will be 790*4882a593Smuzhiyun unlocked by the I/O completion handler. If the filesystem decides 791*4882a593Smuzhiyun to stop attempting I/O before reaching the end of the readahead 792*4882a593Smuzhiyun window, it can simply return. The caller will decrement the page 793*4882a593Smuzhiyun refcount and unlock the remaining pages for you. Set PageUptodate 794*4882a593Smuzhiyun if the I/O completes successfully. Setting PageError on any page 795*4882a593Smuzhiyun will be ignored; simply unlock the page if an I/O error occurs. 796*4882a593Smuzhiyun 797*4882a593Smuzhiyun``readpages`` 798*4882a593Smuzhiyun called by the VM to read pages associated with the address_space 799*4882a593Smuzhiyun object. This is essentially just a vector version of readpage. 800*4882a593Smuzhiyun Instead of just one page, several pages are requested. 801*4882a593Smuzhiyun readpages is only used for read-ahead, so read errors are 802*4882a593Smuzhiyun ignored. If anything goes wrong, feel free to give up. 803*4882a593Smuzhiyun This interface is deprecated and will be removed by the end of 804*4882a593Smuzhiyun 2020; implement readahead instead. 805*4882a593Smuzhiyun 806*4882a593Smuzhiyun``write_begin`` 807*4882a593Smuzhiyun Called by the generic buffered write code to ask the filesystem 808*4882a593Smuzhiyun to prepare to write len bytes at the given offset in the file. 809*4882a593Smuzhiyun The address_space should check that the write will be able to 810*4882a593Smuzhiyun complete, by allocating space if necessary and doing any other 811*4882a593Smuzhiyun internal housekeeping. If the write will update parts of any 812*4882a593Smuzhiyun basic-blocks on storage, then those blocks should be pre-read 813*4882a593Smuzhiyun (if they haven't been read already) so that the updated blocks 814*4882a593Smuzhiyun can be written out properly. 815*4882a593Smuzhiyun 816*4882a593Smuzhiyun The filesystem must return the locked pagecache page for the 817*4882a593Smuzhiyun specified offset, in ``*pagep``, for the caller to write into. 818*4882a593Smuzhiyun 819*4882a593Smuzhiyun It must be able to cope with short writes (where the length 820*4882a593Smuzhiyun passed to write_begin is greater than the number of bytes copied 821*4882a593Smuzhiyun into the page). 822*4882a593Smuzhiyun 823*4882a593Smuzhiyun flags is a field for AOP_FLAG_xxx flags, described in 824*4882a593Smuzhiyun include/linux/fs.h. 825*4882a593Smuzhiyun 826*4882a593Smuzhiyun A void * may be returned in fsdata, which then gets passed into 827*4882a593Smuzhiyun write_end. 828*4882a593Smuzhiyun 829*4882a593Smuzhiyun Returns 0 on success; < 0 on failure (which is the error code), 830*4882a593Smuzhiyun in which case write_end is not called. 831*4882a593Smuzhiyun 832*4882a593Smuzhiyun``write_end`` 833*4882a593Smuzhiyun After a successful write_begin, and data copy, write_end must be 834*4882a593Smuzhiyun called. len is the original len passed to write_begin, and 835*4882a593Smuzhiyun copied is the amount that was able to be copied. 836*4882a593Smuzhiyun 837*4882a593Smuzhiyun The filesystem must take care of unlocking the page and 838*4882a593Smuzhiyun releasing it refcount, and updating i_size. 839*4882a593Smuzhiyun 840*4882a593Smuzhiyun Returns < 0 on failure, otherwise the number of bytes (<= 841*4882a593Smuzhiyun 'copied') that were able to be copied into pagecache. 842*4882a593Smuzhiyun 843*4882a593Smuzhiyun``bmap`` 844*4882a593Smuzhiyun called by the VFS to map a logical block offset within object to 845*4882a593Smuzhiyun physical block number. This method is used by the FIBMAP ioctl 846*4882a593Smuzhiyun and for working with swap-files. To be able to swap to a file, 847*4882a593Smuzhiyun the file must have a stable mapping to a block device. The swap 848*4882a593Smuzhiyun system does not go through the filesystem but instead uses bmap 849*4882a593Smuzhiyun to find out where the blocks in the file are and uses those 850*4882a593Smuzhiyun addresses directly. 851*4882a593Smuzhiyun 852*4882a593Smuzhiyun``invalidatepage`` 853*4882a593Smuzhiyun If a page has PagePrivate set, then invalidatepage will be 854*4882a593Smuzhiyun called when part or all of the page is to be removed from the 855*4882a593Smuzhiyun address space. This generally corresponds to either a 856*4882a593Smuzhiyun truncation, punch hole or a complete invalidation of the address 857*4882a593Smuzhiyun space (in the latter case 'offset' will always be 0 and 'length' 858*4882a593Smuzhiyun will be PAGE_SIZE). Any private data associated with the page 859*4882a593Smuzhiyun should be updated to reflect this truncation. If offset is 0 860*4882a593Smuzhiyun and length is PAGE_SIZE, then the private data should be 861*4882a593Smuzhiyun released, because the page must be able to be completely 862*4882a593Smuzhiyun discarded. This may be done by calling the ->releasepage 863*4882a593Smuzhiyun function, but in this case the release MUST succeed. 864*4882a593Smuzhiyun 865*4882a593Smuzhiyun``releasepage`` 866*4882a593Smuzhiyun releasepage is called on PagePrivate pages to indicate that the 867*4882a593Smuzhiyun page should be freed if possible. ->releasepage should remove 868*4882a593Smuzhiyun any private data from the page and clear the PagePrivate flag. 869*4882a593Smuzhiyun If releasepage() fails for some reason, it must indicate failure 870*4882a593Smuzhiyun with a 0 return value. releasepage() is used in two distinct 871*4882a593Smuzhiyun though related cases. The first is when the VM finds a clean 872*4882a593Smuzhiyun page with no active users and wants to make it a free page. If 873*4882a593Smuzhiyun ->releasepage succeeds, the page will be removed from the 874*4882a593Smuzhiyun address_space and become free. 875*4882a593Smuzhiyun 876*4882a593Smuzhiyun The second case is when a request has been made to invalidate 877*4882a593Smuzhiyun some or all pages in an address_space. This can happen through 878*4882a593Smuzhiyun the fadvise(POSIX_FADV_DONTNEED) system call or by the 879*4882a593Smuzhiyun filesystem explicitly requesting it as nfs and 9fs do (when they 880*4882a593Smuzhiyun believe the cache may be out of date with storage) by calling 881*4882a593Smuzhiyun invalidate_inode_pages2(). If the filesystem makes such a call, 882*4882a593Smuzhiyun and needs to be certain that all pages are invalidated, then its 883*4882a593Smuzhiyun releasepage will need to ensure this. Possibly it can clear the 884*4882a593Smuzhiyun PageUptodate bit if it cannot free private data yet. 885*4882a593Smuzhiyun 886*4882a593Smuzhiyun``freepage`` 887*4882a593Smuzhiyun freepage is called once the page is no longer visible in the 888*4882a593Smuzhiyun page cache in order to allow the cleanup of any private data. 889*4882a593Smuzhiyun Since it may be called by the memory reclaimer, it should not 890*4882a593Smuzhiyun assume that the original address_space mapping still exists, and 891*4882a593Smuzhiyun it should not block. 892*4882a593Smuzhiyun 893*4882a593Smuzhiyun``direct_IO`` 894*4882a593Smuzhiyun called by the generic read/write routines to perform direct_IO - 895*4882a593Smuzhiyun that is IO requests which bypass the page cache and transfer 896*4882a593Smuzhiyun data directly between the storage and the application's address 897*4882a593Smuzhiyun space. 898*4882a593Smuzhiyun 899*4882a593Smuzhiyun``isolate_page`` 900*4882a593Smuzhiyun Called by the VM when isolating a movable non-lru page. If page 901*4882a593Smuzhiyun is successfully isolated, VM marks the page as PG_isolated via 902*4882a593Smuzhiyun __SetPageIsolated. 903*4882a593Smuzhiyun 904*4882a593Smuzhiyun``migrate_page`` 905*4882a593Smuzhiyun This is used to compact the physical memory usage. If the VM 906*4882a593Smuzhiyun wants to relocate a page (maybe off a memory card that is 907*4882a593Smuzhiyun signalling imminent failure) it will pass a new page and an old 908*4882a593Smuzhiyun page to this function. migrate_page should transfer any private 909*4882a593Smuzhiyun data across and update any references that it has to the page. 910*4882a593Smuzhiyun 911*4882a593Smuzhiyun``putback_page`` 912*4882a593Smuzhiyun Called by the VM when isolated page's migration fails. 913*4882a593Smuzhiyun 914*4882a593Smuzhiyun``launder_page`` 915*4882a593Smuzhiyun Called before freeing a page - it writes back the dirty page. 916*4882a593Smuzhiyun To prevent redirtying the page, it is kept locked during the 917*4882a593Smuzhiyun whole operation. 918*4882a593Smuzhiyun 919*4882a593Smuzhiyun``is_partially_uptodate`` 920*4882a593Smuzhiyun Called by the VM when reading a file through the pagecache when 921*4882a593Smuzhiyun the underlying blocksize != pagesize. If the required block is 922*4882a593Smuzhiyun up to date then the read can complete without needing the IO to 923*4882a593Smuzhiyun bring the whole page up to date. 924*4882a593Smuzhiyun 925*4882a593Smuzhiyun``is_dirty_writeback`` 926*4882a593Smuzhiyun Called by the VM when attempting to reclaim a page. The VM uses 927*4882a593Smuzhiyun dirty and writeback information to determine if it needs to 928*4882a593Smuzhiyun stall to allow flushers a chance to complete some IO. 929*4882a593Smuzhiyun Ordinarily it can use PageDirty and PageWriteback but some 930*4882a593Smuzhiyun filesystems have more complex state (unstable pages in NFS 931*4882a593Smuzhiyun prevent reclaim) or do not set those flags due to locking 932*4882a593Smuzhiyun problems. This callback allows a filesystem to indicate to the 933*4882a593Smuzhiyun VM if a page should be treated as dirty or writeback for the 934*4882a593Smuzhiyun purposes of stalling. 935*4882a593Smuzhiyun 936*4882a593Smuzhiyun``error_remove_page`` 937*4882a593Smuzhiyun normally set to generic_error_remove_page if truncation is ok 938*4882a593Smuzhiyun for this address space. Used for memory failure handling. 939*4882a593Smuzhiyun Setting this implies you deal with pages going away under you, 940*4882a593Smuzhiyun unless you have them locked or reference counts increased. 941*4882a593Smuzhiyun 942*4882a593Smuzhiyun``swap_activate`` 943*4882a593Smuzhiyun Called when swapon is used on a file to allocate space if 944*4882a593Smuzhiyun necessary and pin the block lookup information in memory. A 945*4882a593Smuzhiyun return value of zero indicates success, in which case this file 946*4882a593Smuzhiyun can be used to back swapspace. 947*4882a593Smuzhiyun 948*4882a593Smuzhiyun``swap_deactivate`` 949*4882a593Smuzhiyun Called during swapoff on files where swap_activate was 950*4882a593Smuzhiyun successful. 951*4882a593Smuzhiyun 952*4882a593Smuzhiyun 953*4882a593SmuzhiyunThe File Object 954*4882a593Smuzhiyun=============== 955*4882a593Smuzhiyun 956*4882a593SmuzhiyunA file object represents a file opened by a process. This is also known 957*4882a593Smuzhiyunas an "open file description" in POSIX parlance. 958*4882a593Smuzhiyun 959*4882a593Smuzhiyun 960*4882a593Smuzhiyunstruct file_operations 961*4882a593Smuzhiyun---------------------- 962*4882a593Smuzhiyun 963*4882a593SmuzhiyunThis describes how the VFS can manipulate an open file. As of kernel 964*4882a593Smuzhiyun4.18, the following members are defined: 965*4882a593Smuzhiyun 966*4882a593Smuzhiyun.. code-block:: c 967*4882a593Smuzhiyun 968*4882a593Smuzhiyun struct file_operations { 969*4882a593Smuzhiyun struct module *owner; 970*4882a593Smuzhiyun loff_t (*llseek) (struct file *, loff_t, int); 971*4882a593Smuzhiyun ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); 972*4882a593Smuzhiyun ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); 973*4882a593Smuzhiyun ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); 974*4882a593Smuzhiyun ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); 975*4882a593Smuzhiyun int (*iopoll)(struct kiocb *kiocb, bool spin); 976*4882a593Smuzhiyun int (*iterate) (struct file *, struct dir_context *); 977*4882a593Smuzhiyun int (*iterate_shared) (struct file *, struct dir_context *); 978*4882a593Smuzhiyun __poll_t (*poll) (struct file *, struct poll_table_struct *); 979*4882a593Smuzhiyun long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); 980*4882a593Smuzhiyun long (*compat_ioctl) (struct file *, unsigned int, unsigned long); 981*4882a593Smuzhiyun int (*mmap) (struct file *, struct vm_area_struct *); 982*4882a593Smuzhiyun int (*open) (struct inode *, struct file *); 983*4882a593Smuzhiyun int (*flush) (struct file *, fl_owner_t id); 984*4882a593Smuzhiyun int (*release) (struct inode *, struct file *); 985*4882a593Smuzhiyun int (*fsync) (struct file *, loff_t, loff_t, int datasync); 986*4882a593Smuzhiyun int (*fasync) (int, struct file *, int); 987*4882a593Smuzhiyun int (*lock) (struct file *, int, struct file_lock *); 988*4882a593Smuzhiyun ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); 989*4882a593Smuzhiyun unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); 990*4882a593Smuzhiyun int (*check_flags)(int); 991*4882a593Smuzhiyun int (*flock) (struct file *, int, struct file_lock *); 992*4882a593Smuzhiyun ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); 993*4882a593Smuzhiyun ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); 994*4882a593Smuzhiyun int (*setlease)(struct file *, long, struct file_lock **, void **); 995*4882a593Smuzhiyun long (*fallocate)(struct file *file, int mode, loff_t offset, 996*4882a593Smuzhiyun loff_t len); 997*4882a593Smuzhiyun void (*show_fdinfo)(struct seq_file *m, struct file *f); 998*4882a593Smuzhiyun #ifndef CONFIG_MMU 999*4882a593Smuzhiyun unsigned (*mmap_capabilities)(struct file *); 1000*4882a593Smuzhiyun #endif 1001*4882a593Smuzhiyun ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int); 1002*4882a593Smuzhiyun loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in, 1003*4882a593Smuzhiyun struct file *file_out, loff_t pos_out, 1004*4882a593Smuzhiyun loff_t len, unsigned int remap_flags); 1005*4882a593Smuzhiyun int (*fadvise)(struct file *, loff_t, loff_t, int); 1006*4882a593Smuzhiyun }; 1007*4882a593Smuzhiyun 1008*4882a593SmuzhiyunAgain, all methods are called without any locks being held, unless 1009*4882a593Smuzhiyunotherwise noted. 1010*4882a593Smuzhiyun 1011*4882a593Smuzhiyun``llseek`` 1012*4882a593Smuzhiyun called when the VFS needs to move the file position index 1013*4882a593Smuzhiyun 1014*4882a593Smuzhiyun``read`` 1015*4882a593Smuzhiyun called by read(2) and related system calls 1016*4882a593Smuzhiyun 1017*4882a593Smuzhiyun``read_iter`` 1018*4882a593Smuzhiyun possibly asynchronous read with iov_iter as destination 1019*4882a593Smuzhiyun 1020*4882a593Smuzhiyun``write`` 1021*4882a593Smuzhiyun called by write(2) and related system calls 1022*4882a593Smuzhiyun 1023*4882a593Smuzhiyun``write_iter`` 1024*4882a593Smuzhiyun possibly asynchronous write with iov_iter as source 1025*4882a593Smuzhiyun 1026*4882a593Smuzhiyun``iopoll`` 1027*4882a593Smuzhiyun called when aio wants to poll for completions on HIPRI iocbs 1028*4882a593Smuzhiyun 1029*4882a593Smuzhiyun``iterate`` 1030*4882a593Smuzhiyun called when the VFS needs to read the directory contents 1031*4882a593Smuzhiyun 1032*4882a593Smuzhiyun``iterate_shared`` 1033*4882a593Smuzhiyun called when the VFS needs to read the directory contents when 1034*4882a593Smuzhiyun filesystem supports concurrent dir iterators 1035*4882a593Smuzhiyun 1036*4882a593Smuzhiyun``poll`` 1037*4882a593Smuzhiyun called by the VFS when a process wants to check if there is 1038*4882a593Smuzhiyun activity on this file and (optionally) go to sleep until there 1039*4882a593Smuzhiyun is activity. Called by the select(2) and poll(2) system calls 1040*4882a593Smuzhiyun 1041*4882a593Smuzhiyun``unlocked_ioctl`` 1042*4882a593Smuzhiyun called by the ioctl(2) system call. 1043*4882a593Smuzhiyun 1044*4882a593Smuzhiyun``compat_ioctl`` 1045*4882a593Smuzhiyun called by the ioctl(2) system call when 32 bit system calls are 1046*4882a593Smuzhiyun used on 64 bit kernels. 1047*4882a593Smuzhiyun 1048*4882a593Smuzhiyun``mmap`` 1049*4882a593Smuzhiyun called by the mmap(2) system call 1050*4882a593Smuzhiyun 1051*4882a593Smuzhiyun``open`` 1052*4882a593Smuzhiyun called by the VFS when an inode should be opened. When the VFS 1053*4882a593Smuzhiyun opens a file, it creates a new "struct file". It then calls the 1054*4882a593Smuzhiyun open method for the newly allocated file structure. You might 1055*4882a593Smuzhiyun think that the open method really belongs in "struct 1056*4882a593Smuzhiyun inode_operations", and you may be right. I think it's done the 1057*4882a593Smuzhiyun way it is because it makes filesystems simpler to implement. 1058*4882a593Smuzhiyun The open() method is a good place to initialize the 1059*4882a593Smuzhiyun "private_data" member in the file structure if you want to point 1060*4882a593Smuzhiyun to a device structure 1061*4882a593Smuzhiyun 1062*4882a593Smuzhiyun``flush`` 1063*4882a593Smuzhiyun called by the close(2) system call to flush a file 1064*4882a593Smuzhiyun 1065*4882a593Smuzhiyun``release`` 1066*4882a593Smuzhiyun called when the last reference to an open file is closed 1067*4882a593Smuzhiyun 1068*4882a593Smuzhiyun``fsync`` 1069*4882a593Smuzhiyun called by the fsync(2) system call. Also see the section above 1070*4882a593Smuzhiyun entitled "Handling errors during writeback". 1071*4882a593Smuzhiyun 1072*4882a593Smuzhiyun``fasync`` 1073*4882a593Smuzhiyun called by the fcntl(2) system call when asynchronous 1074*4882a593Smuzhiyun (non-blocking) mode is enabled for a file 1075*4882a593Smuzhiyun 1076*4882a593Smuzhiyun``lock`` 1077*4882a593Smuzhiyun called by the fcntl(2) system call for F_GETLK, F_SETLK, and 1078*4882a593Smuzhiyun F_SETLKW commands 1079*4882a593Smuzhiyun 1080*4882a593Smuzhiyun``get_unmapped_area`` 1081*4882a593Smuzhiyun called by the mmap(2) system call 1082*4882a593Smuzhiyun 1083*4882a593Smuzhiyun``check_flags`` 1084*4882a593Smuzhiyun called by the fcntl(2) system call for F_SETFL command 1085*4882a593Smuzhiyun 1086*4882a593Smuzhiyun``flock`` 1087*4882a593Smuzhiyun called by the flock(2) system call 1088*4882a593Smuzhiyun 1089*4882a593Smuzhiyun``splice_write`` 1090*4882a593Smuzhiyun called by the VFS to splice data from a pipe to a file. This 1091*4882a593Smuzhiyun method is used by the splice(2) system call 1092*4882a593Smuzhiyun 1093*4882a593Smuzhiyun``splice_read`` 1094*4882a593Smuzhiyun called by the VFS to splice data from file to a pipe. This 1095*4882a593Smuzhiyun method is used by the splice(2) system call 1096*4882a593Smuzhiyun 1097*4882a593Smuzhiyun``setlease`` 1098*4882a593Smuzhiyun called by the VFS to set or release a file lock lease. setlease 1099*4882a593Smuzhiyun implementations should call generic_setlease to record or remove 1100*4882a593Smuzhiyun the lease in the inode after setting it. 1101*4882a593Smuzhiyun 1102*4882a593Smuzhiyun``fallocate`` 1103*4882a593Smuzhiyun called by the VFS to preallocate blocks or punch a hole. 1104*4882a593Smuzhiyun 1105*4882a593Smuzhiyun``copy_file_range`` 1106*4882a593Smuzhiyun called by the copy_file_range(2) system call. 1107*4882a593Smuzhiyun 1108*4882a593Smuzhiyun``remap_file_range`` 1109*4882a593Smuzhiyun called by the ioctl(2) system call for FICLONERANGE and FICLONE 1110*4882a593Smuzhiyun and FIDEDUPERANGE commands to remap file ranges. An 1111*4882a593Smuzhiyun implementation should remap len bytes at pos_in of the source 1112*4882a593Smuzhiyun file into the dest file at pos_out. Implementations must handle 1113*4882a593Smuzhiyun callers passing in len == 0; this means "remap to the end of the 1114*4882a593Smuzhiyun source file". The return value should the number of bytes 1115*4882a593Smuzhiyun remapped, or the usual negative error code if errors occurred 1116*4882a593Smuzhiyun before any bytes were remapped. The remap_flags parameter 1117*4882a593Smuzhiyun accepts REMAP_FILE_* flags. If REMAP_FILE_DEDUP is set then the 1118*4882a593Smuzhiyun implementation must only remap if the requested file ranges have 1119*4882a593Smuzhiyun identical contents. If REMAP_FILE_CAN_SHORTEN is set, the caller is 1120*4882a593Smuzhiyun ok with the implementation shortening the request length to 1121*4882a593Smuzhiyun satisfy alignment or EOF requirements (or any other reason). 1122*4882a593Smuzhiyun 1123*4882a593Smuzhiyun``fadvise`` 1124*4882a593Smuzhiyun possibly called by the fadvise64() system call. 1125*4882a593Smuzhiyun 1126*4882a593SmuzhiyunNote that the file operations are implemented by the specific 1127*4882a593Smuzhiyunfilesystem in which the inode resides. When opening a device node 1128*4882a593Smuzhiyun(character or block special) most filesystems will call special 1129*4882a593Smuzhiyunsupport routines in the VFS which will locate the required device 1130*4882a593Smuzhiyundriver information. These support routines replace the filesystem file 1131*4882a593Smuzhiyunoperations with those for the device driver, and then proceed to call 1132*4882a593Smuzhiyunthe new open() method for the file. This is how opening a device file 1133*4882a593Smuzhiyunin the filesystem eventually ends up calling the device driver open() 1134*4882a593Smuzhiyunmethod. 1135*4882a593Smuzhiyun 1136*4882a593Smuzhiyun 1137*4882a593SmuzhiyunDirectory Entry Cache (dcache) 1138*4882a593Smuzhiyun============================== 1139*4882a593Smuzhiyun 1140*4882a593Smuzhiyun 1141*4882a593Smuzhiyunstruct dentry_operations 1142*4882a593Smuzhiyun------------------------ 1143*4882a593Smuzhiyun 1144*4882a593SmuzhiyunThis describes how a filesystem can overload the standard dentry 1145*4882a593Smuzhiyunoperations. Dentries and the dcache are the domain of the VFS and the 1146*4882a593Smuzhiyunindividual filesystem implementations. Device drivers have no business 1147*4882a593Smuzhiyunhere. These methods may be set to NULL, as they are either optional or 1148*4882a593Smuzhiyunthe VFS uses a default. As of kernel 2.6.22, the following members are 1149*4882a593Smuzhiyundefined: 1150*4882a593Smuzhiyun 1151*4882a593Smuzhiyun.. code-block:: c 1152*4882a593Smuzhiyun 1153*4882a593Smuzhiyun struct dentry_operations { 1154*4882a593Smuzhiyun int (*d_revalidate)(struct dentry *, unsigned int); 1155*4882a593Smuzhiyun int (*d_weak_revalidate)(struct dentry *, unsigned int); 1156*4882a593Smuzhiyun int (*d_hash)(const struct dentry *, struct qstr *); 1157*4882a593Smuzhiyun int (*d_compare)(const struct dentry *, 1158*4882a593Smuzhiyun unsigned int, const char *, const struct qstr *); 1159*4882a593Smuzhiyun int (*d_delete)(const struct dentry *); 1160*4882a593Smuzhiyun int (*d_init)(struct dentry *); 1161*4882a593Smuzhiyun void (*d_release)(struct dentry *); 1162*4882a593Smuzhiyun void (*d_iput)(struct dentry *, struct inode *); 1163*4882a593Smuzhiyun char *(*d_dname)(struct dentry *, char *, int); 1164*4882a593Smuzhiyun struct vfsmount *(*d_automount)(struct path *); 1165*4882a593Smuzhiyun int (*d_manage)(const struct path *, bool); 1166*4882a593Smuzhiyun struct dentry *(*d_real)(struct dentry *, const struct inode *); 1167*4882a593Smuzhiyun }; 1168*4882a593Smuzhiyun 1169*4882a593Smuzhiyun``d_revalidate`` 1170*4882a593Smuzhiyun called when the VFS needs to revalidate a dentry. This is 1171*4882a593Smuzhiyun called whenever a name look-up finds a dentry in the dcache. 1172*4882a593Smuzhiyun Most local filesystems leave this as NULL, because all their 1173*4882a593Smuzhiyun dentries in the dcache are valid. Network filesystems are 1174*4882a593Smuzhiyun different since things can change on the server without the 1175*4882a593Smuzhiyun client necessarily being aware of it. 1176*4882a593Smuzhiyun 1177*4882a593Smuzhiyun This function should return a positive value if the dentry is 1178*4882a593Smuzhiyun still valid, and zero or a negative error code if it isn't. 1179*4882a593Smuzhiyun 1180*4882a593Smuzhiyun d_revalidate may be called in rcu-walk mode (flags & 1181*4882a593Smuzhiyun LOOKUP_RCU). If in rcu-walk mode, the filesystem must 1182*4882a593Smuzhiyun revalidate the dentry without blocking or storing to the dentry, 1183*4882a593Smuzhiyun d_parent and d_inode should not be used without care (because 1184*4882a593Smuzhiyun they can change and, in d_inode case, even become NULL under 1185*4882a593Smuzhiyun us). 1186*4882a593Smuzhiyun 1187*4882a593Smuzhiyun If a situation is encountered that rcu-walk cannot handle, 1188*4882a593Smuzhiyun return 1189*4882a593Smuzhiyun -ECHILD and it will be called again in ref-walk mode. 1190*4882a593Smuzhiyun 1191*4882a593Smuzhiyun``_weak_revalidate`` 1192*4882a593Smuzhiyun called when the VFS needs to revalidate a "jumped" dentry. This 1193*4882a593Smuzhiyun is called when a path-walk ends at dentry that was not acquired 1194*4882a593Smuzhiyun by doing a lookup in the parent directory. This includes "/", 1195*4882a593Smuzhiyun "." and "..", as well as procfs-style symlinks and mountpoint 1196*4882a593Smuzhiyun traversal. 1197*4882a593Smuzhiyun 1198*4882a593Smuzhiyun In this case, we are less concerned with whether the dentry is 1199*4882a593Smuzhiyun still fully correct, but rather that the inode is still valid. 1200*4882a593Smuzhiyun As with d_revalidate, most local filesystems will set this to 1201*4882a593Smuzhiyun NULL since their dcache entries are always valid. 1202*4882a593Smuzhiyun 1203*4882a593Smuzhiyun This function has the same return code semantics as 1204*4882a593Smuzhiyun d_revalidate. 1205*4882a593Smuzhiyun 1206*4882a593Smuzhiyun d_weak_revalidate is only called after leaving rcu-walk mode. 1207*4882a593Smuzhiyun 1208*4882a593Smuzhiyun``d_hash`` 1209*4882a593Smuzhiyun called when the VFS adds a dentry to the hash table. The first 1210*4882a593Smuzhiyun dentry passed to d_hash is the parent directory that the name is 1211*4882a593Smuzhiyun to be hashed into. 1212*4882a593Smuzhiyun 1213*4882a593Smuzhiyun Same locking and synchronisation rules as d_compare regarding 1214*4882a593Smuzhiyun what is safe to dereference etc. 1215*4882a593Smuzhiyun 1216*4882a593Smuzhiyun``d_compare`` 1217*4882a593Smuzhiyun called to compare a dentry name with a given name. The first 1218*4882a593Smuzhiyun dentry is the parent of the dentry to be compared, the second is 1219*4882a593Smuzhiyun the child dentry. len and name string are properties of the 1220*4882a593Smuzhiyun dentry to be compared. qstr is the name to compare it with. 1221*4882a593Smuzhiyun 1222*4882a593Smuzhiyun Must be constant and idempotent, and should not take locks if 1223*4882a593Smuzhiyun possible, and should not or store into the dentry. Should not 1224*4882a593Smuzhiyun dereference pointers outside the dentry without lots of care 1225*4882a593Smuzhiyun (eg. d_parent, d_inode, d_name should not be used). 1226*4882a593Smuzhiyun 1227*4882a593Smuzhiyun However, our vfsmount is pinned, and RCU held, so the dentries 1228*4882a593Smuzhiyun and inodes won't disappear, neither will our sb or filesystem 1229*4882a593Smuzhiyun module. ->d_sb may be used. 1230*4882a593Smuzhiyun 1231*4882a593Smuzhiyun It is a tricky calling convention because it needs to be called 1232*4882a593Smuzhiyun under "rcu-walk", ie. without any locks or references on things. 1233*4882a593Smuzhiyun 1234*4882a593Smuzhiyun``d_delete`` 1235*4882a593Smuzhiyun called when the last reference to a dentry is dropped and the 1236*4882a593Smuzhiyun dcache is deciding whether or not to cache it. Return 1 to 1237*4882a593Smuzhiyun delete immediately, or 0 to cache the dentry. Default is NULL 1238*4882a593Smuzhiyun which means to always cache a reachable dentry. d_delete must 1239*4882a593Smuzhiyun be constant and idempotent. 1240*4882a593Smuzhiyun 1241*4882a593Smuzhiyun``d_init`` 1242*4882a593Smuzhiyun called when a dentry is allocated 1243*4882a593Smuzhiyun 1244*4882a593Smuzhiyun``d_release`` 1245*4882a593Smuzhiyun called when a dentry is really deallocated 1246*4882a593Smuzhiyun 1247*4882a593Smuzhiyun``d_iput`` 1248*4882a593Smuzhiyun called when a dentry loses its inode (just prior to its being 1249*4882a593Smuzhiyun deallocated). The default when this is NULL is that the VFS 1250*4882a593Smuzhiyun calls iput(). If you define this method, you must call iput() 1251*4882a593Smuzhiyun yourself 1252*4882a593Smuzhiyun 1253*4882a593Smuzhiyun``d_dname`` 1254*4882a593Smuzhiyun called when the pathname of a dentry should be generated. 1255*4882a593Smuzhiyun Useful for some pseudo filesystems (sockfs, pipefs, ...) to 1256*4882a593Smuzhiyun delay pathname generation. (Instead of doing it when dentry is 1257*4882a593Smuzhiyun created, it's done only when the path is needed.). Real 1258*4882a593Smuzhiyun filesystems probably dont want to use it, because their dentries 1259*4882a593Smuzhiyun are present in global dcache hash, so their hash should be an 1260*4882a593Smuzhiyun invariant. As no lock is held, d_dname() should not try to 1261*4882a593Smuzhiyun modify the dentry itself, unless appropriate SMP safety is used. 1262*4882a593Smuzhiyun CAUTION : d_path() logic is quite tricky. The correct way to 1263*4882a593Smuzhiyun return for example "Hello" is to put it at the end of the 1264*4882a593Smuzhiyun buffer, and returns a pointer to the first char. 1265*4882a593Smuzhiyun dynamic_dname() helper function is provided to take care of 1266*4882a593Smuzhiyun this. 1267*4882a593Smuzhiyun 1268*4882a593Smuzhiyun Example : 1269*4882a593Smuzhiyun 1270*4882a593Smuzhiyun.. code-block:: c 1271*4882a593Smuzhiyun 1272*4882a593Smuzhiyun static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen) 1273*4882a593Smuzhiyun { 1274*4882a593Smuzhiyun return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]", 1275*4882a593Smuzhiyun dentry->d_inode->i_ino); 1276*4882a593Smuzhiyun } 1277*4882a593Smuzhiyun 1278*4882a593Smuzhiyun``d_automount`` 1279*4882a593Smuzhiyun called when an automount dentry is to be traversed (optional). 1280*4882a593Smuzhiyun This should create a new VFS mount record and return the record 1281*4882a593Smuzhiyun to the caller. The caller is supplied with a path parameter 1282*4882a593Smuzhiyun giving the automount directory to describe the automount target 1283*4882a593Smuzhiyun and the parent VFS mount record to provide inheritable mount 1284*4882a593Smuzhiyun parameters. NULL should be returned if someone else managed to 1285*4882a593Smuzhiyun make the automount first. If the vfsmount creation failed, then 1286*4882a593Smuzhiyun an error code should be returned. If -EISDIR is returned, then 1287*4882a593Smuzhiyun the directory will be treated as an ordinary directory and 1288*4882a593Smuzhiyun returned to pathwalk to continue walking. 1289*4882a593Smuzhiyun 1290*4882a593Smuzhiyun If a vfsmount is returned, the caller will attempt to mount it 1291*4882a593Smuzhiyun on the mountpoint and will remove the vfsmount from its 1292*4882a593Smuzhiyun expiration list in the case of failure. The vfsmount should be 1293*4882a593Smuzhiyun returned with 2 refs on it to prevent automatic expiration - the 1294*4882a593Smuzhiyun caller will clean up the additional ref. 1295*4882a593Smuzhiyun 1296*4882a593Smuzhiyun This function is only used if DCACHE_NEED_AUTOMOUNT is set on 1297*4882a593Smuzhiyun the dentry. This is set by __d_instantiate() if S_AUTOMOUNT is 1298*4882a593Smuzhiyun set on the inode being added. 1299*4882a593Smuzhiyun 1300*4882a593Smuzhiyun``d_manage`` 1301*4882a593Smuzhiyun called to allow the filesystem to manage the transition from a 1302*4882a593Smuzhiyun dentry (optional). This allows autofs, for example, to hold up 1303*4882a593Smuzhiyun clients waiting to explore behind a 'mountpoint' while letting 1304*4882a593Smuzhiyun the daemon go past and construct the subtree there. 0 should be 1305*4882a593Smuzhiyun returned to let the calling process continue. -EISDIR can be 1306*4882a593Smuzhiyun returned to tell pathwalk to use this directory as an ordinary 1307*4882a593Smuzhiyun directory and to ignore anything mounted on it and not to check 1308*4882a593Smuzhiyun the automount flag. Any other error code will abort pathwalk 1309*4882a593Smuzhiyun completely. 1310*4882a593Smuzhiyun 1311*4882a593Smuzhiyun If the 'rcu_walk' parameter is true, then the caller is doing a 1312*4882a593Smuzhiyun pathwalk in RCU-walk mode. Sleeping is not permitted in this 1313*4882a593Smuzhiyun mode, and the caller can be asked to leave it and call again by 1314*4882a593Smuzhiyun returning -ECHILD. -EISDIR may also be returned to tell 1315*4882a593Smuzhiyun pathwalk to ignore d_automount or any mounts. 1316*4882a593Smuzhiyun 1317*4882a593Smuzhiyun This function is only used if DCACHE_MANAGE_TRANSIT is set on 1318*4882a593Smuzhiyun the dentry being transited from. 1319*4882a593Smuzhiyun 1320*4882a593Smuzhiyun``d_real`` 1321*4882a593Smuzhiyun overlay/union type filesystems implement this method to return 1322*4882a593Smuzhiyun one of the underlying dentries hidden by the overlay. It is 1323*4882a593Smuzhiyun used in two different modes: 1324*4882a593Smuzhiyun 1325*4882a593Smuzhiyun Called from file_dentry() it returns the real dentry matching 1326*4882a593Smuzhiyun the inode argument. The real dentry may be from a lower layer 1327*4882a593Smuzhiyun already copied up, but still referenced from the file. This 1328*4882a593Smuzhiyun mode is selected with a non-NULL inode argument. 1329*4882a593Smuzhiyun 1330*4882a593Smuzhiyun With NULL inode the topmost real underlying dentry is returned. 1331*4882a593Smuzhiyun 1332*4882a593SmuzhiyunEach dentry has a pointer to its parent dentry, as well as a hash list 1333*4882a593Smuzhiyunof child dentries. Child dentries are basically like files in a 1334*4882a593Smuzhiyundirectory. 1335*4882a593Smuzhiyun 1336*4882a593Smuzhiyun 1337*4882a593SmuzhiyunDirectory Entry Cache API 1338*4882a593Smuzhiyun-------------------------- 1339*4882a593Smuzhiyun 1340*4882a593SmuzhiyunThere are a number of functions defined which permit a filesystem to 1341*4882a593Smuzhiyunmanipulate dentries: 1342*4882a593Smuzhiyun 1343*4882a593Smuzhiyun``dget`` 1344*4882a593Smuzhiyun open a new handle for an existing dentry (this just increments 1345*4882a593Smuzhiyun the usage count) 1346*4882a593Smuzhiyun 1347*4882a593Smuzhiyun``dput`` 1348*4882a593Smuzhiyun close a handle for a dentry (decrements the usage count). If 1349*4882a593Smuzhiyun the usage count drops to 0, and the dentry is still in its 1350*4882a593Smuzhiyun parent's hash, the "d_delete" method is called to check whether 1351*4882a593Smuzhiyun it should be cached. If it should not be cached, or if the 1352*4882a593Smuzhiyun dentry is not hashed, it is deleted. Otherwise cached dentries 1353*4882a593Smuzhiyun are put into an LRU list to be reclaimed on memory shortage. 1354*4882a593Smuzhiyun 1355*4882a593Smuzhiyun``d_drop`` 1356*4882a593Smuzhiyun this unhashes a dentry from its parents hash list. A subsequent 1357*4882a593Smuzhiyun call to dput() will deallocate the dentry if its usage count 1358*4882a593Smuzhiyun drops to 0 1359*4882a593Smuzhiyun 1360*4882a593Smuzhiyun``d_delete`` 1361*4882a593Smuzhiyun delete a dentry. If there are no other open references to the 1362*4882a593Smuzhiyun dentry then the dentry is turned into a negative dentry (the 1363*4882a593Smuzhiyun d_iput() method is called). If there are other references, then 1364*4882a593Smuzhiyun d_drop() is called instead 1365*4882a593Smuzhiyun 1366*4882a593Smuzhiyun``d_add`` 1367*4882a593Smuzhiyun add a dentry to its parents hash list and then calls 1368*4882a593Smuzhiyun d_instantiate() 1369*4882a593Smuzhiyun 1370*4882a593Smuzhiyun``d_instantiate`` 1371*4882a593Smuzhiyun add a dentry to the alias hash list for the inode and updates 1372*4882a593Smuzhiyun the "d_inode" member. The "i_count" member in the inode 1373*4882a593Smuzhiyun structure should be set/incremented. If the inode pointer is 1374*4882a593Smuzhiyun NULL, the dentry is called a "negative dentry". This function 1375*4882a593Smuzhiyun is commonly called when an inode is created for an existing 1376*4882a593Smuzhiyun negative dentry 1377*4882a593Smuzhiyun 1378*4882a593Smuzhiyun``d_lookup`` 1379*4882a593Smuzhiyun look up a dentry given its parent and path name component It 1380*4882a593Smuzhiyun looks up the child of that given name from the dcache hash 1381*4882a593Smuzhiyun table. If it is found, the reference count is incremented and 1382*4882a593Smuzhiyun the dentry is returned. The caller must use dput() to free the 1383*4882a593Smuzhiyun dentry when it finishes using it. 1384*4882a593Smuzhiyun 1385*4882a593Smuzhiyun 1386*4882a593SmuzhiyunMount Options 1387*4882a593Smuzhiyun============= 1388*4882a593Smuzhiyun 1389*4882a593Smuzhiyun 1390*4882a593SmuzhiyunParsing options 1391*4882a593Smuzhiyun--------------- 1392*4882a593Smuzhiyun 1393*4882a593SmuzhiyunOn mount and remount the filesystem is passed a string containing a 1394*4882a593Smuzhiyuncomma separated list of mount options. The options can have either of 1395*4882a593Smuzhiyunthese forms: 1396*4882a593Smuzhiyun 1397*4882a593Smuzhiyun option 1398*4882a593Smuzhiyun option=value 1399*4882a593Smuzhiyun 1400*4882a593SmuzhiyunThe <linux/parser.h> header defines an API that helps parse these 1401*4882a593Smuzhiyunoptions. There are plenty of examples on how to use it in existing 1402*4882a593Smuzhiyunfilesystems. 1403*4882a593Smuzhiyun 1404*4882a593Smuzhiyun 1405*4882a593SmuzhiyunShowing options 1406*4882a593Smuzhiyun--------------- 1407*4882a593Smuzhiyun 1408*4882a593SmuzhiyunIf a filesystem accepts mount options, it must define show_options() to 1409*4882a593Smuzhiyunshow all the currently active options. The rules are: 1410*4882a593Smuzhiyun 1411*4882a593Smuzhiyun - options MUST be shown which are not default or their values differ 1412*4882a593Smuzhiyun from the default 1413*4882a593Smuzhiyun 1414*4882a593Smuzhiyun - options MAY be shown which are enabled by default or have their 1415*4882a593Smuzhiyun default value 1416*4882a593Smuzhiyun 1417*4882a593SmuzhiyunOptions used only internally between a mount helper and the kernel (such 1418*4882a593Smuzhiyunas file descriptors), or which only have an effect during the mounting 1419*4882a593Smuzhiyun(such as ones controlling the creation of a journal) are exempt from the 1420*4882a593Smuzhiyunabove rules. 1421*4882a593Smuzhiyun 1422*4882a593SmuzhiyunThe underlying reason for the above rules is to make sure, that a mount 1423*4882a593Smuzhiyuncan be accurately replicated (e.g. umounting and mounting again) based 1424*4882a593Smuzhiyunon the information found in /proc/mounts. 1425*4882a593Smuzhiyun 1426*4882a593Smuzhiyun 1427*4882a593SmuzhiyunResources 1428*4882a593Smuzhiyun========= 1429*4882a593Smuzhiyun 1430*4882a593Smuzhiyun(Note some of these resources are not up-to-date with the latest kernel 1431*4882a593Smuzhiyun version.) 1432*4882a593Smuzhiyun 1433*4882a593SmuzhiyunCreating Linux virtual filesystems. 2002 1434*4882a593Smuzhiyun <https://lwn.net/Articles/13325/> 1435*4882a593Smuzhiyun 1436*4882a593SmuzhiyunThe Linux Virtual File-system Layer by Neil Brown. 1999 1437*4882a593Smuzhiyun <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html> 1438*4882a593Smuzhiyun 1439*4882a593SmuzhiyunA tour of the Linux VFS by Michael K. Johnson. 1996 1440*4882a593Smuzhiyun <https://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html> 1441*4882a593Smuzhiyun 1442*4882a593SmuzhiyunA small trail through the Linux kernel by Andries Brouwer. 2001 1443*4882a593Smuzhiyun <https://www.win.tue.nl/~aeb/linux/vfs/trail.html> 1444