1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun========================================== 4*4882a593SmuzhiyunWHAT IS Flash-Friendly File System (F2FS)? 5*4882a593Smuzhiyun========================================== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunNAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have 8*4882a593Smuzhiyunbeen equipped on a variety systems ranging from mobile to server systems. Since 9*4882a593Smuzhiyunthey are known to have different characteristics from the conventional rotating 10*4882a593Smuzhiyundisks, a file system, an upper layer to the storage device, should adapt to the 11*4882a593Smuzhiyunchanges from the sketch in the design level. 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunF2FS is a file system exploiting NAND flash memory-based storage devices, which 14*4882a593Smuzhiyunis based on Log-structured File System (LFS). The design has been focused on 15*4882a593Smuzhiyunaddressing the fundamental issues in LFS, which are snowball effect of wandering 16*4882a593Smuzhiyuntree and high cleaning overhead. 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunSince a NAND flash memory-based storage device shows different characteristic 19*4882a593Smuzhiyunaccording to its internal geometry or flash memory management scheme, namely FTL, 20*4882a593SmuzhiyunF2FS and its tools support various parameters not only for configuring on-disk 21*4882a593Smuzhiyunlayout, but also for selecting allocation and cleaning algorithms. 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunThe following git tree provides the file system formatting tool (mkfs.f2fs), 24*4882a593Smuzhiyuna consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs). 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun- git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git 27*4882a593Smuzhiyun 28*4882a593SmuzhiyunFor reporting bugs and sending patches, please use the following mailing list: 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun- linux-f2fs-devel@lists.sourceforge.net 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunBackground and Design issues 33*4882a593Smuzhiyun============================ 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunLog-structured File System (LFS) 36*4882a593Smuzhiyun-------------------------------- 37*4882a593Smuzhiyun"A log-structured file system writes all modifications to disk sequentially in 38*4882a593Smuzhiyuna log-like structure, thereby speeding up both file writing and crash recovery. 39*4882a593SmuzhiyunThe log is the only structure on disk; it contains indexing information so that 40*4882a593Smuzhiyunfiles can be read back from the log efficiently. In order to maintain large free 41*4882a593Smuzhiyunareas on disk for fast writing, we divide the log into segments and use a 42*4882a593Smuzhiyunsegment cleaner to compress the live information from heavily fragmented 43*4882a593Smuzhiyunsegments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and 44*4882a593Smuzhiyunimplementation of a log-structured file system", ACM Trans. Computer Systems 45*4882a593Smuzhiyun10, 1, 26–52. 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunWandering Tree Problem 48*4882a593Smuzhiyun---------------------- 49*4882a593SmuzhiyunIn LFS, when a file data is updated and written to the end of log, its direct 50*4882a593Smuzhiyunpointer block is updated due to the changed location. Then the indirect pointer 51*4882a593Smuzhiyunblock is also updated due to the direct pointer block update. In this manner, 52*4882a593Smuzhiyunthe upper index structures such as inode, inode map, and checkpoint block are 53*4882a593Smuzhiyunalso updated recursively. This problem is called as wandering tree problem [1], 54*4882a593Smuzhiyunand in order to enhance the performance, it should eliminate or relax the update 55*4882a593Smuzhiyunpropagation as much as possible. 56*4882a593Smuzhiyun 57*4882a593Smuzhiyun[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/ 58*4882a593Smuzhiyun 59*4882a593SmuzhiyunCleaning Overhead 60*4882a593Smuzhiyun----------------- 61*4882a593SmuzhiyunSince LFS is based on out-of-place writes, it produces so many obsolete blocks 62*4882a593Smuzhiyunscattered across the whole storage. In order to serve new empty log space, it 63*4882a593Smuzhiyunneeds to reclaim these obsolete blocks seamlessly to users. This job is called 64*4882a593Smuzhiyunas a cleaning process. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunThe process consists of three operations as follows. 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun1. A victim segment is selected through referencing segment usage table. 69*4882a593Smuzhiyun2. It loads parent index structures of all the data in the victim identified by 70*4882a593Smuzhiyun segment summary blocks. 71*4882a593Smuzhiyun3. It checks the cross-reference between the data and its parent index structure. 72*4882a593Smuzhiyun4. It moves valid data selectively. 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunThis cleaning job may cause unexpected long delays, so the most important goal 75*4882a593Smuzhiyunis to hide the latencies to users. And also definitely, it should reduce the 76*4882a593Smuzhiyunamount of valid data to be moved, and move them quickly as well. 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunKey Features 79*4882a593Smuzhiyun============ 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunFlash Awareness 82*4882a593Smuzhiyun--------------- 83*4882a593Smuzhiyun- Enlarge the random write area for better performance, but provide the high 84*4882a593Smuzhiyun spatial locality 85*4882a593Smuzhiyun- Align FS data structures to the operational units in FTL as best efforts 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunWandering Tree Problem 88*4882a593Smuzhiyun---------------------- 89*4882a593Smuzhiyun- Use a term, “node”, that represents inodes as well as various pointer blocks 90*4882a593Smuzhiyun- Introduce Node Address Table (NAT) containing the locations of all the “node” 91*4882a593Smuzhiyun blocks; this will cut off the update propagation. 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunCleaning Overhead 94*4882a593Smuzhiyun----------------- 95*4882a593Smuzhiyun- Support a background cleaning process 96*4882a593Smuzhiyun- Support greedy and cost-benefit algorithms for victim selection policies 97*4882a593Smuzhiyun- Support multi-head logs for static/dynamic hot and cold data separation 98*4882a593Smuzhiyun- Introduce adaptive logging for efficient block allocation 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunMount Options 101*4882a593Smuzhiyun============= 102*4882a593Smuzhiyun 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun======================== ============================================================ 105*4882a593Smuzhiyunbackground_gc=%s Turn on/off cleaning operations, namely garbage 106*4882a593Smuzhiyun collection, triggered in background when I/O subsystem is 107*4882a593Smuzhiyun idle. If background_gc=on, it will turn on the garbage 108*4882a593Smuzhiyun collection and if background_gc=off, garbage collection 109*4882a593Smuzhiyun will be turned off. If background_gc=sync, it will turn 110*4882a593Smuzhiyun on synchronous garbage collection running in background. 111*4882a593Smuzhiyun Default value for this option is on. So garbage 112*4882a593Smuzhiyun collection is on by default. 113*4882a593Smuzhiyungc_merge When background_gc is on, this option can be enabled to 114*4882a593Smuzhiyun let background GC thread to handle foreground GC requests, 115*4882a593Smuzhiyun it can eliminate the sluggish issue caused by slow foreground 116*4882a593Smuzhiyun GC operation when GC is triggered from a process with limited 117*4882a593Smuzhiyun I/O and CPU resources. 118*4882a593Smuzhiyunnogc_merge Disable GC merge feature. 119*4882a593Smuzhiyundisable_roll_forward Disable the roll-forward recovery routine 120*4882a593Smuzhiyunnorecovery Disable the roll-forward recovery routine, mounted read- 121*4882a593Smuzhiyun only (i.e., -o ro,disable_roll_forward) 122*4882a593Smuzhiyundiscard/nodiscard Enable/disable real-time discard in f2fs, if discard is 123*4882a593Smuzhiyun enabled, f2fs will issue discard/TRIM commands when a 124*4882a593Smuzhiyun segment is cleaned. 125*4882a593Smuzhiyunno_heap Disable heap-style segment allocation which finds free 126*4882a593Smuzhiyun segments for data from the beginning of main area, while 127*4882a593Smuzhiyun for node from the end of main area. 128*4882a593Smuzhiyunnouser_xattr Disable Extended User Attributes. Note: xattr is enabled 129*4882a593Smuzhiyun by default if CONFIG_F2FS_FS_XATTR is selected. 130*4882a593Smuzhiyunnoacl Disable POSIX Access Control List. Note: acl is enabled 131*4882a593Smuzhiyun by default if CONFIG_F2FS_FS_POSIX_ACL is selected. 132*4882a593Smuzhiyunactive_logs=%u Support configuring the number of active logs. In the 133*4882a593Smuzhiyun current design, f2fs supports only 2, 4, and 6 logs. 134*4882a593Smuzhiyun Default number is 6. 135*4882a593Smuzhiyundisable_ext_identify Disable the extension list configured by mkfs, so f2fs 136*4882a593Smuzhiyun is not aware of cold files such as media files. 137*4882a593Smuzhiyuninline_xattr Enable the inline xattrs feature. 138*4882a593Smuzhiyunnoinline_xattr Disable the inline xattrs feature. 139*4882a593Smuzhiyuninline_xattr_size=%u Support configuring inline xattr size, it depends on 140*4882a593Smuzhiyun flexible inline xattr feature. 141*4882a593Smuzhiyuninline_data Enable the inline data feature: Newly created small (<~3.4k) 142*4882a593Smuzhiyun files can be written into inode block. 143*4882a593Smuzhiyuninline_dentry Enable the inline dir feature: data in newly created 144*4882a593Smuzhiyun directory entries can be written into inode block. The 145*4882a593Smuzhiyun space of inode block which is used to store inline 146*4882a593Smuzhiyun dentries is limited to ~3.4k. 147*4882a593Smuzhiyunnoinline_dentry Disable the inline dentry feature. 148*4882a593Smuzhiyunflush_merge Merge concurrent cache_flush commands as much as possible 149*4882a593Smuzhiyun to eliminate redundant command issues. If the underlying 150*4882a593Smuzhiyun device handles the cache_flush command relatively slowly, 151*4882a593Smuzhiyun recommend to enable this option. 152*4882a593Smuzhiyunnobarrier This option can be used if underlying storage guarantees 153*4882a593Smuzhiyun its cached data should be written to the novolatile area. 154*4882a593Smuzhiyun If this option is set, no cache_flush commands are issued 155*4882a593Smuzhiyun but f2fs still guarantees the write ordering of all the 156*4882a593Smuzhiyun data writes. 157*4882a593Smuzhiyunfastboot This option is used when a system wants to reduce mount 158*4882a593Smuzhiyun time as much as possible, even though normal performance 159*4882a593Smuzhiyun can be sacrificed. 160*4882a593Smuzhiyunextent_cache Enable an extent cache based on rb-tree, it can cache 161*4882a593Smuzhiyun as many as extent which map between contiguous logical 162*4882a593Smuzhiyun address and physical address per inode, resulting in 163*4882a593Smuzhiyun increasing the cache hit ratio. Set by default. 164*4882a593Smuzhiyunnoextent_cache Disable an extent cache based on rb-tree explicitly, see 165*4882a593Smuzhiyun the above extent_cache mount option. 166*4882a593Smuzhiyunnoinline_data Disable the inline data feature, inline data feature is 167*4882a593Smuzhiyun enabled by default. 168*4882a593Smuzhiyundata_flush Enable data flushing before checkpoint in order to 169*4882a593Smuzhiyun persist data of regular and symlink. 170*4882a593Smuzhiyunreserve_root=%d Support configuring reserved space which is used for 171*4882a593Smuzhiyun allocation from a privileged user with specified uid or 172*4882a593Smuzhiyun gid, unit: 4KB, the default limit is 0.2% of user blocks. 173*4882a593Smuzhiyunresuid=%d The user ID which may use the reserved blocks. 174*4882a593Smuzhiyunresgid=%d The group ID which may use the reserved blocks. 175*4882a593Smuzhiyunfault_injection=%d Enable fault injection in all supported types with 176*4882a593Smuzhiyun specified injection rate. 177*4882a593Smuzhiyunfault_type=%d Support configuring fault injection type, should be 178*4882a593Smuzhiyun enabled with fault_injection option, fault type value 179*4882a593Smuzhiyun is shown below, it supports single or combined type. 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun =================== =========== 182*4882a593Smuzhiyun Type_Name Type_Value 183*4882a593Smuzhiyun =================== =========== 184*4882a593Smuzhiyun FAULT_KMALLOC 0x000000001 185*4882a593Smuzhiyun FAULT_KVMALLOC 0x000000002 186*4882a593Smuzhiyun FAULT_PAGE_ALLOC 0x000000004 187*4882a593Smuzhiyun FAULT_PAGE_GET 0x000000008 188*4882a593Smuzhiyun FAULT_ALLOC_NID 0x000000020 189*4882a593Smuzhiyun FAULT_ORPHAN 0x000000040 190*4882a593Smuzhiyun FAULT_BLOCK 0x000000080 191*4882a593Smuzhiyun FAULT_DIR_DEPTH 0x000000100 192*4882a593Smuzhiyun FAULT_EVICT_INODE 0x000000200 193*4882a593Smuzhiyun FAULT_TRUNCATE 0x000000400 194*4882a593Smuzhiyun FAULT_READ_IO 0x000000800 195*4882a593Smuzhiyun FAULT_CHECKPOINT 0x000001000 196*4882a593Smuzhiyun FAULT_DISCARD 0x000002000 197*4882a593Smuzhiyun FAULT_WRITE_IO 0x000004000 198*4882a593Smuzhiyun =================== =========== 199*4882a593Smuzhiyunmode=%s Control block allocation mode which supports "adaptive" 200*4882a593Smuzhiyun and "lfs". In "lfs" mode, there should be no random 201*4882a593Smuzhiyun writes towards main area. 202*4882a593Smuzhiyunio_bits=%u Set the bit size of write IO requests. It should be set 203*4882a593Smuzhiyun with "mode=lfs". 204*4882a593Smuzhiyunusrquota Enable plain user disk quota accounting. 205*4882a593Smuzhiyungrpquota Enable plain group disk quota accounting. 206*4882a593Smuzhiyunprjquota Enable plain project quota accounting. 207*4882a593Smuzhiyunusrjquota=<file> Appoint specified file and type during mount, so that quota 208*4882a593Smuzhiyungrpjquota=<file> information can be properly updated during recovery flow, 209*4882a593Smuzhiyunprjjquota=<file> <quota file>: must be in root directory; 210*4882a593Smuzhiyunjqfmt=<quota type> <quota type>: [vfsold,vfsv0,vfsv1]. 211*4882a593Smuzhiyunoffusrjquota Turn off user journalled quota. 212*4882a593Smuzhiyunoffgrpjquota Turn off group journalled quota. 213*4882a593Smuzhiyunoffprjjquota Turn off project journalled quota. 214*4882a593Smuzhiyunquota Enable plain user disk quota accounting. 215*4882a593Smuzhiyunnoquota Disable all plain disk quota option. 216*4882a593Smuzhiyunwhint_mode=%s Control which write hints are passed down to block 217*4882a593Smuzhiyun layer. This supports "off", "user-based", and 218*4882a593Smuzhiyun "fs-based". In "off" mode (default), f2fs does not pass 219*4882a593Smuzhiyun down hints. In "user-based" mode, f2fs tries to pass 220*4882a593Smuzhiyun down hints given by users. And in "fs-based" mode, f2fs 221*4882a593Smuzhiyun passes down hints with its policy. 222*4882a593Smuzhiyunalloc_mode=%s Adjust block allocation policy, which supports "reuse" 223*4882a593Smuzhiyun and "default". 224*4882a593Smuzhiyunfsync_mode=%s Control the policy of fsync. Currently supports "posix", 225*4882a593Smuzhiyun "strict", and "nobarrier". In "posix" mode, which is 226*4882a593Smuzhiyun default, fsync will follow POSIX semantics and does a 227*4882a593Smuzhiyun light operation to improve the filesystem performance. 228*4882a593Smuzhiyun In "strict" mode, fsync will be heavy and behaves in line 229*4882a593Smuzhiyun with xfs, ext4 and btrfs, where xfstest generic/342 will 230*4882a593Smuzhiyun pass, but the performance will regress. "nobarrier" is 231*4882a593Smuzhiyun based on "posix", but doesn't issue flush command for 232*4882a593Smuzhiyun non-atomic files likewise "nobarrier" mount option. 233*4882a593Smuzhiyuntest_dummy_encryption 234*4882a593Smuzhiyuntest_dummy_encryption=%s 235*4882a593Smuzhiyun Enable dummy encryption, which provides a fake fscrypt 236*4882a593Smuzhiyun context. The fake fscrypt context is used by xfstests. 237*4882a593Smuzhiyun The argument may be either "v1" or "v2", in order to 238*4882a593Smuzhiyun select the corresponding fscrypt policy version. 239*4882a593Smuzhiyuncheckpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable" 240*4882a593Smuzhiyun to reenable checkpointing. Is enabled by default. While 241*4882a593Smuzhiyun disabled, any unmounting or unexpected shutdowns will cause 242*4882a593Smuzhiyun the filesystem contents to appear as they did when the 243*4882a593Smuzhiyun filesystem was mounted with that option. 244*4882a593Smuzhiyun While mounting with checkpoint=disabled, the filesystem must 245*4882a593Smuzhiyun run garbage collection to ensure that all available space can 246*4882a593Smuzhiyun be used. If this takes too much time, the mount may return 247*4882a593Smuzhiyun EAGAIN. You may optionally add a value to indicate how much 248*4882a593Smuzhiyun of the disk you would be willing to temporarily give up to 249*4882a593Smuzhiyun avoid additional garbage collection. This can be given as a 250*4882a593Smuzhiyun number of blocks, or as a percent. For instance, mounting 251*4882a593Smuzhiyun with checkpoint=disable:100% would always succeed, but it may 252*4882a593Smuzhiyun hide up to all remaining free space. The actual space that 253*4882a593Smuzhiyun would be unusable can be viewed at /sys/fs/f2fs/<disk>/unusable 254*4882a593Smuzhiyun This space is reclaimed once checkpoint=enable. 255*4882a593Smuzhiyuncheckpoint_merge When checkpoint is enabled, this can be used to create a kernel 256*4882a593Smuzhiyun daemon and make it to merge concurrent checkpoint requests as 257*4882a593Smuzhiyun much as possible to eliminate redundant checkpoint issues. Plus, 258*4882a593Smuzhiyun we can eliminate the sluggish issue caused by slow checkpoint 259*4882a593Smuzhiyun operation when the checkpoint is done in a process context in 260*4882a593Smuzhiyun a cgroup having low i/o budget and cpu shares. To make this 261*4882a593Smuzhiyun do better, we set the default i/o priority of the kernel daemon 262*4882a593Smuzhiyun to "3", to give one higher priority than other kernel threads. 263*4882a593Smuzhiyun This is the same way to give a I/O priority to the jbd2 264*4882a593Smuzhiyun journaling thread of ext4 filesystem. 265*4882a593Smuzhiyunnocheckpoint_merge Disable checkpoint merge feature. 266*4882a593Smuzhiyuncompress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo", 267*4882a593Smuzhiyun "lz4", "zstd" and "lzo-rle" algorithm. 268*4882a593Smuzhiyuncompress_algorithm=%s:%d Control compress algorithm and its compress level, now, only 269*4882a593Smuzhiyun "lz4" and "zstd" support compress level config. 270*4882a593Smuzhiyun algorithm level range 271*4882a593Smuzhiyun lz4 3 - 16 272*4882a593Smuzhiyun zstd 1 - 22 273*4882a593Smuzhiyuncompress_log_size=%u Support configuring compress cluster size, the size will 274*4882a593Smuzhiyun be 4KB * (1 << %u), 16KB is minimum size, also it's 275*4882a593Smuzhiyun default size. 276*4882a593Smuzhiyuncompress_extension=%s Support adding specified extension, so that f2fs can enable 277*4882a593Smuzhiyun compression on those corresponding files, e.g. if all files 278*4882a593Smuzhiyun with '.ext' has high compression rate, we can set the '.ext' 279*4882a593Smuzhiyun on compression extension list and enable compression on 280*4882a593Smuzhiyun these file by default rather than to enable it via ioctl. 281*4882a593Smuzhiyun For other files, we can still enable compression via ioctl. 282*4882a593Smuzhiyun Note that, there is one reserved special extension '*', it 283*4882a593Smuzhiyun can be set to enable compression for all files. 284*4882a593Smuzhiyuncompress_chksum Support verifying chksum of raw data in compressed cluster. 285*4882a593Smuzhiyuncompress_mode=%s Control file compression mode. This supports "fs" and "user" 286*4882a593Smuzhiyun modes. In "fs" mode (default), f2fs does automatic compression 287*4882a593Smuzhiyun on the compression enabled files. In "user" mode, f2fs disables 288*4882a593Smuzhiyun the automaic compression and gives the user discretion of 289*4882a593Smuzhiyun choosing the target file and the timing. The user can do manual 290*4882a593Smuzhiyun compression/decompression on the compression enabled files using 291*4882a593Smuzhiyun ioctls. 292*4882a593Smuzhiyuncompress_cache Support to use address space of a filesystem managed inode to 293*4882a593Smuzhiyun cache compressed block, in order to improve cache hit ratio of 294*4882a593Smuzhiyun random read. 295*4882a593Smuzhiyuninlinecrypt When possible, encrypt/decrypt the contents of encrypted 296*4882a593Smuzhiyun files using the blk-crypto framework rather than 297*4882a593Smuzhiyun filesystem-layer encryption. This allows the use of 298*4882a593Smuzhiyun inline encryption hardware. The on-disk format is 299*4882a593Smuzhiyun unaffected. For more details, see 300*4882a593Smuzhiyun Documentation/block/inline-encryption.rst. 301*4882a593Smuzhiyunatgc Enable age-threshold garbage collection, it provides high 302*4882a593Smuzhiyun effectiveness and efficiency on background GC. 303*4882a593Smuzhiyunmemory=%s Control memory mode. This supports "normal" and "low" modes. 304*4882a593Smuzhiyun "low" mode is introduced to support low memory devices. 305*4882a593Smuzhiyun Because of the nature of low memory devices, in this mode, f2fs 306*4882a593Smuzhiyun will try to save memory sometimes by sacrificing performance. 307*4882a593Smuzhiyun "normal" mode is the default mode and same as before. 308*4882a593Smuzhiyunage_extent_cache Enable an age extent cache based on rb-tree. It records 309*4882a593Smuzhiyun data block update frequency of the extent per inode, in 310*4882a593Smuzhiyun order to provide better temperature hints for data block 311*4882a593Smuzhiyun allocation. 312*4882a593Smuzhiyun======================== ============================================================ 313*4882a593Smuzhiyun 314*4882a593SmuzhiyunDebugfs Entries 315*4882a593Smuzhiyun=============== 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as 318*4882a593Smuzhiyunf2fs. Each file shows the whole f2fs information. 319*4882a593Smuzhiyun 320*4882a593Smuzhiyun/sys/kernel/debug/f2fs/status includes: 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun - major file system information managed by f2fs currently 323*4882a593Smuzhiyun - average SIT information about whole segments 324*4882a593Smuzhiyun - current memory footprint consumed by f2fs. 325*4882a593Smuzhiyun 326*4882a593SmuzhiyunSysfs Entries 327*4882a593Smuzhiyun============= 328*4882a593Smuzhiyun 329*4882a593SmuzhiyunInformation about mounted f2fs file systems can be found in 330*4882a593Smuzhiyun/sys/fs/f2fs. Each mounted filesystem will have a directory in 331*4882a593Smuzhiyun/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda). 332*4882a593SmuzhiyunThe files in each per-device directory are shown in table below. 333*4882a593Smuzhiyun 334*4882a593SmuzhiyunFiles in /sys/fs/f2fs/<devname> 335*4882a593Smuzhiyun(see also Documentation/ABI/testing/sysfs-fs-f2fs) 336*4882a593Smuzhiyun 337*4882a593SmuzhiyunUsage 338*4882a593Smuzhiyun===== 339*4882a593Smuzhiyun 340*4882a593Smuzhiyun1. Download userland tools and compile them. 341*4882a593Smuzhiyun 342*4882a593Smuzhiyun2. Skip, if f2fs was compiled statically inside kernel. 343*4882a593Smuzhiyun Otherwise, insert the f2fs.ko module:: 344*4882a593Smuzhiyun 345*4882a593Smuzhiyun # insmod f2fs.ko 346*4882a593Smuzhiyun 347*4882a593Smuzhiyun3. Create a directory to use when mounting:: 348*4882a593Smuzhiyun 349*4882a593Smuzhiyun # mkdir /mnt/f2fs 350*4882a593Smuzhiyun 351*4882a593Smuzhiyun4. Format the block device, and then mount as f2fs:: 352*4882a593Smuzhiyun 353*4882a593Smuzhiyun # mkfs.f2fs -l label /dev/block_device 354*4882a593Smuzhiyun # mount -t f2fs /dev/block_device /mnt/f2fs 355*4882a593Smuzhiyun 356*4882a593Smuzhiyunmkfs.f2fs 357*4882a593Smuzhiyun--------- 358*4882a593SmuzhiyunThe mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem, 359*4882a593Smuzhiyunwhich builds a basic on-disk layout. 360*4882a593Smuzhiyun 361*4882a593SmuzhiyunThe quick options consist of: 362*4882a593Smuzhiyun 363*4882a593Smuzhiyun=============== =========================================================== 364*4882a593Smuzhiyun``-l [label]`` Give a volume label, up to 512 unicode name. 365*4882a593Smuzhiyun``-a [0 or 1]`` Split start location of each area for heap-based allocation. 366*4882a593Smuzhiyun 367*4882a593Smuzhiyun 1 is set by default, which performs this. 368*4882a593Smuzhiyun``-o [int]`` Set overprovision ratio in percent over volume size. 369*4882a593Smuzhiyun 370*4882a593Smuzhiyun 5 is set by default. 371*4882a593Smuzhiyun``-s [int]`` Set the number of segments per section. 372*4882a593Smuzhiyun 373*4882a593Smuzhiyun 1 is set by default. 374*4882a593Smuzhiyun``-z [int]`` Set the number of sections per zone. 375*4882a593Smuzhiyun 376*4882a593Smuzhiyun 1 is set by default. 377*4882a593Smuzhiyun``-e [str]`` Set basic extension list. e.g. "mp3,gif,mov" 378*4882a593Smuzhiyun``-t [0 or 1]`` Disable discard command or not. 379*4882a593Smuzhiyun 380*4882a593Smuzhiyun 1 is set by default, which conducts discard. 381*4882a593Smuzhiyun=============== =========================================================== 382*4882a593Smuzhiyun 383*4882a593SmuzhiyunNote: please refer to the manpage of mkfs.f2fs(8) to get full option list. 384*4882a593Smuzhiyun 385*4882a593Smuzhiyunfsck.f2fs 386*4882a593Smuzhiyun--------- 387*4882a593SmuzhiyunThe fsck.f2fs is a tool to check the consistency of an f2fs-formatted 388*4882a593Smuzhiyunpartition, which examines whether the filesystem metadata and user-made data 389*4882a593Smuzhiyunare cross-referenced correctly or not. 390*4882a593SmuzhiyunNote that, initial version of the tool does not fix any inconsistency. 391*4882a593Smuzhiyun 392*4882a593SmuzhiyunThe quick options consist of:: 393*4882a593Smuzhiyun 394*4882a593Smuzhiyun -d debug level [default:0] 395*4882a593Smuzhiyun 396*4882a593SmuzhiyunNote: please refer to the manpage of fsck.f2fs(8) to get full option list. 397*4882a593Smuzhiyun 398*4882a593Smuzhiyundump.f2fs 399*4882a593Smuzhiyun--------- 400*4882a593SmuzhiyunThe dump.f2fs shows the information of specific inode and dumps SSA and SIT to 401*4882a593Smuzhiyunfile. Each file is dump_ssa and dump_sit. 402*4882a593Smuzhiyun 403*4882a593SmuzhiyunThe dump.f2fs is used to debug on-disk data structures of the f2fs filesystem. 404*4882a593SmuzhiyunIt shows on-disk inode information recognized by a given inode number, and is 405*4882a593Smuzhiyunable to dump all the SSA and SIT entries into predefined files, ./dump_ssa and 406*4882a593Smuzhiyun./dump_sit respectively. 407*4882a593Smuzhiyun 408*4882a593SmuzhiyunThe options consist of:: 409*4882a593Smuzhiyun 410*4882a593Smuzhiyun -d debug level [default:0] 411*4882a593Smuzhiyun -i inode no (hex) 412*4882a593Smuzhiyun -s [SIT dump segno from #1~#2 (decimal), for all 0~-1] 413*4882a593Smuzhiyun -a [SSA dump segno from #1~#2 (decimal), for all 0~-1] 414*4882a593Smuzhiyun 415*4882a593SmuzhiyunExamples:: 416*4882a593Smuzhiyun 417*4882a593Smuzhiyun # dump.f2fs -i [ino] /dev/sdx 418*4882a593Smuzhiyun # dump.f2fs -s 0~-1 /dev/sdx (SIT dump) 419*4882a593Smuzhiyun # dump.f2fs -a 0~-1 /dev/sdx (SSA dump) 420*4882a593Smuzhiyun 421*4882a593SmuzhiyunNote: please refer to the manpage of dump.f2fs(8) to get full option list. 422*4882a593Smuzhiyun 423*4882a593Smuzhiyunsload.f2fs 424*4882a593Smuzhiyun---------- 425*4882a593SmuzhiyunThe sload.f2fs gives a way to insert files and directories in the exisiting disk 426*4882a593Smuzhiyunimage. This tool is useful when building f2fs images given compiled files. 427*4882a593Smuzhiyun 428*4882a593SmuzhiyunNote: please refer to the manpage of sload.f2fs(8) to get full option list. 429*4882a593Smuzhiyun 430*4882a593Smuzhiyunresize.f2fs 431*4882a593Smuzhiyun----------- 432*4882a593SmuzhiyunThe resize.f2fs lets a user resize the f2fs-formatted disk image, while preserving 433*4882a593Smuzhiyunall the files and directories stored in the image. 434*4882a593Smuzhiyun 435*4882a593SmuzhiyunNote: please refer to the manpage of resize.f2fs(8) to get full option list. 436*4882a593Smuzhiyun 437*4882a593Smuzhiyundefrag.f2fs 438*4882a593Smuzhiyun----------- 439*4882a593SmuzhiyunThe defrag.f2fs can be used to defragment scattered written data as well as 440*4882a593Smuzhiyunfilesystem metadata across the disk. This can improve the write speed by giving 441*4882a593Smuzhiyunmore free consecutive space. 442*4882a593Smuzhiyun 443*4882a593SmuzhiyunNote: please refer to the manpage of defrag.f2fs(8) to get full option list. 444*4882a593Smuzhiyun 445*4882a593Smuzhiyunf2fs_io 446*4882a593Smuzhiyun------- 447*4882a593SmuzhiyunThe f2fs_io is a simple tool to issue various filesystem APIs as well as 448*4882a593Smuzhiyunf2fs-specific ones, which is very useful for QA tests. 449*4882a593Smuzhiyun 450*4882a593SmuzhiyunNote: please refer to the manpage of f2fs_io(8) to get full option list. 451*4882a593Smuzhiyun 452*4882a593SmuzhiyunDesign 453*4882a593Smuzhiyun====== 454*4882a593Smuzhiyun 455*4882a593SmuzhiyunOn-disk Layout 456*4882a593Smuzhiyun-------------- 457*4882a593Smuzhiyun 458*4882a593SmuzhiyunF2FS divides the whole volume into a number of segments, each of which is fixed 459*4882a593Smuzhiyunto 2MB in size. A section is composed of consecutive segments, and a zone 460*4882a593Smuzhiyunconsists of a set of sections. By default, section and zone sizes are set to one 461*4882a593Smuzhiyunsegment size identically, but users can easily modify the sizes by mkfs. 462*4882a593Smuzhiyun 463*4882a593SmuzhiyunF2FS splits the entire volume into six areas, and all the areas except superblock 464*4882a593Smuzhiyunconsist of multiple segments as described below:: 465*4882a593Smuzhiyun 466*4882a593Smuzhiyun align with the zone size <-| 467*4882a593Smuzhiyun |-> align with the segment size 468*4882a593Smuzhiyun _________________________________________________________________________ 469*4882a593Smuzhiyun | | | Segment | Node | Segment | | 470*4882a593Smuzhiyun | Superblock | Checkpoint | Info. | Address | Summary | Main | 471*4882a593Smuzhiyun | (SB) | (CP) | Table (SIT) | Table (NAT) | Area (SSA) | | 472*4882a593Smuzhiyun |____________|_____2______|______N______|______N______|______N_____|__N___| 473*4882a593Smuzhiyun . . 474*4882a593Smuzhiyun . . 475*4882a593Smuzhiyun . . 476*4882a593Smuzhiyun ._________________________________________. 477*4882a593Smuzhiyun |_Segment_|_..._|_Segment_|_..._|_Segment_| 478*4882a593Smuzhiyun . . 479*4882a593Smuzhiyun ._________._________ 480*4882a593Smuzhiyun |_section_|__...__|_ 481*4882a593Smuzhiyun . . 482*4882a593Smuzhiyun .________. 483*4882a593Smuzhiyun |__zone__| 484*4882a593Smuzhiyun 485*4882a593Smuzhiyun- Superblock (SB) 486*4882a593Smuzhiyun It is located at the beginning of the partition, and there exist two copies 487*4882a593Smuzhiyun to avoid file system crash. It contains basic partition information and some 488*4882a593Smuzhiyun default parameters of f2fs. 489*4882a593Smuzhiyun 490*4882a593Smuzhiyun- Checkpoint (CP) 491*4882a593Smuzhiyun It contains file system information, bitmaps for valid NAT/SIT sets, orphan 492*4882a593Smuzhiyun inode lists, and summary entries of current active segments. 493*4882a593Smuzhiyun 494*4882a593Smuzhiyun- Segment Information Table (SIT) 495*4882a593Smuzhiyun It contains segment information such as valid block count and bitmap for the 496*4882a593Smuzhiyun validity of all the blocks. 497*4882a593Smuzhiyun 498*4882a593Smuzhiyun- Node Address Table (NAT) 499*4882a593Smuzhiyun It is composed of a block address table for all the node blocks stored in 500*4882a593Smuzhiyun Main area. 501*4882a593Smuzhiyun 502*4882a593Smuzhiyun- Segment Summary Area (SSA) 503*4882a593Smuzhiyun It contains summary entries which contains the owner information of all the 504*4882a593Smuzhiyun data and node blocks stored in Main area. 505*4882a593Smuzhiyun 506*4882a593Smuzhiyun- Main Area 507*4882a593Smuzhiyun It contains file and directory data including their indices. 508*4882a593Smuzhiyun 509*4882a593SmuzhiyunIn order to avoid misalignment between file system and flash-based storage, F2FS 510*4882a593Smuzhiyunaligns the start block address of CP with the segment size. Also, it aligns the 511*4882a593Smuzhiyunstart block address of Main area with the zone size by reserving some segments 512*4882a593Smuzhiyunin SSA area. 513*4882a593Smuzhiyun 514*4882a593SmuzhiyunReference the following survey for additional technical details. 515*4882a593Smuzhiyunhttps://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey 516*4882a593Smuzhiyun 517*4882a593SmuzhiyunFile System Metadata Structure 518*4882a593Smuzhiyun------------------------------ 519*4882a593Smuzhiyun 520*4882a593SmuzhiyunF2FS adopts the checkpointing scheme to maintain file system consistency. At 521*4882a593Smuzhiyunmount time, F2FS first tries to find the last valid checkpoint data by scanning 522*4882a593SmuzhiyunCP area. In order to reduce the scanning time, F2FS uses only two copies of CP. 523*4882a593SmuzhiyunOne of them always indicates the last valid data, which is called as shadow copy 524*4882a593Smuzhiyunmechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism. 525*4882a593Smuzhiyun 526*4882a593SmuzhiyunFor file system consistency, each CP points to which NAT and SIT copies are 527*4882a593Smuzhiyunvalid, as shown as below:: 528*4882a593Smuzhiyun 529*4882a593Smuzhiyun +--------+----------+---------+ 530*4882a593Smuzhiyun | CP | SIT | NAT | 531*4882a593Smuzhiyun +--------+----------+---------+ 532*4882a593Smuzhiyun . . . . 533*4882a593Smuzhiyun . . . . 534*4882a593Smuzhiyun . . . . 535*4882a593Smuzhiyun +-------+-------+--------+--------+--------+--------+ 536*4882a593Smuzhiyun | CP #0 | CP #1 | SIT #0 | SIT #1 | NAT #0 | NAT #1 | 537*4882a593Smuzhiyun +-------+-------+--------+--------+--------+--------+ 538*4882a593Smuzhiyun | ^ ^ 539*4882a593Smuzhiyun | | | 540*4882a593Smuzhiyun `----------------------------------------' 541*4882a593Smuzhiyun 542*4882a593SmuzhiyunIndex Structure 543*4882a593Smuzhiyun--------------- 544*4882a593Smuzhiyun 545*4882a593SmuzhiyunThe key data structure to manage the data locations is a "node". Similar to 546*4882a593Smuzhiyuntraditional file structures, F2FS has three types of node: inode, direct node, 547*4882a593Smuzhiyunindirect node. F2FS assigns 4KB to an inode block which contains 923 data block 548*4882a593Smuzhiyunindices, two direct node pointers, two indirect node pointers, and one double 549*4882a593Smuzhiyunindirect node pointer as described below. One direct node block contains 1018 550*4882a593Smuzhiyundata blocks, and one indirect node block contains also 1018 node blocks. Thus, 551*4882a593Smuzhiyunone inode block (i.e., a file) covers:: 552*4882a593Smuzhiyun 553*4882a593Smuzhiyun 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB. 554*4882a593Smuzhiyun 555*4882a593Smuzhiyun Inode block (4KB) 556*4882a593Smuzhiyun |- data (923) 557*4882a593Smuzhiyun |- direct node (2) 558*4882a593Smuzhiyun | `- data (1018) 559*4882a593Smuzhiyun |- indirect node (2) 560*4882a593Smuzhiyun | `- direct node (1018) 561*4882a593Smuzhiyun | `- data (1018) 562*4882a593Smuzhiyun `- double indirect node (1) 563*4882a593Smuzhiyun `- indirect node (1018) 564*4882a593Smuzhiyun `- direct node (1018) 565*4882a593Smuzhiyun `- data (1018) 566*4882a593Smuzhiyun 567*4882a593SmuzhiyunNote that all the node blocks are mapped by NAT which means the location of 568*4882a593Smuzhiyuneach node is translated by the NAT table. In the consideration of the wandering 569*4882a593Smuzhiyuntree problem, F2FS is able to cut off the propagation of node updates caused by 570*4882a593Smuzhiyunleaf data writes. 571*4882a593Smuzhiyun 572*4882a593SmuzhiyunDirectory Structure 573*4882a593Smuzhiyun------------------- 574*4882a593Smuzhiyun 575*4882a593SmuzhiyunA directory entry occupies 11 bytes, which consists of the following attributes. 576*4882a593Smuzhiyun 577*4882a593Smuzhiyun- hash hash value of the file name 578*4882a593Smuzhiyun- ino inode number 579*4882a593Smuzhiyun- len the length of file name 580*4882a593Smuzhiyun- type file type such as directory, symlink, etc 581*4882a593Smuzhiyun 582*4882a593SmuzhiyunA dentry block consists of 214 dentry slots and file names. Therein a bitmap is 583*4882a593Smuzhiyunused to represent whether each dentry is valid or not. A dentry block occupies 584*4882a593Smuzhiyun4KB with the following composition. 585*4882a593Smuzhiyun 586*4882a593Smuzhiyun:: 587*4882a593Smuzhiyun 588*4882a593Smuzhiyun Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) + 589*4882a593Smuzhiyun dentries(11 * 214 bytes) + file name (8 * 214 bytes) 590*4882a593Smuzhiyun 591*4882a593Smuzhiyun [Bucket] 592*4882a593Smuzhiyun +--------------------------------+ 593*4882a593Smuzhiyun |dentry block 1 | dentry block 2 | 594*4882a593Smuzhiyun +--------------------------------+ 595*4882a593Smuzhiyun . . 596*4882a593Smuzhiyun . . 597*4882a593Smuzhiyun . [Dentry Block Structure: 4KB] . 598*4882a593Smuzhiyun +--------+----------+----------+------------+ 599*4882a593Smuzhiyun | bitmap | reserved | dentries | file names | 600*4882a593Smuzhiyun +--------+----------+----------+------------+ 601*4882a593Smuzhiyun [Dentry Block: 4KB] . . 602*4882a593Smuzhiyun . . 603*4882a593Smuzhiyun . . 604*4882a593Smuzhiyun +------+------+-----+------+ 605*4882a593Smuzhiyun | hash | ino | len | type | 606*4882a593Smuzhiyun +------+------+-----+------+ 607*4882a593Smuzhiyun [Dentry Structure: 11 bytes] 608*4882a593Smuzhiyun 609*4882a593SmuzhiyunF2FS implements multi-level hash tables for directory structure. Each level has 610*4882a593Smuzhiyuna hash table with dedicated number of hash buckets as shown below. Note that 611*4882a593Smuzhiyun"A(2B)" means a bucket includes 2 data blocks. 612*4882a593Smuzhiyun 613*4882a593Smuzhiyun:: 614*4882a593Smuzhiyun 615*4882a593Smuzhiyun ---------------------- 616*4882a593Smuzhiyun A : bucket 617*4882a593Smuzhiyun B : block 618*4882a593Smuzhiyun N : MAX_DIR_HASH_DEPTH 619*4882a593Smuzhiyun ---------------------- 620*4882a593Smuzhiyun 621*4882a593Smuzhiyun level #0 | A(2B) 622*4882a593Smuzhiyun | 623*4882a593Smuzhiyun level #1 | A(2B) - A(2B) 624*4882a593Smuzhiyun | 625*4882a593Smuzhiyun level #2 | A(2B) - A(2B) - A(2B) - A(2B) 626*4882a593Smuzhiyun . | . . . . 627*4882a593Smuzhiyun level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B) 628*4882a593Smuzhiyun . | . . . . 629*4882a593Smuzhiyun level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B) 630*4882a593Smuzhiyun 631*4882a593SmuzhiyunThe number of blocks and buckets are determined by:: 632*4882a593Smuzhiyun 633*4882a593Smuzhiyun ,- 2, if n < MAX_DIR_HASH_DEPTH / 2, 634*4882a593Smuzhiyun # of blocks in level #n = | 635*4882a593Smuzhiyun `- 4, Otherwise 636*4882a593Smuzhiyun 637*4882a593Smuzhiyun ,- 2^(n + dir_level), 638*4882a593Smuzhiyun | if n + dir_level < MAX_DIR_HASH_DEPTH / 2, 639*4882a593Smuzhiyun # of buckets in level #n = | 640*4882a593Smuzhiyun `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), 641*4882a593Smuzhiyun Otherwise 642*4882a593Smuzhiyun 643*4882a593SmuzhiyunWhen F2FS finds a file name in a directory, at first a hash value of the file 644*4882a593Smuzhiyunname is calculated. Then, F2FS scans the hash table in level #0 to find the 645*4882a593Smuzhiyundentry consisting of the file name and its inode number. If not found, F2FS 646*4882a593Smuzhiyunscans the next hash table in level #1. In this way, F2FS scans hash tables in 647*4882a593Smuzhiyuneach levels incrementally from 1 to N. In each level F2FS needs to scan only 648*4882a593Smuzhiyunone bucket determined by the following equation, which shows O(log(# of files)) 649*4882a593Smuzhiyuncomplexity:: 650*4882a593Smuzhiyun 651*4882a593Smuzhiyun bucket number to scan in level #n = (hash value) % (# of buckets in level #n) 652*4882a593Smuzhiyun 653*4882a593SmuzhiyunIn the case of file creation, F2FS finds empty consecutive slots that cover the 654*4882a593Smuzhiyunfile name. F2FS searches the empty slots in the hash tables of whole levels from 655*4882a593Smuzhiyun1 to N in the same way as the lookup operation. 656*4882a593Smuzhiyun 657*4882a593SmuzhiyunThe following figure shows an example of two cases holding children:: 658*4882a593Smuzhiyun 659*4882a593Smuzhiyun --------------> Dir <-------------- 660*4882a593Smuzhiyun | | 661*4882a593Smuzhiyun child child 662*4882a593Smuzhiyun 663*4882a593Smuzhiyun child - child [hole] - child 664*4882a593Smuzhiyun 665*4882a593Smuzhiyun child - child - child [hole] - [hole] - child 666*4882a593Smuzhiyun 667*4882a593Smuzhiyun Case 1: Case 2: 668*4882a593Smuzhiyun Number of children = 6, Number of children = 3, 669*4882a593Smuzhiyun File size = 7 File size = 7 670*4882a593Smuzhiyun 671*4882a593SmuzhiyunDefault Block Allocation 672*4882a593Smuzhiyun------------------------ 673*4882a593Smuzhiyun 674*4882a593SmuzhiyunAt runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node 675*4882a593Smuzhiyunand Hot/Warm/Cold data. 676*4882a593Smuzhiyun 677*4882a593Smuzhiyun- Hot node contains direct node blocks of directories. 678*4882a593Smuzhiyun- Warm node contains direct node blocks except hot node blocks. 679*4882a593Smuzhiyun- Cold node contains indirect node blocks 680*4882a593Smuzhiyun- Hot data contains dentry blocks 681*4882a593Smuzhiyun- Warm data contains data blocks except hot and cold data blocks 682*4882a593Smuzhiyun- Cold data contains multimedia data or migrated data blocks 683*4882a593Smuzhiyun 684*4882a593SmuzhiyunLFS has two schemes for free space management: threaded log and copy-and-compac- 685*4882a593Smuzhiyuntion. The copy-and-compaction scheme which is known as cleaning, is well-suited 686*4882a593Smuzhiyunfor devices showing very good sequential write performance, since free segments 687*4882a593Smuzhiyunare served all the time for writing new data. However, it suffers from cleaning 688*4882a593Smuzhiyunoverhead under high utilization. Contrarily, the threaded log scheme suffers 689*4882a593Smuzhiyunfrom random writes, but no cleaning process is needed. F2FS adopts a hybrid 690*4882a593Smuzhiyunscheme where the copy-and-compaction scheme is adopted by default, but the 691*4882a593Smuzhiyunpolicy is dynamically changed to the threaded log scheme according to the file 692*4882a593Smuzhiyunsystem status. 693*4882a593Smuzhiyun 694*4882a593SmuzhiyunIn order to align F2FS with underlying flash-based storage, F2FS allocates a 695*4882a593Smuzhiyunsegment in a unit of section. F2FS expects that the section size would be the 696*4882a593Smuzhiyunsame as the unit size of garbage collection in FTL. Furthermore, with respect 697*4882a593Smuzhiyunto the mapping granularity in FTL, F2FS allocates each section of the active 698*4882a593Smuzhiyunlogs from different zones as much as possible, since FTL can write the data in 699*4882a593Smuzhiyunthe active logs into one allocation unit according to its mapping granularity. 700*4882a593Smuzhiyun 701*4882a593SmuzhiyunCleaning process 702*4882a593Smuzhiyun---------------- 703*4882a593Smuzhiyun 704*4882a593SmuzhiyunF2FS does cleaning both on demand and in the background. On-demand cleaning is 705*4882a593Smuzhiyuntriggered when there are not enough free segments to serve VFS calls. Background 706*4882a593Smuzhiyuncleaner is operated by a kernel thread, and triggers the cleaning job when the 707*4882a593Smuzhiyunsystem is idle. 708*4882a593Smuzhiyun 709*4882a593SmuzhiyunF2FS supports two victim selection policies: greedy and cost-benefit algorithms. 710*4882a593SmuzhiyunIn the greedy algorithm, F2FS selects a victim segment having the smallest number 711*4882a593Smuzhiyunof valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment 712*4882a593Smuzhiyunaccording to the segment age and the number of valid blocks in order to address 713*4882a593Smuzhiyunlog block thrashing problem in the greedy algorithm. F2FS adopts the greedy 714*4882a593Smuzhiyunalgorithm for on-demand cleaner, while background cleaner adopts cost-benefit 715*4882a593Smuzhiyunalgorithm. 716*4882a593Smuzhiyun 717*4882a593SmuzhiyunIn order to identify whether the data in the victim segment are valid or not, 718*4882a593SmuzhiyunF2FS manages a bitmap. Each bit represents the validity of a block, and the 719*4882a593Smuzhiyunbitmap is composed of a bit stream covering whole blocks in main area. 720*4882a593Smuzhiyun 721*4882a593SmuzhiyunWrite-hint Policy 722*4882a593Smuzhiyun----------------- 723*4882a593Smuzhiyun 724*4882a593Smuzhiyun1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET. 725*4882a593Smuzhiyun 726*4882a593Smuzhiyun2) whint_mode=user-based. F2FS tries to pass down hints given by 727*4882a593Smuzhiyunusers. 728*4882a593Smuzhiyun 729*4882a593Smuzhiyun===================== ======================== =================== 730*4882a593SmuzhiyunUser F2FS Block 731*4882a593Smuzhiyun===================== ======================== =================== 732*4882a593SmuzhiyunN/A META WRITE_LIFE_NOT_SET 733*4882a593SmuzhiyunN/A HOT_NODE " 734*4882a593SmuzhiyunN/A WARM_NODE " 735*4882a593SmuzhiyunN/A COLD_NODE " 736*4882a593Smuzhiyunioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME 737*4882a593Smuzhiyunextension list " " 738*4882a593Smuzhiyun 739*4882a593Smuzhiyun-- buffered io 740*4882a593SmuzhiyunWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME 741*4882a593SmuzhiyunWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT 742*4882a593SmuzhiyunWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET 743*4882a593SmuzhiyunWRITE_LIFE_NONE " " 744*4882a593SmuzhiyunWRITE_LIFE_MEDIUM " " 745*4882a593SmuzhiyunWRITE_LIFE_LONG " " 746*4882a593Smuzhiyun 747*4882a593Smuzhiyun-- direct io 748*4882a593SmuzhiyunWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME 749*4882a593SmuzhiyunWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT 750*4882a593SmuzhiyunWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET 751*4882a593SmuzhiyunWRITE_LIFE_NONE " WRITE_LIFE_NONE 752*4882a593SmuzhiyunWRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM 753*4882a593SmuzhiyunWRITE_LIFE_LONG " WRITE_LIFE_LONG 754*4882a593Smuzhiyun===================== ======================== =================== 755*4882a593Smuzhiyun 756*4882a593Smuzhiyun3) whint_mode=fs-based. F2FS passes down hints with its policy. 757*4882a593Smuzhiyun 758*4882a593Smuzhiyun===================== ======================== =================== 759*4882a593SmuzhiyunUser F2FS Block 760*4882a593Smuzhiyun===================== ======================== =================== 761*4882a593SmuzhiyunN/A META WRITE_LIFE_MEDIUM; 762*4882a593SmuzhiyunN/A HOT_NODE WRITE_LIFE_NOT_SET 763*4882a593SmuzhiyunN/A WARM_NODE " 764*4882a593SmuzhiyunN/A COLD_NODE WRITE_LIFE_NONE 765*4882a593Smuzhiyunioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME 766*4882a593Smuzhiyunextension list " " 767*4882a593Smuzhiyun 768*4882a593Smuzhiyun-- buffered io 769*4882a593SmuzhiyunWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME 770*4882a593SmuzhiyunWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT 771*4882a593SmuzhiyunWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONG 772*4882a593SmuzhiyunWRITE_LIFE_NONE " " 773*4882a593SmuzhiyunWRITE_LIFE_MEDIUM " " 774*4882a593SmuzhiyunWRITE_LIFE_LONG " " 775*4882a593Smuzhiyun 776*4882a593Smuzhiyun-- direct io 777*4882a593SmuzhiyunWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME 778*4882a593SmuzhiyunWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT 779*4882a593SmuzhiyunWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET 780*4882a593SmuzhiyunWRITE_LIFE_NONE " WRITE_LIFE_NONE 781*4882a593SmuzhiyunWRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM 782*4882a593SmuzhiyunWRITE_LIFE_LONG " WRITE_LIFE_LONG 783*4882a593Smuzhiyun===================== ======================== =================== 784*4882a593Smuzhiyun 785*4882a593SmuzhiyunFallocate(2) Policy 786*4882a593Smuzhiyun------------------- 787*4882a593Smuzhiyun 788*4882a593SmuzhiyunThe default policy follows the below POSIX rule. 789*4882a593Smuzhiyun 790*4882a593SmuzhiyunAllocating disk space 791*4882a593Smuzhiyun The default operation (i.e., mode is zero) of fallocate() allocates 792*4882a593Smuzhiyun the disk space within the range specified by offset and len. The 793*4882a593Smuzhiyun file size (as reported by stat(2)) will be changed if offset+len is 794*4882a593Smuzhiyun greater than the file size. Any subregion within the range specified 795*4882a593Smuzhiyun by offset and len that did not contain data before the call will be 796*4882a593Smuzhiyun initialized to zero. This default behavior closely resembles the 797*4882a593Smuzhiyun behavior of the posix_fallocate(3) library function, and is intended 798*4882a593Smuzhiyun as a method of optimally implementing that function. 799*4882a593Smuzhiyun 800*4882a593SmuzhiyunHowever, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to 801*4882a593Smuzhiyunfallocate(fd, DEFAULT_MODE), it allocates on-disk block addressess having 802*4882a593Smuzhiyunzero or random data, which is useful to the below scenario where: 803*4882a593Smuzhiyun 804*4882a593Smuzhiyun 1. create(fd) 805*4882a593Smuzhiyun 2. ioctl(fd, F2FS_IOC_SET_PIN_FILE) 806*4882a593Smuzhiyun 3. fallocate(fd, 0, 0, size) 807*4882a593Smuzhiyun 4. address = fibmap(fd, offset) 808*4882a593Smuzhiyun 5. open(blkdev) 809*4882a593Smuzhiyun 6. write(blkdev, address) 810*4882a593Smuzhiyun 811*4882a593SmuzhiyunCompression implementation 812*4882a593Smuzhiyun-------------------------- 813*4882a593Smuzhiyun 814*4882a593Smuzhiyun- New term named cluster is defined as basic unit of compression, file can 815*4882a593Smuzhiyun be divided into multiple clusters logically. One cluster includes 4 << n 816*4882a593Smuzhiyun (n >= 0) logical pages, compression size is also cluster size, each of 817*4882a593Smuzhiyun cluster can be compressed or not. 818*4882a593Smuzhiyun 819*4882a593Smuzhiyun- In cluster metadata layout, one special block address is used to indicate 820*4882a593Smuzhiyun a cluster is a compressed one or normal one; for compressed cluster, following 821*4882a593Smuzhiyun metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs 822*4882a593Smuzhiyun stores data including compress header and compressed data. 823*4882a593Smuzhiyun 824*4882a593Smuzhiyun- In order to eliminate write amplification during overwrite, F2FS only 825*4882a593Smuzhiyun support compression on write-once file, data can be compressed only when 826*4882a593Smuzhiyun all logical blocks in cluster contain valid data and compress ratio of 827*4882a593Smuzhiyun cluster data is lower than specified threshold. 828*4882a593Smuzhiyun 829*4882a593Smuzhiyun- To enable compression on regular inode, there are three ways: 830*4882a593Smuzhiyun 831*4882a593Smuzhiyun * chattr +c file 832*4882a593Smuzhiyun * chattr +c dir; touch dir/file 833*4882a593Smuzhiyun * mount w/ -o compress_extension=ext; touch file.ext 834*4882a593Smuzhiyun * mount w/ -o compress_extension=*; touch any_file 835*4882a593Smuzhiyun 836*4882a593Smuzhiyun- At this point, compression feature doesn't expose compressed space to user 837*4882a593Smuzhiyun directly in order to guarantee potential data updates later to the space. 838*4882a593Smuzhiyun Instead, the main goal is to reduce data writes to flash disk as much as 839*4882a593Smuzhiyun possible, resulting in extending disk life time as well as relaxing IO 840*4882a593Smuzhiyun congestion. Alternatively, we've added ioctl interface to reclaim compressed 841*4882a593Smuzhiyun space and show it to user after putting the immutable bit. 842*4882a593Smuzhiyun 843*4882a593SmuzhiyunCompress metadata layout:: 844*4882a593Smuzhiyun 845*4882a593Smuzhiyun [Dnode Structure] 846*4882a593Smuzhiyun +-----------------------------------------------+ 847*4882a593Smuzhiyun | cluster 1 | cluster 2 | ......... | cluster N | 848*4882a593Smuzhiyun +-----------------------------------------------+ 849*4882a593Smuzhiyun . . . . 850*4882a593Smuzhiyun . . . . 851*4882a593Smuzhiyun . Compressed Cluster . . Normal Cluster . 852*4882a593Smuzhiyun +----------+---------+---------+---------+ +---------+---------+---------+---------+ 853*4882a593Smuzhiyun |compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 | 854*4882a593Smuzhiyun +----------+---------+---------+---------+ +---------+---------+---------+---------+ 855*4882a593Smuzhiyun . . 856*4882a593Smuzhiyun . . 857*4882a593Smuzhiyun . . 858*4882a593Smuzhiyun +-------------+-------------+----------+----------------------------+ 859*4882a593Smuzhiyun | data length | data chksum | reserved | compressed data | 860*4882a593Smuzhiyun +-------------+-------------+----------+----------------------------+ 861*4882a593Smuzhiyun 862*4882a593SmuzhiyunCompression mode 863*4882a593Smuzhiyun-------------------------- 864*4882a593Smuzhiyun 865*4882a593Smuzhiyunf2fs supports "fs" and "user" compression modes with "compression_mode" mount option. 866*4882a593SmuzhiyunWith this option, f2fs provides a choice to select the way how to compress the 867*4882a593Smuzhiyuncompression enabled files (refer to "Compression implementation" section for how to 868*4882a593Smuzhiyunenable compression on a regular inode). 869*4882a593Smuzhiyun 870*4882a593Smuzhiyun1) compress_mode=fs 871*4882a593SmuzhiyunThis is the default option. f2fs does automatic compression in the writeback of the 872*4882a593Smuzhiyuncompression enabled files. 873*4882a593Smuzhiyun 874*4882a593Smuzhiyun2) compress_mode=user 875*4882a593SmuzhiyunThis disables the automatic compression and gives the user discretion of choosing the 876*4882a593Smuzhiyuntarget file and the timing. The user can do manual compression/decompression on the 877*4882a593Smuzhiyuncompression enabled files using F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE 878*4882a593Smuzhiyunioctls like the below. 879*4882a593Smuzhiyun 880*4882a593SmuzhiyunTo decompress a file, 881*4882a593Smuzhiyun 882*4882a593Smuzhiyunfd = open(filename, O_WRONLY, 0); 883*4882a593Smuzhiyunret = ioctl(fd, F2FS_IOC_DECOMPRESS_FILE); 884*4882a593Smuzhiyun 885*4882a593SmuzhiyunTo compress a file, 886*4882a593Smuzhiyun 887*4882a593Smuzhiyunfd = open(filename, O_WRONLY, 0); 888*4882a593Smuzhiyunret = ioctl(fd, F2FS_IOC_COMPRESS_FILE); 889*4882a593Smuzhiyun 890*4882a593SmuzhiyunNVMe Zoned Namespace devices 891*4882a593Smuzhiyun---------------------------- 892*4882a593Smuzhiyun 893*4882a593Smuzhiyun- ZNS defines a per-zone capacity which can be equal or less than the 894*4882a593Smuzhiyun zone-size. Zone-capacity is the number of usable blocks in the zone. 895*4882a593Smuzhiyun F2FS checks if zone-capacity is less than zone-size, if it is, then any 896*4882a593Smuzhiyun segment which starts after the zone-capacity is marked as not-free in 897*4882a593Smuzhiyun the free segment bitmap at initial mount time. These segments are marked 898*4882a593Smuzhiyun as permanently used so they are not allocated for writes and 899*4882a593Smuzhiyun consequently are not needed to be garbage collected. In case the 900*4882a593Smuzhiyun zone-capacity is not aligned to default segment size(2MB), then a segment 901*4882a593Smuzhiyun can start before the zone-capacity and span across zone-capacity boundary. 902*4882a593Smuzhiyun Such spanning segments are also considered as usable segments. All blocks 903*4882a593Smuzhiyun past the zone-capacity are considered unusable in these segments. 904