xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/ext4/allocators.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593SmuzhiyunBlock and Inode Allocation Policy
4*4882a593Smuzhiyun---------------------------------
5*4882a593Smuzhiyun
6*4882a593Smuzhiyunext4 recognizes (better than ext3, anyway) that data locality is
7*4882a593Smuzhiyungenerally a desirably quality of a filesystem. On a spinning disk,
8*4882a593Smuzhiyunkeeping related blocks near each other reduces the amount of movement
9*4882a593Smuzhiyunthat the head actuator and disk must perform to access a data block,
10*4882a593Smuzhiyunthus speeding up disk IO. On an SSD there of course are no moving parts,
11*4882a593Smuzhiyunbut locality can increase the size of each transfer request while
12*4882a593Smuzhiyunreducing the total number of requests. This locality may also have the
13*4882a593Smuzhiyuneffect of concentrating writes on a single erase block, which can speed
14*4882a593Smuzhiyunup file rewrites significantly. Therefore, it is useful to reduce
15*4882a593Smuzhiyunfragmentation whenever possible.
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunThe first tool that ext4 uses to combat fragmentation is the multi-block
18*4882a593Smuzhiyunallocator. When a file is first created, the block allocator
19*4882a593Smuzhiyunspeculatively allocates 8KiB of disk space to the file on the assumption
20*4882a593Smuzhiyunthat the space will get written soon. When the file is closed, the
21*4882a593Smuzhiyununused speculative allocations are of course freed, but if the
22*4882a593Smuzhiyunspeculation is correct (typically the case for full writes of small
23*4882a593Smuzhiyunfiles) then the file data gets written out in a single multi-block
24*4882a593Smuzhiyunextent. A second related trick that ext4 uses is delayed allocation.
25*4882a593SmuzhiyunUnder this scheme, when a file needs more blocks to absorb file writes,
26*4882a593Smuzhiyunthe filesystem defers deciding the exact placement on the disk until all
27*4882a593Smuzhiyunthe dirty buffers are being written out to disk. By not committing to a
28*4882a593Smuzhiyunparticular placement until it's absolutely necessary (the commit timeout
29*4882a593Smuzhiyunis hit, or sync() is called, or the kernel runs out of memory), the hope
30*4882a593Smuzhiyunis that the filesystem can make better location decisions.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunThe third trick that ext4 (and ext3) uses is that it tries to keep a
33*4882a593Smuzhiyunfile's data blocks in the same block group as its inode. This cuts down
34*4882a593Smuzhiyunon the seek penalty when the filesystem first has to read a file's inode
35*4882a593Smuzhiyunto learn where the file's data blocks live and then seek over to the
36*4882a593Smuzhiyunfile's data blocks to begin I/O operations.
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunThe fourth trick is that all the inodes in a directory are placed in the
39*4882a593Smuzhiyunsame block group as the directory, when feasible. The working assumption
40*4882a593Smuzhiyunhere is that all the files in a directory might be related, therefore it
41*4882a593Smuzhiyunis useful to try to keep them all together.
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunThe fifth trick is that the disk volume is cut up into 128MB block
44*4882a593Smuzhiyungroups; these mini-containers are used as outlined above to try to
45*4882a593Smuzhiyunmaintain data locality. However, there is a deliberate quirk -- when a
46*4882a593Smuzhiyundirectory is created in the root directory, the inode allocator scans
47*4882a593Smuzhiyunthe block groups and puts that directory into the least heavily loaded
48*4882a593Smuzhiyunblock group that it can find. This encourages directories to spread out
49*4882a593Smuzhiyunover a disk; as the top-level directory/file blobs fill up one block
50*4882a593Smuzhiyungroup, the allocators simply move on to the next block group. Allegedly
51*4882a593Smuzhiyunthis scheme evens out the loading on the block groups, though the author
52*4882a593Smuzhiyunsuspects that the directories which are so unlucky as to land towards
53*4882a593Smuzhiyunthe end of a spinning drive get a raw deal performance-wise.
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunOf course if all of these mechanisms fail, one can always use e4defrag
56*4882a593Smuzhiyunto defragment files.
57