xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/ext4/blockgroup.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593SmuzhiyunLayout
4*4882a593Smuzhiyun------
5*4882a593Smuzhiyun
6*4882a593SmuzhiyunThe layout of a standard block group is approximately as follows (each
7*4882a593Smuzhiyunof these fields is discussed in a separate section below):
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun.. list-table::
10*4882a593Smuzhiyun   :widths: 1 1 1 1 1 1 1 1
11*4882a593Smuzhiyun   :header-rows: 1
12*4882a593Smuzhiyun
13*4882a593Smuzhiyun   * - Group 0 Padding
14*4882a593Smuzhiyun     - ext4 Super Block
15*4882a593Smuzhiyun     - Group Descriptors
16*4882a593Smuzhiyun     - Reserved GDT Blocks
17*4882a593Smuzhiyun     - Data Block Bitmap
18*4882a593Smuzhiyun     - inode Bitmap
19*4882a593Smuzhiyun     - inode Table
20*4882a593Smuzhiyun     - Data Blocks
21*4882a593Smuzhiyun   * - 1024 bytes
22*4882a593Smuzhiyun     - 1 block
23*4882a593Smuzhiyun     - many blocks
24*4882a593Smuzhiyun     - many blocks
25*4882a593Smuzhiyun     - 1 block
26*4882a593Smuzhiyun     - 1 block
27*4882a593Smuzhiyun     - many blocks
28*4882a593Smuzhiyun     - many more blocks
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunFor the special case of block group 0, the first 1024 bytes are unused,
31*4882a593Smuzhiyunto allow for the installation of x86 boot sectors and other oddities.
32*4882a593SmuzhiyunThe superblock will start at offset 1024 bytes, whichever block that
33*4882a593Smuzhiyunhappens to be (usually 0). However, if for some reason the block size =
34*4882a593Smuzhiyun1024, then block 0 is marked in use and the superblock goes in block 1.
35*4882a593SmuzhiyunFor all other block groups, there is no padding.
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunThe ext4 driver primarily works with the superblock and the group
38*4882a593Smuzhiyundescriptors that are found in block group 0. Redundant copies of the
39*4882a593Smuzhiyunsuperblock and group descriptors are written to some of the block groups
40*4882a593Smuzhiyunacross the disk in case the beginning of the disk gets trashed, though
41*4882a593Smuzhiyunnot all block groups necessarily host a redundant copy (see following
42*4882a593Smuzhiyunparagraph for more details). If the group does not have a redundant
43*4882a593Smuzhiyuncopy, the block group begins with the data block bitmap. Note also that
44*4882a593Smuzhiyunwhen the filesystem is freshly formatted, mkfs will allocate “reserve
45*4882a593SmuzhiyunGDT block” space after the block group descriptors and before the start
46*4882a593Smuzhiyunof the block bitmaps to allow for future expansion of the filesystem. By
47*4882a593Smuzhiyundefault, a filesystem is allowed to increase in size by a factor of
48*4882a593Smuzhiyun1024x over the original filesystem size.
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunThe location of the inode table is given by ``grp.bg_inode_table_*``. It
51*4882a593Smuzhiyunis continuous range of blocks large enough to contain
52*4882a593Smuzhiyun``sb.s_inodes_per_group * sb.s_inode_size`` bytes.
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunAs for the ordering of items in a block group, it is generally
55*4882a593Smuzhiyunestablished that the super block and the group descriptor table, if
56*4882a593Smuzhiyunpresent, will be at the beginning of the block group. The bitmaps and
57*4882a593Smuzhiyunthe inode table can be anywhere, and it is quite possible for the
58*4882a593Smuzhiyunbitmaps to come after the inode table, or for both to be in different
59*4882a593Smuzhiyungroups (flex\_bg). Leftover space is used for file data blocks, indirect
60*4882a593Smuzhiyunblock maps, extent tree blocks, and extended attributes.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunFlexible Block Groups
63*4882a593Smuzhiyun---------------------
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunStarting in ext4, there is a new feature called flexible block groups
66*4882a593Smuzhiyun(flex\_bg). In a flex\_bg, several block groups are tied together as one
67*4882a593Smuzhiyunlogical block group; the bitmap spaces and the inode table space in the
68*4882a593Smuzhiyunfirst block group of the flex\_bg are expanded to include the bitmaps
69*4882a593Smuzhiyunand inode tables of all other block groups in the flex\_bg. For example,
70*4882a593Smuzhiyunif the flex\_bg size is 4, then group 0 will contain (in order) the
71*4882a593Smuzhiyunsuperblock, group descriptors, data block bitmaps for groups 0-3, inode
72*4882a593Smuzhiyunbitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
73*4882a593Smuzhiyunspace in group 0 is for file data. The effect of this is to group the
74*4882a593Smuzhiyunblock group metadata close together for faster loading, and to enable
75*4882a593Smuzhiyunlarge files to be continuous on disk. Backup copies of the superblock
76*4882a593Smuzhiyunand group descriptors are always at the beginning of block groups, even
77*4882a593Smuzhiyunif flex\_bg is enabled. The number of block groups that make up a
78*4882a593Smuzhiyunflex\_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunMeta Block Groups
81*4882a593Smuzhiyun-----------------
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunWithout the option META\_BG, for safety concerns, all block group
84*4882a593Smuzhiyundescriptors copies are kept in the first block group. Given the default
85*4882a593Smuzhiyun128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
86*4882a593Smuzhiyuncan have at most 2^27/64 = 2^21 block groups. This limits the entire
87*4882a593Smuzhiyunfilesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunThe solution to this problem is to use the metablock group feature
90*4882a593Smuzhiyun(META\_BG), which is already in ext3 for all 2.6 releases. With the
91*4882a593SmuzhiyunMETA\_BG feature, ext4 filesystems are partitioned into many metablock
92*4882a593Smuzhiyungroups. Each metablock group is a cluster of block groups whose group
93*4882a593Smuzhiyundescriptor structures can be stored in a single disk block. For ext4
94*4882a593Smuzhiyunfilesystems with 4 KB block size, a single metablock group partition
95*4882a593Smuzhiyunincludes 64 block groups, or 8 GiB of disk space. The metablock group
96*4882a593Smuzhiyunfeature moves the location of the group descriptors from the congested
97*4882a593Smuzhiyunfirst block group of the whole filesystem into the first group of each
98*4882a593Smuzhiyunmetablock group itself. The backups are in the second and last group of
99*4882a593Smuzhiyuneach metablock group. This increases the 2^21 maximum block groups limit
100*4882a593Smuzhiyunto the hard limit 2^32, allowing support for a 512PiB filesystem.
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunThe change in the filesystem format replaces the current scheme where
103*4882a593Smuzhiyunthe superblock is followed by a variable-length set of block group
104*4882a593Smuzhiyundescriptors. Instead, the superblock and a single block group descriptor
105*4882a593Smuzhiyunblock is placed at the beginning of the first, second, and last block
106*4882a593Smuzhiyungroups in a meta-block group. A meta-block group is a collection of
107*4882a593Smuzhiyunblock groups which can be described by a single block group descriptor
108*4882a593Smuzhiyunblock. Since the size of the block group descriptor structure is 32
109*4882a593Smuzhiyunbytes, a meta-block group contains 32 block groups for filesystems with
110*4882a593Smuzhiyuna 1KB block size, and 128 block groups for filesystems with a 4KB
111*4882a593Smuzhiyunblocksize. Filesystems can either be created using this new block group
112*4882a593Smuzhiyundescriptor layout, or existing filesystems can be resized on-line, and
113*4882a593Smuzhiyunthe field s\_first\_meta\_bg in the superblock will indicate the first
114*4882a593Smuzhiyunblock group using this new layout.
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunPlease see an important note about ``BLOCK_UNINIT`` in the section about
117*4882a593Smuzhiyunblock and inode bitmaps.
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunLazy Block Group Initialization
120*4882a593Smuzhiyun-------------------------------
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunA new feature for ext4 are three block group descriptor flags that
123*4882a593Smuzhiyunenable mkfs to skip initializing other parts of the block group
124*4882a593Smuzhiyunmetadata. Specifically, the INODE\_UNINIT and BLOCK\_UNINIT flags mean
125*4882a593Smuzhiyunthat the inode and block bitmaps for that group can be calculated and
126*4882a593Smuzhiyuntherefore the on-disk bitmap blocks are not initialized. This is
127*4882a593Smuzhiyungenerally the case for an empty block group or a block group containing
128*4882a593Smuzhiyunonly fixed-location block group metadata. The INODE\_ZEROED flag means
129*4882a593Smuzhiyunthat the inode table has been initialized; mkfs will unset this flag and
130*4882a593Smuzhiyunrely on the kernel to initialize the inode tables in the background.
131*4882a593Smuzhiyun
132*4882a593SmuzhiyunBy not writing zeroes to the bitmaps and inode table, mkfs time is
133*4882a593Smuzhiyunreduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM,
134*4882a593Smuzhiyunbut the dumpe2fs output prints this as “uninit\_bg”. They are the same
135*4882a593Smuzhiyunthing.
136