1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593SmuzhiyunLayout 4*4882a593Smuzhiyun------ 5*4882a593Smuzhiyun 6*4882a593SmuzhiyunThe layout of a standard block group is approximately as follows (each 7*4882a593Smuzhiyunof these fields is discussed in a separate section below): 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun.. list-table:: 10*4882a593Smuzhiyun :widths: 1 1 1 1 1 1 1 1 11*4882a593Smuzhiyun :header-rows: 1 12*4882a593Smuzhiyun 13*4882a593Smuzhiyun * - Group 0 Padding 14*4882a593Smuzhiyun - ext4 Super Block 15*4882a593Smuzhiyun - Group Descriptors 16*4882a593Smuzhiyun - Reserved GDT Blocks 17*4882a593Smuzhiyun - Data Block Bitmap 18*4882a593Smuzhiyun - inode Bitmap 19*4882a593Smuzhiyun - inode Table 20*4882a593Smuzhiyun - Data Blocks 21*4882a593Smuzhiyun * - 1024 bytes 22*4882a593Smuzhiyun - 1 block 23*4882a593Smuzhiyun - many blocks 24*4882a593Smuzhiyun - many blocks 25*4882a593Smuzhiyun - 1 block 26*4882a593Smuzhiyun - 1 block 27*4882a593Smuzhiyun - many blocks 28*4882a593Smuzhiyun - many more blocks 29*4882a593Smuzhiyun 30*4882a593SmuzhiyunFor the special case of block group 0, the first 1024 bytes are unused, 31*4882a593Smuzhiyunto allow for the installation of x86 boot sectors and other oddities. 32*4882a593SmuzhiyunThe superblock will start at offset 1024 bytes, whichever block that 33*4882a593Smuzhiyunhappens to be (usually 0). However, if for some reason the block size = 34*4882a593Smuzhiyun1024, then block 0 is marked in use and the superblock goes in block 1. 35*4882a593SmuzhiyunFor all other block groups, there is no padding. 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunThe ext4 driver primarily works with the superblock and the group 38*4882a593Smuzhiyundescriptors that are found in block group 0. Redundant copies of the 39*4882a593Smuzhiyunsuperblock and group descriptors are written to some of the block groups 40*4882a593Smuzhiyunacross the disk in case the beginning of the disk gets trashed, though 41*4882a593Smuzhiyunnot all block groups necessarily host a redundant copy (see following 42*4882a593Smuzhiyunparagraph for more details). If the group does not have a redundant 43*4882a593Smuzhiyuncopy, the block group begins with the data block bitmap. Note also that 44*4882a593Smuzhiyunwhen the filesystem is freshly formatted, mkfs will allocate “reserve 45*4882a593SmuzhiyunGDT block” space after the block group descriptors and before the start 46*4882a593Smuzhiyunof the block bitmaps to allow for future expansion of the filesystem. By 47*4882a593Smuzhiyundefault, a filesystem is allowed to increase in size by a factor of 48*4882a593Smuzhiyun1024x over the original filesystem size. 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunThe location of the inode table is given by ``grp.bg_inode_table_*``. It 51*4882a593Smuzhiyunis continuous range of blocks large enough to contain 52*4882a593Smuzhiyun``sb.s_inodes_per_group * sb.s_inode_size`` bytes. 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunAs for the ordering of items in a block group, it is generally 55*4882a593Smuzhiyunestablished that the super block and the group descriptor table, if 56*4882a593Smuzhiyunpresent, will be at the beginning of the block group. The bitmaps and 57*4882a593Smuzhiyunthe inode table can be anywhere, and it is quite possible for the 58*4882a593Smuzhiyunbitmaps to come after the inode table, or for both to be in different 59*4882a593Smuzhiyungroups (flex\_bg). Leftover space is used for file data blocks, indirect 60*4882a593Smuzhiyunblock maps, extent tree blocks, and extended attributes. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunFlexible Block Groups 63*4882a593Smuzhiyun--------------------- 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunStarting in ext4, there is a new feature called flexible block groups 66*4882a593Smuzhiyun(flex\_bg). In a flex\_bg, several block groups are tied together as one 67*4882a593Smuzhiyunlogical block group; the bitmap spaces and the inode table space in the 68*4882a593Smuzhiyunfirst block group of the flex\_bg are expanded to include the bitmaps 69*4882a593Smuzhiyunand inode tables of all other block groups in the flex\_bg. For example, 70*4882a593Smuzhiyunif the flex\_bg size is 4, then group 0 will contain (in order) the 71*4882a593Smuzhiyunsuperblock, group descriptors, data block bitmaps for groups 0-3, inode 72*4882a593Smuzhiyunbitmaps for groups 0-3, inode tables for groups 0-3, and the remaining 73*4882a593Smuzhiyunspace in group 0 is for file data. The effect of this is to group the 74*4882a593Smuzhiyunblock group metadata close together for faster loading, and to enable 75*4882a593Smuzhiyunlarge files to be continuous on disk. Backup copies of the superblock 76*4882a593Smuzhiyunand group descriptors are always at the beginning of block groups, even 77*4882a593Smuzhiyunif flex\_bg is enabled. The number of block groups that make up a 78*4882a593Smuzhiyunflex\_bg is given by 2 ^ ``sb.s_log_groups_per_flex``. 79*4882a593Smuzhiyun 80*4882a593SmuzhiyunMeta Block Groups 81*4882a593Smuzhiyun----------------- 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunWithout the option META\_BG, for safety concerns, all block group 84*4882a593Smuzhiyundescriptors copies are kept in the first block group. Given the default 85*4882a593Smuzhiyun128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4 86*4882a593Smuzhiyuncan have at most 2^27/64 = 2^21 block groups. This limits the entire 87*4882a593Smuzhiyunfilesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB. 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunThe solution to this problem is to use the metablock group feature 90*4882a593Smuzhiyun(META\_BG), which is already in ext3 for all 2.6 releases. With the 91*4882a593SmuzhiyunMETA\_BG feature, ext4 filesystems are partitioned into many metablock 92*4882a593Smuzhiyungroups. Each metablock group is a cluster of block groups whose group 93*4882a593Smuzhiyundescriptor structures can be stored in a single disk block. For ext4 94*4882a593Smuzhiyunfilesystems with 4 KB block size, a single metablock group partition 95*4882a593Smuzhiyunincludes 64 block groups, or 8 GiB of disk space. The metablock group 96*4882a593Smuzhiyunfeature moves the location of the group descriptors from the congested 97*4882a593Smuzhiyunfirst block group of the whole filesystem into the first group of each 98*4882a593Smuzhiyunmetablock group itself. The backups are in the second and last group of 99*4882a593Smuzhiyuneach metablock group. This increases the 2^21 maximum block groups limit 100*4882a593Smuzhiyunto the hard limit 2^32, allowing support for a 512PiB filesystem. 101*4882a593Smuzhiyun 102*4882a593SmuzhiyunThe change in the filesystem format replaces the current scheme where 103*4882a593Smuzhiyunthe superblock is followed by a variable-length set of block group 104*4882a593Smuzhiyundescriptors. Instead, the superblock and a single block group descriptor 105*4882a593Smuzhiyunblock is placed at the beginning of the first, second, and last block 106*4882a593Smuzhiyungroups in a meta-block group. A meta-block group is a collection of 107*4882a593Smuzhiyunblock groups which can be described by a single block group descriptor 108*4882a593Smuzhiyunblock. Since the size of the block group descriptor structure is 32 109*4882a593Smuzhiyunbytes, a meta-block group contains 32 block groups for filesystems with 110*4882a593Smuzhiyuna 1KB block size, and 128 block groups for filesystems with a 4KB 111*4882a593Smuzhiyunblocksize. Filesystems can either be created using this new block group 112*4882a593Smuzhiyundescriptor layout, or existing filesystems can be resized on-line, and 113*4882a593Smuzhiyunthe field s\_first\_meta\_bg in the superblock will indicate the first 114*4882a593Smuzhiyunblock group using this new layout. 115*4882a593Smuzhiyun 116*4882a593SmuzhiyunPlease see an important note about ``BLOCK_UNINIT`` in the section about 117*4882a593Smuzhiyunblock and inode bitmaps. 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunLazy Block Group Initialization 120*4882a593Smuzhiyun------------------------------- 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunA new feature for ext4 are three block group descriptor flags that 123*4882a593Smuzhiyunenable mkfs to skip initializing other parts of the block group 124*4882a593Smuzhiyunmetadata. Specifically, the INODE\_UNINIT and BLOCK\_UNINIT flags mean 125*4882a593Smuzhiyunthat the inode and block bitmaps for that group can be calculated and 126*4882a593Smuzhiyuntherefore the on-disk bitmap blocks are not initialized. This is 127*4882a593Smuzhiyungenerally the case for an empty block group or a block group containing 128*4882a593Smuzhiyunonly fixed-location block group metadata. The INODE\_ZEROED flag means 129*4882a593Smuzhiyunthat the inode table has been initialized; mkfs will unset this flag and 130*4882a593Smuzhiyunrely on the kernel to initialize the inode tables in the background. 131*4882a593Smuzhiyun 132*4882a593SmuzhiyunBy not writing zeroes to the bitmaps and inode table, mkfs time is 133*4882a593Smuzhiyunreduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM, 134*4882a593Smuzhiyunbut the dumpe2fs output prints this as “uninit\_bg”. They are the same 135*4882a593Smuzhiyunthing. 136