xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/xfs-self-describing-metadata.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun============================
4*4882a593SmuzhiyunXFS Self Describing Metadata
5*4882a593Smuzhiyun============================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunIntroduction
8*4882a593Smuzhiyun============
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunThe largest scalability problem facing XFS is not one of algorithmic
11*4882a593Smuzhiyunscalability, but of verification of the filesystem structure. Scalabilty of the
12*4882a593Smuzhiyunstructures and indexes on disk and the algorithms for iterating them are
13*4882a593Smuzhiyunadequate for supporting PB scale filesystems with billions of inodes, however it
14*4882a593Smuzhiyunis this very scalability that causes the verification problem.
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunAlmost all metadata on XFS is dynamically allocated. The only fixed location
17*4882a593Smuzhiyunmetadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
18*4882a593Smuzhiyunother metadata structures need to be discovered by walking the filesystem
19*4882a593Smuzhiyunstructure in different ways. While this is already done by userspace tools for
20*4882a593Smuzhiyunvalidating and repairing the structure, there are limits to what they can
21*4882a593Smuzhiyunverify, and this in turn limits the supportable size of an XFS filesystem.
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunFor example, it is entirely possible to manually use xfs_db and a bit of
24*4882a593Smuzhiyunscripting to analyse the structure of a 100TB filesystem when trying to
25*4882a593Smuzhiyundetermine the root cause of a corruption problem, but it is still mainly a
26*4882a593Smuzhiyunmanual task of verifying that things like single bit errors or misplaced writes
27*4882a593Smuzhiyunweren't the ultimate cause of a corruption event. It may take a few hours to a
28*4882a593Smuzhiyunfew days to perform such forensic analysis, so for at this scale root cause
29*4882a593Smuzhiyunanalysis is entirely possible.
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunHowever, if we scale the filesystem up to 1PB, we now have 10x as much metadata
32*4882a593Smuzhiyunto analyse and so that analysis blows out towards weeks/months of forensic work.
33*4882a593SmuzhiyunMost of the analysis work is slow and tedious, so as the amount of analysis goes
34*4882a593Smuzhiyunup, the more likely that the cause will be lost in the noise.  Hence the primary
35*4882a593Smuzhiyunconcern for supporting PB scale filesystems is minimising the time and effort
36*4882a593Smuzhiyunrequired for basic forensic analysis of the filesystem structure.
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunSelf Describing Metadata
40*4882a593Smuzhiyun========================
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunOne of the problems with the current metadata format is that apart from the
43*4882a593Smuzhiyunmagic number in the metadata block, we have no other way of identifying what it
44*4882a593Smuzhiyunis supposed to be. We can't even identify if it is the right place. Put simply,
45*4882a593Smuzhiyunyou can't look at a single metadata block in isolation and say "yes, it is
46*4882a593Smuzhiyunsupposed to be there and the contents are valid".
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunHence most of the time spent on forensic analysis is spent doing basic
49*4882a593Smuzhiyunverification of metadata values, looking for values that are in range (and hence
50*4882a593Smuzhiyunnot detected by automated verification checks) but are not correct. Finding and
51*4882a593Smuzhiyununderstanding how things like cross linked block lists (e.g. sibling
52*4882a593Smuzhiyunpointers in a btree end up with loops in them) are the key to understanding what
53*4882a593Smuzhiyunwent wrong, but it is impossible to tell what order the blocks were linked into
54*4882a593Smuzhiyuneach other or written to disk after the fact.
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunHence we need to record more information into the metadata to allow us to
57*4882a593Smuzhiyunquickly determine if the metadata is intact and can be ignored for the purpose
58*4882a593Smuzhiyunof analysis. We can't protect against every possible type of error, but we can
59*4882a593Smuzhiyunensure that common types of errors are easily detectable.  Hence the concept of
60*4882a593Smuzhiyunself describing metadata.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunThe first, fundamental requirement of self describing metadata is that the
63*4882a593Smuzhiyunmetadata object contains some form of unique identifier in a well known
64*4882a593Smuzhiyunlocation. This allows us to identify the expected contents of the block and
65*4882a593Smuzhiyunhence parse and verify the metadata object. IF we can't independently identify
66*4882a593Smuzhiyunthe type of metadata in the object, then the metadata doesn't describe itself
67*4882a593Smuzhiyunvery well at all!
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunLuckily, almost all XFS metadata has magic numbers embedded already - only the
70*4882a593SmuzhiyunAGFL, remote symlinks and remote attribute blocks do not contain identifying
71*4882a593Smuzhiyunmagic numbers. Hence we can change the on-disk format of all these objects to
72*4882a593Smuzhiyunadd more identifying information and detect this simply by changing the magic
73*4882a593Smuzhiyunnumbers in the metadata objects. That is, if it has the current magic number,
74*4882a593Smuzhiyunthe metadata isn't self identifying. If it contains a new magic number, it is
75*4882a593Smuzhiyunself identifying and we can do much more expansive automated verification of the
76*4882a593Smuzhiyunmetadata object at runtime, during forensic analysis or repair.
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunAs a primary concern, self describing metadata needs some form of overall
79*4882a593Smuzhiyunintegrity checking. We cannot trust the metadata if we cannot verify that it has
80*4882a593Smuzhiyunnot been changed as a result of external influences. Hence we need some form of
81*4882a593Smuzhiyunintegrity check, and this is done by adding CRC32c validation to the metadata
82*4882a593Smuzhiyunblock. If we can verify the block contains the metadata it was intended to
83*4882a593Smuzhiyuncontain, a large amount of the manual verification work can be skipped.
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunCRC32c was selected as metadata cannot be more than 64k in length in XFS and
86*4882a593Smuzhiyunhence a 32 bit CRC is more than sufficient to detect multi-bit errors in
87*4882a593Smuzhiyunmetadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
88*4882a593Smuzhiyunfast. So while CRC32c is not the strongest of possible integrity checks that
89*4882a593Smuzhiyuncould be used, it is more than sufficient for our needs and has relatively
90*4882a593Smuzhiyunlittle overhead. Adding support for larger integrity fields and/or algorithms
91*4882a593Smuzhiyundoes really provide any extra value over CRC32c, but it does add a lot of
92*4882a593Smuzhiyuncomplexity and so there is no provision for changing the integrity checking
93*4882a593Smuzhiyunmechanism.
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunSelf describing metadata needs to contain enough information so that the
96*4882a593Smuzhiyunmetadata block can be verified as being in the correct place without needing to
97*4882a593Smuzhiyunlook at any other metadata. This means it needs to contain location information.
98*4882a593SmuzhiyunJust adding a block number to the metadata is not sufficient to protect against
99*4882a593Smuzhiyunmis-directed writes - a write might be misdirected to the wrong LUN and so be
100*4882a593Smuzhiyunwritten to the "correct block" of the wrong filesystem. Hence location
101*4882a593Smuzhiyuninformation must contain a filesystem identifier as well as a block number.
102*4882a593Smuzhiyun
103*4882a593SmuzhiyunAnother key information point in forensic analysis is knowing who the metadata
104*4882a593Smuzhiyunblock belongs to. We already know the type, the location, that it is valid
105*4882a593Smuzhiyunand/or corrupted, and how long ago that it was last modified. Knowing the owner
106*4882a593Smuzhiyunof the block is important as it allows us to find other related metadata to
107*4882a593Smuzhiyundetermine the scope of the corruption. For example, if we have a extent btree
108*4882a593Smuzhiyunobject, we don't know what inode it belongs to and hence have to walk the entire
109*4882a593Smuzhiyunfilesystem to find the owner of the block. Worse, the corruption could mean that
110*4882a593Smuzhiyunno owner can be found (i.e. it's an orphan block), and so without an owner field
111*4882a593Smuzhiyunin the metadata we have no idea of the scope of the corruption. If we have an
112*4882a593Smuzhiyunowner field in the metadata object, we can immediately do top down validation to
113*4882a593Smuzhiyundetermine the scope of the problem.
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunDifferent types of metadata have different owner identifiers. For example,
116*4882a593Smuzhiyundirectory, attribute and extent tree blocks are all owned by an inode, while
117*4882a593Smuzhiyunfreespace btree blocks are owned by an allocation group. Hence the size and
118*4882a593Smuzhiyuncontents of the owner field are determined by the type of metadata object we are
119*4882a593Smuzhiyunlooking at.  The owner information can also identify misplaced writes (e.g.
120*4882a593Smuzhiyunfreespace btree block written to the wrong AG).
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunSelf describing metadata also needs to contain some indication of when it was
123*4882a593Smuzhiyunwritten to the filesystem. One of the key information points when doing forensic
124*4882a593Smuzhiyunanalysis is how recently the block was modified. Correlation of set of corrupted
125*4882a593Smuzhiyunmetadata blocks based on modification times is important as it can indicate
126*4882a593Smuzhiyunwhether the corruptions are related, whether there's been multiple corruption
127*4882a593Smuzhiyunevents that lead to the eventual failure, and even whether there are corruptions
128*4882a593Smuzhiyunpresent that the run-time verification is not detecting.
129*4882a593Smuzhiyun
130*4882a593SmuzhiyunFor example, we can determine whether a metadata object is supposed to be free
131*4882a593Smuzhiyunspace or still allocated if it is still referenced by its owner by looking at
132*4882a593Smuzhiyunwhen the free space btree block that contains the block was last written
133*4882a593Smuzhiyuncompared to when the metadata object itself was last written.  If the free space
134*4882a593Smuzhiyunblock is more recent than the object and the object's owner, then there is a
135*4882a593Smuzhiyunvery good chance that the block should have been removed from the owner.
136*4882a593Smuzhiyun
137*4882a593SmuzhiyunTo provide this "written timestamp", each metadata block gets the Log Sequence
138*4882a593SmuzhiyunNumber (LSN) of the most recent transaction it was modified on written into it.
139*4882a593SmuzhiyunThis number will always increase over the life of the filesystem, and the only
140*4882a593Smuzhiyunthing that resets it is running xfs_repair on the filesystem. Further, by use of
141*4882a593Smuzhiyunthe LSN we can tell if the corrupted metadata all belonged to the same log
142*4882a593Smuzhiyuncheckpoint and hence have some idea of how much modification occurred between
143*4882a593Smuzhiyunthe first and last instance of corrupt metadata on disk and, further, how much
144*4882a593Smuzhiyunmodification occurred between the corruption being written and when it was
145*4882a593Smuzhiyundetected.
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunRuntime Validation
148*4882a593Smuzhiyun==================
149*4882a593Smuzhiyun
150*4882a593SmuzhiyunValidation of self-describing metadata takes place at runtime in two places:
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun	- immediately after a successful read from disk
153*4882a593Smuzhiyun	- immediately prior to write IO submission
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunThe verification is completely stateless - it is done independently of the
156*4882a593Smuzhiyunmodification process, and seeks only to check that the metadata is what it says
157*4882a593Smuzhiyunit is and that the metadata fields are within bounds and internally consistent.
158*4882a593SmuzhiyunAs such, we cannot catch all types of corruption that can occur within a block
159*4882a593Smuzhiyunas there may be certain limitations that operational state enforces of the
160*4882a593Smuzhiyunmetadata, or there may be corruption of interblock relationships (e.g. corrupted
161*4882a593Smuzhiyunsibling pointer lists). Hence we still need stateful checking in the main code
162*4882a593Smuzhiyunbody, but in general most of the per-field validation is handled by the
163*4882a593Smuzhiyunverifiers.
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunFor read verification, the caller needs to specify the expected type of metadata
166*4882a593Smuzhiyunthat it should see, and the IO completion process verifies that the metadata
167*4882a593Smuzhiyunobject matches what was expected. If the verification process fails, then it
168*4882a593Smuzhiyunmarks the object being read as EFSCORRUPTED. The caller needs to catch this
169*4882a593Smuzhiyunerror (same as for IO errors), and if it needs to take special action due to a
170*4882a593Smuzhiyunverification error it can do so by catching the EFSCORRUPTED error value. If we
171*4882a593Smuzhiyunneed more discrimination of error type at higher levels, we can define new
172*4882a593Smuzhiyunerror numbers for different errors as necessary.
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunThe first step in read verification is checking the magic number and determining
175*4882a593Smuzhiyunwhether CRC validating is necessary. If it is, the CRC32c is calculated and
176*4882a593Smuzhiyuncompared against the value stored in the object itself. Once this is validated,
177*4882a593Smuzhiyunfurther checks are made against the location information, followed by extensive
178*4882a593Smuzhiyunobject specific metadata validation. If any of these checks fail, then the
179*4882a593Smuzhiyunbuffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
180*4882a593Smuzhiyun
181*4882a593SmuzhiyunWrite verification is the opposite of the read verification - first the object
182*4882a593Smuzhiyunis extensively verified and if it is OK we then update the LSN from the last
183*4882a593Smuzhiyunmodification made to the object, After this, we calculate the CRC and insert it
184*4882a593Smuzhiyuninto the object. Once this is done the write IO is allowed to continue. If any
185*4882a593Smuzhiyunerror occurs during this process, the buffer is again marked with a EFSCORRUPTED
186*4882a593Smuzhiyunerror for the higher layers to catch.
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunStructures
189*4882a593Smuzhiyun==========
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunA typical on-disk structure needs to contain the following information::
192*4882a593Smuzhiyun
193*4882a593Smuzhiyun    struct xfs_ondisk_hdr {
194*4882a593Smuzhiyun	    __be32  magic;		/* magic number */
195*4882a593Smuzhiyun	    __be32  crc;		/* CRC, not logged */
196*4882a593Smuzhiyun	    uuid_t  uuid;		/* filesystem identifier */
197*4882a593Smuzhiyun	    __be64  owner;		/* parent object */
198*4882a593Smuzhiyun	    __be64  blkno;		/* location on disk */
199*4882a593Smuzhiyun	    __be64  lsn;		/* last modification in log, not logged */
200*4882a593Smuzhiyun    };
201*4882a593Smuzhiyun
202*4882a593SmuzhiyunDepending on the metadata, this information may be part of a header structure
203*4882a593Smuzhiyunseparate to the metadata contents, or may be distributed through an existing
204*4882a593Smuzhiyunstructure. The latter occurs with metadata that already contains some of this
205*4882a593Smuzhiyuninformation, such as the superblock and AG headers.
206*4882a593Smuzhiyun
207*4882a593SmuzhiyunOther metadata may have different formats for the information, but the same
208*4882a593Smuzhiyunlevel of information is generally provided. For example:
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun	- short btree blocks have a 32 bit owner (ag number) and a 32 bit block
211*4882a593Smuzhiyun	  number for location. The two of these combined provide the same
212*4882a593Smuzhiyun	  information as @owner and @blkno in eh above structure, but using 8
213*4882a593Smuzhiyun	  bytes less space on disk.
214*4882a593Smuzhiyun
215*4882a593Smuzhiyun	- directory/attribute node blocks have a 16 bit magic number, and the
216*4882a593Smuzhiyun	  header that contains the magic number has other information in it as
217*4882a593Smuzhiyun	  well. hence the additional metadata headers change the overall format
218*4882a593Smuzhiyun	  of the metadata.
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunA typical buffer read verifier is structured as follows::
221*4882a593Smuzhiyun
222*4882a593Smuzhiyun    #define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
223*4882a593Smuzhiyun
224*4882a593Smuzhiyun    static void
225*4882a593Smuzhiyun    xfs_foo_read_verify(
226*4882a593Smuzhiyun	    struct xfs_buf	*bp)
227*4882a593Smuzhiyun    {
228*4882a593Smuzhiyun	struct xfs_mount *mp = bp->b_mount;
229*4882a593Smuzhiyun
230*4882a593Smuzhiyun	    if ((xfs_sb_version_hascrc(&mp->m_sb) &&
231*4882a593Smuzhiyun		!xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
232*4882a593Smuzhiyun					    XFS_FOO_CRC_OFF)) ||
233*4882a593Smuzhiyun		!xfs_foo_verify(bp)) {
234*4882a593Smuzhiyun		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
235*4882a593Smuzhiyun		    xfs_buf_ioerror(bp, EFSCORRUPTED);
236*4882a593Smuzhiyun	    }
237*4882a593Smuzhiyun    }
238*4882a593Smuzhiyun
239*4882a593SmuzhiyunThe code ensures that the CRC is only checked if the filesystem has CRCs enabled
240*4882a593Smuzhiyunby checking the superblock of the feature bit, and then if the CRC verifies OK
241*4882a593Smuzhiyun(or is not needed) it verifies the actual contents of the block.
242*4882a593Smuzhiyun
243*4882a593SmuzhiyunThe verifier function will take a couple of different forms, depending on
244*4882a593Smuzhiyunwhether the magic number can be used to determine the format of the block. In
245*4882a593Smuzhiyunthe case it can't, the code is structured as follows::
246*4882a593Smuzhiyun
247*4882a593Smuzhiyun    static bool
248*4882a593Smuzhiyun    xfs_foo_verify(
249*4882a593Smuzhiyun	    struct xfs_buf		*bp)
250*4882a593Smuzhiyun    {
251*4882a593Smuzhiyun	    struct xfs_mount	*mp = bp->b_mount;
252*4882a593Smuzhiyun	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun	    if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
255*4882a593Smuzhiyun		    return false;
256*4882a593Smuzhiyun
257*4882a593Smuzhiyun	    if (!xfs_sb_version_hascrc(&mp->m_sb)) {
258*4882a593Smuzhiyun		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
259*4882a593Smuzhiyun			    return false;
260*4882a593Smuzhiyun		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
261*4882a593Smuzhiyun			    return false;
262*4882a593Smuzhiyun		    if (hdr->owner == 0)
263*4882a593Smuzhiyun			    return false;
264*4882a593Smuzhiyun	    }
265*4882a593Smuzhiyun
266*4882a593Smuzhiyun	    /* object specific verification checks here */
267*4882a593Smuzhiyun
268*4882a593Smuzhiyun	    return true;
269*4882a593Smuzhiyun    }
270*4882a593Smuzhiyun
271*4882a593SmuzhiyunIf there are different magic numbers for the different formats, the verifier
272*4882a593Smuzhiyunwill look like::
273*4882a593Smuzhiyun
274*4882a593Smuzhiyun    static bool
275*4882a593Smuzhiyun    xfs_foo_verify(
276*4882a593Smuzhiyun	    struct xfs_buf		*bp)
277*4882a593Smuzhiyun    {
278*4882a593Smuzhiyun	    struct xfs_mount	*mp = bp->b_mount;
279*4882a593Smuzhiyun	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
280*4882a593Smuzhiyun
281*4882a593Smuzhiyun	    if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
282*4882a593Smuzhiyun		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
283*4882a593Smuzhiyun			    return false;
284*4882a593Smuzhiyun		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
285*4882a593Smuzhiyun			    return false;
286*4882a593Smuzhiyun		    if (hdr->owner == 0)
287*4882a593Smuzhiyun			    return false;
288*4882a593Smuzhiyun	    } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
289*4882a593Smuzhiyun		    return false;
290*4882a593Smuzhiyun
291*4882a593Smuzhiyun	    /* object specific verification checks here */
292*4882a593Smuzhiyun
293*4882a593Smuzhiyun	    return true;
294*4882a593Smuzhiyun    }
295*4882a593Smuzhiyun
296*4882a593SmuzhiyunWrite verifiers are very similar to the read verifiers, they just do things in
297*4882a593Smuzhiyunthe opposite order to the read verifiers. A typical write verifier::
298*4882a593Smuzhiyun
299*4882a593Smuzhiyun    static void
300*4882a593Smuzhiyun    xfs_foo_write_verify(
301*4882a593Smuzhiyun	    struct xfs_buf	*bp)
302*4882a593Smuzhiyun    {
303*4882a593Smuzhiyun	    struct xfs_mount	*mp = bp->b_mount;
304*4882a593Smuzhiyun	    struct xfs_buf_log_item	*bip = bp->b_fspriv;
305*4882a593Smuzhiyun
306*4882a593Smuzhiyun	    if (!xfs_foo_verify(bp)) {
307*4882a593Smuzhiyun		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
308*4882a593Smuzhiyun		    xfs_buf_ioerror(bp, EFSCORRUPTED);
309*4882a593Smuzhiyun		    return;
310*4882a593Smuzhiyun	    }
311*4882a593Smuzhiyun
312*4882a593Smuzhiyun	    if (!xfs_sb_version_hascrc(&mp->m_sb))
313*4882a593Smuzhiyun		    return;
314*4882a593Smuzhiyun
315*4882a593Smuzhiyun
316*4882a593Smuzhiyun	    if (bip) {
317*4882a593Smuzhiyun		    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
318*4882a593Smuzhiyun		    hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
319*4882a593Smuzhiyun	    }
320*4882a593Smuzhiyun	    xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
321*4882a593Smuzhiyun    }
322*4882a593Smuzhiyun
323*4882a593SmuzhiyunThis will verify the internal structure of the metadata before we go any
324*4882a593Smuzhiyunfurther, detecting corruptions that have occurred as the metadata has been
325*4882a593Smuzhiyunmodified in memory. If the metadata verifies OK, and CRCs are enabled, we then
326*4882a593Smuzhiyunupdate the LSN field (when it was last modified) and calculate the CRC on the
327*4882a593Smuzhiyunmetadata. Once this is done, we can issue the IO.
328*4882a593Smuzhiyun
329*4882a593SmuzhiyunInodes and Dquots
330*4882a593Smuzhiyun=================
331*4882a593Smuzhiyun
332*4882a593SmuzhiyunInodes and dquots are special snowflakes. They have per-object CRC and
333*4882a593Smuzhiyunself-identifiers, but they are packed so that there are multiple objects per
334*4882a593Smuzhiyunbuffer. Hence we do not use per-buffer verifiers to do the work of per-object
335*4882a593Smuzhiyunverification and CRC calculations. The per-buffer verifiers simply perform basic
336*4882a593Smuzhiyunidentification of the buffer - that they contain inodes or dquots, and that
337*4882a593Smuzhiyunthere are magic numbers in all the expected spots. All further CRC and
338*4882a593Smuzhiyunverification checks are done when each inode is read from or written back to the
339*4882a593Smuzhiyunbuffer.
340*4882a593Smuzhiyun
341*4882a593SmuzhiyunThe structure of the verifiers and the identifiers checks is very similar to the
342*4882a593Smuzhiyunbuffer code described above. The only difference is where they are called. For
343*4882a593Smuzhiyunexample, inode read verification is done in xfs_inode_from_disk() when the inode
344*4882a593Smuzhiyunis first read out of the buffer and the struct xfs_inode is instantiated. The
345*4882a593Smuzhiyuninode is already extensively verified during writeback in xfs_iflush_int, so the
346*4882a593Smuzhiyunonly addition here is to add the LSN and CRC to the inode as it is copied back
347*4882a593Smuzhiyuninto the buffer.
348*4882a593Smuzhiyun
349*4882a593SmuzhiyunXXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
350*4882a593Smuzhiyunthe unlinked list modifications check or update CRCs, neither during unlink nor
351*4882a593Smuzhiyunlog recovery. So, it's gone unnoticed until now. This won't matter immediately -
352*4882a593Smuzhiyunrepair will probably complain about it - but it needs to be fixed.
353