xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/fsverity.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun.. _fsverity:
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun=======================================================
6*4882a593Smuzhiyunfs-verity: read-only file-based authenticity protection
7*4882a593Smuzhiyun=======================================================
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunIntroduction
10*4882a593Smuzhiyun============
11*4882a593Smuzhiyun
12*4882a593Smuzhiyunfs-verity (``fs/verity/``) is a support layer that filesystems can
13*4882a593Smuzhiyunhook into to support transparent integrity and authenticity protection
14*4882a593Smuzhiyunof read-only files.  Currently, it is supported by the ext4 and f2fs
15*4882a593Smuzhiyunfilesystems.  Like fscrypt, not too much filesystem-specific code is
16*4882a593Smuzhiyunneeded to support fs-verity.
17*4882a593Smuzhiyun
18*4882a593Smuzhiyunfs-verity is similar to `dm-verity
19*4882a593Smuzhiyun<https://www.kernel.org/doc/Documentation/device-mapper/verity.txt>`_
20*4882a593Smuzhiyunbut works on files rather than block devices.  On regular files on
21*4882a593Smuzhiyunfilesystems supporting fs-verity, userspace can execute an ioctl that
22*4882a593Smuzhiyuncauses the filesystem to build a Merkle tree for the file and persist
23*4882a593Smuzhiyunit to a filesystem-specific location associated with the file.
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunAfter this, the file is made readonly, and all reads from the file are
26*4882a593Smuzhiyunautomatically verified against the file's Merkle tree.  Reads of any
27*4882a593Smuzhiyuncorrupted data, including mmap reads, will fail.
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunUserspace can use another ioctl to retrieve the root hash (actually
30*4882a593Smuzhiyunthe "fs-verity file digest", which is a hash that includes the Merkle
31*4882a593Smuzhiyuntree root hash) that fs-verity is enforcing for the file.  This ioctl
32*4882a593Smuzhiyunexecutes in constant time, regardless of the file size.
33*4882a593Smuzhiyun
34*4882a593Smuzhiyunfs-verity is essentially a way to hash a file in constant time,
35*4882a593Smuzhiyunsubject to the caveat that reads which would violate the hash will
36*4882a593Smuzhiyunfail at runtime.
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunUse cases
39*4882a593Smuzhiyun=========
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunBy itself, the base fs-verity feature only provides integrity
42*4882a593Smuzhiyunprotection, i.e. detection of accidental (non-malicious) corruption.
43*4882a593Smuzhiyun
44*4882a593SmuzhiyunHowever, because fs-verity makes retrieving the file hash extremely
45*4882a593Smuzhiyunefficient, it's primarily meant to be used as a tool to support
46*4882a593Smuzhiyunauthentication (detection of malicious modifications) or auditing
47*4882a593Smuzhiyun(logging file hashes before use).
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunTrusted userspace code (e.g. operating system code running on a
50*4882a593Smuzhiyunread-only partition that is itself authenticated by dm-verity) can
51*4882a593Smuzhiyunauthenticate the contents of an fs-verity file by using the
52*4882a593Smuzhiyun`FS_IOC_MEASURE_VERITY`_ ioctl to retrieve its hash, then verifying a
53*4882a593Smuzhiyundigital signature of it.
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunA standard file hash could be used instead of fs-verity.  However,
56*4882a593Smuzhiyunthis is inefficient if the file is large and only a small portion may
57*4882a593Smuzhiyunbe accessed.  This is often the case for Android application package
58*4882a593Smuzhiyun(APK) files, for example.  These typically contain many translations,
59*4882a593Smuzhiyunclasses, and other resources that are infrequently or even never
60*4882a593Smuzhiyunaccessed on a particular device.  It would be slow and wasteful to
61*4882a593Smuzhiyunread and hash the entire file before starting the application.
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunUnlike an ahead-of-time hash, fs-verity also re-verifies data each
64*4882a593Smuzhiyuntime it's paged in.  This ensures that malicious disk firmware can't
65*4882a593Smuzhiyunundetectably change the contents of the file at runtime.
66*4882a593Smuzhiyun
67*4882a593Smuzhiyunfs-verity does not replace or obsolete dm-verity.  dm-verity should
68*4882a593Smuzhiyunstill be used on read-only filesystems.  fs-verity is for files that
69*4882a593Smuzhiyunmust live on a read-write filesystem because they are independently
70*4882a593Smuzhiyunupdated and potentially user-installed, so dm-verity cannot be used.
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunThe base fs-verity feature is a hashing mechanism only; actually
73*4882a593Smuzhiyunauthenticating the files is up to userspace.  However, to meet some
74*4882a593Smuzhiyunusers' needs, fs-verity optionally supports a simple signature
75*4882a593Smuzhiyunverification mechanism where users can configure the kernel to require
76*4882a593Smuzhiyunthat all fs-verity files be signed by a key loaded into a keyring; see
77*4882a593Smuzhiyun`Built-in signature verification`_.  Support for fs-verity file hashes
78*4882a593Smuzhiyunin IMA (Integrity Measurement Architecture) policies is also planned.
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunUser API
81*4882a593Smuzhiyun========
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunFS_IOC_ENABLE_VERITY
84*4882a593Smuzhiyun--------------------
85*4882a593Smuzhiyun
86*4882a593SmuzhiyunThe FS_IOC_ENABLE_VERITY ioctl enables fs-verity on a file.  It takes
87*4882a593Smuzhiyunin a pointer to a struct fsverity_enable_arg, defined as
88*4882a593Smuzhiyunfollows::
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun    struct fsverity_enable_arg {
91*4882a593Smuzhiyun            __u32 version;
92*4882a593Smuzhiyun            __u32 hash_algorithm;
93*4882a593Smuzhiyun            __u32 block_size;
94*4882a593Smuzhiyun            __u32 salt_size;
95*4882a593Smuzhiyun            __u64 salt_ptr;
96*4882a593Smuzhiyun            __u32 sig_size;
97*4882a593Smuzhiyun            __u32 __reserved1;
98*4882a593Smuzhiyun            __u64 sig_ptr;
99*4882a593Smuzhiyun            __u64 __reserved2[11];
100*4882a593Smuzhiyun    };
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunThis structure contains the parameters of the Merkle tree to build for
103*4882a593Smuzhiyunthe file, and optionally contains a signature.  It must be initialized
104*4882a593Smuzhiyunas follows:
105*4882a593Smuzhiyun
106*4882a593Smuzhiyun- ``version`` must be 1.
107*4882a593Smuzhiyun- ``hash_algorithm`` must be the identifier for the hash algorithm to
108*4882a593Smuzhiyun  use for the Merkle tree, such as FS_VERITY_HASH_ALG_SHA256.  See
109*4882a593Smuzhiyun  ``include/uapi/linux/fsverity.h`` for the list of possible values.
110*4882a593Smuzhiyun- ``block_size`` must be the Merkle tree block size.  Currently, this
111*4882a593Smuzhiyun  must be equal to the system page size, which is usually 4096 bytes.
112*4882a593Smuzhiyun  Other sizes may be supported in the future.  This value is not
113*4882a593Smuzhiyun  necessarily the same as the filesystem block size.
114*4882a593Smuzhiyun- ``salt_size`` is the size of the salt in bytes, or 0 if no salt is
115*4882a593Smuzhiyun  provided.  The salt is a value that is prepended to every hashed
116*4882a593Smuzhiyun  block; it can be used to personalize the hashing for a particular
117*4882a593Smuzhiyun  file or device.  Currently the maximum salt size is 32 bytes.
118*4882a593Smuzhiyun- ``salt_ptr`` is the pointer to the salt, or NULL if no salt is
119*4882a593Smuzhiyun  provided.
120*4882a593Smuzhiyun- ``sig_size`` is the size of the signature in bytes, or 0 if no
121*4882a593Smuzhiyun  signature is provided.  Currently the signature is (somewhat
122*4882a593Smuzhiyun  arbitrarily) limited to 16128 bytes.  See `Built-in signature
123*4882a593Smuzhiyun  verification`_ for more information.
124*4882a593Smuzhiyun- ``sig_ptr``  is the pointer to the signature, or NULL if no
125*4882a593Smuzhiyun  signature is provided.
126*4882a593Smuzhiyun- All reserved fields must be zeroed.
127*4882a593Smuzhiyun
128*4882a593SmuzhiyunFS_IOC_ENABLE_VERITY causes the filesystem to build a Merkle tree for
129*4882a593Smuzhiyunthe file and persist it to a filesystem-specific location associated
130*4882a593Smuzhiyunwith the file, then mark the file as a verity file.  This ioctl may
131*4882a593Smuzhiyuntake a long time to execute on large files, and it is interruptible by
132*4882a593Smuzhiyunfatal signals.
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunFS_IOC_ENABLE_VERITY checks for write access to the inode.  However,
135*4882a593Smuzhiyunit must be executed on an O_RDONLY file descriptor and no processes
136*4882a593Smuzhiyuncan have the file open for writing.  Attempts to open the file for
137*4882a593Smuzhiyunwriting while this ioctl is executing will fail with ETXTBSY.  (This
138*4882a593Smuzhiyunis necessary to guarantee that no writable file descriptors will exist
139*4882a593Smuzhiyunafter verity is enabled, and to guarantee that the file's contents are
140*4882a593Smuzhiyunstable while the Merkle tree is being built over it.)
141*4882a593Smuzhiyun
142*4882a593SmuzhiyunOn success, FS_IOC_ENABLE_VERITY returns 0, and the file becomes a
143*4882a593Smuzhiyunverity file.  On failure (including the case of interruption by a
144*4882a593Smuzhiyunfatal signal), no changes are made to the file.
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunFS_IOC_ENABLE_VERITY can fail with the following errors:
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun- ``EACCES``: the process does not have write access to the file
149*4882a593Smuzhiyun- ``EBADMSG``: the signature is malformed
150*4882a593Smuzhiyun- ``EBUSY``: this ioctl is already running on the file
151*4882a593Smuzhiyun- ``EEXIST``: the file already has verity enabled
152*4882a593Smuzhiyun- ``EFAULT``: the caller provided inaccessible memory
153*4882a593Smuzhiyun- ``EINTR``: the operation was interrupted by a fatal signal
154*4882a593Smuzhiyun- ``EINVAL``: unsupported version, hash algorithm, or block size; or
155*4882a593Smuzhiyun  reserved bits are set; or the file descriptor refers to neither a
156*4882a593Smuzhiyun  regular file nor a directory.
157*4882a593Smuzhiyun- ``EISDIR``: the file descriptor refers to a directory
158*4882a593Smuzhiyun- ``EKEYREJECTED``: the signature doesn't match the file
159*4882a593Smuzhiyun- ``EMSGSIZE``: the salt or signature is too long
160*4882a593Smuzhiyun- ``ENOKEY``: the fs-verity keyring doesn't contain the certificate
161*4882a593Smuzhiyun  needed to verify the signature
162*4882a593Smuzhiyun- ``ENOPKG``: fs-verity recognizes the hash algorithm, but it's not
163*4882a593Smuzhiyun  available in the kernel's crypto API as currently configured (e.g.
164*4882a593Smuzhiyun  for SHA-512, missing CONFIG_CRYPTO_SHA512).
165*4882a593Smuzhiyun- ``ENOTTY``: this type of filesystem does not implement fs-verity
166*4882a593Smuzhiyun- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
167*4882a593Smuzhiyun  support; or the filesystem superblock has not had the 'verity'
168*4882a593Smuzhiyun  feature enabled on it; or the filesystem does not support fs-verity
169*4882a593Smuzhiyun  on this file.  (See `Filesystem support`_.)
170*4882a593Smuzhiyun- ``EPERM``: the file is append-only; or, a signature is required and
171*4882a593Smuzhiyun  one was not provided.
172*4882a593Smuzhiyun- ``EROFS``: the filesystem is read-only
173*4882a593Smuzhiyun- ``ETXTBSY``: someone has the file open for writing.  This can be the
174*4882a593Smuzhiyun  caller's file descriptor, another open file descriptor, or the file
175*4882a593Smuzhiyun  reference held by a writable memory map.
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunFS_IOC_MEASURE_VERITY
178*4882a593Smuzhiyun---------------------
179*4882a593Smuzhiyun
180*4882a593SmuzhiyunThe FS_IOC_MEASURE_VERITY ioctl retrieves the digest of a verity file.
181*4882a593SmuzhiyunThe fs-verity file digest is a cryptographic digest that identifies
182*4882a593Smuzhiyunthe file contents that are being enforced on reads; it is computed via
183*4882a593Smuzhiyuna Merkle tree and is different from a traditional full-file digest.
184*4882a593Smuzhiyun
185*4882a593SmuzhiyunThis ioctl takes in a pointer to a variable-length structure::
186*4882a593Smuzhiyun
187*4882a593Smuzhiyun    struct fsverity_digest {
188*4882a593Smuzhiyun            __u16 digest_algorithm;
189*4882a593Smuzhiyun            __u16 digest_size; /* input/output */
190*4882a593Smuzhiyun            __u8 digest[];
191*4882a593Smuzhiyun    };
192*4882a593Smuzhiyun
193*4882a593Smuzhiyun``digest_size`` is an input/output field.  On input, it must be
194*4882a593Smuzhiyuninitialized to the number of bytes allocated for the variable-length
195*4882a593Smuzhiyun``digest`` field.
196*4882a593Smuzhiyun
197*4882a593SmuzhiyunOn success, 0 is returned and the kernel fills in the structure as
198*4882a593Smuzhiyunfollows:
199*4882a593Smuzhiyun
200*4882a593Smuzhiyun- ``digest_algorithm`` will be the hash algorithm used for the file
201*4882a593Smuzhiyun  digest.  It will match ``fsverity_enable_arg::hash_algorithm``.
202*4882a593Smuzhiyun- ``digest_size`` will be the size of the digest in bytes, e.g. 32
203*4882a593Smuzhiyun  for SHA-256.  (This can be redundant with ``digest_algorithm``.)
204*4882a593Smuzhiyun- ``digest`` will be the actual bytes of the digest.
205*4882a593Smuzhiyun
206*4882a593SmuzhiyunFS_IOC_MEASURE_VERITY is guaranteed to execute in constant time,
207*4882a593Smuzhiyunregardless of the size of the file.
208*4882a593Smuzhiyun
209*4882a593SmuzhiyunFS_IOC_MEASURE_VERITY can fail with the following errors:
210*4882a593Smuzhiyun
211*4882a593Smuzhiyun- ``EFAULT``: the caller provided inaccessible memory
212*4882a593Smuzhiyun- ``ENODATA``: the file is not a verity file
213*4882a593Smuzhiyun- ``ENOTTY``: this type of filesystem does not implement fs-verity
214*4882a593Smuzhiyun- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
215*4882a593Smuzhiyun  support, or the filesystem superblock has not had the 'verity'
216*4882a593Smuzhiyun  feature enabled on it.  (See `Filesystem support`_.)
217*4882a593Smuzhiyun- ``EOVERFLOW``: the digest is longer than the specified
218*4882a593Smuzhiyun  ``digest_size`` bytes.  Try providing a larger buffer.
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunFS_IOC_READ_VERITY_METADATA
221*4882a593Smuzhiyun---------------------------
222*4882a593Smuzhiyun
223*4882a593SmuzhiyunThe FS_IOC_READ_VERITY_METADATA ioctl reads verity metadata from a
224*4882a593Smuzhiyunverity file.  This ioctl is available since Linux v5.12.
225*4882a593Smuzhiyun
226*4882a593SmuzhiyunThis ioctl allows writing a server program that takes a verity file
227*4882a593Smuzhiyunand serves it to a client program, such that the client can do its own
228*4882a593Smuzhiyunfs-verity compatible verification of the file.  This only makes sense
229*4882a593Smuzhiyunif the client doesn't trust the server and if the server needs to
230*4882a593Smuzhiyunprovide the storage for the client.
231*4882a593Smuzhiyun
232*4882a593SmuzhiyunThis is a fairly specialized use case, and most fs-verity users won't
233*4882a593Smuzhiyunneed this ioctl.
234*4882a593Smuzhiyun
235*4882a593SmuzhiyunThis ioctl takes in a pointer to the following structure::
236*4882a593Smuzhiyun
237*4882a593Smuzhiyun   #define FS_VERITY_METADATA_TYPE_MERKLE_TREE     1
238*4882a593Smuzhiyun   #define FS_VERITY_METADATA_TYPE_DESCRIPTOR      2
239*4882a593Smuzhiyun   #define FS_VERITY_METADATA_TYPE_SIGNATURE       3
240*4882a593Smuzhiyun
241*4882a593Smuzhiyun   struct fsverity_read_metadata_arg {
242*4882a593Smuzhiyun           __u64 metadata_type;
243*4882a593Smuzhiyun           __u64 offset;
244*4882a593Smuzhiyun           __u64 length;
245*4882a593Smuzhiyun           __u64 buf_ptr;
246*4882a593Smuzhiyun           __u64 __reserved;
247*4882a593Smuzhiyun   };
248*4882a593Smuzhiyun
249*4882a593Smuzhiyun``metadata_type`` specifies the type of metadata to read:
250*4882a593Smuzhiyun
251*4882a593Smuzhiyun- ``FS_VERITY_METADATA_TYPE_MERKLE_TREE`` reads the blocks of the
252*4882a593Smuzhiyun  Merkle tree.  The blocks are returned in order from the root level
253*4882a593Smuzhiyun  to the leaf level.  Within each level, the blocks are returned in
254*4882a593Smuzhiyun  the same order that their hashes are themselves hashed.
255*4882a593Smuzhiyun  See `Merkle tree`_ for more information.
256*4882a593Smuzhiyun
257*4882a593Smuzhiyun- ``FS_VERITY_METADATA_TYPE_DESCRIPTOR`` reads the fs-verity
258*4882a593Smuzhiyun  descriptor.  See `fs-verity descriptor`_.
259*4882a593Smuzhiyun
260*4882a593Smuzhiyun- ``FS_VERITY_METADATA_TYPE_SIGNATURE`` reads the signature which was
261*4882a593Smuzhiyun  passed to FS_IOC_ENABLE_VERITY, if any.  See `Built-in signature
262*4882a593Smuzhiyun  verification`_.
263*4882a593Smuzhiyun
264*4882a593SmuzhiyunThe semantics are similar to those of ``pread()``.  ``offset``
265*4882a593Smuzhiyunspecifies the offset in bytes into the metadata item to read from, and
266*4882a593Smuzhiyun``length`` specifies the maximum number of bytes to read from the
267*4882a593Smuzhiyunmetadata item.  ``buf_ptr`` is the pointer to the buffer to read into,
268*4882a593Smuzhiyuncast to a 64-bit integer.  ``__reserved`` must be 0.  On success, the
269*4882a593Smuzhiyunnumber of bytes read is returned.  0 is returned at the end of the
270*4882a593Smuzhiyunmetadata item.  The returned length may be less than ``length``, for
271*4882a593Smuzhiyunexample if the ioctl is interrupted.
272*4882a593Smuzhiyun
273*4882a593SmuzhiyunThe metadata returned by FS_IOC_READ_VERITY_METADATA isn't guaranteed
274*4882a593Smuzhiyunto be authenticated against the file digest that would be returned by
275*4882a593Smuzhiyun`FS_IOC_MEASURE_VERITY`_, as the metadata is expected to be used to
276*4882a593Smuzhiyunimplement fs-verity compatible verification anyway (though absent a
277*4882a593Smuzhiyunmalicious disk, the metadata will indeed match).  E.g. to implement
278*4882a593Smuzhiyunthis ioctl, the filesystem is allowed to just read the Merkle tree
279*4882a593Smuzhiyunblocks from disk without actually verifying the path to the root node.
280*4882a593Smuzhiyun
281*4882a593SmuzhiyunFS_IOC_READ_VERITY_METADATA can fail with the following errors:
282*4882a593Smuzhiyun
283*4882a593Smuzhiyun- ``EFAULT``: the caller provided inaccessible memory
284*4882a593Smuzhiyun- ``EINTR``: the ioctl was interrupted before any data was read
285*4882a593Smuzhiyun- ``EINVAL``: reserved fields were set, or ``offset + length``
286*4882a593Smuzhiyun  overflowed
287*4882a593Smuzhiyun- ``ENODATA``: the file is not a verity file, or
288*4882a593Smuzhiyun  FS_VERITY_METADATA_TYPE_SIGNATURE was requested but the file doesn't
289*4882a593Smuzhiyun  have a built-in signature
290*4882a593Smuzhiyun- ``ENOTTY``: this type of filesystem does not implement fs-verity, or
291*4882a593Smuzhiyun  this ioctl is not yet implemented on it
292*4882a593Smuzhiyun- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
293*4882a593Smuzhiyun  support, or the filesystem superblock has not had the 'verity'
294*4882a593Smuzhiyun  feature enabled on it.  (See `Filesystem support`_.)
295*4882a593Smuzhiyun
296*4882a593SmuzhiyunFS_IOC_GETFLAGS
297*4882a593Smuzhiyun---------------
298*4882a593Smuzhiyun
299*4882a593SmuzhiyunThe existing ioctl FS_IOC_GETFLAGS (which isn't specific to fs-verity)
300*4882a593Smuzhiyuncan also be used to check whether a file has fs-verity enabled or not.
301*4882a593SmuzhiyunTo do so, check for FS_VERITY_FL (0x00100000) in the returned flags.
302*4882a593Smuzhiyun
303*4882a593SmuzhiyunThe verity flag is not settable via FS_IOC_SETFLAGS.  You must use
304*4882a593SmuzhiyunFS_IOC_ENABLE_VERITY instead, since parameters must be provided.
305*4882a593Smuzhiyun
306*4882a593Smuzhiyunstatx
307*4882a593Smuzhiyun-----
308*4882a593Smuzhiyun
309*4882a593SmuzhiyunSince Linux v5.5, the statx() system call sets STATX_ATTR_VERITY if
310*4882a593Smuzhiyunthe file has fs-verity enabled.  This can perform better than
311*4882a593SmuzhiyunFS_IOC_GETFLAGS and FS_IOC_MEASURE_VERITY because it doesn't require
312*4882a593Smuzhiyunopening the file, and opening verity files can be expensive.
313*4882a593Smuzhiyun
314*4882a593SmuzhiyunAccessing verity files
315*4882a593Smuzhiyun======================
316*4882a593Smuzhiyun
317*4882a593SmuzhiyunApplications can transparently access a verity file just like a
318*4882a593Smuzhiyunnon-verity one, with the following exceptions:
319*4882a593Smuzhiyun
320*4882a593Smuzhiyun- Verity files are readonly.  They cannot be opened for writing or
321*4882a593Smuzhiyun  truncate()d, even if the file mode bits allow it.  Attempts to do
322*4882a593Smuzhiyun  one of these things will fail with EPERM.  However, changes to
323*4882a593Smuzhiyun  metadata such as owner, mode, timestamps, and xattrs are still
324*4882a593Smuzhiyun  allowed, since these are not measured by fs-verity.  Verity files
325*4882a593Smuzhiyun  can also still be renamed, deleted, and linked to.
326*4882a593Smuzhiyun
327*4882a593Smuzhiyun- Direct I/O is not supported on verity files.  Attempts to use direct
328*4882a593Smuzhiyun  I/O on such files will fall back to buffered I/O.
329*4882a593Smuzhiyun
330*4882a593Smuzhiyun- DAX (Direct Access) is not supported on verity files, because this
331*4882a593Smuzhiyun  would circumvent the data verification.
332*4882a593Smuzhiyun
333*4882a593Smuzhiyun- Reads of data that doesn't match the verity Merkle tree will fail
334*4882a593Smuzhiyun  with EIO (for read()) or SIGBUS (for mmap() reads).
335*4882a593Smuzhiyun
336*4882a593Smuzhiyun- If the sysctl "fs.verity.require_signatures" is set to 1 and the
337*4882a593Smuzhiyun  file is not signed by a key in the fs-verity keyring, then opening
338*4882a593Smuzhiyun  the file will fail.  See `Built-in signature verification`_.
339*4882a593Smuzhiyun
340*4882a593SmuzhiyunDirect access to the Merkle tree is not supported.  Therefore, if a
341*4882a593Smuzhiyunverity file is copied, or is backed up and restored, then it will lose
342*4882a593Smuzhiyunits "verity"-ness.  fs-verity is primarily meant for files like
343*4882a593Smuzhiyunexecutables that are managed by a package manager.
344*4882a593Smuzhiyun
345*4882a593SmuzhiyunFile digest computation
346*4882a593Smuzhiyun=======================
347*4882a593Smuzhiyun
348*4882a593SmuzhiyunThis section describes how fs-verity hashes the file contents using a
349*4882a593SmuzhiyunMerkle tree to produce the digest which cryptographically identifies
350*4882a593Smuzhiyunthe file contents.  This algorithm is the same for all filesystems
351*4882a593Smuzhiyunthat support fs-verity.
352*4882a593Smuzhiyun
353*4882a593SmuzhiyunUserspace only needs to be aware of this algorithm if it needs to
354*4882a593Smuzhiyuncompute fs-verity file digests itself, e.g. in order to sign files.
355*4882a593Smuzhiyun
356*4882a593Smuzhiyun.. _fsverity_merkle_tree:
357*4882a593Smuzhiyun
358*4882a593SmuzhiyunMerkle tree
359*4882a593Smuzhiyun-----------
360*4882a593Smuzhiyun
361*4882a593SmuzhiyunThe file contents is divided into blocks, where the block size is
362*4882a593Smuzhiyunconfigurable but is usually 4096 bytes.  The end of the last block is
363*4882a593Smuzhiyunzero-padded if needed.  Each block is then hashed, producing the first
364*4882a593Smuzhiyunlevel of hashes.  Then, the hashes in this first level are grouped
365*4882a593Smuzhiyuninto 'blocksize'-byte blocks (zero-padding the ends as needed) and
366*4882a593Smuzhiyunthese blocks are hashed, producing the second level of hashes.  This
367*4882a593Smuzhiyunproceeds up the tree until only a single block remains.  The hash of
368*4882a593Smuzhiyunthis block is the "Merkle tree root hash".
369*4882a593Smuzhiyun
370*4882a593SmuzhiyunIf the file fits in one block and is nonempty, then the "Merkle tree
371*4882a593Smuzhiyunroot hash" is simply the hash of the single data block.  If the file
372*4882a593Smuzhiyunis empty, then the "Merkle tree root hash" is all zeroes.
373*4882a593Smuzhiyun
374*4882a593SmuzhiyunThe "blocks" here are not necessarily the same as "filesystem blocks".
375*4882a593Smuzhiyun
376*4882a593SmuzhiyunIf a salt was specified, then it's zero-padded to the closest multiple
377*4882a593Smuzhiyunof the input size of the hash algorithm's compression function, e.g.
378*4882a593Smuzhiyun64 bytes for SHA-256 or 128 bytes for SHA-512.  The padded salt is
379*4882a593Smuzhiyunprepended to every data or Merkle tree block that is hashed.
380*4882a593Smuzhiyun
381*4882a593SmuzhiyunThe purpose of the block padding is to cause every hash to be taken
382*4882a593Smuzhiyunover the same amount of data, which simplifies the implementation and
383*4882a593Smuzhiyunkeeps open more possibilities for hardware acceleration.  The purpose
384*4882a593Smuzhiyunof the salt padding is to make the salting "free" when the salted hash
385*4882a593Smuzhiyunstate is precomputed, then imported for each hash.
386*4882a593Smuzhiyun
387*4882a593SmuzhiyunExample: in the recommended configuration of SHA-256 and 4K blocks,
388*4882a593Smuzhiyun128 hash values fit in each block.  Thus, each level of the Merkle
389*4882a593Smuzhiyuntree is approximately 128 times smaller than the previous, and for
390*4882a593Smuzhiyunlarge files the Merkle tree's size converges to approximately 1/127 of
391*4882a593Smuzhiyunthe original file size.  However, for small files, the padding is
392*4882a593Smuzhiyunsignificant, making the space overhead proportionally more.
393*4882a593Smuzhiyun
394*4882a593Smuzhiyun.. _fsverity_descriptor:
395*4882a593Smuzhiyun
396*4882a593Smuzhiyunfs-verity descriptor
397*4882a593Smuzhiyun--------------------
398*4882a593Smuzhiyun
399*4882a593SmuzhiyunBy itself, the Merkle tree root hash is ambiguous.  For example, it
400*4882a593Smuzhiyuncan't a distinguish a large file from a small second file whose data
401*4882a593Smuzhiyunis exactly the top-level hash block of the first file.  Ambiguities
402*4882a593Smuzhiyunalso arise from the convention of padding to the next block boundary.
403*4882a593Smuzhiyun
404*4882a593SmuzhiyunTo solve this problem, the fs-verity file digest is actually computed
405*4882a593Smuzhiyunas a hash of the following structure, which contains the Merkle tree
406*4882a593Smuzhiyunroot hash as well as other fields such as the file size::
407*4882a593Smuzhiyun
408*4882a593Smuzhiyun    struct fsverity_descriptor {
409*4882a593Smuzhiyun            __u8 version;           /* must be 1 */
410*4882a593Smuzhiyun            __u8 hash_algorithm;    /* Merkle tree hash algorithm */
411*4882a593Smuzhiyun            __u8 log_blocksize;     /* log2 of size of data and tree blocks */
412*4882a593Smuzhiyun            __u8 salt_size;         /* size of salt in bytes; 0 if none */
413*4882a593Smuzhiyun            __le32 __reserved_0x04; /* must be 0 */
414*4882a593Smuzhiyun            __le64 data_size;       /* size of file the Merkle tree is built over */
415*4882a593Smuzhiyun            __u8 root_hash[64];     /* Merkle tree root hash */
416*4882a593Smuzhiyun            __u8 salt[32];          /* salt prepended to each hashed block */
417*4882a593Smuzhiyun            __u8 __reserved[144];   /* must be 0's */
418*4882a593Smuzhiyun    };
419*4882a593Smuzhiyun
420*4882a593SmuzhiyunBuilt-in signature verification
421*4882a593Smuzhiyun===============================
422*4882a593Smuzhiyun
423*4882a593SmuzhiyunWith CONFIG_FS_VERITY_BUILTIN_SIGNATURES=y, fs-verity supports putting
424*4882a593Smuzhiyuna portion of an authentication policy (see `Use cases`_) in the
425*4882a593Smuzhiyunkernel.  Specifically, it adds support for:
426*4882a593Smuzhiyun
427*4882a593Smuzhiyun1. At fs-verity module initialization time, a keyring ".fs-verity" is
428*4882a593Smuzhiyun   created.  The root user can add trusted X.509 certificates to this
429*4882a593Smuzhiyun   keyring using the add_key() system call, then (when done)
430*4882a593Smuzhiyun   optionally use keyctl_restrict_keyring() to prevent additional
431*4882a593Smuzhiyun   certificates from being added.
432*4882a593Smuzhiyun
433*4882a593Smuzhiyun2. `FS_IOC_ENABLE_VERITY`_ accepts a pointer to a PKCS#7 formatted
434*4882a593Smuzhiyun   detached signature in DER format of the file's fs-verity digest.
435*4882a593Smuzhiyun   On success, this signature is persisted alongside the Merkle tree.
436*4882a593Smuzhiyun   Then, any time the file is opened, the kernel will verify the
437*4882a593Smuzhiyun   file's actual digest against this signature, using the certificates
438*4882a593Smuzhiyun   in the ".fs-verity" keyring.
439*4882a593Smuzhiyun
440*4882a593Smuzhiyun3. A new sysctl "fs.verity.require_signatures" is made available.
441*4882a593Smuzhiyun   When set to 1, the kernel requires that all verity files have a
442*4882a593Smuzhiyun   correctly signed digest as described in (2).
443*4882a593Smuzhiyun
444*4882a593Smuzhiyunfs-verity file digests must be signed in the following format, which
445*4882a593Smuzhiyunis similar to the structure used by `FS_IOC_MEASURE_VERITY`_::
446*4882a593Smuzhiyun
447*4882a593Smuzhiyun    struct fsverity_formatted_digest {
448*4882a593Smuzhiyun            char magic[8];                  /* must be "FSVerity" */
449*4882a593Smuzhiyun            __le16 digest_algorithm;
450*4882a593Smuzhiyun            __le16 digest_size;
451*4882a593Smuzhiyun            __u8 digest[];
452*4882a593Smuzhiyun    };
453*4882a593Smuzhiyun
454*4882a593Smuzhiyunfs-verity's built-in signature verification support is meant as a
455*4882a593Smuzhiyunrelatively simple mechanism that can be used to provide some level of
456*4882a593Smuzhiyunauthenticity protection for verity files, as an alternative to doing
457*4882a593Smuzhiyunthe signature verification in userspace or using IMA-appraisal.
458*4882a593SmuzhiyunHowever, with this mechanism, userspace programs still need to check
459*4882a593Smuzhiyunthat the verity bit is set, and there is no protection against verity
460*4882a593Smuzhiyunfiles being swapped around.
461*4882a593Smuzhiyun
462*4882a593SmuzhiyunFilesystem support
463*4882a593Smuzhiyun==================
464*4882a593Smuzhiyun
465*4882a593Smuzhiyunfs-verity is currently supported by the ext4 and f2fs filesystems.
466*4882a593SmuzhiyunThe CONFIG_FS_VERITY kconfig option must be enabled to use fs-verity
467*4882a593Smuzhiyunon either filesystem.
468*4882a593Smuzhiyun
469*4882a593Smuzhiyun``include/linux/fsverity.h`` declares the interface between the
470*4882a593Smuzhiyun``fs/verity/`` support layer and filesystems.  Briefly, filesystems
471*4882a593Smuzhiyunmust provide an ``fsverity_operations`` structure that provides
472*4882a593Smuzhiyunmethods to read and write the verity metadata to a filesystem-specific
473*4882a593Smuzhiyunlocation, including the Merkle tree blocks and
474*4882a593Smuzhiyun``fsverity_descriptor``.  Filesystems must also call functions in
475*4882a593Smuzhiyun``fs/verity/`` at certain times, such as when a file is opened or when
476*4882a593Smuzhiyunpages have been read into the pagecache.  (See `Verifying data`_.)
477*4882a593Smuzhiyun
478*4882a593Smuzhiyunext4
479*4882a593Smuzhiyun----
480*4882a593Smuzhiyun
481*4882a593Smuzhiyunext4 supports fs-verity since Linux v5.4 and e2fsprogs v1.45.2.
482*4882a593Smuzhiyun
483*4882a593SmuzhiyunTo create verity files on an ext4 filesystem, the filesystem must have
484*4882a593Smuzhiyunbeen formatted with ``-O verity`` or had ``tune2fs -O verity`` run on
485*4882a593Smuzhiyunit.  "verity" is an RO_COMPAT filesystem feature, so once set, old
486*4882a593Smuzhiyunkernels will only be able to mount the filesystem readonly, and old
487*4882a593Smuzhiyunversions of e2fsck will be unable to check the filesystem.  Moreover,
488*4882a593Smuzhiyuncurrently ext4 only supports mounting a filesystem with the "verity"
489*4882a593Smuzhiyunfeature when its block size is equal to PAGE_SIZE (often 4096 bytes).
490*4882a593Smuzhiyun
491*4882a593Smuzhiyunext4 sets the EXT4_VERITY_FL on-disk inode flag on verity files.  It
492*4882a593Smuzhiyuncan only be set by `FS_IOC_ENABLE_VERITY`_, and it cannot be cleared.
493*4882a593Smuzhiyun
494*4882a593Smuzhiyunext4 also supports encryption, which can be used simultaneously with
495*4882a593Smuzhiyunfs-verity.  In this case, the plaintext data is verified rather than
496*4882a593Smuzhiyunthe ciphertext.  This is necessary in order to make the fs-verity file
497*4882a593Smuzhiyundigest meaningful, since every file is encrypted differently.
498*4882a593Smuzhiyun
499*4882a593Smuzhiyunext4 stores the verity metadata (Merkle tree and fsverity_descriptor)
500*4882a593Smuzhiyunpast the end of the file, starting at the first 64K boundary beyond
501*4882a593Smuzhiyuni_size.  This approach works because (a) verity files are readonly,
502*4882a593Smuzhiyunand (b) pages fully beyond i_size aren't visible to userspace but can
503*4882a593Smuzhiyunbe read/written internally by ext4 with only some relatively small
504*4882a593Smuzhiyunchanges to ext4.  This approach avoids having to depend on the
505*4882a593SmuzhiyunEA_INODE feature and on rearchitecturing ext4's xattr support to
506*4882a593Smuzhiyunsupport paging multi-gigabyte xattrs into memory, and to support
507*4882a593Smuzhiyunencrypting xattrs.  Note that the verity metadata *must* be encrypted
508*4882a593Smuzhiyunwhen the file is, since it contains hashes of the plaintext data.
509*4882a593Smuzhiyun
510*4882a593SmuzhiyunCurrently, ext4 verity only supports the case where the Merkle tree
511*4882a593Smuzhiyunblock size, filesystem block size, and page size are all the same.  It
512*4882a593Smuzhiyunalso only supports extent-based files.
513*4882a593Smuzhiyun
514*4882a593Smuzhiyunf2fs
515*4882a593Smuzhiyun----
516*4882a593Smuzhiyun
517*4882a593Smuzhiyunf2fs supports fs-verity since Linux v5.4 and f2fs-tools v1.11.0.
518*4882a593Smuzhiyun
519*4882a593SmuzhiyunTo create verity files on an f2fs filesystem, the filesystem must have
520*4882a593Smuzhiyunbeen formatted with ``-O verity``.
521*4882a593Smuzhiyun
522*4882a593Smuzhiyunf2fs sets the FADVISE_VERITY_BIT on-disk inode flag on verity files.
523*4882a593SmuzhiyunIt can only be set by `FS_IOC_ENABLE_VERITY`_, and it cannot be
524*4882a593Smuzhiyuncleared.
525*4882a593Smuzhiyun
526*4882a593SmuzhiyunLike ext4, f2fs stores the verity metadata (Merkle tree and
527*4882a593Smuzhiyunfsverity_descriptor) past the end of the file, starting at the first
528*4882a593Smuzhiyun64K boundary beyond i_size.  See explanation for ext4 above.
529*4882a593SmuzhiyunMoreover, f2fs supports at most 4096 bytes of xattr entries per inode
530*4882a593Smuzhiyunwhich wouldn't be enough for even a single Merkle tree block.
531*4882a593Smuzhiyun
532*4882a593SmuzhiyunCurrently, f2fs verity only supports a Merkle tree block size of 4096.
533*4882a593SmuzhiyunAlso, f2fs doesn't support enabling verity on files that currently
534*4882a593Smuzhiyunhave atomic or volatile writes pending.
535*4882a593Smuzhiyun
536*4882a593SmuzhiyunImplementation details
537*4882a593Smuzhiyun======================
538*4882a593Smuzhiyun
539*4882a593SmuzhiyunVerifying data
540*4882a593Smuzhiyun--------------
541*4882a593Smuzhiyun
542*4882a593Smuzhiyunfs-verity ensures that all reads of a verity file's data are verified,
543*4882a593Smuzhiyunregardless of which syscall is used to do the read (e.g. mmap(),
544*4882a593Smuzhiyunread(), pread()) and regardless of whether it's the first read or a
545*4882a593Smuzhiyunlater read (unless the later read can return cached data that was
546*4882a593Smuzhiyunalready verified).  Below, we describe how filesystems implement this.
547*4882a593Smuzhiyun
548*4882a593SmuzhiyunPagecache
549*4882a593Smuzhiyun~~~~~~~~~
550*4882a593Smuzhiyun
551*4882a593SmuzhiyunFor filesystems using Linux's pagecache, the ``->readpage()`` and
552*4882a593Smuzhiyun``->readpages()`` methods must be modified to verify pages before they
553*4882a593Smuzhiyunare marked Uptodate.  Merely hooking ``->read_iter()`` would be
554*4882a593Smuzhiyuninsufficient, since ``->read_iter()`` is not used for memory maps.
555*4882a593Smuzhiyun
556*4882a593SmuzhiyunTherefore, fs/verity/ provides a function fsverity_verify_page() which
557*4882a593Smuzhiyunverifies a page that has been read into the pagecache of a verity
558*4882a593Smuzhiyuninode, but is still locked and not Uptodate, so it's not yet readable
559*4882a593Smuzhiyunby userspace.  As needed to do the verification,
560*4882a593Smuzhiyunfsverity_verify_page() will call back into the filesystem to read
561*4882a593SmuzhiyunMerkle tree pages via fsverity_operations::read_merkle_tree_page().
562*4882a593Smuzhiyun
563*4882a593Smuzhiyunfsverity_verify_page() returns false if verification failed; in this
564*4882a593Smuzhiyuncase, the filesystem must not set the page Uptodate.  Following this,
565*4882a593Smuzhiyunas per the usual Linux pagecache behavior, attempts by userspace to
566*4882a593Smuzhiyunread() from the part of the file containing the page will fail with
567*4882a593SmuzhiyunEIO, and accesses to the page within a memory map will raise SIGBUS.
568*4882a593Smuzhiyun
569*4882a593Smuzhiyunfsverity_verify_page() currently only supports the case where the
570*4882a593SmuzhiyunMerkle tree block size is equal to PAGE_SIZE (often 4096 bytes).
571*4882a593Smuzhiyun
572*4882a593SmuzhiyunIn principle, fsverity_verify_page() verifies the entire path in the
573*4882a593SmuzhiyunMerkle tree from the data page to the root hash.  However, for
574*4882a593Smuzhiyunefficiency the filesystem may cache the hash pages.  Therefore,
575*4882a593Smuzhiyunfsverity_verify_page() only ascends the tree reading hash pages until
576*4882a593Smuzhiyunan already-verified hash page is seen, as indicated by the PageChecked
577*4882a593Smuzhiyunbit being set.  It then verifies the path to that page.
578*4882a593Smuzhiyun
579*4882a593SmuzhiyunThis optimization, which is also used by dm-verity, results in
580*4882a593Smuzhiyunexcellent sequential read performance.  This is because usually (e.g.
581*4882a593Smuzhiyun127 in 128 times for 4K blocks and SHA-256) the hash page from the
582*4882a593Smuzhiyunbottom level of the tree will already be cached and checked from
583*4882a593Smuzhiyunreading a previous data page.  However, random reads perform worse.
584*4882a593Smuzhiyun
585*4882a593SmuzhiyunBlock device based filesystems
586*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
587*4882a593Smuzhiyun
588*4882a593SmuzhiyunBlock device based filesystems (e.g. ext4 and f2fs) in Linux also use
589*4882a593Smuzhiyunthe pagecache, so the above subsection applies too.  However, they
590*4882a593Smuzhiyunalso usually read many pages from a file at once, grouped into a
591*4882a593Smuzhiyunstructure called a "bio".  To make it easier for these types of
592*4882a593Smuzhiyunfilesystems to support fs-verity, fs/verity/ also provides a function
593*4882a593Smuzhiyunfsverity_verify_bio() which verifies all pages in a bio.
594*4882a593Smuzhiyun
595*4882a593Smuzhiyunext4 and f2fs also support encryption.  If a verity file is also
596*4882a593Smuzhiyunencrypted, the pages must be decrypted before being verified.  To
597*4882a593Smuzhiyunsupport this, these filesystems allocate a "post-read context" for
598*4882a593Smuzhiyuneach bio and store it in ``->bi_private``::
599*4882a593Smuzhiyun
600*4882a593Smuzhiyun    struct bio_post_read_ctx {
601*4882a593Smuzhiyun           struct bio *bio;
602*4882a593Smuzhiyun           struct work_struct work;
603*4882a593Smuzhiyun           unsigned int cur_step;
604*4882a593Smuzhiyun           unsigned int enabled_steps;
605*4882a593Smuzhiyun    };
606*4882a593Smuzhiyun
607*4882a593Smuzhiyun``enabled_steps`` is a bitmask that specifies whether decryption,
608*4882a593Smuzhiyunverity, or both is enabled.  After the bio completes, for each needed
609*4882a593Smuzhiyunpostprocessing step the filesystem enqueues the bio_post_read_ctx on a
610*4882a593Smuzhiyunworkqueue, and then the workqueue work does the decryption or
611*4882a593Smuzhiyunverification.  Finally, pages where no decryption or verity error
612*4882a593Smuzhiyunoccurred are marked Uptodate, and the pages are unlocked.
613*4882a593Smuzhiyun
614*4882a593SmuzhiyunFiles on ext4 and f2fs may contain holes.  Normally, ``->readpages()``
615*4882a593Smuzhiyunsimply zeroes holes and sets the corresponding pages Uptodate; no bios
616*4882a593Smuzhiyunare issued.  To prevent this case from bypassing fs-verity, these
617*4882a593Smuzhiyunfilesystems use fsverity_verify_page() to verify hole pages.
618*4882a593Smuzhiyun
619*4882a593Smuzhiyunext4 and f2fs disable direct I/O on verity files, since otherwise
620*4882a593Smuzhiyundirect I/O would bypass fs-verity.  (They also do the same for
621*4882a593Smuzhiyunencrypted files.)
622*4882a593Smuzhiyun
623*4882a593SmuzhiyunUserspace utility
624*4882a593Smuzhiyun=================
625*4882a593Smuzhiyun
626*4882a593SmuzhiyunThis document focuses on the kernel, but a userspace utility for
627*4882a593Smuzhiyunfs-verity can be found at:
628*4882a593Smuzhiyun
629*4882a593Smuzhiyun	https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git
630*4882a593Smuzhiyun
631*4882a593SmuzhiyunSee the README.md file in the fsverity-utils source tree for details,
632*4882a593Smuzhiyunincluding examples of setting up fs-verity protected files.
633*4882a593Smuzhiyun
634*4882a593SmuzhiyunTests
635*4882a593Smuzhiyun=====
636*4882a593Smuzhiyun
637*4882a593SmuzhiyunTo test fs-verity, use xfstests.  For example, using `kvm-xfstests
638*4882a593Smuzhiyun<https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md>`_::
639*4882a593Smuzhiyun
640*4882a593Smuzhiyun    kvm-xfstests -c ext4,f2fs -g verity
641*4882a593Smuzhiyun
642*4882a593SmuzhiyunFAQ
643*4882a593Smuzhiyun===
644*4882a593Smuzhiyun
645*4882a593SmuzhiyunThis section answers frequently asked questions about fs-verity that
646*4882a593Smuzhiyunweren't already directly answered in other parts of this document.
647*4882a593Smuzhiyun
648*4882a593Smuzhiyun:Q: Why isn't fs-verity part of IMA?
649*4882a593Smuzhiyun:A: fs-verity and IMA (Integrity Measurement Architecture) have
650*4882a593Smuzhiyun    different focuses.  fs-verity is a filesystem-level mechanism for
651*4882a593Smuzhiyun    hashing individual files using a Merkle tree.  In contrast, IMA
652*4882a593Smuzhiyun    specifies a system-wide policy that specifies which files are
653*4882a593Smuzhiyun    hashed and what to do with those hashes, such as log them,
654*4882a593Smuzhiyun    authenticate them, or add them to a measurement list.
655*4882a593Smuzhiyun
656*4882a593Smuzhiyun    IMA is planned to support the fs-verity hashing mechanism as an
657*4882a593Smuzhiyun    alternative to doing full file hashes, for people who want the
658*4882a593Smuzhiyun    performance and security benefits of the Merkle tree based hash.
659*4882a593Smuzhiyun    But it doesn't make sense to force all uses of fs-verity to be
660*4882a593Smuzhiyun    through IMA.  As a standalone filesystem feature, fs-verity
661*4882a593Smuzhiyun    already meets many users' needs, and it's testable like other
662*4882a593Smuzhiyun    filesystem features e.g. with xfstests.
663*4882a593Smuzhiyun
664*4882a593Smuzhiyun:Q: Isn't fs-verity useless because the attacker can just modify the
665*4882a593Smuzhiyun    hashes in the Merkle tree, which is stored on-disk?
666*4882a593Smuzhiyun:A: To verify the authenticity of an fs-verity file you must verify
667*4882a593Smuzhiyun    the authenticity of the "fs-verity file digest", which
668*4882a593Smuzhiyun    incorporates the root hash of the Merkle tree.  See `Use cases`_.
669*4882a593Smuzhiyun
670*4882a593Smuzhiyun:Q: Isn't fs-verity useless because the attacker can just replace a
671*4882a593Smuzhiyun    verity file with a non-verity one?
672*4882a593Smuzhiyun:A: See `Use cases`_.  In the initial use case, it's really trusted
673*4882a593Smuzhiyun    userspace code that authenticates the files; fs-verity is just a
674*4882a593Smuzhiyun    tool to do this job efficiently and securely.  The trusted
675*4882a593Smuzhiyun    userspace code will consider non-verity files to be inauthentic.
676*4882a593Smuzhiyun
677*4882a593Smuzhiyun:Q: Why does the Merkle tree need to be stored on-disk?  Couldn't you
678*4882a593Smuzhiyun    store just the root hash?
679*4882a593Smuzhiyun:A: If the Merkle tree wasn't stored on-disk, then you'd have to
680*4882a593Smuzhiyun    compute the entire tree when the file is first accessed, even if
681*4882a593Smuzhiyun    just one byte is being read.  This is a fundamental consequence of
682*4882a593Smuzhiyun    how Merkle tree hashing works.  To verify a leaf node, you need to
683*4882a593Smuzhiyun    verify the whole path to the root hash, including the root node
684*4882a593Smuzhiyun    (the thing which the root hash is a hash of).  But if the root
685*4882a593Smuzhiyun    node isn't stored on-disk, you have to compute it by hashing its
686*4882a593Smuzhiyun    children, and so on until you've actually hashed the entire file.
687*4882a593Smuzhiyun
688*4882a593Smuzhiyun    That defeats most of the point of doing a Merkle tree-based hash,
689*4882a593Smuzhiyun    since if you have to hash the whole file ahead of time anyway,
690*4882a593Smuzhiyun    then you could simply do sha256(file) instead.  That would be much
691*4882a593Smuzhiyun    simpler, and a bit faster too.
692*4882a593Smuzhiyun
693*4882a593Smuzhiyun    It's true that an in-memory Merkle tree could still provide the
694*4882a593Smuzhiyun    advantage of verification on every read rather than just on the
695*4882a593Smuzhiyun    first read.  However, it would be inefficient because every time a
696*4882a593Smuzhiyun    hash page gets evicted (you can't pin the entire Merkle tree into
697*4882a593Smuzhiyun    memory, since it may be very large), in order to restore it you
698*4882a593Smuzhiyun    again need to hash everything below it in the tree.  This again
699*4882a593Smuzhiyun    defeats most of the point of doing a Merkle tree-based hash, since
700*4882a593Smuzhiyun    a single block read could trigger re-hashing gigabytes of data.
701*4882a593Smuzhiyun
702*4882a593Smuzhiyun:Q: But couldn't you store just the leaf nodes and compute the rest?
703*4882a593Smuzhiyun:A: See previous answer; this really just moves up one level, since
704*4882a593Smuzhiyun    one could alternatively interpret the data blocks as being the
705*4882a593Smuzhiyun    leaf nodes of the Merkle tree.  It's true that the tree can be
706*4882a593Smuzhiyun    computed much faster if the leaf level is stored rather than just
707*4882a593Smuzhiyun    the data, but that's only because each level is less than 1% the
708*4882a593Smuzhiyun    size of the level below (assuming the recommended settings of
709*4882a593Smuzhiyun    SHA-256 and 4K blocks).  For the exact same reason, by storing
710*4882a593Smuzhiyun    "just the leaf nodes" you'd already be storing over 99% of the
711*4882a593Smuzhiyun    tree, so you might as well simply store the whole tree.
712*4882a593Smuzhiyun
713*4882a593Smuzhiyun:Q: Can the Merkle tree be built ahead of time, e.g. distributed as
714*4882a593Smuzhiyun    part of a package that is installed to many computers?
715*4882a593Smuzhiyun:A: This isn't currently supported.  It was part of the original
716*4882a593Smuzhiyun    design, but was removed to simplify the kernel UAPI and because it
717*4882a593Smuzhiyun    wasn't a critical use case.  Files are usually installed once and
718*4882a593Smuzhiyun    used many times, and cryptographic hashing is somewhat fast on
719*4882a593Smuzhiyun    most modern processors.
720*4882a593Smuzhiyun
721*4882a593Smuzhiyun:Q: Why doesn't fs-verity support writes?
722*4882a593Smuzhiyun:A: Write support would be very difficult and would require a
723*4882a593Smuzhiyun    completely different design, so it's well outside the scope of
724*4882a593Smuzhiyun    fs-verity.  Write support would require:
725*4882a593Smuzhiyun
726*4882a593Smuzhiyun    - A way to maintain consistency between the data and hashes,
727*4882a593Smuzhiyun      including all levels of hashes, since corruption after a crash
728*4882a593Smuzhiyun      (especially of potentially the entire file!) is unacceptable.
729*4882a593Smuzhiyun      The main options for solving this are data journalling,
730*4882a593Smuzhiyun      copy-on-write, and log-structured volume.  But it's very hard to
731*4882a593Smuzhiyun      retrofit existing filesystems with new consistency mechanisms.
732*4882a593Smuzhiyun      Data journalling is available on ext4, but is very slow.
733*4882a593Smuzhiyun
734*4882a593Smuzhiyun    - Rebuilding the Merkle tree after every write, which would be
735*4882a593Smuzhiyun      extremely inefficient.  Alternatively, a different authenticated
736*4882a593Smuzhiyun      dictionary structure such as an "authenticated skiplist" could
737*4882a593Smuzhiyun      be used.  However, this would be far more complex.
738*4882a593Smuzhiyun
739*4882a593Smuzhiyun    Compare it to dm-verity vs. dm-integrity.  dm-verity is very
740*4882a593Smuzhiyun    simple: the kernel just verifies read-only data against a
741*4882a593Smuzhiyun    read-only Merkle tree.  In contrast, dm-integrity supports writes
742*4882a593Smuzhiyun    but is slow, is much more complex, and doesn't actually support
743*4882a593Smuzhiyun    full-device authentication since it authenticates each sector
744*4882a593Smuzhiyun    independently, i.e. there is no "root hash".  It doesn't really
745*4882a593Smuzhiyun    make sense for the same device-mapper target to support these two
746*4882a593Smuzhiyun    very different cases; the same applies to fs-verity.
747*4882a593Smuzhiyun
748*4882a593Smuzhiyun:Q: Since verity files are immutable, why isn't the immutable bit set?
749*4882a593Smuzhiyun:A: The existing "immutable" bit (FS_IMMUTABLE_FL) already has a
750*4882a593Smuzhiyun    specific set of semantics which not only make the file contents
751*4882a593Smuzhiyun    read-only, but also prevent the file from being deleted, renamed,
752*4882a593Smuzhiyun    linked to, or having its owner or mode changed.  These extra
753*4882a593Smuzhiyun    properties are unwanted for fs-verity, so reusing the immutable
754*4882a593Smuzhiyun    bit isn't appropriate.
755*4882a593Smuzhiyun
756*4882a593Smuzhiyun:Q: Why does the API use ioctls instead of setxattr() and getxattr()?
757*4882a593Smuzhiyun:A: Abusing the xattr interface for basically arbitrary syscalls is
758*4882a593Smuzhiyun    heavily frowned upon by most of the Linux filesystem developers.
759*4882a593Smuzhiyun    An xattr should really just be an xattr on-disk, not an API to
760*4882a593Smuzhiyun    e.g. magically trigger construction of a Merkle tree.
761*4882a593Smuzhiyun
762*4882a593Smuzhiyun:Q: Does fs-verity support remote filesystems?
763*4882a593Smuzhiyun:A: Only ext4 and f2fs support is implemented currently, but in
764*4882a593Smuzhiyun    principle any filesystem that can store per-file verity metadata
765*4882a593Smuzhiyun    can support fs-verity, regardless of whether it's local or remote.
766*4882a593Smuzhiyun    Some filesystems may have fewer options of where to store the
767*4882a593Smuzhiyun    verity metadata; one possibility is to store it past the end of
768*4882a593Smuzhiyun    the file and "hide" it from userspace by manipulating i_size.  The
769*4882a593Smuzhiyun    data verification functions provided by ``fs/verity/`` also assume
770*4882a593Smuzhiyun    that the filesystem uses the Linux pagecache, but both local and
771*4882a593Smuzhiyun    remote filesystems normally do so.
772*4882a593Smuzhiyun
773*4882a593Smuzhiyun:Q: Why is anything filesystem-specific at all?  Shouldn't fs-verity
774*4882a593Smuzhiyun    be implemented entirely at the VFS level?
775*4882a593Smuzhiyun:A: There are many reasons why this is not possible or would be very
776*4882a593Smuzhiyun    difficult, including the following:
777*4882a593Smuzhiyun
778*4882a593Smuzhiyun    - To prevent bypassing verification, pages must not be marked
779*4882a593Smuzhiyun      Uptodate until they've been verified.  Currently, each
780*4882a593Smuzhiyun      filesystem is responsible for marking pages Uptodate via
781*4882a593Smuzhiyun      ``->readpages()``.  Therefore, currently it's not possible for
782*4882a593Smuzhiyun      the VFS to do the verification on its own.  Changing this would
783*4882a593Smuzhiyun      require significant changes to the VFS and all filesystems.
784*4882a593Smuzhiyun
785*4882a593Smuzhiyun    - It would require defining a filesystem-independent way to store
786*4882a593Smuzhiyun      the verity metadata.  Extended attributes don't work for this
787*4882a593Smuzhiyun      because (a) the Merkle tree may be gigabytes, but many
788*4882a593Smuzhiyun      filesystems assume that all xattrs fit into a single 4K
789*4882a593Smuzhiyun      filesystem block, and (b) ext4 and f2fs encryption doesn't
790*4882a593Smuzhiyun      encrypt xattrs, yet the Merkle tree *must* be encrypted when the
791*4882a593Smuzhiyun      file contents are, because it stores hashes of the plaintext
792*4882a593Smuzhiyun      file contents.
793*4882a593Smuzhiyun
794*4882a593Smuzhiyun      So the verity metadata would have to be stored in an actual
795*4882a593Smuzhiyun      file.  Using a separate file would be very ugly, since the
796*4882a593Smuzhiyun      metadata is fundamentally part of the file to be protected, and
797*4882a593Smuzhiyun      it could cause problems where users could delete the real file
798*4882a593Smuzhiyun      but not the metadata file or vice versa.  On the other hand,
799*4882a593Smuzhiyun      having it be in the same file would break applications unless
800*4882a593Smuzhiyun      filesystems' notion of i_size were divorced from the VFS's,
801*4882a593Smuzhiyun      which would be complex and require changes to all filesystems.
802*4882a593Smuzhiyun
803*4882a593Smuzhiyun    - It's desirable that FS_IOC_ENABLE_VERITY uses the filesystem's
804*4882a593Smuzhiyun      transaction mechanism so that either the file ends up with
805*4882a593Smuzhiyun      verity enabled, or no changes were made.  Allowing intermediate
806*4882a593Smuzhiyun      states to occur after a crash may cause problems.
807