xref: /OK3568_Linux_fs/kernel/Documentation/block/data-integrity.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==============
2*4882a593SmuzhiyunData Integrity
3*4882a593Smuzhiyun==============
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun1. Introduction
6*4882a593Smuzhiyun===============
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunModern filesystems feature checksumming of data and metadata to
9*4882a593Smuzhiyunprotect against data corruption.  However, the detection of the
10*4882a593Smuzhiyuncorruption is done at read time which could potentially be months
11*4882a593Smuzhiyunafter the data was written.  At that point the original data that the
12*4882a593Smuzhiyunapplication tried to write is most likely lost.
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunThe solution is to ensure that the disk is actually storing what the
15*4882a593Smuzhiyunapplication meant it to.  Recent additions to both the SCSI family
16*4882a593Smuzhiyunprotocols (SBC Data Integrity Field, SCC protection proposal) as well
17*4882a593Smuzhiyunas SATA/T13 (External Path Protection) try to remedy this by adding
18*4882a593Smuzhiyunsupport for appending integrity metadata to an I/O.  The integrity
19*4882a593Smuzhiyunmetadata (or protection information in SCSI terminology) includes a
20*4882a593Smuzhiyunchecksum for each sector as well as an incrementing counter that
21*4882a593Smuzhiyunensures the individual sectors are written in the right order.  And
22*4882a593Smuzhiyunfor some protection schemes also that the I/O is written to the right
23*4882a593Smuzhiyunplace on disk.
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunCurrent storage controllers and devices implement various protective
26*4882a593Smuzhiyunmeasures, for instance checksumming and scrubbing.  But these
27*4882a593Smuzhiyuntechnologies are working in their own isolated domains or at best
28*4882a593Smuzhiyunbetween adjacent nodes in the I/O path.  The interesting thing about
29*4882a593SmuzhiyunDIF and the other integrity extensions is that the protection format
30*4882a593Smuzhiyunis well defined and every node in the I/O path can verify the
31*4882a593Smuzhiyunintegrity of the I/O and reject it if corruption is detected.  This
32*4882a593Smuzhiyunallows not only corruption prevention but also isolation of the point
33*4882a593Smuzhiyunof failure.
34*4882a593Smuzhiyun
35*4882a593Smuzhiyun2. The Data Integrity Extensions
36*4882a593Smuzhiyun================================
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunAs written, the protocol extensions only protect the path between
39*4882a593Smuzhiyuncontroller and storage device.  However, many controllers actually
40*4882a593Smuzhiyunallow the operating system to interact with the integrity metadata
41*4882a593Smuzhiyun(IMD).  We have been working with several FC/SAS HBA vendors to enable
42*4882a593Smuzhiyunthe protection information to be transferred to and from their
43*4882a593Smuzhiyuncontrollers.
44*4882a593Smuzhiyun
45*4882a593SmuzhiyunThe SCSI Data Integrity Field works by appending 8 bytes of protection
46*4882a593Smuzhiyuninformation to each sector.  The data + integrity metadata is stored
47*4882a593Smuzhiyunin 520 byte sectors on disk.  Data + IMD are interleaved when
48*4882a593Smuzhiyuntransferred between the controller and target.  The T13 proposal is
49*4882a593Smuzhiyunsimilar.
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunBecause it is highly inconvenient for operating systems to deal with
52*4882a593Smuzhiyun520 (and 4104) byte sectors, we approached several HBA vendors and
53*4882a593Smuzhiyunencouraged them to allow separation of the data and integrity metadata
54*4882a593Smuzhiyunscatter-gather lists.
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunThe controller will interleave the buffers on write and split them on
57*4882a593Smuzhiyunread.  This means that Linux can DMA the data buffers to and from
58*4882a593Smuzhiyunhost memory without changes to the page cache.
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunAlso, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
61*4882a593Smuzhiyunis somewhat heavy to compute in software.  Benchmarks found that
62*4882a593Smuzhiyuncalculating this checksum had a significant impact on system
63*4882a593Smuzhiyunperformance for a number of workloads.  Some controllers allow a
64*4882a593Smuzhiyunlighter-weight checksum to be used when interfacing with the operating
65*4882a593Smuzhiyunsystem.  Emulex, for instance, supports the TCP/IP checksum instead.
66*4882a593SmuzhiyunThe IP checksum received from the OS is converted to the 16-bit CRC
67*4882a593Smuzhiyunwhen writing and vice versa.  This allows the integrity metadata to be
68*4882a593Smuzhiyungenerated by Linux or the application at very low cost (comparable to
69*4882a593Smuzhiyunsoftware RAID5).
70*4882a593Smuzhiyun
71*4882a593SmuzhiyunThe IP checksum is weaker than the CRC in terms of detecting bit
72*4882a593Smuzhiyunerrors.  However, the strength is really in the separation of the data
73*4882a593Smuzhiyunbuffers and the integrity metadata.  These two distinct buffers must
74*4882a593Smuzhiyunmatch up for an I/O to complete.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunThe separation of the data and integrity metadata buffers as well as
77*4882a593Smuzhiyunthe choice in checksums is referred to as the Data Integrity
78*4882a593SmuzhiyunExtensions.  As these extensions are outside the scope of the protocol
79*4882a593Smuzhiyunbodies (T10, T13), Oracle and its partners are trying to standardize
80*4882a593Smuzhiyunthem within the Storage Networking Industry Association.
81*4882a593Smuzhiyun
82*4882a593Smuzhiyun3. Kernel Changes
83*4882a593Smuzhiyun=================
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunThe data integrity framework in Linux enables protection information
86*4882a593Smuzhiyunto be pinned to I/Os and sent to/received from controllers that
87*4882a593Smuzhiyunsupport it.
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunThe advantage to the integrity extensions in SCSI and SATA is that
90*4882a593Smuzhiyunthey enable us to protect the entire path from application to storage
91*4882a593Smuzhiyundevice.  However, at the same time this is also the biggest
92*4882a593Smuzhiyundisadvantage. It means that the protection information must be in a
93*4882a593Smuzhiyunformat that can be understood by the disk.
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunGenerally Linux/POSIX applications are agnostic to the intricacies of
96*4882a593Smuzhiyunthe storage devices they are accessing.  The virtual filesystem switch
97*4882a593Smuzhiyunand the block layer make things like hardware sector size and
98*4882a593Smuzhiyuntransport protocols completely transparent to the application.
99*4882a593Smuzhiyun
100*4882a593SmuzhiyunHowever, this level of detail is required when preparing the
101*4882a593Smuzhiyunprotection information to send to a disk.  Consequently, the very
102*4882a593Smuzhiyunconcept of an end-to-end protection scheme is a layering violation.
103*4882a593SmuzhiyunIt is completely unreasonable for an application to be aware whether
104*4882a593Smuzhiyunit is accessing a SCSI or SATA disk.
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunThe data integrity support implemented in Linux attempts to hide this
107*4882a593Smuzhiyunfrom the application.  As far as the application (and to some extent
108*4882a593Smuzhiyunthe kernel) is concerned, the integrity metadata is opaque information
109*4882a593Smuzhiyunthat's attached to the I/O.
110*4882a593Smuzhiyun
111*4882a593SmuzhiyunThe current implementation allows the block layer to automatically
112*4882a593Smuzhiyungenerate the protection information for any I/O.  Eventually the
113*4882a593Smuzhiyunintent is to move the integrity metadata calculation to userspace for
114*4882a593Smuzhiyunuser data.  Metadata and other I/O that originates within the kernel
115*4882a593Smuzhiyunwill still use the automatic generation interface.
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunSome storage devices allow each hardware sector to be tagged with a
118*4882a593Smuzhiyun16-bit value.  The owner of this tag space is the owner of the block
119*4882a593Smuzhiyundevice.  I.e. the filesystem in most cases.  The filesystem can use
120*4882a593Smuzhiyunthis extra space to tag sectors as they see fit.  Because the tag
121*4882a593Smuzhiyunspace is limited, the block interface allows tagging bigger chunks by
122*4882a593Smuzhiyunway of interleaving.  This way, 8*16 bits of information can be
123*4882a593Smuzhiyunattached to a typical 4KB filesystem block.
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunThis also means that applications such as fsck and mkfs will need
126*4882a593Smuzhiyunaccess to manipulate the tags from user space.  A passthrough
127*4882a593Smuzhiyuninterface for this is being worked on.
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun
130*4882a593Smuzhiyun4. Block Layer Implementation Details
131*4882a593Smuzhiyun=====================================
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun4.1 Bio
134*4882a593Smuzhiyun-------
135*4882a593Smuzhiyun
136*4882a593SmuzhiyunThe data integrity patches add a new field to struct bio when
137*4882a593SmuzhiyunCONFIG_BLK_DEV_INTEGRITY is enabled.  bio_integrity(bio) returns a
138*4882a593Smuzhiyunpointer to a struct bip which contains the bio integrity payload.
139*4882a593SmuzhiyunEssentially a bip is a trimmed down struct bio which holds a bio_vec
140*4882a593Smuzhiyuncontaining the integrity metadata and the required housekeeping
141*4882a593Smuzhiyuninformation (bvec pool, vector count, etc.)
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunA kernel subsystem can enable data integrity protection on a bio by
144*4882a593Smuzhiyuncalling bio_integrity_alloc(bio).  This will allocate and attach the
145*4882a593Smuzhiyunbip to the bio.
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunIndividual pages containing integrity metadata can subsequently be
148*4882a593Smuzhiyunattached using bio_integrity_add_page().
149*4882a593Smuzhiyun
150*4882a593Smuzhiyunbio_free() will automatically free the bip.
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun4.2 Block Device
154*4882a593Smuzhiyun----------------
155*4882a593Smuzhiyun
156*4882a593SmuzhiyunBecause the format of the protection data is tied to the physical
157*4882a593Smuzhiyundisk, each block device has been extended with a block integrity
158*4882a593Smuzhiyunprofile (struct blk_integrity).  This optional profile is registered
159*4882a593Smuzhiyunwith the block layer using blk_integrity_register().
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunThe profile contains callback functions for generating and verifying
162*4882a593Smuzhiyunthe protection data, as well as getting and setting application tags.
163*4882a593SmuzhiyunThe profile also contains a few constants to aid in completing,
164*4882a593Smuzhiyunmerging and splitting the integrity metadata.
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunLayered block devices will need to pick a profile that's appropriate
167*4882a593Smuzhiyunfor all subdevices.  blk_integrity_compare() can help with that.  DM
168*4882a593Smuzhiyunand MD linear, RAID0 and RAID1 are currently supported.  RAID4/5/6
169*4882a593Smuzhiyunwill require extra work due to the application tag.
170*4882a593Smuzhiyun
171*4882a593Smuzhiyun
172*4882a593Smuzhiyun5.0 Block Layer Integrity API
173*4882a593Smuzhiyun=============================
174*4882a593Smuzhiyun
175*4882a593Smuzhiyun5.1 Normal Filesystem
176*4882a593Smuzhiyun---------------------
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun    The normal filesystem is unaware that the underlying block device
179*4882a593Smuzhiyun    is capable of sending/receiving integrity metadata.  The IMD will
180*4882a593Smuzhiyun    be automatically generated by the block layer at submit_bio() time
181*4882a593Smuzhiyun    in case of a WRITE.  A READ request will cause the I/O integrity
182*4882a593Smuzhiyun    to be verified upon completion.
183*4882a593Smuzhiyun
184*4882a593Smuzhiyun    IMD generation and verification can be toggled using the::
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun      /sys/block/<bdev>/integrity/write_generate
187*4882a593Smuzhiyun
188*4882a593Smuzhiyun    and::
189*4882a593Smuzhiyun
190*4882a593Smuzhiyun      /sys/block/<bdev>/integrity/read_verify
191*4882a593Smuzhiyun
192*4882a593Smuzhiyun    flags.
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun
195*4882a593Smuzhiyun5.2 Integrity-Aware Filesystem
196*4882a593Smuzhiyun------------------------------
197*4882a593Smuzhiyun
198*4882a593Smuzhiyun    A filesystem that is integrity-aware can prepare I/Os with IMD
199*4882a593Smuzhiyun    attached.  It can also use the application tag space if this is
200*4882a593Smuzhiyun    supported by the block device.
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun
203*4882a593Smuzhiyun    `bool bio_integrity_prep(bio);`
204*4882a593Smuzhiyun
205*4882a593Smuzhiyun      To generate IMD for WRITE and to set up buffers for READ, the
206*4882a593Smuzhiyun      filesystem must call bio_integrity_prep(bio).
207*4882a593Smuzhiyun
208*4882a593Smuzhiyun      Prior to calling this function, the bio data direction and start
209*4882a593Smuzhiyun      sector must be set, and the bio should have all data pages
210*4882a593Smuzhiyun      added.  It is up to the caller to ensure that the bio does not
211*4882a593Smuzhiyun      change while I/O is in progress.
212*4882a593Smuzhiyun      Complete bio with error if prepare failed for some reson.
213*4882a593Smuzhiyun
214*4882a593Smuzhiyun
215*4882a593Smuzhiyun5.3 Passing Existing Integrity Metadata
216*4882a593Smuzhiyun---------------------------------------
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun    Filesystems that either generate their own integrity metadata or
219*4882a593Smuzhiyun    are capable of transferring IMD from user space can use the
220*4882a593Smuzhiyun    following calls:
221*4882a593Smuzhiyun
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun    `struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);`
224*4882a593Smuzhiyun
225*4882a593Smuzhiyun      Allocates the bio integrity payload and hangs it off of the bio.
226*4882a593Smuzhiyun      nr_pages indicate how many pages of protection data need to be
227*4882a593Smuzhiyun      stored in the integrity bio_vec list (similar to bio_alloc()).
228*4882a593Smuzhiyun
229*4882a593Smuzhiyun      The integrity payload will be freed at bio_free() time.
230*4882a593Smuzhiyun
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun    `int bio_integrity_add_page(bio, page, len, offset);`
233*4882a593Smuzhiyun
234*4882a593Smuzhiyun      Attaches a page containing integrity metadata to an existing
235*4882a593Smuzhiyun      bio.  The bio must have an existing bip,
236*4882a593Smuzhiyun      i.e. bio_integrity_alloc() must have been called.  For a WRITE,
237*4882a593Smuzhiyun      the integrity metadata in the pages must be in a format
238*4882a593Smuzhiyun      understood by the target device with the notable exception that
239*4882a593Smuzhiyun      the sector numbers will be remapped as the request traverses the
240*4882a593Smuzhiyun      I/O stack.  This implies that the pages added using this call
241*4882a593Smuzhiyun      will be modified during I/O!  The first reference tag in the
242*4882a593Smuzhiyun      integrity metadata must have a value of bip->bip_sector.
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun      Pages can be added using bio_integrity_add_page() as long as
245*4882a593Smuzhiyun      there is room in the bip bio_vec array (nr_pages).
246*4882a593Smuzhiyun
247*4882a593Smuzhiyun      Upon completion of a READ operation, the attached pages will
248*4882a593Smuzhiyun      contain the integrity metadata received from the storage device.
249*4882a593Smuzhiyun      It is up to the receiver to process them and verify data
250*4882a593Smuzhiyun      integrity upon completion.
251*4882a593Smuzhiyun
252*4882a593Smuzhiyun
253*4882a593Smuzhiyun5.4 Registering A Block Device As Capable Of Exchanging Integrity Metadata
254*4882a593Smuzhiyun--------------------------------------------------------------------------
255*4882a593Smuzhiyun
256*4882a593Smuzhiyun    To enable integrity exchange on a block device the gendisk must be
257*4882a593Smuzhiyun    registered as capable:
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun    `int blk_integrity_register(gendisk, blk_integrity);`
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun      The blk_integrity struct is a template and should contain the
262*4882a593Smuzhiyun      following::
263*4882a593Smuzhiyun
264*4882a593Smuzhiyun        static struct blk_integrity my_profile = {
265*4882a593Smuzhiyun            .name                   = "STANDARDSBODY-TYPE-VARIANT-CSUM",
266*4882a593Smuzhiyun            .generate_fn            = my_generate_fn,
267*4882a593Smuzhiyun	    .verify_fn              = my_verify_fn,
268*4882a593Smuzhiyun	    .tuple_size             = sizeof(struct my_tuple_size),
269*4882a593Smuzhiyun	    .tag_size               = <tag bytes per hw sector>,
270*4882a593Smuzhiyun        };
271*4882a593Smuzhiyun
272*4882a593Smuzhiyun      'name' is a text string which will be visible in sysfs.  This is
273*4882a593Smuzhiyun      part of the userland API so chose it carefully and never change
274*4882a593Smuzhiyun      it.  The format is standards body-type-variant.
275*4882a593Smuzhiyun      E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
276*4882a593Smuzhiyun
277*4882a593Smuzhiyun      'generate_fn' generates appropriate integrity metadata (for WRITE).
278*4882a593Smuzhiyun
279*4882a593Smuzhiyun      'verify_fn' verifies that the data buffer matches the integrity
280*4882a593Smuzhiyun      metadata.
281*4882a593Smuzhiyun
282*4882a593Smuzhiyun      'tuple_size' must be set to match the size of the integrity
283*4882a593Smuzhiyun      metadata per sector.  I.e. 8 for DIF and EPP.
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun      'tag_size' must be set to identify how many bytes of tag space
286*4882a593Smuzhiyun      are available per hardware sector.  For DIF this is either 2 or
287*4882a593Smuzhiyun      0 depending on the value of the Control Mode Page ATO bit.
288*4882a593Smuzhiyun
289*4882a593Smuzhiyun----------------------------------------------------------------------
290*4882a593Smuzhiyun
291*4882a593Smuzhiyun2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>
292