1*4882a593Smuzhiyun============== 2*4882a593SmuzhiyunData Integrity 3*4882a593Smuzhiyun============== 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun1. Introduction 6*4882a593Smuzhiyun=============== 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunModern filesystems feature checksumming of data and metadata to 9*4882a593Smuzhiyunprotect against data corruption. However, the detection of the 10*4882a593Smuzhiyuncorruption is done at read time which could potentially be months 11*4882a593Smuzhiyunafter the data was written. At that point the original data that the 12*4882a593Smuzhiyunapplication tried to write is most likely lost. 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunThe solution is to ensure that the disk is actually storing what the 15*4882a593Smuzhiyunapplication meant it to. Recent additions to both the SCSI family 16*4882a593Smuzhiyunprotocols (SBC Data Integrity Field, SCC protection proposal) as well 17*4882a593Smuzhiyunas SATA/T13 (External Path Protection) try to remedy this by adding 18*4882a593Smuzhiyunsupport for appending integrity metadata to an I/O. The integrity 19*4882a593Smuzhiyunmetadata (or protection information in SCSI terminology) includes a 20*4882a593Smuzhiyunchecksum for each sector as well as an incrementing counter that 21*4882a593Smuzhiyunensures the individual sectors are written in the right order. And 22*4882a593Smuzhiyunfor some protection schemes also that the I/O is written to the right 23*4882a593Smuzhiyunplace on disk. 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunCurrent storage controllers and devices implement various protective 26*4882a593Smuzhiyunmeasures, for instance checksumming and scrubbing. But these 27*4882a593Smuzhiyuntechnologies are working in their own isolated domains or at best 28*4882a593Smuzhiyunbetween adjacent nodes in the I/O path. The interesting thing about 29*4882a593SmuzhiyunDIF and the other integrity extensions is that the protection format 30*4882a593Smuzhiyunis well defined and every node in the I/O path can verify the 31*4882a593Smuzhiyunintegrity of the I/O and reject it if corruption is detected. This 32*4882a593Smuzhiyunallows not only corruption prevention but also isolation of the point 33*4882a593Smuzhiyunof failure. 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun2. The Data Integrity Extensions 36*4882a593Smuzhiyun================================ 37*4882a593Smuzhiyun 38*4882a593SmuzhiyunAs written, the protocol extensions only protect the path between 39*4882a593Smuzhiyuncontroller and storage device. However, many controllers actually 40*4882a593Smuzhiyunallow the operating system to interact with the integrity metadata 41*4882a593Smuzhiyun(IMD). We have been working with several FC/SAS HBA vendors to enable 42*4882a593Smuzhiyunthe protection information to be transferred to and from their 43*4882a593Smuzhiyuncontrollers. 44*4882a593Smuzhiyun 45*4882a593SmuzhiyunThe SCSI Data Integrity Field works by appending 8 bytes of protection 46*4882a593Smuzhiyuninformation to each sector. The data + integrity metadata is stored 47*4882a593Smuzhiyunin 520 byte sectors on disk. Data + IMD are interleaved when 48*4882a593Smuzhiyuntransferred between the controller and target. The T13 proposal is 49*4882a593Smuzhiyunsimilar. 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunBecause it is highly inconvenient for operating systems to deal with 52*4882a593Smuzhiyun520 (and 4104) byte sectors, we approached several HBA vendors and 53*4882a593Smuzhiyunencouraged them to allow separation of the data and integrity metadata 54*4882a593Smuzhiyunscatter-gather lists. 55*4882a593Smuzhiyun 56*4882a593SmuzhiyunThe controller will interleave the buffers on write and split them on 57*4882a593Smuzhiyunread. This means that Linux can DMA the data buffers to and from 58*4882a593Smuzhiyunhost memory without changes to the page cache. 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunAlso, the 16-bit CRC checksum mandated by both the SCSI and SATA specs 61*4882a593Smuzhiyunis somewhat heavy to compute in software. Benchmarks found that 62*4882a593Smuzhiyuncalculating this checksum had a significant impact on system 63*4882a593Smuzhiyunperformance for a number of workloads. Some controllers allow a 64*4882a593Smuzhiyunlighter-weight checksum to be used when interfacing with the operating 65*4882a593Smuzhiyunsystem. Emulex, for instance, supports the TCP/IP checksum instead. 66*4882a593SmuzhiyunThe IP checksum received from the OS is converted to the 16-bit CRC 67*4882a593Smuzhiyunwhen writing and vice versa. This allows the integrity metadata to be 68*4882a593Smuzhiyungenerated by Linux or the application at very low cost (comparable to 69*4882a593Smuzhiyunsoftware RAID5). 70*4882a593Smuzhiyun 71*4882a593SmuzhiyunThe IP checksum is weaker than the CRC in terms of detecting bit 72*4882a593Smuzhiyunerrors. However, the strength is really in the separation of the data 73*4882a593Smuzhiyunbuffers and the integrity metadata. These two distinct buffers must 74*4882a593Smuzhiyunmatch up for an I/O to complete. 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunThe separation of the data and integrity metadata buffers as well as 77*4882a593Smuzhiyunthe choice in checksums is referred to as the Data Integrity 78*4882a593SmuzhiyunExtensions. As these extensions are outside the scope of the protocol 79*4882a593Smuzhiyunbodies (T10, T13), Oracle and its partners are trying to standardize 80*4882a593Smuzhiyunthem within the Storage Networking Industry Association. 81*4882a593Smuzhiyun 82*4882a593Smuzhiyun3. Kernel Changes 83*4882a593Smuzhiyun================= 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunThe data integrity framework in Linux enables protection information 86*4882a593Smuzhiyunto be pinned to I/Os and sent to/received from controllers that 87*4882a593Smuzhiyunsupport it. 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunThe advantage to the integrity extensions in SCSI and SATA is that 90*4882a593Smuzhiyunthey enable us to protect the entire path from application to storage 91*4882a593Smuzhiyundevice. However, at the same time this is also the biggest 92*4882a593Smuzhiyundisadvantage. It means that the protection information must be in a 93*4882a593Smuzhiyunformat that can be understood by the disk. 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunGenerally Linux/POSIX applications are agnostic to the intricacies of 96*4882a593Smuzhiyunthe storage devices they are accessing. The virtual filesystem switch 97*4882a593Smuzhiyunand the block layer make things like hardware sector size and 98*4882a593Smuzhiyuntransport protocols completely transparent to the application. 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunHowever, this level of detail is required when preparing the 101*4882a593Smuzhiyunprotection information to send to a disk. Consequently, the very 102*4882a593Smuzhiyunconcept of an end-to-end protection scheme is a layering violation. 103*4882a593SmuzhiyunIt is completely unreasonable for an application to be aware whether 104*4882a593Smuzhiyunit is accessing a SCSI or SATA disk. 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunThe data integrity support implemented in Linux attempts to hide this 107*4882a593Smuzhiyunfrom the application. As far as the application (and to some extent 108*4882a593Smuzhiyunthe kernel) is concerned, the integrity metadata is opaque information 109*4882a593Smuzhiyunthat's attached to the I/O. 110*4882a593Smuzhiyun 111*4882a593SmuzhiyunThe current implementation allows the block layer to automatically 112*4882a593Smuzhiyungenerate the protection information for any I/O. Eventually the 113*4882a593Smuzhiyunintent is to move the integrity metadata calculation to userspace for 114*4882a593Smuzhiyunuser data. Metadata and other I/O that originates within the kernel 115*4882a593Smuzhiyunwill still use the automatic generation interface. 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunSome storage devices allow each hardware sector to be tagged with a 118*4882a593Smuzhiyun16-bit value. The owner of this tag space is the owner of the block 119*4882a593Smuzhiyundevice. I.e. the filesystem in most cases. The filesystem can use 120*4882a593Smuzhiyunthis extra space to tag sectors as they see fit. Because the tag 121*4882a593Smuzhiyunspace is limited, the block interface allows tagging bigger chunks by 122*4882a593Smuzhiyunway of interleaving. This way, 8*16 bits of information can be 123*4882a593Smuzhiyunattached to a typical 4KB filesystem block. 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunThis also means that applications such as fsck and mkfs will need 126*4882a593Smuzhiyunaccess to manipulate the tags from user space. A passthrough 127*4882a593Smuzhiyuninterface for this is being worked on. 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun4. Block Layer Implementation Details 131*4882a593Smuzhiyun===================================== 132*4882a593Smuzhiyun 133*4882a593Smuzhiyun4.1 Bio 134*4882a593Smuzhiyun------- 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunThe data integrity patches add a new field to struct bio when 137*4882a593SmuzhiyunCONFIG_BLK_DEV_INTEGRITY is enabled. bio_integrity(bio) returns a 138*4882a593Smuzhiyunpointer to a struct bip which contains the bio integrity payload. 139*4882a593SmuzhiyunEssentially a bip is a trimmed down struct bio which holds a bio_vec 140*4882a593Smuzhiyuncontaining the integrity metadata and the required housekeeping 141*4882a593Smuzhiyuninformation (bvec pool, vector count, etc.) 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunA kernel subsystem can enable data integrity protection on a bio by 144*4882a593Smuzhiyuncalling bio_integrity_alloc(bio). This will allocate and attach the 145*4882a593Smuzhiyunbip to the bio. 146*4882a593Smuzhiyun 147*4882a593SmuzhiyunIndividual pages containing integrity metadata can subsequently be 148*4882a593Smuzhiyunattached using bio_integrity_add_page(). 149*4882a593Smuzhiyun 150*4882a593Smuzhiyunbio_free() will automatically free the bip. 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun4.2 Block Device 154*4882a593Smuzhiyun---------------- 155*4882a593Smuzhiyun 156*4882a593SmuzhiyunBecause the format of the protection data is tied to the physical 157*4882a593Smuzhiyundisk, each block device has been extended with a block integrity 158*4882a593Smuzhiyunprofile (struct blk_integrity). This optional profile is registered 159*4882a593Smuzhiyunwith the block layer using blk_integrity_register(). 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunThe profile contains callback functions for generating and verifying 162*4882a593Smuzhiyunthe protection data, as well as getting and setting application tags. 163*4882a593SmuzhiyunThe profile also contains a few constants to aid in completing, 164*4882a593Smuzhiyunmerging and splitting the integrity metadata. 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunLayered block devices will need to pick a profile that's appropriate 167*4882a593Smuzhiyunfor all subdevices. blk_integrity_compare() can help with that. DM 168*4882a593Smuzhiyunand MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 169*4882a593Smuzhiyunwill require extra work due to the application tag. 170*4882a593Smuzhiyun 171*4882a593Smuzhiyun 172*4882a593Smuzhiyun5.0 Block Layer Integrity API 173*4882a593Smuzhiyun============================= 174*4882a593Smuzhiyun 175*4882a593Smuzhiyun5.1 Normal Filesystem 176*4882a593Smuzhiyun--------------------- 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun The normal filesystem is unaware that the underlying block device 179*4882a593Smuzhiyun is capable of sending/receiving integrity metadata. The IMD will 180*4882a593Smuzhiyun be automatically generated by the block layer at submit_bio() time 181*4882a593Smuzhiyun in case of a WRITE. A READ request will cause the I/O integrity 182*4882a593Smuzhiyun to be verified upon completion. 183*4882a593Smuzhiyun 184*4882a593Smuzhiyun IMD generation and verification can be toggled using the:: 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun /sys/block/<bdev>/integrity/write_generate 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun and:: 189*4882a593Smuzhiyun 190*4882a593Smuzhiyun /sys/block/<bdev>/integrity/read_verify 191*4882a593Smuzhiyun 192*4882a593Smuzhiyun flags. 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun 195*4882a593Smuzhiyun5.2 Integrity-Aware Filesystem 196*4882a593Smuzhiyun------------------------------ 197*4882a593Smuzhiyun 198*4882a593Smuzhiyun A filesystem that is integrity-aware can prepare I/Os with IMD 199*4882a593Smuzhiyun attached. It can also use the application tag space if this is 200*4882a593Smuzhiyun supported by the block device. 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun 203*4882a593Smuzhiyun `bool bio_integrity_prep(bio);` 204*4882a593Smuzhiyun 205*4882a593Smuzhiyun To generate IMD for WRITE and to set up buffers for READ, the 206*4882a593Smuzhiyun filesystem must call bio_integrity_prep(bio). 207*4882a593Smuzhiyun 208*4882a593Smuzhiyun Prior to calling this function, the bio data direction and start 209*4882a593Smuzhiyun sector must be set, and the bio should have all data pages 210*4882a593Smuzhiyun added. It is up to the caller to ensure that the bio does not 211*4882a593Smuzhiyun change while I/O is in progress. 212*4882a593Smuzhiyun Complete bio with error if prepare failed for some reson. 213*4882a593Smuzhiyun 214*4882a593Smuzhiyun 215*4882a593Smuzhiyun5.3 Passing Existing Integrity Metadata 216*4882a593Smuzhiyun--------------------------------------- 217*4882a593Smuzhiyun 218*4882a593Smuzhiyun Filesystems that either generate their own integrity metadata or 219*4882a593Smuzhiyun are capable of transferring IMD from user space can use the 220*4882a593Smuzhiyun following calls: 221*4882a593Smuzhiyun 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun `struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);` 224*4882a593Smuzhiyun 225*4882a593Smuzhiyun Allocates the bio integrity payload and hangs it off of the bio. 226*4882a593Smuzhiyun nr_pages indicate how many pages of protection data need to be 227*4882a593Smuzhiyun stored in the integrity bio_vec list (similar to bio_alloc()). 228*4882a593Smuzhiyun 229*4882a593Smuzhiyun The integrity payload will be freed at bio_free() time. 230*4882a593Smuzhiyun 231*4882a593Smuzhiyun 232*4882a593Smuzhiyun `int bio_integrity_add_page(bio, page, len, offset);` 233*4882a593Smuzhiyun 234*4882a593Smuzhiyun Attaches a page containing integrity metadata to an existing 235*4882a593Smuzhiyun bio. The bio must have an existing bip, 236*4882a593Smuzhiyun i.e. bio_integrity_alloc() must have been called. For a WRITE, 237*4882a593Smuzhiyun the integrity metadata in the pages must be in a format 238*4882a593Smuzhiyun understood by the target device with the notable exception that 239*4882a593Smuzhiyun the sector numbers will be remapped as the request traverses the 240*4882a593Smuzhiyun I/O stack. This implies that the pages added using this call 241*4882a593Smuzhiyun will be modified during I/O! The first reference tag in the 242*4882a593Smuzhiyun integrity metadata must have a value of bip->bip_sector. 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun Pages can be added using bio_integrity_add_page() as long as 245*4882a593Smuzhiyun there is room in the bip bio_vec array (nr_pages). 246*4882a593Smuzhiyun 247*4882a593Smuzhiyun Upon completion of a READ operation, the attached pages will 248*4882a593Smuzhiyun contain the integrity metadata received from the storage device. 249*4882a593Smuzhiyun It is up to the receiver to process them and verify data 250*4882a593Smuzhiyun integrity upon completion. 251*4882a593Smuzhiyun 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun5.4 Registering A Block Device As Capable Of Exchanging Integrity Metadata 254*4882a593Smuzhiyun-------------------------------------------------------------------------- 255*4882a593Smuzhiyun 256*4882a593Smuzhiyun To enable integrity exchange on a block device the gendisk must be 257*4882a593Smuzhiyun registered as capable: 258*4882a593Smuzhiyun 259*4882a593Smuzhiyun `int blk_integrity_register(gendisk, blk_integrity);` 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun The blk_integrity struct is a template and should contain the 262*4882a593Smuzhiyun following:: 263*4882a593Smuzhiyun 264*4882a593Smuzhiyun static struct blk_integrity my_profile = { 265*4882a593Smuzhiyun .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", 266*4882a593Smuzhiyun .generate_fn = my_generate_fn, 267*4882a593Smuzhiyun .verify_fn = my_verify_fn, 268*4882a593Smuzhiyun .tuple_size = sizeof(struct my_tuple_size), 269*4882a593Smuzhiyun .tag_size = <tag bytes per hw sector>, 270*4882a593Smuzhiyun }; 271*4882a593Smuzhiyun 272*4882a593Smuzhiyun 'name' is a text string which will be visible in sysfs. This is 273*4882a593Smuzhiyun part of the userland API so chose it carefully and never change 274*4882a593Smuzhiyun it. The format is standards body-type-variant. 275*4882a593Smuzhiyun E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. 276*4882a593Smuzhiyun 277*4882a593Smuzhiyun 'generate_fn' generates appropriate integrity metadata (for WRITE). 278*4882a593Smuzhiyun 279*4882a593Smuzhiyun 'verify_fn' verifies that the data buffer matches the integrity 280*4882a593Smuzhiyun metadata. 281*4882a593Smuzhiyun 282*4882a593Smuzhiyun 'tuple_size' must be set to match the size of the integrity 283*4882a593Smuzhiyun metadata per sector. I.e. 8 for DIF and EPP. 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun 'tag_size' must be set to identify how many bytes of tag space 286*4882a593Smuzhiyun are available per hardware sector. For DIF this is either 2 or 287*4882a593Smuzhiyun 0 depending on the value of the Control Mode Page ATO bit. 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun---------------------------------------------------------------------- 290*4882a593Smuzhiyun 291*4882a593Smuzhiyun2007-12-24 Martin K. Petersen <martin.petersen@oracle.com> 292