1*4882a593Smuzhiyun===================================================== 2*4882a593SmuzhiyunNotes on the Generic Block Layer Rewrite in Linux 2.5 3*4882a593Smuzhiyun===================================================== 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun.. note:: 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun It seems that there are lot of outdated stuff here. This seems 8*4882a593Smuzhiyun to be written somewhat as a task list. Yet, eventually, something 9*4882a593Smuzhiyun here might still be useful. 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunNotes Written on Jan 15, 2002: 12*4882a593Smuzhiyun 13*4882a593Smuzhiyun - Jens Axboe <jens.axboe@oracle.com> 14*4882a593Smuzhiyun - Suparna Bhattacharya <suparna@in.ibm.com> 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunLast Updated May 2, 2002 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunSeptember 2003: Updated I/O Scheduler portions 19*4882a593Smuzhiyun - Nick Piggin <npiggin@kernel.dk> 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunIntroduction 22*4882a593Smuzhiyun============ 23*4882a593Smuzhiyun 24*4882a593SmuzhiyunThese are some notes describing some aspects of the 2.5 block layer in the 25*4882a593Smuzhiyuncontext of the bio rewrite. The idea is to bring out some of the key 26*4882a593Smuzhiyunchanges and a glimpse of the rationale behind those changes. 27*4882a593Smuzhiyun 28*4882a593SmuzhiyunPlease mail corrections & suggestions to suparna@in.ibm.com. 29*4882a593Smuzhiyun 30*4882a593SmuzhiyunCredits 31*4882a593Smuzhiyun======= 32*4882a593Smuzhiyun 33*4882a593Smuzhiyun2.5 bio rewrite: 34*4882a593Smuzhiyun - Jens Axboe <jens.axboe@oracle.com> 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunMany aspects of the generic block layer redesign were driven by and evolved 37*4882a593Smuzhiyunover discussions, prior patches and the collective experience of several 38*4882a593Smuzhiyunpeople. See sections 8 and 9 for a list of some related references. 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunThe following people helped with review comments and inputs for this 41*4882a593Smuzhiyundocument: 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun - Christoph Hellwig <hch@infradead.org> 44*4882a593Smuzhiyun - Arjan van de Ven <arjanv@redhat.com> 45*4882a593Smuzhiyun - Randy Dunlap <rdunlap@xenotime.net> 46*4882a593Smuzhiyun - Andre Hedrick <andre@linux-ide.org> 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunThe following people helped with fixes/contributions to the bio patches 49*4882a593Smuzhiyunwhile it was still work-in-progress: 50*4882a593Smuzhiyun 51*4882a593Smuzhiyun - David S. Miller <davem@redhat.com> 52*4882a593Smuzhiyun 53*4882a593Smuzhiyun 54*4882a593Smuzhiyun.. Description of Contents: 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun 1. Scope for tuning of logic to various needs 57*4882a593Smuzhiyun 1.1 Tuning based on device or low level driver capabilities 58*4882a593Smuzhiyun - Per-queue parameters 59*4882a593Smuzhiyun - Highmem I/O support 60*4882a593Smuzhiyun - I/O scheduler modularization 61*4882a593Smuzhiyun 1.2 Tuning based on high level requirements/capabilities 62*4882a593Smuzhiyun 1.2.1 Request Priority/Latency 63*4882a593Smuzhiyun 1.3 Direct access/bypass to lower layers for diagnostics and special 64*4882a593Smuzhiyun device operations 65*4882a593Smuzhiyun 1.3.1 Pre-built commands 66*4882a593Smuzhiyun 2. New flexible and generic but minimalist i/o structure or descriptor 67*4882a593Smuzhiyun (instead of using buffer heads at the i/o layer) 68*4882a593Smuzhiyun 2.1 Requirements/Goals addressed 69*4882a593Smuzhiyun 2.2 The bio struct in detail (multi-page io unit) 70*4882a593Smuzhiyun 2.3 Changes in the request structure 71*4882a593Smuzhiyun 3. Using bios 72*4882a593Smuzhiyun 3.1 Setup/teardown (allocation, splitting) 73*4882a593Smuzhiyun 3.2 Generic bio helper routines 74*4882a593Smuzhiyun 3.2.1 Traversing segments and completion units in a request 75*4882a593Smuzhiyun 3.2.2 Setting up DMA scatterlists 76*4882a593Smuzhiyun 3.2.3 I/O completion 77*4882a593Smuzhiyun 3.2.4 Implications for drivers that do not interpret bios (don't handle 78*4882a593Smuzhiyun multiple segments) 79*4882a593Smuzhiyun 3.3 I/O submission 80*4882a593Smuzhiyun 4. The I/O scheduler 81*4882a593Smuzhiyun 5. Scalability related changes 82*4882a593Smuzhiyun 5.1 Granular locking: Removal of io_request_lock 83*4882a593Smuzhiyun 5.2 Prepare for transition to 64 bit sector_t 84*4882a593Smuzhiyun 6. Other Changes/Implications 85*4882a593Smuzhiyun 6.1 Partition re-mapping handled by the generic block layer 86*4882a593Smuzhiyun 7. A few tips on migration of older drivers 87*4882a593Smuzhiyun 8. A list of prior/related/impacted patches/ideas 88*4882a593Smuzhiyun 9. Other References/Discussion Threads 89*4882a593Smuzhiyun 90*4882a593Smuzhiyun 91*4882a593SmuzhiyunBio Notes 92*4882a593Smuzhiyun========= 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunLet us discuss the changes in the context of how some overall goals for the 95*4882a593Smuzhiyunblock layer are addressed. 96*4882a593Smuzhiyun 97*4882a593Smuzhiyun1. Scope for tuning the generic logic to satisfy various requirements 98*4882a593Smuzhiyun===================================================================== 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunThe block layer design supports adaptable abstractions to handle common 101*4882a593Smuzhiyunprocessing with the ability to tune the logic to an appropriate extent 102*4882a593Smuzhiyundepending on the nature of the device and the requirements of the caller. 103*4882a593SmuzhiyunOne of the objectives of the rewrite was to increase the degree of tunability 104*4882a593Smuzhiyunand to enable higher level code to utilize underlying device/driver 105*4882a593Smuzhiyuncapabilities to the maximum extent for better i/o performance. This is 106*4882a593Smuzhiyunimportant especially in the light of ever improving hardware capabilities 107*4882a593Smuzhiyunand application/middleware software designed to take advantage of these 108*4882a593Smuzhiyuncapabilities. 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun1.1 Tuning based on low level device / driver capabilities 111*4882a593Smuzhiyun---------------------------------------------------------- 112*4882a593Smuzhiyun 113*4882a593SmuzhiyunSophisticated devices with large built-in caches, intelligent i/o scheduling 114*4882a593Smuzhiyunoptimizations, high memory DMA support, etc may find some of the 115*4882a593Smuzhiyungeneric processing an overhead, while for less capable devices the 116*4882a593Smuzhiyungeneric functionality is essential for performance or correctness reasons. 117*4882a593SmuzhiyunKnowledge of some of the capabilities or parameters of the device should be 118*4882a593Smuzhiyunused at the generic block layer to take the right decisions on 119*4882a593Smuzhiyunbehalf of the driver. 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunHow is this achieved ? 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunTuning at a per-queue level: 124*4882a593Smuzhiyun 125*4882a593Smuzhiyuni. Per-queue limits/values exported to the generic layer by the driver 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunVarious parameters that the generic i/o scheduler logic uses are set at 128*4882a593Smuzhiyuna per-queue level (e.g maximum request size, maximum number of segments in 129*4882a593Smuzhiyuna scatter-gather list, logical block size) 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunSome parameters that were earlier available as global arrays indexed by 132*4882a593Smuzhiyunmajor/minor are now directly associated with the queue. Some of these may 133*4882a593Smuzhiyunmove into the block device structure in the future. Some characteristics 134*4882a593Smuzhiyunhave been incorporated into a queue flags field rather than separate fields 135*4882a593Smuzhiyunin themselves. There are blk_queue_xxx functions to set the parameters, 136*4882a593Smuzhiyunrather than update the fields directly 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunSome new queue property settings: 139*4882a593Smuzhiyun 140*4882a593Smuzhiyun blk_queue_bounce_limit(q, u64 dma_address) 141*4882a593Smuzhiyun Enable I/O to highmem pages, dma_address being the 142*4882a593Smuzhiyun limit. No highmem default. 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun blk_queue_max_sectors(q, max_sectors) 145*4882a593Smuzhiyun Sets two variables that limit the size of the request. 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun - The request queue's max_sectors, which is a soft size in 148*4882a593Smuzhiyun units of 512 byte sectors, and could be dynamically varied 149*4882a593Smuzhiyun by the core kernel. 150*4882a593Smuzhiyun 151*4882a593Smuzhiyun - The request queue's max_hw_sectors, which is a hard limit 152*4882a593Smuzhiyun and reflects the maximum size request a driver can handle 153*4882a593Smuzhiyun in units of 512 byte sectors. 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun The default for both max_sectors and max_hw_sectors is 156*4882a593Smuzhiyun 255. The upper limit of max_sectors is 1024. 157*4882a593Smuzhiyun 158*4882a593Smuzhiyun blk_queue_max_phys_segments(q, max_segments) 159*4882a593Smuzhiyun Maximum physical segments you can handle in a request. 128 160*4882a593Smuzhiyun default (driver limit). (See 3.2.2) 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun blk_queue_max_hw_segments(q, max_segments) 163*4882a593Smuzhiyun Maximum dma segments the hardware can handle in a request. 128 164*4882a593Smuzhiyun default (host adapter limit, after dma remapping). 165*4882a593Smuzhiyun (See 3.2.2) 166*4882a593Smuzhiyun 167*4882a593Smuzhiyun blk_queue_max_segment_size(q, max_seg_size) 168*4882a593Smuzhiyun Maximum size of a clustered segment, 64kB default. 169*4882a593Smuzhiyun 170*4882a593Smuzhiyun blk_queue_logical_block_size(q, logical_block_size) 171*4882a593Smuzhiyun Lowest possible sector size that the hardware can operate 172*4882a593Smuzhiyun on, 512 bytes default. 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunNew queue flags: 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun - QUEUE_FLAG_CLUSTER (see 3.2.2) 177*4882a593Smuzhiyun - QUEUE_FLAG_QUEUED (see 3.2.4) 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun 180*4882a593Smuzhiyunii. High-mem i/o capabilities are now considered the default 181*4882a593Smuzhiyun 182*4882a593SmuzhiyunThe generic bounce buffer logic, present in 2.4, where the block layer would 183*4882a593Smuzhiyunby default copyin/out i/o requests on high-memory buffers to low-memory buffers 184*4882a593Smuzhiyunassuming that the driver wouldn't be able to handle it directly, has been 185*4882a593Smuzhiyunchanged in 2.5. The bounce logic is now applied only for memory ranges 186*4882a593Smuzhiyunfor which the device cannot handle i/o. A driver can specify this by 187*4882a593Smuzhiyunsetting the queue bounce limit for the request queue for the device 188*4882a593Smuzhiyun(blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out 189*4882a593Smuzhiyunwhere a device is capable of handling high memory i/o. 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunIn order to enable high-memory i/o where the device is capable of supporting 192*4882a593Smuzhiyunit, the pci dma mapping routines and associated data structures have now been 193*4882a593Smuzhiyunmodified to accomplish a direct page -> bus translation, without requiring 194*4882a593Smuzhiyuna virtual address mapping (unlike the earlier scheme of virtual address 195*4882a593Smuzhiyun-> bus translation). So this works uniformly for high-memory pages (which 196*4882a593Smuzhiyundo not have a corresponding kernel virtual address space mapping) and 197*4882a593Smuzhiyunlow-memory pages. 198*4882a593Smuzhiyun 199*4882a593SmuzhiyunNote: Please refer to :doc:`/core-api/dma-api-howto` for a discussion 200*4882a593Smuzhiyunon PCI high mem DMA aspects and mapping of scatter gather lists, and support 201*4882a593Smuzhiyunfor 64 bit PCI. 202*4882a593Smuzhiyun 203*4882a593SmuzhiyunSpecial handling is required only for cases where i/o needs to happen on 204*4882a593Smuzhiyunpages at physical memory addresses beyond what the device can support. In these 205*4882a593Smuzhiyuncases, a bounce bio representing a buffer from the supported memory range 206*4882a593Smuzhiyunis used for performing the i/o with copyin/copyout as needed depending on 207*4882a593Smuzhiyunthe type of the operation. For example, in case of a read operation, the 208*4882a593Smuzhiyundata read has to be copied to the original buffer on i/o completion, so a 209*4882a593Smuzhiyuncallback routine is set up to do this, while for write, the data is copied 210*4882a593Smuzhiyunfrom the original buffer to the bounce buffer prior to issuing the 211*4882a593Smuzhiyunoperation. Since an original buffer may be in a high memory area that's not 212*4882a593Smuzhiyunmapped in kernel virtual addr, a kmap operation may be required for 213*4882a593Smuzhiyunperforming the copy, and special care may be needed in the completion path 214*4882a593Smuzhiyunas it may not be in irq context. Special care is also required (by way of 215*4882a593SmuzhiyunGFP flags) when allocating bounce buffers, to avoid certain highmem 216*4882a593Smuzhiyundeadlock possibilities. 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunIt is also possible that a bounce buffer may be allocated from high-memory 219*4882a593Smuzhiyunarea that's not mapped in kernel virtual addr, but within the range that the 220*4882a593Smuzhiyundevice can use directly; so the bounce page may need to be kmapped during 221*4882a593Smuzhiyuncopy operations. [Note: This does not hold in the current implementation, 222*4882a593Smuzhiyunthough] 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunThere are some situations when pages from high memory may need to 225*4882a593Smuzhiyunbe kmapped, even if bounce buffers are not necessary. For example a device 226*4882a593Smuzhiyunmay need to abort DMA operations and revert to PIO for the transfer, in 227*4882a593Smuzhiyunwhich case a virtual mapping of the page is required. For SCSI it is also 228*4882a593Smuzhiyundone in some scenarios where the low level driver cannot be trusted to 229*4882a593Smuzhiyunhandle a single sg entry correctly. The driver is expected to perform the 230*4882a593Smuzhiyunkmaps as needed on such occasions as appropriate. A driver could also use 231*4882a593Smuzhiyunthe blk_queue_bounce() routine on its own to bounce highmem i/o to low 232*4882a593Smuzhiyunmemory for specific requests if so desired. 233*4882a593Smuzhiyun 234*4882a593Smuzhiyuniii. The i/o scheduler algorithm itself can be replaced/set as appropriate 235*4882a593Smuzhiyun 236*4882a593SmuzhiyunAs in 2.4, it is possible to plugin a brand new i/o scheduler for a particular 237*4882a593Smuzhiyunqueue or pick from (copy) existing generic schedulers and replace/override 238*4882a593Smuzhiyuncertain portions of it. The 2.5 rewrite provides improved modularization 239*4882a593Smuzhiyunof the i/o scheduler. There are more pluggable callbacks, e.g for init, 240*4882a593Smuzhiyunadd request, extract request, which makes it possible to abstract specific 241*4882a593Smuzhiyuni/o scheduling algorithm aspects and details outside of the generic loop. 242*4882a593SmuzhiyunIt also makes it possible to completely hide the implementation details of 243*4882a593Smuzhiyunthe i/o scheduler from block drivers. 244*4882a593Smuzhiyun 245*4882a593SmuzhiyunI/O scheduler wrappers are to be used instead of accessing the queue directly. 246*4882a593SmuzhiyunSee section 4. The I/O scheduler for details. 247*4882a593Smuzhiyun 248*4882a593Smuzhiyun1.2 Tuning Based on High level code capabilities 249*4882a593Smuzhiyun------------------------------------------------ 250*4882a593Smuzhiyun 251*4882a593Smuzhiyuni. Application capabilities for raw i/o 252*4882a593Smuzhiyun 253*4882a593SmuzhiyunThis comes from some of the high-performance database/middleware 254*4882a593Smuzhiyunrequirements where an application prefers to make its own i/o scheduling 255*4882a593Smuzhiyundecisions based on an understanding of the access patterns and i/o 256*4882a593Smuzhiyuncharacteristics 257*4882a593Smuzhiyun 258*4882a593Smuzhiyunii. High performance filesystems or other higher level kernel code's 259*4882a593Smuzhiyuncapabilities 260*4882a593Smuzhiyun 261*4882a593SmuzhiyunKernel components like filesystems could also take their own i/o scheduling 262*4882a593Smuzhiyundecisions for optimizing performance. Journalling filesystems may need 263*4882a593Smuzhiyunsome control over i/o ordering. 264*4882a593Smuzhiyun 265*4882a593SmuzhiyunWhat kind of support exists at the generic block layer for this ? 266*4882a593Smuzhiyun 267*4882a593SmuzhiyunThe flags and rw fields in the bio structure can be used for some tuning 268*4882a593Smuzhiyunfrom above e.g indicating that an i/o is just a readahead request, or priority 269*4882a593Smuzhiyunsettings (currently unused). As far as user applications are concerned they 270*4882a593Smuzhiyunwould need an additional mechanism either via open flags or ioctls, or some 271*4882a593Smuzhiyunother upper level mechanism to communicate such settings to block. 272*4882a593Smuzhiyun 273*4882a593Smuzhiyun1.2.1 Request Priority/Latency 274*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 275*4882a593Smuzhiyun 276*4882a593SmuzhiyunTodo/Under discussion:: 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun Arjan's proposed request priority scheme allows higher levels some broad 279*4882a593Smuzhiyun control (high/med/low) over the priority of an i/o request vs other pending 280*4882a593Smuzhiyun requests in the queue. For example it allows reads for bringing in an 281*4882a593Smuzhiyun executable page on demand to be given a higher priority over pending write 282*4882a593Smuzhiyun requests which haven't aged too much on the queue. Potentially this priority 283*4882a593Smuzhiyun could even be exposed to applications in some manner, providing higher level 284*4882a593Smuzhiyun tunability. Time based aging avoids starvation of lower priority 285*4882a593Smuzhiyun requests. Some bits in the bi_opf flags field in the bio structure are 286*4882a593Smuzhiyun intended to be used for this priority information. 287*4882a593Smuzhiyun 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode) 290*4882a593Smuzhiyun----------------------------------------------------------------------- 291*4882a593Smuzhiyun 292*4882a593Smuzhiyun(e.g Diagnostics, Systems Management) 293*4882a593Smuzhiyun 294*4882a593SmuzhiyunThere are situations where high-level code needs to have direct access to 295*4882a593Smuzhiyunthe low level device capabilities or requires the ability to issue commands 296*4882a593Smuzhiyunto the device bypassing some of the intermediate i/o layers. 297*4882a593SmuzhiyunThese could, for example, be special control commands issued through ioctl 298*4882a593Smuzhiyuninterfaces, or could be raw read/write commands that stress the drive's 299*4882a593Smuzhiyuncapabilities for certain kinds of fitness tests. Having direct interfaces at 300*4882a593Smuzhiyunmultiple levels without having to pass through upper layers makes 301*4882a593Smuzhiyunit possible to perform bottom up validation of the i/o path, layer by 302*4882a593Smuzhiyunlayer, starting from the media. 303*4882a593Smuzhiyun 304*4882a593SmuzhiyunThe normal i/o submission interfaces, e.g submit_bio, could be bypassed 305*4882a593Smuzhiyunfor specially crafted requests which such ioctl or diagnostics 306*4882a593Smuzhiyuninterfaces would typically use, and the elevator add_request routine 307*4882a593Smuzhiyuncan instead be used to directly insert such requests in the queue or preferably 308*4882a593Smuzhiyunthe blk_do_rq routine can be used to place the request on the queue and 309*4882a593Smuzhiyunwait for completion. Alternatively, sometimes the caller might just 310*4882a593Smuzhiyuninvoke a lower level driver specific interface with the request as a 311*4882a593Smuzhiyunparameter. 312*4882a593Smuzhiyun 313*4882a593SmuzhiyunIf the request is a means for passing on special information associated with 314*4882a593Smuzhiyunthe command, then such information is associated with the request->special 315*4882a593Smuzhiyunfield (rather than misuse the request->buffer field which is meant for the 316*4882a593Smuzhiyunrequest data buffer's virtual mapping). 317*4882a593Smuzhiyun 318*4882a593SmuzhiyunFor passing request data, the caller must build up a bio descriptor 319*4882a593Smuzhiyunrepresenting the concerned memory buffer if the underlying driver interprets 320*4882a593Smuzhiyunbio segments or uses the block layer end*request* functions for i/o 321*4882a593Smuzhiyuncompletion. Alternatively one could directly use the request->buffer field to 322*4882a593Smuzhiyunspecify the virtual address of the buffer, if the driver expects buffer 323*4882a593Smuzhiyunaddresses passed in this way and ignores bio entries for the request type 324*4882a593Smuzhiyuninvolved. In the latter case, the driver would modify and manage the 325*4882a593Smuzhiyunrequest->buffer, request->sector and request->nr_sectors or 326*4882a593Smuzhiyunrequest->current_nr_sectors fields itself rather than using the block layer 327*4882a593Smuzhiyunend_request or end_that_request_first completion interfaces. 328*4882a593Smuzhiyun(See 2.3 or Documentation/block/request.rst for a brief explanation of 329*4882a593Smuzhiyunthe request structure fields) 330*4882a593Smuzhiyun 331*4882a593Smuzhiyun:: 332*4882a593Smuzhiyun 333*4882a593Smuzhiyun [TBD: end_that_request_last should be usable even in this case; 334*4882a593Smuzhiyun Perhaps an end_that_direct_request_first routine could be implemented to make 335*4882a593Smuzhiyun handling direct requests easier for such drivers; Also for drivers that 336*4882a593Smuzhiyun expect bios, a helper function could be provided for setting up a bio 337*4882a593Smuzhiyun corresponding to a data buffer] 338*4882a593Smuzhiyun 339*4882a593Smuzhiyun <JENS: I dont understand the above, why is end_that_request_first() not 340*4882a593Smuzhiyun usable? Or _last for that matter. I must be missing something> 341*4882a593Smuzhiyun 342*4882a593Smuzhiyun <SUP: What I meant here was that if the request doesn't have a bio, then 343*4882a593Smuzhiyun end_that_request_first doesn't modify nr_sectors or current_nr_sectors, 344*4882a593Smuzhiyun and hence can't be used for advancing request state settings on the 345*4882a593Smuzhiyun completion of partial transfers. The driver has to modify these fields 346*4882a593Smuzhiyun directly by hand. 347*4882a593Smuzhiyun This is because end_that_request_first only iterates over the bio list, 348*4882a593Smuzhiyun and always returns 0 if there are none associated with the request. 349*4882a593Smuzhiyun _last works OK in this case, and is not a problem, as I mentioned earlier 350*4882a593Smuzhiyun > 351*4882a593Smuzhiyun 352*4882a593Smuzhiyun1.3.1 Pre-built Commands 353*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^ 354*4882a593Smuzhiyun 355*4882a593SmuzhiyunA request can be created with a pre-built custom command to be sent directly 356*4882a593Smuzhiyunto the device. The cmd block in the request structure has room for filling 357*4882a593Smuzhiyunin the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for 358*4882a593Smuzhiyuncommand pre-building, and the type of the request is now indicated 359*4882a593Smuzhiyunthrough rq->flags instead of via rq->cmd) 360*4882a593Smuzhiyun 361*4882a593SmuzhiyunThe request structure flags can be set up to indicate the type of request 362*4882a593Smuzhiyunin such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC: 363*4882a593Smuzhiyunpacket command issued via blk_do_rq, REQ_SPECIAL: special request). 364*4882a593Smuzhiyun 365*4882a593SmuzhiyunIt can help to pre-build device commands for requests in advance. 366*4882a593SmuzhiyunDrivers can now specify a request prepare function (q->prep_rq_fn) that the 367*4882a593Smuzhiyunblock layer would invoke to pre-build device commands for a given request, 368*4882a593Smuzhiyunor perform other preparatory processing for the request. This is routine is 369*4882a593Smuzhiyuncalled by elv_next_request(), i.e. typically just before servicing a request. 370*4882a593Smuzhiyun(The prepare function would not be called for requests that have RQF_DONTPREP 371*4882a593Smuzhiyunenabled) 372*4882a593Smuzhiyun 373*4882a593SmuzhiyunAside: 374*4882a593Smuzhiyun Pre-building could possibly even be done early, i.e before placing the 375*4882a593Smuzhiyun request on the queue, rather than construct the command on the fly in the 376*4882a593Smuzhiyun driver while servicing the request queue when it may affect latencies in 377*4882a593Smuzhiyun interrupt context or responsiveness in general. One way to add early 378*4882a593Smuzhiyun pre-building would be to do it whenever we fail to merge on a request. 379*4882a593Smuzhiyun Now REQ_NOMERGE is set in the request flags to skip this one in the future, 380*4882a593Smuzhiyun which means that it will not change before we feed it to the device. So 381*4882a593Smuzhiyun the pre-builder hook can be invoked there. 382*4882a593Smuzhiyun 383*4882a593Smuzhiyun 384*4882a593Smuzhiyun2. Flexible and generic but minimalist i/o structure/descriptor 385*4882a593Smuzhiyun=============================================================== 386*4882a593Smuzhiyun 387*4882a593Smuzhiyun2.1 Reason for a new structure and requirements addressed 388*4882a593Smuzhiyun--------------------------------------------------------- 389*4882a593Smuzhiyun 390*4882a593SmuzhiyunPrior to 2.5, buffer heads were used as the unit of i/o at the generic block 391*4882a593Smuzhiyunlayer, and the low level request structure was associated with a chain of 392*4882a593Smuzhiyunbuffer heads for a contiguous i/o request. This led to certain inefficiencies 393*4882a593Smuzhiyunwhen it came to large i/o requests and readv/writev style operations, as it 394*4882a593Smuzhiyunforced such requests to be broken up into small chunks before being passed 395*4882a593Smuzhiyunon to the generic block layer, only to be merged by the i/o scheduler 396*4882a593Smuzhiyunwhen the underlying device was capable of handling the i/o in one shot. 397*4882a593SmuzhiyunAlso, using the buffer head as an i/o structure for i/os that didn't originate 398*4882a593Smuzhiyunfrom the buffer cache unnecessarily added to the weight of the descriptors 399*4882a593Smuzhiyunwhich were generated for each such chunk. 400*4882a593Smuzhiyun 401*4882a593SmuzhiyunThe following were some of the goals and expectations considered in the 402*4882a593Smuzhiyunredesign of the block i/o data structure in 2.5. 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun1. Should be appropriate as a descriptor for both raw and buffered i/o - 405*4882a593Smuzhiyun avoid cache related fields which are irrelevant in the direct/page i/o path, 406*4882a593Smuzhiyun or filesystem block size alignment restrictions which may not be relevant 407*4882a593Smuzhiyun for raw i/o. 408*4882a593Smuzhiyun2. Ability to represent high-memory buffers (which do not have a virtual 409*4882a593Smuzhiyun address mapping in kernel address space). 410*4882a593Smuzhiyun3. Ability to represent large i/os w/o unnecessarily breaking them up (i.e 411*4882a593Smuzhiyun greater than PAGE_SIZE chunks in one shot) 412*4882a593Smuzhiyun4. At the same time, ability to retain independent identity of i/os from 413*4882a593Smuzhiyun different sources or i/o units requiring individual completion (e.g. for 414*4882a593Smuzhiyun latency reasons) 415*4882a593Smuzhiyun5. Ability to represent an i/o involving multiple physical memory segments 416*4882a593Smuzhiyun (including non-page aligned page fragments, as specified via readv/writev) 417*4882a593Smuzhiyun without unnecessarily breaking it up, if the underlying device is capable of 418*4882a593Smuzhiyun handling it. 419*4882a593Smuzhiyun6. Preferably should be based on a memory descriptor structure that can be 420*4882a593Smuzhiyun passed around different types of subsystems or layers, maybe even 421*4882a593Smuzhiyun networking, without duplication or extra copies of data/descriptor fields 422*4882a593Smuzhiyun themselves in the process 423*4882a593Smuzhiyun7. Ability to handle the possibility of splits/merges as the structure passes 424*4882a593Smuzhiyun through layered drivers (lvm, md, evms), with minimal overhead. 425*4882a593Smuzhiyun 426*4882a593SmuzhiyunThe solution was to define a new structure (bio) for the block layer, 427*4882a593Smuzhiyuninstead of using the buffer head structure (bh) directly, the idea being 428*4882a593Smuzhiyunavoidance of some associated baggage and limitations. The bio structure 429*4882a593Smuzhiyunis uniformly used for all i/o at the block layer ; it forms a part of the 430*4882a593Smuzhiyunbh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are 431*4882a593Smuzhiyunmapped to bio structures. 432*4882a593Smuzhiyun 433*4882a593Smuzhiyun2.2 The bio struct 434*4882a593Smuzhiyun------------------ 435*4882a593Smuzhiyun 436*4882a593SmuzhiyunThe bio structure uses a vector representation pointing to an array of tuples 437*4882a593Smuzhiyunof <page, offset, len> to describe the i/o buffer, and has various other 438*4882a593Smuzhiyunfields describing i/o parameters and state that needs to be maintained for 439*4882a593Smuzhiyunperforming the i/o. 440*4882a593Smuzhiyun 441*4882a593SmuzhiyunNotice that this representation means that a bio has no virtual address 442*4882a593Smuzhiyunmapping at all (unlike buffer heads). 443*4882a593Smuzhiyun 444*4882a593Smuzhiyun:: 445*4882a593Smuzhiyun 446*4882a593Smuzhiyun struct bio_vec { 447*4882a593Smuzhiyun struct page *bv_page; 448*4882a593Smuzhiyun unsigned short bv_len; 449*4882a593Smuzhiyun unsigned short bv_offset; 450*4882a593Smuzhiyun }; 451*4882a593Smuzhiyun 452*4882a593Smuzhiyun /* 453*4882a593Smuzhiyun * main unit of I/O for the block layer and lower layers (ie drivers) 454*4882a593Smuzhiyun */ 455*4882a593Smuzhiyun struct bio { 456*4882a593Smuzhiyun struct bio *bi_next; /* request queue link */ 457*4882a593Smuzhiyun struct block_device *bi_bdev; /* target device */ 458*4882a593Smuzhiyun unsigned long bi_flags; /* status, command, etc */ 459*4882a593Smuzhiyun unsigned long bi_opf; /* low bits: r/w, high: priority */ 460*4882a593Smuzhiyun 461*4882a593Smuzhiyun unsigned int bi_vcnt; /* how may bio_vec's */ 462*4882a593Smuzhiyun struct bvec_iter bi_iter; /* current index into bio_vec array */ 463*4882a593Smuzhiyun 464*4882a593Smuzhiyun unsigned int bi_size; /* total size in bytes */ 465*4882a593Smuzhiyun unsigned short bi_hw_segments; /* segments after DMA remapping */ 466*4882a593Smuzhiyun unsigned int bi_max; /* max bio_vecs we can hold 467*4882a593Smuzhiyun used as index into pool */ 468*4882a593Smuzhiyun struct bio_vec *bi_io_vec; /* the actual vec list */ 469*4882a593Smuzhiyun bio_end_io_t *bi_end_io; /* bi_end_io (bio) */ 470*4882a593Smuzhiyun atomic_t bi_cnt; /* pin count: free when it hits zero */ 471*4882a593Smuzhiyun void *bi_private; 472*4882a593Smuzhiyun }; 473*4882a593Smuzhiyun 474*4882a593SmuzhiyunWith this multipage bio design: 475*4882a593Smuzhiyun 476*4882a593Smuzhiyun- Large i/os can be sent down in one go using a bio_vec list consisting 477*4882a593Smuzhiyun of an array of <page, offset, len> fragments (similar to the way fragments 478*4882a593Smuzhiyun are represented in the zero-copy network code) 479*4882a593Smuzhiyun- Splitting of an i/o request across multiple devices (as in the case of 480*4882a593Smuzhiyun lvm or raid) is achieved by cloning the bio (where the clone points to 481*4882a593Smuzhiyun the same bi_io_vec array, but with the index and size accordingly modified) 482*4882a593Smuzhiyun- A linked list of bios is used as before for unrelated merges [#]_ - this 483*4882a593Smuzhiyun avoids reallocs and makes independent completions easier to handle. 484*4882a593Smuzhiyun- Code that traverses the req list can find all the segments of a bio 485*4882a593Smuzhiyun by using rq_for_each_segment. This handles the fact that a request 486*4882a593Smuzhiyun has multiple bios, each of which can have multiple segments. 487*4882a593Smuzhiyun- Drivers which can't process a large bio in one shot can use the bi_iter 488*4882a593Smuzhiyun field to keep track of the next bio_vec entry to process. 489*4882a593Smuzhiyun (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE) 490*4882a593Smuzhiyun [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying 491*4882a593Smuzhiyun bi_offset an len fields] 492*4882a593Smuzhiyun 493*4882a593Smuzhiyun.. [#] 494*4882a593Smuzhiyun 495*4882a593Smuzhiyun unrelated merges -- a request ends up containing two or more bios that 496*4882a593Smuzhiyun didn't originate from the same place. 497*4882a593Smuzhiyun 498*4882a593Smuzhiyunbi_end_io() i/o callback gets called on i/o completion of the entire bio. 499*4882a593Smuzhiyun 500*4882a593SmuzhiyunAt a lower level, drivers build a scatter gather list from the merged bios. 501*4882a593SmuzhiyunThe scatter gather list is in the form of an array of <page, offset, len> 502*4882a593Smuzhiyunentries with their corresponding dma address mappings filled in at the 503*4882a593Smuzhiyunappropriate time. As an optimization, contiguous physical pages can be 504*4882a593Smuzhiyuncovered by a single entry where <page> refers to the first page and <len> 505*4882a593Smuzhiyuncovers the range of pages (up to 16 contiguous pages could be covered this 506*4882a593Smuzhiyunway). There is a helper routine (blk_rq_map_sg) which drivers can use to build 507*4882a593Smuzhiyunthe sg list. 508*4882a593Smuzhiyun 509*4882a593SmuzhiyunNote: Right now the only user of bios with more than one page is ll_rw_kio, 510*4882a593Smuzhiyunwhich in turn means that only raw I/O uses it (direct i/o may not work 511*4882a593Smuzhiyunright now). The intent however is to enable clustering of pages etc to 512*4882a593Smuzhiyunbecome possible. The pagebuf abstraction layer from SGI also uses multi-page 513*4882a593Smuzhiyunbios, but that is currently not included in the stock development kernels. 514*4882a593SmuzhiyunThe same is true of Andrew Morton's work-in-progress multipage bio writeout 515*4882a593Smuzhiyunand readahead patches. 516*4882a593Smuzhiyun 517*4882a593Smuzhiyun2.3 Changes in the Request Structure 518*4882a593Smuzhiyun------------------------------------ 519*4882a593Smuzhiyun 520*4882a593SmuzhiyunThe request structure is the structure that gets passed down to low level 521*4882a593Smuzhiyundrivers. The block layer make_request function builds up a request structure, 522*4882a593Smuzhiyunplaces it on the queue and invokes the drivers request_fn. The driver makes 523*4882a593Smuzhiyunuse of block layer helper routine elv_next_request to pull the next request 524*4882a593Smuzhiyunoff the queue. Control or diagnostic functions might bypass block and directly 525*4882a593Smuzhiyuninvoke underlying driver entry points passing in a specially constructed 526*4882a593Smuzhiyunrequest structure. 527*4882a593Smuzhiyun 528*4882a593SmuzhiyunOnly some relevant fields (mainly those which changed or may be referred 529*4882a593Smuzhiyunto in some of the discussion here) are listed below, not necessarily in 530*4882a593Smuzhiyunthe order in which they occur in the structure (see include/linux/blkdev.h) 531*4882a593SmuzhiyunRefer to Documentation/block/request.rst for details about all the request 532*4882a593Smuzhiyunstructure fields and a quick reference about the layers which are 533*4882a593Smuzhiyunsupposed to use or modify those fields:: 534*4882a593Smuzhiyun 535*4882a593Smuzhiyun struct request { 536*4882a593Smuzhiyun struct list_head queuelist; /* Not meant to be directly accessed by 537*4882a593Smuzhiyun the driver. 538*4882a593Smuzhiyun Used by q->elv_next_request_fn 539*4882a593Smuzhiyun rq->queue is gone 540*4882a593Smuzhiyun */ 541*4882a593Smuzhiyun . 542*4882a593Smuzhiyun . 543*4882a593Smuzhiyun unsigned char cmd[16]; /* prebuilt command data block */ 544*4882a593Smuzhiyun unsigned long flags; /* also includes earlier rq->cmd settings */ 545*4882a593Smuzhiyun . 546*4882a593Smuzhiyun . 547*4882a593Smuzhiyun sector_t sector; /* this field is now of type sector_t instead of int 548*4882a593Smuzhiyun preparation for 64 bit sectors */ 549*4882a593Smuzhiyun . 550*4882a593Smuzhiyun . 551*4882a593Smuzhiyun 552*4882a593Smuzhiyun /* Number of scatter-gather DMA addr+len pairs after 553*4882a593Smuzhiyun * physical address coalescing is performed. 554*4882a593Smuzhiyun */ 555*4882a593Smuzhiyun unsigned short nr_phys_segments; 556*4882a593Smuzhiyun 557*4882a593Smuzhiyun /* Number of scatter-gather addr+len pairs after 558*4882a593Smuzhiyun * physical and DMA remapping hardware coalescing is performed. 559*4882a593Smuzhiyun * This is the number of scatter-gather entries the driver 560*4882a593Smuzhiyun * will actually have to deal with after DMA mapping is done. 561*4882a593Smuzhiyun */ 562*4882a593Smuzhiyun unsigned short nr_hw_segments; 563*4882a593Smuzhiyun 564*4882a593Smuzhiyun /* Various sector counts */ 565*4882a593Smuzhiyun unsigned long nr_sectors; /* no. of sectors left: driver modifiable */ 566*4882a593Smuzhiyun unsigned long hard_nr_sectors; /* block internal copy of above */ 567*4882a593Smuzhiyun unsigned int current_nr_sectors; /* no. of sectors left in the 568*4882a593Smuzhiyun current segment:driver modifiable */ 569*4882a593Smuzhiyun unsigned long hard_cur_sectors; /* block internal copy of the above */ 570*4882a593Smuzhiyun . 571*4882a593Smuzhiyun . 572*4882a593Smuzhiyun int tag; /* command tag associated with request */ 573*4882a593Smuzhiyun void *special; /* same as before */ 574*4882a593Smuzhiyun char *buffer; /* valid only for low memory buffers up to 575*4882a593Smuzhiyun current_nr_sectors */ 576*4882a593Smuzhiyun . 577*4882a593Smuzhiyun . 578*4882a593Smuzhiyun struct bio *bio, *biotail; /* bio list instead of bh */ 579*4882a593Smuzhiyun struct request_list *rl; 580*4882a593Smuzhiyun } 581*4882a593Smuzhiyun 582*4882a593SmuzhiyunSee the req_ops and req_flag_bits definitions for an explanation of the various 583*4882a593Smuzhiyunflags available. Some bits are used by the block layer or i/o scheduler. 584*4882a593Smuzhiyun 585*4882a593SmuzhiyunThe behaviour of the various sector counts are almost the same as before, 586*4882a593Smuzhiyunexcept that since we have multi-segment bios, current_nr_sectors refers 587*4882a593Smuzhiyunto the numbers of sectors in the current segment being processed which could 588*4882a593Smuzhiyunbe one of the many segments in the current bio (i.e i/o completion unit). 589*4882a593SmuzhiyunThe nr_sectors value refers to the total number of sectors in the whole 590*4882a593Smuzhiyunrequest that remain to be transferred (no change). The purpose of the 591*4882a593Smuzhiyunhard_xxx values is for block to remember these counts every time it hands 592*4882a593Smuzhiyunover the request to the driver. These values are updated by block on 593*4882a593Smuzhiyunend_that_request_first, i.e. every time the driver completes a part of the 594*4882a593Smuzhiyuntransfer and invokes block end*request helpers to mark this. The 595*4882a593Smuzhiyundriver should not modify these values. The block layer sets up the 596*4882a593Smuzhiyunnr_sectors and current_nr_sectors fields (based on the corresponding 597*4882a593Smuzhiyunhard_xxx values and the number of bytes transferred) and updates it on 598*4882a593Smuzhiyunevery transfer that invokes end_that_request_first. It does the same for the 599*4882a593Smuzhiyunbuffer, bio, bio->bi_iter fields too. 600*4882a593Smuzhiyun 601*4882a593SmuzhiyunThe buffer field is just a virtual address mapping of the current segment 602*4882a593Smuzhiyunof the i/o buffer in cases where the buffer resides in low-memory. For high 603*4882a593Smuzhiyunmemory i/o, this field is not valid and must not be used by drivers. 604*4882a593Smuzhiyun 605*4882a593SmuzhiyunCode that sets up its own request structures and passes them down to 606*4882a593Smuzhiyuna driver needs to be careful about interoperation with the block layer helper 607*4882a593Smuzhiyunfunctions which the driver uses. (Section 1.3) 608*4882a593Smuzhiyun 609*4882a593Smuzhiyun3. Using bios 610*4882a593Smuzhiyun============= 611*4882a593Smuzhiyun 612*4882a593Smuzhiyun3.1 Setup/Teardown 613*4882a593Smuzhiyun------------------ 614*4882a593Smuzhiyun 615*4882a593SmuzhiyunThere are routines for managing the allocation, and reference counting, and 616*4882a593Smuzhiyunfreeing of bios (bio_alloc, bio_get, bio_put). 617*4882a593Smuzhiyun 618*4882a593SmuzhiyunThis makes use of Ingo Molnar's mempool implementation, which enables 619*4882a593Smuzhiyunsubsystems like bio to maintain their own reserve memory pools for guaranteed 620*4882a593Smuzhiyundeadlock-free allocations during extreme VM load. For example, the VM 621*4882a593Smuzhiyunsubsystem makes use of the block layer to writeout dirty pages in order to be 622*4882a593Smuzhiyunable to free up memory space, a case which needs careful handling. The 623*4882a593Smuzhiyunallocation logic draws from the preallocated emergency reserve in situations 624*4882a593Smuzhiyunwhere it cannot allocate through normal means. If the pool is empty and it 625*4882a593Smuzhiyuncan wait, then it would trigger action that would help free up memory or 626*4882a593Smuzhiyunreplenish the pool (without deadlocking) and wait for availability in the pool. 627*4882a593SmuzhiyunIf it is in IRQ context, and hence not in a position to do this, allocation 628*4882a593Smuzhiyuncould fail if the pool is empty. In general mempool always first tries to 629*4882a593Smuzhiyunperform allocation without having to wait, even if it means digging into the 630*4882a593Smuzhiyunpool as long it is not less that 50% full. 631*4882a593Smuzhiyun 632*4882a593SmuzhiyunOn a free, memory is released to the pool or directly freed depending on 633*4882a593Smuzhiyunthe current availability in the pool. The mempool interface lets the 634*4882a593Smuzhiyunsubsystem specify the routines to be used for normal alloc and free. In the 635*4882a593Smuzhiyuncase of bio, these routines make use of the standard slab allocator. 636*4882a593Smuzhiyun 637*4882a593SmuzhiyunThe caller of bio_alloc is expected to taken certain steps to avoid 638*4882a593Smuzhiyundeadlocks, e.g. avoid trying to allocate more memory from the pool while 639*4882a593Smuzhiyunalready holding memory obtained from the pool. 640*4882a593Smuzhiyun 641*4882a593Smuzhiyun:: 642*4882a593Smuzhiyun 643*4882a593Smuzhiyun [TBD: This is a potential issue, though a rare possibility 644*4882a593Smuzhiyun in the bounce bio allocation that happens in the current code, since 645*4882a593Smuzhiyun it ends up allocating a second bio from the same pool while 646*4882a593Smuzhiyun holding the original bio ] 647*4882a593Smuzhiyun 648*4882a593SmuzhiyunMemory allocated from the pool should be released back within a limited 649*4882a593Smuzhiyunamount of time (in the case of bio, that would be after the i/o is completed). 650*4882a593SmuzhiyunThis ensures that if part of the pool has been used up, some work (in this 651*4882a593Smuzhiyuncase i/o) must already be in progress and memory would be available when it 652*4882a593Smuzhiyunis over. If allocating from multiple pools in the same code path, the order 653*4882a593Smuzhiyunor hierarchy of allocation needs to be consistent, just the way one deals 654*4882a593Smuzhiyunwith multiple locks. 655*4882a593Smuzhiyun 656*4882a593SmuzhiyunThe bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc()) 657*4882a593Smuzhiyunfor a non-clone bio. There are the 6 pools setup for different size biovecs, 658*4882a593Smuzhiyunso bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the 659*4882a593Smuzhiyungiven size from these slabs. 660*4882a593Smuzhiyun 661*4882a593SmuzhiyunThe bio_get() routine may be used to hold an extra reference on a bio prior 662*4882a593Smuzhiyunto i/o submission, if the bio fields are likely to be accessed after the 663*4882a593Smuzhiyuni/o is issued (since the bio may otherwise get freed in case i/o completion 664*4882a593Smuzhiyunhappens in the meantime). 665*4882a593Smuzhiyun 666*4882a593SmuzhiyunThe bio_clone_fast() routine may be used to duplicate a bio, where the clone 667*4882a593Smuzhiyunshares the bio_vec_list with the original bio (i.e. both point to the 668*4882a593Smuzhiyunsame bio_vec_list). This would typically be used for splitting i/o requests 669*4882a593Smuzhiyunin lvm or md. 670*4882a593Smuzhiyun 671*4882a593Smuzhiyun3.2 Generic bio helper Routines 672*4882a593Smuzhiyun------------------------------- 673*4882a593Smuzhiyun 674*4882a593Smuzhiyun3.2.1 Traversing segments and completion units in a request 675*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 676*4882a593Smuzhiyun 677*4882a593SmuzhiyunThe macro rq_for_each_segment() should be used for traversing the bios 678*4882a593Smuzhiyunin the request list (drivers should avoid directly trying to do it 679*4882a593Smuzhiyunthemselves). Using these helpers should also make it easier to cope 680*4882a593Smuzhiyunwith block changes in the future. 681*4882a593Smuzhiyun 682*4882a593Smuzhiyun:: 683*4882a593Smuzhiyun 684*4882a593Smuzhiyun struct req_iterator iter; 685*4882a593Smuzhiyun rq_for_each_segment(bio_vec, rq, iter) 686*4882a593Smuzhiyun /* bio_vec is now current segment */ 687*4882a593Smuzhiyun 688*4882a593SmuzhiyunI/O completion callbacks are per-bio rather than per-segment, so drivers 689*4882a593Smuzhiyunthat traverse bio chains on completion need to keep that in mind. Drivers 690*4882a593Smuzhiyunwhich don't make a distinction between segments and completion units would 691*4882a593Smuzhiyunneed to be reorganized to support multi-segment bios. 692*4882a593Smuzhiyun 693*4882a593Smuzhiyun3.2.2 Setting up DMA scatterlists 694*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 695*4882a593Smuzhiyun 696*4882a593SmuzhiyunThe blk_rq_map_sg() helper routine would be used for setting up scatter 697*4882a593Smuzhiyungather lists from a request, so a driver need not do it on its own. 698*4882a593Smuzhiyun 699*4882a593Smuzhiyun nr_segments = blk_rq_map_sg(q, rq, scatterlist); 700*4882a593Smuzhiyun 701*4882a593SmuzhiyunThe helper routine provides a level of abstraction which makes it easier 702*4882a593Smuzhiyunto modify the internals of request to scatterlist conversion down the line 703*4882a593Smuzhiyunwithout breaking drivers. The blk_rq_map_sg routine takes care of several 704*4882a593Smuzhiyunthings like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER 705*4882a593Smuzhiyunis set) and correct segment accounting to avoid exceeding the limits which 706*4882a593Smuzhiyunthe i/o hardware can handle, based on various queue properties. 707*4882a593Smuzhiyun 708*4882a593Smuzhiyun- Prevents a clustered segment from crossing a 4GB mem boundary 709*4882a593Smuzhiyun- Avoids building segments that would exceed the number of physical 710*4882a593Smuzhiyun memory segments that the driver can handle (phys_segments) and the 711*4882a593Smuzhiyun number that the underlying hardware can handle at once, accounting for 712*4882a593Smuzhiyun DMA remapping (hw_segments) (i.e. IOMMU aware limits). 713*4882a593Smuzhiyun 714*4882a593SmuzhiyunRoutines which the low level driver can use to set up the segment limits: 715*4882a593Smuzhiyun 716*4882a593Smuzhiyunblk_queue_max_hw_segments() : Sets an upper limit of the maximum number of 717*4882a593Smuzhiyunhw data segments in a request (i.e. the maximum number of address/length 718*4882a593Smuzhiyunpairs the host adapter can actually hand to the device at once) 719*4882a593Smuzhiyun 720*4882a593Smuzhiyunblk_queue_max_phys_segments() : Sets an upper limit on the maximum number 721*4882a593Smuzhiyunof physical data segments in a request (i.e. the largest sized scatter list 722*4882a593Smuzhiyuna driver could handle) 723*4882a593Smuzhiyun 724*4882a593Smuzhiyun3.2.3 I/O completion 725*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^ 726*4882a593Smuzhiyun 727*4882a593SmuzhiyunThe existing generic block layer helper routines end_request, 728*4882a593Smuzhiyunend_that_request_first and end_that_request_last can be used for i/o 729*4882a593Smuzhiyuncompletion (and setting things up so the rest of the i/o or the next 730*4882a593Smuzhiyunrequest can be kicked of) as before. With the introduction of multi-page 731*4882a593Smuzhiyunbio support, end_that_request_first requires an additional argument indicating 732*4882a593Smuzhiyunthe number of sectors completed. 733*4882a593Smuzhiyun 734*4882a593Smuzhiyun3.2.4 Implications for drivers that do not interpret bios 735*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 736*4882a593Smuzhiyun 737*4882a593Smuzhiyun(don't handle multiple segments) 738*4882a593Smuzhiyun 739*4882a593SmuzhiyunDrivers that do not interpret bios e.g those which do not handle multiple 740*4882a593Smuzhiyunsegments and do not support i/o into high memory addresses (require bounce 741*4882a593Smuzhiyunbuffers) and expect only virtually mapped buffers, can access the rq->buffer 742*4882a593Smuzhiyunfield. As before the driver should use current_nr_sectors to determine the 743*4882a593Smuzhiyunsize of remaining data in the current segment (that is the maximum it can 744*4882a593Smuzhiyuntransfer in one go unless it interprets segments), and rely on the block layer 745*4882a593Smuzhiyunend_request, or end_that_request_first/last to take care of all accounting 746*4882a593Smuzhiyunand transparent mapping of the next bio segment when a segment boundary 747*4882a593Smuzhiyunis crossed on completion of a transfer. (The end*request* functions should 748*4882a593Smuzhiyunbe used if only if the request has come down from block/bio path, not for 749*4882a593Smuzhiyundirect access requests which only specify rq->buffer without a valid rq->bio) 750*4882a593Smuzhiyun 751*4882a593Smuzhiyun3.3 I/O Submission 752*4882a593Smuzhiyun------------------ 753*4882a593Smuzhiyun 754*4882a593SmuzhiyunThe routine submit_bio() is used to submit a single io. Higher level i/o 755*4882a593Smuzhiyunroutines make use of this: 756*4882a593Smuzhiyun 757*4882a593Smuzhiyun(a) Buffered i/o: 758*4882a593Smuzhiyun 759*4882a593SmuzhiyunThe routine submit_bh() invokes submit_bio() on a bio corresponding to the 760*4882a593Smuzhiyunbh, allocating the bio if required. ll_rw_block() uses submit_bh() as before. 761*4882a593Smuzhiyun 762*4882a593Smuzhiyun(b) Kiobuf i/o (for raw/direct i/o): 763*4882a593Smuzhiyun 764*4882a593SmuzhiyunThe ll_rw_kio() routine breaks up the kiobuf into page sized chunks and 765*4882a593Smuzhiyunmaps the array to one or more multi-page bios, issuing submit_bio() to 766*4882a593Smuzhiyunperform the i/o on each of these. 767*4882a593Smuzhiyun 768*4882a593SmuzhiyunThe embedded bh array in the kiobuf structure has been removed and no 769*4882a593Smuzhiyunpreallocation of bios is done for kiobufs. [The intent is to remove the 770*4882a593Smuzhiyunblocks array as well, but it's currently in there to kludge around direct i/o.] 771*4882a593SmuzhiyunThus kiobuf allocation has switched back to using kmalloc rather than vmalloc. 772*4882a593Smuzhiyun 773*4882a593SmuzhiyunTodo/Observation: 774*4882a593Smuzhiyun 775*4882a593Smuzhiyun A single kiobuf structure is assumed to correspond to a contiguous range 776*4882a593Smuzhiyun of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec. 777*4882a593Smuzhiyun So right now it wouldn't work for direct i/o on non-contiguous blocks. 778*4882a593Smuzhiyun This is to be resolved. The eventual direction is to replace kiobuf 779*4882a593Smuzhiyun by kvec's. 780*4882a593Smuzhiyun 781*4882a593Smuzhiyun Badari Pulavarty has a patch to implement direct i/o correctly using 782*4882a593Smuzhiyun bio and kvec. 783*4882a593Smuzhiyun 784*4882a593Smuzhiyun 785*4882a593Smuzhiyun(c) Page i/o: 786*4882a593Smuzhiyun 787*4882a593SmuzhiyunTodo/Under discussion: 788*4882a593Smuzhiyun 789*4882a593Smuzhiyun Andrew Morton's multi-page bio patches attempt to issue multi-page 790*4882a593Smuzhiyun writeouts (and reads) from the page cache, by directly building up 791*4882a593Smuzhiyun large bios for submission completely bypassing the usage of buffer 792*4882a593Smuzhiyun heads. This work is still in progress. 793*4882a593Smuzhiyun 794*4882a593Smuzhiyun Christoph Hellwig had some code that uses bios for page-io (rather than 795*4882a593Smuzhiyun bh). This isn't included in bio as yet. Christoph was also working on a 796*4882a593Smuzhiyun design for representing virtual/real extents as an entity and modifying 797*4882a593Smuzhiyun some of the address space ops interfaces to utilize this abstraction rather 798*4882a593Smuzhiyun than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf 799*4882a593Smuzhiyun abstraction, but intended to be as lightweight as possible). 800*4882a593Smuzhiyun 801*4882a593Smuzhiyun(d) Direct access i/o: 802*4882a593Smuzhiyun 803*4882a593SmuzhiyunDirect access requests that do not contain bios would be submitted differently 804*4882a593Smuzhiyunas discussed earlier in section 1.3. 805*4882a593Smuzhiyun 806*4882a593SmuzhiyunAside: 807*4882a593Smuzhiyun 808*4882a593Smuzhiyun Kvec i/o: 809*4882a593Smuzhiyun 810*4882a593Smuzhiyun Ben LaHaise's aio code uses a slightly different structure instead 811*4882a593Smuzhiyun of kiobufs, called a kvec_cb. This contains an array of <page, offset, len> 812*4882a593Smuzhiyun tuples (very much like the networking code), together with a callback function 813*4882a593Smuzhiyun and data pointer. This is embedded into a brw_cb structure when passed 814*4882a593Smuzhiyun to brw_kvec_async(). 815*4882a593Smuzhiyun 816*4882a593Smuzhiyun Now it should be possible to directly map these kvecs to a bio. Just as while 817*4882a593Smuzhiyun cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec 818*4882a593Smuzhiyun array pointer to point to the veclet array in kvecs. 819*4882a593Smuzhiyun 820*4882a593Smuzhiyun TBD: In order for this to work, some changes are needed in the way multi-page 821*4882a593Smuzhiyun bios are handled today. The values of the tuples in such a vector passed in 822*4882a593Smuzhiyun from higher level code should not be modified by the block layer in the course 823*4882a593Smuzhiyun of its request processing, since that would make it hard for the higher layer 824*4882a593Smuzhiyun to continue to use the vector descriptor (kvec) after i/o completes. Instead, 825*4882a593Smuzhiyun all such transient state should either be maintained in the request structure, 826*4882a593Smuzhiyun and passed on in some way to the endio completion routine. 827*4882a593Smuzhiyun 828*4882a593Smuzhiyun 829*4882a593Smuzhiyun4. The I/O scheduler 830*4882a593Smuzhiyun==================== 831*4882a593Smuzhiyun 832*4882a593SmuzhiyunI/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch 833*4882a593Smuzhiyunqueue and specific I/O schedulers. Unless stated otherwise, elevator is used 834*4882a593Smuzhiyunto refer to both parts and I/O scheduler to specific I/O schedulers. 835*4882a593Smuzhiyun 836*4882a593SmuzhiyunBlock layer implements generic dispatch queue in `block/*.c`. 837*4882a593SmuzhiyunThe generic dispatch queue is responsible for requeueing, handling non-fs 838*4882a593Smuzhiyunrequests and all other subtleties. 839*4882a593Smuzhiyun 840*4882a593SmuzhiyunSpecific I/O schedulers are responsible for ordering normal filesystem 841*4882a593Smuzhiyunrequests. They can also choose to delay certain requests to improve 842*4882a593Smuzhiyunthroughput or whatever purpose. As the plural form indicates, there are 843*4882a593Smuzhiyunmultiple I/O schedulers. They can be built as modules but at least one should 844*4882a593Smuzhiyunbe built inside the kernel. Each queue can choose different one and can also 845*4882a593Smuzhiyunchange to another one dynamically. 846*4882a593Smuzhiyun 847*4882a593SmuzhiyunA block layer call to the i/o scheduler follows the convention elv_xxx(). This 848*4882a593Smuzhiyuncalls elevator_xxx_fn in the elevator switch (block/elevator.c). Oh, xxx 849*4882a593Smuzhiyunand xxx might not match exactly, but use your imagination. If an elevator 850*4882a593Smuzhiyundoesn't implement a function, the switch does nothing or some minimal house 851*4882a593Smuzhiyunkeeping work. 852*4882a593Smuzhiyun 853*4882a593Smuzhiyun4.1. I/O scheduler API 854*4882a593Smuzhiyun---------------------- 855*4882a593Smuzhiyun 856*4882a593SmuzhiyunThe functions an elevator may implement are: (* are mandatory) 857*4882a593Smuzhiyun 858*4882a593Smuzhiyun=============================== ================================================ 859*4882a593Smuzhiyunelevator_merge_fn called to query requests for merge with a bio 860*4882a593Smuzhiyun 861*4882a593Smuzhiyunelevator_merge_req_fn called when two requests get merged. the one 862*4882a593Smuzhiyun which gets merged into the other one will be 863*4882a593Smuzhiyun never seen by I/O scheduler again. IOW, after 864*4882a593Smuzhiyun being merged, the request is gone. 865*4882a593Smuzhiyun 866*4882a593Smuzhiyunelevator_merged_fn called when a request in the scheduler has been 867*4882a593Smuzhiyun involved in a merge. It is used in the deadline 868*4882a593Smuzhiyun scheduler for example, to reposition the request 869*4882a593Smuzhiyun if its sorting order has changed. 870*4882a593Smuzhiyun 871*4882a593Smuzhiyunelevator_allow_merge_fn called whenever the block layer determines 872*4882a593Smuzhiyun that a bio can be merged into an existing 873*4882a593Smuzhiyun request safely. The io scheduler may still 874*4882a593Smuzhiyun want to stop a merge at this point if it 875*4882a593Smuzhiyun results in some sort of conflict internally, 876*4882a593Smuzhiyun this hook allows it to do that. Note however 877*4882a593Smuzhiyun that two *requests* can still be merged at later 878*4882a593Smuzhiyun time. Currently the io scheduler has no way to 879*4882a593Smuzhiyun prevent that. It can only learn about the fact 880*4882a593Smuzhiyun from elevator_merge_req_fn callback. 881*4882a593Smuzhiyun 882*4882a593Smuzhiyunelevator_dispatch_fn* fills the dispatch queue with ready requests. 883*4882a593Smuzhiyun I/O schedulers are free to postpone requests by 884*4882a593Smuzhiyun not filling the dispatch queue unless @force 885*4882a593Smuzhiyun is non-zero. Once dispatched, I/O schedulers 886*4882a593Smuzhiyun are not allowed to manipulate the requests - 887*4882a593Smuzhiyun they belong to generic dispatch queue. 888*4882a593Smuzhiyun 889*4882a593Smuzhiyunelevator_add_req_fn* called to add a new request into the scheduler 890*4882a593Smuzhiyun 891*4882a593Smuzhiyunelevator_former_req_fn 892*4882a593Smuzhiyunelevator_latter_req_fn These return the request before or after the 893*4882a593Smuzhiyun one specified in disk sort order. Used by the 894*4882a593Smuzhiyun block layer to find merge possibilities. 895*4882a593Smuzhiyun 896*4882a593Smuzhiyunelevator_completed_req_fn called when a request is completed. 897*4882a593Smuzhiyun 898*4882a593Smuzhiyunelevator_set_req_fn 899*4882a593Smuzhiyunelevator_put_req_fn Must be used to allocate and free any elevator 900*4882a593Smuzhiyun specific storage for a request. 901*4882a593Smuzhiyun 902*4882a593Smuzhiyunelevator_activate_req_fn Called when device driver first sees a request. 903*4882a593Smuzhiyun I/O schedulers can use this callback to 904*4882a593Smuzhiyun determine when actual execution of a request 905*4882a593Smuzhiyun starts. 906*4882a593Smuzhiyunelevator_deactivate_req_fn Called when device driver decides to delay 907*4882a593Smuzhiyun a request by requeueing it. 908*4882a593Smuzhiyun 909*4882a593Smuzhiyunelevator_init_fn* 910*4882a593Smuzhiyunelevator_exit_fn Allocate and free any elevator specific storage 911*4882a593Smuzhiyun for a queue. 912*4882a593Smuzhiyun=============================== ================================================ 913*4882a593Smuzhiyun 914*4882a593Smuzhiyun4.2 Request flows seen by I/O schedulers 915*4882a593Smuzhiyun---------------------------------------- 916*4882a593Smuzhiyun 917*4882a593SmuzhiyunAll requests seen by I/O schedulers strictly follow one of the following three 918*4882a593Smuzhiyunflows. 919*4882a593Smuzhiyun 920*4882a593Smuzhiyun set_req_fn -> 921*4882a593Smuzhiyun 922*4882a593Smuzhiyun i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn -> 923*4882a593Smuzhiyun (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn 924*4882a593Smuzhiyun ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn 925*4882a593Smuzhiyun iii. [none] 926*4882a593Smuzhiyun 927*4882a593Smuzhiyun -> put_req_fn 928*4882a593Smuzhiyun 929*4882a593Smuzhiyun4.3 I/O scheduler implementation 930*4882a593Smuzhiyun-------------------------------- 931*4882a593Smuzhiyun 932*4882a593SmuzhiyunThe generic i/o scheduler algorithm attempts to sort/merge/batch requests for 933*4882a593Smuzhiyunoptimal disk scan and request servicing performance (based on generic 934*4882a593Smuzhiyunprinciples and device capabilities), optimized for: 935*4882a593Smuzhiyun 936*4882a593Smuzhiyuni. improved throughput 937*4882a593Smuzhiyunii. improved latency 938*4882a593Smuzhiyuniii. better utilization of h/w & CPU time 939*4882a593Smuzhiyun 940*4882a593SmuzhiyunCharacteristics: 941*4882a593Smuzhiyun 942*4882a593Smuzhiyuni. Binary tree 943*4882a593SmuzhiyunAS and deadline i/o schedulers use red black binary trees for disk position 944*4882a593Smuzhiyunsorting and searching, and a fifo linked list for time-based searching. This 945*4882a593Smuzhiyungives good scalability and good availability of information. Requests are 946*4882a593Smuzhiyunalmost always dispatched in disk sort order, so a cache is kept of the next 947*4882a593Smuzhiyunrequest in sort order to prevent binary tree lookups. 948*4882a593Smuzhiyun 949*4882a593SmuzhiyunThis arrangement is not a generic block layer characteristic however, so 950*4882a593Smuzhiyunelevators may implement queues as they please. 951*4882a593Smuzhiyun 952*4882a593Smuzhiyunii. Merge hash 953*4882a593SmuzhiyunAS and deadline use a hash table indexed by the last sector of a request. This 954*4882a593Smuzhiyunenables merging code to quickly look up "back merge" candidates, even when 955*4882a593Smuzhiyunmultiple I/O streams are being performed at once on one disk. 956*4882a593Smuzhiyun 957*4882a593Smuzhiyun"Front merges", a new request being merged at the front of an existing request, 958*4882a593Smuzhiyunare far less common than "back merges" due to the nature of most I/O patterns. 959*4882a593SmuzhiyunFront merges are handled by the binary trees in AS and deadline schedulers. 960*4882a593Smuzhiyun 961*4882a593Smuzhiyuniii. Plugging the queue to batch requests in anticipation of opportunities for 962*4882a593Smuzhiyun merge/sort optimizations 963*4882a593Smuzhiyun 964*4882a593SmuzhiyunPlugging is an approach that the current i/o scheduling algorithm resorts to so 965*4882a593Smuzhiyunthat it collects up enough requests in the queue to be able to take 966*4882a593Smuzhiyunadvantage of the sorting/merging logic in the elevator. If the 967*4882a593Smuzhiyunqueue is empty when a request comes in, then it plugs the request queue 968*4882a593Smuzhiyun(sort of like plugging the bath tub of a vessel to get fluid to build up) 969*4882a593Smuzhiyuntill it fills up with a few more requests, before starting to service 970*4882a593Smuzhiyunthe requests. This provides an opportunity to merge/sort the requests before 971*4882a593Smuzhiyunpassing them down to the device. There are various conditions when the queue is 972*4882a593Smuzhiyununplugged (to open up the flow again), either through a scheduled task or 973*4882a593Smuzhiyuncould be on demand. For example wait_on_buffer sets the unplugging going 974*4882a593Smuzhiyunthrough sync_buffer() running blk_run_address_space(mapping). Or the caller 975*4882a593Smuzhiyuncan do it explicity through blk_unplug(bdev). So in the read case, 976*4882a593Smuzhiyunthe queue gets explicitly unplugged as part of waiting for completion on that 977*4882a593Smuzhiyunbuffer. 978*4882a593Smuzhiyun 979*4882a593SmuzhiyunAside: 980*4882a593Smuzhiyun This is kind of controversial territory, as it's not clear if plugging is 981*4882a593Smuzhiyun always the right thing to do. Devices typically have their own queues, 982*4882a593Smuzhiyun and allowing a big queue to build up in software, while letting the device be 983*4882a593Smuzhiyun idle for a while may not always make sense. The trick is to handle the fine 984*4882a593Smuzhiyun balance between when to plug and when to open up. Also now that we have 985*4882a593Smuzhiyun multi-page bios being queued in one shot, we may not need to wait to merge 986*4882a593Smuzhiyun a big request from the broken up pieces coming by. 987*4882a593Smuzhiyun 988*4882a593Smuzhiyun4.4 I/O contexts 989*4882a593Smuzhiyun---------------- 990*4882a593Smuzhiyun 991*4882a593SmuzhiyunI/O contexts provide a dynamically allocated per process data area. They may 992*4882a593Smuzhiyunbe used in I/O schedulers, and in the block layer (could be used for IO statis, 993*4882a593Smuzhiyunpriorities for example). See `*io_context` in block/ll_rw_blk.c, and as-iosched.c 994*4882a593Smuzhiyunfor an example of usage in an i/o scheduler. 995*4882a593Smuzhiyun 996*4882a593Smuzhiyun 997*4882a593Smuzhiyun5. Scalability related changes 998*4882a593Smuzhiyun============================== 999*4882a593Smuzhiyun 1000*4882a593Smuzhiyun5.1 Granular Locking: io_request_lock replaced by a per-queue lock 1001*4882a593Smuzhiyun------------------------------------------------------------------ 1002*4882a593Smuzhiyun 1003*4882a593SmuzhiyunThe global io_request_lock has been removed as of 2.5, to avoid 1004*4882a593Smuzhiyunthe scalability bottleneck it was causing, and has been replaced by more 1005*4882a593Smuzhiyungranular locking. The request queue structure has a pointer to the 1006*4882a593Smuzhiyunlock to be used for that queue. As a result, locking can now be 1007*4882a593Smuzhiyunper-queue, with a provision for sharing a lock across queues if 1008*4882a593Smuzhiyunnecessary (e.g the scsi layer sets the queue lock pointers to the 1009*4882a593Smuzhiyuncorresponding adapter lock, which results in a per host locking 1010*4882a593Smuzhiyungranularity). The locking semantics are the same, i.e. locking is 1011*4882a593Smuzhiyunstill imposed by the block layer, grabbing the lock before 1012*4882a593Smuzhiyunrequest_fn execution which it means that lots of older drivers 1013*4882a593Smuzhiyunshould still be SMP safe. Drivers are free to drop the queue 1014*4882a593Smuzhiyunlock themselves, if required. Drivers that explicitly used the 1015*4882a593Smuzhiyunio_request_lock for serialization need to be modified accordingly. 1016*4882a593SmuzhiyunUsually it's as easy as adding a global lock:: 1017*4882a593Smuzhiyun 1018*4882a593Smuzhiyun static DEFINE_SPINLOCK(my_driver_lock); 1019*4882a593Smuzhiyun 1020*4882a593Smuzhiyunand passing the address to that lock to blk_init_queue(). 1021*4882a593Smuzhiyun 1022*4882a593Smuzhiyun5.2 64 bit sector numbers (sector_t prepares for 64 bit support) 1023*4882a593Smuzhiyun---------------------------------------------------------------- 1024*4882a593Smuzhiyun 1025*4882a593SmuzhiyunThe sector number used in the bio structure has been changed to sector_t, 1026*4882a593Smuzhiyunwhich could be defined as 64 bit in preparation for 64 bit sector support. 1027*4882a593Smuzhiyun 1028*4882a593Smuzhiyun6. Other Changes/Implications 1029*4882a593Smuzhiyun============================= 1030*4882a593Smuzhiyun 1031*4882a593Smuzhiyun6.1 Partition re-mapping handled by the generic block layer 1032*4882a593Smuzhiyun----------------------------------------------------------- 1033*4882a593Smuzhiyun 1034*4882a593SmuzhiyunIn 2.5 some of the gendisk/partition related code has been reorganized. 1035*4882a593SmuzhiyunNow the generic block layer performs partition-remapping early and thus 1036*4882a593Smuzhiyunprovides drivers with a sector number relative to whole device, rather than 1037*4882a593Smuzhiyunhaving to take partition number into account in order to arrive at the true 1038*4882a593Smuzhiyunsector number. The routine blk_partition_remap() is invoked by 1039*4882a593Smuzhiyunsubmit_bio_noacct even before invoking the queue specific ->submit_bio, 1040*4882a593Smuzhiyunso the i/o scheduler also gets to operate on whole disk sector numbers. This 1041*4882a593Smuzhiyunshould typically not require changes to block drivers, it just never gets 1042*4882a593Smuzhiyunto invoke its own partition sector offset calculations since all bios 1043*4882a593Smuzhiyunsent are offset from the beginning of the device. 1044*4882a593Smuzhiyun 1045*4882a593Smuzhiyun 1046*4882a593Smuzhiyun7. A Few Tips on Migration of older drivers 1047*4882a593Smuzhiyun=========================================== 1048*4882a593Smuzhiyun 1049*4882a593SmuzhiyunOld-style drivers that just use CURRENT and ignores clustered requests, 1050*4882a593Smuzhiyunmay not need much change. The generic layer will automatically handle 1051*4882a593Smuzhiyunclustered requests, multi-page bios, etc for the driver. 1052*4882a593Smuzhiyun 1053*4882a593SmuzhiyunFor a low performance driver or hardware that is PIO driven or just doesn't 1054*4882a593Smuzhiyunsupport scatter-gather changes should be minimal too. 1055*4882a593Smuzhiyun 1056*4882a593SmuzhiyunThe following are some points to keep in mind when converting old drivers 1057*4882a593Smuzhiyunto bio. 1058*4882a593Smuzhiyun 1059*4882a593SmuzhiyunDrivers should use elv_next_request to pick up requests and are no longer 1060*4882a593Smuzhiyunsupposed to handle looping directly over the request list. 1061*4882a593Smuzhiyun(struct request->queue has been removed) 1062*4882a593Smuzhiyun 1063*4882a593SmuzhiyunNow end_that_request_first takes an additional number_of_sectors argument. 1064*4882a593SmuzhiyunIt used to handle always just the first buffer_head in a request, now 1065*4882a593Smuzhiyunit will loop and handle as many sectors (on a bio-segment granularity) 1066*4882a593Smuzhiyunas specified. 1067*4882a593Smuzhiyun 1068*4882a593SmuzhiyunNow bh->b_end_io is replaced by bio->bi_end_io, but most of the time the 1069*4882a593Smuzhiyunright thing to use is bio_endio(bio) instead. 1070*4882a593Smuzhiyun 1071*4882a593SmuzhiyunIf the driver is dropping the io_request_lock from its request_fn strategy, 1072*4882a593Smuzhiyunthen it just needs to replace that with q->queue_lock instead. 1073*4882a593Smuzhiyun 1074*4882a593SmuzhiyunAs described in Sec 1.1, drivers can set max sector size, max segment size 1075*4882a593Smuzhiyunetc per queue now. Drivers that used to define their own merge functions i 1076*4882a593Smuzhiyunto handle things like this can now just use the blk_queue_* functions at 1077*4882a593Smuzhiyunblk_init_queue time. 1078*4882a593Smuzhiyun 1079*4882a593SmuzhiyunDrivers no longer have to map a {partition, sector offset} into the 1080*4882a593Smuzhiyuncorrect absolute location anymore, this is done by the block layer, so 1081*4882a593Smuzhiyunwhere a driver received a request ala this before:: 1082*4882a593Smuzhiyun 1083*4882a593Smuzhiyun rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */ 1084*4882a593Smuzhiyun rq->sector = 0; /* first sector on hda5 */ 1085*4882a593Smuzhiyun 1086*4882a593Smuzhiyunit will now see:: 1087*4882a593Smuzhiyun 1088*4882a593Smuzhiyun rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */ 1089*4882a593Smuzhiyun rq->sector = 123128; /* offset from start of disk */ 1090*4882a593Smuzhiyun 1091*4882a593SmuzhiyunAs mentioned, there is no virtual mapping of a bio. For DMA, this is 1092*4882a593Smuzhiyunnot a problem as the driver probably never will need a virtual mapping. 1093*4882a593SmuzhiyunInstead it needs a bus mapping (dma_map_page for a single segment or 1094*4882a593Smuzhiyunuse dma_map_sg for scatter gather) to be able to ship it to the driver. For 1095*4882a593SmuzhiyunPIO drivers (or drivers that need to revert to PIO transfer once in a 1096*4882a593Smuzhiyunwhile (IDE for example)), where the CPU is doing the actual data 1097*4882a593Smuzhiyuntransfer a virtual mapping is needed. If the driver supports highmem I/O, 1098*4882a593Smuzhiyun(Sec 1.1, (ii) ) it needs to use kmap_atomic or similar to temporarily map 1099*4882a593Smuzhiyuna bio into the virtual address space. 1100*4882a593Smuzhiyun 1101*4882a593Smuzhiyun 1102*4882a593Smuzhiyun8. Prior/Related/Impacted patches 1103*4882a593Smuzhiyun================================= 1104*4882a593Smuzhiyun 1105*4882a593Smuzhiyun8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp) 1106*4882a593Smuzhiyun----------------------------------------------------- 1107*4882a593Smuzhiyun 1108*4882a593Smuzhiyun- orig kiobuf & raw i/o patches (now in 2.4 tree) 1109*4882a593Smuzhiyun- direct kiobuf based i/o to devices (no intermediate bh's) 1110*4882a593Smuzhiyun- page i/o using kiobuf 1111*4882a593Smuzhiyun- kiobuf splitting for lvm (mkp) 1112*4882a593Smuzhiyun- elevator support for kiobuf request merging (axboe) 1113*4882a593Smuzhiyun 1114*4882a593Smuzhiyun8.2. Zero-copy networking (Dave Miller) 1115*4882a593Smuzhiyun--------------------------------------- 1116*4882a593Smuzhiyun 1117*4882a593Smuzhiyun8.3. SGI XFS - pagebuf patches - use of kiobufs 1118*4882a593Smuzhiyun----------------------------------------------- 1119*4882a593Smuzhiyun8.4. Multi-page pioent patch for bio (Christoph Hellwig) 1120*4882a593Smuzhiyun-------------------------------------------------------- 1121*4882a593Smuzhiyun8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11 1122*4882a593Smuzhiyun-------------------------------------------------------------------- 1123*4882a593Smuzhiyun8.6. Async i/o implementation patch (Ben LaHaise) 1124*4882a593Smuzhiyun------------------------------------------------- 1125*4882a593Smuzhiyun8.7. EVMS layering design (IBM EVMS team) 1126*4882a593Smuzhiyun----------------------------------------- 1127*4882a593Smuzhiyun8.8. Larger page cache size patch (Ben LaHaise) and Large page size (Daniel Phillips) 1128*4882a593Smuzhiyun------------------------------------------------------------------------------------- 1129*4882a593Smuzhiyun 1130*4882a593Smuzhiyun => larger contiguous physical memory buffers 1131*4882a593Smuzhiyun 1132*4882a593Smuzhiyun8.9. VM reservations patch (Ben LaHaise) 1133*4882a593Smuzhiyun---------------------------------------- 1134*4882a593Smuzhiyun8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?) 1135*4882a593Smuzhiyun---------------------------------------------------------- 1136*4882a593Smuzhiyun8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+ 1137*4882a593Smuzhiyun--------------------------------------------------------------------------- 1138*4882a593Smuzhiyun8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, Badari) 1139*4882a593Smuzhiyun------------------------------------------------------------------------------- 1140*4882a593Smuzhiyun8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven) 1141*4882a593Smuzhiyun------------------------------------------------------------------ 1142*4882a593Smuzhiyun8.14 IDE Taskfile i/o patch (Andre Hedrick) 1143*4882a593Smuzhiyun-------------------------------------------- 1144*4882a593Smuzhiyun8.15 Multi-page writeout and readahead patches (Andrew Morton) 1145*4882a593Smuzhiyun--------------------------------------------------------------- 1146*4882a593Smuzhiyun8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy) 1147*4882a593Smuzhiyun----------------------------------------------------------------------- 1148*4882a593Smuzhiyun 1149*4882a593Smuzhiyun9. Other References 1150*4882a593Smuzhiyun=================== 1151*4882a593Smuzhiyun 1152*4882a593Smuzhiyun9.1 The Splice I/O Model 1153*4882a593Smuzhiyun------------------------ 1154*4882a593Smuzhiyun 1155*4882a593SmuzhiyunLarry McVoy (and subsequent discussions on lkml, and Linus' comments - Jan 2001 1156*4882a593Smuzhiyun 1157*4882a593Smuzhiyun9.2 Discussions about kiobuf and bh design 1158*4882a593Smuzhiyun------------------------------------------ 1159*4882a593Smuzhiyun 1160*4882a593SmuzhiyunOn lkml between sct, linus, alan et al - Feb-March 2001 (many of the 1161*4882a593Smuzhiyuninitial thoughts that led to bio were brought up in this discussion thread) 1162*4882a593Smuzhiyun 1163*4882a593Smuzhiyun9.3 Discussions on mempool on lkml - Dec 2001. 1164*4882a593Smuzhiyun---------------------------------------------- 1165