Documentation/block/biodoc.rst

*4882a593Smuzhiyun=====================================================
*4882a593SmuzhiyunNotes on the Generic Block Layer Rewrite in Linux 2.5
*4882a593Smuzhiyun=====================================================
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. note::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	It seems that there are lot of outdated stuff here. This seems
*4882a593Smuzhiyun	to be written somewhat as a task list. Yet, eventually, something
*4882a593Smuzhiyun	here might still be useful.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNotes Written on Jan 15, 2002:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	- Jens Axboe <jens.axboe@oracle.com>
*4882a593Smuzhiyun	- Suparna Bhattacharya <suparna@in.ibm.com>
*4882a593Smuzhiyun
*4882a593SmuzhiyunLast Updated May 2, 2002
*4882a593Smuzhiyun
*4882a593SmuzhiyunSeptember 2003: Updated I/O Scheduler portions
*4882a593Smuzhiyun	- Nick Piggin <npiggin@kernel.dk>
*4882a593Smuzhiyun
*4882a593SmuzhiyunIntroduction
*4882a593Smuzhiyun============
*4882a593Smuzhiyun
*4882a593SmuzhiyunThese are some notes describing some aspects of the 2.5 block layer in the
*4882a593Smuzhiyuncontext of the bio rewrite. The idea is to bring out some of the key
*4882a593Smuzhiyunchanges and a glimpse of the rationale behind those changes.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPlease mail corrections & suggestions to suparna@in.ibm.com.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCredits
*4882a593Smuzhiyun=======
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.5 bio rewrite:
*4882a593Smuzhiyun	- Jens Axboe <jens.axboe@oracle.com>
*4882a593Smuzhiyun
*4882a593SmuzhiyunMany aspects of the generic block layer redesign were driven by and evolved
*4882a593Smuzhiyunover discussions, prior patches and the collective experience of several
*4882a593Smuzhiyunpeople. See sections 8 and 9 for a list of some related references.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe following people helped with review comments and inputs for this
*4882a593Smuzhiyundocument:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	- Christoph Hellwig <hch@infradead.org>
*4882a593Smuzhiyun	- Arjan van de Ven <arjanv@redhat.com>
*4882a593Smuzhiyun	- Randy Dunlap <rdunlap@xenotime.net>
*4882a593Smuzhiyun	- Andre Hedrick <andre@linux-ide.org>
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe following people helped with fixes/contributions to the bio patches
*4882a593Smuzhiyunwhile it was still work-in-progress:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	- David S. Miller <davem@redhat.com>
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. Description of Contents:
*4882a593Smuzhiyun
*4882a593Smuzhiyun   1. Scope for tuning of logic to various needs
*4882a593Smuzhiyun     1.1 Tuning based on device or low level driver capabilities
*4882a593Smuzhiyun	- Per-queue parameters
*4882a593Smuzhiyun	- Highmem I/O support
*4882a593Smuzhiyun	- I/O scheduler modularization
*4882a593Smuzhiyun     1.2 Tuning based on high level requirements/capabilities
*4882a593Smuzhiyun	1.2.1 Request Priority/Latency
*4882a593Smuzhiyun     1.3 Direct access/bypass to lower layers for diagnostics and special
*4882a593Smuzhiyun	 device operations
*4882a593Smuzhiyun	1.3.1 Pre-built commands
*4882a593Smuzhiyun   2. New flexible and generic but minimalist i/o structure or descriptor
*4882a593Smuzhiyun      (instead of using buffer heads at the i/o layer)
*4882a593Smuzhiyun     2.1 Requirements/Goals addressed
*4882a593Smuzhiyun     2.2 The bio struct in detail (multi-page io unit)
*4882a593Smuzhiyun     2.3 Changes in the request structure
*4882a593Smuzhiyun   3. Using bios
*4882a593Smuzhiyun     3.1 Setup/teardown (allocation, splitting)
*4882a593Smuzhiyun     3.2 Generic bio helper routines
*4882a593Smuzhiyun       3.2.1 Traversing segments and completion units in a request
*4882a593Smuzhiyun       3.2.2 Setting up DMA scatterlists
*4882a593Smuzhiyun       3.2.3 I/O completion
*4882a593Smuzhiyun       3.2.4 Implications for drivers that do not interpret bios (don't handle
*4882a593Smuzhiyun	  multiple segments)
*4882a593Smuzhiyun     3.3 I/O submission
*4882a593Smuzhiyun   4. The I/O scheduler
*4882a593Smuzhiyun   5. Scalability related changes
*4882a593Smuzhiyun     5.1 Granular locking: Removal of io_request_lock
*4882a593Smuzhiyun     5.2 Prepare for transition to 64 bit sector_t
*4882a593Smuzhiyun   6. Other Changes/Implications
*4882a593Smuzhiyun     6.1 Partition re-mapping handled by the generic block layer
*4882a593Smuzhiyun   7. A few tips on migration of older drivers
*4882a593Smuzhiyun   8. A list of prior/related/impacted patches/ideas
*4882a593Smuzhiyun   9. Other References/Discussion Threads
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunBio Notes
*4882a593Smuzhiyun=========
*4882a593Smuzhiyun
*4882a593SmuzhiyunLet us discuss the changes in the context of how some overall goals for the
*4882a593Smuzhiyunblock layer are addressed.
*4882a593Smuzhiyun
*4882a593Smuzhiyun1. Scope for tuning the generic logic to satisfy various requirements
*4882a593Smuzhiyun=====================================================================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe block layer design supports adaptable abstractions to handle common
*4882a593Smuzhiyunprocessing with the ability to tune the logic to an appropriate extent
*4882a593Smuzhiyundepending on the nature of the device and the requirements of the caller.
*4882a593SmuzhiyunOne of the objectives of the rewrite was to increase the degree of tunability
*4882a593Smuzhiyunand to enable higher level code to utilize underlying device/driver
*4882a593Smuzhiyuncapabilities to the maximum extent for better i/o performance. This is
*4882a593Smuzhiyunimportant especially in the light of ever improving hardware capabilities
*4882a593Smuzhiyunand application/middleware software designed to take advantage of these
*4882a593Smuzhiyuncapabilities.
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.1 Tuning based on low level device / driver capabilities
*4882a593Smuzhiyun----------------------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunSophisticated devices with large built-in caches, intelligent i/o scheduling
*4882a593Smuzhiyunoptimizations, high memory DMA support, etc may find some of the
*4882a593Smuzhiyungeneric processing an overhead, while for less capable devices the
*4882a593Smuzhiyungeneric functionality is essential for performance or correctness reasons.
*4882a593SmuzhiyunKnowledge of some of the capabilities or parameters of the device should be
*4882a593Smuzhiyunused at the generic block layer to take the right decisions on
*4882a593Smuzhiyunbehalf of the driver.
*4882a593Smuzhiyun
*4882a593SmuzhiyunHow is this achieved ?
*4882a593Smuzhiyun
*4882a593SmuzhiyunTuning at a per-queue level:
*4882a593Smuzhiyun
*4882a593Smuzhiyuni. Per-queue limits/values exported to the generic layer by the driver
*4882a593Smuzhiyun
*4882a593SmuzhiyunVarious parameters that the generic i/o scheduler logic uses are set at
*4882a593Smuzhiyuna per-queue level (e.g maximum request size, maximum number of segments in
*4882a593Smuzhiyuna scatter-gather list, logical block size)
*4882a593Smuzhiyun
*4882a593SmuzhiyunSome parameters that were earlier available as global arrays indexed by
*4882a593Smuzhiyunmajor/minor are now directly associated with the queue. Some of these may
*4882a593Smuzhiyunmove into the block device structure in the future. Some characteristics
*4882a593Smuzhiyunhave been incorporated into a queue flags field rather than separate fields
*4882a593Smuzhiyunin themselves.  There are blk_queue_xxx functions to set the parameters,
*4882a593Smuzhiyunrather than update the fields directly
*4882a593Smuzhiyun
*4882a593SmuzhiyunSome new queue property settings:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	blk_queue_bounce_limit(q, u64 dma_address)
*4882a593Smuzhiyun		Enable I/O to highmem pages, dma_address being the
*4882a593Smuzhiyun		limit. No highmem default.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	blk_queue_max_sectors(q, max_sectors)
*4882a593Smuzhiyun		Sets two variables that limit the size of the request.
*4882a593Smuzhiyun
*4882a593Smuzhiyun		- The request queue's max_sectors, which is a soft size in
*4882a593Smuzhiyun		  units of 512 byte sectors, and could be dynamically varied
*4882a593Smuzhiyun		  by the core kernel.
*4882a593Smuzhiyun
*4882a593Smuzhiyun		- The request queue's max_hw_sectors, which is a hard limit
*4882a593Smuzhiyun		  and reflects the maximum size request a driver can handle
*4882a593Smuzhiyun		  in units of 512 byte sectors.
*4882a593Smuzhiyun
*4882a593Smuzhiyun		The default for both max_sectors and max_hw_sectors is
*4882a593Smuzhiyun		255. The upper limit of max_sectors is 1024.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	blk_queue_max_phys_segments(q, max_segments)
*4882a593Smuzhiyun		Maximum physical segments you can handle in a request. 128
*4882a593Smuzhiyun		default (driver limit). (See 3.2.2)
*4882a593Smuzhiyun
*4882a593Smuzhiyun	blk_queue_max_hw_segments(q, max_segments)
*4882a593Smuzhiyun		Maximum dma segments the hardware can handle in a request. 128
*4882a593Smuzhiyun		default (host adapter limit, after dma remapping).
*4882a593Smuzhiyun		(See 3.2.2)
*4882a593Smuzhiyun
*4882a593Smuzhiyun	blk_queue_max_segment_size(q, max_seg_size)
*4882a593Smuzhiyun		Maximum size of a clustered segment, 64kB default.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	blk_queue_logical_block_size(q, logical_block_size)
*4882a593Smuzhiyun		Lowest possible sector size that the hardware can operate
*4882a593Smuzhiyun		on, 512 bytes default.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNew queue flags:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	- QUEUE_FLAG_CLUSTER (see 3.2.2)
*4882a593Smuzhiyun	- QUEUE_FLAG_QUEUED (see 3.2.4)
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyunii. High-mem i/o capabilities are now considered the default
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe generic bounce buffer logic, present in 2.4, where the block layer would
*4882a593Smuzhiyunby default copyin/out i/o requests on high-memory buffers to low-memory buffers
*4882a593Smuzhiyunassuming that the driver wouldn't be able to handle it directly, has been
*4882a593Smuzhiyunchanged in 2.5. The bounce logic is now applied only for memory ranges
*4882a593Smuzhiyunfor which the device cannot handle i/o. A driver can specify this by
*4882a593Smuzhiyunsetting the queue bounce limit for the request queue for the device
*4882a593Smuzhiyun(blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out
*4882a593Smuzhiyunwhere a device is capable of handling high memory i/o.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn order to enable high-memory i/o where the device is capable of supporting
*4882a593Smuzhiyunit, the pci dma mapping routines and associated data structures have now been
*4882a593Smuzhiyunmodified to accomplish a direct page -> bus translation, without requiring
*4882a593Smuzhiyuna virtual address mapping (unlike the earlier scheme of virtual address
*4882a593Smuzhiyun-> bus translation). So this works uniformly for high-memory pages (which
*4882a593Smuzhiyundo not have a corresponding kernel virtual address space mapping) and
*4882a593Smuzhiyunlow-memory pages.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote: Please refer to :doc:`/core-api/dma-api-howto` for a discussion
*4882a593Smuzhiyunon PCI high mem DMA aspects and mapping of scatter gather lists, and support
*4882a593Smuzhiyunfor 64 bit PCI.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSpecial handling is required only for cases where i/o needs to happen on
*4882a593Smuzhiyunpages at physical memory addresses beyond what the device can support. In these
*4882a593Smuzhiyuncases, a bounce bio representing a buffer from the supported memory range
*4882a593Smuzhiyunis used for performing the i/o with copyin/copyout as needed depending on
*4882a593Smuzhiyunthe type of the operation.  For example, in case of a read operation, the
*4882a593Smuzhiyundata read has to be copied to the original buffer on i/o completion, so a
*4882a593Smuzhiyuncallback routine is set up to do this, while for write, the data is copied
*4882a593Smuzhiyunfrom the original buffer to the bounce buffer prior to issuing the
*4882a593Smuzhiyunoperation. Since an original buffer may be in a high memory area that's not
*4882a593Smuzhiyunmapped in kernel virtual addr, a kmap operation may be required for
*4882a593Smuzhiyunperforming the copy, and special care may be needed in the completion path
*4882a593Smuzhiyunas it may not be in irq context. Special care is also required (by way of
*4882a593SmuzhiyunGFP flags) when allocating bounce buffers, to avoid certain highmem
*4882a593Smuzhiyundeadlock possibilities.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIt is also possible that a bounce buffer may be allocated from high-memory
*4882a593Smuzhiyunarea that's not mapped in kernel virtual addr, but within the range that the
*4882a593Smuzhiyundevice can use directly; so the bounce page may need to be kmapped during
*4882a593Smuzhiyuncopy operations. [Note: This does not hold in the current implementation,
*4882a593Smuzhiyunthough]
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are some situations when pages from high memory may need to
*4882a593Smuzhiyunbe kmapped, even if bounce buffers are not necessary. For example a device
*4882a593Smuzhiyunmay need to abort DMA operations and revert to PIO for the transfer, in
*4882a593Smuzhiyunwhich case a virtual mapping of the page is required. For SCSI it is also
*4882a593Smuzhiyundone in some scenarios where the low level driver cannot be trusted to
*4882a593Smuzhiyunhandle a single sg entry correctly. The driver is expected to perform the
*4882a593Smuzhiyunkmaps as needed on such occasions as appropriate. A driver could also use
*4882a593Smuzhiyunthe blk_queue_bounce() routine on its own to bounce highmem i/o to low
*4882a593Smuzhiyunmemory for specific requests if so desired.
*4882a593Smuzhiyun
*4882a593Smuzhiyuniii. The i/o scheduler algorithm itself can be replaced/set as appropriate
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs in 2.4, it is possible to plugin a brand new i/o scheduler for a particular
*4882a593Smuzhiyunqueue or pick from (copy) existing generic schedulers and replace/override
*4882a593Smuzhiyuncertain portions of it. The 2.5 rewrite provides improved modularization
*4882a593Smuzhiyunof the i/o scheduler. There are more pluggable callbacks, e.g for init,
*4882a593Smuzhiyunadd request, extract request, which makes it possible to abstract specific
*4882a593Smuzhiyuni/o scheduling algorithm aspects and details outside of the generic loop.
*4882a593SmuzhiyunIt also makes it possible to completely hide the implementation details of
*4882a593Smuzhiyunthe i/o scheduler from block drivers.
*4882a593Smuzhiyun
*4882a593SmuzhiyunI/O scheduler wrappers are to be used instead of accessing the queue directly.
*4882a593SmuzhiyunSee section 4. The I/O scheduler for details.
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.2 Tuning Based on High level code capabilities
*4882a593Smuzhiyun------------------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyuni. Application capabilities for raw i/o
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis comes from some of the high-performance database/middleware
*4882a593Smuzhiyunrequirements where an application prefers to make its own i/o scheduling
*4882a593Smuzhiyundecisions based on an understanding of the access patterns and i/o
*4882a593Smuzhiyuncharacteristics
*4882a593Smuzhiyun
*4882a593Smuzhiyunii. High performance filesystems or other higher level kernel code's
*4882a593Smuzhiyuncapabilities
*4882a593Smuzhiyun
*4882a593SmuzhiyunKernel components like filesystems could also take their own i/o scheduling
*4882a593Smuzhiyundecisions for optimizing performance. Journalling filesystems may need
*4882a593Smuzhiyunsome control over i/o ordering.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhat kind of support exists at the generic block layer for this ?
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe flags and rw fields in the bio structure can be used for some tuning
*4882a593Smuzhiyunfrom above e.g indicating that an i/o is just a readahead request, or priority
*4882a593Smuzhiyunsettings (currently unused). As far as user applications are concerned they
*4882a593Smuzhiyunwould need an additional mechanism either via open flags or ioctls, or some
*4882a593Smuzhiyunother upper level mechanism to communicate such settings to block.
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.2.1 Request Priority/Latency
*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*4882a593Smuzhiyun
*4882a593SmuzhiyunTodo/Under discussion::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  Arjan's proposed request priority scheme allows higher levels some broad
*4882a593Smuzhiyun  control (high/med/low) over the priority  of an i/o request vs other pending
*4882a593Smuzhiyun  requests in the queue. For example it allows reads for bringing in an
*4882a593Smuzhiyun  executable page on demand to be given a higher priority over pending write
*4882a593Smuzhiyun  requests which haven't aged too much on the queue. Potentially this priority
*4882a593Smuzhiyun  could even be exposed to applications in some manner, providing higher level
*4882a593Smuzhiyun  tunability. Time based aging avoids starvation of lower priority
*4882a593Smuzhiyun  requests. Some bits in the bi_opf flags field in the bio structure are
*4882a593Smuzhiyun  intended to be used for this priority information.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode)
*4882a593Smuzhiyun-----------------------------------------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun(e.g Diagnostics, Systems Management)
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are situations where high-level code needs to have direct access to
*4882a593Smuzhiyunthe low level device capabilities or requires the ability to issue commands
*4882a593Smuzhiyunto the device bypassing some of the intermediate i/o layers.
*4882a593SmuzhiyunThese could, for example, be special control commands issued through ioctl
*4882a593Smuzhiyuninterfaces, or could be raw read/write commands that stress the drive's
*4882a593Smuzhiyuncapabilities for certain kinds of fitness tests. Having direct interfaces at
*4882a593Smuzhiyunmultiple levels without having to pass through upper layers makes
*4882a593Smuzhiyunit possible to perform bottom up validation of the i/o path, layer by
*4882a593Smuzhiyunlayer, starting from the media.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe normal i/o submission interfaces, e.g submit_bio, could be bypassed
*4882a593Smuzhiyunfor specially crafted requests which such ioctl or diagnostics
*4882a593Smuzhiyuninterfaces would typically use, and the elevator add_request routine
*4882a593Smuzhiyuncan instead be used to directly insert such requests in the queue or preferably
*4882a593Smuzhiyunthe blk_do_rq routine can be used to place the request on the queue and
*4882a593Smuzhiyunwait for completion. Alternatively, sometimes the caller might just
*4882a593Smuzhiyuninvoke a lower level driver specific interface with the request as a
*4882a593Smuzhiyunparameter.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the request is a means for passing on special information associated with
*4882a593Smuzhiyunthe command, then such information is associated with the request->special
*4882a593Smuzhiyunfield (rather than misuse the request->buffer field which is meant for the
*4882a593Smuzhiyunrequest data buffer's virtual mapping).
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor passing request data, the caller must build up a bio descriptor
*4882a593Smuzhiyunrepresenting the concerned memory buffer if the underlying driver interprets
*4882a593Smuzhiyunbio segments or uses the block layer end*request* functions for i/o
*4882a593Smuzhiyuncompletion. Alternatively one could directly use the request->buffer field to
*4882a593Smuzhiyunspecify the virtual address of the buffer, if the driver expects buffer
*4882a593Smuzhiyunaddresses passed in this way and ignores bio entries for the request type
*4882a593Smuzhiyuninvolved. In the latter case, the driver would modify and manage the
*4882a593Smuzhiyunrequest->buffer, request->sector and request->nr_sectors or
*4882a593Smuzhiyunrequest->current_nr_sectors fields itself rather than using the block layer
*4882a593Smuzhiyunend_request or end_that_request_first completion interfaces.
*4882a593Smuzhiyun(See 2.3 or Documentation/block/request.rst for a brief explanation of
*4882a593Smuzhiyunthe request structure fields)
*4882a593Smuzhiyun
*4882a593Smuzhiyun::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  [TBD: end_that_request_last should be usable even in this case;
*4882a593Smuzhiyun  Perhaps an end_that_direct_request_first routine could be implemented to make
*4882a593Smuzhiyun  handling direct requests easier for such drivers; Also for drivers that
*4882a593Smuzhiyun  expect bios, a helper function could be provided for setting up a bio
*4882a593Smuzhiyun  corresponding to a data buffer]
*4882a593Smuzhiyun
*4882a593Smuzhiyun  <JENS: I dont understand the above, why is end_that_request_first() not
*4882a593Smuzhiyun  usable? Or _last for that matter. I must be missing something>
*4882a593Smuzhiyun
*4882a593Smuzhiyun  <SUP: What I meant here was that if the request doesn't have a bio, then
*4882a593Smuzhiyun   end_that_request_first doesn't modify nr_sectors or current_nr_sectors,
*4882a593Smuzhiyun   and hence can't be used for advancing request state settings on the
*4882a593Smuzhiyun   completion of partial transfers. The driver has to modify these fields
*4882a593Smuzhiyun   directly by hand.
*4882a593Smuzhiyun   This is because end_that_request_first only iterates over the bio list,
*4882a593Smuzhiyun   and always returns 0 if there are none associated with the request.
*4882a593Smuzhiyun   _last works OK in this case, and is not a problem, as I mentioned earlier
*4882a593Smuzhiyun  >
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.3.1 Pre-built Commands
*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^
*4882a593Smuzhiyun
*4882a593SmuzhiyunA request can be created with a pre-built custom command  to be sent directly
*4882a593Smuzhiyunto the device. The cmd block in the request structure has room for filling
*4882a593Smuzhiyunin the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for
*4882a593Smuzhiyuncommand pre-building, and the type of the request is now indicated
*4882a593Smuzhiyunthrough rq->flags instead of via rq->cmd)
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe request structure flags can be set up to indicate the type of request
*4882a593Smuzhiyunin such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC:
*4882a593Smuzhiyunpacket command issued via blk_do_rq, REQ_SPECIAL: special request).
*4882a593Smuzhiyun
*4882a593SmuzhiyunIt can help to pre-build device commands for requests in advance.
*4882a593SmuzhiyunDrivers can now specify a request prepare function (q->prep_rq_fn) that the
*4882a593Smuzhiyunblock layer would invoke to pre-build device commands for a given request,
*4882a593Smuzhiyunor perform other preparatory processing for the request. This is routine is
*4882a593Smuzhiyuncalled by elv_next_request(), i.e. typically just before servicing a request.
*4882a593Smuzhiyun(The prepare function would not be called for requests that have RQF_DONTPREP
*4882a593Smuzhiyunenabled)
*4882a593Smuzhiyun
*4882a593SmuzhiyunAside:
*4882a593Smuzhiyun  Pre-building could possibly even be done early, i.e before placing the
*4882a593Smuzhiyun  request on the queue, rather than construct the command on the fly in the
*4882a593Smuzhiyun  driver while servicing the request queue when it may affect latencies in
*4882a593Smuzhiyun  interrupt context or responsiveness in general. One way to add early
*4882a593Smuzhiyun  pre-building would be to do it whenever we fail to merge on a request.
*4882a593Smuzhiyun  Now REQ_NOMERGE is set in the request flags to skip this one in the future,
*4882a593Smuzhiyun  which means that it will not change before we feed it to the device. So
*4882a593Smuzhiyun  the pre-builder hook can be invoked there.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun2. Flexible and generic but minimalist i/o structure/descriptor
*4882a593Smuzhiyun===============================================================
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.1 Reason for a new structure and requirements addressed
*4882a593Smuzhiyun---------------------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunPrior to 2.5, buffer heads were used as the unit of i/o at the generic block
*4882a593Smuzhiyunlayer, and the low level request structure was associated with a chain of
*4882a593Smuzhiyunbuffer heads for a contiguous i/o request. This led to certain inefficiencies
*4882a593Smuzhiyunwhen it came to large i/o requests and readv/writev style operations, as it
*4882a593Smuzhiyunforced such requests to be broken up into small chunks before being passed
*4882a593Smuzhiyunon to the generic block layer, only to be merged by the i/o scheduler
*4882a593Smuzhiyunwhen the underlying device was capable of handling the i/o in one shot.
*4882a593SmuzhiyunAlso, using the buffer head as an i/o structure for i/os that didn't originate
*4882a593Smuzhiyunfrom the buffer cache unnecessarily added to the weight of the descriptors
*4882a593Smuzhiyunwhich were generated for each such chunk.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe following were some of the goals and expectations considered in the
*4882a593Smuzhiyunredesign of the block i/o data structure in 2.5.
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.  Should be appropriate as a descriptor for both raw and buffered i/o  -
*4882a593Smuzhiyun    avoid cache related fields which are irrelevant in the direct/page i/o path,
*4882a593Smuzhiyun    or filesystem block size alignment restrictions which may not be relevant
*4882a593Smuzhiyun    for raw i/o.
*4882a593Smuzhiyun2.  Ability to represent high-memory buffers (which do not have a virtual
*4882a593Smuzhiyun    address mapping in kernel address space).
*4882a593Smuzhiyun3.  Ability to represent large i/os w/o unnecessarily breaking them up (i.e
*4882a593Smuzhiyun    greater than PAGE_SIZE chunks in one shot)
*4882a593Smuzhiyun4.  At the same time, ability to retain independent identity of i/os from
*4882a593Smuzhiyun    different sources or i/o units requiring individual completion (e.g. for
*4882a593Smuzhiyun    latency reasons)
*4882a593Smuzhiyun5.  Ability to represent an i/o involving multiple physical memory segments
*4882a593Smuzhiyun    (including non-page aligned page fragments, as specified via readv/writev)
*4882a593Smuzhiyun    without unnecessarily breaking it up, if the underlying device is capable of
*4882a593Smuzhiyun    handling it.
*4882a593Smuzhiyun6.  Preferably should be based on a memory descriptor structure that can be
*4882a593Smuzhiyun    passed around different types of subsystems or layers, maybe even
*4882a593Smuzhiyun    networking, without duplication or extra copies of data/descriptor fields
*4882a593Smuzhiyun    themselves in the process
*4882a593Smuzhiyun7.  Ability to handle the possibility of splits/merges as the structure passes
*4882a593Smuzhiyun    through layered drivers (lvm, md, evms), with minimal overhead.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe solution was to define a new structure (bio)  for the block layer,
*4882a593Smuzhiyuninstead of using the buffer head structure (bh) directly, the idea being
*4882a593Smuzhiyunavoidance of some associated baggage and limitations. The bio structure
*4882a593Smuzhiyunis uniformly used for all i/o at the block layer ; it forms a part of the
*4882a593Smuzhiyunbh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are
*4882a593Smuzhiyunmapped to bio structures.
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.2 The bio struct
*4882a593Smuzhiyun------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe bio structure uses a vector representation pointing to an array of tuples
*4882a593Smuzhiyunof <page, offset, len> to describe the i/o buffer, and has various other
*4882a593Smuzhiyunfields describing i/o parameters and state that needs to be maintained for
*4882a593Smuzhiyunperforming the i/o.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNotice that this representation means that a bio has no virtual address
*4882a593Smuzhiyunmapping at all (unlike buffer heads).
*4882a593Smuzhiyun
*4882a593Smuzhiyun::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  struct bio_vec {
*4882a593Smuzhiyun       struct page     *bv_page;
*4882a593Smuzhiyun       unsigned short  bv_len;
*4882a593Smuzhiyun       unsigned short  bv_offset;
*4882a593Smuzhiyun  };
*4882a593Smuzhiyun
*4882a593Smuzhiyun  /*
*4882a593Smuzhiyun   * main unit of I/O for the block layer and lower layers (ie drivers)
*4882a593Smuzhiyun   */
*4882a593Smuzhiyun  struct bio {
*4882a593Smuzhiyun       struct bio          *bi_next;    /* request queue link */
*4882a593Smuzhiyun       struct block_device *bi_bdev;	/* target device */
*4882a593Smuzhiyun       unsigned long       bi_flags;    /* status, command, etc */
*4882a593Smuzhiyun       unsigned long       bi_opf;       /* low bits: r/w, high: priority */
*4882a593Smuzhiyun
*4882a593Smuzhiyun       unsigned int	bi_vcnt;     /* how may bio_vec's */
*4882a593Smuzhiyun       struct bvec_iter	bi_iter;	/* current index into bio_vec array */
*4882a593Smuzhiyun
*4882a593Smuzhiyun       unsigned int	bi_size;     /* total size in bytes */
*4882a593Smuzhiyun       unsigned short	bi_hw_segments; /* segments after DMA remapping */
*4882a593Smuzhiyun       unsigned int	bi_max;	     /* max bio_vecs we can hold
*4882a593Smuzhiyun                                        used as index into pool */
*4882a593Smuzhiyun       struct bio_vec   *bi_io_vec;  /* the actual vec list */
*4882a593Smuzhiyun       bio_end_io_t	*bi_end_io;  /* bi_end_io (bio) */
*4882a593Smuzhiyun       atomic_t		bi_cnt;	     /* pin count: free when it hits zero */
*4882a593Smuzhiyun       void             *bi_private;
*4882a593Smuzhiyun  };
*4882a593Smuzhiyun
*4882a593SmuzhiyunWith this multipage bio design:
*4882a593Smuzhiyun
*4882a593Smuzhiyun- Large i/os can be sent down in one go using a bio_vec list consisting
*4882a593Smuzhiyun  of an array of <page, offset, len> fragments (similar to the way fragments
*4882a593Smuzhiyun  are represented in the zero-copy network code)
*4882a593Smuzhiyun- Splitting of an i/o request across multiple devices (as in the case of
*4882a593Smuzhiyun  lvm or raid) is achieved by cloning the bio (where the clone points to
*4882a593Smuzhiyun  the same bi_io_vec array, but with the index and size accordingly modified)
*4882a593Smuzhiyun- A linked list of bios is used as before for unrelated merges [#]_ - this
*4882a593Smuzhiyun  avoids reallocs and makes independent completions easier to handle.
*4882a593Smuzhiyun- Code that traverses the req list can find all the segments of a bio
*4882a593Smuzhiyun  by using rq_for_each_segment.  This handles the fact that a request
*4882a593Smuzhiyun  has multiple bios, each of which can have multiple segments.
*4882a593Smuzhiyun- Drivers which can't process a large bio in one shot can use the bi_iter
*4882a593Smuzhiyun  field to keep track of the next bio_vec entry to process.
*4882a593Smuzhiyun  (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE)
*4882a593Smuzhiyun  [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying
*4882a593Smuzhiyun  bi_offset an len fields]
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. [#]
*4882a593Smuzhiyun
*4882a593Smuzhiyun	unrelated merges -- a request ends up containing two or more bios that
*4882a593Smuzhiyun	didn't originate from the same place.
*4882a593Smuzhiyun
*4882a593Smuzhiyunbi_end_io() i/o callback gets called on i/o completion of the entire bio.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAt a lower level, drivers build a scatter gather list from the merged bios.
*4882a593SmuzhiyunThe scatter gather list is in the form of an array of <page, offset, len>
*4882a593Smuzhiyunentries with their corresponding dma address mappings filled in at the
*4882a593Smuzhiyunappropriate time. As an optimization, contiguous physical pages can be
*4882a593Smuzhiyuncovered by a single entry where <page> refers to the first page and <len>
*4882a593Smuzhiyuncovers the range of pages (up to 16 contiguous pages could be covered this
*4882a593Smuzhiyunway). There is a helper routine (blk_rq_map_sg) which drivers can use to build
*4882a593Smuzhiyunthe sg list.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote: Right now the only user of bios with more than one page is ll_rw_kio,
*4882a593Smuzhiyunwhich in turn means that only raw I/O uses it (direct i/o may not work
*4882a593Smuzhiyunright now). The intent however is to enable clustering of pages etc to
*4882a593Smuzhiyunbecome possible. The pagebuf abstraction layer from SGI also uses multi-page
*4882a593Smuzhiyunbios, but that is currently not included in the stock development kernels.
*4882a593SmuzhiyunThe same is true of Andrew Morton's work-in-progress multipage bio writeout
*4882a593Smuzhiyunand readahead patches.
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.3 Changes in the Request Structure
*4882a593Smuzhiyun------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe request structure is the structure that gets passed down to low level
*4882a593Smuzhiyundrivers. The block layer make_request function builds up a request structure,
*4882a593Smuzhiyunplaces it on the queue and invokes the drivers request_fn. The driver makes
*4882a593Smuzhiyunuse of block layer helper routine elv_next_request to pull the next request
*4882a593Smuzhiyunoff the queue. Control or diagnostic functions might bypass block and directly
*4882a593Smuzhiyuninvoke underlying driver entry points passing in a specially constructed
*4882a593Smuzhiyunrequest structure.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOnly some relevant fields (mainly those which changed or may be referred
*4882a593Smuzhiyunto in some of the discussion here) are listed below, not necessarily in
*4882a593Smuzhiyunthe order in which they occur in the structure (see include/linux/blkdev.h)
*4882a593SmuzhiyunRefer to Documentation/block/request.rst for details about all the request
*4882a593Smuzhiyunstructure fields and a quick reference about the layers which are
*4882a593Smuzhiyunsupposed to use or modify those fields::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  struct request {
*4882a593Smuzhiyun	struct list_head queuelist;  /* Not meant to be directly accessed by
*4882a593Smuzhiyun					the driver.
*4882a593Smuzhiyun					Used by q->elv_next_request_fn
*4882a593Smuzhiyun					rq->queue is gone
*4882a593Smuzhiyun					*/
*4882a593Smuzhiyun	.
*4882a593Smuzhiyun	.
*4882a593Smuzhiyun	unsigned char cmd[16]; /* prebuilt command data block */
*4882a593Smuzhiyun	unsigned long flags;   /* also includes earlier rq->cmd settings */
*4882a593Smuzhiyun	.
*4882a593Smuzhiyun	.
*4882a593Smuzhiyun	sector_t sector; /* this field is now of type sector_t instead of int
*4882a593Smuzhiyun			    preparation for 64 bit sectors */
*4882a593Smuzhiyun	.
*4882a593Smuzhiyun	.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* Number of scatter-gather DMA addr+len pairs after
*4882a593Smuzhiyun	 * physical address coalescing is performed.
*4882a593Smuzhiyun	 */
*4882a593Smuzhiyun	unsigned short nr_phys_segments;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* Number of scatter-gather addr+len pairs after
*4882a593Smuzhiyun	 * physical and DMA remapping hardware coalescing is performed.
*4882a593Smuzhiyun	 * This is the number of scatter-gather entries the driver
*4882a593Smuzhiyun	 * will actually have to deal with after DMA mapping is done.
*4882a593Smuzhiyun	 */
*4882a593Smuzhiyun	unsigned short nr_hw_segments;
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* Various sector counts */
*4882a593Smuzhiyun	unsigned long nr_sectors;  /* no. of sectors left: driver modifiable */
*4882a593Smuzhiyun	unsigned long hard_nr_sectors;  /* block internal copy of above */
*4882a593Smuzhiyun	unsigned int current_nr_sectors; /* no. of sectors left in the
*4882a593Smuzhiyun					   current segment:driver modifiable */
*4882a593Smuzhiyun	unsigned long hard_cur_sectors; /* block internal copy of the above */
*4882a593Smuzhiyun	.
*4882a593Smuzhiyun	.
*4882a593Smuzhiyun	int tag;	/* command tag associated with request */
*4882a593Smuzhiyun	void *special;  /* same as before */
*4882a593Smuzhiyun	char *buffer;   /* valid only for low memory buffers up to
*4882a593Smuzhiyun			 current_nr_sectors */
*4882a593Smuzhiyun	.
*4882a593Smuzhiyun	.
*4882a593Smuzhiyun	struct bio *bio, *biotail;  /* bio list instead of bh */
*4882a593Smuzhiyun	struct request_list *rl;
*4882a593Smuzhiyun  }
*4882a593Smuzhiyun
*4882a593SmuzhiyunSee the req_ops and req_flag_bits definitions for an explanation of the various
*4882a593Smuzhiyunflags available. Some bits are used by the block layer or i/o scheduler.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe behaviour of the various sector counts are almost the same as before,
*4882a593Smuzhiyunexcept that since we have multi-segment bios, current_nr_sectors refers
*4882a593Smuzhiyunto the numbers of sectors in the current segment being processed which could
*4882a593Smuzhiyunbe one of the many segments in the current bio (i.e i/o completion unit).
*4882a593SmuzhiyunThe nr_sectors value refers to the total number of sectors in the whole
*4882a593Smuzhiyunrequest that remain to be transferred (no change). The purpose of the
*4882a593Smuzhiyunhard_xxx values is for block to remember these counts every time it hands
*4882a593Smuzhiyunover the request to the driver. These values are updated by block on
*4882a593Smuzhiyunend_that_request_first, i.e. every time the driver completes a part of the
*4882a593Smuzhiyuntransfer and invokes block end*request helpers to mark this. The
*4882a593Smuzhiyundriver should not modify these values. The block layer sets up the
*4882a593Smuzhiyunnr_sectors and current_nr_sectors fields (based on the corresponding
*4882a593Smuzhiyunhard_xxx values and the number of bytes transferred) and updates it on
*4882a593Smuzhiyunevery transfer that invokes end_that_request_first. It does the same for the
*4882a593Smuzhiyunbuffer, bio, bio->bi_iter fields too.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe buffer field is just a virtual address mapping of the current segment
*4882a593Smuzhiyunof the i/o buffer in cases where the buffer resides in low-memory. For high
*4882a593Smuzhiyunmemory i/o, this field is not valid and must not be used by drivers.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCode that sets up its own request structures and passes them down to
*4882a593Smuzhiyuna driver needs to be careful about interoperation with the block layer helper
*4882a593Smuzhiyunfunctions which the driver uses. (Section 1.3)
*4882a593Smuzhiyun
*4882a593Smuzhiyun3. Using bios
*4882a593Smuzhiyun=============
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.1 Setup/Teardown
*4882a593Smuzhiyun------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are routines for managing the allocation, and reference counting, and
*4882a593Smuzhiyunfreeing of bios (bio_alloc, bio_get, bio_put).
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis makes use of Ingo Molnar's mempool implementation, which enables
*4882a593Smuzhiyunsubsystems like bio to maintain their own reserve memory pools for guaranteed
*4882a593Smuzhiyundeadlock-free allocations during extreme VM load. For example, the VM
*4882a593Smuzhiyunsubsystem makes use of the block layer to writeout dirty pages in order to be
*4882a593Smuzhiyunable to free up memory space, a case which needs careful handling. The
*4882a593Smuzhiyunallocation logic draws from the preallocated emergency reserve in situations
*4882a593Smuzhiyunwhere it cannot allocate through normal means. If the pool is empty and it
*4882a593Smuzhiyuncan wait, then it would trigger action that would help free up memory or
*4882a593Smuzhiyunreplenish the pool (without deadlocking) and wait for availability in the pool.
*4882a593SmuzhiyunIf it is in IRQ context, and hence not in a position to do this, allocation
*4882a593Smuzhiyuncould fail if the pool is empty. In general mempool always first tries to
*4882a593Smuzhiyunperform allocation without having to wait, even if it means digging into the
*4882a593Smuzhiyunpool as long it is not less that 50% full.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOn a free, memory is released to the pool or directly freed depending on
*4882a593Smuzhiyunthe current availability in the pool. The mempool interface lets the
*4882a593Smuzhiyunsubsystem specify the routines to be used for normal alloc and free. In the
*4882a593Smuzhiyuncase of bio, these routines make use of the standard slab allocator.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe caller of bio_alloc is expected to taken certain steps to avoid
*4882a593Smuzhiyundeadlocks, e.g. avoid trying to allocate more memory from the pool while
*4882a593Smuzhiyunalready holding memory obtained from the pool.
*4882a593Smuzhiyun
*4882a593Smuzhiyun::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  [TBD: This is a potential issue, though a rare possibility
*4882a593Smuzhiyun   in the bounce bio allocation that happens in the current code, since
*4882a593Smuzhiyun   it ends up allocating a second bio from the same pool while
*4882a593Smuzhiyun   holding the original bio ]
*4882a593Smuzhiyun
*4882a593SmuzhiyunMemory allocated from the pool should be released back within a limited
*4882a593Smuzhiyunamount of time (in the case of bio, that would be after the i/o is completed).
*4882a593SmuzhiyunThis ensures that if part of the pool has been used up, some work (in this
*4882a593Smuzhiyuncase i/o) must already be in progress and memory would be available when it
*4882a593Smuzhiyunis over. If allocating from multiple pools in the same code path, the order
*4882a593Smuzhiyunor hierarchy of allocation needs to be consistent, just the way one deals
*4882a593Smuzhiyunwith multiple locks.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc())
*4882a593Smuzhiyunfor a non-clone bio. There are the 6 pools setup for different size biovecs,
*4882a593Smuzhiyunso bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the
*4882a593Smuzhiyungiven size from these slabs.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe bio_get() routine may be used to hold an extra reference on a bio prior
*4882a593Smuzhiyunto i/o submission, if the bio fields are likely to be accessed after the
*4882a593Smuzhiyuni/o is issued (since the bio may otherwise get freed in case i/o completion
*4882a593Smuzhiyunhappens in the meantime).
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe bio_clone_fast() routine may be used to duplicate a bio, where the clone
*4882a593Smuzhiyunshares the bio_vec_list with the original bio (i.e. both point to the
*4882a593Smuzhiyunsame bio_vec_list). This would typically be used for splitting i/o requests
*4882a593Smuzhiyunin lvm or md.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.2 Generic bio helper Routines
*4882a593Smuzhiyun-------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.2.1 Traversing segments and completion units in a request
*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe macro rq_for_each_segment() should be used for traversing the bios
*4882a593Smuzhiyunin the request list (drivers should avoid directly trying to do it
*4882a593Smuzhiyunthemselves). Using these helpers should also make it easier to cope
*4882a593Smuzhiyunwith block changes in the future.
*4882a593Smuzhiyun
*4882a593Smuzhiyun::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	struct req_iterator iter;
*4882a593Smuzhiyun	rq_for_each_segment(bio_vec, rq, iter)
*4882a593Smuzhiyun		/* bio_vec is now current segment */
*4882a593Smuzhiyun
*4882a593SmuzhiyunI/O completion callbacks are per-bio rather than per-segment, so drivers
*4882a593Smuzhiyunthat traverse bio chains on completion need to keep that in mind. Drivers
*4882a593Smuzhiyunwhich don't make a distinction between segments and completion units would
*4882a593Smuzhiyunneed to be reorganized to support multi-segment bios.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.2.2 Setting up DMA scatterlists
*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe blk_rq_map_sg() helper routine would be used for setting up scatter
*4882a593Smuzhiyungather lists from a request, so a driver need not do it on its own.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	nr_segments = blk_rq_map_sg(q, rq, scatterlist);
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe helper routine provides a level of abstraction which makes it easier
*4882a593Smuzhiyunto modify the internals of request to scatterlist conversion down the line
*4882a593Smuzhiyunwithout breaking drivers. The blk_rq_map_sg routine takes care of several
*4882a593Smuzhiyunthings like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER
*4882a593Smuzhiyunis set) and correct segment accounting to avoid exceeding the limits which
*4882a593Smuzhiyunthe i/o hardware can handle, based on various queue properties.
*4882a593Smuzhiyun
*4882a593Smuzhiyun- Prevents a clustered segment from crossing a 4GB mem boundary
*4882a593Smuzhiyun- Avoids building segments that would exceed the number of physical
*4882a593Smuzhiyun  memory segments that the driver can handle (phys_segments) and the
*4882a593Smuzhiyun  number that the underlying hardware can handle at once, accounting for
*4882a593Smuzhiyun  DMA remapping (hw_segments)  (i.e. IOMMU aware limits).
*4882a593Smuzhiyun
*4882a593SmuzhiyunRoutines which the low level driver can use to set up the segment limits:
*4882a593Smuzhiyun
*4882a593Smuzhiyunblk_queue_max_hw_segments() : Sets an upper limit of the maximum number of
*4882a593Smuzhiyunhw data segments in a request (i.e. the maximum number of address/length
*4882a593Smuzhiyunpairs the host adapter can actually hand to the device at once)
*4882a593Smuzhiyun
*4882a593Smuzhiyunblk_queue_max_phys_segments() : Sets an upper limit on the maximum number
*4882a593Smuzhiyunof physical data segments in a request (i.e. the largest sized scatter list
*4882a593Smuzhiyuna driver could handle)
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.2.3 I/O completion
*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe existing generic block layer helper routines end_request,
*4882a593Smuzhiyunend_that_request_first and end_that_request_last can be used for i/o
*4882a593Smuzhiyuncompletion (and setting things up so the rest of the i/o or the next
*4882a593Smuzhiyunrequest can be kicked of) as before. With the introduction of multi-page
*4882a593Smuzhiyunbio support, end_that_request_first requires an additional argument indicating
*4882a593Smuzhiyunthe number of sectors completed.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.2.4 Implications for drivers that do not interpret bios
*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*4882a593Smuzhiyun
*4882a593Smuzhiyun(don't handle multiple segments)
*4882a593Smuzhiyun
*4882a593SmuzhiyunDrivers that do not interpret bios e.g those which do not handle multiple
*4882a593Smuzhiyunsegments and do not support i/o into high memory addresses (require bounce
*4882a593Smuzhiyunbuffers) and expect only virtually mapped buffers, can access the rq->buffer
*4882a593Smuzhiyunfield. As before the driver should use current_nr_sectors to determine the
*4882a593Smuzhiyunsize of remaining data in the current segment (that is the maximum it can
*4882a593Smuzhiyuntransfer in one go unless it interprets segments), and rely on the block layer
*4882a593Smuzhiyunend_request, or end_that_request_first/last to take care of all accounting
*4882a593Smuzhiyunand transparent mapping of the next bio segment when a segment boundary
*4882a593Smuzhiyunis crossed on completion of a transfer. (The end*request* functions should
*4882a593Smuzhiyunbe used if only if the request has come down from block/bio path, not for
*4882a593Smuzhiyundirect access requests which only specify rq->buffer without a valid rq->bio)
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.3 I/O Submission
*4882a593Smuzhiyun------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe routine submit_bio() is used to submit a single io. Higher level i/o
*4882a593Smuzhiyunroutines make use of this:
*4882a593Smuzhiyun
*4882a593Smuzhiyun(a) Buffered i/o:
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe routine submit_bh() invokes submit_bio() on a bio corresponding to the
*4882a593Smuzhiyunbh, allocating the bio if required. ll_rw_block() uses submit_bh() as before.
*4882a593Smuzhiyun
*4882a593Smuzhiyun(b) Kiobuf i/o (for raw/direct i/o):
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe ll_rw_kio() routine breaks up the kiobuf into page sized chunks and
*4882a593Smuzhiyunmaps the array to one or more multi-page bios, issuing submit_bio() to
*4882a593Smuzhiyunperform the i/o on each of these.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe embedded bh array in the kiobuf structure has been removed and no
*4882a593Smuzhiyunpreallocation of bios is done for kiobufs. [The intent is to remove the
*4882a593Smuzhiyunblocks array as well, but it's currently in there to kludge around direct i/o.]
*4882a593SmuzhiyunThus kiobuf allocation has switched back to using kmalloc rather than vmalloc.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTodo/Observation:
*4882a593Smuzhiyun
*4882a593Smuzhiyun A single kiobuf structure is assumed to correspond to a contiguous range
*4882a593Smuzhiyun of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec.
*4882a593Smuzhiyun So right now it wouldn't work for direct i/o on non-contiguous blocks.
*4882a593Smuzhiyun This is to be resolved.  The eventual direction is to replace kiobuf
*4882a593Smuzhiyun by kvec's.
*4882a593Smuzhiyun
*4882a593Smuzhiyun Badari Pulavarty has a patch to implement direct i/o correctly using
*4882a593Smuzhiyun bio and kvec.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun(c) Page i/o:
*4882a593Smuzhiyun
*4882a593SmuzhiyunTodo/Under discussion:
*4882a593Smuzhiyun
*4882a593Smuzhiyun Andrew Morton's multi-page bio patches attempt to issue multi-page
*4882a593Smuzhiyun writeouts (and reads) from the page cache, by directly building up
*4882a593Smuzhiyun large bios for submission completely bypassing the usage of buffer
*4882a593Smuzhiyun heads. This work is still in progress.
*4882a593Smuzhiyun
*4882a593Smuzhiyun Christoph Hellwig had some code that uses bios for page-io (rather than
*4882a593Smuzhiyun bh). This isn't included in bio as yet. Christoph was also working on a
*4882a593Smuzhiyun design for representing virtual/real extents as an entity and modifying
*4882a593Smuzhiyun some of the address space ops interfaces to utilize this abstraction rather
*4882a593Smuzhiyun than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf
*4882a593Smuzhiyun abstraction, but intended to be as lightweight as possible).
*4882a593Smuzhiyun
*4882a593Smuzhiyun(d) Direct access i/o:
*4882a593Smuzhiyun
*4882a593SmuzhiyunDirect access requests that do not contain bios would be submitted differently
*4882a593Smuzhiyunas discussed earlier in section 1.3.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAside:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  Kvec i/o:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  Ben LaHaise's aio code uses a slightly different structure instead
*4882a593Smuzhiyun  of kiobufs, called a kvec_cb. This contains an array of <page, offset, len>
*4882a593Smuzhiyun  tuples (very much like the networking code), together with a callback function
*4882a593Smuzhiyun  and data pointer. This is embedded into a brw_cb structure when passed
*4882a593Smuzhiyun  to brw_kvec_async().
*4882a593Smuzhiyun
*4882a593Smuzhiyun  Now it should be possible to directly map these kvecs to a bio. Just as while
*4882a593Smuzhiyun  cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec
*4882a593Smuzhiyun  array pointer to point to the veclet array in kvecs.
*4882a593Smuzhiyun
*4882a593Smuzhiyun  TBD: In order for this to work, some changes are needed in the way multi-page
*4882a593Smuzhiyun  bios are handled today. The values of the tuples in such a vector passed in
*4882a593Smuzhiyun  from higher level code should not be modified by the block layer in the course
*4882a593Smuzhiyun  of its request processing, since that would make it hard for the higher layer
*4882a593Smuzhiyun  to continue to use the vector descriptor (kvec) after i/o completes. Instead,
*4882a593Smuzhiyun  all such transient state should either be maintained in the request structure,
*4882a593Smuzhiyun  and passed on in some way to the endio completion routine.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun4. The I/O scheduler
*4882a593Smuzhiyun====================
*4882a593Smuzhiyun
*4882a593SmuzhiyunI/O scheduler, a.k.a. elevator, is implemented in two layers.  Generic dispatch
*4882a593Smuzhiyunqueue and specific I/O schedulers.  Unless stated otherwise, elevator is used
*4882a593Smuzhiyunto refer to both parts and I/O scheduler to specific I/O schedulers.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBlock layer implements generic dispatch queue in `block/*.c`.
*4882a593SmuzhiyunThe generic dispatch queue is responsible for requeueing, handling non-fs
*4882a593Smuzhiyunrequests and all other subtleties.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSpecific I/O schedulers are responsible for ordering normal filesystem
*4882a593Smuzhiyunrequests.  They can also choose to delay certain requests to improve
*4882a593Smuzhiyunthroughput or whatever purpose.  As the plural form indicates, there are
*4882a593Smuzhiyunmultiple I/O schedulers.  They can be built as modules but at least one should
*4882a593Smuzhiyunbe built inside the kernel.  Each queue can choose different one and can also
*4882a593Smuzhiyunchange to another one dynamically.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA block layer call to the i/o scheduler follows the convention elv_xxx(). This
*4882a593Smuzhiyuncalls elevator_xxx_fn in the elevator switch (block/elevator.c). Oh, xxx
*4882a593Smuzhiyunand xxx might not match exactly, but use your imagination. If an elevator
*4882a593Smuzhiyundoesn't implement a function, the switch does nothing or some minimal house
*4882a593Smuzhiyunkeeping work.
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.1. I/O scheduler API
*4882a593Smuzhiyun----------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe functions an elevator may implement are: (* are mandatory)
*4882a593Smuzhiyun
*4882a593Smuzhiyun=============================== ================================================
*4882a593Smuzhiyunelevator_merge_fn		called to query requests for merge with a bio
*4882a593Smuzhiyun
*4882a593Smuzhiyunelevator_merge_req_fn		called when two requests get merged. the one
*4882a593Smuzhiyun				which gets merged into the other one will be
*4882a593Smuzhiyun				never seen by I/O scheduler again. IOW, after
*4882a593Smuzhiyun				being merged, the request is gone.
*4882a593Smuzhiyun
*4882a593Smuzhiyunelevator_merged_fn		called when a request in the scheduler has been
*4882a593Smuzhiyun				involved in a merge. It is used in the deadline
*4882a593Smuzhiyun				scheduler for example, to reposition the request
*4882a593Smuzhiyun				if its sorting order has changed.
*4882a593Smuzhiyun
*4882a593Smuzhiyunelevator_allow_merge_fn		called whenever the block layer determines
*4882a593Smuzhiyun				that a bio can be merged into an existing
*4882a593Smuzhiyun				request safely. The io scheduler may still
*4882a593Smuzhiyun				want to stop a merge at this point if it
*4882a593Smuzhiyun				results in some sort of conflict internally,
*4882a593Smuzhiyun				this hook allows it to do that. Note however
*4882a593Smuzhiyun				that two *requests* can still be merged at later
*4882a593Smuzhiyun				time. Currently the io scheduler has no way to
*4882a593Smuzhiyun				prevent that. It can only learn about the fact
*4882a593Smuzhiyun				from elevator_merge_req_fn callback.
*4882a593Smuzhiyun
*4882a593Smuzhiyunelevator_dispatch_fn*		fills the dispatch queue with ready requests.
*4882a593Smuzhiyun				I/O schedulers are free to postpone requests by
*4882a593Smuzhiyun				not filling the dispatch queue unless @force
*4882a593Smuzhiyun				is non-zero.  Once dispatched, I/O schedulers
*4882a593Smuzhiyun				are not allowed to manipulate the requests -
*4882a593Smuzhiyun				they belong to generic dispatch queue.
*4882a593Smuzhiyun
*4882a593Smuzhiyunelevator_add_req_fn*		called to add a new request into the scheduler
*4882a593Smuzhiyun
*4882a593Smuzhiyunelevator_former_req_fn
*4882a593Smuzhiyunelevator_latter_req_fn		These return the request before or after the
*4882a593Smuzhiyun				one specified in disk sort order. Used by the
*4882a593Smuzhiyun				block layer to find merge possibilities.
*4882a593Smuzhiyun
*4882a593Smuzhiyunelevator_completed_req_fn	called when a request is completed.
*4882a593Smuzhiyun
*4882a593Smuzhiyunelevator_set_req_fn
*4882a593Smuzhiyunelevator_put_req_fn		Must be used to allocate and free any elevator
*4882a593Smuzhiyun				specific storage for a request.
*4882a593Smuzhiyun
*4882a593Smuzhiyunelevator_activate_req_fn	Called when device driver first sees a request.
*4882a593Smuzhiyun				I/O schedulers can use this callback to
*4882a593Smuzhiyun				determine when actual execution of a request
*4882a593Smuzhiyun				starts.
*4882a593Smuzhiyunelevator_deactivate_req_fn	Called when device driver decides to delay
*4882a593Smuzhiyun				a request by requeueing it.
*4882a593Smuzhiyun
*4882a593Smuzhiyunelevator_init_fn*
*4882a593Smuzhiyunelevator_exit_fn		Allocate and free any elevator specific storage
*4882a593Smuzhiyun				for a queue.
*4882a593Smuzhiyun=============================== ================================================
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.2 Request flows seen by I/O schedulers
*4882a593Smuzhiyun----------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunAll requests seen by I/O schedulers strictly follow one of the following three
*4882a593Smuzhiyunflows.
*4882a593Smuzhiyun
*4882a593Smuzhiyun set_req_fn ->
*4882a593Smuzhiyun
*4882a593Smuzhiyun i.   add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn ->
*4882a593Smuzhiyun      (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn
*4882a593Smuzhiyun ii.  add_req_fn -> (merged_fn ->)* -> merge_req_fn
*4882a593Smuzhiyun iii. [none]
*4882a593Smuzhiyun
*4882a593Smuzhiyun -> put_req_fn
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.3 I/O scheduler implementation
*4882a593Smuzhiyun--------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe generic i/o scheduler algorithm attempts to sort/merge/batch requests for
*4882a593Smuzhiyunoptimal disk scan and request servicing performance (based on generic
*4882a593Smuzhiyunprinciples and device capabilities), optimized for:
*4882a593Smuzhiyun
*4882a593Smuzhiyuni.   improved throughput
*4882a593Smuzhiyunii.  improved latency
*4882a593Smuzhiyuniii. better utilization of h/w & CPU time
*4882a593Smuzhiyun
*4882a593SmuzhiyunCharacteristics:
*4882a593Smuzhiyun
*4882a593Smuzhiyuni. Binary tree
*4882a593SmuzhiyunAS and deadline i/o schedulers use red black binary trees for disk position
*4882a593Smuzhiyunsorting and searching, and a fifo linked list for time-based searching. This
*4882a593Smuzhiyungives good scalability and good availability of information. Requests are
*4882a593Smuzhiyunalmost always dispatched in disk sort order, so a cache is kept of the next
*4882a593Smuzhiyunrequest in sort order to prevent binary tree lookups.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis arrangement is not a generic block layer characteristic however, so
*4882a593Smuzhiyunelevators may implement queues as they please.
*4882a593Smuzhiyun
*4882a593Smuzhiyunii. Merge hash
*4882a593SmuzhiyunAS and deadline use a hash table indexed by the last sector of a request. This
*4882a593Smuzhiyunenables merging code to quickly look up "back merge" candidates, even when
*4882a593Smuzhiyunmultiple I/O streams are being performed at once on one disk.
*4882a593Smuzhiyun
*4882a593Smuzhiyun"Front merges", a new request being merged at the front of an existing request,
*4882a593Smuzhiyunare far less common than "back merges" due to the nature of most I/O patterns.
*4882a593SmuzhiyunFront merges are handled by the binary trees in AS and deadline schedulers.
*4882a593Smuzhiyun
*4882a593Smuzhiyuniii. Plugging the queue to batch requests in anticipation of opportunities for
*4882a593Smuzhiyun     merge/sort optimizations
*4882a593Smuzhiyun
*4882a593SmuzhiyunPlugging is an approach that the current i/o scheduling algorithm resorts to so
*4882a593Smuzhiyunthat it collects up enough requests in the queue to be able to take
*4882a593Smuzhiyunadvantage of the sorting/merging logic in the elevator. If the
*4882a593Smuzhiyunqueue is empty when a request comes in, then it plugs the request queue
*4882a593Smuzhiyun(sort of like plugging the bath tub of a vessel to get fluid to build up)
*4882a593Smuzhiyuntill it fills up with a few more requests, before starting to service
*4882a593Smuzhiyunthe requests. This provides an opportunity to merge/sort the requests before
*4882a593Smuzhiyunpassing them down to the device. There are various conditions when the queue is
*4882a593Smuzhiyununplugged (to open up the flow again), either through a scheduled task or
*4882a593Smuzhiyuncould be on demand. For example wait_on_buffer sets the unplugging going
*4882a593Smuzhiyunthrough sync_buffer() running blk_run_address_space(mapping). Or the caller
*4882a593Smuzhiyuncan do it explicity through blk_unplug(bdev). So in the read case,
*4882a593Smuzhiyunthe queue gets explicitly unplugged as part of waiting for completion on that
*4882a593Smuzhiyunbuffer.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAside:
*4882a593Smuzhiyun  This is kind of controversial territory, as it's not clear if plugging is
*4882a593Smuzhiyun  always the right thing to do. Devices typically have their own queues,
*4882a593Smuzhiyun  and allowing a big queue to build up in software, while letting the device be
*4882a593Smuzhiyun  idle for a while may not always make sense. The trick is to handle the fine
*4882a593Smuzhiyun  balance between when to plug and when to open up. Also now that we have
*4882a593Smuzhiyun  multi-page bios being queued in one shot, we may not need to wait to merge
*4882a593Smuzhiyun  a big request from the broken up pieces coming by.
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.4 I/O contexts
*4882a593Smuzhiyun----------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunI/O contexts provide a dynamically allocated per process data area. They may
*4882a593Smuzhiyunbe used in I/O schedulers, and in the block layer (could be used for IO statis,
*4882a593Smuzhiyunpriorities for example). See `*io_context` in block/ll_rw_blk.c, and as-iosched.c
*4882a593Smuzhiyunfor an example of usage in an i/o scheduler.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun5. Scalability related changes
*4882a593Smuzhiyun==============================
*4882a593Smuzhiyun
*4882a593Smuzhiyun5.1 Granular Locking: io_request_lock replaced by a per-queue lock
*4882a593Smuzhiyun------------------------------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe global io_request_lock has been removed as of 2.5, to avoid
*4882a593Smuzhiyunthe scalability bottleneck it was causing, and has been replaced by more
*4882a593Smuzhiyungranular locking. The request queue structure has a pointer to the
*4882a593Smuzhiyunlock to be used for that queue. As a result, locking can now be
*4882a593Smuzhiyunper-queue, with a provision for sharing a lock across queues if
*4882a593Smuzhiyunnecessary (e.g the scsi layer sets the queue lock pointers to the
*4882a593Smuzhiyuncorresponding adapter lock, which results in a per host locking
*4882a593Smuzhiyungranularity). The locking semantics are the same, i.e. locking is
*4882a593Smuzhiyunstill imposed by the block layer, grabbing the lock before
*4882a593Smuzhiyunrequest_fn execution which it means that lots of older drivers
*4882a593Smuzhiyunshould still be SMP safe. Drivers are free to drop the queue
*4882a593Smuzhiyunlock themselves, if required. Drivers that explicitly used the
*4882a593Smuzhiyunio_request_lock for serialization need to be modified accordingly.
*4882a593SmuzhiyunUsually it's as easy as adding a global lock::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	static DEFINE_SPINLOCK(my_driver_lock);
*4882a593Smuzhiyun
*4882a593Smuzhiyunand passing the address to that lock to blk_init_queue().
*4882a593Smuzhiyun
*4882a593Smuzhiyun5.2 64 bit sector numbers (sector_t prepares for 64 bit support)
*4882a593Smuzhiyun----------------------------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe sector number used in the bio structure has been changed to sector_t,
*4882a593Smuzhiyunwhich could be defined as 64 bit in preparation for 64 bit sector support.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6. Other Changes/Implications
*4882a593Smuzhiyun=============================
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.1 Partition re-mapping handled by the generic block layer
*4882a593Smuzhiyun-----------------------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn 2.5 some of the gendisk/partition related code has been reorganized.
*4882a593SmuzhiyunNow the generic block layer performs partition-remapping early and thus
*4882a593Smuzhiyunprovides drivers with a sector number relative to whole device, rather than
*4882a593Smuzhiyunhaving to take partition number into account in order to arrive at the true
*4882a593Smuzhiyunsector number. The routine blk_partition_remap() is invoked by
*4882a593Smuzhiyunsubmit_bio_noacct even before invoking the queue specific ->submit_bio,
*4882a593Smuzhiyunso the i/o scheduler also gets to operate on whole disk sector numbers. This
*4882a593Smuzhiyunshould typically not require changes to block drivers, it just never gets
*4882a593Smuzhiyunto invoke its own partition sector offset calculations since all bios
*4882a593Smuzhiyunsent are offset from the beginning of the device.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun7. A Few Tips on Migration of older drivers
*4882a593Smuzhiyun===========================================
*4882a593Smuzhiyun
*4882a593SmuzhiyunOld-style drivers that just use CURRENT and ignores clustered requests,
*4882a593Smuzhiyunmay not need much change.  The generic layer will automatically handle
*4882a593Smuzhiyunclustered requests, multi-page bios, etc for the driver.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor a low performance driver or hardware that is PIO driven or just doesn't
*4882a593Smuzhiyunsupport scatter-gather changes should be minimal too.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe following are some points to keep in mind when converting old drivers
*4882a593Smuzhiyunto bio.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDrivers should use elv_next_request to pick up requests and are no longer
*4882a593Smuzhiyunsupposed to handle looping directly over the request list.
*4882a593Smuzhiyun(struct request->queue has been removed)
*4882a593Smuzhiyun
*4882a593SmuzhiyunNow end_that_request_first takes an additional number_of_sectors argument.
*4882a593SmuzhiyunIt used to handle always just the first buffer_head in a request, now
*4882a593Smuzhiyunit will loop and handle as many sectors (on a bio-segment granularity)
*4882a593Smuzhiyunas specified.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNow bh->b_end_io is replaced by bio->bi_end_io, but most of the time the
*4882a593Smuzhiyunright thing to use is bio_endio(bio) instead.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the driver is dropping the io_request_lock from its request_fn strategy,
*4882a593Smuzhiyunthen it just needs to replace that with q->queue_lock instead.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs described in Sec 1.1, drivers can set max sector size, max segment size
*4882a593Smuzhiyunetc per queue now. Drivers that used to define their own merge functions i
*4882a593Smuzhiyunto handle things like this can now just use the blk_queue_* functions at
*4882a593Smuzhiyunblk_init_queue time.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDrivers no longer have to map a {partition, sector offset} into the
*4882a593Smuzhiyuncorrect absolute location anymore, this is done by the block layer, so
*4882a593Smuzhiyunwhere a driver received a request ala this before::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	rq->rq_dev = mk_kdev(3, 5);	/* /dev/hda5 */
*4882a593Smuzhiyun	rq->sector = 0;			/* first sector on hda5 */
*4882a593Smuzhiyun
*4882a593Smuzhiyunit will now see::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	rq->rq_dev = mk_kdev(3, 0);	/* /dev/hda */
*4882a593Smuzhiyun	rq->sector = 123128;		/* offset from start of disk */
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs mentioned, there is no virtual mapping of a bio. For DMA, this is
*4882a593Smuzhiyunnot a problem as the driver probably never will need a virtual mapping.
*4882a593SmuzhiyunInstead it needs a bus mapping (dma_map_page for a single segment or
*4882a593Smuzhiyunuse dma_map_sg for scatter gather) to be able to ship it to the driver. For
*4882a593SmuzhiyunPIO drivers (or drivers that need to revert to PIO transfer once in a
*4882a593Smuzhiyunwhile (IDE for example)), where the CPU is doing the actual data
*4882a593Smuzhiyuntransfer a virtual mapping is needed. If the driver supports highmem I/O,
*4882a593Smuzhiyun(Sec 1.1, (ii) ) it needs to use kmap_atomic or similar to temporarily map
*4882a593Smuzhiyuna bio into the virtual address space.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun8. Prior/Related/Impacted patches
*4882a593Smuzhiyun=================================
*4882a593Smuzhiyun
*4882a593Smuzhiyun8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp)
*4882a593Smuzhiyun-----------------------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun- orig kiobuf & raw i/o patches (now in 2.4 tree)
*4882a593Smuzhiyun- direct kiobuf based i/o to devices (no intermediate bh's)
*4882a593Smuzhiyun- page i/o using kiobuf
*4882a593Smuzhiyun- kiobuf splitting for lvm (mkp)
*4882a593Smuzhiyun- elevator support for kiobuf request merging (axboe)
*4882a593Smuzhiyun
*4882a593Smuzhiyun8.2. Zero-copy networking (Dave Miller)
*4882a593Smuzhiyun---------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun8.3. SGI XFS - pagebuf patches - use of kiobufs
*4882a593Smuzhiyun-----------------------------------------------
*4882a593Smuzhiyun8.4. Multi-page pioent patch for bio (Christoph Hellwig)
*4882a593Smuzhiyun--------------------------------------------------------
*4882a593Smuzhiyun8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11
*4882a593Smuzhiyun--------------------------------------------------------------------
*4882a593Smuzhiyun8.6. Async i/o implementation patch (Ben LaHaise)
*4882a593Smuzhiyun-------------------------------------------------
*4882a593Smuzhiyun8.7. EVMS layering design (IBM EVMS team)
*4882a593Smuzhiyun-----------------------------------------
*4882a593Smuzhiyun8.8. Larger page cache size patch (Ben LaHaise) and Large page size (Daniel Phillips)
*4882a593Smuzhiyun-------------------------------------------------------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun    => larger contiguous physical memory buffers
*4882a593Smuzhiyun
*4882a593Smuzhiyun8.9. VM reservations patch (Ben LaHaise)
*4882a593Smuzhiyun----------------------------------------
*4882a593Smuzhiyun8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?)
*4882a593Smuzhiyun----------------------------------------------------------
*4882a593Smuzhiyun8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+
*4882a593Smuzhiyun---------------------------------------------------------------------------
*4882a593Smuzhiyun8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, Badari)
*4882a593Smuzhiyun-------------------------------------------------------------------------------
*4882a593Smuzhiyun8.13  Priority based i/o scheduler - prepatches (Arjan van de Ven)
*4882a593Smuzhiyun------------------------------------------------------------------
*4882a593Smuzhiyun8.14  IDE Taskfile i/o patch (Andre Hedrick)
*4882a593Smuzhiyun--------------------------------------------
*4882a593Smuzhiyun8.15  Multi-page writeout and readahead patches (Andrew Morton)
*4882a593Smuzhiyun---------------------------------------------------------------
*4882a593Smuzhiyun8.16  Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy)
*4882a593Smuzhiyun-----------------------------------------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun9. Other References
*4882a593Smuzhiyun===================
*4882a593Smuzhiyun
*4882a593Smuzhiyun9.1 The Splice I/O Model
*4882a593Smuzhiyun------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunLarry McVoy (and subsequent discussions on lkml, and Linus' comments - Jan 2001
*4882a593Smuzhiyun
*4882a593Smuzhiyun9.2 Discussions about kiobuf and bh design
*4882a593Smuzhiyun------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunOn lkml between sct, linus, alan et al - Feb-March 2001 (many of the
*4882a593Smuzhiyuninitial thoughts that led to bio were brought up in this discussion thread)
*4882a593Smuzhiyun
*4882a593Smuzhiyun9.3 Discussions on mempool on lkml - Dec 2001.
*4882a593Smuzhiyun----------------------------------------------