Documentation/block/biodoc.rst

2 Notes on the Generic Block Layer Rewrite in Linux 2.5
13 	- Jens Axboe <jens.axboe@oracle.com>
14 	- Suparna Bhattacharya <suparna@in.ibm.com>
19 	- Nick Piggin <npiggin@kernel.dk>
24 These are some notes describing some aspects of the 2.5 block layer in the
34 	- Jens Axboe <jens.axboe@oracle.com>
36 Many aspects of the generic block layer redesign were driven by and evolved
43 	- Christoph Hellwig <hch@infradead.org>
44 	- Arjan van de Ven <arjanv@redhat.com>
45 	- Randy Dunlap <rdunlap@xenotime.net>
46 	- Andre Hedrick <andre@linux-ide.org>
49 while it was still work-in-progress:
51 	- David S. Miller <davem@redhat.com>
58 	- Per-queue parameters
59 	- Highmem I/O support
60 	- I/O scheduler modularization
65 	1.3.1 Pre-built commands
69      2.2 The bio struct in detail (multi-page io unit)
85      6.1 Partition re-mapping handled by the generic block layer
95 block layer are addressed.
100 The block layer design supports adaptable abstractions to handle common
111 ----------------------------------------------------------
113 Sophisticated devices with large built-in caches, intelligent i/o scheduling
118 used at the generic block layer to take the right decisions on
123 Tuning at a per-queue level:
125 i. Per-queue limits/values exported to the generic layer by the driver
128 a per-queue level (e.g maximum request size, maximum number of segments in
129 a scatter-gather list, logical block size)
133 move into the block device structure in the future. Some characteristics
147 		- The request queue's max_sectors, which is a soft size in
151 		- The request queue's max_hw_sectors, which is a hard limit
176 	- QUEUE_FLAG_CLUSTER (see 3.2.2)
177 	- QUEUE_FLAG_QUEUED (see 3.2.4)
180 ii. High-mem i/o capabilities are now considered the default
182 The generic bounce buffer logic, present in 2.4, where the block layer would
183 by default copyin/out i/o requests on high-memory buffers to low-memory buffers
191 In order to enable high-memory i/o where the device is capable of supporting
193 modified to accomplish a direct page -> bus translation, without requiring
195 -> bus translation). So this works uniformly for high-memory pages (which
197 low-memory pages.
199 Note: Please refer to :doc:`/core-api/dma-api-howto` for a discussion
218 It is also possible that a bounce buffer may be allocated from high-memory
243 the i/o scheduler from block drivers.
249 ------------------------------------------------
253 This comes from some of the high-performance database/middleware
265 What kind of support exists at the generic block layer for this ?
271 other upper level mechanism to communicate such settings to block.
290 -----------------------------------------------------------------------
294 There are situations where high-level code needs to have direct access to
314 the command, then such information is associated with the request->special
315 field (rather than misuse the request->buffer field which is meant for the
320 bio segments or uses the block layer end*request* functions for i/o
321 completion. Alternatively one could directly use the request->buffer field to
325 request->buffer, request->sector and request->nr_sectors or
326 request->current_nr_sectors fields itself rather than using the block layer
328 (See 2.3 or Documentation/block/request.rst for a brief explanation of
352 1.3.1 Pre-built Commands
355 A request can be created with a pre-built custom command  to be sent directly
356 to the device. The cmd block in the request structure has room for filling
357 in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for
358 command pre-building, and the type of the request is now indicated
359 through rq->flags instead of via rq->cmd)
365 It can help to pre-build device commands for requests in advance.
366 Drivers can now specify a request prepare function (q->prep_rq_fn) that the
367 block layer would invoke to pre-build device commands for a given request,
374   Pre-building could possibly even be done early, i.e before placing the
378   pre-building would be to do it whenever we fail to merge on a request.
381   the pre-builder hook can be invoked there.
388 ---------------------------------------------------------
390 Prior to 2.5, buffer heads were used as the unit of i/o at the generic block
395 on to the generic block layer, only to be merged by the i/o scheduler
402 redesign of the block i/o data structure in 2.5.
404 1.  Should be appropriate as a descriptor for both raw and buffered i/o  -
406     or filesystem block size alignment restrictions which may not be relevant
408 2.  Ability to represent high-memory buffers (which do not have a virtual
416     (including non-page aligned page fragments, as specified via readv/writev)
426 The solution was to define a new structure (bio)  for the block layer,
429 is uniformly used for all i/o at the block layer ; it forms a part of the
434 ------------------
453    * main unit of I/O for the block layer and lower layers (ie drivers)
476 - Large i/os can be sent down in one go using a bio_vec list consisting
478   are represented in the zero-copy network code)
479 - Splitting of an i/o request across multiple devices (as in the case of
482 - A linked list of bios is used as before for unrelated merges [#]_ - this
484 - Code that traverses the req list can find all the segments of a bio
487 - Drivers which can't process a large bio in one shot can use the bi_iter
495 	unrelated merges -- a request ends up containing two or more bios that
512 become possible. The pagebuf abstraction layer from SGI also uses multi-page
514 The same is true of Andrew Morton's work-in-progress multipage bio writeout
518 ------------------------------------
521 drivers. The block layer make_request function builds up a request structure,
523 use of block layer helper routine elv_next_request to pull the next request
524 off the queue. Control or diagnostic functions might bypass block and directly
531 Refer to Documentation/block/request.rst for details about all the request
538 					Used by q->elv_next_request_fn
539 					rq->queue is gone
543 	unsigned char cmd[16]; /* prebuilt command data block */
544 	unsigned long flags;   /* also includes earlier rq->cmd settings */
552 	/* Number of scatter-gather DMA addr+len pairs after
557 	/* Number of scatter-gather addr+len pairs after
559 	 * This is the number of scatter-gather entries the driver
566 	unsigned long hard_nr_sectors;  /* block internal copy of above */
569 	unsigned long hard_cur_sectors; /* block internal copy of the above */
583 flags available. Some bits are used by the block layer or i/o scheduler.
586 except that since we have multi-segment bios, current_nr_sectors refers
591 hard_xxx values is for block to remember these counts every time it hands
592 over the request to the driver. These values are updated by block on
594 transfer and invokes block end*request helpers to mark this. The
595 driver should not modify these values. The block layer sets up the
599 buffer, bio, bio->bi_iter fields too.
602 of the i/o buffer in cases where the buffer resides in low-memory. For high
606 a driver needs to be careful about interoperation with the block layer helper
613 ------------------
620 deadlock-free allocations during extreme VM load. For example, the VM
621 subsystem makes use of the block layer to writeout dirty pages in order to be
657 for a non-clone bio. There are the 6 pools setup for different size biovecs,
672 -------------------------------
680 with block changes in the future.
688 I/O completion callbacks are per-bio rather than per-segment, so drivers
691 need to be reorganized to support multi-segment bios.
708 - Prevents a clustered segment from crossing a 4GB mem boundary
709 - Avoids building segments that would exceed the number of physical
727 The existing generic block layer helper routines end_request,
730 request can be kicked of) as before. With the introduction of multi-page
741 buffers) and expect only virtually mapped buffers, can access the rq->buffer
744 transfer in one go unless it interprets segments), and rely on the block layer
748 be used if only if the request has come down from block/bio path, not for
749 direct access requests which only specify rq->buffer without a valid rq->bio)
752 ------------------
765 maps the array to one or more multi-page bios, issuing submit_bio() to
777  So right now it wouldn't work for direct i/o on non-contiguous blocks.
789  Andrew Morton's multi-page bio patches attempt to issue multi-page
794  Christoph Hellwig had some code that uses bios for page-io (rather than
820   TBD: In order for this to work, some changes are needed in the way multi-page
822   from higher level code should not be modified by the block layer in the course
836 Block layer implements generic dispatch queue in `block/*.c`.
837 The generic dispatch queue is responsible for requeueing, handling non-fs
847 A block layer call to the i/o scheduler follows the convention elv_xxx(). This
848 calls elevator_xxx_fn in the elevator switch (block/elevator.c). Oh, xxx
854 ----------------------
871 elevator_allow_merge_fn		called whenever the block layer determines
885 				is non-zero.  Once dispatched, I/O schedulers
886 				are not allowed to manipulate the requests -
894 				block layer to find merge possibilities.
915 ----------------------------------------
920  set_req_fn ->
922  i.   add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn ->
923       (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn
924  ii.  add_req_fn -> (merged_fn ->)* -> merge_req_fn
927  -> put_req_fn
930 --------------------------------
944 sorting and searching, and a fifo linked list for time-based searching. This
949 This arrangement is not a generic block layer characteristic however, so
985   multi-page bios being queued in one shot, we may not need to wait to merge
989 ----------------
992 be used in I/O schedulers, and in the block layer (could be used for IO statis,
993 priorities for example). See `*io_context` in block/ll_rw_blk.c, and as-iosched.c
1000 5.1 Granular Locking: io_request_lock replaced by a per-queue lock
1001 ------------------------------------------------------------------
1007 per-queue, with a provision for sharing a lock across queues if
1011 still imposed by the block layer, grabbing the lock before
1023 ----------------------------------------------------------------
1031 6.1 Partition re-mapping handled by the generic block layer
1032 -----------------------------------------------------------
1035 Now the generic block layer performs partition-remapping early and thus
1039 submit_bio_noacct even before invoking the queue specific ->submit_bio,
1041 should typically not require changes to block drivers, it just never gets
1049 Old-style drivers that just use CURRENT and ignores clustered requests,
1051 clustered requests, multi-page bios, etc for the driver.
1054 support scatter-gather changes should be minimal too.
1061 (struct request->queue has been removed)
1065 it will loop and handle as many sectors (on a bio-segment granularity)
1068 Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the
1072 then it just needs to replace that with q->queue_lock instead.
1080 correct absolute location anymore, this is done by the block layer, so
1083 	rq->rq_dev = mk_kdev(3, 5);	/* /dev/hda5 */
1084 	rq->sector = 0;			/* first sector on hda5 */
1088 	rq->rq_dev = mk_kdev(3, 0);	/* /dev/hda */
1089 	rq->sector = 123128;		/* offset from start of disk */
1106 -----------------------------------------------------
1108 - orig kiobuf & raw i/o patches (now in 2.4 tree)
1109 - direct kiobuf based i/o to devices (no intermediate bh's)
1110 - page i/o using kiobuf
1111 - kiobuf splitting for lvm (mkp)
1112 - elevator support for kiobuf request merging (axboe)
1114 8.2. Zero-copy networking (Dave Miller)
1115 ---------------------------------------
1117 8.3. SGI XFS - pagebuf patches - use of kiobufs
1118 -----------------------------------------------
1119 8.4. Multi-page pioent patch for bio (Christoph Hellwig)
1120 --------------------------------------------------------
1121 8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11
1122 --------------------------------------------------------------------
1124 -------------------------------------------------
1126 -----------------------------------------
1128 -------------------------------------------------------------------------------------
1133 ----------------------------------------
1135 ----------------------------------------------------------
1136 8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+
1137 ---------------------------------------------------------------------------
1138 8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, Badari)
1139 -------------------------------------------------------------------------------
1140 8.13  Priority based i/o scheduler - prepatches (Arjan van de Ven)
1141 ------------------------------------------------------------------
1143 --------------------------------------------
1144 8.15  Multi-page writeout and readahead patches (Andrew Morton)
1145 ---------------------------------------------------------------
1147 -----------------------------------------------------------------------
1153 ------------------------
1155 Larry McVoy (and subsequent discussions on lkml, and Linus' comments - Jan 2001
1158 ------------------------------------------
1160 On lkml between sct, linus, alan et al - Feb-March 2001 (many of the
1163 9.3 Discussions on mempool on lkml - Dec 2001.
1164 ----------------------------------------------