xref: /OK3568_Linux_fs/kernel/Documentation/block/bfq-iosched.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==========================
2*4882a593SmuzhiyunBFQ (Budget Fair Queueing)
3*4882a593Smuzhiyun==========================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunBFQ is a proportional-share I/O scheduler, with some extra
6*4882a593Smuzhiyunlow-latency capabilities. In addition to cgroups support (blkio or io
7*4882a593Smuzhiyuncontrollers), BFQ's main features are:
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun- BFQ guarantees a high system and application responsiveness, and a
10*4882a593Smuzhiyun  low latency for time-sensitive applications, such as audio or video
11*4882a593Smuzhiyun  players;
12*4882a593Smuzhiyun- BFQ distributes bandwidth, and not just time, among processes or
13*4882a593Smuzhiyun  groups (switching back to time distribution when needed to keep
14*4882a593Smuzhiyun  throughput high).
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunIn its default configuration, BFQ privileges latency over
17*4882a593Smuzhiyunthroughput. So, when needed for achieving a lower latency, BFQ builds
18*4882a593Smuzhiyunschedules that may lead to a lower throughput. If your main or only
19*4882a593Smuzhiyungoal, for a given device, is to achieve the maximum-possible
20*4882a593Smuzhiyunthroughput at all times, then do switch off all low-latency heuristics
21*4882a593Smuzhiyunfor that device, by setting low_latency to 0. See Section 3 for
22*4882a593Smuzhiyundetails on how to configure BFQ for the desired tradeoff between
23*4882a593Smuzhiyunlatency and throughput, or on how to maximize throughput.
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunAs every I/O scheduler, BFQ adds some overhead to per-I/O-request
26*4882a593Smuzhiyunprocessing. To give an idea of this overhead, the total,
27*4882a593Smuzhiyunsingle-lock-protected, per-request processing time of BFQ---i.e., the
28*4882a593Smuzhiyunsum of the execution times of the request insertion, dispatch and
29*4882a593Smuzhiyuncompletion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
30*4882a593Smuzhiyun(dated CPU for notebooks; time measured with simple code
31*4882a593Smuzhiyuninstrumentation, and using the throughput-sync.sh script of the S
32*4882a593Smuzhiyunsuite [1], in performance-profiling mode). To put this result into
33*4882a593Smuzhiyuncontext, the total, single-lock-protected, per-request execution time
34*4882a593Smuzhiyunof the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7
35*4882a593Smuzhiyunus (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ).
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunScheduling overhead further limits the maximum IOPS that a CPU can
38*4882a593Smuzhiyunprocess (already limited by the execution of the rest of the I/O
39*4882a593Smuzhiyunstack). To give an idea of the limits with BFQ, on slow or average
40*4882a593SmuzhiyunCPUs, here are, first, the limits of BFQ for three different CPUs, on,
41*4882a593Smuzhiyunrespectively, an average laptop, an old desktop, and a cheap embedded
42*4882a593Smuzhiyunsystem, in case full hierarchical support is enabled (i.e.,
43*4882a593SmuzhiyunCONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is not
44*4882a593Smuzhiyunset (Section 4-2):
45*4882a593Smuzhiyun- Intel i7-4850HQ: 400 KIOPS
46*4882a593Smuzhiyun- AMD A8-3850: 250 KIOPS
47*4882a593Smuzhiyun- ARM CortexTM-A53 Octa-core: 80 KIOPS
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunIf CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchical
50*4882a593Smuzhiyunsupport is enabled), then the sustainable throughput with BFQ
51*4882a593Smuzhiyundecreases, because all blkio.bfq* statistics are created and updated
52*4882a593Smuzhiyun(Section 4-2). For BFQ, this leads to the following maximum
53*4882a593Smuzhiyunsustainable throughputs, on the same systems as above:
54*4882a593Smuzhiyun- Intel i7-4850HQ: 310 KIOPS
55*4882a593Smuzhiyun- AMD A8-3850: 200 KIOPS
56*4882a593Smuzhiyun- ARM CortexTM-A53 Octa-core: 56 KIOPS
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunBFQ works for multi-queue devices too.
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun.. The table of contents follow. Impatients can just jump to Section 3.
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun.. CONTENTS
63*4882a593Smuzhiyun
64*4882a593Smuzhiyun   1. When may BFQ be useful?
65*4882a593Smuzhiyun    1-1 Personal systems
66*4882a593Smuzhiyun    1-2 Server systems
67*4882a593Smuzhiyun   2. How does BFQ work?
68*4882a593Smuzhiyun   3. What are BFQ's tunables and how to properly configure BFQ?
69*4882a593Smuzhiyun   4. BFQ group scheduling
70*4882a593Smuzhiyun    4-1 Service guarantees provided
71*4882a593Smuzhiyun    4-2 Interface
72*4882a593Smuzhiyun
73*4882a593Smuzhiyun1. When may BFQ be useful?
74*4882a593Smuzhiyun==========================
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunBFQ provides the following benefits on personal and server systems.
77*4882a593Smuzhiyun
78*4882a593Smuzhiyun1-1 Personal systems
79*4882a593Smuzhiyun--------------------
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunLow latency for interactive applications
82*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunRegardless of the actual background workload, BFQ guarantees that, for
85*4882a593Smuzhiyuninteractive tasks, the storage device is virtually as responsive as if
86*4882a593Smuzhiyunit was idle. For example, even if one or more of the following
87*4882a593Smuzhiyunbackground workloads are being executed:
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun- one or more large files are being read, written or copied,
90*4882a593Smuzhiyun- a tree of source files is being compiled,
91*4882a593Smuzhiyun- one or more virtual machines are performing I/O,
92*4882a593Smuzhiyun- a software update is in progress,
93*4882a593Smuzhiyun- indexing daemons are scanning filesystems and updating their
94*4882a593Smuzhiyun  databases,
95*4882a593Smuzhiyun
96*4882a593Smuzhiyunstarting an application or loading a file from within an application
97*4882a593Smuzhiyuntakes about the same time as if the storage device was idle. As a
98*4882a593Smuzhiyuncomparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
99*4882a593Smuzhiyunapplications experience high latencies, or even become unresponsive
100*4882a593Smuzhiyununtil the background workload terminates (also on SSDs).
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunLow latency for soft real-time applications
103*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
104*4882a593SmuzhiyunAlso soft real-time applications, such as audio and video
105*4882a593Smuzhiyunplayers/streamers, enjoy a low latency and a low drop rate, regardless
106*4882a593Smuzhiyunof the background I/O workload. As a consequence, these applications
107*4882a593Smuzhiyundo not suffer from almost any glitch due to the background workload.
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunHigher speed for code-development tasks
110*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunIf some additional workload happens to be executed in parallel, then
113*4882a593SmuzhiyunBFQ executes the I/O-related components of typical code-development
114*4882a593Smuzhiyuntasks (compilation, checkout, merge, ...) much more quickly than CFQ,
115*4882a593SmuzhiyunNOOP or DEADLINE.
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunHigh throughput
118*4882a593Smuzhiyun^^^^^^^^^^^^^^^
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunOn hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
121*4882a593Smuzhiyunup to 150% higher throughput than DEADLINE and NOOP, with all the
122*4882a593Smuzhiyunsequential workloads considered in our tests. With random workloads,
123*4882a593Smuzhiyunand with all the workloads on flash-based devices, BFQ achieves,
124*4882a593Smuzhiyuninstead, about the same throughput as the other schedulers.
125*4882a593Smuzhiyun
126*4882a593SmuzhiyunStrong fairness, bandwidth and delay guarantees
127*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
128*4882a593Smuzhiyun
129*4882a593SmuzhiyunBFQ distributes the device throughput, and not just the device time,
130*4882a593Smuzhiyunamong I/O-bound applications in proportion their weights, with any
131*4882a593Smuzhiyunworkload and regardless of the device parameters. From these bandwidth
132*4882a593Smuzhiyunguarantees, it is possible to compute tight per-I/O-request delay
133*4882a593Smuzhiyunguarantees by a simple formula. If not configured for strict service
134*4882a593Smuzhiyunguarantees, BFQ switches to time-based resource sharing (only) for
135*4882a593Smuzhiyunapplications that would otherwise cause a throughput loss.
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun1-2 Server systems
138*4882a593Smuzhiyun------------------
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunMost benefits for server systems follow from the same service
141*4882a593Smuzhiyunproperties as above. In particular, regardless of whether additional,
142*4882a593Smuzhiyunpossibly heavy workloads are being served, BFQ guarantees:
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun* audio and video-streaming with zero or very low jitter and drop
145*4882a593Smuzhiyun  rate;
146*4882a593Smuzhiyun
147*4882a593Smuzhiyun* fast retrieval of WEB pages and embedded objects;
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun* real-time recording of data in live-dumping applications (e.g.,
150*4882a593Smuzhiyun  packet logging);
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun* responsiveness in local and remote access to a server.
153*4882a593Smuzhiyun
154*4882a593Smuzhiyun
155*4882a593Smuzhiyun2. How does BFQ work?
156*4882a593Smuzhiyun=====================
157*4882a593Smuzhiyun
158*4882a593SmuzhiyunBFQ is a proportional-share I/O scheduler, whose general structure,
159*4882a593Smuzhiyunplus a lot of code, are borrowed from CFQ.
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun- Each process doing I/O on a device is associated with a weight and a
162*4882a593Smuzhiyun  `(bfq_)queue`.
163*4882a593Smuzhiyun
164*4882a593Smuzhiyun- BFQ grants exclusive access to the device, for a while, to one queue
165*4882a593Smuzhiyun  (process) at a time, and implements this service model by
166*4882a593Smuzhiyun  associating every queue with a budget, measured in number of
167*4882a593Smuzhiyun  sectors.
168*4882a593Smuzhiyun
169*4882a593Smuzhiyun  - After a queue is granted access to the device, the budget of the
170*4882a593Smuzhiyun    queue is decremented, on each request dispatch, by the size of the
171*4882a593Smuzhiyun    request.
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun  - The in-service queue is expired, i.e., its service is suspended,
174*4882a593Smuzhiyun    only if one of the following events occurs: 1) the queue finishes
175*4882a593Smuzhiyun    its budget, 2) the queue empties, 3) a "budget timeout" fires.
176*4882a593Smuzhiyun
177*4882a593Smuzhiyun    - The budget timeout prevents processes doing random I/O from
178*4882a593Smuzhiyun      holding the device for too long and dramatically reducing
179*4882a593Smuzhiyun      throughput.
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun    - Actually, as in CFQ, a queue associated with a process issuing
182*4882a593Smuzhiyun      sync requests may not be expired immediately when it empties. In
183*4882a593Smuzhiyun      contrast, BFQ may idle the device for a short time interval,
184*4882a593Smuzhiyun      giving the process the chance to go on being served if it issues
185*4882a593Smuzhiyun      a new request in time. Device idling typically boosts the
186*4882a593Smuzhiyun      throughput on rotational devices and on non-queueing flash-based
187*4882a593Smuzhiyun      devices, if processes do synchronous and sequential I/O. In
188*4882a593Smuzhiyun      addition, under BFQ, device idling is also instrumental in
189*4882a593Smuzhiyun      guaranteeing the desired throughput fraction to processes
190*4882a593Smuzhiyun      issuing sync requests (see the description of the slice_idle
191*4882a593Smuzhiyun      tunable in this document, or [1, 2], for more details).
192*4882a593Smuzhiyun
193*4882a593Smuzhiyun      - With respect to idling for service guarantees, if several
194*4882a593Smuzhiyun	processes are competing for the device at the same time, but
195*4882a593Smuzhiyun	all processes and groups have the same weight, then BFQ
196*4882a593Smuzhiyun	guarantees the expected throughput distribution without ever
197*4882a593Smuzhiyun	idling the device. Throughput is thus as high as possible in
198*4882a593Smuzhiyun	this common scenario.
199*4882a593Smuzhiyun
200*4882a593Smuzhiyun     - On flash-based storage with internal queueing of commands
201*4882a593Smuzhiyun       (typically NCQ), device idling happens to be always detrimental
202*4882a593Smuzhiyun       for throughput. So, with these devices, BFQ performs idling
203*4882a593Smuzhiyun       only when strictly needed for service guarantees, i.e., for
204*4882a593Smuzhiyun       guaranteeing low latency or fairness. In these cases, overall
205*4882a593Smuzhiyun       throughput may be sub-optimal. No solution currently exists to
206*4882a593Smuzhiyun       provide both strong service guarantees and optimal throughput
207*4882a593Smuzhiyun       on devices with internal queueing.
208*4882a593Smuzhiyun
209*4882a593Smuzhiyun  - If low-latency mode is enabled (default configuration), BFQ
210*4882a593Smuzhiyun    executes some special heuristics to detect interactive and soft
211*4882a593Smuzhiyun    real-time applications (e.g., video or audio players/streamers),
212*4882a593Smuzhiyun    and to reduce their latency. The most important action taken to
213*4882a593Smuzhiyun    achieve this goal is to give to the queues associated with these
214*4882a593Smuzhiyun    applications more than their fair share of the device
215*4882a593Smuzhiyun    throughput. For brevity, we call just "weight-raising" the whole
216*4882a593Smuzhiyun    sets of actions taken by BFQ to privilege these queues. In
217*4882a593Smuzhiyun    particular, BFQ provides a milder form of weight-raising for
218*4882a593Smuzhiyun    interactive applications, and a stronger form for soft real-time
219*4882a593Smuzhiyun    applications.
220*4882a593Smuzhiyun
221*4882a593Smuzhiyun  - BFQ automatically deactivates idling for queues born in a burst of
222*4882a593Smuzhiyun    queue creations. In fact, these queues are usually associated with
223*4882a593Smuzhiyun    the processes of applications and services that benefit mostly
224*4882a593Smuzhiyun    from a high throughput. Examples are systemd during boot, or git
225*4882a593Smuzhiyun    grep.
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun  - As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
228*4882a593Smuzhiyun    performing random I/O that becomes mostly sequential if
229*4882a593Smuzhiyun    merged. Differently from CFQ, BFQ achieves this goal with a more
230*4882a593Smuzhiyun    reactive mechanism, called Early Queue Merge (EQM). EQM is so
231*4882a593Smuzhiyun    responsive in detecting interleaved I/O (cooperating processes),
232*4882a593Smuzhiyun    that it enables BFQ to achieve a high throughput, by queue
233*4882a593Smuzhiyun    merging, even for queues for which CFQ needs a different
234*4882a593Smuzhiyun    mechanism, preemption, to get a high throughput. As such EQM is a
235*4882a593Smuzhiyun    unified mechanism to achieve a high throughput with interleaved
236*4882a593Smuzhiyun    I/O.
237*4882a593Smuzhiyun
238*4882a593Smuzhiyun  - Queues are scheduled according to a variant of WF2Q+, named
239*4882a593Smuzhiyun    B-WF2Q+, and implemented using an augmented rb-tree to preserve an
240*4882a593Smuzhiyun    O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
241*4882a593Smuzhiyun    also ready for hierarchical scheduling, details in Section 4.
242*4882a593Smuzhiyun
243*4882a593Smuzhiyun  - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
244*4882a593Smuzhiyun    perfectly fair, and smooth service. In particular, B-WF2Q+
245*4882a593Smuzhiyun    guarantees that each queue receives a fraction of the device
246*4882a593Smuzhiyun    throughput proportional to its weight, even if the throughput
247*4882a593Smuzhiyun    fluctuates, and regardless of: the device parameters, the current
248*4882a593Smuzhiyun    workload and the budgets assigned to the queue.
249*4882a593Smuzhiyun
250*4882a593Smuzhiyun  - The last, budget-independence, property (although probably
251*4882a593Smuzhiyun    counterintuitive in the first place) is definitely beneficial, for
252*4882a593Smuzhiyun    the following reasons:
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun    - First, with any proportional-share scheduler, the maximum
255*4882a593Smuzhiyun      deviation with respect to an ideal service is proportional to
256*4882a593Smuzhiyun      the maximum budget (slice) assigned to queues. As a consequence,
257*4882a593Smuzhiyun      BFQ can keep this deviation tight not only because of the
258*4882a593Smuzhiyun      accurate service of B-WF2Q+, but also because BFQ *does not*
259*4882a593Smuzhiyun      need to assign a larger budget to a queue to let the queue
260*4882a593Smuzhiyun      receive a higher fraction of the device throughput.
261*4882a593Smuzhiyun
262*4882a593Smuzhiyun    - Second, BFQ is free to choose, for every process (queue), the
263*4882a593Smuzhiyun      budget that best fits the needs of the process, or best
264*4882a593Smuzhiyun      leverages the I/O pattern of the process. In particular, BFQ
265*4882a593Smuzhiyun      updates queue budgets with a simple feedback-loop algorithm that
266*4882a593Smuzhiyun      allows a high throughput to be achieved, while still providing
267*4882a593Smuzhiyun      tight latency guarantees to time-sensitive applications. When
268*4882a593Smuzhiyun      the in-service queue expires, this algorithm computes the next
269*4882a593Smuzhiyun      budget of the queue so as to:
270*4882a593Smuzhiyun
271*4882a593Smuzhiyun      - Let large budgets be eventually assigned to the queues
272*4882a593Smuzhiyun	associated with I/O-bound applications performing sequential
273*4882a593Smuzhiyun	I/O: in fact, the longer these applications are served once
274*4882a593Smuzhiyun	got access to the device, the higher the throughput is.
275*4882a593Smuzhiyun
276*4882a593Smuzhiyun      - Let small budgets be eventually assigned to the queues
277*4882a593Smuzhiyun	associated with time-sensitive applications (which typically
278*4882a593Smuzhiyun	perform sporadic and short I/O), because, the smaller the
279*4882a593Smuzhiyun	budget assigned to a queue waiting for service is, the sooner
280*4882a593Smuzhiyun	B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
281*4882a593Smuzhiyun
282*4882a593Smuzhiyun- If several processes are competing for the device at the same time,
283*4882a593Smuzhiyun  but all processes and groups have the same weight, then BFQ
284*4882a593Smuzhiyun  guarantees the expected throughput distribution without ever idling
285*4882a593Smuzhiyun  the device. It uses preemption instead. Throughput is then much
286*4882a593Smuzhiyun  higher in this common scenario.
287*4882a593Smuzhiyun
288*4882a593Smuzhiyun- ioprio classes are served in strict priority order, i.e.,
289*4882a593Smuzhiyun  lower-priority queues are not served as long as there are
290*4882a593Smuzhiyun  higher-priority queues.  Among queues in the same class, the
291*4882a593Smuzhiyun  bandwidth is distributed in proportion to the weight of each
292*4882a593Smuzhiyun  queue. A very thin extra bandwidth is however guaranteed to
293*4882a593Smuzhiyun  the Idle class, to prevent it from starving.
294*4882a593Smuzhiyun
295*4882a593Smuzhiyun
296*4882a593Smuzhiyun3. What are BFQ's tunables and how to properly configure BFQ?
297*4882a593Smuzhiyun=============================================================
298*4882a593Smuzhiyun
299*4882a593SmuzhiyunMost BFQ tunables affect service guarantees (basically latency and
300*4882a593Smuzhiyunfairness) and throughput. For full details on how to choose the
301*4882a593Smuzhiyundesired tradeoff between service guarantees and throughput, see the
302*4882a593Smuzhiyunparameters slice_idle, strict_guarantees and low_latency. For details
303*4882a593Smuzhiyunon how to maximise throughput, see slice_idle, timeout_sync and
304*4882a593Smuzhiyunmax_budget. The other performance-related parameters have been
305*4882a593Smuzhiyuninherited from, and have been preserved mostly for compatibility with
306*4882a593SmuzhiyunCFQ. So far, no performance improvement has been reported after
307*4882a593Smuzhiyunchanging the latter parameters in BFQ.
308*4882a593Smuzhiyun
309*4882a593SmuzhiyunIn particular, the tunables back_seek-max, back_seek_penalty,
310*4882a593Smuzhiyunfifo_expire_async and fifo_expire_sync below are the same as in
311*4882a593SmuzhiyunCFQ. Their description is just copied from that for CFQ. Some
312*4882a593Smuzhiyunconsiderations in the description of slice_idle are copied from CFQ
313*4882a593Smuzhiyuntoo.
314*4882a593Smuzhiyun
315*4882a593Smuzhiyunper-process ioprio and weight
316*4882a593Smuzhiyun-----------------------------
317*4882a593Smuzhiyun
318*4882a593SmuzhiyunUnless the cgroups interface is used (see "4. BFQ group scheduling"),
319*4882a593Smuzhiyunweights can be assigned to processes only indirectly, through I/O
320*4882a593Smuzhiyunpriorities, and according to the relation:
321*4882a593Smuzhiyunweight = (IOPRIO_BE_NR - ioprio) * 10.
322*4882a593Smuzhiyun
323*4882a593SmuzhiyunBeware that, if low-latency is set, then BFQ automatically raises the
324*4882a593Smuzhiyunweight of the queues associated with interactive and soft real-time
325*4882a593Smuzhiyunapplications. Unset this tunable if you need/want to control weights.
326*4882a593Smuzhiyun
327*4882a593Smuzhiyunslice_idle
328*4882a593Smuzhiyun----------
329*4882a593Smuzhiyun
330*4882a593SmuzhiyunThis parameter specifies how long BFQ should idle for next I/O
331*4882a593Smuzhiyunrequest, when certain sync BFQ queues become empty. By default
332*4882a593Smuzhiyunslice_idle is a non-zero value. Idling has a double purpose: boosting
333*4882a593Smuzhiyunthroughput and making sure that the desired throughput distribution is
334*4882a593Smuzhiyunrespected (see the description of how BFQ works, and, if needed, the
335*4882a593Smuzhiyunpapers referred there).
336*4882a593Smuzhiyun
337*4882a593SmuzhiyunAs for throughput, idling can be very helpful on highly seeky media
338*4882a593Smuzhiyunlike single spindle SATA/SAS disks where we can cut down on overall
339*4882a593Smuzhiyunnumber of seeks and see improved throughput.
340*4882a593Smuzhiyun
341*4882a593SmuzhiyunSetting slice_idle to 0 will remove all the idling on queues and one
342*4882a593Smuzhiyunshould see an overall improved throughput on faster storage devices
343*4882a593Smuzhiyunlike multiple SATA/SAS disks in hardware RAID configuration, as well
344*4882a593Smuzhiyunas flash-based storage with internal command queueing (and
345*4882a593Smuzhiyunparallelism).
346*4882a593Smuzhiyun
347*4882a593SmuzhiyunSo depending on storage and workload, it might be useful to set
348*4882a593Smuzhiyunslice_idle=0.  In general for SATA/SAS disks and software RAID of
349*4882a593SmuzhiyunSATA/SAS disks keeping slice_idle enabled should be useful. For any
350*4882a593Smuzhiyunconfigurations where there are multiple spindles behind single LUN
351*4882a593Smuzhiyun(Host based hardware RAID controller or for storage arrays), or with
352*4882a593Smuzhiyunflash-based fast storage, setting slice_idle=0 might end up in better
353*4882a593Smuzhiyunthroughput and acceptable latencies.
354*4882a593Smuzhiyun
355*4882a593SmuzhiyunIdling is however necessary to have service guarantees enforced in
356*4882a593Smuzhiyuncase of differentiated weights or differentiated I/O-request lengths.
357*4882a593SmuzhiyunTo see why, suppose that a given BFQ queue A must get several I/O
358*4882a593Smuzhiyunrequests served for each request served for another queue B. Idling
359*4882a593Smuzhiyunensures that, if A makes a new I/O request slightly after becoming
360*4882a593Smuzhiyunempty, then no request of B is dispatched in the middle, and thus A
361*4882a593Smuzhiyundoes not lose the possibility to get more than one request dispatched
362*4882a593Smuzhiyunbefore the next request of B is dispatched. Note that idling
363*4882a593Smuzhiyunguarantees the desired differentiated treatment of queues only in
364*4882a593Smuzhiyunterms of I/O-request dispatches. To guarantee that the actual service
365*4882a593Smuzhiyunorder then corresponds to the dispatch order, the strict_guarantees
366*4882a593Smuzhiyuntunable must be set too.
367*4882a593Smuzhiyun
368*4882a593SmuzhiyunThere is an important flipside for idling: apart from the above cases
369*4882a593Smuzhiyunwhere it is beneficial also for throughput, idling can severely impact
370*4882a593Smuzhiyunthroughput. One important case is random workload. Because of this
371*4882a593Smuzhiyunissue, BFQ tends to avoid idling as much as possible, when it is not
372*4882a593Smuzhiyunbeneficial also for throughput (as detailed in Section 2). As a
373*4882a593Smuzhiyunconsequence of this behavior, and of further issues described for the
374*4882a593Smuzhiyunstrict_guarantees tunable, short-term service guarantees may be
375*4882a593Smuzhiyunoccasionally violated. And, in some cases, these guarantees may be
376*4882a593Smuzhiyunmore important than guaranteeing maximum throughput. For example, in
377*4882a593Smuzhiyunvideo playing/streaming, a very low drop rate may be more important
378*4882a593Smuzhiyunthan maximum throughput. In these cases, consider setting the
379*4882a593Smuzhiyunstrict_guarantees parameter.
380*4882a593Smuzhiyun
381*4882a593Smuzhiyunslice_idle_us
382*4882a593Smuzhiyun-------------
383*4882a593Smuzhiyun
384*4882a593SmuzhiyunControls the same tuning parameter as slice_idle, but in microseconds.
385*4882a593SmuzhiyunEither tunable can be used to set idling behavior.  Afterwards, the
386*4882a593Smuzhiyunother tunable will reflect the newly set value in sysfs.
387*4882a593Smuzhiyun
388*4882a593Smuzhiyunstrict_guarantees
389*4882a593Smuzhiyun-----------------
390*4882a593Smuzhiyun
391*4882a593SmuzhiyunIf this parameter is set (default: unset), then BFQ
392*4882a593Smuzhiyun
393*4882a593Smuzhiyun- always performs idling when the in-service queue becomes empty;
394*4882a593Smuzhiyun
395*4882a593Smuzhiyun- forces the device to serve one I/O request at a time, by dispatching a
396*4882a593Smuzhiyun  new request only if there is no outstanding request.
397*4882a593Smuzhiyun
398*4882a593SmuzhiyunIn the presence of differentiated weights or I/O-request sizes, both
399*4882a593Smuzhiyunthe above conditions are needed to guarantee that every BFQ queue
400*4882a593Smuzhiyunreceives its allotted share of the bandwidth. The first condition is
401*4882a593Smuzhiyunneeded for the reasons explained in the description of the slice_idle
402*4882a593Smuzhiyuntunable.  The second condition is needed because all modern storage
403*4882a593Smuzhiyundevices reorder internally-queued requests, which may trivially break
404*4882a593Smuzhiyunthe service guarantees enforced by the I/O scheduler.
405*4882a593Smuzhiyun
406*4882a593SmuzhiyunSetting strict_guarantees may evidently affect throughput.
407*4882a593Smuzhiyun
408*4882a593Smuzhiyunback_seek_max
409*4882a593Smuzhiyun-------------
410*4882a593Smuzhiyun
411*4882a593SmuzhiyunThis specifies, given in Kbytes, the maximum "distance" for backward seeking.
412*4882a593SmuzhiyunThe distance is the amount of space from the current head location to the
413*4882a593Smuzhiyunsectors that are backward in terms of distance.
414*4882a593Smuzhiyun
415*4882a593SmuzhiyunThis parameter allows the scheduler to anticipate requests in the "backward"
416*4882a593Smuzhiyundirection and consider them as being the "next" if they are within this
417*4882a593Smuzhiyundistance from the current head location.
418*4882a593Smuzhiyun
419*4882a593Smuzhiyunback_seek_penalty
420*4882a593Smuzhiyun-----------------
421*4882a593Smuzhiyun
422*4882a593SmuzhiyunThis parameter is used to compute the cost of backward seeking. If the
423*4882a593Smuzhiyunbackward distance of request is just 1/back_seek_penalty from a "front"
424*4882a593Smuzhiyunrequest, then the seeking cost of two requests is considered equivalent.
425*4882a593Smuzhiyun
426*4882a593SmuzhiyunSo scheduler will not bias toward one or the other request (otherwise scheduler
427*4882a593Smuzhiyunwill bias toward front request). Default value of back_seek_penalty is 2.
428*4882a593Smuzhiyun
429*4882a593Smuzhiyunfifo_expire_async
430*4882a593Smuzhiyun-----------------
431*4882a593Smuzhiyun
432*4882a593SmuzhiyunThis parameter is used to set the timeout of asynchronous requests. Default
433*4882a593Smuzhiyunvalue of this is 248ms.
434*4882a593Smuzhiyun
435*4882a593Smuzhiyunfifo_expire_sync
436*4882a593Smuzhiyun----------------
437*4882a593Smuzhiyun
438*4882a593SmuzhiyunThis parameter is used to set the timeout of synchronous requests. Default
439*4882a593Smuzhiyunvalue of this is 124ms. In case to favor synchronous requests over asynchronous
440*4882a593Smuzhiyunone, this value should be decreased relative to fifo_expire_async.
441*4882a593Smuzhiyun
442*4882a593Smuzhiyunlow_latency
443*4882a593Smuzhiyun-----------
444*4882a593Smuzhiyun
445*4882a593SmuzhiyunThis parameter is used to enable/disable BFQ's low latency mode. By
446*4882a593Smuzhiyundefault, low latency mode is enabled. If enabled, interactive and soft
447*4882a593Smuzhiyunreal-time applications are privileged and experience a lower latency,
448*4882a593Smuzhiyunas explained in more detail in the description of how BFQ works.
449*4882a593Smuzhiyun
450*4882a593SmuzhiyunDISABLE this mode if you need full control on bandwidth
451*4882a593Smuzhiyundistribution. In fact, if it is enabled, then BFQ automatically
452*4882a593Smuzhiyunincreases the bandwidth share of privileged applications, as the main
453*4882a593Smuzhiyunmeans to guarantee a lower latency to them.
454*4882a593Smuzhiyun
455*4882a593SmuzhiyunIn addition, as already highlighted at the beginning of this document,
456*4882a593SmuzhiyunDISABLE this mode if your only goal is to achieve a high throughput.
457*4882a593SmuzhiyunIn fact, privileging the I/O of some application over the rest may
458*4882a593Smuzhiyunentail a lower throughput. To achieve the highest-possible throughput
459*4882a593Smuzhiyunon a non-rotational device, setting slice_idle to 0 may be needed too
460*4882a593Smuzhiyun(at the cost of giving up any strong guarantee on fairness and low
461*4882a593Smuzhiyunlatency).
462*4882a593Smuzhiyun
463*4882a593Smuzhiyuntimeout_sync
464*4882a593Smuzhiyun------------
465*4882a593Smuzhiyun
466*4882a593SmuzhiyunMaximum amount of device time that can be given to a task (queue) once
467*4882a593Smuzhiyunit has been selected for service. On devices with costly seeks,
468*4882a593Smuzhiyunincreasing this time usually increases maximum throughput. On the
469*4882a593Smuzhiyunopposite end, increasing this time coarsens the granularity of the
470*4882a593Smuzhiyunshort-term bandwidth and latency guarantees, especially if the
471*4882a593Smuzhiyunfollowing parameter is set to zero.
472*4882a593Smuzhiyun
473*4882a593Smuzhiyunmax_budget
474*4882a593Smuzhiyun----------
475*4882a593Smuzhiyun
476*4882a593SmuzhiyunMaximum amount of service, measured in sectors, that can be provided
477*4882a593Smuzhiyunto a BFQ queue once it is set in service (of course within the limits
478*4882a593Smuzhiyunof the above timeout). According to what said in the description of
479*4882a593Smuzhiyunthe algorithm, larger values increase the throughput in proportion to
480*4882a593Smuzhiyunthe percentage of sequential I/O requests issued. The price of larger
481*4882a593Smuzhiyunvalues is that they coarsen the granularity of short-term bandwidth
482*4882a593Smuzhiyunand latency guarantees.
483*4882a593Smuzhiyun
484*4882a593SmuzhiyunThe default value is 0, which enables auto-tuning: BFQ sets max_budget
485*4882a593Smuzhiyunto the maximum number of sectors that can be served during
486*4882a593Smuzhiyuntimeout_sync, according to the estimated peak rate.
487*4882a593Smuzhiyun
488*4882a593SmuzhiyunFor specific devices, some users have occasionally reported to have
489*4882a593Smuzhiyunreached a higher throughput by setting max_budget explicitly, i.e., by
490*4882a593Smuzhiyunsetting max_budget to a higher value than 0. In particular, they have
491*4882a593Smuzhiyunset max_budget to higher values than those to which BFQ would have set
492*4882a593Smuzhiyunit with auto-tuning. An alternative way to achieve this goal is to
493*4882a593Smuzhiyunjust increase the value of timeout_sync, leaving max_budget equal to 0.
494*4882a593Smuzhiyun
495*4882a593Smuzhiyun4. Group scheduling with BFQ
496*4882a593Smuzhiyun============================
497*4882a593Smuzhiyun
498*4882a593SmuzhiyunBFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
499*4882a593Smuzhiyunblkio and io. In particular, BFQ supports weight-based proportional
500*4882a593Smuzhiyunshare. To activate cgroups support, set BFQ_GROUP_IOSCHED.
501*4882a593Smuzhiyun
502*4882a593Smuzhiyun4-1 Service guarantees provided
503*4882a593Smuzhiyun-------------------------------
504*4882a593Smuzhiyun
505*4882a593SmuzhiyunWith BFQ, proportional share means true proportional share of the
506*4882a593Smuzhiyundevice bandwidth, according to group weights. For example, a group
507*4882a593Smuzhiyunwith weight 200 gets twice the bandwidth, and not just twice the time,
508*4882a593Smuzhiyunof a group with weight 100.
509*4882a593Smuzhiyun
510*4882a593SmuzhiyunBFQ supports hierarchies (group trees) of any depth. Bandwidth is
511*4882a593Smuzhiyundistributed among groups and processes in the expected way: for each
512*4882a593Smuzhiyungroup, the children of the group share the whole bandwidth of the
513*4882a593Smuzhiyungroup in proportion to their weights. In particular, this implies
514*4882a593Smuzhiyunthat, for each leaf group, every process of the group receives the
515*4882a593Smuzhiyunsame share of the whole group bandwidth, unless the ioprio of the
516*4882a593Smuzhiyunprocess is modified.
517*4882a593Smuzhiyun
518*4882a593SmuzhiyunThe resource-sharing guarantee for a group may partially or totally
519*4882a593Smuzhiyunswitch from bandwidth to time, if providing bandwidth guarantees to
520*4882a593Smuzhiyunthe group lowers the throughput too much. This switch occurs on a
521*4882a593Smuzhiyunper-process basis: if a process of a leaf group causes throughput loss
522*4882a593Smuzhiyunif served in such a way to receive its share of the bandwidth, then
523*4882a593SmuzhiyunBFQ switches back to just time-based proportional share for that
524*4882a593Smuzhiyunprocess.
525*4882a593Smuzhiyun
526*4882a593Smuzhiyun4-2 Interface
527*4882a593Smuzhiyun-------------
528*4882a593Smuzhiyun
529*4882a593SmuzhiyunTo get proportional sharing of bandwidth with BFQ for a given device,
530*4882a593SmuzhiyunBFQ must of course be the active scheduler for that device.
531*4882a593Smuzhiyun
532*4882a593SmuzhiyunWithin each group directory, the names of the files associated with
533*4882a593SmuzhiyunBFQ-specific cgroup parameters and stats begin with the "bfq."
534*4882a593Smuzhiyunprefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
535*4882a593SmuzhiyunBFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
536*4882a593Smuzhiyunparameter to set the weight of a group with BFQ is blkio.bfq.weight
537*4882a593Smuzhiyunor io.bfq.weight.
538*4882a593Smuzhiyun
539*4882a593SmuzhiyunAs for cgroups-v1 (blkio controller), the exact set of stat files
540*4882a593Smuzhiyuncreated, and kept up-to-date by bfq, depends on whether
541*4882a593SmuzhiyunCONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all
542*4882a593Smuzhiyunthe stat files documented in
543*4882a593SmuzhiyunDocumentation/admin-guide/cgroup-v1/blkio-controller.rst. If, instead,
544*4882a593SmuzhiyunCONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files::
545*4882a593Smuzhiyun
546*4882a593Smuzhiyun  blkio.bfq.io_service_bytes
547*4882a593Smuzhiyun  blkio.bfq.io_service_bytes_recursive
548*4882a593Smuzhiyun  blkio.bfq.io_serviced
549*4882a593Smuzhiyun  blkio.bfq.io_serviced_recursive
550*4882a593Smuzhiyun
551*4882a593SmuzhiyunThe value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximum
552*4882a593Smuzhiyunthroughput sustainable with bfq, because updating the blkio.bfq.*
553*4882a593Smuzhiyunstats is rather costly, especially for some of the stats enabled by
554*4882a593SmuzhiyunCONFIG_BFQ_CGROUP_DEBUG.
555*4882a593Smuzhiyun
556*4882a593SmuzhiyunParameters to set
557*4882a593Smuzhiyun-----------------
558*4882a593Smuzhiyun
559*4882a593SmuzhiyunFor each group, there is only the following parameter to set.
560*4882a593Smuzhiyun
561*4882a593Smuzhiyunweight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
562*4882a593Smuzhiyungroup inside its parent. Available values: 1..1000 (default 100). The
563*4882a593Smuzhiyunlinear mapping between ioprio and weights, described at the beginning
564*4882a593Smuzhiyunof the tunable section, is still valid, but all weights higher than
565*4882a593SmuzhiyunIOPRIO_BE_NR*10 are mapped to ioprio 0.
566*4882a593Smuzhiyun
567*4882a593SmuzhiyunRecall that, if low-latency is set, then BFQ automatically raises the
568*4882a593Smuzhiyunweight of the queues associated with interactive and soft real-time
569*4882a593Smuzhiyunapplications. Unset this tunable if you need/want to control weights.
570*4882a593Smuzhiyun
571*4882a593Smuzhiyun
572*4882a593Smuzhiyun[1]
573*4882a593Smuzhiyun    P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
574*4882a593Smuzhiyun    Scheduler", Proceedings of the First Workshop on Mobile System
575*4882a593Smuzhiyun    Technologies (MST-2015), May 2015.
576*4882a593Smuzhiyun
577*4882a593Smuzhiyun    http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
578*4882a593Smuzhiyun
579*4882a593Smuzhiyun[2]
580*4882a593Smuzhiyun    P. Valente and M. Andreolini, "Improving Application
581*4882a593Smuzhiyun    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
582*4882a593Smuzhiyun    the 5th Annual International Systems and Storage Conference
583*4882a593Smuzhiyun    (SYSTOR '12), June 2012.
584*4882a593Smuzhiyun
585*4882a593Smuzhiyun    Slightly extended version:
586*4882a593Smuzhiyun
587*4882a593Smuzhiyun    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-results.pdf
588*4882a593Smuzhiyun
589*4882a593Smuzhiyun[3]
590*4882a593Smuzhiyun   https://github.com/Algodev-github/S
591