1*4882a593Smuzhiyun========================== 2*4882a593SmuzhiyunBFQ (Budget Fair Queueing) 3*4882a593Smuzhiyun========================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunBFQ is a proportional-share I/O scheduler, with some extra 6*4882a593Smuzhiyunlow-latency capabilities. In addition to cgroups support (blkio or io 7*4882a593Smuzhiyuncontrollers), BFQ's main features are: 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun- BFQ guarantees a high system and application responsiveness, and a 10*4882a593Smuzhiyun low latency for time-sensitive applications, such as audio or video 11*4882a593Smuzhiyun players; 12*4882a593Smuzhiyun- BFQ distributes bandwidth, and not just time, among processes or 13*4882a593Smuzhiyun groups (switching back to time distribution when needed to keep 14*4882a593Smuzhiyun throughput high). 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunIn its default configuration, BFQ privileges latency over 17*4882a593Smuzhiyunthroughput. So, when needed for achieving a lower latency, BFQ builds 18*4882a593Smuzhiyunschedules that may lead to a lower throughput. If your main or only 19*4882a593Smuzhiyungoal, for a given device, is to achieve the maximum-possible 20*4882a593Smuzhiyunthroughput at all times, then do switch off all low-latency heuristics 21*4882a593Smuzhiyunfor that device, by setting low_latency to 0. See Section 3 for 22*4882a593Smuzhiyundetails on how to configure BFQ for the desired tradeoff between 23*4882a593Smuzhiyunlatency and throughput, or on how to maximize throughput. 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunAs every I/O scheduler, BFQ adds some overhead to per-I/O-request 26*4882a593Smuzhiyunprocessing. To give an idea of this overhead, the total, 27*4882a593Smuzhiyunsingle-lock-protected, per-request processing time of BFQ---i.e., the 28*4882a593Smuzhiyunsum of the execution times of the request insertion, dispatch and 29*4882a593Smuzhiyuncompletion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz 30*4882a593Smuzhiyun(dated CPU for notebooks; time measured with simple code 31*4882a593Smuzhiyuninstrumentation, and using the throughput-sync.sh script of the S 32*4882a593Smuzhiyunsuite [1], in performance-profiling mode). To put this result into 33*4882a593Smuzhiyuncontext, the total, single-lock-protected, per-request execution time 34*4882a593Smuzhiyunof the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 35*4882a593Smuzhiyunus (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunScheduling overhead further limits the maximum IOPS that a CPU can 38*4882a593Smuzhiyunprocess (already limited by the execution of the rest of the I/O 39*4882a593Smuzhiyunstack). To give an idea of the limits with BFQ, on slow or average 40*4882a593SmuzhiyunCPUs, here are, first, the limits of BFQ for three different CPUs, on, 41*4882a593Smuzhiyunrespectively, an average laptop, an old desktop, and a cheap embedded 42*4882a593Smuzhiyunsystem, in case full hierarchical support is enabled (i.e., 43*4882a593SmuzhiyunCONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is not 44*4882a593Smuzhiyunset (Section 4-2): 45*4882a593Smuzhiyun- Intel i7-4850HQ: 400 KIOPS 46*4882a593Smuzhiyun- AMD A8-3850: 250 KIOPS 47*4882a593Smuzhiyun- ARM CortexTM-A53 Octa-core: 80 KIOPS 48*4882a593Smuzhiyun 49*4882a593SmuzhiyunIf CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchical 50*4882a593Smuzhiyunsupport is enabled), then the sustainable throughput with BFQ 51*4882a593Smuzhiyundecreases, because all blkio.bfq* statistics are created and updated 52*4882a593Smuzhiyun(Section 4-2). For BFQ, this leads to the following maximum 53*4882a593Smuzhiyunsustainable throughputs, on the same systems as above: 54*4882a593Smuzhiyun- Intel i7-4850HQ: 310 KIOPS 55*4882a593Smuzhiyun- AMD A8-3850: 200 KIOPS 56*4882a593Smuzhiyun- ARM CortexTM-A53 Octa-core: 56 KIOPS 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunBFQ works for multi-queue devices too. 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun.. The table of contents follow. Impatients can just jump to Section 3. 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun.. CONTENTS 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun 1. When may BFQ be useful? 65*4882a593Smuzhiyun 1-1 Personal systems 66*4882a593Smuzhiyun 1-2 Server systems 67*4882a593Smuzhiyun 2. How does BFQ work? 68*4882a593Smuzhiyun 3. What are BFQ's tunables and how to properly configure BFQ? 69*4882a593Smuzhiyun 4. BFQ group scheduling 70*4882a593Smuzhiyun 4-1 Service guarantees provided 71*4882a593Smuzhiyun 4-2 Interface 72*4882a593Smuzhiyun 73*4882a593Smuzhiyun1. When may BFQ be useful? 74*4882a593Smuzhiyun========================== 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunBFQ provides the following benefits on personal and server systems. 77*4882a593Smuzhiyun 78*4882a593Smuzhiyun1-1 Personal systems 79*4882a593Smuzhiyun-------------------- 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunLow latency for interactive applications 82*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 83*4882a593Smuzhiyun 84*4882a593SmuzhiyunRegardless of the actual background workload, BFQ guarantees that, for 85*4882a593Smuzhiyuninteractive tasks, the storage device is virtually as responsive as if 86*4882a593Smuzhiyunit was idle. For example, even if one or more of the following 87*4882a593Smuzhiyunbackground workloads are being executed: 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun- one or more large files are being read, written or copied, 90*4882a593Smuzhiyun- a tree of source files is being compiled, 91*4882a593Smuzhiyun- one or more virtual machines are performing I/O, 92*4882a593Smuzhiyun- a software update is in progress, 93*4882a593Smuzhiyun- indexing daemons are scanning filesystems and updating their 94*4882a593Smuzhiyun databases, 95*4882a593Smuzhiyun 96*4882a593Smuzhiyunstarting an application or loading a file from within an application 97*4882a593Smuzhiyuntakes about the same time as if the storage device was idle. As a 98*4882a593Smuzhiyuncomparison, with CFQ, NOOP or DEADLINE, and in the same conditions, 99*4882a593Smuzhiyunapplications experience high latencies, or even become unresponsive 100*4882a593Smuzhiyununtil the background workload terminates (also on SSDs). 101*4882a593Smuzhiyun 102*4882a593SmuzhiyunLow latency for soft real-time applications 103*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 104*4882a593SmuzhiyunAlso soft real-time applications, such as audio and video 105*4882a593Smuzhiyunplayers/streamers, enjoy a low latency and a low drop rate, regardless 106*4882a593Smuzhiyunof the background I/O workload. As a consequence, these applications 107*4882a593Smuzhiyundo not suffer from almost any glitch due to the background workload. 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunHigher speed for code-development tasks 110*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunIf some additional workload happens to be executed in parallel, then 113*4882a593SmuzhiyunBFQ executes the I/O-related components of typical code-development 114*4882a593Smuzhiyuntasks (compilation, checkout, merge, ...) much more quickly than CFQ, 115*4882a593SmuzhiyunNOOP or DEADLINE. 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunHigh throughput 118*4882a593Smuzhiyun^^^^^^^^^^^^^^^ 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunOn hard disks, BFQ achieves up to 30% higher throughput than CFQ, and 121*4882a593Smuzhiyunup to 150% higher throughput than DEADLINE and NOOP, with all the 122*4882a593Smuzhiyunsequential workloads considered in our tests. With random workloads, 123*4882a593Smuzhiyunand with all the workloads on flash-based devices, BFQ achieves, 124*4882a593Smuzhiyuninstead, about the same throughput as the other schedulers. 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunStrong fairness, bandwidth and delay guarantees 127*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 128*4882a593Smuzhiyun 129*4882a593SmuzhiyunBFQ distributes the device throughput, and not just the device time, 130*4882a593Smuzhiyunamong I/O-bound applications in proportion their weights, with any 131*4882a593Smuzhiyunworkload and regardless of the device parameters. From these bandwidth 132*4882a593Smuzhiyunguarantees, it is possible to compute tight per-I/O-request delay 133*4882a593Smuzhiyunguarantees by a simple formula. If not configured for strict service 134*4882a593Smuzhiyunguarantees, BFQ switches to time-based resource sharing (only) for 135*4882a593Smuzhiyunapplications that would otherwise cause a throughput loss. 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun1-2 Server systems 138*4882a593Smuzhiyun------------------ 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunMost benefits for server systems follow from the same service 141*4882a593Smuzhiyunproperties as above. In particular, regardless of whether additional, 142*4882a593Smuzhiyunpossibly heavy workloads are being served, BFQ guarantees: 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun* audio and video-streaming with zero or very low jitter and drop 145*4882a593Smuzhiyun rate; 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun* fast retrieval of WEB pages and embedded objects; 148*4882a593Smuzhiyun 149*4882a593Smuzhiyun* real-time recording of data in live-dumping applications (e.g., 150*4882a593Smuzhiyun packet logging); 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun* responsiveness in local and remote access to a server. 153*4882a593Smuzhiyun 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun2. How does BFQ work? 156*4882a593Smuzhiyun===================== 157*4882a593Smuzhiyun 158*4882a593SmuzhiyunBFQ is a proportional-share I/O scheduler, whose general structure, 159*4882a593Smuzhiyunplus a lot of code, are borrowed from CFQ. 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun- Each process doing I/O on a device is associated with a weight and a 162*4882a593Smuzhiyun `(bfq_)queue`. 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun- BFQ grants exclusive access to the device, for a while, to one queue 165*4882a593Smuzhiyun (process) at a time, and implements this service model by 166*4882a593Smuzhiyun associating every queue with a budget, measured in number of 167*4882a593Smuzhiyun sectors. 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun - After a queue is granted access to the device, the budget of the 170*4882a593Smuzhiyun queue is decremented, on each request dispatch, by the size of the 171*4882a593Smuzhiyun request. 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun - The in-service queue is expired, i.e., its service is suspended, 174*4882a593Smuzhiyun only if one of the following events occurs: 1) the queue finishes 175*4882a593Smuzhiyun its budget, 2) the queue empties, 3) a "budget timeout" fires. 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun - The budget timeout prevents processes doing random I/O from 178*4882a593Smuzhiyun holding the device for too long and dramatically reducing 179*4882a593Smuzhiyun throughput. 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun - Actually, as in CFQ, a queue associated with a process issuing 182*4882a593Smuzhiyun sync requests may not be expired immediately when it empties. In 183*4882a593Smuzhiyun contrast, BFQ may idle the device for a short time interval, 184*4882a593Smuzhiyun giving the process the chance to go on being served if it issues 185*4882a593Smuzhiyun a new request in time. Device idling typically boosts the 186*4882a593Smuzhiyun throughput on rotational devices and on non-queueing flash-based 187*4882a593Smuzhiyun devices, if processes do synchronous and sequential I/O. In 188*4882a593Smuzhiyun addition, under BFQ, device idling is also instrumental in 189*4882a593Smuzhiyun guaranteeing the desired throughput fraction to processes 190*4882a593Smuzhiyun issuing sync requests (see the description of the slice_idle 191*4882a593Smuzhiyun tunable in this document, or [1, 2], for more details). 192*4882a593Smuzhiyun 193*4882a593Smuzhiyun - With respect to idling for service guarantees, if several 194*4882a593Smuzhiyun processes are competing for the device at the same time, but 195*4882a593Smuzhiyun all processes and groups have the same weight, then BFQ 196*4882a593Smuzhiyun guarantees the expected throughput distribution without ever 197*4882a593Smuzhiyun idling the device. Throughput is thus as high as possible in 198*4882a593Smuzhiyun this common scenario. 199*4882a593Smuzhiyun 200*4882a593Smuzhiyun - On flash-based storage with internal queueing of commands 201*4882a593Smuzhiyun (typically NCQ), device idling happens to be always detrimental 202*4882a593Smuzhiyun for throughput. So, with these devices, BFQ performs idling 203*4882a593Smuzhiyun only when strictly needed for service guarantees, i.e., for 204*4882a593Smuzhiyun guaranteeing low latency or fairness. In these cases, overall 205*4882a593Smuzhiyun throughput may be sub-optimal. No solution currently exists to 206*4882a593Smuzhiyun provide both strong service guarantees and optimal throughput 207*4882a593Smuzhiyun on devices with internal queueing. 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun - If low-latency mode is enabled (default configuration), BFQ 210*4882a593Smuzhiyun executes some special heuristics to detect interactive and soft 211*4882a593Smuzhiyun real-time applications (e.g., video or audio players/streamers), 212*4882a593Smuzhiyun and to reduce their latency. The most important action taken to 213*4882a593Smuzhiyun achieve this goal is to give to the queues associated with these 214*4882a593Smuzhiyun applications more than their fair share of the device 215*4882a593Smuzhiyun throughput. For brevity, we call just "weight-raising" the whole 216*4882a593Smuzhiyun sets of actions taken by BFQ to privilege these queues. In 217*4882a593Smuzhiyun particular, BFQ provides a milder form of weight-raising for 218*4882a593Smuzhiyun interactive applications, and a stronger form for soft real-time 219*4882a593Smuzhiyun applications. 220*4882a593Smuzhiyun 221*4882a593Smuzhiyun - BFQ automatically deactivates idling for queues born in a burst of 222*4882a593Smuzhiyun queue creations. In fact, these queues are usually associated with 223*4882a593Smuzhiyun the processes of applications and services that benefit mostly 224*4882a593Smuzhiyun from a high throughput. Examples are systemd during boot, or git 225*4882a593Smuzhiyun grep. 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun - As CFQ, BFQ merges queues performing interleaved I/O, i.e., 228*4882a593Smuzhiyun performing random I/O that becomes mostly sequential if 229*4882a593Smuzhiyun merged. Differently from CFQ, BFQ achieves this goal with a more 230*4882a593Smuzhiyun reactive mechanism, called Early Queue Merge (EQM). EQM is so 231*4882a593Smuzhiyun responsive in detecting interleaved I/O (cooperating processes), 232*4882a593Smuzhiyun that it enables BFQ to achieve a high throughput, by queue 233*4882a593Smuzhiyun merging, even for queues for which CFQ needs a different 234*4882a593Smuzhiyun mechanism, preemption, to get a high throughput. As such EQM is a 235*4882a593Smuzhiyun unified mechanism to achieve a high throughput with interleaved 236*4882a593Smuzhiyun I/O. 237*4882a593Smuzhiyun 238*4882a593Smuzhiyun - Queues are scheduled according to a variant of WF2Q+, named 239*4882a593Smuzhiyun B-WF2Q+, and implemented using an augmented rb-tree to preserve an 240*4882a593Smuzhiyun O(log N) overall complexity. See [2] for more details. B-WF2Q+ is 241*4882a593Smuzhiyun also ready for hierarchical scheduling, details in Section 4. 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun - B-WF2Q+ guarantees a tight deviation with respect to an ideal, 244*4882a593Smuzhiyun perfectly fair, and smooth service. In particular, B-WF2Q+ 245*4882a593Smuzhiyun guarantees that each queue receives a fraction of the device 246*4882a593Smuzhiyun throughput proportional to its weight, even if the throughput 247*4882a593Smuzhiyun fluctuates, and regardless of: the device parameters, the current 248*4882a593Smuzhiyun workload and the budgets assigned to the queue. 249*4882a593Smuzhiyun 250*4882a593Smuzhiyun - The last, budget-independence, property (although probably 251*4882a593Smuzhiyun counterintuitive in the first place) is definitely beneficial, for 252*4882a593Smuzhiyun the following reasons: 253*4882a593Smuzhiyun 254*4882a593Smuzhiyun - First, with any proportional-share scheduler, the maximum 255*4882a593Smuzhiyun deviation with respect to an ideal service is proportional to 256*4882a593Smuzhiyun the maximum budget (slice) assigned to queues. As a consequence, 257*4882a593Smuzhiyun BFQ can keep this deviation tight not only because of the 258*4882a593Smuzhiyun accurate service of B-WF2Q+, but also because BFQ *does not* 259*4882a593Smuzhiyun need to assign a larger budget to a queue to let the queue 260*4882a593Smuzhiyun receive a higher fraction of the device throughput. 261*4882a593Smuzhiyun 262*4882a593Smuzhiyun - Second, BFQ is free to choose, for every process (queue), the 263*4882a593Smuzhiyun budget that best fits the needs of the process, or best 264*4882a593Smuzhiyun leverages the I/O pattern of the process. In particular, BFQ 265*4882a593Smuzhiyun updates queue budgets with a simple feedback-loop algorithm that 266*4882a593Smuzhiyun allows a high throughput to be achieved, while still providing 267*4882a593Smuzhiyun tight latency guarantees to time-sensitive applications. When 268*4882a593Smuzhiyun the in-service queue expires, this algorithm computes the next 269*4882a593Smuzhiyun budget of the queue so as to: 270*4882a593Smuzhiyun 271*4882a593Smuzhiyun - Let large budgets be eventually assigned to the queues 272*4882a593Smuzhiyun associated with I/O-bound applications performing sequential 273*4882a593Smuzhiyun I/O: in fact, the longer these applications are served once 274*4882a593Smuzhiyun got access to the device, the higher the throughput is. 275*4882a593Smuzhiyun 276*4882a593Smuzhiyun - Let small budgets be eventually assigned to the queues 277*4882a593Smuzhiyun associated with time-sensitive applications (which typically 278*4882a593Smuzhiyun perform sporadic and short I/O), because, the smaller the 279*4882a593Smuzhiyun budget assigned to a queue waiting for service is, the sooner 280*4882a593Smuzhiyun B-WF2Q+ will serve that queue (Subsec 3.3 in [2]). 281*4882a593Smuzhiyun 282*4882a593Smuzhiyun- If several processes are competing for the device at the same time, 283*4882a593Smuzhiyun but all processes and groups have the same weight, then BFQ 284*4882a593Smuzhiyun guarantees the expected throughput distribution without ever idling 285*4882a593Smuzhiyun the device. It uses preemption instead. Throughput is then much 286*4882a593Smuzhiyun higher in this common scenario. 287*4882a593Smuzhiyun 288*4882a593Smuzhiyun- ioprio classes are served in strict priority order, i.e., 289*4882a593Smuzhiyun lower-priority queues are not served as long as there are 290*4882a593Smuzhiyun higher-priority queues. Among queues in the same class, the 291*4882a593Smuzhiyun bandwidth is distributed in proportion to the weight of each 292*4882a593Smuzhiyun queue. A very thin extra bandwidth is however guaranteed to 293*4882a593Smuzhiyun the Idle class, to prevent it from starving. 294*4882a593Smuzhiyun 295*4882a593Smuzhiyun 296*4882a593Smuzhiyun3. What are BFQ's tunables and how to properly configure BFQ? 297*4882a593Smuzhiyun============================================================= 298*4882a593Smuzhiyun 299*4882a593SmuzhiyunMost BFQ tunables affect service guarantees (basically latency and 300*4882a593Smuzhiyunfairness) and throughput. For full details on how to choose the 301*4882a593Smuzhiyundesired tradeoff between service guarantees and throughput, see the 302*4882a593Smuzhiyunparameters slice_idle, strict_guarantees and low_latency. For details 303*4882a593Smuzhiyunon how to maximise throughput, see slice_idle, timeout_sync and 304*4882a593Smuzhiyunmax_budget. The other performance-related parameters have been 305*4882a593Smuzhiyuninherited from, and have been preserved mostly for compatibility with 306*4882a593SmuzhiyunCFQ. So far, no performance improvement has been reported after 307*4882a593Smuzhiyunchanging the latter parameters in BFQ. 308*4882a593Smuzhiyun 309*4882a593SmuzhiyunIn particular, the tunables back_seek-max, back_seek_penalty, 310*4882a593Smuzhiyunfifo_expire_async and fifo_expire_sync below are the same as in 311*4882a593SmuzhiyunCFQ. Their description is just copied from that for CFQ. Some 312*4882a593Smuzhiyunconsiderations in the description of slice_idle are copied from CFQ 313*4882a593Smuzhiyuntoo. 314*4882a593Smuzhiyun 315*4882a593Smuzhiyunper-process ioprio and weight 316*4882a593Smuzhiyun----------------------------- 317*4882a593Smuzhiyun 318*4882a593SmuzhiyunUnless the cgroups interface is used (see "4. BFQ group scheduling"), 319*4882a593Smuzhiyunweights can be assigned to processes only indirectly, through I/O 320*4882a593Smuzhiyunpriorities, and according to the relation: 321*4882a593Smuzhiyunweight = (IOPRIO_BE_NR - ioprio) * 10. 322*4882a593Smuzhiyun 323*4882a593SmuzhiyunBeware that, if low-latency is set, then BFQ automatically raises the 324*4882a593Smuzhiyunweight of the queues associated with interactive and soft real-time 325*4882a593Smuzhiyunapplications. Unset this tunable if you need/want to control weights. 326*4882a593Smuzhiyun 327*4882a593Smuzhiyunslice_idle 328*4882a593Smuzhiyun---------- 329*4882a593Smuzhiyun 330*4882a593SmuzhiyunThis parameter specifies how long BFQ should idle for next I/O 331*4882a593Smuzhiyunrequest, when certain sync BFQ queues become empty. By default 332*4882a593Smuzhiyunslice_idle is a non-zero value. Idling has a double purpose: boosting 333*4882a593Smuzhiyunthroughput and making sure that the desired throughput distribution is 334*4882a593Smuzhiyunrespected (see the description of how BFQ works, and, if needed, the 335*4882a593Smuzhiyunpapers referred there). 336*4882a593Smuzhiyun 337*4882a593SmuzhiyunAs for throughput, idling can be very helpful on highly seeky media 338*4882a593Smuzhiyunlike single spindle SATA/SAS disks where we can cut down on overall 339*4882a593Smuzhiyunnumber of seeks and see improved throughput. 340*4882a593Smuzhiyun 341*4882a593SmuzhiyunSetting slice_idle to 0 will remove all the idling on queues and one 342*4882a593Smuzhiyunshould see an overall improved throughput on faster storage devices 343*4882a593Smuzhiyunlike multiple SATA/SAS disks in hardware RAID configuration, as well 344*4882a593Smuzhiyunas flash-based storage with internal command queueing (and 345*4882a593Smuzhiyunparallelism). 346*4882a593Smuzhiyun 347*4882a593SmuzhiyunSo depending on storage and workload, it might be useful to set 348*4882a593Smuzhiyunslice_idle=0. In general for SATA/SAS disks and software RAID of 349*4882a593SmuzhiyunSATA/SAS disks keeping slice_idle enabled should be useful. For any 350*4882a593Smuzhiyunconfigurations where there are multiple spindles behind single LUN 351*4882a593Smuzhiyun(Host based hardware RAID controller or for storage arrays), or with 352*4882a593Smuzhiyunflash-based fast storage, setting slice_idle=0 might end up in better 353*4882a593Smuzhiyunthroughput and acceptable latencies. 354*4882a593Smuzhiyun 355*4882a593SmuzhiyunIdling is however necessary to have service guarantees enforced in 356*4882a593Smuzhiyuncase of differentiated weights or differentiated I/O-request lengths. 357*4882a593SmuzhiyunTo see why, suppose that a given BFQ queue A must get several I/O 358*4882a593Smuzhiyunrequests served for each request served for another queue B. Idling 359*4882a593Smuzhiyunensures that, if A makes a new I/O request slightly after becoming 360*4882a593Smuzhiyunempty, then no request of B is dispatched in the middle, and thus A 361*4882a593Smuzhiyundoes not lose the possibility to get more than one request dispatched 362*4882a593Smuzhiyunbefore the next request of B is dispatched. Note that idling 363*4882a593Smuzhiyunguarantees the desired differentiated treatment of queues only in 364*4882a593Smuzhiyunterms of I/O-request dispatches. To guarantee that the actual service 365*4882a593Smuzhiyunorder then corresponds to the dispatch order, the strict_guarantees 366*4882a593Smuzhiyuntunable must be set too. 367*4882a593Smuzhiyun 368*4882a593SmuzhiyunThere is an important flipside for idling: apart from the above cases 369*4882a593Smuzhiyunwhere it is beneficial also for throughput, idling can severely impact 370*4882a593Smuzhiyunthroughput. One important case is random workload. Because of this 371*4882a593Smuzhiyunissue, BFQ tends to avoid idling as much as possible, when it is not 372*4882a593Smuzhiyunbeneficial also for throughput (as detailed in Section 2). As a 373*4882a593Smuzhiyunconsequence of this behavior, and of further issues described for the 374*4882a593Smuzhiyunstrict_guarantees tunable, short-term service guarantees may be 375*4882a593Smuzhiyunoccasionally violated. And, in some cases, these guarantees may be 376*4882a593Smuzhiyunmore important than guaranteeing maximum throughput. For example, in 377*4882a593Smuzhiyunvideo playing/streaming, a very low drop rate may be more important 378*4882a593Smuzhiyunthan maximum throughput. In these cases, consider setting the 379*4882a593Smuzhiyunstrict_guarantees parameter. 380*4882a593Smuzhiyun 381*4882a593Smuzhiyunslice_idle_us 382*4882a593Smuzhiyun------------- 383*4882a593Smuzhiyun 384*4882a593SmuzhiyunControls the same tuning parameter as slice_idle, but in microseconds. 385*4882a593SmuzhiyunEither tunable can be used to set idling behavior. Afterwards, the 386*4882a593Smuzhiyunother tunable will reflect the newly set value in sysfs. 387*4882a593Smuzhiyun 388*4882a593Smuzhiyunstrict_guarantees 389*4882a593Smuzhiyun----------------- 390*4882a593Smuzhiyun 391*4882a593SmuzhiyunIf this parameter is set (default: unset), then BFQ 392*4882a593Smuzhiyun 393*4882a593Smuzhiyun- always performs idling when the in-service queue becomes empty; 394*4882a593Smuzhiyun 395*4882a593Smuzhiyun- forces the device to serve one I/O request at a time, by dispatching a 396*4882a593Smuzhiyun new request only if there is no outstanding request. 397*4882a593Smuzhiyun 398*4882a593SmuzhiyunIn the presence of differentiated weights or I/O-request sizes, both 399*4882a593Smuzhiyunthe above conditions are needed to guarantee that every BFQ queue 400*4882a593Smuzhiyunreceives its allotted share of the bandwidth. The first condition is 401*4882a593Smuzhiyunneeded for the reasons explained in the description of the slice_idle 402*4882a593Smuzhiyuntunable. The second condition is needed because all modern storage 403*4882a593Smuzhiyundevices reorder internally-queued requests, which may trivially break 404*4882a593Smuzhiyunthe service guarantees enforced by the I/O scheduler. 405*4882a593Smuzhiyun 406*4882a593SmuzhiyunSetting strict_guarantees may evidently affect throughput. 407*4882a593Smuzhiyun 408*4882a593Smuzhiyunback_seek_max 409*4882a593Smuzhiyun------------- 410*4882a593Smuzhiyun 411*4882a593SmuzhiyunThis specifies, given in Kbytes, the maximum "distance" for backward seeking. 412*4882a593SmuzhiyunThe distance is the amount of space from the current head location to the 413*4882a593Smuzhiyunsectors that are backward in terms of distance. 414*4882a593Smuzhiyun 415*4882a593SmuzhiyunThis parameter allows the scheduler to anticipate requests in the "backward" 416*4882a593Smuzhiyundirection and consider them as being the "next" if they are within this 417*4882a593Smuzhiyundistance from the current head location. 418*4882a593Smuzhiyun 419*4882a593Smuzhiyunback_seek_penalty 420*4882a593Smuzhiyun----------------- 421*4882a593Smuzhiyun 422*4882a593SmuzhiyunThis parameter is used to compute the cost of backward seeking. If the 423*4882a593Smuzhiyunbackward distance of request is just 1/back_seek_penalty from a "front" 424*4882a593Smuzhiyunrequest, then the seeking cost of two requests is considered equivalent. 425*4882a593Smuzhiyun 426*4882a593SmuzhiyunSo scheduler will not bias toward one or the other request (otherwise scheduler 427*4882a593Smuzhiyunwill bias toward front request). Default value of back_seek_penalty is 2. 428*4882a593Smuzhiyun 429*4882a593Smuzhiyunfifo_expire_async 430*4882a593Smuzhiyun----------------- 431*4882a593Smuzhiyun 432*4882a593SmuzhiyunThis parameter is used to set the timeout of asynchronous requests. Default 433*4882a593Smuzhiyunvalue of this is 248ms. 434*4882a593Smuzhiyun 435*4882a593Smuzhiyunfifo_expire_sync 436*4882a593Smuzhiyun---------------- 437*4882a593Smuzhiyun 438*4882a593SmuzhiyunThis parameter is used to set the timeout of synchronous requests. Default 439*4882a593Smuzhiyunvalue of this is 124ms. In case to favor synchronous requests over asynchronous 440*4882a593Smuzhiyunone, this value should be decreased relative to fifo_expire_async. 441*4882a593Smuzhiyun 442*4882a593Smuzhiyunlow_latency 443*4882a593Smuzhiyun----------- 444*4882a593Smuzhiyun 445*4882a593SmuzhiyunThis parameter is used to enable/disable BFQ's low latency mode. By 446*4882a593Smuzhiyundefault, low latency mode is enabled. If enabled, interactive and soft 447*4882a593Smuzhiyunreal-time applications are privileged and experience a lower latency, 448*4882a593Smuzhiyunas explained in more detail in the description of how BFQ works. 449*4882a593Smuzhiyun 450*4882a593SmuzhiyunDISABLE this mode if you need full control on bandwidth 451*4882a593Smuzhiyundistribution. In fact, if it is enabled, then BFQ automatically 452*4882a593Smuzhiyunincreases the bandwidth share of privileged applications, as the main 453*4882a593Smuzhiyunmeans to guarantee a lower latency to them. 454*4882a593Smuzhiyun 455*4882a593SmuzhiyunIn addition, as already highlighted at the beginning of this document, 456*4882a593SmuzhiyunDISABLE this mode if your only goal is to achieve a high throughput. 457*4882a593SmuzhiyunIn fact, privileging the I/O of some application over the rest may 458*4882a593Smuzhiyunentail a lower throughput. To achieve the highest-possible throughput 459*4882a593Smuzhiyunon a non-rotational device, setting slice_idle to 0 may be needed too 460*4882a593Smuzhiyun(at the cost of giving up any strong guarantee on fairness and low 461*4882a593Smuzhiyunlatency). 462*4882a593Smuzhiyun 463*4882a593Smuzhiyuntimeout_sync 464*4882a593Smuzhiyun------------ 465*4882a593Smuzhiyun 466*4882a593SmuzhiyunMaximum amount of device time that can be given to a task (queue) once 467*4882a593Smuzhiyunit has been selected for service. On devices with costly seeks, 468*4882a593Smuzhiyunincreasing this time usually increases maximum throughput. On the 469*4882a593Smuzhiyunopposite end, increasing this time coarsens the granularity of the 470*4882a593Smuzhiyunshort-term bandwidth and latency guarantees, especially if the 471*4882a593Smuzhiyunfollowing parameter is set to zero. 472*4882a593Smuzhiyun 473*4882a593Smuzhiyunmax_budget 474*4882a593Smuzhiyun---------- 475*4882a593Smuzhiyun 476*4882a593SmuzhiyunMaximum amount of service, measured in sectors, that can be provided 477*4882a593Smuzhiyunto a BFQ queue once it is set in service (of course within the limits 478*4882a593Smuzhiyunof the above timeout). According to what said in the description of 479*4882a593Smuzhiyunthe algorithm, larger values increase the throughput in proportion to 480*4882a593Smuzhiyunthe percentage of sequential I/O requests issued. The price of larger 481*4882a593Smuzhiyunvalues is that they coarsen the granularity of short-term bandwidth 482*4882a593Smuzhiyunand latency guarantees. 483*4882a593Smuzhiyun 484*4882a593SmuzhiyunThe default value is 0, which enables auto-tuning: BFQ sets max_budget 485*4882a593Smuzhiyunto the maximum number of sectors that can be served during 486*4882a593Smuzhiyuntimeout_sync, according to the estimated peak rate. 487*4882a593Smuzhiyun 488*4882a593SmuzhiyunFor specific devices, some users have occasionally reported to have 489*4882a593Smuzhiyunreached a higher throughput by setting max_budget explicitly, i.e., by 490*4882a593Smuzhiyunsetting max_budget to a higher value than 0. In particular, they have 491*4882a593Smuzhiyunset max_budget to higher values than those to which BFQ would have set 492*4882a593Smuzhiyunit with auto-tuning. An alternative way to achieve this goal is to 493*4882a593Smuzhiyunjust increase the value of timeout_sync, leaving max_budget equal to 0. 494*4882a593Smuzhiyun 495*4882a593Smuzhiyun4. Group scheduling with BFQ 496*4882a593Smuzhiyun============================ 497*4882a593Smuzhiyun 498*4882a593SmuzhiyunBFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely 499*4882a593Smuzhiyunblkio and io. In particular, BFQ supports weight-based proportional 500*4882a593Smuzhiyunshare. To activate cgroups support, set BFQ_GROUP_IOSCHED. 501*4882a593Smuzhiyun 502*4882a593Smuzhiyun4-1 Service guarantees provided 503*4882a593Smuzhiyun------------------------------- 504*4882a593Smuzhiyun 505*4882a593SmuzhiyunWith BFQ, proportional share means true proportional share of the 506*4882a593Smuzhiyundevice bandwidth, according to group weights. For example, a group 507*4882a593Smuzhiyunwith weight 200 gets twice the bandwidth, and not just twice the time, 508*4882a593Smuzhiyunof a group with weight 100. 509*4882a593Smuzhiyun 510*4882a593SmuzhiyunBFQ supports hierarchies (group trees) of any depth. Bandwidth is 511*4882a593Smuzhiyundistributed among groups and processes in the expected way: for each 512*4882a593Smuzhiyungroup, the children of the group share the whole bandwidth of the 513*4882a593Smuzhiyungroup in proportion to their weights. In particular, this implies 514*4882a593Smuzhiyunthat, for each leaf group, every process of the group receives the 515*4882a593Smuzhiyunsame share of the whole group bandwidth, unless the ioprio of the 516*4882a593Smuzhiyunprocess is modified. 517*4882a593Smuzhiyun 518*4882a593SmuzhiyunThe resource-sharing guarantee for a group may partially or totally 519*4882a593Smuzhiyunswitch from bandwidth to time, if providing bandwidth guarantees to 520*4882a593Smuzhiyunthe group lowers the throughput too much. This switch occurs on a 521*4882a593Smuzhiyunper-process basis: if a process of a leaf group causes throughput loss 522*4882a593Smuzhiyunif served in such a way to receive its share of the bandwidth, then 523*4882a593SmuzhiyunBFQ switches back to just time-based proportional share for that 524*4882a593Smuzhiyunprocess. 525*4882a593Smuzhiyun 526*4882a593Smuzhiyun4-2 Interface 527*4882a593Smuzhiyun------------- 528*4882a593Smuzhiyun 529*4882a593SmuzhiyunTo get proportional sharing of bandwidth with BFQ for a given device, 530*4882a593SmuzhiyunBFQ must of course be the active scheduler for that device. 531*4882a593Smuzhiyun 532*4882a593SmuzhiyunWithin each group directory, the names of the files associated with 533*4882a593SmuzhiyunBFQ-specific cgroup parameters and stats begin with the "bfq." 534*4882a593Smuzhiyunprefix. So, with cgroups-v1 or cgroups-v2, the full prefix for 535*4882a593SmuzhiyunBFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group 536*4882a593Smuzhiyunparameter to set the weight of a group with BFQ is blkio.bfq.weight 537*4882a593Smuzhiyunor io.bfq.weight. 538*4882a593Smuzhiyun 539*4882a593SmuzhiyunAs for cgroups-v1 (blkio controller), the exact set of stat files 540*4882a593Smuzhiyuncreated, and kept up-to-date by bfq, depends on whether 541*4882a593SmuzhiyunCONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all 542*4882a593Smuzhiyunthe stat files documented in 543*4882a593SmuzhiyunDocumentation/admin-guide/cgroup-v1/blkio-controller.rst. If, instead, 544*4882a593SmuzhiyunCONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files:: 545*4882a593Smuzhiyun 546*4882a593Smuzhiyun blkio.bfq.io_service_bytes 547*4882a593Smuzhiyun blkio.bfq.io_service_bytes_recursive 548*4882a593Smuzhiyun blkio.bfq.io_serviced 549*4882a593Smuzhiyun blkio.bfq.io_serviced_recursive 550*4882a593Smuzhiyun 551*4882a593SmuzhiyunThe value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximum 552*4882a593Smuzhiyunthroughput sustainable with bfq, because updating the blkio.bfq.* 553*4882a593Smuzhiyunstats is rather costly, especially for some of the stats enabled by 554*4882a593SmuzhiyunCONFIG_BFQ_CGROUP_DEBUG. 555*4882a593Smuzhiyun 556*4882a593SmuzhiyunParameters to set 557*4882a593Smuzhiyun----------------- 558*4882a593Smuzhiyun 559*4882a593SmuzhiyunFor each group, there is only the following parameter to set. 560*4882a593Smuzhiyun 561*4882a593Smuzhiyunweight (namely blkio.bfq.weight or io.bfq-weight): the weight of the 562*4882a593Smuzhiyungroup inside its parent. Available values: 1..1000 (default 100). The 563*4882a593Smuzhiyunlinear mapping between ioprio and weights, described at the beginning 564*4882a593Smuzhiyunof the tunable section, is still valid, but all weights higher than 565*4882a593SmuzhiyunIOPRIO_BE_NR*10 are mapped to ioprio 0. 566*4882a593Smuzhiyun 567*4882a593SmuzhiyunRecall that, if low-latency is set, then BFQ automatically raises the 568*4882a593Smuzhiyunweight of the queues associated with interactive and soft real-time 569*4882a593Smuzhiyunapplications. Unset this tunable if you need/want to control weights. 570*4882a593Smuzhiyun 571*4882a593Smuzhiyun 572*4882a593Smuzhiyun[1] 573*4882a593Smuzhiyun P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O 574*4882a593Smuzhiyun Scheduler", Proceedings of the First Workshop on Mobile System 575*4882a593Smuzhiyun Technologies (MST-2015), May 2015. 576*4882a593Smuzhiyun 577*4882a593Smuzhiyun http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf 578*4882a593Smuzhiyun 579*4882a593Smuzhiyun[2] 580*4882a593Smuzhiyun P. Valente and M. Andreolini, "Improving Application 581*4882a593Smuzhiyun Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of 582*4882a593Smuzhiyun the 5th Annual International Systems and Storage Conference 583*4882a593Smuzhiyun (SYSTOR '12), June 2012. 584*4882a593Smuzhiyun 585*4882a593Smuzhiyun Slightly extended version: 586*4882a593Smuzhiyun 587*4882a593Smuzhiyun http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-results.pdf 588*4882a593Smuzhiyun 589*4882a593Smuzhiyun[3] 590*4882a593Smuzhiyun https://github.com/Algodev-github/S 591