1*4882a593Smuzhiyun=================== 2*4882a593SmuzhiyunBlock IO Controller 3*4882a593Smuzhiyun=================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunOverview 6*4882a593Smuzhiyun======== 7*4882a593Smuzhiyuncgroup subsys "blkio" implements the block io controller. There seems to be 8*4882a593Smuzhiyuna need of various kinds of IO control policies (like proportional BW, max BW) 9*4882a593Smuzhiyunboth at leaf nodes as well as at intermediate nodes in a storage hierarchy. 10*4882a593SmuzhiyunPlan is to use the same cgroup based management interface for blkio controller 11*4882a593Smuzhiyunand based on user options switch IO policies in the background. 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunOne IO control policy is throttling policy which can be used to 14*4882a593Smuzhiyunspecify upper IO rate limits on devices. This policy is implemented in 15*4882a593Smuzhiyungeneric block layer and can be used on leaf nodes as well as higher 16*4882a593Smuzhiyunlevel logical devices like device mapper. 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunHOWTO 19*4882a593Smuzhiyun===== 20*4882a593SmuzhiyunThrottling/Upper Limit policy 21*4882a593Smuzhiyun----------------------------- 22*4882a593Smuzhiyun- Enable Block IO controller:: 23*4882a593Smuzhiyun 24*4882a593Smuzhiyun CONFIG_BLK_CGROUP=y 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun- Enable throttling in block layer:: 27*4882a593Smuzhiyun 28*4882a593Smuzhiyun CONFIG_BLK_DEV_THROTTLING=y 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun- Mount blkio controller (see cgroups.txt, Why are cgroups needed?):: 31*4882a593Smuzhiyun 32*4882a593Smuzhiyun mount -t cgroup -o blkio none /sys/fs/cgroup/blkio 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun- Specify a bandwidth rate on particular device for root group. The format 35*4882a593Smuzhiyun for policy is "<major>:<minor> <bytes_per_second>":: 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun Above will put a limit of 1MB/second on reads happening for root group 40*4882a593Smuzhiyun on device having major/minor number 8:16. 41*4882a593Smuzhiyun 42*4882a593Smuzhiyun- Run dd to read a file and see if rate is throttled to 1MB/s or not:: 43*4882a593Smuzhiyun 44*4882a593Smuzhiyun # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 45*4882a593Smuzhiyun 1024+0 records in 46*4882a593Smuzhiyun 1024+0 records out 47*4882a593Smuzhiyun 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s 48*4882a593Smuzhiyun 49*4882a593Smuzhiyun Limits for writes can be put using blkio.throttle.write_bps_device file. 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunHierarchical Cgroups 52*4882a593Smuzhiyun==================== 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunThrottling implements hierarchy support; however, 55*4882a593Smuzhiyunthrottling's hierarchy support is enabled iff "sane_behavior" is 56*4882a593Smuzhiyunenabled from cgroup side, which currently is a development option and 57*4882a593Smuzhiyunnot publicly available. 58*4882a593Smuzhiyun 59*4882a593SmuzhiyunIf somebody created a hierarchy like as follows:: 60*4882a593Smuzhiyun 61*4882a593Smuzhiyun root 62*4882a593Smuzhiyun / \ 63*4882a593Smuzhiyun test1 test2 64*4882a593Smuzhiyun | 65*4882a593Smuzhiyun test3 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunThrottling with "sane_behavior" will handle the 68*4882a593Smuzhiyunhierarchy correctly. For throttling, all limits apply 69*4882a593Smuzhiyunto the whole subtree while all statistics are local to the IOs 70*4882a593Smuzhiyundirectly generated by tasks in that cgroup. 71*4882a593Smuzhiyun 72*4882a593SmuzhiyunThrottling without "sane_behavior" enabled from cgroup side will 73*4882a593Smuzhiyunpractically treat all groups at same level as if it looks like the 74*4882a593Smuzhiyunfollowing:: 75*4882a593Smuzhiyun 76*4882a593Smuzhiyun pivot 77*4882a593Smuzhiyun / / \ \ 78*4882a593Smuzhiyun root test1 test2 test3 79*4882a593Smuzhiyun 80*4882a593SmuzhiyunVarious user visible config options 81*4882a593Smuzhiyun=================================== 82*4882a593SmuzhiyunCONFIG_BLK_CGROUP 83*4882a593Smuzhiyun - Block IO controller. 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunCONFIG_BFQ_CGROUP_DEBUG 86*4882a593Smuzhiyun - Debug help. Right now some additional stats file show up in cgroup 87*4882a593Smuzhiyun if this option is enabled. 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunCONFIG_BLK_DEV_THROTTLING 90*4882a593Smuzhiyun - Enable block device throttling support in block layer. 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunDetails of cgroup files 93*4882a593Smuzhiyun======================= 94*4882a593SmuzhiyunProportional weight policy files 95*4882a593Smuzhiyun-------------------------------- 96*4882a593Smuzhiyun- blkio.weight 97*4882a593Smuzhiyun - Specifies per cgroup weight. This is default weight of the group 98*4882a593Smuzhiyun on all the devices until and unless overridden by per device rule. 99*4882a593Smuzhiyun (See blkio.weight_device). 100*4882a593Smuzhiyun Currently allowed range of weights is from 10 to 1000. 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun- blkio.weight_device 103*4882a593Smuzhiyun - One can specify per cgroup per device rules using this interface. 104*4882a593Smuzhiyun These rules override the default value of group weight as specified 105*4882a593Smuzhiyun by blkio.weight. 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun Following is the format:: 108*4882a593Smuzhiyun 109*4882a593Smuzhiyun # echo dev_maj:dev_minor weight > blkio.weight_device 110*4882a593Smuzhiyun 111*4882a593Smuzhiyun Configure weight=300 on /dev/sdb (8:16) in this cgroup:: 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun # echo 8:16 300 > blkio.weight_device 114*4882a593Smuzhiyun # cat blkio.weight_device 115*4882a593Smuzhiyun dev weight 116*4882a593Smuzhiyun 8:16 300 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun Configure weight=500 on /dev/sda (8:0) in this cgroup:: 119*4882a593Smuzhiyun 120*4882a593Smuzhiyun # echo 8:0 500 > blkio.weight_device 121*4882a593Smuzhiyun # cat blkio.weight_device 122*4882a593Smuzhiyun dev weight 123*4882a593Smuzhiyun 8:0 500 124*4882a593Smuzhiyun 8:16 300 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun Remove specific weight for /dev/sda in this cgroup:: 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun # echo 8:0 0 > blkio.weight_device 129*4882a593Smuzhiyun # cat blkio.weight_device 130*4882a593Smuzhiyun dev weight 131*4882a593Smuzhiyun 8:16 300 132*4882a593Smuzhiyun 133*4882a593Smuzhiyun- blkio.time 134*4882a593Smuzhiyun - disk time allocated to cgroup per device in milliseconds. First 135*4882a593Smuzhiyun two fields specify the major and minor number of the device and 136*4882a593Smuzhiyun third field specifies the disk time allocated to group in 137*4882a593Smuzhiyun milliseconds. 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun- blkio.sectors 140*4882a593Smuzhiyun - number of sectors transferred to/from disk by the group. First 141*4882a593Smuzhiyun two fields specify the major and minor number of the device and 142*4882a593Smuzhiyun third field specifies the number of sectors transferred by the 143*4882a593Smuzhiyun group to/from the device. 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun- blkio.io_service_bytes 146*4882a593Smuzhiyun - Number of bytes transferred to/from the disk by the group. These 147*4882a593Smuzhiyun are further divided by the type of operation - read or write, sync 148*4882a593Smuzhiyun or async. First two fields specify the major and minor number of the 149*4882a593Smuzhiyun device, third field specifies the operation type and the fourth field 150*4882a593Smuzhiyun specifies the number of bytes. 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun- blkio.io_serviced 153*4882a593Smuzhiyun - Number of IOs (bio) issued to the disk by the group. These 154*4882a593Smuzhiyun are further divided by the type of operation - read or write, sync 155*4882a593Smuzhiyun or async. First two fields specify the major and minor number of the 156*4882a593Smuzhiyun device, third field specifies the operation type and the fourth field 157*4882a593Smuzhiyun specifies the number of IOs. 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun- blkio.io_service_time 160*4882a593Smuzhiyun - Total amount of time between request dispatch and request completion 161*4882a593Smuzhiyun for the IOs done by this cgroup. This is in nanoseconds to make it 162*4882a593Smuzhiyun meaningful for flash devices too. For devices with queue depth of 1, 163*4882a593Smuzhiyun this time represents the actual service time. When queue_depth > 1, 164*4882a593Smuzhiyun that is no longer true as requests may be served out of order. This 165*4882a593Smuzhiyun may cause the service time for a given IO to include the service time 166*4882a593Smuzhiyun of multiple IOs when served out of order which may result in total 167*4882a593Smuzhiyun io_service_time > actual time elapsed. This time is further divided by 168*4882a593Smuzhiyun the type of operation - read or write, sync or async. First two fields 169*4882a593Smuzhiyun specify the major and minor number of the device, third field 170*4882a593Smuzhiyun specifies the operation type and the fourth field specifies the 171*4882a593Smuzhiyun io_service_time in ns. 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun- blkio.io_wait_time 174*4882a593Smuzhiyun - Total amount of time the IOs for this cgroup spent waiting in the 175*4882a593Smuzhiyun scheduler queues for service. This can be greater than the total time 176*4882a593Smuzhiyun elapsed since it is cumulative io_wait_time for all IOs. It is not a 177*4882a593Smuzhiyun measure of total time the cgroup spent waiting but rather a measure of 178*4882a593Smuzhiyun the wait_time for its individual IOs. For devices with queue_depth > 1 179*4882a593Smuzhiyun this metric does not include the time spent waiting for service once 180*4882a593Smuzhiyun the IO is dispatched to the device but till it actually gets serviced 181*4882a593Smuzhiyun (there might be a time lag here due to re-ordering of requests by the 182*4882a593Smuzhiyun device). This is in nanoseconds to make it meaningful for flash 183*4882a593Smuzhiyun devices too. This time is further divided by the type of operation - 184*4882a593Smuzhiyun read or write, sync or async. First two fields specify the major and 185*4882a593Smuzhiyun minor number of the device, third field specifies the operation type 186*4882a593Smuzhiyun and the fourth field specifies the io_wait_time in ns. 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun- blkio.io_merged 189*4882a593Smuzhiyun - Total number of bios/requests merged into requests belonging to this 190*4882a593Smuzhiyun cgroup. This is further divided by the type of operation - read or 191*4882a593Smuzhiyun write, sync or async. 192*4882a593Smuzhiyun 193*4882a593Smuzhiyun- blkio.io_queued 194*4882a593Smuzhiyun - Total number of requests queued up at any given instant for this 195*4882a593Smuzhiyun cgroup. This is further divided by the type of operation - read or 196*4882a593Smuzhiyun write, sync or async. 197*4882a593Smuzhiyun 198*4882a593Smuzhiyun- blkio.avg_queue_size 199*4882a593Smuzhiyun - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. 200*4882a593Smuzhiyun The average queue size for this cgroup over the entire time of this 201*4882a593Smuzhiyun cgroup's existence. Queue size samples are taken each time one of the 202*4882a593Smuzhiyun queues of this cgroup gets a timeslice. 203*4882a593Smuzhiyun 204*4882a593Smuzhiyun- blkio.group_wait_time 205*4882a593Smuzhiyun - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. 206*4882a593Smuzhiyun This is the amount of time the cgroup had to wait since it became busy 207*4882a593Smuzhiyun (i.e., went from 0 to 1 request queued) to get a timeslice for one of 208*4882a593Smuzhiyun its queues. This is different from the io_wait_time which is the 209*4882a593Smuzhiyun cumulative total of the amount of time spent by each IO in that cgroup 210*4882a593Smuzhiyun waiting in the scheduler queue. This is in nanoseconds. If this is 211*4882a593Smuzhiyun read when the cgroup is in a waiting (for timeslice) state, the stat 212*4882a593Smuzhiyun will only report the group_wait_time accumulated till the last time it 213*4882a593Smuzhiyun got a timeslice and will not include the current delta. 214*4882a593Smuzhiyun 215*4882a593Smuzhiyun- blkio.empty_time 216*4882a593Smuzhiyun - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. 217*4882a593Smuzhiyun This is the amount of time a cgroup spends without any pending 218*4882a593Smuzhiyun requests when not being served, i.e., it does not include any time 219*4882a593Smuzhiyun spent idling for one of the queues of the cgroup. This is in 220*4882a593Smuzhiyun nanoseconds. If this is read when the cgroup is in an empty state, 221*4882a593Smuzhiyun the stat will only report the empty_time accumulated till the last 222*4882a593Smuzhiyun time it had a pending request and will not include the current delta. 223*4882a593Smuzhiyun 224*4882a593Smuzhiyun- blkio.idle_time 225*4882a593Smuzhiyun - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. 226*4882a593Smuzhiyun This is the amount of time spent by the IO scheduler idling for a 227*4882a593Smuzhiyun given cgroup in anticipation of a better request than the existing ones 228*4882a593Smuzhiyun from other queues/cgroups. This is in nanoseconds. If this is read 229*4882a593Smuzhiyun when the cgroup is in an idling state, the stat will only report the 230*4882a593Smuzhiyun idle_time accumulated till the last idle period and will not include 231*4882a593Smuzhiyun the current delta. 232*4882a593Smuzhiyun 233*4882a593Smuzhiyun- blkio.dequeue 234*4882a593Smuzhiyun - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This 235*4882a593Smuzhiyun gives the statistics about how many a times a group was dequeued 236*4882a593Smuzhiyun from service tree of the device. First two fields specify the major 237*4882a593Smuzhiyun and minor number of the device and third field specifies the number 238*4882a593Smuzhiyun of times a group was dequeued from a particular device. 239*4882a593Smuzhiyun 240*4882a593Smuzhiyun- blkio.*_recursive 241*4882a593Smuzhiyun - Recursive version of various stats. These files show the 242*4882a593Smuzhiyun same information as their non-recursive counterparts but 243*4882a593Smuzhiyun include stats from all the descendant cgroups. 244*4882a593Smuzhiyun 245*4882a593SmuzhiyunThrottling/Upper limit policy files 246*4882a593Smuzhiyun----------------------------------- 247*4882a593Smuzhiyun- blkio.throttle.read_bps_device 248*4882a593Smuzhiyun - Specifies upper limit on READ rate from the device. IO rate is 249*4882a593Smuzhiyun specified in bytes per second. Rules are per device. Following is 250*4882a593Smuzhiyun the format:: 251*4882a593Smuzhiyun 252*4882a593Smuzhiyun echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device 253*4882a593Smuzhiyun 254*4882a593Smuzhiyun- blkio.throttle.write_bps_device 255*4882a593Smuzhiyun - Specifies upper limit on WRITE rate to the device. IO rate is 256*4882a593Smuzhiyun specified in bytes per second. Rules are per device. Following is 257*4882a593Smuzhiyun the format:: 258*4882a593Smuzhiyun 259*4882a593Smuzhiyun echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun- blkio.throttle.read_iops_device 262*4882a593Smuzhiyun - Specifies upper limit on READ rate from the device. IO rate is 263*4882a593Smuzhiyun specified in IO per second. Rules are per device. Following is 264*4882a593Smuzhiyun the format:: 265*4882a593Smuzhiyun 266*4882a593Smuzhiyun echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device 267*4882a593Smuzhiyun 268*4882a593Smuzhiyun- blkio.throttle.write_iops_device 269*4882a593Smuzhiyun - Specifies upper limit on WRITE rate to the device. IO rate is 270*4882a593Smuzhiyun specified in io per second. Rules are per device. Following is 271*4882a593Smuzhiyun the format:: 272*4882a593Smuzhiyun 273*4882a593Smuzhiyun echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device 274*4882a593Smuzhiyun 275*4882a593SmuzhiyunNote: If both BW and IOPS rules are specified for a device, then IO is 276*4882a593Smuzhiyun subjected to both the constraints. 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun- blkio.throttle.io_serviced 279*4882a593Smuzhiyun - Number of IOs (bio) issued to the disk by the group. These 280*4882a593Smuzhiyun are further divided by the type of operation - read or write, sync 281*4882a593Smuzhiyun or async. First two fields specify the major and minor number of the 282*4882a593Smuzhiyun device, third field specifies the operation type and the fourth field 283*4882a593Smuzhiyun specifies the number of IOs. 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun- blkio.throttle.io_service_bytes 286*4882a593Smuzhiyun - Number of bytes transferred to/from the disk by the group. These 287*4882a593Smuzhiyun are further divided by the type of operation - read or write, sync 288*4882a593Smuzhiyun or async. First two fields specify the major and minor number of the 289*4882a593Smuzhiyun device, third field specifies the operation type and the fourth field 290*4882a593Smuzhiyun specifies the number of bytes. 291*4882a593Smuzhiyun 292*4882a593SmuzhiyunCommon files among various policies 293*4882a593Smuzhiyun----------------------------------- 294*4882a593Smuzhiyun- blkio.reset_stats 295*4882a593Smuzhiyun - Writing an int to this file will result in resetting all the stats 296*4882a593Smuzhiyun for that cgroup. 297