xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/cgroup-v1/blkio-controller.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun===================
2*4882a593SmuzhiyunBlock IO Controller
3*4882a593Smuzhiyun===================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunOverview
6*4882a593Smuzhiyun========
7*4882a593Smuzhiyuncgroup subsys "blkio" implements the block io controller. There seems to be
8*4882a593Smuzhiyuna need of various kinds of IO control policies (like proportional BW, max BW)
9*4882a593Smuzhiyunboth at leaf nodes as well as at intermediate nodes in a storage hierarchy.
10*4882a593SmuzhiyunPlan is to use the same cgroup based management interface for blkio controller
11*4882a593Smuzhiyunand based on user options switch IO policies in the background.
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunOne IO control policy is throttling policy which can be used to
14*4882a593Smuzhiyunspecify upper IO rate limits on devices. This policy is implemented in
15*4882a593Smuzhiyungeneric block layer and can be used on leaf nodes as well as higher
16*4882a593Smuzhiyunlevel logical devices like device mapper.
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunHOWTO
19*4882a593Smuzhiyun=====
20*4882a593SmuzhiyunThrottling/Upper Limit policy
21*4882a593Smuzhiyun-----------------------------
22*4882a593Smuzhiyun- Enable Block IO controller::
23*4882a593Smuzhiyun
24*4882a593Smuzhiyun	CONFIG_BLK_CGROUP=y
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun- Enable throttling in block layer::
27*4882a593Smuzhiyun
28*4882a593Smuzhiyun	CONFIG_BLK_DEV_THROTTLING=y
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
31*4882a593Smuzhiyun
32*4882a593Smuzhiyun        mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun- Specify a bandwidth rate on particular device for root group. The format
35*4882a593Smuzhiyun  for policy is "<major>:<minor>  <bytes_per_second>"::
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun        echo "8:16  1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
38*4882a593Smuzhiyun
39*4882a593Smuzhiyun  Above will put a limit of 1MB/second on reads happening for root group
40*4882a593Smuzhiyun  on device having major/minor number 8:16.
41*4882a593Smuzhiyun
42*4882a593Smuzhiyun- Run dd to read a file and see if rate is throttled to 1MB/s or not::
43*4882a593Smuzhiyun
44*4882a593Smuzhiyun        # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
45*4882a593Smuzhiyun        1024+0 records in
46*4882a593Smuzhiyun        1024+0 records out
47*4882a593Smuzhiyun        4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
48*4882a593Smuzhiyun
49*4882a593Smuzhiyun Limits for writes can be put using blkio.throttle.write_bps_device file.
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunHierarchical Cgroups
52*4882a593Smuzhiyun====================
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunThrottling implements hierarchy support; however,
55*4882a593Smuzhiyunthrottling's hierarchy support is enabled iff "sane_behavior" is
56*4882a593Smuzhiyunenabled from cgroup side, which currently is a development option and
57*4882a593Smuzhiyunnot publicly available.
58*4882a593Smuzhiyun
59*4882a593SmuzhiyunIf somebody created a hierarchy like as follows::
60*4882a593Smuzhiyun
61*4882a593Smuzhiyun			root
62*4882a593Smuzhiyun			/  \
63*4882a593Smuzhiyun		     test1 test2
64*4882a593Smuzhiyun			|
65*4882a593Smuzhiyun		     test3
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunThrottling with "sane_behavior" will handle the
68*4882a593Smuzhiyunhierarchy correctly. For throttling, all limits apply
69*4882a593Smuzhiyunto the whole subtree while all statistics are local to the IOs
70*4882a593Smuzhiyundirectly generated by tasks in that cgroup.
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunThrottling without "sane_behavior" enabled from cgroup side will
73*4882a593Smuzhiyunpractically treat all groups at same level as if it looks like the
74*4882a593Smuzhiyunfollowing::
75*4882a593Smuzhiyun
76*4882a593Smuzhiyun				pivot
77*4882a593Smuzhiyun			     /  /   \  \
78*4882a593Smuzhiyun			root  test1 test2  test3
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunVarious user visible config options
81*4882a593Smuzhiyun===================================
82*4882a593SmuzhiyunCONFIG_BLK_CGROUP
83*4882a593Smuzhiyun	- Block IO controller.
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunCONFIG_BFQ_CGROUP_DEBUG
86*4882a593Smuzhiyun	- Debug help. Right now some additional stats file show up in cgroup
87*4882a593Smuzhiyun	  if this option is enabled.
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunCONFIG_BLK_DEV_THROTTLING
90*4882a593Smuzhiyun	- Enable block device throttling support in block layer.
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunDetails of cgroup files
93*4882a593Smuzhiyun=======================
94*4882a593SmuzhiyunProportional weight policy files
95*4882a593Smuzhiyun--------------------------------
96*4882a593Smuzhiyun- blkio.weight
97*4882a593Smuzhiyun	- Specifies per cgroup weight. This is default weight of the group
98*4882a593Smuzhiyun	  on all the devices until and unless overridden by per device rule.
99*4882a593Smuzhiyun	  (See blkio.weight_device).
100*4882a593Smuzhiyun	  Currently allowed range of weights is from 10 to 1000.
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun- blkio.weight_device
103*4882a593Smuzhiyun	- One can specify per cgroup per device rules using this interface.
104*4882a593Smuzhiyun	  These rules override the default value of group weight as specified
105*4882a593Smuzhiyun	  by blkio.weight.
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun	  Following is the format::
108*4882a593Smuzhiyun
109*4882a593Smuzhiyun	    # echo dev_maj:dev_minor weight > blkio.weight_device
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun	  Configure weight=300 on /dev/sdb (8:16) in this cgroup::
112*4882a593Smuzhiyun
113*4882a593Smuzhiyun	    # echo 8:16 300 > blkio.weight_device
114*4882a593Smuzhiyun	    # cat blkio.weight_device
115*4882a593Smuzhiyun	    dev     weight
116*4882a593Smuzhiyun	    8:16    300
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun	  Configure weight=500 on /dev/sda (8:0) in this cgroup::
119*4882a593Smuzhiyun
120*4882a593Smuzhiyun	    # echo 8:0 500 > blkio.weight_device
121*4882a593Smuzhiyun	    # cat blkio.weight_device
122*4882a593Smuzhiyun	    dev     weight
123*4882a593Smuzhiyun	    8:0     500
124*4882a593Smuzhiyun	    8:16    300
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun	  Remove specific weight for /dev/sda in this cgroup::
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun	    # echo 8:0 0 > blkio.weight_device
129*4882a593Smuzhiyun	    # cat blkio.weight_device
130*4882a593Smuzhiyun	    dev     weight
131*4882a593Smuzhiyun	    8:16    300
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun- blkio.time
134*4882a593Smuzhiyun	- disk time allocated to cgroup per device in milliseconds. First
135*4882a593Smuzhiyun	  two fields specify the major and minor number of the device and
136*4882a593Smuzhiyun	  third field specifies the disk time allocated to group in
137*4882a593Smuzhiyun	  milliseconds.
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun- blkio.sectors
140*4882a593Smuzhiyun	- number of sectors transferred to/from disk by the group. First
141*4882a593Smuzhiyun	  two fields specify the major and minor number of the device and
142*4882a593Smuzhiyun	  third field specifies the number of sectors transferred by the
143*4882a593Smuzhiyun	  group to/from the device.
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun- blkio.io_service_bytes
146*4882a593Smuzhiyun	- Number of bytes transferred to/from the disk by the group. These
147*4882a593Smuzhiyun	  are further divided by the type of operation - read or write, sync
148*4882a593Smuzhiyun	  or async. First two fields specify the major and minor number of the
149*4882a593Smuzhiyun	  device, third field specifies the operation type and the fourth field
150*4882a593Smuzhiyun	  specifies the number of bytes.
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun- blkio.io_serviced
153*4882a593Smuzhiyun	- Number of IOs (bio) issued to the disk by the group. These
154*4882a593Smuzhiyun	  are further divided by the type of operation - read or write, sync
155*4882a593Smuzhiyun	  or async. First two fields specify the major and minor number of the
156*4882a593Smuzhiyun	  device, third field specifies the operation type and the fourth field
157*4882a593Smuzhiyun	  specifies the number of IOs.
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun- blkio.io_service_time
160*4882a593Smuzhiyun	- Total amount of time between request dispatch and request completion
161*4882a593Smuzhiyun	  for the IOs done by this cgroup. This is in nanoseconds to make it
162*4882a593Smuzhiyun	  meaningful for flash devices too. For devices with queue depth of 1,
163*4882a593Smuzhiyun	  this time represents the actual service time. When queue_depth > 1,
164*4882a593Smuzhiyun	  that is no longer true as requests may be served out of order. This
165*4882a593Smuzhiyun	  may cause the service time for a given IO to include the service time
166*4882a593Smuzhiyun	  of multiple IOs when served out of order which may result in total
167*4882a593Smuzhiyun	  io_service_time > actual time elapsed. This time is further divided by
168*4882a593Smuzhiyun	  the type of operation - read or write, sync or async. First two fields
169*4882a593Smuzhiyun	  specify the major and minor number of the device, third field
170*4882a593Smuzhiyun	  specifies the operation type and the fourth field specifies the
171*4882a593Smuzhiyun	  io_service_time in ns.
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun- blkio.io_wait_time
174*4882a593Smuzhiyun	- Total amount of time the IOs for this cgroup spent waiting in the
175*4882a593Smuzhiyun	  scheduler queues for service. This can be greater than the total time
176*4882a593Smuzhiyun	  elapsed since it is cumulative io_wait_time for all IOs. It is not a
177*4882a593Smuzhiyun	  measure of total time the cgroup spent waiting but rather a measure of
178*4882a593Smuzhiyun	  the wait_time for its individual IOs. For devices with queue_depth > 1
179*4882a593Smuzhiyun	  this metric does not include the time spent waiting for service once
180*4882a593Smuzhiyun	  the IO is dispatched to the device but till it actually gets serviced
181*4882a593Smuzhiyun	  (there might be a time lag here due to re-ordering of requests by the
182*4882a593Smuzhiyun	  device). This is in nanoseconds to make it meaningful for flash
183*4882a593Smuzhiyun	  devices too. This time is further divided by the type of operation -
184*4882a593Smuzhiyun	  read or write, sync or async. First two fields specify the major and
185*4882a593Smuzhiyun	  minor number of the device, third field specifies the operation type
186*4882a593Smuzhiyun	  and the fourth field specifies the io_wait_time in ns.
187*4882a593Smuzhiyun
188*4882a593Smuzhiyun- blkio.io_merged
189*4882a593Smuzhiyun	- Total number of bios/requests merged into requests belonging to this
190*4882a593Smuzhiyun	  cgroup. This is further divided by the type of operation - read or
191*4882a593Smuzhiyun	  write, sync or async.
192*4882a593Smuzhiyun
193*4882a593Smuzhiyun- blkio.io_queued
194*4882a593Smuzhiyun	- Total number of requests queued up at any given instant for this
195*4882a593Smuzhiyun	  cgroup. This is further divided by the type of operation - read or
196*4882a593Smuzhiyun	  write, sync or async.
197*4882a593Smuzhiyun
198*4882a593Smuzhiyun- blkio.avg_queue_size
199*4882a593Smuzhiyun	- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
200*4882a593Smuzhiyun	  The average queue size for this cgroup over the entire time of this
201*4882a593Smuzhiyun	  cgroup's existence. Queue size samples are taken each time one of the
202*4882a593Smuzhiyun	  queues of this cgroup gets a timeslice.
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun- blkio.group_wait_time
205*4882a593Smuzhiyun	- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
206*4882a593Smuzhiyun	  This is the amount of time the cgroup had to wait since it became busy
207*4882a593Smuzhiyun	  (i.e., went from 0 to 1 request queued) to get a timeslice for one of
208*4882a593Smuzhiyun	  its queues. This is different from the io_wait_time which is the
209*4882a593Smuzhiyun	  cumulative total of the amount of time spent by each IO in that cgroup
210*4882a593Smuzhiyun	  waiting in the scheduler queue. This is in nanoseconds. If this is
211*4882a593Smuzhiyun	  read when the cgroup is in a waiting (for timeslice) state, the stat
212*4882a593Smuzhiyun	  will only report the group_wait_time accumulated till the last time it
213*4882a593Smuzhiyun	  got a timeslice and will not include the current delta.
214*4882a593Smuzhiyun
215*4882a593Smuzhiyun- blkio.empty_time
216*4882a593Smuzhiyun	- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
217*4882a593Smuzhiyun	  This is the amount of time a cgroup spends without any pending
218*4882a593Smuzhiyun	  requests when not being served, i.e., it does not include any time
219*4882a593Smuzhiyun	  spent idling for one of the queues of the cgroup. This is in
220*4882a593Smuzhiyun	  nanoseconds. If this is read when the cgroup is in an empty state,
221*4882a593Smuzhiyun	  the stat will only report the empty_time accumulated till the last
222*4882a593Smuzhiyun	  time it had a pending request and will not include the current delta.
223*4882a593Smuzhiyun
224*4882a593Smuzhiyun- blkio.idle_time
225*4882a593Smuzhiyun	- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
226*4882a593Smuzhiyun	  This is the amount of time spent by the IO scheduler idling for a
227*4882a593Smuzhiyun	  given cgroup in anticipation of a better request than the existing ones
228*4882a593Smuzhiyun	  from other queues/cgroups. This is in nanoseconds. If this is read
229*4882a593Smuzhiyun	  when the cgroup is in an idling state, the stat will only report the
230*4882a593Smuzhiyun	  idle_time accumulated till the last idle period and will not include
231*4882a593Smuzhiyun	  the current delta.
232*4882a593Smuzhiyun
233*4882a593Smuzhiyun- blkio.dequeue
234*4882a593Smuzhiyun	- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
235*4882a593Smuzhiyun	  gives the statistics about how many a times a group was dequeued
236*4882a593Smuzhiyun	  from service tree of the device. First two fields specify the major
237*4882a593Smuzhiyun	  and minor number of the device and third field specifies the number
238*4882a593Smuzhiyun	  of times a group was dequeued from a particular device.
239*4882a593Smuzhiyun
240*4882a593Smuzhiyun- blkio.*_recursive
241*4882a593Smuzhiyun	- Recursive version of various stats. These files show the
242*4882a593Smuzhiyun          same information as their non-recursive counterparts but
243*4882a593Smuzhiyun          include stats from all the descendant cgroups.
244*4882a593Smuzhiyun
245*4882a593SmuzhiyunThrottling/Upper limit policy files
246*4882a593Smuzhiyun-----------------------------------
247*4882a593Smuzhiyun- blkio.throttle.read_bps_device
248*4882a593Smuzhiyun	- Specifies upper limit on READ rate from the device. IO rate is
249*4882a593Smuzhiyun	  specified in bytes per second. Rules are per device. Following is
250*4882a593Smuzhiyun	  the format::
251*4882a593Smuzhiyun
252*4882a593Smuzhiyun	    echo "<major>:<minor>  <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun- blkio.throttle.write_bps_device
255*4882a593Smuzhiyun	- Specifies upper limit on WRITE rate to the device. IO rate is
256*4882a593Smuzhiyun	  specified in bytes per second. Rules are per device. Following is
257*4882a593Smuzhiyun	  the format::
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun	    echo "<major>:<minor>  <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun- blkio.throttle.read_iops_device
262*4882a593Smuzhiyun	- Specifies upper limit on READ rate from the device. IO rate is
263*4882a593Smuzhiyun	  specified in IO per second. Rules are per device. Following is
264*4882a593Smuzhiyun	  the format::
265*4882a593Smuzhiyun
266*4882a593Smuzhiyun	   echo "<major>:<minor>  <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
267*4882a593Smuzhiyun
268*4882a593Smuzhiyun- blkio.throttle.write_iops_device
269*4882a593Smuzhiyun	- Specifies upper limit on WRITE rate to the device. IO rate is
270*4882a593Smuzhiyun	  specified in io per second. Rules are per device. Following is
271*4882a593Smuzhiyun	  the format::
272*4882a593Smuzhiyun
273*4882a593Smuzhiyun	    echo "<major>:<minor>  <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
274*4882a593Smuzhiyun
275*4882a593SmuzhiyunNote: If both BW and IOPS rules are specified for a device, then IO is
276*4882a593Smuzhiyun      subjected to both the constraints.
277*4882a593Smuzhiyun
278*4882a593Smuzhiyun- blkio.throttle.io_serviced
279*4882a593Smuzhiyun	- Number of IOs (bio) issued to the disk by the group. These
280*4882a593Smuzhiyun	  are further divided by the type of operation - read or write, sync
281*4882a593Smuzhiyun	  or async. First two fields specify the major and minor number of the
282*4882a593Smuzhiyun	  device, third field specifies the operation type and the fourth field
283*4882a593Smuzhiyun	  specifies the number of IOs.
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun- blkio.throttle.io_service_bytes
286*4882a593Smuzhiyun	- Number of bytes transferred to/from the disk by the group. These
287*4882a593Smuzhiyun	  are further divided by the type of operation - read or write, sync
288*4882a593Smuzhiyun	  or async. First two fields specify the major and minor number of the
289*4882a593Smuzhiyun	  device, third field specifies the operation type and the fourth field
290*4882a593Smuzhiyun	  specifies the number of bytes.
291*4882a593Smuzhiyun
292*4882a593SmuzhiyunCommon files among various policies
293*4882a593Smuzhiyun-----------------------------------
294*4882a593Smuzhiyun- blkio.reset_stats
295*4882a593Smuzhiyun	- Writing an int to this file will result in resetting all the stats
296*4882a593Smuzhiyun	  for that cgroup.
297