xref: /OK3568_Linux_fs/kernel/Documentation/scheduler/sched-bwc.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=====================
2*4882a593SmuzhiyunCFS Bandwidth Control
3*4882a593Smuzhiyun=====================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
6*4882a593Smuzhiyun  The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst ]
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunCFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
9*4882a593Smuzhiyunspecification of the maximum CPU bandwidth available to a group or hierarchy.
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunThe bandwidth allowed for a group is specified using a quota and period. Within
12*4882a593Smuzhiyuneach given "period" (microseconds), a task group is allocated up to "quota"
13*4882a593Smuzhiyunmicroseconds of CPU time. That quota is assigned to per-cpu run queues in
14*4882a593Smuzhiyunslices as threads in the cgroup become runnable. Once all quota has been
15*4882a593Smuzhiyunassigned any additional requests for quota will result in those threads being
16*4882a593Smuzhiyunthrottled. Throttled threads will not be able to run again until the next
17*4882a593Smuzhiyunperiod when the quota is replenished.
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunA group's unassigned quota is globally tracked, being refreshed back to
20*4882a593Smuzhiyuncfs_quota units at each period boundary. As threads consume this bandwidth it
21*4882a593Smuzhiyunis transferred to cpu-local "silos" on a demand basis. The amount transferred
22*4882a593Smuzhiyunwithin each of these updates is tunable and described as the "slice".
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunManagement
25*4882a593Smuzhiyun----------
26*4882a593SmuzhiyunQuota and period are managed within the cpu subsystem via cgroupfs.
27*4882a593Smuzhiyun
28*4882a593Smuzhiyuncpu.cfs_quota_us: the total available run-time within a period (in microseconds)
29*4882a593Smuzhiyuncpu.cfs_period_us: the length of a period (in microseconds)
30*4882a593Smuzhiyuncpu.stat: exports throttling statistics [explained further below]
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunThe default values are::
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun	cpu.cfs_period_us=100ms
35*4882a593Smuzhiyun	cpu.cfs_quota=-1
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunA value of -1 for cpu.cfs_quota_us indicates that the group does not have any
38*4882a593Smuzhiyunbandwidth restriction in place, such a group is described as an unconstrained
39*4882a593Smuzhiyunbandwidth group. This represents the traditional work-conserving behavior for
40*4882a593SmuzhiyunCFS.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunWriting any (valid) positive value(s) will enact the specified bandwidth limit.
43*4882a593SmuzhiyunThe minimum quota allowed for the quota or period is 1ms. There is also an
44*4882a593Smuzhiyunupper bound on the period length of 1s. Additional restrictions exist when
45*4882a593Smuzhiyunbandwidth limits are used in a hierarchical fashion, these are explained in
46*4882a593Smuzhiyunmore detail below.
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunWriting any negative value to cpu.cfs_quota_us will remove the bandwidth limit
49*4882a593Smuzhiyunand return the group to an unconstrained state once more.
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunAny updates to a group's bandwidth specification will result in it becoming
52*4882a593Smuzhiyununthrottled if it is in a constrained state.
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunSystem wide settings
55*4882a593Smuzhiyun--------------------
56*4882a593SmuzhiyunFor efficiency run-time is transferred between the global pool and CPU local
57*4882a593Smuzhiyun"silos" in a batch fashion. This greatly reduces global accounting pressure
58*4882a593Smuzhiyunon large systems. The amount transferred each time such an update is required
59*4882a593Smuzhiyunis described as the "slice".
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunThis is tunable via procfs::
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun	/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunLarger slice values will reduce transfer overheads, while smaller values allow
66*4882a593Smuzhiyunfor more fine-grained consumption.
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunStatistics
69*4882a593Smuzhiyun----------
70*4882a593SmuzhiyunA group's bandwidth statistics are exported via 3 fields in cpu.stat.
71*4882a593Smuzhiyun
72*4882a593Smuzhiyuncpu.stat:
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun- nr_periods: Number of enforcement intervals that have elapsed.
75*4882a593Smuzhiyun- nr_throttled: Number of times the group has been throttled/limited.
76*4882a593Smuzhiyun- throttled_time: The total time duration (in nanoseconds) for which entities
77*4882a593Smuzhiyun  of the group have been throttled.
78*4882a593Smuzhiyun
79*4882a593SmuzhiyunThis interface is read-only.
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunHierarchical considerations
82*4882a593Smuzhiyun---------------------------
83*4882a593SmuzhiyunThe interface enforces that an individual entity's bandwidth is always
84*4882a593Smuzhiyunattainable, that is: max(c_i) <= C. However, over-subscription in the
85*4882a593Smuzhiyunaggregate case is explicitly allowed to enable work-conserving semantics
86*4882a593Smuzhiyunwithin a hierarchy:
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun  e.g. \Sum (c_i) may exceed C
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun[ Where C is the parent's bandwidth, and c_i its children ]
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunThere are two ways in which a group may become throttled:
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun	a. it fully consumes its own quota within a period
96*4882a593Smuzhiyun	b. a parent's quota is fully consumed within its period
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunIn case b) above, even though the child may have runtime remaining it will not
99*4882a593Smuzhiyunbe allowed to until the parent's runtime is refreshed.
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunCFS Bandwidth Quota Caveats
102*4882a593Smuzhiyun---------------------------
103*4882a593SmuzhiyunOnce a slice is assigned to a cpu it does not expire.  However all but 1ms of
104*4882a593Smuzhiyunthe slice may be returned to the global pool if all threads on that cpu become
105*4882a593Smuzhiyununrunnable. This is configured at compile time by the min_cfs_rq_runtime
106*4882a593Smuzhiyunvariable. This is a performance tweak that helps prevent added contention on
107*4882a593Smuzhiyunthe global lock.
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunThe fact that cpu-local slices do not expire results in some interesting corner
110*4882a593Smuzhiyuncases that should be understood.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunFor cgroup cpu constrained applications that are cpu limited this is a
113*4882a593Smuzhiyunrelatively moot point because they will naturally consume the entirety of their
114*4882a593Smuzhiyunquota as well as the entirety of each cpu-local slice in each period. As a
115*4882a593Smuzhiyunresult it is expected that nr_periods roughly equal nr_throttled, and that
116*4882a593Smuzhiyuncpuacct.usage will increase roughly equal to cfs_quota_us in each period.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunFor highly-threaded, non-cpu bound applications this non-expiration nuance
119*4882a593Smuzhiyunallows applications to briefly burst past their quota limits by the amount of
120*4882a593Smuzhiyununused slice on each cpu that the task group is running on (typically at most
121*4882a593Smuzhiyun1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
122*4882a593Smuzhiyunapplies if quota had been assigned to a cpu and then not fully used or returned
123*4882a593Smuzhiyunin previous periods. This burst amount will not be transferred between cores.
124*4882a593SmuzhiyunAs a result, this mechanism still strictly limits the task group to quota
125*4882a593Smuzhiyunaverage usage, albeit over a longer time window than a single period.  This
126*4882a593Smuzhiyunalso limits the burst ability to no more than 1ms per cpu.  This provides
127*4882a593Smuzhiyunbetter more predictable user experience for highly threaded applications with
128*4882a593Smuzhiyunsmall quota limits on high core count machines. It also eliminates the
129*4882a593Smuzhiyunpropensity to throttle these applications while simultanously using less than
130*4882a593Smuzhiyunquota amounts of cpu. Another way to say this, is that by allowing the unused
131*4882a593Smuzhiyunportion of a slice to remain valid across periods we have decreased the
132*4882a593Smuzhiyunpossibility of wastefully expiring quota on cpu-local silos that don't need a
133*4882a593Smuzhiyunfull slice's amount of cpu time.
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunThe interaction between cpu-bound and non-cpu-bound-interactive applications
136*4882a593Smuzhiyunshould also be considered, especially when single core usage hits 100%. If you
137*4882a593Smuzhiyungave each of these applications half of a cpu-core and they both got scheduled
138*4882a593Smuzhiyunon the same CPU it is theoretically possible that the non-cpu bound application
139*4882a593Smuzhiyunwill use up to 1ms additional quota in some periods, thereby preventing the
140*4882a593Smuzhiyuncpu-bound application from fully using its quota by that same amount. In these
141*4882a593Smuzhiyuninstances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
142*4882a593Smuzhiyundecide which application is chosen to run, as they will both be runnable and
143*4882a593Smuzhiyunhave remaining quota. This runtime discrepancy will be made up in the following
144*4882a593Smuzhiyunperiods when the interactive application idles.
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunExamples
147*4882a593Smuzhiyun--------
148*4882a593Smuzhiyun1. Limit a group to 1 CPU worth of runtime::
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun	If period is 250ms and quota is also 250ms, the group will get
151*4882a593Smuzhiyun	1 CPU worth of runtime every 250ms.
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun	# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
154*4882a593Smuzhiyun	# echo 250000 > cpu.cfs_period_us /* period = 250ms */
155*4882a593Smuzhiyun
156*4882a593Smuzhiyun2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine
157*4882a593Smuzhiyun
158*4882a593Smuzhiyun   With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
159*4882a593Smuzhiyun   runtime every 500ms::
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
162*4882a593Smuzhiyun	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
163*4882a593Smuzhiyun
164*4882a593Smuzhiyun	The larger period here allows for increased burst capacity.
165*4882a593Smuzhiyun
166*4882a593Smuzhiyun3. Limit a group to 20% of 1 CPU.
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun   With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU::
169*4882a593Smuzhiyun
170*4882a593Smuzhiyun	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
171*4882a593Smuzhiyun	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun   By using a small period here we are ensuring a consistent latency
174*4882a593Smuzhiyun   response at the expense of burst capacity.
175