1*4882a593Smuzhiyun===================== 2*4882a593SmuzhiyunCFS Bandwidth Control 3*4882a593Smuzhiyun===================== 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun[ This document only discusses CPU bandwidth control for SCHED_NORMAL. 6*4882a593Smuzhiyun The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst ] 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunCFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the 9*4882a593Smuzhiyunspecification of the maximum CPU bandwidth available to a group or hierarchy. 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunThe bandwidth allowed for a group is specified using a quota and period. Within 12*4882a593Smuzhiyuneach given "period" (microseconds), a task group is allocated up to "quota" 13*4882a593Smuzhiyunmicroseconds of CPU time. That quota is assigned to per-cpu run queues in 14*4882a593Smuzhiyunslices as threads in the cgroup become runnable. Once all quota has been 15*4882a593Smuzhiyunassigned any additional requests for quota will result in those threads being 16*4882a593Smuzhiyunthrottled. Throttled threads will not be able to run again until the next 17*4882a593Smuzhiyunperiod when the quota is replenished. 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunA group's unassigned quota is globally tracked, being refreshed back to 20*4882a593Smuzhiyuncfs_quota units at each period boundary. As threads consume this bandwidth it 21*4882a593Smuzhiyunis transferred to cpu-local "silos" on a demand basis. The amount transferred 22*4882a593Smuzhiyunwithin each of these updates is tunable and described as the "slice". 23*4882a593Smuzhiyun 24*4882a593SmuzhiyunManagement 25*4882a593Smuzhiyun---------- 26*4882a593SmuzhiyunQuota and period are managed within the cpu subsystem via cgroupfs. 27*4882a593Smuzhiyun 28*4882a593Smuzhiyuncpu.cfs_quota_us: the total available run-time within a period (in microseconds) 29*4882a593Smuzhiyuncpu.cfs_period_us: the length of a period (in microseconds) 30*4882a593Smuzhiyuncpu.stat: exports throttling statistics [explained further below] 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunThe default values are:: 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun cpu.cfs_period_us=100ms 35*4882a593Smuzhiyun cpu.cfs_quota=-1 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunA value of -1 for cpu.cfs_quota_us indicates that the group does not have any 38*4882a593Smuzhiyunbandwidth restriction in place, such a group is described as an unconstrained 39*4882a593Smuzhiyunbandwidth group. This represents the traditional work-conserving behavior for 40*4882a593SmuzhiyunCFS. 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunWriting any (valid) positive value(s) will enact the specified bandwidth limit. 43*4882a593SmuzhiyunThe minimum quota allowed for the quota or period is 1ms. There is also an 44*4882a593Smuzhiyunupper bound on the period length of 1s. Additional restrictions exist when 45*4882a593Smuzhiyunbandwidth limits are used in a hierarchical fashion, these are explained in 46*4882a593Smuzhiyunmore detail below. 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunWriting any negative value to cpu.cfs_quota_us will remove the bandwidth limit 49*4882a593Smuzhiyunand return the group to an unconstrained state once more. 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunAny updates to a group's bandwidth specification will result in it becoming 52*4882a593Smuzhiyununthrottled if it is in a constrained state. 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunSystem wide settings 55*4882a593Smuzhiyun-------------------- 56*4882a593SmuzhiyunFor efficiency run-time is transferred between the global pool and CPU local 57*4882a593Smuzhiyun"silos" in a batch fashion. This greatly reduces global accounting pressure 58*4882a593Smuzhiyunon large systems. The amount transferred each time such an update is required 59*4882a593Smuzhiyunis described as the "slice". 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunThis is tunable via procfs:: 62*4882a593Smuzhiyun 63*4882a593Smuzhiyun /proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms) 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunLarger slice values will reduce transfer overheads, while smaller values allow 66*4882a593Smuzhiyunfor more fine-grained consumption. 67*4882a593Smuzhiyun 68*4882a593SmuzhiyunStatistics 69*4882a593Smuzhiyun---------- 70*4882a593SmuzhiyunA group's bandwidth statistics are exported via 3 fields in cpu.stat. 71*4882a593Smuzhiyun 72*4882a593Smuzhiyuncpu.stat: 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun- nr_periods: Number of enforcement intervals that have elapsed. 75*4882a593Smuzhiyun- nr_throttled: Number of times the group has been throttled/limited. 76*4882a593Smuzhiyun- throttled_time: The total time duration (in nanoseconds) for which entities 77*4882a593Smuzhiyun of the group have been throttled. 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunThis interface is read-only. 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunHierarchical considerations 82*4882a593Smuzhiyun--------------------------- 83*4882a593SmuzhiyunThe interface enforces that an individual entity's bandwidth is always 84*4882a593Smuzhiyunattainable, that is: max(c_i) <= C. However, over-subscription in the 85*4882a593Smuzhiyunaggregate case is explicitly allowed to enable work-conserving semantics 86*4882a593Smuzhiyunwithin a hierarchy: 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun e.g. \Sum (c_i) may exceed C 89*4882a593Smuzhiyun 90*4882a593Smuzhiyun[ Where C is the parent's bandwidth, and c_i its children ] 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunThere are two ways in which a group may become throttled: 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun a. it fully consumes its own quota within a period 96*4882a593Smuzhiyun b. a parent's quota is fully consumed within its period 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunIn case b) above, even though the child may have runtime remaining it will not 99*4882a593Smuzhiyunbe allowed to until the parent's runtime is refreshed. 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunCFS Bandwidth Quota Caveats 102*4882a593Smuzhiyun--------------------------- 103*4882a593SmuzhiyunOnce a slice is assigned to a cpu it does not expire. However all but 1ms of 104*4882a593Smuzhiyunthe slice may be returned to the global pool if all threads on that cpu become 105*4882a593Smuzhiyununrunnable. This is configured at compile time by the min_cfs_rq_runtime 106*4882a593Smuzhiyunvariable. This is a performance tweak that helps prevent added contention on 107*4882a593Smuzhiyunthe global lock. 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunThe fact that cpu-local slices do not expire results in some interesting corner 110*4882a593Smuzhiyuncases that should be understood. 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunFor cgroup cpu constrained applications that are cpu limited this is a 113*4882a593Smuzhiyunrelatively moot point because they will naturally consume the entirety of their 114*4882a593Smuzhiyunquota as well as the entirety of each cpu-local slice in each period. As a 115*4882a593Smuzhiyunresult it is expected that nr_periods roughly equal nr_throttled, and that 116*4882a593Smuzhiyuncpuacct.usage will increase roughly equal to cfs_quota_us in each period. 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunFor highly-threaded, non-cpu bound applications this non-expiration nuance 119*4882a593Smuzhiyunallows applications to briefly burst past their quota limits by the amount of 120*4882a593Smuzhiyununused slice on each cpu that the task group is running on (typically at most 121*4882a593Smuzhiyun1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only 122*4882a593Smuzhiyunapplies if quota had been assigned to a cpu and then not fully used or returned 123*4882a593Smuzhiyunin previous periods. This burst amount will not be transferred between cores. 124*4882a593SmuzhiyunAs a result, this mechanism still strictly limits the task group to quota 125*4882a593Smuzhiyunaverage usage, albeit over a longer time window than a single period. This 126*4882a593Smuzhiyunalso limits the burst ability to no more than 1ms per cpu. This provides 127*4882a593Smuzhiyunbetter more predictable user experience for highly threaded applications with 128*4882a593Smuzhiyunsmall quota limits on high core count machines. It also eliminates the 129*4882a593Smuzhiyunpropensity to throttle these applications while simultanously using less than 130*4882a593Smuzhiyunquota amounts of cpu. Another way to say this, is that by allowing the unused 131*4882a593Smuzhiyunportion of a slice to remain valid across periods we have decreased the 132*4882a593Smuzhiyunpossibility of wastefully expiring quota on cpu-local silos that don't need a 133*4882a593Smuzhiyunfull slice's amount of cpu time. 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunThe interaction between cpu-bound and non-cpu-bound-interactive applications 136*4882a593Smuzhiyunshould also be considered, especially when single core usage hits 100%. If you 137*4882a593Smuzhiyungave each of these applications half of a cpu-core and they both got scheduled 138*4882a593Smuzhiyunon the same CPU it is theoretically possible that the non-cpu bound application 139*4882a593Smuzhiyunwill use up to 1ms additional quota in some periods, thereby preventing the 140*4882a593Smuzhiyuncpu-bound application from fully using its quota by that same amount. In these 141*4882a593Smuzhiyuninstances it will be up to the CFS algorithm (see sched-design-CFS.rst) to 142*4882a593Smuzhiyundecide which application is chosen to run, as they will both be runnable and 143*4882a593Smuzhiyunhave remaining quota. This runtime discrepancy will be made up in the following 144*4882a593Smuzhiyunperiods when the interactive application idles. 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunExamples 147*4882a593Smuzhiyun-------- 148*4882a593Smuzhiyun1. Limit a group to 1 CPU worth of runtime:: 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun If period is 250ms and quota is also 250ms, the group will get 151*4882a593Smuzhiyun 1 CPU worth of runtime every 250ms. 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun # echo 250000 > cpu.cfs_quota_us /* quota = 250ms */ 154*4882a593Smuzhiyun # echo 250000 > cpu.cfs_period_us /* period = 250ms */ 155*4882a593Smuzhiyun 156*4882a593Smuzhiyun2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine 157*4882a593Smuzhiyun 158*4882a593Smuzhiyun With 500ms period and 1000ms quota, the group can get 2 CPUs worth of 159*4882a593Smuzhiyun runtime every 500ms:: 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun # echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */ 162*4882a593Smuzhiyun # echo 500000 > cpu.cfs_period_us /* period = 500ms */ 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun The larger period here allows for increased burst capacity. 165*4882a593Smuzhiyun 166*4882a593Smuzhiyun3. Limit a group to 20% of 1 CPU. 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU:: 169*4882a593Smuzhiyun 170*4882a593Smuzhiyun # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */ 171*4882a593Smuzhiyun # echo 50000 > cpu.cfs_period_us /* period = 50ms */ 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun By using a small period here we are ensuring a consistent latency 174*4882a593Smuzhiyun response at the expense of burst capacity. 175