1*4882a593Smuzhiyun================ 2*4882a593SmuzhiyunControl Group v2 3*4882a593Smuzhiyun================ 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun:Date: October, 2015 6*4882a593Smuzhiyun:Author: Tejun Heo <tj@kernel.org> 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunThis is the authoritative documentation on the design, interface and 9*4882a593Smuzhiyunconventions of cgroup v2. It describes all userland-visible aspects 10*4882a593Smuzhiyunof cgroup including core and specific controller behaviors. All 11*4882a593Smuzhiyunfuture changes must be reflected in this document. Documentation for 12*4882a593Smuzhiyunv1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`. 13*4882a593Smuzhiyun 14*4882a593Smuzhiyun.. CONTENTS 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun 1. Introduction 17*4882a593Smuzhiyun 1-1. Terminology 18*4882a593Smuzhiyun 1-2. What is cgroup? 19*4882a593Smuzhiyun 2. Basic Operations 20*4882a593Smuzhiyun 2-1. Mounting 21*4882a593Smuzhiyun 2-2. Organizing Processes and Threads 22*4882a593Smuzhiyun 2-2-1. Processes 23*4882a593Smuzhiyun 2-2-2. Threads 24*4882a593Smuzhiyun 2-3. [Un]populated Notification 25*4882a593Smuzhiyun 2-4. Controlling Controllers 26*4882a593Smuzhiyun 2-4-1. Enabling and Disabling 27*4882a593Smuzhiyun 2-4-2. Top-down Constraint 28*4882a593Smuzhiyun 2-4-3. No Internal Process Constraint 29*4882a593Smuzhiyun 2-5. Delegation 30*4882a593Smuzhiyun 2-5-1. Model of Delegation 31*4882a593Smuzhiyun 2-5-2. Delegation Containment 32*4882a593Smuzhiyun 2-6. Guidelines 33*4882a593Smuzhiyun 2-6-1. Organize Once and Control 34*4882a593Smuzhiyun 2-6-2. Avoid Name Collisions 35*4882a593Smuzhiyun 3. Resource Distribution Models 36*4882a593Smuzhiyun 3-1. Weights 37*4882a593Smuzhiyun 3-2. Limits 38*4882a593Smuzhiyun 3-3. Protections 39*4882a593Smuzhiyun 3-4. Allocations 40*4882a593Smuzhiyun 4. Interface Files 41*4882a593Smuzhiyun 4-1. Format 42*4882a593Smuzhiyun 4-2. Conventions 43*4882a593Smuzhiyun 4-3. Core Interface Files 44*4882a593Smuzhiyun 5. Controllers 45*4882a593Smuzhiyun 5-1. CPU 46*4882a593Smuzhiyun 5-1-1. CPU Interface Files 47*4882a593Smuzhiyun 5-2. Memory 48*4882a593Smuzhiyun 5-2-1. Memory Interface Files 49*4882a593Smuzhiyun 5-2-2. Usage Guidelines 50*4882a593Smuzhiyun 5-2-3. Memory Ownership 51*4882a593Smuzhiyun 5-3. IO 52*4882a593Smuzhiyun 5-3-1. IO Interface Files 53*4882a593Smuzhiyun 5-3-2. Writeback 54*4882a593Smuzhiyun 5-3-3. IO Latency 55*4882a593Smuzhiyun 5-3-3-1. How IO Latency Throttling Works 56*4882a593Smuzhiyun 5-3-3-2. IO Latency Interface Files 57*4882a593Smuzhiyun 5-3-4. IO Priority 58*4882a593Smuzhiyun 5-4. PID 59*4882a593Smuzhiyun 5-4-1. PID Interface Files 60*4882a593Smuzhiyun 5-5. Cpuset 61*4882a593Smuzhiyun 5.5-1. Cpuset Interface Files 62*4882a593Smuzhiyun 5-6. Device 63*4882a593Smuzhiyun 5-7. RDMA 64*4882a593Smuzhiyun 5-7-1. RDMA Interface Files 65*4882a593Smuzhiyun 5-8. HugeTLB 66*4882a593Smuzhiyun 5.8-1. HugeTLB Interface Files 67*4882a593Smuzhiyun 5-8. Misc 68*4882a593Smuzhiyun 5-8-1. perf_event 69*4882a593Smuzhiyun 5-N. Non-normative information 70*4882a593Smuzhiyun 5-N-1. CPU controller root cgroup process behaviour 71*4882a593Smuzhiyun 5-N-2. IO controller root cgroup process behaviour 72*4882a593Smuzhiyun 6. Namespace 73*4882a593Smuzhiyun 6-1. Basics 74*4882a593Smuzhiyun 6-2. The Root and Views 75*4882a593Smuzhiyun 6-3. Migration and setns(2) 76*4882a593Smuzhiyun 6-4. Interaction with Other Namespaces 77*4882a593Smuzhiyun P. Information on Kernel Programming 78*4882a593Smuzhiyun P-1. Filesystem Support for Writeback 79*4882a593Smuzhiyun D. Deprecated v1 Core Features 80*4882a593Smuzhiyun R. Issues with v1 and Rationales for v2 81*4882a593Smuzhiyun R-1. Multiple Hierarchies 82*4882a593Smuzhiyun R-2. Thread Granularity 83*4882a593Smuzhiyun R-3. Competition Between Inner Nodes and Threads 84*4882a593Smuzhiyun R-4. Other Interface Issues 85*4882a593Smuzhiyun R-5. Controller Issues and Remedies 86*4882a593Smuzhiyun R-5-1. Memory 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunIntroduction 90*4882a593Smuzhiyun============ 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunTerminology 93*4882a593Smuzhiyun----------- 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun"cgroup" stands for "control group" and is never capitalized. The 96*4882a593Smuzhiyunsingular form is used to designate the whole feature and also as a 97*4882a593Smuzhiyunqualifier as in "cgroup controllers". When explicitly referring to 98*4882a593Smuzhiyunmultiple individual control groups, the plural form "cgroups" is used. 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunWhat is cgroup? 102*4882a593Smuzhiyun--------------- 103*4882a593Smuzhiyun 104*4882a593Smuzhiyuncgroup is a mechanism to organize processes hierarchically and 105*4882a593Smuzhiyundistribute system resources along the hierarchy in a controlled and 106*4882a593Smuzhiyunconfigurable manner. 107*4882a593Smuzhiyun 108*4882a593Smuzhiyuncgroup is largely composed of two parts - the core and controllers. 109*4882a593Smuzhiyuncgroup core is primarily responsible for hierarchically organizing 110*4882a593Smuzhiyunprocesses. A cgroup controller is usually responsible for 111*4882a593Smuzhiyundistributing a specific type of system resource along the hierarchy 112*4882a593Smuzhiyunalthough there are utility controllers which serve purposes other than 113*4882a593Smuzhiyunresource distribution. 114*4882a593Smuzhiyun 115*4882a593Smuzhiyuncgroups form a tree structure and every process in the system belongs 116*4882a593Smuzhiyunto one and only one cgroup. All threads of a process belong to the 117*4882a593Smuzhiyunsame cgroup. On creation, all processes are put in the cgroup that 118*4882a593Smuzhiyunthe parent process belongs to at the time. A process can be migrated 119*4882a593Smuzhiyunto another cgroup. Migration of a process doesn't affect already 120*4882a593Smuzhiyunexisting descendant processes. 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunFollowing certain structural constraints, controllers may be enabled or 123*4882a593Smuzhiyundisabled selectively on a cgroup. All controller behaviors are 124*4882a593Smuzhiyunhierarchical - if a controller is enabled on a cgroup, it affects all 125*4882a593Smuzhiyunprocesses which belong to the cgroups consisting the inclusive 126*4882a593Smuzhiyunsub-hierarchy of the cgroup. When a controller is enabled on a nested 127*4882a593Smuzhiyuncgroup, it always restricts the resource distribution further. The 128*4882a593Smuzhiyunrestrictions set closer to the root in the hierarchy can not be 129*4882a593Smuzhiyunoverridden from further away. 130*4882a593Smuzhiyun 131*4882a593Smuzhiyun 132*4882a593SmuzhiyunBasic Operations 133*4882a593Smuzhiyun================ 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunMounting 136*4882a593Smuzhiyun-------- 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunUnlike v1, cgroup v2 has only single hierarchy. The cgroup v2 139*4882a593Smuzhiyunhierarchy can be mounted with the following mount command:: 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun # mount -t cgroup2 none $MOUNT_POINT 142*4882a593Smuzhiyun 143*4882a593Smuzhiyuncgroup2 filesystem has the magic number 0x63677270 ("cgrp"). All 144*4882a593Smuzhiyuncontrollers which support v2 and are not bound to a v1 hierarchy are 145*4882a593Smuzhiyunautomatically bound to the v2 hierarchy and show up at the root. 146*4882a593SmuzhiyunControllers which are not in active use in the v2 hierarchy can be 147*4882a593Smuzhiyunbound to other hierarchies. This allows mixing v2 hierarchy with the 148*4882a593Smuzhiyunlegacy v1 multiple hierarchies in a fully backward compatible way. 149*4882a593Smuzhiyun 150*4882a593SmuzhiyunA controller can be moved across hierarchies only after the controller 151*4882a593Smuzhiyunis no longer referenced in its current hierarchy. Because per-cgroup 152*4882a593Smuzhiyuncontroller states are destroyed asynchronously and controllers may 153*4882a593Smuzhiyunhave lingering references, a controller may not show up immediately on 154*4882a593Smuzhiyunthe v2 hierarchy after the final umount of the previous hierarchy. 155*4882a593SmuzhiyunSimilarly, a controller should be fully disabled to be moved out of 156*4882a593Smuzhiyunthe unified hierarchy and it may take some time for the disabled 157*4882a593Smuzhiyuncontroller to become available for other hierarchies; furthermore, due 158*4882a593Smuzhiyunto inter-controller dependencies, other controllers may need to be 159*4882a593Smuzhiyundisabled too. 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunWhile useful for development and manual configurations, moving 162*4882a593Smuzhiyuncontrollers dynamically between the v2 and other hierarchies is 163*4882a593Smuzhiyunstrongly discouraged for production use. It is recommended to decide 164*4882a593Smuzhiyunthe hierarchies and controller associations before starting using the 165*4882a593Smuzhiyuncontrollers after system boot. 166*4882a593Smuzhiyun 167*4882a593SmuzhiyunDuring transition to v2, system management software might still 168*4882a593Smuzhiyunautomount the v1 cgroup filesystem and so hijack all controllers 169*4882a593Smuzhiyunduring boot, before manual intervention is possible. To make testing 170*4882a593Smuzhiyunand experimenting easier, the kernel parameter cgroup_no_v1= allows 171*4882a593Smuzhiyundisabling controllers in v1 and make them always available in v2. 172*4882a593Smuzhiyun 173*4882a593Smuzhiyuncgroup v2 currently supports the following mount options. 174*4882a593Smuzhiyun 175*4882a593Smuzhiyun nsdelegate 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun Consider cgroup namespaces as delegation boundaries. This 178*4882a593Smuzhiyun option is system wide and can only be set on mount or modified 179*4882a593Smuzhiyun through remount from the init namespace. The mount option is 180*4882a593Smuzhiyun ignored on non-init namespace mounts. Please refer to the 181*4882a593Smuzhiyun Delegation section for details. 182*4882a593Smuzhiyun 183*4882a593Smuzhiyun memory_localevents 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun Only populate memory.events with data for the current cgroup, 186*4882a593Smuzhiyun and not any subtrees. This is legacy behaviour, the default 187*4882a593Smuzhiyun behaviour without this option is to include subtree counts. 188*4882a593Smuzhiyun This option is system wide and can only be set on mount or 189*4882a593Smuzhiyun modified through remount from the init namespace. The mount 190*4882a593Smuzhiyun option is ignored on non-init namespace mounts. 191*4882a593Smuzhiyun 192*4882a593Smuzhiyun memory_recursiveprot 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun Recursively apply memory.min and memory.low protection to 195*4882a593Smuzhiyun entire subtrees, without requiring explicit downward 196*4882a593Smuzhiyun propagation into leaf cgroups. This allows protecting entire 197*4882a593Smuzhiyun subtrees from one another, while retaining free competition 198*4882a593Smuzhiyun within those subtrees. This should have been the default 199*4882a593Smuzhiyun behavior but is a mount-option to avoid regressing setups 200*4882a593Smuzhiyun relying on the original semantics (e.g. specifying bogusly 201*4882a593Smuzhiyun high 'bypass' protection values at higher tree levels). 202*4882a593Smuzhiyun 203*4882a593Smuzhiyun 204*4882a593SmuzhiyunOrganizing Processes and Threads 205*4882a593Smuzhiyun-------------------------------- 206*4882a593Smuzhiyun 207*4882a593SmuzhiyunProcesses 208*4882a593Smuzhiyun~~~~~~~~~ 209*4882a593Smuzhiyun 210*4882a593SmuzhiyunInitially, only the root cgroup exists to which all processes belong. 211*4882a593SmuzhiyunA child cgroup can be created by creating a sub-directory:: 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun # mkdir $CGROUP_NAME 214*4882a593Smuzhiyun 215*4882a593SmuzhiyunA given cgroup may have multiple child cgroups forming a tree 216*4882a593Smuzhiyunstructure. Each cgroup has a read-writable interface file 217*4882a593Smuzhiyun"cgroup.procs". When read, it lists the PIDs of all processes which 218*4882a593Smuzhiyunbelong to the cgroup one-per-line. The PIDs are not ordered and the 219*4882a593Smuzhiyunsame PID may show up more than once if the process got moved to 220*4882a593Smuzhiyunanother cgroup and then back or the PID got recycled while reading. 221*4882a593Smuzhiyun 222*4882a593SmuzhiyunA process can be migrated into a cgroup by writing its PID to the 223*4882a593Smuzhiyuntarget cgroup's "cgroup.procs" file. Only one process can be migrated 224*4882a593Smuzhiyunon a single write(2) call. If a process is composed of multiple 225*4882a593Smuzhiyunthreads, writing the PID of any thread migrates all threads of the 226*4882a593Smuzhiyunprocess. 227*4882a593Smuzhiyun 228*4882a593SmuzhiyunWhen a process forks a child process, the new process is born into the 229*4882a593Smuzhiyuncgroup that the forking process belongs to at the time of the 230*4882a593Smuzhiyunoperation. After exit, a process stays associated with the cgroup 231*4882a593Smuzhiyunthat it belonged to at the time of exit until it's reaped; however, a 232*4882a593Smuzhiyunzombie process does not appear in "cgroup.procs" and thus can't be 233*4882a593Smuzhiyunmoved to another cgroup. 234*4882a593Smuzhiyun 235*4882a593SmuzhiyunA cgroup which doesn't have any children or live processes can be 236*4882a593Smuzhiyundestroyed by removing the directory. Note that a cgroup which doesn't 237*4882a593Smuzhiyunhave any children and is associated only with zombie processes is 238*4882a593Smuzhiyunconsidered empty and can be removed:: 239*4882a593Smuzhiyun 240*4882a593Smuzhiyun # rmdir $CGROUP_NAME 241*4882a593Smuzhiyun 242*4882a593Smuzhiyun"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy 243*4882a593Smuzhiyuncgroup is in use in the system, this file may contain multiple lines, 244*4882a593Smuzhiyunone for each hierarchy. The entry for cgroup v2 is always in the 245*4882a593Smuzhiyunformat "0::$PATH":: 246*4882a593Smuzhiyun 247*4882a593Smuzhiyun # cat /proc/842/cgroup 248*4882a593Smuzhiyun ... 249*4882a593Smuzhiyun 0::/test-cgroup/test-cgroup-nested 250*4882a593Smuzhiyun 251*4882a593SmuzhiyunIf the process becomes a zombie and the cgroup it was associated with 252*4882a593Smuzhiyunis removed subsequently, " (deleted)" is appended to the path:: 253*4882a593Smuzhiyun 254*4882a593Smuzhiyun # cat /proc/842/cgroup 255*4882a593Smuzhiyun ... 256*4882a593Smuzhiyun 0::/test-cgroup/test-cgroup-nested (deleted) 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun 259*4882a593SmuzhiyunThreads 260*4882a593Smuzhiyun~~~~~~~ 261*4882a593Smuzhiyun 262*4882a593Smuzhiyuncgroup v2 supports thread granularity for a subset of controllers to 263*4882a593Smuzhiyunsupport use cases requiring hierarchical resource distribution across 264*4882a593Smuzhiyunthe threads of a group of processes. By default, all threads of a 265*4882a593Smuzhiyunprocess belong to the same cgroup, which also serves as the resource 266*4882a593Smuzhiyundomain to host resource consumptions which are not specific to a 267*4882a593Smuzhiyunprocess or thread. The thread mode allows threads to be spread across 268*4882a593Smuzhiyuna subtree while still maintaining the common resource domain for them. 269*4882a593Smuzhiyun 270*4882a593SmuzhiyunControllers which support thread mode are called threaded controllers. 271*4882a593SmuzhiyunThe ones which don't are called domain controllers. 272*4882a593Smuzhiyun 273*4882a593SmuzhiyunMarking a cgroup threaded makes it join the resource domain of its 274*4882a593Smuzhiyunparent as a threaded cgroup. The parent may be another threaded 275*4882a593Smuzhiyuncgroup whose resource domain is further up in the hierarchy. The root 276*4882a593Smuzhiyunof a threaded subtree, that is, the nearest ancestor which is not 277*4882a593Smuzhiyunthreaded, is called threaded domain or thread root interchangeably and 278*4882a593Smuzhiyunserves as the resource domain for the entire subtree. 279*4882a593Smuzhiyun 280*4882a593SmuzhiyunInside a threaded subtree, threads of a process can be put in 281*4882a593Smuzhiyundifferent cgroups and are not subject to the no internal process 282*4882a593Smuzhiyunconstraint - threaded controllers can be enabled on non-leaf cgroups 283*4882a593Smuzhiyunwhether they have threads in them or not. 284*4882a593Smuzhiyun 285*4882a593SmuzhiyunAs the threaded domain cgroup hosts all the domain resource 286*4882a593Smuzhiyunconsumptions of the subtree, it is considered to have internal 287*4882a593Smuzhiyunresource consumptions whether there are processes in it or not and 288*4882a593Smuzhiyuncan't have populated child cgroups which aren't threaded. Because the 289*4882a593Smuzhiyunroot cgroup is not subject to no internal process constraint, it can 290*4882a593Smuzhiyunserve both as a threaded domain and a parent to domain cgroups. 291*4882a593Smuzhiyun 292*4882a593SmuzhiyunThe current operation mode or type of the cgroup is shown in the 293*4882a593Smuzhiyun"cgroup.type" file which indicates whether the cgroup is a normal 294*4882a593Smuzhiyundomain, a domain which is serving as the domain of a threaded subtree, 295*4882a593Smuzhiyunor a threaded cgroup. 296*4882a593Smuzhiyun 297*4882a593SmuzhiyunOn creation, a cgroup is always a domain cgroup and can be made 298*4882a593Smuzhiyunthreaded by writing "threaded" to the "cgroup.type" file. The 299*4882a593Smuzhiyunoperation is single direction:: 300*4882a593Smuzhiyun 301*4882a593Smuzhiyun # echo threaded > cgroup.type 302*4882a593Smuzhiyun 303*4882a593SmuzhiyunOnce threaded, the cgroup can't be made a domain again. To enable the 304*4882a593Smuzhiyunthread mode, the following conditions must be met. 305*4882a593Smuzhiyun 306*4882a593Smuzhiyun- As the cgroup will join the parent's resource domain. The parent 307*4882a593Smuzhiyun must either be a valid (threaded) domain or a threaded cgroup. 308*4882a593Smuzhiyun 309*4882a593Smuzhiyun- When the parent is an unthreaded domain, it must not have any domain 310*4882a593Smuzhiyun controllers enabled or populated domain children. The root is 311*4882a593Smuzhiyun exempt from this requirement. 312*4882a593Smuzhiyun 313*4882a593SmuzhiyunTopology-wise, a cgroup can be in an invalid state. Please consider 314*4882a593Smuzhiyunthe following topology:: 315*4882a593Smuzhiyun 316*4882a593Smuzhiyun A (threaded domain) - B (threaded) - C (domain, just created) 317*4882a593Smuzhiyun 318*4882a593SmuzhiyunC is created as a domain but isn't connected to a parent which can 319*4882a593Smuzhiyunhost child domains. C can't be used until it is turned into a 320*4882a593Smuzhiyunthreaded cgroup. "cgroup.type" file will report "domain (invalid)" in 321*4882a593Smuzhiyunthese cases. Operations which fail due to invalid topology use 322*4882a593SmuzhiyunEOPNOTSUPP as the errno. 323*4882a593Smuzhiyun 324*4882a593SmuzhiyunA domain cgroup is turned into a threaded domain when one of its child 325*4882a593Smuzhiyuncgroup becomes threaded or threaded controllers are enabled in the 326*4882a593Smuzhiyun"cgroup.subtree_control" file while there are processes in the cgroup. 327*4882a593SmuzhiyunA threaded domain reverts to a normal domain when the conditions 328*4882a593Smuzhiyunclear. 329*4882a593Smuzhiyun 330*4882a593SmuzhiyunWhen read, "cgroup.threads" contains the list of the thread IDs of all 331*4882a593Smuzhiyunthreads in the cgroup. Except that the operations are per-thread 332*4882a593Smuzhiyuninstead of per-process, "cgroup.threads" has the same format and 333*4882a593Smuzhiyunbehaves the same way as "cgroup.procs". While "cgroup.threads" can be 334*4882a593Smuzhiyunwritten to in any cgroup, as it can only move threads inside the same 335*4882a593Smuzhiyunthreaded domain, its operations are confined inside each threaded 336*4882a593Smuzhiyunsubtree. 337*4882a593Smuzhiyun 338*4882a593SmuzhiyunThe threaded domain cgroup serves as the resource domain for the whole 339*4882a593Smuzhiyunsubtree, and, while the threads can be scattered across the subtree, 340*4882a593Smuzhiyunall the processes are considered to be in the threaded domain cgroup. 341*4882a593Smuzhiyun"cgroup.procs" in a threaded domain cgroup contains the PIDs of all 342*4882a593Smuzhiyunprocesses in the subtree and is not readable in the subtree proper. 343*4882a593SmuzhiyunHowever, "cgroup.procs" can be written to from anywhere in the subtree 344*4882a593Smuzhiyunto migrate all threads of the matching process to the cgroup. 345*4882a593Smuzhiyun 346*4882a593SmuzhiyunOnly threaded controllers can be enabled in a threaded subtree. When 347*4882a593Smuzhiyuna threaded controller is enabled inside a threaded subtree, it only 348*4882a593Smuzhiyunaccounts for and controls resource consumptions associated with the 349*4882a593Smuzhiyunthreads in the cgroup and its descendants. All consumptions which 350*4882a593Smuzhiyunaren't tied to a specific thread belong to the threaded domain cgroup. 351*4882a593Smuzhiyun 352*4882a593SmuzhiyunBecause a threaded subtree is exempt from no internal process 353*4882a593Smuzhiyunconstraint, a threaded controller must be able to handle competition 354*4882a593Smuzhiyunbetween threads in a non-leaf cgroup and its child cgroups. Each 355*4882a593Smuzhiyunthreaded controller defines how such competitions are handled. 356*4882a593Smuzhiyun 357*4882a593Smuzhiyun 358*4882a593Smuzhiyun[Un]populated Notification 359*4882a593Smuzhiyun-------------------------- 360*4882a593Smuzhiyun 361*4882a593SmuzhiyunEach non-root cgroup has a "cgroup.events" file which contains 362*4882a593Smuzhiyun"populated" field indicating whether the cgroup's sub-hierarchy has 363*4882a593Smuzhiyunlive processes in it. Its value is 0 if there is no live process in 364*4882a593Smuzhiyunthe cgroup and its descendants; otherwise, 1. poll and [id]notify 365*4882a593Smuzhiyunevents are triggered when the value changes. This can be used, for 366*4882a593Smuzhiyunexample, to start a clean-up operation after all processes of a given 367*4882a593Smuzhiyunsub-hierarchy have exited. The populated state updates and 368*4882a593Smuzhiyunnotifications are recursive. Consider the following sub-hierarchy 369*4882a593Smuzhiyunwhere the numbers in the parentheses represent the numbers of processes 370*4882a593Smuzhiyunin each cgroup:: 371*4882a593Smuzhiyun 372*4882a593Smuzhiyun A(4) - B(0) - C(1) 373*4882a593Smuzhiyun \ D(0) 374*4882a593Smuzhiyun 375*4882a593SmuzhiyunA, B and C's "populated" fields would be 1 while D's 0. After the one 376*4882a593Smuzhiyunprocess in C exits, B and C's "populated" fields would flip to "0" and 377*4882a593Smuzhiyunfile modified events will be generated on the "cgroup.events" files of 378*4882a593Smuzhiyunboth cgroups. 379*4882a593Smuzhiyun 380*4882a593Smuzhiyun 381*4882a593SmuzhiyunControlling Controllers 382*4882a593Smuzhiyun----------------------- 383*4882a593Smuzhiyun 384*4882a593SmuzhiyunEnabling and Disabling 385*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~ 386*4882a593Smuzhiyun 387*4882a593SmuzhiyunEach cgroup has a "cgroup.controllers" file which lists all 388*4882a593Smuzhiyuncontrollers available for the cgroup to enable:: 389*4882a593Smuzhiyun 390*4882a593Smuzhiyun # cat cgroup.controllers 391*4882a593Smuzhiyun cpu io memory 392*4882a593Smuzhiyun 393*4882a593SmuzhiyunNo controller is enabled by default. Controllers can be enabled and 394*4882a593Smuzhiyundisabled by writing to the "cgroup.subtree_control" file:: 395*4882a593Smuzhiyun 396*4882a593Smuzhiyun # echo "+cpu +memory -io" > cgroup.subtree_control 397*4882a593Smuzhiyun 398*4882a593SmuzhiyunOnly controllers which are listed in "cgroup.controllers" can be 399*4882a593Smuzhiyunenabled. When multiple operations are specified as above, either they 400*4882a593Smuzhiyunall succeed or fail. If multiple operations on the same controller 401*4882a593Smuzhiyunare specified, the last one is effective. 402*4882a593Smuzhiyun 403*4882a593SmuzhiyunEnabling a controller in a cgroup indicates that the distribution of 404*4882a593Smuzhiyunthe target resource across its immediate children will be controlled. 405*4882a593SmuzhiyunConsider the following sub-hierarchy. The enabled controllers are 406*4882a593Smuzhiyunlisted in parentheses:: 407*4882a593Smuzhiyun 408*4882a593Smuzhiyun A(cpu,memory) - B(memory) - C() 409*4882a593Smuzhiyun \ D() 410*4882a593Smuzhiyun 411*4882a593SmuzhiyunAs A has "cpu" and "memory" enabled, A will control the distribution 412*4882a593Smuzhiyunof CPU cycles and memory to its children, in this case, B. As B has 413*4882a593Smuzhiyun"memory" enabled but not "CPU", C and D will compete freely on CPU 414*4882a593Smuzhiyuncycles but their division of memory available to B will be controlled. 415*4882a593Smuzhiyun 416*4882a593SmuzhiyunAs a controller regulates the distribution of the target resource to 417*4882a593Smuzhiyunthe cgroup's children, enabling it creates the controller's interface 418*4882a593Smuzhiyunfiles in the child cgroups. In the above example, enabling "cpu" on B 419*4882a593Smuzhiyunwould create the "cpu." prefixed controller interface files in C and 420*4882a593SmuzhiyunD. Likewise, disabling "memory" from B would remove the "memory." 421*4882a593Smuzhiyunprefixed controller interface files from C and D. This means that the 422*4882a593Smuzhiyuncontroller interface files - anything which doesn't start with 423*4882a593Smuzhiyun"cgroup." are owned by the parent rather than the cgroup itself. 424*4882a593Smuzhiyun 425*4882a593Smuzhiyun 426*4882a593SmuzhiyunTop-down Constraint 427*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~ 428*4882a593Smuzhiyun 429*4882a593SmuzhiyunResources are distributed top-down and a cgroup can further distribute 430*4882a593Smuzhiyuna resource only if the resource has been distributed to it from the 431*4882a593Smuzhiyunparent. This means that all non-root "cgroup.subtree_control" files 432*4882a593Smuzhiyuncan only contain controllers which are enabled in the parent's 433*4882a593Smuzhiyun"cgroup.subtree_control" file. A controller can be enabled only if 434*4882a593Smuzhiyunthe parent has the controller enabled and a controller can't be 435*4882a593Smuzhiyundisabled if one or more children have it enabled. 436*4882a593Smuzhiyun 437*4882a593Smuzhiyun 438*4882a593SmuzhiyunNo Internal Process Constraint 439*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 440*4882a593Smuzhiyun 441*4882a593SmuzhiyunNon-root cgroups can distribute domain resources to their children 442*4882a593Smuzhiyunonly when they don't have any processes of their own. In other words, 443*4882a593Smuzhiyunonly domain cgroups which don't contain any processes can have domain 444*4882a593Smuzhiyuncontrollers enabled in their "cgroup.subtree_control" files. 445*4882a593Smuzhiyun 446*4882a593SmuzhiyunThis guarantees that, when a domain controller is looking at the part 447*4882a593Smuzhiyunof the hierarchy which has it enabled, processes are always only on 448*4882a593Smuzhiyunthe leaves. This rules out situations where child cgroups compete 449*4882a593Smuzhiyunagainst internal processes of the parent. 450*4882a593Smuzhiyun 451*4882a593SmuzhiyunThe root cgroup is exempt from this restriction. Root contains 452*4882a593Smuzhiyunprocesses and anonymous resource consumption which can't be associated 453*4882a593Smuzhiyunwith any other cgroups and requires special treatment from most 454*4882a593Smuzhiyuncontrollers. How resource consumption in the root cgroup is governed 455*4882a593Smuzhiyunis up to each controller (for more information on this topic please 456*4882a593Smuzhiyunrefer to the Non-normative information section in the Controllers 457*4882a593Smuzhiyunchapter). 458*4882a593Smuzhiyun 459*4882a593SmuzhiyunNote that the restriction doesn't get in the way if there is no 460*4882a593Smuzhiyunenabled controller in the cgroup's "cgroup.subtree_control". This is 461*4882a593Smuzhiyunimportant as otherwise it wouldn't be possible to create children of a 462*4882a593Smuzhiyunpopulated cgroup. To control resource distribution of a cgroup, the 463*4882a593Smuzhiyuncgroup must create children and transfer all its processes to the 464*4882a593Smuzhiyunchildren before enabling controllers in its "cgroup.subtree_control" 465*4882a593Smuzhiyunfile. 466*4882a593Smuzhiyun 467*4882a593Smuzhiyun 468*4882a593SmuzhiyunDelegation 469*4882a593Smuzhiyun---------- 470*4882a593Smuzhiyun 471*4882a593SmuzhiyunModel of Delegation 472*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~ 473*4882a593Smuzhiyun 474*4882a593SmuzhiyunA cgroup can be delegated in two ways. First, to a less privileged 475*4882a593Smuzhiyunuser by granting write access of the directory and its "cgroup.procs", 476*4882a593Smuzhiyun"cgroup.threads" and "cgroup.subtree_control" files to the user. 477*4882a593SmuzhiyunSecond, if the "nsdelegate" mount option is set, automatically to a 478*4882a593Smuzhiyuncgroup namespace on namespace creation. 479*4882a593Smuzhiyun 480*4882a593SmuzhiyunBecause the resource control interface files in a given directory 481*4882a593Smuzhiyuncontrol the distribution of the parent's resources, the delegatee 482*4882a593Smuzhiyunshouldn't be allowed to write to them. For the first method, this is 483*4882a593Smuzhiyunachieved by not granting access to these files. For the second, the 484*4882a593Smuzhiyunkernel rejects writes to all files other than "cgroup.procs" and 485*4882a593Smuzhiyun"cgroup.subtree_control" on a namespace root from inside the 486*4882a593Smuzhiyunnamespace. 487*4882a593Smuzhiyun 488*4882a593SmuzhiyunThe end results are equivalent for both delegation types. Once 489*4882a593Smuzhiyundelegated, the user can build sub-hierarchy under the directory, 490*4882a593Smuzhiyunorganize processes inside it as it sees fit and further distribute the 491*4882a593Smuzhiyunresources it received from the parent. The limits and other settings 492*4882a593Smuzhiyunof all resource controllers are hierarchical and regardless of what 493*4882a593Smuzhiyunhappens in the delegated sub-hierarchy, nothing can escape the 494*4882a593Smuzhiyunresource restrictions imposed by the parent. 495*4882a593Smuzhiyun 496*4882a593SmuzhiyunCurrently, cgroup doesn't impose any restrictions on the number of 497*4882a593Smuzhiyuncgroups in or nesting depth of a delegated sub-hierarchy; however, 498*4882a593Smuzhiyunthis may be limited explicitly in the future. 499*4882a593Smuzhiyun 500*4882a593Smuzhiyun 501*4882a593SmuzhiyunDelegation Containment 502*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~ 503*4882a593Smuzhiyun 504*4882a593SmuzhiyunA delegated sub-hierarchy is contained in the sense that processes 505*4882a593Smuzhiyuncan't be moved into or out of the sub-hierarchy by the delegatee. 506*4882a593Smuzhiyun 507*4882a593SmuzhiyunFor delegations to a less privileged user, this is achieved by 508*4882a593Smuzhiyunrequiring the following conditions for a process with a non-root euid 509*4882a593Smuzhiyunto migrate a target process into a cgroup by writing its PID to the 510*4882a593Smuzhiyun"cgroup.procs" file. 511*4882a593Smuzhiyun 512*4882a593Smuzhiyun- The writer must have write access to the "cgroup.procs" file. 513*4882a593Smuzhiyun 514*4882a593Smuzhiyun- The writer must have write access to the "cgroup.procs" file of the 515*4882a593Smuzhiyun common ancestor of the source and destination cgroups. 516*4882a593Smuzhiyun 517*4882a593SmuzhiyunThe above two constraints ensure that while a delegatee may migrate 518*4882a593Smuzhiyunprocesses around freely in the delegated sub-hierarchy it can't pull 519*4882a593Smuzhiyunin from or push out to outside the sub-hierarchy. 520*4882a593Smuzhiyun 521*4882a593SmuzhiyunFor an example, let's assume cgroups C0 and C1 have been delegated to 522*4882a593Smuzhiyunuser U0 who created C00, C01 under C0 and C10 under C1 as follows and 523*4882a593Smuzhiyunall processes under C0 and C1 belong to U0:: 524*4882a593Smuzhiyun 525*4882a593Smuzhiyun ~~~~~~~~~~~~~ - C0 - C00 526*4882a593Smuzhiyun ~ cgroup ~ \ C01 527*4882a593Smuzhiyun ~ hierarchy ~ 528*4882a593Smuzhiyun ~~~~~~~~~~~~~ - C1 - C10 529*4882a593Smuzhiyun 530*4882a593SmuzhiyunLet's also say U0 wants to write the PID of a process which is 531*4882a593Smuzhiyuncurrently in C10 into "C00/cgroup.procs". U0 has write access to the 532*4882a593Smuzhiyunfile; however, the common ancestor of the source cgroup C10 and the 533*4882a593Smuzhiyundestination cgroup C00 is above the points of delegation and U0 would 534*4882a593Smuzhiyunnot have write access to its "cgroup.procs" files and thus the write 535*4882a593Smuzhiyunwill be denied with -EACCES. 536*4882a593Smuzhiyun 537*4882a593SmuzhiyunFor delegations to namespaces, containment is achieved by requiring 538*4882a593Smuzhiyunthat both the source and destination cgroups are reachable from the 539*4882a593Smuzhiyunnamespace of the process which is attempting the migration. If either 540*4882a593Smuzhiyunis not reachable, the migration is rejected with -ENOENT. 541*4882a593Smuzhiyun 542*4882a593Smuzhiyun 543*4882a593SmuzhiyunGuidelines 544*4882a593Smuzhiyun---------- 545*4882a593Smuzhiyun 546*4882a593SmuzhiyunOrganize Once and Control 547*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~ 548*4882a593Smuzhiyun 549*4882a593SmuzhiyunMigrating a process across cgroups is a relatively expensive operation 550*4882a593Smuzhiyunand stateful resources such as memory are not moved together with the 551*4882a593Smuzhiyunprocess. This is an explicit design decision as there often exist 552*4882a593Smuzhiyuninherent trade-offs between migration and various hot paths in terms 553*4882a593Smuzhiyunof synchronization cost. 554*4882a593Smuzhiyun 555*4882a593SmuzhiyunAs such, migrating processes across cgroups frequently as a means to 556*4882a593Smuzhiyunapply different resource restrictions is discouraged. A workload 557*4882a593Smuzhiyunshould be assigned to a cgroup according to the system's logical and 558*4882a593Smuzhiyunresource structure once on start-up. Dynamic adjustments to resource 559*4882a593Smuzhiyundistribution can be made by changing controller configuration through 560*4882a593Smuzhiyunthe interface files. 561*4882a593Smuzhiyun 562*4882a593Smuzhiyun 563*4882a593SmuzhiyunAvoid Name Collisions 564*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~ 565*4882a593Smuzhiyun 566*4882a593SmuzhiyunInterface files for a cgroup and its children cgroups occupy the same 567*4882a593Smuzhiyundirectory and it is possible to create children cgroups which collide 568*4882a593Smuzhiyunwith interface files. 569*4882a593Smuzhiyun 570*4882a593SmuzhiyunAll cgroup core interface files are prefixed with "cgroup." and each 571*4882a593Smuzhiyuncontroller's interface files are prefixed with the controller name and 572*4882a593Smuzhiyuna dot. A controller's name is composed of lower case alphabets and 573*4882a593Smuzhiyun'_'s but never begins with an '_' so it can be used as the prefix 574*4882a593Smuzhiyuncharacter for collision avoidance. Also, interface file names won't 575*4882a593Smuzhiyunstart or end with terms which are often used in categorizing workloads 576*4882a593Smuzhiyunsuch as job, service, slice, unit or workload. 577*4882a593Smuzhiyun 578*4882a593Smuzhiyuncgroup doesn't do anything to prevent name collisions and it's the 579*4882a593Smuzhiyunuser's responsibility to avoid them. 580*4882a593Smuzhiyun 581*4882a593Smuzhiyun 582*4882a593SmuzhiyunResource Distribution Models 583*4882a593Smuzhiyun============================ 584*4882a593Smuzhiyun 585*4882a593Smuzhiyuncgroup controllers implement several resource distribution schemes 586*4882a593Smuzhiyundepending on the resource type and expected use cases. This section 587*4882a593Smuzhiyundescribes major schemes in use along with their expected behaviors. 588*4882a593Smuzhiyun 589*4882a593Smuzhiyun 590*4882a593SmuzhiyunWeights 591*4882a593Smuzhiyun------- 592*4882a593Smuzhiyun 593*4882a593SmuzhiyunA parent's resource is distributed by adding up the weights of all 594*4882a593Smuzhiyunactive children and giving each the fraction matching the ratio of its 595*4882a593Smuzhiyunweight against the sum. As only children which can make use of the 596*4882a593Smuzhiyunresource at the moment participate in the distribution, this is 597*4882a593Smuzhiyunwork-conserving. Due to the dynamic nature, this model is usually 598*4882a593Smuzhiyunused for stateless resources. 599*4882a593Smuzhiyun 600*4882a593SmuzhiyunAll weights are in the range [1, 10000] with the default at 100. This 601*4882a593Smuzhiyunallows symmetric multiplicative biases in both directions at fine 602*4882a593Smuzhiyunenough granularity while staying in the intuitive range. 603*4882a593Smuzhiyun 604*4882a593SmuzhiyunAs long as the weight is in range, all configuration combinations are 605*4882a593Smuzhiyunvalid and there is no reason to reject configuration changes or 606*4882a593Smuzhiyunprocess migrations. 607*4882a593Smuzhiyun 608*4882a593Smuzhiyun"cpu.weight" proportionally distributes CPU cycles to active children 609*4882a593Smuzhiyunand is an example of this type. 610*4882a593Smuzhiyun 611*4882a593Smuzhiyun 612*4882a593SmuzhiyunLimits 613*4882a593Smuzhiyun------ 614*4882a593Smuzhiyun 615*4882a593SmuzhiyunA child can only consume upto the configured amount of the resource. 616*4882a593SmuzhiyunLimits can be over-committed - the sum of the limits of children can 617*4882a593Smuzhiyunexceed the amount of resource available to the parent. 618*4882a593Smuzhiyun 619*4882a593SmuzhiyunLimits are in the range [0, max] and defaults to "max", which is noop. 620*4882a593Smuzhiyun 621*4882a593SmuzhiyunAs limits can be over-committed, all configuration combinations are 622*4882a593Smuzhiyunvalid and there is no reason to reject configuration changes or 623*4882a593Smuzhiyunprocess migrations. 624*4882a593Smuzhiyun 625*4882a593Smuzhiyun"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume 626*4882a593Smuzhiyunon an IO device and is an example of this type. 627*4882a593Smuzhiyun 628*4882a593Smuzhiyun 629*4882a593SmuzhiyunProtections 630*4882a593Smuzhiyun----------- 631*4882a593Smuzhiyun 632*4882a593SmuzhiyunA cgroup is protected upto the configured amount of the resource 633*4882a593Smuzhiyunas long as the usages of all its ancestors are under their 634*4882a593Smuzhiyunprotected levels. Protections can be hard guarantees or best effort 635*4882a593Smuzhiyunsoft boundaries. Protections can also be over-committed in which case 636*4882a593Smuzhiyunonly upto the amount available to the parent is protected among 637*4882a593Smuzhiyunchildren. 638*4882a593Smuzhiyun 639*4882a593SmuzhiyunProtections are in the range [0, max] and defaults to 0, which is 640*4882a593Smuzhiyunnoop. 641*4882a593Smuzhiyun 642*4882a593SmuzhiyunAs protections can be over-committed, all configuration combinations 643*4882a593Smuzhiyunare valid and there is no reason to reject configuration changes or 644*4882a593Smuzhiyunprocess migrations. 645*4882a593Smuzhiyun 646*4882a593Smuzhiyun"memory.low" implements best-effort memory protection and is an 647*4882a593Smuzhiyunexample of this type. 648*4882a593Smuzhiyun 649*4882a593Smuzhiyun 650*4882a593SmuzhiyunAllocations 651*4882a593Smuzhiyun----------- 652*4882a593Smuzhiyun 653*4882a593SmuzhiyunA cgroup is exclusively allocated a certain amount of a finite 654*4882a593Smuzhiyunresource. Allocations can't be over-committed - the sum of the 655*4882a593Smuzhiyunallocations of children can not exceed the amount of resource 656*4882a593Smuzhiyunavailable to the parent. 657*4882a593Smuzhiyun 658*4882a593SmuzhiyunAllocations are in the range [0, max] and defaults to 0, which is no 659*4882a593Smuzhiyunresource. 660*4882a593Smuzhiyun 661*4882a593SmuzhiyunAs allocations can't be over-committed, some configuration 662*4882a593Smuzhiyuncombinations are invalid and should be rejected. Also, if the 663*4882a593Smuzhiyunresource is mandatory for execution of processes, process migrations 664*4882a593Smuzhiyunmay be rejected. 665*4882a593Smuzhiyun 666*4882a593Smuzhiyun"cpu.rt.max" hard-allocates realtime slices and is an example of this 667*4882a593Smuzhiyuntype. 668*4882a593Smuzhiyun 669*4882a593Smuzhiyun 670*4882a593SmuzhiyunInterface Files 671*4882a593Smuzhiyun=============== 672*4882a593Smuzhiyun 673*4882a593SmuzhiyunFormat 674*4882a593Smuzhiyun------ 675*4882a593Smuzhiyun 676*4882a593SmuzhiyunAll interface files should be in one of the following formats whenever 677*4882a593Smuzhiyunpossible:: 678*4882a593Smuzhiyun 679*4882a593Smuzhiyun New-line separated values 680*4882a593Smuzhiyun (when only one value can be written at once) 681*4882a593Smuzhiyun 682*4882a593Smuzhiyun VAL0\n 683*4882a593Smuzhiyun VAL1\n 684*4882a593Smuzhiyun ... 685*4882a593Smuzhiyun 686*4882a593Smuzhiyun Space separated values 687*4882a593Smuzhiyun (when read-only or multiple values can be written at once) 688*4882a593Smuzhiyun 689*4882a593Smuzhiyun VAL0 VAL1 ...\n 690*4882a593Smuzhiyun 691*4882a593Smuzhiyun Flat keyed 692*4882a593Smuzhiyun 693*4882a593Smuzhiyun KEY0 VAL0\n 694*4882a593Smuzhiyun KEY1 VAL1\n 695*4882a593Smuzhiyun ... 696*4882a593Smuzhiyun 697*4882a593Smuzhiyun Nested keyed 698*4882a593Smuzhiyun 699*4882a593Smuzhiyun KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... 700*4882a593Smuzhiyun KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... 701*4882a593Smuzhiyun ... 702*4882a593Smuzhiyun 703*4882a593SmuzhiyunFor a writable file, the format for writing should generally match 704*4882a593Smuzhiyunreading; however, controllers may allow omitting later fields or 705*4882a593Smuzhiyunimplement restricted shortcuts for most common use cases. 706*4882a593Smuzhiyun 707*4882a593SmuzhiyunFor both flat and nested keyed files, only the values for a single key 708*4882a593Smuzhiyuncan be written at a time. For nested keyed files, the sub key pairs 709*4882a593Smuzhiyunmay be specified in any order and not all pairs have to be specified. 710*4882a593Smuzhiyun 711*4882a593Smuzhiyun 712*4882a593SmuzhiyunConventions 713*4882a593Smuzhiyun----------- 714*4882a593Smuzhiyun 715*4882a593Smuzhiyun- Settings for a single feature should be contained in a single file. 716*4882a593Smuzhiyun 717*4882a593Smuzhiyun- The root cgroup should be exempt from resource control and thus 718*4882a593Smuzhiyun shouldn't have resource control interface files. 719*4882a593Smuzhiyun 720*4882a593Smuzhiyun- The default time unit is microseconds. If a different unit is ever 721*4882a593Smuzhiyun used, an explicit unit suffix must be present. 722*4882a593Smuzhiyun 723*4882a593Smuzhiyun- A parts-per quantity should use a percentage decimal with at least 724*4882a593Smuzhiyun two digit fractional part - e.g. 13.40. 725*4882a593Smuzhiyun 726*4882a593Smuzhiyun- If a controller implements weight based resource distribution, its 727*4882a593Smuzhiyun interface file should be named "weight" and have the range [1, 728*4882a593Smuzhiyun 10000] with 100 as the default. The values are chosen to allow 729*4882a593Smuzhiyun enough and symmetric bias in both directions while keeping it 730*4882a593Smuzhiyun intuitive (the default is 100%). 731*4882a593Smuzhiyun 732*4882a593Smuzhiyun- If a controller implements an absolute resource guarantee and/or 733*4882a593Smuzhiyun limit, the interface files should be named "min" and "max" 734*4882a593Smuzhiyun respectively. If a controller implements best effort resource 735*4882a593Smuzhiyun guarantee and/or limit, the interface files should be named "low" 736*4882a593Smuzhiyun and "high" respectively. 737*4882a593Smuzhiyun 738*4882a593Smuzhiyun In the above four control files, the special token "max" should be 739*4882a593Smuzhiyun used to represent upward infinity for both reading and writing. 740*4882a593Smuzhiyun 741*4882a593Smuzhiyun- If a setting has a configurable default value and keyed specific 742*4882a593Smuzhiyun overrides, the default entry should be keyed with "default" and 743*4882a593Smuzhiyun appear as the first entry in the file. 744*4882a593Smuzhiyun 745*4882a593Smuzhiyun The default value can be updated by writing either "default $VAL" or 746*4882a593Smuzhiyun "$VAL". 747*4882a593Smuzhiyun 748*4882a593Smuzhiyun When writing to update a specific override, "default" can be used as 749*4882a593Smuzhiyun the value to indicate removal of the override. Override entries 750*4882a593Smuzhiyun with "default" as the value must not appear when read. 751*4882a593Smuzhiyun 752*4882a593Smuzhiyun For example, a setting which is keyed by major:minor device numbers 753*4882a593Smuzhiyun with integer values may look like the following:: 754*4882a593Smuzhiyun 755*4882a593Smuzhiyun # cat cgroup-example-interface-file 756*4882a593Smuzhiyun default 150 757*4882a593Smuzhiyun 8:0 300 758*4882a593Smuzhiyun 759*4882a593Smuzhiyun The default value can be updated by:: 760*4882a593Smuzhiyun 761*4882a593Smuzhiyun # echo 125 > cgroup-example-interface-file 762*4882a593Smuzhiyun 763*4882a593Smuzhiyun or:: 764*4882a593Smuzhiyun 765*4882a593Smuzhiyun # echo "default 125" > cgroup-example-interface-file 766*4882a593Smuzhiyun 767*4882a593Smuzhiyun An override can be set by:: 768*4882a593Smuzhiyun 769*4882a593Smuzhiyun # echo "8:16 170" > cgroup-example-interface-file 770*4882a593Smuzhiyun 771*4882a593Smuzhiyun and cleared by:: 772*4882a593Smuzhiyun 773*4882a593Smuzhiyun # echo "8:0 default" > cgroup-example-interface-file 774*4882a593Smuzhiyun # cat cgroup-example-interface-file 775*4882a593Smuzhiyun default 125 776*4882a593Smuzhiyun 8:16 170 777*4882a593Smuzhiyun 778*4882a593Smuzhiyun- For events which are not very high frequency, an interface file 779*4882a593Smuzhiyun "events" should be created which lists event key value pairs. 780*4882a593Smuzhiyun Whenever a notifiable event happens, file modified event should be 781*4882a593Smuzhiyun generated on the file. 782*4882a593Smuzhiyun 783*4882a593Smuzhiyun 784*4882a593SmuzhiyunCore Interface Files 785*4882a593Smuzhiyun-------------------- 786*4882a593Smuzhiyun 787*4882a593SmuzhiyunAll cgroup core files are prefixed with "cgroup." 788*4882a593Smuzhiyun 789*4882a593Smuzhiyun cgroup.type 790*4882a593Smuzhiyun 791*4882a593Smuzhiyun A read-write single value file which exists on non-root 792*4882a593Smuzhiyun cgroups. 793*4882a593Smuzhiyun 794*4882a593Smuzhiyun When read, it indicates the current type of the cgroup, which 795*4882a593Smuzhiyun can be one of the following values. 796*4882a593Smuzhiyun 797*4882a593Smuzhiyun - "domain" : A normal valid domain cgroup. 798*4882a593Smuzhiyun 799*4882a593Smuzhiyun - "domain threaded" : A threaded domain cgroup which is 800*4882a593Smuzhiyun serving as the root of a threaded subtree. 801*4882a593Smuzhiyun 802*4882a593Smuzhiyun - "domain invalid" : A cgroup which is in an invalid state. 803*4882a593Smuzhiyun It can't be populated or have controllers enabled. It may 804*4882a593Smuzhiyun be allowed to become a threaded cgroup. 805*4882a593Smuzhiyun 806*4882a593Smuzhiyun - "threaded" : A threaded cgroup which is a member of a 807*4882a593Smuzhiyun threaded subtree. 808*4882a593Smuzhiyun 809*4882a593Smuzhiyun A cgroup can be turned into a threaded cgroup by writing 810*4882a593Smuzhiyun "threaded" to this file. 811*4882a593Smuzhiyun 812*4882a593Smuzhiyun cgroup.procs 813*4882a593Smuzhiyun A read-write new-line separated values file which exists on 814*4882a593Smuzhiyun all cgroups. 815*4882a593Smuzhiyun 816*4882a593Smuzhiyun When read, it lists the PIDs of all processes which belong to 817*4882a593Smuzhiyun the cgroup one-per-line. The PIDs are not ordered and the 818*4882a593Smuzhiyun same PID may show up more than once if the process got moved 819*4882a593Smuzhiyun to another cgroup and then back or the PID got recycled while 820*4882a593Smuzhiyun reading. 821*4882a593Smuzhiyun 822*4882a593Smuzhiyun A PID can be written to migrate the process associated with 823*4882a593Smuzhiyun the PID to the cgroup. The writer should match all of the 824*4882a593Smuzhiyun following conditions. 825*4882a593Smuzhiyun 826*4882a593Smuzhiyun - It must have write access to the "cgroup.procs" file. 827*4882a593Smuzhiyun 828*4882a593Smuzhiyun - It must have write access to the "cgroup.procs" file of the 829*4882a593Smuzhiyun common ancestor of the source and destination cgroups. 830*4882a593Smuzhiyun 831*4882a593Smuzhiyun When delegating a sub-hierarchy, write access to this file 832*4882a593Smuzhiyun should be granted along with the containing directory. 833*4882a593Smuzhiyun 834*4882a593Smuzhiyun In a threaded cgroup, reading this file fails with EOPNOTSUPP 835*4882a593Smuzhiyun as all the processes belong to the thread root. Writing is 836*4882a593Smuzhiyun supported and moves every thread of the process to the cgroup. 837*4882a593Smuzhiyun 838*4882a593Smuzhiyun cgroup.threads 839*4882a593Smuzhiyun A read-write new-line separated values file which exists on 840*4882a593Smuzhiyun all cgroups. 841*4882a593Smuzhiyun 842*4882a593Smuzhiyun When read, it lists the TIDs of all threads which belong to 843*4882a593Smuzhiyun the cgroup one-per-line. The TIDs are not ordered and the 844*4882a593Smuzhiyun same TID may show up more than once if the thread got moved to 845*4882a593Smuzhiyun another cgroup and then back or the TID got recycled while 846*4882a593Smuzhiyun reading. 847*4882a593Smuzhiyun 848*4882a593Smuzhiyun A TID can be written to migrate the thread associated with the 849*4882a593Smuzhiyun TID to the cgroup. The writer should match all of the 850*4882a593Smuzhiyun following conditions. 851*4882a593Smuzhiyun 852*4882a593Smuzhiyun - It must have write access to the "cgroup.threads" file. 853*4882a593Smuzhiyun 854*4882a593Smuzhiyun - The cgroup that the thread is currently in must be in the 855*4882a593Smuzhiyun same resource domain as the destination cgroup. 856*4882a593Smuzhiyun 857*4882a593Smuzhiyun - It must have write access to the "cgroup.procs" file of the 858*4882a593Smuzhiyun common ancestor of the source and destination cgroups. 859*4882a593Smuzhiyun 860*4882a593Smuzhiyun When delegating a sub-hierarchy, write access to this file 861*4882a593Smuzhiyun should be granted along with the containing directory. 862*4882a593Smuzhiyun 863*4882a593Smuzhiyun cgroup.controllers 864*4882a593Smuzhiyun A read-only space separated values file which exists on all 865*4882a593Smuzhiyun cgroups. 866*4882a593Smuzhiyun 867*4882a593Smuzhiyun It shows space separated list of all controllers available to 868*4882a593Smuzhiyun the cgroup. The controllers are not ordered. 869*4882a593Smuzhiyun 870*4882a593Smuzhiyun cgroup.subtree_control 871*4882a593Smuzhiyun A read-write space separated values file which exists on all 872*4882a593Smuzhiyun cgroups. Starts out empty. 873*4882a593Smuzhiyun 874*4882a593Smuzhiyun When read, it shows space separated list of the controllers 875*4882a593Smuzhiyun which are enabled to control resource distribution from the 876*4882a593Smuzhiyun cgroup to its children. 877*4882a593Smuzhiyun 878*4882a593Smuzhiyun Space separated list of controllers prefixed with '+' or '-' 879*4882a593Smuzhiyun can be written to enable or disable controllers. A controller 880*4882a593Smuzhiyun name prefixed with '+' enables the controller and '-' 881*4882a593Smuzhiyun disables. If a controller appears more than once on the list, 882*4882a593Smuzhiyun the last one is effective. When multiple enable and disable 883*4882a593Smuzhiyun operations are specified, either all succeed or all fail. 884*4882a593Smuzhiyun 885*4882a593Smuzhiyun cgroup.events 886*4882a593Smuzhiyun A read-only flat-keyed file which exists on non-root cgroups. 887*4882a593Smuzhiyun The following entries are defined. Unless specified 888*4882a593Smuzhiyun otherwise, a value change in this file generates a file 889*4882a593Smuzhiyun modified event. 890*4882a593Smuzhiyun 891*4882a593Smuzhiyun populated 892*4882a593Smuzhiyun 1 if the cgroup or its descendants contains any live 893*4882a593Smuzhiyun processes; otherwise, 0. 894*4882a593Smuzhiyun frozen 895*4882a593Smuzhiyun 1 if the cgroup is frozen; otherwise, 0. 896*4882a593Smuzhiyun 897*4882a593Smuzhiyun cgroup.max.descendants 898*4882a593Smuzhiyun A read-write single value files. The default is "max". 899*4882a593Smuzhiyun 900*4882a593Smuzhiyun Maximum allowed number of descent cgroups. 901*4882a593Smuzhiyun If the actual number of descendants is equal or larger, 902*4882a593Smuzhiyun an attempt to create a new cgroup in the hierarchy will fail. 903*4882a593Smuzhiyun 904*4882a593Smuzhiyun cgroup.max.depth 905*4882a593Smuzhiyun A read-write single value files. The default is "max". 906*4882a593Smuzhiyun 907*4882a593Smuzhiyun Maximum allowed descent depth below the current cgroup. 908*4882a593Smuzhiyun If the actual descent depth is equal or larger, 909*4882a593Smuzhiyun an attempt to create a new child cgroup will fail. 910*4882a593Smuzhiyun 911*4882a593Smuzhiyun cgroup.stat 912*4882a593Smuzhiyun A read-only flat-keyed file with the following entries: 913*4882a593Smuzhiyun 914*4882a593Smuzhiyun nr_descendants 915*4882a593Smuzhiyun Total number of visible descendant cgroups. 916*4882a593Smuzhiyun 917*4882a593Smuzhiyun nr_dying_descendants 918*4882a593Smuzhiyun Total number of dying descendant cgroups. A cgroup becomes 919*4882a593Smuzhiyun dying after being deleted by a user. The cgroup will remain 920*4882a593Smuzhiyun in dying state for some time undefined time (which can depend 921*4882a593Smuzhiyun on system load) before being completely destroyed. 922*4882a593Smuzhiyun 923*4882a593Smuzhiyun A process can't enter a dying cgroup under any circumstances, 924*4882a593Smuzhiyun a dying cgroup can't revive. 925*4882a593Smuzhiyun 926*4882a593Smuzhiyun A dying cgroup can consume system resources not exceeding 927*4882a593Smuzhiyun limits, which were active at the moment of cgroup deletion. 928*4882a593Smuzhiyun 929*4882a593Smuzhiyun cgroup.freeze 930*4882a593Smuzhiyun A read-write single value file which exists on non-root cgroups. 931*4882a593Smuzhiyun Allowed values are "0" and "1". The default is "0". 932*4882a593Smuzhiyun 933*4882a593Smuzhiyun Writing "1" to the file causes freezing of the cgroup and all 934*4882a593Smuzhiyun descendant cgroups. This means that all belonging processes will 935*4882a593Smuzhiyun be stopped and will not run until the cgroup will be explicitly 936*4882a593Smuzhiyun unfrozen. Freezing of the cgroup may take some time; when this action 937*4882a593Smuzhiyun is completed, the "frozen" value in the cgroup.events control file 938*4882a593Smuzhiyun will be updated to "1" and the corresponding notification will be 939*4882a593Smuzhiyun issued. 940*4882a593Smuzhiyun 941*4882a593Smuzhiyun A cgroup can be frozen either by its own settings, or by settings 942*4882a593Smuzhiyun of any ancestor cgroups. If any of ancestor cgroups is frozen, the 943*4882a593Smuzhiyun cgroup will remain frozen. 944*4882a593Smuzhiyun 945*4882a593Smuzhiyun Processes in the frozen cgroup can be killed by a fatal signal. 946*4882a593Smuzhiyun They also can enter and leave a frozen cgroup: either by an explicit 947*4882a593Smuzhiyun move by a user, or if freezing of the cgroup races with fork(). 948*4882a593Smuzhiyun If a process is moved to a frozen cgroup, it stops. If a process is 949*4882a593Smuzhiyun moved out of a frozen cgroup, it becomes running. 950*4882a593Smuzhiyun 951*4882a593Smuzhiyun Frozen status of a cgroup doesn't affect any cgroup tree operations: 952*4882a593Smuzhiyun it's possible to delete a frozen (and empty) cgroup, as well as 953*4882a593Smuzhiyun create new sub-cgroups. 954*4882a593Smuzhiyun 955*4882a593SmuzhiyunControllers 956*4882a593Smuzhiyun=========== 957*4882a593Smuzhiyun 958*4882a593SmuzhiyunCPU 959*4882a593Smuzhiyun--- 960*4882a593Smuzhiyun 961*4882a593SmuzhiyunThe "cpu" controllers regulates distribution of CPU cycles. This 962*4882a593Smuzhiyuncontroller implements weight and absolute bandwidth limit models for 963*4882a593Smuzhiyunnormal scheduling policy and absolute bandwidth allocation model for 964*4882a593Smuzhiyunrealtime scheduling policy. 965*4882a593Smuzhiyun 966*4882a593SmuzhiyunIn all the above models, cycles distribution is defined only on a temporal 967*4882a593Smuzhiyunbase and it does not account for the frequency at which tasks are executed. 968*4882a593SmuzhiyunThe (optional) utilization clamping support allows to hint the schedutil 969*4882a593Smuzhiyuncpufreq governor about the minimum desired frequency which should always be 970*4882a593Smuzhiyunprovided by a CPU, as well as the maximum desired frequency, which should not 971*4882a593Smuzhiyunbe exceeded by a CPU. 972*4882a593Smuzhiyun 973*4882a593SmuzhiyunWARNING: cgroup2 doesn't yet support control of realtime processes and 974*4882a593Smuzhiyunthe cpu controller can only be enabled when all RT processes are in 975*4882a593Smuzhiyunthe root cgroup. Be aware that system management software may already 976*4882a593Smuzhiyunhave placed RT processes into nonroot cgroups during the system boot 977*4882a593Smuzhiyunprocess, and these processes may need to be moved to the root cgroup 978*4882a593Smuzhiyunbefore the cpu controller can be enabled. 979*4882a593Smuzhiyun 980*4882a593Smuzhiyun 981*4882a593SmuzhiyunCPU Interface Files 982*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~ 983*4882a593Smuzhiyun 984*4882a593SmuzhiyunAll time durations are in microseconds. 985*4882a593Smuzhiyun 986*4882a593Smuzhiyun cpu.stat 987*4882a593Smuzhiyun A read-only flat-keyed file. 988*4882a593Smuzhiyun This file exists whether the controller is enabled or not. 989*4882a593Smuzhiyun 990*4882a593Smuzhiyun It always reports the following three stats: 991*4882a593Smuzhiyun 992*4882a593Smuzhiyun - usage_usec 993*4882a593Smuzhiyun - user_usec 994*4882a593Smuzhiyun - system_usec 995*4882a593Smuzhiyun 996*4882a593Smuzhiyun and the following three when the controller is enabled: 997*4882a593Smuzhiyun 998*4882a593Smuzhiyun - nr_periods 999*4882a593Smuzhiyun - nr_throttled 1000*4882a593Smuzhiyun - throttled_usec 1001*4882a593Smuzhiyun 1002*4882a593Smuzhiyun cpu.weight 1003*4882a593Smuzhiyun A read-write single value file which exists on non-root 1004*4882a593Smuzhiyun cgroups. The default is "100". 1005*4882a593Smuzhiyun 1006*4882a593Smuzhiyun The weight in the range [1, 10000]. 1007*4882a593Smuzhiyun 1008*4882a593Smuzhiyun cpu.weight.nice 1009*4882a593Smuzhiyun A read-write single value file which exists on non-root 1010*4882a593Smuzhiyun cgroups. The default is "0". 1011*4882a593Smuzhiyun 1012*4882a593Smuzhiyun The nice value is in the range [-20, 19]. 1013*4882a593Smuzhiyun 1014*4882a593Smuzhiyun This interface file is an alternative interface for 1015*4882a593Smuzhiyun "cpu.weight" and allows reading and setting weight using the 1016*4882a593Smuzhiyun same values used by nice(2). Because the range is smaller and 1017*4882a593Smuzhiyun granularity is coarser for the nice values, the read value is 1018*4882a593Smuzhiyun the closest approximation of the current weight. 1019*4882a593Smuzhiyun 1020*4882a593Smuzhiyun cpu.max 1021*4882a593Smuzhiyun A read-write two value file which exists on non-root cgroups. 1022*4882a593Smuzhiyun The default is "max 100000". 1023*4882a593Smuzhiyun 1024*4882a593Smuzhiyun The maximum bandwidth limit. It's in the following format:: 1025*4882a593Smuzhiyun 1026*4882a593Smuzhiyun $MAX $PERIOD 1027*4882a593Smuzhiyun 1028*4882a593Smuzhiyun which indicates that the group may consume upto $MAX in each 1029*4882a593Smuzhiyun $PERIOD duration. "max" for $MAX indicates no limit. If only 1030*4882a593Smuzhiyun one number is written, $MAX is updated. 1031*4882a593Smuzhiyun 1032*4882a593Smuzhiyun cpu.pressure 1033*4882a593Smuzhiyun A read-only nested-key file which exists on non-root cgroups. 1034*4882a593Smuzhiyun 1035*4882a593Smuzhiyun Shows pressure stall information for CPU. See 1036*4882a593Smuzhiyun :ref:`Documentation/accounting/psi.rst <psi>` for details. 1037*4882a593Smuzhiyun 1038*4882a593Smuzhiyun cpu.uclamp.min 1039*4882a593Smuzhiyun A read-write single value file which exists on non-root cgroups. 1040*4882a593Smuzhiyun The default is "0", i.e. no utilization boosting. 1041*4882a593Smuzhiyun 1042*4882a593Smuzhiyun The requested minimum utilization (protection) as a percentage 1043*4882a593Smuzhiyun rational number, e.g. 12.34 for 12.34%. 1044*4882a593Smuzhiyun 1045*4882a593Smuzhiyun This interface allows reading and setting minimum utilization clamp 1046*4882a593Smuzhiyun values similar to the sched_setattr(2). This minimum utilization 1047*4882a593Smuzhiyun value is used to clamp the task specific minimum utilization clamp. 1048*4882a593Smuzhiyun 1049*4882a593Smuzhiyun The requested minimum utilization (protection) is always capped by 1050*4882a593Smuzhiyun the current value for the maximum utilization (limit), i.e. 1051*4882a593Smuzhiyun `cpu.uclamp.max`. 1052*4882a593Smuzhiyun 1053*4882a593Smuzhiyun cpu.uclamp.max 1054*4882a593Smuzhiyun A read-write single value file which exists on non-root cgroups. 1055*4882a593Smuzhiyun The default is "max". i.e. no utilization capping 1056*4882a593Smuzhiyun 1057*4882a593Smuzhiyun The requested maximum utilization (limit) as a percentage rational 1058*4882a593Smuzhiyun number, e.g. 98.76 for 98.76%. 1059*4882a593Smuzhiyun 1060*4882a593Smuzhiyun This interface allows reading and setting maximum utilization clamp 1061*4882a593Smuzhiyun values similar to the sched_setattr(2). This maximum utilization 1062*4882a593Smuzhiyun value is used to clamp the task specific maximum utilization clamp. 1063*4882a593Smuzhiyun 1064*4882a593Smuzhiyun 1065*4882a593Smuzhiyun 1066*4882a593SmuzhiyunMemory 1067*4882a593Smuzhiyun------ 1068*4882a593Smuzhiyun 1069*4882a593SmuzhiyunThe "memory" controller regulates distribution of memory. Memory is 1070*4882a593Smuzhiyunstateful and implements both limit and protection models. Due to the 1071*4882a593Smuzhiyunintertwining between memory usage and reclaim pressure and the 1072*4882a593Smuzhiyunstateful nature of memory, the distribution model is relatively 1073*4882a593Smuzhiyuncomplex. 1074*4882a593Smuzhiyun 1075*4882a593SmuzhiyunWhile not completely water-tight, all major memory usages by a given 1076*4882a593Smuzhiyuncgroup are tracked so that the total memory consumption can be 1077*4882a593Smuzhiyunaccounted and controlled to a reasonable extent. Currently, the 1078*4882a593Smuzhiyunfollowing types of memory usages are tracked. 1079*4882a593Smuzhiyun 1080*4882a593Smuzhiyun- Userland memory - page cache and anonymous memory. 1081*4882a593Smuzhiyun 1082*4882a593Smuzhiyun- Kernel data structures such as dentries and inodes. 1083*4882a593Smuzhiyun 1084*4882a593Smuzhiyun- TCP socket buffers. 1085*4882a593Smuzhiyun 1086*4882a593SmuzhiyunThe above list may expand in the future for better coverage. 1087*4882a593Smuzhiyun 1088*4882a593Smuzhiyun 1089*4882a593SmuzhiyunMemory Interface Files 1090*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~ 1091*4882a593Smuzhiyun 1092*4882a593SmuzhiyunAll memory amounts are in bytes. If a value which is not aligned to 1093*4882a593SmuzhiyunPAGE_SIZE is written, the value may be rounded up to the closest 1094*4882a593SmuzhiyunPAGE_SIZE multiple when read back. 1095*4882a593Smuzhiyun 1096*4882a593Smuzhiyun memory.current 1097*4882a593Smuzhiyun A read-only single value file which exists on non-root 1098*4882a593Smuzhiyun cgroups. 1099*4882a593Smuzhiyun 1100*4882a593Smuzhiyun The total amount of memory currently being used by the cgroup 1101*4882a593Smuzhiyun and its descendants. 1102*4882a593Smuzhiyun 1103*4882a593Smuzhiyun memory.min 1104*4882a593Smuzhiyun A read-write single value file which exists on non-root 1105*4882a593Smuzhiyun cgroups. The default is "0". 1106*4882a593Smuzhiyun 1107*4882a593Smuzhiyun Hard memory protection. If the memory usage of a cgroup 1108*4882a593Smuzhiyun is within its effective min boundary, the cgroup's memory 1109*4882a593Smuzhiyun won't be reclaimed under any conditions. If there is no 1110*4882a593Smuzhiyun unprotected reclaimable memory available, OOM killer 1111*4882a593Smuzhiyun is invoked. Above the effective min boundary (or 1112*4882a593Smuzhiyun effective low boundary if it is higher), pages are reclaimed 1113*4882a593Smuzhiyun proportionally to the overage, reducing reclaim pressure for 1114*4882a593Smuzhiyun smaller overages. 1115*4882a593Smuzhiyun 1116*4882a593Smuzhiyun Effective min boundary is limited by memory.min values of 1117*4882a593Smuzhiyun all ancestor cgroups. If there is memory.min overcommitment 1118*4882a593Smuzhiyun (child cgroup or cgroups are requiring more protected memory 1119*4882a593Smuzhiyun than parent will allow), then each child cgroup will get 1120*4882a593Smuzhiyun the part of parent's protection proportional to its 1121*4882a593Smuzhiyun actual memory usage below memory.min. 1122*4882a593Smuzhiyun 1123*4882a593Smuzhiyun Putting more memory than generally available under this 1124*4882a593Smuzhiyun protection is discouraged and may lead to constant OOMs. 1125*4882a593Smuzhiyun 1126*4882a593Smuzhiyun If a memory cgroup is not populated with processes, 1127*4882a593Smuzhiyun its memory.min is ignored. 1128*4882a593Smuzhiyun 1129*4882a593Smuzhiyun memory.low 1130*4882a593Smuzhiyun A read-write single value file which exists on non-root 1131*4882a593Smuzhiyun cgroups. The default is "0". 1132*4882a593Smuzhiyun 1133*4882a593Smuzhiyun Best-effort memory protection. If the memory usage of a 1134*4882a593Smuzhiyun cgroup is within its effective low boundary, the cgroup's 1135*4882a593Smuzhiyun memory won't be reclaimed unless there is no reclaimable 1136*4882a593Smuzhiyun memory available in unprotected cgroups. 1137*4882a593Smuzhiyun Above the effective low boundary (or 1138*4882a593Smuzhiyun effective min boundary if it is higher), pages are reclaimed 1139*4882a593Smuzhiyun proportionally to the overage, reducing reclaim pressure for 1140*4882a593Smuzhiyun smaller overages. 1141*4882a593Smuzhiyun 1142*4882a593Smuzhiyun Effective low boundary is limited by memory.low values of 1143*4882a593Smuzhiyun all ancestor cgroups. If there is memory.low overcommitment 1144*4882a593Smuzhiyun (child cgroup or cgroups are requiring more protected memory 1145*4882a593Smuzhiyun than parent will allow), then each child cgroup will get 1146*4882a593Smuzhiyun the part of parent's protection proportional to its 1147*4882a593Smuzhiyun actual memory usage below memory.low. 1148*4882a593Smuzhiyun 1149*4882a593Smuzhiyun Putting more memory than generally available under this 1150*4882a593Smuzhiyun protection is discouraged. 1151*4882a593Smuzhiyun 1152*4882a593Smuzhiyun memory.high 1153*4882a593Smuzhiyun A read-write single value file which exists on non-root 1154*4882a593Smuzhiyun cgroups. The default is "max". 1155*4882a593Smuzhiyun 1156*4882a593Smuzhiyun Memory usage throttle limit. This is the main mechanism to 1157*4882a593Smuzhiyun control memory usage of a cgroup. If a cgroup's usage goes 1158*4882a593Smuzhiyun over the high boundary, the processes of the cgroup are 1159*4882a593Smuzhiyun throttled and put under heavy reclaim pressure. 1160*4882a593Smuzhiyun 1161*4882a593Smuzhiyun Going over the high limit never invokes the OOM killer and 1162*4882a593Smuzhiyun under extreme conditions the limit may be breached. 1163*4882a593Smuzhiyun 1164*4882a593Smuzhiyun memory.max 1165*4882a593Smuzhiyun A read-write single value file which exists on non-root 1166*4882a593Smuzhiyun cgroups. The default is "max". 1167*4882a593Smuzhiyun 1168*4882a593Smuzhiyun Memory usage hard limit. This is the final protection 1169*4882a593Smuzhiyun mechanism. If a cgroup's memory usage reaches this limit and 1170*4882a593Smuzhiyun can't be reduced, the OOM killer is invoked in the cgroup. 1171*4882a593Smuzhiyun Under certain circumstances, the usage may go over the limit 1172*4882a593Smuzhiyun temporarily. 1173*4882a593Smuzhiyun 1174*4882a593Smuzhiyun In default configuration regular 0-order allocations always 1175*4882a593Smuzhiyun succeed unless OOM killer chooses current task as a victim. 1176*4882a593Smuzhiyun 1177*4882a593Smuzhiyun Some kinds of allocations don't invoke the OOM killer. 1178*4882a593Smuzhiyun Caller could retry them differently, return into userspace 1179*4882a593Smuzhiyun as -ENOMEM or silently ignore in cases like disk readahead. 1180*4882a593Smuzhiyun 1181*4882a593Smuzhiyun This is the ultimate protection mechanism. As long as the 1182*4882a593Smuzhiyun high limit is used and monitored properly, this limit's 1183*4882a593Smuzhiyun utility is limited to providing the final safety net. 1184*4882a593Smuzhiyun 1185*4882a593Smuzhiyun memory.oom.group 1186*4882a593Smuzhiyun A read-write single value file which exists on non-root 1187*4882a593Smuzhiyun cgroups. The default value is "0". 1188*4882a593Smuzhiyun 1189*4882a593Smuzhiyun Determines whether the cgroup should be treated as 1190*4882a593Smuzhiyun an indivisible workload by the OOM killer. If set, 1191*4882a593Smuzhiyun all tasks belonging to the cgroup or to its descendants 1192*4882a593Smuzhiyun (if the memory cgroup is not a leaf cgroup) are killed 1193*4882a593Smuzhiyun together or not at all. This can be used to avoid 1194*4882a593Smuzhiyun partial kills to guarantee workload integrity. 1195*4882a593Smuzhiyun 1196*4882a593Smuzhiyun Tasks with the OOM protection (oom_score_adj set to -1000) 1197*4882a593Smuzhiyun are treated as an exception and are never killed. 1198*4882a593Smuzhiyun 1199*4882a593Smuzhiyun If the OOM killer is invoked in a cgroup, it's not going 1200*4882a593Smuzhiyun to kill any tasks outside of this cgroup, regardless 1201*4882a593Smuzhiyun memory.oom.group values of ancestor cgroups. 1202*4882a593Smuzhiyun 1203*4882a593Smuzhiyun memory.events 1204*4882a593Smuzhiyun A read-only flat-keyed file which exists on non-root cgroups. 1205*4882a593Smuzhiyun The following entries are defined. Unless specified 1206*4882a593Smuzhiyun otherwise, a value change in this file generates a file 1207*4882a593Smuzhiyun modified event. 1208*4882a593Smuzhiyun 1209*4882a593Smuzhiyun Note that all fields in this file are hierarchical and the 1210*4882a593Smuzhiyun file modified event can be generated due to an event down the 1211*4882a593Smuzhiyun hierarchy. For for the local events at the cgroup level see 1212*4882a593Smuzhiyun memory.events.local. 1213*4882a593Smuzhiyun 1214*4882a593Smuzhiyun low 1215*4882a593Smuzhiyun The number of times the cgroup is reclaimed due to 1216*4882a593Smuzhiyun high memory pressure even though its usage is under 1217*4882a593Smuzhiyun the low boundary. This usually indicates that the low 1218*4882a593Smuzhiyun boundary is over-committed. 1219*4882a593Smuzhiyun 1220*4882a593Smuzhiyun high 1221*4882a593Smuzhiyun The number of times processes of the cgroup are 1222*4882a593Smuzhiyun throttled and routed to perform direct memory reclaim 1223*4882a593Smuzhiyun because the high memory boundary was exceeded. For a 1224*4882a593Smuzhiyun cgroup whose memory usage is capped by the high limit 1225*4882a593Smuzhiyun rather than global memory pressure, this event's 1226*4882a593Smuzhiyun occurrences are expected. 1227*4882a593Smuzhiyun 1228*4882a593Smuzhiyun max 1229*4882a593Smuzhiyun The number of times the cgroup's memory usage was 1230*4882a593Smuzhiyun about to go over the max boundary. If direct reclaim 1231*4882a593Smuzhiyun fails to bring it down, the cgroup goes to OOM state. 1232*4882a593Smuzhiyun 1233*4882a593Smuzhiyun oom 1234*4882a593Smuzhiyun The number of time the cgroup's memory usage was 1235*4882a593Smuzhiyun reached the limit and allocation was about to fail. 1236*4882a593Smuzhiyun 1237*4882a593Smuzhiyun This event is not raised if the OOM killer is not 1238*4882a593Smuzhiyun considered as an option, e.g. for failed high-order 1239*4882a593Smuzhiyun allocations or if caller asked to not retry attempts. 1240*4882a593Smuzhiyun 1241*4882a593Smuzhiyun oom_kill 1242*4882a593Smuzhiyun The number of processes belonging to this cgroup 1243*4882a593Smuzhiyun killed by any kind of OOM killer. 1244*4882a593Smuzhiyun 1245*4882a593Smuzhiyun memory.events.local 1246*4882a593Smuzhiyun Similar to memory.events but the fields in the file are local 1247*4882a593Smuzhiyun to the cgroup i.e. not hierarchical. The file modified event 1248*4882a593Smuzhiyun generated on this file reflects only the local events. 1249*4882a593Smuzhiyun 1250*4882a593Smuzhiyun memory.stat 1251*4882a593Smuzhiyun A read-only flat-keyed file which exists on non-root cgroups. 1252*4882a593Smuzhiyun 1253*4882a593Smuzhiyun This breaks down the cgroup's memory footprint into different 1254*4882a593Smuzhiyun types of memory, type-specific details, and other information 1255*4882a593Smuzhiyun on the state and past events of the memory management system. 1256*4882a593Smuzhiyun 1257*4882a593Smuzhiyun All memory amounts are in bytes. 1258*4882a593Smuzhiyun 1259*4882a593Smuzhiyun The entries are ordered to be human readable, and new entries 1260*4882a593Smuzhiyun can show up in the middle. Don't rely on items remaining in a 1261*4882a593Smuzhiyun fixed position; use the keys to look up specific values! 1262*4882a593Smuzhiyun 1263*4882a593Smuzhiyun If the entry has no per-node counter(or not show in the 1264*4882a593Smuzhiyun mempry.numa_stat). We use 'npn'(non-per-node) as the tag 1265*4882a593Smuzhiyun to indicate that it will not show in the mempry.numa_stat. 1266*4882a593Smuzhiyun 1267*4882a593Smuzhiyun anon 1268*4882a593Smuzhiyun Amount of memory used in anonymous mappings such as 1269*4882a593Smuzhiyun brk(), sbrk(), and mmap(MAP_ANONYMOUS) 1270*4882a593Smuzhiyun 1271*4882a593Smuzhiyun file 1272*4882a593Smuzhiyun Amount of memory used to cache filesystem data, 1273*4882a593Smuzhiyun including tmpfs and shared memory. 1274*4882a593Smuzhiyun 1275*4882a593Smuzhiyun kernel_stack 1276*4882a593Smuzhiyun Amount of memory allocated to kernel stacks. 1277*4882a593Smuzhiyun 1278*4882a593Smuzhiyun percpu(npn) 1279*4882a593Smuzhiyun Amount of memory used for storing per-cpu kernel 1280*4882a593Smuzhiyun data structures. 1281*4882a593Smuzhiyun 1282*4882a593Smuzhiyun sock(npn) 1283*4882a593Smuzhiyun Amount of memory used in network transmission buffers 1284*4882a593Smuzhiyun 1285*4882a593Smuzhiyun shmem 1286*4882a593Smuzhiyun Amount of cached filesystem data that is swap-backed, 1287*4882a593Smuzhiyun such as tmpfs, shm segments, shared anonymous mmap()s 1288*4882a593Smuzhiyun 1289*4882a593Smuzhiyun file_mapped 1290*4882a593Smuzhiyun Amount of cached filesystem data mapped with mmap() 1291*4882a593Smuzhiyun 1292*4882a593Smuzhiyun file_dirty 1293*4882a593Smuzhiyun Amount of cached filesystem data that was modified but 1294*4882a593Smuzhiyun not yet written back to disk 1295*4882a593Smuzhiyun 1296*4882a593Smuzhiyun file_writeback 1297*4882a593Smuzhiyun Amount of cached filesystem data that was modified and 1298*4882a593Smuzhiyun is currently being written back to disk 1299*4882a593Smuzhiyun 1300*4882a593Smuzhiyun anon_thp 1301*4882a593Smuzhiyun Amount of memory used in anonymous mappings backed by 1302*4882a593Smuzhiyun transparent hugepages 1303*4882a593Smuzhiyun 1304*4882a593Smuzhiyun inactive_anon, active_anon, inactive_file, active_file, unevictable 1305*4882a593Smuzhiyun Amount of memory, swap-backed and filesystem-backed, 1306*4882a593Smuzhiyun on the internal memory management lists used by the 1307*4882a593Smuzhiyun page reclaim algorithm. 1308*4882a593Smuzhiyun 1309*4882a593Smuzhiyun As these represent internal list state (eg. shmem pages are on anon 1310*4882a593Smuzhiyun memory management lists), inactive_foo + active_foo may not be equal to 1311*4882a593Smuzhiyun the value for the foo counter, since the foo counter is type-based, not 1312*4882a593Smuzhiyun list-based. 1313*4882a593Smuzhiyun 1314*4882a593Smuzhiyun slab_reclaimable 1315*4882a593Smuzhiyun Part of "slab" that might be reclaimed, such as 1316*4882a593Smuzhiyun dentries and inodes. 1317*4882a593Smuzhiyun 1318*4882a593Smuzhiyun slab_unreclaimable 1319*4882a593Smuzhiyun Part of "slab" that cannot be reclaimed on memory 1320*4882a593Smuzhiyun pressure. 1321*4882a593Smuzhiyun 1322*4882a593Smuzhiyun slab(npn) 1323*4882a593Smuzhiyun Amount of memory used for storing in-kernel data 1324*4882a593Smuzhiyun structures. 1325*4882a593Smuzhiyun 1326*4882a593Smuzhiyun workingset_refault_anon 1327*4882a593Smuzhiyun Number of refaults of previously evicted anonymous pages. 1328*4882a593Smuzhiyun 1329*4882a593Smuzhiyun workingset_refault_file 1330*4882a593Smuzhiyun Number of refaults of previously evicted file pages. 1331*4882a593Smuzhiyun 1332*4882a593Smuzhiyun workingset_activate_anon 1333*4882a593Smuzhiyun Number of refaulted anonymous pages that were immediately 1334*4882a593Smuzhiyun activated. 1335*4882a593Smuzhiyun 1336*4882a593Smuzhiyun workingset_activate_file 1337*4882a593Smuzhiyun Number of refaulted file pages that were immediately activated. 1338*4882a593Smuzhiyun 1339*4882a593Smuzhiyun workingset_restore_anon 1340*4882a593Smuzhiyun Number of restored anonymous pages which have been detected as 1341*4882a593Smuzhiyun an active workingset before they got reclaimed. 1342*4882a593Smuzhiyun 1343*4882a593Smuzhiyun workingset_restore_file 1344*4882a593Smuzhiyun Number of restored file pages which have been detected as an 1345*4882a593Smuzhiyun active workingset before they got reclaimed. 1346*4882a593Smuzhiyun 1347*4882a593Smuzhiyun workingset_nodereclaim 1348*4882a593Smuzhiyun Number of times a shadow node has been reclaimed 1349*4882a593Smuzhiyun 1350*4882a593Smuzhiyun pgfault(npn) 1351*4882a593Smuzhiyun Total number of page faults incurred 1352*4882a593Smuzhiyun 1353*4882a593Smuzhiyun pgmajfault(npn) 1354*4882a593Smuzhiyun Number of major page faults incurred 1355*4882a593Smuzhiyun 1356*4882a593Smuzhiyun pgrefill(npn) 1357*4882a593Smuzhiyun Amount of scanned pages (in an active LRU list) 1358*4882a593Smuzhiyun 1359*4882a593Smuzhiyun pgscan(npn) 1360*4882a593Smuzhiyun Amount of scanned pages (in an inactive LRU list) 1361*4882a593Smuzhiyun 1362*4882a593Smuzhiyun pgsteal(npn) 1363*4882a593Smuzhiyun Amount of reclaimed pages 1364*4882a593Smuzhiyun 1365*4882a593Smuzhiyun pgactivate(npn) 1366*4882a593Smuzhiyun Amount of pages moved to the active LRU list 1367*4882a593Smuzhiyun 1368*4882a593Smuzhiyun pgdeactivate(npn) 1369*4882a593Smuzhiyun Amount of pages moved to the inactive LRU list 1370*4882a593Smuzhiyun 1371*4882a593Smuzhiyun pglazyfree(npn) 1372*4882a593Smuzhiyun Amount of pages postponed to be freed under memory pressure 1373*4882a593Smuzhiyun 1374*4882a593Smuzhiyun pglazyfreed(npn) 1375*4882a593Smuzhiyun Amount of reclaimed lazyfree pages 1376*4882a593Smuzhiyun 1377*4882a593Smuzhiyun thp_fault_alloc(npn) 1378*4882a593Smuzhiyun Number of transparent hugepages which were allocated to satisfy 1379*4882a593Smuzhiyun a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE 1380*4882a593Smuzhiyun is not set. 1381*4882a593Smuzhiyun 1382*4882a593Smuzhiyun thp_collapse_alloc(npn) 1383*4882a593Smuzhiyun Number of transparent hugepages which were allocated to allow 1384*4882a593Smuzhiyun collapsing an existing range of pages. This counter is not 1385*4882a593Smuzhiyun present when CONFIG_TRANSPARENT_HUGEPAGE is not set. 1386*4882a593Smuzhiyun 1387*4882a593Smuzhiyun memory.numa_stat 1388*4882a593Smuzhiyun A read-only nested-keyed file which exists on non-root cgroups. 1389*4882a593Smuzhiyun 1390*4882a593Smuzhiyun This breaks down the cgroup's memory footprint into different 1391*4882a593Smuzhiyun types of memory, type-specific details, and other information 1392*4882a593Smuzhiyun per node on the state of the memory management system. 1393*4882a593Smuzhiyun 1394*4882a593Smuzhiyun This is useful for providing visibility into the NUMA locality 1395*4882a593Smuzhiyun information within an memcg since the pages are allowed to be 1396*4882a593Smuzhiyun allocated from any physical node. One of the use case is evaluating 1397*4882a593Smuzhiyun application performance by combining this information with the 1398*4882a593Smuzhiyun application's CPU allocation. 1399*4882a593Smuzhiyun 1400*4882a593Smuzhiyun All memory amounts are in bytes. 1401*4882a593Smuzhiyun 1402*4882a593Smuzhiyun The output format of memory.numa_stat is:: 1403*4882a593Smuzhiyun 1404*4882a593Smuzhiyun type N0=<bytes in node 0> N1=<bytes in node 1> ... 1405*4882a593Smuzhiyun 1406*4882a593Smuzhiyun The entries are ordered to be human readable, and new entries 1407*4882a593Smuzhiyun can show up in the middle. Don't rely on items remaining in a 1408*4882a593Smuzhiyun fixed position; use the keys to look up specific values! 1409*4882a593Smuzhiyun 1410*4882a593Smuzhiyun The entries can refer to the memory.stat. 1411*4882a593Smuzhiyun 1412*4882a593Smuzhiyun memory.swap.current 1413*4882a593Smuzhiyun A read-only single value file which exists on non-root 1414*4882a593Smuzhiyun cgroups. 1415*4882a593Smuzhiyun 1416*4882a593Smuzhiyun The total amount of swap currently being used by the cgroup 1417*4882a593Smuzhiyun and its descendants. 1418*4882a593Smuzhiyun 1419*4882a593Smuzhiyun memory.swap.high 1420*4882a593Smuzhiyun A read-write single value file which exists on non-root 1421*4882a593Smuzhiyun cgroups. The default is "max". 1422*4882a593Smuzhiyun 1423*4882a593Smuzhiyun Swap usage throttle limit. If a cgroup's swap usage exceeds 1424*4882a593Smuzhiyun this limit, all its further allocations will be throttled to 1425*4882a593Smuzhiyun allow userspace to implement custom out-of-memory procedures. 1426*4882a593Smuzhiyun 1427*4882a593Smuzhiyun This limit marks a point of no return for the cgroup. It is NOT 1428*4882a593Smuzhiyun designed to manage the amount of swapping a workload does 1429*4882a593Smuzhiyun during regular operation. Compare to memory.swap.max, which 1430*4882a593Smuzhiyun prohibits swapping past a set amount, but lets the cgroup 1431*4882a593Smuzhiyun continue unimpeded as long as other memory can be reclaimed. 1432*4882a593Smuzhiyun 1433*4882a593Smuzhiyun Healthy workloads are not expected to reach this limit. 1434*4882a593Smuzhiyun 1435*4882a593Smuzhiyun memory.swap.max 1436*4882a593Smuzhiyun A read-write single value file which exists on non-root 1437*4882a593Smuzhiyun cgroups. The default is "max". 1438*4882a593Smuzhiyun 1439*4882a593Smuzhiyun Swap usage hard limit. If a cgroup's swap usage reaches this 1440*4882a593Smuzhiyun limit, anonymous memory of the cgroup will not be swapped out. 1441*4882a593Smuzhiyun 1442*4882a593Smuzhiyun memory.swap.events 1443*4882a593Smuzhiyun A read-only flat-keyed file which exists on non-root cgroups. 1444*4882a593Smuzhiyun The following entries are defined. Unless specified 1445*4882a593Smuzhiyun otherwise, a value change in this file generates a file 1446*4882a593Smuzhiyun modified event. 1447*4882a593Smuzhiyun 1448*4882a593Smuzhiyun high 1449*4882a593Smuzhiyun The number of times the cgroup's swap usage was over 1450*4882a593Smuzhiyun the high threshold. 1451*4882a593Smuzhiyun 1452*4882a593Smuzhiyun max 1453*4882a593Smuzhiyun The number of times the cgroup's swap usage was about 1454*4882a593Smuzhiyun to go over the max boundary and swap allocation 1455*4882a593Smuzhiyun failed. 1456*4882a593Smuzhiyun 1457*4882a593Smuzhiyun fail 1458*4882a593Smuzhiyun The number of times swap allocation failed either 1459*4882a593Smuzhiyun because of running out of swap system-wide or max 1460*4882a593Smuzhiyun limit. 1461*4882a593Smuzhiyun 1462*4882a593Smuzhiyun When reduced under the current usage, the existing swap 1463*4882a593Smuzhiyun entries are reclaimed gradually and the swap usage may stay 1464*4882a593Smuzhiyun higher than the limit for an extended period of time. This 1465*4882a593Smuzhiyun reduces the impact on the workload and memory management. 1466*4882a593Smuzhiyun 1467*4882a593Smuzhiyun memory.pressure 1468*4882a593Smuzhiyun A read-only nested-key file which exists on non-root cgroups. 1469*4882a593Smuzhiyun 1470*4882a593Smuzhiyun Shows pressure stall information for memory. See 1471*4882a593Smuzhiyun :ref:`Documentation/accounting/psi.rst <psi>` for details. 1472*4882a593Smuzhiyun 1473*4882a593Smuzhiyun 1474*4882a593SmuzhiyunUsage Guidelines 1475*4882a593Smuzhiyun~~~~~~~~~~~~~~~~ 1476*4882a593Smuzhiyun 1477*4882a593Smuzhiyun"memory.high" is the main mechanism to control memory usage. 1478*4882a593SmuzhiyunOver-committing on high limit (sum of high limits > available memory) 1479*4882a593Smuzhiyunand letting global memory pressure to distribute memory according to 1480*4882a593Smuzhiyunusage is a viable strategy. 1481*4882a593Smuzhiyun 1482*4882a593SmuzhiyunBecause breach of the high limit doesn't trigger the OOM killer but 1483*4882a593Smuzhiyunthrottles the offending cgroup, a management agent has ample 1484*4882a593Smuzhiyunopportunities to monitor and take appropriate actions such as granting 1485*4882a593Smuzhiyunmore memory or terminating the workload. 1486*4882a593Smuzhiyun 1487*4882a593SmuzhiyunDetermining whether a cgroup has enough memory is not trivial as 1488*4882a593Smuzhiyunmemory usage doesn't indicate whether the workload can benefit from 1489*4882a593Smuzhiyunmore memory. For example, a workload which writes data received from 1490*4882a593Smuzhiyunnetwork to a file can use all available memory but can also operate as 1491*4882a593Smuzhiyunperformant with a small amount of memory. A measure of memory 1492*4882a593Smuzhiyunpressure - how much the workload is being impacted due to lack of 1493*4882a593Smuzhiyunmemory - is necessary to determine whether a workload needs more 1494*4882a593Smuzhiyunmemory; unfortunately, memory pressure monitoring mechanism isn't 1495*4882a593Smuzhiyunimplemented yet. 1496*4882a593Smuzhiyun 1497*4882a593Smuzhiyun 1498*4882a593SmuzhiyunMemory Ownership 1499*4882a593Smuzhiyun~~~~~~~~~~~~~~~~ 1500*4882a593Smuzhiyun 1501*4882a593SmuzhiyunA memory area is charged to the cgroup which instantiated it and stays 1502*4882a593Smuzhiyuncharged to the cgroup until the area is released. Migrating a process 1503*4882a593Smuzhiyunto a different cgroup doesn't move the memory usages that it 1504*4882a593Smuzhiyuninstantiated while in the previous cgroup to the new cgroup. 1505*4882a593Smuzhiyun 1506*4882a593SmuzhiyunA memory area may be used by processes belonging to different cgroups. 1507*4882a593SmuzhiyunTo which cgroup the area will be charged is in-deterministic; however, 1508*4882a593Smuzhiyunover time, the memory area is likely to end up in a cgroup which has 1509*4882a593Smuzhiyunenough memory allowance to avoid high reclaim pressure. 1510*4882a593Smuzhiyun 1511*4882a593SmuzhiyunIf a cgroup sweeps a considerable amount of memory which is expected 1512*4882a593Smuzhiyunto be accessed repeatedly by other cgroups, it may make sense to use 1513*4882a593SmuzhiyunPOSIX_FADV_DONTNEED to relinquish the ownership of memory areas 1514*4882a593Smuzhiyunbelonging to the affected files to ensure correct memory ownership. 1515*4882a593Smuzhiyun 1516*4882a593Smuzhiyun 1517*4882a593SmuzhiyunIO 1518*4882a593Smuzhiyun-- 1519*4882a593Smuzhiyun 1520*4882a593SmuzhiyunThe "io" controller regulates the distribution of IO resources. This 1521*4882a593Smuzhiyuncontroller implements both weight based and absolute bandwidth or IOPS 1522*4882a593Smuzhiyunlimit distribution; however, weight based distribution is available 1523*4882a593Smuzhiyunonly if cfq-iosched is in use and neither scheme is available for 1524*4882a593Smuzhiyunblk-mq devices. 1525*4882a593Smuzhiyun 1526*4882a593Smuzhiyun 1527*4882a593SmuzhiyunIO Interface Files 1528*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~ 1529*4882a593Smuzhiyun 1530*4882a593Smuzhiyun io.stat 1531*4882a593Smuzhiyun A read-only nested-keyed file. 1532*4882a593Smuzhiyun 1533*4882a593Smuzhiyun Lines are keyed by $MAJ:$MIN device numbers and not ordered. 1534*4882a593Smuzhiyun The following nested keys are defined. 1535*4882a593Smuzhiyun 1536*4882a593Smuzhiyun ====== ===================== 1537*4882a593Smuzhiyun rbytes Bytes read 1538*4882a593Smuzhiyun wbytes Bytes written 1539*4882a593Smuzhiyun rios Number of read IOs 1540*4882a593Smuzhiyun wios Number of write IOs 1541*4882a593Smuzhiyun dbytes Bytes discarded 1542*4882a593Smuzhiyun dios Number of discard IOs 1543*4882a593Smuzhiyun ====== ===================== 1544*4882a593Smuzhiyun 1545*4882a593Smuzhiyun An example read output follows:: 1546*4882a593Smuzhiyun 1547*4882a593Smuzhiyun 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 1548*4882a593Smuzhiyun 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 1549*4882a593Smuzhiyun 1550*4882a593Smuzhiyun io.cost.qos 1551*4882a593Smuzhiyun A read-write nested-keyed file with exists only on the root 1552*4882a593Smuzhiyun cgroup. 1553*4882a593Smuzhiyun 1554*4882a593Smuzhiyun This file configures the Quality of Service of the IO cost 1555*4882a593Smuzhiyun model based controller (CONFIG_BLK_CGROUP_IOCOST) which 1556*4882a593Smuzhiyun currently implements "io.weight" proportional control. Lines 1557*4882a593Smuzhiyun are keyed by $MAJ:$MIN device numbers and not ordered. The 1558*4882a593Smuzhiyun line for a given device is populated on the first write for 1559*4882a593Smuzhiyun the device on "io.cost.qos" or "io.cost.model". The following 1560*4882a593Smuzhiyun nested keys are defined. 1561*4882a593Smuzhiyun 1562*4882a593Smuzhiyun ====== ===================================== 1563*4882a593Smuzhiyun enable Weight-based control enable 1564*4882a593Smuzhiyun ctrl "auto" or "user" 1565*4882a593Smuzhiyun rpct Read latency percentile [0, 100] 1566*4882a593Smuzhiyun rlat Read latency threshold 1567*4882a593Smuzhiyun wpct Write latency percentile [0, 100] 1568*4882a593Smuzhiyun wlat Write latency threshold 1569*4882a593Smuzhiyun min Minimum scaling percentage [1, 10000] 1570*4882a593Smuzhiyun max Maximum scaling percentage [1, 10000] 1571*4882a593Smuzhiyun ====== ===================================== 1572*4882a593Smuzhiyun 1573*4882a593Smuzhiyun The controller is disabled by default and can be enabled by 1574*4882a593Smuzhiyun setting "enable" to 1. "rpct" and "wpct" parameters default 1575*4882a593Smuzhiyun to zero and the controller uses internal device saturation 1576*4882a593Smuzhiyun state to adjust the overall IO rate between "min" and "max". 1577*4882a593Smuzhiyun 1578*4882a593Smuzhiyun When a better control quality is needed, latency QoS 1579*4882a593Smuzhiyun parameters can be configured. For example:: 1580*4882a593Smuzhiyun 1581*4882a593Smuzhiyun 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0 1582*4882a593Smuzhiyun 1583*4882a593Smuzhiyun shows that on sdb, the controller is enabled, will consider 1584*4882a593Smuzhiyun the device saturated if the 95th percentile of read completion 1585*4882a593Smuzhiyun latencies is above 75ms or write 150ms, and adjust the overall 1586*4882a593Smuzhiyun IO issue rate between 50% and 150% accordingly. 1587*4882a593Smuzhiyun 1588*4882a593Smuzhiyun The lower the saturation point, the better the latency QoS at 1589*4882a593Smuzhiyun the cost of aggregate bandwidth. The narrower the allowed 1590*4882a593Smuzhiyun adjustment range between "min" and "max", the more conformant 1591*4882a593Smuzhiyun to the cost model the IO behavior. Note that the IO issue 1592*4882a593Smuzhiyun base rate may be far off from 100% and setting "min" and "max" 1593*4882a593Smuzhiyun blindly can lead to a significant loss of device capacity or 1594*4882a593Smuzhiyun control quality. "min" and "max" are useful for regulating 1595*4882a593Smuzhiyun devices which show wide temporary behavior changes - e.g. a 1596*4882a593Smuzhiyun ssd which accepts writes at the line speed for a while and 1597*4882a593Smuzhiyun then completely stalls for multiple seconds. 1598*4882a593Smuzhiyun 1599*4882a593Smuzhiyun When "ctrl" is "auto", the parameters are controlled by the 1600*4882a593Smuzhiyun kernel and may change automatically. Setting "ctrl" to "user" 1601*4882a593Smuzhiyun or setting any of the percentile and latency parameters puts 1602*4882a593Smuzhiyun it into "user" mode and disables the automatic changes. The 1603*4882a593Smuzhiyun automatic mode can be restored by setting "ctrl" to "auto". 1604*4882a593Smuzhiyun 1605*4882a593Smuzhiyun io.cost.model 1606*4882a593Smuzhiyun A read-write nested-keyed file with exists only on the root 1607*4882a593Smuzhiyun cgroup. 1608*4882a593Smuzhiyun 1609*4882a593Smuzhiyun This file configures the cost model of the IO cost model based 1610*4882a593Smuzhiyun controller (CONFIG_BLK_CGROUP_IOCOST) which currently 1611*4882a593Smuzhiyun implements "io.weight" proportional control. Lines are keyed 1612*4882a593Smuzhiyun by $MAJ:$MIN device numbers and not ordered. The line for a 1613*4882a593Smuzhiyun given device is populated on the first write for the device on 1614*4882a593Smuzhiyun "io.cost.qos" or "io.cost.model". The following nested keys 1615*4882a593Smuzhiyun are defined. 1616*4882a593Smuzhiyun 1617*4882a593Smuzhiyun ===== ================================ 1618*4882a593Smuzhiyun ctrl "auto" or "user" 1619*4882a593Smuzhiyun model The cost model in use - "linear" 1620*4882a593Smuzhiyun ===== ================================ 1621*4882a593Smuzhiyun 1622*4882a593Smuzhiyun When "ctrl" is "auto", the kernel may change all parameters 1623*4882a593Smuzhiyun dynamically. When "ctrl" is set to "user" or any other 1624*4882a593Smuzhiyun parameters are written to, "ctrl" become "user" and the 1625*4882a593Smuzhiyun automatic changes are disabled. 1626*4882a593Smuzhiyun 1627*4882a593Smuzhiyun When "model" is "linear", the following model parameters are 1628*4882a593Smuzhiyun defined. 1629*4882a593Smuzhiyun 1630*4882a593Smuzhiyun ============= ======================================== 1631*4882a593Smuzhiyun [r|w]bps The maximum sequential IO throughput 1632*4882a593Smuzhiyun [r|w]seqiops The maximum 4k sequential IOs per second 1633*4882a593Smuzhiyun [r|w]randiops The maximum 4k random IOs per second 1634*4882a593Smuzhiyun ============= ======================================== 1635*4882a593Smuzhiyun 1636*4882a593Smuzhiyun From the above, the builtin linear model determines the base 1637*4882a593Smuzhiyun costs of a sequential and random IO and the cost coefficient 1638*4882a593Smuzhiyun for the IO size. While simple, this model can cover most 1639*4882a593Smuzhiyun common device classes acceptably. 1640*4882a593Smuzhiyun 1641*4882a593Smuzhiyun The IO cost model isn't expected to be accurate in absolute 1642*4882a593Smuzhiyun sense and is scaled to the device behavior dynamically. 1643*4882a593Smuzhiyun 1644*4882a593Smuzhiyun If needed, tools/cgroup/iocost_coef_gen.py can be used to 1645*4882a593Smuzhiyun generate device-specific coefficients. 1646*4882a593Smuzhiyun 1647*4882a593Smuzhiyun io.weight 1648*4882a593Smuzhiyun A read-write flat-keyed file which exists on non-root cgroups. 1649*4882a593Smuzhiyun The default is "default 100". 1650*4882a593Smuzhiyun 1651*4882a593Smuzhiyun The first line is the default weight applied to devices 1652*4882a593Smuzhiyun without specific override. The rest are overrides keyed by 1653*4882a593Smuzhiyun $MAJ:$MIN device numbers and not ordered. The weights are in 1654*4882a593Smuzhiyun the range [1, 10000] and specifies the relative amount IO time 1655*4882a593Smuzhiyun the cgroup can use in relation to its siblings. 1656*4882a593Smuzhiyun 1657*4882a593Smuzhiyun The default weight can be updated by writing either "default 1658*4882a593Smuzhiyun $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing 1659*4882a593Smuzhiyun "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default". 1660*4882a593Smuzhiyun 1661*4882a593Smuzhiyun An example read output follows:: 1662*4882a593Smuzhiyun 1663*4882a593Smuzhiyun default 100 1664*4882a593Smuzhiyun 8:16 200 1665*4882a593Smuzhiyun 8:0 50 1666*4882a593Smuzhiyun 1667*4882a593Smuzhiyun io.max 1668*4882a593Smuzhiyun A read-write nested-keyed file which exists on non-root 1669*4882a593Smuzhiyun cgroups. 1670*4882a593Smuzhiyun 1671*4882a593Smuzhiyun BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN 1672*4882a593Smuzhiyun device numbers and not ordered. The following nested keys are 1673*4882a593Smuzhiyun defined. 1674*4882a593Smuzhiyun 1675*4882a593Smuzhiyun ===== ================================== 1676*4882a593Smuzhiyun rbps Max read bytes per second 1677*4882a593Smuzhiyun wbps Max write bytes per second 1678*4882a593Smuzhiyun riops Max read IO operations per second 1679*4882a593Smuzhiyun wiops Max write IO operations per second 1680*4882a593Smuzhiyun ===== ================================== 1681*4882a593Smuzhiyun 1682*4882a593Smuzhiyun When writing, any number of nested key-value pairs can be 1683*4882a593Smuzhiyun specified in any order. "max" can be specified as the value 1684*4882a593Smuzhiyun to remove a specific limit. If the same key is specified 1685*4882a593Smuzhiyun multiple times, the outcome is undefined. 1686*4882a593Smuzhiyun 1687*4882a593Smuzhiyun BPS and IOPS are measured in each IO direction and IOs are 1688*4882a593Smuzhiyun delayed if limit is reached. Temporary bursts are allowed. 1689*4882a593Smuzhiyun 1690*4882a593Smuzhiyun Setting read limit at 2M BPS and write at 120 IOPS for 8:16:: 1691*4882a593Smuzhiyun 1692*4882a593Smuzhiyun echo "8:16 rbps=2097152 wiops=120" > io.max 1693*4882a593Smuzhiyun 1694*4882a593Smuzhiyun Reading returns the following:: 1695*4882a593Smuzhiyun 1696*4882a593Smuzhiyun 8:16 rbps=2097152 wbps=max riops=max wiops=120 1697*4882a593Smuzhiyun 1698*4882a593Smuzhiyun Write IOPS limit can be removed by writing the following:: 1699*4882a593Smuzhiyun 1700*4882a593Smuzhiyun echo "8:16 wiops=max" > io.max 1701*4882a593Smuzhiyun 1702*4882a593Smuzhiyun Reading now returns the following:: 1703*4882a593Smuzhiyun 1704*4882a593Smuzhiyun 8:16 rbps=2097152 wbps=max riops=max wiops=max 1705*4882a593Smuzhiyun 1706*4882a593Smuzhiyun io.pressure 1707*4882a593Smuzhiyun A read-only nested-key file which exists on non-root cgroups. 1708*4882a593Smuzhiyun 1709*4882a593Smuzhiyun Shows pressure stall information for IO. See 1710*4882a593Smuzhiyun :ref:`Documentation/accounting/psi.rst <psi>` for details. 1711*4882a593Smuzhiyun 1712*4882a593Smuzhiyun 1713*4882a593SmuzhiyunWriteback 1714*4882a593Smuzhiyun~~~~~~~~~ 1715*4882a593Smuzhiyun 1716*4882a593SmuzhiyunPage cache is dirtied through buffered writes and shared mmaps and 1717*4882a593Smuzhiyunwritten asynchronously to the backing filesystem by the writeback 1718*4882a593Smuzhiyunmechanism. Writeback sits between the memory and IO domains and 1719*4882a593Smuzhiyunregulates the proportion of dirty memory by balancing dirtying and 1720*4882a593Smuzhiyunwrite IOs. 1721*4882a593Smuzhiyun 1722*4882a593SmuzhiyunThe io controller, in conjunction with the memory controller, 1723*4882a593Smuzhiyunimplements control of page cache writeback IOs. The memory controller 1724*4882a593Smuzhiyundefines the memory domain that dirty memory ratio is calculated and 1725*4882a593Smuzhiyunmaintained for and the io controller defines the io domain which 1726*4882a593Smuzhiyunwrites out dirty pages for the memory domain. Both system-wide and 1727*4882a593Smuzhiyunper-cgroup dirty memory states are examined and the more restrictive 1728*4882a593Smuzhiyunof the two is enforced. 1729*4882a593Smuzhiyun 1730*4882a593Smuzhiyuncgroup writeback requires explicit support from the underlying 1731*4882a593Smuzhiyunfilesystem. Currently, cgroup writeback is implemented on ext2, ext4, 1732*4882a593Smuzhiyunbtrfs, f2fs, and xfs. On other filesystems, all writeback IOs are 1733*4882a593Smuzhiyunattributed to the root cgroup. 1734*4882a593Smuzhiyun 1735*4882a593SmuzhiyunThere are inherent differences in memory and writeback management 1736*4882a593Smuzhiyunwhich affects how cgroup ownership is tracked. Memory is tracked per 1737*4882a593Smuzhiyunpage while writeback per inode. For the purpose of writeback, an 1738*4882a593Smuzhiyuninode is assigned to a cgroup and all IO requests to write dirty pages 1739*4882a593Smuzhiyunfrom the inode are attributed to that cgroup. 1740*4882a593Smuzhiyun 1741*4882a593SmuzhiyunAs cgroup ownership for memory is tracked per page, there can be pages 1742*4882a593Smuzhiyunwhich are associated with different cgroups than the one the inode is 1743*4882a593Smuzhiyunassociated with. These are called foreign pages. The writeback 1744*4882a593Smuzhiyunconstantly keeps track of foreign pages and, if a particular foreign 1745*4882a593Smuzhiyuncgroup becomes the majority over a certain period of time, switches 1746*4882a593Smuzhiyunthe ownership of the inode to that cgroup. 1747*4882a593Smuzhiyun 1748*4882a593SmuzhiyunWhile this model is enough for most use cases where a given inode is 1749*4882a593Smuzhiyunmostly dirtied by a single cgroup even when the main writing cgroup 1750*4882a593Smuzhiyunchanges over time, use cases where multiple cgroups write to a single 1751*4882a593Smuzhiyuninode simultaneously are not supported well. In such circumstances, a 1752*4882a593Smuzhiyunsignificant portion of IOs are likely to be attributed incorrectly. 1753*4882a593SmuzhiyunAs memory controller assigns page ownership on the first use and 1754*4882a593Smuzhiyundoesn't update it until the page is released, even if writeback 1755*4882a593Smuzhiyunstrictly follows page ownership, multiple cgroups dirtying overlapping 1756*4882a593Smuzhiyunareas wouldn't work as expected. It's recommended to avoid such usage 1757*4882a593Smuzhiyunpatterns. 1758*4882a593Smuzhiyun 1759*4882a593SmuzhiyunThe sysctl knobs which affect writeback behavior are applied to cgroup 1760*4882a593Smuzhiyunwriteback as follows. 1761*4882a593Smuzhiyun 1762*4882a593Smuzhiyun vm.dirty_background_ratio, vm.dirty_ratio 1763*4882a593Smuzhiyun These ratios apply the same to cgroup writeback with the 1764*4882a593Smuzhiyun amount of available memory capped by limits imposed by the 1765*4882a593Smuzhiyun memory controller and system-wide clean memory. 1766*4882a593Smuzhiyun 1767*4882a593Smuzhiyun vm.dirty_background_bytes, vm.dirty_bytes 1768*4882a593Smuzhiyun For cgroup writeback, this is calculated into ratio against 1769*4882a593Smuzhiyun total available memory and applied the same way as 1770*4882a593Smuzhiyun vm.dirty[_background]_ratio. 1771*4882a593Smuzhiyun 1772*4882a593Smuzhiyun 1773*4882a593SmuzhiyunIO Latency 1774*4882a593Smuzhiyun~~~~~~~~~~ 1775*4882a593Smuzhiyun 1776*4882a593SmuzhiyunThis is a cgroup v2 controller for IO workload protection. You provide a group 1777*4882a593Smuzhiyunwith a latency target, and if the average latency exceeds that target the 1778*4882a593Smuzhiyuncontroller will throttle any peers that have a lower latency target than the 1779*4882a593Smuzhiyunprotected workload. 1780*4882a593Smuzhiyun 1781*4882a593SmuzhiyunThe limits are only applied at the peer level in the hierarchy. This means that 1782*4882a593Smuzhiyunin the diagram below, only groups A, B, and C will influence each other, and 1783*4882a593Smuzhiyungroups D and F will influence each other. Group G will influence nobody:: 1784*4882a593Smuzhiyun 1785*4882a593Smuzhiyun [root] 1786*4882a593Smuzhiyun / | \ 1787*4882a593Smuzhiyun A B C 1788*4882a593Smuzhiyun / \ | 1789*4882a593Smuzhiyun D F G 1790*4882a593Smuzhiyun 1791*4882a593Smuzhiyun 1792*4882a593SmuzhiyunSo the ideal way to configure this is to set io.latency in groups A, B, and C. 1793*4882a593SmuzhiyunGenerally you do not want to set a value lower than the latency your device 1794*4882a593Smuzhiyunsupports. Experiment to find the value that works best for your workload. 1795*4882a593SmuzhiyunStart at higher than the expected latency for your device and watch the 1796*4882a593Smuzhiyunavg_lat value in io.stat for your workload group to get an idea of the 1797*4882a593Smuzhiyunlatency you see during normal operation. Use the avg_lat value as a basis for 1798*4882a593Smuzhiyunyour real setting, setting at 10-15% higher than the value in io.stat. 1799*4882a593Smuzhiyun 1800*4882a593SmuzhiyunHow IO Latency Throttling Works 1801*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1802*4882a593Smuzhiyun 1803*4882a593Smuzhiyunio.latency is work conserving; so as long as everybody is meeting their latency 1804*4882a593Smuzhiyuntarget the controller doesn't do anything. Once a group starts missing its 1805*4882a593Smuzhiyuntarget it begins throttling any peer group that has a higher target than itself. 1806*4882a593SmuzhiyunThis throttling takes 2 forms: 1807*4882a593Smuzhiyun 1808*4882a593Smuzhiyun- Queue depth throttling. This is the number of outstanding IO's a group is 1809*4882a593Smuzhiyun allowed to have. We will clamp down relatively quickly, starting at no limit 1810*4882a593Smuzhiyun and going all the way down to 1 IO at a time. 1811*4882a593Smuzhiyun 1812*4882a593Smuzhiyun- Artificial delay induction. There are certain types of IO that cannot be 1813*4882a593Smuzhiyun throttled without possibly adversely affecting higher priority groups. This 1814*4882a593Smuzhiyun includes swapping and metadata IO. These types of IO are allowed to occur 1815*4882a593Smuzhiyun normally, however they are "charged" to the originating group. If the 1816*4882a593Smuzhiyun originating group is being throttled you will see the use_delay and delay 1817*4882a593Smuzhiyun fields in io.stat increase. The delay value is how many microseconds that are 1818*4882a593Smuzhiyun being added to any process that runs in this group. Because this number can 1819*4882a593Smuzhiyun grow quite large if there is a lot of swapping or metadata IO occurring we 1820*4882a593Smuzhiyun limit the individual delay events to 1 second at a time. 1821*4882a593Smuzhiyun 1822*4882a593SmuzhiyunOnce the victimized group starts meeting its latency target again it will start 1823*4882a593Smuzhiyununthrottling any peer groups that were throttled previously. If the victimized 1824*4882a593Smuzhiyungroup simply stops doing IO the global counter will unthrottle appropriately. 1825*4882a593Smuzhiyun 1826*4882a593SmuzhiyunIO Latency Interface Files 1827*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~ 1828*4882a593Smuzhiyun 1829*4882a593Smuzhiyun io.latency 1830*4882a593Smuzhiyun This takes a similar format as the other controllers. 1831*4882a593Smuzhiyun 1832*4882a593Smuzhiyun "MAJOR:MINOR target=<target time in microseconds" 1833*4882a593Smuzhiyun 1834*4882a593Smuzhiyun io.stat 1835*4882a593Smuzhiyun If the controller is enabled you will see extra stats in io.stat in 1836*4882a593Smuzhiyun addition to the normal ones. 1837*4882a593Smuzhiyun 1838*4882a593Smuzhiyun depth 1839*4882a593Smuzhiyun This is the current queue depth for the group. 1840*4882a593Smuzhiyun 1841*4882a593Smuzhiyun avg_lat 1842*4882a593Smuzhiyun This is an exponential moving average with a decay rate of 1/exp 1843*4882a593Smuzhiyun bound by the sampling interval. The decay rate interval can be 1844*4882a593Smuzhiyun calculated by multiplying the win value in io.stat by the 1845*4882a593Smuzhiyun corresponding number of samples based on the win value. 1846*4882a593Smuzhiyun 1847*4882a593Smuzhiyun win 1848*4882a593Smuzhiyun The sampling window size in milliseconds. This is the minimum 1849*4882a593Smuzhiyun duration of time between evaluation events. Windows only elapse 1850*4882a593Smuzhiyun with IO activity. Idle periods extend the most recent window. 1851*4882a593Smuzhiyun 1852*4882a593SmuzhiyunIO Priority 1853*4882a593Smuzhiyun~~~~~~~~~~~ 1854*4882a593Smuzhiyun 1855*4882a593SmuzhiyunA single attribute controls the behavior of the I/O priority cgroup policy, 1856*4882a593Smuzhiyunnamely the blkio.prio.class attribute. The following values are accepted for 1857*4882a593Smuzhiyunthat attribute: 1858*4882a593Smuzhiyun 1859*4882a593Smuzhiyun no-change 1860*4882a593Smuzhiyun Do not modify the I/O priority class. 1861*4882a593Smuzhiyun 1862*4882a593Smuzhiyun none-to-rt 1863*4882a593Smuzhiyun For requests that do not have an I/O priority class (NONE), 1864*4882a593Smuzhiyun change the I/O priority class into RT. Do not modify 1865*4882a593Smuzhiyun the I/O priority class of other requests. 1866*4882a593Smuzhiyun 1867*4882a593Smuzhiyun restrict-to-be 1868*4882a593Smuzhiyun For requests that do not have an I/O priority class or that have I/O 1869*4882a593Smuzhiyun priority class RT, change it into BE. Do not modify the I/O priority 1870*4882a593Smuzhiyun class of requests that have priority class IDLE. 1871*4882a593Smuzhiyun 1872*4882a593Smuzhiyun idle 1873*4882a593Smuzhiyun Change the I/O priority class of all requests into IDLE, the lowest 1874*4882a593Smuzhiyun I/O priority class. 1875*4882a593Smuzhiyun 1876*4882a593SmuzhiyunThe following numerical values are associated with the I/O priority policies: 1877*4882a593Smuzhiyun 1878*4882a593Smuzhiyun+-------------+---+ 1879*4882a593Smuzhiyun| no-change | 0 | 1880*4882a593Smuzhiyun+-------------+---+ 1881*4882a593Smuzhiyun| none-to-rt | 1 | 1882*4882a593Smuzhiyun+-------------+---+ 1883*4882a593Smuzhiyun| rt-to-be | 2 | 1884*4882a593Smuzhiyun+-------------+---+ 1885*4882a593Smuzhiyun| all-to-idle | 3 | 1886*4882a593Smuzhiyun+-------------+---+ 1887*4882a593Smuzhiyun 1888*4882a593SmuzhiyunThe numerical value that corresponds to each I/O priority class is as follows: 1889*4882a593Smuzhiyun 1890*4882a593Smuzhiyun+-------------------------------+---+ 1891*4882a593Smuzhiyun| IOPRIO_CLASS_NONE | 0 | 1892*4882a593Smuzhiyun+-------------------------------+---+ 1893*4882a593Smuzhiyun| IOPRIO_CLASS_RT (real-time) | 1 | 1894*4882a593Smuzhiyun+-------------------------------+---+ 1895*4882a593Smuzhiyun| IOPRIO_CLASS_BE (best effort) | 2 | 1896*4882a593Smuzhiyun+-------------------------------+---+ 1897*4882a593Smuzhiyun| IOPRIO_CLASS_IDLE | 3 | 1898*4882a593Smuzhiyun+-------------------------------+---+ 1899*4882a593Smuzhiyun 1900*4882a593SmuzhiyunThe algorithm to set the I/O priority class for a request is as follows: 1901*4882a593Smuzhiyun 1902*4882a593Smuzhiyun- Translate the I/O priority class policy into a number. 1903*4882a593Smuzhiyun- Change the request I/O priority class into the maximum of the I/O priority 1904*4882a593Smuzhiyun class policy number and the numerical I/O priority class. 1905*4882a593Smuzhiyun 1906*4882a593SmuzhiyunPID 1907*4882a593Smuzhiyun--- 1908*4882a593Smuzhiyun 1909*4882a593SmuzhiyunThe process number controller is used to allow a cgroup to stop any 1910*4882a593Smuzhiyunnew tasks from being fork()'d or clone()'d after a specified limit is 1911*4882a593Smuzhiyunreached. 1912*4882a593Smuzhiyun 1913*4882a593SmuzhiyunThe number of tasks in a cgroup can be exhausted in ways which other 1914*4882a593Smuzhiyuncontrollers cannot prevent, thus warranting its own controller. For 1915*4882a593Smuzhiyunexample, a fork bomb is likely to exhaust the number of tasks before 1916*4882a593Smuzhiyunhitting memory restrictions. 1917*4882a593Smuzhiyun 1918*4882a593SmuzhiyunNote that PIDs used in this controller refer to TIDs, process IDs as 1919*4882a593Smuzhiyunused by the kernel. 1920*4882a593Smuzhiyun 1921*4882a593Smuzhiyun 1922*4882a593SmuzhiyunPID Interface Files 1923*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~ 1924*4882a593Smuzhiyun 1925*4882a593Smuzhiyun pids.max 1926*4882a593Smuzhiyun A read-write single value file which exists on non-root 1927*4882a593Smuzhiyun cgroups. The default is "max". 1928*4882a593Smuzhiyun 1929*4882a593Smuzhiyun Hard limit of number of processes. 1930*4882a593Smuzhiyun 1931*4882a593Smuzhiyun pids.current 1932*4882a593Smuzhiyun A read-only single value file which exists on all cgroups. 1933*4882a593Smuzhiyun 1934*4882a593Smuzhiyun The number of processes currently in the cgroup and its 1935*4882a593Smuzhiyun descendants. 1936*4882a593Smuzhiyun 1937*4882a593SmuzhiyunOrganisational operations are not blocked by cgroup policies, so it is 1938*4882a593Smuzhiyunpossible to have pids.current > pids.max. This can be done by either 1939*4882a593Smuzhiyunsetting the limit to be smaller than pids.current, or attaching enough 1940*4882a593Smuzhiyunprocesses to the cgroup such that pids.current is larger than 1941*4882a593Smuzhiyunpids.max. However, it is not possible to violate a cgroup PID policy 1942*4882a593Smuzhiyunthrough fork() or clone(). These will return -EAGAIN if the creation 1943*4882a593Smuzhiyunof a new process would cause a cgroup policy to be violated. 1944*4882a593Smuzhiyun 1945*4882a593Smuzhiyun 1946*4882a593SmuzhiyunCpuset 1947*4882a593Smuzhiyun------ 1948*4882a593Smuzhiyun 1949*4882a593SmuzhiyunThe "cpuset" controller provides a mechanism for constraining 1950*4882a593Smuzhiyunthe CPU and memory node placement of tasks to only the resources 1951*4882a593Smuzhiyunspecified in the cpuset interface files in a task's current cgroup. 1952*4882a593SmuzhiyunThis is especially valuable on large NUMA systems where placing jobs 1953*4882a593Smuzhiyunon properly sized subsets of the systems with careful processor and 1954*4882a593Smuzhiyunmemory placement to reduce cross-node memory access and contention 1955*4882a593Smuzhiyuncan improve overall system performance. 1956*4882a593Smuzhiyun 1957*4882a593SmuzhiyunThe "cpuset" controller is hierarchical. That means the controller 1958*4882a593Smuzhiyuncannot use CPUs or memory nodes not allowed in its parent. 1959*4882a593Smuzhiyun 1960*4882a593Smuzhiyun 1961*4882a593SmuzhiyunCpuset Interface Files 1962*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~ 1963*4882a593Smuzhiyun 1964*4882a593Smuzhiyun cpuset.cpus 1965*4882a593Smuzhiyun A read-write multiple values file which exists on non-root 1966*4882a593Smuzhiyun cpuset-enabled cgroups. 1967*4882a593Smuzhiyun 1968*4882a593Smuzhiyun It lists the requested CPUs to be used by tasks within this 1969*4882a593Smuzhiyun cgroup. The actual list of CPUs to be granted, however, is 1970*4882a593Smuzhiyun subjected to constraints imposed by its parent and can differ 1971*4882a593Smuzhiyun from the requested CPUs. 1972*4882a593Smuzhiyun 1973*4882a593Smuzhiyun The CPU numbers are comma-separated numbers or ranges. 1974*4882a593Smuzhiyun For example:: 1975*4882a593Smuzhiyun 1976*4882a593Smuzhiyun # cat cpuset.cpus 1977*4882a593Smuzhiyun 0-4,6,8-10 1978*4882a593Smuzhiyun 1979*4882a593Smuzhiyun An empty value indicates that the cgroup is using the same 1980*4882a593Smuzhiyun setting as the nearest cgroup ancestor with a non-empty 1981*4882a593Smuzhiyun "cpuset.cpus" or all the available CPUs if none is found. 1982*4882a593Smuzhiyun 1983*4882a593Smuzhiyun The value of "cpuset.cpus" stays constant until the next update 1984*4882a593Smuzhiyun and won't be affected by any CPU hotplug events. 1985*4882a593Smuzhiyun 1986*4882a593Smuzhiyun cpuset.cpus.effective 1987*4882a593Smuzhiyun A read-only multiple values file which exists on all 1988*4882a593Smuzhiyun cpuset-enabled cgroups. 1989*4882a593Smuzhiyun 1990*4882a593Smuzhiyun It lists the onlined CPUs that are actually granted to this 1991*4882a593Smuzhiyun cgroup by its parent. These CPUs are allowed to be used by 1992*4882a593Smuzhiyun tasks within the current cgroup. 1993*4882a593Smuzhiyun 1994*4882a593Smuzhiyun If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows 1995*4882a593Smuzhiyun all the CPUs from the parent cgroup that can be available to 1996*4882a593Smuzhiyun be used by this cgroup. Otherwise, it should be a subset of 1997*4882a593Smuzhiyun "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus" 1998*4882a593Smuzhiyun can be granted. In this case, it will be treated just like an 1999*4882a593Smuzhiyun empty "cpuset.cpus". 2000*4882a593Smuzhiyun 2001*4882a593Smuzhiyun Its value will be affected by CPU hotplug events. 2002*4882a593Smuzhiyun 2003*4882a593Smuzhiyun cpuset.mems 2004*4882a593Smuzhiyun A read-write multiple values file which exists on non-root 2005*4882a593Smuzhiyun cpuset-enabled cgroups. 2006*4882a593Smuzhiyun 2007*4882a593Smuzhiyun It lists the requested memory nodes to be used by tasks within 2008*4882a593Smuzhiyun this cgroup. The actual list of memory nodes granted, however, 2009*4882a593Smuzhiyun is subjected to constraints imposed by its parent and can differ 2010*4882a593Smuzhiyun from the requested memory nodes. 2011*4882a593Smuzhiyun 2012*4882a593Smuzhiyun The memory node numbers are comma-separated numbers or ranges. 2013*4882a593Smuzhiyun For example:: 2014*4882a593Smuzhiyun 2015*4882a593Smuzhiyun # cat cpuset.mems 2016*4882a593Smuzhiyun 0-1,3 2017*4882a593Smuzhiyun 2018*4882a593Smuzhiyun An empty value indicates that the cgroup is using the same 2019*4882a593Smuzhiyun setting as the nearest cgroup ancestor with a non-empty 2020*4882a593Smuzhiyun "cpuset.mems" or all the available memory nodes if none 2021*4882a593Smuzhiyun is found. 2022*4882a593Smuzhiyun 2023*4882a593Smuzhiyun The value of "cpuset.mems" stays constant until the next update 2024*4882a593Smuzhiyun and won't be affected by any memory nodes hotplug events. 2025*4882a593Smuzhiyun 2026*4882a593Smuzhiyun cpuset.mems.effective 2027*4882a593Smuzhiyun A read-only multiple values file which exists on all 2028*4882a593Smuzhiyun cpuset-enabled cgroups. 2029*4882a593Smuzhiyun 2030*4882a593Smuzhiyun It lists the onlined memory nodes that are actually granted to 2031*4882a593Smuzhiyun this cgroup by its parent. These memory nodes are allowed to 2032*4882a593Smuzhiyun be used by tasks within the current cgroup. 2033*4882a593Smuzhiyun 2034*4882a593Smuzhiyun If "cpuset.mems" is empty, it shows all the memory nodes from the 2035*4882a593Smuzhiyun parent cgroup that will be available to be used by this cgroup. 2036*4882a593Smuzhiyun Otherwise, it should be a subset of "cpuset.mems" unless none of 2037*4882a593Smuzhiyun the memory nodes listed in "cpuset.mems" can be granted. In this 2038*4882a593Smuzhiyun case, it will be treated just like an empty "cpuset.mems". 2039*4882a593Smuzhiyun 2040*4882a593Smuzhiyun Its value will be affected by memory nodes hotplug events. 2041*4882a593Smuzhiyun 2042*4882a593Smuzhiyun cpuset.cpus.partition 2043*4882a593Smuzhiyun A read-write single value file which exists on non-root 2044*4882a593Smuzhiyun cpuset-enabled cgroups. This flag is owned by the parent cgroup 2045*4882a593Smuzhiyun and is not delegatable. 2046*4882a593Smuzhiyun 2047*4882a593Smuzhiyun It accepts only the following input values when written to. 2048*4882a593Smuzhiyun 2049*4882a593Smuzhiyun "root" - a partition root 2050*4882a593Smuzhiyun "member" - a non-root member of a partition 2051*4882a593Smuzhiyun 2052*4882a593Smuzhiyun When set to be a partition root, the current cgroup is the 2053*4882a593Smuzhiyun root of a new partition or scheduling domain that comprises 2054*4882a593Smuzhiyun itself and all its descendants except those that are separate 2055*4882a593Smuzhiyun partition roots themselves and their descendants. The root 2056*4882a593Smuzhiyun cgroup is always a partition root. 2057*4882a593Smuzhiyun 2058*4882a593Smuzhiyun There are constraints on where a partition root can be set. 2059*4882a593Smuzhiyun It can only be set in a cgroup if all the following conditions 2060*4882a593Smuzhiyun are true. 2061*4882a593Smuzhiyun 2062*4882a593Smuzhiyun 1) The "cpuset.cpus" is not empty and the list of CPUs are 2063*4882a593Smuzhiyun exclusive, i.e. they are not shared by any of its siblings. 2064*4882a593Smuzhiyun 2) The parent cgroup is a partition root. 2065*4882a593Smuzhiyun 3) The "cpuset.cpus" is also a proper subset of the parent's 2066*4882a593Smuzhiyun "cpuset.cpus.effective". 2067*4882a593Smuzhiyun 4) There is no child cgroups with cpuset enabled. This is for 2068*4882a593Smuzhiyun eliminating corner cases that have to be handled if such a 2069*4882a593Smuzhiyun condition is allowed. 2070*4882a593Smuzhiyun 2071*4882a593Smuzhiyun Setting it to partition root will take the CPUs away from the 2072*4882a593Smuzhiyun effective CPUs of the parent cgroup. Once it is set, this 2073*4882a593Smuzhiyun file cannot be reverted back to "member" if there are any child 2074*4882a593Smuzhiyun cgroups with cpuset enabled. 2075*4882a593Smuzhiyun 2076*4882a593Smuzhiyun A parent partition cannot distribute all its CPUs to its 2077*4882a593Smuzhiyun child partitions. There must be at least one cpu left in the 2078*4882a593Smuzhiyun parent partition. 2079*4882a593Smuzhiyun 2080*4882a593Smuzhiyun Once becoming a partition root, changes to "cpuset.cpus" is 2081*4882a593Smuzhiyun generally allowed as long as the first condition above is true, 2082*4882a593Smuzhiyun the change will not take away all the CPUs from the parent 2083*4882a593Smuzhiyun partition and the new "cpuset.cpus" value is a superset of its 2084*4882a593Smuzhiyun children's "cpuset.cpus" values. 2085*4882a593Smuzhiyun 2086*4882a593Smuzhiyun Sometimes, external factors like changes to ancestors' 2087*4882a593Smuzhiyun "cpuset.cpus" or cpu hotplug can cause the state of the partition 2088*4882a593Smuzhiyun root to change. On read, the "cpuset.sched.partition" file 2089*4882a593Smuzhiyun can show the following values. 2090*4882a593Smuzhiyun 2091*4882a593Smuzhiyun "member" Non-root member of a partition 2092*4882a593Smuzhiyun "root" Partition root 2093*4882a593Smuzhiyun "root invalid" Invalid partition root 2094*4882a593Smuzhiyun 2095*4882a593Smuzhiyun It is a partition root if the first 2 partition root conditions 2096*4882a593Smuzhiyun above are true and at least one CPU from "cpuset.cpus" is 2097*4882a593Smuzhiyun granted by the parent cgroup. 2098*4882a593Smuzhiyun 2099*4882a593Smuzhiyun A partition root can become invalid if none of CPUs requested 2100*4882a593Smuzhiyun in "cpuset.cpus" can be granted by the parent cgroup or the 2101*4882a593Smuzhiyun parent cgroup is no longer a partition root itself. In this 2102*4882a593Smuzhiyun case, it is not a real partition even though the restriction 2103*4882a593Smuzhiyun of the first partition root condition above will still apply. 2104*4882a593Smuzhiyun The cpu affinity of all the tasks in the cgroup will then be 2105*4882a593Smuzhiyun associated with CPUs in the nearest ancestor partition. 2106*4882a593Smuzhiyun 2107*4882a593Smuzhiyun An invalid partition root can be transitioned back to a 2108*4882a593Smuzhiyun real partition root if at least one of the requested CPUs 2109*4882a593Smuzhiyun can now be granted by its parent. In this case, the cpu 2110*4882a593Smuzhiyun affinity of all the tasks in the formerly invalid partition 2111*4882a593Smuzhiyun will be associated to the CPUs of the newly formed partition. 2112*4882a593Smuzhiyun Changing the partition state of an invalid partition root to 2113*4882a593Smuzhiyun "member" is always allowed even if child cpusets are present. 2114*4882a593Smuzhiyun 2115*4882a593Smuzhiyun 2116*4882a593SmuzhiyunDevice controller 2117*4882a593Smuzhiyun----------------- 2118*4882a593Smuzhiyun 2119*4882a593SmuzhiyunDevice controller manages access to device files. It includes both 2120*4882a593Smuzhiyuncreation of new device files (using mknod), and access to the 2121*4882a593Smuzhiyunexisting device files. 2122*4882a593Smuzhiyun 2123*4882a593SmuzhiyunCgroup v2 device controller has no interface files and is implemented 2124*4882a593Smuzhiyunon top of cgroup BPF. To control access to device files, a user may 2125*4882a593Smuzhiyuncreate bpf programs of the BPF_CGROUP_DEVICE type and attach them 2126*4882a593Smuzhiyunto cgroups. On an attempt to access a device file, corresponding 2127*4882a593SmuzhiyunBPF programs will be executed, and depending on the return value 2128*4882a593Smuzhiyunthe attempt will succeed or fail with -EPERM. 2129*4882a593Smuzhiyun 2130*4882a593SmuzhiyunA BPF_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx 2131*4882a593Smuzhiyunstructure, which describes the device access attempt: access type 2132*4882a593Smuzhiyun(mknod/read/write) and device (type, major and minor numbers). 2133*4882a593SmuzhiyunIf the program returns 0, the attempt fails with -EPERM, otherwise 2134*4882a593Smuzhiyunit succeeds. 2135*4882a593Smuzhiyun 2136*4882a593SmuzhiyunAn example of BPF_CGROUP_DEVICE program may be found in the kernel 2137*4882a593Smuzhiyunsource tree in the tools/testing/selftests/bpf/dev_cgroup.c file. 2138*4882a593Smuzhiyun 2139*4882a593Smuzhiyun 2140*4882a593SmuzhiyunRDMA 2141*4882a593Smuzhiyun---- 2142*4882a593Smuzhiyun 2143*4882a593SmuzhiyunThe "rdma" controller regulates the distribution and accounting of 2144*4882a593SmuzhiyunRDMA resources. 2145*4882a593Smuzhiyun 2146*4882a593SmuzhiyunRDMA Interface Files 2147*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~ 2148*4882a593Smuzhiyun 2149*4882a593Smuzhiyun rdma.max 2150*4882a593Smuzhiyun A readwrite nested-keyed file that exists for all the cgroups 2151*4882a593Smuzhiyun except root that describes current configured resource limit 2152*4882a593Smuzhiyun for a RDMA/IB device. 2153*4882a593Smuzhiyun 2154*4882a593Smuzhiyun Lines are keyed by device name and are not ordered. 2155*4882a593Smuzhiyun Each line contains space separated resource name and its configured 2156*4882a593Smuzhiyun limit that can be distributed. 2157*4882a593Smuzhiyun 2158*4882a593Smuzhiyun The following nested keys are defined. 2159*4882a593Smuzhiyun 2160*4882a593Smuzhiyun ========== ============================= 2161*4882a593Smuzhiyun hca_handle Maximum number of HCA Handles 2162*4882a593Smuzhiyun hca_object Maximum number of HCA Objects 2163*4882a593Smuzhiyun ========== ============================= 2164*4882a593Smuzhiyun 2165*4882a593Smuzhiyun An example for mlx4 and ocrdma device follows:: 2166*4882a593Smuzhiyun 2167*4882a593Smuzhiyun mlx4_0 hca_handle=2 hca_object=2000 2168*4882a593Smuzhiyun ocrdma1 hca_handle=3 hca_object=max 2169*4882a593Smuzhiyun 2170*4882a593Smuzhiyun rdma.current 2171*4882a593Smuzhiyun A read-only file that describes current resource usage. 2172*4882a593Smuzhiyun It exists for all the cgroup except root. 2173*4882a593Smuzhiyun 2174*4882a593Smuzhiyun An example for mlx4 and ocrdma device follows:: 2175*4882a593Smuzhiyun 2176*4882a593Smuzhiyun mlx4_0 hca_handle=1 hca_object=20 2177*4882a593Smuzhiyun ocrdma1 hca_handle=1 hca_object=23 2178*4882a593Smuzhiyun 2179*4882a593SmuzhiyunHugeTLB 2180*4882a593Smuzhiyun------- 2181*4882a593Smuzhiyun 2182*4882a593SmuzhiyunThe HugeTLB controller allows to limit the HugeTLB usage per control group and 2183*4882a593Smuzhiyunenforces the controller limit during page fault. 2184*4882a593Smuzhiyun 2185*4882a593SmuzhiyunHugeTLB Interface Files 2186*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~ 2187*4882a593Smuzhiyun 2188*4882a593Smuzhiyun hugetlb.<hugepagesize>.current 2189*4882a593Smuzhiyun Show current usage for "hugepagesize" hugetlb. It exists for all 2190*4882a593Smuzhiyun the cgroup except root. 2191*4882a593Smuzhiyun 2192*4882a593Smuzhiyun hugetlb.<hugepagesize>.max 2193*4882a593Smuzhiyun Set/show the hard limit of "hugepagesize" hugetlb usage. 2194*4882a593Smuzhiyun The default value is "max". It exists for all the cgroup except root. 2195*4882a593Smuzhiyun 2196*4882a593Smuzhiyun hugetlb.<hugepagesize>.events 2197*4882a593Smuzhiyun A read-only flat-keyed file which exists on non-root cgroups. 2198*4882a593Smuzhiyun 2199*4882a593Smuzhiyun max 2200*4882a593Smuzhiyun The number of allocation failure due to HugeTLB limit 2201*4882a593Smuzhiyun 2202*4882a593Smuzhiyun hugetlb.<hugepagesize>.events.local 2203*4882a593Smuzhiyun Similar to hugetlb.<hugepagesize>.events but the fields in the file 2204*4882a593Smuzhiyun are local to the cgroup i.e. not hierarchical. The file modified event 2205*4882a593Smuzhiyun generated on this file reflects only the local events. 2206*4882a593Smuzhiyun 2207*4882a593SmuzhiyunMisc 2208*4882a593Smuzhiyun---- 2209*4882a593Smuzhiyun 2210*4882a593Smuzhiyunperf_event 2211*4882a593Smuzhiyun~~~~~~~~~~ 2212*4882a593Smuzhiyun 2213*4882a593Smuzhiyunperf_event controller, if not mounted on a legacy hierarchy, is 2214*4882a593Smuzhiyunautomatically enabled on the v2 hierarchy so that perf events can 2215*4882a593Smuzhiyunalways be filtered by cgroup v2 path. The controller can still be 2216*4882a593Smuzhiyunmoved to a legacy hierarchy after v2 hierarchy is populated. 2217*4882a593Smuzhiyun 2218*4882a593Smuzhiyun 2219*4882a593SmuzhiyunNon-normative information 2220*4882a593Smuzhiyun------------------------- 2221*4882a593Smuzhiyun 2222*4882a593SmuzhiyunThis section contains information that isn't considered to be a part of 2223*4882a593Smuzhiyunthe stable kernel API and so is subject to change. 2224*4882a593Smuzhiyun 2225*4882a593Smuzhiyun 2226*4882a593SmuzhiyunCPU controller root cgroup process behaviour 2227*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2228*4882a593Smuzhiyun 2229*4882a593SmuzhiyunWhen distributing CPU cycles in the root cgroup each thread in this 2230*4882a593Smuzhiyuncgroup is treated as if it was hosted in a separate child cgroup of the 2231*4882a593Smuzhiyunroot cgroup. This child cgroup weight is dependent on its thread nice 2232*4882a593Smuzhiyunlevel. 2233*4882a593Smuzhiyun 2234*4882a593SmuzhiyunFor details of this mapping see sched_prio_to_weight array in 2235*4882a593Smuzhiyunkernel/sched/core.c file (values from this array should be scaled 2236*4882a593Smuzhiyunappropriately so the neutral - nice 0 - value is 100 instead of 1024). 2237*4882a593Smuzhiyun 2238*4882a593Smuzhiyun 2239*4882a593SmuzhiyunIO controller root cgroup process behaviour 2240*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2241*4882a593Smuzhiyun 2242*4882a593SmuzhiyunRoot cgroup processes are hosted in an implicit leaf child node. 2243*4882a593SmuzhiyunWhen distributing IO resources this implicit child node is taken into 2244*4882a593Smuzhiyunaccount as if it was a normal child cgroup of the root cgroup with a 2245*4882a593Smuzhiyunweight value of 200. 2246*4882a593Smuzhiyun 2247*4882a593Smuzhiyun 2248*4882a593SmuzhiyunNamespace 2249*4882a593Smuzhiyun========= 2250*4882a593Smuzhiyun 2251*4882a593SmuzhiyunBasics 2252*4882a593Smuzhiyun------ 2253*4882a593Smuzhiyun 2254*4882a593Smuzhiyuncgroup namespace provides a mechanism to virtualize the view of the 2255*4882a593Smuzhiyun"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone 2256*4882a593Smuzhiyunflag can be used with clone(2) and unshare(2) to create a new cgroup 2257*4882a593Smuzhiyunnamespace. The process running inside the cgroup namespace will have 2258*4882a593Smuzhiyunits "/proc/$PID/cgroup" output restricted to cgroupns root. The 2259*4882a593Smuzhiyuncgroupns root is the cgroup of the process at the time of creation of 2260*4882a593Smuzhiyunthe cgroup namespace. 2261*4882a593Smuzhiyun 2262*4882a593SmuzhiyunWithout cgroup namespace, the "/proc/$PID/cgroup" file shows the 2263*4882a593Smuzhiyuncomplete path of the cgroup of a process. In a container setup where 2264*4882a593Smuzhiyuna set of cgroups and namespaces are intended to isolate processes the 2265*4882a593Smuzhiyun"/proc/$PID/cgroup" file may leak potential system level information 2266*4882a593Smuzhiyunto the isolated processes. For Example:: 2267*4882a593Smuzhiyun 2268*4882a593Smuzhiyun # cat /proc/self/cgroup 2269*4882a593Smuzhiyun 0::/batchjobs/container_id1 2270*4882a593Smuzhiyun 2271*4882a593SmuzhiyunThe path '/batchjobs/container_id1' can be considered as system-data 2272*4882a593Smuzhiyunand undesirable to expose to the isolated processes. cgroup namespace 2273*4882a593Smuzhiyuncan be used to restrict visibility of this path. For example, before 2274*4882a593Smuzhiyuncreating a cgroup namespace, one would see:: 2275*4882a593Smuzhiyun 2276*4882a593Smuzhiyun # ls -l /proc/self/ns/cgroup 2277*4882a593Smuzhiyun lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835] 2278*4882a593Smuzhiyun # cat /proc/self/cgroup 2279*4882a593Smuzhiyun 0::/batchjobs/container_id1 2280*4882a593Smuzhiyun 2281*4882a593SmuzhiyunAfter unsharing a new namespace, the view changes:: 2282*4882a593Smuzhiyun 2283*4882a593Smuzhiyun # ls -l /proc/self/ns/cgroup 2284*4882a593Smuzhiyun lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183] 2285*4882a593Smuzhiyun # cat /proc/self/cgroup 2286*4882a593Smuzhiyun 0::/ 2287*4882a593Smuzhiyun 2288*4882a593SmuzhiyunWhen some thread from a multi-threaded process unshares its cgroup 2289*4882a593Smuzhiyunnamespace, the new cgroupns gets applied to the entire process (all 2290*4882a593Smuzhiyunthe threads). This is natural for the v2 hierarchy; however, for the 2291*4882a593Smuzhiyunlegacy hierarchies, this may be unexpected. 2292*4882a593Smuzhiyun 2293*4882a593SmuzhiyunA cgroup namespace is alive as long as there are processes inside or 2294*4882a593Smuzhiyunmounts pinning it. When the last usage goes away, the cgroup 2295*4882a593Smuzhiyunnamespace is destroyed. The cgroupns root and the actual cgroups 2296*4882a593Smuzhiyunremain. 2297*4882a593Smuzhiyun 2298*4882a593Smuzhiyun 2299*4882a593SmuzhiyunThe Root and Views 2300*4882a593Smuzhiyun------------------ 2301*4882a593Smuzhiyun 2302*4882a593SmuzhiyunThe 'cgroupns root' for a cgroup namespace is the cgroup in which the 2303*4882a593Smuzhiyunprocess calling unshare(2) is running. For example, if a process in 2304*4882a593Smuzhiyun/batchjobs/container_id1 cgroup calls unshare, cgroup 2305*4882a593Smuzhiyun/batchjobs/container_id1 becomes the cgroupns root. For the 2306*4882a593Smuzhiyuninit_cgroup_ns, this is the real root ('/') cgroup. 2307*4882a593Smuzhiyun 2308*4882a593SmuzhiyunThe cgroupns root cgroup does not change even if the namespace creator 2309*4882a593Smuzhiyunprocess later moves to a different cgroup:: 2310*4882a593Smuzhiyun 2311*4882a593Smuzhiyun # ~/unshare -c # unshare cgroupns in some cgroup 2312*4882a593Smuzhiyun # cat /proc/self/cgroup 2313*4882a593Smuzhiyun 0::/ 2314*4882a593Smuzhiyun # mkdir sub_cgrp_1 2315*4882a593Smuzhiyun # echo 0 > sub_cgrp_1/cgroup.procs 2316*4882a593Smuzhiyun # cat /proc/self/cgroup 2317*4882a593Smuzhiyun 0::/sub_cgrp_1 2318*4882a593Smuzhiyun 2319*4882a593SmuzhiyunEach process gets its namespace-specific view of "/proc/$PID/cgroup" 2320*4882a593Smuzhiyun 2321*4882a593SmuzhiyunProcesses running inside the cgroup namespace will be able to see 2322*4882a593Smuzhiyuncgroup paths (in /proc/self/cgroup) only inside their root cgroup. 2323*4882a593SmuzhiyunFrom within an unshared cgroupns:: 2324*4882a593Smuzhiyun 2325*4882a593Smuzhiyun # sleep 100000 & 2326*4882a593Smuzhiyun [1] 7353 2327*4882a593Smuzhiyun # echo 7353 > sub_cgrp_1/cgroup.procs 2328*4882a593Smuzhiyun # cat /proc/7353/cgroup 2329*4882a593Smuzhiyun 0::/sub_cgrp_1 2330*4882a593Smuzhiyun 2331*4882a593SmuzhiyunFrom the initial cgroup namespace, the real cgroup path will be 2332*4882a593Smuzhiyunvisible:: 2333*4882a593Smuzhiyun 2334*4882a593Smuzhiyun $ cat /proc/7353/cgroup 2335*4882a593Smuzhiyun 0::/batchjobs/container_id1/sub_cgrp_1 2336*4882a593Smuzhiyun 2337*4882a593SmuzhiyunFrom a sibling cgroup namespace (that is, a namespace rooted at a 2338*4882a593Smuzhiyundifferent cgroup), the cgroup path relative to its own cgroup 2339*4882a593Smuzhiyunnamespace root will be shown. For instance, if PID 7353's cgroup 2340*4882a593Smuzhiyunnamespace root is at '/batchjobs/container_id2', then it will see:: 2341*4882a593Smuzhiyun 2342*4882a593Smuzhiyun # cat /proc/7353/cgroup 2343*4882a593Smuzhiyun 0::/../container_id2/sub_cgrp_1 2344*4882a593Smuzhiyun 2345*4882a593SmuzhiyunNote that the relative path always starts with '/' to indicate that 2346*4882a593Smuzhiyunits relative to the cgroup namespace root of the caller. 2347*4882a593Smuzhiyun 2348*4882a593Smuzhiyun 2349*4882a593SmuzhiyunMigration and setns(2) 2350*4882a593Smuzhiyun---------------------- 2351*4882a593Smuzhiyun 2352*4882a593SmuzhiyunProcesses inside a cgroup namespace can move into and out of the 2353*4882a593Smuzhiyunnamespace root if they have proper access to external cgroups. For 2354*4882a593Smuzhiyunexample, from inside a namespace with cgroupns root at 2355*4882a593Smuzhiyun/batchjobs/container_id1, and assuming that the global hierarchy is 2356*4882a593Smuzhiyunstill accessible inside cgroupns:: 2357*4882a593Smuzhiyun 2358*4882a593Smuzhiyun # cat /proc/7353/cgroup 2359*4882a593Smuzhiyun 0::/sub_cgrp_1 2360*4882a593Smuzhiyun # echo 7353 > batchjobs/container_id2/cgroup.procs 2361*4882a593Smuzhiyun # cat /proc/7353/cgroup 2362*4882a593Smuzhiyun 0::/../container_id2 2363*4882a593Smuzhiyun 2364*4882a593SmuzhiyunNote that this kind of setup is not encouraged. A task inside cgroup 2365*4882a593Smuzhiyunnamespace should only be exposed to its own cgroupns hierarchy. 2366*4882a593Smuzhiyun 2367*4882a593Smuzhiyunsetns(2) to another cgroup namespace is allowed when: 2368*4882a593Smuzhiyun 2369*4882a593Smuzhiyun(a) the process has CAP_SYS_ADMIN against its current user namespace 2370*4882a593Smuzhiyun(b) the process has CAP_SYS_ADMIN against the target cgroup 2371*4882a593Smuzhiyun namespace's userns 2372*4882a593Smuzhiyun 2373*4882a593SmuzhiyunNo implicit cgroup changes happen with attaching to another cgroup 2374*4882a593Smuzhiyunnamespace. It is expected that the someone moves the attaching 2375*4882a593Smuzhiyunprocess under the target cgroup namespace root. 2376*4882a593Smuzhiyun 2377*4882a593Smuzhiyun 2378*4882a593SmuzhiyunInteraction with Other Namespaces 2379*4882a593Smuzhiyun--------------------------------- 2380*4882a593Smuzhiyun 2381*4882a593SmuzhiyunNamespace specific cgroup hierarchy can be mounted by a process 2382*4882a593Smuzhiyunrunning inside a non-init cgroup namespace:: 2383*4882a593Smuzhiyun 2384*4882a593Smuzhiyun # mount -t cgroup2 none $MOUNT_POINT 2385*4882a593Smuzhiyun 2386*4882a593SmuzhiyunThis will mount the unified cgroup hierarchy with cgroupns root as the 2387*4882a593Smuzhiyunfilesystem root. The process needs CAP_SYS_ADMIN against its user and 2388*4882a593Smuzhiyunmount namespaces. 2389*4882a593Smuzhiyun 2390*4882a593SmuzhiyunThe virtualization of /proc/self/cgroup file combined with restricting 2391*4882a593Smuzhiyunthe view of cgroup hierarchy by namespace-private cgroupfs mount 2392*4882a593Smuzhiyunprovides a properly isolated cgroup view inside the container. 2393*4882a593Smuzhiyun 2394*4882a593Smuzhiyun 2395*4882a593SmuzhiyunInformation on Kernel Programming 2396*4882a593Smuzhiyun================================= 2397*4882a593Smuzhiyun 2398*4882a593SmuzhiyunThis section contains kernel programming information in the areas 2399*4882a593Smuzhiyunwhere interacting with cgroup is necessary. cgroup core and 2400*4882a593Smuzhiyuncontrollers are not covered. 2401*4882a593Smuzhiyun 2402*4882a593Smuzhiyun 2403*4882a593SmuzhiyunFilesystem Support for Writeback 2404*4882a593Smuzhiyun-------------------------------- 2405*4882a593Smuzhiyun 2406*4882a593SmuzhiyunA filesystem can support cgroup writeback by updating 2407*4882a593Smuzhiyunaddress_space_operations->writepage[s]() to annotate bio's using the 2408*4882a593Smuzhiyunfollowing two functions. 2409*4882a593Smuzhiyun 2410*4882a593Smuzhiyun wbc_init_bio(@wbc, @bio) 2411*4882a593Smuzhiyun Should be called for each bio carrying writeback data and 2412*4882a593Smuzhiyun associates the bio with the inode's owner cgroup and the 2413*4882a593Smuzhiyun corresponding request queue. This must be called after 2414*4882a593Smuzhiyun a queue (device) has been associated with the bio and 2415*4882a593Smuzhiyun before submission. 2416*4882a593Smuzhiyun 2417*4882a593Smuzhiyun wbc_account_cgroup_owner(@wbc, @page, @bytes) 2418*4882a593Smuzhiyun Should be called for each data segment being written out. 2419*4882a593Smuzhiyun While this function doesn't care exactly when it's called 2420*4882a593Smuzhiyun during the writeback session, it's the easiest and most 2421*4882a593Smuzhiyun natural to call it as data segments are added to a bio. 2422*4882a593Smuzhiyun 2423*4882a593SmuzhiyunWith writeback bio's annotated, cgroup support can be enabled per 2424*4882a593Smuzhiyunsuper_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for 2425*4882a593Smuzhiyunselective disabling of cgroup writeback support which is helpful when 2426*4882a593Smuzhiyuncertain filesystem features, e.g. journaled data mode, are 2427*4882a593Smuzhiyunincompatible. 2428*4882a593Smuzhiyun 2429*4882a593Smuzhiyunwbc_init_bio() binds the specified bio to its cgroup. Depending on 2430*4882a593Smuzhiyunthe configuration, the bio may be executed at a lower priority and if 2431*4882a593Smuzhiyunthe writeback session is holding shared resources, e.g. a journal 2432*4882a593Smuzhiyunentry, may lead to priority inversion. There is no one easy solution 2433*4882a593Smuzhiyunfor the problem. Filesystems can try to work around specific problem 2434*4882a593Smuzhiyuncases by skipping wbc_init_bio() and using bio_associate_blkg() 2435*4882a593Smuzhiyundirectly. 2436*4882a593Smuzhiyun 2437*4882a593Smuzhiyun 2438*4882a593SmuzhiyunDeprecated v1 Core Features 2439*4882a593Smuzhiyun=========================== 2440*4882a593Smuzhiyun 2441*4882a593Smuzhiyun- Multiple hierarchies including named ones are not supported. 2442*4882a593Smuzhiyun 2443*4882a593Smuzhiyun- All v1 mount options are not supported. 2444*4882a593Smuzhiyun 2445*4882a593Smuzhiyun- The "tasks" file is removed and "cgroup.procs" is not sorted. 2446*4882a593Smuzhiyun 2447*4882a593Smuzhiyun- "cgroup.clone_children" is removed. 2448*4882a593Smuzhiyun 2449*4882a593Smuzhiyun- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file 2450*4882a593Smuzhiyun at the root instead. 2451*4882a593Smuzhiyun 2452*4882a593Smuzhiyun 2453*4882a593SmuzhiyunIssues with v1 and Rationales for v2 2454*4882a593Smuzhiyun==================================== 2455*4882a593Smuzhiyun 2456*4882a593SmuzhiyunMultiple Hierarchies 2457*4882a593Smuzhiyun-------------------- 2458*4882a593Smuzhiyun 2459*4882a593Smuzhiyuncgroup v1 allowed an arbitrary number of hierarchies and each 2460*4882a593Smuzhiyunhierarchy could host any number of controllers. While this seemed to 2461*4882a593Smuzhiyunprovide a high level of flexibility, it wasn't useful in practice. 2462*4882a593Smuzhiyun 2463*4882a593SmuzhiyunFor example, as there is only one instance of each controller, utility 2464*4882a593Smuzhiyuntype controllers such as freezer which can be useful in all 2465*4882a593Smuzhiyunhierarchies could only be used in one. The issue is exacerbated by 2466*4882a593Smuzhiyunthe fact that controllers couldn't be moved to another hierarchy once 2467*4882a593Smuzhiyunhierarchies were populated. Another issue was that all controllers 2468*4882a593Smuzhiyunbound to a hierarchy were forced to have exactly the same view of the 2469*4882a593Smuzhiyunhierarchy. It wasn't possible to vary the granularity depending on 2470*4882a593Smuzhiyunthe specific controller. 2471*4882a593Smuzhiyun 2472*4882a593SmuzhiyunIn practice, these issues heavily limited which controllers could be 2473*4882a593Smuzhiyunput on the same hierarchy and most configurations resorted to putting 2474*4882a593Smuzhiyuneach controller on its own hierarchy. Only closely related ones, such 2475*4882a593Smuzhiyunas the cpu and cpuacct controllers, made sense to be put on the same 2476*4882a593Smuzhiyunhierarchy. This often meant that userland ended up managing multiple 2477*4882a593Smuzhiyunsimilar hierarchies repeating the same steps on each hierarchy 2478*4882a593Smuzhiyunwhenever a hierarchy management operation was necessary. 2479*4882a593Smuzhiyun 2480*4882a593SmuzhiyunFurthermore, support for multiple hierarchies came at a steep cost. 2481*4882a593SmuzhiyunIt greatly complicated cgroup core implementation but more importantly 2482*4882a593Smuzhiyunthe support for multiple hierarchies restricted how cgroup could be 2483*4882a593Smuzhiyunused in general and what controllers was able to do. 2484*4882a593Smuzhiyun 2485*4882a593SmuzhiyunThere was no limit on how many hierarchies there might be, which meant 2486*4882a593Smuzhiyunthat a thread's cgroup membership couldn't be described in finite 2487*4882a593Smuzhiyunlength. The key might contain any number of entries and was unlimited 2488*4882a593Smuzhiyunin length, which made it highly awkward to manipulate and led to 2489*4882a593Smuzhiyunaddition of controllers which existed only to identify membership, 2490*4882a593Smuzhiyunwhich in turn exacerbated the original problem of proliferating number 2491*4882a593Smuzhiyunof hierarchies. 2492*4882a593Smuzhiyun 2493*4882a593SmuzhiyunAlso, as a controller couldn't have any expectation regarding the 2494*4882a593Smuzhiyuntopologies of hierarchies other controllers might be on, each 2495*4882a593Smuzhiyuncontroller had to assume that all other controllers were attached to 2496*4882a593Smuzhiyuncompletely orthogonal hierarchies. This made it impossible, or at 2497*4882a593Smuzhiyunleast very cumbersome, for controllers to cooperate with each other. 2498*4882a593Smuzhiyun 2499*4882a593SmuzhiyunIn most use cases, putting controllers on hierarchies which are 2500*4882a593Smuzhiyuncompletely orthogonal to each other isn't necessary. What usually is 2501*4882a593Smuzhiyuncalled for is the ability to have differing levels of granularity 2502*4882a593Smuzhiyundepending on the specific controller. In other words, hierarchy may 2503*4882a593Smuzhiyunbe collapsed from leaf towards root when viewed from specific 2504*4882a593Smuzhiyuncontrollers. For example, a given configuration might not care about 2505*4882a593Smuzhiyunhow memory is distributed beyond a certain level while still wanting 2506*4882a593Smuzhiyunto control how CPU cycles are distributed. 2507*4882a593Smuzhiyun 2508*4882a593Smuzhiyun 2509*4882a593SmuzhiyunThread Granularity 2510*4882a593Smuzhiyun------------------ 2511*4882a593Smuzhiyun 2512*4882a593Smuzhiyuncgroup v1 allowed threads of a process to belong to different cgroups. 2513*4882a593SmuzhiyunThis didn't make sense for some controllers and those controllers 2514*4882a593Smuzhiyunended up implementing different ways to ignore such situations but 2515*4882a593Smuzhiyunmuch more importantly it blurred the line between API exposed to 2516*4882a593Smuzhiyunindividual applications and system management interface. 2517*4882a593Smuzhiyun 2518*4882a593SmuzhiyunGenerally, in-process knowledge is available only to the process 2519*4882a593Smuzhiyunitself; thus, unlike service-level organization of processes, 2520*4882a593Smuzhiyuncategorizing threads of a process requires active participation from 2521*4882a593Smuzhiyunthe application which owns the target process. 2522*4882a593Smuzhiyun 2523*4882a593Smuzhiyuncgroup v1 had an ambiguously defined delegation model which got abused 2524*4882a593Smuzhiyunin combination with thread granularity. cgroups were delegated to 2525*4882a593Smuzhiyunindividual applications so that they can create and manage their own 2526*4882a593Smuzhiyunsub-hierarchies and control resource distributions along them. This 2527*4882a593Smuzhiyuneffectively raised cgroup to the status of a syscall-like API exposed 2528*4882a593Smuzhiyunto lay programs. 2529*4882a593Smuzhiyun 2530*4882a593SmuzhiyunFirst of all, cgroup has a fundamentally inadequate interface to be 2531*4882a593Smuzhiyunexposed this way. For a process to access its own knobs, it has to 2532*4882a593Smuzhiyunextract the path on the target hierarchy from /proc/self/cgroup, 2533*4882a593Smuzhiyunconstruct the path by appending the name of the knob to the path, open 2534*4882a593Smuzhiyunand then read and/or write to it. This is not only extremely clunky 2535*4882a593Smuzhiyunand unusual but also inherently racy. There is no conventional way to 2536*4882a593Smuzhiyundefine transaction across the required steps and nothing can guarantee 2537*4882a593Smuzhiyunthat the process would actually be operating on its own sub-hierarchy. 2538*4882a593Smuzhiyun 2539*4882a593Smuzhiyuncgroup controllers implemented a number of knobs which would never be 2540*4882a593Smuzhiyunaccepted as public APIs because they were just adding control knobs to 2541*4882a593Smuzhiyunsystem-management pseudo filesystem. cgroup ended up with interface 2542*4882a593Smuzhiyunknobs which were not properly abstracted or refined and directly 2543*4882a593Smuzhiyunrevealed kernel internal details. These knobs got exposed to 2544*4882a593Smuzhiyunindividual applications through the ill-defined delegation mechanism 2545*4882a593Smuzhiyuneffectively abusing cgroup as a shortcut to implementing public APIs 2546*4882a593Smuzhiyunwithout going through the required scrutiny. 2547*4882a593Smuzhiyun 2548*4882a593SmuzhiyunThis was painful for both userland and kernel. Userland ended up with 2549*4882a593Smuzhiyunmisbehaving and poorly abstracted interfaces and kernel exposing and 2550*4882a593Smuzhiyunlocked into constructs inadvertently. 2551*4882a593Smuzhiyun 2552*4882a593Smuzhiyun 2553*4882a593SmuzhiyunCompetition Between Inner Nodes and Threads 2554*4882a593Smuzhiyun------------------------------------------- 2555*4882a593Smuzhiyun 2556*4882a593Smuzhiyuncgroup v1 allowed threads to be in any cgroups which created an 2557*4882a593Smuzhiyuninteresting problem where threads belonging to a parent cgroup and its 2558*4882a593Smuzhiyunchildren cgroups competed for resources. This was nasty as two 2559*4882a593Smuzhiyundifferent types of entities competed and there was no obvious way to 2560*4882a593Smuzhiyunsettle it. Different controllers did different things. 2561*4882a593Smuzhiyun 2562*4882a593SmuzhiyunThe cpu controller considered threads and cgroups as equivalents and 2563*4882a593Smuzhiyunmapped nice levels to cgroup weights. This worked for some cases but 2564*4882a593Smuzhiyunfell flat when children wanted to be allocated specific ratios of CPU 2565*4882a593Smuzhiyuncycles and the number of internal threads fluctuated - the ratios 2566*4882a593Smuzhiyunconstantly changed as the number of competing entities fluctuated. 2567*4882a593SmuzhiyunThere also were other issues. The mapping from nice level to weight 2568*4882a593Smuzhiyunwasn't obvious or universal, and there were various other knobs which 2569*4882a593Smuzhiyunsimply weren't available for threads. 2570*4882a593Smuzhiyun 2571*4882a593SmuzhiyunThe io controller implicitly created a hidden leaf node for each 2572*4882a593Smuzhiyuncgroup to host the threads. The hidden leaf had its own copies of all 2573*4882a593Smuzhiyunthe knobs with ``leaf_`` prefixed. While this allowed equivalent 2574*4882a593Smuzhiyuncontrol over internal threads, it was with serious drawbacks. It 2575*4882a593Smuzhiyunalways added an extra layer of nesting which wouldn't be necessary 2576*4882a593Smuzhiyunotherwise, made the interface messy and significantly complicated the 2577*4882a593Smuzhiyunimplementation. 2578*4882a593Smuzhiyun 2579*4882a593SmuzhiyunThe memory controller didn't have a way to control what happened 2580*4882a593Smuzhiyunbetween internal tasks and child cgroups and the behavior was not 2581*4882a593Smuzhiyunclearly defined. There were attempts to add ad-hoc behaviors and 2582*4882a593Smuzhiyunknobs to tailor the behavior to specific workloads which would have 2583*4882a593Smuzhiyunled to problems extremely difficult to resolve in the long term. 2584*4882a593Smuzhiyun 2585*4882a593SmuzhiyunMultiple controllers struggled with internal tasks and came up with 2586*4882a593Smuzhiyundifferent ways to deal with it; unfortunately, all the approaches were 2587*4882a593Smuzhiyunseverely flawed and, furthermore, the widely different behaviors 2588*4882a593Smuzhiyunmade cgroup as a whole highly inconsistent. 2589*4882a593Smuzhiyun 2590*4882a593SmuzhiyunThis clearly is a problem which needs to be addressed from cgroup core 2591*4882a593Smuzhiyunin a uniform way. 2592*4882a593Smuzhiyun 2593*4882a593Smuzhiyun 2594*4882a593SmuzhiyunOther Interface Issues 2595*4882a593Smuzhiyun---------------------- 2596*4882a593Smuzhiyun 2597*4882a593Smuzhiyuncgroup v1 grew without oversight and developed a large number of 2598*4882a593Smuzhiyunidiosyncrasies and inconsistencies. One issue on the cgroup core side 2599*4882a593Smuzhiyunwas how an empty cgroup was notified - a userland helper binary was 2600*4882a593Smuzhiyunforked and executed for each event. The event delivery wasn't 2601*4882a593Smuzhiyunrecursive or delegatable. The limitations of the mechanism also led 2602*4882a593Smuzhiyunto in-kernel event delivery filtering mechanism further complicating 2603*4882a593Smuzhiyunthe interface. 2604*4882a593Smuzhiyun 2605*4882a593SmuzhiyunController interfaces were problematic too. An extreme example is 2606*4882a593Smuzhiyuncontrollers completely ignoring hierarchical organization and treating 2607*4882a593Smuzhiyunall cgroups as if they were all located directly under the root 2608*4882a593Smuzhiyuncgroup. Some controllers exposed a large amount of inconsistent 2609*4882a593Smuzhiyunimplementation details to userland. 2610*4882a593Smuzhiyun 2611*4882a593SmuzhiyunThere also was no consistency across controllers. When a new cgroup 2612*4882a593Smuzhiyunwas created, some controllers defaulted to not imposing extra 2613*4882a593Smuzhiyunrestrictions while others disallowed any resource usage until 2614*4882a593Smuzhiyunexplicitly configured. Configuration knobs for the same type of 2615*4882a593Smuzhiyuncontrol used widely differing naming schemes and formats. Statistics 2616*4882a593Smuzhiyunand information knobs were named arbitrarily and used different 2617*4882a593Smuzhiyunformats and units even in the same controller. 2618*4882a593Smuzhiyun 2619*4882a593Smuzhiyuncgroup v2 establishes common conventions where appropriate and updates 2620*4882a593Smuzhiyuncontrollers so that they expose minimal and consistent interfaces. 2621*4882a593Smuzhiyun 2622*4882a593Smuzhiyun 2623*4882a593SmuzhiyunController Issues and Remedies 2624*4882a593Smuzhiyun------------------------------ 2625*4882a593Smuzhiyun 2626*4882a593SmuzhiyunMemory 2627*4882a593Smuzhiyun~~~~~~ 2628*4882a593Smuzhiyun 2629*4882a593SmuzhiyunThe original lower boundary, the soft limit, is defined as a limit 2630*4882a593Smuzhiyunthat is per default unset. As a result, the set of cgroups that 2631*4882a593Smuzhiyunglobal reclaim prefers is opt-in, rather than opt-out. The costs for 2632*4882a593Smuzhiyunoptimizing these mostly negative lookups are so high that the 2633*4882a593Smuzhiyunimplementation, despite its enormous size, does not even provide the 2634*4882a593Smuzhiyunbasic desirable behavior. First off, the soft limit has no 2635*4882a593Smuzhiyunhierarchical meaning. All configured groups are organized in a global 2636*4882a593Smuzhiyunrbtree and treated like equal peers, regardless where they are located 2637*4882a593Smuzhiyunin the hierarchy. This makes subtree delegation impossible. Second, 2638*4882a593Smuzhiyunthe soft limit reclaim pass is so aggressive that it not just 2639*4882a593Smuzhiyunintroduces high allocation latencies into the system, but also impacts 2640*4882a593Smuzhiyunsystem performance due to overreclaim, to the point where the feature 2641*4882a593Smuzhiyunbecomes self-defeating. 2642*4882a593Smuzhiyun 2643*4882a593SmuzhiyunThe memory.low boundary on the other hand is a top-down allocated 2644*4882a593Smuzhiyunreserve. A cgroup enjoys reclaim protection when it's within its 2645*4882a593Smuzhiyuneffective low, which makes delegation of subtrees possible. It also 2646*4882a593Smuzhiyunenjoys having reclaim pressure proportional to its overage when 2647*4882a593Smuzhiyunabove its effective low. 2648*4882a593Smuzhiyun 2649*4882a593SmuzhiyunThe original high boundary, the hard limit, is defined as a strict 2650*4882a593Smuzhiyunlimit that can not budge, even if the OOM killer has to be called. 2651*4882a593SmuzhiyunBut this generally goes against the goal of making the most out of the 2652*4882a593Smuzhiyunavailable memory. The memory consumption of workloads varies during 2653*4882a593Smuzhiyunruntime, and that requires users to overcommit. But doing that with a 2654*4882a593Smuzhiyunstrict upper limit requires either a fairly accurate prediction of the 2655*4882a593Smuzhiyunworking set size or adding slack to the limit. Since working set size 2656*4882a593Smuzhiyunestimation is hard and error prone, and getting it wrong results in 2657*4882a593SmuzhiyunOOM kills, most users tend to err on the side of a looser limit and 2658*4882a593Smuzhiyunend up wasting precious resources. 2659*4882a593Smuzhiyun 2660*4882a593SmuzhiyunThe memory.high boundary on the other hand can be set much more 2661*4882a593Smuzhiyunconservatively. When hit, it throttles allocations by forcing them 2662*4882a593Smuzhiyuninto direct reclaim to work off the excess, but it never invokes the 2663*4882a593SmuzhiyunOOM killer. As a result, a high boundary that is chosen too 2664*4882a593Smuzhiyunaggressively will not terminate the processes, but instead it will 2665*4882a593Smuzhiyunlead to gradual performance degradation. The user can monitor this 2666*4882a593Smuzhiyunand make corrections until the minimal memory footprint that still 2667*4882a593Smuzhiyungives acceptable performance is found. 2668*4882a593Smuzhiyun 2669*4882a593SmuzhiyunIn extreme cases, with many concurrent allocations and a complete 2670*4882a593Smuzhiyunbreakdown of reclaim progress within the group, the high boundary can 2671*4882a593Smuzhiyunbe exceeded. But even then it's mostly better to satisfy the 2672*4882a593Smuzhiyunallocation from the slack available in other groups or the rest of the 2673*4882a593Smuzhiyunsystem than killing the group. Otherwise, memory.max is there to 2674*4882a593Smuzhiyunlimit this type of spillover and ultimately contain buggy or even 2675*4882a593Smuzhiyunmalicious applications. 2676*4882a593Smuzhiyun 2677*4882a593SmuzhiyunSetting the original memory.limit_in_bytes below the current usage was 2678*4882a593Smuzhiyunsubject to a race condition, where concurrent charges could cause the 2679*4882a593Smuzhiyunlimit setting to fail. memory.max on the other hand will first set the 2680*4882a593Smuzhiyunlimit to prevent new charges, and then reclaim and OOM kill until the 2681*4882a593Smuzhiyunnew limit is met - or the task writing to memory.max is killed. 2682*4882a593Smuzhiyun 2683*4882a593SmuzhiyunThe combined memory+swap accounting and limiting is replaced by real 2684*4882a593Smuzhiyuncontrol over swap space. 2685*4882a593Smuzhiyun 2686*4882a593SmuzhiyunThe main argument for a combined memory+swap facility in the original 2687*4882a593Smuzhiyuncgroup design was that global or parental pressure would always be 2688*4882a593Smuzhiyunable to swap all anonymous memory of a child group, regardless of the 2689*4882a593Smuzhiyunchild's own (possibly untrusted) configuration. However, untrusted 2690*4882a593Smuzhiyungroups can sabotage swapping by other means - such as referencing its 2691*4882a593Smuzhiyunanonymous memory in a tight loop - and an admin can not assume full 2692*4882a593Smuzhiyunswappability when overcommitting untrusted jobs. 2693*4882a593Smuzhiyun 2694*4882a593SmuzhiyunFor trusted jobs, on the other hand, a combined counter is not an 2695*4882a593Smuzhiyunintuitive userspace interface, and it flies in the face of the idea 2696*4882a593Smuzhiyunthat cgroup controllers should account and limit specific physical 2697*4882a593Smuzhiyunresources. Swap space is a resource like all others in the system, 2698*4882a593Smuzhiyunand that's why unified hierarchy allows distributing it separately. 2699