Documentation/admin-guide/cgroup-v2.rst

9 conventions of cgroup v2.  It describes all userland-visible aspects
10 of cgroup including core and specific controller behaviors.  All
12 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
18      1-2. What is cgroup?
70        5-N-1. CPU controller root cgroup process behaviour
71        5-N-2. IO controller root cgroup process behaviour
95 "cgroup" stands for "control group" and is never capitalized.  The
97 qualifier as in "cgroup controllers".  When explicitly referring to
101 What is cgroup?
104 cgroup is a mechanism to organize processes hierarchically and
108 cgroup is largely composed of two parts - the core and controllers.
109 cgroup core is primarily responsible for hierarchically organizing
110 processes.  A cgroup controller is usually responsible for
116 to one and only one cgroup.  All threads of a process belong to the
117 same cgroup.  On creation, all processes are put in the cgroup that
119 to another cgroup.  Migration of a process doesn't affect already
123 disabled selectively on a cgroup.  All controller behaviors are
124 hierarchical - if a controller is enabled on a cgroup, it affects all
126 sub-hierarchy of the cgroup.  When a controller is enabled on a nested
127 cgroup, it always restricts the resource distribution further.  The
138 Unlike v1, cgroup v2 has only single hierarchy.  The cgroup v2
151 is no longer referenced in its current hierarchy.  Because per-cgroup
168 automount the v1 cgroup filesystem and so hijack all controllers
173 cgroup v2 currently supports the following mount options.
177 	Consider cgroup namespaces as delegation boundaries.  This
185         Only populate memory.events with data for the current cgroup,
210 Initially, only the root cgroup exists to which all processes belong.
211 A child cgroup can be created by creating a sub-directory::
215 A given cgroup may have multiple child cgroups forming a tree
216 structure.  Each cgroup has a read-writable interface file
217 "cgroup.procs".  When read, it lists the PIDs of all processes which
218 belong to the cgroup one-per-line.  The PIDs are not ordered and the
220 another cgroup and then back or the PID got recycled while reading.
222 A process can be migrated into a cgroup by writing its PID to the
223 target cgroup's "cgroup.procs" file.  Only one process can be migrated
229 cgroup that the forking process belongs to at the time of the
230 operation.  After exit, a process stays associated with the cgroup
232 zombie process does not appear in "cgroup.procs" and thus can't be
233 moved to another cgroup.
235 A cgroup which doesn't have any children or live processes can be
236 destroyed by removing the directory.  Note that a cgroup which doesn't
242 "/proc/$PID/cgroup" lists a process's cgroup membership.  If legacy
243 cgroup is in use in the system, this file may contain multiple lines,
244 one for each hierarchy.  The entry for cgroup v2 is always in the
247   # cat /proc/842/cgroup
249   0::/test-cgroup/test-cgroup-nested
251 If the process becomes a zombie and the cgroup it was associated with
254   # cat /proc/842/cgroup
256   0::/test-cgroup/test-cgroup-nested (deleted)
262 cgroup v2 supports thread granularity for a subset of controllers to
265 process belong to the same cgroup, which also serves as the resource
273 Marking a cgroup threaded makes it join the resource domain of its
274 parent as a threaded cgroup.  The parent may be another threaded
275 cgroup whose resource domain is further up in the hierarchy.  The root
285 As the threaded domain cgroup hosts all the domain resource
289 root cgroup is not subject to no internal process constraint, it can
292 The current operation mode or type of the cgroup is shown in the
293 "cgroup.type" file which indicates whether the cgroup is a normal
295 or a threaded cgroup.
297 On creation, a cgroup is always a domain cgroup and can be made
298 threaded by writing "threaded" to the "cgroup.type" file.  The
301   # echo threaded > cgroup.type
303 Once threaded, the cgroup can't be made a domain again.  To enable the
306 - As the cgroup will join the parent's resource domain.  The parent
307   must either be a valid (threaded) domain or a threaded cgroup.
313 Topology-wise, a cgroup can be in an invalid state.  Please consider
320 threaded cgroup.  "cgroup.type" file will report "domain (invalid)" in
324 A domain cgroup is turned into a threaded domain when one of its child
325 cgroup becomes threaded or threaded controllers are enabled in the
326 "cgroup.subtree_control" file while there are processes in the cgroup.
330 When read, "cgroup.threads" contains the list of the thread IDs of all
331 threads in the cgroup.  Except that the operations are per-thread
332 instead of per-process, "cgroup.threads" has the same format and
333 behaves the same way as "cgroup.procs".  While "cgroup.threads" can be
334 written to in any cgroup, as it can only move threads inside the same
338 The threaded domain cgroup serves as the resource domain for the whole
340 all the processes are considered to be in the threaded domain cgroup.
341 "cgroup.procs" in a threaded domain cgroup contains the PIDs of all
343 However, "cgroup.procs" can be written to from anywhere in the subtree
344 to migrate all threads of the matching process to the cgroup.
349 threads in the cgroup and its descendants.  All consumptions which
350 aren't tied to a specific thread belong to the threaded domain cgroup.
354 between threads in a non-leaf cgroup and its child cgroups.  Each
361 Each non-root cgroup has a "cgroup.events" file which contains
362 "populated" field indicating whether the cgroup's sub-hierarchy has
364 the cgroup and its descendants; otherwise, 1.  poll and [id]notify
370 in each cgroup::
377 file modified events will be generated on the "cgroup.events" files of
387 Each cgroup has a "cgroup.controllers" file which lists all
388 controllers available for the cgroup to enable::
390   # cat cgroup.controllers
394 disabled by writing to the "cgroup.subtree_control" file::
396   # echo "+cpu +memory -io" > cgroup.subtree_control
398 Only controllers which are listed in "cgroup.controllers" can be
403 Enabling a controller in a cgroup indicates that the distribution of
417 the cgroup's children, enabling it creates the controller's interface
423 "cgroup." are owned by the parent rather than the cgroup itself.
429 Resources are distributed top-down and a cgroup can further distribute
431 parent.  This means that all non-root "cgroup.subtree_control" files
433 "cgroup.subtree_control" file.  A controller can be enabled only if
444 controllers enabled in their "cgroup.subtree_control" files.
451 The root cgroup is exempt from this restriction.  Root contains
454 controllers.  How resource consumption in the root cgroup is governed
460 enabled controller in the cgroup's "cgroup.subtree_control".  This is
462 populated cgroup.  To control resource distribution of a cgroup, the
463 cgroup must create children and transfer all its processes to the
464 children before enabling controllers in its "cgroup.subtree_control"
474 A cgroup can be delegated in two ways.  First, to a less privileged
475 user by granting write access of the directory and its "cgroup.procs",
476 "cgroup.threads" and "cgroup.subtree_control" files to the user.
478 cgroup namespace on namespace creation.
484 kernel rejects writes to all files other than "cgroup.procs" and
485 "cgroup.subtree_control" on a namespace root from inside the
496 Currently, cgroup doesn't impose any restrictions on the number of
509 to migrate a target process into a cgroup by writing its PID to the
510 "cgroup.procs" file.
512 - The writer must have write access to the "cgroup.procs" file.
514 - The writer must have write access to the "cgroup.procs" file of the
526   ~ cgroup    ~      \ C01
531 currently in C10 into "C00/cgroup.procs".  U0 has write access to the
532 file; however, the common ancestor of the source cgroup C10 and the
533 destination cgroup C00 is above the points of delegation and U0 would
534 not have write access to its "cgroup.procs" files and thus the write
557 should be assigned to a cgroup according to the system's logical and
566 Interface files for a cgroup and its children cgroups occupy the same
570 All cgroup core interface files are prefixed with "cgroup." and each
578 cgroup doesn't do anything to prevent name collisions and it's the
585 cgroup controllers implement several resource distribution schemes
625 "io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
632 A cgroup is protected upto the configured amount of the resource
653 A cgroup is exclusively allocated a certain amount of a finite
717 - The root cgroup should be exempt from resource control and thus
755     # cat cgroup-example-interface-file
761     # echo 125 > cgroup-example-interface-file
765     # echo "default 125" > cgroup-example-interface-file
769     # echo "8:16 170" > cgroup-example-interface-file
773     # echo "8:0 default" > cgroup-example-interface-file
774     # cat cgroup-example-interface-file
787 All cgroup core files are prefixed with "cgroup."
789   cgroup.type
794 	When read, it indicates the current type of the cgroup, which
797 	- "domain" : A normal valid domain cgroup.
799 	- "domain threaded" : A threaded domain cgroup which is
802 	- "domain invalid" : A cgroup which is in an invalid state.
804 	  be allowed to become a threaded cgroup.
806 	- "threaded" : A threaded cgroup which is a member of a
809 	A cgroup can be turned into a threaded cgroup by writing
812   cgroup.procs
817 	the cgroup one-per-line.  The PIDs are not ordered and the
819 	to another cgroup and then back or the PID got recycled while
823 	the PID to the cgroup.  The writer should match all of the
826 	- It must have write access to the "cgroup.procs" file.
828 	- It must have write access to the "cgroup.procs" file of the
834 	In a threaded cgroup, reading this file fails with EOPNOTSUPP
836 	supported and moves every thread of the process to the cgroup.
838   cgroup.threads
843 	the cgroup one-per-line.  The TIDs are not ordered and the
845 	another cgroup and then back or the TID got recycled while
849 	TID to the cgroup.  The writer should match all of the
852 	- It must have write access to the "cgroup.threads" file.
854 	- The cgroup that the thread is currently in must be in the
855           same resource domain as the destination cgroup.
857 	- It must have write access to the "cgroup.procs" file of the
863   cgroup.controllers
868 	the cgroup.  The controllers are not ordered.
870   cgroup.subtree_control
876 	cgroup to its children.
885   cgroup.events
892 		1 if the cgroup or its descendants contains any live
895 		1 if the cgroup is frozen; otherwise, 0.
897   cgroup.max.descendants
902 	an attempt to create a new cgroup in the hierarchy will fail.
904   cgroup.max.depth
907 	Maximum allowed descent depth below the current cgroup.
909 	an attempt to create a new child cgroup will fail.
911   cgroup.stat
918 		Total number of dying descendant cgroups. A cgroup becomes
919 		dying after being deleted by a user. The cgroup will remain
923 		A process can't enter a dying cgroup under any circumstances,
924 		a dying cgroup can't revive.
926 		A dying cgroup can consume system resources not exceeding
927 		limits, which were active at the moment of cgroup deletion.
929   cgroup.freeze
933 	Writing "1" to the file causes freezing of the cgroup and all
935 	be stopped and will not run until the cgroup will be explicitly
936 	unfrozen. Freezing of the cgroup may take some time; when this action
937 	is completed, the "frozen" value in the cgroup.events control file
941 	A cgroup can be frozen either by its own settings, or by settings
943 	cgroup will remain frozen.
945 	Processes in the frozen cgroup can be killed by a fatal signal.
946 	They also can enter and leave a frozen cgroup: either by an explicit
947 	move by a user, or if freezing of the cgroup races with fork().
948 	If a process is moved to a frozen cgroup, it stops. If a process is
949 	moved out of a frozen cgroup, it becomes running.
951 	Frozen status of a cgroup doesn't affect any cgroup tree operations:
952 	it's possible to delete a frozen (and empty) cgroup, as well as
975 the root cgroup.  Be aware that system management software may already
977 process, and these processes may need to be moved to the root cgroup
1076 cgroup are tracked so that the total memory consumption can be
1100 	The total amount of memory currently being used by the cgroup
1107 	Hard memory protection.  If the memory usage of a cgroup
1108 	is within its effective min boundary, the cgroup's memory
1118 	(child cgroup or cgroups are requiring more protected memory
1119 	than parent will allow), then each child cgroup will get
1126 	If a memory cgroup is not populated with processes,
1134 	cgroup is within its effective low boundary, the cgroup's
1144 	(child cgroup or cgroups are requiring more protected memory
1145 	than parent will allow), then each child cgroup will get
1157 	control memory usage of a cgroup.  If a cgroup's usage goes
1158 	over the high boundary, the processes of the cgroup are
1169 	mechanism.  If a cgroup's memory usage reaches this limit and
1170 	can't be reduced, the OOM killer is invoked in the cgroup.
1189 	Determines whether the cgroup should be treated as
1191 	all tasks belonging to the cgroup or to its descendants
1192 	(if the memory cgroup is not a leaf cgroup) are killed
1199 	If the OOM killer is invoked in a cgroup, it's not going
1200 	to kill any tasks outside of this cgroup, regardless
1211 	hierarchy. For for the local events at the cgroup level see
1215 		The number of times the cgroup is reclaimed due to
1221 		The number of times processes of the cgroup are
1224 		cgroup whose memory usage is capped by the high limit
1229 		The number of times the cgroup's memory usage was
1231 		fails to bring it down, the cgroup goes to OOM state.
1234 		The number of time the cgroup's memory usage was
1242 		The number of processes belonging to this cgroup
1247 	to the cgroup i.e. not hierarchical. The file modified event
1253 	This breaks down the cgroup's memory footprint into different
1390 	This breaks down the cgroup's memory footprint into different
1416 	The total amount of swap currently being used by the cgroup
1423 	Swap usage throttle limit.  If a cgroup's swap usage exceeds
1427 	This limit marks a point of no return for the cgroup. It is NOT
1430 	prohibits swapping past a set amount, but lets the cgroup
1439 	Swap usage hard limit.  If a cgroup's swap usage reaches this
1440 	limit, anonymous memory of the cgroup will not be swapped out.
1449 		The number of times the cgroup's swap usage was over
1453 		The number of times the cgroup's swap usage was about
1483 throttles the offending cgroup, a management agent has ample
1487 Determining whether a cgroup has enough memory is not trivial as
1501 A memory area is charged to the cgroup which instantiated it and stays
1502 charged to the cgroup until the area is released.  Migrating a process
1503 to a different cgroup doesn't move the memory usages that it
1504 instantiated while in the previous cgroup to the new cgroup.
1507 To which cgroup the area will be charged is in-deterministic; however,
1508 over time, the memory area is likely to end up in a cgroup which has
1511 If a cgroup sweeps a considerable amount of memory which is expected
1552 	cgroup.
1607 	cgroup.
1644 	If needed, tools/cgroup/iocost_coef_gen.py can be used to
1655 	the cgroup can use in relation to its siblings.
1727 per-cgroup dirty memory states are examined and the more restrictive
1730 cgroup writeback requires explicit support from the underlying
1731 filesystem.  Currently, cgroup writeback is implemented on ext2, ext4,
1733 attributed to the root cgroup.
1736 which affects how cgroup ownership is tracked.  Memory is tracked per
1738 inode is assigned to a cgroup and all IO requests to write dirty pages
1739 from the inode are attributed to that cgroup.
1741 As cgroup ownership for memory is tracked per page, there can be pages
1745 cgroup becomes the majority over a certain period of time, switches
1746 the ownership of the inode to that cgroup.
1749 mostly dirtied by a single cgroup even when the main writing cgroup
1759 The sysctl knobs which affect writeback behavior are applied to cgroup
1763 	These ratios apply the same to cgroup writeback with the
1768 	For cgroup writeback, this is calculated into ratio against
1776 This is a cgroup v2 controller for IO workload protection.  You provide a group
1855 A single attribute controls the behavior of the I/O priority cgroup policy,
1909 The process number controller is used to allow a cgroup to stop any
1913 The number of tasks in a cgroup can be exhausted in ways which other
1934 	The number of processes currently in the cgroup and its
1937 Organisational operations are not blocked by cgroup policies, so it is
1940 processes to the cgroup such that pids.current is larger than
1941 pids.max.  However, it is not possible to violate a cgroup PID policy
1943 of a new process would cause a cgroup policy to be violated.
1951 specified in the cpuset interface files in a task's current cgroup.
1969 	cgroup.  The actual list of CPUs to be granted, however, is
1979 	An empty value indicates that the cgroup is using the same
1980 	setting as the nearest cgroup ancestor with a non-empty
1991 	cgroup by its parent.  These CPUs are allowed to be used by
1992 	tasks within the current cgroup.
1995 	all the CPUs from the parent cgroup that can be available to
1996 	be used by this cgroup.  Otherwise, it should be a subset of
2008 	this cgroup.  The actual list of memory nodes granted, however,
2018 	An empty value indicates that the cgroup is using the same
2019 	setting as the nearest cgroup ancestor with a non-empty
2031 	this cgroup by its parent. These memory nodes are allowed to
2032 	be used by tasks within the current cgroup.
2035 	parent cgroup that will be available to be used by this cgroup.
2044 	cpuset-enabled cgroups.  This flag is owned by the parent cgroup
2052 	When set to be a partition root, the current cgroup is the
2056 	cgroup is always a partition root.
2059 	It can only be set in a cgroup if all the following conditions
2064 	2) The parent cgroup is a partition root.
2072 	effective CPUs of the parent cgroup.  Once it is set, this
2097 	granted by the parent cgroup.
2100 	in "cpuset.cpus" can be granted by the parent cgroup or the
2101 	parent cgroup is no longer a partition root itself.  In this
2104 	The cpu affinity of all the tasks in the cgroup will then be
2124 on top of cgroup BPF. To control access to device files, a user may
2172 	It exists for all the cgroup except root.
2190 	the cgroup except root.
2194 	The default value is "max".  It exists for all the cgroup except root.
2204 	are local to the cgroup i.e. not hierarchical. The file modified event
2215 always be filtered by cgroup v2 path.  The controller can still be
2226 CPU controller root cgroup process behaviour
2229 When distributing CPU cycles in the root cgroup each thread in this
2230 cgroup is treated as if it was hosted in a separate child cgroup of the
2231 root cgroup. This child cgroup weight is dependent on its thread nice
2239 IO controller root cgroup process behaviour
2242 Root cgroup processes are hosted in an implicit leaf child node.
2244 account as if it was a normal child cgroup of the root cgroup with a
2254 cgroup namespace provides a mechanism to virtualize the view of the
2255 "/proc/$PID/cgroup" file and cgroup mounts.  The CLONE_NEWCGROUP clone
2256 flag can be used with clone(2) and unshare(2) to create a new cgroup
2257 namespace.  The process running inside the cgroup namespace will have
2258 its "/proc/$PID/cgroup" output restricted to cgroupns root.  The
2259 cgroupns root is the cgroup of the process at the time of creation of
2260 the cgroup namespace.
2262 Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
2263 complete path of the cgroup of a process.  In a container setup where
2265 "/proc/$PID/cgroup" file may leak potential system level information
2268   # cat /proc/self/cgroup
2272 and undesirable to expose to the isolated processes.  cgroup namespace
2274 creating a cgroup namespace, one would see::
2276   # ls -l /proc/self/ns/cgroup
2277   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
2278   # cat /proc/self/cgroup
2283   # ls -l /proc/self/ns/cgroup
2284   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
2285   # cat /proc/self/cgroup
2288 When some thread from a multi-threaded process unshares its cgroup
2293 A cgroup namespace is alive as long as there are processes inside or
2294 mounts pinning it.  When the last usage goes away, the cgroup
2302 The 'cgroupns root' for a cgroup namespace is the cgroup in which the
2304 /batchjobs/container_id1 cgroup calls unshare, cgroup
2306 init_cgroup_ns, this is the real root ('/') cgroup.
2308 The cgroupns root cgroup does not change even if the namespace creator
2309 process later moves to a different cgroup::
2311   # ~/unshare -c # unshare cgroupns in some cgroup
2312   # cat /proc/self/cgroup
2315   # echo 0 > sub_cgrp_1/cgroup.procs
2316   # cat /proc/self/cgroup
2319 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
2321 Processes running inside the cgroup namespace will be able to see
2322 cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
2327   # echo 7353 > sub_cgrp_1/cgroup.procs
2328   # cat /proc/7353/cgroup
2331 From the initial cgroup namespace, the real cgroup path will be
2334   $ cat /proc/7353/cgroup
2337 From a sibling cgroup namespace (that is, a namespace rooted at a
2338 different cgroup), the cgroup path relative to its own cgroup
2339 namespace root will be shown.  For instance, if PID 7353's cgroup
2342   # cat /proc/7353/cgroup
2346 its relative to the cgroup namespace root of the caller.
2352 Processes inside a cgroup namespace can move into and out of the
2358   # cat /proc/7353/cgroup
2360   # echo 7353 > batchjobs/container_id2/cgroup.procs
2361   # cat /proc/7353/cgroup
2364 Note that this kind of setup is not encouraged.  A task inside cgroup
2367 setns(2) to another cgroup namespace is allowed when:
2370 (b) the process has CAP_SYS_ADMIN against the target cgroup
2373 No implicit cgroup changes happen with attaching to another cgroup
2375 process under the target cgroup namespace root.
2381 Namespace specific cgroup hierarchy can be mounted by a process
2382 running inside a non-init cgroup namespace::
2386 This will mount the unified cgroup hierarchy with cgroupns root as the
2390 The virtualization of /proc/self/cgroup file combined with restricting
2391 the view of cgroup hierarchy by namespace-private cgroupfs mount
2392 provides a properly isolated cgroup view inside the container.
2399 where interacting with cgroup is necessary.  cgroup core and
2406 A filesystem can support cgroup writeback by updating
2412 	associates the bio with the inode's owner cgroup and the
2423 With writeback bio's annotated, cgroup support can be enabled per
2425 selective disabling of cgroup writeback support which is helpful when
2429 wbc_init_bio() binds the specified bio to its cgroup.  Depending on
2445 - The "tasks" file is removed and "cgroup.procs" is not sorted.
2447 - "cgroup.clone_children" is removed.
2449 - /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" file
2459 cgroup v1 allowed an arbitrary number of hierarchies and each
2481 It greatly complicated cgroup core implementation but more importantly
2482 the support for multiple hierarchies restricted how cgroup could be
2486 that a thread's cgroup membership couldn't be described in finite
2512 cgroup v1 allowed threads of a process to belong to different cgroups.
2523 cgroup v1 had an ambiguously defined delegation model which got abused
2527 effectively raised cgroup to the status of a syscall-like API exposed
2530 First of all, cgroup has a fundamentally inadequate interface to be
2532 extract the path on the target hierarchy from /proc/self/cgroup,
2539 cgroup controllers implemented a number of knobs which would never be
2541 system-management pseudo filesystem.  cgroup ended up with interface
2545 effectively abusing cgroup as a shortcut to implementing public APIs
2556 cgroup v1 allowed threads to be in any cgroups which created an
2557 interesting problem where threads belonging to a parent cgroup and its
2563 mapped nice levels to cgroup weights.  This worked for some cases but
2572 cgroup to host the threads.  The hidden leaf had its own copies of all
2588 made cgroup as a whole highly inconsistent.
2590 This clearly is a problem which needs to be addressed from cgroup core
2597 cgroup v1 grew without oversight and developed a large number of
2598 idiosyncrasies and inconsistencies.  One issue on the cgroup core side
2599 was how an empty cgroup was notified - a userland helper binary was
2608 cgroup.  Some controllers exposed a large amount of inconsistent
2611 There also was no consistency across controllers.  When a new cgroup
2619 cgroup v2 establishes common conventions where appropriate and updates
2644 reserve.  A cgroup enjoys reclaim protection when it's within its
2687 cgroup design was that global or parental pressure would always be
2696 that cgroup controllers should account and limit specific physical