1*4882a593Smuzhiyun========================== 2*4882a593SmuzhiyunMemory Resource Controller 3*4882a593Smuzhiyun========================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunNOTE: 6*4882a593Smuzhiyun This document is hopelessly outdated and it asks for a complete 7*4882a593Smuzhiyun rewrite. It still contains a useful information so we are keeping it 8*4882a593Smuzhiyun here but make sure to check the current code if you need a deeper 9*4882a593Smuzhiyun understanding. 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunNOTE: 12*4882a593Smuzhiyun The Memory Resource Controller has generically been referred to as the 13*4882a593Smuzhiyun memory controller in this document. Do not confuse memory controller 14*4882a593Smuzhiyun used here with the memory controller that is used in hardware. 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun(For editors) In this document: 17*4882a593Smuzhiyun When we mention a cgroup (cgroupfs's directory) with memory controller, 18*4882a593Smuzhiyun we call it "memory cgroup". When you see git-log and source code, you'll 19*4882a593Smuzhiyun see patch's title and function names tend to use "memcg". 20*4882a593Smuzhiyun In this document, we avoid using it. 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunBenefits and Purpose of the memory controller 23*4882a593Smuzhiyun============================================= 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunThe memory controller isolates the memory behaviour of a group of tasks 26*4882a593Smuzhiyunfrom the rest of the system. The article on LWN [12] mentions some probable 27*4882a593Smuzhiyunuses of the memory controller. The memory controller can be used to 28*4882a593Smuzhiyun 29*4882a593Smuzhiyuna. Isolate an application or a group of applications 30*4882a593Smuzhiyun Memory-hungry applications can be isolated and limited to a smaller 31*4882a593Smuzhiyun amount of memory. 32*4882a593Smuzhiyunb. Create a cgroup with a limited amount of memory; this can be used 33*4882a593Smuzhiyun as a good alternative to booting with mem=XXXX. 34*4882a593Smuzhiyunc. Virtualization solutions can control the amount of memory they want 35*4882a593Smuzhiyun to assign to a virtual machine instance. 36*4882a593Smuzhiyund. A CD/DVD burner could control the amount of memory used by the 37*4882a593Smuzhiyun rest of the system to ensure that burning does not fail due to lack 38*4882a593Smuzhiyun of available memory. 39*4882a593Smuzhiyune. There are several other use cases; find one or use the controller just 40*4882a593Smuzhiyun for fun (to learn and hack on the VM subsystem). 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunCurrent Status: linux-2.6.34-mmotm(development version of 2010/April) 43*4882a593Smuzhiyun 44*4882a593SmuzhiyunFeatures: 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun - accounting anonymous pages, file caches, swap caches usage and limiting them. 47*4882a593Smuzhiyun - pages are linked to per-memcg LRU exclusively, and there is no global LRU. 48*4882a593Smuzhiyun - optionally, memory+swap usage can be accounted and limited. 49*4882a593Smuzhiyun - hierarchical accounting 50*4882a593Smuzhiyun - soft limit 51*4882a593Smuzhiyun - moving (recharging) account at moving a task is selectable. 52*4882a593Smuzhiyun - usage threshold notifier 53*4882a593Smuzhiyun - memory pressure notifier 54*4882a593Smuzhiyun - oom-killer disable knob and oom-notifier 55*4882a593Smuzhiyun - Root cgroup has no limit controls. 56*4882a593Smuzhiyun 57*4882a593Smuzhiyun Kernel memory support is a work in progress, and the current version provides 58*4882a593Smuzhiyun basically functionality. (See Section 2.7) 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunBrief summary of control files. 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun==================================== ========================================== 63*4882a593Smuzhiyun tasks attach a task(thread) and show list of 64*4882a593Smuzhiyun threads 65*4882a593Smuzhiyun cgroup.procs show list of processes 66*4882a593Smuzhiyun cgroup.event_control an interface for event_fd() 67*4882a593Smuzhiyun memory.usage_in_bytes show current usage for memory 68*4882a593Smuzhiyun (See 5.5 for details) 69*4882a593Smuzhiyun memory.memsw.usage_in_bytes show current usage for memory+Swap 70*4882a593Smuzhiyun (See 5.5 for details) 71*4882a593Smuzhiyun memory.limit_in_bytes set/show limit of memory usage 72*4882a593Smuzhiyun memory.memsw.limit_in_bytes set/show limit of memory+Swap usage 73*4882a593Smuzhiyun memory.failcnt show the number of memory usage hits limits 74*4882a593Smuzhiyun memory.memsw.failcnt show the number of memory+Swap hits limits 75*4882a593Smuzhiyun memory.max_usage_in_bytes show max memory usage recorded 76*4882a593Smuzhiyun memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded 77*4882a593Smuzhiyun memory.soft_limit_in_bytes set/show soft limit of memory usage 78*4882a593Smuzhiyun memory.stat show various statistics 79*4882a593Smuzhiyun memory.use_hierarchy set/show hierarchical account enabled 80*4882a593Smuzhiyun memory.force_empty trigger forced page reclaim 81*4882a593Smuzhiyun memory.pressure_level set memory pressure notifications 82*4882a593Smuzhiyun memory.swappiness set/show swappiness parameter of vmscan 83*4882a593Smuzhiyun (See sysctl's vm.swappiness) 84*4882a593Smuzhiyun memory.move_charge_at_immigrate set/show controls of moving charges 85*4882a593Smuzhiyun memory.oom_control set/show oom controls. 86*4882a593Smuzhiyun memory.numa_stat show the number of memory usage per numa 87*4882a593Smuzhiyun node 88*4882a593Smuzhiyun memory.kmem.limit_in_bytes set/show hard limit for kernel memory 89*4882a593Smuzhiyun This knob is deprecated and shouldn't be 90*4882a593Smuzhiyun used. It is planned that this be removed in 91*4882a593Smuzhiyun the foreseeable future. 92*4882a593Smuzhiyun memory.kmem.usage_in_bytes show current kernel memory allocation 93*4882a593Smuzhiyun memory.kmem.failcnt show the number of kernel memory usage 94*4882a593Smuzhiyun hits limits 95*4882a593Smuzhiyun memory.kmem.max_usage_in_bytes show max kernel memory usage recorded 96*4882a593Smuzhiyun 97*4882a593Smuzhiyun memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory 98*4882a593Smuzhiyun memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation 99*4882a593Smuzhiyun memory.kmem.tcp.failcnt show the number of tcp buf memory usage 100*4882a593Smuzhiyun hits limits 101*4882a593Smuzhiyun memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded 102*4882a593Smuzhiyun==================================== ========================================== 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun1. History 105*4882a593Smuzhiyun========== 106*4882a593Smuzhiyun 107*4882a593SmuzhiyunThe memory controller has a long history. A request for comments for the memory 108*4882a593Smuzhiyuncontroller was posted by Balbir Singh [1]. At the time the RFC was posted 109*4882a593Smuzhiyunthere were several implementations for memory control. The goal of the 110*4882a593SmuzhiyunRFC was to build consensus and agreement for the minimal features required 111*4882a593Smuzhiyunfor memory control. The first RSS controller was posted by Balbir Singh[2] 112*4882a593Smuzhiyunin Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the 113*4882a593SmuzhiyunRSS controller. At OLS, at the resource management BoF, everyone suggested 114*4882a593Smuzhiyunthat we handle both page cache and RSS together. Another request was raised 115*4882a593Smuzhiyunto allow user space handling of OOM. The current memory controller is 116*4882a593Smuzhiyunat version 6; it combines both mapped (RSS) and unmapped Page 117*4882a593SmuzhiyunCache Control [11]. 118*4882a593Smuzhiyun 119*4882a593Smuzhiyun2. Memory Control 120*4882a593Smuzhiyun================= 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunMemory is a unique resource in the sense that it is present in a limited 123*4882a593Smuzhiyunamount. If a task requires a lot of CPU processing, the task can spread 124*4882a593Smuzhiyunits processing over a period of hours, days, months or years, but with 125*4882a593Smuzhiyunmemory, the same physical memory needs to be reused to accomplish the task. 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunThe memory controller implementation has been divided into phases. These 128*4882a593Smuzhiyunare: 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun1. Memory controller 131*4882a593Smuzhiyun2. mlock(2) controller 132*4882a593Smuzhiyun3. Kernel user memory accounting and slab control 133*4882a593Smuzhiyun4. user mappings length controller 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunThe memory controller is the first controller developed. 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun2.1. Design 138*4882a593Smuzhiyun----------- 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunThe core of the design is a counter called the page_counter. The 141*4882a593Smuzhiyunpage_counter tracks the current memory usage and limit of the group of 142*4882a593Smuzhiyunprocesses associated with the controller. Each cgroup has a memory controller 143*4882a593Smuzhiyunspecific data structure (mem_cgroup) associated with it. 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun2.2. Accounting 146*4882a593Smuzhiyun--------------- 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun:: 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun +--------------------+ 151*4882a593Smuzhiyun | mem_cgroup | 152*4882a593Smuzhiyun | (page_counter) | 153*4882a593Smuzhiyun +--------------------+ 154*4882a593Smuzhiyun / ^ \ 155*4882a593Smuzhiyun / | \ 156*4882a593Smuzhiyun +---------------+ | +---------------+ 157*4882a593Smuzhiyun | mm_struct | |.... | mm_struct | 158*4882a593Smuzhiyun | | | | | 159*4882a593Smuzhiyun +---------------+ | +---------------+ 160*4882a593Smuzhiyun | 161*4882a593Smuzhiyun + --------------+ 162*4882a593Smuzhiyun | 163*4882a593Smuzhiyun +---------------+ +------+--------+ 164*4882a593Smuzhiyun | page +----------> page_cgroup| 165*4882a593Smuzhiyun | | | | 166*4882a593Smuzhiyun +---------------+ +---------------+ 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun (Figure 1: Hierarchy of Accounting) 169*4882a593Smuzhiyun 170*4882a593Smuzhiyun 171*4882a593SmuzhiyunFigure 1 shows the important aspects of the controller 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun1. Accounting happens per cgroup 174*4882a593Smuzhiyun2. Each mm_struct knows about which cgroup it belongs to 175*4882a593Smuzhiyun3. Each page has a pointer to the page_cgroup, which in turn knows the 176*4882a593Smuzhiyun cgroup it belongs to 177*4882a593Smuzhiyun 178*4882a593SmuzhiyunThe accounting is done as follows: mem_cgroup_charge_common() is invoked to 179*4882a593Smuzhiyunset up the necessary data structures and check if the cgroup that is being 180*4882a593Smuzhiyuncharged is over its limit. If it is, then reclaim is invoked on the cgroup. 181*4882a593SmuzhiyunMore details can be found in the reclaim section of this document. 182*4882a593SmuzhiyunIf everything goes well, a page meta-data-structure called page_cgroup is 183*4882a593Smuzhiyunupdated. page_cgroup has its own LRU on cgroup. 184*4882a593Smuzhiyun(*) page_cgroup structure is allocated at boot/memory-hotplug time. 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun2.2.1 Accounting details 187*4882a593Smuzhiyun------------------------ 188*4882a593Smuzhiyun 189*4882a593SmuzhiyunAll mapped anon pages (RSS) and cache pages (Page Cache) are accounted. 190*4882a593SmuzhiyunSome pages which are never reclaimable and will not be on the LRU 191*4882a593Smuzhiyunare not accounted. We just account pages under usual VM management. 192*4882a593Smuzhiyun 193*4882a593SmuzhiyunRSS pages are accounted at page_fault unless they've already been accounted 194*4882a593Smuzhiyunfor earlier. A file page will be accounted for as Page Cache when it's 195*4882a593Smuzhiyuninserted into inode (radix-tree). While it's mapped into the page tables of 196*4882a593Smuzhiyunprocesses, duplicate accounting is carefully avoided. 197*4882a593Smuzhiyun 198*4882a593SmuzhiyunAn RSS page is unaccounted when it's fully unmapped. A PageCache page is 199*4882a593Smuzhiyununaccounted when it's removed from radix-tree. Even if RSS pages are fully 200*4882a593Smuzhiyununmapped (by kswapd), they may exist as SwapCache in the system until they 201*4882a593Smuzhiyunare really freed. Such SwapCaches are also accounted. 202*4882a593SmuzhiyunA swapped-in page is accounted after adding into swapcache. 203*4882a593Smuzhiyun 204*4882a593SmuzhiyunNote: The kernel does swapin-readahead and reads multiple swaps at once. 205*4882a593SmuzhiyunSince page's memcg recorded into swap whatever memsw enabled, the page will 206*4882a593Smuzhiyunbe accounted after swapin. 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunAt page migration, accounting information is kept. 209*4882a593Smuzhiyun 210*4882a593SmuzhiyunNote: we just account pages-on-LRU because our purpose is to control amount 211*4882a593Smuzhiyunof used pages; not-on-LRU pages tend to be out-of-control from VM view. 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun2.3 Shared Page Accounting 214*4882a593Smuzhiyun-------------------------- 215*4882a593Smuzhiyun 216*4882a593SmuzhiyunShared pages are accounted on the basis of the first touch approach. The 217*4882a593Smuzhiyuncgroup that first touches a page is accounted for the page. The principle 218*4882a593Smuzhiyunbehind this approach is that a cgroup that aggressively uses a shared 219*4882a593Smuzhiyunpage will eventually get charged for it (once it is uncharged from 220*4882a593Smuzhiyunthe cgroup that brought it in -- this will happen on memory pressure). 221*4882a593Smuzhiyun 222*4882a593SmuzhiyunBut see section 8.2: when moving a task to another cgroup, its pages may 223*4882a593Smuzhiyunbe recharged to the new cgroup, if move_charge_at_immigrate has been chosen. 224*4882a593Smuzhiyun 225*4882a593Smuzhiyun2.4 Swap Extension 226*4882a593Smuzhiyun-------------------------------------- 227*4882a593Smuzhiyun 228*4882a593SmuzhiyunSwap usage is always recorded for each of cgroup. Swap Extension allows you to 229*4882a593Smuzhiyunread and limit it. 230*4882a593Smuzhiyun 231*4882a593SmuzhiyunWhen CONFIG_SWAP is enabled, following files are added. 232*4882a593Smuzhiyun 233*4882a593Smuzhiyun - memory.memsw.usage_in_bytes. 234*4882a593Smuzhiyun - memory.memsw.limit_in_bytes. 235*4882a593Smuzhiyun 236*4882a593Smuzhiyunmemsw means memory+swap. Usage of memory+swap is limited by 237*4882a593Smuzhiyunmemsw.limit_in_bytes. 238*4882a593Smuzhiyun 239*4882a593SmuzhiyunExample: Assume a system with 4G of swap. A task which allocates 6G of memory 240*4882a593Smuzhiyun(by mistake) under 2G memory limitation will use all swap. 241*4882a593SmuzhiyunIn this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. 242*4882a593SmuzhiyunBy using the memsw limit, you can avoid system OOM which can be caused by swap 243*4882a593Smuzhiyunshortage. 244*4882a593Smuzhiyun 245*4882a593Smuzhiyun**why 'memory+swap' rather than swap** 246*4882a593Smuzhiyun 247*4882a593SmuzhiyunThe global LRU(kswapd) can swap out arbitrary pages. Swap-out means 248*4882a593Smuzhiyunto move account from memory to swap...there is no change in usage of 249*4882a593Smuzhiyunmemory+swap. In other words, when we want to limit the usage of swap without 250*4882a593Smuzhiyunaffecting global LRU, memory+swap limit is better than just limiting swap from 251*4882a593Smuzhiyunan OS point of view. 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun**What happens when a cgroup hits memory.memsw.limit_in_bytes** 254*4882a593Smuzhiyun 255*4882a593SmuzhiyunWhen a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out 256*4882a593Smuzhiyunin this cgroup. Then, swap-out will not be done by cgroup routine and file 257*4882a593Smuzhiyuncaches are dropped. But as mentioned above, global LRU can do swapout memory 258*4882a593Smuzhiyunfrom it for sanity of the system's memory management state. You can't forbid 259*4882a593Smuzhiyunit by cgroup. 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun2.5 Reclaim 262*4882a593Smuzhiyun----------- 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunEach cgroup maintains a per cgroup LRU which has the same structure as 265*4882a593Smuzhiyunglobal VM. When a cgroup goes over its limit, we first try 266*4882a593Smuzhiyunto reclaim memory from the cgroup so as to make space for the new 267*4882a593Smuzhiyunpages that the cgroup has touched. If the reclaim is unsuccessful, 268*4882a593Smuzhiyunan OOM routine is invoked to select and kill the bulkiest task in the 269*4882a593Smuzhiyuncgroup. (See 10. OOM Control below.) 270*4882a593Smuzhiyun 271*4882a593SmuzhiyunThe reclaim algorithm has not been modified for cgroups, except that 272*4882a593Smuzhiyunpages that are selected for reclaiming come from the per-cgroup LRU 273*4882a593Smuzhiyunlist. 274*4882a593Smuzhiyun 275*4882a593SmuzhiyunNOTE: 276*4882a593Smuzhiyun Reclaim does not work for the root cgroup, since we cannot set any 277*4882a593Smuzhiyun limits on the root cgroup. 278*4882a593Smuzhiyun 279*4882a593SmuzhiyunNote2: 280*4882a593Smuzhiyun When panic_on_oom is set to "2", the whole system will panic. 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunWhen oom event notifier is registered, event will be delivered. 283*4882a593Smuzhiyun(See oom_control section) 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun2.6 Locking 286*4882a593Smuzhiyun----------- 287*4882a593Smuzhiyun 288*4882a593Smuzhiyun lock_page_cgroup()/unlock_page_cgroup() should not be called under 289*4882a593Smuzhiyun the i_pages lock. 290*4882a593Smuzhiyun 291*4882a593Smuzhiyun Other lock order is following: 292*4882a593Smuzhiyun 293*4882a593Smuzhiyun PG_locked. 294*4882a593Smuzhiyun mm->page_table_lock 295*4882a593Smuzhiyun pgdat->lru_lock 296*4882a593Smuzhiyun lock_page_cgroup. 297*4882a593Smuzhiyun 298*4882a593Smuzhiyun In many cases, just lock_page_cgroup() is called. 299*4882a593Smuzhiyun 300*4882a593Smuzhiyun per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by 301*4882a593Smuzhiyun pgdat->lru_lock, it has no lock of its own. 302*4882a593Smuzhiyun 303*4882a593Smuzhiyun2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) 304*4882a593Smuzhiyun----------------------------------------------- 305*4882a593Smuzhiyun 306*4882a593SmuzhiyunWith the Kernel memory extension, the Memory Controller is able to limit 307*4882a593Smuzhiyunthe amount of kernel memory used by the system. Kernel memory is fundamentally 308*4882a593Smuzhiyundifferent than user memory, since it can't be swapped out, which makes it 309*4882a593Smuzhiyunpossible to DoS the system by consuming too much of this precious resource. 310*4882a593Smuzhiyun 311*4882a593SmuzhiyunKernel memory accounting is enabled for all memory cgroups by default. But 312*4882a593Smuzhiyunit can be disabled system-wide by passing cgroup.memory=nokmem to the kernel 313*4882a593Smuzhiyunat boot time. In this case, kernel memory will not be accounted at all. 314*4882a593Smuzhiyun 315*4882a593SmuzhiyunKernel memory limits are not imposed for the root cgroup. Usage for the root 316*4882a593Smuzhiyuncgroup may or may not be accounted. The memory used is accumulated into 317*4882a593Smuzhiyunmemory.kmem.usage_in_bytes, or in a separate counter when it makes sense. 318*4882a593Smuzhiyun(currently only for tcp). 319*4882a593Smuzhiyun 320*4882a593SmuzhiyunThe main "kmem" counter is fed into the main counter, so kmem charges will 321*4882a593Smuzhiyunalso be visible from the user counter. 322*4882a593Smuzhiyun 323*4882a593SmuzhiyunCurrently no soft limit is implemented for kernel memory. It is future work 324*4882a593Smuzhiyunto trigger slab reclaim when those limits are reached. 325*4882a593Smuzhiyun 326*4882a593Smuzhiyun2.7.1 Current Kernel Memory resources accounted 327*4882a593Smuzhiyun----------------------------------------------- 328*4882a593Smuzhiyun 329*4882a593Smuzhiyunstack pages: 330*4882a593Smuzhiyun every process consumes some stack pages. By accounting into 331*4882a593Smuzhiyun kernel memory, we prevent new processes from being created when the kernel 332*4882a593Smuzhiyun memory usage is too high. 333*4882a593Smuzhiyun 334*4882a593Smuzhiyunslab pages: 335*4882a593Smuzhiyun pages allocated by the SLAB or SLUB allocator are tracked. A copy 336*4882a593Smuzhiyun of each kmem_cache is created every time the cache is touched by the first time 337*4882a593Smuzhiyun from inside the memcg. The creation is done lazily, so some objects can still be 338*4882a593Smuzhiyun skipped while the cache is being created. All objects in a slab page should 339*4882a593Smuzhiyun belong to the same memcg. This only fails to hold when a task is migrated to a 340*4882a593Smuzhiyun different memcg during the page allocation by the cache. 341*4882a593Smuzhiyun 342*4882a593Smuzhiyunsockets memory pressure: 343*4882a593Smuzhiyun some sockets protocols have memory pressure 344*4882a593Smuzhiyun thresholds. The Memory Controller allows them to be controlled individually 345*4882a593Smuzhiyun per cgroup, instead of globally. 346*4882a593Smuzhiyun 347*4882a593Smuzhiyuntcp memory pressure: 348*4882a593Smuzhiyun sockets memory pressure for the tcp protocol. 349*4882a593Smuzhiyun 350*4882a593Smuzhiyun2.7.2 Common use cases 351*4882a593Smuzhiyun---------------------- 352*4882a593Smuzhiyun 353*4882a593SmuzhiyunBecause the "kmem" counter is fed to the main user counter, kernel memory can 354*4882a593Smuzhiyunnever be limited completely independently of user memory. Say "U" is the user 355*4882a593Smuzhiyunlimit, and "K" the kernel limit. There are three possible ways limits can be 356*4882a593Smuzhiyunset: 357*4882a593Smuzhiyun 358*4882a593SmuzhiyunU != 0, K = unlimited: 359*4882a593Smuzhiyun This is the standard memcg limitation mechanism already present before kmem 360*4882a593Smuzhiyun accounting. Kernel memory is completely ignored. 361*4882a593Smuzhiyun 362*4882a593SmuzhiyunU != 0, K < U: 363*4882a593Smuzhiyun Kernel memory is a subset of the user memory. This setup is useful in 364*4882a593Smuzhiyun deployments where the total amount of memory per-cgroup is overcommited. 365*4882a593Smuzhiyun Overcommiting kernel memory limits is definitely not recommended, since the 366*4882a593Smuzhiyun box can still run out of non-reclaimable memory. 367*4882a593Smuzhiyun In this case, the admin could set up K so that the sum of all groups is 368*4882a593Smuzhiyun never greater than the total memory, and freely set U at the cost of his 369*4882a593Smuzhiyun QoS. 370*4882a593Smuzhiyun 371*4882a593SmuzhiyunWARNING: 372*4882a593Smuzhiyun In the current implementation, memory reclaim will NOT be 373*4882a593Smuzhiyun triggered for a cgroup when it hits K while staying below U, which makes 374*4882a593Smuzhiyun this setup impractical. 375*4882a593Smuzhiyun 376*4882a593SmuzhiyunU != 0, K >= U: 377*4882a593Smuzhiyun Since kmem charges will also be fed to the user counter and reclaim will be 378*4882a593Smuzhiyun triggered for the cgroup for both kinds of memory. This setup gives the 379*4882a593Smuzhiyun admin a unified view of memory, and it is also useful for people who just 380*4882a593Smuzhiyun want to track kernel memory usage. 381*4882a593Smuzhiyun 382*4882a593Smuzhiyun3. User Interface 383*4882a593Smuzhiyun================= 384*4882a593Smuzhiyun 385*4882a593Smuzhiyun3.0. Configuration 386*4882a593Smuzhiyun------------------ 387*4882a593Smuzhiyun 388*4882a593Smuzhiyuna. Enable CONFIG_CGROUPS 389*4882a593Smuzhiyunb. Enable CONFIG_MEMCG 390*4882a593Smuzhiyunc. Enable CONFIG_MEMCG_SWAP (to use swap extension) 391*4882a593Smuzhiyund. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 392*4882a593Smuzhiyun 393*4882a593Smuzhiyun3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) 394*4882a593Smuzhiyun------------------------------------------------------------------- 395*4882a593Smuzhiyun 396*4882a593Smuzhiyun:: 397*4882a593Smuzhiyun 398*4882a593Smuzhiyun # mount -t tmpfs none /sys/fs/cgroup 399*4882a593Smuzhiyun # mkdir /sys/fs/cgroup/memory 400*4882a593Smuzhiyun # mount -t cgroup none /sys/fs/cgroup/memory -o memory 401*4882a593Smuzhiyun 402*4882a593Smuzhiyun3.2. Make the new group and move bash into it:: 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun # mkdir /sys/fs/cgroup/memory/0 405*4882a593Smuzhiyun # echo $$ > /sys/fs/cgroup/memory/0/tasks 406*4882a593Smuzhiyun 407*4882a593SmuzhiyunSince now we're in the 0 cgroup, we can alter the memory limit:: 408*4882a593Smuzhiyun 409*4882a593Smuzhiyun # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes 410*4882a593Smuzhiyun 411*4882a593SmuzhiyunNOTE: 412*4882a593Smuzhiyun We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, 413*4882a593Smuzhiyun mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, 414*4882a593Smuzhiyun Gibibytes.) 415*4882a593Smuzhiyun 416*4882a593SmuzhiyunNOTE: 417*4882a593Smuzhiyun We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``. 418*4882a593Smuzhiyun 419*4882a593SmuzhiyunNOTE: 420*4882a593Smuzhiyun We cannot set limits on the root cgroup any more. 421*4882a593Smuzhiyun 422*4882a593Smuzhiyun:: 423*4882a593Smuzhiyun 424*4882a593Smuzhiyun # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes 425*4882a593Smuzhiyun 4194304 426*4882a593Smuzhiyun 427*4882a593SmuzhiyunWe can check the usage:: 428*4882a593Smuzhiyun 429*4882a593Smuzhiyun # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes 430*4882a593Smuzhiyun 1216512 431*4882a593Smuzhiyun 432*4882a593SmuzhiyunA successful write to this file does not guarantee a successful setting of 433*4882a593Smuzhiyunthis limit to the value written into the file. This can be due to a 434*4882a593Smuzhiyunnumber of factors, such as rounding up to page boundaries or the total 435*4882a593Smuzhiyunavailability of memory on the system. The user is required to re-read 436*4882a593Smuzhiyunthis file after a write to guarantee the value committed by the kernel:: 437*4882a593Smuzhiyun 438*4882a593Smuzhiyun # echo 1 > memory.limit_in_bytes 439*4882a593Smuzhiyun # cat memory.limit_in_bytes 440*4882a593Smuzhiyun 4096 441*4882a593Smuzhiyun 442*4882a593SmuzhiyunThe memory.failcnt field gives the number of times that the cgroup limit was 443*4882a593Smuzhiyunexceeded. 444*4882a593Smuzhiyun 445*4882a593SmuzhiyunThe memory.stat file gives accounting information. Now, the number of 446*4882a593Smuzhiyuncaches, RSS and Active pages/Inactive pages are shown. 447*4882a593Smuzhiyun 448*4882a593Smuzhiyun4. Testing 449*4882a593Smuzhiyun========== 450*4882a593Smuzhiyun 451*4882a593SmuzhiyunFor testing features and implementation, see memcg_test.txt. 452*4882a593Smuzhiyun 453*4882a593SmuzhiyunPerformance test is also important. To see pure memory controller's overhead, 454*4882a593Smuzhiyuntesting on tmpfs will give you good numbers of small overheads. 455*4882a593SmuzhiyunExample: do kernel make on tmpfs. 456*4882a593Smuzhiyun 457*4882a593SmuzhiyunPage-fault scalability is also important. At measuring parallel 458*4882a593Smuzhiyunpage fault test, multi-process test may be better than multi-thread 459*4882a593Smuzhiyuntest because it has noise of shared objects/status. 460*4882a593Smuzhiyun 461*4882a593SmuzhiyunBut the above two are testing extreme situations. 462*4882a593SmuzhiyunTrying usual test under memory controller is always helpful. 463*4882a593Smuzhiyun 464*4882a593Smuzhiyun4.1 Troubleshooting 465*4882a593Smuzhiyun------------------- 466*4882a593Smuzhiyun 467*4882a593SmuzhiyunSometimes a user might find that the application under a cgroup is 468*4882a593Smuzhiyunterminated by the OOM killer. There are several causes for this: 469*4882a593Smuzhiyun 470*4882a593Smuzhiyun1. The cgroup limit is too low (just too low to do anything useful) 471*4882a593Smuzhiyun2. The user is using anonymous memory and swap is turned off or too low 472*4882a593Smuzhiyun 473*4882a593SmuzhiyunA sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of 474*4882a593Smuzhiyunsome of the pages cached in the cgroup (page cache pages). 475*4882a593Smuzhiyun 476*4882a593SmuzhiyunTo know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and 477*4882a593Smuzhiyunseeing what happens will be helpful. 478*4882a593Smuzhiyun 479*4882a593Smuzhiyun4.2 Task migration 480*4882a593Smuzhiyun------------------ 481*4882a593Smuzhiyun 482*4882a593SmuzhiyunWhen a task migrates from one cgroup to another, its charge is not 483*4882a593Smuzhiyuncarried forward by default. The pages allocated from the original cgroup still 484*4882a593Smuzhiyunremain charged to it, the charge is dropped when the page is freed or 485*4882a593Smuzhiyunreclaimed. 486*4882a593Smuzhiyun 487*4882a593SmuzhiyunYou can move charges of a task along with task migration. 488*4882a593SmuzhiyunSee 8. "Move charges at task migration" 489*4882a593Smuzhiyun 490*4882a593Smuzhiyun4.3 Removing a cgroup 491*4882a593Smuzhiyun--------------------- 492*4882a593Smuzhiyun 493*4882a593SmuzhiyunA cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a 494*4882a593Smuzhiyuncgroup might have some charge associated with it, even though all 495*4882a593Smuzhiyuntasks have migrated away from it. (because we charge against pages, not 496*4882a593Smuzhiyunagainst tasks.) 497*4882a593Smuzhiyun 498*4882a593SmuzhiyunWe move the stats to root (if use_hierarchy==0) or parent (if 499*4882a593Smuzhiyunuse_hierarchy==1), and no change on the charge except uncharging 500*4882a593Smuzhiyunfrom the child. 501*4882a593Smuzhiyun 502*4882a593SmuzhiyunCharges recorded in swap information is not updated at removal of cgroup. 503*4882a593SmuzhiyunRecorded information is discarded and a cgroup which uses swap (swapcache) 504*4882a593Smuzhiyunwill be charged as a new owner of it. 505*4882a593Smuzhiyun 506*4882a593SmuzhiyunAbout use_hierarchy, see Section 6. 507*4882a593Smuzhiyun 508*4882a593Smuzhiyun5. Misc. interfaces 509*4882a593Smuzhiyun=================== 510*4882a593Smuzhiyun 511*4882a593Smuzhiyun5.1 force_empty 512*4882a593Smuzhiyun--------------- 513*4882a593Smuzhiyun memory.force_empty interface is provided to make cgroup's memory usage empty. 514*4882a593Smuzhiyun When writing anything to this:: 515*4882a593Smuzhiyun 516*4882a593Smuzhiyun # echo 0 > memory.force_empty 517*4882a593Smuzhiyun 518*4882a593Smuzhiyun the cgroup will be reclaimed and as many pages reclaimed as possible. 519*4882a593Smuzhiyun 520*4882a593Smuzhiyun The typical use case for this interface is before calling rmdir(). 521*4882a593Smuzhiyun Though rmdir() offlines memcg, but the memcg may still stay there due to 522*4882a593Smuzhiyun charged file caches. Some out-of-use page caches may keep charged until 523*4882a593Smuzhiyun memory pressure happens. If you want to avoid that, force_empty will be useful. 524*4882a593Smuzhiyun 525*4882a593Smuzhiyun Also, note that when memory.kmem.limit_in_bytes is set the charges due to 526*4882a593Smuzhiyun kernel pages will still be seen. This is not considered a failure and the 527*4882a593Smuzhiyun write will still return success. In this case, it is expected that 528*4882a593Smuzhiyun memory.kmem.usage_in_bytes == memory.usage_in_bytes. 529*4882a593Smuzhiyun 530*4882a593Smuzhiyun About use_hierarchy, see Section 6. 531*4882a593Smuzhiyun 532*4882a593Smuzhiyun5.2 stat file 533*4882a593Smuzhiyun------------- 534*4882a593Smuzhiyun 535*4882a593Smuzhiyunmemory.stat file includes following statistics 536*4882a593Smuzhiyun 537*4882a593Smuzhiyunper-memory cgroup local status 538*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 539*4882a593Smuzhiyun 540*4882a593Smuzhiyun=============== =============================================================== 541*4882a593Smuzhiyuncache # of bytes of page cache memory. 542*4882a593Smuzhiyunrss # of bytes of anonymous and swap cache memory (includes 543*4882a593Smuzhiyun transparent hugepages). 544*4882a593Smuzhiyunrss_huge # of bytes of anonymous transparent hugepages. 545*4882a593Smuzhiyunmapped_file # of bytes of mapped file (includes tmpfs/shmem) 546*4882a593Smuzhiyunpgpgin # of charging events to the memory cgroup. The charging 547*4882a593Smuzhiyun event happens each time a page is accounted as either mapped 548*4882a593Smuzhiyun anon page(RSS) or cache page(Page Cache) to the cgroup. 549*4882a593Smuzhiyunpgpgout # of uncharging events to the memory cgroup. The uncharging 550*4882a593Smuzhiyun event happens each time a page is unaccounted from the cgroup. 551*4882a593Smuzhiyunswap # of bytes of swap usage 552*4882a593Smuzhiyundirty # of bytes that are waiting to get written back to the disk. 553*4882a593Smuzhiyunwriteback # of bytes of file/anon cache that are queued for syncing to 554*4882a593Smuzhiyun disk. 555*4882a593Smuzhiyuninactive_anon # of bytes of anonymous and swap cache memory on inactive 556*4882a593Smuzhiyun LRU list. 557*4882a593Smuzhiyunactive_anon # of bytes of anonymous and swap cache memory on active 558*4882a593Smuzhiyun LRU list. 559*4882a593Smuzhiyuninactive_file # of bytes of file-backed memory on inactive LRU list. 560*4882a593Smuzhiyunactive_file # of bytes of file-backed memory on active LRU list. 561*4882a593Smuzhiyununevictable # of bytes of memory that cannot be reclaimed (mlocked etc). 562*4882a593Smuzhiyun=============== =============================================================== 563*4882a593Smuzhiyun 564*4882a593Smuzhiyunstatus considering hierarchy (see memory.use_hierarchy settings) 565*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 566*4882a593Smuzhiyun 567*4882a593Smuzhiyun========================= =================================================== 568*4882a593Smuzhiyunhierarchical_memory_limit # of bytes of memory limit with regard to hierarchy 569*4882a593Smuzhiyun under which the memory cgroup is 570*4882a593Smuzhiyunhierarchical_memsw_limit # of bytes of memory+swap limit with regard to 571*4882a593Smuzhiyun hierarchy under which memory cgroup is. 572*4882a593Smuzhiyun 573*4882a593Smuzhiyuntotal_<counter> # hierarchical version of <counter>, which in 574*4882a593Smuzhiyun addition to the cgroup's own value includes the 575*4882a593Smuzhiyun sum of all hierarchical children's values of 576*4882a593Smuzhiyun <counter>, i.e. total_cache 577*4882a593Smuzhiyun========================= =================================================== 578*4882a593Smuzhiyun 579*4882a593SmuzhiyunThe following additional stats are dependent on CONFIG_DEBUG_VM 580*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 581*4882a593Smuzhiyun 582*4882a593Smuzhiyun========================= ======================================== 583*4882a593Smuzhiyunrecent_rotated_anon VM internal parameter. (see mm/vmscan.c) 584*4882a593Smuzhiyunrecent_rotated_file VM internal parameter. (see mm/vmscan.c) 585*4882a593Smuzhiyunrecent_scanned_anon VM internal parameter. (see mm/vmscan.c) 586*4882a593Smuzhiyunrecent_scanned_file VM internal parameter. (see mm/vmscan.c) 587*4882a593Smuzhiyun========================= ======================================== 588*4882a593Smuzhiyun 589*4882a593SmuzhiyunMemo: 590*4882a593Smuzhiyun recent_rotated means recent frequency of LRU rotation. 591*4882a593Smuzhiyun recent_scanned means recent # of scans to LRU. 592*4882a593Smuzhiyun showing for better debug please see the code for meanings. 593*4882a593Smuzhiyun 594*4882a593SmuzhiyunNote: 595*4882a593Smuzhiyun Only anonymous and swap cache memory is listed as part of 'rss' stat. 596*4882a593Smuzhiyun This should not be confused with the true 'resident set size' or the 597*4882a593Smuzhiyun amount of physical memory used by the cgroup. 598*4882a593Smuzhiyun 599*4882a593Smuzhiyun 'rss + mapped_file" will give you resident set size of cgroup. 600*4882a593Smuzhiyun 601*4882a593Smuzhiyun (Note: file and shmem may be shared among other cgroups. In that case, 602*4882a593Smuzhiyun mapped_file is accounted only when the memory cgroup is owner of page 603*4882a593Smuzhiyun cache.) 604*4882a593Smuzhiyun 605*4882a593Smuzhiyun5.3 swappiness 606*4882a593Smuzhiyun-------------- 607*4882a593Smuzhiyun 608*4882a593SmuzhiyunOverrides /proc/sys/vm/swappiness for the particular group. The tunable 609*4882a593Smuzhiyunin the root cgroup corresponds to the global swappiness setting. 610*4882a593Smuzhiyun 611*4882a593SmuzhiyunPlease note that unlike during the global reclaim, limit reclaim 612*4882a593Smuzhiyunenforces that 0 swappiness really prevents from any swapping even if 613*4882a593Smuzhiyunthere is a swap storage available. This might lead to memcg OOM killer 614*4882a593Smuzhiyunif there are no file pages to reclaim. 615*4882a593Smuzhiyun 616*4882a593Smuzhiyun5.4 failcnt 617*4882a593Smuzhiyun----------- 618*4882a593Smuzhiyun 619*4882a593SmuzhiyunA memory cgroup provides memory.failcnt and memory.memsw.failcnt files. 620*4882a593SmuzhiyunThis failcnt(== failure count) shows the number of times that a usage counter 621*4882a593Smuzhiyunhit its limit. When a memory cgroup hits a limit, failcnt increases and 622*4882a593Smuzhiyunmemory under it will be reclaimed. 623*4882a593Smuzhiyun 624*4882a593SmuzhiyunYou can reset failcnt by writing 0 to failcnt file:: 625*4882a593Smuzhiyun 626*4882a593Smuzhiyun # echo 0 > .../memory.failcnt 627*4882a593Smuzhiyun 628*4882a593Smuzhiyun5.5 usage_in_bytes 629*4882a593Smuzhiyun------------------ 630*4882a593Smuzhiyun 631*4882a593SmuzhiyunFor efficiency, as other kernel components, memory cgroup uses some optimization 632*4882a593Smuzhiyunto avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the 633*4882a593Smuzhiyunmethod and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz 634*4882a593Smuzhiyunvalue for efficient access. (Of course, when necessary, it's synchronized.) 635*4882a593SmuzhiyunIf you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) 636*4882a593Smuzhiyunvalue in memory.stat(see 5.2). 637*4882a593Smuzhiyun 638*4882a593Smuzhiyun5.6 numa_stat 639*4882a593Smuzhiyun------------- 640*4882a593Smuzhiyun 641*4882a593SmuzhiyunThis is similar to numa_maps but operates on a per-memcg basis. This is 642*4882a593Smuzhiyunuseful for providing visibility into the numa locality information within 643*4882a593Smuzhiyunan memcg since the pages are allowed to be allocated from any physical 644*4882a593Smuzhiyunnode. One of the use cases is evaluating application performance by 645*4882a593Smuzhiyuncombining this information with the application's CPU allocation. 646*4882a593Smuzhiyun 647*4882a593SmuzhiyunEach memcg's numa_stat file includes "total", "file", "anon" and "unevictable" 648*4882a593Smuzhiyunper-node page counts including "hierarchical_<counter>" which sums up all 649*4882a593Smuzhiyunhierarchical children's values in addition to the memcg's own value. 650*4882a593Smuzhiyun 651*4882a593SmuzhiyunThe output format of memory.numa_stat is:: 652*4882a593Smuzhiyun 653*4882a593Smuzhiyun total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ... 654*4882a593Smuzhiyun file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ... 655*4882a593Smuzhiyun anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... 656*4882a593Smuzhiyun unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... 657*4882a593Smuzhiyun hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ... 658*4882a593Smuzhiyun 659*4882a593SmuzhiyunThe "total" count is sum of file + anon + unevictable. 660*4882a593Smuzhiyun 661*4882a593Smuzhiyun6. Hierarchy support 662*4882a593Smuzhiyun==================== 663*4882a593Smuzhiyun 664*4882a593SmuzhiyunThe memory controller supports a deep hierarchy and hierarchical accounting. 665*4882a593SmuzhiyunThe hierarchy is created by creating the appropriate cgroups in the 666*4882a593Smuzhiyuncgroup filesystem. Consider for example, the following cgroup filesystem 667*4882a593Smuzhiyunhierarchy:: 668*4882a593Smuzhiyun 669*4882a593Smuzhiyun root 670*4882a593Smuzhiyun / | \ 671*4882a593Smuzhiyun / | \ 672*4882a593Smuzhiyun a b c 673*4882a593Smuzhiyun | \ 674*4882a593Smuzhiyun | \ 675*4882a593Smuzhiyun d e 676*4882a593Smuzhiyun 677*4882a593SmuzhiyunIn the diagram above, with hierarchical accounting enabled, all memory 678*4882a593Smuzhiyunusage of e, is accounted to its ancestors up until the root (i.e, c and root), 679*4882a593Smuzhiyunthat has memory.use_hierarchy enabled. If one of the ancestors goes over its 680*4882a593Smuzhiyunlimit, the reclaim algorithm reclaims from the tasks in the ancestor and the 681*4882a593Smuzhiyunchildren of the ancestor. 682*4882a593Smuzhiyun 683*4882a593Smuzhiyun6.1 Enabling hierarchical accounting and reclaim 684*4882a593Smuzhiyun------------------------------------------------ 685*4882a593Smuzhiyun 686*4882a593SmuzhiyunA memory cgroup by default disables the hierarchy feature. Support 687*4882a593Smuzhiyuncan be enabled by writing 1 to memory.use_hierarchy file of the root cgroup:: 688*4882a593Smuzhiyun 689*4882a593Smuzhiyun # echo 1 > memory.use_hierarchy 690*4882a593Smuzhiyun 691*4882a593SmuzhiyunThe feature can be disabled by:: 692*4882a593Smuzhiyun 693*4882a593Smuzhiyun # echo 0 > memory.use_hierarchy 694*4882a593Smuzhiyun 695*4882a593SmuzhiyunNOTE1: 696*4882a593Smuzhiyun Enabling/disabling will fail if either the cgroup already has other 697*4882a593Smuzhiyun cgroups created below it, or if the parent cgroup has use_hierarchy 698*4882a593Smuzhiyun enabled. 699*4882a593Smuzhiyun 700*4882a593SmuzhiyunNOTE2: 701*4882a593Smuzhiyun When panic_on_oom is set to "2", the whole system will panic in 702*4882a593Smuzhiyun case of an OOM event in any cgroup. 703*4882a593Smuzhiyun 704*4882a593Smuzhiyun7. Soft limits 705*4882a593Smuzhiyun============== 706*4882a593Smuzhiyun 707*4882a593SmuzhiyunSoft limits allow for greater sharing of memory. The idea behind soft limits 708*4882a593Smuzhiyunis to allow control groups to use as much of the memory as needed, provided 709*4882a593Smuzhiyun 710*4882a593Smuzhiyuna. There is no memory contention 711*4882a593Smuzhiyunb. They do not exceed their hard limit 712*4882a593Smuzhiyun 713*4882a593SmuzhiyunWhen the system detects memory contention or low memory, control groups 714*4882a593Smuzhiyunare pushed back to their soft limits. If the soft limit of each control 715*4882a593Smuzhiyungroup is very high, they are pushed back as much as possible to make 716*4882a593Smuzhiyunsure that one control group does not starve the others of memory. 717*4882a593Smuzhiyun 718*4882a593SmuzhiyunPlease note that soft limits is a best-effort feature; it comes with 719*4882a593Smuzhiyunno guarantees, but it does its best to make sure that when memory is 720*4882a593Smuzhiyunheavily contended for, memory is allocated based on the soft limit 721*4882a593Smuzhiyunhints/setup. Currently soft limit based reclaim is set up such that 722*4882a593Smuzhiyunit gets invoked from balance_pgdat (kswapd). 723*4882a593Smuzhiyun 724*4882a593Smuzhiyun7.1 Interface 725*4882a593Smuzhiyun------------- 726*4882a593Smuzhiyun 727*4882a593SmuzhiyunSoft limits can be setup by using the following commands (in this example we 728*4882a593Smuzhiyunassume a soft limit of 256 MiB):: 729*4882a593Smuzhiyun 730*4882a593Smuzhiyun # echo 256M > memory.soft_limit_in_bytes 731*4882a593Smuzhiyun 732*4882a593SmuzhiyunIf we want to change this to 1G, we can at any time use:: 733*4882a593Smuzhiyun 734*4882a593Smuzhiyun # echo 1G > memory.soft_limit_in_bytes 735*4882a593Smuzhiyun 736*4882a593SmuzhiyunNOTE1: 737*4882a593Smuzhiyun Soft limits take effect over a long period of time, since they involve 738*4882a593Smuzhiyun reclaiming memory for balancing between memory cgroups 739*4882a593SmuzhiyunNOTE2: 740*4882a593Smuzhiyun It is recommended to set the soft limit always below the hard limit, 741*4882a593Smuzhiyun otherwise the hard limit will take precedence. 742*4882a593Smuzhiyun 743*4882a593Smuzhiyun8. Move charges at task migration 744*4882a593Smuzhiyun================================= 745*4882a593Smuzhiyun 746*4882a593SmuzhiyunUsers can move charges associated with a task along with task migration, that 747*4882a593Smuzhiyunis, uncharge task's pages from the old cgroup and charge them to the new cgroup. 748*4882a593SmuzhiyunThis feature is not supported in !CONFIG_MMU environments because of lack of 749*4882a593Smuzhiyunpage tables. 750*4882a593Smuzhiyun 751*4882a593Smuzhiyun8.1 Interface 752*4882a593Smuzhiyun------------- 753*4882a593Smuzhiyun 754*4882a593SmuzhiyunThis feature is disabled by default. It can be enabled (and disabled again) by 755*4882a593Smuzhiyunwriting to memory.move_charge_at_immigrate of the destination cgroup. 756*4882a593Smuzhiyun 757*4882a593SmuzhiyunIf you want to enable it:: 758*4882a593Smuzhiyun 759*4882a593Smuzhiyun # echo (some positive value) > memory.move_charge_at_immigrate 760*4882a593Smuzhiyun 761*4882a593SmuzhiyunNote: 762*4882a593Smuzhiyun Each bits of move_charge_at_immigrate has its own meaning about what type 763*4882a593Smuzhiyun of charges should be moved. See 8.2 for details. 764*4882a593SmuzhiyunNote: 765*4882a593Smuzhiyun Charges are moved only when you move mm->owner, in other words, 766*4882a593Smuzhiyun a leader of a thread group. 767*4882a593SmuzhiyunNote: 768*4882a593Smuzhiyun If we cannot find enough space for the task in the destination cgroup, we 769*4882a593Smuzhiyun try to make space by reclaiming memory. Task migration may fail if we 770*4882a593Smuzhiyun cannot make enough space. 771*4882a593SmuzhiyunNote: 772*4882a593Smuzhiyun It can take several seconds if you move charges much. 773*4882a593Smuzhiyun 774*4882a593SmuzhiyunAnd if you want disable it again:: 775*4882a593Smuzhiyun 776*4882a593Smuzhiyun # echo 0 > memory.move_charge_at_immigrate 777*4882a593Smuzhiyun 778*4882a593Smuzhiyun8.2 Type of charges which can be moved 779*4882a593Smuzhiyun-------------------------------------- 780*4882a593Smuzhiyun 781*4882a593SmuzhiyunEach bit in move_charge_at_immigrate has its own meaning about what type of 782*4882a593Smuzhiyuncharges should be moved. But in any case, it must be noted that an account of 783*4882a593Smuzhiyuna page or a swap can be moved only when it is charged to the task's current 784*4882a593Smuzhiyun(old) memory cgroup. 785*4882a593Smuzhiyun 786*4882a593Smuzhiyun+---+--------------------------------------------------------------------------+ 787*4882a593Smuzhiyun|bit| what type of charges would be moved ? | 788*4882a593Smuzhiyun+===+==========================================================================+ 789*4882a593Smuzhiyun| 0 | A charge of an anonymous page (or swap of it) used by the target task. | 790*4882a593Smuzhiyun| | You must enable Swap Extension (see 2.4) to enable move of swap charges. | 791*4882a593Smuzhiyun+---+--------------------------------------------------------------------------+ 792*4882a593Smuzhiyun| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) | 793*4882a593Smuzhiyun| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | 794*4882a593Smuzhiyun| | anonymous pages, file pages (and swaps) in the range mmapped by the task | 795*4882a593Smuzhiyun| | will be moved even if the task hasn't done page fault, i.e. they might | 796*4882a593Smuzhiyun| | not be the task's "RSS", but other task's "RSS" that maps the same file. | 797*4882a593Smuzhiyun| | And mapcount of the page is ignored (the page can be moved even if | 798*4882a593Smuzhiyun| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | 799*4882a593Smuzhiyun| | enable move of swap charges. | 800*4882a593Smuzhiyun+---+--------------------------------------------------------------------------+ 801*4882a593Smuzhiyun 802*4882a593Smuzhiyun8.3 TODO 803*4882a593Smuzhiyun-------- 804*4882a593Smuzhiyun 805*4882a593Smuzhiyun- All of moving charge operations are done under cgroup_mutex. It's not good 806*4882a593Smuzhiyun behavior to hold the mutex too long, so we may need some trick. 807*4882a593Smuzhiyun 808*4882a593Smuzhiyun9. Memory thresholds 809*4882a593Smuzhiyun==================== 810*4882a593Smuzhiyun 811*4882a593SmuzhiyunMemory cgroup implements memory thresholds using the cgroups notification 812*4882a593SmuzhiyunAPI (see cgroups.txt). It allows to register multiple memory and memsw 813*4882a593Smuzhiyunthresholds and gets notifications when it crosses. 814*4882a593Smuzhiyun 815*4882a593SmuzhiyunTo register a threshold, an application must: 816*4882a593Smuzhiyun 817*4882a593Smuzhiyun- create an eventfd using eventfd(2); 818*4882a593Smuzhiyun- open memory.usage_in_bytes or memory.memsw.usage_in_bytes; 819*4882a593Smuzhiyun- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to 820*4882a593Smuzhiyun cgroup.event_control. 821*4882a593Smuzhiyun 822*4882a593SmuzhiyunApplication will be notified through eventfd when memory usage crosses 823*4882a593Smuzhiyunthreshold in any direction. 824*4882a593Smuzhiyun 825*4882a593SmuzhiyunIt's applicable for root and non-root cgroup. 826*4882a593Smuzhiyun 827*4882a593Smuzhiyun10. OOM Control 828*4882a593Smuzhiyun=============== 829*4882a593Smuzhiyun 830*4882a593Smuzhiyunmemory.oom_control file is for OOM notification and other controls. 831*4882a593Smuzhiyun 832*4882a593SmuzhiyunMemory cgroup implements OOM notifier using the cgroup notification 833*4882a593SmuzhiyunAPI (See cgroups.txt). It allows to register multiple OOM notification 834*4882a593Smuzhiyundelivery and gets notification when OOM happens. 835*4882a593Smuzhiyun 836*4882a593SmuzhiyunTo register a notifier, an application must: 837*4882a593Smuzhiyun 838*4882a593Smuzhiyun - create an eventfd using eventfd(2) 839*4882a593Smuzhiyun - open memory.oom_control file 840*4882a593Smuzhiyun - write string like "<event_fd> <fd of memory.oom_control>" to 841*4882a593Smuzhiyun cgroup.event_control 842*4882a593Smuzhiyun 843*4882a593SmuzhiyunThe application will be notified through eventfd when OOM happens. 844*4882a593SmuzhiyunOOM notification doesn't work for the root cgroup. 845*4882a593Smuzhiyun 846*4882a593SmuzhiyunYou can disable the OOM-killer by writing "1" to memory.oom_control file, as: 847*4882a593Smuzhiyun 848*4882a593Smuzhiyun #echo 1 > memory.oom_control 849*4882a593Smuzhiyun 850*4882a593SmuzhiyunIf OOM-killer is disabled, tasks under cgroup will hang/sleep 851*4882a593Smuzhiyunin memory cgroup's OOM-waitqueue when they request accountable memory. 852*4882a593Smuzhiyun 853*4882a593SmuzhiyunFor running them, you have to relax the memory cgroup's OOM status by 854*4882a593Smuzhiyun 855*4882a593Smuzhiyun * enlarge limit or reduce usage. 856*4882a593Smuzhiyun 857*4882a593SmuzhiyunTo reduce usage, 858*4882a593Smuzhiyun 859*4882a593Smuzhiyun * kill some tasks. 860*4882a593Smuzhiyun * move some tasks to other group with account migration. 861*4882a593Smuzhiyun * remove some files (on tmpfs?) 862*4882a593Smuzhiyun 863*4882a593SmuzhiyunThen, stopped tasks will work again. 864*4882a593Smuzhiyun 865*4882a593SmuzhiyunAt reading, current status of OOM is shown. 866*4882a593Smuzhiyun 867*4882a593Smuzhiyun - oom_kill_disable 0 or 1 868*4882a593Smuzhiyun (if 1, oom-killer is disabled) 869*4882a593Smuzhiyun - under_oom 0 or 1 870*4882a593Smuzhiyun (if 1, the memory cgroup is under OOM, tasks may be stopped.) 871*4882a593Smuzhiyun 872*4882a593Smuzhiyun11. Memory Pressure 873*4882a593Smuzhiyun=================== 874*4882a593Smuzhiyun 875*4882a593SmuzhiyunThe pressure level notifications can be used to monitor the memory 876*4882a593Smuzhiyunallocation cost; based on the pressure, applications can implement 877*4882a593Smuzhiyundifferent strategies of managing their memory resources. The pressure 878*4882a593Smuzhiyunlevels are defined as following: 879*4882a593Smuzhiyun 880*4882a593SmuzhiyunThe "low" level means that the system is reclaiming memory for new 881*4882a593Smuzhiyunallocations. Monitoring this reclaiming activity might be useful for 882*4882a593Smuzhiyunmaintaining cache level. Upon notification, the program (typically 883*4882a593Smuzhiyun"Activity Manager") might analyze vmstat and act in advance (i.e. 884*4882a593Smuzhiyunprematurely shutdown unimportant services). 885*4882a593Smuzhiyun 886*4882a593SmuzhiyunThe "medium" level means that the system is experiencing medium memory 887*4882a593Smuzhiyunpressure, the system might be making swap, paging out active file caches, 888*4882a593Smuzhiyunetc. Upon this event applications may decide to further analyze 889*4882a593Smuzhiyunvmstat/zoneinfo/memcg or internal memory usage statistics and free any 890*4882a593Smuzhiyunresources that can be easily reconstructed or re-read from a disk. 891*4882a593Smuzhiyun 892*4882a593SmuzhiyunThe "critical" level means that the system is actively thrashing, it is 893*4882a593Smuzhiyunabout to out of memory (OOM) or even the in-kernel OOM killer is on its 894*4882a593Smuzhiyunway to trigger. Applications should do whatever they can to help the 895*4882a593Smuzhiyunsystem. It might be too late to consult with vmstat or any other 896*4882a593Smuzhiyunstatistics, so it's advisable to take an immediate action. 897*4882a593Smuzhiyun 898*4882a593SmuzhiyunBy default, events are propagated upward until the event is handled, i.e. the 899*4882a593Smuzhiyunevents are not pass-through. For example, you have three cgroups: A->B->C. Now 900*4882a593Smuzhiyunyou set up an event listener on cgroups A, B and C, and suppose group C 901*4882a593Smuzhiyunexperiences some pressure. In this situation, only group C will receive the 902*4882a593Smuzhiyunnotification, i.e. groups A and B will not receive it. This is done to avoid 903*4882a593Smuzhiyunexcessive "broadcasting" of messages, which disturbs the system and which is 904*4882a593Smuzhiyunespecially bad if we are low on memory or thrashing. Group B, will receive 905*4882a593Smuzhiyunnotification only if there are no event listers for group C. 906*4882a593Smuzhiyun 907*4882a593SmuzhiyunThere are three optional modes that specify different propagation behavior: 908*4882a593Smuzhiyun 909*4882a593Smuzhiyun - "default": this is the default behavior specified above. This mode is the 910*4882a593Smuzhiyun same as omitting the optional mode parameter, preserved by backwards 911*4882a593Smuzhiyun compatibility. 912*4882a593Smuzhiyun 913*4882a593Smuzhiyun - "hierarchy": events always propagate up to the root, similar to the default 914*4882a593Smuzhiyun behavior, except that propagation continues regardless of whether there are 915*4882a593Smuzhiyun event listeners at each level, with the "hierarchy" mode. In the above 916*4882a593Smuzhiyun example, groups A, B, and C will receive notification of memory pressure. 917*4882a593Smuzhiyun 918*4882a593Smuzhiyun - "local": events are pass-through, i.e. they only receive notifications when 919*4882a593Smuzhiyun memory pressure is experienced in the memcg for which the notification is 920*4882a593Smuzhiyun registered. In the above example, group C will receive notification if 921*4882a593Smuzhiyun registered for "local" notification and the group experiences memory 922*4882a593Smuzhiyun pressure. However, group B will never receive notification, regardless if 923*4882a593Smuzhiyun there is an event listener for group C or not, if group B is registered for 924*4882a593Smuzhiyun local notification. 925*4882a593Smuzhiyun 926*4882a593SmuzhiyunThe level and event notification mode ("hierarchy" or "local", if necessary) are 927*4882a593Smuzhiyunspecified by a comma-delimited string, i.e. "low,hierarchy" specifies 928*4882a593Smuzhiyunhierarchical, pass-through, notification for all ancestor memcgs. Notification 929*4882a593Smuzhiyunthat is the default, non pass-through behavior, does not specify a mode. 930*4882a593Smuzhiyun"medium,local" specifies pass-through notification for the medium level. 931*4882a593Smuzhiyun 932*4882a593SmuzhiyunThe file memory.pressure_level is only used to setup an eventfd. To 933*4882a593Smuzhiyunregister a notification, an application must: 934*4882a593Smuzhiyun 935*4882a593Smuzhiyun- create an eventfd using eventfd(2); 936*4882a593Smuzhiyun- open memory.pressure_level; 937*4882a593Smuzhiyun- write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>" 938*4882a593Smuzhiyun to cgroup.event_control. 939*4882a593Smuzhiyun 940*4882a593SmuzhiyunApplication will be notified through eventfd when memory pressure is at 941*4882a593Smuzhiyunthe specific level (or higher). Read/write operations to 942*4882a593Smuzhiyunmemory.pressure_level are no implemented. 943*4882a593Smuzhiyun 944*4882a593SmuzhiyunTest: 945*4882a593Smuzhiyun 946*4882a593Smuzhiyun Here is a small script example that makes a new cgroup, sets up a 947*4882a593Smuzhiyun memory limit, sets up a notification in the cgroup and then makes child 948*4882a593Smuzhiyun cgroup experience a critical pressure:: 949*4882a593Smuzhiyun 950*4882a593Smuzhiyun # cd /sys/fs/cgroup/memory/ 951*4882a593Smuzhiyun # mkdir foo 952*4882a593Smuzhiyun # cd foo 953*4882a593Smuzhiyun # cgroup_event_listener memory.pressure_level low,hierarchy & 954*4882a593Smuzhiyun # echo 8000000 > memory.limit_in_bytes 955*4882a593Smuzhiyun # echo 8000000 > memory.memsw.limit_in_bytes 956*4882a593Smuzhiyun # echo $$ > tasks 957*4882a593Smuzhiyun # dd if=/dev/zero | read x 958*4882a593Smuzhiyun 959*4882a593Smuzhiyun (Expect a bunch of notifications, and eventually, the oom-killer will 960*4882a593Smuzhiyun trigger.) 961*4882a593Smuzhiyun 962*4882a593Smuzhiyun12. TODO 963*4882a593Smuzhiyun======== 964*4882a593Smuzhiyun 965*4882a593Smuzhiyun1. Make per-cgroup scanner reclaim not-shared pages first 966*4882a593Smuzhiyun2. Teach controller to account for shared-pages 967*4882a593Smuzhiyun3. Start reclamation in the background when the limit is 968*4882a593Smuzhiyun not yet hit but the usage is getting closer 969*4882a593Smuzhiyun 970*4882a593SmuzhiyunSummary 971*4882a593Smuzhiyun======= 972*4882a593Smuzhiyun 973*4882a593SmuzhiyunOverall, the memory controller has been a stable controller and has been 974*4882a593Smuzhiyuncommented and discussed quite extensively in the community. 975*4882a593Smuzhiyun 976*4882a593SmuzhiyunReferences 977*4882a593Smuzhiyun========== 978*4882a593Smuzhiyun 979*4882a593Smuzhiyun1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ 980*4882a593Smuzhiyun2. Singh, Balbir. Memory Controller (RSS Control), 981*4882a593Smuzhiyun http://lwn.net/Articles/222762/ 982*4882a593Smuzhiyun3. Emelianov, Pavel. Resource controllers based on process cgroups 983*4882a593Smuzhiyun http://lkml.org/lkml/2007/3/6/198 984*4882a593Smuzhiyun4. Emelianov, Pavel. RSS controller based on process cgroups (v2) 985*4882a593Smuzhiyun http://lkml.org/lkml/2007/4/9/78 986*4882a593Smuzhiyun5. Emelianov, Pavel. RSS controller based on process cgroups (v3) 987*4882a593Smuzhiyun http://lkml.org/lkml/2007/5/30/244 988*4882a593Smuzhiyun6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ 989*4882a593Smuzhiyun7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control 990*4882a593Smuzhiyun subsystem (v3), http://lwn.net/Articles/235534/ 991*4882a593Smuzhiyun8. Singh, Balbir. RSS controller v2 test results (lmbench), 992*4882a593Smuzhiyun http://lkml.org/lkml/2007/5/17/232 993*4882a593Smuzhiyun9. Singh, Balbir. RSS controller v2 AIM9 results 994*4882a593Smuzhiyun http://lkml.org/lkml/2007/5/18/1 995*4882a593Smuzhiyun10. Singh, Balbir. Memory controller v6 test results, 996*4882a593Smuzhiyun http://lkml.org/lkml/2007/8/19/36 997*4882a593Smuzhiyun11. Singh, Balbir. Memory controller introduction (v6), 998*4882a593Smuzhiyun http://lkml.org/lkml/2007/8/17/69 999*4882a593Smuzhiyun12. Corbet, Jonathan, Controlling memory use in cgroups, 1000*4882a593Smuzhiyun http://lwn.net/Articles/243795/ 1001