1*4882a593Smuzhiyun.. _cpusets: 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun======= 4*4882a593SmuzhiyunCPUSETS 5*4882a593Smuzhiyun======= 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunCopyright (C) 2004 BULL SA. 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunWritten by Simon.Derr@bull.net 10*4882a593Smuzhiyun 11*4882a593Smuzhiyun- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. 12*4882a593Smuzhiyun- Modified by Paul Jackson <pj@sgi.com> 13*4882a593Smuzhiyun- Modified by Christoph Lameter <cl@linux.com> 14*4882a593Smuzhiyun- Modified by Paul Menage <menage@google.com> 15*4882a593Smuzhiyun- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> 16*4882a593Smuzhiyun 17*4882a593Smuzhiyun.. CONTENTS: 18*4882a593Smuzhiyun 19*4882a593Smuzhiyun 1. Cpusets 20*4882a593Smuzhiyun 1.1 What are cpusets ? 21*4882a593Smuzhiyun 1.2 Why are cpusets needed ? 22*4882a593Smuzhiyun 1.3 How are cpusets implemented ? 23*4882a593Smuzhiyun 1.4 What are exclusive cpusets ? 24*4882a593Smuzhiyun 1.5 What is memory_pressure ? 25*4882a593Smuzhiyun 1.6 What is memory spread ? 26*4882a593Smuzhiyun 1.7 What is sched_load_balance ? 27*4882a593Smuzhiyun 1.8 What is sched_relax_domain_level ? 28*4882a593Smuzhiyun 1.9 How do I use cpusets ? 29*4882a593Smuzhiyun 2. Usage Examples and Syntax 30*4882a593Smuzhiyun 2.1 Basic Usage 31*4882a593Smuzhiyun 2.2 Adding/removing cpus 32*4882a593Smuzhiyun 2.3 Setting flags 33*4882a593Smuzhiyun 2.4 Attaching processes 34*4882a593Smuzhiyun 3. Questions 35*4882a593Smuzhiyun 4. Contact 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun1. Cpusets 38*4882a593Smuzhiyun========== 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun1.1 What are cpusets ? 41*4882a593Smuzhiyun---------------------- 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunCpusets provide a mechanism for assigning a set of CPUs and Memory 44*4882a593SmuzhiyunNodes to a set of tasks. In this document "Memory Node" refers to 45*4882a593Smuzhiyunan on-line node that contains memory. 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunCpusets constrain the CPU and Memory placement of tasks to only 48*4882a593Smuzhiyunthe resources within a task's current cpuset. They form a nested 49*4882a593Smuzhiyunhierarchy visible in a virtual file system. These are the essential 50*4882a593Smuzhiyunhooks, beyond what is already present, required to manage dynamic 51*4882a593Smuzhiyunjob placement on large systems. 52*4882a593Smuzhiyun 53*4882a593SmuzhiyunCpusets use the generic cgroup subsystem described in 54*4882a593SmuzhiyunDocumentation/admin-guide/cgroup-v1/cgroups.rst. 55*4882a593Smuzhiyun 56*4882a593SmuzhiyunRequests by a task, using the sched_setaffinity(2) system call to 57*4882a593Smuzhiyuninclude CPUs in its CPU affinity mask, and using the mbind(2) and 58*4882a593Smuzhiyunset_mempolicy(2) system calls to include Memory Nodes in its memory 59*4882a593Smuzhiyunpolicy, are both filtered through that task's cpuset, filtering out any 60*4882a593SmuzhiyunCPUs or Memory Nodes not in that cpuset. The scheduler will not 61*4882a593Smuzhiyunschedule a task on a CPU that is not allowed in its cpus_allowed 62*4882a593Smuzhiyunvector, and the kernel page allocator will not allocate a page on a 63*4882a593Smuzhiyunnode that is not allowed in the requesting task's mems_allowed vector. 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunUser level code may create and destroy cpusets by name in the cgroup 66*4882a593Smuzhiyunvirtual file system, manage the attributes and permissions of these 67*4882a593Smuzhiyuncpusets and which CPUs and Memory Nodes are assigned to each cpuset, 68*4882a593Smuzhiyunspecify and query to which cpuset a task is assigned, and list the 69*4882a593Smuzhiyuntask pids assigned to a cpuset. 70*4882a593Smuzhiyun 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun1.2 Why are cpusets needed ? 73*4882a593Smuzhiyun---------------------------- 74*4882a593Smuzhiyun 75*4882a593SmuzhiyunThe management of large computer systems, with many processors (CPUs), 76*4882a593Smuzhiyuncomplex memory cache hierarchies and multiple Memory Nodes having 77*4882a593Smuzhiyunnon-uniform access times (NUMA) presents additional challenges for 78*4882a593Smuzhiyunthe efficient scheduling and memory placement of processes. 79*4882a593Smuzhiyun 80*4882a593SmuzhiyunFrequently more modest sized systems can be operated with adequate 81*4882a593Smuzhiyunefficiency just by letting the operating system automatically share 82*4882a593Smuzhiyunthe available CPU and Memory resources amongst the requesting tasks. 83*4882a593Smuzhiyun 84*4882a593SmuzhiyunBut larger systems, which benefit more from careful processor and 85*4882a593Smuzhiyunmemory placement to reduce memory access times and contention, 86*4882a593Smuzhiyunand which typically represent a larger investment for the customer, 87*4882a593Smuzhiyuncan benefit from explicitly placing jobs on properly sized subsets of 88*4882a593Smuzhiyunthe system. 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunThis can be especially valuable on: 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun * Web Servers running multiple instances of the same web application, 93*4882a593Smuzhiyun * Servers running different applications (for instance, a web server 94*4882a593Smuzhiyun and a database), or 95*4882a593Smuzhiyun * NUMA systems running large HPC applications with demanding 96*4882a593Smuzhiyun performance characteristics. 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunThese subsets, or "soft partitions" must be able to be dynamically 99*4882a593Smuzhiyunadjusted, as the job mix changes, without impacting other concurrently 100*4882a593Smuzhiyunexecuting jobs. The location of the running jobs pages may also be moved 101*4882a593Smuzhiyunwhen the memory locations are changed. 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunThe kernel cpuset patch provides the minimum essential kernel 104*4882a593Smuzhiyunmechanisms required to efficiently implement such subsets. It 105*4882a593Smuzhiyunleverages existing CPU and Memory Placement facilities in the Linux 106*4882a593Smuzhiyunkernel to avoid any additional impact on the critical scheduler or 107*4882a593Smuzhiyunmemory allocator code. 108*4882a593Smuzhiyun 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun1.3 How are cpusets implemented ? 111*4882a593Smuzhiyun--------------------------------- 112*4882a593Smuzhiyun 113*4882a593SmuzhiyunCpusets provide a Linux kernel mechanism to constrain which CPUs and 114*4882a593SmuzhiyunMemory Nodes are used by a process or set of processes. 115*4882a593Smuzhiyun 116*4882a593SmuzhiyunThe Linux kernel already has a pair of mechanisms to specify on which 117*4882a593SmuzhiyunCPUs a task may be scheduled (sched_setaffinity) and on which Memory 118*4882a593SmuzhiyunNodes it may obtain memory (mbind, set_mempolicy). 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunCpusets extends these two mechanisms as follows: 121*4882a593Smuzhiyun 122*4882a593Smuzhiyun - Cpusets are sets of allowed CPUs and Memory Nodes, known to the 123*4882a593Smuzhiyun kernel. 124*4882a593Smuzhiyun - Each task in the system is attached to a cpuset, via a pointer 125*4882a593Smuzhiyun in the task structure to a reference counted cgroup structure. 126*4882a593Smuzhiyun - Calls to sched_setaffinity are filtered to just those CPUs 127*4882a593Smuzhiyun allowed in that task's cpuset. 128*4882a593Smuzhiyun - Calls to mbind and set_mempolicy are filtered to just 129*4882a593Smuzhiyun those Memory Nodes allowed in that task's cpuset. 130*4882a593Smuzhiyun - The root cpuset contains all the systems CPUs and Memory 131*4882a593Smuzhiyun Nodes. 132*4882a593Smuzhiyun - For any cpuset, one can define child cpusets containing a subset 133*4882a593Smuzhiyun of the parents CPU and Memory Node resources. 134*4882a593Smuzhiyun - The hierarchy of cpusets can be mounted at /dev/cpuset, for 135*4882a593Smuzhiyun browsing and manipulation from user space. 136*4882a593Smuzhiyun - A cpuset may be marked exclusive, which ensures that no other 137*4882a593Smuzhiyun cpuset (except direct ancestors and descendants) may contain 138*4882a593Smuzhiyun any overlapping CPUs or Memory Nodes. 139*4882a593Smuzhiyun - You can list all the tasks (by pid) attached to any cpuset. 140*4882a593Smuzhiyun 141*4882a593SmuzhiyunThe implementation of cpusets requires a few, simple hooks 142*4882a593Smuzhiyuninto the rest of the kernel, none in performance critical paths: 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun - in init/main.c, to initialize the root cpuset at system boot. 145*4882a593Smuzhiyun - in fork and exit, to attach and detach a task from its cpuset. 146*4882a593Smuzhiyun - in sched_setaffinity, to mask the requested CPUs by what's 147*4882a593Smuzhiyun allowed in that task's cpuset. 148*4882a593Smuzhiyun - in sched.c migrate_live_tasks(), to keep migrating tasks within 149*4882a593Smuzhiyun the CPUs allowed by their cpuset, if possible. 150*4882a593Smuzhiyun - in the mbind and set_mempolicy system calls, to mask the requested 151*4882a593Smuzhiyun Memory Nodes by what's allowed in that task's cpuset. 152*4882a593Smuzhiyun - in page_alloc.c, to restrict memory to allowed nodes. 153*4882a593Smuzhiyun - in vmscan.c, to restrict page recovery to the current cpuset. 154*4882a593Smuzhiyun 155*4882a593SmuzhiyunYou should mount the "cgroup" filesystem type in order to enable 156*4882a593Smuzhiyunbrowsing and modifying the cpusets presently known to the kernel. No 157*4882a593Smuzhiyunnew system calls are added for cpusets - all support for querying and 158*4882a593Smuzhiyunmodifying cpusets is via this cpuset file system. 159*4882a593Smuzhiyun 160*4882a593SmuzhiyunThe /proc/<pid>/status file for each task has four added lines, 161*4882a593Smuzhiyundisplaying the task's cpus_allowed (on which CPUs it may be scheduled) 162*4882a593Smuzhiyunand mems_allowed (on which Memory Nodes it may obtain memory), 163*4882a593Smuzhiyunin the two formats seen in the following example:: 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff 166*4882a593Smuzhiyun Cpus_allowed_list: 0-127 167*4882a593Smuzhiyun Mems_allowed: ffffffff,ffffffff 168*4882a593Smuzhiyun Mems_allowed_list: 0-63 169*4882a593Smuzhiyun 170*4882a593SmuzhiyunEach cpuset is represented by a directory in the cgroup file system 171*4882a593Smuzhiyuncontaining (on top of the standard cgroup files) the following 172*4882a593Smuzhiyunfiles describing that cpuset: 173*4882a593Smuzhiyun 174*4882a593Smuzhiyun - cpuset.cpus: list of CPUs in that cpuset 175*4882a593Smuzhiyun - cpuset.mems: list of Memory Nodes in that cpuset 176*4882a593Smuzhiyun - cpuset.memory_migrate flag: if set, move pages to cpusets nodes 177*4882a593Smuzhiyun - cpuset.cpu_exclusive flag: is cpu placement exclusive? 178*4882a593Smuzhiyun - cpuset.mem_exclusive flag: is memory placement exclusive? 179*4882a593Smuzhiyun - cpuset.mem_hardwall flag: is memory allocation hardwalled 180*4882a593Smuzhiyun - cpuset.memory_pressure: measure of how much paging pressure in cpuset 181*4882a593Smuzhiyun - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes 182*4882a593Smuzhiyun - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes 183*4882a593Smuzhiyun - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset 184*4882a593Smuzhiyun - cpuset.sched_relax_domain_level: the searching range when migrating tasks 185*4882a593Smuzhiyun 186*4882a593SmuzhiyunIn addition, only the root cpuset has the following file: 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun - cpuset.memory_pressure_enabled flag: compute memory_pressure? 189*4882a593Smuzhiyun 190*4882a593SmuzhiyunNew cpusets are created using the mkdir system call or shell 191*4882a593Smuzhiyuncommand. The properties of a cpuset, such as its flags, allowed 192*4882a593SmuzhiyunCPUs and Memory Nodes, and attached tasks, are modified by writing 193*4882a593Smuzhiyunto the appropriate file in that cpusets directory, as listed above. 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunThe named hierarchical structure of nested cpusets allows partitioning 196*4882a593Smuzhiyuna large system into nested, dynamically changeable, "soft-partitions". 197*4882a593Smuzhiyun 198*4882a593SmuzhiyunThe attachment of each task, automatically inherited at fork by any 199*4882a593Smuzhiyunchildren of that task, to a cpuset allows organizing the work load 200*4882a593Smuzhiyunon a system into related sets of tasks such that each set is constrained 201*4882a593Smuzhiyunto using the CPUs and Memory Nodes of a particular cpuset. A task 202*4882a593Smuzhiyunmay be re-attached to any other cpuset, if allowed by the permissions 203*4882a593Smuzhiyunon the necessary cpuset file system directories. 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunSuch management of a system "in the large" integrates smoothly with 206*4882a593Smuzhiyunthe detailed placement done on individual tasks and memory regions 207*4882a593Smuzhiyunusing the sched_setaffinity, mbind and set_mempolicy system calls. 208*4882a593Smuzhiyun 209*4882a593SmuzhiyunThe following rules apply to each cpuset: 210*4882a593Smuzhiyun 211*4882a593Smuzhiyun - Its CPUs and Memory Nodes must be a subset of its parents. 212*4882a593Smuzhiyun - It can't be marked exclusive unless its parent is. 213*4882a593Smuzhiyun - If its cpu or memory is exclusive, they may not overlap any sibling. 214*4882a593Smuzhiyun 215*4882a593SmuzhiyunThese rules, and the natural hierarchy of cpusets, enable efficient 216*4882a593Smuzhiyunenforcement of the exclusive guarantee, without having to scan all 217*4882a593Smuzhiyuncpusets every time any of them change to ensure nothing overlaps a 218*4882a593Smuzhiyunexclusive cpuset. Also, the use of a Linux virtual file system (vfs) 219*4882a593Smuzhiyunto represent the cpuset hierarchy provides for a familiar permission 220*4882a593Smuzhiyunand name space for cpusets, with a minimum of additional kernel code. 221*4882a593Smuzhiyun 222*4882a593SmuzhiyunThe cpus and mems files in the root (top_cpuset) cpuset are 223*4882a593Smuzhiyunread-only. The cpus file automatically tracks the value of 224*4882a593Smuzhiyuncpu_online_mask using a CPU hotplug notifier, and the mems file 225*4882a593Smuzhiyunautomatically tracks the value of node_states[N_MEMORY]--i.e., 226*4882a593Smuzhiyunnodes with memory--using the cpuset_track_online_nodes() hook. 227*4882a593Smuzhiyun 228*4882a593SmuzhiyunThe cpuset.effective_cpus and cpuset.effective_mems files are 229*4882a593Smuzhiyunnormally read-only copies of cpuset.cpus and cpuset.mems files 230*4882a593Smuzhiyunrespectively. If the cpuset cgroup filesystem is mounted with the 231*4882a593Smuzhiyunspecial "cpuset_v2_mode" option, the behavior of these files will become 232*4882a593Smuzhiyunsimilar to the corresponding files in cpuset v2. In other words, hotplug 233*4882a593Smuzhiyunevents will not change cpuset.cpus and cpuset.mems. Those events will 234*4882a593Smuzhiyunonly affect cpuset.effective_cpus and cpuset.effective_mems which show 235*4882a593Smuzhiyunthe actual cpus and memory nodes that are currently used by this cpuset. 236*4882a593SmuzhiyunSee Documentation/admin-guide/cgroup-v2.rst for more information about 237*4882a593Smuzhiyuncpuset v2 behavior. 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun 240*4882a593Smuzhiyun1.4 What are exclusive cpusets ? 241*4882a593Smuzhiyun-------------------------------- 242*4882a593Smuzhiyun 243*4882a593SmuzhiyunIf a cpuset is cpu or mem exclusive, no other cpuset, other than 244*4882a593Smuzhiyuna direct ancestor or descendant, may share any of the same CPUs or 245*4882a593SmuzhiyunMemory Nodes. 246*4882a593Smuzhiyun 247*4882a593SmuzhiyunA cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled", 248*4882a593Smuzhiyuni.e. it restricts kernel allocations for page, buffer and other data 249*4882a593Smuzhiyuncommonly shared by the kernel across multiple users. All cpusets, 250*4882a593Smuzhiyunwhether hardwalled or not, restrict allocations of memory for user 251*4882a593Smuzhiyunspace. This enables configuring a system so that several independent 252*4882a593Smuzhiyunjobs can share common kernel data, such as file system pages, while 253*4882a593Smuzhiyunisolating each job's user allocation in its own cpuset. To do this, 254*4882a593Smuzhiyunconstruct a large mem_exclusive cpuset to hold all the jobs, and 255*4882a593Smuzhiyunconstruct child, non-mem_exclusive cpusets for each individual job. 256*4882a593SmuzhiyunOnly a small amount of typical kernel memory, such as requests from 257*4882a593Smuzhiyuninterrupt handlers, is allowed to be taken outside even a 258*4882a593Smuzhiyunmem_exclusive cpuset. 259*4882a593Smuzhiyun 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun1.5 What is memory_pressure ? 262*4882a593Smuzhiyun----------------------------- 263*4882a593SmuzhiyunThe memory_pressure of a cpuset provides a simple per-cpuset metric 264*4882a593Smuzhiyunof the rate that the tasks in a cpuset are attempting to free up in 265*4882a593Smuzhiyunuse memory on the nodes of the cpuset to satisfy additional memory 266*4882a593Smuzhiyunrequests. 267*4882a593Smuzhiyun 268*4882a593SmuzhiyunThis enables batch managers monitoring jobs running in dedicated 269*4882a593Smuzhiyuncpusets to efficiently detect what level of memory pressure that job 270*4882a593Smuzhiyunis causing. 271*4882a593Smuzhiyun 272*4882a593SmuzhiyunThis is useful both on tightly managed systems running a wide mix of 273*4882a593Smuzhiyunsubmitted jobs, which may choose to terminate or re-prioritize jobs that 274*4882a593Smuzhiyunare trying to use more memory than allowed on the nodes assigned to them, 275*4882a593Smuzhiyunand with tightly coupled, long running, massively parallel scientific 276*4882a593Smuzhiyuncomputing jobs that will dramatically fail to meet required performance 277*4882a593Smuzhiyungoals if they start to use more memory than allowed to them. 278*4882a593Smuzhiyun 279*4882a593SmuzhiyunThis mechanism provides a very economical way for the batch manager 280*4882a593Smuzhiyunto monitor a cpuset for signs of memory pressure. It's up to the 281*4882a593Smuzhiyunbatch manager or other user code to decide what to do about it and 282*4882a593Smuzhiyuntake action. 283*4882a593Smuzhiyun 284*4882a593Smuzhiyun==> 285*4882a593Smuzhiyun Unless this feature is enabled by writing "1" to the special file 286*4882a593Smuzhiyun /dev/cpuset/memory_pressure_enabled, the hook in the rebalance 287*4882a593Smuzhiyun code of __alloc_pages() for this metric reduces to simply noticing 288*4882a593Smuzhiyun that the cpuset_memory_pressure_enabled flag is zero. So only 289*4882a593Smuzhiyun systems that enable this feature will compute the metric. 290*4882a593Smuzhiyun 291*4882a593SmuzhiyunWhy a per-cpuset, running average: 292*4882a593Smuzhiyun 293*4882a593Smuzhiyun Because this meter is per-cpuset, rather than per-task or mm, 294*4882a593Smuzhiyun the system load imposed by a batch scheduler monitoring this 295*4882a593Smuzhiyun metric is sharply reduced on large systems, because a scan of 296*4882a593Smuzhiyun the tasklist can be avoided on each set of queries. 297*4882a593Smuzhiyun 298*4882a593Smuzhiyun Because this meter is a running average, instead of an accumulating 299*4882a593Smuzhiyun counter, a batch scheduler can detect memory pressure with a 300*4882a593Smuzhiyun single read, instead of having to read and accumulate results 301*4882a593Smuzhiyun for a period of time. 302*4882a593Smuzhiyun 303*4882a593Smuzhiyun Because this meter is per-cpuset rather than per-task or mm, 304*4882a593Smuzhiyun the batch scheduler can obtain the key information, memory 305*4882a593Smuzhiyun pressure in a cpuset, with a single read, rather than having to 306*4882a593Smuzhiyun query and accumulate results over all the (dynamically changing) 307*4882a593Smuzhiyun set of tasks in the cpuset. 308*4882a593Smuzhiyun 309*4882a593SmuzhiyunA per-cpuset simple digital filter (requires a spinlock and 3 words 310*4882a593Smuzhiyunof data per-cpuset) is kept, and updated by any task attached to that 311*4882a593Smuzhiyuncpuset, if it enters the synchronous (direct) page reclaim code. 312*4882a593Smuzhiyun 313*4882a593SmuzhiyunA per-cpuset file provides an integer number representing the recent 314*4882a593Smuzhiyun(half-life of 10 seconds) rate of direct page reclaims caused by 315*4882a593Smuzhiyunthe tasks in the cpuset, in units of reclaims attempted per second, 316*4882a593Smuzhiyuntimes 1000. 317*4882a593Smuzhiyun 318*4882a593Smuzhiyun 319*4882a593Smuzhiyun1.6 What is memory spread ? 320*4882a593Smuzhiyun--------------------------- 321*4882a593SmuzhiyunThere are two boolean flag files per cpuset that control where the 322*4882a593Smuzhiyunkernel allocates pages for the file system buffers and related in 323*4882a593Smuzhiyunkernel data structures. They are called 'cpuset.memory_spread_page' and 324*4882a593Smuzhiyun'cpuset.memory_spread_slab'. 325*4882a593Smuzhiyun 326*4882a593SmuzhiyunIf the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then 327*4882a593Smuzhiyunthe kernel will spread the file system buffers (page cache) evenly 328*4882a593Smuzhiyunover all the nodes that the faulting task is allowed to use, instead 329*4882a593Smuzhiyunof preferring to put those pages on the node where the task is running. 330*4882a593Smuzhiyun 331*4882a593SmuzhiyunIf the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set, 332*4882a593Smuzhiyunthen the kernel will spread some file system related slab caches, 333*4882a593Smuzhiyunsuch as for inodes and dentries evenly over all the nodes that the 334*4882a593Smuzhiyunfaulting task is allowed to use, instead of preferring to put those 335*4882a593Smuzhiyunpages on the node where the task is running. 336*4882a593Smuzhiyun 337*4882a593SmuzhiyunThe setting of these flags does not affect anonymous data segment or 338*4882a593Smuzhiyunstack segment pages of a task. 339*4882a593Smuzhiyun 340*4882a593SmuzhiyunBy default, both kinds of memory spreading are off, and memory 341*4882a593Smuzhiyunpages are allocated on the node local to where the task is running, 342*4882a593Smuzhiyunexcept perhaps as modified by the task's NUMA mempolicy or cpuset 343*4882a593Smuzhiyunconfiguration, so long as sufficient free memory pages are available. 344*4882a593Smuzhiyun 345*4882a593SmuzhiyunWhen new cpusets are created, they inherit the memory spread settings 346*4882a593Smuzhiyunof their parent. 347*4882a593Smuzhiyun 348*4882a593SmuzhiyunSetting memory spreading causes allocations for the affected page 349*4882a593Smuzhiyunor slab caches to ignore the task's NUMA mempolicy and be spread 350*4882a593Smuzhiyuninstead. Tasks using mbind() or set_mempolicy() calls to set NUMA 351*4882a593Smuzhiyunmempolicies will not notice any change in these calls as a result of 352*4882a593Smuzhiyuntheir containing task's memory spread settings. If memory spreading 353*4882a593Smuzhiyunis turned off, then the currently specified NUMA mempolicy once again 354*4882a593Smuzhiyunapplies to memory page allocations. 355*4882a593Smuzhiyun 356*4882a593SmuzhiyunBoth 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag 357*4882a593Smuzhiyunfiles. By default they contain "0", meaning that the feature is off 358*4882a593Smuzhiyunfor that cpuset. If a "1" is written to that file, then that turns 359*4882a593Smuzhiyunthe named feature on. 360*4882a593Smuzhiyun 361*4882a593SmuzhiyunThe implementation is simple. 362*4882a593Smuzhiyun 363*4882a593SmuzhiyunSetting the flag 'cpuset.memory_spread_page' turns on a per-process flag 364*4882a593SmuzhiyunPFA_SPREAD_PAGE for each task that is in that cpuset or subsequently 365*4882a593Smuzhiyunjoins that cpuset. The page allocation calls for the page cache 366*4882a593Smuzhiyunis modified to perform an inline check for this PFA_SPREAD_PAGE task 367*4882a593Smuzhiyunflag, and if set, a call to a new routine cpuset_mem_spread_node() 368*4882a593Smuzhiyunreturns the node to prefer for the allocation. 369*4882a593Smuzhiyun 370*4882a593SmuzhiyunSimilarly, setting 'cpuset.memory_spread_slab' turns on the flag 371*4882a593SmuzhiyunPFA_SPREAD_SLAB, and appropriately marked slab caches will allocate 372*4882a593Smuzhiyunpages from the node returned by cpuset_mem_spread_node(). 373*4882a593Smuzhiyun 374*4882a593SmuzhiyunThe cpuset_mem_spread_node() routine is also simple. It uses the 375*4882a593Smuzhiyunvalue of a per-task rotor cpuset_mem_spread_rotor to select the next 376*4882a593Smuzhiyunnode in the current task's mems_allowed to prefer for the allocation. 377*4882a593Smuzhiyun 378*4882a593SmuzhiyunThis memory placement policy is also known (in other contexts) as 379*4882a593Smuzhiyunround-robin or interleave. 380*4882a593Smuzhiyun 381*4882a593SmuzhiyunThis policy can provide substantial improvements for jobs that need 382*4882a593Smuzhiyunto place thread local data on the corresponding node, but that need 383*4882a593Smuzhiyunto access large file system data sets that need to be spread across 384*4882a593Smuzhiyunthe several nodes in the jobs cpuset in order to fit. Without this 385*4882a593Smuzhiyunpolicy, especially for jobs that might have one thread reading in the 386*4882a593Smuzhiyundata set, the memory allocation across the nodes in the jobs cpuset 387*4882a593Smuzhiyuncan become very uneven. 388*4882a593Smuzhiyun 389*4882a593Smuzhiyun1.7 What is sched_load_balance ? 390*4882a593Smuzhiyun-------------------------------- 391*4882a593Smuzhiyun 392*4882a593SmuzhiyunThe kernel scheduler (kernel/sched/core.c) automatically load balances 393*4882a593Smuzhiyuntasks. If one CPU is underutilized, kernel code running on that 394*4882a593SmuzhiyunCPU will look for tasks on other more overloaded CPUs and move those 395*4882a593Smuzhiyuntasks to itself, within the constraints of such placement mechanisms 396*4882a593Smuzhiyunas cpusets and sched_setaffinity. 397*4882a593Smuzhiyun 398*4882a593SmuzhiyunThe algorithmic cost of load balancing and its impact on key shared 399*4882a593Smuzhiyunkernel data structures such as the task list increases more than 400*4882a593Smuzhiyunlinearly with the number of CPUs being balanced. So the scheduler 401*4882a593Smuzhiyunhas support to partition the systems CPUs into a number of sched 402*4882a593Smuzhiyundomains such that it only load balances within each sched domain. 403*4882a593SmuzhiyunEach sched domain covers some subset of the CPUs in the system; 404*4882a593Smuzhiyunno two sched domains overlap; some CPUs might not be in any sched 405*4882a593Smuzhiyundomain and hence won't be load balanced. 406*4882a593Smuzhiyun 407*4882a593SmuzhiyunPut simply, it costs less to balance between two smaller sched domains 408*4882a593Smuzhiyunthan one big one, but doing so means that overloads in one of the 409*4882a593Smuzhiyuntwo domains won't be load balanced to the other one. 410*4882a593Smuzhiyun 411*4882a593SmuzhiyunBy default, there is one sched domain covering all CPUs, including those 412*4882a593Smuzhiyunmarked isolated using the kernel boot time "isolcpus=" argument. However, 413*4882a593Smuzhiyunthe isolated CPUs will not participate in load balancing, and will not 414*4882a593Smuzhiyunhave tasks running on them unless explicitly assigned. 415*4882a593Smuzhiyun 416*4882a593SmuzhiyunThis default load balancing across all CPUs is not well suited for 417*4882a593Smuzhiyunthe following two situations: 418*4882a593Smuzhiyun 419*4882a593Smuzhiyun 1) On large systems, load balancing across many CPUs is expensive. 420*4882a593Smuzhiyun If the system is managed using cpusets to place independent jobs 421*4882a593Smuzhiyun on separate sets of CPUs, full load balancing is unnecessary. 422*4882a593Smuzhiyun 2) Systems supporting realtime on some CPUs need to minimize 423*4882a593Smuzhiyun system overhead on those CPUs, including avoiding task load 424*4882a593Smuzhiyun balancing if that is not needed. 425*4882a593Smuzhiyun 426*4882a593SmuzhiyunWhen the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default 427*4882a593Smuzhiyunsetting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus' 428*4882a593Smuzhiyunbe contained in a single sched domain, ensuring that load balancing 429*4882a593Smuzhiyuncan move a task (not otherwised pinned, as by sched_setaffinity) 430*4882a593Smuzhiyunfrom any CPU in that cpuset to any other. 431*4882a593Smuzhiyun 432*4882a593SmuzhiyunWhen the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the 433*4882a593Smuzhiyunscheduler will avoid load balancing across the CPUs in that cpuset, 434*4882a593Smuzhiyun--except-- in so far as is necessary because some overlapping cpuset 435*4882a593Smuzhiyunhas "sched_load_balance" enabled. 436*4882a593Smuzhiyun 437*4882a593SmuzhiyunSo, for example, if the top cpuset has the flag "cpuset.sched_load_balance" 438*4882a593Smuzhiyunenabled, then the scheduler will have one sched domain covering all 439*4882a593SmuzhiyunCPUs, and the setting of the "cpuset.sched_load_balance" flag in any other 440*4882a593Smuzhiyuncpusets won't matter, as we're already fully load balancing. 441*4882a593Smuzhiyun 442*4882a593SmuzhiyunTherefore in the above two situations, the top cpuset flag 443*4882a593Smuzhiyun"cpuset.sched_load_balance" should be disabled, and only some of the smaller, 444*4882a593Smuzhiyunchild cpusets have this flag enabled. 445*4882a593Smuzhiyun 446*4882a593SmuzhiyunWhen doing this, you don't usually want to leave any unpinned tasks in 447*4882a593Smuzhiyunthe top cpuset that might use non-trivial amounts of CPU, as such tasks 448*4882a593Smuzhiyunmay be artificially constrained to some subset of CPUs, depending on 449*4882a593Smuzhiyunthe particulars of this flag setting in descendant cpusets. Even if 450*4882a593Smuzhiyunsuch a task could use spare CPU cycles in some other CPUs, the kernel 451*4882a593Smuzhiyunscheduler might not consider the possibility of load balancing that 452*4882a593Smuzhiyuntask to that underused CPU. 453*4882a593Smuzhiyun 454*4882a593SmuzhiyunOf course, tasks pinned to a particular CPU can be left in a cpuset 455*4882a593Smuzhiyunthat disables "cpuset.sched_load_balance" as those tasks aren't going anywhere 456*4882a593Smuzhiyunelse anyway. 457*4882a593Smuzhiyun 458*4882a593SmuzhiyunThere is an impedance mismatch here, between cpusets and sched domains. 459*4882a593SmuzhiyunCpusets are hierarchical and nest. Sched domains are flat; they don't 460*4882a593Smuzhiyunoverlap and each CPU is in at most one sched domain. 461*4882a593Smuzhiyun 462*4882a593SmuzhiyunIt is necessary for sched domains to be flat because load balancing 463*4882a593Smuzhiyunacross partially overlapping sets of CPUs would risk unstable dynamics 464*4882a593Smuzhiyunthat would be beyond our understanding. So if each of two partially 465*4882a593Smuzhiyunoverlapping cpusets enables the flag 'cpuset.sched_load_balance', then we 466*4882a593Smuzhiyunform a single sched domain that is a superset of both. We won't move 467*4882a593Smuzhiyuna task to a CPU outside its cpuset, but the scheduler load balancing 468*4882a593Smuzhiyuncode might waste some compute cycles considering that possibility. 469*4882a593Smuzhiyun 470*4882a593SmuzhiyunThis mismatch is why there is not a simple one-to-one relation 471*4882a593Smuzhiyunbetween which cpusets have the flag "cpuset.sched_load_balance" enabled, 472*4882a593Smuzhiyunand the sched domain configuration. If a cpuset enables the flag, it 473*4882a593Smuzhiyunwill get balancing across all its CPUs, but if it disables the flag, 474*4882a593Smuzhiyunit will only be assured of no load balancing if no other overlapping 475*4882a593Smuzhiyuncpuset enables the flag. 476*4882a593Smuzhiyun 477*4882a593SmuzhiyunIf two cpusets have partially overlapping 'cpuset.cpus' allowed, and only 478*4882a593Smuzhiyunone of them has this flag enabled, then the other may find its 479*4882a593Smuzhiyuntasks only partially load balanced, just on the overlapping CPUs. 480*4882a593SmuzhiyunThis is just the general case of the top_cpuset example given a few 481*4882a593Smuzhiyunparagraphs above. In the general case, as in the top cpuset case, 482*4882a593Smuzhiyundon't leave tasks that might use non-trivial amounts of CPU in 483*4882a593Smuzhiyunsuch partially load balanced cpusets, as they may be artificially 484*4882a593Smuzhiyunconstrained to some subset of the CPUs allowed to them, for lack of 485*4882a593Smuzhiyunload balancing to the other CPUs. 486*4882a593Smuzhiyun 487*4882a593SmuzhiyunCPUs in "cpuset.isolcpus" were excluded from load balancing by the 488*4882a593Smuzhiyunisolcpus= kernel boot option, and will never be load balanced regardless 489*4882a593Smuzhiyunof the value of "cpuset.sched_load_balance" in any cpuset. 490*4882a593Smuzhiyun 491*4882a593Smuzhiyun1.7.1 sched_load_balance implementation details. 492*4882a593Smuzhiyun------------------------------------------------ 493*4882a593Smuzhiyun 494*4882a593SmuzhiyunThe per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary 495*4882a593Smuzhiyunto most cpuset flags.) When enabled for a cpuset, the kernel will 496*4882a593Smuzhiyunensure that it can load balance across all the CPUs in that cpuset 497*4882a593Smuzhiyun(makes sure that all the CPUs in the cpus_allowed of that cpuset are 498*4882a593Smuzhiyunin the same sched domain.) 499*4882a593Smuzhiyun 500*4882a593SmuzhiyunIf two overlapping cpusets both have 'cpuset.sched_load_balance' enabled, 501*4882a593Smuzhiyunthen they will be (must be) both in the same sched domain. 502*4882a593Smuzhiyun 503*4882a593SmuzhiyunIf, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled, 504*4882a593Smuzhiyunthen by the above that means there is a single sched domain covering 505*4882a593Smuzhiyunthe whole system, regardless of any other cpuset settings. 506*4882a593Smuzhiyun 507*4882a593SmuzhiyunThe kernel commits to user space that it will avoid load balancing 508*4882a593Smuzhiyunwhere it can. It will pick as fine a granularity partition of sched 509*4882a593Smuzhiyundomains as it can while still providing load balancing for any set 510*4882a593Smuzhiyunof CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled. 511*4882a593Smuzhiyun 512*4882a593SmuzhiyunThe internal kernel cpuset to scheduler interface passes from the 513*4882a593Smuzhiyuncpuset code to the scheduler code a partition of the load balanced 514*4882a593SmuzhiyunCPUs in the system. This partition is a set of subsets (represented 515*4882a593Smuzhiyunas an array of struct cpumask) of CPUs, pairwise disjoint, that cover 516*4882a593Smuzhiyunall the CPUs that must be load balanced. 517*4882a593Smuzhiyun 518*4882a593SmuzhiyunThe cpuset code builds a new such partition and passes it to the 519*4882a593Smuzhiyunscheduler sched domain setup code, to have the sched domains rebuilt 520*4882a593Smuzhiyunas necessary, whenever: 521*4882a593Smuzhiyun 522*4882a593Smuzhiyun - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, 523*4882a593Smuzhiyun - or CPUs come or go from a cpuset with this flag enabled, 524*4882a593Smuzhiyun - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs 525*4882a593Smuzhiyun and with this flag enabled changes, 526*4882a593Smuzhiyun - or a cpuset with non-empty CPUs and with this flag enabled is removed, 527*4882a593Smuzhiyun - or a cpu is offlined/onlined. 528*4882a593Smuzhiyun 529*4882a593SmuzhiyunThis partition exactly defines what sched domains the scheduler should 530*4882a593Smuzhiyunsetup - one sched domain for each element (struct cpumask) in the 531*4882a593Smuzhiyunpartition. 532*4882a593Smuzhiyun 533*4882a593SmuzhiyunThe scheduler remembers the currently active sched domain partitions. 534*4882a593SmuzhiyunWhen the scheduler routine partition_sched_domains() is invoked from 535*4882a593Smuzhiyunthe cpuset code to update these sched domains, it compares the new 536*4882a593Smuzhiyunpartition requested with the current, and updates its sched domains, 537*4882a593Smuzhiyunremoving the old and adding the new, for each change. 538*4882a593Smuzhiyun 539*4882a593Smuzhiyun 540*4882a593Smuzhiyun1.8 What is sched_relax_domain_level ? 541*4882a593Smuzhiyun-------------------------------------- 542*4882a593Smuzhiyun 543*4882a593SmuzhiyunIn sched domain, the scheduler migrates tasks in 2 ways; periodic load 544*4882a593Smuzhiyunbalance on tick, and at time of some schedule events. 545*4882a593Smuzhiyun 546*4882a593SmuzhiyunWhen a task is woken up, scheduler try to move the task on idle CPU. 547*4882a593SmuzhiyunFor example, if a task A running on CPU X activates another task B 548*4882a593Smuzhiyunon the same CPU X, and if CPU Y is X's sibling and performing idle, 549*4882a593Smuzhiyunthen scheduler migrate task B to CPU Y so that task B can start on 550*4882a593SmuzhiyunCPU Y without waiting task A on CPU X. 551*4882a593Smuzhiyun 552*4882a593SmuzhiyunAnd if a CPU run out of tasks in its runqueue, the CPU try to pull 553*4882a593Smuzhiyunextra tasks from other busy CPUs to help them before it is going to 554*4882a593Smuzhiyunbe idle. 555*4882a593Smuzhiyun 556*4882a593SmuzhiyunOf course it takes some searching cost to find movable tasks and/or 557*4882a593Smuzhiyunidle CPUs, the scheduler might not search all CPUs in the domain 558*4882a593Smuzhiyunevery time. In fact, in some architectures, the searching ranges on 559*4882a593Smuzhiyunevents are limited in the same socket or node where the CPU locates, 560*4882a593Smuzhiyunwhile the load balance on tick searches all. 561*4882a593Smuzhiyun 562*4882a593SmuzhiyunFor example, assume CPU Z is relatively far from CPU X. Even if CPU Z 563*4882a593Smuzhiyunis idle while CPU X and the siblings are busy, scheduler can't migrate 564*4882a593Smuzhiyunwoken task B from X to Z since it is out of its searching range. 565*4882a593SmuzhiyunAs the result, task B on CPU X need to wait task A or wait load balance 566*4882a593Smuzhiyunon the next tick. For some applications in special situation, waiting 567*4882a593Smuzhiyun1 tick may be too long. 568*4882a593Smuzhiyun 569*4882a593SmuzhiyunThe 'cpuset.sched_relax_domain_level' file allows you to request changing 570*4882a593Smuzhiyunthis searching range as you like. This file takes int value which 571*4882a593Smuzhiyunindicates size of searching range in levels ideally as follows, 572*4882a593Smuzhiyunotherwise initial value -1 that indicates the cpuset has no request. 573*4882a593Smuzhiyun 574*4882a593Smuzhiyun====== =========================================================== 575*4882a593Smuzhiyun -1 no request. use system default or follow request of others. 576*4882a593Smuzhiyun 0 no search. 577*4882a593Smuzhiyun 1 search siblings (hyperthreads in a core). 578*4882a593Smuzhiyun 2 search cores in a package. 579*4882a593Smuzhiyun 3 search cpus in a node [= system wide on non-NUMA system] 580*4882a593Smuzhiyun 4 search nodes in a chunk of node [on NUMA system] 581*4882a593Smuzhiyun 5 search system wide [on NUMA system] 582*4882a593Smuzhiyun====== =========================================================== 583*4882a593Smuzhiyun 584*4882a593SmuzhiyunThe system default is architecture dependent. The system default 585*4882a593Smuzhiyuncan be changed using the relax_domain_level= boot parameter. 586*4882a593Smuzhiyun 587*4882a593SmuzhiyunThis file is per-cpuset and affect the sched domain where the cpuset 588*4882a593Smuzhiyunbelongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset 589*4882a593Smuzhiyunis disabled, then 'cpuset.sched_relax_domain_level' have no effect since 590*4882a593Smuzhiyunthere is no sched domain belonging the cpuset. 591*4882a593Smuzhiyun 592*4882a593SmuzhiyunIf multiple cpusets are overlapping and hence they form a single sched 593*4882a593Smuzhiyundomain, the largest value among those is used. Be careful, if one 594*4882a593Smuzhiyunrequests 0 and others are -1 then 0 is used. 595*4882a593Smuzhiyun 596*4882a593SmuzhiyunNote that modifying this file will have both good and bad effects, 597*4882a593Smuzhiyunand whether it is acceptable or not depends on your situation. 598*4882a593SmuzhiyunDon't modify this file if you are not sure. 599*4882a593Smuzhiyun 600*4882a593SmuzhiyunIf your situation is: 601*4882a593Smuzhiyun 602*4882a593Smuzhiyun - The migration costs between each cpu can be assumed considerably 603*4882a593Smuzhiyun small(for you) due to your special application's behavior or 604*4882a593Smuzhiyun special hardware support for CPU cache etc. 605*4882a593Smuzhiyun - The searching cost doesn't have impact(for you) or you can make 606*4882a593Smuzhiyun the searching cost enough small by managing cpuset to compact etc. 607*4882a593Smuzhiyun - The latency is required even it sacrifices cache hit rate etc. 608*4882a593Smuzhiyun then increasing 'sched_relax_domain_level' would benefit you. 609*4882a593Smuzhiyun 610*4882a593Smuzhiyun 611*4882a593Smuzhiyun1.9 How do I use cpusets ? 612*4882a593Smuzhiyun-------------------------- 613*4882a593Smuzhiyun 614*4882a593SmuzhiyunIn order to minimize the impact of cpusets on critical kernel 615*4882a593Smuzhiyuncode, such as the scheduler, and due to the fact that the kernel 616*4882a593Smuzhiyundoes not support one task updating the memory placement of another 617*4882a593Smuzhiyuntask directly, the impact on a task of changing its cpuset CPU 618*4882a593Smuzhiyunor Memory Node placement, or of changing to which cpuset a task 619*4882a593Smuzhiyunis attached, is subtle. 620*4882a593Smuzhiyun 621*4882a593SmuzhiyunIf a cpuset has its Memory Nodes modified, then for each task attached 622*4882a593Smuzhiyunto that cpuset, the next time that the kernel attempts to allocate 623*4882a593Smuzhiyuna page of memory for that task, the kernel will notice the change 624*4882a593Smuzhiyunin the task's cpuset, and update its per-task memory placement to 625*4882a593Smuzhiyunremain within the new cpusets memory placement. If the task was using 626*4882a593Smuzhiyunmempolicy MPOL_BIND, and the nodes to which it was bound overlap with 627*4882a593Smuzhiyunits new cpuset, then the task will continue to use whatever subset 628*4882a593Smuzhiyunof MPOL_BIND nodes are still allowed in the new cpuset. If the task 629*4882a593Smuzhiyunwas using MPOL_BIND and now none of its MPOL_BIND nodes are allowed 630*4882a593Smuzhiyunin the new cpuset, then the task will be essentially treated as if it 631*4882a593Smuzhiyunwas MPOL_BIND bound to the new cpuset (even though its NUMA placement, 632*4882a593Smuzhiyunas queried by get_mempolicy(), doesn't change). If a task is moved 633*4882a593Smuzhiyunfrom one cpuset to another, then the kernel will adjust the task's 634*4882a593Smuzhiyunmemory placement, as above, the next time that the kernel attempts 635*4882a593Smuzhiyunto allocate a page of memory for that task. 636*4882a593Smuzhiyun 637*4882a593SmuzhiyunIf a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset 638*4882a593Smuzhiyunwill have its allowed CPU placement changed immediately. Similarly, 639*4882a593Smuzhiyunif a task's pid is written to another cpuset's 'tasks' file, then its 640*4882a593Smuzhiyunallowed CPU placement is changed immediately. If such a task had been 641*4882a593Smuzhiyunbound to some subset of its cpuset using the sched_setaffinity() call, 642*4882a593Smuzhiyunthe task will be allowed to run on any CPU allowed in its new cpuset, 643*4882a593Smuzhiyunnegating the effect of the prior sched_setaffinity() call. 644*4882a593Smuzhiyun 645*4882a593SmuzhiyunIn summary, the memory placement of a task whose cpuset is changed is 646*4882a593Smuzhiyunupdated by the kernel, on the next allocation of a page for that task, 647*4882a593Smuzhiyunand the processor placement is updated immediately. 648*4882a593Smuzhiyun 649*4882a593SmuzhiyunNormally, once a page is allocated (given a physical page 650*4882a593Smuzhiyunof main memory) then that page stays on whatever node it 651*4882a593Smuzhiyunwas allocated, so long as it remains allocated, even if the 652*4882a593Smuzhiyuncpusets memory placement policy 'cpuset.mems' subsequently changes. 653*4882a593SmuzhiyunIf the cpuset flag file 'cpuset.memory_migrate' is set true, then when 654*4882a593Smuzhiyuntasks are attached to that cpuset, any pages that task had 655*4882a593Smuzhiyunallocated to it on nodes in its previous cpuset are migrated 656*4882a593Smuzhiyunto the task's new cpuset. The relative placement of the page within 657*4882a593Smuzhiyunthe cpuset is preserved during these migration operations if possible. 658*4882a593SmuzhiyunFor example if the page was on the second valid node of the prior cpuset 659*4882a593Smuzhiyunthen the page will be placed on the second valid node of the new cpuset. 660*4882a593Smuzhiyun 661*4882a593SmuzhiyunAlso if 'cpuset.memory_migrate' is set true, then if that cpuset's 662*4882a593Smuzhiyun'cpuset.mems' file is modified, pages allocated to tasks in that 663*4882a593Smuzhiyuncpuset, that were on nodes in the previous setting of 'cpuset.mems', 664*4882a593Smuzhiyunwill be moved to nodes in the new setting of 'mems.' 665*4882a593SmuzhiyunPages that were not in the task's prior cpuset, or in the cpuset's 666*4882a593Smuzhiyunprior 'cpuset.mems' setting, will not be moved. 667*4882a593Smuzhiyun 668*4882a593SmuzhiyunThere is an exception to the above. If hotplug functionality is used 669*4882a593Smuzhiyunto remove all the CPUs that are currently assigned to a cpuset, 670*4882a593Smuzhiyunthen all the tasks in that cpuset will be moved to the nearest ancestor 671*4882a593Smuzhiyunwith non-empty cpus. But the moving of some (or all) tasks might fail if 672*4882a593Smuzhiyuncpuset is bound with another cgroup subsystem which has some restrictions 673*4882a593Smuzhiyunon task attaching. In this failing case, those tasks will stay 674*4882a593Smuzhiyunin the original cpuset, and the kernel will automatically update 675*4882a593Smuzhiyuntheir cpus_allowed to allow all online CPUs. When memory hotplug 676*4882a593Smuzhiyunfunctionality for removing Memory Nodes is available, a similar exception 677*4882a593Smuzhiyunis expected to apply there as well. In general, the kernel prefers to 678*4882a593Smuzhiyunviolate cpuset placement, over starving a task that has had all 679*4882a593Smuzhiyunits allowed CPUs or Memory Nodes taken offline. 680*4882a593Smuzhiyun 681*4882a593SmuzhiyunThere is a second exception to the above. GFP_ATOMIC requests are 682*4882a593Smuzhiyunkernel internal allocations that must be satisfied, immediately. 683*4882a593SmuzhiyunThe kernel may drop some request, in rare cases even panic, if a 684*4882a593SmuzhiyunGFP_ATOMIC alloc fails. If the request cannot be satisfied within 685*4882a593Smuzhiyunthe current task's cpuset, then we relax the cpuset, and look for 686*4882a593Smuzhiyunmemory anywhere we can find it. It's better to violate the cpuset 687*4882a593Smuzhiyunthan stress the kernel. 688*4882a593Smuzhiyun 689*4882a593SmuzhiyunTo start a new job that is to be contained within a cpuset, the steps are: 690*4882a593Smuzhiyun 691*4882a593Smuzhiyun 1) mkdir /sys/fs/cgroup/cpuset 692*4882a593Smuzhiyun 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset 693*4882a593Smuzhiyun 3) Create the new cpuset by doing mkdir's and write's (or echo's) in 694*4882a593Smuzhiyun the /sys/fs/cgroup/cpuset virtual file system. 695*4882a593Smuzhiyun 4) Start a task that will be the "founding father" of the new job. 696*4882a593Smuzhiyun 5) Attach that task to the new cpuset by writing its pid to the 697*4882a593Smuzhiyun /sys/fs/cgroup/cpuset tasks file for that cpuset. 698*4882a593Smuzhiyun 6) fork, exec or clone the job tasks from this founding father task. 699*4882a593Smuzhiyun 700*4882a593SmuzhiyunFor example, the following sequence of commands will setup a cpuset 701*4882a593Smuzhiyunnamed "Charlie", containing just CPUs 2 and 3, and Memory Node 1, 702*4882a593Smuzhiyunand then start a subshell 'sh' in that cpuset:: 703*4882a593Smuzhiyun 704*4882a593Smuzhiyun mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset 705*4882a593Smuzhiyun cd /sys/fs/cgroup/cpuset 706*4882a593Smuzhiyun mkdir Charlie 707*4882a593Smuzhiyun cd Charlie 708*4882a593Smuzhiyun /bin/echo 2-3 > cpuset.cpus 709*4882a593Smuzhiyun /bin/echo 1 > cpuset.mems 710*4882a593Smuzhiyun /bin/echo $$ > tasks 711*4882a593Smuzhiyun sh 712*4882a593Smuzhiyun # The subshell 'sh' is now running in cpuset Charlie 713*4882a593Smuzhiyun # The next line should display '/Charlie' 714*4882a593Smuzhiyun cat /proc/self/cpuset 715*4882a593Smuzhiyun 716*4882a593SmuzhiyunThere are ways to query or modify cpusets: 717*4882a593Smuzhiyun 718*4882a593Smuzhiyun - via the cpuset file system directly, using the various cd, mkdir, echo, 719*4882a593Smuzhiyun cat, rmdir commands from the shell, or their equivalent from C. 720*4882a593Smuzhiyun - via the C library libcpuset. 721*4882a593Smuzhiyun - via the C library libcgroup. 722*4882a593Smuzhiyun (http://sourceforge.net/projects/libcg/) 723*4882a593Smuzhiyun - via the python application cset. 724*4882a593Smuzhiyun (http://code.google.com/p/cpuset/) 725*4882a593Smuzhiyun 726*4882a593SmuzhiyunThe sched_setaffinity calls can also be done at the shell prompt using 727*4882a593SmuzhiyunSGI's runon or Robert Love's taskset. The mbind and set_mempolicy 728*4882a593Smuzhiyuncalls can be done at the shell prompt using the numactl command 729*4882a593Smuzhiyun(part of Andi Kleen's numa package). 730*4882a593Smuzhiyun 731*4882a593Smuzhiyun2. Usage Examples and Syntax 732*4882a593Smuzhiyun============================ 733*4882a593Smuzhiyun 734*4882a593Smuzhiyun2.1 Basic Usage 735*4882a593Smuzhiyun--------------- 736*4882a593Smuzhiyun 737*4882a593SmuzhiyunCreating, modifying, using the cpusets can be done through the cpuset 738*4882a593Smuzhiyunvirtual filesystem. 739*4882a593Smuzhiyun 740*4882a593SmuzhiyunTo mount it, type: 741*4882a593Smuzhiyun# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset 742*4882a593Smuzhiyun 743*4882a593SmuzhiyunThen under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the 744*4882a593Smuzhiyuntree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset 745*4882a593Smuzhiyunis the cpuset that holds the whole system. 746*4882a593Smuzhiyun 747*4882a593SmuzhiyunIf you want to create a new cpuset under /sys/fs/cgroup/cpuset:: 748*4882a593Smuzhiyun 749*4882a593Smuzhiyun # cd /sys/fs/cgroup/cpuset 750*4882a593Smuzhiyun # mkdir my_cpuset 751*4882a593Smuzhiyun 752*4882a593SmuzhiyunNow you want to do something with this cpuset:: 753*4882a593Smuzhiyun 754*4882a593Smuzhiyun # cd my_cpuset 755*4882a593Smuzhiyun 756*4882a593SmuzhiyunIn this directory you can find several files:: 757*4882a593Smuzhiyun 758*4882a593Smuzhiyun # ls 759*4882a593Smuzhiyun cgroup.clone_children cpuset.memory_pressure 760*4882a593Smuzhiyun cgroup.event_control cpuset.memory_spread_page 761*4882a593Smuzhiyun cgroup.procs cpuset.memory_spread_slab 762*4882a593Smuzhiyun cpuset.cpu_exclusive cpuset.mems 763*4882a593Smuzhiyun cpuset.cpus cpuset.sched_load_balance 764*4882a593Smuzhiyun cpuset.mem_exclusive cpuset.sched_relax_domain_level 765*4882a593Smuzhiyun cpuset.mem_hardwall notify_on_release 766*4882a593Smuzhiyun cpuset.memory_migrate tasks 767*4882a593Smuzhiyun 768*4882a593SmuzhiyunReading them will give you information about the state of this cpuset: 769*4882a593Smuzhiyunthe CPUs and Memory Nodes it can use, the processes that are using 770*4882a593Smuzhiyunit, its properties. By writing to these files you can manipulate 771*4882a593Smuzhiyunthe cpuset. 772*4882a593Smuzhiyun 773*4882a593SmuzhiyunSet some flags:: 774*4882a593Smuzhiyun 775*4882a593Smuzhiyun # /bin/echo 1 > cpuset.cpu_exclusive 776*4882a593Smuzhiyun 777*4882a593SmuzhiyunAdd some cpus:: 778*4882a593Smuzhiyun 779*4882a593Smuzhiyun # /bin/echo 0-7 > cpuset.cpus 780*4882a593Smuzhiyun 781*4882a593SmuzhiyunAdd some mems:: 782*4882a593Smuzhiyun 783*4882a593Smuzhiyun # /bin/echo 0-7 > cpuset.mems 784*4882a593Smuzhiyun 785*4882a593SmuzhiyunNow attach your shell to this cpuset:: 786*4882a593Smuzhiyun 787*4882a593Smuzhiyun # /bin/echo $$ > tasks 788*4882a593Smuzhiyun 789*4882a593SmuzhiyunYou can also create cpusets inside your cpuset by using mkdir in this 790*4882a593Smuzhiyundirectory:: 791*4882a593Smuzhiyun 792*4882a593Smuzhiyun # mkdir my_sub_cs 793*4882a593Smuzhiyun 794*4882a593SmuzhiyunTo remove a cpuset, just use rmdir:: 795*4882a593Smuzhiyun 796*4882a593Smuzhiyun # rmdir my_sub_cs 797*4882a593Smuzhiyun 798*4882a593SmuzhiyunThis will fail if the cpuset is in use (has cpusets inside, or has 799*4882a593Smuzhiyunprocesses attached). 800*4882a593Smuzhiyun 801*4882a593SmuzhiyunNote that for legacy reasons, the "cpuset" filesystem exists as a 802*4882a593Smuzhiyunwrapper around the cgroup filesystem. 803*4882a593Smuzhiyun 804*4882a593SmuzhiyunThe command:: 805*4882a593Smuzhiyun 806*4882a593Smuzhiyun mount -t cpuset X /sys/fs/cgroup/cpuset 807*4882a593Smuzhiyun 808*4882a593Smuzhiyunis equivalent to:: 809*4882a593Smuzhiyun 810*4882a593Smuzhiyun mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset 811*4882a593Smuzhiyun echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent 812*4882a593Smuzhiyun 813*4882a593Smuzhiyun2.2 Adding/removing cpus 814*4882a593Smuzhiyun------------------------ 815*4882a593Smuzhiyun 816*4882a593SmuzhiyunThis is the syntax to use when writing in the cpus or mems files 817*4882a593Smuzhiyunin cpuset directories:: 818*4882a593Smuzhiyun 819*4882a593Smuzhiyun # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 820*4882a593Smuzhiyun # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 821*4882a593Smuzhiyun 822*4882a593SmuzhiyunTo add a CPU to a cpuset, write the new list of CPUs including the 823*4882a593SmuzhiyunCPU to be added. To add 6 to the above cpuset:: 824*4882a593Smuzhiyun 825*4882a593Smuzhiyun # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 826*4882a593Smuzhiyun 827*4882a593SmuzhiyunSimilarly to remove a CPU from a cpuset, write the new list of CPUs 828*4882a593Smuzhiyunwithout the CPU to be removed. 829*4882a593Smuzhiyun 830*4882a593SmuzhiyunTo remove all the CPUs:: 831*4882a593Smuzhiyun 832*4882a593Smuzhiyun # /bin/echo "" > cpuset.cpus -> clear cpus list 833*4882a593Smuzhiyun 834*4882a593Smuzhiyun2.3 Setting flags 835*4882a593Smuzhiyun----------------- 836*4882a593Smuzhiyun 837*4882a593SmuzhiyunThe syntax is very simple:: 838*4882a593Smuzhiyun 839*4882a593Smuzhiyun # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' 840*4882a593Smuzhiyun # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' 841*4882a593Smuzhiyun 842*4882a593Smuzhiyun2.4 Attaching processes 843*4882a593Smuzhiyun----------------------- 844*4882a593Smuzhiyun 845*4882a593Smuzhiyun:: 846*4882a593Smuzhiyun 847*4882a593Smuzhiyun # /bin/echo PID > tasks 848*4882a593Smuzhiyun 849*4882a593SmuzhiyunNote that it is PID, not PIDs. You can only attach ONE task at a time. 850*4882a593SmuzhiyunIf you have several tasks to attach, you have to do it one after another:: 851*4882a593Smuzhiyun 852*4882a593Smuzhiyun # /bin/echo PID1 > tasks 853*4882a593Smuzhiyun # /bin/echo PID2 > tasks 854*4882a593Smuzhiyun ... 855*4882a593Smuzhiyun # /bin/echo PIDn > tasks 856*4882a593Smuzhiyun 857*4882a593Smuzhiyun 858*4882a593Smuzhiyun3. Questions 859*4882a593Smuzhiyun============ 860*4882a593Smuzhiyun 861*4882a593SmuzhiyunQ: 862*4882a593Smuzhiyun what's up with this '/bin/echo' ? 863*4882a593Smuzhiyun 864*4882a593SmuzhiyunA: 865*4882a593Smuzhiyun bash's builtin 'echo' command does not check calls to write() against 866*4882a593Smuzhiyun errors. If you use it in the cpuset file system, you won't be 867*4882a593Smuzhiyun able to tell whether a command succeeded or failed. 868*4882a593Smuzhiyun 869*4882a593SmuzhiyunQ: 870*4882a593Smuzhiyun When I attach processes, only the first of the line gets really attached ! 871*4882a593Smuzhiyun 872*4882a593SmuzhiyunA: 873*4882a593Smuzhiyun We can only return one error code per call to write(). So you should also 874*4882a593Smuzhiyun put only ONE pid. 875*4882a593Smuzhiyun 876*4882a593Smuzhiyun4. Contact 877*4882a593Smuzhiyun========== 878*4882a593Smuzhiyun 879*4882a593SmuzhiyunWeb: http://www.bullopensource.org/cpuset 880