xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/cgroup-v1/cpusets.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _cpusets:
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=======
4*4882a593SmuzhiyunCPUSETS
5*4882a593Smuzhiyun=======
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunCopyright (C) 2004 BULL SA.
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunWritten by Simon.Derr@bull.net
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
12*4882a593Smuzhiyun- Modified by Paul Jackson <pj@sgi.com>
13*4882a593Smuzhiyun- Modified by Christoph Lameter <cl@linux.com>
14*4882a593Smuzhiyun- Modified by Paul Menage <menage@google.com>
15*4882a593Smuzhiyun- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
16*4882a593Smuzhiyun
17*4882a593Smuzhiyun.. CONTENTS:
18*4882a593Smuzhiyun
19*4882a593Smuzhiyun   1. Cpusets
20*4882a593Smuzhiyun     1.1 What are cpusets ?
21*4882a593Smuzhiyun     1.2 Why are cpusets needed ?
22*4882a593Smuzhiyun     1.3 How are cpusets implemented ?
23*4882a593Smuzhiyun     1.4 What are exclusive cpusets ?
24*4882a593Smuzhiyun     1.5 What is memory_pressure ?
25*4882a593Smuzhiyun     1.6 What is memory spread ?
26*4882a593Smuzhiyun     1.7 What is sched_load_balance ?
27*4882a593Smuzhiyun     1.8 What is sched_relax_domain_level ?
28*4882a593Smuzhiyun     1.9 How do I use cpusets ?
29*4882a593Smuzhiyun   2. Usage Examples and Syntax
30*4882a593Smuzhiyun     2.1 Basic Usage
31*4882a593Smuzhiyun     2.2 Adding/removing cpus
32*4882a593Smuzhiyun     2.3 Setting flags
33*4882a593Smuzhiyun     2.4 Attaching processes
34*4882a593Smuzhiyun   3. Questions
35*4882a593Smuzhiyun   4. Contact
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun1. Cpusets
38*4882a593Smuzhiyun==========
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun1.1 What are cpusets ?
41*4882a593Smuzhiyun----------------------
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunCpusets provide a mechanism for assigning a set of CPUs and Memory
44*4882a593SmuzhiyunNodes to a set of tasks.   In this document "Memory Node" refers to
45*4882a593Smuzhiyunan on-line node that contains memory.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunCpusets constrain the CPU and Memory placement of tasks to only
48*4882a593Smuzhiyunthe resources within a task's current cpuset.  They form a nested
49*4882a593Smuzhiyunhierarchy visible in a virtual file system.  These are the essential
50*4882a593Smuzhiyunhooks, beyond what is already present, required to manage dynamic
51*4882a593Smuzhiyunjob placement on large systems.
52*4882a593Smuzhiyun
53*4882a593SmuzhiyunCpusets use the generic cgroup subsystem described in
54*4882a593SmuzhiyunDocumentation/admin-guide/cgroup-v1/cgroups.rst.
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunRequests by a task, using the sched_setaffinity(2) system call to
57*4882a593Smuzhiyuninclude CPUs in its CPU affinity mask, and using the mbind(2) and
58*4882a593Smuzhiyunset_mempolicy(2) system calls to include Memory Nodes in its memory
59*4882a593Smuzhiyunpolicy, are both filtered through that task's cpuset, filtering out any
60*4882a593SmuzhiyunCPUs or Memory Nodes not in that cpuset.  The scheduler will not
61*4882a593Smuzhiyunschedule a task on a CPU that is not allowed in its cpus_allowed
62*4882a593Smuzhiyunvector, and the kernel page allocator will not allocate a page on a
63*4882a593Smuzhiyunnode that is not allowed in the requesting task's mems_allowed vector.
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunUser level code may create and destroy cpusets by name in the cgroup
66*4882a593Smuzhiyunvirtual file system, manage the attributes and permissions of these
67*4882a593Smuzhiyuncpusets and which CPUs and Memory Nodes are assigned to each cpuset,
68*4882a593Smuzhiyunspecify and query to which cpuset a task is assigned, and list the
69*4882a593Smuzhiyuntask pids assigned to a cpuset.
70*4882a593Smuzhiyun
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun1.2 Why are cpusets needed ?
73*4882a593Smuzhiyun----------------------------
74*4882a593Smuzhiyun
75*4882a593SmuzhiyunThe management of large computer systems, with many processors (CPUs),
76*4882a593Smuzhiyuncomplex memory cache hierarchies and multiple Memory Nodes having
77*4882a593Smuzhiyunnon-uniform access times (NUMA) presents additional challenges for
78*4882a593Smuzhiyunthe efficient scheduling and memory placement of processes.
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunFrequently more modest sized systems can be operated with adequate
81*4882a593Smuzhiyunefficiency just by letting the operating system automatically share
82*4882a593Smuzhiyunthe available CPU and Memory resources amongst the requesting tasks.
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunBut larger systems, which benefit more from careful processor and
85*4882a593Smuzhiyunmemory placement to reduce memory access times and contention,
86*4882a593Smuzhiyunand which typically represent a larger investment for the customer,
87*4882a593Smuzhiyuncan benefit from explicitly placing jobs on properly sized subsets of
88*4882a593Smuzhiyunthe system.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunThis can be especially valuable on:
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun    * Web Servers running multiple instances of the same web application,
93*4882a593Smuzhiyun    * Servers running different applications (for instance, a web server
94*4882a593Smuzhiyun      and a database), or
95*4882a593Smuzhiyun    * NUMA systems running large HPC applications with demanding
96*4882a593Smuzhiyun      performance characteristics.
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunThese subsets, or "soft partitions" must be able to be dynamically
99*4882a593Smuzhiyunadjusted, as the job mix changes, without impacting other concurrently
100*4882a593Smuzhiyunexecuting jobs. The location of the running jobs pages may also be moved
101*4882a593Smuzhiyunwhen the memory locations are changed.
102*4882a593Smuzhiyun
103*4882a593SmuzhiyunThe kernel cpuset patch provides the minimum essential kernel
104*4882a593Smuzhiyunmechanisms required to efficiently implement such subsets.  It
105*4882a593Smuzhiyunleverages existing CPU and Memory Placement facilities in the Linux
106*4882a593Smuzhiyunkernel to avoid any additional impact on the critical scheduler or
107*4882a593Smuzhiyunmemory allocator code.
108*4882a593Smuzhiyun
109*4882a593Smuzhiyun
110*4882a593Smuzhiyun1.3 How are cpusets implemented ?
111*4882a593Smuzhiyun---------------------------------
112*4882a593Smuzhiyun
113*4882a593SmuzhiyunCpusets provide a Linux kernel mechanism to constrain which CPUs and
114*4882a593SmuzhiyunMemory Nodes are used by a process or set of processes.
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunThe Linux kernel already has a pair of mechanisms to specify on which
117*4882a593SmuzhiyunCPUs a task may be scheduled (sched_setaffinity) and on which Memory
118*4882a593SmuzhiyunNodes it may obtain memory (mbind, set_mempolicy).
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunCpusets extends these two mechanisms as follows:
121*4882a593Smuzhiyun
122*4882a593Smuzhiyun - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
123*4882a593Smuzhiyun   kernel.
124*4882a593Smuzhiyun - Each task in the system is attached to a cpuset, via a pointer
125*4882a593Smuzhiyun   in the task structure to a reference counted cgroup structure.
126*4882a593Smuzhiyun - Calls to sched_setaffinity are filtered to just those CPUs
127*4882a593Smuzhiyun   allowed in that task's cpuset.
128*4882a593Smuzhiyun - Calls to mbind and set_mempolicy are filtered to just
129*4882a593Smuzhiyun   those Memory Nodes allowed in that task's cpuset.
130*4882a593Smuzhiyun - The root cpuset contains all the systems CPUs and Memory
131*4882a593Smuzhiyun   Nodes.
132*4882a593Smuzhiyun - For any cpuset, one can define child cpusets containing a subset
133*4882a593Smuzhiyun   of the parents CPU and Memory Node resources.
134*4882a593Smuzhiyun - The hierarchy of cpusets can be mounted at /dev/cpuset, for
135*4882a593Smuzhiyun   browsing and manipulation from user space.
136*4882a593Smuzhiyun - A cpuset may be marked exclusive, which ensures that no other
137*4882a593Smuzhiyun   cpuset (except direct ancestors and descendants) may contain
138*4882a593Smuzhiyun   any overlapping CPUs or Memory Nodes.
139*4882a593Smuzhiyun - You can list all the tasks (by pid) attached to any cpuset.
140*4882a593Smuzhiyun
141*4882a593SmuzhiyunThe implementation of cpusets requires a few, simple hooks
142*4882a593Smuzhiyuninto the rest of the kernel, none in performance critical paths:
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun - in init/main.c, to initialize the root cpuset at system boot.
145*4882a593Smuzhiyun - in fork and exit, to attach and detach a task from its cpuset.
146*4882a593Smuzhiyun - in sched_setaffinity, to mask the requested CPUs by what's
147*4882a593Smuzhiyun   allowed in that task's cpuset.
148*4882a593Smuzhiyun - in sched.c migrate_live_tasks(), to keep migrating tasks within
149*4882a593Smuzhiyun   the CPUs allowed by their cpuset, if possible.
150*4882a593Smuzhiyun - in the mbind and set_mempolicy system calls, to mask the requested
151*4882a593Smuzhiyun   Memory Nodes by what's allowed in that task's cpuset.
152*4882a593Smuzhiyun - in page_alloc.c, to restrict memory to allowed nodes.
153*4882a593Smuzhiyun - in vmscan.c, to restrict page recovery to the current cpuset.
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunYou should mount the "cgroup" filesystem type in order to enable
156*4882a593Smuzhiyunbrowsing and modifying the cpusets presently known to the kernel.  No
157*4882a593Smuzhiyunnew system calls are added for cpusets - all support for querying and
158*4882a593Smuzhiyunmodifying cpusets is via this cpuset file system.
159*4882a593Smuzhiyun
160*4882a593SmuzhiyunThe /proc/<pid>/status file for each task has four added lines,
161*4882a593Smuzhiyundisplaying the task's cpus_allowed (on which CPUs it may be scheduled)
162*4882a593Smuzhiyunand mems_allowed (on which Memory Nodes it may obtain memory),
163*4882a593Smuzhiyunin the two formats seen in the following example::
164*4882a593Smuzhiyun
165*4882a593Smuzhiyun  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
166*4882a593Smuzhiyun  Cpus_allowed_list:      0-127
167*4882a593Smuzhiyun  Mems_allowed:   ffffffff,ffffffff
168*4882a593Smuzhiyun  Mems_allowed_list:      0-63
169*4882a593Smuzhiyun
170*4882a593SmuzhiyunEach cpuset is represented by a directory in the cgroup file system
171*4882a593Smuzhiyuncontaining (on top of the standard cgroup files) the following
172*4882a593Smuzhiyunfiles describing that cpuset:
173*4882a593Smuzhiyun
174*4882a593Smuzhiyun - cpuset.cpus: list of CPUs in that cpuset
175*4882a593Smuzhiyun - cpuset.mems: list of Memory Nodes in that cpuset
176*4882a593Smuzhiyun - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
177*4882a593Smuzhiyun - cpuset.cpu_exclusive flag: is cpu placement exclusive?
178*4882a593Smuzhiyun - cpuset.mem_exclusive flag: is memory placement exclusive?
179*4882a593Smuzhiyun - cpuset.mem_hardwall flag:  is memory allocation hardwalled
180*4882a593Smuzhiyun - cpuset.memory_pressure: measure of how much paging pressure in cpuset
181*4882a593Smuzhiyun - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
182*4882a593Smuzhiyun - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
183*4882a593Smuzhiyun - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
184*4882a593Smuzhiyun - cpuset.sched_relax_domain_level: the searching range when migrating tasks
185*4882a593Smuzhiyun
186*4882a593SmuzhiyunIn addition, only the root cpuset has the following file:
187*4882a593Smuzhiyun
188*4882a593Smuzhiyun - cpuset.memory_pressure_enabled flag: compute memory_pressure?
189*4882a593Smuzhiyun
190*4882a593SmuzhiyunNew cpusets are created using the mkdir system call or shell
191*4882a593Smuzhiyuncommand.  The properties of a cpuset, such as its flags, allowed
192*4882a593SmuzhiyunCPUs and Memory Nodes, and attached tasks, are modified by writing
193*4882a593Smuzhiyunto the appropriate file in that cpusets directory, as listed above.
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunThe named hierarchical structure of nested cpusets allows partitioning
196*4882a593Smuzhiyuna large system into nested, dynamically changeable, "soft-partitions".
197*4882a593Smuzhiyun
198*4882a593SmuzhiyunThe attachment of each task, automatically inherited at fork by any
199*4882a593Smuzhiyunchildren of that task, to a cpuset allows organizing the work load
200*4882a593Smuzhiyunon a system into related sets of tasks such that each set is constrained
201*4882a593Smuzhiyunto using the CPUs and Memory Nodes of a particular cpuset.  A task
202*4882a593Smuzhiyunmay be re-attached to any other cpuset, if allowed by the permissions
203*4882a593Smuzhiyunon the necessary cpuset file system directories.
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunSuch management of a system "in the large" integrates smoothly with
206*4882a593Smuzhiyunthe detailed placement done on individual tasks and memory regions
207*4882a593Smuzhiyunusing the sched_setaffinity, mbind and set_mempolicy system calls.
208*4882a593Smuzhiyun
209*4882a593SmuzhiyunThe following rules apply to each cpuset:
210*4882a593Smuzhiyun
211*4882a593Smuzhiyun - Its CPUs and Memory Nodes must be a subset of its parents.
212*4882a593Smuzhiyun - It can't be marked exclusive unless its parent is.
213*4882a593Smuzhiyun - If its cpu or memory is exclusive, they may not overlap any sibling.
214*4882a593Smuzhiyun
215*4882a593SmuzhiyunThese rules, and the natural hierarchy of cpusets, enable efficient
216*4882a593Smuzhiyunenforcement of the exclusive guarantee, without having to scan all
217*4882a593Smuzhiyuncpusets every time any of them change to ensure nothing overlaps a
218*4882a593Smuzhiyunexclusive cpuset.  Also, the use of a Linux virtual file system (vfs)
219*4882a593Smuzhiyunto represent the cpuset hierarchy provides for a familiar permission
220*4882a593Smuzhiyunand name space for cpusets, with a minimum of additional kernel code.
221*4882a593Smuzhiyun
222*4882a593SmuzhiyunThe cpus and mems files in the root (top_cpuset) cpuset are
223*4882a593Smuzhiyunread-only.  The cpus file automatically tracks the value of
224*4882a593Smuzhiyuncpu_online_mask using a CPU hotplug notifier, and the mems file
225*4882a593Smuzhiyunautomatically tracks the value of node_states[N_MEMORY]--i.e.,
226*4882a593Smuzhiyunnodes with memory--using the cpuset_track_online_nodes() hook.
227*4882a593Smuzhiyun
228*4882a593SmuzhiyunThe cpuset.effective_cpus and cpuset.effective_mems files are
229*4882a593Smuzhiyunnormally read-only copies of cpuset.cpus and cpuset.mems files
230*4882a593Smuzhiyunrespectively.  If the cpuset cgroup filesystem is mounted with the
231*4882a593Smuzhiyunspecial "cpuset_v2_mode" option, the behavior of these files will become
232*4882a593Smuzhiyunsimilar to the corresponding files in cpuset v2.  In other words, hotplug
233*4882a593Smuzhiyunevents will not change cpuset.cpus and cpuset.mems.  Those events will
234*4882a593Smuzhiyunonly affect cpuset.effective_cpus and cpuset.effective_mems which show
235*4882a593Smuzhiyunthe actual cpus and memory nodes that are currently used by this cpuset.
236*4882a593SmuzhiyunSee Documentation/admin-guide/cgroup-v2.rst for more information about
237*4882a593Smuzhiyuncpuset v2 behavior.
238*4882a593Smuzhiyun
239*4882a593Smuzhiyun
240*4882a593Smuzhiyun1.4 What are exclusive cpusets ?
241*4882a593Smuzhiyun--------------------------------
242*4882a593Smuzhiyun
243*4882a593SmuzhiyunIf a cpuset is cpu or mem exclusive, no other cpuset, other than
244*4882a593Smuzhiyuna direct ancestor or descendant, may share any of the same CPUs or
245*4882a593SmuzhiyunMemory Nodes.
246*4882a593Smuzhiyun
247*4882a593SmuzhiyunA cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
248*4882a593Smuzhiyuni.e. it restricts kernel allocations for page, buffer and other data
249*4882a593Smuzhiyuncommonly shared by the kernel across multiple users.  All cpusets,
250*4882a593Smuzhiyunwhether hardwalled or not, restrict allocations of memory for user
251*4882a593Smuzhiyunspace.  This enables configuring a system so that several independent
252*4882a593Smuzhiyunjobs can share common kernel data, such as file system pages, while
253*4882a593Smuzhiyunisolating each job's user allocation in its own cpuset.  To do this,
254*4882a593Smuzhiyunconstruct a large mem_exclusive cpuset to hold all the jobs, and
255*4882a593Smuzhiyunconstruct child, non-mem_exclusive cpusets for each individual job.
256*4882a593SmuzhiyunOnly a small amount of typical kernel memory, such as requests from
257*4882a593Smuzhiyuninterrupt handlers, is allowed to be taken outside even a
258*4882a593Smuzhiyunmem_exclusive cpuset.
259*4882a593Smuzhiyun
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun1.5 What is memory_pressure ?
262*4882a593Smuzhiyun-----------------------------
263*4882a593SmuzhiyunThe memory_pressure of a cpuset provides a simple per-cpuset metric
264*4882a593Smuzhiyunof the rate that the tasks in a cpuset are attempting to free up in
265*4882a593Smuzhiyunuse memory on the nodes of the cpuset to satisfy additional memory
266*4882a593Smuzhiyunrequests.
267*4882a593Smuzhiyun
268*4882a593SmuzhiyunThis enables batch managers monitoring jobs running in dedicated
269*4882a593Smuzhiyuncpusets to efficiently detect what level of memory pressure that job
270*4882a593Smuzhiyunis causing.
271*4882a593Smuzhiyun
272*4882a593SmuzhiyunThis is useful both on tightly managed systems running a wide mix of
273*4882a593Smuzhiyunsubmitted jobs, which may choose to terminate or re-prioritize jobs that
274*4882a593Smuzhiyunare trying to use more memory than allowed on the nodes assigned to them,
275*4882a593Smuzhiyunand with tightly coupled, long running, massively parallel scientific
276*4882a593Smuzhiyuncomputing jobs that will dramatically fail to meet required performance
277*4882a593Smuzhiyungoals if they start to use more memory than allowed to them.
278*4882a593Smuzhiyun
279*4882a593SmuzhiyunThis mechanism provides a very economical way for the batch manager
280*4882a593Smuzhiyunto monitor a cpuset for signs of memory pressure.  It's up to the
281*4882a593Smuzhiyunbatch manager or other user code to decide what to do about it and
282*4882a593Smuzhiyuntake action.
283*4882a593Smuzhiyun
284*4882a593Smuzhiyun==>
285*4882a593Smuzhiyun    Unless this feature is enabled by writing "1" to the special file
286*4882a593Smuzhiyun    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
287*4882a593Smuzhiyun    code of __alloc_pages() for this metric reduces to simply noticing
288*4882a593Smuzhiyun    that the cpuset_memory_pressure_enabled flag is zero.  So only
289*4882a593Smuzhiyun    systems that enable this feature will compute the metric.
290*4882a593Smuzhiyun
291*4882a593SmuzhiyunWhy a per-cpuset, running average:
292*4882a593Smuzhiyun
293*4882a593Smuzhiyun    Because this meter is per-cpuset, rather than per-task or mm,
294*4882a593Smuzhiyun    the system load imposed by a batch scheduler monitoring this
295*4882a593Smuzhiyun    metric is sharply reduced on large systems, because a scan of
296*4882a593Smuzhiyun    the tasklist can be avoided on each set of queries.
297*4882a593Smuzhiyun
298*4882a593Smuzhiyun    Because this meter is a running average, instead of an accumulating
299*4882a593Smuzhiyun    counter, a batch scheduler can detect memory pressure with a
300*4882a593Smuzhiyun    single read, instead of having to read and accumulate results
301*4882a593Smuzhiyun    for a period of time.
302*4882a593Smuzhiyun
303*4882a593Smuzhiyun    Because this meter is per-cpuset rather than per-task or mm,
304*4882a593Smuzhiyun    the batch scheduler can obtain the key information, memory
305*4882a593Smuzhiyun    pressure in a cpuset, with a single read, rather than having to
306*4882a593Smuzhiyun    query and accumulate results over all the (dynamically changing)
307*4882a593Smuzhiyun    set of tasks in the cpuset.
308*4882a593Smuzhiyun
309*4882a593SmuzhiyunA per-cpuset simple digital filter (requires a spinlock and 3 words
310*4882a593Smuzhiyunof data per-cpuset) is kept, and updated by any task attached to that
311*4882a593Smuzhiyuncpuset, if it enters the synchronous (direct) page reclaim code.
312*4882a593Smuzhiyun
313*4882a593SmuzhiyunA per-cpuset file provides an integer number representing the recent
314*4882a593Smuzhiyun(half-life of 10 seconds) rate of direct page reclaims caused by
315*4882a593Smuzhiyunthe tasks in the cpuset, in units of reclaims attempted per second,
316*4882a593Smuzhiyuntimes 1000.
317*4882a593Smuzhiyun
318*4882a593Smuzhiyun
319*4882a593Smuzhiyun1.6 What is memory spread ?
320*4882a593Smuzhiyun---------------------------
321*4882a593SmuzhiyunThere are two boolean flag files per cpuset that control where the
322*4882a593Smuzhiyunkernel allocates pages for the file system buffers and related in
323*4882a593Smuzhiyunkernel data structures.  They are called 'cpuset.memory_spread_page' and
324*4882a593Smuzhiyun'cpuset.memory_spread_slab'.
325*4882a593Smuzhiyun
326*4882a593SmuzhiyunIf the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
327*4882a593Smuzhiyunthe kernel will spread the file system buffers (page cache) evenly
328*4882a593Smuzhiyunover all the nodes that the faulting task is allowed to use, instead
329*4882a593Smuzhiyunof preferring to put those pages on the node where the task is running.
330*4882a593Smuzhiyun
331*4882a593SmuzhiyunIf the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
332*4882a593Smuzhiyunthen the kernel will spread some file system related slab caches,
333*4882a593Smuzhiyunsuch as for inodes and dentries evenly over all the nodes that the
334*4882a593Smuzhiyunfaulting task is allowed to use, instead of preferring to put those
335*4882a593Smuzhiyunpages on the node where the task is running.
336*4882a593Smuzhiyun
337*4882a593SmuzhiyunThe setting of these flags does not affect anonymous data segment or
338*4882a593Smuzhiyunstack segment pages of a task.
339*4882a593Smuzhiyun
340*4882a593SmuzhiyunBy default, both kinds of memory spreading are off, and memory
341*4882a593Smuzhiyunpages are allocated on the node local to where the task is running,
342*4882a593Smuzhiyunexcept perhaps as modified by the task's NUMA mempolicy or cpuset
343*4882a593Smuzhiyunconfiguration, so long as sufficient free memory pages are available.
344*4882a593Smuzhiyun
345*4882a593SmuzhiyunWhen new cpusets are created, they inherit the memory spread settings
346*4882a593Smuzhiyunof their parent.
347*4882a593Smuzhiyun
348*4882a593SmuzhiyunSetting memory spreading causes allocations for the affected page
349*4882a593Smuzhiyunor slab caches to ignore the task's NUMA mempolicy and be spread
350*4882a593Smuzhiyuninstead.    Tasks using mbind() or set_mempolicy() calls to set NUMA
351*4882a593Smuzhiyunmempolicies will not notice any change in these calls as a result of
352*4882a593Smuzhiyuntheir containing task's memory spread settings.  If memory spreading
353*4882a593Smuzhiyunis turned off, then the currently specified NUMA mempolicy once again
354*4882a593Smuzhiyunapplies to memory page allocations.
355*4882a593Smuzhiyun
356*4882a593SmuzhiyunBoth 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
357*4882a593Smuzhiyunfiles.  By default they contain "0", meaning that the feature is off
358*4882a593Smuzhiyunfor that cpuset.  If a "1" is written to that file, then that turns
359*4882a593Smuzhiyunthe named feature on.
360*4882a593Smuzhiyun
361*4882a593SmuzhiyunThe implementation is simple.
362*4882a593Smuzhiyun
363*4882a593SmuzhiyunSetting the flag 'cpuset.memory_spread_page' turns on a per-process flag
364*4882a593SmuzhiyunPFA_SPREAD_PAGE for each task that is in that cpuset or subsequently
365*4882a593Smuzhiyunjoins that cpuset.  The page allocation calls for the page cache
366*4882a593Smuzhiyunis modified to perform an inline check for this PFA_SPREAD_PAGE task
367*4882a593Smuzhiyunflag, and if set, a call to a new routine cpuset_mem_spread_node()
368*4882a593Smuzhiyunreturns the node to prefer for the allocation.
369*4882a593Smuzhiyun
370*4882a593SmuzhiyunSimilarly, setting 'cpuset.memory_spread_slab' turns on the flag
371*4882a593SmuzhiyunPFA_SPREAD_SLAB, and appropriately marked slab caches will allocate
372*4882a593Smuzhiyunpages from the node returned by cpuset_mem_spread_node().
373*4882a593Smuzhiyun
374*4882a593SmuzhiyunThe cpuset_mem_spread_node() routine is also simple.  It uses the
375*4882a593Smuzhiyunvalue of a per-task rotor cpuset_mem_spread_rotor to select the next
376*4882a593Smuzhiyunnode in the current task's mems_allowed to prefer for the allocation.
377*4882a593Smuzhiyun
378*4882a593SmuzhiyunThis memory placement policy is also known (in other contexts) as
379*4882a593Smuzhiyunround-robin or interleave.
380*4882a593Smuzhiyun
381*4882a593SmuzhiyunThis policy can provide substantial improvements for jobs that need
382*4882a593Smuzhiyunto place thread local data on the corresponding node, but that need
383*4882a593Smuzhiyunto access large file system data sets that need to be spread across
384*4882a593Smuzhiyunthe several nodes in the jobs cpuset in order to fit.  Without this
385*4882a593Smuzhiyunpolicy, especially for jobs that might have one thread reading in the
386*4882a593Smuzhiyundata set, the memory allocation across the nodes in the jobs cpuset
387*4882a593Smuzhiyuncan become very uneven.
388*4882a593Smuzhiyun
389*4882a593Smuzhiyun1.7 What is sched_load_balance ?
390*4882a593Smuzhiyun--------------------------------
391*4882a593Smuzhiyun
392*4882a593SmuzhiyunThe kernel scheduler (kernel/sched/core.c) automatically load balances
393*4882a593Smuzhiyuntasks.  If one CPU is underutilized, kernel code running on that
394*4882a593SmuzhiyunCPU will look for tasks on other more overloaded CPUs and move those
395*4882a593Smuzhiyuntasks to itself, within the constraints of such placement mechanisms
396*4882a593Smuzhiyunas cpusets and sched_setaffinity.
397*4882a593Smuzhiyun
398*4882a593SmuzhiyunThe algorithmic cost of load balancing and its impact on key shared
399*4882a593Smuzhiyunkernel data structures such as the task list increases more than
400*4882a593Smuzhiyunlinearly with the number of CPUs being balanced.  So the scheduler
401*4882a593Smuzhiyunhas support to partition the systems CPUs into a number of sched
402*4882a593Smuzhiyundomains such that it only load balances within each sched domain.
403*4882a593SmuzhiyunEach sched domain covers some subset of the CPUs in the system;
404*4882a593Smuzhiyunno two sched domains overlap; some CPUs might not be in any sched
405*4882a593Smuzhiyundomain and hence won't be load balanced.
406*4882a593Smuzhiyun
407*4882a593SmuzhiyunPut simply, it costs less to balance between two smaller sched domains
408*4882a593Smuzhiyunthan one big one, but doing so means that overloads in one of the
409*4882a593Smuzhiyuntwo domains won't be load balanced to the other one.
410*4882a593Smuzhiyun
411*4882a593SmuzhiyunBy default, there is one sched domain covering all CPUs, including those
412*4882a593Smuzhiyunmarked isolated using the kernel boot time "isolcpus=" argument. However,
413*4882a593Smuzhiyunthe isolated CPUs will not participate in load balancing, and will not
414*4882a593Smuzhiyunhave tasks running on them unless explicitly assigned.
415*4882a593Smuzhiyun
416*4882a593SmuzhiyunThis default load balancing across all CPUs is not well suited for
417*4882a593Smuzhiyunthe following two situations:
418*4882a593Smuzhiyun
419*4882a593Smuzhiyun 1) On large systems, load balancing across many CPUs is expensive.
420*4882a593Smuzhiyun    If the system is managed using cpusets to place independent jobs
421*4882a593Smuzhiyun    on separate sets of CPUs, full load balancing is unnecessary.
422*4882a593Smuzhiyun 2) Systems supporting realtime on some CPUs need to minimize
423*4882a593Smuzhiyun    system overhead on those CPUs, including avoiding task load
424*4882a593Smuzhiyun    balancing if that is not needed.
425*4882a593Smuzhiyun
426*4882a593SmuzhiyunWhen the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
427*4882a593Smuzhiyunsetting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
428*4882a593Smuzhiyunbe contained in a single sched domain, ensuring that load balancing
429*4882a593Smuzhiyuncan move a task (not otherwised pinned, as by sched_setaffinity)
430*4882a593Smuzhiyunfrom any CPU in that cpuset to any other.
431*4882a593Smuzhiyun
432*4882a593SmuzhiyunWhen the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
433*4882a593Smuzhiyunscheduler will avoid load balancing across the CPUs in that cpuset,
434*4882a593Smuzhiyun--except-- in so far as is necessary because some overlapping cpuset
435*4882a593Smuzhiyunhas "sched_load_balance" enabled.
436*4882a593Smuzhiyun
437*4882a593SmuzhiyunSo, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
438*4882a593Smuzhiyunenabled, then the scheduler will have one sched domain covering all
439*4882a593SmuzhiyunCPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
440*4882a593Smuzhiyuncpusets won't matter, as we're already fully load balancing.
441*4882a593Smuzhiyun
442*4882a593SmuzhiyunTherefore in the above two situations, the top cpuset flag
443*4882a593Smuzhiyun"cpuset.sched_load_balance" should be disabled, and only some of the smaller,
444*4882a593Smuzhiyunchild cpusets have this flag enabled.
445*4882a593Smuzhiyun
446*4882a593SmuzhiyunWhen doing this, you don't usually want to leave any unpinned tasks in
447*4882a593Smuzhiyunthe top cpuset that might use non-trivial amounts of CPU, as such tasks
448*4882a593Smuzhiyunmay be artificially constrained to some subset of CPUs, depending on
449*4882a593Smuzhiyunthe particulars of this flag setting in descendant cpusets.  Even if
450*4882a593Smuzhiyunsuch a task could use spare CPU cycles in some other CPUs, the kernel
451*4882a593Smuzhiyunscheduler might not consider the possibility of load balancing that
452*4882a593Smuzhiyuntask to that underused CPU.
453*4882a593Smuzhiyun
454*4882a593SmuzhiyunOf course, tasks pinned to a particular CPU can be left in a cpuset
455*4882a593Smuzhiyunthat disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
456*4882a593Smuzhiyunelse anyway.
457*4882a593Smuzhiyun
458*4882a593SmuzhiyunThere is an impedance mismatch here, between cpusets and sched domains.
459*4882a593SmuzhiyunCpusets are hierarchical and nest.  Sched domains are flat; they don't
460*4882a593Smuzhiyunoverlap and each CPU is in at most one sched domain.
461*4882a593Smuzhiyun
462*4882a593SmuzhiyunIt is necessary for sched domains to be flat because load balancing
463*4882a593Smuzhiyunacross partially overlapping sets of CPUs would risk unstable dynamics
464*4882a593Smuzhiyunthat would be beyond our understanding.  So if each of two partially
465*4882a593Smuzhiyunoverlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
466*4882a593Smuzhiyunform a single sched domain that is a superset of both.  We won't move
467*4882a593Smuzhiyuna task to a CPU outside its cpuset, but the scheduler load balancing
468*4882a593Smuzhiyuncode might waste some compute cycles considering that possibility.
469*4882a593Smuzhiyun
470*4882a593SmuzhiyunThis mismatch is why there is not a simple one-to-one relation
471*4882a593Smuzhiyunbetween which cpusets have the flag "cpuset.sched_load_balance" enabled,
472*4882a593Smuzhiyunand the sched domain configuration.  If a cpuset enables the flag, it
473*4882a593Smuzhiyunwill get balancing across all its CPUs, but if it disables the flag,
474*4882a593Smuzhiyunit will only be assured of no load balancing if no other overlapping
475*4882a593Smuzhiyuncpuset enables the flag.
476*4882a593Smuzhiyun
477*4882a593SmuzhiyunIf two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
478*4882a593Smuzhiyunone of them has this flag enabled, then the other may find its
479*4882a593Smuzhiyuntasks only partially load balanced, just on the overlapping CPUs.
480*4882a593SmuzhiyunThis is just the general case of the top_cpuset example given a few
481*4882a593Smuzhiyunparagraphs above.  In the general case, as in the top cpuset case,
482*4882a593Smuzhiyundon't leave tasks that might use non-trivial amounts of CPU in
483*4882a593Smuzhiyunsuch partially load balanced cpusets, as they may be artificially
484*4882a593Smuzhiyunconstrained to some subset of the CPUs allowed to them, for lack of
485*4882a593Smuzhiyunload balancing to the other CPUs.
486*4882a593Smuzhiyun
487*4882a593SmuzhiyunCPUs in "cpuset.isolcpus" were excluded from load balancing by the
488*4882a593Smuzhiyunisolcpus= kernel boot option, and will never be load balanced regardless
489*4882a593Smuzhiyunof the value of "cpuset.sched_load_balance" in any cpuset.
490*4882a593Smuzhiyun
491*4882a593Smuzhiyun1.7.1 sched_load_balance implementation details.
492*4882a593Smuzhiyun------------------------------------------------
493*4882a593Smuzhiyun
494*4882a593SmuzhiyunThe per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
495*4882a593Smuzhiyunto most cpuset flags.)  When enabled for a cpuset, the kernel will
496*4882a593Smuzhiyunensure that it can load balance across all the CPUs in that cpuset
497*4882a593Smuzhiyun(makes sure that all the CPUs in the cpus_allowed of that cpuset are
498*4882a593Smuzhiyunin the same sched domain.)
499*4882a593Smuzhiyun
500*4882a593SmuzhiyunIf two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
501*4882a593Smuzhiyunthen they will be (must be) both in the same sched domain.
502*4882a593Smuzhiyun
503*4882a593SmuzhiyunIf, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
504*4882a593Smuzhiyunthen by the above that means there is a single sched domain covering
505*4882a593Smuzhiyunthe whole system, regardless of any other cpuset settings.
506*4882a593Smuzhiyun
507*4882a593SmuzhiyunThe kernel commits to user space that it will avoid load balancing
508*4882a593Smuzhiyunwhere it can.  It will pick as fine a granularity partition of sched
509*4882a593Smuzhiyundomains as it can while still providing load balancing for any set
510*4882a593Smuzhiyunof CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
511*4882a593Smuzhiyun
512*4882a593SmuzhiyunThe internal kernel cpuset to scheduler interface passes from the
513*4882a593Smuzhiyuncpuset code to the scheduler code a partition of the load balanced
514*4882a593SmuzhiyunCPUs in the system. This partition is a set of subsets (represented
515*4882a593Smuzhiyunas an array of struct cpumask) of CPUs, pairwise disjoint, that cover
516*4882a593Smuzhiyunall the CPUs that must be load balanced.
517*4882a593Smuzhiyun
518*4882a593SmuzhiyunThe cpuset code builds a new such partition and passes it to the
519*4882a593Smuzhiyunscheduler sched domain setup code, to have the sched domains rebuilt
520*4882a593Smuzhiyunas necessary, whenever:
521*4882a593Smuzhiyun
522*4882a593Smuzhiyun - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
523*4882a593Smuzhiyun - or CPUs come or go from a cpuset with this flag enabled,
524*4882a593Smuzhiyun - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
525*4882a593Smuzhiyun   and with this flag enabled changes,
526*4882a593Smuzhiyun - or a cpuset with non-empty CPUs and with this flag enabled is removed,
527*4882a593Smuzhiyun - or a cpu is offlined/onlined.
528*4882a593Smuzhiyun
529*4882a593SmuzhiyunThis partition exactly defines what sched domains the scheduler should
530*4882a593Smuzhiyunsetup - one sched domain for each element (struct cpumask) in the
531*4882a593Smuzhiyunpartition.
532*4882a593Smuzhiyun
533*4882a593SmuzhiyunThe scheduler remembers the currently active sched domain partitions.
534*4882a593SmuzhiyunWhen the scheduler routine partition_sched_domains() is invoked from
535*4882a593Smuzhiyunthe cpuset code to update these sched domains, it compares the new
536*4882a593Smuzhiyunpartition requested with the current, and updates its sched domains,
537*4882a593Smuzhiyunremoving the old and adding the new, for each change.
538*4882a593Smuzhiyun
539*4882a593Smuzhiyun
540*4882a593Smuzhiyun1.8 What is sched_relax_domain_level ?
541*4882a593Smuzhiyun--------------------------------------
542*4882a593Smuzhiyun
543*4882a593SmuzhiyunIn sched domain, the scheduler migrates tasks in 2 ways; periodic load
544*4882a593Smuzhiyunbalance on tick, and at time of some schedule events.
545*4882a593Smuzhiyun
546*4882a593SmuzhiyunWhen a task is woken up, scheduler try to move the task on idle CPU.
547*4882a593SmuzhiyunFor example, if a task A running on CPU X activates another task B
548*4882a593Smuzhiyunon the same CPU X, and if CPU Y is X's sibling and performing idle,
549*4882a593Smuzhiyunthen scheduler migrate task B to CPU Y so that task B can start on
550*4882a593SmuzhiyunCPU Y without waiting task A on CPU X.
551*4882a593Smuzhiyun
552*4882a593SmuzhiyunAnd if a CPU run out of tasks in its runqueue, the CPU try to pull
553*4882a593Smuzhiyunextra tasks from other busy CPUs to help them before it is going to
554*4882a593Smuzhiyunbe idle.
555*4882a593Smuzhiyun
556*4882a593SmuzhiyunOf course it takes some searching cost to find movable tasks and/or
557*4882a593Smuzhiyunidle CPUs, the scheduler might not search all CPUs in the domain
558*4882a593Smuzhiyunevery time.  In fact, in some architectures, the searching ranges on
559*4882a593Smuzhiyunevents are limited in the same socket or node where the CPU locates,
560*4882a593Smuzhiyunwhile the load balance on tick searches all.
561*4882a593Smuzhiyun
562*4882a593SmuzhiyunFor example, assume CPU Z is relatively far from CPU X.  Even if CPU Z
563*4882a593Smuzhiyunis idle while CPU X and the siblings are busy, scheduler can't migrate
564*4882a593Smuzhiyunwoken task B from X to Z since it is out of its searching range.
565*4882a593SmuzhiyunAs the result, task B on CPU X need to wait task A or wait load balance
566*4882a593Smuzhiyunon the next tick.  For some applications in special situation, waiting
567*4882a593Smuzhiyun1 tick may be too long.
568*4882a593Smuzhiyun
569*4882a593SmuzhiyunThe 'cpuset.sched_relax_domain_level' file allows you to request changing
570*4882a593Smuzhiyunthis searching range as you like.  This file takes int value which
571*4882a593Smuzhiyunindicates size of searching range in levels ideally as follows,
572*4882a593Smuzhiyunotherwise initial value -1 that indicates the cpuset has no request.
573*4882a593Smuzhiyun
574*4882a593Smuzhiyun====== ===========================================================
575*4882a593Smuzhiyun  -1   no request. use system default or follow request of others.
576*4882a593Smuzhiyun   0   no search.
577*4882a593Smuzhiyun   1   search siblings (hyperthreads in a core).
578*4882a593Smuzhiyun   2   search cores in a package.
579*4882a593Smuzhiyun   3   search cpus in a node [= system wide on non-NUMA system]
580*4882a593Smuzhiyun   4   search nodes in a chunk of node [on NUMA system]
581*4882a593Smuzhiyun   5   search system wide [on NUMA system]
582*4882a593Smuzhiyun====== ===========================================================
583*4882a593Smuzhiyun
584*4882a593SmuzhiyunThe system default is architecture dependent.  The system default
585*4882a593Smuzhiyuncan be changed using the relax_domain_level= boot parameter.
586*4882a593Smuzhiyun
587*4882a593SmuzhiyunThis file is per-cpuset and affect the sched domain where the cpuset
588*4882a593Smuzhiyunbelongs to.  Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
589*4882a593Smuzhiyunis disabled, then 'cpuset.sched_relax_domain_level' have no effect since
590*4882a593Smuzhiyunthere is no sched domain belonging the cpuset.
591*4882a593Smuzhiyun
592*4882a593SmuzhiyunIf multiple cpusets are overlapping and hence they form a single sched
593*4882a593Smuzhiyundomain, the largest value among those is used.  Be careful, if one
594*4882a593Smuzhiyunrequests 0 and others are -1 then 0 is used.
595*4882a593Smuzhiyun
596*4882a593SmuzhiyunNote that modifying this file will have both good and bad effects,
597*4882a593Smuzhiyunand whether it is acceptable or not depends on your situation.
598*4882a593SmuzhiyunDon't modify this file if you are not sure.
599*4882a593Smuzhiyun
600*4882a593SmuzhiyunIf your situation is:
601*4882a593Smuzhiyun
602*4882a593Smuzhiyun - The migration costs between each cpu can be assumed considerably
603*4882a593Smuzhiyun   small(for you) due to your special application's behavior or
604*4882a593Smuzhiyun   special hardware support for CPU cache etc.
605*4882a593Smuzhiyun - The searching cost doesn't have impact(for you) or you can make
606*4882a593Smuzhiyun   the searching cost enough small by managing cpuset to compact etc.
607*4882a593Smuzhiyun - The latency is required even it sacrifices cache hit rate etc.
608*4882a593Smuzhiyun   then increasing 'sched_relax_domain_level' would benefit you.
609*4882a593Smuzhiyun
610*4882a593Smuzhiyun
611*4882a593Smuzhiyun1.9 How do I use cpusets ?
612*4882a593Smuzhiyun--------------------------
613*4882a593Smuzhiyun
614*4882a593SmuzhiyunIn order to minimize the impact of cpusets on critical kernel
615*4882a593Smuzhiyuncode, such as the scheduler, and due to the fact that the kernel
616*4882a593Smuzhiyundoes not support one task updating the memory placement of another
617*4882a593Smuzhiyuntask directly, the impact on a task of changing its cpuset CPU
618*4882a593Smuzhiyunor Memory Node placement, or of changing to which cpuset a task
619*4882a593Smuzhiyunis attached, is subtle.
620*4882a593Smuzhiyun
621*4882a593SmuzhiyunIf a cpuset has its Memory Nodes modified, then for each task attached
622*4882a593Smuzhiyunto that cpuset, the next time that the kernel attempts to allocate
623*4882a593Smuzhiyuna page of memory for that task, the kernel will notice the change
624*4882a593Smuzhiyunin the task's cpuset, and update its per-task memory placement to
625*4882a593Smuzhiyunremain within the new cpusets memory placement.  If the task was using
626*4882a593Smuzhiyunmempolicy MPOL_BIND, and the nodes to which it was bound overlap with
627*4882a593Smuzhiyunits new cpuset, then the task will continue to use whatever subset
628*4882a593Smuzhiyunof MPOL_BIND nodes are still allowed in the new cpuset.  If the task
629*4882a593Smuzhiyunwas using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
630*4882a593Smuzhiyunin the new cpuset, then the task will be essentially treated as if it
631*4882a593Smuzhiyunwas MPOL_BIND bound to the new cpuset (even though its NUMA placement,
632*4882a593Smuzhiyunas queried by get_mempolicy(), doesn't change).  If a task is moved
633*4882a593Smuzhiyunfrom one cpuset to another, then the kernel will adjust the task's
634*4882a593Smuzhiyunmemory placement, as above, the next time that the kernel attempts
635*4882a593Smuzhiyunto allocate a page of memory for that task.
636*4882a593Smuzhiyun
637*4882a593SmuzhiyunIf a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
638*4882a593Smuzhiyunwill have its allowed CPU placement changed immediately.  Similarly,
639*4882a593Smuzhiyunif a task's pid is written to another cpuset's 'tasks' file, then its
640*4882a593Smuzhiyunallowed CPU placement is changed immediately.  If such a task had been
641*4882a593Smuzhiyunbound to some subset of its cpuset using the sched_setaffinity() call,
642*4882a593Smuzhiyunthe task will be allowed to run on any CPU allowed in its new cpuset,
643*4882a593Smuzhiyunnegating the effect of the prior sched_setaffinity() call.
644*4882a593Smuzhiyun
645*4882a593SmuzhiyunIn summary, the memory placement of a task whose cpuset is changed is
646*4882a593Smuzhiyunupdated by the kernel, on the next allocation of a page for that task,
647*4882a593Smuzhiyunand the processor placement is updated immediately.
648*4882a593Smuzhiyun
649*4882a593SmuzhiyunNormally, once a page is allocated (given a physical page
650*4882a593Smuzhiyunof main memory) then that page stays on whatever node it
651*4882a593Smuzhiyunwas allocated, so long as it remains allocated, even if the
652*4882a593Smuzhiyuncpusets memory placement policy 'cpuset.mems' subsequently changes.
653*4882a593SmuzhiyunIf the cpuset flag file 'cpuset.memory_migrate' is set true, then when
654*4882a593Smuzhiyuntasks are attached to that cpuset, any pages that task had
655*4882a593Smuzhiyunallocated to it on nodes in its previous cpuset are migrated
656*4882a593Smuzhiyunto the task's new cpuset. The relative placement of the page within
657*4882a593Smuzhiyunthe cpuset is preserved during these migration operations if possible.
658*4882a593SmuzhiyunFor example if the page was on the second valid node of the prior cpuset
659*4882a593Smuzhiyunthen the page will be placed on the second valid node of the new cpuset.
660*4882a593Smuzhiyun
661*4882a593SmuzhiyunAlso if 'cpuset.memory_migrate' is set true, then if that cpuset's
662*4882a593Smuzhiyun'cpuset.mems' file is modified, pages allocated to tasks in that
663*4882a593Smuzhiyuncpuset, that were on nodes in the previous setting of 'cpuset.mems',
664*4882a593Smuzhiyunwill be moved to nodes in the new setting of 'mems.'
665*4882a593SmuzhiyunPages that were not in the task's prior cpuset, or in the cpuset's
666*4882a593Smuzhiyunprior 'cpuset.mems' setting, will not be moved.
667*4882a593Smuzhiyun
668*4882a593SmuzhiyunThere is an exception to the above.  If hotplug functionality is used
669*4882a593Smuzhiyunto remove all the CPUs that are currently assigned to a cpuset,
670*4882a593Smuzhiyunthen all the tasks in that cpuset will be moved to the nearest ancestor
671*4882a593Smuzhiyunwith non-empty cpus.  But the moving of some (or all) tasks might fail if
672*4882a593Smuzhiyuncpuset is bound with another cgroup subsystem which has some restrictions
673*4882a593Smuzhiyunon task attaching.  In this failing case, those tasks will stay
674*4882a593Smuzhiyunin the original cpuset, and the kernel will automatically update
675*4882a593Smuzhiyuntheir cpus_allowed to allow all online CPUs.  When memory hotplug
676*4882a593Smuzhiyunfunctionality for removing Memory Nodes is available, a similar exception
677*4882a593Smuzhiyunis expected to apply there as well.  In general, the kernel prefers to
678*4882a593Smuzhiyunviolate cpuset placement, over starving a task that has had all
679*4882a593Smuzhiyunits allowed CPUs or Memory Nodes taken offline.
680*4882a593Smuzhiyun
681*4882a593SmuzhiyunThere is a second exception to the above.  GFP_ATOMIC requests are
682*4882a593Smuzhiyunkernel internal allocations that must be satisfied, immediately.
683*4882a593SmuzhiyunThe kernel may drop some request, in rare cases even panic, if a
684*4882a593SmuzhiyunGFP_ATOMIC alloc fails.  If the request cannot be satisfied within
685*4882a593Smuzhiyunthe current task's cpuset, then we relax the cpuset, and look for
686*4882a593Smuzhiyunmemory anywhere we can find it.  It's better to violate the cpuset
687*4882a593Smuzhiyunthan stress the kernel.
688*4882a593Smuzhiyun
689*4882a593SmuzhiyunTo start a new job that is to be contained within a cpuset, the steps are:
690*4882a593Smuzhiyun
691*4882a593Smuzhiyun 1) mkdir /sys/fs/cgroup/cpuset
692*4882a593Smuzhiyun 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
693*4882a593Smuzhiyun 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
694*4882a593Smuzhiyun    the /sys/fs/cgroup/cpuset virtual file system.
695*4882a593Smuzhiyun 4) Start a task that will be the "founding father" of the new job.
696*4882a593Smuzhiyun 5) Attach that task to the new cpuset by writing its pid to the
697*4882a593Smuzhiyun    /sys/fs/cgroup/cpuset tasks file for that cpuset.
698*4882a593Smuzhiyun 6) fork, exec or clone the job tasks from this founding father task.
699*4882a593Smuzhiyun
700*4882a593SmuzhiyunFor example, the following sequence of commands will setup a cpuset
701*4882a593Smuzhiyunnamed "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
702*4882a593Smuzhiyunand then start a subshell 'sh' in that cpuset::
703*4882a593Smuzhiyun
704*4882a593Smuzhiyun  mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
705*4882a593Smuzhiyun  cd /sys/fs/cgroup/cpuset
706*4882a593Smuzhiyun  mkdir Charlie
707*4882a593Smuzhiyun  cd Charlie
708*4882a593Smuzhiyun  /bin/echo 2-3 > cpuset.cpus
709*4882a593Smuzhiyun  /bin/echo 1 > cpuset.mems
710*4882a593Smuzhiyun  /bin/echo $$ > tasks
711*4882a593Smuzhiyun  sh
712*4882a593Smuzhiyun  # The subshell 'sh' is now running in cpuset Charlie
713*4882a593Smuzhiyun  # The next line should display '/Charlie'
714*4882a593Smuzhiyun  cat /proc/self/cpuset
715*4882a593Smuzhiyun
716*4882a593SmuzhiyunThere are ways to query or modify cpusets:
717*4882a593Smuzhiyun
718*4882a593Smuzhiyun - via the cpuset file system directly, using the various cd, mkdir, echo,
719*4882a593Smuzhiyun   cat, rmdir commands from the shell, or their equivalent from C.
720*4882a593Smuzhiyun - via the C library libcpuset.
721*4882a593Smuzhiyun - via the C library libcgroup.
722*4882a593Smuzhiyun   (http://sourceforge.net/projects/libcg/)
723*4882a593Smuzhiyun - via the python application cset.
724*4882a593Smuzhiyun   (http://code.google.com/p/cpuset/)
725*4882a593Smuzhiyun
726*4882a593SmuzhiyunThe sched_setaffinity calls can also be done at the shell prompt using
727*4882a593SmuzhiyunSGI's runon or Robert Love's taskset.  The mbind and set_mempolicy
728*4882a593Smuzhiyuncalls can be done at the shell prompt using the numactl command
729*4882a593Smuzhiyun(part of Andi Kleen's numa package).
730*4882a593Smuzhiyun
731*4882a593Smuzhiyun2. Usage Examples and Syntax
732*4882a593Smuzhiyun============================
733*4882a593Smuzhiyun
734*4882a593Smuzhiyun2.1 Basic Usage
735*4882a593Smuzhiyun---------------
736*4882a593Smuzhiyun
737*4882a593SmuzhiyunCreating, modifying, using the cpusets can be done through the cpuset
738*4882a593Smuzhiyunvirtual filesystem.
739*4882a593Smuzhiyun
740*4882a593SmuzhiyunTo mount it, type:
741*4882a593Smuzhiyun# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
742*4882a593Smuzhiyun
743*4882a593SmuzhiyunThen under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
744*4882a593Smuzhiyuntree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
745*4882a593Smuzhiyunis the cpuset that holds the whole system.
746*4882a593Smuzhiyun
747*4882a593SmuzhiyunIf you want to create a new cpuset under /sys/fs/cgroup/cpuset::
748*4882a593Smuzhiyun
749*4882a593Smuzhiyun  # cd /sys/fs/cgroup/cpuset
750*4882a593Smuzhiyun  # mkdir my_cpuset
751*4882a593Smuzhiyun
752*4882a593SmuzhiyunNow you want to do something with this cpuset::
753*4882a593Smuzhiyun
754*4882a593Smuzhiyun  # cd my_cpuset
755*4882a593Smuzhiyun
756*4882a593SmuzhiyunIn this directory you can find several files::
757*4882a593Smuzhiyun
758*4882a593Smuzhiyun  # ls
759*4882a593Smuzhiyun  cgroup.clone_children  cpuset.memory_pressure
760*4882a593Smuzhiyun  cgroup.event_control   cpuset.memory_spread_page
761*4882a593Smuzhiyun  cgroup.procs           cpuset.memory_spread_slab
762*4882a593Smuzhiyun  cpuset.cpu_exclusive   cpuset.mems
763*4882a593Smuzhiyun  cpuset.cpus            cpuset.sched_load_balance
764*4882a593Smuzhiyun  cpuset.mem_exclusive   cpuset.sched_relax_domain_level
765*4882a593Smuzhiyun  cpuset.mem_hardwall    notify_on_release
766*4882a593Smuzhiyun  cpuset.memory_migrate  tasks
767*4882a593Smuzhiyun
768*4882a593SmuzhiyunReading them will give you information about the state of this cpuset:
769*4882a593Smuzhiyunthe CPUs and Memory Nodes it can use, the processes that are using
770*4882a593Smuzhiyunit, its properties.  By writing to these files you can manipulate
771*4882a593Smuzhiyunthe cpuset.
772*4882a593Smuzhiyun
773*4882a593SmuzhiyunSet some flags::
774*4882a593Smuzhiyun
775*4882a593Smuzhiyun  # /bin/echo 1 > cpuset.cpu_exclusive
776*4882a593Smuzhiyun
777*4882a593SmuzhiyunAdd some cpus::
778*4882a593Smuzhiyun
779*4882a593Smuzhiyun  # /bin/echo 0-7 > cpuset.cpus
780*4882a593Smuzhiyun
781*4882a593SmuzhiyunAdd some mems::
782*4882a593Smuzhiyun
783*4882a593Smuzhiyun  # /bin/echo 0-7 > cpuset.mems
784*4882a593Smuzhiyun
785*4882a593SmuzhiyunNow attach your shell to this cpuset::
786*4882a593Smuzhiyun
787*4882a593Smuzhiyun  # /bin/echo $$ > tasks
788*4882a593Smuzhiyun
789*4882a593SmuzhiyunYou can also create cpusets inside your cpuset by using mkdir in this
790*4882a593Smuzhiyundirectory::
791*4882a593Smuzhiyun
792*4882a593Smuzhiyun  # mkdir my_sub_cs
793*4882a593Smuzhiyun
794*4882a593SmuzhiyunTo remove a cpuset, just use rmdir::
795*4882a593Smuzhiyun
796*4882a593Smuzhiyun  # rmdir my_sub_cs
797*4882a593Smuzhiyun
798*4882a593SmuzhiyunThis will fail if the cpuset is in use (has cpusets inside, or has
799*4882a593Smuzhiyunprocesses attached).
800*4882a593Smuzhiyun
801*4882a593SmuzhiyunNote that for legacy reasons, the "cpuset" filesystem exists as a
802*4882a593Smuzhiyunwrapper around the cgroup filesystem.
803*4882a593Smuzhiyun
804*4882a593SmuzhiyunThe command::
805*4882a593Smuzhiyun
806*4882a593Smuzhiyun  mount -t cpuset X /sys/fs/cgroup/cpuset
807*4882a593Smuzhiyun
808*4882a593Smuzhiyunis equivalent to::
809*4882a593Smuzhiyun
810*4882a593Smuzhiyun  mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
811*4882a593Smuzhiyun  echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
812*4882a593Smuzhiyun
813*4882a593Smuzhiyun2.2 Adding/removing cpus
814*4882a593Smuzhiyun------------------------
815*4882a593Smuzhiyun
816*4882a593SmuzhiyunThis is the syntax to use when writing in the cpus or mems files
817*4882a593Smuzhiyunin cpuset directories::
818*4882a593Smuzhiyun
819*4882a593Smuzhiyun  # /bin/echo 1-4 > cpuset.cpus		-> set cpus list to cpus 1,2,3,4
820*4882a593Smuzhiyun  # /bin/echo 1,2,3,4 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4
821*4882a593Smuzhiyun
822*4882a593SmuzhiyunTo add a CPU to a cpuset, write the new list of CPUs including the
823*4882a593SmuzhiyunCPU to be added. To add 6 to the above cpuset::
824*4882a593Smuzhiyun
825*4882a593Smuzhiyun  # /bin/echo 1-4,6 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4,6
826*4882a593Smuzhiyun
827*4882a593SmuzhiyunSimilarly to remove a CPU from a cpuset, write the new list of CPUs
828*4882a593Smuzhiyunwithout the CPU to be removed.
829*4882a593Smuzhiyun
830*4882a593SmuzhiyunTo remove all the CPUs::
831*4882a593Smuzhiyun
832*4882a593Smuzhiyun  # /bin/echo "" > cpuset.cpus		-> clear cpus list
833*4882a593Smuzhiyun
834*4882a593Smuzhiyun2.3 Setting flags
835*4882a593Smuzhiyun-----------------
836*4882a593Smuzhiyun
837*4882a593SmuzhiyunThe syntax is very simple::
838*4882a593Smuzhiyun
839*4882a593Smuzhiyun  # /bin/echo 1 > cpuset.cpu_exclusive 	-> set flag 'cpuset.cpu_exclusive'
840*4882a593Smuzhiyun  # /bin/echo 0 > cpuset.cpu_exclusive 	-> unset flag 'cpuset.cpu_exclusive'
841*4882a593Smuzhiyun
842*4882a593Smuzhiyun2.4 Attaching processes
843*4882a593Smuzhiyun-----------------------
844*4882a593Smuzhiyun
845*4882a593Smuzhiyun::
846*4882a593Smuzhiyun
847*4882a593Smuzhiyun  # /bin/echo PID > tasks
848*4882a593Smuzhiyun
849*4882a593SmuzhiyunNote that it is PID, not PIDs. You can only attach ONE task at a time.
850*4882a593SmuzhiyunIf you have several tasks to attach, you have to do it one after another::
851*4882a593Smuzhiyun
852*4882a593Smuzhiyun  # /bin/echo PID1 > tasks
853*4882a593Smuzhiyun  # /bin/echo PID2 > tasks
854*4882a593Smuzhiyun	...
855*4882a593Smuzhiyun  # /bin/echo PIDn > tasks
856*4882a593Smuzhiyun
857*4882a593Smuzhiyun
858*4882a593Smuzhiyun3. Questions
859*4882a593Smuzhiyun============
860*4882a593Smuzhiyun
861*4882a593SmuzhiyunQ:
862*4882a593Smuzhiyun   what's up with this '/bin/echo' ?
863*4882a593Smuzhiyun
864*4882a593SmuzhiyunA:
865*4882a593Smuzhiyun   bash's builtin 'echo' command does not check calls to write() against
866*4882a593Smuzhiyun   errors. If you use it in the cpuset file system, you won't be
867*4882a593Smuzhiyun   able to tell whether a command succeeded or failed.
868*4882a593Smuzhiyun
869*4882a593SmuzhiyunQ:
870*4882a593Smuzhiyun   When I attach processes, only the first of the line gets really attached !
871*4882a593Smuzhiyun
872*4882a593SmuzhiyunA:
873*4882a593Smuzhiyun   We can only return one error code per call to write(). So you should also
874*4882a593Smuzhiyun   put only ONE pid.
875*4882a593Smuzhiyun
876*4882a593Smuzhiyun4. Contact
877*4882a593Smuzhiyun==========
878*4882a593Smuzhiyun
879*4882a593SmuzhiyunWeb: http://www.bullopensource.org/cpuset
880