admin-guide/cgroup-v1/cpusets.rst

*4882a593Smuzhiyun.. _cpusets:
*4882a593Smuzhiyun
*4882a593Smuzhiyun=======
*4882a593SmuzhiyunCPUSETS
*4882a593Smuzhiyun=======
*4882a593Smuzhiyun
*4882a593SmuzhiyunCopyright (C) 2004 BULL SA.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWritten by Simon.Derr@bull.net
*4882a593Smuzhiyun
*4882a593Smuzhiyun- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
*4882a593Smuzhiyun- Modified by Paul Jackson <pj@sgi.com>
*4882a593Smuzhiyun- Modified by Christoph Lameter <cl@linux.com>
*4882a593Smuzhiyun- Modified by Paul Menage <menage@google.com>
*4882a593Smuzhiyun- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. CONTENTS:
*4882a593Smuzhiyun
*4882a593Smuzhiyun   1. Cpusets
*4882a593Smuzhiyun     1.1 What are cpusets ?
*4882a593Smuzhiyun     1.2 Why are cpusets needed ?
*4882a593Smuzhiyun     1.3 How are cpusets implemented ?
*4882a593Smuzhiyun     1.4 What are exclusive cpusets ?
*4882a593Smuzhiyun     1.5 What is memory_pressure ?
*4882a593Smuzhiyun     1.6 What is memory spread ?
*4882a593Smuzhiyun     1.7 What is sched_load_balance ?
*4882a593Smuzhiyun     1.8 What is sched_relax_domain_level ?
*4882a593Smuzhiyun     1.9 How do I use cpusets ?
*4882a593Smuzhiyun   2. Usage Examples and Syntax
*4882a593Smuzhiyun     2.1 Basic Usage
*4882a593Smuzhiyun     2.2 Adding/removing cpus
*4882a593Smuzhiyun     2.3 Setting flags
*4882a593Smuzhiyun     2.4 Attaching processes
*4882a593Smuzhiyun   3. Questions
*4882a593Smuzhiyun   4. Contact
*4882a593Smuzhiyun
*4882a593Smuzhiyun1. Cpusets
*4882a593Smuzhiyun==========
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.1 What are cpusets ?
*4882a593Smuzhiyun----------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunCpusets provide a mechanism for assigning a set of CPUs and Memory
*4882a593SmuzhiyunNodes to a set of tasks.   In this document "Memory Node" refers to
*4882a593Smuzhiyunan on-line node that contains memory.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCpusets constrain the CPU and Memory placement of tasks to only
*4882a593Smuzhiyunthe resources within a task's current cpuset.  They form a nested
*4882a593Smuzhiyunhierarchy visible in a virtual file system.  These are the essential
*4882a593Smuzhiyunhooks, beyond what is already present, required to manage dynamic
*4882a593Smuzhiyunjob placement on large systems.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCpusets use the generic cgroup subsystem described in
*4882a593SmuzhiyunDocumentation/admin-guide/cgroup-v1/cgroups.rst.
*4882a593Smuzhiyun
*4882a593SmuzhiyunRequests by a task, using the sched_setaffinity(2) system call to
*4882a593Smuzhiyuninclude CPUs in its CPU affinity mask, and using the mbind(2) and
*4882a593Smuzhiyunset_mempolicy(2) system calls to include Memory Nodes in its memory
*4882a593Smuzhiyunpolicy, are both filtered through that task's cpuset, filtering out any
*4882a593SmuzhiyunCPUs or Memory Nodes not in that cpuset.  The scheduler will not
*4882a593Smuzhiyunschedule a task on a CPU that is not allowed in its cpus_allowed
*4882a593Smuzhiyunvector, and the kernel page allocator will not allocate a page on a
*4882a593Smuzhiyunnode that is not allowed in the requesting task's mems_allowed vector.
*4882a593Smuzhiyun
*4882a593SmuzhiyunUser level code may create and destroy cpusets by name in the cgroup
*4882a593Smuzhiyunvirtual file system, manage the attributes and permissions of these
*4882a593Smuzhiyuncpusets and which CPUs and Memory Nodes are assigned to each cpuset,
*4882a593Smuzhiyunspecify and query to which cpuset a task is assigned, and list the
*4882a593Smuzhiyuntask pids assigned to a cpuset.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.2 Why are cpusets needed ?
*4882a593Smuzhiyun----------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe management of large computer systems, with many processors (CPUs),
*4882a593Smuzhiyuncomplex memory cache hierarchies and multiple Memory Nodes having
*4882a593Smuzhiyunnon-uniform access times (NUMA) presents additional challenges for
*4882a593Smuzhiyunthe efficient scheduling and memory placement of processes.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFrequently more modest sized systems can be operated with adequate
*4882a593Smuzhiyunefficiency just by letting the operating system automatically share
*4882a593Smuzhiyunthe available CPU and Memory resources amongst the requesting tasks.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBut larger systems, which benefit more from careful processor and
*4882a593Smuzhiyunmemory placement to reduce memory access times and contention,
*4882a593Smuzhiyunand which typically represent a larger investment for the customer,
*4882a593Smuzhiyuncan benefit from explicitly placing jobs on properly sized subsets of
*4882a593Smuzhiyunthe system.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis can be especially valuable on:
*4882a593Smuzhiyun
*4882a593Smuzhiyun    * Web Servers running multiple instances of the same web application,
*4882a593Smuzhiyun    * Servers running different applications (for instance, a web server
*4882a593Smuzhiyun      and a database), or
*4882a593Smuzhiyun    * NUMA systems running large HPC applications with demanding
*4882a593Smuzhiyun      performance characteristics.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThese subsets, or "soft partitions" must be able to be dynamically
*4882a593Smuzhiyunadjusted, as the job mix changes, without impacting other concurrently
*4882a593Smuzhiyunexecuting jobs. The location of the running jobs pages may also be moved
*4882a593Smuzhiyunwhen the memory locations are changed.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe kernel cpuset patch provides the minimum essential kernel
*4882a593Smuzhiyunmechanisms required to efficiently implement such subsets.  It
*4882a593Smuzhiyunleverages existing CPU and Memory Placement facilities in the Linux
*4882a593Smuzhiyunkernel to avoid any additional impact on the critical scheduler or
*4882a593Smuzhiyunmemory allocator code.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.3 How are cpusets implemented ?
*4882a593Smuzhiyun---------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunCpusets provide a Linux kernel mechanism to constrain which CPUs and
*4882a593SmuzhiyunMemory Nodes are used by a process or set of processes.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe Linux kernel already has a pair of mechanisms to specify on which
*4882a593SmuzhiyunCPUs a task may be scheduled (sched_setaffinity) and on which Memory
*4882a593SmuzhiyunNodes it may obtain memory (mbind, set_mempolicy).
*4882a593Smuzhiyun
*4882a593SmuzhiyunCpusets extends these two mechanisms as follows:
*4882a593Smuzhiyun
*4882a593Smuzhiyun - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
*4882a593Smuzhiyun   kernel.
*4882a593Smuzhiyun - Each task in the system is attached to a cpuset, via a pointer
*4882a593Smuzhiyun   in the task structure to a reference counted cgroup structure.
*4882a593Smuzhiyun - Calls to sched_setaffinity are filtered to just those CPUs
*4882a593Smuzhiyun   allowed in that task's cpuset.
*4882a593Smuzhiyun - Calls to mbind and set_mempolicy are filtered to just
*4882a593Smuzhiyun   those Memory Nodes allowed in that task's cpuset.
*4882a593Smuzhiyun - The root cpuset contains all the systems CPUs and Memory
*4882a593Smuzhiyun   Nodes.
*4882a593Smuzhiyun - For any cpuset, one can define child cpusets containing a subset
*4882a593Smuzhiyun   of the parents CPU and Memory Node resources.
*4882a593Smuzhiyun - The hierarchy of cpusets can be mounted at /dev/cpuset, for
*4882a593Smuzhiyun   browsing and manipulation from user space.
*4882a593Smuzhiyun - A cpuset may be marked exclusive, which ensures that no other
*4882a593Smuzhiyun   cpuset (except direct ancestors and descendants) may contain
*4882a593Smuzhiyun   any overlapping CPUs or Memory Nodes.
*4882a593Smuzhiyun - You can list all the tasks (by pid) attached to any cpuset.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe implementation of cpusets requires a few, simple hooks
*4882a593Smuzhiyuninto the rest of the kernel, none in performance critical paths:
*4882a593Smuzhiyun
*4882a593Smuzhiyun - in init/main.c, to initialize the root cpuset at system boot.
*4882a593Smuzhiyun - in fork and exit, to attach and detach a task from its cpuset.
*4882a593Smuzhiyun - in sched_setaffinity, to mask the requested CPUs by what's
*4882a593Smuzhiyun   allowed in that task's cpuset.
*4882a593Smuzhiyun - in sched.c migrate_live_tasks(), to keep migrating tasks within
*4882a593Smuzhiyun   the CPUs allowed by their cpuset, if possible.
*4882a593Smuzhiyun - in the mbind and set_mempolicy system calls, to mask the requested
*4882a593Smuzhiyun   Memory Nodes by what's allowed in that task's cpuset.
*4882a593Smuzhiyun - in page_alloc.c, to restrict memory to allowed nodes.
*4882a593Smuzhiyun - in vmscan.c, to restrict page recovery to the current cpuset.
*4882a593Smuzhiyun
*4882a593SmuzhiyunYou should mount the "cgroup" filesystem type in order to enable
*4882a593Smuzhiyunbrowsing and modifying the cpusets presently known to the kernel.  No
*4882a593Smuzhiyunnew system calls are added for cpusets - all support for querying and
*4882a593Smuzhiyunmodifying cpusets is via this cpuset file system.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe /proc/<pid>/status file for each task has four added lines,
*4882a593Smuzhiyundisplaying the task's cpus_allowed (on which CPUs it may be scheduled)
*4882a593Smuzhiyunand mems_allowed (on which Memory Nodes it may obtain memory),
*4882a593Smuzhiyunin the two formats seen in the following example::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
*4882a593Smuzhiyun  Cpus_allowed_list:      0-127
*4882a593Smuzhiyun  Mems_allowed:   ffffffff,ffffffff
*4882a593Smuzhiyun  Mems_allowed_list:      0-63
*4882a593Smuzhiyun
*4882a593SmuzhiyunEach cpuset is represented by a directory in the cgroup file system
*4882a593Smuzhiyuncontaining (on top of the standard cgroup files) the following
*4882a593Smuzhiyunfiles describing that cpuset:
*4882a593Smuzhiyun
*4882a593Smuzhiyun - cpuset.cpus: list of CPUs in that cpuset
*4882a593Smuzhiyun - cpuset.mems: list of Memory Nodes in that cpuset
*4882a593Smuzhiyun - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
*4882a593Smuzhiyun - cpuset.cpu_exclusive flag: is cpu placement exclusive?
*4882a593Smuzhiyun - cpuset.mem_exclusive flag: is memory placement exclusive?
*4882a593Smuzhiyun - cpuset.mem_hardwall flag:  is memory allocation hardwalled
*4882a593Smuzhiyun - cpuset.memory_pressure: measure of how much paging pressure in cpuset
*4882a593Smuzhiyun - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
*4882a593Smuzhiyun - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
*4882a593Smuzhiyun - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
*4882a593Smuzhiyun - cpuset.sched_relax_domain_level: the searching range when migrating tasks
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn addition, only the root cpuset has the following file:
*4882a593Smuzhiyun
*4882a593Smuzhiyun - cpuset.memory_pressure_enabled flag: compute memory_pressure?
*4882a593Smuzhiyun
*4882a593SmuzhiyunNew cpusets are created using the mkdir system call or shell
*4882a593Smuzhiyuncommand.  The properties of a cpuset, such as its flags, allowed
*4882a593SmuzhiyunCPUs and Memory Nodes, and attached tasks, are modified by writing
*4882a593Smuzhiyunto the appropriate file in that cpusets directory, as listed above.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe named hierarchical structure of nested cpusets allows partitioning
*4882a593Smuzhiyuna large system into nested, dynamically changeable, "soft-partitions".
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe attachment of each task, automatically inherited at fork by any
*4882a593Smuzhiyunchildren of that task, to a cpuset allows organizing the work load
*4882a593Smuzhiyunon a system into related sets of tasks such that each set is constrained
*4882a593Smuzhiyunto using the CPUs and Memory Nodes of a particular cpuset.  A task
*4882a593Smuzhiyunmay be re-attached to any other cpuset, if allowed by the permissions
*4882a593Smuzhiyunon the necessary cpuset file system directories.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSuch management of a system "in the large" integrates smoothly with
*4882a593Smuzhiyunthe detailed placement done on individual tasks and memory regions
*4882a593Smuzhiyunusing the sched_setaffinity, mbind and set_mempolicy system calls.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe following rules apply to each cpuset:
*4882a593Smuzhiyun
*4882a593Smuzhiyun - Its CPUs and Memory Nodes must be a subset of its parents.
*4882a593Smuzhiyun - It can't be marked exclusive unless its parent is.
*4882a593Smuzhiyun - If its cpu or memory is exclusive, they may not overlap any sibling.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThese rules, and the natural hierarchy of cpusets, enable efficient
*4882a593Smuzhiyunenforcement of the exclusive guarantee, without having to scan all
*4882a593Smuzhiyuncpusets every time any of them change to ensure nothing overlaps a
*4882a593Smuzhiyunexclusive cpuset.  Also, the use of a Linux virtual file system (vfs)
*4882a593Smuzhiyunto represent the cpuset hierarchy provides for a familiar permission
*4882a593Smuzhiyunand name space for cpusets, with a minimum of additional kernel code.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe cpus and mems files in the root (top_cpuset) cpuset are
*4882a593Smuzhiyunread-only.  The cpus file automatically tracks the value of
*4882a593Smuzhiyuncpu_online_mask using a CPU hotplug notifier, and the mems file
*4882a593Smuzhiyunautomatically tracks the value of node_states[N_MEMORY]--i.e.,
*4882a593Smuzhiyunnodes with memory--using the cpuset_track_online_nodes() hook.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe cpuset.effective_cpus and cpuset.effective_mems files are
*4882a593Smuzhiyunnormally read-only copies of cpuset.cpus and cpuset.mems files
*4882a593Smuzhiyunrespectively.  If the cpuset cgroup filesystem is mounted with the
*4882a593Smuzhiyunspecial "cpuset_v2_mode" option, the behavior of these files will become
*4882a593Smuzhiyunsimilar to the corresponding files in cpuset v2.  In other words, hotplug
*4882a593Smuzhiyunevents will not change cpuset.cpus and cpuset.mems.  Those events will
*4882a593Smuzhiyunonly affect cpuset.effective_cpus and cpuset.effective_mems which show
*4882a593Smuzhiyunthe actual cpus and memory nodes that are currently used by this cpuset.
*4882a593SmuzhiyunSee Documentation/admin-guide/cgroup-v2.rst for more information about
*4882a593Smuzhiyuncpuset v2 behavior.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.4 What are exclusive cpusets ?
*4882a593Smuzhiyun--------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf a cpuset is cpu or mem exclusive, no other cpuset, other than
*4882a593Smuzhiyuna direct ancestor or descendant, may share any of the same CPUs or
*4882a593SmuzhiyunMemory Nodes.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
*4882a593Smuzhiyuni.e. it restricts kernel allocations for page, buffer and other data
*4882a593Smuzhiyuncommonly shared by the kernel across multiple users.  All cpusets,
*4882a593Smuzhiyunwhether hardwalled or not, restrict allocations of memory for user
*4882a593Smuzhiyunspace.  This enables configuring a system so that several independent
*4882a593Smuzhiyunjobs can share common kernel data, such as file system pages, while
*4882a593Smuzhiyunisolating each job's user allocation in its own cpuset.  To do this,
*4882a593Smuzhiyunconstruct a large mem_exclusive cpuset to hold all the jobs, and
*4882a593Smuzhiyunconstruct child, non-mem_exclusive cpusets for each individual job.
*4882a593SmuzhiyunOnly a small amount of typical kernel memory, such as requests from
*4882a593Smuzhiyuninterrupt handlers, is allowed to be taken outside even a
*4882a593Smuzhiyunmem_exclusive cpuset.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.5 What is memory_pressure ?
*4882a593Smuzhiyun-----------------------------
*4882a593SmuzhiyunThe memory_pressure of a cpuset provides a simple per-cpuset metric
*4882a593Smuzhiyunof the rate that the tasks in a cpuset are attempting to free up in
*4882a593Smuzhiyunuse memory on the nodes of the cpuset to satisfy additional memory
*4882a593Smuzhiyunrequests.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis enables batch managers monitoring jobs running in dedicated
*4882a593Smuzhiyuncpusets to efficiently detect what level of memory pressure that job
*4882a593Smuzhiyunis causing.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis is useful both on tightly managed systems running a wide mix of
*4882a593Smuzhiyunsubmitted jobs, which may choose to terminate or re-prioritize jobs that
*4882a593Smuzhiyunare trying to use more memory than allowed on the nodes assigned to them,
*4882a593Smuzhiyunand with tightly coupled, long running, massively parallel scientific
*4882a593Smuzhiyuncomputing jobs that will dramatically fail to meet required performance
*4882a593Smuzhiyungoals if they start to use more memory than allowed to them.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis mechanism provides a very economical way for the batch manager
*4882a593Smuzhiyunto monitor a cpuset for signs of memory pressure.  It's up to the
*4882a593Smuzhiyunbatch manager or other user code to decide what to do about it and
*4882a593Smuzhiyuntake action.
*4882a593Smuzhiyun
*4882a593Smuzhiyun==>
*4882a593Smuzhiyun    Unless this feature is enabled by writing "1" to the special file
*4882a593Smuzhiyun    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
*4882a593Smuzhiyun    code of __alloc_pages() for this metric reduces to simply noticing
*4882a593Smuzhiyun    that the cpuset_memory_pressure_enabled flag is zero.  So only
*4882a593Smuzhiyun    systems that enable this feature will compute the metric.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhy a per-cpuset, running average:
*4882a593Smuzhiyun
*4882a593Smuzhiyun    Because this meter is per-cpuset, rather than per-task or mm,
*4882a593Smuzhiyun    the system load imposed by a batch scheduler monitoring this
*4882a593Smuzhiyun    metric is sharply reduced on large systems, because a scan of
*4882a593Smuzhiyun    the tasklist can be avoided on each set of queries.
*4882a593Smuzhiyun
*4882a593Smuzhiyun    Because this meter is a running average, instead of an accumulating
*4882a593Smuzhiyun    counter, a batch scheduler can detect memory pressure with a
*4882a593Smuzhiyun    single read, instead of having to read and accumulate results
*4882a593Smuzhiyun    for a period of time.
*4882a593Smuzhiyun
*4882a593Smuzhiyun    Because this meter is per-cpuset rather than per-task or mm,
*4882a593Smuzhiyun    the batch scheduler can obtain the key information, memory
*4882a593Smuzhiyun    pressure in a cpuset, with a single read, rather than having to
*4882a593Smuzhiyun    query and accumulate results over all the (dynamically changing)
*4882a593Smuzhiyun    set of tasks in the cpuset.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA per-cpuset simple digital filter (requires a spinlock and 3 words
*4882a593Smuzhiyunof data per-cpuset) is kept, and updated by any task attached to that
*4882a593Smuzhiyuncpuset, if it enters the synchronous (direct) page reclaim code.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA per-cpuset file provides an integer number representing the recent
*4882a593Smuzhiyun(half-life of 10 seconds) rate of direct page reclaims caused by
*4882a593Smuzhiyunthe tasks in the cpuset, in units of reclaims attempted per second,
*4882a593Smuzhiyuntimes 1000.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.6 What is memory spread ?
*4882a593Smuzhiyun---------------------------
*4882a593SmuzhiyunThere are two boolean flag files per cpuset that control where the
*4882a593Smuzhiyunkernel allocates pages for the file system buffers and related in
*4882a593Smuzhiyunkernel data structures.  They are called 'cpuset.memory_spread_page' and
*4882a593Smuzhiyun'cpuset.memory_spread_slab'.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
*4882a593Smuzhiyunthe kernel will spread the file system buffers (page cache) evenly
*4882a593Smuzhiyunover all the nodes that the faulting task is allowed to use, instead
*4882a593Smuzhiyunof preferring to put those pages on the node where the task is running.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
*4882a593Smuzhiyunthen the kernel will spread some file system related slab caches,
*4882a593Smuzhiyunsuch as for inodes and dentries evenly over all the nodes that the
*4882a593Smuzhiyunfaulting task is allowed to use, instead of preferring to put those
*4882a593Smuzhiyunpages on the node where the task is running.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe setting of these flags does not affect anonymous data segment or
*4882a593Smuzhiyunstack segment pages of a task.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBy default, both kinds of memory spreading are off, and memory
*4882a593Smuzhiyunpages are allocated on the node local to where the task is running,
*4882a593Smuzhiyunexcept perhaps as modified by the task's NUMA mempolicy or cpuset
*4882a593Smuzhiyunconfiguration, so long as sufficient free memory pages are available.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen new cpusets are created, they inherit the memory spread settings
*4882a593Smuzhiyunof their parent.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSetting memory spreading causes allocations for the affected page
*4882a593Smuzhiyunor slab caches to ignore the task's NUMA mempolicy and be spread
*4882a593Smuzhiyuninstead.    Tasks using mbind() or set_mempolicy() calls to set NUMA
*4882a593Smuzhiyunmempolicies will not notice any change in these calls as a result of
*4882a593Smuzhiyuntheir containing task's memory spread settings.  If memory spreading
*4882a593Smuzhiyunis turned off, then the currently specified NUMA mempolicy once again
*4882a593Smuzhiyunapplies to memory page allocations.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBoth 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
*4882a593Smuzhiyunfiles.  By default they contain "0", meaning that the feature is off
*4882a593Smuzhiyunfor that cpuset.  If a "1" is written to that file, then that turns
*4882a593Smuzhiyunthe named feature on.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe implementation is simple.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSetting the flag 'cpuset.memory_spread_page' turns on a per-process flag
*4882a593SmuzhiyunPFA_SPREAD_PAGE for each task that is in that cpuset or subsequently
*4882a593Smuzhiyunjoins that cpuset.  The page allocation calls for the page cache
*4882a593Smuzhiyunis modified to perform an inline check for this PFA_SPREAD_PAGE task
*4882a593Smuzhiyunflag, and if set, a call to a new routine cpuset_mem_spread_node()
*4882a593Smuzhiyunreturns the node to prefer for the allocation.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSimilarly, setting 'cpuset.memory_spread_slab' turns on the flag
*4882a593SmuzhiyunPFA_SPREAD_SLAB, and appropriately marked slab caches will allocate
*4882a593Smuzhiyunpages from the node returned by cpuset_mem_spread_node().
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe cpuset_mem_spread_node() routine is also simple.  It uses the
*4882a593Smuzhiyunvalue of a per-task rotor cpuset_mem_spread_rotor to select the next
*4882a593Smuzhiyunnode in the current task's mems_allowed to prefer for the allocation.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis memory placement policy is also known (in other contexts) as
*4882a593Smuzhiyunround-robin or interleave.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis policy can provide substantial improvements for jobs that need
*4882a593Smuzhiyunto place thread local data on the corresponding node, but that need
*4882a593Smuzhiyunto access large file system data sets that need to be spread across
*4882a593Smuzhiyunthe several nodes in the jobs cpuset in order to fit.  Without this
*4882a593Smuzhiyunpolicy, especially for jobs that might have one thread reading in the
*4882a593Smuzhiyundata set, the memory allocation across the nodes in the jobs cpuset
*4882a593Smuzhiyuncan become very uneven.
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.7 What is sched_load_balance ?
*4882a593Smuzhiyun--------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe kernel scheduler (kernel/sched/core.c) automatically load balances
*4882a593Smuzhiyuntasks.  If one CPU is underutilized, kernel code running on that
*4882a593SmuzhiyunCPU will look for tasks on other more overloaded CPUs and move those
*4882a593Smuzhiyuntasks to itself, within the constraints of such placement mechanisms
*4882a593Smuzhiyunas cpusets and sched_setaffinity.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe algorithmic cost of load balancing and its impact on key shared
*4882a593Smuzhiyunkernel data structures such as the task list increases more than
*4882a593Smuzhiyunlinearly with the number of CPUs being balanced.  So the scheduler
*4882a593Smuzhiyunhas support to partition the systems CPUs into a number of sched
*4882a593Smuzhiyundomains such that it only load balances within each sched domain.
*4882a593SmuzhiyunEach sched domain covers some subset of the CPUs in the system;
*4882a593Smuzhiyunno two sched domains overlap; some CPUs might not be in any sched
*4882a593Smuzhiyundomain and hence won't be load balanced.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPut simply, it costs less to balance between two smaller sched domains
*4882a593Smuzhiyunthan one big one, but doing so means that overloads in one of the
*4882a593Smuzhiyuntwo domains won't be load balanced to the other one.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBy default, there is one sched domain covering all CPUs, including those
*4882a593Smuzhiyunmarked isolated using the kernel boot time "isolcpus=" argument. However,
*4882a593Smuzhiyunthe isolated CPUs will not participate in load balancing, and will not
*4882a593Smuzhiyunhave tasks running on them unless explicitly assigned.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis default load balancing across all CPUs is not well suited for
*4882a593Smuzhiyunthe following two situations:
*4882a593Smuzhiyun
*4882a593Smuzhiyun 1) On large systems, load balancing across many CPUs is expensive.
*4882a593Smuzhiyun    If the system is managed using cpusets to place independent jobs
*4882a593Smuzhiyun    on separate sets of CPUs, full load balancing is unnecessary.
*4882a593Smuzhiyun 2) Systems supporting realtime on some CPUs need to minimize
*4882a593Smuzhiyun    system overhead on those CPUs, including avoiding task load
*4882a593Smuzhiyun    balancing if that is not needed.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
*4882a593Smuzhiyunsetting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
*4882a593Smuzhiyunbe contained in a single sched domain, ensuring that load balancing
*4882a593Smuzhiyuncan move a task (not otherwised pinned, as by sched_setaffinity)
*4882a593Smuzhiyunfrom any CPU in that cpuset to any other.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
*4882a593Smuzhiyunscheduler will avoid load balancing across the CPUs in that cpuset,
*4882a593Smuzhiyun--except-- in so far as is necessary because some overlapping cpuset
*4882a593Smuzhiyunhas "sched_load_balance" enabled.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSo, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
*4882a593Smuzhiyunenabled, then the scheduler will have one sched domain covering all
*4882a593SmuzhiyunCPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
*4882a593Smuzhiyuncpusets won't matter, as we're already fully load balancing.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTherefore in the above two situations, the top cpuset flag
*4882a593Smuzhiyun"cpuset.sched_load_balance" should be disabled, and only some of the smaller,
*4882a593Smuzhiyunchild cpusets have this flag enabled.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen doing this, you don't usually want to leave any unpinned tasks in
*4882a593Smuzhiyunthe top cpuset that might use non-trivial amounts of CPU, as such tasks
*4882a593Smuzhiyunmay be artificially constrained to some subset of CPUs, depending on
*4882a593Smuzhiyunthe particulars of this flag setting in descendant cpusets.  Even if
*4882a593Smuzhiyunsuch a task could use spare CPU cycles in some other CPUs, the kernel
*4882a593Smuzhiyunscheduler might not consider the possibility of load balancing that
*4882a593Smuzhiyuntask to that underused CPU.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOf course, tasks pinned to a particular CPU can be left in a cpuset
*4882a593Smuzhiyunthat disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
*4882a593Smuzhiyunelse anyway.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere is an impedance mismatch here, between cpusets and sched domains.
*4882a593SmuzhiyunCpusets are hierarchical and nest.  Sched domains are flat; they don't
*4882a593Smuzhiyunoverlap and each CPU is in at most one sched domain.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIt is necessary for sched domains to be flat because load balancing
*4882a593Smuzhiyunacross partially overlapping sets of CPUs would risk unstable dynamics
*4882a593Smuzhiyunthat would be beyond our understanding.  So if each of two partially
*4882a593Smuzhiyunoverlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
*4882a593Smuzhiyunform a single sched domain that is a superset of both.  We won't move
*4882a593Smuzhiyuna task to a CPU outside its cpuset, but the scheduler load balancing
*4882a593Smuzhiyuncode might waste some compute cycles considering that possibility.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis mismatch is why there is not a simple one-to-one relation
*4882a593Smuzhiyunbetween which cpusets have the flag "cpuset.sched_load_balance" enabled,
*4882a593Smuzhiyunand the sched domain configuration.  If a cpuset enables the flag, it
*4882a593Smuzhiyunwill get balancing across all its CPUs, but if it disables the flag,
*4882a593Smuzhiyunit will only be assured of no load balancing if no other overlapping
*4882a593Smuzhiyuncpuset enables the flag.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
*4882a593Smuzhiyunone of them has this flag enabled, then the other may find its
*4882a593Smuzhiyuntasks only partially load balanced, just on the overlapping CPUs.
*4882a593SmuzhiyunThis is just the general case of the top_cpuset example given a few
*4882a593Smuzhiyunparagraphs above.  In the general case, as in the top cpuset case,
*4882a593Smuzhiyundon't leave tasks that might use non-trivial amounts of CPU in
*4882a593Smuzhiyunsuch partially load balanced cpusets, as they may be artificially
*4882a593Smuzhiyunconstrained to some subset of the CPUs allowed to them, for lack of
*4882a593Smuzhiyunload balancing to the other CPUs.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCPUs in "cpuset.isolcpus" were excluded from load balancing by the
*4882a593Smuzhiyunisolcpus= kernel boot option, and will never be load balanced regardless
*4882a593Smuzhiyunof the value of "cpuset.sched_load_balance" in any cpuset.
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.7.1 sched_load_balance implementation details.
*4882a593Smuzhiyun------------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
*4882a593Smuzhiyunto most cpuset flags.)  When enabled for a cpuset, the kernel will
*4882a593Smuzhiyunensure that it can load balance across all the CPUs in that cpuset
*4882a593Smuzhiyun(makes sure that all the CPUs in the cpus_allowed of that cpuset are
*4882a593Smuzhiyunin the same sched domain.)
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
*4882a593Smuzhiyunthen they will be (must be) both in the same sched domain.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
*4882a593Smuzhiyunthen by the above that means there is a single sched domain covering
*4882a593Smuzhiyunthe whole system, regardless of any other cpuset settings.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe kernel commits to user space that it will avoid load balancing
*4882a593Smuzhiyunwhere it can.  It will pick as fine a granularity partition of sched
*4882a593Smuzhiyundomains as it can while still providing load balancing for any set
*4882a593Smuzhiyunof CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe internal kernel cpuset to scheduler interface passes from the
*4882a593Smuzhiyuncpuset code to the scheduler code a partition of the load balanced
*4882a593SmuzhiyunCPUs in the system. This partition is a set of subsets (represented
*4882a593Smuzhiyunas an array of struct cpumask) of CPUs, pairwise disjoint, that cover
*4882a593Smuzhiyunall the CPUs that must be load balanced.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe cpuset code builds a new such partition and passes it to the
*4882a593Smuzhiyunscheduler sched domain setup code, to have the sched domains rebuilt
*4882a593Smuzhiyunas necessary, whenever:
*4882a593Smuzhiyun
*4882a593Smuzhiyun - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
*4882a593Smuzhiyun - or CPUs come or go from a cpuset with this flag enabled,
*4882a593Smuzhiyun - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
*4882a593Smuzhiyun   and with this flag enabled changes,
*4882a593Smuzhiyun - or a cpuset with non-empty CPUs and with this flag enabled is removed,
*4882a593Smuzhiyun - or a cpu is offlined/onlined.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis partition exactly defines what sched domains the scheduler should
*4882a593Smuzhiyunsetup - one sched domain for each element (struct cpumask) in the
*4882a593Smuzhiyunpartition.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe scheduler remembers the currently active sched domain partitions.
*4882a593SmuzhiyunWhen the scheduler routine partition_sched_domains() is invoked from
*4882a593Smuzhiyunthe cpuset code to update these sched domains, it compares the new
*4882a593Smuzhiyunpartition requested with the current, and updates its sched domains,
*4882a593Smuzhiyunremoving the old and adding the new, for each change.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.8 What is sched_relax_domain_level ?
*4882a593Smuzhiyun--------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn sched domain, the scheduler migrates tasks in 2 ways; periodic load
*4882a593Smuzhiyunbalance on tick, and at time of some schedule events.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen a task is woken up, scheduler try to move the task on idle CPU.
*4882a593SmuzhiyunFor example, if a task A running on CPU X activates another task B
*4882a593Smuzhiyunon the same CPU X, and if CPU Y is X's sibling and performing idle,
*4882a593Smuzhiyunthen scheduler migrate task B to CPU Y so that task B can start on
*4882a593SmuzhiyunCPU Y without waiting task A on CPU X.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAnd if a CPU run out of tasks in its runqueue, the CPU try to pull
*4882a593Smuzhiyunextra tasks from other busy CPUs to help them before it is going to
*4882a593Smuzhiyunbe idle.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOf course it takes some searching cost to find movable tasks and/or
*4882a593Smuzhiyunidle CPUs, the scheduler might not search all CPUs in the domain
*4882a593Smuzhiyunevery time.  In fact, in some architectures, the searching ranges on
*4882a593Smuzhiyunevents are limited in the same socket or node where the CPU locates,
*4882a593Smuzhiyunwhile the load balance on tick searches all.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor example, assume CPU Z is relatively far from CPU X.  Even if CPU Z
*4882a593Smuzhiyunis idle while CPU X and the siblings are busy, scheduler can't migrate
*4882a593Smuzhiyunwoken task B from X to Z since it is out of its searching range.
*4882a593SmuzhiyunAs the result, task B on CPU X need to wait task A or wait load balance
*4882a593Smuzhiyunon the next tick.  For some applications in special situation, waiting
*4882a593Smuzhiyun1 tick may be too long.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe 'cpuset.sched_relax_domain_level' file allows you to request changing
*4882a593Smuzhiyunthis searching range as you like.  This file takes int value which
*4882a593Smuzhiyunindicates size of searching range in levels ideally as follows,
*4882a593Smuzhiyunotherwise initial value -1 that indicates the cpuset has no request.
*4882a593Smuzhiyun
*4882a593Smuzhiyun====== ===========================================================
*4882a593Smuzhiyun  -1   no request. use system default or follow request of others.
*4882a593Smuzhiyun   0   no search.
*4882a593Smuzhiyun   1   search siblings (hyperthreads in a core).
*4882a593Smuzhiyun   2   search cores in a package.
*4882a593Smuzhiyun   3   search cpus in a node [= system wide on non-NUMA system]
*4882a593Smuzhiyun   4   search nodes in a chunk of node [on NUMA system]
*4882a593Smuzhiyun   5   search system wide [on NUMA system]
*4882a593Smuzhiyun====== ===========================================================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe system default is architecture dependent.  The system default
*4882a593Smuzhiyuncan be changed using the relax_domain_level= boot parameter.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis file is per-cpuset and affect the sched domain where the cpuset
*4882a593Smuzhiyunbelongs to.  Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
*4882a593Smuzhiyunis disabled, then 'cpuset.sched_relax_domain_level' have no effect since
*4882a593Smuzhiyunthere is no sched domain belonging the cpuset.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf multiple cpusets are overlapping and hence they form a single sched
*4882a593Smuzhiyundomain, the largest value among those is used.  Be careful, if one
*4882a593Smuzhiyunrequests 0 and others are -1 then 0 is used.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote that modifying this file will have both good and bad effects,
*4882a593Smuzhiyunand whether it is acceptable or not depends on your situation.
*4882a593SmuzhiyunDon't modify this file if you are not sure.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf your situation is:
*4882a593Smuzhiyun
*4882a593Smuzhiyun - The migration costs between each cpu can be assumed considerably
*4882a593Smuzhiyun   small(for you) due to your special application's behavior or
*4882a593Smuzhiyun   special hardware support for CPU cache etc.
*4882a593Smuzhiyun - The searching cost doesn't have impact(for you) or you can make
*4882a593Smuzhiyun   the searching cost enough small by managing cpuset to compact etc.
*4882a593Smuzhiyun - The latency is required even it sacrifices cache hit rate etc.
*4882a593Smuzhiyun   then increasing 'sched_relax_domain_level' would benefit you.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun1.9 How do I use cpusets ?
*4882a593Smuzhiyun--------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn order to minimize the impact of cpusets on critical kernel
*4882a593Smuzhiyuncode, such as the scheduler, and due to the fact that the kernel
*4882a593Smuzhiyundoes not support one task updating the memory placement of another
*4882a593Smuzhiyuntask directly, the impact on a task of changing its cpuset CPU
*4882a593Smuzhiyunor Memory Node placement, or of changing to which cpuset a task
*4882a593Smuzhiyunis attached, is subtle.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf a cpuset has its Memory Nodes modified, then for each task attached
*4882a593Smuzhiyunto that cpuset, the next time that the kernel attempts to allocate
*4882a593Smuzhiyuna page of memory for that task, the kernel will notice the change
*4882a593Smuzhiyunin the task's cpuset, and update its per-task memory placement to
*4882a593Smuzhiyunremain within the new cpusets memory placement.  If the task was using
*4882a593Smuzhiyunmempolicy MPOL_BIND, and the nodes to which it was bound overlap with
*4882a593Smuzhiyunits new cpuset, then the task will continue to use whatever subset
*4882a593Smuzhiyunof MPOL_BIND nodes are still allowed in the new cpuset.  If the task
*4882a593Smuzhiyunwas using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
*4882a593Smuzhiyunin the new cpuset, then the task will be essentially treated as if it
*4882a593Smuzhiyunwas MPOL_BIND bound to the new cpuset (even though its NUMA placement,
*4882a593Smuzhiyunas queried by get_mempolicy(), doesn't change).  If a task is moved
*4882a593Smuzhiyunfrom one cpuset to another, then the kernel will adjust the task's
*4882a593Smuzhiyunmemory placement, as above, the next time that the kernel attempts
*4882a593Smuzhiyunto allocate a page of memory for that task.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
*4882a593Smuzhiyunwill have its allowed CPU placement changed immediately.  Similarly,
*4882a593Smuzhiyunif a task's pid is written to another cpuset's 'tasks' file, then its
*4882a593Smuzhiyunallowed CPU placement is changed immediately.  If such a task had been
*4882a593Smuzhiyunbound to some subset of its cpuset using the sched_setaffinity() call,
*4882a593Smuzhiyunthe task will be allowed to run on any CPU allowed in its new cpuset,
*4882a593Smuzhiyunnegating the effect of the prior sched_setaffinity() call.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn summary, the memory placement of a task whose cpuset is changed is
*4882a593Smuzhiyunupdated by the kernel, on the next allocation of a page for that task,
*4882a593Smuzhiyunand the processor placement is updated immediately.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNormally, once a page is allocated (given a physical page
*4882a593Smuzhiyunof main memory) then that page stays on whatever node it
*4882a593Smuzhiyunwas allocated, so long as it remains allocated, even if the
*4882a593Smuzhiyuncpusets memory placement policy 'cpuset.mems' subsequently changes.
*4882a593SmuzhiyunIf the cpuset flag file 'cpuset.memory_migrate' is set true, then when
*4882a593Smuzhiyuntasks are attached to that cpuset, any pages that task had
*4882a593Smuzhiyunallocated to it on nodes in its previous cpuset are migrated
*4882a593Smuzhiyunto the task's new cpuset. The relative placement of the page within
*4882a593Smuzhiyunthe cpuset is preserved during these migration operations if possible.
*4882a593SmuzhiyunFor example if the page was on the second valid node of the prior cpuset
*4882a593Smuzhiyunthen the page will be placed on the second valid node of the new cpuset.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAlso if 'cpuset.memory_migrate' is set true, then if that cpuset's
*4882a593Smuzhiyun'cpuset.mems' file is modified, pages allocated to tasks in that
*4882a593Smuzhiyuncpuset, that were on nodes in the previous setting of 'cpuset.mems',
*4882a593Smuzhiyunwill be moved to nodes in the new setting of 'mems.'
*4882a593SmuzhiyunPages that were not in the task's prior cpuset, or in the cpuset's
*4882a593Smuzhiyunprior 'cpuset.mems' setting, will not be moved.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere is an exception to the above.  If hotplug functionality is used
*4882a593Smuzhiyunto remove all the CPUs that are currently assigned to a cpuset,
*4882a593Smuzhiyunthen all the tasks in that cpuset will be moved to the nearest ancestor
*4882a593Smuzhiyunwith non-empty cpus.  But the moving of some (or all) tasks might fail if
*4882a593Smuzhiyuncpuset is bound with another cgroup subsystem which has some restrictions
*4882a593Smuzhiyunon task attaching.  In this failing case, those tasks will stay
*4882a593Smuzhiyunin the original cpuset, and the kernel will automatically update
*4882a593Smuzhiyuntheir cpus_allowed to allow all online CPUs.  When memory hotplug
*4882a593Smuzhiyunfunctionality for removing Memory Nodes is available, a similar exception
*4882a593Smuzhiyunis expected to apply there as well.  In general, the kernel prefers to
*4882a593Smuzhiyunviolate cpuset placement, over starving a task that has had all
*4882a593Smuzhiyunits allowed CPUs or Memory Nodes taken offline.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere is a second exception to the above.  GFP_ATOMIC requests are
*4882a593Smuzhiyunkernel internal allocations that must be satisfied, immediately.
*4882a593SmuzhiyunThe kernel may drop some request, in rare cases even panic, if a
*4882a593SmuzhiyunGFP_ATOMIC alloc fails.  If the request cannot be satisfied within
*4882a593Smuzhiyunthe current task's cpuset, then we relax the cpuset, and look for
*4882a593Smuzhiyunmemory anywhere we can find it.  It's better to violate the cpuset
*4882a593Smuzhiyunthan stress the kernel.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo start a new job that is to be contained within a cpuset, the steps are:
*4882a593Smuzhiyun
*4882a593Smuzhiyun 1) mkdir /sys/fs/cgroup/cpuset
*4882a593Smuzhiyun 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
*4882a593Smuzhiyun 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
*4882a593Smuzhiyun    the /sys/fs/cgroup/cpuset virtual file system.
*4882a593Smuzhiyun 4) Start a task that will be the "founding father" of the new job.
*4882a593Smuzhiyun 5) Attach that task to the new cpuset by writing its pid to the
*4882a593Smuzhiyun    /sys/fs/cgroup/cpuset tasks file for that cpuset.
*4882a593Smuzhiyun 6) fork, exec or clone the job tasks from this founding father task.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor example, the following sequence of commands will setup a cpuset
*4882a593Smuzhiyunnamed "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
*4882a593Smuzhiyunand then start a subshell 'sh' in that cpuset::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
*4882a593Smuzhiyun  cd /sys/fs/cgroup/cpuset
*4882a593Smuzhiyun  mkdir Charlie
*4882a593Smuzhiyun  cd Charlie
*4882a593Smuzhiyun  /bin/echo 2-3 > cpuset.cpus
*4882a593Smuzhiyun  /bin/echo 1 > cpuset.mems
*4882a593Smuzhiyun  /bin/echo $$ > tasks
*4882a593Smuzhiyun  sh
*4882a593Smuzhiyun  # The subshell 'sh' is now running in cpuset Charlie
*4882a593Smuzhiyun  # The next line should display '/Charlie'
*4882a593Smuzhiyun  cat /proc/self/cpuset
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are ways to query or modify cpusets:
*4882a593Smuzhiyun
*4882a593Smuzhiyun - via the cpuset file system directly, using the various cd, mkdir, echo,
*4882a593Smuzhiyun   cat, rmdir commands from the shell, or their equivalent from C.
*4882a593Smuzhiyun - via the C library libcpuset.
*4882a593Smuzhiyun - via the C library libcgroup.
*4882a593Smuzhiyun   (http://sourceforge.net/projects/libcg/)
*4882a593Smuzhiyun - via the python application cset.
*4882a593Smuzhiyun   (http://code.google.com/p/cpuset/)
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe sched_setaffinity calls can also be done at the shell prompt using
*4882a593SmuzhiyunSGI's runon or Robert Love's taskset.  The mbind and set_mempolicy
*4882a593Smuzhiyuncalls can be done at the shell prompt using the numactl command
*4882a593Smuzhiyun(part of Andi Kleen's numa package).
*4882a593Smuzhiyun
*4882a593Smuzhiyun2. Usage Examples and Syntax
*4882a593Smuzhiyun============================
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.1 Basic Usage
*4882a593Smuzhiyun---------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunCreating, modifying, using the cpusets can be done through the cpuset
*4882a593Smuzhiyunvirtual filesystem.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo mount it, type:
*4882a593Smuzhiyun# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
*4882a593Smuzhiyun
*4882a593SmuzhiyunThen under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
*4882a593Smuzhiyuntree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
*4882a593Smuzhiyunis the cpuset that holds the whole system.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf you want to create a new cpuset under /sys/fs/cgroup/cpuset::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # cd /sys/fs/cgroup/cpuset
*4882a593Smuzhiyun  # mkdir my_cpuset
*4882a593Smuzhiyun
*4882a593SmuzhiyunNow you want to do something with this cpuset::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # cd my_cpuset
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn this directory you can find several files::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # ls
*4882a593Smuzhiyun  cgroup.clone_children  cpuset.memory_pressure
*4882a593Smuzhiyun  cgroup.event_control   cpuset.memory_spread_page
*4882a593Smuzhiyun  cgroup.procs           cpuset.memory_spread_slab
*4882a593Smuzhiyun  cpuset.cpu_exclusive   cpuset.mems
*4882a593Smuzhiyun  cpuset.cpus            cpuset.sched_load_balance
*4882a593Smuzhiyun  cpuset.mem_exclusive   cpuset.sched_relax_domain_level
*4882a593Smuzhiyun  cpuset.mem_hardwall    notify_on_release
*4882a593Smuzhiyun  cpuset.memory_migrate  tasks
*4882a593Smuzhiyun
*4882a593SmuzhiyunReading them will give you information about the state of this cpuset:
*4882a593Smuzhiyunthe CPUs and Memory Nodes it can use, the processes that are using
*4882a593Smuzhiyunit, its properties.  By writing to these files you can manipulate
*4882a593Smuzhiyunthe cpuset.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSet some flags::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # /bin/echo 1 > cpuset.cpu_exclusive
*4882a593Smuzhiyun
*4882a593SmuzhiyunAdd some cpus::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # /bin/echo 0-7 > cpuset.cpus
*4882a593Smuzhiyun
*4882a593SmuzhiyunAdd some mems::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # /bin/echo 0-7 > cpuset.mems
*4882a593Smuzhiyun
*4882a593SmuzhiyunNow attach your shell to this cpuset::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # /bin/echo $$ > tasks
*4882a593Smuzhiyun
*4882a593SmuzhiyunYou can also create cpusets inside your cpuset by using mkdir in this
*4882a593Smuzhiyundirectory::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # mkdir my_sub_cs
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo remove a cpuset, just use rmdir::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # rmdir my_sub_cs
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis will fail if the cpuset is in use (has cpusets inside, or has
*4882a593Smuzhiyunprocesses attached).
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote that for legacy reasons, the "cpuset" filesystem exists as a
*4882a593Smuzhiyunwrapper around the cgroup filesystem.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe command::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  mount -t cpuset X /sys/fs/cgroup/cpuset
*4882a593Smuzhiyun
*4882a593Smuzhiyunis equivalent to::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
*4882a593Smuzhiyun  echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.2 Adding/removing cpus
*4882a593Smuzhiyun------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis is the syntax to use when writing in the cpus or mems files
*4882a593Smuzhiyunin cpuset directories::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # /bin/echo 1-4 > cpuset.cpus		-> set cpus list to cpus 1,2,3,4
*4882a593Smuzhiyun  # /bin/echo 1,2,3,4 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo add a CPU to a cpuset, write the new list of CPUs including the
*4882a593SmuzhiyunCPU to be added. To add 6 to the above cpuset::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # /bin/echo 1-4,6 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4,6
*4882a593Smuzhiyun
*4882a593SmuzhiyunSimilarly to remove a CPU from a cpuset, write the new list of CPUs
*4882a593Smuzhiyunwithout the CPU to be removed.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo remove all the CPUs::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # /bin/echo "" > cpuset.cpus		-> clear cpus list
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.3 Setting flags
*4882a593Smuzhiyun-----------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe syntax is very simple::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # /bin/echo 1 > cpuset.cpu_exclusive 	-> set flag 'cpuset.cpu_exclusive'
*4882a593Smuzhiyun  # /bin/echo 0 > cpuset.cpu_exclusive 	-> unset flag 'cpuset.cpu_exclusive'
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.4 Attaching processes
*4882a593Smuzhiyun-----------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # /bin/echo PID > tasks
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote that it is PID, not PIDs. You can only attach ONE task at a time.
*4882a593SmuzhiyunIf you have several tasks to attach, you have to do it one after another::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  # /bin/echo PID1 > tasks
*4882a593Smuzhiyun  # /bin/echo PID2 > tasks
*4882a593Smuzhiyun	...
*4882a593Smuzhiyun  # /bin/echo PIDn > tasks
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun3. Questions
*4882a593Smuzhiyun============
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ:
*4882a593Smuzhiyun   what's up with this '/bin/echo' ?
*4882a593Smuzhiyun
*4882a593SmuzhiyunA:
*4882a593Smuzhiyun   bash's builtin 'echo' command does not check calls to write() against
*4882a593Smuzhiyun   errors. If you use it in the cpuset file system, you won't be
*4882a593Smuzhiyun   able to tell whether a command succeeded or failed.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ:
*4882a593Smuzhiyun   When I attach processes, only the first of the line gets really attached !
*4882a593Smuzhiyun
*4882a593SmuzhiyunA:
*4882a593Smuzhiyun   We can only return one error code per call to write(). So you should also
*4882a593Smuzhiyun   put only ONE pid.
*4882a593Smuzhiyun
*4882a593Smuzhiyun4. Contact
*4882a593Smuzhiyun==========
*4882a593Smuzhiyun
*4882a593SmuzhiyunWeb: http://www.bullopensource.org/cpuset