1*4882a593Smuzhiyun.. _numa_memory_policy: 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun================== 4*4882a593SmuzhiyunNUMA Memory Policy 5*4882a593Smuzhiyun================== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunWhat is NUMA Memory Policy? 8*4882a593Smuzhiyun============================ 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunIn the Linux kernel, "memory policy" determines from which node the kernel will 11*4882a593Smuzhiyunallocate memory in a NUMA system or in an emulated NUMA system. Linux has 12*4882a593Smuzhiyunsupported platforms with Non-Uniform Memory Access architectures since 2.4.?. 13*4882a593SmuzhiyunThe current memory policy support was added to Linux 2.6 around May 2004. This 14*4882a593Smuzhiyundocument attempts to describe the concepts and APIs of the 2.6 memory policy 15*4882a593Smuzhiyunsupport. 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunMemory policies should not be confused with cpusets 18*4882a593Smuzhiyun(``Documentation/admin-guide/cgroup-v1/cpusets.rst``) 19*4882a593Smuzhiyunwhich is an administrative mechanism for restricting the nodes from which 20*4882a593Smuzhiyunmemory may be allocated by a set of processes. Memory policies are a 21*4882a593Smuzhiyunprogramming interface that a NUMA-aware application can take advantage of. When 22*4882a593Smuzhiyunboth cpusets and policies are applied to a task, the restrictions of the cpuset 23*4882a593Smuzhiyuntakes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>` 24*4882a593Smuzhiyunbelow for more details. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunMemory Policy Concepts 27*4882a593Smuzhiyun====================== 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunScope of Memory Policies 30*4882a593Smuzhiyun------------------------ 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunThe Linux kernel supports _scopes_ of memory policy, described here from 33*4882a593Smuzhiyunmost general to most specific: 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunSystem Default Policy 36*4882a593Smuzhiyun this policy is "hard coded" into the kernel. It is the policy 37*4882a593Smuzhiyun that governs all page allocations that aren't controlled by 38*4882a593Smuzhiyun one of the more specific policy scopes discussed below. When 39*4882a593Smuzhiyun the system is "up and running", the system default policy will 40*4882a593Smuzhiyun use "local allocation" described below. However, during boot 41*4882a593Smuzhiyun up, the system default policy will be set to interleave 42*4882a593Smuzhiyun allocations across all nodes with "sufficient" memory, so as 43*4882a593Smuzhiyun not to overload the initial boot node with boot-time 44*4882a593Smuzhiyun allocations. 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunTask/Process Policy 47*4882a593Smuzhiyun this is an optional, per-task policy. When defined for a 48*4882a593Smuzhiyun specific task, this policy controls all page allocations made 49*4882a593Smuzhiyun by or on behalf of the task that aren't controlled by a more 50*4882a593Smuzhiyun specific scope. If a task does not define a task policy, then 51*4882a593Smuzhiyun all page allocations that would have been controlled by the 52*4882a593Smuzhiyun task policy "fall back" to the System Default Policy. 53*4882a593Smuzhiyun 54*4882a593Smuzhiyun The task policy applies to the entire address space of a task. Thus, 55*4882a593Smuzhiyun it is inheritable, and indeed is inherited, across both fork() 56*4882a593Smuzhiyun [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task 57*4882a593Smuzhiyun to establish the task policy for a child task exec()'d from an 58*4882a593Smuzhiyun executable image that has no awareness of memory policy. See the 59*4882a593Smuzhiyun :ref:`Memory Policy APIs <memory_policy_apis>` section, 60*4882a593Smuzhiyun below, for an overview of the system call 61*4882a593Smuzhiyun that a task may use to set/change its task/process policy. 62*4882a593Smuzhiyun 63*4882a593Smuzhiyun In a multi-threaded task, task policies apply only to the thread 64*4882a593Smuzhiyun [Linux kernel task] that installs the policy and any threads 65*4882a593Smuzhiyun subsequently created by that thread. Any sibling threads existing 66*4882a593Smuzhiyun at the time a new task policy is installed retain their current 67*4882a593Smuzhiyun policy. 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun A task policy applies only to pages allocated after the policy is 70*4882a593Smuzhiyun installed. Any pages already faulted in by the task when the task 71*4882a593Smuzhiyun changes its task policy remain where they were allocated based on 72*4882a593Smuzhiyun the policy at the time they were allocated. 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun.. _vma_policy: 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunVMA Policy 77*4882a593Smuzhiyun A "VMA" or "Virtual Memory Area" refers to a range of a task's 78*4882a593Smuzhiyun virtual address space. A task may define a specific policy for a range 79*4882a593Smuzhiyun of its virtual address space. See the 80*4882a593Smuzhiyun :ref:`Memory Policy APIs <memory_policy_apis>` section, 81*4882a593Smuzhiyun below, for an overview of the mbind() system call used to set a VMA 82*4882a593Smuzhiyun policy. 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun A VMA policy will govern the allocation of pages that back 85*4882a593Smuzhiyun this region of the address space. Any regions of the task's 86*4882a593Smuzhiyun address space that don't have an explicit VMA policy will fall 87*4882a593Smuzhiyun back to the task policy, which may itself fall back to the 88*4882a593Smuzhiyun System Default Policy. 89*4882a593Smuzhiyun 90*4882a593Smuzhiyun VMA policies have a few complicating details: 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun * VMA policy applies ONLY to anonymous pages. These include 93*4882a593Smuzhiyun pages allocated for anonymous segments, such as the task 94*4882a593Smuzhiyun stack and heap, and any regions of the address space 95*4882a593Smuzhiyun mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is 96*4882a593Smuzhiyun applied to a file mapping, it will be ignored if the mapping 97*4882a593Smuzhiyun used the MAP_SHARED flag. If the file mapping used the 98*4882a593Smuzhiyun MAP_PRIVATE flag, the VMA policy will only be applied when 99*4882a593Smuzhiyun an anonymous page is allocated on an attempt to write to the 100*4882a593Smuzhiyun mapping-- i.e., at Copy-On-Write. 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun * VMA policies are shared between all tasks that share a 103*4882a593Smuzhiyun virtual address space--a.k.a. threads--independent of when 104*4882a593Smuzhiyun the policy is installed; and they are inherited across 105*4882a593Smuzhiyun fork(). However, because VMA policies refer to a specific 106*4882a593Smuzhiyun region of a task's address space, and because the address 107*4882a593Smuzhiyun space is discarded and recreated on exec*(), VMA policies 108*4882a593Smuzhiyun are NOT inheritable across exec(). Thus, only NUMA-aware 109*4882a593Smuzhiyun applications may use VMA policies. 110*4882a593Smuzhiyun 111*4882a593Smuzhiyun * A task may install a new VMA policy on a sub-range of a 112*4882a593Smuzhiyun previously mmap()ed region. When this happens, Linux splits 113*4882a593Smuzhiyun the existing virtual memory area into 2 or 3 VMAs, each with 114*4882a593Smuzhiyun it's own policy. 115*4882a593Smuzhiyun 116*4882a593Smuzhiyun * By default, VMA policy applies only to pages allocated after 117*4882a593Smuzhiyun the policy is installed. Any pages already faulted into the 118*4882a593Smuzhiyun VMA range remain where they were allocated based on the 119*4882a593Smuzhiyun policy at the time they were allocated. However, since 120*4882a593Smuzhiyun 2.6.16, Linux supports page migration via the mbind() system 121*4882a593Smuzhiyun call, so that page contents can be moved to match a newly 122*4882a593Smuzhiyun installed policy. 123*4882a593Smuzhiyun 124*4882a593SmuzhiyunShared Policy 125*4882a593Smuzhiyun Conceptually, shared policies apply to "memory objects" mapped 126*4882a593Smuzhiyun shared into one or more tasks' distinct address spaces. An 127*4882a593Smuzhiyun application installs shared policies the same way as VMA 128*4882a593Smuzhiyun policies--using the mbind() system call specifying a range of 129*4882a593Smuzhiyun virtual addresses that map the shared object. However, unlike 130*4882a593Smuzhiyun VMA policies, which can be considered to be an attribute of a 131*4882a593Smuzhiyun range of a task's address space, shared policies apply 132*4882a593Smuzhiyun directly to the shared object. Thus, all tasks that attach to 133*4882a593Smuzhiyun the object share the policy, and all pages allocated for the 134*4882a593Smuzhiyun shared object, by any task, will obey the shared policy. 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun As of 2.6.22, only shared memory segments, created by shmget() or 137*4882a593Smuzhiyun mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared 138*4882a593Smuzhiyun policy support was added to Linux, the associated data structures were 139*4882a593Smuzhiyun added to hugetlbfs shmem segments. At the time, hugetlbfs did not 140*4882a593Smuzhiyun support allocation at fault time--a.k.a lazy allocation--so hugetlbfs 141*4882a593Smuzhiyun shmem segments were never "hooked up" to the shared policy support. 142*4882a593Smuzhiyun Although hugetlbfs segments now support lazy allocation, their support 143*4882a593Smuzhiyun for shared policy has not been completed. 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun As mentioned above in :ref:`VMA policies <vma_policy>` section, 146*4882a593Smuzhiyun allocations of page cache pages for regular files mmap()ed 147*4882a593Smuzhiyun with MAP_SHARED ignore any VMA policy installed on the virtual 148*4882a593Smuzhiyun address range backed by the shared file mapping. Rather, 149*4882a593Smuzhiyun shared page cache pages, including pages backing private 150*4882a593Smuzhiyun mappings that have not yet been written by the task, follow 151*4882a593Smuzhiyun task policy, if any, else System Default Policy. 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun The shared policy infrastructure supports different policies on subset 154*4882a593Smuzhiyun ranges of the shared object. However, Linux still splits the VMA of 155*4882a593Smuzhiyun the task that installs the policy for each range of distinct policy. 156*4882a593Smuzhiyun Thus, different tasks that attach to a shared memory segment can have 157*4882a593Smuzhiyun different VMA configurations mapping that one shared object. This 158*4882a593Smuzhiyun can be seen by examining the /proc/<pid>/numa_maps of tasks sharing 159*4882a593Smuzhiyun a shared memory region, when one task has installed shared policy on 160*4882a593Smuzhiyun one or more ranges of the region. 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunComponents of Memory Policies 163*4882a593Smuzhiyun----------------------------- 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunA NUMA memory policy consists of a "mode", optional mode flags, and 166*4882a593Smuzhiyunan optional set of nodes. The mode determines the behavior of the 167*4882a593Smuzhiyunpolicy, the optional mode flags determine the behavior of the mode, 168*4882a593Smuzhiyunand the optional set of nodes can be viewed as the arguments to the 169*4882a593Smuzhiyunpolicy behavior. 170*4882a593Smuzhiyun 171*4882a593SmuzhiyunInternally, memory policies are implemented by a reference counted 172*4882a593Smuzhiyunstructure, struct mempolicy. Details of this structure will be 173*4882a593Smuzhiyundiscussed in context, below, as required to explain the behavior. 174*4882a593Smuzhiyun 175*4882a593SmuzhiyunNUMA memory policy supports the following 4 behavioral modes: 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunDefault Mode--MPOL_DEFAULT 178*4882a593Smuzhiyun This mode is only used in the memory policy APIs. Internally, 179*4882a593Smuzhiyun MPOL_DEFAULT is converted to the NULL memory policy in all 180*4882a593Smuzhiyun policy scopes. Any existing non-default policy will simply be 181*4882a593Smuzhiyun removed when MPOL_DEFAULT is specified. As a result, 182*4882a593Smuzhiyun MPOL_DEFAULT means "fall back to the next most specific policy 183*4882a593Smuzhiyun scope." 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun For example, a NULL or default task policy will fall back to the 186*4882a593Smuzhiyun system default policy. A NULL or default vma policy will fall 187*4882a593Smuzhiyun back to the task policy. 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun When specified in one of the memory policy APIs, the Default mode 190*4882a593Smuzhiyun does not use the optional set of nodes. 191*4882a593Smuzhiyun 192*4882a593Smuzhiyun It is an error for the set of nodes specified for this policy to 193*4882a593Smuzhiyun be non-empty. 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunMPOL_BIND 196*4882a593Smuzhiyun This mode specifies that memory must come from the set of 197*4882a593Smuzhiyun nodes specified by the policy. Memory will be allocated from 198*4882a593Smuzhiyun the node in the set with sufficient free memory that is 199*4882a593Smuzhiyun closest to the node where the allocation takes place. 200*4882a593Smuzhiyun 201*4882a593SmuzhiyunMPOL_PREFERRED 202*4882a593Smuzhiyun This mode specifies that the allocation should be attempted 203*4882a593Smuzhiyun from the single node specified in the policy. If that 204*4882a593Smuzhiyun allocation fails, the kernel will search other nodes, in order 205*4882a593Smuzhiyun of increasing distance from the preferred node based on 206*4882a593Smuzhiyun information provided by the platform firmware. 207*4882a593Smuzhiyun 208*4882a593Smuzhiyun Internally, the Preferred policy uses a single node--the 209*4882a593Smuzhiyun preferred_node member of struct mempolicy. When the internal 210*4882a593Smuzhiyun mode flag MPOL_F_LOCAL is set, the preferred_node is ignored 211*4882a593Smuzhiyun and the policy is interpreted as local allocation. "Local" 212*4882a593Smuzhiyun allocation policy can be viewed as a Preferred policy that 213*4882a593Smuzhiyun starts at the node containing the cpu where the allocation 214*4882a593Smuzhiyun takes place. 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun It is possible for the user to specify that local allocation 217*4882a593Smuzhiyun is always preferred by passing an empty nodemask with this 218*4882a593Smuzhiyun mode. If an empty nodemask is passed, the policy cannot use 219*4882a593Smuzhiyun the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags 220*4882a593Smuzhiyun described below. 221*4882a593Smuzhiyun 222*4882a593SmuzhiyunMPOL_INTERLEAVED 223*4882a593Smuzhiyun This mode specifies that page allocations be interleaved, on a 224*4882a593Smuzhiyun page granularity, across the nodes specified in the policy. 225*4882a593Smuzhiyun This mode also behaves slightly differently, based on the 226*4882a593Smuzhiyun context where it is used: 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun For allocation of anonymous pages and shared memory pages, 229*4882a593Smuzhiyun Interleave mode indexes the set of nodes specified by the 230*4882a593Smuzhiyun policy using the page offset of the faulting address into the 231*4882a593Smuzhiyun segment [VMA] containing the address modulo the number of 232*4882a593Smuzhiyun nodes specified by the policy. It then attempts to allocate a 233*4882a593Smuzhiyun page, starting at the selected node, as if the node had been 234*4882a593Smuzhiyun specified by a Preferred policy or had been selected by a 235*4882a593Smuzhiyun local allocation. That is, allocation will follow the per 236*4882a593Smuzhiyun node zonelist. 237*4882a593Smuzhiyun 238*4882a593Smuzhiyun For allocation of page cache pages, Interleave mode indexes 239*4882a593Smuzhiyun the set of nodes specified by the policy using a node counter 240*4882a593Smuzhiyun maintained per task. This counter wraps around to the lowest 241*4882a593Smuzhiyun specified node after it reaches the highest specified node. 242*4882a593Smuzhiyun This will tend to spread the pages out over the nodes 243*4882a593Smuzhiyun specified by the policy based on the order in which they are 244*4882a593Smuzhiyun allocated, rather than based on any page offset into an 245*4882a593Smuzhiyun address range or file. During system boot up, the temporary 246*4882a593Smuzhiyun interleaved system default policy works in this mode. 247*4882a593Smuzhiyun 248*4882a593SmuzhiyunNUMA memory policy supports the following optional mode flags: 249*4882a593Smuzhiyun 250*4882a593SmuzhiyunMPOL_F_STATIC_NODES 251*4882a593Smuzhiyun This flag specifies that the nodemask passed by 252*4882a593Smuzhiyun the user should not be remapped if the task or VMA's set of allowed 253*4882a593Smuzhiyun nodes changes after the memory policy has been defined. 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun Without this flag, any time a mempolicy is rebound because of a 256*4882a593Smuzhiyun change in the set of allowed nodes, the node (Preferred) or 257*4882a593Smuzhiyun nodemask (Bind, Interleave) is remapped to the new set of 258*4882a593Smuzhiyun allowed nodes. This may result in nodes being used that were 259*4882a593Smuzhiyun previously undesired. 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun With this flag, if the user-specified nodes overlap with the 262*4882a593Smuzhiyun nodes allowed by the task's cpuset, then the memory policy is 263*4882a593Smuzhiyun applied to their intersection. If the two sets of nodes do not 264*4882a593Smuzhiyun overlap, the Default policy is used. 265*4882a593Smuzhiyun 266*4882a593Smuzhiyun For example, consider a task that is attached to a cpuset with 267*4882a593Smuzhiyun mems 1-3 that sets an Interleave policy over the same set. If 268*4882a593Smuzhiyun the cpuset's mems change to 3-5, the Interleave will now occur 269*4882a593Smuzhiyun over nodes 3, 4, and 5. With this flag, however, since only node 270*4882a593Smuzhiyun 3 is allowed from the user's nodemask, the "interleave" only 271*4882a593Smuzhiyun occurs over that node. If no nodes from the user's nodemask are 272*4882a593Smuzhiyun now allowed, the Default behavior is used. 273*4882a593Smuzhiyun 274*4882a593Smuzhiyun MPOL_F_STATIC_NODES cannot be combined with the 275*4882a593Smuzhiyun MPOL_F_RELATIVE_NODES flag. It also cannot be used for 276*4882a593Smuzhiyun MPOL_PREFERRED policies that were created with an empty nodemask 277*4882a593Smuzhiyun (local allocation). 278*4882a593Smuzhiyun 279*4882a593SmuzhiyunMPOL_F_RELATIVE_NODES 280*4882a593Smuzhiyun This flag specifies that the nodemask passed 281*4882a593Smuzhiyun by the user will be mapped relative to the set of the task or VMA's 282*4882a593Smuzhiyun set of allowed nodes. The kernel stores the user-passed nodemask, 283*4882a593Smuzhiyun and if the allowed nodes changes, then that original nodemask will 284*4882a593Smuzhiyun be remapped relative to the new set of allowed nodes. 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun Without this flag (and without MPOL_F_STATIC_NODES), anytime a 287*4882a593Smuzhiyun mempolicy is rebound because of a change in the set of allowed 288*4882a593Smuzhiyun nodes, the node (Preferred) or nodemask (Bind, Interleave) is 289*4882a593Smuzhiyun remapped to the new set of allowed nodes. That remap may not 290*4882a593Smuzhiyun preserve the relative nature of the user's passed nodemask to its 291*4882a593Smuzhiyun set of allowed nodes upon successive rebinds: a nodemask of 292*4882a593Smuzhiyun 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of 293*4882a593Smuzhiyun allowed nodes is restored to its original state. 294*4882a593Smuzhiyun 295*4882a593Smuzhiyun With this flag, the remap is done so that the node numbers from 296*4882a593Smuzhiyun the user's passed nodemask are relative to the set of allowed 297*4882a593Smuzhiyun nodes. In other words, if nodes 0, 2, and 4 are set in the user's 298*4882a593Smuzhiyun nodemask, the policy will be effected over the first (and in the 299*4882a593Smuzhiyun Bind or Interleave case, the third and fifth) nodes in the set of 300*4882a593Smuzhiyun allowed nodes. The nodemask passed by the user represents nodes 301*4882a593Smuzhiyun relative to task or VMA's set of allowed nodes. 302*4882a593Smuzhiyun 303*4882a593Smuzhiyun If the user's nodemask includes nodes that are outside the range 304*4882a593Smuzhiyun of the new set of allowed nodes (for example, node 5 is set in 305*4882a593Smuzhiyun the user's nodemask when the set of allowed nodes is only 0-3), 306*4882a593Smuzhiyun then the remap wraps around to the beginning of the nodemask and, 307*4882a593Smuzhiyun if not already set, sets the node in the mempolicy nodemask. 308*4882a593Smuzhiyun 309*4882a593Smuzhiyun For example, consider a task that is attached to a cpuset with 310*4882a593Smuzhiyun mems 2-5 that sets an Interleave policy over the same set with 311*4882a593Smuzhiyun MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the 312*4882a593Smuzhiyun interleave now occurs over nodes 3,5-7. If the cpuset's mems 313*4882a593Smuzhiyun then change to 0,2-3,5, then the interleave occurs over nodes 314*4882a593Smuzhiyun 0,2-3,5. 315*4882a593Smuzhiyun 316*4882a593Smuzhiyun Thanks to the consistent remapping, applications preparing 317*4882a593Smuzhiyun nodemasks to specify memory policies using this flag should 318*4882a593Smuzhiyun disregard their current, actual cpuset imposed memory placement 319*4882a593Smuzhiyun and prepare the nodemask as if they were always located on 320*4882a593Smuzhiyun memory nodes 0 to N-1, where N is the number of memory nodes the 321*4882a593Smuzhiyun policy is intended to manage. Let the kernel then remap to the 322*4882a593Smuzhiyun set of memory nodes allowed by the task's cpuset, as that may 323*4882a593Smuzhiyun change over time. 324*4882a593Smuzhiyun 325*4882a593Smuzhiyun MPOL_F_RELATIVE_NODES cannot be combined with the 326*4882a593Smuzhiyun MPOL_F_STATIC_NODES flag. It also cannot be used for 327*4882a593Smuzhiyun MPOL_PREFERRED policies that were created with an empty nodemask 328*4882a593Smuzhiyun (local allocation). 329*4882a593Smuzhiyun 330*4882a593SmuzhiyunMemory Policy Reference Counting 331*4882a593Smuzhiyun================================ 332*4882a593Smuzhiyun 333*4882a593SmuzhiyunTo resolve use/free races, struct mempolicy contains an atomic reference 334*4882a593Smuzhiyuncount field. Internal interfaces, mpol_get()/mpol_put() increment and 335*4882a593Smuzhiyundecrement this reference count, respectively. mpol_put() will only free 336*4882a593Smuzhiyunthe structure back to the mempolicy kmem cache when the reference count 337*4882a593Smuzhiyungoes to zero. 338*4882a593Smuzhiyun 339*4882a593SmuzhiyunWhen a new memory policy is allocated, its reference count is initialized 340*4882a593Smuzhiyunto '1', representing the reference held by the task that is installing the 341*4882a593Smuzhiyunnew policy. When a pointer to a memory policy structure is stored in another 342*4882a593Smuzhiyunstructure, another reference is added, as the task's reference will be dropped 343*4882a593Smuzhiyunon completion of the policy installation. 344*4882a593Smuzhiyun 345*4882a593SmuzhiyunDuring run-time "usage" of the policy, we attempt to minimize atomic operations 346*4882a593Smuzhiyunon the reference count, as this can lead to cache lines bouncing between cpus 347*4882a593Smuzhiyunand NUMA nodes. "Usage" here means one of the following: 348*4882a593Smuzhiyun 349*4882a593Smuzhiyun1) querying of the policy, either by the task itself [using the get_mempolicy() 350*4882a593Smuzhiyun API discussed below] or by another task using the /proc/<pid>/numa_maps 351*4882a593Smuzhiyun interface. 352*4882a593Smuzhiyun 353*4882a593Smuzhiyun2) examination of the policy to determine the policy mode and associated node 354*4882a593Smuzhiyun or node lists, if any, for page allocation. This is considered a "hot 355*4882a593Smuzhiyun path". Note that for MPOL_BIND, the "usage" extends across the entire 356*4882a593Smuzhiyun allocation process, which may sleep during page reclaimation, because the 357*4882a593Smuzhiyun BIND policy nodemask is used, by reference, to filter ineligible nodes. 358*4882a593Smuzhiyun 359*4882a593SmuzhiyunWe can avoid taking an extra reference during the usages listed above as 360*4882a593Smuzhiyunfollows: 361*4882a593Smuzhiyun 362*4882a593Smuzhiyun1) we never need to get/free the system default policy as this is never 363*4882a593Smuzhiyun changed nor freed, once the system is up and running. 364*4882a593Smuzhiyun 365*4882a593Smuzhiyun2) for querying the policy, we do not need to take an extra reference on the 366*4882a593Smuzhiyun target task's task policy nor vma policies because we always acquire the 367*4882a593Smuzhiyun task's mm's mmap_lock for read during the query. The set_mempolicy() and 368*4882a593Smuzhiyun mbind() APIs [see below] always acquire the mmap_lock for write when 369*4882a593Smuzhiyun installing or replacing task or vma policies. Thus, there is no possibility 370*4882a593Smuzhiyun of a task or thread freeing a policy while another task or thread is 371*4882a593Smuzhiyun querying it. 372*4882a593Smuzhiyun 373*4882a593Smuzhiyun3) Page allocation usage of task or vma policy occurs in the fault path where 374*4882a593Smuzhiyun we hold them mmap_lock for read. Again, because replacing the task or vma 375*4882a593Smuzhiyun policy requires that the mmap_lock be held for write, the policy can't be 376*4882a593Smuzhiyun freed out from under us while we're using it for page allocation. 377*4882a593Smuzhiyun 378*4882a593Smuzhiyun4) Shared policies require special consideration. One task can replace a 379*4882a593Smuzhiyun shared memory policy while another task, with a distinct mmap_lock, is 380*4882a593Smuzhiyun querying or allocating a page based on the policy. To resolve this 381*4882a593Smuzhiyun potential race, the shared policy infrastructure adds an extra reference 382*4882a593Smuzhiyun to the shared policy during lookup while holding a spin lock on the shared 383*4882a593Smuzhiyun policy management structure. This requires that we drop this extra 384*4882a593Smuzhiyun reference when we're finished "using" the policy. We must drop the 385*4882a593Smuzhiyun extra reference on shared policies in the same query/allocation paths 386*4882a593Smuzhiyun used for non-shared policies. For this reason, shared policies are marked 387*4882a593Smuzhiyun as such, and the extra reference is dropped "conditionally"--i.e., only 388*4882a593Smuzhiyun for shared policies. 389*4882a593Smuzhiyun 390*4882a593Smuzhiyun Because of this extra reference counting, and because we must lookup 391*4882a593Smuzhiyun shared policies in a tree structure under spinlock, shared policies are 392*4882a593Smuzhiyun more expensive to use in the page allocation path. This is especially 393*4882a593Smuzhiyun true for shared policies on shared memory regions shared by tasks running 394*4882a593Smuzhiyun on different NUMA nodes. This extra overhead can be avoided by always 395*4882a593Smuzhiyun falling back to task or system default policy for shared memory regions, 396*4882a593Smuzhiyun or by prefaulting the entire shared memory region into memory and locking 397*4882a593Smuzhiyun it down. However, this might not be appropriate for all applications. 398*4882a593Smuzhiyun 399*4882a593Smuzhiyun.. _memory_policy_apis: 400*4882a593Smuzhiyun 401*4882a593SmuzhiyunMemory Policy APIs 402*4882a593Smuzhiyun================== 403*4882a593Smuzhiyun 404*4882a593SmuzhiyunLinux supports 3 system calls for controlling memory policy. These APIS 405*4882a593Smuzhiyunalways affect only the calling task, the calling task's address space, or 406*4882a593Smuzhiyunsome shared object mapped into the calling task's address space. 407*4882a593Smuzhiyun 408*4882a593Smuzhiyun.. note:: 409*4882a593Smuzhiyun the headers that define these APIs and the parameter data types for 410*4882a593Smuzhiyun user space applications reside in a package that is not part of the 411*4882a593Smuzhiyun Linux kernel. The kernel system call interfaces, with the 'sys\_' 412*4882a593Smuzhiyun prefix, are defined in <linux/syscalls.h>; the mode and flag 413*4882a593Smuzhiyun definitions are defined in <linux/mempolicy.h>. 414*4882a593Smuzhiyun 415*4882a593SmuzhiyunSet [Task] Memory Policy:: 416*4882a593Smuzhiyun 417*4882a593Smuzhiyun long set_mempolicy(int mode, const unsigned long *nmask, 418*4882a593Smuzhiyun unsigned long maxnode); 419*4882a593Smuzhiyun 420*4882a593SmuzhiyunSet's the calling task's "task/process memory policy" to mode 421*4882a593Smuzhiyunspecified by the 'mode' argument and the set of nodes defined by 422*4882a593Smuzhiyun'nmask'. 'nmask' points to a bit mask of node ids containing at least 423*4882a593Smuzhiyun'maxnode' ids. Optional mode flags may be passed by combining the 424*4882a593Smuzhiyun'mode' argument with the flag (for example: MPOL_INTERLEAVE | 425*4882a593SmuzhiyunMPOL_F_STATIC_NODES). 426*4882a593Smuzhiyun 427*4882a593SmuzhiyunSee the set_mempolicy(2) man page for more details 428*4882a593Smuzhiyun 429*4882a593Smuzhiyun 430*4882a593SmuzhiyunGet [Task] Memory Policy or Related Information:: 431*4882a593Smuzhiyun 432*4882a593Smuzhiyun long get_mempolicy(int *mode, 433*4882a593Smuzhiyun const unsigned long *nmask, unsigned long maxnode, 434*4882a593Smuzhiyun void *addr, int flags); 435*4882a593Smuzhiyun 436*4882a593SmuzhiyunQueries the "task/process memory policy" of the calling task, or the 437*4882a593Smuzhiyunpolicy or location of a specified virtual address, depending on the 438*4882a593Smuzhiyun'flags' argument. 439*4882a593Smuzhiyun 440*4882a593SmuzhiyunSee the get_mempolicy(2) man page for more details 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun 443*4882a593SmuzhiyunInstall VMA/Shared Policy for a Range of Task's Address Space:: 444*4882a593Smuzhiyun 445*4882a593Smuzhiyun long mbind(void *start, unsigned long len, int mode, 446*4882a593Smuzhiyun const unsigned long *nmask, unsigned long maxnode, 447*4882a593Smuzhiyun unsigned flags); 448*4882a593Smuzhiyun 449*4882a593Smuzhiyunmbind() installs the policy specified by (mode, nmask, maxnodes) as a 450*4882a593SmuzhiyunVMA policy for the range of the calling task's address space specified 451*4882a593Smuzhiyunby the 'start' and 'len' arguments. Additional actions may be 452*4882a593Smuzhiyunrequested via the 'flags' argument. 453*4882a593Smuzhiyun 454*4882a593SmuzhiyunSee the mbind(2) man page for more details. 455*4882a593Smuzhiyun 456*4882a593SmuzhiyunMemory Policy Command Line Interface 457*4882a593Smuzhiyun==================================== 458*4882a593Smuzhiyun 459*4882a593SmuzhiyunAlthough not strictly part of the Linux implementation of memory policy, 460*4882a593Smuzhiyuna command line tool, numactl(8), exists that allows one to: 461*4882a593Smuzhiyun 462*4882a593Smuzhiyun+ set the task policy for a specified program via set_mempolicy(2), fork(2) and 463*4882a593Smuzhiyun exec(2) 464*4882a593Smuzhiyun 465*4882a593Smuzhiyun+ set the shared policy for a shared memory segment via mbind(2) 466*4882a593Smuzhiyun 467*4882a593SmuzhiyunThe numactl(8) tool is packaged with the run-time version of the library 468*4882a593Smuzhiyuncontaining the memory policy system call wrappers. Some distributions 469*4882a593Smuzhiyunpackage the headers and compile-time libraries in a separate development 470*4882a593Smuzhiyunpackage. 471*4882a593Smuzhiyun 472*4882a593Smuzhiyun.. _mem_pol_and_cpusets: 473*4882a593Smuzhiyun 474*4882a593SmuzhiyunMemory Policies and cpusets 475*4882a593Smuzhiyun=========================== 476*4882a593Smuzhiyun 477*4882a593SmuzhiyunMemory policies work within cpusets as described above. For memory policies 478*4882a593Smuzhiyunthat require a node or set of nodes, the nodes are restricted to the set of 479*4882a593Smuzhiyunnodes whose memories are allowed by the cpuset constraints. If the nodemask 480*4882a593Smuzhiyunspecified for the policy contains nodes that are not allowed by the cpuset and 481*4882a593SmuzhiyunMPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes 482*4882a593Smuzhiyunspecified for the policy and the set of nodes with memory is used. If the 483*4882a593Smuzhiyunresult is the empty set, the policy is considered invalid and cannot be 484*4882a593Smuzhiyuninstalled. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped 485*4882a593Smuzhiyunonto and folded into the task's set of allowed nodes as previously described. 486*4882a593Smuzhiyun 487*4882a593SmuzhiyunThe interaction of memory policies and cpusets can be problematic when tasks 488*4882a593Smuzhiyunin two cpusets share access to a memory region, such as shared memory segments 489*4882a593Smuzhiyuncreated by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and 490*4882a593Smuzhiyunany of the tasks install shared policy on the region, only nodes whose 491*4882a593Smuzhiyunmemories are allowed in both cpusets may be used in the policies. Obtaining 492*4882a593Smuzhiyunthis information requires "stepping outside" the memory policy APIs to use the 493*4882a593Smuzhiyuncpuset information and requires that one know in what cpusets other task might 494*4882a593Smuzhiyunbe attaching to the shared region. Furthermore, if the cpusets' allowed 495*4882a593Smuzhiyunmemory sets are disjoint, "local" allocation is the only valid policy. 496