xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/mm/numa_memory_policy.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _numa_memory_policy:
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun==================
4*4882a593SmuzhiyunNUMA Memory Policy
5*4882a593Smuzhiyun==================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunWhat is NUMA Memory Policy?
8*4882a593Smuzhiyun============================
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunIn the Linux kernel, "memory policy" determines from which node the kernel will
11*4882a593Smuzhiyunallocate memory in a NUMA system or in an emulated NUMA system.  Linux has
12*4882a593Smuzhiyunsupported platforms with Non-Uniform Memory Access architectures since 2.4.?.
13*4882a593SmuzhiyunThe current memory policy support was added to Linux 2.6 around May 2004.  This
14*4882a593Smuzhiyundocument attempts to describe the concepts and APIs of the 2.6 memory policy
15*4882a593Smuzhiyunsupport.
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunMemory policies should not be confused with cpusets
18*4882a593Smuzhiyun(``Documentation/admin-guide/cgroup-v1/cpusets.rst``)
19*4882a593Smuzhiyunwhich is an administrative mechanism for restricting the nodes from which
20*4882a593Smuzhiyunmemory may be allocated by a set of processes. Memory policies are a
21*4882a593Smuzhiyunprogramming interface that a NUMA-aware application can take advantage of.  When
22*4882a593Smuzhiyunboth cpusets and policies are applied to a task, the restrictions of the cpuset
23*4882a593Smuzhiyuntakes priority.  See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
24*4882a593Smuzhiyunbelow for more details.
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunMemory Policy Concepts
27*4882a593Smuzhiyun======================
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunScope of Memory Policies
30*4882a593Smuzhiyun------------------------
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunThe Linux kernel supports _scopes_ of memory policy, described here from
33*4882a593Smuzhiyunmost general to most specific:
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunSystem Default Policy
36*4882a593Smuzhiyun	this policy is "hard coded" into the kernel.  It is the policy
37*4882a593Smuzhiyun	that governs all page allocations that aren't controlled by
38*4882a593Smuzhiyun	one of the more specific policy scopes discussed below.  When
39*4882a593Smuzhiyun	the system is "up and running", the system default policy will
40*4882a593Smuzhiyun	use "local allocation" described below.  However, during boot
41*4882a593Smuzhiyun	up, the system default policy will be set to interleave
42*4882a593Smuzhiyun	allocations across all nodes with "sufficient" memory, so as
43*4882a593Smuzhiyun	not to overload the initial boot node with boot-time
44*4882a593Smuzhiyun	allocations.
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunTask/Process Policy
47*4882a593Smuzhiyun	this is an optional, per-task policy.  When defined for a
48*4882a593Smuzhiyun	specific task, this policy controls all page allocations made
49*4882a593Smuzhiyun	by or on behalf of the task that aren't controlled by a more
50*4882a593Smuzhiyun	specific scope. If a task does not define a task policy, then
51*4882a593Smuzhiyun	all page allocations that would have been controlled by the
52*4882a593Smuzhiyun	task policy "fall back" to the System Default Policy.
53*4882a593Smuzhiyun
54*4882a593Smuzhiyun	The task policy applies to the entire address space of a task. Thus,
55*4882a593Smuzhiyun	it is inheritable, and indeed is inherited, across both fork()
56*4882a593Smuzhiyun	[clone() w/o the CLONE_VM flag] and exec*().  This allows a parent task
57*4882a593Smuzhiyun	to establish the task policy for a child task exec()'d from an
58*4882a593Smuzhiyun	executable image that has no awareness of memory policy.  See the
59*4882a593Smuzhiyun	:ref:`Memory Policy APIs <memory_policy_apis>` section,
60*4882a593Smuzhiyun	below, for an overview of the system call
61*4882a593Smuzhiyun	that a task may use to set/change its task/process policy.
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun	In a multi-threaded task, task policies apply only to the thread
64*4882a593Smuzhiyun	[Linux kernel task] that installs the policy and any threads
65*4882a593Smuzhiyun	subsequently created by that thread.  Any sibling threads existing
66*4882a593Smuzhiyun	at the time a new task policy is installed retain their current
67*4882a593Smuzhiyun	policy.
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun	A task policy applies only to pages allocated after the policy is
70*4882a593Smuzhiyun	installed.  Any pages already faulted in by the task when the task
71*4882a593Smuzhiyun	changes its task policy remain where they were allocated based on
72*4882a593Smuzhiyun	the policy at the time they were allocated.
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun.. _vma_policy:
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunVMA Policy
77*4882a593Smuzhiyun	A "VMA" or "Virtual Memory Area" refers to a range of a task's
78*4882a593Smuzhiyun	virtual address space.  A task may define a specific policy for a range
79*4882a593Smuzhiyun	of its virtual address space.   See the
80*4882a593Smuzhiyun	:ref:`Memory Policy APIs <memory_policy_apis>` section,
81*4882a593Smuzhiyun	below, for an overview of the mbind() system call used to set a VMA
82*4882a593Smuzhiyun	policy.
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun	A VMA policy will govern the allocation of pages that back
85*4882a593Smuzhiyun	this region of the address space.  Any regions of the task's
86*4882a593Smuzhiyun	address space that don't have an explicit VMA policy will fall
87*4882a593Smuzhiyun	back to the task policy, which may itself fall back to the
88*4882a593Smuzhiyun	System Default Policy.
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun	VMA policies have a few complicating details:
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun	* VMA policy applies ONLY to anonymous pages.  These include
93*4882a593Smuzhiyun	  pages allocated for anonymous segments, such as the task
94*4882a593Smuzhiyun	  stack and heap, and any regions of the address space
95*4882a593Smuzhiyun	  mmap()ed with the MAP_ANONYMOUS flag.  If a VMA policy is
96*4882a593Smuzhiyun	  applied to a file mapping, it will be ignored if the mapping
97*4882a593Smuzhiyun	  used the MAP_SHARED flag.  If the file mapping used the
98*4882a593Smuzhiyun	  MAP_PRIVATE flag, the VMA policy will only be applied when
99*4882a593Smuzhiyun	  an anonymous page is allocated on an attempt to write to the
100*4882a593Smuzhiyun	  mapping-- i.e., at Copy-On-Write.
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun	* VMA policies are shared between all tasks that share a
103*4882a593Smuzhiyun	  virtual address space--a.k.a. threads--independent of when
104*4882a593Smuzhiyun	  the policy is installed; and they are inherited across
105*4882a593Smuzhiyun	  fork().  However, because VMA policies refer to a specific
106*4882a593Smuzhiyun	  region of a task's address space, and because the address
107*4882a593Smuzhiyun	  space is discarded and recreated on exec*(), VMA policies
108*4882a593Smuzhiyun	  are NOT inheritable across exec().  Thus, only NUMA-aware
109*4882a593Smuzhiyun	  applications may use VMA policies.
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun	* A task may install a new VMA policy on a sub-range of a
112*4882a593Smuzhiyun	  previously mmap()ed region.  When this happens, Linux splits
113*4882a593Smuzhiyun	  the existing virtual memory area into 2 or 3 VMAs, each with
114*4882a593Smuzhiyun	  it's own policy.
115*4882a593Smuzhiyun
116*4882a593Smuzhiyun	* By default, VMA policy applies only to pages allocated after
117*4882a593Smuzhiyun	  the policy is installed.  Any pages already faulted into the
118*4882a593Smuzhiyun	  VMA range remain where they were allocated based on the
119*4882a593Smuzhiyun	  policy at the time they were allocated.  However, since
120*4882a593Smuzhiyun	  2.6.16, Linux supports page migration via the mbind() system
121*4882a593Smuzhiyun	  call, so that page contents can be moved to match a newly
122*4882a593Smuzhiyun	  installed policy.
123*4882a593Smuzhiyun
124*4882a593SmuzhiyunShared Policy
125*4882a593Smuzhiyun	Conceptually, shared policies apply to "memory objects" mapped
126*4882a593Smuzhiyun	shared into one or more tasks' distinct address spaces.  An
127*4882a593Smuzhiyun	application installs shared policies the same way as VMA
128*4882a593Smuzhiyun	policies--using the mbind() system call specifying a range of
129*4882a593Smuzhiyun	virtual addresses that map the shared object.  However, unlike
130*4882a593Smuzhiyun	VMA policies, which can be considered to be an attribute of a
131*4882a593Smuzhiyun	range of a task's address space, shared policies apply
132*4882a593Smuzhiyun	directly to the shared object.  Thus, all tasks that attach to
133*4882a593Smuzhiyun	the object share the policy, and all pages allocated for the
134*4882a593Smuzhiyun	shared object, by any task, will obey the shared policy.
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun	As of 2.6.22, only shared memory segments, created by shmget() or
137*4882a593Smuzhiyun	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
138*4882a593Smuzhiyun	policy support was added to Linux, the associated data structures were
139*4882a593Smuzhiyun	added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
140*4882a593Smuzhiyun	support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
141*4882a593Smuzhiyun	shmem segments were never "hooked up" to the shared policy support.
142*4882a593Smuzhiyun	Although hugetlbfs segments now support lazy allocation, their support
143*4882a593Smuzhiyun	for shared policy has not been completed.
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun	As mentioned above in :ref:`VMA policies <vma_policy>` section,
146*4882a593Smuzhiyun	allocations of page cache pages for regular files mmap()ed
147*4882a593Smuzhiyun	with MAP_SHARED ignore any VMA policy installed on the virtual
148*4882a593Smuzhiyun	address range backed by the shared file mapping.  Rather,
149*4882a593Smuzhiyun	shared page cache pages, including pages backing private
150*4882a593Smuzhiyun	mappings that have not yet been written by the task, follow
151*4882a593Smuzhiyun	task policy, if any, else System Default Policy.
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun	The shared policy infrastructure supports different policies on subset
154*4882a593Smuzhiyun	ranges of the shared object.  However, Linux still splits the VMA of
155*4882a593Smuzhiyun	the task that installs the policy for each range of distinct policy.
156*4882a593Smuzhiyun	Thus, different tasks that attach to a shared memory segment can have
157*4882a593Smuzhiyun	different VMA configurations mapping that one shared object.  This
158*4882a593Smuzhiyun	can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
159*4882a593Smuzhiyun	a shared memory region, when one task has installed shared policy on
160*4882a593Smuzhiyun	one or more ranges of the region.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunComponents of Memory Policies
163*4882a593Smuzhiyun-----------------------------
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunA NUMA memory policy consists of a "mode", optional mode flags, and
166*4882a593Smuzhiyunan optional set of nodes.  The mode determines the behavior of the
167*4882a593Smuzhiyunpolicy, the optional mode flags determine the behavior of the mode,
168*4882a593Smuzhiyunand the optional set of nodes can be viewed as the arguments to the
169*4882a593Smuzhiyunpolicy behavior.
170*4882a593Smuzhiyun
171*4882a593SmuzhiyunInternally, memory policies are implemented by a reference counted
172*4882a593Smuzhiyunstructure, struct mempolicy.  Details of this structure will be
173*4882a593Smuzhiyundiscussed in context, below, as required to explain the behavior.
174*4882a593Smuzhiyun
175*4882a593SmuzhiyunNUMA memory policy supports the following 4 behavioral modes:
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunDefault Mode--MPOL_DEFAULT
178*4882a593Smuzhiyun	This mode is only used in the memory policy APIs.  Internally,
179*4882a593Smuzhiyun	MPOL_DEFAULT is converted to the NULL memory policy in all
180*4882a593Smuzhiyun	policy scopes.  Any existing non-default policy will simply be
181*4882a593Smuzhiyun	removed when MPOL_DEFAULT is specified.  As a result,
182*4882a593Smuzhiyun	MPOL_DEFAULT means "fall back to the next most specific policy
183*4882a593Smuzhiyun	scope."
184*4882a593Smuzhiyun
185*4882a593Smuzhiyun	For example, a NULL or default task policy will fall back to the
186*4882a593Smuzhiyun	system default policy.  A NULL or default vma policy will fall
187*4882a593Smuzhiyun	back to the task policy.
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun	When specified in one of the memory policy APIs, the Default mode
190*4882a593Smuzhiyun	does not use the optional set of nodes.
191*4882a593Smuzhiyun
192*4882a593Smuzhiyun	It is an error for the set of nodes specified for this policy to
193*4882a593Smuzhiyun	be non-empty.
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunMPOL_BIND
196*4882a593Smuzhiyun	This mode specifies that memory must come from the set of
197*4882a593Smuzhiyun	nodes specified by the policy.  Memory will be allocated from
198*4882a593Smuzhiyun	the node in the set with sufficient free memory that is
199*4882a593Smuzhiyun	closest to the node where the allocation takes place.
200*4882a593Smuzhiyun
201*4882a593SmuzhiyunMPOL_PREFERRED
202*4882a593Smuzhiyun	This mode specifies that the allocation should be attempted
203*4882a593Smuzhiyun	from the single node specified in the policy.  If that
204*4882a593Smuzhiyun	allocation fails, the kernel will search other nodes, in order
205*4882a593Smuzhiyun	of increasing distance from the preferred node based on
206*4882a593Smuzhiyun	information provided by the platform firmware.
207*4882a593Smuzhiyun
208*4882a593Smuzhiyun	Internally, the Preferred policy uses a single node--the
209*4882a593Smuzhiyun	preferred_node member of struct mempolicy.  When the internal
210*4882a593Smuzhiyun	mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
211*4882a593Smuzhiyun	and the policy is interpreted as local allocation.  "Local"
212*4882a593Smuzhiyun	allocation policy can be viewed as a Preferred policy that
213*4882a593Smuzhiyun	starts at the node containing the cpu where the allocation
214*4882a593Smuzhiyun	takes place.
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun	It is possible for the user to specify that local allocation
217*4882a593Smuzhiyun	is always preferred by passing an empty nodemask with this
218*4882a593Smuzhiyun	mode.  If an empty nodemask is passed, the policy cannot use
219*4882a593Smuzhiyun	the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
220*4882a593Smuzhiyun	described below.
221*4882a593Smuzhiyun
222*4882a593SmuzhiyunMPOL_INTERLEAVED
223*4882a593Smuzhiyun	This mode specifies that page allocations be interleaved, on a
224*4882a593Smuzhiyun	page granularity, across the nodes specified in the policy.
225*4882a593Smuzhiyun	This mode also behaves slightly differently, based on the
226*4882a593Smuzhiyun	context where it is used:
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun	For allocation of anonymous pages and shared memory pages,
229*4882a593Smuzhiyun	Interleave mode indexes the set of nodes specified by the
230*4882a593Smuzhiyun	policy using the page offset of the faulting address into the
231*4882a593Smuzhiyun	segment [VMA] containing the address modulo the number of
232*4882a593Smuzhiyun	nodes specified by the policy.  It then attempts to allocate a
233*4882a593Smuzhiyun	page, starting at the selected node, as if the node had been
234*4882a593Smuzhiyun	specified by a Preferred policy or had been selected by a
235*4882a593Smuzhiyun	local allocation.  That is, allocation will follow the per
236*4882a593Smuzhiyun	node zonelist.
237*4882a593Smuzhiyun
238*4882a593Smuzhiyun	For allocation of page cache pages, Interleave mode indexes
239*4882a593Smuzhiyun	the set of nodes specified by the policy using a node counter
240*4882a593Smuzhiyun	maintained per task.  This counter wraps around to the lowest
241*4882a593Smuzhiyun	specified node after it reaches the highest specified node.
242*4882a593Smuzhiyun	This will tend to spread the pages out over the nodes
243*4882a593Smuzhiyun	specified by the policy based on the order in which they are
244*4882a593Smuzhiyun	allocated, rather than based on any page offset into an
245*4882a593Smuzhiyun	address range or file.  During system boot up, the temporary
246*4882a593Smuzhiyun	interleaved system default policy works in this mode.
247*4882a593Smuzhiyun
248*4882a593SmuzhiyunNUMA memory policy supports the following optional mode flags:
249*4882a593Smuzhiyun
250*4882a593SmuzhiyunMPOL_F_STATIC_NODES
251*4882a593Smuzhiyun	This flag specifies that the nodemask passed by
252*4882a593Smuzhiyun	the user should not be remapped if the task or VMA's set of allowed
253*4882a593Smuzhiyun	nodes changes after the memory policy has been defined.
254*4882a593Smuzhiyun
255*4882a593Smuzhiyun	Without this flag, any time a mempolicy is rebound because of a
256*4882a593Smuzhiyun	change in the set of allowed nodes, the node (Preferred) or
257*4882a593Smuzhiyun	nodemask (Bind, Interleave) is remapped to the new set of
258*4882a593Smuzhiyun	allowed nodes.  This may result in nodes being used that were
259*4882a593Smuzhiyun	previously undesired.
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun	With this flag, if the user-specified nodes overlap with the
262*4882a593Smuzhiyun	nodes allowed by the task's cpuset, then the memory policy is
263*4882a593Smuzhiyun	applied to their intersection.  If the two sets of nodes do not
264*4882a593Smuzhiyun	overlap, the Default policy is used.
265*4882a593Smuzhiyun
266*4882a593Smuzhiyun	For example, consider a task that is attached to a cpuset with
267*4882a593Smuzhiyun	mems 1-3 that sets an Interleave policy over the same set.  If
268*4882a593Smuzhiyun	the cpuset's mems change to 3-5, the Interleave will now occur
269*4882a593Smuzhiyun	over nodes 3, 4, and 5.  With this flag, however, since only node
270*4882a593Smuzhiyun	3 is allowed from the user's nodemask, the "interleave" only
271*4882a593Smuzhiyun	occurs over that node.  If no nodes from the user's nodemask are
272*4882a593Smuzhiyun	now allowed, the Default behavior is used.
273*4882a593Smuzhiyun
274*4882a593Smuzhiyun	MPOL_F_STATIC_NODES cannot be combined with the
275*4882a593Smuzhiyun	MPOL_F_RELATIVE_NODES flag.  It also cannot be used for
276*4882a593Smuzhiyun	MPOL_PREFERRED policies that were created with an empty nodemask
277*4882a593Smuzhiyun	(local allocation).
278*4882a593Smuzhiyun
279*4882a593SmuzhiyunMPOL_F_RELATIVE_NODES
280*4882a593Smuzhiyun	This flag specifies that the nodemask passed
281*4882a593Smuzhiyun	by the user will be mapped relative to the set of the task or VMA's
282*4882a593Smuzhiyun	set of allowed nodes.  The kernel stores the user-passed nodemask,
283*4882a593Smuzhiyun	and if the allowed nodes changes, then that original nodemask will
284*4882a593Smuzhiyun	be remapped relative to the new set of allowed nodes.
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun	Without this flag (and without MPOL_F_STATIC_NODES), anytime a
287*4882a593Smuzhiyun	mempolicy is rebound because of a change in the set of allowed
288*4882a593Smuzhiyun	nodes, the node (Preferred) or nodemask (Bind, Interleave) is
289*4882a593Smuzhiyun	remapped to the new set of allowed nodes.  That remap may not
290*4882a593Smuzhiyun	preserve the relative nature of the user's passed nodemask to its
291*4882a593Smuzhiyun	set of allowed nodes upon successive rebinds: a nodemask of
292*4882a593Smuzhiyun	1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
293*4882a593Smuzhiyun	allowed nodes is restored to its original state.
294*4882a593Smuzhiyun
295*4882a593Smuzhiyun	With this flag, the remap is done so that the node numbers from
296*4882a593Smuzhiyun	the user's passed nodemask are relative to the set of allowed
297*4882a593Smuzhiyun	nodes.  In other words, if nodes 0, 2, and 4 are set in the user's
298*4882a593Smuzhiyun	nodemask, the policy will be effected over the first (and in the
299*4882a593Smuzhiyun	Bind or Interleave case, the third and fifth) nodes in the set of
300*4882a593Smuzhiyun	allowed nodes.  The nodemask passed by the user represents nodes
301*4882a593Smuzhiyun	relative to task or VMA's set of allowed nodes.
302*4882a593Smuzhiyun
303*4882a593Smuzhiyun	If the user's nodemask includes nodes that are outside the range
304*4882a593Smuzhiyun	of the new set of allowed nodes (for example, node 5 is set in
305*4882a593Smuzhiyun	the user's nodemask when the set of allowed nodes is only 0-3),
306*4882a593Smuzhiyun	then the remap wraps around to the beginning of the nodemask and,
307*4882a593Smuzhiyun	if not already set, sets the node in the mempolicy nodemask.
308*4882a593Smuzhiyun
309*4882a593Smuzhiyun	For example, consider a task that is attached to a cpuset with
310*4882a593Smuzhiyun	mems 2-5 that sets an Interleave policy over the same set with
311*4882a593Smuzhiyun	MPOL_F_RELATIVE_NODES.  If the cpuset's mems change to 3-7, the
312*4882a593Smuzhiyun	interleave now occurs over nodes 3,5-7.  If the cpuset's mems
313*4882a593Smuzhiyun	then change to 0,2-3,5, then the interleave occurs over nodes
314*4882a593Smuzhiyun	0,2-3,5.
315*4882a593Smuzhiyun
316*4882a593Smuzhiyun	Thanks to the consistent remapping, applications preparing
317*4882a593Smuzhiyun	nodemasks to specify memory policies using this flag should
318*4882a593Smuzhiyun	disregard their current, actual cpuset imposed memory placement
319*4882a593Smuzhiyun	and prepare the nodemask as if they were always located on
320*4882a593Smuzhiyun	memory nodes 0 to N-1, where N is the number of memory nodes the
321*4882a593Smuzhiyun	policy is intended to manage.  Let the kernel then remap to the
322*4882a593Smuzhiyun	set of memory nodes allowed by the task's cpuset, as that may
323*4882a593Smuzhiyun	change over time.
324*4882a593Smuzhiyun
325*4882a593Smuzhiyun	MPOL_F_RELATIVE_NODES cannot be combined with the
326*4882a593Smuzhiyun	MPOL_F_STATIC_NODES flag.  It also cannot be used for
327*4882a593Smuzhiyun	MPOL_PREFERRED policies that were created with an empty nodemask
328*4882a593Smuzhiyun	(local allocation).
329*4882a593Smuzhiyun
330*4882a593SmuzhiyunMemory Policy Reference Counting
331*4882a593Smuzhiyun================================
332*4882a593Smuzhiyun
333*4882a593SmuzhiyunTo resolve use/free races, struct mempolicy contains an atomic reference
334*4882a593Smuzhiyuncount field.  Internal interfaces, mpol_get()/mpol_put() increment and
335*4882a593Smuzhiyundecrement this reference count, respectively.  mpol_put() will only free
336*4882a593Smuzhiyunthe structure back to the mempolicy kmem cache when the reference count
337*4882a593Smuzhiyungoes to zero.
338*4882a593Smuzhiyun
339*4882a593SmuzhiyunWhen a new memory policy is allocated, its reference count is initialized
340*4882a593Smuzhiyunto '1', representing the reference held by the task that is installing the
341*4882a593Smuzhiyunnew policy.  When a pointer to a memory policy structure is stored in another
342*4882a593Smuzhiyunstructure, another reference is added, as the task's reference will be dropped
343*4882a593Smuzhiyunon completion of the policy installation.
344*4882a593Smuzhiyun
345*4882a593SmuzhiyunDuring run-time "usage" of the policy, we attempt to minimize atomic operations
346*4882a593Smuzhiyunon the reference count, as this can lead to cache lines bouncing between cpus
347*4882a593Smuzhiyunand NUMA nodes.  "Usage" here means one of the following:
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun1) querying of the policy, either by the task itself [using the get_mempolicy()
350*4882a593Smuzhiyun   API discussed below] or by another task using the /proc/<pid>/numa_maps
351*4882a593Smuzhiyun   interface.
352*4882a593Smuzhiyun
353*4882a593Smuzhiyun2) examination of the policy to determine the policy mode and associated node
354*4882a593Smuzhiyun   or node lists, if any, for page allocation.  This is considered a "hot
355*4882a593Smuzhiyun   path".  Note that for MPOL_BIND, the "usage" extends across the entire
356*4882a593Smuzhiyun   allocation process, which may sleep during page reclaimation, because the
357*4882a593Smuzhiyun   BIND policy nodemask is used, by reference, to filter ineligible nodes.
358*4882a593Smuzhiyun
359*4882a593SmuzhiyunWe can avoid taking an extra reference during the usages listed above as
360*4882a593Smuzhiyunfollows:
361*4882a593Smuzhiyun
362*4882a593Smuzhiyun1) we never need to get/free the system default policy as this is never
363*4882a593Smuzhiyun   changed nor freed, once the system is up and running.
364*4882a593Smuzhiyun
365*4882a593Smuzhiyun2) for querying the policy, we do not need to take an extra reference on the
366*4882a593Smuzhiyun   target task's task policy nor vma policies because we always acquire the
367*4882a593Smuzhiyun   task's mm's mmap_lock for read during the query.  The set_mempolicy() and
368*4882a593Smuzhiyun   mbind() APIs [see below] always acquire the mmap_lock for write when
369*4882a593Smuzhiyun   installing or replacing task or vma policies.  Thus, there is no possibility
370*4882a593Smuzhiyun   of a task or thread freeing a policy while another task or thread is
371*4882a593Smuzhiyun   querying it.
372*4882a593Smuzhiyun
373*4882a593Smuzhiyun3) Page allocation usage of task or vma policy occurs in the fault path where
374*4882a593Smuzhiyun   we hold them mmap_lock for read.  Again, because replacing the task or vma
375*4882a593Smuzhiyun   policy requires that the mmap_lock be held for write, the policy can't be
376*4882a593Smuzhiyun   freed out from under us while we're using it for page allocation.
377*4882a593Smuzhiyun
378*4882a593Smuzhiyun4) Shared policies require special consideration.  One task can replace a
379*4882a593Smuzhiyun   shared memory policy while another task, with a distinct mmap_lock, is
380*4882a593Smuzhiyun   querying or allocating a page based on the policy.  To resolve this
381*4882a593Smuzhiyun   potential race, the shared policy infrastructure adds an extra reference
382*4882a593Smuzhiyun   to the shared policy during lookup while holding a spin lock on the shared
383*4882a593Smuzhiyun   policy management structure.  This requires that we drop this extra
384*4882a593Smuzhiyun   reference when we're finished "using" the policy.  We must drop the
385*4882a593Smuzhiyun   extra reference on shared policies in the same query/allocation paths
386*4882a593Smuzhiyun   used for non-shared policies.  For this reason, shared policies are marked
387*4882a593Smuzhiyun   as such, and the extra reference is dropped "conditionally"--i.e., only
388*4882a593Smuzhiyun   for shared policies.
389*4882a593Smuzhiyun
390*4882a593Smuzhiyun   Because of this extra reference counting, and because we must lookup
391*4882a593Smuzhiyun   shared policies in a tree structure under spinlock, shared policies are
392*4882a593Smuzhiyun   more expensive to use in the page allocation path.  This is especially
393*4882a593Smuzhiyun   true for shared policies on shared memory regions shared by tasks running
394*4882a593Smuzhiyun   on different NUMA nodes.  This extra overhead can be avoided by always
395*4882a593Smuzhiyun   falling back to task or system default policy for shared memory regions,
396*4882a593Smuzhiyun   or by prefaulting the entire shared memory region into memory and locking
397*4882a593Smuzhiyun   it down.  However, this might not be appropriate for all applications.
398*4882a593Smuzhiyun
399*4882a593Smuzhiyun.. _memory_policy_apis:
400*4882a593Smuzhiyun
401*4882a593SmuzhiyunMemory Policy APIs
402*4882a593Smuzhiyun==================
403*4882a593Smuzhiyun
404*4882a593SmuzhiyunLinux supports 3 system calls for controlling memory policy.  These APIS
405*4882a593Smuzhiyunalways affect only the calling task, the calling task's address space, or
406*4882a593Smuzhiyunsome shared object mapped into the calling task's address space.
407*4882a593Smuzhiyun
408*4882a593Smuzhiyun.. note::
409*4882a593Smuzhiyun   the headers that define these APIs and the parameter data types for
410*4882a593Smuzhiyun   user space applications reside in a package that is not part of the
411*4882a593Smuzhiyun   Linux kernel.  The kernel system call interfaces, with the 'sys\_'
412*4882a593Smuzhiyun   prefix, are defined in <linux/syscalls.h>; the mode and flag
413*4882a593Smuzhiyun   definitions are defined in <linux/mempolicy.h>.
414*4882a593Smuzhiyun
415*4882a593SmuzhiyunSet [Task] Memory Policy::
416*4882a593Smuzhiyun
417*4882a593Smuzhiyun	long set_mempolicy(int mode, const unsigned long *nmask,
418*4882a593Smuzhiyun					unsigned long maxnode);
419*4882a593Smuzhiyun
420*4882a593SmuzhiyunSet's the calling task's "task/process memory policy" to mode
421*4882a593Smuzhiyunspecified by the 'mode' argument and the set of nodes defined by
422*4882a593Smuzhiyun'nmask'.  'nmask' points to a bit mask of node ids containing at least
423*4882a593Smuzhiyun'maxnode' ids.  Optional mode flags may be passed by combining the
424*4882a593Smuzhiyun'mode' argument with the flag (for example: MPOL_INTERLEAVE |
425*4882a593SmuzhiyunMPOL_F_STATIC_NODES).
426*4882a593Smuzhiyun
427*4882a593SmuzhiyunSee the set_mempolicy(2) man page for more details
428*4882a593Smuzhiyun
429*4882a593Smuzhiyun
430*4882a593SmuzhiyunGet [Task] Memory Policy or Related Information::
431*4882a593Smuzhiyun
432*4882a593Smuzhiyun	long get_mempolicy(int *mode,
433*4882a593Smuzhiyun			   const unsigned long *nmask, unsigned long maxnode,
434*4882a593Smuzhiyun			   void *addr, int flags);
435*4882a593Smuzhiyun
436*4882a593SmuzhiyunQueries the "task/process memory policy" of the calling task, or the
437*4882a593Smuzhiyunpolicy or location of a specified virtual address, depending on the
438*4882a593Smuzhiyun'flags' argument.
439*4882a593Smuzhiyun
440*4882a593SmuzhiyunSee the get_mempolicy(2) man page for more details
441*4882a593Smuzhiyun
442*4882a593Smuzhiyun
443*4882a593SmuzhiyunInstall VMA/Shared Policy for a Range of Task's Address Space::
444*4882a593Smuzhiyun
445*4882a593Smuzhiyun	long mbind(void *start, unsigned long len, int mode,
446*4882a593Smuzhiyun		   const unsigned long *nmask, unsigned long maxnode,
447*4882a593Smuzhiyun		   unsigned flags);
448*4882a593Smuzhiyun
449*4882a593Smuzhiyunmbind() installs the policy specified by (mode, nmask, maxnodes) as a
450*4882a593SmuzhiyunVMA policy for the range of the calling task's address space specified
451*4882a593Smuzhiyunby the 'start' and 'len' arguments.  Additional actions may be
452*4882a593Smuzhiyunrequested via the 'flags' argument.
453*4882a593Smuzhiyun
454*4882a593SmuzhiyunSee the mbind(2) man page for more details.
455*4882a593Smuzhiyun
456*4882a593SmuzhiyunMemory Policy Command Line Interface
457*4882a593Smuzhiyun====================================
458*4882a593Smuzhiyun
459*4882a593SmuzhiyunAlthough not strictly part of the Linux implementation of memory policy,
460*4882a593Smuzhiyuna command line tool, numactl(8), exists that allows one to:
461*4882a593Smuzhiyun
462*4882a593Smuzhiyun+ set the task policy for a specified program via set_mempolicy(2), fork(2) and
463*4882a593Smuzhiyun  exec(2)
464*4882a593Smuzhiyun
465*4882a593Smuzhiyun+ set the shared policy for a shared memory segment via mbind(2)
466*4882a593Smuzhiyun
467*4882a593SmuzhiyunThe numactl(8) tool is packaged with the run-time version of the library
468*4882a593Smuzhiyuncontaining the memory policy system call wrappers.  Some distributions
469*4882a593Smuzhiyunpackage the headers and compile-time libraries in a separate development
470*4882a593Smuzhiyunpackage.
471*4882a593Smuzhiyun
472*4882a593Smuzhiyun.. _mem_pol_and_cpusets:
473*4882a593Smuzhiyun
474*4882a593SmuzhiyunMemory Policies and cpusets
475*4882a593Smuzhiyun===========================
476*4882a593Smuzhiyun
477*4882a593SmuzhiyunMemory policies work within cpusets as described above.  For memory policies
478*4882a593Smuzhiyunthat require a node or set of nodes, the nodes are restricted to the set of
479*4882a593Smuzhiyunnodes whose memories are allowed by the cpuset constraints.  If the nodemask
480*4882a593Smuzhiyunspecified for the policy contains nodes that are not allowed by the cpuset and
481*4882a593SmuzhiyunMPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
482*4882a593Smuzhiyunspecified for the policy and the set of nodes with memory is used.  If the
483*4882a593Smuzhiyunresult is the empty set, the policy is considered invalid and cannot be
484*4882a593Smuzhiyuninstalled.  If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
485*4882a593Smuzhiyunonto and folded into the task's set of allowed nodes as previously described.
486*4882a593Smuzhiyun
487*4882a593SmuzhiyunThe interaction of memory policies and cpusets can be problematic when tasks
488*4882a593Smuzhiyunin two cpusets share access to a memory region, such as shared memory segments
489*4882a593Smuzhiyuncreated by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
490*4882a593Smuzhiyunany of the tasks install shared policy on the region, only nodes whose
491*4882a593Smuzhiyunmemories are allowed in both cpusets may be used in the policies.  Obtaining
492*4882a593Smuzhiyunthis information requires "stepping outside" the memory policy APIs to use the
493*4882a593Smuzhiyuncpuset information and requires that one know in what cpusets other task might
494*4882a593Smuzhiyunbe attaching to the shared region.  Furthermore, if the cpusets' allowed
495*4882a593Smuzhiyunmemory sets are disjoint, "local" allocation is the only valid policy.
496