xref: /OK3568_Linux_fs/kernel/Documentation/vm/numa.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _numa:
2*4882a593Smuzhiyun
3*4882a593SmuzhiyunStarted Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun=============
6*4882a593SmuzhiyunWhat is NUMA?
7*4882a593Smuzhiyun=============
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunThis question can be answered from a couple of perspectives:  the
10*4882a593Smuzhiyunhardware view and the Linux software view.
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunFrom the hardware perspective, a NUMA system is a computer platform that
13*4882a593Smuzhiyuncomprises multiple components or assemblies each of which may contain 0
14*4882a593Smuzhiyunor more CPUs, local memory, and/or IO buses.  For brevity and to
15*4882a593Smuzhiyundisambiguate the hardware view of these physical components/assemblies
16*4882a593Smuzhiyunfrom the software abstraction thereof, we'll call the components/assemblies
17*4882a593Smuzhiyun'cells' in this document.
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunEach of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset
20*4882a593Smuzhiyunof the system--although some components necessary for a stand-alone SMP system
21*4882a593Smuzhiyunmay not be populated on any given cell.   The cells of the NUMA system are
22*4882a593Smuzhiyunconnected together with some sort of system interconnect--e.g., a crossbar or
23*4882a593Smuzhiyunpoint-to-point link are common types of NUMA system interconnects.  Both of
24*4882a593Smuzhiyunthese types of interconnects can be aggregated to create NUMA platforms with
25*4882a593Smuzhiyuncells at multiple distances from other cells.
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunFor Linux, the NUMA platforms of interest are primarily what is known as Cache
28*4882a593SmuzhiyunCoherent NUMA or ccNUMA systems.   With ccNUMA systems, all memory is visible
29*4882a593Smuzhiyunto and accessible from any CPU attached to any cell and cache coherency
30*4882a593Smuzhiyunis handled in hardware by the processor caches and/or the system interconnect.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunMemory access time and effective memory bandwidth varies depending on how far
33*4882a593Smuzhiyunaway the cell containing the CPU or IO bus making the memory access is from the
34*4882a593Smuzhiyuncell containing the target memory.  For example, access to memory by CPUs
35*4882a593Smuzhiyunattached to the same cell will experience faster access times and higher
36*4882a593Smuzhiyunbandwidths than accesses to memory on other, remote cells.  NUMA platforms
37*4882a593Smuzhiyuncan have cells at multiple remote distances from any given cell.
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunPlatform vendors don't build NUMA systems just to make software developers'
40*4882a593Smuzhiyunlives interesting.  Rather, this architecture is a means to provide scalable
41*4882a593Smuzhiyunmemory bandwidth.  However, to achieve scalable memory bandwidth, system and
42*4882a593Smuzhiyunapplication software must arrange for a large majority of the memory references
43*4882a593Smuzhiyun[cache misses] to be to "local" memory--memory on the same cell, if any--or
44*4882a593Smuzhiyunto the closest cell with memory.
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunThis leads to the Linux software view of a NUMA system:
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunLinux divides the system's hardware resources into multiple software
49*4882a593Smuzhiyunabstractions called "nodes".  Linux maps the nodes onto the physical cells
50*4882a593Smuzhiyunof the hardware platform, abstracting away some of the details for some
51*4882a593Smuzhiyunarchitectures.  As with physical cells, software nodes may contain 0 or more
52*4882a593SmuzhiyunCPUs, memory and/or IO buses.  And, again, memory accesses to memory on
53*4882a593Smuzhiyun"closer" nodes--nodes that map to closer cells--will generally experience
54*4882a593Smuzhiyunfaster access times and higher effective bandwidth than accesses to more
55*4882a593Smuzhiyunremote cells.
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunFor some architectures, such as x86, Linux will "hide" any node representing a
58*4882a593Smuzhiyunphysical cell that has no memory attached, and reassign any CPUs attached to
59*4882a593Smuzhiyunthat cell to a node representing a cell that does have memory.  Thus, on
60*4882a593Smuzhiyunthese architectures, one cannot assume that all CPUs that Linux associates with
61*4882a593Smuzhiyuna given node will see the same local memory access times and bandwidth.
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunIn addition, for some architectures, again x86 is an example, Linux supports
64*4882a593Smuzhiyunthe emulation of additional nodes.  For NUMA emulation, linux will carve up
65*4882a593Smuzhiyunthe existing nodes--or the system memory for non-NUMA platforms--into multiple
66*4882a593Smuzhiyunnodes.  Each emulated node will manage a fraction of the underlying cells'
67*4882a593Smuzhiyunphysical memory.  NUMA emluation is useful for testing NUMA kernel and
68*4882a593Smuzhiyunapplication features on non-NUMA platforms, and as a sort of memory resource
69*4882a593Smuzhiyunmanagement mechanism when used together with cpusets.
70*4882a593Smuzhiyun[see Documentation/admin-guide/cgroup-v1/cpusets.rst]
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunFor each node with memory, Linux constructs an independent memory management
73*4882a593Smuzhiyunsubsystem, complete with its own free page lists, in-use page lists, usage
74*4882a593Smuzhiyunstatistics and locks to mediate access.  In addition, Linux constructs for
75*4882a593Smuzhiyuneach memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],
76*4882a593Smuzhiyunan ordered "zonelist".  A zonelist specifies the zones/nodes to visit when a
77*4882a593Smuzhiyunselected zone/node cannot satisfy the allocation request.  This situation,
78*4882a593Smuzhiyunwhen a zone has no available memory to satisfy a request, is called
79*4882a593Smuzhiyun"overflow" or "fallback".
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunBecause some nodes contain multiple zones containing different types of
82*4882a593Smuzhiyunmemory, Linux must decide whether to order the zonelists such that allocations
83*4882a593Smuzhiyunfall back to the same zone type on a different node, or to a different zone
84*4882a593Smuzhiyuntype on the same node.  This is an important consideration because some zones,
85*4882a593Smuzhiyunsuch as DMA or DMA32, represent relatively scarce resources.  Linux chooses
86*4882a593Smuzhiyuna default Node ordered zonelist. This means it tries to fallback to other zones
87*4882a593Smuzhiyunfrom the same node before using remote nodes which are ordered by NUMA distance.
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunBy default, Linux will attempt to satisfy memory allocation requests from the
90*4882a593Smuzhiyunnode to which the CPU that executes the request is assigned.  Specifically,
91*4882a593SmuzhiyunLinux will attempt to allocate from the first node in the appropriate zonelist
92*4882a593Smuzhiyunfor the node where the request originates.  This is called "local allocation."
93*4882a593SmuzhiyunIf the "local" node cannot satisfy the request, the kernel will examine other
94*4882a593Smuzhiyunnodes' zones in the selected zonelist looking for the first zone in the list
95*4882a593Smuzhiyunthat can satisfy the request.
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunLocal allocation will tend to keep subsequent access to the allocated memory
98*4882a593Smuzhiyun"local" to the underlying physical resources and off the system interconnect--
99*4882a593Smuzhiyunas long as the task on whose behalf the kernel allocated some memory does not
100*4882a593Smuzhiyunlater migrate away from that memory.  The Linux scheduler is aware of the
101*4882a593SmuzhiyunNUMA topology of the platform--embodied in the "scheduling domains" data
102*4882a593Smuzhiyunstructures [see Documentation/scheduler/sched-domains.rst]--and the scheduler
103*4882a593Smuzhiyunattempts to minimize task migration to distant scheduling domains.  However,
104*4882a593Smuzhiyunthe scheduler does not take a task's NUMA footprint into account directly.
105*4882a593SmuzhiyunThus, under sufficient imbalance, tasks can migrate between nodes, remote
106*4882a593Smuzhiyunfrom their initial node and kernel data structures.
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunSystem administrators and application designers can restrict a task's migration
109*4882a593Smuzhiyunto improve NUMA locality using various CPU affinity command line interfaces,
110*4882a593Smuzhiyunsuch as taskset(1) and numactl(1), and program interfaces such as
111*4882a593Smuzhiyunsched_setaffinity(2).  Further, one can modify the kernel's default local
112*4882a593Smuzhiyunallocation behavior using Linux NUMA memory policy. [see
113*4882a593Smuzhiyun:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`].
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunSystem administrators can restrict the CPUs and nodes' memories that a non-
116*4882a593Smuzhiyunprivileged user can specify in the scheduling or NUMA commands and functions
117*4882a593Smuzhiyunusing control groups and CPUsets.  [see Documentation/admin-guide/cgroup-v1/cpusets.rst]
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunOn architectures that do not hide memoryless nodes, Linux will include only
120*4882a593Smuzhiyunzones [nodes] with memory in the zonelists.  This means that for a memoryless
121*4882a593Smuzhiyunnode the "local memory node"--the node of the first zone in CPU's node's
122*4882a593Smuzhiyunzonelist--will not be the node itself.  Rather, it will be the node that the
123*4882a593Smuzhiyunkernel selected as the nearest node with memory when it built the zonelists.
124*4882a593SmuzhiyunSo, default, local allocations will succeed with the kernel supplying the
125*4882a593Smuzhiyunclosest available memory.  This is a consequence of the same mechanism that
126*4882a593Smuzhiyunallows such allocations to fallback to other nearby nodes when a node that
127*4882a593Smuzhiyundoes contain memory overflows.
128*4882a593Smuzhiyun
129*4882a593SmuzhiyunSome kernel allocations do not want or cannot tolerate this allocation fallback
130*4882a593Smuzhiyunbehavior.  Rather they want to be sure they get memory from the specified node
131*4882a593Smuzhiyunor get notified that the node has no free memory.  This is usually the case when
132*4882a593Smuzhiyuna subsystem allocates per CPU memory resources, for example.
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunA typical model for making such an allocation is to obtain the node id of the
135*4882a593Smuzhiyunnode to which the "current CPU" is attached using one of the kernel's
136*4882a593Smuzhiyunnuma_node_id() or CPU_to_node() functions and then request memory from only
137*4882a593Smuzhiyunthe node id returned.  When such an allocation fails, the requesting subsystem
138*4882a593Smuzhiyunmay revert to its own fallback path.  The slab kernel memory allocator is an
139*4882a593Smuzhiyunexample of this.  Or, the subsystem may choose to disable or not to enable
140*4882a593Smuzhiyunitself on allocation failure.  The kernel profiling subsystem is an example of
141*4882a593Smuzhiyunthis.
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunIf the architecture supports--does not hide--memoryless nodes, then CPUs
144*4882a593Smuzhiyunattached to memoryless nodes would always incur the fallback path overhead
145*4882a593Smuzhiyunor some subsystems would fail to initialize if they attempted to allocated
146*4882a593Smuzhiyunmemory exclusively from a node without memory.  To support such
147*4882a593Smuzhiyunarchitectures transparently, kernel subsystems can use the numa_mem_id()
148*4882a593Smuzhiyunor cpu_to_mem() function to locate the "local memory node" for the calling or
149*4882a593Smuzhiyunspecified CPU.  Again, this is the same node from which default, local page
150*4882a593Smuzhiyunallocations will be attempted.
151