1*4882a593Smuzhiyun.. _numa: 2*4882a593Smuzhiyun 3*4882a593SmuzhiyunStarted Nov 1999 by Kanoj Sarcar <kanoj@sgi.com> 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun============= 6*4882a593SmuzhiyunWhat is NUMA? 7*4882a593Smuzhiyun============= 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunThis question can be answered from a couple of perspectives: the 10*4882a593Smuzhiyunhardware view and the Linux software view. 11*4882a593Smuzhiyun 12*4882a593SmuzhiyunFrom the hardware perspective, a NUMA system is a computer platform that 13*4882a593Smuzhiyuncomprises multiple components or assemblies each of which may contain 0 14*4882a593Smuzhiyunor more CPUs, local memory, and/or IO buses. For brevity and to 15*4882a593Smuzhiyundisambiguate the hardware view of these physical components/assemblies 16*4882a593Smuzhiyunfrom the software abstraction thereof, we'll call the components/assemblies 17*4882a593Smuzhiyun'cells' in this document. 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunEach of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset 20*4882a593Smuzhiyunof the system--although some components necessary for a stand-alone SMP system 21*4882a593Smuzhiyunmay not be populated on any given cell. The cells of the NUMA system are 22*4882a593Smuzhiyunconnected together with some sort of system interconnect--e.g., a crossbar or 23*4882a593Smuzhiyunpoint-to-point link are common types of NUMA system interconnects. Both of 24*4882a593Smuzhiyunthese types of interconnects can be aggregated to create NUMA platforms with 25*4882a593Smuzhiyuncells at multiple distances from other cells. 26*4882a593Smuzhiyun 27*4882a593SmuzhiyunFor Linux, the NUMA platforms of interest are primarily what is known as Cache 28*4882a593SmuzhiyunCoherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible 29*4882a593Smuzhiyunto and accessible from any CPU attached to any cell and cache coherency 30*4882a593Smuzhiyunis handled in hardware by the processor caches and/or the system interconnect. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunMemory access time and effective memory bandwidth varies depending on how far 33*4882a593Smuzhiyunaway the cell containing the CPU or IO bus making the memory access is from the 34*4882a593Smuzhiyuncell containing the target memory. For example, access to memory by CPUs 35*4882a593Smuzhiyunattached to the same cell will experience faster access times and higher 36*4882a593Smuzhiyunbandwidths than accesses to memory on other, remote cells. NUMA platforms 37*4882a593Smuzhiyuncan have cells at multiple remote distances from any given cell. 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunPlatform vendors don't build NUMA systems just to make software developers' 40*4882a593Smuzhiyunlives interesting. Rather, this architecture is a means to provide scalable 41*4882a593Smuzhiyunmemory bandwidth. However, to achieve scalable memory bandwidth, system and 42*4882a593Smuzhiyunapplication software must arrange for a large majority of the memory references 43*4882a593Smuzhiyun[cache misses] to be to "local" memory--memory on the same cell, if any--or 44*4882a593Smuzhiyunto the closest cell with memory. 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunThis leads to the Linux software view of a NUMA system: 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunLinux divides the system's hardware resources into multiple software 49*4882a593Smuzhiyunabstractions called "nodes". Linux maps the nodes onto the physical cells 50*4882a593Smuzhiyunof the hardware platform, abstracting away some of the details for some 51*4882a593Smuzhiyunarchitectures. As with physical cells, software nodes may contain 0 or more 52*4882a593SmuzhiyunCPUs, memory and/or IO buses. And, again, memory accesses to memory on 53*4882a593Smuzhiyun"closer" nodes--nodes that map to closer cells--will generally experience 54*4882a593Smuzhiyunfaster access times and higher effective bandwidth than accesses to more 55*4882a593Smuzhiyunremote cells. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunFor some architectures, such as x86, Linux will "hide" any node representing a 58*4882a593Smuzhiyunphysical cell that has no memory attached, and reassign any CPUs attached to 59*4882a593Smuzhiyunthat cell to a node representing a cell that does have memory. Thus, on 60*4882a593Smuzhiyunthese architectures, one cannot assume that all CPUs that Linux associates with 61*4882a593Smuzhiyuna given node will see the same local memory access times and bandwidth. 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunIn addition, for some architectures, again x86 is an example, Linux supports 64*4882a593Smuzhiyunthe emulation of additional nodes. For NUMA emulation, linux will carve up 65*4882a593Smuzhiyunthe existing nodes--or the system memory for non-NUMA platforms--into multiple 66*4882a593Smuzhiyunnodes. Each emulated node will manage a fraction of the underlying cells' 67*4882a593Smuzhiyunphysical memory. NUMA emluation is useful for testing NUMA kernel and 68*4882a593Smuzhiyunapplication features on non-NUMA platforms, and as a sort of memory resource 69*4882a593Smuzhiyunmanagement mechanism when used together with cpusets. 70*4882a593Smuzhiyun[see Documentation/admin-guide/cgroup-v1/cpusets.rst] 71*4882a593Smuzhiyun 72*4882a593SmuzhiyunFor each node with memory, Linux constructs an independent memory management 73*4882a593Smuzhiyunsubsystem, complete with its own free page lists, in-use page lists, usage 74*4882a593Smuzhiyunstatistics and locks to mediate access. In addition, Linux constructs for 75*4882a593Smuzhiyuneach memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE], 76*4882a593Smuzhiyunan ordered "zonelist". A zonelist specifies the zones/nodes to visit when a 77*4882a593Smuzhiyunselected zone/node cannot satisfy the allocation request. This situation, 78*4882a593Smuzhiyunwhen a zone has no available memory to satisfy a request, is called 79*4882a593Smuzhiyun"overflow" or "fallback". 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunBecause some nodes contain multiple zones containing different types of 82*4882a593Smuzhiyunmemory, Linux must decide whether to order the zonelists such that allocations 83*4882a593Smuzhiyunfall back to the same zone type on a different node, or to a different zone 84*4882a593Smuzhiyuntype on the same node. This is an important consideration because some zones, 85*4882a593Smuzhiyunsuch as DMA or DMA32, represent relatively scarce resources. Linux chooses 86*4882a593Smuzhiyuna default Node ordered zonelist. This means it tries to fallback to other zones 87*4882a593Smuzhiyunfrom the same node before using remote nodes which are ordered by NUMA distance. 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunBy default, Linux will attempt to satisfy memory allocation requests from the 90*4882a593Smuzhiyunnode to which the CPU that executes the request is assigned. Specifically, 91*4882a593SmuzhiyunLinux will attempt to allocate from the first node in the appropriate zonelist 92*4882a593Smuzhiyunfor the node where the request originates. This is called "local allocation." 93*4882a593SmuzhiyunIf the "local" node cannot satisfy the request, the kernel will examine other 94*4882a593Smuzhiyunnodes' zones in the selected zonelist looking for the first zone in the list 95*4882a593Smuzhiyunthat can satisfy the request. 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunLocal allocation will tend to keep subsequent access to the allocated memory 98*4882a593Smuzhiyun"local" to the underlying physical resources and off the system interconnect-- 99*4882a593Smuzhiyunas long as the task on whose behalf the kernel allocated some memory does not 100*4882a593Smuzhiyunlater migrate away from that memory. The Linux scheduler is aware of the 101*4882a593SmuzhiyunNUMA topology of the platform--embodied in the "scheduling domains" data 102*4882a593Smuzhiyunstructures [see Documentation/scheduler/sched-domains.rst]--and the scheduler 103*4882a593Smuzhiyunattempts to minimize task migration to distant scheduling domains. However, 104*4882a593Smuzhiyunthe scheduler does not take a task's NUMA footprint into account directly. 105*4882a593SmuzhiyunThus, under sufficient imbalance, tasks can migrate between nodes, remote 106*4882a593Smuzhiyunfrom their initial node and kernel data structures. 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunSystem administrators and application designers can restrict a task's migration 109*4882a593Smuzhiyunto improve NUMA locality using various CPU affinity command line interfaces, 110*4882a593Smuzhiyunsuch as taskset(1) and numactl(1), and program interfaces such as 111*4882a593Smuzhiyunsched_setaffinity(2). Further, one can modify the kernel's default local 112*4882a593Smuzhiyunallocation behavior using Linux NUMA memory policy. [see 113*4882a593Smuzhiyun:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`]. 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunSystem administrators can restrict the CPUs and nodes' memories that a non- 116*4882a593Smuzhiyunprivileged user can specify in the scheduling or NUMA commands and functions 117*4882a593Smuzhiyunusing control groups and CPUsets. [see Documentation/admin-guide/cgroup-v1/cpusets.rst] 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunOn architectures that do not hide memoryless nodes, Linux will include only 120*4882a593Smuzhiyunzones [nodes] with memory in the zonelists. This means that for a memoryless 121*4882a593Smuzhiyunnode the "local memory node"--the node of the first zone in CPU's node's 122*4882a593Smuzhiyunzonelist--will not be the node itself. Rather, it will be the node that the 123*4882a593Smuzhiyunkernel selected as the nearest node with memory when it built the zonelists. 124*4882a593SmuzhiyunSo, default, local allocations will succeed with the kernel supplying the 125*4882a593Smuzhiyunclosest available memory. This is a consequence of the same mechanism that 126*4882a593Smuzhiyunallows such allocations to fallback to other nearby nodes when a node that 127*4882a593Smuzhiyundoes contain memory overflows. 128*4882a593Smuzhiyun 129*4882a593SmuzhiyunSome kernel allocations do not want or cannot tolerate this allocation fallback 130*4882a593Smuzhiyunbehavior. Rather they want to be sure they get memory from the specified node 131*4882a593Smuzhiyunor get notified that the node has no free memory. This is usually the case when 132*4882a593Smuzhiyuna subsystem allocates per CPU memory resources, for example. 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunA typical model for making such an allocation is to obtain the node id of the 135*4882a593Smuzhiyunnode to which the "current CPU" is attached using one of the kernel's 136*4882a593Smuzhiyunnuma_node_id() or CPU_to_node() functions and then request memory from only 137*4882a593Smuzhiyunthe node id returned. When such an allocation fails, the requesting subsystem 138*4882a593Smuzhiyunmay revert to its own fallback path. The slab kernel memory allocator is an 139*4882a593Smuzhiyunexample of this. Or, the subsystem may choose to disable or not to enable 140*4882a593Smuzhiyunitself on allocation failure. The kernel profiling subsystem is an example of 141*4882a593Smuzhiyunthis. 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunIf the architecture supports--does not hide--memoryless nodes, then CPUs 144*4882a593Smuzhiyunattached to memoryless nodes would always incur the fallback path overhead 145*4882a593Smuzhiyunor some subsystems would fail to initialize if they attempted to allocated 146*4882a593Smuzhiyunmemory exclusively from a node without memory. To support such 147*4882a593Smuzhiyunarchitectures transparently, kernel subsystems can use the numa_mem_id() 148*4882a593Smuzhiyunor cpu_to_mem() function to locate the "local memory node" for the calling or 149*4882a593Smuzhiyunspecified CPU. Again, this is the same node from which default, local page 150*4882a593Smuzhiyunallocations will be attempted. 151