Documentation/vm/numa.rst

*4882a593Smuzhiyun.. _numa:
*4882a593Smuzhiyun
*4882a593SmuzhiyunStarted Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
*4882a593Smuzhiyun
*4882a593Smuzhiyun=============
*4882a593SmuzhiyunWhat is NUMA?
*4882a593Smuzhiyun=============
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis question can be answered from a couple of perspectives:  the
*4882a593Smuzhiyunhardware view and the Linux software view.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFrom the hardware perspective, a NUMA system is a computer platform that
*4882a593Smuzhiyuncomprises multiple components or assemblies each of which may contain 0
*4882a593Smuzhiyunor more CPUs, local memory, and/or IO buses.  For brevity and to
*4882a593Smuzhiyundisambiguate the hardware view of these physical components/assemblies
*4882a593Smuzhiyunfrom the software abstraction thereof, we'll call the components/assemblies
*4882a593Smuzhiyun'cells' in this document.
*4882a593Smuzhiyun
*4882a593SmuzhiyunEach of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset
*4882a593Smuzhiyunof the system--although some components necessary for a stand-alone SMP system
*4882a593Smuzhiyunmay not be populated on any given cell.   The cells of the NUMA system are
*4882a593Smuzhiyunconnected together with some sort of system interconnect--e.g., a crossbar or
*4882a593Smuzhiyunpoint-to-point link are common types of NUMA system interconnects.  Both of
*4882a593Smuzhiyunthese types of interconnects can be aggregated to create NUMA platforms with
*4882a593Smuzhiyuncells at multiple distances from other cells.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor Linux, the NUMA platforms of interest are primarily what is known as Cache
*4882a593SmuzhiyunCoherent NUMA or ccNUMA systems.   With ccNUMA systems, all memory is visible
*4882a593Smuzhiyunto and accessible from any CPU attached to any cell and cache coherency
*4882a593Smuzhiyunis handled in hardware by the processor caches and/or the system interconnect.
*4882a593Smuzhiyun
*4882a593SmuzhiyunMemory access time and effective memory bandwidth varies depending on how far
*4882a593Smuzhiyunaway the cell containing the CPU or IO bus making the memory access is from the
*4882a593Smuzhiyuncell containing the target memory.  For example, access to memory by CPUs
*4882a593Smuzhiyunattached to the same cell will experience faster access times and higher
*4882a593Smuzhiyunbandwidths than accesses to memory on other, remote cells.  NUMA platforms
*4882a593Smuzhiyuncan have cells at multiple remote distances from any given cell.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPlatform vendors don't build NUMA systems just to make software developers'
*4882a593Smuzhiyunlives interesting.  Rather, this architecture is a means to provide scalable
*4882a593Smuzhiyunmemory bandwidth.  However, to achieve scalable memory bandwidth, system and
*4882a593Smuzhiyunapplication software must arrange for a large majority of the memory references
*4882a593Smuzhiyun[cache misses] to be to "local" memory--memory on the same cell, if any--or
*4882a593Smuzhiyunto the closest cell with memory.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis leads to the Linux software view of a NUMA system:
*4882a593Smuzhiyun
*4882a593SmuzhiyunLinux divides the system's hardware resources into multiple software
*4882a593Smuzhiyunabstractions called "nodes".  Linux maps the nodes onto the physical cells
*4882a593Smuzhiyunof the hardware platform, abstracting away some of the details for some
*4882a593Smuzhiyunarchitectures.  As with physical cells, software nodes may contain 0 or more
*4882a593SmuzhiyunCPUs, memory and/or IO buses.  And, again, memory accesses to memory on
*4882a593Smuzhiyun"closer" nodes--nodes that map to closer cells--will generally experience
*4882a593Smuzhiyunfaster access times and higher effective bandwidth than accesses to more
*4882a593Smuzhiyunremote cells.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor some architectures, such as x86, Linux will "hide" any node representing a
*4882a593Smuzhiyunphysical cell that has no memory attached, and reassign any CPUs attached to
*4882a593Smuzhiyunthat cell to a node representing a cell that does have memory.  Thus, on
*4882a593Smuzhiyunthese architectures, one cannot assume that all CPUs that Linux associates with
*4882a593Smuzhiyuna given node will see the same local memory access times and bandwidth.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn addition, for some architectures, again x86 is an example, Linux supports
*4882a593Smuzhiyunthe emulation of additional nodes.  For NUMA emulation, linux will carve up
*4882a593Smuzhiyunthe existing nodes--or the system memory for non-NUMA platforms--into multiple
*4882a593Smuzhiyunnodes.  Each emulated node will manage a fraction of the underlying cells'
*4882a593Smuzhiyunphysical memory.  NUMA emluation is useful for testing NUMA kernel and
*4882a593Smuzhiyunapplication features on non-NUMA platforms, and as a sort of memory resource
*4882a593Smuzhiyunmanagement mechanism when used together with cpusets.
*4882a593Smuzhiyun[see Documentation/admin-guide/cgroup-v1/cpusets.rst]
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor each node with memory, Linux constructs an independent memory management
*4882a593Smuzhiyunsubsystem, complete with its own free page lists, in-use page lists, usage
*4882a593Smuzhiyunstatistics and locks to mediate access.  In addition, Linux constructs for
*4882a593Smuzhiyuneach memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],
*4882a593Smuzhiyunan ordered "zonelist".  A zonelist specifies the zones/nodes to visit when a
*4882a593Smuzhiyunselected zone/node cannot satisfy the allocation request.  This situation,
*4882a593Smuzhiyunwhen a zone has no available memory to satisfy a request, is called
*4882a593Smuzhiyun"overflow" or "fallback".
*4882a593Smuzhiyun
*4882a593SmuzhiyunBecause some nodes contain multiple zones containing different types of
*4882a593Smuzhiyunmemory, Linux must decide whether to order the zonelists such that allocations
*4882a593Smuzhiyunfall back to the same zone type on a different node, or to a different zone
*4882a593Smuzhiyuntype on the same node.  This is an important consideration because some zones,
*4882a593Smuzhiyunsuch as DMA or DMA32, represent relatively scarce resources.  Linux chooses
*4882a593Smuzhiyuna default Node ordered zonelist. This means it tries to fallback to other zones
*4882a593Smuzhiyunfrom the same node before using remote nodes which are ordered by NUMA distance.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBy default, Linux will attempt to satisfy memory allocation requests from the
*4882a593Smuzhiyunnode to which the CPU that executes the request is assigned.  Specifically,
*4882a593SmuzhiyunLinux will attempt to allocate from the first node in the appropriate zonelist
*4882a593Smuzhiyunfor the node where the request originates.  This is called "local allocation."
*4882a593SmuzhiyunIf the "local" node cannot satisfy the request, the kernel will examine other
*4882a593Smuzhiyunnodes' zones in the selected zonelist looking for the first zone in the list
*4882a593Smuzhiyunthat can satisfy the request.
*4882a593Smuzhiyun
*4882a593SmuzhiyunLocal allocation will tend to keep subsequent access to the allocated memory
*4882a593Smuzhiyun"local" to the underlying physical resources and off the system interconnect--
*4882a593Smuzhiyunas long as the task on whose behalf the kernel allocated some memory does not
*4882a593Smuzhiyunlater migrate away from that memory.  The Linux scheduler is aware of the
*4882a593SmuzhiyunNUMA topology of the platform--embodied in the "scheduling domains" data
*4882a593Smuzhiyunstructures [see Documentation/scheduler/sched-domains.rst]--and the scheduler
*4882a593Smuzhiyunattempts to minimize task migration to distant scheduling domains.  However,
*4882a593Smuzhiyunthe scheduler does not take a task's NUMA footprint into account directly.
*4882a593SmuzhiyunThus, under sufficient imbalance, tasks can migrate between nodes, remote
*4882a593Smuzhiyunfrom their initial node and kernel data structures.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSystem administrators and application designers can restrict a task's migration
*4882a593Smuzhiyunto improve NUMA locality using various CPU affinity command line interfaces,
*4882a593Smuzhiyunsuch as taskset(1) and numactl(1), and program interfaces such as
*4882a593Smuzhiyunsched_setaffinity(2).  Further, one can modify the kernel's default local
*4882a593Smuzhiyunallocation behavior using Linux NUMA memory policy. [see
*4882a593Smuzhiyun:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`].
*4882a593Smuzhiyun
*4882a593SmuzhiyunSystem administrators can restrict the CPUs and nodes' memories that a non-
*4882a593Smuzhiyunprivileged user can specify in the scheduling or NUMA commands and functions
*4882a593Smuzhiyunusing control groups and CPUsets.  [see Documentation/admin-guide/cgroup-v1/cpusets.rst]
*4882a593Smuzhiyun
*4882a593SmuzhiyunOn architectures that do not hide memoryless nodes, Linux will include only
*4882a593Smuzhiyunzones [nodes] with memory in the zonelists.  This means that for a memoryless
*4882a593Smuzhiyunnode the "local memory node"--the node of the first zone in CPU's node's
*4882a593Smuzhiyunzonelist--will not be the node itself.  Rather, it will be the node that the
*4882a593Smuzhiyunkernel selected as the nearest node with memory when it built the zonelists.
*4882a593SmuzhiyunSo, default, local allocations will succeed with the kernel supplying the
*4882a593Smuzhiyunclosest available memory.  This is a consequence of the same mechanism that
*4882a593Smuzhiyunallows such allocations to fallback to other nearby nodes when a node that
*4882a593Smuzhiyundoes contain memory overflows.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSome kernel allocations do not want or cannot tolerate this allocation fallback
*4882a593Smuzhiyunbehavior.  Rather they want to be sure they get memory from the specified node
*4882a593Smuzhiyunor get notified that the node has no free memory.  This is usually the case when
*4882a593Smuzhiyuna subsystem allocates per CPU memory resources, for example.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA typical model for making such an allocation is to obtain the node id of the
*4882a593Smuzhiyunnode to which the "current CPU" is attached using one of the kernel's
*4882a593Smuzhiyunnuma_node_id() or CPU_to_node() functions and then request memory from only
*4882a593Smuzhiyunthe node id returned.  When such an allocation fails, the requesting subsystem
*4882a593Smuzhiyunmay revert to its own fallback path.  The slab kernel memory allocator is an
*4882a593Smuzhiyunexample of this.  Or, the subsystem may choose to disable or not to enable
*4882a593Smuzhiyunitself on allocation failure.  The kernel profiling subsystem is an example of
*4882a593Smuzhiyunthis.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the architecture supports--does not hide--memoryless nodes, then CPUs
*4882a593Smuzhiyunattached to memoryless nodes would always incur the fallback path overhead
*4882a593Smuzhiyunor some subsystems would fail to initialize if they attempted to allocated
*4882a593Smuzhiyunmemory exclusively from a node without memory.  To support such
*4882a593Smuzhiyunarchitectures transparently, kernel subsystems can use the numa_mem_id()
*4882a593Smuzhiyunor cpu_to_mem() function to locate the "local memory node" for the calling or
*4882a593Smuzhiyunspecified CPU.  Again, this is the same node from which default, local page
*4882a593Smuzhiyunallocations will be attempted.