1*193980a0SChris KayNUMA-Aware Per-CPU Framework 2b92586abSSammit Joshi============================ 3b92586abSSammit Joshi 4b92586abSSammit Joshi.. contents:: 5b92586abSSammit Joshi :local: 6b92586abSSammit Joshi :depth: 2 7b92586abSSammit Joshi 8b92586abSSammit JoshiIntroduction 9*193980a0SChris Kay------------ 10b92586abSSammit Joshi 11*193980a0SChris KayModern system designs increasingly adopt multi-node architectures, where the 12*193980a0SChris Kaysystem is divided into multiple topological units such as chiplets, sockets, or 13*193980a0SChris Kayother isolated compute and memory units. Each node typically has its own local 14*193980a0SChris Kaymemory, and CPUs within a node can access this memory with lower latency than 15*193980a0SChris KayCPUs on remote nodes. In TF-A's current implementation, per-CPU data (for 16*193980a0SChris Kayexample, PSCI or SPM context) is stored in a global array or contiguous region, 17*193980a0SChris Kayusually located in the memory of a single node. This approach introduces two key 18*193980a0SChris Kayissues in multi-node systems: 19b92586abSSammit Joshi 20b92586abSSammit Joshi- **Storage Constraints:** As systems scale to include more CPUs and nodes, this 21b92586abSSammit Joshi centralized allocation becomes a bottleneck. The memory capacity of a single 22*193980a0SChris Kay node may be insufficient to hold per-CPU data for all CPUs. This constraint 23b92586abSSammit Joshi limits scalability in systems where each node has limited local memory. 24b92586abSSammit Joshi 25*193980a0SChris Kay .. figure:: ../resources/diagrams/per-cpu-numa-disabled.png 26*193980a0SChris Kay :alt: Diagram showing the BL31 binary section layout in TF-A within local 27*193980a0SChris Kay memory. From bottom to top: \`.text\`, \`.rodata\`, \`.data\`, 28*193980a0SChris Kay \`.stack\`, \`.bss\`, and \`xlat\` sections. The \`.text\`, 29*193980a0SChris Kay \`.rodata\`, and \`.data\` segments are \`PROGBITS\` sections, while 30*193980a0SChris Kay \`.stack\`, \`.bss\`, and \`xlat\` form the \`NOBITS\` sections at 31*193980a0SChris Kay the top. The memory extends from the local memory start address at 32*193980a0SChris Kay the bottom to the end address at the top. 33b92586abSSammit Joshi 34b92586abSSammit Joshi *Figure: Typical BL31/BL32 binary storage in local memory* 35b92586abSSammit Joshi 36b92586abSSammit Joshi- **Non-Uniform Memory Access (NUMA) Latency:** In multi-node systems, memory 37*193980a0SChris Kay access across nodes incurs additional latency because of interconnect 38*193980a0SChris Kay traversal. When per-CPU data is centralized on a single node, CPUs on remote 39*193980a0SChris Kay nodes must access that data via the interconnect, leading to increased latency 40*193980a0SChris Kay for frequent operations such as context switching, exception handling, and 41*193980a0SChris Kay crash reporting. This violates NUMA design principles, where data locality is 42b92586abSSammit Joshi critical to achieving performance and scalability. 43b92586abSSammit Joshi 44*193980a0SChris KayTo address these challenges, TF-A provides the NUMA-Aware Per-CPU Framework. The 45*193980a0SChris Kayframework optimizes the allocation and access of per-CPU objects by letting 46*193980a0SChris Kayplatforms place them in the nodes with the lowest access latency. 47b92586abSSammit Joshi 48b92586abSSammit JoshiDesign 49*193980a0SChris Kay------ 50b92586abSSammit Joshi 51b92586abSSammit JoshiThe framework provides standardized interfaces and mechanisms for 52*193980a0SChris Kay**allocating**, **defining**, and **accessing** per-CPU data in a NUMA-aware 53b92586abSSammit Joshienvironment. This ensures portability and maintainability across different 54b92586abSSammit Joshiplatforms while optimizing for performance in multi-node systems. 55b92586abSSammit Joshi 56*193980a0SChris Kay``.per_cpu`` Section 57*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~ 58b92586abSSammit Joshi 59*193980a0SChris KayThe framework dedicates a zero-initialized, cache-aligned ``.per_cpu`` section 60*193980a0SChris Kayto **allocate** per-CPU global variables and ensure that these objects reside in 61*193980a0SChris Kaythe local memory of each NUMA node. The figure below illustrates how per-CPU 62*193980a0SChris Kayobjects are allocated in the local memory of their respective nodes. 63b92586abSSammit Joshi 64*193980a0SChris Kay.. figure:: ../resources/diagrams/per-cpu-numa-enabled.png 65b92586abSSammit Joshi :align: center 66*193980a0SChris Kay :alt: Diagram comparing the TF-A BL31 memory layout with NUMA disabled versus 67*193980a0SChris Kay NUMA enabled. When NUMA is disabled, Node 0 contains a local memory 68*193980a0SChris Kay layout with the \`.text\`, \`.rodata\`, \`.data\`, \`.stack\`, 69*193980a0SChris Kay \`.bss\`, and \`xlat\` sections stacked vertically. When NUMA is 70*193980a0SChris Kay enabled, Node 0 includes an additional \`.per_cpu\` section between 71*193980a0SChris Kay \`.bss\` and \`xlat\` to represent per-CPU data allocation, while 72*193980a0SChris Kay remote nodes (Node 1 through Node N) each contain their own local 73*193980a0SChris Kay per-CPU memory regions. 74b92586abSSammit Joshi 75b92586abSSammit Joshi*Figure: BL31/BL32 binary storage in local memory of per node when per-cpu NUMA 76b92586abSSammit Joshiframework is enabled* 77b92586abSSammit Joshi 78*193980a0SChris KayAt link time, TF-A linker scripts carve out this section and publish section 79*193980a0SChris Kaybounds and per-object stride via internal symbols so that they can be replicated 80*193980a0SChris Kayand initialized across the non-primary nodes. 81b92586abSSammit Joshi 82*193980a0SChris KayThis linker section also addresses a common performance issue in modern 83*193980a0SChris Kaymulti-CPU systems known as **false sharing**. This issue arises when multiple 84*193980a0SChris KayCPUs access different addresses that lie on the same cache line. Although the 85b92586abSSammit Joshiaccessed variables may be logically independent, their proximity in memory can 86*193980a0SChris Kayresult in repeated cache invalidations and reloads. Cache-coherency mechanisms 87*193980a0SChris Kayoperate at the granularity of cache lines (typically 64 bytes). If two CPUs 88*193980a0SChris Kaywrite to different addresses within the same cache line, the line bounces 89*193980a0SChris Kaybetween cores and incurs unnecessary overhead. 90b92586abSSammit Joshi 91*193980a0SChris Kay.. figure:: ../resources/diagrams/per-cpu-false-sharing.png 92b92586abSSammit Joshi :align: center 93*193980a0SChris Kay :alt: Diagram showing three CPUs (CPU 1, CPU 2, and CPU 3) each with their 94*193980a0SChris Kay own cache, connected through a shared interconnect to main memory. At 95*193980a0SChris Kay address 0x1000, CPU 2's cache holds data values D1, D2, D3, and D4 96*193980a0SChris Kay representing per-CPU data objects, while CPU 1 and CPU 3 have that 97*193980a0SChris Kay cache line marked as invalid. CPU 3 is attempting to read from its own 98*193980a0SChris Kay per-CPU data object, triggering a coherence transaction over the 99*193980a0SChris Kay interconnect. 100b92586abSSammit Joshi 101b92586abSSammit Joshi*Figure: Two processors modifying different variables placed too closely in 102*193980a0SChris Kaymemory, leading to false sharing* 103b92586abSSammit Joshi 104*193980a0SChris KayTo eliminate false sharing, this framework employs **linker-script-based 105*193980a0SChris Kayalignment**, which: 106b92586abSSammit Joshi 107*193980a0SChris Kay- Places all per-CPU variables into a **dedicated, aligned** section 108*193980a0SChris Kay (``.per_cpu``). 109*193980a0SChris Kay- Aligns that section using the cache granularity size 110*193980a0SChris Kay (``CACHE_WRITEBACK_GRANULE``). 111b92586abSSammit Joshi 112b92586abSSammit JoshiDefiner Interfaces 113*193980a0SChris Kay~~~~~~~~~~~~~~~~~~ 114b92586abSSammit Joshi 115*193980a0SChris KayThe NUMA-Aware Per-CPU Framework provides a set of macros to define and declare 116*193980a0SChris Kayper-CPU objects efficiently in multi-node systems. 117b92586abSSammit Joshi 118*193980a0SChris Kay- ``PER_CPU_DECLARE(TYPE, NAME)`` 119b92586abSSammit Joshi 120*193980a0SChris Kay Declares an external per-CPU object so that other translation units can refer 121*193980a0SChris Kay to it without allocating storage. 122b92586abSSammit Joshi 123*193980a0SChris Kay- ``PER_CPU_DEFINE(TYPE, NAME)`` 124b92586abSSammit Joshi 125*193980a0SChris Kay Defines a per-CPU object and assigns it to ``PER_CPU_SECTION_NAME`` so the 126*193980a0SChris Kay linker emits it into the ``.per_cpu`` section that the framework manages. 127b92586abSSammit Joshi 128b92586abSSammit JoshiAccessor Interfaces 129*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~ 130b92586abSSammit Joshi 131*193980a0SChris KayThe NUMA-Aware Per-CPU Framework also provides macros to access per-CPU objects 132*193980a0SChris Kayefficiently in multi-node systems. 133b92586abSSammit Joshi 134*193980a0SChris Kay- ``PER_CPU_BY_INDEX(NAME, CPU)`` 135b92586abSSammit Joshi 136*193980a0SChris Kay Returns a pointer to the per-CPU object ``NAME`` for the specified CPU by 137*193980a0SChris Kay combining the per-node base with the object's offset within ``.per_cpu``. 138b92586abSSammit Joshi 139*193980a0SChris Kay- ``PER_CPU_CUR(NAME)`` 140b92586abSSammit Joshi 141*193980a0SChris Kay Returns a pointer to the per-CPU object ``NAME`` for the current CPU. 142b92586abSSammit Joshi 143b92586abSSammit JoshiFor use in assembly routines, a corresponding macro version is provided: 144b92586abSSammit Joshi 145*193980a0SChris KayIn assembly routines, the ``per_cpu_cur`` helper macro performs the same 146*193980a0SChris Kaycalculation. It accepts the label of the per-CPU object and optional register 147*193980a0SChris Kayarguments (destination and clobber) to materialize the per-CPU pointer without 148*193980a0SChris Kayduplicating addressing logic in assembly files. 149b92586abSSammit Joshi 150*193980a0SChris KayPlatform Responsibilities (NUMA-only) 151*193980a0SChris Kay------------------------------------- 152b92586abSSammit Joshi 153*193980a0SChris KayWhen NUMA is enabled, the platform is required to comply with some additional 154*193980a0SChris Kayrequirements in order for the runtime to correctly set up per-CPU sections on 155*193980a0SChris Kayremote nodes: 156b92586abSSammit Joshi 157b92586abSSammit Joshi1. Enable the Framework 158*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~~~~ 159b92586abSSammit Joshi 160*193980a0SChris KaySet ``PLATFORM_NODE_COUNT`` to a value greater than 1 (>=2) in the platform 161*193980a0SChris Kaymakefile to enable NUMA-aware per-CPU support: 162b92586abSSammit Joshi 163*193980a0SChris Kay.. code-block:: make 164b92586abSSammit Joshi 165*193980a0SChris Kay PLATFORM_NODE_COUNT := 2 # >= 2 enables NUMA-aware per-CPU support 166b92586abSSammit Joshi 167*193980a0SChris KayPlatforms that are not multi-node do not need to modify this value because the 168*193980a0SChris Kaydefault ``PLATFORM_NODE_COUNT`` is 1. The NUMA framework is not supported in 169*193980a0SChris Kay32-bit images such as BL32 SP_MIN. 170b92586abSSammit Joshi 171*193980a0SChris Kay2. Provide Per-CPU Section Base Address Data 172*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 173b92586abSSammit Joshi 174*193980a0SChris KayEnsure that the platform can supply the base address of the ``.per_cpu`` section 175*193980a0SChris Kayfor each node and CPU when implementing ``plat_per_cpu_node_base`` and 176*193980a0SChris Kay``plat_per_cpu_base``. The framework does not mandate how this information is 177*193980a0SChris Kayobtained, only that each hook returns a valid base address. Platforms may: 178b92586abSSammit Joshi 179*193980a0SChris Kay- derive the base addresses from platform descriptors or firmware configuration 180*193980a0SChris Kay data; 181*193980a0SChris Kay- read them from device tree nodes or other runtime discovery mechanisms; or 182*193980a0SChris Kay- encode them in platform-specific tables compiled into the image. 183b92586abSSammit Joshi 184*193980a0SChris KayIf a node described in platform data is not populated at runtime, the hooks may 185*193980a0SChris Kayreturn ``UINT64_MAX`` to signal that no per-CPU section exists for that node. 186b92586abSSammit Joshi 187*193980a0SChris KayThe platform is free to maintain this mapping however it prefers, and may do so 188*193980a0SChris Kayat either compile-time or through employing runtime discovery. The only 189*193980a0SChris Kayrequirement is that the ``plat_per_cpu_node_base`` and ``plat_per_cpu_base`` 190*193980a0SChris Kayhooks translate a node or CPU identifier into the base address of the 191*193980a0SChris Kaycorresponding ``.per_cpu`` section. 192*193980a0SChris Kay 193*193980a0SChris KayPlatform-defined regions that hold remote per-CPU sections must have 194*193980a0SChris Kaypage-aligned bases and sizes for page table mapping through the xlat library, 195*193980a0SChris Kaywhich requires page alignment for mapped entries. The per-CPU section itself 196*193980a0SChris Kayrequires only cache writeback granule alignment for its base. 197b92586abSSammit Joshi 198b92586abSSammit Joshi3. Implement Required Platform Hooks 199*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 200b92586abSSammit Joshi 201b92586abSSammit JoshiProvide the following platform-specific functions: 202b92586abSSammit Joshi 203*193980a0SChris Kay- ``uintptr_t plat_per_cpu_base(uint64_t cpu)`` 204b92586abSSammit Joshi 205*193980a0SChris Kay Returns the base address of the ``.per_cpu`` section for the specified CPU. 206b92586abSSammit Joshi 207*193980a0SChris Kay- ``uintptr_t plat_per_cpu_node_base(uint64_t node)`` 208*193980a0SChris Kay 209*193980a0SChris Kay Returns the base address of the ``.per_cpu`` section for the specified node. 210*193980a0SChris Kay 211*193980a0SChris Kay- ``uintptr_t plat_per_cpu_dcache_clean(void)`` 212*193980a0SChris Kay 213*193980a0SChris Kay Cleans the entire per-CPU section from the data cache. This ensures that any 214*193980a0SChris Kay modifications made to per-CPU data are written back to memory, making them 215*193980a0SChris Kay visible to other CPUs or system components that may access this memory. This 216*193980a0SChris Kay step is especially important on platforms that do not support hardware-managed 217*193980a0SChris Kay coherency early in the boot process. 218b92586abSSammit Joshi 219b92586abSSammit JoshiReferences 220*193980a0SChris Kay---------- 221b92586abSSammit Joshi 222*193980a0SChris Kay- Original presentation: 223*193980a0SChris Kay https://www.trustedfirmware.org/docs/NUMA-aware-PER-CPU-framework-18Jul24.pdf 224b92586abSSammit Joshi 225b92586abSSammit Joshi-------------- 226b92586abSSammit Joshi 227b92586abSSammit Joshi*Copyright (c) 2025, Arm Limited and Contributors. All rights reserved.* 228