1NUMA-Aware Per-CPU Framework 2============================ 3 4.. contents:: 5 :local: 6 :depth: 2 7 8Introduction 9------------ 10 11Modern system designs increasingly adopt multi-node architectures, where the 12system is divided into multiple topological units such as chiplets, sockets, or 13other isolated compute and memory units. Each node typically has its own local 14memory, and CPUs within a node can access this memory with lower latency than 15CPUs on remote nodes. In TF-A's current implementation, per-CPU data (for 16example, PSCI or SPM context) is stored in a global array or contiguous region, 17usually located in the memory of a single node. This approach introduces two key 18issues in multi-node systems: 19 20- **Storage Constraints:** As systems scale to include more CPUs and nodes, this 21 centralized allocation becomes a bottleneck. The memory capacity of a single 22 node may be insufficient to hold per-CPU data for all CPUs. This constraint 23 limits scalability in systems where each node has limited local memory. 24 25 .. figure:: ../resources/diagrams/per-cpu-numa-disabled.png 26 :alt: Diagram showing the BL31 binary section layout in TF-A within local 27 memory. From bottom to top: \`.text\`, \`.rodata\`, \`.data\`, 28 \`.stack\`, \`.bss\`, and \`xlat\` sections. The \`.text\`, 29 \`.rodata\`, and \`.data\` segments are \`PROGBITS\` sections, while 30 \`.stack\`, \`.bss\`, and \`xlat\` form the \`NOBITS\` sections at 31 the top. The memory extends from the local memory start address at 32 the bottom to the end address at the top. 33 34 *Figure: Typical BL31/BL32 binary storage in local memory* 35 36- **Non-Uniform Memory Access (NUMA) Latency:** In multi-node systems, memory 37 access across nodes incurs additional latency because of interconnect 38 traversal. When per-CPU data is centralized on a single node, CPUs on remote 39 nodes must access that data via the interconnect, leading to increased latency 40 for frequent operations such as context switching, exception handling, and 41 crash reporting. This violates NUMA design principles, where data locality is 42 critical to achieving performance and scalability. 43 44To address these challenges, TF-A provides the NUMA-Aware Per-CPU Framework. The 45framework optimizes the allocation and access of per-CPU objects by letting 46platforms place them in the nodes with the lowest access latency. 47 48Design 49------ 50 51The framework provides standardized interfaces and mechanisms for 52**allocating**, **defining**, and **accessing** per-CPU data in a NUMA-aware 53environment. This ensures portability and maintainability across different 54platforms while optimizing for performance in multi-node systems. 55 56``.per_cpu`` Section 57~~~~~~~~~~~~~~~~~~~~ 58 59The framework dedicates a zero-initialized, cache-aligned ``.per_cpu`` section 60to **allocate** per-CPU global variables and ensure that these objects reside in 61the local memory of each NUMA node. The figure below illustrates how per-CPU 62objects are allocated in the local memory of their respective nodes. 63 64.. figure:: ../resources/diagrams/per-cpu-numa-enabled.png 65 :align: center 66 :alt: Diagram comparing the TF-A BL31 memory layout with NUMA disabled versus 67 NUMA enabled. When NUMA is disabled, Node 0 contains a local memory 68 layout with the \`.text\`, \`.rodata\`, \`.data\`, \`.stack\`, 69 \`.bss\`, and \`xlat\` sections stacked vertically. When NUMA is 70 enabled, Node 0 includes an additional \`.per_cpu\` section between 71 \`.bss\` and \`xlat\` to represent per-CPU data allocation, while 72 remote nodes (Node 1 through Node N) each contain their own local 73 per-CPU memory regions. 74 75*Figure: BL31/BL32 binary storage in local memory of per node when per-cpu NUMA 76framework is enabled* 77 78At link time, TF-A linker scripts carve out this section and publish section 79bounds and per-object stride via internal symbols so that they can be replicated 80and initialized across the non-primary nodes. 81 82This linker section also addresses a common performance issue in modern 83multi-CPU systems known as **false sharing**. This issue arises when multiple 84CPUs access different addresses that lie on the same cache line. Although the 85accessed variables may be logically independent, their proximity in memory can 86result in repeated cache invalidations and reloads. Cache-coherency mechanisms 87operate at the granularity of cache lines (typically 64 bytes). If two CPUs 88write to different addresses within the same cache line, the line bounces 89between cores and incurs unnecessary overhead. 90 91.. figure:: ../resources/diagrams/per-cpu-false-sharing.png 92 :align: center 93 :alt: Diagram showing three CPUs (CPU 1, CPU 2, and CPU 3) each with their 94 own cache, connected through a shared interconnect to main memory. At 95 address 0x1000, CPU 2's cache holds data values D1, D2, D3, and D4 96 representing per-CPU data objects, while CPU 1 and CPU 3 have that 97 cache line marked as invalid. CPU 3 is attempting to read from its own 98 per-CPU data object, triggering a coherence transaction over the 99 interconnect. 100 101*Figure: Two processors modifying different variables placed too closely in 102memory, leading to false sharing* 103 104To eliminate false sharing, this framework employs **linker-script-based 105alignment**, which: 106 107- Places all per-CPU variables into a **dedicated, aligned** section 108 (``.per_cpu``). 109- Aligns that section using the cache granularity size 110 (``CACHE_WRITEBACK_GRANULE``). 111 112Definer Interfaces 113~~~~~~~~~~~~~~~~~~ 114 115The NUMA-Aware Per-CPU Framework provides a set of macros to define and declare 116per-CPU objects efficiently in multi-node systems. 117 118- ``PER_CPU_DECLARE(TYPE, NAME)`` 119 120 Declares an external per-CPU object so that other translation units can refer 121 to it without allocating storage. 122 123- ``PER_CPU_DEFINE(TYPE, NAME)`` 124 125 Defines a per-CPU object and assigns it to ``PER_CPU_SECTION_NAME`` so the 126 linker emits it into the ``.per_cpu`` section that the framework manages. 127 128Accessor Interfaces 129~~~~~~~~~~~~~~~~~~~ 130 131The NUMA-Aware Per-CPU Framework also provides macros to access per-CPU objects 132efficiently in multi-node systems. 133 134- ``PER_CPU_BY_INDEX(NAME, CPU)`` 135 136 Returns a pointer to the per-CPU object ``NAME`` for the specified CPU by 137 combining the per-node base with the object's offset within ``.per_cpu``. 138 139- ``PER_CPU_CUR(NAME)`` 140 141 Returns a pointer to the per-CPU object ``NAME`` for the current CPU. 142 143For use in assembly routines, a corresponding macro version is provided: 144 145In assembly routines, the ``per_cpu_cur`` helper macro performs the same 146calculation. It accepts the label of the per-CPU object and optional register 147arguments (destination and clobber) to materialize the per-CPU pointer without 148duplicating addressing logic in assembly files. 149 150Platform Responsibilities (NUMA-only) 151------------------------------------- 152 153When NUMA is enabled, the platform is required to comply with some additional 154requirements in order for the runtime to correctly set up per-CPU sections on 155remote nodes: 156 1571. Enable the Framework 158~~~~~~~~~~~~~~~~~~~~~~~ 159 160Set ``PLATFORM_NODE_COUNT`` to a value greater than 1 (>=2) in the platform 161makefile to enable NUMA-aware per-CPU support: 162 163.. code-block:: make 164 165 PLATFORM_NODE_COUNT := 2 # >= 2 enables NUMA-aware per-CPU support 166 167Platforms that are not multi-node do not need to modify this value because the 168default ``PLATFORM_NODE_COUNT`` is 1. The NUMA framework is not supported in 16932-bit images such as BL32 SP_MIN. 170 1712. Provide Per-CPU Section Base Address Data 172~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 173 174Ensure that the platform can supply the base address of the ``.per_cpu`` section 175for each node and CPU when implementing ``plat_per_cpu_node_base`` and 176``plat_per_cpu_base``. The framework does not mandate how this information is 177obtained, only that each hook returns a valid base address. Platforms may: 178 179- derive the base addresses from platform descriptors or firmware configuration 180 data; 181- read them from device tree nodes or other runtime discovery mechanisms; or 182- encode them in platform-specific tables compiled into the image. 183 184If a node described in platform data is not populated at runtime, the hooks may 185return ``UINT64_MAX`` to signal that no per-CPU section exists for that node. 186 187The platform is free to maintain this mapping however it prefers, and may do so 188at either compile-time or through employing runtime discovery. The only 189requirement is that the ``plat_per_cpu_node_base`` and ``plat_per_cpu_base`` 190hooks translate a node or CPU identifier into the base address of the 191corresponding ``.per_cpu`` section. 192 193Platform-defined regions that hold remote per-CPU sections must have 194page-aligned bases and sizes for page table mapping through the xlat library, 195which requires page alignment for mapped entries. The per-CPU section itself 196requires only cache writeback granule alignment for its base. 197 1983. Implement Required Platform Hooks 199~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 200 201Provide the following platform-specific functions: 202 203- ``uintptr_t plat_per_cpu_base(uint64_t cpu)`` 204 205 Returns the base address of the ``.per_cpu`` section for the specified CPU. 206 207- ``uintptr_t plat_per_cpu_node_base(uint64_t node)`` 208 209 Returns the base address of the ``.per_cpu`` section for the specified node. 210 211- ``uintptr_t plat_per_cpu_dcache_clean(void)`` 212 213 Cleans the entire per-CPU section from the data cache. This ensures that any 214 modifications made to per-CPU data are written back to memory, making them 215 visible to other CPUs or system components that may access this memory. This 216 step is especially important on platforms that do not support hardware-managed 217 coherency early in the boot process. 218 219References 220---------- 221 222- Original presentation: 223 https://www.trustedfirmware.org/docs/NUMA-aware-PER-CPU-framework-18Jul24.pdf 224 225-------------- 226 227*Copyright (c) 2025, Arm Limited and Contributors. All rights reserved.* 228