1NUMA-Aware PER-CPU Framework 2============================ 3 4.. contents:: 5 :local: 6 :depth: 2 7 8Introduction 9============ 10 11Modern System designs increasingly adopt multi-node architectures, where the 12system is divided into multiple topological units such as chiplet, socket, or 13any other isolated unit of compute and memory. Each node typically has its own 14local memory, and CPUs within a node can access this memory with lower latency 15compared to memory located on remote nodes. In TF-A's current implementation, 16per-cpu data (such as PSCI context, SPM context, etc.) is stored in a global 17array or contiguous region, usually located in the memory of a single node. This 18approach introduces two key issues in multi-node systems: 19 20- **Storage Constraints:** As systems scale to include more CPUs and nodes, this 21 centralized allocation becomes a bottleneck. The memory capacity of a single 22 node may be insufficient to hold per-cpu data for all CPUs. This constraint 23 limits scalability in systems where each node has limited local memory. 24 25 .. figure:: ../resources/diagrams/per_cpu_numa_numa_disabled.png 26 :alt: Storage Problem in Multi-node Systems 27 :align: right 28 :width: 500px 29 30 *Figure: Typical BL31/BL32 binary storage in local memory* 31 32- **Non-Uniform Memory Access (NUMA) Latency:** In multi-node systems, memory 33 access across nodes incurs additional latency due to interconnect traversal. 34 When per-cpu data is centralized on a single node, CPUs on remote nodes must 35 access their per-cpu data via the interconnect, leading to increased latency 36 for frequent operations like context switching, exception handling, and crash 37 reporting. This violates NUMA design principles, where data locality is 38 critical to achieving performance and scalability. 39 40To address these challenges, the NUMA-Aware per-cpu framework has been 41introduced. This framework optimizes the allocation and access of per-cpu 42objects by allowing platforms to place them in nodes with least access latency. 43 44Design 45====== 46 47To address these architectural challenges, TF-A introduces the NUMA-aware per 48cpu framework. This framework is designed to give platforms the opportunity to 49allocate per-cpu data as close to the calling CPU as possible, ideally within 50the same NUMA node, thereby reducing access latency and improving overall memory 51scalability. 52 53The framework provides standardized interfaces and mechanisms for 54**allocating**, **defining**, and **accessing** per-cpu data in a NUMA-aware 55environment. This ensures portability and maintainability across different 56platforms while optimizing for performance in multi-node systems. 57 58.per_cpu Section 59---------------- 60 61A dedicated .per_cpu section to **allocate** per-cpu global variables, ensuring 62that these objects are allocated in the local memory of each NUMA node. Figure 63illustrates how per-cpu objects are allocated in the local memory of their 64respective nodes. The necessary linker modifications to support this layout are 65shown in the accompanying snippet. 66 67.. figure:: ../resources/diagrams/per_cpu_numa_numa_enabled.png 68 :alt: NUMA-Aware PER-CPU Framework Overview 69 :align: center 70 :width: 2000px 71 72*Figure: BL31/BL32 binary storage in local memory of per node when per-cpu NUMA 73framework is enabled* 74 75.. code-block:: text 76 77 /* The .per_cpu section gets initialised to 0 at runtime. */ \ 78 .per_cpu (NOLOAD) : ALIGN(CACHE_WRITEBACK_GRANULE) { \ 79 __PER_CPU_START__ = .; \ 80 __PER_CPU_UNIT_START__ = .; \ 81 *(SORT_BY_ALIGNMENT(.per_cpu*)) \ 82 __PER_CPU_UNIT_UNALIGNED_END_UNIT__ = .; \ 83 . = ALIGN(CACHE_WRITEBACK_GRANULE); \ 84 __PER_CPU_UNIT_END__ = .; \ 85 __PER_CPU_UNIT_SECTION_SIZE__ = \ 86 ABSOLUTE(__PER_CPU_UNIT_END__ - __PER_CPU_UNIT_START__);\ 87 . = . + (PER_CPU_NODE_CORE_COUNT - 1) * \ 88 __PER_CPU_UNIT_SECTION_SIZE__; \ 89 __PER_CPU_END__ = .; \ 90 } 91 92The newly introduced linker changes also addresses a common performance issue in 93modern multi-cpu systems—**cache thrashing**. 94 95A performance issue known as **cache thrashing** arises when multiple CPUs 96access different addresses that are on the same cache line. Although the 97accessed variables may be logically independent, their proximity in memory can 98result in repeated cache invalidations and reloads. This is because cache 99coherency mechanisms operate at the granularity of cache lines (typically 64 100bytes). If two CPUs attempt to write to two different addresses that fall within 101the same cache line, the cache line is bounced back and forth between the cores, 102incurring unnecessary overhead. 103 104.. figure:: ../resources/diagrams/per_cpu_numa_cache_thrashing.png 105 :alt: Illustration of Cache Thrashing from Per-CPU Data Collisions 106 :align: center 107 :width: 600px 108 109*Figure: Two processors modifying different variables placed too closely in 110memory, leading to cache thrashing* 111 112To eliminate cache thrashing, this framework employs **linker-script-based 113alignment**. It ensures: 114 115- Placing all per-cpu variables into a **dedicated, aligned** section: 116 `.per_cpu` 117- Aligning that section using the cache granularity size 118 (`CACHE_WRITEBACK_GRANULE`) 119 120Definer Interfaces 121------------------ 122 123The NUMA-Aware PER-CPU framework provides set of macros to define and declare 124per-cpu objects efficiently in multi-node systems. 125 126- **PER_CPU_DECLARE** 127 128 Declares an external per-cpu object. 129 130 .. code-block:: c 131 132 #define PER_CPU_DECLARE(TYPE, NAME) \ 133 extern typeof(TYPE) NAME 134 135- **PER_CPU_DEFINE** 136 137 Defines a per-cpu object and places it in the `.per_cpu` section. 138 139 .. code-block:: c 140 141 #define PER_CPU_DEFINE(TYPE, NAME) \ 142 typeof(TYPE) NAME \ 143 __section(PER_CPU_SECTION_NAME) 144 145Accessor Interfaces 146------------------- 147 148The NUMA-Aware PER-CPU framework provides set of macros to access per-cpu 149objects efficiently in multi-node systems. 150 151- **PER_CPU_BY_INDEX(NAME, CPU)** 152 Returns a pointer to the per-cpu object `NAME` for the specified CPU. 153 154 .. code-block:: c 155 156 #define PER_CPU_BY_INDEX(NAME, CPU) \ 157 ((__typeof__(&NAME)) \ 158 (per_cpu_by_index_compute((CPU), (void *)&(NAME)))) 159 160- **PER_CPU_CUR(NAME)** 161 Returns a pointer to the per-cpu object `NAME` for the current CPU. 162 163 .. code-block:: c 164 165 #define PER_CPU_CUR(NAME) \ 166 ((__typeof__(&(NAME))) \ 167 (per_cpu_cur_compute((void *)&(NAME)))) 168 169For use in assembly routines, a corresponding macro version is provided: 170 171.. code-block:: text 172 173 .macro per_cpu_cur label, dst=x0, clobber=x1 174 /* Safety checks */ 175 .ifc \dst,\clobber 176 .error "per_cpu_cur: dst and clobber must be different" 177 .endif 178 179 /* dst = absolute address of label */ 180 adr_l \dst, \label 181 182 /* clobber = absolute address of __PER_CPU_START__ */ 183 adr_l \clobber, __PER_CPU_START__ 184 185 /* dst = (label - __PER_CPU_START__) */ 186 sub \dst, \dst, \clobber 187 188 /* clobber = per-cpu base (TPIDR_EL3) */ 189 mrs \clobber, tpidr_el3 190 191 /* dst = base + offset */ 192 add \dst, \clobber, \dst 193 .endm 194 195 196The accessor interfaces take advantage of using `tpidr_el3` system register 197(Thread ID Register at EL3). It stores the **base address of the current CPU's 198`.per_cpu` section**. By setting up this register during early CPU 199initialization (e.g., in the el3_entrypoint_common path), TF-A can avoid 200repeated calculations or memory lookups when accessing per-cpu objects. 201 202Instead of computing the per-cpu address dynamically using platform-level 203functions (which could involve node discovery, offset arithmetic, and memory 204dereferencing), TF-A can simply: 205 206- Read `tpidr_el3` to get the base address of the current CPU's per-cpu data. 207- Add the relative offset of the desired object within the `.per_cpu` section. 208- Access the target object directly using this computed address. 209 210This strategy significantly reduces access time by replacing a potentially 211expensive memory access path with a single register read and offset addition. It 212improves performance—particularly in hot paths like PSCI operations and context 213switching taking advantage of fast-access system registers instead of traversing 214interconnects. 215 216Usage Example 217============= 218 219Platform Responsibilities 220------------------------- 221 222To integrate the NUMA-Aware PER-CPU Framework into a platform, the following 223steps must be taken: 224 2251. Enable the Framework 226------------------------- 227 228Set the PLATFORM_NODE_COUNT to greater than 1 (>=2) in the platform 229makefile to enable NUMA-aware per-cpu support: 230 231.. code-block:: text 232 233 PLATFORM_NODE_COUNT := 1 (>=2 for enabling NUMA-aware per-cpu support) 234 235Platforms that are not multi-node needn't do anything as 236PLATFORM_NODE_COUNT = 1 (NODE COUNT) by default. 237In the case of 32-bit Images such as BL32 sp_min NUMA framework is not supported. 238 2392. Provide Per-CPU Section Base Address Table 240--------------------------------------------- 241 242Declare and initialize an array holding the base address of the `.per_cpu` 243section for each node: 244 245.. code-block:: c 246 247 const uintptr_t per_cpu_nodes_base[] = { 248 /* Base addresses per node (platform-specific) */ 249 }; 250 251This array allows efficient mapping from logical CPU IDs to physical memory 252regions in multi-node systems. Note: This is one example of how platforms can 253define .per_cpu section base addresses. Platforms are free to determine and 254provide these addresses using other methods, such as device tree parsing, 255platform-specific tables, or dynamic discovery logic. It is important to note 256that the platform defined regions for holding remote per-cpu section should have 257a page aligned base and size for page table mapping via the xlat library. This 258is simply due to the fact that xlat requires page aligned address and size for 259mapping an entry. per-cpu section by itself requires only CACHE_WRITEBACK_GRANULE 260alignment for its base. 261 2623. Implement Required Platform Hooks 263------------------------------------ 264 265Provide the following platform-specific functions: 266 267- **`plat_per_cpu_base(int cpu)`** 268 Returns the base address of the `.per_cpu` section for the specified CPU. 269 270- **`plat_per_cpu_node_base(void)`** 271 Returns the node base address of the `.per_cpu` section. 272 273- **`plat_per_cpu_dcache_clean(void)`** 274 Cleans the entire per-cpu section from the data cache. This ensures that any 275 modifications made to per-cpu data are written back to memory, making them 276 visible to other CPUs or system components that may access this memory. It is 277 especially important on platforms that do not support hardware managed 278 coherency early in the boot. 279 280References 281========== 282 283- Original Presentation: https://www.trustedfirmware.org/docs/NUMA-aware-PER-CPU-framework-18Jul24.pdf 284 285-------------- 286 287*Copyright (c) 2025, Arm Limited and Contributors. All rights reserved.* 288