1*b92586abSSammit JoshiNUMA-Aware PER-CPU Framework 2*b92586abSSammit Joshi============================ 3*b92586abSSammit Joshi 4*b92586abSSammit Joshi.. contents:: 5*b92586abSSammit Joshi :local: 6*b92586abSSammit Joshi :depth: 2 7*b92586abSSammit Joshi 8*b92586abSSammit JoshiIntroduction 9*b92586abSSammit Joshi============ 10*b92586abSSammit Joshi 11*b92586abSSammit JoshiModern System designs increasingly adopt multi-node architectures, where the 12*b92586abSSammit Joshisystem is divided into multiple topological units such as chiplet, socket, or 13*b92586abSSammit Joshiany other isolated unit of compute and memory. Each node typically has its own 14*b92586abSSammit Joshilocal memory, and CPUs within a node can access this memory with lower latency 15*b92586abSSammit Joshicompared to memory located on remote nodes. In TF-A's current implementation, 16*b92586abSSammit Joshiper-cpu data (such as PSCI context, SPM context, etc.) is stored in a global 17*b92586abSSammit Joshiarray or contiguous region, usually located in the memory of a single node. This 18*b92586abSSammit Joshiapproach introduces two key issues in multi-node systems: 19*b92586abSSammit Joshi 20*b92586abSSammit Joshi- **Storage Constraints:** As systems scale to include more CPUs and nodes, this 21*b92586abSSammit Joshi centralized allocation becomes a bottleneck. The memory capacity of a single 22*b92586abSSammit Joshi node may be insufficient to hold per-cpu data for all CPUs. This constraint 23*b92586abSSammit Joshi limits scalability in systems where each node has limited local memory. 24*b92586abSSammit Joshi 25*b92586abSSammit Joshi .. figure:: ../resources/diagrams/per_cpu_numa_numa_disabled.png 26*b92586abSSammit Joshi :alt: Storage Problem in Multi-node Systems 27*b92586abSSammit Joshi :align: right 28*b92586abSSammit Joshi :width: 500px 29*b92586abSSammit Joshi 30*b92586abSSammit Joshi *Figure: Typical BL31/BL32 binary storage in local memory* 31*b92586abSSammit Joshi 32*b92586abSSammit Joshi- **Non-Uniform Memory Access (NUMA) Latency:** In multi-node systems, memory 33*b92586abSSammit Joshi access across nodes incurs additional latency due to interconnect traversal. 34*b92586abSSammit Joshi When per-cpu data is centralized on a single node, CPUs on remote nodes must 35*b92586abSSammit Joshi access their per-cpu data via the interconnect, leading to increased latency 36*b92586abSSammit Joshi for frequent operations like context switching, exception handling, and crash 37*b92586abSSammit Joshi reporting. This violates NUMA design principles, where data locality is 38*b92586abSSammit Joshi critical to achieving performance and scalability. 39*b92586abSSammit Joshi 40*b92586abSSammit JoshiTo address these challenges, the NUMA-Aware per-cpu framework has been 41*b92586abSSammit Joshiintroduced. This framework optimizes the allocation and access of per-cpu 42*b92586abSSammit Joshiobjects by allowing platforms to place them in nodes with least access latency. 43*b92586abSSammit Joshi 44*b92586abSSammit JoshiDesign 45*b92586abSSammit Joshi====== 46*b92586abSSammit Joshi 47*b92586abSSammit JoshiTo address these architectural challenges, TF-A introduces the NUMA-aware per 48*b92586abSSammit Joshicpu framework. This framework is designed to give platforms the opportunity to 49*b92586abSSammit Joshiallocate per-cpu data as close to the calling CPU as possible, ideally within 50*b92586abSSammit Joshithe same NUMA node, thereby reducing access latency and improving overall memory 51*b92586abSSammit Joshiscalability. 52*b92586abSSammit Joshi 53*b92586abSSammit JoshiThe framework provides standardized interfaces and mechanisms for 54*b92586abSSammit Joshi**allocating**, **defining**, and **accessing** per-cpu data in a NUMA-aware 55*b92586abSSammit Joshienvironment. This ensures portability and maintainability across different 56*b92586abSSammit Joshiplatforms while optimizing for performance in multi-node systems. 57*b92586abSSammit Joshi 58*b92586abSSammit Joshi.per_cpu Section 59*b92586abSSammit Joshi---------------- 60*b92586abSSammit Joshi 61*b92586abSSammit JoshiA dedicated .per_cpu section to **allocate** per-cpu global variables, ensuring 62*b92586abSSammit Joshithat these objects are allocated in the local memory of each NUMA node. Figure 63*b92586abSSammit Joshiillustrates how per-cpu objects are allocated in the local memory of their 64*b92586abSSammit Joshirespective nodes. The necessary linker modifications to support this layout are 65*b92586abSSammit Joshishown in the accompanying snippet. 66*b92586abSSammit Joshi 67*b92586abSSammit Joshi.. figure:: ../resources/diagrams/per_cpu_numa_numa_enabled.png 68*b92586abSSammit Joshi :alt: NUMA-Aware PER-CPU Framework Overview 69*b92586abSSammit Joshi :align: center 70*b92586abSSammit Joshi :width: 2000px 71*b92586abSSammit Joshi 72*b92586abSSammit Joshi*Figure: BL31/BL32 binary storage in local memory of per node when per-cpu NUMA 73*b92586abSSammit Joshiframework is enabled* 74*b92586abSSammit Joshi 75*b92586abSSammit Joshi.. code-block:: text 76*b92586abSSammit Joshi 77*b92586abSSammit Joshi /* The .per_cpu section gets initialised to 0 at runtime. */ \ 78*b92586abSSammit Joshi .per_cpu (NOLOAD) : ALIGN(CACHE_WRITEBACK_GRANULE) { \ 79*b92586abSSammit Joshi __PER_CPU_START__ = .; \ 80*b92586abSSammit Joshi __PER_CPU_UNIT_START__ = .; \ 81*b92586abSSammit Joshi *(SORT_BY_ALIGNMENT(.per_cpu*)) \ 82*b92586abSSammit Joshi __PER_CPU_UNIT_UNALIGNED_END_UNIT__ = .; \ 83*b92586abSSammit Joshi . = ALIGN(CACHE_WRITEBACK_GRANULE); \ 84*b92586abSSammit Joshi __PER_CPU_UNIT_END__ = .; \ 85*b92586abSSammit Joshi __PER_CPU_UNIT_SECTION_SIZE__ = \ 86*b92586abSSammit Joshi ABSOLUTE(__PER_CPU_UNIT_END__ - __PER_CPU_UNIT_START__);\ 87*b92586abSSammit Joshi . = . + (PER_CPU_NODE_CORE_COUNT - 1) * \ 88*b92586abSSammit Joshi __PER_CPU_UNIT_SECTION_SIZE__; \ 89*b92586abSSammit Joshi __PER_CPU_END__ = .; \ 90*b92586abSSammit Joshi } 91*b92586abSSammit Joshi 92*b92586abSSammit JoshiThe newly introduced linker changes also addresses a common performance issue in 93*b92586abSSammit Joshimodern multi-cpu systems—**cache thrashing**. 94*b92586abSSammit Joshi 95*b92586abSSammit JoshiA performance issue known as **cache thrashing** arises when multiple CPUs 96*b92586abSSammit Joshiaccess different addresses that are on the same cache line. Although the 97*b92586abSSammit Joshiaccessed variables may be logically independent, their proximity in memory can 98*b92586abSSammit Joshiresult in repeated cache invalidations and reloads. This is because cache 99*b92586abSSammit Joshicoherency mechanisms operate at the granularity of cache lines (typically 64 100*b92586abSSammit Joshibytes). If two CPUs attempt to write to two different addresses that fall within 101*b92586abSSammit Joshithe same cache line, the cache line is bounced back and forth between the cores, 102*b92586abSSammit Joshiincurring unnecessary overhead. 103*b92586abSSammit Joshi 104*b92586abSSammit Joshi.. figure:: ../resources/diagrams/per_cpu_numa_cache_thrashing.png 105*b92586abSSammit Joshi :alt: Illustration of Cache Thrashing from Per-CPU Data Collisions 106*b92586abSSammit Joshi :align: center 107*b92586abSSammit Joshi :width: 600px 108*b92586abSSammit Joshi 109*b92586abSSammit Joshi*Figure: Two processors modifying different variables placed too closely in 110*b92586abSSammit Joshimemory, leading to cache thrashing* 111*b92586abSSammit Joshi 112*b92586abSSammit JoshiTo eliminate cache thrashing, this framework employs **linker-script-based 113*b92586abSSammit Joshialignment**. It ensures: 114*b92586abSSammit Joshi 115*b92586abSSammit Joshi- Placing all per-cpu variables into a **dedicated, aligned** section: 116*b92586abSSammit Joshi `.per_cpu` 117*b92586abSSammit Joshi- Aligning that section using the cache granularity size 118*b92586abSSammit Joshi (`CACHE_WRITEBACK_GRANULE`) 119*b92586abSSammit Joshi 120*b92586abSSammit JoshiDefiner Interfaces 121*b92586abSSammit Joshi------------------ 122*b92586abSSammit Joshi 123*b92586abSSammit JoshiThe NUMA-Aware PER-CPU framework provides set of macros to define and declare 124*b92586abSSammit Joshiper-cpu objects efficiently in multi-node systems. 125*b92586abSSammit Joshi 126*b92586abSSammit Joshi- **PER_CPU_DECLARE** 127*b92586abSSammit Joshi 128*b92586abSSammit Joshi Declares an external per-cpu object. 129*b92586abSSammit Joshi 130*b92586abSSammit Joshi .. code-block:: c 131*b92586abSSammit Joshi 132*b92586abSSammit Joshi #define PER_CPU_DECLARE(TYPE, NAME) \ 133*b92586abSSammit Joshi extern typeof(TYPE) NAME 134*b92586abSSammit Joshi 135*b92586abSSammit Joshi- **PER_CPU_DEFINE** 136*b92586abSSammit Joshi 137*b92586abSSammit Joshi Defines a per-cpu object and places it in the `.per_cpu` section. 138*b92586abSSammit Joshi 139*b92586abSSammit Joshi .. code-block:: c 140*b92586abSSammit Joshi 141*b92586abSSammit Joshi #define PER_CPU_DEFINE(TYPE, NAME) \ 142*b92586abSSammit Joshi typeof(TYPE) NAME \ 143*b92586abSSammit Joshi __section(PER_CPU_SECTION_NAME) 144*b92586abSSammit Joshi 145*b92586abSSammit JoshiAccessor Interfaces 146*b92586abSSammit Joshi------------------- 147*b92586abSSammit Joshi 148*b92586abSSammit JoshiThe NUMA-Aware PER-CPU framework provides set of macros to access per-cpu 149*b92586abSSammit Joshiobjects efficiently in multi-node systems. 150*b92586abSSammit Joshi 151*b92586abSSammit Joshi- **PER_CPU_BY_INDEX(NAME, CPU)** 152*b92586abSSammit Joshi Returns a pointer to the per-cpu object `NAME` for the specified CPU. 153*b92586abSSammit Joshi 154*b92586abSSammit Joshi .. code-block:: c 155*b92586abSSammit Joshi 156*b92586abSSammit Joshi #define PER_CPU_BY_INDEX(NAME, CPU) \ 157*b92586abSSammit Joshi ((__typeof__(&NAME)) \ 158*b92586abSSammit Joshi (per_cpu_by_index_compute((CPU), (void *)&(NAME)))) 159*b92586abSSammit Joshi 160*b92586abSSammit Joshi- **PER_CPU_CUR(NAME)** 161*b92586abSSammit Joshi Returns a pointer to the per-cpu object `NAME` for the current CPU. 162*b92586abSSammit Joshi 163*b92586abSSammit Joshi .. code-block:: c 164*b92586abSSammit Joshi 165*b92586abSSammit Joshi #define PER_CPU_CUR(NAME) \ 166*b92586abSSammit Joshi ((__typeof__(&(NAME))) \ 167*b92586abSSammit Joshi (per_cpu_cur_compute((void *)&(NAME)))) 168*b92586abSSammit Joshi 169*b92586abSSammit JoshiFor use in assembly routines, a corresponding macro version is provided: 170*b92586abSSammit Joshi 171*b92586abSSammit Joshi.. code-block:: text 172*b92586abSSammit Joshi 173*b92586abSSammit Joshi .macro per_cpu_cur label, dst=x0, clobber=x1 174*b92586abSSammit Joshi /* Safety checks */ 175*b92586abSSammit Joshi .ifc \dst,\clobber 176*b92586abSSammit Joshi .error "per_cpu_cur: dst and clobber must be different" 177*b92586abSSammit Joshi .endif 178*b92586abSSammit Joshi 179*b92586abSSammit Joshi /* dst = absolute address of label */ 180*b92586abSSammit Joshi adr_l \dst, \label 181*b92586abSSammit Joshi 182*b92586abSSammit Joshi /* clobber = absolute address of __PER_CPU_START__ */ 183*b92586abSSammit Joshi adr_l \clobber, __PER_CPU_START__ 184*b92586abSSammit Joshi 185*b92586abSSammit Joshi /* dst = (label - __PER_CPU_START__) */ 186*b92586abSSammit Joshi sub \dst, \dst, \clobber 187*b92586abSSammit Joshi 188*b92586abSSammit Joshi /* clobber = per-cpu base (TPIDR_EL3) */ 189*b92586abSSammit Joshi mrs \clobber, tpidr_el3 190*b92586abSSammit Joshi 191*b92586abSSammit Joshi /* dst = base + offset */ 192*b92586abSSammit Joshi add \dst, \clobber, \dst 193*b92586abSSammit Joshi .endm 194*b92586abSSammit Joshi 195*b92586abSSammit Joshi 196*b92586abSSammit JoshiThe accessor interfaces take advantage of using `tpidr_el3` system register 197*b92586abSSammit Joshi(Thread ID Register at EL3). It stores the **base address of the current CPU's 198*b92586abSSammit Joshi`.per_cpu` section**. By setting up this register during early CPU 199*b92586abSSammit Joshiinitialization (e.g., in the el3_entrypoint_common path), TF-A can avoid 200*b92586abSSammit Joshirepeated calculations or memory lookups when accessing per-cpu objects. 201*b92586abSSammit Joshi 202*b92586abSSammit JoshiInstead of computing the per-cpu address dynamically using platform-level 203*b92586abSSammit Joshifunctions (which could involve node discovery, offset arithmetic, and memory 204*b92586abSSammit Joshidereferencing), TF-A can simply: 205*b92586abSSammit Joshi 206*b92586abSSammit Joshi- Read `tpidr_el3` to get the base address of the current CPU's per-cpu data. 207*b92586abSSammit Joshi- Add the relative offset of the desired object within the `.per_cpu` section. 208*b92586abSSammit Joshi- Access the target object directly using this computed address. 209*b92586abSSammit Joshi 210*b92586abSSammit JoshiThis strategy significantly reduces access time by replacing a potentially 211*b92586abSSammit Joshiexpensive memory access path with a single register read and offset addition. It 212*b92586abSSammit Joshiimproves performance—particularly in hot paths like PSCI operations and context 213*b92586abSSammit Joshiswitching taking advantage of fast-access system registers instead of traversing 214*b92586abSSammit Joshiinterconnects. 215*b92586abSSammit Joshi 216*b92586abSSammit JoshiUsage Example 217*b92586abSSammit Joshi============= 218*b92586abSSammit Joshi 219*b92586abSSammit JoshiPlatform Responsibilities 220*b92586abSSammit Joshi------------------------- 221*b92586abSSammit Joshi 222*b92586abSSammit JoshiTo integrate the NUMA-Aware PER-CPU Framework into a platform, the following 223*b92586abSSammit Joshisteps must be taken: 224*b92586abSSammit Joshi 225*b92586abSSammit Joshi1. Enable the Framework 226*b92586abSSammit Joshi------------------------- 227*b92586abSSammit Joshi 228*b92586abSSammit JoshiSet the PLATFORM_NODE_COUNT to greater than 1 (>=2) in the platform 229*b92586abSSammit Joshimakefile to enable NUMA-aware per-cpu support: 230*b92586abSSammit Joshi 231*b92586abSSammit Joshi.. code-block:: text 232*b92586abSSammit Joshi 233*b92586abSSammit Joshi PLATFORM_NODE_COUNT := 1 (>=2 for enabling NUMA-aware per-cpu support) 234*b92586abSSammit Joshi 235*b92586abSSammit JoshiPlatforms that are not multi-node needn't do anything as 236*b92586abSSammit JoshiPLATFORM_NODE_COUNT = 1 (NODE COUNT) by default. 237*b92586abSSammit JoshiIn the case of 32-bit Images such as BL32 sp_min NUMA framework is not supported. 238*b92586abSSammit Joshi 239*b92586abSSammit Joshi2. Provide Per-CPU Section Base Address Table 240*b92586abSSammit Joshi--------------------------------------------- 241*b92586abSSammit Joshi 242*b92586abSSammit JoshiDeclare and initialize an array holding the base address of the `.per_cpu` 243*b92586abSSammit Joshisection for each node: 244*b92586abSSammit Joshi 245*b92586abSSammit Joshi.. code-block:: c 246*b92586abSSammit Joshi 247*b92586abSSammit Joshi const uintptr_t per_cpu_nodes_base[] = { 248*b92586abSSammit Joshi /* Base addresses per node (platform-specific) */ 249*b92586abSSammit Joshi }; 250*b92586abSSammit Joshi 251*b92586abSSammit JoshiThis array allows efficient mapping from logical CPU IDs to physical memory 252*b92586abSSammit Joshiregions in multi-node systems. Note: This is one example of how platforms can 253*b92586abSSammit Joshidefine .per_cpu section base addresses. Platforms are free to determine and 254*b92586abSSammit Joshiprovide these addresses using other methods, such as device tree parsing, 255*b92586abSSammit Joshiplatform-specific tables, or dynamic discovery logic. It is important to note 256*b92586abSSammit Joshithat the platform defined regions for holding remote per-cpu section should have 257*b92586abSSammit Joshia page aligned base and size for page table mapping via the xlat library. This 258*b92586abSSammit Joshiis simply due to the fact that xlat requires page aligned address and size for 259*b92586abSSammit Joshimapping an entry. per-cpu section by itself requires only CACHE_WRITEBACK_GRANULE 260*b92586abSSammit Joshialignment for its base. 261*b92586abSSammit Joshi 262*b92586abSSammit Joshi3. Implement Required Platform Hooks 263*b92586abSSammit Joshi------------------------------------ 264*b92586abSSammit Joshi 265*b92586abSSammit JoshiProvide the following platform-specific functions: 266*b92586abSSammit Joshi 267*b92586abSSammit Joshi- **`plat_per_cpu_base(int cpu)`** 268*b92586abSSammit Joshi Returns the base address of the `.per_cpu` section for the specified CPU. 269*b92586abSSammit Joshi 270*b92586abSSammit Joshi- **`plat_per_cpu_node_base(void)`** 271*b92586abSSammit Joshi Returns the node base address of the `.per_cpu` section. 272*b92586abSSammit Joshi 273*b92586abSSammit Joshi- **`plat_per_cpu_dcache_clean(void)`** 274*b92586abSSammit Joshi Cleans the entire per-cpu section from the data cache. This ensures that any 275*b92586abSSammit Joshi modifications made to per-cpu data are written back to memory, making them 276*b92586abSSammit Joshi visible to other CPUs or system components that may access this memory. It is 277*b92586abSSammit Joshi especially important on platforms that do not support hardware managed 278*b92586abSSammit Joshi coherency early in the boot. 279*b92586abSSammit Joshi 280*b92586abSSammit JoshiReferences 281*b92586abSSammit Joshi========== 282*b92586abSSammit Joshi 283*b92586abSSammit Joshi- Original Presentation: https://www.trustedfirmware.org/docs/NUMA-aware-PER-CPU-framework-18Jul24.pdf 284*b92586abSSammit Joshi 285*b92586abSSammit Joshi-------------- 286*b92586abSSammit Joshi 287*b92586abSSammit Joshi*Copyright (c) 2025, Arm Limited and Contributors. All rights reserved.* 288