NUMA-Aware PER-CPU Framework ============================ .. contents:: :local: :depth: 2 Introduction ============ Modern System designs increasingly adopt multi-node architectures, where the system is divided into multiple topological units such as chiplet, socket, or any other isolated unit of compute and memory. Each node typically has its own local memory, and CPUs within a node can access this memory with lower latency compared to memory located on remote nodes. In TF-A's current implementation, per-cpu data (such as PSCI context, SPM context, etc.) is stored in a global array or contiguous region, usually located in the memory of a single node. This approach introduces two key issues in multi-node systems: - **Storage Constraints:** As systems scale to include more CPUs and nodes, this centralized allocation becomes a bottleneck. The memory capacity of a single node may be insufficient to hold per-cpu data for all CPUs. This constraint limits scalability in systems where each node has limited local memory. .. figure:: ../resources/diagrams/per_cpu_numa_numa_disabled.png :alt: Storage Problem in Multi-node Systems :align: right :width: 500px *Figure: Typical BL31/BL32 binary storage in local memory* - **Non-Uniform Memory Access (NUMA) Latency:** In multi-node systems, memory access across nodes incurs additional latency due to interconnect traversal. When per-cpu data is centralized on a single node, CPUs on remote nodes must access their per-cpu data via the interconnect, leading to increased latency for frequent operations like context switching, exception handling, and crash reporting. This violates NUMA design principles, where data locality is critical to achieving performance and scalability. To address these challenges, the NUMA-Aware per-cpu framework has been introduced. This framework optimizes the allocation and access of per-cpu objects by allowing platforms to place them in nodes with least access latency. Design ====== To address these architectural challenges, TF-A introduces the NUMA-aware per cpu framework. This framework is designed to give platforms the opportunity to allocate per-cpu data as close to the calling CPU as possible, ideally within the same NUMA node, thereby reducing access latency and improving overall memory scalability. The framework provides standardized interfaces and mechanisms for **allocating**, **defining**, and **accessing** per-cpu data in a NUMA-aware environment. This ensures portability and maintainability across different platforms while optimizing for performance in multi-node systems. .per_cpu Section ---------------- A dedicated .per_cpu section to **allocate** per-cpu global variables, ensuring that these objects are allocated in the local memory of each NUMA node. Figure illustrates how per-cpu objects are allocated in the local memory of their respective nodes. The necessary linker modifications to support this layout are shown in the accompanying snippet. .. figure:: ../resources/diagrams/per_cpu_numa_numa_enabled.png :alt: NUMA-Aware PER-CPU Framework Overview :align: center :width: 2000px *Figure: BL31/BL32 binary storage in local memory of per node when per-cpu NUMA framework is enabled* .. code-block:: text /* The .per_cpu section gets initialised to 0 at runtime. */ \ .per_cpu (NOLOAD) : ALIGN(CACHE_WRITEBACK_GRANULE) { \ __PER_CPU_START__ = .; \ __PER_CPU_UNIT_START__ = .; \ *(SORT_BY_ALIGNMENT(.per_cpu*)) \ __PER_CPU_UNIT_UNALIGNED_END_UNIT__ = .; \ . = ALIGN(CACHE_WRITEBACK_GRANULE); \ __PER_CPU_UNIT_END__ = .; \ __PER_CPU_UNIT_SECTION_SIZE__ = \ ABSOLUTE(__PER_CPU_UNIT_END__ - __PER_CPU_UNIT_START__);\ . = . + (PER_CPU_NODE_CORE_COUNT - 1) * \ __PER_CPU_UNIT_SECTION_SIZE__; \ __PER_CPU_END__ = .; \ } The newly introduced linker changes also addresses a common performance issue in modern multi-cpu systems—**cache thrashing**. A performance issue known as **cache thrashing** arises when multiple CPUs access different addresses that are on the same cache line. Although the accessed variables may be logically independent, their proximity in memory can result in repeated cache invalidations and reloads. This is because cache coherency mechanisms operate at the granularity of cache lines (typically 64 bytes). If two CPUs attempt to write to two different addresses that fall within the same cache line, the cache line is bounced back and forth between the cores, incurring unnecessary overhead. .. figure:: ../resources/diagrams/per_cpu_numa_cache_thrashing.png :alt: Illustration of Cache Thrashing from Per-CPU Data Collisions :align: center :width: 600px *Figure: Two processors modifying different variables placed too closely in memory, leading to cache thrashing* To eliminate cache thrashing, this framework employs **linker-script-based alignment**. It ensures: - Placing all per-cpu variables into a **dedicated, aligned** section: `.per_cpu` - Aligning that section using the cache granularity size (`CACHE_WRITEBACK_GRANULE`) Definer Interfaces ------------------ The NUMA-Aware PER-CPU framework provides set of macros to define and declare per-cpu objects efficiently in multi-node systems. - **PER_CPU_DECLARE** Declares an external per-cpu object. .. code-block:: c #define PER_CPU_DECLARE(TYPE, NAME) \ extern typeof(TYPE) NAME - **PER_CPU_DEFINE** Defines a per-cpu object and places it in the `.per_cpu` section. .. code-block:: c #define PER_CPU_DEFINE(TYPE, NAME) \ typeof(TYPE) NAME \ __section(PER_CPU_SECTION_NAME) Accessor Interfaces ------------------- The NUMA-Aware PER-CPU framework provides set of macros to access per-cpu objects efficiently in multi-node systems. - **PER_CPU_BY_INDEX(NAME, CPU)** Returns a pointer to the per-cpu object `NAME` for the specified CPU. .. code-block:: c #define PER_CPU_BY_INDEX(NAME, CPU) \ ((__typeof__(&NAME)) \ (per_cpu_by_index_compute((CPU), (void *)&(NAME)))) - **PER_CPU_CUR(NAME)** Returns a pointer to the per-cpu object `NAME` for the current CPU. .. code-block:: c #define PER_CPU_CUR(NAME) \ ((__typeof__(&(NAME))) \ (per_cpu_cur_compute((void *)&(NAME)))) For use in assembly routines, a corresponding macro version is provided: .. code-block:: text .macro per_cpu_cur label, dst=x0, clobber=x1 /* Safety checks */ .ifc \dst,\clobber .error "per_cpu_cur: dst and clobber must be different" .endif /* dst = absolute address of label */ adr_l \dst, \label /* clobber = absolute address of __PER_CPU_START__ */ adr_l \clobber, __PER_CPU_START__ /* dst = (label - __PER_CPU_START__) */ sub \dst, \dst, \clobber /* clobber = per-cpu base (TPIDR_EL3) */ mrs \clobber, tpidr_el3 /* dst = base + offset */ add \dst, \clobber, \dst .endm The accessor interfaces take advantage of using `tpidr_el3` system register (Thread ID Register at EL3). It stores the **base address of the current CPU's `.per_cpu` section**. By setting up this register during early CPU initialization (e.g., in the el3_entrypoint_common path), TF-A can avoid repeated calculations or memory lookups when accessing per-cpu objects. Instead of computing the per-cpu address dynamically using platform-level functions (which could involve node discovery, offset arithmetic, and memory dereferencing), TF-A can simply: - Read `tpidr_el3` to get the base address of the current CPU's per-cpu data. - Add the relative offset of the desired object within the `.per_cpu` section. - Access the target object directly using this computed address. This strategy significantly reduces access time by replacing a potentially expensive memory access path with a single register read and offset addition. It improves performance—particularly in hot paths like PSCI operations and context switching taking advantage of fast-access system registers instead of traversing interconnects. Usage Example ============= Platform Responsibilities ------------------------- To integrate the NUMA-Aware PER-CPU Framework into a platform, the following steps must be taken: 1. Enable the Framework ------------------------- Set the PLATFORM_NODE_COUNT to greater than 1 (>=2) in the platform makefile to enable NUMA-aware per-cpu support: .. code-block:: text PLATFORM_NODE_COUNT := 1 (>=2 for enabling NUMA-aware per-cpu support) Platforms that are not multi-node needn't do anything as PLATFORM_NODE_COUNT = 1 (NODE COUNT) by default. In the case of 32-bit Images such as BL32 sp_min NUMA framework is not supported. 2. Provide Per-CPU Section Base Address Table --------------------------------------------- Declare and initialize an array holding the base address of the `.per_cpu` section for each node: .. code-block:: c const uintptr_t per_cpu_nodes_base[] = { /* Base addresses per node (platform-specific) */ }; This array allows efficient mapping from logical CPU IDs to physical memory regions in multi-node systems. Note: This is one example of how platforms can define .per_cpu section base addresses. Platforms are free to determine and provide these addresses using other methods, such as device tree parsing, platform-specific tables, or dynamic discovery logic. It is important to note that the platform defined regions for holding remote per-cpu section should have a page aligned base and size for page table mapping via the xlat library. This is simply due to the fact that xlat requires page aligned address and size for mapping an entry. per-cpu section by itself requires only CACHE_WRITEBACK_GRANULE alignment for its base. 3. Implement Required Platform Hooks ------------------------------------ Provide the following platform-specific functions: - **`plat_per_cpu_base(int cpu)`** Returns the base address of the `.per_cpu` section for the specified CPU. - **`plat_per_cpu_node_base(void)`** Returns the node base address of the `.per_cpu` section. - **`plat_per_cpu_dcache_clean(void)`** Cleans the entire per-cpu section from the data cache. This ensures that any modifications made to per-cpu data are written back to memory, making them visible to other CPUs or system components that may access this memory. It is especially important on platforms that do not support hardware managed coherency early in the boot. References ========== - Original Presentation: https://www.trustedfirmware.org/docs/NUMA-aware-PER-CPU-framework-18Jul24.pdf -------------- *Copyright (c) 2025, Arm Limited and Contributors. All rights reserved.*