docs/components/numa-per-cpu.rst

*b92586abSSammit JoshiNUMA-Aware PER-CPU Framework
*b92586abSSammit Joshi============================
*b92586abSSammit Joshi
*b92586abSSammit Joshi.. contents::
*b92586abSSammit Joshi   :local:
*b92586abSSammit Joshi   :depth: 2
*b92586abSSammit Joshi
*b92586abSSammit JoshiIntroduction
*b92586abSSammit Joshi============
*b92586abSSammit Joshi
*b92586abSSammit JoshiModern System designs increasingly adopt multi-node architectures, where the
*b92586abSSammit Joshisystem is divided into multiple topological units such as chiplet, socket, or
*b92586abSSammit Joshiany other isolated unit of compute and memory. Each node typically has its own
*b92586abSSammit Joshilocal memory, and CPUs within a node can access this memory with lower latency
*b92586abSSammit Joshicompared to memory located on remote nodes. In TF-A's current implementation,
*b92586abSSammit Joshiper-cpu data (such as PSCI context, SPM context, etc.) is stored in a global
*b92586abSSammit Joshiarray or contiguous region, usually located in the memory of a single node. This
*b92586abSSammit Joshiapproach introduces two key issues in multi-node systems:
*b92586abSSammit Joshi
*b92586abSSammit Joshi- **Storage Constraints:** As systems scale to include more CPUs and nodes, this
*b92586abSSammit Joshi  centralized allocation becomes a bottleneck.  The memory capacity of a single
*b92586abSSammit Joshi  node may be insufficient to hold per-cpu data for all CPUs. This constraint
*b92586abSSammit Joshi  limits scalability in systems where each node has limited local memory.
*b92586abSSammit Joshi
*b92586abSSammit Joshi  .. figure:: ../resources/diagrams/per_cpu_numa_numa_disabled.png
*b92586abSSammit Joshi     :alt: Storage Problem in Multi-node Systems
*b92586abSSammit Joshi     :align: right
*b92586abSSammit Joshi     :width: 500px
*b92586abSSammit Joshi
*b92586abSSammit Joshi  *Figure: Typical BL31/BL32 binary storage in local memory*
*b92586abSSammit Joshi
*b92586abSSammit Joshi- **Non-Uniform Memory Access (NUMA) Latency:** In multi-node systems, memory
*b92586abSSammit Joshi  access across nodes incurs additional latency due to interconnect traversal.
*b92586abSSammit Joshi  When per-cpu data is centralized on a single node, CPUs on remote nodes must
*b92586abSSammit Joshi  access their per-cpu data via the interconnect, leading to increased latency
*b92586abSSammit Joshi  for frequent operations like context switching, exception handling, and crash
*b92586abSSammit Joshi  reporting. This violates NUMA design principles, where data locality is
*b92586abSSammit Joshi  critical to achieving performance and scalability.
*b92586abSSammit Joshi
*b92586abSSammit JoshiTo address these challenges, the NUMA-Aware per-cpu framework has been
*b92586abSSammit Joshiintroduced. This framework optimizes the allocation and access of per-cpu
*b92586abSSammit Joshiobjects by allowing platforms to place them in nodes with least access latency.
*b92586abSSammit Joshi
*b92586abSSammit JoshiDesign
*b92586abSSammit Joshi======
*b92586abSSammit Joshi
*b92586abSSammit JoshiTo address these architectural challenges, TF-A introduces the NUMA-aware per
*b92586abSSammit Joshicpu framework. This framework is designed to give platforms the opportunity to
*b92586abSSammit Joshiallocate per-cpu data as close to the calling CPU as possible, ideally within
*b92586abSSammit Joshithe same NUMA node, thereby reducing access latency and improving overall memory
*b92586abSSammit Joshiscalability.
*b92586abSSammit Joshi
*b92586abSSammit JoshiThe framework provides standardized interfaces and mechanisms for
*b92586abSSammit Joshi**allocating**, **defining**, and **accessing** per-cpu data in a NUMA-aware
*b92586abSSammit Joshienvironment. This ensures portability and maintainability across different
*b92586abSSammit Joshiplatforms while optimizing for performance in multi-node systems.
*b92586abSSammit Joshi
*b92586abSSammit Joshi.per_cpu Section
*b92586abSSammit Joshi----------------
*b92586abSSammit Joshi
*b92586abSSammit JoshiA dedicated .per_cpu section to **allocate** per-cpu global variables, ensuring
*b92586abSSammit Joshithat these objects are allocated in the local memory of each NUMA node. Figure
*b92586abSSammit Joshiillustrates how per-cpu objects are allocated in the local memory of their
*b92586abSSammit Joshirespective nodes. The necessary linker modifications to support this layout are
*b92586abSSammit Joshishown in the accompanying snippet.
*b92586abSSammit Joshi
*b92586abSSammit Joshi.. figure:: ../resources/diagrams/per_cpu_numa_numa_enabled.png
*b92586abSSammit Joshi   :alt: NUMA-Aware PER-CPU Framework Overview
*b92586abSSammit Joshi   :align: center
*b92586abSSammit Joshi   :width: 2000px
*b92586abSSammit Joshi
*b92586abSSammit Joshi*Figure: BL31/BL32 binary storage in local memory of per node when per-cpu NUMA
*b92586abSSammit Joshiframework is enabled*
*b92586abSSammit Joshi
*b92586abSSammit Joshi.. code-block:: text
*b92586abSSammit Joshi
*b92586abSSammit Joshi	/* The .per_cpu section gets initialised to 0 at runtime. */	\
*b92586abSSammit Joshi	.per_cpu (NOLOAD) : ALIGN(CACHE_WRITEBACK_GRANULE) {		\
*b92586abSSammit Joshi		__PER_CPU_START__ = .;					\
*b92586abSSammit Joshi		__PER_CPU_UNIT_START__ = .;				\
*b92586abSSammit Joshi		*(SORT_BY_ALIGNMENT(.per_cpu*))				\
*b92586abSSammit Joshi		__PER_CPU_UNIT_UNALIGNED_END_UNIT__ = .;		\
*b92586abSSammit Joshi		. = ALIGN(CACHE_WRITEBACK_GRANULE);			\
*b92586abSSammit Joshi		__PER_CPU_UNIT_END__ = .;				\
*b92586abSSammit Joshi		__PER_CPU_UNIT_SECTION_SIZE__ =				\
*b92586abSSammit Joshi		ABSOLUTE(__PER_CPU_UNIT_END__ - __PER_CPU_UNIT_START__);\
*b92586abSSammit Joshi		. = . + (PER_CPU_NODE_CORE_COUNT - 1) *			\
*b92586abSSammit Joshi		__PER_CPU_UNIT_SECTION_SIZE__;				\
*b92586abSSammit Joshi		__PER_CPU_END__ = .;					\
*b92586abSSammit Joshi	}
*b92586abSSammit Joshi
*b92586abSSammit JoshiThe newly introduced linker changes also addresses a common performance issue in
*b92586abSSammit Joshimodern multi-cpu systems—**cache thrashing**.
*b92586abSSammit Joshi
*b92586abSSammit JoshiA performance issue known as **cache thrashing** arises when multiple CPUs
*b92586abSSammit Joshiaccess different addresses that are on the same cache line. Although the
*b92586abSSammit Joshiaccessed variables may be logically independent, their proximity in memory can
*b92586abSSammit Joshiresult in repeated cache invalidations and reloads. This is because cache
*b92586abSSammit Joshicoherency mechanisms operate at the granularity of cache lines (typically 64
*b92586abSSammit Joshibytes). If two CPUs attempt to write to two different addresses that fall within
*b92586abSSammit Joshithe same cache line, the cache line is bounced back and forth between the cores,
*b92586abSSammit Joshiincurring unnecessary overhead.
*b92586abSSammit Joshi
*b92586abSSammit Joshi.. figure:: ../resources/diagrams/per_cpu_numa_cache_thrashing.png
*b92586abSSammit Joshi   :alt: Illustration of Cache Thrashing from Per-CPU Data Collisions
*b92586abSSammit Joshi   :align: center
*b92586abSSammit Joshi   :width: 600px
*b92586abSSammit Joshi
*b92586abSSammit Joshi*Figure: Two processors modifying different variables placed too closely in
*b92586abSSammit Joshimemory, leading to cache thrashing*
*b92586abSSammit Joshi
*b92586abSSammit JoshiTo eliminate cache thrashing, this framework employs **linker-script-based
*b92586abSSammit Joshialignment**. It ensures:
*b92586abSSammit Joshi
*b92586abSSammit Joshi- Placing all per-cpu variables into a **dedicated, aligned** section:
*b92586abSSammit Joshi  `.per_cpu`
*b92586abSSammit Joshi- Aligning that section using the cache granularity size
*b92586abSSammit Joshi  (`CACHE_WRITEBACK_GRANULE`)
*b92586abSSammit Joshi
*b92586abSSammit JoshiDefiner Interfaces
*b92586abSSammit Joshi------------------
*b92586abSSammit Joshi
*b92586abSSammit JoshiThe NUMA-Aware PER-CPU framework provides set of macros to define and declare
*b92586abSSammit Joshiper-cpu objects efficiently in multi-node systems.
*b92586abSSammit Joshi
*b92586abSSammit Joshi- **PER_CPU_DECLARE**
*b92586abSSammit Joshi
*b92586abSSammit Joshi  Declares an external per-cpu object.
*b92586abSSammit Joshi
*b92586abSSammit Joshi  .. code-block:: c
*b92586abSSammit Joshi
*b92586abSSammit Joshi      #define PER_CPU_DECLARE(TYPE, NAME) \
*b92586abSSammit Joshi          extern typeof(TYPE) NAME
*b92586abSSammit Joshi
*b92586abSSammit Joshi- **PER_CPU_DEFINE**
*b92586abSSammit Joshi
*b92586abSSammit Joshi  Defines a per-cpu object and places it in the `.per_cpu` section.
*b92586abSSammit Joshi
*b92586abSSammit Joshi  .. code-block:: c
*b92586abSSammit Joshi
*b92586abSSammit Joshi      #define PER_CPU_DEFINE(TYPE, NAME) \
*b92586abSSammit Joshi          typeof(TYPE) NAME \
*b92586abSSammit Joshi          __section(PER_CPU_SECTION_NAME)
*b92586abSSammit Joshi
*b92586abSSammit JoshiAccessor Interfaces
*b92586abSSammit Joshi-------------------
*b92586abSSammit Joshi
*b92586abSSammit JoshiThe NUMA-Aware PER-CPU framework provides set of macros to access per-cpu
*b92586abSSammit Joshiobjects efficiently in multi-node systems.
*b92586abSSammit Joshi
*b92586abSSammit Joshi- **PER_CPU_BY_INDEX(NAME, CPU)**
*b92586abSSammit Joshi  Returns a pointer to the per-cpu object `NAME` for the specified CPU.
*b92586abSSammit Joshi
*b92586abSSammit Joshi  .. code-block:: c
*b92586abSSammit Joshi
*b92586abSSammit Joshi      #define PER_CPU_BY_INDEX(NAME, CPU)			\
*b92586abSSammit Joshi          ((__typeof__(&NAME))					\
*b92586abSSammit Joshi          (per_cpu_by_index_compute((CPU), (void *)&(NAME))))
*b92586abSSammit Joshi
*b92586abSSammit Joshi- **PER_CPU_CUR(NAME)**
*b92586abSSammit Joshi  Returns a pointer to the per-cpu object `NAME` for the current CPU.
*b92586abSSammit Joshi
*b92586abSSammit Joshi  .. code-block:: c
*b92586abSSammit Joshi
*b92586abSSammit Joshi      #define PER_CPU_CUR(NAME) 			\
*b92586abSSammit Joshi      ((__typeof__(&(NAME)))				\
*b92586abSSammit Joshi      (per_cpu_cur_compute((void *)&(NAME))))
*b92586abSSammit Joshi
*b92586abSSammit JoshiFor use in assembly routines, a corresponding macro version is provided:
*b92586abSSammit Joshi
*b92586abSSammit Joshi.. code-block:: text
*b92586abSSammit Joshi
*b92586abSSammit Joshi   .macro  per_cpu_cur label, dst=x0, clobber=x1
*b92586abSSammit Joshi       /* Safety checks */
*b92586abSSammit Joshi       .ifc \dst,\clobber
*b92586abSSammit Joshi       .error "per_cpu_cur: dst and clobber must be different"
*b92586abSSammit Joshi       .endif
*b92586abSSammit Joshi
*b92586abSSammit Joshi       /* dst = absolute address of label */
*b92586abSSammit Joshi       adr_l	\dst, \label
*b92586abSSammit Joshi
*b92586abSSammit Joshi       /* clobber = absolute address of __PER_CPU_START__ */
*b92586abSSammit Joshi       	adr_l	\clobber, __PER_CPU_START__
*b92586abSSammit Joshi
*b92586abSSammit Joshi       /* dst = (label - __PER_CPU_START__) */
*b92586abSSammit Joshi       sub     \dst, \dst, \clobber
*b92586abSSammit Joshi
*b92586abSSammit Joshi       /* clobber = per-cpu base (TPIDR_EL3) */
*b92586abSSammit Joshi       mrs     \clobber, tpidr_el3
*b92586abSSammit Joshi
*b92586abSSammit Joshi       /* dst = base + offset */
*b92586abSSammit Joshi       add     \dst, \clobber, \dst
*b92586abSSammit Joshi   .endm
*b92586abSSammit Joshi
*b92586abSSammit Joshi
*b92586abSSammit JoshiThe accessor interfaces take advantage of using `tpidr_el3` system register
*b92586abSSammit Joshi(Thread ID Register at EL3). It stores the **base address of the current CPU's
*b92586abSSammit Joshi`.per_cpu` section**. By setting up this register during early CPU
*b92586abSSammit Joshiinitialization (e.g., in the el3_entrypoint_common path), TF-A can avoid
*b92586abSSammit Joshirepeated calculations or memory lookups when accessing per-cpu objects.
*b92586abSSammit Joshi
*b92586abSSammit JoshiInstead of computing the per-cpu address dynamically using platform-level
*b92586abSSammit Joshifunctions (which could involve node discovery, offset arithmetic, and memory
*b92586abSSammit Joshidereferencing), TF-A can simply:
*b92586abSSammit Joshi
*b92586abSSammit Joshi- Read `tpidr_el3` to get the base address of the current CPU's per-cpu data.
*b92586abSSammit Joshi- Add the relative offset of the desired object within the `.per_cpu` section.
*b92586abSSammit Joshi- Access the target object directly using this computed address.
*b92586abSSammit Joshi
*b92586abSSammit JoshiThis strategy significantly reduces access time by replacing a potentially
*b92586abSSammit Joshiexpensive memory access path with a single register read and offset addition. It
*b92586abSSammit Joshiimproves performance—particularly in hot paths like PSCI operations and context
*b92586abSSammit Joshiswitching taking advantage of fast-access system registers instead of traversing
*b92586abSSammit Joshiinterconnects.
*b92586abSSammit Joshi
*b92586abSSammit JoshiUsage Example
*b92586abSSammit Joshi=============
*b92586abSSammit Joshi
*b92586abSSammit JoshiPlatform Responsibilities
*b92586abSSammit Joshi-------------------------
*b92586abSSammit Joshi
*b92586abSSammit JoshiTo integrate the NUMA-Aware PER-CPU Framework into a platform, the following
*b92586abSSammit Joshisteps must be taken:
*b92586abSSammit Joshi
*b92586abSSammit Joshi1. Enable the Framework
*b92586abSSammit Joshi-------------------------
*b92586abSSammit Joshi
*b92586abSSammit JoshiSet the PLATFORM_NODE_COUNT to greater than 1 (>=2) in the platform
*b92586abSSammit Joshimakefile to enable NUMA-aware per-cpu support:
*b92586abSSammit Joshi
*b92586abSSammit Joshi.. code-block:: text
*b92586abSSammit Joshi
*b92586abSSammit Joshi    PLATFORM_NODE_COUNT := 1 (>=2 for enabling NUMA-aware per-cpu support)
*b92586abSSammit Joshi
*b92586abSSammit JoshiPlatforms that are not multi-node needn't do anything as
*b92586abSSammit JoshiPLATFORM_NODE_COUNT = 1 (NODE COUNT) by default.
*b92586abSSammit JoshiIn the case of 32-bit Images such as BL32 sp_min NUMA framework is not supported.
*b92586abSSammit Joshi
*b92586abSSammit Joshi2. Provide Per-CPU Section Base Address Table
*b92586abSSammit Joshi---------------------------------------------
*b92586abSSammit Joshi
*b92586abSSammit JoshiDeclare and initialize an array holding the base address of the `.per_cpu`
*b92586abSSammit Joshisection for each node:
*b92586abSSammit Joshi
*b92586abSSammit Joshi.. code-block:: c
*b92586abSSammit Joshi
*b92586abSSammit Joshi    const uintptr_t per_cpu_nodes_base[] = {
*b92586abSSammit Joshi        /* Base addresses per node (platform-specific) */
*b92586abSSammit Joshi    };
*b92586abSSammit Joshi
*b92586abSSammit JoshiThis array allows efficient mapping from logical CPU IDs to physical memory
*b92586abSSammit Joshiregions in multi-node systems.  Note: This is one example of how platforms can
*b92586abSSammit Joshidefine .per_cpu section base addresses.  Platforms are free to determine and
*b92586abSSammit Joshiprovide these addresses using other methods, such as device tree parsing,
*b92586abSSammit Joshiplatform-specific tables, or dynamic discovery logic. It is important to note
*b92586abSSammit Joshithat the platform defined regions for holding remote per-cpu section should have
*b92586abSSammit Joshia page aligned base and size for page table mapping via the xlat library. This
*b92586abSSammit Joshiis simply due to the fact that xlat requires page aligned address and size for
*b92586abSSammit Joshimapping an entry. per-cpu section by itself requires only CACHE_WRITEBACK_GRANULE
*b92586abSSammit Joshialignment for its base.
*b92586abSSammit Joshi
*b92586abSSammit Joshi3. Implement Required Platform Hooks
*b92586abSSammit Joshi------------------------------------
*b92586abSSammit Joshi
*b92586abSSammit JoshiProvide the following platform-specific functions:
*b92586abSSammit Joshi
*b92586abSSammit Joshi- **`plat_per_cpu_base(int cpu)`**
*b92586abSSammit Joshi  Returns the base address of the `.per_cpu` section for the specified CPU.
*b92586abSSammit Joshi
*b92586abSSammit Joshi- **`plat_per_cpu_node_base(void)`**
*b92586abSSammit Joshi  Returns the node base address of the `.per_cpu` section.
*b92586abSSammit Joshi
*b92586abSSammit Joshi- **`plat_per_cpu_dcache_clean(void)`**
*b92586abSSammit Joshi  Cleans the entire per-cpu section from the data cache. This ensures that any
*b92586abSSammit Joshi  modifications made to per-cpu data are written back to memory, making them
*b92586abSSammit Joshi  visible to other CPUs or system components that may access this memory. It is
*b92586abSSammit Joshi  especially important on platforms that do not support hardware managed
*b92586abSSammit Joshi  coherency early in the boot.
*b92586abSSammit Joshi
*b92586abSSammit JoshiReferences
*b92586abSSammit Joshi==========
*b92586abSSammit Joshi
*b92586abSSammit Joshi- Original Presentation: https://www.trustedfirmware.org/docs/NUMA-aware-PER-CPU-framework-18Jul24.pdf
*b92586abSSammit Joshi
*b92586abSSammit Joshi--------------
*b92586abSSammit Joshi
*b92586abSSammit Joshi*Copyright (c) 2025, Arm Limited and Contributors. All rights reserved.*