docs/components/numa-per-cpu.rst

*193980a0SChris KayNUMA-Aware Per-CPU Framework
b92586abSSammit Joshi============================
b92586abSSammit Joshi
b92586abSSammit Joshi.. contents::
b92586abSSammit Joshi   :local:
b92586abSSammit Joshi   :depth: 2
b92586abSSammit Joshi
b92586abSSammit JoshiIntroduction
*193980a0SChris Kay------------
b92586abSSammit Joshi
*193980a0SChris KayModern system designs increasingly adopt multi-node architectures, where the
*193980a0SChris Kaysystem is divided into multiple topological units such as chiplets, sockets, or
*193980a0SChris Kayother isolated compute and memory units. Each node typically has its own local
*193980a0SChris Kaymemory, and CPUs within a node can access this memory with lower latency than
*193980a0SChris KayCPUs on remote nodes. In TF-A's current implementation, per-CPU data (for
*193980a0SChris Kayexample, PSCI or SPM context) is stored in a global array or contiguous region,
*193980a0SChris Kayusually located in the memory of a single node. This approach introduces two key
*193980a0SChris Kayissues in multi-node systems:
b92586abSSammit Joshi
b92586abSSammit Joshi- **Storage Constraints:** As systems scale to include more CPUs and nodes, this
b92586abSSammit Joshi  centralized allocation becomes a bottleneck. The memory capacity of a single
*193980a0SChris Kay  node may be insufficient to hold per-CPU data for all CPUs. This constraint
b92586abSSammit Joshi  limits scalability in systems where each node has limited local memory.
b92586abSSammit Joshi
*193980a0SChris Kay  .. figure:: ../resources/diagrams/per-cpu-numa-disabled.png
*193980a0SChris Kay     :alt: Diagram showing the BL31 binary section layout in TF-A within local
*193980a0SChris Kay           memory. From bottom to top: \`.text\`, \`.rodata\`, \`.data\`,
*193980a0SChris Kay           \`.stack\`, \`.bss\`, and \`xlat\` sections. The \`.text\`,
*193980a0SChris Kay           \`.rodata\`, and \`.data\` segments are \`PROGBITS\` sections, while
*193980a0SChris Kay           \`.stack\`, \`.bss\`, and \`xlat\` form the \`NOBITS\` sections at
*193980a0SChris Kay           the top. The memory extends from the local memory start address at
*193980a0SChris Kay           the bottom to the end address at the top.
b92586abSSammit Joshi
b92586abSSammit Joshi  *Figure: Typical BL31/BL32 binary storage in local memory*
b92586abSSammit Joshi
b92586abSSammit Joshi- **Non-Uniform Memory Access (NUMA) Latency:** In multi-node systems, memory
*193980a0SChris Kay  access across nodes incurs additional latency because of interconnect
*193980a0SChris Kay  traversal. When per-CPU data is centralized on a single node, CPUs on remote
*193980a0SChris Kay  nodes must access that data via the interconnect, leading to increased latency
*193980a0SChris Kay  for frequent operations such as context switching, exception handling, and
*193980a0SChris Kay  crash reporting. This violates NUMA design principles, where data locality is
b92586abSSammit Joshi  critical to achieving performance and scalability.
b92586abSSammit Joshi
*193980a0SChris KayTo address these challenges, TF-A provides the NUMA-Aware Per-CPU Framework. The
*193980a0SChris Kayframework optimizes the allocation and access of per-CPU objects by letting
*193980a0SChris Kayplatforms place them in the nodes with the lowest access latency.
b92586abSSammit Joshi
b92586abSSammit JoshiDesign
*193980a0SChris Kay------
b92586abSSammit Joshi
b92586abSSammit JoshiThe framework provides standardized interfaces and mechanisms for
*193980a0SChris Kay**allocating**, **defining**, and **accessing** per-CPU data in a NUMA-aware
b92586abSSammit Joshienvironment. This ensures portability and maintainability across different
b92586abSSammit Joshiplatforms while optimizing for performance in multi-node systems.
b92586abSSammit Joshi
*193980a0SChris Kay``.per_cpu`` Section
*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~
b92586abSSammit Joshi
*193980a0SChris KayThe framework dedicates a zero-initialized, cache-aligned ``.per_cpu`` section
*193980a0SChris Kayto **allocate** per-CPU global variables and ensure that these objects reside in
*193980a0SChris Kaythe local memory of each NUMA node. The figure below illustrates how per-CPU
*193980a0SChris Kayobjects are allocated in the local memory of their respective nodes.
b92586abSSammit Joshi
*193980a0SChris Kay.. figure:: ../resources/diagrams/per-cpu-numa-enabled.png
b92586abSSammit Joshi   :align: center
*193980a0SChris Kay   :alt: Diagram comparing the TF-A BL31 memory layout with NUMA disabled versus
*193980a0SChris Kay         NUMA enabled. When NUMA is disabled, Node 0 contains a local memory
*193980a0SChris Kay         layout with the \`.text\`, \`.rodata\`, \`.data\`, \`.stack\`,
*193980a0SChris Kay         \`.bss\`, and \`xlat\` sections stacked vertically. When NUMA is
*193980a0SChris Kay         enabled, Node 0 includes an additional \`.per_cpu\` section between
*193980a0SChris Kay         \`.bss\` and \`xlat\` to represent per-CPU data allocation, while
*193980a0SChris Kay         remote nodes (Node 1 through Node N) each contain their own local
*193980a0SChris Kay         per-CPU memory regions.
b92586abSSammit Joshi
b92586abSSammit Joshi*Figure: BL31/BL32 binary storage in local memory of per node when per-cpu NUMA
b92586abSSammit Joshiframework is enabled*
b92586abSSammit Joshi
*193980a0SChris KayAt link time, TF-A linker scripts carve out this section and publish section
*193980a0SChris Kaybounds and per-object stride via internal symbols so that they can be replicated
*193980a0SChris Kayand initialized across the non-primary nodes.
b92586abSSammit Joshi
*193980a0SChris KayThis linker section also addresses a common performance issue in modern
*193980a0SChris Kaymulti-CPU systems known as **false sharing**. This issue arises when multiple
*193980a0SChris KayCPUs access different addresses that lie on the same cache line. Although the
b92586abSSammit Joshiaccessed variables may be logically independent, their proximity in memory can
*193980a0SChris Kayresult in repeated cache invalidations and reloads. Cache-coherency mechanisms
*193980a0SChris Kayoperate at the granularity of cache lines (typically 64 bytes). If two CPUs
*193980a0SChris Kaywrite to different addresses within the same cache line, the line bounces
*193980a0SChris Kaybetween cores and incurs unnecessary overhead.
b92586abSSammit Joshi
*193980a0SChris Kay.. figure:: ../resources/diagrams/per-cpu-false-sharing.png
b92586abSSammit Joshi   :align: center
*193980a0SChris Kay   :alt: Diagram showing three CPUs (CPU 1, CPU 2, and CPU 3) each with their
*193980a0SChris Kay         own cache, connected through a shared interconnect to main memory. At
*193980a0SChris Kay         address 0x1000, CPU 2's cache holds data values D1, D2, D3, and D4
*193980a0SChris Kay         representing per-CPU data objects, while CPU 1 and CPU 3 have that
*193980a0SChris Kay         cache line marked as invalid. CPU 3 is attempting to read from its own
*193980a0SChris Kay         per-CPU data object, triggering a coherence transaction over the
*193980a0SChris Kay         interconnect.
b92586abSSammit Joshi
b92586abSSammit Joshi*Figure: Two processors modifying different variables placed too closely in
*193980a0SChris Kaymemory, leading to false sharing*
b92586abSSammit Joshi
*193980a0SChris KayTo eliminate false sharing, this framework employs **linker-script-based
*193980a0SChris Kayalignment**, which:
b92586abSSammit Joshi
*193980a0SChris Kay- Places all per-CPU variables into a **dedicated, aligned** section
*193980a0SChris Kay  (``.per_cpu``).
*193980a0SChris Kay- Aligns that section using the cache granularity size
*193980a0SChris Kay  (``CACHE_WRITEBACK_GRANULE``).
b92586abSSammit Joshi
b92586abSSammit JoshiDefiner Interfaces
*193980a0SChris Kay~~~~~~~~~~~~~~~~~~
b92586abSSammit Joshi
*193980a0SChris KayThe NUMA-Aware Per-CPU Framework provides a set of macros to define and declare
*193980a0SChris Kayper-CPU objects efficiently in multi-node systems.
b92586abSSammit Joshi
*193980a0SChris Kay- ``PER_CPU_DECLARE(TYPE, NAME)``
b92586abSSammit Joshi
*193980a0SChris Kay  Declares an external per-CPU object so that other translation units can refer
*193980a0SChris Kay  to it without allocating storage.
b92586abSSammit Joshi
*193980a0SChris Kay- ``PER_CPU_DEFINE(TYPE, NAME)``
b92586abSSammit Joshi
*193980a0SChris Kay  Defines a per-CPU object and assigns it to ``PER_CPU_SECTION_NAME`` so the
*193980a0SChris Kay  linker emits it into the ``.per_cpu`` section that the framework manages.
b92586abSSammit Joshi
b92586abSSammit JoshiAccessor Interfaces
*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~
b92586abSSammit Joshi
*193980a0SChris KayThe NUMA-Aware Per-CPU Framework also provides macros to access per-CPU objects
*193980a0SChris Kayefficiently in multi-node systems.
b92586abSSammit Joshi
*193980a0SChris Kay- ``PER_CPU_BY_INDEX(NAME, CPU)``
b92586abSSammit Joshi
*193980a0SChris Kay  Returns a pointer to the per-CPU object ``NAME`` for the specified CPU by
*193980a0SChris Kay  combining the per-node base with the object's offset within ``.per_cpu``.
b92586abSSammit Joshi
*193980a0SChris Kay- ``PER_CPU_CUR(NAME)``
b92586abSSammit Joshi
*193980a0SChris Kay  Returns a pointer to the per-CPU object ``NAME`` for the current CPU.
b92586abSSammit Joshi
b92586abSSammit JoshiFor use in assembly routines, a corresponding macro version is provided:
b92586abSSammit Joshi
*193980a0SChris KayIn assembly routines, the ``per_cpu_cur`` helper macro performs the same
*193980a0SChris Kaycalculation. It accepts the label of the per-CPU object and optional register
*193980a0SChris Kayarguments (destination and clobber) to materialize the per-CPU pointer without
*193980a0SChris Kayduplicating addressing logic in assembly files.
b92586abSSammit Joshi
*193980a0SChris KayPlatform Responsibilities (NUMA-only)
*193980a0SChris Kay-------------------------------------
b92586abSSammit Joshi
*193980a0SChris KayWhen NUMA is enabled, the platform is required to comply with some additional
*193980a0SChris Kayrequirements in order for the runtime to correctly set up per-CPU sections on
*193980a0SChris Kayremote nodes:
b92586abSSammit Joshi
b92586abSSammit Joshi1. Enable the Framework
*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~~~~
b92586abSSammit Joshi
*193980a0SChris KaySet ``PLATFORM_NODE_COUNT`` to a value greater than 1 (>=2) in the platform
*193980a0SChris Kaymakefile to enable NUMA-aware per-CPU support:
b92586abSSammit Joshi
*193980a0SChris Kay.. code-block:: make
b92586abSSammit Joshi
*193980a0SChris Kay   PLATFORM_NODE_COUNT := 2  # >= 2 enables NUMA-aware per-CPU support
b92586abSSammit Joshi
*193980a0SChris KayPlatforms that are not multi-node do not need to modify this value because the
*193980a0SChris Kaydefault ``PLATFORM_NODE_COUNT`` is 1. The NUMA framework is not supported in
*193980a0SChris Kay32-bit images such as BL32 SP_MIN.
b92586abSSammit Joshi
*193980a0SChris Kay2. Provide Per-CPU Section Base Address Data
*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
b92586abSSammit Joshi
*193980a0SChris KayEnsure that the platform can supply the base address of the ``.per_cpu`` section
*193980a0SChris Kayfor each node and CPU when implementing ``plat_per_cpu_node_base`` and
*193980a0SChris Kay``plat_per_cpu_base``. The framework does not mandate how this information is
*193980a0SChris Kayobtained, only that each hook returns a valid base address. Platforms may:
b92586abSSammit Joshi
*193980a0SChris Kay- derive the base addresses from platform descriptors or firmware configuration
*193980a0SChris Kay  data;
*193980a0SChris Kay- read them from device tree nodes or other runtime discovery mechanisms; or
*193980a0SChris Kay- encode them in platform-specific tables compiled into the image.
b92586abSSammit Joshi
*193980a0SChris KayIf a node described in platform data is not populated at runtime, the hooks may
*193980a0SChris Kayreturn ``UINT64_MAX`` to signal that no per-CPU section exists for that node.
b92586abSSammit Joshi
*193980a0SChris KayThe platform is free to maintain this mapping however it prefers, and may do so
*193980a0SChris Kayat either compile-time or through employing runtime discovery. The only
*193980a0SChris Kayrequirement is that the ``plat_per_cpu_node_base`` and ``plat_per_cpu_base``
*193980a0SChris Kayhooks translate a node or CPU identifier into the base address of the
*193980a0SChris Kaycorresponding ``.per_cpu`` section.
*193980a0SChris Kay
*193980a0SChris KayPlatform-defined regions that hold remote per-CPU sections must have
*193980a0SChris Kaypage-aligned bases and sizes for page table mapping through the xlat library,
*193980a0SChris Kaywhich requires page alignment for mapped entries. The per-CPU section itself
*193980a0SChris Kayrequires only cache writeback granule alignment for its base.
b92586abSSammit Joshi
b92586abSSammit Joshi3. Implement Required Platform Hooks
*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
b92586abSSammit Joshi
b92586abSSammit JoshiProvide the following platform-specific functions:
b92586abSSammit Joshi
*193980a0SChris Kay- ``uintptr_t plat_per_cpu_base(uint64_t cpu)``
b92586abSSammit Joshi
*193980a0SChris Kay  Returns the base address of the ``.per_cpu`` section for the specified CPU.
b92586abSSammit Joshi
*193980a0SChris Kay- ``uintptr_t plat_per_cpu_node_base(uint64_t node)``
*193980a0SChris Kay
*193980a0SChris Kay  Returns the base address of the ``.per_cpu`` section for the specified node.
*193980a0SChris Kay
*193980a0SChris Kay- ``uintptr_t plat_per_cpu_dcache_clean(void)``
*193980a0SChris Kay
*193980a0SChris Kay  Cleans the entire per-CPU section from the data cache. This ensures that any
*193980a0SChris Kay  modifications made to per-CPU data are written back to memory, making them
*193980a0SChris Kay  visible to other CPUs or system components that may access this memory. This
*193980a0SChris Kay  step is especially important on platforms that do not support hardware-managed
*193980a0SChris Kay  coherency early in the boot process.
b92586abSSammit Joshi
b92586abSSammit JoshiReferences
*193980a0SChris Kay----------
b92586abSSammit Joshi
*193980a0SChris Kay- Original presentation:
*193980a0SChris Kay  https://www.trustedfirmware.org/docs/NUMA-aware-PER-CPU-framework-18Jul24.pdf
b92586abSSammit Joshi
b92586abSSammit Joshi--------------
b92586abSSammit Joshi
b92586abSSammit Joshi*Copyright (c) 2025, Arm Limited and Contributors. All rights reserved.*