xref: /rk3399_ARM-atf/docs/components/numa-per-cpu.rst (revision e01ce1ea61368f169f8f827a05ad9d0c5bb06160)
1NUMA-Aware Per-CPU Framework
2============================
3
4.. contents::
5   :local:
6   :depth: 2
7
8Introduction
9------------
10
11Modern system designs increasingly adopt multi-node architectures, where the
12system is divided into multiple topological units such as chiplets, sockets, or
13other isolated compute and memory units. Each node typically has its own local
14memory, and CPUs within a node can access this memory with lower latency than
15CPUs on remote nodes. In TF-A's current implementation, per-CPU data (for
16example, PSCI or SPM context) is stored in a global array or contiguous region,
17usually located in the memory of a single node. This approach introduces two key
18issues in multi-node systems:
19
20- **Storage Constraints:** As systems scale to include more CPUs and nodes, this
21  centralized allocation becomes a bottleneck. The memory capacity of a single
22  node may be insufficient to hold per-CPU data for all CPUs. This constraint
23  limits scalability in systems where each node has limited local memory.
24
25  .. figure:: ../resources/diagrams/per-cpu-numa-disabled.png
26     :alt: Diagram showing the BL31 binary section layout in TF-A within local
27           memory. From bottom to top: \`.text\`, \`.rodata\`, \`.data\`,
28           \`.stack\`, \`.bss\`, and \`xlat\` sections. The \`.text\`,
29           \`.rodata\`, and \`.data\` segments are \`PROGBITS\` sections, while
30           \`.stack\`, \`.bss\`, and \`xlat\` form the \`NOBITS\` sections at
31           the top. The memory extends from the local memory start address at
32           the bottom to the end address at the top.
33
34  *Figure: Typical BL31/BL32 binary storage in local memory*
35
36- **Non-Uniform Memory Access (NUMA) Latency:** In multi-node systems, memory
37  access across nodes incurs additional latency because of interconnect
38  traversal. When per-CPU data is centralized on a single node, CPUs on remote
39  nodes must access that data via the interconnect, leading to increased latency
40  for frequent operations such as context switching, exception handling, and
41  crash reporting. This violates NUMA design principles, where data locality is
42  critical to achieving performance and scalability.
43
44To address these challenges, TF-A provides the NUMA-Aware Per-CPU Framework. The
45framework optimizes the allocation and access of per-CPU objects by letting
46platforms place them in the nodes with the lowest access latency.
47
48Design
49------
50
51The framework provides standardized interfaces and mechanisms for
52**allocating**, **defining**, and **accessing** per-CPU data in a NUMA-aware
53environment. This ensures portability and maintainability across different
54platforms while optimizing for performance in multi-node systems.
55
56``.per_cpu`` Section
57~~~~~~~~~~~~~~~~~~~~
58
59The framework dedicates a zero-initialized, cache-aligned ``.per_cpu`` section
60to **allocate** per-CPU global variables and ensure that these objects reside in
61the local memory of each NUMA node. The figure below illustrates how per-CPU
62objects are allocated in the local memory of their respective nodes.
63
64.. figure:: ../resources/diagrams/per-cpu-numa-enabled.png
65   :align: center
66   :alt: Diagram comparing the TF-A BL31 memory layout with NUMA disabled versus
67         NUMA enabled. When NUMA is disabled, Node 0 contains a local memory
68         layout with the \`.text\`, \`.rodata\`, \`.data\`, \`.stack\`,
69         \`.bss\`, and \`xlat\` sections stacked vertically. When NUMA is
70         enabled, Node 0 includes an additional \`.per_cpu\` section between
71         \`.bss\` and \`xlat\` to represent per-CPU data allocation, while
72         remote nodes (Node 1 through Node N) each contain their own local
73         per-CPU memory regions.
74
75*Figure: BL31/BL32 binary storage in local memory of per node when per-cpu NUMA
76framework is enabled*
77
78At link time, TF-A linker scripts carve out this section and publish section
79bounds and per-object stride via internal symbols so that they can be replicated
80and initialized across the non-primary nodes.
81
82This linker section also addresses a common performance issue in modern
83multi-CPU systems known as **false sharing**. This issue arises when multiple
84CPUs access different addresses that lie on the same cache line. Although the
85accessed variables may be logically independent, their proximity in memory can
86result in repeated cache invalidations and reloads. Cache-coherency mechanisms
87operate at the granularity of cache lines (typically 64 bytes). If two CPUs
88write to different addresses within the same cache line, the line bounces
89between cores and incurs unnecessary overhead.
90
91.. figure:: ../resources/diagrams/per-cpu-false-sharing.png
92   :align: center
93   :alt: Diagram showing three CPUs (CPU 1, CPU 2, and CPU 3) each with their
94         own cache, connected through a shared interconnect to main memory. At
95         address 0x1000, CPU 2's cache holds data values D1, D2, D3, and D4
96         representing per-CPU data objects, while CPU 1 and CPU 3 have that
97         cache line marked as invalid. CPU 3 is attempting to read from its own
98         per-CPU data object, triggering a coherence transaction over the
99         interconnect.
100
101*Figure: Two processors modifying different variables placed too closely in
102memory, leading to false sharing*
103
104To eliminate false sharing, this framework employs **linker-script-based
105alignment**, which:
106
107- Places all per-CPU variables into a **dedicated, aligned** section
108  (``.per_cpu``).
109- Aligns that section using the cache granularity size
110  (``CACHE_WRITEBACK_GRANULE``).
111
112Definer Interfaces
113~~~~~~~~~~~~~~~~~~
114
115The NUMA-Aware Per-CPU Framework provides a set of macros to define and declare
116per-CPU objects efficiently in multi-node systems.
117
118- ``PER_CPU_DECLARE(TYPE, NAME)``
119
120  Declares an external per-CPU object so that other translation units can refer
121  to it without allocating storage.
122
123- ``PER_CPU_DEFINE(TYPE, NAME)``
124
125  Defines a per-CPU object and assigns it to ``PER_CPU_SECTION_NAME`` so the
126  linker emits it into the ``.per_cpu`` section that the framework manages.
127
128Accessor Interfaces
129~~~~~~~~~~~~~~~~~~~
130
131The NUMA-Aware Per-CPU Framework also provides macros to access per-CPU objects
132efficiently in multi-node systems.
133
134- ``PER_CPU_BY_INDEX(NAME, CPU)``
135
136  Returns a pointer to the per-CPU object ``NAME`` for the specified CPU by
137  combining the per-node base with the object's offset within ``.per_cpu``.
138
139- ``PER_CPU_CUR(NAME)``
140
141  Returns a pointer to the per-CPU object ``NAME`` for the current CPU.
142
143For use in assembly routines, a corresponding macro version is provided:
144
145In assembly routines, the ``per_cpu_cur`` helper macro performs the same
146calculation. It accepts the label of the per-CPU object and optional register
147arguments (destination and clobber) to materialize the per-CPU pointer without
148duplicating addressing logic in assembly files.
149
150Platform Responsibilities (NUMA-only)
151-------------------------------------
152
153When NUMA is enabled, the platform is required to comply with some additional
154requirements in order for the runtime to correctly set up per-CPU sections on
155remote nodes:
156
1571. Enable the Framework
158~~~~~~~~~~~~~~~~~~~~~~~
159
160Set ``PLATFORM_NODE_COUNT`` to a value greater than 1 (>=2) in the platform
161makefile to enable NUMA-aware per-CPU support:
162
163.. code-block:: make
164
165   PLATFORM_NODE_COUNT := 2  # >= 2 enables NUMA-aware per-CPU support
166
167Platforms that are not multi-node do not need to modify this value because the
168default ``PLATFORM_NODE_COUNT`` is 1. The NUMA framework is not supported in
16932-bit images such as BL32 SP_MIN.
170
1712. Provide Per-CPU Section Base Address Data
172~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
173
174Ensure that the platform can supply the base address of the ``.per_cpu`` section
175for each node and CPU when implementing ``plat_per_cpu_node_base`` and
176``plat_per_cpu_base``. The framework does not mandate how this information is
177obtained, only that each hook returns a valid base address. Platforms may:
178
179- derive the base addresses from platform descriptors or firmware configuration
180  data;
181- read them from device tree nodes or other runtime discovery mechanisms; or
182- encode them in platform-specific tables compiled into the image.
183
184If a node described in platform data is not populated at runtime, the hooks may
185return ``UINT64_MAX`` to signal that no per-CPU section exists for that node.
186
187The platform is free to maintain this mapping however it prefers, and may do so
188at either compile-time or through employing runtime discovery. The only
189requirement is that the ``plat_per_cpu_node_base`` and ``plat_per_cpu_base``
190hooks translate a node or CPU identifier into the base address of the
191corresponding ``.per_cpu`` section.
192
193Platform-defined regions that hold remote per-CPU sections must have
194page-aligned bases and sizes for page table mapping through the xlat library,
195which requires page alignment for mapped entries. The per-CPU section itself
196requires only cache writeback granule alignment for its base.
197
1983. Implement Required Platform Hooks
199~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
200
201Provide the following platform-specific functions:
202
203- ``uintptr_t plat_per_cpu_base(uint64_t cpu)``
204
205  Returns the base address of the ``.per_cpu`` section for the specified CPU.
206
207- ``uintptr_t plat_per_cpu_node_base(uint64_t node)``
208
209  Returns the base address of the ``.per_cpu`` section for the specified node.
210
211- ``uintptr_t plat_per_cpu_dcache_clean(void)``
212
213  Cleans the entire per-CPU section from the data cache. This ensures that any
214  modifications made to per-CPU data are written back to memory, making them
215  visible to other CPUs or system components that may access this memory. This
216  step is especially important on platforms that do not support hardware-managed
217  coherency early in the boot process.
218
219References
220----------
221
222- Original presentation:
223  https://www.trustedfirmware.org/docs/NUMA-aware-PER-CPU-framework-18Jul24.pdf
224
225--------------
226
227*Copyright (c) 2025, Arm Limited and Contributors. All rights reserved.*
228