xref: /rk3399_ARM-atf/docs/components/numa-per-cpu.rst (revision 7303319b3823e9e33748d963e9173f3678aba4da)
1NUMA-Aware PER-CPU Framework
2============================
3
4.. contents::
5   :local:
6   :depth: 2
7
8Introduction
9============
10
11Modern System designs increasingly adopt multi-node architectures, where the
12system is divided into multiple topological units such as chiplet, socket, or
13any other isolated unit of compute and memory. Each node typically has its own
14local memory, and CPUs within a node can access this memory with lower latency
15compared to memory located on remote nodes. In TF-A's current implementation,
16per-cpu data (such as PSCI context, SPM context, etc.) is stored in a global
17array or contiguous region, usually located in the memory of a single node. This
18approach introduces two key issues in multi-node systems:
19
20- **Storage Constraints:** As systems scale to include more CPUs and nodes, this
21  centralized allocation becomes a bottleneck.  The memory capacity of a single
22  node may be insufficient to hold per-cpu data for all CPUs. This constraint
23  limits scalability in systems where each node has limited local memory.
24
25  .. figure:: ../resources/diagrams/per_cpu_numa_numa_disabled.png
26     :alt: Storage Problem in Multi-node Systems
27     :align: right
28     :width: 500px
29
30  *Figure: Typical BL31/BL32 binary storage in local memory*
31
32- **Non-Uniform Memory Access (NUMA) Latency:** In multi-node systems, memory
33  access across nodes incurs additional latency due to interconnect traversal.
34  When per-cpu data is centralized on a single node, CPUs on remote nodes must
35  access their per-cpu data via the interconnect, leading to increased latency
36  for frequent operations like context switching, exception handling, and crash
37  reporting. This violates NUMA design principles, where data locality is
38  critical to achieving performance and scalability.
39
40To address these challenges, the NUMA-Aware per-cpu framework has been
41introduced. This framework optimizes the allocation and access of per-cpu
42objects by allowing platforms to place them in nodes with least access latency.
43
44Design
45======
46
47To address these architectural challenges, TF-A introduces the NUMA-aware per
48cpu framework. This framework is designed to give platforms the opportunity to
49allocate per-cpu data as close to the calling CPU as possible, ideally within
50the same NUMA node, thereby reducing access latency and improving overall memory
51scalability.
52
53The framework provides standardized interfaces and mechanisms for
54**allocating**, **defining**, and **accessing** per-cpu data in a NUMA-aware
55environment. This ensures portability and maintainability across different
56platforms while optimizing for performance in multi-node systems.
57
58.per_cpu Section
59----------------
60
61A dedicated .per_cpu section to **allocate** per-cpu global variables, ensuring
62that these objects are allocated in the local memory of each NUMA node. Figure
63illustrates how per-cpu objects are allocated in the local memory of their
64respective nodes. The necessary linker modifications to support this layout are
65shown in the accompanying snippet.
66
67.. figure:: ../resources/diagrams/per_cpu_numa_numa_enabled.png
68   :alt: NUMA-Aware PER-CPU Framework Overview
69   :align: center
70   :width: 2000px
71
72*Figure: BL31/BL32 binary storage in local memory of per node when per-cpu NUMA
73framework is enabled*
74
75.. code-block:: text
76
77	/* The .per_cpu section gets initialised to 0 at runtime. */	\
78	.per_cpu (NOLOAD) : ALIGN(CACHE_WRITEBACK_GRANULE) {		\
79		__PER_CPU_START__ = .;					\
80		__PER_CPU_UNIT_START__ = .;				\
81		*(SORT_BY_ALIGNMENT(.per_cpu*))				\
82		__PER_CPU_UNIT_UNALIGNED_END_UNIT__ = .;		\
83		. = ALIGN(CACHE_WRITEBACK_GRANULE);			\
84		__PER_CPU_UNIT_END__ = .;				\
85		__PER_CPU_UNIT_SECTION_SIZE__ =				\
86		ABSOLUTE(__PER_CPU_UNIT_END__ - __PER_CPU_UNIT_START__);\
87		. = . + (PER_CPU_NODE_CORE_COUNT - 1) *			\
88		__PER_CPU_UNIT_SECTION_SIZE__;				\
89		__PER_CPU_END__ = .;					\
90	}
91
92The newly introduced linker changes also addresses a common performance issue in
93modern multi-cpu systems—**cache thrashing**.
94
95A performance issue known as **cache thrashing** arises when multiple CPUs
96access different addresses that are on the same cache line. Although the
97accessed variables may be logically independent, their proximity in memory can
98result in repeated cache invalidations and reloads. This is because cache
99coherency mechanisms operate at the granularity of cache lines (typically 64
100bytes). If two CPUs attempt to write to two different addresses that fall within
101the same cache line, the cache line is bounced back and forth between the cores,
102incurring unnecessary overhead.
103
104.. figure:: ../resources/diagrams/per_cpu_numa_cache_thrashing.png
105   :alt: Illustration of Cache Thrashing from Per-CPU Data Collisions
106   :align: center
107   :width: 600px
108
109*Figure: Two processors modifying different variables placed too closely in
110memory, leading to cache thrashing*
111
112To eliminate cache thrashing, this framework employs **linker-script-based
113alignment**. It ensures:
114
115- Placing all per-cpu variables into a **dedicated, aligned** section:
116  `.per_cpu`
117- Aligning that section using the cache granularity size
118  (`CACHE_WRITEBACK_GRANULE`)
119
120Definer Interfaces
121------------------
122
123The NUMA-Aware PER-CPU framework provides set of macros to define and declare
124per-cpu objects efficiently in multi-node systems.
125
126- **PER_CPU_DECLARE**
127
128  Declares an external per-cpu object.
129
130  .. code-block:: c
131
132      #define PER_CPU_DECLARE(TYPE, NAME) \
133          extern typeof(TYPE) NAME
134
135- **PER_CPU_DEFINE**
136
137  Defines a per-cpu object and places it in the `.per_cpu` section.
138
139  .. code-block:: c
140
141      #define PER_CPU_DEFINE(TYPE, NAME) \
142          typeof(TYPE) NAME \
143          __section(PER_CPU_SECTION_NAME)
144
145Accessor Interfaces
146-------------------
147
148The NUMA-Aware PER-CPU framework provides set of macros to access per-cpu
149objects efficiently in multi-node systems.
150
151- **PER_CPU_BY_INDEX(NAME, CPU)**
152  Returns a pointer to the per-cpu object `NAME` for the specified CPU.
153
154  .. code-block:: c
155
156      #define PER_CPU_BY_INDEX(NAME, CPU)			\
157          ((__typeof__(&NAME))					\
158          (per_cpu_by_index_compute((CPU), (void *)&(NAME))))
159
160- **PER_CPU_CUR(NAME)**
161  Returns a pointer to the per-cpu object `NAME` for the current CPU.
162
163  .. code-block:: c
164
165      #define PER_CPU_CUR(NAME) 			\
166      ((__typeof__(&(NAME)))				\
167      (per_cpu_cur_compute((void *)&(NAME))))
168
169For use in assembly routines, a corresponding macro version is provided:
170
171.. code-block:: text
172
173   .macro  per_cpu_cur label, dst=x0, clobber=x1
174       /* Safety checks */
175       .ifc \dst,\clobber
176       .error "per_cpu_cur: dst and clobber must be different"
177       .endif
178
179       /* dst = absolute address of label */
180       adr_l	\dst, \label
181
182       /* clobber = absolute address of __PER_CPU_START__ */
183       	adr_l	\clobber, __PER_CPU_START__
184
185       /* dst = (label - __PER_CPU_START__) */
186       sub     \dst, \dst, \clobber
187
188       /* clobber = per-cpu base (TPIDR_EL3) */
189       mrs     \clobber, tpidr_el3
190
191       /* dst = base + offset */
192       add     \dst, \clobber, \dst
193   .endm
194
195
196The accessor interfaces take advantage of using `tpidr_el3` system register
197(Thread ID Register at EL3). It stores the **base address of the current CPU's
198`.per_cpu` section**. By setting up this register during early CPU
199initialization (e.g., in the el3_entrypoint_common path), TF-A can avoid
200repeated calculations or memory lookups when accessing per-cpu objects.
201
202Instead of computing the per-cpu address dynamically using platform-level
203functions (which could involve node discovery, offset arithmetic, and memory
204dereferencing), TF-A can simply:
205
206- Read `tpidr_el3` to get the base address of the current CPU's per-cpu data.
207- Add the relative offset of the desired object within the `.per_cpu` section.
208- Access the target object directly using this computed address.
209
210This strategy significantly reduces access time by replacing a potentially
211expensive memory access path with a single register read and offset addition. It
212improves performance—particularly in hot paths like PSCI operations and context
213switching taking advantage of fast-access system registers instead of traversing
214interconnects.
215
216Usage Example
217=============
218
219Platform Responsibilities
220-------------------------
221
222To integrate the NUMA-Aware PER-CPU Framework into a platform, the following
223steps must be taken:
224
2251. Enable the Framework
226-------------------------
227
228Set the PLATFORM_NODE_COUNT to greater than 1 (>=2) in the platform
229makefile to enable NUMA-aware per-cpu support:
230
231.. code-block:: text
232
233    PLATFORM_NODE_COUNT := 1 (>=2 for enabling NUMA-aware per-cpu support)
234
235Platforms that are not multi-node needn't do anything as
236PLATFORM_NODE_COUNT = 1 (NODE COUNT) by default.
237In the case of 32-bit Images such as BL32 sp_min NUMA framework is not supported.
238
2392. Provide Per-CPU Section Base Address Table
240---------------------------------------------
241
242Declare and initialize an array holding the base address of the `.per_cpu`
243section for each node:
244
245.. code-block:: c
246
247    const uintptr_t per_cpu_nodes_base[] = {
248        /* Base addresses per node (platform-specific) */
249    };
250
251This array allows efficient mapping from logical CPU IDs to physical memory
252regions in multi-node systems.  Note: This is one example of how platforms can
253define .per_cpu section base addresses.  Platforms are free to determine and
254provide these addresses using other methods, such as device tree parsing,
255platform-specific tables, or dynamic discovery logic. It is important to note
256that the platform defined regions for holding remote per-cpu section should have
257a page aligned base and size for page table mapping via the xlat library. This
258is simply due to the fact that xlat requires page aligned address and size for
259mapping an entry. per-cpu section by itself requires only CACHE_WRITEBACK_GRANULE
260alignment for its base.
261
2623. Implement Required Platform Hooks
263------------------------------------
264
265Provide the following platform-specific functions:
266
267- **`plat_per_cpu_base(int cpu)`**
268  Returns the base address of the `.per_cpu` section for the specified CPU.
269
270- **`plat_per_cpu_node_base(void)`**
271  Returns the node base address of the `.per_cpu` section.
272
273- **`plat_per_cpu_dcache_clean(void)`**
274  Cleans the entire per-cpu section from the data cache. This ensures that any
275  modifications made to per-cpu data are written back to memory, making them
276  visible to other CPUs or system components that may access this memory. It is
277  especially important on platforms that do not support hardware managed
278  coherency early in the boot.
279
280References
281==========
282
283- Original Presentation: https://www.trustedfirmware.org/docs/NUMA-aware-PER-CPU-framework-18Jul24.pdf
284
285--------------
286
287*Copyright (c) 2025, Arm Limited and Contributors. All rights reserved.*
288