xref: /rk3399_ARM-atf/docs/components/numa-per-cpu.rst (revision b92586abe3ce3b442542a1cce0bfe2923a72a3cf)
1*b92586abSSammit JoshiNUMA-Aware PER-CPU Framework
2*b92586abSSammit Joshi============================
3*b92586abSSammit Joshi
4*b92586abSSammit Joshi.. contents::
5*b92586abSSammit Joshi   :local:
6*b92586abSSammit Joshi   :depth: 2
7*b92586abSSammit Joshi
8*b92586abSSammit JoshiIntroduction
9*b92586abSSammit Joshi============
10*b92586abSSammit Joshi
11*b92586abSSammit JoshiModern System designs increasingly adopt multi-node architectures, where the
12*b92586abSSammit Joshisystem is divided into multiple topological units such as chiplet, socket, or
13*b92586abSSammit Joshiany other isolated unit of compute and memory. Each node typically has its own
14*b92586abSSammit Joshilocal memory, and CPUs within a node can access this memory with lower latency
15*b92586abSSammit Joshicompared to memory located on remote nodes. In TF-A's current implementation,
16*b92586abSSammit Joshiper-cpu data (such as PSCI context, SPM context, etc.) is stored in a global
17*b92586abSSammit Joshiarray or contiguous region, usually located in the memory of a single node. This
18*b92586abSSammit Joshiapproach introduces two key issues in multi-node systems:
19*b92586abSSammit Joshi
20*b92586abSSammit Joshi- **Storage Constraints:** As systems scale to include more CPUs and nodes, this
21*b92586abSSammit Joshi  centralized allocation becomes a bottleneck.  The memory capacity of a single
22*b92586abSSammit Joshi  node may be insufficient to hold per-cpu data for all CPUs. This constraint
23*b92586abSSammit Joshi  limits scalability in systems where each node has limited local memory.
24*b92586abSSammit Joshi
25*b92586abSSammit Joshi  .. figure:: ../resources/diagrams/per_cpu_numa_numa_disabled.png
26*b92586abSSammit Joshi     :alt: Storage Problem in Multi-node Systems
27*b92586abSSammit Joshi     :align: right
28*b92586abSSammit Joshi     :width: 500px
29*b92586abSSammit Joshi
30*b92586abSSammit Joshi  *Figure: Typical BL31/BL32 binary storage in local memory*
31*b92586abSSammit Joshi
32*b92586abSSammit Joshi- **Non-Uniform Memory Access (NUMA) Latency:** In multi-node systems, memory
33*b92586abSSammit Joshi  access across nodes incurs additional latency due to interconnect traversal.
34*b92586abSSammit Joshi  When per-cpu data is centralized on a single node, CPUs on remote nodes must
35*b92586abSSammit Joshi  access their per-cpu data via the interconnect, leading to increased latency
36*b92586abSSammit Joshi  for frequent operations like context switching, exception handling, and crash
37*b92586abSSammit Joshi  reporting. This violates NUMA design principles, where data locality is
38*b92586abSSammit Joshi  critical to achieving performance and scalability.
39*b92586abSSammit Joshi
40*b92586abSSammit JoshiTo address these challenges, the NUMA-Aware per-cpu framework has been
41*b92586abSSammit Joshiintroduced. This framework optimizes the allocation and access of per-cpu
42*b92586abSSammit Joshiobjects by allowing platforms to place them in nodes with least access latency.
43*b92586abSSammit Joshi
44*b92586abSSammit JoshiDesign
45*b92586abSSammit Joshi======
46*b92586abSSammit Joshi
47*b92586abSSammit JoshiTo address these architectural challenges, TF-A introduces the NUMA-aware per
48*b92586abSSammit Joshicpu framework. This framework is designed to give platforms the opportunity to
49*b92586abSSammit Joshiallocate per-cpu data as close to the calling CPU as possible, ideally within
50*b92586abSSammit Joshithe same NUMA node, thereby reducing access latency and improving overall memory
51*b92586abSSammit Joshiscalability.
52*b92586abSSammit Joshi
53*b92586abSSammit JoshiThe framework provides standardized interfaces and mechanisms for
54*b92586abSSammit Joshi**allocating**, **defining**, and **accessing** per-cpu data in a NUMA-aware
55*b92586abSSammit Joshienvironment. This ensures portability and maintainability across different
56*b92586abSSammit Joshiplatforms while optimizing for performance in multi-node systems.
57*b92586abSSammit Joshi
58*b92586abSSammit Joshi.per_cpu Section
59*b92586abSSammit Joshi----------------
60*b92586abSSammit Joshi
61*b92586abSSammit JoshiA dedicated .per_cpu section to **allocate** per-cpu global variables, ensuring
62*b92586abSSammit Joshithat these objects are allocated in the local memory of each NUMA node. Figure
63*b92586abSSammit Joshiillustrates how per-cpu objects are allocated in the local memory of their
64*b92586abSSammit Joshirespective nodes. The necessary linker modifications to support this layout are
65*b92586abSSammit Joshishown in the accompanying snippet.
66*b92586abSSammit Joshi
67*b92586abSSammit Joshi.. figure:: ../resources/diagrams/per_cpu_numa_numa_enabled.png
68*b92586abSSammit Joshi   :alt: NUMA-Aware PER-CPU Framework Overview
69*b92586abSSammit Joshi   :align: center
70*b92586abSSammit Joshi   :width: 2000px
71*b92586abSSammit Joshi
72*b92586abSSammit Joshi*Figure: BL31/BL32 binary storage in local memory of per node when per-cpu NUMA
73*b92586abSSammit Joshiframework is enabled*
74*b92586abSSammit Joshi
75*b92586abSSammit Joshi.. code-block:: text
76*b92586abSSammit Joshi
77*b92586abSSammit Joshi	/* The .per_cpu section gets initialised to 0 at runtime. */	\
78*b92586abSSammit Joshi	.per_cpu (NOLOAD) : ALIGN(CACHE_WRITEBACK_GRANULE) {		\
79*b92586abSSammit Joshi		__PER_CPU_START__ = .;					\
80*b92586abSSammit Joshi		__PER_CPU_UNIT_START__ = .;				\
81*b92586abSSammit Joshi		*(SORT_BY_ALIGNMENT(.per_cpu*))				\
82*b92586abSSammit Joshi		__PER_CPU_UNIT_UNALIGNED_END_UNIT__ = .;		\
83*b92586abSSammit Joshi		. = ALIGN(CACHE_WRITEBACK_GRANULE);			\
84*b92586abSSammit Joshi		__PER_CPU_UNIT_END__ = .;				\
85*b92586abSSammit Joshi		__PER_CPU_UNIT_SECTION_SIZE__ =				\
86*b92586abSSammit Joshi		ABSOLUTE(__PER_CPU_UNIT_END__ - __PER_CPU_UNIT_START__);\
87*b92586abSSammit Joshi		. = . + (PER_CPU_NODE_CORE_COUNT - 1) *			\
88*b92586abSSammit Joshi		__PER_CPU_UNIT_SECTION_SIZE__;				\
89*b92586abSSammit Joshi		__PER_CPU_END__ = .;					\
90*b92586abSSammit Joshi	}
91*b92586abSSammit Joshi
92*b92586abSSammit JoshiThe newly introduced linker changes also addresses a common performance issue in
93*b92586abSSammit Joshimodern multi-cpu systems—**cache thrashing**.
94*b92586abSSammit Joshi
95*b92586abSSammit JoshiA performance issue known as **cache thrashing** arises when multiple CPUs
96*b92586abSSammit Joshiaccess different addresses that are on the same cache line. Although the
97*b92586abSSammit Joshiaccessed variables may be logically independent, their proximity in memory can
98*b92586abSSammit Joshiresult in repeated cache invalidations and reloads. This is because cache
99*b92586abSSammit Joshicoherency mechanisms operate at the granularity of cache lines (typically 64
100*b92586abSSammit Joshibytes). If two CPUs attempt to write to two different addresses that fall within
101*b92586abSSammit Joshithe same cache line, the cache line is bounced back and forth between the cores,
102*b92586abSSammit Joshiincurring unnecessary overhead.
103*b92586abSSammit Joshi
104*b92586abSSammit Joshi.. figure:: ../resources/diagrams/per_cpu_numa_cache_thrashing.png
105*b92586abSSammit Joshi   :alt: Illustration of Cache Thrashing from Per-CPU Data Collisions
106*b92586abSSammit Joshi   :align: center
107*b92586abSSammit Joshi   :width: 600px
108*b92586abSSammit Joshi
109*b92586abSSammit Joshi*Figure: Two processors modifying different variables placed too closely in
110*b92586abSSammit Joshimemory, leading to cache thrashing*
111*b92586abSSammit Joshi
112*b92586abSSammit JoshiTo eliminate cache thrashing, this framework employs **linker-script-based
113*b92586abSSammit Joshialignment**. It ensures:
114*b92586abSSammit Joshi
115*b92586abSSammit Joshi- Placing all per-cpu variables into a **dedicated, aligned** section:
116*b92586abSSammit Joshi  `.per_cpu`
117*b92586abSSammit Joshi- Aligning that section using the cache granularity size
118*b92586abSSammit Joshi  (`CACHE_WRITEBACK_GRANULE`)
119*b92586abSSammit Joshi
120*b92586abSSammit JoshiDefiner Interfaces
121*b92586abSSammit Joshi------------------
122*b92586abSSammit Joshi
123*b92586abSSammit JoshiThe NUMA-Aware PER-CPU framework provides set of macros to define and declare
124*b92586abSSammit Joshiper-cpu objects efficiently in multi-node systems.
125*b92586abSSammit Joshi
126*b92586abSSammit Joshi- **PER_CPU_DECLARE**
127*b92586abSSammit Joshi
128*b92586abSSammit Joshi  Declares an external per-cpu object.
129*b92586abSSammit Joshi
130*b92586abSSammit Joshi  .. code-block:: c
131*b92586abSSammit Joshi
132*b92586abSSammit Joshi      #define PER_CPU_DECLARE(TYPE, NAME) \
133*b92586abSSammit Joshi          extern typeof(TYPE) NAME
134*b92586abSSammit Joshi
135*b92586abSSammit Joshi- **PER_CPU_DEFINE**
136*b92586abSSammit Joshi
137*b92586abSSammit Joshi  Defines a per-cpu object and places it in the `.per_cpu` section.
138*b92586abSSammit Joshi
139*b92586abSSammit Joshi  .. code-block:: c
140*b92586abSSammit Joshi
141*b92586abSSammit Joshi      #define PER_CPU_DEFINE(TYPE, NAME) \
142*b92586abSSammit Joshi          typeof(TYPE) NAME \
143*b92586abSSammit Joshi          __section(PER_CPU_SECTION_NAME)
144*b92586abSSammit Joshi
145*b92586abSSammit JoshiAccessor Interfaces
146*b92586abSSammit Joshi-------------------
147*b92586abSSammit Joshi
148*b92586abSSammit JoshiThe NUMA-Aware PER-CPU framework provides set of macros to access per-cpu
149*b92586abSSammit Joshiobjects efficiently in multi-node systems.
150*b92586abSSammit Joshi
151*b92586abSSammit Joshi- **PER_CPU_BY_INDEX(NAME, CPU)**
152*b92586abSSammit Joshi  Returns a pointer to the per-cpu object `NAME` for the specified CPU.
153*b92586abSSammit Joshi
154*b92586abSSammit Joshi  .. code-block:: c
155*b92586abSSammit Joshi
156*b92586abSSammit Joshi      #define PER_CPU_BY_INDEX(NAME, CPU)			\
157*b92586abSSammit Joshi          ((__typeof__(&NAME))					\
158*b92586abSSammit Joshi          (per_cpu_by_index_compute((CPU), (void *)&(NAME))))
159*b92586abSSammit Joshi
160*b92586abSSammit Joshi- **PER_CPU_CUR(NAME)**
161*b92586abSSammit Joshi  Returns a pointer to the per-cpu object `NAME` for the current CPU.
162*b92586abSSammit Joshi
163*b92586abSSammit Joshi  .. code-block:: c
164*b92586abSSammit Joshi
165*b92586abSSammit Joshi      #define PER_CPU_CUR(NAME) 			\
166*b92586abSSammit Joshi      ((__typeof__(&(NAME)))				\
167*b92586abSSammit Joshi      (per_cpu_cur_compute((void *)&(NAME))))
168*b92586abSSammit Joshi
169*b92586abSSammit JoshiFor use in assembly routines, a corresponding macro version is provided:
170*b92586abSSammit Joshi
171*b92586abSSammit Joshi.. code-block:: text
172*b92586abSSammit Joshi
173*b92586abSSammit Joshi   .macro  per_cpu_cur label, dst=x0, clobber=x1
174*b92586abSSammit Joshi       /* Safety checks */
175*b92586abSSammit Joshi       .ifc \dst,\clobber
176*b92586abSSammit Joshi       .error "per_cpu_cur: dst and clobber must be different"
177*b92586abSSammit Joshi       .endif
178*b92586abSSammit Joshi
179*b92586abSSammit Joshi       /* dst = absolute address of label */
180*b92586abSSammit Joshi       adr_l	\dst, \label
181*b92586abSSammit Joshi
182*b92586abSSammit Joshi       /* clobber = absolute address of __PER_CPU_START__ */
183*b92586abSSammit Joshi       	adr_l	\clobber, __PER_CPU_START__
184*b92586abSSammit Joshi
185*b92586abSSammit Joshi       /* dst = (label - __PER_CPU_START__) */
186*b92586abSSammit Joshi       sub     \dst, \dst, \clobber
187*b92586abSSammit Joshi
188*b92586abSSammit Joshi       /* clobber = per-cpu base (TPIDR_EL3) */
189*b92586abSSammit Joshi       mrs     \clobber, tpidr_el3
190*b92586abSSammit Joshi
191*b92586abSSammit Joshi       /* dst = base + offset */
192*b92586abSSammit Joshi       add     \dst, \clobber, \dst
193*b92586abSSammit Joshi   .endm
194*b92586abSSammit Joshi
195*b92586abSSammit Joshi
196*b92586abSSammit JoshiThe accessor interfaces take advantage of using `tpidr_el3` system register
197*b92586abSSammit Joshi(Thread ID Register at EL3). It stores the **base address of the current CPU's
198*b92586abSSammit Joshi`.per_cpu` section**. By setting up this register during early CPU
199*b92586abSSammit Joshiinitialization (e.g., in the el3_entrypoint_common path), TF-A can avoid
200*b92586abSSammit Joshirepeated calculations or memory lookups when accessing per-cpu objects.
201*b92586abSSammit Joshi
202*b92586abSSammit JoshiInstead of computing the per-cpu address dynamically using platform-level
203*b92586abSSammit Joshifunctions (which could involve node discovery, offset arithmetic, and memory
204*b92586abSSammit Joshidereferencing), TF-A can simply:
205*b92586abSSammit Joshi
206*b92586abSSammit Joshi- Read `tpidr_el3` to get the base address of the current CPU's per-cpu data.
207*b92586abSSammit Joshi- Add the relative offset of the desired object within the `.per_cpu` section.
208*b92586abSSammit Joshi- Access the target object directly using this computed address.
209*b92586abSSammit Joshi
210*b92586abSSammit JoshiThis strategy significantly reduces access time by replacing a potentially
211*b92586abSSammit Joshiexpensive memory access path with a single register read and offset addition. It
212*b92586abSSammit Joshiimproves performance—particularly in hot paths like PSCI operations and context
213*b92586abSSammit Joshiswitching taking advantage of fast-access system registers instead of traversing
214*b92586abSSammit Joshiinterconnects.
215*b92586abSSammit Joshi
216*b92586abSSammit JoshiUsage Example
217*b92586abSSammit Joshi=============
218*b92586abSSammit Joshi
219*b92586abSSammit JoshiPlatform Responsibilities
220*b92586abSSammit Joshi-------------------------
221*b92586abSSammit Joshi
222*b92586abSSammit JoshiTo integrate the NUMA-Aware PER-CPU Framework into a platform, the following
223*b92586abSSammit Joshisteps must be taken:
224*b92586abSSammit Joshi
225*b92586abSSammit Joshi1. Enable the Framework
226*b92586abSSammit Joshi-------------------------
227*b92586abSSammit Joshi
228*b92586abSSammit JoshiSet the PLATFORM_NODE_COUNT to greater than 1 (>=2) in the platform
229*b92586abSSammit Joshimakefile to enable NUMA-aware per-cpu support:
230*b92586abSSammit Joshi
231*b92586abSSammit Joshi.. code-block:: text
232*b92586abSSammit Joshi
233*b92586abSSammit Joshi    PLATFORM_NODE_COUNT := 1 (>=2 for enabling NUMA-aware per-cpu support)
234*b92586abSSammit Joshi
235*b92586abSSammit JoshiPlatforms that are not multi-node needn't do anything as
236*b92586abSSammit JoshiPLATFORM_NODE_COUNT = 1 (NODE COUNT) by default.
237*b92586abSSammit JoshiIn the case of 32-bit Images such as BL32 sp_min NUMA framework is not supported.
238*b92586abSSammit Joshi
239*b92586abSSammit Joshi2. Provide Per-CPU Section Base Address Table
240*b92586abSSammit Joshi---------------------------------------------
241*b92586abSSammit Joshi
242*b92586abSSammit JoshiDeclare and initialize an array holding the base address of the `.per_cpu`
243*b92586abSSammit Joshisection for each node:
244*b92586abSSammit Joshi
245*b92586abSSammit Joshi.. code-block:: c
246*b92586abSSammit Joshi
247*b92586abSSammit Joshi    const uintptr_t per_cpu_nodes_base[] = {
248*b92586abSSammit Joshi        /* Base addresses per node (platform-specific) */
249*b92586abSSammit Joshi    };
250*b92586abSSammit Joshi
251*b92586abSSammit JoshiThis array allows efficient mapping from logical CPU IDs to physical memory
252*b92586abSSammit Joshiregions in multi-node systems.  Note: This is one example of how platforms can
253*b92586abSSammit Joshidefine .per_cpu section base addresses.  Platforms are free to determine and
254*b92586abSSammit Joshiprovide these addresses using other methods, such as device tree parsing,
255*b92586abSSammit Joshiplatform-specific tables, or dynamic discovery logic. It is important to note
256*b92586abSSammit Joshithat the platform defined regions for holding remote per-cpu section should have
257*b92586abSSammit Joshia page aligned base and size for page table mapping via the xlat library. This
258*b92586abSSammit Joshiis simply due to the fact that xlat requires page aligned address and size for
259*b92586abSSammit Joshimapping an entry. per-cpu section by itself requires only CACHE_WRITEBACK_GRANULE
260*b92586abSSammit Joshialignment for its base.
261*b92586abSSammit Joshi
262*b92586abSSammit Joshi3. Implement Required Platform Hooks
263*b92586abSSammit Joshi------------------------------------
264*b92586abSSammit Joshi
265*b92586abSSammit JoshiProvide the following platform-specific functions:
266*b92586abSSammit Joshi
267*b92586abSSammit Joshi- **`plat_per_cpu_base(int cpu)`**
268*b92586abSSammit Joshi  Returns the base address of the `.per_cpu` section for the specified CPU.
269*b92586abSSammit Joshi
270*b92586abSSammit Joshi- **`plat_per_cpu_node_base(void)`**
271*b92586abSSammit Joshi  Returns the node base address of the `.per_cpu` section.
272*b92586abSSammit Joshi
273*b92586abSSammit Joshi- **`plat_per_cpu_dcache_clean(void)`**
274*b92586abSSammit Joshi  Cleans the entire per-cpu section from the data cache. This ensures that any
275*b92586abSSammit Joshi  modifications made to per-cpu data are written back to memory, making them
276*b92586abSSammit Joshi  visible to other CPUs or system components that may access this memory. It is
277*b92586abSSammit Joshi  especially important on platforms that do not support hardware managed
278*b92586abSSammit Joshi  coherency early in the boot.
279*b92586abSSammit Joshi
280*b92586abSSammit JoshiReferences
281*b92586abSSammit Joshi==========
282*b92586abSSammit Joshi
283*b92586abSSammit Joshi- Original Presentation: https://www.trustedfirmware.org/docs/NUMA-aware-PER-CPU-framework-18Jul24.pdf
284*b92586abSSammit Joshi
285*b92586abSSammit Joshi--------------
286*b92586abSSammit Joshi
287*b92586abSSammit Joshi*Copyright (c) 2025, Arm Limited and Contributors. All rights reserved.*
288