xref: /rk3399_ARM-atf/docs/components/numa-per-cpu.rst (revision dcabf4fd95d1179739564a8670d8ca3bf0a473ae)
1*193980a0SChris KayNUMA-Aware Per-CPU Framework
2b92586abSSammit Joshi============================
3b92586abSSammit Joshi
4b92586abSSammit Joshi.. contents::
5b92586abSSammit Joshi   :local:
6b92586abSSammit Joshi   :depth: 2
7b92586abSSammit Joshi
8b92586abSSammit JoshiIntroduction
9*193980a0SChris Kay------------
10b92586abSSammit Joshi
11*193980a0SChris KayModern system designs increasingly adopt multi-node architectures, where the
12*193980a0SChris Kaysystem is divided into multiple topological units such as chiplets, sockets, or
13*193980a0SChris Kayother isolated compute and memory units. Each node typically has its own local
14*193980a0SChris Kaymemory, and CPUs within a node can access this memory with lower latency than
15*193980a0SChris KayCPUs on remote nodes. In TF-A's current implementation, per-CPU data (for
16*193980a0SChris Kayexample, PSCI or SPM context) is stored in a global array or contiguous region,
17*193980a0SChris Kayusually located in the memory of a single node. This approach introduces two key
18*193980a0SChris Kayissues in multi-node systems:
19b92586abSSammit Joshi
20b92586abSSammit Joshi- **Storage Constraints:** As systems scale to include more CPUs and nodes, this
21b92586abSSammit Joshi  centralized allocation becomes a bottleneck. The memory capacity of a single
22*193980a0SChris Kay  node may be insufficient to hold per-CPU data for all CPUs. This constraint
23b92586abSSammit Joshi  limits scalability in systems where each node has limited local memory.
24b92586abSSammit Joshi
25*193980a0SChris Kay  .. figure:: ../resources/diagrams/per-cpu-numa-disabled.png
26*193980a0SChris Kay     :alt: Diagram showing the BL31 binary section layout in TF-A within local
27*193980a0SChris Kay           memory. From bottom to top: \`.text\`, \`.rodata\`, \`.data\`,
28*193980a0SChris Kay           \`.stack\`, \`.bss\`, and \`xlat\` sections. The \`.text\`,
29*193980a0SChris Kay           \`.rodata\`, and \`.data\` segments are \`PROGBITS\` sections, while
30*193980a0SChris Kay           \`.stack\`, \`.bss\`, and \`xlat\` form the \`NOBITS\` sections at
31*193980a0SChris Kay           the top. The memory extends from the local memory start address at
32*193980a0SChris Kay           the bottom to the end address at the top.
33b92586abSSammit Joshi
34b92586abSSammit Joshi  *Figure: Typical BL31/BL32 binary storage in local memory*
35b92586abSSammit Joshi
36b92586abSSammit Joshi- **Non-Uniform Memory Access (NUMA) Latency:** In multi-node systems, memory
37*193980a0SChris Kay  access across nodes incurs additional latency because of interconnect
38*193980a0SChris Kay  traversal. When per-CPU data is centralized on a single node, CPUs on remote
39*193980a0SChris Kay  nodes must access that data via the interconnect, leading to increased latency
40*193980a0SChris Kay  for frequent operations such as context switching, exception handling, and
41*193980a0SChris Kay  crash reporting. This violates NUMA design principles, where data locality is
42b92586abSSammit Joshi  critical to achieving performance and scalability.
43b92586abSSammit Joshi
44*193980a0SChris KayTo address these challenges, TF-A provides the NUMA-Aware Per-CPU Framework. The
45*193980a0SChris Kayframework optimizes the allocation and access of per-CPU objects by letting
46*193980a0SChris Kayplatforms place them in the nodes with the lowest access latency.
47b92586abSSammit Joshi
48b92586abSSammit JoshiDesign
49*193980a0SChris Kay------
50b92586abSSammit Joshi
51b92586abSSammit JoshiThe framework provides standardized interfaces and mechanisms for
52*193980a0SChris Kay**allocating**, **defining**, and **accessing** per-CPU data in a NUMA-aware
53b92586abSSammit Joshienvironment. This ensures portability and maintainability across different
54b92586abSSammit Joshiplatforms while optimizing for performance in multi-node systems.
55b92586abSSammit Joshi
56*193980a0SChris Kay``.per_cpu`` Section
57*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~
58b92586abSSammit Joshi
59*193980a0SChris KayThe framework dedicates a zero-initialized, cache-aligned ``.per_cpu`` section
60*193980a0SChris Kayto **allocate** per-CPU global variables and ensure that these objects reside in
61*193980a0SChris Kaythe local memory of each NUMA node. The figure below illustrates how per-CPU
62*193980a0SChris Kayobjects are allocated in the local memory of their respective nodes.
63b92586abSSammit Joshi
64*193980a0SChris Kay.. figure:: ../resources/diagrams/per-cpu-numa-enabled.png
65b92586abSSammit Joshi   :align: center
66*193980a0SChris Kay   :alt: Diagram comparing the TF-A BL31 memory layout with NUMA disabled versus
67*193980a0SChris Kay         NUMA enabled. When NUMA is disabled, Node 0 contains a local memory
68*193980a0SChris Kay         layout with the \`.text\`, \`.rodata\`, \`.data\`, \`.stack\`,
69*193980a0SChris Kay         \`.bss\`, and \`xlat\` sections stacked vertically. When NUMA is
70*193980a0SChris Kay         enabled, Node 0 includes an additional \`.per_cpu\` section between
71*193980a0SChris Kay         \`.bss\` and \`xlat\` to represent per-CPU data allocation, while
72*193980a0SChris Kay         remote nodes (Node 1 through Node N) each contain their own local
73*193980a0SChris Kay         per-CPU memory regions.
74b92586abSSammit Joshi
75b92586abSSammit Joshi*Figure: BL31/BL32 binary storage in local memory of per node when per-cpu NUMA
76b92586abSSammit Joshiframework is enabled*
77b92586abSSammit Joshi
78*193980a0SChris KayAt link time, TF-A linker scripts carve out this section and publish section
79*193980a0SChris Kaybounds and per-object stride via internal symbols so that they can be replicated
80*193980a0SChris Kayand initialized across the non-primary nodes.
81b92586abSSammit Joshi
82*193980a0SChris KayThis linker section also addresses a common performance issue in modern
83*193980a0SChris Kaymulti-CPU systems known as **false sharing**. This issue arises when multiple
84*193980a0SChris KayCPUs access different addresses that lie on the same cache line. Although the
85b92586abSSammit Joshiaccessed variables may be logically independent, their proximity in memory can
86*193980a0SChris Kayresult in repeated cache invalidations and reloads. Cache-coherency mechanisms
87*193980a0SChris Kayoperate at the granularity of cache lines (typically 64 bytes). If two CPUs
88*193980a0SChris Kaywrite to different addresses within the same cache line, the line bounces
89*193980a0SChris Kaybetween cores and incurs unnecessary overhead.
90b92586abSSammit Joshi
91*193980a0SChris Kay.. figure:: ../resources/diagrams/per-cpu-false-sharing.png
92b92586abSSammit Joshi   :align: center
93*193980a0SChris Kay   :alt: Diagram showing three CPUs (CPU 1, CPU 2, and CPU 3) each with their
94*193980a0SChris Kay         own cache, connected through a shared interconnect to main memory. At
95*193980a0SChris Kay         address 0x1000, CPU 2's cache holds data values D1, D2, D3, and D4
96*193980a0SChris Kay         representing per-CPU data objects, while CPU 1 and CPU 3 have that
97*193980a0SChris Kay         cache line marked as invalid. CPU 3 is attempting to read from its own
98*193980a0SChris Kay         per-CPU data object, triggering a coherence transaction over the
99*193980a0SChris Kay         interconnect.
100b92586abSSammit Joshi
101b92586abSSammit Joshi*Figure: Two processors modifying different variables placed too closely in
102*193980a0SChris Kaymemory, leading to false sharing*
103b92586abSSammit Joshi
104*193980a0SChris KayTo eliminate false sharing, this framework employs **linker-script-based
105*193980a0SChris Kayalignment**, which:
106b92586abSSammit Joshi
107*193980a0SChris Kay- Places all per-CPU variables into a **dedicated, aligned** section
108*193980a0SChris Kay  (``.per_cpu``).
109*193980a0SChris Kay- Aligns that section using the cache granularity size
110*193980a0SChris Kay  (``CACHE_WRITEBACK_GRANULE``).
111b92586abSSammit Joshi
112b92586abSSammit JoshiDefiner Interfaces
113*193980a0SChris Kay~~~~~~~~~~~~~~~~~~
114b92586abSSammit Joshi
115*193980a0SChris KayThe NUMA-Aware Per-CPU Framework provides a set of macros to define and declare
116*193980a0SChris Kayper-CPU objects efficiently in multi-node systems.
117b92586abSSammit Joshi
118*193980a0SChris Kay- ``PER_CPU_DECLARE(TYPE, NAME)``
119b92586abSSammit Joshi
120*193980a0SChris Kay  Declares an external per-CPU object so that other translation units can refer
121*193980a0SChris Kay  to it without allocating storage.
122b92586abSSammit Joshi
123*193980a0SChris Kay- ``PER_CPU_DEFINE(TYPE, NAME)``
124b92586abSSammit Joshi
125*193980a0SChris Kay  Defines a per-CPU object and assigns it to ``PER_CPU_SECTION_NAME`` so the
126*193980a0SChris Kay  linker emits it into the ``.per_cpu`` section that the framework manages.
127b92586abSSammit Joshi
128b92586abSSammit JoshiAccessor Interfaces
129*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~
130b92586abSSammit Joshi
131*193980a0SChris KayThe NUMA-Aware Per-CPU Framework also provides macros to access per-CPU objects
132*193980a0SChris Kayefficiently in multi-node systems.
133b92586abSSammit Joshi
134*193980a0SChris Kay- ``PER_CPU_BY_INDEX(NAME, CPU)``
135b92586abSSammit Joshi
136*193980a0SChris Kay  Returns a pointer to the per-CPU object ``NAME`` for the specified CPU by
137*193980a0SChris Kay  combining the per-node base with the object's offset within ``.per_cpu``.
138b92586abSSammit Joshi
139*193980a0SChris Kay- ``PER_CPU_CUR(NAME)``
140b92586abSSammit Joshi
141*193980a0SChris Kay  Returns a pointer to the per-CPU object ``NAME`` for the current CPU.
142b92586abSSammit Joshi
143b92586abSSammit JoshiFor use in assembly routines, a corresponding macro version is provided:
144b92586abSSammit Joshi
145*193980a0SChris KayIn assembly routines, the ``per_cpu_cur`` helper macro performs the same
146*193980a0SChris Kaycalculation. It accepts the label of the per-CPU object and optional register
147*193980a0SChris Kayarguments (destination and clobber) to materialize the per-CPU pointer without
148*193980a0SChris Kayduplicating addressing logic in assembly files.
149b92586abSSammit Joshi
150*193980a0SChris KayPlatform Responsibilities (NUMA-only)
151*193980a0SChris Kay-------------------------------------
152b92586abSSammit Joshi
153*193980a0SChris KayWhen NUMA is enabled, the platform is required to comply with some additional
154*193980a0SChris Kayrequirements in order for the runtime to correctly set up per-CPU sections on
155*193980a0SChris Kayremote nodes:
156b92586abSSammit Joshi
157b92586abSSammit Joshi1. Enable the Framework
158*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~~~~
159b92586abSSammit Joshi
160*193980a0SChris KaySet ``PLATFORM_NODE_COUNT`` to a value greater than 1 (>=2) in the platform
161*193980a0SChris Kaymakefile to enable NUMA-aware per-CPU support:
162b92586abSSammit Joshi
163*193980a0SChris Kay.. code-block:: make
164b92586abSSammit Joshi
165*193980a0SChris Kay   PLATFORM_NODE_COUNT := 2  # >= 2 enables NUMA-aware per-CPU support
166b92586abSSammit Joshi
167*193980a0SChris KayPlatforms that are not multi-node do not need to modify this value because the
168*193980a0SChris Kaydefault ``PLATFORM_NODE_COUNT`` is 1. The NUMA framework is not supported in
169*193980a0SChris Kay32-bit images such as BL32 SP_MIN.
170b92586abSSammit Joshi
171*193980a0SChris Kay2. Provide Per-CPU Section Base Address Data
172*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
173b92586abSSammit Joshi
174*193980a0SChris KayEnsure that the platform can supply the base address of the ``.per_cpu`` section
175*193980a0SChris Kayfor each node and CPU when implementing ``plat_per_cpu_node_base`` and
176*193980a0SChris Kay``plat_per_cpu_base``. The framework does not mandate how this information is
177*193980a0SChris Kayobtained, only that each hook returns a valid base address. Platforms may:
178b92586abSSammit Joshi
179*193980a0SChris Kay- derive the base addresses from platform descriptors or firmware configuration
180*193980a0SChris Kay  data;
181*193980a0SChris Kay- read them from device tree nodes or other runtime discovery mechanisms; or
182*193980a0SChris Kay- encode them in platform-specific tables compiled into the image.
183b92586abSSammit Joshi
184*193980a0SChris KayIf a node described in platform data is not populated at runtime, the hooks may
185*193980a0SChris Kayreturn ``UINT64_MAX`` to signal that no per-CPU section exists for that node.
186b92586abSSammit Joshi
187*193980a0SChris KayThe platform is free to maintain this mapping however it prefers, and may do so
188*193980a0SChris Kayat either compile-time or through employing runtime discovery. The only
189*193980a0SChris Kayrequirement is that the ``plat_per_cpu_node_base`` and ``plat_per_cpu_base``
190*193980a0SChris Kayhooks translate a node or CPU identifier into the base address of the
191*193980a0SChris Kaycorresponding ``.per_cpu`` section.
192*193980a0SChris Kay
193*193980a0SChris KayPlatform-defined regions that hold remote per-CPU sections must have
194*193980a0SChris Kaypage-aligned bases and sizes for page table mapping through the xlat library,
195*193980a0SChris Kaywhich requires page alignment for mapped entries. The per-CPU section itself
196*193980a0SChris Kayrequires only cache writeback granule alignment for its base.
197b92586abSSammit Joshi
198b92586abSSammit Joshi3. Implement Required Platform Hooks
199*193980a0SChris Kay~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
200b92586abSSammit Joshi
201b92586abSSammit JoshiProvide the following platform-specific functions:
202b92586abSSammit Joshi
203*193980a0SChris Kay- ``uintptr_t plat_per_cpu_base(uint64_t cpu)``
204b92586abSSammit Joshi
205*193980a0SChris Kay  Returns the base address of the ``.per_cpu`` section for the specified CPU.
206b92586abSSammit Joshi
207*193980a0SChris Kay- ``uintptr_t plat_per_cpu_node_base(uint64_t node)``
208*193980a0SChris Kay
209*193980a0SChris Kay  Returns the base address of the ``.per_cpu`` section for the specified node.
210*193980a0SChris Kay
211*193980a0SChris Kay- ``uintptr_t plat_per_cpu_dcache_clean(void)``
212*193980a0SChris Kay
213*193980a0SChris Kay  Cleans the entire per-CPU section from the data cache. This ensures that any
214*193980a0SChris Kay  modifications made to per-CPU data are written back to memory, making them
215*193980a0SChris Kay  visible to other CPUs or system components that may access this memory. This
216*193980a0SChris Kay  step is especially important on platforms that do not support hardware-managed
217*193980a0SChris Kay  coherency early in the boot process.
218b92586abSSammit Joshi
219b92586abSSammit JoshiReferences
220*193980a0SChris Kay----------
221b92586abSSammit Joshi
222*193980a0SChris Kay- Original presentation:
223*193980a0SChris Kay  https://www.trustedfirmware.org/docs/NUMA-aware-PER-CPU-framework-18Jul24.pdf
224b92586abSSammit Joshi
225b92586abSSammit Joshi--------------
226b92586abSSammit Joshi
227b92586abSSammit Joshi*Copyright (c) 2025, Arm Limited and Contributors. All rights reserved.*
228