xref: /rk3399_ARM-atf/docs/components/ras.rst (revision e01ce1ea61368f169f8f827a05ad9d0c5bb06160)
1Reliability, Availability, and Serviceability (RAS) Extensions
2**************************************************************
3
4This document describes |TF-A| support for Arm Reliability, Availability, and
5Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
6later CPUs, and also an optional extension to the base Armv8.0 architecture.
7
8For the description of Arm RAS extensions, Standard Error Records, and the
9precise definition of RAS terminology, please refer to the Arm Architecture
10Reference Manual and `RAS Supplement`_. The rest of this document assumes
11familiarity with architecture and terminology.
12
13**IMPORTANT NOTE**: TF-A implementation assumes that if RAS extension is present
14then FEAT_IESB is also implmented.
15
16There are two philosophies for handling RAS errors from Non-secure world point
17of view.
18
19- :ref:`Firmware First Handling (FFH)`
20- :ref:`Kernel First Handling (KFH)`
21
22.. _Firmware First Handling (FFH):
23
24Firmware First Handling (FFH)
25=============================
26
27Introduction
28------------
29
30EA’s and Error interrupts corresponding to NS nodes are handled first in firmware
31
32-  Errors signaled back to NS world via suitable mechanism
33-  Kernel is prohibited from accessing the RAS error records directly
34-  Firmware creates CPER records for kernel to navigate and process
35-  Firmware signals error back to Kernel via SDEI
36
37Overview
38--------
39
40FFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from
41errors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous
42External Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling
43and Error Recovery interrupts.
44RAS Framework in TF-A allows the platform to define an external abort handler and to
45register RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard
46Error Records as introduced by the RAS extensions
47
48
49.. __: `Standard Error Record helpers`_
50
51.. _Kernel First Handling (KFH):
52
53Kernel First Handling (KFH)
54===========================
55
56Introduction
57------------
58
59EA's originating/attributed to NS world are handled first in NS and Kernel navigates
60the std error records directly.
61
62-  KFH is the default handling mode if platform does not explicitly enable FFH mode.
63-  KFH mode does not need any EL3 involvement except for the reflection of errors back
64   to lower EL. This happens when there is an error (EA) in the system which is not yet
65   signaled to PE while executing at lower EL. During entry into EL3 the errors (EA) are
66   synchronized causing async EA to pend at EL3.
67
68Error Syncronization at EL3 entry
69=================================
70
71During entry to EL3 from lower EL, if there is any pending async EAs they are either
72reflected back to lower EL (KFH) or handled in EL3 itself (FFH).
73
74|Image 1|
75
76Limitation in KFH Mode
77----------------------
78
79When handling asynchronous External Aborts (EAs) synchronized at EL3 entry in Kernel First Handling
80(KFH) mode, there is a limitation in the current implementation:
81
82* The handler reflects pending async EAs back to the lower EL if the EA routing model is KFH
83* However, if the asynchronous EA is masked at the target exception level, or if its priority
84  relative to an EL3/secure interrupt is lower, repeated back-and-forth transitions between
85  lower EL and EL3 can occur.
86
87To prevent infinite cycling between EL3 and lower EL, a loop counter (``CTX_NESTED_EA_FLAG``) and
88the previously saved ELR (``CTX_SAVED_ELR_EL3``) are used to detect this condition. If a loop is
89detected, EL3 will trigger a panic (label ``check_loop_ctr``) to indicate a problem.
90
91Future Plan: Delegated SError Injection (FEAT_E3DSE)
92----------------------------------------------------
93
94In future revisions, this limitation can be mitigated by utilizing **FEAT_E3DSE** — the
95**Delegated SError exception injection** feature introduced for EL3.
96
97FEAT_E3DSE provides a mechanism for EL3 to inject a virtual SError into lower exception levels.
98Once this capability is supported in TF-A, EL3 will be able to handle the original exception
99and then inject the delegated SError to the appropriate lower EL before returning, thereby
100eliminating the need for panic handling in this scenario.
101
102This planned enhancement will improve robustness and correctness of asynchronous error handling
103in KFH mode.
104
105TF-A build options
106==================
107
108- **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3.
109- **HANDLE_EA_EL3_FIRST_NS**: Required for FFH
110- **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers.
111- **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and
112  HANDLE_EA_EL3_FIRST_NS put together.
113
114RAS internal macros
115
116- **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled.
117
118RAS feature has dependency on some other TF-A build flags
119
120- **EL3_EXCEPTION_HANDLING**: Required for FFH
121- **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform
122
123TF-A Tests
124==========
125
126RAS functionality is regularly tested in TF-A CI using `RAS test group`_ which has multiple
127configurations for testing lower EL External aborts.
128
129All the tests are written in TF-A tests which runs as NS-EL2 payload.
130
131- **FFH without RAS extension**
132
133  *fvp-ea-ffh,fvp-ea-ffh:fvp-tftf-fip.tftf-aemv8a-debug*
134
135   Couple of tests, one each for sync EA and async EA from lower EL which gets handled in El3.
136   Inject External aborts(sync/async) which traps in EL3, FVP has a handler which gracefully
137   handles these errors and returns back to TF-A Tests
138
139   Build Configs : **HANDLE_EA_EL3_FIRST_NS** , **PLATFORM_TEST_EA_FFH**
140
141- **FFH with RAS extension**
142
143  Three Tests :
144
145  - *fvp-ras-ffh,fvp-single-fault:fvp-tftf-fip.tftf-aemv8a.fi-debug*
146
147    Inject an unrecoverable RAS error, which gets handled in EL3.
148
149  - *fvp-ras-ffh,fvp-uncontainable:fvp-tftf.fault-fip.tftf-aemv8a.fi-debug*
150
151    Inject uncontainable RAS errors which causes platform to panic.
152
153  - *fvp-ras-ffh,fvp-ras-ffh-nested:fvp-tftf-fip.tftf-ras_ffh_nested-aemv8a.fi-debug*
154
155    Test nested exception handling at El3 for synchronized async EAs. Inject an SError in lower EL
156    which remain pending until we enter EL3 through SMC call. At EL3 entry on encountering a pending
157    async EA it will handle the async EA first (nested exception) before handling the original SMC call.
158
159-  **KFH with RAS extension**
160
161  Couple of tests in the group :
162
163  - *fvp-ras-kfh,fvp-ras-kfh:fvp-tftf-fip.tftf-aemv8a.fi-debug*
164
165    Inject and handle RAS errors in TF-A tests (no El3 involvement)
166
167  - *fvp-ras-kfh,fvp-ras-kfh-reflect:fvp-tftf-fip.tftf-ras_kfh_reflection-aemv8a.fi-debug*
168
169    Reflection of synchronized errors from EL3 to TF-A tests, two tests one each for reflecting
170    in IRQ and SMC path.
171
172RAS Framework
173=============
174
175
176.. _ras-figure:
177
178.. image:: ../resources/diagrams/draw.io/ras.svg
179
180Platform APIs
181-------------
182
183The RAS framework allows the platform to define handlers for External Abort,
184Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
185refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`.
186
187Registering RAS error records
188-----------------------------
189
190RAS nodes are components in the system capable of signalling errors to PEs
191through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
192nodes contain one or more error records, which are registers through which the
193nodes advertise various properties of the signalled error. Arm recommends that
194error records are implemented in the Standard Error Record format. The RAS
195architecture allows for error records to be accessible via system or
196memory-mapped registers.
197
198The platform should enumerate the error records providing for each of them:
199
200-  A handler to probe error records for errors;
201-  When the probing identifies an error, a handler to handle it;
202-  For memory-mapped error record, its base address and size in KB; for a system
203   register-accessed record, the start index of the record and number of
204   continuous records from that index;
205-  Any node-specific auxiliary data.
206
207With this information supplied, when the run time firmware receives one of the
208notification mechanisms, the RAS framework can iterate through and probe error
209records for error, and invoke the appropriate handler to handle it.
210
211The RAS framework provides the macros to populate error record information. The
212macros are versioned, and the latest version as of this writing is 1. These
213macros create a structure of type ``struct err_record_info`` from its arguments,
214which are later passed to probe and error handlers.
215
216For memory-mapped error records:
217
218.. code:: c
219
220    ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
221
222And, for system register ones:
223
224.. code:: c
225
226    ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
227
228The probe handler must have the following prototype:
229
230.. code:: c
231
232    typedef int (*err_record_probe_t)(const struct err_record_info *info,
233                    int *probe_data);
234
235The probe handler must return a non-zero value if an error was detected, or 0
236otherwise. The ``probe_data`` output parameter can be used to pass any useful
237information resulting from probe to the error handler (see `below`__). For
238example, it could return the index of the record.
239
240.. __: `Standard Error Record helpers`_
241
242The error handler must have the following prototype:
243
244.. code:: c
245
246    typedef int (*err_record_handler_t)(const struct err_record_info *info,
247               int probe_data, const struct err_handler_data *const data);
248
249The ``data`` constant parameter describes the various properties of the error,
250including the reason for the error, exception syndrome, and also ``flags``,
251``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler
252<EL3 interrupts>`.
253
254The platform is expected populate an array using the macros above, and register
255the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
256passing it the name of the array describing the records. Note that the macro
257must be used in the same file where the array is defined.
258
259Standard Error Record helpers
260~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
261
262The |TF-A| RAS framework provides probe handlers for Standard Error Records, for
263both memory-mapped and System Register accesses:
264
265.. code:: c
266
267    int ras_err_ser_probe_memmap(const struct err_record_info *info,
268                int *probe_data);
269
270    int ras_err_ser_probe_sysreg(const struct err_record_info *info,
271                int *probe_data);
272
273When the platform enumerates error records, for those records in the Standard
274Error Record format, these helpers maybe used instead of rolling out their own.
275Both helpers above:
276
277-  Return non-zero value when an error is detected in a Standard Error Record;
278-  Set ``probe_data`` to the index of the error record upon detecting an error.
279
280Registering RAS interrupts
281--------------------------
282
283RAS nodes can signal errors to the PE by raising Fault Handling and/or Error
284Recovery interrupts. For the firmware-first handling paradigm for interrupts to
285work, the platform must setup and register with |EHF|. See `Interaction with
286Exception Handling Framework`_.
287
288For each RAS interrupt, the platform has to provide structure of type ``struct
289ras_interrupt``:
290
291-  Interrupt number;
292-  The associated error record information (pointer to the corresponding
293   ``struct err_record_info``);
294-  Optionally, a cookie.
295
296The platform is expected to define an array of ``struct ras_interrupt``, and
297register it with the RAS framework using the macro
298``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
299macro must be used in the same file where the array is defined.
300
301The array of ``struct ras_interrupt`` must be sorted in the increasing order of
302interrupt number. This allows for fast look of handlers in order to service RAS
303interrupts.
304
305Double-fault handling
306---------------------
307
308A Double Fault condition arises when an error is signalled to the PE while
309handling of a previously signalled error is still underway. When a Double Fault
310condition arises, the Arm RAS extensions only require for handler to perform
311orderly shutdown of the system, as recovery may be impossible.
312
313The RAS extensions part of Armv8.4 introduced new architectural features to deal
314with Double Fault conditions, specifically, the introduction of ``NMEA`` and
315``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
316software which runs part of its entry/exit routines with exceptions momentarily
317masked—meaning, in such systems, External Aborts/SErrors are not immediately
318handled when they occur, but only after the exceptions are unmasked again.
319
320|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
321This means that all exceptions routed to EL3 are handled immediately. |TF-A|
322thus is able to detect a Double Fault conditions in software, without needing
323the intended advantages of Armv8.4 Double Fault architecture extensions.
324
325Double faults are fatal, and terminate at the platform double fault handler, and
326doesn't return.
327
328Engaging the RAS framework
329--------------------------
330
331Enabling RAS support is a platform choice
332
333The RAS support in |TF-A| introduces a default implementation of
334``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS``
335is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
336top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
337to through platform-supplied error records, probe them, and when an error is
338identified, look up and invoke the corresponding error handler.
339
340Note that, if the platform chooses to override the ``plat_ea_handler`` function
341and intend to use the RAS framework, it must explicitly call
342``ras_ea_handler()`` from within.
343
344Similarly, for RAS interrupts, the framework defines
345``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
346when  a RAS interrupt taken at EL3. The function bisects the platform-supplied
347sorted array of interrupts to look up the error record information associated
348with the interrupt number. That error handler for that record is then invoked to
349handle the error.
350
351Interaction with Exception Handling Framework
352---------------------------------------------
353
354As mentioned in earlier sections, RAS framework interacts with the |EHF| to
355arbitrate handling of RAS exceptions with others that are routed to EL3. This
356means that the platform must partition a :ref:`priority level <Partitioning
357priority levels>` for handling RAS exceptions. The platform must then define
358the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions.
359Platforms would typically want to allocate the highest secure priority for
360RAS handling.
361
362Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt
363<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF|
364documentation. I.e., for interrupts, the priority management is implicit; but
365for non-interrupt exceptions, they're explicit using :ref:`EHF APIs
366<Activating and Deactivating priorities>`.
367
368--------------
369
370*Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.*
371
372.. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest
373.. _RAS Test group: https://git.trustedfirmware.org/ci/tf-a-ci-scripts.git/tree/group/tf-l3-boot-tests-ras?h=refs/heads/master
374
375.. |Image 1| image:: ../resources/diagrams/bl31-exception-entry-error-synchronization.png
376