1Reliability, Availability, and Serviceability (RAS) Extensions 2************************************************************** 3 4This document describes |TF-A| support for Arm Reliability, Availability, and 5Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and 6later CPUs, and also an optional extension to the base Armv8.0 architecture. 7 8For the description of Arm RAS extensions, Standard Error Records, and the 9precise definition of RAS terminology, please refer to the Arm Architecture 10Reference Manual and `RAS Supplement`_. The rest of this document assumes 11familiarity with architecture and terminology. 12 13There are two philosophies for handling RAS errors from Non-secure world point 14of view. 15 16- :ref:`Firmware First Handling (FFH)` 17- :ref:`Kernel First Handling (KFH)` 18 19.. _Firmware First Handling (FFH): 20 21Firmware First Handling (FFH) 22============================= 23 24Introduction 25------------ 26 27EA’s and Error interrupts corresponding to NS nodes are handled first in firmware 28 29- Errors signaled back to NS world via suitable mechanism 30- Kernel is prohibited from accessing the RAS error records directly 31- Firmware creates CPER records for kernel to navigate and process 32- Firmware signals error back to Kernel via SDEI 33 34Overview 35-------- 36 37FFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from 38errors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous 39External Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling 40and Error Recovery interrupts. 41RAS Framework in TF-A allows the platform to define an external abort handler and to 42register RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard 43Error Records as introduced by the RAS extensions 44 45 46.. __: `Standard Error Record helpers`_ 47 48.. _Kernel First Handling (KFH): 49 50Kernel First Handling (KFH) 51=========================== 52 53Introduction 54------------ 55 56EA's originating/attributed to NS world are handled first in NS and Kernel navigates 57the std error records directly. 58 59**KFH can be supported in a platform without TF-A being aware of it but there are few 60corner cases where TF-A needs to have special handling, which is currently missing and 61will be added in future** 62 63TF-A build options 64================== 65 66- **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3. 67- **HANDLE_EA_EL3_FIRST_NS**: Required for FFH 68- **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers. 69- **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and 70 HANDLE_EA_EL3_FIRST_NS put together. 71 72RAS internal macros 73 74- **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled. 75 76RAS feature has dependency on some other TF-A build flags 77 78- **EL3_EXCEPTION_HANDLING**: Required for FFH 79- **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform 80 81RAS Framework 82============= 83 84 85.. _ras-figure: 86 87.. image:: ../resources/diagrams/draw.io/ras.svg 88 89Platform APIs 90------------- 91 92The RAS framework allows the platform to define handlers for External Abort, 93Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please 94refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`. 95 96Registering RAS error records 97----------------------------- 98 99RAS nodes are components in the system capable of signalling errors to PEs 100through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS 101nodes contain one or more error records, which are registers through which the 102nodes advertise various properties of the signalled error. Arm recommends that 103error records are implemented in the Standard Error Record format. The RAS 104architecture allows for error records to be accessible via system or 105memory-mapped registers. 106 107The platform should enumerate the error records providing for each of them: 108 109- A handler to probe error records for errors; 110- When the probing identifies an error, a handler to handle it; 111- For memory-mapped error record, its base address and size in KB; for a system 112 register-accessed record, the start index of the record and number of 113 continuous records from that index; 114- Any node-specific auxiliary data. 115 116With this information supplied, when the run time firmware receives one of the 117notification mechanisms, the RAS framework can iterate through and probe error 118records for error, and invoke the appropriate handler to handle it. 119 120The RAS framework provides the macros to populate error record information. The 121macros are versioned, and the latest version as of this writing is 1. These 122macros create a structure of type ``struct err_record_info`` from its arguments, 123which are later passed to probe and error handlers. 124 125For memory-mapped error records: 126 127.. code:: c 128 129 ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux) 130 131And, for system register ones: 132 133.. code:: c 134 135 ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux) 136 137The probe handler must have the following prototype: 138 139.. code:: c 140 141 typedef int (*err_record_probe_t)(const struct err_record_info *info, 142 int *probe_data); 143 144The probe handler must return a non-zero value if an error was detected, or 0 145otherwise. The ``probe_data`` output parameter can be used to pass any useful 146information resulting from probe to the error handler (see `below`__). For 147example, it could return the index of the record. 148 149.. __: `Standard Error Record helpers`_ 150 151The error handler must have the following prototype: 152 153.. code:: c 154 155 typedef int (*err_record_handler_t)(const struct err_record_info *info, 156 int probe_data, const struct err_handler_data *const data); 157 158The ``data`` constant parameter describes the various properties of the error, 159including the reason for the error, exception syndrome, and also ``flags``, 160``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler 161<EL3 interrupts>`. 162 163The platform is expected populate an array using the macros above, and register 164the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``, 165passing it the name of the array describing the records. Note that the macro 166must be used in the same file where the array is defined. 167 168Standard Error Record helpers 169~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 170 171The |TF-A| RAS framework provides probe handlers for Standard Error Records, for 172both memory-mapped and System Register accesses: 173 174.. code:: c 175 176 int ras_err_ser_probe_memmap(const struct err_record_info *info, 177 int *probe_data); 178 179 int ras_err_ser_probe_sysreg(const struct err_record_info *info, 180 int *probe_data); 181 182When the platform enumerates error records, for those records in the Standard 183Error Record format, these helpers maybe used instead of rolling out their own. 184Both helpers above: 185 186- Return non-zero value when an error is detected in a Standard Error Record; 187- Set ``probe_data`` to the index of the error record upon detecting an error. 188 189Registering RAS interrupts 190-------------------------- 191 192RAS nodes can signal errors to the PE by raising Fault Handling and/or Error 193Recovery interrupts. For the firmware-first handling paradigm for interrupts to 194work, the platform must setup and register with |EHF|. See `Interaction with 195Exception Handling Framework`_. 196 197For each RAS interrupt, the platform has to provide structure of type ``struct 198ras_interrupt``: 199 200- Interrupt number; 201- The associated error record information (pointer to the corresponding 202 ``struct err_record_info``); 203- Optionally, a cookie. 204 205The platform is expected to define an array of ``struct ras_interrupt``, and 206register it with the RAS framework using the macro 207``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the 208macro must be used in the same file where the array is defined. 209 210The array of ``struct ras_interrupt`` must be sorted in the increasing order of 211interrupt number. This allows for fast look of handlers in order to service RAS 212interrupts. 213 214Double-fault handling 215--------------------- 216 217A Double Fault condition arises when an error is signalled to the PE while 218handling of a previously signalled error is still underway. When a Double Fault 219condition arises, the Arm RAS extensions only require for handler to perform 220orderly shutdown of the system, as recovery may be impossible. 221 222The RAS extensions part of Armv8.4 introduced new architectural features to deal 223with Double Fault conditions, specifically, the introduction of ``NMEA`` and 224``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3 225software which runs part of its entry/exit routines with exceptions momentarily 226masked—meaning, in such systems, External Aborts/SErrors are not immediately 227handled when they occur, but only after the exceptions are unmasked again. 228 229|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked. 230This means that all exceptions routed to EL3 are handled immediately. |TF-A| 231thus is able to detect a Double Fault conditions in software, without needing 232the intended advantages of Armv8.4 Double Fault architecture extensions. 233 234Double faults are fatal, and terminate at the platform double fault handler, and 235doesn't return. 236 237Engaging the RAS framework 238-------------------------- 239 240Enabling RAS support is a platform choice 241 242The RAS support in |TF-A| introduces a default implementation of 243``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS`` 244is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the 245top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating 246to through platform-supplied error records, probe them, and when an error is 247identified, look up and invoke the corresponding error handler. 248 249Note that, if the platform chooses to override the ``plat_ea_handler`` function 250and intend to use the RAS framework, it must explicitly call 251``ras_ea_handler()`` from within. 252 253Similarly, for RAS interrupts, the framework defines 254``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked 255when a RAS interrupt taken at EL3. The function bisects the platform-supplied 256sorted array of interrupts to look up the error record information associated 257with the interrupt number. That error handler for that record is then invoked to 258handle the error. 259 260Interaction with Exception Handling Framework 261--------------------------------------------- 262 263As mentioned in earlier sections, RAS framework interacts with the |EHF| to 264arbitrate handling of RAS exceptions with others that are routed to EL3. This 265means that the platform must partition a :ref:`priority level <Partitioning 266priority levels>` for handling RAS exceptions. The platform must then define 267the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions. 268Platforms would typically want to allocate the highest secure priority for 269RAS handling. 270 271Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt 272<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF| 273documentation. I.e., for interrupts, the priority management is implicit; but 274for non-interrupt exceptions, they're explicit using :ref:`EHF APIs 275<Activating and Deactivating priorities>`. 276 277-------------- 278 279*Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.* 280 281.. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest 282