1Reliability, Availability, and Serviceability (RAS) Extensions 2************************************************************** 3 4This document describes |TF-A| support for Arm Reliability, Availability, and 5Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and 6later CPUs, and also an optional extension to the base Armv8.0 architecture. 7 8For the description of Arm RAS extensions, Standard Error Records, and the 9precise definition of RAS terminology, please refer to the Arm Architecture 10Reference Manual and `RAS Supplement`_. The rest of this document assumes 11familiarity with architecture and terminology. 12 13There are two philosophies for handling RAS errors from Non-secure world point 14of view. 15 16- :ref:`Firmware First Handling (FFH)` 17- :ref:`Kernel First Handling (KFH)` 18 19.. _Firmware First Handling (FFH): 20 21Firmware First Handling (FFH) 22============================= 23 24Introduction 25------------ 26 27EA’s and Error interrupts corresponding to NS nodes are handled first in firmware 28 29- Errors signaled back to NS world via suitable mechanism 30- Kernel is prohibited from accessing the RAS error records directly 31- Firmware creates CPER records for kernel to navigate and process 32- Firmware signals error back to Kernel via SDEI 33 34Overview 35-------- 36 37FFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from 38errors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous 39External Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling 40and Error Recovery interrupts. 41RAS Framework in TF-A allows the platform to define an external abort handler and to 42register RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard 43Error Records as introduced by the RAS extensions 44 45 46.. __: `Standard Error Record helpers`_ 47 48.. _Kernel First Handling (KFH): 49 50Kernel First Handling (KFH) 51=========================== 52 53Introduction 54------------ 55 56EA's originating/attributed to NS world are handled first in NS and Kernel navigates 57the std error records directly. 58 59**KFH can be supported in a platform without TF-A being aware of it but there are few 60corner cases where TF-A needs to have special handling, which is currently missing and 61will be added in future** 62 63TF-A build options 64================== 65 66- **ENABLE_FEAT_RAS**: Manage FEAT_RAS extension when switching the world. 67- **RAS_FFH_SUPPORT**: Pull in necessary framework and platform hooks for Firmware first 68 handling(FFH) of RAS errors. 69- **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers. 70- **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and RAS_FFH_SUPPORT 71 put together. 72 73RAS feature has dependency on some other TF-A build flags 74 75- **EL3_EXCEPTION_HANDLING**: Required for FFH 76- **HANDLE_EA_EL3_FIRST_NS**: Required for FFH 77- **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform 78 79RAS Framework 80============= 81 82 83.. _ras-figure: 84 85.. image:: ../resources/diagrams/draw.io/ras.svg 86 87Platform APIs 88------------- 89 90The RAS framework allows the platform to define handlers for External Abort, 91Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please 92refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`. 93 94Registering RAS error records 95----------------------------- 96 97RAS nodes are components in the system capable of signalling errors to PEs 98through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS 99nodes contain one or more error records, which are registers through which the 100nodes advertise various properties of the signalled error. Arm recommends that 101error records are implemented in the Standard Error Record format. The RAS 102architecture allows for error records to be accessible via system or 103memory-mapped registers. 104 105The platform should enumerate the error records providing for each of them: 106 107- A handler to probe error records for errors; 108- When the probing identifies an error, a handler to handle it; 109- For memory-mapped error record, its base address and size in KB; for a system 110 register-accessed record, the start index of the record and number of 111 continuous records from that index; 112- Any node-specific auxiliary data. 113 114With this information supplied, when the run time firmware receives one of the 115notification mechanisms, the RAS framework can iterate through and probe error 116records for error, and invoke the appropriate handler to handle it. 117 118The RAS framework provides the macros to populate error record information. The 119macros are versioned, and the latest version as of this writing is 1. These 120macros create a structure of type ``struct err_record_info`` from its arguments, 121which are later passed to probe and error handlers. 122 123For memory-mapped error records: 124 125.. code:: c 126 127 ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux) 128 129And, for system register ones: 130 131.. code:: c 132 133 ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux) 134 135The probe handler must have the following prototype: 136 137.. code:: c 138 139 typedef int (*err_record_probe_t)(const struct err_record_info *info, 140 int *probe_data); 141 142The probe handler must return a non-zero value if an error was detected, or 0 143otherwise. The ``probe_data`` output parameter can be used to pass any useful 144information resulting from probe to the error handler (see `below`__). For 145example, it could return the index of the record. 146 147.. __: `Standard Error Record helpers`_ 148 149The error handler must have the following prototype: 150 151.. code:: c 152 153 typedef int (*err_record_handler_t)(const struct err_record_info *info, 154 int probe_data, const struct err_handler_data *const data); 155 156The ``data`` constant parameter describes the various properties of the error, 157including the reason for the error, exception syndrome, and also ``flags``, 158``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler 159<EL3 interrupts>`. 160 161The platform is expected populate an array using the macros above, and register 162the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``, 163passing it the name of the array describing the records. Note that the macro 164must be used in the same file where the array is defined. 165 166Standard Error Record helpers 167~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 168 169The |TF-A| RAS framework provides probe handlers for Standard Error Records, for 170both memory-mapped and System Register accesses: 171 172.. code:: c 173 174 int ras_err_ser_probe_memmap(const struct err_record_info *info, 175 int *probe_data); 176 177 int ras_err_ser_probe_sysreg(const struct err_record_info *info, 178 int *probe_data); 179 180When the platform enumerates error records, for those records in the Standard 181Error Record format, these helpers maybe used instead of rolling out their own. 182Both helpers above: 183 184- Return non-zero value when an error is detected in a Standard Error Record; 185- Set ``probe_data`` to the index of the error record upon detecting an error. 186 187Registering RAS interrupts 188-------------------------- 189 190RAS nodes can signal errors to the PE by raising Fault Handling and/or Error 191Recovery interrupts. For the firmware-first handling paradigm for interrupts to 192work, the platform must setup and register with |EHF|. See `Interaction with 193Exception Handling Framework`_. 194 195For each RAS interrupt, the platform has to provide structure of type ``struct 196ras_interrupt``: 197 198- Interrupt number; 199- The associated error record information (pointer to the corresponding 200 ``struct err_record_info``); 201- Optionally, a cookie. 202 203The platform is expected to define an array of ``struct ras_interrupt``, and 204register it with the RAS framework using the macro 205``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the 206macro must be used in the same file where the array is defined. 207 208The array of ``struct ras_interrupt`` must be sorted in the increasing order of 209interrupt number. This allows for fast look of handlers in order to service RAS 210interrupts. 211 212Double-fault handling 213--------------------- 214 215A Double Fault condition arises when an error is signalled to the PE while 216handling of a previously signalled error is still underway. When a Double Fault 217condition arises, the Arm RAS extensions only require for handler to perform 218orderly shutdown of the system, as recovery may be impossible. 219 220The RAS extensions part of Armv8.4 introduced new architectural features to deal 221with Double Fault conditions, specifically, the introduction of ``NMEA`` and 222``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3 223software which runs part of its entry/exit routines with exceptions momentarily 224masked—meaning, in such systems, External Aborts/SErrors are not immediately 225handled when they occur, but only after the exceptions are unmasked again. 226 227|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked. 228This means that all exceptions routed to EL3 are handled immediately. |TF-A| 229thus is able to detect a Double Fault conditions in software, without needing 230the intended advantages of Armv8.4 Double Fault architecture extensions. 231 232Double faults are fatal, and terminate at the platform double fault handler, and 233doesn't return. 234 235Engaging the RAS framework 236-------------------------- 237 238Enabling RAS support is a platform choice 239 240The RAS support in |TF-A| introduces a default implementation of 241``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_FFH_SUPPORT`` 242is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the 243top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating 244to through platform-supplied error records, probe them, and when an error is 245identified, look up and invoke the corresponding error handler. 246 247Note that, if the platform chooses to override the ``plat_ea_handler`` function 248and intend to use the RAS framework, it must explicitly call 249``ras_ea_handler()`` from within. 250 251Similarly, for RAS interrupts, the framework defines 252``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked 253when a RAS interrupt taken at EL3. The function bisects the platform-supplied 254sorted array of interrupts to look up the error record information associated 255with the interrupt number. That error handler for that record is then invoked to 256handle the error. 257 258Interaction with Exception Handling Framework 259--------------------------------------------- 260 261As mentioned in earlier sections, RAS framework interacts with the |EHF| to 262arbitrate handling of RAS exceptions with others that are routed to EL3. This 263means that the platform must partition a :ref:`priority level <Partitioning 264priority levels>` for handling RAS exceptions. The platform must then define 265the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions. 266Platforms would typically want to allocate the highest secure priority for 267RAS handling. 268 269Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt 270<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF| 271documentation. I.e., for interrupts, the priority management is implicit; but 272for non-interrupt exceptions, they're explicit using :ref:`EHF APIs 273<Activating and Deactivating priorities>`. 274 275-------------- 276 277*Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.* 278 279.. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest 280