1Reliability, Availability, and Serviceability (RAS) Extensions 2============================================================== 3 4This document describes |TF-A| support for Arm Reliability, Availability, and 5Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and 6later CPUs, and also an optional extension to the base Armv8.0 architecture. 7 8In conjunction with the |EHF|, support for RAS extension enables firmware-first 9paradigm for handling platform errors: exceptions resulting from errors are 10routed to and handled in EL3. Said errors are Synchronous External Abort (SEA), 11Asynchronous External Abort (signalled as SErrors), Fault Handling and Error 12Recovery interrupts. The |EHF| document mentions various `error handling 13use-cases`__. 14 15.. __: exception-handling.rst#delegation-use-cases 16 17For the description of Arm RAS extensions, Standard Error Records, and the 18precise definition of RAS terminology, please refer to the Arm Architecture 19Reference Manual. The rest of this document assumes familiarity with 20architecture and terminology. 21 22Overview 23-------- 24 25As mentioned above, the RAS support in |TF-A| enables routing to and handling of 26exceptions resulting from platform errors in EL3. It allows the platform to 27define an External Abort handler, and to register RAS nodes and interrupts. RAS 28framework also provides `helpers`__ for accessing Standard Error Records as 29introduced by the RAS extensions. 30 31.. __: `Standard Error Record helpers`_ 32 33The build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run 34time firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also 35be set ``1``. 36 37.. _ras-figure: 38 39.. image:: ../resources/diagrams/draw.io/ras.svg 40 41See more on `Engaging the RAS framework`_. 42 43Platform APIs 44------------- 45 46The RAS framework allows the platform to define handlers for External Abort, 47Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please 48refer to the porting guide for the `RAS platform API descriptions`__. 49 50.. __: ../getting_started/porting-guide.rst#external-abort-handling-and-ras-support 51 52Registering RAS error records 53----------------------------- 54 55RAS nodes are components in the system capable of signalling errors to PEs 56through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS 57nodes contain one or more error records, which are registers through which the 58nodes advertise various properties of the signalled error. Arm recommends that 59error records are implemented in the Standard Error Record format. The RAS 60architecture allows for error records to be accessible via system or 61memory-mapped registers. 62 63The platform should enumerate the error records providing for each of them: 64 65- A handler to probe error records for errors; 66- When the probing identifies an error, a handler to handle it; 67- For memory-mapped error record, its base address and size in KB; for a system 68 register-accessed record, the start index of the record and number of 69 continuous records from that index; 70- Any node-specific auxiliary data. 71 72With this information supplied, when the run time firmware receives one of the 73notification mechanisms, the RAS framework can iterate through and probe error 74records for error, and invoke the appropriate handler to handle it. 75 76The RAS framework provides the macros to populate error record information. The 77macros are versioned, and the latest version as of this writing is 1. These 78macros create a structure of type ``struct err_record_info`` from its arguments, 79which are later passed to probe and error handlers. 80 81For memory-mapped error records: 82 83.. code:: c 84 85 ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux) 86 87And, for system register ones: 88 89.. code:: c 90 91 ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux) 92 93The probe handler must have the following prototype: 94 95.. code:: c 96 97 typedef int (*err_record_probe_t)(const struct err_record_info *info, 98 int *probe_data); 99 100The probe handler must return a non-zero value if an error was detected, or 0 101otherwise. The ``probe_data`` output parameter can be used to pass any useful 102information resulting from probe to the error handler (see `below`__). For 103example, it could return the index of the record. 104 105.. __: `Standard Error Record helpers`_ 106 107The error handler must have the following prototype: 108 109.. code:: c 110 111 typedef int (*err_record_handler_t)(const struct err_record_info *info, 112 int probe_data, const struct err_handler_data *const data); 113 114The ``data`` constant parameter describes the various properties of the error, 115including the reason for the error, exception syndrome, and also ``flags``, 116``cookie``, and ``handle`` parameters from the `top-level exception handler`__. 117 118.. __: interrupt-framework-design.rst#el3-interrupts 119 120The platform is expected populate an array using the macros above, and register 121the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``, 122passing it the name of the array describing the records. Note that the macro 123must be used in the same file where the array is defined. 124 125Standard Error Record helpers 126~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 127 128The |TF-A| RAS framework provides probe handlers for Standard Error Records, for 129both memory-mapped and System Register accesses: 130 131.. code:: c 132 133 int ras_err_ser_probe_memmap(const struct err_record_info *info, 134 int *probe_data); 135 136 int ras_err_ser_probe_sysreg(const struct err_record_info *info, 137 int *probe_data); 138 139When the platform enumerates error records, for those records in the Standard 140Error Record format, these helpers maybe used instead of rolling out their own. 141Both helpers above: 142 143- Return non-zero value when an error is detected in a Standard Error Record; 144- Set ``probe_data`` to the index of the error record upon detecting an error. 145 146Registering RAS interrupts 147-------------------------- 148 149RAS nodes can signal errors to the PE by raising Fault Handling and/or Error 150Recovery interrupts. For the firmware-first handling paradigm for interrupts to 151work, the platform must setup and register with |EHF|. See `Interaction with 152Exception Handling Framework`_. 153 154For each RAS interrupt, the platform has to provide structure of type ``struct 155ras_interrupt``: 156 157- Interrupt number; 158- The associated error record information (pointer to the corresponding 159 ``struct err_record_info``); 160- Optionally, a cookie. 161 162The platform is expected to define an array of ``struct ras_interrupt``, and 163register it with the RAS framework using the macro 164``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the 165macro must be used in the same file where the array is defined. 166 167The array of ``struct ras_interrupt`` must be sorted in the increasing order of 168interrupt number. This allows for fast look of handlers in order to service RAS 169interrupts. 170 171Double-fault handling 172--------------------- 173 174A Double Fault condition arises when an error is signalled to the PE while 175handling of a previously signalled error is still underway. When a Double Fault 176condition arises, the Arm RAS extensions only require for handler to perform 177orderly shutdown of the system, as recovery may be impossible. 178 179The RAS extensions part of Armv8.4 introduced new architectural features to deal 180with Double Fault conditions, specifically, the introduction of ``NMEA`` and 181``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3 182software which runs part of its entry/exit routines with exceptions momentarily 183masked—meaning, in such systems, External Aborts/SErrors are not immediately 184handled when they occur, but only after the exceptions are unmasked again. 185 186|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked. 187This means that all exceptions routed to EL3 are handled immediately. |TF-A| 188thus is able to detect a Double Fault conditions in software, without needing 189the intended advantages of Armv8.4 Double Fault architecture extensions. 190 191Double faults are fatal, and terminate at the platform double fault handler, and 192doesn't return. 193 194Engaging the RAS framework 195-------------------------- 196 197Enabling RAS support is a platform choice constructed from three distinct, but 198related, build options: 199 200- ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware; 201 202- ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See 203 `Interaction with Exception Handling Framework`_; 204 205- ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to 206 EL3. 207 208The RAS support in |TF-A| introduces a default implementation of 209``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION`` 210is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the 211top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating 212to through platform-supplied error records, probe them, and when an error is 213identified, look up and invoke the corresponding error handler. 214 215Note that, if the platform chooses to override the ``plat_ea_handler`` function 216and intend to use the RAS framework, it must explicitly call 217``ras_ea_handler()`` from within. 218 219Similarly, for RAS interrupts, the framework defines 220``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked 221when a RAS interrupt taken at EL3. The function bisects the platform-supplied 222sorted array of interrupts to look up the error record information associated 223with the interrupt number. That error handler for that record is then invoked to 224handle the error. 225 226Interaction with Exception Handling Framework 227--------------------------------------------- 228 229As mentioned in earlier sections, RAS framework interacts with the |EHF| to 230arbitrate handling of RAS exceptions with others that are routed to EL3. This 231means that the platform must partition a `priority level`__ for handling RAS 232exceptions. The platform must then define the macro ``PLAT_RAS_PRI`` to the 233priority level used for RAS exceptions. Platforms would typically want to 234allocate the highest secure priority for RAS handling. 235 236.. __: exception-handling.rst#partitioning-priority-levels 237 238Handling of both `interrupt`__ and `non-interrupt`__ exceptions follow the 239sequences outlined in the |EHF| documentation. I.e., for interrupts, the 240priority management is implicit; but for non-interrupt exceptions, they're 241explicit using `EHF APIs`__. 242 243.. __: exception-handling.rst#interrupt-flow 244.. __: exception-handling.rst#non-interrupt-flow 245.. __: exception-handling.rst#activating-and-deactivating-priorities 246 247-------------- 248 249*Copyright (c) 2018-2019, Arm Limited and Contributors. All rights reserved.* 250