1*8aa05055SPaul BeesleyReliability, Availability, and Serviceability (RAS) Extensions 2*8aa05055SPaul Beesley============================================================== 340d553cfSPaul Beesley 440d553cfSPaul Beesley.. contents:: 540d553cfSPaul Beesley :depth: 2 640d553cfSPaul Beesley 740d553cfSPaul Beesley.. |EHF| replace:: Exception Handling Framework 840d553cfSPaul Beesley.. |TF-A| replace:: Trusted Firmware-A 940d553cfSPaul Beesley 1040d553cfSPaul BeesleyThis document describes |TF-A| support for Arm Reliability, Availability, and 1140d553cfSPaul BeesleyServiceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and 1240d553cfSPaul Beesleylater CPUs, and also an optional extension to the base Armv8.0 architecture. 1340d553cfSPaul Beesley 1440d553cfSPaul BeesleyIn conjunction with the |EHF|, support for RAS extension enables firmware-first 1540d553cfSPaul Beesleyparadigm for handling platform errors: exceptions resulting from errors are 1640d553cfSPaul Beesleyrouted to and handled in EL3. Said errors are Synchronous External Abort (SEA), 1740d553cfSPaul BeesleyAsynchronous External Abort (signalled as SErrors), Fault Handling and Error 1840d553cfSPaul BeesleyRecovery interrupts. The |EHF| document mentions various `error handling 1940d553cfSPaul Beesleyuse-cases`__. 2040d553cfSPaul Beesley 2140d553cfSPaul Beesley.. __: exception-handling.rst#delegation-use-cases 2240d553cfSPaul Beesley 2340d553cfSPaul BeesleyFor the description of Arm RAS extensions, Standard Error Records, and the 2440d553cfSPaul Beesleyprecise definition of RAS terminology, please refer to the Arm Architecture 2540d553cfSPaul BeesleyReference Manual. The rest of this document assumes familiarity with 2640d553cfSPaul Beesleyarchitecture and terminology. 2740d553cfSPaul Beesley 2840d553cfSPaul BeesleyOverview 2940d553cfSPaul Beesley-------- 3040d553cfSPaul Beesley 3140d553cfSPaul BeesleyAs mentioned above, the RAS support in |TF-A| enables routing to and handling of 3240d553cfSPaul Beesleyexceptions resulting from platform errors in EL3. It allows the platform to 3340d553cfSPaul Beesleydefine an External Abort handler, and to register RAS nodes and interrupts. RAS 3440d553cfSPaul Beesleyframework also provides `helpers`__ for accessing Standard Error Records as 3540d553cfSPaul Beesleyintroduced by the RAS extensions. 3640d553cfSPaul Beesley 3740d553cfSPaul Beesley.. __: `Standard Error Record helpers`_ 3840d553cfSPaul Beesley 3940d553cfSPaul BeesleyThe build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run 4040d553cfSPaul Beesleytime firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also 4140d553cfSPaul Beesleybe set ``1``. 4240d553cfSPaul Beesley 4340d553cfSPaul Beesley.. _ras-figure: 4440d553cfSPaul Beesley 4540d553cfSPaul Beesley.. image:: ../draw.io/ras.svg 4640d553cfSPaul Beesley 4740d553cfSPaul BeesleySee more on `Engaging the RAS framework`_. 4840d553cfSPaul Beesley 4940d553cfSPaul BeesleyPlatform APIs 5040d553cfSPaul Beesley------------- 5140d553cfSPaul Beesley 5240d553cfSPaul BeesleyThe RAS framework allows the platform to define handlers for External Abort, 5340d553cfSPaul BeesleyUncontainable Errors, Double Fault, and errors rising from EL3 execution. Please 5440d553cfSPaul Beesleyrefer to the porting guide for the `RAS platform API descriptions`__. 5540d553cfSPaul Beesley 5640d553cfSPaul Beesley.. __: ../getting_started/porting-guide.rst#external-abort-handling-and-ras-support 5740d553cfSPaul Beesley 5840d553cfSPaul BeesleyRegistering RAS error records 5940d553cfSPaul Beesley----------------------------- 6040d553cfSPaul Beesley 6140d553cfSPaul BeesleyRAS nodes are components in the system capable of signalling errors to PEs 6240d553cfSPaul Beesleythrough one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS 6340d553cfSPaul Beesleynodes contain one or more error records, which are registers through which the 6440d553cfSPaul Beesleynodes advertise various properties of the signalled error. Arm recommends that 6540d553cfSPaul Beesleyerror records are implemented in the Standard Error Record format. The RAS 6640d553cfSPaul Beesleyarchitecture allows for error records to be accessible via system or 6740d553cfSPaul Beesleymemory-mapped registers. 6840d553cfSPaul Beesley 6940d553cfSPaul BeesleyThe platform should enumerate the error records providing for each of them: 7040d553cfSPaul Beesley 7140d553cfSPaul Beesley- A handler to probe error records for errors; 7240d553cfSPaul Beesley- When the probing identifies an error, a handler to handle it; 7340d553cfSPaul Beesley- For memory-mapped error record, its base address and size in KB; for a system 7440d553cfSPaul Beesley register-accessed record, the start index of the record and number of 7540d553cfSPaul Beesley continuous records from that index; 7640d553cfSPaul Beesley- Any node-specific auxiliary data. 7740d553cfSPaul Beesley 7840d553cfSPaul BeesleyWith this information supplied, when the run time firmware receives one of the 7940d553cfSPaul Beesleynotification mechanisms, the RAS framework can iterate through and probe error 8040d553cfSPaul Beesleyrecords for error, and invoke the appropriate handler to handle it. 8140d553cfSPaul Beesley 8240d553cfSPaul BeesleyThe RAS framework provides the macros to populate error record information. The 8340d553cfSPaul Beesleymacros are versioned, and the latest version as of this writing is 1. These 8440d553cfSPaul Beesleymacros create a structure of type ``struct err_record_info`` from its arguments, 8540d553cfSPaul Beesleywhich are later passed to probe and error handlers. 8640d553cfSPaul Beesley 8740d553cfSPaul BeesleyFor memory-mapped error records: 8840d553cfSPaul Beesley 8940d553cfSPaul Beesley.. code:: c 9040d553cfSPaul Beesley 9140d553cfSPaul Beesley ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux) 9240d553cfSPaul Beesley 9340d553cfSPaul BeesleyAnd, for system register ones: 9440d553cfSPaul Beesley 9540d553cfSPaul Beesley.. code:: c 9640d553cfSPaul Beesley 9740d553cfSPaul Beesley ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux) 9840d553cfSPaul Beesley 9940d553cfSPaul BeesleyThe probe handler must have the following prototype: 10040d553cfSPaul Beesley 10140d553cfSPaul Beesley.. code:: c 10240d553cfSPaul Beesley 10340d553cfSPaul Beesley typedef int (*err_record_probe_t)(const struct err_record_info *info, 10440d553cfSPaul Beesley int *probe_data); 10540d553cfSPaul Beesley 10640d553cfSPaul BeesleyThe probe handler must return a non-zero value if an error was detected, or 0 10740d553cfSPaul Beesleyotherwise. The ``probe_data`` output parameter can be used to pass any useful 10840d553cfSPaul Beesleyinformation resulting from probe to the error handler (see `below`__). For 10940d553cfSPaul Beesleyexample, it could return the index of the record. 11040d553cfSPaul Beesley 11140d553cfSPaul Beesley.. __: `Standard Error Record helpers`_ 11240d553cfSPaul Beesley 11340d553cfSPaul BeesleyThe error handler must have the following prototype: 11440d553cfSPaul Beesley 11540d553cfSPaul Beesley.. code:: c 11640d553cfSPaul Beesley 11740d553cfSPaul Beesley typedef int (*err_record_handler_t)(const struct err_record_info *info, 11840d553cfSPaul Beesley int probe_data, const struct err_handler_data *const data); 11940d553cfSPaul Beesley 12040d553cfSPaul BeesleyThe ``data`` constant parameter describes the various properties of the error, 12140d553cfSPaul Beesleyincluding the reason for the error, exception syndrome, and also ``flags``, 12240d553cfSPaul Beesley``cookie``, and ``handle`` parameters from the `top-level exception handler`__. 12340d553cfSPaul Beesley 12440d553cfSPaul Beesley.. __: interrupt-framework-design.rst#el3-interrupts 12540d553cfSPaul Beesley 12640d553cfSPaul BeesleyThe platform is expected populate an array using the macros above, and register 12740d553cfSPaul Beesleythe it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``, 12840d553cfSPaul Beesleypassing it the name of the array describing the records. Note that the macro 12940d553cfSPaul Beesleymust be used in the same file where the array is defined. 13040d553cfSPaul Beesley 13140d553cfSPaul BeesleyStandard Error Record helpers 13240d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 13340d553cfSPaul Beesley 13440d553cfSPaul BeesleyThe |TF-A| RAS framework provides probe handlers for Standard Error Records, for 13540d553cfSPaul Beesleyboth memory-mapped and System Register accesses: 13640d553cfSPaul Beesley 13740d553cfSPaul Beesley.. code:: c 13840d553cfSPaul Beesley 13940d553cfSPaul Beesley int ras_err_ser_probe_memmap(const struct err_record_info *info, 14040d553cfSPaul Beesley int *probe_data); 14140d553cfSPaul Beesley 14240d553cfSPaul Beesley int ras_err_ser_probe_sysreg(const struct err_record_info *info, 14340d553cfSPaul Beesley int *probe_data); 14440d553cfSPaul Beesley 14540d553cfSPaul BeesleyWhen the platform enumerates error records, for those records in the Standard 14640d553cfSPaul BeesleyError Record format, these helpers maybe used instead of rolling out their own. 14740d553cfSPaul BeesleyBoth helpers above: 14840d553cfSPaul Beesley 14940d553cfSPaul Beesley- Return non-zero value when an error is detected in a Standard Error Record; 15040d553cfSPaul Beesley- Set ``probe_data`` to the index of the error record upon detecting an error. 15140d553cfSPaul Beesley 15240d553cfSPaul BeesleyRegistering RAS interrupts 15340d553cfSPaul Beesley-------------------------- 15440d553cfSPaul Beesley 15540d553cfSPaul BeesleyRAS nodes can signal errors to the PE by raising Fault Handling and/or Error 15640d553cfSPaul BeesleyRecovery interrupts. For the firmware-first handling paradigm for interrupts to 15740d553cfSPaul Beesleywork, the platform must setup and register with |EHF|. See `Interaction with 15840d553cfSPaul BeesleyException Handling Framework`_. 15940d553cfSPaul Beesley 16040d553cfSPaul BeesleyFor each RAS interrupt, the platform has to provide structure of type ``struct 16140d553cfSPaul Beesleyras_interrupt``: 16240d553cfSPaul Beesley 16340d553cfSPaul Beesley- Interrupt number; 16440d553cfSPaul Beesley- The associated error record information (pointer to the corresponding 16540d553cfSPaul Beesley ``struct err_record_info``); 16640d553cfSPaul Beesley- Optionally, a cookie. 16740d553cfSPaul Beesley 16840d553cfSPaul BeesleyThe platform is expected to define an array of ``struct ras_interrupt``, and 16940d553cfSPaul Beesleyregister it with the RAS framework using the macro 17040d553cfSPaul Beesley``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the 17140d553cfSPaul Beesleymacro must be used in the same file where the array is defined. 17240d553cfSPaul Beesley 17340d553cfSPaul BeesleyThe array of ``struct ras_interrupt`` must be sorted in the increasing order of 17440d553cfSPaul Beesleyinterrupt number. This allows for fast look of handlers in order to service RAS 17540d553cfSPaul Beesleyinterrupts. 17640d553cfSPaul Beesley 17740d553cfSPaul BeesleyDouble-fault handling 17840d553cfSPaul Beesley--------------------- 17940d553cfSPaul Beesley 18040d553cfSPaul BeesleyA Double Fault condition arises when an error is signalled to the PE while 18140d553cfSPaul Beesleyhandling of a previously signalled error is still underway. When a Double Fault 18240d553cfSPaul Beesleycondition arises, the Arm RAS extensions only require for handler to perform 18340d553cfSPaul Beesleyorderly shutdown of the system, as recovery may be impossible. 18440d553cfSPaul Beesley 18540d553cfSPaul BeesleyThe RAS extensions part of Armv8.4 introduced new architectural features to deal 18640d553cfSPaul Beesleywith Double Fault conditions, specifically, the introduction of ``NMEA`` and 18740d553cfSPaul Beesley``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3 18840d553cfSPaul Beesleysoftware which runs part of its entry/exit routines with exceptions momentarily 18940d553cfSPaul Beesleymasked—meaning, in such systems, External Aborts/SErrors are not immediately 19040d553cfSPaul Beesleyhandled when they occur, but only after the exceptions are unmasked again. 19140d553cfSPaul Beesley 19240d553cfSPaul Beesley|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked. 19340d553cfSPaul BeesleyThis means that all exceptions routed to EL3 are handled immediately. |TF-A| 19440d553cfSPaul Beesleythus is able to detect a Double Fault conditions in software, without needing 19540d553cfSPaul Beesleythe intended advantages of Armv8.4 Double Fault architecture extensions. 19640d553cfSPaul Beesley 19740d553cfSPaul BeesleyDouble faults are fatal, and terminate at the platform double fault handler, and 19840d553cfSPaul Beesleydoesn't return. 19940d553cfSPaul Beesley 20040d553cfSPaul BeesleyEngaging the RAS framework 20140d553cfSPaul Beesley-------------------------- 20240d553cfSPaul Beesley 20340d553cfSPaul BeesleyEnabling RAS support is a platform choice constructed from three distinct, but 20440d553cfSPaul Beesleyrelated, build options: 20540d553cfSPaul Beesley 20640d553cfSPaul Beesley- ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware; 20740d553cfSPaul Beesley 20840d553cfSPaul Beesley- ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See 20940d553cfSPaul Beesley `Interaction with Exception Handling Framework`_; 21040d553cfSPaul Beesley 21140d553cfSPaul Beesley- ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to 21240d553cfSPaul Beesley EL3. 21340d553cfSPaul Beesley 21440d553cfSPaul BeesleyThe RAS support in |TF-A| introduces a default implementation of 21540d553cfSPaul Beesley``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION`` 21640d553cfSPaul Beesleyis set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the 21740d553cfSPaul Beesleytop-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating 21840d553cfSPaul Beesleyto through platform-supplied error records, probe them, and when an error is 21940d553cfSPaul Beesleyidentified, look up and invoke the corresponding error handler. 22040d553cfSPaul Beesley 22140d553cfSPaul BeesleyNote that, if the platform chooses to override the ``plat_ea_handler`` function 22240d553cfSPaul Beesleyand intend to use the RAS framework, it must explicitly call 22340d553cfSPaul Beesley``ras_ea_handler()`` from within. 22440d553cfSPaul Beesley 22540d553cfSPaul BeesleySimilarly, for RAS interrupts, the framework defines 22640d553cfSPaul Beesley``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked 22740d553cfSPaul Beesleywhen a RAS interrupt taken at EL3. The function bisects the platform-supplied 22840d553cfSPaul Beesleysorted array of interrupts to look up the error record information associated 22940d553cfSPaul Beesleywith the interrupt number. That error handler for that record is then invoked to 23040d553cfSPaul Beesleyhandle the error. 23140d553cfSPaul Beesley 23240d553cfSPaul BeesleyInteraction with Exception Handling Framework 23340d553cfSPaul Beesley--------------------------------------------- 23440d553cfSPaul Beesley 23540d553cfSPaul BeesleyAs mentioned in earlier sections, RAS framework interacts with the |EHF| to 23640d553cfSPaul Beesleyarbitrate handling of RAS exceptions with others that are routed to EL3. This 23740d553cfSPaul Beesleymeans that the platform must partition a `priority level`__ for handling RAS 23840d553cfSPaul Beesleyexceptions. The platform must then define the macro ``PLAT_RAS_PRI`` to the 23940d553cfSPaul Beesleypriority level used for RAS exceptions. Platforms would typically want to 24040d553cfSPaul Beesleyallocate the highest secure priority for RAS handling. 24140d553cfSPaul Beesley 24240d553cfSPaul Beesley.. __: exception-handling.rst#partitioning-priority-levels 24340d553cfSPaul Beesley 24440d553cfSPaul BeesleyHandling of both `interrupt`__ and `non-interrupt`__ exceptions follow the 24540d553cfSPaul Beesleysequences outlined in the |EHF| documentation. I.e., for interrupts, the 24640d553cfSPaul Beesleypriority management is implicit; but for non-interrupt exceptions, they're 24740d553cfSPaul Beesleyexplicit using `EHF APIs`__. 24840d553cfSPaul Beesley 24940d553cfSPaul Beesley.. __: exception-handling.rst#interrupt-flow 25040d553cfSPaul Beesley.. __: exception-handling.rst#non-interrupt-flow 25140d553cfSPaul Beesley.. __: exception-handling.rst#activating-and-deactivating-priorities 25240d553cfSPaul Beesley 25340d553cfSPaul Beesley---- 25440d553cfSPaul Beesley 25540d553cfSPaul Beesley*Copyright (c) 2018, Arm Limited and Contributors. All rights reserved.* 256