18aa05055SPaul BeesleyReliability, Availability, and Serviceability (RAS) Extensions 28aa05055SPaul Beesley============================================================== 340d553cfSPaul Beesley 440d553cfSPaul BeesleyThis document describes |TF-A| support for Arm Reliability, Availability, and 540d553cfSPaul BeesleyServiceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and 640d553cfSPaul Beesleylater CPUs, and also an optional extension to the base Armv8.0 architecture. 740d553cfSPaul Beesley 840d553cfSPaul BeesleyIn conjunction with the |EHF|, support for RAS extension enables firmware-first 940d553cfSPaul Beesleyparadigm for handling platform errors: exceptions resulting from errors are 1040d553cfSPaul Beesleyrouted to and handled in EL3. Said errors are Synchronous External Abort (SEA), 1140d553cfSPaul BeesleyAsynchronous External Abort (signalled as SErrors), Fault Handling and Error 1240d553cfSPaul BeesleyRecovery interrupts. The |EHF| document mentions various `error handling 1340d553cfSPaul Beesleyuse-cases`__. 1440d553cfSPaul Beesley 1540d553cfSPaul Beesley.. __: exception-handling.rst#delegation-use-cases 1640d553cfSPaul Beesley 1740d553cfSPaul BeesleyFor the description of Arm RAS extensions, Standard Error Records, and the 1840d553cfSPaul Beesleyprecise definition of RAS terminology, please refer to the Arm Architecture 1940d553cfSPaul BeesleyReference Manual. The rest of this document assumes familiarity with 2040d553cfSPaul Beesleyarchitecture and terminology. 2140d553cfSPaul Beesley 2240d553cfSPaul BeesleyOverview 2340d553cfSPaul Beesley-------- 2440d553cfSPaul Beesley 2540d553cfSPaul BeesleyAs mentioned above, the RAS support in |TF-A| enables routing to and handling of 2640d553cfSPaul Beesleyexceptions resulting from platform errors in EL3. It allows the platform to 2740d553cfSPaul Beesleydefine an External Abort handler, and to register RAS nodes and interrupts. RAS 2840d553cfSPaul Beesleyframework also provides `helpers`__ for accessing Standard Error Records as 2940d553cfSPaul Beesleyintroduced by the RAS extensions. 3040d553cfSPaul Beesley 3140d553cfSPaul Beesley.. __: `Standard Error Record helpers`_ 3240d553cfSPaul Beesley 3340d553cfSPaul BeesleyThe build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run 3440d553cfSPaul Beesleytime firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also 3540d553cfSPaul Beesleybe set ``1``. 3640d553cfSPaul Beesley 3740d553cfSPaul Beesley.. _ras-figure: 3840d553cfSPaul Beesley 39a2c320a8SPaul Beesley.. image:: ../resources/diagrams/draw.io/ras.svg 4040d553cfSPaul Beesley 4140d553cfSPaul BeesleySee more on `Engaging the RAS framework`_. 4240d553cfSPaul Beesley 4340d553cfSPaul BeesleyPlatform APIs 4440d553cfSPaul Beesley------------- 4540d553cfSPaul Beesley 4640d553cfSPaul BeesleyThe RAS framework allows the platform to define handlers for External Abort, 4740d553cfSPaul BeesleyUncontainable Errors, Double Fault, and errors rising from EL3 execution. Please 4840d553cfSPaul Beesleyrefer to the porting guide for the `RAS platform API descriptions`__. 4940d553cfSPaul Beesley 5040d553cfSPaul Beesley.. __: ../getting_started/porting-guide.rst#external-abort-handling-and-ras-support 5140d553cfSPaul Beesley 5240d553cfSPaul BeesleyRegistering RAS error records 5340d553cfSPaul Beesley----------------------------- 5440d553cfSPaul Beesley 5540d553cfSPaul BeesleyRAS nodes are components in the system capable of signalling errors to PEs 5640d553cfSPaul Beesleythrough one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS 5740d553cfSPaul Beesleynodes contain one or more error records, which are registers through which the 5840d553cfSPaul Beesleynodes advertise various properties of the signalled error. Arm recommends that 5940d553cfSPaul Beesleyerror records are implemented in the Standard Error Record format. The RAS 6040d553cfSPaul Beesleyarchitecture allows for error records to be accessible via system or 6140d553cfSPaul Beesleymemory-mapped registers. 6240d553cfSPaul Beesley 6340d553cfSPaul BeesleyThe platform should enumerate the error records providing for each of them: 6440d553cfSPaul Beesley 6540d553cfSPaul Beesley- A handler to probe error records for errors; 6640d553cfSPaul Beesley- When the probing identifies an error, a handler to handle it; 6740d553cfSPaul Beesley- For memory-mapped error record, its base address and size in KB; for a system 6840d553cfSPaul Beesley register-accessed record, the start index of the record and number of 6940d553cfSPaul Beesley continuous records from that index; 7040d553cfSPaul Beesley- Any node-specific auxiliary data. 7140d553cfSPaul Beesley 7240d553cfSPaul BeesleyWith this information supplied, when the run time firmware receives one of the 7340d553cfSPaul Beesleynotification mechanisms, the RAS framework can iterate through and probe error 7440d553cfSPaul Beesleyrecords for error, and invoke the appropriate handler to handle it. 7540d553cfSPaul Beesley 7640d553cfSPaul BeesleyThe RAS framework provides the macros to populate error record information. The 7740d553cfSPaul Beesleymacros are versioned, and the latest version as of this writing is 1. These 7840d553cfSPaul Beesleymacros create a structure of type ``struct err_record_info`` from its arguments, 7940d553cfSPaul Beesleywhich are later passed to probe and error handlers. 8040d553cfSPaul Beesley 8140d553cfSPaul BeesleyFor memory-mapped error records: 8240d553cfSPaul Beesley 8340d553cfSPaul Beesley.. code:: c 8440d553cfSPaul Beesley 8540d553cfSPaul Beesley ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux) 8640d553cfSPaul Beesley 8740d553cfSPaul BeesleyAnd, for system register ones: 8840d553cfSPaul Beesley 8940d553cfSPaul Beesley.. code:: c 9040d553cfSPaul Beesley 9140d553cfSPaul Beesley ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux) 9240d553cfSPaul Beesley 9340d553cfSPaul BeesleyThe probe handler must have the following prototype: 9440d553cfSPaul Beesley 9540d553cfSPaul Beesley.. code:: c 9640d553cfSPaul Beesley 9740d553cfSPaul Beesley typedef int (*err_record_probe_t)(const struct err_record_info *info, 9840d553cfSPaul Beesley int *probe_data); 9940d553cfSPaul Beesley 10040d553cfSPaul BeesleyThe probe handler must return a non-zero value if an error was detected, or 0 10140d553cfSPaul Beesleyotherwise. The ``probe_data`` output parameter can be used to pass any useful 10240d553cfSPaul Beesleyinformation resulting from probe to the error handler (see `below`__). For 10340d553cfSPaul Beesleyexample, it could return the index of the record. 10440d553cfSPaul Beesley 10540d553cfSPaul Beesley.. __: `Standard Error Record helpers`_ 10640d553cfSPaul Beesley 10740d553cfSPaul BeesleyThe error handler must have the following prototype: 10840d553cfSPaul Beesley 10940d553cfSPaul Beesley.. code:: c 11040d553cfSPaul Beesley 11140d553cfSPaul Beesley typedef int (*err_record_handler_t)(const struct err_record_info *info, 11240d553cfSPaul Beesley int probe_data, const struct err_handler_data *const data); 11340d553cfSPaul Beesley 11440d553cfSPaul BeesleyThe ``data`` constant parameter describes the various properties of the error, 11540d553cfSPaul Beesleyincluding the reason for the error, exception syndrome, and also ``flags``, 11640d553cfSPaul Beesley``cookie``, and ``handle`` parameters from the `top-level exception handler`__. 11740d553cfSPaul Beesley 11840d553cfSPaul Beesley.. __: interrupt-framework-design.rst#el3-interrupts 11940d553cfSPaul Beesley 12040d553cfSPaul BeesleyThe platform is expected populate an array using the macros above, and register 12140d553cfSPaul Beesleythe it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``, 12240d553cfSPaul Beesleypassing it the name of the array describing the records. Note that the macro 12340d553cfSPaul Beesleymust be used in the same file where the array is defined. 12440d553cfSPaul Beesley 12540d553cfSPaul BeesleyStandard Error Record helpers 12640d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12740d553cfSPaul Beesley 12840d553cfSPaul BeesleyThe |TF-A| RAS framework provides probe handlers for Standard Error Records, for 12940d553cfSPaul Beesleyboth memory-mapped and System Register accesses: 13040d553cfSPaul Beesley 13140d553cfSPaul Beesley.. code:: c 13240d553cfSPaul Beesley 13340d553cfSPaul Beesley int ras_err_ser_probe_memmap(const struct err_record_info *info, 13440d553cfSPaul Beesley int *probe_data); 13540d553cfSPaul Beesley 13640d553cfSPaul Beesley int ras_err_ser_probe_sysreg(const struct err_record_info *info, 13740d553cfSPaul Beesley int *probe_data); 13840d553cfSPaul Beesley 13940d553cfSPaul BeesleyWhen the platform enumerates error records, for those records in the Standard 14040d553cfSPaul BeesleyError Record format, these helpers maybe used instead of rolling out their own. 14140d553cfSPaul BeesleyBoth helpers above: 14240d553cfSPaul Beesley 14340d553cfSPaul Beesley- Return non-zero value when an error is detected in a Standard Error Record; 14440d553cfSPaul Beesley- Set ``probe_data`` to the index of the error record upon detecting an error. 14540d553cfSPaul Beesley 14640d553cfSPaul BeesleyRegistering RAS interrupts 14740d553cfSPaul Beesley-------------------------- 14840d553cfSPaul Beesley 14940d553cfSPaul BeesleyRAS nodes can signal errors to the PE by raising Fault Handling and/or Error 15040d553cfSPaul BeesleyRecovery interrupts. For the firmware-first handling paradigm for interrupts to 15140d553cfSPaul Beesleywork, the platform must setup and register with |EHF|. See `Interaction with 15240d553cfSPaul BeesleyException Handling Framework`_. 15340d553cfSPaul Beesley 15440d553cfSPaul BeesleyFor each RAS interrupt, the platform has to provide structure of type ``struct 15540d553cfSPaul Beesleyras_interrupt``: 15640d553cfSPaul Beesley 15740d553cfSPaul Beesley- Interrupt number; 15840d553cfSPaul Beesley- The associated error record information (pointer to the corresponding 15940d553cfSPaul Beesley ``struct err_record_info``); 16040d553cfSPaul Beesley- Optionally, a cookie. 16140d553cfSPaul Beesley 16240d553cfSPaul BeesleyThe platform is expected to define an array of ``struct ras_interrupt``, and 16340d553cfSPaul Beesleyregister it with the RAS framework using the macro 16440d553cfSPaul Beesley``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the 16540d553cfSPaul Beesleymacro must be used in the same file where the array is defined. 16640d553cfSPaul Beesley 16740d553cfSPaul BeesleyThe array of ``struct ras_interrupt`` must be sorted in the increasing order of 16840d553cfSPaul Beesleyinterrupt number. This allows for fast look of handlers in order to service RAS 16940d553cfSPaul Beesleyinterrupts. 17040d553cfSPaul Beesley 17140d553cfSPaul BeesleyDouble-fault handling 17240d553cfSPaul Beesley--------------------- 17340d553cfSPaul Beesley 17440d553cfSPaul BeesleyA Double Fault condition arises when an error is signalled to the PE while 17540d553cfSPaul Beesleyhandling of a previously signalled error is still underway. When a Double Fault 17640d553cfSPaul Beesleycondition arises, the Arm RAS extensions only require for handler to perform 17740d553cfSPaul Beesleyorderly shutdown of the system, as recovery may be impossible. 17840d553cfSPaul Beesley 17940d553cfSPaul BeesleyThe RAS extensions part of Armv8.4 introduced new architectural features to deal 18040d553cfSPaul Beesleywith Double Fault conditions, specifically, the introduction of ``NMEA`` and 18140d553cfSPaul Beesley``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3 18240d553cfSPaul Beesleysoftware which runs part of its entry/exit routines with exceptions momentarily 18340d553cfSPaul Beesleymasked—meaning, in such systems, External Aborts/SErrors are not immediately 18440d553cfSPaul Beesleyhandled when they occur, but only after the exceptions are unmasked again. 18540d553cfSPaul Beesley 18640d553cfSPaul Beesley|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked. 18740d553cfSPaul BeesleyThis means that all exceptions routed to EL3 are handled immediately. |TF-A| 18840d553cfSPaul Beesleythus is able to detect a Double Fault conditions in software, without needing 18940d553cfSPaul Beesleythe intended advantages of Armv8.4 Double Fault architecture extensions. 19040d553cfSPaul Beesley 19140d553cfSPaul BeesleyDouble faults are fatal, and terminate at the platform double fault handler, and 19240d553cfSPaul Beesleydoesn't return. 19340d553cfSPaul Beesley 19440d553cfSPaul BeesleyEngaging the RAS framework 19540d553cfSPaul Beesley-------------------------- 19640d553cfSPaul Beesley 19740d553cfSPaul BeesleyEnabling RAS support is a platform choice constructed from three distinct, but 19840d553cfSPaul Beesleyrelated, build options: 19940d553cfSPaul Beesley 20040d553cfSPaul Beesley- ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware; 20140d553cfSPaul Beesley 20240d553cfSPaul Beesley- ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See 20340d553cfSPaul Beesley `Interaction with Exception Handling Framework`_; 20440d553cfSPaul Beesley 20540d553cfSPaul Beesley- ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to 20640d553cfSPaul Beesley EL3. 20740d553cfSPaul Beesley 20840d553cfSPaul BeesleyThe RAS support in |TF-A| introduces a default implementation of 20940d553cfSPaul Beesley``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION`` 21040d553cfSPaul Beesleyis set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the 21140d553cfSPaul Beesleytop-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating 21240d553cfSPaul Beesleyto through platform-supplied error records, probe them, and when an error is 21340d553cfSPaul Beesleyidentified, look up and invoke the corresponding error handler. 21440d553cfSPaul Beesley 21540d553cfSPaul BeesleyNote that, if the platform chooses to override the ``plat_ea_handler`` function 21640d553cfSPaul Beesleyand intend to use the RAS framework, it must explicitly call 21740d553cfSPaul Beesley``ras_ea_handler()`` from within. 21840d553cfSPaul Beesley 21940d553cfSPaul BeesleySimilarly, for RAS interrupts, the framework defines 22040d553cfSPaul Beesley``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked 22140d553cfSPaul Beesleywhen a RAS interrupt taken at EL3. The function bisects the platform-supplied 22240d553cfSPaul Beesleysorted array of interrupts to look up the error record information associated 22340d553cfSPaul Beesleywith the interrupt number. That error handler for that record is then invoked to 22440d553cfSPaul Beesleyhandle the error. 22540d553cfSPaul Beesley 22640d553cfSPaul BeesleyInteraction with Exception Handling Framework 22740d553cfSPaul Beesley--------------------------------------------- 22840d553cfSPaul Beesley 22940d553cfSPaul BeesleyAs mentioned in earlier sections, RAS framework interacts with the |EHF| to 23040d553cfSPaul Beesleyarbitrate handling of RAS exceptions with others that are routed to EL3. This 23140d553cfSPaul Beesleymeans that the platform must partition a `priority level`__ for handling RAS 23240d553cfSPaul Beesleyexceptions. The platform must then define the macro ``PLAT_RAS_PRI`` to the 23340d553cfSPaul Beesleypriority level used for RAS exceptions. Platforms would typically want to 23440d553cfSPaul Beesleyallocate the highest secure priority for RAS handling. 23540d553cfSPaul Beesley 23640d553cfSPaul Beesley.. __: exception-handling.rst#partitioning-priority-levels 23740d553cfSPaul Beesley 23840d553cfSPaul BeesleyHandling of both `interrupt`__ and `non-interrupt`__ exceptions follow the 23940d553cfSPaul Beesleysequences outlined in the |EHF| documentation. I.e., for interrupts, the 24040d553cfSPaul Beesleypriority management is implicit; but for non-interrupt exceptions, they're 24140d553cfSPaul Beesleyexplicit using `EHF APIs`__. 24240d553cfSPaul Beesley 24340d553cfSPaul Beesley.. __: exception-handling.rst#interrupt-flow 24440d553cfSPaul Beesley.. __: exception-handling.rst#non-interrupt-flow 24540d553cfSPaul Beesley.. __: exception-handling.rst#activating-and-deactivating-priorities 24640d553cfSPaul Beesley 247*34760951SPaul Beesley-------------- 24840d553cfSPaul Beesley 249*34760951SPaul Beesley*Copyright (c) 2018-2019, Arm Limited and Contributors. All rights reserved.* 250