18aa05055SPaul BeesleyReliability, Availability, and Serviceability (RAS) Extensions 29202d519SManish Pandey************************************************************** 340d553cfSPaul Beesley 440d553cfSPaul BeesleyThis document describes |TF-A| support for Arm Reliability, Availability, and 540d553cfSPaul BeesleyServiceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and 640d553cfSPaul Beesleylater CPUs, and also an optional extension to the base Armv8.0 architecture. 740d553cfSPaul Beesley 840d553cfSPaul BeesleyFor the description of Arm RAS extensions, Standard Error Records, and the 940d553cfSPaul Beesleyprecise definition of RAS terminology, please refer to the Arm Architecture 109202d519SManish PandeyReference Manual and `RAS Supplement`_. The rest of this document assumes 119202d519SManish Pandeyfamiliarity with architecture and terminology. 129202d519SManish Pandey 139202d519SManish PandeyThere are two philosophies for handling RAS errors from Non-secure world point 149202d519SManish Pandeyof view. 159202d519SManish Pandey 169202d519SManish Pandey- :ref:`Firmware First Handling (FFH)` 179202d519SManish Pandey- :ref:`Kernel First Handling (KFH)` 189202d519SManish Pandey 199202d519SManish Pandey.. _Firmware First Handling (FFH): 209202d519SManish Pandey 219202d519SManish PandeyFirmware First Handling (FFH) 229202d519SManish Pandey============================= 239202d519SManish Pandey 249202d519SManish PandeyIntroduction 259202d519SManish Pandey------------ 269202d519SManish Pandey 279202d519SManish PandeyEA’s and Error interrupts corresponding to NS nodes are handled first in firmware 289202d519SManish Pandey 299202d519SManish Pandey- Errors signaled back to NS world via suitable mechanism 309202d519SManish Pandey- Kernel is prohibited from accessing the RAS error records directly 319202d519SManish Pandey- Firmware creates CPER records for kernel to navigate and process 329202d519SManish Pandey- Firmware signals error back to Kernel via SDEI 3340d553cfSPaul Beesley 3440d553cfSPaul BeesleyOverview 3540d553cfSPaul Beesley-------- 3640d553cfSPaul Beesley 379202d519SManish PandeyFFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from 389202d519SManish Pandeyerrors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous 399202d519SManish PandeyExternal Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling 409202d519SManish Pandeyand Error Recovery interrupts. 419202d519SManish PandeyRAS Framework in TF-A allows the platform to define an external abort handler and to 429202d519SManish Pandeyregister RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard 439202d519SManish PandeyError Records as introduced by the RAS extensions 449202d519SManish Pandey 4540d553cfSPaul Beesley 4640d553cfSPaul Beesley.. __: `Standard Error Record helpers`_ 4740d553cfSPaul Beesley 489202d519SManish Pandey.. _Kernel First Handling (KFH): 499202d519SManish Pandey 509202d519SManish PandeyKernel First Handling (KFH) 519202d519SManish Pandey=========================== 529202d519SManish Pandey 539202d519SManish PandeyIntroduction 549202d519SManish Pandey------------ 559202d519SManish Pandey 569202d519SManish PandeyEA's originating/attributed to NS world are handled first in NS and Kernel navigates 579202d519SManish Pandeythe std error records directly. 589202d519SManish Pandey 599202d519SManish Pandey**KFH can be supported in a platform without TF-A being aware of it but there are few 609202d519SManish Pandeycorner cases where TF-A needs to have special handling, which is currently missing and 619202d519SManish Pandeywill be added in future** 629202d519SManish Pandey 639202d519SManish PandeyTF-A build options 649202d519SManish Pandey================== 659202d519SManish Pandey 66*f87e54f7SManish Pandey- **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3. 67*f87e54f7SManish Pandey- **HANDLE_EA_EL3_FIRST_NS**: Required for FFH 689202d519SManish Pandey- **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers. 69*f87e54f7SManish Pandey- **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and 70*f87e54f7SManish Pandey HANDLE_EA_EL3_FIRST_NS put together. 71*f87e54f7SManish Pandey 72*f87e54f7SManish PandeyRAS internal macros 73*f87e54f7SManish Pandey 74*f87e54f7SManish Pandey- **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled. 759202d519SManish Pandey 769202d519SManish PandeyRAS feature has dependency on some other TF-A build flags 779202d519SManish Pandey 789202d519SManish Pandey- **EL3_EXCEPTION_HANDLING**: Required for FFH 799202d519SManish Pandey- **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform 809202d519SManish Pandey 819202d519SManish PandeyRAS Framework 829202d519SManish Pandey============= 839202d519SManish Pandey 8440d553cfSPaul Beesley 8540d553cfSPaul Beesley.. _ras-figure: 8640d553cfSPaul Beesley 87a2c320a8SPaul Beesley.. image:: ../resources/diagrams/draw.io/ras.svg 8840d553cfSPaul Beesley 8940d553cfSPaul BeesleyPlatform APIs 9040d553cfSPaul Beesley------------- 9140d553cfSPaul Beesley 9240d553cfSPaul BeesleyThe RAS framework allows the platform to define handlers for External Abort, 9340d553cfSPaul BeesleyUncontainable Errors, Double Fault, and errors rising from EL3 execution. Please 94c3233c11SManish Pandeyrefer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`. 9540d553cfSPaul Beesley 9640d553cfSPaul BeesleyRegistering RAS error records 9740d553cfSPaul Beesley----------------------------- 9840d553cfSPaul Beesley 9940d553cfSPaul BeesleyRAS nodes are components in the system capable of signalling errors to PEs 10040d553cfSPaul Beesleythrough one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS 10140d553cfSPaul Beesleynodes contain one or more error records, which are registers through which the 10240d553cfSPaul Beesleynodes advertise various properties of the signalled error. Arm recommends that 10340d553cfSPaul Beesleyerror records are implemented in the Standard Error Record format. The RAS 10440d553cfSPaul Beesleyarchitecture allows for error records to be accessible via system or 10540d553cfSPaul Beesleymemory-mapped registers. 10640d553cfSPaul Beesley 10740d553cfSPaul BeesleyThe platform should enumerate the error records providing for each of them: 10840d553cfSPaul Beesley 10940d553cfSPaul Beesley- A handler to probe error records for errors; 11040d553cfSPaul Beesley- When the probing identifies an error, a handler to handle it; 11140d553cfSPaul Beesley- For memory-mapped error record, its base address and size in KB; for a system 11240d553cfSPaul Beesley register-accessed record, the start index of the record and number of 11340d553cfSPaul Beesley continuous records from that index; 11440d553cfSPaul Beesley- Any node-specific auxiliary data. 11540d553cfSPaul Beesley 11640d553cfSPaul BeesleyWith this information supplied, when the run time firmware receives one of the 11740d553cfSPaul Beesleynotification mechanisms, the RAS framework can iterate through and probe error 11840d553cfSPaul Beesleyrecords for error, and invoke the appropriate handler to handle it. 11940d553cfSPaul Beesley 12040d553cfSPaul BeesleyThe RAS framework provides the macros to populate error record information. The 12140d553cfSPaul Beesleymacros are versioned, and the latest version as of this writing is 1. These 12240d553cfSPaul Beesleymacros create a structure of type ``struct err_record_info`` from its arguments, 12340d553cfSPaul Beesleywhich are later passed to probe and error handlers. 12440d553cfSPaul Beesley 12540d553cfSPaul BeesleyFor memory-mapped error records: 12640d553cfSPaul Beesley 12740d553cfSPaul Beesley.. code:: c 12840d553cfSPaul Beesley 12940d553cfSPaul Beesley ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux) 13040d553cfSPaul Beesley 13140d553cfSPaul BeesleyAnd, for system register ones: 13240d553cfSPaul Beesley 13340d553cfSPaul Beesley.. code:: c 13440d553cfSPaul Beesley 13540d553cfSPaul Beesley ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux) 13640d553cfSPaul Beesley 13740d553cfSPaul BeesleyThe probe handler must have the following prototype: 13840d553cfSPaul Beesley 13940d553cfSPaul Beesley.. code:: c 14040d553cfSPaul Beesley 14140d553cfSPaul Beesley typedef int (*err_record_probe_t)(const struct err_record_info *info, 14240d553cfSPaul Beesley int *probe_data); 14340d553cfSPaul Beesley 14440d553cfSPaul BeesleyThe probe handler must return a non-zero value if an error was detected, or 0 14540d553cfSPaul Beesleyotherwise. The ``probe_data`` output parameter can be used to pass any useful 14640d553cfSPaul Beesleyinformation resulting from probe to the error handler (see `below`__). For 14740d553cfSPaul Beesleyexample, it could return the index of the record. 14840d553cfSPaul Beesley 14940d553cfSPaul Beesley.. __: `Standard Error Record helpers`_ 15040d553cfSPaul Beesley 15140d553cfSPaul BeesleyThe error handler must have the following prototype: 15240d553cfSPaul Beesley 15340d553cfSPaul Beesley.. code:: c 15440d553cfSPaul Beesley 15540d553cfSPaul Beesley typedef int (*err_record_handler_t)(const struct err_record_info *info, 15640d553cfSPaul Beesley int probe_data, const struct err_handler_data *const data); 15740d553cfSPaul Beesley 15840d553cfSPaul BeesleyThe ``data`` constant parameter describes the various properties of the error, 15940d553cfSPaul Beesleyincluding the reason for the error, exception syndrome, and also ``flags``, 160c3233c11SManish Pandey``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler 161c3233c11SManish Pandey<EL3 interrupts>`. 16240d553cfSPaul Beesley 16340d553cfSPaul BeesleyThe platform is expected populate an array using the macros above, and register 16440d553cfSPaul Beesleythe it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``, 16540d553cfSPaul Beesleypassing it the name of the array describing the records. Note that the macro 16640d553cfSPaul Beesleymust be used in the same file where the array is defined. 16740d553cfSPaul Beesley 16840d553cfSPaul BeesleyStandard Error Record helpers 16940d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17040d553cfSPaul Beesley 17140d553cfSPaul BeesleyThe |TF-A| RAS framework provides probe handlers for Standard Error Records, for 17240d553cfSPaul Beesleyboth memory-mapped and System Register accesses: 17340d553cfSPaul Beesley 17440d553cfSPaul Beesley.. code:: c 17540d553cfSPaul Beesley 17640d553cfSPaul Beesley int ras_err_ser_probe_memmap(const struct err_record_info *info, 17740d553cfSPaul Beesley int *probe_data); 17840d553cfSPaul Beesley 17940d553cfSPaul Beesley int ras_err_ser_probe_sysreg(const struct err_record_info *info, 18040d553cfSPaul Beesley int *probe_data); 18140d553cfSPaul Beesley 18240d553cfSPaul BeesleyWhen the platform enumerates error records, for those records in the Standard 18340d553cfSPaul BeesleyError Record format, these helpers maybe used instead of rolling out their own. 18440d553cfSPaul BeesleyBoth helpers above: 18540d553cfSPaul Beesley 18640d553cfSPaul Beesley- Return non-zero value when an error is detected in a Standard Error Record; 18740d553cfSPaul Beesley- Set ``probe_data`` to the index of the error record upon detecting an error. 18840d553cfSPaul Beesley 18940d553cfSPaul BeesleyRegistering RAS interrupts 19040d553cfSPaul Beesley-------------------------- 19140d553cfSPaul Beesley 19240d553cfSPaul BeesleyRAS nodes can signal errors to the PE by raising Fault Handling and/or Error 19340d553cfSPaul BeesleyRecovery interrupts. For the firmware-first handling paradigm for interrupts to 19440d553cfSPaul Beesleywork, the platform must setup and register with |EHF|. See `Interaction with 19540d553cfSPaul BeesleyException Handling Framework`_. 19640d553cfSPaul Beesley 19740d553cfSPaul BeesleyFor each RAS interrupt, the platform has to provide structure of type ``struct 19840d553cfSPaul Beesleyras_interrupt``: 19940d553cfSPaul Beesley 20040d553cfSPaul Beesley- Interrupt number; 20140d553cfSPaul Beesley- The associated error record information (pointer to the corresponding 20240d553cfSPaul Beesley ``struct err_record_info``); 20340d553cfSPaul Beesley- Optionally, a cookie. 20440d553cfSPaul Beesley 20540d553cfSPaul BeesleyThe platform is expected to define an array of ``struct ras_interrupt``, and 20640d553cfSPaul Beesleyregister it with the RAS framework using the macro 20740d553cfSPaul Beesley``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the 20840d553cfSPaul Beesleymacro must be used in the same file where the array is defined. 20940d553cfSPaul Beesley 21040d553cfSPaul BeesleyThe array of ``struct ras_interrupt`` must be sorted in the increasing order of 21140d553cfSPaul Beesleyinterrupt number. This allows for fast look of handlers in order to service RAS 21240d553cfSPaul Beesleyinterrupts. 21340d553cfSPaul Beesley 21440d553cfSPaul BeesleyDouble-fault handling 21540d553cfSPaul Beesley--------------------- 21640d553cfSPaul Beesley 21740d553cfSPaul BeesleyA Double Fault condition arises when an error is signalled to the PE while 21840d553cfSPaul Beesleyhandling of a previously signalled error is still underway. When a Double Fault 21940d553cfSPaul Beesleycondition arises, the Arm RAS extensions only require for handler to perform 22040d553cfSPaul Beesleyorderly shutdown of the system, as recovery may be impossible. 22140d553cfSPaul Beesley 22240d553cfSPaul BeesleyThe RAS extensions part of Armv8.4 introduced new architectural features to deal 22340d553cfSPaul Beesleywith Double Fault conditions, specifically, the introduction of ``NMEA`` and 22440d553cfSPaul Beesley``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3 22540d553cfSPaul Beesleysoftware which runs part of its entry/exit routines with exceptions momentarily 22640d553cfSPaul Beesleymasked—meaning, in such systems, External Aborts/SErrors are not immediately 22740d553cfSPaul Beesleyhandled when they occur, but only after the exceptions are unmasked again. 22840d553cfSPaul Beesley 22940d553cfSPaul Beesley|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked. 23040d553cfSPaul BeesleyThis means that all exceptions routed to EL3 are handled immediately. |TF-A| 23140d553cfSPaul Beesleythus is able to detect a Double Fault conditions in software, without needing 23240d553cfSPaul Beesleythe intended advantages of Armv8.4 Double Fault architecture extensions. 23340d553cfSPaul Beesley 23440d553cfSPaul BeesleyDouble faults are fatal, and terminate at the platform double fault handler, and 23540d553cfSPaul Beesleydoesn't return. 23640d553cfSPaul Beesley 23740d553cfSPaul BeesleyEngaging the RAS framework 23840d553cfSPaul Beesley-------------------------- 23940d553cfSPaul Beesley 2409202d519SManish PandeyEnabling RAS support is a platform choice 24140d553cfSPaul Beesley 24240d553cfSPaul BeesleyThe RAS support in |TF-A| introduces a default implementation of 243*f87e54f7SManish Pandey``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS`` 24440d553cfSPaul Beesleyis set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the 24540d553cfSPaul Beesleytop-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating 24640d553cfSPaul Beesleyto through platform-supplied error records, probe them, and when an error is 24740d553cfSPaul Beesleyidentified, look up and invoke the corresponding error handler. 24840d553cfSPaul Beesley 24940d553cfSPaul BeesleyNote that, if the platform chooses to override the ``plat_ea_handler`` function 25040d553cfSPaul Beesleyand intend to use the RAS framework, it must explicitly call 25140d553cfSPaul Beesley``ras_ea_handler()`` from within. 25240d553cfSPaul Beesley 25340d553cfSPaul BeesleySimilarly, for RAS interrupts, the framework defines 25440d553cfSPaul Beesley``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked 25540d553cfSPaul Beesleywhen a RAS interrupt taken at EL3. The function bisects the platform-supplied 25640d553cfSPaul Beesleysorted array of interrupts to look up the error record information associated 25740d553cfSPaul Beesleywith the interrupt number. That error handler for that record is then invoked to 25840d553cfSPaul Beesleyhandle the error. 25940d553cfSPaul Beesley 26040d553cfSPaul BeesleyInteraction with Exception Handling Framework 26140d553cfSPaul Beesley--------------------------------------------- 26240d553cfSPaul Beesley 26340d553cfSPaul BeesleyAs mentioned in earlier sections, RAS framework interacts with the |EHF| to 26440d553cfSPaul Beesleyarbitrate handling of RAS exceptions with others that are routed to EL3. This 265c3233c11SManish Pandeymeans that the platform must partition a :ref:`priority level <Partitioning 266c3233c11SManish Pandeypriority levels>` for handling RAS exceptions. The platform must then define 267c3233c11SManish Pandeythe macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions. 268c3233c11SManish PandeyPlatforms would typically want to allocate the highest secure priority for 269c3233c11SManish PandeyRAS handling. 27040d553cfSPaul Beesley 271c3233c11SManish PandeyHandling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt 272c3233c11SManish Pandey<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF| 273c3233c11SManish Pandeydocumentation. I.e., for interrupts, the priority management is implicit; but 274c3233c11SManish Pandeyfor non-interrupt exceptions, they're explicit using :ref:`EHF APIs 275c3233c11SManish Pandey<Activating and Deactivating priorities>`. 27640d553cfSPaul Beesley 27734760951SPaul Beesley-------------- 27840d553cfSPaul Beesley 2799202d519SManish Pandey*Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.* 2809202d519SManish Pandey 2819202d519SManish Pandey.. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest 282