18aa05055SPaul BeesleyReliability, Availability, and Serviceability (RAS) Extensions 29202d519SManish Pandey************************************************************** 340d553cfSPaul Beesley 440d553cfSPaul BeesleyThis document describes |TF-A| support for Arm Reliability, Availability, and 540d553cfSPaul BeesleyServiceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and 640d553cfSPaul Beesleylater CPUs, and also an optional extension to the base Armv8.0 architecture. 740d553cfSPaul Beesley 840d553cfSPaul BeesleyFor the description of Arm RAS extensions, Standard Error Records, and the 940d553cfSPaul Beesleyprecise definition of RAS terminology, please refer to the Arm Architecture 109202d519SManish PandeyReference Manual and `RAS Supplement`_. The rest of this document assumes 119202d519SManish Pandeyfamiliarity with architecture and terminology. 129202d519SManish Pandey 13*42604d2dSManish Pandey**IMPORTANT NOTE**: TF-A implementation assumes that if RAS extension is present 14*42604d2dSManish Pandeythen FEAT_IESB is also implmented. 15*42604d2dSManish Pandey 169202d519SManish PandeyThere are two philosophies for handling RAS errors from Non-secure world point 179202d519SManish Pandeyof view. 189202d519SManish Pandey 199202d519SManish Pandey- :ref:`Firmware First Handling (FFH)` 209202d519SManish Pandey- :ref:`Kernel First Handling (KFH)` 219202d519SManish Pandey 229202d519SManish Pandey.. _Firmware First Handling (FFH): 239202d519SManish Pandey 249202d519SManish PandeyFirmware First Handling (FFH) 259202d519SManish Pandey============================= 269202d519SManish Pandey 279202d519SManish PandeyIntroduction 289202d519SManish Pandey------------ 299202d519SManish Pandey 309202d519SManish PandeyEA’s and Error interrupts corresponding to NS nodes are handled first in firmware 319202d519SManish Pandey 329202d519SManish Pandey- Errors signaled back to NS world via suitable mechanism 339202d519SManish Pandey- Kernel is prohibited from accessing the RAS error records directly 349202d519SManish Pandey- Firmware creates CPER records for kernel to navigate and process 359202d519SManish Pandey- Firmware signals error back to Kernel via SDEI 3640d553cfSPaul Beesley 3740d553cfSPaul BeesleyOverview 3840d553cfSPaul Beesley-------- 3940d553cfSPaul Beesley 409202d519SManish PandeyFFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from 419202d519SManish Pandeyerrors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous 429202d519SManish PandeyExternal Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling 439202d519SManish Pandeyand Error Recovery interrupts. 449202d519SManish PandeyRAS Framework in TF-A allows the platform to define an external abort handler and to 459202d519SManish Pandeyregister RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard 469202d519SManish PandeyError Records as introduced by the RAS extensions 479202d519SManish Pandey 4840d553cfSPaul Beesley 4940d553cfSPaul Beesley.. __: `Standard Error Record helpers`_ 5040d553cfSPaul Beesley 519202d519SManish Pandey.. _Kernel First Handling (KFH): 529202d519SManish Pandey 539202d519SManish PandeyKernel First Handling (KFH) 549202d519SManish Pandey=========================== 559202d519SManish Pandey 569202d519SManish PandeyIntroduction 579202d519SManish Pandey------------ 589202d519SManish Pandey 599202d519SManish PandeyEA's originating/attributed to NS world are handled first in NS and Kernel navigates 609202d519SManish Pandeythe std error records directly. 619202d519SManish Pandey 62*42604d2dSManish Pandey- KFH is the default handling mode if platform does not explicitly enable FFH mode. 63*42604d2dSManish Pandey- KFH mode does not need any EL3 involvement except for the reflection of errors back 64*42604d2dSManish Pandey to lower EL. This happens when there is an error (EA) in the system which is not yet 65*42604d2dSManish Pandey signaled to PE while executing at lower EL. During entry into EL3 the errors (EA) are 66*42604d2dSManish Pandey synchronized causing async EA to pend at EL3. 67*42604d2dSManish Pandey 68*42604d2dSManish PandeyError Syncronization at EL3 entry 69*42604d2dSManish Pandey================================= 70*42604d2dSManish Pandey 71*42604d2dSManish PandeyDuring entry to EL3 from lower EL, if there is any pending async EAs they are either 72*42604d2dSManish Pandeyreflected back to lower EL (KFH) or handled in EL3 itself (FFH). 73*42604d2dSManish Pandey 74*42604d2dSManish Pandey|Image 1| 759202d519SManish Pandey 769202d519SManish PandeyTF-A build options 779202d519SManish Pandey================== 789202d519SManish Pandey 79f87e54f7SManish Pandey- **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3. 80f87e54f7SManish Pandey- **HANDLE_EA_EL3_FIRST_NS**: Required for FFH 819202d519SManish Pandey- **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers. 82f87e54f7SManish Pandey- **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and 83f87e54f7SManish Pandey HANDLE_EA_EL3_FIRST_NS put together. 84f87e54f7SManish Pandey 85f87e54f7SManish PandeyRAS internal macros 86f87e54f7SManish Pandey 87f87e54f7SManish Pandey- **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled. 889202d519SManish Pandey 899202d519SManish PandeyRAS feature has dependency on some other TF-A build flags 909202d519SManish Pandey 919202d519SManish Pandey- **EL3_EXCEPTION_HANDLING**: Required for FFH 929202d519SManish Pandey- **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform 939202d519SManish Pandey 94*42604d2dSManish PandeyTF-A Tests 95*42604d2dSManish Pandey========== 96*42604d2dSManish Pandey 97*42604d2dSManish PandeyRAS functionality is regularly tested in TF-A CI using `RAS test group`_ which has multiple 98*42604d2dSManish Pandeyconfigurations for testing lower EL External aborts. 99*42604d2dSManish Pandey 100*42604d2dSManish PandeyAll the tests are written in TF-A tests which runs as NS-EL2 payload. 101*42604d2dSManish Pandey 102*42604d2dSManish Pandey- **FFH without RAS extension** 103*42604d2dSManish Pandey 104*42604d2dSManish Pandey *fvp-ea-ffh,fvp-ea-ffh:fvp-tftf-fip.tftf-aemv8a-debug* 105*42604d2dSManish Pandey 106*42604d2dSManish Pandey Couple of tests, one each for sync EA and async EA from lower EL which gets handled in El3. 107*42604d2dSManish Pandey Inject External aborts(sync/async) which traps in EL3, FVP has a handler which gracefully 108*42604d2dSManish Pandey handles these errors and returns back to TF-A Tests 109*42604d2dSManish Pandey 110*42604d2dSManish Pandey Build Configs : **HANDLE_EA_EL3_FIRST_NS** , **PLATFORM_TEST_EA_FFH** 111*42604d2dSManish Pandey 112*42604d2dSManish Pandey- **FFH with RAS extension** 113*42604d2dSManish Pandey 114*42604d2dSManish Pandey Three Tests : 115*42604d2dSManish Pandey 116*42604d2dSManish Pandey - *fvp-ras-ffh,fvp-single-fault:fvp-tftf-fip.tftf-aemv8a.fi-debug* 117*42604d2dSManish Pandey 118*42604d2dSManish Pandey Inject an unrecoverable RAS error, which gets handled in EL3. 119*42604d2dSManish Pandey 120*42604d2dSManish Pandey - *fvp-ras-ffh,fvp-uncontainable:fvp-tftf.fault-fip.tftf-aemv8a.fi-debug* 121*42604d2dSManish Pandey 122*42604d2dSManish Pandey Inject uncontainable RAS errors which causes platform to panic. 123*42604d2dSManish Pandey 124*42604d2dSManish Pandey - *fvp-ras-ffh,fvp-ras-ffh-nested:fvp-tftf-fip.tftf-ras_ffh_nested-aemv8a.fi-debug* 125*42604d2dSManish Pandey 126*42604d2dSManish Pandey Test nested exception handling at El3 for synchronized async EAs. Inject an SError in lower EL 127*42604d2dSManish Pandey which remain pending until we enter EL3 through SMC call. At EL3 entry on encountering a pending 128*42604d2dSManish Pandey async EA it will handle the async EA first (nested exception) before handling the original SMC call. 129*42604d2dSManish Pandey 130*42604d2dSManish Pandey- **KFH with RAS extension** 131*42604d2dSManish Pandey 132*42604d2dSManish Pandey Couple of tests in the group : 133*42604d2dSManish Pandey 134*42604d2dSManish Pandey - *fvp-ras-kfh,fvp-ras-kfh:fvp-tftf-fip.tftf-aemv8a.fi-debug* 135*42604d2dSManish Pandey 136*42604d2dSManish Pandey Inject and handle RAS errors in TF-A tests (no El3 involvement) 137*42604d2dSManish Pandey 138*42604d2dSManish Pandey - *fvp-ras-kfh,fvp-ras-kfh-reflect:fvp-tftf-fip.tftf-ras_kfh_reflection-aemv8a.fi-debug* 139*42604d2dSManish Pandey 140*42604d2dSManish Pandey Reflection of synchronized errors from EL3 to TF-A tests, two tests one each for reflecting 141*42604d2dSManish Pandey in IRQ and SMC path. 142*42604d2dSManish Pandey 1439202d519SManish PandeyRAS Framework 1449202d519SManish Pandey============= 1459202d519SManish Pandey 14640d553cfSPaul Beesley 14740d553cfSPaul Beesley.. _ras-figure: 14840d553cfSPaul Beesley 149a2c320a8SPaul Beesley.. image:: ../resources/diagrams/draw.io/ras.svg 15040d553cfSPaul Beesley 15140d553cfSPaul BeesleyPlatform APIs 15240d553cfSPaul Beesley------------- 15340d553cfSPaul Beesley 15440d553cfSPaul BeesleyThe RAS framework allows the platform to define handlers for External Abort, 15540d553cfSPaul BeesleyUncontainable Errors, Double Fault, and errors rising from EL3 execution. Please 156c3233c11SManish Pandeyrefer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`. 15740d553cfSPaul Beesley 15840d553cfSPaul BeesleyRegistering RAS error records 15940d553cfSPaul Beesley----------------------------- 16040d553cfSPaul Beesley 16140d553cfSPaul BeesleyRAS nodes are components in the system capable of signalling errors to PEs 16240d553cfSPaul Beesleythrough one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS 16340d553cfSPaul Beesleynodes contain one or more error records, which are registers through which the 16440d553cfSPaul Beesleynodes advertise various properties of the signalled error. Arm recommends that 16540d553cfSPaul Beesleyerror records are implemented in the Standard Error Record format. The RAS 16640d553cfSPaul Beesleyarchitecture allows for error records to be accessible via system or 16740d553cfSPaul Beesleymemory-mapped registers. 16840d553cfSPaul Beesley 16940d553cfSPaul BeesleyThe platform should enumerate the error records providing for each of them: 17040d553cfSPaul Beesley 17140d553cfSPaul Beesley- A handler to probe error records for errors; 17240d553cfSPaul Beesley- When the probing identifies an error, a handler to handle it; 17340d553cfSPaul Beesley- For memory-mapped error record, its base address and size in KB; for a system 17440d553cfSPaul Beesley register-accessed record, the start index of the record and number of 17540d553cfSPaul Beesley continuous records from that index; 17640d553cfSPaul Beesley- Any node-specific auxiliary data. 17740d553cfSPaul Beesley 17840d553cfSPaul BeesleyWith this information supplied, when the run time firmware receives one of the 17940d553cfSPaul Beesleynotification mechanisms, the RAS framework can iterate through and probe error 18040d553cfSPaul Beesleyrecords for error, and invoke the appropriate handler to handle it. 18140d553cfSPaul Beesley 18240d553cfSPaul BeesleyThe RAS framework provides the macros to populate error record information. The 18340d553cfSPaul Beesleymacros are versioned, and the latest version as of this writing is 1. These 18440d553cfSPaul Beesleymacros create a structure of type ``struct err_record_info`` from its arguments, 18540d553cfSPaul Beesleywhich are later passed to probe and error handlers. 18640d553cfSPaul Beesley 18740d553cfSPaul BeesleyFor memory-mapped error records: 18840d553cfSPaul Beesley 18940d553cfSPaul Beesley.. code:: c 19040d553cfSPaul Beesley 19140d553cfSPaul Beesley ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux) 19240d553cfSPaul Beesley 19340d553cfSPaul BeesleyAnd, for system register ones: 19440d553cfSPaul Beesley 19540d553cfSPaul Beesley.. code:: c 19640d553cfSPaul Beesley 19740d553cfSPaul Beesley ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux) 19840d553cfSPaul Beesley 19940d553cfSPaul BeesleyThe probe handler must have the following prototype: 20040d553cfSPaul Beesley 20140d553cfSPaul Beesley.. code:: c 20240d553cfSPaul Beesley 20340d553cfSPaul Beesley typedef int (*err_record_probe_t)(const struct err_record_info *info, 20440d553cfSPaul Beesley int *probe_data); 20540d553cfSPaul Beesley 20640d553cfSPaul BeesleyThe probe handler must return a non-zero value if an error was detected, or 0 20740d553cfSPaul Beesleyotherwise. The ``probe_data`` output parameter can be used to pass any useful 20840d553cfSPaul Beesleyinformation resulting from probe to the error handler (see `below`__). For 20940d553cfSPaul Beesleyexample, it could return the index of the record. 21040d553cfSPaul Beesley 21140d553cfSPaul Beesley.. __: `Standard Error Record helpers`_ 21240d553cfSPaul Beesley 21340d553cfSPaul BeesleyThe error handler must have the following prototype: 21440d553cfSPaul Beesley 21540d553cfSPaul Beesley.. code:: c 21640d553cfSPaul Beesley 21740d553cfSPaul Beesley typedef int (*err_record_handler_t)(const struct err_record_info *info, 21840d553cfSPaul Beesley int probe_data, const struct err_handler_data *const data); 21940d553cfSPaul Beesley 22040d553cfSPaul BeesleyThe ``data`` constant parameter describes the various properties of the error, 22140d553cfSPaul Beesleyincluding the reason for the error, exception syndrome, and also ``flags``, 222c3233c11SManish Pandey``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler 223c3233c11SManish Pandey<EL3 interrupts>`. 22440d553cfSPaul Beesley 22540d553cfSPaul BeesleyThe platform is expected populate an array using the macros above, and register 22640d553cfSPaul Beesleythe it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``, 22740d553cfSPaul Beesleypassing it the name of the array describing the records. Note that the macro 22840d553cfSPaul Beesleymust be used in the same file where the array is defined. 22940d553cfSPaul Beesley 23040d553cfSPaul BeesleyStandard Error Record helpers 23140d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 23240d553cfSPaul Beesley 23340d553cfSPaul BeesleyThe |TF-A| RAS framework provides probe handlers for Standard Error Records, for 23440d553cfSPaul Beesleyboth memory-mapped and System Register accesses: 23540d553cfSPaul Beesley 23640d553cfSPaul Beesley.. code:: c 23740d553cfSPaul Beesley 23840d553cfSPaul Beesley int ras_err_ser_probe_memmap(const struct err_record_info *info, 23940d553cfSPaul Beesley int *probe_data); 24040d553cfSPaul Beesley 24140d553cfSPaul Beesley int ras_err_ser_probe_sysreg(const struct err_record_info *info, 24240d553cfSPaul Beesley int *probe_data); 24340d553cfSPaul Beesley 24440d553cfSPaul BeesleyWhen the platform enumerates error records, for those records in the Standard 24540d553cfSPaul BeesleyError Record format, these helpers maybe used instead of rolling out their own. 24640d553cfSPaul BeesleyBoth helpers above: 24740d553cfSPaul Beesley 24840d553cfSPaul Beesley- Return non-zero value when an error is detected in a Standard Error Record; 24940d553cfSPaul Beesley- Set ``probe_data`` to the index of the error record upon detecting an error. 25040d553cfSPaul Beesley 25140d553cfSPaul BeesleyRegistering RAS interrupts 25240d553cfSPaul Beesley-------------------------- 25340d553cfSPaul Beesley 25440d553cfSPaul BeesleyRAS nodes can signal errors to the PE by raising Fault Handling and/or Error 25540d553cfSPaul BeesleyRecovery interrupts. For the firmware-first handling paradigm for interrupts to 25640d553cfSPaul Beesleywork, the platform must setup and register with |EHF|. See `Interaction with 25740d553cfSPaul BeesleyException Handling Framework`_. 25840d553cfSPaul Beesley 25940d553cfSPaul BeesleyFor each RAS interrupt, the platform has to provide structure of type ``struct 26040d553cfSPaul Beesleyras_interrupt``: 26140d553cfSPaul Beesley 26240d553cfSPaul Beesley- Interrupt number; 26340d553cfSPaul Beesley- The associated error record information (pointer to the corresponding 26440d553cfSPaul Beesley ``struct err_record_info``); 26540d553cfSPaul Beesley- Optionally, a cookie. 26640d553cfSPaul Beesley 26740d553cfSPaul BeesleyThe platform is expected to define an array of ``struct ras_interrupt``, and 26840d553cfSPaul Beesleyregister it with the RAS framework using the macro 26940d553cfSPaul Beesley``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the 27040d553cfSPaul Beesleymacro must be used in the same file where the array is defined. 27140d553cfSPaul Beesley 27240d553cfSPaul BeesleyThe array of ``struct ras_interrupt`` must be sorted in the increasing order of 27340d553cfSPaul Beesleyinterrupt number. This allows for fast look of handlers in order to service RAS 27440d553cfSPaul Beesleyinterrupts. 27540d553cfSPaul Beesley 27640d553cfSPaul BeesleyDouble-fault handling 27740d553cfSPaul Beesley--------------------- 27840d553cfSPaul Beesley 27940d553cfSPaul BeesleyA Double Fault condition arises when an error is signalled to the PE while 28040d553cfSPaul Beesleyhandling of a previously signalled error is still underway. When a Double Fault 28140d553cfSPaul Beesleycondition arises, the Arm RAS extensions only require for handler to perform 28240d553cfSPaul Beesleyorderly shutdown of the system, as recovery may be impossible. 28340d553cfSPaul Beesley 28440d553cfSPaul BeesleyThe RAS extensions part of Armv8.4 introduced new architectural features to deal 28540d553cfSPaul Beesleywith Double Fault conditions, specifically, the introduction of ``NMEA`` and 28640d553cfSPaul Beesley``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3 28740d553cfSPaul Beesleysoftware which runs part of its entry/exit routines with exceptions momentarily 28840d553cfSPaul Beesleymasked—meaning, in such systems, External Aborts/SErrors are not immediately 28940d553cfSPaul Beesleyhandled when they occur, but only after the exceptions are unmasked again. 29040d553cfSPaul Beesley 29140d553cfSPaul Beesley|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked. 29240d553cfSPaul BeesleyThis means that all exceptions routed to EL3 are handled immediately. |TF-A| 29340d553cfSPaul Beesleythus is able to detect a Double Fault conditions in software, without needing 29440d553cfSPaul Beesleythe intended advantages of Armv8.4 Double Fault architecture extensions. 29540d553cfSPaul Beesley 29640d553cfSPaul BeesleyDouble faults are fatal, and terminate at the platform double fault handler, and 29740d553cfSPaul Beesleydoesn't return. 29840d553cfSPaul Beesley 29940d553cfSPaul BeesleyEngaging the RAS framework 30040d553cfSPaul Beesley-------------------------- 30140d553cfSPaul Beesley 3029202d519SManish PandeyEnabling RAS support is a platform choice 30340d553cfSPaul Beesley 30440d553cfSPaul BeesleyThe RAS support in |TF-A| introduces a default implementation of 305f87e54f7SManish Pandey``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS`` 30640d553cfSPaul Beesleyis set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the 30740d553cfSPaul Beesleytop-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating 30840d553cfSPaul Beesleyto through platform-supplied error records, probe them, and when an error is 30940d553cfSPaul Beesleyidentified, look up and invoke the corresponding error handler. 31040d553cfSPaul Beesley 31140d553cfSPaul BeesleyNote that, if the platform chooses to override the ``plat_ea_handler`` function 31240d553cfSPaul Beesleyand intend to use the RAS framework, it must explicitly call 31340d553cfSPaul Beesley``ras_ea_handler()`` from within. 31440d553cfSPaul Beesley 31540d553cfSPaul BeesleySimilarly, for RAS interrupts, the framework defines 31640d553cfSPaul Beesley``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked 31740d553cfSPaul Beesleywhen a RAS interrupt taken at EL3. The function bisects the platform-supplied 31840d553cfSPaul Beesleysorted array of interrupts to look up the error record information associated 31940d553cfSPaul Beesleywith the interrupt number. That error handler for that record is then invoked to 32040d553cfSPaul Beesleyhandle the error. 32140d553cfSPaul Beesley 32240d553cfSPaul BeesleyInteraction with Exception Handling Framework 32340d553cfSPaul Beesley--------------------------------------------- 32440d553cfSPaul Beesley 32540d553cfSPaul BeesleyAs mentioned in earlier sections, RAS framework interacts with the |EHF| to 32640d553cfSPaul Beesleyarbitrate handling of RAS exceptions with others that are routed to EL3. This 327c3233c11SManish Pandeymeans that the platform must partition a :ref:`priority level <Partitioning 328c3233c11SManish Pandeypriority levels>` for handling RAS exceptions. The platform must then define 329c3233c11SManish Pandeythe macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions. 330c3233c11SManish PandeyPlatforms would typically want to allocate the highest secure priority for 331c3233c11SManish PandeyRAS handling. 33240d553cfSPaul Beesley 333c3233c11SManish PandeyHandling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt 334c3233c11SManish Pandey<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF| 335c3233c11SManish Pandeydocumentation. I.e., for interrupts, the priority management is implicit; but 336c3233c11SManish Pandeyfor non-interrupt exceptions, they're explicit using :ref:`EHF APIs 337c3233c11SManish Pandey<Activating and Deactivating priorities>`. 33840d553cfSPaul Beesley 33934760951SPaul Beesley-------------- 34040d553cfSPaul Beesley 3419202d519SManish Pandey*Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.* 3429202d519SManish Pandey 3439202d519SManish Pandey.. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest 344*42604d2dSManish Pandey.. _RAS Test group: https://git.trustedfirmware.org/ci/tf-a-ci-scripts.git/tree/group/tf-l3-boot-tests-ras?h=refs/heads/master 345*42604d2dSManish Pandey 346*42604d2dSManish Pandey.. |Image 1| image:: ../resources/diagrams/bl31-exception-entry-error-synchronization.png 347