xref: /rk3399_ARM-atf/docs/components/ras.rst (revision f87e54f73cfee5042df526af6185ac6d9653a8f5)
18aa05055SPaul BeesleyReliability, Availability, and Serviceability (RAS) Extensions
29202d519SManish Pandey**************************************************************
340d553cfSPaul Beesley
440d553cfSPaul BeesleyThis document describes |TF-A| support for Arm Reliability, Availability, and
540d553cfSPaul BeesleyServiceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
640d553cfSPaul Beesleylater CPUs, and also an optional extension to the base Armv8.0 architecture.
740d553cfSPaul Beesley
840d553cfSPaul BeesleyFor the description of Arm RAS extensions, Standard Error Records, and the
940d553cfSPaul Beesleyprecise definition of RAS terminology, please refer to the Arm Architecture
109202d519SManish PandeyReference Manual and `RAS Supplement`_. The rest of this document assumes
119202d519SManish Pandeyfamiliarity with architecture and terminology.
129202d519SManish Pandey
139202d519SManish PandeyThere are two philosophies for handling RAS errors from Non-secure world point
149202d519SManish Pandeyof view.
159202d519SManish Pandey
169202d519SManish Pandey- :ref:`Firmware First Handling (FFH)`
179202d519SManish Pandey- :ref:`Kernel First Handling (KFH)`
189202d519SManish Pandey
199202d519SManish Pandey.. _Firmware First Handling (FFH):
209202d519SManish Pandey
219202d519SManish PandeyFirmware First Handling (FFH)
229202d519SManish Pandey=============================
239202d519SManish Pandey
249202d519SManish PandeyIntroduction
259202d519SManish Pandey------------
269202d519SManish Pandey
279202d519SManish PandeyEA’s and Error interrupts corresponding to NS nodes are handled first in firmware
289202d519SManish Pandey
299202d519SManish Pandey-  Errors signaled back to NS world via suitable mechanism
309202d519SManish Pandey-  Kernel is prohibited from accessing the RAS error records directly
319202d519SManish Pandey-  Firmware creates CPER records for kernel to navigate and process
329202d519SManish Pandey-  Firmware signals error back to Kernel via SDEI
3340d553cfSPaul Beesley
3440d553cfSPaul BeesleyOverview
3540d553cfSPaul Beesley--------
3640d553cfSPaul Beesley
379202d519SManish PandeyFFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from
389202d519SManish Pandeyerrors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous
399202d519SManish PandeyExternal Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling
409202d519SManish Pandeyand Error Recovery interrupts.
419202d519SManish PandeyRAS Framework in TF-A allows the platform to define an external abort handler and to
429202d519SManish Pandeyregister RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard
439202d519SManish PandeyError Records as introduced by the RAS extensions
449202d519SManish Pandey
4540d553cfSPaul Beesley
4640d553cfSPaul Beesley.. __: `Standard Error Record helpers`_
4740d553cfSPaul Beesley
489202d519SManish Pandey.. _Kernel First Handling (KFH):
499202d519SManish Pandey
509202d519SManish PandeyKernel First Handling (KFH)
519202d519SManish Pandey===========================
529202d519SManish Pandey
539202d519SManish PandeyIntroduction
549202d519SManish Pandey------------
559202d519SManish Pandey
569202d519SManish PandeyEA's originating/attributed to NS world are handled first in NS and Kernel navigates
579202d519SManish Pandeythe std error records directly.
589202d519SManish Pandey
599202d519SManish Pandey**KFH can be supported in a platform without TF-A being aware of it but there are few
609202d519SManish Pandeycorner cases where TF-A needs to have special handling, which is currently missing and
619202d519SManish Pandeywill be added in future**
629202d519SManish Pandey
639202d519SManish PandeyTF-A build options
649202d519SManish Pandey==================
659202d519SManish Pandey
66*f87e54f7SManish Pandey- **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3.
67*f87e54f7SManish Pandey- **HANDLE_EA_EL3_FIRST_NS**: Required for FFH
689202d519SManish Pandey- **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers.
69*f87e54f7SManish Pandey- **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and
70*f87e54f7SManish Pandey  HANDLE_EA_EL3_FIRST_NS put together.
71*f87e54f7SManish Pandey
72*f87e54f7SManish PandeyRAS internal macros
73*f87e54f7SManish Pandey
74*f87e54f7SManish Pandey- **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled.
759202d519SManish Pandey
769202d519SManish PandeyRAS feature has dependency on some other TF-A build flags
779202d519SManish Pandey
789202d519SManish Pandey- **EL3_EXCEPTION_HANDLING**: Required for FFH
799202d519SManish Pandey- **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform
809202d519SManish Pandey
819202d519SManish PandeyRAS Framework
829202d519SManish Pandey=============
839202d519SManish Pandey
8440d553cfSPaul Beesley
8540d553cfSPaul Beesley.. _ras-figure:
8640d553cfSPaul Beesley
87a2c320a8SPaul Beesley.. image:: ../resources/diagrams/draw.io/ras.svg
8840d553cfSPaul Beesley
8940d553cfSPaul BeesleyPlatform APIs
9040d553cfSPaul Beesley-------------
9140d553cfSPaul Beesley
9240d553cfSPaul BeesleyThe RAS framework allows the platform to define handlers for External Abort,
9340d553cfSPaul BeesleyUncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
94c3233c11SManish Pandeyrefer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`.
9540d553cfSPaul Beesley
9640d553cfSPaul BeesleyRegistering RAS error records
9740d553cfSPaul Beesley-----------------------------
9840d553cfSPaul Beesley
9940d553cfSPaul BeesleyRAS nodes are components in the system capable of signalling errors to PEs
10040d553cfSPaul Beesleythrough one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
10140d553cfSPaul Beesleynodes contain one or more error records, which are registers through which the
10240d553cfSPaul Beesleynodes advertise various properties of the signalled error. Arm recommends that
10340d553cfSPaul Beesleyerror records are implemented in the Standard Error Record format. The RAS
10440d553cfSPaul Beesleyarchitecture allows for error records to be accessible via system or
10540d553cfSPaul Beesleymemory-mapped registers.
10640d553cfSPaul Beesley
10740d553cfSPaul BeesleyThe platform should enumerate the error records providing for each of them:
10840d553cfSPaul Beesley
10940d553cfSPaul Beesley-  A handler to probe error records for errors;
11040d553cfSPaul Beesley-  When the probing identifies an error, a handler to handle it;
11140d553cfSPaul Beesley-  For memory-mapped error record, its base address and size in KB; for a system
11240d553cfSPaul Beesley   register-accessed record, the start index of the record and number of
11340d553cfSPaul Beesley   continuous records from that index;
11440d553cfSPaul Beesley-  Any node-specific auxiliary data.
11540d553cfSPaul Beesley
11640d553cfSPaul BeesleyWith this information supplied, when the run time firmware receives one of the
11740d553cfSPaul Beesleynotification mechanisms, the RAS framework can iterate through and probe error
11840d553cfSPaul Beesleyrecords for error, and invoke the appropriate handler to handle it.
11940d553cfSPaul Beesley
12040d553cfSPaul BeesleyThe RAS framework provides the macros to populate error record information. The
12140d553cfSPaul Beesleymacros are versioned, and the latest version as of this writing is 1. These
12240d553cfSPaul Beesleymacros create a structure of type ``struct err_record_info`` from its arguments,
12340d553cfSPaul Beesleywhich are later passed to probe and error handlers.
12440d553cfSPaul Beesley
12540d553cfSPaul BeesleyFor memory-mapped error records:
12640d553cfSPaul Beesley
12740d553cfSPaul Beesley.. code:: c
12840d553cfSPaul Beesley
12940d553cfSPaul Beesley    ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
13040d553cfSPaul Beesley
13140d553cfSPaul BeesleyAnd, for system register ones:
13240d553cfSPaul Beesley
13340d553cfSPaul Beesley.. code:: c
13440d553cfSPaul Beesley
13540d553cfSPaul Beesley    ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
13640d553cfSPaul Beesley
13740d553cfSPaul BeesleyThe probe handler must have the following prototype:
13840d553cfSPaul Beesley
13940d553cfSPaul Beesley.. code:: c
14040d553cfSPaul Beesley
14140d553cfSPaul Beesley    typedef int (*err_record_probe_t)(const struct err_record_info *info,
14240d553cfSPaul Beesley                    int *probe_data);
14340d553cfSPaul Beesley
14440d553cfSPaul BeesleyThe probe handler must return a non-zero value if an error was detected, or 0
14540d553cfSPaul Beesleyotherwise. The ``probe_data`` output parameter can be used to pass any useful
14640d553cfSPaul Beesleyinformation resulting from probe to the error handler (see `below`__). For
14740d553cfSPaul Beesleyexample, it could return the index of the record.
14840d553cfSPaul Beesley
14940d553cfSPaul Beesley.. __: `Standard Error Record helpers`_
15040d553cfSPaul Beesley
15140d553cfSPaul BeesleyThe error handler must have the following prototype:
15240d553cfSPaul Beesley
15340d553cfSPaul Beesley.. code:: c
15440d553cfSPaul Beesley
15540d553cfSPaul Beesley    typedef int (*err_record_handler_t)(const struct err_record_info *info,
15640d553cfSPaul Beesley               int probe_data, const struct err_handler_data *const data);
15740d553cfSPaul Beesley
15840d553cfSPaul BeesleyThe ``data`` constant parameter describes the various properties of the error,
15940d553cfSPaul Beesleyincluding the reason for the error, exception syndrome, and also ``flags``,
160c3233c11SManish Pandey``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler
161c3233c11SManish Pandey<EL3 interrupts>`.
16240d553cfSPaul Beesley
16340d553cfSPaul BeesleyThe platform is expected populate an array using the macros above, and register
16440d553cfSPaul Beesleythe it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
16540d553cfSPaul Beesleypassing it the name of the array describing the records. Note that the macro
16640d553cfSPaul Beesleymust be used in the same file where the array is defined.
16740d553cfSPaul Beesley
16840d553cfSPaul BeesleyStandard Error Record helpers
16940d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17040d553cfSPaul Beesley
17140d553cfSPaul BeesleyThe |TF-A| RAS framework provides probe handlers for Standard Error Records, for
17240d553cfSPaul Beesleyboth memory-mapped and System Register accesses:
17340d553cfSPaul Beesley
17440d553cfSPaul Beesley.. code:: c
17540d553cfSPaul Beesley
17640d553cfSPaul Beesley    int ras_err_ser_probe_memmap(const struct err_record_info *info,
17740d553cfSPaul Beesley                int *probe_data);
17840d553cfSPaul Beesley
17940d553cfSPaul Beesley    int ras_err_ser_probe_sysreg(const struct err_record_info *info,
18040d553cfSPaul Beesley                int *probe_data);
18140d553cfSPaul Beesley
18240d553cfSPaul BeesleyWhen the platform enumerates error records, for those records in the Standard
18340d553cfSPaul BeesleyError Record format, these helpers maybe used instead of rolling out their own.
18440d553cfSPaul BeesleyBoth helpers above:
18540d553cfSPaul Beesley
18640d553cfSPaul Beesley-  Return non-zero value when an error is detected in a Standard Error Record;
18740d553cfSPaul Beesley-  Set ``probe_data`` to the index of the error record upon detecting an error.
18840d553cfSPaul Beesley
18940d553cfSPaul BeesleyRegistering RAS interrupts
19040d553cfSPaul Beesley--------------------------
19140d553cfSPaul Beesley
19240d553cfSPaul BeesleyRAS nodes can signal errors to the PE by raising Fault Handling and/or Error
19340d553cfSPaul BeesleyRecovery interrupts. For the firmware-first handling paradigm for interrupts to
19440d553cfSPaul Beesleywork, the platform must setup and register with |EHF|. See `Interaction with
19540d553cfSPaul BeesleyException Handling Framework`_.
19640d553cfSPaul Beesley
19740d553cfSPaul BeesleyFor each RAS interrupt, the platform has to provide structure of type ``struct
19840d553cfSPaul Beesleyras_interrupt``:
19940d553cfSPaul Beesley
20040d553cfSPaul Beesley-  Interrupt number;
20140d553cfSPaul Beesley-  The associated error record information (pointer to the corresponding
20240d553cfSPaul Beesley   ``struct err_record_info``);
20340d553cfSPaul Beesley-  Optionally, a cookie.
20440d553cfSPaul Beesley
20540d553cfSPaul BeesleyThe platform is expected to define an array of ``struct ras_interrupt``, and
20640d553cfSPaul Beesleyregister it with the RAS framework using the macro
20740d553cfSPaul Beesley``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
20840d553cfSPaul Beesleymacro must be used in the same file where the array is defined.
20940d553cfSPaul Beesley
21040d553cfSPaul BeesleyThe array of ``struct ras_interrupt`` must be sorted in the increasing order of
21140d553cfSPaul Beesleyinterrupt number. This allows for fast look of handlers in order to service RAS
21240d553cfSPaul Beesleyinterrupts.
21340d553cfSPaul Beesley
21440d553cfSPaul BeesleyDouble-fault handling
21540d553cfSPaul Beesley---------------------
21640d553cfSPaul Beesley
21740d553cfSPaul BeesleyA Double Fault condition arises when an error is signalled to the PE while
21840d553cfSPaul Beesleyhandling of a previously signalled error is still underway. When a Double Fault
21940d553cfSPaul Beesleycondition arises, the Arm RAS extensions only require for handler to perform
22040d553cfSPaul Beesleyorderly shutdown of the system, as recovery may be impossible.
22140d553cfSPaul Beesley
22240d553cfSPaul BeesleyThe RAS extensions part of Armv8.4 introduced new architectural features to deal
22340d553cfSPaul Beesleywith Double Fault conditions, specifically, the introduction of ``NMEA`` and
22440d553cfSPaul Beesley``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
22540d553cfSPaul Beesleysoftware which runs part of its entry/exit routines with exceptions momentarily
22640d553cfSPaul Beesleymasked—meaning, in such systems, External Aborts/SErrors are not immediately
22740d553cfSPaul Beesleyhandled when they occur, but only after the exceptions are unmasked again.
22840d553cfSPaul Beesley
22940d553cfSPaul Beesley|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
23040d553cfSPaul BeesleyThis means that all exceptions routed to EL3 are handled immediately. |TF-A|
23140d553cfSPaul Beesleythus is able to detect a Double Fault conditions in software, without needing
23240d553cfSPaul Beesleythe intended advantages of Armv8.4 Double Fault architecture extensions.
23340d553cfSPaul Beesley
23440d553cfSPaul BeesleyDouble faults are fatal, and terminate at the platform double fault handler, and
23540d553cfSPaul Beesleydoesn't return.
23640d553cfSPaul Beesley
23740d553cfSPaul BeesleyEngaging the RAS framework
23840d553cfSPaul Beesley--------------------------
23940d553cfSPaul Beesley
2409202d519SManish PandeyEnabling RAS support is a platform choice
24140d553cfSPaul Beesley
24240d553cfSPaul BeesleyThe RAS support in |TF-A| introduces a default implementation of
243*f87e54f7SManish Pandey``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS``
24440d553cfSPaul Beesleyis set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
24540d553cfSPaul Beesleytop-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
24640d553cfSPaul Beesleyto through platform-supplied error records, probe them, and when an error is
24740d553cfSPaul Beesleyidentified, look up and invoke the corresponding error handler.
24840d553cfSPaul Beesley
24940d553cfSPaul BeesleyNote that, if the platform chooses to override the ``plat_ea_handler`` function
25040d553cfSPaul Beesleyand intend to use the RAS framework, it must explicitly call
25140d553cfSPaul Beesley``ras_ea_handler()`` from within.
25240d553cfSPaul Beesley
25340d553cfSPaul BeesleySimilarly, for RAS interrupts, the framework defines
25440d553cfSPaul Beesley``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
25540d553cfSPaul Beesleywhen  a RAS interrupt taken at EL3. The function bisects the platform-supplied
25640d553cfSPaul Beesleysorted array of interrupts to look up the error record information associated
25740d553cfSPaul Beesleywith the interrupt number. That error handler for that record is then invoked to
25840d553cfSPaul Beesleyhandle the error.
25940d553cfSPaul Beesley
26040d553cfSPaul BeesleyInteraction with Exception Handling Framework
26140d553cfSPaul Beesley---------------------------------------------
26240d553cfSPaul Beesley
26340d553cfSPaul BeesleyAs mentioned in earlier sections, RAS framework interacts with the |EHF| to
26440d553cfSPaul Beesleyarbitrate handling of RAS exceptions with others that are routed to EL3. This
265c3233c11SManish Pandeymeans that the platform must partition a :ref:`priority level <Partitioning
266c3233c11SManish Pandeypriority levels>` for handling RAS exceptions. The platform must then define
267c3233c11SManish Pandeythe macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions.
268c3233c11SManish PandeyPlatforms would typically want to allocate the highest secure priority for
269c3233c11SManish PandeyRAS handling.
27040d553cfSPaul Beesley
271c3233c11SManish PandeyHandling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt
272c3233c11SManish Pandey<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF|
273c3233c11SManish Pandeydocumentation. I.e., for interrupts, the priority management is implicit; but
274c3233c11SManish Pandeyfor non-interrupt exceptions, they're explicit using :ref:`EHF APIs
275c3233c11SManish Pandey<Activating and Deactivating priorities>`.
27640d553cfSPaul Beesley
27734760951SPaul Beesley--------------
27840d553cfSPaul Beesley
2799202d519SManish Pandey*Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.*
2809202d519SManish Pandey
2819202d519SManish Pandey.. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest
282