xref: /rk3399_ARM-atf/docs/perf/psci-performance-juno.rst (revision 0cbcccc0289a05129403a5308d871332eaa4781f)
140d553cfSPaul BeesleyPSCI Performance Measurements on Arm Juno Development Platform
240d553cfSPaul Beesley==============================================================
340d553cfSPaul Beesley
440d553cfSPaul BeesleyThis document summarises the findings of performance measurements of key
5bd97f83aSJohn Tsichritzisoperations in the Trusted Firmware-A Power State Coordination Interface (PSCI)
6bd97f83aSJohn Tsichritzisimplementation, using the in-built Performance Measurement Framework (PMF) and
7bd97f83aSJohn Tsichritzisruntime instrumentation timestamps.
840d553cfSPaul Beesley
940d553cfSPaul BeesleyMethod
1040d553cfSPaul Beesley------
1140d553cfSPaul Beesley
1240d553cfSPaul BeesleyWe used the `Juno R1 platform`_ for these tests, which has 4 x Cortex-A53 and 2
1340d553cfSPaul Beesleyx Cortex-A57 clusters running at the following frequencies:
1440d553cfSPaul Beesley
1540d553cfSPaul Beesley+-----------------+--------------------+
1640d553cfSPaul Beesley| Domain          | Frequency (MHz)    |
1740d553cfSPaul Beesley+=================+====================+
1840d553cfSPaul Beesley| Cortex-A57      | 900 (nominal)      |
1940d553cfSPaul Beesley+-----------------+--------------------+
2040d553cfSPaul Beesley| Cortex-A53      | 650 (underdrive)   |
2140d553cfSPaul Beesley+-----------------+--------------------+
2240d553cfSPaul Beesley| AXI subsystem   | 533                |
2340d553cfSPaul Beesley+-----------------+--------------------+
2440d553cfSPaul Beesley
2540d553cfSPaul BeesleyJuno supports CPU, cluster and system power down states, corresponding to power
2640d553cfSPaul Beesleylevels 0, 1 and 2 respectively. It does not support any retention states.
2740d553cfSPaul Beesley
2840d553cfSPaul BeesleyWe used the upstream `TF master as of 31/01/2017`_, building the platform using
2940d553cfSPaul Beesleythe ``ENABLE_RUNTIME_INSTRUMENTATION`` option:
3040d553cfSPaul Beesley
3129c02529SPaul Beesley.. code:: shell
3240d553cfSPaul Beesley
3340d553cfSPaul Beesley    make PLAT=juno ENABLE_RUNTIME_INSTRUMENTATION=1 \
3440d553cfSPaul Beesley        SCP_BL2=<path/to/scp-fw.bin>                \
3540d553cfSPaul Beesley        BL33=<path/to/test-fw.bin>                  \
3640d553cfSPaul Beesley        all fip
3740d553cfSPaul Beesley
3840d553cfSPaul BeesleyWhen using the debug build of TF, there was no noticeable difference in the
3940d553cfSPaul Beesleyresults.
4040d553cfSPaul Beesley
4140d553cfSPaul BeesleyThe tests are based on an ARM-internal test framework. The release build of this
4240d553cfSPaul Beesleyframework was used because the results in the debug build became skewed; the
4340d553cfSPaul Beesleyconsole output prevented some of the tests from executing in parallel.
4440d553cfSPaul Beesley
4540d553cfSPaul BeesleyThe tests consist of both parallel and sequential tests, which are broadly
4640d553cfSPaul Beesleydescribed as follows:
4740d553cfSPaul Beesley
4840d553cfSPaul Beesley- **Parallel Tests** This type of test powers on all the non-lead CPUs and
4940d553cfSPaul Beesley  brings them and the lead CPU to a common synchronization point.  The lead CPU
5040d553cfSPaul Beesley  then initiates the test on all CPUs in parallel.
5140d553cfSPaul Beesley
5240d553cfSPaul Beesley- **Sequential Tests** This type of test powers on each non-lead CPU in
5340d553cfSPaul Beesley  sequence. The lead CPU initiates the test on a non-lead CPU then waits for the
5440d553cfSPaul Beesley  test to complete before proceeding to the next non-lead CPU. The lead CPU then
5540d553cfSPaul Beesley  executes the test on itself.
5640d553cfSPaul Beesley
5740d553cfSPaul BeesleyIn the results below, CPUs 0-3 refer to CPUs in the little cluster (A53) and
5840d553cfSPaul BeesleyCPUs 4-5 refer to CPUs in the big cluster (A57). In all cases CPU 4 is the lead
5940d553cfSPaul BeesleyCPU.
6040d553cfSPaul Beesley
6140d553cfSPaul Beesley``PSCI_ENTRY`` refers to the time taken from entering the TF PSCI implementation
6240d553cfSPaul Beesleyto the point the hardware enters the low power state (WFI). Referring to the TF
6340d553cfSPaul Beesleyruntime instrumentation points, this corresponds to:
6440d553cfSPaul Beesley``(RT_INSTR_ENTER_HW_LOW_PWR - RT_INSTR_ENTER_PSCI)``.
6540d553cfSPaul Beesley
6640d553cfSPaul Beesley``PSCI_EXIT`` refers to the time taken from the point the hardware exits the low
6740d553cfSPaul Beesleypower state to exiting the TF PSCI implementation. This corresponds to:
6840d553cfSPaul Beesley``(RT_INSTR_EXIT_PSCI - RT_INSTR_EXIT_HW_LOW_PWR)``.
6940d553cfSPaul Beesley
7040d553cfSPaul Beesley``CFLUSH_OVERHEAD`` refers to the part of ``PSCI_ENTRY`` taken to flush the
7140d553cfSPaul Beesleycaches. This corresponds to: ``(RT_INSTR_EXIT_CFLUSH - RT_INSTR_ENTER_CFLUSH)``.
7240d553cfSPaul Beesley
7340d553cfSPaul BeesleyNote there is very little variance observed in the values given (~1us), although
7440d553cfSPaul Beesleythe values for each CPU are sometimes interchanged, depending on the order in
7540d553cfSPaul Beesleywhich locks are acquired. Also, there is very little variance observed between
7640d553cfSPaul Beesleyexecuting the tests sequentially in a single boot or rebooting between tests.
7740d553cfSPaul Beesley
7840d553cfSPaul BeesleyGiven that runtime instrumentation using PMF is invasive, there is a small
7940d553cfSPaul Beesley(unquantified) overhead on the results. PMF uses the generic counter for
8040d553cfSPaul Beesleytimestamps, which runs at 50MHz on Juno.
8140d553cfSPaul Beesley
8240d553cfSPaul BeesleyResults and Commentary
8340d553cfSPaul Beesley----------------------
8440d553cfSPaul Beesley
8540d553cfSPaul Beesley``CPU_SUSPEND`` to deepest power level on all CPUs in parallel
8640d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
8740d553cfSPaul Beesley
8840d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
8940d553cfSPaul Beesley| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
9040d553cfSPaul Beesley+=======+=====================+====================+==========================+
9140d553cfSPaul Beesley| 0     | 27                  | 20                 | 5                        |
9240d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
9340d553cfSPaul Beesley| 1     | 114                 | 86                 | 5                        |
9440d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
9540d553cfSPaul Beesley| 2     | 202                 | 58                 | 5                        |
9640d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
9740d553cfSPaul Beesley| 3     | 375                 | 29                 | 94                       |
9840d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
9940d553cfSPaul Beesley| 4     | 20                  | 22                 | 6                        |
10040d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
10140d553cfSPaul Beesley| 5     | 290                 | 18                 | 206                      |
10240d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
10340d553cfSPaul Beesley
10440d553cfSPaul BeesleyA large variance in ``PSCI_ENTRY`` and ``PSCI_EXIT`` times across CPUs is
10540d553cfSPaul Beesleyobserved due to TF PSCI lock contention. In the worst case, CPU 3 has to wait
10640d553cfSPaul Beesleyfor the 3 other CPUs in the cluster (0-2) to complete ``PSCI_ENTRY`` and release
10740d553cfSPaul Beesleythe lock before proceeding.
10840d553cfSPaul Beesley
10940d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for CPUs 3 and 5 are higher because they are the
11040d553cfSPaul Beesleylast CPUs in their respective clusters to power down, therefore both the L1 and
11140d553cfSPaul BeesleyL2 caches are flushed.
11240d553cfSPaul Beesley
11340d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 5 is a lot larger than that for CPU 3
11440d553cfSPaul Beesleybecause the L2 cache size for the big cluster is lot larger (2MB) compared to
11540d553cfSPaul Beesleythe little cluster (1MB).
11640d553cfSPaul Beesley
11740d553cfSPaul Beesley``CPU_SUSPEND`` to power level 0 on all CPUs in parallel
11840d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11940d553cfSPaul Beesley
12040d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
12140d553cfSPaul Beesley| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
12240d553cfSPaul Beesley+=======+=====================+====================+==========================+
12340d553cfSPaul Beesley| 0     | 116                 | 14                 | 8                        |
12440d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
12540d553cfSPaul Beesley| 1     | 204                 | 14                 | 8                        |
12640d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
12740d553cfSPaul Beesley| 2     | 287                 | 13                 | 8                        |
12840d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
12940d553cfSPaul Beesley| 3     | 376                 | 13                 | 9                        |
13040d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
13140d553cfSPaul Beesley| 4     | 29                  | 15                 | 7                        |
13240d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
13340d553cfSPaul Beesley| 5     | 21                  | 15                 | 8                        |
13440d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
13540d553cfSPaul Beesley
13640d553cfSPaul BeesleyThere is no lock contention in TF generic code at power level 0 but the large
13740d553cfSPaul Beesleyvariance in ``PSCI_ENTRY`` times across CPUs is due to lock contention in Juno
13840d553cfSPaul Beesleyplatform code. The platform lock is used to mediate access to a single SCP
13940d553cfSPaul Beesleycommunication channel. This is compounded by the SCP firmware waiting for each
14040d553cfSPaul BeesleyAP CPU to enter WFI before making the channel available to other CPUs, which
14140d553cfSPaul Beesleyeffectively serializes the SCP power down commands from all CPUs.
14240d553cfSPaul Beesley
14340d553cfSPaul BeesleyOn platforms with a more efficient CPU power down mechanism, it should be
14440d553cfSPaul Beesleypossible to make the ``PSCI_ENTRY`` times smaller and consistent.
14540d553cfSPaul Beesley
14640d553cfSPaul BeesleyThe ``PSCI_EXIT`` times are consistent across all CPUs because TF does not
14740d553cfSPaul Beesleyrequire locks at power level 0.
14840d553cfSPaul Beesley
14940d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for all CPUs are small and consistent since only
15040d553cfSPaul Beesleythe cache associated with power level 0 is flushed (L1).
15140d553cfSPaul Beesley
15240d553cfSPaul Beesley``CPU_SUSPEND`` to deepest power level on all CPUs in sequence
15340d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15440d553cfSPaul Beesley
15540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
15640d553cfSPaul Beesley| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
15740d553cfSPaul Beesley+=======+=====================+====================+==========================+
15840d553cfSPaul Beesley| 0     | 114                 | 20                 | 94                       |
15940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
16040d553cfSPaul Beesley| 1     | 114                 | 20                 | 94                       |
16140d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
16240d553cfSPaul Beesley| 2     | 114                 | 20                 | 94                       |
16340d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
16440d553cfSPaul Beesley| 3     | 114                 | 20                 | 94                       |
16540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
16640d553cfSPaul Beesley| 4     | 195                 | 22                 | 180                      |
16740d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
16840d553cfSPaul Beesley| 5     | 21                  | 17                 | 6                        |
16940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
17040d553cfSPaul Beesley
171be653a69SPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for lead CPU 4 and all CPUs in the non-lead cluster
17240d553cfSPaul Beesleyare large because all other CPUs in the cluster are powered down during the
17340d553cfSPaul Beesleytest. The ``CPU_SUSPEND`` call powers down to the cluster level, requiring a
17440d553cfSPaul Beesleyflush of both L1 and L2 caches.
17540d553cfSPaul Beesley
17640d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little
17740d553cfSPaul BeesleyCPUs because the L2 cache size for the big cluster is lot larger (2MB) compared
17840d553cfSPaul Beesleyto the little cluster (1MB).
17940d553cfSPaul Beesley
18040d553cfSPaul BeesleyThe ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are low because lead
18140d553cfSPaul BeesleyCPU 4 continues to run while CPU 5 is suspended. Hence CPU 5 only powers down to
18240d553cfSPaul Beesleylevel 0, which only requires L1 cache flush.
18340d553cfSPaul Beesley
18440d553cfSPaul Beesley``CPU_SUSPEND`` to power level 0 on all CPUs in sequence
18540d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18640d553cfSPaul Beesley
18740d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
18840d553cfSPaul Beesley| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
18940d553cfSPaul Beesley+=======+=====================+====================+==========================+
19040d553cfSPaul Beesley| 0     | 22                  | 14                 | 5                        |
19140d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
19240d553cfSPaul Beesley| 1     | 22                  | 14                 | 5                        |
19340d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
19440d553cfSPaul Beesley| 2     | 21                  | 14                 | 5                        |
19540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
19640d553cfSPaul Beesley| 3     | 22                  | 14                 | 5                        |
19740d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
19840d553cfSPaul Beesley| 4     | 17                  | 14                 | 6                        |
19940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
20040d553cfSPaul Beesley| 5     | 18                  | 15                 | 6                        |
20140d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
20240d553cfSPaul Beesley
20340d553cfSPaul BeesleyHere the times are small and consistent since there is no contention and it is
20440d553cfSPaul Beesleyonly necessary to flush the cache to power level 0 (L1). This is the best case
20540d553cfSPaul Beesleyscenario.
20640d553cfSPaul Beesley
20740d553cfSPaul BeesleyThe ``PSCI_ENTRY`` times for CPUs in the big cluster are slightly smaller than
20840d553cfSPaul Beesleyfor the CPUs in little cluster due to greater CPU performance.
20940d553cfSPaul Beesley
21040d553cfSPaul BeesleyThe ``PSCI_EXIT`` times are generally lower than in the last test because the
21140d553cfSPaul Beesleycluster remains powered on throughout the test and there is less code to execute
21240d553cfSPaul Beesleyon power on (for example, no need to enter CCI coherency)
21340d553cfSPaul Beesley
21440d553cfSPaul Beesley``CPU_OFF`` on all non-lead CPUs in sequence then ``CPU_SUSPEND`` on lead CPU to deepest power level
21540d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
21640d553cfSPaul Beesley
21740d553cfSPaul BeesleyThe test sequence here is as follows:
21840d553cfSPaul Beesley
21940d553cfSPaul Beesley1. Call ``CPU_ON`` and ``CPU_OFF`` on each non-lead CPU in sequence.
22040d553cfSPaul Beesley
22140d553cfSPaul Beesley2. Program wake up timer and suspend the lead CPU to the deepest power level.
22240d553cfSPaul Beesley
22340d553cfSPaul Beesley3. Call ``CPU_ON`` on non-lead CPU to get the timestamps from each CPU.
22440d553cfSPaul Beesley
22540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
22640d553cfSPaul Beesley| CPU   | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
22740d553cfSPaul Beesley+=======+=====================+====================+==========================+
22840d553cfSPaul Beesley| 0     | 110                 | 28                 | 93                       |
22940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
23040d553cfSPaul Beesley| 1     | 110                 | 28                 | 93                       |
23140d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
23240d553cfSPaul Beesley| 2     | 110                 | 28                 | 93                       |
23340d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
23440d553cfSPaul Beesley| 3     | 111                 | 28                 | 93                       |
23540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
23640d553cfSPaul Beesley| 4     | 195                 | 22                 | 181                      |
23740d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
23840d553cfSPaul Beesley| 5     | 20                  | 23                 | 6                        |
23940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+
24040d553cfSPaul Beesley
24140d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for all little CPUs are large because all other
24240d553cfSPaul BeesleyCPUs in that cluster are powerered down during the test. The ``CPU_OFF`` call
24340d553cfSPaul Beesleypowers down to the cluster level, requiring a flush of both L1 and L2 caches.
24440d553cfSPaul Beesley
24540d553cfSPaul BeesleyThe ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are small because
24640d553cfSPaul Beesleylead CPU 4 is running and CPU 5 only powers down to level 0, which only requires
24740d553cfSPaul Beesleyan L1 cache flush.
24840d553cfSPaul Beesley
24940d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little
25040d553cfSPaul BeesleyCPUs because the L2 cache size for the big cluster is lot larger (2MB) compared
25140d553cfSPaul Beesleyto the little cluster (1MB).
25240d553cfSPaul Beesley
25340d553cfSPaul BeesleyThe ``PSCI_EXIT`` times for CPUs in the big cluster are slightly smaller than
25440d553cfSPaul Beesleyfor CPUs in the little cluster due to greater CPU performance.  These times
25540d553cfSPaul Beesleygenerally are greater than the ``PSCI_EXIT`` times in the ``CPU_SUSPEND`` tests
25640d553cfSPaul Beesleybecause there is more code to execute in the "on finisher" compared to the
25740d553cfSPaul Beesley"suspend finisher" (for example, GIC redistributor register programming).
25840d553cfSPaul Beesley
25940d553cfSPaul Beesley``PSCI_VERSION`` on all CPUs in parallel
26040d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
26140d553cfSPaul Beesley
26240d553cfSPaul BeesleySince very little code is associated with ``PSCI_VERSION``, this test
26340d553cfSPaul Beesleyapproximates the round trip latency for handling a fast SMC at EL3 in TF.
26440d553cfSPaul Beesley
26540d553cfSPaul Beesley+-------+-------------------+
26640d553cfSPaul Beesley| CPU   | TOTAL TIME (ns)   |
26740d553cfSPaul Beesley+=======+===================+
26840d553cfSPaul Beesley| 0     | 3020              |
26940d553cfSPaul Beesley+-------+-------------------+
27040d553cfSPaul Beesley| 1     | 2940              |
27140d553cfSPaul Beesley+-------+-------------------+
27240d553cfSPaul Beesley| 2     | 2980              |
27340d553cfSPaul Beesley+-------+-------------------+
27440d553cfSPaul Beesley| 3     | 3060              |
27540d553cfSPaul Beesley+-------+-------------------+
27640d553cfSPaul Beesley| 4     | 520               |
27740d553cfSPaul Beesley+-------+-------------------+
27840d553cfSPaul Beesley| 5     | 720               |
27940d553cfSPaul Beesley+-------+-------------------+
28040d553cfSPaul Beesley
28140d553cfSPaul BeesleyThe times for the big CPUs are less than the little CPUs due to greater CPU
28240d553cfSPaul Beesleyperformance.
28340d553cfSPaul Beesley
28440d553cfSPaul BeesleyWe suspect the time for lead CPU 4 is shorter than CPU 5 due to subtle cache
28540d553cfSPaul Beesleyeffects, given that these measurements are at the nano-second level.
28640d553cfSPaul Beesley
287bd97f83aSJohn Tsichritzis--------------
288bd97f83aSJohn Tsichritzis
289*0cbcccc0SHarrison Mutai*Copyright (c) 2019-2023, Arm Limited and Contributors. All rights reserved.*
290bd97f83aSJohn Tsichritzis
291*0cbcccc0SHarrison Mutai.. _Juno R1 platform: https://developer.arm.com/documentation/100122/latest/
29240d553cfSPaul Beesley.. _TF master as of 31/01/2017: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?id=c38b36d
293