140d553cfSPaul BeesleyPSCI Performance Measurements on Arm Juno Development Platform 240d553cfSPaul Beesley============================================================== 340d553cfSPaul Beesley 440d553cfSPaul BeesleyThis document summarises the findings of performance measurements of key 5bd97f83aSJohn Tsichritzisoperations in the Trusted Firmware-A Power State Coordination Interface (PSCI) 6bd97f83aSJohn Tsichritzisimplementation, using the in-built Performance Measurement Framework (PMF) and 7bd97f83aSJohn Tsichritzisruntime instrumentation timestamps. 840d553cfSPaul Beesley 940d553cfSPaul BeesleyMethod 1040d553cfSPaul Beesley------ 1140d553cfSPaul Beesley 1240d553cfSPaul BeesleyWe used the `Juno R1 platform`_ for these tests, which has 4 x Cortex-A53 and 2 1340d553cfSPaul Beesleyx Cortex-A57 clusters running at the following frequencies: 1440d553cfSPaul Beesley 1540d553cfSPaul Beesley+-----------------+--------------------+ 1640d553cfSPaul Beesley| Domain | Frequency (MHz) | 1740d553cfSPaul Beesley+=================+====================+ 1840d553cfSPaul Beesley| Cortex-A57 | 900 (nominal) | 1940d553cfSPaul Beesley+-----------------+--------------------+ 2040d553cfSPaul Beesley| Cortex-A53 | 650 (underdrive) | 2140d553cfSPaul Beesley+-----------------+--------------------+ 2240d553cfSPaul Beesley| AXI subsystem | 533 | 2340d553cfSPaul Beesley+-----------------+--------------------+ 2440d553cfSPaul Beesley 2540d553cfSPaul BeesleyJuno supports CPU, cluster and system power down states, corresponding to power 2640d553cfSPaul Beesleylevels 0, 1 and 2 respectively. It does not support any retention states. 2740d553cfSPaul Beesley 2840d553cfSPaul BeesleyWe used the upstream `TF master as of 31/01/2017`_, building the platform using 2940d553cfSPaul Beesleythe ``ENABLE_RUNTIME_INSTRUMENTATION`` option: 3040d553cfSPaul Beesley 3129c02529SPaul Beesley.. code:: shell 3240d553cfSPaul Beesley 3340d553cfSPaul Beesley make PLAT=juno ENABLE_RUNTIME_INSTRUMENTATION=1 \ 3440d553cfSPaul Beesley SCP_BL2=<path/to/scp-fw.bin> \ 3540d553cfSPaul Beesley BL33=<path/to/test-fw.bin> \ 3640d553cfSPaul Beesley all fip 3740d553cfSPaul Beesley 3840d553cfSPaul BeesleyWhen using the debug build of TF, there was no noticeable difference in the 3940d553cfSPaul Beesleyresults. 4040d553cfSPaul Beesley 4140d553cfSPaul BeesleyThe tests are based on an ARM-internal test framework. The release build of this 4240d553cfSPaul Beesleyframework was used because the results in the debug build became skewed; the 4340d553cfSPaul Beesleyconsole output prevented some of the tests from executing in parallel. 4440d553cfSPaul Beesley 4540d553cfSPaul BeesleyThe tests consist of both parallel and sequential tests, which are broadly 4640d553cfSPaul Beesleydescribed as follows: 4740d553cfSPaul Beesley 4840d553cfSPaul Beesley- **Parallel Tests** This type of test powers on all the non-lead CPUs and 4940d553cfSPaul Beesley brings them and the lead CPU to a common synchronization point. The lead CPU 5040d553cfSPaul Beesley then initiates the test on all CPUs in parallel. 5140d553cfSPaul Beesley 5240d553cfSPaul Beesley- **Sequential Tests** This type of test powers on each non-lead CPU in 5340d553cfSPaul Beesley sequence. The lead CPU initiates the test on a non-lead CPU then waits for the 5440d553cfSPaul Beesley test to complete before proceeding to the next non-lead CPU. The lead CPU then 5540d553cfSPaul Beesley executes the test on itself. 5640d553cfSPaul Beesley 5740d553cfSPaul BeesleyIn the results below, CPUs 0-3 refer to CPUs in the little cluster (A53) and 5840d553cfSPaul BeesleyCPUs 4-5 refer to CPUs in the big cluster (A57). In all cases CPU 4 is the lead 5940d553cfSPaul BeesleyCPU. 6040d553cfSPaul Beesley 6140d553cfSPaul Beesley``PSCI_ENTRY`` refers to the time taken from entering the TF PSCI implementation 6240d553cfSPaul Beesleyto the point the hardware enters the low power state (WFI). Referring to the TF 6340d553cfSPaul Beesleyruntime instrumentation points, this corresponds to: 6440d553cfSPaul Beesley``(RT_INSTR_ENTER_HW_LOW_PWR - RT_INSTR_ENTER_PSCI)``. 6540d553cfSPaul Beesley 6640d553cfSPaul Beesley``PSCI_EXIT`` refers to the time taken from the point the hardware exits the low 6740d553cfSPaul Beesleypower state to exiting the TF PSCI implementation. This corresponds to: 6840d553cfSPaul Beesley``(RT_INSTR_EXIT_PSCI - RT_INSTR_EXIT_HW_LOW_PWR)``. 6940d553cfSPaul Beesley 7040d553cfSPaul Beesley``CFLUSH_OVERHEAD`` refers to the part of ``PSCI_ENTRY`` taken to flush the 7140d553cfSPaul Beesleycaches. This corresponds to: ``(RT_INSTR_EXIT_CFLUSH - RT_INSTR_ENTER_CFLUSH)``. 7240d553cfSPaul Beesley 7340d553cfSPaul BeesleyNote there is very little variance observed in the values given (~1us), although 7440d553cfSPaul Beesleythe values for each CPU are sometimes interchanged, depending on the order in 7540d553cfSPaul Beesleywhich locks are acquired. Also, there is very little variance observed between 7640d553cfSPaul Beesleyexecuting the tests sequentially in a single boot or rebooting between tests. 7740d553cfSPaul Beesley 7840d553cfSPaul BeesleyGiven that runtime instrumentation using PMF is invasive, there is a small 7940d553cfSPaul Beesley(unquantified) overhead on the results. PMF uses the generic counter for 8040d553cfSPaul Beesleytimestamps, which runs at 50MHz on Juno. 8140d553cfSPaul Beesley 8240d553cfSPaul BeesleyResults and Commentary 8340d553cfSPaul Beesley---------------------- 8440d553cfSPaul Beesley 8540d553cfSPaul Beesley``CPU_SUSPEND`` to deepest power level on all CPUs in parallel 8640d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 8740d553cfSPaul Beesley 8840d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 8940d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 9040d553cfSPaul Beesley+=======+=====================+====================+==========================+ 9140d553cfSPaul Beesley| 0 | 27 | 20 | 5 | 9240d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 9340d553cfSPaul Beesley| 1 | 114 | 86 | 5 | 9440d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 9540d553cfSPaul Beesley| 2 | 202 | 58 | 5 | 9640d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 9740d553cfSPaul Beesley| 3 | 375 | 29 | 94 | 9840d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 9940d553cfSPaul Beesley| 4 | 20 | 22 | 6 | 10040d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 10140d553cfSPaul Beesley| 5 | 290 | 18 | 206 | 10240d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 10340d553cfSPaul Beesley 10440d553cfSPaul BeesleyA large variance in ``PSCI_ENTRY`` and ``PSCI_EXIT`` times across CPUs is 10540d553cfSPaul Beesleyobserved due to TF PSCI lock contention. In the worst case, CPU 3 has to wait 10640d553cfSPaul Beesleyfor the 3 other CPUs in the cluster (0-2) to complete ``PSCI_ENTRY`` and release 10740d553cfSPaul Beesleythe lock before proceeding. 10840d553cfSPaul Beesley 10940d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for CPUs 3 and 5 are higher because they are the 11040d553cfSPaul Beesleylast CPUs in their respective clusters to power down, therefore both the L1 and 11140d553cfSPaul BeesleyL2 caches are flushed. 11240d553cfSPaul Beesley 11340d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 5 is a lot larger than that for CPU 3 11440d553cfSPaul Beesleybecause the L2 cache size for the big cluster is lot larger (2MB) compared to 11540d553cfSPaul Beesleythe little cluster (1MB). 11640d553cfSPaul Beesley 11740d553cfSPaul Beesley``CPU_SUSPEND`` to power level 0 on all CPUs in parallel 11840d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11940d553cfSPaul Beesley 12040d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 12140d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 12240d553cfSPaul Beesley+=======+=====================+====================+==========================+ 12340d553cfSPaul Beesley| 0 | 116 | 14 | 8 | 12440d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 12540d553cfSPaul Beesley| 1 | 204 | 14 | 8 | 12640d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 12740d553cfSPaul Beesley| 2 | 287 | 13 | 8 | 12840d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 12940d553cfSPaul Beesley| 3 | 376 | 13 | 9 | 13040d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 13140d553cfSPaul Beesley| 4 | 29 | 15 | 7 | 13240d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 13340d553cfSPaul Beesley| 5 | 21 | 15 | 8 | 13440d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 13540d553cfSPaul Beesley 13640d553cfSPaul BeesleyThere is no lock contention in TF generic code at power level 0 but the large 13740d553cfSPaul Beesleyvariance in ``PSCI_ENTRY`` times across CPUs is due to lock contention in Juno 13840d553cfSPaul Beesleyplatform code. The platform lock is used to mediate access to a single SCP 13940d553cfSPaul Beesleycommunication channel. This is compounded by the SCP firmware waiting for each 14040d553cfSPaul BeesleyAP CPU to enter WFI before making the channel available to other CPUs, which 14140d553cfSPaul Beesleyeffectively serializes the SCP power down commands from all CPUs. 14240d553cfSPaul Beesley 14340d553cfSPaul BeesleyOn platforms with a more efficient CPU power down mechanism, it should be 14440d553cfSPaul Beesleypossible to make the ``PSCI_ENTRY`` times smaller and consistent. 14540d553cfSPaul Beesley 14640d553cfSPaul BeesleyThe ``PSCI_EXIT`` times are consistent across all CPUs because TF does not 14740d553cfSPaul Beesleyrequire locks at power level 0. 14840d553cfSPaul Beesley 14940d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for all CPUs are small and consistent since only 15040d553cfSPaul Beesleythe cache associated with power level 0 is flushed (L1). 15140d553cfSPaul Beesley 15240d553cfSPaul Beesley``CPU_SUSPEND`` to deepest power level on all CPUs in sequence 15340d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 15440d553cfSPaul Beesley 15540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 15640d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 15740d553cfSPaul Beesley+=======+=====================+====================+==========================+ 15840d553cfSPaul Beesley| 0 | 114 | 20 | 94 | 15940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 16040d553cfSPaul Beesley| 1 | 114 | 20 | 94 | 16140d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 16240d553cfSPaul Beesley| 2 | 114 | 20 | 94 | 16340d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 16440d553cfSPaul Beesley| 3 | 114 | 20 | 94 | 16540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 16640d553cfSPaul Beesley| 4 | 195 | 22 | 180 | 16740d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 16840d553cfSPaul Beesley| 5 | 21 | 17 | 6 | 16940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 17040d553cfSPaul Beesley 171be653a69SPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for lead CPU 4 and all CPUs in the non-lead cluster 17240d553cfSPaul Beesleyare large because all other CPUs in the cluster are powered down during the 17340d553cfSPaul Beesleytest. The ``CPU_SUSPEND`` call powers down to the cluster level, requiring a 17440d553cfSPaul Beesleyflush of both L1 and L2 caches. 17540d553cfSPaul Beesley 17640d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little 17740d553cfSPaul BeesleyCPUs because the L2 cache size for the big cluster is lot larger (2MB) compared 17840d553cfSPaul Beesleyto the little cluster (1MB). 17940d553cfSPaul Beesley 18040d553cfSPaul BeesleyThe ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are low because lead 18140d553cfSPaul BeesleyCPU 4 continues to run while CPU 5 is suspended. Hence CPU 5 only powers down to 18240d553cfSPaul Beesleylevel 0, which only requires L1 cache flush. 18340d553cfSPaul Beesley 18440d553cfSPaul Beesley``CPU_SUSPEND`` to power level 0 on all CPUs in sequence 18540d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 18640d553cfSPaul Beesley 18740d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 18840d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 18940d553cfSPaul Beesley+=======+=====================+====================+==========================+ 19040d553cfSPaul Beesley| 0 | 22 | 14 | 5 | 19140d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 19240d553cfSPaul Beesley| 1 | 22 | 14 | 5 | 19340d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 19440d553cfSPaul Beesley| 2 | 21 | 14 | 5 | 19540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 19640d553cfSPaul Beesley| 3 | 22 | 14 | 5 | 19740d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 19840d553cfSPaul Beesley| 4 | 17 | 14 | 6 | 19940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 20040d553cfSPaul Beesley| 5 | 18 | 15 | 6 | 20140d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 20240d553cfSPaul Beesley 20340d553cfSPaul BeesleyHere the times are small and consistent since there is no contention and it is 20440d553cfSPaul Beesleyonly necessary to flush the cache to power level 0 (L1). This is the best case 20540d553cfSPaul Beesleyscenario. 20640d553cfSPaul Beesley 20740d553cfSPaul BeesleyThe ``PSCI_ENTRY`` times for CPUs in the big cluster are slightly smaller than 20840d553cfSPaul Beesleyfor the CPUs in little cluster due to greater CPU performance. 20940d553cfSPaul Beesley 21040d553cfSPaul BeesleyThe ``PSCI_EXIT`` times are generally lower than in the last test because the 21140d553cfSPaul Beesleycluster remains powered on throughout the test and there is less code to execute 21240d553cfSPaul Beesleyon power on (for example, no need to enter CCI coherency) 21340d553cfSPaul Beesley 21440d553cfSPaul Beesley``CPU_OFF`` on all non-lead CPUs in sequence then ``CPU_SUSPEND`` on lead CPU to deepest power level 21540d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21640d553cfSPaul Beesley 21740d553cfSPaul BeesleyThe test sequence here is as follows: 21840d553cfSPaul Beesley 21940d553cfSPaul Beesley1. Call ``CPU_ON`` and ``CPU_OFF`` on each non-lead CPU in sequence. 22040d553cfSPaul Beesley 22140d553cfSPaul Beesley2. Program wake up timer and suspend the lead CPU to the deepest power level. 22240d553cfSPaul Beesley 22340d553cfSPaul Beesley3. Call ``CPU_ON`` on non-lead CPU to get the timestamps from each CPU. 22440d553cfSPaul Beesley 22540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 22640d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 22740d553cfSPaul Beesley+=======+=====================+====================+==========================+ 22840d553cfSPaul Beesley| 0 | 110 | 28 | 93 | 22940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 23040d553cfSPaul Beesley| 1 | 110 | 28 | 93 | 23140d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 23240d553cfSPaul Beesley| 2 | 110 | 28 | 93 | 23340d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 23440d553cfSPaul Beesley| 3 | 111 | 28 | 93 | 23540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 23640d553cfSPaul Beesley| 4 | 195 | 22 | 181 | 23740d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 23840d553cfSPaul Beesley| 5 | 20 | 23 | 6 | 23940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 24040d553cfSPaul Beesley 24140d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for all little CPUs are large because all other 24240d553cfSPaul BeesleyCPUs in that cluster are powerered down during the test. The ``CPU_OFF`` call 24340d553cfSPaul Beesleypowers down to the cluster level, requiring a flush of both L1 and L2 caches. 24440d553cfSPaul Beesley 24540d553cfSPaul BeesleyThe ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are small because 24640d553cfSPaul Beesleylead CPU 4 is running and CPU 5 only powers down to level 0, which only requires 24740d553cfSPaul Beesleyan L1 cache flush. 24840d553cfSPaul Beesley 24940d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little 25040d553cfSPaul BeesleyCPUs because the L2 cache size for the big cluster is lot larger (2MB) compared 25140d553cfSPaul Beesleyto the little cluster (1MB). 25240d553cfSPaul Beesley 25340d553cfSPaul BeesleyThe ``PSCI_EXIT`` times for CPUs in the big cluster are slightly smaller than 25440d553cfSPaul Beesleyfor CPUs in the little cluster due to greater CPU performance. These times 25540d553cfSPaul Beesleygenerally are greater than the ``PSCI_EXIT`` times in the ``CPU_SUSPEND`` tests 25640d553cfSPaul Beesleybecause there is more code to execute in the "on finisher" compared to the 25740d553cfSPaul Beesley"suspend finisher" (for example, GIC redistributor register programming). 25840d553cfSPaul Beesley 25940d553cfSPaul Beesley``PSCI_VERSION`` on all CPUs in parallel 26040d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 26140d553cfSPaul Beesley 26240d553cfSPaul BeesleySince very little code is associated with ``PSCI_VERSION``, this test 26340d553cfSPaul Beesleyapproximates the round trip latency for handling a fast SMC at EL3 in TF. 26440d553cfSPaul Beesley 26540d553cfSPaul Beesley+-------+-------------------+ 26640d553cfSPaul Beesley| CPU | TOTAL TIME (ns) | 26740d553cfSPaul Beesley+=======+===================+ 26840d553cfSPaul Beesley| 0 | 3020 | 26940d553cfSPaul Beesley+-------+-------------------+ 27040d553cfSPaul Beesley| 1 | 2940 | 27140d553cfSPaul Beesley+-------+-------------------+ 27240d553cfSPaul Beesley| 2 | 2980 | 27340d553cfSPaul Beesley+-------+-------------------+ 27440d553cfSPaul Beesley| 3 | 3060 | 27540d553cfSPaul Beesley+-------+-------------------+ 27640d553cfSPaul Beesley| 4 | 520 | 27740d553cfSPaul Beesley+-------+-------------------+ 27840d553cfSPaul Beesley| 5 | 720 | 27940d553cfSPaul Beesley+-------+-------------------+ 28040d553cfSPaul Beesley 28140d553cfSPaul BeesleyThe times for the big CPUs are less than the little CPUs due to greater CPU 28240d553cfSPaul Beesleyperformance. 28340d553cfSPaul Beesley 28440d553cfSPaul BeesleyWe suspect the time for lead CPU 4 is shorter than CPU 5 due to subtle cache 28540d553cfSPaul Beesleyeffects, given that these measurements are at the nano-second level. 28640d553cfSPaul Beesley 287bd97f83aSJohn Tsichritzis-------------- 288bd97f83aSJohn Tsichritzis 289*0cbcccc0SHarrison Mutai*Copyright (c) 2019-2023, Arm Limited and Contributors. All rights reserved.* 290bd97f83aSJohn Tsichritzis 291*0cbcccc0SHarrison Mutai.. _Juno R1 platform: https://developer.arm.com/documentation/100122/latest/ 29240d553cfSPaul Beesley.. _TF master as of 31/01/2017: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?id=c38b36d 293