140d553cfSPaul BeesleyPSCI Performance Measurements on Arm Juno Development Platform 240d553cfSPaul Beesley============================================================== 340d553cfSPaul Beesley 440d553cfSPaul BeesleyThis document summarises the findings of performance measurements of key 5bd97f83aSJohn Tsichritzisoperations in the Trusted Firmware-A Power State Coordination Interface (PSCI) 6bd97f83aSJohn Tsichritzisimplementation, using the in-built Performance Measurement Framework (PMF) and 7bd97f83aSJohn Tsichritzisruntime instrumentation timestamps. 840d553cfSPaul Beesley 940d553cfSPaul BeesleyMethod 1040d553cfSPaul Beesley------ 1140d553cfSPaul Beesley 1240d553cfSPaul BeesleyWe used the `Juno R1 platform`_ for these tests, which has 4 x Cortex-A53 and 2 1340d553cfSPaul Beesleyx Cortex-A57 clusters running at the following frequencies: 1440d553cfSPaul Beesley 1540d553cfSPaul Beesley+-----------------+--------------------+ 1640d553cfSPaul Beesley| Domain | Frequency (MHz) | 1740d553cfSPaul Beesley+=================+====================+ 1840d553cfSPaul Beesley| Cortex-A57 | 900 (nominal) | 1940d553cfSPaul Beesley+-----------------+--------------------+ 2040d553cfSPaul Beesley| Cortex-A53 | 650 (underdrive) | 2140d553cfSPaul Beesley+-----------------+--------------------+ 2240d553cfSPaul Beesley| AXI subsystem | 533 | 2340d553cfSPaul Beesley+-----------------+--------------------+ 2440d553cfSPaul Beesley 2540d553cfSPaul BeesleyJuno supports CPU, cluster and system power down states, corresponding to power 2640d553cfSPaul Beesleylevels 0, 1 and 2 respectively. It does not support any retention states. 2740d553cfSPaul Beesley 28*a3077ae1SHarrison MutaiGiven that runtime instrumentation using PMF is invasive, there is a small 29*a3077ae1SHarrison Mutai(unquantified) overhead on the results. PMF uses the generic counter for 30*a3077ae1SHarrison Mutaitimestamps, which runs at 50MHz on Juno. 31*a3077ae1SHarrison Mutai 32*a3077ae1SHarrison MutaiThe following source trees and binaries were used: 33*a3077ae1SHarrison Mutai 34*a3077ae1SHarrison Mutai- TF-A [`v2.9-rc0`_] 35*a3077ae1SHarrison Mutai- TFTF [`v2.9-rc0`_] 36*a3077ae1SHarrison Mutai 37*a3077ae1SHarrison MutaiPlease see the Runtime Instrumentation `Testing Methodology`_ page for more 38*a3077ae1SHarrison Mutaidetails. 39*a3077ae1SHarrison Mutai 40*a3077ae1SHarrison MutaiProcedure 41*a3077ae1SHarrison Mutai--------- 42*a3077ae1SHarrison Mutai 43*a3077ae1SHarrison Mutai#. Build TFTF with runtime instrumentation enabled: 4440d553cfSPaul Beesley 4529c02529SPaul Beesley .. code:: shell 4640d553cfSPaul Beesley 47*a3077ae1SHarrison Mutai make CROSS_COMPILE=aarch64-none-elf- PLAT=juno \ 48*a3077ae1SHarrison Mutai TESTS=runtime-instrumentation all 4940d553cfSPaul Beesley 50*a3077ae1SHarrison Mutai#. Fetch Juno's SCP binary from TF-A's archive: 5140d553cfSPaul Beesley 52*a3077ae1SHarrison Mutai .. code:: shell 5340d553cfSPaul Beesley 54*a3077ae1SHarrison Mutai curl --fail --connect-timeout 5 --retry 5 -sLS -o scp_bl2.bin \ 55*a3077ae1SHarrison Mutai https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/juno/release/juno-bl2.bin 5640d553cfSPaul Beesley 57*a3077ae1SHarrison Mutai#. Build TF-A with the following build options: 5840d553cfSPaul Beesley 59*a3077ae1SHarrison Mutai .. code:: shell 60*a3077ae1SHarrison Mutai 61*a3077ae1SHarrison Mutai make CROSS_COMPILE=aarch64-none-elf- PLAT=juno \ 62*a3077ae1SHarrison Mutai BL33="/path/to/tftf.bin" SCP_BL2="scp_bl2.bin" \ 63*a3077ae1SHarrison Mutai ENABLE_RUNTIME_INSTRUMENTATION=1 fiptool all fip 64*a3077ae1SHarrison Mutai 65*a3077ae1SHarrison Mutai#. Load the following images onto the development board: ``fip.bin``, 66*a3077ae1SHarrison Mutai ``scp_bl2.bin``. 67*a3077ae1SHarrison Mutai 68*a3077ae1SHarrison MutaiResults 69*a3077ae1SHarrison Mutai------- 70*a3077ae1SHarrison Mutai 71*a3077ae1SHarrison Mutai``CPU_SUSPEND`` to deepest power level 72*a3077ae1SHarrison Mutai~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 73*a3077ae1SHarrison Mutai 74*a3077ae1SHarrison Mutai.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in 75*a3077ae1SHarrison Mutai parallel 76*a3077ae1SHarrison Mutai 77*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 78*a3077ae1SHarrison Mutai | Cluster | Core | Powerdown | Wakekup | Cache Flush | 79*a3077ae1SHarrison Mutai +=========+======+===========+=========+=============+ 80*a3077ae1SHarrison Mutai | 0 | 0 | 243.76 | 239.92 | 6.32 | 81*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 82*a3077ae1SHarrison Mutai | 0 | 1 | 663.5 | 30.32 | 167.82 | 83*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 84*a3077ae1SHarrison Mutai | 1 | 0 | 105.12 | 22.84 | 5.88 | 85*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 86*a3077ae1SHarrison Mutai | 1 | 1 | 384.16 | 19.06 | 4.7 | 87*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 88*a3077ae1SHarrison Mutai | 1 | 2 | 523.98 | 270.46 | 4.74 | 89*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 90*a3077ae1SHarrison Mutai | 1 | 3 | 950.54 | 220.9 | 89.2 | 91*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 92*a3077ae1SHarrison Mutai 93*a3077ae1SHarrison Mutai.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in 94*a3077ae1SHarrison Mutai serial 95*a3077ae1SHarrison Mutai 96*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 97*a3077ae1SHarrison Mutai | Cluster | Core | Powerdown | Wakekup | Cache Flush | 98*a3077ae1SHarrison Mutai +=========+======+===========+=========+=============+ 99*a3077ae1SHarrison Mutai | 0 | 0 | 266.96 | 31.74 | 167.92 | 100*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 101*a3077ae1SHarrison Mutai | 0 | 1 | 266.9 | 31.52 | 167.82 | 102*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 103*a3077ae1SHarrison Mutai | 1 | 0 | 279.86 | 23.42 | 87.52 | 104*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 105*a3077ae1SHarrison Mutai | 1 | 1 | 101.38 | 18.8 | 4.64 | 106*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 107*a3077ae1SHarrison Mutai | 1 | 2 | 101.18 | 19.28 | 4.64 | 108*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 109*a3077ae1SHarrison Mutai | 1 | 3 | 101.32 | 19.02 | 4.62 | 110*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 111*a3077ae1SHarrison Mutai 112*a3077ae1SHarrison Mutai``CPU_SUSPEND`` to power level 0 113*a3077ae1SHarrison Mutai~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 114*a3077ae1SHarrison Mutai 115*a3077ae1SHarrison Mutai.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in 116*a3077ae1SHarrison Mutai parallel 117*a3077ae1SHarrison Mutai 118*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 119*a3077ae1SHarrison Mutai | Cluster | Core | Powerdown | Wakekup | Cache Flush | 120*a3077ae1SHarrison Mutai +=========+======+===========+=========+=============+ 121*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 122*a3077ae1SHarrison Mutai | 0 | 0 | 661.94 | 22.88 | 9.66 | 123*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 124*a3077ae1SHarrison Mutai | 0 | 1 | 801.64 | 23.38 | 9.62 | 125*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 126*a3077ae1SHarrison Mutai | 1 | 0 | 105.56 | 16.02 | 8.12 | 127*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 128*a3077ae1SHarrison Mutai | 1 | 1 | 245.42 | 16.26 | 7.78 | 129*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 130*a3077ae1SHarrison Mutai | 1 | 2 | 384.42 | 16.1 | 7.84 | 131*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 132*a3077ae1SHarrison Mutai | 1 | 3 | 523.74 | 15.4 | 8.02 | 133*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 134*a3077ae1SHarrison Mutai 135*a3077ae1SHarrison Mutai.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in serial 136*a3077ae1SHarrison Mutai 137*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 138*a3077ae1SHarrison Mutai | Cluster | Core | Powerdown | Wakekup | Cache Flush | 139*a3077ae1SHarrison Mutai +=========+======+===========+=========+=============+ 140*a3077ae1SHarrison Mutai | 0 | 0 | 102.16 | 23.64 | 6.7 | 141*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 142*a3077ae1SHarrison Mutai | 0 | 1 | 101.66 | 23.78 | 6.6 | 143*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 144*a3077ae1SHarrison Mutai | 1 | 0 | 277.74 | 15.96 | 4.66 | 145*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 146*a3077ae1SHarrison Mutai | 1 | 1 | 98.0 | 15.88 | 4.64 | 147*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 148*a3077ae1SHarrison Mutai | 1 | 2 | 97.66 | 15.88 | 4.62 | 149*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 150*a3077ae1SHarrison Mutai | 1 | 3 | 97.76 | 15.38 | 4.64 | 151*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 152*a3077ae1SHarrison Mutai 153*a3077ae1SHarrison Mutai``CPU_OFF`` on all non-lead CPUs 154*a3077ae1SHarrison Mutai~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 155*a3077ae1SHarrison Mutai 156*a3077ae1SHarrison Mutai``CPU_OFF`` on all non-lead CPUs in sequence then, ``CPU_SUSPEND`` on the lead 157*a3077ae1SHarrison Mutaicore to the deepest power level. 158*a3077ae1SHarrison Mutai 159*a3077ae1SHarrison Mutai.. table:: ``CPU_OFF`` latencies (µs) on all non-lead CPUs 160*a3077ae1SHarrison Mutai 161*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 162*a3077ae1SHarrison Mutai | Cluster | Core | Powerdown | Wakekup | Cache Flush | 163*a3077ae1SHarrison Mutai +=========+======+===========+=========+=============+ 164*a3077ae1SHarrison Mutai | 0 | 0 | 265.38 | 34.12 | 167.36 | 165*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 166*a3077ae1SHarrison Mutai | 0 | 1 | 265.72 | 33.98 | 167.48 | 167*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 168*a3077ae1SHarrison Mutai | 1 | 0 | 185.3 | 23.18 | 87.42 | 169*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 170*a3077ae1SHarrison Mutai | 1 | 1 | 101.58 | 23.46 | 4.48 | 171*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 172*a3077ae1SHarrison Mutai | 1 | 2 | 101.66 | 22.02 | 4.72 | 173*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 174*a3077ae1SHarrison Mutai | 1 | 3 | 101.48 | 22.22 | 4.52 | 175*a3077ae1SHarrison Mutai +---------+------+-----------+---------+-------------+ 176*a3077ae1SHarrison Mutai 177*a3077ae1SHarrison Mutai``CPU_VERSION`` in parallel 178*a3077ae1SHarrison Mutai~~~~~~~~~~~~~~~~~~~~~~~~~~~ 179*a3077ae1SHarrison Mutai 180*a3077ae1SHarrison Mutai.. table:: ``CPU_VERSION`` latency (µs) in parallel on all cores 181*a3077ae1SHarrison Mutai 182*a3077ae1SHarrison Mutai +-------------+--------+--------------+ 183*a3077ae1SHarrison Mutai | Cluster | Core | Latency | 184*a3077ae1SHarrison Mutai +=============+========+==============+ 185*a3077ae1SHarrison Mutai | 0 | 0 | 1.22 | 186*a3077ae1SHarrison Mutai +-------------+--------+--------------+ 187*a3077ae1SHarrison Mutai | 0 | 1 | 1.2 | 188*a3077ae1SHarrison Mutai +-------------+--------+--------------+ 189*a3077ae1SHarrison Mutai | 1 | 0 | 0.6 | 190*a3077ae1SHarrison Mutai +-------------+--------+--------------+ 191*a3077ae1SHarrison Mutai | 1 | 1 | 1.08 | 192*a3077ae1SHarrison Mutai +-------------+--------+--------------+ 193*a3077ae1SHarrison Mutai | 1 | 2 | 1.04 | 194*a3077ae1SHarrison Mutai +-------------+--------+--------------+ 195*a3077ae1SHarrison Mutai | 1 | 3 | 1.04 | 196*a3077ae1SHarrison Mutai +-------------+--------+--------------+ 197*a3077ae1SHarrison Mutai 198*a3077ae1SHarrison MutaiAnnotated Historic Results 199*a3077ae1SHarrison Mutai-------------------------- 200*a3077ae1SHarrison Mutai 201*a3077ae1SHarrison MutaiThe following results are based on the upstream `TF master as of 31/01/2017`_. 202*a3077ae1SHarrison MutaiTF-A was built using the same build instructions as detailed in the procedure 203*a3077ae1SHarrison Mutaiabove. 20440d553cfSPaul Beesley 20540d553cfSPaul BeesleyIn the results below, CPUs 0-3 refer to CPUs in the little cluster (A53) and 20640d553cfSPaul BeesleyCPUs 4-5 refer to CPUs in the big cluster (A57). In all cases CPU 4 is the lead 20740d553cfSPaul BeesleyCPU. 20840d553cfSPaul Beesley 209*a3077ae1SHarrison Mutai``PSCI_ENTRY`` corresponds to the powerdown latency, ``PSCI_EXIT`` the wakeup latency, and 210*a3077ae1SHarrison Mutai``CFLUSH_OVERHEAD`` the latency of the cache flush operation. 21140d553cfSPaul Beesley 21240d553cfSPaul Beesley``CPU_SUSPEND`` to deepest power level on all CPUs in parallel 21340d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21440d553cfSPaul Beesley 21540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 21640d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 21740d553cfSPaul Beesley+=======+=====================+====================+==========================+ 21840d553cfSPaul Beesley| 0 | 27 | 20 | 5 | 21940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 22040d553cfSPaul Beesley| 1 | 114 | 86 | 5 | 22140d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 22240d553cfSPaul Beesley| 2 | 202 | 58 | 5 | 22340d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 22440d553cfSPaul Beesley| 3 | 375 | 29 | 94 | 22540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 22640d553cfSPaul Beesley| 4 | 20 | 22 | 6 | 22740d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 22840d553cfSPaul Beesley| 5 | 290 | 18 | 206 | 22940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 23040d553cfSPaul Beesley 23140d553cfSPaul BeesleyA large variance in ``PSCI_ENTRY`` and ``PSCI_EXIT`` times across CPUs is 23240d553cfSPaul Beesleyobserved due to TF PSCI lock contention. In the worst case, CPU 3 has to wait 23340d553cfSPaul Beesleyfor the 3 other CPUs in the cluster (0-2) to complete ``PSCI_ENTRY`` and release 23440d553cfSPaul Beesleythe lock before proceeding. 23540d553cfSPaul Beesley 23640d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for CPUs 3 and 5 are higher because they are the 23740d553cfSPaul Beesleylast CPUs in their respective clusters to power down, therefore both the L1 and 23840d553cfSPaul BeesleyL2 caches are flushed. 23940d553cfSPaul Beesley 24040d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 5 is a lot larger than that for CPU 3 24140d553cfSPaul Beesleybecause the L2 cache size for the big cluster is lot larger (2MB) compared to 24240d553cfSPaul Beesleythe little cluster (1MB). 24340d553cfSPaul Beesley 24440d553cfSPaul Beesley``CPU_SUSPEND`` to power level 0 on all CPUs in parallel 24540d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 24640d553cfSPaul Beesley 24740d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 24840d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 24940d553cfSPaul Beesley+=======+=====================+====================+==========================+ 25040d553cfSPaul Beesley| 0 | 116 | 14 | 8 | 25140d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 25240d553cfSPaul Beesley| 1 | 204 | 14 | 8 | 25340d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 25440d553cfSPaul Beesley| 2 | 287 | 13 | 8 | 25540d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 25640d553cfSPaul Beesley| 3 | 376 | 13 | 9 | 25740d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 25840d553cfSPaul Beesley| 4 | 29 | 15 | 7 | 25940d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 26040d553cfSPaul Beesley| 5 | 21 | 15 | 8 | 26140d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 26240d553cfSPaul Beesley 26340d553cfSPaul BeesleyThere is no lock contention in TF generic code at power level 0 but the large 26440d553cfSPaul Beesleyvariance in ``PSCI_ENTRY`` times across CPUs is due to lock contention in Juno 26540d553cfSPaul Beesleyplatform code. The platform lock is used to mediate access to a single SCP 26640d553cfSPaul Beesleycommunication channel. This is compounded by the SCP firmware waiting for each 26740d553cfSPaul BeesleyAP CPU to enter WFI before making the channel available to other CPUs, which 26840d553cfSPaul Beesleyeffectively serializes the SCP power down commands from all CPUs. 26940d553cfSPaul Beesley 27040d553cfSPaul BeesleyOn platforms with a more efficient CPU power down mechanism, it should be 27140d553cfSPaul Beesleypossible to make the ``PSCI_ENTRY`` times smaller and consistent. 27240d553cfSPaul Beesley 27340d553cfSPaul BeesleyThe ``PSCI_EXIT`` times are consistent across all CPUs because TF does not 27440d553cfSPaul Beesleyrequire locks at power level 0. 27540d553cfSPaul Beesley 27640d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for all CPUs are small and consistent since only 27740d553cfSPaul Beesleythe cache associated with power level 0 is flushed (L1). 27840d553cfSPaul Beesley 27940d553cfSPaul Beesley``CPU_SUSPEND`` to deepest power level on all CPUs in sequence 28040d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 28140d553cfSPaul Beesley 28240d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 28340d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 28440d553cfSPaul Beesley+=======+=====================+====================+==========================+ 28540d553cfSPaul Beesley| 0 | 114 | 20 | 94 | 28640d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 28740d553cfSPaul Beesley| 1 | 114 | 20 | 94 | 28840d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 28940d553cfSPaul Beesley| 2 | 114 | 20 | 94 | 29040d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 29140d553cfSPaul Beesley| 3 | 114 | 20 | 94 | 29240d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 29340d553cfSPaul Beesley| 4 | 195 | 22 | 180 | 29440d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 29540d553cfSPaul Beesley| 5 | 21 | 17 | 6 | 29640d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 29740d553cfSPaul Beesley 298be653a69SPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for lead CPU 4 and all CPUs in the non-lead cluster 29940d553cfSPaul Beesleyare large because all other CPUs in the cluster are powered down during the 30040d553cfSPaul Beesleytest. The ``CPU_SUSPEND`` call powers down to the cluster level, requiring a 30140d553cfSPaul Beesleyflush of both L1 and L2 caches. 30240d553cfSPaul Beesley 30340d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little 30440d553cfSPaul BeesleyCPUs because the L2 cache size for the big cluster is lot larger (2MB) compared 30540d553cfSPaul Beesleyto the little cluster (1MB). 30640d553cfSPaul Beesley 30740d553cfSPaul BeesleyThe ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are low because lead 30840d553cfSPaul BeesleyCPU 4 continues to run while CPU 5 is suspended. Hence CPU 5 only powers down to 30940d553cfSPaul Beesleylevel 0, which only requires L1 cache flush. 31040d553cfSPaul Beesley 31140d553cfSPaul Beesley``CPU_SUSPEND`` to power level 0 on all CPUs in sequence 31240d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 31340d553cfSPaul Beesley 31440d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 31540d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 31640d553cfSPaul Beesley+=======+=====================+====================+==========================+ 31740d553cfSPaul Beesley| 0 | 22 | 14 | 5 | 31840d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 31940d553cfSPaul Beesley| 1 | 22 | 14 | 5 | 32040d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 32140d553cfSPaul Beesley| 2 | 21 | 14 | 5 | 32240d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 32340d553cfSPaul Beesley| 3 | 22 | 14 | 5 | 32440d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 32540d553cfSPaul Beesley| 4 | 17 | 14 | 6 | 32640d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 32740d553cfSPaul Beesley| 5 | 18 | 15 | 6 | 32840d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 32940d553cfSPaul Beesley 33040d553cfSPaul BeesleyHere the times are small and consistent since there is no contention and it is 33140d553cfSPaul Beesleyonly necessary to flush the cache to power level 0 (L1). This is the best case 33240d553cfSPaul Beesleyscenario. 33340d553cfSPaul Beesley 33440d553cfSPaul BeesleyThe ``PSCI_ENTRY`` times for CPUs in the big cluster are slightly smaller than 33540d553cfSPaul Beesleyfor the CPUs in little cluster due to greater CPU performance. 33640d553cfSPaul Beesley 33740d553cfSPaul BeesleyThe ``PSCI_EXIT`` times are generally lower than in the last test because the 33840d553cfSPaul Beesleycluster remains powered on throughout the test and there is less code to execute 33940d553cfSPaul Beesleyon power on (for example, no need to enter CCI coherency) 34040d553cfSPaul Beesley 34140d553cfSPaul Beesley``CPU_OFF`` on all non-lead CPUs in sequence then ``CPU_SUSPEND`` on lead CPU to deepest power level 34240d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 34340d553cfSPaul Beesley 34440d553cfSPaul BeesleyThe test sequence here is as follows: 34540d553cfSPaul Beesley 34640d553cfSPaul Beesley1. Call ``CPU_ON`` and ``CPU_OFF`` on each non-lead CPU in sequence. 34740d553cfSPaul Beesley 34840d553cfSPaul Beesley2. Program wake up timer and suspend the lead CPU to the deepest power level. 34940d553cfSPaul Beesley 35040d553cfSPaul Beesley3. Call ``CPU_ON`` on non-lead CPU to get the timestamps from each CPU. 35140d553cfSPaul Beesley 35240d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 35340d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 35440d553cfSPaul Beesley+=======+=====================+====================+==========================+ 35540d553cfSPaul Beesley| 0 | 110 | 28 | 93 | 35640d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 35740d553cfSPaul Beesley| 1 | 110 | 28 | 93 | 35840d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 35940d553cfSPaul Beesley| 2 | 110 | 28 | 93 | 36040d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 36140d553cfSPaul Beesley| 3 | 111 | 28 | 93 | 36240d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 36340d553cfSPaul Beesley| 4 | 195 | 22 | 181 | 36440d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 36540d553cfSPaul Beesley| 5 | 20 | 23 | 6 | 36640d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 36740d553cfSPaul Beesley 36840d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for all little CPUs are large because all other 36940d553cfSPaul BeesleyCPUs in that cluster are powerered down during the test. The ``CPU_OFF`` call 37040d553cfSPaul Beesleypowers down to the cluster level, requiring a flush of both L1 and L2 caches. 37140d553cfSPaul Beesley 37240d553cfSPaul BeesleyThe ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are small because 37340d553cfSPaul Beesleylead CPU 4 is running and CPU 5 only powers down to level 0, which only requires 37440d553cfSPaul Beesleyan L1 cache flush. 37540d553cfSPaul Beesley 37640d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little 37740d553cfSPaul BeesleyCPUs because the L2 cache size for the big cluster is lot larger (2MB) compared 37840d553cfSPaul Beesleyto the little cluster (1MB). 37940d553cfSPaul Beesley 38040d553cfSPaul BeesleyThe ``PSCI_EXIT`` times for CPUs in the big cluster are slightly smaller than 38140d553cfSPaul Beesleyfor CPUs in the little cluster due to greater CPU performance. These times 38240d553cfSPaul Beesleygenerally are greater than the ``PSCI_EXIT`` times in the ``CPU_SUSPEND`` tests 38340d553cfSPaul Beesleybecause there is more code to execute in the "on finisher" compared to the 38440d553cfSPaul Beesley"suspend finisher" (for example, GIC redistributor register programming). 38540d553cfSPaul Beesley 38640d553cfSPaul Beesley``PSCI_VERSION`` on all CPUs in parallel 38740d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 38840d553cfSPaul Beesley 38940d553cfSPaul BeesleySince very little code is associated with ``PSCI_VERSION``, this test 39040d553cfSPaul Beesleyapproximates the round trip latency for handling a fast SMC at EL3 in TF. 39140d553cfSPaul Beesley 39240d553cfSPaul Beesley+-------+-------------------+ 39340d553cfSPaul Beesley| CPU | TOTAL TIME (ns) | 39440d553cfSPaul Beesley+=======+===================+ 39540d553cfSPaul Beesley| 0 | 3020 | 39640d553cfSPaul Beesley+-------+-------------------+ 39740d553cfSPaul Beesley| 1 | 2940 | 39840d553cfSPaul Beesley+-------+-------------------+ 39940d553cfSPaul Beesley| 2 | 2980 | 40040d553cfSPaul Beesley+-------+-------------------+ 40140d553cfSPaul Beesley| 3 | 3060 | 40240d553cfSPaul Beesley+-------+-------------------+ 40340d553cfSPaul Beesley| 4 | 520 | 40440d553cfSPaul Beesley+-------+-------------------+ 40540d553cfSPaul Beesley| 5 | 720 | 40640d553cfSPaul Beesley+-------+-------------------+ 40740d553cfSPaul Beesley 40840d553cfSPaul BeesleyThe times for the big CPUs are less than the little CPUs due to greater CPU 40940d553cfSPaul Beesleyperformance. 41040d553cfSPaul Beesley 41140d553cfSPaul BeesleyWe suspect the time for lead CPU 4 is shorter than CPU 5 due to subtle cache 41240d553cfSPaul Beesleyeffects, given that these measurements are at the nano-second level. 41340d553cfSPaul Beesley 414bd97f83aSJohn Tsichritzis-------------- 415bd97f83aSJohn Tsichritzis 4160cbcccc0SHarrison Mutai*Copyright (c) 2019-2023, Arm Limited and Contributors. All rights reserved.* 417bd97f83aSJohn Tsichritzis 4180cbcccc0SHarrison Mutai.. _Juno R1 platform: https://developer.arm.com/documentation/100122/latest/ 41940d553cfSPaul Beesley.. _TF master as of 31/01/2017: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?id=c38b36d 420*a3077ae1SHarrison Mutai.. _v2.9-rc0: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?h=v2.9-rc0 421*a3077ae1SHarrison Mutai.. _Testing Methodology: ../perf/psci-performance-methodology.html 422