1*40d553cfSPaul BeesleyPSCI Performance Measurements on Arm Juno Development Platform 2*40d553cfSPaul Beesley============================================================== 3*40d553cfSPaul Beesley 4*40d553cfSPaul BeesleyThis document summarises the findings of performance measurements of key 5*40d553cfSPaul Beesleyoperations in the ARM Trusted Firmware (TF) Power State Coordination Interface 6*40d553cfSPaul Beesley(PSCI) implementation, using the in-built Performance Measurement Framework 7*40d553cfSPaul Beesley(PMF) and runtime instrumentation timestamps. 8*40d553cfSPaul Beesley 9*40d553cfSPaul BeesleyMethod 10*40d553cfSPaul Beesley------ 11*40d553cfSPaul Beesley 12*40d553cfSPaul BeesleyWe used the `Juno R1 platform`_ for these tests, which has 4 x Cortex-A53 and 2 13*40d553cfSPaul Beesleyx Cortex-A57 clusters running at the following frequencies: 14*40d553cfSPaul Beesley 15*40d553cfSPaul Beesley+-----------------+--------------------+ 16*40d553cfSPaul Beesley| Domain | Frequency (MHz) | 17*40d553cfSPaul Beesley+=================+====================+ 18*40d553cfSPaul Beesley| Cortex-A57 | 900 (nominal) | 19*40d553cfSPaul Beesley+-----------------+--------------------+ 20*40d553cfSPaul Beesley| Cortex-A53 | 650 (underdrive) | 21*40d553cfSPaul Beesley+-----------------+--------------------+ 22*40d553cfSPaul Beesley| AXI subsystem | 533 | 23*40d553cfSPaul Beesley+-----------------+--------------------+ 24*40d553cfSPaul Beesley 25*40d553cfSPaul BeesleyJuno supports CPU, cluster and system power down states, corresponding to power 26*40d553cfSPaul Beesleylevels 0, 1 and 2 respectively. It does not support any retention states. 27*40d553cfSPaul Beesley 28*40d553cfSPaul BeesleyWe used the upstream `TF master as of 31/01/2017`_, building the platform using 29*40d553cfSPaul Beesleythe ``ENABLE_RUNTIME_INSTRUMENTATION`` option: 30*40d553cfSPaul Beesley 31*40d553cfSPaul Beesley:: 32*40d553cfSPaul Beesley 33*40d553cfSPaul Beesley make PLAT=juno ENABLE_RUNTIME_INSTRUMENTATION=1 \ 34*40d553cfSPaul Beesley SCP_BL2=<path/to/scp-fw.bin> \ 35*40d553cfSPaul Beesley BL33=<path/to/test-fw.bin> \ 36*40d553cfSPaul Beesley all fip 37*40d553cfSPaul Beesley 38*40d553cfSPaul BeesleyWhen using the debug build of TF, there was no noticeable difference in the 39*40d553cfSPaul Beesleyresults. 40*40d553cfSPaul Beesley 41*40d553cfSPaul BeesleyThe tests are based on an ARM-internal test framework. The release build of this 42*40d553cfSPaul Beesleyframework was used because the results in the debug build became skewed; the 43*40d553cfSPaul Beesleyconsole output prevented some of the tests from executing in parallel. 44*40d553cfSPaul Beesley 45*40d553cfSPaul BeesleyThe tests consist of both parallel and sequential tests, which are broadly 46*40d553cfSPaul Beesleydescribed as follows: 47*40d553cfSPaul Beesley 48*40d553cfSPaul Beesley- **Parallel Tests** This type of test powers on all the non-lead CPUs and 49*40d553cfSPaul Beesley brings them and the lead CPU to a common synchronization point. The lead CPU 50*40d553cfSPaul Beesley then initiates the test on all CPUs in parallel. 51*40d553cfSPaul Beesley 52*40d553cfSPaul Beesley- **Sequential Tests** This type of test powers on each non-lead CPU in 53*40d553cfSPaul Beesley sequence. The lead CPU initiates the test on a non-lead CPU then waits for the 54*40d553cfSPaul Beesley test to complete before proceeding to the next non-lead CPU. The lead CPU then 55*40d553cfSPaul Beesley executes the test on itself. 56*40d553cfSPaul Beesley 57*40d553cfSPaul BeesleyIn the results below, CPUs 0-3 refer to CPUs in the little cluster (A53) and 58*40d553cfSPaul BeesleyCPUs 4-5 refer to CPUs in the big cluster (A57). In all cases CPU 4 is the lead 59*40d553cfSPaul BeesleyCPU. 60*40d553cfSPaul Beesley 61*40d553cfSPaul Beesley``PSCI_ENTRY`` refers to the time taken from entering the TF PSCI implementation 62*40d553cfSPaul Beesleyto the point the hardware enters the low power state (WFI). Referring to the TF 63*40d553cfSPaul Beesleyruntime instrumentation points, this corresponds to: 64*40d553cfSPaul Beesley``(RT_INSTR_ENTER_HW_LOW_PWR - RT_INSTR_ENTER_PSCI)``. 65*40d553cfSPaul Beesley 66*40d553cfSPaul Beesley``PSCI_EXIT`` refers to the time taken from the point the hardware exits the low 67*40d553cfSPaul Beesleypower state to exiting the TF PSCI implementation. This corresponds to: 68*40d553cfSPaul Beesley``(RT_INSTR_EXIT_PSCI - RT_INSTR_EXIT_HW_LOW_PWR)``. 69*40d553cfSPaul Beesley 70*40d553cfSPaul Beesley``CFLUSH_OVERHEAD`` refers to the part of ``PSCI_ENTRY`` taken to flush the 71*40d553cfSPaul Beesleycaches. This corresponds to: ``(RT_INSTR_EXIT_CFLUSH - RT_INSTR_ENTER_CFLUSH)``. 72*40d553cfSPaul Beesley 73*40d553cfSPaul BeesleyNote there is very little variance observed in the values given (~1us), although 74*40d553cfSPaul Beesleythe values for each CPU are sometimes interchanged, depending on the order in 75*40d553cfSPaul Beesleywhich locks are acquired. Also, there is very little variance observed between 76*40d553cfSPaul Beesleyexecuting the tests sequentially in a single boot or rebooting between tests. 77*40d553cfSPaul Beesley 78*40d553cfSPaul BeesleyGiven that runtime instrumentation using PMF is invasive, there is a small 79*40d553cfSPaul Beesley(unquantified) overhead on the results. PMF uses the generic counter for 80*40d553cfSPaul Beesleytimestamps, which runs at 50MHz on Juno. 81*40d553cfSPaul Beesley 82*40d553cfSPaul BeesleyResults and Commentary 83*40d553cfSPaul Beesley---------------------- 84*40d553cfSPaul Beesley 85*40d553cfSPaul Beesley``CPU_SUSPEND`` to deepest power level on all CPUs in parallel 86*40d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 87*40d553cfSPaul Beesley 88*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 89*40d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 90*40d553cfSPaul Beesley+=======+=====================+====================+==========================+ 91*40d553cfSPaul Beesley| 0 | 27 | 20 | 5 | 92*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 93*40d553cfSPaul Beesley| 1 | 114 | 86 | 5 | 94*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 95*40d553cfSPaul Beesley| 2 | 202 | 58 | 5 | 96*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 97*40d553cfSPaul Beesley| 3 | 375 | 29 | 94 | 98*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 99*40d553cfSPaul Beesley| 4 | 20 | 22 | 6 | 100*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 101*40d553cfSPaul Beesley| 5 | 290 | 18 | 206 | 102*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 103*40d553cfSPaul Beesley 104*40d553cfSPaul BeesleyA large variance in ``PSCI_ENTRY`` and ``PSCI_EXIT`` times across CPUs is 105*40d553cfSPaul Beesleyobserved due to TF PSCI lock contention. In the worst case, CPU 3 has to wait 106*40d553cfSPaul Beesleyfor the 3 other CPUs in the cluster (0-2) to complete ``PSCI_ENTRY`` and release 107*40d553cfSPaul Beesleythe lock before proceeding. 108*40d553cfSPaul Beesley 109*40d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for CPUs 3 and 5 are higher because they are the 110*40d553cfSPaul Beesleylast CPUs in their respective clusters to power down, therefore both the L1 and 111*40d553cfSPaul BeesleyL2 caches are flushed. 112*40d553cfSPaul Beesley 113*40d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 5 is a lot larger than that for CPU 3 114*40d553cfSPaul Beesleybecause the L2 cache size for the big cluster is lot larger (2MB) compared to 115*40d553cfSPaul Beesleythe little cluster (1MB). 116*40d553cfSPaul Beesley 117*40d553cfSPaul Beesley``CPU_SUSPEND`` to power level 0 on all CPUs in parallel 118*40d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 119*40d553cfSPaul Beesley 120*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 121*40d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 122*40d553cfSPaul Beesley+=======+=====================+====================+==========================+ 123*40d553cfSPaul Beesley| 0 | 116 | 14 | 8 | 124*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 125*40d553cfSPaul Beesley| 1 | 204 | 14 | 8 | 126*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 127*40d553cfSPaul Beesley| 2 | 287 | 13 | 8 | 128*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 129*40d553cfSPaul Beesley| 3 | 376 | 13 | 9 | 130*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 131*40d553cfSPaul Beesley| 4 | 29 | 15 | 7 | 132*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 133*40d553cfSPaul Beesley| 5 | 21 | 15 | 8 | 134*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 135*40d553cfSPaul Beesley 136*40d553cfSPaul BeesleyThere is no lock contention in TF generic code at power level 0 but the large 137*40d553cfSPaul Beesleyvariance in ``PSCI_ENTRY`` times across CPUs is due to lock contention in Juno 138*40d553cfSPaul Beesleyplatform code. The platform lock is used to mediate access to a single SCP 139*40d553cfSPaul Beesleycommunication channel. This is compounded by the SCP firmware waiting for each 140*40d553cfSPaul BeesleyAP CPU to enter WFI before making the channel available to other CPUs, which 141*40d553cfSPaul Beesleyeffectively serializes the SCP power down commands from all CPUs. 142*40d553cfSPaul Beesley 143*40d553cfSPaul BeesleyOn platforms with a more efficient CPU power down mechanism, it should be 144*40d553cfSPaul Beesleypossible to make the ``PSCI_ENTRY`` times smaller and consistent. 145*40d553cfSPaul Beesley 146*40d553cfSPaul BeesleyThe ``PSCI_EXIT`` times are consistent across all CPUs because TF does not 147*40d553cfSPaul Beesleyrequire locks at power level 0. 148*40d553cfSPaul Beesley 149*40d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for all CPUs are small and consistent since only 150*40d553cfSPaul Beesleythe cache associated with power level 0 is flushed (L1). 151*40d553cfSPaul Beesley 152*40d553cfSPaul Beesley``CPU_SUSPEND`` to deepest power level on all CPUs in sequence 153*40d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 154*40d553cfSPaul Beesley 155*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 156*40d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 157*40d553cfSPaul Beesley+=======+=====================+====================+==========================+ 158*40d553cfSPaul Beesley| 0 | 114 | 20 | 94 | 159*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 160*40d553cfSPaul Beesley| 1 | 114 | 20 | 94 | 161*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 162*40d553cfSPaul Beesley| 2 | 114 | 20 | 94 | 163*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 164*40d553cfSPaul Beesley| 3 | 114 | 20 | 94 | 165*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 166*40d553cfSPaul Beesley| 4 | 195 | 22 | 180 | 167*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 168*40d553cfSPaul Beesley| 5 | 21 | 17 | 6 | 169*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 170*40d553cfSPaul Beesley 171*40d553cfSPaul BeesleyThe ``CLUSH_OVERHEAD`` times for lead CPU 4 and all CPUs in the non-lead cluster 172*40d553cfSPaul Beesleyare large because all other CPUs in the cluster are powered down during the 173*40d553cfSPaul Beesleytest. The ``CPU_SUSPEND`` call powers down to the cluster level, requiring a 174*40d553cfSPaul Beesleyflush of both L1 and L2 caches. 175*40d553cfSPaul Beesley 176*40d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little 177*40d553cfSPaul BeesleyCPUs because the L2 cache size for the big cluster is lot larger (2MB) compared 178*40d553cfSPaul Beesleyto the little cluster (1MB). 179*40d553cfSPaul Beesley 180*40d553cfSPaul BeesleyThe ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are low because lead 181*40d553cfSPaul BeesleyCPU 4 continues to run while CPU 5 is suspended. Hence CPU 5 only powers down to 182*40d553cfSPaul Beesleylevel 0, which only requires L1 cache flush. 183*40d553cfSPaul Beesley 184*40d553cfSPaul Beesley``CPU_SUSPEND`` to power level 0 on all CPUs in sequence 185*40d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 186*40d553cfSPaul Beesley 187*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 188*40d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 189*40d553cfSPaul Beesley+=======+=====================+====================+==========================+ 190*40d553cfSPaul Beesley| 0 | 22 | 14 | 5 | 191*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 192*40d553cfSPaul Beesley| 1 | 22 | 14 | 5 | 193*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 194*40d553cfSPaul Beesley| 2 | 21 | 14 | 5 | 195*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 196*40d553cfSPaul Beesley| 3 | 22 | 14 | 5 | 197*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 198*40d553cfSPaul Beesley| 4 | 17 | 14 | 6 | 199*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 200*40d553cfSPaul Beesley| 5 | 18 | 15 | 6 | 201*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 202*40d553cfSPaul Beesley 203*40d553cfSPaul BeesleyHere the times are small and consistent since there is no contention and it is 204*40d553cfSPaul Beesleyonly necessary to flush the cache to power level 0 (L1). This is the best case 205*40d553cfSPaul Beesleyscenario. 206*40d553cfSPaul Beesley 207*40d553cfSPaul BeesleyThe ``PSCI_ENTRY`` times for CPUs in the big cluster are slightly smaller than 208*40d553cfSPaul Beesleyfor the CPUs in little cluster due to greater CPU performance. 209*40d553cfSPaul Beesley 210*40d553cfSPaul BeesleyThe ``PSCI_EXIT`` times are generally lower than in the last test because the 211*40d553cfSPaul Beesleycluster remains powered on throughout the test and there is less code to execute 212*40d553cfSPaul Beesleyon power on (for example, no need to enter CCI coherency) 213*40d553cfSPaul Beesley 214*40d553cfSPaul Beesley``CPU_OFF`` on all non-lead CPUs in sequence then ``CPU_SUSPEND`` on lead CPU to deepest power level 215*40d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 216*40d553cfSPaul Beesley 217*40d553cfSPaul BeesleyThe test sequence here is as follows: 218*40d553cfSPaul Beesley 219*40d553cfSPaul Beesley1. Call ``CPU_ON`` and ``CPU_OFF`` on each non-lead CPU in sequence. 220*40d553cfSPaul Beesley 221*40d553cfSPaul Beesley2. Program wake up timer and suspend the lead CPU to the deepest power level. 222*40d553cfSPaul Beesley 223*40d553cfSPaul Beesley3. Call ``CPU_ON`` on non-lead CPU to get the timestamps from each CPU. 224*40d553cfSPaul Beesley 225*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 226*40d553cfSPaul Beesley| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 227*40d553cfSPaul Beesley+=======+=====================+====================+==========================+ 228*40d553cfSPaul Beesley| 0 | 110 | 28 | 93 | 229*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 230*40d553cfSPaul Beesley| 1 | 110 | 28 | 93 | 231*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 232*40d553cfSPaul Beesley| 2 | 110 | 28 | 93 | 233*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 234*40d553cfSPaul Beesley| 3 | 111 | 28 | 93 | 235*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 236*40d553cfSPaul Beesley| 4 | 195 | 22 | 181 | 237*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 238*40d553cfSPaul Beesley| 5 | 20 | 23 | 6 | 239*40d553cfSPaul Beesley+-------+---------------------+--------------------+--------------------------+ 240*40d553cfSPaul Beesley 241*40d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` times for all little CPUs are large because all other 242*40d553cfSPaul BeesleyCPUs in that cluster are powerered down during the test. The ``CPU_OFF`` call 243*40d553cfSPaul Beesleypowers down to the cluster level, requiring a flush of both L1 and L2 caches. 244*40d553cfSPaul Beesley 245*40d553cfSPaul BeesleyThe ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are small because 246*40d553cfSPaul Beesleylead CPU 4 is running and CPU 5 only powers down to level 0, which only requires 247*40d553cfSPaul Beesleyan L1 cache flush. 248*40d553cfSPaul Beesley 249*40d553cfSPaul BeesleyThe ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little 250*40d553cfSPaul BeesleyCPUs because the L2 cache size for the big cluster is lot larger (2MB) compared 251*40d553cfSPaul Beesleyto the little cluster (1MB). 252*40d553cfSPaul Beesley 253*40d553cfSPaul BeesleyThe ``PSCI_EXIT`` times for CPUs in the big cluster are slightly smaller than 254*40d553cfSPaul Beesleyfor CPUs in the little cluster due to greater CPU performance. These times 255*40d553cfSPaul Beesleygenerally are greater than the ``PSCI_EXIT`` times in the ``CPU_SUSPEND`` tests 256*40d553cfSPaul Beesleybecause there is more code to execute in the "on finisher" compared to the 257*40d553cfSPaul Beesley"suspend finisher" (for example, GIC redistributor register programming). 258*40d553cfSPaul Beesley 259*40d553cfSPaul Beesley``PSCI_VERSION`` on all CPUs in parallel 260*40d553cfSPaul Beesley~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 261*40d553cfSPaul Beesley 262*40d553cfSPaul BeesleySince very little code is associated with ``PSCI_VERSION``, this test 263*40d553cfSPaul Beesleyapproximates the round trip latency for handling a fast SMC at EL3 in TF. 264*40d553cfSPaul Beesley 265*40d553cfSPaul Beesley+-------+-------------------+ 266*40d553cfSPaul Beesley| CPU | TOTAL TIME (ns) | 267*40d553cfSPaul Beesley+=======+===================+ 268*40d553cfSPaul Beesley| 0 | 3020 | 269*40d553cfSPaul Beesley+-------+-------------------+ 270*40d553cfSPaul Beesley| 1 | 2940 | 271*40d553cfSPaul Beesley+-------+-------------------+ 272*40d553cfSPaul Beesley| 2 | 2980 | 273*40d553cfSPaul Beesley+-------+-------------------+ 274*40d553cfSPaul Beesley| 3 | 3060 | 275*40d553cfSPaul Beesley+-------+-------------------+ 276*40d553cfSPaul Beesley| 4 | 520 | 277*40d553cfSPaul Beesley+-------+-------------------+ 278*40d553cfSPaul Beesley| 5 | 720 | 279*40d553cfSPaul Beesley+-------+-------------------+ 280*40d553cfSPaul Beesley 281*40d553cfSPaul BeesleyThe times for the big CPUs are less than the little CPUs due to greater CPU 282*40d553cfSPaul Beesleyperformance. 283*40d553cfSPaul Beesley 284*40d553cfSPaul BeesleyWe suspect the time for lead CPU 4 is shorter than CPU 5 due to subtle cache 285*40d553cfSPaul Beesleyeffects, given that these measurements are at the nano-second level. 286*40d553cfSPaul Beesley 287*40d553cfSPaul Beesley.. _Juno R1 platform: https://www.arm.com/files/pdf/Juno_r1_ARM_Dev_datasheet.pdf 288*40d553cfSPaul Beesley.. _TF master as of 31/01/2017: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?id=c38b36d 289