1PSCI Performance Measurements on Arm Juno Development Platform 2============================================================== 3 4This document summarises the findings of performance measurements of key 5operations in the Trusted Firmware-A Power State Coordination Interface (PSCI) 6implementation, using the in-built Performance Measurement Framework (PMF) and 7runtime instrumentation timestamps. 8 9Method 10------ 11 12We used the `Juno R1 platform`_ for these tests, which has 4 x Cortex-A53 and 2 13x Cortex-A57 clusters running at the following frequencies: 14 15+-----------------+--------------------+ 16| Domain | Frequency (MHz) | 17+=================+====================+ 18| Cortex-A57 | 900 (nominal) | 19+-----------------+--------------------+ 20| Cortex-A53 | 650 (underdrive) | 21+-----------------+--------------------+ 22| AXI subsystem | 533 | 23+-----------------+--------------------+ 24 25Juno supports CPU, cluster and system power down states, corresponding to power 26levels 0, 1 and 2 respectively. It does not support any retention states. 27 28Given that runtime instrumentation using PMF is invasive, there is a small 29(unquantified) overhead on the results. PMF uses the generic counter for 30timestamps, which runs at 50MHz on Juno. 31 32The following source trees and binaries were used: 33 34- TF-A [`v2.9-rc0`_] 35- TFTF [`v2.9-rc0`_] 36 37Please see the Runtime Instrumentation `Testing Methodology`_ page for more 38details. 39 40Procedure 41--------- 42 43#. Build TFTF with runtime instrumentation enabled: 44 45 .. code:: shell 46 47 make CROSS_COMPILE=aarch64-none-elf- PLAT=juno \ 48 TESTS=runtime-instrumentation all 49 50#. Fetch Juno's SCP binary from TF-A's archive: 51 52 .. code:: shell 53 54 curl --fail --connect-timeout 5 --retry 5 -sLS -o scp_bl2.bin \ 55 https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/juno/release/juno-bl2.bin 56 57#. Build TF-A with the following build options: 58 59 .. code:: shell 60 61 make CROSS_COMPILE=aarch64-none-elf- PLAT=juno \ 62 BL33="/path/to/tftf.bin" SCP_BL2="scp_bl2.bin" \ 63 ENABLE_RUNTIME_INSTRUMENTATION=1 fiptool all fip 64 65#. Load the following images onto the development board: ``fip.bin``, 66 ``scp_bl2.bin``. 67 68Results 69------- 70 71``CPU_SUSPEND`` to deepest power level 72~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 73 74.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in 75 parallel 76 77 +---------+------+-----------+---------+-------------+ 78 | Cluster | Core | Powerdown | Wakekup | Cache Flush | 79 +=========+======+===========+=========+=============+ 80 | 0 | 0 | 243.76 | 239.92 | 6.32 | 81 +---------+------+-----------+---------+-------------+ 82 | 0 | 1 | 663.5 | 30.32 | 167.82 | 83 +---------+------+-----------+---------+-------------+ 84 | 1 | 0 | 105.12 | 22.84 | 5.88 | 85 +---------+------+-----------+---------+-------------+ 86 | 1 | 1 | 384.16 | 19.06 | 4.7 | 87 +---------+------+-----------+---------+-------------+ 88 | 1 | 2 | 523.98 | 270.46 | 4.74 | 89 +---------+------+-----------+---------+-------------+ 90 | 1 | 3 | 950.54 | 220.9 | 89.2 | 91 +---------+------+-----------+---------+-------------+ 92 93.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in 94 serial 95 96 +---------+------+-----------+---------+-------------+ 97 | Cluster | Core | Powerdown | Wakekup | Cache Flush | 98 +=========+======+===========+=========+=============+ 99 | 0 | 0 | 266.96 | 31.74 | 167.92 | 100 +---------+------+-----------+---------+-------------+ 101 | 0 | 1 | 266.9 | 31.52 | 167.82 | 102 +---------+------+-----------+---------+-------------+ 103 | 1 | 0 | 279.86 | 23.42 | 87.52 | 104 +---------+------+-----------+---------+-------------+ 105 | 1 | 1 | 101.38 | 18.8 | 4.64 | 106 +---------+------+-----------+---------+-------------+ 107 | 1 | 2 | 101.18 | 19.28 | 4.64 | 108 +---------+------+-----------+---------+-------------+ 109 | 1 | 3 | 101.32 | 19.02 | 4.62 | 110 +---------+------+-----------+---------+-------------+ 111 112``CPU_SUSPEND`` to power level 0 113~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 114 115.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in 116 parallel 117 118 +---------+------+-----------+---------+-------------+ 119 | Cluster | Core | Powerdown | Wakekup | Cache Flush | 120 +=========+======+===========+=========+=============+ 121 +---------+------+-----------+---------+-------------+ 122 | 0 | 0 | 661.94 | 22.88 | 9.66 | 123 +---------+------+-----------+---------+-------------+ 124 | 0 | 1 | 801.64 | 23.38 | 9.62 | 125 +---------+------+-----------+---------+-------------+ 126 | 1 | 0 | 105.56 | 16.02 | 8.12 | 127 +---------+------+-----------+---------+-------------+ 128 | 1 | 1 | 245.42 | 16.26 | 7.78 | 129 +---------+------+-----------+---------+-------------+ 130 | 1 | 2 | 384.42 | 16.1 | 7.84 | 131 +---------+------+-----------+---------+-------------+ 132 | 1 | 3 | 523.74 | 15.4 | 8.02 | 133 +---------+------+-----------+---------+-------------+ 134 135.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in serial 136 137 +---------+------+-----------+---------+-------------+ 138 | Cluster | Core | Powerdown | Wakekup | Cache Flush | 139 +=========+======+===========+=========+=============+ 140 | 0 | 0 | 102.16 | 23.64 | 6.7 | 141 +---------+------+-----------+---------+-------------+ 142 | 0 | 1 | 101.66 | 23.78 | 6.6 | 143 +---------+------+-----------+---------+-------------+ 144 | 1 | 0 | 277.74 | 15.96 | 4.66 | 145 +---------+------+-----------+---------+-------------+ 146 | 1 | 1 | 98.0 | 15.88 | 4.64 | 147 +---------+------+-----------+---------+-------------+ 148 | 1 | 2 | 97.66 | 15.88 | 4.62 | 149 +---------+------+-----------+---------+-------------+ 150 | 1 | 3 | 97.76 | 15.38 | 4.64 | 151 +---------+------+-----------+---------+-------------+ 152 153``CPU_OFF`` on all non-lead CPUs 154~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 155 156``CPU_OFF`` on all non-lead CPUs in sequence then, ``CPU_SUSPEND`` on the lead 157core to the deepest power level. 158 159.. table:: ``CPU_OFF`` latencies (µs) on all non-lead CPUs 160 161 +---------+------+-----------+---------+-------------+ 162 | Cluster | Core | Powerdown | Wakekup | Cache Flush | 163 +=========+======+===========+=========+=============+ 164 | 0 | 0 | 265.38 | 34.12 | 167.36 | 165 +---------+------+-----------+---------+-------------+ 166 | 0 | 1 | 265.72 | 33.98 | 167.48 | 167 +---------+------+-----------+---------+-------------+ 168 | 1 | 0 | 185.3 | 23.18 | 87.42 | 169 +---------+------+-----------+---------+-------------+ 170 | 1 | 1 | 101.58 | 23.46 | 4.48 | 171 +---------+------+-----------+---------+-------------+ 172 | 1 | 2 | 101.66 | 22.02 | 4.72 | 173 +---------+------+-----------+---------+-------------+ 174 | 1 | 3 | 101.48 | 22.22 | 4.52 | 175 +---------+------+-----------+---------+-------------+ 176 177``CPU_VERSION`` in parallel 178~~~~~~~~~~~~~~~~~~~~~~~~~~~ 179 180.. table:: ``CPU_VERSION`` latency (µs) in parallel on all cores 181 182 +-------------+--------+--------------+ 183 | Cluster | Core | Latency | 184 +=============+========+==============+ 185 | 0 | 0 | 1.22 | 186 +-------------+--------+--------------+ 187 | 0 | 1 | 1.2 | 188 +-------------+--------+--------------+ 189 | 1 | 0 | 0.6 | 190 +-------------+--------+--------------+ 191 | 1 | 1 | 1.08 | 192 +-------------+--------+--------------+ 193 | 1 | 2 | 1.04 | 194 +-------------+--------+--------------+ 195 | 1 | 3 | 1.04 | 196 +-------------+--------+--------------+ 197 198Annotated Historic Results 199-------------------------- 200 201The following results are based on the upstream `TF master as of 31/01/2017`_. 202TF-A was built using the same build instructions as detailed in the procedure 203above. 204 205In the results below, CPUs 0-3 refer to CPUs in the little cluster (A53) and 206CPUs 4-5 refer to CPUs in the big cluster (A57). In all cases CPU 4 is the lead 207CPU. 208 209``PSCI_ENTRY`` corresponds to the powerdown latency, ``PSCI_EXIT`` the wakeup latency, and 210``CFLUSH_OVERHEAD`` the latency of the cache flush operation. 211 212``CPU_SUSPEND`` to deepest power level on all CPUs in parallel 213~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 214 215+-------+---------------------+--------------------+--------------------------+ 216| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 217+=======+=====================+====================+==========================+ 218| 0 | 27 | 20 | 5 | 219+-------+---------------------+--------------------+--------------------------+ 220| 1 | 114 | 86 | 5 | 221+-------+---------------------+--------------------+--------------------------+ 222| 2 | 202 | 58 | 5 | 223+-------+---------------------+--------------------+--------------------------+ 224| 3 | 375 | 29 | 94 | 225+-------+---------------------+--------------------+--------------------------+ 226| 4 | 20 | 22 | 6 | 227+-------+---------------------+--------------------+--------------------------+ 228| 5 | 290 | 18 | 206 | 229+-------+---------------------+--------------------+--------------------------+ 230 231A large variance in ``PSCI_ENTRY`` and ``PSCI_EXIT`` times across CPUs is 232observed due to TF PSCI lock contention. In the worst case, CPU 3 has to wait 233for the 3 other CPUs in the cluster (0-2) to complete ``PSCI_ENTRY`` and release 234the lock before proceeding. 235 236The ``CFLUSH_OVERHEAD`` times for CPUs 3 and 5 are higher because they are the 237last CPUs in their respective clusters to power down, therefore both the L1 and 238L2 caches are flushed. 239 240The ``CFLUSH_OVERHEAD`` time for CPU 5 is a lot larger than that for CPU 3 241because the L2 cache size for the big cluster is lot larger (2MB) compared to 242the little cluster (1MB). 243 244``CPU_SUSPEND`` to power level 0 on all CPUs in parallel 245~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 246 247+-------+---------------------+--------------------+--------------------------+ 248| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 249+=======+=====================+====================+==========================+ 250| 0 | 116 | 14 | 8 | 251+-------+---------------------+--------------------+--------------------------+ 252| 1 | 204 | 14 | 8 | 253+-------+---------------------+--------------------+--------------------------+ 254| 2 | 287 | 13 | 8 | 255+-------+---------------------+--------------------+--------------------------+ 256| 3 | 376 | 13 | 9 | 257+-------+---------------------+--------------------+--------------------------+ 258| 4 | 29 | 15 | 7 | 259+-------+---------------------+--------------------+--------------------------+ 260| 5 | 21 | 15 | 8 | 261+-------+---------------------+--------------------+--------------------------+ 262 263There is no lock contention in TF generic code at power level 0 but the large 264variance in ``PSCI_ENTRY`` times across CPUs is due to lock contention in Juno 265platform code. The platform lock is used to mediate access to a single SCP 266communication channel. This is compounded by the SCP firmware waiting for each 267AP CPU to enter WFI before making the channel available to other CPUs, which 268effectively serializes the SCP power down commands from all CPUs. 269 270On platforms with a more efficient CPU power down mechanism, it should be 271possible to make the ``PSCI_ENTRY`` times smaller and consistent. 272 273The ``PSCI_EXIT`` times are consistent across all CPUs because TF does not 274require locks at power level 0. 275 276The ``CFLUSH_OVERHEAD`` times for all CPUs are small and consistent since only 277the cache associated with power level 0 is flushed (L1). 278 279``CPU_SUSPEND`` to deepest power level on all CPUs in sequence 280~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 281 282+-------+---------------------+--------------------+--------------------------+ 283| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 284+=======+=====================+====================+==========================+ 285| 0 | 114 | 20 | 94 | 286+-------+---------------------+--------------------+--------------------------+ 287| 1 | 114 | 20 | 94 | 288+-------+---------------------+--------------------+--------------------------+ 289| 2 | 114 | 20 | 94 | 290+-------+---------------------+--------------------+--------------------------+ 291| 3 | 114 | 20 | 94 | 292+-------+---------------------+--------------------+--------------------------+ 293| 4 | 195 | 22 | 180 | 294+-------+---------------------+--------------------+--------------------------+ 295| 5 | 21 | 17 | 6 | 296+-------+---------------------+--------------------+--------------------------+ 297 298The ``CFLUSH_OVERHEAD`` times for lead CPU 4 and all CPUs in the non-lead cluster 299are large because all other CPUs in the cluster are powered down during the 300test. The ``CPU_SUSPEND`` call powers down to the cluster level, requiring a 301flush of both L1 and L2 caches. 302 303The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little 304CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared 305to the little cluster (1MB). 306 307The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are low because lead 308CPU 4 continues to run while CPU 5 is suspended. Hence CPU 5 only powers down to 309level 0, which only requires L1 cache flush. 310 311``CPU_SUSPEND`` to power level 0 on all CPUs in sequence 312~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 313 314+-------+---------------------+--------------------+--------------------------+ 315| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 316+=======+=====================+====================+==========================+ 317| 0 | 22 | 14 | 5 | 318+-------+---------------------+--------------------+--------------------------+ 319| 1 | 22 | 14 | 5 | 320+-------+---------------------+--------------------+--------------------------+ 321| 2 | 21 | 14 | 5 | 322+-------+---------------------+--------------------+--------------------------+ 323| 3 | 22 | 14 | 5 | 324+-------+---------------------+--------------------+--------------------------+ 325| 4 | 17 | 14 | 6 | 326+-------+---------------------+--------------------+--------------------------+ 327| 5 | 18 | 15 | 6 | 328+-------+---------------------+--------------------+--------------------------+ 329 330Here the times are small and consistent since there is no contention and it is 331only necessary to flush the cache to power level 0 (L1). This is the best case 332scenario. 333 334The ``PSCI_ENTRY`` times for CPUs in the big cluster are slightly smaller than 335for the CPUs in little cluster due to greater CPU performance. 336 337The ``PSCI_EXIT`` times are generally lower than in the last test because the 338cluster remains powered on throughout the test and there is less code to execute 339on power on (for example, no need to enter CCI coherency) 340 341``CPU_OFF`` on all non-lead CPUs in sequence then ``CPU_SUSPEND`` on lead CPU to deepest power level 342~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 343 344The test sequence here is as follows: 345 3461. Call ``CPU_ON`` and ``CPU_OFF`` on each non-lead CPU in sequence. 347 3482. Program wake up timer and suspend the lead CPU to the deepest power level. 349 3503. Call ``CPU_ON`` on non-lead CPU to get the timestamps from each CPU. 351 352+-------+---------------------+--------------------+--------------------------+ 353| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 354+=======+=====================+====================+==========================+ 355| 0 | 110 | 28 | 93 | 356+-------+---------------------+--------------------+--------------------------+ 357| 1 | 110 | 28 | 93 | 358+-------+---------------------+--------------------+--------------------------+ 359| 2 | 110 | 28 | 93 | 360+-------+---------------------+--------------------+--------------------------+ 361| 3 | 111 | 28 | 93 | 362+-------+---------------------+--------------------+--------------------------+ 363| 4 | 195 | 22 | 181 | 364+-------+---------------------+--------------------+--------------------------+ 365| 5 | 20 | 23 | 6 | 366+-------+---------------------+--------------------+--------------------------+ 367 368The ``CFLUSH_OVERHEAD`` times for all little CPUs are large because all other 369CPUs in that cluster are powerered down during the test. The ``CPU_OFF`` call 370powers down to the cluster level, requiring a flush of both L1 and L2 caches. 371 372The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are small because 373lead CPU 4 is running and CPU 5 only powers down to level 0, which only requires 374an L1 cache flush. 375 376The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little 377CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared 378to the little cluster (1MB). 379 380The ``PSCI_EXIT`` times for CPUs in the big cluster are slightly smaller than 381for CPUs in the little cluster due to greater CPU performance. These times 382generally are greater than the ``PSCI_EXIT`` times in the ``CPU_SUSPEND`` tests 383because there is more code to execute in the "on finisher" compared to the 384"suspend finisher" (for example, GIC redistributor register programming). 385 386``PSCI_VERSION`` on all CPUs in parallel 387~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 388 389Since very little code is associated with ``PSCI_VERSION``, this test 390approximates the round trip latency for handling a fast SMC at EL3 in TF. 391 392+-------+-------------------+ 393| CPU | TOTAL TIME (ns) | 394+=======+===================+ 395| 0 | 3020 | 396+-------+-------------------+ 397| 1 | 2940 | 398+-------+-------------------+ 399| 2 | 2980 | 400+-------+-------------------+ 401| 3 | 3060 | 402+-------+-------------------+ 403| 4 | 520 | 404+-------+-------------------+ 405| 5 | 720 | 406+-------+-------------------+ 407 408The times for the big CPUs are less than the little CPUs due to greater CPU 409performance. 410 411We suspect the time for lead CPU 4 is shorter than CPU 5 due to subtle cache 412effects, given that these measurements are at the nano-second level. 413 414-------------- 415 416*Copyright (c) 2019-2023, Arm Limited and Contributors. All rights reserved.* 417 418.. _Juno R1 platform: https://developer.arm.com/documentation/100122/latest/ 419.. _TF master as of 31/01/2017: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?id=c38b36d 420.. _v2.9-rc0: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?h=v2.9-rc0 421.. _Testing Methodology: ../perf/psci-performance-methodology.html 422