1PSCI Performance Measurements on Arm Juno Development Platform 2============================================================== 3 4This document summarises the findings of performance measurements of key 5operations in the Trusted Firmware-A Power State Coordination Interface (PSCI) 6implementation, using the in-built Performance Measurement Framework (PMF) and 7runtime instrumentation timestamps. 8 9Method 10------ 11 12We used the `Juno R1 platform`_ for these tests, which has 4 x Cortex-A53 and 2 13x Cortex-A57 clusters running at the following frequencies: 14 15+-----------------+--------------------+ 16| Domain | Frequency (MHz) | 17+=================+====================+ 18| Cortex-A57 | 900 (nominal) | 19+-----------------+--------------------+ 20| Cortex-A53 | 650 (underdrive) | 21+-----------------+--------------------+ 22| AXI subsystem | 533 | 23+-----------------+--------------------+ 24 25Juno supports CPU, cluster and system power down states, corresponding to power 26levels 0, 1 and 2 respectively. It does not support any retention states. 27 28Given that runtime instrumentation using PMF is invasive, there is a small 29(unquantified) overhead on the results. PMF uses the generic counter for 30timestamps, which runs at 50MHz on Juno. 31 32The following source trees and binaries were used: 33 34- TF-A [`v2.9-rc0`_] 35- TFTF [`v2.9-rc0`_] 36 37Please see the Runtime Instrumentation :ref:`Testing Methodology 38<Runtime Instrumentation Methodology>` 39page for more details. 40 41Procedure 42--------- 43 44#. Build TFTF with runtime instrumentation enabled: 45 46 .. code:: shell 47 48 make CROSS_COMPILE=aarch64-none-elf- PLAT=juno \ 49 TESTS=runtime-instrumentation all 50 51#. Fetch Juno's SCP binary from TF-A's archive: 52 53 .. code:: shell 54 55 curl --fail --connect-timeout 5 --retry 5 -sLS -o scp_bl2.bin \ 56 https://downloads.trustedfirmware.org/tf-a/css_scp_2.12.0/juno/release/juno-bl2.bin 57 58#. Build TF-A with the following build options: 59 60 .. code:: shell 61 62 make CROSS_COMPILE=aarch64-none-elf- PLAT=juno \ 63 BL33="/path/to/tftf.bin" SCP_BL2="scp_bl2.bin" \ 64 ENABLE_RUNTIME_INSTRUMENTATION=1 fiptool all fip 65 66#. Load the following images onto the development board: ``fip.bin``, 67 ``scp_bl2.bin``. 68 69Results 70------- 71 72``CPU_SUSPEND`` to deepest power level 73~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 74 75.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in 76 parallel 77 78 +---------+------+-----------+---------+-------------+ 79 | Cluster | Core | Powerdown | Wakekup | Cache Flush | 80 +=========+======+===========+=========+=============+ 81 | 0 | 0 | 243.76 | 239.92 | 6.32 | 82 +---------+------+-----------+---------+-------------+ 83 | 0 | 1 | 663.5 | 30.32 | 167.82 | 84 +---------+------+-----------+---------+-------------+ 85 | 1 | 0 | 105.12 | 22.84 | 5.88 | 86 +---------+------+-----------+---------+-------------+ 87 | 1 | 1 | 384.16 | 19.06 | 4.7 | 88 +---------+------+-----------+---------+-------------+ 89 | 1 | 2 | 523.98 | 270.46 | 4.74 | 90 +---------+------+-----------+---------+-------------+ 91 | 1 | 3 | 950.54 | 220.9 | 89.2 | 92 +---------+------+-----------+---------+-------------+ 93 94.. table:: ``CPU_SUSPEND`` latencies (µs) to deepest power level in 95 serial 96 97 +---------+------+-----------+---------+-------------+ 98 | Cluster | Core | Powerdown | Wakekup | Cache Flush | 99 +=========+======+===========+=========+=============+ 100 | 0 | 0 | 266.96 | 31.74 | 167.92 | 101 +---------+------+-----------+---------+-------------+ 102 | 0 | 1 | 266.9 | 31.52 | 167.82 | 103 +---------+------+-----------+---------+-------------+ 104 | 1 | 0 | 279.86 | 23.42 | 87.52 | 105 +---------+------+-----------+---------+-------------+ 106 | 1 | 1 | 101.38 | 18.8 | 4.64 | 107 +---------+------+-----------+---------+-------------+ 108 | 1 | 2 | 101.18 | 19.28 | 4.64 | 109 +---------+------+-----------+---------+-------------+ 110 | 1 | 3 | 101.32 | 19.02 | 4.62 | 111 +---------+------+-----------+---------+-------------+ 112 113``CPU_SUSPEND`` to power level 0 114~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 115 116.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in 117 parallel 118 119 +---------+------+-----------+---------+-------------+ 120 | Cluster | Core | Powerdown | Wakekup | Cache Flush | 121 +=========+======+===========+=========+=============+ 122 +---------+------+-----------+---------+-------------+ 123 | 0 | 0 | 661.94 | 22.88 | 9.66 | 124 +---------+------+-----------+---------+-------------+ 125 | 0 | 1 | 801.64 | 23.38 | 9.62 | 126 +---------+------+-----------+---------+-------------+ 127 | 1 | 0 | 105.56 | 16.02 | 8.12 | 128 +---------+------+-----------+---------+-------------+ 129 | 1 | 1 | 245.42 | 16.26 | 7.78 | 130 +---------+------+-----------+---------+-------------+ 131 | 1 | 2 | 384.42 | 16.1 | 7.84 | 132 +---------+------+-----------+---------+-------------+ 133 | 1 | 3 | 523.74 | 15.4 | 8.02 | 134 +---------+------+-----------+---------+-------------+ 135 136.. table:: ``CPU_SUSPEND`` latencies (µs) to power level 0 in serial 137 138 +---------+------+-----------+---------+-------------+ 139 | Cluster | Core | Powerdown | Wakekup | Cache Flush | 140 +=========+======+===========+=========+=============+ 141 | 0 | 0 | 102.16 | 23.64 | 6.7 | 142 +---------+------+-----------+---------+-------------+ 143 | 0 | 1 | 101.66 | 23.78 | 6.6 | 144 +---------+------+-----------+---------+-------------+ 145 | 1 | 0 | 277.74 | 15.96 | 4.66 | 146 +---------+------+-----------+---------+-------------+ 147 | 1 | 1 | 98.0 | 15.88 | 4.64 | 148 +---------+------+-----------+---------+-------------+ 149 | 1 | 2 | 97.66 | 15.88 | 4.62 | 150 +---------+------+-----------+---------+-------------+ 151 | 1 | 3 | 97.76 | 15.38 | 4.64 | 152 +---------+------+-----------+---------+-------------+ 153 154``CPU_OFF`` on all non-lead CPUs 155~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 156 157``CPU_OFF`` on all non-lead CPUs in sequence then, ``CPU_SUSPEND`` on the lead 158core to the deepest power level. 159 160.. table:: ``CPU_OFF`` latencies (µs) on all non-lead CPUs 161 162 +---------+------+-----------+---------+-------------+ 163 | Cluster | Core | Powerdown | Wakekup | Cache Flush | 164 +=========+======+===========+=========+=============+ 165 | 0 | 0 | 265.38 | 34.12 | 167.36 | 166 +---------+------+-----------+---------+-------------+ 167 | 0 | 1 | 265.72 | 33.98 | 167.48 | 168 +---------+------+-----------+---------+-------------+ 169 | 1 | 0 | 185.3 | 23.18 | 87.42 | 170 +---------+------+-----------+---------+-------------+ 171 | 1 | 1 | 101.58 | 23.46 | 4.48 | 172 +---------+------+-----------+---------+-------------+ 173 | 1 | 2 | 101.66 | 22.02 | 4.72 | 174 +---------+------+-----------+---------+-------------+ 175 | 1 | 3 | 101.48 | 22.22 | 4.52 | 176 +---------+------+-----------+---------+-------------+ 177 178``CPU_VERSION`` in parallel 179~~~~~~~~~~~~~~~~~~~~~~~~~~~ 180 181.. table:: ``CPU_VERSION`` latency (µs) in parallel on all cores 182 183 +-------------+--------+--------------+ 184 | Cluster | Core | Latency | 185 +=============+========+==============+ 186 | 0 | 0 | 1.22 | 187 +-------------+--------+--------------+ 188 | 0 | 1 | 1.2 | 189 +-------------+--------+--------------+ 190 | 1 | 0 | 0.6 | 191 +-------------+--------+--------------+ 192 | 1 | 1 | 1.08 | 193 +-------------+--------+--------------+ 194 | 1 | 2 | 1.04 | 195 +-------------+--------+--------------+ 196 | 1 | 3 | 1.04 | 197 +-------------+--------+--------------+ 198 199Annotated Historic Results 200-------------------------- 201 202The following results are based on the upstream `TF master as of 31/01/2017`_. 203TF-A was built using the same build instructions as detailed in the procedure 204above. 205 206In the results below, CPUs 0-3 refer to CPUs in the little cluster (A53) and 207CPUs 4-5 refer to CPUs in the big cluster (A57). In all cases CPU 4 is the lead 208CPU. 209 210``PSCI_ENTRY`` corresponds to the powerdown latency, ``PSCI_EXIT`` the wakeup latency, and 211``CFLUSH_OVERHEAD`` the latency of the cache flush operation. 212 213``CPU_SUSPEND`` to deepest power level on all CPUs in parallel 214~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 215 216+-------+---------------------+--------------------+--------------------------+ 217| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 218+=======+=====================+====================+==========================+ 219| 0 | 27 | 20 | 5 | 220+-------+---------------------+--------------------+--------------------------+ 221| 1 | 114 | 86 | 5 | 222+-------+---------------------+--------------------+--------------------------+ 223| 2 | 202 | 58 | 5 | 224+-------+---------------------+--------------------+--------------------------+ 225| 3 | 375 | 29 | 94 | 226+-------+---------------------+--------------------+--------------------------+ 227| 4 | 20 | 22 | 6 | 228+-------+---------------------+--------------------+--------------------------+ 229| 5 | 290 | 18 | 206 | 230+-------+---------------------+--------------------+--------------------------+ 231 232A large variance in ``PSCI_ENTRY`` and ``PSCI_EXIT`` times across CPUs is 233observed due to TF PSCI lock contention. In the worst case, CPU 3 has to wait 234for the 3 other CPUs in the cluster (0-2) to complete ``PSCI_ENTRY`` and release 235the lock before proceeding. 236 237The ``CFLUSH_OVERHEAD`` times for CPUs 3 and 5 are higher because they are the 238last CPUs in their respective clusters to power down, therefore both the L1 and 239L2 caches are flushed. 240 241The ``CFLUSH_OVERHEAD`` time for CPU 5 is a lot larger than that for CPU 3 242because the L2 cache size for the big cluster is lot larger (2MB) compared to 243the little cluster (1MB). 244 245``CPU_SUSPEND`` to power level 0 on all CPUs in parallel 246~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 247 248+-------+---------------------+--------------------+--------------------------+ 249| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 250+=======+=====================+====================+==========================+ 251| 0 | 116 | 14 | 8 | 252+-------+---------------------+--------------------+--------------------------+ 253| 1 | 204 | 14 | 8 | 254+-------+---------------------+--------------------+--------------------------+ 255| 2 | 287 | 13 | 8 | 256+-------+---------------------+--------------------+--------------------------+ 257| 3 | 376 | 13 | 9 | 258+-------+---------------------+--------------------+--------------------------+ 259| 4 | 29 | 15 | 7 | 260+-------+---------------------+--------------------+--------------------------+ 261| 5 | 21 | 15 | 8 | 262+-------+---------------------+--------------------+--------------------------+ 263 264There is no lock contention in TF generic code at power level 0 but the large 265variance in ``PSCI_ENTRY`` times across CPUs is due to lock contention in Juno 266platform code. The platform lock is used to mediate access to a single SCP 267communication channel. This is compounded by the SCP firmware waiting for each 268AP CPU to enter WFI before making the channel available to other CPUs, which 269effectively serializes the SCP power down commands from all CPUs. 270 271On platforms with a more efficient CPU power down mechanism, it should be 272possible to make the ``PSCI_ENTRY`` times smaller and consistent. 273 274The ``PSCI_EXIT`` times are consistent across all CPUs because TF does not 275require locks at power level 0. 276 277The ``CFLUSH_OVERHEAD`` times for all CPUs are small and consistent since only 278the cache associated with power level 0 is flushed (L1). 279 280``CPU_SUSPEND`` to deepest power level on all CPUs in sequence 281~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 282 283+-------+---------------------+--------------------+--------------------------+ 284| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 285+=======+=====================+====================+==========================+ 286| 0 | 114 | 20 | 94 | 287+-------+---------------------+--------------------+--------------------------+ 288| 1 | 114 | 20 | 94 | 289+-------+---------------------+--------------------+--------------------------+ 290| 2 | 114 | 20 | 94 | 291+-------+---------------------+--------------------+--------------------------+ 292| 3 | 114 | 20 | 94 | 293+-------+---------------------+--------------------+--------------------------+ 294| 4 | 195 | 22 | 180 | 295+-------+---------------------+--------------------+--------------------------+ 296| 5 | 21 | 17 | 6 | 297+-------+---------------------+--------------------+--------------------------+ 298 299The ``CFLUSH_OVERHEAD`` times for lead CPU 4 and all CPUs in the non-lead cluster 300are large because all other CPUs in the cluster are powered down during the 301test. The ``CPU_SUSPEND`` call powers down to the cluster level, requiring a 302flush of both L1 and L2 caches. 303 304The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little 305CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared 306to the little cluster (1MB). 307 308The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are low because lead 309CPU 4 continues to run while CPU 5 is suspended. Hence CPU 5 only powers down to 310level 0, which only requires L1 cache flush. 311 312``CPU_SUSPEND`` to power level 0 on all CPUs in sequence 313~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 314 315+-------+---------------------+--------------------+--------------------------+ 316| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 317+=======+=====================+====================+==========================+ 318| 0 | 22 | 14 | 5 | 319+-------+---------------------+--------------------+--------------------------+ 320| 1 | 22 | 14 | 5 | 321+-------+---------------------+--------------------+--------------------------+ 322| 2 | 21 | 14 | 5 | 323+-------+---------------------+--------------------+--------------------------+ 324| 3 | 22 | 14 | 5 | 325+-------+---------------------+--------------------+--------------------------+ 326| 4 | 17 | 14 | 6 | 327+-------+---------------------+--------------------+--------------------------+ 328| 5 | 18 | 15 | 6 | 329+-------+---------------------+--------------------+--------------------------+ 330 331Here the times are small and consistent since there is no contention and it is 332only necessary to flush the cache to power level 0 (L1). This is the best case 333scenario. 334 335The ``PSCI_ENTRY`` times for CPUs in the big cluster are slightly smaller than 336for the CPUs in little cluster due to greater CPU performance. 337 338The ``PSCI_EXIT`` times are generally lower than in the last test because the 339cluster remains powered on throughout the test and there is less code to execute 340on power on (for example, no need to enter CCI coherency) 341 342``CPU_OFF`` on all non-lead CPUs in sequence then ``CPU_SUSPEND`` on lead CPU to deepest power level 343~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 344 345The test sequence here is as follows: 346 3471. Call ``CPU_ON`` and ``CPU_OFF`` on each non-lead CPU in sequence. 348 3492. Program wake up timer and suspend the lead CPU to the deepest power level. 350 3513. Call ``CPU_ON`` on non-lead CPU to get the timestamps from each CPU. 352 353+-------+---------------------+--------------------+--------------------------+ 354| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | 355+=======+=====================+====================+==========================+ 356| 0 | 110 | 28 | 93 | 357+-------+---------------------+--------------------+--------------------------+ 358| 1 | 110 | 28 | 93 | 359+-------+---------------------+--------------------+--------------------------+ 360| 2 | 110 | 28 | 93 | 361+-------+---------------------+--------------------+--------------------------+ 362| 3 | 111 | 28 | 93 | 363+-------+---------------------+--------------------+--------------------------+ 364| 4 | 195 | 22 | 181 | 365+-------+---------------------+--------------------+--------------------------+ 366| 5 | 20 | 23 | 6 | 367+-------+---------------------+--------------------+--------------------------+ 368 369The ``CFLUSH_OVERHEAD`` times for all little CPUs are large because all other 370CPUs in that cluster are powerered down during the test. The ``CPU_OFF`` call 371powers down to the cluster level, requiring a flush of both L1 and L2 caches. 372 373The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are small because 374lead CPU 4 is running and CPU 5 only powers down to level 0, which only requires 375an L1 cache flush. 376 377The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little 378CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared 379to the little cluster (1MB). 380 381The ``PSCI_EXIT`` times for CPUs in the big cluster are slightly smaller than 382for CPUs in the little cluster due to greater CPU performance. These times 383generally are greater than the ``PSCI_EXIT`` times in the ``CPU_SUSPEND`` tests 384because there is more code to execute in the "on finisher" compared to the 385"suspend finisher" (for example, GIC redistributor register programming). 386 387``PSCI_VERSION`` on all CPUs in parallel 388~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 389 390Since very little code is associated with ``PSCI_VERSION``, this test 391approximates the round trip latency for handling a fast SMC at EL3 in TF. 392 393+-------+-------------------+ 394| CPU | TOTAL TIME (ns) | 395+=======+===================+ 396| 0 | 3020 | 397+-------+-------------------+ 398| 1 | 2940 | 399+-------+-------------------+ 400| 2 | 2980 | 401+-------+-------------------+ 402| 3 | 3060 | 403+-------+-------------------+ 404| 4 | 520 | 405+-------+-------------------+ 406| 5 | 720 | 407+-------+-------------------+ 408 409The times for the big CPUs are less than the little CPUs due to greater CPU 410performance. 411 412We suspect the time for lead CPU 4 is shorter than CPU 5 due to subtle cache 413effects, given that these measurements are at the nano-second level. 414 415-------------- 416 417*Copyright (c) 2019-2023, Arm Limited and Contributors. All rights reserved.* 418 419.. _Juno R1 platform: https://developer.arm.com/documentation/100122/latest/ 420.. _TF master as of 31/01/2017: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?id=c38b36d 421.. _v2.9-rc0: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?h=v2.9-rc0 422