1*4882a593Smuzhiyunperf-stat(1) 2*4882a593Smuzhiyun============ 3*4882a593Smuzhiyun 4*4882a593SmuzhiyunNAME 5*4882a593Smuzhiyun---- 6*4882a593Smuzhiyunperf-stat - Run a command and gather performance counter statistics 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunSYNOPSIS 9*4882a593Smuzhiyun-------- 10*4882a593Smuzhiyun[verse] 11*4882a593Smuzhiyun'perf stat' [-e <EVENT> | --event=EVENT] [-a] <command> 12*4882a593Smuzhiyun'perf stat' [-e <EVENT> | --event=EVENT] [-a] -- <command> [<options>] 13*4882a593Smuzhiyun'perf stat' [-e <EVENT> | --event=EVENT] [-a] record [-o file] -- <command> [<options>] 14*4882a593Smuzhiyun'perf stat' report [-i file] 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunDESCRIPTION 17*4882a593Smuzhiyun----------- 18*4882a593SmuzhiyunThis command runs a command and gathers performance counter statistics 19*4882a593Smuzhiyunfrom it. 20*4882a593Smuzhiyun 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunOPTIONS 23*4882a593Smuzhiyun------- 24*4882a593Smuzhiyun<command>...:: 25*4882a593Smuzhiyun Any command you can specify in a shell. 26*4882a593Smuzhiyun 27*4882a593Smuzhiyunrecord:: 28*4882a593Smuzhiyun See STAT RECORD. 29*4882a593Smuzhiyun 30*4882a593Smuzhiyunreport:: 31*4882a593Smuzhiyun See STAT REPORT. 32*4882a593Smuzhiyun 33*4882a593Smuzhiyun-e:: 34*4882a593Smuzhiyun--event=:: 35*4882a593Smuzhiyun Select the PMU event. Selection can be: 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun - a symbolic event name (use 'perf list' to list all events) 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun - a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a 40*4882a593Smuzhiyun hexadecimal event descriptor. 41*4882a593Smuzhiyun 42*4882a593Smuzhiyun - a symbolic or raw PMU event followed by an optional colon 43*4882a593Smuzhiyun and a list of event modifiers, e.g., cpu-cycles:p. See the 44*4882a593Smuzhiyun linkperf:perf-list[1] man page for details on event modifiers. 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun - a symbolically formed event like 'pmu/param1=0x3,param2/' where 47*4882a593Smuzhiyun param1 and param2 are defined as formats for the PMU in 48*4882a593Smuzhiyun /sys/bus/event_source/devices/<pmu>/format/* 49*4882a593Smuzhiyun 50*4882a593Smuzhiyun 'percore' is a event qualifier that sums up the event counts for both 51*4882a593Smuzhiyun hardware threads in a core. For example: 52*4882a593Smuzhiyun perf stat -A -a -e cpu/event,percore=1/,otherevent ... 53*4882a593Smuzhiyun 54*4882a593Smuzhiyun - a symbolically formed event like 'pmu/config=M,config1=N,config2=K/' 55*4882a593Smuzhiyun where M, N, K are numbers (in decimal, hex, octal format). 56*4882a593Smuzhiyun Acceptable values for each of 'config', 'config1' and 'config2' 57*4882a593Smuzhiyun parameters are defined by corresponding entries in 58*4882a593Smuzhiyun /sys/bus/event_source/devices/<pmu>/format/* 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun Note that the last two syntaxes support prefix and glob matching in 61*4882a593Smuzhiyun the PMU name to simplify creation of events across multiple instances 62*4882a593Smuzhiyun of the same type of PMU in large systems (e.g. memory controller PMUs). 63*4882a593Smuzhiyun Multiple PMU instances are typical for uncore PMUs, so the prefix 64*4882a593Smuzhiyun 'uncore_' is also ignored when performing this match. 65*4882a593Smuzhiyun 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun-i:: 68*4882a593Smuzhiyun--no-inherit:: 69*4882a593Smuzhiyun child tasks do not inherit counters 70*4882a593Smuzhiyun-p:: 71*4882a593Smuzhiyun--pid=<pid>:: 72*4882a593Smuzhiyun stat events on existing process id (comma separated list) 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun-t:: 75*4882a593Smuzhiyun--tid=<tid>:: 76*4882a593Smuzhiyun stat events on existing thread id (comma separated list) 77*4882a593Smuzhiyun 78*4882a593Smuzhiyunifdef::HAVE_LIBPFM[] 79*4882a593Smuzhiyun--pfm-events events:: 80*4882a593SmuzhiyunSelect a PMU event using libpfm4 syntax (see http://perfmon2.sf.net) 81*4882a593Smuzhiyunincluding support for event filters. For example '--pfm-events 82*4882a593Smuzhiyuninst_retired:any_p:u:c=1:i'. More than one event can be passed to the 83*4882a593Smuzhiyunoption using the comma separator. Hardware events and generic hardware 84*4882a593Smuzhiyunevents cannot be mixed together. The latter must be used with the -e 85*4882a593Smuzhiyunoption. The -e option and this one can be mixed and matched. Events 86*4882a593Smuzhiyuncan be grouped using the {} notation. 87*4882a593Smuzhiyunendif::HAVE_LIBPFM[] 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun-a:: 90*4882a593Smuzhiyun--all-cpus:: 91*4882a593Smuzhiyun system-wide collection from all CPUs (default if no target is specified) 92*4882a593Smuzhiyun 93*4882a593Smuzhiyun--no-scale:: 94*4882a593Smuzhiyun Don't scale/normalize counter values 95*4882a593Smuzhiyun 96*4882a593Smuzhiyun-d:: 97*4882a593Smuzhiyun--detailed:: 98*4882a593Smuzhiyun print more detailed statistics, can be specified up to 3 times 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun -d: detailed events, L1 and LLC data cache 101*4882a593Smuzhiyun -d -d: more detailed events, dTLB and iTLB events 102*4882a593Smuzhiyun -d -d -d: very detailed events, adding prefetch events 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun-r:: 105*4882a593Smuzhiyun--repeat=<n>:: 106*4882a593Smuzhiyun repeat command and print average + stddev (max: 100). 0 means forever. 107*4882a593Smuzhiyun 108*4882a593Smuzhiyun-B:: 109*4882a593Smuzhiyun--big-num:: 110*4882a593Smuzhiyun print large numbers with thousands' separators according to locale. 111*4882a593Smuzhiyun Enabled by default. Use "--no-big-num" to disable. 112*4882a593Smuzhiyun Default setting can be changed with "perf config stat.big-num=false". 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun-C:: 115*4882a593Smuzhiyun--cpu=:: 116*4882a593SmuzhiyunCount only on the list of CPUs provided. Multiple CPUs can be provided as a 117*4882a593Smuzhiyuncomma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2. 118*4882a593SmuzhiyunIn per-thread mode, this option is ignored. The -a option is still necessary 119*4882a593Smuzhiyunto activate system-wide monitoring. Default is to count on all CPUs. 120*4882a593Smuzhiyun 121*4882a593Smuzhiyun-A:: 122*4882a593Smuzhiyun--no-aggr:: 123*4882a593SmuzhiyunDo not aggregate counts across all monitored CPUs. 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun-n:: 126*4882a593Smuzhiyun--null:: 127*4882a593Smuzhiyun null run - don't start any counters 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun-v:: 130*4882a593Smuzhiyun--verbose:: 131*4882a593Smuzhiyun be more verbose (show counter open errors, etc) 132*4882a593Smuzhiyun 133*4882a593Smuzhiyun-x SEP:: 134*4882a593Smuzhiyun--field-separator SEP:: 135*4882a593Smuzhiyunprint counts using a CSV-style output to make it easy to import directly into 136*4882a593Smuzhiyunspreadsheets. Columns are separated by the string specified in SEP. 137*4882a593Smuzhiyun 138*4882a593Smuzhiyun--table:: Display time for each run (-r option), in a table format, e.g.: 139*4882a593Smuzhiyun 140*4882a593Smuzhiyun $ perf stat --null -r 5 --table perf bench sched pipe 141*4882a593Smuzhiyun 142*4882a593Smuzhiyun Performance counter stats for 'perf bench sched pipe' (5 runs): 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun # Table of individual measurements: 145*4882a593Smuzhiyun 5.189 (-0.293) # 146*4882a593Smuzhiyun 5.189 (-0.294) # 147*4882a593Smuzhiyun 5.186 (-0.296) # 148*4882a593Smuzhiyun 5.663 (+0.181) ## 149*4882a593Smuzhiyun 6.186 (+0.703) #### 150*4882a593Smuzhiyun 151*4882a593Smuzhiyun # Final result: 152*4882a593Smuzhiyun 5.483 +- 0.198 seconds time elapsed ( +- 3.62% ) 153*4882a593Smuzhiyun 154*4882a593Smuzhiyun-G name:: 155*4882a593Smuzhiyun--cgroup name:: 156*4882a593Smuzhiyunmonitor only in the container (cgroup) called "name". This option is available only 157*4882a593Smuzhiyunin per-cpu mode. The cgroup filesystem must be mounted. All threads belonging to 158*4882a593Smuzhiyuncontainer "name" are monitored when they run on the monitored CPUs. Multiple cgroups 159*4882a593Smuzhiyuncan be provided. Each cgroup is applied to the corresponding event, i.e., first cgroup 160*4882a593Smuzhiyunto first event, second cgroup to second event and so on. It is possible to provide 161*4882a593Smuzhiyunan empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must have 162*4882a593Smuzhiyuncorresponding events, i.e., they always refer to events defined earlier on the command 163*4882a593Smuzhiyunline. If the user wants to track multiple events for a specific cgroup, the user can 164*4882a593Smuzhiyunuse '-e e1 -e e2 -G foo,foo' or just use '-e e1 -e e2 -G foo'. 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunIf wanting to monitor, say, 'cycles' for a cgroup and also for system wide, this 167*4882a593Smuzhiyuncommand line can be used: 'perf stat -e cycles -G cgroup_name -a -e cycles'. 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun--for-each-cgroup name:: 170*4882a593SmuzhiyunExpand event list for each cgroup in "name" (allow multiple cgroups separated 171*4882a593Smuzhiyunby comma). This has same effect that repeating -e option and -G option for 172*4882a593Smuzhiyuneach event x name. This option cannot be used with -G/--cgroup option. 173*4882a593Smuzhiyun 174*4882a593Smuzhiyun-o file:: 175*4882a593Smuzhiyun--output file:: 176*4882a593SmuzhiyunPrint the output into the designated file. 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun--append:: 179*4882a593SmuzhiyunAppend to the output file designated with the -o option. Ignored if -o is not specified. 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun--log-fd:: 182*4882a593Smuzhiyun 183*4882a593SmuzhiyunLog output to fd, instead of stderr. Complementary to --output, and mutually exclusive 184*4882a593Smuzhiyunwith it. --append may be used here. Examples: 185*4882a593Smuzhiyun 3>results perf stat --log-fd 3 -- $cmd 186*4882a593Smuzhiyun 3>>results perf stat --log-fd 3 --append -- $cmd 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun--control=fifo:ctl-fifo[,ack-fifo]:: 189*4882a593Smuzhiyun--control=fd:ctl-fd[,ack-fd]:: 190*4882a593Smuzhiyunctl-fifo / ack-fifo are opened and used as ctl-fd / ack-fd as follows. 191*4882a593SmuzhiyunListen on ctl-fd descriptor for command to control measurement ('enable': enable events, 192*4882a593Smuzhiyun'disable': disable events). Measurements can be started with events disabled using 193*4882a593Smuzhiyun--delay=-1 option. Optionally send control command completion ('ack\n') to ack-fd descriptor 194*4882a593Smuzhiyunto synchronize with the controlling process. Example of bash shell script to enable and 195*4882a593Smuzhiyundisable events during measurements: 196*4882a593Smuzhiyun 197*4882a593Smuzhiyun #!/bin/bash 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun ctl_dir=/tmp/ 200*4882a593Smuzhiyun 201*4882a593Smuzhiyun ctl_fifo=${ctl_dir}perf_ctl.fifo 202*4882a593Smuzhiyun test -p ${ctl_fifo} && unlink ${ctl_fifo} 203*4882a593Smuzhiyun mkfifo ${ctl_fifo} 204*4882a593Smuzhiyun exec {ctl_fd}<>${ctl_fifo} 205*4882a593Smuzhiyun 206*4882a593Smuzhiyun ctl_ack_fifo=${ctl_dir}perf_ctl_ack.fifo 207*4882a593Smuzhiyun test -p ${ctl_ack_fifo} && unlink ${ctl_ack_fifo} 208*4882a593Smuzhiyun mkfifo ${ctl_ack_fifo} 209*4882a593Smuzhiyun exec {ctl_fd_ack}<>${ctl_ack_fifo} 210*4882a593Smuzhiyun 211*4882a593Smuzhiyun perf stat -D -1 -e cpu-cycles -a -I 1000 \ 212*4882a593Smuzhiyun --control fd:${ctl_fd},${ctl_fd_ack} \ 213*4882a593Smuzhiyun -- sleep 30 & 214*4882a593Smuzhiyun perf_pid=$! 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun sleep 5 && echo 'enable' >&${ctl_fd} && read -u ${ctl_fd_ack} e1 && echo "enabled(${e1})" 217*4882a593Smuzhiyun sleep 10 && echo 'disable' >&${ctl_fd} && read -u ${ctl_fd_ack} d1 && echo "disabled(${d1})" 218*4882a593Smuzhiyun 219*4882a593Smuzhiyun exec {ctl_fd_ack}>&- 220*4882a593Smuzhiyun unlink ${ctl_ack_fifo} 221*4882a593Smuzhiyun 222*4882a593Smuzhiyun exec {ctl_fd}>&- 223*4882a593Smuzhiyun unlink ${ctl_fifo} 224*4882a593Smuzhiyun 225*4882a593Smuzhiyun wait -n ${perf_pid} 226*4882a593Smuzhiyun exit $? 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun 229*4882a593Smuzhiyun--pre:: 230*4882a593Smuzhiyun--post:: 231*4882a593Smuzhiyun Pre and post measurement hooks, e.g.: 232*4882a593Smuzhiyun 233*4882a593Smuzhiyunperf stat --repeat 10 --null --sync --pre 'make -s O=defconfig-build/clean' -- make -s -j64 O=defconfig-build/ bzImage 234*4882a593Smuzhiyun 235*4882a593Smuzhiyun-I msecs:: 236*4882a593Smuzhiyun--interval-print msecs:: 237*4882a593SmuzhiyunPrint count deltas every N milliseconds (minimum: 1ms) 238*4882a593SmuzhiyunThe overhead percentage could be high in some cases, for instance with small, sub 100ms intervals. Use with caution. 239*4882a593Smuzhiyun example: 'perf stat -I 1000 -e cycles -a sleep 5' 240*4882a593Smuzhiyun 241*4882a593SmuzhiyunIf the metric exists, it is calculated by the counts generated in this interval and the metric is printed after #. 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun--interval-count times:: 244*4882a593SmuzhiyunPrint count deltas for fixed number of times. 245*4882a593SmuzhiyunThis option should be used together with "-I" option. 246*4882a593Smuzhiyun example: 'perf stat -I 1000 --interval-count 2 -e cycles -a' 247*4882a593Smuzhiyun 248*4882a593Smuzhiyun--interval-clear:: 249*4882a593SmuzhiyunClear the screen before next interval. 250*4882a593Smuzhiyun 251*4882a593Smuzhiyun--timeout msecs:: 252*4882a593SmuzhiyunStop the 'perf stat' session and print count deltas after N milliseconds (minimum: 10 ms). 253*4882a593SmuzhiyunThis option is not supported with the "-I" option. 254*4882a593Smuzhiyun example: 'perf stat --time 2000 -e cycles -a' 255*4882a593Smuzhiyun 256*4882a593Smuzhiyun--metric-only:: 257*4882a593SmuzhiyunOnly print computed metrics. Print them in a single line. 258*4882a593SmuzhiyunDon't show any raw values. Not supported with --per-thread. 259*4882a593Smuzhiyun 260*4882a593Smuzhiyun--per-socket:: 261*4882a593SmuzhiyunAggregate counts per processor socket for system-wide mode measurements. This 262*4882a593Smuzhiyunis a useful mode to detect imbalance between sockets. To enable this mode, 263*4882a593Smuzhiyunuse --per-socket in addition to -a. (system-wide). The output includes the 264*4882a593Smuzhiyunsocket number and the number of online processors on that socket. This is 265*4882a593Smuzhiyunuseful to gauge the amount of aggregation. 266*4882a593Smuzhiyun 267*4882a593Smuzhiyun--per-die:: 268*4882a593SmuzhiyunAggregate counts per processor die for system-wide mode measurements. This 269*4882a593Smuzhiyunis a useful mode to detect imbalance between dies. To enable this mode, 270*4882a593Smuzhiyunuse --per-die in addition to -a. (system-wide). The output includes the 271*4882a593Smuzhiyundie number and the number of online processors on that die. This is 272*4882a593Smuzhiyunuseful to gauge the amount of aggregation. 273*4882a593Smuzhiyun 274*4882a593Smuzhiyun--per-core:: 275*4882a593SmuzhiyunAggregate counts per physical processor for system-wide mode measurements. This 276*4882a593Smuzhiyunis a useful mode to detect imbalance between physical cores. To enable this mode, 277*4882a593Smuzhiyunuse --per-core in addition to -a. (system-wide). The output includes the 278*4882a593Smuzhiyuncore number and the number of online logical processors on that physical processor. 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun--per-thread:: 281*4882a593SmuzhiyunAggregate counts per monitored threads, when monitoring threads (-t option) 282*4882a593Smuzhiyunor processes (-p option). 283*4882a593Smuzhiyun 284*4882a593Smuzhiyun--per-node:: 285*4882a593SmuzhiyunAggregate counts per NUMA nodes for system-wide mode measurements. This 286*4882a593Smuzhiyunis a useful mode to detect imbalance between NUMA nodes. To enable this 287*4882a593Smuzhiyunmode, use --per-node in addition to -a. (system-wide). 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun-D msecs:: 290*4882a593Smuzhiyun--delay msecs:: 291*4882a593SmuzhiyunAfter starting the program, wait msecs before measuring (-1: start with events 292*4882a593Smuzhiyundisabled). This is useful to filter out the startup phase of the program, 293*4882a593Smuzhiyunwhich is often very different. 294*4882a593Smuzhiyun 295*4882a593Smuzhiyun-T:: 296*4882a593Smuzhiyun--transaction:: 297*4882a593Smuzhiyun 298*4882a593SmuzhiyunPrint statistics of transactional execution if supported. 299*4882a593Smuzhiyun 300*4882a593Smuzhiyun--metric-no-group:: 301*4882a593SmuzhiyunBy default, events to compute a metric are placed in weak groups. The 302*4882a593Smuzhiyungroup tries to enforce scheduling all or none of the events. The 303*4882a593Smuzhiyun--metric-no-group option places events outside of groups and may 304*4882a593Smuzhiyunincrease the chance of the event being scheduled - leading to more 305*4882a593Smuzhiyunaccuracy. However, as events may not be scheduled together accuracy 306*4882a593Smuzhiyunfor metrics like instructions per cycle can be lower - as both metrics 307*4882a593Smuzhiyunmay no longer be being measured at the same time. 308*4882a593Smuzhiyun 309*4882a593Smuzhiyun--metric-no-merge:: 310*4882a593SmuzhiyunBy default metric events in different weak groups can be shared if one 311*4882a593Smuzhiyungroup contains all the events needed by another. In such cases one 312*4882a593Smuzhiyungroup will be eliminated reducing event multiplexing and making it so 313*4882a593Smuzhiyunthat certain groups of metrics sum to 100%. A downside to sharing a 314*4882a593Smuzhiyungroup is that the group may require multiplexing and so accuracy for a 315*4882a593Smuzhiyunsmall group that need not have multiplexing is lowered. This option 316*4882a593Smuzhiyunforbids the event merging logic from sharing events between groups and 317*4882a593Smuzhiyunmay be used to increase accuracy in this case. 318*4882a593Smuzhiyun 319*4882a593SmuzhiyunSTAT RECORD 320*4882a593Smuzhiyun----------- 321*4882a593SmuzhiyunStores stat data into perf data file. 322*4882a593Smuzhiyun 323*4882a593Smuzhiyun-o file:: 324*4882a593Smuzhiyun--output file:: 325*4882a593SmuzhiyunOutput file name. 326*4882a593Smuzhiyun 327*4882a593SmuzhiyunSTAT REPORT 328*4882a593Smuzhiyun----------- 329*4882a593SmuzhiyunReads and reports stat data from perf data file. 330*4882a593Smuzhiyun 331*4882a593Smuzhiyun-i file:: 332*4882a593Smuzhiyun--input file:: 333*4882a593SmuzhiyunInput file name. 334*4882a593Smuzhiyun 335*4882a593Smuzhiyun--per-socket:: 336*4882a593SmuzhiyunAggregate counts per processor socket for system-wide mode measurements. 337*4882a593Smuzhiyun 338*4882a593Smuzhiyun--per-die:: 339*4882a593SmuzhiyunAggregate counts per processor die for system-wide mode measurements. 340*4882a593Smuzhiyun 341*4882a593Smuzhiyun--per-core:: 342*4882a593SmuzhiyunAggregate counts per physical processor for system-wide mode measurements. 343*4882a593Smuzhiyun 344*4882a593Smuzhiyun-M:: 345*4882a593Smuzhiyun--metrics:: 346*4882a593SmuzhiyunPrint metrics or metricgroups specified in a comma separated list. 347*4882a593SmuzhiyunFor a group all metrics from the group are added. 348*4882a593SmuzhiyunThe events from the metrics are automatically measured. 349*4882a593SmuzhiyunSee perf list output for the possble metrics and metricgroups. 350*4882a593Smuzhiyun 351*4882a593Smuzhiyun-A:: 352*4882a593Smuzhiyun--no-aggr:: 353*4882a593SmuzhiyunDo not aggregate counts across all monitored CPUs. 354*4882a593Smuzhiyun 355*4882a593Smuzhiyun--topdown:: 356*4882a593SmuzhiyunPrint top down level 1 metrics if supported by the CPU. This allows to 357*4882a593Smuzhiyundetermine bottle necks in the CPU pipeline for CPU bound workloads, 358*4882a593Smuzhiyunby breaking the cycles consumed down into frontend bound, backend bound, 359*4882a593Smuzhiyunbad speculation and retiring. 360*4882a593Smuzhiyun 361*4882a593SmuzhiyunFrontend bound means that the CPU cannot fetch and decode instructions fast 362*4882a593Smuzhiyunenough. Backend bound means that computation or memory access is the bottle 363*4882a593Smuzhiyunneck. Bad Speculation means that the CPU wasted cycles due to branch 364*4882a593Smuzhiyunmispredictions and similar issues. Retiring means that the CPU computed without 365*4882a593Smuzhiyunan apparently bottleneck. The bottleneck is only the real bottleneck 366*4882a593Smuzhiyunif the workload is actually bound by the CPU and not by something else. 367*4882a593Smuzhiyun 368*4882a593SmuzhiyunFor best results it is usually a good idea to use it with interval 369*4882a593Smuzhiyunmode like -I 1000, as the bottleneck of workloads can change often. 370*4882a593Smuzhiyun 371*4882a593SmuzhiyunThis enables --metric-only, unless overridden with --no-metric-only. 372*4882a593Smuzhiyun 373*4882a593SmuzhiyunThe following restrictions only apply to older Intel CPUs and Atom, 374*4882a593Smuzhiyunon newer CPUs (IceLake and later) TopDown can be collected for any thread: 375*4882a593Smuzhiyun 376*4882a593SmuzhiyunThe top down metrics are collected per core instead of per 377*4882a593SmuzhiyunCPU thread. Per core mode is automatically enabled 378*4882a593Smuzhiyunand -a (global monitoring) is needed, requiring root rights or 379*4882a593Smuzhiyunperf.perf_event_paranoid=-1. 380*4882a593Smuzhiyun 381*4882a593SmuzhiyunTopdown uses the full Performance Monitoring Unit, and needs 382*4882a593Smuzhiyundisabling of the NMI watchdog (as root): 383*4882a593Smuzhiyunecho 0 > /proc/sys/kernel/nmi_watchdog 384*4882a593Smuzhiyunfor best results. Otherwise the bottlenecks may be inconsistent 385*4882a593Smuzhiyunon workload with changing phases. 386*4882a593Smuzhiyun 387*4882a593SmuzhiyunTo interpret the results it is usually needed to know on which 388*4882a593SmuzhiyunCPUs the workload runs on. If needed the CPUs can be forced using 389*4882a593Smuzhiyuntaskset. 390*4882a593Smuzhiyun 391*4882a593Smuzhiyun--no-merge:: 392*4882a593SmuzhiyunDo not merge results from same PMUs. 393*4882a593Smuzhiyun 394*4882a593SmuzhiyunWhen multiple events are created from a single event specification, 395*4882a593Smuzhiyunstat will, by default, aggregate the event counts and show the result 396*4882a593Smuzhiyunin a single row. This option disables that behavior and shows 397*4882a593Smuzhiyunthe individual events and counts. 398*4882a593Smuzhiyun 399*4882a593SmuzhiyunMultiple events are created from a single event specification when: 400*4882a593Smuzhiyun1. Prefix or glob matching is used for the PMU name. 401*4882a593Smuzhiyun2. Aliases, which are listed immediately after the Kernel PMU events 402*4882a593Smuzhiyun by perf list, are used. 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun--smi-cost:: 405*4882a593SmuzhiyunMeasure SMI cost if msr/aperf/ and msr/smi/ events are supported. 406*4882a593Smuzhiyun 407*4882a593SmuzhiyunDuring the measurement, the /sys/device/cpu/freeze_on_smi will be set to 408*4882a593Smuzhiyunfreeze core counters on SMI. 409*4882a593SmuzhiyunThe aperf counter will not be effected by the setting. 410*4882a593SmuzhiyunThe cost of SMI can be measured by (aperf - unhalted core cycles). 411*4882a593Smuzhiyun 412*4882a593SmuzhiyunIn practice, the percentages of SMI cycles is very useful for performance 413*4882a593Smuzhiyunoriented analysis. --metric_only will be applied by default. 414*4882a593SmuzhiyunThe output is SMI cycles%, equals to (aperf - unhalted core cycles) / aperf 415*4882a593Smuzhiyun 416*4882a593SmuzhiyunUsers who wants to get the actual value can apply --no-metric-only. 417*4882a593Smuzhiyun 418*4882a593Smuzhiyun--all-kernel:: 419*4882a593SmuzhiyunConfigure all used events to run in kernel space. 420*4882a593Smuzhiyun 421*4882a593Smuzhiyun--all-user:: 422*4882a593SmuzhiyunConfigure all used events to run in user space. 423*4882a593Smuzhiyun 424*4882a593Smuzhiyun--percore-show-thread:: 425*4882a593SmuzhiyunThe event modifier "percore" has supported to sum up the event counts 426*4882a593Smuzhiyunfor all hardware threads in a core and show the counts per core. 427*4882a593Smuzhiyun 428*4882a593SmuzhiyunThis option with event modifier "percore" enabled also sums up the event 429*4882a593Smuzhiyuncounts for all hardware threads in a core but show the sum counts per 430*4882a593Smuzhiyunhardware thread. This is essentially a replacement for the any bit and 431*4882a593Smuzhiyunconvenient for post processing. 432*4882a593Smuzhiyun 433*4882a593Smuzhiyun--summary:: 434*4882a593SmuzhiyunPrint summary for interval mode (-I). 435*4882a593Smuzhiyun 436*4882a593SmuzhiyunEXAMPLES 437*4882a593Smuzhiyun-------- 438*4882a593Smuzhiyun 439*4882a593Smuzhiyun$ perf stat -- make 440*4882a593Smuzhiyun 441*4882a593Smuzhiyun Performance counter stats for 'make': 442*4882a593Smuzhiyun 443*4882a593Smuzhiyun 83723.452481 task-clock:u (msec) # 1.004 CPUs utilized 444*4882a593Smuzhiyun 0 context-switches:u # 0.000 K/sec 445*4882a593Smuzhiyun 0 cpu-migrations:u # 0.000 K/sec 446*4882a593Smuzhiyun 3,228,188 page-faults:u # 0.039 M/sec 447*4882a593Smuzhiyun 229,570,665,834 cycles:u # 2.742 GHz 448*4882a593Smuzhiyun 313,163,853,778 instructions:u # 1.36 insn per cycle 449*4882a593Smuzhiyun 69,704,684,856 branches:u # 832.559 M/sec 450*4882a593Smuzhiyun 2,078,861,393 branch-misses:u # 2.98% of all branches 451*4882a593Smuzhiyun 452*4882a593Smuzhiyun 83.409183620 seconds time elapsed 453*4882a593Smuzhiyun 454*4882a593Smuzhiyun 74.684747000 seconds user 455*4882a593Smuzhiyun 8.739217000 seconds sys 456*4882a593Smuzhiyun 457*4882a593SmuzhiyunTIMINGS 458*4882a593Smuzhiyun------- 459*4882a593SmuzhiyunAs displayed in the example above we can display 3 types of timings. 460*4882a593SmuzhiyunWe always display the time the counters were enabled/alive: 461*4882a593Smuzhiyun 462*4882a593Smuzhiyun 83.409183620 seconds time elapsed 463*4882a593Smuzhiyun 464*4882a593SmuzhiyunFor workload sessions we also display time the workloads spent in 465*4882a593Smuzhiyunuser/system lands: 466*4882a593Smuzhiyun 467*4882a593Smuzhiyun 74.684747000 seconds user 468*4882a593Smuzhiyun 8.739217000 seconds sys 469*4882a593Smuzhiyun 470*4882a593SmuzhiyunThose times are the very same as displayed by the 'time' tool. 471*4882a593Smuzhiyun 472*4882a593SmuzhiyunCSV FORMAT 473*4882a593Smuzhiyun---------- 474*4882a593Smuzhiyun 475*4882a593SmuzhiyunWith -x, perf stat is able to output a not-quite-CSV format output 476*4882a593SmuzhiyunCommas in the output are not put into "". To make it easy to parse 477*4882a593Smuzhiyunit is recommended to use a different character like -x \; 478*4882a593Smuzhiyun 479*4882a593SmuzhiyunThe fields are in this order: 480*4882a593Smuzhiyun 481*4882a593Smuzhiyun - optional usec time stamp in fractions of second (with -I xxx) 482*4882a593Smuzhiyun - optional CPU, core, or socket identifier 483*4882a593Smuzhiyun - optional number of logical CPUs aggregated 484*4882a593Smuzhiyun - counter value 485*4882a593Smuzhiyun - unit of the counter value or empty 486*4882a593Smuzhiyun - event name 487*4882a593Smuzhiyun - run time of counter 488*4882a593Smuzhiyun - percentage of measurement time the counter was running 489*4882a593Smuzhiyun - optional variance if multiple values are collected with -r 490*4882a593Smuzhiyun - optional metric value 491*4882a593Smuzhiyun - optional unit of metric 492*4882a593Smuzhiyun 493*4882a593SmuzhiyunAdditional metrics may be printed with all earlier fields being empty. 494*4882a593Smuzhiyun 495*4882a593SmuzhiyunSEE ALSO 496*4882a593Smuzhiyun-------- 497*4882a593Smuzhiyunlinkperf:perf-top[1], linkperf:perf-list[1] 498