perf/Documentation/perf-stat.txt

*4882a593Smuzhiyunperf-stat(1)
*4882a593Smuzhiyun============
*4882a593Smuzhiyun
*4882a593SmuzhiyunNAME
*4882a593Smuzhiyun----
*4882a593Smuzhiyunperf-stat - Run a command and gather performance counter statistics
*4882a593Smuzhiyun
*4882a593SmuzhiyunSYNOPSIS
*4882a593Smuzhiyun--------
*4882a593Smuzhiyun[verse]
*4882a593Smuzhiyun'perf stat' [-e <EVENT> | --event=EVENT] [-a] <command>
*4882a593Smuzhiyun'perf stat' [-e <EVENT> | --event=EVENT] [-a] -- <command> [<options>]
*4882a593Smuzhiyun'perf stat' [-e <EVENT> | --event=EVENT] [-a] record [-o file] -- <command> [<options>]
*4882a593Smuzhiyun'perf stat' report [-i file]
*4882a593Smuzhiyun
*4882a593SmuzhiyunDESCRIPTION
*4882a593Smuzhiyun-----------
*4882a593SmuzhiyunThis command runs a command and gathers performance counter statistics
*4882a593Smuzhiyunfrom it.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunOPTIONS
*4882a593Smuzhiyun-------
*4882a593Smuzhiyun<command>...::
*4882a593Smuzhiyun	Any command you can specify in a shell.
*4882a593Smuzhiyun
*4882a593Smuzhiyunrecord::
*4882a593Smuzhiyun	See STAT RECORD.
*4882a593Smuzhiyun
*4882a593Smuzhiyunreport::
*4882a593Smuzhiyun	See STAT REPORT.
*4882a593Smuzhiyun
*4882a593Smuzhiyun-e::
*4882a593Smuzhiyun--event=::
*4882a593Smuzhiyun	Select the PMU event. Selection can be:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	- a symbolic event name (use 'perf list' to list all events)
*4882a593Smuzhiyun
*4882a593Smuzhiyun	- a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a
*4882a593Smuzhiyun	  hexadecimal event descriptor.
*4882a593Smuzhiyun
*4882a593Smuzhiyun        - a symbolic or raw PMU event followed by an optional colon
*4882a593Smuzhiyun	  and a list of event modifiers, e.g., cpu-cycles:p.  See the
*4882a593Smuzhiyun	  linkperf:perf-list[1] man page for details on event modifiers.
*4882a593Smuzhiyun
*4882a593Smuzhiyun	- a symbolically formed event like 'pmu/param1=0x3,param2/' where
*4882a593Smuzhiyun	  param1 and param2 are defined as formats for the PMU in
*4882a593Smuzhiyun	  /sys/bus/event_source/devices/<pmu>/format/*
*4882a593Smuzhiyun
*4882a593Smuzhiyun	  'percore' is a event qualifier that sums up the event counts for both
*4882a593Smuzhiyun	  hardware threads in a core. For example:
*4882a593Smuzhiyun	  perf stat -A -a -e cpu/event,percore=1/,otherevent ...
*4882a593Smuzhiyun
*4882a593Smuzhiyun	- a symbolically formed event like 'pmu/config=M,config1=N,config2=K/'
*4882a593Smuzhiyun	  where M, N, K are numbers (in decimal, hex, octal format).
*4882a593Smuzhiyun	  Acceptable values for each of 'config', 'config1' and 'config2'
*4882a593Smuzhiyun	  parameters are defined by corresponding entries in
*4882a593Smuzhiyun	  /sys/bus/event_source/devices/<pmu>/format/*
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Note that the last two syntaxes support prefix and glob matching in
*4882a593Smuzhiyun	the PMU name to simplify creation of events across multiple instances
*4882a593Smuzhiyun	of the same type of PMU in large systems (e.g. memory controller PMUs).
*4882a593Smuzhiyun	Multiple PMU instances are typical for uncore PMUs, so the prefix
*4882a593Smuzhiyun	'uncore_' is also ignored when performing this match.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun-i::
*4882a593Smuzhiyun--no-inherit::
*4882a593Smuzhiyun        child tasks do not inherit counters
*4882a593Smuzhiyun-p::
*4882a593Smuzhiyun--pid=<pid>::
*4882a593Smuzhiyun        stat events on existing process id (comma separated list)
*4882a593Smuzhiyun
*4882a593Smuzhiyun-t::
*4882a593Smuzhiyun--tid=<tid>::
*4882a593Smuzhiyun        stat events on existing thread id (comma separated list)
*4882a593Smuzhiyun
*4882a593Smuzhiyunifdef::HAVE_LIBPFM[]
*4882a593Smuzhiyun--pfm-events events::
*4882a593SmuzhiyunSelect a PMU event using libpfm4 syntax (see http://perfmon2.sf.net)
*4882a593Smuzhiyunincluding support for event filters. For example '--pfm-events
*4882a593Smuzhiyuninst_retired:any_p:u:c=1:i'. More than one event can be passed to the
*4882a593Smuzhiyunoption using the comma separator. Hardware events and generic hardware
*4882a593Smuzhiyunevents cannot be mixed together. The latter must be used with the -e
*4882a593Smuzhiyunoption. The -e option and this one can be mixed and matched.  Events
*4882a593Smuzhiyuncan be grouped using the {} notation.
*4882a593Smuzhiyunendif::HAVE_LIBPFM[]
*4882a593Smuzhiyun
*4882a593Smuzhiyun-a::
*4882a593Smuzhiyun--all-cpus::
*4882a593Smuzhiyun        system-wide collection from all CPUs (default if no target is specified)
*4882a593Smuzhiyun
*4882a593Smuzhiyun--no-scale::
*4882a593Smuzhiyun	Don't scale/normalize counter values
*4882a593Smuzhiyun
*4882a593Smuzhiyun-d::
*4882a593Smuzhiyun--detailed::
*4882a593Smuzhiyun	print more detailed statistics, can be specified up to 3 times
*4882a593Smuzhiyun
*4882a593Smuzhiyun	   -d:          detailed events, L1 and LLC data cache
*4882a593Smuzhiyun        -d -d:     more detailed events, dTLB and iTLB events
*4882a593Smuzhiyun     -d -d -d:     very detailed events, adding prefetch events
*4882a593Smuzhiyun
*4882a593Smuzhiyun-r::
*4882a593Smuzhiyun--repeat=<n>::
*4882a593Smuzhiyun	repeat command and print average + stddev (max: 100). 0 means forever.
*4882a593Smuzhiyun
*4882a593Smuzhiyun-B::
*4882a593Smuzhiyun--big-num::
*4882a593Smuzhiyun        print large numbers with thousands' separators according to locale.
*4882a593Smuzhiyun	Enabled by default. Use "--no-big-num" to disable.
*4882a593Smuzhiyun	Default setting can be changed with "perf config stat.big-num=false".
*4882a593Smuzhiyun
*4882a593Smuzhiyun-C::
*4882a593Smuzhiyun--cpu=::
*4882a593SmuzhiyunCount only on the list of CPUs provided. Multiple CPUs can be provided as a
*4882a593Smuzhiyuncomma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2.
*4882a593SmuzhiyunIn per-thread mode, this option is ignored. The -a option is still necessary
*4882a593Smuzhiyunto activate system-wide monitoring. Default is to count on all CPUs.
*4882a593Smuzhiyun
*4882a593Smuzhiyun-A::
*4882a593Smuzhiyun--no-aggr::
*4882a593SmuzhiyunDo not aggregate counts across all monitored CPUs.
*4882a593Smuzhiyun
*4882a593Smuzhiyun-n::
*4882a593Smuzhiyun--null::
*4882a593Smuzhiyun        null run - don't start any counters
*4882a593Smuzhiyun
*4882a593Smuzhiyun-v::
*4882a593Smuzhiyun--verbose::
*4882a593Smuzhiyun        be more verbose (show counter open errors, etc)
*4882a593Smuzhiyun
*4882a593Smuzhiyun-x SEP::
*4882a593Smuzhiyun--field-separator SEP::
*4882a593Smuzhiyunprint counts using a CSV-style output to make it easy to import directly into
*4882a593Smuzhiyunspreadsheets. Columns are separated by the string specified in SEP.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--table:: Display time for each run (-r option), in a table format, e.g.:
*4882a593Smuzhiyun
*4882a593Smuzhiyun  $ perf stat --null -r 5 --table perf bench sched pipe
*4882a593Smuzhiyun
*4882a593Smuzhiyun   Performance counter stats for 'perf bench sched pipe' (5 runs):
*4882a593Smuzhiyun
*4882a593Smuzhiyun             # Table of individual measurements:
*4882a593Smuzhiyun             5.189 (-0.293) #
*4882a593Smuzhiyun             5.189 (-0.294) #
*4882a593Smuzhiyun             5.186 (-0.296) #
*4882a593Smuzhiyun             5.663 (+0.181) ##
*4882a593Smuzhiyun             6.186 (+0.703) ####
*4882a593Smuzhiyun
*4882a593Smuzhiyun             # Final result:
*4882a593Smuzhiyun             5.483 +- 0.198 seconds time elapsed  ( +-  3.62% )
*4882a593Smuzhiyun
*4882a593Smuzhiyun-G name::
*4882a593Smuzhiyun--cgroup name::
*4882a593Smuzhiyunmonitor only in the container (cgroup) called "name". This option is available only
*4882a593Smuzhiyunin per-cpu mode. The cgroup filesystem must be mounted. All threads belonging to
*4882a593Smuzhiyuncontainer "name" are monitored when they run on the monitored CPUs. Multiple cgroups
*4882a593Smuzhiyuncan be provided. Each cgroup is applied to the corresponding event, i.e., first cgroup
*4882a593Smuzhiyunto first event, second cgroup to second event and so on. It is possible to provide
*4882a593Smuzhiyunan empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must have
*4882a593Smuzhiyuncorresponding events, i.e., they always refer to events defined earlier on the command
*4882a593Smuzhiyunline. If the user wants to track multiple events for a specific cgroup, the user can
*4882a593Smuzhiyunuse '-e e1 -e e2 -G foo,foo' or just use '-e e1 -e e2 -G foo'.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf wanting to monitor, say, 'cycles' for a cgroup and also for system wide, this
*4882a593Smuzhiyuncommand line can be used: 'perf stat -e cycles -G cgroup_name -a -e cycles'.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--for-each-cgroup name::
*4882a593SmuzhiyunExpand event list for each cgroup in "name" (allow multiple cgroups separated
*4882a593Smuzhiyunby comma).  This has same effect that repeating -e option and -G option for
*4882a593Smuzhiyuneach event x name.  This option cannot be used with -G/--cgroup option.
*4882a593Smuzhiyun
*4882a593Smuzhiyun-o file::
*4882a593Smuzhiyun--output file::
*4882a593SmuzhiyunPrint the output into the designated file.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--append::
*4882a593SmuzhiyunAppend to the output file designated with the -o option. Ignored if -o is not specified.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--log-fd::
*4882a593Smuzhiyun
*4882a593SmuzhiyunLog output to fd, instead of stderr.  Complementary to --output, and mutually exclusive
*4882a593Smuzhiyunwith it.  --append may be used here.  Examples:
*4882a593Smuzhiyun     3>results  perf stat --log-fd 3          -- $cmd
*4882a593Smuzhiyun     3>>results perf stat --log-fd 3 --append -- $cmd
*4882a593Smuzhiyun
*4882a593Smuzhiyun--control=fifo:ctl-fifo[,ack-fifo]::
*4882a593Smuzhiyun--control=fd:ctl-fd[,ack-fd]::
*4882a593Smuzhiyunctl-fifo / ack-fifo are opened and used as ctl-fd / ack-fd as follows.
*4882a593SmuzhiyunListen on ctl-fd descriptor for command to control measurement ('enable': enable events,
*4882a593Smuzhiyun'disable': disable events). Measurements can be started with events disabled using
*4882a593Smuzhiyun--delay=-1 option. Optionally send control command completion ('ack\n') to ack-fd descriptor
*4882a593Smuzhiyunto synchronize with the controlling process. Example of bash shell script to enable and
*4882a593Smuzhiyundisable events during measurements:
*4882a593Smuzhiyun
*4882a593Smuzhiyun #!/bin/bash
*4882a593Smuzhiyun
*4882a593Smuzhiyun ctl_dir=/tmp/
*4882a593Smuzhiyun
*4882a593Smuzhiyun ctl_fifo=${ctl_dir}perf_ctl.fifo
*4882a593Smuzhiyun test -p ${ctl_fifo} && unlink ${ctl_fifo}
*4882a593Smuzhiyun mkfifo ${ctl_fifo}
*4882a593Smuzhiyun exec {ctl_fd}<>${ctl_fifo}
*4882a593Smuzhiyun
*4882a593Smuzhiyun ctl_ack_fifo=${ctl_dir}perf_ctl_ack.fifo
*4882a593Smuzhiyun test -p ${ctl_ack_fifo} && unlink ${ctl_ack_fifo}
*4882a593Smuzhiyun mkfifo ${ctl_ack_fifo}
*4882a593Smuzhiyun exec {ctl_fd_ack}<>${ctl_ack_fifo}
*4882a593Smuzhiyun
*4882a593Smuzhiyun perf stat -D -1 -e cpu-cycles -a -I 1000       \
*4882a593Smuzhiyun           --control fd:${ctl_fd},${ctl_fd_ack} \
*4882a593Smuzhiyun           -- sleep 30 &
*4882a593Smuzhiyun perf_pid=$!
*4882a593Smuzhiyun
*4882a593Smuzhiyun sleep 5  && echo 'enable' >&${ctl_fd} && read -u ${ctl_fd_ack} e1 && echo "enabled(${e1})"
*4882a593Smuzhiyun sleep 10 && echo 'disable' >&${ctl_fd} && read -u ${ctl_fd_ack} d1 && echo "disabled(${d1})"
*4882a593Smuzhiyun
*4882a593Smuzhiyun exec {ctl_fd_ack}>&-
*4882a593Smuzhiyun unlink ${ctl_ack_fifo}
*4882a593Smuzhiyun
*4882a593Smuzhiyun exec {ctl_fd}>&-
*4882a593Smuzhiyun unlink ${ctl_fifo}
*4882a593Smuzhiyun
*4882a593Smuzhiyun wait -n ${perf_pid}
*4882a593Smuzhiyun exit $?
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun--pre::
*4882a593Smuzhiyun--post::
*4882a593Smuzhiyun	Pre and post measurement hooks, e.g.:
*4882a593Smuzhiyun
*4882a593Smuzhiyunperf stat --repeat 10 --null --sync --pre 'make -s O=defconfig-build/clean' -- make -s -j64 O=defconfig-build/ bzImage
*4882a593Smuzhiyun
*4882a593Smuzhiyun-I msecs::
*4882a593Smuzhiyun--interval-print msecs::
*4882a593SmuzhiyunPrint count deltas every N milliseconds (minimum: 1ms)
*4882a593SmuzhiyunThe overhead percentage could be high in some cases, for instance with small, sub 100ms intervals.  Use with caution.
*4882a593Smuzhiyun	example: 'perf stat -I 1000 -e cycles -a sleep 5'
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf the metric exists, it is calculated by the counts generated in this interval and the metric is printed after #.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--interval-count times::
*4882a593SmuzhiyunPrint count deltas for fixed number of times.
*4882a593SmuzhiyunThis option should be used together with "-I" option.
*4882a593Smuzhiyun	example: 'perf stat -I 1000 --interval-count 2 -e cycles -a'
*4882a593Smuzhiyun
*4882a593Smuzhiyun--interval-clear::
*4882a593SmuzhiyunClear the screen before next interval.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--timeout msecs::
*4882a593SmuzhiyunStop the 'perf stat' session and print count deltas after N milliseconds (minimum: 10 ms).
*4882a593SmuzhiyunThis option is not supported with the "-I" option.
*4882a593Smuzhiyun	example: 'perf stat --time 2000 -e cycles -a'
*4882a593Smuzhiyun
*4882a593Smuzhiyun--metric-only::
*4882a593SmuzhiyunOnly print computed metrics. Print them in a single line.
*4882a593SmuzhiyunDon't show any raw values. Not supported with --per-thread.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--per-socket::
*4882a593SmuzhiyunAggregate counts per processor socket for system-wide mode measurements.  This
*4882a593Smuzhiyunis a useful mode to detect imbalance between sockets.  To enable this mode,
*4882a593Smuzhiyunuse --per-socket in addition to -a. (system-wide).  The output includes the
*4882a593Smuzhiyunsocket number and the number of online processors on that socket. This is
*4882a593Smuzhiyunuseful to gauge the amount of aggregation.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--per-die::
*4882a593SmuzhiyunAggregate counts per processor die for system-wide mode measurements.  This
*4882a593Smuzhiyunis a useful mode to detect imbalance between dies.  To enable this mode,
*4882a593Smuzhiyunuse --per-die in addition to -a. (system-wide).  The output includes the
*4882a593Smuzhiyundie number and the number of online processors on that die. This is
*4882a593Smuzhiyunuseful to gauge the amount of aggregation.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--per-core::
*4882a593SmuzhiyunAggregate counts per physical processor for system-wide mode measurements.  This
*4882a593Smuzhiyunis a useful mode to detect imbalance between physical cores.  To enable this mode,
*4882a593Smuzhiyunuse --per-core in addition to -a. (system-wide).  The output includes the
*4882a593Smuzhiyuncore number and the number of online logical processors on that physical processor.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--per-thread::
*4882a593SmuzhiyunAggregate counts per monitored threads, when monitoring threads (-t option)
*4882a593Smuzhiyunor processes (-p option).
*4882a593Smuzhiyun
*4882a593Smuzhiyun--per-node::
*4882a593SmuzhiyunAggregate counts per NUMA nodes for system-wide mode measurements. This
*4882a593Smuzhiyunis a useful mode to detect imbalance between NUMA nodes. To enable this
*4882a593Smuzhiyunmode, use --per-node in addition to -a. (system-wide).
*4882a593Smuzhiyun
*4882a593Smuzhiyun-D msecs::
*4882a593Smuzhiyun--delay msecs::
*4882a593SmuzhiyunAfter starting the program, wait msecs before measuring (-1: start with events
*4882a593Smuzhiyundisabled). This is useful to filter out the startup phase of the program,
*4882a593Smuzhiyunwhich is often very different.
*4882a593Smuzhiyun
*4882a593Smuzhiyun-T::
*4882a593Smuzhiyun--transaction::
*4882a593Smuzhiyun
*4882a593SmuzhiyunPrint statistics of transactional execution if supported.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--metric-no-group::
*4882a593SmuzhiyunBy default, events to compute a metric are placed in weak groups. The
*4882a593Smuzhiyungroup tries to enforce scheduling all or none of the events. The
*4882a593Smuzhiyun--metric-no-group option places events outside of groups and may
*4882a593Smuzhiyunincrease the chance of the event being scheduled - leading to more
*4882a593Smuzhiyunaccuracy. However, as events may not be scheduled together accuracy
*4882a593Smuzhiyunfor metrics like instructions per cycle can be lower - as both metrics
*4882a593Smuzhiyunmay no longer be being measured at the same time.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--metric-no-merge::
*4882a593SmuzhiyunBy default metric events in different weak groups can be shared if one
*4882a593Smuzhiyungroup contains all the events needed by another. In such cases one
*4882a593Smuzhiyungroup will be eliminated reducing event multiplexing and making it so
*4882a593Smuzhiyunthat certain groups of metrics sum to 100%. A downside to sharing a
*4882a593Smuzhiyungroup is that the group may require multiplexing and so accuracy for a
*4882a593Smuzhiyunsmall group that need not have multiplexing is lowered. This option
*4882a593Smuzhiyunforbids the event merging logic from sharing events between groups and
*4882a593Smuzhiyunmay be used to increase accuracy in this case.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSTAT RECORD
*4882a593Smuzhiyun-----------
*4882a593SmuzhiyunStores stat data into perf data file.
*4882a593Smuzhiyun
*4882a593Smuzhiyun-o file::
*4882a593Smuzhiyun--output file::
*4882a593SmuzhiyunOutput file name.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSTAT REPORT
*4882a593Smuzhiyun-----------
*4882a593SmuzhiyunReads and reports stat data from perf data file.
*4882a593Smuzhiyun
*4882a593Smuzhiyun-i file::
*4882a593Smuzhiyun--input file::
*4882a593SmuzhiyunInput file name.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--per-socket::
*4882a593SmuzhiyunAggregate counts per processor socket for system-wide mode measurements.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--per-die::
*4882a593SmuzhiyunAggregate counts per processor die for system-wide mode measurements.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--per-core::
*4882a593SmuzhiyunAggregate counts per physical processor for system-wide mode measurements.
*4882a593Smuzhiyun
*4882a593Smuzhiyun-M::
*4882a593Smuzhiyun--metrics::
*4882a593SmuzhiyunPrint metrics or metricgroups specified in a comma separated list.
*4882a593SmuzhiyunFor a group all metrics from the group are added.
*4882a593SmuzhiyunThe events from the metrics are automatically measured.
*4882a593SmuzhiyunSee perf list output for the possble metrics and metricgroups.
*4882a593Smuzhiyun
*4882a593Smuzhiyun-A::
*4882a593Smuzhiyun--no-aggr::
*4882a593SmuzhiyunDo not aggregate counts across all monitored CPUs.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--topdown::
*4882a593SmuzhiyunPrint top down level 1 metrics if supported by the CPU. This allows to
*4882a593Smuzhiyundetermine bottle necks in the CPU pipeline for CPU bound workloads,
*4882a593Smuzhiyunby breaking the cycles consumed down into frontend bound, backend bound,
*4882a593Smuzhiyunbad speculation and retiring.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFrontend bound means that the CPU cannot fetch and decode instructions fast
*4882a593Smuzhiyunenough. Backend bound means that computation or memory access is the bottle
*4882a593Smuzhiyunneck. Bad Speculation means that the CPU wasted cycles due to branch
*4882a593Smuzhiyunmispredictions and similar issues. Retiring means that the CPU computed without
*4882a593Smuzhiyunan apparently bottleneck. The bottleneck is only the real bottleneck
*4882a593Smuzhiyunif the workload is actually bound by the CPU and not by something else.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor best results it is usually a good idea to use it with interval
*4882a593Smuzhiyunmode like -I 1000, as the bottleneck of workloads can change often.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis enables --metric-only, unless overridden with --no-metric-only.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe following restrictions only apply to older Intel CPUs and Atom,
*4882a593Smuzhiyunon newer CPUs (IceLake and later) TopDown can be collected for any thread:
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe top down metrics are collected per core instead of per
*4882a593SmuzhiyunCPU thread. Per core mode is automatically enabled
*4882a593Smuzhiyunand -a (global monitoring) is needed, requiring root rights or
*4882a593Smuzhiyunperf.perf_event_paranoid=-1.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTopdown uses the full Performance Monitoring Unit, and needs
*4882a593Smuzhiyundisabling of the NMI watchdog (as root):
*4882a593Smuzhiyunecho 0 > /proc/sys/kernel/nmi_watchdog
*4882a593Smuzhiyunfor best results. Otherwise the bottlenecks may be inconsistent
*4882a593Smuzhiyunon workload with changing phases.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo interpret the results it is usually needed to know on which
*4882a593SmuzhiyunCPUs the workload runs on. If needed the CPUs can be forced using
*4882a593Smuzhiyuntaskset.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--no-merge::
*4882a593SmuzhiyunDo not merge results from same PMUs.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen multiple events are created from a single event specification,
*4882a593Smuzhiyunstat will, by default, aggregate the event counts and show the result
*4882a593Smuzhiyunin a single row. This option disables that behavior and shows
*4882a593Smuzhiyunthe individual events and counts.
*4882a593Smuzhiyun
*4882a593SmuzhiyunMultiple events are created from a single event specification when:
*4882a593Smuzhiyun1. Prefix or glob matching is used for the PMU name.
*4882a593Smuzhiyun2. Aliases, which are listed immediately after the Kernel PMU events
*4882a593Smuzhiyun   by perf list, are used.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--smi-cost::
*4882a593SmuzhiyunMeasure SMI cost if msr/aperf/ and msr/smi/ events are supported.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDuring the measurement, the /sys/device/cpu/freeze_on_smi will be set to
*4882a593Smuzhiyunfreeze core counters on SMI.
*4882a593SmuzhiyunThe aperf counter will not be effected by the setting.
*4882a593SmuzhiyunThe cost of SMI can be measured by (aperf - unhalted core cycles).
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn practice, the percentages of SMI cycles is very useful for performance
*4882a593Smuzhiyunoriented analysis. --metric_only will be applied by default.
*4882a593SmuzhiyunThe output is SMI cycles%, equals to (aperf - unhalted core cycles) / aperf
*4882a593Smuzhiyun
*4882a593SmuzhiyunUsers who wants to get the actual value can apply --no-metric-only.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--all-kernel::
*4882a593SmuzhiyunConfigure all used events to run in kernel space.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--all-user::
*4882a593SmuzhiyunConfigure all used events to run in user space.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--percore-show-thread::
*4882a593SmuzhiyunThe event modifier "percore" has supported to sum up the event counts
*4882a593Smuzhiyunfor all hardware threads in a core and show the counts per core.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThis option with event modifier "percore" enabled also sums up the event
*4882a593Smuzhiyuncounts for all hardware threads in a core but show the sum counts per
*4882a593Smuzhiyunhardware thread. This is essentially a replacement for the any bit and
*4882a593Smuzhiyunconvenient for post processing.
*4882a593Smuzhiyun
*4882a593Smuzhiyun--summary::
*4882a593SmuzhiyunPrint summary for interval mode (-I).
*4882a593Smuzhiyun
*4882a593SmuzhiyunEXAMPLES
*4882a593Smuzhiyun--------
*4882a593Smuzhiyun
*4882a593Smuzhiyun$ perf stat -- make
*4882a593Smuzhiyun
*4882a593Smuzhiyun   Performance counter stats for 'make':
*4882a593Smuzhiyun
*4882a593Smuzhiyun        83723.452481      task-clock:u (msec)       #    1.004 CPUs utilized
*4882a593Smuzhiyun                   0      context-switches:u        #    0.000 K/sec
*4882a593Smuzhiyun                   0      cpu-migrations:u          #    0.000 K/sec
*4882a593Smuzhiyun           3,228,188      page-faults:u             #    0.039 M/sec
*4882a593Smuzhiyun     229,570,665,834      cycles:u                  #    2.742 GHz
*4882a593Smuzhiyun     313,163,853,778      instructions:u            #    1.36  insn per cycle
*4882a593Smuzhiyun      69,704,684,856      branches:u                #  832.559 M/sec
*4882a593Smuzhiyun       2,078,861,393      branch-misses:u           #    2.98% of all branches
*4882a593Smuzhiyun
*4882a593Smuzhiyun        83.409183620 seconds time elapsed
*4882a593Smuzhiyun
*4882a593Smuzhiyun        74.684747000 seconds user
*4882a593Smuzhiyun         8.739217000 seconds sys
*4882a593Smuzhiyun
*4882a593SmuzhiyunTIMINGS
*4882a593Smuzhiyun-------
*4882a593SmuzhiyunAs displayed in the example above we can display 3 types of timings.
*4882a593SmuzhiyunWe always display the time the counters were enabled/alive:
*4882a593Smuzhiyun
*4882a593Smuzhiyun        83.409183620 seconds time elapsed
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor workload sessions we also display time the workloads spent in
*4882a593Smuzhiyunuser/system lands:
*4882a593Smuzhiyun
*4882a593Smuzhiyun        74.684747000 seconds user
*4882a593Smuzhiyun         8.739217000 seconds sys
*4882a593Smuzhiyun
*4882a593SmuzhiyunThose times are the very same as displayed by the 'time' tool.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCSV FORMAT
*4882a593Smuzhiyun----------
*4882a593Smuzhiyun
*4882a593SmuzhiyunWith -x, perf stat is able to output a not-quite-CSV format output
*4882a593SmuzhiyunCommas in the output are not put into "". To make it easy to parse
*4882a593Smuzhiyunit is recommended to use a different character like -x \;
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe fields are in this order:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	- optional usec time stamp in fractions of second (with -I xxx)
*4882a593Smuzhiyun	- optional CPU, core, or socket identifier
*4882a593Smuzhiyun	- optional number of logical CPUs aggregated
*4882a593Smuzhiyun	- counter value
*4882a593Smuzhiyun	- unit of the counter value or empty
*4882a593Smuzhiyun	- event name
*4882a593Smuzhiyun	- run time of counter
*4882a593Smuzhiyun	- percentage of measurement time the counter was running
*4882a593Smuzhiyun	- optional variance if multiple values are collected with -r
*4882a593Smuzhiyun	- optional metric value
*4882a593Smuzhiyun	- optional unit of metric
*4882a593Smuzhiyun
*4882a593SmuzhiyunAdditional metrics may be printed with all earlier fields being empty.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSEE ALSO
*4882a593Smuzhiyun--------
*4882a593Smuzhiyunlinkperf:perf-top[1], linkperf:perf-list[1]