xref: /OK3568_Linux_fs/kernel/tools/perf/Documentation/perf-c2c.txt (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyunperf-c2c(1)
2*4882a593Smuzhiyun===========
3*4882a593Smuzhiyun
4*4882a593SmuzhiyunNAME
5*4882a593Smuzhiyun----
6*4882a593Smuzhiyunperf-c2c - Shared Data C2C/HITM Analyzer.
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunSYNOPSIS
9*4882a593Smuzhiyun--------
10*4882a593Smuzhiyun[verse]
11*4882a593Smuzhiyun'perf c2c record' [<options>] <command>
12*4882a593Smuzhiyun'perf c2c record' [<options>] -- [<record command options>] <command>
13*4882a593Smuzhiyun'perf c2c report' [<options>]
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunDESCRIPTION
16*4882a593Smuzhiyun-----------
17*4882a593SmuzhiyunC2C stands for Cache To Cache.
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunThe perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows
20*4882a593Smuzhiyunyou to track down the cacheline contentions.
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunOn x86, the tool is based on load latency and precise store facility events
23*4882a593Smuzhiyunprovided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
24*4882a593Smuzhiyunwith thresholding feature.
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunThese events provide:
27*4882a593Smuzhiyun  - memory address of the access
28*4882a593Smuzhiyun  - type of the access (load and store details)
29*4882a593Smuzhiyun  - latency (in cycles) of the load access
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunThe c2c tool provide means to record this data and report back access details
32*4882a593Smuzhiyunfor cachelines with highest contention - highest number of HITM accesses.
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunThe basic workflow with this tool follows the standard record/report phase.
35*4882a593SmuzhiyunUser uses the record command to record events data and report command to
36*4882a593Smuzhiyundisplay it.
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunRECORD OPTIONS
40*4882a593Smuzhiyun--------------
41*4882a593Smuzhiyun-e::
42*4882a593Smuzhiyun--event=::
43*4882a593Smuzhiyun	Select the PMU event. Use 'perf c2c record -e list'
44*4882a593Smuzhiyun	to list available events.
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun-v::
47*4882a593Smuzhiyun--verbose::
48*4882a593Smuzhiyun	Be more verbose (show counter open errors, etc).
49*4882a593Smuzhiyun
50*4882a593Smuzhiyun-l::
51*4882a593Smuzhiyun--ldlat::
52*4882a593Smuzhiyun	Configure mem-loads latency. (x86 only)
53*4882a593Smuzhiyun
54*4882a593Smuzhiyun-k::
55*4882a593Smuzhiyun--all-kernel::
56*4882a593Smuzhiyun	Configure all used events to run in kernel space.
57*4882a593Smuzhiyun
58*4882a593Smuzhiyun-u::
59*4882a593Smuzhiyun--all-user::
60*4882a593Smuzhiyun	Configure all used events to run in user space.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunREPORT OPTIONS
63*4882a593Smuzhiyun--------------
64*4882a593Smuzhiyun-k::
65*4882a593Smuzhiyun--vmlinux=<file>::
66*4882a593Smuzhiyun	vmlinux pathname
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun-v::
69*4882a593Smuzhiyun--verbose::
70*4882a593Smuzhiyun	Be more verbose (show counter open errors, etc).
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun-i::
73*4882a593Smuzhiyun--input::
74*4882a593Smuzhiyun	Specify the input file to process.
75*4882a593Smuzhiyun
76*4882a593Smuzhiyun-N::
77*4882a593Smuzhiyun--node-info::
78*4882a593Smuzhiyun	Show extra node info in report (see NODE INFO section)
79*4882a593Smuzhiyun
80*4882a593Smuzhiyun-c::
81*4882a593Smuzhiyun--coalesce::
82*4882a593Smuzhiyun	Specify sorting fields for single cacheline display.
83*4882a593Smuzhiyun	Following fields are available: tid,pid,iaddr,dso
84*4882a593Smuzhiyun	(see COALESCE)
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun-g::
87*4882a593Smuzhiyun--call-graph::
88*4882a593Smuzhiyun	Setup callchains parameters.
89*4882a593Smuzhiyun	Please refer to perf-report man page for details.
90*4882a593Smuzhiyun
91*4882a593Smuzhiyun--stdio::
92*4882a593Smuzhiyun	Force the stdio output (see STDIO OUTPUT)
93*4882a593Smuzhiyun
94*4882a593Smuzhiyun--stats::
95*4882a593Smuzhiyun	Display only statistic tables and force stdio mode.
96*4882a593Smuzhiyun
97*4882a593Smuzhiyun--full-symbols::
98*4882a593Smuzhiyun	Display full length of symbols.
99*4882a593Smuzhiyun
100*4882a593Smuzhiyun--no-source::
101*4882a593Smuzhiyun	Do not display Source:Line column.
102*4882a593Smuzhiyun
103*4882a593Smuzhiyun--show-all::
104*4882a593Smuzhiyun	Show all captured HITM lines, with no regard to HITM % 0.0005 limit.
105*4882a593Smuzhiyun
106*4882a593Smuzhiyun-f::
107*4882a593Smuzhiyun--force::
108*4882a593Smuzhiyun	Don't do ownership validation.
109*4882a593Smuzhiyun
110*4882a593Smuzhiyun-d::
111*4882a593Smuzhiyun--display::
112*4882a593Smuzhiyun	Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default.
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun--stitch-lbr::
115*4882a593Smuzhiyun	Show callgraph with stitched LBRs, which may have more complete
116*4882a593Smuzhiyun	callgraph. The perf.data file must have been obtained using
117*4882a593Smuzhiyun	perf c2c record --call-graph lbr.
118*4882a593Smuzhiyun	Disabled by default. In common cases with call stack overflows,
119*4882a593Smuzhiyun	it can recreate better call stacks than the default lbr call stack
120*4882a593Smuzhiyun	output. But this approach is not full proof. There can be cases
121*4882a593Smuzhiyun	where it creates incorrect call stacks from incorrect matches.
122*4882a593Smuzhiyun	The known limitations include exception handing such as
123*4882a593Smuzhiyun	setjmp/longjmp will have calls/returns not match.
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunC2C RECORD
126*4882a593Smuzhiyun----------
127*4882a593SmuzhiyunThe perf c2c record command setup options related to HITM cacheline analysis
128*4882a593Smuzhiyunand calls standard perf record command.
129*4882a593Smuzhiyun
130*4882a593SmuzhiyunFollowing perf record options are configured by default:
131*4882a593Smuzhiyun(check perf record man page for details)
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun  -W,-d,--phys-data,--sample-cpu
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunUnless specified otherwise with '-e' option, following events are monitored by
136*4882a593Smuzhiyundefault on x86:
137*4882a593Smuzhiyun
138*4882a593Smuzhiyun  cpu/mem-loads,ldlat=30/P
139*4882a593Smuzhiyun  cpu/mem-stores/P
140*4882a593Smuzhiyun
141*4882a593Smuzhiyunand following on PowerPC:
142*4882a593Smuzhiyun
143*4882a593Smuzhiyun  cpu/mem-loads/
144*4882a593Smuzhiyun  cpu/mem-stores/
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunUser can pass any 'perf record' option behind '--' mark, like (to enable
147*4882a593Smuzhiyuncallchains and system wide monitoring):
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun  $ perf c2c record -- -g -a
150*4882a593Smuzhiyun
151*4882a593SmuzhiyunPlease check RECORD OPTIONS section for specific c2c record options.
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunC2C REPORT
154*4882a593Smuzhiyun----------
155*4882a593SmuzhiyunThe perf c2c report command displays shared data analysis.  It comes in two
156*4882a593Smuzhiyundisplay modes: stdio and tui (default).
157*4882a593Smuzhiyun
158*4882a593SmuzhiyunThe report command workflow is following:
159*4882a593Smuzhiyun  - sort all the data based on the cacheline address
160*4882a593Smuzhiyun  - store access details for each cacheline
161*4882a593Smuzhiyun  - sort all cachelines based on user settings
162*4882a593Smuzhiyun  - display data
163*4882a593Smuzhiyun
164*4882a593SmuzhiyunIn general perf report output consist of 2 basic views:
165*4882a593Smuzhiyun  1) most expensive cachelines list
166*4882a593Smuzhiyun  2) offsets details for each cacheline
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunFor each cacheline in the 1) list we display following data:
169*4882a593Smuzhiyun(Both stdio and TUI modes follow the same fields output)
170*4882a593Smuzhiyun
171*4882a593Smuzhiyun  Index
172*4882a593Smuzhiyun  - zero based index to identify the cacheline
173*4882a593Smuzhiyun
174*4882a593Smuzhiyun  Cacheline
175*4882a593Smuzhiyun  - cacheline address (hex number)
176*4882a593Smuzhiyun
177*4882a593Smuzhiyun  Rmt/Lcl Hitm
178*4882a593Smuzhiyun  - cacheline percentage of all Remote/Local HITM accesses
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun  LLC Load Hitm - Total, LclHitm, RmtHitm
181*4882a593Smuzhiyun  - count of Total/Local/Remote load HITMs
182*4882a593Smuzhiyun
183*4882a593Smuzhiyun  Total records
184*4882a593Smuzhiyun  - sum of all cachelines accesses
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun  Total loads
187*4882a593Smuzhiyun  - sum of all load accesses
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun  Total stores
190*4882a593Smuzhiyun  - sum of all store accesses
191*4882a593Smuzhiyun
192*4882a593Smuzhiyun  Store Reference - L1Hit, L1Miss
193*4882a593Smuzhiyun    L1Hit - store accesses that hit L1
194*4882a593Smuzhiyun    L1Miss - store accesses that missed L1
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun  Core Load Hit - FB, L1, L2
197*4882a593Smuzhiyun  - count of load hits in FB (Fill Buffer), L1 and L2 cache
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun  LLC Load Hit - LlcHit, LclHitm
200*4882a593Smuzhiyun  - count of LLC load accesses, includes LLC hits and LLC HITMs
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun  RMT Load Hit - RmtHit, RmtHitm
203*4882a593Smuzhiyun  - count of remote load accesses, includes remote hits and remote HITMs
204*4882a593Smuzhiyun
205*4882a593Smuzhiyun  Load Dram - Lcl, Rmt
206*4882a593Smuzhiyun  - count of local and remote DRAM accesses
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunFor each offset in the 2) list we display following data:
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun  HITM - Rmt, Lcl
211*4882a593Smuzhiyun  - % of Remote/Local HITM accesses for given offset within cacheline
212*4882a593Smuzhiyun
213*4882a593Smuzhiyun  Store Refs - L1 Hit, L1 Miss
214*4882a593Smuzhiyun  - % of store accesses that hit/missed L1 for given offset within cacheline
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun  Data address - Offset
217*4882a593Smuzhiyun  - offset address
218*4882a593Smuzhiyun
219*4882a593Smuzhiyun  Pid
220*4882a593Smuzhiyun  - pid of the process responsible for the accesses
221*4882a593Smuzhiyun
222*4882a593Smuzhiyun  Tid
223*4882a593Smuzhiyun  - tid of the process responsible for the accesses
224*4882a593Smuzhiyun
225*4882a593Smuzhiyun  Code address
226*4882a593Smuzhiyun  - code address responsible for the accesses
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun  cycles - rmt hitm, lcl hitm, load
229*4882a593Smuzhiyun    - sum of cycles for given accesses - Remote/Local HITM and generic load
230*4882a593Smuzhiyun
231*4882a593Smuzhiyun  cpu cnt
232*4882a593Smuzhiyun    - number of cpus that participated on the access
233*4882a593Smuzhiyun
234*4882a593Smuzhiyun  Symbol
235*4882a593Smuzhiyun    - code symbol related to the 'Code address' value
236*4882a593Smuzhiyun
237*4882a593Smuzhiyun  Shared Object
238*4882a593Smuzhiyun    - shared object name related to the 'Code address' value
239*4882a593Smuzhiyun
240*4882a593Smuzhiyun  Source:Line
241*4882a593Smuzhiyun    - source information related to the 'Code address' value
242*4882a593Smuzhiyun
243*4882a593Smuzhiyun  Node
244*4882a593Smuzhiyun    - nodes participating on the access (see NODE INFO section)
245*4882a593Smuzhiyun
246*4882a593SmuzhiyunNODE INFO
247*4882a593Smuzhiyun---------
248*4882a593SmuzhiyunThe 'Node' field displays nodes that accesses given cacheline
249*4882a593Smuzhiyunoffset. Its output comes in 3 flavors:
250*4882a593Smuzhiyun  - node IDs separated by ','
251*4882a593Smuzhiyun  - node IDs with stats for each ID, in following format:
252*4882a593Smuzhiyun      Node{cpus %hitms %stores}
253*4882a593Smuzhiyun  - node IDs with list of affected CPUs in following format:
254*4882a593Smuzhiyun      Node{cpu list}
255*4882a593Smuzhiyun
256*4882a593SmuzhiyunUser can switch between above flavors with -N option or
257*4882a593Smuzhiyunuse 'n' key to interactively switch in TUI mode.
258*4882a593Smuzhiyun
259*4882a593SmuzhiyunCOALESCE
260*4882a593Smuzhiyun--------
261*4882a593SmuzhiyunUser can specify how to sort offsets for cacheline.
262*4882a593Smuzhiyun
263*4882a593SmuzhiyunFollowing fields are available and governs the final
264*4882a593Smuzhiyunoutput fields set for caheline offsets output:
265*4882a593Smuzhiyun
266*4882a593Smuzhiyun  tid   - coalesced by process TIDs
267*4882a593Smuzhiyun  pid   - coalesced by process PIDs
268*4882a593Smuzhiyun  iaddr - coalesced by code address, following fields are displayed:
269*4882a593Smuzhiyun             Code address, Code symbol, Shared Object, Source line
270*4882a593Smuzhiyun  dso   - coalesced by shared object
271*4882a593Smuzhiyun
272*4882a593SmuzhiyunBy default the coalescing is setup with 'pid,iaddr'.
273*4882a593Smuzhiyun
274*4882a593SmuzhiyunSTDIO OUTPUT
275*4882a593Smuzhiyun------------
276*4882a593SmuzhiyunThe stdio output displays data on standard output.
277*4882a593Smuzhiyun
278*4882a593SmuzhiyunFollowing tables are displayed:
279*4882a593Smuzhiyun  Trace Event Information
280*4882a593Smuzhiyun  - overall statistics of memory accesses
281*4882a593Smuzhiyun
282*4882a593Smuzhiyun  Global Shared Cache Line Event Information
283*4882a593Smuzhiyun  - overall statistics on shared cachelines
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun  Shared Data Cache Line Table
286*4882a593Smuzhiyun  - list of most expensive cachelines
287*4882a593Smuzhiyun
288*4882a593Smuzhiyun  Shared Cache Line Distribution Pareto
289*4882a593Smuzhiyun  - list of all accessed offsets for each cacheline
290*4882a593Smuzhiyun
291*4882a593SmuzhiyunTUI OUTPUT
292*4882a593Smuzhiyun----------
293*4882a593SmuzhiyunThe TUI output provides interactive interface to navigate
294*4882a593Smuzhiyunthrough cachelines list and to display offset details.
295*4882a593Smuzhiyun
296*4882a593SmuzhiyunFor details please refer to the help window by pressing '?' key.
297*4882a593Smuzhiyun
298*4882a593SmuzhiyunCREDITS
299*4882a593Smuzhiyun-------
300*4882a593SmuzhiyunAlthough Don Zickus, Dick Fowles and Joe Mario worked together
301*4882a593Smuzhiyunto get this implemented, we got lots of early help from Arnaldo
302*4882a593SmuzhiyunCarvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
303*4882a593Smuzhiyun
304*4882a593SmuzhiyunC2C BLOG
305*4882a593Smuzhiyun--------
306*4882a593SmuzhiyunCheck Joe's blog on c2c tool for detailed use case explanation:
307*4882a593Smuzhiyun  https://joemario.github.io/blog/2016/09/01/c2c-blog/
308*4882a593Smuzhiyun
309*4882a593SmuzhiyunSEE ALSO
310*4882a593Smuzhiyun--------
311*4882a593Smuzhiyunlinkperf:perf-record[1], linkperf:perf-mem[1]
312