xref: /OK3568_Linux_fs/kernel/Documentation/trace/tracepoint-analysis.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=========================================================
2*4882a593SmuzhiyunNotes on Analysing Behaviour Using Events and Tracepoints
3*4882a593Smuzhiyun=========================================================
4*4882a593Smuzhiyun:Author: Mel Gorman (PCL information heavily based on email from Ingo Molnar)
5*4882a593Smuzhiyun
6*4882a593Smuzhiyun1. Introduction
7*4882a593Smuzhiyun===============
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunTracepoints (see Documentation/trace/tracepoints.rst) can be used without
10*4882a593Smuzhiyuncreating custom kernel modules to register probe functions using the event
11*4882a593Smuzhiyuntracing infrastructure.
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunSimplistically, tracepoints represent important events that can be
14*4882a593Smuzhiyuntaken in conjunction with other tracepoints to build a "Big Picture" of
15*4882a593Smuzhiyunwhat is going on within the system. There are a large number of methods for
16*4882a593Smuzhiyungathering and interpreting these events. Lacking any current Best Practises,
17*4882a593Smuzhiyunthis document describes some of the methods that can be used.
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunThis document assumes that debugfs is mounted on /sys/kernel/debug and that
20*4882a593Smuzhiyunthe appropriate tracing options have been configured into the kernel. It is
21*4882a593Smuzhiyunassumed that the PCL tool tools/perf has been installed and is in your path.
22*4882a593Smuzhiyun
23*4882a593Smuzhiyun2. Listing Available Events
24*4882a593Smuzhiyun===========================
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun2.1 Standard Utilities
27*4882a593Smuzhiyun----------------------
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunAll possible events are visible from /sys/kernel/debug/tracing/events. Simply
30*4882a593Smuzhiyuncalling::
31*4882a593Smuzhiyun
32*4882a593Smuzhiyun  $ find /sys/kernel/debug/tracing/events -type d
33*4882a593Smuzhiyun
34*4882a593Smuzhiyunwill give a fair indication of the number of events available.
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun2.2 PCL (Performance Counters for Linux)
37*4882a593Smuzhiyun----------------------------------------
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunDiscovery and enumeration of all counters and events, including tracepoints,
40*4882a593Smuzhiyunare available with the perf tool. Getting a list of available events is a
41*4882a593Smuzhiyunsimple case of::
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun  $ perf list 2>&1 | grep Tracepoint
44*4882a593Smuzhiyun  ext4:ext4_free_inode                     [Tracepoint event]
45*4882a593Smuzhiyun  ext4:ext4_request_inode                  [Tracepoint event]
46*4882a593Smuzhiyun  ext4:ext4_allocate_inode                 [Tracepoint event]
47*4882a593Smuzhiyun  ext4:ext4_write_begin                    [Tracepoint event]
48*4882a593Smuzhiyun  ext4:ext4_ordered_write_end              [Tracepoint event]
49*4882a593Smuzhiyun  [ .... remaining output snipped .... ]
50*4882a593Smuzhiyun
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun3. Enabling Events
53*4882a593Smuzhiyun==================
54*4882a593Smuzhiyun
55*4882a593Smuzhiyun3.1 System-Wide Event Enabling
56*4882a593Smuzhiyun------------------------------
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunSee Documentation/trace/events.rst for a proper description on how events
59*4882a593Smuzhiyuncan be enabled system-wide. A short example of enabling all events related
60*4882a593Smuzhiyunto page allocation would look something like::
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
63*4882a593Smuzhiyun
64*4882a593Smuzhiyun3.2 System-Wide Event Enabling with SystemTap
65*4882a593Smuzhiyun---------------------------------------------
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunIn SystemTap, tracepoints are accessible using the kernel.trace() function
68*4882a593Smuzhiyuncall. The following is an example that reports every 5 seconds what processes
69*4882a593Smuzhiyunwere allocating the pages.
70*4882a593Smuzhiyun::
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun  global page_allocs
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun  probe kernel.trace("mm_page_alloc") {
75*4882a593Smuzhiyun  	page_allocs[execname()]++
76*4882a593Smuzhiyun  }
77*4882a593Smuzhiyun
78*4882a593Smuzhiyun  function print_count() {
79*4882a593Smuzhiyun  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
80*4882a593Smuzhiyun  	foreach (proc in page_allocs-)
81*4882a593Smuzhiyun  		printf("%-25d %s\n", page_allocs[proc], proc)
82*4882a593Smuzhiyun  	printf ("\n")
83*4882a593Smuzhiyun  	delete page_allocs
84*4882a593Smuzhiyun  }
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun  probe timer.s(5) {
87*4882a593Smuzhiyun          print_count()
88*4882a593Smuzhiyun  }
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun3.3 System-Wide Event Enabling with PCL
91*4882a593Smuzhiyun---------------------------------------
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunBy specifying the -a switch and analysing sleep, the system-wide events
94*4882a593Smuzhiyunfor a duration of time can be examined.
95*4882a593Smuzhiyun::
96*4882a593Smuzhiyun
97*4882a593Smuzhiyun $ perf stat -a \
98*4882a593Smuzhiyun	-e kmem:mm_page_alloc -e kmem:mm_page_free \
99*4882a593Smuzhiyun	-e kmem:mm_page_free_batched \
100*4882a593Smuzhiyun	sleep 10
101*4882a593Smuzhiyun Performance counter stats for 'sleep 10':
102*4882a593Smuzhiyun
103*4882a593Smuzhiyun           9630  kmem:mm_page_alloc
104*4882a593Smuzhiyun           2143  kmem:mm_page_free
105*4882a593Smuzhiyun           7424  kmem:mm_page_free_batched
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun   10.002577764  seconds time elapsed
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunSimilarly, one could execute a shell and exit it as desired to get a report
110*4882a593Smuzhiyunat that point.
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun3.4 Local Event Enabling
113*4882a593Smuzhiyun------------------------
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunDocumentation/trace/ftrace.rst describes how to enable events on a per-thread
116*4882a593Smuzhiyunbasis using set_ftrace_pid.
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun3.5 Local Event Enablement with PCL
119*4882a593Smuzhiyun-----------------------------------
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunEvents can be activated and tracked for the duration of a process on a local
122*4882a593Smuzhiyunbasis using PCL such as follows.
123*4882a593Smuzhiyun::
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
126*4882a593Smuzhiyun		 -e kmem:mm_page_free_batched ./hackbench 10
127*4882a593Smuzhiyun  Time: 0.909
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun    Performance counter stats for './hackbench 10':
130*4882a593Smuzhiyun
131*4882a593Smuzhiyun          17803  kmem:mm_page_alloc
132*4882a593Smuzhiyun          12398  kmem:mm_page_free
133*4882a593Smuzhiyun           4827  kmem:mm_page_free_batched
134*4882a593Smuzhiyun
135*4882a593Smuzhiyun    0.973913387  seconds time elapsed
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun4. Event Filtering
138*4882a593Smuzhiyun==================
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunDocumentation/trace/ftrace.rst covers in-depth how to filter events in
141*4882a593Smuzhiyunftrace.  Obviously using grep and awk of trace_pipe is an option as well
142*4882a593Smuzhiyunas any script reading trace_pipe.
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun5. Analysing Event Variances with PCL
145*4882a593Smuzhiyun=====================================
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunAny workload can exhibit variances between runs and it can be important
148*4882a593Smuzhiyunto know what the standard deviation is. By and large, this is left to the
149*4882a593Smuzhiyunperformance analyst to do it by hand. In the event that the discrete event
150*4882a593Smuzhiyunoccurrences are useful to the performance analyst, then perf can be used.
151*4882a593Smuzhiyun::
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free
154*4882a593Smuzhiyun			-e kmem:mm_page_free_batched ./hackbench 10
155*4882a593Smuzhiyun  Time: 0.890
156*4882a593Smuzhiyun  Time: 0.895
157*4882a593Smuzhiyun  Time: 0.915
158*4882a593Smuzhiyun  Time: 1.001
159*4882a593Smuzhiyun  Time: 0.899
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun   Performance counter stats for './hackbench 10' (5 runs):
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun          16630  kmem:mm_page_alloc         ( +-   3.542% )
164*4882a593Smuzhiyun          11486  kmem:mm_page_free	    ( +-   4.771% )
165*4882a593Smuzhiyun           4730  kmem:mm_page_free_batched  ( +-   2.325% )
166*4882a593Smuzhiyun
167*4882a593Smuzhiyun    0.982653002  seconds time elapsed   ( +-   1.448% )
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunIn the event that some higher-level event is required that depends on some
170*4882a593Smuzhiyunaggregation of discrete events, then a script would need to be developed.
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunUsing --repeat, it is also possible to view how events are fluctuating over
173*4882a593Smuzhiyuntime on a system-wide basis using -a and sleep.
174*4882a593Smuzhiyun::
175*4882a593Smuzhiyun
176*4882a593Smuzhiyun  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
177*4882a593Smuzhiyun		-e kmem:mm_page_free_batched \
178*4882a593Smuzhiyun		-a --repeat 10 \
179*4882a593Smuzhiyun		sleep 1
180*4882a593Smuzhiyun  Performance counter stats for 'sleep 1' (10 runs):
181*4882a593Smuzhiyun
182*4882a593Smuzhiyun           1066  kmem:mm_page_alloc         ( +-  26.148% )
183*4882a593Smuzhiyun            182  kmem:mm_page_free          ( +-   5.464% )
184*4882a593Smuzhiyun            890  kmem:mm_page_free_batched  ( +-  30.079% )
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun    1.002251757  seconds time elapsed   ( +-   0.005% )
187*4882a593Smuzhiyun
188*4882a593Smuzhiyun6. Higher-Level Analysis with Helper Scripts
189*4882a593Smuzhiyun============================================
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunWhen events are enabled the events that are triggering can be read from
192*4882a593Smuzhiyun/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
193*4882a593Smuzhiyunoptions exist as well. By post-processing the output, further information can
194*4882a593Smuzhiyunbe gathered on-line as appropriate. Examples of post-processing might include
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun  - Reading information from /proc for the PID that triggered the event
197*4882a593Smuzhiyun  - Deriving a higher-level event from a series of lower-level events.
198*4882a593Smuzhiyun  - Calculating latencies between two events
199*4882a593Smuzhiyun
200*4882a593SmuzhiyunDocumentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
201*4882a593Smuzhiyunscript that can read trace_pipe from STDIN or a copy of a trace. When used
202*4882a593Smuzhiyunon-line, it can be interrupted once to generate a report without exiting
203*4882a593Smuzhiyunand twice to exit.
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunSimplistically, the script just reads STDIN and counts up events but it
206*4882a593Smuzhiyunalso can do more such as
207*4882a593Smuzhiyun
208*4882a593Smuzhiyun  - Derive high-level events from many low-level events. If a number of pages
209*4882a593Smuzhiyun    are freed to the main allocator from the per-CPU lists, it recognises
210*4882a593Smuzhiyun    that as one per-CPU drain even though there is no specific tracepoint
211*4882a593Smuzhiyun    for that event
212*4882a593Smuzhiyun  - It can aggregate based on PID or individual process number
213*4882a593Smuzhiyun  - In the event memory is getting externally fragmented, it reports
214*4882a593Smuzhiyun    on whether the fragmentation event was severe or moderate.
215*4882a593Smuzhiyun  - When receiving an event about a PID, it can record who the parent was so
216*4882a593Smuzhiyun    that if large numbers of events are coming from very short-lived
217*4882a593Smuzhiyun    processes, the parent process responsible for creating all the helpers
218*4882a593Smuzhiyun    can be identified
219*4882a593Smuzhiyun
220*4882a593Smuzhiyun7. Lower-Level Analysis with PCL
221*4882a593Smuzhiyun================================
222*4882a593Smuzhiyun
223*4882a593SmuzhiyunThere may also be a requirement to identify what functions within a program
224*4882a593Smuzhiyunwere generating events within the kernel. To begin this sort of analysis, the
225*4882a593Smuzhiyundata must be recorded. At the time of writing, this required root:
226*4882a593Smuzhiyun::
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun  $ perf record -c 1 \
229*4882a593Smuzhiyun	-e kmem:mm_page_alloc -e kmem:mm_page_free \
230*4882a593Smuzhiyun	-e kmem:mm_page_free_batched \
231*4882a593Smuzhiyun	./hackbench 10
232*4882a593Smuzhiyun  Time: 0.894
233*4882a593Smuzhiyun  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
234*4882a593Smuzhiyun
235*4882a593SmuzhiyunNote the use of '-c 1' to set the event period to sample. The default sample
236*4882a593Smuzhiyunperiod is quite high to minimise overhead but the information collected can be
237*4882a593Smuzhiyunvery coarse as a result.
238*4882a593Smuzhiyun
239*4882a593SmuzhiyunThis record outputted a file called perf.data which can be analysed using
240*4882a593Smuzhiyunperf report.
241*4882a593Smuzhiyun::
242*4882a593Smuzhiyun
243*4882a593Smuzhiyun  $ perf report
244*4882a593Smuzhiyun  # Samples: 30922
245*4882a593Smuzhiyun  #
246*4882a593Smuzhiyun  # Overhead    Command                     Shared Object
247*4882a593Smuzhiyun  # ........  .........  ................................
248*4882a593Smuzhiyun  #
249*4882a593Smuzhiyun      87.27%  hackbench  [vdso]
250*4882a593Smuzhiyun       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so
251*4882a593Smuzhiyun       2.62%  hackbench  /lib/ld-2.9.so
252*4882a593Smuzhiyun       1.52%       perf  [vdso]
253*4882a593Smuzhiyun       1.22%  hackbench  ./hackbench
254*4882a593Smuzhiyun       0.48%  hackbench  [kernel]
255*4882a593Smuzhiyun       0.02%       perf  /lib/i686/cmov/libc-2.9.so
256*4882a593Smuzhiyun       0.01%       perf  /usr/bin/perf
257*4882a593Smuzhiyun       0.01%       perf  /lib/ld-2.9.so
258*4882a593Smuzhiyun       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
259*4882a593Smuzhiyun  #
260*4882a593Smuzhiyun  # (For more details, try: perf report --sort comm,dso,symbol)
261*4882a593Smuzhiyun  #
262*4882a593Smuzhiyun
263*4882a593SmuzhiyunAccording to this, the vast majority of events triggered on events
264*4882a593Smuzhiyunwithin the VDSO. With simple binaries, this will often be the case so let's
265*4882a593Smuzhiyuntake a slightly different example. In the course of writing this, it was
266*4882a593Smuzhiyunnoticed that X was generating an insane amount of page allocations so let's look
267*4882a593Smuzhiyunat it:
268*4882a593Smuzhiyun::
269*4882a593Smuzhiyun
270*4882a593Smuzhiyun  $ perf record -c 1 -f \
271*4882a593Smuzhiyun		-e kmem:mm_page_alloc -e kmem:mm_page_free \
272*4882a593Smuzhiyun		-e kmem:mm_page_free_batched \
273*4882a593Smuzhiyun		-p `pidof X`
274*4882a593Smuzhiyun
275*4882a593SmuzhiyunThis was interrupted after a few seconds and
276*4882a593Smuzhiyun::
277*4882a593Smuzhiyun
278*4882a593Smuzhiyun  $ perf report
279*4882a593Smuzhiyun  # Samples: 27666
280*4882a593Smuzhiyun  #
281*4882a593Smuzhiyun  # Overhead  Command                            Shared Object
282*4882a593Smuzhiyun  # ........  .......  .......................................
283*4882a593Smuzhiyun  #
284*4882a593Smuzhiyun      51.95%     Xorg  [vdso]
285*4882a593Smuzhiyun      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
286*4882a593Smuzhiyun       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so
287*4882a593Smuzhiyun       0.01%     Xorg  [kernel]
288*4882a593Smuzhiyun  #
289*4882a593Smuzhiyun  # (For more details, try: perf report --sort comm,dso,symbol)
290*4882a593Smuzhiyun  #
291*4882a593Smuzhiyun
292*4882a593SmuzhiyunSo, almost half of the events are occurring in a library. To get an idea which
293*4882a593Smuzhiyunsymbol:
294*4882a593Smuzhiyun::
295*4882a593Smuzhiyun
296*4882a593Smuzhiyun  $ perf report --sort comm,dso,symbol
297*4882a593Smuzhiyun  # Samples: 27666
298*4882a593Smuzhiyun  #
299*4882a593Smuzhiyun  # Overhead  Command                            Shared Object  Symbol
300*4882a593Smuzhiyun  # ........  .......  .......................................  ......
301*4882a593Smuzhiyun  #
302*4882a593Smuzhiyun      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
303*4882a593Smuzhiyun      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
304*4882a593Smuzhiyun       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
305*4882a593Smuzhiyun       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
306*4882a593Smuzhiyun       0.01%     Xorg  [kernel]                                 [k] read_hpet
307*4882a593Smuzhiyun       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
308*4882a593Smuzhiyun       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
309*4882a593Smuzhiyun
310*4882a593SmuzhiyunTo see where within the function pixmanFillsse2 things are going wrong:
311*4882a593Smuzhiyun::
312*4882a593Smuzhiyun
313*4882a593Smuzhiyun  $ perf annotate pixmanFillsse2
314*4882a593Smuzhiyun  [ ... ]
315*4882a593Smuzhiyun    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
316*4882a593Smuzhiyun         :      }
317*4882a593Smuzhiyun         :
318*4882a593Smuzhiyun         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
319*4882a593Smuzhiyun         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
320*4882a593Smuzhiyun         :        *__P = __B;
321*4882a593Smuzhiyun   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
322*4882a593Smuzhiyun    0.00 :         34ef5:       ff
323*4882a593Smuzhiyun   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
324*4882a593Smuzhiyun    0.00 :         34efd:       ff
325*4882a593Smuzhiyun   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
326*4882a593Smuzhiyun    0.00 :         34f05:       ff
327*4882a593Smuzhiyun   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
328*4882a593Smuzhiyun    0.00 :         34f0d:       ff
329*4882a593Smuzhiyun   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
330*4882a593Smuzhiyun   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
331*4882a593Smuzhiyun   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
332*4882a593Smuzhiyun   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
333*4882a593Smuzhiyun
334*4882a593SmuzhiyunAt a glance, it looks like the time is being spent copying pixmaps to
335*4882a593Smuzhiyunthe card.  Further investigation would be needed to determine why pixmaps
336*4882a593Smuzhiyunare being copied around so much but a starting point would be to take an
337*4882a593Smuzhiyunancient build of libpixmap out of the library path where it was totally
338*4882a593Smuzhiyunforgotten about from months ago!
339