1*4882a593Smuzhiyun========================================================= 2*4882a593SmuzhiyunNotes on Analysing Behaviour Using Events and Tracepoints 3*4882a593Smuzhiyun========================================================= 4*4882a593Smuzhiyun:Author: Mel Gorman (PCL information heavily based on email from Ingo Molnar) 5*4882a593Smuzhiyun 6*4882a593Smuzhiyun1. Introduction 7*4882a593Smuzhiyun=============== 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunTracepoints (see Documentation/trace/tracepoints.rst) can be used without 10*4882a593Smuzhiyuncreating custom kernel modules to register probe functions using the event 11*4882a593Smuzhiyuntracing infrastructure. 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunSimplistically, tracepoints represent important events that can be 14*4882a593Smuzhiyuntaken in conjunction with other tracepoints to build a "Big Picture" of 15*4882a593Smuzhiyunwhat is going on within the system. There are a large number of methods for 16*4882a593Smuzhiyungathering and interpreting these events. Lacking any current Best Practises, 17*4882a593Smuzhiyunthis document describes some of the methods that can be used. 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunThis document assumes that debugfs is mounted on /sys/kernel/debug and that 20*4882a593Smuzhiyunthe appropriate tracing options have been configured into the kernel. It is 21*4882a593Smuzhiyunassumed that the PCL tool tools/perf has been installed and is in your path. 22*4882a593Smuzhiyun 23*4882a593Smuzhiyun2. Listing Available Events 24*4882a593Smuzhiyun=========================== 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun2.1 Standard Utilities 27*4882a593Smuzhiyun---------------------- 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunAll possible events are visible from /sys/kernel/debug/tracing/events. Simply 30*4882a593Smuzhiyuncalling:: 31*4882a593Smuzhiyun 32*4882a593Smuzhiyun $ find /sys/kernel/debug/tracing/events -type d 33*4882a593Smuzhiyun 34*4882a593Smuzhiyunwill give a fair indication of the number of events available. 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun2.2 PCL (Performance Counters for Linux) 37*4882a593Smuzhiyun---------------------------------------- 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunDiscovery and enumeration of all counters and events, including tracepoints, 40*4882a593Smuzhiyunare available with the perf tool. Getting a list of available events is a 41*4882a593Smuzhiyunsimple case of:: 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun $ perf list 2>&1 | grep Tracepoint 44*4882a593Smuzhiyun ext4:ext4_free_inode [Tracepoint event] 45*4882a593Smuzhiyun ext4:ext4_request_inode [Tracepoint event] 46*4882a593Smuzhiyun ext4:ext4_allocate_inode [Tracepoint event] 47*4882a593Smuzhiyun ext4:ext4_write_begin [Tracepoint event] 48*4882a593Smuzhiyun ext4:ext4_ordered_write_end [Tracepoint event] 49*4882a593Smuzhiyun [ .... remaining output snipped .... ] 50*4882a593Smuzhiyun 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun3. Enabling Events 53*4882a593Smuzhiyun================== 54*4882a593Smuzhiyun 55*4882a593Smuzhiyun3.1 System-Wide Event Enabling 56*4882a593Smuzhiyun------------------------------ 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunSee Documentation/trace/events.rst for a proper description on how events 59*4882a593Smuzhiyuncan be enabled system-wide. A short example of enabling all events related 60*4882a593Smuzhiyunto page allocation would look something like:: 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun3.2 System-Wide Event Enabling with SystemTap 65*4882a593Smuzhiyun--------------------------------------------- 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunIn SystemTap, tracepoints are accessible using the kernel.trace() function 68*4882a593Smuzhiyuncall. The following is an example that reports every 5 seconds what processes 69*4882a593Smuzhiyunwere allocating the pages. 70*4882a593Smuzhiyun:: 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun global page_allocs 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun probe kernel.trace("mm_page_alloc") { 75*4882a593Smuzhiyun page_allocs[execname()]++ 76*4882a593Smuzhiyun } 77*4882a593Smuzhiyun 78*4882a593Smuzhiyun function print_count() { 79*4882a593Smuzhiyun printf ("%-25s %-s\n", "#Pages Allocated", "Process Name") 80*4882a593Smuzhiyun foreach (proc in page_allocs-) 81*4882a593Smuzhiyun printf("%-25d %s\n", page_allocs[proc], proc) 82*4882a593Smuzhiyun printf ("\n") 83*4882a593Smuzhiyun delete page_allocs 84*4882a593Smuzhiyun } 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun probe timer.s(5) { 87*4882a593Smuzhiyun print_count() 88*4882a593Smuzhiyun } 89*4882a593Smuzhiyun 90*4882a593Smuzhiyun3.3 System-Wide Event Enabling with PCL 91*4882a593Smuzhiyun--------------------------------------- 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunBy specifying the -a switch and analysing sleep, the system-wide events 94*4882a593Smuzhiyunfor a duration of time can be examined. 95*4882a593Smuzhiyun:: 96*4882a593Smuzhiyun 97*4882a593Smuzhiyun $ perf stat -a \ 98*4882a593Smuzhiyun -e kmem:mm_page_alloc -e kmem:mm_page_free \ 99*4882a593Smuzhiyun -e kmem:mm_page_free_batched \ 100*4882a593Smuzhiyun sleep 10 101*4882a593Smuzhiyun Performance counter stats for 'sleep 10': 102*4882a593Smuzhiyun 103*4882a593Smuzhiyun 9630 kmem:mm_page_alloc 104*4882a593Smuzhiyun 2143 kmem:mm_page_free 105*4882a593Smuzhiyun 7424 kmem:mm_page_free_batched 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun 10.002577764 seconds time elapsed 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunSimilarly, one could execute a shell and exit it as desired to get a report 110*4882a593Smuzhiyunat that point. 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun3.4 Local Event Enabling 113*4882a593Smuzhiyun------------------------ 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunDocumentation/trace/ftrace.rst describes how to enable events on a per-thread 116*4882a593Smuzhiyunbasis using set_ftrace_pid. 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun3.5 Local Event Enablement with PCL 119*4882a593Smuzhiyun----------------------------------- 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunEvents can be activated and tracked for the duration of a process on a local 122*4882a593Smuzhiyunbasis using PCL such as follows. 123*4882a593Smuzhiyun:: 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \ 126*4882a593Smuzhiyun -e kmem:mm_page_free_batched ./hackbench 10 127*4882a593Smuzhiyun Time: 0.909 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun Performance counter stats for './hackbench 10': 130*4882a593Smuzhiyun 131*4882a593Smuzhiyun 17803 kmem:mm_page_alloc 132*4882a593Smuzhiyun 12398 kmem:mm_page_free 133*4882a593Smuzhiyun 4827 kmem:mm_page_free_batched 134*4882a593Smuzhiyun 135*4882a593Smuzhiyun 0.973913387 seconds time elapsed 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun4. Event Filtering 138*4882a593Smuzhiyun================== 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunDocumentation/trace/ftrace.rst covers in-depth how to filter events in 141*4882a593Smuzhiyunftrace. Obviously using grep and awk of trace_pipe is an option as well 142*4882a593Smuzhiyunas any script reading trace_pipe. 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun5. Analysing Event Variances with PCL 145*4882a593Smuzhiyun===================================== 146*4882a593Smuzhiyun 147*4882a593SmuzhiyunAny workload can exhibit variances between runs and it can be important 148*4882a593Smuzhiyunto know what the standard deviation is. By and large, this is left to the 149*4882a593Smuzhiyunperformance analyst to do it by hand. In the event that the discrete event 150*4882a593Smuzhiyunoccurrences are useful to the performance analyst, then perf can be used. 151*4882a593Smuzhiyun:: 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free 154*4882a593Smuzhiyun -e kmem:mm_page_free_batched ./hackbench 10 155*4882a593Smuzhiyun Time: 0.890 156*4882a593Smuzhiyun Time: 0.895 157*4882a593Smuzhiyun Time: 0.915 158*4882a593Smuzhiyun Time: 1.001 159*4882a593Smuzhiyun Time: 0.899 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun Performance counter stats for './hackbench 10' (5 runs): 162*4882a593Smuzhiyun 163*4882a593Smuzhiyun 16630 kmem:mm_page_alloc ( +- 3.542% ) 164*4882a593Smuzhiyun 11486 kmem:mm_page_free ( +- 4.771% ) 165*4882a593Smuzhiyun 4730 kmem:mm_page_free_batched ( +- 2.325% ) 166*4882a593Smuzhiyun 167*4882a593Smuzhiyun 0.982653002 seconds time elapsed ( +- 1.448% ) 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunIn the event that some higher-level event is required that depends on some 170*4882a593Smuzhiyunaggregation of discrete events, then a script would need to be developed. 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunUsing --repeat, it is also possible to view how events are fluctuating over 173*4882a593Smuzhiyuntime on a system-wide basis using -a and sleep. 174*4882a593Smuzhiyun:: 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \ 177*4882a593Smuzhiyun -e kmem:mm_page_free_batched \ 178*4882a593Smuzhiyun -a --repeat 10 \ 179*4882a593Smuzhiyun sleep 1 180*4882a593Smuzhiyun Performance counter stats for 'sleep 1' (10 runs): 181*4882a593Smuzhiyun 182*4882a593Smuzhiyun 1066 kmem:mm_page_alloc ( +- 26.148% ) 183*4882a593Smuzhiyun 182 kmem:mm_page_free ( +- 5.464% ) 184*4882a593Smuzhiyun 890 kmem:mm_page_free_batched ( +- 30.079% ) 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun 1.002251757 seconds time elapsed ( +- 0.005% ) 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun6. Higher-Level Analysis with Helper Scripts 189*4882a593Smuzhiyun============================================ 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunWhen events are enabled the events that are triggering can be read from 192*4882a593Smuzhiyun/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary 193*4882a593Smuzhiyunoptions exist as well. By post-processing the output, further information can 194*4882a593Smuzhiyunbe gathered on-line as appropriate. Examples of post-processing might include 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun - Reading information from /proc for the PID that triggered the event 197*4882a593Smuzhiyun - Deriving a higher-level event from a series of lower-level events. 198*4882a593Smuzhiyun - Calculating latencies between two events 199*4882a593Smuzhiyun 200*4882a593SmuzhiyunDocumentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example 201*4882a593Smuzhiyunscript that can read trace_pipe from STDIN or a copy of a trace. When used 202*4882a593Smuzhiyunon-line, it can be interrupted once to generate a report without exiting 203*4882a593Smuzhiyunand twice to exit. 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunSimplistically, the script just reads STDIN and counts up events but it 206*4882a593Smuzhiyunalso can do more such as 207*4882a593Smuzhiyun 208*4882a593Smuzhiyun - Derive high-level events from many low-level events. If a number of pages 209*4882a593Smuzhiyun are freed to the main allocator from the per-CPU lists, it recognises 210*4882a593Smuzhiyun that as one per-CPU drain even though there is no specific tracepoint 211*4882a593Smuzhiyun for that event 212*4882a593Smuzhiyun - It can aggregate based on PID or individual process number 213*4882a593Smuzhiyun - In the event memory is getting externally fragmented, it reports 214*4882a593Smuzhiyun on whether the fragmentation event was severe or moderate. 215*4882a593Smuzhiyun - When receiving an event about a PID, it can record who the parent was so 216*4882a593Smuzhiyun that if large numbers of events are coming from very short-lived 217*4882a593Smuzhiyun processes, the parent process responsible for creating all the helpers 218*4882a593Smuzhiyun can be identified 219*4882a593Smuzhiyun 220*4882a593Smuzhiyun7. Lower-Level Analysis with PCL 221*4882a593Smuzhiyun================================ 222*4882a593Smuzhiyun 223*4882a593SmuzhiyunThere may also be a requirement to identify what functions within a program 224*4882a593Smuzhiyunwere generating events within the kernel. To begin this sort of analysis, the 225*4882a593Smuzhiyundata must be recorded. At the time of writing, this required root: 226*4882a593Smuzhiyun:: 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun $ perf record -c 1 \ 229*4882a593Smuzhiyun -e kmem:mm_page_alloc -e kmem:mm_page_free \ 230*4882a593Smuzhiyun -e kmem:mm_page_free_batched \ 231*4882a593Smuzhiyun ./hackbench 10 232*4882a593Smuzhiyun Time: 0.894 233*4882a593Smuzhiyun [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ] 234*4882a593Smuzhiyun 235*4882a593SmuzhiyunNote the use of '-c 1' to set the event period to sample. The default sample 236*4882a593Smuzhiyunperiod is quite high to minimise overhead but the information collected can be 237*4882a593Smuzhiyunvery coarse as a result. 238*4882a593Smuzhiyun 239*4882a593SmuzhiyunThis record outputted a file called perf.data which can be analysed using 240*4882a593Smuzhiyunperf report. 241*4882a593Smuzhiyun:: 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun $ perf report 244*4882a593Smuzhiyun # Samples: 30922 245*4882a593Smuzhiyun # 246*4882a593Smuzhiyun # Overhead Command Shared Object 247*4882a593Smuzhiyun # ........ ......... ................................ 248*4882a593Smuzhiyun # 249*4882a593Smuzhiyun 87.27% hackbench [vdso] 250*4882a593Smuzhiyun 6.85% hackbench /lib/i686/cmov/libc-2.9.so 251*4882a593Smuzhiyun 2.62% hackbench /lib/ld-2.9.so 252*4882a593Smuzhiyun 1.52% perf [vdso] 253*4882a593Smuzhiyun 1.22% hackbench ./hackbench 254*4882a593Smuzhiyun 0.48% hackbench [kernel] 255*4882a593Smuzhiyun 0.02% perf /lib/i686/cmov/libc-2.9.so 256*4882a593Smuzhiyun 0.01% perf /usr/bin/perf 257*4882a593Smuzhiyun 0.01% perf /lib/ld-2.9.so 258*4882a593Smuzhiyun 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so 259*4882a593Smuzhiyun # 260*4882a593Smuzhiyun # (For more details, try: perf report --sort comm,dso,symbol) 261*4882a593Smuzhiyun # 262*4882a593Smuzhiyun 263*4882a593SmuzhiyunAccording to this, the vast majority of events triggered on events 264*4882a593Smuzhiyunwithin the VDSO. With simple binaries, this will often be the case so let's 265*4882a593Smuzhiyuntake a slightly different example. In the course of writing this, it was 266*4882a593Smuzhiyunnoticed that X was generating an insane amount of page allocations so let's look 267*4882a593Smuzhiyunat it: 268*4882a593Smuzhiyun:: 269*4882a593Smuzhiyun 270*4882a593Smuzhiyun $ perf record -c 1 -f \ 271*4882a593Smuzhiyun -e kmem:mm_page_alloc -e kmem:mm_page_free \ 272*4882a593Smuzhiyun -e kmem:mm_page_free_batched \ 273*4882a593Smuzhiyun -p `pidof X` 274*4882a593Smuzhiyun 275*4882a593SmuzhiyunThis was interrupted after a few seconds and 276*4882a593Smuzhiyun:: 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun $ perf report 279*4882a593Smuzhiyun # Samples: 27666 280*4882a593Smuzhiyun # 281*4882a593Smuzhiyun # Overhead Command Shared Object 282*4882a593Smuzhiyun # ........ ....... ....................................... 283*4882a593Smuzhiyun # 284*4882a593Smuzhiyun 51.95% Xorg [vdso] 285*4882a593Smuzhiyun 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 286*4882a593Smuzhiyun 0.09% Xorg /lib/i686/cmov/libc-2.9.so 287*4882a593Smuzhiyun 0.01% Xorg [kernel] 288*4882a593Smuzhiyun # 289*4882a593Smuzhiyun # (For more details, try: perf report --sort comm,dso,symbol) 290*4882a593Smuzhiyun # 291*4882a593Smuzhiyun 292*4882a593SmuzhiyunSo, almost half of the events are occurring in a library. To get an idea which 293*4882a593Smuzhiyunsymbol: 294*4882a593Smuzhiyun:: 295*4882a593Smuzhiyun 296*4882a593Smuzhiyun $ perf report --sort comm,dso,symbol 297*4882a593Smuzhiyun # Samples: 27666 298*4882a593Smuzhiyun # 299*4882a593Smuzhiyun # Overhead Command Shared Object Symbol 300*4882a593Smuzhiyun # ........ ....... ....................................... ...... 301*4882a593Smuzhiyun # 302*4882a593Smuzhiyun 51.95% Xorg [vdso] [.] 0x000000ffffe424 303*4882a593Smuzhiyun 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2 304*4882a593Smuzhiyun 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc 305*4882a593Smuzhiyun 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f 306*4882a593Smuzhiyun 0.01% Xorg [kernel] [k] read_hpet 307*4882a593Smuzhiyun 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path 308*4882a593Smuzhiyun 0.00% Xorg [kernel] [k] ftrace_trace_userstack 309*4882a593Smuzhiyun 310*4882a593SmuzhiyunTo see where within the function pixmanFillsse2 things are going wrong: 311*4882a593Smuzhiyun:: 312*4882a593Smuzhiyun 313*4882a593Smuzhiyun $ perf annotate pixmanFillsse2 314*4882a593Smuzhiyun [ ... ] 315*4882a593Smuzhiyun 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax) 316*4882a593Smuzhiyun : } 317*4882a593Smuzhiyun : 318*4882a593Smuzhiyun : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _ 319*4882a593Smuzhiyun : _mm_store_si128 (__m128i *__P, __m128i __B) : { 320*4882a593Smuzhiyun : *__P = __B; 321*4882a593Smuzhiyun 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax) 322*4882a593Smuzhiyun 0.00 : 34ef5: ff 323*4882a593Smuzhiyun 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax) 324*4882a593Smuzhiyun 0.00 : 34efd: ff 325*4882a593Smuzhiyun 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax) 326*4882a593Smuzhiyun 0.00 : 34f05: ff 327*4882a593Smuzhiyun 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax) 328*4882a593Smuzhiyun 0.00 : 34f0d: ff 329*4882a593Smuzhiyun 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax) 330*4882a593Smuzhiyun 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax) 331*4882a593Smuzhiyun 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax) 332*4882a593Smuzhiyun 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax) 333*4882a593Smuzhiyun 334*4882a593SmuzhiyunAt a glance, it looks like the time is being spent copying pixmaps to 335*4882a593Smuzhiyunthe card. Further investigation would be needed to determine why pixmaps 336*4882a593Smuzhiyunare being copied around so much but a starting point would be to take an 337*4882a593Smuzhiyunancient build of libpixmap out of the library path where it was totally 338*4882a593Smuzhiyunforgotten about from months ago! 339