1*4882a593Smuzhiyun======================== 2*4882a593Smuzhiyunftrace - Function Tracer 3*4882a593Smuzhiyun======================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunCopyright 2008 Red Hat Inc. 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun:Author: Steven Rostedt <srostedt@redhat.com> 8*4882a593Smuzhiyun:License: The GNU Free Documentation License, Version 1.2 9*4882a593Smuzhiyun (dual licensed under the GPL v2) 10*4882a593Smuzhiyun:Original Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, 11*4882a593Smuzhiyun John Kacur, and David Teigland. 12*4882a593Smuzhiyun 13*4882a593Smuzhiyun- Written for: 2.6.28-rc2 14*4882a593Smuzhiyun- Updated for: 3.10 15*4882a593Smuzhiyun- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt 16*4882a593Smuzhiyun- Converted to rst format - Changbin Du <changbin.du@intel.com> 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunIntroduction 19*4882a593Smuzhiyun------------ 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunFtrace is an internal tracer designed to help out developers and 22*4882a593Smuzhiyundesigners of systems to find what is going on inside the kernel. 23*4882a593SmuzhiyunIt can be used for debugging or analyzing latencies and 24*4882a593Smuzhiyunperformance issues that take place outside of user-space. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunAlthough ftrace is typically considered the function tracer, it 27*4882a593Smuzhiyunis really a framework of several assorted tracing utilities. 28*4882a593SmuzhiyunThere's latency tracing to examine what occurs between interrupts 29*4882a593Smuzhiyundisabled and enabled, as well as for preemption and from a time 30*4882a593Smuzhiyuna task is woken to the task is actually scheduled in. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunOne of the most common uses of ftrace is the event tracing. 33*4882a593SmuzhiyunThroughout the kernel is hundreds of static event points that 34*4882a593Smuzhiyuncan be enabled via the tracefs file system to see what is 35*4882a593Smuzhiyungoing on in certain parts of the kernel. 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunSee events.rst for more information. 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunImplementation Details 41*4882a593Smuzhiyun---------------------- 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunSee :doc:`ftrace-design` for details for arch porters and such. 44*4882a593Smuzhiyun 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunThe File System 47*4882a593Smuzhiyun--------------- 48*4882a593Smuzhiyun 49*4882a593SmuzhiyunFtrace uses the tracefs file system to hold the control files as 50*4882a593Smuzhiyunwell as the files to display output. 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunWhen tracefs is configured into the kernel (which selecting any ftrace 53*4882a593Smuzhiyunoption will do) the directory /sys/kernel/tracing will be created. To mount 54*4882a593Smuzhiyunthis directory, you can add to your /etc/fstab file:: 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun tracefs /sys/kernel/tracing tracefs defaults 0 0 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunOr you can mount it at run time with:: 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun mount -t tracefs nodev /sys/kernel/tracing 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunFor quicker access to that directory you may want to make a soft link to 63*4882a593Smuzhiyunit:: 64*4882a593Smuzhiyun 65*4882a593Smuzhiyun ln -s /sys/kernel/tracing /tracing 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun.. attention:: 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun Before 4.1, all ftrace tracing control files were within the debugfs 70*4882a593Smuzhiyun file system, which is typically located at /sys/kernel/debug/tracing. 71*4882a593Smuzhiyun For backward compatibility, when mounting the debugfs file system, 72*4882a593Smuzhiyun the tracefs file system will be automatically mounted at: 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun /sys/kernel/debug/tracing 75*4882a593Smuzhiyun 76*4882a593Smuzhiyun All files located in the tracefs file system will be located in that 77*4882a593Smuzhiyun debugfs file system directory as well. 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun In order to not automount tracefs in the debugfs filesystem, enable the 80*4882a593Smuzhiyun defconfig option CONFIG_TRACEFS_DISABLE_AUTOMOUNT. 81*4882a593Smuzhiyun 82*4882a593Smuzhiyun.. attention:: 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun Any selected ftrace option will also create the tracefs file system. 85*4882a593Smuzhiyun The rest of the document will assume that you are in the ftrace directory 86*4882a593Smuzhiyun (cd /sys/kernel/tracing) and will only concentrate on the files within that 87*4882a593Smuzhiyun directory and not distract from the content with the extended 88*4882a593Smuzhiyun "/sys/kernel/tracing" path name. 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunThat's it! (assuming that you have ftrace configured into your kernel) 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunAfter mounting tracefs you will have access to the control and output files 93*4882a593Smuzhiyunof ftrace. Here is a list of some of the key files: 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun 96*4882a593Smuzhiyun Note: all time values are in microseconds. 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun current_tracer: 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun This is used to set or display the current tracer 101*4882a593Smuzhiyun that is configured. Changing the current tracer clears 102*4882a593Smuzhiyun the ring buffer content as well as the "snapshot" buffer. 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun available_tracers: 105*4882a593Smuzhiyun 106*4882a593Smuzhiyun This holds the different types of tracers that 107*4882a593Smuzhiyun have been compiled into the kernel. The 108*4882a593Smuzhiyun tracers listed here can be configured by 109*4882a593Smuzhiyun echoing their name into current_tracer. 110*4882a593Smuzhiyun 111*4882a593Smuzhiyun tracing_on: 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun This sets or displays whether writing to the trace 114*4882a593Smuzhiyun ring buffer is enabled. Echo 0 into this file to disable 115*4882a593Smuzhiyun the tracer or 1 to enable it. Note, this only disables 116*4882a593Smuzhiyun writing to the ring buffer, the tracing overhead may 117*4882a593Smuzhiyun still be occurring. 118*4882a593Smuzhiyun 119*4882a593Smuzhiyun The kernel function tracing_off() can be used within the 120*4882a593Smuzhiyun kernel to disable writing to the ring buffer, which will 121*4882a593Smuzhiyun set this file to "0". User space can re-enable tracing by 122*4882a593Smuzhiyun echoing "1" into the file. 123*4882a593Smuzhiyun 124*4882a593Smuzhiyun Note, the function and event trigger "traceoff" will also 125*4882a593Smuzhiyun set this file to zero and stop tracing. Which can also 126*4882a593Smuzhiyun be re-enabled by user space using this file. 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun trace: 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun This file holds the output of the trace in a human 131*4882a593Smuzhiyun readable format (described below). Opening this file for 132*4882a593Smuzhiyun writing with the O_TRUNC flag clears the ring buffer content. 133*4882a593Smuzhiyun Note, this file is not a consumer. If tracing is off 134*4882a593Smuzhiyun (no tracer running, or tracing_on is zero), it will produce 135*4882a593Smuzhiyun the same output each time it is read. When tracing is on, 136*4882a593Smuzhiyun it may produce inconsistent results as it tries to read 137*4882a593Smuzhiyun the entire buffer without consuming it. 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun trace_pipe: 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun The output is the same as the "trace" file but this 142*4882a593Smuzhiyun file is meant to be streamed with live tracing. 143*4882a593Smuzhiyun Reads from this file will block until new data is 144*4882a593Smuzhiyun retrieved. Unlike the "trace" file, this file is a 145*4882a593Smuzhiyun consumer. This means reading from this file causes 146*4882a593Smuzhiyun sequential reads to display more current data. Once 147*4882a593Smuzhiyun data is read from this file, it is consumed, and 148*4882a593Smuzhiyun will not be read again with a sequential read. The 149*4882a593Smuzhiyun "trace" file is static, and if the tracer is not 150*4882a593Smuzhiyun adding more data, it will display the same 151*4882a593Smuzhiyun information every time it is read. 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun trace_options: 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun This file lets the user control the amount of data 156*4882a593Smuzhiyun that is displayed in one of the above output 157*4882a593Smuzhiyun files. Options also exist to modify how a tracer 158*4882a593Smuzhiyun or events work (stack traces, timestamps, etc). 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun options: 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun This is a directory that has a file for every available 163*4882a593Smuzhiyun trace option (also in trace_options). Options may also be set 164*4882a593Smuzhiyun or cleared by writing a "1" or "0" respectively into the 165*4882a593Smuzhiyun corresponding file with the option name. 166*4882a593Smuzhiyun 167*4882a593Smuzhiyun tracing_max_latency: 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun Some of the tracers record the max latency. 170*4882a593Smuzhiyun For example, the maximum time that interrupts are disabled. 171*4882a593Smuzhiyun The maximum time is saved in this file. The max trace will also be 172*4882a593Smuzhiyun stored, and displayed by "trace". A new max trace will only be 173*4882a593Smuzhiyun recorded if the latency is greater than the value in this file 174*4882a593Smuzhiyun (in microseconds). 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun By echoing in a time into this file, no latency will be recorded 177*4882a593Smuzhiyun unless it is greater than the time in this file. 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun tracing_thresh: 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun Some latency tracers will record a trace whenever the 182*4882a593Smuzhiyun latency is greater than the number in this file. 183*4882a593Smuzhiyun Only active when the file contains a number greater than 0. 184*4882a593Smuzhiyun (in microseconds) 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun buffer_size_kb: 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun This sets or displays the number of kilobytes each CPU 189*4882a593Smuzhiyun buffer holds. By default, the trace buffers are the same size 190*4882a593Smuzhiyun for each CPU. The displayed number is the size of the 191*4882a593Smuzhiyun CPU buffer and not total size of all buffers. The 192*4882a593Smuzhiyun trace buffers are allocated in pages (blocks of memory 193*4882a593Smuzhiyun that the kernel uses for allocation, usually 4 KB in size). 194*4882a593Smuzhiyun A few extra pages may be allocated to accommodate buffer management 195*4882a593Smuzhiyun meta-data. If the last page allocated has room for more bytes 196*4882a593Smuzhiyun than requested, the rest of the page will be used, 197*4882a593Smuzhiyun making the actual allocation bigger than requested or shown. 198*4882a593Smuzhiyun ( Note, the size may not be a multiple of the page size 199*4882a593Smuzhiyun due to buffer management meta-data. ) 200*4882a593Smuzhiyun 201*4882a593Smuzhiyun Buffer sizes for individual CPUs may vary 202*4882a593Smuzhiyun (see "per_cpu/cpu0/buffer_size_kb" below), and if they do 203*4882a593Smuzhiyun this file will show "X". 204*4882a593Smuzhiyun 205*4882a593Smuzhiyun buffer_total_size_kb: 206*4882a593Smuzhiyun 207*4882a593Smuzhiyun This displays the total combined size of all the trace buffers. 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun free_buffer: 210*4882a593Smuzhiyun 211*4882a593Smuzhiyun If a process is performing tracing, and the ring buffer should be 212*4882a593Smuzhiyun shrunk "freed" when the process is finished, even if it were to be 213*4882a593Smuzhiyun killed by a signal, this file can be used for that purpose. On close 214*4882a593Smuzhiyun of this file, the ring buffer will be resized to its minimum size. 215*4882a593Smuzhiyun Having a process that is tracing also open this file, when the process 216*4882a593Smuzhiyun exits its file descriptor for this file will be closed, and in doing so, 217*4882a593Smuzhiyun the ring buffer will be "freed". 218*4882a593Smuzhiyun 219*4882a593Smuzhiyun It may also stop tracing if disable_on_free option is set. 220*4882a593Smuzhiyun 221*4882a593Smuzhiyun tracing_cpumask: 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun This is a mask that lets the user only trace on specified CPUs. 224*4882a593Smuzhiyun The format is a hex string representing the CPUs. 225*4882a593Smuzhiyun 226*4882a593Smuzhiyun set_ftrace_filter: 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun When dynamic ftrace is configured in (see the 229*4882a593Smuzhiyun section below "dynamic ftrace"), the code is dynamically 230*4882a593Smuzhiyun modified (code text rewrite) to disable calling of the 231*4882a593Smuzhiyun function profiler (mcount). This lets tracing be configured 232*4882a593Smuzhiyun in with practically no overhead in performance. This also 233*4882a593Smuzhiyun has a side effect of enabling or disabling specific functions 234*4882a593Smuzhiyun to be traced. Echoing names of functions into this file 235*4882a593Smuzhiyun will limit the trace to only those functions. 236*4882a593Smuzhiyun This influences the tracers "function" and "function_graph" 237*4882a593Smuzhiyun and thus also function profiling (see "function_profile_enabled"). 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun The functions listed in "available_filter_functions" are what 240*4882a593Smuzhiyun can be written into this file. 241*4882a593Smuzhiyun 242*4882a593Smuzhiyun This interface also allows for commands to be used. See the 243*4882a593Smuzhiyun "Filter commands" section for more details. 244*4882a593Smuzhiyun 245*4882a593Smuzhiyun As a speed up, since processing strings can be quite expensive 246*4882a593Smuzhiyun and requires a check of all functions registered to tracing, instead 247*4882a593Smuzhiyun an index can be written into this file. A number (starting with "1") 248*4882a593Smuzhiyun written will instead select the same corresponding at the line position 249*4882a593Smuzhiyun of the "available_filter_functions" file. 250*4882a593Smuzhiyun 251*4882a593Smuzhiyun set_ftrace_notrace: 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun This has an effect opposite to that of 254*4882a593Smuzhiyun set_ftrace_filter. Any function that is added here will not 255*4882a593Smuzhiyun be traced. If a function exists in both set_ftrace_filter 256*4882a593Smuzhiyun and set_ftrace_notrace, the function will _not_ be traced. 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun set_ftrace_pid: 259*4882a593Smuzhiyun 260*4882a593Smuzhiyun Have the function tracer only trace the threads whose PID are 261*4882a593Smuzhiyun listed in this file. 262*4882a593Smuzhiyun 263*4882a593Smuzhiyun If the "function-fork" option is set, then when a task whose 264*4882a593Smuzhiyun PID is listed in this file forks, the child's PID will 265*4882a593Smuzhiyun automatically be added to this file, and the child will be 266*4882a593Smuzhiyun traced by the function tracer as well. This option will also 267*4882a593Smuzhiyun cause PIDs of tasks that exit to be removed from the file. 268*4882a593Smuzhiyun 269*4882a593Smuzhiyun set_ftrace_notrace_pid: 270*4882a593Smuzhiyun 271*4882a593Smuzhiyun Have the function tracer ignore threads whose PID are listed in 272*4882a593Smuzhiyun this file. 273*4882a593Smuzhiyun 274*4882a593Smuzhiyun If the "function-fork" option is set, then when a task whose 275*4882a593Smuzhiyun PID is listed in this file forks, the child's PID will 276*4882a593Smuzhiyun automatically be added to this file, and the child will not be 277*4882a593Smuzhiyun traced by the function tracer as well. This option will also 278*4882a593Smuzhiyun cause PIDs of tasks that exit to be removed from the file. 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun If a PID is in both this file and "set_ftrace_pid", then this 281*4882a593Smuzhiyun file takes precedence, and the thread will not be traced. 282*4882a593Smuzhiyun 283*4882a593Smuzhiyun set_event_pid: 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun Have the events only trace a task with a PID listed in this file. 286*4882a593Smuzhiyun Note, sched_switch and sched_wake_up will also trace events 287*4882a593Smuzhiyun listed in this file. 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun To have the PIDs of children of tasks with their PID in this file 290*4882a593Smuzhiyun added on fork, enable the "event-fork" option. That option will also 291*4882a593Smuzhiyun cause the PIDs of tasks to be removed from this file when the task 292*4882a593Smuzhiyun exits. 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun set_event_notrace_pid: 295*4882a593Smuzhiyun 296*4882a593Smuzhiyun Have the events not trace a task with a PID listed in this file. 297*4882a593Smuzhiyun Note, sched_switch and sched_wakeup will trace threads not listed 298*4882a593Smuzhiyun in this file, even if a thread's PID is in the file if the 299*4882a593Smuzhiyun sched_switch or sched_wakeup events also trace a thread that should 300*4882a593Smuzhiyun be traced. 301*4882a593Smuzhiyun 302*4882a593Smuzhiyun To have the PIDs of children of tasks with their PID in this file 303*4882a593Smuzhiyun added on fork, enable the "event-fork" option. That option will also 304*4882a593Smuzhiyun cause the PIDs of tasks to be removed from this file when the task 305*4882a593Smuzhiyun exits. 306*4882a593Smuzhiyun 307*4882a593Smuzhiyun set_graph_function: 308*4882a593Smuzhiyun 309*4882a593Smuzhiyun Functions listed in this file will cause the function graph 310*4882a593Smuzhiyun tracer to only trace these functions and the functions that 311*4882a593Smuzhiyun they call. (See the section "dynamic ftrace" for more details). 312*4882a593Smuzhiyun Note, set_ftrace_filter and set_ftrace_notrace still affects 313*4882a593Smuzhiyun what functions are being traced. 314*4882a593Smuzhiyun 315*4882a593Smuzhiyun set_graph_notrace: 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun Similar to set_graph_function, but will disable function graph 318*4882a593Smuzhiyun tracing when the function is hit until it exits the function. 319*4882a593Smuzhiyun This makes it possible to ignore tracing functions that are called 320*4882a593Smuzhiyun by a specific function. 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun available_filter_functions: 323*4882a593Smuzhiyun 324*4882a593Smuzhiyun This lists the functions that ftrace has processed and can trace. 325*4882a593Smuzhiyun These are the function names that you can pass to 326*4882a593Smuzhiyun "set_ftrace_filter", "set_ftrace_notrace", 327*4882a593Smuzhiyun "set_graph_function", or "set_graph_notrace". 328*4882a593Smuzhiyun (See the section "dynamic ftrace" below for more details.) 329*4882a593Smuzhiyun 330*4882a593Smuzhiyun dyn_ftrace_total_info: 331*4882a593Smuzhiyun 332*4882a593Smuzhiyun This file is for debugging purposes. The number of functions that 333*4882a593Smuzhiyun have been converted to nops and are available to be traced. 334*4882a593Smuzhiyun 335*4882a593Smuzhiyun enabled_functions: 336*4882a593Smuzhiyun 337*4882a593Smuzhiyun This file is more for debugging ftrace, but can also be useful 338*4882a593Smuzhiyun in seeing if any function has a callback attached to it. 339*4882a593Smuzhiyun Not only does the trace infrastructure use ftrace function 340*4882a593Smuzhiyun trace utility, but other subsystems might too. This file 341*4882a593Smuzhiyun displays all functions that have a callback attached to them 342*4882a593Smuzhiyun as well as the number of callbacks that have been attached. 343*4882a593Smuzhiyun Note, a callback may also call multiple functions which will 344*4882a593Smuzhiyun not be listed in this count. 345*4882a593Smuzhiyun 346*4882a593Smuzhiyun If the callback registered to be traced by a function with 347*4882a593Smuzhiyun the "save regs" attribute (thus even more overhead), a 'R' 348*4882a593Smuzhiyun will be displayed on the same line as the function that 349*4882a593Smuzhiyun is returning registers. 350*4882a593Smuzhiyun 351*4882a593Smuzhiyun If the callback registered to be traced by a function with 352*4882a593Smuzhiyun the "ip modify" attribute (thus the regs->ip can be changed), 353*4882a593Smuzhiyun an 'I' will be displayed on the same line as the function that 354*4882a593Smuzhiyun can be overridden. 355*4882a593Smuzhiyun 356*4882a593Smuzhiyun If the architecture supports it, it will also show what callback 357*4882a593Smuzhiyun is being directly called by the function. If the count is greater 358*4882a593Smuzhiyun than 1 it most likely will be ftrace_ops_list_func(). 359*4882a593Smuzhiyun 360*4882a593Smuzhiyun If the callback of the function jumps to a trampoline that is 361*4882a593Smuzhiyun specific to a the callback and not the standard trampoline, 362*4882a593Smuzhiyun its address will be printed as well as the function that the 363*4882a593Smuzhiyun trampoline calls. 364*4882a593Smuzhiyun 365*4882a593Smuzhiyun function_profile_enabled: 366*4882a593Smuzhiyun 367*4882a593Smuzhiyun When set it will enable all functions with either the function 368*4882a593Smuzhiyun tracer, or if configured, the function graph tracer. It will 369*4882a593Smuzhiyun keep a histogram of the number of functions that were called 370*4882a593Smuzhiyun and if the function graph tracer was configured, it will also keep 371*4882a593Smuzhiyun track of the time spent in those functions. The histogram 372*4882a593Smuzhiyun content can be displayed in the files: 373*4882a593Smuzhiyun 374*4882a593Smuzhiyun trace_stat/function<cpu> ( function0, function1, etc). 375*4882a593Smuzhiyun 376*4882a593Smuzhiyun trace_stat: 377*4882a593Smuzhiyun 378*4882a593Smuzhiyun A directory that holds different tracing stats. 379*4882a593Smuzhiyun 380*4882a593Smuzhiyun kprobe_events: 381*4882a593Smuzhiyun 382*4882a593Smuzhiyun Enable dynamic trace points. See kprobetrace.rst. 383*4882a593Smuzhiyun 384*4882a593Smuzhiyun kprobe_profile: 385*4882a593Smuzhiyun 386*4882a593Smuzhiyun Dynamic trace points stats. See kprobetrace.rst. 387*4882a593Smuzhiyun 388*4882a593Smuzhiyun max_graph_depth: 389*4882a593Smuzhiyun 390*4882a593Smuzhiyun Used with the function graph tracer. This is the max depth 391*4882a593Smuzhiyun it will trace into a function. Setting this to a value of 392*4882a593Smuzhiyun one will show only the first kernel function that is called 393*4882a593Smuzhiyun from user space. 394*4882a593Smuzhiyun 395*4882a593Smuzhiyun printk_formats: 396*4882a593Smuzhiyun 397*4882a593Smuzhiyun This is for tools that read the raw format files. If an event in 398*4882a593Smuzhiyun the ring buffer references a string, only a pointer to the string 399*4882a593Smuzhiyun is recorded into the buffer and not the string itself. This prevents 400*4882a593Smuzhiyun tools from knowing what that string was. This file displays the string 401*4882a593Smuzhiyun and address for the string allowing tools to map the pointers to what 402*4882a593Smuzhiyun the strings were. 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun saved_cmdlines: 405*4882a593Smuzhiyun 406*4882a593Smuzhiyun Only the pid of the task is recorded in a trace event unless 407*4882a593Smuzhiyun the event specifically saves the task comm as well. Ftrace 408*4882a593Smuzhiyun makes a cache of pid mappings to comms to try to display 409*4882a593Smuzhiyun comms for events. If a pid for a comm is not listed, then 410*4882a593Smuzhiyun "<...>" is displayed in the output. 411*4882a593Smuzhiyun 412*4882a593Smuzhiyun If the option "record-cmd" is set to "0", then comms of tasks 413*4882a593Smuzhiyun will not be saved during recording. By default, it is enabled. 414*4882a593Smuzhiyun 415*4882a593Smuzhiyun saved_cmdlines_size: 416*4882a593Smuzhiyun 417*4882a593Smuzhiyun By default, 128 comms are saved (see "saved_cmdlines" above). To 418*4882a593Smuzhiyun increase or decrease the amount of comms that are cached, echo 419*4882a593Smuzhiyun the number of comms to cache into this file. 420*4882a593Smuzhiyun 421*4882a593Smuzhiyun saved_tgids: 422*4882a593Smuzhiyun 423*4882a593Smuzhiyun If the option "record-tgid" is set, on each scheduling context switch 424*4882a593Smuzhiyun the Task Group ID of a task is saved in a table mapping the PID of 425*4882a593Smuzhiyun the thread to its TGID. By default, the "record-tgid" option is 426*4882a593Smuzhiyun disabled. 427*4882a593Smuzhiyun 428*4882a593Smuzhiyun snapshot: 429*4882a593Smuzhiyun 430*4882a593Smuzhiyun This displays the "snapshot" buffer and also lets the user 431*4882a593Smuzhiyun take a snapshot of the current running trace. 432*4882a593Smuzhiyun See the "Snapshot" section below for more details. 433*4882a593Smuzhiyun 434*4882a593Smuzhiyun stack_max_size: 435*4882a593Smuzhiyun 436*4882a593Smuzhiyun When the stack tracer is activated, this will display the 437*4882a593Smuzhiyun maximum stack size it has encountered. 438*4882a593Smuzhiyun See the "Stack Trace" section below. 439*4882a593Smuzhiyun 440*4882a593Smuzhiyun stack_trace: 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun This displays the stack back trace of the largest stack 443*4882a593Smuzhiyun that was encountered when the stack tracer is activated. 444*4882a593Smuzhiyun See the "Stack Trace" section below. 445*4882a593Smuzhiyun 446*4882a593Smuzhiyun stack_trace_filter: 447*4882a593Smuzhiyun 448*4882a593Smuzhiyun This is similar to "set_ftrace_filter" but it limits what 449*4882a593Smuzhiyun functions the stack tracer will check. 450*4882a593Smuzhiyun 451*4882a593Smuzhiyun trace_clock: 452*4882a593Smuzhiyun 453*4882a593Smuzhiyun Whenever an event is recorded into the ring buffer, a 454*4882a593Smuzhiyun "timestamp" is added. This stamp comes from a specified 455*4882a593Smuzhiyun clock. By default, ftrace uses the "local" clock. This 456*4882a593Smuzhiyun clock is very fast and strictly per cpu, but on some 457*4882a593Smuzhiyun systems it may not be monotonic with respect to other 458*4882a593Smuzhiyun CPUs. In other words, the local clocks may not be in sync 459*4882a593Smuzhiyun with local clocks on other CPUs. 460*4882a593Smuzhiyun 461*4882a593Smuzhiyun Usual clocks for tracing:: 462*4882a593Smuzhiyun 463*4882a593Smuzhiyun # cat trace_clock 464*4882a593Smuzhiyun [local] global counter x86-tsc 465*4882a593Smuzhiyun 466*4882a593Smuzhiyun The clock with the square brackets around it is the one in effect. 467*4882a593Smuzhiyun 468*4882a593Smuzhiyun local: 469*4882a593Smuzhiyun Default clock, but may not be in sync across CPUs 470*4882a593Smuzhiyun 471*4882a593Smuzhiyun global: 472*4882a593Smuzhiyun This clock is in sync with all CPUs but may 473*4882a593Smuzhiyun be a bit slower than the local clock. 474*4882a593Smuzhiyun 475*4882a593Smuzhiyun counter: 476*4882a593Smuzhiyun This is not a clock at all, but literally an atomic 477*4882a593Smuzhiyun counter. It counts up one by one, but is in sync 478*4882a593Smuzhiyun with all CPUs. This is useful when you need to 479*4882a593Smuzhiyun know exactly the order events occurred with respect to 480*4882a593Smuzhiyun each other on different CPUs. 481*4882a593Smuzhiyun 482*4882a593Smuzhiyun uptime: 483*4882a593Smuzhiyun This uses the jiffies counter and the time stamp 484*4882a593Smuzhiyun is relative to the time since boot up. 485*4882a593Smuzhiyun 486*4882a593Smuzhiyun perf: 487*4882a593Smuzhiyun This makes ftrace use the same clock that perf uses. 488*4882a593Smuzhiyun Eventually perf will be able to read ftrace buffers 489*4882a593Smuzhiyun and this will help out in interleaving the data. 490*4882a593Smuzhiyun 491*4882a593Smuzhiyun x86-tsc: 492*4882a593Smuzhiyun Architectures may define their own clocks. For 493*4882a593Smuzhiyun example, x86 uses its own TSC cycle clock here. 494*4882a593Smuzhiyun 495*4882a593Smuzhiyun ppc-tb: 496*4882a593Smuzhiyun This uses the powerpc timebase register value. 497*4882a593Smuzhiyun This is in sync across CPUs and can also be used 498*4882a593Smuzhiyun to correlate events across hypervisor/guest if 499*4882a593Smuzhiyun tb_offset is known. 500*4882a593Smuzhiyun 501*4882a593Smuzhiyun mono: 502*4882a593Smuzhiyun This uses the fast monotonic clock (CLOCK_MONOTONIC) 503*4882a593Smuzhiyun which is monotonic and is subject to NTP rate adjustments. 504*4882a593Smuzhiyun 505*4882a593Smuzhiyun mono_raw: 506*4882a593Smuzhiyun This is the raw monotonic clock (CLOCK_MONOTONIC_RAW) 507*4882a593Smuzhiyun which is monotonic but is not subject to any rate adjustments 508*4882a593Smuzhiyun and ticks at the same rate as the hardware clocksource. 509*4882a593Smuzhiyun 510*4882a593Smuzhiyun boot: 511*4882a593Smuzhiyun This is the boot clock (CLOCK_BOOTTIME) and is based on the 512*4882a593Smuzhiyun fast monotonic clock, but also accounts for time spent in 513*4882a593Smuzhiyun suspend. Since the clock access is designed for use in 514*4882a593Smuzhiyun tracing in the suspend path, some side effects are possible 515*4882a593Smuzhiyun if clock is accessed after the suspend time is accounted before 516*4882a593Smuzhiyun the fast mono clock is updated. In this case, the clock update 517*4882a593Smuzhiyun appears to happen slightly sooner than it normally would have. 518*4882a593Smuzhiyun Also on 32-bit systems, it's possible that the 64-bit boot offset 519*4882a593Smuzhiyun sees a partial update. These effects are rare and post 520*4882a593Smuzhiyun processing should be able to handle them. See comments in the 521*4882a593Smuzhiyun ktime_get_boot_fast_ns() function for more information. 522*4882a593Smuzhiyun 523*4882a593Smuzhiyun To set a clock, simply echo the clock name into this file:: 524*4882a593Smuzhiyun 525*4882a593Smuzhiyun # echo global > trace_clock 526*4882a593Smuzhiyun 527*4882a593Smuzhiyun Setting a clock clears the ring buffer content as well as the 528*4882a593Smuzhiyun "snapshot" buffer. 529*4882a593Smuzhiyun 530*4882a593Smuzhiyun trace_marker: 531*4882a593Smuzhiyun 532*4882a593Smuzhiyun This is a very useful file for synchronizing user space 533*4882a593Smuzhiyun with events happening in the kernel. Writing strings into 534*4882a593Smuzhiyun this file will be written into the ftrace buffer. 535*4882a593Smuzhiyun 536*4882a593Smuzhiyun It is useful in applications to open this file at the start 537*4882a593Smuzhiyun of the application and just reference the file descriptor 538*4882a593Smuzhiyun for the file:: 539*4882a593Smuzhiyun 540*4882a593Smuzhiyun void trace_write(const char *fmt, ...) 541*4882a593Smuzhiyun { 542*4882a593Smuzhiyun va_list ap; 543*4882a593Smuzhiyun char buf[256]; 544*4882a593Smuzhiyun int n; 545*4882a593Smuzhiyun 546*4882a593Smuzhiyun if (trace_fd < 0) 547*4882a593Smuzhiyun return; 548*4882a593Smuzhiyun 549*4882a593Smuzhiyun va_start(ap, fmt); 550*4882a593Smuzhiyun n = vsnprintf(buf, 256, fmt, ap); 551*4882a593Smuzhiyun va_end(ap); 552*4882a593Smuzhiyun 553*4882a593Smuzhiyun write(trace_fd, buf, n); 554*4882a593Smuzhiyun } 555*4882a593Smuzhiyun 556*4882a593Smuzhiyun start:: 557*4882a593Smuzhiyun 558*4882a593Smuzhiyun trace_fd = open("trace_marker", WR_ONLY); 559*4882a593Smuzhiyun 560*4882a593Smuzhiyun Note: Writing into the trace_marker file can also initiate triggers 561*4882a593Smuzhiyun that are written into /sys/kernel/tracing/events/ftrace/print/trigger 562*4882a593Smuzhiyun See "Event triggers" in Documentation/trace/events.rst and an 563*4882a593Smuzhiyun example in Documentation/trace/histogram.rst (Section 3.) 564*4882a593Smuzhiyun 565*4882a593Smuzhiyun trace_marker_raw: 566*4882a593Smuzhiyun 567*4882a593Smuzhiyun This is similar to trace_marker above, but is meant for binary data 568*4882a593Smuzhiyun to be written to it, where a tool can be used to parse the data 569*4882a593Smuzhiyun from trace_pipe_raw. 570*4882a593Smuzhiyun 571*4882a593Smuzhiyun uprobe_events: 572*4882a593Smuzhiyun 573*4882a593Smuzhiyun Add dynamic tracepoints in programs. 574*4882a593Smuzhiyun See uprobetracer.rst 575*4882a593Smuzhiyun 576*4882a593Smuzhiyun uprobe_profile: 577*4882a593Smuzhiyun 578*4882a593Smuzhiyun Uprobe statistics. See uprobetrace.txt 579*4882a593Smuzhiyun 580*4882a593Smuzhiyun instances: 581*4882a593Smuzhiyun 582*4882a593Smuzhiyun This is a way to make multiple trace buffers where different 583*4882a593Smuzhiyun events can be recorded in different buffers. 584*4882a593Smuzhiyun See "Instances" section below. 585*4882a593Smuzhiyun 586*4882a593Smuzhiyun events: 587*4882a593Smuzhiyun 588*4882a593Smuzhiyun This is the trace event directory. It holds event tracepoints 589*4882a593Smuzhiyun (also known as static tracepoints) that have been compiled 590*4882a593Smuzhiyun into the kernel. It shows what event tracepoints exist 591*4882a593Smuzhiyun and how they are grouped by system. There are "enable" 592*4882a593Smuzhiyun files at various levels that can enable the tracepoints 593*4882a593Smuzhiyun when a "1" is written to them. 594*4882a593Smuzhiyun 595*4882a593Smuzhiyun See events.rst for more information. 596*4882a593Smuzhiyun 597*4882a593Smuzhiyun set_event: 598*4882a593Smuzhiyun 599*4882a593Smuzhiyun By echoing in the event into this file, will enable that event. 600*4882a593Smuzhiyun 601*4882a593Smuzhiyun See events.rst for more information. 602*4882a593Smuzhiyun 603*4882a593Smuzhiyun available_events: 604*4882a593Smuzhiyun 605*4882a593Smuzhiyun A list of events that can be enabled in tracing. 606*4882a593Smuzhiyun 607*4882a593Smuzhiyun See events.rst for more information. 608*4882a593Smuzhiyun 609*4882a593Smuzhiyun timestamp_mode: 610*4882a593Smuzhiyun 611*4882a593Smuzhiyun Certain tracers may change the timestamp mode used when 612*4882a593Smuzhiyun logging trace events into the event buffer. Events with 613*4882a593Smuzhiyun different modes can coexist within a buffer but the mode in 614*4882a593Smuzhiyun effect when an event is logged determines which timestamp mode 615*4882a593Smuzhiyun is used for that event. The default timestamp mode is 616*4882a593Smuzhiyun 'delta'. 617*4882a593Smuzhiyun 618*4882a593Smuzhiyun Usual timestamp modes for tracing: 619*4882a593Smuzhiyun 620*4882a593Smuzhiyun # cat timestamp_mode 621*4882a593Smuzhiyun [delta] absolute 622*4882a593Smuzhiyun 623*4882a593Smuzhiyun The timestamp mode with the square brackets around it is the 624*4882a593Smuzhiyun one in effect. 625*4882a593Smuzhiyun 626*4882a593Smuzhiyun delta: Default timestamp mode - timestamp is a delta against 627*4882a593Smuzhiyun a per-buffer timestamp. 628*4882a593Smuzhiyun 629*4882a593Smuzhiyun absolute: The timestamp is a full timestamp, not a delta 630*4882a593Smuzhiyun against some other value. As such it takes up more 631*4882a593Smuzhiyun space and is less efficient. 632*4882a593Smuzhiyun 633*4882a593Smuzhiyun hwlat_detector: 634*4882a593Smuzhiyun 635*4882a593Smuzhiyun Directory for the Hardware Latency Detector. 636*4882a593Smuzhiyun See "Hardware Latency Detector" section below. 637*4882a593Smuzhiyun 638*4882a593Smuzhiyun per_cpu: 639*4882a593Smuzhiyun 640*4882a593Smuzhiyun This is a directory that contains the trace per_cpu information. 641*4882a593Smuzhiyun 642*4882a593Smuzhiyun per_cpu/cpu0/buffer_size_kb: 643*4882a593Smuzhiyun 644*4882a593Smuzhiyun The ftrace buffer is defined per_cpu. That is, there's a separate 645*4882a593Smuzhiyun buffer for each CPU to allow writes to be done atomically, 646*4882a593Smuzhiyun and free from cache bouncing. These buffers may have different 647*4882a593Smuzhiyun size buffers. This file is similar to the buffer_size_kb 648*4882a593Smuzhiyun file, but it only displays or sets the buffer size for the 649*4882a593Smuzhiyun specific CPU. (here cpu0). 650*4882a593Smuzhiyun 651*4882a593Smuzhiyun per_cpu/cpu0/trace: 652*4882a593Smuzhiyun 653*4882a593Smuzhiyun This is similar to the "trace" file, but it will only display 654*4882a593Smuzhiyun the data specific for the CPU. If written to, it only clears 655*4882a593Smuzhiyun the specific CPU buffer. 656*4882a593Smuzhiyun 657*4882a593Smuzhiyun per_cpu/cpu0/trace_pipe 658*4882a593Smuzhiyun 659*4882a593Smuzhiyun This is similar to the "trace_pipe" file, and is a consuming 660*4882a593Smuzhiyun read, but it will only display (and consume) the data specific 661*4882a593Smuzhiyun for the CPU. 662*4882a593Smuzhiyun 663*4882a593Smuzhiyun per_cpu/cpu0/trace_pipe_raw 664*4882a593Smuzhiyun 665*4882a593Smuzhiyun For tools that can parse the ftrace ring buffer binary format, 666*4882a593Smuzhiyun the trace_pipe_raw file can be used to extract the data 667*4882a593Smuzhiyun from the ring buffer directly. With the use of the splice() 668*4882a593Smuzhiyun system call, the buffer data can be quickly transferred to 669*4882a593Smuzhiyun a file or to the network where a server is collecting the 670*4882a593Smuzhiyun data. 671*4882a593Smuzhiyun 672*4882a593Smuzhiyun Like trace_pipe, this is a consuming reader, where multiple 673*4882a593Smuzhiyun reads will always produce different data. 674*4882a593Smuzhiyun 675*4882a593Smuzhiyun per_cpu/cpu0/snapshot: 676*4882a593Smuzhiyun 677*4882a593Smuzhiyun This is similar to the main "snapshot" file, but will only 678*4882a593Smuzhiyun snapshot the current CPU (if supported). It only displays 679*4882a593Smuzhiyun the content of the snapshot for a given CPU, and if 680*4882a593Smuzhiyun written to, only clears this CPU buffer. 681*4882a593Smuzhiyun 682*4882a593Smuzhiyun per_cpu/cpu0/snapshot_raw: 683*4882a593Smuzhiyun 684*4882a593Smuzhiyun Similar to the trace_pipe_raw, but will read the binary format 685*4882a593Smuzhiyun from the snapshot buffer for the given CPU. 686*4882a593Smuzhiyun 687*4882a593Smuzhiyun per_cpu/cpu0/stats: 688*4882a593Smuzhiyun 689*4882a593Smuzhiyun This displays certain stats about the ring buffer: 690*4882a593Smuzhiyun 691*4882a593Smuzhiyun entries: 692*4882a593Smuzhiyun The number of events that are still in the buffer. 693*4882a593Smuzhiyun 694*4882a593Smuzhiyun overrun: 695*4882a593Smuzhiyun The number of lost events due to overwriting when 696*4882a593Smuzhiyun the buffer was full. 697*4882a593Smuzhiyun 698*4882a593Smuzhiyun commit overrun: 699*4882a593Smuzhiyun Should always be zero. 700*4882a593Smuzhiyun This gets set if so many events happened within a nested 701*4882a593Smuzhiyun event (ring buffer is re-entrant), that it fills the 702*4882a593Smuzhiyun buffer and starts dropping events. 703*4882a593Smuzhiyun 704*4882a593Smuzhiyun bytes: 705*4882a593Smuzhiyun Bytes actually read (not overwritten). 706*4882a593Smuzhiyun 707*4882a593Smuzhiyun oldest event ts: 708*4882a593Smuzhiyun The oldest timestamp in the buffer 709*4882a593Smuzhiyun 710*4882a593Smuzhiyun now ts: 711*4882a593Smuzhiyun The current timestamp 712*4882a593Smuzhiyun 713*4882a593Smuzhiyun dropped events: 714*4882a593Smuzhiyun Events lost due to overwrite option being off. 715*4882a593Smuzhiyun 716*4882a593Smuzhiyun read events: 717*4882a593Smuzhiyun The number of events read. 718*4882a593Smuzhiyun 719*4882a593SmuzhiyunThe Tracers 720*4882a593Smuzhiyun----------- 721*4882a593Smuzhiyun 722*4882a593SmuzhiyunHere is the list of current tracers that may be configured. 723*4882a593Smuzhiyun 724*4882a593Smuzhiyun "function" 725*4882a593Smuzhiyun 726*4882a593Smuzhiyun Function call tracer to trace all kernel functions. 727*4882a593Smuzhiyun 728*4882a593Smuzhiyun "function_graph" 729*4882a593Smuzhiyun 730*4882a593Smuzhiyun Similar to the function tracer except that the 731*4882a593Smuzhiyun function tracer probes the functions on their entry 732*4882a593Smuzhiyun whereas the function graph tracer traces on both entry 733*4882a593Smuzhiyun and exit of the functions. It then provides the ability 734*4882a593Smuzhiyun to draw a graph of function calls similar to C code 735*4882a593Smuzhiyun source. 736*4882a593Smuzhiyun 737*4882a593Smuzhiyun "blk" 738*4882a593Smuzhiyun 739*4882a593Smuzhiyun The block tracer. The tracer used by the blktrace user 740*4882a593Smuzhiyun application. 741*4882a593Smuzhiyun 742*4882a593Smuzhiyun "hwlat" 743*4882a593Smuzhiyun 744*4882a593Smuzhiyun The Hardware Latency tracer is used to detect if the hardware 745*4882a593Smuzhiyun produces any latency. See "Hardware Latency Detector" section 746*4882a593Smuzhiyun below. 747*4882a593Smuzhiyun 748*4882a593Smuzhiyun "irqsoff" 749*4882a593Smuzhiyun 750*4882a593Smuzhiyun Traces the areas that disable interrupts and saves 751*4882a593Smuzhiyun the trace with the longest max latency. 752*4882a593Smuzhiyun See tracing_max_latency. When a new max is recorded, 753*4882a593Smuzhiyun it replaces the old trace. It is best to view this 754*4882a593Smuzhiyun trace with the latency-format option enabled, which 755*4882a593Smuzhiyun happens automatically when the tracer is selected. 756*4882a593Smuzhiyun 757*4882a593Smuzhiyun "preemptoff" 758*4882a593Smuzhiyun 759*4882a593Smuzhiyun Similar to irqsoff but traces and records the amount of 760*4882a593Smuzhiyun time for which preemption is disabled. 761*4882a593Smuzhiyun 762*4882a593Smuzhiyun "preemptirqsoff" 763*4882a593Smuzhiyun 764*4882a593Smuzhiyun Similar to irqsoff and preemptoff, but traces and 765*4882a593Smuzhiyun records the largest time for which irqs and/or preemption 766*4882a593Smuzhiyun is disabled. 767*4882a593Smuzhiyun 768*4882a593Smuzhiyun "wakeup" 769*4882a593Smuzhiyun 770*4882a593Smuzhiyun Traces and records the max latency that it takes for 771*4882a593Smuzhiyun the highest priority task to get scheduled after 772*4882a593Smuzhiyun it has been woken up. 773*4882a593Smuzhiyun Traces all tasks as an average developer would expect. 774*4882a593Smuzhiyun 775*4882a593Smuzhiyun "wakeup_rt" 776*4882a593Smuzhiyun 777*4882a593Smuzhiyun Traces and records the max latency that it takes for just 778*4882a593Smuzhiyun RT tasks (as the current "wakeup" does). This is useful 779*4882a593Smuzhiyun for those interested in wake up timings of RT tasks. 780*4882a593Smuzhiyun 781*4882a593Smuzhiyun "wakeup_dl" 782*4882a593Smuzhiyun 783*4882a593Smuzhiyun Traces and records the max latency that it takes for 784*4882a593Smuzhiyun a SCHED_DEADLINE task to be woken (as the "wakeup" and 785*4882a593Smuzhiyun "wakeup_rt" does). 786*4882a593Smuzhiyun 787*4882a593Smuzhiyun "mmiotrace" 788*4882a593Smuzhiyun 789*4882a593Smuzhiyun A special tracer that is used to trace binary module. 790*4882a593Smuzhiyun It will trace all the calls that a module makes to the 791*4882a593Smuzhiyun hardware. Everything it writes and reads from the I/O 792*4882a593Smuzhiyun as well. 793*4882a593Smuzhiyun 794*4882a593Smuzhiyun "branch" 795*4882a593Smuzhiyun 796*4882a593Smuzhiyun This tracer can be configured when tracing likely/unlikely 797*4882a593Smuzhiyun calls within the kernel. It will trace when a likely and 798*4882a593Smuzhiyun unlikely branch is hit and if it was correct in its prediction 799*4882a593Smuzhiyun of being correct. 800*4882a593Smuzhiyun 801*4882a593Smuzhiyun "nop" 802*4882a593Smuzhiyun 803*4882a593Smuzhiyun This is the "trace nothing" tracer. To remove all 804*4882a593Smuzhiyun tracers from tracing simply echo "nop" into 805*4882a593Smuzhiyun current_tracer. 806*4882a593Smuzhiyun 807*4882a593SmuzhiyunError conditions 808*4882a593Smuzhiyun---------------- 809*4882a593Smuzhiyun 810*4882a593Smuzhiyun For most ftrace commands, failure modes are obvious and communicated 811*4882a593Smuzhiyun using standard return codes. 812*4882a593Smuzhiyun 813*4882a593Smuzhiyun For other more involved commands, extended error information may be 814*4882a593Smuzhiyun available via the tracing/error_log file. For the commands that 815*4882a593Smuzhiyun support it, reading the tracing/error_log file after an error will 816*4882a593Smuzhiyun display more detailed information about what went wrong, if 817*4882a593Smuzhiyun information is available. The tracing/error_log file is a circular 818*4882a593Smuzhiyun error log displaying a small number (currently, 8) of ftrace errors 819*4882a593Smuzhiyun for the last (8) failed commands. 820*4882a593Smuzhiyun 821*4882a593Smuzhiyun The extended error information and usage takes the form shown in 822*4882a593Smuzhiyun this example:: 823*4882a593Smuzhiyun 824*4882a593Smuzhiyun # echo xxx > /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger 825*4882a593Smuzhiyun echo: write error: Invalid argument 826*4882a593Smuzhiyun 827*4882a593Smuzhiyun # cat /sys/kernel/debug/tracing/error_log 828*4882a593Smuzhiyun [ 5348.887237] location: error: Couldn't yyy: zzz 829*4882a593Smuzhiyun Command: xxx 830*4882a593Smuzhiyun ^ 831*4882a593Smuzhiyun [ 7517.023364] location: error: Bad rrr: sss 832*4882a593Smuzhiyun Command: ppp qqq 833*4882a593Smuzhiyun ^ 834*4882a593Smuzhiyun 835*4882a593Smuzhiyun To clear the error log, echo the empty string into it:: 836*4882a593Smuzhiyun 837*4882a593Smuzhiyun # echo > /sys/kernel/debug/tracing/error_log 838*4882a593Smuzhiyun 839*4882a593SmuzhiyunExamples of using the tracer 840*4882a593Smuzhiyun---------------------------- 841*4882a593Smuzhiyun 842*4882a593SmuzhiyunHere are typical examples of using the tracers when controlling 843*4882a593Smuzhiyunthem only with the tracefs interface (without using any 844*4882a593Smuzhiyunuser-land utilities). 845*4882a593Smuzhiyun 846*4882a593SmuzhiyunOutput format: 847*4882a593Smuzhiyun-------------- 848*4882a593Smuzhiyun 849*4882a593SmuzhiyunHere is an example of the output format of the file "trace":: 850*4882a593Smuzhiyun 851*4882a593Smuzhiyun # tracer: function 852*4882a593Smuzhiyun # 853*4882a593Smuzhiyun # entries-in-buffer/entries-written: 140080/250280 #P:4 854*4882a593Smuzhiyun # 855*4882a593Smuzhiyun # _-----=> irqs-off 856*4882a593Smuzhiyun # / _----=> need-resched 857*4882a593Smuzhiyun # | / _---=> hardirq/softirq 858*4882a593Smuzhiyun # || / _--=> preempt-depth 859*4882a593Smuzhiyun # ||| / delay 860*4882a593Smuzhiyun # TASK-PID CPU# |||| TIMESTAMP FUNCTION 861*4882a593Smuzhiyun # | | | |||| | | 862*4882a593Smuzhiyun bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath 863*4882a593Smuzhiyun bash-1977 [000] .... 17284.993653: __close_fd <-sys_close 864*4882a593Smuzhiyun bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd 865*4882a593Smuzhiyun sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify 866*4882a593Smuzhiyun bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock 867*4882a593Smuzhiyun bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd 868*4882a593Smuzhiyun bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock 869*4882a593Smuzhiyun bash-1977 [000] .... 17284.993657: filp_close <-__close_fd 870*4882a593Smuzhiyun bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close 871*4882a593Smuzhiyun sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath 872*4882a593Smuzhiyun .... 873*4882a593Smuzhiyun 874*4882a593SmuzhiyunA header is printed with the tracer name that is represented by 875*4882a593Smuzhiyunthe trace. In this case the tracer is "function". Then it shows the 876*4882a593Smuzhiyunnumber of events in the buffer as well as the total number of entries 877*4882a593Smuzhiyunthat were written. The difference is the number of entries that were 878*4882a593Smuzhiyunlost due to the buffer filling up (250280 - 140080 = 110200 events 879*4882a593Smuzhiyunlost). 880*4882a593Smuzhiyun 881*4882a593SmuzhiyunThe header explains the content of the events. Task name "bash", the task 882*4882a593SmuzhiyunPID "1977", the CPU that it was running on "000", the latency format 883*4882a593Smuzhiyun(explained below), the timestamp in <secs>.<usecs> format, the 884*4882a593Smuzhiyunfunction name that was traced "sys_close" and the parent function that 885*4882a593Smuzhiyuncalled this function "system_call_fastpath". The timestamp is the time 886*4882a593Smuzhiyunat which the function was entered. 887*4882a593Smuzhiyun 888*4882a593SmuzhiyunLatency trace format 889*4882a593Smuzhiyun-------------------- 890*4882a593Smuzhiyun 891*4882a593SmuzhiyunWhen the latency-format option is enabled or when one of the latency 892*4882a593Smuzhiyuntracers is set, the trace file gives somewhat more information to see 893*4882a593Smuzhiyunwhy a latency happened. Here is a typical trace:: 894*4882a593Smuzhiyun 895*4882a593Smuzhiyun # tracer: irqsoff 896*4882a593Smuzhiyun # 897*4882a593Smuzhiyun # irqsoff latency trace v1.1.5 on 3.8.0-test+ 898*4882a593Smuzhiyun # -------------------------------------------------------------------- 899*4882a593Smuzhiyun # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 900*4882a593Smuzhiyun # ----------------- 901*4882a593Smuzhiyun # | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0) 902*4882a593Smuzhiyun # ----------------- 903*4882a593Smuzhiyun # => started at: __lock_task_sighand 904*4882a593Smuzhiyun # => ended at: _raw_spin_unlock_irqrestore 905*4882a593Smuzhiyun # 906*4882a593Smuzhiyun # 907*4882a593Smuzhiyun # _------=> CPU# 908*4882a593Smuzhiyun # / _-----=> irqs-off 909*4882a593Smuzhiyun # | / _----=> need-resched 910*4882a593Smuzhiyun # || / _---=> hardirq/softirq 911*4882a593Smuzhiyun # ||| / _--=> preempt-depth 912*4882a593Smuzhiyun # |||| / delay 913*4882a593Smuzhiyun # cmd pid ||||| time | caller 914*4882a593Smuzhiyun # \ / ||||| \ | / 915*4882a593Smuzhiyun ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand 916*4882a593Smuzhiyun ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore 917*4882a593Smuzhiyun ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore 918*4882a593Smuzhiyun ps-6143 2d..1 306us : <stack trace> 919*4882a593Smuzhiyun => trace_hardirqs_on_caller 920*4882a593Smuzhiyun => trace_hardirqs_on 921*4882a593Smuzhiyun => _raw_spin_unlock_irqrestore 922*4882a593Smuzhiyun => do_task_stat 923*4882a593Smuzhiyun => proc_tgid_stat 924*4882a593Smuzhiyun => proc_single_show 925*4882a593Smuzhiyun => seq_read 926*4882a593Smuzhiyun => vfs_read 927*4882a593Smuzhiyun => sys_read 928*4882a593Smuzhiyun => system_call_fastpath 929*4882a593Smuzhiyun 930*4882a593Smuzhiyun 931*4882a593SmuzhiyunThis shows that the current tracer is "irqsoff" tracing the time 932*4882a593Smuzhiyunfor which interrupts were disabled. It gives the trace version (which 933*4882a593Smuzhiyunnever changes) and the version of the kernel upon which this was executed on 934*4882a593Smuzhiyun(3.8). Then it displays the max latency in microseconds (259 us). The number 935*4882a593Smuzhiyunof trace entries displayed and the total number (both are four: #4/4). 936*4882a593SmuzhiyunVP, KP, SP, and HP are always zero and are reserved for later use. 937*4882a593Smuzhiyun#P is the number of online CPUs (#P:4). 938*4882a593Smuzhiyun 939*4882a593SmuzhiyunThe task is the process that was running when the latency 940*4882a593Smuzhiyunoccurred. (ps pid: 6143). 941*4882a593Smuzhiyun 942*4882a593SmuzhiyunThe start and stop (the functions in which the interrupts were 943*4882a593Smuzhiyundisabled and enabled respectively) that caused the latencies: 944*4882a593Smuzhiyun 945*4882a593Smuzhiyun - __lock_task_sighand is where the interrupts were disabled. 946*4882a593Smuzhiyun - _raw_spin_unlock_irqrestore is where they were enabled again. 947*4882a593Smuzhiyun 948*4882a593SmuzhiyunThe next lines after the header are the trace itself. The header 949*4882a593Smuzhiyunexplains which is which. 950*4882a593Smuzhiyun 951*4882a593Smuzhiyun cmd: The name of the process in the trace. 952*4882a593Smuzhiyun 953*4882a593Smuzhiyun pid: The PID of that process. 954*4882a593Smuzhiyun 955*4882a593Smuzhiyun CPU#: The CPU which the process was running on. 956*4882a593Smuzhiyun 957*4882a593Smuzhiyun irqs-off: 'd' interrupts are disabled. '.' otherwise. 958*4882a593Smuzhiyun .. caution:: If the architecture does not support a way to 959*4882a593Smuzhiyun read the irq flags variable, an 'X' will always 960*4882a593Smuzhiyun be printed here. 961*4882a593Smuzhiyun 962*4882a593Smuzhiyun need-resched: 963*4882a593Smuzhiyun - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set, 964*4882a593Smuzhiyun - 'n' only TIF_NEED_RESCHED is set, 965*4882a593Smuzhiyun - 'p' only PREEMPT_NEED_RESCHED is set, 966*4882a593Smuzhiyun - '.' otherwise. 967*4882a593Smuzhiyun 968*4882a593Smuzhiyun hardirq/softirq: 969*4882a593Smuzhiyun - 'Z' - NMI occurred inside a hardirq 970*4882a593Smuzhiyun - 'z' - NMI is running 971*4882a593Smuzhiyun - 'H' - hard irq occurred inside a softirq. 972*4882a593Smuzhiyun - 'h' - hard irq is running 973*4882a593Smuzhiyun - 's' - soft irq is running 974*4882a593Smuzhiyun - '.' - normal context. 975*4882a593Smuzhiyun 976*4882a593Smuzhiyun preempt-depth: The level of preempt_disabled 977*4882a593Smuzhiyun 978*4882a593SmuzhiyunThe above is mostly meaningful for kernel developers. 979*4882a593Smuzhiyun 980*4882a593Smuzhiyun time: 981*4882a593Smuzhiyun When the latency-format option is enabled, the trace file 982*4882a593Smuzhiyun output includes a timestamp relative to the start of the 983*4882a593Smuzhiyun trace. This differs from the output when latency-format 984*4882a593Smuzhiyun is disabled, which includes an absolute timestamp. 985*4882a593Smuzhiyun 986*4882a593Smuzhiyun delay: 987*4882a593Smuzhiyun This is just to help catch your eye a bit better. And 988*4882a593Smuzhiyun needs to be fixed to be only relative to the same CPU. 989*4882a593Smuzhiyun The marks are determined by the difference between this 990*4882a593Smuzhiyun current trace and the next trace. 991*4882a593Smuzhiyun 992*4882a593Smuzhiyun - '$' - greater than 1 second 993*4882a593Smuzhiyun - '@' - greater than 100 millisecond 994*4882a593Smuzhiyun - '*' - greater than 10 millisecond 995*4882a593Smuzhiyun - '#' - greater than 1000 microsecond 996*4882a593Smuzhiyun - '!' - greater than 100 microsecond 997*4882a593Smuzhiyun - '+' - greater than 10 microsecond 998*4882a593Smuzhiyun - ' ' - less than or equal to 10 microsecond. 999*4882a593Smuzhiyun 1000*4882a593Smuzhiyun The rest is the same as the 'trace' file. 1001*4882a593Smuzhiyun 1002*4882a593Smuzhiyun Note, the latency tracers will usually end with a back trace 1003*4882a593Smuzhiyun to easily find where the latency occurred. 1004*4882a593Smuzhiyun 1005*4882a593Smuzhiyuntrace_options 1006*4882a593Smuzhiyun------------- 1007*4882a593Smuzhiyun 1008*4882a593SmuzhiyunThe trace_options file (or the options directory) is used to control 1009*4882a593Smuzhiyunwhat gets printed in the trace output, or manipulate the tracers. 1010*4882a593SmuzhiyunTo see what is available, simply cat the file:: 1011*4882a593Smuzhiyun 1012*4882a593Smuzhiyun cat trace_options 1013*4882a593Smuzhiyun print-parent 1014*4882a593Smuzhiyun nosym-offset 1015*4882a593Smuzhiyun nosym-addr 1016*4882a593Smuzhiyun noverbose 1017*4882a593Smuzhiyun noraw 1018*4882a593Smuzhiyun nohex 1019*4882a593Smuzhiyun nobin 1020*4882a593Smuzhiyun noblock 1021*4882a593Smuzhiyun trace_printk 1022*4882a593Smuzhiyun annotate 1023*4882a593Smuzhiyun nouserstacktrace 1024*4882a593Smuzhiyun nosym-userobj 1025*4882a593Smuzhiyun noprintk-msg-only 1026*4882a593Smuzhiyun context-info 1027*4882a593Smuzhiyun nolatency-format 1028*4882a593Smuzhiyun record-cmd 1029*4882a593Smuzhiyun norecord-tgid 1030*4882a593Smuzhiyun overwrite 1031*4882a593Smuzhiyun nodisable_on_free 1032*4882a593Smuzhiyun irq-info 1033*4882a593Smuzhiyun markers 1034*4882a593Smuzhiyun noevent-fork 1035*4882a593Smuzhiyun function-trace 1036*4882a593Smuzhiyun nofunction-fork 1037*4882a593Smuzhiyun nodisplay-graph 1038*4882a593Smuzhiyun nostacktrace 1039*4882a593Smuzhiyun nobranch 1040*4882a593Smuzhiyun 1041*4882a593SmuzhiyunTo disable one of the options, echo in the option prepended with 1042*4882a593Smuzhiyun"no":: 1043*4882a593Smuzhiyun 1044*4882a593Smuzhiyun echo noprint-parent > trace_options 1045*4882a593Smuzhiyun 1046*4882a593SmuzhiyunTo enable an option, leave off the "no":: 1047*4882a593Smuzhiyun 1048*4882a593Smuzhiyun echo sym-offset > trace_options 1049*4882a593Smuzhiyun 1050*4882a593SmuzhiyunHere are the available options: 1051*4882a593Smuzhiyun 1052*4882a593Smuzhiyun print-parent 1053*4882a593Smuzhiyun On function traces, display the calling (parent) 1054*4882a593Smuzhiyun function as well as the function being traced. 1055*4882a593Smuzhiyun :: 1056*4882a593Smuzhiyun 1057*4882a593Smuzhiyun print-parent: 1058*4882a593Smuzhiyun bash-4000 [01] 1477.606694: simple_strtoul <-kstrtoul 1059*4882a593Smuzhiyun 1060*4882a593Smuzhiyun noprint-parent: 1061*4882a593Smuzhiyun bash-4000 [01] 1477.606694: simple_strtoul 1062*4882a593Smuzhiyun 1063*4882a593Smuzhiyun 1064*4882a593Smuzhiyun sym-offset 1065*4882a593Smuzhiyun Display not only the function name, but also the 1066*4882a593Smuzhiyun offset in the function. For example, instead of 1067*4882a593Smuzhiyun seeing just "ktime_get", you will see 1068*4882a593Smuzhiyun "ktime_get+0xb/0x20". 1069*4882a593Smuzhiyun :: 1070*4882a593Smuzhiyun 1071*4882a593Smuzhiyun sym-offset: 1072*4882a593Smuzhiyun bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 1073*4882a593Smuzhiyun 1074*4882a593Smuzhiyun sym-addr 1075*4882a593Smuzhiyun This will also display the function address as well 1076*4882a593Smuzhiyun as the function name. 1077*4882a593Smuzhiyun :: 1078*4882a593Smuzhiyun 1079*4882a593Smuzhiyun sym-addr: 1080*4882a593Smuzhiyun bash-4000 [01] 1477.606694: simple_strtoul <c0339346> 1081*4882a593Smuzhiyun 1082*4882a593Smuzhiyun verbose 1083*4882a593Smuzhiyun This deals with the trace file when the 1084*4882a593Smuzhiyun latency-format option is enabled. 1085*4882a593Smuzhiyun :: 1086*4882a593Smuzhiyun 1087*4882a593Smuzhiyun bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ 1088*4882a593Smuzhiyun (+0.000ms): simple_strtoul (kstrtoul) 1089*4882a593Smuzhiyun 1090*4882a593Smuzhiyun raw 1091*4882a593Smuzhiyun This will display raw numbers. This option is best for 1092*4882a593Smuzhiyun use with user applications that can translate the raw 1093*4882a593Smuzhiyun numbers better than having it done in the kernel. 1094*4882a593Smuzhiyun 1095*4882a593Smuzhiyun hex 1096*4882a593Smuzhiyun Similar to raw, but the numbers will be in a hexadecimal format. 1097*4882a593Smuzhiyun 1098*4882a593Smuzhiyun bin 1099*4882a593Smuzhiyun This will print out the formats in raw binary. 1100*4882a593Smuzhiyun 1101*4882a593Smuzhiyun block 1102*4882a593Smuzhiyun When set, reading trace_pipe will not block when polled. 1103*4882a593Smuzhiyun 1104*4882a593Smuzhiyun trace_printk 1105*4882a593Smuzhiyun Can disable trace_printk() from writing into the buffer. 1106*4882a593Smuzhiyun 1107*4882a593Smuzhiyun annotate 1108*4882a593Smuzhiyun It is sometimes confusing when the CPU buffers are full 1109*4882a593Smuzhiyun and one CPU buffer had a lot of events recently, thus 1110*4882a593Smuzhiyun a shorter time frame, were another CPU may have only had 1111*4882a593Smuzhiyun a few events, which lets it have older events. When 1112*4882a593Smuzhiyun the trace is reported, it shows the oldest events first, 1113*4882a593Smuzhiyun and it may look like only one CPU ran (the one with the 1114*4882a593Smuzhiyun oldest events). When the annotate option is set, it will 1115*4882a593Smuzhiyun display when a new CPU buffer started:: 1116*4882a593Smuzhiyun 1117*4882a593Smuzhiyun <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on 1118*4882a593Smuzhiyun <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on 1119*4882a593Smuzhiyun <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore 1120*4882a593Smuzhiyun ##### CPU 2 buffer started #### 1121*4882a593Smuzhiyun <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle 1122*4882a593Smuzhiyun <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog 1123*4882a593Smuzhiyun <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock 1124*4882a593Smuzhiyun 1125*4882a593Smuzhiyun userstacktrace 1126*4882a593Smuzhiyun This option changes the trace. It records a 1127*4882a593Smuzhiyun stacktrace of the current user space thread after 1128*4882a593Smuzhiyun each trace event. 1129*4882a593Smuzhiyun 1130*4882a593Smuzhiyun sym-userobj 1131*4882a593Smuzhiyun when user stacktrace are enabled, look up which 1132*4882a593Smuzhiyun object the address belongs to, and print a 1133*4882a593Smuzhiyun relative address. This is especially useful when 1134*4882a593Smuzhiyun ASLR is on, otherwise you don't get a chance to 1135*4882a593Smuzhiyun resolve the address to object/file/line after 1136*4882a593Smuzhiyun the app is no longer running 1137*4882a593Smuzhiyun 1138*4882a593Smuzhiyun The lookup is performed when you read 1139*4882a593Smuzhiyun trace,trace_pipe. Example:: 1140*4882a593Smuzhiyun 1141*4882a593Smuzhiyun a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 1142*4882a593Smuzhiyun x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] 1143*4882a593Smuzhiyun 1144*4882a593Smuzhiyun 1145*4882a593Smuzhiyun printk-msg-only 1146*4882a593Smuzhiyun When set, trace_printk()s will only show the format 1147*4882a593Smuzhiyun and not their parameters (if trace_bprintk() or 1148*4882a593Smuzhiyun trace_bputs() was used to save the trace_printk()). 1149*4882a593Smuzhiyun 1150*4882a593Smuzhiyun context-info 1151*4882a593Smuzhiyun Show only the event data. Hides the comm, PID, 1152*4882a593Smuzhiyun timestamp, CPU, and other useful data. 1153*4882a593Smuzhiyun 1154*4882a593Smuzhiyun latency-format 1155*4882a593Smuzhiyun This option changes the trace output. When it is enabled, 1156*4882a593Smuzhiyun the trace displays additional information about the 1157*4882a593Smuzhiyun latency, as described in "Latency trace format". 1158*4882a593Smuzhiyun 1159*4882a593Smuzhiyun pause-on-trace 1160*4882a593Smuzhiyun When set, opening the trace file for read, will pause 1161*4882a593Smuzhiyun writing to the ring buffer (as if tracing_on was set to zero). 1162*4882a593Smuzhiyun This simulates the original behavior of the trace file. 1163*4882a593Smuzhiyun When the file is closed, tracing will be enabled again. 1164*4882a593Smuzhiyun 1165*4882a593Smuzhiyun record-cmd 1166*4882a593Smuzhiyun When any event or tracer is enabled, a hook is enabled 1167*4882a593Smuzhiyun in the sched_switch trace point to fill comm cache 1168*4882a593Smuzhiyun with mapped pids and comms. But this may cause some 1169*4882a593Smuzhiyun overhead, and if you only care about pids, and not the 1170*4882a593Smuzhiyun name of the task, disabling this option can lower the 1171*4882a593Smuzhiyun impact of tracing. See "saved_cmdlines". 1172*4882a593Smuzhiyun 1173*4882a593Smuzhiyun record-tgid 1174*4882a593Smuzhiyun When any event or tracer is enabled, a hook is enabled 1175*4882a593Smuzhiyun in the sched_switch trace point to fill the cache of 1176*4882a593Smuzhiyun mapped Thread Group IDs (TGID) mapping to pids. See 1177*4882a593Smuzhiyun "saved_tgids". 1178*4882a593Smuzhiyun 1179*4882a593Smuzhiyun overwrite 1180*4882a593Smuzhiyun This controls what happens when the trace buffer is 1181*4882a593Smuzhiyun full. If "1" (default), the oldest events are 1182*4882a593Smuzhiyun discarded and overwritten. If "0", then the newest 1183*4882a593Smuzhiyun events are discarded. 1184*4882a593Smuzhiyun (see per_cpu/cpu0/stats for overrun and dropped) 1185*4882a593Smuzhiyun 1186*4882a593Smuzhiyun disable_on_free 1187*4882a593Smuzhiyun When the free_buffer is closed, tracing will 1188*4882a593Smuzhiyun stop (tracing_on set to 0). 1189*4882a593Smuzhiyun 1190*4882a593Smuzhiyun irq-info 1191*4882a593Smuzhiyun Shows the interrupt, preempt count, need resched data. 1192*4882a593Smuzhiyun When disabled, the trace looks like:: 1193*4882a593Smuzhiyun 1194*4882a593Smuzhiyun # tracer: function 1195*4882a593Smuzhiyun # 1196*4882a593Smuzhiyun # entries-in-buffer/entries-written: 144405/9452052 #P:4 1197*4882a593Smuzhiyun # 1198*4882a593Smuzhiyun # TASK-PID CPU# TIMESTAMP FUNCTION 1199*4882a593Smuzhiyun # | | | | | 1200*4882a593Smuzhiyun <idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up 1201*4882a593Smuzhiyun <idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89 1202*4882a593Smuzhiyun <idle>-0 [002] 23636.756055: enqueue_task <-activate_task 1203*4882a593Smuzhiyun 1204*4882a593Smuzhiyun 1205*4882a593Smuzhiyun markers 1206*4882a593Smuzhiyun When set, the trace_marker is writable (only by root). 1207*4882a593Smuzhiyun When disabled, the trace_marker will error with EINVAL 1208*4882a593Smuzhiyun on write. 1209*4882a593Smuzhiyun 1210*4882a593Smuzhiyun event-fork 1211*4882a593Smuzhiyun When set, tasks with PIDs listed in set_event_pid will have 1212*4882a593Smuzhiyun the PIDs of their children added to set_event_pid when those 1213*4882a593Smuzhiyun tasks fork. Also, when tasks with PIDs in set_event_pid exit, 1214*4882a593Smuzhiyun their PIDs will be removed from the file. 1215*4882a593Smuzhiyun 1216*4882a593Smuzhiyun This affects PIDs listed in set_event_notrace_pid as well. 1217*4882a593Smuzhiyun 1218*4882a593Smuzhiyun function-trace 1219*4882a593Smuzhiyun The latency tracers will enable function tracing 1220*4882a593Smuzhiyun if this option is enabled (default it is). When 1221*4882a593Smuzhiyun it is disabled, the latency tracers do not trace 1222*4882a593Smuzhiyun functions. This keeps the overhead of the tracer down 1223*4882a593Smuzhiyun when performing latency tests. 1224*4882a593Smuzhiyun 1225*4882a593Smuzhiyun function-fork 1226*4882a593Smuzhiyun When set, tasks with PIDs listed in set_ftrace_pid will 1227*4882a593Smuzhiyun have the PIDs of their children added to set_ftrace_pid 1228*4882a593Smuzhiyun when those tasks fork. Also, when tasks with PIDs in 1229*4882a593Smuzhiyun set_ftrace_pid exit, their PIDs will be removed from the 1230*4882a593Smuzhiyun file. 1231*4882a593Smuzhiyun 1232*4882a593Smuzhiyun This affects PIDs in set_ftrace_notrace_pid as well. 1233*4882a593Smuzhiyun 1234*4882a593Smuzhiyun display-graph 1235*4882a593Smuzhiyun When set, the latency tracers (irqsoff, wakeup, etc) will 1236*4882a593Smuzhiyun use function graph tracing instead of function tracing. 1237*4882a593Smuzhiyun 1238*4882a593Smuzhiyun stacktrace 1239*4882a593Smuzhiyun When set, a stack trace is recorded after any trace event 1240*4882a593Smuzhiyun is recorded. 1241*4882a593Smuzhiyun 1242*4882a593Smuzhiyun branch 1243*4882a593Smuzhiyun Enable branch tracing with the tracer. This enables branch 1244*4882a593Smuzhiyun tracer along with the currently set tracer. Enabling this 1245*4882a593Smuzhiyun with the "nop" tracer is the same as just enabling the 1246*4882a593Smuzhiyun "branch" tracer. 1247*4882a593Smuzhiyun 1248*4882a593Smuzhiyun.. tip:: Some tracers have their own options. They only appear in this 1249*4882a593Smuzhiyun file when the tracer is active. They always appear in the 1250*4882a593Smuzhiyun options directory. 1251*4882a593Smuzhiyun 1252*4882a593Smuzhiyun 1253*4882a593SmuzhiyunHere are the per tracer options: 1254*4882a593Smuzhiyun 1255*4882a593SmuzhiyunOptions for function tracer: 1256*4882a593Smuzhiyun 1257*4882a593Smuzhiyun func_stack_trace 1258*4882a593Smuzhiyun When set, a stack trace is recorded after every 1259*4882a593Smuzhiyun function that is recorded. NOTE! Limit the functions 1260*4882a593Smuzhiyun that are recorded before enabling this, with 1261*4882a593Smuzhiyun "set_ftrace_filter" otherwise the system performance 1262*4882a593Smuzhiyun will be critically degraded. Remember to disable 1263*4882a593Smuzhiyun this option before clearing the function filter. 1264*4882a593Smuzhiyun 1265*4882a593SmuzhiyunOptions for function_graph tracer: 1266*4882a593Smuzhiyun 1267*4882a593Smuzhiyun Since the function_graph tracer has a slightly different output 1268*4882a593Smuzhiyun it has its own options to control what is displayed. 1269*4882a593Smuzhiyun 1270*4882a593Smuzhiyun funcgraph-overrun 1271*4882a593Smuzhiyun When set, the "overrun" of the graph stack is 1272*4882a593Smuzhiyun displayed after each function traced. The 1273*4882a593Smuzhiyun overrun, is when the stack depth of the calls 1274*4882a593Smuzhiyun is greater than what is reserved for each task. 1275*4882a593Smuzhiyun Each task has a fixed array of functions to 1276*4882a593Smuzhiyun trace in the call graph. If the depth of the 1277*4882a593Smuzhiyun calls exceeds that, the function is not traced. 1278*4882a593Smuzhiyun The overrun is the number of functions missed 1279*4882a593Smuzhiyun due to exceeding this array. 1280*4882a593Smuzhiyun 1281*4882a593Smuzhiyun funcgraph-cpu 1282*4882a593Smuzhiyun When set, the CPU number of the CPU where the trace 1283*4882a593Smuzhiyun occurred is displayed. 1284*4882a593Smuzhiyun 1285*4882a593Smuzhiyun funcgraph-overhead 1286*4882a593Smuzhiyun When set, if the function takes longer than 1287*4882a593Smuzhiyun A certain amount, then a delay marker is 1288*4882a593Smuzhiyun displayed. See "delay" above, under the 1289*4882a593Smuzhiyun header description. 1290*4882a593Smuzhiyun 1291*4882a593Smuzhiyun funcgraph-proc 1292*4882a593Smuzhiyun Unlike other tracers, the process' command line 1293*4882a593Smuzhiyun is not displayed by default, but instead only 1294*4882a593Smuzhiyun when a task is traced in and out during a context 1295*4882a593Smuzhiyun switch. Enabling this options has the command 1296*4882a593Smuzhiyun of each process displayed at every line. 1297*4882a593Smuzhiyun 1298*4882a593Smuzhiyun funcgraph-duration 1299*4882a593Smuzhiyun At the end of each function (the return) 1300*4882a593Smuzhiyun the duration of the amount of time in the 1301*4882a593Smuzhiyun function is displayed in microseconds. 1302*4882a593Smuzhiyun 1303*4882a593Smuzhiyun funcgraph-abstime 1304*4882a593Smuzhiyun When set, the timestamp is displayed at each line. 1305*4882a593Smuzhiyun 1306*4882a593Smuzhiyun funcgraph-irqs 1307*4882a593Smuzhiyun When disabled, functions that happen inside an 1308*4882a593Smuzhiyun interrupt will not be traced. 1309*4882a593Smuzhiyun 1310*4882a593Smuzhiyun funcgraph-tail 1311*4882a593Smuzhiyun When set, the return event will include the function 1312*4882a593Smuzhiyun that it represents. By default this is off, and 1313*4882a593Smuzhiyun only a closing curly bracket "}" is displayed for 1314*4882a593Smuzhiyun the return of a function. 1315*4882a593Smuzhiyun 1316*4882a593Smuzhiyun sleep-time 1317*4882a593Smuzhiyun When running function graph tracer, to include 1318*4882a593Smuzhiyun the time a task schedules out in its function. 1319*4882a593Smuzhiyun When enabled, it will account time the task has been 1320*4882a593Smuzhiyun scheduled out as part of the function call. 1321*4882a593Smuzhiyun 1322*4882a593Smuzhiyun graph-time 1323*4882a593Smuzhiyun When running function profiler with function graph tracer, 1324*4882a593Smuzhiyun to include the time to call nested functions. When this is 1325*4882a593Smuzhiyun not set, the time reported for the function will only 1326*4882a593Smuzhiyun include the time the function itself executed for, not the 1327*4882a593Smuzhiyun time for functions that it called. 1328*4882a593Smuzhiyun 1329*4882a593SmuzhiyunOptions for blk tracer: 1330*4882a593Smuzhiyun 1331*4882a593Smuzhiyun blk_classic 1332*4882a593Smuzhiyun Shows a more minimalistic output. 1333*4882a593Smuzhiyun 1334*4882a593Smuzhiyun 1335*4882a593Smuzhiyunirqsoff 1336*4882a593Smuzhiyun------- 1337*4882a593Smuzhiyun 1338*4882a593SmuzhiyunWhen interrupts are disabled, the CPU can not react to any other 1339*4882a593Smuzhiyunexternal event (besides NMIs and SMIs). This prevents the timer 1340*4882a593Smuzhiyuninterrupt from triggering or the mouse interrupt from letting 1341*4882a593Smuzhiyunthe kernel know of a new mouse event. The result is a latency 1342*4882a593Smuzhiyunwith the reaction time. 1343*4882a593Smuzhiyun 1344*4882a593SmuzhiyunThe irqsoff tracer tracks the time for which interrupts are 1345*4882a593Smuzhiyundisabled. When a new maximum latency is hit, the tracer saves 1346*4882a593Smuzhiyunthe trace leading up to that latency point so that every time a 1347*4882a593Smuzhiyunnew maximum is reached, the old saved trace is discarded and the 1348*4882a593Smuzhiyunnew trace is saved. 1349*4882a593Smuzhiyun 1350*4882a593SmuzhiyunTo reset the maximum, echo 0 into tracing_max_latency. Here is 1351*4882a593Smuzhiyunan example:: 1352*4882a593Smuzhiyun 1353*4882a593Smuzhiyun # echo 0 > options/function-trace 1354*4882a593Smuzhiyun # echo irqsoff > current_tracer 1355*4882a593Smuzhiyun # echo 1 > tracing_on 1356*4882a593Smuzhiyun # echo 0 > tracing_max_latency 1357*4882a593Smuzhiyun # ls -ltr 1358*4882a593Smuzhiyun [...] 1359*4882a593Smuzhiyun # echo 0 > tracing_on 1360*4882a593Smuzhiyun # cat trace 1361*4882a593Smuzhiyun # tracer: irqsoff 1362*4882a593Smuzhiyun # 1363*4882a593Smuzhiyun # irqsoff latency trace v1.1.5 on 3.8.0-test+ 1364*4882a593Smuzhiyun # -------------------------------------------------------------------- 1365*4882a593Smuzhiyun # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1366*4882a593Smuzhiyun # ----------------- 1367*4882a593Smuzhiyun # | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0) 1368*4882a593Smuzhiyun # ----------------- 1369*4882a593Smuzhiyun # => started at: run_timer_softirq 1370*4882a593Smuzhiyun # => ended at: run_timer_softirq 1371*4882a593Smuzhiyun # 1372*4882a593Smuzhiyun # 1373*4882a593Smuzhiyun # _------=> CPU# 1374*4882a593Smuzhiyun # / _-----=> irqs-off 1375*4882a593Smuzhiyun # | / _----=> need-resched 1376*4882a593Smuzhiyun # || / _---=> hardirq/softirq 1377*4882a593Smuzhiyun # ||| / _--=> preempt-depth 1378*4882a593Smuzhiyun # |||| / delay 1379*4882a593Smuzhiyun # cmd pid ||||| time | caller 1380*4882a593Smuzhiyun # \ / ||||| \ | / 1381*4882a593Smuzhiyun <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq 1382*4882a593Smuzhiyun <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq 1383*4882a593Smuzhiyun <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq 1384*4882a593Smuzhiyun <idle>-0 0dNs3 25us : <stack trace> 1385*4882a593Smuzhiyun => _raw_spin_unlock_irq 1386*4882a593Smuzhiyun => run_timer_softirq 1387*4882a593Smuzhiyun => __do_softirq 1388*4882a593Smuzhiyun => call_softirq 1389*4882a593Smuzhiyun => do_softirq 1390*4882a593Smuzhiyun => irq_exit 1391*4882a593Smuzhiyun => smp_apic_timer_interrupt 1392*4882a593Smuzhiyun => apic_timer_interrupt 1393*4882a593Smuzhiyun => rcu_idle_exit 1394*4882a593Smuzhiyun => cpu_idle 1395*4882a593Smuzhiyun => rest_init 1396*4882a593Smuzhiyun => start_kernel 1397*4882a593Smuzhiyun => x86_64_start_reservations 1398*4882a593Smuzhiyun => x86_64_start_kernel 1399*4882a593Smuzhiyun 1400*4882a593SmuzhiyunHere we see that we had a latency of 16 microseconds (which is 1401*4882a593Smuzhiyunvery good). The _raw_spin_lock_irq in run_timer_softirq disabled 1402*4882a593Smuzhiyuninterrupts. The difference between the 16 and the displayed 1403*4882a593Smuzhiyuntimestamp 25us occurred because the clock was incremented 1404*4882a593Smuzhiyunbetween the time of recording the max latency and the time of 1405*4882a593Smuzhiyunrecording the function that had that latency. 1406*4882a593Smuzhiyun 1407*4882a593SmuzhiyunNote the above example had function-trace not set. If we set 1408*4882a593Smuzhiyunfunction-trace, we get a much larger output:: 1409*4882a593Smuzhiyun 1410*4882a593Smuzhiyun with echo 1 > options/function-trace 1411*4882a593Smuzhiyun 1412*4882a593Smuzhiyun # tracer: irqsoff 1413*4882a593Smuzhiyun # 1414*4882a593Smuzhiyun # irqsoff latency trace v1.1.5 on 3.8.0-test+ 1415*4882a593Smuzhiyun # -------------------------------------------------------------------- 1416*4882a593Smuzhiyun # latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1417*4882a593Smuzhiyun # ----------------- 1418*4882a593Smuzhiyun # | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0) 1419*4882a593Smuzhiyun # ----------------- 1420*4882a593Smuzhiyun # => started at: ata_scsi_queuecmd 1421*4882a593Smuzhiyun # => ended at: ata_scsi_queuecmd 1422*4882a593Smuzhiyun # 1423*4882a593Smuzhiyun # 1424*4882a593Smuzhiyun # _------=> CPU# 1425*4882a593Smuzhiyun # / _-----=> irqs-off 1426*4882a593Smuzhiyun # | / _----=> need-resched 1427*4882a593Smuzhiyun # || / _---=> hardirq/softirq 1428*4882a593Smuzhiyun # ||| / _--=> preempt-depth 1429*4882a593Smuzhiyun # |||| / delay 1430*4882a593Smuzhiyun # cmd pid ||||| time | caller 1431*4882a593Smuzhiyun # \ / ||||| \ | / 1432*4882a593Smuzhiyun bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd 1433*4882a593Smuzhiyun bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave 1434*4882a593Smuzhiyun bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd 1435*4882a593Smuzhiyun bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev 1436*4882a593Smuzhiyun bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev 1437*4882a593Smuzhiyun bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd 1438*4882a593Smuzhiyun bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd 1439*4882a593Smuzhiyun bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd 1440*4882a593Smuzhiyun bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat 1441*4882a593Smuzhiyun [...] 1442*4882a593Smuzhiyun bash-2042 3d..1 67us : delay_tsc <-__delay 1443*4882a593Smuzhiyun bash-2042 3d..1 67us : add_preempt_count <-delay_tsc 1444*4882a593Smuzhiyun bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc 1445*4882a593Smuzhiyun bash-2042 3d..1 67us : add_preempt_count <-delay_tsc 1446*4882a593Smuzhiyun bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc 1447*4882a593Smuzhiyun bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue 1448*4882a593Smuzhiyun bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd 1449*4882a593Smuzhiyun bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd 1450*4882a593Smuzhiyun bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd 1451*4882a593Smuzhiyun bash-2042 3d..1 120us : <stack trace> 1452*4882a593Smuzhiyun => _raw_spin_unlock_irqrestore 1453*4882a593Smuzhiyun => ata_scsi_queuecmd 1454*4882a593Smuzhiyun => scsi_dispatch_cmd 1455*4882a593Smuzhiyun => scsi_request_fn 1456*4882a593Smuzhiyun => __blk_run_queue_uncond 1457*4882a593Smuzhiyun => __blk_run_queue 1458*4882a593Smuzhiyun => blk_queue_bio 1459*4882a593Smuzhiyun => submit_bio_noacct 1460*4882a593Smuzhiyun => submit_bio 1461*4882a593Smuzhiyun => submit_bh 1462*4882a593Smuzhiyun => __ext3_get_inode_loc 1463*4882a593Smuzhiyun => ext3_iget 1464*4882a593Smuzhiyun => ext3_lookup 1465*4882a593Smuzhiyun => lookup_real 1466*4882a593Smuzhiyun => __lookup_hash 1467*4882a593Smuzhiyun => walk_component 1468*4882a593Smuzhiyun => lookup_last 1469*4882a593Smuzhiyun => path_lookupat 1470*4882a593Smuzhiyun => filename_lookup 1471*4882a593Smuzhiyun => user_path_at_empty 1472*4882a593Smuzhiyun => user_path_at 1473*4882a593Smuzhiyun => vfs_fstatat 1474*4882a593Smuzhiyun => vfs_stat 1475*4882a593Smuzhiyun => sys_newstat 1476*4882a593Smuzhiyun => system_call_fastpath 1477*4882a593Smuzhiyun 1478*4882a593Smuzhiyun 1479*4882a593SmuzhiyunHere we traced a 71 microsecond latency. But we also see all the 1480*4882a593Smuzhiyunfunctions that were called during that time. Note that by 1481*4882a593Smuzhiyunenabling function tracing, we incur an added overhead. This 1482*4882a593Smuzhiyunoverhead may extend the latency times. But nevertheless, this 1483*4882a593Smuzhiyuntrace has provided some very helpful debugging information. 1484*4882a593Smuzhiyun 1485*4882a593SmuzhiyunIf we prefer function graph output instead of function, we can set 1486*4882a593Smuzhiyundisplay-graph option:: 1487*4882a593Smuzhiyun 1488*4882a593Smuzhiyun with echo 1 > options/display-graph 1489*4882a593Smuzhiyun 1490*4882a593Smuzhiyun # tracer: irqsoff 1491*4882a593Smuzhiyun # 1492*4882a593Smuzhiyun # irqsoff latency trace v1.1.5 on 4.20.0-rc6+ 1493*4882a593Smuzhiyun # -------------------------------------------------------------------- 1494*4882a593Smuzhiyun # latency: 3751 us, #274/274, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4) 1495*4882a593Smuzhiyun # ----------------- 1496*4882a593Smuzhiyun # | task: bash-1507 (uid:0 nice:0 policy:0 rt_prio:0) 1497*4882a593Smuzhiyun # ----------------- 1498*4882a593Smuzhiyun # => started at: free_debug_processing 1499*4882a593Smuzhiyun # => ended at: return_to_handler 1500*4882a593Smuzhiyun # 1501*4882a593Smuzhiyun # 1502*4882a593Smuzhiyun # _-----=> irqs-off 1503*4882a593Smuzhiyun # / _----=> need-resched 1504*4882a593Smuzhiyun # | / _---=> hardirq/softirq 1505*4882a593Smuzhiyun # || / _--=> preempt-depth 1506*4882a593Smuzhiyun # ||| / 1507*4882a593Smuzhiyun # REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS 1508*4882a593Smuzhiyun # | | | | |||| | | | | | | 1509*4882a593Smuzhiyun 0 us | 0) bash-1507 | d... | 0.000 us | _raw_spin_lock_irqsave(); 1510*4882a593Smuzhiyun 0 us | 0) bash-1507 | d..1 | 0.378 us | do_raw_spin_trylock(); 1511*4882a593Smuzhiyun 1 us | 0) bash-1507 | d..2 | | set_track() { 1512*4882a593Smuzhiyun 2 us | 0) bash-1507 | d..2 | | save_stack_trace() { 1513*4882a593Smuzhiyun 2 us | 0) bash-1507 | d..2 | | __save_stack_trace() { 1514*4882a593Smuzhiyun 3 us | 0) bash-1507 | d..2 | | __unwind_start() { 1515*4882a593Smuzhiyun 3 us | 0) bash-1507 | d..2 | | get_stack_info() { 1516*4882a593Smuzhiyun 3 us | 0) bash-1507 | d..2 | 0.351 us | in_task_stack(); 1517*4882a593Smuzhiyun 4 us | 0) bash-1507 | d..2 | 1.107 us | } 1518*4882a593Smuzhiyun [...] 1519*4882a593Smuzhiyun 3750 us | 0) bash-1507 | d..1 | 0.516 us | do_raw_spin_unlock(); 1520*4882a593Smuzhiyun 3750 us | 0) bash-1507 | d..1 | 0.000 us | _raw_spin_unlock_irqrestore(); 1521*4882a593Smuzhiyun 3764 us | 0) bash-1507 | d..1 | 0.000 us | tracer_hardirqs_on(); 1522*4882a593Smuzhiyun bash-1507 0d..1 3792us : <stack trace> 1523*4882a593Smuzhiyun => free_debug_processing 1524*4882a593Smuzhiyun => __slab_free 1525*4882a593Smuzhiyun => kmem_cache_free 1526*4882a593Smuzhiyun => vm_area_free 1527*4882a593Smuzhiyun => remove_vma 1528*4882a593Smuzhiyun => exit_mmap 1529*4882a593Smuzhiyun => mmput 1530*4882a593Smuzhiyun => begin_new_exec 1531*4882a593Smuzhiyun => load_elf_binary 1532*4882a593Smuzhiyun => search_binary_handler 1533*4882a593Smuzhiyun => __do_execve_file.isra.32 1534*4882a593Smuzhiyun => __x64_sys_execve 1535*4882a593Smuzhiyun => do_syscall_64 1536*4882a593Smuzhiyun => entry_SYSCALL_64_after_hwframe 1537*4882a593Smuzhiyun 1538*4882a593Smuzhiyunpreemptoff 1539*4882a593Smuzhiyun---------- 1540*4882a593Smuzhiyun 1541*4882a593SmuzhiyunWhen preemption is disabled, we may be able to receive 1542*4882a593Smuzhiyuninterrupts but the task cannot be preempted and a higher 1543*4882a593Smuzhiyunpriority task must wait for preemption to be enabled again 1544*4882a593Smuzhiyunbefore it can preempt a lower priority task. 1545*4882a593Smuzhiyun 1546*4882a593SmuzhiyunThe preemptoff tracer traces the places that disable preemption. 1547*4882a593SmuzhiyunLike the irqsoff tracer, it records the maximum latency for 1548*4882a593Smuzhiyunwhich preemption was disabled. The control of preemptoff tracer 1549*4882a593Smuzhiyunis much like the irqsoff tracer. 1550*4882a593Smuzhiyun:: 1551*4882a593Smuzhiyun 1552*4882a593Smuzhiyun # echo 0 > options/function-trace 1553*4882a593Smuzhiyun # echo preemptoff > current_tracer 1554*4882a593Smuzhiyun # echo 1 > tracing_on 1555*4882a593Smuzhiyun # echo 0 > tracing_max_latency 1556*4882a593Smuzhiyun # ls -ltr 1557*4882a593Smuzhiyun [...] 1558*4882a593Smuzhiyun # echo 0 > tracing_on 1559*4882a593Smuzhiyun # cat trace 1560*4882a593Smuzhiyun # tracer: preemptoff 1561*4882a593Smuzhiyun # 1562*4882a593Smuzhiyun # preemptoff latency trace v1.1.5 on 3.8.0-test+ 1563*4882a593Smuzhiyun # -------------------------------------------------------------------- 1564*4882a593Smuzhiyun # latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1565*4882a593Smuzhiyun # ----------------- 1566*4882a593Smuzhiyun # | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0) 1567*4882a593Smuzhiyun # ----------------- 1568*4882a593Smuzhiyun # => started at: do_IRQ 1569*4882a593Smuzhiyun # => ended at: do_IRQ 1570*4882a593Smuzhiyun # 1571*4882a593Smuzhiyun # 1572*4882a593Smuzhiyun # _------=> CPU# 1573*4882a593Smuzhiyun # / _-----=> irqs-off 1574*4882a593Smuzhiyun # | / _----=> need-resched 1575*4882a593Smuzhiyun # || / _---=> hardirq/softirq 1576*4882a593Smuzhiyun # ||| / _--=> preempt-depth 1577*4882a593Smuzhiyun # |||| / delay 1578*4882a593Smuzhiyun # cmd pid ||||| time | caller 1579*4882a593Smuzhiyun # \ / ||||| \ | / 1580*4882a593Smuzhiyun sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ 1581*4882a593Smuzhiyun sshd-1991 1d..1 46us : irq_exit <-do_IRQ 1582*4882a593Smuzhiyun sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ 1583*4882a593Smuzhiyun sshd-1991 1d..1 52us : <stack trace> 1584*4882a593Smuzhiyun => sub_preempt_count 1585*4882a593Smuzhiyun => irq_exit 1586*4882a593Smuzhiyun => do_IRQ 1587*4882a593Smuzhiyun => ret_from_intr 1588*4882a593Smuzhiyun 1589*4882a593Smuzhiyun 1590*4882a593SmuzhiyunThis has some more changes. Preemption was disabled when an 1591*4882a593Smuzhiyuninterrupt came in (notice the 'h'), and was enabled on exit. 1592*4882a593SmuzhiyunBut we also see that interrupts have been disabled when entering 1593*4882a593Smuzhiyunthe preempt off section and leaving it (the 'd'). We do not know if 1594*4882a593Smuzhiyuninterrupts were enabled in the mean time or shortly after this 1595*4882a593Smuzhiyunwas over. 1596*4882a593Smuzhiyun:: 1597*4882a593Smuzhiyun 1598*4882a593Smuzhiyun # tracer: preemptoff 1599*4882a593Smuzhiyun # 1600*4882a593Smuzhiyun # preemptoff latency trace v1.1.5 on 3.8.0-test+ 1601*4882a593Smuzhiyun # -------------------------------------------------------------------- 1602*4882a593Smuzhiyun # latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1603*4882a593Smuzhiyun # ----------------- 1604*4882a593Smuzhiyun # | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0) 1605*4882a593Smuzhiyun # ----------------- 1606*4882a593Smuzhiyun # => started at: wake_up_new_task 1607*4882a593Smuzhiyun # => ended at: task_rq_unlock 1608*4882a593Smuzhiyun # 1609*4882a593Smuzhiyun # 1610*4882a593Smuzhiyun # _------=> CPU# 1611*4882a593Smuzhiyun # / _-----=> irqs-off 1612*4882a593Smuzhiyun # | / _----=> need-resched 1613*4882a593Smuzhiyun # || / _---=> hardirq/softirq 1614*4882a593Smuzhiyun # ||| / _--=> preempt-depth 1615*4882a593Smuzhiyun # |||| / delay 1616*4882a593Smuzhiyun # cmd pid ||||| time | caller 1617*4882a593Smuzhiyun # \ / ||||| \ | / 1618*4882a593Smuzhiyun bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task 1619*4882a593Smuzhiyun bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq 1620*4882a593Smuzhiyun bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair 1621*4882a593Smuzhiyun bash-1994 1d..1 1us : source_load <-select_task_rq_fair 1622*4882a593Smuzhiyun bash-1994 1d..1 1us : source_load <-select_task_rq_fair 1623*4882a593Smuzhiyun [...] 1624*4882a593Smuzhiyun bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt 1625*4882a593Smuzhiyun bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter 1626*4882a593Smuzhiyun bash-1994 1d..1 13us : add_preempt_count <-irq_enter 1627*4882a593Smuzhiyun bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt 1628*4882a593Smuzhiyun bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt 1629*4882a593Smuzhiyun bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt 1630*4882a593Smuzhiyun bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock 1631*4882a593Smuzhiyun bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt 1632*4882a593Smuzhiyun [...] 1633*4882a593Smuzhiyun bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event 1634*4882a593Smuzhiyun bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt 1635*4882a593Smuzhiyun bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit 1636*4882a593Smuzhiyun bash-1994 1d..2 36us : do_softirq <-irq_exit 1637*4882a593Smuzhiyun bash-1994 1d..2 36us : __do_softirq <-call_softirq 1638*4882a593Smuzhiyun bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq 1639*4882a593Smuzhiyun bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq 1640*4882a593Smuzhiyun bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq 1641*4882a593Smuzhiyun bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock 1642*4882a593Smuzhiyun bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq 1643*4882a593Smuzhiyun [...] 1644*4882a593Smuzhiyun bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks 1645*4882a593Smuzhiyun bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq 1646*4882a593Smuzhiyun bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable 1647*4882a593Smuzhiyun bash-1994 1dN.2 82us : idle_cpu <-irq_exit 1648*4882a593Smuzhiyun bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit 1649*4882a593Smuzhiyun bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit 1650*4882a593Smuzhiyun bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock 1651*4882a593Smuzhiyun bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock 1652*4882a593Smuzhiyun bash-1994 1.N.1 104us : <stack trace> 1653*4882a593Smuzhiyun => sub_preempt_count 1654*4882a593Smuzhiyun => _raw_spin_unlock_irqrestore 1655*4882a593Smuzhiyun => task_rq_unlock 1656*4882a593Smuzhiyun => wake_up_new_task 1657*4882a593Smuzhiyun => do_fork 1658*4882a593Smuzhiyun => sys_clone 1659*4882a593Smuzhiyun => stub_clone 1660*4882a593Smuzhiyun 1661*4882a593Smuzhiyun 1662*4882a593SmuzhiyunThe above is an example of the preemptoff trace with 1663*4882a593Smuzhiyunfunction-trace set. Here we see that interrupts were not disabled 1664*4882a593Smuzhiyunthe entire time. The irq_enter code lets us know that we entered 1665*4882a593Smuzhiyunan interrupt 'h'. Before that, the functions being traced still 1666*4882a593Smuzhiyunshow that it is not in an interrupt, but we can see from the 1667*4882a593Smuzhiyunfunctions themselves that this is not the case. 1668*4882a593Smuzhiyun 1669*4882a593Smuzhiyunpreemptirqsoff 1670*4882a593Smuzhiyun-------------- 1671*4882a593Smuzhiyun 1672*4882a593SmuzhiyunKnowing the locations that have interrupts disabled or 1673*4882a593Smuzhiyunpreemption disabled for the longest times is helpful. But 1674*4882a593Smuzhiyunsometimes we would like to know when either preemption and/or 1675*4882a593Smuzhiyuninterrupts are disabled. 1676*4882a593Smuzhiyun 1677*4882a593SmuzhiyunConsider the following code:: 1678*4882a593Smuzhiyun 1679*4882a593Smuzhiyun local_irq_disable(); 1680*4882a593Smuzhiyun call_function_with_irqs_off(); 1681*4882a593Smuzhiyun preempt_disable(); 1682*4882a593Smuzhiyun call_function_with_irqs_and_preemption_off(); 1683*4882a593Smuzhiyun local_irq_enable(); 1684*4882a593Smuzhiyun call_function_with_preemption_off(); 1685*4882a593Smuzhiyun preempt_enable(); 1686*4882a593Smuzhiyun 1687*4882a593SmuzhiyunThe irqsoff tracer will record the total length of 1688*4882a593Smuzhiyuncall_function_with_irqs_off() and 1689*4882a593Smuzhiyuncall_function_with_irqs_and_preemption_off(). 1690*4882a593Smuzhiyun 1691*4882a593SmuzhiyunThe preemptoff tracer will record the total length of 1692*4882a593Smuzhiyuncall_function_with_irqs_and_preemption_off() and 1693*4882a593Smuzhiyuncall_function_with_preemption_off(). 1694*4882a593Smuzhiyun 1695*4882a593SmuzhiyunBut neither will trace the time that interrupts and/or 1696*4882a593Smuzhiyunpreemption is disabled. This total time is the time that we can 1697*4882a593Smuzhiyunnot schedule. To record this time, use the preemptirqsoff 1698*4882a593Smuzhiyuntracer. 1699*4882a593Smuzhiyun 1700*4882a593SmuzhiyunAgain, using this trace is much like the irqsoff and preemptoff 1701*4882a593Smuzhiyuntracers. 1702*4882a593Smuzhiyun:: 1703*4882a593Smuzhiyun 1704*4882a593Smuzhiyun # echo 0 > options/function-trace 1705*4882a593Smuzhiyun # echo preemptirqsoff > current_tracer 1706*4882a593Smuzhiyun # echo 1 > tracing_on 1707*4882a593Smuzhiyun # echo 0 > tracing_max_latency 1708*4882a593Smuzhiyun # ls -ltr 1709*4882a593Smuzhiyun [...] 1710*4882a593Smuzhiyun # echo 0 > tracing_on 1711*4882a593Smuzhiyun # cat trace 1712*4882a593Smuzhiyun # tracer: preemptirqsoff 1713*4882a593Smuzhiyun # 1714*4882a593Smuzhiyun # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ 1715*4882a593Smuzhiyun # -------------------------------------------------------------------- 1716*4882a593Smuzhiyun # latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1717*4882a593Smuzhiyun # ----------------- 1718*4882a593Smuzhiyun # | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0) 1719*4882a593Smuzhiyun # ----------------- 1720*4882a593Smuzhiyun # => started at: ata_scsi_queuecmd 1721*4882a593Smuzhiyun # => ended at: ata_scsi_queuecmd 1722*4882a593Smuzhiyun # 1723*4882a593Smuzhiyun # 1724*4882a593Smuzhiyun # _------=> CPU# 1725*4882a593Smuzhiyun # / _-----=> irqs-off 1726*4882a593Smuzhiyun # | / _----=> need-resched 1727*4882a593Smuzhiyun # || / _---=> hardirq/softirq 1728*4882a593Smuzhiyun # ||| / _--=> preempt-depth 1729*4882a593Smuzhiyun # |||| / delay 1730*4882a593Smuzhiyun # cmd pid ||||| time | caller 1731*4882a593Smuzhiyun # \ / ||||| \ | / 1732*4882a593Smuzhiyun ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd 1733*4882a593Smuzhiyun ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd 1734*4882a593Smuzhiyun ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd 1735*4882a593Smuzhiyun ls-2230 3...1 111us : <stack trace> 1736*4882a593Smuzhiyun => sub_preempt_count 1737*4882a593Smuzhiyun => _raw_spin_unlock_irqrestore 1738*4882a593Smuzhiyun => ata_scsi_queuecmd 1739*4882a593Smuzhiyun => scsi_dispatch_cmd 1740*4882a593Smuzhiyun => scsi_request_fn 1741*4882a593Smuzhiyun => __blk_run_queue_uncond 1742*4882a593Smuzhiyun => __blk_run_queue 1743*4882a593Smuzhiyun => blk_queue_bio 1744*4882a593Smuzhiyun => submit_bio_noacct 1745*4882a593Smuzhiyun => submit_bio 1746*4882a593Smuzhiyun => submit_bh 1747*4882a593Smuzhiyun => ext3_bread 1748*4882a593Smuzhiyun => ext3_dir_bread 1749*4882a593Smuzhiyun => htree_dirblock_to_tree 1750*4882a593Smuzhiyun => ext3_htree_fill_tree 1751*4882a593Smuzhiyun => ext3_readdir 1752*4882a593Smuzhiyun => vfs_readdir 1753*4882a593Smuzhiyun => sys_getdents 1754*4882a593Smuzhiyun => system_call_fastpath 1755*4882a593Smuzhiyun 1756*4882a593Smuzhiyun 1757*4882a593SmuzhiyunThe trace_hardirqs_off_thunk is called from assembly on x86 when 1758*4882a593Smuzhiyuninterrupts are disabled in the assembly code. Without the 1759*4882a593Smuzhiyunfunction tracing, we do not know if interrupts were enabled 1760*4882a593Smuzhiyunwithin the preemption points. We do see that it started with 1761*4882a593Smuzhiyunpreemption enabled. 1762*4882a593Smuzhiyun 1763*4882a593SmuzhiyunHere is a trace with function-trace set:: 1764*4882a593Smuzhiyun 1765*4882a593Smuzhiyun # tracer: preemptirqsoff 1766*4882a593Smuzhiyun # 1767*4882a593Smuzhiyun # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ 1768*4882a593Smuzhiyun # -------------------------------------------------------------------- 1769*4882a593Smuzhiyun # latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1770*4882a593Smuzhiyun # ----------------- 1771*4882a593Smuzhiyun # | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0) 1772*4882a593Smuzhiyun # ----------------- 1773*4882a593Smuzhiyun # => started at: schedule 1774*4882a593Smuzhiyun # => ended at: mutex_unlock 1775*4882a593Smuzhiyun # 1776*4882a593Smuzhiyun # 1777*4882a593Smuzhiyun # _------=> CPU# 1778*4882a593Smuzhiyun # / _-----=> irqs-off 1779*4882a593Smuzhiyun # | / _----=> need-resched 1780*4882a593Smuzhiyun # || / _---=> hardirq/softirq 1781*4882a593Smuzhiyun # ||| / _--=> preempt-depth 1782*4882a593Smuzhiyun # |||| / delay 1783*4882a593Smuzhiyun # cmd pid ||||| time | caller 1784*4882a593Smuzhiyun # \ / ||||| \ | / 1785*4882a593Smuzhiyun kworker/-59 3...1 0us : __schedule <-schedule 1786*4882a593Smuzhiyun kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch 1787*4882a593Smuzhiyun kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq 1788*4882a593Smuzhiyun kworker/-59 3d..2 1us : deactivate_task <-__schedule 1789*4882a593Smuzhiyun kworker/-59 3d..2 1us : dequeue_task <-deactivate_task 1790*4882a593Smuzhiyun kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task 1791*4882a593Smuzhiyun kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task 1792*4882a593Smuzhiyun kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair 1793*4882a593Smuzhiyun kworker/-59 3d..2 2us : update_min_vruntime <-update_curr 1794*4882a593Smuzhiyun kworker/-59 3d..2 3us : cpuacct_charge <-update_curr 1795*4882a593Smuzhiyun kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge 1796*4882a593Smuzhiyun kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge 1797*4882a593Smuzhiyun kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair 1798*4882a593Smuzhiyun kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair 1799*4882a593Smuzhiyun kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair 1800*4882a593Smuzhiyun kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair 1801*4882a593Smuzhiyun kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair 1802*4882a593Smuzhiyun kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair 1803*4882a593Smuzhiyun kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule 1804*4882a593Smuzhiyun kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping 1805*4882a593Smuzhiyun kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule 1806*4882a593Smuzhiyun kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task 1807*4882a593Smuzhiyun kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair 1808*4882a593Smuzhiyun kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair 1809*4882a593Smuzhiyun kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity 1810*4882a593Smuzhiyun ls-2269 3d..2 7us : finish_task_switch <-__schedule 1811*4882a593Smuzhiyun ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch 1812*4882a593Smuzhiyun ls-2269 3d..2 8us : do_IRQ <-ret_from_intr 1813*4882a593Smuzhiyun ls-2269 3d..2 8us : irq_enter <-do_IRQ 1814*4882a593Smuzhiyun ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter 1815*4882a593Smuzhiyun ls-2269 3d..2 9us : add_preempt_count <-irq_enter 1816*4882a593Smuzhiyun ls-2269 3d.h2 9us : exit_idle <-do_IRQ 1817*4882a593Smuzhiyun [...] 1818*4882a593Smuzhiyun ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock 1819*4882a593Smuzhiyun ls-2269 3d.h2 20us : irq_exit <-do_IRQ 1820*4882a593Smuzhiyun ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit 1821*4882a593Smuzhiyun ls-2269 3d..3 21us : do_softirq <-irq_exit 1822*4882a593Smuzhiyun ls-2269 3d..3 21us : __do_softirq <-call_softirq 1823*4882a593Smuzhiyun ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq 1824*4882a593Smuzhiyun ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip 1825*4882a593Smuzhiyun ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip 1826*4882a593Smuzhiyun ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr 1827*4882a593Smuzhiyun ls-2269 3d.s5 31us : irq_enter <-do_IRQ 1828*4882a593Smuzhiyun ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter 1829*4882a593Smuzhiyun [...] 1830*4882a593Smuzhiyun ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter 1831*4882a593Smuzhiyun ls-2269 3d.s5 32us : add_preempt_count <-irq_enter 1832*4882a593Smuzhiyun ls-2269 3d.H5 32us : exit_idle <-do_IRQ 1833*4882a593Smuzhiyun ls-2269 3d.H5 32us : handle_irq <-do_IRQ 1834*4882a593Smuzhiyun ls-2269 3d.H5 32us : irq_to_desc <-handle_irq 1835*4882a593Smuzhiyun ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq 1836*4882a593Smuzhiyun [...] 1837*4882a593Smuzhiyun ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll 1838*4882a593Smuzhiyun ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action 1839*4882a593Smuzhiyun ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq 1840*4882a593Smuzhiyun ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable 1841*4882a593Smuzhiyun ls-2269 3d..3 159us : idle_cpu <-irq_exit 1842*4882a593Smuzhiyun ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit 1843*4882a593Smuzhiyun ls-2269 3d..3 160us : sub_preempt_count <-irq_exit 1844*4882a593Smuzhiyun ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock 1845*4882a593Smuzhiyun ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock 1846*4882a593Smuzhiyun ls-2269 3d... 186us : <stack trace> 1847*4882a593Smuzhiyun => __mutex_unlock_slowpath 1848*4882a593Smuzhiyun => mutex_unlock 1849*4882a593Smuzhiyun => process_output 1850*4882a593Smuzhiyun => n_tty_write 1851*4882a593Smuzhiyun => tty_write 1852*4882a593Smuzhiyun => vfs_write 1853*4882a593Smuzhiyun => sys_write 1854*4882a593Smuzhiyun => system_call_fastpath 1855*4882a593Smuzhiyun 1856*4882a593SmuzhiyunThis is an interesting trace. It started with kworker running and 1857*4882a593Smuzhiyunscheduling out and ls taking over. But as soon as ls released the 1858*4882a593Smuzhiyunrq lock and enabled interrupts (but not preemption) an interrupt 1859*4882a593Smuzhiyuntriggered. When the interrupt finished, it started running softirqs. 1860*4882a593SmuzhiyunBut while the softirq was running, another interrupt triggered. 1861*4882a593SmuzhiyunWhen an interrupt is running inside a softirq, the annotation is 'H'. 1862*4882a593Smuzhiyun 1863*4882a593Smuzhiyun 1864*4882a593Smuzhiyunwakeup 1865*4882a593Smuzhiyun------ 1866*4882a593Smuzhiyun 1867*4882a593SmuzhiyunOne common case that people are interested in tracing is the 1868*4882a593Smuzhiyuntime it takes for a task that is woken to actually wake up. 1869*4882a593SmuzhiyunNow for non Real-Time tasks, this can be arbitrary. But tracing 1870*4882a593Smuzhiyunit none the less can be interesting. 1871*4882a593Smuzhiyun 1872*4882a593SmuzhiyunWithout function tracing:: 1873*4882a593Smuzhiyun 1874*4882a593Smuzhiyun # echo 0 > options/function-trace 1875*4882a593Smuzhiyun # echo wakeup > current_tracer 1876*4882a593Smuzhiyun # echo 1 > tracing_on 1877*4882a593Smuzhiyun # echo 0 > tracing_max_latency 1878*4882a593Smuzhiyun # chrt -f 5 sleep 1 1879*4882a593Smuzhiyun # echo 0 > tracing_on 1880*4882a593Smuzhiyun # cat trace 1881*4882a593Smuzhiyun # tracer: wakeup 1882*4882a593Smuzhiyun # 1883*4882a593Smuzhiyun # wakeup latency trace v1.1.5 on 3.8.0-test+ 1884*4882a593Smuzhiyun # -------------------------------------------------------------------- 1885*4882a593Smuzhiyun # latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1886*4882a593Smuzhiyun # ----------------- 1887*4882a593Smuzhiyun # | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0) 1888*4882a593Smuzhiyun # ----------------- 1889*4882a593Smuzhiyun # 1890*4882a593Smuzhiyun # _------=> CPU# 1891*4882a593Smuzhiyun # / _-----=> irqs-off 1892*4882a593Smuzhiyun # | / _----=> need-resched 1893*4882a593Smuzhiyun # || / _---=> hardirq/softirq 1894*4882a593Smuzhiyun # ||| / _--=> preempt-depth 1895*4882a593Smuzhiyun # |||| / delay 1896*4882a593Smuzhiyun # cmd pid ||||| time | caller 1897*4882a593Smuzhiyun # \ / ||||| \ | / 1898*4882a593Smuzhiyun <idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H 1899*4882a593Smuzhiyun <idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up 1900*4882a593Smuzhiyun <idle>-0 3d..3 15us : __schedule <-schedule 1901*4882a593Smuzhiyun <idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H 1902*4882a593Smuzhiyun 1903*4882a593SmuzhiyunThe tracer only traces the highest priority task in the system 1904*4882a593Smuzhiyunto avoid tracing the normal circumstances. Here we see that 1905*4882a593Smuzhiyunthe kworker with a nice priority of -20 (not very nice), took 1906*4882a593Smuzhiyunjust 15 microseconds from the time it woke up, to the time it 1907*4882a593Smuzhiyunran. 1908*4882a593Smuzhiyun 1909*4882a593SmuzhiyunNon Real-Time tasks are not that interesting. A more interesting 1910*4882a593Smuzhiyuntrace is to concentrate only on Real-Time tasks. 1911*4882a593Smuzhiyun 1912*4882a593Smuzhiyunwakeup_rt 1913*4882a593Smuzhiyun--------- 1914*4882a593Smuzhiyun 1915*4882a593SmuzhiyunIn a Real-Time environment it is very important to know the 1916*4882a593Smuzhiyunwakeup time it takes for the highest priority task that is woken 1917*4882a593Smuzhiyunup to the time that it executes. This is also known as "schedule 1918*4882a593Smuzhiyunlatency". I stress the point that this is about RT tasks. It is 1919*4882a593Smuzhiyunalso important to know the scheduling latency of non-RT tasks, 1920*4882a593Smuzhiyunbut the average schedule latency is better for non-RT tasks. 1921*4882a593SmuzhiyunTools like LatencyTop are more appropriate for such 1922*4882a593Smuzhiyunmeasurements. 1923*4882a593Smuzhiyun 1924*4882a593SmuzhiyunReal-Time environments are interested in the worst case latency. 1925*4882a593SmuzhiyunThat is the longest latency it takes for something to happen, 1926*4882a593Smuzhiyunand not the average. We can have a very fast scheduler that may 1927*4882a593Smuzhiyunonly have a large latency once in a while, but that would not 1928*4882a593Smuzhiyunwork well with Real-Time tasks. The wakeup_rt tracer was designed 1929*4882a593Smuzhiyunto record the worst case wakeups of RT tasks. Non-RT tasks are 1930*4882a593Smuzhiyunnot recorded because the tracer only records one worst case and 1931*4882a593Smuzhiyuntracing non-RT tasks that are unpredictable will overwrite the 1932*4882a593Smuzhiyunworst case latency of RT tasks (just run the normal wakeup 1933*4882a593Smuzhiyuntracer for a while to see that effect). 1934*4882a593Smuzhiyun 1935*4882a593SmuzhiyunSince this tracer only deals with RT tasks, we will run this 1936*4882a593Smuzhiyunslightly differently than we did with the previous tracers. 1937*4882a593SmuzhiyunInstead of performing an 'ls', we will run 'sleep 1' under 1938*4882a593Smuzhiyun'chrt' which changes the priority of the task. 1939*4882a593Smuzhiyun:: 1940*4882a593Smuzhiyun 1941*4882a593Smuzhiyun # echo 0 > options/function-trace 1942*4882a593Smuzhiyun # echo wakeup_rt > current_tracer 1943*4882a593Smuzhiyun # echo 1 > tracing_on 1944*4882a593Smuzhiyun # echo 0 > tracing_max_latency 1945*4882a593Smuzhiyun # chrt -f 5 sleep 1 1946*4882a593Smuzhiyun # echo 0 > tracing_on 1947*4882a593Smuzhiyun # cat trace 1948*4882a593Smuzhiyun # tracer: wakeup 1949*4882a593Smuzhiyun # 1950*4882a593Smuzhiyun # tracer: wakeup_rt 1951*4882a593Smuzhiyun # 1952*4882a593Smuzhiyun # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ 1953*4882a593Smuzhiyun # -------------------------------------------------------------------- 1954*4882a593Smuzhiyun # latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 1955*4882a593Smuzhiyun # ----------------- 1956*4882a593Smuzhiyun # | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5) 1957*4882a593Smuzhiyun # ----------------- 1958*4882a593Smuzhiyun # 1959*4882a593Smuzhiyun # _------=> CPU# 1960*4882a593Smuzhiyun # / _-----=> irqs-off 1961*4882a593Smuzhiyun # | / _----=> need-resched 1962*4882a593Smuzhiyun # || / _---=> hardirq/softirq 1963*4882a593Smuzhiyun # ||| / _--=> preempt-depth 1964*4882a593Smuzhiyun # |||| / delay 1965*4882a593Smuzhiyun # cmd pid ||||| time | caller 1966*4882a593Smuzhiyun # \ / ||||| \ | / 1967*4882a593Smuzhiyun <idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep 1968*4882a593Smuzhiyun <idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up 1969*4882a593Smuzhiyun <idle>-0 3d..3 5us : __schedule <-schedule 1970*4882a593Smuzhiyun <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep 1971*4882a593Smuzhiyun 1972*4882a593Smuzhiyun 1973*4882a593SmuzhiyunRunning this on an idle system, we see that it only took 5 microseconds 1974*4882a593Smuzhiyunto perform the task switch. Note, since the trace point in the schedule 1975*4882a593Smuzhiyunis before the actual "switch", we stop the tracing when the recorded task 1976*4882a593Smuzhiyunis about to schedule in. This may change if we add a new marker at the 1977*4882a593Smuzhiyunend of the scheduler. 1978*4882a593Smuzhiyun 1979*4882a593SmuzhiyunNotice that the recorded task is 'sleep' with the PID of 2389 1980*4882a593Smuzhiyunand it has an rt_prio of 5. This priority is user-space priority 1981*4882a593Smuzhiyunand not the internal kernel priority. The policy is 1 for 1982*4882a593SmuzhiyunSCHED_FIFO and 2 for SCHED_RR. 1983*4882a593Smuzhiyun 1984*4882a593SmuzhiyunNote, that the trace data shows the internal priority (99 - rtprio). 1985*4882a593Smuzhiyun:: 1986*4882a593Smuzhiyun 1987*4882a593Smuzhiyun <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep 1988*4882a593Smuzhiyun 1989*4882a593SmuzhiyunThe 0:120:R means idle was running with a nice priority of 0 (120 - 120) 1990*4882a593Smuzhiyunand in the running state 'R'. The sleep task was scheduled in with 1991*4882a593Smuzhiyun2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94) 1992*4882a593Smuzhiyunand it too is in the running state. 1993*4882a593Smuzhiyun 1994*4882a593SmuzhiyunDoing the same with chrt -r 5 and function-trace set. 1995*4882a593Smuzhiyun:: 1996*4882a593Smuzhiyun 1997*4882a593Smuzhiyun echo 1 > options/function-trace 1998*4882a593Smuzhiyun 1999*4882a593Smuzhiyun # tracer: wakeup_rt 2000*4882a593Smuzhiyun # 2001*4882a593Smuzhiyun # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ 2002*4882a593Smuzhiyun # -------------------------------------------------------------------- 2003*4882a593Smuzhiyun # latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 2004*4882a593Smuzhiyun # ----------------- 2005*4882a593Smuzhiyun # | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5) 2006*4882a593Smuzhiyun # ----------------- 2007*4882a593Smuzhiyun # 2008*4882a593Smuzhiyun # _------=> CPU# 2009*4882a593Smuzhiyun # / _-----=> irqs-off 2010*4882a593Smuzhiyun # | / _----=> need-resched 2011*4882a593Smuzhiyun # || / _---=> hardirq/softirq 2012*4882a593Smuzhiyun # ||| / _--=> preempt-depth 2013*4882a593Smuzhiyun # |||| / delay 2014*4882a593Smuzhiyun # cmd pid ||||| time | caller 2015*4882a593Smuzhiyun # \ / ||||| \ | / 2016*4882a593Smuzhiyun <idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep 2017*4882a593Smuzhiyun <idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up 2018*4882a593Smuzhiyun <idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup 2019*4882a593Smuzhiyun <idle>-0 3d.h3 3us : resched_curr <-check_preempt_curr 2020*4882a593Smuzhiyun <idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup 2021*4882a593Smuzhiyun <idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up 2022*4882a593Smuzhiyun <idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock 2023*4882a593Smuzhiyun <idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up 2024*4882a593Smuzhiyun <idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up 2025*4882a593Smuzhiyun <idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore 2026*4882a593Smuzhiyun <idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer 2027*4882a593Smuzhiyun <idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock 2028*4882a593Smuzhiyun <idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt 2029*4882a593Smuzhiyun <idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock 2030*4882a593Smuzhiyun <idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt 2031*4882a593Smuzhiyun <idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event 2032*4882a593Smuzhiyun <idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event 2033*4882a593Smuzhiyun <idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event 2034*4882a593Smuzhiyun <idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt 2035*4882a593Smuzhiyun <idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit 2036*4882a593Smuzhiyun <idle>-0 3dN.2 9us : idle_cpu <-irq_exit 2037*4882a593Smuzhiyun <idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit 2038*4882a593Smuzhiyun <idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit 2039*4882a593Smuzhiyun <idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit 2040*4882a593Smuzhiyun <idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle 2041*4882a593Smuzhiyun <idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit 2042*4882a593Smuzhiyun <idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle 2043*4882a593Smuzhiyun <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit 2044*4882a593Smuzhiyun <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit 2045*4882a593Smuzhiyun <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit 2046*4882a593Smuzhiyun <idle>-0 3dN.1 13us : cpu_load_update_nohz <-tick_nohz_idle_exit 2047*4882a593Smuzhiyun <idle>-0 3dN.1 13us : _raw_spin_lock <-cpu_load_update_nohz 2048*4882a593Smuzhiyun <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock 2049*4882a593Smuzhiyun <idle>-0 3dN.2 13us : __cpu_load_update <-cpu_load_update_nohz 2050*4882a593Smuzhiyun <idle>-0 3dN.2 14us : sched_avg_update <-__cpu_load_update 2051*4882a593Smuzhiyun <idle>-0 3dN.2 14us : _raw_spin_unlock <-cpu_load_update_nohz 2052*4882a593Smuzhiyun <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock 2053*4882a593Smuzhiyun <idle>-0 3dN.1 15us : calc_load_nohz_stop <-tick_nohz_idle_exit 2054*4882a593Smuzhiyun <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit 2055*4882a593Smuzhiyun <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit 2056*4882a593Smuzhiyun <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel 2057*4882a593Smuzhiyun <idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel 2058*4882a593Smuzhiyun <idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 2059*4882a593Smuzhiyun <idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave 2060*4882a593Smuzhiyun <idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16 2061*4882a593Smuzhiyun <idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer 2062*4882a593Smuzhiyun <idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram 2063*4882a593Smuzhiyun <idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event 2064*4882a593Smuzhiyun <idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event 2065*4882a593Smuzhiyun <idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event 2066*4882a593Smuzhiyun <idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel 2067*4882a593Smuzhiyun <idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore 2068*4882a593Smuzhiyun <idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit 2069*4882a593Smuzhiyun <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward 2070*4882a593Smuzhiyun <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward 2071*4882a593Smuzhiyun <idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 2072*4882a593Smuzhiyun <idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns 2073*4882a593Smuzhiyun <idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns 2074*4882a593Smuzhiyun <idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 2075*4882a593Smuzhiyun <idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave 2076*4882a593Smuzhiyun <idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns 2077*4882a593Smuzhiyun <idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns 2078*4882a593Smuzhiyun <idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns 2079*4882a593Smuzhiyun <idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event 2080*4882a593Smuzhiyun <idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event 2081*4882a593Smuzhiyun <idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event 2082*4882a593Smuzhiyun <idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns 2083*4882a593Smuzhiyun <idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore 2084*4882a593Smuzhiyun <idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit 2085*4882a593Smuzhiyun <idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks 2086*4882a593Smuzhiyun <idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle 2087*4882a593Smuzhiyun <idle>-0 3.N.. 25us : schedule <-cpu_idle 2088*4882a593Smuzhiyun <idle>-0 3.N.. 25us : __schedule <-preempt_schedule 2089*4882a593Smuzhiyun <idle>-0 3.N.. 26us : add_preempt_count <-__schedule 2090*4882a593Smuzhiyun <idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule 2091*4882a593Smuzhiyun <idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch 2092*4882a593Smuzhiyun <idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch 2093*4882a593Smuzhiyun <idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule 2094*4882a593Smuzhiyun <idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq 2095*4882a593Smuzhiyun <idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule 2096*4882a593Smuzhiyun <idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task 2097*4882a593Smuzhiyun <idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task 2098*4882a593Smuzhiyun <idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt 2099*4882a593Smuzhiyun <idle>-0 3d..3 29us : __schedule <-preempt_schedule 2100*4882a593Smuzhiyun <idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep 2101*4882a593Smuzhiyun 2102*4882a593SmuzhiyunThis isn't that big of a trace, even with function tracing enabled, 2103*4882a593Smuzhiyunso I included the entire trace. 2104*4882a593Smuzhiyun 2105*4882a593SmuzhiyunThe interrupt went off while when the system was idle. Somewhere 2106*4882a593Smuzhiyunbefore task_woken_rt() was called, the NEED_RESCHED flag was set, 2107*4882a593Smuzhiyunthis is indicated by the first occurrence of the 'N' flag. 2108*4882a593Smuzhiyun 2109*4882a593SmuzhiyunLatency tracing and events 2110*4882a593Smuzhiyun-------------------------- 2111*4882a593SmuzhiyunAs function tracing can induce a much larger latency, but without 2112*4882a593Smuzhiyunseeing what happens within the latency it is hard to know what 2113*4882a593Smuzhiyuncaused it. There is a middle ground, and that is with enabling 2114*4882a593Smuzhiyunevents. 2115*4882a593Smuzhiyun:: 2116*4882a593Smuzhiyun 2117*4882a593Smuzhiyun # echo 0 > options/function-trace 2118*4882a593Smuzhiyun # echo wakeup_rt > current_tracer 2119*4882a593Smuzhiyun # echo 1 > events/enable 2120*4882a593Smuzhiyun # echo 1 > tracing_on 2121*4882a593Smuzhiyun # echo 0 > tracing_max_latency 2122*4882a593Smuzhiyun # chrt -f 5 sleep 1 2123*4882a593Smuzhiyun # echo 0 > tracing_on 2124*4882a593Smuzhiyun # cat trace 2125*4882a593Smuzhiyun # tracer: wakeup_rt 2126*4882a593Smuzhiyun # 2127*4882a593Smuzhiyun # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ 2128*4882a593Smuzhiyun # -------------------------------------------------------------------- 2129*4882a593Smuzhiyun # latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) 2130*4882a593Smuzhiyun # ----------------- 2131*4882a593Smuzhiyun # | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5) 2132*4882a593Smuzhiyun # ----------------- 2133*4882a593Smuzhiyun # 2134*4882a593Smuzhiyun # _------=> CPU# 2135*4882a593Smuzhiyun # / _-----=> irqs-off 2136*4882a593Smuzhiyun # | / _----=> need-resched 2137*4882a593Smuzhiyun # || / _---=> hardirq/softirq 2138*4882a593Smuzhiyun # ||| / _--=> preempt-depth 2139*4882a593Smuzhiyun # |||| / delay 2140*4882a593Smuzhiyun # cmd pid ||||| time | caller 2141*4882a593Smuzhiyun # \ / ||||| \ | / 2142*4882a593Smuzhiyun <idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep 2143*4882a593Smuzhiyun <idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up 2144*4882a593Smuzhiyun <idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002 2145*4882a593Smuzhiyun <idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8 2146*4882a593Smuzhiyun <idle>-0 2.N.2 2us : power_end: cpu_id=2 2147*4882a593Smuzhiyun <idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2 2148*4882a593Smuzhiyun <idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0 2149*4882a593Smuzhiyun <idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000 2150*4882a593Smuzhiyun <idle>-0 2.N.2 5us : rcu_utilization: Start context switch 2151*4882a593Smuzhiyun <idle>-0 2.N.2 5us : rcu_utilization: End context switch 2152*4882a593Smuzhiyun <idle>-0 2d..3 6us : __schedule <-schedule 2153*4882a593Smuzhiyun <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep 2154*4882a593Smuzhiyun 2155*4882a593Smuzhiyun 2156*4882a593SmuzhiyunHardware Latency Detector 2157*4882a593Smuzhiyun------------------------- 2158*4882a593Smuzhiyun 2159*4882a593SmuzhiyunThe hardware latency detector is executed by enabling the "hwlat" tracer. 2160*4882a593Smuzhiyun 2161*4882a593SmuzhiyunNOTE, this tracer will affect the performance of the system as it will 2162*4882a593Smuzhiyunperiodically make a CPU constantly busy with interrupts disabled. 2163*4882a593Smuzhiyun:: 2164*4882a593Smuzhiyun 2165*4882a593Smuzhiyun # echo hwlat > current_tracer 2166*4882a593Smuzhiyun # sleep 100 2167*4882a593Smuzhiyun # cat trace 2168*4882a593Smuzhiyun # tracer: hwlat 2169*4882a593Smuzhiyun # 2170*4882a593Smuzhiyun # entries-in-buffer/entries-written: 13/13 #P:8 2171*4882a593Smuzhiyun # 2172*4882a593Smuzhiyun # _-----=> irqs-off 2173*4882a593Smuzhiyun # / _----=> need-resched 2174*4882a593Smuzhiyun # | / _---=> hardirq/softirq 2175*4882a593Smuzhiyun # || / _--=> preempt-depth 2176*4882a593Smuzhiyun # ||| / delay 2177*4882a593Smuzhiyun # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2178*4882a593Smuzhiyun # | | | |||| | | 2179*4882a593Smuzhiyun <...>-1729 [001] d... 678.473449: #1 inner/outer(us): 11/12 ts:1581527483.343962693 count:6 2180*4882a593Smuzhiyun <...>-1729 [004] d... 689.556542: #2 inner/outer(us): 16/9 ts:1581527494.889008092 count:1 2181*4882a593Smuzhiyun <...>-1729 [005] d... 714.756290: #3 inner/outer(us): 16/16 ts:1581527519.678961629 count:5 2182*4882a593Smuzhiyun <...>-1729 [001] d... 718.788247: #4 inner/outer(us): 9/17 ts:1581527523.889012713 count:1 2183*4882a593Smuzhiyun <...>-1729 [002] d... 719.796341: #5 inner/outer(us): 13/9 ts:1581527524.912872606 count:1 2184*4882a593Smuzhiyun <...>-1729 [006] d... 844.787091: #6 inner/outer(us): 9/12 ts:1581527649.889048502 count:2 2185*4882a593Smuzhiyun <...>-1729 [003] d... 849.827033: #7 inner/outer(us): 18/9 ts:1581527654.889013793 count:1 2186*4882a593Smuzhiyun <...>-1729 [007] d... 853.859002: #8 inner/outer(us): 9/12 ts:1581527658.889065736 count:1 2187*4882a593Smuzhiyun <...>-1729 [001] d... 855.874978: #9 inner/outer(us): 9/11 ts:1581527660.861991877 count:1 2188*4882a593Smuzhiyun <...>-1729 [001] d... 863.938932: #10 inner/outer(us): 9/11 ts:1581527668.970010500 count:1 nmi-total:7 nmi-count:1 2189*4882a593Smuzhiyun <...>-1729 [007] d... 878.050780: #11 inner/outer(us): 9/12 ts:1581527683.385002600 count:1 nmi-total:5 nmi-count:1 2190*4882a593Smuzhiyun <...>-1729 [007] d... 886.114702: #12 inner/outer(us): 9/12 ts:1581527691.385001600 count:1 2191*4882a593Smuzhiyun 2192*4882a593Smuzhiyun 2193*4882a593SmuzhiyunThe above output is somewhat the same in the header. All events will have 2194*4882a593Smuzhiyuninterrupts disabled 'd'. Under the FUNCTION title there is: 2195*4882a593Smuzhiyun 2196*4882a593Smuzhiyun #1 2197*4882a593Smuzhiyun This is the count of events recorded that were greater than the 2198*4882a593Smuzhiyun tracing_threshold (See below). 2199*4882a593Smuzhiyun 2200*4882a593Smuzhiyun inner/outer(us): 11/11 2201*4882a593Smuzhiyun 2202*4882a593Smuzhiyun This shows two numbers as "inner latency" and "outer latency". The test 2203*4882a593Smuzhiyun runs in a loop checking a timestamp twice. The latency detected within 2204*4882a593Smuzhiyun the two timestamps is the "inner latency" and the latency detected 2205*4882a593Smuzhiyun after the previous timestamp and the next timestamp in the loop is 2206*4882a593Smuzhiyun the "outer latency". 2207*4882a593Smuzhiyun 2208*4882a593Smuzhiyun ts:1581527483.343962693 2209*4882a593Smuzhiyun 2210*4882a593Smuzhiyun The absolute timestamp that the first latency was recorded in the window. 2211*4882a593Smuzhiyun 2212*4882a593Smuzhiyun count:6 2213*4882a593Smuzhiyun 2214*4882a593Smuzhiyun The number of times a latency was detected during the window. 2215*4882a593Smuzhiyun 2216*4882a593Smuzhiyun nmi-total:7 nmi-count:1 2217*4882a593Smuzhiyun 2218*4882a593Smuzhiyun On architectures that support it, if an NMI comes in during the 2219*4882a593Smuzhiyun test, the time spent in NMI is reported in "nmi-total" (in 2220*4882a593Smuzhiyun microseconds). 2221*4882a593Smuzhiyun 2222*4882a593Smuzhiyun All architectures that have NMIs will show the "nmi-count" if an 2223*4882a593Smuzhiyun NMI comes in during the test. 2224*4882a593Smuzhiyun 2225*4882a593Smuzhiyunhwlat files: 2226*4882a593Smuzhiyun 2227*4882a593Smuzhiyun tracing_threshold 2228*4882a593Smuzhiyun This gets automatically set to "10" to represent 10 2229*4882a593Smuzhiyun microseconds. This is the threshold of latency that 2230*4882a593Smuzhiyun needs to be detected before the trace will be recorded. 2231*4882a593Smuzhiyun 2232*4882a593Smuzhiyun Note, when hwlat tracer is finished (another tracer is 2233*4882a593Smuzhiyun written into "current_tracer"), the original value for 2234*4882a593Smuzhiyun tracing_threshold is placed back into this file. 2235*4882a593Smuzhiyun 2236*4882a593Smuzhiyun hwlat_detector/width 2237*4882a593Smuzhiyun The length of time the test runs with interrupts disabled. 2238*4882a593Smuzhiyun 2239*4882a593Smuzhiyun hwlat_detector/window 2240*4882a593Smuzhiyun The length of time of the window which the test 2241*4882a593Smuzhiyun runs. That is, the test will run for "width" 2242*4882a593Smuzhiyun microseconds per "window" microseconds 2243*4882a593Smuzhiyun 2244*4882a593Smuzhiyun tracing_cpumask 2245*4882a593Smuzhiyun When the test is started. A kernel thread is created that 2246*4882a593Smuzhiyun runs the test. This thread will alternate between CPUs 2247*4882a593Smuzhiyun listed in the tracing_cpumask between each period 2248*4882a593Smuzhiyun (one "window"). To limit the test to specific CPUs 2249*4882a593Smuzhiyun set the mask in this file to only the CPUs that the test 2250*4882a593Smuzhiyun should run on. 2251*4882a593Smuzhiyun 2252*4882a593Smuzhiyunfunction 2253*4882a593Smuzhiyun-------- 2254*4882a593Smuzhiyun 2255*4882a593SmuzhiyunThis tracer is the function tracer. Enabling the function tracer 2256*4882a593Smuzhiyuncan be done from the debug file system. Make sure the 2257*4882a593Smuzhiyunftrace_enabled is set; otherwise this tracer is a nop. 2258*4882a593SmuzhiyunSee the "ftrace_enabled" section below. 2259*4882a593Smuzhiyun:: 2260*4882a593Smuzhiyun 2261*4882a593Smuzhiyun # sysctl kernel.ftrace_enabled=1 2262*4882a593Smuzhiyun # echo function > current_tracer 2263*4882a593Smuzhiyun # echo 1 > tracing_on 2264*4882a593Smuzhiyun # usleep 1 2265*4882a593Smuzhiyun # echo 0 > tracing_on 2266*4882a593Smuzhiyun # cat trace 2267*4882a593Smuzhiyun # tracer: function 2268*4882a593Smuzhiyun # 2269*4882a593Smuzhiyun # entries-in-buffer/entries-written: 24799/24799 #P:4 2270*4882a593Smuzhiyun # 2271*4882a593Smuzhiyun # _-----=> irqs-off 2272*4882a593Smuzhiyun # / _----=> need-resched 2273*4882a593Smuzhiyun # | / _---=> hardirq/softirq 2274*4882a593Smuzhiyun # || / _--=> preempt-depth 2275*4882a593Smuzhiyun # ||| / delay 2276*4882a593Smuzhiyun # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2277*4882a593Smuzhiyun # | | | |||| | | 2278*4882a593Smuzhiyun bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write 2279*4882a593Smuzhiyun bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock 2280*4882a593Smuzhiyun bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify 2281*4882a593Smuzhiyun bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify 2282*4882a593Smuzhiyun bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify 2283*4882a593Smuzhiyun bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock 2284*4882a593Smuzhiyun bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock 2285*4882a593Smuzhiyun bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify 2286*4882a593Smuzhiyun [...] 2287*4882a593Smuzhiyun 2288*4882a593Smuzhiyun 2289*4882a593SmuzhiyunNote: function tracer uses ring buffers to store the above 2290*4882a593Smuzhiyunentries. The newest data may overwrite the oldest data. 2291*4882a593SmuzhiyunSometimes using echo to stop the trace is not sufficient because 2292*4882a593Smuzhiyunthe tracing could have overwritten the data that you wanted to 2293*4882a593Smuzhiyunrecord. For this reason, it is sometimes better to disable 2294*4882a593Smuzhiyuntracing directly from a program. This allows you to stop the 2295*4882a593Smuzhiyuntracing at the point that you hit the part that you are 2296*4882a593Smuzhiyuninterested in. To disable the tracing directly from a C program, 2297*4882a593Smuzhiyunsomething like following code snippet can be used:: 2298*4882a593Smuzhiyun 2299*4882a593Smuzhiyun int trace_fd; 2300*4882a593Smuzhiyun [...] 2301*4882a593Smuzhiyun int main(int argc, char *argv[]) { 2302*4882a593Smuzhiyun [...] 2303*4882a593Smuzhiyun trace_fd = open(tracing_file("tracing_on"), O_WRONLY); 2304*4882a593Smuzhiyun [...] 2305*4882a593Smuzhiyun if (condition_hit()) { 2306*4882a593Smuzhiyun write(trace_fd, "0", 1); 2307*4882a593Smuzhiyun } 2308*4882a593Smuzhiyun [...] 2309*4882a593Smuzhiyun } 2310*4882a593Smuzhiyun 2311*4882a593Smuzhiyun 2312*4882a593SmuzhiyunSingle thread tracing 2313*4882a593Smuzhiyun--------------------- 2314*4882a593Smuzhiyun 2315*4882a593SmuzhiyunBy writing into set_ftrace_pid you can trace a 2316*4882a593Smuzhiyunsingle thread. For example:: 2317*4882a593Smuzhiyun 2318*4882a593Smuzhiyun # cat set_ftrace_pid 2319*4882a593Smuzhiyun no pid 2320*4882a593Smuzhiyun # echo 3111 > set_ftrace_pid 2321*4882a593Smuzhiyun # cat set_ftrace_pid 2322*4882a593Smuzhiyun 3111 2323*4882a593Smuzhiyun # echo function > current_tracer 2324*4882a593Smuzhiyun # cat trace | head 2325*4882a593Smuzhiyun # tracer: function 2326*4882a593Smuzhiyun # 2327*4882a593Smuzhiyun # TASK-PID CPU# TIMESTAMP FUNCTION 2328*4882a593Smuzhiyun # | | | | | 2329*4882a593Smuzhiyun yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return 2330*4882a593Smuzhiyun yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range 2331*4882a593Smuzhiyun yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel 2332*4882a593Smuzhiyun yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel 2333*4882a593Smuzhiyun yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll 2334*4882a593Smuzhiyun yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll 2335*4882a593Smuzhiyun # echo > set_ftrace_pid 2336*4882a593Smuzhiyun # cat trace |head 2337*4882a593Smuzhiyun # tracer: function 2338*4882a593Smuzhiyun # 2339*4882a593Smuzhiyun # TASK-PID CPU# TIMESTAMP FUNCTION 2340*4882a593Smuzhiyun # | | | | | 2341*4882a593Smuzhiyun ##### CPU 3 buffer started #### 2342*4882a593Smuzhiyun yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait 2343*4882a593Smuzhiyun yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry 2344*4882a593Smuzhiyun yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry 2345*4882a593Smuzhiyun yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit 2346*4882a593Smuzhiyun yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit 2347*4882a593Smuzhiyun 2348*4882a593SmuzhiyunIf you want to trace a function when executing, you could use 2349*4882a593Smuzhiyunsomething like this simple program. 2350*4882a593Smuzhiyun:: 2351*4882a593Smuzhiyun 2352*4882a593Smuzhiyun #include <stdio.h> 2353*4882a593Smuzhiyun #include <stdlib.h> 2354*4882a593Smuzhiyun #include <sys/types.h> 2355*4882a593Smuzhiyun #include <sys/stat.h> 2356*4882a593Smuzhiyun #include <fcntl.h> 2357*4882a593Smuzhiyun #include <unistd.h> 2358*4882a593Smuzhiyun #include <string.h> 2359*4882a593Smuzhiyun 2360*4882a593Smuzhiyun #define _STR(x) #x 2361*4882a593Smuzhiyun #define STR(x) _STR(x) 2362*4882a593Smuzhiyun #define MAX_PATH 256 2363*4882a593Smuzhiyun 2364*4882a593Smuzhiyun const char *find_tracefs(void) 2365*4882a593Smuzhiyun { 2366*4882a593Smuzhiyun static char tracefs[MAX_PATH+1]; 2367*4882a593Smuzhiyun static int tracefs_found; 2368*4882a593Smuzhiyun char type[100]; 2369*4882a593Smuzhiyun FILE *fp; 2370*4882a593Smuzhiyun 2371*4882a593Smuzhiyun if (tracefs_found) 2372*4882a593Smuzhiyun return tracefs; 2373*4882a593Smuzhiyun 2374*4882a593Smuzhiyun if ((fp = fopen("/proc/mounts","r")) == NULL) { 2375*4882a593Smuzhiyun perror("/proc/mounts"); 2376*4882a593Smuzhiyun return NULL; 2377*4882a593Smuzhiyun } 2378*4882a593Smuzhiyun 2379*4882a593Smuzhiyun while (fscanf(fp, "%*s %" 2380*4882a593Smuzhiyun STR(MAX_PATH) 2381*4882a593Smuzhiyun "s %99s %*s %*d %*d\n", 2382*4882a593Smuzhiyun tracefs, type) == 2) { 2383*4882a593Smuzhiyun if (strcmp(type, "tracefs") == 0) 2384*4882a593Smuzhiyun break; 2385*4882a593Smuzhiyun } 2386*4882a593Smuzhiyun fclose(fp); 2387*4882a593Smuzhiyun 2388*4882a593Smuzhiyun if (strcmp(type, "tracefs") != 0) { 2389*4882a593Smuzhiyun fprintf(stderr, "tracefs not mounted"); 2390*4882a593Smuzhiyun return NULL; 2391*4882a593Smuzhiyun } 2392*4882a593Smuzhiyun 2393*4882a593Smuzhiyun strcat(tracefs, "/tracing/"); 2394*4882a593Smuzhiyun tracefs_found = 1; 2395*4882a593Smuzhiyun 2396*4882a593Smuzhiyun return tracefs; 2397*4882a593Smuzhiyun } 2398*4882a593Smuzhiyun 2399*4882a593Smuzhiyun const char *tracing_file(const char *file_name) 2400*4882a593Smuzhiyun { 2401*4882a593Smuzhiyun static char trace_file[MAX_PATH+1]; 2402*4882a593Smuzhiyun snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name); 2403*4882a593Smuzhiyun return trace_file; 2404*4882a593Smuzhiyun } 2405*4882a593Smuzhiyun 2406*4882a593Smuzhiyun int main (int argc, char **argv) 2407*4882a593Smuzhiyun { 2408*4882a593Smuzhiyun if (argc < 1) 2409*4882a593Smuzhiyun exit(-1); 2410*4882a593Smuzhiyun 2411*4882a593Smuzhiyun if (fork() > 0) { 2412*4882a593Smuzhiyun int fd, ffd; 2413*4882a593Smuzhiyun char line[64]; 2414*4882a593Smuzhiyun int s; 2415*4882a593Smuzhiyun 2416*4882a593Smuzhiyun ffd = open(tracing_file("current_tracer"), O_WRONLY); 2417*4882a593Smuzhiyun if (ffd < 0) 2418*4882a593Smuzhiyun exit(-1); 2419*4882a593Smuzhiyun write(ffd, "nop", 3); 2420*4882a593Smuzhiyun 2421*4882a593Smuzhiyun fd = open(tracing_file("set_ftrace_pid"), O_WRONLY); 2422*4882a593Smuzhiyun s = sprintf(line, "%d\n", getpid()); 2423*4882a593Smuzhiyun write(fd, line, s); 2424*4882a593Smuzhiyun 2425*4882a593Smuzhiyun write(ffd, "function", 8); 2426*4882a593Smuzhiyun 2427*4882a593Smuzhiyun close(fd); 2428*4882a593Smuzhiyun close(ffd); 2429*4882a593Smuzhiyun 2430*4882a593Smuzhiyun execvp(argv[1], argv+1); 2431*4882a593Smuzhiyun } 2432*4882a593Smuzhiyun 2433*4882a593Smuzhiyun return 0; 2434*4882a593Smuzhiyun } 2435*4882a593Smuzhiyun 2436*4882a593SmuzhiyunOr this simple script! 2437*4882a593Smuzhiyun:: 2438*4882a593Smuzhiyun 2439*4882a593Smuzhiyun #!/bin/bash 2440*4882a593Smuzhiyun 2441*4882a593Smuzhiyun tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts` 2442*4882a593Smuzhiyun echo nop > $tracefs/tracing/current_tracer 2443*4882a593Smuzhiyun echo 0 > $tracefs/tracing/tracing_on 2444*4882a593Smuzhiyun echo $$ > $tracefs/tracing/set_ftrace_pid 2445*4882a593Smuzhiyun echo function > $tracefs/tracing/current_tracer 2446*4882a593Smuzhiyun echo 1 > $tracefs/tracing/tracing_on 2447*4882a593Smuzhiyun exec "$@" 2448*4882a593Smuzhiyun 2449*4882a593Smuzhiyun 2450*4882a593Smuzhiyunfunction graph tracer 2451*4882a593Smuzhiyun--------------------------- 2452*4882a593Smuzhiyun 2453*4882a593SmuzhiyunThis tracer is similar to the function tracer except that it 2454*4882a593Smuzhiyunprobes a function on its entry and its exit. This is done by 2455*4882a593Smuzhiyunusing a dynamically allocated stack of return addresses in each 2456*4882a593Smuzhiyuntask_struct. On function entry the tracer overwrites the return 2457*4882a593Smuzhiyunaddress of each function traced to set a custom probe. Thus the 2458*4882a593Smuzhiyunoriginal return address is stored on the stack of return address 2459*4882a593Smuzhiyunin the task_struct. 2460*4882a593Smuzhiyun 2461*4882a593SmuzhiyunProbing on both ends of a function leads to special features 2462*4882a593Smuzhiyunsuch as: 2463*4882a593Smuzhiyun 2464*4882a593Smuzhiyun- measure of a function's time execution 2465*4882a593Smuzhiyun- having a reliable call stack to draw function calls graph 2466*4882a593Smuzhiyun 2467*4882a593SmuzhiyunThis tracer is useful in several situations: 2468*4882a593Smuzhiyun 2469*4882a593Smuzhiyun- you want to find the reason of a strange kernel behavior and 2470*4882a593Smuzhiyun need to see what happens in detail on any areas (or specific 2471*4882a593Smuzhiyun ones). 2472*4882a593Smuzhiyun 2473*4882a593Smuzhiyun- you are experiencing weird latencies but it's difficult to 2474*4882a593Smuzhiyun find its origin. 2475*4882a593Smuzhiyun 2476*4882a593Smuzhiyun- you want to find quickly which path is taken by a specific 2477*4882a593Smuzhiyun function 2478*4882a593Smuzhiyun 2479*4882a593Smuzhiyun- you just want to peek inside a working kernel and want to see 2480*4882a593Smuzhiyun what happens there. 2481*4882a593Smuzhiyun 2482*4882a593Smuzhiyun:: 2483*4882a593Smuzhiyun 2484*4882a593Smuzhiyun # tracer: function_graph 2485*4882a593Smuzhiyun # 2486*4882a593Smuzhiyun # CPU DURATION FUNCTION CALLS 2487*4882a593Smuzhiyun # | | | | | | | 2488*4882a593Smuzhiyun 2489*4882a593Smuzhiyun 0) | sys_open() { 2490*4882a593Smuzhiyun 0) | do_sys_open() { 2491*4882a593Smuzhiyun 0) | getname() { 2492*4882a593Smuzhiyun 0) | kmem_cache_alloc() { 2493*4882a593Smuzhiyun 0) 1.382 us | __might_sleep(); 2494*4882a593Smuzhiyun 0) 2.478 us | } 2495*4882a593Smuzhiyun 0) | strncpy_from_user() { 2496*4882a593Smuzhiyun 0) | might_fault() { 2497*4882a593Smuzhiyun 0) 1.389 us | __might_sleep(); 2498*4882a593Smuzhiyun 0) 2.553 us | } 2499*4882a593Smuzhiyun 0) 3.807 us | } 2500*4882a593Smuzhiyun 0) 7.876 us | } 2501*4882a593Smuzhiyun 0) | alloc_fd() { 2502*4882a593Smuzhiyun 0) 0.668 us | _spin_lock(); 2503*4882a593Smuzhiyun 0) 0.570 us | expand_files(); 2504*4882a593Smuzhiyun 0) 0.586 us | _spin_unlock(); 2505*4882a593Smuzhiyun 2506*4882a593Smuzhiyun 2507*4882a593SmuzhiyunThere are several columns that can be dynamically 2508*4882a593Smuzhiyunenabled/disabled. You can use every combination of options you 2509*4882a593Smuzhiyunwant, depending on your needs. 2510*4882a593Smuzhiyun 2511*4882a593Smuzhiyun- The cpu number on which the function executed is default 2512*4882a593Smuzhiyun enabled. It is sometimes better to only trace one cpu (see 2513*4882a593Smuzhiyun tracing_cpu_mask file) or you might sometimes see unordered 2514*4882a593Smuzhiyun function calls while cpu tracing switch. 2515*4882a593Smuzhiyun 2516*4882a593Smuzhiyun - hide: echo nofuncgraph-cpu > trace_options 2517*4882a593Smuzhiyun - show: echo funcgraph-cpu > trace_options 2518*4882a593Smuzhiyun 2519*4882a593Smuzhiyun- The duration (function's time of execution) is displayed on 2520*4882a593Smuzhiyun the closing bracket line of a function or on the same line 2521*4882a593Smuzhiyun than the current function in case of a leaf one. It is default 2522*4882a593Smuzhiyun enabled. 2523*4882a593Smuzhiyun 2524*4882a593Smuzhiyun - hide: echo nofuncgraph-duration > trace_options 2525*4882a593Smuzhiyun - show: echo funcgraph-duration > trace_options 2526*4882a593Smuzhiyun 2527*4882a593Smuzhiyun- The overhead field precedes the duration field in case of 2528*4882a593Smuzhiyun reached duration thresholds. 2529*4882a593Smuzhiyun 2530*4882a593Smuzhiyun - hide: echo nofuncgraph-overhead > trace_options 2531*4882a593Smuzhiyun - show: echo funcgraph-overhead > trace_options 2532*4882a593Smuzhiyun - depends on: funcgraph-duration 2533*4882a593Smuzhiyun 2534*4882a593Smuzhiyun ie:: 2535*4882a593Smuzhiyun 2536*4882a593Smuzhiyun 3) # 1837.709 us | } /* __switch_to */ 2537*4882a593Smuzhiyun 3) | finish_task_switch() { 2538*4882a593Smuzhiyun 3) 0.313 us | _raw_spin_unlock_irq(); 2539*4882a593Smuzhiyun 3) 3.177 us | } 2540*4882a593Smuzhiyun 3) # 1889.063 us | } /* __schedule */ 2541*4882a593Smuzhiyun 3) ! 140.417 us | } /* __schedule */ 2542*4882a593Smuzhiyun 3) # 2034.948 us | } /* schedule */ 2543*4882a593Smuzhiyun 3) * 33998.59 us | } /* schedule_preempt_disabled */ 2544*4882a593Smuzhiyun 2545*4882a593Smuzhiyun [...] 2546*4882a593Smuzhiyun 2547*4882a593Smuzhiyun 1) 0.260 us | msecs_to_jiffies(); 2548*4882a593Smuzhiyun 1) 0.313 us | __rcu_read_unlock(); 2549*4882a593Smuzhiyun 1) + 61.770 us | } 2550*4882a593Smuzhiyun 1) + 64.479 us | } 2551*4882a593Smuzhiyun 1) 0.313 us | rcu_bh_qs(); 2552*4882a593Smuzhiyun 1) 0.313 us | __local_bh_enable(); 2553*4882a593Smuzhiyun 1) ! 217.240 us | } 2554*4882a593Smuzhiyun 1) 0.365 us | idle_cpu(); 2555*4882a593Smuzhiyun 1) | rcu_irq_exit() { 2556*4882a593Smuzhiyun 1) 0.417 us | rcu_eqs_enter_common.isra.47(); 2557*4882a593Smuzhiyun 1) 3.125 us | } 2558*4882a593Smuzhiyun 1) ! 227.812 us | } 2559*4882a593Smuzhiyun 1) ! 457.395 us | } 2560*4882a593Smuzhiyun 1) @ 119760.2 us | } 2561*4882a593Smuzhiyun 2562*4882a593Smuzhiyun [...] 2563*4882a593Smuzhiyun 2564*4882a593Smuzhiyun 2) | handle_IPI() { 2565*4882a593Smuzhiyun 1) 6.979 us | } 2566*4882a593Smuzhiyun 2) 0.417 us | scheduler_ipi(); 2567*4882a593Smuzhiyun 1) 9.791 us | } 2568*4882a593Smuzhiyun 1) + 12.917 us | } 2569*4882a593Smuzhiyun 2) 3.490 us | } 2570*4882a593Smuzhiyun 1) + 15.729 us | } 2571*4882a593Smuzhiyun 1) + 18.542 us | } 2572*4882a593Smuzhiyun 2) $ 3594274 us | } 2573*4882a593Smuzhiyun 2574*4882a593SmuzhiyunFlags:: 2575*4882a593Smuzhiyun 2576*4882a593Smuzhiyun + means that the function exceeded 10 usecs. 2577*4882a593Smuzhiyun ! means that the function exceeded 100 usecs. 2578*4882a593Smuzhiyun # means that the function exceeded 1000 usecs. 2579*4882a593Smuzhiyun * means that the function exceeded 10 msecs. 2580*4882a593Smuzhiyun @ means that the function exceeded 100 msecs. 2581*4882a593Smuzhiyun $ means that the function exceeded 1 sec. 2582*4882a593Smuzhiyun 2583*4882a593Smuzhiyun 2584*4882a593Smuzhiyun- The task/pid field displays the thread cmdline and pid which 2585*4882a593Smuzhiyun executed the function. It is default disabled. 2586*4882a593Smuzhiyun 2587*4882a593Smuzhiyun - hide: echo nofuncgraph-proc > trace_options 2588*4882a593Smuzhiyun - show: echo funcgraph-proc > trace_options 2589*4882a593Smuzhiyun 2590*4882a593Smuzhiyun ie:: 2591*4882a593Smuzhiyun 2592*4882a593Smuzhiyun # tracer: function_graph 2593*4882a593Smuzhiyun # 2594*4882a593Smuzhiyun # CPU TASK/PID DURATION FUNCTION CALLS 2595*4882a593Smuzhiyun # | | | | | | | | | 2596*4882a593Smuzhiyun 0) sh-4802 | | d_free() { 2597*4882a593Smuzhiyun 0) sh-4802 | | call_rcu() { 2598*4882a593Smuzhiyun 0) sh-4802 | | __call_rcu() { 2599*4882a593Smuzhiyun 0) sh-4802 | 0.616 us | rcu_process_gp_end(); 2600*4882a593Smuzhiyun 0) sh-4802 | 0.586 us | check_for_new_grace_period(); 2601*4882a593Smuzhiyun 0) sh-4802 | 2.899 us | } 2602*4882a593Smuzhiyun 0) sh-4802 | 4.040 us | } 2603*4882a593Smuzhiyun 0) sh-4802 | 5.151 us | } 2604*4882a593Smuzhiyun 0) sh-4802 | + 49.370 us | } 2605*4882a593Smuzhiyun 2606*4882a593Smuzhiyun 2607*4882a593Smuzhiyun- The absolute time field is an absolute timestamp given by the 2608*4882a593Smuzhiyun system clock since it started. A snapshot of this time is 2609*4882a593Smuzhiyun given on each entry/exit of functions 2610*4882a593Smuzhiyun 2611*4882a593Smuzhiyun - hide: echo nofuncgraph-abstime > trace_options 2612*4882a593Smuzhiyun - show: echo funcgraph-abstime > trace_options 2613*4882a593Smuzhiyun 2614*4882a593Smuzhiyun ie:: 2615*4882a593Smuzhiyun 2616*4882a593Smuzhiyun # 2617*4882a593Smuzhiyun # TIME CPU DURATION FUNCTION CALLS 2618*4882a593Smuzhiyun # | | | | | | | | 2619*4882a593Smuzhiyun 360.774522 | 1) 0.541 us | } 2620*4882a593Smuzhiyun 360.774522 | 1) 4.663 us | } 2621*4882a593Smuzhiyun 360.774523 | 1) 0.541 us | __wake_up_bit(); 2622*4882a593Smuzhiyun 360.774524 | 1) 6.796 us | } 2623*4882a593Smuzhiyun 360.774524 | 1) 7.952 us | } 2624*4882a593Smuzhiyun 360.774525 | 1) 9.063 us | } 2625*4882a593Smuzhiyun 360.774525 | 1) 0.615 us | journal_mark_dirty(); 2626*4882a593Smuzhiyun 360.774527 | 1) 0.578 us | __brelse(); 2627*4882a593Smuzhiyun 360.774528 | 1) | reiserfs_prepare_for_journal() { 2628*4882a593Smuzhiyun 360.774528 | 1) | unlock_buffer() { 2629*4882a593Smuzhiyun 360.774529 | 1) | wake_up_bit() { 2630*4882a593Smuzhiyun 360.774529 | 1) | bit_waitqueue() { 2631*4882a593Smuzhiyun 360.774530 | 1) 0.594 us | __phys_addr(); 2632*4882a593Smuzhiyun 2633*4882a593Smuzhiyun 2634*4882a593SmuzhiyunThe function name is always displayed after the closing bracket 2635*4882a593Smuzhiyunfor a function if the start of that function is not in the 2636*4882a593Smuzhiyuntrace buffer. 2637*4882a593Smuzhiyun 2638*4882a593SmuzhiyunDisplay of the function name after the closing bracket may be 2639*4882a593Smuzhiyunenabled for functions whose start is in the trace buffer, 2640*4882a593Smuzhiyunallowing easier searching with grep for function durations. 2641*4882a593SmuzhiyunIt is default disabled. 2642*4882a593Smuzhiyun 2643*4882a593Smuzhiyun - hide: echo nofuncgraph-tail > trace_options 2644*4882a593Smuzhiyun - show: echo funcgraph-tail > trace_options 2645*4882a593Smuzhiyun 2646*4882a593Smuzhiyun Example with nofuncgraph-tail (default):: 2647*4882a593Smuzhiyun 2648*4882a593Smuzhiyun 0) | putname() { 2649*4882a593Smuzhiyun 0) | kmem_cache_free() { 2650*4882a593Smuzhiyun 0) 0.518 us | __phys_addr(); 2651*4882a593Smuzhiyun 0) 1.757 us | } 2652*4882a593Smuzhiyun 0) 2.861 us | } 2653*4882a593Smuzhiyun 2654*4882a593Smuzhiyun Example with funcgraph-tail:: 2655*4882a593Smuzhiyun 2656*4882a593Smuzhiyun 0) | putname() { 2657*4882a593Smuzhiyun 0) | kmem_cache_free() { 2658*4882a593Smuzhiyun 0) 0.518 us | __phys_addr(); 2659*4882a593Smuzhiyun 0) 1.757 us | } /* kmem_cache_free() */ 2660*4882a593Smuzhiyun 0) 2.861 us | } /* putname() */ 2661*4882a593Smuzhiyun 2662*4882a593SmuzhiyunYou can put some comments on specific functions by using 2663*4882a593Smuzhiyuntrace_printk() For example, if you want to put a comment inside 2664*4882a593Smuzhiyunthe __might_sleep() function, you just have to include 2665*4882a593Smuzhiyun<linux/ftrace.h> and call trace_printk() inside __might_sleep():: 2666*4882a593Smuzhiyun 2667*4882a593Smuzhiyun trace_printk("I'm a comment!\n") 2668*4882a593Smuzhiyun 2669*4882a593Smuzhiyunwill produce:: 2670*4882a593Smuzhiyun 2671*4882a593Smuzhiyun 1) | __might_sleep() { 2672*4882a593Smuzhiyun 1) | /* I'm a comment! */ 2673*4882a593Smuzhiyun 1) 1.449 us | } 2674*4882a593Smuzhiyun 2675*4882a593Smuzhiyun 2676*4882a593SmuzhiyunYou might find other useful features for this tracer in the 2677*4882a593Smuzhiyunfollowing "dynamic ftrace" section such as tracing only specific 2678*4882a593Smuzhiyunfunctions or tasks. 2679*4882a593Smuzhiyun 2680*4882a593Smuzhiyundynamic ftrace 2681*4882a593Smuzhiyun-------------- 2682*4882a593Smuzhiyun 2683*4882a593SmuzhiyunIf CONFIG_DYNAMIC_FTRACE is set, the system will run with 2684*4882a593Smuzhiyunvirtually no overhead when function tracing is disabled. The way 2685*4882a593Smuzhiyunthis works is the mcount function call (placed at the start of 2686*4882a593Smuzhiyunevery kernel function, produced by the -pg switch in gcc), 2687*4882a593Smuzhiyunstarts of pointing to a simple return. (Enabling FTRACE will 2688*4882a593Smuzhiyuninclude the -pg switch in the compiling of the kernel.) 2689*4882a593Smuzhiyun 2690*4882a593SmuzhiyunAt compile time every C file object is run through the 2691*4882a593Smuzhiyunrecordmcount program (located in the scripts directory). This 2692*4882a593Smuzhiyunprogram will parse the ELF headers in the C object to find all 2693*4882a593Smuzhiyunthe locations in the .text section that call mcount. Starting 2694*4882a593Smuzhiyunwith gcc version 4.6, the -mfentry has been added for x86, which 2695*4882a593Smuzhiyuncalls "__fentry__" instead of "mcount". Which is called before 2696*4882a593Smuzhiyunthe creation of the stack frame. 2697*4882a593Smuzhiyun 2698*4882a593SmuzhiyunNote, not all sections are traced. They may be prevented by either 2699*4882a593Smuzhiyuna notrace, or blocked another way and all inline functions are not 2700*4882a593Smuzhiyuntraced. Check the "available_filter_functions" file to see what functions 2701*4882a593Smuzhiyuncan be traced. 2702*4882a593Smuzhiyun 2703*4882a593SmuzhiyunA section called "__mcount_loc" is created that holds 2704*4882a593Smuzhiyunreferences to all the mcount/fentry call sites in the .text section. 2705*4882a593SmuzhiyunThe recordmcount program re-links this section back into the 2706*4882a593Smuzhiyunoriginal object. The final linking stage of the kernel will add all these 2707*4882a593Smuzhiyunreferences into a single table. 2708*4882a593Smuzhiyun 2709*4882a593SmuzhiyunOn boot up, before SMP is initialized, the dynamic ftrace code 2710*4882a593Smuzhiyunscans this table and updates all the locations into nops. It 2711*4882a593Smuzhiyunalso records the locations, which are added to the 2712*4882a593Smuzhiyunavailable_filter_functions list. Modules are processed as they 2713*4882a593Smuzhiyunare loaded and before they are executed. When a module is 2714*4882a593Smuzhiyununloaded, it also removes its functions from the ftrace function 2715*4882a593Smuzhiyunlist. This is automatic in the module unload code, and the 2716*4882a593Smuzhiyunmodule author does not need to worry about it. 2717*4882a593Smuzhiyun 2718*4882a593SmuzhiyunWhen tracing is enabled, the process of modifying the function 2719*4882a593Smuzhiyuntracepoints is dependent on architecture. The old method is to use 2720*4882a593Smuzhiyunkstop_machine to prevent races with the CPUs executing code being 2721*4882a593Smuzhiyunmodified (which can cause the CPU to do undesirable things, especially 2722*4882a593Smuzhiyunif the modified code crosses cache (or page) boundaries), and the nops are 2723*4882a593Smuzhiyunpatched back to calls. But this time, they do not call mcount 2724*4882a593Smuzhiyun(which is just a function stub). They now call into the ftrace 2725*4882a593Smuzhiyuninfrastructure. 2726*4882a593Smuzhiyun 2727*4882a593SmuzhiyunThe new method of modifying the function tracepoints is to place 2728*4882a593Smuzhiyuna breakpoint at the location to be modified, sync all CPUs, modify 2729*4882a593Smuzhiyunthe rest of the instruction not covered by the breakpoint. Sync 2730*4882a593Smuzhiyunall CPUs again, and then remove the breakpoint with the finished 2731*4882a593Smuzhiyunversion to the ftrace call site. 2732*4882a593Smuzhiyun 2733*4882a593SmuzhiyunSome archs do not even need to monkey around with the synchronization, 2734*4882a593Smuzhiyunand can just slap the new code on top of the old without any 2735*4882a593Smuzhiyunproblems with other CPUs executing it at the same time. 2736*4882a593Smuzhiyun 2737*4882a593SmuzhiyunOne special side-effect to the recording of the functions being 2738*4882a593Smuzhiyuntraced is that we can now selectively choose which functions we 2739*4882a593Smuzhiyunwish to trace and which ones we want the mcount calls to remain 2740*4882a593Smuzhiyunas nops. 2741*4882a593Smuzhiyun 2742*4882a593SmuzhiyunTwo files are used, one for enabling and one for disabling the 2743*4882a593Smuzhiyuntracing of specified functions. They are: 2744*4882a593Smuzhiyun 2745*4882a593Smuzhiyun set_ftrace_filter 2746*4882a593Smuzhiyun 2747*4882a593Smuzhiyunand 2748*4882a593Smuzhiyun 2749*4882a593Smuzhiyun set_ftrace_notrace 2750*4882a593Smuzhiyun 2751*4882a593SmuzhiyunA list of available functions that you can add to these files is 2752*4882a593Smuzhiyunlisted in: 2753*4882a593Smuzhiyun 2754*4882a593Smuzhiyun available_filter_functions 2755*4882a593Smuzhiyun 2756*4882a593Smuzhiyun:: 2757*4882a593Smuzhiyun 2758*4882a593Smuzhiyun # cat available_filter_functions 2759*4882a593Smuzhiyun put_prev_task_idle 2760*4882a593Smuzhiyun kmem_cache_create 2761*4882a593Smuzhiyun pick_next_task_rt 2762*4882a593Smuzhiyun get_online_cpus 2763*4882a593Smuzhiyun pick_next_task_fair 2764*4882a593Smuzhiyun mutex_lock 2765*4882a593Smuzhiyun [...] 2766*4882a593Smuzhiyun 2767*4882a593SmuzhiyunIf I am only interested in sys_nanosleep and hrtimer_interrupt:: 2768*4882a593Smuzhiyun 2769*4882a593Smuzhiyun # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter 2770*4882a593Smuzhiyun # echo function > current_tracer 2771*4882a593Smuzhiyun # echo 1 > tracing_on 2772*4882a593Smuzhiyun # usleep 1 2773*4882a593Smuzhiyun # echo 0 > tracing_on 2774*4882a593Smuzhiyun # cat trace 2775*4882a593Smuzhiyun # tracer: function 2776*4882a593Smuzhiyun # 2777*4882a593Smuzhiyun # entries-in-buffer/entries-written: 5/5 #P:4 2778*4882a593Smuzhiyun # 2779*4882a593Smuzhiyun # _-----=> irqs-off 2780*4882a593Smuzhiyun # / _----=> need-resched 2781*4882a593Smuzhiyun # | / _---=> hardirq/softirq 2782*4882a593Smuzhiyun # || / _--=> preempt-depth 2783*4882a593Smuzhiyun # ||| / delay 2784*4882a593Smuzhiyun # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2785*4882a593Smuzhiyun # | | | |||| | | 2786*4882a593Smuzhiyun usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath 2787*4882a593Smuzhiyun <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt 2788*4882a593Smuzhiyun usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt 2789*4882a593Smuzhiyun <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt 2790*4882a593Smuzhiyun <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt 2791*4882a593Smuzhiyun 2792*4882a593SmuzhiyunTo see which functions are being traced, you can cat the file: 2793*4882a593Smuzhiyun:: 2794*4882a593Smuzhiyun 2795*4882a593Smuzhiyun # cat set_ftrace_filter 2796*4882a593Smuzhiyun hrtimer_interrupt 2797*4882a593Smuzhiyun sys_nanosleep 2798*4882a593Smuzhiyun 2799*4882a593Smuzhiyun 2800*4882a593SmuzhiyunPerhaps this is not enough. The filters also allow glob(7) matching. 2801*4882a593Smuzhiyun 2802*4882a593Smuzhiyun ``<match>*`` 2803*4882a593Smuzhiyun will match functions that begin with <match> 2804*4882a593Smuzhiyun ``*<match>`` 2805*4882a593Smuzhiyun will match functions that end with <match> 2806*4882a593Smuzhiyun ``*<match>*`` 2807*4882a593Smuzhiyun will match functions that have <match> in it 2808*4882a593Smuzhiyun ``<match1>*<match2>`` 2809*4882a593Smuzhiyun will match functions that begin with <match1> and end with <match2> 2810*4882a593Smuzhiyun 2811*4882a593Smuzhiyun.. note:: 2812*4882a593Smuzhiyun It is better to use quotes to enclose the wild cards, 2813*4882a593Smuzhiyun otherwise the shell may expand the parameters into names 2814*4882a593Smuzhiyun of files in the local directory. 2815*4882a593Smuzhiyun 2816*4882a593Smuzhiyun:: 2817*4882a593Smuzhiyun 2818*4882a593Smuzhiyun # echo 'hrtimer_*' > set_ftrace_filter 2819*4882a593Smuzhiyun 2820*4882a593SmuzhiyunProduces:: 2821*4882a593Smuzhiyun 2822*4882a593Smuzhiyun # tracer: function 2823*4882a593Smuzhiyun # 2824*4882a593Smuzhiyun # entries-in-buffer/entries-written: 897/897 #P:4 2825*4882a593Smuzhiyun # 2826*4882a593Smuzhiyun # _-----=> irqs-off 2827*4882a593Smuzhiyun # / _----=> need-resched 2828*4882a593Smuzhiyun # | / _---=> hardirq/softirq 2829*4882a593Smuzhiyun # || / _--=> preempt-depth 2830*4882a593Smuzhiyun # ||| / delay 2831*4882a593Smuzhiyun # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2832*4882a593Smuzhiyun # | | | |||| | | 2833*4882a593Smuzhiyun <idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit 2834*4882a593Smuzhiyun <idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel 2835*4882a593Smuzhiyun <idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer 2836*4882a593Smuzhiyun <idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit 2837*4882a593Smuzhiyun <idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 2838*4882a593Smuzhiyun <idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt 2839*4882a593Smuzhiyun <idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter 2840*4882a593Smuzhiyun <idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem 2841*4882a593Smuzhiyun 2842*4882a593SmuzhiyunNotice that we lost the sys_nanosleep. 2843*4882a593Smuzhiyun:: 2844*4882a593Smuzhiyun 2845*4882a593Smuzhiyun # cat set_ftrace_filter 2846*4882a593Smuzhiyun hrtimer_run_queues 2847*4882a593Smuzhiyun hrtimer_run_pending 2848*4882a593Smuzhiyun hrtimer_init 2849*4882a593Smuzhiyun hrtimer_cancel 2850*4882a593Smuzhiyun hrtimer_try_to_cancel 2851*4882a593Smuzhiyun hrtimer_forward 2852*4882a593Smuzhiyun hrtimer_start 2853*4882a593Smuzhiyun hrtimer_reprogram 2854*4882a593Smuzhiyun hrtimer_force_reprogram 2855*4882a593Smuzhiyun hrtimer_get_next_event 2856*4882a593Smuzhiyun hrtimer_interrupt 2857*4882a593Smuzhiyun hrtimer_nanosleep 2858*4882a593Smuzhiyun hrtimer_wakeup 2859*4882a593Smuzhiyun hrtimer_get_remaining 2860*4882a593Smuzhiyun hrtimer_get_res 2861*4882a593Smuzhiyun hrtimer_init_sleeper 2862*4882a593Smuzhiyun 2863*4882a593Smuzhiyun 2864*4882a593SmuzhiyunThis is because the '>' and '>>' act just like they do in bash. 2865*4882a593SmuzhiyunTo rewrite the filters, use '>' 2866*4882a593SmuzhiyunTo append to the filters, use '>>' 2867*4882a593Smuzhiyun 2868*4882a593SmuzhiyunTo clear out a filter so that all functions will be recorded 2869*4882a593Smuzhiyunagain:: 2870*4882a593Smuzhiyun 2871*4882a593Smuzhiyun # echo > set_ftrace_filter 2872*4882a593Smuzhiyun # cat set_ftrace_filter 2873*4882a593Smuzhiyun # 2874*4882a593Smuzhiyun 2875*4882a593SmuzhiyunAgain, now we want to append. 2876*4882a593Smuzhiyun 2877*4882a593Smuzhiyun:: 2878*4882a593Smuzhiyun 2879*4882a593Smuzhiyun # echo sys_nanosleep > set_ftrace_filter 2880*4882a593Smuzhiyun # cat set_ftrace_filter 2881*4882a593Smuzhiyun sys_nanosleep 2882*4882a593Smuzhiyun # echo 'hrtimer_*' >> set_ftrace_filter 2883*4882a593Smuzhiyun # cat set_ftrace_filter 2884*4882a593Smuzhiyun hrtimer_run_queues 2885*4882a593Smuzhiyun hrtimer_run_pending 2886*4882a593Smuzhiyun hrtimer_init 2887*4882a593Smuzhiyun hrtimer_cancel 2888*4882a593Smuzhiyun hrtimer_try_to_cancel 2889*4882a593Smuzhiyun hrtimer_forward 2890*4882a593Smuzhiyun hrtimer_start 2891*4882a593Smuzhiyun hrtimer_reprogram 2892*4882a593Smuzhiyun hrtimer_force_reprogram 2893*4882a593Smuzhiyun hrtimer_get_next_event 2894*4882a593Smuzhiyun hrtimer_interrupt 2895*4882a593Smuzhiyun sys_nanosleep 2896*4882a593Smuzhiyun hrtimer_nanosleep 2897*4882a593Smuzhiyun hrtimer_wakeup 2898*4882a593Smuzhiyun hrtimer_get_remaining 2899*4882a593Smuzhiyun hrtimer_get_res 2900*4882a593Smuzhiyun hrtimer_init_sleeper 2901*4882a593Smuzhiyun 2902*4882a593Smuzhiyun 2903*4882a593SmuzhiyunThe set_ftrace_notrace prevents those functions from being 2904*4882a593Smuzhiyuntraced. 2905*4882a593Smuzhiyun:: 2906*4882a593Smuzhiyun 2907*4882a593Smuzhiyun # echo '*preempt*' '*lock*' > set_ftrace_notrace 2908*4882a593Smuzhiyun 2909*4882a593SmuzhiyunProduces:: 2910*4882a593Smuzhiyun 2911*4882a593Smuzhiyun # tracer: function 2912*4882a593Smuzhiyun # 2913*4882a593Smuzhiyun # entries-in-buffer/entries-written: 39608/39608 #P:4 2914*4882a593Smuzhiyun # 2915*4882a593Smuzhiyun # _-----=> irqs-off 2916*4882a593Smuzhiyun # / _----=> need-resched 2917*4882a593Smuzhiyun # | / _---=> hardirq/softirq 2918*4882a593Smuzhiyun # || / _--=> preempt-depth 2919*4882a593Smuzhiyun # ||| / delay 2920*4882a593Smuzhiyun # TASK-PID CPU# |||| TIMESTAMP FUNCTION 2921*4882a593Smuzhiyun # | | | |||| | | 2922*4882a593Smuzhiyun bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open 2923*4882a593Smuzhiyun bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last 2924*4882a593Smuzhiyun bash-1994 [000] .... 4342.324897: ima_file_check <-do_last 2925*4882a593Smuzhiyun bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check 2926*4882a593Smuzhiyun bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement 2927*4882a593Smuzhiyun bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action 2928*4882a593Smuzhiyun bash-1994 [000] .... 4342.324899: do_truncate <-do_last 2929*4882a593Smuzhiyun bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate 2930*4882a593Smuzhiyun bash-1994 [000] .... 4342.324899: notify_change <-do_truncate 2931*4882a593Smuzhiyun bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change 2932*4882a593Smuzhiyun bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time 2933*4882a593Smuzhiyun bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time 2934*4882a593Smuzhiyun 2935*4882a593SmuzhiyunWe can see that there's no more lock or preempt tracing. 2936*4882a593Smuzhiyun 2937*4882a593SmuzhiyunSelecting function filters via index 2938*4882a593Smuzhiyun------------------------------------ 2939*4882a593Smuzhiyun 2940*4882a593SmuzhiyunBecause processing of strings is expensive (the address of the function 2941*4882a593Smuzhiyunneeds to be looked up before comparing to the string being passed in), 2942*4882a593Smuzhiyunan index can be used as well to enable functions. This is useful in the 2943*4882a593Smuzhiyuncase of setting thousands of specific functions at a time. By passing 2944*4882a593Smuzhiyunin a list of numbers, no string processing will occur. Instead, the function 2945*4882a593Smuzhiyunat the specific location in the internal array (which corresponds to the 2946*4882a593Smuzhiyunfunctions in the "available_filter_functions" file), is selected. 2947*4882a593Smuzhiyun 2948*4882a593Smuzhiyun:: 2949*4882a593Smuzhiyun 2950*4882a593Smuzhiyun # echo 1 > set_ftrace_filter 2951*4882a593Smuzhiyun 2952*4882a593SmuzhiyunWill select the first function listed in "available_filter_functions" 2953*4882a593Smuzhiyun 2954*4882a593Smuzhiyun:: 2955*4882a593Smuzhiyun 2956*4882a593Smuzhiyun # head -1 available_filter_functions 2957*4882a593Smuzhiyun trace_initcall_finish_cb 2958*4882a593Smuzhiyun 2959*4882a593Smuzhiyun # cat set_ftrace_filter 2960*4882a593Smuzhiyun trace_initcall_finish_cb 2961*4882a593Smuzhiyun 2962*4882a593Smuzhiyun # head -50 available_filter_functions | tail -1 2963*4882a593Smuzhiyun x86_pmu_commit_txn 2964*4882a593Smuzhiyun 2965*4882a593Smuzhiyun # echo 1 50 > set_ftrace_filter 2966*4882a593Smuzhiyun # cat set_ftrace_filter 2967*4882a593Smuzhiyun trace_initcall_finish_cb 2968*4882a593Smuzhiyun x86_pmu_commit_txn 2969*4882a593Smuzhiyun 2970*4882a593SmuzhiyunDynamic ftrace with the function graph tracer 2971*4882a593Smuzhiyun--------------------------------------------- 2972*4882a593Smuzhiyun 2973*4882a593SmuzhiyunAlthough what has been explained above concerns both the 2974*4882a593Smuzhiyunfunction tracer and the function-graph-tracer, there are some 2975*4882a593Smuzhiyunspecial features only available in the function-graph tracer. 2976*4882a593Smuzhiyun 2977*4882a593SmuzhiyunIf you want to trace only one function and all of its children, 2978*4882a593Smuzhiyunyou just have to echo its name into set_graph_function:: 2979*4882a593Smuzhiyun 2980*4882a593Smuzhiyun echo __do_fault > set_graph_function 2981*4882a593Smuzhiyun 2982*4882a593Smuzhiyunwill produce the following "expanded" trace of the __do_fault() 2983*4882a593Smuzhiyunfunction:: 2984*4882a593Smuzhiyun 2985*4882a593Smuzhiyun 0) | __do_fault() { 2986*4882a593Smuzhiyun 0) | filemap_fault() { 2987*4882a593Smuzhiyun 0) | find_lock_page() { 2988*4882a593Smuzhiyun 0) 0.804 us | find_get_page(); 2989*4882a593Smuzhiyun 0) | __might_sleep() { 2990*4882a593Smuzhiyun 0) 1.329 us | } 2991*4882a593Smuzhiyun 0) 3.904 us | } 2992*4882a593Smuzhiyun 0) 4.979 us | } 2993*4882a593Smuzhiyun 0) 0.653 us | _spin_lock(); 2994*4882a593Smuzhiyun 0) 0.578 us | page_add_file_rmap(); 2995*4882a593Smuzhiyun 0) 0.525 us | native_set_pte_at(); 2996*4882a593Smuzhiyun 0) 0.585 us | _spin_unlock(); 2997*4882a593Smuzhiyun 0) | unlock_page() { 2998*4882a593Smuzhiyun 0) 0.541 us | page_waitqueue(); 2999*4882a593Smuzhiyun 0) 0.639 us | __wake_up_bit(); 3000*4882a593Smuzhiyun 0) 2.786 us | } 3001*4882a593Smuzhiyun 0) + 14.237 us | } 3002*4882a593Smuzhiyun 0) | __do_fault() { 3003*4882a593Smuzhiyun 0) | filemap_fault() { 3004*4882a593Smuzhiyun 0) | find_lock_page() { 3005*4882a593Smuzhiyun 0) 0.698 us | find_get_page(); 3006*4882a593Smuzhiyun 0) | __might_sleep() { 3007*4882a593Smuzhiyun 0) 1.412 us | } 3008*4882a593Smuzhiyun 0) 3.950 us | } 3009*4882a593Smuzhiyun 0) 5.098 us | } 3010*4882a593Smuzhiyun 0) 0.631 us | _spin_lock(); 3011*4882a593Smuzhiyun 0) 0.571 us | page_add_file_rmap(); 3012*4882a593Smuzhiyun 0) 0.526 us | native_set_pte_at(); 3013*4882a593Smuzhiyun 0) 0.586 us | _spin_unlock(); 3014*4882a593Smuzhiyun 0) | unlock_page() { 3015*4882a593Smuzhiyun 0) 0.533 us | page_waitqueue(); 3016*4882a593Smuzhiyun 0) 0.638 us | __wake_up_bit(); 3017*4882a593Smuzhiyun 0) 2.793 us | } 3018*4882a593Smuzhiyun 0) + 14.012 us | } 3019*4882a593Smuzhiyun 3020*4882a593SmuzhiyunYou can also expand several functions at once:: 3021*4882a593Smuzhiyun 3022*4882a593Smuzhiyun echo sys_open > set_graph_function 3023*4882a593Smuzhiyun echo sys_close >> set_graph_function 3024*4882a593Smuzhiyun 3025*4882a593SmuzhiyunNow if you want to go back to trace all functions you can clear 3026*4882a593Smuzhiyunthis special filter via:: 3027*4882a593Smuzhiyun 3028*4882a593Smuzhiyun echo > set_graph_function 3029*4882a593Smuzhiyun 3030*4882a593Smuzhiyun 3031*4882a593Smuzhiyunftrace_enabled 3032*4882a593Smuzhiyun-------------- 3033*4882a593Smuzhiyun 3034*4882a593SmuzhiyunNote, the proc sysctl ftrace_enable is a big on/off switch for the 3035*4882a593Smuzhiyunfunction tracer. By default it is enabled (when function tracing is 3036*4882a593Smuzhiyunenabled in the kernel). If it is disabled, all function tracing is 3037*4882a593Smuzhiyundisabled. This includes not only the function tracers for ftrace, but 3038*4882a593Smuzhiyunalso for any other uses (perf, kprobes, stack tracing, profiling, etc). It 3039*4882a593Smuzhiyuncannot be disabled if there is a callback with FTRACE_OPS_FL_PERMANENT set 3040*4882a593Smuzhiyunregistered. 3041*4882a593Smuzhiyun 3042*4882a593SmuzhiyunPlease disable this with care. 3043*4882a593Smuzhiyun 3044*4882a593SmuzhiyunThis can be disable (and enabled) with:: 3045*4882a593Smuzhiyun 3046*4882a593Smuzhiyun sysctl kernel.ftrace_enabled=0 3047*4882a593Smuzhiyun sysctl kernel.ftrace_enabled=1 3048*4882a593Smuzhiyun 3049*4882a593Smuzhiyun or 3050*4882a593Smuzhiyun 3051*4882a593Smuzhiyun echo 0 > /proc/sys/kernel/ftrace_enabled 3052*4882a593Smuzhiyun echo 1 > /proc/sys/kernel/ftrace_enabled 3053*4882a593Smuzhiyun 3054*4882a593Smuzhiyun 3055*4882a593SmuzhiyunFilter commands 3056*4882a593Smuzhiyun--------------- 3057*4882a593Smuzhiyun 3058*4882a593SmuzhiyunA few commands are supported by the set_ftrace_filter interface. 3059*4882a593SmuzhiyunTrace commands have the following format:: 3060*4882a593Smuzhiyun 3061*4882a593Smuzhiyun <function>:<command>:<parameter> 3062*4882a593Smuzhiyun 3063*4882a593SmuzhiyunThe following commands are supported: 3064*4882a593Smuzhiyun 3065*4882a593Smuzhiyun- mod: 3066*4882a593Smuzhiyun This command enables function filtering per module. The 3067*4882a593Smuzhiyun parameter defines the module. For example, if only the write* 3068*4882a593Smuzhiyun functions in the ext3 module are desired, run: 3069*4882a593Smuzhiyun 3070*4882a593Smuzhiyun echo 'write*:mod:ext3' > set_ftrace_filter 3071*4882a593Smuzhiyun 3072*4882a593Smuzhiyun This command interacts with the filter in the same way as 3073*4882a593Smuzhiyun filtering based on function names. Thus, adding more functions 3074*4882a593Smuzhiyun in a different module is accomplished by appending (>>) to the 3075*4882a593Smuzhiyun filter file. Remove specific module functions by prepending 3076*4882a593Smuzhiyun '!':: 3077*4882a593Smuzhiyun 3078*4882a593Smuzhiyun echo '!writeback*:mod:ext3' >> set_ftrace_filter 3079*4882a593Smuzhiyun 3080*4882a593Smuzhiyun Mod command supports module globbing. Disable tracing for all 3081*4882a593Smuzhiyun functions except a specific module:: 3082*4882a593Smuzhiyun 3083*4882a593Smuzhiyun echo '!*:mod:!ext3' >> set_ftrace_filter 3084*4882a593Smuzhiyun 3085*4882a593Smuzhiyun Disable tracing for all modules, but still trace kernel:: 3086*4882a593Smuzhiyun 3087*4882a593Smuzhiyun echo '!*:mod:*' >> set_ftrace_filter 3088*4882a593Smuzhiyun 3089*4882a593Smuzhiyun Enable filter only for kernel:: 3090*4882a593Smuzhiyun 3091*4882a593Smuzhiyun echo '*write*:mod:!*' >> set_ftrace_filter 3092*4882a593Smuzhiyun 3093*4882a593Smuzhiyun Enable filter for module globbing:: 3094*4882a593Smuzhiyun 3095*4882a593Smuzhiyun echo '*write*:mod:*snd*' >> set_ftrace_filter 3096*4882a593Smuzhiyun 3097*4882a593Smuzhiyun- traceon/traceoff: 3098*4882a593Smuzhiyun These commands turn tracing on and off when the specified 3099*4882a593Smuzhiyun functions are hit. The parameter determines how many times the 3100*4882a593Smuzhiyun tracing system is turned on and off. If unspecified, there is 3101*4882a593Smuzhiyun no limit. For example, to disable tracing when a schedule bug 3102*4882a593Smuzhiyun is hit the first 5 times, run:: 3103*4882a593Smuzhiyun 3104*4882a593Smuzhiyun echo '__schedule_bug:traceoff:5' > set_ftrace_filter 3105*4882a593Smuzhiyun 3106*4882a593Smuzhiyun To always disable tracing when __schedule_bug is hit:: 3107*4882a593Smuzhiyun 3108*4882a593Smuzhiyun echo '__schedule_bug:traceoff' > set_ftrace_filter 3109*4882a593Smuzhiyun 3110*4882a593Smuzhiyun These commands are cumulative whether or not they are appended 3111*4882a593Smuzhiyun to set_ftrace_filter. To remove a command, prepend it by '!' 3112*4882a593Smuzhiyun and drop the parameter:: 3113*4882a593Smuzhiyun 3114*4882a593Smuzhiyun echo '!__schedule_bug:traceoff:0' > set_ftrace_filter 3115*4882a593Smuzhiyun 3116*4882a593Smuzhiyun The above removes the traceoff command for __schedule_bug 3117*4882a593Smuzhiyun that have a counter. To remove commands without counters:: 3118*4882a593Smuzhiyun 3119*4882a593Smuzhiyun echo '!__schedule_bug:traceoff' > set_ftrace_filter 3120*4882a593Smuzhiyun 3121*4882a593Smuzhiyun- snapshot: 3122*4882a593Smuzhiyun Will cause a snapshot to be triggered when the function is hit. 3123*4882a593Smuzhiyun :: 3124*4882a593Smuzhiyun 3125*4882a593Smuzhiyun echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter 3126*4882a593Smuzhiyun 3127*4882a593Smuzhiyun To only snapshot once: 3128*4882a593Smuzhiyun :: 3129*4882a593Smuzhiyun 3130*4882a593Smuzhiyun echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter 3131*4882a593Smuzhiyun 3132*4882a593Smuzhiyun To remove the above commands:: 3133*4882a593Smuzhiyun 3134*4882a593Smuzhiyun echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter 3135*4882a593Smuzhiyun echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter 3136*4882a593Smuzhiyun 3137*4882a593Smuzhiyun- enable_event/disable_event: 3138*4882a593Smuzhiyun These commands can enable or disable a trace event. Note, because 3139*4882a593Smuzhiyun function tracing callbacks are very sensitive, when these commands 3140*4882a593Smuzhiyun are registered, the trace point is activated, but disabled in 3141*4882a593Smuzhiyun a "soft" mode. That is, the tracepoint will be called, but 3142*4882a593Smuzhiyun just will not be traced. The event tracepoint stays in this mode 3143*4882a593Smuzhiyun as long as there's a command that triggers it. 3144*4882a593Smuzhiyun :: 3145*4882a593Smuzhiyun 3146*4882a593Smuzhiyun echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \ 3147*4882a593Smuzhiyun set_ftrace_filter 3148*4882a593Smuzhiyun 3149*4882a593Smuzhiyun The format is:: 3150*4882a593Smuzhiyun 3151*4882a593Smuzhiyun <function>:enable_event:<system>:<event>[:count] 3152*4882a593Smuzhiyun <function>:disable_event:<system>:<event>[:count] 3153*4882a593Smuzhiyun 3154*4882a593Smuzhiyun To remove the events commands:: 3155*4882a593Smuzhiyun 3156*4882a593Smuzhiyun echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \ 3157*4882a593Smuzhiyun set_ftrace_filter 3158*4882a593Smuzhiyun echo '!schedule:disable_event:sched:sched_switch' > \ 3159*4882a593Smuzhiyun set_ftrace_filter 3160*4882a593Smuzhiyun 3161*4882a593Smuzhiyun- dump: 3162*4882a593Smuzhiyun When the function is hit, it will dump the contents of the ftrace 3163*4882a593Smuzhiyun ring buffer to the console. This is useful if you need to debug 3164*4882a593Smuzhiyun something, and want to dump the trace when a certain function 3165*4882a593Smuzhiyun is hit. Perhaps it's a function that is called before a triple 3166*4882a593Smuzhiyun fault happens and does not allow you to get a regular dump. 3167*4882a593Smuzhiyun 3168*4882a593Smuzhiyun- cpudump: 3169*4882a593Smuzhiyun When the function is hit, it will dump the contents of the ftrace 3170*4882a593Smuzhiyun ring buffer for the current CPU to the console. Unlike the "dump" 3171*4882a593Smuzhiyun command, it only prints out the contents of the ring buffer for the 3172*4882a593Smuzhiyun CPU that executed the function that triggered the dump. 3173*4882a593Smuzhiyun 3174*4882a593Smuzhiyun- stacktrace: 3175*4882a593Smuzhiyun When the function is hit, a stack trace is recorded. 3176*4882a593Smuzhiyun 3177*4882a593Smuzhiyuntrace_pipe 3178*4882a593Smuzhiyun---------- 3179*4882a593Smuzhiyun 3180*4882a593SmuzhiyunThe trace_pipe outputs the same content as the trace file, but 3181*4882a593Smuzhiyunthe effect on the tracing is different. Every read from 3182*4882a593Smuzhiyuntrace_pipe is consumed. This means that subsequent reads will be 3183*4882a593Smuzhiyundifferent. The trace is live. 3184*4882a593Smuzhiyun:: 3185*4882a593Smuzhiyun 3186*4882a593Smuzhiyun # echo function > current_tracer 3187*4882a593Smuzhiyun # cat trace_pipe > /tmp/trace.out & 3188*4882a593Smuzhiyun [1] 4153 3189*4882a593Smuzhiyun # echo 1 > tracing_on 3190*4882a593Smuzhiyun # usleep 1 3191*4882a593Smuzhiyun # echo 0 > tracing_on 3192*4882a593Smuzhiyun # cat trace 3193*4882a593Smuzhiyun # tracer: function 3194*4882a593Smuzhiyun # 3195*4882a593Smuzhiyun # entries-in-buffer/entries-written: 0/0 #P:4 3196*4882a593Smuzhiyun # 3197*4882a593Smuzhiyun # _-----=> irqs-off 3198*4882a593Smuzhiyun # / _----=> need-resched 3199*4882a593Smuzhiyun # | / _---=> hardirq/softirq 3200*4882a593Smuzhiyun # || / _--=> preempt-depth 3201*4882a593Smuzhiyun # ||| / delay 3202*4882a593Smuzhiyun # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3203*4882a593Smuzhiyun # | | | |||| | | 3204*4882a593Smuzhiyun 3205*4882a593Smuzhiyun # 3206*4882a593Smuzhiyun # cat /tmp/trace.out 3207*4882a593Smuzhiyun bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write 3208*4882a593Smuzhiyun bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock 3209*4882a593Smuzhiyun bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify 3210*4882a593Smuzhiyun bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify 3211*4882a593Smuzhiyun bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify 3212*4882a593Smuzhiyun bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock 3213*4882a593Smuzhiyun bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock 3214*4882a593Smuzhiyun bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify 3215*4882a593Smuzhiyun bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath 3216*4882a593Smuzhiyun 3217*4882a593Smuzhiyun 3218*4882a593SmuzhiyunNote, reading the trace_pipe file will block until more input is 3219*4882a593Smuzhiyunadded. This is contrary to the trace file. If any process opened 3220*4882a593Smuzhiyunthe trace file for reading, it will actually disable tracing and 3221*4882a593Smuzhiyunprevent new entries from being added. The trace_pipe file does 3222*4882a593Smuzhiyunnot have this limitation. 3223*4882a593Smuzhiyun 3224*4882a593Smuzhiyuntrace entries 3225*4882a593Smuzhiyun------------- 3226*4882a593Smuzhiyun 3227*4882a593SmuzhiyunHaving too much or not enough data can be troublesome in 3228*4882a593Smuzhiyundiagnosing an issue in the kernel. The file buffer_size_kb is 3229*4882a593Smuzhiyunused to modify the size of the internal trace buffers. The 3230*4882a593Smuzhiyunnumber listed is the number of entries that can be recorded per 3231*4882a593SmuzhiyunCPU. To know the full size, multiply the number of possible CPUs 3232*4882a593Smuzhiyunwith the number of entries. 3233*4882a593Smuzhiyun:: 3234*4882a593Smuzhiyun 3235*4882a593Smuzhiyun # cat buffer_size_kb 3236*4882a593Smuzhiyun 1408 (units kilobytes) 3237*4882a593Smuzhiyun 3238*4882a593SmuzhiyunOr simply read buffer_total_size_kb 3239*4882a593Smuzhiyun:: 3240*4882a593Smuzhiyun 3241*4882a593Smuzhiyun # cat buffer_total_size_kb 3242*4882a593Smuzhiyun 5632 3243*4882a593Smuzhiyun 3244*4882a593SmuzhiyunTo modify the buffer, simple echo in a number (in 1024 byte segments). 3245*4882a593Smuzhiyun:: 3246*4882a593Smuzhiyun 3247*4882a593Smuzhiyun # echo 10000 > buffer_size_kb 3248*4882a593Smuzhiyun # cat buffer_size_kb 3249*4882a593Smuzhiyun 10000 (units kilobytes) 3250*4882a593Smuzhiyun 3251*4882a593SmuzhiyunIt will try to allocate as much as possible. If you allocate too 3252*4882a593Smuzhiyunmuch, it can cause Out-Of-Memory to trigger. 3253*4882a593Smuzhiyun:: 3254*4882a593Smuzhiyun 3255*4882a593Smuzhiyun # echo 1000000000000 > buffer_size_kb 3256*4882a593Smuzhiyun -bash: echo: write error: Cannot allocate memory 3257*4882a593Smuzhiyun # cat buffer_size_kb 3258*4882a593Smuzhiyun 85 3259*4882a593Smuzhiyun 3260*4882a593SmuzhiyunThe per_cpu buffers can be changed individually as well: 3261*4882a593Smuzhiyun:: 3262*4882a593Smuzhiyun 3263*4882a593Smuzhiyun # echo 10000 > per_cpu/cpu0/buffer_size_kb 3264*4882a593Smuzhiyun # echo 100 > per_cpu/cpu1/buffer_size_kb 3265*4882a593Smuzhiyun 3266*4882a593SmuzhiyunWhen the per_cpu buffers are not the same, the buffer_size_kb 3267*4882a593Smuzhiyunat the top level will just show an X 3268*4882a593Smuzhiyun:: 3269*4882a593Smuzhiyun 3270*4882a593Smuzhiyun # cat buffer_size_kb 3271*4882a593Smuzhiyun X 3272*4882a593Smuzhiyun 3273*4882a593SmuzhiyunThis is where the buffer_total_size_kb is useful: 3274*4882a593Smuzhiyun:: 3275*4882a593Smuzhiyun 3276*4882a593Smuzhiyun # cat buffer_total_size_kb 3277*4882a593Smuzhiyun 12916 3278*4882a593Smuzhiyun 3279*4882a593SmuzhiyunWriting to the top level buffer_size_kb will reset all the buffers 3280*4882a593Smuzhiyunto be the same again. 3281*4882a593Smuzhiyun 3282*4882a593SmuzhiyunSnapshot 3283*4882a593Smuzhiyun-------- 3284*4882a593SmuzhiyunCONFIG_TRACER_SNAPSHOT makes a generic snapshot feature 3285*4882a593Smuzhiyunavailable to all non latency tracers. (Latency tracers which 3286*4882a593Smuzhiyunrecord max latency, such as "irqsoff" or "wakeup", can't use 3287*4882a593Smuzhiyunthis feature, since those are already using the snapshot 3288*4882a593Smuzhiyunmechanism internally.) 3289*4882a593Smuzhiyun 3290*4882a593SmuzhiyunSnapshot preserves a current trace buffer at a particular point 3291*4882a593Smuzhiyunin time without stopping tracing. Ftrace swaps the current 3292*4882a593Smuzhiyunbuffer with a spare buffer, and tracing continues in the new 3293*4882a593Smuzhiyuncurrent (=previous spare) buffer. 3294*4882a593Smuzhiyun 3295*4882a593SmuzhiyunThe following tracefs files in "tracing" are related to this 3296*4882a593Smuzhiyunfeature: 3297*4882a593Smuzhiyun 3298*4882a593Smuzhiyun snapshot: 3299*4882a593Smuzhiyun 3300*4882a593Smuzhiyun This is used to take a snapshot and to read the output 3301*4882a593Smuzhiyun of the snapshot. Echo 1 into this file to allocate a 3302*4882a593Smuzhiyun spare buffer and to take a snapshot (swap), then read 3303*4882a593Smuzhiyun the snapshot from this file in the same format as 3304*4882a593Smuzhiyun "trace" (described above in the section "The File 3305*4882a593Smuzhiyun System"). Both reads snapshot and tracing are executable 3306*4882a593Smuzhiyun in parallel. When the spare buffer is allocated, echoing 3307*4882a593Smuzhiyun 0 frees it, and echoing else (positive) values clear the 3308*4882a593Smuzhiyun snapshot contents. 3309*4882a593Smuzhiyun More details are shown in the table below. 3310*4882a593Smuzhiyun 3311*4882a593Smuzhiyun +--------------+------------+------------+------------+ 3312*4882a593Smuzhiyun |status\\input | 0 | 1 | else | 3313*4882a593Smuzhiyun +==============+============+============+============+ 3314*4882a593Smuzhiyun |not allocated |(do nothing)| alloc+swap |(do nothing)| 3315*4882a593Smuzhiyun +--------------+------------+------------+------------+ 3316*4882a593Smuzhiyun |allocated | free | swap | clear | 3317*4882a593Smuzhiyun +--------------+------------+------------+------------+ 3318*4882a593Smuzhiyun 3319*4882a593SmuzhiyunHere is an example of using the snapshot feature. 3320*4882a593Smuzhiyun:: 3321*4882a593Smuzhiyun 3322*4882a593Smuzhiyun # echo 1 > events/sched/enable 3323*4882a593Smuzhiyun # echo 1 > snapshot 3324*4882a593Smuzhiyun # cat snapshot 3325*4882a593Smuzhiyun # tracer: nop 3326*4882a593Smuzhiyun # 3327*4882a593Smuzhiyun # entries-in-buffer/entries-written: 71/71 #P:8 3328*4882a593Smuzhiyun # 3329*4882a593Smuzhiyun # _-----=> irqs-off 3330*4882a593Smuzhiyun # / _----=> need-resched 3331*4882a593Smuzhiyun # | / _---=> hardirq/softirq 3332*4882a593Smuzhiyun # || / _--=> preempt-depth 3333*4882a593Smuzhiyun # ||| / delay 3334*4882a593Smuzhiyun # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3335*4882a593Smuzhiyun # | | | |||| | | 3336*4882a593Smuzhiyun <idle>-0 [005] d... 2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120 3337*4882a593Smuzhiyun sleep-2242 [005] d... 2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120 3338*4882a593Smuzhiyun [...] 3339*4882a593Smuzhiyun <idle>-0 [002] d... 2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120 3340*4882a593Smuzhiyun 3341*4882a593Smuzhiyun # cat trace 3342*4882a593Smuzhiyun # tracer: nop 3343*4882a593Smuzhiyun # 3344*4882a593Smuzhiyun # entries-in-buffer/entries-written: 77/77 #P:8 3345*4882a593Smuzhiyun # 3346*4882a593Smuzhiyun # _-----=> irqs-off 3347*4882a593Smuzhiyun # / _----=> need-resched 3348*4882a593Smuzhiyun # | / _---=> hardirq/softirq 3349*4882a593Smuzhiyun # || / _--=> preempt-depth 3350*4882a593Smuzhiyun # ||| / delay 3351*4882a593Smuzhiyun # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3352*4882a593Smuzhiyun # | | | |||| | | 3353*4882a593Smuzhiyun <idle>-0 [007] d... 2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120 3354*4882a593Smuzhiyun snapshot-test-2-2229 [002] d... 2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120 3355*4882a593Smuzhiyun [...] 3356*4882a593Smuzhiyun 3357*4882a593Smuzhiyun 3358*4882a593SmuzhiyunIf you try to use this snapshot feature when current tracer is 3359*4882a593Smuzhiyunone of the latency tracers, you will get the following results. 3360*4882a593Smuzhiyun:: 3361*4882a593Smuzhiyun 3362*4882a593Smuzhiyun # echo wakeup > current_tracer 3363*4882a593Smuzhiyun # echo 1 > snapshot 3364*4882a593Smuzhiyun bash: echo: write error: Device or resource busy 3365*4882a593Smuzhiyun # cat snapshot 3366*4882a593Smuzhiyun cat: snapshot: Device or resource busy 3367*4882a593Smuzhiyun 3368*4882a593Smuzhiyun 3369*4882a593SmuzhiyunInstances 3370*4882a593Smuzhiyun--------- 3371*4882a593SmuzhiyunIn the tracefs tracing directory is a directory called "instances". 3372*4882a593SmuzhiyunThis directory can have new directories created inside of it using 3373*4882a593Smuzhiyunmkdir, and removing directories with rmdir. The directory created 3374*4882a593Smuzhiyunwith mkdir in this directory will already contain files and other 3375*4882a593Smuzhiyundirectories after it is created. 3376*4882a593Smuzhiyun:: 3377*4882a593Smuzhiyun 3378*4882a593Smuzhiyun # mkdir instances/foo 3379*4882a593Smuzhiyun # ls instances/foo 3380*4882a593Smuzhiyun buffer_size_kb buffer_total_size_kb events free_buffer per_cpu 3381*4882a593Smuzhiyun set_event snapshot trace trace_clock trace_marker trace_options 3382*4882a593Smuzhiyun trace_pipe tracing_on 3383*4882a593Smuzhiyun 3384*4882a593SmuzhiyunAs you can see, the new directory looks similar to the tracing directory 3385*4882a593Smuzhiyunitself. In fact, it is very similar, except that the buffer and 3386*4882a593Smuzhiyunevents are agnostic from the main directory, or from any other 3387*4882a593Smuzhiyuninstances that are created. 3388*4882a593Smuzhiyun 3389*4882a593SmuzhiyunThe files in the new directory work just like the files with the 3390*4882a593Smuzhiyunsame name in the tracing directory except the buffer that is used 3391*4882a593Smuzhiyunis a separate and new buffer. The files affect that buffer but do not 3392*4882a593Smuzhiyunaffect the main buffer with the exception of trace_options. Currently, 3393*4882a593Smuzhiyunthe trace_options affect all instances and the top level buffer 3394*4882a593Smuzhiyunthe same, but this may change in future releases. That is, options 3395*4882a593Smuzhiyunmay become specific to the instance they reside in. 3396*4882a593Smuzhiyun 3397*4882a593SmuzhiyunNotice that none of the function tracer files are there, nor is 3398*4882a593Smuzhiyuncurrent_tracer and available_tracers. This is because the buffers 3399*4882a593Smuzhiyuncan currently only have events enabled for them. 3400*4882a593Smuzhiyun:: 3401*4882a593Smuzhiyun 3402*4882a593Smuzhiyun # mkdir instances/foo 3403*4882a593Smuzhiyun # mkdir instances/bar 3404*4882a593Smuzhiyun # mkdir instances/zoot 3405*4882a593Smuzhiyun # echo 100000 > buffer_size_kb 3406*4882a593Smuzhiyun # echo 1000 > instances/foo/buffer_size_kb 3407*4882a593Smuzhiyun # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb 3408*4882a593Smuzhiyun # echo function > current_trace 3409*4882a593Smuzhiyun # echo 1 > instances/foo/events/sched/sched_wakeup/enable 3410*4882a593Smuzhiyun # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable 3411*4882a593Smuzhiyun # echo 1 > instances/foo/events/sched/sched_switch/enable 3412*4882a593Smuzhiyun # echo 1 > instances/bar/events/irq/enable 3413*4882a593Smuzhiyun # echo 1 > instances/zoot/events/syscalls/enable 3414*4882a593Smuzhiyun # cat trace_pipe 3415*4882a593Smuzhiyun CPU:2 [LOST 11745 EVENTS] 3416*4882a593Smuzhiyun bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist 3417*4882a593Smuzhiyun bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave 3418*4882a593Smuzhiyun bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist 3419*4882a593Smuzhiyun bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist 3420*4882a593Smuzhiyun bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock 3421*4882a593Smuzhiyun bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype 3422*4882a593Smuzhiyun bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist 3423*4882a593Smuzhiyun bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist 3424*4882a593Smuzhiyun bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics 3425*4882a593Smuzhiyun bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics 3426*4882a593Smuzhiyun bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process 3427*4882a593Smuzhiyun [...] 3428*4882a593Smuzhiyun 3429*4882a593Smuzhiyun # cat instances/foo/trace_pipe 3430*4882a593Smuzhiyun bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 3431*4882a593Smuzhiyun bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 3432*4882a593Smuzhiyun <idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003 3433*4882a593Smuzhiyun <idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120 3434*4882a593Smuzhiyun rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 3435*4882a593Smuzhiyun bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 3436*4882a593Smuzhiyun bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 3437*4882a593Smuzhiyun bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120 3438*4882a593Smuzhiyun kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001 3439*4882a593Smuzhiyun kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120 3440*4882a593Smuzhiyun [...] 3441*4882a593Smuzhiyun 3442*4882a593Smuzhiyun # cat instances/bar/trace_pipe 3443*4882a593Smuzhiyun migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX] 3444*4882a593Smuzhiyun <idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX] 3445*4882a593Smuzhiyun bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER] 3446*4882a593Smuzhiyun bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU] 3447*4882a593Smuzhiyun bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER] 3448*4882a593Smuzhiyun bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER] 3449*4882a593Smuzhiyun bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU] 3450*4882a593Smuzhiyun bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU] 3451*4882a593Smuzhiyun sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4 3452*4882a593Smuzhiyun sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled 3453*4882a593Smuzhiyun sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0 3454*4882a593Smuzhiyun sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled 3455*4882a593Smuzhiyun [...] 3456*4882a593Smuzhiyun 3457*4882a593Smuzhiyun # cat instances/zoot/trace 3458*4882a593Smuzhiyun # tracer: nop 3459*4882a593Smuzhiyun # 3460*4882a593Smuzhiyun # entries-in-buffer/entries-written: 18996/18996 #P:4 3461*4882a593Smuzhiyun # 3462*4882a593Smuzhiyun # _-----=> irqs-off 3463*4882a593Smuzhiyun # / _----=> need-resched 3464*4882a593Smuzhiyun # | / _---=> hardirq/softirq 3465*4882a593Smuzhiyun # || / _--=> preempt-depth 3466*4882a593Smuzhiyun # ||| / delay 3467*4882a593Smuzhiyun # TASK-PID CPU# |||| TIMESTAMP FUNCTION 3468*4882a593Smuzhiyun # | | | |||| | | 3469*4882a593Smuzhiyun bash-1998 [000] d... 140.733501: sys_write -> 0x2 3470*4882a593Smuzhiyun bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1) 3471*4882a593Smuzhiyun bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1 3472*4882a593Smuzhiyun bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0) 3473*4882a593Smuzhiyun bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1 3474*4882a593Smuzhiyun bash-1998 [000] d... 140.733510: sys_close(fd: a) 3475*4882a593Smuzhiyun bash-1998 [000] d... 140.733510: sys_close -> 0x0 3476*4882a593Smuzhiyun bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8) 3477*4882a593Smuzhiyun bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0 3478*4882a593Smuzhiyun bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8) 3479*4882a593Smuzhiyun bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0 3480*4882a593Smuzhiyun 3481*4882a593SmuzhiyunYou can see that the trace of the top most trace buffer shows only 3482*4882a593Smuzhiyunthe function tracing. The foo instance displays wakeups and task 3483*4882a593Smuzhiyunswitches. 3484*4882a593Smuzhiyun 3485*4882a593SmuzhiyunTo remove the instances, simply delete their directories: 3486*4882a593Smuzhiyun:: 3487*4882a593Smuzhiyun 3488*4882a593Smuzhiyun # rmdir instances/foo 3489*4882a593Smuzhiyun # rmdir instances/bar 3490*4882a593Smuzhiyun # rmdir instances/zoot 3491*4882a593Smuzhiyun 3492*4882a593SmuzhiyunNote, if a process has a trace file open in one of the instance 3493*4882a593Smuzhiyundirectories, the rmdir will fail with EBUSY. 3494*4882a593Smuzhiyun 3495*4882a593Smuzhiyun 3496*4882a593SmuzhiyunStack trace 3497*4882a593Smuzhiyun----------- 3498*4882a593SmuzhiyunSince the kernel has a fixed sized stack, it is important not to 3499*4882a593Smuzhiyunwaste it in functions. A kernel developer must be conscience of 3500*4882a593Smuzhiyunwhat they allocate on the stack. If they add too much, the system 3501*4882a593Smuzhiyuncan be in danger of a stack overflow, and corruption will occur, 3502*4882a593Smuzhiyunusually leading to a system panic. 3503*4882a593Smuzhiyun 3504*4882a593SmuzhiyunThere are some tools that check this, usually with interrupts 3505*4882a593Smuzhiyunperiodically checking usage. But if you can perform a check 3506*4882a593Smuzhiyunat every function call that will become very useful. As ftrace provides 3507*4882a593Smuzhiyuna function tracer, it makes it convenient to check the stack size 3508*4882a593Smuzhiyunat every function call. This is enabled via the stack tracer. 3509*4882a593Smuzhiyun 3510*4882a593SmuzhiyunCONFIG_STACK_TRACER enables the ftrace stack tracing functionality. 3511*4882a593SmuzhiyunTo enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled. 3512*4882a593Smuzhiyun:: 3513*4882a593Smuzhiyun 3514*4882a593Smuzhiyun # echo 1 > /proc/sys/kernel/stack_tracer_enabled 3515*4882a593Smuzhiyun 3516*4882a593SmuzhiyunYou can also enable it from the kernel command line to trace 3517*4882a593Smuzhiyunthe stack size of the kernel during boot up, by adding "stacktrace" 3518*4882a593Smuzhiyunto the kernel command line parameter. 3519*4882a593Smuzhiyun 3520*4882a593SmuzhiyunAfter running it for a few minutes, the output looks like: 3521*4882a593Smuzhiyun:: 3522*4882a593Smuzhiyun 3523*4882a593Smuzhiyun # cat stack_max_size 3524*4882a593Smuzhiyun 2928 3525*4882a593Smuzhiyun 3526*4882a593Smuzhiyun # cat stack_trace 3527*4882a593Smuzhiyun Depth Size Location (18 entries) 3528*4882a593Smuzhiyun ----- ---- -------- 3529*4882a593Smuzhiyun 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac 3530*4882a593Smuzhiyun 1) 2704 160 find_busiest_group+0x31/0x1f1 3531*4882a593Smuzhiyun 2) 2544 256 load_balance+0xd9/0x662 3532*4882a593Smuzhiyun 3) 2288 80 idle_balance+0xbb/0x130 3533*4882a593Smuzhiyun 4) 2208 128 __schedule+0x26e/0x5b9 3534*4882a593Smuzhiyun 5) 2080 16 schedule+0x64/0x66 3535*4882a593Smuzhiyun 6) 2064 128 schedule_timeout+0x34/0xe0 3536*4882a593Smuzhiyun 7) 1936 112 wait_for_common+0x97/0xf1 3537*4882a593Smuzhiyun 8) 1824 16 wait_for_completion+0x1d/0x1f 3538*4882a593Smuzhiyun 9) 1808 128 flush_work+0xfe/0x119 3539*4882a593Smuzhiyun 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20 3540*4882a593Smuzhiyun 11) 1664 48 input_available_p+0x1d/0x5c 3541*4882a593Smuzhiyun 12) 1616 48 n_tty_poll+0x6d/0x134 3542*4882a593Smuzhiyun 13) 1568 64 tty_poll+0x64/0x7f 3543*4882a593Smuzhiyun 14) 1504 880 do_select+0x31e/0x511 3544*4882a593Smuzhiyun 15) 624 400 core_sys_select+0x177/0x216 3545*4882a593Smuzhiyun 16) 224 96 sys_select+0x91/0xb9 3546*4882a593Smuzhiyun 17) 128 128 system_call_fastpath+0x16/0x1b 3547*4882a593Smuzhiyun 3548*4882a593SmuzhiyunNote, if -mfentry is being used by gcc, functions get traced before 3549*4882a593Smuzhiyunthey set up the stack frame. This means that leaf level functions 3550*4882a593Smuzhiyunare not tested by the stack tracer when -mfentry is used. 3551*4882a593Smuzhiyun 3552*4882a593SmuzhiyunCurrently, -mfentry is used by gcc 4.6.0 and above on x86 only. 3553*4882a593Smuzhiyun 3554*4882a593SmuzhiyunMore 3555*4882a593Smuzhiyun---- 3556*4882a593SmuzhiyunMore details can be found in the source code, in the `kernel/trace/*.c` files. 3557