xref: /OK3568_Linux_fs/kernel/tools/perf/design.txt (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun
2*4882a593SmuzhiyunPerformance Counters for Linux
3*4882a593Smuzhiyun------------------------------
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunPerformance counters are special hardware registers available on most modern
6*4882a593SmuzhiyunCPUs. These registers count the number of certain types of hw events: such
7*4882a593Smuzhiyunas instructions executed, cachemisses suffered, or branches mis-predicted -
8*4882a593Smuzhiyunwithout slowing down the kernel or applications. These registers can also
9*4882a593Smuzhiyuntrigger interrupts when a threshold number of events have passed - and can
10*4882a593Smuzhiyunthus be used to profile the code that runs on that CPU.
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunThe Linux Performance Counter subsystem provides an abstraction of these
13*4882a593Smuzhiyunhardware capabilities. It provides per task and per CPU counters, counter
14*4882a593Smuzhiyungroups, and it provides event capabilities on top of those.  It
15*4882a593Smuzhiyunprovides "virtual" 64-bit counters, regardless of the width of the
16*4882a593Smuzhiyununderlying hardware counters.
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunPerformance counters are accessed via special file descriptors.
19*4882a593SmuzhiyunThere's one file descriptor per virtual counter used.
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunThe special file descriptor is opened via the sys_perf_event_open()
22*4882a593Smuzhiyunsystem call:
23*4882a593Smuzhiyun
24*4882a593Smuzhiyun   int sys_perf_event_open(struct perf_event_attr *hw_event_uptr,
25*4882a593Smuzhiyun			     pid_t pid, int cpu, int group_fd,
26*4882a593Smuzhiyun			     unsigned long flags);
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunThe syscall returns the new fd. The fd can be used via the normal
29*4882a593SmuzhiyunVFS system calls: read() can be used to read the counter, fcntl()
30*4882a593Smuzhiyuncan be used to set the blocking mode, etc.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunMultiple counters can be kept open at a time, and the counters
33*4882a593Smuzhiyuncan be poll()ed.
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunWhen creating a new counter fd, 'perf_event_attr' is:
36*4882a593Smuzhiyun
37*4882a593Smuzhiyunstruct perf_event_attr {
38*4882a593Smuzhiyun        /*
39*4882a593Smuzhiyun         * The MSB of the config word signifies if the rest contains cpu
40*4882a593Smuzhiyun         * specific (raw) counter configuration data, if unset, the next
41*4882a593Smuzhiyun         * 7 bits are an event type and the rest of the bits are the event
42*4882a593Smuzhiyun         * identifier.
43*4882a593Smuzhiyun         */
44*4882a593Smuzhiyun        __u64                   config;
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun        __u64                   irq_period;
47*4882a593Smuzhiyun        __u32                   record_type;
48*4882a593Smuzhiyun        __u32                   read_format;
49*4882a593Smuzhiyun
50*4882a593Smuzhiyun        __u64                   disabled       :  1, /* off by default        */
51*4882a593Smuzhiyun                                inherit        :  1, /* children inherit it   */
52*4882a593Smuzhiyun                                pinned         :  1, /* must always be on PMU */
53*4882a593Smuzhiyun                                exclusive      :  1, /* only group on PMU     */
54*4882a593Smuzhiyun                                exclude_user   :  1, /* don't count user      */
55*4882a593Smuzhiyun                                exclude_kernel :  1, /* ditto kernel          */
56*4882a593Smuzhiyun                                exclude_hv     :  1, /* ditto hypervisor      */
57*4882a593Smuzhiyun                                exclude_idle   :  1, /* don't count when idle */
58*4882a593Smuzhiyun                                mmap           :  1, /* include mmap data     */
59*4882a593Smuzhiyun                                munmap         :  1, /* include munmap data   */
60*4882a593Smuzhiyun                                comm           :  1, /* include comm data     */
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun                                __reserved_1   : 52;
63*4882a593Smuzhiyun
64*4882a593Smuzhiyun        __u32                   extra_config_len;
65*4882a593Smuzhiyun        __u32                   wakeup_events;  /* wakeup every n events */
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun        __u64                   __reserved_2;
68*4882a593Smuzhiyun        __u64                   __reserved_3;
69*4882a593Smuzhiyun};
70*4882a593Smuzhiyun
71*4882a593SmuzhiyunThe 'config' field specifies what the counter should count.  It
72*4882a593Smuzhiyunis divided into 3 bit-fields:
73*4882a593Smuzhiyun
74*4882a593Smuzhiyunraw_type: 1 bit   (most significant bit)	0x8000_0000_0000_0000
75*4882a593Smuzhiyuntype:	  7 bits  (next most significant)	0x7f00_0000_0000_0000
76*4882a593Smuzhiyunevent_id: 56 bits (least significant)		0x00ff_ffff_ffff_ffff
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunIf 'raw_type' is 1, then the counter will count a hardware event
79*4882a593Smuzhiyunspecified by the remaining 63 bits of event_config.  The encoding is
80*4882a593Smuzhiyunmachine-specific.
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunIf 'raw_type' is 0, then the 'type' field says what kind of counter
83*4882a593Smuzhiyunthis is, with the following encoding:
84*4882a593Smuzhiyun
85*4882a593Smuzhiyunenum perf_type_id {
86*4882a593Smuzhiyun	PERF_TYPE_HARDWARE		= 0,
87*4882a593Smuzhiyun	PERF_TYPE_SOFTWARE		= 1,
88*4882a593Smuzhiyun	PERF_TYPE_TRACEPOINT		= 2,
89*4882a593Smuzhiyun};
90*4882a593Smuzhiyun
91*4882a593SmuzhiyunA counter of PERF_TYPE_HARDWARE will count the hardware event
92*4882a593Smuzhiyunspecified by 'event_id':
93*4882a593Smuzhiyun
94*4882a593Smuzhiyun/*
95*4882a593Smuzhiyun * Generalized performance counter event types, used by the hw_event.event_id
96*4882a593Smuzhiyun * parameter of the sys_perf_event_open() syscall:
97*4882a593Smuzhiyun */
98*4882a593Smuzhiyunenum perf_hw_id {
99*4882a593Smuzhiyun	/*
100*4882a593Smuzhiyun	 * Common hardware events, generalized by the kernel:
101*4882a593Smuzhiyun	 */
102*4882a593Smuzhiyun	PERF_COUNT_HW_CPU_CYCLES		= 0,
103*4882a593Smuzhiyun	PERF_COUNT_HW_INSTRUCTIONS		= 1,
104*4882a593Smuzhiyun	PERF_COUNT_HW_CACHE_REFERENCES		= 2,
105*4882a593Smuzhiyun	PERF_COUNT_HW_CACHE_MISSES		= 3,
106*4882a593Smuzhiyun	PERF_COUNT_HW_BRANCH_INSTRUCTIONS	= 4,
107*4882a593Smuzhiyun	PERF_COUNT_HW_BRANCH_MISSES		= 5,
108*4882a593Smuzhiyun	PERF_COUNT_HW_BUS_CYCLES		= 6,
109*4882a593Smuzhiyun};
110*4882a593Smuzhiyun
111*4882a593SmuzhiyunThese are standardized types of events that work relatively uniformly
112*4882a593Smuzhiyunon all CPUs that implement Performance Counters support under Linux,
113*4882a593Smuzhiyunalthough there may be variations (e.g., different CPUs might count
114*4882a593Smuzhiyuncache references and misses at different levels of the cache hierarchy).
115*4882a593SmuzhiyunIf a CPU is not able to count the selected event, then the system call
116*4882a593Smuzhiyunwill return -EINVAL.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunMore hw_event_types are supported as well, but they are CPU-specific
119*4882a593Smuzhiyunand accessed as raw events.  For example, to count "External bus
120*4882a593Smuzhiyuncycles while bus lock signal asserted" events on Intel Core CPUs, pass
121*4882a593Smuzhiyunin a 0x4064 event_id value and set hw_event.raw_type to 1.
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunA counter of type PERF_TYPE_SOFTWARE will count one of the available
124*4882a593Smuzhiyunsoftware events, selected by 'event_id':
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun/*
127*4882a593Smuzhiyun * Special "software" counters provided by the kernel, even if the hardware
128*4882a593Smuzhiyun * does not support performance counters. These counters measure various
129*4882a593Smuzhiyun * physical and sw events of the kernel (and allow the profiling of them as
130*4882a593Smuzhiyun * well):
131*4882a593Smuzhiyun */
132*4882a593Smuzhiyunenum perf_sw_ids {
133*4882a593Smuzhiyun	PERF_COUNT_SW_CPU_CLOCK		= 0,
134*4882a593Smuzhiyun	PERF_COUNT_SW_TASK_CLOCK	= 1,
135*4882a593Smuzhiyun	PERF_COUNT_SW_PAGE_FAULTS	= 2,
136*4882a593Smuzhiyun	PERF_COUNT_SW_CONTEXT_SWITCHES	= 3,
137*4882a593Smuzhiyun	PERF_COUNT_SW_CPU_MIGRATIONS	= 4,
138*4882a593Smuzhiyun	PERF_COUNT_SW_PAGE_FAULTS_MIN	= 5,
139*4882a593Smuzhiyun	PERF_COUNT_SW_PAGE_FAULTS_MAJ	= 6,
140*4882a593Smuzhiyun	PERF_COUNT_SW_ALIGNMENT_FAULTS	= 7,
141*4882a593Smuzhiyun	PERF_COUNT_SW_EMULATION_FAULTS	= 8,
142*4882a593Smuzhiyun};
143*4882a593Smuzhiyun
144*4882a593SmuzhiyunCounters of the type PERF_TYPE_TRACEPOINT are available when the ftrace event
145*4882a593Smuzhiyuntracer is available, and event_id values can be obtained from
146*4882a593Smuzhiyun/debug/tracing/events/*/*/id
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun
149*4882a593SmuzhiyunCounters come in two flavours: counting counters and sampling
150*4882a593Smuzhiyuncounters.  A "counting" counter is one that is used for counting the
151*4882a593Smuzhiyunnumber of events that occur, and is characterised by having
152*4882a593Smuzhiyunirq_period = 0.
153*4882a593Smuzhiyun
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunA read() on a counter returns the current value of the counter and possible
156*4882a593Smuzhiyunadditional values as specified by 'read_format', each value is a u64 (8 bytes)
157*4882a593Smuzhiyunin size.
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun/*
160*4882a593Smuzhiyun * Bits that can be set in hw_event.read_format to request that
161*4882a593Smuzhiyun * reads on the counter should return the indicated quantities,
162*4882a593Smuzhiyun * in increasing order of bit value, after the counter value.
163*4882a593Smuzhiyun */
164*4882a593Smuzhiyunenum perf_event_read_format {
165*4882a593Smuzhiyun        PERF_FORMAT_TOTAL_TIME_ENABLED  =  1,
166*4882a593Smuzhiyun        PERF_FORMAT_TOTAL_TIME_RUNNING  =  2,
167*4882a593Smuzhiyun};
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunUsing these additional values one can establish the overcommit ratio for a
170*4882a593Smuzhiyunparticular counter allowing one to take the round-robin scheduling effect
171*4882a593Smuzhiyuninto account.
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunA "sampling" counter is one that is set up to generate an interrupt
175*4882a593Smuzhiyunevery N events, where N is given by 'irq_period'.  A sampling counter
176*4882a593Smuzhiyunhas irq_period > 0. The record_type controls what data is recorded on each
177*4882a593Smuzhiyuninterrupt:
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun/*
180*4882a593Smuzhiyun * Bits that can be set in hw_event.record_type to request information
181*4882a593Smuzhiyun * in the overflow packets.
182*4882a593Smuzhiyun */
183*4882a593Smuzhiyunenum perf_event_record_format {
184*4882a593Smuzhiyun        PERF_RECORD_IP          = 1U << 0,
185*4882a593Smuzhiyun        PERF_RECORD_TID         = 1U << 1,
186*4882a593Smuzhiyun        PERF_RECORD_TIME        = 1U << 2,
187*4882a593Smuzhiyun        PERF_RECORD_ADDR        = 1U << 3,
188*4882a593Smuzhiyun        PERF_RECORD_GROUP       = 1U << 4,
189*4882a593Smuzhiyun        PERF_RECORD_CALLCHAIN   = 1U << 5,
190*4882a593Smuzhiyun};
191*4882a593Smuzhiyun
192*4882a593SmuzhiyunSuch (and other) events will be recorded in a ring-buffer, which is
193*4882a593Smuzhiyunavailable to user-space using mmap() (see below).
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunThe 'disabled' bit specifies whether the counter starts out disabled
196*4882a593Smuzhiyunor enabled.  If it is initially disabled, it can be enabled by ioctl
197*4882a593Smuzhiyunor prctl (see below).
198*4882a593Smuzhiyun
199*4882a593SmuzhiyunThe 'inherit' bit, if set, specifies that this counter should count
200*4882a593Smuzhiyunevents on descendant tasks as well as the task specified.  This only
201*4882a593Smuzhiyunapplies to new descendents, not to any existing descendents at the
202*4882a593Smuzhiyuntime the counter is created (nor to any new descendents of existing
203*4882a593Smuzhiyundescendents).
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunThe 'pinned' bit, if set, specifies that the counter should always be
206*4882a593Smuzhiyunon the CPU if at all possible.  It only applies to hardware counters
207*4882a593Smuzhiyunand only to group leaders.  If a pinned counter cannot be put onto the
208*4882a593SmuzhiyunCPU (e.g. because there are not enough hardware counters or because of
209*4882a593Smuzhiyuna conflict with some other event), then the counter goes into an
210*4882a593Smuzhiyun'error' state, where reads return end-of-file (i.e. read() returns 0)
211*4882a593Smuzhiyununtil the counter is subsequently enabled or disabled.
212*4882a593Smuzhiyun
213*4882a593SmuzhiyunThe 'exclusive' bit, if set, specifies that when this counter's group
214*4882a593Smuzhiyunis on the CPU, it should be the only group using the CPU's counters.
215*4882a593SmuzhiyunIn future, this will allow sophisticated monitoring programs to supply
216*4882a593Smuzhiyunextra configuration information via 'extra_config_len' to exploit
217*4882a593Smuzhiyunadvanced features of the CPU's Performance Monitor Unit (PMU) that are
218*4882a593Smuzhiyunnot otherwise accessible and that might disrupt other hardware
219*4882a593Smuzhiyuncounters.
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunThe 'exclude_user', 'exclude_kernel' and 'exclude_hv' bits provide a
222*4882a593Smuzhiyunway to request that counting of events be restricted to times when the
223*4882a593SmuzhiyunCPU is in user, kernel and/or hypervisor mode.
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunFurthermore the 'exclude_host' and 'exclude_guest' bits provide a way
226*4882a593Smuzhiyunto request counting of events restricted to guest and host contexts when
227*4882a593Smuzhiyunusing Linux as the hypervisor.
228*4882a593Smuzhiyun
229*4882a593SmuzhiyunThe 'mmap' and 'munmap' bits allow recording of PROT_EXEC mmap/munmap
230*4882a593Smuzhiyunoperations, these can be used to relate userspace IP addresses to actual
231*4882a593Smuzhiyuncode, even after the mapping (or even the whole process) is gone,
232*4882a593Smuzhiyunthese events are recorded in the ring-buffer (see below).
233*4882a593Smuzhiyun
234*4882a593SmuzhiyunThe 'comm' bit allows tracking of process comm data on process creation.
235*4882a593SmuzhiyunThis too is recorded in the ring-buffer (see below).
236*4882a593Smuzhiyun
237*4882a593SmuzhiyunThe 'pid' parameter to the sys_perf_event_open() system call allows the
238*4882a593Smuzhiyuncounter to be specific to a task:
239*4882a593Smuzhiyun
240*4882a593Smuzhiyun pid == 0: if the pid parameter is zero, the counter is attached to the
241*4882a593Smuzhiyun current task.
242*4882a593Smuzhiyun
243*4882a593Smuzhiyun pid > 0: the counter is attached to a specific task (if the current task
244*4882a593Smuzhiyun has sufficient privilege to do so)
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun pid < 0: all tasks are counted (per cpu counters)
247*4882a593Smuzhiyun
248*4882a593SmuzhiyunThe 'cpu' parameter allows a counter to be made specific to a CPU:
249*4882a593Smuzhiyun
250*4882a593Smuzhiyun cpu >= 0: the counter is restricted to a specific CPU
251*4882a593Smuzhiyun cpu == -1: the counter counts on all CPUs
252*4882a593Smuzhiyun
253*4882a593Smuzhiyun(Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)
254*4882a593Smuzhiyun
255*4882a593SmuzhiyunA 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
256*4882a593Smuzhiyunevents of that task and 'follows' that task to whatever CPU the task
257*4882a593Smuzhiyungets schedule to. Per task counters can be created by any user, for
258*4882a593Smuzhiyuntheir own tasks.
259*4882a593Smuzhiyun
260*4882a593SmuzhiyunA 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
261*4882a593Smuzhiyunall events on CPU-x. Per CPU counters need CAP_PERFMON or CAP_SYS_ADMIN
262*4882a593Smuzhiyunprivilege.
263*4882a593Smuzhiyun
264*4882a593SmuzhiyunThe 'flags' parameter is currently unused and must be zero.
265*4882a593Smuzhiyun
266*4882a593SmuzhiyunThe 'group_fd' parameter allows counter "groups" to be set up.  A
267*4882a593Smuzhiyuncounter group has one counter which is the group "leader".  The leader
268*4882a593Smuzhiyunis created first, with group_fd = -1 in the sys_perf_event_open call
269*4882a593Smuzhiyunthat creates it.  The rest of the group members are created
270*4882a593Smuzhiyunsubsequently, with group_fd giving the fd of the group leader.
271*4882a593Smuzhiyun(A single counter on its own is created with group_fd = -1 and is
272*4882a593Smuzhiyunconsidered to be a group with only 1 member.)
273*4882a593Smuzhiyun
274*4882a593SmuzhiyunA counter group is scheduled onto the CPU as a unit, that is, it will
275*4882a593Smuzhiyunonly be put onto the CPU if all of the counters in the group can be
276*4882a593Smuzhiyunput onto the CPU.  This means that the values of the member counters
277*4882a593Smuzhiyuncan be meaningfully compared, added, divided (to get ratios), etc.,
278*4882a593Smuzhiyunwith each other, since they have counted events for the same set of
279*4882a593Smuzhiyunexecuted instructions.
280*4882a593Smuzhiyun
281*4882a593Smuzhiyun
282*4882a593SmuzhiyunLike stated, asynchronous events, like counter overflow or PROT_EXEC mmap
283*4882a593Smuzhiyuntracking are logged into a ring-buffer. This ring-buffer is created and
284*4882a593Smuzhiyunaccessed through mmap().
285*4882a593Smuzhiyun
286*4882a593SmuzhiyunThe mmap size should be 1+2^n pages, where the first page is a meta-data page
287*4882a593Smuzhiyun(struct perf_event_mmap_page) that contains various bits of information such
288*4882a593Smuzhiyunas where the ring-buffer head is.
289*4882a593Smuzhiyun
290*4882a593Smuzhiyun/*
291*4882a593Smuzhiyun * Structure of the page that can be mapped via mmap
292*4882a593Smuzhiyun */
293*4882a593Smuzhiyunstruct perf_event_mmap_page {
294*4882a593Smuzhiyun        __u32   version;                /* version number of this structure */
295*4882a593Smuzhiyun        __u32   compat_version;         /* lowest version this is compat with */
296*4882a593Smuzhiyun
297*4882a593Smuzhiyun        /*
298*4882a593Smuzhiyun         * Bits needed to read the hw counters in user-space.
299*4882a593Smuzhiyun         *
300*4882a593Smuzhiyun         *   u32 seq;
301*4882a593Smuzhiyun         *   s64 count;
302*4882a593Smuzhiyun         *
303*4882a593Smuzhiyun         *   do {
304*4882a593Smuzhiyun         *     seq = pc->lock;
305*4882a593Smuzhiyun         *
306*4882a593Smuzhiyun         *     barrier()
307*4882a593Smuzhiyun         *     if (pc->index) {
308*4882a593Smuzhiyun         *       count = pmc_read(pc->index - 1);
309*4882a593Smuzhiyun         *       count += pc->offset;
310*4882a593Smuzhiyun         *     } else
311*4882a593Smuzhiyun         *       goto regular_read;
312*4882a593Smuzhiyun         *
313*4882a593Smuzhiyun         *     barrier();
314*4882a593Smuzhiyun         *   } while (pc->lock != seq);
315*4882a593Smuzhiyun         *
316*4882a593Smuzhiyun         * NOTE: for obvious reason this only works on self-monitoring
317*4882a593Smuzhiyun         *       processes.
318*4882a593Smuzhiyun         */
319*4882a593Smuzhiyun        __u32   lock;                   /* seqlock for synchronization */
320*4882a593Smuzhiyun        __u32   index;                  /* hardware counter identifier */
321*4882a593Smuzhiyun        __s64   offset;                 /* add to hardware counter value */
322*4882a593Smuzhiyun
323*4882a593Smuzhiyun        /*
324*4882a593Smuzhiyun         * Control data for the mmap() data buffer.
325*4882a593Smuzhiyun         *
326*4882a593Smuzhiyun         * User-space reading this value should issue an rmb(), on SMP capable
327*4882a593Smuzhiyun         * platforms, after reading this value -- see perf_event_wakeup().
328*4882a593Smuzhiyun         */
329*4882a593Smuzhiyun        __u32   data_head;              /* head in the data section */
330*4882a593Smuzhiyun};
331*4882a593Smuzhiyun
332*4882a593SmuzhiyunNOTE: the hw-counter userspace bits are arch specific and are currently only
333*4882a593Smuzhiyun      implemented on powerpc.
334*4882a593Smuzhiyun
335*4882a593SmuzhiyunThe following 2^n pages are the ring-buffer which contains events of the form:
336*4882a593Smuzhiyun
337*4882a593Smuzhiyun#define PERF_RECORD_MISC_KERNEL          (1 << 0)
338*4882a593Smuzhiyun#define PERF_RECORD_MISC_USER            (1 << 1)
339*4882a593Smuzhiyun#define PERF_RECORD_MISC_OVERFLOW        (1 << 2)
340*4882a593Smuzhiyun
341*4882a593Smuzhiyunstruct perf_event_header {
342*4882a593Smuzhiyun        __u32   type;
343*4882a593Smuzhiyun        __u16   misc;
344*4882a593Smuzhiyun        __u16   size;
345*4882a593Smuzhiyun};
346*4882a593Smuzhiyun
347*4882a593Smuzhiyunenum perf_event_type {
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun        /*
350*4882a593Smuzhiyun         * The MMAP events record the PROT_EXEC mappings so that we can
351*4882a593Smuzhiyun         * correlate userspace IPs to code. They have the following structure:
352*4882a593Smuzhiyun         *
353*4882a593Smuzhiyun         * struct {
354*4882a593Smuzhiyun         *      struct perf_event_header        header;
355*4882a593Smuzhiyun         *
356*4882a593Smuzhiyun         *      u32                             pid, tid;
357*4882a593Smuzhiyun         *      u64                             addr;
358*4882a593Smuzhiyun         *      u64                             len;
359*4882a593Smuzhiyun         *      u64                             pgoff;
360*4882a593Smuzhiyun         *      char                            filename[];
361*4882a593Smuzhiyun         * };
362*4882a593Smuzhiyun         */
363*4882a593Smuzhiyun        PERF_RECORD_MMAP                 = 1,
364*4882a593Smuzhiyun        PERF_RECORD_MUNMAP               = 2,
365*4882a593Smuzhiyun
366*4882a593Smuzhiyun        /*
367*4882a593Smuzhiyun         * struct {
368*4882a593Smuzhiyun         *      struct perf_event_header        header;
369*4882a593Smuzhiyun         *
370*4882a593Smuzhiyun         *      u32                             pid, tid;
371*4882a593Smuzhiyun         *      char                            comm[];
372*4882a593Smuzhiyun         * };
373*4882a593Smuzhiyun         */
374*4882a593Smuzhiyun        PERF_RECORD_COMM                 = 3,
375*4882a593Smuzhiyun
376*4882a593Smuzhiyun        /*
377*4882a593Smuzhiyun         * When header.misc & PERF_RECORD_MISC_OVERFLOW the event_type field
378*4882a593Smuzhiyun         * will be PERF_RECORD_*
379*4882a593Smuzhiyun         *
380*4882a593Smuzhiyun         * struct {
381*4882a593Smuzhiyun         *      struct perf_event_header        header;
382*4882a593Smuzhiyun         *
383*4882a593Smuzhiyun         *      { u64                   ip;       } && PERF_RECORD_IP
384*4882a593Smuzhiyun         *      { u32                   pid, tid; } && PERF_RECORD_TID
385*4882a593Smuzhiyun         *      { u64                   time;     } && PERF_RECORD_TIME
386*4882a593Smuzhiyun         *      { u64                   addr;     } && PERF_RECORD_ADDR
387*4882a593Smuzhiyun         *
388*4882a593Smuzhiyun         *      { u64                   nr;
389*4882a593Smuzhiyun         *        { u64 event, val; }   cnt[nr];  } && PERF_RECORD_GROUP
390*4882a593Smuzhiyun         *
391*4882a593Smuzhiyun         *      { u16                   nr,
392*4882a593Smuzhiyun         *                              hv,
393*4882a593Smuzhiyun         *                              kernel,
394*4882a593Smuzhiyun         *                              user;
395*4882a593Smuzhiyun         *        u64                   ips[nr];  } && PERF_RECORD_CALLCHAIN
396*4882a593Smuzhiyun         * };
397*4882a593Smuzhiyun         */
398*4882a593Smuzhiyun};
399*4882a593Smuzhiyun
400*4882a593SmuzhiyunNOTE: PERF_RECORD_CALLCHAIN is arch specific and currently only implemented
401*4882a593Smuzhiyun      on x86.
402*4882a593Smuzhiyun
403*4882a593SmuzhiyunNotification of new events is possible through poll()/select()/epoll() and
404*4882a593Smuzhiyunfcntl() managing signals.
405*4882a593Smuzhiyun
406*4882a593SmuzhiyunNormally a notification is generated for every page filled, however one can
407*4882a593Smuzhiyunadditionally set perf_event_attr.wakeup_events to generate one every
408*4882a593Smuzhiyunso many counter overflow events.
409*4882a593Smuzhiyun
410*4882a593SmuzhiyunFuture work will include a splice() interface to the ring-buffer.
411*4882a593Smuzhiyun
412*4882a593Smuzhiyun
413*4882a593SmuzhiyunCounters can be enabled and disabled in two ways: via ioctl and via
414*4882a593Smuzhiyunprctl.  When a counter is disabled, it doesn't count or generate
415*4882a593Smuzhiyunevents but does continue to exist and maintain its count value.
416*4882a593Smuzhiyun
417*4882a593SmuzhiyunAn individual counter can be enabled with
418*4882a593Smuzhiyun
419*4882a593Smuzhiyun	ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
420*4882a593Smuzhiyun
421*4882a593Smuzhiyunor disabled with
422*4882a593Smuzhiyun
423*4882a593Smuzhiyun	ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
424*4882a593Smuzhiyun
425*4882a593SmuzhiyunFor a counter group, pass PERF_IOC_FLAG_GROUP as the third argument.
426*4882a593SmuzhiyunEnabling or disabling the leader of a group enables or disables the
427*4882a593Smuzhiyunwhole group; that is, while the group leader is disabled, none of the
428*4882a593Smuzhiyuncounters in the group will count.  Enabling or disabling a member of a
429*4882a593Smuzhiyungroup other than the leader only affects that counter - disabling an
430*4882a593Smuzhiyunnon-leader stops that counter from counting but doesn't affect any
431*4882a593Smuzhiyunother counter.
432*4882a593Smuzhiyun
433*4882a593SmuzhiyunAdditionally, non-inherited overflow counters can use
434*4882a593Smuzhiyun
435*4882a593Smuzhiyun	ioctl(fd, PERF_EVENT_IOC_REFRESH, nr);
436*4882a593Smuzhiyun
437*4882a593Smuzhiyunto enable a counter for 'nr' events, after which it gets disabled again.
438*4882a593Smuzhiyun
439*4882a593SmuzhiyunA process can enable or disable all the counter groups that are
440*4882a593Smuzhiyunattached to it, using prctl:
441*4882a593Smuzhiyun
442*4882a593Smuzhiyun	prctl(PR_TASK_PERF_EVENTS_ENABLE);
443*4882a593Smuzhiyun
444*4882a593Smuzhiyun	prctl(PR_TASK_PERF_EVENTS_DISABLE);
445*4882a593Smuzhiyun
446*4882a593SmuzhiyunThis applies to all counters on the current process, whether created
447*4882a593Smuzhiyunby this process or by another, and doesn't affect any counters that
448*4882a593Smuzhiyunthis process has created on other processes.  It only enables or
449*4882a593Smuzhiyundisables the group leaders, not any other members in the groups.
450*4882a593Smuzhiyun
451*4882a593Smuzhiyun
452*4882a593SmuzhiyunArch requirements
453*4882a593Smuzhiyun-----------------
454*4882a593Smuzhiyun
455*4882a593SmuzhiyunIf your architecture does not have hardware performance metrics, you can
456*4882a593Smuzhiyunstill use the generic software counters based on hrtimers for sampling.
457*4882a593Smuzhiyun
458*4882a593SmuzhiyunSo to start with, in order to add HAVE_PERF_EVENTS to your Kconfig, you
459*4882a593Smuzhiyunwill need at least this:
460*4882a593Smuzhiyun	- asm/perf_event.h - a basic stub will suffice at first
461*4882a593Smuzhiyun	- support for atomic64 types (and associated helper functions)
462*4882a593Smuzhiyun
463*4882a593SmuzhiyunIf your architecture does have hardware capabilities, you can override the
464*4882a593Smuzhiyunweak stub hw_perf_event_init() to register hardware counters.
465*4882a593Smuzhiyun
466*4882a593SmuzhiyunArchitectures that have d-cache aliassing issues, such as Sparc and ARM,
467*4882a593Smuzhiyunshould select PERF_USE_VMALLOC in order to avoid these for perf mmap().
468