1*4882a593Smuzhiyun 2*4882a593SmuzhiyunPerformance Counters for Linux 3*4882a593Smuzhiyun------------------------------ 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunPerformance counters are special hardware registers available on most modern 6*4882a593SmuzhiyunCPUs. These registers count the number of certain types of hw events: such 7*4882a593Smuzhiyunas instructions executed, cachemisses suffered, or branches mis-predicted - 8*4882a593Smuzhiyunwithout slowing down the kernel or applications. These registers can also 9*4882a593Smuzhiyuntrigger interrupts when a threshold number of events have passed - and can 10*4882a593Smuzhiyunthus be used to profile the code that runs on that CPU. 11*4882a593Smuzhiyun 12*4882a593SmuzhiyunThe Linux Performance Counter subsystem provides an abstraction of these 13*4882a593Smuzhiyunhardware capabilities. It provides per task and per CPU counters, counter 14*4882a593Smuzhiyungroups, and it provides event capabilities on top of those. It 15*4882a593Smuzhiyunprovides "virtual" 64-bit counters, regardless of the width of the 16*4882a593Smuzhiyununderlying hardware counters. 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunPerformance counters are accessed via special file descriptors. 19*4882a593SmuzhiyunThere's one file descriptor per virtual counter used. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunThe special file descriptor is opened via the sys_perf_event_open() 22*4882a593Smuzhiyunsystem call: 23*4882a593Smuzhiyun 24*4882a593Smuzhiyun int sys_perf_event_open(struct perf_event_attr *hw_event_uptr, 25*4882a593Smuzhiyun pid_t pid, int cpu, int group_fd, 26*4882a593Smuzhiyun unsigned long flags); 27*4882a593Smuzhiyun 28*4882a593SmuzhiyunThe syscall returns the new fd. The fd can be used via the normal 29*4882a593SmuzhiyunVFS system calls: read() can be used to read the counter, fcntl() 30*4882a593Smuzhiyuncan be used to set the blocking mode, etc. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunMultiple counters can be kept open at a time, and the counters 33*4882a593Smuzhiyuncan be poll()ed. 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunWhen creating a new counter fd, 'perf_event_attr' is: 36*4882a593Smuzhiyun 37*4882a593Smuzhiyunstruct perf_event_attr { 38*4882a593Smuzhiyun /* 39*4882a593Smuzhiyun * The MSB of the config word signifies if the rest contains cpu 40*4882a593Smuzhiyun * specific (raw) counter configuration data, if unset, the next 41*4882a593Smuzhiyun * 7 bits are an event type and the rest of the bits are the event 42*4882a593Smuzhiyun * identifier. 43*4882a593Smuzhiyun */ 44*4882a593Smuzhiyun __u64 config; 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun __u64 irq_period; 47*4882a593Smuzhiyun __u32 record_type; 48*4882a593Smuzhiyun __u32 read_format; 49*4882a593Smuzhiyun 50*4882a593Smuzhiyun __u64 disabled : 1, /* off by default */ 51*4882a593Smuzhiyun inherit : 1, /* children inherit it */ 52*4882a593Smuzhiyun pinned : 1, /* must always be on PMU */ 53*4882a593Smuzhiyun exclusive : 1, /* only group on PMU */ 54*4882a593Smuzhiyun exclude_user : 1, /* don't count user */ 55*4882a593Smuzhiyun exclude_kernel : 1, /* ditto kernel */ 56*4882a593Smuzhiyun exclude_hv : 1, /* ditto hypervisor */ 57*4882a593Smuzhiyun exclude_idle : 1, /* don't count when idle */ 58*4882a593Smuzhiyun mmap : 1, /* include mmap data */ 59*4882a593Smuzhiyun munmap : 1, /* include munmap data */ 60*4882a593Smuzhiyun comm : 1, /* include comm data */ 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun __reserved_1 : 52; 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun __u32 extra_config_len; 65*4882a593Smuzhiyun __u32 wakeup_events; /* wakeup every n events */ 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun __u64 __reserved_2; 68*4882a593Smuzhiyun __u64 __reserved_3; 69*4882a593Smuzhiyun}; 70*4882a593Smuzhiyun 71*4882a593SmuzhiyunThe 'config' field specifies what the counter should count. It 72*4882a593Smuzhiyunis divided into 3 bit-fields: 73*4882a593Smuzhiyun 74*4882a593Smuzhiyunraw_type: 1 bit (most significant bit) 0x8000_0000_0000_0000 75*4882a593Smuzhiyuntype: 7 bits (next most significant) 0x7f00_0000_0000_0000 76*4882a593Smuzhiyunevent_id: 56 bits (least significant) 0x00ff_ffff_ffff_ffff 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunIf 'raw_type' is 1, then the counter will count a hardware event 79*4882a593Smuzhiyunspecified by the remaining 63 bits of event_config. The encoding is 80*4882a593Smuzhiyunmachine-specific. 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunIf 'raw_type' is 0, then the 'type' field says what kind of counter 83*4882a593Smuzhiyunthis is, with the following encoding: 84*4882a593Smuzhiyun 85*4882a593Smuzhiyunenum perf_type_id { 86*4882a593Smuzhiyun PERF_TYPE_HARDWARE = 0, 87*4882a593Smuzhiyun PERF_TYPE_SOFTWARE = 1, 88*4882a593Smuzhiyun PERF_TYPE_TRACEPOINT = 2, 89*4882a593Smuzhiyun}; 90*4882a593Smuzhiyun 91*4882a593SmuzhiyunA counter of PERF_TYPE_HARDWARE will count the hardware event 92*4882a593Smuzhiyunspecified by 'event_id': 93*4882a593Smuzhiyun 94*4882a593Smuzhiyun/* 95*4882a593Smuzhiyun * Generalized performance counter event types, used by the hw_event.event_id 96*4882a593Smuzhiyun * parameter of the sys_perf_event_open() syscall: 97*4882a593Smuzhiyun */ 98*4882a593Smuzhiyunenum perf_hw_id { 99*4882a593Smuzhiyun /* 100*4882a593Smuzhiyun * Common hardware events, generalized by the kernel: 101*4882a593Smuzhiyun */ 102*4882a593Smuzhiyun PERF_COUNT_HW_CPU_CYCLES = 0, 103*4882a593Smuzhiyun PERF_COUNT_HW_INSTRUCTIONS = 1, 104*4882a593Smuzhiyun PERF_COUNT_HW_CACHE_REFERENCES = 2, 105*4882a593Smuzhiyun PERF_COUNT_HW_CACHE_MISSES = 3, 106*4882a593Smuzhiyun PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4, 107*4882a593Smuzhiyun PERF_COUNT_HW_BRANCH_MISSES = 5, 108*4882a593Smuzhiyun PERF_COUNT_HW_BUS_CYCLES = 6, 109*4882a593Smuzhiyun}; 110*4882a593Smuzhiyun 111*4882a593SmuzhiyunThese are standardized types of events that work relatively uniformly 112*4882a593Smuzhiyunon all CPUs that implement Performance Counters support under Linux, 113*4882a593Smuzhiyunalthough there may be variations (e.g., different CPUs might count 114*4882a593Smuzhiyuncache references and misses at different levels of the cache hierarchy). 115*4882a593SmuzhiyunIf a CPU is not able to count the selected event, then the system call 116*4882a593Smuzhiyunwill return -EINVAL. 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunMore hw_event_types are supported as well, but they are CPU-specific 119*4882a593Smuzhiyunand accessed as raw events. For example, to count "External bus 120*4882a593Smuzhiyuncycles while bus lock signal asserted" events on Intel Core CPUs, pass 121*4882a593Smuzhiyunin a 0x4064 event_id value and set hw_event.raw_type to 1. 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunA counter of type PERF_TYPE_SOFTWARE will count one of the available 124*4882a593Smuzhiyunsoftware events, selected by 'event_id': 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun/* 127*4882a593Smuzhiyun * Special "software" counters provided by the kernel, even if the hardware 128*4882a593Smuzhiyun * does not support performance counters. These counters measure various 129*4882a593Smuzhiyun * physical and sw events of the kernel (and allow the profiling of them as 130*4882a593Smuzhiyun * well): 131*4882a593Smuzhiyun */ 132*4882a593Smuzhiyunenum perf_sw_ids { 133*4882a593Smuzhiyun PERF_COUNT_SW_CPU_CLOCK = 0, 134*4882a593Smuzhiyun PERF_COUNT_SW_TASK_CLOCK = 1, 135*4882a593Smuzhiyun PERF_COUNT_SW_PAGE_FAULTS = 2, 136*4882a593Smuzhiyun PERF_COUNT_SW_CONTEXT_SWITCHES = 3, 137*4882a593Smuzhiyun PERF_COUNT_SW_CPU_MIGRATIONS = 4, 138*4882a593Smuzhiyun PERF_COUNT_SW_PAGE_FAULTS_MIN = 5, 139*4882a593Smuzhiyun PERF_COUNT_SW_PAGE_FAULTS_MAJ = 6, 140*4882a593Smuzhiyun PERF_COUNT_SW_ALIGNMENT_FAULTS = 7, 141*4882a593Smuzhiyun PERF_COUNT_SW_EMULATION_FAULTS = 8, 142*4882a593Smuzhiyun}; 143*4882a593Smuzhiyun 144*4882a593SmuzhiyunCounters of the type PERF_TYPE_TRACEPOINT are available when the ftrace event 145*4882a593Smuzhiyuntracer is available, and event_id values can be obtained from 146*4882a593Smuzhiyun/debug/tracing/events/*/*/id 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun 149*4882a593SmuzhiyunCounters come in two flavours: counting counters and sampling 150*4882a593Smuzhiyuncounters. A "counting" counter is one that is used for counting the 151*4882a593Smuzhiyunnumber of events that occur, and is characterised by having 152*4882a593Smuzhiyunirq_period = 0. 153*4882a593Smuzhiyun 154*4882a593Smuzhiyun 155*4882a593SmuzhiyunA read() on a counter returns the current value of the counter and possible 156*4882a593Smuzhiyunadditional values as specified by 'read_format', each value is a u64 (8 bytes) 157*4882a593Smuzhiyunin size. 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun/* 160*4882a593Smuzhiyun * Bits that can be set in hw_event.read_format to request that 161*4882a593Smuzhiyun * reads on the counter should return the indicated quantities, 162*4882a593Smuzhiyun * in increasing order of bit value, after the counter value. 163*4882a593Smuzhiyun */ 164*4882a593Smuzhiyunenum perf_event_read_format { 165*4882a593Smuzhiyun PERF_FORMAT_TOTAL_TIME_ENABLED = 1, 166*4882a593Smuzhiyun PERF_FORMAT_TOTAL_TIME_RUNNING = 2, 167*4882a593Smuzhiyun}; 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunUsing these additional values one can establish the overcommit ratio for a 170*4882a593Smuzhiyunparticular counter allowing one to take the round-robin scheduling effect 171*4882a593Smuzhiyuninto account. 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunA "sampling" counter is one that is set up to generate an interrupt 175*4882a593Smuzhiyunevery N events, where N is given by 'irq_period'. A sampling counter 176*4882a593Smuzhiyunhas irq_period > 0. The record_type controls what data is recorded on each 177*4882a593Smuzhiyuninterrupt: 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun/* 180*4882a593Smuzhiyun * Bits that can be set in hw_event.record_type to request information 181*4882a593Smuzhiyun * in the overflow packets. 182*4882a593Smuzhiyun */ 183*4882a593Smuzhiyunenum perf_event_record_format { 184*4882a593Smuzhiyun PERF_RECORD_IP = 1U << 0, 185*4882a593Smuzhiyun PERF_RECORD_TID = 1U << 1, 186*4882a593Smuzhiyun PERF_RECORD_TIME = 1U << 2, 187*4882a593Smuzhiyun PERF_RECORD_ADDR = 1U << 3, 188*4882a593Smuzhiyun PERF_RECORD_GROUP = 1U << 4, 189*4882a593Smuzhiyun PERF_RECORD_CALLCHAIN = 1U << 5, 190*4882a593Smuzhiyun}; 191*4882a593Smuzhiyun 192*4882a593SmuzhiyunSuch (and other) events will be recorded in a ring-buffer, which is 193*4882a593Smuzhiyunavailable to user-space using mmap() (see below). 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunThe 'disabled' bit specifies whether the counter starts out disabled 196*4882a593Smuzhiyunor enabled. If it is initially disabled, it can be enabled by ioctl 197*4882a593Smuzhiyunor prctl (see below). 198*4882a593Smuzhiyun 199*4882a593SmuzhiyunThe 'inherit' bit, if set, specifies that this counter should count 200*4882a593Smuzhiyunevents on descendant tasks as well as the task specified. This only 201*4882a593Smuzhiyunapplies to new descendents, not to any existing descendents at the 202*4882a593Smuzhiyuntime the counter is created (nor to any new descendents of existing 203*4882a593Smuzhiyundescendents). 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunThe 'pinned' bit, if set, specifies that the counter should always be 206*4882a593Smuzhiyunon the CPU if at all possible. It only applies to hardware counters 207*4882a593Smuzhiyunand only to group leaders. If a pinned counter cannot be put onto the 208*4882a593SmuzhiyunCPU (e.g. because there are not enough hardware counters or because of 209*4882a593Smuzhiyuna conflict with some other event), then the counter goes into an 210*4882a593Smuzhiyun'error' state, where reads return end-of-file (i.e. read() returns 0) 211*4882a593Smuzhiyununtil the counter is subsequently enabled or disabled. 212*4882a593Smuzhiyun 213*4882a593SmuzhiyunThe 'exclusive' bit, if set, specifies that when this counter's group 214*4882a593Smuzhiyunis on the CPU, it should be the only group using the CPU's counters. 215*4882a593SmuzhiyunIn future, this will allow sophisticated monitoring programs to supply 216*4882a593Smuzhiyunextra configuration information via 'extra_config_len' to exploit 217*4882a593Smuzhiyunadvanced features of the CPU's Performance Monitor Unit (PMU) that are 218*4882a593Smuzhiyunnot otherwise accessible and that might disrupt other hardware 219*4882a593Smuzhiyuncounters. 220*4882a593Smuzhiyun 221*4882a593SmuzhiyunThe 'exclude_user', 'exclude_kernel' and 'exclude_hv' bits provide a 222*4882a593Smuzhiyunway to request that counting of events be restricted to times when the 223*4882a593SmuzhiyunCPU is in user, kernel and/or hypervisor mode. 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunFurthermore the 'exclude_host' and 'exclude_guest' bits provide a way 226*4882a593Smuzhiyunto request counting of events restricted to guest and host contexts when 227*4882a593Smuzhiyunusing Linux as the hypervisor. 228*4882a593Smuzhiyun 229*4882a593SmuzhiyunThe 'mmap' and 'munmap' bits allow recording of PROT_EXEC mmap/munmap 230*4882a593Smuzhiyunoperations, these can be used to relate userspace IP addresses to actual 231*4882a593Smuzhiyuncode, even after the mapping (or even the whole process) is gone, 232*4882a593Smuzhiyunthese events are recorded in the ring-buffer (see below). 233*4882a593Smuzhiyun 234*4882a593SmuzhiyunThe 'comm' bit allows tracking of process comm data on process creation. 235*4882a593SmuzhiyunThis too is recorded in the ring-buffer (see below). 236*4882a593Smuzhiyun 237*4882a593SmuzhiyunThe 'pid' parameter to the sys_perf_event_open() system call allows the 238*4882a593Smuzhiyuncounter to be specific to a task: 239*4882a593Smuzhiyun 240*4882a593Smuzhiyun pid == 0: if the pid parameter is zero, the counter is attached to the 241*4882a593Smuzhiyun current task. 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun pid > 0: the counter is attached to a specific task (if the current task 244*4882a593Smuzhiyun has sufficient privilege to do so) 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun pid < 0: all tasks are counted (per cpu counters) 247*4882a593Smuzhiyun 248*4882a593SmuzhiyunThe 'cpu' parameter allows a counter to be made specific to a CPU: 249*4882a593Smuzhiyun 250*4882a593Smuzhiyun cpu >= 0: the counter is restricted to a specific CPU 251*4882a593Smuzhiyun cpu == -1: the counter counts on all CPUs 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun(Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.) 254*4882a593Smuzhiyun 255*4882a593SmuzhiyunA 'pid > 0' and 'cpu == -1' counter is a per task counter that counts 256*4882a593Smuzhiyunevents of that task and 'follows' that task to whatever CPU the task 257*4882a593Smuzhiyungets schedule to. Per task counters can be created by any user, for 258*4882a593Smuzhiyuntheir own tasks. 259*4882a593Smuzhiyun 260*4882a593SmuzhiyunA 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts 261*4882a593Smuzhiyunall events on CPU-x. Per CPU counters need CAP_PERFMON or CAP_SYS_ADMIN 262*4882a593Smuzhiyunprivilege. 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunThe 'flags' parameter is currently unused and must be zero. 265*4882a593Smuzhiyun 266*4882a593SmuzhiyunThe 'group_fd' parameter allows counter "groups" to be set up. A 267*4882a593Smuzhiyuncounter group has one counter which is the group "leader". The leader 268*4882a593Smuzhiyunis created first, with group_fd = -1 in the sys_perf_event_open call 269*4882a593Smuzhiyunthat creates it. The rest of the group members are created 270*4882a593Smuzhiyunsubsequently, with group_fd giving the fd of the group leader. 271*4882a593Smuzhiyun(A single counter on its own is created with group_fd = -1 and is 272*4882a593Smuzhiyunconsidered to be a group with only 1 member.) 273*4882a593Smuzhiyun 274*4882a593SmuzhiyunA counter group is scheduled onto the CPU as a unit, that is, it will 275*4882a593Smuzhiyunonly be put onto the CPU if all of the counters in the group can be 276*4882a593Smuzhiyunput onto the CPU. This means that the values of the member counters 277*4882a593Smuzhiyuncan be meaningfully compared, added, divided (to get ratios), etc., 278*4882a593Smuzhiyunwith each other, since they have counted events for the same set of 279*4882a593Smuzhiyunexecuted instructions. 280*4882a593Smuzhiyun 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunLike stated, asynchronous events, like counter overflow or PROT_EXEC mmap 283*4882a593Smuzhiyuntracking are logged into a ring-buffer. This ring-buffer is created and 284*4882a593Smuzhiyunaccessed through mmap(). 285*4882a593Smuzhiyun 286*4882a593SmuzhiyunThe mmap size should be 1+2^n pages, where the first page is a meta-data page 287*4882a593Smuzhiyun(struct perf_event_mmap_page) that contains various bits of information such 288*4882a593Smuzhiyunas where the ring-buffer head is. 289*4882a593Smuzhiyun 290*4882a593Smuzhiyun/* 291*4882a593Smuzhiyun * Structure of the page that can be mapped via mmap 292*4882a593Smuzhiyun */ 293*4882a593Smuzhiyunstruct perf_event_mmap_page { 294*4882a593Smuzhiyun __u32 version; /* version number of this structure */ 295*4882a593Smuzhiyun __u32 compat_version; /* lowest version this is compat with */ 296*4882a593Smuzhiyun 297*4882a593Smuzhiyun /* 298*4882a593Smuzhiyun * Bits needed to read the hw counters in user-space. 299*4882a593Smuzhiyun * 300*4882a593Smuzhiyun * u32 seq; 301*4882a593Smuzhiyun * s64 count; 302*4882a593Smuzhiyun * 303*4882a593Smuzhiyun * do { 304*4882a593Smuzhiyun * seq = pc->lock; 305*4882a593Smuzhiyun * 306*4882a593Smuzhiyun * barrier() 307*4882a593Smuzhiyun * if (pc->index) { 308*4882a593Smuzhiyun * count = pmc_read(pc->index - 1); 309*4882a593Smuzhiyun * count += pc->offset; 310*4882a593Smuzhiyun * } else 311*4882a593Smuzhiyun * goto regular_read; 312*4882a593Smuzhiyun * 313*4882a593Smuzhiyun * barrier(); 314*4882a593Smuzhiyun * } while (pc->lock != seq); 315*4882a593Smuzhiyun * 316*4882a593Smuzhiyun * NOTE: for obvious reason this only works on self-monitoring 317*4882a593Smuzhiyun * processes. 318*4882a593Smuzhiyun */ 319*4882a593Smuzhiyun __u32 lock; /* seqlock for synchronization */ 320*4882a593Smuzhiyun __u32 index; /* hardware counter identifier */ 321*4882a593Smuzhiyun __s64 offset; /* add to hardware counter value */ 322*4882a593Smuzhiyun 323*4882a593Smuzhiyun /* 324*4882a593Smuzhiyun * Control data for the mmap() data buffer. 325*4882a593Smuzhiyun * 326*4882a593Smuzhiyun * User-space reading this value should issue an rmb(), on SMP capable 327*4882a593Smuzhiyun * platforms, after reading this value -- see perf_event_wakeup(). 328*4882a593Smuzhiyun */ 329*4882a593Smuzhiyun __u32 data_head; /* head in the data section */ 330*4882a593Smuzhiyun}; 331*4882a593Smuzhiyun 332*4882a593SmuzhiyunNOTE: the hw-counter userspace bits are arch specific and are currently only 333*4882a593Smuzhiyun implemented on powerpc. 334*4882a593Smuzhiyun 335*4882a593SmuzhiyunThe following 2^n pages are the ring-buffer which contains events of the form: 336*4882a593Smuzhiyun 337*4882a593Smuzhiyun#define PERF_RECORD_MISC_KERNEL (1 << 0) 338*4882a593Smuzhiyun#define PERF_RECORD_MISC_USER (1 << 1) 339*4882a593Smuzhiyun#define PERF_RECORD_MISC_OVERFLOW (1 << 2) 340*4882a593Smuzhiyun 341*4882a593Smuzhiyunstruct perf_event_header { 342*4882a593Smuzhiyun __u32 type; 343*4882a593Smuzhiyun __u16 misc; 344*4882a593Smuzhiyun __u16 size; 345*4882a593Smuzhiyun}; 346*4882a593Smuzhiyun 347*4882a593Smuzhiyunenum perf_event_type { 348*4882a593Smuzhiyun 349*4882a593Smuzhiyun /* 350*4882a593Smuzhiyun * The MMAP events record the PROT_EXEC mappings so that we can 351*4882a593Smuzhiyun * correlate userspace IPs to code. They have the following structure: 352*4882a593Smuzhiyun * 353*4882a593Smuzhiyun * struct { 354*4882a593Smuzhiyun * struct perf_event_header header; 355*4882a593Smuzhiyun * 356*4882a593Smuzhiyun * u32 pid, tid; 357*4882a593Smuzhiyun * u64 addr; 358*4882a593Smuzhiyun * u64 len; 359*4882a593Smuzhiyun * u64 pgoff; 360*4882a593Smuzhiyun * char filename[]; 361*4882a593Smuzhiyun * }; 362*4882a593Smuzhiyun */ 363*4882a593Smuzhiyun PERF_RECORD_MMAP = 1, 364*4882a593Smuzhiyun PERF_RECORD_MUNMAP = 2, 365*4882a593Smuzhiyun 366*4882a593Smuzhiyun /* 367*4882a593Smuzhiyun * struct { 368*4882a593Smuzhiyun * struct perf_event_header header; 369*4882a593Smuzhiyun * 370*4882a593Smuzhiyun * u32 pid, tid; 371*4882a593Smuzhiyun * char comm[]; 372*4882a593Smuzhiyun * }; 373*4882a593Smuzhiyun */ 374*4882a593Smuzhiyun PERF_RECORD_COMM = 3, 375*4882a593Smuzhiyun 376*4882a593Smuzhiyun /* 377*4882a593Smuzhiyun * When header.misc & PERF_RECORD_MISC_OVERFLOW the event_type field 378*4882a593Smuzhiyun * will be PERF_RECORD_* 379*4882a593Smuzhiyun * 380*4882a593Smuzhiyun * struct { 381*4882a593Smuzhiyun * struct perf_event_header header; 382*4882a593Smuzhiyun * 383*4882a593Smuzhiyun * { u64 ip; } && PERF_RECORD_IP 384*4882a593Smuzhiyun * { u32 pid, tid; } && PERF_RECORD_TID 385*4882a593Smuzhiyun * { u64 time; } && PERF_RECORD_TIME 386*4882a593Smuzhiyun * { u64 addr; } && PERF_RECORD_ADDR 387*4882a593Smuzhiyun * 388*4882a593Smuzhiyun * { u64 nr; 389*4882a593Smuzhiyun * { u64 event, val; } cnt[nr]; } && PERF_RECORD_GROUP 390*4882a593Smuzhiyun * 391*4882a593Smuzhiyun * { u16 nr, 392*4882a593Smuzhiyun * hv, 393*4882a593Smuzhiyun * kernel, 394*4882a593Smuzhiyun * user; 395*4882a593Smuzhiyun * u64 ips[nr]; } && PERF_RECORD_CALLCHAIN 396*4882a593Smuzhiyun * }; 397*4882a593Smuzhiyun */ 398*4882a593Smuzhiyun}; 399*4882a593Smuzhiyun 400*4882a593SmuzhiyunNOTE: PERF_RECORD_CALLCHAIN is arch specific and currently only implemented 401*4882a593Smuzhiyun on x86. 402*4882a593Smuzhiyun 403*4882a593SmuzhiyunNotification of new events is possible through poll()/select()/epoll() and 404*4882a593Smuzhiyunfcntl() managing signals. 405*4882a593Smuzhiyun 406*4882a593SmuzhiyunNormally a notification is generated for every page filled, however one can 407*4882a593Smuzhiyunadditionally set perf_event_attr.wakeup_events to generate one every 408*4882a593Smuzhiyunso many counter overflow events. 409*4882a593Smuzhiyun 410*4882a593SmuzhiyunFuture work will include a splice() interface to the ring-buffer. 411*4882a593Smuzhiyun 412*4882a593Smuzhiyun 413*4882a593SmuzhiyunCounters can be enabled and disabled in two ways: via ioctl and via 414*4882a593Smuzhiyunprctl. When a counter is disabled, it doesn't count or generate 415*4882a593Smuzhiyunevents but does continue to exist and maintain its count value. 416*4882a593Smuzhiyun 417*4882a593SmuzhiyunAn individual counter can be enabled with 418*4882a593Smuzhiyun 419*4882a593Smuzhiyun ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); 420*4882a593Smuzhiyun 421*4882a593Smuzhiyunor disabled with 422*4882a593Smuzhiyun 423*4882a593Smuzhiyun ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); 424*4882a593Smuzhiyun 425*4882a593SmuzhiyunFor a counter group, pass PERF_IOC_FLAG_GROUP as the third argument. 426*4882a593SmuzhiyunEnabling or disabling the leader of a group enables or disables the 427*4882a593Smuzhiyunwhole group; that is, while the group leader is disabled, none of the 428*4882a593Smuzhiyuncounters in the group will count. Enabling or disabling a member of a 429*4882a593Smuzhiyungroup other than the leader only affects that counter - disabling an 430*4882a593Smuzhiyunnon-leader stops that counter from counting but doesn't affect any 431*4882a593Smuzhiyunother counter. 432*4882a593Smuzhiyun 433*4882a593SmuzhiyunAdditionally, non-inherited overflow counters can use 434*4882a593Smuzhiyun 435*4882a593Smuzhiyun ioctl(fd, PERF_EVENT_IOC_REFRESH, nr); 436*4882a593Smuzhiyun 437*4882a593Smuzhiyunto enable a counter for 'nr' events, after which it gets disabled again. 438*4882a593Smuzhiyun 439*4882a593SmuzhiyunA process can enable or disable all the counter groups that are 440*4882a593Smuzhiyunattached to it, using prctl: 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun prctl(PR_TASK_PERF_EVENTS_ENABLE); 443*4882a593Smuzhiyun 444*4882a593Smuzhiyun prctl(PR_TASK_PERF_EVENTS_DISABLE); 445*4882a593Smuzhiyun 446*4882a593SmuzhiyunThis applies to all counters on the current process, whether created 447*4882a593Smuzhiyunby this process or by another, and doesn't affect any counters that 448*4882a593Smuzhiyunthis process has created on other processes. It only enables or 449*4882a593Smuzhiyundisables the group leaders, not any other members in the groups. 450*4882a593Smuzhiyun 451*4882a593Smuzhiyun 452*4882a593SmuzhiyunArch requirements 453*4882a593Smuzhiyun----------------- 454*4882a593Smuzhiyun 455*4882a593SmuzhiyunIf your architecture does not have hardware performance metrics, you can 456*4882a593Smuzhiyunstill use the generic software counters based on hrtimers for sampling. 457*4882a593Smuzhiyun 458*4882a593SmuzhiyunSo to start with, in order to add HAVE_PERF_EVENTS to your Kconfig, you 459*4882a593Smuzhiyunwill need at least this: 460*4882a593Smuzhiyun - asm/perf_event.h - a basic stub will suffice at first 461*4882a593Smuzhiyun - support for atomic64 types (and associated helper functions) 462*4882a593Smuzhiyun 463*4882a593SmuzhiyunIf your architecture does have hardware capabilities, you can override the 464*4882a593Smuzhiyunweak stub hw_perf_event_init() to register hardware counters. 465*4882a593Smuzhiyun 466*4882a593SmuzhiyunArchitectures that have d-cache aliassing issues, such as Sparc and ARM, 467*4882a593Smuzhiyunshould select PERF_USE_VMALLOC in order to avoid these for perf mmap(). 468