1*4882a593SmuzhiyunUsing TopDown metrics in user space 2*4882a593Smuzhiyun----------------------------------- 3*4882a593Smuzhiyun 4*4882a593SmuzhiyunIntel CPUs (since Sandy Bridge and Silvermont) support a TopDown 5*4882a593Smuzhiyunmethology to break down CPU pipeline execution into 4 bottlenecks: 6*4882a593Smuzhiyunfrontend bound, backend bound, bad speculation, retiring. 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunFor more details on Topdown see [1][5] 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunTraditionally this was implemented by events in generic counters 11*4882a593Smuzhiyunand specific formulas to compute the bottlenecks. 12*4882a593Smuzhiyun 13*4882a593Smuzhiyunperf stat --topdown implements this. 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunFull Top Down includes more levels that can break down the 16*4882a593Smuzhiyunbottlenecks further. This is not directly implemented in perf, 17*4882a593Smuzhiyunbut available in other tools that can run on top of perf, 18*4882a593Smuzhiyunsuch as toplev[2] or vtune[3] 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunNew Topdown features in Ice Lake 21*4882a593Smuzhiyun=============================== 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunWith Ice Lake CPUs the TopDown metrics are directly available as 24*4882a593Smuzhiyunfixed counters and do not require generic counters. This allows 25*4882a593Smuzhiyunto collect TopDown always in addition to other events. 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun% perf stat -a --topdown -I1000 28*4882a593Smuzhiyun# time retiring bad speculation frontend bound backend bound 29*4882a593Smuzhiyun 1.001281330 23.0% 15.3% 29.6% 32.1% 30*4882a593Smuzhiyun 2.003009005 5.0% 6.8% 46.6% 41.6% 31*4882a593Smuzhiyun 3.004646182 6.7% 6.7% 46.0% 40.6% 32*4882a593Smuzhiyun 4.006326375 5.0% 6.4% 47.6% 41.0% 33*4882a593Smuzhiyun 5.007991804 5.1% 6.3% 46.3% 42.3% 34*4882a593Smuzhiyun 6.009626773 6.2% 7.1% 47.3% 39.3% 35*4882a593Smuzhiyun 7.011296356 4.7% 6.7% 46.2% 42.4% 36*4882a593Smuzhiyun 8.012951831 4.7% 6.7% 47.5% 41.1% 37*4882a593Smuzhiyun... 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunThis also enables measuring TopDown per thread/process instead 40*4882a593Smuzhiyunof only per core. 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunUsing TopDown through RDPMC in applications on Ice Lake 43*4882a593Smuzhiyun====================================================== 44*4882a593Smuzhiyun 45*4882a593SmuzhiyunFor more fine grained measurements it can be useful to 46*4882a593Smuzhiyunaccess the new directly from user space. This is more complicated, 47*4882a593Smuzhiyunbut drastically lowers overhead. 48*4882a593Smuzhiyun 49*4882a593SmuzhiyunOn Ice Lake, there is a new fixed counter 3: SLOTS, which reports 50*4882a593Smuzhiyun"pipeline SLOTS" (cycles multiplied by core issue width) and a 51*4882a593Smuzhiyunmetric register that reports slots ratios for the different bottleneck 52*4882a593Smuzhiyuncategories. 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunThe metrics counter is CPU model specific and is not available on older 55*4882a593SmuzhiyunCPUs. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunExample code 58*4882a593Smuzhiyun============ 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunLibrary functions to do the functionality described below 61*4882a593Smuzhiyunis also available in libjevents [4] 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunThe application opens a group with fixed counter 3 (SLOTS) and any 64*4882a593Smuzhiyunmetric event, and allow user programs to read the performance counters. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunFixed counter 3 is mapped to a pseudo event event=0x00, umask=04, 67*4882a593Smuzhiyunso the perf_event_attr structure should be initialized with 68*4882a593Smuzhiyun{ .config = 0x0400, .type = PERF_TYPE_RAW } 69*4882a593SmuzhiyunThe metric events are mapped to the pseudo event event=0x00, umask=0x8X. 70*4882a593SmuzhiyunFor example, the perf_event_attr structure can be initialized with 71*4882a593Smuzhiyun{ .config = 0x8000, .type = PERF_TYPE_RAW } for Retiring metric event 72*4882a593SmuzhiyunThe Fixed counter 3 must be the leader of the group. 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun#include <linux/perf_event.h> 75*4882a593Smuzhiyun#include <sys/syscall.h> 76*4882a593Smuzhiyun#include <unistd.h> 77*4882a593Smuzhiyun 78*4882a593Smuzhiyun/* Provide own perf_event_open stub because glibc doesn't */ 79*4882a593Smuzhiyun__attribute__((weak)) 80*4882a593Smuzhiyunint perf_event_open(struct perf_event_attr *attr, pid_t pid, 81*4882a593Smuzhiyun int cpu, int group_fd, unsigned long flags) 82*4882a593Smuzhiyun{ 83*4882a593Smuzhiyun return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags); 84*4882a593Smuzhiyun} 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun/* Open slots counter file descriptor for current task. */ 87*4882a593Smuzhiyunstruct perf_event_attr slots = { 88*4882a593Smuzhiyun .type = PERF_TYPE_RAW, 89*4882a593Smuzhiyun .size = sizeof(struct perf_event_attr), 90*4882a593Smuzhiyun .config = 0x400, 91*4882a593Smuzhiyun .exclude_kernel = 1, 92*4882a593Smuzhiyun}; 93*4882a593Smuzhiyun 94*4882a593Smuzhiyunint slots_fd = perf_event_open(&slots, 0, -1, -1, 0); 95*4882a593Smuzhiyunif (slots_fd < 0) 96*4882a593Smuzhiyun ... error ... 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun/* 99*4882a593Smuzhiyun * Open metrics event file descriptor for current task. 100*4882a593Smuzhiyun * Set slots event as the leader of the group. 101*4882a593Smuzhiyun */ 102*4882a593Smuzhiyunstruct perf_event_attr metrics = { 103*4882a593Smuzhiyun .type = PERF_TYPE_RAW, 104*4882a593Smuzhiyun .size = sizeof(struct perf_event_attr), 105*4882a593Smuzhiyun .config = 0x8000, 106*4882a593Smuzhiyun .exclude_kernel = 1, 107*4882a593Smuzhiyun}; 108*4882a593Smuzhiyun 109*4882a593Smuzhiyunint metrics_fd = perf_event_open(&metrics, 0, -1, slots_fd, 0); 110*4882a593Smuzhiyunif (metrics_fd < 0) 111*4882a593Smuzhiyun ... error ... 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun 114*4882a593SmuzhiyunThe RDPMC instruction (or _rdpmc compiler intrinsic) can now be used 115*4882a593Smuzhiyunto read slots and the topdown metrics at different points of the program: 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun#include <stdint.h> 118*4882a593Smuzhiyun#include <x86intrin.h> 119*4882a593Smuzhiyun 120*4882a593Smuzhiyun#define RDPMC_FIXED (1 << 30) /* return fixed counters */ 121*4882a593Smuzhiyun#define RDPMC_METRIC (1 << 29) /* return metric counters */ 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun#define FIXED_COUNTER_SLOTS 3 124*4882a593Smuzhiyun#define METRIC_COUNTER_TOPDOWN_L1 0 125*4882a593Smuzhiyun 126*4882a593Smuzhiyunstatic inline uint64_t read_slots(void) 127*4882a593Smuzhiyun{ 128*4882a593Smuzhiyun return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS); 129*4882a593Smuzhiyun} 130*4882a593Smuzhiyun 131*4882a593Smuzhiyunstatic inline uint64_t read_metrics(void) 132*4882a593Smuzhiyun{ 133*4882a593Smuzhiyun return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1); 134*4882a593Smuzhiyun} 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunThen the program can be instrumented to read these metrics at different 137*4882a593Smuzhiyunpoints. 138*4882a593Smuzhiyun 139*4882a593SmuzhiyunIt's not a good idea to do this with too short code regions, 140*4882a593Smuzhiyunas the parallelism and overlap in the CPU program execution will 141*4882a593Smuzhiyuncause too much measurement inaccuracy. For example instrumenting 142*4882a593Smuzhiyunindividual basic blocks is definitely too fine grained. 143*4882a593Smuzhiyun 144*4882a593SmuzhiyunDecoding metrics values 145*4882a593Smuzhiyun======================= 146*4882a593Smuzhiyun 147*4882a593SmuzhiyunThe value reported by read_metrics() contains four 8 bit fields 148*4882a593Smuzhiyunthat represent a scaled ratio that represent the Level 1 bottleneck. 149*4882a593SmuzhiyunAll four fields add up to 0xff (= 100%) 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunThe binary ratios in the metric value can be converted to float ratios: 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun#define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff) 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun#define TOPDOWN_RETIRING(val) ((float)GET_METRIC(val, 0) / 0xff) 156*4882a593Smuzhiyun#define TOPDOWN_BAD_SPEC(val) ((float)GET_METRIC(val, 1) / 0xff) 157*4882a593Smuzhiyun#define TOPDOWN_FE_BOUND(val) ((float)GET_METRIC(val, 2) / 0xff) 158*4882a593Smuzhiyun#define TOPDOWN_BE_BOUND(val) ((float)GET_METRIC(val, 3) / 0xff) 159*4882a593Smuzhiyun 160*4882a593Smuzhiyunand then converted to percent for printing. 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunThe ratios in the metric accumulate for the time when the counter 163*4882a593Smuzhiyunis enabled. For measuring programs it is often useful to measure 164*4882a593Smuzhiyunspecific sections. For this it is needed to deltas on metrics. 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunThis can be done by scaling the metrics with the slots counter 167*4882a593Smuzhiyunread at the same time. 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunThen it's possible to take deltas of these slots counts 170*4882a593Smuzhiyunmeasured at different points, and determine the metrics 171*4882a593Smuzhiyunfor that time period. 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun slots_a = read_slots(); 174*4882a593Smuzhiyun metric_a = read_metrics(); 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun ... larger code region ... 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun slots_b = read_slots() 179*4882a593Smuzhiyun metric_b = read_metrics() 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun # compute scaled metrics for measurement a 182*4882a593Smuzhiyun retiring_slots_a = GET_METRIC(metric_a, 0) * slots_a 183*4882a593Smuzhiyun bad_spec_slots_a = GET_METRIC(metric_a, 1) * slots_a 184*4882a593Smuzhiyun fe_bound_slots_a = GET_METRIC(metric_a, 2) * slots_a 185*4882a593Smuzhiyun be_bound_slots_a = GET_METRIC(metric_a, 3) * slots_a 186*4882a593Smuzhiyun 187*4882a593Smuzhiyun # compute delta scaled metrics between b and a 188*4882a593Smuzhiyun retiring_slots = GET_METRIC(metric_b, 0) * slots_b - retiring_slots_a 189*4882a593Smuzhiyun bad_spec_slots = GET_METRIC(metric_b, 1) * slots_b - bad_spec_slots_a 190*4882a593Smuzhiyun fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a 191*4882a593Smuzhiyun be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a 192*4882a593Smuzhiyun 193*4882a593SmuzhiyunLater the individual ratios for the measurement period can be recreated 194*4882a593Smuzhiyunfrom these counts. 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun slots_delta = slots_b - slots_a 197*4882a593Smuzhiyun retiring_ratio = (float)retiring_slots / slots_delta 198*4882a593Smuzhiyun bad_spec_ratio = (float)bad_spec_slots / slots_delta 199*4882a593Smuzhiyun fe_bound_ratio = (float)fe_bound_slots / slots_delta 200*4882a593Smuzhiyun be_bound_ratio = (float)be_bound_slots / slota_delta 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun printf("Retiring %.2f%% Bad Speculation %.2f%% FE Bound %.2f%% BE Bound %.2f%%\n", 203*4882a593Smuzhiyun retiring_ratio * 100., 204*4882a593Smuzhiyun bad_spec_ratio * 100., 205*4882a593Smuzhiyun fe_bound_ratio * 100., 206*4882a593Smuzhiyun be_bound_ratio * 100.); 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunResetting metrics counters 209*4882a593Smuzhiyun========================== 210*4882a593Smuzhiyun 211*4882a593SmuzhiyunSince the individual metrics are only 8bit they lose precision for 212*4882a593Smuzhiyunshort regions over time because the number of cycles covered by each 213*4882a593Smuzhiyunfraction bit shrinks. So the counters need to be reset regularly. 214*4882a593Smuzhiyun 215*4882a593SmuzhiyunWhen using the kernel perf API the kernel resets on every read. 216*4882a593SmuzhiyunSo as long as the reading is at reasonable intervals (every few 217*4882a593Smuzhiyunseconds) the precision is good. 218*4882a593Smuzhiyun 219*4882a593SmuzhiyunWhen using perf stat it is recommended to always use the -I option, 220*4882a593Smuzhiyunwith no longer interval than a few seconds 221*4882a593Smuzhiyun 222*4882a593Smuzhiyun perf stat -I 1000 --topdown ... 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunFor user programs using RDPMC directly the counter can 225*4882a593Smuzhiyunbe reset explicitly using ioctl: 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0); 228*4882a593Smuzhiyun 229*4882a593SmuzhiyunThis "opens" a new measurement period. 230*4882a593Smuzhiyun 231*4882a593SmuzhiyunA program using RDPMC for TopDown should schedule such a reset 232*4882a593Smuzhiyunregularly, as in every few seconds. 233*4882a593Smuzhiyun 234*4882a593SmuzhiyunLimits on Ice Lake 235*4882a593Smuzhiyun================== 236*4882a593Smuzhiyun 237*4882a593SmuzhiyunFour pseudo TopDown metric events are exposed for the end-users, 238*4882a593Smuzhiyuntopdown-retiring, topdown-bad-spec, topdown-fe-bound and topdown-be-bound. 239*4882a593SmuzhiyunThey can be used to collect the TopDown value under the following 240*4882a593Smuzhiyunrules: 241*4882a593Smuzhiyun- All the TopDown metric events must be in a group with the SLOTS event. 242*4882a593Smuzhiyun- The SLOTS event must be the leader of the group. 243*4882a593Smuzhiyun- The PERF_FORMAT_GROUP flag must be applied for each TopDown metric 244*4882a593Smuzhiyun events 245*4882a593Smuzhiyun 246*4882a593SmuzhiyunThe SLOTS event and the TopDown metric events can be counting members of 247*4882a593Smuzhiyuna sampling read group. Since the SLOTS event must be the leader of a TopDown 248*4882a593Smuzhiyungroup, the second event of the group is the sampling event. 249*4882a593SmuzhiyunFor example, perf record -e '{slots, $sampling_event, topdown-retiring}:S' 250*4882a593Smuzhiyun 251*4882a593Smuzhiyun 252*4882a593Smuzhiyun[1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win 253*4882a593Smuzhiyun[2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual 254*4882a593Smuzhiyun[3] https://software.intel.com/en-us/intel-vtune-amplifier-xe 255*4882a593Smuzhiyun[4] https://github.com/andikleen/pmu-tools/tree/master/jevents 256*4882a593Smuzhiyun[5] https://sites.google.com/site/analysismethods/yasin-pubs 257