1*4882a593Smuzhiyun=================================== 2*4882a593SmuzhiyunSupporting PMUs on RISC-V platforms 3*4882a593Smuzhiyun=================================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunAlan Kao <alankao@andestech.com>, Mar 2018 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunIntroduction 8*4882a593Smuzhiyun------------ 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunAs of this writing, perf_event-related features mentioned in The RISC-V ISA 11*4882a593SmuzhiyunPrivileged Version 1.10 are as follows: 12*4882a593Smuzhiyun(please check the manual for more details) 13*4882a593Smuzhiyun 14*4882a593Smuzhiyun* [m|s]counteren 15*4882a593Smuzhiyun* mcycle[h], cycle[h] 16*4882a593Smuzhiyun* minstret[h], instret[h] 17*4882a593Smuzhiyun* mhpeventx, mhpcounterx[h] 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunWith such function set only, porting perf would require a lot of work, due to 20*4882a593Smuzhiyunthe lack of the following general architectural performance monitoring features: 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun* Enabling/Disabling counters 23*4882a593Smuzhiyun Counters are just free-running all the time in our case. 24*4882a593Smuzhiyun* Interrupt caused by counter overflow 25*4882a593Smuzhiyun No such feature in the spec. 26*4882a593Smuzhiyun* Interrupt indicator 27*4882a593Smuzhiyun It is not possible to have many interrupt ports for all counters, so an 28*4882a593Smuzhiyun interrupt indicator is required for software to tell which counter has 29*4882a593Smuzhiyun just overflowed. 30*4882a593Smuzhiyun* Writing to counters 31*4882a593Smuzhiyun There will be an SBI to support this since the kernel cannot modify the 32*4882a593Smuzhiyun counters [1]. Alternatively, some vendor considers to implement 33*4882a593Smuzhiyun hardware-extension for M-S-U model machines to write counters directly. 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunThis document aims to provide developers a quick guide on supporting their 36*4882a593SmuzhiyunPMUs in the kernel. The following sections briefly explain perf' mechanism 37*4882a593Smuzhiyunand todos. 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunYou may check previous discussions here [1][2]. Also, it might be helpful 40*4882a593Smuzhiyunto check the appendix for related kernel structures. 41*4882a593Smuzhiyun 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun1. Initialization 44*4882a593Smuzhiyun----------------- 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun*riscv_pmu* is a global pointer of type *struct riscv_pmu*, which contains 47*4882a593Smuzhiyunvarious methods according to perf's internal convention and PMU-specific 48*4882a593Smuzhiyunparameters. One should declare such instance to represent the PMU. By default, 49*4882a593Smuzhiyun*riscv_pmu* points to a constant structure *riscv_base_pmu*, which has very 50*4882a593Smuzhiyunbasic support to a baseline QEMU model. 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunThen he/she can either assign the instance's pointer to *riscv_pmu* so that 53*4882a593Smuzhiyunthe minimal and already-implemented logic can be leveraged, or invent his/her 54*4882a593Smuzhiyunown *riscv_init_platform_pmu* implementation. 55*4882a593Smuzhiyun 56*4882a593SmuzhiyunIn other words, existing sources of *riscv_base_pmu* merely provide a 57*4882a593Smuzhiyunreference implementation. Developers can flexibly decide how many parts they 58*4882a593Smuzhiyuncan leverage, and in the most extreme case, they can customize every function 59*4882a593Smuzhiyunaccording to their needs. 60*4882a593Smuzhiyun 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun2. Event Initialization 63*4882a593Smuzhiyun----------------------- 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunWhen a user launches a perf command to monitor some events, it is first 66*4882a593Smuzhiyuninterpreted by the userspace perf tool into multiple *perf_event_open* 67*4882a593Smuzhiyunsystem calls, and then each of them calls to the body of *event_init* 68*4882a593Smuzhiyunmember function that was assigned in the previous step. In *riscv_base_pmu*'s 69*4882a593Smuzhiyuncase, it is *riscv_event_init*. 70*4882a593Smuzhiyun 71*4882a593SmuzhiyunThe main purpose of this function is to translate the event provided by user 72*4882a593Smuzhiyuninto bitmap, so that HW-related control registers or counters can directly be 73*4882a593Smuzhiyunmanipulated. The translation is based on the mappings and methods provided in 74*4882a593Smuzhiyun*riscv_pmu*. 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunNote that some features can be done in this stage as well: 77*4882a593Smuzhiyun 78*4882a593Smuzhiyun(1) interrupt setting, which is stated in the next section; 79*4882a593Smuzhiyun(2) privilege level setting (user space only, kernel space only, both); 80*4882a593Smuzhiyun(3) destructor setting. Normally it is sufficient to apply *riscv_destroy_event*; 81*4882a593Smuzhiyun(4) tweaks for non-sampling events, which will be utilized by functions such as 82*4882a593Smuzhiyun *perf_adjust_period*, usually something like the follows:: 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun if (!is_sampling_event(event)) { 85*4882a593Smuzhiyun hwc->sample_period = x86_pmu.max_period; 86*4882a593Smuzhiyun hwc->last_period = hwc->sample_period; 87*4882a593Smuzhiyun local64_set(&hwc->period_left, hwc->sample_period); 88*4882a593Smuzhiyun } 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunIn the case of *riscv_base_pmu*, only (3) is provided for now. 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun 93*4882a593Smuzhiyun3. Interrupt 94*4882a593Smuzhiyun------------ 95*4882a593Smuzhiyun 96*4882a593Smuzhiyun3.1. Interrupt Initialization 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunThis often occurs at the beginning of the *event_init* method. In common 99*4882a593Smuzhiyunpractice, this should be a code segment like:: 100*4882a593Smuzhiyun 101*4882a593Smuzhiyun int x86_reserve_hardware(void) 102*4882a593Smuzhiyun { 103*4882a593Smuzhiyun int err = 0; 104*4882a593Smuzhiyun 105*4882a593Smuzhiyun if (!atomic_inc_not_zero(&pmc_refcount)) { 106*4882a593Smuzhiyun mutex_lock(&pmc_reserve_mutex); 107*4882a593Smuzhiyun if (atomic_read(&pmc_refcount) == 0) { 108*4882a593Smuzhiyun if (!reserve_pmc_hardware()) 109*4882a593Smuzhiyun err = -EBUSY; 110*4882a593Smuzhiyun else 111*4882a593Smuzhiyun reserve_ds_buffers(); 112*4882a593Smuzhiyun } 113*4882a593Smuzhiyun if (!err) 114*4882a593Smuzhiyun atomic_inc(&pmc_refcount); 115*4882a593Smuzhiyun mutex_unlock(&pmc_reserve_mutex); 116*4882a593Smuzhiyun } 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun return err; 119*4882a593Smuzhiyun } 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunAnd the magic is in *reserve_pmc_hardware*, which usually does atomic 122*4882a593Smuzhiyunoperations to make implemented IRQ accessible from some global function pointer. 123*4882a593Smuzhiyun*release_pmc_hardware* serves the opposite purpose, and it is used in event 124*4882a593Smuzhiyundestructors mentioned in previous section. 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun(Note: From the implementations in all the architectures, the *reserve/release* 127*4882a593Smuzhiyunpair are always IRQ settings, so the *pmc_hardware* seems somehow misleading. 128*4882a593SmuzhiyunIt does NOT deal with the binding between an event and a physical counter, 129*4882a593Smuzhiyunwhich will be introduced in the next section.) 130*4882a593Smuzhiyun 131*4882a593Smuzhiyun3.2. IRQ Structure 132*4882a593Smuzhiyun 133*4882a593SmuzhiyunBasically, a IRQ runs the following pseudo code:: 134*4882a593Smuzhiyun 135*4882a593Smuzhiyun for each hardware counter that triggered this overflow 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun get the event of this counter 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun // following two steps are defined as *read()*, 140*4882a593Smuzhiyun // check the section Reading/Writing Counters for details. 141*4882a593Smuzhiyun count the delta value since previous interrupt 142*4882a593Smuzhiyun update the event->count (# event occurs) by adding delta, and 143*4882a593Smuzhiyun event->hw.period_left by subtracting delta 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun if the event overflows 146*4882a593Smuzhiyun sample data 147*4882a593Smuzhiyun set the counter appropriately for the next overflow 148*4882a593Smuzhiyun 149*4882a593Smuzhiyun if the event overflows again 150*4882a593Smuzhiyun too frequently, throttle this event 151*4882a593Smuzhiyun fi 152*4882a593Smuzhiyun fi 153*4882a593Smuzhiyun 154*4882a593Smuzhiyun end for 155*4882a593Smuzhiyun 156*4882a593SmuzhiyunHowever as of this writing, none of the RISC-V implementations have designed an 157*4882a593Smuzhiyuninterrupt for perf, so the details are to be completed in the future. 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun4. Reading/Writing Counters 160*4882a593Smuzhiyun--------------------------- 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunThey seem symmetric but perf treats them quite differently. For reading, there 163*4882a593Smuzhiyunis a *read* interface in *struct pmu*, but it serves more than just reading. 164*4882a593SmuzhiyunAccording to the context, the *read* function not only reads the content of the 165*4882a593Smuzhiyuncounter (event->count), but also updates the left period to the next interrupt 166*4882a593Smuzhiyun(event->hw.period_left). 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunBut the core of perf does not need direct write to counters. Writing counters 169*4882a593Smuzhiyunis hidden behind the abstraction of 1) *pmu->start*, literally start counting so one 170*4882a593Smuzhiyunhas to set the counter to a good value for the next interrupt; 2) inside the IRQ 171*4882a593Smuzhiyunit should set the counter to the same resonable value. 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunReading is not a problem in RISC-V but writing would need some effort, since 174*4882a593Smuzhiyuncounters are not allowed to be written by S-mode. 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun5. add()/del()/start()/stop() 178*4882a593Smuzhiyun----------------------------- 179*4882a593Smuzhiyun 180*4882a593SmuzhiyunBasic idea: add()/del() adds/deletes events to/from a PMU, and start()/stop() 181*4882a593Smuzhiyunstarts/stop the counter of some event in the PMU. All of them take the same 182*4882a593Smuzhiyunarguments: *struct perf_event *event* and *int flag*. 183*4882a593Smuzhiyun 184*4882a593SmuzhiyunConsider perf as a state machine, then you will find that these functions serve 185*4882a593Smuzhiyunas the state transition process between those states. 186*4882a593SmuzhiyunThree states (event->hw.state) are defined: 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun* PERF_HES_STOPPED: the counter is stopped 189*4882a593Smuzhiyun* PERF_HES_UPTODATE: the event->count is up-to-date 190*4882a593Smuzhiyun* PERF_HES_ARCH: arch-dependent usage ... we don't need this for now 191*4882a593Smuzhiyun 192*4882a593SmuzhiyunA normal flow of these state transitions are as follows: 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun* A user launches a perf event, resulting in calling to *event_init*. 195*4882a593Smuzhiyun* When being context-switched in, *add* is called by the perf core, with a flag 196*4882a593Smuzhiyun PERF_EF_START, which means that the event should be started after it is added. 197*4882a593Smuzhiyun At this stage, a general event is bound to a physical counter, if any. 198*4882a593Smuzhiyun The state changes to PERF_HES_STOPPED and PERF_HES_UPTODATE, because it is now 199*4882a593Smuzhiyun stopped, and the (software) event count does not need updating. 200*4882a593Smuzhiyun 201*4882a593Smuzhiyun - *start* is then called, and the counter is enabled. 202*4882a593Smuzhiyun With flag PERF_EF_RELOAD, it writes an appropriate value to the counter (check 203*4882a593Smuzhiyun previous section for detail). 204*4882a593Smuzhiyun Nothing is written if the flag does not contain PERF_EF_RELOAD. 205*4882a593Smuzhiyun The state now is reset to none, because it is neither stopped nor updated 206*4882a593Smuzhiyun (the counting already started) 207*4882a593Smuzhiyun 208*4882a593Smuzhiyun* When being context-switched out, *del* is called. It then checks out all the 209*4882a593Smuzhiyun events in the PMU and calls *stop* to update their counts. 210*4882a593Smuzhiyun 211*4882a593Smuzhiyun - *stop* is called by *del* 212*4882a593Smuzhiyun and the perf core with flag PERF_EF_UPDATE, and it often shares the same 213*4882a593Smuzhiyun subroutine as *read* with the same logic. 214*4882a593Smuzhiyun The state changes to PERF_HES_STOPPED and PERF_HES_UPTODATE, again. 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun - Life cycle of these two pairs: *add* and *del* are called repeatedly as 217*4882a593Smuzhiyun tasks switch in-and-out; *start* and *stop* is also called when the perf core 218*4882a593Smuzhiyun needs a quick stop-and-start, for instance, when the interrupt period is being 219*4882a593Smuzhiyun adjusted. 220*4882a593Smuzhiyun 221*4882a593SmuzhiyunCurrent implementation is sufficient for now and can be easily extended to 222*4882a593Smuzhiyunfeatures in the future. 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunA. Related Structures 225*4882a593Smuzhiyun--------------------- 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun* struct pmu: include/linux/perf_event.h 228*4882a593Smuzhiyun* struct riscv_pmu: arch/riscv/include/asm/perf_event.h 229*4882a593Smuzhiyun 230*4882a593Smuzhiyun Both structures are designed to be read-only. 231*4882a593Smuzhiyun 232*4882a593Smuzhiyun *struct pmu* defines some function pointer interfaces, and most of them take 233*4882a593Smuzhiyun *struct perf_event* as a main argument, dealing with perf events according to 234*4882a593Smuzhiyun perf's internal state machine (check kernel/events/core.c for details). 235*4882a593Smuzhiyun 236*4882a593Smuzhiyun *struct riscv_pmu* defines PMU-specific parameters. The naming follows the 237*4882a593Smuzhiyun convention of all other architectures. 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun* struct perf_event: include/linux/perf_event.h 240*4882a593Smuzhiyun* struct hw_perf_event 241*4882a593Smuzhiyun 242*4882a593Smuzhiyun The generic structure that represents perf events, and the hardware-related 243*4882a593Smuzhiyun details. 244*4882a593Smuzhiyun 245*4882a593Smuzhiyun* struct riscv_hw_events: arch/riscv/include/asm/perf_event.h 246*4882a593Smuzhiyun 247*4882a593Smuzhiyun The structure that holds the status of events, has two fixed members: 248*4882a593Smuzhiyun the number of events and the array of the events. 249*4882a593Smuzhiyun 250*4882a593SmuzhiyunReferences 251*4882a593Smuzhiyun---------- 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun[1] https://github.com/riscv/riscv-linux/pull/124 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun[2] https://groups.google.com/a/groups.riscv.org/forum/#!topic/sw-dev/f19TmCNP6yA 256