xref: /OK3568_Linux_fs/kernel/Documentation/riscv/pmu.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun===================================
2*4882a593SmuzhiyunSupporting PMUs on RISC-V platforms
3*4882a593Smuzhiyun===================================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunAlan Kao <alankao@andestech.com>, Mar 2018
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunIntroduction
8*4882a593Smuzhiyun------------
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunAs of this writing, perf_event-related features mentioned in The RISC-V ISA
11*4882a593SmuzhiyunPrivileged Version 1.10 are as follows:
12*4882a593Smuzhiyun(please check the manual for more details)
13*4882a593Smuzhiyun
14*4882a593Smuzhiyun* [m|s]counteren
15*4882a593Smuzhiyun* mcycle[h], cycle[h]
16*4882a593Smuzhiyun* minstret[h], instret[h]
17*4882a593Smuzhiyun* mhpeventx, mhpcounterx[h]
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunWith such function set only, porting perf would require a lot of work, due to
20*4882a593Smuzhiyunthe lack of the following general architectural performance monitoring features:
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun* Enabling/Disabling counters
23*4882a593Smuzhiyun  Counters are just free-running all the time in our case.
24*4882a593Smuzhiyun* Interrupt caused by counter overflow
25*4882a593Smuzhiyun  No such feature in the spec.
26*4882a593Smuzhiyun* Interrupt indicator
27*4882a593Smuzhiyun  It is not possible to have many interrupt ports for all counters, so an
28*4882a593Smuzhiyun  interrupt indicator is required for software to tell which counter has
29*4882a593Smuzhiyun  just overflowed.
30*4882a593Smuzhiyun* Writing to counters
31*4882a593Smuzhiyun  There will be an SBI to support this since the kernel cannot modify the
32*4882a593Smuzhiyun  counters [1].  Alternatively, some vendor considers to implement
33*4882a593Smuzhiyun  hardware-extension for M-S-U model machines to write counters directly.
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunThis document aims to provide developers a quick guide on supporting their
36*4882a593SmuzhiyunPMUs in the kernel.  The following sections briefly explain perf' mechanism
37*4882a593Smuzhiyunand todos.
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunYou may check previous discussions here [1][2].  Also, it might be helpful
40*4882a593Smuzhiyunto check the appendix for related kernel structures.
41*4882a593Smuzhiyun
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun1. Initialization
44*4882a593Smuzhiyun-----------------
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun*riscv_pmu* is a global pointer of type *struct riscv_pmu*, which contains
47*4882a593Smuzhiyunvarious methods according to perf's internal convention and PMU-specific
48*4882a593Smuzhiyunparameters.  One should declare such instance to represent the PMU.  By default,
49*4882a593Smuzhiyun*riscv_pmu* points to a constant structure *riscv_base_pmu*, which has very
50*4882a593Smuzhiyunbasic support to a baseline QEMU model.
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunThen he/she can either assign the instance's pointer to *riscv_pmu* so that
53*4882a593Smuzhiyunthe minimal and already-implemented logic can be leveraged, or invent his/her
54*4882a593Smuzhiyunown *riscv_init_platform_pmu* implementation.
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunIn other words, existing sources of *riscv_base_pmu* merely provide a
57*4882a593Smuzhiyunreference implementation.  Developers can flexibly decide how many parts they
58*4882a593Smuzhiyuncan leverage, and in the most extreme case, they can customize every function
59*4882a593Smuzhiyunaccording to their needs.
60*4882a593Smuzhiyun
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun2. Event Initialization
63*4882a593Smuzhiyun-----------------------
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunWhen a user launches a perf command to monitor some events, it is first
66*4882a593Smuzhiyuninterpreted by the userspace perf tool into multiple *perf_event_open*
67*4882a593Smuzhiyunsystem calls, and then each of them calls to the body of *event_init*
68*4882a593Smuzhiyunmember function that was assigned in the previous step.  In *riscv_base_pmu*'s
69*4882a593Smuzhiyuncase, it is *riscv_event_init*.
70*4882a593Smuzhiyun
71*4882a593SmuzhiyunThe main purpose of this function is to translate the event provided by user
72*4882a593Smuzhiyuninto bitmap, so that HW-related control registers or counters can directly be
73*4882a593Smuzhiyunmanipulated.  The translation is based on the mappings and methods provided in
74*4882a593Smuzhiyun*riscv_pmu*.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunNote that some features can be done in this stage as well:
77*4882a593Smuzhiyun
78*4882a593Smuzhiyun(1) interrupt setting, which is stated in the next section;
79*4882a593Smuzhiyun(2) privilege level setting (user space only, kernel space only, both);
80*4882a593Smuzhiyun(3) destructor setting.  Normally it is sufficient to apply *riscv_destroy_event*;
81*4882a593Smuzhiyun(4) tweaks for non-sampling events, which will be utilized by functions such as
82*4882a593Smuzhiyun    *perf_adjust_period*, usually something like the follows::
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun      if (!is_sampling_event(event)) {
85*4882a593Smuzhiyun              hwc->sample_period = x86_pmu.max_period;
86*4882a593Smuzhiyun              hwc->last_period = hwc->sample_period;
87*4882a593Smuzhiyun              local64_set(&hwc->period_left, hwc->sample_period);
88*4882a593Smuzhiyun      }
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunIn the case of *riscv_base_pmu*, only (3) is provided for now.
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun3. Interrupt
94*4882a593Smuzhiyun------------
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun3.1. Interrupt Initialization
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunThis often occurs at the beginning of the *event_init* method. In common
99*4882a593Smuzhiyunpractice, this should be a code segment like::
100*4882a593Smuzhiyun
101*4882a593Smuzhiyun  int x86_reserve_hardware(void)
102*4882a593Smuzhiyun  {
103*4882a593Smuzhiyun        int err = 0;
104*4882a593Smuzhiyun
105*4882a593Smuzhiyun        if (!atomic_inc_not_zero(&pmc_refcount)) {
106*4882a593Smuzhiyun                mutex_lock(&pmc_reserve_mutex);
107*4882a593Smuzhiyun                if (atomic_read(&pmc_refcount) == 0) {
108*4882a593Smuzhiyun                        if (!reserve_pmc_hardware())
109*4882a593Smuzhiyun                                err = -EBUSY;
110*4882a593Smuzhiyun                        else
111*4882a593Smuzhiyun                                reserve_ds_buffers();
112*4882a593Smuzhiyun                }
113*4882a593Smuzhiyun                if (!err)
114*4882a593Smuzhiyun                        atomic_inc(&pmc_refcount);
115*4882a593Smuzhiyun                mutex_unlock(&pmc_reserve_mutex);
116*4882a593Smuzhiyun        }
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun        return err;
119*4882a593Smuzhiyun  }
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunAnd the magic is in *reserve_pmc_hardware*, which usually does atomic
122*4882a593Smuzhiyunoperations to make implemented IRQ accessible from some global function pointer.
123*4882a593Smuzhiyun*release_pmc_hardware* serves the opposite purpose, and it is used in event
124*4882a593Smuzhiyundestructors mentioned in previous section.
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun(Note: From the implementations in all the architectures, the *reserve/release*
127*4882a593Smuzhiyunpair are always IRQ settings, so the *pmc_hardware* seems somehow misleading.
128*4882a593SmuzhiyunIt does NOT deal with the binding between an event and a physical counter,
129*4882a593Smuzhiyunwhich will be introduced in the next section.)
130*4882a593Smuzhiyun
131*4882a593Smuzhiyun3.2. IRQ Structure
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunBasically, a IRQ runs the following pseudo code::
134*4882a593Smuzhiyun
135*4882a593Smuzhiyun  for each hardware counter that triggered this overflow
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun      get the event of this counter
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun      // following two steps are defined as *read()*,
140*4882a593Smuzhiyun      // check the section Reading/Writing Counters for details.
141*4882a593Smuzhiyun      count the delta value since previous interrupt
142*4882a593Smuzhiyun      update the event->count (# event occurs) by adding delta, and
143*4882a593Smuzhiyun                 event->hw.period_left by subtracting delta
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun      if the event overflows
146*4882a593Smuzhiyun          sample data
147*4882a593Smuzhiyun          set the counter appropriately for the next overflow
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun          if the event overflows again
150*4882a593Smuzhiyun              too frequently, throttle this event
151*4882a593Smuzhiyun          fi
152*4882a593Smuzhiyun      fi
153*4882a593Smuzhiyun
154*4882a593Smuzhiyun  end for
155*4882a593Smuzhiyun
156*4882a593SmuzhiyunHowever as of this writing, none of the RISC-V implementations have designed an
157*4882a593Smuzhiyuninterrupt for perf, so the details are to be completed in the future.
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun4. Reading/Writing Counters
160*4882a593Smuzhiyun---------------------------
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunThey seem symmetric but perf treats them quite differently.  For reading, there
163*4882a593Smuzhiyunis a *read* interface in *struct pmu*, but it serves more than just reading.
164*4882a593SmuzhiyunAccording to the context, the *read* function not only reads the content of the
165*4882a593Smuzhiyuncounter (event->count), but also updates the left period to the next interrupt
166*4882a593Smuzhiyun(event->hw.period_left).
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunBut the core of perf does not need direct write to counters.  Writing counters
169*4882a593Smuzhiyunis hidden behind the abstraction of 1) *pmu->start*, literally start counting so one
170*4882a593Smuzhiyunhas to set the counter to a good value for the next interrupt; 2) inside the IRQ
171*4882a593Smuzhiyunit should set the counter to the same resonable value.
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunReading is not a problem in RISC-V but writing would need some effort, since
174*4882a593Smuzhiyuncounters are not allowed to be written by S-mode.
175*4882a593Smuzhiyun
176*4882a593Smuzhiyun
177*4882a593Smuzhiyun5. add()/del()/start()/stop()
178*4882a593Smuzhiyun-----------------------------
179*4882a593Smuzhiyun
180*4882a593SmuzhiyunBasic idea: add()/del() adds/deletes events to/from a PMU, and start()/stop()
181*4882a593Smuzhiyunstarts/stop the counter of some event in the PMU.  All of them take the same
182*4882a593Smuzhiyunarguments: *struct perf_event *event* and *int flag*.
183*4882a593Smuzhiyun
184*4882a593SmuzhiyunConsider perf as a state machine, then you will find that these functions serve
185*4882a593Smuzhiyunas the state transition process between those states.
186*4882a593SmuzhiyunThree states (event->hw.state) are defined:
187*4882a593Smuzhiyun
188*4882a593Smuzhiyun* PERF_HES_STOPPED:	the counter is stopped
189*4882a593Smuzhiyun* PERF_HES_UPTODATE:	the event->count is up-to-date
190*4882a593Smuzhiyun* PERF_HES_ARCH:	arch-dependent usage ... we don't need this for now
191*4882a593Smuzhiyun
192*4882a593SmuzhiyunA normal flow of these state transitions are as follows:
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun* A user launches a perf event, resulting in calling to *event_init*.
195*4882a593Smuzhiyun* When being context-switched in, *add* is called by the perf core, with a flag
196*4882a593Smuzhiyun  PERF_EF_START, which means that the event should be started after it is added.
197*4882a593Smuzhiyun  At this stage, a general event is bound to a physical counter, if any.
198*4882a593Smuzhiyun  The state changes to PERF_HES_STOPPED and PERF_HES_UPTODATE, because it is now
199*4882a593Smuzhiyun  stopped, and the (software) event count does not need updating.
200*4882a593Smuzhiyun
201*4882a593Smuzhiyun  - *start* is then called, and the counter is enabled.
202*4882a593Smuzhiyun    With flag PERF_EF_RELOAD, it writes an appropriate value to the counter (check
203*4882a593Smuzhiyun    previous section for detail).
204*4882a593Smuzhiyun    Nothing is written if the flag does not contain PERF_EF_RELOAD.
205*4882a593Smuzhiyun    The state now is reset to none, because it is neither stopped nor updated
206*4882a593Smuzhiyun    (the counting already started)
207*4882a593Smuzhiyun
208*4882a593Smuzhiyun* When being context-switched out, *del* is called.  It then checks out all the
209*4882a593Smuzhiyun  events in the PMU and calls *stop* to update their counts.
210*4882a593Smuzhiyun
211*4882a593Smuzhiyun  - *stop* is called by *del*
212*4882a593Smuzhiyun    and the perf core with flag PERF_EF_UPDATE, and it often shares the same
213*4882a593Smuzhiyun    subroutine as *read* with the same logic.
214*4882a593Smuzhiyun    The state changes to PERF_HES_STOPPED and PERF_HES_UPTODATE, again.
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun  - Life cycle of these two pairs: *add* and *del* are called repeatedly as
217*4882a593Smuzhiyun    tasks switch in-and-out; *start* and *stop* is also called when the perf core
218*4882a593Smuzhiyun    needs a quick stop-and-start, for instance, when the interrupt period is being
219*4882a593Smuzhiyun    adjusted.
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunCurrent implementation is sufficient for now and can be easily extended to
222*4882a593Smuzhiyunfeatures in the future.
223*4882a593Smuzhiyun
224*4882a593SmuzhiyunA. Related Structures
225*4882a593Smuzhiyun---------------------
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun* struct pmu: include/linux/perf_event.h
228*4882a593Smuzhiyun* struct riscv_pmu: arch/riscv/include/asm/perf_event.h
229*4882a593Smuzhiyun
230*4882a593Smuzhiyun  Both structures are designed to be read-only.
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun  *struct pmu* defines some function pointer interfaces, and most of them take
233*4882a593Smuzhiyun  *struct perf_event* as a main argument, dealing with perf events according to
234*4882a593Smuzhiyun  perf's internal state machine (check kernel/events/core.c for details).
235*4882a593Smuzhiyun
236*4882a593Smuzhiyun  *struct riscv_pmu* defines PMU-specific parameters.  The naming follows the
237*4882a593Smuzhiyun  convention of all other architectures.
238*4882a593Smuzhiyun
239*4882a593Smuzhiyun* struct perf_event: include/linux/perf_event.h
240*4882a593Smuzhiyun* struct hw_perf_event
241*4882a593Smuzhiyun
242*4882a593Smuzhiyun  The generic structure that represents perf events, and the hardware-related
243*4882a593Smuzhiyun  details.
244*4882a593Smuzhiyun
245*4882a593Smuzhiyun* struct riscv_hw_events: arch/riscv/include/asm/perf_event.h
246*4882a593Smuzhiyun
247*4882a593Smuzhiyun  The structure that holds the status of events, has two fixed members:
248*4882a593Smuzhiyun  the number of events and the array of the events.
249*4882a593Smuzhiyun
250*4882a593SmuzhiyunReferences
251*4882a593Smuzhiyun----------
252*4882a593Smuzhiyun
253*4882a593Smuzhiyun[1] https://github.com/riscv/riscv-linux/pull/124
254*4882a593Smuzhiyun
255*4882a593Smuzhiyun[2] https://groups.google.com/a/groups.riscv.org/forum/#!topic/sw-dev/f19TmCNP6yA
256