1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun.. _imc: 3*4882a593Smuzhiyun 4*4882a593Smuzhiyun=================================== 5*4882a593SmuzhiyunIMC (In-Memory Collection Counters) 6*4882a593Smuzhiyun=================================== 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunAnju T Sudhakar, 10 May 2019 9*4882a593Smuzhiyun 10*4882a593Smuzhiyun.. contents:: 11*4882a593Smuzhiyun :depth: 3 12*4882a593Smuzhiyun 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunBasic overview 15*4882a593Smuzhiyun============== 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunIMC (In-Memory collection counters) is a hardware monitoring facility that 18*4882a593Smuzhiyuncollects large numbers of hardware performance events at Nest level (these are 19*4882a593Smuzhiyunon-chip but off-core), Core level and Thread level. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunThe Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC 22*4882a593Smuzhiyun(On-Chip Controller) complex. The microcode collects the counter data and moves 23*4882a593Smuzhiyunthe nest IMC counter data to memory. 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunThe Core and Thread IMC PMU counters are handled in the core. Core level PMU 26*4882a593Smuzhiyuncounters give us the IMC counters' data per core and thread level PMU counters 27*4882a593Smuzhiyungive us the IMC counters' data per CPU thread. 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunOPAL obtains the IMC PMU and supported events information from the IMC Catalog 30*4882a593Smuzhiyunand passes on to the kernel via the device tree. The event's information 31*4882a593Smuzhiyuncontains: 32*4882a593Smuzhiyun 33*4882a593Smuzhiyun- Event name 34*4882a593Smuzhiyun- Event Offset 35*4882a593Smuzhiyun- Event description 36*4882a593Smuzhiyun 37*4882a593Smuzhiyunand possibly also: 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun- Event scale 40*4882a593Smuzhiyun- Event unit 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunSome PMUs may have a common scale and unit values for all their supported 43*4882a593Smuzhiyunevents. For those cases, the scale and unit properties for those events must be 44*4882a593Smuzhiyuninherited from the PMU. 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunThe event offset in the memory is where the counter data gets accumulated. 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunIMC catalog is available at: 49*4882a593Smuzhiyun https://github.com/open-power/ima-catalog 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunThe kernel discovers the IMC counters information in the device tree at the 52*4882a593Smuzhiyun`imc-counters` device node which has a compatible field 53*4882a593Smuzhiyun`ibm,opal-in-memory-counters`. From the device tree, the kernel parses the PMUs 54*4882a593Smuzhiyunand their event's information and register the PMU and its attributes in the 55*4882a593Smuzhiyunkernel. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunIMC example usage 58*4882a593Smuzhiyun================= 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun.. code-block:: sh 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun # perf list 63*4882a593Smuzhiyun [...] 64*4882a593Smuzhiyun nest_mcs01/PM_MCS01_64B_RD_DISP_PORT01/ [Kernel PMU event] 65*4882a593Smuzhiyun nest_mcs01/PM_MCS01_64B_RD_DISP_PORT23/ [Kernel PMU event] 66*4882a593Smuzhiyun [...] 67*4882a593Smuzhiyun core_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event] 68*4882a593Smuzhiyun core_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event] 69*4882a593Smuzhiyun [...] 70*4882a593Smuzhiyun thread_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event] 71*4882a593Smuzhiyun thread_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event] 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunTo see per chip data for nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0/: 74*4882a593Smuzhiyun 75*4882a593Smuzhiyun.. code-block:: sh 76*4882a593Smuzhiyun 77*4882a593Smuzhiyun # ./perf stat -e "nest_mcs01/PM_MCS01_64B_WR_DISP_PORT01/" -a --per-socket 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunTo see non-idle instructions for core 0: 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun.. code-block:: sh 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun # ./perf stat -e "core_imc/CPM_NON_IDLE_INST/" -C 0 -I 1000 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunTo see non-idle instructions for a "make": 86*4882a593Smuzhiyun 87*4882a593Smuzhiyun.. code-block:: sh 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun # ./perf stat -e "thread_imc/CPM_NON_IDLE_PCYC/" make 90*4882a593Smuzhiyun 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunIMC Trace-mode 93*4882a593Smuzhiyun=============== 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunPOWER9 supports two modes for IMC which are the Accumulation mode and Trace 96*4882a593Smuzhiyunmode. In Accumulation mode, event counts are accumulated in system Memory. 97*4882a593SmuzhiyunHypervisor then reads the posted counts periodically or when requested. In IMC 98*4882a593SmuzhiyunTrace mode, the 64 bit trace SCOM value is initialized with the event 99*4882a593Smuzhiyuninformation. The CPMCxSEL and CPMC_LOAD in the trace SCOM, specifies the event 100*4882a593Smuzhiyunto be monitored and the sampling duration. On each overflow in the CPMCxSEL, 101*4882a593Smuzhiyunhardware snapshots the program counter along with event counts and writes into 102*4882a593Smuzhiyunmemory pointed by LDBAR. 103*4882a593Smuzhiyun 104*4882a593SmuzhiyunLDBAR is a 64 bit special purpose per thread register, it has bits to indicate 105*4882a593Smuzhiyunwhether hardware is configured for accumulation or trace mode. 106*4882a593Smuzhiyun 107*4882a593SmuzhiyunLDBAR Register Layout 108*4882a593Smuzhiyun--------------------- 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun +-------+----------------------+ 111*4882a593Smuzhiyun | 0 | Enable/Disable | 112*4882a593Smuzhiyun +-------+----------------------+ 113*4882a593Smuzhiyun | 1 | 0: Accumulation Mode | 114*4882a593Smuzhiyun | +----------------------+ 115*4882a593Smuzhiyun | | 1: Trace Mode | 116*4882a593Smuzhiyun +-------+----------------------+ 117*4882a593Smuzhiyun | 2:3 | Reserved | 118*4882a593Smuzhiyun +-------+----------------------+ 119*4882a593Smuzhiyun | 4-6 | PB scope | 120*4882a593Smuzhiyun +-------+----------------------+ 121*4882a593Smuzhiyun | 7 | Reserved | 122*4882a593Smuzhiyun +-------+----------------------+ 123*4882a593Smuzhiyun | 8:50 | Counter Address | 124*4882a593Smuzhiyun +-------+----------------------+ 125*4882a593Smuzhiyun | 51:63 | Reserved | 126*4882a593Smuzhiyun +-------+----------------------+ 127*4882a593Smuzhiyun 128*4882a593SmuzhiyunTRACE_IMC_SCOM bit representation 129*4882a593Smuzhiyun--------------------------------- 130*4882a593Smuzhiyun 131*4882a593Smuzhiyun +-------+------------+ 132*4882a593Smuzhiyun | 0:1 | SAMPSEL | 133*4882a593Smuzhiyun +-------+------------+ 134*4882a593Smuzhiyun | 2:33 | CPMC_LOAD | 135*4882a593Smuzhiyun +-------+------------+ 136*4882a593Smuzhiyun | 34:40 | CPMC1SEL | 137*4882a593Smuzhiyun +-------+------------+ 138*4882a593Smuzhiyun | 41:47 | CPMC2SEL | 139*4882a593Smuzhiyun +-------+------------+ 140*4882a593Smuzhiyun | 48:50 | BUFFERSIZE | 141*4882a593Smuzhiyun +-------+------------+ 142*4882a593Smuzhiyun | 51:63 | RESERVED | 143*4882a593Smuzhiyun +-------+------------+ 144*4882a593Smuzhiyun 145*4882a593SmuzhiyunCPMC_LOAD contains the sampling duration. SAMPSEL and CPMCxSEL determines the 146*4882a593Smuzhiyunevent to count. BUFFERSIZE indicates the memory range. On each overflow, 147*4882a593Smuzhiyunhardware snapshots the program counter along with event counts and updates the 148*4882a593Smuzhiyunmemory and reloads the CMPC_LOAD value for the next sampling duration. IMC 149*4882a593Smuzhiyunhardware does not support exceptions, so it quietly wraps around if memory 150*4882a593Smuzhiyunbuffer reaches the end. 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun*Currently the event monitored for trace-mode is fixed as cycle.* 153*4882a593Smuzhiyun 154*4882a593SmuzhiyunTrace IMC example usage 155*4882a593Smuzhiyun======================= 156*4882a593Smuzhiyun 157*4882a593Smuzhiyun.. code-block:: sh 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun # perf list 160*4882a593Smuzhiyun [....] 161*4882a593Smuzhiyun trace_imc/trace_cycles/ [Kernel PMU event] 162*4882a593Smuzhiyun 163*4882a593SmuzhiyunTo record an application/process with trace-imc event: 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun.. code-block:: sh 166*4882a593Smuzhiyun 167*4882a593Smuzhiyun # perf record -e trace_imc/trace_cycles/ yes > /dev/null 168*4882a593Smuzhiyun [ perf record: Woken up 1 times to write data ] 169*4882a593Smuzhiyun [ perf record: Captured and wrote 0.012 MB perf.data (21 samples) ] 170*4882a593Smuzhiyun 171*4882a593SmuzhiyunThe `perf.data` generated, can be read using perf report. 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunBenefits of using IMC trace-mode 174*4882a593Smuzhiyun================================ 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunPMI (Performance Monitoring Interrupts) interrupt handling is avoided, since IMC 177*4882a593Smuzhiyuntrace mode snapshots the program counter and updates to the memory. And this 178*4882a593Smuzhiyunalso provide a way for the operating system to do instruction sampling in real 179*4882a593Smuzhiyuntime without PMI processing overhead. 180*4882a593Smuzhiyun 181*4882a593SmuzhiyunPerformance data using `perf top` with and without trace-imc event. 182*4882a593Smuzhiyun 183*4882a593SmuzhiyunPMI interrupts count when `perf top` command is executed without trace-imc event. 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun.. code-block:: sh 186*4882a593Smuzhiyun 187*4882a593Smuzhiyun # grep PMI /proc/interrupts 188*4882a593Smuzhiyun PMI: 0 0 0 0 Performance monitoring interrupts 189*4882a593Smuzhiyun # ./perf top 190*4882a593Smuzhiyun ... 191*4882a593Smuzhiyun # grep PMI /proc/interrupts 192*4882a593Smuzhiyun PMI: 39735 8710 17338 17801 Performance monitoring interrupts 193*4882a593Smuzhiyun # ./perf top -e trace_imc/trace_cycles/ 194*4882a593Smuzhiyun ... 195*4882a593Smuzhiyun # grep PMI /proc/interrupts 196*4882a593Smuzhiyun PMI: 39735 8710 17338 17801 Performance monitoring interrupts 197*4882a593Smuzhiyun 198*4882a593Smuzhiyun 199*4882a593SmuzhiyunThat is, the PMI interrupt counts do not increment when using the `trace_imc` event. 200