1*4882a593Smuzhiyunperf-c2c(1) 2*4882a593Smuzhiyun=========== 3*4882a593Smuzhiyun 4*4882a593SmuzhiyunNAME 5*4882a593Smuzhiyun---- 6*4882a593Smuzhiyunperf-c2c - Shared Data C2C/HITM Analyzer. 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunSYNOPSIS 9*4882a593Smuzhiyun-------- 10*4882a593Smuzhiyun[verse] 11*4882a593Smuzhiyun'perf c2c record' [<options>] <command> 12*4882a593Smuzhiyun'perf c2c record' [<options>] -- [<record command options>] <command> 13*4882a593Smuzhiyun'perf c2c report' [<options>] 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunDESCRIPTION 16*4882a593Smuzhiyun----------- 17*4882a593SmuzhiyunC2C stands for Cache To Cache. 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunThe perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows 20*4882a593Smuzhiyunyou to track down the cacheline contentions. 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunOn x86, the tool is based on load latency and precise store facility events 23*4882a593Smuzhiyunprovided by Intel CPUs. On PowerPC, the tool uses random instruction sampling 24*4882a593Smuzhiyunwith thresholding feature. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunThese events provide: 27*4882a593Smuzhiyun - memory address of the access 28*4882a593Smuzhiyun - type of the access (load and store details) 29*4882a593Smuzhiyun - latency (in cycles) of the load access 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunThe c2c tool provide means to record this data and report back access details 32*4882a593Smuzhiyunfor cachelines with highest contention - highest number of HITM accesses. 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunThe basic workflow with this tool follows the standard record/report phase. 35*4882a593SmuzhiyunUser uses the record command to record events data and report command to 36*4882a593Smuzhiyundisplay it. 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunRECORD OPTIONS 40*4882a593Smuzhiyun-------------- 41*4882a593Smuzhiyun-e:: 42*4882a593Smuzhiyun--event=:: 43*4882a593Smuzhiyun Select the PMU event. Use 'perf c2c record -e list' 44*4882a593Smuzhiyun to list available events. 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun-v:: 47*4882a593Smuzhiyun--verbose:: 48*4882a593Smuzhiyun Be more verbose (show counter open errors, etc). 49*4882a593Smuzhiyun 50*4882a593Smuzhiyun-l:: 51*4882a593Smuzhiyun--ldlat:: 52*4882a593Smuzhiyun Configure mem-loads latency. (x86 only) 53*4882a593Smuzhiyun 54*4882a593Smuzhiyun-k:: 55*4882a593Smuzhiyun--all-kernel:: 56*4882a593Smuzhiyun Configure all used events to run in kernel space. 57*4882a593Smuzhiyun 58*4882a593Smuzhiyun-u:: 59*4882a593Smuzhiyun--all-user:: 60*4882a593Smuzhiyun Configure all used events to run in user space. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunREPORT OPTIONS 63*4882a593Smuzhiyun-------------- 64*4882a593Smuzhiyun-k:: 65*4882a593Smuzhiyun--vmlinux=<file>:: 66*4882a593Smuzhiyun vmlinux pathname 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun-v:: 69*4882a593Smuzhiyun--verbose:: 70*4882a593Smuzhiyun Be more verbose (show counter open errors, etc). 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun-i:: 73*4882a593Smuzhiyun--input:: 74*4882a593Smuzhiyun Specify the input file to process. 75*4882a593Smuzhiyun 76*4882a593Smuzhiyun-N:: 77*4882a593Smuzhiyun--node-info:: 78*4882a593Smuzhiyun Show extra node info in report (see NODE INFO section) 79*4882a593Smuzhiyun 80*4882a593Smuzhiyun-c:: 81*4882a593Smuzhiyun--coalesce:: 82*4882a593Smuzhiyun Specify sorting fields for single cacheline display. 83*4882a593Smuzhiyun Following fields are available: tid,pid,iaddr,dso 84*4882a593Smuzhiyun (see COALESCE) 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun-g:: 87*4882a593Smuzhiyun--call-graph:: 88*4882a593Smuzhiyun Setup callchains parameters. 89*4882a593Smuzhiyun Please refer to perf-report man page for details. 90*4882a593Smuzhiyun 91*4882a593Smuzhiyun--stdio:: 92*4882a593Smuzhiyun Force the stdio output (see STDIO OUTPUT) 93*4882a593Smuzhiyun 94*4882a593Smuzhiyun--stats:: 95*4882a593Smuzhiyun Display only statistic tables and force stdio mode. 96*4882a593Smuzhiyun 97*4882a593Smuzhiyun--full-symbols:: 98*4882a593Smuzhiyun Display full length of symbols. 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun--no-source:: 101*4882a593Smuzhiyun Do not display Source:Line column. 102*4882a593Smuzhiyun 103*4882a593Smuzhiyun--show-all:: 104*4882a593Smuzhiyun Show all captured HITM lines, with no regard to HITM % 0.0005 limit. 105*4882a593Smuzhiyun 106*4882a593Smuzhiyun-f:: 107*4882a593Smuzhiyun--force:: 108*4882a593Smuzhiyun Don't do ownership validation. 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun-d:: 111*4882a593Smuzhiyun--display:: 112*4882a593Smuzhiyun Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default. 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun--stitch-lbr:: 115*4882a593Smuzhiyun Show callgraph with stitched LBRs, which may have more complete 116*4882a593Smuzhiyun callgraph. The perf.data file must have been obtained using 117*4882a593Smuzhiyun perf c2c record --call-graph lbr. 118*4882a593Smuzhiyun Disabled by default. In common cases with call stack overflows, 119*4882a593Smuzhiyun it can recreate better call stacks than the default lbr call stack 120*4882a593Smuzhiyun output. But this approach is not full proof. There can be cases 121*4882a593Smuzhiyun where it creates incorrect call stacks from incorrect matches. 122*4882a593Smuzhiyun The known limitations include exception handing such as 123*4882a593Smuzhiyun setjmp/longjmp will have calls/returns not match. 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunC2C RECORD 126*4882a593Smuzhiyun---------- 127*4882a593SmuzhiyunThe perf c2c record command setup options related to HITM cacheline analysis 128*4882a593Smuzhiyunand calls standard perf record command. 129*4882a593Smuzhiyun 130*4882a593SmuzhiyunFollowing perf record options are configured by default: 131*4882a593Smuzhiyun(check perf record man page for details) 132*4882a593Smuzhiyun 133*4882a593Smuzhiyun -W,-d,--phys-data,--sample-cpu 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunUnless specified otherwise with '-e' option, following events are monitored by 136*4882a593Smuzhiyundefault on x86: 137*4882a593Smuzhiyun 138*4882a593Smuzhiyun cpu/mem-loads,ldlat=30/P 139*4882a593Smuzhiyun cpu/mem-stores/P 140*4882a593Smuzhiyun 141*4882a593Smuzhiyunand following on PowerPC: 142*4882a593Smuzhiyun 143*4882a593Smuzhiyun cpu/mem-loads/ 144*4882a593Smuzhiyun cpu/mem-stores/ 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunUser can pass any 'perf record' option behind '--' mark, like (to enable 147*4882a593Smuzhiyuncallchains and system wide monitoring): 148*4882a593Smuzhiyun 149*4882a593Smuzhiyun $ perf c2c record -- -g -a 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunPlease check RECORD OPTIONS section for specific c2c record options. 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunC2C REPORT 154*4882a593Smuzhiyun---------- 155*4882a593SmuzhiyunThe perf c2c report command displays shared data analysis. It comes in two 156*4882a593Smuzhiyundisplay modes: stdio and tui (default). 157*4882a593Smuzhiyun 158*4882a593SmuzhiyunThe report command workflow is following: 159*4882a593Smuzhiyun - sort all the data based on the cacheline address 160*4882a593Smuzhiyun - store access details for each cacheline 161*4882a593Smuzhiyun - sort all cachelines based on user settings 162*4882a593Smuzhiyun - display data 163*4882a593Smuzhiyun 164*4882a593SmuzhiyunIn general perf report output consist of 2 basic views: 165*4882a593Smuzhiyun 1) most expensive cachelines list 166*4882a593Smuzhiyun 2) offsets details for each cacheline 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunFor each cacheline in the 1) list we display following data: 169*4882a593Smuzhiyun(Both stdio and TUI modes follow the same fields output) 170*4882a593Smuzhiyun 171*4882a593Smuzhiyun Index 172*4882a593Smuzhiyun - zero based index to identify the cacheline 173*4882a593Smuzhiyun 174*4882a593Smuzhiyun Cacheline 175*4882a593Smuzhiyun - cacheline address (hex number) 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun Rmt/Lcl Hitm 178*4882a593Smuzhiyun - cacheline percentage of all Remote/Local HITM accesses 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun LLC Load Hitm - Total, LclHitm, RmtHitm 181*4882a593Smuzhiyun - count of Total/Local/Remote load HITMs 182*4882a593Smuzhiyun 183*4882a593Smuzhiyun Total records 184*4882a593Smuzhiyun - sum of all cachelines accesses 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun Total loads 187*4882a593Smuzhiyun - sum of all load accesses 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun Total stores 190*4882a593Smuzhiyun - sum of all store accesses 191*4882a593Smuzhiyun 192*4882a593Smuzhiyun Store Reference - L1Hit, L1Miss 193*4882a593Smuzhiyun L1Hit - store accesses that hit L1 194*4882a593Smuzhiyun L1Miss - store accesses that missed L1 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun Core Load Hit - FB, L1, L2 197*4882a593Smuzhiyun - count of load hits in FB (Fill Buffer), L1 and L2 cache 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun LLC Load Hit - LlcHit, LclHitm 200*4882a593Smuzhiyun - count of LLC load accesses, includes LLC hits and LLC HITMs 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun RMT Load Hit - RmtHit, RmtHitm 203*4882a593Smuzhiyun - count of remote load accesses, includes remote hits and remote HITMs 204*4882a593Smuzhiyun 205*4882a593Smuzhiyun Load Dram - Lcl, Rmt 206*4882a593Smuzhiyun - count of local and remote DRAM accesses 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunFor each offset in the 2) list we display following data: 209*4882a593Smuzhiyun 210*4882a593Smuzhiyun HITM - Rmt, Lcl 211*4882a593Smuzhiyun - % of Remote/Local HITM accesses for given offset within cacheline 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun Store Refs - L1 Hit, L1 Miss 214*4882a593Smuzhiyun - % of store accesses that hit/missed L1 for given offset within cacheline 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun Data address - Offset 217*4882a593Smuzhiyun - offset address 218*4882a593Smuzhiyun 219*4882a593Smuzhiyun Pid 220*4882a593Smuzhiyun - pid of the process responsible for the accesses 221*4882a593Smuzhiyun 222*4882a593Smuzhiyun Tid 223*4882a593Smuzhiyun - tid of the process responsible for the accesses 224*4882a593Smuzhiyun 225*4882a593Smuzhiyun Code address 226*4882a593Smuzhiyun - code address responsible for the accesses 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun cycles - rmt hitm, lcl hitm, load 229*4882a593Smuzhiyun - sum of cycles for given accesses - Remote/Local HITM and generic load 230*4882a593Smuzhiyun 231*4882a593Smuzhiyun cpu cnt 232*4882a593Smuzhiyun - number of cpus that participated on the access 233*4882a593Smuzhiyun 234*4882a593Smuzhiyun Symbol 235*4882a593Smuzhiyun - code symbol related to the 'Code address' value 236*4882a593Smuzhiyun 237*4882a593Smuzhiyun Shared Object 238*4882a593Smuzhiyun - shared object name related to the 'Code address' value 239*4882a593Smuzhiyun 240*4882a593Smuzhiyun Source:Line 241*4882a593Smuzhiyun - source information related to the 'Code address' value 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun Node 244*4882a593Smuzhiyun - nodes participating on the access (see NODE INFO section) 245*4882a593Smuzhiyun 246*4882a593SmuzhiyunNODE INFO 247*4882a593Smuzhiyun--------- 248*4882a593SmuzhiyunThe 'Node' field displays nodes that accesses given cacheline 249*4882a593Smuzhiyunoffset. Its output comes in 3 flavors: 250*4882a593Smuzhiyun - node IDs separated by ',' 251*4882a593Smuzhiyun - node IDs with stats for each ID, in following format: 252*4882a593Smuzhiyun Node{cpus %hitms %stores} 253*4882a593Smuzhiyun - node IDs with list of affected CPUs in following format: 254*4882a593Smuzhiyun Node{cpu list} 255*4882a593Smuzhiyun 256*4882a593SmuzhiyunUser can switch between above flavors with -N option or 257*4882a593Smuzhiyunuse 'n' key to interactively switch in TUI mode. 258*4882a593Smuzhiyun 259*4882a593SmuzhiyunCOALESCE 260*4882a593Smuzhiyun-------- 261*4882a593SmuzhiyunUser can specify how to sort offsets for cacheline. 262*4882a593Smuzhiyun 263*4882a593SmuzhiyunFollowing fields are available and governs the final 264*4882a593Smuzhiyunoutput fields set for caheline offsets output: 265*4882a593Smuzhiyun 266*4882a593Smuzhiyun tid - coalesced by process TIDs 267*4882a593Smuzhiyun pid - coalesced by process PIDs 268*4882a593Smuzhiyun iaddr - coalesced by code address, following fields are displayed: 269*4882a593Smuzhiyun Code address, Code symbol, Shared Object, Source line 270*4882a593Smuzhiyun dso - coalesced by shared object 271*4882a593Smuzhiyun 272*4882a593SmuzhiyunBy default the coalescing is setup with 'pid,iaddr'. 273*4882a593Smuzhiyun 274*4882a593SmuzhiyunSTDIO OUTPUT 275*4882a593Smuzhiyun------------ 276*4882a593SmuzhiyunThe stdio output displays data on standard output. 277*4882a593Smuzhiyun 278*4882a593SmuzhiyunFollowing tables are displayed: 279*4882a593Smuzhiyun Trace Event Information 280*4882a593Smuzhiyun - overall statistics of memory accesses 281*4882a593Smuzhiyun 282*4882a593Smuzhiyun Global Shared Cache Line Event Information 283*4882a593Smuzhiyun - overall statistics on shared cachelines 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun Shared Data Cache Line Table 286*4882a593Smuzhiyun - list of most expensive cachelines 287*4882a593Smuzhiyun 288*4882a593Smuzhiyun Shared Cache Line Distribution Pareto 289*4882a593Smuzhiyun - list of all accessed offsets for each cacheline 290*4882a593Smuzhiyun 291*4882a593SmuzhiyunTUI OUTPUT 292*4882a593Smuzhiyun---------- 293*4882a593SmuzhiyunThe TUI output provides interactive interface to navigate 294*4882a593Smuzhiyunthrough cachelines list and to display offset details. 295*4882a593Smuzhiyun 296*4882a593SmuzhiyunFor details please refer to the help window by pressing '?' key. 297*4882a593Smuzhiyun 298*4882a593SmuzhiyunCREDITS 299*4882a593Smuzhiyun------- 300*4882a593SmuzhiyunAlthough Don Zickus, Dick Fowles and Joe Mario worked together 301*4882a593Smuzhiyunto get this implemented, we got lots of early help from Arnaldo 302*4882a593SmuzhiyunCarvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen. 303*4882a593Smuzhiyun 304*4882a593SmuzhiyunC2C BLOG 305*4882a593Smuzhiyun-------- 306*4882a593SmuzhiyunCheck Joe's blog on c2c tool for detailed use case explanation: 307*4882a593Smuzhiyun https://joemario.github.io/blog/2016/09/01/c2c-blog/ 308*4882a593Smuzhiyun 309*4882a593SmuzhiyunSEE ALSO 310*4882a593Smuzhiyun-------- 311*4882a593Smuzhiyunlinkperf:perf-record[1], linkperf:perf-mem[1] 312