1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun.. include:: <isonum.txt> 3*4882a593Smuzhiyun 4*4882a593Smuzhiyun=========================================== 5*4882a593SmuzhiyunUser Interface for Resource Control feature 6*4882a593Smuzhiyun=========================================== 7*4882a593Smuzhiyun 8*4882a593Smuzhiyun:Copyright: |copy| 2016 Intel Corporation 9*4882a593Smuzhiyun:Authors: - Fenghua Yu <fenghua.yu@intel.com> 10*4882a593Smuzhiyun - Tony Luck <tony.luck@intel.com> 11*4882a593Smuzhiyun - Vikas Shivappa <vikas.shivappa@intel.com> 12*4882a593Smuzhiyun 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunIntel refers to this feature as Intel Resource Director Technology(Intel(R) RDT). 15*4882a593SmuzhiyunAMD refers to this feature as AMD Platform Quality of Service(AMD QoS). 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunThis feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo 18*4882a593Smuzhiyunflag bits: 19*4882a593Smuzhiyun 20*4882a593Smuzhiyun============================================= ================================ 21*4882a593SmuzhiyunRDT (Resource Director Technology) Allocation "rdt_a" 22*4882a593SmuzhiyunCAT (Cache Allocation Technology) "cat_l3", "cat_l2" 23*4882a593SmuzhiyunCDP (Code and Data Prioritization) "cdp_l3", "cdp_l2" 24*4882a593SmuzhiyunCQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc" 25*4882a593SmuzhiyunMBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local" 26*4882a593SmuzhiyunMBA (Memory Bandwidth Allocation) "mba" 27*4882a593Smuzhiyun============================================= ================================ 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunTo use the feature mount the file system:: 30*4882a593Smuzhiyun 31*4882a593Smuzhiyun # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl 32*4882a593Smuzhiyun 33*4882a593Smuzhiyunmount options are: 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun"cdp": 36*4882a593Smuzhiyun Enable code/data prioritization in L3 cache allocations. 37*4882a593Smuzhiyun"cdpl2": 38*4882a593Smuzhiyun Enable code/data prioritization in L2 cache allocations. 39*4882a593Smuzhiyun"mba_MBps": 40*4882a593Smuzhiyun Enable the MBA Software Controller(mba_sc) to specify MBA 41*4882a593Smuzhiyun bandwidth in MBps 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunL2 and L3 CDP are controlled separately. 44*4882a593Smuzhiyun 45*4882a593SmuzhiyunRDT features are orthogonal. A particular system may support only 46*4882a593Smuzhiyunmonitoring, only control, or both monitoring and control. Cache 47*4882a593Smuzhiyunpseudo-locking is a unique way of using cache control to "pin" or 48*4882a593Smuzhiyun"lock" data in the cache. Details can be found in 49*4882a593Smuzhiyun"Cache Pseudo-Locking". 50*4882a593Smuzhiyun 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunThe mount succeeds if either of allocation or monitoring is present, but 53*4882a593Smuzhiyunonly those files and directories supported by the system will be created. 54*4882a593SmuzhiyunFor more details on the behavior of the interface during monitoring 55*4882a593Smuzhiyunand allocation, see the "Resource alloc and monitor groups" section. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunInfo directory 58*4882a593Smuzhiyun============== 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunThe 'info' directory contains information about the enabled 61*4882a593Smuzhiyunresources. Each resource has its own subdirectory. The subdirectory 62*4882a593Smuzhiyunnames reflect the resource names. 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunEach subdirectory contains the following files with respect to 65*4882a593Smuzhiyunallocation: 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunCache resource(L3/L2) subdirectory contains the following files 68*4882a593Smuzhiyunrelated to allocation: 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun"num_closids": 71*4882a593Smuzhiyun The number of CLOSIDs which are valid for this 72*4882a593Smuzhiyun resource. The kernel uses the smallest number of 73*4882a593Smuzhiyun CLOSIDs of all enabled resources as limit. 74*4882a593Smuzhiyun"cbm_mask": 75*4882a593Smuzhiyun The bitmask which is valid for this resource. 76*4882a593Smuzhiyun This mask is equivalent to 100%. 77*4882a593Smuzhiyun"min_cbm_bits": 78*4882a593Smuzhiyun The minimum number of consecutive bits which 79*4882a593Smuzhiyun must be set when writing a mask. 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun"shareable_bits": 82*4882a593Smuzhiyun Bitmask of shareable resource with other executing 83*4882a593Smuzhiyun entities (e.g. I/O). User can use this when 84*4882a593Smuzhiyun setting up exclusive cache partitions. Note that 85*4882a593Smuzhiyun some platforms support devices that have their 86*4882a593Smuzhiyun own settings for cache use which can over-ride 87*4882a593Smuzhiyun these bits. 88*4882a593Smuzhiyun"bit_usage": 89*4882a593Smuzhiyun Annotated capacity bitmasks showing how all 90*4882a593Smuzhiyun instances of the resource are used. The legend is: 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun "0": 93*4882a593Smuzhiyun Corresponding region is unused. When the system's 94*4882a593Smuzhiyun resources have been allocated and a "0" is found 95*4882a593Smuzhiyun in "bit_usage" it is a sign that resources are 96*4882a593Smuzhiyun wasted. 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun "H": 99*4882a593Smuzhiyun Corresponding region is used by hardware only 100*4882a593Smuzhiyun but available for software use. If a resource 101*4882a593Smuzhiyun has bits set in "shareable_bits" but not all 102*4882a593Smuzhiyun of these bits appear in the resource groups' 103*4882a593Smuzhiyun schematas then the bits appearing in 104*4882a593Smuzhiyun "shareable_bits" but no resource group will 105*4882a593Smuzhiyun be marked as "H". 106*4882a593Smuzhiyun "X": 107*4882a593Smuzhiyun Corresponding region is available for sharing and 108*4882a593Smuzhiyun used by hardware and software. These are the 109*4882a593Smuzhiyun bits that appear in "shareable_bits" as 110*4882a593Smuzhiyun well as a resource group's allocation. 111*4882a593Smuzhiyun "S": 112*4882a593Smuzhiyun Corresponding region is used by software 113*4882a593Smuzhiyun and available for sharing. 114*4882a593Smuzhiyun "E": 115*4882a593Smuzhiyun Corresponding region is used exclusively by 116*4882a593Smuzhiyun one resource group. No sharing allowed. 117*4882a593Smuzhiyun "P": 118*4882a593Smuzhiyun Corresponding region is pseudo-locked. No 119*4882a593Smuzhiyun sharing allowed. 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunMemory bandwidth(MB) subdirectory contains the following files 122*4882a593Smuzhiyunwith respect to allocation: 123*4882a593Smuzhiyun 124*4882a593Smuzhiyun"min_bandwidth": 125*4882a593Smuzhiyun The minimum memory bandwidth percentage which 126*4882a593Smuzhiyun user can request. 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun"bandwidth_gran": 129*4882a593Smuzhiyun The granularity in which the memory bandwidth 130*4882a593Smuzhiyun percentage is allocated. The allocated 131*4882a593Smuzhiyun b/w percentage is rounded off to the next 132*4882a593Smuzhiyun control step available on the hardware. The 133*4882a593Smuzhiyun available bandwidth control steps are: 134*4882a593Smuzhiyun min_bandwidth + N * bandwidth_gran. 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun"delay_linear": 137*4882a593Smuzhiyun Indicates if the delay scale is linear or 138*4882a593Smuzhiyun non-linear. This field is purely informational 139*4882a593Smuzhiyun only. 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun"thread_throttle_mode": 142*4882a593Smuzhiyun Indicator on Intel systems of how tasks running on threads 143*4882a593Smuzhiyun of a physical core are throttled in cases where they 144*4882a593Smuzhiyun request different memory bandwidth percentages: 145*4882a593Smuzhiyun 146*4882a593Smuzhiyun "max": 147*4882a593Smuzhiyun the smallest percentage is applied 148*4882a593Smuzhiyun to all threads 149*4882a593Smuzhiyun "per-thread": 150*4882a593Smuzhiyun bandwidth percentages are directly applied to 151*4882a593Smuzhiyun the threads running on the core 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunIf RDT monitoring is available there will be an "L3_MON" directory 154*4882a593Smuzhiyunwith the following files: 155*4882a593Smuzhiyun 156*4882a593Smuzhiyun"num_rmids": 157*4882a593Smuzhiyun The number of RMIDs available. This is the 158*4882a593Smuzhiyun upper bound for how many "CTRL_MON" + "MON" 159*4882a593Smuzhiyun groups can be created. 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun"mon_features": 162*4882a593Smuzhiyun Lists the monitoring events if 163*4882a593Smuzhiyun monitoring is enabled for the resource. 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun"max_threshold_occupancy": 166*4882a593Smuzhiyun Read/write file provides the largest value (in 167*4882a593Smuzhiyun bytes) at which a previously used LLC_occupancy 168*4882a593Smuzhiyun counter can be considered for re-use. 169*4882a593Smuzhiyun 170*4882a593SmuzhiyunFinally, in the top level of the "info" directory there is a file 171*4882a593Smuzhiyunnamed "last_cmd_status". This is reset with every "command" issued 172*4882a593Smuzhiyunvia the file system (making new directories or writing to any of the 173*4882a593Smuzhiyuncontrol files). If the command was successful, it will read as "ok". 174*4882a593SmuzhiyunIf the command failed, it will provide more information that can be 175*4882a593Smuzhiyunconveyed in the error returns from file operations. E.g. 176*4882a593Smuzhiyun:: 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun # echo L3:0=f7 > schemata 179*4882a593Smuzhiyun bash: echo: write error: Invalid argument 180*4882a593Smuzhiyun # cat info/last_cmd_status 181*4882a593Smuzhiyun mask f7 has non-consecutive 1-bits 182*4882a593Smuzhiyun 183*4882a593SmuzhiyunResource alloc and monitor groups 184*4882a593Smuzhiyun================================= 185*4882a593Smuzhiyun 186*4882a593SmuzhiyunResource groups are represented as directories in the resctrl file 187*4882a593Smuzhiyunsystem. The default group is the root directory which, immediately 188*4882a593Smuzhiyunafter mounting, owns all the tasks and cpus in the system and can make 189*4882a593Smuzhiyunfull use of all resources. 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunOn a system with RDT control features additional directories can be 192*4882a593Smuzhiyuncreated in the root directory that specify different amounts of each 193*4882a593Smuzhiyunresource (see "schemata" below). The root and these additional top level 194*4882a593Smuzhiyundirectories are referred to as "CTRL_MON" groups below. 195*4882a593Smuzhiyun 196*4882a593SmuzhiyunOn a system with RDT monitoring the root directory and other top level 197*4882a593Smuzhiyundirectories contain a directory named "mon_groups" in which additional 198*4882a593Smuzhiyundirectories can be created to monitor subsets of tasks in the CTRL_MON 199*4882a593Smuzhiyungroup that is their ancestor. These are called "MON" groups in the rest 200*4882a593Smuzhiyunof this document. 201*4882a593Smuzhiyun 202*4882a593SmuzhiyunRemoving a directory will move all tasks and cpus owned by the group it 203*4882a593Smuzhiyunrepresents to the parent. Removing one of the created CTRL_MON groups 204*4882a593Smuzhiyunwill automatically remove all MON groups below it. 205*4882a593Smuzhiyun 206*4882a593SmuzhiyunAll groups contain the following files: 207*4882a593Smuzhiyun 208*4882a593Smuzhiyun"tasks": 209*4882a593Smuzhiyun Reading this file shows the list of all tasks that belong to 210*4882a593Smuzhiyun this group. Writing a task id to the file will add a task to the 211*4882a593Smuzhiyun group. If the group is a CTRL_MON group the task is removed from 212*4882a593Smuzhiyun whichever previous CTRL_MON group owned the task and also from 213*4882a593Smuzhiyun any MON group that owned the task. If the group is a MON group, 214*4882a593Smuzhiyun then the task must already belong to the CTRL_MON parent of this 215*4882a593Smuzhiyun group. The task is removed from any previous MON group. 216*4882a593Smuzhiyun 217*4882a593Smuzhiyun 218*4882a593Smuzhiyun"cpus": 219*4882a593Smuzhiyun Reading this file shows a bitmask of the logical CPUs owned by 220*4882a593Smuzhiyun this group. Writing a mask to this file will add and remove 221*4882a593Smuzhiyun CPUs to/from this group. As with the tasks file a hierarchy is 222*4882a593Smuzhiyun maintained where MON groups may only include CPUs owned by the 223*4882a593Smuzhiyun parent CTRL_MON group. 224*4882a593Smuzhiyun When the resource group is in pseudo-locked mode this file will 225*4882a593Smuzhiyun only be readable, reflecting the CPUs associated with the 226*4882a593Smuzhiyun pseudo-locked region. 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun 229*4882a593Smuzhiyun"cpus_list": 230*4882a593Smuzhiyun Just like "cpus", only using ranges of CPUs instead of bitmasks. 231*4882a593Smuzhiyun 232*4882a593Smuzhiyun 233*4882a593SmuzhiyunWhen control is enabled all CTRL_MON groups will also contain: 234*4882a593Smuzhiyun 235*4882a593Smuzhiyun"schemata": 236*4882a593Smuzhiyun A list of all the resources available to this group. 237*4882a593Smuzhiyun Each resource has its own line and format - see below for details. 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun"size": 240*4882a593Smuzhiyun Mirrors the display of the "schemata" file to display the size in 241*4882a593Smuzhiyun bytes of each allocation instead of the bits representing the 242*4882a593Smuzhiyun allocation. 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun"mode": 245*4882a593Smuzhiyun The "mode" of the resource group dictates the sharing of its 246*4882a593Smuzhiyun allocations. A "shareable" resource group allows sharing of its 247*4882a593Smuzhiyun allocations while an "exclusive" resource group does not. A 248*4882a593Smuzhiyun cache pseudo-locked region is created by first writing 249*4882a593Smuzhiyun "pseudo-locksetup" to the "mode" file before writing the cache 250*4882a593Smuzhiyun pseudo-locked region's schemata to the resource group's "schemata" 251*4882a593Smuzhiyun file. On successful pseudo-locked region creation the mode will 252*4882a593Smuzhiyun automatically change to "pseudo-locked". 253*4882a593Smuzhiyun 254*4882a593SmuzhiyunWhen monitoring is enabled all MON groups will also contain: 255*4882a593Smuzhiyun 256*4882a593Smuzhiyun"mon_data": 257*4882a593Smuzhiyun This contains a set of files organized by L3 domain and by 258*4882a593Smuzhiyun RDT event. E.g. on a system with two L3 domains there will 259*4882a593Smuzhiyun be subdirectories "mon_L3_00" and "mon_L3_01". Each of these 260*4882a593Smuzhiyun directories have one file per event (e.g. "llc_occupancy", 261*4882a593Smuzhiyun "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these 262*4882a593Smuzhiyun files provide a read out of the current value of the event for 263*4882a593Smuzhiyun all tasks in the group. In CTRL_MON groups these files provide 264*4882a593Smuzhiyun the sum for all tasks in the CTRL_MON group and all tasks in 265*4882a593Smuzhiyun MON groups. Please see example section for more details on usage. 266*4882a593Smuzhiyun 267*4882a593SmuzhiyunResource allocation rules 268*4882a593Smuzhiyun------------------------- 269*4882a593Smuzhiyun 270*4882a593SmuzhiyunWhen a task is running the following rules define which resources are 271*4882a593Smuzhiyunavailable to it: 272*4882a593Smuzhiyun 273*4882a593Smuzhiyun1) If the task is a member of a non-default group, then the schemata 274*4882a593Smuzhiyun for that group is used. 275*4882a593Smuzhiyun 276*4882a593Smuzhiyun2) Else if the task belongs to the default group, but is running on a 277*4882a593Smuzhiyun CPU that is assigned to some specific group, then the schemata for the 278*4882a593Smuzhiyun CPU's group is used. 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun3) Otherwise the schemata for the default group is used. 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunResource monitoring rules 283*4882a593Smuzhiyun------------------------- 284*4882a593Smuzhiyun1) If a task is a member of a MON group, or non-default CTRL_MON group 285*4882a593Smuzhiyun then RDT events for the task will be reported in that group. 286*4882a593Smuzhiyun 287*4882a593Smuzhiyun2) If a task is a member of the default CTRL_MON group, but is running 288*4882a593Smuzhiyun on a CPU that is assigned to some specific group, then the RDT events 289*4882a593Smuzhiyun for the task will be reported in that group. 290*4882a593Smuzhiyun 291*4882a593Smuzhiyun3) Otherwise RDT events for the task will be reported in the root level 292*4882a593Smuzhiyun "mon_data" group. 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun 295*4882a593SmuzhiyunNotes on cache occupancy monitoring and control 296*4882a593Smuzhiyun=============================================== 297*4882a593SmuzhiyunWhen moving a task from one group to another you should remember that 298*4882a593Smuzhiyunthis only affects *new* cache allocations by the task. E.g. you may have 299*4882a593Smuzhiyuna task in a monitor group showing 3 MB of cache occupancy. If you move 300*4882a593Smuzhiyunto a new group and immediately check the occupancy of the old and new 301*4882a593Smuzhiyungroups you will likely see that the old group is still showing 3 MB and 302*4882a593Smuzhiyunthe new group zero. When the task accesses locations still in cache from 303*4882a593Smuzhiyunbefore the move, the h/w does not update any counters. On a busy system 304*4882a593Smuzhiyunyou will likely see the occupancy in the old group go down as cache lines 305*4882a593Smuzhiyunare evicted and re-used while the occupancy in the new group rises as 306*4882a593Smuzhiyunthe task accesses memory and loads into the cache are counted based on 307*4882a593Smuzhiyunmembership in the new group. 308*4882a593Smuzhiyun 309*4882a593SmuzhiyunThe same applies to cache allocation control. Moving a task to a group 310*4882a593Smuzhiyunwith a smaller cache partition will not evict any cache lines. The 311*4882a593Smuzhiyunprocess may continue to use them from the old partition. 312*4882a593Smuzhiyun 313*4882a593SmuzhiyunHardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) 314*4882a593Smuzhiyunto identify a control group and a monitoring group respectively. Each of 315*4882a593Smuzhiyunthe resource groups are mapped to these IDs based on the kind of group. The 316*4882a593Smuzhiyunnumber of CLOSid and RMID are limited by the hardware and hence the creation of 317*4882a593Smuzhiyuna "CTRL_MON" directory may fail if we run out of either CLOSID or RMID 318*4882a593Smuzhiyunand creation of "MON" group may fail if we run out of RMIDs. 319*4882a593Smuzhiyun 320*4882a593Smuzhiyunmax_threshold_occupancy - generic concepts 321*4882a593Smuzhiyun------------------------------------------ 322*4882a593Smuzhiyun 323*4882a593SmuzhiyunNote that an RMID once freed may not be immediately available for use as 324*4882a593Smuzhiyunthe RMID is still tagged the cache lines of the previous user of RMID. 325*4882a593SmuzhiyunHence such RMIDs are placed on limbo list and checked back if the cache 326*4882a593Smuzhiyunoccupancy has gone down. If there is a time when system has a lot of 327*4882a593Smuzhiyunlimbo RMIDs but which are not ready to be used, user may see an -EBUSY 328*4882a593Smuzhiyunduring mkdir. 329*4882a593Smuzhiyun 330*4882a593Smuzhiyunmax_threshold_occupancy is a user configurable value to determine the 331*4882a593Smuzhiyunoccupancy at which an RMID can be freed. 332*4882a593Smuzhiyun 333*4882a593SmuzhiyunSchemata files - general concepts 334*4882a593Smuzhiyun--------------------------------- 335*4882a593SmuzhiyunEach line in the file describes one resource. The line starts with 336*4882a593Smuzhiyunthe name of the resource, followed by specific values to be applied 337*4882a593Smuzhiyunin each of the instances of that resource on the system. 338*4882a593Smuzhiyun 339*4882a593SmuzhiyunCache IDs 340*4882a593Smuzhiyun--------- 341*4882a593SmuzhiyunOn current generation systems there is one L3 cache per socket and L2 342*4882a593Smuzhiyuncaches are generally just shared by the hyperthreads on a core, but this 343*4882a593Smuzhiyunisn't an architectural requirement. We could have multiple separate L3 344*4882a593Smuzhiyuncaches on a socket, multiple cores could share an L2 cache. So instead 345*4882a593Smuzhiyunof using "socket" or "core" to define the set of logical cpus sharing 346*4882a593Smuzhiyuna resource we use a "Cache ID". At a given cache level this will be a 347*4882a593Smuzhiyununique number across the whole system (but it isn't guaranteed to be a 348*4882a593Smuzhiyuncontiguous sequence, there may be gaps). To find the ID for each logical 349*4882a593SmuzhiyunCPU look in /sys/devices/system/cpu/cpu*/cache/index*/id 350*4882a593Smuzhiyun 351*4882a593SmuzhiyunCache Bit Masks (CBM) 352*4882a593Smuzhiyun--------------------- 353*4882a593SmuzhiyunFor cache resources we describe the portion of the cache that is available 354*4882a593Smuzhiyunfor allocation using a bitmask. The maximum value of the mask is defined 355*4882a593Smuzhiyunby each cpu model (and may be different for different cache levels). It 356*4882a593Smuzhiyunis found using CPUID, but is also provided in the "info" directory of 357*4882a593Smuzhiyunthe resctrl file system in "info/{resource}/cbm_mask". Intel hardware 358*4882a593Smuzhiyunrequires that these masks have all the '1' bits in a contiguous block. So 359*4882a593Smuzhiyun0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9 360*4882a593Smuzhiyunand 0xA are not. On a system with a 20-bit mask each bit represents 5% 361*4882a593Smuzhiyunof the capacity of the cache. You could partition the cache into four 362*4882a593Smuzhiyunequal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. 363*4882a593Smuzhiyun 364*4882a593SmuzhiyunMemory bandwidth Allocation and monitoring 365*4882a593Smuzhiyun========================================== 366*4882a593Smuzhiyun 367*4882a593SmuzhiyunFor Memory bandwidth resource, by default the user controls the resource 368*4882a593Smuzhiyunby indicating the percentage of total memory bandwidth. 369*4882a593Smuzhiyun 370*4882a593SmuzhiyunThe minimum bandwidth percentage value for each cpu model is predefined 371*4882a593Smuzhiyunand can be looked up through "info/MB/min_bandwidth". The bandwidth 372*4882a593Smuzhiyungranularity that is allocated is also dependent on the cpu model and can 373*4882a593Smuzhiyunbe looked up at "info/MB/bandwidth_gran". The available bandwidth 374*4882a593Smuzhiyuncontrol steps are: min_bw + N * bw_gran. Intermediate values are rounded 375*4882a593Smuzhiyunto the next control step available on the hardware. 376*4882a593Smuzhiyun 377*4882a593SmuzhiyunThe bandwidth throttling is a core specific mechanism on some of Intel 378*4882a593SmuzhiyunSKUs. Using a high bandwidth and a low bandwidth setting on two threads 379*4882a593Smuzhiyunsharing a core may result in both threads being throttled to use the 380*4882a593Smuzhiyunlow bandwidth (see "thread_throttle_mode"). 381*4882a593Smuzhiyun 382*4882a593SmuzhiyunThe fact that Memory bandwidth allocation(MBA) may be a core 383*4882a593Smuzhiyunspecific mechanism where as memory bandwidth monitoring(MBM) is done at 384*4882a593Smuzhiyunthe package level may lead to confusion when users try to apply control 385*4882a593Smuzhiyunvia the MBA and then monitor the bandwidth to see if the controls are 386*4882a593Smuzhiyuneffective. Below are such scenarios: 387*4882a593Smuzhiyun 388*4882a593Smuzhiyun1. User may *not* see increase in actual bandwidth when percentage 389*4882a593Smuzhiyun values are increased: 390*4882a593Smuzhiyun 391*4882a593SmuzhiyunThis can occur when aggregate L2 external bandwidth is more than L3 392*4882a593Smuzhiyunexternal bandwidth. Consider an SKL SKU with 24 cores on a package and 393*4882a593Smuzhiyunwhere L2 external is 10GBps (hence aggregate L2 external bandwidth is 394*4882a593Smuzhiyun240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20 395*4882a593Smuzhiyunthreads, having 50% bandwidth, each consuming 5GBps' consumes the max L3 396*4882a593Smuzhiyunbandwidth of 100GBps although the percentage value specified is only 50% 397*4882a593Smuzhiyun<< 100%. Hence increasing the bandwidth percentage will not yield any 398*4882a593Smuzhiyunmore bandwidth. This is because although the L2 external bandwidth still 399*4882a593Smuzhiyunhas capacity, the L3 external bandwidth is fully used. Also note that 400*4882a593Smuzhiyunthis would be dependent on number of cores the benchmark is run on. 401*4882a593Smuzhiyun 402*4882a593Smuzhiyun2. Same bandwidth percentage may mean different actual bandwidth 403*4882a593Smuzhiyun depending on # of threads: 404*4882a593Smuzhiyun 405*4882a593SmuzhiyunFor the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 406*4882a593Smuzhiyunthread, with 10% bandwidth' can consume upto 10GBps and 40GBps although 407*4882a593Smuzhiyunthey have same percentage bandwidth of 10%. This is simply because as 408*4882a593Smuzhiyunthreads start using more cores in an rdtgroup, the actual bandwidth may 409*4882a593Smuzhiyunincrease or vary although user specified bandwidth percentage is same. 410*4882a593Smuzhiyun 411*4882a593SmuzhiyunIn order to mitigate this and make the interface more user friendly, 412*4882a593Smuzhiyunresctrl added support for specifying the bandwidth in MBps as well. The 413*4882a593Smuzhiyunkernel underneath would use a software feedback mechanism or a "Software 414*4882a593SmuzhiyunController(mba_sc)" which reads the actual bandwidth using MBM counters 415*4882a593Smuzhiyunand adjust the memory bandwidth percentages to ensure:: 416*4882a593Smuzhiyun 417*4882a593Smuzhiyun "actual bandwidth < user specified bandwidth". 418*4882a593Smuzhiyun 419*4882a593SmuzhiyunBy default, the schemata would take the bandwidth percentage values 420*4882a593Smuzhiyunwhere as user can switch to the "MBA software controller" mode using 421*4882a593Smuzhiyuna mount option 'mba_MBps'. The schemata format is specified in the below 422*4882a593Smuzhiyunsections. 423*4882a593Smuzhiyun 424*4882a593SmuzhiyunL3 schemata file details (code and data prioritization disabled) 425*4882a593Smuzhiyun---------------------------------------------------------------- 426*4882a593SmuzhiyunWith CDP disabled the L3 schemata format is:: 427*4882a593Smuzhiyun 428*4882a593Smuzhiyun L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 429*4882a593Smuzhiyun 430*4882a593SmuzhiyunL3 schemata file details (CDP enabled via mount option to resctrl) 431*4882a593Smuzhiyun------------------------------------------------------------------ 432*4882a593SmuzhiyunWhen CDP is enabled L3 control is split into two separate resources 433*4882a593Smuzhiyunso you can specify independent masks for code and data like this:: 434*4882a593Smuzhiyun 435*4882a593Smuzhiyun L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 436*4882a593Smuzhiyun L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 437*4882a593Smuzhiyun 438*4882a593SmuzhiyunL2 schemata file details 439*4882a593Smuzhiyun------------------------ 440*4882a593SmuzhiyunCDP is supported at L2 using the 'cdpl2' mount option. The schemata 441*4882a593Smuzhiyunformat is either:: 442*4882a593Smuzhiyun 443*4882a593Smuzhiyun L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 444*4882a593Smuzhiyun 445*4882a593Smuzhiyunor 446*4882a593Smuzhiyun 447*4882a593Smuzhiyun L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 448*4882a593Smuzhiyun L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 449*4882a593Smuzhiyun 450*4882a593Smuzhiyun 451*4882a593SmuzhiyunMemory bandwidth Allocation (default mode) 452*4882a593Smuzhiyun------------------------------------------ 453*4882a593Smuzhiyun 454*4882a593SmuzhiyunMemory b/w domain is L3 cache. 455*4882a593Smuzhiyun:: 456*4882a593Smuzhiyun 457*4882a593Smuzhiyun MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... 458*4882a593Smuzhiyun 459*4882a593SmuzhiyunMemory bandwidth Allocation specified in MBps 460*4882a593Smuzhiyun--------------------------------------------- 461*4882a593Smuzhiyun 462*4882a593SmuzhiyunMemory bandwidth domain is L3 cache. 463*4882a593Smuzhiyun:: 464*4882a593Smuzhiyun 465*4882a593Smuzhiyun MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;... 466*4882a593Smuzhiyun 467*4882a593SmuzhiyunReading/writing the schemata file 468*4882a593Smuzhiyun--------------------------------- 469*4882a593SmuzhiyunReading the schemata file will show the state of all resources 470*4882a593Smuzhiyunon all domains. When writing you only need to specify those values 471*4882a593Smuzhiyunwhich you wish to change. E.g. 472*4882a593Smuzhiyun:: 473*4882a593Smuzhiyun 474*4882a593Smuzhiyun # cat schemata 475*4882a593Smuzhiyun L3DATA:0=fffff;1=fffff;2=fffff;3=fffff 476*4882a593Smuzhiyun L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 477*4882a593Smuzhiyun # echo "L3DATA:2=3c0;" > schemata 478*4882a593Smuzhiyun # cat schemata 479*4882a593Smuzhiyun L3DATA:0=fffff;1=fffff;2=3c0;3=fffff 480*4882a593Smuzhiyun L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 481*4882a593Smuzhiyun 482*4882a593SmuzhiyunCache Pseudo-Locking 483*4882a593Smuzhiyun==================== 484*4882a593SmuzhiyunCAT enables a user to specify the amount of cache space that an 485*4882a593Smuzhiyunapplication can fill. Cache pseudo-locking builds on the fact that a 486*4882a593SmuzhiyunCPU can still read and write data pre-allocated outside its current 487*4882a593Smuzhiyunallocated area on a cache hit. With cache pseudo-locking, data can be 488*4882a593Smuzhiyunpreloaded into a reserved portion of cache that no application can 489*4882a593Smuzhiyunfill, and from that point on will only serve cache hits. The cache 490*4882a593Smuzhiyunpseudo-locked memory is made accessible to user space where an 491*4882a593Smuzhiyunapplication can map it into its virtual address space and thus have 492*4882a593Smuzhiyuna region of memory with reduced average read latency. 493*4882a593Smuzhiyun 494*4882a593SmuzhiyunThe creation of a cache pseudo-locked region is triggered by a request 495*4882a593Smuzhiyunfrom the user to do so that is accompanied by a schemata of the region 496*4882a593Smuzhiyunto be pseudo-locked. The cache pseudo-locked region is created as follows: 497*4882a593Smuzhiyun 498*4882a593Smuzhiyun- Create a CAT allocation CLOSNEW with a CBM matching the schemata 499*4882a593Smuzhiyun from the user of the cache region that will contain the pseudo-locked 500*4882a593Smuzhiyun memory. This region must not overlap with any current CAT allocation/CLOS 501*4882a593Smuzhiyun on the system and no future overlap with this cache region is allowed 502*4882a593Smuzhiyun while the pseudo-locked region exists. 503*4882a593Smuzhiyun- Create a contiguous region of memory of the same size as the cache 504*4882a593Smuzhiyun region. 505*4882a593Smuzhiyun- Flush the cache, disable hardware prefetchers, disable preemption. 506*4882a593Smuzhiyun- Make CLOSNEW the active CLOS and touch the allocated memory to load 507*4882a593Smuzhiyun it into the cache. 508*4882a593Smuzhiyun- Set the previous CLOS as active. 509*4882a593Smuzhiyun- At this point the closid CLOSNEW can be released - the cache 510*4882a593Smuzhiyun pseudo-locked region is protected as long as its CBM does not appear in 511*4882a593Smuzhiyun any CAT allocation. Even though the cache pseudo-locked region will from 512*4882a593Smuzhiyun this point on not appear in any CBM of any CLOS an application running with 513*4882a593Smuzhiyun any CLOS will be able to access the memory in the pseudo-locked region since 514*4882a593Smuzhiyun the region continues to serve cache hits. 515*4882a593Smuzhiyun- The contiguous region of memory loaded into the cache is exposed to 516*4882a593Smuzhiyun user-space as a character device. 517*4882a593Smuzhiyun 518*4882a593SmuzhiyunCache pseudo-locking increases the probability that data will remain 519*4882a593Smuzhiyunin the cache via carefully configuring the CAT feature and controlling 520*4882a593Smuzhiyunapplication behavior. There is no guarantee that data is placed in 521*4882a593Smuzhiyuncache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict 522*4882a593Smuzhiyun“locked” data from cache. Power management C-states may shrink or 523*4882a593Smuzhiyunpower off cache. Deeper C-states will automatically be restricted on 524*4882a593Smuzhiyunpseudo-locked region creation. 525*4882a593Smuzhiyun 526*4882a593SmuzhiyunIt is required that an application using a pseudo-locked region runs 527*4882a593Smuzhiyunwith affinity to the cores (or a subset of the cores) associated 528*4882a593Smuzhiyunwith the cache on which the pseudo-locked region resides. A sanity check 529*4882a593Smuzhiyunwithin the code will not allow an application to map pseudo-locked memory 530*4882a593Smuzhiyununless it runs with affinity to cores associated with the cache on which the 531*4882a593Smuzhiyunpseudo-locked region resides. The sanity check is only done during the 532*4882a593Smuzhiyuninitial mmap() handling, there is no enforcement afterwards and the 533*4882a593Smuzhiyunapplication self needs to ensure it remains affine to the correct cores. 534*4882a593Smuzhiyun 535*4882a593SmuzhiyunPseudo-locking is accomplished in two stages: 536*4882a593Smuzhiyun 537*4882a593Smuzhiyun1) During the first stage the system administrator allocates a portion 538*4882a593Smuzhiyun of cache that should be dedicated to pseudo-locking. At this time an 539*4882a593Smuzhiyun equivalent portion of memory is allocated, loaded into allocated 540*4882a593Smuzhiyun cache portion, and exposed as a character device. 541*4882a593Smuzhiyun2) During the second stage a user-space application maps (mmap()) the 542*4882a593Smuzhiyun pseudo-locked memory into its address space. 543*4882a593Smuzhiyun 544*4882a593SmuzhiyunCache Pseudo-Locking Interface 545*4882a593Smuzhiyun------------------------------ 546*4882a593SmuzhiyunA pseudo-locked region is created using the resctrl interface as follows: 547*4882a593Smuzhiyun 548*4882a593Smuzhiyun1) Create a new resource group by creating a new directory in /sys/fs/resctrl. 549*4882a593Smuzhiyun2) Change the new resource group's mode to "pseudo-locksetup" by writing 550*4882a593Smuzhiyun "pseudo-locksetup" to the "mode" file. 551*4882a593Smuzhiyun3) Write the schemata of the pseudo-locked region to the "schemata" file. All 552*4882a593Smuzhiyun bits within the schemata should be "unused" according to the "bit_usage" 553*4882a593Smuzhiyun file. 554*4882a593Smuzhiyun 555*4882a593SmuzhiyunOn successful pseudo-locked region creation the "mode" file will contain 556*4882a593Smuzhiyun"pseudo-locked" and a new character device with the same name as the resource 557*4882a593Smuzhiyungroup will exist in /dev/pseudo_lock. This character device can be mmap()'ed 558*4882a593Smuzhiyunby user space in order to obtain access to the pseudo-locked memory region. 559*4882a593Smuzhiyun 560*4882a593SmuzhiyunAn example of cache pseudo-locked region creation and usage can be found below. 561*4882a593Smuzhiyun 562*4882a593SmuzhiyunCache Pseudo-Locking Debugging Interface 563*4882a593Smuzhiyun---------------------------------------- 564*4882a593SmuzhiyunThe pseudo-locking debugging interface is enabled by default (if 565*4882a593SmuzhiyunCONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl. 566*4882a593Smuzhiyun 567*4882a593SmuzhiyunThere is no explicit way for the kernel to test if a provided memory 568*4882a593Smuzhiyunlocation is present in the cache. The pseudo-locking debugging interface uses 569*4882a593Smuzhiyunthe tracing infrastructure to provide two ways to measure cache residency of 570*4882a593Smuzhiyunthe pseudo-locked region: 571*4882a593Smuzhiyun 572*4882a593Smuzhiyun1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data 573*4882a593Smuzhiyun from these measurements are best visualized using a hist trigger (see 574*4882a593Smuzhiyun example below). In this test the pseudo-locked region is traversed at 575*4882a593Smuzhiyun a stride of 32 bytes while hardware prefetchers and preemption 576*4882a593Smuzhiyun are disabled. This also provides a substitute visualization of cache 577*4882a593Smuzhiyun hits and misses. 578*4882a593Smuzhiyun2) Cache hit and miss measurements using model specific precision counters if 579*4882a593Smuzhiyun available. Depending on the levels of cache on the system the pseudo_lock_l2 580*4882a593Smuzhiyun and pseudo_lock_l3 tracepoints are available. 581*4882a593Smuzhiyun 582*4882a593SmuzhiyunWhen a pseudo-locked region is created a new debugfs directory is created for 583*4882a593Smuzhiyunit in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single 584*4882a593Smuzhiyunwrite-only file, pseudo_lock_measure, is present in this directory. The 585*4882a593Smuzhiyunmeasurement of the pseudo-locked region depends on the number written to this 586*4882a593Smuzhiyundebugfs file: 587*4882a593Smuzhiyun 588*4882a593Smuzhiyun1: 589*4882a593Smuzhiyun writing "1" to the pseudo_lock_measure file will trigger the latency 590*4882a593Smuzhiyun measurement captured in the pseudo_lock_mem_latency tracepoint. See 591*4882a593Smuzhiyun example below. 592*4882a593Smuzhiyun2: 593*4882a593Smuzhiyun writing "2" to the pseudo_lock_measure file will trigger the L2 cache 594*4882a593Smuzhiyun residency (cache hits and misses) measurement captured in the 595*4882a593Smuzhiyun pseudo_lock_l2 tracepoint. See example below. 596*4882a593Smuzhiyun3: 597*4882a593Smuzhiyun writing "3" to the pseudo_lock_measure file will trigger the L3 cache 598*4882a593Smuzhiyun residency (cache hits and misses) measurement captured in the 599*4882a593Smuzhiyun pseudo_lock_l3 tracepoint. 600*4882a593Smuzhiyun 601*4882a593SmuzhiyunAll measurements are recorded with the tracing infrastructure. This requires 602*4882a593Smuzhiyunthe relevant tracepoints to be enabled before the measurement is triggered. 603*4882a593Smuzhiyun 604*4882a593SmuzhiyunExample of latency debugging interface 605*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 606*4882a593SmuzhiyunIn this example a pseudo-locked region named "newlock" was created. Here is 607*4882a593Smuzhiyunhow we can measure the latency in cycles of reading from this region and 608*4882a593Smuzhiyunvisualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS 609*4882a593Smuzhiyunis set:: 610*4882a593Smuzhiyun 611*4882a593Smuzhiyun # :> /sys/kernel/debug/tracing/trace 612*4882a593Smuzhiyun # echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger 613*4882a593Smuzhiyun # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable 614*4882a593Smuzhiyun # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure 615*4882a593Smuzhiyun # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable 616*4882a593Smuzhiyun # cat /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/hist 617*4882a593Smuzhiyun 618*4882a593Smuzhiyun # event histogram 619*4882a593Smuzhiyun # 620*4882a593Smuzhiyun # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active] 621*4882a593Smuzhiyun # 622*4882a593Smuzhiyun 623*4882a593Smuzhiyun { latency: 456 } hitcount: 1 624*4882a593Smuzhiyun { latency: 50 } hitcount: 83 625*4882a593Smuzhiyun { latency: 36 } hitcount: 96 626*4882a593Smuzhiyun { latency: 44 } hitcount: 174 627*4882a593Smuzhiyun { latency: 48 } hitcount: 195 628*4882a593Smuzhiyun { latency: 46 } hitcount: 262 629*4882a593Smuzhiyun { latency: 42 } hitcount: 693 630*4882a593Smuzhiyun { latency: 40 } hitcount: 3204 631*4882a593Smuzhiyun { latency: 38 } hitcount: 3484 632*4882a593Smuzhiyun 633*4882a593Smuzhiyun Totals: 634*4882a593Smuzhiyun Hits: 8192 635*4882a593Smuzhiyun Entries: 9 636*4882a593Smuzhiyun Dropped: 0 637*4882a593Smuzhiyun 638*4882a593SmuzhiyunExample of cache hits/misses debugging 639*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 640*4882a593SmuzhiyunIn this example a pseudo-locked region named "newlock" was created on the L2 641*4882a593Smuzhiyuncache of a platform. Here is how we can obtain details of the cache hits 642*4882a593Smuzhiyunand misses using the platform's precision counters. 643*4882a593Smuzhiyun:: 644*4882a593Smuzhiyun 645*4882a593Smuzhiyun # :> /sys/kernel/debug/tracing/trace 646*4882a593Smuzhiyun # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable 647*4882a593Smuzhiyun # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure 648*4882a593Smuzhiyun # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable 649*4882a593Smuzhiyun # cat /sys/kernel/debug/tracing/trace 650*4882a593Smuzhiyun 651*4882a593Smuzhiyun # tracer: nop 652*4882a593Smuzhiyun # 653*4882a593Smuzhiyun # _-----=> irqs-off 654*4882a593Smuzhiyun # / _----=> need-resched 655*4882a593Smuzhiyun # | / _---=> hardirq/softirq 656*4882a593Smuzhiyun # || / _--=> preempt-depth 657*4882a593Smuzhiyun # ||| / delay 658*4882a593Smuzhiyun # TASK-PID CPU# |||| TIMESTAMP FUNCTION 659*4882a593Smuzhiyun # | | | |||| | | 660*4882a593Smuzhiyun pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0 661*4882a593Smuzhiyun 662*4882a593Smuzhiyun 663*4882a593SmuzhiyunExamples for RDT allocation usage 664*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 665*4882a593Smuzhiyun 666*4882a593Smuzhiyun1) Example 1 667*4882a593Smuzhiyun 668*4882a593SmuzhiyunOn a two socket machine (one L3 cache per socket) with just four bits 669*4882a593Smuzhiyunfor cache bit masks, minimum b/w of 10% with a memory bandwidth 670*4882a593Smuzhiyungranularity of 10%. 671*4882a593Smuzhiyun:: 672*4882a593Smuzhiyun 673*4882a593Smuzhiyun # mount -t resctrl resctrl /sys/fs/resctrl 674*4882a593Smuzhiyun # cd /sys/fs/resctrl 675*4882a593Smuzhiyun # mkdir p0 p1 676*4882a593Smuzhiyun # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata 677*4882a593Smuzhiyun # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata 678*4882a593Smuzhiyun 679*4882a593SmuzhiyunThe default resource group is unmodified, so we have access to all parts 680*4882a593Smuzhiyunof all caches (its schemata file reads "L3:0=f;1=f"). 681*4882a593Smuzhiyun 682*4882a593SmuzhiyunTasks that are under the control of group "p0" may only allocate from the 683*4882a593Smuzhiyun"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. 684*4882a593SmuzhiyunTasks in group "p1" use the "lower" 50% of cache on both sockets. 685*4882a593Smuzhiyun 686*4882a593SmuzhiyunSimilarly, tasks that are under the control of group "p0" may use a 687*4882a593Smuzhiyunmaximum memory b/w of 50% on socket0 and 50% on socket 1. 688*4882a593SmuzhiyunTasks in group "p1" may also use 50% memory b/w on both sockets. 689*4882a593SmuzhiyunNote that unlike cache masks, memory b/w cannot specify whether these 690*4882a593Smuzhiyunallocations can overlap or not. The allocations specifies the maximum 691*4882a593Smuzhiyunb/w that the group may be able to use and the system admin can configure 692*4882a593Smuzhiyunthe b/w accordingly. 693*4882a593Smuzhiyun 694*4882a593SmuzhiyunIf resctrl is using the software controller (mba_sc) then user can enter the 695*4882a593Smuzhiyunmax b/w in MB rather than the percentage values. 696*4882a593Smuzhiyun:: 697*4882a593Smuzhiyun 698*4882a593Smuzhiyun # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata 699*4882a593Smuzhiyun # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata 700*4882a593Smuzhiyun 701*4882a593SmuzhiyunIn the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w 702*4882a593Smuzhiyunof 1024MB where as on socket 1 they would use 500MB. 703*4882a593Smuzhiyun 704*4882a593Smuzhiyun2) Example 2 705*4882a593Smuzhiyun 706*4882a593SmuzhiyunAgain two sockets, but this time with a more realistic 20-bit mask. 707*4882a593Smuzhiyun 708*4882a593SmuzhiyunTwo real time tasks pid=1234 running on processor 0 and pid=5678 running on 709*4882a593Smuzhiyunprocessor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy 710*4882a593Smuzhiyunneighbors, each of the two real-time tasks exclusively occupies one quarter 711*4882a593Smuzhiyunof L3 cache on socket 0. 712*4882a593Smuzhiyun:: 713*4882a593Smuzhiyun 714*4882a593Smuzhiyun # mount -t resctrl resctrl /sys/fs/resctrl 715*4882a593Smuzhiyun # cd /sys/fs/resctrl 716*4882a593Smuzhiyun 717*4882a593SmuzhiyunFirst we reset the schemata for the default group so that the "upper" 718*4882a593Smuzhiyun50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by 719*4882a593Smuzhiyunordinary tasks:: 720*4882a593Smuzhiyun 721*4882a593Smuzhiyun # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata 722*4882a593Smuzhiyun 723*4882a593SmuzhiyunNext we make a resource group for our first real time task and give 724*4882a593Smuzhiyunit access to the "top" 25% of the cache on socket 0. 725*4882a593Smuzhiyun:: 726*4882a593Smuzhiyun 727*4882a593Smuzhiyun # mkdir p0 728*4882a593Smuzhiyun # echo "L3:0=f8000;1=fffff" > p0/schemata 729*4882a593Smuzhiyun 730*4882a593SmuzhiyunFinally we move our first real time task into this resource group. We 731*4882a593Smuzhiyunalso use taskset(1) to ensure the task always runs on a dedicated CPU 732*4882a593Smuzhiyunon socket 0. Most uses of resource groups will also constrain which 733*4882a593Smuzhiyunprocessors tasks run on. 734*4882a593Smuzhiyun:: 735*4882a593Smuzhiyun 736*4882a593Smuzhiyun # echo 1234 > p0/tasks 737*4882a593Smuzhiyun # taskset -cp 1 1234 738*4882a593Smuzhiyun 739*4882a593SmuzhiyunDitto for the second real time task (with the remaining 25% of cache):: 740*4882a593Smuzhiyun 741*4882a593Smuzhiyun # mkdir p1 742*4882a593Smuzhiyun # echo "L3:0=7c00;1=fffff" > p1/schemata 743*4882a593Smuzhiyun # echo 5678 > p1/tasks 744*4882a593Smuzhiyun # taskset -cp 2 5678 745*4882a593Smuzhiyun 746*4882a593SmuzhiyunFor the same 2 socket system with memory b/w resource and CAT L3 the 747*4882a593Smuzhiyunschemata would look like(Assume min_bandwidth 10 and bandwidth_gran is 748*4882a593Smuzhiyun10): 749*4882a593Smuzhiyun 750*4882a593SmuzhiyunFor our first real time task this would request 20% memory b/w on socket 0. 751*4882a593Smuzhiyun:: 752*4882a593Smuzhiyun 753*4882a593Smuzhiyun # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata 754*4882a593Smuzhiyun 755*4882a593SmuzhiyunFor our second real time task this would request an other 20% memory b/w 756*4882a593Smuzhiyunon socket 0. 757*4882a593Smuzhiyun:: 758*4882a593Smuzhiyun 759*4882a593Smuzhiyun # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata 760*4882a593Smuzhiyun 761*4882a593Smuzhiyun3) Example 3 762*4882a593Smuzhiyun 763*4882a593SmuzhiyunA single socket system which has real-time tasks running on core 4-7 and 764*4882a593Smuzhiyunnon real-time workload assigned to core 0-3. The real-time tasks share text 765*4882a593Smuzhiyunand data, so a per task association is not required and due to interaction 766*4882a593Smuzhiyunwith the kernel it's desired that the kernel on these cores shares L3 with 767*4882a593Smuzhiyunthe tasks. 768*4882a593Smuzhiyun:: 769*4882a593Smuzhiyun 770*4882a593Smuzhiyun # mount -t resctrl resctrl /sys/fs/resctrl 771*4882a593Smuzhiyun # cd /sys/fs/resctrl 772*4882a593Smuzhiyun 773*4882a593SmuzhiyunFirst we reset the schemata for the default group so that the "upper" 774*4882a593Smuzhiyun50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 775*4882a593Smuzhiyuncannot be used by ordinary tasks:: 776*4882a593Smuzhiyun 777*4882a593Smuzhiyun # echo "L3:0=3ff\nMB:0=50" > schemata 778*4882a593Smuzhiyun 779*4882a593SmuzhiyunNext we make a resource group for our real time cores and give it access 780*4882a593Smuzhiyunto the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on 781*4882a593Smuzhiyunsocket 0. 782*4882a593Smuzhiyun:: 783*4882a593Smuzhiyun 784*4882a593Smuzhiyun # mkdir p0 785*4882a593Smuzhiyun # echo "L3:0=ffc00\nMB:0=50" > p0/schemata 786*4882a593Smuzhiyun 787*4882a593SmuzhiyunFinally we move core 4-7 over to the new group and make sure that the 788*4882a593Smuzhiyunkernel and the tasks running there get 50% of the cache. They should 789*4882a593Smuzhiyunalso get 50% of memory bandwidth assuming that the cores 4-7 are SMT 790*4882a593Smuzhiyunsiblings and only the real time threads are scheduled on the cores 4-7. 791*4882a593Smuzhiyun:: 792*4882a593Smuzhiyun 793*4882a593Smuzhiyun # echo F0 > p0/cpus 794*4882a593Smuzhiyun 795*4882a593Smuzhiyun4) Example 4 796*4882a593Smuzhiyun 797*4882a593SmuzhiyunThe resource groups in previous examples were all in the default "shareable" 798*4882a593Smuzhiyunmode allowing sharing of their cache allocations. If one resource group 799*4882a593Smuzhiyunconfigures a cache allocation then nothing prevents another resource group 800*4882a593Smuzhiyunto overlap with that allocation. 801*4882a593Smuzhiyun 802*4882a593SmuzhiyunIn this example a new exclusive resource group will be created on a L2 CAT 803*4882a593Smuzhiyunsystem with two L2 cache instances that can be configured with an 8-bit 804*4882a593Smuzhiyuncapacity bitmask. The new exclusive resource group will be configured to use 805*4882a593Smuzhiyun25% of each cache instance. 806*4882a593Smuzhiyun:: 807*4882a593Smuzhiyun 808*4882a593Smuzhiyun # mount -t resctrl resctrl /sys/fs/resctrl/ 809*4882a593Smuzhiyun # cd /sys/fs/resctrl 810*4882a593Smuzhiyun 811*4882a593SmuzhiyunFirst, we observe that the default group is configured to allocate to all L2 812*4882a593Smuzhiyuncache:: 813*4882a593Smuzhiyun 814*4882a593Smuzhiyun # cat schemata 815*4882a593Smuzhiyun L2:0=ff;1=ff 816*4882a593Smuzhiyun 817*4882a593SmuzhiyunWe could attempt to create the new resource group at this point, but it will 818*4882a593Smuzhiyunfail because of the overlap with the schemata of the default group:: 819*4882a593Smuzhiyun 820*4882a593Smuzhiyun # mkdir p0 821*4882a593Smuzhiyun # echo 'L2:0=0x3;1=0x3' > p0/schemata 822*4882a593Smuzhiyun # cat p0/mode 823*4882a593Smuzhiyun shareable 824*4882a593Smuzhiyun # echo exclusive > p0/mode 825*4882a593Smuzhiyun -sh: echo: write error: Invalid argument 826*4882a593Smuzhiyun # cat info/last_cmd_status 827*4882a593Smuzhiyun schemata overlaps 828*4882a593Smuzhiyun 829*4882a593SmuzhiyunTo ensure that there is no overlap with another resource group the default 830*4882a593Smuzhiyunresource group's schemata has to change, making it possible for the new 831*4882a593Smuzhiyunresource group to become exclusive. 832*4882a593Smuzhiyun:: 833*4882a593Smuzhiyun 834*4882a593Smuzhiyun # echo 'L2:0=0xfc;1=0xfc' > schemata 835*4882a593Smuzhiyun # echo exclusive > p0/mode 836*4882a593Smuzhiyun # grep . p0/* 837*4882a593Smuzhiyun p0/cpus:0 838*4882a593Smuzhiyun p0/mode:exclusive 839*4882a593Smuzhiyun p0/schemata:L2:0=03;1=03 840*4882a593Smuzhiyun p0/size:L2:0=262144;1=262144 841*4882a593Smuzhiyun 842*4882a593SmuzhiyunA new resource group will on creation not overlap with an exclusive resource 843*4882a593Smuzhiyungroup:: 844*4882a593Smuzhiyun 845*4882a593Smuzhiyun # mkdir p1 846*4882a593Smuzhiyun # grep . p1/* 847*4882a593Smuzhiyun p1/cpus:0 848*4882a593Smuzhiyun p1/mode:shareable 849*4882a593Smuzhiyun p1/schemata:L2:0=fc;1=fc 850*4882a593Smuzhiyun p1/size:L2:0=786432;1=786432 851*4882a593Smuzhiyun 852*4882a593SmuzhiyunThe bit_usage will reflect how the cache is used:: 853*4882a593Smuzhiyun 854*4882a593Smuzhiyun # cat info/L2/bit_usage 855*4882a593Smuzhiyun 0=SSSSSSEE;1=SSSSSSEE 856*4882a593Smuzhiyun 857*4882a593SmuzhiyunA resource group cannot be forced to overlap with an exclusive resource group:: 858*4882a593Smuzhiyun 859*4882a593Smuzhiyun # echo 'L2:0=0x1;1=0x1' > p1/schemata 860*4882a593Smuzhiyun -sh: echo: write error: Invalid argument 861*4882a593Smuzhiyun # cat info/last_cmd_status 862*4882a593Smuzhiyun overlaps with exclusive group 863*4882a593Smuzhiyun 864*4882a593SmuzhiyunExample of Cache Pseudo-Locking 865*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 866*4882a593SmuzhiyunLock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked 867*4882a593Smuzhiyunregion is exposed at /dev/pseudo_lock/newlock that can be provided to 868*4882a593Smuzhiyunapplication for argument to mmap(). 869*4882a593Smuzhiyun:: 870*4882a593Smuzhiyun 871*4882a593Smuzhiyun # mount -t resctrl resctrl /sys/fs/resctrl/ 872*4882a593Smuzhiyun # cd /sys/fs/resctrl 873*4882a593Smuzhiyun 874*4882a593SmuzhiyunEnsure that there are bits available that can be pseudo-locked, since only 875*4882a593Smuzhiyununused bits can be pseudo-locked the bits to be pseudo-locked needs to be 876*4882a593Smuzhiyunremoved from the default resource group's schemata:: 877*4882a593Smuzhiyun 878*4882a593Smuzhiyun # cat info/L2/bit_usage 879*4882a593Smuzhiyun 0=SSSSSSSS;1=SSSSSSSS 880*4882a593Smuzhiyun # echo 'L2:1=0xfc' > schemata 881*4882a593Smuzhiyun # cat info/L2/bit_usage 882*4882a593Smuzhiyun 0=SSSSSSSS;1=SSSSSS00 883*4882a593Smuzhiyun 884*4882a593SmuzhiyunCreate a new resource group that will be associated with the pseudo-locked 885*4882a593Smuzhiyunregion, indicate that it will be used for a pseudo-locked region, and 886*4882a593Smuzhiyunconfigure the requested pseudo-locked region capacity bitmask:: 887*4882a593Smuzhiyun 888*4882a593Smuzhiyun # mkdir newlock 889*4882a593Smuzhiyun # echo pseudo-locksetup > newlock/mode 890*4882a593Smuzhiyun # echo 'L2:1=0x3' > newlock/schemata 891*4882a593Smuzhiyun 892*4882a593SmuzhiyunOn success the resource group's mode will change to pseudo-locked, the 893*4882a593Smuzhiyunbit_usage will reflect the pseudo-locked region, and the character device 894*4882a593Smuzhiyunexposing the pseudo-locked region will exist:: 895*4882a593Smuzhiyun 896*4882a593Smuzhiyun # cat newlock/mode 897*4882a593Smuzhiyun pseudo-locked 898*4882a593Smuzhiyun # cat info/L2/bit_usage 899*4882a593Smuzhiyun 0=SSSSSSSS;1=SSSSSSPP 900*4882a593Smuzhiyun # ls -l /dev/pseudo_lock/newlock 901*4882a593Smuzhiyun crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock 902*4882a593Smuzhiyun 903*4882a593Smuzhiyun:: 904*4882a593Smuzhiyun 905*4882a593Smuzhiyun /* 906*4882a593Smuzhiyun * Example code to access one page of pseudo-locked cache region 907*4882a593Smuzhiyun * from user space. 908*4882a593Smuzhiyun */ 909*4882a593Smuzhiyun #define _GNU_SOURCE 910*4882a593Smuzhiyun #include <fcntl.h> 911*4882a593Smuzhiyun #include <sched.h> 912*4882a593Smuzhiyun #include <stdio.h> 913*4882a593Smuzhiyun #include <stdlib.h> 914*4882a593Smuzhiyun #include <unistd.h> 915*4882a593Smuzhiyun #include <sys/mman.h> 916*4882a593Smuzhiyun 917*4882a593Smuzhiyun /* 918*4882a593Smuzhiyun * It is required that the application runs with affinity to only 919*4882a593Smuzhiyun * cores associated with the pseudo-locked region. Here the cpu 920*4882a593Smuzhiyun * is hardcoded for convenience of example. 921*4882a593Smuzhiyun */ 922*4882a593Smuzhiyun static int cpuid = 2; 923*4882a593Smuzhiyun 924*4882a593Smuzhiyun int main(int argc, char *argv[]) 925*4882a593Smuzhiyun { 926*4882a593Smuzhiyun cpu_set_t cpuset; 927*4882a593Smuzhiyun long page_size; 928*4882a593Smuzhiyun void *mapping; 929*4882a593Smuzhiyun int dev_fd; 930*4882a593Smuzhiyun int ret; 931*4882a593Smuzhiyun 932*4882a593Smuzhiyun page_size = sysconf(_SC_PAGESIZE); 933*4882a593Smuzhiyun 934*4882a593Smuzhiyun CPU_ZERO(&cpuset); 935*4882a593Smuzhiyun CPU_SET(cpuid, &cpuset); 936*4882a593Smuzhiyun ret = sched_setaffinity(0, sizeof(cpuset), &cpuset); 937*4882a593Smuzhiyun if (ret < 0) { 938*4882a593Smuzhiyun perror("sched_setaffinity"); 939*4882a593Smuzhiyun exit(EXIT_FAILURE); 940*4882a593Smuzhiyun } 941*4882a593Smuzhiyun 942*4882a593Smuzhiyun dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR); 943*4882a593Smuzhiyun if (dev_fd < 0) { 944*4882a593Smuzhiyun perror("open"); 945*4882a593Smuzhiyun exit(EXIT_FAILURE); 946*4882a593Smuzhiyun } 947*4882a593Smuzhiyun 948*4882a593Smuzhiyun mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, 949*4882a593Smuzhiyun dev_fd, 0); 950*4882a593Smuzhiyun if (mapping == MAP_FAILED) { 951*4882a593Smuzhiyun perror("mmap"); 952*4882a593Smuzhiyun close(dev_fd); 953*4882a593Smuzhiyun exit(EXIT_FAILURE); 954*4882a593Smuzhiyun } 955*4882a593Smuzhiyun 956*4882a593Smuzhiyun /* Application interacts with pseudo-locked memory @mapping */ 957*4882a593Smuzhiyun 958*4882a593Smuzhiyun ret = munmap(mapping, page_size); 959*4882a593Smuzhiyun if (ret < 0) { 960*4882a593Smuzhiyun perror("munmap"); 961*4882a593Smuzhiyun close(dev_fd); 962*4882a593Smuzhiyun exit(EXIT_FAILURE); 963*4882a593Smuzhiyun } 964*4882a593Smuzhiyun 965*4882a593Smuzhiyun close(dev_fd); 966*4882a593Smuzhiyun exit(EXIT_SUCCESS); 967*4882a593Smuzhiyun } 968*4882a593Smuzhiyun 969*4882a593SmuzhiyunLocking between applications 970*4882a593Smuzhiyun---------------------------- 971*4882a593Smuzhiyun 972*4882a593SmuzhiyunCertain operations on the resctrl filesystem, composed of read/writes 973*4882a593Smuzhiyunto/from multiple files, must be atomic. 974*4882a593Smuzhiyun 975*4882a593SmuzhiyunAs an example, the allocation of an exclusive reservation of L3 cache 976*4882a593Smuzhiyuninvolves: 977*4882a593Smuzhiyun 978*4882a593Smuzhiyun 1. Read the cbmmasks from each directory or the per-resource "bit_usage" 979*4882a593Smuzhiyun 2. Find a contiguous set of bits in the global CBM bitmask that is clear 980*4882a593Smuzhiyun in any of the directory cbmmasks 981*4882a593Smuzhiyun 3. Create a new directory 982*4882a593Smuzhiyun 4. Set the bits found in step 2 to the new directory "schemata" file 983*4882a593Smuzhiyun 984*4882a593SmuzhiyunIf two applications attempt to allocate space concurrently then they can 985*4882a593Smuzhiyunend up allocating the same bits so the reservations are shared instead of 986*4882a593Smuzhiyunexclusive. 987*4882a593Smuzhiyun 988*4882a593SmuzhiyunTo coordinate atomic operations on the resctrlfs and to avoid the problem 989*4882a593Smuzhiyunabove, the following locking procedure is recommended: 990*4882a593Smuzhiyun 991*4882a593SmuzhiyunLocking is based on flock, which is available in libc and also as a shell 992*4882a593Smuzhiyunscript command 993*4882a593Smuzhiyun 994*4882a593SmuzhiyunWrite lock: 995*4882a593Smuzhiyun 996*4882a593Smuzhiyun A) Take flock(LOCK_EX) on /sys/fs/resctrl 997*4882a593Smuzhiyun B) Read/write the directory structure. 998*4882a593Smuzhiyun C) funlock 999*4882a593Smuzhiyun 1000*4882a593SmuzhiyunRead lock: 1001*4882a593Smuzhiyun 1002*4882a593Smuzhiyun A) Take flock(LOCK_SH) on /sys/fs/resctrl 1003*4882a593Smuzhiyun B) If success read the directory structure. 1004*4882a593Smuzhiyun C) funlock 1005*4882a593Smuzhiyun 1006*4882a593SmuzhiyunExample with bash:: 1007*4882a593Smuzhiyun 1008*4882a593Smuzhiyun # Atomically read directory structure 1009*4882a593Smuzhiyun $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl 1010*4882a593Smuzhiyun 1011*4882a593Smuzhiyun # Read directory contents and create new subdirectory 1012*4882a593Smuzhiyun 1013*4882a593Smuzhiyun $ cat create-dir.sh 1014*4882a593Smuzhiyun find /sys/fs/resctrl/ > output.txt 1015*4882a593Smuzhiyun mask = function-of(output.txt) 1016*4882a593Smuzhiyun mkdir /sys/fs/resctrl/newres/ 1017*4882a593Smuzhiyun echo mask > /sys/fs/resctrl/newres/schemata 1018*4882a593Smuzhiyun 1019*4882a593Smuzhiyun $ flock /sys/fs/resctrl/ ./create-dir.sh 1020*4882a593Smuzhiyun 1021*4882a593SmuzhiyunExample with C:: 1022*4882a593Smuzhiyun 1023*4882a593Smuzhiyun /* 1024*4882a593Smuzhiyun * Example code do take advisory locks 1025*4882a593Smuzhiyun * before accessing resctrl filesystem 1026*4882a593Smuzhiyun */ 1027*4882a593Smuzhiyun #include <sys/file.h> 1028*4882a593Smuzhiyun #include <stdlib.h> 1029*4882a593Smuzhiyun 1030*4882a593Smuzhiyun void resctrl_take_shared_lock(int fd) 1031*4882a593Smuzhiyun { 1032*4882a593Smuzhiyun int ret; 1033*4882a593Smuzhiyun 1034*4882a593Smuzhiyun /* take shared lock on resctrl filesystem */ 1035*4882a593Smuzhiyun ret = flock(fd, LOCK_SH); 1036*4882a593Smuzhiyun if (ret) { 1037*4882a593Smuzhiyun perror("flock"); 1038*4882a593Smuzhiyun exit(-1); 1039*4882a593Smuzhiyun } 1040*4882a593Smuzhiyun } 1041*4882a593Smuzhiyun 1042*4882a593Smuzhiyun void resctrl_take_exclusive_lock(int fd) 1043*4882a593Smuzhiyun { 1044*4882a593Smuzhiyun int ret; 1045*4882a593Smuzhiyun 1046*4882a593Smuzhiyun /* release lock on resctrl filesystem */ 1047*4882a593Smuzhiyun ret = flock(fd, LOCK_EX); 1048*4882a593Smuzhiyun if (ret) { 1049*4882a593Smuzhiyun perror("flock"); 1050*4882a593Smuzhiyun exit(-1); 1051*4882a593Smuzhiyun } 1052*4882a593Smuzhiyun } 1053*4882a593Smuzhiyun 1054*4882a593Smuzhiyun void resctrl_release_lock(int fd) 1055*4882a593Smuzhiyun { 1056*4882a593Smuzhiyun int ret; 1057*4882a593Smuzhiyun 1058*4882a593Smuzhiyun /* take shared lock on resctrl filesystem */ 1059*4882a593Smuzhiyun ret = flock(fd, LOCK_UN); 1060*4882a593Smuzhiyun if (ret) { 1061*4882a593Smuzhiyun perror("flock"); 1062*4882a593Smuzhiyun exit(-1); 1063*4882a593Smuzhiyun } 1064*4882a593Smuzhiyun } 1065*4882a593Smuzhiyun 1066*4882a593Smuzhiyun void main(void) 1067*4882a593Smuzhiyun { 1068*4882a593Smuzhiyun int fd, ret; 1069*4882a593Smuzhiyun 1070*4882a593Smuzhiyun fd = open("/sys/fs/resctrl", O_DIRECTORY); 1071*4882a593Smuzhiyun if (fd == -1) { 1072*4882a593Smuzhiyun perror("open"); 1073*4882a593Smuzhiyun exit(-1); 1074*4882a593Smuzhiyun } 1075*4882a593Smuzhiyun resctrl_take_shared_lock(fd); 1076*4882a593Smuzhiyun /* code to read directory contents */ 1077*4882a593Smuzhiyun resctrl_release_lock(fd); 1078*4882a593Smuzhiyun 1079*4882a593Smuzhiyun resctrl_take_exclusive_lock(fd); 1080*4882a593Smuzhiyun /* code to read and write directory contents */ 1081*4882a593Smuzhiyun resctrl_release_lock(fd); 1082*4882a593Smuzhiyun } 1083*4882a593Smuzhiyun 1084*4882a593SmuzhiyunExamples for RDT Monitoring along with allocation usage 1085*4882a593Smuzhiyun======================================================= 1086*4882a593SmuzhiyunReading monitored data 1087*4882a593Smuzhiyun---------------------- 1088*4882a593SmuzhiyunReading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would 1089*4882a593Smuzhiyunshow the current snapshot of LLC occupancy of the corresponding MON 1090*4882a593Smuzhiyungroup or CTRL_MON group. 1091*4882a593Smuzhiyun 1092*4882a593Smuzhiyun 1093*4882a593SmuzhiyunExample 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) 1094*4882a593Smuzhiyun------------------------------------------------------------------------ 1095*4882a593SmuzhiyunOn a two socket machine (one L3 cache per socket) with just four bits 1096*4882a593Smuzhiyunfor cache bit masks:: 1097*4882a593Smuzhiyun 1098*4882a593Smuzhiyun # mount -t resctrl resctrl /sys/fs/resctrl 1099*4882a593Smuzhiyun # cd /sys/fs/resctrl 1100*4882a593Smuzhiyun # mkdir p0 p1 1101*4882a593Smuzhiyun # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata 1102*4882a593Smuzhiyun # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata 1103*4882a593Smuzhiyun # echo 5678 > p1/tasks 1104*4882a593Smuzhiyun # echo 5679 > p1/tasks 1105*4882a593Smuzhiyun 1106*4882a593SmuzhiyunThe default resource group is unmodified, so we have access to all parts 1107*4882a593Smuzhiyunof all caches (its schemata file reads "L3:0=f;1=f"). 1108*4882a593Smuzhiyun 1109*4882a593SmuzhiyunTasks that are under the control of group "p0" may only allocate from the 1110*4882a593Smuzhiyun"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. 1111*4882a593SmuzhiyunTasks in group "p1" use the "lower" 50% of cache on both sockets. 1112*4882a593Smuzhiyun 1113*4882a593SmuzhiyunCreate monitor groups and assign a subset of tasks to each monitor group. 1114*4882a593Smuzhiyun:: 1115*4882a593Smuzhiyun 1116*4882a593Smuzhiyun # cd /sys/fs/resctrl/p1/mon_groups 1117*4882a593Smuzhiyun # mkdir m11 m12 1118*4882a593Smuzhiyun # echo 5678 > m11/tasks 1119*4882a593Smuzhiyun # echo 5679 > m12/tasks 1120*4882a593Smuzhiyun 1121*4882a593Smuzhiyunfetch data (data shown in bytes) 1122*4882a593Smuzhiyun:: 1123*4882a593Smuzhiyun 1124*4882a593Smuzhiyun # cat m11/mon_data/mon_L3_00/llc_occupancy 1125*4882a593Smuzhiyun 16234000 1126*4882a593Smuzhiyun # cat m11/mon_data/mon_L3_01/llc_occupancy 1127*4882a593Smuzhiyun 14789000 1128*4882a593Smuzhiyun # cat m12/mon_data/mon_L3_00/llc_occupancy 1129*4882a593Smuzhiyun 16789000 1130*4882a593Smuzhiyun 1131*4882a593SmuzhiyunThe parent ctrl_mon group shows the aggregated data. 1132*4882a593Smuzhiyun:: 1133*4882a593Smuzhiyun 1134*4882a593Smuzhiyun # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 1135*4882a593Smuzhiyun 31234000 1136*4882a593Smuzhiyun 1137*4882a593SmuzhiyunExample 2 (Monitor a task from its creation) 1138*4882a593Smuzhiyun-------------------------------------------- 1139*4882a593SmuzhiyunOn a two socket machine (one L3 cache per socket):: 1140*4882a593Smuzhiyun 1141*4882a593Smuzhiyun # mount -t resctrl resctrl /sys/fs/resctrl 1142*4882a593Smuzhiyun # cd /sys/fs/resctrl 1143*4882a593Smuzhiyun # mkdir p0 p1 1144*4882a593Smuzhiyun 1145*4882a593SmuzhiyunAn RMID is allocated to the group once its created and hence the <cmd> 1146*4882a593Smuzhiyunbelow is monitored from its creation. 1147*4882a593Smuzhiyun:: 1148*4882a593Smuzhiyun 1149*4882a593Smuzhiyun # echo $$ > /sys/fs/resctrl/p1/tasks 1150*4882a593Smuzhiyun # <cmd> 1151*4882a593Smuzhiyun 1152*4882a593SmuzhiyunFetch the data:: 1153*4882a593Smuzhiyun 1154*4882a593Smuzhiyun # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 1155*4882a593Smuzhiyun 31789000 1156*4882a593Smuzhiyun 1157*4882a593SmuzhiyunExample 3 (Monitor without CAT support or before creating CAT groups) 1158*4882a593Smuzhiyun--------------------------------------------------------------------- 1159*4882a593Smuzhiyun 1160*4882a593SmuzhiyunAssume a system like HSW has only CQM and no CAT support. In this case 1161*4882a593Smuzhiyunthe resctrl will still mount but cannot create CTRL_MON directories. 1162*4882a593SmuzhiyunBut user can create different MON groups within the root group thereby 1163*4882a593Smuzhiyunable to monitor all tasks including kernel threads. 1164*4882a593Smuzhiyun 1165*4882a593SmuzhiyunThis can also be used to profile jobs cache size footprint before being 1166*4882a593Smuzhiyunable to allocate them to different allocation groups. 1167*4882a593Smuzhiyun:: 1168*4882a593Smuzhiyun 1169*4882a593Smuzhiyun # mount -t resctrl resctrl /sys/fs/resctrl 1170*4882a593Smuzhiyun # cd /sys/fs/resctrl 1171*4882a593Smuzhiyun # mkdir mon_groups/m01 1172*4882a593Smuzhiyun # mkdir mon_groups/m02 1173*4882a593Smuzhiyun 1174*4882a593Smuzhiyun # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks 1175*4882a593Smuzhiyun # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks 1176*4882a593Smuzhiyun 1177*4882a593SmuzhiyunMonitor the groups separately and also get per domain data. From the 1178*4882a593Smuzhiyunbelow its apparent that the tasks are mostly doing work on 1179*4882a593Smuzhiyundomain(socket) 0. 1180*4882a593Smuzhiyun:: 1181*4882a593Smuzhiyun 1182*4882a593Smuzhiyun # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy 1183*4882a593Smuzhiyun 31234000 1184*4882a593Smuzhiyun # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy 1185*4882a593Smuzhiyun 34555 1186*4882a593Smuzhiyun # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy 1187*4882a593Smuzhiyun 31234000 1188*4882a593Smuzhiyun # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy 1189*4882a593Smuzhiyun 32789 1190*4882a593Smuzhiyun 1191*4882a593Smuzhiyun 1192*4882a593SmuzhiyunExample 4 (Monitor real time tasks) 1193*4882a593Smuzhiyun----------------------------------- 1194*4882a593Smuzhiyun 1195*4882a593SmuzhiyunA single socket system which has real time tasks running on cores 4-7 1196*4882a593Smuzhiyunand non real time tasks on other cpus. We want to monitor the cache 1197*4882a593Smuzhiyunoccupancy of the real time threads on these cores. 1198*4882a593Smuzhiyun:: 1199*4882a593Smuzhiyun 1200*4882a593Smuzhiyun # mount -t resctrl resctrl /sys/fs/resctrl 1201*4882a593Smuzhiyun # cd /sys/fs/resctrl 1202*4882a593Smuzhiyun # mkdir p1 1203*4882a593Smuzhiyun 1204*4882a593SmuzhiyunMove the cpus 4-7 over to p1:: 1205*4882a593Smuzhiyun 1206*4882a593Smuzhiyun # echo f0 > p1/cpus 1207*4882a593Smuzhiyun 1208*4882a593SmuzhiyunView the llc occupancy snapshot:: 1209*4882a593Smuzhiyun 1210*4882a593Smuzhiyun # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy 1211*4882a593Smuzhiyun 11234000 1212