xref: /OK3568_Linux_fs/kernel/Documentation/x86/resctrl_ui.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun.. include:: <isonum.txt>
3*4882a593Smuzhiyun
4*4882a593Smuzhiyun===========================================
5*4882a593SmuzhiyunUser Interface for Resource Control feature
6*4882a593Smuzhiyun===========================================
7*4882a593Smuzhiyun
8*4882a593Smuzhiyun:Copyright: |copy| 2016 Intel Corporation
9*4882a593Smuzhiyun:Authors: - Fenghua Yu <fenghua.yu@intel.com>
10*4882a593Smuzhiyun          - Tony Luck <tony.luck@intel.com>
11*4882a593Smuzhiyun          - Vikas Shivappa <vikas.shivappa@intel.com>
12*4882a593Smuzhiyun
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunIntel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
15*4882a593SmuzhiyunAMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunThis feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
18*4882a593Smuzhiyunflag bits:
19*4882a593Smuzhiyun
20*4882a593Smuzhiyun=============================================	================================
21*4882a593SmuzhiyunRDT (Resource Director Technology) Allocation	"rdt_a"
22*4882a593SmuzhiyunCAT (Cache Allocation Technology)		"cat_l3", "cat_l2"
23*4882a593SmuzhiyunCDP (Code and Data Prioritization)		"cdp_l3", "cdp_l2"
24*4882a593SmuzhiyunCQM (Cache QoS Monitoring)			"cqm_llc", "cqm_occup_llc"
25*4882a593SmuzhiyunMBM (Memory Bandwidth Monitoring)		"cqm_mbm_total", "cqm_mbm_local"
26*4882a593SmuzhiyunMBA (Memory Bandwidth Allocation)		"mba"
27*4882a593Smuzhiyun=============================================	================================
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunTo use the feature mount the file system::
30*4882a593Smuzhiyun
31*4882a593Smuzhiyun # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
32*4882a593Smuzhiyun
33*4882a593Smuzhiyunmount options are:
34*4882a593Smuzhiyun
35*4882a593Smuzhiyun"cdp":
36*4882a593Smuzhiyun	Enable code/data prioritization in L3 cache allocations.
37*4882a593Smuzhiyun"cdpl2":
38*4882a593Smuzhiyun	Enable code/data prioritization in L2 cache allocations.
39*4882a593Smuzhiyun"mba_MBps":
40*4882a593Smuzhiyun	Enable the MBA Software Controller(mba_sc) to specify MBA
41*4882a593Smuzhiyun	bandwidth in MBps
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunL2 and L3 CDP are controlled separately.
44*4882a593Smuzhiyun
45*4882a593SmuzhiyunRDT features are orthogonal. A particular system may support only
46*4882a593Smuzhiyunmonitoring, only control, or both monitoring and control.  Cache
47*4882a593Smuzhiyunpseudo-locking is a unique way of using cache control to "pin" or
48*4882a593Smuzhiyun"lock" data in the cache. Details can be found in
49*4882a593Smuzhiyun"Cache Pseudo-Locking".
50*4882a593Smuzhiyun
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunThe mount succeeds if either of allocation or monitoring is present, but
53*4882a593Smuzhiyunonly those files and directories supported by the system will be created.
54*4882a593SmuzhiyunFor more details on the behavior of the interface during monitoring
55*4882a593Smuzhiyunand allocation, see the "Resource alloc and monitor groups" section.
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunInfo directory
58*4882a593Smuzhiyun==============
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunThe 'info' directory contains information about the enabled
61*4882a593Smuzhiyunresources. Each resource has its own subdirectory. The subdirectory
62*4882a593Smuzhiyunnames reflect the resource names.
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunEach subdirectory contains the following files with respect to
65*4882a593Smuzhiyunallocation:
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunCache resource(L3/L2)  subdirectory contains the following files
68*4882a593Smuzhiyunrelated to allocation:
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun"num_closids":
71*4882a593Smuzhiyun		The number of CLOSIDs which are valid for this
72*4882a593Smuzhiyun		resource. The kernel uses the smallest number of
73*4882a593Smuzhiyun		CLOSIDs of all enabled resources as limit.
74*4882a593Smuzhiyun"cbm_mask":
75*4882a593Smuzhiyun		The bitmask which is valid for this resource.
76*4882a593Smuzhiyun		This mask is equivalent to 100%.
77*4882a593Smuzhiyun"min_cbm_bits":
78*4882a593Smuzhiyun		The minimum number of consecutive bits which
79*4882a593Smuzhiyun		must be set when writing a mask.
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun"shareable_bits":
82*4882a593Smuzhiyun		Bitmask of shareable resource with other executing
83*4882a593Smuzhiyun		entities (e.g. I/O). User can use this when
84*4882a593Smuzhiyun		setting up exclusive cache partitions. Note that
85*4882a593Smuzhiyun		some platforms support devices that have their
86*4882a593Smuzhiyun		own settings for cache use which can over-ride
87*4882a593Smuzhiyun		these bits.
88*4882a593Smuzhiyun"bit_usage":
89*4882a593Smuzhiyun		Annotated capacity bitmasks showing how all
90*4882a593Smuzhiyun		instances of the resource are used. The legend is:
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun			"0":
93*4882a593Smuzhiyun			      Corresponding region is unused. When the system's
94*4882a593Smuzhiyun			      resources have been allocated and a "0" is found
95*4882a593Smuzhiyun			      in "bit_usage" it is a sign that resources are
96*4882a593Smuzhiyun			      wasted.
97*4882a593Smuzhiyun
98*4882a593Smuzhiyun			"H":
99*4882a593Smuzhiyun			      Corresponding region is used by hardware only
100*4882a593Smuzhiyun			      but available for software use. If a resource
101*4882a593Smuzhiyun			      has bits set in "shareable_bits" but not all
102*4882a593Smuzhiyun			      of these bits appear in the resource groups'
103*4882a593Smuzhiyun			      schematas then the bits appearing in
104*4882a593Smuzhiyun			      "shareable_bits" but no resource group will
105*4882a593Smuzhiyun			      be marked as "H".
106*4882a593Smuzhiyun			"X":
107*4882a593Smuzhiyun			      Corresponding region is available for sharing and
108*4882a593Smuzhiyun			      used by hardware and software. These are the
109*4882a593Smuzhiyun			      bits that appear in "shareable_bits" as
110*4882a593Smuzhiyun			      well as a resource group's allocation.
111*4882a593Smuzhiyun			"S":
112*4882a593Smuzhiyun			      Corresponding region is used by software
113*4882a593Smuzhiyun			      and available for sharing.
114*4882a593Smuzhiyun			"E":
115*4882a593Smuzhiyun			      Corresponding region is used exclusively by
116*4882a593Smuzhiyun			      one resource group. No sharing allowed.
117*4882a593Smuzhiyun			"P":
118*4882a593Smuzhiyun			      Corresponding region is pseudo-locked. No
119*4882a593Smuzhiyun			      sharing allowed.
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunMemory bandwidth(MB) subdirectory contains the following files
122*4882a593Smuzhiyunwith respect to allocation:
123*4882a593Smuzhiyun
124*4882a593Smuzhiyun"min_bandwidth":
125*4882a593Smuzhiyun		The minimum memory bandwidth percentage which
126*4882a593Smuzhiyun		user can request.
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun"bandwidth_gran":
129*4882a593Smuzhiyun		The granularity in which the memory bandwidth
130*4882a593Smuzhiyun		percentage is allocated. The allocated
131*4882a593Smuzhiyun		b/w percentage is rounded off to the next
132*4882a593Smuzhiyun		control step available on the hardware. The
133*4882a593Smuzhiyun		available bandwidth control steps are:
134*4882a593Smuzhiyun		min_bandwidth + N * bandwidth_gran.
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun"delay_linear":
137*4882a593Smuzhiyun		Indicates if the delay scale is linear or
138*4882a593Smuzhiyun		non-linear. This field is purely informational
139*4882a593Smuzhiyun		only.
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun"thread_throttle_mode":
142*4882a593Smuzhiyun		Indicator on Intel systems of how tasks running on threads
143*4882a593Smuzhiyun		of a physical core are throttled in cases where they
144*4882a593Smuzhiyun		request different memory bandwidth percentages:
145*4882a593Smuzhiyun
146*4882a593Smuzhiyun		"max":
147*4882a593Smuzhiyun			the smallest percentage is applied
148*4882a593Smuzhiyun			to all threads
149*4882a593Smuzhiyun		"per-thread":
150*4882a593Smuzhiyun			bandwidth percentages are directly applied to
151*4882a593Smuzhiyun			the threads running on the core
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunIf RDT monitoring is available there will be an "L3_MON" directory
154*4882a593Smuzhiyunwith the following files:
155*4882a593Smuzhiyun
156*4882a593Smuzhiyun"num_rmids":
157*4882a593Smuzhiyun		The number of RMIDs available. This is the
158*4882a593Smuzhiyun		upper bound for how many "CTRL_MON" + "MON"
159*4882a593Smuzhiyun		groups can be created.
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun"mon_features":
162*4882a593Smuzhiyun		Lists the monitoring events if
163*4882a593Smuzhiyun		monitoring is enabled for the resource.
164*4882a593Smuzhiyun
165*4882a593Smuzhiyun"max_threshold_occupancy":
166*4882a593Smuzhiyun		Read/write file provides the largest value (in
167*4882a593Smuzhiyun		bytes) at which a previously used LLC_occupancy
168*4882a593Smuzhiyun		counter can be considered for re-use.
169*4882a593Smuzhiyun
170*4882a593SmuzhiyunFinally, in the top level of the "info" directory there is a file
171*4882a593Smuzhiyunnamed "last_cmd_status". This is reset with every "command" issued
172*4882a593Smuzhiyunvia the file system (making new directories or writing to any of the
173*4882a593Smuzhiyuncontrol files). If the command was successful, it will read as "ok".
174*4882a593SmuzhiyunIf the command failed, it will provide more information that can be
175*4882a593Smuzhiyunconveyed in the error returns from file operations. E.g.
176*4882a593Smuzhiyun::
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun	# echo L3:0=f7 > schemata
179*4882a593Smuzhiyun	bash: echo: write error: Invalid argument
180*4882a593Smuzhiyun	# cat info/last_cmd_status
181*4882a593Smuzhiyun	mask f7 has non-consecutive 1-bits
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunResource alloc and monitor groups
184*4882a593Smuzhiyun=================================
185*4882a593Smuzhiyun
186*4882a593SmuzhiyunResource groups are represented as directories in the resctrl file
187*4882a593Smuzhiyunsystem.  The default group is the root directory which, immediately
188*4882a593Smuzhiyunafter mounting, owns all the tasks and cpus in the system and can make
189*4882a593Smuzhiyunfull use of all resources.
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunOn a system with RDT control features additional directories can be
192*4882a593Smuzhiyuncreated in the root directory that specify different amounts of each
193*4882a593Smuzhiyunresource (see "schemata" below). The root and these additional top level
194*4882a593Smuzhiyundirectories are referred to as "CTRL_MON" groups below.
195*4882a593Smuzhiyun
196*4882a593SmuzhiyunOn a system with RDT monitoring the root directory and other top level
197*4882a593Smuzhiyundirectories contain a directory named "mon_groups" in which additional
198*4882a593Smuzhiyundirectories can be created to monitor subsets of tasks in the CTRL_MON
199*4882a593Smuzhiyungroup that is their ancestor. These are called "MON" groups in the rest
200*4882a593Smuzhiyunof this document.
201*4882a593Smuzhiyun
202*4882a593SmuzhiyunRemoving a directory will move all tasks and cpus owned by the group it
203*4882a593Smuzhiyunrepresents to the parent. Removing one of the created CTRL_MON groups
204*4882a593Smuzhiyunwill automatically remove all MON groups below it.
205*4882a593Smuzhiyun
206*4882a593SmuzhiyunAll groups contain the following files:
207*4882a593Smuzhiyun
208*4882a593Smuzhiyun"tasks":
209*4882a593Smuzhiyun	Reading this file shows the list of all tasks that belong to
210*4882a593Smuzhiyun	this group. Writing a task id to the file will add a task to the
211*4882a593Smuzhiyun	group. If the group is a CTRL_MON group the task is removed from
212*4882a593Smuzhiyun	whichever previous CTRL_MON group owned the task and also from
213*4882a593Smuzhiyun	any MON group that owned the task. If the group is a MON group,
214*4882a593Smuzhiyun	then the task must already belong to the CTRL_MON parent of this
215*4882a593Smuzhiyun	group. The task is removed from any previous MON group.
216*4882a593Smuzhiyun
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun"cpus":
219*4882a593Smuzhiyun	Reading this file shows a bitmask of the logical CPUs owned by
220*4882a593Smuzhiyun	this group. Writing a mask to this file will add and remove
221*4882a593Smuzhiyun	CPUs to/from this group. As with the tasks file a hierarchy is
222*4882a593Smuzhiyun	maintained where MON groups may only include CPUs owned by the
223*4882a593Smuzhiyun	parent CTRL_MON group.
224*4882a593Smuzhiyun	When the resource group is in pseudo-locked mode this file will
225*4882a593Smuzhiyun	only be readable, reflecting the CPUs associated with the
226*4882a593Smuzhiyun	pseudo-locked region.
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun
229*4882a593Smuzhiyun"cpus_list":
230*4882a593Smuzhiyun	Just like "cpus", only using ranges of CPUs instead of bitmasks.
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunWhen control is enabled all CTRL_MON groups will also contain:
234*4882a593Smuzhiyun
235*4882a593Smuzhiyun"schemata":
236*4882a593Smuzhiyun	A list of all the resources available to this group.
237*4882a593Smuzhiyun	Each resource has its own line and format - see below for details.
238*4882a593Smuzhiyun
239*4882a593Smuzhiyun"size":
240*4882a593Smuzhiyun	Mirrors the display of the "schemata" file to display the size in
241*4882a593Smuzhiyun	bytes of each allocation instead of the bits representing the
242*4882a593Smuzhiyun	allocation.
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun"mode":
245*4882a593Smuzhiyun	The "mode" of the resource group dictates the sharing of its
246*4882a593Smuzhiyun	allocations. A "shareable" resource group allows sharing of its
247*4882a593Smuzhiyun	allocations while an "exclusive" resource group does not. A
248*4882a593Smuzhiyun	cache pseudo-locked region is created by first writing
249*4882a593Smuzhiyun	"pseudo-locksetup" to the "mode" file before writing the cache
250*4882a593Smuzhiyun	pseudo-locked region's schemata to the resource group's "schemata"
251*4882a593Smuzhiyun	file. On successful pseudo-locked region creation the mode will
252*4882a593Smuzhiyun	automatically change to "pseudo-locked".
253*4882a593Smuzhiyun
254*4882a593SmuzhiyunWhen monitoring is enabled all MON groups will also contain:
255*4882a593Smuzhiyun
256*4882a593Smuzhiyun"mon_data":
257*4882a593Smuzhiyun	This contains a set of files organized by L3 domain and by
258*4882a593Smuzhiyun	RDT event. E.g. on a system with two L3 domains there will
259*4882a593Smuzhiyun	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
260*4882a593Smuzhiyun	directories have one file per event (e.g. "llc_occupancy",
261*4882a593Smuzhiyun	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
262*4882a593Smuzhiyun	files provide a read out of the current value of the event for
263*4882a593Smuzhiyun	all tasks in the group. In CTRL_MON groups these files provide
264*4882a593Smuzhiyun	the sum for all tasks in the CTRL_MON group and all tasks in
265*4882a593Smuzhiyun	MON groups. Please see example section for more details on usage.
266*4882a593Smuzhiyun
267*4882a593SmuzhiyunResource allocation rules
268*4882a593Smuzhiyun-------------------------
269*4882a593Smuzhiyun
270*4882a593SmuzhiyunWhen a task is running the following rules define which resources are
271*4882a593Smuzhiyunavailable to it:
272*4882a593Smuzhiyun
273*4882a593Smuzhiyun1) If the task is a member of a non-default group, then the schemata
274*4882a593Smuzhiyun   for that group is used.
275*4882a593Smuzhiyun
276*4882a593Smuzhiyun2) Else if the task belongs to the default group, but is running on a
277*4882a593Smuzhiyun   CPU that is assigned to some specific group, then the schemata for the
278*4882a593Smuzhiyun   CPU's group is used.
279*4882a593Smuzhiyun
280*4882a593Smuzhiyun3) Otherwise the schemata for the default group is used.
281*4882a593Smuzhiyun
282*4882a593SmuzhiyunResource monitoring rules
283*4882a593Smuzhiyun-------------------------
284*4882a593Smuzhiyun1) If a task is a member of a MON group, or non-default CTRL_MON group
285*4882a593Smuzhiyun   then RDT events for the task will be reported in that group.
286*4882a593Smuzhiyun
287*4882a593Smuzhiyun2) If a task is a member of the default CTRL_MON group, but is running
288*4882a593Smuzhiyun   on a CPU that is assigned to some specific group, then the RDT events
289*4882a593Smuzhiyun   for the task will be reported in that group.
290*4882a593Smuzhiyun
291*4882a593Smuzhiyun3) Otherwise RDT events for the task will be reported in the root level
292*4882a593Smuzhiyun   "mon_data" group.
293*4882a593Smuzhiyun
294*4882a593Smuzhiyun
295*4882a593SmuzhiyunNotes on cache occupancy monitoring and control
296*4882a593Smuzhiyun===============================================
297*4882a593SmuzhiyunWhen moving a task from one group to another you should remember that
298*4882a593Smuzhiyunthis only affects *new* cache allocations by the task. E.g. you may have
299*4882a593Smuzhiyuna task in a monitor group showing 3 MB of cache occupancy. If you move
300*4882a593Smuzhiyunto a new group and immediately check the occupancy of the old and new
301*4882a593Smuzhiyungroups you will likely see that the old group is still showing 3 MB and
302*4882a593Smuzhiyunthe new group zero. When the task accesses locations still in cache from
303*4882a593Smuzhiyunbefore the move, the h/w does not update any counters. On a busy system
304*4882a593Smuzhiyunyou will likely see the occupancy in the old group go down as cache lines
305*4882a593Smuzhiyunare evicted and re-used while the occupancy in the new group rises as
306*4882a593Smuzhiyunthe task accesses memory and loads into the cache are counted based on
307*4882a593Smuzhiyunmembership in the new group.
308*4882a593Smuzhiyun
309*4882a593SmuzhiyunThe same applies to cache allocation control. Moving a task to a group
310*4882a593Smuzhiyunwith a smaller cache partition will not evict any cache lines. The
311*4882a593Smuzhiyunprocess may continue to use them from the old partition.
312*4882a593Smuzhiyun
313*4882a593SmuzhiyunHardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
314*4882a593Smuzhiyunto identify a control group and a monitoring group respectively. Each of
315*4882a593Smuzhiyunthe resource groups are mapped to these IDs based on the kind of group. The
316*4882a593Smuzhiyunnumber of CLOSid and RMID are limited by the hardware and hence the creation of
317*4882a593Smuzhiyuna "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
318*4882a593Smuzhiyunand creation of "MON" group may fail if we run out of RMIDs.
319*4882a593Smuzhiyun
320*4882a593Smuzhiyunmax_threshold_occupancy - generic concepts
321*4882a593Smuzhiyun------------------------------------------
322*4882a593Smuzhiyun
323*4882a593SmuzhiyunNote that an RMID once freed may not be immediately available for use as
324*4882a593Smuzhiyunthe RMID is still tagged the cache lines of the previous user of RMID.
325*4882a593SmuzhiyunHence such RMIDs are placed on limbo list and checked back if the cache
326*4882a593Smuzhiyunoccupancy has gone down. If there is a time when system has a lot of
327*4882a593Smuzhiyunlimbo RMIDs but which are not ready to be used, user may see an -EBUSY
328*4882a593Smuzhiyunduring mkdir.
329*4882a593Smuzhiyun
330*4882a593Smuzhiyunmax_threshold_occupancy is a user configurable value to determine the
331*4882a593Smuzhiyunoccupancy at which an RMID can be freed.
332*4882a593Smuzhiyun
333*4882a593SmuzhiyunSchemata files - general concepts
334*4882a593Smuzhiyun---------------------------------
335*4882a593SmuzhiyunEach line in the file describes one resource. The line starts with
336*4882a593Smuzhiyunthe name of the resource, followed by specific values to be applied
337*4882a593Smuzhiyunin each of the instances of that resource on the system.
338*4882a593Smuzhiyun
339*4882a593SmuzhiyunCache IDs
340*4882a593Smuzhiyun---------
341*4882a593SmuzhiyunOn current generation systems there is one L3 cache per socket and L2
342*4882a593Smuzhiyuncaches are generally just shared by the hyperthreads on a core, but this
343*4882a593Smuzhiyunisn't an architectural requirement. We could have multiple separate L3
344*4882a593Smuzhiyuncaches on a socket, multiple cores could share an L2 cache. So instead
345*4882a593Smuzhiyunof using "socket" or "core" to define the set of logical cpus sharing
346*4882a593Smuzhiyuna resource we use a "Cache ID". At a given cache level this will be a
347*4882a593Smuzhiyununique number across the whole system (but it isn't guaranteed to be a
348*4882a593Smuzhiyuncontiguous sequence, there may be gaps).  To find the ID for each logical
349*4882a593SmuzhiyunCPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
350*4882a593Smuzhiyun
351*4882a593SmuzhiyunCache Bit Masks (CBM)
352*4882a593Smuzhiyun---------------------
353*4882a593SmuzhiyunFor cache resources we describe the portion of the cache that is available
354*4882a593Smuzhiyunfor allocation using a bitmask. The maximum value of the mask is defined
355*4882a593Smuzhiyunby each cpu model (and may be different for different cache levels). It
356*4882a593Smuzhiyunis found using CPUID, but is also provided in the "info" directory of
357*4882a593Smuzhiyunthe resctrl file system in "info/{resource}/cbm_mask". Intel hardware
358*4882a593Smuzhiyunrequires that these masks have all the '1' bits in a contiguous block. So
359*4882a593Smuzhiyun0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
360*4882a593Smuzhiyunand 0xA are not.  On a system with a 20-bit mask each bit represents 5%
361*4882a593Smuzhiyunof the capacity of the cache. You could partition the cache into four
362*4882a593Smuzhiyunequal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
363*4882a593Smuzhiyun
364*4882a593SmuzhiyunMemory bandwidth Allocation and monitoring
365*4882a593Smuzhiyun==========================================
366*4882a593Smuzhiyun
367*4882a593SmuzhiyunFor Memory bandwidth resource, by default the user controls the resource
368*4882a593Smuzhiyunby indicating the percentage of total memory bandwidth.
369*4882a593Smuzhiyun
370*4882a593SmuzhiyunThe minimum bandwidth percentage value for each cpu model is predefined
371*4882a593Smuzhiyunand can be looked up through "info/MB/min_bandwidth". The bandwidth
372*4882a593Smuzhiyungranularity that is allocated is also dependent on the cpu model and can
373*4882a593Smuzhiyunbe looked up at "info/MB/bandwidth_gran". The available bandwidth
374*4882a593Smuzhiyuncontrol steps are: min_bw + N * bw_gran. Intermediate values are rounded
375*4882a593Smuzhiyunto the next control step available on the hardware.
376*4882a593Smuzhiyun
377*4882a593SmuzhiyunThe bandwidth throttling is a core specific mechanism on some of Intel
378*4882a593SmuzhiyunSKUs. Using a high bandwidth and a low bandwidth setting on two threads
379*4882a593Smuzhiyunsharing a core may result in both threads being throttled to use the
380*4882a593Smuzhiyunlow bandwidth (see "thread_throttle_mode").
381*4882a593Smuzhiyun
382*4882a593SmuzhiyunThe fact that Memory bandwidth allocation(MBA) may be a core
383*4882a593Smuzhiyunspecific mechanism where as memory bandwidth monitoring(MBM) is done at
384*4882a593Smuzhiyunthe package level may lead to confusion when users try to apply control
385*4882a593Smuzhiyunvia the MBA and then monitor the bandwidth to see if the controls are
386*4882a593Smuzhiyuneffective. Below are such scenarios:
387*4882a593Smuzhiyun
388*4882a593Smuzhiyun1. User may *not* see increase in actual bandwidth when percentage
389*4882a593Smuzhiyun   values are increased:
390*4882a593Smuzhiyun
391*4882a593SmuzhiyunThis can occur when aggregate L2 external bandwidth is more than L3
392*4882a593Smuzhiyunexternal bandwidth. Consider an SKL SKU with 24 cores on a package and
393*4882a593Smuzhiyunwhere L2 external  is 10GBps (hence aggregate L2 external bandwidth is
394*4882a593Smuzhiyun240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
395*4882a593Smuzhiyunthreads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
396*4882a593Smuzhiyunbandwidth of 100GBps although the percentage value specified is only 50%
397*4882a593Smuzhiyun<< 100%. Hence increasing the bandwidth percentage will not yield any
398*4882a593Smuzhiyunmore bandwidth. This is because although the L2 external bandwidth still
399*4882a593Smuzhiyunhas capacity, the L3 external bandwidth is fully used. Also note that
400*4882a593Smuzhiyunthis would be dependent on number of cores the benchmark is run on.
401*4882a593Smuzhiyun
402*4882a593Smuzhiyun2. Same bandwidth percentage may mean different actual bandwidth
403*4882a593Smuzhiyun   depending on # of threads:
404*4882a593Smuzhiyun
405*4882a593SmuzhiyunFor the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
406*4882a593Smuzhiyunthread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
407*4882a593Smuzhiyunthey have same percentage bandwidth of 10%. This is simply because as
408*4882a593Smuzhiyunthreads start using more cores in an rdtgroup, the actual bandwidth may
409*4882a593Smuzhiyunincrease or vary although user specified bandwidth percentage is same.
410*4882a593Smuzhiyun
411*4882a593SmuzhiyunIn order to mitigate this and make the interface more user friendly,
412*4882a593Smuzhiyunresctrl added support for specifying the bandwidth in MBps as well.  The
413*4882a593Smuzhiyunkernel underneath would use a software feedback mechanism or a "Software
414*4882a593SmuzhiyunController(mba_sc)" which reads the actual bandwidth using MBM counters
415*4882a593Smuzhiyunand adjust the memory bandwidth percentages to ensure::
416*4882a593Smuzhiyun
417*4882a593Smuzhiyun	"actual bandwidth < user specified bandwidth".
418*4882a593Smuzhiyun
419*4882a593SmuzhiyunBy default, the schemata would take the bandwidth percentage values
420*4882a593Smuzhiyunwhere as user can switch to the "MBA software controller" mode using
421*4882a593Smuzhiyuna mount option 'mba_MBps'. The schemata format is specified in the below
422*4882a593Smuzhiyunsections.
423*4882a593Smuzhiyun
424*4882a593SmuzhiyunL3 schemata file details (code and data prioritization disabled)
425*4882a593Smuzhiyun----------------------------------------------------------------
426*4882a593SmuzhiyunWith CDP disabled the L3 schemata format is::
427*4882a593Smuzhiyun
428*4882a593Smuzhiyun	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
429*4882a593Smuzhiyun
430*4882a593SmuzhiyunL3 schemata file details (CDP enabled via mount option to resctrl)
431*4882a593Smuzhiyun------------------------------------------------------------------
432*4882a593SmuzhiyunWhen CDP is enabled L3 control is split into two separate resources
433*4882a593Smuzhiyunso you can specify independent masks for code and data like this::
434*4882a593Smuzhiyun
435*4882a593Smuzhiyun	L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
436*4882a593Smuzhiyun	L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
437*4882a593Smuzhiyun
438*4882a593SmuzhiyunL2 schemata file details
439*4882a593Smuzhiyun------------------------
440*4882a593SmuzhiyunCDP is supported at L2 using the 'cdpl2' mount option. The schemata
441*4882a593Smuzhiyunformat is either::
442*4882a593Smuzhiyun
443*4882a593Smuzhiyun	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
444*4882a593Smuzhiyun
445*4882a593Smuzhiyunor
446*4882a593Smuzhiyun
447*4882a593Smuzhiyun	L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
448*4882a593Smuzhiyun	L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
449*4882a593Smuzhiyun
450*4882a593Smuzhiyun
451*4882a593SmuzhiyunMemory bandwidth Allocation (default mode)
452*4882a593Smuzhiyun------------------------------------------
453*4882a593Smuzhiyun
454*4882a593SmuzhiyunMemory b/w domain is L3 cache.
455*4882a593Smuzhiyun::
456*4882a593Smuzhiyun
457*4882a593Smuzhiyun	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
458*4882a593Smuzhiyun
459*4882a593SmuzhiyunMemory bandwidth Allocation specified in MBps
460*4882a593Smuzhiyun---------------------------------------------
461*4882a593Smuzhiyun
462*4882a593SmuzhiyunMemory bandwidth domain is L3 cache.
463*4882a593Smuzhiyun::
464*4882a593Smuzhiyun
465*4882a593Smuzhiyun	MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
466*4882a593Smuzhiyun
467*4882a593SmuzhiyunReading/writing the schemata file
468*4882a593Smuzhiyun---------------------------------
469*4882a593SmuzhiyunReading the schemata file will show the state of all resources
470*4882a593Smuzhiyunon all domains. When writing you only need to specify those values
471*4882a593Smuzhiyunwhich you wish to change.  E.g.
472*4882a593Smuzhiyun::
473*4882a593Smuzhiyun
474*4882a593Smuzhiyun  # cat schemata
475*4882a593Smuzhiyun  L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
476*4882a593Smuzhiyun  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
477*4882a593Smuzhiyun  # echo "L3DATA:2=3c0;" > schemata
478*4882a593Smuzhiyun  # cat schemata
479*4882a593Smuzhiyun  L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
480*4882a593Smuzhiyun  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
481*4882a593Smuzhiyun
482*4882a593SmuzhiyunCache Pseudo-Locking
483*4882a593Smuzhiyun====================
484*4882a593SmuzhiyunCAT enables a user to specify the amount of cache space that an
485*4882a593Smuzhiyunapplication can fill. Cache pseudo-locking builds on the fact that a
486*4882a593SmuzhiyunCPU can still read and write data pre-allocated outside its current
487*4882a593Smuzhiyunallocated area on a cache hit. With cache pseudo-locking, data can be
488*4882a593Smuzhiyunpreloaded into a reserved portion of cache that no application can
489*4882a593Smuzhiyunfill, and from that point on will only serve cache hits. The cache
490*4882a593Smuzhiyunpseudo-locked memory is made accessible to user space where an
491*4882a593Smuzhiyunapplication can map it into its virtual address space and thus have
492*4882a593Smuzhiyuna region of memory with reduced average read latency.
493*4882a593Smuzhiyun
494*4882a593SmuzhiyunThe creation of a cache pseudo-locked region is triggered by a request
495*4882a593Smuzhiyunfrom the user to do so that is accompanied by a schemata of the region
496*4882a593Smuzhiyunto be pseudo-locked. The cache pseudo-locked region is created as follows:
497*4882a593Smuzhiyun
498*4882a593Smuzhiyun- Create a CAT allocation CLOSNEW with a CBM matching the schemata
499*4882a593Smuzhiyun  from the user of the cache region that will contain the pseudo-locked
500*4882a593Smuzhiyun  memory. This region must not overlap with any current CAT allocation/CLOS
501*4882a593Smuzhiyun  on the system and no future overlap with this cache region is allowed
502*4882a593Smuzhiyun  while the pseudo-locked region exists.
503*4882a593Smuzhiyun- Create a contiguous region of memory of the same size as the cache
504*4882a593Smuzhiyun  region.
505*4882a593Smuzhiyun- Flush the cache, disable hardware prefetchers, disable preemption.
506*4882a593Smuzhiyun- Make CLOSNEW the active CLOS and touch the allocated memory to load
507*4882a593Smuzhiyun  it into the cache.
508*4882a593Smuzhiyun- Set the previous CLOS as active.
509*4882a593Smuzhiyun- At this point the closid CLOSNEW can be released - the cache
510*4882a593Smuzhiyun  pseudo-locked region is protected as long as its CBM does not appear in
511*4882a593Smuzhiyun  any CAT allocation. Even though the cache pseudo-locked region will from
512*4882a593Smuzhiyun  this point on not appear in any CBM of any CLOS an application running with
513*4882a593Smuzhiyun  any CLOS will be able to access the memory in the pseudo-locked region since
514*4882a593Smuzhiyun  the region continues to serve cache hits.
515*4882a593Smuzhiyun- The contiguous region of memory loaded into the cache is exposed to
516*4882a593Smuzhiyun  user-space as a character device.
517*4882a593Smuzhiyun
518*4882a593SmuzhiyunCache pseudo-locking increases the probability that data will remain
519*4882a593Smuzhiyunin the cache via carefully configuring the CAT feature and controlling
520*4882a593Smuzhiyunapplication behavior. There is no guarantee that data is placed in
521*4882a593Smuzhiyuncache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
522*4882a593Smuzhiyun“locked” data from cache. Power management C-states may shrink or
523*4882a593Smuzhiyunpower off cache. Deeper C-states will automatically be restricted on
524*4882a593Smuzhiyunpseudo-locked region creation.
525*4882a593Smuzhiyun
526*4882a593SmuzhiyunIt is required that an application using a pseudo-locked region runs
527*4882a593Smuzhiyunwith affinity to the cores (or a subset of the cores) associated
528*4882a593Smuzhiyunwith the cache on which the pseudo-locked region resides. A sanity check
529*4882a593Smuzhiyunwithin the code will not allow an application to map pseudo-locked memory
530*4882a593Smuzhiyununless it runs with affinity to cores associated with the cache on which the
531*4882a593Smuzhiyunpseudo-locked region resides. The sanity check is only done during the
532*4882a593Smuzhiyuninitial mmap() handling, there is no enforcement afterwards and the
533*4882a593Smuzhiyunapplication self needs to ensure it remains affine to the correct cores.
534*4882a593Smuzhiyun
535*4882a593SmuzhiyunPseudo-locking is accomplished in two stages:
536*4882a593Smuzhiyun
537*4882a593Smuzhiyun1) During the first stage the system administrator allocates a portion
538*4882a593Smuzhiyun   of cache that should be dedicated to pseudo-locking. At this time an
539*4882a593Smuzhiyun   equivalent portion of memory is allocated, loaded into allocated
540*4882a593Smuzhiyun   cache portion, and exposed as a character device.
541*4882a593Smuzhiyun2) During the second stage a user-space application maps (mmap()) the
542*4882a593Smuzhiyun   pseudo-locked memory into its address space.
543*4882a593Smuzhiyun
544*4882a593SmuzhiyunCache Pseudo-Locking Interface
545*4882a593Smuzhiyun------------------------------
546*4882a593SmuzhiyunA pseudo-locked region is created using the resctrl interface as follows:
547*4882a593Smuzhiyun
548*4882a593Smuzhiyun1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
549*4882a593Smuzhiyun2) Change the new resource group's mode to "pseudo-locksetup" by writing
550*4882a593Smuzhiyun   "pseudo-locksetup" to the "mode" file.
551*4882a593Smuzhiyun3) Write the schemata of the pseudo-locked region to the "schemata" file. All
552*4882a593Smuzhiyun   bits within the schemata should be "unused" according to the "bit_usage"
553*4882a593Smuzhiyun   file.
554*4882a593Smuzhiyun
555*4882a593SmuzhiyunOn successful pseudo-locked region creation the "mode" file will contain
556*4882a593Smuzhiyun"pseudo-locked" and a new character device with the same name as the resource
557*4882a593Smuzhiyungroup will exist in /dev/pseudo_lock. This character device can be mmap()'ed
558*4882a593Smuzhiyunby user space in order to obtain access to the pseudo-locked memory region.
559*4882a593Smuzhiyun
560*4882a593SmuzhiyunAn example of cache pseudo-locked region creation and usage can be found below.
561*4882a593Smuzhiyun
562*4882a593SmuzhiyunCache Pseudo-Locking Debugging Interface
563*4882a593Smuzhiyun----------------------------------------
564*4882a593SmuzhiyunThe pseudo-locking debugging interface is enabled by default (if
565*4882a593SmuzhiyunCONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
566*4882a593Smuzhiyun
567*4882a593SmuzhiyunThere is no explicit way for the kernel to test if a provided memory
568*4882a593Smuzhiyunlocation is present in the cache. The pseudo-locking debugging interface uses
569*4882a593Smuzhiyunthe tracing infrastructure to provide two ways to measure cache residency of
570*4882a593Smuzhiyunthe pseudo-locked region:
571*4882a593Smuzhiyun
572*4882a593Smuzhiyun1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
573*4882a593Smuzhiyun   from these measurements are best visualized using a hist trigger (see
574*4882a593Smuzhiyun   example below). In this test the pseudo-locked region is traversed at
575*4882a593Smuzhiyun   a stride of 32 bytes while hardware prefetchers and preemption
576*4882a593Smuzhiyun   are disabled. This also provides a substitute visualization of cache
577*4882a593Smuzhiyun   hits and misses.
578*4882a593Smuzhiyun2) Cache hit and miss measurements using model specific precision counters if
579*4882a593Smuzhiyun   available. Depending on the levels of cache on the system the pseudo_lock_l2
580*4882a593Smuzhiyun   and pseudo_lock_l3 tracepoints are available.
581*4882a593Smuzhiyun
582*4882a593SmuzhiyunWhen a pseudo-locked region is created a new debugfs directory is created for
583*4882a593Smuzhiyunit in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
584*4882a593Smuzhiyunwrite-only file, pseudo_lock_measure, is present in this directory. The
585*4882a593Smuzhiyunmeasurement of the pseudo-locked region depends on the number written to this
586*4882a593Smuzhiyundebugfs file:
587*4882a593Smuzhiyun
588*4882a593Smuzhiyun1:
589*4882a593Smuzhiyun     writing "1" to the pseudo_lock_measure file will trigger the latency
590*4882a593Smuzhiyun     measurement captured in the pseudo_lock_mem_latency tracepoint. See
591*4882a593Smuzhiyun     example below.
592*4882a593Smuzhiyun2:
593*4882a593Smuzhiyun     writing "2" to the pseudo_lock_measure file will trigger the L2 cache
594*4882a593Smuzhiyun     residency (cache hits and misses) measurement captured in the
595*4882a593Smuzhiyun     pseudo_lock_l2 tracepoint. See example below.
596*4882a593Smuzhiyun3:
597*4882a593Smuzhiyun     writing "3" to the pseudo_lock_measure file will trigger the L3 cache
598*4882a593Smuzhiyun     residency (cache hits and misses) measurement captured in the
599*4882a593Smuzhiyun     pseudo_lock_l3 tracepoint.
600*4882a593Smuzhiyun
601*4882a593SmuzhiyunAll measurements are recorded with the tracing infrastructure. This requires
602*4882a593Smuzhiyunthe relevant tracepoints to be enabled before the measurement is triggered.
603*4882a593Smuzhiyun
604*4882a593SmuzhiyunExample of latency debugging interface
605*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
606*4882a593SmuzhiyunIn this example a pseudo-locked region named "newlock" was created. Here is
607*4882a593Smuzhiyunhow we can measure the latency in cycles of reading from this region and
608*4882a593Smuzhiyunvisualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
609*4882a593Smuzhiyunis set::
610*4882a593Smuzhiyun
611*4882a593Smuzhiyun  # :> /sys/kernel/debug/tracing/trace
612*4882a593Smuzhiyun  # echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
613*4882a593Smuzhiyun  # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable
614*4882a593Smuzhiyun  # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
615*4882a593Smuzhiyun  # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable
616*4882a593Smuzhiyun  # cat /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/hist
617*4882a593Smuzhiyun
618*4882a593Smuzhiyun  # event histogram
619*4882a593Smuzhiyun  #
620*4882a593Smuzhiyun  # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
621*4882a593Smuzhiyun  #
622*4882a593Smuzhiyun
623*4882a593Smuzhiyun  { latency:        456 } hitcount:          1
624*4882a593Smuzhiyun  { latency:         50 } hitcount:         83
625*4882a593Smuzhiyun  { latency:         36 } hitcount:         96
626*4882a593Smuzhiyun  { latency:         44 } hitcount:        174
627*4882a593Smuzhiyun  { latency:         48 } hitcount:        195
628*4882a593Smuzhiyun  { latency:         46 } hitcount:        262
629*4882a593Smuzhiyun  { latency:         42 } hitcount:        693
630*4882a593Smuzhiyun  { latency:         40 } hitcount:       3204
631*4882a593Smuzhiyun  { latency:         38 } hitcount:       3484
632*4882a593Smuzhiyun
633*4882a593Smuzhiyun  Totals:
634*4882a593Smuzhiyun      Hits: 8192
635*4882a593Smuzhiyun      Entries: 9
636*4882a593Smuzhiyun    Dropped: 0
637*4882a593Smuzhiyun
638*4882a593SmuzhiyunExample of cache hits/misses debugging
639*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
640*4882a593SmuzhiyunIn this example a pseudo-locked region named "newlock" was created on the L2
641*4882a593Smuzhiyuncache of a platform. Here is how we can obtain details of the cache hits
642*4882a593Smuzhiyunand misses using the platform's precision counters.
643*4882a593Smuzhiyun::
644*4882a593Smuzhiyun
645*4882a593Smuzhiyun  # :> /sys/kernel/debug/tracing/trace
646*4882a593Smuzhiyun  # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable
647*4882a593Smuzhiyun  # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
648*4882a593Smuzhiyun  # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable
649*4882a593Smuzhiyun  # cat /sys/kernel/debug/tracing/trace
650*4882a593Smuzhiyun
651*4882a593Smuzhiyun  # tracer: nop
652*4882a593Smuzhiyun  #
653*4882a593Smuzhiyun  #                              _-----=> irqs-off
654*4882a593Smuzhiyun  #                             / _----=> need-resched
655*4882a593Smuzhiyun  #                            | / _---=> hardirq/softirq
656*4882a593Smuzhiyun  #                            || / _--=> preempt-depth
657*4882a593Smuzhiyun  #                            ||| /     delay
658*4882a593Smuzhiyun  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
659*4882a593Smuzhiyun  #              | |       |   ||||       |         |
660*4882a593Smuzhiyun  pseudo_lock_mea-1672  [002] ....  3132.860500: pseudo_lock_l2: hits=4097 miss=0
661*4882a593Smuzhiyun
662*4882a593Smuzhiyun
663*4882a593SmuzhiyunExamples for RDT allocation usage
664*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
665*4882a593Smuzhiyun
666*4882a593Smuzhiyun1) Example 1
667*4882a593Smuzhiyun
668*4882a593SmuzhiyunOn a two socket machine (one L3 cache per socket) with just four bits
669*4882a593Smuzhiyunfor cache bit masks, minimum b/w of 10% with a memory bandwidth
670*4882a593Smuzhiyungranularity of 10%.
671*4882a593Smuzhiyun::
672*4882a593Smuzhiyun
673*4882a593Smuzhiyun  # mount -t resctrl resctrl /sys/fs/resctrl
674*4882a593Smuzhiyun  # cd /sys/fs/resctrl
675*4882a593Smuzhiyun  # mkdir p0 p1
676*4882a593Smuzhiyun  # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
677*4882a593Smuzhiyun  # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
678*4882a593Smuzhiyun
679*4882a593SmuzhiyunThe default resource group is unmodified, so we have access to all parts
680*4882a593Smuzhiyunof all caches (its schemata file reads "L3:0=f;1=f").
681*4882a593Smuzhiyun
682*4882a593SmuzhiyunTasks that are under the control of group "p0" may only allocate from the
683*4882a593Smuzhiyun"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
684*4882a593SmuzhiyunTasks in group "p1" use the "lower" 50% of cache on both sockets.
685*4882a593Smuzhiyun
686*4882a593SmuzhiyunSimilarly, tasks that are under the control of group "p0" may use a
687*4882a593Smuzhiyunmaximum memory b/w of 50% on socket0 and 50% on socket 1.
688*4882a593SmuzhiyunTasks in group "p1" may also use 50% memory b/w on both sockets.
689*4882a593SmuzhiyunNote that unlike cache masks, memory b/w cannot specify whether these
690*4882a593Smuzhiyunallocations can overlap or not. The allocations specifies the maximum
691*4882a593Smuzhiyunb/w that the group may be able to use and the system admin can configure
692*4882a593Smuzhiyunthe b/w accordingly.
693*4882a593Smuzhiyun
694*4882a593SmuzhiyunIf resctrl is using the software controller (mba_sc) then user can enter the
695*4882a593Smuzhiyunmax b/w in MB rather than the percentage values.
696*4882a593Smuzhiyun::
697*4882a593Smuzhiyun
698*4882a593Smuzhiyun  # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
699*4882a593Smuzhiyun  # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
700*4882a593Smuzhiyun
701*4882a593SmuzhiyunIn the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
702*4882a593Smuzhiyunof 1024MB where as on socket 1 they would use 500MB.
703*4882a593Smuzhiyun
704*4882a593Smuzhiyun2) Example 2
705*4882a593Smuzhiyun
706*4882a593SmuzhiyunAgain two sockets, but this time with a more realistic 20-bit mask.
707*4882a593Smuzhiyun
708*4882a593SmuzhiyunTwo real time tasks pid=1234 running on processor 0 and pid=5678 running on
709*4882a593Smuzhiyunprocessor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
710*4882a593Smuzhiyunneighbors, each of the two real-time tasks exclusively occupies one quarter
711*4882a593Smuzhiyunof L3 cache on socket 0.
712*4882a593Smuzhiyun::
713*4882a593Smuzhiyun
714*4882a593Smuzhiyun  # mount -t resctrl resctrl /sys/fs/resctrl
715*4882a593Smuzhiyun  # cd /sys/fs/resctrl
716*4882a593Smuzhiyun
717*4882a593SmuzhiyunFirst we reset the schemata for the default group so that the "upper"
718*4882a593Smuzhiyun50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
719*4882a593Smuzhiyunordinary tasks::
720*4882a593Smuzhiyun
721*4882a593Smuzhiyun  # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
722*4882a593Smuzhiyun
723*4882a593SmuzhiyunNext we make a resource group for our first real time task and give
724*4882a593Smuzhiyunit access to the "top" 25% of the cache on socket 0.
725*4882a593Smuzhiyun::
726*4882a593Smuzhiyun
727*4882a593Smuzhiyun  # mkdir p0
728*4882a593Smuzhiyun  # echo "L3:0=f8000;1=fffff" > p0/schemata
729*4882a593Smuzhiyun
730*4882a593SmuzhiyunFinally we move our first real time task into this resource group. We
731*4882a593Smuzhiyunalso use taskset(1) to ensure the task always runs on a dedicated CPU
732*4882a593Smuzhiyunon socket 0. Most uses of resource groups will also constrain which
733*4882a593Smuzhiyunprocessors tasks run on.
734*4882a593Smuzhiyun::
735*4882a593Smuzhiyun
736*4882a593Smuzhiyun  # echo 1234 > p0/tasks
737*4882a593Smuzhiyun  # taskset -cp 1 1234
738*4882a593Smuzhiyun
739*4882a593SmuzhiyunDitto for the second real time task (with the remaining 25% of cache)::
740*4882a593Smuzhiyun
741*4882a593Smuzhiyun  # mkdir p1
742*4882a593Smuzhiyun  # echo "L3:0=7c00;1=fffff" > p1/schemata
743*4882a593Smuzhiyun  # echo 5678 > p1/tasks
744*4882a593Smuzhiyun  # taskset -cp 2 5678
745*4882a593Smuzhiyun
746*4882a593SmuzhiyunFor the same 2 socket system with memory b/w resource and CAT L3 the
747*4882a593Smuzhiyunschemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
748*4882a593Smuzhiyun10):
749*4882a593Smuzhiyun
750*4882a593SmuzhiyunFor our first real time task this would request 20% memory b/w on socket 0.
751*4882a593Smuzhiyun::
752*4882a593Smuzhiyun
753*4882a593Smuzhiyun  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
754*4882a593Smuzhiyun
755*4882a593SmuzhiyunFor our second real time task this would request an other 20% memory b/w
756*4882a593Smuzhiyunon socket 0.
757*4882a593Smuzhiyun::
758*4882a593Smuzhiyun
759*4882a593Smuzhiyun  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
760*4882a593Smuzhiyun
761*4882a593Smuzhiyun3) Example 3
762*4882a593Smuzhiyun
763*4882a593SmuzhiyunA single socket system which has real-time tasks running on core 4-7 and
764*4882a593Smuzhiyunnon real-time workload assigned to core 0-3. The real-time tasks share text
765*4882a593Smuzhiyunand data, so a per task association is not required and due to interaction
766*4882a593Smuzhiyunwith the kernel it's desired that the kernel on these cores shares L3 with
767*4882a593Smuzhiyunthe tasks.
768*4882a593Smuzhiyun::
769*4882a593Smuzhiyun
770*4882a593Smuzhiyun  # mount -t resctrl resctrl /sys/fs/resctrl
771*4882a593Smuzhiyun  # cd /sys/fs/resctrl
772*4882a593Smuzhiyun
773*4882a593SmuzhiyunFirst we reset the schemata for the default group so that the "upper"
774*4882a593Smuzhiyun50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
775*4882a593Smuzhiyuncannot be used by ordinary tasks::
776*4882a593Smuzhiyun
777*4882a593Smuzhiyun  # echo "L3:0=3ff\nMB:0=50" > schemata
778*4882a593Smuzhiyun
779*4882a593SmuzhiyunNext we make a resource group for our real time cores and give it access
780*4882a593Smuzhiyunto the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
781*4882a593Smuzhiyunsocket 0.
782*4882a593Smuzhiyun::
783*4882a593Smuzhiyun
784*4882a593Smuzhiyun  # mkdir p0
785*4882a593Smuzhiyun  # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
786*4882a593Smuzhiyun
787*4882a593SmuzhiyunFinally we move core 4-7 over to the new group and make sure that the
788*4882a593Smuzhiyunkernel and the tasks running there get 50% of the cache. They should
789*4882a593Smuzhiyunalso get 50% of memory bandwidth assuming that the cores 4-7 are SMT
790*4882a593Smuzhiyunsiblings and only the real time threads are scheduled on the cores 4-7.
791*4882a593Smuzhiyun::
792*4882a593Smuzhiyun
793*4882a593Smuzhiyun  # echo F0 > p0/cpus
794*4882a593Smuzhiyun
795*4882a593Smuzhiyun4) Example 4
796*4882a593Smuzhiyun
797*4882a593SmuzhiyunThe resource groups in previous examples were all in the default "shareable"
798*4882a593Smuzhiyunmode allowing sharing of their cache allocations. If one resource group
799*4882a593Smuzhiyunconfigures a cache allocation then nothing prevents another resource group
800*4882a593Smuzhiyunto overlap with that allocation.
801*4882a593Smuzhiyun
802*4882a593SmuzhiyunIn this example a new exclusive resource group will be created on a L2 CAT
803*4882a593Smuzhiyunsystem with two L2 cache instances that can be configured with an 8-bit
804*4882a593Smuzhiyuncapacity bitmask. The new exclusive resource group will be configured to use
805*4882a593Smuzhiyun25% of each cache instance.
806*4882a593Smuzhiyun::
807*4882a593Smuzhiyun
808*4882a593Smuzhiyun  # mount -t resctrl resctrl /sys/fs/resctrl/
809*4882a593Smuzhiyun  # cd /sys/fs/resctrl
810*4882a593Smuzhiyun
811*4882a593SmuzhiyunFirst, we observe that the default group is configured to allocate to all L2
812*4882a593Smuzhiyuncache::
813*4882a593Smuzhiyun
814*4882a593Smuzhiyun  # cat schemata
815*4882a593Smuzhiyun  L2:0=ff;1=ff
816*4882a593Smuzhiyun
817*4882a593SmuzhiyunWe could attempt to create the new resource group at this point, but it will
818*4882a593Smuzhiyunfail because of the overlap with the schemata of the default group::
819*4882a593Smuzhiyun
820*4882a593Smuzhiyun  # mkdir p0
821*4882a593Smuzhiyun  # echo 'L2:0=0x3;1=0x3' > p0/schemata
822*4882a593Smuzhiyun  # cat p0/mode
823*4882a593Smuzhiyun  shareable
824*4882a593Smuzhiyun  # echo exclusive > p0/mode
825*4882a593Smuzhiyun  -sh: echo: write error: Invalid argument
826*4882a593Smuzhiyun  # cat info/last_cmd_status
827*4882a593Smuzhiyun  schemata overlaps
828*4882a593Smuzhiyun
829*4882a593SmuzhiyunTo ensure that there is no overlap with another resource group the default
830*4882a593Smuzhiyunresource group's schemata has to change, making it possible for the new
831*4882a593Smuzhiyunresource group to become exclusive.
832*4882a593Smuzhiyun::
833*4882a593Smuzhiyun
834*4882a593Smuzhiyun  # echo 'L2:0=0xfc;1=0xfc' > schemata
835*4882a593Smuzhiyun  # echo exclusive > p0/mode
836*4882a593Smuzhiyun  # grep . p0/*
837*4882a593Smuzhiyun  p0/cpus:0
838*4882a593Smuzhiyun  p0/mode:exclusive
839*4882a593Smuzhiyun  p0/schemata:L2:0=03;1=03
840*4882a593Smuzhiyun  p0/size:L2:0=262144;1=262144
841*4882a593Smuzhiyun
842*4882a593SmuzhiyunA new resource group will on creation not overlap with an exclusive resource
843*4882a593Smuzhiyungroup::
844*4882a593Smuzhiyun
845*4882a593Smuzhiyun  # mkdir p1
846*4882a593Smuzhiyun  # grep . p1/*
847*4882a593Smuzhiyun  p1/cpus:0
848*4882a593Smuzhiyun  p1/mode:shareable
849*4882a593Smuzhiyun  p1/schemata:L2:0=fc;1=fc
850*4882a593Smuzhiyun  p1/size:L2:0=786432;1=786432
851*4882a593Smuzhiyun
852*4882a593SmuzhiyunThe bit_usage will reflect how the cache is used::
853*4882a593Smuzhiyun
854*4882a593Smuzhiyun  # cat info/L2/bit_usage
855*4882a593Smuzhiyun  0=SSSSSSEE;1=SSSSSSEE
856*4882a593Smuzhiyun
857*4882a593SmuzhiyunA resource group cannot be forced to overlap with an exclusive resource group::
858*4882a593Smuzhiyun
859*4882a593Smuzhiyun  # echo 'L2:0=0x1;1=0x1' > p1/schemata
860*4882a593Smuzhiyun  -sh: echo: write error: Invalid argument
861*4882a593Smuzhiyun  # cat info/last_cmd_status
862*4882a593Smuzhiyun  overlaps with exclusive group
863*4882a593Smuzhiyun
864*4882a593SmuzhiyunExample of Cache Pseudo-Locking
865*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
866*4882a593SmuzhiyunLock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
867*4882a593Smuzhiyunregion is exposed at /dev/pseudo_lock/newlock that can be provided to
868*4882a593Smuzhiyunapplication for argument to mmap().
869*4882a593Smuzhiyun::
870*4882a593Smuzhiyun
871*4882a593Smuzhiyun  # mount -t resctrl resctrl /sys/fs/resctrl/
872*4882a593Smuzhiyun  # cd /sys/fs/resctrl
873*4882a593Smuzhiyun
874*4882a593SmuzhiyunEnsure that there are bits available that can be pseudo-locked, since only
875*4882a593Smuzhiyununused bits can be pseudo-locked the bits to be pseudo-locked needs to be
876*4882a593Smuzhiyunremoved from the default resource group's schemata::
877*4882a593Smuzhiyun
878*4882a593Smuzhiyun  # cat info/L2/bit_usage
879*4882a593Smuzhiyun  0=SSSSSSSS;1=SSSSSSSS
880*4882a593Smuzhiyun  # echo 'L2:1=0xfc' > schemata
881*4882a593Smuzhiyun  # cat info/L2/bit_usage
882*4882a593Smuzhiyun  0=SSSSSSSS;1=SSSSSS00
883*4882a593Smuzhiyun
884*4882a593SmuzhiyunCreate a new resource group that will be associated with the pseudo-locked
885*4882a593Smuzhiyunregion, indicate that it will be used for a pseudo-locked region, and
886*4882a593Smuzhiyunconfigure the requested pseudo-locked region capacity bitmask::
887*4882a593Smuzhiyun
888*4882a593Smuzhiyun  # mkdir newlock
889*4882a593Smuzhiyun  # echo pseudo-locksetup > newlock/mode
890*4882a593Smuzhiyun  # echo 'L2:1=0x3' > newlock/schemata
891*4882a593Smuzhiyun
892*4882a593SmuzhiyunOn success the resource group's mode will change to pseudo-locked, the
893*4882a593Smuzhiyunbit_usage will reflect the pseudo-locked region, and the character device
894*4882a593Smuzhiyunexposing the pseudo-locked region will exist::
895*4882a593Smuzhiyun
896*4882a593Smuzhiyun  # cat newlock/mode
897*4882a593Smuzhiyun  pseudo-locked
898*4882a593Smuzhiyun  # cat info/L2/bit_usage
899*4882a593Smuzhiyun  0=SSSSSSSS;1=SSSSSSPP
900*4882a593Smuzhiyun  # ls -l /dev/pseudo_lock/newlock
901*4882a593Smuzhiyun  crw------- 1 root root 243, 0 Apr  3 05:01 /dev/pseudo_lock/newlock
902*4882a593Smuzhiyun
903*4882a593Smuzhiyun::
904*4882a593Smuzhiyun
905*4882a593Smuzhiyun  /*
906*4882a593Smuzhiyun  * Example code to access one page of pseudo-locked cache region
907*4882a593Smuzhiyun  * from user space.
908*4882a593Smuzhiyun  */
909*4882a593Smuzhiyun  #define _GNU_SOURCE
910*4882a593Smuzhiyun  #include <fcntl.h>
911*4882a593Smuzhiyun  #include <sched.h>
912*4882a593Smuzhiyun  #include <stdio.h>
913*4882a593Smuzhiyun  #include <stdlib.h>
914*4882a593Smuzhiyun  #include <unistd.h>
915*4882a593Smuzhiyun  #include <sys/mman.h>
916*4882a593Smuzhiyun
917*4882a593Smuzhiyun  /*
918*4882a593Smuzhiyun  * It is required that the application runs with affinity to only
919*4882a593Smuzhiyun  * cores associated with the pseudo-locked region. Here the cpu
920*4882a593Smuzhiyun  * is hardcoded for convenience of example.
921*4882a593Smuzhiyun  */
922*4882a593Smuzhiyun  static int cpuid = 2;
923*4882a593Smuzhiyun
924*4882a593Smuzhiyun  int main(int argc, char *argv[])
925*4882a593Smuzhiyun  {
926*4882a593Smuzhiyun    cpu_set_t cpuset;
927*4882a593Smuzhiyun    long page_size;
928*4882a593Smuzhiyun    void *mapping;
929*4882a593Smuzhiyun    int dev_fd;
930*4882a593Smuzhiyun    int ret;
931*4882a593Smuzhiyun
932*4882a593Smuzhiyun    page_size = sysconf(_SC_PAGESIZE);
933*4882a593Smuzhiyun
934*4882a593Smuzhiyun    CPU_ZERO(&cpuset);
935*4882a593Smuzhiyun    CPU_SET(cpuid, &cpuset);
936*4882a593Smuzhiyun    ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
937*4882a593Smuzhiyun    if (ret < 0) {
938*4882a593Smuzhiyun      perror("sched_setaffinity");
939*4882a593Smuzhiyun      exit(EXIT_FAILURE);
940*4882a593Smuzhiyun    }
941*4882a593Smuzhiyun
942*4882a593Smuzhiyun    dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
943*4882a593Smuzhiyun    if (dev_fd < 0) {
944*4882a593Smuzhiyun      perror("open");
945*4882a593Smuzhiyun      exit(EXIT_FAILURE);
946*4882a593Smuzhiyun    }
947*4882a593Smuzhiyun
948*4882a593Smuzhiyun    mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
949*4882a593Smuzhiyun            dev_fd, 0);
950*4882a593Smuzhiyun    if (mapping == MAP_FAILED) {
951*4882a593Smuzhiyun      perror("mmap");
952*4882a593Smuzhiyun      close(dev_fd);
953*4882a593Smuzhiyun      exit(EXIT_FAILURE);
954*4882a593Smuzhiyun    }
955*4882a593Smuzhiyun
956*4882a593Smuzhiyun    /* Application interacts with pseudo-locked memory @mapping */
957*4882a593Smuzhiyun
958*4882a593Smuzhiyun    ret = munmap(mapping, page_size);
959*4882a593Smuzhiyun    if (ret < 0) {
960*4882a593Smuzhiyun      perror("munmap");
961*4882a593Smuzhiyun      close(dev_fd);
962*4882a593Smuzhiyun      exit(EXIT_FAILURE);
963*4882a593Smuzhiyun    }
964*4882a593Smuzhiyun
965*4882a593Smuzhiyun    close(dev_fd);
966*4882a593Smuzhiyun    exit(EXIT_SUCCESS);
967*4882a593Smuzhiyun  }
968*4882a593Smuzhiyun
969*4882a593SmuzhiyunLocking between applications
970*4882a593Smuzhiyun----------------------------
971*4882a593Smuzhiyun
972*4882a593SmuzhiyunCertain operations on the resctrl filesystem, composed of read/writes
973*4882a593Smuzhiyunto/from multiple files, must be atomic.
974*4882a593Smuzhiyun
975*4882a593SmuzhiyunAs an example, the allocation of an exclusive reservation of L3 cache
976*4882a593Smuzhiyuninvolves:
977*4882a593Smuzhiyun
978*4882a593Smuzhiyun  1. Read the cbmmasks from each directory or the per-resource "bit_usage"
979*4882a593Smuzhiyun  2. Find a contiguous set of bits in the global CBM bitmask that is clear
980*4882a593Smuzhiyun     in any of the directory cbmmasks
981*4882a593Smuzhiyun  3. Create a new directory
982*4882a593Smuzhiyun  4. Set the bits found in step 2 to the new directory "schemata" file
983*4882a593Smuzhiyun
984*4882a593SmuzhiyunIf two applications attempt to allocate space concurrently then they can
985*4882a593Smuzhiyunend up allocating the same bits so the reservations are shared instead of
986*4882a593Smuzhiyunexclusive.
987*4882a593Smuzhiyun
988*4882a593SmuzhiyunTo coordinate atomic operations on the resctrlfs and to avoid the problem
989*4882a593Smuzhiyunabove, the following locking procedure is recommended:
990*4882a593Smuzhiyun
991*4882a593SmuzhiyunLocking is based on flock, which is available in libc and also as a shell
992*4882a593Smuzhiyunscript command
993*4882a593Smuzhiyun
994*4882a593SmuzhiyunWrite lock:
995*4882a593Smuzhiyun
996*4882a593Smuzhiyun A) Take flock(LOCK_EX) on /sys/fs/resctrl
997*4882a593Smuzhiyun B) Read/write the directory structure.
998*4882a593Smuzhiyun C) funlock
999*4882a593Smuzhiyun
1000*4882a593SmuzhiyunRead lock:
1001*4882a593Smuzhiyun
1002*4882a593Smuzhiyun A) Take flock(LOCK_SH) on /sys/fs/resctrl
1003*4882a593Smuzhiyun B) If success read the directory structure.
1004*4882a593Smuzhiyun C) funlock
1005*4882a593Smuzhiyun
1006*4882a593SmuzhiyunExample with bash::
1007*4882a593Smuzhiyun
1008*4882a593Smuzhiyun  # Atomically read directory structure
1009*4882a593Smuzhiyun  $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
1010*4882a593Smuzhiyun
1011*4882a593Smuzhiyun  # Read directory contents and create new subdirectory
1012*4882a593Smuzhiyun
1013*4882a593Smuzhiyun  $ cat create-dir.sh
1014*4882a593Smuzhiyun  find /sys/fs/resctrl/ > output.txt
1015*4882a593Smuzhiyun  mask = function-of(output.txt)
1016*4882a593Smuzhiyun  mkdir /sys/fs/resctrl/newres/
1017*4882a593Smuzhiyun  echo mask > /sys/fs/resctrl/newres/schemata
1018*4882a593Smuzhiyun
1019*4882a593Smuzhiyun  $ flock /sys/fs/resctrl/ ./create-dir.sh
1020*4882a593Smuzhiyun
1021*4882a593SmuzhiyunExample with C::
1022*4882a593Smuzhiyun
1023*4882a593Smuzhiyun  /*
1024*4882a593Smuzhiyun  * Example code do take advisory locks
1025*4882a593Smuzhiyun  * before accessing resctrl filesystem
1026*4882a593Smuzhiyun  */
1027*4882a593Smuzhiyun  #include <sys/file.h>
1028*4882a593Smuzhiyun  #include <stdlib.h>
1029*4882a593Smuzhiyun
1030*4882a593Smuzhiyun  void resctrl_take_shared_lock(int fd)
1031*4882a593Smuzhiyun  {
1032*4882a593Smuzhiyun    int ret;
1033*4882a593Smuzhiyun
1034*4882a593Smuzhiyun    /* take shared lock on resctrl filesystem */
1035*4882a593Smuzhiyun    ret = flock(fd, LOCK_SH);
1036*4882a593Smuzhiyun    if (ret) {
1037*4882a593Smuzhiyun      perror("flock");
1038*4882a593Smuzhiyun      exit(-1);
1039*4882a593Smuzhiyun    }
1040*4882a593Smuzhiyun  }
1041*4882a593Smuzhiyun
1042*4882a593Smuzhiyun  void resctrl_take_exclusive_lock(int fd)
1043*4882a593Smuzhiyun  {
1044*4882a593Smuzhiyun    int ret;
1045*4882a593Smuzhiyun
1046*4882a593Smuzhiyun    /* release lock on resctrl filesystem */
1047*4882a593Smuzhiyun    ret = flock(fd, LOCK_EX);
1048*4882a593Smuzhiyun    if (ret) {
1049*4882a593Smuzhiyun      perror("flock");
1050*4882a593Smuzhiyun      exit(-1);
1051*4882a593Smuzhiyun    }
1052*4882a593Smuzhiyun  }
1053*4882a593Smuzhiyun
1054*4882a593Smuzhiyun  void resctrl_release_lock(int fd)
1055*4882a593Smuzhiyun  {
1056*4882a593Smuzhiyun    int ret;
1057*4882a593Smuzhiyun
1058*4882a593Smuzhiyun    /* take shared lock on resctrl filesystem */
1059*4882a593Smuzhiyun    ret = flock(fd, LOCK_UN);
1060*4882a593Smuzhiyun    if (ret) {
1061*4882a593Smuzhiyun      perror("flock");
1062*4882a593Smuzhiyun      exit(-1);
1063*4882a593Smuzhiyun    }
1064*4882a593Smuzhiyun  }
1065*4882a593Smuzhiyun
1066*4882a593Smuzhiyun  void main(void)
1067*4882a593Smuzhiyun  {
1068*4882a593Smuzhiyun    int fd, ret;
1069*4882a593Smuzhiyun
1070*4882a593Smuzhiyun    fd = open("/sys/fs/resctrl", O_DIRECTORY);
1071*4882a593Smuzhiyun    if (fd == -1) {
1072*4882a593Smuzhiyun      perror("open");
1073*4882a593Smuzhiyun      exit(-1);
1074*4882a593Smuzhiyun    }
1075*4882a593Smuzhiyun    resctrl_take_shared_lock(fd);
1076*4882a593Smuzhiyun    /* code to read directory contents */
1077*4882a593Smuzhiyun    resctrl_release_lock(fd);
1078*4882a593Smuzhiyun
1079*4882a593Smuzhiyun    resctrl_take_exclusive_lock(fd);
1080*4882a593Smuzhiyun    /* code to read and write directory contents */
1081*4882a593Smuzhiyun    resctrl_release_lock(fd);
1082*4882a593Smuzhiyun  }
1083*4882a593Smuzhiyun
1084*4882a593SmuzhiyunExamples for RDT Monitoring along with allocation usage
1085*4882a593Smuzhiyun=======================================================
1086*4882a593SmuzhiyunReading monitored data
1087*4882a593Smuzhiyun----------------------
1088*4882a593SmuzhiyunReading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
1089*4882a593Smuzhiyunshow the current snapshot of LLC occupancy of the corresponding MON
1090*4882a593Smuzhiyungroup or CTRL_MON group.
1091*4882a593Smuzhiyun
1092*4882a593Smuzhiyun
1093*4882a593SmuzhiyunExample 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
1094*4882a593Smuzhiyun------------------------------------------------------------------------
1095*4882a593SmuzhiyunOn a two socket machine (one L3 cache per socket) with just four bits
1096*4882a593Smuzhiyunfor cache bit masks::
1097*4882a593Smuzhiyun
1098*4882a593Smuzhiyun  # mount -t resctrl resctrl /sys/fs/resctrl
1099*4882a593Smuzhiyun  # cd /sys/fs/resctrl
1100*4882a593Smuzhiyun  # mkdir p0 p1
1101*4882a593Smuzhiyun  # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
1102*4882a593Smuzhiyun  # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
1103*4882a593Smuzhiyun  # echo 5678 > p1/tasks
1104*4882a593Smuzhiyun  # echo 5679 > p1/tasks
1105*4882a593Smuzhiyun
1106*4882a593SmuzhiyunThe default resource group is unmodified, so we have access to all parts
1107*4882a593Smuzhiyunof all caches (its schemata file reads "L3:0=f;1=f").
1108*4882a593Smuzhiyun
1109*4882a593SmuzhiyunTasks that are under the control of group "p0" may only allocate from the
1110*4882a593Smuzhiyun"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
1111*4882a593SmuzhiyunTasks in group "p1" use the "lower" 50% of cache on both sockets.
1112*4882a593Smuzhiyun
1113*4882a593SmuzhiyunCreate monitor groups and assign a subset of tasks to each monitor group.
1114*4882a593Smuzhiyun::
1115*4882a593Smuzhiyun
1116*4882a593Smuzhiyun  # cd /sys/fs/resctrl/p1/mon_groups
1117*4882a593Smuzhiyun  # mkdir m11 m12
1118*4882a593Smuzhiyun  # echo 5678 > m11/tasks
1119*4882a593Smuzhiyun  # echo 5679 > m12/tasks
1120*4882a593Smuzhiyun
1121*4882a593Smuzhiyunfetch data (data shown in bytes)
1122*4882a593Smuzhiyun::
1123*4882a593Smuzhiyun
1124*4882a593Smuzhiyun  # cat m11/mon_data/mon_L3_00/llc_occupancy
1125*4882a593Smuzhiyun  16234000
1126*4882a593Smuzhiyun  # cat m11/mon_data/mon_L3_01/llc_occupancy
1127*4882a593Smuzhiyun  14789000
1128*4882a593Smuzhiyun  # cat m12/mon_data/mon_L3_00/llc_occupancy
1129*4882a593Smuzhiyun  16789000
1130*4882a593Smuzhiyun
1131*4882a593SmuzhiyunThe parent ctrl_mon group shows the aggregated data.
1132*4882a593Smuzhiyun::
1133*4882a593Smuzhiyun
1134*4882a593Smuzhiyun  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1135*4882a593Smuzhiyun  31234000
1136*4882a593Smuzhiyun
1137*4882a593SmuzhiyunExample 2 (Monitor a task from its creation)
1138*4882a593Smuzhiyun--------------------------------------------
1139*4882a593SmuzhiyunOn a two socket machine (one L3 cache per socket)::
1140*4882a593Smuzhiyun
1141*4882a593Smuzhiyun  # mount -t resctrl resctrl /sys/fs/resctrl
1142*4882a593Smuzhiyun  # cd /sys/fs/resctrl
1143*4882a593Smuzhiyun  # mkdir p0 p1
1144*4882a593Smuzhiyun
1145*4882a593SmuzhiyunAn RMID is allocated to the group once its created and hence the <cmd>
1146*4882a593Smuzhiyunbelow is monitored from its creation.
1147*4882a593Smuzhiyun::
1148*4882a593Smuzhiyun
1149*4882a593Smuzhiyun  # echo $$ > /sys/fs/resctrl/p1/tasks
1150*4882a593Smuzhiyun  # <cmd>
1151*4882a593Smuzhiyun
1152*4882a593SmuzhiyunFetch the data::
1153*4882a593Smuzhiyun
1154*4882a593Smuzhiyun  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1155*4882a593Smuzhiyun  31789000
1156*4882a593Smuzhiyun
1157*4882a593SmuzhiyunExample 3 (Monitor without CAT support or before creating CAT groups)
1158*4882a593Smuzhiyun---------------------------------------------------------------------
1159*4882a593Smuzhiyun
1160*4882a593SmuzhiyunAssume a system like HSW has only CQM and no CAT support. In this case
1161*4882a593Smuzhiyunthe resctrl will still mount but cannot create CTRL_MON directories.
1162*4882a593SmuzhiyunBut user can create different MON groups within the root group thereby
1163*4882a593Smuzhiyunable to monitor all tasks including kernel threads.
1164*4882a593Smuzhiyun
1165*4882a593SmuzhiyunThis can also be used to profile jobs cache size footprint before being
1166*4882a593Smuzhiyunable to allocate them to different allocation groups.
1167*4882a593Smuzhiyun::
1168*4882a593Smuzhiyun
1169*4882a593Smuzhiyun  # mount -t resctrl resctrl /sys/fs/resctrl
1170*4882a593Smuzhiyun  # cd /sys/fs/resctrl
1171*4882a593Smuzhiyun  # mkdir mon_groups/m01
1172*4882a593Smuzhiyun  # mkdir mon_groups/m02
1173*4882a593Smuzhiyun
1174*4882a593Smuzhiyun  # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
1175*4882a593Smuzhiyun  # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
1176*4882a593Smuzhiyun
1177*4882a593SmuzhiyunMonitor the groups separately and also get per domain data. From the
1178*4882a593Smuzhiyunbelow its apparent that the tasks are mostly doing work on
1179*4882a593Smuzhiyundomain(socket) 0.
1180*4882a593Smuzhiyun::
1181*4882a593Smuzhiyun
1182*4882a593Smuzhiyun  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
1183*4882a593Smuzhiyun  31234000
1184*4882a593Smuzhiyun  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
1185*4882a593Smuzhiyun  34555
1186*4882a593Smuzhiyun  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
1187*4882a593Smuzhiyun  31234000
1188*4882a593Smuzhiyun  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
1189*4882a593Smuzhiyun  32789
1190*4882a593Smuzhiyun
1191*4882a593Smuzhiyun
1192*4882a593SmuzhiyunExample 4 (Monitor real time tasks)
1193*4882a593Smuzhiyun-----------------------------------
1194*4882a593Smuzhiyun
1195*4882a593SmuzhiyunA single socket system which has real time tasks running on cores 4-7
1196*4882a593Smuzhiyunand non real time tasks on other cpus. We want to monitor the cache
1197*4882a593Smuzhiyunoccupancy of the real time threads on these cores.
1198*4882a593Smuzhiyun::
1199*4882a593Smuzhiyun
1200*4882a593Smuzhiyun  # mount -t resctrl resctrl /sys/fs/resctrl
1201*4882a593Smuzhiyun  # cd /sys/fs/resctrl
1202*4882a593Smuzhiyun  # mkdir p1
1203*4882a593Smuzhiyun
1204*4882a593SmuzhiyunMove the cpus 4-7 over to p1::
1205*4882a593Smuzhiyun
1206*4882a593Smuzhiyun  # echo f0 > p1/cpus
1207*4882a593Smuzhiyun
1208*4882a593SmuzhiyunView the llc occupancy snapshot::
1209*4882a593Smuzhiyun
1210*4882a593Smuzhiyun  # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
1211*4882a593Smuzhiyun  11234000
1212