xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/perf-security.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _perf_security:
2*4882a593Smuzhiyun
3*4882a593SmuzhiyunPerf events and tool security
4*4882a593Smuzhiyun=============================
5*4882a593Smuzhiyun
6*4882a593SmuzhiyunOverview
7*4882a593Smuzhiyun--------
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunUsage of Performance Counters for Linux (perf_events) [1]_ , [2]_ , [3]_
10*4882a593Smuzhiyuncan impose a considerable risk of leaking sensitive data accessed by
11*4882a593Smuzhiyunmonitored processes. The data leakage is possible both in scenarios of
12*4882a593Smuzhiyundirect usage of perf_events system call API [2]_ and over data files
13*4882a593Smuzhiyungenerated by Perf tool user mode utility (Perf) [3]_ , [4]_ . The risk
14*4882a593Smuzhiyundepends on the nature of data that perf_events performance monitoring
15*4882a593Smuzhiyununits (PMU) [2]_ and Perf collect and expose for performance analysis.
16*4882a593SmuzhiyunCollected system and performance data may be split into several
17*4882a593Smuzhiyuncategories:
18*4882a593Smuzhiyun
19*4882a593Smuzhiyun1. System hardware and software configuration data, for example: a CPU
20*4882a593Smuzhiyun   model and its cache configuration, an amount of available memory and
21*4882a593Smuzhiyun   its topology, used kernel and Perf versions, performance monitoring
22*4882a593Smuzhiyun   setup including experiment time, events configuration, Perf command
23*4882a593Smuzhiyun   line parameters, etc.
24*4882a593Smuzhiyun
25*4882a593Smuzhiyun2. User and kernel module paths and their load addresses with sizes,
26*4882a593Smuzhiyun   process and thread names with their PIDs and TIDs, timestamps for
27*4882a593Smuzhiyun   captured hardware and software events.
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun3. Content of kernel software counters (e.g., for context switches, page
30*4882a593Smuzhiyun   faults, CPU migrations), architectural hardware performance counters
31*4882a593Smuzhiyun   (PMC) [8]_ and machine specific registers (MSR) [9]_ that provide
32*4882a593Smuzhiyun   execution metrics for various monitored parts of the system (e.g.,
33*4882a593Smuzhiyun   memory controller (IMC), interconnect (QPI/UPI) or peripheral (PCIe)
34*4882a593Smuzhiyun   uncore counters) without direct attribution to any execution context
35*4882a593Smuzhiyun   state.
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun4. Content of architectural execution context registers (e.g., RIP, RSP,
38*4882a593Smuzhiyun   RBP on x86_64), process user and kernel space memory addresses and
39*4882a593Smuzhiyun   data, content of various architectural MSRs that capture data from
40*4882a593Smuzhiyun   this category.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunData that belong to the fourth category can potentially contain
43*4882a593Smuzhiyunsensitive process data. If PMUs in some monitoring modes capture values
44*4882a593Smuzhiyunof execution context registers or data from process memory then access
45*4882a593Smuzhiyunto such monitoring modes requires to be ordered and secured properly.
46*4882a593SmuzhiyunSo, perf_events performance monitoring and observability operations are
47*4882a593Smuzhiyunthe subject for security access control management [5]_ .
48*4882a593Smuzhiyun
49*4882a593Smuzhiyunperf_events access control
50*4882a593Smuzhiyun-------------------------------
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunTo perform security checks, the Linux implementation splits processes
53*4882a593Smuzhiyuninto two categories [6]_ : a) privileged processes (whose effective user
54*4882a593SmuzhiyunID is 0, referred to as superuser or root), and b) unprivileged
55*4882a593Smuzhiyunprocesses (whose effective UID is nonzero). Privileged processes bypass
56*4882a593Smuzhiyunall kernel security permission checks so perf_events performance
57*4882a593Smuzhiyunmonitoring is fully available to privileged processes without access,
58*4882a593Smuzhiyunscope and resource restrictions.
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunUnprivileged processes are subject to a full security permission check
61*4882a593Smuzhiyunbased on the process's credentials [5]_ (usually: effective UID,
62*4882a593Smuzhiyuneffective GID, and supplementary group list).
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunLinux divides the privileges traditionally associated with superuser
65*4882a593Smuzhiyuninto distinct units, known as capabilities [6]_ , which can be
66*4882a593Smuzhiyunindependently enabled and disabled on per-thread basis for processes and
67*4882a593Smuzhiyunfiles of unprivileged users.
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunUnprivileged processes with enabled CAP_PERFMON capability are treated
70*4882a593Smuzhiyunas privileged processes with respect to perf_events performance
71*4882a593Smuzhiyunmonitoring and observability operations, thus, bypass *scope* permissions
72*4882a593Smuzhiyunchecks in the kernel. CAP_PERFMON implements the principle of least
73*4882a593Smuzhiyunprivilege [13]_ (POSIX 1003.1e: 2.2.2.39) for performance monitoring and
74*4882a593Smuzhiyunobservability operations in the kernel and provides a secure approach to
75*4882a593Smuzhiyunperfomance monitoring and observability in the system.
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunFor backward compatibility reasons the access to perf_events monitoring and
78*4882a593Smuzhiyunobservability operations is also open for CAP_SYS_ADMIN privileged
79*4882a593Smuzhiyunprocesses but CAP_SYS_ADMIN usage for secure monitoring and observability
80*4882a593Smuzhiyunuse cases is discouraged with respect to the CAP_PERFMON capability.
81*4882a593SmuzhiyunIf system audit records [14]_ for a process using perf_events system call
82*4882a593SmuzhiyunAPI contain denial records of acquiring both CAP_PERFMON and CAP_SYS_ADMIN
83*4882a593Smuzhiyuncapabilities then providing the process with CAP_PERFMON capability singly
84*4882a593Smuzhiyunis recommended as the preferred secure approach to resolve double access
85*4882a593Smuzhiyundenial logging related to usage of performance monitoring and observability.
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunUnprivileged processes using perf_events system call are also subject
88*4882a593Smuzhiyunfor PTRACE_MODE_READ_REALCREDS ptrace access mode check [7]_ , whose
89*4882a593Smuzhiyunoutcome determines whether monitoring is permitted. So unprivileged
90*4882a593Smuzhiyunprocesses provided with CAP_SYS_PTRACE capability are effectively
91*4882a593Smuzhiyunpermitted to pass the check.
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunOther capabilities being granted to unprivileged processes can
94*4882a593Smuzhiyuneffectively enable capturing of additional data required for later
95*4882a593Smuzhiyunperformance analysis of monitored processes or a system. For example,
96*4882a593SmuzhiyunCAP_SYSLOG capability permits reading kernel space memory addresses from
97*4882a593Smuzhiyun/proc/kallsyms file.
98*4882a593Smuzhiyun
99*4882a593SmuzhiyunPrivileged Perf users groups
100*4882a593Smuzhiyun---------------------------------
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunMechanisms of capabilities, privileged capability-dumb files [6]_ and
103*4882a593Smuzhiyunfile system ACLs [10]_ can be used to create dedicated groups of
104*4882a593Smuzhiyunprivileged Perf users who are permitted to execute performance monitoring
105*4882a593Smuzhiyunand observability without scope limits. The following steps can be
106*4882a593Smuzhiyuntaken to create such groups of privileged Perf users.
107*4882a593Smuzhiyun
108*4882a593Smuzhiyun1. Create perf_users group of privileged Perf users, assign perf_users
109*4882a593Smuzhiyun   group to Perf tool executable and limit access to the executable for
110*4882a593Smuzhiyun   other users in the system who are not in the perf_users group:
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun::
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun   # groupadd perf_users
115*4882a593Smuzhiyun   # ls -alhF
116*4882a593Smuzhiyun   -rwxr-xr-x  2 root root  11M Oct 19 15:12 perf
117*4882a593Smuzhiyun   # chgrp perf_users perf
118*4882a593Smuzhiyun   # ls -alhF
119*4882a593Smuzhiyun   -rwxr-xr-x  2 root perf_users  11M Oct 19 15:12 perf
120*4882a593Smuzhiyun   # chmod o-rwx perf
121*4882a593Smuzhiyun   # ls -alhF
122*4882a593Smuzhiyun   -rwxr-x---  2 root perf_users  11M Oct 19 15:12 perf
123*4882a593Smuzhiyun
124*4882a593Smuzhiyun2. Assign the required capabilities to the Perf tool executable file and
125*4882a593Smuzhiyun   enable members of perf_users group with monitoring and observability
126*4882a593Smuzhiyun   privileges [6]_ :
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun::
129*4882a593Smuzhiyun
130*4882a593Smuzhiyun   # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
131*4882a593Smuzhiyun   # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
132*4882a593Smuzhiyun   perf: OK
133*4882a593Smuzhiyun   # getcap perf
134*4882a593Smuzhiyun   perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
135*4882a593Smuzhiyun
136*4882a593SmuzhiyunIf the libcap installed doesn't yet support "cap_perfmon", use "38" instead,
137*4882a593Smuzhiyuni.e.:
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun::
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun   # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunNote that you may need to have 'cap_ipc_lock' in the mix for tools such as
144*4882a593Smuzhiyun'perf top', alternatively use 'perf top -m N', to reduce the memory that
145*4882a593Smuzhiyunit uses for the perf ring buffer, see the memory allocation section below.
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunUsing a libcap without support for CAP_PERFMON will make cap_get_flag(caps, 38,
148*4882a593SmuzhiyunCAP_EFFECTIVE, &val) fail, which will lead the default event to be 'cycles:u',
149*4882a593Smuzhiyunso as a workaround explicitly ask for the 'cycles' event, i.e.:
150*4882a593Smuzhiyun
151*4882a593Smuzhiyun::
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun  # perf top -e cycles
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunTo get kernel and user samples with a perf binary with just CAP_PERFMON.
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunAs a result, members of perf_users group are capable of conducting
158*4882a593Smuzhiyunperformance monitoring and observability by using functionality of the
159*4882a593Smuzhiyunconfigured Perf tool executable that, when executes, passes perf_events
160*4882a593Smuzhiyunsubsystem scope checks.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunThis specific access control management is only available to superuser
163*4882a593Smuzhiyunor root running processes with CAP_SETPCAP, CAP_SETFCAP [6]_
164*4882a593Smuzhiyuncapabilities.
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunUnprivileged users
167*4882a593Smuzhiyun-----------------------------------
168*4882a593Smuzhiyun
169*4882a593Smuzhiyunperf_events *scope* and *access* control for unprivileged processes
170*4882a593Smuzhiyunis governed by perf_event_paranoid [2]_ setting:
171*4882a593Smuzhiyun
172*4882a593Smuzhiyun-1:
173*4882a593Smuzhiyun     Impose no *scope* and *access* restrictions on using perf_events
174*4882a593Smuzhiyun     performance monitoring. Per-user per-cpu perf_event_mlock_kb [2]_
175*4882a593Smuzhiyun     locking limit is ignored when allocating memory buffers for storing
176*4882a593Smuzhiyun     performance data. This is the least secure mode since allowed
177*4882a593Smuzhiyun     monitored *scope* is maximized and no perf_events specific limits
178*4882a593Smuzhiyun     are imposed on *resources* allocated for performance monitoring.
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun>=0:
181*4882a593Smuzhiyun     *scope* includes per-process and system wide performance monitoring
182*4882a593Smuzhiyun     but excludes raw tracepoints and ftrace function tracepoints
183*4882a593Smuzhiyun     monitoring. CPU and system events happened when executing either in
184*4882a593Smuzhiyun     user or in kernel space can be monitored and captured for later
185*4882a593Smuzhiyun     analysis. Per-user per-cpu perf_event_mlock_kb locking limit is
186*4882a593Smuzhiyun     imposed but ignored for unprivileged processes with CAP_IPC_LOCK
187*4882a593Smuzhiyun     [6]_ capability.
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun>=1:
190*4882a593Smuzhiyun     *scope* includes per-process performance monitoring only and
191*4882a593Smuzhiyun     excludes system wide performance monitoring. CPU and system events
192*4882a593Smuzhiyun     happened when executing either in user or in kernel space can be
193*4882a593Smuzhiyun     monitored and captured for later analysis. Per-user per-cpu
194*4882a593Smuzhiyun     perf_event_mlock_kb locking limit is imposed but ignored for
195*4882a593Smuzhiyun     unprivileged processes with CAP_IPC_LOCK capability.
196*4882a593Smuzhiyun
197*4882a593Smuzhiyun>=2:
198*4882a593Smuzhiyun     *scope* includes per-process performance monitoring only. CPU and
199*4882a593Smuzhiyun     system events happened when executing in user space only can be
200*4882a593Smuzhiyun     monitored and captured for later analysis. Per-user per-cpu
201*4882a593Smuzhiyun     perf_event_mlock_kb locking limit is imposed but ignored for
202*4882a593Smuzhiyun     unprivileged processes with CAP_IPC_LOCK capability.
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunResource control
205*4882a593Smuzhiyun---------------------------------
206*4882a593Smuzhiyun
207*4882a593SmuzhiyunOpen file descriptors
208*4882a593Smuzhiyun+++++++++++++++++++++
209*4882a593Smuzhiyun
210*4882a593SmuzhiyunThe perf_events system call API [2]_ allocates file descriptors for
211*4882a593Smuzhiyunevery configured PMU event. Open file descriptors are a per-process
212*4882a593Smuzhiyunaccountable resource governed by the RLIMIT_NOFILE [11]_ limit
213*4882a593Smuzhiyun(ulimit -n), which is usually derived from the login shell process. When
214*4882a593Smuzhiyunconfiguring Perf collection for a long list of events on a large server
215*4882a593Smuzhiyunsystem, this limit can be easily hit preventing required monitoring
216*4882a593Smuzhiyunconfiguration. RLIMIT_NOFILE limit can be increased on per-user basis
217*4882a593Smuzhiyunmodifying content of the limits.conf file [12]_ . Ordinarily, a Perf
218*4882a593Smuzhiyunsampling session (perf record) requires an amount of open perf_event
219*4882a593Smuzhiyunfile descriptors that is not less than the number of monitored events
220*4882a593Smuzhiyunmultiplied by the number of monitored CPUs.
221*4882a593Smuzhiyun
222*4882a593SmuzhiyunMemory allocation
223*4882a593Smuzhiyun+++++++++++++++++
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunThe amount of memory available to user processes for capturing
226*4882a593Smuzhiyunperformance monitoring data is governed by the perf_event_mlock_kb [2]_
227*4882a593Smuzhiyunsetting. This perf_event specific resource setting defines overall
228*4882a593Smuzhiyunper-cpu limits of memory allowed for mapping by the user processes to
229*4882a593Smuzhiyunexecute performance monitoring. The setting essentially extends the
230*4882a593SmuzhiyunRLIMIT_MEMLOCK [11]_ limit, but only for memory regions mapped
231*4882a593Smuzhiyunspecifically for capturing monitored performance events and related data.
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunFor example, if a machine has eight cores and perf_event_mlock_kb limit
234*4882a593Smuzhiyunis set to 516 KiB, then a user process is provided with 516 KiB * 8 =
235*4882a593Smuzhiyun4128 KiB of memory above the RLIMIT_MEMLOCK limit (ulimit -l) for
236*4882a593Smuzhiyunperf_event mmap buffers. In particular, this means that, if the user
237*4882a593Smuzhiyunwants to start two or more performance monitoring processes, the user is
238*4882a593Smuzhiyunrequired to manually distribute the available 4128 KiB between the
239*4882a593Smuzhiyunmonitoring processes, for example, using the --mmap-pages Perf record
240*4882a593Smuzhiyunmode option. Otherwise, the first started performance monitoring process
241*4882a593Smuzhiyunallocates all available 4128 KiB and the other processes will fail to
242*4882a593Smuzhiyunproceed due to the lack of memory.
243*4882a593Smuzhiyun
244*4882a593SmuzhiyunRLIMIT_MEMLOCK and perf_event_mlock_kb resource constraints are ignored
245*4882a593Smuzhiyunfor processes with the CAP_IPC_LOCK capability. Thus, perf_events/Perf
246*4882a593Smuzhiyunprivileged users can be provided with memory above the constraints for
247*4882a593Smuzhiyunperf_events/Perf performance monitoring purpose by providing the Perf
248*4882a593Smuzhiyunexecutable with CAP_IPC_LOCK capability.
249*4882a593Smuzhiyun
250*4882a593SmuzhiyunBibliography
251*4882a593Smuzhiyun------------
252*4882a593Smuzhiyun
253*4882a593Smuzhiyun.. [1] `<https://lwn.net/Articles/337493/>`_
254*4882a593Smuzhiyun.. [2] `<http://man7.org/linux/man-pages/man2/perf_event_open.2.html>`_
255*4882a593Smuzhiyun.. [3] `<http://web.eece.maine.edu/~vweaver/projects/perf_events/>`_
256*4882a593Smuzhiyun.. [4] `<https://perf.wiki.kernel.org/index.php/Main_Page>`_
257*4882a593Smuzhiyun.. [5] `<https://www.kernel.org/doc/html/latest/security/credentials.html>`_
258*4882a593Smuzhiyun.. [6] `<http://man7.org/linux/man-pages/man7/capabilities.7.html>`_
259*4882a593Smuzhiyun.. [7] `<http://man7.org/linux/man-pages/man2/ptrace.2.html>`_
260*4882a593Smuzhiyun.. [8] `<https://en.wikipedia.org/wiki/Hardware_performance_counter>`_
261*4882a593Smuzhiyun.. [9] `<https://en.wikipedia.org/wiki/Model-specific_register>`_
262*4882a593Smuzhiyun.. [10] `<http://man7.org/linux/man-pages/man5/acl.5.html>`_
263*4882a593Smuzhiyun.. [11] `<http://man7.org/linux/man-pages/man2/getrlimit.2.html>`_
264*4882a593Smuzhiyun.. [12] `<http://man7.org/linux/man-pages/man5/limits.conf.5.html>`_
265*4882a593Smuzhiyun.. [13] `<https://sites.google.com/site/fullycapable>`_
266*4882a593Smuzhiyun.. [14] `<http://man7.org/linux/man-pages/man8/auditd.8.html>`_
267