1*4882a593Smuzhiyun.. _perf_security: 2*4882a593Smuzhiyun 3*4882a593SmuzhiyunPerf events and tool security 4*4882a593Smuzhiyun============================= 5*4882a593Smuzhiyun 6*4882a593SmuzhiyunOverview 7*4882a593Smuzhiyun-------- 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunUsage of Performance Counters for Linux (perf_events) [1]_ , [2]_ , [3]_ 10*4882a593Smuzhiyuncan impose a considerable risk of leaking sensitive data accessed by 11*4882a593Smuzhiyunmonitored processes. The data leakage is possible both in scenarios of 12*4882a593Smuzhiyundirect usage of perf_events system call API [2]_ and over data files 13*4882a593Smuzhiyungenerated by Perf tool user mode utility (Perf) [3]_ , [4]_ . The risk 14*4882a593Smuzhiyundepends on the nature of data that perf_events performance monitoring 15*4882a593Smuzhiyununits (PMU) [2]_ and Perf collect and expose for performance analysis. 16*4882a593SmuzhiyunCollected system and performance data may be split into several 17*4882a593Smuzhiyuncategories: 18*4882a593Smuzhiyun 19*4882a593Smuzhiyun1. System hardware and software configuration data, for example: a CPU 20*4882a593Smuzhiyun model and its cache configuration, an amount of available memory and 21*4882a593Smuzhiyun its topology, used kernel and Perf versions, performance monitoring 22*4882a593Smuzhiyun setup including experiment time, events configuration, Perf command 23*4882a593Smuzhiyun line parameters, etc. 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun2. User and kernel module paths and their load addresses with sizes, 26*4882a593Smuzhiyun process and thread names with their PIDs and TIDs, timestamps for 27*4882a593Smuzhiyun captured hardware and software events. 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun3. Content of kernel software counters (e.g., for context switches, page 30*4882a593Smuzhiyun faults, CPU migrations), architectural hardware performance counters 31*4882a593Smuzhiyun (PMC) [8]_ and machine specific registers (MSR) [9]_ that provide 32*4882a593Smuzhiyun execution metrics for various monitored parts of the system (e.g., 33*4882a593Smuzhiyun memory controller (IMC), interconnect (QPI/UPI) or peripheral (PCIe) 34*4882a593Smuzhiyun uncore counters) without direct attribution to any execution context 35*4882a593Smuzhiyun state. 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun4. Content of architectural execution context registers (e.g., RIP, RSP, 38*4882a593Smuzhiyun RBP on x86_64), process user and kernel space memory addresses and 39*4882a593Smuzhiyun data, content of various architectural MSRs that capture data from 40*4882a593Smuzhiyun this category. 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunData that belong to the fourth category can potentially contain 43*4882a593Smuzhiyunsensitive process data. If PMUs in some monitoring modes capture values 44*4882a593Smuzhiyunof execution context registers or data from process memory then access 45*4882a593Smuzhiyunto such monitoring modes requires to be ordered and secured properly. 46*4882a593SmuzhiyunSo, perf_events performance monitoring and observability operations are 47*4882a593Smuzhiyunthe subject for security access control management [5]_ . 48*4882a593Smuzhiyun 49*4882a593Smuzhiyunperf_events access control 50*4882a593Smuzhiyun------------------------------- 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunTo perform security checks, the Linux implementation splits processes 53*4882a593Smuzhiyuninto two categories [6]_ : a) privileged processes (whose effective user 54*4882a593SmuzhiyunID is 0, referred to as superuser or root), and b) unprivileged 55*4882a593Smuzhiyunprocesses (whose effective UID is nonzero). Privileged processes bypass 56*4882a593Smuzhiyunall kernel security permission checks so perf_events performance 57*4882a593Smuzhiyunmonitoring is fully available to privileged processes without access, 58*4882a593Smuzhiyunscope and resource restrictions. 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunUnprivileged processes are subject to a full security permission check 61*4882a593Smuzhiyunbased on the process's credentials [5]_ (usually: effective UID, 62*4882a593Smuzhiyuneffective GID, and supplementary group list). 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunLinux divides the privileges traditionally associated with superuser 65*4882a593Smuzhiyuninto distinct units, known as capabilities [6]_ , which can be 66*4882a593Smuzhiyunindependently enabled and disabled on per-thread basis for processes and 67*4882a593Smuzhiyunfiles of unprivileged users. 68*4882a593Smuzhiyun 69*4882a593SmuzhiyunUnprivileged processes with enabled CAP_PERFMON capability are treated 70*4882a593Smuzhiyunas privileged processes with respect to perf_events performance 71*4882a593Smuzhiyunmonitoring and observability operations, thus, bypass *scope* permissions 72*4882a593Smuzhiyunchecks in the kernel. CAP_PERFMON implements the principle of least 73*4882a593Smuzhiyunprivilege [13]_ (POSIX 1003.1e: 2.2.2.39) for performance monitoring and 74*4882a593Smuzhiyunobservability operations in the kernel and provides a secure approach to 75*4882a593Smuzhiyunperfomance monitoring and observability in the system. 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunFor backward compatibility reasons the access to perf_events monitoring and 78*4882a593Smuzhiyunobservability operations is also open for CAP_SYS_ADMIN privileged 79*4882a593Smuzhiyunprocesses but CAP_SYS_ADMIN usage for secure monitoring and observability 80*4882a593Smuzhiyunuse cases is discouraged with respect to the CAP_PERFMON capability. 81*4882a593SmuzhiyunIf system audit records [14]_ for a process using perf_events system call 82*4882a593SmuzhiyunAPI contain denial records of acquiring both CAP_PERFMON and CAP_SYS_ADMIN 83*4882a593Smuzhiyuncapabilities then providing the process with CAP_PERFMON capability singly 84*4882a593Smuzhiyunis recommended as the preferred secure approach to resolve double access 85*4882a593Smuzhiyundenial logging related to usage of performance monitoring and observability. 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunUnprivileged processes using perf_events system call are also subject 88*4882a593Smuzhiyunfor PTRACE_MODE_READ_REALCREDS ptrace access mode check [7]_ , whose 89*4882a593Smuzhiyunoutcome determines whether monitoring is permitted. So unprivileged 90*4882a593Smuzhiyunprocesses provided with CAP_SYS_PTRACE capability are effectively 91*4882a593Smuzhiyunpermitted to pass the check. 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunOther capabilities being granted to unprivileged processes can 94*4882a593Smuzhiyuneffectively enable capturing of additional data required for later 95*4882a593Smuzhiyunperformance analysis of monitored processes or a system. For example, 96*4882a593SmuzhiyunCAP_SYSLOG capability permits reading kernel space memory addresses from 97*4882a593Smuzhiyun/proc/kallsyms file. 98*4882a593Smuzhiyun 99*4882a593SmuzhiyunPrivileged Perf users groups 100*4882a593Smuzhiyun--------------------------------- 101*4882a593Smuzhiyun 102*4882a593SmuzhiyunMechanisms of capabilities, privileged capability-dumb files [6]_ and 103*4882a593Smuzhiyunfile system ACLs [10]_ can be used to create dedicated groups of 104*4882a593Smuzhiyunprivileged Perf users who are permitted to execute performance monitoring 105*4882a593Smuzhiyunand observability without scope limits. The following steps can be 106*4882a593Smuzhiyuntaken to create such groups of privileged Perf users. 107*4882a593Smuzhiyun 108*4882a593Smuzhiyun1. Create perf_users group of privileged Perf users, assign perf_users 109*4882a593Smuzhiyun group to Perf tool executable and limit access to the executable for 110*4882a593Smuzhiyun other users in the system who are not in the perf_users group: 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun:: 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun # groupadd perf_users 115*4882a593Smuzhiyun # ls -alhF 116*4882a593Smuzhiyun -rwxr-xr-x 2 root root 11M Oct 19 15:12 perf 117*4882a593Smuzhiyun # chgrp perf_users perf 118*4882a593Smuzhiyun # ls -alhF 119*4882a593Smuzhiyun -rwxr-xr-x 2 root perf_users 11M Oct 19 15:12 perf 120*4882a593Smuzhiyun # chmod o-rwx perf 121*4882a593Smuzhiyun # ls -alhF 122*4882a593Smuzhiyun -rwxr-x--- 2 root perf_users 11M Oct 19 15:12 perf 123*4882a593Smuzhiyun 124*4882a593Smuzhiyun2. Assign the required capabilities to the Perf tool executable file and 125*4882a593Smuzhiyun enable members of perf_users group with monitoring and observability 126*4882a593Smuzhiyun privileges [6]_ : 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun:: 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf 131*4882a593Smuzhiyun # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf 132*4882a593Smuzhiyun perf: OK 133*4882a593Smuzhiyun # getcap perf 134*4882a593Smuzhiyun perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunIf the libcap installed doesn't yet support "cap_perfmon", use "38" instead, 137*4882a593Smuzhiyuni.e.: 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun:: 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunNote that you may need to have 'cap_ipc_lock' in the mix for tools such as 144*4882a593Smuzhiyun'perf top', alternatively use 'perf top -m N', to reduce the memory that 145*4882a593Smuzhiyunit uses for the perf ring buffer, see the memory allocation section below. 146*4882a593Smuzhiyun 147*4882a593SmuzhiyunUsing a libcap without support for CAP_PERFMON will make cap_get_flag(caps, 38, 148*4882a593SmuzhiyunCAP_EFFECTIVE, &val) fail, which will lead the default event to be 'cycles:u', 149*4882a593Smuzhiyunso as a workaround explicitly ask for the 'cycles' event, i.e.: 150*4882a593Smuzhiyun 151*4882a593Smuzhiyun:: 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun # perf top -e cycles 154*4882a593Smuzhiyun 155*4882a593SmuzhiyunTo get kernel and user samples with a perf binary with just CAP_PERFMON. 156*4882a593Smuzhiyun 157*4882a593SmuzhiyunAs a result, members of perf_users group are capable of conducting 158*4882a593Smuzhiyunperformance monitoring and observability by using functionality of the 159*4882a593Smuzhiyunconfigured Perf tool executable that, when executes, passes perf_events 160*4882a593Smuzhiyunsubsystem scope checks. 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunThis specific access control management is only available to superuser 163*4882a593Smuzhiyunor root running processes with CAP_SETPCAP, CAP_SETFCAP [6]_ 164*4882a593Smuzhiyuncapabilities. 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunUnprivileged users 167*4882a593Smuzhiyun----------------------------------- 168*4882a593Smuzhiyun 169*4882a593Smuzhiyunperf_events *scope* and *access* control for unprivileged processes 170*4882a593Smuzhiyunis governed by perf_event_paranoid [2]_ setting: 171*4882a593Smuzhiyun 172*4882a593Smuzhiyun-1: 173*4882a593Smuzhiyun Impose no *scope* and *access* restrictions on using perf_events 174*4882a593Smuzhiyun performance monitoring. Per-user per-cpu perf_event_mlock_kb [2]_ 175*4882a593Smuzhiyun locking limit is ignored when allocating memory buffers for storing 176*4882a593Smuzhiyun performance data. This is the least secure mode since allowed 177*4882a593Smuzhiyun monitored *scope* is maximized and no perf_events specific limits 178*4882a593Smuzhiyun are imposed on *resources* allocated for performance monitoring. 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun>=0: 181*4882a593Smuzhiyun *scope* includes per-process and system wide performance monitoring 182*4882a593Smuzhiyun but excludes raw tracepoints and ftrace function tracepoints 183*4882a593Smuzhiyun monitoring. CPU and system events happened when executing either in 184*4882a593Smuzhiyun user or in kernel space can be monitored and captured for later 185*4882a593Smuzhiyun analysis. Per-user per-cpu perf_event_mlock_kb locking limit is 186*4882a593Smuzhiyun imposed but ignored for unprivileged processes with CAP_IPC_LOCK 187*4882a593Smuzhiyun [6]_ capability. 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun>=1: 190*4882a593Smuzhiyun *scope* includes per-process performance monitoring only and 191*4882a593Smuzhiyun excludes system wide performance monitoring. CPU and system events 192*4882a593Smuzhiyun happened when executing either in user or in kernel space can be 193*4882a593Smuzhiyun monitored and captured for later analysis. Per-user per-cpu 194*4882a593Smuzhiyun perf_event_mlock_kb locking limit is imposed but ignored for 195*4882a593Smuzhiyun unprivileged processes with CAP_IPC_LOCK capability. 196*4882a593Smuzhiyun 197*4882a593Smuzhiyun>=2: 198*4882a593Smuzhiyun *scope* includes per-process performance monitoring only. CPU and 199*4882a593Smuzhiyun system events happened when executing in user space only can be 200*4882a593Smuzhiyun monitored and captured for later analysis. Per-user per-cpu 201*4882a593Smuzhiyun perf_event_mlock_kb locking limit is imposed but ignored for 202*4882a593Smuzhiyun unprivileged processes with CAP_IPC_LOCK capability. 203*4882a593Smuzhiyun 204*4882a593SmuzhiyunResource control 205*4882a593Smuzhiyun--------------------------------- 206*4882a593Smuzhiyun 207*4882a593SmuzhiyunOpen file descriptors 208*4882a593Smuzhiyun+++++++++++++++++++++ 209*4882a593Smuzhiyun 210*4882a593SmuzhiyunThe perf_events system call API [2]_ allocates file descriptors for 211*4882a593Smuzhiyunevery configured PMU event. Open file descriptors are a per-process 212*4882a593Smuzhiyunaccountable resource governed by the RLIMIT_NOFILE [11]_ limit 213*4882a593Smuzhiyun(ulimit -n), which is usually derived from the login shell process. When 214*4882a593Smuzhiyunconfiguring Perf collection for a long list of events on a large server 215*4882a593Smuzhiyunsystem, this limit can be easily hit preventing required monitoring 216*4882a593Smuzhiyunconfiguration. RLIMIT_NOFILE limit can be increased on per-user basis 217*4882a593Smuzhiyunmodifying content of the limits.conf file [12]_ . Ordinarily, a Perf 218*4882a593Smuzhiyunsampling session (perf record) requires an amount of open perf_event 219*4882a593Smuzhiyunfile descriptors that is not less than the number of monitored events 220*4882a593Smuzhiyunmultiplied by the number of monitored CPUs. 221*4882a593Smuzhiyun 222*4882a593SmuzhiyunMemory allocation 223*4882a593Smuzhiyun+++++++++++++++++ 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunThe amount of memory available to user processes for capturing 226*4882a593Smuzhiyunperformance monitoring data is governed by the perf_event_mlock_kb [2]_ 227*4882a593Smuzhiyunsetting. This perf_event specific resource setting defines overall 228*4882a593Smuzhiyunper-cpu limits of memory allowed for mapping by the user processes to 229*4882a593Smuzhiyunexecute performance monitoring. The setting essentially extends the 230*4882a593SmuzhiyunRLIMIT_MEMLOCK [11]_ limit, but only for memory regions mapped 231*4882a593Smuzhiyunspecifically for capturing monitored performance events and related data. 232*4882a593Smuzhiyun 233*4882a593SmuzhiyunFor example, if a machine has eight cores and perf_event_mlock_kb limit 234*4882a593Smuzhiyunis set to 516 KiB, then a user process is provided with 516 KiB * 8 = 235*4882a593Smuzhiyun4128 KiB of memory above the RLIMIT_MEMLOCK limit (ulimit -l) for 236*4882a593Smuzhiyunperf_event mmap buffers. In particular, this means that, if the user 237*4882a593Smuzhiyunwants to start two or more performance monitoring processes, the user is 238*4882a593Smuzhiyunrequired to manually distribute the available 4128 KiB between the 239*4882a593Smuzhiyunmonitoring processes, for example, using the --mmap-pages Perf record 240*4882a593Smuzhiyunmode option. Otherwise, the first started performance monitoring process 241*4882a593Smuzhiyunallocates all available 4128 KiB and the other processes will fail to 242*4882a593Smuzhiyunproceed due to the lack of memory. 243*4882a593Smuzhiyun 244*4882a593SmuzhiyunRLIMIT_MEMLOCK and perf_event_mlock_kb resource constraints are ignored 245*4882a593Smuzhiyunfor processes with the CAP_IPC_LOCK capability. Thus, perf_events/Perf 246*4882a593Smuzhiyunprivileged users can be provided with memory above the constraints for 247*4882a593Smuzhiyunperf_events/Perf performance monitoring purpose by providing the Perf 248*4882a593Smuzhiyunexecutable with CAP_IPC_LOCK capability. 249*4882a593Smuzhiyun 250*4882a593SmuzhiyunBibliography 251*4882a593Smuzhiyun------------ 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun.. [1] `<https://lwn.net/Articles/337493/>`_ 254*4882a593Smuzhiyun.. [2] `<http://man7.org/linux/man-pages/man2/perf_event_open.2.html>`_ 255*4882a593Smuzhiyun.. [3] `<http://web.eece.maine.edu/~vweaver/projects/perf_events/>`_ 256*4882a593Smuzhiyun.. [4] `<https://perf.wiki.kernel.org/index.php/Main_Page>`_ 257*4882a593Smuzhiyun.. [5] `<https://www.kernel.org/doc/html/latest/security/credentials.html>`_ 258*4882a593Smuzhiyun.. [6] `<http://man7.org/linux/man-pages/man7/capabilities.7.html>`_ 259*4882a593Smuzhiyun.. [7] `<http://man7.org/linux/man-pages/man2/ptrace.2.html>`_ 260*4882a593Smuzhiyun.. [8] `<https://en.wikipedia.org/wiki/Hardware_performance_counter>`_ 261*4882a593Smuzhiyun.. [9] `<https://en.wikipedia.org/wiki/Model-specific_register>`_ 262*4882a593Smuzhiyun.. [10] `<http://man7.org/linux/man-pages/man5/acl.5.html>`_ 263*4882a593Smuzhiyun.. [11] `<http://man7.org/linux/man-pages/man2/getrlimit.2.html>`_ 264*4882a593Smuzhiyun.. [12] `<http://man7.org/linux/man-pages/man5/limits.conf.5.html>`_ 265*4882a593Smuzhiyun.. [13] `<https://sites.google.com/site/fullycapable>`_ 266*4882a593Smuzhiyun.. [14] `<http://man7.org/linux/man-pages/man8/auditd.8.html>`_ 267