1*4882a593SmuzhiyunL1TF - L1 Terminal Fault 2*4882a593Smuzhiyun======================== 3*4882a593Smuzhiyun 4*4882a593SmuzhiyunL1 Terminal Fault is a hardware vulnerability which allows unprivileged 5*4882a593Smuzhiyunspeculative access to data which is available in the Level 1 Data Cache 6*4882a593Smuzhiyunwhen the page table entry controlling the virtual address, which is used 7*4882a593Smuzhiyunfor the access, has the Present bit cleared or other reserved bits set. 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunAffected processors 10*4882a593Smuzhiyun------------------- 11*4882a593Smuzhiyun 12*4882a593SmuzhiyunThis vulnerability affects a wide range of Intel processors. The 13*4882a593Smuzhiyunvulnerability is not present on: 14*4882a593Smuzhiyun 15*4882a593Smuzhiyun - Processors from AMD, Centaur and other non Intel vendors 16*4882a593Smuzhiyun 17*4882a593Smuzhiyun - Older processor models, where the CPU family is < 6 18*4882a593Smuzhiyun 19*4882a593Smuzhiyun - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft, 20*4882a593Smuzhiyun Penwell, Pineview, Silvermont, Airmont, Merrifield) 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun - The Intel XEON PHI family 23*4882a593Smuzhiyun 24*4882a593Smuzhiyun - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the 25*4882a593Smuzhiyun IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected 26*4882a593Smuzhiyun by the Meltdown vulnerability either. These CPUs should become 27*4882a593Smuzhiyun available by end of 2018. 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunWhether a processor is affected or not can be read out from the L1TF 30*4882a593Smuzhiyunvulnerability file in sysfs. See :ref:`l1tf_sys_info`. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunRelated CVEs 33*4882a593Smuzhiyun------------ 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunThe following CVE entries are related to the L1TF vulnerability: 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun ============= ================= ============================== 38*4882a593Smuzhiyun CVE-2018-3615 L1 Terminal Fault SGX related aspects 39*4882a593Smuzhiyun CVE-2018-3620 L1 Terminal Fault OS, SMM related aspects 40*4882a593Smuzhiyun CVE-2018-3646 L1 Terminal Fault Virtualization related aspects 41*4882a593Smuzhiyun ============= ================= ============================== 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunProblem 44*4882a593Smuzhiyun------- 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunIf an instruction accesses a virtual address for which the relevant page 47*4882a593Smuzhiyuntable entry (PTE) has the Present bit cleared or other reserved bits set, 48*4882a593Smuzhiyunthen speculative execution ignores the invalid PTE and loads the referenced 49*4882a593Smuzhiyundata if it is present in the Level 1 Data Cache, as if the page referenced 50*4882a593Smuzhiyunby the address bits in the PTE was still present and accessible. 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunWhile this is a purely speculative mechanism and the instruction will raise 53*4882a593Smuzhiyuna page fault when it is retired eventually, the pure act of loading the 54*4882a593Smuzhiyundata and making it available to other speculative instructions opens up the 55*4882a593Smuzhiyunopportunity for side channel attacks to unprivileged malicious code, 56*4882a593Smuzhiyunsimilar to the Meltdown attack. 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunWhile Meltdown breaks the user space to kernel space protection, L1TF 59*4882a593Smuzhiyunallows to attack any physical memory address in the system and the attack 60*4882a593Smuzhiyunworks across all protection domains. It allows an attack of SGX and also 61*4882a593Smuzhiyunworks from inside virtual machines because the speculation bypasses the 62*4882a593Smuzhiyunextended page table (EPT) protection mechanism. 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunAttack scenarios 66*4882a593Smuzhiyun---------------- 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun1. Malicious user space 69*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^ 70*4882a593Smuzhiyun 71*4882a593Smuzhiyun Operating Systems store arbitrary information in the address bits of a 72*4882a593Smuzhiyun PTE which is marked non present. This allows a malicious user space 73*4882a593Smuzhiyun application to attack the physical memory to which these PTEs resolve. 74*4882a593Smuzhiyun In some cases user-space can maliciously influence the information 75*4882a593Smuzhiyun encoded in the address bits of the PTE, thus making attacks more 76*4882a593Smuzhiyun deterministic and more practical. 77*4882a593Smuzhiyun 78*4882a593Smuzhiyun The Linux kernel contains a mitigation for this attack vector, PTE 79*4882a593Smuzhiyun inversion, which is permanently enabled and has no performance 80*4882a593Smuzhiyun impact. The kernel ensures that the address bits of PTEs, which are not 81*4882a593Smuzhiyun marked present, never point to cacheable physical memory space. 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun A system with an up to date kernel is protected against attacks from 84*4882a593Smuzhiyun malicious user space applications. 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun2. Malicious guest in a virtual machine 87*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun The fact that L1TF breaks all domain protections allows malicious guest 90*4882a593Smuzhiyun OSes, which can control the PTEs directly, and malicious guest user 91*4882a593Smuzhiyun space applications, which run on an unprotected guest kernel lacking the 92*4882a593Smuzhiyun PTE inversion mitigation for L1TF, to attack physical host memory. 93*4882a593Smuzhiyun 94*4882a593Smuzhiyun A special aspect of L1TF in the context of virtualization is symmetric 95*4882a593Smuzhiyun multi threading (SMT). The Intel implementation of SMT is called 96*4882a593Smuzhiyun HyperThreading. The fact that Hyperthreads on the affected processors 97*4882a593Smuzhiyun share the L1 Data Cache (L1D) is important for this. As the flaw allows 98*4882a593Smuzhiyun only to attack data which is present in L1D, a malicious guest running 99*4882a593Smuzhiyun on one Hyperthread can attack the data which is brought into the L1D by 100*4882a593Smuzhiyun the context which runs on the sibling Hyperthread of the same physical 101*4882a593Smuzhiyun core. This context can be host OS, host user space or a different guest. 102*4882a593Smuzhiyun 103*4882a593Smuzhiyun If the processor does not support Extended Page Tables, the attack is 104*4882a593Smuzhiyun only possible, when the hypervisor does not sanitize the content of the 105*4882a593Smuzhiyun effective (shadow) page tables. 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun While solutions exist to mitigate these attack vectors fully, these 108*4882a593Smuzhiyun mitigations are not enabled by default in the Linux kernel because they 109*4882a593Smuzhiyun can affect performance significantly. The kernel provides several 110*4882a593Smuzhiyun mechanisms which can be utilized to address the problem depending on the 111*4882a593Smuzhiyun deployment scenario. The mitigations, their protection scope and impact 112*4882a593Smuzhiyun are described in the next sections. 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun The default mitigations and the rationale for choosing them are explained 115*4882a593Smuzhiyun at the end of this document. See :ref:`default_mitigations`. 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun.. _l1tf_sys_info: 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunL1TF system information 120*4882a593Smuzhiyun----------------------- 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunThe Linux kernel provides a sysfs interface to enumerate the current L1TF 123*4882a593Smuzhiyunstatus of the system: whether the system is vulnerable, and which 124*4882a593Smuzhiyunmitigations are active. The relevant sysfs file is: 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun/sys/devices/system/cpu/vulnerabilities/l1tf 127*4882a593Smuzhiyun 128*4882a593SmuzhiyunThe possible values in this file are: 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun =========================== =============================== 131*4882a593Smuzhiyun 'Not affected' The processor is not vulnerable 132*4882a593Smuzhiyun 'Mitigation: PTE Inversion' The host protection is active 133*4882a593Smuzhiyun =========================== =============================== 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunIf KVM/VMX is enabled and the processor is vulnerable then the following 136*4882a593Smuzhiyuninformation is appended to the 'Mitigation: PTE Inversion' part: 137*4882a593Smuzhiyun 138*4882a593Smuzhiyun - SMT status: 139*4882a593Smuzhiyun 140*4882a593Smuzhiyun ===================== ================ 141*4882a593Smuzhiyun 'VMX: SMT vulnerable' SMT is enabled 142*4882a593Smuzhiyun 'VMX: SMT disabled' SMT is disabled 143*4882a593Smuzhiyun ===================== ================ 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun - L1D Flush mode: 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun ================================ ==================================== 148*4882a593Smuzhiyun 'L1D vulnerable' L1D flushing is disabled 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun 'L1D conditional cache flushes' L1D flush is conditionally enabled 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun 'L1D cache flushes' L1D flush is unconditionally enabled 153*4882a593Smuzhiyun ================================ ==================================== 154*4882a593Smuzhiyun 155*4882a593SmuzhiyunThe resulting grade of protection is discussed in the following sections. 156*4882a593Smuzhiyun 157*4882a593Smuzhiyun 158*4882a593SmuzhiyunHost mitigation mechanism 159*4882a593Smuzhiyun------------------------- 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunThe kernel is unconditionally protected against L1TF attacks from malicious 162*4882a593Smuzhiyunuser space running on the host. 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunGuest mitigation mechanisms 166*4882a593Smuzhiyun--------------------------- 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun.. _l1d_flush: 169*4882a593Smuzhiyun 170*4882a593Smuzhiyun1. L1D flush on VMENTER 171*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^ 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun To make sure that a guest cannot attack data which is present in the L1D 174*4882a593Smuzhiyun the hypervisor flushes the L1D before entering the guest. 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun Flushing the L1D evicts not only the data which should not be accessed 177*4882a593Smuzhiyun by a potentially malicious guest, it also flushes the guest 178*4882a593Smuzhiyun data. Flushing the L1D has a performance impact as the processor has to 179*4882a593Smuzhiyun bring the flushed guest data back into the L1D. Depending on the 180*4882a593Smuzhiyun frequency of VMEXIT/VMENTER and the type of computations in the guest 181*4882a593Smuzhiyun performance degradation in the range of 1% to 50% has been observed. For 182*4882a593Smuzhiyun scenarios where guest VMEXIT/VMENTER are rare the performance impact is 183*4882a593Smuzhiyun minimal. Virtio and mechanisms like posted interrupts are designed to 184*4882a593Smuzhiyun confine the VMEXITs to a bare minimum, but specific configurations and 185*4882a593Smuzhiyun application scenarios might still suffer from a high VMEXIT rate. 186*4882a593Smuzhiyun 187*4882a593Smuzhiyun The kernel provides two L1D flush modes: 188*4882a593Smuzhiyun - conditional ('cond') 189*4882a593Smuzhiyun - unconditional ('always') 190*4882a593Smuzhiyun 191*4882a593Smuzhiyun The conditional mode avoids L1D flushing after VMEXITs which execute 192*4882a593Smuzhiyun only audited code paths before the corresponding VMENTER. These code 193*4882a593Smuzhiyun paths have been verified that they cannot expose secrets or other 194*4882a593Smuzhiyun interesting data to an attacker, but they can leak information about the 195*4882a593Smuzhiyun address space layout of the hypervisor. 196*4882a593Smuzhiyun 197*4882a593Smuzhiyun Unconditional mode flushes L1D on all VMENTER invocations and provides 198*4882a593Smuzhiyun maximum protection. It has a higher overhead than the conditional 199*4882a593Smuzhiyun mode. The overhead cannot be quantified correctly as it depends on the 200*4882a593Smuzhiyun workload scenario and the resulting number of VMEXITs. 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun The general recommendation is to enable L1D flush on VMENTER. The kernel 203*4882a593Smuzhiyun defaults to conditional mode on affected processors. 204*4882a593Smuzhiyun 205*4882a593Smuzhiyun **Note**, that L1D flush does not prevent the SMT problem because the 206*4882a593Smuzhiyun sibling thread will also bring back its data into the L1D which makes it 207*4882a593Smuzhiyun attackable again. 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun L1D flush can be controlled by the administrator via the kernel command 210*4882a593Smuzhiyun line and sysfs control files. See :ref:`mitigation_control_command_line` 211*4882a593Smuzhiyun and :ref:`mitigation_control_kvm`. 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun.. _guest_confinement: 214*4882a593Smuzhiyun 215*4882a593Smuzhiyun2. Guest VCPU confinement to dedicated physical cores 216*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 217*4882a593Smuzhiyun 218*4882a593Smuzhiyun To address the SMT problem, it is possible to make a guest or a group of 219*4882a593Smuzhiyun guests affine to one or more physical cores. The proper mechanism for 220*4882a593Smuzhiyun that is to utilize exclusive cpusets to ensure that no other guest or 221*4882a593Smuzhiyun host tasks can run on these cores. 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun If only a single guest or related guests run on sibling SMT threads on 224*4882a593Smuzhiyun the same physical core then they can only attack their own memory and 225*4882a593Smuzhiyun restricted parts of the host memory. 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun Host memory is attackable, when one of the sibling SMT threads runs in 228*4882a593Smuzhiyun host OS (hypervisor) context and the other in guest context. The amount 229*4882a593Smuzhiyun of valuable information from the host OS context depends on the context 230*4882a593Smuzhiyun which the host OS executes, i.e. interrupts, soft interrupts and kernel 231*4882a593Smuzhiyun threads. The amount of valuable data from these contexts cannot be 232*4882a593Smuzhiyun declared as non-interesting for an attacker without deep inspection of 233*4882a593Smuzhiyun the code. 234*4882a593Smuzhiyun 235*4882a593Smuzhiyun **Note**, that assigning guests to a fixed set of physical cores affects 236*4882a593Smuzhiyun the ability of the scheduler to do load balancing and might have 237*4882a593Smuzhiyun negative effects on CPU utilization depending on the hosting 238*4882a593Smuzhiyun scenario. Disabling SMT might be a viable alternative for particular 239*4882a593Smuzhiyun scenarios. 240*4882a593Smuzhiyun 241*4882a593Smuzhiyun For further information about confining guests to a single or to a group 242*4882a593Smuzhiyun of cores consult the cpusets documentation: 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v1/cpusets.rst 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun.. _interrupt_isolation: 247*4882a593Smuzhiyun 248*4882a593Smuzhiyun3. Interrupt affinity 249*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^ 250*4882a593Smuzhiyun 251*4882a593Smuzhiyun Interrupts can be made affine to logical CPUs. This is not universally 252*4882a593Smuzhiyun true because there are types of interrupts which are truly per CPU 253*4882a593Smuzhiyun interrupts, e.g. the local timer interrupt. Aside of that multi queue 254*4882a593Smuzhiyun devices affine their interrupts to single CPUs or groups of CPUs per 255*4882a593Smuzhiyun queue without allowing the administrator to control the affinities. 256*4882a593Smuzhiyun 257*4882a593Smuzhiyun Moving the interrupts, which can be affinity controlled, away from CPUs 258*4882a593Smuzhiyun which run untrusted guests, reduces the attack vector space. 259*4882a593Smuzhiyun 260*4882a593Smuzhiyun Whether the interrupts with are affine to CPUs, which run untrusted 261*4882a593Smuzhiyun guests, provide interesting data for an attacker depends on the system 262*4882a593Smuzhiyun configuration and the scenarios which run on the system. While for some 263*4882a593Smuzhiyun of the interrupts it can be assumed that they won't expose interesting 264*4882a593Smuzhiyun information beyond exposing hints about the host OS memory layout, there 265*4882a593Smuzhiyun is no way to make general assumptions. 266*4882a593Smuzhiyun 267*4882a593Smuzhiyun Interrupt affinity can be controlled by the administrator via the 268*4882a593Smuzhiyun /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is 269*4882a593Smuzhiyun available at: 270*4882a593Smuzhiyun 271*4882a593Smuzhiyun https://www.kernel.org/doc/Documentation/core-api/irq/irq-affinity.rst 272*4882a593Smuzhiyun 273*4882a593Smuzhiyun.. _smt_control: 274*4882a593Smuzhiyun 275*4882a593Smuzhiyun4. SMT control 276*4882a593Smuzhiyun^^^^^^^^^^^^^^ 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun To prevent the SMT issues of L1TF it might be necessary to disable SMT 279*4882a593Smuzhiyun completely. Disabling SMT can have a significant performance impact, but 280*4882a593Smuzhiyun the impact depends on the hosting scenario and the type of workloads. 281*4882a593Smuzhiyun The impact of disabling SMT needs also to be weighted against the impact 282*4882a593Smuzhiyun of other mitigation solutions like confining guests to dedicated cores. 283*4882a593Smuzhiyun 284*4882a593Smuzhiyun The kernel provides a sysfs interface to retrieve the status of SMT and 285*4882a593Smuzhiyun to control it. It also provides a kernel command line interface to 286*4882a593Smuzhiyun control SMT. 287*4882a593Smuzhiyun 288*4882a593Smuzhiyun The kernel command line interface consists of the following options: 289*4882a593Smuzhiyun 290*4882a593Smuzhiyun =========== ========================================================== 291*4882a593Smuzhiyun nosmt Affects the bring up of the secondary CPUs during boot. The 292*4882a593Smuzhiyun kernel tries to bring all present CPUs online during the 293*4882a593Smuzhiyun boot process. "nosmt" makes sure that from each physical 294*4882a593Smuzhiyun core only one - the so called primary (hyper) thread is 295*4882a593Smuzhiyun activated. Due to a design flaw of Intel processors related 296*4882a593Smuzhiyun to Machine Check Exceptions the non primary siblings have 297*4882a593Smuzhiyun to be brought up at least partially and are then shut down 298*4882a593Smuzhiyun again. "nosmt" can be undone via the sysfs interface. 299*4882a593Smuzhiyun 300*4882a593Smuzhiyun nosmt=force Has the same effect as "nosmt" but it does not allow to 301*4882a593Smuzhiyun undo the SMT disable via the sysfs interface. 302*4882a593Smuzhiyun =========== ========================================================== 303*4882a593Smuzhiyun 304*4882a593Smuzhiyun The sysfs interface provides two files: 305*4882a593Smuzhiyun 306*4882a593Smuzhiyun - /sys/devices/system/cpu/smt/control 307*4882a593Smuzhiyun - /sys/devices/system/cpu/smt/active 308*4882a593Smuzhiyun 309*4882a593Smuzhiyun /sys/devices/system/cpu/smt/control: 310*4882a593Smuzhiyun 311*4882a593Smuzhiyun This file allows to read out the SMT control state and provides the 312*4882a593Smuzhiyun ability to disable or (re)enable SMT. The possible states are: 313*4882a593Smuzhiyun 314*4882a593Smuzhiyun ============== =================================================== 315*4882a593Smuzhiyun on SMT is supported by the CPU and enabled. All 316*4882a593Smuzhiyun logical CPUs can be onlined and offlined without 317*4882a593Smuzhiyun restrictions. 318*4882a593Smuzhiyun 319*4882a593Smuzhiyun off SMT is supported by the CPU and disabled. Only 320*4882a593Smuzhiyun the so called primary SMT threads can be onlined 321*4882a593Smuzhiyun and offlined without restrictions. An attempt to 322*4882a593Smuzhiyun online a non-primary sibling is rejected 323*4882a593Smuzhiyun 324*4882a593Smuzhiyun forceoff Same as 'off' but the state cannot be controlled. 325*4882a593Smuzhiyun Attempts to write to the control file are rejected. 326*4882a593Smuzhiyun 327*4882a593Smuzhiyun notsupported The processor does not support SMT. It's therefore 328*4882a593Smuzhiyun not affected by the SMT implications of L1TF. 329*4882a593Smuzhiyun Attempts to write to the control file are rejected. 330*4882a593Smuzhiyun ============== =================================================== 331*4882a593Smuzhiyun 332*4882a593Smuzhiyun The possible states which can be written into this file to control SMT 333*4882a593Smuzhiyun state are: 334*4882a593Smuzhiyun 335*4882a593Smuzhiyun - on 336*4882a593Smuzhiyun - off 337*4882a593Smuzhiyun - forceoff 338*4882a593Smuzhiyun 339*4882a593Smuzhiyun /sys/devices/system/cpu/smt/active: 340*4882a593Smuzhiyun 341*4882a593Smuzhiyun This file reports whether SMT is enabled and active, i.e. if on any 342*4882a593Smuzhiyun physical core two or more sibling threads are online. 343*4882a593Smuzhiyun 344*4882a593Smuzhiyun SMT control is also possible at boot time via the l1tf kernel command 345*4882a593Smuzhiyun line parameter in combination with L1D flush control. See 346*4882a593Smuzhiyun :ref:`mitigation_control_command_line`. 347*4882a593Smuzhiyun 348*4882a593Smuzhiyun5. Disabling EPT 349*4882a593Smuzhiyun^^^^^^^^^^^^^^^^ 350*4882a593Smuzhiyun 351*4882a593Smuzhiyun Disabling EPT for virtual machines provides full mitigation for L1TF even 352*4882a593Smuzhiyun with SMT enabled, because the effective page tables for guests are 353*4882a593Smuzhiyun managed and sanitized by the hypervisor. Though disabling EPT has a 354*4882a593Smuzhiyun significant performance impact especially when the Meltdown mitigation 355*4882a593Smuzhiyun KPTI is enabled. 356*4882a593Smuzhiyun 357*4882a593Smuzhiyun EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter. 358*4882a593Smuzhiyun 359*4882a593SmuzhiyunThere is ongoing research and development for new mitigation mechanisms to 360*4882a593Smuzhiyunaddress the performance impact of disabling SMT or EPT. 361*4882a593Smuzhiyun 362*4882a593Smuzhiyun.. _mitigation_control_command_line: 363*4882a593Smuzhiyun 364*4882a593SmuzhiyunMitigation control on the kernel command line 365*4882a593Smuzhiyun--------------------------------------------- 366*4882a593Smuzhiyun 367*4882a593SmuzhiyunThe kernel command line allows to control the L1TF mitigations at boot 368*4882a593Smuzhiyuntime with the option "l1tf=". The valid arguments for this option are: 369*4882a593Smuzhiyun 370*4882a593Smuzhiyun ============ ============================================================= 371*4882a593Smuzhiyun full Provides all available mitigations for the L1TF 372*4882a593Smuzhiyun vulnerability. Disables SMT and enables all mitigations in 373*4882a593Smuzhiyun the hypervisors, i.e. unconditional L1D flushing 374*4882a593Smuzhiyun 375*4882a593Smuzhiyun SMT control and L1D flush control via the sysfs interface 376*4882a593Smuzhiyun is still possible after boot. Hypervisors will issue a 377*4882a593Smuzhiyun warning when the first VM is started in a potentially 378*4882a593Smuzhiyun insecure configuration, i.e. SMT enabled or L1D flush 379*4882a593Smuzhiyun disabled. 380*4882a593Smuzhiyun 381*4882a593Smuzhiyun full,force Same as 'full', but disables SMT and L1D flush runtime 382*4882a593Smuzhiyun control. Implies the 'nosmt=force' command line option. 383*4882a593Smuzhiyun (i.e. sysfs control of SMT is disabled.) 384*4882a593Smuzhiyun 385*4882a593Smuzhiyun flush Leaves SMT enabled and enables the default hypervisor 386*4882a593Smuzhiyun mitigation, i.e. conditional L1D flushing 387*4882a593Smuzhiyun 388*4882a593Smuzhiyun SMT control and L1D flush control via the sysfs interface 389*4882a593Smuzhiyun is still possible after boot. Hypervisors will issue a 390*4882a593Smuzhiyun warning when the first VM is started in a potentially 391*4882a593Smuzhiyun insecure configuration, i.e. SMT enabled or L1D flush 392*4882a593Smuzhiyun disabled. 393*4882a593Smuzhiyun 394*4882a593Smuzhiyun flush,nosmt Disables SMT and enables the default hypervisor mitigation, 395*4882a593Smuzhiyun i.e. conditional L1D flushing. 396*4882a593Smuzhiyun 397*4882a593Smuzhiyun SMT control and L1D flush control via the sysfs interface 398*4882a593Smuzhiyun is still possible after boot. Hypervisors will issue a 399*4882a593Smuzhiyun warning when the first VM is started in a potentially 400*4882a593Smuzhiyun insecure configuration, i.e. SMT enabled or L1D flush 401*4882a593Smuzhiyun disabled. 402*4882a593Smuzhiyun 403*4882a593Smuzhiyun flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is 404*4882a593Smuzhiyun started in a potentially insecure configuration. 405*4882a593Smuzhiyun 406*4882a593Smuzhiyun off Disables hypervisor mitigations and doesn't emit any 407*4882a593Smuzhiyun warnings. 408*4882a593Smuzhiyun It also drops the swap size and available RAM limit restrictions 409*4882a593Smuzhiyun on both hypervisor and bare metal. 410*4882a593Smuzhiyun 411*4882a593Smuzhiyun ============ ============================================================= 412*4882a593Smuzhiyun 413*4882a593SmuzhiyunThe default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`. 414*4882a593Smuzhiyun 415*4882a593Smuzhiyun 416*4882a593Smuzhiyun.. _mitigation_control_kvm: 417*4882a593Smuzhiyun 418*4882a593SmuzhiyunMitigation control for KVM - module parameter 419*4882a593Smuzhiyun------------------------------------------------------------- 420*4882a593Smuzhiyun 421*4882a593SmuzhiyunThe KVM hypervisor mitigation mechanism, flushing the L1D cache when 422*4882a593Smuzhiyunentering a guest, can be controlled with a module parameter. 423*4882a593Smuzhiyun 424*4882a593SmuzhiyunThe option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the 425*4882a593Smuzhiyunfollowing arguments: 426*4882a593Smuzhiyun 427*4882a593Smuzhiyun ============ ============================================================== 428*4882a593Smuzhiyun always L1D cache flush on every VMENTER. 429*4882a593Smuzhiyun 430*4882a593Smuzhiyun cond Flush L1D on VMENTER only when the code between VMEXIT and 431*4882a593Smuzhiyun VMENTER can leak host memory which is considered 432*4882a593Smuzhiyun interesting for an attacker. This still can leak host memory 433*4882a593Smuzhiyun which allows e.g. to determine the hosts address space layout. 434*4882a593Smuzhiyun 435*4882a593Smuzhiyun never Disables the mitigation 436*4882a593Smuzhiyun ============ ============================================================== 437*4882a593Smuzhiyun 438*4882a593SmuzhiyunThe parameter can be provided on the kernel command line, as a module 439*4882a593Smuzhiyunparameter when loading the modules and at runtime modified via the sysfs 440*4882a593Smuzhiyunfile: 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun/sys/module/kvm_intel/parameters/vmentry_l1d_flush 443*4882a593Smuzhiyun 444*4882a593SmuzhiyunThe default is 'cond'. If 'l1tf=full,force' is given on the kernel command 445*4882a593Smuzhiyunline, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush 446*4882a593Smuzhiyunmodule parameter is ignored and writes to the sysfs file are rejected. 447*4882a593Smuzhiyun 448*4882a593Smuzhiyun.. _mitigation_selection: 449*4882a593Smuzhiyun 450*4882a593SmuzhiyunMitigation selection guide 451*4882a593Smuzhiyun-------------------------- 452*4882a593Smuzhiyun 453*4882a593Smuzhiyun1. No virtualization in use 454*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^ 455*4882a593Smuzhiyun 456*4882a593Smuzhiyun The system is protected by the kernel unconditionally and no further 457*4882a593Smuzhiyun action is required. 458*4882a593Smuzhiyun 459*4882a593Smuzhiyun2. Virtualization with trusted guests 460*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 461*4882a593Smuzhiyun 462*4882a593Smuzhiyun If the guest comes from a trusted source and the guest OS kernel is 463*4882a593Smuzhiyun guaranteed to have the L1TF mitigations in place the system is fully 464*4882a593Smuzhiyun protected against L1TF and no further action is required. 465*4882a593Smuzhiyun 466*4882a593Smuzhiyun To avoid the overhead of the default L1D flushing on VMENTER the 467*4882a593Smuzhiyun administrator can disable the flushing via the kernel command line and 468*4882a593Smuzhiyun sysfs control files. See :ref:`mitigation_control_command_line` and 469*4882a593Smuzhiyun :ref:`mitigation_control_kvm`. 470*4882a593Smuzhiyun 471*4882a593Smuzhiyun 472*4882a593Smuzhiyun3. Virtualization with untrusted guests 473*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 474*4882a593Smuzhiyun 475*4882a593Smuzhiyun3.1. SMT not supported or disabled 476*4882a593Smuzhiyun"""""""""""""""""""""""""""""""""" 477*4882a593Smuzhiyun 478*4882a593Smuzhiyun If SMT is not supported by the processor or disabled in the BIOS or by 479*4882a593Smuzhiyun the kernel, it's only required to enforce L1D flushing on VMENTER. 480*4882a593Smuzhiyun 481*4882a593Smuzhiyun Conditional L1D flushing is the default behaviour and can be tuned. See 482*4882a593Smuzhiyun :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`. 483*4882a593Smuzhiyun 484*4882a593Smuzhiyun3.2. EPT not supported or disabled 485*4882a593Smuzhiyun"""""""""""""""""""""""""""""""""" 486*4882a593Smuzhiyun 487*4882a593Smuzhiyun If EPT is not supported by the processor or disabled in the hypervisor, 488*4882a593Smuzhiyun the system is fully protected. SMT can stay enabled and L1D flushing on 489*4882a593Smuzhiyun VMENTER is not required. 490*4882a593Smuzhiyun 491*4882a593Smuzhiyun EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter. 492*4882a593Smuzhiyun 493*4882a593Smuzhiyun3.3. SMT and EPT supported and active 494*4882a593Smuzhiyun""""""""""""""""""""""""""""""""""""" 495*4882a593Smuzhiyun 496*4882a593Smuzhiyun If SMT and EPT are supported and active then various degrees of 497*4882a593Smuzhiyun mitigations can be employed: 498*4882a593Smuzhiyun 499*4882a593Smuzhiyun - L1D flushing on VMENTER: 500*4882a593Smuzhiyun 501*4882a593Smuzhiyun L1D flushing on VMENTER is the minimal protection requirement, but it 502*4882a593Smuzhiyun is only potent in combination with other mitigation methods. 503*4882a593Smuzhiyun 504*4882a593Smuzhiyun Conditional L1D flushing is the default behaviour and can be tuned. See 505*4882a593Smuzhiyun :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`. 506*4882a593Smuzhiyun 507*4882a593Smuzhiyun - Guest confinement: 508*4882a593Smuzhiyun 509*4882a593Smuzhiyun Confinement of guests to a single or a group of physical cores which 510*4882a593Smuzhiyun are not running any other processes, can reduce the attack surface 511*4882a593Smuzhiyun significantly, but interrupts, soft interrupts and kernel threads can 512*4882a593Smuzhiyun still expose valuable data to a potential attacker. See 513*4882a593Smuzhiyun :ref:`guest_confinement`. 514*4882a593Smuzhiyun 515*4882a593Smuzhiyun - Interrupt isolation: 516*4882a593Smuzhiyun 517*4882a593Smuzhiyun Isolating the guest CPUs from interrupts can reduce the attack surface 518*4882a593Smuzhiyun further, but still allows a malicious guest to explore a limited amount 519*4882a593Smuzhiyun of host physical memory. This can at least be used to gain knowledge 520*4882a593Smuzhiyun about the host address space layout. The interrupts which have a fixed 521*4882a593Smuzhiyun affinity to the CPUs which run the untrusted guests can depending on 522*4882a593Smuzhiyun the scenario still trigger soft interrupts and schedule kernel threads 523*4882a593Smuzhiyun which might expose valuable information. See 524*4882a593Smuzhiyun :ref:`interrupt_isolation`. 525*4882a593Smuzhiyun 526*4882a593SmuzhiyunThe above three mitigation methods combined can provide protection to a 527*4882a593Smuzhiyuncertain degree, but the risk of the remaining attack surface has to be 528*4882a593Smuzhiyuncarefully analyzed. For full protection the following methods are 529*4882a593Smuzhiyunavailable: 530*4882a593Smuzhiyun 531*4882a593Smuzhiyun - Disabling SMT: 532*4882a593Smuzhiyun 533*4882a593Smuzhiyun Disabling SMT and enforcing the L1D flushing provides the maximum 534*4882a593Smuzhiyun amount of protection. This mitigation is not depending on any of the 535*4882a593Smuzhiyun above mitigation methods. 536*4882a593Smuzhiyun 537*4882a593Smuzhiyun SMT control and L1D flushing can be tuned by the command line 538*4882a593Smuzhiyun parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run 539*4882a593Smuzhiyun time with the matching sysfs control files. See :ref:`smt_control`, 540*4882a593Smuzhiyun :ref:`mitigation_control_command_line` and 541*4882a593Smuzhiyun :ref:`mitigation_control_kvm`. 542*4882a593Smuzhiyun 543*4882a593Smuzhiyun - Disabling EPT: 544*4882a593Smuzhiyun 545*4882a593Smuzhiyun Disabling EPT provides the maximum amount of protection as well. It is 546*4882a593Smuzhiyun not depending on any of the above mitigation methods. SMT can stay 547*4882a593Smuzhiyun enabled and L1D flushing is not required, but the performance impact is 548*4882a593Smuzhiyun significant. 549*4882a593Smuzhiyun 550*4882a593Smuzhiyun EPT can be disabled in the hypervisor via the 'kvm-intel.ept' 551*4882a593Smuzhiyun parameter. 552*4882a593Smuzhiyun 553*4882a593Smuzhiyun3.4. Nested virtual machines 554*4882a593Smuzhiyun"""""""""""""""""""""""""""" 555*4882a593Smuzhiyun 556*4882a593SmuzhiyunWhen nested virtualization is in use, three operating systems are involved: 557*4882a593Smuzhiyunthe bare metal hypervisor, the nested hypervisor and the nested virtual 558*4882a593Smuzhiyunmachine. VMENTER operations from the nested hypervisor into the nested 559*4882a593Smuzhiyunguest will always be processed by the bare metal hypervisor. If KVM is the 560*4882a593Smuzhiyunbare metal hypervisor it will: 561*4882a593Smuzhiyun 562*4882a593Smuzhiyun - Flush the L1D cache on every switch from the nested hypervisor to the 563*4882a593Smuzhiyun nested virtual machine, so that the nested hypervisor's secrets are not 564*4882a593Smuzhiyun exposed to the nested virtual machine; 565*4882a593Smuzhiyun 566*4882a593Smuzhiyun - Flush the L1D cache on every switch from the nested virtual machine to 567*4882a593Smuzhiyun the nested hypervisor; this is a complex operation, and flushing the L1D 568*4882a593Smuzhiyun cache avoids that the bare metal hypervisor's secrets are exposed to the 569*4882a593Smuzhiyun nested virtual machine; 570*4882a593Smuzhiyun 571*4882a593Smuzhiyun - Instruct the nested hypervisor to not perform any L1D cache flush. This 572*4882a593Smuzhiyun is an optimization to avoid double L1D flushing. 573*4882a593Smuzhiyun 574*4882a593Smuzhiyun 575*4882a593Smuzhiyun.. _default_mitigations: 576*4882a593Smuzhiyun 577*4882a593SmuzhiyunDefault mitigations 578*4882a593Smuzhiyun------------------- 579*4882a593Smuzhiyun 580*4882a593Smuzhiyun The kernel default mitigations for vulnerable processors are: 581*4882a593Smuzhiyun 582*4882a593Smuzhiyun - PTE inversion to protect against malicious user space. This is done 583*4882a593Smuzhiyun unconditionally and cannot be controlled. The swap storage is limited 584*4882a593Smuzhiyun to ~16TB. 585*4882a593Smuzhiyun 586*4882a593Smuzhiyun - L1D conditional flushing on VMENTER when EPT is enabled for 587*4882a593Smuzhiyun a guest. 588*4882a593Smuzhiyun 589*4882a593Smuzhiyun The kernel does not by default enforce the disabling of SMT, which leaves 590*4882a593Smuzhiyun SMT systems vulnerable when running untrusted guests with EPT enabled. 591*4882a593Smuzhiyun 592*4882a593Smuzhiyun The rationale for this choice is: 593*4882a593Smuzhiyun 594*4882a593Smuzhiyun - Force disabling SMT can break existing setups, especially with 595*4882a593Smuzhiyun unattended updates. 596*4882a593Smuzhiyun 597*4882a593Smuzhiyun - If regular users run untrusted guests on their machine, then L1TF is 598*4882a593Smuzhiyun just an add on to other malware which might be embedded in an untrusted 599*4882a593Smuzhiyun guest, e.g. spam-bots or attacks on the local network. 600*4882a593Smuzhiyun 601*4882a593Smuzhiyun There is no technical way to prevent a user from running untrusted code 602*4882a593Smuzhiyun on their machines blindly. 603*4882a593Smuzhiyun 604*4882a593Smuzhiyun - It's technically extremely unlikely and from today's knowledge even 605*4882a593Smuzhiyun impossible that L1TF can be exploited via the most popular attack 606*4882a593Smuzhiyun mechanisms like JavaScript because these mechanisms have no way to 607*4882a593Smuzhiyun control PTEs. If this would be possible and not other mitigation would 608*4882a593Smuzhiyun be possible, then the default might be different. 609*4882a593Smuzhiyun 610*4882a593Smuzhiyun - The administrators of cloud and hosting setups have to carefully 611*4882a593Smuzhiyun analyze the risk for their scenarios and make the appropriate 612*4882a593Smuzhiyun mitigation choices, which might even vary across their deployed 613*4882a593Smuzhiyun machines and also result in other changes of their overall setup. 614*4882a593Smuzhiyun There is no way for the kernel to provide a sensible default for this 615*4882a593Smuzhiyun kind of scenarios. 616