xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/hw-vuln/l1tf.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593SmuzhiyunL1TF - L1 Terminal Fault
2*4882a593Smuzhiyun========================
3*4882a593Smuzhiyun
4*4882a593SmuzhiyunL1 Terminal Fault is a hardware vulnerability which allows unprivileged
5*4882a593Smuzhiyunspeculative access to data which is available in the Level 1 Data Cache
6*4882a593Smuzhiyunwhen the page table entry controlling the virtual address, which is used
7*4882a593Smuzhiyunfor the access, has the Present bit cleared or other reserved bits set.
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunAffected processors
10*4882a593Smuzhiyun-------------------
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunThis vulnerability affects a wide range of Intel processors. The
13*4882a593Smuzhiyunvulnerability is not present on:
14*4882a593Smuzhiyun
15*4882a593Smuzhiyun   - Processors from AMD, Centaur and other non Intel vendors
16*4882a593Smuzhiyun
17*4882a593Smuzhiyun   - Older processor models, where the CPU family is < 6
18*4882a593Smuzhiyun
19*4882a593Smuzhiyun   - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
20*4882a593Smuzhiyun     Penwell, Pineview, Silvermont, Airmont, Merrifield)
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun   - The Intel XEON PHI family
23*4882a593Smuzhiyun
24*4882a593Smuzhiyun   - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
25*4882a593Smuzhiyun     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
26*4882a593Smuzhiyun     by the Meltdown vulnerability either. These CPUs should become
27*4882a593Smuzhiyun     available by end of 2018.
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunWhether a processor is affected or not can be read out from the L1TF
30*4882a593Smuzhiyunvulnerability file in sysfs. See :ref:`l1tf_sys_info`.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunRelated CVEs
33*4882a593Smuzhiyun------------
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunThe following CVE entries are related to the L1TF vulnerability:
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun   =============  =================  ==============================
38*4882a593Smuzhiyun   CVE-2018-3615  L1 Terminal Fault  SGX related aspects
39*4882a593Smuzhiyun   CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects
40*4882a593Smuzhiyun   CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects
41*4882a593Smuzhiyun   =============  =================  ==============================
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunProblem
44*4882a593Smuzhiyun-------
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunIf an instruction accesses a virtual address for which the relevant page
47*4882a593Smuzhiyuntable entry (PTE) has the Present bit cleared or other reserved bits set,
48*4882a593Smuzhiyunthen speculative execution ignores the invalid PTE and loads the referenced
49*4882a593Smuzhiyundata if it is present in the Level 1 Data Cache, as if the page referenced
50*4882a593Smuzhiyunby the address bits in the PTE was still present and accessible.
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunWhile this is a purely speculative mechanism and the instruction will raise
53*4882a593Smuzhiyuna page fault when it is retired eventually, the pure act of loading the
54*4882a593Smuzhiyundata and making it available to other speculative instructions opens up the
55*4882a593Smuzhiyunopportunity for side channel attacks to unprivileged malicious code,
56*4882a593Smuzhiyunsimilar to the Meltdown attack.
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunWhile Meltdown breaks the user space to kernel space protection, L1TF
59*4882a593Smuzhiyunallows to attack any physical memory address in the system and the attack
60*4882a593Smuzhiyunworks across all protection domains. It allows an attack of SGX and also
61*4882a593Smuzhiyunworks from inside virtual machines because the speculation bypasses the
62*4882a593Smuzhiyunextended page table (EPT) protection mechanism.
63*4882a593Smuzhiyun
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunAttack scenarios
66*4882a593Smuzhiyun----------------
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun1. Malicious user space
69*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^
70*4882a593Smuzhiyun
71*4882a593Smuzhiyun   Operating Systems store arbitrary information in the address bits of a
72*4882a593Smuzhiyun   PTE which is marked non present. This allows a malicious user space
73*4882a593Smuzhiyun   application to attack the physical memory to which these PTEs resolve.
74*4882a593Smuzhiyun   In some cases user-space can maliciously influence the information
75*4882a593Smuzhiyun   encoded in the address bits of the PTE, thus making attacks more
76*4882a593Smuzhiyun   deterministic and more practical.
77*4882a593Smuzhiyun
78*4882a593Smuzhiyun   The Linux kernel contains a mitigation for this attack vector, PTE
79*4882a593Smuzhiyun   inversion, which is permanently enabled and has no performance
80*4882a593Smuzhiyun   impact. The kernel ensures that the address bits of PTEs, which are not
81*4882a593Smuzhiyun   marked present, never point to cacheable physical memory space.
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun   A system with an up to date kernel is protected against attacks from
84*4882a593Smuzhiyun   malicious user space applications.
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun2. Malicious guest in a virtual machine
87*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun   The fact that L1TF breaks all domain protections allows malicious guest
90*4882a593Smuzhiyun   OSes, which can control the PTEs directly, and malicious guest user
91*4882a593Smuzhiyun   space applications, which run on an unprotected guest kernel lacking the
92*4882a593Smuzhiyun   PTE inversion mitigation for L1TF, to attack physical host memory.
93*4882a593Smuzhiyun
94*4882a593Smuzhiyun   A special aspect of L1TF in the context of virtualization is symmetric
95*4882a593Smuzhiyun   multi threading (SMT). The Intel implementation of SMT is called
96*4882a593Smuzhiyun   HyperThreading. The fact that Hyperthreads on the affected processors
97*4882a593Smuzhiyun   share the L1 Data Cache (L1D) is important for this. As the flaw allows
98*4882a593Smuzhiyun   only to attack data which is present in L1D, a malicious guest running
99*4882a593Smuzhiyun   on one Hyperthread can attack the data which is brought into the L1D by
100*4882a593Smuzhiyun   the context which runs on the sibling Hyperthread of the same physical
101*4882a593Smuzhiyun   core. This context can be host OS, host user space or a different guest.
102*4882a593Smuzhiyun
103*4882a593Smuzhiyun   If the processor does not support Extended Page Tables, the attack is
104*4882a593Smuzhiyun   only possible, when the hypervisor does not sanitize the content of the
105*4882a593Smuzhiyun   effective (shadow) page tables.
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun   While solutions exist to mitigate these attack vectors fully, these
108*4882a593Smuzhiyun   mitigations are not enabled by default in the Linux kernel because they
109*4882a593Smuzhiyun   can affect performance significantly. The kernel provides several
110*4882a593Smuzhiyun   mechanisms which can be utilized to address the problem depending on the
111*4882a593Smuzhiyun   deployment scenario. The mitigations, their protection scope and impact
112*4882a593Smuzhiyun   are described in the next sections.
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun   The default mitigations and the rationale for choosing them are explained
115*4882a593Smuzhiyun   at the end of this document. See :ref:`default_mitigations`.
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun.. _l1tf_sys_info:
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunL1TF system information
120*4882a593Smuzhiyun-----------------------
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunThe Linux kernel provides a sysfs interface to enumerate the current L1TF
123*4882a593Smuzhiyunstatus of the system: whether the system is vulnerable, and which
124*4882a593Smuzhiyunmitigations are active. The relevant sysfs file is:
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun/sys/devices/system/cpu/vulnerabilities/l1tf
127*4882a593Smuzhiyun
128*4882a593SmuzhiyunThe possible values in this file are:
129*4882a593Smuzhiyun
130*4882a593Smuzhiyun  ===========================   ===============================
131*4882a593Smuzhiyun  'Not affected'		The processor is not vulnerable
132*4882a593Smuzhiyun  'Mitigation: PTE Inversion'	The host protection is active
133*4882a593Smuzhiyun  ===========================   ===============================
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunIf KVM/VMX is enabled and the processor is vulnerable then the following
136*4882a593Smuzhiyuninformation is appended to the 'Mitigation: PTE Inversion' part:
137*4882a593Smuzhiyun
138*4882a593Smuzhiyun  - SMT status:
139*4882a593Smuzhiyun
140*4882a593Smuzhiyun    =====================  ================
141*4882a593Smuzhiyun    'VMX: SMT vulnerable'  SMT is enabled
142*4882a593Smuzhiyun    'VMX: SMT disabled'    SMT is disabled
143*4882a593Smuzhiyun    =====================  ================
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun  - L1D Flush mode:
146*4882a593Smuzhiyun
147*4882a593Smuzhiyun    ================================  ====================================
148*4882a593Smuzhiyun    'L1D vulnerable'		      L1D flushing is disabled
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun    'L1D conditional cache flushes'   L1D flush is conditionally enabled
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun    'L1D cache flushes'		      L1D flush is unconditionally enabled
153*4882a593Smuzhiyun    ================================  ====================================
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunThe resulting grade of protection is discussed in the following sections.
156*4882a593Smuzhiyun
157*4882a593Smuzhiyun
158*4882a593SmuzhiyunHost mitigation mechanism
159*4882a593Smuzhiyun-------------------------
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunThe kernel is unconditionally protected against L1TF attacks from malicious
162*4882a593Smuzhiyunuser space running on the host.
163*4882a593Smuzhiyun
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunGuest mitigation mechanisms
166*4882a593Smuzhiyun---------------------------
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun.. _l1d_flush:
169*4882a593Smuzhiyun
170*4882a593Smuzhiyun1. L1D flush on VMENTER
171*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun   To make sure that a guest cannot attack data which is present in the L1D
174*4882a593Smuzhiyun   the hypervisor flushes the L1D before entering the guest.
175*4882a593Smuzhiyun
176*4882a593Smuzhiyun   Flushing the L1D evicts not only the data which should not be accessed
177*4882a593Smuzhiyun   by a potentially malicious guest, it also flushes the guest
178*4882a593Smuzhiyun   data. Flushing the L1D has a performance impact as the processor has to
179*4882a593Smuzhiyun   bring the flushed guest data back into the L1D. Depending on the
180*4882a593Smuzhiyun   frequency of VMEXIT/VMENTER and the type of computations in the guest
181*4882a593Smuzhiyun   performance degradation in the range of 1% to 50% has been observed. For
182*4882a593Smuzhiyun   scenarios where guest VMEXIT/VMENTER are rare the performance impact is
183*4882a593Smuzhiyun   minimal. Virtio and mechanisms like posted interrupts are designed to
184*4882a593Smuzhiyun   confine the VMEXITs to a bare minimum, but specific configurations and
185*4882a593Smuzhiyun   application scenarios might still suffer from a high VMEXIT rate.
186*4882a593Smuzhiyun
187*4882a593Smuzhiyun   The kernel provides two L1D flush modes:
188*4882a593Smuzhiyun    - conditional ('cond')
189*4882a593Smuzhiyun    - unconditional ('always')
190*4882a593Smuzhiyun
191*4882a593Smuzhiyun   The conditional mode avoids L1D flushing after VMEXITs which execute
192*4882a593Smuzhiyun   only audited code paths before the corresponding VMENTER. These code
193*4882a593Smuzhiyun   paths have been verified that they cannot expose secrets or other
194*4882a593Smuzhiyun   interesting data to an attacker, but they can leak information about the
195*4882a593Smuzhiyun   address space layout of the hypervisor.
196*4882a593Smuzhiyun
197*4882a593Smuzhiyun   Unconditional mode flushes L1D on all VMENTER invocations and provides
198*4882a593Smuzhiyun   maximum protection. It has a higher overhead than the conditional
199*4882a593Smuzhiyun   mode. The overhead cannot be quantified correctly as it depends on the
200*4882a593Smuzhiyun   workload scenario and the resulting number of VMEXITs.
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun   The general recommendation is to enable L1D flush on VMENTER. The kernel
203*4882a593Smuzhiyun   defaults to conditional mode on affected processors.
204*4882a593Smuzhiyun
205*4882a593Smuzhiyun   **Note**, that L1D flush does not prevent the SMT problem because the
206*4882a593Smuzhiyun   sibling thread will also bring back its data into the L1D which makes it
207*4882a593Smuzhiyun   attackable again.
208*4882a593Smuzhiyun
209*4882a593Smuzhiyun   L1D flush can be controlled by the administrator via the kernel command
210*4882a593Smuzhiyun   line and sysfs control files. See :ref:`mitigation_control_command_line`
211*4882a593Smuzhiyun   and :ref:`mitigation_control_kvm`.
212*4882a593Smuzhiyun
213*4882a593Smuzhiyun.. _guest_confinement:
214*4882a593Smuzhiyun
215*4882a593Smuzhiyun2. Guest VCPU confinement to dedicated physical cores
216*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun   To address the SMT problem, it is possible to make a guest or a group of
219*4882a593Smuzhiyun   guests affine to one or more physical cores. The proper mechanism for
220*4882a593Smuzhiyun   that is to utilize exclusive cpusets to ensure that no other guest or
221*4882a593Smuzhiyun   host tasks can run on these cores.
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun   If only a single guest or related guests run on sibling SMT threads on
224*4882a593Smuzhiyun   the same physical core then they can only attack their own memory and
225*4882a593Smuzhiyun   restricted parts of the host memory.
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun   Host memory is attackable, when one of the sibling SMT threads runs in
228*4882a593Smuzhiyun   host OS (hypervisor) context and the other in guest context. The amount
229*4882a593Smuzhiyun   of valuable information from the host OS context depends on the context
230*4882a593Smuzhiyun   which the host OS executes, i.e. interrupts, soft interrupts and kernel
231*4882a593Smuzhiyun   threads. The amount of valuable data from these contexts cannot be
232*4882a593Smuzhiyun   declared as non-interesting for an attacker without deep inspection of
233*4882a593Smuzhiyun   the code.
234*4882a593Smuzhiyun
235*4882a593Smuzhiyun   **Note**, that assigning guests to a fixed set of physical cores affects
236*4882a593Smuzhiyun   the ability of the scheduler to do load balancing and might have
237*4882a593Smuzhiyun   negative effects on CPU utilization depending on the hosting
238*4882a593Smuzhiyun   scenario. Disabling SMT might be a viable alternative for particular
239*4882a593Smuzhiyun   scenarios.
240*4882a593Smuzhiyun
241*4882a593Smuzhiyun   For further information about confining guests to a single or to a group
242*4882a593Smuzhiyun   of cores consult the cpusets documentation:
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun   https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v1/cpusets.rst
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun.. _interrupt_isolation:
247*4882a593Smuzhiyun
248*4882a593Smuzhiyun3. Interrupt affinity
249*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^
250*4882a593Smuzhiyun
251*4882a593Smuzhiyun   Interrupts can be made affine to logical CPUs. This is not universally
252*4882a593Smuzhiyun   true because there are types of interrupts which are truly per CPU
253*4882a593Smuzhiyun   interrupts, e.g. the local timer interrupt. Aside of that multi queue
254*4882a593Smuzhiyun   devices affine their interrupts to single CPUs or groups of CPUs per
255*4882a593Smuzhiyun   queue without allowing the administrator to control the affinities.
256*4882a593Smuzhiyun
257*4882a593Smuzhiyun   Moving the interrupts, which can be affinity controlled, away from CPUs
258*4882a593Smuzhiyun   which run untrusted guests, reduces the attack vector space.
259*4882a593Smuzhiyun
260*4882a593Smuzhiyun   Whether the interrupts with are affine to CPUs, which run untrusted
261*4882a593Smuzhiyun   guests, provide interesting data for an attacker depends on the system
262*4882a593Smuzhiyun   configuration and the scenarios which run on the system. While for some
263*4882a593Smuzhiyun   of the interrupts it can be assumed that they won't expose interesting
264*4882a593Smuzhiyun   information beyond exposing hints about the host OS memory layout, there
265*4882a593Smuzhiyun   is no way to make general assumptions.
266*4882a593Smuzhiyun
267*4882a593Smuzhiyun   Interrupt affinity can be controlled by the administrator via the
268*4882a593Smuzhiyun   /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
269*4882a593Smuzhiyun   available at:
270*4882a593Smuzhiyun
271*4882a593Smuzhiyun   https://www.kernel.org/doc/Documentation/core-api/irq/irq-affinity.rst
272*4882a593Smuzhiyun
273*4882a593Smuzhiyun.. _smt_control:
274*4882a593Smuzhiyun
275*4882a593Smuzhiyun4. SMT control
276*4882a593Smuzhiyun^^^^^^^^^^^^^^
277*4882a593Smuzhiyun
278*4882a593Smuzhiyun   To prevent the SMT issues of L1TF it might be necessary to disable SMT
279*4882a593Smuzhiyun   completely. Disabling SMT can have a significant performance impact, but
280*4882a593Smuzhiyun   the impact depends on the hosting scenario and the type of workloads.
281*4882a593Smuzhiyun   The impact of disabling SMT needs also to be weighted against the impact
282*4882a593Smuzhiyun   of other mitigation solutions like confining guests to dedicated cores.
283*4882a593Smuzhiyun
284*4882a593Smuzhiyun   The kernel provides a sysfs interface to retrieve the status of SMT and
285*4882a593Smuzhiyun   to control it. It also provides a kernel command line interface to
286*4882a593Smuzhiyun   control SMT.
287*4882a593Smuzhiyun
288*4882a593Smuzhiyun   The kernel command line interface consists of the following options:
289*4882a593Smuzhiyun
290*4882a593Smuzhiyun     =========== ==========================================================
291*4882a593Smuzhiyun     nosmt	 Affects the bring up of the secondary CPUs during boot. The
292*4882a593Smuzhiyun		 kernel tries to bring all present CPUs online during the
293*4882a593Smuzhiyun		 boot process. "nosmt" makes sure that from each physical
294*4882a593Smuzhiyun		 core only one - the so called primary (hyper) thread is
295*4882a593Smuzhiyun		 activated. Due to a design flaw of Intel processors related
296*4882a593Smuzhiyun		 to Machine Check Exceptions the non primary siblings have
297*4882a593Smuzhiyun		 to be brought up at least partially and are then shut down
298*4882a593Smuzhiyun		 again.  "nosmt" can be undone via the sysfs interface.
299*4882a593Smuzhiyun
300*4882a593Smuzhiyun     nosmt=force Has the same effect as "nosmt" but it does not allow to
301*4882a593Smuzhiyun		 undo the SMT disable via the sysfs interface.
302*4882a593Smuzhiyun     =========== ==========================================================
303*4882a593Smuzhiyun
304*4882a593Smuzhiyun   The sysfs interface provides two files:
305*4882a593Smuzhiyun
306*4882a593Smuzhiyun   - /sys/devices/system/cpu/smt/control
307*4882a593Smuzhiyun   - /sys/devices/system/cpu/smt/active
308*4882a593Smuzhiyun
309*4882a593Smuzhiyun   /sys/devices/system/cpu/smt/control:
310*4882a593Smuzhiyun
311*4882a593Smuzhiyun     This file allows to read out the SMT control state and provides the
312*4882a593Smuzhiyun     ability to disable or (re)enable SMT. The possible states are:
313*4882a593Smuzhiyun
314*4882a593Smuzhiyun	==============  ===================================================
315*4882a593Smuzhiyun	on		SMT is supported by the CPU and enabled. All
316*4882a593Smuzhiyun			logical CPUs can be onlined and offlined without
317*4882a593Smuzhiyun			restrictions.
318*4882a593Smuzhiyun
319*4882a593Smuzhiyun	off		SMT is supported by the CPU and disabled. Only
320*4882a593Smuzhiyun			the so called primary SMT threads can be onlined
321*4882a593Smuzhiyun			and offlined without restrictions. An attempt to
322*4882a593Smuzhiyun			online a non-primary sibling is rejected
323*4882a593Smuzhiyun
324*4882a593Smuzhiyun	forceoff	Same as 'off' but the state cannot be controlled.
325*4882a593Smuzhiyun			Attempts to write to the control file are rejected.
326*4882a593Smuzhiyun
327*4882a593Smuzhiyun	notsupported	The processor does not support SMT. It's therefore
328*4882a593Smuzhiyun			not affected by the SMT implications of L1TF.
329*4882a593Smuzhiyun			Attempts to write to the control file are rejected.
330*4882a593Smuzhiyun	==============  ===================================================
331*4882a593Smuzhiyun
332*4882a593Smuzhiyun     The possible states which can be written into this file to control SMT
333*4882a593Smuzhiyun     state are:
334*4882a593Smuzhiyun
335*4882a593Smuzhiyun     - on
336*4882a593Smuzhiyun     - off
337*4882a593Smuzhiyun     - forceoff
338*4882a593Smuzhiyun
339*4882a593Smuzhiyun   /sys/devices/system/cpu/smt/active:
340*4882a593Smuzhiyun
341*4882a593Smuzhiyun     This file reports whether SMT is enabled and active, i.e. if on any
342*4882a593Smuzhiyun     physical core two or more sibling threads are online.
343*4882a593Smuzhiyun
344*4882a593Smuzhiyun   SMT control is also possible at boot time via the l1tf kernel command
345*4882a593Smuzhiyun   line parameter in combination with L1D flush control. See
346*4882a593Smuzhiyun   :ref:`mitigation_control_command_line`.
347*4882a593Smuzhiyun
348*4882a593Smuzhiyun5. Disabling EPT
349*4882a593Smuzhiyun^^^^^^^^^^^^^^^^
350*4882a593Smuzhiyun
351*4882a593Smuzhiyun  Disabling EPT for virtual machines provides full mitigation for L1TF even
352*4882a593Smuzhiyun  with SMT enabled, because the effective page tables for guests are
353*4882a593Smuzhiyun  managed and sanitized by the hypervisor. Though disabling EPT has a
354*4882a593Smuzhiyun  significant performance impact especially when the Meltdown mitigation
355*4882a593Smuzhiyun  KPTI is enabled.
356*4882a593Smuzhiyun
357*4882a593Smuzhiyun  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
358*4882a593Smuzhiyun
359*4882a593SmuzhiyunThere is ongoing research and development for new mitigation mechanisms to
360*4882a593Smuzhiyunaddress the performance impact of disabling SMT or EPT.
361*4882a593Smuzhiyun
362*4882a593Smuzhiyun.. _mitigation_control_command_line:
363*4882a593Smuzhiyun
364*4882a593SmuzhiyunMitigation control on the kernel command line
365*4882a593Smuzhiyun---------------------------------------------
366*4882a593Smuzhiyun
367*4882a593SmuzhiyunThe kernel command line allows to control the L1TF mitigations at boot
368*4882a593Smuzhiyuntime with the option "l1tf=". The valid arguments for this option are:
369*4882a593Smuzhiyun
370*4882a593Smuzhiyun  ============  =============================================================
371*4882a593Smuzhiyun  full		Provides all available mitigations for the L1TF
372*4882a593Smuzhiyun		vulnerability. Disables SMT and enables all mitigations in
373*4882a593Smuzhiyun		the hypervisors, i.e. unconditional L1D flushing
374*4882a593Smuzhiyun
375*4882a593Smuzhiyun		SMT control and L1D flush control via the sysfs interface
376*4882a593Smuzhiyun		is still possible after boot.  Hypervisors will issue a
377*4882a593Smuzhiyun		warning when the first VM is started in a potentially
378*4882a593Smuzhiyun		insecure configuration, i.e. SMT enabled or L1D flush
379*4882a593Smuzhiyun		disabled.
380*4882a593Smuzhiyun
381*4882a593Smuzhiyun  full,force	Same as 'full', but disables SMT and L1D flush runtime
382*4882a593Smuzhiyun		control. Implies the 'nosmt=force' command line option.
383*4882a593Smuzhiyun		(i.e. sysfs control of SMT is disabled.)
384*4882a593Smuzhiyun
385*4882a593Smuzhiyun  flush		Leaves SMT enabled and enables the default hypervisor
386*4882a593Smuzhiyun		mitigation, i.e. conditional L1D flushing
387*4882a593Smuzhiyun
388*4882a593Smuzhiyun		SMT control and L1D flush control via the sysfs interface
389*4882a593Smuzhiyun		is still possible after boot.  Hypervisors will issue a
390*4882a593Smuzhiyun		warning when the first VM is started in a potentially
391*4882a593Smuzhiyun		insecure configuration, i.e. SMT enabled or L1D flush
392*4882a593Smuzhiyun		disabled.
393*4882a593Smuzhiyun
394*4882a593Smuzhiyun  flush,nosmt	Disables SMT and enables the default hypervisor mitigation,
395*4882a593Smuzhiyun		i.e. conditional L1D flushing.
396*4882a593Smuzhiyun
397*4882a593Smuzhiyun		SMT control and L1D flush control via the sysfs interface
398*4882a593Smuzhiyun		is still possible after boot.  Hypervisors will issue a
399*4882a593Smuzhiyun		warning when the first VM is started in a potentially
400*4882a593Smuzhiyun		insecure configuration, i.e. SMT enabled or L1D flush
401*4882a593Smuzhiyun		disabled.
402*4882a593Smuzhiyun
403*4882a593Smuzhiyun  flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is
404*4882a593Smuzhiyun		started in a potentially insecure configuration.
405*4882a593Smuzhiyun
406*4882a593Smuzhiyun  off		Disables hypervisor mitigations and doesn't emit any
407*4882a593Smuzhiyun		warnings.
408*4882a593Smuzhiyun		It also drops the swap size and available RAM limit restrictions
409*4882a593Smuzhiyun		on both hypervisor and bare metal.
410*4882a593Smuzhiyun
411*4882a593Smuzhiyun  ============  =============================================================
412*4882a593Smuzhiyun
413*4882a593SmuzhiyunThe default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
414*4882a593Smuzhiyun
415*4882a593Smuzhiyun
416*4882a593Smuzhiyun.. _mitigation_control_kvm:
417*4882a593Smuzhiyun
418*4882a593SmuzhiyunMitigation control for KVM - module parameter
419*4882a593Smuzhiyun-------------------------------------------------------------
420*4882a593Smuzhiyun
421*4882a593SmuzhiyunThe KVM hypervisor mitigation mechanism, flushing the L1D cache when
422*4882a593Smuzhiyunentering a guest, can be controlled with a module parameter.
423*4882a593Smuzhiyun
424*4882a593SmuzhiyunThe option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
425*4882a593Smuzhiyunfollowing arguments:
426*4882a593Smuzhiyun
427*4882a593Smuzhiyun  ============  ==============================================================
428*4882a593Smuzhiyun  always	L1D cache flush on every VMENTER.
429*4882a593Smuzhiyun
430*4882a593Smuzhiyun  cond		Flush L1D on VMENTER only when the code between VMEXIT and
431*4882a593Smuzhiyun		VMENTER can leak host memory which is considered
432*4882a593Smuzhiyun		interesting for an attacker. This still can leak host memory
433*4882a593Smuzhiyun		which allows e.g. to determine the hosts address space layout.
434*4882a593Smuzhiyun
435*4882a593Smuzhiyun  never		Disables the mitigation
436*4882a593Smuzhiyun  ============  ==============================================================
437*4882a593Smuzhiyun
438*4882a593SmuzhiyunThe parameter can be provided on the kernel command line, as a module
439*4882a593Smuzhiyunparameter when loading the modules and at runtime modified via the sysfs
440*4882a593Smuzhiyunfile:
441*4882a593Smuzhiyun
442*4882a593Smuzhiyun/sys/module/kvm_intel/parameters/vmentry_l1d_flush
443*4882a593Smuzhiyun
444*4882a593SmuzhiyunThe default is 'cond'. If 'l1tf=full,force' is given on the kernel command
445*4882a593Smuzhiyunline, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
446*4882a593Smuzhiyunmodule parameter is ignored and writes to the sysfs file are rejected.
447*4882a593Smuzhiyun
448*4882a593Smuzhiyun.. _mitigation_selection:
449*4882a593Smuzhiyun
450*4882a593SmuzhiyunMitigation selection guide
451*4882a593Smuzhiyun--------------------------
452*4882a593Smuzhiyun
453*4882a593Smuzhiyun1. No virtualization in use
454*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^
455*4882a593Smuzhiyun
456*4882a593Smuzhiyun   The system is protected by the kernel unconditionally and no further
457*4882a593Smuzhiyun   action is required.
458*4882a593Smuzhiyun
459*4882a593Smuzhiyun2. Virtualization with trusted guests
460*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
461*4882a593Smuzhiyun
462*4882a593Smuzhiyun   If the guest comes from a trusted source and the guest OS kernel is
463*4882a593Smuzhiyun   guaranteed to have the L1TF mitigations in place the system is fully
464*4882a593Smuzhiyun   protected against L1TF and no further action is required.
465*4882a593Smuzhiyun
466*4882a593Smuzhiyun   To avoid the overhead of the default L1D flushing on VMENTER the
467*4882a593Smuzhiyun   administrator can disable the flushing via the kernel command line and
468*4882a593Smuzhiyun   sysfs control files. See :ref:`mitigation_control_command_line` and
469*4882a593Smuzhiyun   :ref:`mitigation_control_kvm`.
470*4882a593Smuzhiyun
471*4882a593Smuzhiyun
472*4882a593Smuzhiyun3. Virtualization with untrusted guests
473*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
474*4882a593Smuzhiyun
475*4882a593Smuzhiyun3.1. SMT not supported or disabled
476*4882a593Smuzhiyun""""""""""""""""""""""""""""""""""
477*4882a593Smuzhiyun
478*4882a593Smuzhiyun  If SMT is not supported by the processor or disabled in the BIOS or by
479*4882a593Smuzhiyun  the kernel, it's only required to enforce L1D flushing on VMENTER.
480*4882a593Smuzhiyun
481*4882a593Smuzhiyun  Conditional L1D flushing is the default behaviour and can be tuned. See
482*4882a593Smuzhiyun  :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
483*4882a593Smuzhiyun
484*4882a593Smuzhiyun3.2. EPT not supported or disabled
485*4882a593Smuzhiyun""""""""""""""""""""""""""""""""""
486*4882a593Smuzhiyun
487*4882a593Smuzhiyun  If EPT is not supported by the processor or disabled in the hypervisor,
488*4882a593Smuzhiyun  the system is fully protected. SMT can stay enabled and L1D flushing on
489*4882a593Smuzhiyun  VMENTER is not required.
490*4882a593Smuzhiyun
491*4882a593Smuzhiyun  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
492*4882a593Smuzhiyun
493*4882a593Smuzhiyun3.3. SMT and EPT supported and active
494*4882a593Smuzhiyun"""""""""""""""""""""""""""""""""""""
495*4882a593Smuzhiyun
496*4882a593Smuzhiyun  If SMT and EPT are supported and active then various degrees of
497*4882a593Smuzhiyun  mitigations can be employed:
498*4882a593Smuzhiyun
499*4882a593Smuzhiyun  - L1D flushing on VMENTER:
500*4882a593Smuzhiyun
501*4882a593Smuzhiyun    L1D flushing on VMENTER is the minimal protection requirement, but it
502*4882a593Smuzhiyun    is only potent in combination with other mitigation methods.
503*4882a593Smuzhiyun
504*4882a593Smuzhiyun    Conditional L1D flushing is the default behaviour and can be tuned. See
505*4882a593Smuzhiyun    :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
506*4882a593Smuzhiyun
507*4882a593Smuzhiyun  - Guest confinement:
508*4882a593Smuzhiyun
509*4882a593Smuzhiyun    Confinement of guests to a single or a group of physical cores which
510*4882a593Smuzhiyun    are not running any other processes, can reduce the attack surface
511*4882a593Smuzhiyun    significantly, but interrupts, soft interrupts and kernel threads can
512*4882a593Smuzhiyun    still expose valuable data to a potential attacker. See
513*4882a593Smuzhiyun    :ref:`guest_confinement`.
514*4882a593Smuzhiyun
515*4882a593Smuzhiyun  - Interrupt isolation:
516*4882a593Smuzhiyun
517*4882a593Smuzhiyun    Isolating the guest CPUs from interrupts can reduce the attack surface
518*4882a593Smuzhiyun    further, but still allows a malicious guest to explore a limited amount
519*4882a593Smuzhiyun    of host physical memory. This can at least be used to gain knowledge
520*4882a593Smuzhiyun    about the host address space layout. The interrupts which have a fixed
521*4882a593Smuzhiyun    affinity to the CPUs which run the untrusted guests can depending on
522*4882a593Smuzhiyun    the scenario still trigger soft interrupts and schedule kernel threads
523*4882a593Smuzhiyun    which might expose valuable information. See
524*4882a593Smuzhiyun    :ref:`interrupt_isolation`.
525*4882a593Smuzhiyun
526*4882a593SmuzhiyunThe above three mitigation methods combined can provide protection to a
527*4882a593Smuzhiyuncertain degree, but the risk of the remaining attack surface has to be
528*4882a593Smuzhiyuncarefully analyzed. For full protection the following methods are
529*4882a593Smuzhiyunavailable:
530*4882a593Smuzhiyun
531*4882a593Smuzhiyun  - Disabling SMT:
532*4882a593Smuzhiyun
533*4882a593Smuzhiyun    Disabling SMT and enforcing the L1D flushing provides the maximum
534*4882a593Smuzhiyun    amount of protection. This mitigation is not depending on any of the
535*4882a593Smuzhiyun    above mitigation methods.
536*4882a593Smuzhiyun
537*4882a593Smuzhiyun    SMT control and L1D flushing can be tuned by the command line
538*4882a593Smuzhiyun    parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
539*4882a593Smuzhiyun    time with the matching sysfs control files. See :ref:`smt_control`,
540*4882a593Smuzhiyun    :ref:`mitigation_control_command_line` and
541*4882a593Smuzhiyun    :ref:`mitigation_control_kvm`.
542*4882a593Smuzhiyun
543*4882a593Smuzhiyun  - Disabling EPT:
544*4882a593Smuzhiyun
545*4882a593Smuzhiyun    Disabling EPT provides the maximum amount of protection as well. It is
546*4882a593Smuzhiyun    not depending on any of the above mitigation methods. SMT can stay
547*4882a593Smuzhiyun    enabled and L1D flushing is not required, but the performance impact is
548*4882a593Smuzhiyun    significant.
549*4882a593Smuzhiyun
550*4882a593Smuzhiyun    EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
551*4882a593Smuzhiyun    parameter.
552*4882a593Smuzhiyun
553*4882a593Smuzhiyun3.4. Nested virtual machines
554*4882a593Smuzhiyun""""""""""""""""""""""""""""
555*4882a593Smuzhiyun
556*4882a593SmuzhiyunWhen nested virtualization is in use, three operating systems are involved:
557*4882a593Smuzhiyunthe bare metal hypervisor, the nested hypervisor and the nested virtual
558*4882a593Smuzhiyunmachine.  VMENTER operations from the nested hypervisor into the nested
559*4882a593Smuzhiyunguest will always be processed by the bare metal hypervisor. If KVM is the
560*4882a593Smuzhiyunbare metal hypervisor it will:
561*4882a593Smuzhiyun
562*4882a593Smuzhiyun - Flush the L1D cache on every switch from the nested hypervisor to the
563*4882a593Smuzhiyun   nested virtual machine, so that the nested hypervisor's secrets are not
564*4882a593Smuzhiyun   exposed to the nested virtual machine;
565*4882a593Smuzhiyun
566*4882a593Smuzhiyun - Flush the L1D cache on every switch from the nested virtual machine to
567*4882a593Smuzhiyun   the nested hypervisor; this is a complex operation, and flushing the L1D
568*4882a593Smuzhiyun   cache avoids that the bare metal hypervisor's secrets are exposed to the
569*4882a593Smuzhiyun   nested virtual machine;
570*4882a593Smuzhiyun
571*4882a593Smuzhiyun - Instruct the nested hypervisor to not perform any L1D cache flush. This
572*4882a593Smuzhiyun   is an optimization to avoid double L1D flushing.
573*4882a593Smuzhiyun
574*4882a593Smuzhiyun
575*4882a593Smuzhiyun.. _default_mitigations:
576*4882a593Smuzhiyun
577*4882a593SmuzhiyunDefault mitigations
578*4882a593Smuzhiyun-------------------
579*4882a593Smuzhiyun
580*4882a593Smuzhiyun  The kernel default mitigations for vulnerable processors are:
581*4882a593Smuzhiyun
582*4882a593Smuzhiyun  - PTE inversion to protect against malicious user space. This is done
583*4882a593Smuzhiyun    unconditionally and cannot be controlled. The swap storage is limited
584*4882a593Smuzhiyun    to ~16TB.
585*4882a593Smuzhiyun
586*4882a593Smuzhiyun  - L1D conditional flushing on VMENTER when EPT is enabled for
587*4882a593Smuzhiyun    a guest.
588*4882a593Smuzhiyun
589*4882a593Smuzhiyun  The kernel does not by default enforce the disabling of SMT, which leaves
590*4882a593Smuzhiyun  SMT systems vulnerable when running untrusted guests with EPT enabled.
591*4882a593Smuzhiyun
592*4882a593Smuzhiyun  The rationale for this choice is:
593*4882a593Smuzhiyun
594*4882a593Smuzhiyun  - Force disabling SMT can break existing setups, especially with
595*4882a593Smuzhiyun    unattended updates.
596*4882a593Smuzhiyun
597*4882a593Smuzhiyun  - If regular users run untrusted guests on their machine, then L1TF is
598*4882a593Smuzhiyun    just an add on to other malware which might be embedded in an untrusted
599*4882a593Smuzhiyun    guest, e.g. spam-bots or attacks on the local network.
600*4882a593Smuzhiyun
601*4882a593Smuzhiyun    There is no technical way to prevent a user from running untrusted code
602*4882a593Smuzhiyun    on their machines blindly.
603*4882a593Smuzhiyun
604*4882a593Smuzhiyun  - It's technically extremely unlikely and from today's knowledge even
605*4882a593Smuzhiyun    impossible that L1TF can be exploited via the most popular attack
606*4882a593Smuzhiyun    mechanisms like JavaScript because these mechanisms have no way to
607*4882a593Smuzhiyun    control PTEs. If this would be possible and not other mitigation would
608*4882a593Smuzhiyun    be possible, then the default might be different.
609*4882a593Smuzhiyun
610*4882a593Smuzhiyun  - The administrators of cloud and hosting setups have to carefully
611*4882a593Smuzhiyun    analyze the risk for their scenarios and make the appropriate
612*4882a593Smuzhiyun    mitigation choices, which might even vary across their deployed
613*4882a593Smuzhiyun    machines and also result in other changes of their overall setup.
614*4882a593Smuzhiyun    There is no way for the kernel to provide a sensible default for this
615*4882a593Smuzhiyun    kind of scenarios.
616