xref: /OK3568_Linux_fs/kernel/Documentation/x86/mds.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593SmuzhiyunMicroarchitectural Data Sampling (MDS) mitigation
2*4882a593Smuzhiyun=================================================
3*4882a593Smuzhiyun
4*4882a593Smuzhiyun.. _mds:
5*4882a593Smuzhiyun
6*4882a593SmuzhiyunOverview
7*4882a593Smuzhiyun--------
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunMicroarchitectural Data Sampling (MDS) is a family of side channel attacks
10*4882a593Smuzhiyunon internal buffers in Intel CPUs. The variants are:
11*4882a593Smuzhiyun
12*4882a593Smuzhiyun - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
13*4882a593Smuzhiyun - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
14*4882a593Smuzhiyun - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
15*4882a593Smuzhiyun - Microarchitectural Data Sampling Uncacheable Memory (MDSUM) (CVE-2019-11091)
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunMSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
18*4882a593Smuzhiyundependent load (store-to-load forwarding) as an optimization. The forward
19*4882a593Smuzhiyuncan also happen to a faulting or assisting load operation for a different
20*4882a593Smuzhiyunmemory address, which can be exploited under certain conditions. Store
21*4882a593Smuzhiyunbuffers are partitioned between Hyper-Threads so cross thread forwarding is
22*4882a593Smuzhiyunnot possible. But if a thread enters or exits a sleep state the store
23*4882a593Smuzhiyunbuffer is repartitioned which can expose data from one thread to the other.
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunMFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
26*4882a593SmuzhiyunL1 miss situations and to hold data which is returned or sent in response
27*4882a593Smuzhiyunto a memory or I/O operation. Fill buffers can forward data to a load
28*4882a593Smuzhiyunoperation and also write data to the cache. When the fill buffer is
29*4882a593Smuzhiyundeallocated it can retain the stale data of the preceding operations which
30*4882a593Smuzhiyuncan then be forwarded to a faulting or assisting load operation, which can
31*4882a593Smuzhiyunbe exploited under certain conditions. Fill buffers are shared between
32*4882a593SmuzhiyunHyper-Threads so cross thread leakage is possible.
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunMLPDS leaks Load Port Data. Load ports are used to perform load operations
35*4882a593Smuzhiyunfrom memory or I/O. The received data is then forwarded to the register
36*4882a593Smuzhiyunfile or a subsequent operation. In some implementations the Load Port can
37*4882a593Smuzhiyuncontain stale data from a previous operation which can be forwarded to
38*4882a593Smuzhiyunfaulting or assisting loads under certain conditions, which again can be
39*4882a593Smuzhiyunexploited eventually. Load ports are shared between Hyper-Threads so cross
40*4882a593Smuzhiyunthread leakage is possible.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunMDSUM is a special case of MSBDS, MFBDS and MLPDS. An uncacheable load from
43*4882a593Smuzhiyunmemory that takes a fault or assist can leave data in a microarchitectural
44*4882a593Smuzhiyunstructure that may later be observed using one of the same methods used by
45*4882a593SmuzhiyunMSBDS, MFBDS or MLPDS.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunExposure assumptions
48*4882a593Smuzhiyun--------------------
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunIt is assumed that attack code resides in user space or in a guest with one
51*4882a593Smuzhiyunexception. The rationale behind this assumption is that the code construct
52*4882a593Smuzhiyunneeded for exploiting MDS requires:
53*4882a593Smuzhiyun
54*4882a593Smuzhiyun - to control the load to trigger a fault or assist
55*4882a593Smuzhiyun
56*4882a593Smuzhiyun - to have a disclosure gadget which exposes the speculatively accessed
57*4882a593Smuzhiyun   data for consumption through a side channel.
58*4882a593Smuzhiyun
59*4882a593Smuzhiyun - to control the pointer through which the disclosure gadget exposes the
60*4882a593Smuzhiyun   data
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunThe existence of such a construct in the kernel cannot be excluded with
63*4882a593Smuzhiyun100% certainty, but the complexity involved makes it extremly unlikely.
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunThere is one exception, which is untrusted BPF. The functionality of
66*4882a593Smuzhiyununtrusted BPF is limited, but it needs to be thoroughly investigated
67*4882a593Smuzhiyunwhether it can be used to create such a construct.
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunMitigation strategy
71*4882a593Smuzhiyun-------------------
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunAll variants have the same mitigation strategy at least for the single CPU
74*4882a593Smuzhiyunthread case (SMT off): Force the CPU to clear the affected buffers.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunThis is achieved by using the otherwise unused and obsolete VERW
77*4882a593Smuzhiyuninstruction in combination with a microcode update. The microcode clears
78*4882a593Smuzhiyunthe affected CPU buffers when the VERW instruction is executed.
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunFor virtualization there are two ways to achieve CPU buffer
81*4882a593Smuzhiyunclearing. Either the modified VERW instruction or via the L1D Flush
82*4882a593Smuzhiyuncommand. The latter is issued when L1TF mitigation is enabled so the extra
83*4882a593SmuzhiyunVERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
84*4882a593Smuzhiyunbe issued.
85*4882a593Smuzhiyun
86*4882a593SmuzhiyunIf the VERW instruction with the supplied segment selector argument is
87*4882a593Smuzhiyunexecuted on a CPU without the microcode update there is no side effect
88*4882a593Smuzhiyunother than a small number of pointlessly wasted CPU cycles.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunThis does not protect against cross Hyper-Thread attacks except for MSBDS
91*4882a593Smuzhiyunwhich is only exploitable cross Hyper-thread when one of the Hyper-Threads
92*4882a593Smuzhiyunenters a C-state.
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunThe kernel provides a function to invoke the buffer clearing:
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun    mds_clear_cpu_buffers()
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunThe mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
99*4882a593Smuzhiyun(idle) transitions.
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunAs a special quirk to address virtualization scenarios where the host has
102*4882a593Smuzhiyunthe microcode updated, but the hypervisor does not (yet) expose the
103*4882a593SmuzhiyunMD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
104*4882a593Smuzhiyunhope that it might actually clear the buffers. The state is reflected
105*4882a593Smuzhiyunaccordingly.
106*4882a593Smuzhiyun
107*4882a593SmuzhiyunAccording to current knowledge additional mitigations inside the kernel
108*4882a593Smuzhiyunitself are not required because the necessary gadgets to expose the leaked
109*4882a593Smuzhiyundata cannot be controlled in a way which allows exploitation from malicious
110*4882a593Smuzhiyunuser space or VM guests.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunKernel internal mitigation modes
113*4882a593Smuzhiyun--------------------------------
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun ======= ============================================================
116*4882a593Smuzhiyun off      Mitigation is disabled. Either the CPU is not affected or
117*4882a593Smuzhiyun          mds=off is supplied on the kernel command line
118*4882a593Smuzhiyun
119*4882a593Smuzhiyun full     Mitigation is enabled. CPU is affected and MD_CLEAR is
120*4882a593Smuzhiyun          advertised in CPUID.
121*4882a593Smuzhiyun
122*4882a593Smuzhiyun vmwerv	  Mitigation is enabled. CPU is affected and MD_CLEAR is not
123*4882a593Smuzhiyun	  advertised in CPUID. That is mainly for virtualization
124*4882a593Smuzhiyun	  scenarios where the host has the updated microcode but the
125*4882a593Smuzhiyun	  hypervisor does not expose MD_CLEAR in CPUID. It's a best
126*4882a593Smuzhiyun	  effort approach without guarantee.
127*4882a593Smuzhiyun ======= ============================================================
128*4882a593Smuzhiyun
129*4882a593SmuzhiyunIf the CPU is affected and mds=off is not supplied on the kernel command
130*4882a593Smuzhiyunline then the kernel selects the appropriate mitigation mode depending on
131*4882a593Smuzhiyunthe availability of the MD_CLEAR CPUID bit.
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunMitigation points
134*4882a593Smuzhiyun-----------------
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun1. Return to user space
137*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun   When transitioning from kernel to user space the CPU buffers are flushed
140*4882a593Smuzhiyun   on affected CPUs when the mitigation is not disabled on the kernel
141*4882a593Smuzhiyun   command line. The migitation is enabled through the static key
142*4882a593Smuzhiyun   mds_user_clear.
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun   The mitigation is invoked in prepare_exit_to_usermode() which covers
145*4882a593Smuzhiyun   all but one of the kernel to user space transitions.  The exception
146*4882a593Smuzhiyun   is when we return from a Non Maskable Interrupt (NMI), which is
147*4882a593Smuzhiyun   handled directly in do_nmi().
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun   (The reason that NMI is special is that prepare_exit_to_usermode() can
150*4882a593Smuzhiyun    enable IRQs.  In NMI context, NMIs are blocked, and we don't want to
151*4882a593Smuzhiyun    enable IRQs with NMIs blocked.)
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun
154*4882a593Smuzhiyun2. C-State transition
155*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^
156*4882a593Smuzhiyun
157*4882a593Smuzhiyun   When a CPU goes idle and enters a C-State the CPU buffers need to be
158*4882a593Smuzhiyun   cleared on affected CPUs when SMT is active. This addresses the
159*4882a593Smuzhiyun   repartitioning of the store buffer when one of the Hyper-Threads enters
160*4882a593Smuzhiyun   a C-State.
161*4882a593Smuzhiyun
162*4882a593Smuzhiyun   When SMT is inactive, i.e. either the CPU does not support it or all
163*4882a593Smuzhiyun   sibling threads are offline CPU buffer clearing is not required.
164*4882a593Smuzhiyun
165*4882a593Smuzhiyun   The idle clearing is enabled on CPUs which are only affected by MSBDS
166*4882a593Smuzhiyun   and not by any other MDS variant. The other MDS variants cannot be
167*4882a593Smuzhiyun   protected against cross Hyper-Thread attacks because the Fill Buffer and
168*4882a593Smuzhiyun   the Load Ports are shared. So on CPUs affected by other variants, the
169*4882a593Smuzhiyun   idle clearing would be a window dressing exercise and is therefore not
170*4882a593Smuzhiyun   activated.
171*4882a593Smuzhiyun
172*4882a593Smuzhiyun   The invocation is controlled by the static key mds_idle_clear which is
173*4882a593Smuzhiyun   switched depending on the chosen mitigation mode and the SMT state of
174*4882a593Smuzhiyun   the system.
175*4882a593Smuzhiyun
176*4882a593Smuzhiyun   The buffer clear is only invoked before entering the C-State to prevent
177*4882a593Smuzhiyun   that stale data from the idling CPU from spilling to the Hyper-Thread
178*4882a593Smuzhiyun   sibling after the store buffer got repartitioned and all entries are
179*4882a593Smuzhiyun   available to the non idle sibling.
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun   When coming out of idle the store buffer is partitioned again so each
182*4882a593Smuzhiyun   sibling has half of it available. The back from idle CPU could be then
183*4882a593Smuzhiyun   speculatively exposed to contents of the sibling. The buffers are
184*4882a593Smuzhiyun   flushed either on exit to user space or on VMENTER so malicious code
185*4882a593Smuzhiyun   in user space or the guest cannot speculatively access them.
186*4882a593Smuzhiyun
187*4882a593Smuzhiyun   The mitigation is hooked into all variants of halt()/mwait(), but does
188*4882a593Smuzhiyun   not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
189*4882a593Smuzhiyun   has been superseded by the intel_idle driver around 2010 and is
190*4882a593Smuzhiyun   preferred on all affected CPUs which are expected to gain the MD_CLEAR
191*4882a593Smuzhiyun   functionality in microcode. Aside of that the IO-Port mechanism is a
192*4882a593Smuzhiyun   legacy interface which is only used on older systems which are either
193*4882a593Smuzhiyun   not affected or do not receive microcode updates anymore.
194