xref: /OK3568_Linux_fs/kernel/Documentation/x86/pti.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun==========================
4*4882a593SmuzhiyunPage Table Isolation (PTI)
5*4882a593Smuzhiyun==========================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunOverview
8*4882a593Smuzhiyun========
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunPage Table Isolation (pti, previously known as KAISER [1]_) is a
11*4882a593Smuzhiyuncountermeasure against attacks on the shared user/kernel address
12*4882a593Smuzhiyunspace such as the "Meltdown" approach [2]_.
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunTo mitigate this class of attacks, we create an independent set of
15*4882a593Smuzhiyunpage tables for use only when running userspace applications.  When
16*4882a593Smuzhiyunthe kernel is entered via syscalls, interrupts or exceptions, the
17*4882a593Smuzhiyunpage tables are switched to the full "kernel" copy.  When the system
18*4882a593Smuzhiyunswitches back to user mode, the user copy is used again.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunThe userspace page tables contain only a minimal amount of kernel
21*4882a593Smuzhiyundata: only what is needed to enter/exit the kernel such as the
22*4882a593Smuzhiyunentry/exit functions themselves and the interrupt descriptor table
23*4882a593Smuzhiyun(IDT).  There are a few strictly unnecessary things that get mapped
24*4882a593Smuzhiyunsuch as the first C function when entering an interrupt (see
25*4882a593Smuzhiyuncomments in pti.c).
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunThis approach helps to ensure that side-channel attacks leveraging
28*4882a593Smuzhiyunthe paging structures do not function when PTI is enabled.  It can be
29*4882a593Smuzhiyunenabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
30*4882a593SmuzhiyunOnce enabled at compile-time, it can be disabled at boot with the
31*4882a593Smuzhiyun'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunPage Table Management
34*4882a593Smuzhiyun=====================
35*4882a593Smuzhiyun
36*4882a593SmuzhiyunWhen PTI is enabled, the kernel manages two sets of page tables.
37*4882a593SmuzhiyunThe first set is very similar to the single set which is present in
38*4882a593Smuzhiyunkernels without PTI.  This includes a complete mapping of userspace
39*4882a593Smuzhiyunthat the kernel can use for things like copy_to_user().
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunAlthough _complete_, the user portion of the kernel page tables is
42*4882a593Smuzhiyuncrippled by setting the NX bit in the top level.  This ensures
43*4882a593Smuzhiyunthat any missed kernel->user CR3 switch will immediately crash
44*4882a593Smuzhiyunuserspace upon executing its first instruction.
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunThe userspace page tables map only the kernel data needed to enter
47*4882a593Smuzhiyunand exit the kernel.  This data is entirely contained in the 'struct
48*4882a593Smuzhiyuncpu_entry_area' structure which is placed in the fixmap which gives
49*4882a593Smuzhiyuneach CPU's copy of the area a compile-time-fixed virtual address.
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunFor new userspace mappings, the kernel makes the entries in its
52*4882a593Smuzhiyunpage tables like normal.  The only difference is when the kernel
53*4882a593Smuzhiyunmakes entries in the top (PGD) level.  In addition to setting the
54*4882a593Smuzhiyunentry in the main kernel PGD, a copy of the entry is made in the
55*4882a593Smuzhiyunuserspace page tables' PGD.
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunThis sharing at the PGD level also inherently shares all the lower
58*4882a593Smuzhiyunlayers of the page tables.  This leaves a single, shared set of
59*4882a593Smuzhiyunuserspace page tables to manage.  One PTE to lock, one set of
60*4882a593Smuzhiyunaccessed bits, dirty bits, etc...
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunOverhead
63*4882a593Smuzhiyun========
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunProtection against side-channel attacks is important.  But,
66*4882a593Smuzhiyunthis protection comes at a cost:
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun1. Increased Memory Use
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun  a. Each process now needs an order-1 PGD instead of order-0.
71*4882a593Smuzhiyun     (Consumes an additional 4k per process).
72*4882a593Smuzhiyun  b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
73*4882a593Smuzhiyun     aligned so that it can be mapped by setting a single PMD
74*4882a593Smuzhiyun     entry.  This consumes nearly 2MB of RAM once the kernel
75*4882a593Smuzhiyun     is decompressed, but no space in the kernel image itself.
76*4882a593Smuzhiyun
77*4882a593Smuzhiyun2. Runtime Cost
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun  a. CR3 manipulation to switch between the page table copies
80*4882a593Smuzhiyun     must be done at interrupt, syscall, and exception entry
81*4882a593Smuzhiyun     and exit (it can be skipped when the kernel is interrupted,
82*4882a593Smuzhiyun     though.)  Moves to CR3 are on the order of a hundred
83*4882a593Smuzhiyun     cycles, and are required at every entry and exit.
84*4882a593Smuzhiyun  b. A "trampoline" must be used for SYSCALL entry.  This
85*4882a593Smuzhiyun     trampoline depends on a smaller set of resources than the
86*4882a593Smuzhiyun     non-PTI SYSCALL entry code, so requires mapping fewer
87*4882a593Smuzhiyun     things into the userspace page tables.  The downside is
88*4882a593Smuzhiyun     that stacks must be switched at entry time.
89*4882a593Smuzhiyun  c. Global pages are disabled for all kernel structures not
90*4882a593Smuzhiyun     mapped into both kernel and userspace page tables.  This
91*4882a593Smuzhiyun     feature of the MMU allows different processes to share TLB
92*4882a593Smuzhiyun     entries mapping the kernel.  Losing the feature means more
93*4882a593Smuzhiyun     TLB misses after a context switch.  The actual loss of
94*4882a593Smuzhiyun     performance is very small, however, never exceeding 1%.
95*4882a593Smuzhiyun  d. Process Context IDentifiers (PCID) is a CPU feature that
96*4882a593Smuzhiyun     allows us to skip flushing the entire TLB when switching page
97*4882a593Smuzhiyun     tables by setting a special bit in CR3 when the page tables
98*4882a593Smuzhiyun     are changed.  This makes switching the page tables (at context
99*4882a593Smuzhiyun     switch, or kernel entry/exit) cheaper.  But, on systems with
100*4882a593Smuzhiyun     PCID support, the context switch code must flush both the user
101*4882a593Smuzhiyun     and kernel entries out of the TLB.  The user PCID TLB flush is
102*4882a593Smuzhiyun     deferred until the exit to userspace, minimizing the cost.
103*4882a593Smuzhiyun     See intel.com/sdm for the gory PCID/INVPCID details.
104*4882a593Smuzhiyun  e. The userspace page tables must be populated for each new
105*4882a593Smuzhiyun     process.  Even without PTI, the shared kernel mappings
106*4882a593Smuzhiyun     are created by copying top-level (PGD) entries into each
107*4882a593Smuzhiyun     new process.  But, with PTI, there are now *two* kernel
108*4882a593Smuzhiyun     mappings: one in the kernel page tables that maps everything
109*4882a593Smuzhiyun     and one for the entry/exit structures.  At fork(), we need to
110*4882a593Smuzhiyun     copy both.
111*4882a593Smuzhiyun  f. In addition to the fork()-time copying, there must also
112*4882a593Smuzhiyun     be an update to the userspace PGD any time a set_pgd() is done
113*4882a593Smuzhiyun     on a PGD used to map userspace.  This ensures that the kernel
114*4882a593Smuzhiyun     and userspace copies always map the same userspace
115*4882a593Smuzhiyun     memory.
116*4882a593Smuzhiyun  g. On systems without PCID support, each CR3 write flushes
117*4882a593Smuzhiyun     the entire TLB.  That means that each syscall, interrupt
118*4882a593Smuzhiyun     or exception flushes the TLB.
119*4882a593Smuzhiyun  h. INVPCID is a TLB-flushing instruction which allows flushing
120*4882a593Smuzhiyun     of TLB entries for non-current PCIDs.  Some systems support
121*4882a593Smuzhiyun     PCIDs, but do not support INVPCID.  On these systems, addresses
122*4882a593Smuzhiyun     can only be flushed from the TLB for the current PCID.  When
123*4882a593Smuzhiyun     flushing a kernel address, we need to flush all PCIDs, so a
124*4882a593Smuzhiyun     single kernel address flush will require a TLB-flushing CR3
125*4882a593Smuzhiyun     write upon the next use of every PCID.
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunPossible Future Work
128*4882a593Smuzhiyun====================
129*4882a593Smuzhiyun1. We can be more careful about not actually writing to CR3
130*4882a593Smuzhiyun   unless its value is actually changed.
131*4882a593Smuzhiyun2. Allow PTI to be enabled/disabled at runtime in addition to the
132*4882a593Smuzhiyun   boot-time switching.
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunTesting
135*4882a593Smuzhiyun========
136*4882a593Smuzhiyun
137*4882a593SmuzhiyunTo test stability of PTI, the following test procedure is recommended,
138*4882a593Smuzhiyunideally doing all of these in parallel:
139*4882a593Smuzhiyun
140*4882a593Smuzhiyun1. Set CONFIG_DEBUG_ENTRY=y
141*4882a593Smuzhiyun2. Run several copies of all of the tools/testing/selftests/x86/ tests
142*4882a593Smuzhiyun   (excluding MPX and protection_keys) in a loop on multiple CPUs for
143*4882a593Smuzhiyun   several minutes.  These tests frequently uncover corner cases in the
144*4882a593Smuzhiyun   kernel entry code.  In general, old kernels might cause these tests
145*4882a593Smuzhiyun   themselves to crash, but they should never crash the kernel.
146*4882a593Smuzhiyun3. Run the 'perf' tool in a mode (top or record) that generates many
147*4882a593Smuzhiyun   frequent performance monitoring non-maskable interrupts (see "NMI"
148*4882a593Smuzhiyun   in /proc/interrupts).  This exercises the NMI entry/exit code which
149*4882a593Smuzhiyun   is known to trigger bugs in code paths that did not expect to be
150*4882a593Smuzhiyun   interrupted, including nested NMIs.  Using "-c" boosts the rate of
151*4882a593Smuzhiyun   NMIs, and using two -c with separate counters encourages nested NMIs
152*4882a593Smuzhiyun   and less deterministic behavior.
153*4882a593Smuzhiyun   ::
154*4882a593Smuzhiyun
155*4882a593Smuzhiyun	while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
156*4882a593Smuzhiyun
157*4882a593Smuzhiyun4. Launch a KVM virtual machine.
158*4882a593Smuzhiyun5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
159*4882a593Smuzhiyun   This has been a lightly-tested code path and needs extra scrutiny.
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunDebugging
162*4882a593Smuzhiyun=========
163*4882a593Smuzhiyun
164*4882a593SmuzhiyunBugs in PTI cause a few different signatures of crashes
165*4882a593Smuzhiyunthat are worth noting here.
166*4882a593Smuzhiyun
167*4882a593Smuzhiyun * Failures of the selftests/x86 code.  Usually a bug in one of the
168*4882a593Smuzhiyun   more obscure corners of entry_64.S
169*4882a593Smuzhiyun * Crashes in early boot, especially around CPU bringup.  Bugs
170*4882a593Smuzhiyun   in the trampoline code or mappings cause these.
171*4882a593Smuzhiyun * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
172*4882a593Smuzhiyun   like screwing up a page table switch.  Also caused by
173*4882a593Smuzhiyun   incorrectly mapping the IRQ handler entry code.
174*4882a593Smuzhiyun * Crashes at the first NMI.  The NMI code is separate from main
175*4882a593Smuzhiyun   interrupt handlers and can have bugs that do not affect
176*4882a593Smuzhiyun   normal interrupts.  Also caused by incorrectly mapping NMI
177*4882a593Smuzhiyun   code.  NMIs that interrupt the entry code must be very
178*4882a593Smuzhiyun   careful and can be the cause of crashes that show up when
179*4882a593Smuzhiyun   running perf.
180*4882a593Smuzhiyun * Kernel crashes at the first exit to userspace.  entry_64.S
181*4882a593Smuzhiyun   bugs, or failing to map some of the exit code.
182*4882a593Smuzhiyun * Crashes at first interrupt that interrupts userspace. The paths
183*4882a593Smuzhiyun   in entry_64.S that return to userspace are sometimes separate
184*4882a593Smuzhiyun   from the ones that return to the kernel.
185*4882a593Smuzhiyun * Double faults: overflowing the kernel stack because of page
186*4882a593Smuzhiyun   faults upon page faults.  Caused by touching non-pti-mapped
187*4882a593Smuzhiyun   data in the entry code, or forgetting to switch to kernel
188*4882a593Smuzhiyun   CR3 before calling into C functions which are not pti-mapped.
189*4882a593Smuzhiyun * Userspace segfaults early in boot, sometimes manifesting
190*4882a593Smuzhiyun   as mount(8) failing to mount the rootfs.  These have
191*4882a593Smuzhiyun   tended to be TLB invalidation issues.  Usually invalidating
192*4882a593Smuzhiyun   the wrong PCID, or otherwise missing an invalidation.
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun.. [1] https://gruss.cc/files/kaiser.pdf
195*4882a593Smuzhiyun.. [2] https://meltdownattack.com/meltdown.pdf
196