1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun========================== 4*4882a593SmuzhiyunPage Table Isolation (PTI) 5*4882a593Smuzhiyun========================== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunOverview 8*4882a593Smuzhiyun======== 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunPage Table Isolation (pti, previously known as KAISER [1]_) is a 11*4882a593Smuzhiyuncountermeasure against attacks on the shared user/kernel address 12*4882a593Smuzhiyunspace such as the "Meltdown" approach [2]_. 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunTo mitigate this class of attacks, we create an independent set of 15*4882a593Smuzhiyunpage tables for use only when running userspace applications. When 16*4882a593Smuzhiyunthe kernel is entered via syscalls, interrupts or exceptions, the 17*4882a593Smuzhiyunpage tables are switched to the full "kernel" copy. When the system 18*4882a593Smuzhiyunswitches back to user mode, the user copy is used again. 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunThe userspace page tables contain only a minimal amount of kernel 21*4882a593Smuzhiyundata: only what is needed to enter/exit the kernel such as the 22*4882a593Smuzhiyunentry/exit functions themselves and the interrupt descriptor table 23*4882a593Smuzhiyun(IDT). There are a few strictly unnecessary things that get mapped 24*4882a593Smuzhiyunsuch as the first C function when entering an interrupt (see 25*4882a593Smuzhiyuncomments in pti.c). 26*4882a593Smuzhiyun 27*4882a593SmuzhiyunThis approach helps to ensure that side-channel attacks leveraging 28*4882a593Smuzhiyunthe paging structures do not function when PTI is enabled. It can be 29*4882a593Smuzhiyunenabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. 30*4882a593SmuzhiyunOnce enabled at compile-time, it can be disabled at boot with the 31*4882a593Smuzhiyun'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). 32*4882a593Smuzhiyun 33*4882a593SmuzhiyunPage Table Management 34*4882a593Smuzhiyun===================== 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunWhen PTI is enabled, the kernel manages two sets of page tables. 37*4882a593SmuzhiyunThe first set is very similar to the single set which is present in 38*4882a593Smuzhiyunkernels without PTI. This includes a complete mapping of userspace 39*4882a593Smuzhiyunthat the kernel can use for things like copy_to_user(). 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunAlthough _complete_, the user portion of the kernel page tables is 42*4882a593Smuzhiyuncrippled by setting the NX bit in the top level. This ensures 43*4882a593Smuzhiyunthat any missed kernel->user CR3 switch will immediately crash 44*4882a593Smuzhiyunuserspace upon executing its first instruction. 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunThe userspace page tables map only the kernel data needed to enter 47*4882a593Smuzhiyunand exit the kernel. This data is entirely contained in the 'struct 48*4882a593Smuzhiyuncpu_entry_area' structure which is placed in the fixmap which gives 49*4882a593Smuzhiyuneach CPU's copy of the area a compile-time-fixed virtual address. 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunFor new userspace mappings, the kernel makes the entries in its 52*4882a593Smuzhiyunpage tables like normal. The only difference is when the kernel 53*4882a593Smuzhiyunmakes entries in the top (PGD) level. In addition to setting the 54*4882a593Smuzhiyunentry in the main kernel PGD, a copy of the entry is made in the 55*4882a593Smuzhiyunuserspace page tables' PGD. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunThis sharing at the PGD level also inherently shares all the lower 58*4882a593Smuzhiyunlayers of the page tables. This leaves a single, shared set of 59*4882a593Smuzhiyunuserspace page tables to manage. One PTE to lock, one set of 60*4882a593Smuzhiyunaccessed bits, dirty bits, etc... 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunOverhead 63*4882a593Smuzhiyun======== 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunProtection against side-channel attacks is important. But, 66*4882a593Smuzhiyunthis protection comes at a cost: 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun1. Increased Memory Use 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun a. Each process now needs an order-1 PGD instead of order-0. 71*4882a593Smuzhiyun (Consumes an additional 4k per process). 72*4882a593Smuzhiyun b. The 'cpu_entry_area' structure must be 2MB in size and 2MB 73*4882a593Smuzhiyun aligned so that it can be mapped by setting a single PMD 74*4882a593Smuzhiyun entry. This consumes nearly 2MB of RAM once the kernel 75*4882a593Smuzhiyun is decompressed, but no space in the kernel image itself. 76*4882a593Smuzhiyun 77*4882a593Smuzhiyun2. Runtime Cost 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun a. CR3 manipulation to switch between the page table copies 80*4882a593Smuzhiyun must be done at interrupt, syscall, and exception entry 81*4882a593Smuzhiyun and exit (it can be skipped when the kernel is interrupted, 82*4882a593Smuzhiyun though.) Moves to CR3 are on the order of a hundred 83*4882a593Smuzhiyun cycles, and are required at every entry and exit. 84*4882a593Smuzhiyun b. A "trampoline" must be used for SYSCALL entry. This 85*4882a593Smuzhiyun trampoline depends on a smaller set of resources than the 86*4882a593Smuzhiyun non-PTI SYSCALL entry code, so requires mapping fewer 87*4882a593Smuzhiyun things into the userspace page tables. The downside is 88*4882a593Smuzhiyun that stacks must be switched at entry time. 89*4882a593Smuzhiyun c. Global pages are disabled for all kernel structures not 90*4882a593Smuzhiyun mapped into both kernel and userspace page tables. This 91*4882a593Smuzhiyun feature of the MMU allows different processes to share TLB 92*4882a593Smuzhiyun entries mapping the kernel. Losing the feature means more 93*4882a593Smuzhiyun TLB misses after a context switch. The actual loss of 94*4882a593Smuzhiyun performance is very small, however, never exceeding 1%. 95*4882a593Smuzhiyun d. Process Context IDentifiers (PCID) is a CPU feature that 96*4882a593Smuzhiyun allows us to skip flushing the entire TLB when switching page 97*4882a593Smuzhiyun tables by setting a special bit in CR3 when the page tables 98*4882a593Smuzhiyun are changed. This makes switching the page tables (at context 99*4882a593Smuzhiyun switch, or kernel entry/exit) cheaper. But, on systems with 100*4882a593Smuzhiyun PCID support, the context switch code must flush both the user 101*4882a593Smuzhiyun and kernel entries out of the TLB. The user PCID TLB flush is 102*4882a593Smuzhiyun deferred until the exit to userspace, minimizing the cost. 103*4882a593Smuzhiyun See intel.com/sdm for the gory PCID/INVPCID details. 104*4882a593Smuzhiyun e. The userspace page tables must be populated for each new 105*4882a593Smuzhiyun process. Even without PTI, the shared kernel mappings 106*4882a593Smuzhiyun are created by copying top-level (PGD) entries into each 107*4882a593Smuzhiyun new process. But, with PTI, there are now *two* kernel 108*4882a593Smuzhiyun mappings: one in the kernel page tables that maps everything 109*4882a593Smuzhiyun and one for the entry/exit structures. At fork(), we need to 110*4882a593Smuzhiyun copy both. 111*4882a593Smuzhiyun f. In addition to the fork()-time copying, there must also 112*4882a593Smuzhiyun be an update to the userspace PGD any time a set_pgd() is done 113*4882a593Smuzhiyun on a PGD used to map userspace. This ensures that the kernel 114*4882a593Smuzhiyun and userspace copies always map the same userspace 115*4882a593Smuzhiyun memory. 116*4882a593Smuzhiyun g. On systems without PCID support, each CR3 write flushes 117*4882a593Smuzhiyun the entire TLB. That means that each syscall, interrupt 118*4882a593Smuzhiyun or exception flushes the TLB. 119*4882a593Smuzhiyun h. INVPCID is a TLB-flushing instruction which allows flushing 120*4882a593Smuzhiyun of TLB entries for non-current PCIDs. Some systems support 121*4882a593Smuzhiyun PCIDs, but do not support INVPCID. On these systems, addresses 122*4882a593Smuzhiyun can only be flushed from the TLB for the current PCID. When 123*4882a593Smuzhiyun flushing a kernel address, we need to flush all PCIDs, so a 124*4882a593Smuzhiyun single kernel address flush will require a TLB-flushing CR3 125*4882a593Smuzhiyun write upon the next use of every PCID. 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunPossible Future Work 128*4882a593Smuzhiyun==================== 129*4882a593Smuzhiyun1. We can be more careful about not actually writing to CR3 130*4882a593Smuzhiyun unless its value is actually changed. 131*4882a593Smuzhiyun2. Allow PTI to be enabled/disabled at runtime in addition to the 132*4882a593Smuzhiyun boot-time switching. 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunTesting 135*4882a593Smuzhiyun======== 136*4882a593Smuzhiyun 137*4882a593SmuzhiyunTo test stability of PTI, the following test procedure is recommended, 138*4882a593Smuzhiyunideally doing all of these in parallel: 139*4882a593Smuzhiyun 140*4882a593Smuzhiyun1. Set CONFIG_DEBUG_ENTRY=y 141*4882a593Smuzhiyun2. Run several copies of all of the tools/testing/selftests/x86/ tests 142*4882a593Smuzhiyun (excluding MPX and protection_keys) in a loop on multiple CPUs for 143*4882a593Smuzhiyun several minutes. These tests frequently uncover corner cases in the 144*4882a593Smuzhiyun kernel entry code. In general, old kernels might cause these tests 145*4882a593Smuzhiyun themselves to crash, but they should never crash the kernel. 146*4882a593Smuzhiyun3. Run the 'perf' tool in a mode (top or record) that generates many 147*4882a593Smuzhiyun frequent performance monitoring non-maskable interrupts (see "NMI" 148*4882a593Smuzhiyun in /proc/interrupts). This exercises the NMI entry/exit code which 149*4882a593Smuzhiyun is known to trigger bugs in code paths that did not expect to be 150*4882a593Smuzhiyun interrupted, including nested NMIs. Using "-c" boosts the rate of 151*4882a593Smuzhiyun NMIs, and using two -c with separate counters encourages nested NMIs 152*4882a593Smuzhiyun and less deterministic behavior. 153*4882a593Smuzhiyun :: 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done 156*4882a593Smuzhiyun 157*4882a593Smuzhiyun4. Launch a KVM virtual machine. 158*4882a593Smuzhiyun5. Run 32-bit binaries on systems supporting the SYSCALL instruction. 159*4882a593Smuzhiyun This has been a lightly-tested code path and needs extra scrutiny. 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunDebugging 162*4882a593Smuzhiyun========= 163*4882a593Smuzhiyun 164*4882a593SmuzhiyunBugs in PTI cause a few different signatures of crashes 165*4882a593Smuzhiyunthat are worth noting here. 166*4882a593Smuzhiyun 167*4882a593Smuzhiyun * Failures of the selftests/x86 code. Usually a bug in one of the 168*4882a593Smuzhiyun more obscure corners of entry_64.S 169*4882a593Smuzhiyun * Crashes in early boot, especially around CPU bringup. Bugs 170*4882a593Smuzhiyun in the trampoline code or mappings cause these. 171*4882a593Smuzhiyun * Crashes at the first interrupt. Caused by bugs in entry_64.S, 172*4882a593Smuzhiyun like screwing up a page table switch. Also caused by 173*4882a593Smuzhiyun incorrectly mapping the IRQ handler entry code. 174*4882a593Smuzhiyun * Crashes at the first NMI. The NMI code is separate from main 175*4882a593Smuzhiyun interrupt handlers and can have bugs that do not affect 176*4882a593Smuzhiyun normal interrupts. Also caused by incorrectly mapping NMI 177*4882a593Smuzhiyun code. NMIs that interrupt the entry code must be very 178*4882a593Smuzhiyun careful and can be the cause of crashes that show up when 179*4882a593Smuzhiyun running perf. 180*4882a593Smuzhiyun * Kernel crashes at the first exit to userspace. entry_64.S 181*4882a593Smuzhiyun bugs, or failing to map some of the exit code. 182*4882a593Smuzhiyun * Crashes at first interrupt that interrupts userspace. The paths 183*4882a593Smuzhiyun in entry_64.S that return to userspace are sometimes separate 184*4882a593Smuzhiyun from the ones that return to the kernel. 185*4882a593Smuzhiyun * Double faults: overflowing the kernel stack because of page 186*4882a593Smuzhiyun faults upon page faults. Caused by touching non-pti-mapped 187*4882a593Smuzhiyun data in the entry code, or forgetting to switch to kernel 188*4882a593Smuzhiyun CR3 before calling into C functions which are not pti-mapped. 189*4882a593Smuzhiyun * Userspace segfaults early in boot, sometimes manifesting 190*4882a593Smuzhiyun as mount(8) failing to mount the rootfs. These have 191*4882a593Smuzhiyun tended to be TLB invalidation issues. Usually invalidating 192*4882a593Smuzhiyun the wrong PCID, or otherwise missing an invalidation. 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun.. [1] https://gruss.cc/files/kaiser.pdf 195*4882a593Smuzhiyun.. [2] https://meltdownattack.com/meltdown.pdf 196