1*4882a593Smuzhiyun====================== 2*4882a593SmuzhiyunKernel Self-Protection 3*4882a593Smuzhiyun====================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunKernel self-protection is the design and implementation of systems and 6*4882a593Smuzhiyunstructures within the Linux kernel to protect against security flaws in 7*4882a593Smuzhiyunthe kernel itself. This covers a wide range of issues, including removing 8*4882a593Smuzhiyunentire classes of bugs, blocking security flaw exploitation methods, 9*4882a593Smuzhiyunand actively detecting attack attempts. Not all topics are explored in 10*4882a593Smuzhiyunthis document, but it should serve as a reasonable starting point and 11*4882a593Smuzhiyunanswer any frequently asked questions. (Patches welcome, of course!) 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunIn the worst-case scenario, we assume an unprivileged local attacker 14*4882a593Smuzhiyunhas arbitrary read and write access to the kernel's memory. In many 15*4882a593Smuzhiyuncases, bugs being exploited will not provide this level of access, 16*4882a593Smuzhiyunbut with systems in place that defend against the worst case we'll 17*4882a593Smuzhiyuncover the more limited cases as well. A higher bar, and one that should 18*4882a593Smuzhiyunstill be kept in mind, is protecting the kernel against a _privileged_ 19*4882a593Smuzhiyunlocal attacker, since the root user has access to a vastly increased 20*4882a593Smuzhiyunattack surface. (Especially when they have the ability to load arbitrary 21*4882a593Smuzhiyunkernel modules.) 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunThe goals for successful self-protection systems would be that they 24*4882a593Smuzhiyunare effective, on by default, require no opt-in by developers, have no 25*4882a593Smuzhiyunperformance impact, do not impede kernel debugging, and have tests. It 26*4882a593Smuzhiyunis uncommon that all these goals can be met, but it is worth explicitly 27*4882a593Smuzhiyunmentioning them, since these aspects need to be explored, dealt with, 28*4882a593Smuzhiyunand/or accepted. 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunAttack Surface Reduction 32*4882a593Smuzhiyun======================== 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunThe most fundamental defense against security exploits is to reduce the 35*4882a593Smuzhiyunareas of the kernel that can be used to redirect execution. This ranges 36*4882a593Smuzhiyunfrom limiting the exposed APIs available to userspace, making in-kernel 37*4882a593SmuzhiyunAPIs hard to use incorrectly, minimizing the areas of writable kernel 38*4882a593Smuzhiyunmemory, etc. 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunStrict kernel memory permissions 41*4882a593Smuzhiyun-------------------------------- 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunWhen all of kernel memory is writable, it becomes trivial for attacks 44*4882a593Smuzhiyunto redirect execution flow. To reduce the availability of these targets 45*4882a593Smuzhiyunthe kernel needs to protect its memory with a tight set of permissions. 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunExecutable code and read-only data must not be writable 48*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunAny areas of the kernel with executable memory must not be writable. 51*4882a593SmuzhiyunWhile this obviously includes the kernel text itself, we must consider 52*4882a593Smuzhiyunall additional places too: kernel modules, JIT memory, etc. (There are 53*4882a593Smuzhiyuntemporary exceptions to this rule to support things like instruction 54*4882a593Smuzhiyunalternatives, breakpoints, kprobes, etc. If these must exist in a 55*4882a593Smuzhiyunkernel, they are implemented in a way where the memory is temporarily 56*4882a593Smuzhiyunmade writable during the update, and then returned to the original 57*4882a593Smuzhiyunpermissions.) 58*4882a593Smuzhiyun 59*4882a593SmuzhiyunIn support of this are ``CONFIG_STRICT_KERNEL_RWX`` and 60*4882a593Smuzhiyun``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not 61*4882a593Smuzhiyunwritable, data is not executable, and read-only data is neither writable 62*4882a593Smuzhiyunnor executable. 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunMost architectures have these options on by default and not user selectable. 65*4882a593SmuzhiyunFor some architectures like arm that wish to have these be selectable, 66*4882a593Smuzhiyunthe architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable 67*4882a593Smuzhiyuna Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines 68*4882a593Smuzhiyunthe default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled. 69*4882a593Smuzhiyun 70*4882a593SmuzhiyunFunction pointers and sensitive variables must not be writable 71*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunVast areas of kernel memory contain function pointers that are looked 74*4882a593Smuzhiyunup by the kernel and used to continue execution (e.g. descriptor/vector 75*4882a593Smuzhiyuntables, file/network/etc operation structures, etc). The number of these 76*4882a593Smuzhiyunvariables must be reduced to an absolute minimum. 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunMany such variables can be made read-only by setting them "const" 79*4882a593Smuzhiyunso that they live in the .rodata section instead of the .data section 80*4882a593Smuzhiyunof the kernel, gaining the protection of the kernel's strict memory 81*4882a593Smuzhiyunpermissions as described above. 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunFor variables that are initialized once at ``__init`` time, these can 84*4882a593Smuzhiyunbe marked with the (new and under development) ``__ro_after_init`` 85*4882a593Smuzhiyunattribute. 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunWhat remains are variables that are updated rarely (e.g. GDT). These 88*4882a593Smuzhiyunwill need another infrastructure (similar to the temporary exceptions 89*4882a593Smuzhiyunmade to kernel code mentioned above) that allow them to spend the rest 90*4882a593Smuzhiyunof their lifetime read-only. (For example, when being updated, only the 91*4882a593SmuzhiyunCPU thread performing the update would be given uninterruptible write 92*4882a593Smuzhiyunaccess to the memory.) 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunSegregation of kernel memory from userspace memory 95*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunThe kernel must never execute userspace memory. The kernel must also never 98*4882a593Smuzhiyunaccess userspace memory without explicit expectation to do so. These 99*4882a593Smuzhiyunrules can be enforced either by support of hardware-based restrictions 100*4882a593Smuzhiyun(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). 101*4882a593SmuzhiyunBy blocking userspace memory in this way, execution and data parsing 102*4882a593Smuzhiyuncannot be passed to trivially-controlled userspace memory, forcing 103*4882a593Smuzhiyunattacks to operate entirely in kernel memory. 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunReduced access to syscalls 106*4882a593Smuzhiyun-------------------------- 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunOne trivial way to eliminate many syscalls for 64-bit systems is building 109*4882a593Smuzhiyunwithout ``CONFIG_COMPAT``. However, this is rarely a feasible scenario. 110*4882a593Smuzhiyun 111*4882a593SmuzhiyunThe "seccomp" system provides an opt-in feature made available to 112*4882a593Smuzhiyunuserspace, which provides a way to reduce the number of kernel entry 113*4882a593Smuzhiyunpoints available to a running process. This limits the breadth of kernel 114*4882a593Smuzhiyuncode that can be reached, possibly reducing the availability of a given 115*4882a593Smuzhiyunbug to an attack. 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunAn area of improvement would be creating viable ways to keep access to 118*4882a593Smuzhiyunthings like compat, user namespaces, BPF creation, and perf limited only 119*4882a593Smuzhiyunto trusted processes. This would keep the scope of kernel entry points 120*4882a593Smuzhiyunrestricted to the more regular set of normally available to unprivileged 121*4882a593Smuzhiyunuserspace. 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunRestricting access to kernel modules 124*4882a593Smuzhiyun------------------------------------ 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunThe kernel should never allow an unprivileged user the ability to 127*4882a593Smuzhiyunload specific kernel modules, since that would provide a facility to 128*4882a593Smuzhiyununexpectedly extend the available attack surface. (The on-demand loading 129*4882a593Smuzhiyunof modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is 130*4882a593Smuzhiyunconsidered "expected" here, though additional consideration should be 131*4882a593Smuzhiyungiven even to these.) For example, loading a filesystem module via an 132*4882a593Smuzhiyununprivileged socket API is nonsense: only the root or physically local 133*4882a593Smuzhiyunuser should trigger filesystem module loading. (And even this can be up 134*4882a593Smuzhiyunfor debate in some scenarios.) 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunTo protect against even privileged users, systems may need to either 137*4882a593Smuzhiyundisable module loading entirely (e.g. monolithic kernel builds or 138*4882a593Smuzhiyunmodules_disabled sysctl), or provide signed modules (e.g. 139*4882a593Smuzhiyun``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having 140*4882a593Smuzhiyunroot load arbitrary kernel code via the module loader interface. 141*4882a593Smuzhiyun 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunMemory integrity 144*4882a593Smuzhiyun================ 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunThere are many memory structures in the kernel that are regularly abused 147*4882a593Smuzhiyunto gain execution control during an attack, By far the most commonly 148*4882a593Smuzhiyununderstood is that of the stack buffer overflow in which the return 149*4882a593Smuzhiyunaddress stored on the stack is overwritten. Many other examples of this 150*4882a593Smuzhiyunkind of attack exist, and protections exist to defend against them. 151*4882a593Smuzhiyun 152*4882a593SmuzhiyunStack buffer overflow 153*4882a593Smuzhiyun--------------------- 154*4882a593Smuzhiyun 155*4882a593SmuzhiyunThe classic stack buffer overflow involves writing past the expected end 156*4882a593Smuzhiyunof a variable stored on the stack, ultimately writing a controlled value 157*4882a593Smuzhiyunto the stack frame's stored return address. The most widely used defense 158*4882a593Smuzhiyunis the presence of a stack canary between the stack variables and the 159*4882a593Smuzhiyunreturn address (``CONFIG_STACKPROTECTOR``), which is verified just before 160*4882a593Smuzhiyunthe function returns. Other defenses include things like shadow stacks. 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunStack depth overflow 163*4882a593Smuzhiyun-------------------- 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunA less well understood attack is using a bug that triggers the 166*4882a593Smuzhiyunkernel to consume stack memory with deep function calls or large stack 167*4882a593Smuzhiyunallocations. With this attack it is possible to write beyond the end of 168*4882a593Smuzhiyunthe kernel's preallocated stack space and into sensitive structures. Two 169*4882a593Smuzhiyunimportant changes need to be made for better protections: moving the 170*4882a593Smuzhiyunsensitive thread_info structure elsewhere, and adding a faulting memory 171*4882a593Smuzhiyunhole at the bottom of the stack to catch these overflows. 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunHeap memory integrity 174*4882a593Smuzhiyun--------------------- 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunThe structures used to track heap free lists can be sanity-checked during 177*4882a593Smuzhiyunallocation and freeing to make sure they aren't being used to manipulate 178*4882a593Smuzhiyunother memory areas. 179*4882a593Smuzhiyun 180*4882a593SmuzhiyunCounter integrity 181*4882a593Smuzhiyun----------------- 182*4882a593Smuzhiyun 183*4882a593SmuzhiyunMany places in the kernel use atomic counters to track object references 184*4882a593Smuzhiyunor perform similar lifetime management. When these counters can be made 185*4882a593Smuzhiyunto wrap (over or under) this traditionally exposes a use-after-free 186*4882a593Smuzhiyunflaw. By trapping atomic wrapping, this class of bug vanishes. 187*4882a593Smuzhiyun 188*4882a593SmuzhiyunSize calculation overflow detection 189*4882a593Smuzhiyun----------------------------------- 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunSimilar to counter overflow, integer overflows (usually size calculations) 192*4882a593Smuzhiyunneed to be detected at runtime to kill this class of bug, which 193*4882a593Smuzhiyuntraditionally leads to being able to write past the end of kernel buffers. 194*4882a593Smuzhiyun 195*4882a593Smuzhiyun 196*4882a593SmuzhiyunProbabilistic defenses 197*4882a593Smuzhiyun====================== 198*4882a593Smuzhiyun 199*4882a593SmuzhiyunWhile many protections can be considered deterministic (e.g. read-only 200*4882a593Smuzhiyunmemory cannot be written to), some protections provide only statistical 201*4882a593Smuzhiyundefense, in that an attack must gather enough information about a 202*4882a593Smuzhiyunrunning system to overcome the defense. While not perfect, these do 203*4882a593Smuzhiyunprovide meaningful defenses. 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunCanaries, blinding, and other secrets 206*4882a593Smuzhiyun------------------------------------- 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunIt should be noted that things like the stack canary discussed earlier 209*4882a593Smuzhiyunare technically statistical defenses, since they rely on a secret value, 210*4882a593Smuzhiyunand such values may become discoverable through an information exposure 211*4882a593Smuzhiyunflaw. 212*4882a593Smuzhiyun 213*4882a593SmuzhiyunBlinding literal values for things like JITs, where the executable 214*4882a593Smuzhiyuncontents may be partially under the control of userspace, need a similar 215*4882a593Smuzhiyunsecret value. 216*4882a593Smuzhiyun 217*4882a593SmuzhiyunIt is critical that the secret values used must be separate (e.g. 218*4882a593Smuzhiyundifferent canary per stack) and high entropy (e.g. is the RNG actually 219*4882a593Smuzhiyunworking?) in order to maximize their success. 220*4882a593Smuzhiyun 221*4882a593SmuzhiyunKernel Address Space Layout Randomization (KASLR) 222*4882a593Smuzhiyun------------------------------------------------- 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunSince the location of kernel memory is almost always instrumental in 225*4882a593Smuzhiyunmounting a successful attack, making the location non-deterministic 226*4882a593Smuzhiyunraises the difficulty of an exploit. (Note that this in turn makes 227*4882a593Smuzhiyunthe value of information exposures higher, since they may be used to 228*4882a593Smuzhiyundiscover desired memory locations.) 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunText and module base 231*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~ 232*4882a593Smuzhiyun 233*4882a593SmuzhiyunBy relocating the physical and virtual base address of the kernel at 234*4882a593Smuzhiyunboot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be 235*4882a593Smuzhiyunfrustrated. Additionally, offsetting the module loading base address 236*4882a593Smuzhiyunmeans that even systems that load the same set of modules in the same 237*4882a593Smuzhiyunorder every boot will not share a common base address with the rest of 238*4882a593Smuzhiyunthe kernel text. 239*4882a593Smuzhiyun 240*4882a593SmuzhiyunStack base 241*4882a593Smuzhiyun~~~~~~~~~~ 242*4882a593Smuzhiyun 243*4882a593SmuzhiyunIf the base address of the kernel stack is not the same between processes, 244*4882a593Smuzhiyunor even not the same between syscalls, targets on or beyond the stack 245*4882a593Smuzhiyunbecome more difficult to locate. 246*4882a593Smuzhiyun 247*4882a593SmuzhiyunDynamic memory base 248*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~ 249*4882a593Smuzhiyun 250*4882a593SmuzhiyunMuch of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up 251*4882a593Smuzhiyunbeing relatively deterministic in layout due to the order of early-boot 252*4882a593Smuzhiyuninitializations. If the base address of these areas is not the same 253*4882a593Smuzhiyunbetween boots, targeting them is frustrated, requiring an information 254*4882a593Smuzhiyunexposure specific to the region. 255*4882a593Smuzhiyun 256*4882a593SmuzhiyunStructure layout 257*4882a593Smuzhiyun~~~~~~~~~~~~~~~~ 258*4882a593Smuzhiyun 259*4882a593SmuzhiyunBy performing a per-build randomization of the layout of sensitive 260*4882a593Smuzhiyunstructures, attacks must either be tuned to known kernel builds or expose 261*4882a593Smuzhiyunenough kernel memory to determine structure layouts before manipulating 262*4882a593Smuzhiyunthem. 263*4882a593Smuzhiyun 264*4882a593Smuzhiyun 265*4882a593SmuzhiyunPreventing Information Exposures 266*4882a593Smuzhiyun================================ 267*4882a593Smuzhiyun 268*4882a593SmuzhiyunSince the locations of sensitive structures are the primary target for 269*4882a593Smuzhiyunattacks, it is important to defend against exposure of both kernel memory 270*4882a593Smuzhiyunaddresses and kernel memory contents (since they may contain kernel 271*4882a593Smuzhiyunaddresses or other sensitive things like canary values). 272*4882a593Smuzhiyun 273*4882a593SmuzhiyunKernel addresses 274*4882a593Smuzhiyun---------------- 275*4882a593Smuzhiyun 276*4882a593SmuzhiyunPrinting kernel addresses to userspace leaks sensitive information about 277*4882a593Smuzhiyunthe kernel memory layout. Care should be exercised when using any printk 278*4882a593Smuzhiyunspecifier that prints the raw address, currently %px, %p[ad], (and %p[sSb] 279*4882a593Smuzhiyunin certain circumstances [*]). Any file written to using one of these 280*4882a593Smuzhiyunspecifiers should be readable only by privileged processes. 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunKernels 4.14 and older printed the raw address using %p. As of 4.15-rc1 283*4882a593Smuzhiyunaddresses printed with the specifier %p are hashed before printing. 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun[*] If KALLSYMS is enabled and symbol lookup fails, the raw address is 286*4882a593Smuzhiyunprinted. If KALLSYMS is not enabled the raw address is printed. 287*4882a593Smuzhiyun 288*4882a593SmuzhiyunUnique identifiers 289*4882a593Smuzhiyun------------------ 290*4882a593Smuzhiyun 291*4882a593SmuzhiyunKernel memory addresses must never be used as identifiers exposed to 292*4882a593Smuzhiyunuserspace. Instead, use an atomic counter, an idr, or similar unique 293*4882a593Smuzhiyunidentifier. 294*4882a593Smuzhiyun 295*4882a593SmuzhiyunMemory initialization 296*4882a593Smuzhiyun--------------------- 297*4882a593Smuzhiyun 298*4882a593SmuzhiyunMemory copied to userspace must always be fully initialized. If not 299*4882a593Smuzhiyunexplicitly memset(), this will require changes to the compiler to make 300*4882a593Smuzhiyunsure structure holes are cleared. 301*4882a593Smuzhiyun 302*4882a593SmuzhiyunMemory poisoning 303*4882a593Smuzhiyun---------------- 304*4882a593Smuzhiyun 305*4882a593SmuzhiyunWhen releasing memory, it is best to poison the contents, to avoid reuse 306*4882a593Smuzhiyunattacks that rely on the old contents of memory. E.g., clear stack on a 307*4882a593Smuzhiyunsyscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a 308*4882a593Smuzhiyunfree. This frustrates many uninitialized variable attacks, stack content 309*4882a593Smuzhiyunexposures, heap content exposures, and use-after-free attacks. 310*4882a593Smuzhiyun 311*4882a593SmuzhiyunDestination tracking 312*4882a593Smuzhiyun-------------------- 313*4882a593Smuzhiyun 314*4882a593SmuzhiyunTo help kill classes of bugs that result in kernel addresses being 315*4882a593Smuzhiyunwritten to userspace, the destination of writes needs to be tracked. If 316*4882a593Smuzhiyunthe buffer is destined for userspace (e.g. seq_file backed ``/proc`` files), 317*4882a593Smuzhiyunit should automatically censor sensitive values. 318