xref: /OK3568_Linux_fs/kernel/Documentation/security/self-protection.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun======================
2*4882a593SmuzhiyunKernel Self-Protection
3*4882a593Smuzhiyun======================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunKernel self-protection is the design and implementation of systems and
6*4882a593Smuzhiyunstructures within the Linux kernel to protect against security flaws in
7*4882a593Smuzhiyunthe kernel itself. This covers a wide range of issues, including removing
8*4882a593Smuzhiyunentire classes of bugs, blocking security flaw exploitation methods,
9*4882a593Smuzhiyunand actively detecting attack attempts. Not all topics are explored in
10*4882a593Smuzhiyunthis document, but it should serve as a reasonable starting point and
11*4882a593Smuzhiyunanswer any frequently asked questions. (Patches welcome, of course!)
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunIn the worst-case scenario, we assume an unprivileged local attacker
14*4882a593Smuzhiyunhas arbitrary read and write access to the kernel's memory. In many
15*4882a593Smuzhiyuncases, bugs being exploited will not provide this level of access,
16*4882a593Smuzhiyunbut with systems in place that defend against the worst case we'll
17*4882a593Smuzhiyuncover the more limited cases as well. A higher bar, and one that should
18*4882a593Smuzhiyunstill be kept in mind, is protecting the kernel against a _privileged_
19*4882a593Smuzhiyunlocal attacker, since the root user has access to a vastly increased
20*4882a593Smuzhiyunattack surface. (Especially when they have the ability to load arbitrary
21*4882a593Smuzhiyunkernel modules.)
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunThe goals for successful self-protection systems would be that they
24*4882a593Smuzhiyunare effective, on by default, require no opt-in by developers, have no
25*4882a593Smuzhiyunperformance impact, do not impede kernel debugging, and have tests. It
26*4882a593Smuzhiyunis uncommon that all these goals can be met, but it is worth explicitly
27*4882a593Smuzhiyunmentioning them, since these aspects need to be explored, dealt with,
28*4882a593Smuzhiyunand/or accepted.
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunAttack Surface Reduction
32*4882a593Smuzhiyun========================
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunThe most fundamental defense against security exploits is to reduce the
35*4882a593Smuzhiyunareas of the kernel that can be used to redirect execution. This ranges
36*4882a593Smuzhiyunfrom limiting the exposed APIs available to userspace, making in-kernel
37*4882a593SmuzhiyunAPIs hard to use incorrectly, minimizing the areas of writable kernel
38*4882a593Smuzhiyunmemory, etc.
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunStrict kernel memory permissions
41*4882a593Smuzhiyun--------------------------------
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunWhen all of kernel memory is writable, it becomes trivial for attacks
44*4882a593Smuzhiyunto redirect execution flow. To reduce the availability of these targets
45*4882a593Smuzhiyunthe kernel needs to protect its memory with a tight set of permissions.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunExecutable code and read-only data must not be writable
48*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunAny areas of the kernel with executable memory must not be writable.
51*4882a593SmuzhiyunWhile this obviously includes the kernel text itself, we must consider
52*4882a593Smuzhiyunall additional places too: kernel modules, JIT memory, etc. (There are
53*4882a593Smuzhiyuntemporary exceptions to this rule to support things like instruction
54*4882a593Smuzhiyunalternatives, breakpoints, kprobes, etc. If these must exist in a
55*4882a593Smuzhiyunkernel, they are implemented in a way where the memory is temporarily
56*4882a593Smuzhiyunmade writable during the update, and then returned to the original
57*4882a593Smuzhiyunpermissions.)
58*4882a593Smuzhiyun
59*4882a593SmuzhiyunIn support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
60*4882a593Smuzhiyun``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
61*4882a593Smuzhiyunwritable, data is not executable, and read-only data is neither writable
62*4882a593Smuzhiyunnor executable.
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunMost architectures have these options on by default and not user selectable.
65*4882a593SmuzhiyunFor some architectures like arm that wish to have these be selectable,
66*4882a593Smuzhiyunthe architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
67*4882a593Smuzhiyuna Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
68*4882a593Smuzhiyunthe default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunFunction pointers and sensitive variables must not be writable
71*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunVast areas of kernel memory contain function pointers that are looked
74*4882a593Smuzhiyunup by the kernel and used to continue execution (e.g. descriptor/vector
75*4882a593Smuzhiyuntables, file/network/etc operation structures, etc). The number of these
76*4882a593Smuzhiyunvariables must be reduced to an absolute minimum.
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunMany such variables can be made read-only by setting them "const"
79*4882a593Smuzhiyunso that they live in the .rodata section instead of the .data section
80*4882a593Smuzhiyunof the kernel, gaining the protection of the kernel's strict memory
81*4882a593Smuzhiyunpermissions as described above.
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunFor variables that are initialized once at ``__init`` time, these can
84*4882a593Smuzhiyunbe marked with the (new and under development) ``__ro_after_init``
85*4882a593Smuzhiyunattribute.
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunWhat remains are variables that are updated rarely (e.g. GDT). These
88*4882a593Smuzhiyunwill need another infrastructure (similar to the temporary exceptions
89*4882a593Smuzhiyunmade to kernel code mentioned above) that allow them to spend the rest
90*4882a593Smuzhiyunof their lifetime read-only. (For example, when being updated, only the
91*4882a593SmuzhiyunCPU thread performing the update would be given uninterruptible write
92*4882a593Smuzhiyunaccess to the memory.)
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunSegregation of kernel memory from userspace memory
95*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunThe kernel must never execute userspace memory. The kernel must also never
98*4882a593Smuzhiyunaccess userspace memory without explicit expectation to do so. These
99*4882a593Smuzhiyunrules can be enforced either by support of hardware-based restrictions
100*4882a593Smuzhiyun(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
101*4882a593SmuzhiyunBy blocking userspace memory in this way, execution and data parsing
102*4882a593Smuzhiyuncannot be passed to trivially-controlled userspace memory, forcing
103*4882a593Smuzhiyunattacks to operate entirely in kernel memory.
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunReduced access to syscalls
106*4882a593Smuzhiyun--------------------------
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunOne trivial way to eliminate many syscalls for 64-bit systems is building
109*4882a593Smuzhiyunwithout ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
110*4882a593Smuzhiyun
111*4882a593SmuzhiyunThe "seccomp" system provides an opt-in feature made available to
112*4882a593Smuzhiyunuserspace, which provides a way to reduce the number of kernel entry
113*4882a593Smuzhiyunpoints available to a running process. This limits the breadth of kernel
114*4882a593Smuzhiyuncode that can be reached, possibly reducing the availability of a given
115*4882a593Smuzhiyunbug to an attack.
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunAn area of improvement would be creating viable ways to keep access to
118*4882a593Smuzhiyunthings like compat, user namespaces, BPF creation, and perf limited only
119*4882a593Smuzhiyunto trusted processes. This would keep the scope of kernel entry points
120*4882a593Smuzhiyunrestricted to the more regular set of normally available to unprivileged
121*4882a593Smuzhiyunuserspace.
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunRestricting access to kernel modules
124*4882a593Smuzhiyun------------------------------------
125*4882a593Smuzhiyun
126*4882a593SmuzhiyunThe kernel should never allow an unprivileged user the ability to
127*4882a593Smuzhiyunload specific kernel modules, since that would provide a facility to
128*4882a593Smuzhiyununexpectedly extend the available attack surface. (The on-demand loading
129*4882a593Smuzhiyunof modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
130*4882a593Smuzhiyunconsidered "expected" here, though additional consideration should be
131*4882a593Smuzhiyungiven even to these.) For example, loading a filesystem module via an
132*4882a593Smuzhiyununprivileged socket API is nonsense: only the root or physically local
133*4882a593Smuzhiyunuser should trigger filesystem module loading. (And even this can be up
134*4882a593Smuzhiyunfor debate in some scenarios.)
135*4882a593Smuzhiyun
136*4882a593SmuzhiyunTo protect against even privileged users, systems may need to either
137*4882a593Smuzhiyundisable module loading entirely (e.g. monolithic kernel builds or
138*4882a593Smuzhiyunmodules_disabled sysctl), or provide signed modules (e.g.
139*4882a593Smuzhiyun``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
140*4882a593Smuzhiyunroot load arbitrary kernel code via the module loader interface.
141*4882a593Smuzhiyun
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunMemory integrity
144*4882a593Smuzhiyun================
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunThere are many memory structures in the kernel that are regularly abused
147*4882a593Smuzhiyunto gain execution control during an attack, By far the most commonly
148*4882a593Smuzhiyununderstood is that of the stack buffer overflow in which the return
149*4882a593Smuzhiyunaddress stored on the stack is overwritten. Many other examples of this
150*4882a593Smuzhiyunkind of attack exist, and protections exist to defend against them.
151*4882a593Smuzhiyun
152*4882a593SmuzhiyunStack buffer overflow
153*4882a593Smuzhiyun---------------------
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunThe classic stack buffer overflow involves writing past the expected end
156*4882a593Smuzhiyunof a variable stored on the stack, ultimately writing a controlled value
157*4882a593Smuzhiyunto the stack frame's stored return address. The most widely used defense
158*4882a593Smuzhiyunis the presence of a stack canary between the stack variables and the
159*4882a593Smuzhiyunreturn address (``CONFIG_STACKPROTECTOR``), which is verified just before
160*4882a593Smuzhiyunthe function returns. Other defenses include things like shadow stacks.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunStack depth overflow
163*4882a593Smuzhiyun--------------------
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunA less well understood attack is using a bug that triggers the
166*4882a593Smuzhiyunkernel to consume stack memory with deep function calls or large stack
167*4882a593Smuzhiyunallocations. With this attack it is possible to write beyond the end of
168*4882a593Smuzhiyunthe kernel's preallocated stack space and into sensitive structures. Two
169*4882a593Smuzhiyunimportant changes need to be made for better protections: moving the
170*4882a593Smuzhiyunsensitive thread_info structure elsewhere, and adding a faulting memory
171*4882a593Smuzhiyunhole at the bottom of the stack to catch these overflows.
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunHeap memory integrity
174*4882a593Smuzhiyun---------------------
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunThe structures used to track heap free lists can be sanity-checked during
177*4882a593Smuzhiyunallocation and freeing to make sure they aren't being used to manipulate
178*4882a593Smuzhiyunother memory areas.
179*4882a593Smuzhiyun
180*4882a593SmuzhiyunCounter integrity
181*4882a593Smuzhiyun-----------------
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunMany places in the kernel use atomic counters to track object references
184*4882a593Smuzhiyunor perform similar lifetime management. When these counters can be made
185*4882a593Smuzhiyunto wrap (over or under) this traditionally exposes a use-after-free
186*4882a593Smuzhiyunflaw. By trapping atomic wrapping, this class of bug vanishes.
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunSize calculation overflow detection
189*4882a593Smuzhiyun-----------------------------------
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunSimilar to counter overflow, integer overflows (usually size calculations)
192*4882a593Smuzhiyunneed to be detected at runtime to kill this class of bug, which
193*4882a593Smuzhiyuntraditionally leads to being able to write past the end of kernel buffers.
194*4882a593Smuzhiyun
195*4882a593Smuzhiyun
196*4882a593SmuzhiyunProbabilistic defenses
197*4882a593Smuzhiyun======================
198*4882a593Smuzhiyun
199*4882a593SmuzhiyunWhile many protections can be considered deterministic (e.g. read-only
200*4882a593Smuzhiyunmemory cannot be written to), some protections provide only statistical
201*4882a593Smuzhiyundefense, in that an attack must gather enough information about a
202*4882a593Smuzhiyunrunning system to overcome the defense. While not perfect, these do
203*4882a593Smuzhiyunprovide meaningful defenses.
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunCanaries, blinding, and other secrets
206*4882a593Smuzhiyun-------------------------------------
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunIt should be noted that things like the stack canary discussed earlier
209*4882a593Smuzhiyunare technically statistical defenses, since they rely on a secret value,
210*4882a593Smuzhiyunand such values may become discoverable through an information exposure
211*4882a593Smuzhiyunflaw.
212*4882a593Smuzhiyun
213*4882a593SmuzhiyunBlinding literal values for things like JITs, where the executable
214*4882a593Smuzhiyuncontents may be partially under the control of userspace, need a similar
215*4882a593Smuzhiyunsecret value.
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunIt is critical that the secret values used must be separate (e.g.
218*4882a593Smuzhiyundifferent canary per stack) and high entropy (e.g. is the RNG actually
219*4882a593Smuzhiyunworking?) in order to maximize their success.
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunKernel Address Space Layout Randomization (KASLR)
222*4882a593Smuzhiyun-------------------------------------------------
223*4882a593Smuzhiyun
224*4882a593SmuzhiyunSince the location of kernel memory is almost always instrumental in
225*4882a593Smuzhiyunmounting a successful attack, making the location non-deterministic
226*4882a593Smuzhiyunraises the difficulty of an exploit. (Note that this in turn makes
227*4882a593Smuzhiyunthe value of information exposures higher, since they may be used to
228*4882a593Smuzhiyundiscover desired memory locations.)
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunText and module base
231*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunBy relocating the physical and virtual base address of the kernel at
234*4882a593Smuzhiyunboot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
235*4882a593Smuzhiyunfrustrated. Additionally, offsetting the module loading base address
236*4882a593Smuzhiyunmeans that even systems that load the same set of modules in the same
237*4882a593Smuzhiyunorder every boot will not share a common base address with the rest of
238*4882a593Smuzhiyunthe kernel text.
239*4882a593Smuzhiyun
240*4882a593SmuzhiyunStack base
241*4882a593Smuzhiyun~~~~~~~~~~
242*4882a593Smuzhiyun
243*4882a593SmuzhiyunIf the base address of the kernel stack is not the same between processes,
244*4882a593Smuzhiyunor even not the same between syscalls, targets on or beyond the stack
245*4882a593Smuzhiyunbecome more difficult to locate.
246*4882a593Smuzhiyun
247*4882a593SmuzhiyunDynamic memory base
248*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~
249*4882a593Smuzhiyun
250*4882a593SmuzhiyunMuch of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
251*4882a593Smuzhiyunbeing relatively deterministic in layout due to the order of early-boot
252*4882a593Smuzhiyuninitializations. If the base address of these areas is not the same
253*4882a593Smuzhiyunbetween boots, targeting them is frustrated, requiring an information
254*4882a593Smuzhiyunexposure specific to the region.
255*4882a593Smuzhiyun
256*4882a593SmuzhiyunStructure layout
257*4882a593Smuzhiyun~~~~~~~~~~~~~~~~
258*4882a593Smuzhiyun
259*4882a593SmuzhiyunBy performing a per-build randomization of the layout of sensitive
260*4882a593Smuzhiyunstructures, attacks must either be tuned to known kernel builds or expose
261*4882a593Smuzhiyunenough kernel memory to determine structure layouts before manipulating
262*4882a593Smuzhiyunthem.
263*4882a593Smuzhiyun
264*4882a593Smuzhiyun
265*4882a593SmuzhiyunPreventing Information Exposures
266*4882a593Smuzhiyun================================
267*4882a593Smuzhiyun
268*4882a593SmuzhiyunSince the locations of sensitive structures are the primary target for
269*4882a593Smuzhiyunattacks, it is important to defend against exposure of both kernel memory
270*4882a593Smuzhiyunaddresses and kernel memory contents (since they may contain kernel
271*4882a593Smuzhiyunaddresses or other sensitive things like canary values).
272*4882a593Smuzhiyun
273*4882a593SmuzhiyunKernel addresses
274*4882a593Smuzhiyun----------------
275*4882a593Smuzhiyun
276*4882a593SmuzhiyunPrinting kernel addresses to userspace leaks sensitive information about
277*4882a593Smuzhiyunthe kernel memory layout. Care should be exercised when using any printk
278*4882a593Smuzhiyunspecifier that prints the raw address, currently %px, %p[ad], (and %p[sSb]
279*4882a593Smuzhiyunin certain circumstances [*]).  Any file written to using one of these
280*4882a593Smuzhiyunspecifiers should be readable only by privileged processes.
281*4882a593Smuzhiyun
282*4882a593SmuzhiyunKernels 4.14 and older printed the raw address using %p. As of 4.15-rc1
283*4882a593Smuzhiyunaddresses printed with the specifier %p are hashed before printing.
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun[*] If KALLSYMS is enabled and symbol lookup fails, the raw address is
286*4882a593Smuzhiyunprinted. If KALLSYMS is not enabled the raw address is printed.
287*4882a593Smuzhiyun
288*4882a593SmuzhiyunUnique identifiers
289*4882a593Smuzhiyun------------------
290*4882a593Smuzhiyun
291*4882a593SmuzhiyunKernel memory addresses must never be used as identifiers exposed to
292*4882a593Smuzhiyunuserspace. Instead, use an atomic counter, an idr, or similar unique
293*4882a593Smuzhiyunidentifier.
294*4882a593Smuzhiyun
295*4882a593SmuzhiyunMemory initialization
296*4882a593Smuzhiyun---------------------
297*4882a593Smuzhiyun
298*4882a593SmuzhiyunMemory copied to userspace must always be fully initialized. If not
299*4882a593Smuzhiyunexplicitly memset(), this will require changes to the compiler to make
300*4882a593Smuzhiyunsure structure holes are cleared.
301*4882a593Smuzhiyun
302*4882a593SmuzhiyunMemory poisoning
303*4882a593Smuzhiyun----------------
304*4882a593Smuzhiyun
305*4882a593SmuzhiyunWhen releasing memory, it is best to poison the contents, to avoid reuse
306*4882a593Smuzhiyunattacks that rely on the old contents of memory. E.g., clear stack on a
307*4882a593Smuzhiyunsyscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a
308*4882a593Smuzhiyunfree. This frustrates many uninitialized variable attacks, stack content
309*4882a593Smuzhiyunexposures, heap content exposures, and use-after-free attacks.
310*4882a593Smuzhiyun
311*4882a593SmuzhiyunDestination tracking
312*4882a593Smuzhiyun--------------------
313*4882a593Smuzhiyun
314*4882a593SmuzhiyunTo help kill classes of bugs that result in kernel addresses being
315*4882a593Smuzhiyunwritten to userspace, the destination of writes needs to be tracked. If
316*4882a593Smuzhiyunthe buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
317*4882a593Smuzhiyunit should automatically censor sensitive values.
318