xref: /OK3568_Linux_fs/kernel/Documentation/x86/kernel-stacks.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=============
4*4882a593SmuzhiyunKernel Stacks
5*4882a593Smuzhiyun=============
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunKernel stacks on x86-64 bit
8*4882a593Smuzhiyun===========================
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunMost of the text from Keith Owens, hacked by AK
11*4882a593Smuzhiyun
12*4882a593Smuzhiyunx86_64 page size (PAGE_SIZE) is 4K.
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunLike all other architectures, x86_64 has a kernel stack for every
15*4882a593Smuzhiyunactive thread.  These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big.
16*4882a593SmuzhiyunThese stacks contain useful data as long as a thread is alive or a
17*4882a593Smuzhiyunzombie. While the thread is in user space the kernel stack is empty
18*4882a593Smuzhiyunexcept for the thread_info structure at the bottom.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunIn addition to the per thread stacks, there are specialized stacks
21*4882a593Smuzhiyunassociated with each CPU.  These stacks are only used while the kernel
22*4882a593Smuzhiyunis in control on that CPU; when a CPU returns to user space the
23*4882a593Smuzhiyunspecialized stacks contain no useful data.  The main CPU stacks are:
24*4882a593Smuzhiyun
25*4882a593Smuzhiyun* Interrupt stack.  IRQ_STACK_SIZE
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun  Used for external hardware interrupts.  If this is the first external
28*4882a593Smuzhiyun  hardware interrupt (i.e. not a nested hardware interrupt) then the
29*4882a593Smuzhiyun  kernel switches from the current task to the interrupt stack.  Like
30*4882a593Smuzhiyun  the split thread and interrupt stacks on i386, this gives more room
31*4882a593Smuzhiyun  for kernel interrupt processing without having to increase the size
32*4882a593Smuzhiyun  of every per thread stack.
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun  The interrupt stack is also used when processing a softirq.
35*4882a593Smuzhiyun
36*4882a593SmuzhiyunSwitching to the kernel interrupt stack is done by software based on a
37*4882a593Smuzhiyunper CPU interrupt nest counter. This is needed because x86-64 "IST"
38*4882a593Smuzhiyunhardware stacks cannot nest without races.
39*4882a593Smuzhiyun
40*4882a593Smuzhiyunx86_64 also has a feature which is not available on i386, the ability
41*4882a593Smuzhiyunto automatically switch to a new stack for designated events such as
42*4882a593Smuzhiyundouble fault or NMI, which makes it easier to handle these unusual
43*4882a593Smuzhiyunevents on x86_64.  This feature is called the Interrupt Stack Table
44*4882a593Smuzhiyun(IST).  There can be up to 7 IST entries per CPU. The IST code is an
45*4882a593Smuzhiyunindex into the Task State Segment (TSS). The IST entries in the TSS
46*4882a593Smuzhiyunpoint to dedicated stacks; each stack can be a different size.
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunAn IST is selected by a non-zero value in the IST field of an
49*4882a593Smuzhiyuninterrupt-gate descriptor.  When an interrupt occurs and the hardware
50*4882a593Smuzhiyunloads such a descriptor, the hardware automatically sets the new stack
51*4882a593Smuzhiyunpointer based on the IST value, then invokes the interrupt handler.  If
52*4882a593Smuzhiyunthe interrupt came from user mode, then the interrupt handler prologue
53*4882a593Smuzhiyunwill switch back to the per-thread stack.  If software wants to allow
54*4882a593Smuzhiyunnested IST interrupts then the handler must adjust the IST values on
55*4882a593Smuzhiyunentry to and exit from the interrupt handler.  (This is occasionally
56*4882a593Smuzhiyundone, e.g. for debug exceptions.)
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunEvents with different IST codes (i.e. with different stacks) can be
59*4882a593Smuzhiyunnested.  For example, a debug interrupt can safely be interrupted by an
60*4882a593SmuzhiyunNMI.  arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack
61*4882a593Smuzhiyunpointers on entry to and exit from all IST events, in theory allowing
62*4882a593SmuzhiyunIST events with the same code to be nested.  However in most cases, the
63*4882a593Smuzhiyunstack size allocated to an IST assumes no nesting for the same code.
64*4882a593SmuzhiyunIf that assumption is ever broken then the stacks will become corrupt.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunThe currently assigned IST stacks are:
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun* ESTACK_DF.  EXCEPTION_STKSZ (PAGE_SIZE).
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun  Used for interrupt 8 - Double Fault Exception (#DF).
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun  Invoked when handling one exception causes another exception. Happens
73*4882a593Smuzhiyun  when the kernel is very confused (e.g. kernel stack pointer corrupt).
74*4882a593Smuzhiyun  Using a separate stack allows the kernel to recover from it well enough
75*4882a593Smuzhiyun  in many cases to still output an oops.
76*4882a593Smuzhiyun
77*4882a593Smuzhiyun* ESTACK_NMI.  EXCEPTION_STKSZ (PAGE_SIZE).
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun  Used for non-maskable interrupts (NMI).
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun  NMI can be delivered at any time, including when the kernel is in the
82*4882a593Smuzhiyun  middle of switching stacks.  Using IST for NMI events avoids making
83*4882a593Smuzhiyun  assumptions about the previous state of the kernel stack.
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun* ESTACK_DB.  EXCEPTION_STKSZ (PAGE_SIZE).
86*4882a593Smuzhiyun
87*4882a593Smuzhiyun  Used for hardware debug interrupts (interrupt 1) and for software
88*4882a593Smuzhiyun  debug interrupts (INT3).
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun  When debugging a kernel, debug interrupts (both hardware and
91*4882a593Smuzhiyun  software) can occur at any time.  Using IST for these interrupts
92*4882a593Smuzhiyun  avoids making assumptions about the previous state of the kernel
93*4882a593Smuzhiyun  stack.
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun  To handle nested #DB correctly there exist two instances of DB stacks. On
96*4882a593Smuzhiyun  #DB entry the IST stackpointer for #DB is switched to the second instance
97*4882a593Smuzhiyun  so a nested #DB starts from a clean stack. The nested #DB switches
98*4882a593Smuzhiyun  the IST stackpointer to a guard hole to catch triple nesting.
99*4882a593Smuzhiyun
100*4882a593Smuzhiyun* ESTACK_MCE.  EXCEPTION_STKSZ (PAGE_SIZE).
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun  Used for interrupt 18 - Machine Check Exception (#MC).
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun  MCE can be delivered at any time, including when the kernel is in the
105*4882a593Smuzhiyun  middle of switching stacks.  Using IST for MCE events avoids making
106*4882a593Smuzhiyun  assumptions about the previous state of the kernel stack.
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunFor more details see the Intel IA32 or AMD AMD64 architecture manuals.
109*4882a593Smuzhiyun
110*4882a593Smuzhiyun
111*4882a593SmuzhiyunPrinting backtraces on x86
112*4882a593Smuzhiyun==========================
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunThe question about the '?' preceding function names in an x86 stacktrace
115*4882a593Smuzhiyunkeeps popping up, here's an indepth explanation. It helps if the reader
116*4882a593Smuzhiyunstares at print_context_stack() and the whole machinery in and around
117*4882a593Smuzhiyunarch/x86/kernel/dumpstack.c.
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunAdapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>:
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunWe always scan the full kernel stack for return addresses stored on
122*4882a593Smuzhiyunthe kernel stack(s) [1]_, from stack top to stack bottom, and print out
123*4882a593Smuzhiyunanything that 'looks like' a kernel text address.
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunIf it fits into the frame pointer chain, we print it without a question
126*4882a593Smuzhiyunmark, knowing that it's part of the real backtrace.
127*4882a593Smuzhiyun
128*4882a593SmuzhiyunIf the address does not fit into our expected frame pointer chain we
129*4882a593Smuzhiyunstill print it, but we print a '?'. It can mean two things:
130*4882a593Smuzhiyun
131*4882a593Smuzhiyun - either the address is not part of the call chain: it's just stale
132*4882a593Smuzhiyun   values on the kernel stack, from earlier function calls. This is
133*4882a593Smuzhiyun   the common case.
134*4882a593Smuzhiyun
135*4882a593Smuzhiyun - or it is part of the call chain, but the frame pointer was not set
136*4882a593Smuzhiyun   up properly within the function, so we don't recognize it.
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunThis way we will always print out the real call chain (plus a few more
139*4882a593Smuzhiyunentries), regardless of whether the frame pointer was set up correctly
140*4882a593Smuzhiyunor not - but in most cases we'll get the call chain right as well. The
141*4882a593Smuzhiyunentries printed are strictly in stack order, so you can deduce more
142*4882a593Smuzhiyuninformation from that as well.
143*4882a593Smuzhiyun
144*4882a593SmuzhiyunThe most important property of this method is that we _never_ lose
145*4882a593Smuzhiyuninformation: we always strive to print _all_ addresses on the stack(s)
146*4882a593Smuzhiyunthat look like kernel text addresses, so if debug information is wrong,
147*4882a593Smuzhiyunwe still print out the real call chain as well - just with more question
148*4882a593Smuzhiyunmarks than ideal.
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun.. [1] For things like IRQ and IST stacks, we also scan those stacks, in
151*4882a593Smuzhiyun       the right order, and try to cross from one stack into another
152*4882a593Smuzhiyun       reconstructing the call chain. This works most of the time.
153