1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun============= 4*4882a593SmuzhiyunKernel Stacks 5*4882a593Smuzhiyun============= 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunKernel stacks on x86-64 bit 8*4882a593Smuzhiyun=========================== 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunMost of the text from Keith Owens, hacked by AK 11*4882a593Smuzhiyun 12*4882a593Smuzhiyunx86_64 page size (PAGE_SIZE) is 4K. 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunLike all other architectures, x86_64 has a kernel stack for every 15*4882a593Smuzhiyunactive thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. 16*4882a593SmuzhiyunThese stacks contain useful data as long as a thread is alive or a 17*4882a593Smuzhiyunzombie. While the thread is in user space the kernel stack is empty 18*4882a593Smuzhiyunexcept for the thread_info structure at the bottom. 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunIn addition to the per thread stacks, there are specialized stacks 21*4882a593Smuzhiyunassociated with each CPU. These stacks are only used while the kernel 22*4882a593Smuzhiyunis in control on that CPU; when a CPU returns to user space the 23*4882a593Smuzhiyunspecialized stacks contain no useful data. The main CPU stacks are: 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun* Interrupt stack. IRQ_STACK_SIZE 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun Used for external hardware interrupts. If this is the first external 28*4882a593Smuzhiyun hardware interrupt (i.e. not a nested hardware interrupt) then the 29*4882a593Smuzhiyun kernel switches from the current task to the interrupt stack. Like 30*4882a593Smuzhiyun the split thread and interrupt stacks on i386, this gives more room 31*4882a593Smuzhiyun for kernel interrupt processing without having to increase the size 32*4882a593Smuzhiyun of every per thread stack. 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun The interrupt stack is also used when processing a softirq. 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunSwitching to the kernel interrupt stack is done by software based on a 37*4882a593Smuzhiyunper CPU interrupt nest counter. This is needed because x86-64 "IST" 38*4882a593Smuzhiyunhardware stacks cannot nest without races. 39*4882a593Smuzhiyun 40*4882a593Smuzhiyunx86_64 also has a feature which is not available on i386, the ability 41*4882a593Smuzhiyunto automatically switch to a new stack for designated events such as 42*4882a593Smuzhiyundouble fault or NMI, which makes it easier to handle these unusual 43*4882a593Smuzhiyunevents on x86_64. This feature is called the Interrupt Stack Table 44*4882a593Smuzhiyun(IST). There can be up to 7 IST entries per CPU. The IST code is an 45*4882a593Smuzhiyunindex into the Task State Segment (TSS). The IST entries in the TSS 46*4882a593Smuzhiyunpoint to dedicated stacks; each stack can be a different size. 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunAn IST is selected by a non-zero value in the IST field of an 49*4882a593Smuzhiyuninterrupt-gate descriptor. When an interrupt occurs and the hardware 50*4882a593Smuzhiyunloads such a descriptor, the hardware automatically sets the new stack 51*4882a593Smuzhiyunpointer based on the IST value, then invokes the interrupt handler. If 52*4882a593Smuzhiyunthe interrupt came from user mode, then the interrupt handler prologue 53*4882a593Smuzhiyunwill switch back to the per-thread stack. If software wants to allow 54*4882a593Smuzhiyunnested IST interrupts then the handler must adjust the IST values on 55*4882a593Smuzhiyunentry to and exit from the interrupt handler. (This is occasionally 56*4882a593Smuzhiyundone, e.g. for debug exceptions.) 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunEvents with different IST codes (i.e. with different stacks) can be 59*4882a593Smuzhiyunnested. For example, a debug interrupt can safely be interrupted by an 60*4882a593SmuzhiyunNMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack 61*4882a593Smuzhiyunpointers on entry to and exit from all IST events, in theory allowing 62*4882a593SmuzhiyunIST events with the same code to be nested. However in most cases, the 63*4882a593Smuzhiyunstack size allocated to an IST assumes no nesting for the same code. 64*4882a593SmuzhiyunIf that assumption is ever broken then the stacks will become corrupt. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunThe currently assigned IST stacks are: 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun* ESTACK_DF. EXCEPTION_STKSZ (PAGE_SIZE). 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun Used for interrupt 8 - Double Fault Exception (#DF). 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun Invoked when handling one exception causes another exception. Happens 73*4882a593Smuzhiyun when the kernel is very confused (e.g. kernel stack pointer corrupt). 74*4882a593Smuzhiyun Using a separate stack allows the kernel to recover from it well enough 75*4882a593Smuzhiyun in many cases to still output an oops. 76*4882a593Smuzhiyun 77*4882a593Smuzhiyun* ESTACK_NMI. EXCEPTION_STKSZ (PAGE_SIZE). 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun Used for non-maskable interrupts (NMI). 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun NMI can be delivered at any time, including when the kernel is in the 82*4882a593Smuzhiyun middle of switching stacks. Using IST for NMI events avoids making 83*4882a593Smuzhiyun assumptions about the previous state of the kernel stack. 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun* ESTACK_DB. EXCEPTION_STKSZ (PAGE_SIZE). 86*4882a593Smuzhiyun 87*4882a593Smuzhiyun Used for hardware debug interrupts (interrupt 1) and for software 88*4882a593Smuzhiyun debug interrupts (INT3). 89*4882a593Smuzhiyun 90*4882a593Smuzhiyun When debugging a kernel, debug interrupts (both hardware and 91*4882a593Smuzhiyun software) can occur at any time. Using IST for these interrupts 92*4882a593Smuzhiyun avoids making assumptions about the previous state of the kernel 93*4882a593Smuzhiyun stack. 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun To handle nested #DB correctly there exist two instances of DB stacks. On 96*4882a593Smuzhiyun #DB entry the IST stackpointer for #DB is switched to the second instance 97*4882a593Smuzhiyun so a nested #DB starts from a clean stack. The nested #DB switches 98*4882a593Smuzhiyun the IST stackpointer to a guard hole to catch triple nesting. 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun* ESTACK_MCE. EXCEPTION_STKSZ (PAGE_SIZE). 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun Used for interrupt 18 - Machine Check Exception (#MC). 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun MCE can be delivered at any time, including when the kernel is in the 105*4882a593Smuzhiyun middle of switching stacks. Using IST for MCE events avoids making 106*4882a593Smuzhiyun assumptions about the previous state of the kernel stack. 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunFor more details see the Intel IA32 or AMD AMD64 architecture manuals. 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun 111*4882a593SmuzhiyunPrinting backtraces on x86 112*4882a593Smuzhiyun========================== 113*4882a593Smuzhiyun 114*4882a593SmuzhiyunThe question about the '?' preceding function names in an x86 stacktrace 115*4882a593Smuzhiyunkeeps popping up, here's an indepth explanation. It helps if the reader 116*4882a593Smuzhiyunstares at print_context_stack() and the whole machinery in and around 117*4882a593Smuzhiyunarch/x86/kernel/dumpstack.c. 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunAdapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>: 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunWe always scan the full kernel stack for return addresses stored on 122*4882a593Smuzhiyunthe kernel stack(s) [1]_, from stack top to stack bottom, and print out 123*4882a593Smuzhiyunanything that 'looks like' a kernel text address. 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunIf it fits into the frame pointer chain, we print it without a question 126*4882a593Smuzhiyunmark, knowing that it's part of the real backtrace. 127*4882a593Smuzhiyun 128*4882a593SmuzhiyunIf the address does not fit into our expected frame pointer chain we 129*4882a593Smuzhiyunstill print it, but we print a '?'. It can mean two things: 130*4882a593Smuzhiyun 131*4882a593Smuzhiyun - either the address is not part of the call chain: it's just stale 132*4882a593Smuzhiyun values on the kernel stack, from earlier function calls. This is 133*4882a593Smuzhiyun the common case. 134*4882a593Smuzhiyun 135*4882a593Smuzhiyun - or it is part of the call chain, but the frame pointer was not set 136*4882a593Smuzhiyun up properly within the function, so we don't recognize it. 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunThis way we will always print out the real call chain (plus a few more 139*4882a593Smuzhiyunentries), regardless of whether the frame pointer was set up correctly 140*4882a593Smuzhiyunor not - but in most cases we'll get the call chain right as well. The 141*4882a593Smuzhiyunentries printed are strictly in stack order, so you can deduce more 142*4882a593Smuzhiyuninformation from that as well. 143*4882a593Smuzhiyun 144*4882a593SmuzhiyunThe most important property of this method is that we _never_ lose 145*4882a593Smuzhiyuninformation: we always strive to print _all_ addresses on the stack(s) 146*4882a593Smuzhiyunthat look like kernel text addresses, so if debug information is wrong, 147*4882a593Smuzhiyunwe still print out the real call chain as well - just with more question 148*4882a593Smuzhiyunmarks than ideal. 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun.. [1] For things like IRQ and IST stacks, we also scan those stacks, in 151*4882a593Smuzhiyun the right order, and try to cross from one stack into another 152*4882a593Smuzhiyun reconstructing the call chain. This works most of the time. 153