1*4882a593Smuzhiyun============================================================= 2*4882a593SmuzhiyunAn ad-hoc collection of notes on IA64 MCA and INIT processing 3*4882a593Smuzhiyun============================================================= 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunFeel free to update it with notes about any area that is not clear. 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun--- 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunMCA/INIT are completely asynchronous. They can occur at any time, when 10*4882a593Smuzhiyunthe OS is in any state. Including when one of the cpus is already 11*4882a593Smuzhiyunholding a spinlock. Trying to get any lock from MCA/INIT state is 12*4882a593Smuzhiyunasking for deadlock. Also the state of structures that are protected 13*4882a593Smuzhiyunby locks is indeterminate, including linked lists. 14*4882a593Smuzhiyun 15*4882a593Smuzhiyun--- 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunThe complicated ia64 MCA process. All of this is mandated by Intel's 18*4882a593Smuzhiyunspecification for ia64 SAL, error recovery and unwind, it is not as 19*4882a593Smuzhiyunif we have a choice here. 20*4882a593Smuzhiyun 21*4882a593Smuzhiyun* MCA occurs on one cpu, usually due to a double bit memory error. 22*4882a593Smuzhiyun This is the monarch cpu. 23*4882a593Smuzhiyun 24*4882a593Smuzhiyun* SAL sends an MCA rendezvous interrupt (which is a normal interrupt) 25*4882a593Smuzhiyun to all the other cpus, the slaves. 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun* Slave cpus that receive the MCA interrupt call down into SAL, they 28*4882a593Smuzhiyun end up spinning disabled while the MCA is being serviced. 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun* If any slave cpu was already spinning disabled when the MCA occurred 31*4882a593Smuzhiyun then it cannot service the MCA interrupt. SAL waits ~20 seconds then 32*4882a593Smuzhiyun sends an unmaskable INIT event to the slave cpus that have not 33*4882a593Smuzhiyun already rendezvoused. 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun* Because MCA/INIT can be delivered at any time, including when the cpu 36*4882a593Smuzhiyun is down in PAL in physical mode, the registers at the time of the 37*4882a593Smuzhiyun event are _completely_ undefined. In particular the MCA/INIT 38*4882a593Smuzhiyun handlers cannot rely on the thread pointer, PAL physical mode can 39*4882a593Smuzhiyun (and does) modify TP. It is allowed to do that as long as it resets 40*4882a593Smuzhiyun TP on return. However MCA/INIT events expose us to these PAL 41*4882a593Smuzhiyun internal TP changes. Hence curr_task(). 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun* If an MCA/INIT event occurs while the kernel was running (not user 44*4882a593Smuzhiyun space) and the kernel has called PAL then the MCA/INIT handler cannot 45*4882a593Smuzhiyun assume that the kernel stack is in a fit state to be used. Mainly 46*4882a593Smuzhiyun because PAL may or may not maintain the stack pointer internally. 47*4882a593Smuzhiyun Because the MCA/INIT handlers cannot trust the kernel stack, they 48*4882a593Smuzhiyun have to use their own, per-cpu stacks. The MCA/INIT stacks are 49*4882a593Smuzhiyun preformatted with just enough task state to let the relevant handlers 50*4882a593Smuzhiyun do their job. 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun* Unlike most other architectures, the ia64 struct task is embedded in 53*4882a593Smuzhiyun the kernel stack[1]. So switching to a new kernel stack means that 54*4882a593Smuzhiyun we switch to a new task as well. Because various bits of the kernel 55*4882a593Smuzhiyun assume that current points into the struct task, switching to a new 56*4882a593Smuzhiyun stack also means a new value for current. 57*4882a593Smuzhiyun 58*4882a593Smuzhiyun* Once all slaves have rendezvoused and are spinning disabled, the 59*4882a593Smuzhiyun monarch is entered. The monarch now tries to diagnose the problem 60*4882a593Smuzhiyun and decide if it can recover or not. 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun* Part of the monarch's job is to look at the state of all the other 63*4882a593Smuzhiyun tasks. The only way to do that on ia64 is to call the unwinder, 64*4882a593Smuzhiyun as mandated by Intel. 65*4882a593Smuzhiyun 66*4882a593Smuzhiyun* The starting point for the unwind depends on whether a task is 67*4882a593Smuzhiyun running or not. That is, whether it is on a cpu or is blocked. The 68*4882a593Smuzhiyun monarch has to determine whether or not a task is on a cpu before it 69*4882a593Smuzhiyun knows how to start unwinding it. The tasks that received an MCA or 70*4882a593Smuzhiyun INIT event are no longer running, they have been converted to blocked 71*4882a593Smuzhiyun tasks. But (and its a big but), the cpus that received the MCA 72*4882a593Smuzhiyun rendezvous interrupt are still running on their normal kernel stacks! 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun* To distinguish between these two cases, the monarch must know which 75*4882a593Smuzhiyun tasks are on a cpu and which are not. Hence each slave cpu that 76*4882a593Smuzhiyun switches to an MCA/INIT stack, registers its new stack using 77*4882a593Smuzhiyun set_curr_task(), so the monarch can tell that the _original_ task is 78*4882a593Smuzhiyun no longer running on that cpu. That gives us a decent chance of 79*4882a593Smuzhiyun getting a valid backtrace of the _original_ task. 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun* MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a 82*4882a593Smuzhiyun nested error, we want diagnostics on the MCA/INIT handler that 83*4882a593Smuzhiyun failed, not on the task that was originally running. Again this 84*4882a593Smuzhiyun requires set_curr_task() so the MCA/INIT handlers can register their 85*4882a593Smuzhiyun own stack as running on that cpu. Then a recursive error gets a 86*4882a593Smuzhiyun trace of the failing handler's "task". 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun[1] 89*4882a593Smuzhiyun My (Keith Owens) original design called for ia64 to separate its 90*4882a593Smuzhiyun struct task and the kernel stacks. Then the MCA/INIT data would be 91*4882a593Smuzhiyun chained stacks like i386 interrupt stacks. But that required 92*4882a593Smuzhiyun radical surgery on the rest of ia64, plus extra hard wired TLB 93*4882a593Smuzhiyun entries with its associated performance degradation. David 94*4882a593Smuzhiyun Mosberger vetoed that approach. Which meant that separate kernel 95*4882a593Smuzhiyun stacks meant separate "tasks" for the MCA/INIT handlers. 96*4882a593Smuzhiyun 97*4882a593Smuzhiyun--- 98*4882a593Smuzhiyun 99*4882a593SmuzhiyunINIT is less complicated than MCA. Pressing the nmi button or using 100*4882a593Smuzhiyunthe equivalent command on the management console sends INIT to all 101*4882a593Smuzhiyuncpus. SAL picks one of the cpus as the monarch and the rest are 102*4882a593Smuzhiyunslaves. All the OS INIT handlers are entered at approximately the same 103*4882a593Smuzhiyuntime. The OS monarch prints the state of all tasks and returns, after 104*4882a593Smuzhiyunwhich the slaves return and the system resumes. 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunAt least that is what is supposed to happen. Alas there are broken 107*4882a593Smuzhiyunversions of SAL out there. Some drive all the cpus as monarchs. Some 108*4882a593Smuzhiyundrive them all as slaves. Some drive one cpu as monarch, wait for that 109*4882a593Smuzhiyuncpu to return from the OS then drive the rest as slaves. Some versions 110*4882a593Smuzhiyunof SAL cannot even cope with returning from the OS, they spin inside 111*4882a593SmuzhiyunSAL on resume. The OS INIT code has workarounds for some of these 112*4882a593Smuzhiyunbroken SAL symptoms, but some simply cannot be fixed from the OS side. 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun--- 115*4882a593Smuzhiyun 116*4882a593SmuzhiyunThe scheduler hooks used by ia64 (curr_task, set_curr_task) are layer 117*4882a593Smuzhiyunviolations. Unfortunately MCA/INIT start off as massive layer 118*4882a593Smuzhiyunviolations (can occur at _any_ time) and they build from there. 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunAt least ia64 makes an attempt at recovering from hardware errors, but 121*4882a593Smuzhiyunit is a difficult problem because of the asynchronous nature of these 122*4882a593Smuzhiyunerrors. When processing an unmaskable interrupt we sometimes need 123*4882a593Smuzhiyunspecial code to cope with our inability to take any locks. 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun--- 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunHow is ia64 MCA/INIT different from x86 NMI? 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun* x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to 130*4882a593Smuzhiyun all cpus. 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun* x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2 133*4882a593Smuzhiyun per cpu. 134*4882a593Smuzhiyun 135*4882a593Smuzhiyun* x86 has a separate struct task which points to one of multiple kernel 136*4882a593Smuzhiyun stacks. ia64 has the struct task embedded in the single kernel 137*4882a593Smuzhiyun stack, so switching stack means switching task. 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun* x86 does not call the BIOS so the NMI handler does not have to worry 140*4882a593Smuzhiyun about any registers having changed. MCA/INIT can occur while the cpu 141*4882a593Smuzhiyun is in PAL in physical mode, with undefined registers and an undefined 142*4882a593Smuzhiyun kernel stack. 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun* i386 backtrace is not very sensitive to whether a process is running 145*4882a593Smuzhiyun or not. ia64 unwind is very, very sensitive to whether a process is 146*4882a593Smuzhiyun running or not. 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun--- 149*4882a593Smuzhiyun 150*4882a593SmuzhiyunWhat happens when MCA/INIT is delivered what a cpu is running user 151*4882a593Smuzhiyunspace code? 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunThe user mode registers are stored in the RSE area of the MCA/INIT on 154*4882a593Smuzhiyunentry to the OS and are restored from there on return to SAL, so user 155*4882a593Smuzhiyunmode registers are preserved across a recoverable MCA/INIT. Since the 156*4882a593SmuzhiyunOS has no idea what unwind data is available for the user space stack, 157*4882a593SmuzhiyunMCA/INIT never tries to backtrace user space. Which means that the OS 158*4882a593Smuzhiyundoes not bother making the user space process look like a blocked task, 159*4882a593Smuzhiyuni.e. the OS does not copy pt_regs and switch_stack to the user space 160*4882a593Smuzhiyunstack. Also the OS has no idea how big the user space RSE and memory 161*4882a593Smuzhiyunstacks are, which makes it too risky to copy the saved state to a user 162*4882a593Smuzhiyunmode stack. 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun--- 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunHow do we get a backtrace on the tasks that were running when MCA/INIT 167*4882a593Smuzhiyunwas delivered? 168*4882a593Smuzhiyun 169*4882a593Smuzhiyunmca.c:::ia64_mca_modify_original_stack(). That identifies and 170*4882a593Smuzhiyunverifies the original kernel stack, copies the dirty registers from 171*4882a593Smuzhiyunthe MCA/INIT stack's RSE to the original stack's RSE, copies the 172*4882a593Smuzhiyunskeleton struct pt_regs and switch_stack to the original stack, fills 173*4882a593Smuzhiyunin the skeleton structures from the PAL minstate area and updates the 174*4882a593Smuzhiyunoriginal stack's thread.ksp. That makes the original stack look 175*4882a593Smuzhiyunexactly like any other blocked task, i.e. it now appears to be 176*4882a593Smuzhiyunsleeping. To get a backtrace, just start with thread.ksp for the 177*4882a593Smuzhiyunoriginal task and unwind like any other sleeping task. 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun--- 180*4882a593Smuzhiyun 181*4882a593SmuzhiyunHow do we identify the tasks that were running when MCA/INIT was 182*4882a593Smuzhiyundelivered? 183*4882a593Smuzhiyun 184*4882a593SmuzhiyunIf the previous task has been verified and converted to a blocked 185*4882a593Smuzhiyunstate, then sos->prev_task on the MCA/INIT stack is updated to point to 186*4882a593Smuzhiyunthe previous task. You can look at that field in dumps or debuggers. 187*4882a593SmuzhiyunTo help distinguish between the handler and the original tasks, 188*4882a593Smuzhiyunhandlers have _TIF_MCA_INIT set in thread_info.flags. 189*4882a593Smuzhiyun 190*4882a593SmuzhiyunThe sos data is always in the MCA/INIT handler stack, at offset 191*4882a593SmuzhiyunMCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it 192*4882a593Smuzhiyunas KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct 193*4882a593Smuzhiyunia64_sal_os_state), with 16 byte alignment for all structures. 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunAlso the comm field of the MCA/INIT task is modified to include the pid 196*4882a593Smuzhiyunof the original task, for humans to use. For example, a comm field of 197*4882a593Smuzhiyun'MCA 12159' means that pid 12159 was running when the MCA was 198*4882a593Smuzhiyundelivered. 199