xref: /OK3568_Linux_fs/kernel/Documentation/ia64/mca.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=============================================================
2*4882a593SmuzhiyunAn ad-hoc collection of notes on IA64 MCA and INIT processing
3*4882a593Smuzhiyun=============================================================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunFeel free to update it with notes about any area that is not clear.
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun---
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunMCA/INIT are completely asynchronous.  They can occur at any time, when
10*4882a593Smuzhiyunthe OS is in any state.  Including when one of the cpus is already
11*4882a593Smuzhiyunholding a spinlock.  Trying to get any lock from MCA/INIT state is
12*4882a593Smuzhiyunasking for deadlock.  Also the state of structures that are protected
13*4882a593Smuzhiyunby locks is indeterminate, including linked lists.
14*4882a593Smuzhiyun
15*4882a593Smuzhiyun---
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunThe complicated ia64 MCA process.  All of this is mandated by Intel's
18*4882a593Smuzhiyunspecification for ia64 SAL, error recovery and unwind, it is not as
19*4882a593Smuzhiyunif we have a choice here.
20*4882a593Smuzhiyun
21*4882a593Smuzhiyun* MCA occurs on one cpu, usually due to a double bit memory error.
22*4882a593Smuzhiyun  This is the monarch cpu.
23*4882a593Smuzhiyun
24*4882a593Smuzhiyun* SAL sends an MCA rendezvous interrupt (which is a normal interrupt)
25*4882a593Smuzhiyun  to all the other cpus, the slaves.
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun* Slave cpus that receive the MCA interrupt call down into SAL, they
28*4882a593Smuzhiyun  end up spinning disabled while the MCA is being serviced.
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun* If any slave cpu was already spinning disabled when the MCA occurred
31*4882a593Smuzhiyun  then it cannot service the MCA interrupt.  SAL waits ~20 seconds then
32*4882a593Smuzhiyun  sends an unmaskable INIT event to the slave cpus that have not
33*4882a593Smuzhiyun  already rendezvoused.
34*4882a593Smuzhiyun
35*4882a593Smuzhiyun* Because MCA/INIT can be delivered at any time, including when the cpu
36*4882a593Smuzhiyun  is down in PAL in physical mode, the registers at the time of the
37*4882a593Smuzhiyun  event are _completely_ undefined.  In particular the MCA/INIT
38*4882a593Smuzhiyun  handlers cannot rely on the thread pointer, PAL physical mode can
39*4882a593Smuzhiyun  (and does) modify TP.  It is allowed to do that as long as it resets
40*4882a593Smuzhiyun  TP on return.  However MCA/INIT events expose us to these PAL
41*4882a593Smuzhiyun  internal TP changes.  Hence curr_task().
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun* If an MCA/INIT event occurs while the kernel was running (not user
44*4882a593Smuzhiyun  space) and the kernel has called PAL then the MCA/INIT handler cannot
45*4882a593Smuzhiyun  assume that the kernel stack is in a fit state to be used.  Mainly
46*4882a593Smuzhiyun  because PAL may or may not maintain the stack pointer internally.
47*4882a593Smuzhiyun  Because the MCA/INIT handlers cannot trust the kernel stack, they
48*4882a593Smuzhiyun  have to use their own, per-cpu stacks.  The MCA/INIT stacks are
49*4882a593Smuzhiyun  preformatted with just enough task state to let the relevant handlers
50*4882a593Smuzhiyun  do their job.
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun* Unlike most other architectures, the ia64 struct task is embedded in
53*4882a593Smuzhiyun  the kernel stack[1].  So switching to a new kernel stack means that
54*4882a593Smuzhiyun  we switch to a new task as well.  Because various bits of the kernel
55*4882a593Smuzhiyun  assume that current points into the struct task, switching to a new
56*4882a593Smuzhiyun  stack also means a new value for current.
57*4882a593Smuzhiyun
58*4882a593Smuzhiyun* Once all slaves have rendezvoused and are spinning disabled, the
59*4882a593Smuzhiyun  monarch is entered.  The monarch now tries to diagnose the problem
60*4882a593Smuzhiyun  and decide if it can recover or not.
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun* Part of the monarch's job is to look at the state of all the other
63*4882a593Smuzhiyun  tasks.  The only way to do that on ia64 is to call the unwinder,
64*4882a593Smuzhiyun  as mandated by Intel.
65*4882a593Smuzhiyun
66*4882a593Smuzhiyun* The starting point for the unwind depends on whether a task is
67*4882a593Smuzhiyun  running or not.  That is, whether it is on a cpu or is blocked.  The
68*4882a593Smuzhiyun  monarch has to determine whether or not a task is on a cpu before it
69*4882a593Smuzhiyun  knows how to start unwinding it.  The tasks that received an MCA or
70*4882a593Smuzhiyun  INIT event are no longer running, they have been converted to blocked
71*4882a593Smuzhiyun  tasks.  But (and its a big but), the cpus that received the MCA
72*4882a593Smuzhiyun  rendezvous interrupt are still running on their normal kernel stacks!
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun* To distinguish between these two cases, the monarch must know which
75*4882a593Smuzhiyun  tasks are on a cpu and which are not.  Hence each slave cpu that
76*4882a593Smuzhiyun  switches to an MCA/INIT stack, registers its new stack using
77*4882a593Smuzhiyun  set_curr_task(), so the monarch can tell that the _original_ task is
78*4882a593Smuzhiyun  no longer running on that cpu.  That gives us a decent chance of
79*4882a593Smuzhiyun  getting a valid backtrace of the _original_ task.
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun* MCA/INIT can be nested, to a depth of 2 on any cpu.  In the case of a
82*4882a593Smuzhiyun  nested error, we want diagnostics on the MCA/INIT handler that
83*4882a593Smuzhiyun  failed, not on the task that was originally running.  Again this
84*4882a593Smuzhiyun  requires set_curr_task() so the MCA/INIT handlers can register their
85*4882a593Smuzhiyun  own stack as running on that cpu.  Then a recursive error gets a
86*4882a593Smuzhiyun  trace of the failing handler's "task".
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun[1]
89*4882a593Smuzhiyun    My (Keith Owens) original design called for ia64 to separate its
90*4882a593Smuzhiyun    struct task and the kernel stacks.  Then the MCA/INIT data would be
91*4882a593Smuzhiyun    chained stacks like i386 interrupt stacks.  But that required
92*4882a593Smuzhiyun    radical surgery on the rest of ia64, plus extra hard wired TLB
93*4882a593Smuzhiyun    entries with its associated performance degradation.  David
94*4882a593Smuzhiyun    Mosberger vetoed that approach.  Which meant that separate kernel
95*4882a593Smuzhiyun    stacks meant separate "tasks" for the MCA/INIT handlers.
96*4882a593Smuzhiyun
97*4882a593Smuzhiyun---
98*4882a593Smuzhiyun
99*4882a593SmuzhiyunINIT is less complicated than MCA.  Pressing the nmi button or using
100*4882a593Smuzhiyunthe equivalent command on the management console sends INIT to all
101*4882a593Smuzhiyuncpus.  SAL picks one of the cpus as the monarch and the rest are
102*4882a593Smuzhiyunslaves.  All the OS INIT handlers are entered at approximately the same
103*4882a593Smuzhiyuntime.  The OS monarch prints the state of all tasks and returns, after
104*4882a593Smuzhiyunwhich the slaves return and the system resumes.
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunAt least that is what is supposed to happen.  Alas there are broken
107*4882a593Smuzhiyunversions of SAL out there.  Some drive all the cpus as monarchs.  Some
108*4882a593Smuzhiyundrive them all as slaves.  Some drive one cpu as monarch, wait for that
109*4882a593Smuzhiyuncpu to return from the OS then drive the rest as slaves.  Some versions
110*4882a593Smuzhiyunof SAL cannot even cope with returning from the OS, they spin inside
111*4882a593SmuzhiyunSAL on resume.  The OS INIT code has workarounds for some of these
112*4882a593Smuzhiyunbroken SAL symptoms, but some simply cannot be fixed from the OS side.
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun---
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunThe scheduler hooks used by ia64 (curr_task, set_curr_task) are layer
117*4882a593Smuzhiyunviolations.  Unfortunately MCA/INIT start off as massive layer
118*4882a593Smuzhiyunviolations (can occur at _any_ time) and they build from there.
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunAt least ia64 makes an attempt at recovering from hardware errors, but
121*4882a593Smuzhiyunit is a difficult problem because of the asynchronous nature of these
122*4882a593Smuzhiyunerrors.  When processing an unmaskable interrupt we sometimes need
123*4882a593Smuzhiyunspecial code to cope with our inability to take any locks.
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun---
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunHow is ia64 MCA/INIT different from x86 NMI?
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun* x86 NMI typically gets delivered to one cpu.  MCA/INIT gets sent to
130*4882a593Smuzhiyun  all cpus.
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun* x86 NMI cannot be nested.  MCA/INIT can be nested, to a depth of 2
133*4882a593Smuzhiyun  per cpu.
134*4882a593Smuzhiyun
135*4882a593Smuzhiyun* x86 has a separate struct task which points to one of multiple kernel
136*4882a593Smuzhiyun  stacks.  ia64 has the struct task embedded in the single kernel
137*4882a593Smuzhiyun  stack, so switching stack means switching task.
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun* x86 does not call the BIOS so the NMI handler does not have to worry
140*4882a593Smuzhiyun  about any registers having changed.  MCA/INIT can occur while the cpu
141*4882a593Smuzhiyun  is in PAL in physical mode, with undefined registers and an undefined
142*4882a593Smuzhiyun  kernel stack.
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun* i386 backtrace is not very sensitive to whether a process is running
145*4882a593Smuzhiyun  or not.  ia64 unwind is very, very sensitive to whether a process is
146*4882a593Smuzhiyun  running or not.
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun---
149*4882a593Smuzhiyun
150*4882a593SmuzhiyunWhat happens when MCA/INIT is delivered what a cpu is running user
151*4882a593Smuzhiyunspace code?
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunThe user mode registers are stored in the RSE area of the MCA/INIT on
154*4882a593Smuzhiyunentry to the OS and are restored from there on return to SAL, so user
155*4882a593Smuzhiyunmode registers are preserved across a recoverable MCA/INIT.  Since the
156*4882a593SmuzhiyunOS has no idea what unwind data is available for the user space stack,
157*4882a593SmuzhiyunMCA/INIT never tries to backtrace user space.  Which means that the OS
158*4882a593Smuzhiyundoes not bother making the user space process look like a blocked task,
159*4882a593Smuzhiyuni.e. the OS does not copy pt_regs and switch_stack to the user space
160*4882a593Smuzhiyunstack.  Also the OS has no idea how big the user space RSE and memory
161*4882a593Smuzhiyunstacks are, which makes it too risky to copy the saved state to a user
162*4882a593Smuzhiyunmode stack.
163*4882a593Smuzhiyun
164*4882a593Smuzhiyun---
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunHow do we get a backtrace on the tasks that were running when MCA/INIT
167*4882a593Smuzhiyunwas delivered?
168*4882a593Smuzhiyun
169*4882a593Smuzhiyunmca.c:::ia64_mca_modify_original_stack().  That identifies and
170*4882a593Smuzhiyunverifies the original kernel stack, copies the dirty registers from
171*4882a593Smuzhiyunthe MCA/INIT stack's RSE to the original stack's RSE, copies the
172*4882a593Smuzhiyunskeleton struct pt_regs and switch_stack to the original stack, fills
173*4882a593Smuzhiyunin the skeleton structures from the PAL minstate area and updates the
174*4882a593Smuzhiyunoriginal stack's thread.ksp.  That makes the original stack look
175*4882a593Smuzhiyunexactly like any other blocked task, i.e. it now appears to be
176*4882a593Smuzhiyunsleeping.  To get a backtrace, just start with thread.ksp for the
177*4882a593Smuzhiyunoriginal task and unwind like any other sleeping task.
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun---
180*4882a593Smuzhiyun
181*4882a593SmuzhiyunHow do we identify the tasks that were running when MCA/INIT was
182*4882a593Smuzhiyundelivered?
183*4882a593Smuzhiyun
184*4882a593SmuzhiyunIf the previous task has been verified and converted to a blocked
185*4882a593Smuzhiyunstate, then sos->prev_task on the MCA/INIT stack is updated to point to
186*4882a593Smuzhiyunthe previous task.  You can look at that field in dumps or debuggers.
187*4882a593SmuzhiyunTo help distinguish between the handler and the original tasks,
188*4882a593Smuzhiyunhandlers have _TIF_MCA_INIT set in thread_info.flags.
189*4882a593Smuzhiyun
190*4882a593SmuzhiyunThe sos data is always in the MCA/INIT handler stack, at offset
191*4882a593SmuzhiyunMCA_SOS_OFFSET.  You can get that value from mca_asm.h or calculate it
192*4882a593Smuzhiyunas KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct
193*4882a593Smuzhiyunia64_sal_os_state), with 16 byte alignment for all structures.
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunAlso the comm field of the MCA/INIT task is modified to include the pid
196*4882a593Smuzhiyunof the original task, for humans to use.  For example, a comm field of
197*4882a593Smuzhiyun'MCA 12159' means that pid 12159 was running when the MCA was
198*4882a593Smuzhiyundelivered.
199