xref: /OK3568_Linux_fs/kernel/Documentation/ia64/fsys.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun===================================
2*4882a593SmuzhiyunLight-weight System Calls for IA-64
3*4882a593Smuzhiyun===================================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun		        Started: 13-Jan-2003
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun		    Last update: 27-Sep-2003
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun	              David Mosberger-Tang
10*4882a593Smuzhiyun		      <davidm@hpl.hp.com>
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunUsing the "epc" instruction effectively introduces a new mode of
13*4882a593Smuzhiyunexecution to the ia64 linux kernel.  We call this mode the
14*4882a593Smuzhiyun"fsys-mode".  To recap, the normal states of execution are:
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun  - kernel mode:
17*4882a593Smuzhiyun	Both the register stack and the memory stack have been
18*4882a593Smuzhiyun	switched over to kernel memory.  The user-level state is saved
19*4882a593Smuzhiyun	in a pt-regs structure at the top of the kernel memory stack.
20*4882a593Smuzhiyun
21*4882a593Smuzhiyun  - user mode:
22*4882a593Smuzhiyun	Both the register stack and the kernel stack are in
23*4882a593Smuzhiyun	user memory.  The user-level state is contained in the
24*4882a593Smuzhiyun	CPU registers.
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun  - bank 0 interruption-handling mode:
27*4882a593Smuzhiyun	This is the non-interruptible state which all
28*4882a593Smuzhiyun	interruption-handlers start execution in.  The user-level
29*4882a593Smuzhiyun	state remains in the CPU registers and some kernel state may
30*4882a593Smuzhiyun	be stored in bank 0 of registers r16-r31.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunIn contrast, fsys-mode has the following special properties:
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun  - execution is at privilege level 0 (most-privileged)
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun  - CPU registers may contain a mixture of user-level and kernel-level
37*4882a593Smuzhiyun    state (it is the responsibility of the kernel to ensure that no
38*4882a593Smuzhiyun    security-sensitive kernel-level state is leaked back to
39*4882a593Smuzhiyun    user-level)
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun  - execution is interruptible and preemptible (an fsys-mode handler
42*4882a593Smuzhiyun    can disable interrupts and avoid all other interruption-sources
43*4882a593Smuzhiyun    to avoid preemption)
44*4882a593Smuzhiyun
45*4882a593Smuzhiyun  - neither the memory-stack nor the register-stack can be trusted while
46*4882a593Smuzhiyun    in fsys-mode (they point to the user-level stacks, which may
47*4882a593Smuzhiyun    be invalid, or completely bogus addresses)
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunIn summary, fsys-mode is much more similar to running in user-mode
50*4882a593Smuzhiyunthan it is to running in kernel-mode.  Of course, given that the
51*4882a593Smuzhiyunprivilege level is at level 0, this means that fsys-mode requires some
52*4882a593Smuzhiyuncare (see below).
53*4882a593Smuzhiyun
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunHow to tell fsys-mode
56*4882a593Smuzhiyun=====================
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunLinux operates in fsys-mode when (a) the privilege level is 0 (most
59*4882a593Smuzhiyunprivileged) and (b) the stacks have NOT been switched to kernel memory
60*4882a593Smuzhiyunyet.  For convenience, the header file <asm-ia64/ptrace.h> provides
61*4882a593Smuzhiyunthree macros::
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun	user_mode(regs)
64*4882a593Smuzhiyun	user_stack(task,regs)
65*4882a593Smuzhiyun	fsys_mode(task,regs)
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunThe "regs" argument is a pointer to a pt_regs structure.  The "task"
68*4882a593Smuzhiyunargument is a pointer to the task structure to which the "regs"
69*4882a593Smuzhiyunpointer belongs to.  user_mode() returns TRUE if the CPU state pointed
70*4882a593Smuzhiyunto by "regs" was executing in user mode (privilege level 3).
71*4882a593Smuzhiyunuser_stack() returns TRUE if the state pointed to by "regs" was
72*4882a593Smuzhiyunexecuting on the user-level stack(s).  Finally, fsys_mode() returns
73*4882a593SmuzhiyunTRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
74*4882a593SmuzhiyunThe fsys_mode() macro is equivalent to the expression::
75*4882a593Smuzhiyun
76*4882a593Smuzhiyun	!user_mode(regs) && user_stack(task,regs)
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunHow to write an fsyscall handler
79*4882a593Smuzhiyun================================
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunThe file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
82*4882a593Smuzhiyun(fsyscall_table).  This table contains one entry for each system call.
83*4882a593SmuzhiyunBy default, a system call is handled by fsys_fallback_syscall().  This
84*4882a593Smuzhiyunroutine takes care of entering (full) kernel mode and calling the
85*4882a593Smuzhiyunnormal Linux system call handler.  For performance-critical system
86*4882a593Smuzhiyuncalls, it is possible to write a hand-tuned fsyscall_handler.  For
87*4882a593Smuzhiyunexample, fsys.S contains fsys_getpid(), which is a hand-tuned version
88*4882a593Smuzhiyunof the getpid() system call.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunThe entry and exit-state of an fsyscall handler is as follows:
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunMachine state on entry to fsyscall handler
93*4882a593Smuzhiyun------------------------------------------
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun  ========= ===============================================================
96*4882a593Smuzhiyun  r10	    0
97*4882a593Smuzhiyun  r11	    saved ar.pfs (a user-level value)
98*4882a593Smuzhiyun  r15	    system call number
99*4882a593Smuzhiyun  r16	    "current" task pointer (in normal kernel-mode, this is in r13)
100*4882a593Smuzhiyun  r32-r39   system call arguments
101*4882a593Smuzhiyun  b6	    return address (a user-level value)
102*4882a593Smuzhiyun  ar.pfs    previous frame-state (a user-level value)
103*4882a593Smuzhiyun  PSR.be    cleared to zero (i.e., little-endian byte order is in effect)
104*4882a593Smuzhiyun  -         all other registers may contain values passed in from user-mode
105*4882a593Smuzhiyun  ========= ===============================================================
106*4882a593Smuzhiyun
107*4882a593SmuzhiyunRequired machine state on exit to fsyscall handler
108*4882a593Smuzhiyun--------------------------------------------------
109*4882a593Smuzhiyun
110*4882a593Smuzhiyun  ========= ===========================================================
111*4882a593Smuzhiyun  r11	    saved ar.pfs (as passed into the fsyscall handler)
112*4882a593Smuzhiyun  r15	    system call number (as passed into the fsyscall handler)
113*4882a593Smuzhiyun  r32-r39   system call arguments (as passed into the fsyscall handler)
114*4882a593Smuzhiyun  b6	    return address (as passed into the fsyscall handler)
115*4882a593Smuzhiyun  ar.pfs    previous frame-state (as passed into the fsyscall handler)
116*4882a593Smuzhiyun  ========= ===========================================================
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunFsyscall handlers can execute with very little overhead, but with that
119*4882a593Smuzhiyunspeed comes a set of restrictions:
120*4882a593Smuzhiyun
121*4882a593Smuzhiyun * Fsyscall-handlers MUST check for any pending work in the flags
122*4882a593Smuzhiyun   member of the thread-info structure and if any of the
123*4882a593Smuzhiyun   TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
124*4882a593Smuzhiyun   doing a full system call (by calling fsys_fallback_syscall).
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun * Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
127*4882a593Smuzhiyun   r15, b6, and ar.pfs) because they will be needed in case of a
128*4882a593Smuzhiyun   system call restart.  Of course, all "preserved" registers also
129*4882a593Smuzhiyun   must be preserved, in accordance to the normal calling conventions.
130*4882a593Smuzhiyun
131*4882a593Smuzhiyun * Fsyscall-handlers MUST check argument registers for containing a
132*4882a593Smuzhiyun   NaT value before using them in any way that could trigger a
133*4882a593Smuzhiyun   NaT-consumption fault.  If a system call argument is found to
134*4882a593Smuzhiyun   contain a NaT value, an fsyscall-handler may return immediately
135*4882a593Smuzhiyun   with r8=EINVAL, r10=-1.
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun * Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
138*4882a593Smuzhiyun   any other operation that would trigger mandatory RSE
139*4882a593Smuzhiyun   (register-stack engine) traffic.
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun * Fsyscall-handlers MUST NOT write to any stacked registers because
142*4882a593Smuzhiyun   it is not safe to assume that user-level called a handler with the
143*4882a593Smuzhiyun   proper number of arguments.
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun * Fsyscall-handlers need to be careful when accessing per-CPU variables:
146*4882a593Smuzhiyun   unless proper safe-guards are taken (e.g., interruptions are avoided),
147*4882a593Smuzhiyun   execution may be pre-empted and resumed on another CPU at any given
148*4882a593Smuzhiyun   time.
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun * Fsyscall-handlers must be careful not to leak sensitive kernel'
151*4882a593Smuzhiyun   information back to user-level.  In particular, before returning to
152*4882a593Smuzhiyun   user-level, care needs to be taken to clear any scratch registers
153*4882a593Smuzhiyun   that could contain sensitive information (note that the current
154*4882a593Smuzhiyun   task pointer is not considered sensitive: it's already exposed
155*4882a593Smuzhiyun   through ar.k6).
156*4882a593Smuzhiyun
157*4882a593Smuzhiyun * Fsyscall-handlers MUST NOT access user-memory without first
158*4882a593Smuzhiyun   validating access-permission (this can be done typically via
159*4882a593Smuzhiyun   probe.r.fault and/or probe.w.fault) and without guarding against
160*4882a593Smuzhiyun   memory access exceptions (this can be done with the EX() macros
161*4882a593Smuzhiyun   defined by asmmacro.h).
162*4882a593Smuzhiyun
163*4882a593SmuzhiyunThe above restrictions may seem draconian, but remember that it's
164*4882a593Smuzhiyunpossible to trade off some of the restrictions by paying a slightly
165*4882a593Smuzhiyunhigher overhead.  For example, if an fsyscall-handler could benefit
166*4882a593Smuzhiyunfrom the shadow register bank, it could temporarily disable PSR.i and
167*4882a593SmuzhiyunPSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
168*4882a593Smuzhiyunneeded.  In other words, following the above rules yields extremely
169*4882a593Smuzhiyunfast system call execution (while fully preserving system call
170*4882a593Smuzhiyunsemantics), but there is also a lot of flexibility in handling more
171*4882a593Smuzhiyuncomplicated cases.
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunSignal handling
174*4882a593Smuzhiyun===============
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunThe delivery of (asynchronous) signals must be delayed until fsys-mode
177*4882a593Smuzhiyunis exited.  This is accomplished with the help of the lower-privilege
178*4882a593Smuzhiyuntransfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
179*4882a593Smuzhiyunchecks whether the interrupted task was in fsys-mode and, if so, sets
180*4882a593SmuzhiyunPSR.lp and returns immediately.  When fsys-mode is exited via the
181*4882a593Smuzhiyun"br.ret" instruction that lowers the privilege level, a trap will
182*4882a593Smuzhiyunoccur.  The trap handler clears PSR.lp again and returns immediately.
183*4882a593SmuzhiyunThe kernel exit path then checks for and delivers any pending signals.
184*4882a593Smuzhiyun
185*4882a593SmuzhiyunPSR Handling
186*4882a593Smuzhiyun============
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunThe "epc" instruction doesn't change the contents of PSR at all.  This
189*4882a593Smuzhiyunis in contrast to a regular interruption, which clears almost all
190*4882a593Smuzhiyunbits.  Because of that, some care needs to be taken to ensure things
191*4882a593Smuzhiyunwork as expected.  The following discussion describes how each PSR bit
192*4882a593Smuzhiyunis handled.
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun======= =======================================================================
195*4882a593SmuzhiyunPSR.be	Cleared when entering fsys-mode.  A srlz.d instruction is used
196*4882a593Smuzhiyun	to ensure the CPU is in little-endian mode before the first
197*4882a593Smuzhiyun	load/store instruction is executed.  PSR.be is normally NOT
198*4882a593Smuzhiyun	restored upon return from an fsys-mode handler.  In other
199*4882a593Smuzhiyun	words, user-level code must not rely on PSR.be being preserved
200*4882a593Smuzhiyun	across a system call.
201*4882a593SmuzhiyunPSR.up	Unchanged.
202*4882a593SmuzhiyunPSR.ac	Unchanged.
203*4882a593SmuzhiyunPSR.mfl Unchanged.  Note: fsys-mode handlers must not write-registers!
204*4882a593SmuzhiyunPSR.mfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
205*4882a593SmuzhiyunPSR.ic	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
206*4882a593SmuzhiyunPSR.i	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
207*4882a593SmuzhiyunPSR.pk	Unchanged.
208*4882a593SmuzhiyunPSR.dt	Unchanged.
209*4882a593SmuzhiyunPSR.dfl	Unchanged.  Note: fsys-mode handlers must not write-registers!
210*4882a593SmuzhiyunPSR.dfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
211*4882a593SmuzhiyunPSR.sp	Unchanged.
212*4882a593SmuzhiyunPSR.pp	Unchanged.
213*4882a593SmuzhiyunPSR.di	Unchanged.
214*4882a593SmuzhiyunPSR.si	Unchanged.
215*4882a593SmuzhiyunPSR.db	Unchanged.  The kernel prevents user-level from setting a hardware
216*4882a593Smuzhiyun	breakpoint that triggers at any privilege level other than
217*4882a593Smuzhiyun	3 (user-mode).
218*4882a593SmuzhiyunPSR.lp	Unchanged.
219*4882a593SmuzhiyunPSR.tb	Lazy redirect.  If a taken-branch trap occurs while in
220*4882a593Smuzhiyun	fsys-mode, the trap-handler modifies the saved machine state
221*4882a593Smuzhiyun	such that execution resumes in the gate page at
222*4882a593Smuzhiyun	syscall_via_break(), with privilege level 3.  Note: the
223*4882a593Smuzhiyun	taken branch would occur on the branch invoking the
224*4882a593Smuzhiyun	fsyscall-handler, at which point, by definition, a syscall
225*4882a593Smuzhiyun	restart is still safe.  If the system call number is invalid,
226*4882a593Smuzhiyun	the fsys-mode handler will return directly to user-level.  This
227*4882a593Smuzhiyun	return will trigger a taken-branch trap, but since the trap is
228*4882a593Smuzhiyun	taken _after_ restoring the privilege level, the CPU has already
229*4882a593Smuzhiyun	left fsys-mode, so no special treatment is needed.
230*4882a593SmuzhiyunPSR.rt	Unchanged.
231*4882a593SmuzhiyunPSR.cpl	Cleared to 0.
232*4882a593SmuzhiyunPSR.is	Unchanged (guaranteed to be 0 on entry to the gate page).
233*4882a593SmuzhiyunPSR.mc	Unchanged.
234*4882a593SmuzhiyunPSR.it	Unchanged (guaranteed to be 1).
235*4882a593SmuzhiyunPSR.id	Unchanged.  Note: the ia64 linux kernel never sets this bit.
236*4882a593SmuzhiyunPSR.da	Unchanged.  Note: the ia64 linux kernel never sets this bit.
237*4882a593SmuzhiyunPSR.dd	Unchanged.  Note: the ia64 linux kernel never sets this bit.
238*4882a593SmuzhiyunPSR.ss	Lazy redirect.  If set, "epc" will cause a Single Step Trap to
239*4882a593Smuzhiyun	be taken.  The trap handler then modifies the saved machine
240*4882a593Smuzhiyun	state such that execution resumes in the gate page at
241*4882a593Smuzhiyun	syscall_via_break(), with privilege level 3.
242*4882a593SmuzhiyunPSR.ri	Unchanged.
243*4882a593SmuzhiyunPSR.ed	Unchanged.  Note: This bit could only have an effect if an fsys-mode
244*4882a593Smuzhiyun	handler performed a speculative load that gets NaTted.  If so, this
245*4882a593Smuzhiyun	would be the normal & expected behavior, so no special treatment is
246*4882a593Smuzhiyun	needed.
247*4882a593SmuzhiyunPSR.bn	Unchanged.  Note: fsys-mode handlers may clear the bit, if needed.
248*4882a593Smuzhiyun	Doing so requires clearing PSR.i and PSR.ic as well.
249*4882a593SmuzhiyunPSR.ia	Unchanged.  Note: the ia64 linux kernel never sets this bit.
250*4882a593Smuzhiyun======= =======================================================================
251*4882a593Smuzhiyun
252*4882a593SmuzhiyunUsing fast system calls
253*4882a593Smuzhiyun=======================
254*4882a593Smuzhiyun
255*4882a593SmuzhiyunTo use fast system calls, userspace applications need simply call
256*4882a593Smuzhiyun__kernel_syscall_via_epc().  For example
257*4882a593Smuzhiyun
258*4882a593Smuzhiyun-- example fgettimeofday() call --
259*4882a593Smuzhiyun
260*4882a593Smuzhiyun-- fgettimeofday.S --
261*4882a593Smuzhiyun
262*4882a593Smuzhiyun::
263*4882a593Smuzhiyun
264*4882a593Smuzhiyun  #include <asm/asmmacro.h>
265*4882a593Smuzhiyun
266*4882a593Smuzhiyun  GLOBAL_ENTRY(fgettimeofday)
267*4882a593Smuzhiyun  .prologue
268*4882a593Smuzhiyun  .save ar.pfs, r11
269*4882a593Smuzhiyun  mov r11 = ar.pfs
270*4882a593Smuzhiyun  .body
271*4882a593Smuzhiyun
272*4882a593Smuzhiyun  mov r2 = 0xa000000000020660;;  // gate address
273*4882a593Smuzhiyun			       // found by inspection of System.map for the
274*4882a593Smuzhiyun			       // __kernel_syscall_via_epc() function.  See
275*4882a593Smuzhiyun			       // below for how to do this for real.
276*4882a593Smuzhiyun
277*4882a593Smuzhiyun  mov b7 = r2
278*4882a593Smuzhiyun  mov r15 = 1087		       // gettimeofday syscall
279*4882a593Smuzhiyun  ;;
280*4882a593Smuzhiyun  br.call.sptk.many b6 = b7
281*4882a593Smuzhiyun  ;;
282*4882a593Smuzhiyun
283*4882a593Smuzhiyun  .restore sp
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun  mov ar.pfs = r11
286*4882a593Smuzhiyun  br.ret.sptk.many rp;;	      // return to caller
287*4882a593Smuzhiyun  END(fgettimeofday)
288*4882a593Smuzhiyun
289*4882a593Smuzhiyun-- end fgettimeofday.S --
290*4882a593Smuzhiyun
291*4882a593SmuzhiyunIn reality, getting the gate address is accomplished by two extra
292*4882a593Smuzhiyunvalues passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
293*4882a593Smuzhiyun
294*4882a593Smuzhiyun * AT_SYSINFO : is the address of __kernel_syscall_via_epc()
295*4882a593Smuzhiyun * AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
296*4882a593Smuzhiyun
297*4882a593SmuzhiyunThe ELF DSO is a pre-linked library that is mapped in by the kernel at
298*4882a593Smuzhiyunthe gate page.  It is a proper ELF shared object so, with a dynamic
299*4882a593Smuzhiyunloader that recognises the library, you should be able to make calls to
300*4882a593Smuzhiyunthe exported functions within it as with any other shared library.
301*4882a593SmuzhiyunAT_SYSINFO points into the kernel DSO at the
302*4882a593Smuzhiyun__kernel_syscall_via_epc() function for historical reasons (it was
303*4882a593Smuzhiyunused before the kernel DSO) and as a convenience.
304