1*4882a593Smuzhiyun=================================== 2*4882a593SmuzhiyunLight-weight System Calls for IA-64 3*4882a593Smuzhiyun=================================== 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun Started: 13-Jan-2003 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun Last update: 27-Sep-2003 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun David Mosberger-Tang 10*4882a593Smuzhiyun <davidm@hpl.hp.com> 11*4882a593Smuzhiyun 12*4882a593SmuzhiyunUsing the "epc" instruction effectively introduces a new mode of 13*4882a593Smuzhiyunexecution to the ia64 linux kernel. We call this mode the 14*4882a593Smuzhiyun"fsys-mode". To recap, the normal states of execution are: 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun - kernel mode: 17*4882a593Smuzhiyun Both the register stack and the memory stack have been 18*4882a593Smuzhiyun switched over to kernel memory. The user-level state is saved 19*4882a593Smuzhiyun in a pt-regs structure at the top of the kernel memory stack. 20*4882a593Smuzhiyun 21*4882a593Smuzhiyun - user mode: 22*4882a593Smuzhiyun Both the register stack and the kernel stack are in 23*4882a593Smuzhiyun user memory. The user-level state is contained in the 24*4882a593Smuzhiyun CPU registers. 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun - bank 0 interruption-handling mode: 27*4882a593Smuzhiyun This is the non-interruptible state which all 28*4882a593Smuzhiyun interruption-handlers start execution in. The user-level 29*4882a593Smuzhiyun state remains in the CPU registers and some kernel state may 30*4882a593Smuzhiyun be stored in bank 0 of registers r16-r31. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunIn contrast, fsys-mode has the following special properties: 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun - execution is at privilege level 0 (most-privileged) 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun - CPU registers may contain a mixture of user-level and kernel-level 37*4882a593Smuzhiyun state (it is the responsibility of the kernel to ensure that no 38*4882a593Smuzhiyun security-sensitive kernel-level state is leaked back to 39*4882a593Smuzhiyun user-level) 40*4882a593Smuzhiyun 41*4882a593Smuzhiyun - execution is interruptible and preemptible (an fsys-mode handler 42*4882a593Smuzhiyun can disable interrupts and avoid all other interruption-sources 43*4882a593Smuzhiyun to avoid preemption) 44*4882a593Smuzhiyun 45*4882a593Smuzhiyun - neither the memory-stack nor the register-stack can be trusted while 46*4882a593Smuzhiyun in fsys-mode (they point to the user-level stacks, which may 47*4882a593Smuzhiyun be invalid, or completely bogus addresses) 48*4882a593Smuzhiyun 49*4882a593SmuzhiyunIn summary, fsys-mode is much more similar to running in user-mode 50*4882a593Smuzhiyunthan it is to running in kernel-mode. Of course, given that the 51*4882a593Smuzhiyunprivilege level is at level 0, this means that fsys-mode requires some 52*4882a593Smuzhiyuncare (see below). 53*4882a593Smuzhiyun 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunHow to tell fsys-mode 56*4882a593Smuzhiyun===================== 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunLinux operates in fsys-mode when (a) the privilege level is 0 (most 59*4882a593Smuzhiyunprivileged) and (b) the stacks have NOT been switched to kernel memory 60*4882a593Smuzhiyunyet. For convenience, the header file <asm-ia64/ptrace.h> provides 61*4882a593Smuzhiyunthree macros:: 62*4882a593Smuzhiyun 63*4882a593Smuzhiyun user_mode(regs) 64*4882a593Smuzhiyun user_stack(task,regs) 65*4882a593Smuzhiyun fsys_mode(task,regs) 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunThe "regs" argument is a pointer to a pt_regs structure. The "task" 68*4882a593Smuzhiyunargument is a pointer to the task structure to which the "regs" 69*4882a593Smuzhiyunpointer belongs to. user_mode() returns TRUE if the CPU state pointed 70*4882a593Smuzhiyunto by "regs" was executing in user mode (privilege level 3). 71*4882a593Smuzhiyunuser_stack() returns TRUE if the state pointed to by "regs" was 72*4882a593Smuzhiyunexecuting on the user-level stack(s). Finally, fsys_mode() returns 73*4882a593SmuzhiyunTRUE if the CPU state pointed to by "regs" was executing in fsys-mode. 74*4882a593SmuzhiyunThe fsys_mode() macro is equivalent to the expression:: 75*4882a593Smuzhiyun 76*4882a593Smuzhiyun !user_mode(regs) && user_stack(task,regs) 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunHow to write an fsyscall handler 79*4882a593Smuzhiyun================================ 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunThe file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers 82*4882a593Smuzhiyun(fsyscall_table). This table contains one entry for each system call. 83*4882a593SmuzhiyunBy default, a system call is handled by fsys_fallback_syscall(). This 84*4882a593Smuzhiyunroutine takes care of entering (full) kernel mode and calling the 85*4882a593Smuzhiyunnormal Linux system call handler. For performance-critical system 86*4882a593Smuzhiyuncalls, it is possible to write a hand-tuned fsyscall_handler. For 87*4882a593Smuzhiyunexample, fsys.S contains fsys_getpid(), which is a hand-tuned version 88*4882a593Smuzhiyunof the getpid() system call. 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunThe entry and exit-state of an fsyscall handler is as follows: 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunMachine state on entry to fsyscall handler 93*4882a593Smuzhiyun------------------------------------------ 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun ========= =============================================================== 96*4882a593Smuzhiyun r10 0 97*4882a593Smuzhiyun r11 saved ar.pfs (a user-level value) 98*4882a593Smuzhiyun r15 system call number 99*4882a593Smuzhiyun r16 "current" task pointer (in normal kernel-mode, this is in r13) 100*4882a593Smuzhiyun r32-r39 system call arguments 101*4882a593Smuzhiyun b6 return address (a user-level value) 102*4882a593Smuzhiyun ar.pfs previous frame-state (a user-level value) 103*4882a593Smuzhiyun PSR.be cleared to zero (i.e., little-endian byte order is in effect) 104*4882a593Smuzhiyun - all other registers may contain values passed in from user-mode 105*4882a593Smuzhiyun ========= =============================================================== 106*4882a593Smuzhiyun 107*4882a593SmuzhiyunRequired machine state on exit to fsyscall handler 108*4882a593Smuzhiyun-------------------------------------------------- 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun ========= =========================================================== 111*4882a593Smuzhiyun r11 saved ar.pfs (as passed into the fsyscall handler) 112*4882a593Smuzhiyun r15 system call number (as passed into the fsyscall handler) 113*4882a593Smuzhiyun r32-r39 system call arguments (as passed into the fsyscall handler) 114*4882a593Smuzhiyun b6 return address (as passed into the fsyscall handler) 115*4882a593Smuzhiyun ar.pfs previous frame-state (as passed into the fsyscall handler) 116*4882a593Smuzhiyun ========= =========================================================== 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunFsyscall handlers can execute with very little overhead, but with that 119*4882a593Smuzhiyunspeed comes a set of restrictions: 120*4882a593Smuzhiyun 121*4882a593Smuzhiyun * Fsyscall-handlers MUST check for any pending work in the flags 122*4882a593Smuzhiyun member of the thread-info structure and if any of the 123*4882a593Smuzhiyun TIF_ALLWORK_MASK flags are set, the handler needs to fall back on 124*4882a593Smuzhiyun doing a full system call (by calling fsys_fallback_syscall). 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun * Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11, 127*4882a593Smuzhiyun r15, b6, and ar.pfs) because they will be needed in case of a 128*4882a593Smuzhiyun system call restart. Of course, all "preserved" registers also 129*4882a593Smuzhiyun must be preserved, in accordance to the normal calling conventions. 130*4882a593Smuzhiyun 131*4882a593Smuzhiyun * Fsyscall-handlers MUST check argument registers for containing a 132*4882a593Smuzhiyun NaT value before using them in any way that could trigger a 133*4882a593Smuzhiyun NaT-consumption fault. If a system call argument is found to 134*4882a593Smuzhiyun contain a NaT value, an fsyscall-handler may return immediately 135*4882a593Smuzhiyun with r8=EINVAL, r10=-1. 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun * Fsyscall-handlers MUST NOT use the "alloc" instruction or perform 138*4882a593Smuzhiyun any other operation that would trigger mandatory RSE 139*4882a593Smuzhiyun (register-stack engine) traffic. 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun * Fsyscall-handlers MUST NOT write to any stacked registers because 142*4882a593Smuzhiyun it is not safe to assume that user-level called a handler with the 143*4882a593Smuzhiyun proper number of arguments. 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun * Fsyscall-handlers need to be careful when accessing per-CPU variables: 146*4882a593Smuzhiyun unless proper safe-guards are taken (e.g., interruptions are avoided), 147*4882a593Smuzhiyun execution may be pre-empted and resumed on another CPU at any given 148*4882a593Smuzhiyun time. 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun * Fsyscall-handlers must be careful not to leak sensitive kernel' 151*4882a593Smuzhiyun information back to user-level. In particular, before returning to 152*4882a593Smuzhiyun user-level, care needs to be taken to clear any scratch registers 153*4882a593Smuzhiyun that could contain sensitive information (note that the current 154*4882a593Smuzhiyun task pointer is not considered sensitive: it's already exposed 155*4882a593Smuzhiyun through ar.k6). 156*4882a593Smuzhiyun 157*4882a593Smuzhiyun * Fsyscall-handlers MUST NOT access user-memory without first 158*4882a593Smuzhiyun validating access-permission (this can be done typically via 159*4882a593Smuzhiyun probe.r.fault and/or probe.w.fault) and without guarding against 160*4882a593Smuzhiyun memory access exceptions (this can be done with the EX() macros 161*4882a593Smuzhiyun defined by asmmacro.h). 162*4882a593Smuzhiyun 163*4882a593SmuzhiyunThe above restrictions may seem draconian, but remember that it's 164*4882a593Smuzhiyunpossible to trade off some of the restrictions by paying a slightly 165*4882a593Smuzhiyunhigher overhead. For example, if an fsyscall-handler could benefit 166*4882a593Smuzhiyunfrom the shadow register bank, it could temporarily disable PSR.i and 167*4882a593SmuzhiyunPSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as 168*4882a593Smuzhiyunneeded. In other words, following the above rules yields extremely 169*4882a593Smuzhiyunfast system call execution (while fully preserving system call 170*4882a593Smuzhiyunsemantics), but there is also a lot of flexibility in handling more 171*4882a593Smuzhiyuncomplicated cases. 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunSignal handling 174*4882a593Smuzhiyun=============== 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunThe delivery of (asynchronous) signals must be delayed until fsys-mode 177*4882a593Smuzhiyunis exited. This is accomplished with the help of the lower-privilege 178*4882a593Smuzhiyuntransfer trap: arch/ia64/kernel/process.c:do_notify_resume_user() 179*4882a593Smuzhiyunchecks whether the interrupted task was in fsys-mode and, if so, sets 180*4882a593SmuzhiyunPSR.lp and returns immediately. When fsys-mode is exited via the 181*4882a593Smuzhiyun"br.ret" instruction that lowers the privilege level, a trap will 182*4882a593Smuzhiyunoccur. The trap handler clears PSR.lp again and returns immediately. 183*4882a593SmuzhiyunThe kernel exit path then checks for and delivers any pending signals. 184*4882a593Smuzhiyun 185*4882a593SmuzhiyunPSR Handling 186*4882a593Smuzhiyun============ 187*4882a593Smuzhiyun 188*4882a593SmuzhiyunThe "epc" instruction doesn't change the contents of PSR at all. This 189*4882a593Smuzhiyunis in contrast to a regular interruption, which clears almost all 190*4882a593Smuzhiyunbits. Because of that, some care needs to be taken to ensure things 191*4882a593Smuzhiyunwork as expected. The following discussion describes how each PSR bit 192*4882a593Smuzhiyunis handled. 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun======= ======================================================================= 195*4882a593SmuzhiyunPSR.be Cleared when entering fsys-mode. A srlz.d instruction is used 196*4882a593Smuzhiyun to ensure the CPU is in little-endian mode before the first 197*4882a593Smuzhiyun load/store instruction is executed. PSR.be is normally NOT 198*4882a593Smuzhiyun restored upon return from an fsys-mode handler. In other 199*4882a593Smuzhiyun words, user-level code must not rely on PSR.be being preserved 200*4882a593Smuzhiyun across a system call. 201*4882a593SmuzhiyunPSR.up Unchanged. 202*4882a593SmuzhiyunPSR.ac Unchanged. 203*4882a593SmuzhiyunPSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers! 204*4882a593SmuzhiyunPSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers! 205*4882a593SmuzhiyunPSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed. 206*4882a593SmuzhiyunPSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed. 207*4882a593SmuzhiyunPSR.pk Unchanged. 208*4882a593SmuzhiyunPSR.dt Unchanged. 209*4882a593SmuzhiyunPSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers! 210*4882a593SmuzhiyunPSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers! 211*4882a593SmuzhiyunPSR.sp Unchanged. 212*4882a593SmuzhiyunPSR.pp Unchanged. 213*4882a593SmuzhiyunPSR.di Unchanged. 214*4882a593SmuzhiyunPSR.si Unchanged. 215*4882a593SmuzhiyunPSR.db Unchanged. The kernel prevents user-level from setting a hardware 216*4882a593Smuzhiyun breakpoint that triggers at any privilege level other than 217*4882a593Smuzhiyun 3 (user-mode). 218*4882a593SmuzhiyunPSR.lp Unchanged. 219*4882a593SmuzhiyunPSR.tb Lazy redirect. If a taken-branch trap occurs while in 220*4882a593Smuzhiyun fsys-mode, the trap-handler modifies the saved machine state 221*4882a593Smuzhiyun such that execution resumes in the gate page at 222*4882a593Smuzhiyun syscall_via_break(), with privilege level 3. Note: the 223*4882a593Smuzhiyun taken branch would occur on the branch invoking the 224*4882a593Smuzhiyun fsyscall-handler, at which point, by definition, a syscall 225*4882a593Smuzhiyun restart is still safe. If the system call number is invalid, 226*4882a593Smuzhiyun the fsys-mode handler will return directly to user-level. This 227*4882a593Smuzhiyun return will trigger a taken-branch trap, but since the trap is 228*4882a593Smuzhiyun taken _after_ restoring the privilege level, the CPU has already 229*4882a593Smuzhiyun left fsys-mode, so no special treatment is needed. 230*4882a593SmuzhiyunPSR.rt Unchanged. 231*4882a593SmuzhiyunPSR.cpl Cleared to 0. 232*4882a593SmuzhiyunPSR.is Unchanged (guaranteed to be 0 on entry to the gate page). 233*4882a593SmuzhiyunPSR.mc Unchanged. 234*4882a593SmuzhiyunPSR.it Unchanged (guaranteed to be 1). 235*4882a593SmuzhiyunPSR.id Unchanged. Note: the ia64 linux kernel never sets this bit. 236*4882a593SmuzhiyunPSR.da Unchanged. Note: the ia64 linux kernel never sets this bit. 237*4882a593SmuzhiyunPSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit. 238*4882a593SmuzhiyunPSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to 239*4882a593Smuzhiyun be taken. The trap handler then modifies the saved machine 240*4882a593Smuzhiyun state such that execution resumes in the gate page at 241*4882a593Smuzhiyun syscall_via_break(), with privilege level 3. 242*4882a593SmuzhiyunPSR.ri Unchanged. 243*4882a593SmuzhiyunPSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode 244*4882a593Smuzhiyun handler performed a speculative load that gets NaTted. If so, this 245*4882a593Smuzhiyun would be the normal & expected behavior, so no special treatment is 246*4882a593Smuzhiyun needed. 247*4882a593SmuzhiyunPSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed. 248*4882a593Smuzhiyun Doing so requires clearing PSR.i and PSR.ic as well. 249*4882a593SmuzhiyunPSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit. 250*4882a593Smuzhiyun======= ======================================================================= 251*4882a593Smuzhiyun 252*4882a593SmuzhiyunUsing fast system calls 253*4882a593Smuzhiyun======================= 254*4882a593Smuzhiyun 255*4882a593SmuzhiyunTo use fast system calls, userspace applications need simply call 256*4882a593Smuzhiyun__kernel_syscall_via_epc(). For example 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun-- example fgettimeofday() call -- 259*4882a593Smuzhiyun 260*4882a593Smuzhiyun-- fgettimeofday.S -- 261*4882a593Smuzhiyun 262*4882a593Smuzhiyun:: 263*4882a593Smuzhiyun 264*4882a593Smuzhiyun #include <asm/asmmacro.h> 265*4882a593Smuzhiyun 266*4882a593Smuzhiyun GLOBAL_ENTRY(fgettimeofday) 267*4882a593Smuzhiyun .prologue 268*4882a593Smuzhiyun .save ar.pfs, r11 269*4882a593Smuzhiyun mov r11 = ar.pfs 270*4882a593Smuzhiyun .body 271*4882a593Smuzhiyun 272*4882a593Smuzhiyun mov r2 = 0xa000000000020660;; // gate address 273*4882a593Smuzhiyun // found by inspection of System.map for the 274*4882a593Smuzhiyun // __kernel_syscall_via_epc() function. See 275*4882a593Smuzhiyun // below for how to do this for real. 276*4882a593Smuzhiyun 277*4882a593Smuzhiyun mov b7 = r2 278*4882a593Smuzhiyun mov r15 = 1087 // gettimeofday syscall 279*4882a593Smuzhiyun ;; 280*4882a593Smuzhiyun br.call.sptk.many b6 = b7 281*4882a593Smuzhiyun ;; 282*4882a593Smuzhiyun 283*4882a593Smuzhiyun .restore sp 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun mov ar.pfs = r11 286*4882a593Smuzhiyun br.ret.sptk.many rp;; // return to caller 287*4882a593Smuzhiyun END(fgettimeofday) 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun-- end fgettimeofday.S -- 290*4882a593Smuzhiyun 291*4882a593SmuzhiyunIn reality, getting the gate address is accomplished by two extra 292*4882a593Smuzhiyunvalues passed via the ELF auxiliary vector (include/asm-ia64/elf.h) 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun * AT_SYSINFO : is the address of __kernel_syscall_via_epc() 295*4882a593Smuzhiyun * AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO 296*4882a593Smuzhiyun 297*4882a593SmuzhiyunThe ELF DSO is a pre-linked library that is mapped in by the kernel at 298*4882a593Smuzhiyunthe gate page. It is a proper ELF shared object so, with a dynamic 299*4882a593Smuzhiyunloader that recognises the library, you should be able to make calls to 300*4882a593Smuzhiyunthe exported functions within it as with any other shared library. 301*4882a593SmuzhiyunAT_SYSINFO points into the kernel DSO at the 302*4882a593Smuzhiyun__kernel_syscall_via_epc() function for historical reasons (it was 303*4882a593Smuzhiyunused before the kernel DSO) and as a convenience. 304