xref: /OK3568_Linux_fs/kernel/Documentation/process/adding-syscalls.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun
2*4882a593Smuzhiyun.. _addsyscalls:
3*4882a593Smuzhiyun
4*4882a593SmuzhiyunAdding a New System Call
5*4882a593Smuzhiyun========================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunThis document describes what's involved in adding a new system call to the
8*4882a593SmuzhiyunLinux kernel, over and above the normal submission advice in
9*4882a593Smuzhiyun:ref:`Documentation/process/submitting-patches.rst <submittingpatches>`.
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunSystem Call Alternatives
13*4882a593Smuzhiyun------------------------
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunThe first thing to consider when adding a new system call is whether one of
16*4882a593Smuzhiyunthe alternatives might be suitable instead.  Although system calls are the
17*4882a593Smuzhiyunmost traditional and most obvious interaction points between userspace and the
18*4882a593Smuzhiyunkernel, there are other possibilities -- choose what fits best for your
19*4882a593Smuzhiyuninterface.
20*4882a593Smuzhiyun
21*4882a593Smuzhiyun - If the operations involved can be made to look like a filesystem-like
22*4882a593Smuzhiyun   object, it may make more sense to create a new filesystem or device.  This
23*4882a593Smuzhiyun   also makes it easier to encapsulate the new functionality in a kernel module
24*4882a593Smuzhiyun   rather than requiring it to be built into the main kernel.
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun     - If the new functionality involves operations where the kernel notifies
27*4882a593Smuzhiyun       userspace that something has happened, then returning a new file
28*4882a593Smuzhiyun       descriptor for the relevant object allows userspace to use
29*4882a593Smuzhiyun       ``poll``/``select``/``epoll`` to receive that notification.
30*4882a593Smuzhiyun     - However, operations that don't map to
31*4882a593Smuzhiyun       :manpage:`read(2)`/:manpage:`write(2)`-like operations
32*4882a593Smuzhiyun       have to be implemented as :manpage:`ioctl(2)` requests, which can lead
33*4882a593Smuzhiyun       to a somewhat opaque API.
34*4882a593Smuzhiyun
35*4882a593Smuzhiyun - If you're just exposing runtime system information, a new node in sysfs
36*4882a593Smuzhiyun   (see ``Documentation/filesystems/sysfs.rst``) or the ``/proc`` filesystem may
37*4882a593Smuzhiyun   be more appropriate.  However, access to these mechanisms requires that the
38*4882a593Smuzhiyun   relevant filesystem is mounted, which might not always be the case (e.g.
39*4882a593Smuzhiyun   in a namespaced/sandboxed/chrooted environment).  Avoid adding any API to
40*4882a593Smuzhiyun   debugfs, as this is not considered a 'production' interface to userspace.
41*4882a593Smuzhiyun - If the operation is specific to a particular file or file descriptor, then
42*4882a593Smuzhiyun   an additional :manpage:`fcntl(2)` command option may be more appropriate.  However,
43*4882a593Smuzhiyun   :manpage:`fcntl(2)` is a multiplexing system call that hides a lot of complexity, so
44*4882a593Smuzhiyun   this option is best for when the new function is closely analogous to
45*4882a593Smuzhiyun   existing :manpage:`fcntl(2)` functionality, or the new functionality is very simple
46*4882a593Smuzhiyun   (for example, getting/setting a simple flag related to a file descriptor).
47*4882a593Smuzhiyun - If the operation is specific to a particular task or process, then an
48*4882a593Smuzhiyun   additional :manpage:`prctl(2)` command option may be more appropriate.  As
49*4882a593Smuzhiyun   with :manpage:`fcntl(2)`, this system call is a complicated multiplexor so
50*4882a593Smuzhiyun   is best reserved for near-analogs of existing ``prctl()`` commands or
51*4882a593Smuzhiyun   getting/setting a simple flag related to a process.
52*4882a593Smuzhiyun
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunDesigning the API: Planning for Extension
55*4882a593Smuzhiyun-----------------------------------------
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunA new system call forms part of the API of the kernel, and has to be supported
58*4882a593Smuzhiyunindefinitely.  As such, it's a very good idea to explicitly discuss the
59*4882a593Smuzhiyuninterface on the kernel mailing list, and it's important to plan for future
60*4882a593Smuzhiyunextensions of the interface.
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun(The syscall table is littered with historical examples where this wasn't done,
63*4882a593Smuzhiyuntogether with the corresponding follow-up system calls --
64*4882a593Smuzhiyun``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``,
65*4882a593Smuzhiyun``pipe``/``pipe2``, ``renameat``/``renameat2`` -- so
66*4882a593Smuzhiyunlearn from the history of the kernel and plan for extensions from the start.)
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunFor simpler system calls that only take a couple of arguments, the preferred
69*4882a593Smuzhiyunway to allow for future extensibility is to include a flags argument to the
70*4882a593Smuzhiyunsystem call.  To make sure that userspace programs can safely use flags
71*4882a593Smuzhiyunbetween kernel versions, check whether the flags value holds any unknown
72*4882a593Smuzhiyunflags, and reject the system call (with ``EINVAL``) if it does::
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun    if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3))
75*4882a593Smuzhiyun        return -EINVAL;
76*4882a593Smuzhiyun
77*4882a593Smuzhiyun(If no flags values are used yet, check that the flags argument is zero.)
78*4882a593Smuzhiyun
79*4882a593SmuzhiyunFor more sophisticated system calls that involve a larger number of arguments,
80*4882a593Smuzhiyunit's preferred to encapsulate the majority of the arguments into a structure
81*4882a593Smuzhiyunthat is passed in by pointer.  Such a structure can cope with future extension
82*4882a593Smuzhiyunby including a size argument in the structure::
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun    struct xyzzy_params {
85*4882a593Smuzhiyun        u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */
86*4882a593Smuzhiyun        u32 param_1;
87*4882a593Smuzhiyun        u64 param_2;
88*4882a593Smuzhiyun        u64 param_3;
89*4882a593Smuzhiyun    };
90*4882a593Smuzhiyun
91*4882a593SmuzhiyunAs long as any subsequently added field, say ``param_4``, is designed so that a
92*4882a593Smuzhiyunzero value gives the previous behaviour, then this allows both directions of
93*4882a593Smuzhiyunversion mismatch:
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun - To cope with a later userspace program calling an older kernel, the kernel
96*4882a593Smuzhiyun   code should check that any memory beyond the size of the structure that it
97*4882a593Smuzhiyun   expects is zero (effectively checking that ``param_4 == 0``).
98*4882a593Smuzhiyun - To cope with an older userspace program calling a newer kernel, the kernel
99*4882a593Smuzhiyun   code can zero-extend a smaller instance of the structure (effectively
100*4882a593Smuzhiyun   setting ``param_4 = 0``).
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunSee :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in
103*4882a593Smuzhiyun``kernel/events/core.c``) for an example of this approach.
104*4882a593Smuzhiyun
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunDesigning the API: Other Considerations
107*4882a593Smuzhiyun---------------------------------------
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunIf your new system call allows userspace to refer to a kernel object, it
110*4882a593Smuzhiyunshould use a file descriptor as the handle for that object -- don't invent a
111*4882a593Smuzhiyunnew type of userspace object handle when the kernel already has mechanisms and
112*4882a593Smuzhiyunwell-defined semantics for using file descriptors.
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunIf your new :manpage:`xyzzy(2)` system call does return a new file descriptor,
115*4882a593Smuzhiyunthen the flags argument should include a value that is equivalent to setting
116*4882a593Smuzhiyun``O_CLOEXEC`` on the new FD.  This makes it possible for userspace to close
117*4882a593Smuzhiyunthe timing window between ``xyzzy()`` and calling
118*4882a593Smuzhiyun``fcntl(fd, F_SETFD, FD_CLOEXEC)``, where an unexpected ``fork()`` and
119*4882a593Smuzhiyun``execve()`` in another thread could leak a descriptor to
120*4882a593Smuzhiyunthe exec'ed program. (However, resist the temptation to re-use the actual value
121*4882a593Smuzhiyunof the ``O_CLOEXEC`` constant, as it is architecture-specific and is part of a
122*4882a593Smuzhiyunnumbering space of ``O_*`` flags that is fairly full.)
123*4882a593Smuzhiyun
124*4882a593SmuzhiyunIf your system call returns a new file descriptor, you should also consider
125*4882a593Smuzhiyunwhat it means to use the :manpage:`poll(2)` family of system calls on that file
126*4882a593Smuzhiyundescriptor. Making a file descriptor ready for reading or writing is the
127*4882a593Smuzhiyunnormal way for the kernel to indicate to userspace that an event has
128*4882a593Smuzhiyunoccurred on the corresponding kernel object.
129*4882a593Smuzhiyun
130*4882a593SmuzhiyunIf your new :manpage:`xyzzy(2)` system call involves a filename argument::
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun    int sys_xyzzy(const char __user *path, ..., unsigned int flags);
133*4882a593Smuzhiyun
134*4882a593Smuzhiyunyou should also consider whether an :manpage:`xyzzyat(2)` version is more appropriate::
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun    int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags);
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunThis allows more flexibility for how userspace specifies the file in question;
139*4882a593Smuzhiyunin particular it allows userspace to request the functionality for an
140*4882a593Smuzhiyunalready-opened file descriptor using the ``AT_EMPTY_PATH`` flag, effectively
141*4882a593Smuzhiyungiving an :manpage:`fxyzzy(3)` operation for free::
142*4882a593Smuzhiyun
143*4882a593Smuzhiyun - xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...)
144*4882a593Smuzhiyun - xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...)
145*4882a593Smuzhiyun
146*4882a593Smuzhiyun(For more details on the rationale of the \*at() calls, see the
147*4882a593Smuzhiyun:manpage:`openat(2)` man page; for an example of AT_EMPTY_PATH, see the
148*4882a593Smuzhiyun:manpage:`fstatat(2)` man page.)
149*4882a593Smuzhiyun
150*4882a593SmuzhiyunIf your new :manpage:`xyzzy(2)` system call involves a parameter describing an
151*4882a593Smuzhiyunoffset within a file, make its type ``loff_t`` so that 64-bit offsets can be
152*4882a593Smuzhiyunsupported even on 32-bit architectures.
153*4882a593Smuzhiyun
154*4882a593SmuzhiyunIf your new :manpage:`xyzzy(2)` system call involves privileged functionality,
155*4882a593Smuzhiyunit needs to be governed by the appropriate Linux capability bit (checked with
156*4882a593Smuzhiyuna call to ``capable()``), as described in the :manpage:`capabilities(7)` man
157*4882a593Smuzhiyunpage.  Choose an existing capability bit that governs related functionality,
158*4882a593Smuzhiyunbut try to avoid combining lots of only vaguely related functions together
159*4882a593Smuzhiyununder the same bit, as this goes against capabilities' purpose of splitting
160*4882a593Smuzhiyunthe power of root.  In particular, avoid adding new uses of the already
161*4882a593Smuzhiyunoverly-general ``CAP_SYS_ADMIN`` capability.
162*4882a593Smuzhiyun
163*4882a593SmuzhiyunIf your new :manpage:`xyzzy(2)` system call manipulates a process other than
164*4882a593Smuzhiyunthe calling process, it should be restricted (using a call to
165*4882a593Smuzhiyun``ptrace_may_access()``) so that only a calling process with the same
166*4882a593Smuzhiyunpermissions as the target process, or with the necessary capabilities, can
167*4882a593Smuzhiyunmanipulate the target process.
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunFinally, be aware that some non-x86 architectures have an easier time if
170*4882a593Smuzhiyunsystem call parameters that are explicitly 64-bit fall on odd-numbered
171*4882a593Smuzhiyunarguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bit
172*4882a593Smuzhiyunregisters.  (This concern does not apply if the arguments are part of a
173*4882a593Smuzhiyunstructure that's passed in by pointer.)
174*4882a593Smuzhiyun
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunProposing the API
177*4882a593Smuzhiyun-----------------
178*4882a593Smuzhiyun
179*4882a593SmuzhiyunTo make new system calls easy to review, it's best to divide up the patchset
180*4882a593Smuzhiyuninto separate chunks.  These should include at least the following items as
181*4882a593Smuzhiyundistinct commits (each of which is described further below):
182*4882a593Smuzhiyun
183*4882a593Smuzhiyun - The core implementation of the system call, together with prototypes,
184*4882a593Smuzhiyun   generic numbering, Kconfig changes and fallback stub implementation.
185*4882a593Smuzhiyun - Wiring up of the new system call for one particular architecture, usually
186*4882a593Smuzhiyun   x86 (including all of x86_64, x86_32 and x32).
187*4882a593Smuzhiyun - A demonstration of the use of the new system call in userspace via a
188*4882a593Smuzhiyun   selftest in ``tools/testing/selftests/``.
189*4882a593Smuzhiyun - A draft man-page for the new system call, either as plain text in the
190*4882a593Smuzhiyun   cover letter, or as a patch to the (separate) man-pages repository.
191*4882a593Smuzhiyun
192*4882a593SmuzhiyunNew system call proposals, like any change to the kernel's API, should always
193*4882a593Smuzhiyunbe cc'ed to linux-api@vger.kernel.org.
194*4882a593Smuzhiyun
195*4882a593Smuzhiyun
196*4882a593SmuzhiyunGeneric System Call Implementation
197*4882a593Smuzhiyun----------------------------------
198*4882a593Smuzhiyun
199*4882a593SmuzhiyunThe main entry point for your new :manpage:`xyzzy(2)` system call will be called
200*4882a593Smuzhiyun``sys_xyzzy()``, but you add this entry point with the appropriate
201*4882a593Smuzhiyun``SYSCALL_DEFINEn()`` macro rather than explicitly.  The 'n' indicates the
202*4882a593Smuzhiyunnumber of arguments to the system call, and the macro takes the system call name
203*4882a593Smuzhiyunfollowed by the (type, name) pairs for the parameters as arguments.  Using
204*4882a593Smuzhiyunthis macro allows metadata about the new system call to be made available for
205*4882a593Smuzhiyunother tools.
206*4882a593Smuzhiyun
207*4882a593SmuzhiyunThe new entry point also needs a corresponding function prototype, in
208*4882a593Smuzhiyun``include/linux/syscalls.h``, marked as asmlinkage to match the way that system
209*4882a593Smuzhiyuncalls are invoked::
210*4882a593Smuzhiyun
211*4882a593Smuzhiyun    asmlinkage long sys_xyzzy(...);
212*4882a593Smuzhiyun
213*4882a593SmuzhiyunSome architectures (e.g. x86) have their own architecture-specific syscall
214*4882a593Smuzhiyuntables, but several other architectures share a generic syscall table. Add your
215*4882a593Smuzhiyunnew system call to the generic list by adding an entry to the list in
216*4882a593Smuzhiyun``include/uapi/asm-generic/unistd.h``::
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun    #define __NR_xyzzy 292
219*4882a593Smuzhiyun    __SYSCALL(__NR_xyzzy, sys_xyzzy)
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunAlso update the __NR_syscalls count to reflect the additional system call, and
222*4882a593Smuzhiyunnote that if multiple new system calls are added in the same merge window,
223*4882a593Smuzhiyunyour new syscall number may get adjusted to resolve conflicts.
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunThe file ``kernel/sys_ni.c`` provides a fallback stub implementation of each
226*4882a593Smuzhiyunsystem call, returning ``-ENOSYS``.  Add your new system call here too::
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun    COND_SYSCALL(xyzzy);
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunYour new kernel functionality, and the system call that controls it, should
231*4882a593Smuzhiyunnormally be optional, so add a ``CONFIG`` option (typically to
232*4882a593Smuzhiyun``init/Kconfig``) for it. As usual for new ``CONFIG`` options:
233*4882a593Smuzhiyun
234*4882a593Smuzhiyun - Include a description of the new functionality and system call controlled
235*4882a593Smuzhiyun   by the option.
236*4882a593Smuzhiyun - Make the option depend on EXPERT if it should be hidden from normal users.
237*4882a593Smuzhiyun - Make any new source files implementing the function dependent on the CONFIG
238*4882a593Smuzhiyun   option in the Makefile (e.g. ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.o``).
239*4882a593Smuzhiyun - Double check that the kernel still builds with the new CONFIG option turned
240*4882a593Smuzhiyun   off.
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunTo summarize, you need a commit that includes:
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun - ``CONFIG`` option for the new function, normally in ``init/Kconfig``
245*4882a593Smuzhiyun - ``SYSCALL_DEFINEn(xyzzy, ...)`` for the entry point
246*4882a593Smuzhiyun - corresponding prototype in ``include/linux/syscalls.h``
247*4882a593Smuzhiyun - generic table entry in ``include/uapi/asm-generic/unistd.h``
248*4882a593Smuzhiyun - fallback stub in ``kernel/sys_ni.c``
249*4882a593Smuzhiyun
250*4882a593Smuzhiyun
251*4882a593Smuzhiyunx86 System Call Implementation
252*4882a593Smuzhiyun------------------------------
253*4882a593Smuzhiyun
254*4882a593SmuzhiyunTo wire up your new system call for x86 platforms, you need to update the
255*4882a593Smuzhiyunmaster syscall tables.  Assuming your new system call isn't special in some
256*4882a593Smuzhiyunway (see below), this involves a "common" entry (for x86_64 and x32) in
257*4882a593Smuzhiyunarch/x86/entry/syscalls/syscall_64.tbl::
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun    333   common   xyzzy     sys_xyzzy
260*4882a593Smuzhiyun
261*4882a593Smuzhiyunand an "i386" entry in ``arch/x86/entry/syscalls/syscall_32.tbl``::
262*4882a593Smuzhiyun
263*4882a593Smuzhiyun    380   i386     xyzzy     sys_xyzzy
264*4882a593Smuzhiyun
265*4882a593SmuzhiyunAgain, these numbers are liable to be changed if there are conflicts in the
266*4882a593Smuzhiyunrelevant merge window.
267*4882a593Smuzhiyun
268*4882a593Smuzhiyun
269*4882a593SmuzhiyunCompatibility System Calls (Generic)
270*4882a593Smuzhiyun------------------------------------
271*4882a593Smuzhiyun
272*4882a593SmuzhiyunFor most system calls the same 64-bit implementation can be invoked even when
273*4882a593Smuzhiyunthe userspace program is itself 32-bit; even if the system call's parameters
274*4882a593Smuzhiyuninclude an explicit pointer, this is handled transparently.
275*4882a593Smuzhiyun
276*4882a593SmuzhiyunHowever, there are a couple of situations where a compatibility layer is
277*4882a593Smuzhiyunneeded to cope with size differences between 32-bit and 64-bit.
278*4882a593Smuzhiyun
279*4882a593SmuzhiyunThe first is if the 64-bit kernel also supports 32-bit userspace programs, and
280*4882a593Smuzhiyunso needs to parse areas of (``__user``) memory that could hold either 32-bit or
281*4882a593Smuzhiyun64-bit values.  In particular, this is needed whenever a system call argument
282*4882a593Smuzhiyunis:
283*4882a593Smuzhiyun
284*4882a593Smuzhiyun - a pointer to a pointer
285*4882a593Smuzhiyun - a pointer to a struct containing a pointer (e.g. ``struct iovec __user *``)
286*4882a593Smuzhiyun - a pointer to a varying sized integral type (``time_t``, ``off_t``,
287*4882a593Smuzhiyun   ``long``, ...)
288*4882a593Smuzhiyun - a pointer to a struct containing a varying sized integral type.
289*4882a593Smuzhiyun
290*4882a593SmuzhiyunThe second situation that requires a compatibility layer is if one of the
291*4882a593Smuzhiyunsystem call's arguments has a type that is explicitly 64-bit even on a 32-bit
292*4882a593Smuzhiyunarchitecture, for example ``loff_t`` or ``__u64``.  In this case, a value that
293*4882a593Smuzhiyunarrives at a 64-bit kernel from a 32-bit application will be split into two
294*4882a593Smuzhiyun32-bit values, which then need to be re-assembled in the compatibility layer.
295*4882a593Smuzhiyun
296*4882a593Smuzhiyun(Note that a system call argument that's a pointer to an explicit 64-bit type
297*4882a593Smuzhiyundoes **not** need a compatibility layer; for example, :manpage:`splice(2)`'s arguments of
298*4882a593Smuzhiyuntype ``loff_t __user *`` do not trigger the need for a ``compat_`` system call.)
299*4882a593Smuzhiyun
300*4882a593SmuzhiyunThe compatibility version of the system call is called ``compat_sys_xyzzy()``,
301*4882a593Smuzhiyunand is added with the ``COMPAT_SYSCALL_DEFINEn()`` macro, analogously to
302*4882a593SmuzhiyunSYSCALL_DEFINEn.  This version of the implementation runs as part of a 64-bit
303*4882a593Smuzhiyunkernel, but expects to receive 32-bit parameter values and does whatever is
304*4882a593Smuzhiyunneeded to deal with them.  (Typically, the ``compat_sys_`` version converts the
305*4882a593Smuzhiyunvalues to 64-bit versions and either calls on to the ``sys_`` version, or both of
306*4882a593Smuzhiyunthem call a common inner implementation function.)
307*4882a593Smuzhiyun
308*4882a593SmuzhiyunThe compat entry point also needs a corresponding function prototype, in
309*4882a593Smuzhiyun``include/linux/compat.h``, marked as asmlinkage to match the way that system
310*4882a593Smuzhiyuncalls are invoked::
311*4882a593Smuzhiyun
312*4882a593Smuzhiyun    asmlinkage long compat_sys_xyzzy(...);
313*4882a593Smuzhiyun
314*4882a593SmuzhiyunIf the system call involves a structure that is laid out differently on 32-bit
315*4882a593Smuzhiyunand 64-bit systems, say ``struct xyzzy_args``, then the include/linux/compat.h
316*4882a593Smuzhiyunheader file should also include a compat version of the structure (``struct
317*4882a593Smuzhiyuncompat_xyzzy_args``) where each variable-size field has the appropriate
318*4882a593Smuzhiyun``compat_`` type that corresponds to the type in ``struct xyzzy_args``.  The
319*4882a593Smuzhiyun``compat_sys_xyzzy()`` routine can then use this ``compat_`` structure to
320*4882a593Smuzhiyunparse the arguments from a 32-bit invocation.
321*4882a593Smuzhiyun
322*4882a593SmuzhiyunFor example, if there are fields::
323*4882a593Smuzhiyun
324*4882a593Smuzhiyun    struct xyzzy_args {
325*4882a593Smuzhiyun        const char __user *ptr;
326*4882a593Smuzhiyun        __kernel_long_t varying_val;
327*4882a593Smuzhiyun        u64 fixed_val;
328*4882a593Smuzhiyun        /* ... */
329*4882a593Smuzhiyun    };
330*4882a593Smuzhiyun
331*4882a593Smuzhiyunin struct xyzzy_args, then struct compat_xyzzy_args would have::
332*4882a593Smuzhiyun
333*4882a593Smuzhiyun    struct compat_xyzzy_args {
334*4882a593Smuzhiyun        compat_uptr_t ptr;
335*4882a593Smuzhiyun        compat_long_t varying_val;
336*4882a593Smuzhiyun        u64 fixed_val;
337*4882a593Smuzhiyun        /* ... */
338*4882a593Smuzhiyun    };
339*4882a593Smuzhiyun
340*4882a593SmuzhiyunThe generic system call list also needs adjusting to allow for the compat
341*4882a593Smuzhiyunversion; the entry in ``include/uapi/asm-generic/unistd.h`` should use
342*4882a593Smuzhiyun``__SC_COMP`` rather than ``__SYSCALL``::
343*4882a593Smuzhiyun
344*4882a593Smuzhiyun    #define __NR_xyzzy 292
345*4882a593Smuzhiyun    __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy)
346*4882a593Smuzhiyun
347*4882a593SmuzhiyunTo summarize, you need:
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun - a ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` for the compat entry point
350*4882a593Smuzhiyun - corresponding prototype in ``include/linux/compat.h``
351*4882a593Smuzhiyun - (if needed) 32-bit mapping struct in ``include/linux/compat.h``
352*4882a593Smuzhiyun - instance of ``__SC_COMP`` not ``__SYSCALL`` in
353*4882a593Smuzhiyun   ``include/uapi/asm-generic/unistd.h``
354*4882a593Smuzhiyun
355*4882a593Smuzhiyun
356*4882a593SmuzhiyunCompatibility System Calls (x86)
357*4882a593Smuzhiyun--------------------------------
358*4882a593Smuzhiyun
359*4882a593SmuzhiyunTo wire up the x86 architecture of a system call with a compatibility version,
360*4882a593Smuzhiyunthe entries in the syscall tables need to be adjusted.
361*4882a593Smuzhiyun
362*4882a593SmuzhiyunFirst, the entry in ``arch/x86/entry/syscalls/syscall_32.tbl`` gets an extra
363*4882a593Smuzhiyuncolumn to indicate that a 32-bit userspace program running on a 64-bit kernel
364*4882a593Smuzhiyunshould hit the compat entry point::
365*4882a593Smuzhiyun
366*4882a593Smuzhiyun    380   i386     xyzzy     sys_xyzzy    __ia32_compat_sys_xyzzy
367*4882a593Smuzhiyun
368*4882a593SmuzhiyunSecond, you need to figure out what should happen for the x32 ABI version of
369*4882a593Smuzhiyunthe new system call.  There's a choice here: the layout of the arguments
370*4882a593Smuzhiyunshould either match the 64-bit version or the 32-bit version.
371*4882a593Smuzhiyun
372*4882a593SmuzhiyunIf there's a pointer-to-a-pointer involved, the decision is easy: x32 is
373*4882a593SmuzhiyunILP32, so the layout should match the 32-bit version, and the entry in
374*4882a593Smuzhiyun``arch/x86/entry/syscalls/syscall_64.tbl`` is split so that x32 programs hit
375*4882a593Smuzhiyunthe compatibility wrapper::
376*4882a593Smuzhiyun
377*4882a593Smuzhiyun    333   64       xyzzy     sys_xyzzy
378*4882a593Smuzhiyun    ...
379*4882a593Smuzhiyun    555   x32      xyzzy     __x32_compat_sys_xyzzy
380*4882a593Smuzhiyun
381*4882a593SmuzhiyunIf no pointers are involved, then it is preferable to re-use the 64-bit system
382*4882a593Smuzhiyuncall for the x32 ABI (and consequently the entry in
383*4882a593Smuzhiyunarch/x86/entry/syscalls/syscall_64.tbl is unchanged).
384*4882a593Smuzhiyun
385*4882a593SmuzhiyunIn either case, you should check that the types involved in your argument
386*4882a593Smuzhiyunlayout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or
387*4882a593Smuzhiyun64-bit (-m64) equivalents.
388*4882a593Smuzhiyun
389*4882a593Smuzhiyun
390*4882a593SmuzhiyunSystem Calls Returning Elsewhere
391*4882a593Smuzhiyun--------------------------------
392*4882a593Smuzhiyun
393*4882a593SmuzhiyunFor most system calls, once the system call is complete the user program
394*4882a593Smuzhiyuncontinues exactly where it left off -- at the next instruction, with the
395*4882a593Smuzhiyunstack the same and most of the registers the same as before the system call,
396*4882a593Smuzhiyunand with the same virtual memory space.
397*4882a593Smuzhiyun
398*4882a593SmuzhiyunHowever, a few system calls do things differently.  They might return to a
399*4882a593Smuzhiyundifferent location (``rt_sigreturn``) or change the memory space
400*4882a593Smuzhiyun(``fork``/``vfork``/``clone``) or even architecture (``execve``/``execveat``)
401*4882a593Smuzhiyunof the program.
402*4882a593Smuzhiyun
403*4882a593SmuzhiyunTo allow for this, the kernel implementation of the system call may need to
404*4882a593Smuzhiyunsave and restore additional registers to the kernel stack, allowing complete
405*4882a593Smuzhiyuncontrol of where and how execution continues after the system call.
406*4882a593Smuzhiyun
407*4882a593SmuzhiyunThis is arch-specific, but typically involves defining assembly entry points
408*4882a593Smuzhiyunthat save/restore additional registers and invoke the real system call entry
409*4882a593Smuzhiyunpoint.
410*4882a593Smuzhiyun
411*4882a593SmuzhiyunFor x86_64, this is implemented as a ``stub_xyzzy`` entry point in
412*4882a593Smuzhiyun``arch/x86/entry/entry_64.S``, and the entry in the syscall table
413*4882a593Smuzhiyun(``arch/x86/entry/syscalls/syscall_64.tbl``) is adjusted to match::
414*4882a593Smuzhiyun
415*4882a593Smuzhiyun    333   common   xyzzy     stub_xyzzy
416*4882a593Smuzhiyun
417*4882a593SmuzhiyunThe equivalent for 32-bit programs running on a 64-bit kernel is normally
418*4882a593Smuzhiyuncalled ``stub32_xyzzy`` and implemented in ``arch/x86/entry/entry_64_compat.S``,
419*4882a593Smuzhiyunwith the corresponding syscall table adjustment in
420*4882a593Smuzhiyun``arch/x86/entry/syscalls/syscall_32.tbl``::
421*4882a593Smuzhiyun
422*4882a593Smuzhiyun    380   i386     xyzzy     sys_xyzzy    stub32_xyzzy
423*4882a593Smuzhiyun
424*4882a593SmuzhiyunIf the system call needs a compatibility layer (as in the previous section)
425*4882a593Smuzhiyunthen the ``stub32_`` version needs to call on to the ``compat_sys_`` version
426*4882a593Smuzhiyunof the system call rather than the native 64-bit version.  Also, if the x32 ABI
427*4882a593Smuzhiyunimplementation is not common with the x86_64 version, then its syscall
428*4882a593Smuzhiyuntable will also need to invoke a stub that calls on to the ``compat_sys_``
429*4882a593Smuzhiyunversion.
430*4882a593Smuzhiyun
431*4882a593SmuzhiyunFor completeness, it's also nice to set up a mapping so that user-mode Linux
432*4882a593Smuzhiyunstill works -- its syscall table will reference stub_xyzzy, but the UML build
433*4882a593Smuzhiyundoesn't include ``arch/x86/entry/entry_64.S`` implementation (because UML
434*4882a593Smuzhiyunsimulates registers etc).  Fixing this is as simple as adding a #define to
435*4882a593Smuzhiyun``arch/x86/um/sys_call_table_64.c``::
436*4882a593Smuzhiyun
437*4882a593Smuzhiyun    #define stub_xyzzy sys_xyzzy
438*4882a593Smuzhiyun
439*4882a593Smuzhiyun
440*4882a593SmuzhiyunOther Details
441*4882a593Smuzhiyun-------------
442*4882a593Smuzhiyun
443*4882a593SmuzhiyunMost of the kernel treats system calls in a generic way, but there is the
444*4882a593Smuzhiyunoccasional exception that may need updating for your particular system call.
445*4882a593Smuzhiyun
446*4882a593SmuzhiyunThe audit subsystem is one such special case; it includes (arch-specific)
447*4882a593Smuzhiyunfunctions that classify some special types of system call -- specifically
448*4882a593Smuzhiyunfile open (``open``/``openat``), program execution (``execve``/``exeveat``) or
449*4882a593Smuzhiyunsocket multiplexor (``socketcall``) operations. If your new system call is
450*4882a593Smuzhiyunanalogous to one of these, then the audit system should be updated.
451*4882a593Smuzhiyun
452*4882a593SmuzhiyunMore generally, if there is an existing system call that is analogous to your
453*4882a593Smuzhiyunnew system call, it's worth doing a kernel-wide grep for the existing system
454*4882a593Smuzhiyuncall to check there are no other special cases.
455*4882a593Smuzhiyun
456*4882a593Smuzhiyun
457*4882a593SmuzhiyunTesting
458*4882a593Smuzhiyun-------
459*4882a593Smuzhiyun
460*4882a593SmuzhiyunA new system call should obviously be tested; it is also useful to provide
461*4882a593Smuzhiyunreviewers with a demonstration of how user space programs will use the system
462*4882a593Smuzhiyuncall.  A good way to combine these aims is to include a simple self-test
463*4882a593Smuzhiyunprogram in a new directory under ``tools/testing/selftests/``.
464*4882a593Smuzhiyun
465*4882a593SmuzhiyunFor a new system call, there will obviously be no libc wrapper function and so
466*4882a593Smuzhiyunthe test will need to invoke it using ``syscall()``; also, if the system call
467*4882a593Smuzhiyuninvolves a new userspace-visible structure, the corresponding header will need
468*4882a593Smuzhiyunto be installed to compile the test.
469*4882a593Smuzhiyun
470*4882a593SmuzhiyunMake sure the selftest runs successfully on all supported architectures.  For
471*4882a593Smuzhiyunexample, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32)
472*4882a593Smuzhiyunand x32 (-mx32) ABI program.
473*4882a593Smuzhiyun
474*4882a593SmuzhiyunFor more extensive and thorough testing of new functionality, you should also
475*4882a593Smuzhiyunconsider adding tests to the Linux Test Project, or to the xfstests project
476*4882a593Smuzhiyunfor filesystem-related changes.
477*4882a593Smuzhiyun
478*4882a593Smuzhiyun - https://linux-test-project.github.io/
479*4882a593Smuzhiyun - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
480*4882a593Smuzhiyun
481*4882a593Smuzhiyun
482*4882a593SmuzhiyunMan Page
483*4882a593Smuzhiyun--------
484*4882a593Smuzhiyun
485*4882a593SmuzhiyunAll new system calls should come with a complete man page, ideally using groff
486*4882a593Smuzhiyunmarkup, but plain text will do.  If groff is used, it's helpful to include a
487*4882a593Smuzhiyunpre-rendered ASCII version of the man page in the cover email for the
488*4882a593Smuzhiyunpatchset, for the convenience of reviewers.
489*4882a593Smuzhiyun
490*4882a593SmuzhiyunThe man page should be cc'ed to linux-man@vger.kernel.org
491*4882a593SmuzhiyunFor more details, see https://www.kernel.org/doc/man-pages/patches.html
492*4882a593Smuzhiyun
493*4882a593Smuzhiyun
494*4882a593SmuzhiyunDo not call System Calls in the Kernel
495*4882a593Smuzhiyun--------------------------------------
496*4882a593Smuzhiyun
497*4882a593SmuzhiyunSystem calls are, as stated above, interaction points between userspace and
498*4882a593Smuzhiyunthe kernel.  Therefore, system call functions such as ``sys_xyzzy()`` or
499*4882a593Smuzhiyun``compat_sys_xyzzy()`` should only be called from userspace via the syscall
500*4882a593Smuzhiyuntable, but not from elsewhere in the kernel.  If the syscall functionality is
501*4882a593Smuzhiyunuseful to be used within the kernel, needs to be shared between an old and a
502*4882a593Smuzhiyunnew syscall, or needs to be shared between a syscall and its compatibility
503*4882a593Smuzhiyunvariant, it should be implemented by means of a "helper" function (such as
504*4882a593Smuzhiyun``kern_xyzzy()``).  This kernel function may then be called within the
505*4882a593Smuzhiyunsyscall stub (``sys_xyzzy()``), the compatibility syscall stub
506*4882a593Smuzhiyun(``compat_sys_xyzzy()``), and/or other kernel code.
507*4882a593Smuzhiyun
508*4882a593SmuzhiyunAt least on 64-bit x86, it will be a hard requirement from v4.17 onwards to not
509*4882a593Smuzhiyuncall system call functions in the kernel.  It uses a different calling
510*4882a593Smuzhiyunconvention for system calls where ``struct pt_regs`` is decoded on-the-fly in a
511*4882a593Smuzhiyunsyscall wrapper which then hands processing over to the actual syscall function.
512*4882a593SmuzhiyunThis means that only those parameters which are actually needed for a specific
513*4882a593Smuzhiyunsyscall are passed on during syscall entry, instead of filling in six CPU
514*4882a593Smuzhiyunregisters with random user space content all the time (which may cause serious
515*4882a593Smuzhiyuntrouble down the call chain).
516*4882a593Smuzhiyun
517*4882a593SmuzhiyunMoreover, rules on how data may be accessed may differ between kernel data and
518*4882a593Smuzhiyunuser data.  This is another reason why calling ``sys_xyzzy()`` is generally a
519*4882a593Smuzhiyunbad idea.
520*4882a593Smuzhiyun
521*4882a593SmuzhiyunExceptions to this rule are only allowed in architecture-specific overrides,
522*4882a593Smuzhiyunarchitecture-specific compatibility wrappers, or other code in arch/.
523*4882a593Smuzhiyun
524*4882a593Smuzhiyun
525*4882a593SmuzhiyunReferences and Sources
526*4882a593Smuzhiyun----------------------
527*4882a593Smuzhiyun
528*4882a593Smuzhiyun - LWN article from Michael Kerrisk on use of flags argument in system calls:
529*4882a593Smuzhiyun   https://lwn.net/Articles/585415/
530*4882a593Smuzhiyun - LWN article from Michael Kerrisk on how to handle unknown flags in a system
531*4882a593Smuzhiyun   call: https://lwn.net/Articles/588444/
532*4882a593Smuzhiyun - LWN article from Jake Edge describing constraints on 64-bit system call
533*4882a593Smuzhiyun   arguments: https://lwn.net/Articles/311630/
534*4882a593Smuzhiyun - Pair of LWN articles from David Drysdale that describe the system call
535*4882a593Smuzhiyun   implementation paths in detail for v3.14:
536*4882a593Smuzhiyun
537*4882a593Smuzhiyun    - https://lwn.net/Articles/604287/
538*4882a593Smuzhiyun    - https://lwn.net/Articles/604515/
539*4882a593Smuzhiyun
540*4882a593Smuzhiyun - Architecture-specific requirements for system calls are discussed in the
541*4882a593Smuzhiyun   :manpage:`syscall(2)` man-page:
542*4882a593Smuzhiyun   http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES
543*4882a593Smuzhiyun - Collated emails from Linus Torvalds discussing the problems with ``ioctl()``:
544*4882a593Smuzhiyun   https://yarchive.net/comp/linux/ioctl.html
545*4882a593Smuzhiyun - "How to not invent kernel interfaces", Arnd Bergmann,
546*4882a593Smuzhiyun   https://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf
547*4882a593Smuzhiyun - LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN:
548*4882a593Smuzhiyun   https://lwn.net/Articles/486306/
549*4882a593Smuzhiyun - Recommendation from Andrew Morton that all related information for a new
550*4882a593Smuzhiyun   system call should come in the same email thread:
551*4882a593Smuzhiyun   https://lkml.org/lkml/2014/7/24/641
552*4882a593Smuzhiyun - Recommendation from Michael Kerrisk that a new system call should come with
553*4882a593Smuzhiyun   a man page: https://lkml.org/lkml/2014/6/13/309
554*4882a593Smuzhiyun - Suggestion from Thomas Gleixner that x86 wire-up should be in a separate
555*4882a593Smuzhiyun   commit: https://lkml.org/lkml/2014/11/19/254
556*4882a593Smuzhiyun - Suggestion from Greg Kroah-Hartman that it's good for new system calls to
557*4882a593Smuzhiyun   come with a man-page & selftest: https://lkml.org/lkml/2014/3/19/710
558*4882a593Smuzhiyun - Discussion from Michael Kerrisk of new system call vs. :manpage:`prctl(2)` extension:
559*4882a593Smuzhiyun   https://lkml.org/lkml/2014/6/3/411
560*4882a593Smuzhiyun - Suggestion from Ingo Molnar that system calls that involve multiple
561*4882a593Smuzhiyun   arguments should encapsulate those arguments in a struct, which includes a
562*4882a593Smuzhiyun   size field for future extensibility: https://lkml.org/lkml/2015/7/30/117
563*4882a593Smuzhiyun - Numbering oddities arising from (re-)use of O_* numbering space flags:
564*4882a593Smuzhiyun
565*4882a593Smuzhiyun    - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness
566*4882a593Smuzhiyun      check")
567*4882a593Smuzhiyun    - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc
568*4882a593Smuzhiyun      conflict")
569*4882a593Smuzhiyun    - commit bb458c644a59 ("Safer ABI for O_TMPFILE")
570*4882a593Smuzhiyun
571*4882a593Smuzhiyun - Discussion from Matthew Wilcox about restrictions on 64-bit arguments:
572*4882a593Smuzhiyun   https://lkml.org/lkml/2008/12/12/187
573*4882a593Smuzhiyun - Recommendation from Greg Kroah-Hartman that unknown flags should be
574*4882a593Smuzhiyun   policed: https://lkml.org/lkml/2014/7/17/577
575*4882a593Smuzhiyun - Recommendation from Linus Torvalds that x32 system calls should prefer
576*4882a593Smuzhiyun   compatibility with 64-bit versions rather than 32-bit versions:
577*4882a593Smuzhiyun   https://lkml.org/lkml/2011/8/31/244
578