xref: /OK3568_Linux_fs/kernel/Documentation/userspace-api/unshare.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyununshare system call
2*4882a593Smuzhiyun===================
3*4882a593Smuzhiyun
4*4882a593SmuzhiyunThis document describes the new system call, unshare(). The document
5*4882a593Smuzhiyunprovides an overview of the feature, why it is needed, how it can
6*4882a593Smuzhiyunbe used, its interface specification, design, implementation and
7*4882a593Smuzhiyunhow it can be tested.
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunChange Log
10*4882a593Smuzhiyun----------
11*4882a593Smuzhiyunversion 0.1  Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunContents
14*4882a593Smuzhiyun--------
15*4882a593Smuzhiyun	1) Overview
16*4882a593Smuzhiyun	2) Benefits
17*4882a593Smuzhiyun	3) Cost
18*4882a593Smuzhiyun	4) Requirements
19*4882a593Smuzhiyun	5) Functional Specification
20*4882a593Smuzhiyun	6) High Level Design
21*4882a593Smuzhiyun	7) Low Level Design
22*4882a593Smuzhiyun	8) Test Specification
23*4882a593Smuzhiyun	9) Future Work
24*4882a593Smuzhiyun
25*4882a593Smuzhiyun1) Overview
26*4882a593Smuzhiyun-----------
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunMost legacy operating system kernels support an abstraction of threads
29*4882a593Smuzhiyunas multiple execution contexts within a process. These kernels provide
30*4882a593Smuzhiyunspecial resources and mechanisms to maintain these "threads". The Linux
31*4882a593Smuzhiyunkernel, in a clever and simple manner, does not make distinction
32*4882a593Smuzhiyunbetween processes and "threads". The kernel allows processes to share
33*4882a593Smuzhiyunresources and thus they can achieve legacy "threads" behavior without
34*4882a593Smuzhiyunrequiring additional data structures and mechanisms in the kernel. The
35*4882a593Smuzhiyunpower of implementing threads in this manner comes not only from
36*4882a593Smuzhiyunits simplicity but also from allowing application programmers to work
37*4882a593Smuzhiyunoutside the confinement of all-or-nothing shared resources of legacy
38*4882a593Smuzhiyunthreads. On Linux, at the time of thread creation using the clone system
39*4882a593Smuzhiyuncall, applications can selectively choose which resources to share
40*4882a593Smuzhiyunbetween threads.
41*4882a593Smuzhiyun
42*4882a593Smuzhiyununshare() system call adds a primitive to the Linux thread model that
43*4882a593Smuzhiyunallows threads to selectively 'unshare' any resources that were being
44*4882a593Smuzhiyunshared at the time of their creation. unshare() was conceptualized by
45*4882a593SmuzhiyunAl Viro in the August of 2000, on the Linux-Kernel mailing list, as part
46*4882a593Smuzhiyunof the discussion on POSIX threads on Linux.  unshare() augments the
47*4882a593Smuzhiyunusefulness of Linux threads for applications that would like to control
48*4882a593Smuzhiyunshared resources without creating a new process. unshare() is a natural
49*4882a593Smuzhiyunaddition to the set of available primitives on Linux that implement
50*4882a593Smuzhiyunthe concept of process/thread as a virtual machine.
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun2) Benefits
53*4882a593Smuzhiyun-----------
54*4882a593Smuzhiyun
55*4882a593Smuzhiyununshare() would be useful to large application frameworks such as PAM
56*4882a593Smuzhiyunwhere creating a new process to control sharing/unsharing of process
57*4882a593Smuzhiyunresources is not possible. Since namespaces are shared by default
58*4882a593Smuzhiyunwhen creating a new process using fork or clone, unshare() can benefit
59*4882a593Smuzhiyuneven non-threaded applications if they have a need to disassociate
60*4882a593Smuzhiyunfrom default shared namespace. The following lists two use-cases
61*4882a593Smuzhiyunwhere unshare() can be used.
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun2.1 Per-security context namespaces
64*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
65*4882a593Smuzhiyun
66*4882a593Smuzhiyununshare() can be used to implement polyinstantiated directories using
67*4882a593Smuzhiyunthe kernel's per-process namespace mechanism. Polyinstantiated directories,
68*4882a593Smuzhiyunsuch as per-user and/or per-security context instance of /tmp, /var/tmp or
69*4882a593Smuzhiyunper-security context instance of a user's home directory, isolate user
70*4882a593Smuzhiyunprocesses when working with these directories. Using unshare(), a PAM
71*4882a593Smuzhiyunmodule can easily setup a private namespace for a user at login.
72*4882a593SmuzhiyunPolyinstantiated directories are required for Common Criteria certification
73*4882a593Smuzhiyunwith Labeled System Protection Profile, however, with the availability
74*4882a593Smuzhiyunof shared-tree feature in the Linux kernel, even regular Linux systems
75*4882a593Smuzhiyuncan benefit from setting up private namespaces at login and
76*4882a593Smuzhiyunpolyinstantiating /tmp, /var/tmp and other directories deemed
77*4882a593Smuzhiyunappropriate by system administrators.
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun2.2 unsharing of virtual memory and/or open files
80*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunConsider a client/server application where the server is processing
83*4882a593Smuzhiyunclient requests by creating processes that share resources such as
84*4882a593Smuzhiyunvirtual memory and open files. Without unshare(), the server has to
85*4882a593Smuzhiyundecide what needs to be shared at the time of creating the process
86*4882a593Smuzhiyunwhich services the request. unshare() allows the server an ability to
87*4882a593Smuzhiyundisassociate parts of the context during the servicing of the
88*4882a593Smuzhiyunrequest. For large and complex middleware application frameworks, this
89*4882a593Smuzhiyunability to unshare() after the process was created can be very
90*4882a593Smuzhiyunuseful.
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun3) Cost
93*4882a593Smuzhiyun-------
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunIn order to not duplicate code and to handle the fact that unshare()
96*4882a593Smuzhiyunworks on an active task (as opposed to clone/fork working on a newly
97*4882a593Smuzhiyunallocated inactive task) unshare() had to make minor reorganizational
98*4882a593Smuzhiyunchanges to copy_* functions utilized by clone/fork system call.
99*4882a593SmuzhiyunThere is a cost associated with altering existing, well tested and
100*4882a593Smuzhiyunstable code to implement a new feature that may not get exercised
101*4882a593Smuzhiyunextensively in the beginning. However, with proper design and code
102*4882a593Smuzhiyunreview of the changes and creation of an unshare() test for the LTP
103*4882a593Smuzhiyunthe benefits of this new feature can exceed its cost.
104*4882a593Smuzhiyun
105*4882a593Smuzhiyun4) Requirements
106*4882a593Smuzhiyun---------------
107*4882a593Smuzhiyun
108*4882a593Smuzhiyununshare() reverses sharing that was done using clone(2) system call,
109*4882a593Smuzhiyunso unshare() should have a similar interface as clone(2). That is,
110*4882a593Smuzhiyunsince flags in clone(int flags, void \*stack) specifies what should
111*4882a593Smuzhiyunbe shared, similar flags in unshare(int flags) should specify
112*4882a593Smuzhiyunwhat should be unshared. Unfortunately, this may appear to invert
113*4882a593Smuzhiyunthe meaning of the flags from the way they are used in clone(2).
114*4882a593SmuzhiyunHowever, there was no easy solution that was less confusing and that
115*4882a593Smuzhiyunallowed incremental context unsharing in future without an ABI change.
116*4882a593Smuzhiyun
117*4882a593Smuzhiyununshare() interface should accommodate possible future addition of
118*4882a593Smuzhiyunnew context flags without requiring a rebuild of old applications.
119*4882a593SmuzhiyunIf and when new context flags are added, unshare() design should allow
120*4882a593Smuzhiyunincremental unsharing of those resources on an as needed basis.
121*4882a593Smuzhiyun
122*4882a593Smuzhiyun5) Functional Specification
123*4882a593Smuzhiyun---------------------------
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunNAME
126*4882a593Smuzhiyun	unshare - disassociate parts of the process execution context
127*4882a593Smuzhiyun
128*4882a593SmuzhiyunSYNOPSIS
129*4882a593Smuzhiyun	#include <sched.h>
130*4882a593Smuzhiyun
131*4882a593Smuzhiyun	int unshare(int flags);
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunDESCRIPTION
134*4882a593Smuzhiyun	unshare() allows a process to disassociate parts of its execution
135*4882a593Smuzhiyun	context that are currently being shared with other processes. Part
136*4882a593Smuzhiyun	of execution context, such as the namespace, is shared by default
137*4882a593Smuzhiyun	when a new process is created using fork(2), while other parts,
138*4882a593Smuzhiyun	such as the virtual memory, open file descriptors, etc, may be
139*4882a593Smuzhiyun	shared by explicit request to share them when creating a process
140*4882a593Smuzhiyun	using clone(2).
141*4882a593Smuzhiyun
142*4882a593Smuzhiyun	The main use of unshare() is to allow a process to control its
143*4882a593Smuzhiyun	shared execution context without creating a new process.
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun	The flags argument specifies one or bitwise-or'ed of several of
146*4882a593Smuzhiyun	the following constants.
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun	CLONE_FS
149*4882a593Smuzhiyun		If CLONE_FS is set, file system information of the caller
150*4882a593Smuzhiyun		is disassociated from the shared file system information.
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun	CLONE_FILES
153*4882a593Smuzhiyun		If CLONE_FILES is set, the file descriptor table of the
154*4882a593Smuzhiyun		caller is disassociated from the shared file descriptor
155*4882a593Smuzhiyun		table.
156*4882a593Smuzhiyun
157*4882a593Smuzhiyun	CLONE_NEWNS
158*4882a593Smuzhiyun		If CLONE_NEWNS is set, the namespace of the caller is
159*4882a593Smuzhiyun		disassociated from the shared namespace.
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun	CLONE_VM
162*4882a593Smuzhiyun		If CLONE_VM is set, the virtual memory of the caller is
163*4882a593Smuzhiyun		disassociated from the shared virtual memory.
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunRETURN VALUE
166*4882a593Smuzhiyun	On success, zero returned. On failure, -1 is returned and errno is
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunERRORS
169*4882a593Smuzhiyun	EPERM	CLONE_NEWNS was specified by a non-root process (process
170*4882a593Smuzhiyun		without CAP_SYS_ADMIN).
171*4882a593Smuzhiyun
172*4882a593Smuzhiyun	ENOMEM	Cannot allocate sufficient memory to copy parts of caller's
173*4882a593Smuzhiyun		context that need to be unshared.
174*4882a593Smuzhiyun
175*4882a593Smuzhiyun	EINVAL	Invalid flag was specified as an argument.
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunCONFORMING TO
178*4882a593Smuzhiyun	The unshare() call is Linux-specific and  should  not be used
179*4882a593Smuzhiyun	in programs intended to be portable.
180*4882a593Smuzhiyun
181*4882a593SmuzhiyunSEE ALSO
182*4882a593Smuzhiyun	clone(2), fork(2)
183*4882a593Smuzhiyun
184*4882a593Smuzhiyun6) High Level Design
185*4882a593Smuzhiyun--------------------
186*4882a593Smuzhiyun
187*4882a593SmuzhiyunDepending on the flags argument, the unshare() system call allocates
188*4882a593Smuzhiyunappropriate process context structures, populates it with values from
189*4882a593Smuzhiyunthe current shared version, associates newly duplicated structures
190*4882a593Smuzhiyunwith the current task structure and releases corresponding shared
191*4882a593Smuzhiyunversions. Helper functions of clone (copy_*) could not be used
192*4882a593Smuzhiyundirectly by unshare() because of the following two reasons.
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun  1) clone operates on a newly allocated not-yet-active task
195*4882a593Smuzhiyun     structure, where as unshare() operates on the current active
196*4882a593Smuzhiyun     task. Therefore unshare() has to take appropriate task_lock()
197*4882a593Smuzhiyun     before associating newly duplicated context structures
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun  2) unshare() has to allocate and duplicate all context structures
200*4882a593Smuzhiyun     that are being unshared, before associating them with the
201*4882a593Smuzhiyun     current task and releasing older shared structures. Failure
202*4882a593Smuzhiyun     do so will create race conditions and/or oops when trying
203*4882a593Smuzhiyun     to backout due to an error. Consider the case of unsharing
204*4882a593Smuzhiyun     both virtual memory and namespace. After successfully unsharing
205*4882a593Smuzhiyun     vm, if the system call encounters an error while allocating
206*4882a593Smuzhiyun     new namespace structure, the error return code will have to
207*4882a593Smuzhiyun     reverse the unsharing of vm. As part of the reversal the
208*4882a593Smuzhiyun     system call will have to go back to older, shared, vm
209*4882a593Smuzhiyun     structure, which may not exist anymore.
210*4882a593Smuzhiyun
211*4882a593SmuzhiyunTherefore code from copy_* functions that allocated and duplicated
212*4882a593Smuzhiyuncurrent context structure was moved into new dup_* functions. Now,
213*4882a593Smuzhiyuncopy_* functions call dup_* functions to allocate and duplicate
214*4882a593Smuzhiyunappropriate context structures and then associate them with the
215*4882a593Smuzhiyuntask structure that is being constructed. unshare() system call on
216*4882a593Smuzhiyunthe other hand performs the following:
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun  1) Check flags to force missing, but implied, flags
219*4882a593Smuzhiyun
220*4882a593Smuzhiyun  2) For each context structure, call the corresponding unshare()
221*4882a593Smuzhiyun     helper function to allocate and duplicate a new context
222*4882a593Smuzhiyun     structure, if the appropriate bit is set in the flags argument.
223*4882a593Smuzhiyun
224*4882a593Smuzhiyun  3) If there is no error in allocation and duplication and there
225*4882a593Smuzhiyun     are new context structures then lock the current task structure,
226*4882a593Smuzhiyun     associate new context structures with the current task structure,
227*4882a593Smuzhiyun     and release the lock on the current task structure.
228*4882a593Smuzhiyun
229*4882a593Smuzhiyun  4) Appropriately release older, shared, context structures.
230*4882a593Smuzhiyun
231*4882a593Smuzhiyun7) Low Level Design
232*4882a593Smuzhiyun-------------------
233*4882a593Smuzhiyun
234*4882a593SmuzhiyunImplementation of unshare() can be grouped in the following 4 different
235*4882a593Smuzhiyunitems:
236*4882a593Smuzhiyun
237*4882a593Smuzhiyun  a) Reorganization of existing copy_* functions
238*4882a593Smuzhiyun
239*4882a593Smuzhiyun  b) unshare() system call service function
240*4882a593Smuzhiyun
241*4882a593Smuzhiyun  c) unshare() helper functions for each different process context
242*4882a593Smuzhiyun
243*4882a593Smuzhiyun  d) Registration of system call number for different architectures
244*4882a593Smuzhiyun
245*4882a593Smuzhiyun7.1) Reorganization of copy_* functions
246*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
247*4882a593Smuzhiyun
248*4882a593SmuzhiyunEach copy function such as copy_mm, copy_namespace, copy_files,
249*4882a593Smuzhiyunetc, had roughly two components. The first component allocated
250*4882a593Smuzhiyunand duplicated the appropriate structure and the second component
251*4882a593Smuzhiyunlinked it to the task structure passed in as an argument to the copy
252*4882a593Smuzhiyunfunction. The first component was split into its own function.
253*4882a593SmuzhiyunThese dup_* functions allocated and duplicated the appropriate
254*4882a593Smuzhiyuncontext structure. The reorganized copy_* functions invoked
255*4882a593Smuzhiyuntheir corresponding dup_* functions and then linked the newly
256*4882a593Smuzhiyunduplicated structures to the task structure with which the
257*4882a593Smuzhiyuncopy function was called.
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun7.2) unshare() system call service function
260*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
261*4882a593Smuzhiyun
262*4882a593Smuzhiyun       * Check flags
263*4882a593Smuzhiyun	 Force implied flags. If CLONE_THREAD is set force CLONE_VM.
264*4882a593Smuzhiyun	 If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
265*4882a593Smuzhiyun	 set and signals are also being shared, force CLONE_THREAD. If
266*4882a593Smuzhiyun	 CLONE_NEWNS is set, force CLONE_FS.
267*4882a593Smuzhiyun
268*4882a593Smuzhiyun       * For each context flag, invoke the corresponding unshare_*
269*4882a593Smuzhiyun	 helper routine with flags passed into the system call and a
270*4882a593Smuzhiyun	 reference to pointer pointing the new unshared structure
271*4882a593Smuzhiyun
272*4882a593Smuzhiyun       * If any new structures are created by unshare_* helper
273*4882a593Smuzhiyun	 functions, take the task_lock() on the current task,
274*4882a593Smuzhiyun	 modify appropriate context pointers, and release the
275*4882a593Smuzhiyun         task lock.
276*4882a593Smuzhiyun
277*4882a593Smuzhiyun       * For all newly unshared structures, release the corresponding
278*4882a593Smuzhiyun         older, shared, structures.
279*4882a593Smuzhiyun
280*4882a593Smuzhiyun7.3) unshare_* helper functions
281*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
282*4882a593Smuzhiyun
283*4882a593SmuzhiyunFor unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
284*4882a593Smuzhiyunand CLONE_THREAD, return -EINVAL since they are not implemented yet.
285*4882a593SmuzhiyunFor others, check the flag value to see if the unsharing is
286*4882a593Smuzhiyunrequired for that structure. If it is, invoke the corresponding
287*4882a593Smuzhiyundup_* function to allocate and duplicate the structure and return
288*4882a593Smuzhiyuna pointer to it.
289*4882a593Smuzhiyun
290*4882a593Smuzhiyun7.4) Finally
291*4882a593Smuzhiyun~~~~~~~~~~~~
292*4882a593Smuzhiyun
293*4882a593SmuzhiyunAppropriately modify architecture specific code to register the
294*4882a593Smuzhiyunnew system call.
295*4882a593Smuzhiyun
296*4882a593Smuzhiyun8) Test Specification
297*4882a593Smuzhiyun---------------------
298*4882a593Smuzhiyun
299*4882a593SmuzhiyunThe test for unshare() should test the following:
300*4882a593Smuzhiyun
301*4882a593Smuzhiyun  1) Valid flags: Test to check that clone flags for signal and
302*4882a593Smuzhiyun     signal handlers, for which unsharing is not implemented
303*4882a593Smuzhiyun     yet, return -EINVAL.
304*4882a593Smuzhiyun
305*4882a593Smuzhiyun  2) Missing/implied flags: Test to make sure that if unsharing
306*4882a593Smuzhiyun     namespace without specifying unsharing of filesystem, correctly
307*4882a593Smuzhiyun     unshares both namespace and filesystem information.
308*4882a593Smuzhiyun
309*4882a593Smuzhiyun  3) For each of the four (namespace, filesystem, files and vm)
310*4882a593Smuzhiyun     supported unsharing, verify that the system call correctly
311*4882a593Smuzhiyun     unshares the appropriate structure. Verify that unsharing
312*4882a593Smuzhiyun     them individually as well as in combination with each
313*4882a593Smuzhiyun     other works as expected.
314*4882a593Smuzhiyun
315*4882a593Smuzhiyun  4) Concurrent execution: Use shared memory segments and futex on
316*4882a593Smuzhiyun     an address in the shm segment to synchronize execution of
317*4882a593Smuzhiyun     about 10 threads. Have a couple of threads execute execve,
318*4882a593Smuzhiyun     a couple _exit and the rest unshare with different combination
319*4882a593Smuzhiyun     of flags. Verify that unsharing is performed as expected and
320*4882a593Smuzhiyun     that there are no oops or hangs.
321*4882a593Smuzhiyun
322*4882a593Smuzhiyun9) Future Work
323*4882a593Smuzhiyun--------------
324*4882a593Smuzhiyun
325*4882a593SmuzhiyunThe current implementation of unshare() does not allow unsharing of
326*4882a593Smuzhiyunsignals and signal handlers. Signals are complex to begin with and
327*4882a593Smuzhiyunto unshare signals and/or signal handlers of a currently running
328*4882a593Smuzhiyunprocess is even more complex. If in the future there is a specific
329*4882a593Smuzhiyunneed to allow unsharing of signals and/or signal handlers, it can
330*4882a593Smuzhiyunbe incrementally added to unshare() without affecting legacy
331*4882a593Smuzhiyunapplications using unshare().
332*4882a593Smuzhiyun
333