1*4882a593Smuzhiyununshare system call 2*4882a593Smuzhiyun=================== 3*4882a593Smuzhiyun 4*4882a593SmuzhiyunThis document describes the new system call, unshare(). The document 5*4882a593Smuzhiyunprovides an overview of the feature, why it is needed, how it can 6*4882a593Smuzhiyunbe used, its interface specification, design, implementation and 7*4882a593Smuzhiyunhow it can be tested. 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunChange Log 10*4882a593Smuzhiyun---------- 11*4882a593Smuzhiyunversion 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunContents 14*4882a593Smuzhiyun-------- 15*4882a593Smuzhiyun 1) Overview 16*4882a593Smuzhiyun 2) Benefits 17*4882a593Smuzhiyun 3) Cost 18*4882a593Smuzhiyun 4) Requirements 19*4882a593Smuzhiyun 5) Functional Specification 20*4882a593Smuzhiyun 6) High Level Design 21*4882a593Smuzhiyun 7) Low Level Design 22*4882a593Smuzhiyun 8) Test Specification 23*4882a593Smuzhiyun 9) Future Work 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun1) Overview 26*4882a593Smuzhiyun----------- 27*4882a593Smuzhiyun 28*4882a593SmuzhiyunMost legacy operating system kernels support an abstraction of threads 29*4882a593Smuzhiyunas multiple execution contexts within a process. These kernels provide 30*4882a593Smuzhiyunspecial resources and mechanisms to maintain these "threads". The Linux 31*4882a593Smuzhiyunkernel, in a clever and simple manner, does not make distinction 32*4882a593Smuzhiyunbetween processes and "threads". The kernel allows processes to share 33*4882a593Smuzhiyunresources and thus they can achieve legacy "threads" behavior without 34*4882a593Smuzhiyunrequiring additional data structures and mechanisms in the kernel. The 35*4882a593Smuzhiyunpower of implementing threads in this manner comes not only from 36*4882a593Smuzhiyunits simplicity but also from allowing application programmers to work 37*4882a593Smuzhiyunoutside the confinement of all-or-nothing shared resources of legacy 38*4882a593Smuzhiyunthreads. On Linux, at the time of thread creation using the clone system 39*4882a593Smuzhiyuncall, applications can selectively choose which resources to share 40*4882a593Smuzhiyunbetween threads. 41*4882a593Smuzhiyun 42*4882a593Smuzhiyununshare() system call adds a primitive to the Linux thread model that 43*4882a593Smuzhiyunallows threads to selectively 'unshare' any resources that were being 44*4882a593Smuzhiyunshared at the time of their creation. unshare() was conceptualized by 45*4882a593SmuzhiyunAl Viro in the August of 2000, on the Linux-Kernel mailing list, as part 46*4882a593Smuzhiyunof the discussion on POSIX threads on Linux. unshare() augments the 47*4882a593Smuzhiyunusefulness of Linux threads for applications that would like to control 48*4882a593Smuzhiyunshared resources without creating a new process. unshare() is a natural 49*4882a593Smuzhiyunaddition to the set of available primitives on Linux that implement 50*4882a593Smuzhiyunthe concept of process/thread as a virtual machine. 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun2) Benefits 53*4882a593Smuzhiyun----------- 54*4882a593Smuzhiyun 55*4882a593Smuzhiyununshare() would be useful to large application frameworks such as PAM 56*4882a593Smuzhiyunwhere creating a new process to control sharing/unsharing of process 57*4882a593Smuzhiyunresources is not possible. Since namespaces are shared by default 58*4882a593Smuzhiyunwhen creating a new process using fork or clone, unshare() can benefit 59*4882a593Smuzhiyuneven non-threaded applications if they have a need to disassociate 60*4882a593Smuzhiyunfrom default shared namespace. The following lists two use-cases 61*4882a593Smuzhiyunwhere unshare() can be used. 62*4882a593Smuzhiyun 63*4882a593Smuzhiyun2.1 Per-security context namespaces 64*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 65*4882a593Smuzhiyun 66*4882a593Smuzhiyununshare() can be used to implement polyinstantiated directories using 67*4882a593Smuzhiyunthe kernel's per-process namespace mechanism. Polyinstantiated directories, 68*4882a593Smuzhiyunsuch as per-user and/or per-security context instance of /tmp, /var/tmp or 69*4882a593Smuzhiyunper-security context instance of a user's home directory, isolate user 70*4882a593Smuzhiyunprocesses when working with these directories. Using unshare(), a PAM 71*4882a593Smuzhiyunmodule can easily setup a private namespace for a user at login. 72*4882a593SmuzhiyunPolyinstantiated directories are required for Common Criteria certification 73*4882a593Smuzhiyunwith Labeled System Protection Profile, however, with the availability 74*4882a593Smuzhiyunof shared-tree feature in the Linux kernel, even regular Linux systems 75*4882a593Smuzhiyuncan benefit from setting up private namespaces at login and 76*4882a593Smuzhiyunpolyinstantiating /tmp, /var/tmp and other directories deemed 77*4882a593Smuzhiyunappropriate by system administrators. 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun2.2 unsharing of virtual memory and/or open files 80*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunConsider a client/server application where the server is processing 83*4882a593Smuzhiyunclient requests by creating processes that share resources such as 84*4882a593Smuzhiyunvirtual memory and open files. Without unshare(), the server has to 85*4882a593Smuzhiyundecide what needs to be shared at the time of creating the process 86*4882a593Smuzhiyunwhich services the request. unshare() allows the server an ability to 87*4882a593Smuzhiyundisassociate parts of the context during the servicing of the 88*4882a593Smuzhiyunrequest. For large and complex middleware application frameworks, this 89*4882a593Smuzhiyunability to unshare() after the process was created can be very 90*4882a593Smuzhiyunuseful. 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun3) Cost 93*4882a593Smuzhiyun------- 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunIn order to not duplicate code and to handle the fact that unshare() 96*4882a593Smuzhiyunworks on an active task (as opposed to clone/fork working on a newly 97*4882a593Smuzhiyunallocated inactive task) unshare() had to make minor reorganizational 98*4882a593Smuzhiyunchanges to copy_* functions utilized by clone/fork system call. 99*4882a593SmuzhiyunThere is a cost associated with altering existing, well tested and 100*4882a593Smuzhiyunstable code to implement a new feature that may not get exercised 101*4882a593Smuzhiyunextensively in the beginning. However, with proper design and code 102*4882a593Smuzhiyunreview of the changes and creation of an unshare() test for the LTP 103*4882a593Smuzhiyunthe benefits of this new feature can exceed its cost. 104*4882a593Smuzhiyun 105*4882a593Smuzhiyun4) Requirements 106*4882a593Smuzhiyun--------------- 107*4882a593Smuzhiyun 108*4882a593Smuzhiyununshare() reverses sharing that was done using clone(2) system call, 109*4882a593Smuzhiyunso unshare() should have a similar interface as clone(2). That is, 110*4882a593Smuzhiyunsince flags in clone(int flags, void \*stack) specifies what should 111*4882a593Smuzhiyunbe shared, similar flags in unshare(int flags) should specify 112*4882a593Smuzhiyunwhat should be unshared. Unfortunately, this may appear to invert 113*4882a593Smuzhiyunthe meaning of the flags from the way they are used in clone(2). 114*4882a593SmuzhiyunHowever, there was no easy solution that was less confusing and that 115*4882a593Smuzhiyunallowed incremental context unsharing in future without an ABI change. 116*4882a593Smuzhiyun 117*4882a593Smuzhiyununshare() interface should accommodate possible future addition of 118*4882a593Smuzhiyunnew context flags without requiring a rebuild of old applications. 119*4882a593SmuzhiyunIf and when new context flags are added, unshare() design should allow 120*4882a593Smuzhiyunincremental unsharing of those resources on an as needed basis. 121*4882a593Smuzhiyun 122*4882a593Smuzhiyun5) Functional Specification 123*4882a593Smuzhiyun--------------------------- 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunNAME 126*4882a593Smuzhiyun unshare - disassociate parts of the process execution context 127*4882a593Smuzhiyun 128*4882a593SmuzhiyunSYNOPSIS 129*4882a593Smuzhiyun #include <sched.h> 130*4882a593Smuzhiyun 131*4882a593Smuzhiyun int unshare(int flags); 132*4882a593Smuzhiyun 133*4882a593SmuzhiyunDESCRIPTION 134*4882a593Smuzhiyun unshare() allows a process to disassociate parts of its execution 135*4882a593Smuzhiyun context that are currently being shared with other processes. Part 136*4882a593Smuzhiyun of execution context, such as the namespace, is shared by default 137*4882a593Smuzhiyun when a new process is created using fork(2), while other parts, 138*4882a593Smuzhiyun such as the virtual memory, open file descriptors, etc, may be 139*4882a593Smuzhiyun shared by explicit request to share them when creating a process 140*4882a593Smuzhiyun using clone(2). 141*4882a593Smuzhiyun 142*4882a593Smuzhiyun The main use of unshare() is to allow a process to control its 143*4882a593Smuzhiyun shared execution context without creating a new process. 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun The flags argument specifies one or bitwise-or'ed of several of 146*4882a593Smuzhiyun the following constants. 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun CLONE_FS 149*4882a593Smuzhiyun If CLONE_FS is set, file system information of the caller 150*4882a593Smuzhiyun is disassociated from the shared file system information. 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun CLONE_FILES 153*4882a593Smuzhiyun If CLONE_FILES is set, the file descriptor table of the 154*4882a593Smuzhiyun caller is disassociated from the shared file descriptor 155*4882a593Smuzhiyun table. 156*4882a593Smuzhiyun 157*4882a593Smuzhiyun CLONE_NEWNS 158*4882a593Smuzhiyun If CLONE_NEWNS is set, the namespace of the caller is 159*4882a593Smuzhiyun disassociated from the shared namespace. 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun CLONE_VM 162*4882a593Smuzhiyun If CLONE_VM is set, the virtual memory of the caller is 163*4882a593Smuzhiyun disassociated from the shared virtual memory. 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunRETURN VALUE 166*4882a593Smuzhiyun On success, zero returned. On failure, -1 is returned and errno is 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunERRORS 169*4882a593Smuzhiyun EPERM CLONE_NEWNS was specified by a non-root process (process 170*4882a593Smuzhiyun without CAP_SYS_ADMIN). 171*4882a593Smuzhiyun 172*4882a593Smuzhiyun ENOMEM Cannot allocate sufficient memory to copy parts of caller's 173*4882a593Smuzhiyun context that need to be unshared. 174*4882a593Smuzhiyun 175*4882a593Smuzhiyun EINVAL Invalid flag was specified as an argument. 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunCONFORMING TO 178*4882a593Smuzhiyun The unshare() call is Linux-specific and should not be used 179*4882a593Smuzhiyun in programs intended to be portable. 180*4882a593Smuzhiyun 181*4882a593SmuzhiyunSEE ALSO 182*4882a593Smuzhiyun clone(2), fork(2) 183*4882a593Smuzhiyun 184*4882a593Smuzhiyun6) High Level Design 185*4882a593Smuzhiyun-------------------- 186*4882a593Smuzhiyun 187*4882a593SmuzhiyunDepending on the flags argument, the unshare() system call allocates 188*4882a593Smuzhiyunappropriate process context structures, populates it with values from 189*4882a593Smuzhiyunthe current shared version, associates newly duplicated structures 190*4882a593Smuzhiyunwith the current task structure and releases corresponding shared 191*4882a593Smuzhiyunversions. Helper functions of clone (copy_*) could not be used 192*4882a593Smuzhiyundirectly by unshare() because of the following two reasons. 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun 1) clone operates on a newly allocated not-yet-active task 195*4882a593Smuzhiyun structure, where as unshare() operates on the current active 196*4882a593Smuzhiyun task. Therefore unshare() has to take appropriate task_lock() 197*4882a593Smuzhiyun before associating newly duplicated context structures 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun 2) unshare() has to allocate and duplicate all context structures 200*4882a593Smuzhiyun that are being unshared, before associating them with the 201*4882a593Smuzhiyun current task and releasing older shared structures. Failure 202*4882a593Smuzhiyun do so will create race conditions and/or oops when trying 203*4882a593Smuzhiyun to backout due to an error. Consider the case of unsharing 204*4882a593Smuzhiyun both virtual memory and namespace. After successfully unsharing 205*4882a593Smuzhiyun vm, if the system call encounters an error while allocating 206*4882a593Smuzhiyun new namespace structure, the error return code will have to 207*4882a593Smuzhiyun reverse the unsharing of vm. As part of the reversal the 208*4882a593Smuzhiyun system call will have to go back to older, shared, vm 209*4882a593Smuzhiyun structure, which may not exist anymore. 210*4882a593Smuzhiyun 211*4882a593SmuzhiyunTherefore code from copy_* functions that allocated and duplicated 212*4882a593Smuzhiyuncurrent context structure was moved into new dup_* functions. Now, 213*4882a593Smuzhiyuncopy_* functions call dup_* functions to allocate and duplicate 214*4882a593Smuzhiyunappropriate context structures and then associate them with the 215*4882a593Smuzhiyuntask structure that is being constructed. unshare() system call on 216*4882a593Smuzhiyunthe other hand performs the following: 217*4882a593Smuzhiyun 218*4882a593Smuzhiyun 1) Check flags to force missing, but implied, flags 219*4882a593Smuzhiyun 220*4882a593Smuzhiyun 2) For each context structure, call the corresponding unshare() 221*4882a593Smuzhiyun helper function to allocate and duplicate a new context 222*4882a593Smuzhiyun structure, if the appropriate bit is set in the flags argument. 223*4882a593Smuzhiyun 224*4882a593Smuzhiyun 3) If there is no error in allocation and duplication and there 225*4882a593Smuzhiyun are new context structures then lock the current task structure, 226*4882a593Smuzhiyun associate new context structures with the current task structure, 227*4882a593Smuzhiyun and release the lock on the current task structure. 228*4882a593Smuzhiyun 229*4882a593Smuzhiyun 4) Appropriately release older, shared, context structures. 230*4882a593Smuzhiyun 231*4882a593Smuzhiyun7) Low Level Design 232*4882a593Smuzhiyun------------------- 233*4882a593Smuzhiyun 234*4882a593SmuzhiyunImplementation of unshare() can be grouped in the following 4 different 235*4882a593Smuzhiyunitems: 236*4882a593Smuzhiyun 237*4882a593Smuzhiyun a) Reorganization of existing copy_* functions 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun b) unshare() system call service function 240*4882a593Smuzhiyun 241*4882a593Smuzhiyun c) unshare() helper functions for each different process context 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun d) Registration of system call number for different architectures 244*4882a593Smuzhiyun 245*4882a593Smuzhiyun7.1) Reorganization of copy_* functions 246*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 247*4882a593Smuzhiyun 248*4882a593SmuzhiyunEach copy function such as copy_mm, copy_namespace, copy_files, 249*4882a593Smuzhiyunetc, had roughly two components. The first component allocated 250*4882a593Smuzhiyunand duplicated the appropriate structure and the second component 251*4882a593Smuzhiyunlinked it to the task structure passed in as an argument to the copy 252*4882a593Smuzhiyunfunction. The first component was split into its own function. 253*4882a593SmuzhiyunThese dup_* functions allocated and duplicated the appropriate 254*4882a593Smuzhiyuncontext structure. The reorganized copy_* functions invoked 255*4882a593Smuzhiyuntheir corresponding dup_* functions and then linked the newly 256*4882a593Smuzhiyunduplicated structures to the task structure with which the 257*4882a593Smuzhiyuncopy function was called. 258*4882a593Smuzhiyun 259*4882a593Smuzhiyun7.2) unshare() system call service function 260*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 261*4882a593Smuzhiyun 262*4882a593Smuzhiyun * Check flags 263*4882a593Smuzhiyun Force implied flags. If CLONE_THREAD is set force CLONE_VM. 264*4882a593Smuzhiyun If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is 265*4882a593Smuzhiyun set and signals are also being shared, force CLONE_THREAD. If 266*4882a593Smuzhiyun CLONE_NEWNS is set, force CLONE_FS. 267*4882a593Smuzhiyun 268*4882a593Smuzhiyun * For each context flag, invoke the corresponding unshare_* 269*4882a593Smuzhiyun helper routine with flags passed into the system call and a 270*4882a593Smuzhiyun reference to pointer pointing the new unshared structure 271*4882a593Smuzhiyun 272*4882a593Smuzhiyun * If any new structures are created by unshare_* helper 273*4882a593Smuzhiyun functions, take the task_lock() on the current task, 274*4882a593Smuzhiyun modify appropriate context pointers, and release the 275*4882a593Smuzhiyun task lock. 276*4882a593Smuzhiyun 277*4882a593Smuzhiyun * For all newly unshared structures, release the corresponding 278*4882a593Smuzhiyun older, shared, structures. 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun7.3) unshare_* helper functions 281*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 282*4882a593Smuzhiyun 283*4882a593SmuzhiyunFor unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND, 284*4882a593Smuzhiyunand CLONE_THREAD, return -EINVAL since they are not implemented yet. 285*4882a593SmuzhiyunFor others, check the flag value to see if the unsharing is 286*4882a593Smuzhiyunrequired for that structure. If it is, invoke the corresponding 287*4882a593Smuzhiyundup_* function to allocate and duplicate the structure and return 288*4882a593Smuzhiyuna pointer to it. 289*4882a593Smuzhiyun 290*4882a593Smuzhiyun7.4) Finally 291*4882a593Smuzhiyun~~~~~~~~~~~~ 292*4882a593Smuzhiyun 293*4882a593SmuzhiyunAppropriately modify architecture specific code to register the 294*4882a593Smuzhiyunnew system call. 295*4882a593Smuzhiyun 296*4882a593Smuzhiyun8) Test Specification 297*4882a593Smuzhiyun--------------------- 298*4882a593Smuzhiyun 299*4882a593SmuzhiyunThe test for unshare() should test the following: 300*4882a593Smuzhiyun 301*4882a593Smuzhiyun 1) Valid flags: Test to check that clone flags for signal and 302*4882a593Smuzhiyun signal handlers, for which unsharing is not implemented 303*4882a593Smuzhiyun yet, return -EINVAL. 304*4882a593Smuzhiyun 305*4882a593Smuzhiyun 2) Missing/implied flags: Test to make sure that if unsharing 306*4882a593Smuzhiyun namespace without specifying unsharing of filesystem, correctly 307*4882a593Smuzhiyun unshares both namespace and filesystem information. 308*4882a593Smuzhiyun 309*4882a593Smuzhiyun 3) For each of the four (namespace, filesystem, files and vm) 310*4882a593Smuzhiyun supported unsharing, verify that the system call correctly 311*4882a593Smuzhiyun unshares the appropriate structure. Verify that unsharing 312*4882a593Smuzhiyun them individually as well as in combination with each 313*4882a593Smuzhiyun other works as expected. 314*4882a593Smuzhiyun 315*4882a593Smuzhiyun 4) Concurrent execution: Use shared memory segments and futex on 316*4882a593Smuzhiyun an address in the shm segment to synchronize execution of 317*4882a593Smuzhiyun about 10 threads. Have a couple of threads execute execve, 318*4882a593Smuzhiyun a couple _exit and the rest unshare with different combination 319*4882a593Smuzhiyun of flags. Verify that unsharing is performed as expected and 320*4882a593Smuzhiyun that there are no oops or hangs. 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun9) Future Work 323*4882a593Smuzhiyun-------------- 324*4882a593Smuzhiyun 325*4882a593SmuzhiyunThe current implementation of unshare() does not allow unsharing of 326*4882a593Smuzhiyunsignals and signal handlers. Signals are complex to begin with and 327*4882a593Smuzhiyunto unshare signals and/or signal handlers of a currently running 328*4882a593Smuzhiyunprocess is even more complex. If in the future there is a specific 329*4882a593Smuzhiyunneed to allow unsharing of signals and/or signal handlers, it can 330*4882a593Smuzhiyunbe incrementally added to unshare() without affecting legacy 331*4882a593Smuzhiyunapplications using unshare(). 332*4882a593Smuzhiyun 333