1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun===================================================== 4*4882a593SmuzhiyunMandatory File Locking For The Linux Operating System 5*4882a593Smuzhiyun===================================================== 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun Andy Walker <andy@lysaker.kvaerner.no> 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun 15 April 1996 10*4882a593Smuzhiyun 11*4882a593Smuzhiyun (Updated September 2007) 12*4882a593Smuzhiyun 13*4882a593Smuzhiyun0. Why you should avoid mandatory locking 14*4882a593Smuzhiyun----------------------------------------- 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunThe Linux implementation is prey to a number of difficult-to-fix race 17*4882a593Smuzhiyunconditions which in practice make it not dependable: 18*4882a593Smuzhiyun 19*4882a593Smuzhiyun - The write system call checks for a mandatory lock only once 20*4882a593Smuzhiyun at its start. It is therefore possible for a lock request to 21*4882a593Smuzhiyun be granted after this check but before the data is modified. 22*4882a593Smuzhiyun A process may then see file data change even while a mandatory 23*4882a593Smuzhiyun lock was held. 24*4882a593Smuzhiyun - Similarly, an exclusive lock may be granted on a file after 25*4882a593Smuzhiyun the kernel has decided to proceed with a read, but before the 26*4882a593Smuzhiyun read has actually completed, and the reading process may see 27*4882a593Smuzhiyun the file data in a state which should not have been visible 28*4882a593Smuzhiyun to it. 29*4882a593Smuzhiyun - Similar races make the claimed mutual exclusion between lock 30*4882a593Smuzhiyun and mmap similarly unreliable. 31*4882a593Smuzhiyun 32*4882a593Smuzhiyun1. What is mandatory locking? 33*4882a593Smuzhiyun------------------------------ 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunMandatory locking is kernel enforced file locking, as opposed to the more usual 36*4882a593Smuzhiyuncooperative file locking used to guarantee sequential access to files among 37*4882a593Smuzhiyunprocesses. File locks are applied using the flock() and fcntl() system calls 38*4882a593Smuzhiyun(and the lockf() library routine which is a wrapper around fcntl().) It is 39*4882a593Smuzhiyunnormally a process' responsibility to check for locks on a file it wishes to 40*4882a593Smuzhiyunupdate, before applying its own lock, updating the file and unlocking it again. 41*4882a593SmuzhiyunThe most commonly used example of this (and in the case of sendmail, the most 42*4882a593Smuzhiyuntroublesome) is access to a user's mailbox. The mail user agent and the mail 43*4882a593Smuzhiyuntransfer agent must guard against updating the mailbox at the same time, and 44*4882a593Smuzhiyunprevent reading the mailbox while it is being updated. 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunIn a perfect world all processes would use and honour a cooperative, or 47*4882a593Smuzhiyun"advisory" locking scheme. However, the world isn't perfect, and there's 48*4882a593Smuzhiyuna lot of poorly written code out there. 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunIn trying to address this problem, the designers of System V UNIX came up 51*4882a593Smuzhiyunwith a "mandatory" locking scheme, whereby the operating system kernel would 52*4882a593Smuzhiyunblock attempts by a process to write to a file that another process holds a 53*4882a593Smuzhiyun"read" -or- "shared" lock on, and block attempts to both read and write to a 54*4882a593Smuzhiyunfile that a process holds a "write " -or- "exclusive" lock on. 55*4882a593Smuzhiyun 56*4882a593SmuzhiyunThe System V mandatory locking scheme was intended to have as little impact as 57*4882a593Smuzhiyunpossible on existing user code. The scheme is based on marking individual files 58*4882a593Smuzhiyunas candidates for mandatory locking, and using the existing fcntl()/lockf() 59*4882a593Smuzhiyuninterface for applying locks just as if they were normal, advisory locks. 60*4882a593Smuzhiyun 61*4882a593Smuzhiyun.. Note:: 62*4882a593Smuzhiyun 63*4882a593Smuzhiyun 1. In saying "file" in the paragraphs above I am actually not telling 64*4882a593Smuzhiyun the whole truth. System V locking is based on fcntl(). The granularity of 65*4882a593Smuzhiyun fcntl() is such that it allows the locking of byte ranges in files, in 66*4882a593Smuzhiyun addition to entire files, so the mandatory locking rules also have byte 67*4882a593Smuzhiyun level granularity. 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun 2. POSIX.1 does not specify any scheme for mandatory locking, despite 70*4882a593Smuzhiyun borrowing the fcntl() locking scheme from System V. The mandatory locking 71*4882a593Smuzhiyun scheme is defined by the System V Interface Definition (SVID) Version 3. 72*4882a593Smuzhiyun 73*4882a593Smuzhiyun2. Marking a file for mandatory locking 74*4882a593Smuzhiyun--------------------------------------- 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunA file is marked as a candidate for mandatory locking by setting the group-id 77*4882a593Smuzhiyunbit in its file mode but removing the group-execute bit. This is an otherwise 78*4882a593Smuzhiyunmeaningless combination, and was chosen by the System V implementors so as not 79*4882a593Smuzhiyunto break existing user programs. 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunNote that the group-id bit is usually automatically cleared by the kernel when 82*4882a593Smuzhiyuna setgid file is written to. This is a security measure. The kernel has been 83*4882a593Smuzhiyunmodified to recognize the special case of a mandatory lock candidate and to 84*4882a593Smuzhiyunrefrain from clearing this bit. Similarly the kernel has been modified not 85*4882a593Smuzhiyunto run mandatory lock candidates with setgid privileges. 86*4882a593Smuzhiyun 87*4882a593Smuzhiyun3. Available implementations 88*4882a593Smuzhiyun---------------------------- 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunI have considered the implementations of mandatory locking available with 91*4882a593SmuzhiyunSunOS 4.1.x, Solaris 2.x and HP-UX 9.x. 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunGenerally I have tried to make the most sense out of the behaviour exhibited 94*4882a593Smuzhiyunby these three reference systems. There are many anomalies. 95*4882a593Smuzhiyun 96*4882a593SmuzhiyunAll the reference systems reject all calls to open() for a file on which 97*4882a593Smuzhiyunanother process has outstanding mandatory locks. This is in direct 98*4882a593Smuzhiyuncontravention of SVID 3, which states that only calls to open() with the 99*4882a593SmuzhiyunO_TRUNC flag set should be rejected. The Linux implementation follows the SVID 100*4882a593Smuzhiyundefinition, which is the "Right Thing", since only calls with O_TRUNC can 101*4882a593Smuzhiyunmodify the contents of the file. 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunHP-UX even disallows open() with O_TRUNC for a file with advisory locks, not 104*4882a593Smuzhiyunjust mandatory locks. That would appear to contravene POSIX.1. 105*4882a593Smuzhiyun 106*4882a593Smuzhiyunmmap() is another interesting case. All the operating systems mentioned 107*4882a593Smuzhiyunprevent mandatory locks from being applied to an mmap()'ed file, but HP-UX 108*4882a593Smuzhiyunalso disallows advisory locks for such a file. SVID actually specifies the 109*4882a593Smuzhiyunparanoid HP-UX behaviour. 110*4882a593Smuzhiyun 111*4882a593SmuzhiyunIn my opinion only MAP_SHARED mappings should be immune from locking, and then 112*4882a593Smuzhiyunonly from mandatory locks - that is what is currently implemented. 113*4882a593Smuzhiyun 114*4882a593SmuzhiyunSunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for 115*4882a593Smuzhiyunmandatory locks, so reads and writes to locked files always block when they 116*4882a593Smuzhiyunshould return EAGAIN. 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunI'm afraid that this is such an esoteric area that the semantics described 119*4882a593Smuzhiyunbelow are just as valid as any others, so long as the main points seem to 120*4882a593Smuzhiyunagree. 121*4882a593Smuzhiyun 122*4882a593Smuzhiyun4. Semantics 123*4882a593Smuzhiyun------------ 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun1. Mandatory locks can only be applied via the fcntl()/lockf() locking 126*4882a593Smuzhiyun interface - in other words the System V/POSIX interface. BSD style 127*4882a593Smuzhiyun locks using flock() never result in a mandatory lock. 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun2. If a process has locked a region of a file with a mandatory read lock, then 130*4882a593Smuzhiyun other processes are permitted to read from that region. If any of these 131*4882a593Smuzhiyun processes attempts to write to the region it will block until the lock is 132*4882a593Smuzhiyun released, unless the process has opened the file with the O_NONBLOCK 133*4882a593Smuzhiyun flag in which case the system call will return immediately with the error 134*4882a593Smuzhiyun status EAGAIN. 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun3. If a process has locked a region of a file with a mandatory write lock, all 137*4882a593Smuzhiyun attempts to read or write to that region block until the lock is released, 138*4882a593Smuzhiyun unless a process has opened the file with the O_NONBLOCK flag in which case 139*4882a593Smuzhiyun the system call will return immediately with the error status EAGAIN. 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has 142*4882a593Smuzhiyun any mandatory locks owned by other processes will be rejected with the 143*4882a593Smuzhiyun error status EAGAIN. 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun5. Attempts to apply a mandatory lock to a file that is memory mapped and 146*4882a593Smuzhiyun shared (via mmap() with MAP_SHARED) will be rejected with the error status 147*4882a593Smuzhiyun EAGAIN. 148*4882a593Smuzhiyun 149*4882a593Smuzhiyun6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) 150*4882a593Smuzhiyun that has any mandatory locks in effect will be rejected with the error status 151*4882a593Smuzhiyun EAGAIN. 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun5. Which system calls are affected? 154*4882a593Smuzhiyun----------------------------------- 155*4882a593Smuzhiyun 156*4882a593SmuzhiyunThose which modify a file's contents, not just the inode. That gives read(), 157*4882a593Smuzhiyunwrite(), readv(), writev(), open(), creat(), mmap(), truncate() and 158*4882a593Smuzhiyunftruncate(). truncate() and ftruncate() are considered to be "write" actions 159*4882a593Smuzhiyunfor the purposes of mandatory locking. 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunThe affected region is usually defined as stretching from the current position 162*4882a593Smuzhiyunfor the total number of bytes read or written. For the truncate calls it is 163*4882a593Smuzhiyundefined as the bytes of a file removed or added (we must also consider bytes 164*4882a593Smuzhiyunadded, as a lock can specify just "the whole file", rather than a specific 165*4882a593Smuzhiyunrange of bytes.) 166*4882a593Smuzhiyun 167*4882a593SmuzhiyunNote 3: I may have overlooked some system calls that need mandatory lock 168*4882a593Smuzhiyunchecking in my eagerness to get this code out the door. Please let me know, or 169*4882a593Smuzhiyunbetter still fix the system calls yourself and submit a patch to me or Linus. 170*4882a593Smuzhiyun 171*4882a593Smuzhiyun6. Warning! 172*4882a593Smuzhiyun----------- 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunNot even root can override a mandatory lock, so runaway processes can wreak 175*4882a593Smuzhiyunhavoc if they lock crucial files. The way around it is to change the file 176*4882a593Smuzhiyunpermissions (remove the setgid bit) before trying to read or write to it. 177*4882a593SmuzhiyunOf course, that might be a bit tricky if the system is hung :-( 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun7. The "mand" mount option 180*4882a593Smuzhiyun-------------------------- 181*4882a593SmuzhiyunMandatory locking is disabled on all filesystems by default, and must be 182*4882a593Smuzhiyunadministratively enabled by mounting with "-o mand". That mount option 183*4882a593Smuzhiyunis only allowed if the mounting task has the CAP_SYS_ADMIN capability. 184*4882a593Smuzhiyun 185*4882a593SmuzhiyunSince kernel v4.5, it is possible to disable mandatory locking 186*4882a593Smuzhiyunaltogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel 187*4882a593Smuzhiyunwith this disabled will reject attempts to mount filesystems with the 188*4882a593Smuzhiyun"mand" mount option with the error status EPERM. 189