xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/mandatory-locking.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=====================================================
4*4882a593SmuzhiyunMandatory File Locking For The Linux Operating System
5*4882a593Smuzhiyun=====================================================
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun		Andy Walker <andy@lysaker.kvaerner.no>
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun			   15 April 1996
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun		     (Updated September 2007)
12*4882a593Smuzhiyun
13*4882a593Smuzhiyun0. Why you should avoid mandatory locking
14*4882a593Smuzhiyun-----------------------------------------
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunThe Linux implementation is prey to a number of difficult-to-fix race
17*4882a593Smuzhiyunconditions which in practice make it not dependable:
18*4882a593Smuzhiyun
19*4882a593Smuzhiyun	- The write system call checks for a mandatory lock only once
20*4882a593Smuzhiyun	  at its start.  It is therefore possible for a lock request to
21*4882a593Smuzhiyun	  be granted after this check but before the data is modified.
22*4882a593Smuzhiyun	  A process may then see file data change even while a mandatory
23*4882a593Smuzhiyun	  lock was held.
24*4882a593Smuzhiyun	- Similarly, an exclusive lock may be granted on a file after
25*4882a593Smuzhiyun	  the kernel has decided to proceed with a read, but before the
26*4882a593Smuzhiyun	  read has actually completed, and the reading process may see
27*4882a593Smuzhiyun	  the file data in a state which should not have been visible
28*4882a593Smuzhiyun	  to it.
29*4882a593Smuzhiyun	- Similar races make the claimed mutual exclusion between lock
30*4882a593Smuzhiyun	  and mmap similarly unreliable.
31*4882a593Smuzhiyun
32*4882a593Smuzhiyun1. What is  mandatory locking?
33*4882a593Smuzhiyun------------------------------
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunMandatory locking is kernel enforced file locking, as opposed to the more usual
36*4882a593Smuzhiyuncooperative file locking used to guarantee sequential access to files among
37*4882a593Smuzhiyunprocesses. File locks are applied using the flock() and fcntl() system calls
38*4882a593Smuzhiyun(and the lockf() library routine which is a wrapper around fcntl().) It is
39*4882a593Smuzhiyunnormally a process' responsibility to check for locks on a file it wishes to
40*4882a593Smuzhiyunupdate, before applying its own lock, updating the file and unlocking it again.
41*4882a593SmuzhiyunThe most commonly used example of this (and in the case of sendmail, the most
42*4882a593Smuzhiyuntroublesome) is access to a user's mailbox. The mail user agent and the mail
43*4882a593Smuzhiyuntransfer agent must guard against updating the mailbox at the same time, and
44*4882a593Smuzhiyunprevent reading the mailbox while it is being updated.
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunIn a perfect world all processes would use and honour a cooperative, or
47*4882a593Smuzhiyun"advisory" locking scheme. However, the world isn't perfect, and there's
48*4882a593Smuzhiyuna lot of poorly written code out there.
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunIn trying to address this problem, the designers of System V UNIX came up
51*4882a593Smuzhiyunwith a "mandatory" locking scheme, whereby the operating system kernel would
52*4882a593Smuzhiyunblock attempts by a process to write to a file that another process holds a
53*4882a593Smuzhiyun"read" -or- "shared" lock on, and block attempts to both read and write to a
54*4882a593Smuzhiyunfile that a process holds a "write " -or- "exclusive" lock on.
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunThe System V mandatory locking scheme was intended to have as little impact as
57*4882a593Smuzhiyunpossible on existing user code. The scheme is based on marking individual files
58*4882a593Smuzhiyunas candidates for mandatory locking, and using the existing fcntl()/lockf()
59*4882a593Smuzhiyuninterface for applying locks just as if they were normal, advisory locks.
60*4882a593Smuzhiyun
61*4882a593Smuzhiyun.. Note::
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun   1. In saying "file" in the paragraphs above I am actually not telling
64*4882a593Smuzhiyun      the whole truth. System V locking is based on fcntl(). The granularity of
65*4882a593Smuzhiyun      fcntl() is such that it allows the locking of byte ranges in files, in
66*4882a593Smuzhiyun      addition to entire files, so the mandatory locking rules also have byte
67*4882a593Smuzhiyun      level granularity.
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun   2. POSIX.1 does not specify any scheme for mandatory locking, despite
70*4882a593Smuzhiyun      borrowing the fcntl() locking scheme from System V. The mandatory locking
71*4882a593Smuzhiyun      scheme is defined by the System V Interface Definition (SVID) Version 3.
72*4882a593Smuzhiyun
73*4882a593Smuzhiyun2. Marking a file for mandatory locking
74*4882a593Smuzhiyun---------------------------------------
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunA file is marked as a candidate for mandatory locking by setting the group-id
77*4882a593Smuzhiyunbit in its file mode but removing the group-execute bit. This is an otherwise
78*4882a593Smuzhiyunmeaningless combination, and was chosen by the System V implementors so as not
79*4882a593Smuzhiyunto break existing user programs.
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunNote that the group-id bit is usually automatically cleared by the kernel when
82*4882a593Smuzhiyuna setgid file is written to. This is a security measure. The kernel has been
83*4882a593Smuzhiyunmodified to recognize the special case of a mandatory lock candidate and to
84*4882a593Smuzhiyunrefrain from clearing this bit. Similarly the kernel has been modified not
85*4882a593Smuzhiyunto run mandatory lock candidates with setgid privileges.
86*4882a593Smuzhiyun
87*4882a593Smuzhiyun3. Available implementations
88*4882a593Smuzhiyun----------------------------
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunI have considered the implementations of mandatory locking available with
91*4882a593SmuzhiyunSunOS 4.1.x, Solaris 2.x and HP-UX 9.x.
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunGenerally I have tried to make the most sense out of the behaviour exhibited
94*4882a593Smuzhiyunby these three reference systems. There are many anomalies.
95*4882a593Smuzhiyun
96*4882a593SmuzhiyunAll the reference systems reject all calls to open() for a file on which
97*4882a593Smuzhiyunanother process has outstanding mandatory locks. This is in direct
98*4882a593Smuzhiyuncontravention of SVID 3, which states that only calls to open() with the
99*4882a593SmuzhiyunO_TRUNC flag set should be rejected. The Linux implementation follows the SVID
100*4882a593Smuzhiyundefinition, which is the "Right Thing", since only calls with O_TRUNC can
101*4882a593Smuzhiyunmodify the contents of the file.
102*4882a593Smuzhiyun
103*4882a593SmuzhiyunHP-UX even disallows open() with O_TRUNC for a file with advisory locks, not
104*4882a593Smuzhiyunjust mandatory locks. That would appear to contravene POSIX.1.
105*4882a593Smuzhiyun
106*4882a593Smuzhiyunmmap() is another interesting case. All the operating systems mentioned
107*4882a593Smuzhiyunprevent mandatory locks from being applied to an mmap()'ed file, but  HP-UX
108*4882a593Smuzhiyunalso disallows advisory locks for such a file. SVID actually specifies the
109*4882a593Smuzhiyunparanoid HP-UX behaviour.
110*4882a593Smuzhiyun
111*4882a593SmuzhiyunIn my opinion only MAP_SHARED mappings should be immune from locking, and then
112*4882a593Smuzhiyunonly from mandatory locks - that is what is currently implemented.
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunSunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for
115*4882a593Smuzhiyunmandatory locks, so reads and writes to locked files always block when they
116*4882a593Smuzhiyunshould return EAGAIN.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunI'm afraid that this is such an esoteric area that the semantics described
119*4882a593Smuzhiyunbelow are just as valid as any others, so long as the main points seem to
120*4882a593Smuzhiyunagree.
121*4882a593Smuzhiyun
122*4882a593Smuzhiyun4. Semantics
123*4882a593Smuzhiyun------------
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun1. Mandatory locks can only be applied via the fcntl()/lockf() locking
126*4882a593Smuzhiyun   interface - in other words the System V/POSIX interface. BSD style
127*4882a593Smuzhiyun   locks using flock() never result in a mandatory lock.
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun2. If a process has locked a region of a file with a mandatory read lock, then
130*4882a593Smuzhiyun   other processes are permitted to read from that region. If any of these
131*4882a593Smuzhiyun   processes attempts to write to the region it will block until the lock is
132*4882a593Smuzhiyun   released, unless the process has opened the file with the O_NONBLOCK
133*4882a593Smuzhiyun   flag in which case the system call will return immediately with the error
134*4882a593Smuzhiyun   status EAGAIN.
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun3. If a process has locked a region of a file with a mandatory write lock, all
137*4882a593Smuzhiyun   attempts to read or write to that region block until the lock is released,
138*4882a593Smuzhiyun   unless a process has opened the file with the O_NONBLOCK flag in which case
139*4882a593Smuzhiyun   the system call will return immediately with the error status EAGAIN.
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has
142*4882a593Smuzhiyun   any mandatory locks owned by other processes will be rejected with the
143*4882a593Smuzhiyun   error status EAGAIN.
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun5. Attempts to apply a mandatory lock to a file that is memory mapped and
146*4882a593Smuzhiyun   shared (via mmap() with MAP_SHARED) will be rejected with the error status
147*4882a593Smuzhiyun   EAGAIN.
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED)
150*4882a593Smuzhiyun   that has any mandatory locks in effect will be rejected with the error status
151*4882a593Smuzhiyun   EAGAIN.
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun5. Which system calls are affected?
154*4882a593Smuzhiyun-----------------------------------
155*4882a593Smuzhiyun
156*4882a593SmuzhiyunThose which modify a file's contents, not just the inode. That gives read(),
157*4882a593Smuzhiyunwrite(), readv(), writev(), open(), creat(), mmap(), truncate() and
158*4882a593Smuzhiyunftruncate(). truncate() and ftruncate() are considered to be "write" actions
159*4882a593Smuzhiyunfor the purposes of mandatory locking.
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunThe affected region is usually defined as stretching from the current position
162*4882a593Smuzhiyunfor the total number of bytes read or written. For the truncate calls it is
163*4882a593Smuzhiyundefined as the bytes of a file removed or added (we must also consider bytes
164*4882a593Smuzhiyunadded, as a lock can specify just "the whole file", rather than a specific
165*4882a593Smuzhiyunrange of bytes.)
166*4882a593Smuzhiyun
167*4882a593SmuzhiyunNote 3: I may have overlooked some system calls that need mandatory lock
168*4882a593Smuzhiyunchecking in my eagerness to get this code out the door. Please let me know, or
169*4882a593Smuzhiyunbetter still fix the system calls yourself and submit a patch to me or Linus.
170*4882a593Smuzhiyun
171*4882a593Smuzhiyun6. Warning!
172*4882a593Smuzhiyun-----------
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunNot even root can override a mandatory lock, so runaway processes can wreak
175*4882a593Smuzhiyunhavoc if they lock crucial files. The way around it is to change the file
176*4882a593Smuzhiyunpermissions (remove the setgid bit) before trying to read or write to it.
177*4882a593SmuzhiyunOf course, that might be a bit tricky if the system is hung :-(
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun7. The "mand" mount option
180*4882a593Smuzhiyun--------------------------
181*4882a593SmuzhiyunMandatory locking is disabled on all filesystems by default, and must be
182*4882a593Smuzhiyunadministratively enabled by mounting with "-o mand". That mount option
183*4882a593Smuzhiyunis only allowed if the mounting task has the CAP_SYS_ADMIN capability.
184*4882a593Smuzhiyun
185*4882a593SmuzhiyunSince kernel v4.5, it is possible to disable mandatory locking
186*4882a593Smuzhiyunaltogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel
187*4882a593Smuzhiyunwith this disabled will reject attempts to mount filesystems with the
188*4882a593Smuzhiyun"mand" mount option with the error status EPERM.
189