xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/directory-locking.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=================
2*4882a593SmuzhiyunDirectory Locking
3*4882a593Smuzhiyun=================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun
6*4882a593SmuzhiyunLocking scheme used for directory operations is based on two
7*4882a593Smuzhiyunkinds of locks - per-inode (->i_rwsem) and per-filesystem
8*4882a593Smuzhiyun(->s_vfs_rename_mutex).
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunWhen taking the i_rwsem on multiple non-directory objects, we
11*4882a593Smuzhiyunalways acquire the locks in order by increasing address.  We'll call
12*4882a593Smuzhiyunthat "inode pointer" order in the following.
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunFor our purposes all operations fall in 5 classes:
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun1) read access.  Locking rules: caller locks directory we are accessing.
17*4882a593SmuzhiyunThe lock is taken shared.
18*4882a593Smuzhiyun
19*4882a593Smuzhiyun2) object creation.  Locking rules: same as above, but the lock is taken
20*4882a593Smuzhiyunexclusive.
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun3) object removal.  Locking rules: caller locks parent, finds victim,
23*4882a593Smuzhiyunlocks victim and calls the method.  Locks are exclusive.
24*4882a593Smuzhiyun
25*4882a593Smuzhiyun4) rename() that is _not_ cross-directory.  Locking rules: caller locks
26*4882a593Smuzhiyunthe parent and finds source and target.  In case of exchange (with
27*4882a593SmuzhiyunRENAME_EXCHANGE in flags argument) lock both.  In any case,
28*4882a593Smuzhiyunif the target already exists, lock it.  If the source is a non-directory,
29*4882a593Smuzhiyunlock it.  If we need to lock both, lock them in inode pointer order.
30*4882a593SmuzhiyunThen call the method.  All locks are exclusive.
31*4882a593SmuzhiyunNB: we might get away with locking the source (and target in exchange
32*4882a593Smuzhiyuncase) shared.
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun5) link creation.  Locking rules:
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun	* lock parent
37*4882a593Smuzhiyun	* check that source is not a directory
38*4882a593Smuzhiyun	* lock source
39*4882a593Smuzhiyun	* call the method.
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunAll locks are exclusive.
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun6) cross-directory rename.  The trickiest in the whole bunch.  Locking
44*4882a593Smuzhiyunrules:
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun	* lock the filesystem
47*4882a593Smuzhiyun	* lock parents in "ancestors first" order.
48*4882a593Smuzhiyun	* find source and target.
49*4882a593Smuzhiyun	* if old parent is equal to or is a descendent of target
50*4882a593Smuzhiyun	  fail with -ENOTEMPTY
51*4882a593Smuzhiyun	* if new parent is equal to or is a descendent of source
52*4882a593Smuzhiyun	  fail with -ELOOP
53*4882a593Smuzhiyun	* If it's an exchange, lock both the source and the target.
54*4882a593Smuzhiyun	* If the target exists, lock it.  If the source is a non-directory,
55*4882a593Smuzhiyun	  lock it.  If we need to lock both, do so in inode pointer order.
56*4882a593Smuzhiyun	* call the method.
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunAll ->i_rwsem are taken exclusive.  Again, we might get away with locking
59*4882a593Smuzhiyunthe source (and target in exchange case) shared.
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunThe rules above obviously guarantee that all directories that are going to be
62*4882a593Smuzhiyunread, modified or removed by method will be locked by caller.
63*4882a593Smuzhiyun
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunIf no directory is its own ancestor, the scheme above is deadlock-free.
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunProof:
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun	First of all, at any moment we have a partial ordering of the
70*4882a593Smuzhiyun	objects - A < B iff A is an ancestor of B.
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun	That ordering can change.  However, the following is true:
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun(1) if object removal or non-cross-directory rename holds lock on A and
75*4882a593Smuzhiyun    attempts to acquire lock on B, A will remain the parent of B until we
76*4882a593Smuzhiyun    acquire the lock on B.  (Proof: only cross-directory rename can change
77*4882a593Smuzhiyun    the parent of object and it would have to lock the parent).
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun(2) if cross-directory rename holds the lock on filesystem, order will not
80*4882a593Smuzhiyun    change until rename acquires all locks.  (Proof: other cross-directory
81*4882a593Smuzhiyun    renames will be blocked on filesystem lock and we don't start changing
82*4882a593Smuzhiyun    the order until we had acquired all locks).
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun(3) locks on non-directory objects are acquired only after locks on
85*4882a593Smuzhiyun    directory objects, and are acquired in inode pointer order.
86*4882a593Smuzhiyun    (Proof: all operations but renames take lock on at most one
87*4882a593Smuzhiyun    non-directory object, except renames, which take locks on source and
88*4882a593Smuzhiyun    target in inode pointer order in the case they are not directories.)
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunNow consider the minimal deadlock.  Each process is blocked on
91*4882a593Smuzhiyunattempt to acquire some lock and already holds at least one lock.  Let's
92*4882a593Smuzhiyunconsider the set of contended locks.  First of all, filesystem lock is
93*4882a593Smuzhiyunnot contended, since any process blocked on it is not holding any locks.
94*4882a593SmuzhiyunThus all processes are blocked on ->i_rwsem.
95*4882a593Smuzhiyun
96*4882a593SmuzhiyunBy (3), any process holding a non-directory lock can only be
97*4882a593Smuzhiyunwaiting on another non-directory lock with a larger address.  Therefore
98*4882a593Smuzhiyunthe process holding the "largest" such lock can always make progress, and
99*4882a593Smuzhiyunnon-directory objects are not included in the set of contended locks.
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunThus link creation can't be a part of deadlock - it can't be
102*4882a593Smuzhiyunblocked on source and it means that it doesn't hold any locks.
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunAny contended object is either held by cross-directory rename or
105*4882a593Smuzhiyunhas a child that is also contended.  Indeed, suppose that it is held by
106*4882a593Smuzhiyunoperation other than cross-directory rename.  Then the lock this operation
107*4882a593Smuzhiyunis blocked on belongs to child of that object due to (1).
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunIt means that one of the operations is cross-directory rename.
110*4882a593SmuzhiyunOtherwise the set of contended objects would be infinite - each of them
111*4882a593Smuzhiyunwould have a contended child and we had assumed that no object is its
112*4882a593Smuzhiyunown descendent.  Moreover, there is exactly one cross-directory rename
113*4882a593Smuzhiyun(see above).
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunConsider the object blocking the cross-directory rename.  One
116*4882a593Smuzhiyunof its descendents is locked by cross-directory rename (otherwise we
117*4882a593Smuzhiyunwould again have an infinite set of contended objects).  But that
118*4882a593Smuzhiyunmeans that cross-directory rename is taking locks out of order.  Due
119*4882a593Smuzhiyunto (2) the order hadn't changed since we had acquired filesystem lock.
120*4882a593SmuzhiyunBut locking rules for cross-directory rename guarantee that we do not
121*4882a593Smuzhiyuntry to acquire lock on descendent before the lock on ancestor.
122*4882a593SmuzhiyunContradiction.  I.e.  deadlock is impossible.  Q.E.D.
123*4882a593Smuzhiyun
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunThese operations are guaranteed to avoid loop creation.  Indeed,
126*4882a593Smuzhiyunthe only operation that could introduce loops is cross-directory rename.
127*4882a593SmuzhiyunSince the only new (parent, child) pair added by rename() is (new parent,
128*4882a593Smuzhiyunsource), such loop would have to contain these objects and the rest of it
129*4882a593Smuzhiyunwould have to exist before rename().  I.e. at the moment of loop creation
130*4882a593Smuzhiyunrename() responsible for that would be holding filesystem lock and new parent
131*4882a593Smuzhiyunwould have to be equal to or a descendent of source.  But that means that
132*4882a593Smuzhiyunnew parent had been equal to or a descendent of source since the moment when
133*4882a593Smuzhiyunwe had acquired filesystem lock and rename() would fail with -ELOOP in that
134*4882a593Smuzhiyuncase.
135*4882a593Smuzhiyun
136*4882a593SmuzhiyunWhile this locking scheme works for arbitrary DAGs, it relies on
137*4882a593Smuzhiyunability to check that directory is a descendent of another object.  Current
138*4882a593Smuzhiyunimplementation assumes that directory graph is a tree.  This assumption is
139*4882a593Smuzhiyunalso preserved by all operations (cross-directory rename on a tree that would
140*4882a593Smuzhiyunnot introduce a cycle will leave it a tree and link() fails for directories).
141*4882a593Smuzhiyun
142*4882a593SmuzhiyunNotice that "directory" in the above == "anything that might have
143*4882a593Smuzhiyunchildren", so if we are going to introduce hybrid objects we will need
144*4882a593Smuzhiyuneither to make sure that link(2) doesn't work for them or to make changes
145*4882a593Smuzhiyunin is_subdir() that would make it work even in presence of such beasts.
146