1*4882a593Smuzhiyun================= 2*4882a593SmuzhiyunDirectory Locking 3*4882a593Smuzhiyun================= 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun 6*4882a593SmuzhiyunLocking scheme used for directory operations is based on two 7*4882a593Smuzhiyunkinds of locks - per-inode (->i_rwsem) and per-filesystem 8*4882a593Smuzhiyun(->s_vfs_rename_mutex). 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunWhen taking the i_rwsem on multiple non-directory objects, we 11*4882a593Smuzhiyunalways acquire the locks in order by increasing address. We'll call 12*4882a593Smuzhiyunthat "inode pointer" order in the following. 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunFor our purposes all operations fall in 5 classes: 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun1) read access. Locking rules: caller locks directory we are accessing. 17*4882a593SmuzhiyunThe lock is taken shared. 18*4882a593Smuzhiyun 19*4882a593Smuzhiyun2) object creation. Locking rules: same as above, but the lock is taken 20*4882a593Smuzhiyunexclusive. 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun3) object removal. Locking rules: caller locks parent, finds victim, 23*4882a593Smuzhiyunlocks victim and calls the method. Locks are exclusive. 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun4) rename() that is _not_ cross-directory. Locking rules: caller locks 26*4882a593Smuzhiyunthe parent and finds source and target. In case of exchange (with 27*4882a593SmuzhiyunRENAME_EXCHANGE in flags argument) lock both. In any case, 28*4882a593Smuzhiyunif the target already exists, lock it. If the source is a non-directory, 29*4882a593Smuzhiyunlock it. If we need to lock both, lock them in inode pointer order. 30*4882a593SmuzhiyunThen call the method. All locks are exclusive. 31*4882a593SmuzhiyunNB: we might get away with locking the source (and target in exchange 32*4882a593Smuzhiyuncase) shared. 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun5) link creation. Locking rules: 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun * lock parent 37*4882a593Smuzhiyun * check that source is not a directory 38*4882a593Smuzhiyun * lock source 39*4882a593Smuzhiyun * call the method. 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunAll locks are exclusive. 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun6) cross-directory rename. The trickiest in the whole bunch. Locking 44*4882a593Smuzhiyunrules: 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun * lock the filesystem 47*4882a593Smuzhiyun * lock parents in "ancestors first" order. 48*4882a593Smuzhiyun * find source and target. 49*4882a593Smuzhiyun * if old parent is equal to or is a descendent of target 50*4882a593Smuzhiyun fail with -ENOTEMPTY 51*4882a593Smuzhiyun * if new parent is equal to or is a descendent of source 52*4882a593Smuzhiyun fail with -ELOOP 53*4882a593Smuzhiyun * If it's an exchange, lock both the source and the target. 54*4882a593Smuzhiyun * If the target exists, lock it. If the source is a non-directory, 55*4882a593Smuzhiyun lock it. If we need to lock both, do so in inode pointer order. 56*4882a593Smuzhiyun * call the method. 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunAll ->i_rwsem are taken exclusive. Again, we might get away with locking 59*4882a593Smuzhiyunthe source (and target in exchange case) shared. 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunThe rules above obviously guarantee that all directories that are going to be 62*4882a593Smuzhiyunread, modified or removed by method will be locked by caller. 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunIf no directory is its own ancestor, the scheme above is deadlock-free. 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunProof: 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun First of all, at any moment we have a partial ordering of the 70*4882a593Smuzhiyun objects - A < B iff A is an ancestor of B. 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun That ordering can change. However, the following is true: 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun(1) if object removal or non-cross-directory rename holds lock on A and 75*4882a593Smuzhiyun attempts to acquire lock on B, A will remain the parent of B until we 76*4882a593Smuzhiyun acquire the lock on B. (Proof: only cross-directory rename can change 77*4882a593Smuzhiyun the parent of object and it would have to lock the parent). 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun(2) if cross-directory rename holds the lock on filesystem, order will not 80*4882a593Smuzhiyun change until rename acquires all locks. (Proof: other cross-directory 81*4882a593Smuzhiyun renames will be blocked on filesystem lock and we don't start changing 82*4882a593Smuzhiyun the order until we had acquired all locks). 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun(3) locks on non-directory objects are acquired only after locks on 85*4882a593Smuzhiyun directory objects, and are acquired in inode pointer order. 86*4882a593Smuzhiyun (Proof: all operations but renames take lock on at most one 87*4882a593Smuzhiyun non-directory object, except renames, which take locks on source and 88*4882a593Smuzhiyun target in inode pointer order in the case they are not directories.) 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunNow consider the minimal deadlock. Each process is blocked on 91*4882a593Smuzhiyunattempt to acquire some lock and already holds at least one lock. Let's 92*4882a593Smuzhiyunconsider the set of contended locks. First of all, filesystem lock is 93*4882a593Smuzhiyunnot contended, since any process blocked on it is not holding any locks. 94*4882a593SmuzhiyunThus all processes are blocked on ->i_rwsem. 95*4882a593Smuzhiyun 96*4882a593SmuzhiyunBy (3), any process holding a non-directory lock can only be 97*4882a593Smuzhiyunwaiting on another non-directory lock with a larger address. Therefore 98*4882a593Smuzhiyunthe process holding the "largest" such lock can always make progress, and 99*4882a593Smuzhiyunnon-directory objects are not included in the set of contended locks. 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunThus link creation can't be a part of deadlock - it can't be 102*4882a593Smuzhiyunblocked on source and it means that it doesn't hold any locks. 103*4882a593Smuzhiyun 104*4882a593SmuzhiyunAny contended object is either held by cross-directory rename or 105*4882a593Smuzhiyunhas a child that is also contended. Indeed, suppose that it is held by 106*4882a593Smuzhiyunoperation other than cross-directory rename. Then the lock this operation 107*4882a593Smuzhiyunis blocked on belongs to child of that object due to (1). 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunIt means that one of the operations is cross-directory rename. 110*4882a593SmuzhiyunOtherwise the set of contended objects would be infinite - each of them 111*4882a593Smuzhiyunwould have a contended child and we had assumed that no object is its 112*4882a593Smuzhiyunown descendent. Moreover, there is exactly one cross-directory rename 113*4882a593Smuzhiyun(see above). 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunConsider the object blocking the cross-directory rename. One 116*4882a593Smuzhiyunof its descendents is locked by cross-directory rename (otherwise we 117*4882a593Smuzhiyunwould again have an infinite set of contended objects). But that 118*4882a593Smuzhiyunmeans that cross-directory rename is taking locks out of order. Due 119*4882a593Smuzhiyunto (2) the order hadn't changed since we had acquired filesystem lock. 120*4882a593SmuzhiyunBut locking rules for cross-directory rename guarantee that we do not 121*4882a593Smuzhiyuntry to acquire lock on descendent before the lock on ancestor. 122*4882a593SmuzhiyunContradiction. I.e. deadlock is impossible. Q.E.D. 123*4882a593Smuzhiyun 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunThese operations are guaranteed to avoid loop creation. Indeed, 126*4882a593Smuzhiyunthe only operation that could introduce loops is cross-directory rename. 127*4882a593SmuzhiyunSince the only new (parent, child) pair added by rename() is (new parent, 128*4882a593Smuzhiyunsource), such loop would have to contain these objects and the rest of it 129*4882a593Smuzhiyunwould have to exist before rename(). I.e. at the moment of loop creation 130*4882a593Smuzhiyunrename() responsible for that would be holding filesystem lock and new parent 131*4882a593Smuzhiyunwould have to be equal to or a descendent of source. But that means that 132*4882a593Smuzhiyunnew parent had been equal to or a descendent of source since the moment when 133*4882a593Smuzhiyunwe had acquired filesystem lock and rename() would fail with -ELOOP in that 134*4882a593Smuzhiyuncase. 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunWhile this locking scheme works for arbitrary DAGs, it relies on 137*4882a593Smuzhiyunability to check that directory is a descendent of another object. Current 138*4882a593Smuzhiyunimplementation assumes that directory graph is a tree. This assumption is 139*4882a593Smuzhiyunalso preserved by all operations (cross-directory rename on a tree that would 140*4882a593Smuzhiyunnot introduce a cycle will leave it a tree and link() fails for directories). 141*4882a593Smuzhiyun 142*4882a593SmuzhiyunNotice that "directory" in the above == "anything that might have 143*4882a593Smuzhiyunchildren", so if we are going to introduce hybrid objects we will need 144*4882a593Smuzhiyuneither to make sure that link(2) doesn't work for them or to make changes 145*4882a593Smuzhiyunin is_subdir() that would make it work even in presence of such beasts. 146