xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/path-lookup.txt (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593SmuzhiyunPath walking and name lookup locking
2*4882a593Smuzhiyun====================================
3*4882a593Smuzhiyun
4*4882a593SmuzhiyunPath resolution is the finding a dentry corresponding to a path name string, by
5*4882a593Smuzhiyunperforming a path walk. Typically, for every open(), stat() etc., the path name
6*4882a593Smuzhiyunwill be resolved. Paths are resolved by walking the namespace tree, starting
7*4882a593Smuzhiyunwith the first component of the pathname (eg. root or cwd) with a known dentry,
8*4882a593Smuzhiyunthen finding the child of that dentry, which is named the next component in the
9*4882a593Smuzhiyunpath string. Then repeating the lookup from the child dentry and finding its
10*4882a593Smuzhiyunchild with the next element, and so on.
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunSince it is a frequent operation for workloads like multiuser environments and
13*4882a593Smuzhiyunweb servers, it is important to optimize this code.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunPath walking synchronisation history:
16*4882a593SmuzhiyunPrior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and
17*4882a593Smuzhiyunthus in every component during path look-up. Since 2.5.10 onwards, fast-walk
18*4882a593Smuzhiyunalgorithm changed this by holding the dcache_lock at the beginning and walking
19*4882a593Smuzhiyunas many cached path component dentries as possible. This significantly
20*4882a593Smuzhiyundecreases the number of acquisition of dcache_lock. However it also increases
21*4882a593Smuzhiyunthe lock hold time significantly and affects performance in large SMP machines.
22*4882a593SmuzhiyunSince 2.5.62 kernel, dcache has been using a new locking model that uses RCU to
23*4882a593Smuzhiyunmake dcache look-up lock-free.
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunAll the above algorithms required taking a lock and reference count on the
26*4882a593Smuzhiyundentry that was looked up, so that may be used as the basis for walking the
27*4882a593Smuzhiyunnext path element. This is inefficient and unscalable. It is inefficient
28*4882a593Smuzhiyunbecause of the locks and atomic operations required for every dentry element
29*4882a593Smuzhiyunslows things down. It is not scalable because many parallel applications that
30*4882a593Smuzhiyunare path-walk intensive tend to do path lookups starting from a common dentry
31*4882a593Smuzhiyun(usually, the root "/" or current working directory). So contention on these
32*4882a593Smuzhiyuncommon path elements causes lock and cacheline queueing.
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunSince 2.6.38, RCU is used to make a significant part of the entire path walk
35*4882a593Smuzhiyun(including dcache look-up) completely "store-free" (so, no locks, atomics, or
36*4882a593Smuzhiyuneven stores into cachelines of common dentries). This is known as "rcu-walk"
37*4882a593Smuzhiyunpath walking.
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunPath walking overview
40*4882a593Smuzhiyun=====================
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunA name string specifies a start (root directory, cwd, fd-relative) and a
43*4882a593Smuzhiyunsequence of elements (directory entry names), which together refer to a path in
44*4882a593Smuzhiyunthe namespace. A path is represented as a (dentry, vfsmount) tuple. The name
45*4882a593Smuzhiyunelements are sub-strings, separated by '/'.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunName lookups will want to find a particular path that a name string refers to
48*4882a593Smuzhiyun(usually the final element, or parent of final element). This is done by taking
49*4882a593Smuzhiyunthe path given by the name's starting point (which we know in advance -- eg.
50*4882a593Smuzhiyuncurrent->fs->cwd or current->fs->root) as the first parent of the lookup. Then
51*4882a593Smuzhiyuniteratively for each subsequent name element, look up the child of the current
52*4882a593Smuzhiyunparent with the given name and if it is not the desired entry, make it the
53*4882a593Smuzhiyunparent for the next lookup.
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunA parent, of course, must be a directory, and we must have appropriate
56*4882a593Smuzhiyunpermissions on the parent inode to be able to walk into it.
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunTurning the child into a parent for the next lookup requires more checks and
59*4882a593Smuzhiyunprocedures. Symlinks essentially substitute the symlink name for the target
60*4882a593Smuzhiyunname in the name string, and require some recursive path walking.  Mount points
61*4882a593Smuzhiyunmust be followed into (thus changing the vfsmount that subsequent path elements
62*4882a593Smuzhiyunrefer to), switching from the mount point path to the root of the particular
63*4882a593Smuzhiyunmounted vfsmount. These behaviours are variously modified depending on the
64*4882a593Smuzhiyunexact path walking flags.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunPath walking then must, broadly, do several particular things:
67*4882a593Smuzhiyun- find the start point of the walk;
68*4882a593Smuzhiyun- perform permissions and validity checks on inodes;
69*4882a593Smuzhiyun- perform dcache hash name lookups on (parent, name element) tuples;
70*4882a593Smuzhiyun- traverse mount points;
71*4882a593Smuzhiyun- traverse symlinks;
72*4882a593Smuzhiyun- lookup and create missing parts of the path on demand.
73*4882a593Smuzhiyun
74*4882a593SmuzhiyunSafe store-free look-up of dcache hash table
75*4882a593Smuzhiyun============================================
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunDcache name lookup
78*4882a593Smuzhiyun------------------
79*4882a593SmuzhiyunIn order to lookup a dcache (parent, name) tuple, we take a hash on the tuple
80*4882a593Smuzhiyunand use that to select a bucket in the dcache-hash table. The list of entries
81*4882a593Smuzhiyunin that bucket is then walked, and we do a full comparison of each entry
82*4882a593Smuzhiyunagainst our (parent, name) tuple.
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunThe hash lists are RCU protected, so list walking is not serialised with
85*4882a593Smuzhiyunconcurrent updates (insertion, deletion from the hash). This is a standard RCU
86*4882a593Smuzhiyunlist application with the exception of renames, which will be covered below.
87*4882a593Smuzhiyun
88*4882a593SmuzhiyunParent and name members of a dentry, as well as its membership in the dcache
89*4882a593Smuzhiyunhash, and its inode are protected by the per-dentry d_lock spinlock. A
90*4882a593Smuzhiyunreference is taken on the dentry (while the fields are verified under d_lock),
91*4882a593Smuzhiyunand this stabilises its d_inode pointer and actual inode. This gives a stable
92*4882a593Smuzhiyunpoint to perform the next step of our path walk against.
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunThese members are also protected by d_seq seqlock, although this offers
95*4882a593Smuzhiyunread-only protection and no durability of results, so care must be taken when
96*4882a593Smuzhiyunusing d_seq for synchronisation (see seqcount based lookups, below).
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunRenames
99*4882a593Smuzhiyun-------
100*4882a593SmuzhiyunBack to the rename case. In usual RCU protected lists, the only operations that
101*4882a593Smuzhiyunwill happen to an object is insertion, and then eventually removal from the
102*4882a593Smuzhiyunlist. The object will not be reused until an RCU grace period is complete.
103*4882a593SmuzhiyunThis ensures the RCU list traversal primitives can run over the object without
104*4882a593Smuzhiyunproblems (see RCU documentation for how this works).
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunHowever when a dentry is renamed, its hash value can change, requiring it to be
107*4882a593Smuzhiyunmoved to a new hash list. Allocating and inserting a new alias would be
108*4882a593Smuzhiyunexpensive and also problematic for directory dentries. Latency would be far to
109*4882a593Smuzhiyunhigh to wait for a grace period after removing the dentry and before inserting
110*4882a593Smuzhiyunit in the new hash bucket. So what is done is to insert the dentry into the
111*4882a593Smuzhiyunnew list immediately.
112*4882a593Smuzhiyun
113*4882a593SmuzhiyunHowever, when the dentry's list pointers are updated to point to objects in the
114*4882a593Smuzhiyunnew list before waiting for a grace period, this can result in a concurrent RCU
115*4882a593Smuzhiyunlookup of the old list veering off into the new (incorrect) list and missing
116*4882a593Smuzhiyunthe remaining dentries on the list.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunThere is no fundamental problem with walking down the wrong list, because the
119*4882a593Smuzhiyundentry comparisons will never match. However it is fatal to miss a matching
120*4882a593Smuzhiyundentry. So a seqlock is used to detect when a rename has occurred, and so the
121*4882a593Smuzhiyunlookup can be retried.
122*4882a593Smuzhiyun
123*4882a593Smuzhiyun         1      2      3
124*4882a593Smuzhiyun        +---+  +---+  +---+
125*4882a593Smuzhiyunhlist-->| N-+->| N-+->| N-+->
126*4882a593Smuzhiyunhead <--+-P |<-+-P |<-+-P |
127*4882a593Smuzhiyun        +---+  +---+  +---+
128*4882a593Smuzhiyun
129*4882a593SmuzhiyunRename of dentry 2 may require it deleted from the above list, and inserted
130*4882a593Smuzhiyuninto a new list. Deleting 2 gives the following list.
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun         1             3
133*4882a593Smuzhiyun        +---+         +---+     (don't worry, the longer pointers do not
134*4882a593Smuzhiyunhlist-->| N-+-------->| N-+->    impose a measurable performance overhead
135*4882a593Smuzhiyunhead <--+-P |<--------+-P |      on modern CPUs)
136*4882a593Smuzhiyun        +---+         +---+
137*4882a593Smuzhiyun          ^      2      ^
138*4882a593Smuzhiyun          |    +---+    |
139*4882a593Smuzhiyun          |    | N-+----+
140*4882a593Smuzhiyun          +----+-P |
141*4882a593Smuzhiyun               +---+
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunThis is a standard RCU-list deletion, which leaves the deleted object's
144*4882a593Smuzhiyunpointers intact, so a concurrent list walker that is currently looking at
145*4882a593Smuzhiyunobject 2 will correctly continue to object 3 when it is time to traverse the
146*4882a593Smuzhiyunnext object.
147*4882a593Smuzhiyun
148*4882a593SmuzhiyunHowever, when inserting object 2 onto a new list, we end up with this:
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun         1             3
151*4882a593Smuzhiyun        +---+         +---+
152*4882a593Smuzhiyunhlist-->| N-+-------->| N-+->
153*4882a593Smuzhiyunhead <--+-P |<--------+-P |
154*4882a593Smuzhiyun        +---+         +---+
155*4882a593Smuzhiyun                 2
156*4882a593Smuzhiyun               +---+
157*4882a593Smuzhiyun               | N-+---->
158*4882a593Smuzhiyun          <----+-P |
159*4882a593Smuzhiyun               +---+
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunBecause we didn't wait for a grace period, there may be a concurrent lookup
162*4882a593Smuzhiyunstill at 2. Now when it follows 2's 'next' pointer, it will walk off into
163*4882a593Smuzhiyunanother list without ever having checked object 3.
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunA related, but distinctly different, issue is that of rename atomicity versus
166*4882a593Smuzhiyunlookup operations. If a file is renamed from 'A' to 'B', a lookup must only
167*4882a593Smuzhiyunfind either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup
168*4882a593Smuzhiyunof 'B' must succeed (note the reverse is not true).
169*4882a593Smuzhiyun
170*4882a593SmuzhiyunBetween deleting the dentry from the old hash list, and inserting it on the new
171*4882a593Smuzhiyunhash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same
172*4882a593Smuzhiyunrename seqlock is also used to cover this race in much the same way, by
173*4882a593Smuzhiyunretrying a negative lookup result if a rename was in progress.
174*4882a593Smuzhiyun
175*4882a593SmuzhiyunSeqcount based lookups
176*4882a593Smuzhiyun----------------------
177*4882a593SmuzhiyunIn refcount based dcache lookups, d_lock is used to serialise access to
178*4882a593Smuzhiyunthe dentry, stabilising it while comparing its name and parent and then
179*4882a593Smuzhiyuntaking a reference count (the reference count then gives a stable place to
180*4882a593Smuzhiyunstart the next part of the path walk from).
181*4882a593Smuzhiyun
182*4882a593SmuzhiyunAs explained above, we would like to do path walking without taking locks or
183*4882a593Smuzhiyunreference counts on intermediate dentries along the path. To do this, a per
184*4882a593Smuzhiyundentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry
185*4882a593Smuzhiyunlooks like (its name, parent, and inode). That snapshot is then used to start
186*4882a593Smuzhiyunthe next part of the path walk. When loading the coherent snapshot under d_seq,
187*4882a593Smuzhiyuncare must be taken to load the members up-front, and use those pointers rather
188*4882a593Smuzhiyunthan reloading from the dentry later on (otherwise we'd have interesting things
189*4882a593Smuzhiyunlike d_inode going NULL underneath us, if the name was unlinked).
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunAlso important is to avoid performing any destructive operations (pretty much:
192*4882a593Smuzhiyunno non-atomic stores to shared data), and to recheck the seqcount when we are
193*4882a593Smuzhiyun"done" with the operation. Retry or abort if the seqcount does not match.
194*4882a593SmuzhiyunAvoiding destructive or changing operations means we can easily unwind from
195*4882a593Smuzhiyunfailure.
196*4882a593Smuzhiyun
197*4882a593SmuzhiyunWhat this means is that a caller, provided they are holding RCU lock to
198*4882a593Smuzhiyunprotect the dentry object from disappearing, can perform a seqcount based
199*4882a593Smuzhiyunlookup which does not increment the refcount on the dentry or write to
200*4882a593Smuzhiyunit in any way. This returned dentry can be used for subsequent operations,
201*4882a593Smuzhiyunprovided that d_seq is rechecked after that operation is complete.
202*4882a593Smuzhiyun
203*4882a593SmuzhiyunInodes are also rcu freed, so the seqcount lookup dentry's inode may also be
204*4882a593Smuzhiyunqueried for permissions.
205*4882a593Smuzhiyun
206*4882a593SmuzhiyunWith this two parts of the puzzle, we can do path lookups without taking
207*4882a593Smuzhiyunlocks or refcounts on dentry elements.
208*4882a593Smuzhiyun
209*4882a593SmuzhiyunRCU-walk path walking design
210*4882a593Smuzhiyun============================
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunPath walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk
213*4882a593Smuzhiyunis the traditional[*] way of performing dcache lookups using d_lock to
214*4882a593Smuzhiyunserialise concurrent modifications to the dentry and take a reference count on
215*4882a593Smuzhiyunit. ref-walk is simple and obvious, and may sleep, take locks, etc while path
216*4882a593Smuzhiyunwalking is operating on each dentry. rcu-walk uses seqcount based dentry
217*4882a593Smuzhiyunlookups, and can perform lookup of intermediate elements without any stores to
218*4882a593Smuzhiyunshared data in the dentry or inode. rcu-walk can not be applied to all cases,
219*4882a593Smuzhiyuneg. if the filesystem must sleep or perform non trivial operations, rcu-walk
220*4882a593Smuzhiyunmust be switched to ref-walk mode.
221*4882a593Smuzhiyun
222*4882a593Smuzhiyun[*] RCU is still used for the dentry hash lookup in ref-walk, but not the full
223*4882a593Smuzhiyun    path walk.
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunWhere ref-walk uses a stable, refcounted ``parent'' to walk the remaining
226*4882a593Smuzhiyunpath string, rcu-walk uses a d_seq protected snapshot. When looking up a
227*4882a593Smuzhiyunchild of this parent snapshot, we open d_seq critical section on the child
228*4882a593Smuzhiyunbefore closing d_seq critical section on the parent. This gives an interlocking
229*4882a593Smuzhiyunladder of snapshots to walk down.
230*4882a593Smuzhiyun
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun     proc 101
233*4882a593Smuzhiyun      /----------------\
234*4882a593Smuzhiyun     / comm:    "vi"    \
235*4882a593Smuzhiyun    /  fs.root: dentry0  \
236*4882a593Smuzhiyun    \  fs.cwd:  dentry2  /
237*4882a593Smuzhiyun     \                  /
238*4882a593Smuzhiyun      \----------------/
239*4882a593Smuzhiyun
240*4882a593SmuzhiyunSo when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will
241*4882a593Smuzhiyunstart from current->fs->root, which is a pinned dentry. Alternatively,
242*4882a593Smuzhiyun"./test.c" would start from cwd; both names refer to the same path in
243*4882a593Smuzhiyunthe context of proc101.
244*4882a593Smuzhiyun
245*4882a593Smuzhiyun     dentry 0
246*4882a593Smuzhiyun    +---------------------+   rcu-walk begins here, we note d_seq, check the
247*4882a593Smuzhiyun    | name:    "/"        |   inode's permission, and then look up the next
248*4882a593Smuzhiyun    | inode:   10         |   path element which is "home"...
249*4882a593Smuzhiyun    | children:"home", ...|
250*4882a593Smuzhiyun    +---------------------+
251*4882a593Smuzhiyun              |
252*4882a593Smuzhiyun     dentry 1 V
253*4882a593Smuzhiyun    +---------------------+   ... which brings us here. We find dentry1 via
254*4882a593Smuzhiyun    | name:    "home"     |   hash lookup, then note d_seq and compare name
255*4882a593Smuzhiyun    | inode:   678        |   string and parent pointer. When we have a match,
256*4882a593Smuzhiyun    | children:"npiggin"  |   we now recheck the d_seq of dentry0. Then we
257*4882a593Smuzhiyun    +---------------------+   check inode and look up the next element.
258*4882a593Smuzhiyun              |
259*4882a593Smuzhiyun     dentry2  V
260*4882a593Smuzhiyun    +---------------------+   Note: if dentry0 is now modified, lookup is
261*4882a593Smuzhiyun    | name:    "npiggin"  |   not necessarily invalid, so we need only keep a
262*4882a593Smuzhiyun    | inode:   543        |   parent for d_seq verification, and grandparents
263*4882a593Smuzhiyun    | children:"a.c", ... |   can be forgotten.
264*4882a593Smuzhiyun    +---------------------+
265*4882a593Smuzhiyun              |
266*4882a593Smuzhiyun     dentry3  V
267*4882a593Smuzhiyun    +---------------------+   At this point we have our destination dentry.
268*4882a593Smuzhiyun    | name:    "a.c"      |   We now take its d_lock, verify d_seq of this
269*4882a593Smuzhiyun    | inode:   14221      |   dentry. If that checks out, we can increment
270*4882a593Smuzhiyun    | children:NULL       |   its refcount because we're holding d_lock.
271*4882a593Smuzhiyun    +---------------------+
272*4882a593Smuzhiyun
273*4882a593SmuzhiyunTaking a refcount on a dentry from rcu-walk mode, by taking its d_lock,
274*4882a593Smuzhiyunre-checking its d_seq, and then incrementing its refcount is called
275*4882a593Smuzhiyun"dropping rcu" or dropping from rcu-walk into ref-walk mode.
276*4882a593Smuzhiyun
277*4882a593SmuzhiyunIt is, in some sense, a bit of a house of cards. If the seqcount check of the
278*4882a593Smuzhiyunparent snapshot fails, the house comes down, because we had closed the d_seq
279*4882a593Smuzhiyunsection on the grandparent, so we have nothing left to stand on. In that case,
280*4882a593Smuzhiyunthe path walk must be fully restarted (which we do in ref-walk mode, to avoid
281*4882a593Smuzhiyunlive locks). It is costly to have a full restart, but fortunately they are
282*4882a593Smuzhiyunquite rare.
283*4882a593Smuzhiyun
284*4882a593SmuzhiyunWhen we reach a point where sleeping is required, or a filesystem callout
285*4882a593Smuzhiyunrequires ref-walk, then instead of restarting the walk, we attempt to drop rcu
286*4882a593Smuzhiyunat the last known good dentry we have. Avoiding a full restart in ref-walk in
287*4882a593Smuzhiyunthese cases is fundamental for performance and scalability because blocking
288*4882a593Smuzhiyunoperations such as creates and unlinks are not uncommon.
289*4882a593Smuzhiyun
290*4882a593SmuzhiyunThe detailed design for rcu-walk is like this:
291*4882a593Smuzhiyun* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
292*4882a593Smuzhiyun* Take the RCU lock for the entire path walk, starting with the acquiring
293*4882a593Smuzhiyun  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
294*4882a593Smuzhiyun  not required for dentry persistence.
295*4882a593Smuzhiyun* synchronize_rcu is called when unregistering a filesystem, so we can
296*4882a593Smuzhiyun  access d_ops and i_ops during rcu-walk.
297*4882a593Smuzhiyun* Similarly take the vfsmount lock for the entire path walk. So now mnt
298*4882a593Smuzhiyun  refcounts are not required for persistence. Also we are free to perform mount
299*4882a593Smuzhiyun  lookups, and to assume dentry mount points and mount roots are stable up and
300*4882a593Smuzhiyun  down the path.
301*4882a593Smuzhiyun* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
302*4882a593Smuzhiyun  so we can load this tuple atomically, and also check whether any of its
303*4882a593Smuzhiyun  members have changed.
304*4882a593Smuzhiyun* Dentry lookups (based on parent, candidate string tuple) recheck the parent
305*4882a593Smuzhiyun  sequence after the child is found in case anything changed in the parent
306*4882a593Smuzhiyun  during the path walk.
307*4882a593Smuzhiyun* inode is also RCU protected so we can load d_inode and use the inode for
308*4882a593Smuzhiyun  limited things.
309*4882a593Smuzhiyun* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
310*4882a593Smuzhiyun* i_op can be loaded.
311*4882a593Smuzhiyun* When the destination dentry is reached, drop rcu there (ie. take d_lock,
312*4882a593Smuzhiyun  verify d_seq, increment refcount).
313*4882a593Smuzhiyun* If seqlock verification fails anywhere along the path, do a full restart
314*4882a593Smuzhiyun  of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of
315*4882a593Smuzhiyun  a better errno) to signal an rcu-walk failure.
316*4882a593Smuzhiyun
317*4882a593SmuzhiyunThe cases where rcu-walk cannot continue are:
318*4882a593Smuzhiyun* NULL dentry (ie. any uncached path element)
319*4882a593Smuzhiyun* Following links
320*4882a593Smuzhiyun
321*4882a593SmuzhiyunIt may be possible eventually to make following links rcu-walk aware.
322*4882a593Smuzhiyun
323*4882a593SmuzhiyunUncached path elements will always require dropping to ref-walk mode, at the
324*4882a593Smuzhiyunvery least because i_mutex needs to be grabbed, and objects allocated.
325*4882a593Smuzhiyun
326*4882a593SmuzhiyunFinal note:
327*4882a593Smuzhiyun"store-free" path walking is not strictly store free. We take vfsmount lock
328*4882a593Smuzhiyunand refcounts (both of which can be made per-cpu), and we also store to the
329*4882a593Smuzhiyunstack (which is essentially CPU-local), and we also have to take locks and
330*4882a593Smuzhiyunrefcount on final dentry.
331*4882a593Smuzhiyun
332*4882a593SmuzhiyunThe point is that shared data, where practically possible, is not locked
333*4882a593Smuzhiyunor stored into. The result is massive improvements in performance and
334*4882a593Smuzhiyunscalability of path resolution.
335*4882a593Smuzhiyun
336*4882a593Smuzhiyun
337*4882a593SmuzhiyunInteresting statistics
338*4882a593Smuzhiyun======================
339*4882a593Smuzhiyun
340*4882a593SmuzhiyunThe following table gives rcu lookup statistics for a few simple workloads
341*4882a593Smuzhiyun(2s12c24t Westmere, debian non-graphical system). Ungraceful are attempts to
342*4882a593Smuzhiyundrop rcu that fail due to d_seq failure and requiring the entire path lookup
343*4882a593Smuzhiyunagain. Other cases are successful rcu-drops that are required before the final
344*4882a593Smuzhiyunelement, nodentry for missing dentry, revalidate for filesystem revalidate
345*4882a593Smuzhiyunroutine requiring rcu drop, permission for permission check requiring drop,
346*4882a593Smuzhiyunand link for symlink traversal requiring drop.
347*4882a593Smuzhiyun
348*4882a593Smuzhiyun     rcu-lookups     restart  nodentry          link  revalidate  permission
349*4882a593Smuzhiyunbootup     47121           0      4624          1010       10283        7852
350*4882a593Smuzhiyundbench  25386793           0   6778659(26.7%)     55         549        1156
351*4882a593Smuzhiyunkbuild   2696672          10     64442(2.3%)  108764(4.0%)     1        1590
352*4882a593Smuzhiyungit diff   39605           0        28             2           0         106
353*4882a593Smuzhiyunvfstest 24185492        4945    708725(2.9%) 1076136(4.4%)     0        2651
354*4882a593Smuzhiyun
355*4882a593SmuzhiyunWhat this shows is that failed rcu-walk lookups, ie. ones that are restarted
356*4882a593Smuzhiyunentirely with ref-walk, are quite rare. Even the "vfstest" case which
357*4882a593Smuzhiyunspecifically has concurrent renames/mkdir/rmdir/ creat/unlink/etc to exercise
358*4882a593Smuzhiyunsuch races is not showing a huge amount of restarts.
359*4882a593Smuzhiyun
360*4882a593SmuzhiyunDropping from rcu-walk to ref-walk mean that we have encountered a dentry where
361*4882a593Smuzhiyunthe reference count needs to be taken for some reason. This is either because
362*4882a593Smuzhiyunwe have reached the target of the path walk, or because we have encountered a
363*4882a593Smuzhiyuncondition that can't be resolved in rcu-walk mode.  Ideally, we drop rcu-walk
364*4882a593Smuzhiyunonly when we have reached the target dentry, so the other statistics show where
365*4882a593Smuzhiyunthis does not happen.
366*4882a593Smuzhiyun
367*4882a593SmuzhiyunNote that a graceful drop from rcu-walk mode due to something such as the
368*4882a593Smuzhiyundentry not existing (which can be common) is not necessarily a failure of
369*4882a593Smuzhiyunrcu-walk scheme, because some elements of the path may have been walked in
370*4882a593Smuzhiyunrcu-walk mode. The further we get from common path elements (such as cwd or
371*4882a593Smuzhiyunroot), the less contended the dentry is likely to be. The closer we are to
372*4882a593Smuzhiyuncommon path elements, the more likely they will exist in dentry cache.
373*4882a593Smuzhiyun
374*4882a593Smuzhiyun
375*4882a593SmuzhiyunPapers and other documentation on dcache locking
376*4882a593Smuzhiyun================================================
377*4882a593Smuzhiyun
378*4882a593Smuzhiyun1. Scaling dcache with RCU (https://linuxjournal.com/article.php?sid=7124).
379*4882a593Smuzhiyun
380*4882a593Smuzhiyun2. http://lse.sourceforge.net/locking/dcache/dcache.html
381*4882a593Smuzhiyun
382*4882a593Smuzhiyun3. path-lookup.md in this directory.
383