1*4882a593SmuzhiyunPath walking and name lookup locking 2*4882a593Smuzhiyun==================================== 3*4882a593Smuzhiyun 4*4882a593SmuzhiyunPath resolution is the finding a dentry corresponding to a path name string, by 5*4882a593Smuzhiyunperforming a path walk. Typically, for every open(), stat() etc., the path name 6*4882a593Smuzhiyunwill be resolved. Paths are resolved by walking the namespace tree, starting 7*4882a593Smuzhiyunwith the first component of the pathname (eg. root or cwd) with a known dentry, 8*4882a593Smuzhiyunthen finding the child of that dentry, which is named the next component in the 9*4882a593Smuzhiyunpath string. Then repeating the lookup from the child dentry and finding its 10*4882a593Smuzhiyunchild with the next element, and so on. 11*4882a593Smuzhiyun 12*4882a593SmuzhiyunSince it is a frequent operation for workloads like multiuser environments and 13*4882a593Smuzhiyunweb servers, it is important to optimize this code. 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunPath walking synchronisation history: 16*4882a593SmuzhiyunPrior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and 17*4882a593Smuzhiyunthus in every component during path look-up. Since 2.5.10 onwards, fast-walk 18*4882a593Smuzhiyunalgorithm changed this by holding the dcache_lock at the beginning and walking 19*4882a593Smuzhiyunas many cached path component dentries as possible. This significantly 20*4882a593Smuzhiyundecreases the number of acquisition of dcache_lock. However it also increases 21*4882a593Smuzhiyunthe lock hold time significantly and affects performance in large SMP machines. 22*4882a593SmuzhiyunSince 2.5.62 kernel, dcache has been using a new locking model that uses RCU to 23*4882a593Smuzhiyunmake dcache look-up lock-free. 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunAll the above algorithms required taking a lock and reference count on the 26*4882a593Smuzhiyundentry that was looked up, so that may be used as the basis for walking the 27*4882a593Smuzhiyunnext path element. This is inefficient and unscalable. It is inefficient 28*4882a593Smuzhiyunbecause of the locks and atomic operations required for every dentry element 29*4882a593Smuzhiyunslows things down. It is not scalable because many parallel applications that 30*4882a593Smuzhiyunare path-walk intensive tend to do path lookups starting from a common dentry 31*4882a593Smuzhiyun(usually, the root "/" or current working directory). So contention on these 32*4882a593Smuzhiyuncommon path elements causes lock and cacheline queueing. 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunSince 2.6.38, RCU is used to make a significant part of the entire path walk 35*4882a593Smuzhiyun(including dcache look-up) completely "store-free" (so, no locks, atomics, or 36*4882a593Smuzhiyuneven stores into cachelines of common dentries). This is known as "rcu-walk" 37*4882a593Smuzhiyunpath walking. 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunPath walking overview 40*4882a593Smuzhiyun===================== 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunA name string specifies a start (root directory, cwd, fd-relative) and a 43*4882a593Smuzhiyunsequence of elements (directory entry names), which together refer to a path in 44*4882a593Smuzhiyunthe namespace. A path is represented as a (dentry, vfsmount) tuple. The name 45*4882a593Smuzhiyunelements are sub-strings, separated by '/'. 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunName lookups will want to find a particular path that a name string refers to 48*4882a593Smuzhiyun(usually the final element, or parent of final element). This is done by taking 49*4882a593Smuzhiyunthe path given by the name's starting point (which we know in advance -- eg. 50*4882a593Smuzhiyuncurrent->fs->cwd or current->fs->root) as the first parent of the lookup. Then 51*4882a593Smuzhiyuniteratively for each subsequent name element, look up the child of the current 52*4882a593Smuzhiyunparent with the given name and if it is not the desired entry, make it the 53*4882a593Smuzhiyunparent for the next lookup. 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunA parent, of course, must be a directory, and we must have appropriate 56*4882a593Smuzhiyunpermissions on the parent inode to be able to walk into it. 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunTurning the child into a parent for the next lookup requires more checks and 59*4882a593Smuzhiyunprocedures. Symlinks essentially substitute the symlink name for the target 60*4882a593Smuzhiyunname in the name string, and require some recursive path walking. Mount points 61*4882a593Smuzhiyunmust be followed into (thus changing the vfsmount that subsequent path elements 62*4882a593Smuzhiyunrefer to), switching from the mount point path to the root of the particular 63*4882a593Smuzhiyunmounted vfsmount. These behaviours are variously modified depending on the 64*4882a593Smuzhiyunexact path walking flags. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunPath walking then must, broadly, do several particular things: 67*4882a593Smuzhiyun- find the start point of the walk; 68*4882a593Smuzhiyun- perform permissions and validity checks on inodes; 69*4882a593Smuzhiyun- perform dcache hash name lookups on (parent, name element) tuples; 70*4882a593Smuzhiyun- traverse mount points; 71*4882a593Smuzhiyun- traverse symlinks; 72*4882a593Smuzhiyun- lookup and create missing parts of the path on demand. 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunSafe store-free look-up of dcache hash table 75*4882a593Smuzhiyun============================================ 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunDcache name lookup 78*4882a593Smuzhiyun------------------ 79*4882a593SmuzhiyunIn order to lookup a dcache (parent, name) tuple, we take a hash on the tuple 80*4882a593Smuzhiyunand use that to select a bucket in the dcache-hash table. The list of entries 81*4882a593Smuzhiyunin that bucket is then walked, and we do a full comparison of each entry 82*4882a593Smuzhiyunagainst our (parent, name) tuple. 83*4882a593Smuzhiyun 84*4882a593SmuzhiyunThe hash lists are RCU protected, so list walking is not serialised with 85*4882a593Smuzhiyunconcurrent updates (insertion, deletion from the hash). This is a standard RCU 86*4882a593Smuzhiyunlist application with the exception of renames, which will be covered below. 87*4882a593Smuzhiyun 88*4882a593SmuzhiyunParent and name members of a dentry, as well as its membership in the dcache 89*4882a593Smuzhiyunhash, and its inode are protected by the per-dentry d_lock spinlock. A 90*4882a593Smuzhiyunreference is taken on the dentry (while the fields are verified under d_lock), 91*4882a593Smuzhiyunand this stabilises its d_inode pointer and actual inode. This gives a stable 92*4882a593Smuzhiyunpoint to perform the next step of our path walk against. 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunThese members are also protected by d_seq seqlock, although this offers 95*4882a593Smuzhiyunread-only protection and no durability of results, so care must be taken when 96*4882a593Smuzhiyunusing d_seq for synchronisation (see seqcount based lookups, below). 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunRenames 99*4882a593Smuzhiyun------- 100*4882a593SmuzhiyunBack to the rename case. In usual RCU protected lists, the only operations that 101*4882a593Smuzhiyunwill happen to an object is insertion, and then eventually removal from the 102*4882a593Smuzhiyunlist. The object will not be reused until an RCU grace period is complete. 103*4882a593SmuzhiyunThis ensures the RCU list traversal primitives can run over the object without 104*4882a593Smuzhiyunproblems (see RCU documentation for how this works). 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunHowever when a dentry is renamed, its hash value can change, requiring it to be 107*4882a593Smuzhiyunmoved to a new hash list. Allocating and inserting a new alias would be 108*4882a593Smuzhiyunexpensive and also problematic for directory dentries. Latency would be far to 109*4882a593Smuzhiyunhigh to wait for a grace period after removing the dentry and before inserting 110*4882a593Smuzhiyunit in the new hash bucket. So what is done is to insert the dentry into the 111*4882a593Smuzhiyunnew list immediately. 112*4882a593Smuzhiyun 113*4882a593SmuzhiyunHowever, when the dentry's list pointers are updated to point to objects in the 114*4882a593Smuzhiyunnew list before waiting for a grace period, this can result in a concurrent RCU 115*4882a593Smuzhiyunlookup of the old list veering off into the new (incorrect) list and missing 116*4882a593Smuzhiyunthe remaining dentries on the list. 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunThere is no fundamental problem with walking down the wrong list, because the 119*4882a593Smuzhiyundentry comparisons will never match. However it is fatal to miss a matching 120*4882a593Smuzhiyundentry. So a seqlock is used to detect when a rename has occurred, and so the 121*4882a593Smuzhiyunlookup can be retried. 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun 1 2 3 124*4882a593Smuzhiyun +---+ +---+ +---+ 125*4882a593Smuzhiyunhlist-->| N-+->| N-+->| N-+-> 126*4882a593Smuzhiyunhead <--+-P |<-+-P |<-+-P | 127*4882a593Smuzhiyun +---+ +---+ +---+ 128*4882a593Smuzhiyun 129*4882a593SmuzhiyunRename of dentry 2 may require it deleted from the above list, and inserted 130*4882a593Smuzhiyuninto a new list. Deleting 2 gives the following list. 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun 1 3 133*4882a593Smuzhiyun +---+ +---+ (don't worry, the longer pointers do not 134*4882a593Smuzhiyunhlist-->| N-+-------->| N-+-> impose a measurable performance overhead 135*4882a593Smuzhiyunhead <--+-P |<--------+-P | on modern CPUs) 136*4882a593Smuzhiyun +---+ +---+ 137*4882a593Smuzhiyun ^ 2 ^ 138*4882a593Smuzhiyun | +---+ | 139*4882a593Smuzhiyun | | N-+----+ 140*4882a593Smuzhiyun +----+-P | 141*4882a593Smuzhiyun +---+ 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunThis is a standard RCU-list deletion, which leaves the deleted object's 144*4882a593Smuzhiyunpointers intact, so a concurrent list walker that is currently looking at 145*4882a593Smuzhiyunobject 2 will correctly continue to object 3 when it is time to traverse the 146*4882a593Smuzhiyunnext object. 147*4882a593Smuzhiyun 148*4882a593SmuzhiyunHowever, when inserting object 2 onto a new list, we end up with this: 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun 1 3 151*4882a593Smuzhiyun +---+ +---+ 152*4882a593Smuzhiyunhlist-->| N-+-------->| N-+-> 153*4882a593Smuzhiyunhead <--+-P |<--------+-P | 154*4882a593Smuzhiyun +---+ +---+ 155*4882a593Smuzhiyun 2 156*4882a593Smuzhiyun +---+ 157*4882a593Smuzhiyun | N-+----> 158*4882a593Smuzhiyun <----+-P | 159*4882a593Smuzhiyun +---+ 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunBecause we didn't wait for a grace period, there may be a concurrent lookup 162*4882a593Smuzhiyunstill at 2. Now when it follows 2's 'next' pointer, it will walk off into 163*4882a593Smuzhiyunanother list without ever having checked object 3. 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunA related, but distinctly different, issue is that of rename atomicity versus 166*4882a593Smuzhiyunlookup operations. If a file is renamed from 'A' to 'B', a lookup must only 167*4882a593Smuzhiyunfind either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup 168*4882a593Smuzhiyunof 'B' must succeed (note the reverse is not true). 169*4882a593Smuzhiyun 170*4882a593SmuzhiyunBetween deleting the dentry from the old hash list, and inserting it on the new 171*4882a593Smuzhiyunhash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same 172*4882a593Smuzhiyunrename seqlock is also used to cover this race in much the same way, by 173*4882a593Smuzhiyunretrying a negative lookup result if a rename was in progress. 174*4882a593Smuzhiyun 175*4882a593SmuzhiyunSeqcount based lookups 176*4882a593Smuzhiyun---------------------- 177*4882a593SmuzhiyunIn refcount based dcache lookups, d_lock is used to serialise access to 178*4882a593Smuzhiyunthe dentry, stabilising it while comparing its name and parent and then 179*4882a593Smuzhiyuntaking a reference count (the reference count then gives a stable place to 180*4882a593Smuzhiyunstart the next part of the path walk from). 181*4882a593Smuzhiyun 182*4882a593SmuzhiyunAs explained above, we would like to do path walking without taking locks or 183*4882a593Smuzhiyunreference counts on intermediate dentries along the path. To do this, a per 184*4882a593Smuzhiyundentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry 185*4882a593Smuzhiyunlooks like (its name, parent, and inode). That snapshot is then used to start 186*4882a593Smuzhiyunthe next part of the path walk. When loading the coherent snapshot under d_seq, 187*4882a593Smuzhiyuncare must be taken to load the members up-front, and use those pointers rather 188*4882a593Smuzhiyunthan reloading from the dentry later on (otherwise we'd have interesting things 189*4882a593Smuzhiyunlike d_inode going NULL underneath us, if the name was unlinked). 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunAlso important is to avoid performing any destructive operations (pretty much: 192*4882a593Smuzhiyunno non-atomic stores to shared data), and to recheck the seqcount when we are 193*4882a593Smuzhiyun"done" with the operation. Retry or abort if the seqcount does not match. 194*4882a593SmuzhiyunAvoiding destructive or changing operations means we can easily unwind from 195*4882a593Smuzhiyunfailure. 196*4882a593Smuzhiyun 197*4882a593SmuzhiyunWhat this means is that a caller, provided they are holding RCU lock to 198*4882a593Smuzhiyunprotect the dentry object from disappearing, can perform a seqcount based 199*4882a593Smuzhiyunlookup which does not increment the refcount on the dentry or write to 200*4882a593Smuzhiyunit in any way. This returned dentry can be used for subsequent operations, 201*4882a593Smuzhiyunprovided that d_seq is rechecked after that operation is complete. 202*4882a593Smuzhiyun 203*4882a593SmuzhiyunInodes are also rcu freed, so the seqcount lookup dentry's inode may also be 204*4882a593Smuzhiyunqueried for permissions. 205*4882a593Smuzhiyun 206*4882a593SmuzhiyunWith this two parts of the puzzle, we can do path lookups without taking 207*4882a593Smuzhiyunlocks or refcounts on dentry elements. 208*4882a593Smuzhiyun 209*4882a593SmuzhiyunRCU-walk path walking design 210*4882a593Smuzhiyun============================ 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunPath walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk 213*4882a593Smuzhiyunis the traditional[*] way of performing dcache lookups using d_lock to 214*4882a593Smuzhiyunserialise concurrent modifications to the dentry and take a reference count on 215*4882a593Smuzhiyunit. ref-walk is simple and obvious, and may sleep, take locks, etc while path 216*4882a593Smuzhiyunwalking is operating on each dentry. rcu-walk uses seqcount based dentry 217*4882a593Smuzhiyunlookups, and can perform lookup of intermediate elements without any stores to 218*4882a593Smuzhiyunshared data in the dentry or inode. rcu-walk can not be applied to all cases, 219*4882a593Smuzhiyuneg. if the filesystem must sleep or perform non trivial operations, rcu-walk 220*4882a593Smuzhiyunmust be switched to ref-walk mode. 221*4882a593Smuzhiyun 222*4882a593Smuzhiyun[*] RCU is still used for the dentry hash lookup in ref-walk, but not the full 223*4882a593Smuzhiyun path walk. 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunWhere ref-walk uses a stable, refcounted ``parent'' to walk the remaining 226*4882a593Smuzhiyunpath string, rcu-walk uses a d_seq protected snapshot. When looking up a 227*4882a593Smuzhiyunchild of this parent snapshot, we open d_seq critical section on the child 228*4882a593Smuzhiyunbefore closing d_seq critical section on the parent. This gives an interlocking 229*4882a593Smuzhiyunladder of snapshots to walk down. 230*4882a593Smuzhiyun 231*4882a593Smuzhiyun 232*4882a593Smuzhiyun proc 101 233*4882a593Smuzhiyun /----------------\ 234*4882a593Smuzhiyun / comm: "vi" \ 235*4882a593Smuzhiyun / fs.root: dentry0 \ 236*4882a593Smuzhiyun \ fs.cwd: dentry2 / 237*4882a593Smuzhiyun \ / 238*4882a593Smuzhiyun \----------------/ 239*4882a593Smuzhiyun 240*4882a593SmuzhiyunSo when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will 241*4882a593Smuzhiyunstart from current->fs->root, which is a pinned dentry. Alternatively, 242*4882a593Smuzhiyun"./test.c" would start from cwd; both names refer to the same path in 243*4882a593Smuzhiyunthe context of proc101. 244*4882a593Smuzhiyun 245*4882a593Smuzhiyun dentry 0 246*4882a593Smuzhiyun +---------------------+ rcu-walk begins here, we note d_seq, check the 247*4882a593Smuzhiyun | name: "/" | inode's permission, and then look up the next 248*4882a593Smuzhiyun | inode: 10 | path element which is "home"... 249*4882a593Smuzhiyun | children:"home", ...| 250*4882a593Smuzhiyun +---------------------+ 251*4882a593Smuzhiyun | 252*4882a593Smuzhiyun dentry 1 V 253*4882a593Smuzhiyun +---------------------+ ... which brings us here. We find dentry1 via 254*4882a593Smuzhiyun | name: "home" | hash lookup, then note d_seq and compare name 255*4882a593Smuzhiyun | inode: 678 | string and parent pointer. When we have a match, 256*4882a593Smuzhiyun | children:"npiggin" | we now recheck the d_seq of dentry0. Then we 257*4882a593Smuzhiyun +---------------------+ check inode and look up the next element. 258*4882a593Smuzhiyun | 259*4882a593Smuzhiyun dentry2 V 260*4882a593Smuzhiyun +---------------------+ Note: if dentry0 is now modified, lookup is 261*4882a593Smuzhiyun | name: "npiggin" | not necessarily invalid, so we need only keep a 262*4882a593Smuzhiyun | inode: 543 | parent for d_seq verification, and grandparents 263*4882a593Smuzhiyun | children:"a.c", ... | can be forgotten. 264*4882a593Smuzhiyun +---------------------+ 265*4882a593Smuzhiyun | 266*4882a593Smuzhiyun dentry3 V 267*4882a593Smuzhiyun +---------------------+ At this point we have our destination dentry. 268*4882a593Smuzhiyun | name: "a.c" | We now take its d_lock, verify d_seq of this 269*4882a593Smuzhiyun | inode: 14221 | dentry. If that checks out, we can increment 270*4882a593Smuzhiyun | children:NULL | its refcount because we're holding d_lock. 271*4882a593Smuzhiyun +---------------------+ 272*4882a593Smuzhiyun 273*4882a593SmuzhiyunTaking a refcount on a dentry from rcu-walk mode, by taking its d_lock, 274*4882a593Smuzhiyunre-checking its d_seq, and then incrementing its refcount is called 275*4882a593Smuzhiyun"dropping rcu" or dropping from rcu-walk into ref-walk mode. 276*4882a593Smuzhiyun 277*4882a593SmuzhiyunIt is, in some sense, a bit of a house of cards. If the seqcount check of the 278*4882a593Smuzhiyunparent snapshot fails, the house comes down, because we had closed the d_seq 279*4882a593Smuzhiyunsection on the grandparent, so we have nothing left to stand on. In that case, 280*4882a593Smuzhiyunthe path walk must be fully restarted (which we do in ref-walk mode, to avoid 281*4882a593Smuzhiyunlive locks). It is costly to have a full restart, but fortunately they are 282*4882a593Smuzhiyunquite rare. 283*4882a593Smuzhiyun 284*4882a593SmuzhiyunWhen we reach a point where sleeping is required, or a filesystem callout 285*4882a593Smuzhiyunrequires ref-walk, then instead of restarting the walk, we attempt to drop rcu 286*4882a593Smuzhiyunat the last known good dentry we have. Avoiding a full restart in ref-walk in 287*4882a593Smuzhiyunthese cases is fundamental for performance and scalability because blocking 288*4882a593Smuzhiyunoperations such as creates and unlinks are not uncommon. 289*4882a593Smuzhiyun 290*4882a593SmuzhiyunThe detailed design for rcu-walk is like this: 291*4882a593Smuzhiyun* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk. 292*4882a593Smuzhiyun* Take the RCU lock for the entire path walk, starting with the acquiring 293*4882a593Smuzhiyun of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are 294*4882a593Smuzhiyun not required for dentry persistence. 295*4882a593Smuzhiyun* synchronize_rcu is called when unregistering a filesystem, so we can 296*4882a593Smuzhiyun access d_ops and i_ops during rcu-walk. 297*4882a593Smuzhiyun* Similarly take the vfsmount lock for the entire path walk. So now mnt 298*4882a593Smuzhiyun refcounts are not required for persistence. Also we are free to perform mount 299*4882a593Smuzhiyun lookups, and to assume dentry mount points and mount roots are stable up and 300*4882a593Smuzhiyun down the path. 301*4882a593Smuzhiyun* Have a per-dentry seqlock to protect the dentry name, parent, and inode, 302*4882a593Smuzhiyun so we can load this tuple atomically, and also check whether any of its 303*4882a593Smuzhiyun members have changed. 304*4882a593Smuzhiyun* Dentry lookups (based on parent, candidate string tuple) recheck the parent 305*4882a593Smuzhiyun sequence after the child is found in case anything changed in the parent 306*4882a593Smuzhiyun during the path walk. 307*4882a593Smuzhiyun* inode is also RCU protected so we can load d_inode and use the inode for 308*4882a593Smuzhiyun limited things. 309*4882a593Smuzhiyun* i_mode, i_uid, i_gid can be tested for exec permissions during path walk. 310*4882a593Smuzhiyun* i_op can be loaded. 311*4882a593Smuzhiyun* When the destination dentry is reached, drop rcu there (ie. take d_lock, 312*4882a593Smuzhiyun verify d_seq, increment refcount). 313*4882a593Smuzhiyun* If seqlock verification fails anywhere along the path, do a full restart 314*4882a593Smuzhiyun of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of 315*4882a593Smuzhiyun a better errno) to signal an rcu-walk failure. 316*4882a593Smuzhiyun 317*4882a593SmuzhiyunThe cases where rcu-walk cannot continue are: 318*4882a593Smuzhiyun* NULL dentry (ie. any uncached path element) 319*4882a593Smuzhiyun* Following links 320*4882a593Smuzhiyun 321*4882a593SmuzhiyunIt may be possible eventually to make following links rcu-walk aware. 322*4882a593Smuzhiyun 323*4882a593SmuzhiyunUncached path elements will always require dropping to ref-walk mode, at the 324*4882a593Smuzhiyunvery least because i_mutex needs to be grabbed, and objects allocated. 325*4882a593Smuzhiyun 326*4882a593SmuzhiyunFinal note: 327*4882a593Smuzhiyun"store-free" path walking is not strictly store free. We take vfsmount lock 328*4882a593Smuzhiyunand refcounts (both of which can be made per-cpu), and we also store to the 329*4882a593Smuzhiyunstack (which is essentially CPU-local), and we also have to take locks and 330*4882a593Smuzhiyunrefcount on final dentry. 331*4882a593Smuzhiyun 332*4882a593SmuzhiyunThe point is that shared data, where practically possible, is not locked 333*4882a593Smuzhiyunor stored into. The result is massive improvements in performance and 334*4882a593Smuzhiyunscalability of path resolution. 335*4882a593Smuzhiyun 336*4882a593Smuzhiyun 337*4882a593SmuzhiyunInteresting statistics 338*4882a593Smuzhiyun====================== 339*4882a593Smuzhiyun 340*4882a593SmuzhiyunThe following table gives rcu lookup statistics for a few simple workloads 341*4882a593Smuzhiyun(2s12c24t Westmere, debian non-graphical system). Ungraceful are attempts to 342*4882a593Smuzhiyundrop rcu that fail due to d_seq failure and requiring the entire path lookup 343*4882a593Smuzhiyunagain. Other cases are successful rcu-drops that are required before the final 344*4882a593Smuzhiyunelement, nodentry for missing dentry, revalidate for filesystem revalidate 345*4882a593Smuzhiyunroutine requiring rcu drop, permission for permission check requiring drop, 346*4882a593Smuzhiyunand link for symlink traversal requiring drop. 347*4882a593Smuzhiyun 348*4882a593Smuzhiyun rcu-lookups restart nodentry link revalidate permission 349*4882a593Smuzhiyunbootup 47121 0 4624 1010 10283 7852 350*4882a593Smuzhiyundbench 25386793 0 6778659(26.7%) 55 549 1156 351*4882a593Smuzhiyunkbuild 2696672 10 64442(2.3%) 108764(4.0%) 1 1590 352*4882a593Smuzhiyungit diff 39605 0 28 2 0 106 353*4882a593Smuzhiyunvfstest 24185492 4945 708725(2.9%) 1076136(4.4%) 0 2651 354*4882a593Smuzhiyun 355*4882a593SmuzhiyunWhat this shows is that failed rcu-walk lookups, ie. ones that are restarted 356*4882a593Smuzhiyunentirely with ref-walk, are quite rare. Even the "vfstest" case which 357*4882a593Smuzhiyunspecifically has concurrent renames/mkdir/rmdir/ creat/unlink/etc to exercise 358*4882a593Smuzhiyunsuch races is not showing a huge amount of restarts. 359*4882a593Smuzhiyun 360*4882a593SmuzhiyunDropping from rcu-walk to ref-walk mean that we have encountered a dentry where 361*4882a593Smuzhiyunthe reference count needs to be taken for some reason. This is either because 362*4882a593Smuzhiyunwe have reached the target of the path walk, or because we have encountered a 363*4882a593Smuzhiyuncondition that can't be resolved in rcu-walk mode. Ideally, we drop rcu-walk 364*4882a593Smuzhiyunonly when we have reached the target dentry, so the other statistics show where 365*4882a593Smuzhiyunthis does not happen. 366*4882a593Smuzhiyun 367*4882a593SmuzhiyunNote that a graceful drop from rcu-walk mode due to something such as the 368*4882a593Smuzhiyundentry not existing (which can be common) is not necessarily a failure of 369*4882a593Smuzhiyunrcu-walk scheme, because some elements of the path may have been walked in 370*4882a593Smuzhiyunrcu-walk mode. The further we get from common path elements (such as cwd or 371*4882a593Smuzhiyunroot), the less contended the dentry is likely to be. The closer we are to 372*4882a593Smuzhiyuncommon path elements, the more likely they will exist in dentry cache. 373*4882a593Smuzhiyun 374*4882a593Smuzhiyun 375*4882a593SmuzhiyunPapers and other documentation on dcache locking 376*4882a593Smuzhiyun================================================ 377*4882a593Smuzhiyun 378*4882a593Smuzhiyun1. Scaling dcache with RCU (https://linuxjournal.com/article.php?sid=7124). 379*4882a593Smuzhiyun 380*4882a593Smuzhiyun2. http://lse.sourceforge.net/locking/dcache/dcache.html 381*4882a593Smuzhiyun 382*4882a593Smuzhiyun3. path-lookup.md in this directory. 383