1*4882a593Smuzhiyun=============== 2*4882a593SmuzhiyunPathname lookup 3*4882a593Smuzhiyun=============== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunThis write-up is based on three articles published at lwn.net: 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun- <https://lwn.net/Articles/649115/> Pathname lookup in Linux 8*4882a593Smuzhiyun- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux 9*4882a593Smuzhiyun- <https://lwn.net/Articles/650786/> A walk among the symlinks 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunWritten by Neil Brown with help from Al Viro and Jon Corbet. 12*4882a593SmuzhiyunIt has subsequently been updated to reflect changes in the kernel 13*4882a593Smuzhiyunincluding: 14*4882a593Smuzhiyun 15*4882a593Smuzhiyun- per-directory parallel name lookup. 16*4882a593Smuzhiyun- ``openat2()`` resolution restriction flags. 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunIntroduction to pathname lookup 19*4882a593Smuzhiyun=============================== 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunThe most obvious aspect of pathname lookup, which very little 22*4882a593Smuzhiyunexploration is needed to discover, is that it is complex. There are 23*4882a593Smuzhiyunmany rules, special cases, and implementation alternatives that all 24*4882a593Smuzhiyuncombine to confuse the unwary reader. Computer science has long been 25*4882a593Smuzhiyunacquainted with such complexity and has tools to help manage it. One 26*4882a593Smuzhiyuntool that we will make extensive use of is "divide and conquer". For 27*4882a593Smuzhiyunthe early parts of the analysis we will divide off symlinks - leaving 28*4882a593Smuzhiyunthem until the final part. Well before we get to symlinks we have 29*4882a593Smuzhiyunanother major division based on the VFS's approach to locking which 30*4882a593Smuzhiyunwill allow us to review "REF-walk" and "RCU-walk" separately. But we 31*4882a593Smuzhiyunare getting ahead of ourselves. There are some important low level 32*4882a593Smuzhiyundistinctions we need to clarify first. 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunThere are two sorts of ... 35*4882a593Smuzhiyun-------------------------- 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun.. _openat: http://man7.org/linux/man-pages/man2/openat.2.html 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunPathnames (sometimes "file names"), used to identify objects in the 40*4882a593Smuzhiyunfilesystem, will be familiar to most readers. They contain two sorts 41*4882a593Smuzhiyunof elements: "slashes" that are sequences of one or more "``/``" 42*4882a593Smuzhiyuncharacters, and "components" that are sequences of one or more 43*4882a593Smuzhiyunnon-"``/``" characters. These form two kinds of paths. Those that 44*4882a593Smuzhiyunstart with slashes are "absolute" and start from the filesystem root. 45*4882a593SmuzhiyunThe others are "relative" and start from the current directory, or 46*4882a593Smuzhiyunfrom some other location specified by a file descriptor given to 47*4882a593Smuzhiyun"``*at()``" system calls such as `openat() <openat_>`_. 48*4882a593Smuzhiyun 49*4882a593Smuzhiyun.. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunIt is tempting to describe the second kind as starting with a 52*4882a593Smuzhiyuncomponent, but that isn't always accurate: a pathname can lack both 53*4882a593Smuzhiyunslashes and components, it can be empty, in other words. This is 54*4882a593Smuzhiyungenerally forbidden in POSIX, but some of those "``*at()``" system calls 55*4882a593Smuzhiyunin Linux permit it when the ``AT_EMPTY_PATH`` flag is given. For 56*4882a593Smuzhiyunexample, if you have an open file descriptor on an executable file you 57*4882a593Smuzhiyuncan execute it by calling `execveat() <execveat_>`_ passing 58*4882a593Smuzhiyunthe file descriptor, an empty path, and the ``AT_EMPTY_PATH`` flag. 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunThese paths can be divided into two sections: the final component and 61*4882a593Smuzhiyuneverything else. The "everything else" is the easy bit. In all cases 62*4882a593Smuzhiyunit must identify a directory that already exists, otherwise an error 63*4882a593Smuzhiyunsuch as ``ENOENT`` or ``ENOTDIR`` will be reported. 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunThe final component is not so simple. Not only do different system 66*4882a593Smuzhiyuncalls interpret it quite differently (e.g. some create it, some do 67*4882a593Smuzhiyunnot), but it might not even exist: neither the empty pathname nor the 68*4882a593Smuzhiyunpathname that is just slashes have a final component. If it does 69*4882a593Smuzhiyunexist, it could be "``.``" or "``..``" which are handled quite differently 70*4882a593Smuzhiyunfrom other components. 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun.. _POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunIf a pathname ends with a slash, such as "``/tmp/foo/``" it might be 75*4882a593Smuzhiyuntempting to consider that to have an empty final component. In many 76*4882a593Smuzhiyunways that would lead to correct results, but not always. In 77*4882a593Smuzhiyunparticular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named 78*4882a593Smuzhiyunby the final component, and they are required to work with pathnames 79*4882a593Smuzhiyunending in "``/``". According to POSIX_: 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun A pathname that contains at least one non-<slash> character and 82*4882a593Smuzhiyun that ends with one or more trailing <slash> characters shall not 83*4882a593Smuzhiyun be resolved successfully unless the last pathname component before 84*4882a593Smuzhiyun the trailing <slash> characters names an existing directory or a 85*4882a593Smuzhiyun directory entry that is to be created for a directory immediately 86*4882a593Smuzhiyun after the pathname is resolved. 87*4882a593Smuzhiyun 88*4882a593SmuzhiyunThe Linux pathname walking code (mostly in ``fs/namei.c``) deals with 89*4882a593Smuzhiyunall of these issues: breaking the path into components, handling the 90*4882a593Smuzhiyun"everything else" quite separately from the final component, and 91*4882a593Smuzhiyunchecking that the trailing slash is not used where it isn't 92*4882a593Smuzhiyunpermitted. It also addresses the important issue of concurrent 93*4882a593Smuzhiyunaccess. 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunWhile one process is looking up a pathname, another might be making 96*4882a593Smuzhiyunchanges that affect that lookup. One fairly extreme case is that if 97*4882a593Smuzhiyun"a/b" were renamed to "a/c/b" while another process were looking up 98*4882a593Smuzhiyun"a/b/..", that process might successfully resolve on "a/c". 99*4882a593SmuzhiyunMost races are much more subtle, and a big part of the task of 100*4882a593Smuzhiyunpathname lookup is to prevent them from having damaging effects. Many 101*4882a593Smuzhiyunof the possible races are seen most clearly in the context of the 102*4882a593Smuzhiyun"dcache" and an understanding of that is central to understanding 103*4882a593Smuzhiyunpathname lookup. 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunMore than just a cache 106*4882a593Smuzhiyun---------------------- 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunThe "dcache" caches information about names in each filesystem to 109*4882a593Smuzhiyunmake them quickly available for lookup. Each entry (known as a 110*4882a593Smuzhiyun"dentry") contains three significant fields: a component name, a 111*4882a593Smuzhiyunpointer to a parent dentry, and a pointer to the "inode" which 112*4882a593Smuzhiyuncontains further information about the object in that parent with 113*4882a593Smuzhiyunthe given name. The inode pointer can be ``NULL`` indicating that the 114*4882a593Smuzhiyunname doesn't exist in the parent. While there can be linkage in the 115*4882a593Smuzhiyundentry of a directory to the dentries of the children, that linkage is 116*4882a593Smuzhiyunnot used for pathname lookup, and so will not be considered here. 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunThe dcache has a number of uses apart from accelerating lookup. One 119*4882a593Smuzhiyunthat will be particularly relevant is that it is closely integrated 120*4882a593Smuzhiyunwith the mount table that records which filesystem is mounted where. 121*4882a593SmuzhiyunWhat the mount table actually stores is which dentry is mounted on top 122*4882a593Smuzhiyunof which other dentry. 123*4882a593Smuzhiyun 124*4882a593SmuzhiyunWhen considering the dcache, we have another of our "two types" 125*4882a593Smuzhiyundistinctions: there are two types of filesystems. 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunSome filesystems ensure that the information in the dcache is always 128*4882a593Smuzhiyuncompletely accurate (though not necessarily complete). This can allow 129*4882a593Smuzhiyunthe VFS to determine if a particular file does or doesn't exist 130*4882a593Smuzhiyunwithout checking with the filesystem, and means that the VFS can 131*4882a593Smuzhiyunprotect the filesystem against certain races and other problems. 132*4882a593SmuzhiyunThese are typically "local" filesystems such as ext3, XFS, and Btrfs. 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunOther filesystems don't provide that guarantee because they cannot. 135*4882a593SmuzhiyunThese are typically filesystems that are shared across a network, 136*4882a593Smuzhiyunwhether remote filesystems like NFS and 9P, or cluster filesystems 137*4882a593Smuzhiyunlike ocfs2 or cephfs. These filesystems allow the VFS to revalidate 138*4882a593Smuzhiyuncached information, and must provide their own protection against 139*4882a593Smuzhiyunawkward races. The VFS can detect these filesystems by the 140*4882a593Smuzhiyun``DCACHE_OP_REVALIDATE`` flag being set in the dentry. 141*4882a593Smuzhiyun 142*4882a593SmuzhiyunREF-walk: simple concurrency management with refcounts and spinlocks 143*4882a593Smuzhiyun-------------------------------------------------------------------- 144*4882a593Smuzhiyun 145*4882a593SmuzhiyunWith all of those divisions carefully classified, we can now start 146*4882a593Smuzhiyunlooking at the actual process of walking along a path. In particular 147*4882a593Smuzhiyunwe will start with the handling of the "everything else" part of a 148*4882a593Smuzhiyunpathname, and focus on the "REF-walk" approach to concurrency 149*4882a593Smuzhiyunmanagement. This code is found in the ``link_path_walk()`` function, if 150*4882a593Smuzhiyunyou ignore all the places that only run when "``LOOKUP_RCU``" 151*4882a593Smuzhiyun(indicating the use of RCU-walk) is set. 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun.. _Meet the Lockers: https://lwn.net/Articles/453685/ 154*4882a593Smuzhiyun 155*4882a593SmuzhiyunREF-walk is fairly heavy-handed with locks and reference counts. Not 156*4882a593Smuzhiyunas heavy-handed as in the old "big kernel lock" days, but certainly not 157*4882a593Smuzhiyunafraid of taking a lock when one is needed. It uses a variety of 158*4882a593Smuzhiyundifferent concurrency controls. A background understanding of the 159*4882a593Smuzhiyunvarious primitives is assumed, or can be gleaned from elsewhere such 160*4882a593Smuzhiyunas in `Meet the Lockers`_. 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunThe locking mechanisms used by REF-walk include: 163*4882a593Smuzhiyun 164*4882a593Smuzhiyundentry->d_lockref 165*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~ 166*4882a593Smuzhiyun 167*4882a593SmuzhiyunThis uses the lockref primitive to provide both a spinlock and a 168*4882a593Smuzhiyunreference count. The special-sauce of this primitive is that the 169*4882a593Smuzhiyunconceptual sequence "lock; inc_ref; unlock;" can often be performed 170*4882a593Smuzhiyunwith a single atomic memory operation. 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunHolding a reference on a dentry ensures that the dentry won't suddenly 173*4882a593Smuzhiyunbe freed and used for something else, so the values in various fields 174*4882a593Smuzhiyunwill behave as expected. It also protects the ``->d_inode`` reference 175*4882a593Smuzhiyunto the inode to some extent. 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunThe association between a dentry and its inode is fairly permanent. 178*4882a593SmuzhiyunFor example, when a file is renamed, the dentry and inode move 179*4882a593Smuzhiyuntogether to the new location. When a file is created the dentry will 180*4882a593Smuzhiyuninitially be negative (i.e. ``d_inode`` is ``NULL``), and will be assigned 181*4882a593Smuzhiyunto the new inode as part of the act of creation. 182*4882a593Smuzhiyun 183*4882a593SmuzhiyunWhen a file is deleted, this can be reflected in the cache either by 184*4882a593Smuzhiyunsetting ``d_inode`` to ``NULL``, or by removing it from the hash table 185*4882a593Smuzhiyun(described shortly) used to look up the name in the parent directory. 186*4882a593SmuzhiyunIf the dentry is still in use the second option is used as it is 187*4882a593Smuzhiyunperfectly legal to keep using an open file after it has been deleted 188*4882a593Smuzhiyunand having the dentry around helps. If the dentry is not otherwise in 189*4882a593Smuzhiyunuse (i.e. if the refcount in ``d_lockref`` is one), only then will 190*4882a593Smuzhiyun``d_inode`` be set to ``NULL``. Doing it this way is more efficient for a 191*4882a593Smuzhiyunvery common case. 192*4882a593Smuzhiyun 193*4882a593SmuzhiyunSo as long as a counted reference is held to a dentry, a non-``NULL`` ``->d_inode`` 194*4882a593Smuzhiyunvalue will never be changed. 195*4882a593Smuzhiyun 196*4882a593Smuzhiyundentry->d_lock 197*4882a593Smuzhiyun~~~~~~~~~~~~~~ 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun``d_lock`` is a synonym for the spinlock that is part of ``d_lockref`` above. 200*4882a593SmuzhiyunFor our purposes, holding this lock protects against the dentry being 201*4882a593Smuzhiyunrenamed or unlinked. In particular, its parent (``d_parent``), and its 202*4882a593Smuzhiyunname (``d_name``) cannot be changed, and it cannot be removed from the 203*4882a593Smuzhiyundentry hash table. 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunWhen looking for a name in a directory, REF-walk takes ``d_lock`` on 206*4882a593Smuzhiyuneach candidate dentry that it finds in the hash table and then checks 207*4882a593Smuzhiyunthat the parent and name are correct. So it doesn't lock the parent 208*4882a593Smuzhiyunwhile searching in the cache; it only locks children. 209*4882a593Smuzhiyun 210*4882a593SmuzhiyunWhen looking for the parent for a given name (to handle "``..``"), 211*4882a593SmuzhiyunREF-walk can take ``d_lock`` to get a stable reference to ``d_parent``, 212*4882a593Smuzhiyunbut it first tries a more lightweight approach. As seen in 213*4882a593Smuzhiyun``dget_parent()``, if a reference can be claimed on the parent, and if 214*4882a593Smuzhiyunsubsequently ``d_parent`` can be seen to have not changed, then there is 215*4882a593Smuzhiyunno need to actually take the lock on the child. 216*4882a593Smuzhiyun 217*4882a593Smuzhiyunrename_lock 218*4882a593Smuzhiyun~~~~~~~~~~~ 219*4882a593Smuzhiyun 220*4882a593SmuzhiyunLooking up a given name in a given directory involves computing a hash 221*4882a593Smuzhiyunfrom the two values (the name and the dentry of the directory), 222*4882a593Smuzhiyunaccessing that slot in a hash table, and searching the linked list 223*4882a593Smuzhiyunthat is found there. 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunWhen a dentry is renamed, the name and the parent dentry can both 226*4882a593Smuzhiyunchange so the hash will almost certainly change too. This would move the 227*4882a593Smuzhiyundentry to a different chain in the hash table. If a filename search 228*4882a593Smuzhiyunhappened to be looking at a dentry that was moved in this way, 229*4882a593Smuzhiyunit might end up continuing the search down the wrong chain, 230*4882a593Smuzhiyunand so miss out on part of the correct chain. 231*4882a593Smuzhiyun 232*4882a593SmuzhiyunThe name-lookup process (``d_lookup()``) does *not* try to prevent this 233*4882a593Smuzhiyunfrom happening, but only to detect when it happens. 234*4882a593Smuzhiyun``rename_lock`` is a seqlock that is updated whenever any dentry is 235*4882a593Smuzhiyunrenamed. If ``d_lookup`` finds that a rename happened while it 236*4882a593Smuzhiyununsuccessfully scanned a chain in the hash table, it simply tries 237*4882a593Smuzhiyunagain. 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun``rename_lock`` is also used to detect and defend against potential attacks 240*4882a593Smuzhiyunagainst ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where 241*4882a593Smuzhiyunthe parent directory is moved outside the root, bypassing the ``path_equal()`` 242*4882a593Smuzhiyuncheck). If ``rename_lock`` is updated during the lookup and the path encounters 243*4882a593Smuzhiyuna "..", a potential attack occurred and ``handle_dots()`` will bail out with 244*4882a593Smuzhiyun``-EAGAIN``. 245*4882a593Smuzhiyun 246*4882a593Smuzhiyuninode->i_rwsem 247*4882a593Smuzhiyun~~~~~~~~~~~~~~ 248*4882a593Smuzhiyun 249*4882a593Smuzhiyun``i_rwsem`` is a read/write semaphore that serializes all changes to a particular 250*4882a593Smuzhiyundirectory. This ensures that, for example, an ``unlink()`` and a ``rename()`` 251*4882a593Smuzhiyuncannot both happen at the same time. It also keeps the directory 252*4882a593Smuzhiyunstable while the filesystem is asked to look up a name that is not 253*4882a593Smuzhiyuncurrently in the dcache or, optionally, when the list of entries in a 254*4882a593Smuzhiyundirectory is being retrieved with ``readdir()``. 255*4882a593Smuzhiyun 256*4882a593SmuzhiyunThis has a complementary role to that of ``d_lock``: ``i_rwsem`` on a 257*4882a593Smuzhiyundirectory protects all of the names in that directory, while ``d_lock`` 258*4882a593Smuzhiyunon a name protects just one name in a directory. Most changes to the 259*4882a593Smuzhiyundcache hold ``i_rwsem`` on the relevant directory inode and briefly take 260*4882a593Smuzhiyun``d_lock`` on one or more the dentries while the change happens. One 261*4882a593Smuzhiyunexception is when idle dentries are removed from the dcache due to 262*4882a593Smuzhiyunmemory pressure. This uses ``d_lock``, but ``i_rwsem`` plays no role. 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunThe semaphore affects pathname lookup in two distinct ways. Firstly it 265*4882a593Smuzhiyunprevents changes during lookup of a name in a directory. ``walk_component()`` uses 266*4882a593Smuzhiyun``lookup_fast()`` first which, in turn, checks to see if the name is in the cache, 267*4882a593Smuzhiyunusing only ``d_lock`` locking. If the name isn't found, then ``walk_component()`` 268*4882a593Smuzhiyunfalls back to ``lookup_slow()`` which takes a shared lock on ``i_rwsem``, checks again that 269*4882a593Smuzhiyunthe name isn't in the cache, and then calls in to the filesystem to get a 270*4882a593Smuzhiyundefinitive answer. A new dentry will be added to the cache regardless of 271*4882a593Smuzhiyunthe result. 272*4882a593Smuzhiyun 273*4882a593SmuzhiyunSecondly, when pathname lookup reaches the final component, it will 274*4882a593Smuzhiyunsometimes need to take an exclusive lock on ``i_rwsem`` before performing the last lookup so 275*4882a593Smuzhiyunthat the required exclusion can be achieved. How path lookup chooses 276*4882a593Smuzhiyunto take, or not take, ``i_rwsem`` is one of the 277*4882a593Smuzhiyunissues addressed in a subsequent section. 278*4882a593Smuzhiyun 279*4882a593SmuzhiyunIf two threads attempt to look up the same name at the same time - a 280*4882a593Smuzhiyunname that is not yet in the dcache - the shared lock on ``i_rwsem`` will 281*4882a593Smuzhiyunnot prevent them both adding new dentries with the same name. As this 282*4882a593Smuzhiyunwould result in confusion an extra level of interlocking is used, 283*4882a593Smuzhiyunbased around a secondary hash table (``in_lookup_hashtable``) and a 284*4882a593Smuzhiyunper-dentry flag bit (``DCACHE_PAR_LOOKUP``). 285*4882a593Smuzhiyun 286*4882a593SmuzhiyunTo add a new dentry to the cache while only holding a shared lock on 287*4882a593Smuzhiyun``i_rwsem``, a thread must call ``d_alloc_parallel()``. This allocates a 288*4882a593Smuzhiyundentry, stores the required name and parent in it, checks if there 289*4882a593Smuzhiyunis already a matching dentry in the primary or secondary hash 290*4882a593Smuzhiyuntables, and if not, stores the newly allocated dentry in the secondary 291*4882a593Smuzhiyunhash table, with ``DCACHE_PAR_LOOKUP`` set. 292*4882a593Smuzhiyun 293*4882a593SmuzhiyunIf a matching dentry was found in the primary hash table then that is 294*4882a593Smuzhiyunreturned and the caller can know that it lost a race with some other 295*4882a593Smuzhiyunthread adding the entry. If no matching dentry is found in either 296*4882a593Smuzhiyuncache, the newly allocated dentry is returned and the caller can 297*4882a593Smuzhiyundetect this from the presence of ``DCACHE_PAR_LOOKUP``. In this case it 298*4882a593Smuzhiyunknows that it has won any race and now is responsible for asking the 299*4882a593Smuzhiyunfilesystem to perform the lookup and find the matching inode. When 300*4882a593Smuzhiyunthe lookup is complete, it must call ``d_lookup_done()`` which clears 301*4882a593Smuzhiyunthe flag and does some other house keeping, including removing the 302*4882a593Smuzhiyundentry from the secondary hash table - it will normally have been 303*4882a593Smuzhiyunadded to the primary hash table already. Note that a ``struct 304*4882a593Smuzhiyunwaitqueue_head`` is passed to ``d_alloc_parallel()``, and 305*4882a593Smuzhiyun``d_lookup_done()`` must be called while this ``waitqueue_head`` is still 306*4882a593Smuzhiyunin scope. 307*4882a593Smuzhiyun 308*4882a593SmuzhiyunIf a matching dentry is found in the secondary hash table, 309*4882a593Smuzhiyun``d_alloc_parallel()`` has a little more work to do. It first waits for 310*4882a593Smuzhiyun``DCACHE_PAR_LOOKUP`` to be cleared, using a wait_queue that was passed 311*4882a593Smuzhiyunto the instance of ``d_alloc_parallel()`` that won the race and that 312*4882a593Smuzhiyunwill be woken by the call to ``d_lookup_done()``. It then checks to see 313*4882a593Smuzhiyunif the dentry has now been added to the primary hash table. If it 314*4882a593Smuzhiyunhas, the dentry is returned and the caller just sees that it lost any 315*4882a593Smuzhiyunrace. If it hasn't been added to the primary hash table, the most 316*4882a593Smuzhiyunlikely explanation is that some other dentry was added instead using 317*4882a593Smuzhiyun``d_splice_alias()``. In any case, ``d_alloc_parallel()`` repeats all the 318*4882a593Smuzhiyunlook ups from the start and will normally return something from the 319*4882a593Smuzhiyunprimary hash table. 320*4882a593Smuzhiyun 321*4882a593Smuzhiyunmnt->mnt_count 322*4882a593Smuzhiyun~~~~~~~~~~~~~~ 323*4882a593Smuzhiyun 324*4882a593Smuzhiyun``mnt_count`` is a per-CPU reference counter on "``mount``" structures. 325*4882a593SmuzhiyunPer-CPU here means that incrementing the count is cheap as it only 326*4882a593Smuzhiyunuses CPU-local memory, but checking if the count is zero is expensive as 327*4882a593Smuzhiyunit needs to check with every CPU. Taking a ``mnt_count`` reference 328*4882a593Smuzhiyunprevents the mount structure from disappearing as the result of regular 329*4882a593Smuzhiyununmount operations, but does not prevent a "lazy" unmount. So holding 330*4882a593Smuzhiyun``mnt_count`` doesn't ensure that the mount remains in the namespace and, 331*4882a593Smuzhiyunin particular, doesn't stabilize the link to the mounted-on dentry. It 332*4882a593Smuzhiyundoes, however, ensure that the ``mount`` data structure remains coherent, 333*4882a593Smuzhiyunand it provides a reference to the root dentry of the mounted 334*4882a593Smuzhiyunfilesystem. So a reference through ``->mnt_count`` provides a stable 335*4882a593Smuzhiyunreference to the mounted dentry, but not the mounted-on dentry. 336*4882a593Smuzhiyun 337*4882a593Smuzhiyunmount_lock 338*4882a593Smuzhiyun~~~~~~~~~~ 339*4882a593Smuzhiyun 340*4882a593Smuzhiyun``mount_lock`` is a global seqlock, a bit like ``rename_lock``. It can be used to 341*4882a593Smuzhiyuncheck if any change has been made to any mount points. 342*4882a593Smuzhiyun 343*4882a593SmuzhiyunWhile walking down the tree (away from the root) this lock is used when 344*4882a593Smuzhiyuncrossing a mount point to check that the crossing was safe. That is, 345*4882a593Smuzhiyunthe value in the seqlock is read, then the code finds the mount that 346*4882a593Smuzhiyunis mounted on the current directory, if there is one, and increments 347*4882a593Smuzhiyunthe ``mnt_count``. Finally the value in ``mount_lock`` is checked against 348*4882a593Smuzhiyunthe old value. If there is no change, then the crossing was safe. If there 349*4882a593Smuzhiyunwas a change, the ``mnt_count`` is decremented and the whole process is 350*4882a593Smuzhiyunretried. 351*4882a593Smuzhiyun 352*4882a593SmuzhiyunWhen walking up the tree (towards the root) by following a ".." link, 353*4882a593Smuzhiyuna little more care is needed. In this case the seqlock (which 354*4882a593Smuzhiyuncontains both a counter and a spinlock) is fully locked to prevent 355*4882a593Smuzhiyunany changes to any mount points while stepping up. This locking is 356*4882a593Smuzhiyunneeded to stabilize the link to the mounted-on dentry, which the 357*4882a593Smuzhiyunrefcount on the mount itself doesn't ensure. 358*4882a593Smuzhiyun 359*4882a593Smuzhiyun``mount_lock`` is also used to detect and defend against potential attacks 360*4882a593Smuzhiyunagainst ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where 361*4882a593Smuzhiyunthe parent directory is moved outside the root, bypassing the ``path_equal()`` 362*4882a593Smuzhiyuncheck). If ``mount_lock`` is updated during the lookup and the path encounters 363*4882a593Smuzhiyuna "..", a potential attack occurred and ``handle_dots()`` will bail out with 364*4882a593Smuzhiyun``-EAGAIN``. 365*4882a593Smuzhiyun 366*4882a593SmuzhiyunRCU 367*4882a593Smuzhiyun~~~ 368*4882a593Smuzhiyun 369*4882a593SmuzhiyunFinally the global (but extremely lightweight) RCU read lock is held 370*4882a593Smuzhiyunfrom time to time to ensure certain data structures don't get freed 371*4882a593Smuzhiyununexpectedly. 372*4882a593Smuzhiyun 373*4882a593SmuzhiyunIn particular it is held while scanning chains in the dcache hash 374*4882a593Smuzhiyuntable, and the mount point hash table. 375*4882a593Smuzhiyun 376*4882a593SmuzhiyunBringing it together with ``struct nameidata`` 377*4882a593Smuzhiyun---------------------------------------------- 378*4882a593Smuzhiyun 379*4882a593Smuzhiyun.. _First edition Unix: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s 380*4882a593Smuzhiyun 381*4882a593SmuzhiyunThroughout the process of walking a path, the current status is stored 382*4882a593Smuzhiyunin a ``struct nameidata``, "namei" being the traditional name - dating 383*4882a593Smuzhiyunall the way back to `First Edition Unix`_ - of the function that 384*4882a593Smuzhiyunconverts a "name" to an "inode". ``struct nameidata`` contains (among 385*4882a593Smuzhiyunother fields): 386*4882a593Smuzhiyun 387*4882a593Smuzhiyun``struct path path`` 388*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~ 389*4882a593Smuzhiyun 390*4882a593SmuzhiyunA ``path`` contains a ``struct vfsmount`` (which is 391*4882a593Smuzhiyunembedded in a ``struct mount``) and a ``struct dentry``. Together these 392*4882a593Smuzhiyunrecord the current status of the walk. They start out referring to the 393*4882a593Smuzhiyunstarting point (the current working directory, the root directory, or some other 394*4882a593Smuzhiyundirectory identified by a file descriptor), and are updated on each 395*4882a593Smuzhiyunstep. A reference through ``d_lockref`` and ``mnt_count`` is always 396*4882a593Smuzhiyunheld. 397*4882a593Smuzhiyun 398*4882a593Smuzhiyun``struct qstr last`` 399*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~ 400*4882a593Smuzhiyun 401*4882a593SmuzhiyunThis is a string together with a length (i.e. *not* ``nul`` terminated) 402*4882a593Smuzhiyunthat is the "next" component in the pathname. 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun``int last_type`` 405*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~ 406*4882a593Smuzhiyun 407*4882a593SmuzhiyunThis is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT`` or ``LAST_DOTDOT``. 408*4882a593SmuzhiyunThe ``last`` field is only valid if the type is ``LAST_NORM``. 409*4882a593Smuzhiyun 410*4882a593Smuzhiyun``struct path root`` 411*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~ 412*4882a593Smuzhiyun 413*4882a593SmuzhiyunThis is used to hold a reference to the effective root of the 414*4882a593Smuzhiyunfilesystem. Often that reference won't be needed, so this field is 415*4882a593Smuzhiyunonly assigned the first time it is used, or when a non-standard root 416*4882a593Smuzhiyunis requested. Keeping a reference in the ``nameidata`` ensures that 417*4882a593Smuzhiyunonly one root is in effect for the entire path walk, even if it races 418*4882a593Smuzhiyunwith a ``chroot()`` system call. 419*4882a593Smuzhiyun 420*4882a593SmuzhiyunIt should be noted that in the case of ``LOOKUP_IN_ROOT`` or 421*4882a593Smuzhiyun``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor 422*4882a593Smuzhiyunpassed to ``openat2()`` (which exposes these ``LOOKUP_`` flags). 423*4882a593Smuzhiyun 424*4882a593SmuzhiyunThe root is needed when either of two conditions holds: (1) either the 425*4882a593Smuzhiyunpathname or a symbolic link starts with a "'/'", or (2) a "``..``" 426*4882a593Smuzhiyuncomponent is being handled, since "``..``" from the root must always stay 427*4882a593Smuzhiyunat the root. The value used is usually the current root directory of 428*4882a593Smuzhiyunthe calling process. An alternate root can be provided as when 429*4882a593Smuzhiyun``sysctl()`` calls ``file_open_root()``, and when NFSv4 or Btrfs call 430*4882a593Smuzhiyun``mount_subtree()``. In each case a pathname is being looked up in a very 431*4882a593Smuzhiyunspecific part of the filesystem, and the lookup must not be allowed to 432*4882a593Smuzhiyunescape that subtree. It works a bit like a local ``chroot()``. 433*4882a593Smuzhiyun 434*4882a593SmuzhiyunIgnoring the handling of symbolic links, we can now describe the 435*4882a593Smuzhiyun"``link_path_walk()``" function, which handles the lookup of everything 436*4882a593Smuzhiyunexcept the final component as: 437*4882a593Smuzhiyun 438*4882a593Smuzhiyun Given a path (``name``) and a nameidata structure (``nd``), check that the 439*4882a593Smuzhiyun current directory has execute permission and then advance ``name`` 440*4882a593Smuzhiyun over one component while updating ``last_type`` and ``last``. If that 441*4882a593Smuzhiyun was the final component, then return, otherwise call 442*4882a593Smuzhiyun ``walk_component()`` and repeat from the top. 443*4882a593Smuzhiyun 444*4882a593Smuzhiyun``walk_component()`` is even easier. If the component is ``LAST_DOTS``, 445*4882a593Smuzhiyunit calls ``handle_dots()`` which does the necessary locking as already 446*4882a593Smuzhiyundescribed. If it finds a ``LAST_NORM`` component it first calls 447*4882a593Smuzhiyun"``lookup_fast()``" which only looks in the dcache, but will ask the 448*4882a593Smuzhiyunfilesystem to revalidate the result if it is that sort of filesystem. 449*4882a593SmuzhiyunIf that doesn't get a good result, it calls "``lookup_slow()``" which 450*4882a593Smuzhiyuntakes ``i_rwsem``, rechecks the cache, and then asks the filesystem 451*4882a593Smuzhiyunto find a definitive answer. Each of these will call 452*4882a593Smuzhiyun``follow_managed()`` (as described below) to handle any mount points. 453*4882a593Smuzhiyun 454*4882a593SmuzhiyunIn the absence of symbolic links, ``walk_component()`` creates a new 455*4882a593Smuzhiyun``struct path`` containing a counted reference to the new dentry and a 456*4882a593Smuzhiyunreference to the new ``vfsmount`` which is only counted if it is 457*4882a593Smuzhiyundifferent from the previous ``vfsmount``. It then calls 458*4882a593Smuzhiyun``path_to_nameidata()`` to install the new ``struct path`` in the 459*4882a593Smuzhiyun``struct nameidata`` and drop the unneeded references. 460*4882a593Smuzhiyun 461*4882a593SmuzhiyunThis "hand-over-hand" sequencing of getting a reference to the new 462*4882a593Smuzhiyundentry before dropping the reference to the previous dentry may 463*4882a593Smuzhiyunseem obvious, but is worth pointing out so that we will recognize its 464*4882a593Smuzhiyunanalogue in the "RCU-walk" version. 465*4882a593Smuzhiyun 466*4882a593SmuzhiyunHandling the final component 467*4882a593Smuzhiyun---------------------------- 468*4882a593Smuzhiyun 469*4882a593Smuzhiyun``link_path_walk()`` only walks as far as setting ``nd->last`` and 470*4882a593Smuzhiyun``nd->last_type`` to refer to the final component of the path. It does 471*4882a593Smuzhiyunnot call ``walk_component()`` that last time. Handling that final 472*4882a593Smuzhiyuncomponent remains for the caller to sort out. Those callers are 473*4882a593Smuzhiyun``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and 474*4882a593Smuzhiyun``path_openat()`` each of which handles the differing requirements of 475*4882a593Smuzhiyundifferent system calls. 476*4882a593Smuzhiyun 477*4882a593Smuzhiyun``path_parentat()`` is clearly the simplest - it just wraps a little bit 478*4882a593Smuzhiyunof housekeeping around ``link_path_walk()`` and returns the parent 479*4882a593Smuzhiyundirectory and final component to the caller. The caller will be either 480*4882a593Smuzhiyunaiming to create a name (via ``filename_create()``) or remove or rename 481*4882a593Smuzhiyuna name (in which case ``user_path_parent()`` is used). They will use 482*4882a593Smuzhiyun``i_rwsem`` to exclude other changes while they validate and then 483*4882a593Smuzhiyunperform their operation. 484*4882a593Smuzhiyun 485*4882a593Smuzhiyun``path_lookupat()`` is nearly as simple - it is used when an existing 486*4882a593Smuzhiyunobject is wanted such as by ``stat()`` or ``chmod()``. It essentially just 487*4882a593Smuzhiyuncalls ``walk_component()`` on the final component through a call to 488*4882a593Smuzhiyun``lookup_last()``. ``path_lookupat()`` returns just the final dentry. 489*4882a593Smuzhiyun 490*4882a593Smuzhiyun``path_mountpoint()`` handles the special case of unmounting which must 491*4882a593Smuzhiyunnot try to revalidate the mounted filesystem. It effectively 492*4882a593Smuzhiyuncontains, through a call to ``mountpoint_last()``, an alternate 493*4882a593Smuzhiyunimplementation of ``lookup_slow()`` which skips that step. This is 494*4882a593Smuzhiyunimportant when unmounting a filesystem that is inaccessible, such as 495*4882a593Smuzhiyunone provided by a dead NFS server. 496*4882a593Smuzhiyun 497*4882a593SmuzhiyunFinally ``path_openat()`` is used for the ``open()`` system call; it 498*4882a593Smuzhiyuncontains, in support functions starting with "``do_last()``", all the 499*4882a593Smuzhiyuncomplexity needed to handle the different subtleties of O_CREAT (with 500*4882a593Smuzhiyunor without O_EXCL), final "``/``" characters, and trailing symbolic 501*4882a593Smuzhiyunlinks. We will revisit this in the final part of this series, which 502*4882a593Smuzhiyunfocuses on those symbolic links. "``do_last()``" will sometimes, but 503*4882a593Smuzhiyunnot always, take ``i_rwsem``, depending on what it finds. 504*4882a593Smuzhiyun 505*4882a593SmuzhiyunEach of these, or the functions which call them, need to be alert to 506*4882a593Smuzhiyunthe possibility that the final component is not ``LAST_NORM``. If the 507*4882a593Smuzhiyungoal of the lookup is to create something, then any value for 508*4882a593Smuzhiyun``last_type`` other than ``LAST_NORM`` will result in an error. For 509*4882a593Smuzhiyunexample if ``path_parentat()`` reports ``LAST_DOTDOT``, then the caller 510*4882a593Smuzhiyunwon't try to create that name. They also check for trailing slashes 511*4882a593Smuzhiyunby testing ``last.name[last.len]``. If there is any character beyond 512*4882a593Smuzhiyunthe final component, it must be a trailing slash. 513*4882a593Smuzhiyun 514*4882a593SmuzhiyunRevalidation and automounts 515*4882a593Smuzhiyun--------------------------- 516*4882a593Smuzhiyun 517*4882a593SmuzhiyunApart from symbolic links, there are only two parts of the "REF-walk" 518*4882a593Smuzhiyunprocess not yet covered. One is the handling of stale cache entries 519*4882a593Smuzhiyunand the other is automounts. 520*4882a593Smuzhiyun 521*4882a593SmuzhiyunOn filesystems that require it, the lookup routines will call the 522*4882a593Smuzhiyun``->d_revalidate()`` dentry method to ensure that the cached information 523*4882a593Smuzhiyunis current. This will often confirm validity or update a few details 524*4882a593Smuzhiyunfrom a server. In some cases it may find that there has been change 525*4882a593Smuzhiyunfurther up the path and that something that was thought to be valid 526*4882a593Smuzhiyunpreviously isn't really. When this happens the lookup of the whole 527*4882a593Smuzhiyunpath is aborted and retried with the "``LOOKUP_REVAL``" flag set. This 528*4882a593Smuzhiyunforces revalidation to be more thorough. We will see more details of 529*4882a593Smuzhiyunthis retry process in the next article. 530*4882a593Smuzhiyun 531*4882a593SmuzhiyunAutomount points are locations in the filesystem where an attempt to 532*4882a593Smuzhiyunlookup a name can trigger changes to how that lookup should be 533*4882a593Smuzhiyunhandled, in particular by mounting a filesystem there. These are 534*4882a593Smuzhiyuncovered in greater detail in autofs.txt in the Linux documentation 535*4882a593Smuzhiyuntree, but a few notes specifically related to path lookup are in order 536*4882a593Smuzhiyunhere. 537*4882a593Smuzhiyun 538*4882a593SmuzhiyunThe Linux VFS has a concept of "managed" dentries which is reflected 539*4882a593Smuzhiyunin function names such as "``follow_managed()``". There are three 540*4882a593Smuzhiyunpotentially interesting things about these dentries corresponding 541*4882a593Smuzhiyunto three different flags that might be set in ``dentry->d_flags``: 542*4882a593Smuzhiyun 543*4882a593Smuzhiyun``DCACHE_MANAGE_TRANSIT`` 544*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~ 545*4882a593Smuzhiyun 546*4882a593SmuzhiyunIf this flag has been set, then the filesystem has requested that the 547*4882a593Smuzhiyun``d_manage()`` dentry operation be called before handling any possible 548*4882a593Smuzhiyunmount point. This can perform two particular services: 549*4882a593Smuzhiyun 550*4882a593SmuzhiyunIt can block to avoid races. If an automount point is being 551*4882a593Smuzhiyununmounted, the ``d_manage()`` function will usually wait for that 552*4882a593Smuzhiyunprocess to complete before letting the new lookup proceed and possibly 553*4882a593Smuzhiyuntrigger a new automount. 554*4882a593Smuzhiyun 555*4882a593SmuzhiyunIt can selectively allow only some processes to transit through a 556*4882a593Smuzhiyunmount point. When a server process is managing automounts, it may 557*4882a593Smuzhiyunneed to access a directory without triggering normal automount 558*4882a593Smuzhiyunprocessing. That server process can identify itself to the ``autofs`` 559*4882a593Smuzhiyunfilesystem, which will then give it a special pass through 560*4882a593Smuzhiyun``d_manage()`` by returning ``-EISDIR``. 561*4882a593Smuzhiyun 562*4882a593Smuzhiyun``DCACHE_MOUNTED`` 563*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~ 564*4882a593Smuzhiyun 565*4882a593SmuzhiyunThis flag is set on every dentry that is mounted on. As Linux 566*4882a593Smuzhiyunsupports multiple filesystem namespaces, it is possible that the 567*4882a593Smuzhiyundentry may not be mounted on in *this* namespace, just in some 568*4882a593Smuzhiyunother. So this flag is seen as a hint, not a promise. 569*4882a593Smuzhiyun 570*4882a593SmuzhiyunIf this flag is set, and ``d_manage()`` didn't return ``-EISDIR``, 571*4882a593Smuzhiyun``lookup_mnt()`` is called to examine the mount hash table (honoring the 572*4882a593Smuzhiyun``mount_lock`` described earlier) and possibly return a new ``vfsmount`` 573*4882a593Smuzhiyunand a new ``dentry`` (both with counted references). 574*4882a593Smuzhiyun 575*4882a593Smuzhiyun``DCACHE_NEED_AUTOMOUNT`` 576*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~ 577*4882a593Smuzhiyun 578*4882a593SmuzhiyunIf ``d_manage()`` allowed us to get this far, and ``lookup_mnt()`` didn't 579*4882a593Smuzhiyunfind a mount point, then this flag causes the ``d_automount()`` dentry 580*4882a593Smuzhiyunoperation to be called. 581*4882a593Smuzhiyun 582*4882a593SmuzhiyunThe ``d_automount()`` operation can be arbitrarily complex and may 583*4882a593Smuzhiyuncommunicate with server processes etc. but it should ultimately either 584*4882a593Smuzhiyunreport that there was an error, that there was nothing to mount, or 585*4882a593Smuzhiyunshould provide an updated ``struct path`` with new ``dentry`` and ``vfsmount``. 586*4882a593Smuzhiyun 587*4882a593SmuzhiyunIn the latter case, ``finish_automount()`` will be called to safely 588*4882a593Smuzhiyuninstall the new mount point into the mount table. 589*4882a593Smuzhiyun 590*4882a593SmuzhiyunThere is no new locking of import here and it is important that no 591*4882a593Smuzhiyunlocks (only counted references) are held over this processing due to 592*4882a593Smuzhiyunthe very real possibility of extended delays. 593*4882a593SmuzhiyunThis will become more important next time when we examine RCU-walk 594*4882a593Smuzhiyunwhich is particularly sensitive to delays. 595*4882a593Smuzhiyun 596*4882a593SmuzhiyunRCU-walk - faster pathname lookup in Linux 597*4882a593Smuzhiyun========================================== 598*4882a593Smuzhiyun 599*4882a593SmuzhiyunRCU-walk is another algorithm for performing pathname lookup in Linux. 600*4882a593SmuzhiyunIt is in many ways similar to REF-walk and the two share quite a bit 601*4882a593Smuzhiyunof code. The significant difference in RCU-walk is how it allows for 602*4882a593Smuzhiyunthe possibility of concurrent access. 603*4882a593Smuzhiyun 604*4882a593SmuzhiyunWe noted that REF-walk is complex because there are numerous details 605*4882a593Smuzhiyunand special cases. RCU-walk reduces this complexity by simply 606*4882a593Smuzhiyunrefusing to handle a number of cases -- it instead falls back to 607*4882a593SmuzhiyunREF-walk. The difficulty with RCU-walk comes from a different 608*4882a593Smuzhiyundirection: unfamiliarity. The locking rules when depending on RCU are 609*4882a593Smuzhiyunquite different from traditional locking, so we will spend a little extra 610*4882a593Smuzhiyuntime when we come to those. 611*4882a593Smuzhiyun 612*4882a593SmuzhiyunClear demarcation of roles 613*4882a593Smuzhiyun-------------------------- 614*4882a593Smuzhiyun 615*4882a593SmuzhiyunThe easiest way to manage concurrency is to forcibly stop any other 616*4882a593Smuzhiyunthread from changing the data structures that a given thread is 617*4882a593Smuzhiyunlooking at. In cases where no other thread would even think of 618*4882a593Smuzhiyunchanging the data and lots of different threads want to read at the 619*4882a593Smuzhiyunsame time, this can be very costly. Even when using locks that permit 620*4882a593Smuzhiyunmultiple concurrent readers, the simple act of updating the count of 621*4882a593Smuzhiyunthe number of current readers can impose an unwanted cost. So the 622*4882a593Smuzhiyungoal when reading a shared data structure that no other process is 623*4882a593Smuzhiyunchanging is to avoid writing anything to memory at all. Take no 624*4882a593Smuzhiyunlocks, increment no counts, leave no footprints. 625*4882a593Smuzhiyun 626*4882a593SmuzhiyunThe REF-walk mechanism already described certainly doesn't follow this 627*4882a593Smuzhiyunprinciple, but then it is really designed to work when there may well 628*4882a593Smuzhiyunbe other threads modifying the data. RCU-walk, in contrast, is 629*4882a593Smuzhiyundesigned for the common situation where there are lots of frequent 630*4882a593Smuzhiyunreaders and only occasional writers. This may not be common in all 631*4882a593Smuzhiyunparts of the filesystem tree, but in many parts it will be. For the 632*4882a593Smuzhiyunother parts it is important that RCU-walk can quickly fall back to 633*4882a593Smuzhiyunusing REF-walk. 634*4882a593Smuzhiyun 635*4882a593SmuzhiyunPathname lookup always starts in RCU-walk mode but only remains there 636*4882a593Smuzhiyunas long as what it is looking for is in the cache and is stable. It 637*4882a593Smuzhiyundances lightly down the cached filesystem image, leaving no footprints 638*4882a593Smuzhiyunand carefully watching where it is, to be sure it doesn't trip. If it 639*4882a593Smuzhiyunnotices that something has changed or is changing, or if something 640*4882a593Smuzhiyunisn't in the cache, then it tries to stop gracefully and switch to 641*4882a593SmuzhiyunREF-walk. 642*4882a593Smuzhiyun 643*4882a593SmuzhiyunThis stopping requires getting a counted reference on the current 644*4882a593Smuzhiyun``vfsmount`` and ``dentry``, and ensuring that these are still valid - 645*4882a593Smuzhiyunthat a path walk with REF-walk would have found the same entries. 646*4882a593SmuzhiyunThis is an invariant that RCU-walk must guarantee. It can only make 647*4882a593Smuzhiyundecisions, such as selecting the next step, that are decisions which 648*4882a593SmuzhiyunREF-walk could also have made if it were walking down the tree at the 649*4882a593Smuzhiyunsame time. If the graceful stop succeeds, the rest of the path is 650*4882a593Smuzhiyunprocessed with the reliable, if slightly sluggish, REF-walk. If 651*4882a593SmuzhiyunRCU-walk finds it cannot stop gracefully, it simply gives up and 652*4882a593Smuzhiyunrestarts from the top with REF-walk. 653*4882a593Smuzhiyun 654*4882a593SmuzhiyunThis pattern of "try RCU-walk, if that fails try REF-walk" can be 655*4882a593Smuzhiyunclearly seen in functions like ``filename_lookup()``, 656*4882a593Smuzhiyun``filename_parentat()``, ``filename_mountpoint()``, 657*4882a593Smuzhiyun``do_filp_open()``, and ``do_file_open_root()``. These five 658*4882a593Smuzhiyuncorrespond roughly to the four ``path_*()`` functions we met earlier, 659*4882a593Smuzhiyuneach of which calls ``link_path_walk()``. The ``path_*()`` functions are 660*4882a593Smuzhiyuncalled using different mode flags until a mode is found which works. 661*4882a593SmuzhiyunThey are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If 662*4882a593Smuzhiyunthat fails with the error ``ECHILD`` they are called again with no 663*4882a593Smuzhiyunspecial flag to request "REF-walk". If either of those report the 664*4882a593Smuzhiyunerror ``ESTALE`` a final attempt is made with ``LOOKUP_REVAL`` set (and no 665*4882a593Smuzhiyun``LOOKUP_RCU``) to ensure that entries found in the cache are forcibly 666*4882a593Smuzhiyunrevalidated - normally entries are only revalidated if the filesystem 667*4882a593Smuzhiyundetermines that they are too old to trust. 668*4882a593Smuzhiyun 669*4882a593SmuzhiyunThe ``LOOKUP_RCU`` attempt may drop that flag internally and switch to 670*4882a593SmuzhiyunREF-walk, but will never then try to switch back to RCU-walk. Places 671*4882a593Smuzhiyunthat trip up RCU-walk are much more likely to be near the leaves and 672*4882a593Smuzhiyunso it is very unlikely that there will be much, if any, benefit from 673*4882a593Smuzhiyunswitching back. 674*4882a593Smuzhiyun 675*4882a593SmuzhiyunRCU and seqlocks: fast and light 676*4882a593Smuzhiyun-------------------------------- 677*4882a593Smuzhiyun 678*4882a593SmuzhiyunRCU is, unsurprisingly, critical to RCU-walk mode. The 679*4882a593Smuzhiyun``rcu_read_lock()`` is held for the entire time that RCU-walk is walking 680*4882a593Smuzhiyundown a path. The particular guarantee it provides is that the key 681*4882a593Smuzhiyundata structures - dentries, inodes, super_blocks, and mounts - will 682*4882a593Smuzhiyunnot be freed while the lock is held. They might be unlinked or 683*4882a593Smuzhiyuninvalidated in one way or another, but the memory will not be 684*4882a593Smuzhiyunrepurposed so values in various fields will still be meaningful. This 685*4882a593Smuzhiyunis the only guarantee that RCU provides; everything else is done using 686*4882a593Smuzhiyunseqlocks. 687*4882a593Smuzhiyun 688*4882a593SmuzhiyunAs we saw above, REF-walk holds a counted reference to the current 689*4882a593Smuzhiyundentry and the current vfsmount, and does not release those references 690*4882a593Smuzhiyunbefore taking references to the "next" dentry or vfsmount. It also 691*4882a593Smuzhiyunsometimes takes the ``d_lock`` spinlock. These references and locks are 692*4882a593Smuzhiyuntaken to prevent certain changes from happening. RCU-walk must not 693*4882a593Smuzhiyuntake those references or locks and so cannot prevent such changes. 694*4882a593SmuzhiyunInstead, it checks to see if a change has been made, and aborts or 695*4882a593Smuzhiyunretries if it has. 696*4882a593Smuzhiyun 697*4882a593SmuzhiyunTo preserve the invariant mentioned above (that RCU-walk may only make 698*4882a593Smuzhiyundecisions that REF-walk could have made), it must make the checks at 699*4882a593Smuzhiyunor near the same places that REF-walk holds the references. So, when 700*4882a593SmuzhiyunREF-walk increments a reference count or takes a spinlock, RCU-walk 701*4882a593Smuzhiyunsamples the status of a seqlock using ``read_seqcount_begin()`` or a 702*4882a593Smuzhiyunsimilar function. When REF-walk decrements the count or drops the 703*4882a593Smuzhiyunlock, RCU-walk checks if the sampled status is still valid using 704*4882a593Smuzhiyun``read_seqcount_retry()`` or similar. 705*4882a593Smuzhiyun 706*4882a593SmuzhiyunHowever, there is a little bit more to seqlocks than that. If 707*4882a593SmuzhiyunRCU-walk accesses two different fields in a seqlock-protected 708*4882a593Smuzhiyunstructure, or accesses the same field twice, there is no a priori 709*4882a593Smuzhiyunguarantee of any consistency between those accesses. When consistency 710*4882a593Smuzhiyunis needed - which it usually is - RCU-walk must take a copy and then 711*4882a593Smuzhiyunuse ``read_seqcount_retry()`` to validate that copy. 712*4882a593Smuzhiyun 713*4882a593Smuzhiyun``read_seqcount_retry()`` not only checks the sequence number, but also 714*4882a593Smuzhiyunimposes a memory barrier so that no memory-read instruction from 715*4882a593Smuzhiyun*before* the call can be delayed until *after* the call, either by the 716*4882a593SmuzhiyunCPU or by the compiler. A simple example of this can be seen in 717*4882a593Smuzhiyun``slow_dentry_cmp()`` which, for filesystems which do not use simple 718*4882a593Smuzhiyunbyte-wise name equality, calls into the filesystem to compare a name 719*4882a593Smuzhiyunagainst a dentry. The length and name pointer are copied into local 720*4882a593Smuzhiyunvariables, then ``read_seqcount_retry()`` is called to confirm the two 721*4882a593Smuzhiyunare consistent, and only then is ``->d_compare()`` called. When 722*4882a593Smuzhiyunstandard filename comparison is used, ``dentry_cmp()`` is called 723*4882a593Smuzhiyuninstead. Notably it does *not* use ``read_seqcount_retry()``, but 724*4882a593Smuzhiyuninstead has a large comment explaining why the consistency guarantee 725*4882a593Smuzhiyunisn't necessary. A subsequent ``read_seqcount_retry()`` will be 726*4882a593Smuzhiyunsufficient to catch any problem that could occur at this point. 727*4882a593Smuzhiyun 728*4882a593SmuzhiyunWith that little refresher on seqlocks out of the way we can look at 729*4882a593Smuzhiyunthe bigger picture of how RCU-walk uses seqlocks. 730*4882a593Smuzhiyun 731*4882a593Smuzhiyun``mount_lock`` and ``nd->m_seq`` 732*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 733*4882a593Smuzhiyun 734*4882a593SmuzhiyunWe already met the ``mount_lock`` seqlock when REF-walk used it to 735*4882a593Smuzhiyunensure that crossing a mount point is performed safely. RCU-walk uses 736*4882a593Smuzhiyunit for that too, but for quite a bit more. 737*4882a593Smuzhiyun 738*4882a593SmuzhiyunInstead of taking a counted reference to each ``vfsmount`` as it 739*4882a593Smuzhiyundescends the tree, RCU-walk samples the state of ``mount_lock`` at the 740*4882a593Smuzhiyunstart of the walk and stores this initial sequence number in the 741*4882a593Smuzhiyun``struct nameidata`` in the ``m_seq`` field. This one lock and one 742*4882a593Smuzhiyunsequence number are used to validate all accesses to all ``vfsmounts``, 743*4882a593Smuzhiyunand all mount point crossings. As changes to the mount table are 744*4882a593Smuzhiyunrelatively rare, it is reasonable to fall back on REF-walk any time 745*4882a593Smuzhiyunthat any "mount" or "unmount" happens. 746*4882a593Smuzhiyun 747*4882a593Smuzhiyun``m_seq`` is checked (using ``read_seqretry()``) at the end of an RCU-walk 748*4882a593Smuzhiyunsequence, whether switching to REF-walk for the rest of the path or 749*4882a593Smuzhiyunwhen the end of the path is reached. It is also checked when stepping 750*4882a593Smuzhiyundown over a mount point (in ``__follow_mount_rcu()``) or up (in 751*4882a593Smuzhiyun``follow_dotdot_rcu()``). If it is ever found to have changed, the 752*4882a593Smuzhiyunwhole RCU-walk sequence is aborted and the path is processed again by 753*4882a593SmuzhiyunREF-walk. 754*4882a593Smuzhiyun 755*4882a593SmuzhiyunIf RCU-walk finds that ``mount_lock`` hasn't changed then it can be sure 756*4882a593Smuzhiyunthat, had REF-walk taken counted references on each vfsmount, the 757*4882a593Smuzhiyunresults would have been the same. This ensures the invariant holds, 758*4882a593Smuzhiyunat least for vfsmount structures. 759*4882a593Smuzhiyun 760*4882a593Smuzhiyun``dentry->d_seq`` and ``nd->seq`` 761*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 762*4882a593Smuzhiyun 763*4882a593SmuzhiyunIn place of taking a count or lock on ``d_reflock``, RCU-walk samples 764*4882a593Smuzhiyunthe per-dentry ``d_seq`` seqlock, and stores the sequence number in the 765*4882a593Smuzhiyun``seq`` field of the nameidata structure, so ``nd->seq`` should always be 766*4882a593Smuzhiyunthe current sequence number of ``nd->dentry``. This number needs to be 767*4882a593Smuzhiyunrevalidated after copying, and before using, the name, parent, or 768*4882a593Smuzhiyuninode of the dentry. 769*4882a593Smuzhiyun 770*4882a593SmuzhiyunThe handling of the name we have already looked at, and the parent is 771*4882a593Smuzhiyunonly accessed in ``follow_dotdot_rcu()`` which fairly trivially follows 772*4882a593Smuzhiyunthe required pattern, though it does so for three different cases. 773*4882a593Smuzhiyun 774*4882a593SmuzhiyunWhen not at a mount point, ``d_parent`` is followed and its ``d_seq`` is 775*4882a593Smuzhiyuncollected. When we are at a mount point, we instead follow the 776*4882a593Smuzhiyun``mnt->mnt_mountpoint`` link to get a new dentry and collect its 777*4882a593Smuzhiyun``d_seq``. Then, after finally finding a ``d_parent`` to follow, we must 778*4882a593Smuzhiyuncheck if we have landed on a mount point and, if so, must find that 779*4882a593Smuzhiyunmount point and follow the ``mnt->mnt_root`` link. This would imply a 780*4882a593Smuzhiyunsomewhat unusual, but certainly possible, circumstance where the 781*4882a593Smuzhiyunstarting point of the path lookup was in part of the filesystem that 782*4882a593Smuzhiyunwas mounted on, and so not visible from the root. 783*4882a593Smuzhiyun 784*4882a593SmuzhiyunThe inode pointer, stored in ``->d_inode``, is a little more 785*4882a593Smuzhiyuninteresting. The inode will always need to be accessed at least 786*4882a593Smuzhiyuntwice, once to determine if it is NULL and once to verify access 787*4882a593Smuzhiyunpermissions. Symlink handling requires a validated inode pointer too. 788*4882a593SmuzhiyunRather than revalidating on each access, a copy is made on the first 789*4882a593Smuzhiyunaccess and it is stored in the ``inode`` field of ``nameidata`` from where 790*4882a593Smuzhiyunit can be safely accessed without further validation. 791*4882a593Smuzhiyun 792*4882a593Smuzhiyun``lookup_fast()`` is the only lookup routine that is used in RCU-mode, 793*4882a593Smuzhiyun``lookup_slow()`` being too slow and requiring locks. It is in 794*4882a593Smuzhiyun``lookup_fast()`` that we find the important "hand over hand" tracking 795*4882a593Smuzhiyunof the current dentry. 796*4882a593Smuzhiyun 797*4882a593SmuzhiyunThe current ``dentry`` and current ``seq`` number are passed to 798*4882a593Smuzhiyun``__d_lookup_rcu()`` which, on success, returns a new ``dentry`` and a 799*4882a593Smuzhiyunnew ``seq`` number. ``lookup_fast()`` then copies the inode pointer and 800*4882a593Smuzhiyunrevalidates the new ``seq`` number. It then validates the old ``dentry`` 801*4882a593Smuzhiyunwith the old ``seq`` number one last time and only then continues. This 802*4882a593Smuzhiyunprocess of getting the ``seq`` number of the new dentry and then 803*4882a593Smuzhiyunchecking the ``seq`` number of the old exactly mirrors the process of 804*4882a593Smuzhiyungetting a counted reference to the new dentry before dropping that for 805*4882a593Smuzhiyunthe old dentry which we saw in REF-walk. 806*4882a593Smuzhiyun 807*4882a593SmuzhiyunNo ``inode->i_rwsem`` or even ``rename_lock`` 808*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 809*4882a593Smuzhiyun 810*4882a593SmuzhiyunA semaphore is a fairly heavyweight lock that can only be taken when it is 811*4882a593Smuzhiyunpermissible to sleep. As ``rcu_read_lock()`` forbids sleeping, 812*4882a593Smuzhiyun``inode->i_rwsem`` plays no role in RCU-walk. If some other thread does 813*4882a593Smuzhiyuntake ``i_rwsem`` and modifies the directory in a way that RCU-walk needs 814*4882a593Smuzhiyunto notice, the result will be either that RCU-walk fails to find the 815*4882a593Smuzhiyundentry that it is looking for, or it will find a dentry which 816*4882a593Smuzhiyun``read_seqretry()`` won't validate. In either case it will drop down to 817*4882a593SmuzhiyunREF-walk mode which can take whatever locks are needed. 818*4882a593Smuzhiyun 819*4882a593SmuzhiyunThough ``rename_lock`` could be used by RCU-walk as it doesn't require 820*4882a593Smuzhiyunany sleeping, RCU-walk doesn't bother. REF-walk uses ``rename_lock`` to 821*4882a593Smuzhiyunprotect against the possibility of hash chains in the dcache changing 822*4882a593Smuzhiyunwhile they are being searched. This can result in failing to find 823*4882a593Smuzhiyunsomething that actually is there. When RCU-walk fails to find 824*4882a593Smuzhiyunsomething in the dentry cache, whether it is really there or not, it 825*4882a593Smuzhiyunalready drops down to REF-walk and tries again with appropriate 826*4882a593Smuzhiyunlocking. This neatly handles all cases, so adding extra checks on 827*4882a593Smuzhiyunrename_lock would bring no significant value. 828*4882a593Smuzhiyun 829*4882a593Smuzhiyun``unlazy walk()`` and ``complete_walk()`` 830*4882a593Smuzhiyun----------------------------------------- 831*4882a593Smuzhiyun 832*4882a593SmuzhiyunThat "dropping down to REF-walk" typically involves a call to 833*4882a593Smuzhiyun``unlazy_walk()``, so named because "RCU-walk" is also sometimes 834*4882a593Smuzhiyunreferred to as "lazy walk". ``unlazy_walk()`` is called when 835*4882a593Smuzhiyunfollowing the path down to the current vfsmount/dentry pair seems to 836*4882a593Smuzhiyunhave proceeded successfully, but the next step is problematic. This 837*4882a593Smuzhiyuncan happen if the next name cannot be found in the dcache, if 838*4882a593Smuzhiyunpermission checking or name revalidation couldn't be achieved while 839*4882a593Smuzhiyunthe ``rcu_read_lock()`` is held (which forbids sleeping), if an 840*4882a593Smuzhiyunautomount point is found, or in a couple of cases involving symlinks. 841*4882a593SmuzhiyunIt is also called from ``complete_walk()`` when the lookup has reached 842*4882a593Smuzhiyunthe final component, or the very end of the path, depending on which 843*4882a593Smuzhiyunparticular flavor of lookup is used. 844*4882a593Smuzhiyun 845*4882a593SmuzhiyunOther reasons for dropping out of RCU-walk that do not trigger a call 846*4882a593Smuzhiyunto ``unlazy_walk()`` are when some inconsistency is found that cannot be 847*4882a593Smuzhiyunhandled immediately, such as ``mount_lock`` or one of the ``d_seq`` 848*4882a593Smuzhiyunseqlocks reporting a change. In these cases the relevant function 849*4882a593Smuzhiyunwill return ``-ECHILD`` which will percolate up until it triggers a new 850*4882a593Smuzhiyunattempt from the top using REF-walk. 851*4882a593Smuzhiyun 852*4882a593SmuzhiyunFor those cases where ``unlazy_walk()`` is an option, it essentially 853*4882a593Smuzhiyuntakes a reference on each of the pointers that it holds (vfsmount, 854*4882a593Smuzhiyundentry, and possibly some symbolic links) and then verifies that the 855*4882a593Smuzhiyunrelevant seqlocks have not been changed. If there have been changes, 856*4882a593Smuzhiyunit, too, aborts with ``-ECHILD``, otherwise the transition to REF-walk 857*4882a593Smuzhiyunhas been a success and the lookup process continues. 858*4882a593Smuzhiyun 859*4882a593SmuzhiyunTaking a reference on those pointers is not quite as simple as just 860*4882a593Smuzhiyunincrementing a counter. That works to take a second reference if you 861*4882a593Smuzhiyunalready have one (often indirectly through another object), but it 862*4882a593Smuzhiyunisn't sufficient if you don't actually have a counted reference at 863*4882a593Smuzhiyunall. For ``dentry->d_lockref``, it is safe to increment the reference 864*4882a593Smuzhiyuncounter to get a reference unless it has been explicitly marked as 865*4882a593Smuzhiyun"dead" which involves setting the counter to ``-128``. 866*4882a593Smuzhiyun``lockref_get_not_dead()`` achieves this. 867*4882a593Smuzhiyun 868*4882a593SmuzhiyunFor ``mnt->mnt_count`` it is safe to take a reference as long as 869*4882a593Smuzhiyun``mount_lock`` is then used to validate the reference. If that 870*4882a593Smuzhiyunvalidation fails, it may *not* be safe to just drop that reference in 871*4882a593Smuzhiyunthe standard way of calling ``mnt_put()`` - an unmount may have 872*4882a593Smuzhiyunprogressed too far. So the code in ``legitimize_mnt()``, when it 873*4882a593Smuzhiyunfinds that the reference it got might not be safe, checks the 874*4882a593Smuzhiyun``MNT_SYNC_UMOUNT`` flag to determine if a simple ``mnt_put()`` is 875*4882a593Smuzhiyuncorrect, or if it should just decrement the count and pretend none of 876*4882a593Smuzhiyunthis ever happened. 877*4882a593Smuzhiyun 878*4882a593SmuzhiyunTaking care in filesystems 879*4882a593Smuzhiyun-------------------------- 880*4882a593Smuzhiyun 881*4882a593SmuzhiyunRCU-walk depends almost entirely on cached information and often will 882*4882a593Smuzhiyunnot call into the filesystem at all. However there are two places, 883*4882a593Smuzhiyunbesides the already-mentioned component-name comparison, where the 884*4882a593Smuzhiyunfile system might be included in RCU-walk, and it must know to be 885*4882a593Smuzhiyuncareful. 886*4882a593Smuzhiyun 887*4882a593SmuzhiyunIf the filesystem has non-standard permission-checking requirements - 888*4882a593Smuzhiyunsuch as a networked filesystem which may need to check with the server 889*4882a593Smuzhiyun- the ``i_op->permission`` interface might be called during RCU-walk. 890*4882a593SmuzhiyunIn this case an extra "``MAY_NOT_BLOCK``" flag is passed so that it 891*4882a593Smuzhiyunknows not to sleep, but to return ``-ECHILD`` if it cannot complete 892*4882a593Smuzhiyunpromptly. ``i_op->permission`` is given the inode pointer, not the 893*4882a593Smuzhiyundentry, so it doesn't need to worry about further consistency checks. 894*4882a593SmuzhiyunHowever if it accesses any other filesystem data structures, it must 895*4882a593Smuzhiyunensure they are safe to be accessed with only the ``rcu_read_lock()`` 896*4882a593Smuzhiyunheld. This typically means they must be freed using ``kfree_rcu()`` or 897*4882a593Smuzhiyunsimilar. 898*4882a593Smuzhiyun 899*4882a593Smuzhiyun.. _READ_ONCE: https://lwn.net/Articles/624126/ 900*4882a593Smuzhiyun 901*4882a593SmuzhiyunIf the filesystem may need to revalidate dcache entries, then 902*4882a593Smuzhiyun``d_op->d_revalidate`` may be called in RCU-walk too. This interface 903*4882a593Smuzhiyun*is* passed the dentry but does not have access to the ``inode`` or the 904*4882a593Smuzhiyun``seq`` number from the ``nameidata``, so it needs to be extra careful 905*4882a593Smuzhiyunwhen accessing fields in the dentry. This "extra care" typically 906*4882a593Smuzhiyuninvolves using `READ_ONCE() <READ_ONCE_>`_ to access fields, and verifying the 907*4882a593Smuzhiyunresult is not NULL before using it. This pattern can be seen in 908*4882a593Smuzhiyun``nfs_lookup_revalidate()``. 909*4882a593Smuzhiyun 910*4882a593SmuzhiyunA pair of patterns 911*4882a593Smuzhiyun------------------ 912*4882a593Smuzhiyun 913*4882a593SmuzhiyunIn various places in the details of REF-walk and RCU-walk, and also in 914*4882a593Smuzhiyunthe big picture, there are a couple of related patterns that are worth 915*4882a593Smuzhiyunbeing aware of. 916*4882a593Smuzhiyun 917*4882a593SmuzhiyunThe first is "try quickly and check, if that fails try slowly". We 918*4882a593Smuzhiyuncan see that in the high-level approach of first trying RCU-walk and 919*4882a593Smuzhiyunthen trying REF-walk, and in places where ``unlazy_walk()`` is used to 920*4882a593Smuzhiyunswitch to REF-walk for the rest of the path. We also saw it earlier 921*4882a593Smuzhiyunin ``dget_parent()`` when following a "``..``" link. It tries a quick way 922*4882a593Smuzhiyunto get a reference, then falls back to taking locks if needed. 923*4882a593Smuzhiyun 924*4882a593SmuzhiyunThe second pattern is "try quickly and check, if that fails try 925*4882a593Smuzhiyunagain - repeatedly". This is seen with the use of ``rename_lock`` and 926*4882a593Smuzhiyun``mount_lock`` in REF-walk. RCU-walk doesn't make use of this pattern - 927*4882a593Smuzhiyunif anything goes wrong it is much safer to just abort and try a more 928*4882a593Smuzhiyunsedate approach. 929*4882a593Smuzhiyun 930*4882a593SmuzhiyunThe emphasis here is "try quickly and check". It should probably be 931*4882a593Smuzhiyun"try quickly *and carefully*, then check". The fact that checking is 932*4882a593Smuzhiyunneeded is a reminder that the system is dynamic and only a limited 933*4882a593Smuzhiyunnumber of things are safe at all. The most likely cause of errors in 934*4882a593Smuzhiyunthis whole process is assuming something is safe when in reality it 935*4882a593Smuzhiyunisn't. Careful consideration of what exactly guarantees the safety of 936*4882a593Smuzhiyuneach access is sometimes necessary. 937*4882a593Smuzhiyun 938*4882a593SmuzhiyunA walk among the symlinks 939*4882a593Smuzhiyun========================= 940*4882a593Smuzhiyun 941*4882a593SmuzhiyunThere are several basic issues that we will examine to understand the 942*4882a593Smuzhiyunhandling of symbolic links: the symlink stack, together with cache 943*4882a593Smuzhiyunlifetimes, will help us understand the overall recursive handling of 944*4882a593Smuzhiyunsymlinks and lead to the special care needed for the final component. 945*4882a593SmuzhiyunThen a consideration of access-time updates and summary of the various 946*4882a593Smuzhiyunflags controlling lookup will finish the story. 947*4882a593Smuzhiyun 948*4882a593SmuzhiyunThe symlink stack 949*4882a593Smuzhiyun----------------- 950*4882a593Smuzhiyun 951*4882a593SmuzhiyunThere are only two sorts of filesystem objects that can usefully 952*4882a593Smuzhiyunappear in a path prior to the final component: directories and symlinks. 953*4882a593SmuzhiyunHandling directories is quite straightforward: the new directory 954*4882a593Smuzhiyunsimply becomes the starting point at which to interpret the next 955*4882a593Smuzhiyuncomponent on the path. Handling symbolic links requires a bit more 956*4882a593Smuzhiyunwork. 957*4882a593Smuzhiyun 958*4882a593SmuzhiyunConceptually, symbolic links could be handled by editing the path. If 959*4882a593Smuzhiyuna component name refers to a symbolic link, then that component is 960*4882a593Smuzhiyunreplaced by the body of the link and, if that body starts with a '/', 961*4882a593Smuzhiyunthen all preceding parts of the path are discarded. This is what the 962*4882a593Smuzhiyun"``readlink -f``" command does, though it also edits out "``.``" and 963*4882a593Smuzhiyun"``..``" components. 964*4882a593Smuzhiyun 965*4882a593SmuzhiyunDirectly editing the path string is not really necessary when looking 966*4882a593Smuzhiyunup a path, and discarding early components is pointless as they aren't 967*4882a593Smuzhiyunlooked at anyway. Keeping track of all remaining components is 968*4882a593Smuzhiyunimportant, but they can of course be kept separately; there is no need 969*4882a593Smuzhiyunto concatenate them. As one symlink may easily refer to another, 970*4882a593Smuzhiyunwhich in turn can refer to a third, we may need to keep the remaining 971*4882a593Smuzhiyuncomponents of several paths, each to be processed when the preceding 972*4882a593Smuzhiyunones are completed. These path remnants are kept on a stack of 973*4882a593Smuzhiyunlimited size. 974*4882a593Smuzhiyun 975*4882a593SmuzhiyunThere are two reasons for placing limits on how many symlinks can 976*4882a593Smuzhiyunoccur in a single path lookup. The most obvious is to avoid loops. 977*4882a593SmuzhiyunIf a symlink referred to itself either directly or through 978*4882a593Smuzhiyunintermediaries, then following the symlink can never complete 979*4882a593Smuzhiyunsuccessfully - the error ``ELOOP`` must be returned. Loops can be 980*4882a593Smuzhiyundetected without imposing limits, but limits are the simplest solution 981*4882a593Smuzhiyunand, given the second reason for restriction, quite sufficient. 982*4882a593Smuzhiyun 983*4882a593Smuzhiyun.. _outlined recently: http://thread.gmane.org/gmane.linux.kernel/1934390/focus=1934550 984*4882a593Smuzhiyun 985*4882a593SmuzhiyunThe second reason was `outlined recently`_ by Linus: 986*4882a593Smuzhiyun 987*4882a593Smuzhiyun Because it's a latency and DoS issue too. We need to react well to 988*4882a593Smuzhiyun true loops, but also to "very deep" non-loops. It's not about memory 989*4882a593Smuzhiyun use, it's about users triggering unreasonable CPU resources. 990*4882a593Smuzhiyun 991*4882a593SmuzhiyunLinux imposes a limit on the length of any pathname: ``PATH_MAX``, which 992*4882a593Smuzhiyunis 4096. There are a number of reasons for this limit; not letting the 993*4882a593Smuzhiyunkernel spend too much time on just one path is one of them. With 994*4882a593Smuzhiyunsymbolic links you can effectively generate much longer paths so some 995*4882a593Smuzhiyunsort of limit is needed for the same reason. Linux imposes a limit of 996*4882a593Smuzhiyunat most 40 symlinks in any one path lookup. It previously imposed a 997*4882a593Smuzhiyunfurther limit of eight on the maximum depth of recursion, but that was 998*4882a593Smuzhiyunraised to 40 when a separate stack was implemented, so there is now 999*4882a593Smuzhiyunjust the one limit. 1000*4882a593Smuzhiyun 1001*4882a593SmuzhiyunThe ``nameidata`` structure that we met in an earlier article contains a 1002*4882a593Smuzhiyunsmall stack that can be used to store the remaining part of up to two 1003*4882a593Smuzhiyunsymlinks. In many cases this will be sufficient. If it isn't, a 1004*4882a593Smuzhiyunseparate stack is allocated with room for 40 symlinks. Pathname 1005*4882a593Smuzhiyunlookup will never exceed that stack as, once the 40th symlink is 1006*4882a593Smuzhiyundetected, an error is returned. 1007*4882a593Smuzhiyun 1008*4882a593SmuzhiyunIt might seem that the name remnants are all that needs to be stored on 1009*4882a593Smuzhiyunthis stack, but we need a bit more. To see that, we need to move on to 1010*4882a593Smuzhiyuncache lifetimes. 1011*4882a593Smuzhiyun 1012*4882a593SmuzhiyunStorage and lifetime of cached symlinks 1013*4882a593Smuzhiyun--------------------------------------- 1014*4882a593Smuzhiyun 1015*4882a593SmuzhiyunLike other filesystem resources, such as inodes and directory 1016*4882a593Smuzhiyunentries, symlinks are cached by Linux to avoid repeated costly access 1017*4882a593Smuzhiyunto external storage. It is particularly important for RCU-walk to be 1018*4882a593Smuzhiyunable to find and temporarily hold onto these cached entries, so that 1019*4882a593Smuzhiyunit doesn't need to drop down into REF-walk. 1020*4882a593Smuzhiyun 1021*4882a593Smuzhiyun.. _object-oriented design pattern: https://lwn.net/Articles/446317/ 1022*4882a593Smuzhiyun 1023*4882a593SmuzhiyunWhile each filesystem is free to make its own choice, symlinks are 1024*4882a593Smuzhiyuntypically stored in one of two places. Short symlinks are often 1025*4882a593Smuzhiyunstored directly in the inode. When a filesystem allocates a ``struct 1026*4882a593Smuzhiyuninode`` it typically allocates extra space to store private data (a 1027*4882a593Smuzhiyuncommon `object-oriented design pattern`_ in the kernel). This will 1028*4882a593Smuzhiyunsometimes include space for a symlink. The other common location is 1029*4882a593Smuzhiyunin the page cache, which normally stores the content of files. The 1030*4882a593Smuzhiyunpathname in a symlink can be seen as the content of that symlink and 1031*4882a593Smuzhiyuncan easily be stored in the page cache just like file content. 1032*4882a593Smuzhiyun 1033*4882a593SmuzhiyunWhen neither of these is suitable, the next most likely scenario is 1034*4882a593Smuzhiyunthat the filesystem will allocate some temporary memory and copy or 1035*4882a593Smuzhiyunconstruct the symlink content into that memory whenever it is needed. 1036*4882a593Smuzhiyun 1037*4882a593SmuzhiyunWhen the symlink is stored in the inode, it has the same lifetime as 1038*4882a593Smuzhiyunthe inode which, itself, is protected by RCU or by a counted reference 1039*4882a593Smuzhiyunon the dentry. This means that the mechanisms that pathname lookup 1040*4882a593Smuzhiyunuses to access the dcache and icache (inode cache) safely are quite 1041*4882a593Smuzhiyunsufficient for accessing some cached symlinks safely. In these cases, 1042*4882a593Smuzhiyunthe ``i_link`` pointer in the inode is set to point to wherever the 1043*4882a593Smuzhiyunsymlink is stored and it can be accessed directly whenever needed. 1044*4882a593Smuzhiyun 1045*4882a593SmuzhiyunWhen the symlink is stored in the page cache or elsewhere, the 1046*4882a593Smuzhiyunsituation is not so straightforward. A reference on a dentry or even 1047*4882a593Smuzhiyunon an inode does not imply any reference on cached pages of that 1048*4882a593Smuzhiyuninode, and even an ``rcu_read_lock()`` is not sufficient to ensure that 1049*4882a593Smuzhiyuna page will not disappear. So for these symlinks the pathname lookup 1050*4882a593Smuzhiyuncode needs to ask the filesystem to provide a stable reference and, 1051*4882a593Smuzhiyunsignificantly, needs to release that reference when it is finished 1052*4882a593Smuzhiyunwith it. 1053*4882a593Smuzhiyun 1054*4882a593SmuzhiyunTaking a reference to a cache page is often possible even in RCU-walk 1055*4882a593Smuzhiyunmode. It does require making changes to memory, which is best avoided, 1056*4882a593Smuzhiyunbut that isn't necessarily a big cost and it is better than dropping 1057*4882a593Smuzhiyunout of RCU-walk mode completely. Even filesystems that allocate 1058*4882a593Smuzhiyunspace to copy the symlink into can use ``GFP_ATOMIC`` to often successfully 1059*4882a593Smuzhiyunallocate memory without the need to drop out of RCU-walk. If a 1060*4882a593Smuzhiyunfilesystem cannot successfully get a reference in RCU-walk mode, it 1061*4882a593Smuzhiyunmust return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to 1062*4882a593SmuzhiyunREF-walk mode in which the filesystem is allowed to sleep. 1063*4882a593Smuzhiyun 1064*4882a593SmuzhiyunThe place for all this to happen is the ``i_op->follow_link()`` inode 1065*4882a593Smuzhiyunmethod. In the present mainline code this is never actually called in 1066*4882a593SmuzhiyunRCU-walk mode as the rewrite is not quite complete. It is likely that 1067*4882a593Smuzhiyunin a future release this method will be passed an ``inode`` pointer when 1068*4882a593Smuzhiyuncalled in RCU-walk mode so it both (1) knows to be careful, and (2) has the 1069*4882a593Smuzhiyunvalidated pointer. Much like the ``i_op->permission()`` method we 1070*4882a593Smuzhiyunlooked at previously, ``->follow_link()`` would need to be careful that 1071*4882a593Smuzhiyunall the data structures it references are safe to be accessed while 1072*4882a593Smuzhiyunholding no counted reference, only the RCU lock. Though getting a 1073*4882a593Smuzhiyunreference with ``->follow_link()`` is not yet done in RCU-walk mode, the 1074*4882a593Smuzhiyuncode is ready to release the reference when that does happen. 1075*4882a593Smuzhiyun 1076*4882a593SmuzhiyunThis need to drop the reference to a symlink adds significant 1077*4882a593Smuzhiyuncomplexity. It requires a reference to the inode so that the 1078*4882a593Smuzhiyun``i_op->put_link()`` inode operation can be called. In REF-walk, that 1079*4882a593Smuzhiyunreference is kept implicitly through a reference to the dentry, so 1080*4882a593Smuzhiyunkeeping the ``struct path`` of the symlink is easiest. For RCU-walk, 1081*4882a593Smuzhiyunthe pointer to the inode is kept separately. To allow switching from 1082*4882a593SmuzhiyunRCU-walk back to REF-walk in the middle of processing nested symlinks 1083*4882a593Smuzhiyunwe also need the seq number for the dentry so we can confirm that 1084*4882a593Smuzhiyunswitching back was safe. 1085*4882a593Smuzhiyun 1086*4882a593SmuzhiyunFinally, when providing a reference to a symlink, the filesystem also 1087*4882a593Smuzhiyunprovides an opaque "cookie" that must be passed to ``->put_link()`` so that it 1088*4882a593Smuzhiyunknows what to free. This might be the allocated memory area, or a 1089*4882a593Smuzhiyunpointer to the ``struct page`` in the page cache, or something else 1090*4882a593Smuzhiyuncompletely. Only the filesystem knows what it is. 1091*4882a593Smuzhiyun 1092*4882a593SmuzhiyunIn order for the reference to each symlink to be dropped when the walk completes, 1093*4882a593Smuzhiyunwhether in RCU-walk or REF-walk, the symlink stack needs to contain, 1094*4882a593Smuzhiyunalong with the path remnants: 1095*4882a593Smuzhiyun 1096*4882a593Smuzhiyun- the ``struct path`` to provide a reference to the inode in REF-walk 1097*4882a593Smuzhiyun- the ``struct inode *`` to provide a reference to the inode in RCU-walk 1098*4882a593Smuzhiyun- the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk 1099*4882a593Smuzhiyun- the ``cookie`` that tells ``->put_path()`` what to put. 1100*4882a593Smuzhiyun 1101*4882a593SmuzhiyunThis means that each entry in the symlink stack needs to hold five 1102*4882a593Smuzhiyunpointers and an integer instead of just one pointer (the path 1103*4882a593Smuzhiyunremnant). On a 64-bit system, this is about 40 bytes per entry; 1104*4882a593Smuzhiyunwith 40 entries it adds up to 1600 bytes total, which is less than 1105*4882a593Smuzhiyunhalf a page. So it might seem like a lot, but is by no means 1106*4882a593Smuzhiyunexcessive. 1107*4882a593Smuzhiyun 1108*4882a593SmuzhiyunNote that, in a given stack frame, the path remnant (``name``) is not 1109*4882a593Smuzhiyunpart of the symlink that the other fields refer to. It is the remnant 1110*4882a593Smuzhiyunto be followed once that symlink has been fully parsed. 1111*4882a593Smuzhiyun 1112*4882a593SmuzhiyunFollowing the symlink 1113*4882a593Smuzhiyun--------------------- 1114*4882a593Smuzhiyun 1115*4882a593SmuzhiyunThe main loop in ``link_path_walk()`` iterates seamlessly over all 1116*4882a593Smuzhiyuncomponents in the path and all of the non-final symlinks. As symlinks 1117*4882a593Smuzhiyunare processed, the ``name`` pointer is adjusted to point to a new 1118*4882a593Smuzhiyunsymlink, or is restored from the stack, so that much of the loop 1119*4882a593Smuzhiyundoesn't need to notice. Getting this ``name`` variable on and off the 1120*4882a593Smuzhiyunstack is very straightforward; pushing and popping the references is 1121*4882a593Smuzhiyuna little more complex. 1122*4882a593Smuzhiyun 1123*4882a593SmuzhiyunWhen a symlink is found, ``walk_component()`` returns the value ``1`` 1124*4882a593Smuzhiyun(``0`` is returned for any other sort of success, and a negative number 1125*4882a593Smuzhiyunis, as usual, an error indicator). This causes ``get_link()`` to be 1126*4882a593Smuzhiyuncalled; it then gets the link from the filesystem. Providing that 1127*4882a593Smuzhiyunoperation is successful, the old path ``name`` is placed on the stack, 1128*4882a593Smuzhiyunand the new value is used as the ``name`` for a while. When the end of 1129*4882a593Smuzhiyunthe path is found (i.e. ``*name`` is ``'\0'``) the old ``name`` is restored 1130*4882a593Smuzhiyunoff the stack and path walking continues. 1131*4882a593Smuzhiyun 1132*4882a593SmuzhiyunPushing and popping the reference pointers (inode, cookie, etc.) is more 1133*4882a593Smuzhiyuncomplex in part because of the desire to handle tail recursion. When 1134*4882a593Smuzhiyunthe last component of a symlink itself points to a symlink, we 1135*4882a593Smuzhiyunwant to pop the symlink-just-completed off the stack before pushing 1136*4882a593Smuzhiyunthe symlink-just-found to avoid leaving empty path remnants that would 1137*4882a593Smuzhiyunjust get in the way. 1138*4882a593Smuzhiyun 1139*4882a593SmuzhiyunIt is most convenient to push the new symlink references onto the 1140*4882a593Smuzhiyunstack in ``walk_component()`` immediately when the symlink is found; 1141*4882a593Smuzhiyun``walk_component()`` is also the last piece of code that needs to look at the 1142*4882a593Smuzhiyunold symlink as it walks that last component. So it is quite 1143*4882a593Smuzhiyunconvenient for ``walk_component()`` to release the old symlink and pop 1144*4882a593Smuzhiyunthe references just before pushing the reference information for the 1145*4882a593Smuzhiyunnew symlink. It is guided in this by two flags; ``WALK_GET``, which 1146*4882a593Smuzhiyungives it permission to follow a symlink if it finds one, and 1147*4882a593Smuzhiyun``WALK_PUT``, which tells it to release the current symlink after it has been 1148*4882a593Smuzhiyunfollowed. ``WALK_PUT`` is tested first, leading to a call to 1149*4882a593Smuzhiyun``put_link()``. ``WALK_GET`` is tested subsequently (by 1150*4882a593Smuzhiyun``should_follow_link()``) leading to a call to ``pick_link()`` which sets 1151*4882a593Smuzhiyunup the stack frame. 1152*4882a593Smuzhiyun 1153*4882a593SmuzhiyunSymlinks with no final component 1154*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1155*4882a593Smuzhiyun 1156*4882a593SmuzhiyunA pair of special-case symlinks deserve a little further explanation. 1157*4882a593SmuzhiyunBoth result in a new ``struct path`` (with mount and dentry) being set 1158*4882a593Smuzhiyunup in the ``nameidata``, and result in ``get_link()`` returning ``NULL``. 1159*4882a593Smuzhiyun 1160*4882a593SmuzhiyunThe more obvious case is a symlink to "``/``". All symlinks starting 1161*4882a593Smuzhiyunwith "``/``" are detected in ``get_link()`` which resets the ``nameidata`` 1162*4882a593Smuzhiyunto point to the effective filesystem root. If the symlink only 1163*4882a593Smuzhiyuncontains "``/``" then there is nothing more to do, no components at all, 1164*4882a593Smuzhiyunso ``NULL`` is returned to indicate that the symlink can be released and 1165*4882a593Smuzhiyunthe stack frame discarded. 1166*4882a593Smuzhiyun 1167*4882a593SmuzhiyunThe other case involves things in ``/proc`` that look like symlinks but 1168*4882a593Smuzhiyunaren't really (and are therefore commonly referred to as "magic-links"):: 1169*4882a593Smuzhiyun 1170*4882a593Smuzhiyun $ ls -l /proc/self/fd/1 1171*4882a593Smuzhiyun lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4 1172*4882a593Smuzhiyun 1173*4882a593SmuzhiyunEvery open file descriptor in any process is represented in ``/proc`` by 1174*4882a593Smuzhiyunsomething that looks like a symlink. It is really a reference to the 1175*4882a593Smuzhiyuntarget file, not just the name of it. When you ``readlink`` these 1176*4882a593Smuzhiyunobjects you get a name that might refer to the same file - unless it 1177*4882a593Smuzhiyunhas been unlinked or mounted over. When ``walk_component()`` follows 1178*4882a593Smuzhiyunone of these, the ``->follow_link()`` method in "procfs" doesn't return 1179*4882a593Smuzhiyuna string name, but instead calls ``nd_jump_link()`` which updates the 1180*4882a593Smuzhiyun``nameidata`` in place to point to that target. ``->follow_link()`` then 1181*4882a593Smuzhiyunreturns ``NULL``. Again there is no final component and ``get_link()`` 1182*4882a593Smuzhiyunreports this by leaving the ``last_type`` field of ``nameidata`` as 1183*4882a593Smuzhiyun``LAST_BIND``. 1184*4882a593Smuzhiyun 1185*4882a593SmuzhiyunFollowing the symlink in the final component 1186*4882a593Smuzhiyun-------------------------------------------- 1187*4882a593Smuzhiyun 1188*4882a593SmuzhiyunAll this leads to ``link_path_walk()`` walking down every component, and 1189*4882a593Smuzhiyunfollowing all symbolic links it finds, until it reaches the final 1190*4882a593Smuzhiyuncomponent. This is just returned in the ``last`` field of ``nameidata``. 1191*4882a593SmuzhiyunFor some callers, this is all they need; they want to create that 1192*4882a593Smuzhiyun``last`` name if it doesn't exist or give an error if it does. Other 1193*4882a593Smuzhiyuncallers will want to follow a symlink if one is found, and possibly 1194*4882a593Smuzhiyunapply special handling to the last component of that symlink, rather 1195*4882a593Smuzhiyunthan just the last component of the original file name. These callers 1196*4882a593Smuzhiyunpotentially need to call ``link_path_walk()`` again and again on 1197*4882a593Smuzhiyunsuccessive symlinks until one is found that doesn't point to another 1198*4882a593Smuzhiyunsymlink. 1199*4882a593Smuzhiyun 1200*4882a593SmuzhiyunThis case is handled by the relevant caller of ``link_path_walk()``, such as 1201*4882a593Smuzhiyun``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then 1202*4882a593Smuzhiyunhandles the final component. If the final component is a symlink 1203*4882a593Smuzhiyunthat needs to be followed, then ``trailing_symlink()`` is called to set 1204*4882a593Smuzhiyunthings up properly and the loop repeats, calling ``link_path_walk()`` 1205*4882a593Smuzhiyunagain. This could loop as many as 40 times if the last component of 1206*4882a593Smuzhiyuneach symlink is another symlink. 1207*4882a593Smuzhiyun 1208*4882a593SmuzhiyunThe various functions that examine the final component and possibly 1209*4882a593Smuzhiyunreport that it is a symlink are ``lookup_last()``, ``mountpoint_last()`` 1210*4882a593Smuzhiyunand ``do_last()``, each of which use the same convention as 1211*4882a593Smuzhiyun``walk_component()`` of returning ``1`` if a symlink was found that needs 1212*4882a593Smuzhiyunto be followed. 1213*4882a593Smuzhiyun 1214*4882a593SmuzhiyunOf these, ``do_last()`` is the most interesting as it is used for 1215*4882a593Smuzhiyunopening a file. Part of ``do_last()`` runs with ``i_rwsem`` held and this 1216*4882a593Smuzhiyunpart is in a separate function: ``lookup_open()``. 1217*4882a593Smuzhiyun 1218*4882a593SmuzhiyunExplaining ``do_last()`` completely is beyond the scope of this article, 1219*4882a593Smuzhiyunbut a few highlights should help those interested in exploring the 1220*4882a593Smuzhiyuncode. 1221*4882a593Smuzhiyun 1222*4882a593Smuzhiyun1. Rather than just finding the target file, ``do_last()`` needs to open 1223*4882a593Smuzhiyun it. If the file was found in the dcache, then ``vfs_open()`` is used for 1224*4882a593Smuzhiyun this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if 1225*4882a593Smuzhiyun the filesystem provides it) to combine the final lookup with the open, or 1226*4882a593Smuzhiyun will perform the separate ``lookup_real()`` and ``vfs_create()`` steps 1227*4882a593Smuzhiyun directly. In the later case the actual "open" of this newly found or 1228*4882a593Smuzhiyun created file will be performed by ``vfs_open()``, just as if the name 1229*4882a593Smuzhiyun were found in the dcache. 1230*4882a593Smuzhiyun 1231*4882a593Smuzhiyun2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information 1232*4882a593Smuzhiyun wasn't quite current enough. Rather than restarting the lookup from 1233*4882a593Smuzhiyun the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead, 1234*4882a593Smuzhiyun giving the filesystem a chance to resolve small inconsistencies. 1235*4882a593Smuzhiyun If that doesn't work, only then is the lookup restarted from the top. 1236*4882a593Smuzhiyun 1237*4882a593Smuzhiyun3. An open with O_CREAT **does** follow a symlink in the final component, 1238*4882a593Smuzhiyun unlike other creation system calls (like ``mkdir``). So the sequence:: 1239*4882a593Smuzhiyun 1240*4882a593Smuzhiyun ln -s bar /tmp/foo 1241*4882a593Smuzhiyun echo hello > /tmp/foo 1242*4882a593Smuzhiyun 1243*4882a593Smuzhiyun will create a file called ``/tmp/bar``. This is not permitted if 1244*4882a593Smuzhiyun ``O_EXCL`` is set but otherwise is handled for an O_CREAT open much 1245*4882a593Smuzhiyun like for a non-creating open: ``should_follow_link()`` returns ``1``, and 1246*4882a593Smuzhiyun so does ``do_last()`` so that ``trailing_symlink()`` gets called and the 1247*4882a593Smuzhiyun open process continues on the symlink that was found. 1248*4882a593Smuzhiyun 1249*4882a593SmuzhiyunUpdating the access time 1250*4882a593Smuzhiyun------------------------ 1251*4882a593Smuzhiyun 1252*4882a593SmuzhiyunWe previously said of RCU-walk that it would "take no locks, increment 1253*4882a593Smuzhiyunno counts, leave no footprints." We have since seen that some 1254*4882a593Smuzhiyun"footprints" can be needed when handling symlinks as a counted 1255*4882a593Smuzhiyunreference (or even a memory allocation) may be needed. But these 1256*4882a593Smuzhiyunfootprints are best kept to a minimum. 1257*4882a593Smuzhiyun 1258*4882a593SmuzhiyunOne other place where walking down a symlink can involve leaving 1259*4882a593Smuzhiyunfootprints in a way that doesn't affect directories is in updating access times. 1260*4882a593SmuzhiyunIn Unix (and Linux) every filesystem object has a "last accessed 1261*4882a593Smuzhiyuntime", or "``atime``". Passing through a directory to access a file 1262*4882a593Smuzhiyunwithin is not considered to be an access for the purposes of 1263*4882a593Smuzhiyun``atime``; only listing the contents of a directory can update its ``atime``. 1264*4882a593SmuzhiyunSymlinks are different it seems. Both reading a symlink (with ``readlink()``) 1265*4882a593Smuzhiyunand looking up a symlink on the way to some other destination can 1266*4882a593Smuzhiyunupdate the atime on that symlink. 1267*4882a593Smuzhiyun 1268*4882a593Smuzhiyun.. _clearest statement: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_08 1269*4882a593Smuzhiyun 1270*4882a593SmuzhiyunIt is not clear why this is the case; POSIX has little to say on the 1271*4882a593Smuzhiyunsubject. The `clearest statement`_ is that, if a particular implementation 1272*4882a593Smuzhiyunupdates a timestamp in a place not specified by POSIX, this must be 1273*4882a593Smuzhiyundocumented "except that any changes caused by pathname resolution need 1274*4882a593Smuzhiyunnot be documented". This seems to imply that POSIX doesn't really 1275*4882a593Smuzhiyuncare about access-time updates during pathname lookup. 1276*4882a593Smuzhiyun 1277*4882a593Smuzhiyun.. _Linux 1.3.87: https://git.kernel.org/cgit/linux/kernel/git/history/history.git/diff/fs/ext2/symlink.c?id=f806c6db77b8eaa6e00dcfb6b567706feae8dbb8 1278*4882a593Smuzhiyun 1279*4882a593SmuzhiyunAn examination of history shows that prior to `Linux 1.3.87`_, the ext2 1280*4882a593Smuzhiyunfilesystem, at least, didn't update atime when following a link. 1281*4882a593SmuzhiyunUnfortunately we have no record of why that behavior was changed. 1282*4882a593Smuzhiyun 1283*4882a593SmuzhiyunIn any case, access time must now be updated and that operation can be 1284*4882a593Smuzhiyunquite complex. Trying to stay in RCU-walk while doing it is best 1285*4882a593Smuzhiyunavoided. Fortunately it is often permitted to skip the ``atime`` 1286*4882a593Smuzhiyunupdate. Because ``atime`` updates cause performance problems in various 1287*4882a593Smuzhiyunareas, Linux supports the ``relatime`` mount option, which generally 1288*4882a593Smuzhiyunlimits the updates of ``atime`` to once per day on files that aren't 1289*4882a593Smuzhiyunbeing changed (and symlinks never change once created). Even without 1290*4882a593Smuzhiyun``relatime``, many filesystems record ``atime`` with a one-second 1291*4882a593Smuzhiyungranularity, so only one update per second is required. 1292*4882a593Smuzhiyun 1293*4882a593SmuzhiyunIt is easy to test if an ``atime`` update is needed while in RCU-walk 1294*4882a593Smuzhiyunmode and, if it isn't, the update can be skipped and RCU-walk mode 1295*4882a593Smuzhiyuncontinues. Only when an ``atime`` update is actually required does the 1296*4882a593Smuzhiyunpath walk drop down to REF-walk. All of this is handled in the 1297*4882a593Smuzhiyun``get_link()`` function. 1298*4882a593Smuzhiyun 1299*4882a593SmuzhiyunA few flags 1300*4882a593Smuzhiyun----------- 1301*4882a593Smuzhiyun 1302*4882a593SmuzhiyunA suitable way to wrap up this tour of pathname walking is to list 1303*4882a593Smuzhiyunthe various flags that can be stored in the ``nameidata`` to guide the 1304*4882a593Smuzhiyunlookup process. Many of these are only meaningful on the final 1305*4882a593Smuzhiyuncomponent, others reflect the current state of the pathname lookup, and some 1306*4882a593Smuzhiyunapply restrictions to all path components encountered in the path lookup. 1307*4882a593Smuzhiyun 1308*4882a593SmuzhiyunAnd then there is ``LOOKUP_EMPTY``, which doesn't fit conceptually with 1309*4882a593Smuzhiyunthe others. If this is not set, an empty pathname causes an error 1310*4882a593Smuzhiyunvery early on. If it is set, empty pathnames are not considered to be 1311*4882a593Smuzhiyunan error. 1312*4882a593Smuzhiyun 1313*4882a593SmuzhiyunGlobal state flags 1314*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~ 1315*4882a593Smuzhiyun 1316*4882a593SmuzhiyunWe have already met two global state flags: ``LOOKUP_RCU`` and 1317*4882a593Smuzhiyun``LOOKUP_REVAL``. These select between one of three overall approaches 1318*4882a593Smuzhiyunto lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation. 1319*4882a593Smuzhiyun 1320*4882a593Smuzhiyun``LOOKUP_PARENT`` indicates that the final component hasn't been reached 1321*4882a593Smuzhiyunyet. This is primarily used to tell the audit subsystem the full 1322*4882a593Smuzhiyuncontext of a particular access being audited. 1323*4882a593Smuzhiyun 1324*4882a593Smuzhiyun``LOOKUP_ROOT`` indicates that the ``root`` field in the ``nameidata`` was 1325*4882a593Smuzhiyunprovided by the caller, so it shouldn't be released when it is no 1326*4882a593Smuzhiyunlonger needed. 1327*4882a593Smuzhiyun 1328*4882a593Smuzhiyun``LOOKUP_JUMPED`` means that the current dentry was chosen not because 1329*4882a593Smuzhiyunit had the right name but for some other reason. This happens when 1330*4882a593Smuzhiyunfollowing "``..``", following a symlink to ``/``, crossing a mount point 1331*4882a593Smuzhiyunor accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic 1332*4882a593Smuzhiyunlink"). In this case the filesystem has not been asked to revalidate the 1333*4882a593Smuzhiyunname (with ``d_revalidate()``). In such cases the inode may still need 1334*4882a593Smuzhiyunto be revalidated, so ``d_op->d_weak_revalidate()`` is called if 1335*4882a593Smuzhiyun``LOOKUP_JUMPED`` is set when the look completes - which may be at the 1336*4882a593Smuzhiyunfinal component or, when creating, unlinking, or renaming, at the penultimate component. 1337*4882a593Smuzhiyun 1338*4882a593SmuzhiyunResolution-restriction flags 1339*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1340*4882a593Smuzhiyun 1341*4882a593SmuzhiyunIn order to allow userspace to protect itself against certain race conditions 1342*4882a593Smuzhiyunand attack scenarios involving changing path components, a series of flags are 1343*4882a593Smuzhiyunavailable which apply restrictions to all path components encountered during 1344*4882a593Smuzhiyunpath lookup. These flags are exposed through ``openat2()``'s ``resolve`` field. 1345*4882a593Smuzhiyun 1346*4882a593Smuzhiyun``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links). 1347*4882a593SmuzhiyunThis is distinctly different from ``LOOKUP_FOLLOW``, because the latter only 1348*4882a593Smuzhiyunrelates to restricting the following of trailing symlinks. 1349*4882a593Smuzhiyun 1350*4882a593Smuzhiyun``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must 1351*4882a593Smuzhiyunensure that they return errors from ``nd_jump_link()``, because that is how 1352*4882a593Smuzhiyun``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented. 1353*4882a593Smuzhiyun 1354*4882a593Smuzhiyun``LOOKUP_NO_XDEV`` blocks all ``vfsmount`` traversals (this includes both 1355*4882a593Smuzhiyunbind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the 1356*4882a593Smuzhiyunlookup is determined by the first mountpoint the path lookup reaches -- 1357*4882a593Smuzhiyunabsolute paths start with the ``vfsmount`` of ``/``, and relative paths start 1358*4882a593Smuzhiyunwith the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the 1359*4882a593Smuzhiyun``vfsmount`` of the path is unchanged. 1360*4882a593Smuzhiyun 1361*4882a593Smuzhiyun``LOOKUP_BENEATH`` blocks any path components which resolve outside the 1362*4882a593Smuzhiyunstarting point of the resolution. This is done by blocking ``nd_jump_root()`` 1363*4882a593Smuzhiyunas well as blocking ".." if it would jump outside the starting point. 1364*4882a593Smuzhiyun``rename_lock`` and ``mount_lock`` are used to detect attacks against the 1365*4882a593Smuzhiyunresolution of "..". Magic-links are also blocked. 1366*4882a593Smuzhiyun 1367*4882a593Smuzhiyun``LOOKUP_IN_ROOT`` resolves all path components as though the starting point 1368*4882a593Smuzhiyunwere the filesystem root. ``nd_jump_root()`` brings the resolution back to 1369*4882a593Smuzhiyunthe starting point, and ".." at the starting point will act as a no-op. As with 1370*4882a593Smuzhiyun``LOOKUP_BENEATH``, ``rename_lock`` and ``mount_lock`` are used to detect 1371*4882a593Smuzhiyunattacks against ".." resolution. Magic-links are also blocked. 1372*4882a593Smuzhiyun 1373*4882a593SmuzhiyunFinal-component flags 1374*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~ 1375*4882a593Smuzhiyun 1376*4882a593SmuzhiyunSome of these flags are only set when the final component is being 1377*4882a593Smuzhiyunconsidered. Others are only checked for when considering that final 1378*4882a593Smuzhiyuncomponent. 1379*4882a593Smuzhiyun 1380*4882a593Smuzhiyun``LOOKUP_AUTOMOUNT`` ensures that, if the final component is an automount 1381*4882a593Smuzhiyunpoint, then the mount is triggered. Some operations would trigger it 1382*4882a593Smuzhiyunanyway, but operations like ``stat()`` deliberately don't. ``statfs()`` 1383*4882a593Smuzhiyunneeds to trigger the mount but otherwise behaves a lot like ``stat()``, so 1384*4882a593Smuzhiyunit sets ``LOOKUP_AUTOMOUNT``, as does "``quotactl()``" and the handling of 1385*4882a593Smuzhiyun"``mount --bind``". 1386*4882a593Smuzhiyun 1387*4882a593Smuzhiyun``LOOKUP_FOLLOW`` has a similar function to ``LOOKUP_AUTOMOUNT`` but for 1388*4882a593Smuzhiyunsymlinks. Some system calls set or clear it implicitly, while 1389*4882a593Smuzhiyunothers have API flags such as ``AT_SYMLINK_FOLLOW`` and 1390*4882a593Smuzhiyun``UMOUNT_NOFOLLOW`` to control it. Its effect is similar to 1391*4882a593Smuzhiyun``WALK_GET`` that we already met, but it is used in a different way. 1392*4882a593Smuzhiyun 1393*4882a593Smuzhiyun``LOOKUP_DIRECTORY`` insists that the final component is a directory. 1394*4882a593SmuzhiyunVarious callers set this and it is also set when the final component 1395*4882a593Smuzhiyunis found to be followed by a slash. 1396*4882a593Smuzhiyun 1397*4882a593SmuzhiyunFinally ``LOOKUP_OPEN``, ``LOOKUP_CREATE``, ``LOOKUP_EXCL``, and 1398*4882a593Smuzhiyun``LOOKUP_RENAME_TARGET`` are not used directly by the VFS but are made 1399*4882a593Smuzhiyunavailable to the filesystem and particularly the ``->d_revalidate()`` 1400*4882a593Smuzhiyunmethod. A filesystem can choose not to bother revalidating too hard 1401*4882a593Smuzhiyunif it knows that it will be asked to open or create the file soon. 1402*4882a593SmuzhiyunThese flags were previously useful for ``->lookup()`` too but with the 1403*4882a593Smuzhiyunintroduction of ``->atomic_open()`` they are less relevant there. 1404*4882a593Smuzhiyun 1405*4882a593SmuzhiyunEnd of the road 1406*4882a593Smuzhiyun--------------- 1407*4882a593Smuzhiyun 1408*4882a593SmuzhiyunDespite its complexity, all this pathname lookup code appears to be 1409*4882a593Smuzhiyunin good shape - various parts are certainly easier to understand now 1410*4882a593Smuzhiyunthan even a couple of releases ago. But that doesn't mean it is 1411*4882a593Smuzhiyun"finished". As already mentioned, RCU-walk currently only follows 1412*4882a593Smuzhiyunsymlinks that are stored in the inode so, while it handles many ext4 1413*4882a593Smuzhiyunsymlinks, it doesn't help with NFS, XFS, or Btrfs. That support 1414*4882a593Smuzhiyunis not likely to be long delayed. 1415