xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/path-lookup.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun===============
2*4882a593SmuzhiyunPathname lookup
3*4882a593Smuzhiyun===============
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunThis write-up is based on three articles published at lwn.net:
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
8*4882a593Smuzhiyun- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
9*4882a593Smuzhiyun- <https://lwn.net/Articles/650786/> A walk among the symlinks
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunWritten by Neil Brown with help from Al Viro and Jon Corbet.
12*4882a593SmuzhiyunIt has subsequently been updated to reflect changes in the kernel
13*4882a593Smuzhiyunincluding:
14*4882a593Smuzhiyun
15*4882a593Smuzhiyun- per-directory parallel name lookup.
16*4882a593Smuzhiyun- ``openat2()`` resolution restriction flags.
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunIntroduction to pathname lookup
19*4882a593Smuzhiyun===============================
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunThe most obvious aspect of pathname lookup, which very little
22*4882a593Smuzhiyunexploration is needed to discover, is that it is complex.  There are
23*4882a593Smuzhiyunmany rules, special cases, and implementation alternatives that all
24*4882a593Smuzhiyuncombine to confuse the unwary reader.  Computer science has long been
25*4882a593Smuzhiyunacquainted with such complexity and has tools to help manage it.  One
26*4882a593Smuzhiyuntool that we will make extensive use of is "divide and conquer".  For
27*4882a593Smuzhiyunthe early parts of the analysis we will divide off symlinks - leaving
28*4882a593Smuzhiyunthem until the final part.  Well before we get to symlinks we have
29*4882a593Smuzhiyunanother major division based on the VFS's approach to locking which
30*4882a593Smuzhiyunwill allow us to review "REF-walk" and "RCU-walk" separately.  But we
31*4882a593Smuzhiyunare getting ahead of ourselves.  There are some important low level
32*4882a593Smuzhiyundistinctions we need to clarify first.
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunThere are two sorts of ...
35*4882a593Smuzhiyun--------------------------
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun.. _openat: http://man7.org/linux/man-pages/man2/openat.2.html
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunPathnames (sometimes "file names"), used to identify objects in the
40*4882a593Smuzhiyunfilesystem, will be familiar to most readers.  They contain two sorts
41*4882a593Smuzhiyunof elements: "slashes" that are sequences of one or more "``/``"
42*4882a593Smuzhiyuncharacters, and "components" that are sequences of one or more
43*4882a593Smuzhiyunnon-"``/``" characters.  These form two kinds of paths.  Those that
44*4882a593Smuzhiyunstart with slashes are "absolute" and start from the filesystem root.
45*4882a593SmuzhiyunThe others are "relative" and start from the current directory, or
46*4882a593Smuzhiyunfrom some other location specified by a file descriptor given to
47*4882a593Smuzhiyun"``*at()``" system calls such as `openat() <openat_>`_.
48*4882a593Smuzhiyun
49*4882a593Smuzhiyun.. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunIt is tempting to describe the second kind as starting with a
52*4882a593Smuzhiyuncomponent, but that isn't always accurate: a pathname can lack both
53*4882a593Smuzhiyunslashes and components, it can be empty, in other words.  This is
54*4882a593Smuzhiyungenerally forbidden in POSIX, but some of those "``*at()``" system calls
55*4882a593Smuzhiyunin Linux permit it when the ``AT_EMPTY_PATH`` flag is given.  For
56*4882a593Smuzhiyunexample, if you have an open file descriptor on an executable file you
57*4882a593Smuzhiyuncan execute it by calling `execveat() <execveat_>`_ passing
58*4882a593Smuzhiyunthe file descriptor, an empty path, and the ``AT_EMPTY_PATH`` flag.
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunThese paths can be divided into two sections: the final component and
61*4882a593Smuzhiyuneverything else.  The "everything else" is the easy bit.  In all cases
62*4882a593Smuzhiyunit must identify a directory that already exists, otherwise an error
63*4882a593Smuzhiyunsuch as ``ENOENT`` or ``ENOTDIR`` will be reported.
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunThe final component is not so simple.  Not only do different system
66*4882a593Smuzhiyuncalls interpret it quite differently (e.g. some create it, some do
67*4882a593Smuzhiyunnot), but it might not even exist: neither the empty pathname nor the
68*4882a593Smuzhiyunpathname that is just slashes have a final component.  If it does
69*4882a593Smuzhiyunexist, it could be "``.``" or "``..``" which are handled quite differently
70*4882a593Smuzhiyunfrom other components.
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun.. _POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
73*4882a593Smuzhiyun
74*4882a593SmuzhiyunIf a pathname ends with a slash, such as "``/tmp/foo/``" it might be
75*4882a593Smuzhiyuntempting to consider that to have an empty final component.  In many
76*4882a593Smuzhiyunways that would lead to correct results, but not always.  In
77*4882a593Smuzhiyunparticular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named
78*4882a593Smuzhiyunby the final component, and they are required to work with pathnames
79*4882a593Smuzhiyunending in "``/``".  According to POSIX_:
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun  A pathname that contains at least one non-<slash> character and
82*4882a593Smuzhiyun  that ends with one or more trailing <slash> characters shall not
83*4882a593Smuzhiyun  be resolved successfully unless the last pathname component before
84*4882a593Smuzhiyun  the trailing <slash> characters names an existing directory or a
85*4882a593Smuzhiyun  directory entry that is to be created for a directory immediately
86*4882a593Smuzhiyun  after the pathname is resolved.
87*4882a593Smuzhiyun
88*4882a593SmuzhiyunThe Linux pathname walking code (mostly in ``fs/namei.c``) deals with
89*4882a593Smuzhiyunall of these issues: breaking the path into components, handling the
90*4882a593Smuzhiyun"everything else" quite separately from the final component, and
91*4882a593Smuzhiyunchecking that the trailing slash is not used where it isn't
92*4882a593Smuzhiyunpermitted.  It also addresses the important issue of concurrent
93*4882a593Smuzhiyunaccess.
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunWhile one process is looking up a pathname, another might be making
96*4882a593Smuzhiyunchanges that affect that lookup.  One fairly extreme case is that if
97*4882a593Smuzhiyun"a/b" were renamed to "a/c/b" while another process were looking up
98*4882a593Smuzhiyun"a/b/..", that process might successfully resolve on "a/c".
99*4882a593SmuzhiyunMost races are much more subtle, and a big part of the task of
100*4882a593Smuzhiyunpathname lookup is to prevent them from having damaging effects.  Many
101*4882a593Smuzhiyunof the possible races are seen most clearly in the context of the
102*4882a593Smuzhiyun"dcache" and an understanding of that is central to understanding
103*4882a593Smuzhiyunpathname lookup.
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunMore than just a cache
106*4882a593Smuzhiyun----------------------
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunThe "dcache" caches information about names in each filesystem to
109*4882a593Smuzhiyunmake them quickly available for lookup.  Each entry (known as a
110*4882a593Smuzhiyun"dentry") contains three significant fields: a component name, a
111*4882a593Smuzhiyunpointer to a parent dentry, and a pointer to the "inode" which
112*4882a593Smuzhiyuncontains further information about the object in that parent with
113*4882a593Smuzhiyunthe given name.  The inode pointer can be ``NULL`` indicating that the
114*4882a593Smuzhiyunname doesn't exist in the parent.  While there can be linkage in the
115*4882a593Smuzhiyundentry of a directory to the dentries of the children, that linkage is
116*4882a593Smuzhiyunnot used for pathname lookup, and so will not be considered here.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunThe dcache has a number of uses apart from accelerating lookup.  One
119*4882a593Smuzhiyunthat will be particularly relevant is that it is closely integrated
120*4882a593Smuzhiyunwith the mount table that records which filesystem is mounted where.
121*4882a593SmuzhiyunWhat the mount table actually stores is which dentry is mounted on top
122*4882a593Smuzhiyunof which other dentry.
123*4882a593Smuzhiyun
124*4882a593SmuzhiyunWhen considering the dcache, we have another of our "two types"
125*4882a593Smuzhiyundistinctions: there are two types of filesystems.
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunSome filesystems ensure that the information in the dcache is always
128*4882a593Smuzhiyuncompletely accurate (though not necessarily complete).  This can allow
129*4882a593Smuzhiyunthe VFS to determine if a particular file does or doesn't exist
130*4882a593Smuzhiyunwithout checking with the filesystem, and means that the VFS can
131*4882a593Smuzhiyunprotect the filesystem against certain races and other problems.
132*4882a593SmuzhiyunThese are typically "local" filesystems such as ext3, XFS, and Btrfs.
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunOther filesystems don't provide that guarantee because they cannot.
135*4882a593SmuzhiyunThese are typically filesystems that are shared across a network,
136*4882a593Smuzhiyunwhether remote filesystems like NFS and 9P, or cluster filesystems
137*4882a593Smuzhiyunlike ocfs2 or cephfs.  These filesystems allow the VFS to revalidate
138*4882a593Smuzhiyuncached information, and must provide their own protection against
139*4882a593Smuzhiyunawkward races.  The VFS can detect these filesystems by the
140*4882a593Smuzhiyun``DCACHE_OP_REVALIDATE`` flag being set in the dentry.
141*4882a593Smuzhiyun
142*4882a593SmuzhiyunREF-walk: simple concurrency management with refcounts and spinlocks
143*4882a593Smuzhiyun--------------------------------------------------------------------
144*4882a593Smuzhiyun
145*4882a593SmuzhiyunWith all of those divisions carefully classified, we can now start
146*4882a593Smuzhiyunlooking at the actual process of walking along a path.  In particular
147*4882a593Smuzhiyunwe will start with the handling of the "everything else" part of a
148*4882a593Smuzhiyunpathname, and focus on the "REF-walk" approach to concurrency
149*4882a593Smuzhiyunmanagement.  This code is found in the ``link_path_walk()`` function, if
150*4882a593Smuzhiyunyou ignore all the places that only run when "``LOOKUP_RCU``"
151*4882a593Smuzhiyun(indicating the use of RCU-walk) is set.
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun.. _Meet the Lockers: https://lwn.net/Articles/453685/
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunREF-walk is fairly heavy-handed with locks and reference counts.  Not
156*4882a593Smuzhiyunas heavy-handed as in the old "big kernel lock" days, but certainly not
157*4882a593Smuzhiyunafraid of taking a lock when one is needed.  It uses a variety of
158*4882a593Smuzhiyundifferent concurrency controls.  A background understanding of the
159*4882a593Smuzhiyunvarious primitives is assumed, or can be gleaned from elsewhere such
160*4882a593Smuzhiyunas in `Meet the Lockers`_.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunThe locking mechanisms used by REF-walk include:
163*4882a593Smuzhiyun
164*4882a593Smuzhiyundentry->d_lockref
165*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~
166*4882a593Smuzhiyun
167*4882a593SmuzhiyunThis uses the lockref primitive to provide both a spinlock and a
168*4882a593Smuzhiyunreference count.  The special-sauce of this primitive is that the
169*4882a593Smuzhiyunconceptual sequence "lock; inc_ref; unlock;" can often be performed
170*4882a593Smuzhiyunwith a single atomic memory operation.
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunHolding a reference on a dentry ensures that the dentry won't suddenly
173*4882a593Smuzhiyunbe freed and used for something else, so the values in various fields
174*4882a593Smuzhiyunwill behave as expected.  It also protects the ``->d_inode`` reference
175*4882a593Smuzhiyunto the inode to some extent.
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunThe association between a dentry and its inode is fairly permanent.
178*4882a593SmuzhiyunFor example, when a file is renamed, the dentry and inode move
179*4882a593Smuzhiyuntogether to the new location.  When a file is created the dentry will
180*4882a593Smuzhiyuninitially be negative (i.e. ``d_inode`` is ``NULL``), and will be assigned
181*4882a593Smuzhiyunto the new inode as part of the act of creation.
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunWhen a file is deleted, this can be reflected in the cache either by
184*4882a593Smuzhiyunsetting ``d_inode`` to ``NULL``, or by removing it from the hash table
185*4882a593Smuzhiyun(described shortly) used to look up the name in the parent directory.
186*4882a593SmuzhiyunIf the dentry is still in use the second option is used as it is
187*4882a593Smuzhiyunperfectly legal to keep using an open file after it has been deleted
188*4882a593Smuzhiyunand having the dentry around helps.  If the dentry is not otherwise in
189*4882a593Smuzhiyunuse (i.e. if the refcount in ``d_lockref`` is one), only then will
190*4882a593Smuzhiyun``d_inode`` be set to ``NULL``.  Doing it this way is more efficient for a
191*4882a593Smuzhiyunvery common case.
192*4882a593Smuzhiyun
193*4882a593SmuzhiyunSo as long as a counted reference is held to a dentry, a non-``NULL`` ``->d_inode``
194*4882a593Smuzhiyunvalue will never be changed.
195*4882a593Smuzhiyun
196*4882a593Smuzhiyundentry->d_lock
197*4882a593Smuzhiyun~~~~~~~~~~~~~~
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun``d_lock`` is a synonym for the spinlock that is part of ``d_lockref`` above.
200*4882a593SmuzhiyunFor our purposes, holding this lock protects against the dentry being
201*4882a593Smuzhiyunrenamed or unlinked.  In particular, its parent (``d_parent``), and its
202*4882a593Smuzhiyunname (``d_name``) cannot be changed, and it cannot be removed from the
203*4882a593Smuzhiyundentry hash table.
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunWhen looking for a name in a directory, REF-walk takes ``d_lock`` on
206*4882a593Smuzhiyuneach candidate dentry that it finds in the hash table and then checks
207*4882a593Smuzhiyunthat the parent and name are correct.  So it doesn't lock the parent
208*4882a593Smuzhiyunwhile searching in the cache; it only locks children.
209*4882a593Smuzhiyun
210*4882a593SmuzhiyunWhen looking for the parent for a given name (to handle "``..``"),
211*4882a593SmuzhiyunREF-walk can take ``d_lock`` to get a stable reference to ``d_parent``,
212*4882a593Smuzhiyunbut it first tries a more lightweight approach.  As seen in
213*4882a593Smuzhiyun``dget_parent()``, if a reference can be claimed on the parent, and if
214*4882a593Smuzhiyunsubsequently ``d_parent`` can be seen to have not changed, then there is
215*4882a593Smuzhiyunno need to actually take the lock on the child.
216*4882a593Smuzhiyun
217*4882a593Smuzhiyunrename_lock
218*4882a593Smuzhiyun~~~~~~~~~~~
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunLooking up a given name in a given directory involves computing a hash
221*4882a593Smuzhiyunfrom the two values (the name and the dentry of the directory),
222*4882a593Smuzhiyunaccessing that slot in a hash table, and searching the linked list
223*4882a593Smuzhiyunthat is found there.
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunWhen a dentry is renamed, the name and the parent dentry can both
226*4882a593Smuzhiyunchange so the hash will almost certainly change too.  This would move the
227*4882a593Smuzhiyundentry to a different chain in the hash table.  If a filename search
228*4882a593Smuzhiyunhappened to be looking at a dentry that was moved in this way,
229*4882a593Smuzhiyunit might end up continuing the search down the wrong chain,
230*4882a593Smuzhiyunand so miss out on part of the correct chain.
231*4882a593Smuzhiyun
232*4882a593SmuzhiyunThe name-lookup process (``d_lookup()``) does *not* try to prevent this
233*4882a593Smuzhiyunfrom happening, but only to detect when it happens.
234*4882a593Smuzhiyun``rename_lock`` is a seqlock that is updated whenever any dentry is
235*4882a593Smuzhiyunrenamed.  If ``d_lookup`` finds that a rename happened while it
236*4882a593Smuzhiyununsuccessfully scanned a chain in the hash table, it simply tries
237*4882a593Smuzhiyunagain.
238*4882a593Smuzhiyun
239*4882a593Smuzhiyun``rename_lock`` is also used to detect and defend against potential attacks
240*4882a593Smuzhiyunagainst ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
241*4882a593Smuzhiyunthe parent directory is moved outside the root, bypassing the ``path_equal()``
242*4882a593Smuzhiyuncheck). If ``rename_lock`` is updated during the lookup and the path encounters
243*4882a593Smuzhiyuna "..", a potential attack occurred and ``handle_dots()`` will bail out with
244*4882a593Smuzhiyun``-EAGAIN``.
245*4882a593Smuzhiyun
246*4882a593Smuzhiyuninode->i_rwsem
247*4882a593Smuzhiyun~~~~~~~~~~~~~~
248*4882a593Smuzhiyun
249*4882a593Smuzhiyun``i_rwsem`` is a read/write semaphore that serializes all changes to a particular
250*4882a593Smuzhiyundirectory.  This ensures that, for example, an ``unlink()`` and a ``rename()``
251*4882a593Smuzhiyuncannot both happen at the same time.  It also keeps the directory
252*4882a593Smuzhiyunstable while the filesystem is asked to look up a name that is not
253*4882a593Smuzhiyuncurrently in the dcache or, optionally, when the list of entries in a
254*4882a593Smuzhiyundirectory is being retrieved with ``readdir()``.
255*4882a593Smuzhiyun
256*4882a593SmuzhiyunThis has a complementary role to that of ``d_lock``: ``i_rwsem`` on a
257*4882a593Smuzhiyundirectory protects all of the names in that directory, while ``d_lock``
258*4882a593Smuzhiyunon a name protects just one name in a directory.  Most changes to the
259*4882a593Smuzhiyundcache hold ``i_rwsem`` on the relevant directory inode and briefly take
260*4882a593Smuzhiyun``d_lock`` on one or more the dentries while the change happens.  One
261*4882a593Smuzhiyunexception is when idle dentries are removed from the dcache due to
262*4882a593Smuzhiyunmemory pressure.  This uses ``d_lock``, but ``i_rwsem`` plays no role.
263*4882a593Smuzhiyun
264*4882a593SmuzhiyunThe semaphore affects pathname lookup in two distinct ways.  Firstly it
265*4882a593Smuzhiyunprevents changes during lookup of a name in a directory.  ``walk_component()`` uses
266*4882a593Smuzhiyun``lookup_fast()`` first which, in turn, checks to see if the name is in the cache,
267*4882a593Smuzhiyunusing only ``d_lock`` locking.  If the name isn't found, then ``walk_component()``
268*4882a593Smuzhiyunfalls back to ``lookup_slow()`` which takes a shared lock on ``i_rwsem``, checks again that
269*4882a593Smuzhiyunthe name isn't in the cache, and then calls in to the filesystem to get a
270*4882a593Smuzhiyundefinitive answer.  A new dentry will be added to the cache regardless of
271*4882a593Smuzhiyunthe result.
272*4882a593Smuzhiyun
273*4882a593SmuzhiyunSecondly, when pathname lookup reaches the final component, it will
274*4882a593Smuzhiyunsometimes need to take an exclusive lock on ``i_rwsem`` before performing the last lookup so
275*4882a593Smuzhiyunthat the required exclusion can be achieved.  How path lookup chooses
276*4882a593Smuzhiyunto take, or not take, ``i_rwsem`` is one of the
277*4882a593Smuzhiyunissues addressed in a subsequent section.
278*4882a593Smuzhiyun
279*4882a593SmuzhiyunIf two threads attempt to look up the same name at the same time - a
280*4882a593Smuzhiyunname that is not yet in the dcache - the shared lock on ``i_rwsem`` will
281*4882a593Smuzhiyunnot prevent them both adding new dentries with the same name.  As this
282*4882a593Smuzhiyunwould result in confusion an extra level of interlocking is used,
283*4882a593Smuzhiyunbased around a secondary hash table (``in_lookup_hashtable``) and a
284*4882a593Smuzhiyunper-dentry flag bit (``DCACHE_PAR_LOOKUP``).
285*4882a593Smuzhiyun
286*4882a593SmuzhiyunTo add a new dentry to the cache while only holding a shared lock on
287*4882a593Smuzhiyun``i_rwsem``, a thread must call ``d_alloc_parallel()``.  This allocates a
288*4882a593Smuzhiyundentry, stores the required name and parent in it, checks if there
289*4882a593Smuzhiyunis already a matching dentry in the primary or secondary hash
290*4882a593Smuzhiyuntables, and if not, stores the newly allocated dentry in the secondary
291*4882a593Smuzhiyunhash table, with ``DCACHE_PAR_LOOKUP`` set.
292*4882a593Smuzhiyun
293*4882a593SmuzhiyunIf a matching dentry was found in the primary hash table then that is
294*4882a593Smuzhiyunreturned and the caller can know that it lost a race with some other
295*4882a593Smuzhiyunthread adding the entry.  If no matching dentry is found in either
296*4882a593Smuzhiyuncache, the newly allocated dentry is returned and the caller can
297*4882a593Smuzhiyundetect this from the presence of ``DCACHE_PAR_LOOKUP``.  In this case it
298*4882a593Smuzhiyunknows that it has won any race and now is responsible for asking the
299*4882a593Smuzhiyunfilesystem to perform the lookup and find the matching inode.  When
300*4882a593Smuzhiyunthe lookup is complete, it must call ``d_lookup_done()`` which clears
301*4882a593Smuzhiyunthe flag and does some other house keeping, including removing the
302*4882a593Smuzhiyundentry from the secondary hash table - it will normally have been
303*4882a593Smuzhiyunadded to the primary hash table already.  Note that a ``struct
304*4882a593Smuzhiyunwaitqueue_head`` is passed to ``d_alloc_parallel()``, and
305*4882a593Smuzhiyun``d_lookup_done()`` must be called while this ``waitqueue_head`` is still
306*4882a593Smuzhiyunin scope.
307*4882a593Smuzhiyun
308*4882a593SmuzhiyunIf a matching dentry is found in the secondary hash table,
309*4882a593Smuzhiyun``d_alloc_parallel()`` has a little more work to do. It first waits for
310*4882a593Smuzhiyun``DCACHE_PAR_LOOKUP`` to be cleared, using a wait_queue that was passed
311*4882a593Smuzhiyunto the instance of ``d_alloc_parallel()`` that won the race and that
312*4882a593Smuzhiyunwill be woken by the call to ``d_lookup_done()``.  It then checks to see
313*4882a593Smuzhiyunif the dentry has now been added to the primary hash table.  If it
314*4882a593Smuzhiyunhas, the dentry is returned and the caller just sees that it lost any
315*4882a593Smuzhiyunrace.  If it hasn't been added to the primary hash table, the most
316*4882a593Smuzhiyunlikely explanation is that some other dentry was added instead using
317*4882a593Smuzhiyun``d_splice_alias()``.  In any case, ``d_alloc_parallel()`` repeats all the
318*4882a593Smuzhiyunlook ups from the start and will normally return something from the
319*4882a593Smuzhiyunprimary hash table.
320*4882a593Smuzhiyun
321*4882a593Smuzhiyunmnt->mnt_count
322*4882a593Smuzhiyun~~~~~~~~~~~~~~
323*4882a593Smuzhiyun
324*4882a593Smuzhiyun``mnt_count`` is a per-CPU reference counter on "``mount``" structures.
325*4882a593SmuzhiyunPer-CPU here means that incrementing the count is cheap as it only
326*4882a593Smuzhiyunuses CPU-local memory, but checking if the count is zero is expensive as
327*4882a593Smuzhiyunit needs to check with every CPU.  Taking a ``mnt_count`` reference
328*4882a593Smuzhiyunprevents the mount structure from disappearing as the result of regular
329*4882a593Smuzhiyununmount operations, but does not prevent a "lazy" unmount.  So holding
330*4882a593Smuzhiyun``mnt_count`` doesn't ensure that the mount remains in the namespace and,
331*4882a593Smuzhiyunin particular, doesn't stabilize the link to the mounted-on dentry.  It
332*4882a593Smuzhiyundoes, however, ensure that the ``mount`` data structure remains coherent,
333*4882a593Smuzhiyunand it provides a reference to the root dentry of the mounted
334*4882a593Smuzhiyunfilesystem.  So a reference through ``->mnt_count`` provides a stable
335*4882a593Smuzhiyunreference to the mounted dentry, but not the mounted-on dentry.
336*4882a593Smuzhiyun
337*4882a593Smuzhiyunmount_lock
338*4882a593Smuzhiyun~~~~~~~~~~
339*4882a593Smuzhiyun
340*4882a593Smuzhiyun``mount_lock`` is a global seqlock, a bit like ``rename_lock``.  It can be used to
341*4882a593Smuzhiyuncheck if any change has been made to any mount points.
342*4882a593Smuzhiyun
343*4882a593SmuzhiyunWhile walking down the tree (away from the root) this lock is used when
344*4882a593Smuzhiyuncrossing a mount point to check that the crossing was safe.  That is,
345*4882a593Smuzhiyunthe value in the seqlock is read, then the code finds the mount that
346*4882a593Smuzhiyunis mounted on the current directory, if there is one, and increments
347*4882a593Smuzhiyunthe ``mnt_count``.  Finally the value in ``mount_lock`` is checked against
348*4882a593Smuzhiyunthe old value.  If there is no change, then the crossing was safe.  If there
349*4882a593Smuzhiyunwas a change, the ``mnt_count`` is decremented and the whole process is
350*4882a593Smuzhiyunretried.
351*4882a593Smuzhiyun
352*4882a593SmuzhiyunWhen walking up the tree (towards the root) by following a ".." link,
353*4882a593Smuzhiyuna little more care is needed.  In this case the seqlock (which
354*4882a593Smuzhiyuncontains both a counter and a spinlock) is fully locked to prevent
355*4882a593Smuzhiyunany changes to any mount points while stepping up.  This locking is
356*4882a593Smuzhiyunneeded to stabilize the link to the mounted-on dentry, which the
357*4882a593Smuzhiyunrefcount on the mount itself doesn't ensure.
358*4882a593Smuzhiyun
359*4882a593Smuzhiyun``mount_lock`` is also used to detect and defend against potential attacks
360*4882a593Smuzhiyunagainst ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
361*4882a593Smuzhiyunthe parent directory is moved outside the root, bypassing the ``path_equal()``
362*4882a593Smuzhiyuncheck). If ``mount_lock`` is updated during the lookup and the path encounters
363*4882a593Smuzhiyuna "..", a potential attack occurred and ``handle_dots()`` will bail out with
364*4882a593Smuzhiyun``-EAGAIN``.
365*4882a593Smuzhiyun
366*4882a593SmuzhiyunRCU
367*4882a593Smuzhiyun~~~
368*4882a593Smuzhiyun
369*4882a593SmuzhiyunFinally the global (but extremely lightweight) RCU read lock is held
370*4882a593Smuzhiyunfrom time to time to ensure certain data structures don't get freed
371*4882a593Smuzhiyununexpectedly.
372*4882a593Smuzhiyun
373*4882a593SmuzhiyunIn particular it is held while scanning chains in the dcache hash
374*4882a593Smuzhiyuntable, and the mount point hash table.
375*4882a593Smuzhiyun
376*4882a593SmuzhiyunBringing it together with ``struct nameidata``
377*4882a593Smuzhiyun----------------------------------------------
378*4882a593Smuzhiyun
379*4882a593Smuzhiyun.. _First edition Unix: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
380*4882a593Smuzhiyun
381*4882a593SmuzhiyunThroughout the process of walking a path, the current status is stored
382*4882a593Smuzhiyunin a ``struct nameidata``, "namei" being the traditional name - dating
383*4882a593Smuzhiyunall the way back to `First Edition Unix`_ - of the function that
384*4882a593Smuzhiyunconverts a "name" to an "inode".  ``struct nameidata`` contains (among
385*4882a593Smuzhiyunother fields):
386*4882a593Smuzhiyun
387*4882a593Smuzhiyun``struct path path``
388*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~
389*4882a593Smuzhiyun
390*4882a593SmuzhiyunA ``path`` contains a ``struct vfsmount`` (which is
391*4882a593Smuzhiyunembedded in a ``struct mount``) and a ``struct dentry``.  Together these
392*4882a593Smuzhiyunrecord the current status of the walk.  They start out referring to the
393*4882a593Smuzhiyunstarting point (the current working directory, the root directory, or some other
394*4882a593Smuzhiyundirectory identified by a file descriptor), and are updated on each
395*4882a593Smuzhiyunstep.  A reference through ``d_lockref`` and ``mnt_count`` is always
396*4882a593Smuzhiyunheld.
397*4882a593Smuzhiyun
398*4882a593Smuzhiyun``struct qstr last``
399*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~
400*4882a593Smuzhiyun
401*4882a593SmuzhiyunThis is a string together with a length (i.e. *not* ``nul`` terminated)
402*4882a593Smuzhiyunthat is the "next" component in the pathname.
403*4882a593Smuzhiyun
404*4882a593Smuzhiyun``int last_type``
405*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~
406*4882a593Smuzhiyun
407*4882a593SmuzhiyunThis is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT`` or ``LAST_DOTDOT``.
408*4882a593SmuzhiyunThe ``last`` field is only valid if the type is ``LAST_NORM``.
409*4882a593Smuzhiyun
410*4882a593Smuzhiyun``struct path root``
411*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~
412*4882a593Smuzhiyun
413*4882a593SmuzhiyunThis is used to hold a reference to the effective root of the
414*4882a593Smuzhiyunfilesystem.  Often that reference won't be needed, so this field is
415*4882a593Smuzhiyunonly assigned the first time it is used, or when a non-standard root
416*4882a593Smuzhiyunis requested.  Keeping a reference in the ``nameidata`` ensures that
417*4882a593Smuzhiyunonly one root is in effect for the entire path walk, even if it races
418*4882a593Smuzhiyunwith a ``chroot()`` system call.
419*4882a593Smuzhiyun
420*4882a593SmuzhiyunIt should be noted that in the case of ``LOOKUP_IN_ROOT`` or
421*4882a593Smuzhiyun``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
422*4882a593Smuzhiyunpassed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
423*4882a593Smuzhiyun
424*4882a593SmuzhiyunThe root is needed when either of two conditions holds: (1) either the
425*4882a593Smuzhiyunpathname or a symbolic link starts with a "'/'", or (2) a "``..``"
426*4882a593Smuzhiyuncomponent is being handled, since "``..``" from the root must always stay
427*4882a593Smuzhiyunat the root.  The value used is usually the current root directory of
428*4882a593Smuzhiyunthe calling process.  An alternate root can be provided as when
429*4882a593Smuzhiyun``sysctl()`` calls ``file_open_root()``, and when NFSv4 or Btrfs call
430*4882a593Smuzhiyun``mount_subtree()``.  In each case a pathname is being looked up in a very
431*4882a593Smuzhiyunspecific part of the filesystem, and the lookup must not be allowed to
432*4882a593Smuzhiyunescape that subtree.  It works a bit like a local ``chroot()``.
433*4882a593Smuzhiyun
434*4882a593SmuzhiyunIgnoring the handling of symbolic links, we can now describe the
435*4882a593Smuzhiyun"``link_path_walk()``" function, which handles the lookup of everything
436*4882a593Smuzhiyunexcept the final component as:
437*4882a593Smuzhiyun
438*4882a593Smuzhiyun   Given a path (``name``) and a nameidata structure (``nd``), check that the
439*4882a593Smuzhiyun   current directory has execute permission and then advance ``name``
440*4882a593Smuzhiyun   over one component while updating ``last_type`` and ``last``.  If that
441*4882a593Smuzhiyun   was the final component, then return, otherwise call
442*4882a593Smuzhiyun   ``walk_component()`` and repeat from the top.
443*4882a593Smuzhiyun
444*4882a593Smuzhiyun``walk_component()`` is even easier.  If the component is ``LAST_DOTS``,
445*4882a593Smuzhiyunit calls ``handle_dots()`` which does the necessary locking as already
446*4882a593Smuzhiyundescribed.  If it finds a ``LAST_NORM`` component it first calls
447*4882a593Smuzhiyun"``lookup_fast()``" which only looks in the dcache, but will ask the
448*4882a593Smuzhiyunfilesystem to revalidate the result if it is that sort of filesystem.
449*4882a593SmuzhiyunIf that doesn't get a good result, it calls "``lookup_slow()``" which
450*4882a593Smuzhiyuntakes ``i_rwsem``, rechecks the cache, and then asks the filesystem
451*4882a593Smuzhiyunto find a definitive answer.  Each of these will call
452*4882a593Smuzhiyun``follow_managed()`` (as described below) to handle any mount points.
453*4882a593Smuzhiyun
454*4882a593SmuzhiyunIn the absence of symbolic links, ``walk_component()`` creates a new
455*4882a593Smuzhiyun``struct path`` containing a counted reference to the new dentry and a
456*4882a593Smuzhiyunreference to the new ``vfsmount`` which is only counted if it is
457*4882a593Smuzhiyundifferent from the previous ``vfsmount``.  It then calls
458*4882a593Smuzhiyun``path_to_nameidata()`` to install the new ``struct path`` in the
459*4882a593Smuzhiyun``struct nameidata`` and drop the unneeded references.
460*4882a593Smuzhiyun
461*4882a593SmuzhiyunThis "hand-over-hand" sequencing of getting a reference to the new
462*4882a593Smuzhiyundentry before dropping the reference to the previous dentry may
463*4882a593Smuzhiyunseem obvious, but is worth pointing out so that we will recognize its
464*4882a593Smuzhiyunanalogue in the "RCU-walk" version.
465*4882a593Smuzhiyun
466*4882a593SmuzhiyunHandling the final component
467*4882a593Smuzhiyun----------------------------
468*4882a593Smuzhiyun
469*4882a593Smuzhiyun``link_path_walk()`` only walks as far as setting ``nd->last`` and
470*4882a593Smuzhiyun``nd->last_type`` to refer to the final component of the path.  It does
471*4882a593Smuzhiyunnot call ``walk_component()`` that last time.  Handling that final
472*4882a593Smuzhiyuncomponent remains for the caller to sort out. Those callers are
473*4882a593Smuzhiyun``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and
474*4882a593Smuzhiyun``path_openat()`` each of which handles the differing requirements of
475*4882a593Smuzhiyundifferent system calls.
476*4882a593Smuzhiyun
477*4882a593Smuzhiyun``path_parentat()`` is clearly the simplest - it just wraps a little bit
478*4882a593Smuzhiyunof housekeeping around ``link_path_walk()`` and returns the parent
479*4882a593Smuzhiyundirectory and final component to the caller.  The caller will be either
480*4882a593Smuzhiyunaiming to create a name (via ``filename_create()``) or remove or rename
481*4882a593Smuzhiyuna name (in which case ``user_path_parent()`` is used).  They will use
482*4882a593Smuzhiyun``i_rwsem`` to exclude other changes while they validate and then
483*4882a593Smuzhiyunperform their operation.
484*4882a593Smuzhiyun
485*4882a593Smuzhiyun``path_lookupat()`` is nearly as simple - it is used when an existing
486*4882a593Smuzhiyunobject is wanted such as by ``stat()`` or ``chmod()``.  It essentially just
487*4882a593Smuzhiyuncalls ``walk_component()`` on the final component through a call to
488*4882a593Smuzhiyun``lookup_last()``.  ``path_lookupat()`` returns just the final dentry.
489*4882a593Smuzhiyun
490*4882a593Smuzhiyun``path_mountpoint()`` handles the special case of unmounting which must
491*4882a593Smuzhiyunnot try to revalidate the mounted filesystem.  It effectively
492*4882a593Smuzhiyuncontains, through a call to ``mountpoint_last()``, an alternate
493*4882a593Smuzhiyunimplementation of ``lookup_slow()`` which skips that step.  This is
494*4882a593Smuzhiyunimportant when unmounting a filesystem that is inaccessible, such as
495*4882a593Smuzhiyunone provided by a dead NFS server.
496*4882a593Smuzhiyun
497*4882a593SmuzhiyunFinally ``path_openat()`` is used for the ``open()`` system call; it
498*4882a593Smuzhiyuncontains, in support functions starting with "``do_last()``", all the
499*4882a593Smuzhiyuncomplexity needed to handle the different subtleties of O_CREAT (with
500*4882a593Smuzhiyunor without O_EXCL), final "``/``" characters, and trailing symbolic
501*4882a593Smuzhiyunlinks.  We will revisit this in the final part of this series, which
502*4882a593Smuzhiyunfocuses on those symbolic links.  "``do_last()``" will sometimes, but
503*4882a593Smuzhiyunnot always, take ``i_rwsem``, depending on what it finds.
504*4882a593Smuzhiyun
505*4882a593SmuzhiyunEach of these, or the functions which call them, need to be alert to
506*4882a593Smuzhiyunthe possibility that the final component is not ``LAST_NORM``.  If the
507*4882a593Smuzhiyungoal of the lookup is to create something, then any value for
508*4882a593Smuzhiyun``last_type`` other than ``LAST_NORM`` will result in an error.  For
509*4882a593Smuzhiyunexample if ``path_parentat()`` reports ``LAST_DOTDOT``, then the caller
510*4882a593Smuzhiyunwon't try to create that name.  They also check for trailing slashes
511*4882a593Smuzhiyunby testing ``last.name[last.len]``.  If there is any character beyond
512*4882a593Smuzhiyunthe final component, it must be a trailing slash.
513*4882a593Smuzhiyun
514*4882a593SmuzhiyunRevalidation and automounts
515*4882a593Smuzhiyun---------------------------
516*4882a593Smuzhiyun
517*4882a593SmuzhiyunApart from symbolic links, there are only two parts of the "REF-walk"
518*4882a593Smuzhiyunprocess not yet covered.  One is the handling of stale cache entries
519*4882a593Smuzhiyunand the other is automounts.
520*4882a593Smuzhiyun
521*4882a593SmuzhiyunOn filesystems that require it, the lookup routines will call the
522*4882a593Smuzhiyun``->d_revalidate()`` dentry method to ensure that the cached information
523*4882a593Smuzhiyunis current.  This will often confirm validity or update a few details
524*4882a593Smuzhiyunfrom a server.  In some cases it may find that there has been change
525*4882a593Smuzhiyunfurther up the path and that something that was thought to be valid
526*4882a593Smuzhiyunpreviously isn't really.  When this happens the lookup of the whole
527*4882a593Smuzhiyunpath is aborted and retried with the "``LOOKUP_REVAL``" flag set.  This
528*4882a593Smuzhiyunforces revalidation to be more thorough.  We will see more details of
529*4882a593Smuzhiyunthis retry process in the next article.
530*4882a593Smuzhiyun
531*4882a593SmuzhiyunAutomount points are locations in the filesystem where an attempt to
532*4882a593Smuzhiyunlookup a name can trigger changes to how that lookup should be
533*4882a593Smuzhiyunhandled, in particular by mounting a filesystem there.  These are
534*4882a593Smuzhiyuncovered in greater detail in autofs.txt in the Linux documentation
535*4882a593Smuzhiyuntree, but a few notes specifically related to path lookup are in order
536*4882a593Smuzhiyunhere.
537*4882a593Smuzhiyun
538*4882a593SmuzhiyunThe Linux VFS has a concept of "managed" dentries which is reflected
539*4882a593Smuzhiyunin function names such as "``follow_managed()``".  There are three
540*4882a593Smuzhiyunpotentially interesting things about these dentries corresponding
541*4882a593Smuzhiyunto three different flags that might be set in ``dentry->d_flags``:
542*4882a593Smuzhiyun
543*4882a593Smuzhiyun``DCACHE_MANAGE_TRANSIT``
544*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~
545*4882a593Smuzhiyun
546*4882a593SmuzhiyunIf this flag has been set, then the filesystem has requested that the
547*4882a593Smuzhiyun``d_manage()`` dentry operation be called before handling any possible
548*4882a593Smuzhiyunmount point.  This can perform two particular services:
549*4882a593Smuzhiyun
550*4882a593SmuzhiyunIt can block to avoid races.  If an automount point is being
551*4882a593Smuzhiyununmounted, the ``d_manage()`` function will usually wait for that
552*4882a593Smuzhiyunprocess to complete before letting the new lookup proceed and possibly
553*4882a593Smuzhiyuntrigger a new automount.
554*4882a593Smuzhiyun
555*4882a593SmuzhiyunIt can selectively allow only some processes to transit through a
556*4882a593Smuzhiyunmount point.  When a server process is managing automounts, it may
557*4882a593Smuzhiyunneed to access a directory without triggering normal automount
558*4882a593Smuzhiyunprocessing.  That server process can identify itself to the ``autofs``
559*4882a593Smuzhiyunfilesystem, which will then give it a special pass through
560*4882a593Smuzhiyun``d_manage()`` by returning ``-EISDIR``.
561*4882a593Smuzhiyun
562*4882a593Smuzhiyun``DCACHE_MOUNTED``
563*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~
564*4882a593Smuzhiyun
565*4882a593SmuzhiyunThis flag is set on every dentry that is mounted on.  As Linux
566*4882a593Smuzhiyunsupports multiple filesystem namespaces, it is possible that the
567*4882a593Smuzhiyundentry may not be mounted on in *this* namespace, just in some
568*4882a593Smuzhiyunother.  So this flag is seen as a hint, not a promise.
569*4882a593Smuzhiyun
570*4882a593SmuzhiyunIf this flag is set, and ``d_manage()`` didn't return ``-EISDIR``,
571*4882a593Smuzhiyun``lookup_mnt()`` is called to examine the mount hash table (honoring the
572*4882a593Smuzhiyun``mount_lock`` described earlier) and possibly return a new ``vfsmount``
573*4882a593Smuzhiyunand a new ``dentry`` (both with counted references).
574*4882a593Smuzhiyun
575*4882a593Smuzhiyun``DCACHE_NEED_AUTOMOUNT``
576*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~
577*4882a593Smuzhiyun
578*4882a593SmuzhiyunIf ``d_manage()`` allowed us to get this far, and ``lookup_mnt()`` didn't
579*4882a593Smuzhiyunfind a mount point, then this flag causes the ``d_automount()`` dentry
580*4882a593Smuzhiyunoperation to be called.
581*4882a593Smuzhiyun
582*4882a593SmuzhiyunThe ``d_automount()`` operation can be arbitrarily complex and may
583*4882a593Smuzhiyuncommunicate with server processes etc. but it should ultimately either
584*4882a593Smuzhiyunreport that there was an error, that there was nothing to mount, or
585*4882a593Smuzhiyunshould provide an updated ``struct path`` with new ``dentry`` and ``vfsmount``.
586*4882a593Smuzhiyun
587*4882a593SmuzhiyunIn the latter case, ``finish_automount()`` will be called to safely
588*4882a593Smuzhiyuninstall the new mount point into the mount table.
589*4882a593Smuzhiyun
590*4882a593SmuzhiyunThere is no new locking of import here and it is important that no
591*4882a593Smuzhiyunlocks (only counted references) are held over this processing due to
592*4882a593Smuzhiyunthe very real possibility of extended delays.
593*4882a593SmuzhiyunThis will become more important next time when we examine RCU-walk
594*4882a593Smuzhiyunwhich is particularly sensitive to delays.
595*4882a593Smuzhiyun
596*4882a593SmuzhiyunRCU-walk - faster pathname lookup in Linux
597*4882a593Smuzhiyun==========================================
598*4882a593Smuzhiyun
599*4882a593SmuzhiyunRCU-walk is another algorithm for performing pathname lookup in Linux.
600*4882a593SmuzhiyunIt is in many ways similar to REF-walk and the two share quite a bit
601*4882a593Smuzhiyunof code.  The significant difference in RCU-walk is how it allows for
602*4882a593Smuzhiyunthe possibility of concurrent access.
603*4882a593Smuzhiyun
604*4882a593SmuzhiyunWe noted that REF-walk is complex because there are numerous details
605*4882a593Smuzhiyunand special cases.  RCU-walk reduces this complexity by simply
606*4882a593Smuzhiyunrefusing to handle a number of cases -- it instead falls back to
607*4882a593SmuzhiyunREF-walk.  The difficulty with RCU-walk comes from a different
608*4882a593Smuzhiyundirection: unfamiliarity.  The locking rules when depending on RCU are
609*4882a593Smuzhiyunquite different from traditional locking, so we will spend a little extra
610*4882a593Smuzhiyuntime when we come to those.
611*4882a593Smuzhiyun
612*4882a593SmuzhiyunClear demarcation of roles
613*4882a593Smuzhiyun--------------------------
614*4882a593Smuzhiyun
615*4882a593SmuzhiyunThe easiest way to manage concurrency is to forcibly stop any other
616*4882a593Smuzhiyunthread from changing the data structures that a given thread is
617*4882a593Smuzhiyunlooking at.  In cases where no other thread would even think of
618*4882a593Smuzhiyunchanging the data and lots of different threads want to read at the
619*4882a593Smuzhiyunsame time, this can be very costly.  Even when using locks that permit
620*4882a593Smuzhiyunmultiple concurrent readers, the simple act of updating the count of
621*4882a593Smuzhiyunthe number of current readers can impose an unwanted cost.  So the
622*4882a593Smuzhiyungoal when reading a shared data structure that no other process is
623*4882a593Smuzhiyunchanging is to avoid writing anything to memory at all.  Take no
624*4882a593Smuzhiyunlocks, increment no counts, leave no footprints.
625*4882a593Smuzhiyun
626*4882a593SmuzhiyunThe REF-walk mechanism already described certainly doesn't follow this
627*4882a593Smuzhiyunprinciple, but then it is really designed to work when there may well
628*4882a593Smuzhiyunbe other threads modifying the data.  RCU-walk, in contrast, is
629*4882a593Smuzhiyundesigned for the common situation where there are lots of frequent
630*4882a593Smuzhiyunreaders and only occasional writers.  This may not be common in all
631*4882a593Smuzhiyunparts of the filesystem tree, but in many parts it will be.  For the
632*4882a593Smuzhiyunother parts it is important that RCU-walk can quickly fall back to
633*4882a593Smuzhiyunusing REF-walk.
634*4882a593Smuzhiyun
635*4882a593SmuzhiyunPathname lookup always starts in RCU-walk mode but only remains there
636*4882a593Smuzhiyunas long as what it is looking for is in the cache and is stable.  It
637*4882a593Smuzhiyundances lightly down the cached filesystem image, leaving no footprints
638*4882a593Smuzhiyunand carefully watching where it is, to be sure it doesn't trip.  If it
639*4882a593Smuzhiyunnotices that something has changed or is changing, or if something
640*4882a593Smuzhiyunisn't in the cache, then it tries to stop gracefully and switch to
641*4882a593SmuzhiyunREF-walk.
642*4882a593Smuzhiyun
643*4882a593SmuzhiyunThis stopping requires getting a counted reference on the current
644*4882a593Smuzhiyun``vfsmount`` and ``dentry``, and ensuring that these are still valid -
645*4882a593Smuzhiyunthat a path walk with REF-walk would have found the same entries.
646*4882a593SmuzhiyunThis is an invariant that RCU-walk must guarantee.  It can only make
647*4882a593Smuzhiyundecisions, such as selecting the next step, that are decisions which
648*4882a593SmuzhiyunREF-walk could also have made if it were walking down the tree at the
649*4882a593Smuzhiyunsame time.  If the graceful stop succeeds, the rest of the path is
650*4882a593Smuzhiyunprocessed with the reliable, if slightly sluggish, REF-walk.  If
651*4882a593SmuzhiyunRCU-walk finds it cannot stop gracefully, it simply gives up and
652*4882a593Smuzhiyunrestarts from the top with REF-walk.
653*4882a593Smuzhiyun
654*4882a593SmuzhiyunThis pattern of "try RCU-walk, if that fails try REF-walk" can be
655*4882a593Smuzhiyunclearly seen in functions like ``filename_lookup()``,
656*4882a593Smuzhiyun``filename_parentat()``, ``filename_mountpoint()``,
657*4882a593Smuzhiyun``do_filp_open()``, and ``do_file_open_root()``.  These five
658*4882a593Smuzhiyuncorrespond roughly to the four ``path_*()`` functions we met earlier,
659*4882a593Smuzhiyuneach of which calls ``link_path_walk()``.  The ``path_*()`` functions are
660*4882a593Smuzhiyuncalled using different mode flags until a mode is found which works.
661*4882a593SmuzhiyunThey are first called with ``LOOKUP_RCU`` set to request "RCU-walk".  If
662*4882a593Smuzhiyunthat fails with the error ``ECHILD`` they are called again with no
663*4882a593Smuzhiyunspecial flag to request "REF-walk".  If either of those report the
664*4882a593Smuzhiyunerror ``ESTALE`` a final attempt is made with ``LOOKUP_REVAL`` set (and no
665*4882a593Smuzhiyun``LOOKUP_RCU``) to ensure that entries found in the cache are forcibly
666*4882a593Smuzhiyunrevalidated - normally entries are only revalidated if the filesystem
667*4882a593Smuzhiyundetermines that they are too old to trust.
668*4882a593Smuzhiyun
669*4882a593SmuzhiyunThe ``LOOKUP_RCU`` attempt may drop that flag internally and switch to
670*4882a593SmuzhiyunREF-walk, but will never then try to switch back to RCU-walk.  Places
671*4882a593Smuzhiyunthat trip up RCU-walk are much more likely to be near the leaves and
672*4882a593Smuzhiyunso it is very unlikely that there will be much, if any, benefit from
673*4882a593Smuzhiyunswitching back.
674*4882a593Smuzhiyun
675*4882a593SmuzhiyunRCU and seqlocks: fast and light
676*4882a593Smuzhiyun--------------------------------
677*4882a593Smuzhiyun
678*4882a593SmuzhiyunRCU is, unsurprisingly, critical to RCU-walk mode.  The
679*4882a593Smuzhiyun``rcu_read_lock()`` is held for the entire time that RCU-walk is walking
680*4882a593Smuzhiyundown a path.  The particular guarantee it provides is that the key
681*4882a593Smuzhiyundata structures - dentries, inodes, super_blocks, and mounts - will
682*4882a593Smuzhiyunnot be freed while the lock is held.  They might be unlinked or
683*4882a593Smuzhiyuninvalidated in one way or another, but the memory will not be
684*4882a593Smuzhiyunrepurposed so values in various fields will still be meaningful.  This
685*4882a593Smuzhiyunis the only guarantee that RCU provides; everything else is done using
686*4882a593Smuzhiyunseqlocks.
687*4882a593Smuzhiyun
688*4882a593SmuzhiyunAs we saw above, REF-walk holds a counted reference to the current
689*4882a593Smuzhiyundentry and the current vfsmount, and does not release those references
690*4882a593Smuzhiyunbefore taking references to the "next" dentry or vfsmount.  It also
691*4882a593Smuzhiyunsometimes takes the ``d_lock`` spinlock.  These references and locks are
692*4882a593Smuzhiyuntaken to prevent certain changes from happening.  RCU-walk must not
693*4882a593Smuzhiyuntake those references or locks and so cannot prevent such changes.
694*4882a593SmuzhiyunInstead, it checks to see if a change has been made, and aborts or
695*4882a593Smuzhiyunretries if it has.
696*4882a593Smuzhiyun
697*4882a593SmuzhiyunTo preserve the invariant mentioned above (that RCU-walk may only make
698*4882a593Smuzhiyundecisions that REF-walk could have made), it must make the checks at
699*4882a593Smuzhiyunor near the same places that REF-walk holds the references.  So, when
700*4882a593SmuzhiyunREF-walk increments a reference count or takes a spinlock, RCU-walk
701*4882a593Smuzhiyunsamples the status of a seqlock using ``read_seqcount_begin()`` or a
702*4882a593Smuzhiyunsimilar function.  When REF-walk decrements the count or drops the
703*4882a593Smuzhiyunlock, RCU-walk checks if the sampled status is still valid using
704*4882a593Smuzhiyun``read_seqcount_retry()`` or similar.
705*4882a593Smuzhiyun
706*4882a593SmuzhiyunHowever, there is a little bit more to seqlocks than that.  If
707*4882a593SmuzhiyunRCU-walk accesses two different fields in a seqlock-protected
708*4882a593Smuzhiyunstructure, or accesses the same field twice, there is no a priori
709*4882a593Smuzhiyunguarantee of any consistency between those accesses.  When consistency
710*4882a593Smuzhiyunis needed - which it usually is - RCU-walk must take a copy and then
711*4882a593Smuzhiyunuse ``read_seqcount_retry()`` to validate that copy.
712*4882a593Smuzhiyun
713*4882a593Smuzhiyun``read_seqcount_retry()`` not only checks the sequence number, but also
714*4882a593Smuzhiyunimposes a memory barrier so that no memory-read instruction from
715*4882a593Smuzhiyun*before* the call can be delayed until *after* the call, either by the
716*4882a593SmuzhiyunCPU or by the compiler.  A simple example of this can be seen in
717*4882a593Smuzhiyun``slow_dentry_cmp()`` which, for filesystems which do not use simple
718*4882a593Smuzhiyunbyte-wise name equality, calls into the filesystem to compare a name
719*4882a593Smuzhiyunagainst a dentry.  The length and name pointer are copied into local
720*4882a593Smuzhiyunvariables, then ``read_seqcount_retry()`` is called to confirm the two
721*4882a593Smuzhiyunare consistent, and only then is ``->d_compare()`` called.  When
722*4882a593Smuzhiyunstandard filename comparison is used, ``dentry_cmp()`` is called
723*4882a593Smuzhiyuninstead.  Notably it does *not* use ``read_seqcount_retry()``, but
724*4882a593Smuzhiyuninstead has a large comment explaining why the consistency guarantee
725*4882a593Smuzhiyunisn't necessary.  A subsequent ``read_seqcount_retry()`` will be
726*4882a593Smuzhiyunsufficient to catch any problem that could occur at this point.
727*4882a593Smuzhiyun
728*4882a593SmuzhiyunWith that little refresher on seqlocks out of the way we can look at
729*4882a593Smuzhiyunthe bigger picture of how RCU-walk uses seqlocks.
730*4882a593Smuzhiyun
731*4882a593Smuzhiyun``mount_lock`` and ``nd->m_seq``
732*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
733*4882a593Smuzhiyun
734*4882a593SmuzhiyunWe already met the ``mount_lock`` seqlock when REF-walk used it to
735*4882a593Smuzhiyunensure that crossing a mount point is performed safely.  RCU-walk uses
736*4882a593Smuzhiyunit for that too, but for quite a bit more.
737*4882a593Smuzhiyun
738*4882a593SmuzhiyunInstead of taking a counted reference to each ``vfsmount`` as it
739*4882a593Smuzhiyundescends the tree, RCU-walk samples the state of ``mount_lock`` at the
740*4882a593Smuzhiyunstart of the walk and stores this initial sequence number in the
741*4882a593Smuzhiyun``struct nameidata`` in the ``m_seq`` field.  This one lock and one
742*4882a593Smuzhiyunsequence number are used to validate all accesses to all ``vfsmounts``,
743*4882a593Smuzhiyunand all mount point crossings.  As changes to the mount table are
744*4882a593Smuzhiyunrelatively rare, it is reasonable to fall back on REF-walk any time
745*4882a593Smuzhiyunthat any "mount" or "unmount" happens.
746*4882a593Smuzhiyun
747*4882a593Smuzhiyun``m_seq`` is checked (using ``read_seqretry()``) at the end of an RCU-walk
748*4882a593Smuzhiyunsequence, whether switching to REF-walk for the rest of the path or
749*4882a593Smuzhiyunwhen the end of the path is reached.  It is also checked when stepping
750*4882a593Smuzhiyundown over a mount point (in ``__follow_mount_rcu()``) or up (in
751*4882a593Smuzhiyun``follow_dotdot_rcu()``).  If it is ever found to have changed, the
752*4882a593Smuzhiyunwhole RCU-walk sequence is aborted and the path is processed again by
753*4882a593SmuzhiyunREF-walk.
754*4882a593Smuzhiyun
755*4882a593SmuzhiyunIf RCU-walk finds that ``mount_lock`` hasn't changed then it can be sure
756*4882a593Smuzhiyunthat, had REF-walk taken counted references on each vfsmount, the
757*4882a593Smuzhiyunresults would have been the same.  This ensures the invariant holds,
758*4882a593Smuzhiyunat least for vfsmount structures.
759*4882a593Smuzhiyun
760*4882a593Smuzhiyun``dentry->d_seq`` and ``nd->seq``
761*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
762*4882a593Smuzhiyun
763*4882a593SmuzhiyunIn place of taking a count or lock on ``d_reflock``, RCU-walk samples
764*4882a593Smuzhiyunthe per-dentry ``d_seq`` seqlock, and stores the sequence number in the
765*4882a593Smuzhiyun``seq`` field of the nameidata structure, so ``nd->seq`` should always be
766*4882a593Smuzhiyunthe current sequence number of ``nd->dentry``.  This number needs to be
767*4882a593Smuzhiyunrevalidated after copying, and before using, the name, parent, or
768*4882a593Smuzhiyuninode of the dentry.
769*4882a593Smuzhiyun
770*4882a593SmuzhiyunThe handling of the name we have already looked at, and the parent is
771*4882a593Smuzhiyunonly accessed in ``follow_dotdot_rcu()`` which fairly trivially follows
772*4882a593Smuzhiyunthe required pattern, though it does so for three different cases.
773*4882a593Smuzhiyun
774*4882a593SmuzhiyunWhen not at a mount point, ``d_parent`` is followed and its ``d_seq`` is
775*4882a593Smuzhiyuncollected.  When we are at a mount point, we instead follow the
776*4882a593Smuzhiyun``mnt->mnt_mountpoint`` link to get a new dentry and collect its
777*4882a593Smuzhiyun``d_seq``.  Then, after finally finding a ``d_parent`` to follow, we must
778*4882a593Smuzhiyuncheck if we have landed on a mount point and, if so, must find that
779*4882a593Smuzhiyunmount point and follow the ``mnt->mnt_root`` link.  This would imply a
780*4882a593Smuzhiyunsomewhat unusual, but certainly possible, circumstance where the
781*4882a593Smuzhiyunstarting point of the path lookup was in part of the filesystem that
782*4882a593Smuzhiyunwas mounted on, and so not visible from the root.
783*4882a593Smuzhiyun
784*4882a593SmuzhiyunThe inode pointer, stored in ``->d_inode``, is a little more
785*4882a593Smuzhiyuninteresting.  The inode will always need to be accessed at least
786*4882a593Smuzhiyuntwice, once to determine if it is NULL and once to verify access
787*4882a593Smuzhiyunpermissions.  Symlink handling requires a validated inode pointer too.
788*4882a593SmuzhiyunRather than revalidating on each access, a copy is made on the first
789*4882a593Smuzhiyunaccess and it is stored in the ``inode`` field of ``nameidata`` from where
790*4882a593Smuzhiyunit can be safely accessed without further validation.
791*4882a593Smuzhiyun
792*4882a593Smuzhiyun``lookup_fast()`` is the only lookup routine that is used in RCU-mode,
793*4882a593Smuzhiyun``lookup_slow()`` being too slow and requiring locks.  It is in
794*4882a593Smuzhiyun``lookup_fast()`` that we find the important "hand over hand" tracking
795*4882a593Smuzhiyunof the current dentry.
796*4882a593Smuzhiyun
797*4882a593SmuzhiyunThe current ``dentry`` and current ``seq`` number are passed to
798*4882a593Smuzhiyun``__d_lookup_rcu()`` which, on success, returns a new ``dentry`` and a
799*4882a593Smuzhiyunnew ``seq`` number.  ``lookup_fast()`` then copies the inode pointer and
800*4882a593Smuzhiyunrevalidates the new ``seq`` number.  It then validates the old ``dentry``
801*4882a593Smuzhiyunwith the old ``seq`` number one last time and only then continues.  This
802*4882a593Smuzhiyunprocess of getting the ``seq`` number of the new dentry and then
803*4882a593Smuzhiyunchecking the ``seq`` number of the old exactly mirrors the process of
804*4882a593Smuzhiyungetting a counted reference to the new dentry before dropping that for
805*4882a593Smuzhiyunthe old dentry which we saw in REF-walk.
806*4882a593Smuzhiyun
807*4882a593SmuzhiyunNo ``inode->i_rwsem`` or even ``rename_lock``
808*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
809*4882a593Smuzhiyun
810*4882a593SmuzhiyunA semaphore is a fairly heavyweight lock that can only be taken when it is
811*4882a593Smuzhiyunpermissible to sleep.  As ``rcu_read_lock()`` forbids sleeping,
812*4882a593Smuzhiyun``inode->i_rwsem`` plays no role in RCU-walk.  If some other thread does
813*4882a593Smuzhiyuntake ``i_rwsem`` and modifies the directory in a way that RCU-walk needs
814*4882a593Smuzhiyunto notice, the result will be either that RCU-walk fails to find the
815*4882a593Smuzhiyundentry that it is looking for, or it will find a dentry which
816*4882a593Smuzhiyun``read_seqretry()`` won't validate.  In either case it will drop down to
817*4882a593SmuzhiyunREF-walk mode which can take whatever locks are needed.
818*4882a593Smuzhiyun
819*4882a593SmuzhiyunThough ``rename_lock`` could be used by RCU-walk as it doesn't require
820*4882a593Smuzhiyunany sleeping, RCU-walk doesn't bother.  REF-walk uses ``rename_lock`` to
821*4882a593Smuzhiyunprotect against the possibility of hash chains in the dcache changing
822*4882a593Smuzhiyunwhile they are being searched.  This can result in failing to find
823*4882a593Smuzhiyunsomething that actually is there.  When RCU-walk fails to find
824*4882a593Smuzhiyunsomething in the dentry cache, whether it is really there or not, it
825*4882a593Smuzhiyunalready drops down to REF-walk and tries again with appropriate
826*4882a593Smuzhiyunlocking.  This neatly handles all cases, so adding extra checks on
827*4882a593Smuzhiyunrename_lock would bring no significant value.
828*4882a593Smuzhiyun
829*4882a593Smuzhiyun``unlazy walk()`` and ``complete_walk()``
830*4882a593Smuzhiyun-----------------------------------------
831*4882a593Smuzhiyun
832*4882a593SmuzhiyunThat "dropping down to REF-walk" typically involves a call to
833*4882a593Smuzhiyun``unlazy_walk()``, so named because "RCU-walk" is also sometimes
834*4882a593Smuzhiyunreferred to as "lazy walk".  ``unlazy_walk()`` is called when
835*4882a593Smuzhiyunfollowing the path down to the current vfsmount/dentry pair seems to
836*4882a593Smuzhiyunhave proceeded successfully, but the next step is problematic.  This
837*4882a593Smuzhiyuncan happen if the next name cannot be found in the dcache, if
838*4882a593Smuzhiyunpermission checking or name revalidation couldn't be achieved while
839*4882a593Smuzhiyunthe ``rcu_read_lock()`` is held (which forbids sleeping), if an
840*4882a593Smuzhiyunautomount point is found, or in a couple of cases involving symlinks.
841*4882a593SmuzhiyunIt is also called from ``complete_walk()`` when the lookup has reached
842*4882a593Smuzhiyunthe final component, or the very end of the path, depending on which
843*4882a593Smuzhiyunparticular flavor of lookup is used.
844*4882a593Smuzhiyun
845*4882a593SmuzhiyunOther reasons for dropping out of RCU-walk that do not trigger a call
846*4882a593Smuzhiyunto ``unlazy_walk()`` are when some inconsistency is found that cannot be
847*4882a593Smuzhiyunhandled immediately, such as ``mount_lock`` or one of the ``d_seq``
848*4882a593Smuzhiyunseqlocks reporting a change.  In these cases the relevant function
849*4882a593Smuzhiyunwill return ``-ECHILD`` which will percolate up until it triggers a new
850*4882a593Smuzhiyunattempt from the top using REF-walk.
851*4882a593Smuzhiyun
852*4882a593SmuzhiyunFor those cases where ``unlazy_walk()`` is an option, it essentially
853*4882a593Smuzhiyuntakes a reference on each of the pointers that it holds (vfsmount,
854*4882a593Smuzhiyundentry, and possibly some symbolic links) and then verifies that the
855*4882a593Smuzhiyunrelevant seqlocks have not been changed.  If there have been changes,
856*4882a593Smuzhiyunit, too, aborts with ``-ECHILD``, otherwise the transition to REF-walk
857*4882a593Smuzhiyunhas been a success and the lookup process continues.
858*4882a593Smuzhiyun
859*4882a593SmuzhiyunTaking a reference on those pointers is not quite as simple as just
860*4882a593Smuzhiyunincrementing a counter.  That works to take a second reference if you
861*4882a593Smuzhiyunalready have one (often indirectly through another object), but it
862*4882a593Smuzhiyunisn't sufficient if you don't actually have a counted reference at
863*4882a593Smuzhiyunall.  For ``dentry->d_lockref``, it is safe to increment the reference
864*4882a593Smuzhiyuncounter to get a reference unless it has been explicitly marked as
865*4882a593Smuzhiyun"dead" which involves setting the counter to ``-128``.
866*4882a593Smuzhiyun``lockref_get_not_dead()`` achieves this.
867*4882a593Smuzhiyun
868*4882a593SmuzhiyunFor ``mnt->mnt_count`` it is safe to take a reference as long as
869*4882a593Smuzhiyun``mount_lock`` is then used to validate the reference.  If that
870*4882a593Smuzhiyunvalidation fails, it may *not* be safe to just drop that reference in
871*4882a593Smuzhiyunthe standard way of calling ``mnt_put()`` - an unmount may have
872*4882a593Smuzhiyunprogressed too far.  So the code in ``legitimize_mnt()``, when it
873*4882a593Smuzhiyunfinds that the reference it got might not be safe, checks the
874*4882a593Smuzhiyun``MNT_SYNC_UMOUNT`` flag to determine if a simple ``mnt_put()`` is
875*4882a593Smuzhiyuncorrect, or if it should just decrement the count and pretend none of
876*4882a593Smuzhiyunthis ever happened.
877*4882a593Smuzhiyun
878*4882a593SmuzhiyunTaking care in filesystems
879*4882a593Smuzhiyun--------------------------
880*4882a593Smuzhiyun
881*4882a593SmuzhiyunRCU-walk depends almost entirely on cached information and often will
882*4882a593Smuzhiyunnot call into the filesystem at all.  However there are two places,
883*4882a593Smuzhiyunbesides the already-mentioned component-name comparison, where the
884*4882a593Smuzhiyunfile system might be included in RCU-walk, and it must know to be
885*4882a593Smuzhiyuncareful.
886*4882a593Smuzhiyun
887*4882a593SmuzhiyunIf the filesystem has non-standard permission-checking requirements -
888*4882a593Smuzhiyunsuch as a networked filesystem which may need to check with the server
889*4882a593Smuzhiyun- the ``i_op->permission`` interface might be called during RCU-walk.
890*4882a593SmuzhiyunIn this case an extra "``MAY_NOT_BLOCK``" flag is passed so that it
891*4882a593Smuzhiyunknows not to sleep, but to return ``-ECHILD`` if it cannot complete
892*4882a593Smuzhiyunpromptly.  ``i_op->permission`` is given the inode pointer, not the
893*4882a593Smuzhiyundentry, so it doesn't need to worry about further consistency checks.
894*4882a593SmuzhiyunHowever if it accesses any other filesystem data structures, it must
895*4882a593Smuzhiyunensure they are safe to be accessed with only the ``rcu_read_lock()``
896*4882a593Smuzhiyunheld.  This typically means they must be freed using ``kfree_rcu()`` or
897*4882a593Smuzhiyunsimilar.
898*4882a593Smuzhiyun
899*4882a593Smuzhiyun.. _READ_ONCE: https://lwn.net/Articles/624126/
900*4882a593Smuzhiyun
901*4882a593SmuzhiyunIf the filesystem may need to revalidate dcache entries, then
902*4882a593Smuzhiyun``d_op->d_revalidate`` may be called in RCU-walk too.  This interface
903*4882a593Smuzhiyun*is* passed the dentry but does not have access to the ``inode`` or the
904*4882a593Smuzhiyun``seq`` number from the ``nameidata``, so it needs to be extra careful
905*4882a593Smuzhiyunwhen accessing fields in the dentry.  This "extra care" typically
906*4882a593Smuzhiyuninvolves using  `READ_ONCE() <READ_ONCE_>`_ to access fields, and verifying the
907*4882a593Smuzhiyunresult is not NULL before using it.  This pattern can be seen in
908*4882a593Smuzhiyun``nfs_lookup_revalidate()``.
909*4882a593Smuzhiyun
910*4882a593SmuzhiyunA pair of patterns
911*4882a593Smuzhiyun------------------
912*4882a593Smuzhiyun
913*4882a593SmuzhiyunIn various places in the details of REF-walk and RCU-walk, and also in
914*4882a593Smuzhiyunthe big picture, there are a couple of related patterns that are worth
915*4882a593Smuzhiyunbeing aware of.
916*4882a593Smuzhiyun
917*4882a593SmuzhiyunThe first is "try quickly and check, if that fails try slowly".  We
918*4882a593Smuzhiyuncan see that in the high-level approach of first trying RCU-walk and
919*4882a593Smuzhiyunthen trying REF-walk, and in places where ``unlazy_walk()`` is used to
920*4882a593Smuzhiyunswitch to REF-walk for the rest of the path.  We also saw it earlier
921*4882a593Smuzhiyunin ``dget_parent()`` when following a "``..``" link.  It tries a quick way
922*4882a593Smuzhiyunto get a reference, then falls back to taking locks if needed.
923*4882a593Smuzhiyun
924*4882a593SmuzhiyunThe second pattern is "try quickly and check, if that fails try
925*4882a593Smuzhiyunagain - repeatedly".  This is seen with the use of ``rename_lock`` and
926*4882a593Smuzhiyun``mount_lock`` in REF-walk.  RCU-walk doesn't make use of this pattern -
927*4882a593Smuzhiyunif anything goes wrong it is much safer to just abort and try a more
928*4882a593Smuzhiyunsedate approach.
929*4882a593Smuzhiyun
930*4882a593SmuzhiyunThe emphasis here is "try quickly and check".  It should probably be
931*4882a593Smuzhiyun"try quickly *and carefully*, then check".  The fact that checking is
932*4882a593Smuzhiyunneeded is a reminder that the system is dynamic and only a limited
933*4882a593Smuzhiyunnumber of things are safe at all.  The most likely cause of errors in
934*4882a593Smuzhiyunthis whole process is assuming something is safe when in reality it
935*4882a593Smuzhiyunisn't.  Careful consideration of what exactly guarantees the safety of
936*4882a593Smuzhiyuneach access is sometimes necessary.
937*4882a593Smuzhiyun
938*4882a593SmuzhiyunA walk among the symlinks
939*4882a593Smuzhiyun=========================
940*4882a593Smuzhiyun
941*4882a593SmuzhiyunThere are several basic issues that we will examine to understand the
942*4882a593Smuzhiyunhandling of symbolic links:  the symlink stack, together with cache
943*4882a593Smuzhiyunlifetimes, will help us understand the overall recursive handling of
944*4882a593Smuzhiyunsymlinks and lead to the special care needed for the final component.
945*4882a593SmuzhiyunThen a consideration of access-time updates and summary of the various
946*4882a593Smuzhiyunflags controlling lookup will finish the story.
947*4882a593Smuzhiyun
948*4882a593SmuzhiyunThe symlink stack
949*4882a593Smuzhiyun-----------------
950*4882a593Smuzhiyun
951*4882a593SmuzhiyunThere are only two sorts of filesystem objects that can usefully
952*4882a593Smuzhiyunappear in a path prior to the final component: directories and symlinks.
953*4882a593SmuzhiyunHandling directories is quite straightforward: the new directory
954*4882a593Smuzhiyunsimply becomes the starting point at which to interpret the next
955*4882a593Smuzhiyuncomponent on the path.  Handling symbolic links requires a bit more
956*4882a593Smuzhiyunwork.
957*4882a593Smuzhiyun
958*4882a593SmuzhiyunConceptually, symbolic links could be handled by editing the path.  If
959*4882a593Smuzhiyuna component name refers to a symbolic link, then that component is
960*4882a593Smuzhiyunreplaced by the body of the link and, if that body starts with a '/',
961*4882a593Smuzhiyunthen all preceding parts of the path are discarded.  This is what the
962*4882a593Smuzhiyun"``readlink -f``" command does, though it also edits out "``.``" and
963*4882a593Smuzhiyun"``..``" components.
964*4882a593Smuzhiyun
965*4882a593SmuzhiyunDirectly editing the path string is not really necessary when looking
966*4882a593Smuzhiyunup a path, and discarding early components is pointless as they aren't
967*4882a593Smuzhiyunlooked at anyway.  Keeping track of all remaining components is
968*4882a593Smuzhiyunimportant, but they can of course be kept separately; there is no need
969*4882a593Smuzhiyunto concatenate them.  As one symlink may easily refer to another,
970*4882a593Smuzhiyunwhich in turn can refer to a third, we may need to keep the remaining
971*4882a593Smuzhiyuncomponents of several paths, each to be processed when the preceding
972*4882a593Smuzhiyunones are completed.  These path remnants are kept on a stack of
973*4882a593Smuzhiyunlimited size.
974*4882a593Smuzhiyun
975*4882a593SmuzhiyunThere are two reasons for placing limits on how many symlinks can
976*4882a593Smuzhiyunoccur in a single path lookup.  The most obvious is to avoid loops.
977*4882a593SmuzhiyunIf a symlink referred to itself either directly or through
978*4882a593Smuzhiyunintermediaries, then following the symlink can never complete
979*4882a593Smuzhiyunsuccessfully - the error ``ELOOP`` must be returned.  Loops can be
980*4882a593Smuzhiyundetected without imposing limits, but limits are the simplest solution
981*4882a593Smuzhiyunand, given the second reason for restriction, quite sufficient.
982*4882a593Smuzhiyun
983*4882a593Smuzhiyun.. _outlined recently: http://thread.gmane.org/gmane.linux.kernel/1934390/focus=1934550
984*4882a593Smuzhiyun
985*4882a593SmuzhiyunThe second reason was `outlined recently`_ by Linus:
986*4882a593Smuzhiyun
987*4882a593Smuzhiyun   Because it's a latency and DoS issue too. We need to react well to
988*4882a593Smuzhiyun   true loops, but also to "very deep" non-loops. It's not about memory
989*4882a593Smuzhiyun   use, it's about users triggering unreasonable CPU resources.
990*4882a593Smuzhiyun
991*4882a593SmuzhiyunLinux imposes a limit on the length of any pathname: ``PATH_MAX``, which
992*4882a593Smuzhiyunis 4096.  There are a number of reasons for this limit; not letting the
993*4882a593Smuzhiyunkernel spend too much time on just one path is one of them.  With
994*4882a593Smuzhiyunsymbolic links you can effectively generate much longer paths so some
995*4882a593Smuzhiyunsort of limit is needed for the same reason.  Linux imposes a limit of
996*4882a593Smuzhiyunat most 40 symlinks in any one path lookup.  It previously imposed a
997*4882a593Smuzhiyunfurther limit of eight on the maximum depth of recursion, but that was
998*4882a593Smuzhiyunraised to 40 when a separate stack was implemented, so there is now
999*4882a593Smuzhiyunjust the one limit.
1000*4882a593Smuzhiyun
1001*4882a593SmuzhiyunThe ``nameidata`` structure that we met in an earlier article contains a
1002*4882a593Smuzhiyunsmall stack that can be used to store the remaining part of up to two
1003*4882a593Smuzhiyunsymlinks.  In many cases this will be sufficient.  If it isn't, a
1004*4882a593Smuzhiyunseparate stack is allocated with room for 40 symlinks.  Pathname
1005*4882a593Smuzhiyunlookup will never exceed that stack as, once the 40th symlink is
1006*4882a593Smuzhiyundetected, an error is returned.
1007*4882a593Smuzhiyun
1008*4882a593SmuzhiyunIt might seem that the name remnants are all that needs to be stored on
1009*4882a593Smuzhiyunthis stack, but we need a bit more.  To see that, we need to move on to
1010*4882a593Smuzhiyuncache lifetimes.
1011*4882a593Smuzhiyun
1012*4882a593SmuzhiyunStorage and lifetime of cached symlinks
1013*4882a593Smuzhiyun---------------------------------------
1014*4882a593Smuzhiyun
1015*4882a593SmuzhiyunLike other filesystem resources, such as inodes and directory
1016*4882a593Smuzhiyunentries, symlinks are cached by Linux to avoid repeated costly access
1017*4882a593Smuzhiyunto external storage.  It is particularly important for RCU-walk to be
1018*4882a593Smuzhiyunable to find and temporarily hold onto these cached entries, so that
1019*4882a593Smuzhiyunit doesn't need to drop down into REF-walk.
1020*4882a593Smuzhiyun
1021*4882a593Smuzhiyun.. _object-oriented design pattern: https://lwn.net/Articles/446317/
1022*4882a593Smuzhiyun
1023*4882a593SmuzhiyunWhile each filesystem is free to make its own choice, symlinks are
1024*4882a593Smuzhiyuntypically stored in one of two places.  Short symlinks are often
1025*4882a593Smuzhiyunstored directly in the inode.  When a filesystem allocates a ``struct
1026*4882a593Smuzhiyuninode`` it typically allocates extra space to store private data (a
1027*4882a593Smuzhiyuncommon `object-oriented design pattern`_ in the kernel).  This will
1028*4882a593Smuzhiyunsometimes include space for a symlink.  The other common location is
1029*4882a593Smuzhiyunin the page cache, which normally stores the content of files.  The
1030*4882a593Smuzhiyunpathname in a symlink can be seen as the content of that symlink and
1031*4882a593Smuzhiyuncan easily be stored in the page cache just like file content.
1032*4882a593Smuzhiyun
1033*4882a593SmuzhiyunWhen neither of these is suitable, the next most likely scenario is
1034*4882a593Smuzhiyunthat the filesystem will allocate some temporary memory and copy or
1035*4882a593Smuzhiyunconstruct the symlink content into that memory whenever it is needed.
1036*4882a593Smuzhiyun
1037*4882a593SmuzhiyunWhen the symlink is stored in the inode, it has the same lifetime as
1038*4882a593Smuzhiyunthe inode which, itself, is protected by RCU or by a counted reference
1039*4882a593Smuzhiyunon the dentry.  This means that the mechanisms that pathname lookup
1040*4882a593Smuzhiyunuses to access the dcache and icache (inode cache) safely are quite
1041*4882a593Smuzhiyunsufficient for accessing some cached symlinks safely.  In these cases,
1042*4882a593Smuzhiyunthe ``i_link`` pointer in the inode is set to point to wherever the
1043*4882a593Smuzhiyunsymlink is stored and it can be accessed directly whenever needed.
1044*4882a593Smuzhiyun
1045*4882a593SmuzhiyunWhen the symlink is stored in the page cache or elsewhere, the
1046*4882a593Smuzhiyunsituation is not so straightforward.  A reference on a dentry or even
1047*4882a593Smuzhiyunon an inode does not imply any reference on cached pages of that
1048*4882a593Smuzhiyuninode, and even an ``rcu_read_lock()`` is not sufficient to ensure that
1049*4882a593Smuzhiyuna page will not disappear.  So for these symlinks the pathname lookup
1050*4882a593Smuzhiyuncode needs to ask the filesystem to provide a stable reference and,
1051*4882a593Smuzhiyunsignificantly, needs to release that reference when it is finished
1052*4882a593Smuzhiyunwith it.
1053*4882a593Smuzhiyun
1054*4882a593SmuzhiyunTaking a reference to a cache page is often possible even in RCU-walk
1055*4882a593Smuzhiyunmode.  It does require making changes to memory, which is best avoided,
1056*4882a593Smuzhiyunbut that isn't necessarily a big cost and it is better than dropping
1057*4882a593Smuzhiyunout of RCU-walk mode completely.  Even filesystems that allocate
1058*4882a593Smuzhiyunspace to copy the symlink into can use ``GFP_ATOMIC`` to often successfully
1059*4882a593Smuzhiyunallocate memory without the need to drop out of RCU-walk.  If a
1060*4882a593Smuzhiyunfilesystem cannot successfully get a reference in RCU-walk mode, it
1061*4882a593Smuzhiyunmust return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
1062*4882a593SmuzhiyunREF-walk mode in which the filesystem is allowed to sleep.
1063*4882a593Smuzhiyun
1064*4882a593SmuzhiyunThe place for all this to happen is the ``i_op->follow_link()`` inode
1065*4882a593Smuzhiyunmethod.  In the present mainline code this is never actually called in
1066*4882a593SmuzhiyunRCU-walk mode as the rewrite is not quite complete.  It is likely that
1067*4882a593Smuzhiyunin a future release this method will be passed an ``inode`` pointer when
1068*4882a593Smuzhiyuncalled in RCU-walk mode so it both (1) knows to be careful, and (2) has the
1069*4882a593Smuzhiyunvalidated pointer.  Much like the ``i_op->permission()`` method we
1070*4882a593Smuzhiyunlooked at previously, ``->follow_link()`` would need to be careful that
1071*4882a593Smuzhiyunall the data structures it references are safe to be accessed while
1072*4882a593Smuzhiyunholding no counted reference, only the RCU lock.  Though getting a
1073*4882a593Smuzhiyunreference with ``->follow_link()`` is not yet done in RCU-walk mode, the
1074*4882a593Smuzhiyuncode is ready to release the reference when that does happen.
1075*4882a593Smuzhiyun
1076*4882a593SmuzhiyunThis need to drop the reference to a symlink adds significant
1077*4882a593Smuzhiyuncomplexity.  It requires a reference to the inode so that the
1078*4882a593Smuzhiyun``i_op->put_link()`` inode operation can be called.  In REF-walk, that
1079*4882a593Smuzhiyunreference is kept implicitly through a reference to the dentry, so
1080*4882a593Smuzhiyunkeeping the ``struct path`` of the symlink is easiest.  For RCU-walk,
1081*4882a593Smuzhiyunthe pointer to the inode is kept separately.  To allow switching from
1082*4882a593SmuzhiyunRCU-walk back to REF-walk in the middle of processing nested symlinks
1083*4882a593Smuzhiyunwe also need the seq number for the dentry so we can confirm that
1084*4882a593Smuzhiyunswitching back was safe.
1085*4882a593Smuzhiyun
1086*4882a593SmuzhiyunFinally, when providing a reference to a symlink, the filesystem also
1087*4882a593Smuzhiyunprovides an opaque "cookie" that must be passed to ``->put_link()`` so that it
1088*4882a593Smuzhiyunknows what to free.  This might be the allocated memory area, or a
1089*4882a593Smuzhiyunpointer to the ``struct page`` in the page cache, or something else
1090*4882a593Smuzhiyuncompletely.  Only the filesystem knows what it is.
1091*4882a593Smuzhiyun
1092*4882a593SmuzhiyunIn order for the reference to each symlink to be dropped when the walk completes,
1093*4882a593Smuzhiyunwhether in RCU-walk or REF-walk, the symlink stack needs to contain,
1094*4882a593Smuzhiyunalong with the path remnants:
1095*4882a593Smuzhiyun
1096*4882a593Smuzhiyun- the ``struct path`` to provide a reference to the inode in REF-walk
1097*4882a593Smuzhiyun- the ``struct inode *`` to provide a reference to the inode in RCU-walk
1098*4882a593Smuzhiyun- the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk
1099*4882a593Smuzhiyun- the ``cookie`` that tells ``->put_path()`` what to put.
1100*4882a593Smuzhiyun
1101*4882a593SmuzhiyunThis means that each entry in the symlink stack needs to hold five
1102*4882a593Smuzhiyunpointers and an integer instead of just one pointer (the path
1103*4882a593Smuzhiyunremnant).  On a 64-bit system, this is about 40 bytes per entry;
1104*4882a593Smuzhiyunwith 40 entries it adds up to 1600 bytes total, which is less than
1105*4882a593Smuzhiyunhalf a page.  So it might seem like a lot, but is by no means
1106*4882a593Smuzhiyunexcessive.
1107*4882a593Smuzhiyun
1108*4882a593SmuzhiyunNote that, in a given stack frame, the path remnant (``name``) is not
1109*4882a593Smuzhiyunpart of the symlink that the other fields refer to.  It is the remnant
1110*4882a593Smuzhiyunto be followed once that symlink has been fully parsed.
1111*4882a593Smuzhiyun
1112*4882a593SmuzhiyunFollowing the symlink
1113*4882a593Smuzhiyun---------------------
1114*4882a593Smuzhiyun
1115*4882a593SmuzhiyunThe main loop in ``link_path_walk()`` iterates seamlessly over all
1116*4882a593Smuzhiyuncomponents in the path and all of the non-final symlinks.  As symlinks
1117*4882a593Smuzhiyunare processed, the ``name`` pointer is adjusted to point to a new
1118*4882a593Smuzhiyunsymlink, or is restored from the stack, so that much of the loop
1119*4882a593Smuzhiyundoesn't need to notice.  Getting this ``name`` variable on and off the
1120*4882a593Smuzhiyunstack is very straightforward; pushing and popping the references is
1121*4882a593Smuzhiyuna little more complex.
1122*4882a593Smuzhiyun
1123*4882a593SmuzhiyunWhen a symlink is found, ``walk_component()`` returns the value ``1``
1124*4882a593Smuzhiyun(``0`` is returned for any other sort of success, and a negative number
1125*4882a593Smuzhiyunis, as usual, an error indicator).  This causes ``get_link()`` to be
1126*4882a593Smuzhiyuncalled; it then gets the link from the filesystem.  Providing that
1127*4882a593Smuzhiyunoperation is successful, the old path ``name`` is placed on the stack,
1128*4882a593Smuzhiyunand the new value is used as the ``name`` for a while.  When the end of
1129*4882a593Smuzhiyunthe path is found (i.e. ``*name`` is ``'\0'``) the old ``name`` is restored
1130*4882a593Smuzhiyunoff the stack and path walking continues.
1131*4882a593Smuzhiyun
1132*4882a593SmuzhiyunPushing and popping the reference pointers (inode, cookie, etc.) is more
1133*4882a593Smuzhiyuncomplex in part because of the desire to handle tail recursion.  When
1134*4882a593Smuzhiyunthe last component of a symlink itself points to a symlink, we
1135*4882a593Smuzhiyunwant to pop the symlink-just-completed off the stack before pushing
1136*4882a593Smuzhiyunthe symlink-just-found to avoid leaving empty path remnants that would
1137*4882a593Smuzhiyunjust get in the way.
1138*4882a593Smuzhiyun
1139*4882a593SmuzhiyunIt is most convenient to push the new symlink references onto the
1140*4882a593Smuzhiyunstack in ``walk_component()`` immediately when the symlink is found;
1141*4882a593Smuzhiyun``walk_component()`` is also the last piece of code that needs to look at the
1142*4882a593Smuzhiyunold symlink as it walks that last component.  So it is quite
1143*4882a593Smuzhiyunconvenient for ``walk_component()`` to release the old symlink and pop
1144*4882a593Smuzhiyunthe references just before pushing the reference information for the
1145*4882a593Smuzhiyunnew symlink.  It is guided in this by two flags; ``WALK_GET``, which
1146*4882a593Smuzhiyungives it permission to follow a symlink if it finds one, and
1147*4882a593Smuzhiyun``WALK_PUT``, which tells it to release the current symlink after it has been
1148*4882a593Smuzhiyunfollowed.  ``WALK_PUT`` is tested first, leading to a call to
1149*4882a593Smuzhiyun``put_link()``.  ``WALK_GET`` is tested subsequently (by
1150*4882a593Smuzhiyun``should_follow_link()``) leading to a call to ``pick_link()`` which sets
1151*4882a593Smuzhiyunup the stack frame.
1152*4882a593Smuzhiyun
1153*4882a593SmuzhiyunSymlinks with no final component
1154*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1155*4882a593Smuzhiyun
1156*4882a593SmuzhiyunA pair of special-case symlinks deserve a little further explanation.
1157*4882a593SmuzhiyunBoth result in a new ``struct path`` (with mount and dentry) being set
1158*4882a593Smuzhiyunup in the ``nameidata``, and result in ``get_link()`` returning ``NULL``.
1159*4882a593Smuzhiyun
1160*4882a593SmuzhiyunThe more obvious case is a symlink to "``/``".  All symlinks starting
1161*4882a593Smuzhiyunwith "``/``" are detected in ``get_link()`` which resets the ``nameidata``
1162*4882a593Smuzhiyunto point to the effective filesystem root.  If the symlink only
1163*4882a593Smuzhiyuncontains "``/``" then there is nothing more to do, no components at all,
1164*4882a593Smuzhiyunso ``NULL`` is returned to indicate that the symlink can be released and
1165*4882a593Smuzhiyunthe stack frame discarded.
1166*4882a593Smuzhiyun
1167*4882a593SmuzhiyunThe other case involves things in ``/proc`` that look like symlinks but
1168*4882a593Smuzhiyunaren't really (and are therefore commonly referred to as "magic-links")::
1169*4882a593Smuzhiyun
1170*4882a593Smuzhiyun     $ ls -l /proc/self/fd/1
1171*4882a593Smuzhiyun     lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
1172*4882a593Smuzhiyun
1173*4882a593SmuzhiyunEvery open file descriptor in any process is represented in ``/proc`` by
1174*4882a593Smuzhiyunsomething that looks like a symlink.  It is really a reference to the
1175*4882a593Smuzhiyuntarget file, not just the name of it.  When you ``readlink`` these
1176*4882a593Smuzhiyunobjects you get a name that might refer to the same file - unless it
1177*4882a593Smuzhiyunhas been unlinked or mounted over.  When ``walk_component()`` follows
1178*4882a593Smuzhiyunone of these, the ``->follow_link()`` method in "procfs" doesn't return
1179*4882a593Smuzhiyuna string name, but instead calls ``nd_jump_link()`` which updates the
1180*4882a593Smuzhiyun``nameidata`` in place to point to that target.  ``->follow_link()`` then
1181*4882a593Smuzhiyunreturns ``NULL``.  Again there is no final component and ``get_link()``
1182*4882a593Smuzhiyunreports this by leaving the ``last_type`` field of ``nameidata`` as
1183*4882a593Smuzhiyun``LAST_BIND``.
1184*4882a593Smuzhiyun
1185*4882a593SmuzhiyunFollowing the symlink in the final component
1186*4882a593Smuzhiyun--------------------------------------------
1187*4882a593Smuzhiyun
1188*4882a593SmuzhiyunAll this leads to ``link_path_walk()`` walking down every component, and
1189*4882a593Smuzhiyunfollowing all symbolic links it finds, until it reaches the final
1190*4882a593Smuzhiyuncomponent.  This is just returned in the ``last`` field of ``nameidata``.
1191*4882a593SmuzhiyunFor some callers, this is all they need; they want to create that
1192*4882a593Smuzhiyun``last`` name if it doesn't exist or give an error if it does.  Other
1193*4882a593Smuzhiyuncallers will want to follow a symlink if one is found, and possibly
1194*4882a593Smuzhiyunapply special handling to the last component of that symlink, rather
1195*4882a593Smuzhiyunthan just the last component of the original file name.  These callers
1196*4882a593Smuzhiyunpotentially need to call ``link_path_walk()`` again and again on
1197*4882a593Smuzhiyunsuccessive symlinks until one is found that doesn't point to another
1198*4882a593Smuzhiyunsymlink.
1199*4882a593Smuzhiyun
1200*4882a593SmuzhiyunThis case is handled by the relevant caller of ``link_path_walk()``, such as
1201*4882a593Smuzhiyun``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then
1202*4882a593Smuzhiyunhandles the final component.  If the final component is a symlink
1203*4882a593Smuzhiyunthat needs to be followed, then ``trailing_symlink()`` is called to set
1204*4882a593Smuzhiyunthings up properly and the loop repeats, calling ``link_path_walk()``
1205*4882a593Smuzhiyunagain.  This could loop as many as 40 times if the last component of
1206*4882a593Smuzhiyuneach symlink is another symlink.
1207*4882a593Smuzhiyun
1208*4882a593SmuzhiyunThe various functions that examine the final component and possibly
1209*4882a593Smuzhiyunreport that it is a symlink are ``lookup_last()``, ``mountpoint_last()``
1210*4882a593Smuzhiyunand ``do_last()``, each of which use the same convention as
1211*4882a593Smuzhiyun``walk_component()`` of returning ``1`` if a symlink was found that needs
1212*4882a593Smuzhiyunto be followed.
1213*4882a593Smuzhiyun
1214*4882a593SmuzhiyunOf these, ``do_last()`` is the most interesting as it is used for
1215*4882a593Smuzhiyunopening a file.  Part of ``do_last()`` runs with ``i_rwsem`` held and this
1216*4882a593Smuzhiyunpart is in a separate function: ``lookup_open()``.
1217*4882a593Smuzhiyun
1218*4882a593SmuzhiyunExplaining ``do_last()`` completely is beyond the scope of this article,
1219*4882a593Smuzhiyunbut a few highlights should help those interested in exploring the
1220*4882a593Smuzhiyuncode.
1221*4882a593Smuzhiyun
1222*4882a593Smuzhiyun1. Rather than just finding the target file, ``do_last()`` needs to open
1223*4882a593Smuzhiyun   it.  If the file was found in the dcache, then ``vfs_open()`` is used for
1224*4882a593Smuzhiyun   this.  If not, then ``lookup_open()`` will either call ``atomic_open()`` (if
1225*4882a593Smuzhiyun   the filesystem provides it) to combine the final lookup with the open, or
1226*4882a593Smuzhiyun   will perform the separate ``lookup_real()`` and ``vfs_create()`` steps
1227*4882a593Smuzhiyun   directly.  In the later case the actual "open" of this newly found or
1228*4882a593Smuzhiyun   created file will be performed by ``vfs_open()``, just as if the name
1229*4882a593Smuzhiyun   were found in the dcache.
1230*4882a593Smuzhiyun
1231*4882a593Smuzhiyun2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information
1232*4882a593Smuzhiyun   wasn't quite current enough.  Rather than restarting the lookup from
1233*4882a593Smuzhiyun   the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead,
1234*4882a593Smuzhiyun   giving the filesystem a chance to resolve small inconsistencies.
1235*4882a593Smuzhiyun   If that doesn't work, only then is the lookup restarted from the top.
1236*4882a593Smuzhiyun
1237*4882a593Smuzhiyun3. An open with O_CREAT **does** follow a symlink in the final component,
1238*4882a593Smuzhiyun   unlike other creation system calls (like ``mkdir``).  So the sequence::
1239*4882a593Smuzhiyun
1240*4882a593Smuzhiyun          ln -s bar /tmp/foo
1241*4882a593Smuzhiyun          echo hello > /tmp/foo
1242*4882a593Smuzhiyun
1243*4882a593Smuzhiyun   will create a file called ``/tmp/bar``.  This is not permitted if
1244*4882a593Smuzhiyun   ``O_EXCL`` is set but otherwise is handled for an O_CREAT open much
1245*4882a593Smuzhiyun   like for a non-creating open: ``should_follow_link()`` returns ``1``, and
1246*4882a593Smuzhiyun   so does ``do_last()`` so that ``trailing_symlink()`` gets called and the
1247*4882a593Smuzhiyun   open process continues on the symlink that was found.
1248*4882a593Smuzhiyun
1249*4882a593SmuzhiyunUpdating the access time
1250*4882a593Smuzhiyun------------------------
1251*4882a593Smuzhiyun
1252*4882a593SmuzhiyunWe previously said of RCU-walk that it would "take no locks, increment
1253*4882a593Smuzhiyunno counts, leave no footprints."  We have since seen that some
1254*4882a593Smuzhiyun"footprints" can be needed when handling symlinks as a counted
1255*4882a593Smuzhiyunreference (or even a memory allocation) may be needed.  But these
1256*4882a593Smuzhiyunfootprints are best kept to a minimum.
1257*4882a593Smuzhiyun
1258*4882a593SmuzhiyunOne other place where walking down a symlink can involve leaving
1259*4882a593Smuzhiyunfootprints in a way that doesn't affect directories is in updating access times.
1260*4882a593SmuzhiyunIn Unix (and Linux) every filesystem object has a "last accessed
1261*4882a593Smuzhiyuntime", or "``atime``".  Passing through a directory to access a file
1262*4882a593Smuzhiyunwithin is not considered to be an access for the purposes of
1263*4882a593Smuzhiyun``atime``; only listing the contents of a directory can update its ``atime``.
1264*4882a593SmuzhiyunSymlinks are different it seems.  Both reading a symlink (with ``readlink()``)
1265*4882a593Smuzhiyunand looking up a symlink on the way to some other destination can
1266*4882a593Smuzhiyunupdate the atime on that symlink.
1267*4882a593Smuzhiyun
1268*4882a593Smuzhiyun.. _clearest statement: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_08
1269*4882a593Smuzhiyun
1270*4882a593SmuzhiyunIt is not clear why this is the case; POSIX has little to say on the
1271*4882a593Smuzhiyunsubject.  The `clearest statement`_ is that, if a particular implementation
1272*4882a593Smuzhiyunupdates a timestamp in a place not specified by POSIX, this must be
1273*4882a593Smuzhiyundocumented "except that any changes caused by pathname resolution need
1274*4882a593Smuzhiyunnot be documented".  This seems to imply that POSIX doesn't really
1275*4882a593Smuzhiyuncare about access-time updates during pathname lookup.
1276*4882a593Smuzhiyun
1277*4882a593Smuzhiyun.. _Linux 1.3.87: https://git.kernel.org/cgit/linux/kernel/git/history/history.git/diff/fs/ext2/symlink.c?id=f806c6db77b8eaa6e00dcfb6b567706feae8dbb8
1278*4882a593Smuzhiyun
1279*4882a593SmuzhiyunAn examination of history shows that prior to `Linux 1.3.87`_, the ext2
1280*4882a593Smuzhiyunfilesystem, at least, didn't update atime when following a link.
1281*4882a593SmuzhiyunUnfortunately we have no record of why that behavior was changed.
1282*4882a593Smuzhiyun
1283*4882a593SmuzhiyunIn any case, access time must now be updated and that operation can be
1284*4882a593Smuzhiyunquite complex.  Trying to stay in RCU-walk while doing it is best
1285*4882a593Smuzhiyunavoided.  Fortunately it is often permitted to skip the ``atime``
1286*4882a593Smuzhiyunupdate.  Because ``atime`` updates cause performance problems in various
1287*4882a593Smuzhiyunareas, Linux supports the ``relatime`` mount option, which generally
1288*4882a593Smuzhiyunlimits the updates of ``atime`` to once per day on files that aren't
1289*4882a593Smuzhiyunbeing changed (and symlinks never change once created).  Even without
1290*4882a593Smuzhiyun``relatime``, many filesystems record ``atime`` with a one-second
1291*4882a593Smuzhiyungranularity, so only one update per second is required.
1292*4882a593Smuzhiyun
1293*4882a593SmuzhiyunIt is easy to test if an ``atime`` update is needed while in RCU-walk
1294*4882a593Smuzhiyunmode and, if it isn't, the update can be skipped and RCU-walk mode
1295*4882a593Smuzhiyuncontinues.  Only when an ``atime`` update is actually required does the
1296*4882a593Smuzhiyunpath walk drop down to REF-walk.  All of this is handled in the
1297*4882a593Smuzhiyun``get_link()`` function.
1298*4882a593Smuzhiyun
1299*4882a593SmuzhiyunA few flags
1300*4882a593Smuzhiyun-----------
1301*4882a593Smuzhiyun
1302*4882a593SmuzhiyunA suitable way to wrap up this tour of pathname walking is to list
1303*4882a593Smuzhiyunthe various flags that can be stored in the ``nameidata`` to guide the
1304*4882a593Smuzhiyunlookup process.  Many of these are only meaningful on the final
1305*4882a593Smuzhiyuncomponent, others reflect the current state of the pathname lookup, and some
1306*4882a593Smuzhiyunapply restrictions to all path components encountered in the path lookup.
1307*4882a593Smuzhiyun
1308*4882a593SmuzhiyunAnd then there is ``LOOKUP_EMPTY``, which doesn't fit conceptually with
1309*4882a593Smuzhiyunthe others.  If this is not set, an empty pathname causes an error
1310*4882a593Smuzhiyunvery early on.  If it is set, empty pathnames are not considered to be
1311*4882a593Smuzhiyunan error.
1312*4882a593Smuzhiyun
1313*4882a593SmuzhiyunGlobal state flags
1314*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~
1315*4882a593Smuzhiyun
1316*4882a593SmuzhiyunWe have already met two global state flags: ``LOOKUP_RCU`` and
1317*4882a593Smuzhiyun``LOOKUP_REVAL``.  These select between one of three overall approaches
1318*4882a593Smuzhiyunto lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation.
1319*4882a593Smuzhiyun
1320*4882a593Smuzhiyun``LOOKUP_PARENT`` indicates that the final component hasn't been reached
1321*4882a593Smuzhiyunyet.  This is primarily used to tell the audit subsystem the full
1322*4882a593Smuzhiyuncontext of a particular access being audited.
1323*4882a593Smuzhiyun
1324*4882a593Smuzhiyun``LOOKUP_ROOT`` indicates that the ``root`` field in the ``nameidata`` was
1325*4882a593Smuzhiyunprovided by the caller, so it shouldn't be released when it is no
1326*4882a593Smuzhiyunlonger needed.
1327*4882a593Smuzhiyun
1328*4882a593Smuzhiyun``LOOKUP_JUMPED`` means that the current dentry was chosen not because
1329*4882a593Smuzhiyunit had the right name but for some other reason.  This happens when
1330*4882a593Smuzhiyunfollowing "``..``", following a symlink to ``/``, crossing a mount point
1331*4882a593Smuzhiyunor accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
1332*4882a593Smuzhiyunlink"). In this case the filesystem has not been asked to revalidate the
1333*4882a593Smuzhiyunname (with ``d_revalidate()``).  In such cases the inode may still need
1334*4882a593Smuzhiyunto be revalidated, so ``d_op->d_weak_revalidate()`` is called if
1335*4882a593Smuzhiyun``LOOKUP_JUMPED`` is set when the look completes - which may be at the
1336*4882a593Smuzhiyunfinal component or, when creating, unlinking, or renaming, at the penultimate component.
1337*4882a593Smuzhiyun
1338*4882a593SmuzhiyunResolution-restriction flags
1339*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1340*4882a593Smuzhiyun
1341*4882a593SmuzhiyunIn order to allow userspace to protect itself against certain race conditions
1342*4882a593Smuzhiyunand attack scenarios involving changing path components, a series of flags are
1343*4882a593Smuzhiyunavailable which apply restrictions to all path components encountered during
1344*4882a593Smuzhiyunpath lookup. These flags are exposed through ``openat2()``'s ``resolve`` field.
1345*4882a593Smuzhiyun
1346*4882a593Smuzhiyun``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links).
1347*4882a593SmuzhiyunThis is distinctly different from ``LOOKUP_FOLLOW``, because the latter only
1348*4882a593Smuzhiyunrelates to restricting the following of trailing symlinks.
1349*4882a593Smuzhiyun
1350*4882a593Smuzhiyun``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must
1351*4882a593Smuzhiyunensure that they return errors from ``nd_jump_link()``, because that is how
1352*4882a593Smuzhiyun``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented.
1353*4882a593Smuzhiyun
1354*4882a593Smuzhiyun``LOOKUP_NO_XDEV`` blocks all ``vfsmount`` traversals (this includes both
1355*4882a593Smuzhiyunbind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the
1356*4882a593Smuzhiyunlookup is determined by the first mountpoint the path lookup reaches --
1357*4882a593Smuzhiyunabsolute paths start with the ``vfsmount`` of ``/``, and relative paths start
1358*4882a593Smuzhiyunwith the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the
1359*4882a593Smuzhiyun``vfsmount`` of the path is unchanged.
1360*4882a593Smuzhiyun
1361*4882a593Smuzhiyun``LOOKUP_BENEATH`` blocks any path components which resolve outside the
1362*4882a593Smuzhiyunstarting point of the resolution. This is done by blocking ``nd_jump_root()``
1363*4882a593Smuzhiyunas well as blocking ".." if it would jump outside the starting point.
1364*4882a593Smuzhiyun``rename_lock`` and ``mount_lock`` are used to detect attacks against the
1365*4882a593Smuzhiyunresolution of "..". Magic-links are also blocked.
1366*4882a593Smuzhiyun
1367*4882a593Smuzhiyun``LOOKUP_IN_ROOT`` resolves all path components as though the starting point
1368*4882a593Smuzhiyunwere the filesystem root. ``nd_jump_root()`` brings the resolution back to
1369*4882a593Smuzhiyunthe starting point, and ".." at the starting point will act as a no-op. As with
1370*4882a593Smuzhiyun``LOOKUP_BENEATH``, ``rename_lock`` and ``mount_lock`` are used to detect
1371*4882a593Smuzhiyunattacks against ".." resolution. Magic-links are also blocked.
1372*4882a593Smuzhiyun
1373*4882a593SmuzhiyunFinal-component flags
1374*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~
1375*4882a593Smuzhiyun
1376*4882a593SmuzhiyunSome of these flags are only set when the final component is being
1377*4882a593Smuzhiyunconsidered.  Others are only checked for when considering that final
1378*4882a593Smuzhiyuncomponent.
1379*4882a593Smuzhiyun
1380*4882a593Smuzhiyun``LOOKUP_AUTOMOUNT`` ensures that, if the final component is an automount
1381*4882a593Smuzhiyunpoint, then the mount is triggered.  Some operations would trigger it
1382*4882a593Smuzhiyunanyway, but operations like ``stat()`` deliberately don't.  ``statfs()``
1383*4882a593Smuzhiyunneeds to trigger the mount but otherwise behaves a lot like ``stat()``, so
1384*4882a593Smuzhiyunit sets ``LOOKUP_AUTOMOUNT``, as does "``quotactl()``" and the handling of
1385*4882a593Smuzhiyun"``mount --bind``".
1386*4882a593Smuzhiyun
1387*4882a593Smuzhiyun``LOOKUP_FOLLOW`` has a similar function to ``LOOKUP_AUTOMOUNT`` but for
1388*4882a593Smuzhiyunsymlinks.  Some system calls set or clear it implicitly, while
1389*4882a593Smuzhiyunothers have API flags such as ``AT_SYMLINK_FOLLOW`` and
1390*4882a593Smuzhiyun``UMOUNT_NOFOLLOW`` to control it.  Its effect is similar to
1391*4882a593Smuzhiyun``WALK_GET`` that we already met, but it is used in a different way.
1392*4882a593Smuzhiyun
1393*4882a593Smuzhiyun``LOOKUP_DIRECTORY`` insists that the final component is a directory.
1394*4882a593SmuzhiyunVarious callers set this and it is also set when the final component
1395*4882a593Smuzhiyunis found to be followed by a slash.
1396*4882a593Smuzhiyun
1397*4882a593SmuzhiyunFinally ``LOOKUP_OPEN``, ``LOOKUP_CREATE``, ``LOOKUP_EXCL``, and
1398*4882a593Smuzhiyun``LOOKUP_RENAME_TARGET`` are not used directly by the VFS but are made
1399*4882a593Smuzhiyunavailable to the filesystem and particularly the ``->d_revalidate()``
1400*4882a593Smuzhiyunmethod.  A filesystem can choose not to bother revalidating too hard
1401*4882a593Smuzhiyunif it knows that it will be asked to open or create the file soon.
1402*4882a593SmuzhiyunThese flags were previously useful for ``->lookup()`` too but with the
1403*4882a593Smuzhiyunintroduction of ``->atomic_open()`` they are less relevant there.
1404*4882a593Smuzhiyun
1405*4882a593SmuzhiyunEnd of the road
1406*4882a593Smuzhiyun---------------
1407*4882a593Smuzhiyun
1408*4882a593SmuzhiyunDespite its complexity, all this pathname lookup code appears to be
1409*4882a593Smuzhiyunin good shape - various parts are certainly easier to understand now
1410*4882a593Smuzhiyunthan even a couple of releases ago.  But that doesn't mean it is
1411*4882a593Smuzhiyun"finished".   As already mentioned, RCU-walk currently only follows
1412*4882a593Smuzhiyunsymlinks that are stored in the inode so, while it handles many ext4
1413*4882a593Smuzhiyunsymlinks, it doesn't help with NFS, XFS, or Btrfs.  That support
1414*4882a593Smuzhiyunis not likely to be long delayed.
1415