xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/fuse.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun====
4*4882a593SmuzhiyunFUSE
5*4882a593Smuzhiyun====
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunDefinitions
8*4882a593Smuzhiyun===========
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunUserspace filesystem:
11*4882a593Smuzhiyun  A filesystem in which data and metadata are provided by an ordinary
12*4882a593Smuzhiyun  userspace process.  The filesystem can be accessed normally through
13*4882a593Smuzhiyun  the kernel interface.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunFilesystem daemon:
16*4882a593Smuzhiyun  The process(es) providing the data and metadata of the filesystem.
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunNon-privileged mount (or user mount):
19*4882a593Smuzhiyun  A userspace filesystem mounted by a non-privileged (non-root) user.
20*4882a593Smuzhiyun  The filesystem daemon is running with the privileges of the mounting
21*4882a593Smuzhiyun  user.  NOTE: this is not the same as mounts allowed with the "user"
22*4882a593Smuzhiyun  option in /etc/fstab, which is not discussed here.
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunFilesystem connection:
25*4882a593Smuzhiyun  A connection between the filesystem daemon and the kernel.  The
26*4882a593Smuzhiyun  connection exists until either the daemon dies, or the filesystem is
27*4882a593Smuzhiyun  umounted.  Note that detaching (or lazy umounting) the filesystem
28*4882a593Smuzhiyun  does *not* break the connection, in this case it will exist until
29*4882a593Smuzhiyun  the last reference to the filesystem is released.
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunMount owner:
32*4882a593Smuzhiyun  The user who does the mounting.
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunUser:
35*4882a593Smuzhiyun  The user who is performing filesystem operations.
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunWhat is FUSE?
38*4882a593Smuzhiyun=============
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunFUSE is a userspace filesystem framework.  It consists of a kernel
41*4882a593Smuzhiyunmodule (fuse.ko), a userspace library (libfuse.*) and a mount utility
42*4882a593Smuzhiyun(fusermount).
43*4882a593Smuzhiyun
44*4882a593SmuzhiyunOne of the most important features of FUSE is allowing secure,
45*4882a593Smuzhiyunnon-privileged mounts.  This opens up new possibilities for the use of
46*4882a593Smuzhiyunfilesystems.  A good example is sshfs: a secure network filesystem
47*4882a593Smuzhiyunusing the sftp protocol.
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunThe userspace library and utilities are available from the
50*4882a593Smuzhiyun`FUSE homepage: <https://github.com/libfuse/>`_
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunFilesystem type
53*4882a593Smuzhiyun===============
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunThe filesystem type given to mount(2) can be one of the following:
56*4882a593Smuzhiyun
57*4882a593Smuzhiyun    fuse
58*4882a593Smuzhiyun      This is the usual way to mount a FUSE filesystem.  The first
59*4882a593Smuzhiyun      argument of the mount system call may contain an arbitrary string,
60*4882a593Smuzhiyun      which is not interpreted by the kernel.
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun    fuseblk
63*4882a593Smuzhiyun      The filesystem is block device based.  The first argument of the
64*4882a593Smuzhiyun      mount system call is interpreted as the name of the device.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunMount options
67*4882a593Smuzhiyun=============
68*4882a593Smuzhiyun
69*4882a593Smuzhiyunfd=N
70*4882a593Smuzhiyun  The file descriptor to use for communication between the userspace
71*4882a593Smuzhiyun  filesystem and the kernel.  The file descriptor must have been
72*4882a593Smuzhiyun  obtained by opening the FUSE device ('/dev/fuse').
73*4882a593Smuzhiyun
74*4882a593Smuzhiyunrootmode=M
75*4882a593Smuzhiyun  The file mode of the filesystem's root in octal representation.
76*4882a593Smuzhiyun
77*4882a593Smuzhiyunuser_id=N
78*4882a593Smuzhiyun  The numeric user id of the mount owner.
79*4882a593Smuzhiyun
80*4882a593Smuzhiyungroup_id=N
81*4882a593Smuzhiyun  The numeric group id of the mount owner.
82*4882a593Smuzhiyun
83*4882a593Smuzhiyundefault_permissions
84*4882a593Smuzhiyun  By default FUSE doesn't check file access permissions, the
85*4882a593Smuzhiyun  filesystem is free to implement its access policy or leave it to
86*4882a593Smuzhiyun  the underlying file access mechanism (e.g. in case of network
87*4882a593Smuzhiyun  filesystems).  This option enables permission checking, restricting
88*4882a593Smuzhiyun  access based on file mode.  It is usually useful together with the
89*4882a593Smuzhiyun  'allow_other' mount option.
90*4882a593Smuzhiyun
91*4882a593Smuzhiyunallow_other
92*4882a593Smuzhiyun  This option overrides the security measure restricting file access
93*4882a593Smuzhiyun  to the user mounting the filesystem.  This option is by default only
94*4882a593Smuzhiyun  allowed to root, but this restriction can be removed with a
95*4882a593Smuzhiyun  (userspace) configuration option.
96*4882a593Smuzhiyun
97*4882a593Smuzhiyunmax_read=N
98*4882a593Smuzhiyun  With this option the maximum size of read operations can be set.
99*4882a593Smuzhiyun  The default is infinite.  Note that the size of read requests is
100*4882a593Smuzhiyun  limited anyway to 32 pages (which is 128kbyte on i386).
101*4882a593Smuzhiyun
102*4882a593Smuzhiyunblksize=N
103*4882a593Smuzhiyun  Set the block size for the filesystem.  The default is 512.  This
104*4882a593Smuzhiyun  option is only valid for 'fuseblk' type mounts.
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunControl filesystem
107*4882a593Smuzhiyun==================
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunThere's a control filesystem for FUSE, which can be mounted by::
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun  mount -t fusectl none /sys/fs/fuse/connections
112*4882a593Smuzhiyun
113*4882a593SmuzhiyunMounting it under the '/sys/fs/fuse/connections' directory makes it
114*4882a593Smuzhiyunbackwards compatible with earlier versions.
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunUnder the fuse control filesystem each connection has a directory
117*4882a593Smuzhiyunnamed by a unique number.
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunFor each connection the following files exist within this directory:
120*4882a593Smuzhiyun
121*4882a593Smuzhiyun	waiting
122*4882a593Smuzhiyun	  The number of requests which are waiting to be transferred to
123*4882a593Smuzhiyun	  userspace or being processed by the filesystem daemon.  If there is
124*4882a593Smuzhiyun	  no filesystem activity and 'waiting' is non-zero, then the
125*4882a593Smuzhiyun	  filesystem is hung or deadlocked.
126*4882a593Smuzhiyun
127*4882a593Smuzhiyun	abort
128*4882a593Smuzhiyun	  Writing anything into this file will abort the filesystem
129*4882a593Smuzhiyun	  connection.  This means that all waiting requests will be aborted an
130*4882a593Smuzhiyun	  error returned for all aborted and new requests.
131*4882a593Smuzhiyun
132*4882a593SmuzhiyunOnly the owner of the mount may read or write these files.
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunInterrupting filesystem operations
135*4882a593Smuzhiyun##################################
136*4882a593Smuzhiyun
137*4882a593SmuzhiyunIf a process issuing a FUSE filesystem request is interrupted, the
138*4882a593Smuzhiyunfollowing will happen:
139*4882a593Smuzhiyun
140*4882a593Smuzhiyun  -  If the request is not yet sent to userspace AND the signal is
141*4882a593Smuzhiyun     fatal (SIGKILL or unhandled fatal signal), then the request is
142*4882a593Smuzhiyun     dequeued and returns immediately.
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun  -  If the request is not yet sent to userspace AND the signal is not
145*4882a593Smuzhiyun     fatal, then an interrupted flag is set for the request.  When
146*4882a593Smuzhiyun     the request has been successfully transferred to userspace and
147*4882a593Smuzhiyun     this flag is set, an INTERRUPT request is queued.
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun  -  If the request is already sent to userspace, then an INTERRUPT
150*4882a593Smuzhiyun     request is queued.
151*4882a593Smuzhiyun
152*4882a593SmuzhiyunINTERRUPT requests take precedence over other requests, so the
153*4882a593Smuzhiyunuserspace filesystem will receive queued INTERRUPTs before any others.
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunThe userspace filesystem may ignore the INTERRUPT requests entirely,
156*4882a593Smuzhiyunor may honor them by sending a reply to the *original* request, with
157*4882a593Smuzhiyunthe error set to EINTR.
158*4882a593Smuzhiyun
159*4882a593SmuzhiyunIt is also possible that there's a race between processing the
160*4882a593Smuzhiyunoriginal request and its INTERRUPT request.  There are two possibilities:
161*4882a593Smuzhiyun
162*4882a593Smuzhiyun  1. The INTERRUPT request is processed before the original request is
163*4882a593Smuzhiyun     processed
164*4882a593Smuzhiyun
165*4882a593Smuzhiyun  2. The INTERRUPT request is processed after the original request has
166*4882a593Smuzhiyun     been answered
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunIf the filesystem cannot find the original request, it should wait for
169*4882a593Smuzhiyunsome timeout and/or a number of new requests to arrive, after which it
170*4882a593Smuzhiyunshould reply to the INTERRUPT request with an EAGAIN error.  In case
171*4882a593Smuzhiyun1) the INTERRUPT request will be requeued.  In case 2) the INTERRUPT
172*4882a593Smuzhiyunreply will be ignored.
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunAborting a filesystem connection
175*4882a593Smuzhiyun================================
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunIt is possible to get into certain situations where the filesystem is
178*4882a593Smuzhiyunnot responding.  Reasons for this may be:
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun  a) Broken userspace filesystem implementation
181*4882a593Smuzhiyun
182*4882a593Smuzhiyun  b) Network connection down
183*4882a593Smuzhiyun
184*4882a593Smuzhiyun  c) Accidental deadlock
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun  d) Malicious deadlock
187*4882a593Smuzhiyun
188*4882a593Smuzhiyun(For more on c) and d) see later sections)
189*4882a593Smuzhiyun
190*4882a593SmuzhiyunIn either of these cases it may be useful to abort the connection to
191*4882a593Smuzhiyunthe filesystem.  There are several ways to do this:
192*4882a593Smuzhiyun
193*4882a593Smuzhiyun  - Kill the filesystem daemon.  Works in case of a) and b)
194*4882a593Smuzhiyun
195*4882a593Smuzhiyun  - Kill the filesystem daemon and all users of the filesystem.  Works
196*4882a593Smuzhiyun    in all cases except some malicious deadlocks
197*4882a593Smuzhiyun
198*4882a593Smuzhiyun  - Use forced umount (umount -f).  Works in all cases but only if
199*4882a593Smuzhiyun    filesystem is still attached (it hasn't been lazy unmounted)
200*4882a593Smuzhiyun
201*4882a593Smuzhiyun  - Abort filesystem through the FUSE control filesystem.  Most
202*4882a593Smuzhiyun    powerful method, always works.
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunHow do non-privileged mounts work?
205*4882a593Smuzhiyun==================================
206*4882a593Smuzhiyun
207*4882a593SmuzhiyunSince the mount() system call is a privileged operation, a helper
208*4882a593Smuzhiyunprogram (fusermount) is needed, which is installed setuid root.
209*4882a593Smuzhiyun
210*4882a593SmuzhiyunThe implication of providing non-privileged mounts is that the mount
211*4882a593Smuzhiyunowner must not be able to use this capability to compromise the
212*4882a593Smuzhiyunsystem.  Obvious requirements arising from this are:
213*4882a593Smuzhiyun
214*4882a593Smuzhiyun A) mount owner should not be able to get elevated privileges with the
215*4882a593Smuzhiyun    help of the mounted filesystem
216*4882a593Smuzhiyun
217*4882a593Smuzhiyun B) mount owner should not get illegitimate access to information from
218*4882a593Smuzhiyun    other users' and the super user's processes
219*4882a593Smuzhiyun
220*4882a593Smuzhiyun C) mount owner should not be able to induce undesired behavior in
221*4882a593Smuzhiyun    other users' or the super user's processes
222*4882a593Smuzhiyun
223*4882a593SmuzhiyunHow are requirements fulfilled?
224*4882a593Smuzhiyun===============================
225*4882a593Smuzhiyun
226*4882a593Smuzhiyun A) The mount owner could gain elevated privileges by either:
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun    1. creating a filesystem containing a device file, then opening this device
229*4882a593Smuzhiyun
230*4882a593Smuzhiyun    2. creating a filesystem containing a suid or sgid application, then executing this application
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun    The solution is not to allow opening device files and ignore
233*4882a593Smuzhiyun    setuid and setgid bits when executing programs.  To ensure this
234*4882a593Smuzhiyun    fusermount always adds "nosuid" and "nodev" to the mount options
235*4882a593Smuzhiyun    for non-privileged mounts.
236*4882a593Smuzhiyun
237*4882a593Smuzhiyun B) If another user is accessing files or directories in the
238*4882a593Smuzhiyun    filesystem, the filesystem daemon serving requests can record the
239*4882a593Smuzhiyun    exact sequence and timing of operations performed.  This
240*4882a593Smuzhiyun    information is otherwise inaccessible to the mount owner, so this
241*4882a593Smuzhiyun    counts as an information leak.
242*4882a593Smuzhiyun
243*4882a593Smuzhiyun    The solution to this problem will be presented in point 2) of C).
244*4882a593Smuzhiyun
245*4882a593Smuzhiyun C) There are several ways in which the mount owner can induce
246*4882a593Smuzhiyun    undesired behavior in other users' processes, such as:
247*4882a593Smuzhiyun
248*4882a593Smuzhiyun     1) mounting a filesystem over a file or directory which the mount
249*4882a593Smuzhiyun        owner could otherwise not be able to modify (or could only
250*4882a593Smuzhiyun        make limited modifications).
251*4882a593Smuzhiyun
252*4882a593Smuzhiyun        This is solved in fusermount, by checking the access
253*4882a593Smuzhiyun        permissions on the mountpoint and only allowing the mount if
254*4882a593Smuzhiyun        the mount owner can do unlimited modification (has write
255*4882a593Smuzhiyun        access to the mountpoint, and mountpoint is not a "sticky"
256*4882a593Smuzhiyun        directory)
257*4882a593Smuzhiyun
258*4882a593Smuzhiyun     2) Even if 1) is solved the mount owner can change the behavior
259*4882a593Smuzhiyun        of other users' processes.
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun         i) It can slow down or indefinitely delay the execution of a
262*4882a593Smuzhiyun            filesystem operation creating a DoS against the user or the
263*4882a593Smuzhiyun            whole system.  For example a suid application locking a
264*4882a593Smuzhiyun            system file, and then accessing a file on the mount owner's
265*4882a593Smuzhiyun            filesystem could be stopped, and thus causing the system
266*4882a593Smuzhiyun            file to be locked forever.
267*4882a593Smuzhiyun
268*4882a593Smuzhiyun         ii) It can present files or directories of unlimited length, or
269*4882a593Smuzhiyun             directory structures of unlimited depth, possibly causing a
270*4882a593Smuzhiyun             system process to eat up diskspace, memory or other
271*4882a593Smuzhiyun             resources, again causing *DoS*.
272*4882a593Smuzhiyun
273*4882a593Smuzhiyun	The solution to this as well as B) is not to allow processes
274*4882a593Smuzhiyun	to access the filesystem, which could otherwise not be
275*4882a593Smuzhiyun	monitored or manipulated by the mount owner.  Since if the
276*4882a593Smuzhiyun	mount owner can ptrace a process, it can do all of the above
277*4882a593Smuzhiyun	without using a FUSE mount, the same criteria as used in
278*4882a593Smuzhiyun	ptrace can be used to check if a process is allowed to access
279*4882a593Smuzhiyun	the filesystem or not.
280*4882a593Smuzhiyun
281*4882a593Smuzhiyun	Note that the *ptrace* check is not strictly necessary to
282*4882a593Smuzhiyun	prevent B/2/i, it is enough to check if mount owner has enough
283*4882a593Smuzhiyun	privilege to send signal to the process accessing the
284*4882a593Smuzhiyun	filesystem, since *SIGSTOP* can be used to get a similar effect.
285*4882a593Smuzhiyun
286*4882a593SmuzhiyunI think these limitations are unacceptable?
287*4882a593Smuzhiyun===========================================
288*4882a593Smuzhiyun
289*4882a593SmuzhiyunIf a sysadmin trusts the users enough, or can ensure through other
290*4882a593Smuzhiyunmeasures, that system processes will never enter non-privileged
291*4882a593Smuzhiyunmounts, it can relax the last limitation with a 'user_allow_other'
292*4882a593Smuzhiyunconfig option.  If this config option is set, the mounting user can
293*4882a593Smuzhiyunadd the 'allow_other' mount option which disables the check for other
294*4882a593Smuzhiyunusers' processes.
295*4882a593Smuzhiyun
296*4882a593SmuzhiyunKernel - userspace interface
297*4882a593Smuzhiyun============================
298*4882a593Smuzhiyun
299*4882a593SmuzhiyunThe following diagram shows how a filesystem operation (in this
300*4882a593Smuzhiyunexample unlink) is performed in FUSE. ::
301*4882a593Smuzhiyun
302*4882a593Smuzhiyun
303*4882a593Smuzhiyun |  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
304*4882a593Smuzhiyun |                                    |
305*4882a593Smuzhiyun |                                    |  >sys_read()
306*4882a593Smuzhiyun |                                    |    >fuse_dev_read()
307*4882a593Smuzhiyun |                                    |      >request_wait()
308*4882a593Smuzhiyun |                                    |        [sleep on fc->waitq]
309*4882a593Smuzhiyun |                                    |
310*4882a593Smuzhiyun |  >sys_unlink()                     |
311*4882a593Smuzhiyun |    >fuse_unlink()                  |
312*4882a593Smuzhiyun |      [get request from             |
313*4882a593Smuzhiyun |       fc->unused_list]             |
314*4882a593Smuzhiyun |      >request_send()               |
315*4882a593Smuzhiyun |        [queue req on fc->pending]  |
316*4882a593Smuzhiyun |        [wake up fc->waitq]         |        [woken up]
317*4882a593Smuzhiyun |        >request_wait_answer()      |
318*4882a593Smuzhiyun |          [sleep on req->waitq]     |
319*4882a593Smuzhiyun |                                    |      <request_wait()
320*4882a593Smuzhiyun |                                    |      [remove req from fc->pending]
321*4882a593Smuzhiyun |                                    |      [copy req to read buffer]
322*4882a593Smuzhiyun |                                    |      [add req to fc->processing]
323*4882a593Smuzhiyun |                                    |    <fuse_dev_read()
324*4882a593Smuzhiyun |                                    |  <sys_read()
325*4882a593Smuzhiyun |                                    |
326*4882a593Smuzhiyun |                                    |  [perform unlink]
327*4882a593Smuzhiyun |                                    |
328*4882a593Smuzhiyun |                                    |  >sys_write()
329*4882a593Smuzhiyun |                                    |    >fuse_dev_write()
330*4882a593Smuzhiyun |                                    |      [look up req in fc->processing]
331*4882a593Smuzhiyun |                                    |      [remove from fc->processing]
332*4882a593Smuzhiyun |                                    |      [copy write buffer to req]
333*4882a593Smuzhiyun |          [woken up]                |      [wake up req->waitq]
334*4882a593Smuzhiyun |                                    |    <fuse_dev_write()
335*4882a593Smuzhiyun |                                    |  <sys_write()
336*4882a593Smuzhiyun |        <request_wait_answer()      |
337*4882a593Smuzhiyun |      <request_send()               |
338*4882a593Smuzhiyun |      [add request to               |
339*4882a593Smuzhiyun |       fc->unused_list]             |
340*4882a593Smuzhiyun |    <fuse_unlink()                  |
341*4882a593Smuzhiyun |  <sys_unlink()                     |
342*4882a593Smuzhiyun
343*4882a593Smuzhiyun.. note:: Everything in the description above is greatly simplified
344*4882a593Smuzhiyun
345*4882a593SmuzhiyunThere are a couple of ways in which to deadlock a FUSE filesystem.
346*4882a593SmuzhiyunSince we are talking about unprivileged userspace programs,
347*4882a593Smuzhiyunsomething must be done about these.
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun**Scenario 1 -  Simple deadlock**::
350*4882a593Smuzhiyun
351*4882a593Smuzhiyun |  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
352*4882a593Smuzhiyun |                                    |
353*4882a593Smuzhiyun |  >sys_unlink("/mnt/fuse/file")     |
354*4882a593Smuzhiyun |    [acquire inode semaphore        |
355*4882a593Smuzhiyun |     for "file"]                    |
356*4882a593Smuzhiyun |    >fuse_unlink()                  |
357*4882a593Smuzhiyun |      [sleep on req->waitq]         |
358*4882a593Smuzhiyun |                                    |  <sys_read()
359*4882a593Smuzhiyun |                                    |  >sys_unlink("/mnt/fuse/file")
360*4882a593Smuzhiyun |                                    |    [acquire inode semaphore
361*4882a593Smuzhiyun |                                    |     for "file"]
362*4882a593Smuzhiyun |                                    |    *DEADLOCK*
363*4882a593Smuzhiyun
364*4882a593SmuzhiyunThe solution for this is to allow the filesystem to be aborted.
365*4882a593Smuzhiyun
366*4882a593Smuzhiyun**Scenario 2 - Tricky deadlock**
367*4882a593Smuzhiyun
368*4882a593Smuzhiyun
369*4882a593SmuzhiyunThis one needs a carefully crafted filesystem.  It's a variation on
370*4882a593Smuzhiyunthe above, only the call back to the filesystem is not explicit,
371*4882a593Smuzhiyunbut is caused by a pagefault. ::
372*4882a593Smuzhiyun
373*4882a593Smuzhiyun |  Kamikaze filesystem thread 1      |  Kamikaze filesystem thread 2
374*4882a593Smuzhiyun |                                    |
375*4882a593Smuzhiyun |  [fd = open("/mnt/fuse/file")]     |  [request served normally]
376*4882a593Smuzhiyun |  [mmap fd to 'addr']               |
377*4882a593Smuzhiyun |  [close fd]                        |  [FLUSH triggers 'magic' flag]
378*4882a593Smuzhiyun |  [read a byte from addr]           |
379*4882a593Smuzhiyun |    >do_page_fault()                |
380*4882a593Smuzhiyun |      [find or create page]         |
381*4882a593Smuzhiyun |      [lock page]                   |
382*4882a593Smuzhiyun |      >fuse_readpage()              |
383*4882a593Smuzhiyun |         [queue READ request]       |
384*4882a593Smuzhiyun |         [sleep on req->waitq]      |
385*4882a593Smuzhiyun |                                    |  [read request to buffer]
386*4882a593Smuzhiyun |                                    |  [create reply header before addr]
387*4882a593Smuzhiyun |                                    |  >sys_write(addr - headerlength)
388*4882a593Smuzhiyun |                                    |    >fuse_dev_write()
389*4882a593Smuzhiyun |                                    |      [look up req in fc->processing]
390*4882a593Smuzhiyun |                                    |      [remove from fc->processing]
391*4882a593Smuzhiyun |                                    |      [copy write buffer to req]
392*4882a593Smuzhiyun |                                    |        >do_page_fault()
393*4882a593Smuzhiyun |                                    |           [find or create page]
394*4882a593Smuzhiyun |                                    |           [lock page]
395*4882a593Smuzhiyun |                                    |           * DEADLOCK *
396*4882a593Smuzhiyun
397*4882a593SmuzhiyunThe solution is basically the same as above.
398*4882a593Smuzhiyun
399*4882a593SmuzhiyunAn additional problem is that while the write buffer is being copied
400*4882a593Smuzhiyunto the request, the request must not be interrupted/aborted.  This is
401*4882a593Smuzhiyunbecause the destination address of the copy may not be valid after the
402*4882a593Smuzhiyunrequest has returned.
403*4882a593Smuzhiyun
404*4882a593SmuzhiyunThis is solved with doing the copy atomically, and allowing abort
405*4882a593Smuzhiyunwhile the page(s) belonging to the write buffer are faulted with
406*4882a593Smuzhiyunget_user_pages().  The 'req->locked' flag indicates when the copy is
407*4882a593Smuzhiyuntaking place, and abort is delayed until this flag is unset.
408