xref: /OK3568_Linux_fs/kernel/Documentation/filesystems/orangefs.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun========
4*4882a593SmuzhiyunORANGEFS
5*4882a593Smuzhiyun========
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunOrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal
8*4882a593Smuzhiyunfor large storage problems faced by HPC, BigData, Streaming Video,
9*4882a593SmuzhiyunGenomics, Bioinformatics.
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunOrangefs, originally called PVFS, was first developed in 1993 by
12*4882a593SmuzhiyunWalt Ligon and Eric Blumer as a parallel file system for Parallel
13*4882a593SmuzhiyunVirtual Machine (PVM) as part of a NASA grant to study the I/O patterns
14*4882a593Smuzhiyunof parallel programs.
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunOrangefs features include:
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun  * Distributes file data among multiple file servers
19*4882a593Smuzhiyun  * Supports simultaneous access by multiple clients
20*4882a593Smuzhiyun  * Stores file data and metadata on servers using local file system
21*4882a593Smuzhiyun    and access methods
22*4882a593Smuzhiyun  * Userspace implementation is easy to install and maintain
23*4882a593Smuzhiyun  * Direct MPI support
24*4882a593Smuzhiyun  * Stateless
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunMailing List Archives
28*4882a593Smuzhiyun=====================
29*4882a593Smuzhiyun
30*4882a593Smuzhiyunhttp://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
31*4882a593Smuzhiyun
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunMailing List Submissions
34*4882a593Smuzhiyun========================
35*4882a593Smuzhiyun
36*4882a593Smuzhiyundevel@lists.orangefs.org
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunDocumentation
40*4882a593Smuzhiyun=============
41*4882a593Smuzhiyun
42*4882a593Smuzhiyunhttp://www.orangefs.org/documentation/
43*4882a593Smuzhiyun
44*4882a593SmuzhiyunRunning ORANGEFS On a Single Server
45*4882a593Smuzhiyun===================================
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunOrangeFS is usually run in large installations with multiple servers and
48*4882a593Smuzhiyunclients, but a complete filesystem can be run on a single machine for
49*4882a593Smuzhiyundevelopment and testing.
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunOn Fedora, install orangefs and orangefs-server::
52*4882a593Smuzhiyun
53*4882a593Smuzhiyun    dnf -y install orangefs orangefs-server
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunThere is an example server configuration file in
56*4882a593Smuzhiyun/etc/orangefs/orangefs.conf.  Change localhost to your hostname if
57*4882a593Smuzhiyunnecessary.
58*4882a593Smuzhiyun
59*4882a593SmuzhiyunTo generate a filesystem to run xfstests against, see below.
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunThere is an example client configuration file in /etc/pvfs2tab.  It is a
62*4882a593Smuzhiyunsingle line.  Uncomment it and change the hostname if necessary.  This
63*4882a593Smuzhiyuncontrols clients which use libpvfs2.  This does not control the
64*4882a593Smuzhiyunpvfs2-client-core.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunCreate the filesystem::
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun    pvfs2-server -f /etc/orangefs/orangefs.conf
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunStart the server::
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun    systemctl start orangefs-server
73*4882a593Smuzhiyun
74*4882a593SmuzhiyunTest the server::
75*4882a593Smuzhiyun
76*4882a593Smuzhiyun    pvfs2-ping -m /pvfsmnt
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunStart the client.  The module must be compiled in or loaded before this
79*4882a593Smuzhiyunpoint::
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun    systemctl start orangefs-client
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunMount the filesystem::
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun    mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunUserspace Filesystem Source
88*4882a593Smuzhiyun===========================
89*4882a593Smuzhiyun
90*4882a593Smuzhiyunhttp://www.orangefs.org/download
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunOrangefs versions prior to 2.9.3 would not be compatible with the
93*4882a593Smuzhiyunupstream version of the kernel client.
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun
96*4882a593SmuzhiyunBuilding ORANGEFS on a Single Server
97*4882a593Smuzhiyun====================================
98*4882a593Smuzhiyun
99*4882a593SmuzhiyunWhere OrangeFS cannot be installed from distribution packages, it may be
100*4882a593Smuzhiyunbuilt from source.
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunYou can omit --prefix if you don't care that things are sprinkled around
103*4882a593Smuzhiyunin /usr/local.  As of version 2.9.6, OrangeFS uses Berkeley DB by
104*4882a593Smuzhiyundefault, we will probably be changing the default to LMDB soon.
105*4882a593Smuzhiyun
106*4882a593Smuzhiyun::
107*4882a593Smuzhiyun
108*4882a593Smuzhiyun    ./configure --prefix=/opt/ofs --with-db-backend=lmdb --disable-usrint
109*4882a593Smuzhiyun
110*4882a593Smuzhiyun    make
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun    make install
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunCreate an orangefs config file by running pvfs2-genconfig and
115*4882a593Smuzhiyunspecifying a target config file. Pvfs2-genconfig will prompt you
116*4882a593Smuzhiyunthrough. Generally it works fine to take the defaults, but you
117*4882a593Smuzhiyunshould use your server's hostname, rather than "localhost" when
118*4882a593Smuzhiyunit comes to that question::
119*4882a593Smuzhiyun
120*4882a593Smuzhiyun    /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunCreate an /etc/pvfs2tab file (localhost is fine)::
123*4882a593Smuzhiyun
124*4882a593Smuzhiyun    echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
125*4882a593Smuzhiyun	/etc/pvfs2tab
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunCreate the mount point you specified in the tab file if needed::
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun    mkdir /pvfsmnt
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunBootstrap the server::
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun    /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunStart the server::
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun    /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf
138*4882a593Smuzhiyun
139*4882a593SmuzhiyunNow the server should be running. Pvfs2-ls is a simple
140*4882a593Smuzhiyuntest to verify that the server is running::
141*4882a593Smuzhiyun
142*4882a593Smuzhiyun    /opt/ofs/bin/pvfs2-ls /pvfsmnt
143*4882a593Smuzhiyun
144*4882a593SmuzhiyunIf stuff seems to be working, load the kernel module and
145*4882a593Smuzhiyunturn on the client core::
146*4882a593Smuzhiyun
147*4882a593Smuzhiyun    /opt/ofs/sbin/pvfs2-client -p /opt/ofs/sbin/pvfs2-client-core
148*4882a593Smuzhiyun
149*4882a593SmuzhiyunMount your filesystem::
150*4882a593Smuzhiyun
151*4882a593Smuzhiyun    mount -t pvfs2 tcp://`hostname`:3334/orangefs /pvfsmnt
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun
154*4882a593SmuzhiyunRunning xfstests
155*4882a593Smuzhiyun================
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunIt is useful to use a scratch filesystem with xfstests.  This can be
158*4882a593Smuzhiyundone with only one server.
159*4882a593Smuzhiyun
160*4882a593SmuzhiyunMake a second copy of the FileSystem section in the server configuration
161*4882a593Smuzhiyunfile, which is /etc/orangefs/orangefs.conf.  Change the Name to scratch.
162*4882a593SmuzhiyunChange the ID to something other than the ID of the first FileSystem
163*4882a593Smuzhiyunsection (2 is usually a good choice).
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunThen there are two FileSystem sections: orangefs and scratch.
166*4882a593Smuzhiyun
167*4882a593SmuzhiyunThis change should be made before creating the filesystem.
168*4882a593Smuzhiyun
169*4882a593Smuzhiyun::
170*4882a593Smuzhiyun
171*4882a593Smuzhiyun    pvfs2-server -f /etc/orangefs/orangefs.conf
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunTo run xfstests, create /etc/xfsqa.config::
174*4882a593Smuzhiyun
175*4882a593Smuzhiyun    TEST_DIR=/orangefs
176*4882a593Smuzhiyun    TEST_DEV=tcp://localhost:3334/orangefs
177*4882a593Smuzhiyun    SCRATCH_MNT=/scratch
178*4882a593Smuzhiyun    SCRATCH_DEV=tcp://localhost:3334/scratch
179*4882a593Smuzhiyun
180*4882a593SmuzhiyunThen xfstests can be run::
181*4882a593Smuzhiyun
182*4882a593Smuzhiyun    ./check -pvfs2
183*4882a593Smuzhiyun
184*4882a593Smuzhiyun
185*4882a593SmuzhiyunOptions
186*4882a593Smuzhiyun=======
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunThe following mount options are accepted:
189*4882a593Smuzhiyun
190*4882a593Smuzhiyun  acl
191*4882a593Smuzhiyun    Allow the use of Access Control Lists on files and directories.
192*4882a593Smuzhiyun
193*4882a593Smuzhiyun  intr
194*4882a593Smuzhiyun    Some operations between the kernel client and the user space
195*4882a593Smuzhiyun    filesystem can be interruptible, such as changes in debug levels
196*4882a593Smuzhiyun    and the setting of tunable parameters.
197*4882a593Smuzhiyun
198*4882a593Smuzhiyun  local_lock
199*4882a593Smuzhiyun    Enable posix locking from the perspective of "this" kernel. The
200*4882a593Smuzhiyun    default file_operations lock action is to return ENOSYS. Posix
201*4882a593Smuzhiyun    locking kicks in if the filesystem is mounted with -o local_lock.
202*4882a593Smuzhiyun    Distributed locking is being worked on for the future.
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunDebugging
206*4882a593Smuzhiyun=========
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunIf you want the debug (GOSSIP) statements in a particular
209*4882a593Smuzhiyunsource file (inode.c for example) go to syslog::
210*4882a593Smuzhiyun
211*4882a593Smuzhiyun  echo inode > /sys/kernel/debug/orangefs/kernel-debug
212*4882a593Smuzhiyun
213*4882a593SmuzhiyunNo debugging (the default)::
214*4882a593Smuzhiyun
215*4882a593Smuzhiyun  echo none > /sys/kernel/debug/orangefs/kernel-debug
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunDebugging from several source files::
218*4882a593Smuzhiyun
219*4882a593Smuzhiyun  echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunAll debugging::
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun  echo all > /sys/kernel/debug/orangefs/kernel-debug
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunGet a list of all debugging keywords::
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun  cat /sys/kernel/debug/orangefs/debug-help
228*4882a593Smuzhiyun
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunProtocol between Kernel Module and Userspace
231*4882a593Smuzhiyun============================================
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunOrangefs is a user space filesystem and an associated kernel module.
234*4882a593SmuzhiyunWe'll just refer to the user space part of Orangefs as "userspace"
235*4882a593Smuzhiyunfrom here on out. Orangefs descends from PVFS, and userspace code
236*4882a593Smuzhiyunstill uses PVFS for function and variable names. Userspace typedefs
237*4882a593Smuzhiyunmany of the important structures. Function and variable names in
238*4882a593Smuzhiyunthe kernel module have been transitioned to "orangefs", and The Linux
239*4882a593SmuzhiyunCoding Style avoids typedefs, so kernel module structures that
240*4882a593Smuzhiyuncorrespond to userspace structures are not typedefed.
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunThe kernel module implements a pseudo device that userspace
243*4882a593Smuzhiyuncan read from and write to. Userspace can also manipulate the
244*4882a593Smuzhiyunkernel module through the pseudo device with ioctl.
245*4882a593Smuzhiyun
246*4882a593SmuzhiyunThe Bufmap
247*4882a593Smuzhiyun----------
248*4882a593Smuzhiyun
249*4882a593SmuzhiyunAt startup userspace allocates two page-size-aligned (posix_memalign)
250*4882a593Smuzhiyunmlocked memory buffers, one is used for IO and one is used for readdir
251*4882a593Smuzhiyunoperations. The IO buffer is 41943040 bytes and the readdir buffer is
252*4882a593Smuzhiyun4194304 bytes. Each buffer contains logical chunks, or partitions, and
253*4882a593Smuzhiyuna pointer to each buffer is added to its own PVFS_dev_map_desc structure
254*4882a593Smuzhiyunwhich also describes its total size, as well as the size and number of
255*4882a593Smuzhiyunthe partitions.
256*4882a593Smuzhiyun
257*4882a593SmuzhiyunA pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a
258*4882a593Smuzhiyunmapping routine in the kernel module with an ioctl. The structure is
259*4882a593Smuzhiyuncopied from user space to kernel space with copy_from_user and is used
260*4882a593Smuzhiyunto initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
261*4882a593Smuzhiyunthen contains:
262*4882a593Smuzhiyun
263*4882a593Smuzhiyun  * refcnt
264*4882a593Smuzhiyun    - a reference counter
265*4882a593Smuzhiyun  * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
266*4882a593Smuzhiyun    partition size, which represents the filesystem's block size and
267*4882a593Smuzhiyun    is used for s_blocksize in super blocks.
268*4882a593Smuzhiyun  * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of
269*4882a593Smuzhiyun    partitions in the IO buffer.
270*4882a593Smuzhiyun  * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
271*4882a593Smuzhiyun  * total_size - the total size of the IO buffer.
272*4882a593Smuzhiyun  * page_count - the number of 4096 byte pages in the IO buffer.
273*4882a593Smuzhiyun  * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes
274*4882a593Smuzhiyun    of kcalloced memory. This memory is used as an array of pointers
275*4882a593Smuzhiyun    to each of the pages in the IO buffer through a call to get_user_pages.
276*4882a593Smuzhiyun  * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
277*4882a593Smuzhiyun    bytes of kcalloced memory. This memory is further intialized:
278*4882a593Smuzhiyun
279*4882a593Smuzhiyun      user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
280*4882a593Smuzhiyun      structure. user_desc->ptr points to the IO buffer.
281*4882a593Smuzhiyun
282*4882a593Smuzhiyun      ::
283*4882a593Smuzhiyun
284*4882a593Smuzhiyun	pages_per_desc = bufmap->desc_size / PAGE_SIZE
285*4882a593Smuzhiyun	offset = 0
286*4882a593Smuzhiyun
287*4882a593Smuzhiyun        bufmap->desc_array[0].page_array = &bufmap->page_array[offset]
288*4882a593Smuzhiyun        bufmap->desc_array[0].array_count = pages_per_desc = 1024
289*4882a593Smuzhiyun        bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096)
290*4882a593Smuzhiyun        offset += 1024
291*4882a593Smuzhiyun                           .
292*4882a593Smuzhiyun                           .
293*4882a593Smuzhiyun                           .
294*4882a593Smuzhiyun        bufmap->desc_array[9].page_array = &bufmap->page_array[offset]
295*4882a593Smuzhiyun        bufmap->desc_array[9].array_count = pages_per_desc = 1024
296*4882a593Smuzhiyun        bufmap->desc_array[9].uaddr = (user_desc->ptr) +
297*4882a593Smuzhiyun                                               (9 * 1024 * 4096)
298*4882a593Smuzhiyun        offset += 1024
299*4882a593Smuzhiyun
300*4882a593Smuzhiyun  * buffer_index_array - a desc_count sized array of ints, used to
301*4882a593Smuzhiyun    indicate which of the IO buffer's partitions are available to use.
302*4882a593Smuzhiyun  * buffer_index_lock - a spinlock to protect buffer_index_array during update.
303*4882a593Smuzhiyun  * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element
304*4882a593Smuzhiyun    int array used to indicate which of the readdir buffer's partitions are
305*4882a593Smuzhiyun    available to use.
306*4882a593Smuzhiyun  * readdir_index_lock - a spinlock to protect readdir_index_array during
307*4882a593Smuzhiyun    update.
308*4882a593Smuzhiyun
309*4882a593SmuzhiyunOperations
310*4882a593Smuzhiyun----------
311*4882a593Smuzhiyun
312*4882a593SmuzhiyunThe kernel module builds an "op" (struct orangefs_kernel_op_s) when it
313*4882a593Smuzhiyunneeds to communicate with userspace. Part of the op contains the "upcall"
314*4882a593Smuzhiyunwhich expresses the request to userspace. Part of the op eventually
315*4882a593Smuzhiyuncontains the "downcall" which expresses the results of the request.
316*4882a593Smuzhiyun
317*4882a593SmuzhiyunThe slab allocator is used to keep a cache of op structures handy.
318*4882a593Smuzhiyun
319*4882a593SmuzhiyunAt init time the kernel module defines and initializes a request list
320*4882a593Smuzhiyunand an in_progress hash table to keep track of all the ops that are
321*4882a593Smuzhiyunin flight at any given time.
322*4882a593Smuzhiyun
323*4882a593SmuzhiyunOps are stateful:
324*4882a593Smuzhiyun
325*4882a593Smuzhiyun * unknown
326*4882a593Smuzhiyun	    - op was just initialized
327*4882a593Smuzhiyun * waiting
328*4882a593Smuzhiyun	    - op is on request_list (upward bound)
329*4882a593Smuzhiyun * inprogr
330*4882a593Smuzhiyun	    - op is in progress (waiting for downcall)
331*4882a593Smuzhiyun * serviced
332*4882a593Smuzhiyun	    - op has matching downcall; ok
333*4882a593Smuzhiyun * purged
334*4882a593Smuzhiyun	    - op has to start a timer since client-core
335*4882a593Smuzhiyun              exited uncleanly before servicing op
336*4882a593Smuzhiyun * given up
337*4882a593Smuzhiyun	    - submitter has given up waiting for it
338*4882a593Smuzhiyun
339*4882a593SmuzhiyunWhen some arbitrary userspace program needs to perform a
340*4882a593Smuzhiyunfilesystem operation on Orangefs (readdir, I/O, create, whatever)
341*4882a593Smuzhiyunan op structure is initialized and tagged with a distinguishing ID
342*4882a593Smuzhiyunnumber. The upcall part of the op is filled out, and the op is
343*4882a593Smuzhiyunpassed to the "service_operation" function.
344*4882a593Smuzhiyun
345*4882a593SmuzhiyunService_operation changes the op's state to "waiting", puts
346*4882a593Smuzhiyunit on the request list, and signals the Orangefs file_operations.poll
347*4882a593Smuzhiyunfunction through a wait queue. Userspace is polling the pseudo-device
348*4882a593Smuzhiyunand thus becomes aware of the upcall request that needs to be read.
349*4882a593Smuzhiyun
350*4882a593SmuzhiyunWhen the Orangefs file_operations.read function is triggered, the
351*4882a593Smuzhiyunrequest list is searched for an op that seems ready-to-process.
352*4882a593SmuzhiyunThe op is removed from the request list. The tag from the op and
353*4882a593Smuzhiyunthe filled-out upcall struct are copy_to_user'ed back to userspace.
354*4882a593Smuzhiyun
355*4882a593SmuzhiyunIf any of these (and some additional protocol) copy_to_users fail,
356*4882a593Smuzhiyunthe op's state is set to "waiting" and the op is added back to
357*4882a593Smuzhiyunthe request list. Otherwise, the op's state is changed to "in progress",
358*4882a593Smuzhiyunand the op is hashed on its tag and put onto the end of a list in the
359*4882a593Smuzhiyunin_progress hash table at the index the tag hashed to.
360*4882a593Smuzhiyun
361*4882a593SmuzhiyunWhen userspace has assembled the response to the upcall, it
362*4882a593Smuzhiyunwrites the response, which includes the distinguishing tag, back to
363*4882a593Smuzhiyunthe pseudo device in a series of io_vecs. This triggers the Orangefs
364*4882a593Smuzhiyunfile_operations.write_iter function to find the op with the associated
365*4882a593Smuzhiyuntag and remove it from the in_progress hash table. As long as the op's
366*4882a593Smuzhiyunstate is not "canceled" or "given up", its state is set to "serviced".
367*4882a593SmuzhiyunThe file_operations.write_iter function returns to the waiting vfs,
368*4882a593Smuzhiyunand back to service_operation through wait_for_matching_downcall.
369*4882a593Smuzhiyun
370*4882a593SmuzhiyunService operation returns to its caller with the op's downcall
371*4882a593Smuzhiyunpart (the response to the upcall) filled out.
372*4882a593Smuzhiyun
373*4882a593SmuzhiyunThe "client-core" is the bridge between the kernel module and
374*4882a593Smuzhiyunuserspace. The client-core is a daemon. The client-core has an
375*4882a593Smuzhiyunassociated watchdog daemon. If the client-core is ever signaled
376*4882a593Smuzhiyunto die, the watchdog daemon restarts the client-core. Even though
377*4882a593Smuzhiyunthe client-core is restarted "right away", there is a period of
378*4882a593Smuzhiyuntime during such an event that the client-core is dead. A dead client-core
379*4882a593Smuzhiyuncan't be triggered by the Orangefs file_operations.poll function.
380*4882a593SmuzhiyunOps that pass through service_operation during a "dead spell" can timeout
381*4882a593Smuzhiyunon the wait queue and one attempt is made to recycle them. Obviously,
382*4882a593Smuzhiyunif the client-core stays dead too long, the arbitrary userspace processes
383*4882a593Smuzhiyuntrying to use Orangefs will be negatively affected. Waiting ops
384*4882a593Smuzhiyunthat can't be serviced will be removed from the request list and
385*4882a593Smuzhiyunhave their states set to "given up". In-progress ops that can't
386*4882a593Smuzhiyunbe serviced will be removed from the in_progress hash table and
387*4882a593Smuzhiyunhave their states set to "given up".
388*4882a593Smuzhiyun
389*4882a593SmuzhiyunReaddir and I/O ops are atypical with respect to their payloads.
390*4882a593Smuzhiyun
391*4882a593Smuzhiyun  - readdir ops use the smaller of the two pre-allocated pre-partitioned
392*4882a593Smuzhiyun    memory buffers. The readdir buffer is only available to userspace.
393*4882a593Smuzhiyun    The kernel module obtains an index to a free partition before launching
394*4882a593Smuzhiyun    a readdir op. Userspace deposits the results into the indexed partition
395*4882a593Smuzhiyun    and then writes them to back to the pvfs device.
396*4882a593Smuzhiyun
397*4882a593Smuzhiyun  - io (read and write) ops use the larger of the two pre-allocated
398*4882a593Smuzhiyun    pre-partitioned memory buffers. The IO buffer is accessible from
399*4882a593Smuzhiyun    both userspace and the kernel module. The kernel module obtains an
400*4882a593Smuzhiyun    index to a free partition before launching an io op. The kernel module
401*4882a593Smuzhiyun    deposits write data into the indexed partition, to be consumed
402*4882a593Smuzhiyun    directly by userspace. Userspace deposits the results of read
403*4882a593Smuzhiyun    requests into the indexed partition, to be consumed directly
404*4882a593Smuzhiyun    by the kernel module.
405*4882a593Smuzhiyun
406*4882a593SmuzhiyunResponses to kernel requests are all packaged in pvfs2_downcall_t
407*4882a593Smuzhiyunstructs. Besides a few other members, pvfs2_downcall_t contains a
408*4882a593Smuzhiyununion of structs, each of which is associated with a particular
409*4882a593Smuzhiyunresponse type.
410*4882a593Smuzhiyun
411*4882a593SmuzhiyunThe several members outside of the union are:
412*4882a593Smuzhiyun
413*4882a593Smuzhiyun ``int32_t type``
414*4882a593Smuzhiyun    - type of operation.
415*4882a593Smuzhiyun ``int32_t status``
416*4882a593Smuzhiyun    - return code for the operation.
417*4882a593Smuzhiyun ``int64_t trailer_size``
418*4882a593Smuzhiyun    - 0 unless readdir operation.
419*4882a593Smuzhiyun ``char *trailer_buf``
420*4882a593Smuzhiyun    - initialized to NULL, used during readdir operations.
421*4882a593Smuzhiyun
422*4882a593SmuzhiyunThe appropriate member inside the union is filled out for any
423*4882a593Smuzhiyunparticular response.
424*4882a593Smuzhiyun
425*4882a593Smuzhiyun  PVFS2_VFS_OP_FILE_IO
426*4882a593Smuzhiyun    fill a pvfs2_io_response_t
427*4882a593Smuzhiyun
428*4882a593Smuzhiyun  PVFS2_VFS_OP_LOOKUP
429*4882a593Smuzhiyun    fill a PVFS_object_kref
430*4882a593Smuzhiyun
431*4882a593Smuzhiyun  PVFS2_VFS_OP_CREATE
432*4882a593Smuzhiyun    fill a PVFS_object_kref
433*4882a593Smuzhiyun
434*4882a593Smuzhiyun  PVFS2_VFS_OP_SYMLINK
435*4882a593Smuzhiyun    fill a PVFS_object_kref
436*4882a593Smuzhiyun
437*4882a593Smuzhiyun  PVFS2_VFS_OP_GETATTR
438*4882a593Smuzhiyun    fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need)
439*4882a593Smuzhiyun    fill in a string with the link target when the object is a symlink.
440*4882a593Smuzhiyun
441*4882a593Smuzhiyun  PVFS2_VFS_OP_MKDIR
442*4882a593Smuzhiyun    fill a PVFS_object_kref
443*4882a593Smuzhiyun
444*4882a593Smuzhiyun  PVFS2_VFS_OP_STATFS
445*4882a593Smuzhiyun    fill a pvfs2_statfs_response_t with useless info <g>. It is hard for
446*4882a593Smuzhiyun    us to know, in a timely fashion, these statistics about our
447*4882a593Smuzhiyun    distributed network filesystem.
448*4882a593Smuzhiyun
449*4882a593Smuzhiyun  PVFS2_VFS_OP_FS_MOUNT
450*4882a593Smuzhiyun    fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref
451*4882a593Smuzhiyun    except its members are in a different order and "__pad1" is replaced
452*4882a593Smuzhiyun    with "id".
453*4882a593Smuzhiyun
454*4882a593Smuzhiyun  PVFS2_VFS_OP_GETXATTR
455*4882a593Smuzhiyun    fill a pvfs2_getxattr_response_t
456*4882a593Smuzhiyun
457*4882a593Smuzhiyun  PVFS2_VFS_OP_LISTXATTR
458*4882a593Smuzhiyun    fill a pvfs2_listxattr_response_t
459*4882a593Smuzhiyun
460*4882a593Smuzhiyun  PVFS2_VFS_OP_PARAM
461*4882a593Smuzhiyun    fill a pvfs2_param_response_t
462*4882a593Smuzhiyun
463*4882a593Smuzhiyun  PVFS2_VFS_OP_PERF_COUNT
464*4882a593Smuzhiyun    fill a pvfs2_perf_count_response_t
465*4882a593Smuzhiyun
466*4882a593Smuzhiyun  PVFS2_VFS_OP_FSKEY
467*4882a593Smuzhiyun    file a pvfs2_fs_key_response_t
468*4882a593Smuzhiyun
469*4882a593Smuzhiyun  PVFS2_VFS_OP_READDIR
470*4882a593Smuzhiyun    jamb everything needed to represent a pvfs2_readdir_response_t into
471*4882a593Smuzhiyun    the readdir buffer descriptor specified in the upcall.
472*4882a593Smuzhiyun
473*4882a593SmuzhiyunUserspace uses writev() on /dev/pvfs2-req to pass responses to the requests
474*4882a593Smuzhiyunmade by the kernel side.
475*4882a593Smuzhiyun
476*4882a593SmuzhiyunA buffer_list containing:
477*4882a593Smuzhiyun
478*4882a593Smuzhiyun  - a pointer to the prepared response to the request from the
479*4882a593Smuzhiyun    kernel (struct pvfs2_downcall_t).
480*4882a593Smuzhiyun  - and also, in the case of a readdir request, a pointer to a
481*4882a593Smuzhiyun    buffer containing descriptors for the objects in the target
482*4882a593Smuzhiyun    directory.
483*4882a593Smuzhiyun
484*4882a593Smuzhiyun... is sent to the function (PINT_dev_write_list) which performs
485*4882a593Smuzhiyunthe writev.
486*4882a593Smuzhiyun
487*4882a593SmuzhiyunPINT_dev_write_list has a local iovec array: struct iovec io_array[10];
488*4882a593Smuzhiyun
489*4882a593SmuzhiyunThe first four elements of io_array are initialized like this for all
490*4882a593Smuzhiyunresponses::
491*4882a593Smuzhiyun
492*4882a593Smuzhiyun  io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
493*4882a593Smuzhiyun  io_array[0].iov_len = sizeof(int32_t)
494*4882a593Smuzhiyun
495*4882a593Smuzhiyun  io_array[1].iov_base = address of global variable "pdev_magic" (int32_t)
496*4882a593Smuzhiyun  io_array[1].iov_len = sizeof(int32_t)
497*4882a593Smuzhiyun
498*4882a593Smuzhiyun  io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t)
499*4882a593Smuzhiyun  io_array[2].iov_len = sizeof(int64_t)
500*4882a593Smuzhiyun
501*4882a593Smuzhiyun  io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t)
502*4882a593Smuzhiyun                         of global variable vfs_request (vfs_request_t)
503*4882a593Smuzhiyun  io_array[3].iov_len = sizeof(pvfs2_downcall_t)
504*4882a593Smuzhiyun
505*4882a593SmuzhiyunReaddir responses initialize the fifth element io_array like this::
506*4882a593Smuzhiyun
507*4882a593Smuzhiyun  io_array[4].iov_base = contents of member trailer_buf (char *)
508*4882a593Smuzhiyun                         from out_downcall member of global variable
509*4882a593Smuzhiyun                         vfs_request
510*4882a593Smuzhiyun  io_array[4].iov_len = contents of member trailer_size (PVFS_size)
511*4882a593Smuzhiyun                        from out_downcall member of global variable
512*4882a593Smuzhiyun                        vfs_request
513*4882a593Smuzhiyun
514*4882a593SmuzhiyunOrangefs exploits the dcache in order to avoid sending redundant
515*4882a593Smuzhiyunrequests to userspace. We keep object inode attributes up-to-date with
516*4882a593Smuzhiyunorangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to
517*4882a593Smuzhiyunhelp it decide whether or not to update an inode: "new" and "bypass".
518*4882a593SmuzhiyunOrangefs keeps private data in an object's inode that includes a short
519*4882a593Smuzhiyuntimeout value, getattr_time, which allows any iteration of
520*4882a593Smuzhiyunorangefs_inode_getattr to know how long it has been since the inode was
521*4882a593Smuzhiyunupdated. When the object is not new (new == 0) and the bypass flag is not
522*4882a593Smuzhiyunset (bypass == 0) orangefs_inode_getattr returns without updating the inode
523*4882a593Smuzhiyunif getattr_time has not timed out. Getattr_time is updated each time the
524*4882a593Smuzhiyuninode is updated.
525*4882a593Smuzhiyun
526*4882a593SmuzhiyunCreation of a new object (file, dir, sym-link) includes the evaluation of
527*4882a593Smuzhiyunits pathname, resulting in a negative directory entry for the object.
528*4882a593SmuzhiyunA new inode is allocated and associated with the dentry, turning it from
529*4882a593Smuzhiyuna negative dentry into a "productive full member of society". Orangefs
530*4882a593Smuzhiyunobtains the new inode from Linux with new_inode() and associates
531*4882a593Smuzhiyunthe inode with the dentry by sending the pair back to Linux with
532*4882a593Smuzhiyund_instantiate().
533*4882a593Smuzhiyun
534*4882a593SmuzhiyunThe evaluation of a pathname for an object resolves to its corresponding
535*4882a593Smuzhiyundentry. If there is no corresponding dentry, one is created for it in
536*4882a593Smuzhiyunthe dcache. Whenever a dentry is modified or verified Orangefs stores a
537*4882a593Smuzhiyunshort timeout value in the dentry's d_time, and the dentry will be trusted
538*4882a593Smuzhiyunfor that amount of time. Orangefs is a network filesystem, and objects
539*4882a593Smuzhiyuncan potentially change out-of-band with any particular Orangefs kernel module
540*4882a593Smuzhiyuninstance, so trusting a dentry is risky. The alternative to trusting
541*4882a593Smuzhiyundentries is to always obtain the needed information from userspace - at
542*4882a593Smuzhiyunleast a trip to the client-core, maybe to the servers. Obtaining information
543*4882a593Smuzhiyunfrom a dentry is cheap, obtaining it from userspace is relatively expensive,
544*4882a593Smuzhiyunhence the motivation to use the dentry when possible.
545*4882a593Smuzhiyun
546*4882a593SmuzhiyunThe timeout values d_time and getattr_time are jiffy based, and the
547*4882a593Smuzhiyuncode is designed to avoid the jiffy-wrap problem::
548*4882a593Smuzhiyun
549*4882a593Smuzhiyun    "In general, if the clock may have wrapped around more than once, there
550*4882a593Smuzhiyun    is no way to tell how much time has elapsed. However, if the times t1
551*4882a593Smuzhiyun    and t2 are known to be fairly close, we can reliably compute the
552*4882a593Smuzhiyun    difference in a way that takes into account the possibility that the
553*4882a593Smuzhiyun    clock may have wrapped between times."
554*4882a593Smuzhiyun
555*4882a593Smuzhiyunfrom course notes by instructor Andy Wang
556*4882a593Smuzhiyun
557