1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun======== 4*4882a593SmuzhiyunORANGEFS 5*4882a593Smuzhiyun======== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunOrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal 8*4882a593Smuzhiyunfor large storage problems faced by HPC, BigData, Streaming Video, 9*4882a593SmuzhiyunGenomics, Bioinformatics. 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunOrangefs, originally called PVFS, was first developed in 1993 by 12*4882a593SmuzhiyunWalt Ligon and Eric Blumer as a parallel file system for Parallel 13*4882a593SmuzhiyunVirtual Machine (PVM) as part of a NASA grant to study the I/O patterns 14*4882a593Smuzhiyunof parallel programs. 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunOrangefs features include: 17*4882a593Smuzhiyun 18*4882a593Smuzhiyun * Distributes file data among multiple file servers 19*4882a593Smuzhiyun * Supports simultaneous access by multiple clients 20*4882a593Smuzhiyun * Stores file data and metadata on servers using local file system 21*4882a593Smuzhiyun and access methods 22*4882a593Smuzhiyun * Userspace implementation is easy to install and maintain 23*4882a593Smuzhiyun * Direct MPI support 24*4882a593Smuzhiyun * Stateless 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun 27*4882a593SmuzhiyunMailing List Archives 28*4882a593Smuzhiyun===================== 29*4882a593Smuzhiyun 30*4882a593Smuzhiyunhttp://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ 31*4882a593Smuzhiyun 32*4882a593Smuzhiyun 33*4882a593SmuzhiyunMailing List Submissions 34*4882a593Smuzhiyun======================== 35*4882a593Smuzhiyun 36*4882a593Smuzhiyundevel@lists.orangefs.org 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunDocumentation 40*4882a593Smuzhiyun============= 41*4882a593Smuzhiyun 42*4882a593Smuzhiyunhttp://www.orangefs.org/documentation/ 43*4882a593Smuzhiyun 44*4882a593SmuzhiyunRunning ORANGEFS On a Single Server 45*4882a593Smuzhiyun=================================== 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunOrangeFS is usually run in large installations with multiple servers and 48*4882a593Smuzhiyunclients, but a complete filesystem can be run on a single machine for 49*4882a593Smuzhiyundevelopment and testing. 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunOn Fedora, install orangefs and orangefs-server:: 52*4882a593Smuzhiyun 53*4882a593Smuzhiyun dnf -y install orangefs orangefs-server 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunThere is an example server configuration file in 56*4882a593Smuzhiyun/etc/orangefs/orangefs.conf. Change localhost to your hostname if 57*4882a593Smuzhiyunnecessary. 58*4882a593Smuzhiyun 59*4882a593SmuzhiyunTo generate a filesystem to run xfstests against, see below. 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunThere is an example client configuration file in /etc/pvfs2tab. It is a 62*4882a593Smuzhiyunsingle line. Uncomment it and change the hostname if necessary. This 63*4882a593Smuzhiyuncontrols clients which use libpvfs2. This does not control the 64*4882a593Smuzhiyunpvfs2-client-core. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunCreate the filesystem:: 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun pvfs2-server -f /etc/orangefs/orangefs.conf 69*4882a593Smuzhiyun 70*4882a593SmuzhiyunStart the server:: 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun systemctl start orangefs-server 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunTest the server:: 75*4882a593Smuzhiyun 76*4882a593Smuzhiyun pvfs2-ping -m /pvfsmnt 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunStart the client. The module must be compiled in or loaded before this 79*4882a593Smuzhiyunpoint:: 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun systemctl start orangefs-client 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunMount the filesystem:: 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunUserspace Filesystem Source 88*4882a593Smuzhiyun=========================== 89*4882a593Smuzhiyun 90*4882a593Smuzhiyunhttp://www.orangefs.org/download 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunOrangefs versions prior to 2.9.3 would not be compatible with the 93*4882a593Smuzhiyunupstream version of the kernel client. 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun 96*4882a593SmuzhiyunBuilding ORANGEFS on a Single Server 97*4882a593Smuzhiyun==================================== 98*4882a593Smuzhiyun 99*4882a593SmuzhiyunWhere OrangeFS cannot be installed from distribution packages, it may be 100*4882a593Smuzhiyunbuilt from source. 101*4882a593Smuzhiyun 102*4882a593SmuzhiyunYou can omit --prefix if you don't care that things are sprinkled around 103*4882a593Smuzhiyunin /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by 104*4882a593Smuzhiyundefault, we will probably be changing the default to LMDB soon. 105*4882a593Smuzhiyun 106*4882a593Smuzhiyun:: 107*4882a593Smuzhiyun 108*4882a593Smuzhiyun ./configure --prefix=/opt/ofs --with-db-backend=lmdb --disable-usrint 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun make 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun make install 113*4882a593Smuzhiyun 114*4882a593SmuzhiyunCreate an orangefs config file by running pvfs2-genconfig and 115*4882a593Smuzhiyunspecifying a target config file. Pvfs2-genconfig will prompt you 116*4882a593Smuzhiyunthrough. Generally it works fine to take the defaults, but you 117*4882a593Smuzhiyunshould use your server's hostname, rather than "localhost" when 118*4882a593Smuzhiyunit comes to that question:: 119*4882a593Smuzhiyun 120*4882a593Smuzhiyun /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunCreate an /etc/pvfs2tab file (localhost is fine):: 123*4882a593Smuzhiyun 124*4882a593Smuzhiyun echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ 125*4882a593Smuzhiyun /etc/pvfs2tab 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunCreate the mount point you specified in the tab file if needed:: 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun mkdir /pvfsmnt 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunBootstrap the server:: 132*4882a593Smuzhiyun 133*4882a593Smuzhiyun /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunStart the server:: 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf 138*4882a593Smuzhiyun 139*4882a593SmuzhiyunNow the server should be running. Pvfs2-ls is a simple 140*4882a593Smuzhiyuntest to verify that the server is running:: 141*4882a593Smuzhiyun 142*4882a593Smuzhiyun /opt/ofs/bin/pvfs2-ls /pvfsmnt 143*4882a593Smuzhiyun 144*4882a593SmuzhiyunIf stuff seems to be working, load the kernel module and 145*4882a593Smuzhiyunturn on the client core:: 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun /opt/ofs/sbin/pvfs2-client -p /opt/ofs/sbin/pvfs2-client-core 148*4882a593Smuzhiyun 149*4882a593SmuzhiyunMount your filesystem:: 150*4882a593Smuzhiyun 151*4882a593Smuzhiyun mount -t pvfs2 tcp://`hostname`:3334/orangefs /pvfsmnt 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun 154*4882a593SmuzhiyunRunning xfstests 155*4882a593Smuzhiyun================ 156*4882a593Smuzhiyun 157*4882a593SmuzhiyunIt is useful to use a scratch filesystem with xfstests. This can be 158*4882a593Smuzhiyundone with only one server. 159*4882a593Smuzhiyun 160*4882a593SmuzhiyunMake a second copy of the FileSystem section in the server configuration 161*4882a593Smuzhiyunfile, which is /etc/orangefs/orangefs.conf. Change the Name to scratch. 162*4882a593SmuzhiyunChange the ID to something other than the ID of the first FileSystem 163*4882a593Smuzhiyunsection (2 is usually a good choice). 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunThen there are two FileSystem sections: orangefs and scratch. 166*4882a593Smuzhiyun 167*4882a593SmuzhiyunThis change should be made before creating the filesystem. 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun:: 170*4882a593Smuzhiyun 171*4882a593Smuzhiyun pvfs2-server -f /etc/orangefs/orangefs.conf 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunTo run xfstests, create /etc/xfsqa.config:: 174*4882a593Smuzhiyun 175*4882a593Smuzhiyun TEST_DIR=/orangefs 176*4882a593Smuzhiyun TEST_DEV=tcp://localhost:3334/orangefs 177*4882a593Smuzhiyun SCRATCH_MNT=/scratch 178*4882a593Smuzhiyun SCRATCH_DEV=tcp://localhost:3334/scratch 179*4882a593Smuzhiyun 180*4882a593SmuzhiyunThen xfstests can be run:: 181*4882a593Smuzhiyun 182*4882a593Smuzhiyun ./check -pvfs2 183*4882a593Smuzhiyun 184*4882a593Smuzhiyun 185*4882a593SmuzhiyunOptions 186*4882a593Smuzhiyun======= 187*4882a593Smuzhiyun 188*4882a593SmuzhiyunThe following mount options are accepted: 189*4882a593Smuzhiyun 190*4882a593Smuzhiyun acl 191*4882a593Smuzhiyun Allow the use of Access Control Lists on files and directories. 192*4882a593Smuzhiyun 193*4882a593Smuzhiyun intr 194*4882a593Smuzhiyun Some operations between the kernel client and the user space 195*4882a593Smuzhiyun filesystem can be interruptible, such as changes in debug levels 196*4882a593Smuzhiyun and the setting of tunable parameters. 197*4882a593Smuzhiyun 198*4882a593Smuzhiyun local_lock 199*4882a593Smuzhiyun Enable posix locking from the perspective of "this" kernel. The 200*4882a593Smuzhiyun default file_operations lock action is to return ENOSYS. Posix 201*4882a593Smuzhiyun locking kicks in if the filesystem is mounted with -o local_lock. 202*4882a593Smuzhiyun Distributed locking is being worked on for the future. 203*4882a593Smuzhiyun 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunDebugging 206*4882a593Smuzhiyun========= 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunIf you want the debug (GOSSIP) statements in a particular 209*4882a593Smuzhiyunsource file (inode.c for example) go to syslog:: 210*4882a593Smuzhiyun 211*4882a593Smuzhiyun echo inode > /sys/kernel/debug/orangefs/kernel-debug 212*4882a593Smuzhiyun 213*4882a593SmuzhiyunNo debugging (the default):: 214*4882a593Smuzhiyun 215*4882a593Smuzhiyun echo none > /sys/kernel/debug/orangefs/kernel-debug 216*4882a593Smuzhiyun 217*4882a593SmuzhiyunDebugging from several source files:: 218*4882a593Smuzhiyun 219*4882a593Smuzhiyun echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug 220*4882a593Smuzhiyun 221*4882a593SmuzhiyunAll debugging:: 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun echo all > /sys/kernel/debug/orangefs/kernel-debug 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunGet a list of all debugging keywords:: 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun cat /sys/kernel/debug/orangefs/debug-help 228*4882a593Smuzhiyun 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunProtocol between Kernel Module and Userspace 231*4882a593Smuzhiyun============================================ 232*4882a593Smuzhiyun 233*4882a593SmuzhiyunOrangefs is a user space filesystem and an associated kernel module. 234*4882a593SmuzhiyunWe'll just refer to the user space part of Orangefs as "userspace" 235*4882a593Smuzhiyunfrom here on out. Orangefs descends from PVFS, and userspace code 236*4882a593Smuzhiyunstill uses PVFS for function and variable names. Userspace typedefs 237*4882a593Smuzhiyunmany of the important structures. Function and variable names in 238*4882a593Smuzhiyunthe kernel module have been transitioned to "orangefs", and The Linux 239*4882a593SmuzhiyunCoding Style avoids typedefs, so kernel module structures that 240*4882a593Smuzhiyuncorrespond to userspace structures are not typedefed. 241*4882a593Smuzhiyun 242*4882a593SmuzhiyunThe kernel module implements a pseudo device that userspace 243*4882a593Smuzhiyuncan read from and write to. Userspace can also manipulate the 244*4882a593Smuzhiyunkernel module through the pseudo device with ioctl. 245*4882a593Smuzhiyun 246*4882a593SmuzhiyunThe Bufmap 247*4882a593Smuzhiyun---------- 248*4882a593Smuzhiyun 249*4882a593SmuzhiyunAt startup userspace allocates two page-size-aligned (posix_memalign) 250*4882a593Smuzhiyunmlocked memory buffers, one is used for IO and one is used for readdir 251*4882a593Smuzhiyunoperations. The IO buffer is 41943040 bytes and the readdir buffer is 252*4882a593Smuzhiyun4194304 bytes. Each buffer contains logical chunks, or partitions, and 253*4882a593Smuzhiyuna pointer to each buffer is added to its own PVFS_dev_map_desc structure 254*4882a593Smuzhiyunwhich also describes its total size, as well as the size and number of 255*4882a593Smuzhiyunthe partitions. 256*4882a593Smuzhiyun 257*4882a593SmuzhiyunA pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a 258*4882a593Smuzhiyunmapping routine in the kernel module with an ioctl. The structure is 259*4882a593Smuzhiyuncopied from user space to kernel space with copy_from_user and is used 260*4882a593Smuzhiyunto initialize the kernel module's "bufmap" (struct orangefs_bufmap), which 261*4882a593Smuzhiyunthen contains: 262*4882a593Smuzhiyun 263*4882a593Smuzhiyun * refcnt 264*4882a593Smuzhiyun - a reference counter 265*4882a593Smuzhiyun * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's 266*4882a593Smuzhiyun partition size, which represents the filesystem's block size and 267*4882a593Smuzhiyun is used for s_blocksize in super blocks. 268*4882a593Smuzhiyun * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of 269*4882a593Smuzhiyun partitions in the IO buffer. 270*4882a593Smuzhiyun * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. 271*4882a593Smuzhiyun * total_size - the total size of the IO buffer. 272*4882a593Smuzhiyun * page_count - the number of 4096 byte pages in the IO buffer. 273*4882a593Smuzhiyun * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes 274*4882a593Smuzhiyun of kcalloced memory. This memory is used as an array of pointers 275*4882a593Smuzhiyun to each of the pages in the IO buffer through a call to get_user_pages. 276*4882a593Smuzhiyun * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))`` 277*4882a593Smuzhiyun bytes of kcalloced memory. This memory is further intialized: 278*4882a593Smuzhiyun 279*4882a593Smuzhiyun user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc 280*4882a593Smuzhiyun structure. user_desc->ptr points to the IO buffer. 281*4882a593Smuzhiyun 282*4882a593Smuzhiyun :: 283*4882a593Smuzhiyun 284*4882a593Smuzhiyun pages_per_desc = bufmap->desc_size / PAGE_SIZE 285*4882a593Smuzhiyun offset = 0 286*4882a593Smuzhiyun 287*4882a593Smuzhiyun bufmap->desc_array[0].page_array = &bufmap->page_array[offset] 288*4882a593Smuzhiyun bufmap->desc_array[0].array_count = pages_per_desc = 1024 289*4882a593Smuzhiyun bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) 290*4882a593Smuzhiyun offset += 1024 291*4882a593Smuzhiyun . 292*4882a593Smuzhiyun . 293*4882a593Smuzhiyun . 294*4882a593Smuzhiyun bufmap->desc_array[9].page_array = &bufmap->page_array[offset] 295*4882a593Smuzhiyun bufmap->desc_array[9].array_count = pages_per_desc = 1024 296*4882a593Smuzhiyun bufmap->desc_array[9].uaddr = (user_desc->ptr) + 297*4882a593Smuzhiyun (9 * 1024 * 4096) 298*4882a593Smuzhiyun offset += 1024 299*4882a593Smuzhiyun 300*4882a593Smuzhiyun * buffer_index_array - a desc_count sized array of ints, used to 301*4882a593Smuzhiyun indicate which of the IO buffer's partitions are available to use. 302*4882a593Smuzhiyun * buffer_index_lock - a spinlock to protect buffer_index_array during update. 303*4882a593Smuzhiyun * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element 304*4882a593Smuzhiyun int array used to indicate which of the readdir buffer's partitions are 305*4882a593Smuzhiyun available to use. 306*4882a593Smuzhiyun * readdir_index_lock - a spinlock to protect readdir_index_array during 307*4882a593Smuzhiyun update. 308*4882a593Smuzhiyun 309*4882a593SmuzhiyunOperations 310*4882a593Smuzhiyun---------- 311*4882a593Smuzhiyun 312*4882a593SmuzhiyunThe kernel module builds an "op" (struct orangefs_kernel_op_s) when it 313*4882a593Smuzhiyunneeds to communicate with userspace. Part of the op contains the "upcall" 314*4882a593Smuzhiyunwhich expresses the request to userspace. Part of the op eventually 315*4882a593Smuzhiyuncontains the "downcall" which expresses the results of the request. 316*4882a593Smuzhiyun 317*4882a593SmuzhiyunThe slab allocator is used to keep a cache of op structures handy. 318*4882a593Smuzhiyun 319*4882a593SmuzhiyunAt init time the kernel module defines and initializes a request list 320*4882a593Smuzhiyunand an in_progress hash table to keep track of all the ops that are 321*4882a593Smuzhiyunin flight at any given time. 322*4882a593Smuzhiyun 323*4882a593SmuzhiyunOps are stateful: 324*4882a593Smuzhiyun 325*4882a593Smuzhiyun * unknown 326*4882a593Smuzhiyun - op was just initialized 327*4882a593Smuzhiyun * waiting 328*4882a593Smuzhiyun - op is on request_list (upward bound) 329*4882a593Smuzhiyun * inprogr 330*4882a593Smuzhiyun - op is in progress (waiting for downcall) 331*4882a593Smuzhiyun * serviced 332*4882a593Smuzhiyun - op has matching downcall; ok 333*4882a593Smuzhiyun * purged 334*4882a593Smuzhiyun - op has to start a timer since client-core 335*4882a593Smuzhiyun exited uncleanly before servicing op 336*4882a593Smuzhiyun * given up 337*4882a593Smuzhiyun - submitter has given up waiting for it 338*4882a593Smuzhiyun 339*4882a593SmuzhiyunWhen some arbitrary userspace program needs to perform a 340*4882a593Smuzhiyunfilesystem operation on Orangefs (readdir, I/O, create, whatever) 341*4882a593Smuzhiyunan op structure is initialized and tagged with a distinguishing ID 342*4882a593Smuzhiyunnumber. The upcall part of the op is filled out, and the op is 343*4882a593Smuzhiyunpassed to the "service_operation" function. 344*4882a593Smuzhiyun 345*4882a593SmuzhiyunService_operation changes the op's state to "waiting", puts 346*4882a593Smuzhiyunit on the request list, and signals the Orangefs file_operations.poll 347*4882a593Smuzhiyunfunction through a wait queue. Userspace is polling the pseudo-device 348*4882a593Smuzhiyunand thus becomes aware of the upcall request that needs to be read. 349*4882a593Smuzhiyun 350*4882a593SmuzhiyunWhen the Orangefs file_operations.read function is triggered, the 351*4882a593Smuzhiyunrequest list is searched for an op that seems ready-to-process. 352*4882a593SmuzhiyunThe op is removed from the request list. The tag from the op and 353*4882a593Smuzhiyunthe filled-out upcall struct are copy_to_user'ed back to userspace. 354*4882a593Smuzhiyun 355*4882a593SmuzhiyunIf any of these (and some additional protocol) copy_to_users fail, 356*4882a593Smuzhiyunthe op's state is set to "waiting" and the op is added back to 357*4882a593Smuzhiyunthe request list. Otherwise, the op's state is changed to "in progress", 358*4882a593Smuzhiyunand the op is hashed on its tag and put onto the end of a list in the 359*4882a593Smuzhiyunin_progress hash table at the index the tag hashed to. 360*4882a593Smuzhiyun 361*4882a593SmuzhiyunWhen userspace has assembled the response to the upcall, it 362*4882a593Smuzhiyunwrites the response, which includes the distinguishing tag, back to 363*4882a593Smuzhiyunthe pseudo device in a series of io_vecs. This triggers the Orangefs 364*4882a593Smuzhiyunfile_operations.write_iter function to find the op with the associated 365*4882a593Smuzhiyuntag and remove it from the in_progress hash table. As long as the op's 366*4882a593Smuzhiyunstate is not "canceled" or "given up", its state is set to "serviced". 367*4882a593SmuzhiyunThe file_operations.write_iter function returns to the waiting vfs, 368*4882a593Smuzhiyunand back to service_operation through wait_for_matching_downcall. 369*4882a593Smuzhiyun 370*4882a593SmuzhiyunService operation returns to its caller with the op's downcall 371*4882a593Smuzhiyunpart (the response to the upcall) filled out. 372*4882a593Smuzhiyun 373*4882a593SmuzhiyunThe "client-core" is the bridge between the kernel module and 374*4882a593Smuzhiyunuserspace. The client-core is a daemon. The client-core has an 375*4882a593Smuzhiyunassociated watchdog daemon. If the client-core is ever signaled 376*4882a593Smuzhiyunto die, the watchdog daemon restarts the client-core. Even though 377*4882a593Smuzhiyunthe client-core is restarted "right away", there is a period of 378*4882a593Smuzhiyuntime during such an event that the client-core is dead. A dead client-core 379*4882a593Smuzhiyuncan't be triggered by the Orangefs file_operations.poll function. 380*4882a593SmuzhiyunOps that pass through service_operation during a "dead spell" can timeout 381*4882a593Smuzhiyunon the wait queue and one attempt is made to recycle them. Obviously, 382*4882a593Smuzhiyunif the client-core stays dead too long, the arbitrary userspace processes 383*4882a593Smuzhiyuntrying to use Orangefs will be negatively affected. Waiting ops 384*4882a593Smuzhiyunthat can't be serviced will be removed from the request list and 385*4882a593Smuzhiyunhave their states set to "given up". In-progress ops that can't 386*4882a593Smuzhiyunbe serviced will be removed from the in_progress hash table and 387*4882a593Smuzhiyunhave their states set to "given up". 388*4882a593Smuzhiyun 389*4882a593SmuzhiyunReaddir and I/O ops are atypical with respect to their payloads. 390*4882a593Smuzhiyun 391*4882a593Smuzhiyun - readdir ops use the smaller of the two pre-allocated pre-partitioned 392*4882a593Smuzhiyun memory buffers. The readdir buffer is only available to userspace. 393*4882a593Smuzhiyun The kernel module obtains an index to a free partition before launching 394*4882a593Smuzhiyun a readdir op. Userspace deposits the results into the indexed partition 395*4882a593Smuzhiyun and then writes them to back to the pvfs device. 396*4882a593Smuzhiyun 397*4882a593Smuzhiyun - io (read and write) ops use the larger of the two pre-allocated 398*4882a593Smuzhiyun pre-partitioned memory buffers. The IO buffer is accessible from 399*4882a593Smuzhiyun both userspace and the kernel module. The kernel module obtains an 400*4882a593Smuzhiyun index to a free partition before launching an io op. The kernel module 401*4882a593Smuzhiyun deposits write data into the indexed partition, to be consumed 402*4882a593Smuzhiyun directly by userspace. Userspace deposits the results of read 403*4882a593Smuzhiyun requests into the indexed partition, to be consumed directly 404*4882a593Smuzhiyun by the kernel module. 405*4882a593Smuzhiyun 406*4882a593SmuzhiyunResponses to kernel requests are all packaged in pvfs2_downcall_t 407*4882a593Smuzhiyunstructs. Besides a few other members, pvfs2_downcall_t contains a 408*4882a593Smuzhiyununion of structs, each of which is associated with a particular 409*4882a593Smuzhiyunresponse type. 410*4882a593Smuzhiyun 411*4882a593SmuzhiyunThe several members outside of the union are: 412*4882a593Smuzhiyun 413*4882a593Smuzhiyun ``int32_t type`` 414*4882a593Smuzhiyun - type of operation. 415*4882a593Smuzhiyun ``int32_t status`` 416*4882a593Smuzhiyun - return code for the operation. 417*4882a593Smuzhiyun ``int64_t trailer_size`` 418*4882a593Smuzhiyun - 0 unless readdir operation. 419*4882a593Smuzhiyun ``char *trailer_buf`` 420*4882a593Smuzhiyun - initialized to NULL, used during readdir operations. 421*4882a593Smuzhiyun 422*4882a593SmuzhiyunThe appropriate member inside the union is filled out for any 423*4882a593Smuzhiyunparticular response. 424*4882a593Smuzhiyun 425*4882a593Smuzhiyun PVFS2_VFS_OP_FILE_IO 426*4882a593Smuzhiyun fill a pvfs2_io_response_t 427*4882a593Smuzhiyun 428*4882a593Smuzhiyun PVFS2_VFS_OP_LOOKUP 429*4882a593Smuzhiyun fill a PVFS_object_kref 430*4882a593Smuzhiyun 431*4882a593Smuzhiyun PVFS2_VFS_OP_CREATE 432*4882a593Smuzhiyun fill a PVFS_object_kref 433*4882a593Smuzhiyun 434*4882a593Smuzhiyun PVFS2_VFS_OP_SYMLINK 435*4882a593Smuzhiyun fill a PVFS_object_kref 436*4882a593Smuzhiyun 437*4882a593Smuzhiyun PVFS2_VFS_OP_GETATTR 438*4882a593Smuzhiyun fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) 439*4882a593Smuzhiyun fill in a string with the link target when the object is a symlink. 440*4882a593Smuzhiyun 441*4882a593Smuzhiyun PVFS2_VFS_OP_MKDIR 442*4882a593Smuzhiyun fill a PVFS_object_kref 443*4882a593Smuzhiyun 444*4882a593Smuzhiyun PVFS2_VFS_OP_STATFS 445*4882a593Smuzhiyun fill a pvfs2_statfs_response_t with useless info <g>. It is hard for 446*4882a593Smuzhiyun us to know, in a timely fashion, these statistics about our 447*4882a593Smuzhiyun distributed network filesystem. 448*4882a593Smuzhiyun 449*4882a593Smuzhiyun PVFS2_VFS_OP_FS_MOUNT 450*4882a593Smuzhiyun fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref 451*4882a593Smuzhiyun except its members are in a different order and "__pad1" is replaced 452*4882a593Smuzhiyun with "id". 453*4882a593Smuzhiyun 454*4882a593Smuzhiyun PVFS2_VFS_OP_GETXATTR 455*4882a593Smuzhiyun fill a pvfs2_getxattr_response_t 456*4882a593Smuzhiyun 457*4882a593Smuzhiyun PVFS2_VFS_OP_LISTXATTR 458*4882a593Smuzhiyun fill a pvfs2_listxattr_response_t 459*4882a593Smuzhiyun 460*4882a593Smuzhiyun PVFS2_VFS_OP_PARAM 461*4882a593Smuzhiyun fill a pvfs2_param_response_t 462*4882a593Smuzhiyun 463*4882a593Smuzhiyun PVFS2_VFS_OP_PERF_COUNT 464*4882a593Smuzhiyun fill a pvfs2_perf_count_response_t 465*4882a593Smuzhiyun 466*4882a593Smuzhiyun PVFS2_VFS_OP_FSKEY 467*4882a593Smuzhiyun file a pvfs2_fs_key_response_t 468*4882a593Smuzhiyun 469*4882a593Smuzhiyun PVFS2_VFS_OP_READDIR 470*4882a593Smuzhiyun jamb everything needed to represent a pvfs2_readdir_response_t into 471*4882a593Smuzhiyun the readdir buffer descriptor specified in the upcall. 472*4882a593Smuzhiyun 473*4882a593SmuzhiyunUserspace uses writev() on /dev/pvfs2-req to pass responses to the requests 474*4882a593Smuzhiyunmade by the kernel side. 475*4882a593Smuzhiyun 476*4882a593SmuzhiyunA buffer_list containing: 477*4882a593Smuzhiyun 478*4882a593Smuzhiyun - a pointer to the prepared response to the request from the 479*4882a593Smuzhiyun kernel (struct pvfs2_downcall_t). 480*4882a593Smuzhiyun - and also, in the case of a readdir request, a pointer to a 481*4882a593Smuzhiyun buffer containing descriptors for the objects in the target 482*4882a593Smuzhiyun directory. 483*4882a593Smuzhiyun 484*4882a593Smuzhiyun... is sent to the function (PINT_dev_write_list) which performs 485*4882a593Smuzhiyunthe writev. 486*4882a593Smuzhiyun 487*4882a593SmuzhiyunPINT_dev_write_list has a local iovec array: struct iovec io_array[10]; 488*4882a593Smuzhiyun 489*4882a593SmuzhiyunThe first four elements of io_array are initialized like this for all 490*4882a593Smuzhiyunresponses:: 491*4882a593Smuzhiyun 492*4882a593Smuzhiyun io_array[0].iov_base = address of local variable "proto_ver" (int32_t) 493*4882a593Smuzhiyun io_array[0].iov_len = sizeof(int32_t) 494*4882a593Smuzhiyun 495*4882a593Smuzhiyun io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) 496*4882a593Smuzhiyun io_array[1].iov_len = sizeof(int32_t) 497*4882a593Smuzhiyun 498*4882a593Smuzhiyun io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) 499*4882a593Smuzhiyun io_array[2].iov_len = sizeof(int64_t) 500*4882a593Smuzhiyun 501*4882a593Smuzhiyun io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) 502*4882a593Smuzhiyun of global variable vfs_request (vfs_request_t) 503*4882a593Smuzhiyun io_array[3].iov_len = sizeof(pvfs2_downcall_t) 504*4882a593Smuzhiyun 505*4882a593SmuzhiyunReaddir responses initialize the fifth element io_array like this:: 506*4882a593Smuzhiyun 507*4882a593Smuzhiyun io_array[4].iov_base = contents of member trailer_buf (char *) 508*4882a593Smuzhiyun from out_downcall member of global variable 509*4882a593Smuzhiyun vfs_request 510*4882a593Smuzhiyun io_array[4].iov_len = contents of member trailer_size (PVFS_size) 511*4882a593Smuzhiyun from out_downcall member of global variable 512*4882a593Smuzhiyun vfs_request 513*4882a593Smuzhiyun 514*4882a593SmuzhiyunOrangefs exploits the dcache in order to avoid sending redundant 515*4882a593Smuzhiyunrequests to userspace. We keep object inode attributes up-to-date with 516*4882a593Smuzhiyunorangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to 517*4882a593Smuzhiyunhelp it decide whether or not to update an inode: "new" and "bypass". 518*4882a593SmuzhiyunOrangefs keeps private data in an object's inode that includes a short 519*4882a593Smuzhiyuntimeout value, getattr_time, which allows any iteration of 520*4882a593Smuzhiyunorangefs_inode_getattr to know how long it has been since the inode was 521*4882a593Smuzhiyunupdated. When the object is not new (new == 0) and the bypass flag is not 522*4882a593Smuzhiyunset (bypass == 0) orangefs_inode_getattr returns without updating the inode 523*4882a593Smuzhiyunif getattr_time has not timed out. Getattr_time is updated each time the 524*4882a593Smuzhiyuninode is updated. 525*4882a593Smuzhiyun 526*4882a593SmuzhiyunCreation of a new object (file, dir, sym-link) includes the evaluation of 527*4882a593Smuzhiyunits pathname, resulting in a negative directory entry for the object. 528*4882a593SmuzhiyunA new inode is allocated and associated with the dentry, turning it from 529*4882a593Smuzhiyuna negative dentry into a "productive full member of society". Orangefs 530*4882a593Smuzhiyunobtains the new inode from Linux with new_inode() and associates 531*4882a593Smuzhiyunthe inode with the dentry by sending the pair back to Linux with 532*4882a593Smuzhiyund_instantiate(). 533*4882a593Smuzhiyun 534*4882a593SmuzhiyunThe evaluation of a pathname for an object resolves to its corresponding 535*4882a593Smuzhiyundentry. If there is no corresponding dentry, one is created for it in 536*4882a593Smuzhiyunthe dcache. Whenever a dentry is modified or verified Orangefs stores a 537*4882a593Smuzhiyunshort timeout value in the dentry's d_time, and the dentry will be trusted 538*4882a593Smuzhiyunfor that amount of time. Orangefs is a network filesystem, and objects 539*4882a593Smuzhiyuncan potentially change out-of-band with any particular Orangefs kernel module 540*4882a593Smuzhiyuninstance, so trusting a dentry is risky. The alternative to trusting 541*4882a593Smuzhiyundentries is to always obtain the needed information from userspace - at 542*4882a593Smuzhiyunleast a trip to the client-core, maybe to the servers. Obtaining information 543*4882a593Smuzhiyunfrom a dentry is cheap, obtaining it from userspace is relatively expensive, 544*4882a593Smuzhiyunhence the motivation to use the dentry when possible. 545*4882a593Smuzhiyun 546*4882a593SmuzhiyunThe timeout values d_time and getattr_time are jiffy based, and the 547*4882a593Smuzhiyuncode is designed to avoid the jiffy-wrap problem:: 548*4882a593Smuzhiyun 549*4882a593Smuzhiyun "In general, if the clock may have wrapped around more than once, there 550*4882a593Smuzhiyun is no way to tell how much time has elapsed. However, if the times t1 551*4882a593Smuzhiyun and t2 are known to be fairly close, we can reliably compute the 552*4882a593Smuzhiyun difference in a way that takes into account the possibility that the 553*4882a593Smuzhiyun clock may have wrapped between times." 554*4882a593Smuzhiyun 555*4882a593Smuzhiyunfrom course notes by instructor Andy Wang 556*4882a593Smuzhiyun 557