1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun============================ 4*4882a593SmuzhiyunCeph Distributed File System 5*4882a593Smuzhiyun============================ 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunCeph is a distributed network file system designed to provide good 8*4882a593Smuzhiyunperformance, reliability, and scalability. 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunBasic features include: 11*4882a593Smuzhiyun 12*4882a593Smuzhiyun * POSIX semantics 13*4882a593Smuzhiyun * Seamless scaling from 1 to many thousands of nodes 14*4882a593Smuzhiyun * High availability and reliability. No single point of failure. 15*4882a593Smuzhiyun * N-way replication of data across storage nodes 16*4882a593Smuzhiyun * Fast recovery from node failures 17*4882a593Smuzhiyun * Automatic rebalancing of data on node addition/removal 18*4882a593Smuzhiyun * Easy deployment: most FS components are userspace daemons 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunAlso, 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun * Flexible snapshots (on any directory) 23*4882a593Smuzhiyun * Recursive accounting (nested files, directories, bytes) 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunIn contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely 26*4882a593Smuzhiyunon symmetric access by all clients to shared block devices, Ceph 27*4882a593Smuzhiyunseparates data and metadata management into independent server 28*4882a593Smuzhiyunclusters, similar to Lustre. Unlike Lustre, however, metadata and 29*4882a593Smuzhiyunstorage nodes run entirely as user space daemons. File data is striped 30*4882a593Smuzhiyunacross storage nodes in large chunks to distribute workload and 31*4882a593Smuzhiyunfacilitate high throughputs. When storage nodes fail, data is 32*4882a593Smuzhiyunre-replicated in a distributed fashion by the storage nodes themselves 33*4882a593Smuzhiyun(with some minimal coordination from a cluster monitor), making the 34*4882a593Smuzhiyunsystem extremely efficient and scalable. 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunMetadata servers effectively form a large, consistent, distributed 37*4882a593Smuzhiyunin-memory cache above the file namespace that is extremely scalable, 38*4882a593Smuzhiyundynamically redistributes metadata in response to workload changes, 39*4882a593Smuzhiyunand can tolerate arbitrary (well, non-Byzantine) node failures. The 40*4882a593Smuzhiyunmetadata server takes a somewhat unconventional approach to metadata 41*4882a593Smuzhiyunstorage to significantly improve performance for common workloads. In 42*4882a593Smuzhiyunparticular, inodes with only a single link are embedded in 43*4882a593Smuzhiyundirectories, allowing entire directories of dentries and inodes to be 44*4882a593Smuzhiyunloaded into its cache with a single I/O operation. The contents of 45*4882a593Smuzhiyunextremely large directories can be fragmented and managed by 46*4882a593Smuzhiyunindependent metadata servers, allowing scalable concurrent access. 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunThe system offers automatic data rebalancing/migration when scaling 49*4882a593Smuzhiyunfrom a small cluster of just a few nodes to many hundreds, without 50*4882a593Smuzhiyunrequiring an administrator carve the data set into static volumes or 51*4882a593Smuzhiyungo through the tedious process of migrating data between servers. 52*4882a593SmuzhiyunWhen the file system approaches full, new nodes can be easily added 53*4882a593Smuzhiyunand things will "just work." 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunCeph includes flexible snapshot mechanism that allows a user to create 56*4882a593Smuzhiyuna snapshot on any subdirectory (and its nested contents) in the 57*4882a593Smuzhiyunsystem. Snapshot creation and deletion are as simple as 'mkdir 58*4882a593Smuzhiyun.snap/foo' and 'rmdir .snap/foo'. 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunCeph also provides some recursive accounting on directories for nested 61*4882a593Smuzhiyunfiles and bytes. That is, a 'getfattr -d foo' on any directory in the 62*4882a593Smuzhiyunsystem will reveal the total number of nested regular files and 63*4882a593Smuzhiyunsubdirectories, and a summation of all nested file sizes. This makes 64*4882a593Smuzhiyunthe identification of large disk space consumers relatively quick, as 65*4882a593Smuzhiyunno 'du' or similar recursive scan of the file system is required. 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunFinally, Ceph also allows quotas to be set on any directory in the system. 68*4882a593SmuzhiyunThe quota can restrict the number of bytes or the number of files stored 69*4882a593Smuzhiyunbeneath that point in the directory hierarchy. Quotas can be set using 70*4882a593Smuzhiyunextended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg:: 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir 73*4882a593Smuzhiyun getfattr -n ceph.quota.max_bytes /some/dir 74*4882a593Smuzhiyun 75*4882a593SmuzhiyunA limitation of the current quotas implementation is that it relies on the 76*4882a593Smuzhiyuncooperation of the client mounting the file system to stop writers when a 77*4882a593Smuzhiyunlimit is reached. A modified or adversarial client cannot be prevented 78*4882a593Smuzhiyunfrom writing as much data as it needs. 79*4882a593Smuzhiyun 80*4882a593SmuzhiyunMount Syntax 81*4882a593Smuzhiyun============ 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunThe basic mount syntax is:: 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunYou only need to specify a single monitor, as the client will get the 88*4882a593Smuzhiyunfull list when it connects. (However, if the monitor you specify 89*4882a593Smuzhiyunhappens to be down, the mount won't succeed.) The port can be left 90*4882a593Smuzhiyunoff if the monitor is using the default. So if the monitor is at 91*4882a593Smuzhiyun1.2.3.4:: 92*4882a593Smuzhiyun 93*4882a593Smuzhiyun # mount -t ceph 1.2.3.4:/ /mnt/ceph 94*4882a593Smuzhiyun 95*4882a593Smuzhiyunis sufficient. If /sbin/mount.ceph is installed, a hostname can be 96*4882a593Smuzhiyunused instead of an IP address. 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunMount Options 101*4882a593Smuzhiyun============= 102*4882a593Smuzhiyun 103*4882a593Smuzhiyun ip=A.B.C.D[:N] 104*4882a593Smuzhiyun Specify the IP and/or port the client should bind to locally. 105*4882a593Smuzhiyun There is normally not much reason to do this. If the IP is not 106*4882a593Smuzhiyun specified, the client's IP address is determined by looking at the 107*4882a593Smuzhiyun address its connection to the monitor originates from. 108*4882a593Smuzhiyun 109*4882a593Smuzhiyun wsize=X 110*4882a593Smuzhiyun Specify the maximum write size in bytes. Default: 64 MB. 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun rsize=X 113*4882a593Smuzhiyun Specify the maximum read size in bytes. Default: 64 MB. 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun rasize=X 116*4882a593Smuzhiyun Specify the maximum readahead size in bytes. Default: 8 MB. 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun mount_timeout=X 119*4882a593Smuzhiyun Specify the timeout value for mount (in seconds), in the case 120*4882a593Smuzhiyun of a non-responsive Ceph file system. The default is 60 121*4882a593Smuzhiyun seconds. 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun caps_max=X 124*4882a593Smuzhiyun Specify the maximum number of caps to hold. Unused caps are released 125*4882a593Smuzhiyun when number of caps exceeds the limit. The default is 0 (no limit) 126*4882a593Smuzhiyun 127*4882a593Smuzhiyun rbytes 128*4882a593Smuzhiyun When stat() is called on a directory, set st_size to 'rbytes', 129*4882a593Smuzhiyun the summation of file sizes over all files nested beneath that 130*4882a593Smuzhiyun directory. This is the default. 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun norbytes 133*4882a593Smuzhiyun When stat() is called on a directory, set st_size to the 134*4882a593Smuzhiyun number of entries in that directory. 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun nocrc 137*4882a593Smuzhiyun Disable CRC32C calculation for data writes. If set, the storage node 138*4882a593Smuzhiyun must rely on TCP's error correction to detect data corruption 139*4882a593Smuzhiyun in the data payload. 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun dcache 142*4882a593Smuzhiyun Use the dcache contents to perform negative lookups and 143*4882a593Smuzhiyun readdir when the client has the entire directory contents in 144*4882a593Smuzhiyun its cache. (This does not change correctness; the client uses 145*4882a593Smuzhiyun cached metadata only when a lease or capability ensures it is 146*4882a593Smuzhiyun valid.) 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun nodcache 149*4882a593Smuzhiyun Do not use the dcache as above. This avoids a significant amount of 150*4882a593Smuzhiyun complex code, sacrificing performance without affecting correctness, 151*4882a593Smuzhiyun and is useful for tracking down bugs. 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun noasyncreaddir 154*4882a593Smuzhiyun Do not use the dcache as above for readdir. 155*4882a593Smuzhiyun 156*4882a593Smuzhiyun noquotadf 157*4882a593Smuzhiyun Report overall filesystem usage in statfs instead of using the root 158*4882a593Smuzhiyun directory quota. 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun nocopyfrom 161*4882a593Smuzhiyun Don't use the RADOS 'copy-from' operation to perform remote object 162*4882a593Smuzhiyun copies. Currently, it's only used in copy_file_range, which will revert 163*4882a593Smuzhiyun to the default VFS implementation if this option is used. 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun recover_session=<no|clean> 166*4882a593Smuzhiyun Set auto reconnect mode in the case where the client is blocklisted. The 167*4882a593Smuzhiyun available modes are "no" and "clean". The default is "no". 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun * no: never attempt to reconnect when client detects that it has been 170*4882a593Smuzhiyun blocklisted. Operations will generally fail after being blocklisted. 171*4882a593Smuzhiyun 172*4882a593Smuzhiyun * clean: client reconnects to the ceph cluster automatically when it 173*4882a593Smuzhiyun detects that it has been blocklisted. During reconnect, client drops 174*4882a593Smuzhiyun dirty data/metadata, invalidates page caches and writable file handles. 175*4882a593Smuzhiyun After reconnect, file locks become stale because the MDS loses track 176*4882a593Smuzhiyun of them. If an inode contains any stale file locks, read/write on the 177*4882a593Smuzhiyun inode is not allowed until applications release all stale file locks. 178*4882a593Smuzhiyun 179*4882a593SmuzhiyunMore Information 180*4882a593Smuzhiyun================ 181*4882a593Smuzhiyun 182*4882a593SmuzhiyunFor more information on Ceph, see the home page at 183*4882a593Smuzhiyun https://ceph.com/ 184*4882a593Smuzhiyun 185*4882a593SmuzhiyunThe Linux kernel client source tree is available at 186*4882a593Smuzhiyun - https://github.com/ceph/ceph-client.git 187*4882a593Smuzhiyun - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git 188*4882a593Smuzhiyun 189*4882a593Smuzhiyunand the source for the full system is at 190*4882a593Smuzhiyun https://github.com/ceph/ceph.git 191