1*4882a593Smuzhiyun.. _cleancache: 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun========== 4*4882a593SmuzhiyunCleancache 5*4882a593Smuzhiyun========== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunMotivation 8*4882a593Smuzhiyun========== 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunCleancache is a new optional feature provided by the VFS layer that 11*4882a593Smuzhiyunpotentially dramatically increases page cache effectiveness for 12*4882a593Smuzhiyunmany workloads in many environments at a negligible cost. 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunCleancache can be thought of as a page-granularity victim cache for clean 15*4882a593Smuzhiyunpages that the kernel's pageframe replacement algorithm (PFRA) would like 16*4882a593Smuzhiyunto keep around, but can't since there isn't enough memory. So when the 17*4882a593SmuzhiyunPFRA "evicts" a page, it first attempts to use cleancache code to 18*4882a593Smuzhiyunput the data contained in that page into "transcendent memory", memory 19*4882a593Smuzhiyunthat is not directly accessible or addressable by the kernel and is 20*4882a593Smuzhiyunof unknown and possibly time-varying size. 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunLater, when a cleancache-enabled filesystem wishes to access a page 23*4882a593Smuzhiyunin a file on disk, it first checks cleancache to see if it already 24*4882a593Smuzhiyuncontains it; if it does, the page of data is copied into the kernel 25*4882a593Smuzhiyunand a disk access is avoided. 26*4882a593Smuzhiyun 27*4882a593SmuzhiyunTranscendent memory "drivers" for cleancache are currently implemented 28*4882a593Smuzhiyunin Xen (using hypervisor memory) and zcache (using in-kernel compressed 29*4882a593Smuzhiyunmemory) and other implementations are in development. 30*4882a593Smuzhiyun 31*4882a593Smuzhiyun:ref:`FAQs <faq>` are included below. 32*4882a593Smuzhiyun 33*4882a593SmuzhiyunImplementation Overview 34*4882a593Smuzhiyun======================= 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunA cleancache "backend" that provides transcendent memory registers itself 37*4882a593Smuzhiyunto the kernel's cleancache "frontend" by calling cleancache_register_ops, 38*4882a593Smuzhiyunpassing a pointer to a cleancache_ops structure with funcs set appropriately. 39*4882a593SmuzhiyunThe functions provided must conform to certain semantics as follows: 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunMost important, cleancache is "ephemeral". Pages which are copied into 42*4882a593Smuzhiyuncleancache have an indefinite lifetime which is completely unknowable 43*4882a593Smuzhiyunby the kernel and so may or may not still be in cleancache at any later time. 44*4882a593SmuzhiyunThus, as its name implies, cleancache is not suitable for dirty pages. 45*4882a593SmuzhiyunCleancache has complete discretion over what pages to preserve and what 46*4882a593Smuzhiyunpages to discard and when. 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunMounting a cleancache-enabled filesystem should call "init_fs" to obtain a 49*4882a593Smuzhiyunpool id which, if positive, must be saved in the filesystem's superblock; 50*4882a593Smuzhiyuna negative return value indicates failure. A "put_page" will copy a 51*4882a593Smuzhiyun(presumably about-to-be-evicted) page into cleancache and associate it with 52*4882a593Smuzhiyunthe pool id, a file key, and a page index into the file. (The combination 53*4882a593Smuzhiyunof a pool id, a file key, and an index is sometimes called a "handle".) 54*4882a593SmuzhiyunA "get_page" will copy the page, if found, from cleancache into kernel memory. 55*4882a593SmuzhiyunAn "invalidate_page" will ensure the page no longer is present in cleancache; 56*4882a593Smuzhiyunan "invalidate_inode" will invalidate all pages associated with the specified 57*4882a593Smuzhiyunfile; and, when a filesystem is unmounted, an "invalidate_fs" will invalidate 58*4882a593Smuzhiyunall pages in all files specified by the given pool id and also surrender 59*4882a593Smuzhiyunthe pool id. 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunAn "init_shared_fs", like init_fs, obtains a pool id but tells cleancache 62*4882a593Smuzhiyunto treat the pool as shared using a 128-bit UUID as a key. On systems 63*4882a593Smuzhiyunthat may run multiple kernels (such as hard partitioned or virtualized 64*4882a593Smuzhiyunsystems) that may share a clustered filesystem, and where cleancache 65*4882a593Smuzhiyunmay be shared among those kernels, calls to init_shared_fs that specify the 66*4882a593Smuzhiyunsame UUID will receive the same pool id, thus allowing the pages to 67*4882a593Smuzhiyunbe shared. Note that any security requirements must be imposed outside 68*4882a593Smuzhiyunof the kernel (e.g. by "tools" that control cleancache). Or a 69*4882a593Smuzhiyuncleancache implementation can simply disable shared_init by always 70*4882a593Smuzhiyunreturning a negative value. 71*4882a593Smuzhiyun 72*4882a593SmuzhiyunIf a get_page is successful on a non-shared pool, the page is invalidated 73*4882a593Smuzhiyun(thus making cleancache an "exclusive" cache). On a shared pool, the page 74*4882a593Smuzhiyunis NOT invalidated on a successful get_page so that it remains accessible to 75*4882a593Smuzhiyunother sharers. The kernel is responsible for ensuring coherency between 76*4882a593Smuzhiyuncleancache (shared or not), the page cache, and the filesystem, using 77*4882a593Smuzhiyuncleancache invalidate operations as required. 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunNote that cleancache must enforce put-put-get coherency and get-get 80*4882a593Smuzhiyuncoherency. For the former, if two puts are made to the same handle but 81*4882a593Smuzhiyunwith different data, say AAA by the first put and BBB by the second, a 82*4882a593Smuzhiyunsubsequent get can never return the stale data (AAA). For get-get coherency, 83*4882a593Smuzhiyunif a get for a given handle fails, subsequent gets for that handle will 84*4882a593Smuzhiyunnever succeed unless preceded by a successful put with that handle. 85*4882a593Smuzhiyun 86*4882a593SmuzhiyunLast, cleancache provides no SMP serialization guarantees; if two 87*4882a593Smuzhiyundifferent Linux threads are simultaneously putting and invalidating a page 88*4882a593Smuzhiyunwith the same handle, the results are indeterminate. Callers must 89*4882a593Smuzhiyunlock the page to ensure serial behavior. 90*4882a593Smuzhiyun 91*4882a593SmuzhiyunCleancache Performance Metrics 92*4882a593Smuzhiyun============================== 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunIf properly configured, monitoring of cleancache is done via debugfs in 95*4882a593Smuzhiyunthe `/sys/kernel/debug/cleancache` directory. The effectiveness of cleancache 96*4882a593Smuzhiyuncan be measured (across all filesystems) with: 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun``succ_gets`` 99*4882a593Smuzhiyun number of gets that were successful 100*4882a593Smuzhiyun 101*4882a593Smuzhiyun``failed_gets`` 102*4882a593Smuzhiyun number of gets that failed 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun``puts`` 105*4882a593Smuzhiyun number of puts attempted (all "succeed") 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun``invalidates`` 108*4882a593Smuzhiyun number of invalidates attempted 109*4882a593Smuzhiyun 110*4882a593SmuzhiyunA backend implementation may provide additional metrics. 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun.. _faq: 113*4882a593Smuzhiyun 114*4882a593SmuzhiyunFAQ 115*4882a593Smuzhiyun=== 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun* Where's the value? (Andrew Morton) 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunCleancache provides a significant performance benefit to many workloads 120*4882a593Smuzhiyunin many environments with negligible overhead by improving the 121*4882a593Smuzhiyuneffectiveness of the pagecache. Clean pagecache pages are 122*4882a593Smuzhiyunsaved in transcendent memory (RAM that is otherwise not directly 123*4882a593Smuzhiyunaddressable to the kernel); fetching those pages later avoids "refaults" 124*4882a593Smuzhiyunand thus disk reads. 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunCleancache (and its sister code "frontswap") provide interfaces for 127*4882a593Smuzhiyunthis transcendent memory (aka "tmem"), which conceptually lies between 128*4882a593Smuzhiyunfast kernel-directly-addressable RAM and slower DMA/asynchronous devices. 129*4882a593SmuzhiyunDisallowing direct kernel or userland reads/writes to tmem 130*4882a593Smuzhiyunis ideal when data is transformed to a different form and size (such 131*4882a593Smuzhiyunas with compression) or secretly moved (as might be useful for write- 132*4882a593Smuzhiyunbalancing for some RAM-like devices). Evicted page-cache pages (and 133*4882a593Smuzhiyunswap pages) are a great use for this kind of slower-than-RAM-but-much- 134*4882a593Smuzhiyunfaster-than-disk transcendent memory, and the cleancache (and frontswap) 135*4882a593Smuzhiyun"page-object-oriented" specification provides a nice way to read and 136*4882a593Smuzhiyunwrite -- and indirectly "name" -- the pages. 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunIn the virtual case, the whole point of virtualization is to statistically 139*4882a593Smuzhiyunmultiplex physical resources across the varying demands of multiple 140*4882a593Smuzhiyunvirtual machines. This is really hard to do with RAM and efforts to 141*4882a593Smuzhiyundo it well with no kernel change have essentially failed (except in some 142*4882a593Smuzhiyunwell-publicized special-case workloads). Cleancache -- and frontswap -- 143*4882a593Smuzhiyunwith a fairly small impact on the kernel, provide a huge amount 144*4882a593Smuzhiyunof flexibility for more dynamic, flexible RAM multiplexing. 145*4882a593SmuzhiyunSpecifically, the Xen Transcendent Memory backend allows otherwise 146*4882a593Smuzhiyun"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple 147*4882a593Smuzhiyunvirtual machines, but the pages can be compressed and deduplicated to 148*4882a593Smuzhiyunoptimize RAM utilization. And when guest OS's are induced to surrender 149*4882a593Smuzhiyununderutilized RAM (e.g. with "self-ballooning"), page cache pages 150*4882a593Smuzhiyunare the first to go, and cleancache allows those pages to be 151*4882a593Smuzhiyunsaved and reclaimed if overall host system memory conditions allow. 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunAnd the identical interface used for cleancache can be used in 154*4882a593Smuzhiyunphysical systems as well. The zcache driver acts as a memory-hungry 155*4882a593Smuzhiyundevice that stores pages of data in a compressed state. And 156*4882a593Smuzhiyunthe proposed "RAMster" driver shares RAM across multiple physical 157*4882a593Smuzhiyunsystems. 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun* Why does cleancache have its sticky fingers so deep inside the 160*4882a593Smuzhiyun filesystems and VFS? (Andrew Morton and Christoph Hellwig) 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunThe core hooks for cleancache in VFS are in most cases a single line 163*4882a593Smuzhiyunand the minimum set are placed precisely where needed to maintain 164*4882a593Smuzhiyuncoherency (via cleancache_invalidate operations) between cleancache, 165*4882a593Smuzhiyunthe page cache, and disk. All hooks compile into nothingness if 166*4882a593Smuzhiyuncleancache is config'ed off and turn into a function-pointer- 167*4882a593Smuzhiyuncompare-to-NULL if config'ed on but no backend claims the ops 168*4882a593Smuzhiyunfunctions, or to a compare-struct-element-to-negative if a 169*4882a593Smuzhiyunbackend claims the ops functions but a filesystem doesn't enable 170*4882a593Smuzhiyuncleancache. 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunSome filesystems are built entirely on top of VFS and the hooks 173*4882a593Smuzhiyunin VFS are sufficient, so don't require an "init_fs" hook; the 174*4882a593Smuzhiyuninitial implementation of cleancache didn't provide this hook. 175*4882a593SmuzhiyunBut for some filesystems (such as btrfs), the VFS hooks are 176*4882a593Smuzhiyunincomplete and one or more hooks in fs-specific code are required. 177*4882a593SmuzhiyunAnd for some other filesystems, such as tmpfs, cleancache may 178*4882a593Smuzhiyunbe counterproductive. So it seemed prudent to require a filesystem 179*4882a593Smuzhiyunto "opt in" to use cleancache, which requires adding a hook in 180*4882a593Smuzhiyuneach filesystem. Not all filesystems are supported by cleancache 181*4882a593Smuzhiyunonly because they haven't been tested. The existing set should 182*4882a593Smuzhiyunbe sufficient to validate the concept, the opt-in approach means 183*4882a593Smuzhiyunthat untested filesystems are not affected, and the hooks in the 184*4882a593Smuzhiyunexisting filesystems should make it very easy to add more 185*4882a593Smuzhiyunfilesystems in the future. 186*4882a593Smuzhiyun 187*4882a593SmuzhiyunThe total impact of the hooks to existing fs and mm files is only 188*4882a593Smuzhiyunabout 40 lines added (not counting comments and blank lines). 189*4882a593Smuzhiyun 190*4882a593Smuzhiyun* Why not make cleancache asynchronous and batched so it can more 191*4882a593Smuzhiyun easily interface with real devices with DMA instead of copying each 192*4882a593Smuzhiyun individual page? (Minchan Kim) 193*4882a593Smuzhiyun 194*4882a593SmuzhiyunThe one-page-at-a-time copy semantics simplifies the implementation 195*4882a593Smuzhiyunon both the frontend and backend and also allows the backend to 196*4882a593Smuzhiyundo fancy things on-the-fly like page compression and 197*4882a593Smuzhiyunpage deduplication. And since the data is "gone" (copied into/out 198*4882a593Smuzhiyunof the pageframe) before the cleancache get/put call returns, 199*4882a593Smuzhiyuna great deal of race conditions and potential coherency issues 200*4882a593Smuzhiyunare avoided. While the interface seems odd for a "real device" 201*4882a593Smuzhiyunor for real kernel-addressable RAM, it makes perfect sense for 202*4882a593Smuzhiyuntranscendent memory. 203*4882a593Smuzhiyun 204*4882a593Smuzhiyun* Why is non-shared cleancache "exclusive"? And where is the 205*4882a593Smuzhiyun page "invalidated" after a "get"? (Minchan Kim) 206*4882a593Smuzhiyun 207*4882a593SmuzhiyunThe main reason is to free up space in transcendent memory and 208*4882a593Smuzhiyunto avoid unnecessary cleancache_invalidate calls. If you want inclusive, 209*4882a593Smuzhiyunthe page can be "put" immediately following the "get". If 210*4882a593Smuzhiyunput-after-get for inclusive becomes common, the interface could 211*4882a593Smuzhiyunbe easily extended to add a "get_no_invalidate" call. 212*4882a593Smuzhiyun 213*4882a593SmuzhiyunThe invalidate is done by the cleancache backend implementation. 214*4882a593Smuzhiyun 215*4882a593Smuzhiyun* What's the performance impact? 216*4882a593Smuzhiyun 217*4882a593SmuzhiyunPerformance analysis has been presented at OLS'09 and LCA'10. 218*4882a593SmuzhiyunBriefly, performance gains can be significant on most workloads, 219*4882a593Smuzhiyunespecially when memory pressure is high (e.g. when RAM is 220*4882a593Smuzhiyunovercommitted in a virtual workload); and because the hooks are 221*4882a593Smuzhiyuninvoked primarily in place of or in addition to a disk read/write, 222*4882a593Smuzhiyunoverhead is negligible even in worst case workloads. Basically 223*4882a593Smuzhiyuncleancache replaces I/O with memory-copy-CPU-overhead; on older 224*4882a593Smuzhiyunsingle-core systems with slow memory-copy speeds, cleancache 225*4882a593Smuzhiyunhas little value, but in newer multicore machines, especially 226*4882a593Smuzhiyunconsolidated/virtualized machines, it has great value. 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun* How do I add cleancache support for filesystem X? (Boaz Harrash) 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunFilesystems that are well-behaved and conform to certain 231*4882a593Smuzhiyunrestrictions can utilize cleancache simply by making a call to 232*4882a593Smuzhiyuncleancache_init_fs at mount time. Unusual, misbehaving, or 233*4882a593Smuzhiyunpoorly layered filesystems must either add additional hooks 234*4882a593Smuzhiyunand/or undergo extensive additional testing... or should just 235*4882a593Smuzhiyunnot enable the optional cleancache. 236*4882a593Smuzhiyun 237*4882a593SmuzhiyunSome points for a filesystem to consider: 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun - The FS should be block-device-based (e.g. a ram-based FS such 240*4882a593Smuzhiyun as tmpfs should not enable cleancache) 241*4882a593Smuzhiyun - To ensure coherency/correctness, the FS must ensure that all 242*4882a593Smuzhiyun file removal or truncation operations either go through VFS or 243*4882a593Smuzhiyun add hooks to do the equivalent cleancache "invalidate" operations 244*4882a593Smuzhiyun - To ensure coherency/correctness, either inode numbers must 245*4882a593Smuzhiyun be unique across the lifetime of the on-disk file OR the 246*4882a593Smuzhiyun FS must provide an "encode_fh" function. 247*4882a593Smuzhiyun - The FS must call the VFS superblock alloc and deactivate routines 248*4882a593Smuzhiyun or add hooks to do the equivalent cleancache calls done there. 249*4882a593Smuzhiyun - To maximize performance, all pages fetched from the FS should 250*4882a593Smuzhiyun go through the do_mpag_readpage routine or the FS should add 251*4882a593Smuzhiyun hooks to do the equivalent (cf. btrfs) 252*4882a593Smuzhiyun - Currently, the FS blocksize must be the same as PAGESIZE. This 253*4882a593Smuzhiyun is not an architectural restriction, but no backends currently 254*4882a593Smuzhiyun support anything different. 255*4882a593Smuzhiyun - A clustered FS should invoke the "shared_init_fs" cleancache 256*4882a593Smuzhiyun hook to get best performance for some backends. 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun* Why not use the KVA of the inode as the key? (Christoph Hellwig) 259*4882a593Smuzhiyun 260*4882a593SmuzhiyunIf cleancache would use the inode virtual address instead of 261*4882a593Smuzhiyuninode/filehandle, the pool id could be eliminated. But, this 262*4882a593Smuzhiyunwon't work because cleancache retains pagecache data pages 263*4882a593Smuzhiyunpersistently even when the inode has been pruned from the 264*4882a593Smuzhiyuninode unused list, and only invalidates the data page if the file 265*4882a593Smuzhiyungets removed/truncated. So if cleancache used the inode kva, 266*4882a593Smuzhiyunthere would be potential coherency issues if/when the inode 267*4882a593Smuzhiyunkva is reused for a different file. Alternately, if cleancache 268*4882a593Smuzhiyuninvalidated the pages when the inode kva was freed, much of the value 269*4882a593Smuzhiyunof cleancache would be lost because the cache of pages in cleanache 270*4882a593Smuzhiyunis potentially much larger than the kernel pagecache and is most 271*4882a593Smuzhiyunuseful if the pages survive inode cache removal. 272*4882a593Smuzhiyun 273*4882a593Smuzhiyun* Why is a global variable required? 274*4882a593Smuzhiyun 275*4882a593SmuzhiyunThe cleancache_enabled flag is checked in all of the frequently-used 276*4882a593Smuzhiyuncleancache hooks. The alternative is a function call to check a static 277*4882a593Smuzhiyunvariable. Since cleancache is enabled dynamically at runtime, systems 278*4882a593Smuzhiyunthat don't enable cleancache would suffer thousands (possibly 279*4882a593Smuzhiyuntens-of-thousands) of unnecessary function calls per second. So the 280*4882a593Smuzhiyunglobal variable allows cleancache to be enabled by default at compile 281*4882a593Smuzhiyuntime, but have insignificant performance impact when cleancache remains 282*4882a593Smuzhiyundisabled at runtime. 283*4882a593Smuzhiyun 284*4882a593Smuzhiyun* Does cleanache work with KVM? 285*4882a593Smuzhiyun 286*4882a593SmuzhiyunThe memory model of KVM is sufficiently different that a cleancache 287*4882a593Smuzhiyunbackend may have less value for KVM. This remains to be tested, 288*4882a593Smuzhiyunespecially in an overcommitted system. 289*4882a593Smuzhiyun 290*4882a593Smuzhiyun* Does cleancache work in userspace? It sounds useful for 291*4882a593Smuzhiyun memory hungry caches like web browsers. (Jamie Lokier) 292*4882a593Smuzhiyun 293*4882a593SmuzhiyunNo plans yet, though we agree it sounds useful, at least for 294*4882a593Smuzhiyunapps that bypass the page cache (e.g. O_DIRECT). 295*4882a593Smuzhiyun 296*4882a593SmuzhiyunLast updated: Dan Magenheimer, April 13 2011 297