1*4882a593Smuzhiyun.. _frontswap: 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun========= 4*4882a593SmuzhiyunFrontswap 5*4882a593Smuzhiyun========= 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunFrontswap provides a "transcendent memory" interface for swap pages. 8*4882a593SmuzhiyunIn some environments, dramatic performance savings may be obtained because 9*4882a593Smuzhiyunswapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. 10*4882a593Smuzhiyun 11*4882a593Smuzhiyun(Note, frontswap -- and :ref:`cleancache` (merged at 3.0) -- are the "frontends" 12*4882a593Smuzhiyunand the only necessary changes to the core kernel for transcendent memory; 13*4882a593Smuzhiyunall other supporting code -- the "backends" -- is implemented as drivers. 14*4882a593SmuzhiyunSee the LWN.net article `Transcendent memory in a nutshell`_ 15*4882a593Smuzhiyunfor a detailed overview of frontswap and related kernel parts) 16*4882a593Smuzhiyun 17*4882a593Smuzhiyun.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/ 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunFrontswap is so named because it can be thought of as the opposite of 20*4882a593Smuzhiyuna "backing" store for a swap device. The storage is assumed to be 21*4882a593Smuzhiyuna synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming 22*4882a593Smuzhiyunto the requirements of transcendent memory (such as Xen's "tmem", or 23*4882a593Smuzhiyunin-kernel compressed memory, aka "zcache", or future RAM-like devices); 24*4882a593Smuzhiyunthis pseudo-RAM device is not directly accessible or addressable by the 25*4882a593Smuzhiyunkernel and is of unknown and possibly time-varying size. The driver 26*4882a593Smuzhiyunlinks itself to frontswap by calling frontswap_register_ops to set the 27*4882a593Smuzhiyunfrontswap_ops funcs appropriately and the functions it provides must 28*4882a593Smuzhiyunconform to certain policies as follows: 29*4882a593Smuzhiyun 30*4882a593SmuzhiyunAn "init" prepares the device to receive frontswap pages associated 31*4882a593Smuzhiyunwith the specified swap device number (aka "type"). A "store" will 32*4882a593Smuzhiyuncopy the page to transcendent memory and associate it with the type and 33*4882a593Smuzhiyunoffset associated with the page. A "load" will copy the page, if found, 34*4882a593Smuzhiyunfrom transcendent memory into kernel memory, but will NOT remove the page 35*4882a593Smuzhiyunfrom transcendent memory. An "invalidate_page" will remove the page 36*4882a593Smuzhiyunfrom transcendent memory and an "invalidate_area" will remove ALL pages 37*4882a593Smuzhiyunassociated with the swap type (e.g., like swapoff) and notify the "device" 38*4882a593Smuzhiyunto refuse further stores with that swap type. 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunOnce a page is successfully stored, a matching load on the page will normally 41*4882a593Smuzhiyunsucceed. So when the kernel finds itself in a situation where it needs 42*4882a593Smuzhiyunto swap out a page, it first attempts to use frontswap. If the store returns 43*4882a593Smuzhiyunsuccess, the data has been successfully saved to transcendent memory and 44*4882a593Smuzhiyuna disk write and, if the data is later read back, a disk read are avoided. 45*4882a593SmuzhiyunIf a store returns failure, transcendent memory has rejected the data, and the 46*4882a593Smuzhiyunpage can be written to swap as usual. 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunIf a backend chooses, frontswap can be configured as a "writethrough 49*4882a593Smuzhiyuncache" by calling frontswap_writethrough(). In this mode, the reduction 50*4882a593Smuzhiyunin swap device writes is lost (and also a non-trivial performance advantage) 51*4882a593Smuzhiyunin order to allow the backend to arbitrarily "reclaim" space used to 52*4882a593Smuzhiyunstore frontswap pages to more completely manage its memory usage. 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunNote that if a page is stored and the page already exists in transcendent memory 55*4882a593Smuzhiyun(a "duplicate" store), either the store succeeds and the data is overwritten, 56*4882a593Smuzhiyunor the store fails AND the page is invalidated. This ensures stale data may 57*4882a593Smuzhiyunnever be obtained from frontswap. 58*4882a593Smuzhiyun 59*4882a593SmuzhiyunIf properly configured, monitoring of frontswap is done via debugfs in 60*4882a593Smuzhiyunthe `/sys/kernel/debug/frontswap` directory. The effectiveness of 61*4882a593Smuzhiyunfrontswap can be measured (across all swap devices) with: 62*4882a593Smuzhiyun 63*4882a593Smuzhiyun``failed_stores`` 64*4882a593Smuzhiyun how many store attempts have failed 65*4882a593Smuzhiyun 66*4882a593Smuzhiyun``loads`` 67*4882a593Smuzhiyun how many loads were attempted (all should succeed) 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun``succ_stores`` 70*4882a593Smuzhiyun how many store attempts have succeeded 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun``invalidates`` 73*4882a593Smuzhiyun how many invalidates were attempted 74*4882a593Smuzhiyun 75*4882a593SmuzhiyunA backend implementation may provide additional metrics. 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunFAQ 78*4882a593Smuzhiyun=== 79*4882a593Smuzhiyun 80*4882a593Smuzhiyun* Where's the value? 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunWhen a workload starts swapping, performance falls through the floor. 83*4882a593SmuzhiyunFrontswap significantly increases performance in many such workloads by 84*4882a593Smuzhiyunproviding a clean, dynamic interface to read and write swap pages to 85*4882a593Smuzhiyun"transcendent memory" that is otherwise not directly addressable to the kernel. 86*4882a593SmuzhiyunThis interface is ideal when data is transformed to a different form 87*4882a593Smuzhiyunand size (such as with compression) or secretly moved (as might be 88*4882a593Smuzhiyunuseful for write-balancing for some RAM-like devices). Swap pages (and 89*4882a593Smuzhiyunevicted page-cache pages) are a great use for this kind of slower-than-RAM- 90*4882a593Smuzhiyunbut-much-faster-than-disk "pseudo-RAM device" and the frontswap (and 91*4882a593Smuzhiyuncleancache) interface to transcendent memory provides a nice way to read 92*4882a593Smuzhiyunand write -- and indirectly "name" -- the pages. 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunFrontswap -- and cleancache -- with a fairly small impact on the kernel, 95*4882a593Smuzhiyunprovides a huge amount of flexibility for more dynamic, flexible RAM 96*4882a593Smuzhiyunutilization in various system configurations: 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunIn the single kernel case, aka "zcache", pages are compressed and 99*4882a593Smuzhiyunstored in local memory, thus increasing the total anonymous pages 100*4882a593Smuzhiyunthat can be safely kept in RAM. Zcache essentially trades off CPU 101*4882a593Smuzhiyuncycles used in compression/decompression for better memory utilization. 102*4882a593SmuzhiyunBenchmarks have shown little or no impact when memory pressure is 103*4882a593Smuzhiyunlow while providing a significant performance improvement (25%+) 104*4882a593Smuzhiyunon some workloads under high memory pressure. 105*4882a593Smuzhiyun 106*4882a593Smuzhiyun"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory 107*4882a593Smuzhiyunsupport for clustered systems. Frontswap pages are locally compressed 108*4882a593Smuzhiyunas in zcache, but then "remotified" to another system's RAM. This 109*4882a593Smuzhiyunallows RAM to be dynamically load-balanced back-and-forth as needed, 110*4882a593Smuzhiyuni.e. when system A is overcommitted, it can swap to system B, and 111*4882a593Smuzhiyunvice versa. RAMster can also be configured as a memory server so 112*4882a593Smuzhiyunmany servers in a cluster can swap, dynamically as needed, to a single 113*4882a593Smuzhiyunserver configured with a large amount of RAM... without pre-configuring 114*4882a593Smuzhiyunhow much of the RAM is available for each of the clients! 115*4882a593Smuzhiyun 116*4882a593SmuzhiyunIn the virtual case, the whole point of virtualization is to statistically 117*4882a593Smuzhiyunmultiplex physical resources across the varying demands of multiple 118*4882a593Smuzhiyunvirtual machines. This is really hard to do with RAM and efforts to do 119*4882a593Smuzhiyunit well with no kernel changes have essentially failed (except in some 120*4882a593Smuzhiyunwell-publicized special-case workloads). 121*4882a593SmuzhiyunSpecifically, the Xen Transcendent Memory backend allows otherwise 122*4882a593Smuzhiyun"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple 123*4882a593Smuzhiyunvirtual machines, but the pages can be compressed and deduplicated to 124*4882a593Smuzhiyunoptimize RAM utilization. And when guest OS's are induced to surrender 125*4882a593Smuzhiyununderutilized RAM (e.g. with "selfballooning"), sudden unexpected 126*4882a593Smuzhiyunmemory pressure may result in swapping; frontswap allows those pages 127*4882a593Smuzhiyunto be swapped to and from hypervisor RAM (if overall host system memory 128*4882a593Smuzhiyunconditions allow), thus mitigating the potentially awful performance impact 129*4882a593Smuzhiyunof unplanned swapping. 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunA KVM implementation is underway and has been RFC'ed to lkml. And, 132*4882a593Smuzhiyunusing frontswap, investigation is also underway on the use of NVM as 133*4882a593Smuzhiyuna memory extension technology. 134*4882a593Smuzhiyun 135*4882a593Smuzhiyun* Sure there may be performance advantages in some situations, but 136*4882a593Smuzhiyun what's the space/time overhead of frontswap? 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunIf CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into 139*4882a593Smuzhiyunnothingness and the only overhead is a few extra bytes per swapon'ed 140*4882a593Smuzhiyunswap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" 141*4882a593Smuzhiyunregisters, there is one extra global variable compared to zero for 142*4882a593Smuzhiyunevery swap page read or written. If CONFIG_FRONTSWAP is enabled 143*4882a593SmuzhiyunAND a frontswap backend registers AND the backend fails every "store" 144*4882a593Smuzhiyunrequest (i.e. provides no memory despite claiming it might), 145*4882a593SmuzhiyunCPU overhead is still negligible -- and since every frontswap fail 146*4882a593Smuzhiyunprecedes a swap page write-to-disk, the system is highly likely 147*4882a593Smuzhiyunto be I/O bound and using a small fraction of a percent of a CPU 148*4882a593Smuzhiyunwill be irrelevant anyway. 149*4882a593Smuzhiyun 150*4882a593SmuzhiyunAs for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend 151*4882a593Smuzhiyunregisters, one bit is allocated for every swap page for every swap 152*4882a593Smuzhiyundevice that is swapon'd. This is added to the EIGHT bits (which 153*4882a593Smuzhiyunwas sixteen until about 2.6.34) that the kernel already allocates 154*4882a593Smuzhiyunfor every swap page for every swap device that is swapon'd. (Hugh 155*4882a593SmuzhiyunDickins has observed that frontswap could probably steal one of 156*4882a593Smuzhiyunthe existing eight bits, but let's worry about that minor optimization 157*4882a593Smuzhiyunlater.) For very large swap disks (which are rare) on a standard 158*4882a593Smuzhiyun4K pagesize, this is 1MB per 32GB swap. 159*4882a593Smuzhiyun 160*4882a593SmuzhiyunWhen swap pages are stored in transcendent memory instead of written 161*4882a593Smuzhiyunout to disk, there is a side effect that this may create more memory 162*4882a593Smuzhiyunpressure that can potentially outweigh the other advantages. A 163*4882a593Smuzhiyunbackend, such as zcache, must implement policies to carefully (but 164*4882a593Smuzhiyundynamically) manage memory limits to ensure this doesn't happen. 165*4882a593Smuzhiyun 166*4882a593Smuzhiyun* OK, how about a quick overview of what this frontswap patch does 167*4882a593Smuzhiyun in terms that a kernel hacker can grok? 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunLet's assume that a frontswap "backend" has registered during 170*4882a593Smuzhiyunkernel initialization; this registration indicates that this 171*4882a593Smuzhiyunfrontswap backend has access to some "memory" that is not directly 172*4882a593Smuzhiyunaccessible by the kernel. Exactly how much memory it provides is 173*4882a593Smuzhiyunentirely dynamic and random. 174*4882a593Smuzhiyun 175*4882a593SmuzhiyunWhenever a swap-device is swapon'd frontswap_init() is called, 176*4882a593Smuzhiyunpassing the swap device number (aka "type") as a parameter. 177*4882a593SmuzhiyunThis notifies frontswap to expect attempts to "store" swap pages 178*4882a593Smuzhiyunassociated with that number. 179*4882a593Smuzhiyun 180*4882a593SmuzhiyunWhenever the swap subsystem is readying a page to write to a swap 181*4882a593Smuzhiyundevice (c.f swap_writepage()), frontswap_store is called. Frontswap 182*4882a593Smuzhiyunconsults with the frontswap backend and if the backend says it does NOT 183*4882a593Smuzhiyunhave room, frontswap_store returns -1 and the kernel swaps the page 184*4882a593Smuzhiyunto the swap device as normal. Note that the response from the frontswap 185*4882a593Smuzhiyunbackend is unpredictable to the kernel; it may choose to never accept a 186*4882a593Smuzhiyunpage, it could accept every ninth page, or it might accept every 187*4882a593Smuzhiyunpage. But if the backend does accept a page, the data from the page 188*4882a593Smuzhiyunhas already been copied and associated with the type and offset, 189*4882a593Smuzhiyunand the backend guarantees the persistence of the data. In this case, 190*4882a593Smuzhiyunfrontswap sets a bit in the "frontswap_map" for the swap device 191*4882a593Smuzhiyuncorresponding to the page offset on the swap device to which it would 192*4882a593Smuzhiyunotherwise have written the data. 193*4882a593Smuzhiyun 194*4882a593SmuzhiyunWhen the swap subsystem needs to swap-in a page (swap_readpage()), 195*4882a593Smuzhiyunit first calls frontswap_load() which checks the frontswap_map to 196*4882a593Smuzhiyunsee if the page was earlier accepted by the frontswap backend. If 197*4882a593Smuzhiyunit was, the page of data is filled from the frontswap backend and 198*4882a593Smuzhiyunthe swap-in is complete. If not, the normal swap-in code is 199*4882a593Smuzhiyunexecuted to obtain the page of data from the real swap device. 200*4882a593Smuzhiyun 201*4882a593SmuzhiyunSo every time the frontswap backend accepts a page, a swap device read 202*4882a593Smuzhiyunand (potentially) a swap device write are replaced by a "frontswap backend 203*4882a593Smuzhiyunstore" and (possibly) a "frontswap backend loads", which are presumably much 204*4882a593Smuzhiyunfaster. 205*4882a593Smuzhiyun 206*4882a593Smuzhiyun* Can't frontswap be configured as a "special" swap device that is 207*4882a593Smuzhiyun just higher priority than any real swap device (e.g. like zswap, 208*4882a593Smuzhiyun or maybe swap-over-nbd/NFS)? 209*4882a593Smuzhiyun 210*4882a593SmuzhiyunNo. First, the existing swap subsystem doesn't allow for any kind of 211*4882a593Smuzhiyunswap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, 212*4882a593Smuzhiyunbut this would require fairly drastic changes. Even if it were 213*4882a593Smuzhiyunrewritten, the existing swap subsystem uses the block I/O layer which 214*4882a593Smuzhiyunassumes a swap device is fixed size and any page in it is linearly 215*4882a593Smuzhiyunaddressable. Frontswap barely touches the existing swap subsystem, 216*4882a593Smuzhiyunand works around the constraints of the block I/O subsystem to provide 217*4882a593Smuzhiyuna great deal of flexibility and dynamicity. 218*4882a593Smuzhiyun 219*4882a593SmuzhiyunFor example, the acceptance of any swap page by the frontswap backend is 220*4882a593Smuzhiyunentirely unpredictable. This is critical to the definition of frontswap 221*4882a593Smuzhiyunbackends because it grants completely dynamic discretion to the 222*4882a593Smuzhiyunbackend. In zcache, one cannot know a priori how compressible a page is. 223*4882a593Smuzhiyun"Poorly" compressible pages can be rejected, and "poorly" can itself be 224*4882a593Smuzhiyundefined dynamically depending on current memory constraints. 225*4882a593Smuzhiyun 226*4882a593SmuzhiyunFurther, frontswap is entirely synchronous whereas a real swap 227*4882a593Smuzhiyundevice is, by definition, asynchronous and uses block I/O. The 228*4882a593Smuzhiyunblock I/O layer is not only unnecessary, but may perform "optimizations" 229*4882a593Smuzhiyunthat are inappropriate for a RAM-oriented device including delaying 230*4882a593Smuzhiyunthe write of some pages for a significant amount of time. Synchrony is 231*4882a593Smuzhiyunrequired to ensure the dynamicity of the backend and to avoid thorny race 232*4882a593Smuzhiyunconditions that would unnecessarily and greatly complicate frontswap 233*4882a593Smuzhiyunand/or the block I/O subsystem. That said, only the initial "store" 234*4882a593Smuzhiyunand "load" operations need be synchronous. A separate asynchronous thread 235*4882a593Smuzhiyunis free to manipulate the pages stored by frontswap. For example, 236*4882a593Smuzhiyunthe "remotification" thread in RAMster uses standard asynchronous 237*4882a593Smuzhiyunkernel sockets to move compressed frontswap pages to a remote machine. 238*4882a593SmuzhiyunSimilarly, a KVM guest-side implementation could do in-guest compression 239*4882a593Smuzhiyunand use "batched" hypercalls. 240*4882a593Smuzhiyun 241*4882a593SmuzhiyunIn a virtualized environment, the dynamicity allows the hypervisor 242*4882a593Smuzhiyun(or host OS) to do "intelligent overcommit". For example, it can 243*4882a593Smuzhiyunchoose to accept pages only until host-swapping might be imminent, 244*4882a593Smuzhiyunthen force guests to do their own swapping. 245*4882a593Smuzhiyun 246*4882a593SmuzhiyunThere is a downside to the transcendent memory specifications for 247*4882a593Smuzhiyunfrontswap: Since any "store" might fail, there must always be a real 248*4882a593Smuzhiyunslot on a real swap device to swap the page. Thus frontswap must be 249*4882a593Smuzhiyunimplemented as a "shadow" to every swapon'd device with the potential 250*4882a593Smuzhiyuncapability of holding every page that the swap device might have held 251*4882a593Smuzhiyunand the possibility that it might hold no pages at all. This means 252*4882a593Smuzhiyunthat frontswap cannot contain more pages than the total of swapon'd 253*4882a593Smuzhiyunswap devices. For example, if NO swap device is configured on some 254*4882a593Smuzhiyuninstallation, frontswap is useless. Swapless portable devices 255*4882a593Smuzhiyuncan still use frontswap but a backend for such devices must configure 256*4882a593Smuzhiyunsome kind of "ghost" swap device and ensure that it is never used. 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun* Why this weird definition about "duplicate stores"? If a page 259*4882a593Smuzhiyun has been previously successfully stored, can't it always be 260*4882a593Smuzhiyun successfully overwritten? 261*4882a593Smuzhiyun 262*4882a593SmuzhiyunNearly always it can, but no, sometimes it cannot. Consider an example 263*4882a593Smuzhiyunwhere data is compressed and the original 4K page has been compressed 264*4882a593Smuzhiyunto 1K. Now an attempt is made to overwrite the page with data that 265*4882a593Smuzhiyunis non-compressible and so would take the entire 4K. But the backend 266*4882a593Smuzhiyunhas no more space. In this case, the store must be rejected. Whenever 267*4882a593Smuzhiyunfrontswap rejects a store that would overwrite, it also must invalidate 268*4882a593Smuzhiyunthe old data and ensure that it is no longer accessible. Since the 269*4882a593Smuzhiyunswap subsystem then writes the new data to the read swap device, 270*4882a593Smuzhiyunthis is the correct course of action to ensure coherency. 271*4882a593Smuzhiyun 272*4882a593Smuzhiyun* What is frontswap_shrink for? 273*4882a593Smuzhiyun 274*4882a593SmuzhiyunWhen the (non-frontswap) swap subsystem swaps out a page to a real 275*4882a593Smuzhiyunswap device, that page is only taking up low-value pre-allocated disk 276*4882a593Smuzhiyunspace. But if frontswap has placed a page in transcendent memory, that 277*4882a593Smuzhiyunpage may be taking up valuable real estate. The frontswap_shrink 278*4882a593Smuzhiyunroutine allows code outside of the swap subsystem to force pages out 279*4882a593Smuzhiyunof the memory managed by frontswap and back into kernel-addressable memory. 280*4882a593SmuzhiyunFor example, in RAMster, a "suction driver" thread will attempt 281*4882a593Smuzhiyunto "repatriate" pages sent to a remote machine back to the local machine; 282*4882a593Smuzhiyunthis is driven using the frontswap_shrink mechanism when memory pressure 283*4882a593Smuzhiyunsubsides. 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun* Why does the frontswap patch create the new include file swapfile.h? 286*4882a593Smuzhiyun 287*4882a593SmuzhiyunThe frontswap code depends on some swap-subsystem-internal data 288*4882a593Smuzhiyunstructures that have, over the years, moved back and forth between 289*4882a593Smuzhiyunstatic and global. This seemed a reasonable compromise: Define 290*4882a593Smuzhiyunthem as global but declare them in a new include file that isn't 291*4882a593Smuzhiyunincluded by the large number of source files that include swap.h. 292*4882a593Smuzhiyun 293*4882a593SmuzhiyunDan Magenheimer, last updated April 9, 2012 294