Documentation/vm/frontswap.rst

*4882a593Smuzhiyun.. _frontswap:
*4882a593Smuzhiyun
*4882a593Smuzhiyun=========
*4882a593SmuzhiyunFrontswap
*4882a593Smuzhiyun=========
*4882a593Smuzhiyun
*4882a593SmuzhiyunFrontswap provides a "transcendent memory" interface for swap pages.
*4882a593SmuzhiyunIn some environments, dramatic performance savings may be obtained because
*4882a593Smuzhiyunswapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
*4882a593Smuzhiyun
*4882a593Smuzhiyun(Note, frontswap -- and :ref:`cleancache` (merged at 3.0) -- are the "frontends"
*4882a593Smuzhiyunand the only necessary changes to the core kernel for transcendent memory;
*4882a593Smuzhiyunall other supporting code -- the "backends" -- is implemented as drivers.
*4882a593SmuzhiyunSee the LWN.net article `Transcendent memory in a nutshell`_
*4882a593Smuzhiyunfor a detailed overview of frontswap and related kernel parts)
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/
*4882a593Smuzhiyun
*4882a593SmuzhiyunFrontswap is so named because it can be thought of as the opposite of
*4882a593Smuzhiyuna "backing" store for a swap device.  The storage is assumed to be
*4882a593Smuzhiyuna synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
*4882a593Smuzhiyunto the requirements of transcendent memory (such as Xen's "tmem", or
*4882a593Smuzhiyunin-kernel compressed memory, aka "zcache", or future RAM-like devices);
*4882a593Smuzhiyunthis pseudo-RAM device is not directly accessible or addressable by the
*4882a593Smuzhiyunkernel and is of unknown and possibly time-varying size.  The driver
*4882a593Smuzhiyunlinks itself to frontswap by calling frontswap_register_ops to set the
*4882a593Smuzhiyunfrontswap_ops funcs appropriately and the functions it provides must
*4882a593Smuzhiyunconform to certain policies as follows:
*4882a593Smuzhiyun
*4882a593SmuzhiyunAn "init" prepares the device to receive frontswap pages associated
*4882a593Smuzhiyunwith the specified swap device number (aka "type").  A "store" will
*4882a593Smuzhiyuncopy the page to transcendent memory and associate it with the type and
*4882a593Smuzhiyunoffset associated with the page. A "load" will copy the page, if found,
*4882a593Smuzhiyunfrom transcendent memory into kernel memory, but will NOT remove the page
*4882a593Smuzhiyunfrom transcendent memory.  An "invalidate_page" will remove the page
*4882a593Smuzhiyunfrom transcendent memory and an "invalidate_area" will remove ALL pages
*4882a593Smuzhiyunassociated with the swap type (e.g., like swapoff) and notify the "device"
*4882a593Smuzhiyunto refuse further stores with that swap type.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOnce a page is successfully stored, a matching load on the page will normally
*4882a593Smuzhiyunsucceed.  So when the kernel finds itself in a situation where it needs
*4882a593Smuzhiyunto swap out a page, it first attempts to use frontswap.  If the store returns
*4882a593Smuzhiyunsuccess, the data has been successfully saved to transcendent memory and
*4882a593Smuzhiyuna disk write and, if the data is later read back, a disk read are avoided.
*4882a593SmuzhiyunIf a store returns failure, transcendent memory has rejected the data, and the
*4882a593Smuzhiyunpage can be written to swap as usual.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf a backend chooses, frontswap can be configured as a "writethrough
*4882a593Smuzhiyuncache" by calling frontswap_writethrough().  In this mode, the reduction
*4882a593Smuzhiyunin swap device writes is lost (and also a non-trivial performance advantage)
*4882a593Smuzhiyunin order to allow the backend to arbitrarily "reclaim" space used to
*4882a593Smuzhiyunstore frontswap pages to more completely manage its memory usage.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote that if a page is stored and the page already exists in transcendent memory
*4882a593Smuzhiyun(a "duplicate" store), either the store succeeds and the data is overwritten,
*4882a593Smuzhiyunor the store fails AND the page is invalidated.  This ensures stale data may
*4882a593Smuzhiyunnever be obtained from frontswap.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf properly configured, monitoring of frontswap is done via debugfs in
*4882a593Smuzhiyunthe `/sys/kernel/debug/frontswap` directory.  The effectiveness of
*4882a593Smuzhiyunfrontswap can be measured (across all swap devices) with:
*4882a593Smuzhiyun
*4882a593Smuzhiyun``failed_stores``
*4882a593Smuzhiyun	how many store attempts have failed
*4882a593Smuzhiyun
*4882a593Smuzhiyun``loads``
*4882a593Smuzhiyun	how many loads were attempted (all should succeed)
*4882a593Smuzhiyun
*4882a593Smuzhiyun``succ_stores``
*4882a593Smuzhiyun	how many store attempts have succeeded
*4882a593Smuzhiyun
*4882a593Smuzhiyun``invalidates``
*4882a593Smuzhiyun	how many invalidates were attempted
*4882a593Smuzhiyun
*4882a593SmuzhiyunA backend implementation may provide additional metrics.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFAQ
*4882a593Smuzhiyun===
*4882a593Smuzhiyun
*4882a593Smuzhiyun* Where's the value?
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen a workload starts swapping, performance falls through the floor.
*4882a593SmuzhiyunFrontswap significantly increases performance in many such workloads by
*4882a593Smuzhiyunproviding a clean, dynamic interface to read and write swap pages to
*4882a593Smuzhiyun"transcendent memory" that is otherwise not directly addressable to the kernel.
*4882a593SmuzhiyunThis interface is ideal when data is transformed to a different form
*4882a593Smuzhiyunand size (such as with compression) or secretly moved (as might be
*4882a593Smuzhiyunuseful for write-balancing for some RAM-like devices).  Swap pages (and
*4882a593Smuzhiyunevicted page-cache pages) are a great use for this kind of slower-than-RAM-
*4882a593Smuzhiyunbut-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
*4882a593Smuzhiyuncleancache) interface to transcendent memory provides a nice way to read
*4882a593Smuzhiyunand write -- and indirectly "name" -- the pages.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFrontswap -- and cleancache -- with a fairly small impact on the kernel,
*4882a593Smuzhiyunprovides a huge amount of flexibility for more dynamic, flexible RAM
*4882a593Smuzhiyunutilization in various system configurations:
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn the single kernel case, aka "zcache", pages are compressed and
*4882a593Smuzhiyunstored in local memory, thus increasing the total anonymous pages
*4882a593Smuzhiyunthat can be safely kept in RAM.  Zcache essentially trades off CPU
*4882a593Smuzhiyuncycles used in compression/decompression for better memory utilization.
*4882a593SmuzhiyunBenchmarks have shown little or no impact when memory pressure is
*4882a593Smuzhiyunlow while providing a significant performance improvement (25%+)
*4882a593Smuzhiyunon some workloads under high memory pressure.
*4882a593Smuzhiyun
*4882a593Smuzhiyun"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
*4882a593Smuzhiyunsupport for clustered systems.  Frontswap pages are locally compressed
*4882a593Smuzhiyunas in zcache, but then "remotified" to another system's RAM.  This
*4882a593Smuzhiyunallows RAM to be dynamically load-balanced back-and-forth as needed,
*4882a593Smuzhiyuni.e. when system A is overcommitted, it can swap to system B, and
*4882a593Smuzhiyunvice versa.  RAMster can also be configured as a memory server so
*4882a593Smuzhiyunmany servers in a cluster can swap, dynamically as needed, to a single
*4882a593Smuzhiyunserver configured with a large amount of RAM... without pre-configuring
*4882a593Smuzhiyunhow much of the RAM is available for each of the clients!
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn the virtual case, the whole point of virtualization is to statistically
*4882a593Smuzhiyunmultiplex physical resources across the varying demands of multiple
*4882a593Smuzhiyunvirtual machines.  This is really hard to do with RAM and efforts to do
*4882a593Smuzhiyunit well with no kernel changes have essentially failed (except in some
*4882a593Smuzhiyunwell-publicized special-case workloads).
*4882a593SmuzhiyunSpecifically, the Xen Transcendent Memory backend allows otherwise
*4882a593Smuzhiyun"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
*4882a593Smuzhiyunvirtual machines, but the pages can be compressed and deduplicated to
*4882a593Smuzhiyunoptimize RAM utilization.  And when guest OS's are induced to surrender
*4882a593Smuzhiyununderutilized RAM (e.g. with "selfballooning"), sudden unexpected
*4882a593Smuzhiyunmemory pressure may result in swapping; frontswap allows those pages
*4882a593Smuzhiyunto be swapped to and from hypervisor RAM (if overall host system memory
*4882a593Smuzhiyunconditions allow), thus mitigating the potentially awful performance impact
*4882a593Smuzhiyunof unplanned swapping.
*4882a593Smuzhiyun
*4882a593SmuzhiyunA KVM implementation is underway and has been RFC'ed to lkml.  And,
*4882a593Smuzhiyunusing frontswap, investigation is also underway on the use of NVM as
*4882a593Smuzhiyuna memory extension technology.
*4882a593Smuzhiyun
*4882a593Smuzhiyun* Sure there may be performance advantages in some situations, but
*4882a593Smuzhiyun  what's the space/time overhead of frontswap?
*4882a593Smuzhiyun
*4882a593SmuzhiyunIf CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
*4882a593Smuzhiyunnothingness and the only overhead is a few extra bytes per swapon'ed
*4882a593Smuzhiyunswap device.  If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
*4882a593Smuzhiyunregisters, there is one extra global variable compared to zero for
*4882a593Smuzhiyunevery swap page read or written.  If CONFIG_FRONTSWAP is enabled
*4882a593SmuzhiyunAND a frontswap backend registers AND the backend fails every "store"
*4882a593Smuzhiyunrequest (i.e. provides no memory despite claiming it might),
*4882a593SmuzhiyunCPU overhead is still negligible -- and since every frontswap fail
*4882a593Smuzhiyunprecedes a swap page write-to-disk, the system is highly likely
*4882a593Smuzhiyunto be I/O bound and using a small fraction of a percent of a CPU
*4882a593Smuzhiyunwill be irrelevant anyway.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
*4882a593Smuzhiyunregisters, one bit is allocated for every swap page for every swap
*4882a593Smuzhiyundevice that is swapon'd.  This is added to the EIGHT bits (which
*4882a593Smuzhiyunwas sixteen until about 2.6.34) that the kernel already allocates
*4882a593Smuzhiyunfor every swap page for every swap device that is swapon'd.  (Hugh
*4882a593SmuzhiyunDickins has observed that frontswap could probably steal one of
*4882a593Smuzhiyunthe existing eight bits, but let's worry about that minor optimization
*4882a593Smuzhiyunlater.)  For very large swap disks (which are rare) on a standard
*4882a593Smuzhiyun4K pagesize, this is 1MB per 32GB swap.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen swap pages are stored in transcendent memory instead of written
*4882a593Smuzhiyunout to disk, there is a side effect that this may create more memory
*4882a593Smuzhiyunpressure that can potentially outweigh the other advantages.  A
*4882a593Smuzhiyunbackend, such as zcache, must implement policies to carefully (but
*4882a593Smuzhiyundynamically) manage memory limits to ensure this doesn't happen.
*4882a593Smuzhiyun
*4882a593Smuzhiyun* OK, how about a quick overview of what this frontswap patch does
*4882a593Smuzhiyun  in terms that a kernel hacker can grok?
*4882a593Smuzhiyun
*4882a593SmuzhiyunLet's assume that a frontswap "backend" has registered during
*4882a593Smuzhiyunkernel initialization; this registration indicates that this
*4882a593Smuzhiyunfrontswap backend has access to some "memory" that is not directly
*4882a593Smuzhiyunaccessible by the kernel.  Exactly how much memory it provides is
*4882a593Smuzhiyunentirely dynamic and random.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhenever a swap-device is swapon'd frontswap_init() is called,
*4882a593Smuzhiyunpassing the swap device number (aka "type") as a parameter.
*4882a593SmuzhiyunThis notifies frontswap to expect attempts to "store" swap pages
*4882a593Smuzhiyunassociated with that number.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhenever the swap subsystem is readying a page to write to a swap
*4882a593Smuzhiyundevice (c.f swap_writepage()), frontswap_store is called.  Frontswap
*4882a593Smuzhiyunconsults with the frontswap backend and if the backend says it does NOT
*4882a593Smuzhiyunhave room, frontswap_store returns -1 and the kernel swaps the page
*4882a593Smuzhiyunto the swap device as normal.  Note that the response from the frontswap
*4882a593Smuzhiyunbackend is unpredictable to the kernel; it may choose to never accept a
*4882a593Smuzhiyunpage, it could accept every ninth page, or it might accept every
*4882a593Smuzhiyunpage.  But if the backend does accept a page, the data from the page
*4882a593Smuzhiyunhas already been copied and associated with the type and offset,
*4882a593Smuzhiyunand the backend guarantees the persistence of the data.  In this case,
*4882a593Smuzhiyunfrontswap sets a bit in the "frontswap_map" for the swap device
*4882a593Smuzhiyuncorresponding to the page offset on the swap device to which it would
*4882a593Smuzhiyunotherwise have written the data.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen the swap subsystem needs to swap-in a page (swap_readpage()),
*4882a593Smuzhiyunit first calls frontswap_load() which checks the frontswap_map to
*4882a593Smuzhiyunsee if the page was earlier accepted by the frontswap backend.  If
*4882a593Smuzhiyunit was, the page of data is filled from the frontswap backend and
*4882a593Smuzhiyunthe swap-in is complete.  If not, the normal swap-in code is
*4882a593Smuzhiyunexecuted to obtain the page of data from the real swap device.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSo every time the frontswap backend accepts a page, a swap device read
*4882a593Smuzhiyunand (potentially) a swap device write are replaced by a "frontswap backend
*4882a593Smuzhiyunstore" and (possibly) a "frontswap backend loads", which are presumably much
*4882a593Smuzhiyunfaster.
*4882a593Smuzhiyun
*4882a593Smuzhiyun* Can't frontswap be configured as a "special" swap device that is
*4882a593Smuzhiyun  just higher priority than any real swap device (e.g. like zswap,
*4882a593Smuzhiyun  or maybe swap-over-nbd/NFS)?
*4882a593Smuzhiyun
*4882a593SmuzhiyunNo.  First, the existing swap subsystem doesn't allow for any kind of
*4882a593Smuzhiyunswap hierarchy.  Perhaps it could be rewritten to accommodate a hierarchy,
*4882a593Smuzhiyunbut this would require fairly drastic changes.  Even if it were
*4882a593Smuzhiyunrewritten, the existing swap subsystem uses the block I/O layer which
*4882a593Smuzhiyunassumes a swap device is fixed size and any page in it is linearly
*4882a593Smuzhiyunaddressable.  Frontswap barely touches the existing swap subsystem,
*4882a593Smuzhiyunand works around the constraints of the block I/O subsystem to provide
*4882a593Smuzhiyuna great deal of flexibility and dynamicity.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor example, the acceptance of any swap page by the frontswap backend is
*4882a593Smuzhiyunentirely unpredictable. This is critical to the definition of frontswap
*4882a593Smuzhiyunbackends because it grants completely dynamic discretion to the
*4882a593Smuzhiyunbackend.  In zcache, one cannot know a priori how compressible a page is.
*4882a593Smuzhiyun"Poorly" compressible pages can be rejected, and "poorly" can itself be
*4882a593Smuzhiyundefined dynamically depending on current memory constraints.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFurther, frontswap is entirely synchronous whereas a real swap
*4882a593Smuzhiyundevice is, by definition, asynchronous and uses block I/O.  The
*4882a593Smuzhiyunblock I/O layer is not only unnecessary, but may perform "optimizations"
*4882a593Smuzhiyunthat are inappropriate for a RAM-oriented device including delaying
*4882a593Smuzhiyunthe write of some pages for a significant amount of time.  Synchrony is
*4882a593Smuzhiyunrequired to ensure the dynamicity of the backend and to avoid thorny race
*4882a593Smuzhiyunconditions that would unnecessarily and greatly complicate frontswap
*4882a593Smuzhiyunand/or the block I/O subsystem.  That said, only the initial "store"
*4882a593Smuzhiyunand "load" operations need be synchronous.  A separate asynchronous thread
*4882a593Smuzhiyunis free to manipulate the pages stored by frontswap.  For example,
*4882a593Smuzhiyunthe "remotification" thread in RAMster uses standard asynchronous
*4882a593Smuzhiyunkernel sockets to move compressed frontswap pages to a remote machine.
*4882a593SmuzhiyunSimilarly, a KVM guest-side implementation could do in-guest compression
*4882a593Smuzhiyunand use "batched" hypercalls.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn a virtualized environment, the dynamicity allows the hypervisor
*4882a593Smuzhiyun(or host OS) to do "intelligent overcommit".  For example, it can
*4882a593Smuzhiyunchoose to accept pages only until host-swapping might be imminent,
*4882a593Smuzhiyunthen force guests to do their own swapping.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere is a downside to the transcendent memory specifications for
*4882a593Smuzhiyunfrontswap:  Since any "store" might fail, there must always be a real
*4882a593Smuzhiyunslot on a real swap device to swap the page.  Thus frontswap must be
*4882a593Smuzhiyunimplemented as a "shadow" to every swapon'd device with the potential
*4882a593Smuzhiyuncapability of holding every page that the swap device might have held
*4882a593Smuzhiyunand the possibility that it might hold no pages at all.  This means
*4882a593Smuzhiyunthat frontswap cannot contain more pages than the total of swapon'd
*4882a593Smuzhiyunswap devices.  For example, if NO swap device is configured on some
*4882a593Smuzhiyuninstallation, frontswap is useless.  Swapless portable devices
*4882a593Smuzhiyuncan still use frontswap but a backend for such devices must configure
*4882a593Smuzhiyunsome kind of "ghost" swap device and ensure that it is never used.
*4882a593Smuzhiyun
*4882a593Smuzhiyun* Why this weird definition about "duplicate stores"?  If a page
*4882a593Smuzhiyun  has been previously successfully stored, can't it always be
*4882a593Smuzhiyun  successfully overwritten?
*4882a593Smuzhiyun
*4882a593SmuzhiyunNearly always it can, but no, sometimes it cannot.  Consider an example
*4882a593Smuzhiyunwhere data is compressed and the original 4K page has been compressed
*4882a593Smuzhiyunto 1K.  Now an attempt is made to overwrite the page with data that
*4882a593Smuzhiyunis non-compressible and so would take the entire 4K.  But the backend
*4882a593Smuzhiyunhas no more space.  In this case, the store must be rejected.  Whenever
*4882a593Smuzhiyunfrontswap rejects a store that would overwrite, it also must invalidate
*4882a593Smuzhiyunthe old data and ensure that it is no longer accessible.  Since the
*4882a593Smuzhiyunswap subsystem then writes the new data to the read swap device,
*4882a593Smuzhiyunthis is the correct course of action to ensure coherency.
*4882a593Smuzhiyun
*4882a593Smuzhiyun* What is frontswap_shrink for?
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen the (non-frontswap) swap subsystem swaps out a page to a real
*4882a593Smuzhiyunswap device, that page is only taking up low-value pre-allocated disk
*4882a593Smuzhiyunspace.  But if frontswap has placed a page in transcendent memory, that
*4882a593Smuzhiyunpage may be taking up valuable real estate.  The frontswap_shrink
*4882a593Smuzhiyunroutine allows code outside of the swap subsystem to force pages out
*4882a593Smuzhiyunof the memory managed by frontswap and back into kernel-addressable memory.
*4882a593SmuzhiyunFor example, in RAMster, a "suction driver" thread will attempt
*4882a593Smuzhiyunto "repatriate" pages sent to a remote machine back to the local machine;
*4882a593Smuzhiyunthis is driven using the frontswap_shrink mechanism when memory pressure
*4882a593Smuzhiyunsubsides.
*4882a593Smuzhiyun
*4882a593Smuzhiyun* Why does the frontswap patch create the new include file swapfile.h?
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe frontswap code depends on some swap-subsystem-internal data
*4882a593Smuzhiyunstructures that have, over the years, moved back and forth between
*4882a593Smuzhiyunstatic and global.  This seemed a reasonable compromise:  Define
*4882a593Smuzhiyunthem as global but declare them in a new include file that isn't
*4882a593Smuzhiyunincluded by the large number of source files that include swap.h.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDan Magenheimer, last updated April 9, 2012