xref: /OK3568_Linux_fs/kernel/Documentation/vm/frontswap.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _frontswap:
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=========
4*4882a593SmuzhiyunFrontswap
5*4882a593Smuzhiyun=========
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunFrontswap provides a "transcendent memory" interface for swap pages.
8*4882a593SmuzhiyunIn some environments, dramatic performance savings may be obtained because
9*4882a593Smuzhiyunswapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun(Note, frontswap -- and :ref:`cleancache` (merged at 3.0) -- are the "frontends"
12*4882a593Smuzhiyunand the only necessary changes to the core kernel for transcendent memory;
13*4882a593Smuzhiyunall other supporting code -- the "backends" -- is implemented as drivers.
14*4882a593SmuzhiyunSee the LWN.net article `Transcendent memory in a nutshell`_
15*4882a593Smuzhiyunfor a detailed overview of frontswap and related kernel parts)
16*4882a593Smuzhiyun
17*4882a593Smuzhiyun.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunFrontswap is so named because it can be thought of as the opposite of
20*4882a593Smuzhiyuna "backing" store for a swap device.  The storage is assumed to be
21*4882a593Smuzhiyuna synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
22*4882a593Smuzhiyunto the requirements of transcendent memory (such as Xen's "tmem", or
23*4882a593Smuzhiyunin-kernel compressed memory, aka "zcache", or future RAM-like devices);
24*4882a593Smuzhiyunthis pseudo-RAM device is not directly accessible or addressable by the
25*4882a593Smuzhiyunkernel and is of unknown and possibly time-varying size.  The driver
26*4882a593Smuzhiyunlinks itself to frontswap by calling frontswap_register_ops to set the
27*4882a593Smuzhiyunfrontswap_ops funcs appropriately and the functions it provides must
28*4882a593Smuzhiyunconform to certain policies as follows:
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunAn "init" prepares the device to receive frontswap pages associated
31*4882a593Smuzhiyunwith the specified swap device number (aka "type").  A "store" will
32*4882a593Smuzhiyuncopy the page to transcendent memory and associate it with the type and
33*4882a593Smuzhiyunoffset associated with the page. A "load" will copy the page, if found,
34*4882a593Smuzhiyunfrom transcendent memory into kernel memory, but will NOT remove the page
35*4882a593Smuzhiyunfrom transcendent memory.  An "invalidate_page" will remove the page
36*4882a593Smuzhiyunfrom transcendent memory and an "invalidate_area" will remove ALL pages
37*4882a593Smuzhiyunassociated with the swap type (e.g., like swapoff) and notify the "device"
38*4882a593Smuzhiyunto refuse further stores with that swap type.
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunOnce a page is successfully stored, a matching load on the page will normally
41*4882a593Smuzhiyunsucceed.  So when the kernel finds itself in a situation where it needs
42*4882a593Smuzhiyunto swap out a page, it first attempts to use frontswap.  If the store returns
43*4882a593Smuzhiyunsuccess, the data has been successfully saved to transcendent memory and
44*4882a593Smuzhiyuna disk write and, if the data is later read back, a disk read are avoided.
45*4882a593SmuzhiyunIf a store returns failure, transcendent memory has rejected the data, and the
46*4882a593Smuzhiyunpage can be written to swap as usual.
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunIf a backend chooses, frontswap can be configured as a "writethrough
49*4882a593Smuzhiyuncache" by calling frontswap_writethrough().  In this mode, the reduction
50*4882a593Smuzhiyunin swap device writes is lost (and also a non-trivial performance advantage)
51*4882a593Smuzhiyunin order to allow the backend to arbitrarily "reclaim" space used to
52*4882a593Smuzhiyunstore frontswap pages to more completely manage its memory usage.
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunNote that if a page is stored and the page already exists in transcendent memory
55*4882a593Smuzhiyun(a "duplicate" store), either the store succeeds and the data is overwritten,
56*4882a593Smuzhiyunor the store fails AND the page is invalidated.  This ensures stale data may
57*4882a593Smuzhiyunnever be obtained from frontswap.
58*4882a593Smuzhiyun
59*4882a593SmuzhiyunIf properly configured, monitoring of frontswap is done via debugfs in
60*4882a593Smuzhiyunthe `/sys/kernel/debug/frontswap` directory.  The effectiveness of
61*4882a593Smuzhiyunfrontswap can be measured (across all swap devices) with:
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun``failed_stores``
64*4882a593Smuzhiyun	how many store attempts have failed
65*4882a593Smuzhiyun
66*4882a593Smuzhiyun``loads``
67*4882a593Smuzhiyun	how many loads were attempted (all should succeed)
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun``succ_stores``
70*4882a593Smuzhiyun	how many store attempts have succeeded
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun``invalidates``
73*4882a593Smuzhiyun	how many invalidates were attempted
74*4882a593Smuzhiyun
75*4882a593SmuzhiyunA backend implementation may provide additional metrics.
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunFAQ
78*4882a593Smuzhiyun===
79*4882a593Smuzhiyun
80*4882a593Smuzhiyun* Where's the value?
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunWhen a workload starts swapping, performance falls through the floor.
83*4882a593SmuzhiyunFrontswap significantly increases performance in many such workloads by
84*4882a593Smuzhiyunproviding a clean, dynamic interface to read and write swap pages to
85*4882a593Smuzhiyun"transcendent memory" that is otherwise not directly addressable to the kernel.
86*4882a593SmuzhiyunThis interface is ideal when data is transformed to a different form
87*4882a593Smuzhiyunand size (such as with compression) or secretly moved (as might be
88*4882a593Smuzhiyunuseful for write-balancing for some RAM-like devices).  Swap pages (and
89*4882a593Smuzhiyunevicted page-cache pages) are a great use for this kind of slower-than-RAM-
90*4882a593Smuzhiyunbut-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
91*4882a593Smuzhiyuncleancache) interface to transcendent memory provides a nice way to read
92*4882a593Smuzhiyunand write -- and indirectly "name" -- the pages.
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunFrontswap -- and cleancache -- with a fairly small impact on the kernel,
95*4882a593Smuzhiyunprovides a huge amount of flexibility for more dynamic, flexible RAM
96*4882a593Smuzhiyunutilization in various system configurations:
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunIn the single kernel case, aka "zcache", pages are compressed and
99*4882a593Smuzhiyunstored in local memory, thus increasing the total anonymous pages
100*4882a593Smuzhiyunthat can be safely kept in RAM.  Zcache essentially trades off CPU
101*4882a593Smuzhiyuncycles used in compression/decompression for better memory utilization.
102*4882a593SmuzhiyunBenchmarks have shown little or no impact when memory pressure is
103*4882a593Smuzhiyunlow while providing a significant performance improvement (25%+)
104*4882a593Smuzhiyunon some workloads under high memory pressure.
105*4882a593Smuzhiyun
106*4882a593Smuzhiyun"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
107*4882a593Smuzhiyunsupport for clustered systems.  Frontswap pages are locally compressed
108*4882a593Smuzhiyunas in zcache, but then "remotified" to another system's RAM.  This
109*4882a593Smuzhiyunallows RAM to be dynamically load-balanced back-and-forth as needed,
110*4882a593Smuzhiyuni.e. when system A is overcommitted, it can swap to system B, and
111*4882a593Smuzhiyunvice versa.  RAMster can also be configured as a memory server so
112*4882a593Smuzhiyunmany servers in a cluster can swap, dynamically as needed, to a single
113*4882a593Smuzhiyunserver configured with a large amount of RAM... without pre-configuring
114*4882a593Smuzhiyunhow much of the RAM is available for each of the clients!
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunIn the virtual case, the whole point of virtualization is to statistically
117*4882a593Smuzhiyunmultiplex physical resources across the varying demands of multiple
118*4882a593Smuzhiyunvirtual machines.  This is really hard to do with RAM and efforts to do
119*4882a593Smuzhiyunit well with no kernel changes have essentially failed (except in some
120*4882a593Smuzhiyunwell-publicized special-case workloads).
121*4882a593SmuzhiyunSpecifically, the Xen Transcendent Memory backend allows otherwise
122*4882a593Smuzhiyun"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
123*4882a593Smuzhiyunvirtual machines, but the pages can be compressed and deduplicated to
124*4882a593Smuzhiyunoptimize RAM utilization.  And when guest OS's are induced to surrender
125*4882a593Smuzhiyununderutilized RAM (e.g. with "selfballooning"), sudden unexpected
126*4882a593Smuzhiyunmemory pressure may result in swapping; frontswap allows those pages
127*4882a593Smuzhiyunto be swapped to and from hypervisor RAM (if overall host system memory
128*4882a593Smuzhiyunconditions allow), thus mitigating the potentially awful performance impact
129*4882a593Smuzhiyunof unplanned swapping.
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunA KVM implementation is underway and has been RFC'ed to lkml.  And,
132*4882a593Smuzhiyunusing frontswap, investigation is also underway on the use of NVM as
133*4882a593Smuzhiyuna memory extension technology.
134*4882a593Smuzhiyun
135*4882a593Smuzhiyun* Sure there may be performance advantages in some situations, but
136*4882a593Smuzhiyun  what's the space/time overhead of frontswap?
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunIf CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
139*4882a593Smuzhiyunnothingness and the only overhead is a few extra bytes per swapon'ed
140*4882a593Smuzhiyunswap device.  If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
141*4882a593Smuzhiyunregisters, there is one extra global variable compared to zero for
142*4882a593Smuzhiyunevery swap page read or written.  If CONFIG_FRONTSWAP is enabled
143*4882a593SmuzhiyunAND a frontswap backend registers AND the backend fails every "store"
144*4882a593Smuzhiyunrequest (i.e. provides no memory despite claiming it might),
145*4882a593SmuzhiyunCPU overhead is still negligible -- and since every frontswap fail
146*4882a593Smuzhiyunprecedes a swap page write-to-disk, the system is highly likely
147*4882a593Smuzhiyunto be I/O bound and using a small fraction of a percent of a CPU
148*4882a593Smuzhiyunwill be irrelevant anyway.
149*4882a593Smuzhiyun
150*4882a593SmuzhiyunAs for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
151*4882a593Smuzhiyunregisters, one bit is allocated for every swap page for every swap
152*4882a593Smuzhiyundevice that is swapon'd.  This is added to the EIGHT bits (which
153*4882a593Smuzhiyunwas sixteen until about 2.6.34) that the kernel already allocates
154*4882a593Smuzhiyunfor every swap page for every swap device that is swapon'd.  (Hugh
155*4882a593SmuzhiyunDickins has observed that frontswap could probably steal one of
156*4882a593Smuzhiyunthe existing eight bits, but let's worry about that minor optimization
157*4882a593Smuzhiyunlater.)  For very large swap disks (which are rare) on a standard
158*4882a593Smuzhiyun4K pagesize, this is 1MB per 32GB swap.
159*4882a593Smuzhiyun
160*4882a593SmuzhiyunWhen swap pages are stored in transcendent memory instead of written
161*4882a593Smuzhiyunout to disk, there is a side effect that this may create more memory
162*4882a593Smuzhiyunpressure that can potentially outweigh the other advantages.  A
163*4882a593Smuzhiyunbackend, such as zcache, must implement policies to carefully (but
164*4882a593Smuzhiyundynamically) manage memory limits to ensure this doesn't happen.
165*4882a593Smuzhiyun
166*4882a593Smuzhiyun* OK, how about a quick overview of what this frontswap patch does
167*4882a593Smuzhiyun  in terms that a kernel hacker can grok?
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunLet's assume that a frontswap "backend" has registered during
170*4882a593Smuzhiyunkernel initialization; this registration indicates that this
171*4882a593Smuzhiyunfrontswap backend has access to some "memory" that is not directly
172*4882a593Smuzhiyunaccessible by the kernel.  Exactly how much memory it provides is
173*4882a593Smuzhiyunentirely dynamic and random.
174*4882a593Smuzhiyun
175*4882a593SmuzhiyunWhenever a swap-device is swapon'd frontswap_init() is called,
176*4882a593Smuzhiyunpassing the swap device number (aka "type") as a parameter.
177*4882a593SmuzhiyunThis notifies frontswap to expect attempts to "store" swap pages
178*4882a593Smuzhiyunassociated with that number.
179*4882a593Smuzhiyun
180*4882a593SmuzhiyunWhenever the swap subsystem is readying a page to write to a swap
181*4882a593Smuzhiyundevice (c.f swap_writepage()), frontswap_store is called.  Frontswap
182*4882a593Smuzhiyunconsults with the frontswap backend and if the backend says it does NOT
183*4882a593Smuzhiyunhave room, frontswap_store returns -1 and the kernel swaps the page
184*4882a593Smuzhiyunto the swap device as normal.  Note that the response from the frontswap
185*4882a593Smuzhiyunbackend is unpredictable to the kernel; it may choose to never accept a
186*4882a593Smuzhiyunpage, it could accept every ninth page, or it might accept every
187*4882a593Smuzhiyunpage.  But if the backend does accept a page, the data from the page
188*4882a593Smuzhiyunhas already been copied and associated with the type and offset,
189*4882a593Smuzhiyunand the backend guarantees the persistence of the data.  In this case,
190*4882a593Smuzhiyunfrontswap sets a bit in the "frontswap_map" for the swap device
191*4882a593Smuzhiyuncorresponding to the page offset on the swap device to which it would
192*4882a593Smuzhiyunotherwise have written the data.
193*4882a593Smuzhiyun
194*4882a593SmuzhiyunWhen the swap subsystem needs to swap-in a page (swap_readpage()),
195*4882a593Smuzhiyunit first calls frontswap_load() which checks the frontswap_map to
196*4882a593Smuzhiyunsee if the page was earlier accepted by the frontswap backend.  If
197*4882a593Smuzhiyunit was, the page of data is filled from the frontswap backend and
198*4882a593Smuzhiyunthe swap-in is complete.  If not, the normal swap-in code is
199*4882a593Smuzhiyunexecuted to obtain the page of data from the real swap device.
200*4882a593Smuzhiyun
201*4882a593SmuzhiyunSo every time the frontswap backend accepts a page, a swap device read
202*4882a593Smuzhiyunand (potentially) a swap device write are replaced by a "frontswap backend
203*4882a593Smuzhiyunstore" and (possibly) a "frontswap backend loads", which are presumably much
204*4882a593Smuzhiyunfaster.
205*4882a593Smuzhiyun
206*4882a593Smuzhiyun* Can't frontswap be configured as a "special" swap device that is
207*4882a593Smuzhiyun  just higher priority than any real swap device (e.g. like zswap,
208*4882a593Smuzhiyun  or maybe swap-over-nbd/NFS)?
209*4882a593Smuzhiyun
210*4882a593SmuzhiyunNo.  First, the existing swap subsystem doesn't allow for any kind of
211*4882a593Smuzhiyunswap hierarchy.  Perhaps it could be rewritten to accommodate a hierarchy,
212*4882a593Smuzhiyunbut this would require fairly drastic changes.  Even if it were
213*4882a593Smuzhiyunrewritten, the existing swap subsystem uses the block I/O layer which
214*4882a593Smuzhiyunassumes a swap device is fixed size and any page in it is linearly
215*4882a593Smuzhiyunaddressable.  Frontswap barely touches the existing swap subsystem,
216*4882a593Smuzhiyunand works around the constraints of the block I/O subsystem to provide
217*4882a593Smuzhiyuna great deal of flexibility and dynamicity.
218*4882a593Smuzhiyun
219*4882a593SmuzhiyunFor example, the acceptance of any swap page by the frontswap backend is
220*4882a593Smuzhiyunentirely unpredictable. This is critical to the definition of frontswap
221*4882a593Smuzhiyunbackends because it grants completely dynamic discretion to the
222*4882a593Smuzhiyunbackend.  In zcache, one cannot know a priori how compressible a page is.
223*4882a593Smuzhiyun"Poorly" compressible pages can be rejected, and "poorly" can itself be
224*4882a593Smuzhiyundefined dynamically depending on current memory constraints.
225*4882a593Smuzhiyun
226*4882a593SmuzhiyunFurther, frontswap is entirely synchronous whereas a real swap
227*4882a593Smuzhiyundevice is, by definition, asynchronous and uses block I/O.  The
228*4882a593Smuzhiyunblock I/O layer is not only unnecessary, but may perform "optimizations"
229*4882a593Smuzhiyunthat are inappropriate for a RAM-oriented device including delaying
230*4882a593Smuzhiyunthe write of some pages for a significant amount of time.  Synchrony is
231*4882a593Smuzhiyunrequired to ensure the dynamicity of the backend and to avoid thorny race
232*4882a593Smuzhiyunconditions that would unnecessarily and greatly complicate frontswap
233*4882a593Smuzhiyunand/or the block I/O subsystem.  That said, only the initial "store"
234*4882a593Smuzhiyunand "load" operations need be synchronous.  A separate asynchronous thread
235*4882a593Smuzhiyunis free to manipulate the pages stored by frontswap.  For example,
236*4882a593Smuzhiyunthe "remotification" thread in RAMster uses standard asynchronous
237*4882a593Smuzhiyunkernel sockets to move compressed frontswap pages to a remote machine.
238*4882a593SmuzhiyunSimilarly, a KVM guest-side implementation could do in-guest compression
239*4882a593Smuzhiyunand use "batched" hypercalls.
240*4882a593Smuzhiyun
241*4882a593SmuzhiyunIn a virtualized environment, the dynamicity allows the hypervisor
242*4882a593Smuzhiyun(or host OS) to do "intelligent overcommit".  For example, it can
243*4882a593Smuzhiyunchoose to accept pages only until host-swapping might be imminent,
244*4882a593Smuzhiyunthen force guests to do their own swapping.
245*4882a593Smuzhiyun
246*4882a593SmuzhiyunThere is a downside to the transcendent memory specifications for
247*4882a593Smuzhiyunfrontswap:  Since any "store" might fail, there must always be a real
248*4882a593Smuzhiyunslot on a real swap device to swap the page.  Thus frontswap must be
249*4882a593Smuzhiyunimplemented as a "shadow" to every swapon'd device with the potential
250*4882a593Smuzhiyuncapability of holding every page that the swap device might have held
251*4882a593Smuzhiyunand the possibility that it might hold no pages at all.  This means
252*4882a593Smuzhiyunthat frontswap cannot contain more pages than the total of swapon'd
253*4882a593Smuzhiyunswap devices.  For example, if NO swap device is configured on some
254*4882a593Smuzhiyuninstallation, frontswap is useless.  Swapless portable devices
255*4882a593Smuzhiyuncan still use frontswap but a backend for such devices must configure
256*4882a593Smuzhiyunsome kind of "ghost" swap device and ensure that it is never used.
257*4882a593Smuzhiyun
258*4882a593Smuzhiyun* Why this weird definition about "duplicate stores"?  If a page
259*4882a593Smuzhiyun  has been previously successfully stored, can't it always be
260*4882a593Smuzhiyun  successfully overwritten?
261*4882a593Smuzhiyun
262*4882a593SmuzhiyunNearly always it can, but no, sometimes it cannot.  Consider an example
263*4882a593Smuzhiyunwhere data is compressed and the original 4K page has been compressed
264*4882a593Smuzhiyunto 1K.  Now an attempt is made to overwrite the page with data that
265*4882a593Smuzhiyunis non-compressible and so would take the entire 4K.  But the backend
266*4882a593Smuzhiyunhas no more space.  In this case, the store must be rejected.  Whenever
267*4882a593Smuzhiyunfrontswap rejects a store that would overwrite, it also must invalidate
268*4882a593Smuzhiyunthe old data and ensure that it is no longer accessible.  Since the
269*4882a593Smuzhiyunswap subsystem then writes the new data to the read swap device,
270*4882a593Smuzhiyunthis is the correct course of action to ensure coherency.
271*4882a593Smuzhiyun
272*4882a593Smuzhiyun* What is frontswap_shrink for?
273*4882a593Smuzhiyun
274*4882a593SmuzhiyunWhen the (non-frontswap) swap subsystem swaps out a page to a real
275*4882a593Smuzhiyunswap device, that page is only taking up low-value pre-allocated disk
276*4882a593Smuzhiyunspace.  But if frontswap has placed a page in transcendent memory, that
277*4882a593Smuzhiyunpage may be taking up valuable real estate.  The frontswap_shrink
278*4882a593Smuzhiyunroutine allows code outside of the swap subsystem to force pages out
279*4882a593Smuzhiyunof the memory managed by frontswap and back into kernel-addressable memory.
280*4882a593SmuzhiyunFor example, in RAMster, a "suction driver" thread will attempt
281*4882a593Smuzhiyunto "repatriate" pages sent to a remote machine back to the local machine;
282*4882a593Smuzhiyunthis is driven using the frontswap_shrink mechanism when memory pressure
283*4882a593Smuzhiyunsubsides.
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun* Why does the frontswap patch create the new include file swapfile.h?
286*4882a593Smuzhiyun
287*4882a593SmuzhiyunThe frontswap code depends on some swap-subsystem-internal data
288*4882a593Smuzhiyunstructures that have, over the years, moved back and forth between
289*4882a593Smuzhiyunstatic and global.  This seemed a reasonable compromise:  Define
290*4882a593Smuzhiyunthem as global but declare them in a new include file that isn't
291*4882a593Smuzhiyunincluded by the large number of source files that include swap.h.
292*4882a593Smuzhiyun
293*4882a593SmuzhiyunDan Magenheimer, last updated April 9, 2012
294