xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/device-mapper/cache.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=====
2*4882a593SmuzhiyunCache
3*4882a593Smuzhiyun=====
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunIntroduction
6*4882a593Smuzhiyun============
7*4882a593Smuzhiyun
8*4882a593Smuzhiyundm-cache is a device mapper target written by Joe Thornber, Heinz
9*4882a593SmuzhiyunMauelshagen, and Mike Snitzer.
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunIt aims to improve performance of a block device (eg, a spindle) by
12*4882a593Smuzhiyundynamically migrating some of its data to a faster, smaller device
13*4882a593Smuzhiyun(eg, an SSD).
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunThis device-mapper solution allows us to insert this caching at
16*4882a593Smuzhiyundifferent levels of the dm stack, for instance above the data device for
17*4882a593Smuzhiyuna thin-provisioning pool.  Caching solutions that are integrated more
18*4882a593Smuzhiyunclosely with the virtual memory system should give better performance.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunThe target reuses the metadata library used in the thin-provisioning
21*4882a593Smuzhiyunlibrary.
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunThe decision as to what data to migrate and when is left to a plug-in
24*4882a593Smuzhiyunpolicy module.  Several of these have been written as we experiment,
25*4882a593Smuzhiyunand we hope other people will contribute others for specific io
26*4882a593Smuzhiyunscenarios (eg. a vm image server).
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunGlossary
29*4882a593Smuzhiyun========
30*4882a593Smuzhiyun
31*4882a593Smuzhiyun  Migration
32*4882a593Smuzhiyun	       Movement of the primary copy of a logical block from one
33*4882a593Smuzhiyun	       device to the other.
34*4882a593Smuzhiyun  Promotion
35*4882a593Smuzhiyun	       Migration from slow device to fast device.
36*4882a593Smuzhiyun  Demotion
37*4882a593Smuzhiyun	       Migration from fast device to slow device.
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunThe origin device always contains a copy of the logical block, which
40*4882a593Smuzhiyunmay be out of date or kept in sync with the copy on the cache device
41*4882a593Smuzhiyun(depending on policy).
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunDesign
44*4882a593Smuzhiyun======
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunSub-devices
47*4882a593Smuzhiyun-----------
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunThe target is constructed by passing three devices to it (along with
50*4882a593Smuzhiyunother parameters detailed later):
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun1. An origin device - the big, slow one.
53*4882a593Smuzhiyun
54*4882a593Smuzhiyun2. A cache device - the small, fast one.
55*4882a593Smuzhiyun
56*4882a593Smuzhiyun3. A small metadata device - records which blocks are in the cache,
57*4882a593Smuzhiyun   which are dirty, and extra hints for use by the policy object.
58*4882a593Smuzhiyun   This information could be put on the cache device, but having it
59*4882a593Smuzhiyun   separate allows the volume manager to configure it differently,
60*4882a593Smuzhiyun   e.g. as a mirror for extra robustness.  This metadata device may only
61*4882a593Smuzhiyun   be used by a single cache device.
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunFixed block size
64*4882a593Smuzhiyun----------------
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunThe origin is divided up into blocks of a fixed size.  This block size
67*4882a593Smuzhiyunis configurable when you first create the cache.  Typically we've been
68*4882a593Smuzhiyunusing block sizes of 256KB - 1024KB.  The block size must be between 64
69*4882a593Smuzhiyunsectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
70*4882a593Smuzhiyun
71*4882a593SmuzhiyunHaving a fixed block size simplifies the target a lot.  But it is
72*4882a593Smuzhiyunsomething of a compromise.  For instance, a small part of a block may be
73*4882a593Smuzhiyungetting hit a lot, yet the whole block will be promoted to the cache.
74*4882a593SmuzhiyunSo large block sizes are bad because they waste cache space.  And small
75*4882a593Smuzhiyunblock sizes are bad because they increase the amount of metadata (both
76*4882a593Smuzhiyunin core and on disk).
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunCache operating modes
79*4882a593Smuzhiyun---------------------
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunThe cache has three operating modes: writeback, writethrough and
82*4882a593Smuzhiyunpassthrough.
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunIf writeback, the default, is selected then a write to a block that is
85*4882a593Smuzhiyuncached will go only to the cache and the block will be marked dirty in
86*4882a593Smuzhiyunthe metadata.
87*4882a593Smuzhiyun
88*4882a593SmuzhiyunIf writethrough is selected then a write to a cached block will not
89*4882a593Smuzhiyuncomplete until it has hit both the origin and cache devices.  Clean
90*4882a593Smuzhiyunblocks should remain clean.
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunIf passthrough is selected, useful when the cache contents are not known
93*4882a593Smuzhiyunto be coherent with the origin device, then all reads are served from
94*4882a593Smuzhiyunthe origin device (all reads miss the cache) and all writes are
95*4882a593Smuzhiyunforwarded to the origin device; additionally, write hits cause cache
96*4882a593Smuzhiyunblock invalidates.  To enable passthrough mode the cache must be clean.
97*4882a593SmuzhiyunPassthrough mode allows a cache device to be activated without having to
98*4882a593Smuzhiyunworry about coherency.  Coherency that exists is maintained, although
99*4882a593Smuzhiyunthe cache will gradually cool as writes take place.  If the coherency of
100*4882a593Smuzhiyunthe cache can later be verified, or established through use of the
101*4882a593Smuzhiyun"invalidate_cblocks" message, the cache device can be transitioned to
102*4882a593Smuzhiyunwritethrough or writeback mode while still warm.  Otherwise, the cache
103*4882a593Smuzhiyuncontents can be discarded prior to transitioning to the desired
104*4882a593Smuzhiyunoperating mode.
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunA simple cleaner policy is provided, which will clean (write back) all
107*4882a593Smuzhiyundirty blocks in a cache.  Useful for decommissioning a cache or when
108*4882a593Smuzhiyunshrinking a cache.  Shrinking the cache's fast device requires all cache
109*4882a593Smuzhiyunblocks, in the area of the cache being removed, to be clean.  If the
110*4882a593Smuzhiyunarea being removed from the cache still contains dirty blocks the resize
111*4882a593Smuzhiyunwill fail.  Care must be taken to never reduce the volume used for the
112*4882a593Smuzhiyuncache's fast device until the cache is clean.  This is of particular
113*4882a593Smuzhiyunimportance if writeback mode is used.  Writethrough and passthrough
114*4882a593Smuzhiyunmodes already maintain a clean cache.  Future support to partially clean
115*4882a593Smuzhiyunthe cache, above a specified threshold, will allow for keeping the cache
116*4882a593Smuzhiyunwarm and in writeback mode during resize.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunMigration throttling
119*4882a593Smuzhiyun--------------------
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunMigrating data between the origin and cache device uses bandwidth.
122*4882a593SmuzhiyunThe user can set a throttle to prevent more than a certain amount of
123*4882a593Smuzhiyunmigration occurring at any one time.  Currently we're not taking any
124*4882a593Smuzhiyunaccount of normal io traffic going to the devices.  More work needs
125*4882a593Smuzhiyundoing here to avoid migrating during those peak io moments.
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunFor the time being, a message "migration_threshold <#sectors>"
128*4882a593Smuzhiyuncan be used to set the maximum number of sectors being migrated,
129*4882a593Smuzhiyunthe default being 2048 sectors (1MB).
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunUpdating on-disk metadata
132*4882a593Smuzhiyun-------------------------
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunOn-disk metadata is committed every time a FLUSH or FUA bio is written.
135*4882a593SmuzhiyunIf no such requests are made then commits will occur every second.  This
136*4882a593Smuzhiyunmeans the cache behaves like a physical disk that has a volatile write
137*4882a593Smuzhiyuncache.  If power is lost you may lose some recent writes.  The metadata
138*4882a593Smuzhiyunshould always be consistent in spite of any crash.
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunThe 'dirty' state for a cache block changes far too frequently for us
141*4882a593Smuzhiyunto keep updating it on the fly.  So we treat it as a hint.  In normal
142*4882a593Smuzhiyunoperation it will be written when the dm device is suspended.  If the
143*4882a593Smuzhiyunsystem crashes all cache blocks will be assumed dirty when restarted.
144*4882a593Smuzhiyun
145*4882a593SmuzhiyunPer-block policy hints
146*4882a593Smuzhiyun----------------------
147*4882a593Smuzhiyun
148*4882a593SmuzhiyunPolicy plug-ins can store a chunk of data per cache block.  It's up to
149*4882a593Smuzhiyunthe policy how big this chunk is, but it should be kept small.  Like the
150*4882a593Smuzhiyundirty flags this data is lost if there's a crash so a safe fallback
151*4882a593Smuzhiyunvalue should always be possible.
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunPolicy hints affect performance, not correctness.
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunPolicy messaging
156*4882a593Smuzhiyun----------------
157*4882a593Smuzhiyun
158*4882a593SmuzhiyunPolicies will have different tunables, specific to each one, so we
159*4882a593Smuzhiyunneed a generic way of getting and setting these.  Device-mapper
160*4882a593Smuzhiyunmessages are used.  Refer to cache-policies.txt.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunDiscard bitset resolution
163*4882a593Smuzhiyun-------------------------
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunWe can avoid copying data during migration if we know the block has
166*4882a593Smuzhiyunbeen discarded.  A prime example of this is when mkfs discards the
167*4882a593Smuzhiyunwhole block device.  We store a bitset tracking the discard state of
168*4882a593Smuzhiyunblocks.  However, we allow this bitset to have a different block size
169*4882a593Smuzhiyunfrom the cache blocks.  This is because we need to track the discard
170*4882a593Smuzhiyunstate for all of the origin device (compare with the dirty bitset
171*4882a593Smuzhiyunwhich is just for the smaller cache device).
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunTarget interface
174*4882a593Smuzhiyun================
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunConstructor
177*4882a593Smuzhiyun-----------
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun  ::
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun   cache <metadata dev> <cache dev> <origin dev> <block size>
182*4882a593Smuzhiyun         <#feature args> [<feature arg>]*
183*4882a593Smuzhiyun         <policy> <#policy args> [policy args]*
184*4882a593Smuzhiyun
185*4882a593Smuzhiyun ================ =======================================================
186*4882a593Smuzhiyun metadata dev     fast device holding the persistent metadata
187*4882a593Smuzhiyun cache dev	  fast device holding cached data blocks
188*4882a593Smuzhiyun origin dev	  slow device holding original data blocks
189*4882a593Smuzhiyun block size       cache unit size in sectors
190*4882a593Smuzhiyun
191*4882a593Smuzhiyun #feature args    number of feature arguments passed
192*4882a593Smuzhiyun feature args     writethrough or passthrough (The default is writeback.)
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun policy           the replacement policy to use
195*4882a593Smuzhiyun #policy args     an even number of arguments corresponding to
196*4882a593Smuzhiyun                  key/value pairs passed to the policy
197*4882a593Smuzhiyun policy args      key/value pairs passed to the policy
198*4882a593Smuzhiyun		  E.g. 'sequential_threshold 1024'
199*4882a593Smuzhiyun		  See cache-policies.txt for details.
200*4882a593Smuzhiyun ================ =======================================================
201*4882a593Smuzhiyun
202*4882a593SmuzhiyunOptional feature arguments are:
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun
205*4882a593Smuzhiyun   ==================== ========================================================
206*4882a593Smuzhiyun   writethrough		write through caching that prohibits cache block
207*4882a593Smuzhiyun			content from being different from origin block content.
208*4882a593Smuzhiyun			Without this argument, the default behaviour is to write
209*4882a593Smuzhiyun			back cache block contents later for performance reasons,
210*4882a593Smuzhiyun			so they may differ from the corresponding origin blocks.
211*4882a593Smuzhiyun
212*4882a593Smuzhiyun   passthrough		a degraded mode useful for various cache coherency
213*4882a593Smuzhiyun			situations (e.g., rolling back snapshots of
214*4882a593Smuzhiyun			underlying storage).	 Reads and writes always go to
215*4882a593Smuzhiyun			the origin.	If a write goes to a cached origin
216*4882a593Smuzhiyun			block, then the cache block is invalidated.
217*4882a593Smuzhiyun			To enable passthrough mode the cache must be clean.
218*4882a593Smuzhiyun
219*4882a593Smuzhiyun   metadata2		use version 2 of the metadata.  This stores the dirty
220*4882a593Smuzhiyun			bits in a separate btree, which improves speed of
221*4882a593Smuzhiyun			shutting down the cache.
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun   no_discard_passdown	disable passing down discards from the cache
224*4882a593Smuzhiyun			to the origin's data device.
225*4882a593Smuzhiyun   ==================== ========================================================
226*4882a593Smuzhiyun
227*4882a593SmuzhiyunA policy called 'default' is always registered.  This is an alias for
228*4882a593Smuzhiyunthe policy we currently think is giving best all round performance.
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunAs the default policy could vary between kernels, if you are relying on
231*4882a593Smuzhiyunthe characteristics of a specific policy, always request it by name.
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunStatus
234*4882a593Smuzhiyun------
235*4882a593Smuzhiyun
236*4882a593Smuzhiyun::
237*4882a593Smuzhiyun
238*4882a593Smuzhiyun  <metadata block size> <#used metadata blocks>/<#total metadata blocks>
239*4882a593Smuzhiyun  <cache block size> <#used cache blocks>/<#total cache blocks>
240*4882a593Smuzhiyun  <#read hits> <#read misses> <#write hits> <#write misses>
241*4882a593Smuzhiyun  <#demotions> <#promotions> <#dirty> <#features> <features>*
242*4882a593Smuzhiyun  <#core args> <core args>* <policy name> <#policy args> <policy args>*
243*4882a593Smuzhiyun  <cache metadata mode>
244*4882a593Smuzhiyun
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun========================= =====================================================
247*4882a593Smuzhiyunmetadata block size	  Fixed block size for each metadata block in
248*4882a593Smuzhiyun			  sectors
249*4882a593Smuzhiyun#used metadata blocks	  Number of metadata blocks used
250*4882a593Smuzhiyun#total metadata blocks	  Total number of metadata blocks
251*4882a593Smuzhiyuncache block size	  Configurable block size for the cache device
252*4882a593Smuzhiyun			  in sectors
253*4882a593Smuzhiyun#used cache blocks	  Number of blocks resident in the cache
254*4882a593Smuzhiyun#total cache blocks	  Total number of cache blocks
255*4882a593Smuzhiyun#read hits		  Number of times a READ bio has been mapped
256*4882a593Smuzhiyun			  to the cache
257*4882a593Smuzhiyun#read misses		  Number of times a READ bio has been mapped
258*4882a593Smuzhiyun			  to the origin
259*4882a593Smuzhiyun#write hits		  Number of times a WRITE bio has been mapped
260*4882a593Smuzhiyun			  to the cache
261*4882a593Smuzhiyun#write misses		  Number of times a WRITE bio has been
262*4882a593Smuzhiyun			  mapped to the origin
263*4882a593Smuzhiyun#demotions		  Number of times a block has been removed
264*4882a593Smuzhiyun			  from the cache
265*4882a593Smuzhiyun#promotions		  Number of times a block has been moved to
266*4882a593Smuzhiyun			  the cache
267*4882a593Smuzhiyun#dirty			  Number of blocks in the cache that differ
268*4882a593Smuzhiyun			  from the origin
269*4882a593Smuzhiyun#feature args		  Number of feature args to follow
270*4882a593Smuzhiyunfeature args		  'writethrough' (optional)
271*4882a593Smuzhiyun#core args		  Number of core arguments (must be even)
272*4882a593Smuzhiyuncore args		  Key/value pairs for tuning the core
273*4882a593Smuzhiyun			  e.g. migration_threshold
274*4882a593Smuzhiyunpolicy name		  Name of the policy
275*4882a593Smuzhiyun#policy args		  Number of policy arguments to follow (must be even)
276*4882a593Smuzhiyunpolicy args		  Key/value pairs e.g. sequential_threshold
277*4882a593Smuzhiyuncache metadata mode       ro if read-only, rw if read-write
278*4882a593Smuzhiyun
279*4882a593Smuzhiyun			  In serious cases where even a read-only mode is
280*4882a593Smuzhiyun			  deemed unsafe no further I/O will be permitted and
281*4882a593Smuzhiyun			  the status will just contain the string 'Fail'.
282*4882a593Smuzhiyun			  The userspace recovery tools should then be used.
283*4882a593Smuzhiyunneeds_check		  'needs_check' if set, '-' if not set
284*4882a593Smuzhiyun			  A metadata operation has failed, resulting in the
285*4882a593Smuzhiyun			  needs_check flag being set in the metadata's
286*4882a593Smuzhiyun			  superblock.  The metadata device must be
287*4882a593Smuzhiyun			  deactivated and checked/repaired before the
288*4882a593Smuzhiyun			  cache can be made fully operational again.
289*4882a593Smuzhiyun			  '-' indicates	needs_check is not set.
290*4882a593Smuzhiyun========================= =====================================================
291*4882a593Smuzhiyun
292*4882a593SmuzhiyunMessages
293*4882a593Smuzhiyun--------
294*4882a593Smuzhiyun
295*4882a593SmuzhiyunPolicies will have different tunables, specific to each one, so we
296*4882a593Smuzhiyunneed a generic way of getting and setting these.  Device-mapper
297*4882a593Smuzhiyunmessages are used.  (A sysfs interface would also be possible.)
298*4882a593Smuzhiyun
299*4882a593SmuzhiyunThe message format is::
300*4882a593Smuzhiyun
301*4882a593Smuzhiyun   <key> <value>
302*4882a593Smuzhiyun
303*4882a593SmuzhiyunE.g.::
304*4882a593Smuzhiyun
305*4882a593Smuzhiyun   dmsetup message my_cache 0 sequential_threshold 1024
306*4882a593Smuzhiyun
307*4882a593Smuzhiyun
308*4882a593SmuzhiyunInvalidation is removing an entry from the cache without writing it
309*4882a593Smuzhiyunback.  Cache blocks can be invalidated via the invalidate_cblocks
310*4882a593Smuzhiyunmessage, which takes an arbitrary number of cblock ranges.  Each cblock
311*4882a593Smuzhiyunrange's end value is "one past the end", meaning 5-10 expresses a range
312*4882a593Smuzhiyunof values from 5 to 9.  Each cblock must be expressed as a decimal
313*4882a593Smuzhiyunvalue, in the future a variant message that takes cblock ranges
314*4882a593Smuzhiyunexpressed in hexadecimal may be needed to better support efficient
315*4882a593Smuzhiyuninvalidation of larger caches.  The cache must be in passthrough mode
316*4882a593Smuzhiyunwhen invalidate_cblocks is used::
317*4882a593Smuzhiyun
318*4882a593Smuzhiyun   invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
319*4882a593Smuzhiyun
320*4882a593SmuzhiyunE.g.::
321*4882a593Smuzhiyun
322*4882a593Smuzhiyun   dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
323*4882a593Smuzhiyun
324*4882a593SmuzhiyunExamples
325*4882a593Smuzhiyun========
326*4882a593Smuzhiyun
327*4882a593SmuzhiyunThe test suite can be found here:
328*4882a593Smuzhiyun
329*4882a593Smuzhiyunhttps://github.com/jthornber/device-mapper-test-suite
330*4882a593Smuzhiyun
331*4882a593Smuzhiyun::
332*4882a593Smuzhiyun
333*4882a593Smuzhiyun  dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
334*4882a593Smuzhiyun	  /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
335*4882a593Smuzhiyun  dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
336*4882a593Smuzhiyun	  /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
337*4882a593Smuzhiyun	  mq 4 sequential_threshold 1024 random_threshold 8'
338