1*4882a593Smuzhiyun===== 2*4882a593SmuzhiyunCache 3*4882a593Smuzhiyun===== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunIntroduction 6*4882a593Smuzhiyun============ 7*4882a593Smuzhiyun 8*4882a593Smuzhiyundm-cache is a device mapper target written by Joe Thornber, Heinz 9*4882a593SmuzhiyunMauelshagen, and Mike Snitzer. 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunIt aims to improve performance of a block device (eg, a spindle) by 12*4882a593Smuzhiyundynamically migrating some of its data to a faster, smaller device 13*4882a593Smuzhiyun(eg, an SSD). 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunThis device-mapper solution allows us to insert this caching at 16*4882a593Smuzhiyundifferent levels of the dm stack, for instance above the data device for 17*4882a593Smuzhiyuna thin-provisioning pool. Caching solutions that are integrated more 18*4882a593Smuzhiyunclosely with the virtual memory system should give better performance. 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunThe target reuses the metadata library used in the thin-provisioning 21*4882a593Smuzhiyunlibrary. 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunThe decision as to what data to migrate and when is left to a plug-in 24*4882a593Smuzhiyunpolicy module. Several of these have been written as we experiment, 25*4882a593Smuzhiyunand we hope other people will contribute others for specific io 26*4882a593Smuzhiyunscenarios (eg. a vm image server). 27*4882a593Smuzhiyun 28*4882a593SmuzhiyunGlossary 29*4882a593Smuzhiyun======== 30*4882a593Smuzhiyun 31*4882a593Smuzhiyun Migration 32*4882a593Smuzhiyun Movement of the primary copy of a logical block from one 33*4882a593Smuzhiyun device to the other. 34*4882a593Smuzhiyun Promotion 35*4882a593Smuzhiyun Migration from slow device to fast device. 36*4882a593Smuzhiyun Demotion 37*4882a593Smuzhiyun Migration from fast device to slow device. 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunThe origin device always contains a copy of the logical block, which 40*4882a593Smuzhiyunmay be out of date or kept in sync with the copy on the cache device 41*4882a593Smuzhiyun(depending on policy). 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunDesign 44*4882a593Smuzhiyun====== 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunSub-devices 47*4882a593Smuzhiyun----------- 48*4882a593Smuzhiyun 49*4882a593SmuzhiyunThe target is constructed by passing three devices to it (along with 50*4882a593Smuzhiyunother parameters detailed later): 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun1. An origin device - the big, slow one. 53*4882a593Smuzhiyun 54*4882a593Smuzhiyun2. A cache device - the small, fast one. 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun3. A small metadata device - records which blocks are in the cache, 57*4882a593Smuzhiyun which are dirty, and extra hints for use by the policy object. 58*4882a593Smuzhiyun This information could be put on the cache device, but having it 59*4882a593Smuzhiyun separate allows the volume manager to configure it differently, 60*4882a593Smuzhiyun e.g. as a mirror for extra robustness. This metadata device may only 61*4882a593Smuzhiyun be used by a single cache device. 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunFixed block size 64*4882a593Smuzhiyun---------------- 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunThe origin is divided up into blocks of a fixed size. This block size 67*4882a593Smuzhiyunis configurable when you first create the cache. Typically we've been 68*4882a593Smuzhiyunusing block sizes of 256KB - 1024KB. The block size must be between 64 69*4882a593Smuzhiyunsectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB). 70*4882a593Smuzhiyun 71*4882a593SmuzhiyunHaving a fixed block size simplifies the target a lot. But it is 72*4882a593Smuzhiyunsomething of a compromise. For instance, a small part of a block may be 73*4882a593Smuzhiyungetting hit a lot, yet the whole block will be promoted to the cache. 74*4882a593SmuzhiyunSo large block sizes are bad because they waste cache space. And small 75*4882a593Smuzhiyunblock sizes are bad because they increase the amount of metadata (both 76*4882a593Smuzhiyunin core and on disk). 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunCache operating modes 79*4882a593Smuzhiyun--------------------- 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunThe cache has three operating modes: writeback, writethrough and 82*4882a593Smuzhiyunpassthrough. 83*4882a593Smuzhiyun 84*4882a593SmuzhiyunIf writeback, the default, is selected then a write to a block that is 85*4882a593Smuzhiyuncached will go only to the cache and the block will be marked dirty in 86*4882a593Smuzhiyunthe metadata. 87*4882a593Smuzhiyun 88*4882a593SmuzhiyunIf writethrough is selected then a write to a cached block will not 89*4882a593Smuzhiyuncomplete until it has hit both the origin and cache devices. Clean 90*4882a593Smuzhiyunblocks should remain clean. 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunIf passthrough is selected, useful when the cache contents are not known 93*4882a593Smuzhiyunto be coherent with the origin device, then all reads are served from 94*4882a593Smuzhiyunthe origin device (all reads miss the cache) and all writes are 95*4882a593Smuzhiyunforwarded to the origin device; additionally, write hits cause cache 96*4882a593Smuzhiyunblock invalidates. To enable passthrough mode the cache must be clean. 97*4882a593SmuzhiyunPassthrough mode allows a cache device to be activated without having to 98*4882a593Smuzhiyunworry about coherency. Coherency that exists is maintained, although 99*4882a593Smuzhiyunthe cache will gradually cool as writes take place. If the coherency of 100*4882a593Smuzhiyunthe cache can later be verified, or established through use of the 101*4882a593Smuzhiyun"invalidate_cblocks" message, the cache device can be transitioned to 102*4882a593Smuzhiyunwritethrough or writeback mode while still warm. Otherwise, the cache 103*4882a593Smuzhiyuncontents can be discarded prior to transitioning to the desired 104*4882a593Smuzhiyunoperating mode. 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunA simple cleaner policy is provided, which will clean (write back) all 107*4882a593Smuzhiyundirty blocks in a cache. Useful for decommissioning a cache or when 108*4882a593Smuzhiyunshrinking a cache. Shrinking the cache's fast device requires all cache 109*4882a593Smuzhiyunblocks, in the area of the cache being removed, to be clean. If the 110*4882a593Smuzhiyunarea being removed from the cache still contains dirty blocks the resize 111*4882a593Smuzhiyunwill fail. Care must be taken to never reduce the volume used for the 112*4882a593Smuzhiyuncache's fast device until the cache is clean. This is of particular 113*4882a593Smuzhiyunimportance if writeback mode is used. Writethrough and passthrough 114*4882a593Smuzhiyunmodes already maintain a clean cache. Future support to partially clean 115*4882a593Smuzhiyunthe cache, above a specified threshold, will allow for keeping the cache 116*4882a593Smuzhiyunwarm and in writeback mode during resize. 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunMigration throttling 119*4882a593Smuzhiyun-------------------- 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunMigrating data between the origin and cache device uses bandwidth. 122*4882a593SmuzhiyunThe user can set a throttle to prevent more than a certain amount of 123*4882a593Smuzhiyunmigration occurring at any one time. Currently we're not taking any 124*4882a593Smuzhiyunaccount of normal io traffic going to the devices. More work needs 125*4882a593Smuzhiyundoing here to avoid migrating during those peak io moments. 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunFor the time being, a message "migration_threshold <#sectors>" 128*4882a593Smuzhiyuncan be used to set the maximum number of sectors being migrated, 129*4882a593Smuzhiyunthe default being 2048 sectors (1MB). 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunUpdating on-disk metadata 132*4882a593Smuzhiyun------------------------- 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunOn-disk metadata is committed every time a FLUSH or FUA bio is written. 135*4882a593SmuzhiyunIf no such requests are made then commits will occur every second. This 136*4882a593Smuzhiyunmeans the cache behaves like a physical disk that has a volatile write 137*4882a593Smuzhiyuncache. If power is lost you may lose some recent writes. The metadata 138*4882a593Smuzhiyunshould always be consistent in spite of any crash. 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunThe 'dirty' state for a cache block changes far too frequently for us 141*4882a593Smuzhiyunto keep updating it on the fly. So we treat it as a hint. In normal 142*4882a593Smuzhiyunoperation it will be written when the dm device is suspended. If the 143*4882a593Smuzhiyunsystem crashes all cache blocks will be assumed dirty when restarted. 144*4882a593Smuzhiyun 145*4882a593SmuzhiyunPer-block policy hints 146*4882a593Smuzhiyun---------------------- 147*4882a593Smuzhiyun 148*4882a593SmuzhiyunPolicy plug-ins can store a chunk of data per cache block. It's up to 149*4882a593Smuzhiyunthe policy how big this chunk is, but it should be kept small. Like the 150*4882a593Smuzhiyundirty flags this data is lost if there's a crash so a safe fallback 151*4882a593Smuzhiyunvalue should always be possible. 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunPolicy hints affect performance, not correctness. 154*4882a593Smuzhiyun 155*4882a593SmuzhiyunPolicy messaging 156*4882a593Smuzhiyun---------------- 157*4882a593Smuzhiyun 158*4882a593SmuzhiyunPolicies will have different tunables, specific to each one, so we 159*4882a593Smuzhiyunneed a generic way of getting and setting these. Device-mapper 160*4882a593Smuzhiyunmessages are used. Refer to cache-policies.txt. 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunDiscard bitset resolution 163*4882a593Smuzhiyun------------------------- 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunWe can avoid copying data during migration if we know the block has 166*4882a593Smuzhiyunbeen discarded. A prime example of this is when mkfs discards the 167*4882a593Smuzhiyunwhole block device. We store a bitset tracking the discard state of 168*4882a593Smuzhiyunblocks. However, we allow this bitset to have a different block size 169*4882a593Smuzhiyunfrom the cache blocks. This is because we need to track the discard 170*4882a593Smuzhiyunstate for all of the origin device (compare with the dirty bitset 171*4882a593Smuzhiyunwhich is just for the smaller cache device). 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunTarget interface 174*4882a593Smuzhiyun================ 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunConstructor 177*4882a593Smuzhiyun----------- 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun :: 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun cache <metadata dev> <cache dev> <origin dev> <block size> 182*4882a593Smuzhiyun <#feature args> [<feature arg>]* 183*4882a593Smuzhiyun <policy> <#policy args> [policy args]* 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun ================ ======================================================= 186*4882a593Smuzhiyun metadata dev fast device holding the persistent metadata 187*4882a593Smuzhiyun cache dev fast device holding cached data blocks 188*4882a593Smuzhiyun origin dev slow device holding original data blocks 189*4882a593Smuzhiyun block size cache unit size in sectors 190*4882a593Smuzhiyun 191*4882a593Smuzhiyun #feature args number of feature arguments passed 192*4882a593Smuzhiyun feature args writethrough or passthrough (The default is writeback.) 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun policy the replacement policy to use 195*4882a593Smuzhiyun #policy args an even number of arguments corresponding to 196*4882a593Smuzhiyun key/value pairs passed to the policy 197*4882a593Smuzhiyun policy args key/value pairs passed to the policy 198*4882a593Smuzhiyun E.g. 'sequential_threshold 1024' 199*4882a593Smuzhiyun See cache-policies.txt for details. 200*4882a593Smuzhiyun ================ ======================================================= 201*4882a593Smuzhiyun 202*4882a593SmuzhiyunOptional feature arguments are: 203*4882a593Smuzhiyun 204*4882a593Smuzhiyun 205*4882a593Smuzhiyun ==================== ======================================================== 206*4882a593Smuzhiyun writethrough write through caching that prohibits cache block 207*4882a593Smuzhiyun content from being different from origin block content. 208*4882a593Smuzhiyun Without this argument, the default behaviour is to write 209*4882a593Smuzhiyun back cache block contents later for performance reasons, 210*4882a593Smuzhiyun so they may differ from the corresponding origin blocks. 211*4882a593Smuzhiyun 212*4882a593Smuzhiyun passthrough a degraded mode useful for various cache coherency 213*4882a593Smuzhiyun situations (e.g., rolling back snapshots of 214*4882a593Smuzhiyun underlying storage). Reads and writes always go to 215*4882a593Smuzhiyun the origin. If a write goes to a cached origin 216*4882a593Smuzhiyun block, then the cache block is invalidated. 217*4882a593Smuzhiyun To enable passthrough mode the cache must be clean. 218*4882a593Smuzhiyun 219*4882a593Smuzhiyun metadata2 use version 2 of the metadata. This stores the dirty 220*4882a593Smuzhiyun bits in a separate btree, which improves speed of 221*4882a593Smuzhiyun shutting down the cache. 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun no_discard_passdown disable passing down discards from the cache 224*4882a593Smuzhiyun to the origin's data device. 225*4882a593Smuzhiyun ==================== ======================================================== 226*4882a593Smuzhiyun 227*4882a593SmuzhiyunA policy called 'default' is always registered. This is an alias for 228*4882a593Smuzhiyunthe policy we currently think is giving best all round performance. 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunAs the default policy could vary between kernels, if you are relying on 231*4882a593Smuzhiyunthe characteristics of a specific policy, always request it by name. 232*4882a593Smuzhiyun 233*4882a593SmuzhiyunStatus 234*4882a593Smuzhiyun------ 235*4882a593Smuzhiyun 236*4882a593Smuzhiyun:: 237*4882a593Smuzhiyun 238*4882a593Smuzhiyun <metadata block size> <#used metadata blocks>/<#total metadata blocks> 239*4882a593Smuzhiyun <cache block size> <#used cache blocks>/<#total cache blocks> 240*4882a593Smuzhiyun <#read hits> <#read misses> <#write hits> <#write misses> 241*4882a593Smuzhiyun <#demotions> <#promotions> <#dirty> <#features> <features>* 242*4882a593Smuzhiyun <#core args> <core args>* <policy name> <#policy args> <policy args>* 243*4882a593Smuzhiyun <cache metadata mode> 244*4882a593Smuzhiyun 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun========================= ===================================================== 247*4882a593Smuzhiyunmetadata block size Fixed block size for each metadata block in 248*4882a593Smuzhiyun sectors 249*4882a593Smuzhiyun#used metadata blocks Number of metadata blocks used 250*4882a593Smuzhiyun#total metadata blocks Total number of metadata blocks 251*4882a593Smuzhiyuncache block size Configurable block size for the cache device 252*4882a593Smuzhiyun in sectors 253*4882a593Smuzhiyun#used cache blocks Number of blocks resident in the cache 254*4882a593Smuzhiyun#total cache blocks Total number of cache blocks 255*4882a593Smuzhiyun#read hits Number of times a READ bio has been mapped 256*4882a593Smuzhiyun to the cache 257*4882a593Smuzhiyun#read misses Number of times a READ bio has been mapped 258*4882a593Smuzhiyun to the origin 259*4882a593Smuzhiyun#write hits Number of times a WRITE bio has been mapped 260*4882a593Smuzhiyun to the cache 261*4882a593Smuzhiyun#write misses Number of times a WRITE bio has been 262*4882a593Smuzhiyun mapped to the origin 263*4882a593Smuzhiyun#demotions Number of times a block has been removed 264*4882a593Smuzhiyun from the cache 265*4882a593Smuzhiyun#promotions Number of times a block has been moved to 266*4882a593Smuzhiyun the cache 267*4882a593Smuzhiyun#dirty Number of blocks in the cache that differ 268*4882a593Smuzhiyun from the origin 269*4882a593Smuzhiyun#feature args Number of feature args to follow 270*4882a593Smuzhiyunfeature args 'writethrough' (optional) 271*4882a593Smuzhiyun#core args Number of core arguments (must be even) 272*4882a593Smuzhiyuncore args Key/value pairs for tuning the core 273*4882a593Smuzhiyun e.g. migration_threshold 274*4882a593Smuzhiyunpolicy name Name of the policy 275*4882a593Smuzhiyun#policy args Number of policy arguments to follow (must be even) 276*4882a593Smuzhiyunpolicy args Key/value pairs e.g. sequential_threshold 277*4882a593Smuzhiyuncache metadata mode ro if read-only, rw if read-write 278*4882a593Smuzhiyun 279*4882a593Smuzhiyun In serious cases where even a read-only mode is 280*4882a593Smuzhiyun deemed unsafe no further I/O will be permitted and 281*4882a593Smuzhiyun the status will just contain the string 'Fail'. 282*4882a593Smuzhiyun The userspace recovery tools should then be used. 283*4882a593Smuzhiyunneeds_check 'needs_check' if set, '-' if not set 284*4882a593Smuzhiyun A metadata operation has failed, resulting in the 285*4882a593Smuzhiyun needs_check flag being set in the metadata's 286*4882a593Smuzhiyun superblock. The metadata device must be 287*4882a593Smuzhiyun deactivated and checked/repaired before the 288*4882a593Smuzhiyun cache can be made fully operational again. 289*4882a593Smuzhiyun '-' indicates needs_check is not set. 290*4882a593Smuzhiyun========================= ===================================================== 291*4882a593Smuzhiyun 292*4882a593SmuzhiyunMessages 293*4882a593Smuzhiyun-------- 294*4882a593Smuzhiyun 295*4882a593SmuzhiyunPolicies will have different tunables, specific to each one, so we 296*4882a593Smuzhiyunneed a generic way of getting and setting these. Device-mapper 297*4882a593Smuzhiyunmessages are used. (A sysfs interface would also be possible.) 298*4882a593Smuzhiyun 299*4882a593SmuzhiyunThe message format is:: 300*4882a593Smuzhiyun 301*4882a593Smuzhiyun <key> <value> 302*4882a593Smuzhiyun 303*4882a593SmuzhiyunE.g.:: 304*4882a593Smuzhiyun 305*4882a593Smuzhiyun dmsetup message my_cache 0 sequential_threshold 1024 306*4882a593Smuzhiyun 307*4882a593Smuzhiyun 308*4882a593SmuzhiyunInvalidation is removing an entry from the cache without writing it 309*4882a593Smuzhiyunback. Cache blocks can be invalidated via the invalidate_cblocks 310*4882a593Smuzhiyunmessage, which takes an arbitrary number of cblock ranges. Each cblock 311*4882a593Smuzhiyunrange's end value is "one past the end", meaning 5-10 expresses a range 312*4882a593Smuzhiyunof values from 5 to 9. Each cblock must be expressed as a decimal 313*4882a593Smuzhiyunvalue, in the future a variant message that takes cblock ranges 314*4882a593Smuzhiyunexpressed in hexadecimal may be needed to better support efficient 315*4882a593Smuzhiyuninvalidation of larger caches. The cache must be in passthrough mode 316*4882a593Smuzhiyunwhen invalidate_cblocks is used:: 317*4882a593Smuzhiyun 318*4882a593Smuzhiyun invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]* 319*4882a593Smuzhiyun 320*4882a593SmuzhiyunE.g.:: 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789 323*4882a593Smuzhiyun 324*4882a593SmuzhiyunExamples 325*4882a593Smuzhiyun======== 326*4882a593Smuzhiyun 327*4882a593SmuzhiyunThe test suite can be found here: 328*4882a593Smuzhiyun 329*4882a593Smuzhiyunhttps://github.com/jthornber/device-mapper-test-suite 330*4882a593Smuzhiyun 331*4882a593Smuzhiyun:: 332*4882a593Smuzhiyun 333*4882a593Smuzhiyun dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ 334*4882a593Smuzhiyun /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' 335*4882a593Smuzhiyun dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ 336*4882a593Smuzhiyun /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ 337*4882a593Smuzhiyun mq 4 sequential_threshold 1024 random_threshold 8' 338