1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0-only 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun======== 4*4882a593Smuzhiyundm-clone 5*4882a593Smuzhiyun======== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunIntroduction 8*4882a593Smuzhiyun============ 9*4882a593Smuzhiyun 10*4882a593Smuzhiyundm-clone is a device mapper target which produces a one-to-one copy of an 11*4882a593Smuzhiyunexisting, read-only source device into a writable destination device: It 12*4882a593Smuzhiyunpresents a virtual block device which makes all data appear immediately, and 13*4882a593Smuzhiyunredirects reads and writes accordingly. 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunThe main use case of dm-clone is to clone a potentially remote, high-latency, 16*4882a593Smuzhiyunread-only, archival-type block device into a writable, fast, primary-type device 17*4882a593Smuzhiyunfor fast, low-latency I/O. The cloned device is visible/mountable immediately 18*4882a593Smuzhiyunand the copy of the source device to the destination device happens in the 19*4882a593Smuzhiyunbackground, in parallel with user I/O. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunFor example, one could restore an application backup from a read-only copy, 22*4882a593Smuzhiyunaccessible through a network storage protocol (NBD, Fibre Channel, iSCSI, AoE, 23*4882a593Smuzhiyunetc.), into a local SSD or NVMe device, and start using the device immediately, 24*4882a593Smuzhiyunwithout waiting for the restore to complete. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunWhen the cloning completes, the dm-clone table can be removed altogether and be 27*4882a593Smuzhiyunreplaced, e.g., by a linear table, mapping directly to the destination device. 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunThe dm-clone target reuses the metadata library used by the thin-provisioning 30*4882a593Smuzhiyuntarget. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunGlossary 33*4882a593Smuzhiyun======== 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun Hydration 36*4882a593Smuzhiyun The process of filling a region of the destination device with data from 37*4882a593Smuzhiyun the same region of the source device, i.e., copying the region from the 38*4882a593Smuzhiyun source to the destination device. 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunOnce a region gets hydrated we redirect all I/O regarding it to the destination 41*4882a593Smuzhiyundevice. 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunDesign 44*4882a593Smuzhiyun====== 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunSub-devices 47*4882a593Smuzhiyun----------- 48*4882a593Smuzhiyun 49*4882a593SmuzhiyunThe target is constructed by passing three devices to it (along with other 50*4882a593Smuzhiyunparameters detailed later): 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun1. A source device - the read-only device that gets cloned and source of the 53*4882a593Smuzhiyun hydration. 54*4882a593Smuzhiyun 55*4882a593Smuzhiyun2. A destination device - the destination of the hydration, which will become a 56*4882a593Smuzhiyun clone of the source device. 57*4882a593Smuzhiyun 58*4882a593Smuzhiyun3. A small metadata device - it records which regions are already valid in the 59*4882a593Smuzhiyun destination device, i.e., which regions have already been hydrated, or have 60*4882a593Smuzhiyun been written to directly, via user I/O. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunThe size of the destination device must be at least equal to the size of the 63*4882a593Smuzhiyunsource device. 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunRegions 66*4882a593Smuzhiyun------- 67*4882a593Smuzhiyun 68*4882a593Smuzhiyundm-clone divides the source and destination devices in fixed sized regions. 69*4882a593SmuzhiyunRegions are the unit of hydration, i.e., the minimum amount of data copied from 70*4882a593Smuzhiyunthe source to the destination device. 71*4882a593Smuzhiyun 72*4882a593SmuzhiyunThe region size is configurable when you first create the dm-clone device. The 73*4882a593Smuzhiyunrecommended region size is the same as the file system block size, which usually 74*4882a593Smuzhiyunis 4KB. The region size must be between 8 sectors (4KB) and 2097152 sectors 75*4882a593Smuzhiyun(1GB) and a power of two. 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunReads and writes from/to hydrated regions are serviced from the destination 78*4882a593Smuzhiyundevice. 79*4882a593Smuzhiyun 80*4882a593SmuzhiyunA read to a not yet hydrated region is serviced directly from the source device. 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunA write to a not yet hydrated region will be delayed until the corresponding 83*4882a593Smuzhiyunregion has been hydrated and the hydration of the region starts immediately. 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunNote that a write request with size equal to region size will skip copying of 86*4882a593Smuzhiyunthe corresponding region from the source device and overwrite the region of the 87*4882a593Smuzhiyundestination device directly. 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunDiscards 90*4882a593Smuzhiyun-------- 91*4882a593Smuzhiyun 92*4882a593Smuzhiyundm-clone interprets a discard request to a range that hasn't been hydrated yet 93*4882a593Smuzhiyunas a hint to skip hydration of the regions covered by the request, i.e., it 94*4882a593Smuzhiyunskips copying the region's data from the source to the destination device, and 95*4882a593Smuzhiyunonly updates its metadata. 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunIf the destination device supports discards, then by default dm-clone will pass 98*4882a593Smuzhiyundown discard requests to it. 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunBackground Hydration 101*4882a593Smuzhiyun-------------------- 102*4882a593Smuzhiyun 103*4882a593Smuzhiyundm-clone copies continuously from the source to the destination device, until 104*4882a593Smuzhiyunall of the device has been copied. 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunCopying data from the source to the destination device uses bandwidth. The user 107*4882a593Smuzhiyuncan set a throttle to prevent more than a certain amount of copying occurring at 108*4882a593Smuzhiyunany one time. Moreover, dm-clone takes into account user I/O traffic going to 109*4882a593Smuzhiyunthe devices and pauses the background hydration when there is I/O in-flight. 110*4882a593Smuzhiyun 111*4882a593SmuzhiyunA message `hydration_threshold <#regions>` can be used to set the maximum number 112*4882a593Smuzhiyunof regions being copied, the default being 1 region. 113*4882a593Smuzhiyun 114*4882a593Smuzhiyundm-clone employs dm-kcopyd for copying portions of the source device to the 115*4882a593Smuzhiyundestination device. By default, we issue copy requests of size equal to the 116*4882a593Smuzhiyunregion size. A message `hydration_batch_size <#regions>` can be used to tune the 117*4882a593Smuzhiyunsize of these copy requests. Increasing the hydration batch size results in 118*4882a593Smuzhiyundm-clone trying to batch together contiguous regions, so we copy the data in 119*4882a593Smuzhiyunbatches of this many regions. 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunWhen the hydration of the destination device finishes, a dm event will be sent 122*4882a593Smuzhiyunto user space. 123*4882a593Smuzhiyun 124*4882a593SmuzhiyunUpdating on-disk metadata 125*4882a593Smuzhiyun------------------------- 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunOn-disk metadata is committed every time a FLUSH or FUA bio is written. If no 128*4882a593Smuzhiyunsuch requests are made then commits will occur every second. This means the 129*4882a593Smuzhiyundm-clone device behaves like a physical disk that has a volatile write cache. If 130*4882a593Smuzhiyunpower is lost you may lose some recent writes. The metadata should always be 131*4882a593Smuzhiyunconsistent in spite of any crash. 132*4882a593Smuzhiyun 133*4882a593SmuzhiyunTarget Interface 134*4882a593Smuzhiyun================ 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunConstructor 137*4882a593Smuzhiyun----------- 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun :: 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun clone <metadata dev> <destination dev> <source dev> <region size> 142*4882a593Smuzhiyun [<#feature args> [<feature arg>]* [<#core args> [<core arg>]*]] 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun ================ ============================================================== 145*4882a593Smuzhiyun metadata dev Fast device holding the persistent metadata 146*4882a593Smuzhiyun destination dev The destination device, where the source will be cloned 147*4882a593Smuzhiyun source dev Read only device containing the data that gets cloned 148*4882a593Smuzhiyun region size The size of a region in sectors 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun #feature args Number of feature arguments passed 151*4882a593Smuzhiyun feature args no_hydration or no_discard_passdown 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun #core args An even number of arguments corresponding to key/value pairs 154*4882a593Smuzhiyun passed to dm-clone 155*4882a593Smuzhiyun core args Key/value pairs passed to dm-clone, e.g. `hydration_threshold 156*4882a593Smuzhiyun 256` 157*4882a593Smuzhiyun ================ ============================================================== 158*4882a593Smuzhiyun 159*4882a593SmuzhiyunOptional feature arguments are: 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun ==================== ========================================================= 162*4882a593Smuzhiyun no_hydration Create a dm-clone instance with background hydration 163*4882a593Smuzhiyun disabled 164*4882a593Smuzhiyun no_discard_passdown Disable passing down discards to the destination device 165*4882a593Smuzhiyun ==================== ========================================================= 166*4882a593Smuzhiyun 167*4882a593SmuzhiyunOptional core arguments are: 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun ================================ ============================================== 170*4882a593Smuzhiyun hydration_threshold <#regions> Maximum number of regions being copied from 171*4882a593Smuzhiyun the source to the destination device at any 172*4882a593Smuzhiyun one time, during background hydration. 173*4882a593Smuzhiyun hydration_batch_size <#regions> During background hydration, try to batch 174*4882a593Smuzhiyun together contiguous regions, so we copy data 175*4882a593Smuzhiyun from the source to the destination device in 176*4882a593Smuzhiyun batches of this many regions. 177*4882a593Smuzhiyun ================================ ============================================== 178*4882a593Smuzhiyun 179*4882a593SmuzhiyunStatus 180*4882a593Smuzhiyun------ 181*4882a593Smuzhiyun 182*4882a593Smuzhiyun :: 183*4882a593Smuzhiyun 184*4882a593Smuzhiyun <metadata block size> <#used metadata blocks>/<#total metadata blocks> 185*4882a593Smuzhiyun <region size> <#hydrated regions>/<#total regions> <#hydrating regions> 186*4882a593Smuzhiyun <#feature args> <feature args>* <#core args> <core args>* 187*4882a593Smuzhiyun <clone metadata mode> 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun ======================= ======================================================= 190*4882a593Smuzhiyun metadata block size Fixed block size for each metadata block in sectors 191*4882a593Smuzhiyun #used metadata blocks Number of metadata blocks used 192*4882a593Smuzhiyun #total metadata blocks Total number of metadata blocks 193*4882a593Smuzhiyun region size Configurable region size for the device in sectors 194*4882a593Smuzhiyun #hydrated regions Number of regions that have finished hydrating 195*4882a593Smuzhiyun #total regions Total number of regions to hydrate 196*4882a593Smuzhiyun #hydrating regions Number of regions currently hydrating 197*4882a593Smuzhiyun #feature args Number of feature arguments to follow 198*4882a593Smuzhiyun feature args Feature arguments, e.g. `no_hydration` 199*4882a593Smuzhiyun #core args Even number of core arguments to follow 200*4882a593Smuzhiyun core args Key/value pairs for tuning the core, e.g. 201*4882a593Smuzhiyun `hydration_threshold 256` 202*4882a593Smuzhiyun clone metadata mode ro if read-only, rw if read-write 203*4882a593Smuzhiyun 204*4882a593Smuzhiyun In serious cases where even a read-only mode is deemed 205*4882a593Smuzhiyun unsafe no further I/O will be permitted and the status 206*4882a593Smuzhiyun will just contain the string 'Fail'. If the metadata 207*4882a593Smuzhiyun mode changes, a dm event will be sent to user space. 208*4882a593Smuzhiyun ======================= ======================================================= 209*4882a593Smuzhiyun 210*4882a593SmuzhiyunMessages 211*4882a593Smuzhiyun-------- 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun `disable_hydration` 214*4882a593Smuzhiyun Disable the background hydration of the destination device. 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun `enable_hydration` 217*4882a593Smuzhiyun Enable the background hydration of the destination device. 218*4882a593Smuzhiyun 219*4882a593Smuzhiyun `hydration_threshold <#regions>` 220*4882a593Smuzhiyun Set background hydration threshold. 221*4882a593Smuzhiyun 222*4882a593Smuzhiyun `hydration_batch_size <#regions>` 223*4882a593Smuzhiyun Set background hydration batch size. 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunExamples 226*4882a593Smuzhiyun======== 227*4882a593Smuzhiyun 228*4882a593SmuzhiyunClone a device containing a file system 229*4882a593Smuzhiyun--------------------------------------- 230*4882a593Smuzhiyun 231*4882a593Smuzhiyun1. Create the dm-clone device. 232*4882a593Smuzhiyun 233*4882a593Smuzhiyun :: 234*4882a593Smuzhiyun 235*4882a593Smuzhiyun dmsetup create clone --table "0 1048576000 clone $metadata_dev $dest_dev \ 236*4882a593Smuzhiyun $source_dev 8 1 no_hydration" 237*4882a593Smuzhiyun 238*4882a593Smuzhiyun2. Mount the device and trim the file system. dm-clone interprets the discards 239*4882a593Smuzhiyun sent by the file system and it will not hydrate the unused space. 240*4882a593Smuzhiyun 241*4882a593Smuzhiyun :: 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun mount /dev/mapper/clone /mnt/cloned-fs 244*4882a593Smuzhiyun fstrim /mnt/cloned-fs 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun3. Enable background hydration of the destination device. 247*4882a593Smuzhiyun 248*4882a593Smuzhiyun :: 249*4882a593Smuzhiyun 250*4882a593Smuzhiyun dmsetup message clone 0 enable_hydration 251*4882a593Smuzhiyun 252*4882a593Smuzhiyun4. When the hydration finishes, we can replace the dm-clone table with a linear 253*4882a593Smuzhiyun table. 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun :: 256*4882a593Smuzhiyun 257*4882a593Smuzhiyun dmsetup suspend clone 258*4882a593Smuzhiyun dmsetup load clone --table "0 1048576000 linear $dest_dev 0" 259*4882a593Smuzhiyun dmsetup resume clone 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun The metadata device is no longer needed and can be safely discarded or reused 262*4882a593Smuzhiyun for other purposes. 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunKnown issues 265*4882a593Smuzhiyun============ 266*4882a593Smuzhiyun 267*4882a593Smuzhiyun1. We redirect reads, to not-yet-hydrated regions, to the source device. If 268*4882a593Smuzhiyun reading the source device has high latency and the user repeatedly reads from 269*4882a593Smuzhiyun the same regions, this behaviour could degrade performance. We should use 270*4882a593Smuzhiyun these reads as hints to hydrate the relevant regions sooner. Currently, we 271*4882a593Smuzhiyun rely on the page cache to cache these regions, so we hopefully don't end up 272*4882a593Smuzhiyun reading them multiple times from the source device. 273*4882a593Smuzhiyun 274*4882a593Smuzhiyun2. Release in-core resources, i.e., the bitmaps tracking which regions are 275*4882a593Smuzhiyun hydrated, after the hydration has finished. 276*4882a593Smuzhiyun 277*4882a593Smuzhiyun3. During background hydration, if we fail to read the source or write to the 278*4882a593Smuzhiyun destination device, we print an error message, but the hydration process 279*4882a593Smuzhiyun continues indefinitely, until it succeeds. We should stop the background 280*4882a593Smuzhiyun hydration after a number of failures and emit a dm event for user space to 281*4882a593Smuzhiyun notice. 282*4882a593Smuzhiyun 283*4882a593SmuzhiyunWhy not...? 284*4882a593Smuzhiyun=========== 285*4882a593Smuzhiyun 286*4882a593SmuzhiyunWe explored the following alternatives before implementing dm-clone: 287*4882a593Smuzhiyun 288*4882a593Smuzhiyun1. Use dm-cache with cache size equal to the source device and implement a new 289*4882a593Smuzhiyun cloning policy: 290*4882a593Smuzhiyun 291*4882a593Smuzhiyun * The resulting cache device is not a one-to-one mirror of the source device 292*4882a593Smuzhiyun and thus we cannot remove the cache device once cloning completes. 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun * dm-cache writes to the source device, which violates our requirement that 295*4882a593Smuzhiyun the source device must be treated as read-only. 296*4882a593Smuzhiyun 297*4882a593Smuzhiyun * Caching is semantically different from cloning. 298*4882a593Smuzhiyun 299*4882a593Smuzhiyun2. Use dm-snapshot with a COW device equal to the source device: 300*4882a593Smuzhiyun 301*4882a593Smuzhiyun * dm-snapshot stores its metadata in the COW device, so the resulting device 302*4882a593Smuzhiyun is not a one-to-one mirror of the source device. 303*4882a593Smuzhiyun 304*4882a593Smuzhiyun * No background copying mechanism. 305*4882a593Smuzhiyun 306*4882a593Smuzhiyun * dm-snapshot needs to commit its metadata whenever a pending exception 307*4882a593Smuzhiyun completes, to ensure snapshot consistency. In the case of cloning, we don't 308*4882a593Smuzhiyun need to be so strict and can rely on committing metadata every time a FLUSH 309*4882a593Smuzhiyun or FUA bio is written, or periodically, like dm-thin and dm-cache do. This 310*4882a593Smuzhiyun improves the performance significantly. 311*4882a593Smuzhiyun 312*4882a593Smuzhiyun3. Use dm-mirror: The mirror target has a background copying/mirroring 313*4882a593Smuzhiyun mechanism, but it writes to all mirrors, thus violating our requirement that 314*4882a593Smuzhiyun the source device must be treated as read-only. 315*4882a593Smuzhiyun 316*4882a593Smuzhiyun4. Use dm-thin's external snapshot functionality. This approach is the most 317*4882a593Smuzhiyun promising among all alternatives, as the thinly-provisioned volume is a 318*4882a593Smuzhiyun one-to-one mirror of the source device and handles reads and writes to 319*4882a593Smuzhiyun un-provisioned/not-yet-cloned areas the same way as dm-clone does. 320*4882a593Smuzhiyun 321*4882a593Smuzhiyun Still: 322*4882a593Smuzhiyun 323*4882a593Smuzhiyun * There is no background copying mechanism, though one could be implemented. 324*4882a593Smuzhiyun 325*4882a593Smuzhiyun * Most importantly, we want to support arbitrary block devices as the 326*4882a593Smuzhiyun destination of the cloning process and not restrict ourselves to 327*4882a593Smuzhiyun thinly-provisioned volumes. Thin-provisioning has an inherent metadata 328*4882a593Smuzhiyun overhead, for maintaining the thin volume mappings, which significantly 329*4882a593Smuzhiyun degrades performance. 330*4882a593Smuzhiyun 331*4882a593Smuzhiyun Moreover, cloning a device shouldn't force the use of thin-provisioning. On 332*4882a593Smuzhiyun the other hand, if we wish to use thin provisioning, we can just use a thin 333*4882a593Smuzhiyun LV as dm-clone's destination device. 334