1*4882a593SmuzhiyunRAID arrays 2*4882a593Smuzhiyun=========== 3*4882a593Smuzhiyun 4*4882a593SmuzhiyunBoot time assembly of RAID arrays 5*4882a593Smuzhiyun--------------------------------- 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunTools that manage md devices can be found at 8*4882a593Smuzhiyun https://www.kernel.org/pub/linux/utils/raid/ 9*4882a593Smuzhiyun 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunYou can boot with your md device with the following kernel command 12*4882a593Smuzhiyunlines: 13*4882a593Smuzhiyun 14*4882a593Smuzhiyunfor old raid arrays without persistent superblocks:: 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun md=<md device no.>,<raid level>,<chunk size factor>,<fault level>,dev0,dev1,...,devn 17*4882a593Smuzhiyun 18*4882a593Smuzhiyunfor raid arrays with persistent superblocks:: 19*4882a593Smuzhiyun 20*4882a593Smuzhiyun md=<md device no.>,dev0,dev1,...,devn 21*4882a593Smuzhiyun 22*4882a593Smuzhiyunor, to assemble a partitionable array:: 23*4882a593Smuzhiyun 24*4882a593Smuzhiyun md=d<md device no.>,dev0,dev1,...,devn 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun``md device no.`` 27*4882a593Smuzhiyun+++++++++++++++++ 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunThe number of the md device 30*4882a593Smuzhiyun 31*4882a593Smuzhiyun================= ========= 32*4882a593Smuzhiyun``md device no.`` device 33*4882a593Smuzhiyun================= ========= 34*4882a593Smuzhiyun 0 md0 35*4882a593Smuzhiyun 1 md1 36*4882a593Smuzhiyun 2 md2 37*4882a593Smuzhiyun 3 md3 38*4882a593Smuzhiyun 4 md4 39*4882a593Smuzhiyun================= ========= 40*4882a593Smuzhiyun 41*4882a593Smuzhiyun``raid level`` 42*4882a593Smuzhiyun++++++++++++++ 43*4882a593Smuzhiyun 44*4882a593Smuzhiyunlevel of the RAID array 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun=============== ============= 47*4882a593Smuzhiyun``raid level`` level 48*4882a593Smuzhiyun=============== ============= 49*4882a593Smuzhiyun-1 linear mode 50*4882a593Smuzhiyun0 striped mode 51*4882a593Smuzhiyun=============== ============= 52*4882a593Smuzhiyun 53*4882a593Smuzhiyunother modes are only supported with persistent super blocks 54*4882a593Smuzhiyun 55*4882a593Smuzhiyun``chunk size factor`` 56*4882a593Smuzhiyun+++++++++++++++++++++ 57*4882a593Smuzhiyun 58*4882a593Smuzhiyun(raid-0 and raid-1 only) 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunSet the chunk size as 4k << n. 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun``fault level`` 63*4882a593Smuzhiyun+++++++++++++++ 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunTotally ignored 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun``dev0`` to ``devn`` 68*4882a593Smuzhiyun++++++++++++++++++++ 69*4882a593Smuzhiyun 70*4882a593Smuzhiyune.g. ``/dev/hda1``, ``/dev/hdc1``, ``/dev/sda1``, ``/dev/sdb1`` 71*4882a593Smuzhiyun 72*4882a593SmuzhiyunA possible loadlin line (Harald Hoyer <HarryH@Royal.Net>) looks like this:: 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun e:\loadlin\loadlin e:\zimage root=/dev/md0 md=0,0,4,0,/dev/hdb2,/dev/hdc3 ro 75*4882a593Smuzhiyun 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunBoot time autodetection of RAID arrays 78*4882a593Smuzhiyun-------------------------------------- 79*4882a593Smuzhiyun 80*4882a593SmuzhiyunWhen md is compiled into the kernel (not as module), partitions of 81*4882a593Smuzhiyuntype 0xfd are scanned and automatically assembled into RAID arrays. 82*4882a593SmuzhiyunThis autodetection may be suppressed with the kernel parameter 83*4882a593Smuzhiyun``raid=noautodetect``. As of kernel 2.6.9, only drives with a type 0 84*4882a593Smuzhiyunsuperblock can be autodetected and run at boot time. 85*4882a593Smuzhiyun 86*4882a593SmuzhiyunThe kernel parameter ``raid=partitionable`` (or ``raid=part``) means 87*4882a593Smuzhiyunthat all auto-detected arrays are assembled as partitionable. 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunBoot time assembly of degraded/dirty arrays 90*4882a593Smuzhiyun------------------------------------------- 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunIf a raid5 or raid6 array is both dirty and degraded, it could have 93*4882a593Smuzhiyunundetectable data corruption. This is because the fact that it is 94*4882a593Smuzhiyun``dirty`` means that the parity cannot be trusted, and the fact that it 95*4882a593Smuzhiyunis degraded means that some datablocks are missing and cannot reliably 96*4882a593Smuzhiyunbe reconstructed (due to no parity). 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunFor this reason, md will normally refuse to start such an array. This 99*4882a593Smuzhiyunrequires the sysadmin to take action to explicitly start the array 100*4882a593Smuzhiyundespite possible corruption. This is normally done with:: 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun mdadm --assemble --force .... 103*4882a593Smuzhiyun 104*4882a593SmuzhiyunThis option is not really available if the array has the root 105*4882a593Smuzhiyunfilesystem on it. In order to support this booting from such an 106*4882a593Smuzhiyunarray, md supports a module parameter ``start_dirty_degraded`` which, 107*4882a593Smuzhiyunwhen set to 1, bypassed the checks and will allows dirty degraded 108*4882a593Smuzhiyunarrays to be started. 109*4882a593Smuzhiyun 110*4882a593SmuzhiyunSo, to boot with a root filesystem of a dirty degraded raid 5 or 6, use:: 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun md-mod.start_dirty_degraded=1 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunSuperblock formats 116*4882a593Smuzhiyun------------------ 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunThe md driver can support a variety of different superblock formats. 119*4882a593SmuzhiyunCurrently, it supports superblock formats ``0.90.0`` and the ``md-1`` format 120*4882a593Smuzhiyunintroduced in the 2.5 development series. 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunThe kernel will autodetect which format superblock is being used. 123*4882a593Smuzhiyun 124*4882a593SmuzhiyunSuperblock format ``0`` is treated differently to others for legacy 125*4882a593Smuzhiyunreasons - it is the original superblock format. 126*4882a593Smuzhiyun 127*4882a593Smuzhiyun 128*4882a593SmuzhiyunGeneral Rules - apply for all superblock formats 129*4882a593Smuzhiyun------------------------------------------------ 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunAn array is ``created`` by writing appropriate superblocks to all 132*4882a593Smuzhiyundevices. 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunIt is ``assembled`` by associating each of these devices with an 135*4882a593Smuzhiyunparticular md virtual device. Once it is completely assembled, it can 136*4882a593Smuzhiyunbe accessed. 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunAn array should be created by a user-space tool. This will write 139*4882a593Smuzhiyunsuperblocks to all devices. It will usually mark the array as 140*4882a593Smuzhiyun``unclean``, or with some devices missing so that the kernel md driver 141*4882a593Smuzhiyuncan create appropriate redundancy (copying in raid 1, parity 142*4882a593Smuzhiyuncalculation in raid 4/5). 143*4882a593Smuzhiyun 144*4882a593SmuzhiyunWhen an array is assembled, it is first initialized with the 145*4882a593SmuzhiyunSET_ARRAY_INFO ioctl. This contains, in particular, a major and minor 146*4882a593Smuzhiyunversion number. The major version number selects which superblock 147*4882a593Smuzhiyunformat is to be used. The minor number might be used to tune handling 148*4882a593Smuzhiyunof the format, such as suggesting where on each device to look for the 149*4882a593Smuzhiyunsuperblock. 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunThen each device is added using the ADD_NEW_DISK ioctl. This 152*4882a593Smuzhiyunprovides, in particular, a major and minor number identifying the 153*4882a593Smuzhiyundevice to add. 154*4882a593Smuzhiyun 155*4882a593SmuzhiyunThe array is started with the RUN_ARRAY ioctl. 156*4882a593Smuzhiyun 157*4882a593SmuzhiyunOnce started, new devices can be added. They should have an 158*4882a593Smuzhiyunappropriate superblock written to them, and then be passed in with 159*4882a593SmuzhiyunADD_NEW_DISK. 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunDevices that have failed or are not yet active can be detached from an 162*4882a593Smuzhiyunarray using HOT_REMOVE_DISK. 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunSpecific Rules that apply to format-0 super block arrays, and arrays with no superblock (non-persistent) 166*4882a593Smuzhiyun-------------------------------------------------------------------------------------------------------- 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunAn array can be ``created`` by describing the array (level, chunksize 169*4882a593Smuzhiyunetc) in a SET_ARRAY_INFO ioctl. This must have ``major_version==0`` and 170*4882a593Smuzhiyun``raid_disks != 0``. 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunThen uninitialized devices can be added with ADD_NEW_DISK. The 173*4882a593Smuzhiyunstructure passed to ADD_NEW_DISK must specify the state of the device 174*4882a593Smuzhiyunand its role in the array. 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunOnce started with RUN_ARRAY, uninitialized spares can be added with 177*4882a593SmuzhiyunHOT_ADD_DISK. 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun 180*4882a593SmuzhiyunMD devices in sysfs 181*4882a593Smuzhiyun------------------- 182*4882a593Smuzhiyun 183*4882a593Smuzhiyunmd devices appear in sysfs (``/sys``) as regular block devices, 184*4882a593Smuzhiyune.g.:: 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun /sys/block/md0 187*4882a593Smuzhiyun 188*4882a593SmuzhiyunEach ``md`` device will contain a subdirectory called ``md`` which 189*4882a593Smuzhiyuncontains further md-specific information about the device. 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunAll md devices contain: 192*4882a593Smuzhiyun 193*4882a593Smuzhiyun level 194*4882a593Smuzhiyun a text file indicating the ``raid level``. e.g. raid0, raid1, 195*4882a593Smuzhiyun raid5, linear, multipath, faulty. 196*4882a593Smuzhiyun If no raid level has been set yet (array is still being 197*4882a593Smuzhiyun assembled), the value will reflect whatever has been written 198*4882a593Smuzhiyun to it, which may be a name like the above, or may be a number 199*4882a593Smuzhiyun such as ``0``, ``5``, etc. 200*4882a593Smuzhiyun 201*4882a593Smuzhiyun raid_disks 202*4882a593Smuzhiyun a text file with a simple number indicating the number of devices 203*4882a593Smuzhiyun in a fully functional array. If this is not yet known, the file 204*4882a593Smuzhiyun will be empty. If an array is being resized this will contain 205*4882a593Smuzhiyun the new number of devices. 206*4882a593Smuzhiyun Some raid levels allow this value to be set while the array is 207*4882a593Smuzhiyun active. This will reconfigure the array. Otherwise it can only 208*4882a593Smuzhiyun be set while assembling an array. 209*4882a593Smuzhiyun A change to this attribute will not be permitted if it would 210*4882a593Smuzhiyun reduce the size of the array. To reduce the number of drives 211*4882a593Smuzhiyun in an e.g. raid5, the array size must first be reduced by 212*4882a593Smuzhiyun setting the ``array_size`` attribute. 213*4882a593Smuzhiyun 214*4882a593Smuzhiyun chunk_size 215*4882a593Smuzhiyun This is the size in bytes for ``chunks`` and is only relevant to 216*4882a593Smuzhiyun raid levels that involve striping (0,4,5,6,10). The address space 217*4882a593Smuzhiyun of the array is conceptually divided into chunks and consecutive 218*4882a593Smuzhiyun chunks are striped onto neighbouring devices. 219*4882a593Smuzhiyun The size should be at least PAGE_SIZE (4k) and should be a power 220*4882a593Smuzhiyun of 2. This can only be set while assembling an array 221*4882a593Smuzhiyun 222*4882a593Smuzhiyun layout 223*4882a593Smuzhiyun The ``layout`` for the array for the particular level. This is 224*4882a593Smuzhiyun simply a number that is interpretted differently by different 225*4882a593Smuzhiyun levels. It can be written while assembling an array. 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun array_size 228*4882a593Smuzhiyun This can be used to artificially constrain the available space in 229*4882a593Smuzhiyun the array to be less than is actually available on the combined 230*4882a593Smuzhiyun devices. Writing a number (in Kilobytes) which is less than 231*4882a593Smuzhiyun the available size will set the size. Any reconfiguration of the 232*4882a593Smuzhiyun array (e.g. adding devices) will not cause the size to change. 233*4882a593Smuzhiyun Writing the word ``default`` will cause the effective size of the 234*4882a593Smuzhiyun array to be whatever size is actually available based on 235*4882a593Smuzhiyun ``level``, ``chunk_size`` and ``component_size``. 236*4882a593Smuzhiyun 237*4882a593Smuzhiyun This can be used to reduce the size of the array before reducing 238*4882a593Smuzhiyun the number of devices in a raid4/5/6, or to support external 239*4882a593Smuzhiyun metadata formats which mandate such clipping. 240*4882a593Smuzhiyun 241*4882a593Smuzhiyun reshape_position 242*4882a593Smuzhiyun This is either ``none`` or a sector number within the devices of 243*4882a593Smuzhiyun the array where ``reshape`` is up to. If this is set, the three 244*4882a593Smuzhiyun attributes mentioned above (raid_disks, chunk_size, layout) can 245*4882a593Smuzhiyun potentially have 2 values, an old and a new value. If these 246*4882a593Smuzhiyun values differ, reading the attribute returns:: 247*4882a593Smuzhiyun 248*4882a593Smuzhiyun new (old) 249*4882a593Smuzhiyun 250*4882a593Smuzhiyun and writing will effect the ``new`` value, leaving the ``old`` 251*4882a593Smuzhiyun unchanged. 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun component_size 254*4882a593Smuzhiyun For arrays with data redundancy (i.e. not raid0, linear, faulty, 255*4882a593Smuzhiyun multipath), all components must be the same size - or at least 256*4882a593Smuzhiyun there must a size that they all provide space for. This is a key 257*4882a593Smuzhiyun part or the geometry of the array. It is measured in sectors 258*4882a593Smuzhiyun and can be read from here. Writing to this value may resize 259*4882a593Smuzhiyun the array if the personality supports it (raid1, raid5, raid6), 260*4882a593Smuzhiyun and if the component drives are large enough. 261*4882a593Smuzhiyun 262*4882a593Smuzhiyun metadata_version 263*4882a593Smuzhiyun This indicates the format that is being used to record metadata 264*4882a593Smuzhiyun about the array. It can be 0.90 (traditional format), 1.0, 1.1, 265*4882a593Smuzhiyun 1.2 (newer format in varying locations) or ``none`` indicating that 266*4882a593Smuzhiyun the kernel isn't managing metadata at all. 267*4882a593Smuzhiyun Alternately it can be ``external:`` followed by a string which 268*4882a593Smuzhiyun is set by user-space. This indicates that metadata is managed 269*4882a593Smuzhiyun by a user-space program. Any device failure or other event that 270*4882a593Smuzhiyun requires a metadata update will cause array activity to be 271*4882a593Smuzhiyun suspended until the event is acknowledged. 272*4882a593Smuzhiyun 273*4882a593Smuzhiyun resync_start 274*4882a593Smuzhiyun The point at which resync should start. If no resync is needed, 275*4882a593Smuzhiyun this will be a very large number (or ``none`` since 2.6.30-rc1). At 276*4882a593Smuzhiyun array creation it will default to 0, though starting the array as 277*4882a593Smuzhiyun ``clean`` will set it much larger. 278*4882a593Smuzhiyun 279*4882a593Smuzhiyun new_dev 280*4882a593Smuzhiyun This file can be written but not read. The value written should 281*4882a593Smuzhiyun be a block device number as major:minor. e.g. 8:0 282*4882a593Smuzhiyun This will cause that device to be attached to the array, if it is 283*4882a593Smuzhiyun available. It will then appear at md/dev-XXX (depending on the 284*4882a593Smuzhiyun name of the device) and further configuration is then possible. 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun safe_mode_delay 287*4882a593Smuzhiyun When an md array has seen no write requests for a certain period 288*4882a593Smuzhiyun of time, it will be marked as ``clean``. When another write 289*4882a593Smuzhiyun request arrives, the array is marked as ``dirty`` before the write 290*4882a593Smuzhiyun commences. This is known as ``safe_mode``. 291*4882a593Smuzhiyun The ``certain period`` is controlled by this file which stores the 292*4882a593Smuzhiyun period as a number of seconds. The default is 200msec (0.200). 293*4882a593Smuzhiyun Writing a value of 0 disables safemode. 294*4882a593Smuzhiyun 295*4882a593Smuzhiyun array_state 296*4882a593Smuzhiyun This file contains a single word which describes the current 297*4882a593Smuzhiyun state of the array. In many cases, the state can be set by 298*4882a593Smuzhiyun writing the word for the desired state, however some states 299*4882a593Smuzhiyun cannot be explicitly set, and some transitions are not allowed. 300*4882a593Smuzhiyun 301*4882a593Smuzhiyun Select/poll works on this file. All changes except between 302*4882a593Smuzhiyun Active_idle and active (which can be frequent and are not 303*4882a593Smuzhiyun very interesting) are notified. active->active_idle is 304*4882a593Smuzhiyun reported if the metadata is externally managed. 305*4882a593Smuzhiyun 306*4882a593Smuzhiyun clear 307*4882a593Smuzhiyun No devices, no size, no level 308*4882a593Smuzhiyun 309*4882a593Smuzhiyun Writing is equivalent to STOP_ARRAY ioctl 310*4882a593Smuzhiyun 311*4882a593Smuzhiyun inactive 312*4882a593Smuzhiyun May have some settings, but array is not active 313*4882a593Smuzhiyun all IO results in error 314*4882a593Smuzhiyun 315*4882a593Smuzhiyun When written, doesn't tear down array, but just stops it 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun suspended (not supported yet) 318*4882a593Smuzhiyun All IO requests will block. The array can be reconfigured. 319*4882a593Smuzhiyun 320*4882a593Smuzhiyun Writing this, if accepted, will block until array is quiessent 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun readonly 323*4882a593Smuzhiyun no resync can happen. no superblocks get written. 324*4882a593Smuzhiyun 325*4882a593Smuzhiyun Write requests fail 326*4882a593Smuzhiyun 327*4882a593Smuzhiyun read-auto 328*4882a593Smuzhiyun like readonly, but behaves like ``clean`` on a write request. 329*4882a593Smuzhiyun 330*4882a593Smuzhiyun clean 331*4882a593Smuzhiyun no pending writes, but otherwise active. 332*4882a593Smuzhiyun 333*4882a593Smuzhiyun When written to inactive array, starts without resync 334*4882a593Smuzhiyun 335*4882a593Smuzhiyun If a write request arrives then 336*4882a593Smuzhiyun if metadata is known, mark ``dirty`` and switch to ``active``. 337*4882a593Smuzhiyun if not known, block and switch to write-pending 338*4882a593Smuzhiyun 339*4882a593Smuzhiyun If written to an active array that has pending writes, then fails. 340*4882a593Smuzhiyun active 341*4882a593Smuzhiyun fully active: IO and resync can be happening. 342*4882a593Smuzhiyun When written to inactive array, starts with resync 343*4882a593Smuzhiyun 344*4882a593Smuzhiyun write-pending 345*4882a593Smuzhiyun clean, but writes are blocked waiting for ``active`` to be written. 346*4882a593Smuzhiyun 347*4882a593Smuzhiyun active-idle 348*4882a593Smuzhiyun like active, but no writes have been seen for a while (safe_mode_delay). 349*4882a593Smuzhiyun 350*4882a593Smuzhiyun bitmap/location 351*4882a593Smuzhiyun This indicates where the write-intent bitmap for the array is 352*4882a593Smuzhiyun stored. 353*4882a593Smuzhiyun 354*4882a593Smuzhiyun It can be one of ``none``, ``file`` or ``[+-]N``. 355*4882a593Smuzhiyun ``file`` may later be extended to ``file:/file/name`` 356*4882a593Smuzhiyun ``[+-]N`` means that many sectors from the start of the metadata. 357*4882a593Smuzhiyun 358*4882a593Smuzhiyun This is replicated on all devices. For arrays with externally 359*4882a593Smuzhiyun managed metadata, the offset is from the beginning of the 360*4882a593Smuzhiyun device. 361*4882a593Smuzhiyun 362*4882a593Smuzhiyun bitmap/chunksize 363*4882a593Smuzhiyun The size, in bytes, of the chunk which will be represented by a 364*4882a593Smuzhiyun single bit. For RAID456, it is a portion of an individual 365*4882a593Smuzhiyun device. For RAID10, it is a portion of the array. For RAID1, it 366*4882a593Smuzhiyun is both (they come to the same thing). 367*4882a593Smuzhiyun 368*4882a593Smuzhiyun bitmap/time_base 369*4882a593Smuzhiyun The time, in seconds, between looking for bits in the bitmap to 370*4882a593Smuzhiyun be cleared. In the current implementation, a bit will be cleared 371*4882a593Smuzhiyun between 2 and 3 times ``time_base`` after all the covered blocks 372*4882a593Smuzhiyun are known to be in-sync. 373*4882a593Smuzhiyun 374*4882a593Smuzhiyun bitmap/backlog 375*4882a593Smuzhiyun When write-mostly devices are active in a RAID1, write requests 376*4882a593Smuzhiyun to those devices proceed in the background - the filesystem (or 377*4882a593Smuzhiyun other user of the device) does not have to wait for them. 378*4882a593Smuzhiyun ``backlog`` sets a limit on the number of concurrent background 379*4882a593Smuzhiyun writes. If there are more than this, new writes will by 380*4882a593Smuzhiyun synchronous. 381*4882a593Smuzhiyun 382*4882a593Smuzhiyun bitmap/metadata 383*4882a593Smuzhiyun This can be either ``internal`` or ``external``. 384*4882a593Smuzhiyun 385*4882a593Smuzhiyun ``internal`` 386*4882a593Smuzhiyun is the default and means the metadata for the bitmap 387*4882a593Smuzhiyun is stored in the first 256 bytes of the allocated space and is 388*4882a593Smuzhiyun managed by the md module. 389*4882a593Smuzhiyun 390*4882a593Smuzhiyun ``external`` 391*4882a593Smuzhiyun means that bitmap metadata is managed externally to 392*4882a593Smuzhiyun the kernel (i.e. by some userspace program) 393*4882a593Smuzhiyun 394*4882a593Smuzhiyun bitmap/can_clear 395*4882a593Smuzhiyun This is either ``true`` or ``false``. If ``true``, then bits in the 396*4882a593Smuzhiyun bitmap will be cleared when the corresponding blocks are thought 397*4882a593Smuzhiyun to be in-sync. If ``false``, bits will never be cleared. 398*4882a593Smuzhiyun This is automatically set to ``false`` if a write happens on a 399*4882a593Smuzhiyun degraded array, or if the array becomes degraded during a write. 400*4882a593Smuzhiyun When metadata is managed externally, it should be set to true 401*4882a593Smuzhiyun once the array becomes non-degraded, and this fact has been 402*4882a593Smuzhiyun recorded in the metadata. 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun consistency_policy 405*4882a593Smuzhiyun This indicates how the array maintains consistency in case of unexpected 406*4882a593Smuzhiyun shutdown. It can be: 407*4882a593Smuzhiyun 408*4882a593Smuzhiyun none 409*4882a593Smuzhiyun Array has no redundancy information, e.g. raid0, linear. 410*4882a593Smuzhiyun 411*4882a593Smuzhiyun resync 412*4882a593Smuzhiyun Full resync is performed and all redundancy is regenerated when the 413*4882a593Smuzhiyun array is started after unclean shutdown. 414*4882a593Smuzhiyun 415*4882a593Smuzhiyun bitmap 416*4882a593Smuzhiyun Resync assisted by a write-intent bitmap. 417*4882a593Smuzhiyun 418*4882a593Smuzhiyun journal 419*4882a593Smuzhiyun For raid4/5/6, journal device is used to log transactions and replay 420*4882a593Smuzhiyun after unclean shutdown. 421*4882a593Smuzhiyun 422*4882a593Smuzhiyun ppl 423*4882a593Smuzhiyun For raid5 only, Partial Parity Log is used to close the write hole and 424*4882a593Smuzhiyun eliminate resync. 425*4882a593Smuzhiyun 426*4882a593Smuzhiyun The accepted values when writing to this file are ``ppl`` and ``resync``, 427*4882a593Smuzhiyun used to enable and disable PPL. 428*4882a593Smuzhiyun 429*4882a593Smuzhiyun uuid 430*4882a593Smuzhiyun This indicates the UUID of the array in the following format: 431*4882a593Smuzhiyun xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 432*4882a593Smuzhiyun 433*4882a593Smuzhiyun 434*4882a593SmuzhiyunAs component devices are added to an md array, they appear in the ``md`` 435*4882a593Smuzhiyundirectory as new directories named:: 436*4882a593Smuzhiyun 437*4882a593Smuzhiyun dev-XXX 438*4882a593Smuzhiyun 439*4882a593Smuzhiyunwhere ``XXX`` is a name that the kernel knows for the device, e.g. hdb1. 440*4882a593SmuzhiyunEach directory contains: 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun block 443*4882a593Smuzhiyun a symlink to the block device in /sys/block, e.g.:: 444*4882a593Smuzhiyun 445*4882a593Smuzhiyun /sys/block/md0/md/dev-hdb1/block -> ../../../../block/hdb/hdb1 446*4882a593Smuzhiyun 447*4882a593Smuzhiyun super 448*4882a593Smuzhiyun A file containing an image of the superblock read from, or 449*4882a593Smuzhiyun written to, that device. 450*4882a593Smuzhiyun 451*4882a593Smuzhiyun state 452*4882a593Smuzhiyun A file recording the current state of the device in the array 453*4882a593Smuzhiyun which can be a comma separated list of: 454*4882a593Smuzhiyun 455*4882a593Smuzhiyun faulty 456*4882a593Smuzhiyun device has been kicked from active use due to 457*4882a593Smuzhiyun a detected fault, or it has unacknowledged bad 458*4882a593Smuzhiyun blocks 459*4882a593Smuzhiyun 460*4882a593Smuzhiyun in_sync 461*4882a593Smuzhiyun device is a fully in-sync member of the array 462*4882a593Smuzhiyun 463*4882a593Smuzhiyun writemostly 464*4882a593Smuzhiyun device will only be subject to read 465*4882a593Smuzhiyun requests if there are no other options. 466*4882a593Smuzhiyun 467*4882a593Smuzhiyun This applies only to raid1 arrays. 468*4882a593Smuzhiyun 469*4882a593Smuzhiyun blocked 470*4882a593Smuzhiyun device has failed, and the failure hasn't been 471*4882a593Smuzhiyun acknowledged yet by the metadata handler. 472*4882a593Smuzhiyun 473*4882a593Smuzhiyun Writes that would write to this device if 474*4882a593Smuzhiyun it were not faulty are blocked. 475*4882a593Smuzhiyun 476*4882a593Smuzhiyun spare 477*4882a593Smuzhiyun device is working, but not a full member. 478*4882a593Smuzhiyun 479*4882a593Smuzhiyun This includes spares that are in the process 480*4882a593Smuzhiyun of being recovered to 481*4882a593Smuzhiyun 482*4882a593Smuzhiyun write_error 483*4882a593Smuzhiyun device has ever seen a write error. 484*4882a593Smuzhiyun 485*4882a593Smuzhiyun want_replacement 486*4882a593Smuzhiyun device is (mostly) working but probably 487*4882a593Smuzhiyun should be replaced, either due to errors or 488*4882a593Smuzhiyun due to user request. 489*4882a593Smuzhiyun 490*4882a593Smuzhiyun replacement 491*4882a593Smuzhiyun device is a replacement for another active 492*4882a593Smuzhiyun device with same raid_disk. 493*4882a593Smuzhiyun 494*4882a593Smuzhiyun 495*4882a593Smuzhiyun This list may grow in future. 496*4882a593Smuzhiyun 497*4882a593Smuzhiyun This can be written to. 498*4882a593Smuzhiyun 499*4882a593Smuzhiyun Writing ``faulty`` simulates a failure on the device. 500*4882a593Smuzhiyun 501*4882a593Smuzhiyun Writing ``remove`` removes the device from the array. 502*4882a593Smuzhiyun 503*4882a593Smuzhiyun Writing ``writemostly`` sets the writemostly flag. 504*4882a593Smuzhiyun 505*4882a593Smuzhiyun Writing ``-writemostly`` clears the writemostly flag. 506*4882a593Smuzhiyun 507*4882a593Smuzhiyun Writing ``blocked`` sets the ``blocked`` flag. 508*4882a593Smuzhiyun 509*4882a593Smuzhiyun Writing ``-blocked`` clears the ``blocked`` flags and allows writes 510*4882a593Smuzhiyun to complete and possibly simulates an error. 511*4882a593Smuzhiyun 512*4882a593Smuzhiyun Writing ``in_sync`` sets the in_sync flag. 513*4882a593Smuzhiyun 514*4882a593Smuzhiyun Writing ``write_error`` sets writeerrorseen flag. 515*4882a593Smuzhiyun 516*4882a593Smuzhiyun Writing ``-write_error`` clears writeerrorseen flag. 517*4882a593Smuzhiyun 518*4882a593Smuzhiyun Writing ``want_replacement`` is allowed at any time except to a 519*4882a593Smuzhiyun replacement device or a spare. It sets the flag. 520*4882a593Smuzhiyun 521*4882a593Smuzhiyun Writing ``-want_replacement`` is allowed at any time. It clears 522*4882a593Smuzhiyun the flag. 523*4882a593Smuzhiyun 524*4882a593Smuzhiyun Writing ``replacement`` or ``-replacement`` is only allowed before 525*4882a593Smuzhiyun starting the array. It sets or clears the flag. 526*4882a593Smuzhiyun 527*4882a593Smuzhiyun 528*4882a593Smuzhiyun This file responds to select/poll. Any change to ``faulty`` 529*4882a593Smuzhiyun or ``blocked`` causes an event. 530*4882a593Smuzhiyun 531*4882a593Smuzhiyun errors 532*4882a593Smuzhiyun An approximate count of read errors that have been detected on 533*4882a593Smuzhiyun this device but have not caused the device to be evicted from 534*4882a593Smuzhiyun the array (either because they were corrected or because they 535*4882a593Smuzhiyun happened while the array was read-only). When using version-1 536*4882a593Smuzhiyun metadata, this value persists across restarts of the array. 537*4882a593Smuzhiyun 538*4882a593Smuzhiyun This value can be written while assembling an array thus 539*4882a593Smuzhiyun providing an ongoing count for arrays with metadata managed by 540*4882a593Smuzhiyun userspace. 541*4882a593Smuzhiyun 542*4882a593Smuzhiyun slot 543*4882a593Smuzhiyun This gives the role that the device has in the array. It will 544*4882a593Smuzhiyun either be ``none`` if the device is not active in the array 545*4882a593Smuzhiyun (i.e. is a spare or has failed) or an integer less than the 546*4882a593Smuzhiyun ``raid_disks`` number for the array indicating which position 547*4882a593Smuzhiyun it currently fills. This can only be set while assembling an 548*4882a593Smuzhiyun array. A device for which this is set is assumed to be working. 549*4882a593Smuzhiyun 550*4882a593Smuzhiyun offset 551*4882a593Smuzhiyun This gives the location in the device (in sectors from the 552*4882a593Smuzhiyun start) where data from the array will be stored. Any part of 553*4882a593Smuzhiyun the device before this offset is not touched, unless it is 554*4882a593Smuzhiyun used for storing metadata (Formats 1.1 and 1.2). 555*4882a593Smuzhiyun 556*4882a593Smuzhiyun size 557*4882a593Smuzhiyun The amount of the device, after the offset, that can be used 558*4882a593Smuzhiyun for storage of data. This will normally be the same as the 559*4882a593Smuzhiyun component_size. This can be written while assembling an 560*4882a593Smuzhiyun array. If a value less than the current component_size is 561*4882a593Smuzhiyun written, it will be rejected. 562*4882a593Smuzhiyun 563*4882a593Smuzhiyun recovery_start 564*4882a593Smuzhiyun When the device is not ``in_sync``, this records the number of 565*4882a593Smuzhiyun sectors from the start of the device which are known to be 566*4882a593Smuzhiyun correct. This is normally zero, but during a recovery 567*4882a593Smuzhiyun operation it will steadily increase, and if the recovery is 568*4882a593Smuzhiyun interrupted, restoring this value can cause recovery to 569*4882a593Smuzhiyun avoid repeating the earlier blocks. With v1.x metadata, this 570*4882a593Smuzhiyun value is saved and restored automatically. 571*4882a593Smuzhiyun 572*4882a593Smuzhiyun This can be set whenever the device is not an active member of 573*4882a593Smuzhiyun the array, either before the array is activated, or before 574*4882a593Smuzhiyun the ``slot`` is set. 575*4882a593Smuzhiyun 576*4882a593Smuzhiyun Setting this to ``none`` is equivalent to setting ``in_sync``. 577*4882a593Smuzhiyun Setting to any other value also clears the ``in_sync`` flag. 578*4882a593Smuzhiyun 579*4882a593Smuzhiyun bad_blocks 580*4882a593Smuzhiyun This gives the list of all known bad blocks in the form of 581*4882a593Smuzhiyun start address and length (in sectors respectively). If output 582*4882a593Smuzhiyun is too big to fit in a page, it will be truncated. Writing 583*4882a593Smuzhiyun ``sector length`` to this file adds new acknowledged (i.e. 584*4882a593Smuzhiyun recorded to disk safely) bad blocks. 585*4882a593Smuzhiyun 586*4882a593Smuzhiyun unacknowledged_bad_blocks 587*4882a593Smuzhiyun This gives the list of known-but-not-yet-saved-to-disk bad 588*4882a593Smuzhiyun blocks in the same form of ``bad_blocks``. If output is too big 589*4882a593Smuzhiyun to fit in a page, it will be truncated. Writing to this file 590*4882a593Smuzhiyun adds bad blocks without acknowledging them. This is largely 591*4882a593Smuzhiyun for testing. 592*4882a593Smuzhiyun 593*4882a593Smuzhiyun ppl_sector, ppl_size 594*4882a593Smuzhiyun Location and size (in sectors) of the space used for Partial Parity Log 595*4882a593Smuzhiyun on this device. 596*4882a593Smuzhiyun 597*4882a593Smuzhiyun 598*4882a593SmuzhiyunAn active md device will also contain an entry for each active device 599*4882a593Smuzhiyunin the array. These are named:: 600*4882a593Smuzhiyun 601*4882a593Smuzhiyun rdNN 602*4882a593Smuzhiyun 603*4882a593Smuzhiyunwhere ``NN`` is the position in the array, starting from 0. 604*4882a593SmuzhiyunSo for a 3 drive array there will be rd0, rd1, rd2. 605*4882a593SmuzhiyunThese are symbolic links to the appropriate ``dev-XXX`` entry. 606*4882a593SmuzhiyunThus, for example:: 607*4882a593Smuzhiyun 608*4882a593Smuzhiyun cat /sys/block/md*/md/rd*/state 609*4882a593Smuzhiyun 610*4882a593Smuzhiyunwill show ``in_sync`` on every line. 611*4882a593Smuzhiyun 612*4882a593Smuzhiyun 613*4882a593Smuzhiyun 614*4882a593SmuzhiyunActive md devices for levels that support data redundancy (1,4,5,6,10) 615*4882a593Smuzhiyunalso have 616*4882a593Smuzhiyun 617*4882a593Smuzhiyun sync_action 618*4882a593Smuzhiyun a text file that can be used to monitor and control the rebuild 619*4882a593Smuzhiyun process. It contains one word which can be one of: 620*4882a593Smuzhiyun 621*4882a593Smuzhiyun resync 622*4882a593Smuzhiyun redundancy is being recalculated after unclean 623*4882a593Smuzhiyun shutdown or creation 624*4882a593Smuzhiyun 625*4882a593Smuzhiyun recover 626*4882a593Smuzhiyun a hot spare is being built to replace a 627*4882a593Smuzhiyun failed/missing device 628*4882a593Smuzhiyun 629*4882a593Smuzhiyun idle 630*4882a593Smuzhiyun nothing is happening 631*4882a593Smuzhiyun check 632*4882a593Smuzhiyun A full check of redundancy was requested and is 633*4882a593Smuzhiyun happening. This reads all blocks and checks 634*4882a593Smuzhiyun them. A repair may also happen for some raid 635*4882a593Smuzhiyun levels. 636*4882a593Smuzhiyun 637*4882a593Smuzhiyun repair 638*4882a593Smuzhiyun A full check and repair is happening. This is 639*4882a593Smuzhiyun similar to ``resync``, but was requested by the 640*4882a593Smuzhiyun user, and the write-intent bitmap is NOT used to 641*4882a593Smuzhiyun optimise the process. 642*4882a593Smuzhiyun 643*4882a593Smuzhiyun This file is writable, and each of the strings that could be 644*4882a593Smuzhiyun read are meaningful for writing. 645*4882a593Smuzhiyun 646*4882a593Smuzhiyun ``idle`` will stop an active resync/recovery etc. There is no 647*4882a593Smuzhiyun guarantee that another resync/recovery may not be automatically 648*4882a593Smuzhiyun started again, though some event will be needed to trigger 649*4882a593Smuzhiyun this. 650*4882a593Smuzhiyun 651*4882a593Smuzhiyun ``resync`` or ``recovery`` can be used to restart the 652*4882a593Smuzhiyun corresponding operation if it was stopped with ``idle``. 653*4882a593Smuzhiyun 654*4882a593Smuzhiyun ``check`` and ``repair`` will start the appropriate process 655*4882a593Smuzhiyun providing the current state is ``idle``. 656*4882a593Smuzhiyun 657*4882a593Smuzhiyun This file responds to select/poll. Any important change in the value 658*4882a593Smuzhiyun triggers a poll event. Sometimes the value will briefly be 659*4882a593Smuzhiyun ``recover`` if a recovery seems to be needed, but cannot be 660*4882a593Smuzhiyun achieved. In that case, the transition to ``recover`` isn't 661*4882a593Smuzhiyun notified, but the transition away is. 662*4882a593Smuzhiyun 663*4882a593Smuzhiyun degraded 664*4882a593Smuzhiyun This contains a count of the number of devices by which the 665*4882a593Smuzhiyun arrays is degraded. So an optimal array will show ``0``. A 666*4882a593Smuzhiyun single failed/missing drive will show ``1``, etc. 667*4882a593Smuzhiyun 668*4882a593Smuzhiyun This file responds to select/poll, any increase or decrease 669*4882a593Smuzhiyun in the count of missing devices will trigger an event. 670*4882a593Smuzhiyun 671*4882a593Smuzhiyun mismatch_count 672*4882a593Smuzhiyun When performing ``check`` and ``repair``, and possibly when 673*4882a593Smuzhiyun performing ``resync``, md will count the number of errors that are 674*4882a593Smuzhiyun found. The count in ``mismatch_cnt`` is the number of sectors 675*4882a593Smuzhiyun that were re-written, or (for ``check``) would have been 676*4882a593Smuzhiyun re-written. As most raid levels work in units of pages rather 677*4882a593Smuzhiyun than sectors, this may be larger than the number of actual errors 678*4882a593Smuzhiyun by a factor of the number of sectors in a page. 679*4882a593Smuzhiyun 680*4882a593Smuzhiyun bitmap_set_bits 681*4882a593Smuzhiyun If the array has a write-intent bitmap, then writing to this 682*4882a593Smuzhiyun attribute can set bits in the bitmap, indicating that a resync 683*4882a593Smuzhiyun would need to check the corresponding blocks. Either individual 684*4882a593Smuzhiyun numbers or start-end pairs can be written. Multiple numbers 685*4882a593Smuzhiyun can be separated by a space. 686*4882a593Smuzhiyun 687*4882a593Smuzhiyun Note that the numbers are ``bit`` numbers, not ``block`` numbers. 688*4882a593Smuzhiyun They should be scaled by the bitmap_chunksize. 689*4882a593Smuzhiyun 690*4882a593Smuzhiyun sync_speed_min, sync_speed_max 691*4882a593Smuzhiyun This are similar to ``/proc/sys/dev/raid/speed_limit_{min,max}`` 692*4882a593Smuzhiyun however they only apply to the particular array. 693*4882a593Smuzhiyun 694*4882a593Smuzhiyun If no value has been written to these, or if the word ``system`` 695*4882a593Smuzhiyun is written, then the system-wide value is used. If a value, 696*4882a593Smuzhiyun in kibibytes-per-second is written, then it is used. 697*4882a593Smuzhiyun 698*4882a593Smuzhiyun When the files are read, they show the currently active value 699*4882a593Smuzhiyun followed by ``(local)`` or ``(system)`` depending on whether it is 700*4882a593Smuzhiyun a locally set or system-wide value. 701*4882a593Smuzhiyun 702*4882a593Smuzhiyun sync_completed 703*4882a593Smuzhiyun This shows the number of sectors that have been completed of 704*4882a593Smuzhiyun whatever the current sync_action is, followed by the number of 705*4882a593Smuzhiyun sectors in total that could need to be processed. The two 706*4882a593Smuzhiyun numbers are separated by a ``/`` thus effectively showing one 707*4882a593Smuzhiyun value, a fraction of the process that is complete. 708*4882a593Smuzhiyun 709*4882a593Smuzhiyun A ``select`` on this attribute will return when resync completes, 710*4882a593Smuzhiyun when it reaches the current sync_max (below) and possibly at 711*4882a593Smuzhiyun other times. 712*4882a593Smuzhiyun 713*4882a593Smuzhiyun sync_speed 714*4882a593Smuzhiyun This shows the current actual speed, in K/sec, of the current 715*4882a593Smuzhiyun sync_action. It is averaged over the last 30 seconds. 716*4882a593Smuzhiyun 717*4882a593Smuzhiyun suspend_lo, suspend_hi 718*4882a593Smuzhiyun The two values, given as numbers of sectors, indicate a range 719*4882a593Smuzhiyun within the array where IO will be blocked. This is currently 720*4882a593Smuzhiyun only supported for raid4/5/6. 721*4882a593Smuzhiyun 722*4882a593Smuzhiyun sync_min, sync_max 723*4882a593Smuzhiyun The two values, given as numbers of sectors, indicate a range 724*4882a593Smuzhiyun within the array where ``check``/``repair`` will operate. Must be 725*4882a593Smuzhiyun a multiple of chunk_size. When it reaches ``sync_max`` it will 726*4882a593Smuzhiyun pause, rather than complete. 727*4882a593Smuzhiyun You can use ``select`` or ``poll`` on ``sync_completed`` to wait for 728*4882a593Smuzhiyun that number to reach sync_max. Then you can either increase 729*4882a593Smuzhiyun ``sync_max``, or can write ``idle`` to ``sync_action``. 730*4882a593Smuzhiyun 731*4882a593Smuzhiyun The value of ``max`` for ``sync_max`` effectively disables the limit. 732*4882a593Smuzhiyun When a resync is active, the value can only ever be increased, 733*4882a593Smuzhiyun never decreased. 734*4882a593Smuzhiyun The value of ``0`` is the minimum for ``sync_min``. 735*4882a593Smuzhiyun 736*4882a593Smuzhiyun 737*4882a593Smuzhiyun 738*4882a593SmuzhiyunEach active md device may also have attributes specific to the 739*4882a593Smuzhiyunpersonality module that manages it. 740*4882a593SmuzhiyunThese are specific to the implementation of the module and could 741*4882a593Smuzhiyunchange substantially if the implementation changes. 742*4882a593Smuzhiyun 743*4882a593SmuzhiyunThese currently include: 744*4882a593Smuzhiyun 745*4882a593Smuzhiyun stripe_cache_size (currently raid5 only) 746*4882a593Smuzhiyun number of entries in the stripe cache. This is writable, but 747*4882a593Smuzhiyun there are upper and lower limits (32768, 17). Default is 256. 748*4882a593Smuzhiyun 749*4882a593Smuzhiyun strip_cache_active (currently raid5 only) 750*4882a593Smuzhiyun number of active entries in the stripe cache 751*4882a593Smuzhiyun 752*4882a593Smuzhiyun preread_bypass_threshold (currently raid5 only) 753*4882a593Smuzhiyun number of times a stripe requiring preread will be bypassed by 754*4882a593Smuzhiyun a stripe that does not require preread. For fairness defaults 755*4882a593Smuzhiyun to 1. Setting this to 0 disables bypass accounting and 756*4882a593Smuzhiyun requires preread stripes to wait until all full-width stripe- 757*4882a593Smuzhiyun writes are complete. Valid values are 0 to stripe_cache_size. 758*4882a593Smuzhiyun 759*4882a593Smuzhiyun journal_mode (currently raid5 only) 760*4882a593Smuzhiyun The cache mode for raid5. raid5 could include an extra disk for 761*4882a593Smuzhiyun caching. The mode can be "write-throuth" and "write-back". The 762*4882a593Smuzhiyun default is "write-through". 763*4882a593Smuzhiyun 764*4882a593Smuzhiyun ppl_write_hint 765*4882a593Smuzhiyun NVMe stream ID to be set for each PPL write request. 766