1*4882a593Smuzhiyun========== 2*4882a593SmuzhiyunMD Cluster 3*4882a593Smuzhiyun========== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunThe cluster MD is a shared-device RAID for a cluster, it supports 6*4882a593Smuzhiyuntwo levels: raid1 and raid10 (limited support). 7*4882a593Smuzhiyun 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun1. On-disk format 10*4882a593Smuzhiyun================= 11*4882a593Smuzhiyun 12*4882a593SmuzhiyunSeparate write-intent-bitmaps are used for each cluster node. 13*4882a593SmuzhiyunThe bitmaps record all writes that may have been started on that node, 14*4882a593Smuzhiyunand may not yet have finished. The on-disk layout is:: 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun 0 4k 8k 12k 17*4882a593Smuzhiyun ------------------------------------------------------------------- 18*4882a593Smuzhiyun | idle | md super | bm super [0] + bits | 19*4882a593Smuzhiyun | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | 20*4882a593Smuzhiyun | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | 21*4882a593Smuzhiyun | bm bits [3, contd] | | | 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunDuring "normal" functioning we assume the filesystem ensures that only 24*4882a593Smuzhiyunone node writes to any given block at a time, so a write request will 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun - set the appropriate bit (if not already set) 27*4882a593Smuzhiyun - commit the write to all mirrors 28*4882a593Smuzhiyun - schedule the bit to be cleared after a timeout. 29*4882a593Smuzhiyun 30*4882a593SmuzhiyunReads are just handled normally. It is up to the filesystem to ensure 31*4882a593Smuzhiyunone node doesn't read from a location where another node (or the same 32*4882a593Smuzhiyunnode) is writing. 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun2. DLM Locks for management 36*4882a593Smuzhiyun=========================== 37*4882a593Smuzhiyun 38*4882a593SmuzhiyunThere are three groups of locks for managing the device: 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun2.1 Bitmap lock resource (bm_lockres) 41*4882a593Smuzhiyun------------------------------------- 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun The bm_lockres protects individual node bitmaps. They are named in 44*4882a593Smuzhiyun the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a 45*4882a593Smuzhiyun node joins the cluster, it acquires the lock in PW mode and it stays 46*4882a593Smuzhiyun so during the lifetime the node is part of the cluster. The lock 47*4882a593Smuzhiyun resource number is based on the slot number returned by the DLM 48*4882a593Smuzhiyun subsystem. Since DLM starts node count from one and bitmap slots 49*4882a593Smuzhiyun start from zero, one is subtracted from the DLM slot number to arrive 50*4882a593Smuzhiyun at the bitmap slot number. 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun The LVB of the bitmap lock for a particular node records the range 53*4882a593Smuzhiyun of sectors that are being re-synced by that node. No other 54*4882a593Smuzhiyun node may write to those sectors. This is used when a new nodes 55*4882a593Smuzhiyun joins the cluster. 56*4882a593Smuzhiyun 57*4882a593Smuzhiyun2.2 Message passing locks 58*4882a593Smuzhiyun------------------------- 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun Each node has to communicate with other nodes when starting or ending 61*4882a593Smuzhiyun resync, and for metadata superblock updates. This communication is 62*4882a593Smuzhiyun managed through three locks: "token", "message", and "ack", together 63*4882a593Smuzhiyun with the Lock Value Block (LVB) of one of the "message" lock. 64*4882a593Smuzhiyun 65*4882a593Smuzhiyun2.3 new-device management 66*4882a593Smuzhiyun------------------------- 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun A single lock: "no-new-dev" is used to co-ordinate the addition of 69*4882a593Smuzhiyun new devices - this must be synchronized across the array. 70*4882a593Smuzhiyun Normally all nodes hold a concurrent-read lock on this device. 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun3. Communication 73*4882a593Smuzhiyun================ 74*4882a593Smuzhiyun 75*4882a593Smuzhiyun Messages can be broadcast to all nodes, and the sender waits for all 76*4882a593Smuzhiyun other nodes to acknowledge the message before proceeding. Only one 77*4882a593Smuzhiyun message can be processed at a time. 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun3.1 Message Types 80*4882a593Smuzhiyun----------------- 81*4882a593Smuzhiyun 82*4882a593Smuzhiyun There are six types of messages which are passed: 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun3.1.1 METADATA_UPDATED 85*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^ 86*4882a593Smuzhiyun 87*4882a593Smuzhiyun informs other nodes that the metadata has 88*4882a593Smuzhiyun been updated, and the node must re-read the md superblock. This is 89*4882a593Smuzhiyun performed synchronously. It is primarily used to signal device 90*4882a593Smuzhiyun failure. 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun3.1.2 RESYNCING 93*4882a593Smuzhiyun^^^^^^^^^^^^^^^ 94*4882a593Smuzhiyun informs other nodes that a resync is initiated or 95*4882a593Smuzhiyun ended so that each node may suspend or resume the region. Each 96*4882a593Smuzhiyun RESYNCING message identifies a range of the devices that the 97*4882a593Smuzhiyun sending node is about to resync. This overrides any previous 98*4882a593Smuzhiyun notification from that node: only one ranged can be resynced at a 99*4882a593Smuzhiyun time per-node. 100*4882a593Smuzhiyun 101*4882a593Smuzhiyun3.1.3 NEWDISK 102*4882a593Smuzhiyun^^^^^^^^^^^^^ 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun informs other nodes that a device is being added to 105*4882a593Smuzhiyun the array. Message contains an identifier for that device. See 106*4882a593Smuzhiyun below for further details. 107*4882a593Smuzhiyun 108*4882a593Smuzhiyun3.1.4 REMOVE 109*4882a593Smuzhiyun^^^^^^^^^^^^ 110*4882a593Smuzhiyun 111*4882a593Smuzhiyun A failed or spare device is being removed from the 112*4882a593Smuzhiyun array. The slot-number of the device is included in the message. 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun 3.1.5 RE_ADD: 115*4882a593Smuzhiyun 116*4882a593Smuzhiyun A failed device is being re-activated - the assumption 117*4882a593Smuzhiyun is that it has been determined to be working again. 118*4882a593Smuzhiyun 119*4882a593Smuzhiyun 3.1.6 BITMAP_NEEDS_SYNC: 120*4882a593Smuzhiyun 121*4882a593Smuzhiyun If a node is stopped locally but the bitmap 122*4882a593Smuzhiyun isn't clean, then another node is informed to take the ownership of 123*4882a593Smuzhiyun resync. 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun3.2 Communication mechanism 126*4882a593Smuzhiyun--------------------------- 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun The DLM LVB is used to communicate within nodes of the cluster. There 129*4882a593Smuzhiyun are three resources used for the purpose: 130*4882a593Smuzhiyun 131*4882a593Smuzhiyun3.2.1 token 132*4882a593Smuzhiyun^^^^^^^^^^^ 133*4882a593Smuzhiyun The resource which protects the entire communication 134*4882a593Smuzhiyun system. The node having the token resource is allowed to 135*4882a593Smuzhiyun communicate. 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun3.2.2 message 138*4882a593Smuzhiyun^^^^^^^^^^^^^ 139*4882a593Smuzhiyun The lock resource which carries the data to communicate. 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun3.2.3 ack 142*4882a593Smuzhiyun^^^^^^^^^ 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun The resource, acquiring which means the message has been 145*4882a593Smuzhiyun acknowledged by all nodes in the cluster. The BAST of the resource 146*4882a593Smuzhiyun is used to inform the receiving node that a node wants to 147*4882a593Smuzhiyun communicate. 148*4882a593Smuzhiyun 149*4882a593SmuzhiyunThe algorithm is: 150*4882a593Smuzhiyun 151*4882a593Smuzhiyun 1. receive status - all nodes have concurrent-reader lock on "ack":: 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun sender receiver receiver 154*4882a593Smuzhiyun "ack":CR "ack":CR "ack":CR 155*4882a593Smuzhiyun 156*4882a593Smuzhiyun 2. sender get EX on "token", 157*4882a593Smuzhiyun sender get EX on "message":: 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun sender receiver receiver 160*4882a593Smuzhiyun "token":EX "ack":CR "ack":CR 161*4882a593Smuzhiyun "message":EX 162*4882a593Smuzhiyun "ack":CR 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun Sender checks that it still needs to send a message. Messages 165*4882a593Smuzhiyun received or other events that happened while waiting for the 166*4882a593Smuzhiyun "token" may have made this message inappropriate or redundant. 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun 3. sender writes LVB 169*4882a593Smuzhiyun 170*4882a593Smuzhiyun sender down-convert "message" from EX to CW 171*4882a593Smuzhiyun 172*4882a593Smuzhiyun sender try to get EX of "ack" 173*4882a593Smuzhiyun 174*4882a593Smuzhiyun :: 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun [ wait until all receivers have *processed* the "message" ] 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun [ triggered by bast of "ack" ] 179*4882a593Smuzhiyun receiver get CR on "message" 180*4882a593Smuzhiyun receiver read LVB 181*4882a593Smuzhiyun receiver processes the message 182*4882a593Smuzhiyun [ wait finish ] 183*4882a593Smuzhiyun receiver releases "ack" 184*4882a593Smuzhiyun receiver tries to get PR on "message" 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun sender receiver receiver 187*4882a593Smuzhiyun "token":EX "message":CR "message":CR 188*4882a593Smuzhiyun "message":CW 189*4882a593Smuzhiyun "ack":EX 190*4882a593Smuzhiyun 191*4882a593Smuzhiyun 4. triggered by grant of EX on "ack" (indicating all receivers 192*4882a593Smuzhiyun have processed message) 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun sender down-converts "ack" from EX to CR 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun sender releases "message" 197*4882a593Smuzhiyun 198*4882a593Smuzhiyun sender releases "token" 199*4882a593Smuzhiyun 200*4882a593Smuzhiyun :: 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun receiver upconvert to PR on "message" 203*4882a593Smuzhiyun receiver get CR of "ack" 204*4882a593Smuzhiyun receiver release "message" 205*4882a593Smuzhiyun 206*4882a593Smuzhiyun sender receiver receiver 207*4882a593Smuzhiyun "ack":CR "ack":CR "ack":CR 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun 210*4882a593Smuzhiyun4. Handling Failures 211*4882a593Smuzhiyun==================== 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun4.1 Node Failure 214*4882a593Smuzhiyun---------------- 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun When a node fails, the DLM informs the cluster with the slot 217*4882a593Smuzhiyun number. The node starts a cluster recovery thread. The cluster 218*4882a593Smuzhiyun recovery thread: 219*4882a593Smuzhiyun 220*4882a593Smuzhiyun - acquires the bitmap<number> lock of the failed node 221*4882a593Smuzhiyun - opens the bitmap 222*4882a593Smuzhiyun - reads the bitmap of the failed node 223*4882a593Smuzhiyun - copies the set bitmap to local node 224*4882a593Smuzhiyun - cleans the bitmap of the failed node 225*4882a593Smuzhiyun - releases bitmap<number> lock of the failed node 226*4882a593Smuzhiyun - initiates resync of the bitmap on the current node 227*4882a593Smuzhiyun md_check_recovery is invoked within recover_bitmaps, 228*4882a593Smuzhiyun then md_check_recovery -> metadata_update_start/finish, 229*4882a593Smuzhiyun it will lock the communication by lock_comm. 230*4882a593Smuzhiyun Which means when one node is resyncing it blocks all 231*4882a593Smuzhiyun other nodes from writing anywhere on the array. 232*4882a593Smuzhiyun 233*4882a593Smuzhiyun The resync process is the regular md resync. However, in a clustered 234*4882a593Smuzhiyun environment when a resync is performed, it needs to tell other nodes 235*4882a593Smuzhiyun of the areas which are suspended. Before a resync starts, the node 236*4882a593Smuzhiyun send out RESYNCING with the (lo,hi) range of the area which needs to 237*4882a593Smuzhiyun be suspended. Each node maintains a suspend_list, which contains the 238*4882a593Smuzhiyun list of ranges which are currently suspended. On receiving RESYNCING, 239*4882a593Smuzhiyun the node adds the range to the suspend_list. Similarly, when the node 240*4882a593Smuzhiyun performing resync finishes, it sends RESYNCING with an empty range to 241*4882a593Smuzhiyun other nodes and other nodes remove the corresponding entry from the 242*4882a593Smuzhiyun suspend_list. 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun A helper function, ->area_resyncing() can be used to check if a 245*4882a593Smuzhiyun particular I/O range should be suspended or not. 246*4882a593Smuzhiyun 247*4882a593Smuzhiyun4.2 Device Failure 248*4882a593Smuzhiyun================== 249*4882a593Smuzhiyun 250*4882a593Smuzhiyun Device failures are handled and communicated with the metadata update 251*4882a593Smuzhiyun routine. When a node detects a device failure it does not allow 252*4882a593Smuzhiyun any further writes to that device until the failure has been 253*4882a593Smuzhiyun acknowledged by all other nodes. 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun5. Adding a new Device 256*4882a593Smuzhiyun---------------------- 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun For adding a new device, it is necessary that all nodes "see" the new 259*4882a593Smuzhiyun device to be added. For this, the following algorithm is used: 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues 262*4882a593Smuzhiyun ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD) 263*4882a593Smuzhiyun 2. Node 1 sends a NEWDISK message with uuid and slot number 264*4882a593Smuzhiyun 3. Other nodes issue kobject_uevent_env with uuid and slot number 265*4882a593Smuzhiyun (Steps 4,5 could be a udev rule) 266*4882a593Smuzhiyun 4. In userspace, the node searches for the disk, perhaps 267*4882a593Smuzhiyun using blkid -t SUB_UUID="" 268*4882a593Smuzhiyun 5. Other nodes issue either of the following depending on whether 269*4882a593Smuzhiyun the disk was found: 270*4882a593Smuzhiyun ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and 271*4882a593Smuzhiyun disc.number set to slot number) 272*4882a593Smuzhiyun ioctl(CLUSTERED_DISK_NACK) 273*4882a593Smuzhiyun 6. Other nodes drop lock on "no-new-devs" (CR) if device is found 274*4882a593Smuzhiyun 7. Node 1 attempts EX lock on "no-new-dev" 275*4882a593Smuzhiyun 8. If node 1 gets the lock, it sends METADATA_UPDATED after 276*4882a593Smuzhiyun unmarking the disk as SpareLocal 277*4882a593Smuzhiyun 9. If not (get "no-new-dev" lock), it fails the operation and sends 278*4882a593Smuzhiyun METADATA_UPDATED. 279*4882a593Smuzhiyun 10. Other nodes get the information whether a disk is added or not 280*4882a593Smuzhiyun by the following METADATA_UPDATED. 281*4882a593Smuzhiyun 282*4882a593Smuzhiyun6. Module interface 283*4882a593Smuzhiyun=================== 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun There are 17 call-backs which the md core can make to the cluster 286*4882a593Smuzhiyun module. Understanding these can give a good overview of the whole 287*4882a593Smuzhiyun process. 288*4882a593Smuzhiyun 289*4882a593Smuzhiyun6.1 join(nodes) and leave() 290*4882a593Smuzhiyun--------------------------- 291*4882a593Smuzhiyun 292*4882a593Smuzhiyun These are called when an array is started with a clustered bitmap, 293*4882a593Smuzhiyun and when the array is stopped. join() ensures the cluster is 294*4882a593Smuzhiyun available and initializes the various resources. 295*4882a593Smuzhiyun Only the first 'nodes' nodes in the cluster can use the array. 296*4882a593Smuzhiyun 297*4882a593Smuzhiyun6.2 slot_number() 298*4882a593Smuzhiyun----------------- 299*4882a593Smuzhiyun 300*4882a593Smuzhiyun Reports the slot number advised by the cluster infrastructure. 301*4882a593Smuzhiyun Range is from 0 to nodes-1. 302*4882a593Smuzhiyun 303*4882a593Smuzhiyun6.3 resync_info_update() 304*4882a593Smuzhiyun------------------------ 305*4882a593Smuzhiyun 306*4882a593Smuzhiyun This updates the resync range that is stored in the bitmap lock. 307*4882a593Smuzhiyun The starting point is updated as the resync progresses. The 308*4882a593Smuzhiyun end point is always the end of the array. 309*4882a593Smuzhiyun It does *not* send a RESYNCING message. 310*4882a593Smuzhiyun 311*4882a593Smuzhiyun6.4 resync_start(), resync_finish() 312*4882a593Smuzhiyun----------------------------------- 313*4882a593Smuzhiyun 314*4882a593Smuzhiyun These are called when resync/recovery/reshape starts or stops. 315*4882a593Smuzhiyun They update the resyncing range in the bitmap lock and also 316*4882a593Smuzhiyun send a RESYNCING message. resync_start reports the whole 317*4882a593Smuzhiyun array as resyncing, resync_finish reports none of it. 318*4882a593Smuzhiyun 319*4882a593Smuzhiyun resync_finish() also sends a BITMAP_NEEDS_SYNC message which 320*4882a593Smuzhiyun allows some other node to take over. 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel() 323*4882a593Smuzhiyun------------------------------------------------------------------------------- 324*4882a593Smuzhiyun 325*4882a593Smuzhiyun metadata_update_start is used to get exclusive access to 326*4882a593Smuzhiyun the metadata. If a change is still needed once that access is 327*4882a593Smuzhiyun gained, metadata_update_finish() will send a METADATA_UPDATE 328*4882a593Smuzhiyun message to all other nodes, otherwise metadata_update_cancel() 329*4882a593Smuzhiyun can be used to release the lock. 330*4882a593Smuzhiyun 331*4882a593Smuzhiyun6.6 area_resyncing() 332*4882a593Smuzhiyun-------------------- 333*4882a593Smuzhiyun 334*4882a593Smuzhiyun This combines two elements of functionality. 335*4882a593Smuzhiyun 336*4882a593Smuzhiyun Firstly, it will check if any node is currently resyncing 337*4882a593Smuzhiyun anything in a given range of sectors. If any resync is found, 338*4882a593Smuzhiyun then the caller will avoid writing or read-balancing in that 339*4882a593Smuzhiyun range. 340*4882a593Smuzhiyun 341*4882a593Smuzhiyun Secondly, while node recovery is happening it reports that 342*4882a593Smuzhiyun all areas are resyncing for READ requests. This avoids races 343*4882a593Smuzhiyun between the cluster-filesystem and the cluster-RAID handling 344*4882a593Smuzhiyun a node failure. 345*4882a593Smuzhiyun 346*4882a593Smuzhiyun6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack() 347*4882a593Smuzhiyun--------------------------------------------------------------- 348*4882a593Smuzhiyun 349*4882a593Smuzhiyun These are used to manage the new-disk protocol described above. 350*4882a593Smuzhiyun When a new device is added, add_new_disk_start() is called before 351*4882a593Smuzhiyun it is bound to the array and, if that succeeds, add_new_disk_finish() 352*4882a593Smuzhiyun is called the device is fully added. 353*4882a593Smuzhiyun 354*4882a593Smuzhiyun When a device is added in acknowledgement to a previous 355*4882a593Smuzhiyun request, or when the device is declared "unavailable", 356*4882a593Smuzhiyun new_disk_ack() is called. 357*4882a593Smuzhiyun 358*4882a593Smuzhiyun6.8 remove_disk() 359*4882a593Smuzhiyun----------------- 360*4882a593Smuzhiyun 361*4882a593Smuzhiyun This is called when a spare or failed device is removed from 362*4882a593Smuzhiyun the array. It causes a REMOVE message to be send to other nodes. 363*4882a593Smuzhiyun 364*4882a593Smuzhiyun6.9 gather_bitmaps() 365*4882a593Smuzhiyun-------------------- 366*4882a593Smuzhiyun 367*4882a593Smuzhiyun This sends a RE_ADD message to all other nodes and then 368*4882a593Smuzhiyun gathers bitmap information from all bitmaps. This combined 369*4882a593Smuzhiyun bitmap is then used to recovery the re-added device. 370*4882a593Smuzhiyun 371*4882a593Smuzhiyun6.10 lock_all_bitmaps() and unlock_all_bitmaps() 372*4882a593Smuzhiyun------------------------------------------------ 373*4882a593Smuzhiyun 374*4882a593Smuzhiyun These are called when change bitmap to none. If a node plans 375*4882a593Smuzhiyun to clear the cluster raid's bitmap, it need to make sure no other 376*4882a593Smuzhiyun nodes are using the raid which is achieved by lock all bitmap 377*4882a593Smuzhiyun locks within the cluster, and also those locks are unlocked 378*4882a593Smuzhiyun accordingly. 379*4882a593Smuzhiyun 380*4882a593Smuzhiyun7. Unsupported features 381*4882a593Smuzhiyun======================= 382*4882a593Smuzhiyun 383*4882a593SmuzhiyunThere are somethings which are not supported by cluster MD yet. 384*4882a593Smuzhiyun 385*4882a593Smuzhiyun- change array_sectors. 386