driver-api/md/md-cluster.rst

*4882a593Smuzhiyun==========
*4882a593SmuzhiyunMD Cluster
*4882a593Smuzhiyun==========
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe cluster MD is a shared-device RAID for a cluster, it supports
*4882a593Smuzhiyuntwo levels: raid1 and raid10 (limited support).
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun1. On-disk format
*4882a593Smuzhiyun=================
*4882a593Smuzhiyun
*4882a593SmuzhiyunSeparate write-intent-bitmaps are used for each cluster node.
*4882a593SmuzhiyunThe bitmaps record all writes that may have been started on that node,
*4882a593Smuzhiyunand may not yet have finished. The on-disk layout is::
*4882a593Smuzhiyun
*4882a593Smuzhiyun  0                    4k                     8k                    12k
*4882a593Smuzhiyun  -------------------------------------------------------------------
*4882a593Smuzhiyun  | idle                | md super            | bm super [0] + bits |
*4882a593Smuzhiyun  | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
*4882a593Smuzhiyun  | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
*4882a593Smuzhiyun  | bm bits [3, contd]  |                     |                     |
*4882a593Smuzhiyun
*4882a593SmuzhiyunDuring "normal" functioning we assume the filesystem ensures that only
*4882a593Smuzhiyunone node writes to any given block at a time, so a write request will
*4882a593Smuzhiyun
*4882a593Smuzhiyun - set the appropriate bit (if not already set)
*4882a593Smuzhiyun - commit the write to all mirrors
*4882a593Smuzhiyun - schedule the bit to be cleared after a timeout.
*4882a593Smuzhiyun
*4882a593SmuzhiyunReads are just handled normally. It is up to the filesystem to ensure
*4882a593Smuzhiyunone node doesn't read from a location where another node (or the same
*4882a593Smuzhiyunnode) is writing.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun2. DLM Locks for management
*4882a593Smuzhiyun===========================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are three groups of locks for managing the device:
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.1 Bitmap lock resource (bm_lockres)
*4882a593Smuzhiyun-------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun The bm_lockres protects individual node bitmaps. They are named in
*4882a593Smuzhiyun the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a
*4882a593Smuzhiyun node joins the cluster, it acquires the lock in PW mode and it stays
*4882a593Smuzhiyun so during the lifetime the node is part of the cluster. The lock
*4882a593Smuzhiyun resource number is based on the slot number returned by the DLM
*4882a593Smuzhiyun subsystem. Since DLM starts node count from one and bitmap slots
*4882a593Smuzhiyun start from zero, one is subtracted from the DLM slot number to arrive
*4882a593Smuzhiyun at the bitmap slot number.
*4882a593Smuzhiyun
*4882a593Smuzhiyun The LVB of the bitmap lock for a particular node records the range
*4882a593Smuzhiyun of sectors that are being re-synced by that node.  No other
*4882a593Smuzhiyun node may write to those sectors.  This is used when a new nodes
*4882a593Smuzhiyun joins the cluster.
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.2 Message passing locks
*4882a593Smuzhiyun-------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun Each node has to communicate with other nodes when starting or ending
*4882a593Smuzhiyun resync, and for metadata superblock updates.  This communication is
*4882a593Smuzhiyun managed through three locks: "token", "message", and "ack", together
*4882a593Smuzhiyun with the Lock Value Block (LVB) of one of the "message" lock.
*4882a593Smuzhiyun
*4882a593Smuzhiyun2.3 new-device management
*4882a593Smuzhiyun-------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun A single lock: "no-new-dev" is used to co-ordinate the addition of
*4882a593Smuzhiyun new devices - this must be synchronized across the array.
*4882a593Smuzhiyun Normally all nodes hold a concurrent-read lock on this device.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3. Communication
*4882a593Smuzhiyun================
*4882a593Smuzhiyun
*4882a593Smuzhiyun Messages can be broadcast to all nodes, and the sender waits for all
*4882a593Smuzhiyun other nodes to acknowledge the message before proceeding.  Only one
*4882a593Smuzhiyun message can be processed at a time.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.1 Message Types
*4882a593Smuzhiyun-----------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun There are six types of messages which are passed:
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.1.1 METADATA_UPDATED
*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^
*4882a593Smuzhiyun
*4882a593Smuzhiyun   informs other nodes that the metadata has
*4882a593Smuzhiyun   been updated, and the node must re-read the md superblock. This is
*4882a593Smuzhiyun   performed synchronously. It is primarily used to signal device
*4882a593Smuzhiyun   failure.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.1.2 RESYNCING
*4882a593Smuzhiyun^^^^^^^^^^^^^^^
*4882a593Smuzhiyun   informs other nodes that a resync is initiated or
*4882a593Smuzhiyun   ended so that each node may suspend or resume the region.  Each
*4882a593Smuzhiyun   RESYNCING message identifies a range of the devices that the
*4882a593Smuzhiyun   sending node is about to resync. This overrides any previous
*4882a593Smuzhiyun   notification from that node: only one ranged can be resynced at a
*4882a593Smuzhiyun   time per-node.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.1.3 NEWDISK
*4882a593Smuzhiyun^^^^^^^^^^^^^
*4882a593Smuzhiyun
*4882a593Smuzhiyun   informs other nodes that a device is being added to
*4882a593Smuzhiyun   the array. Message contains an identifier for that device.  See
*4882a593Smuzhiyun   below for further details.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.1.4 REMOVE
*4882a593Smuzhiyun^^^^^^^^^^^^
*4882a593Smuzhiyun
*4882a593Smuzhiyun   A failed or spare device is being removed from the
*4882a593Smuzhiyun   array. The slot-number of the device is included in the message.
*4882a593Smuzhiyun
*4882a593Smuzhiyun 3.1.5 RE_ADD:
*4882a593Smuzhiyun
*4882a593Smuzhiyun   A failed device is being re-activated - the assumption
*4882a593Smuzhiyun   is that it has been determined to be working again.
*4882a593Smuzhiyun
*4882a593Smuzhiyun 3.1.6 BITMAP_NEEDS_SYNC:
*4882a593Smuzhiyun
*4882a593Smuzhiyun   If a node is stopped locally but the bitmap
*4882a593Smuzhiyun   isn't clean, then another node is informed to take the ownership of
*4882a593Smuzhiyun   resync.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.2 Communication mechanism
*4882a593Smuzhiyun---------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun The DLM LVB is used to communicate within nodes of the cluster. There
*4882a593Smuzhiyun are three resources used for the purpose:
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.2.1 token
*4882a593Smuzhiyun^^^^^^^^^^^
*4882a593Smuzhiyun   The resource which protects the entire communication
*4882a593Smuzhiyun   system. The node having the token resource is allowed to
*4882a593Smuzhiyun   communicate.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.2.2 message
*4882a593Smuzhiyun^^^^^^^^^^^^^
*4882a593Smuzhiyun   The lock resource which carries the data to communicate.
*4882a593Smuzhiyun
*4882a593Smuzhiyun3.2.3 ack
*4882a593Smuzhiyun^^^^^^^^^
*4882a593Smuzhiyun
*4882a593Smuzhiyun   The resource, acquiring which means the message has been
*4882a593Smuzhiyun   acknowledged by all nodes in the cluster. The BAST of the resource
*4882a593Smuzhiyun   is used to inform the receiving node that a node wants to
*4882a593Smuzhiyun   communicate.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe algorithm is:
*4882a593Smuzhiyun
*4882a593Smuzhiyun 1. receive status - all nodes have concurrent-reader lock on "ack"::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	sender                         receiver                 receiver
*4882a593Smuzhiyun	"ack":CR                       "ack":CR                 "ack":CR
*4882a593Smuzhiyun
*4882a593Smuzhiyun 2. sender get EX on "token",
*4882a593Smuzhiyun    sender get EX on "message"::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	sender                        receiver                 receiver
*4882a593Smuzhiyun	"token":EX                    "ack":CR                 "ack":CR
*4882a593Smuzhiyun	"message":EX
*4882a593Smuzhiyun	"ack":CR
*4882a593Smuzhiyun
*4882a593Smuzhiyun    Sender checks that it still needs to send a message. Messages
*4882a593Smuzhiyun    received or other events that happened while waiting for the
*4882a593Smuzhiyun    "token" may have made this message inappropriate or redundant.
*4882a593Smuzhiyun
*4882a593Smuzhiyun 3. sender writes LVB
*4882a593Smuzhiyun
*4882a593Smuzhiyun    sender down-convert "message" from EX to CW
*4882a593Smuzhiyun
*4882a593Smuzhiyun    sender try to get EX of "ack"
*4882a593Smuzhiyun
*4882a593Smuzhiyun    ::
*4882a593Smuzhiyun
*4882a593Smuzhiyun      [ wait until all receivers have *processed* the "message" ]
*4882a593Smuzhiyun
*4882a593Smuzhiyun                                       [ triggered by bast of "ack" ]
*4882a593Smuzhiyun                                       receiver get CR on "message"
*4882a593Smuzhiyun                                       receiver read LVB
*4882a593Smuzhiyun                                       receiver processes the message
*4882a593Smuzhiyun                                       [ wait finish ]
*4882a593Smuzhiyun                                       receiver releases "ack"
*4882a593Smuzhiyun                                       receiver tries to get PR on "message"
*4882a593Smuzhiyun
*4882a593Smuzhiyun     sender                         receiver                  receiver
*4882a593Smuzhiyun     "token":EX                     "message":CR              "message":CR
*4882a593Smuzhiyun     "message":CW
*4882a593Smuzhiyun     "ack":EX
*4882a593Smuzhiyun
*4882a593Smuzhiyun 4. triggered by grant of EX on "ack" (indicating all receivers
*4882a593Smuzhiyun    have processed message)
*4882a593Smuzhiyun
*4882a593Smuzhiyun    sender down-converts "ack" from EX to CR
*4882a593Smuzhiyun
*4882a593Smuzhiyun    sender releases "message"
*4882a593Smuzhiyun
*4882a593Smuzhiyun    sender releases "token"
*4882a593Smuzhiyun
*4882a593Smuzhiyun    ::
*4882a593Smuzhiyun
*4882a593Smuzhiyun                                 receiver upconvert to PR on "message"
*4882a593Smuzhiyun                                 receiver get CR of "ack"
*4882a593Smuzhiyun                                 receiver release "message"
*4882a593Smuzhiyun
*4882a593Smuzhiyun     sender                      receiver                   receiver
*4882a593Smuzhiyun     "ack":CR                    "ack":CR                   "ack":CR
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun4. Handling Failures
*4882a593Smuzhiyun====================
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.1 Node Failure
*4882a593Smuzhiyun----------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun When a node fails, the DLM informs the cluster with the slot
*4882a593Smuzhiyun number. The node starts a cluster recovery thread. The cluster
*4882a593Smuzhiyun recovery thread:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	- acquires the bitmap<number> lock of the failed node
*4882a593Smuzhiyun	- opens the bitmap
*4882a593Smuzhiyun	- reads the bitmap of the failed node
*4882a593Smuzhiyun	- copies the set bitmap to local node
*4882a593Smuzhiyun	- cleans the bitmap of the failed node
*4882a593Smuzhiyun	- releases bitmap<number> lock of the failed node
*4882a593Smuzhiyun	- initiates resync of the bitmap on the current node
*4882a593Smuzhiyun	  md_check_recovery is invoked within recover_bitmaps,
*4882a593Smuzhiyun	  then md_check_recovery -> metadata_update_start/finish,
*4882a593Smuzhiyun	  it will lock the communication by lock_comm.
*4882a593Smuzhiyun	  Which means when one node is resyncing it blocks all
*4882a593Smuzhiyun	  other nodes from writing anywhere on the array.
*4882a593Smuzhiyun
*4882a593Smuzhiyun The resync process is the regular md resync. However, in a clustered
*4882a593Smuzhiyun environment when a resync is performed, it needs to tell other nodes
*4882a593Smuzhiyun of the areas which are suspended. Before a resync starts, the node
*4882a593Smuzhiyun send out RESYNCING with the (lo,hi) range of the area which needs to
*4882a593Smuzhiyun be suspended. Each node maintains a suspend_list, which contains the
*4882a593Smuzhiyun list of ranges which are currently suspended. On receiving RESYNCING,
*4882a593Smuzhiyun the node adds the range to the suspend_list. Similarly, when the node
*4882a593Smuzhiyun performing resync finishes, it sends RESYNCING with an empty range to
*4882a593Smuzhiyun other nodes and other nodes remove the corresponding entry from the
*4882a593Smuzhiyun suspend_list.
*4882a593Smuzhiyun
*4882a593Smuzhiyun A helper function, ->area_resyncing() can be used to check if a
*4882a593Smuzhiyun particular I/O range should be suspended or not.
*4882a593Smuzhiyun
*4882a593Smuzhiyun4.2 Device Failure
*4882a593Smuzhiyun==================
*4882a593Smuzhiyun
*4882a593Smuzhiyun Device failures are handled and communicated with the metadata update
*4882a593Smuzhiyun routine.  When a node detects a device failure it does not allow
*4882a593Smuzhiyun any further writes to that device until the failure has been
*4882a593Smuzhiyun acknowledged by all other nodes.
*4882a593Smuzhiyun
*4882a593Smuzhiyun5. Adding a new Device
*4882a593Smuzhiyun----------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun For adding a new device, it is necessary that all nodes "see" the new
*4882a593Smuzhiyun device to be added. For this, the following algorithm is used:
*4882a593Smuzhiyun
*4882a593Smuzhiyun   1.  Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
*4882a593Smuzhiyun       ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD)
*4882a593Smuzhiyun   2.  Node 1 sends a NEWDISK message with uuid and slot number
*4882a593Smuzhiyun   3.  Other nodes issue kobject_uevent_env with uuid and slot number
*4882a593Smuzhiyun       (Steps 4,5 could be a udev rule)
*4882a593Smuzhiyun   4.  In userspace, the node searches for the disk, perhaps
*4882a593Smuzhiyun       using blkid -t SUB_UUID=""
*4882a593Smuzhiyun   5.  Other nodes issue either of the following depending on whether
*4882a593Smuzhiyun       the disk was found:
*4882a593Smuzhiyun       ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
*4882a593Smuzhiyun       disc.number set to slot number)
*4882a593Smuzhiyun       ioctl(CLUSTERED_DISK_NACK)
*4882a593Smuzhiyun   6.  Other nodes drop lock on "no-new-devs" (CR) if device is found
*4882a593Smuzhiyun   7.  Node 1 attempts EX lock on "no-new-dev"
*4882a593Smuzhiyun   8.  If node 1 gets the lock, it sends METADATA_UPDATED after
*4882a593Smuzhiyun       unmarking the disk as SpareLocal
*4882a593Smuzhiyun   9.  If not (get "no-new-dev" lock), it fails the operation and sends
*4882a593Smuzhiyun       METADATA_UPDATED.
*4882a593Smuzhiyun   10. Other nodes get the information whether a disk is added or not
*4882a593Smuzhiyun       by the following METADATA_UPDATED.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6. Module interface
*4882a593Smuzhiyun===================
*4882a593Smuzhiyun
*4882a593Smuzhiyun There are 17 call-backs which the md core can make to the cluster
*4882a593Smuzhiyun module.  Understanding these can give a good overview of the whole
*4882a593Smuzhiyun process.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.1 join(nodes) and leave()
*4882a593Smuzhiyun---------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun These are called when an array is started with a clustered bitmap,
*4882a593Smuzhiyun and when the array is stopped.  join() ensures the cluster is
*4882a593Smuzhiyun available and initializes the various resources.
*4882a593Smuzhiyun Only the first 'nodes' nodes in the cluster can use the array.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.2 slot_number()
*4882a593Smuzhiyun-----------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun Reports the slot number advised by the cluster infrastructure.
*4882a593Smuzhiyun Range is from 0 to nodes-1.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.3 resync_info_update()
*4882a593Smuzhiyun------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun This updates the resync range that is stored in the bitmap lock.
*4882a593Smuzhiyun The starting point is updated as the resync progresses.  The
*4882a593Smuzhiyun end point is always the end of the array.
*4882a593Smuzhiyun It does *not* send a RESYNCING message.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.4 resync_start(), resync_finish()
*4882a593Smuzhiyun-----------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun These are called when resync/recovery/reshape starts or stops.
*4882a593Smuzhiyun They update the resyncing range in the bitmap lock and also
*4882a593Smuzhiyun send a RESYNCING message.  resync_start reports the whole
*4882a593Smuzhiyun array as resyncing, resync_finish reports none of it.
*4882a593Smuzhiyun
*4882a593Smuzhiyun resync_finish() also sends a BITMAP_NEEDS_SYNC message which
*4882a593Smuzhiyun allows some other node to take over.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel()
*4882a593Smuzhiyun-------------------------------------------------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun metadata_update_start is used to get exclusive access to
*4882a593Smuzhiyun the metadata.  If a change is still needed once that access is
*4882a593Smuzhiyun gained, metadata_update_finish() will send a METADATA_UPDATE
*4882a593Smuzhiyun message to all other nodes, otherwise metadata_update_cancel()
*4882a593Smuzhiyun can be used to release the lock.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.6 area_resyncing()
*4882a593Smuzhiyun--------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun This combines two elements of functionality.
*4882a593Smuzhiyun
*4882a593Smuzhiyun Firstly, it will check if any node is currently resyncing
*4882a593Smuzhiyun anything in a given range of sectors.  If any resync is found,
*4882a593Smuzhiyun then the caller will avoid writing or read-balancing in that
*4882a593Smuzhiyun range.
*4882a593Smuzhiyun
*4882a593Smuzhiyun Secondly, while node recovery is happening it reports that
*4882a593Smuzhiyun all areas are resyncing for READ requests.  This avoids races
*4882a593Smuzhiyun between the cluster-filesystem and the cluster-RAID handling
*4882a593Smuzhiyun a node failure.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack()
*4882a593Smuzhiyun---------------------------------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun These are used to manage the new-disk protocol described above.
*4882a593Smuzhiyun When a new device is added, add_new_disk_start() is called before
*4882a593Smuzhiyun it is bound to the array and, if that succeeds, add_new_disk_finish()
*4882a593Smuzhiyun is called the device is fully added.
*4882a593Smuzhiyun
*4882a593Smuzhiyun When a device is added in acknowledgement to a previous
*4882a593Smuzhiyun request, or when the device is declared "unavailable",
*4882a593Smuzhiyun new_disk_ack() is called.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.8 remove_disk()
*4882a593Smuzhiyun-----------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun This is called when a spare or failed device is removed from
*4882a593Smuzhiyun the array.  It causes a REMOVE message to be send to other nodes.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.9 gather_bitmaps()
*4882a593Smuzhiyun--------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun This sends a RE_ADD message to all other nodes and then
*4882a593Smuzhiyun gathers bitmap information from all bitmaps.  This combined
*4882a593Smuzhiyun bitmap is then used to recovery the re-added device.
*4882a593Smuzhiyun
*4882a593Smuzhiyun6.10 lock_all_bitmaps() and unlock_all_bitmaps()
*4882a593Smuzhiyun------------------------------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun These are called when change bitmap to none. If a node plans
*4882a593Smuzhiyun to clear the cluster raid's bitmap, it need to make sure no other
*4882a593Smuzhiyun nodes are using the raid which is achieved by lock all bitmap
*4882a593Smuzhiyun locks within the cluster, and also those locks are unlocked
*4882a593Smuzhiyun accordingly.
*4882a593Smuzhiyun
*4882a593Smuzhiyun7. Unsupported features
*4882a593Smuzhiyun=======================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are somethings which are not supported by cluster MD yet.
*4882a593Smuzhiyun
*4882a593Smuzhiyun- change array_sectors.