xref: /OK3568_Linux_fs/kernel/Documentation/driver-api/md/md-cluster.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==========
2*4882a593SmuzhiyunMD Cluster
3*4882a593Smuzhiyun==========
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunThe cluster MD is a shared-device RAID for a cluster, it supports
6*4882a593Smuzhiyuntwo levels: raid1 and raid10 (limited support).
7*4882a593Smuzhiyun
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun1. On-disk format
10*4882a593Smuzhiyun=================
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunSeparate write-intent-bitmaps are used for each cluster node.
13*4882a593SmuzhiyunThe bitmaps record all writes that may have been started on that node,
14*4882a593Smuzhiyunand may not yet have finished. The on-disk layout is::
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun  0                    4k                     8k                    12k
17*4882a593Smuzhiyun  -------------------------------------------------------------------
18*4882a593Smuzhiyun  | idle                | md super            | bm super [0] + bits |
19*4882a593Smuzhiyun  | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
20*4882a593Smuzhiyun  | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
21*4882a593Smuzhiyun  | bm bits [3, contd]  |                     |                     |
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunDuring "normal" functioning we assume the filesystem ensures that only
24*4882a593Smuzhiyunone node writes to any given block at a time, so a write request will
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun - set the appropriate bit (if not already set)
27*4882a593Smuzhiyun - commit the write to all mirrors
28*4882a593Smuzhiyun - schedule the bit to be cleared after a timeout.
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunReads are just handled normally. It is up to the filesystem to ensure
31*4882a593Smuzhiyunone node doesn't read from a location where another node (or the same
32*4882a593Smuzhiyunnode) is writing.
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun
35*4882a593Smuzhiyun2. DLM Locks for management
36*4882a593Smuzhiyun===========================
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunThere are three groups of locks for managing the device:
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun2.1 Bitmap lock resource (bm_lockres)
41*4882a593Smuzhiyun-------------------------------------
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun The bm_lockres protects individual node bitmaps. They are named in
44*4882a593Smuzhiyun the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a
45*4882a593Smuzhiyun node joins the cluster, it acquires the lock in PW mode and it stays
46*4882a593Smuzhiyun so during the lifetime the node is part of the cluster. The lock
47*4882a593Smuzhiyun resource number is based on the slot number returned by the DLM
48*4882a593Smuzhiyun subsystem. Since DLM starts node count from one and bitmap slots
49*4882a593Smuzhiyun start from zero, one is subtracted from the DLM slot number to arrive
50*4882a593Smuzhiyun at the bitmap slot number.
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun The LVB of the bitmap lock for a particular node records the range
53*4882a593Smuzhiyun of sectors that are being re-synced by that node.  No other
54*4882a593Smuzhiyun node may write to those sectors.  This is used when a new nodes
55*4882a593Smuzhiyun joins the cluster.
56*4882a593Smuzhiyun
57*4882a593Smuzhiyun2.2 Message passing locks
58*4882a593Smuzhiyun-------------------------
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun Each node has to communicate with other nodes when starting or ending
61*4882a593Smuzhiyun resync, and for metadata superblock updates.  This communication is
62*4882a593Smuzhiyun managed through three locks: "token", "message", and "ack", together
63*4882a593Smuzhiyun with the Lock Value Block (LVB) of one of the "message" lock.
64*4882a593Smuzhiyun
65*4882a593Smuzhiyun2.3 new-device management
66*4882a593Smuzhiyun-------------------------
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun A single lock: "no-new-dev" is used to co-ordinate the addition of
69*4882a593Smuzhiyun new devices - this must be synchronized across the array.
70*4882a593Smuzhiyun Normally all nodes hold a concurrent-read lock on this device.
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun3. Communication
73*4882a593Smuzhiyun================
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun Messages can be broadcast to all nodes, and the sender waits for all
76*4882a593Smuzhiyun other nodes to acknowledge the message before proceeding.  Only one
77*4882a593Smuzhiyun message can be processed at a time.
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun3.1 Message Types
80*4882a593Smuzhiyun-----------------
81*4882a593Smuzhiyun
82*4882a593Smuzhiyun There are six types of messages which are passed:
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun3.1.1 METADATA_UPDATED
85*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^
86*4882a593Smuzhiyun
87*4882a593Smuzhiyun   informs other nodes that the metadata has
88*4882a593Smuzhiyun   been updated, and the node must re-read the md superblock. This is
89*4882a593Smuzhiyun   performed synchronously. It is primarily used to signal device
90*4882a593Smuzhiyun   failure.
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun3.1.2 RESYNCING
93*4882a593Smuzhiyun^^^^^^^^^^^^^^^
94*4882a593Smuzhiyun   informs other nodes that a resync is initiated or
95*4882a593Smuzhiyun   ended so that each node may suspend or resume the region.  Each
96*4882a593Smuzhiyun   RESYNCING message identifies a range of the devices that the
97*4882a593Smuzhiyun   sending node is about to resync. This overrides any previous
98*4882a593Smuzhiyun   notification from that node: only one ranged can be resynced at a
99*4882a593Smuzhiyun   time per-node.
100*4882a593Smuzhiyun
101*4882a593Smuzhiyun3.1.3 NEWDISK
102*4882a593Smuzhiyun^^^^^^^^^^^^^
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun   informs other nodes that a device is being added to
105*4882a593Smuzhiyun   the array. Message contains an identifier for that device.  See
106*4882a593Smuzhiyun   below for further details.
107*4882a593Smuzhiyun
108*4882a593Smuzhiyun3.1.4 REMOVE
109*4882a593Smuzhiyun^^^^^^^^^^^^
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun   A failed or spare device is being removed from the
112*4882a593Smuzhiyun   array. The slot-number of the device is included in the message.
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun 3.1.5 RE_ADD:
115*4882a593Smuzhiyun
116*4882a593Smuzhiyun   A failed device is being re-activated - the assumption
117*4882a593Smuzhiyun   is that it has been determined to be working again.
118*4882a593Smuzhiyun
119*4882a593Smuzhiyun 3.1.6 BITMAP_NEEDS_SYNC:
120*4882a593Smuzhiyun
121*4882a593Smuzhiyun   If a node is stopped locally but the bitmap
122*4882a593Smuzhiyun   isn't clean, then another node is informed to take the ownership of
123*4882a593Smuzhiyun   resync.
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun3.2 Communication mechanism
126*4882a593Smuzhiyun---------------------------
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun The DLM LVB is used to communicate within nodes of the cluster. There
129*4882a593Smuzhiyun are three resources used for the purpose:
130*4882a593Smuzhiyun
131*4882a593Smuzhiyun3.2.1 token
132*4882a593Smuzhiyun^^^^^^^^^^^
133*4882a593Smuzhiyun   The resource which protects the entire communication
134*4882a593Smuzhiyun   system. The node having the token resource is allowed to
135*4882a593Smuzhiyun   communicate.
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun3.2.2 message
138*4882a593Smuzhiyun^^^^^^^^^^^^^
139*4882a593Smuzhiyun   The lock resource which carries the data to communicate.
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun3.2.3 ack
142*4882a593Smuzhiyun^^^^^^^^^
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun   The resource, acquiring which means the message has been
145*4882a593Smuzhiyun   acknowledged by all nodes in the cluster. The BAST of the resource
146*4882a593Smuzhiyun   is used to inform the receiving node that a node wants to
147*4882a593Smuzhiyun   communicate.
148*4882a593Smuzhiyun
149*4882a593SmuzhiyunThe algorithm is:
150*4882a593Smuzhiyun
151*4882a593Smuzhiyun 1. receive status - all nodes have concurrent-reader lock on "ack"::
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun	sender                         receiver                 receiver
154*4882a593Smuzhiyun	"ack":CR                       "ack":CR                 "ack":CR
155*4882a593Smuzhiyun
156*4882a593Smuzhiyun 2. sender get EX on "token",
157*4882a593Smuzhiyun    sender get EX on "message"::
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun	sender                        receiver                 receiver
160*4882a593Smuzhiyun	"token":EX                    "ack":CR                 "ack":CR
161*4882a593Smuzhiyun	"message":EX
162*4882a593Smuzhiyun	"ack":CR
163*4882a593Smuzhiyun
164*4882a593Smuzhiyun    Sender checks that it still needs to send a message. Messages
165*4882a593Smuzhiyun    received or other events that happened while waiting for the
166*4882a593Smuzhiyun    "token" may have made this message inappropriate or redundant.
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun 3. sender writes LVB
169*4882a593Smuzhiyun
170*4882a593Smuzhiyun    sender down-convert "message" from EX to CW
171*4882a593Smuzhiyun
172*4882a593Smuzhiyun    sender try to get EX of "ack"
173*4882a593Smuzhiyun
174*4882a593Smuzhiyun    ::
175*4882a593Smuzhiyun
176*4882a593Smuzhiyun      [ wait until all receivers have *processed* the "message" ]
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun                                       [ triggered by bast of "ack" ]
179*4882a593Smuzhiyun                                       receiver get CR on "message"
180*4882a593Smuzhiyun                                       receiver read LVB
181*4882a593Smuzhiyun                                       receiver processes the message
182*4882a593Smuzhiyun                                       [ wait finish ]
183*4882a593Smuzhiyun                                       receiver releases "ack"
184*4882a593Smuzhiyun                                       receiver tries to get PR on "message"
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun     sender                         receiver                  receiver
187*4882a593Smuzhiyun     "token":EX                     "message":CR              "message":CR
188*4882a593Smuzhiyun     "message":CW
189*4882a593Smuzhiyun     "ack":EX
190*4882a593Smuzhiyun
191*4882a593Smuzhiyun 4. triggered by grant of EX on "ack" (indicating all receivers
192*4882a593Smuzhiyun    have processed message)
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun    sender down-converts "ack" from EX to CR
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun    sender releases "message"
197*4882a593Smuzhiyun
198*4882a593Smuzhiyun    sender releases "token"
199*4882a593Smuzhiyun
200*4882a593Smuzhiyun    ::
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun                                 receiver upconvert to PR on "message"
203*4882a593Smuzhiyun                                 receiver get CR of "ack"
204*4882a593Smuzhiyun                                 receiver release "message"
205*4882a593Smuzhiyun
206*4882a593Smuzhiyun     sender                      receiver                   receiver
207*4882a593Smuzhiyun     "ack":CR                    "ack":CR                   "ack":CR
208*4882a593Smuzhiyun
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun4. Handling Failures
211*4882a593Smuzhiyun====================
212*4882a593Smuzhiyun
213*4882a593Smuzhiyun4.1 Node Failure
214*4882a593Smuzhiyun----------------
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun When a node fails, the DLM informs the cluster with the slot
217*4882a593Smuzhiyun number. The node starts a cluster recovery thread. The cluster
218*4882a593Smuzhiyun recovery thread:
219*4882a593Smuzhiyun
220*4882a593Smuzhiyun	- acquires the bitmap<number> lock of the failed node
221*4882a593Smuzhiyun	- opens the bitmap
222*4882a593Smuzhiyun	- reads the bitmap of the failed node
223*4882a593Smuzhiyun	- copies the set bitmap to local node
224*4882a593Smuzhiyun	- cleans the bitmap of the failed node
225*4882a593Smuzhiyun	- releases bitmap<number> lock of the failed node
226*4882a593Smuzhiyun	- initiates resync of the bitmap on the current node
227*4882a593Smuzhiyun	  md_check_recovery is invoked within recover_bitmaps,
228*4882a593Smuzhiyun	  then md_check_recovery -> metadata_update_start/finish,
229*4882a593Smuzhiyun	  it will lock the communication by lock_comm.
230*4882a593Smuzhiyun	  Which means when one node is resyncing it blocks all
231*4882a593Smuzhiyun	  other nodes from writing anywhere on the array.
232*4882a593Smuzhiyun
233*4882a593Smuzhiyun The resync process is the regular md resync. However, in a clustered
234*4882a593Smuzhiyun environment when a resync is performed, it needs to tell other nodes
235*4882a593Smuzhiyun of the areas which are suspended. Before a resync starts, the node
236*4882a593Smuzhiyun send out RESYNCING with the (lo,hi) range of the area which needs to
237*4882a593Smuzhiyun be suspended. Each node maintains a suspend_list, which contains the
238*4882a593Smuzhiyun list of ranges which are currently suspended. On receiving RESYNCING,
239*4882a593Smuzhiyun the node adds the range to the suspend_list. Similarly, when the node
240*4882a593Smuzhiyun performing resync finishes, it sends RESYNCING with an empty range to
241*4882a593Smuzhiyun other nodes and other nodes remove the corresponding entry from the
242*4882a593Smuzhiyun suspend_list.
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun A helper function, ->area_resyncing() can be used to check if a
245*4882a593Smuzhiyun particular I/O range should be suspended or not.
246*4882a593Smuzhiyun
247*4882a593Smuzhiyun4.2 Device Failure
248*4882a593Smuzhiyun==================
249*4882a593Smuzhiyun
250*4882a593Smuzhiyun Device failures are handled and communicated with the metadata update
251*4882a593Smuzhiyun routine.  When a node detects a device failure it does not allow
252*4882a593Smuzhiyun any further writes to that device until the failure has been
253*4882a593Smuzhiyun acknowledged by all other nodes.
254*4882a593Smuzhiyun
255*4882a593Smuzhiyun5. Adding a new Device
256*4882a593Smuzhiyun----------------------
257*4882a593Smuzhiyun
258*4882a593Smuzhiyun For adding a new device, it is necessary that all nodes "see" the new
259*4882a593Smuzhiyun device to be added. For this, the following algorithm is used:
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun   1.  Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
262*4882a593Smuzhiyun       ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD)
263*4882a593Smuzhiyun   2.  Node 1 sends a NEWDISK message with uuid and slot number
264*4882a593Smuzhiyun   3.  Other nodes issue kobject_uevent_env with uuid and slot number
265*4882a593Smuzhiyun       (Steps 4,5 could be a udev rule)
266*4882a593Smuzhiyun   4.  In userspace, the node searches for the disk, perhaps
267*4882a593Smuzhiyun       using blkid -t SUB_UUID=""
268*4882a593Smuzhiyun   5.  Other nodes issue either of the following depending on whether
269*4882a593Smuzhiyun       the disk was found:
270*4882a593Smuzhiyun       ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
271*4882a593Smuzhiyun       disc.number set to slot number)
272*4882a593Smuzhiyun       ioctl(CLUSTERED_DISK_NACK)
273*4882a593Smuzhiyun   6.  Other nodes drop lock on "no-new-devs" (CR) if device is found
274*4882a593Smuzhiyun   7.  Node 1 attempts EX lock on "no-new-dev"
275*4882a593Smuzhiyun   8.  If node 1 gets the lock, it sends METADATA_UPDATED after
276*4882a593Smuzhiyun       unmarking the disk as SpareLocal
277*4882a593Smuzhiyun   9.  If not (get "no-new-dev" lock), it fails the operation and sends
278*4882a593Smuzhiyun       METADATA_UPDATED.
279*4882a593Smuzhiyun   10. Other nodes get the information whether a disk is added or not
280*4882a593Smuzhiyun       by the following METADATA_UPDATED.
281*4882a593Smuzhiyun
282*4882a593Smuzhiyun6. Module interface
283*4882a593Smuzhiyun===================
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun There are 17 call-backs which the md core can make to the cluster
286*4882a593Smuzhiyun module.  Understanding these can give a good overview of the whole
287*4882a593Smuzhiyun process.
288*4882a593Smuzhiyun
289*4882a593Smuzhiyun6.1 join(nodes) and leave()
290*4882a593Smuzhiyun---------------------------
291*4882a593Smuzhiyun
292*4882a593Smuzhiyun These are called when an array is started with a clustered bitmap,
293*4882a593Smuzhiyun and when the array is stopped.  join() ensures the cluster is
294*4882a593Smuzhiyun available and initializes the various resources.
295*4882a593Smuzhiyun Only the first 'nodes' nodes in the cluster can use the array.
296*4882a593Smuzhiyun
297*4882a593Smuzhiyun6.2 slot_number()
298*4882a593Smuzhiyun-----------------
299*4882a593Smuzhiyun
300*4882a593Smuzhiyun Reports the slot number advised by the cluster infrastructure.
301*4882a593Smuzhiyun Range is from 0 to nodes-1.
302*4882a593Smuzhiyun
303*4882a593Smuzhiyun6.3 resync_info_update()
304*4882a593Smuzhiyun------------------------
305*4882a593Smuzhiyun
306*4882a593Smuzhiyun This updates the resync range that is stored in the bitmap lock.
307*4882a593Smuzhiyun The starting point is updated as the resync progresses.  The
308*4882a593Smuzhiyun end point is always the end of the array.
309*4882a593Smuzhiyun It does *not* send a RESYNCING message.
310*4882a593Smuzhiyun
311*4882a593Smuzhiyun6.4 resync_start(), resync_finish()
312*4882a593Smuzhiyun-----------------------------------
313*4882a593Smuzhiyun
314*4882a593Smuzhiyun These are called when resync/recovery/reshape starts or stops.
315*4882a593Smuzhiyun They update the resyncing range in the bitmap lock and also
316*4882a593Smuzhiyun send a RESYNCING message.  resync_start reports the whole
317*4882a593Smuzhiyun array as resyncing, resync_finish reports none of it.
318*4882a593Smuzhiyun
319*4882a593Smuzhiyun resync_finish() also sends a BITMAP_NEEDS_SYNC message which
320*4882a593Smuzhiyun allows some other node to take over.
321*4882a593Smuzhiyun
322*4882a593Smuzhiyun6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel()
323*4882a593Smuzhiyun-------------------------------------------------------------------------------
324*4882a593Smuzhiyun
325*4882a593Smuzhiyun metadata_update_start is used to get exclusive access to
326*4882a593Smuzhiyun the metadata.  If a change is still needed once that access is
327*4882a593Smuzhiyun gained, metadata_update_finish() will send a METADATA_UPDATE
328*4882a593Smuzhiyun message to all other nodes, otherwise metadata_update_cancel()
329*4882a593Smuzhiyun can be used to release the lock.
330*4882a593Smuzhiyun
331*4882a593Smuzhiyun6.6 area_resyncing()
332*4882a593Smuzhiyun--------------------
333*4882a593Smuzhiyun
334*4882a593Smuzhiyun This combines two elements of functionality.
335*4882a593Smuzhiyun
336*4882a593Smuzhiyun Firstly, it will check if any node is currently resyncing
337*4882a593Smuzhiyun anything in a given range of sectors.  If any resync is found,
338*4882a593Smuzhiyun then the caller will avoid writing or read-balancing in that
339*4882a593Smuzhiyun range.
340*4882a593Smuzhiyun
341*4882a593Smuzhiyun Secondly, while node recovery is happening it reports that
342*4882a593Smuzhiyun all areas are resyncing for READ requests.  This avoids races
343*4882a593Smuzhiyun between the cluster-filesystem and the cluster-RAID handling
344*4882a593Smuzhiyun a node failure.
345*4882a593Smuzhiyun
346*4882a593Smuzhiyun6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack()
347*4882a593Smuzhiyun---------------------------------------------------------------
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun These are used to manage the new-disk protocol described above.
350*4882a593Smuzhiyun When a new device is added, add_new_disk_start() is called before
351*4882a593Smuzhiyun it is bound to the array and, if that succeeds, add_new_disk_finish()
352*4882a593Smuzhiyun is called the device is fully added.
353*4882a593Smuzhiyun
354*4882a593Smuzhiyun When a device is added in acknowledgement to a previous
355*4882a593Smuzhiyun request, or when the device is declared "unavailable",
356*4882a593Smuzhiyun new_disk_ack() is called.
357*4882a593Smuzhiyun
358*4882a593Smuzhiyun6.8 remove_disk()
359*4882a593Smuzhiyun-----------------
360*4882a593Smuzhiyun
361*4882a593Smuzhiyun This is called when a spare or failed device is removed from
362*4882a593Smuzhiyun the array.  It causes a REMOVE message to be send to other nodes.
363*4882a593Smuzhiyun
364*4882a593Smuzhiyun6.9 gather_bitmaps()
365*4882a593Smuzhiyun--------------------
366*4882a593Smuzhiyun
367*4882a593Smuzhiyun This sends a RE_ADD message to all other nodes and then
368*4882a593Smuzhiyun gathers bitmap information from all bitmaps.  This combined
369*4882a593Smuzhiyun bitmap is then used to recovery the re-added device.
370*4882a593Smuzhiyun
371*4882a593Smuzhiyun6.10 lock_all_bitmaps() and unlock_all_bitmaps()
372*4882a593Smuzhiyun------------------------------------------------
373*4882a593Smuzhiyun
374*4882a593Smuzhiyun These are called when change bitmap to none. If a node plans
375*4882a593Smuzhiyun to clear the cluster raid's bitmap, it need to make sure no other
376*4882a593Smuzhiyun nodes are using the raid which is achieved by lock all bitmap
377*4882a593Smuzhiyun locks within the cluster, and also those locks are unlocked
378*4882a593Smuzhiyun accordingly.
379*4882a593Smuzhiyun
380*4882a593Smuzhiyun7. Unsupported features
381*4882a593Smuzhiyun=======================
382*4882a593Smuzhiyun
383*4882a593SmuzhiyunThere are somethings which are not supported by cluster MD yet.
384*4882a593Smuzhiyun
385*4882a593Smuzhiyun- change array_sectors.
386