xref: /OK3568_Linux_fs/kernel/Documentation/driver-api/md/raid5-ppl.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==================
2*4882a593SmuzhiyunPartial Parity Log
3*4882a593Smuzhiyun==================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunPartial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
6*4882a593Smuzhiyunaddressed by PPL is that after a dirty shutdown, parity of a particular stripe
7*4882a593Smuzhiyunmay become inconsistent with data on other member disks. If the array is also
8*4882a593Smuzhiyunin degraded state, there is no way to recalculate parity, because one of the
9*4882a593Smuzhiyundisks is missing. This can lead to silent data corruption when rebuilding the
10*4882a593Smuzhiyunarray or using it is as degraded - data calculated from parity for array blocks
11*4882a593Smuzhiyunthat have not been touched by a write request during the unclean shutdown can
12*4882a593Smuzhiyunbe incorrect. Such condition is known as the RAID5 Write Hole. Because of
13*4882a593Smuzhiyunthis, md by default does not allow starting a dirty degraded array.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunPartial parity for a write operation is the XOR of stripe data chunks not
16*4882a593Smuzhiyunmodified by this write. It is just enough data needed for recovering from the
17*4882a593Smuzhiyunwrite hole. XORing partial parity with the modified chunks produces parity for
18*4882a593Smuzhiyunthe stripe, consistent with its state before the write operation, regardless of
19*4882a593Smuzhiyunwhich chunk writes have completed. If one of the not modified data disks of
20*4882a593Smuzhiyunthis stripe is missing, this updated parity can be used to recover its
21*4882a593Smuzhiyuncontents. PPL recovery is also performed when starting an array after an
22*4882a593Smuzhiyununclean shutdown and all disks are available, eliminating the need to resync
23*4882a593Smuzhiyunthe array. Because of this, using write-intent bitmap and PPL together is not
24*4882a593Smuzhiyunsupported.
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunWhen handling a write request PPL writes partial parity before new data and
27*4882a593Smuzhiyunparity are dispatched to disks. PPL is a distributed log - it is stored on
28*4882a593Smuzhiyunarray member drives in the metadata area, on the parity drive of a particular
29*4882a593Smuzhiyunstripe.  It does not require a dedicated journaling drive. Write performance is
30*4882a593Smuzhiyunreduced by up to 30%-40% but it scales with the number of drives in the array
31*4882a593Smuzhiyunand the journaling drive does not become a bottleneck or a single point of
32*4882a593Smuzhiyunfailure.
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunUnlike raid5-cache, the other solution in md for closing the write hole, PPL is
35*4882a593Smuzhiyunnot a true journal. It does not protect from losing in-flight data, only from
36*4882a593Smuzhiyunsilent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
37*4882a593Smuzhiyunperformed for this stripe (parity is not updated). So it is possible to have
38*4882a593Smuzhiyunarbitrary data in the written part of a stripe if that disk is lost. In such
39*4882a593Smuzhiyuncase the behavior is the same as in plain raid5.
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunPPL is available for md version-1 metadata and external (specifically IMSM)
42*4882a593Smuzhiyunmetadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
43*4882a593Smuzhiyun
44*4882a593SmuzhiyunThere is a limitation of maximum 64 disks in the array for PPL. It allows to
45*4882a593Smuzhiyunkeep data structures and implementation simple. RAID5 arrays with so many disks
46*4882a593Smuzhiyunare not likely due to high risk of multiple disks failure. Such restriction
47*4882a593Smuzhiyunshould not be a real life limitation.
48