xref: /OK3568_Linux_fs/kernel/Documentation/driver-api/mtd/nand_ecc.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==========================
2*4882a593SmuzhiyunNAND Error-correction Code
3*4882a593Smuzhiyun==========================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunIntroduction
6*4882a593Smuzhiyun============
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunHaving looked at the linux mtd/nand driver and more specific at nand_ecc.c
9*4882a593SmuzhiyunI felt there was room for optimisation. I bashed the code for a few hours
10*4882a593Smuzhiyunperforming tricks like table lookup removing superfluous code etc.
11*4882a593SmuzhiyunAfter that the speed was increased by 35-40%.
12*4882a593SmuzhiyunStill I was not too happy as I felt there was additional room for improvement.
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunBad! I was hooked.
15*4882a593SmuzhiyunI decided to annotate my steps in this file. Perhaps it is useful to someone
16*4882a593Smuzhiyunor someone learns something from it.
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunThe problem
20*4882a593Smuzhiyun===========
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunNAND flash (at least SLC one) typically has sectors of 256 bytes.
23*4882a593SmuzhiyunHowever NAND flash is not extremely reliable so some error detection
24*4882a593Smuzhiyun(and sometimes correction) is needed.
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunThis is done by means of a Hamming code. I'll try to explain it in
27*4882a593Smuzhiyunlaymans terms (and apologies to all the pro's in the field in case I do
28*4882a593Smuzhiyunnot use the right terminology, my coding theory class was almost 30
29*4882a593Smuzhiyunyears ago, and I must admit it was not one of my favourites).
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunAs I said before the ecc calculation is performed on sectors of 256
32*4882a593Smuzhiyunbytes. This is done by calculating several parity bits over the rows and
33*4882a593Smuzhiyuncolumns. The parity used is even parity which means that the parity bit = 1
34*4882a593Smuzhiyunif the data over which the parity is calculated is 1 and the parity bit = 0
35*4882a593Smuzhiyunif the data over which the parity is calculated is 0. So the total
36*4882a593Smuzhiyunnumber of bits over the data over which the parity is calculated + the
37*4882a593Smuzhiyunparity bit is even. (see wikipedia if you can't follow this).
38*4882a593SmuzhiyunParity is often calculated by means of an exclusive or operation,
39*4882a593Smuzhiyunsometimes also referred to as xor. In C the operator for xor is ^
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunBack to ecc.
42*4882a593SmuzhiyunLet's give a small figure:
43*4882a593Smuzhiyun
44*4882a593Smuzhiyun=========  ==== ==== ==== ==== ==== ==== ==== ====   === === === === ====
45*4882a593Smuzhiyunbyte   0:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp4 ... rp14
46*4882a593Smuzhiyunbyte   1:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp2 rp4 ... rp14
47*4882a593Smuzhiyunbyte   2:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp4 ... rp14
48*4882a593Smuzhiyunbyte   3:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp4 ... rp14
49*4882a593Smuzhiyunbyte   4:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp5 ... rp14
50*4882a593Smuzhiyun...
51*4882a593Smuzhiyunbyte 254:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp5 ... rp15
52*4882a593Smuzhiyunbyte 255:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp5 ... rp15
53*4882a593Smuzhiyun           cp1  cp0  cp1  cp0  cp1  cp0  cp1  cp0
54*4882a593Smuzhiyun           cp3  cp3  cp2  cp2  cp3  cp3  cp2  cp2
55*4882a593Smuzhiyun           cp5  cp5  cp5  cp5  cp4  cp4  cp4  cp4
56*4882a593Smuzhiyun=========  ==== ==== ==== ==== ==== ==== ==== ====   === === === === ====
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunThis figure represents a sector of 256 bytes.
59*4882a593Smuzhiyuncp is my abbreviation for column parity, rp for row parity.
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunLet's start to explain column parity.
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun- cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
64*4882a593Smuzhiyun
65*4882a593Smuzhiyun  so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even.
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunSimilarly cp1 is the sum of all bit1, bit3, bit5 and bit7.
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun- cp2 is the parity over bit0, bit1, bit4 and bit5
70*4882a593Smuzhiyun- cp3 is the parity over bit2, bit3, bit6 and bit7.
71*4882a593Smuzhiyun- cp4 is the parity over bit0, bit1, bit2 and bit3.
72*4882a593Smuzhiyun- cp5 is the parity over bit4, bit5, bit6 and bit7.
73*4882a593Smuzhiyun
74*4882a593SmuzhiyunNote that each of cp0 .. cp5 is exactly one bit.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunRow parity actually works almost the same.
77*4882a593Smuzhiyun
78*4882a593Smuzhiyun- rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
79*4882a593Smuzhiyun- rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
80*4882a593Smuzhiyun- rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ...
81*4882a593Smuzhiyun  (so handle two bytes, then skip 2 bytes).
82*4882a593Smuzhiyun- rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
83*4882a593Smuzhiyun- for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun  so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
86*4882a593Smuzhiyun- and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
87*4882a593Smuzhiyun
88*4882a593SmuzhiyunThe story now becomes quite boring. I guess you get the idea.
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun- rp6 covers 8 bytes then skips 8 etc
91*4882a593Smuzhiyun- rp7 skips 8 bytes then covers 8 etc
92*4882a593Smuzhiyun- rp8 covers 16 bytes then skips 16 etc
93*4882a593Smuzhiyun- rp9 skips 16 bytes then covers 16 etc
94*4882a593Smuzhiyun- rp10 covers 32 bytes then skips 32 etc
95*4882a593Smuzhiyun- rp11 skips 32 bytes then covers 32 etc
96*4882a593Smuzhiyun- rp12 covers 64 bytes then skips 64 etc
97*4882a593Smuzhiyun- rp13 skips 64 bytes then covers 64 etc
98*4882a593Smuzhiyun- rp14 covers 128 bytes then skips 128
99*4882a593Smuzhiyun- rp15 skips 128 bytes then covers 128
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunIn the end the parity bits are grouped together in three bytes as
102*4882a593Smuzhiyunfollows:
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun=====  ===== ===== ===== ===== ===== ===== ===== =====
105*4882a593SmuzhiyunECC    Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
106*4882a593Smuzhiyun=====  ===== ===== ===== ===== ===== ===== ===== =====
107*4882a593SmuzhiyunECC 0   rp07  rp06  rp05  rp04  rp03  rp02  rp01  rp00
108*4882a593SmuzhiyunECC 1   rp15  rp14  rp13  rp12  rp11  rp10  rp09  rp08
109*4882a593SmuzhiyunECC 2   cp5   cp4   cp3   cp2   cp1   cp0      1     1
110*4882a593Smuzhiyun=====  ===== ===== ===== ===== ===== ===== ===== =====
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunI detected after writing this that ST application note AN1823
113*4882a593Smuzhiyun(http://www.st.com/stonline/) gives a much
114*4882a593Smuzhiyunnicer picture.(but they use line parity as term where I use row parity)
115*4882a593SmuzhiyunOh well, I'm graphically challenged, so suffer with me for a moment :-)
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunAnd I could not reuse the ST picture anyway for copyright reasons.
118*4882a593Smuzhiyun
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunAttempt 0
121*4882a593Smuzhiyun=========
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunImplementing the parity calculation is pretty simple.
124*4882a593SmuzhiyunIn C pseudocode::
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun  for (i = 0; i < 256; i++)
127*4882a593Smuzhiyun  {
128*4882a593Smuzhiyun    if (i & 0x01)
129*4882a593Smuzhiyun       rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
130*4882a593Smuzhiyun    else
131*4882a593Smuzhiyun       rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp0;
132*4882a593Smuzhiyun    if (i & 0x02)
133*4882a593Smuzhiyun       rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3;
134*4882a593Smuzhiyun    else
135*4882a593Smuzhiyun       rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2;
136*4882a593Smuzhiyun    if (i & 0x04)
137*4882a593Smuzhiyun      rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5;
138*4882a593Smuzhiyun    else
139*4882a593Smuzhiyun      rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4;
140*4882a593Smuzhiyun    if (i & 0x08)
141*4882a593Smuzhiyun      rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7;
142*4882a593Smuzhiyun    else
143*4882a593Smuzhiyun      rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6;
144*4882a593Smuzhiyun    if (i & 0x10)
145*4882a593Smuzhiyun      rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9;
146*4882a593Smuzhiyun    else
147*4882a593Smuzhiyun      rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8;
148*4882a593Smuzhiyun    if (i & 0x20)
149*4882a593Smuzhiyun      rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11;
150*4882a593Smuzhiyun    else
151*4882a593Smuzhiyun      rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10;
152*4882a593Smuzhiyun    if (i & 0x40)
153*4882a593Smuzhiyun      rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13;
154*4882a593Smuzhiyun    else
155*4882a593Smuzhiyun      rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12;
156*4882a593Smuzhiyun    if (i & 0x80)
157*4882a593Smuzhiyun      rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15;
158*4882a593Smuzhiyun    else
159*4882a593Smuzhiyun      rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14;
160*4882a593Smuzhiyun    cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0;
161*4882a593Smuzhiyun    cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1;
162*4882a593Smuzhiyun    cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2;
163*4882a593Smuzhiyun    cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3
164*4882a593Smuzhiyun    cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4
165*4882a593Smuzhiyun    cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5
166*4882a593Smuzhiyun  }
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunAnalysis 0
170*4882a593Smuzhiyun==========
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunC does have bitwise operators but not really operators to do the above
173*4882a593Smuzhiyunefficiently (and most hardware has no such instructions either).
174*4882a593SmuzhiyunTherefore without implementing this it was clear that the code above was
175*4882a593Smuzhiyunnot going to bring me a Nobel prize :-)
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunFortunately the exclusive or operation is commutative, so we can combine
178*4882a593Smuzhiyunthe values in any order. So instead of calculating all the bits
179*4882a593Smuzhiyunindividually, let us try to rearrange things.
180*4882a593SmuzhiyunFor the column parity this is easy. We can just xor the bytes and in the
181*4882a593Smuzhiyunend filter out the relevant bits. This is pretty nice as it will bring
182*4882a593Smuzhiyunall cp calculation out of the for loop.
183*4882a593Smuzhiyun
184*4882a593SmuzhiyunSimilarly we can first xor the bytes for the various rows.
185*4882a593SmuzhiyunThis leads to:
186*4882a593Smuzhiyun
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunAttempt 1
189*4882a593Smuzhiyun=========
190*4882a593Smuzhiyun
191*4882a593Smuzhiyun::
192*4882a593Smuzhiyun
193*4882a593Smuzhiyun  const char parity[256] = {
194*4882a593Smuzhiyun      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
195*4882a593Smuzhiyun      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
196*4882a593Smuzhiyun      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
197*4882a593Smuzhiyun      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
198*4882a593Smuzhiyun      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
199*4882a593Smuzhiyun      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
200*4882a593Smuzhiyun      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
201*4882a593Smuzhiyun      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
202*4882a593Smuzhiyun      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
203*4882a593Smuzhiyun      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
204*4882a593Smuzhiyun      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
205*4882a593Smuzhiyun      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
206*4882a593Smuzhiyun      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
207*4882a593Smuzhiyun      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
208*4882a593Smuzhiyun      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
209*4882a593Smuzhiyun      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
210*4882a593Smuzhiyun  };
211*4882a593Smuzhiyun
212*4882a593Smuzhiyun  void ecc1(const unsigned char *buf, unsigned char *code)
213*4882a593Smuzhiyun  {
214*4882a593Smuzhiyun      int i;
215*4882a593Smuzhiyun      const unsigned char *bp = buf;
216*4882a593Smuzhiyun      unsigned char cur;
217*4882a593Smuzhiyun      unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
218*4882a593Smuzhiyun      unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
219*4882a593Smuzhiyun      unsigned char par;
220*4882a593Smuzhiyun
221*4882a593Smuzhiyun      par = 0;
222*4882a593Smuzhiyun      rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
223*4882a593Smuzhiyun      rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
224*4882a593Smuzhiyun      rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
225*4882a593Smuzhiyun      rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun      for (i = 0; i < 256; i++)
228*4882a593Smuzhiyun      {
229*4882a593Smuzhiyun          cur = *bp++;
230*4882a593Smuzhiyun          par ^= cur;
231*4882a593Smuzhiyun          if (i & 0x01) rp1 ^= cur; else rp0 ^= cur;
232*4882a593Smuzhiyun          if (i & 0x02) rp3 ^= cur; else rp2 ^= cur;
233*4882a593Smuzhiyun          if (i & 0x04) rp5 ^= cur; else rp4 ^= cur;
234*4882a593Smuzhiyun          if (i & 0x08) rp7 ^= cur; else rp6 ^= cur;
235*4882a593Smuzhiyun          if (i & 0x10) rp9 ^= cur; else rp8 ^= cur;
236*4882a593Smuzhiyun          if (i & 0x20) rp11 ^= cur; else rp10 ^= cur;
237*4882a593Smuzhiyun          if (i & 0x40) rp13 ^= cur; else rp12 ^= cur;
238*4882a593Smuzhiyun          if (i & 0x80) rp15 ^= cur; else rp14 ^= cur;
239*4882a593Smuzhiyun      }
240*4882a593Smuzhiyun      code[0] =
241*4882a593Smuzhiyun          (parity[rp7] << 7) |
242*4882a593Smuzhiyun          (parity[rp6] << 6) |
243*4882a593Smuzhiyun          (parity[rp5] << 5) |
244*4882a593Smuzhiyun          (parity[rp4] << 4) |
245*4882a593Smuzhiyun          (parity[rp3] << 3) |
246*4882a593Smuzhiyun          (parity[rp2] << 2) |
247*4882a593Smuzhiyun          (parity[rp1] << 1) |
248*4882a593Smuzhiyun          (parity[rp0]);
249*4882a593Smuzhiyun      code[1] =
250*4882a593Smuzhiyun          (parity[rp15] << 7) |
251*4882a593Smuzhiyun          (parity[rp14] << 6) |
252*4882a593Smuzhiyun          (parity[rp13] << 5) |
253*4882a593Smuzhiyun          (parity[rp12] << 4) |
254*4882a593Smuzhiyun          (parity[rp11] << 3) |
255*4882a593Smuzhiyun          (parity[rp10] << 2) |
256*4882a593Smuzhiyun          (parity[rp9]  << 1) |
257*4882a593Smuzhiyun          (parity[rp8]);
258*4882a593Smuzhiyun      code[2] =
259*4882a593Smuzhiyun          (parity[par & 0xf0] << 7) |
260*4882a593Smuzhiyun          (parity[par & 0x0f] << 6) |
261*4882a593Smuzhiyun          (parity[par & 0xcc] << 5) |
262*4882a593Smuzhiyun          (parity[par & 0x33] << 4) |
263*4882a593Smuzhiyun          (parity[par & 0xaa] << 3) |
264*4882a593Smuzhiyun          (parity[par & 0x55] << 2);
265*4882a593Smuzhiyun      code[0] = ~code[0];
266*4882a593Smuzhiyun      code[1] = ~code[1];
267*4882a593Smuzhiyun      code[2] = ~code[2];
268*4882a593Smuzhiyun  }
269*4882a593Smuzhiyun
270*4882a593SmuzhiyunStill pretty straightforward. The last three invert statements are there to
271*4882a593Smuzhiyungive a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash
272*4882a593Smuzhiyunall data is 0xff, so the checksum then matches.
273*4882a593Smuzhiyun
274*4882a593SmuzhiyunI also introduced the parity lookup. I expected this to be the fastest
275*4882a593Smuzhiyunway to calculate the parity, but I will investigate alternatives later
276*4882a593Smuzhiyunon.
277*4882a593Smuzhiyun
278*4882a593Smuzhiyun
279*4882a593SmuzhiyunAnalysis 1
280*4882a593Smuzhiyun==========
281*4882a593Smuzhiyun
282*4882a593SmuzhiyunThe code works, but is not terribly efficient. On my system it took
283*4882a593Smuzhiyunalmost 4 times as much time as the linux driver code. But hey, if it was
284*4882a593Smuzhiyun*that* easy this would have been done long before.
285*4882a593SmuzhiyunNo pain. no gain.
286*4882a593Smuzhiyun
287*4882a593SmuzhiyunFortunately there is plenty of room for improvement.
288*4882a593Smuzhiyun
289*4882a593SmuzhiyunIn step 1 we moved from bit-wise calculation to byte-wise calculation.
290*4882a593SmuzhiyunHowever in C we can also use the unsigned long data type and virtually
291*4882a593Smuzhiyunevery modern microprocessor supports 32 bit operations, so why not try
292*4882a593Smuzhiyunto write our code in such a way that we process data in 32 bit chunks.
293*4882a593Smuzhiyun
294*4882a593SmuzhiyunOf course this means some modification as the row parity is byte by
295*4882a593Smuzhiyunbyte. A quick analysis:
296*4882a593Smuzhiyunfor the column parity we use the par variable. When extending to 32 bits
297*4882a593Smuzhiyunwe can in the end easily calculate rp0 and rp1 from it.
298*4882a593Smuzhiyun(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
299*4882a593Smuzhiyunrespectively, from MSB to LSB)
300*4882a593Smuzhiyunalso rp2 and rp3 can be easily retrieved from par as rp3 covers the
301*4882a593Smuzhiyunfirst two MSBs and rp2 covers the last two LSBs.
302*4882a593Smuzhiyun
303*4882a593SmuzhiyunNote that of course now the loop is executed only 64 times (256/4).
304*4882a593SmuzhiyunAnd note that care must taken wrt byte ordering. The way bytes are
305*4882a593Smuzhiyunordered in a long is machine dependent, and might affect us.
306*4882a593SmuzhiyunAnyway, if there is an issue: this code is developed on x86 (to be
307*4882a593Smuzhiyunprecise: a DELL PC with a D920 Intel CPU)
308*4882a593Smuzhiyun
309*4882a593SmuzhiyunAnd of course the performance might depend on alignment, but I expect
310*4882a593Smuzhiyunthat the I/O buffers in the nand driver are aligned properly (and
311*4882a593Smuzhiyunotherwise that should be fixed to get maximum performance).
312*4882a593Smuzhiyun
313*4882a593SmuzhiyunLet's give it a try...
314*4882a593Smuzhiyun
315*4882a593Smuzhiyun
316*4882a593SmuzhiyunAttempt 2
317*4882a593Smuzhiyun=========
318*4882a593Smuzhiyun
319*4882a593Smuzhiyun::
320*4882a593Smuzhiyun
321*4882a593Smuzhiyun  extern const char parity[256];
322*4882a593Smuzhiyun
323*4882a593Smuzhiyun  void ecc2(const unsigned char *buf, unsigned char *code)
324*4882a593Smuzhiyun  {
325*4882a593Smuzhiyun      int i;
326*4882a593Smuzhiyun      const unsigned long *bp = (unsigned long *)buf;
327*4882a593Smuzhiyun      unsigned long cur;
328*4882a593Smuzhiyun      unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
329*4882a593Smuzhiyun      unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
330*4882a593Smuzhiyun      unsigned long par;
331*4882a593Smuzhiyun
332*4882a593Smuzhiyun      par = 0;
333*4882a593Smuzhiyun      rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
334*4882a593Smuzhiyun      rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
335*4882a593Smuzhiyun      rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
336*4882a593Smuzhiyun      rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
337*4882a593Smuzhiyun
338*4882a593Smuzhiyun      for (i = 0; i < 64; i++)
339*4882a593Smuzhiyun      {
340*4882a593Smuzhiyun          cur = *bp++;
341*4882a593Smuzhiyun          par ^= cur;
342*4882a593Smuzhiyun          if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
343*4882a593Smuzhiyun          if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
344*4882a593Smuzhiyun          if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
345*4882a593Smuzhiyun          if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
346*4882a593Smuzhiyun          if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
347*4882a593Smuzhiyun          if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
348*4882a593Smuzhiyun      }
349*4882a593Smuzhiyun      /*
350*4882a593Smuzhiyun         we need to adapt the code generation for the fact that rp vars are now
351*4882a593Smuzhiyun         long; also the column parity calculation needs to be changed.
352*4882a593Smuzhiyun         we'll bring rp4 to 15 back to single byte entities by shifting and
353*4882a593Smuzhiyun         xoring
354*4882a593Smuzhiyun      */
355*4882a593Smuzhiyun      rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff;
356*4882a593Smuzhiyun      rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff;
357*4882a593Smuzhiyun      rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff;
358*4882a593Smuzhiyun      rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff;
359*4882a593Smuzhiyun      rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff;
360*4882a593Smuzhiyun      rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff;
361*4882a593Smuzhiyun      rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff;
362*4882a593Smuzhiyun      rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff;
363*4882a593Smuzhiyun      rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff;
364*4882a593Smuzhiyun      rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff;
365*4882a593Smuzhiyun      rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff;
366*4882a593Smuzhiyun      rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff;
367*4882a593Smuzhiyun      rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff;
368*4882a593Smuzhiyun      rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff;
369*4882a593Smuzhiyun      par ^= (par >> 16);
370*4882a593Smuzhiyun      rp1 = (par >> 8); rp1 &= 0xff;
371*4882a593Smuzhiyun      rp0 = (par & 0xff);
372*4882a593Smuzhiyun      par ^= (par >> 8); par &= 0xff;
373*4882a593Smuzhiyun
374*4882a593Smuzhiyun      code[0] =
375*4882a593Smuzhiyun          (parity[rp7] << 7) |
376*4882a593Smuzhiyun          (parity[rp6] << 6) |
377*4882a593Smuzhiyun          (parity[rp5] << 5) |
378*4882a593Smuzhiyun          (parity[rp4] << 4) |
379*4882a593Smuzhiyun          (parity[rp3] << 3) |
380*4882a593Smuzhiyun          (parity[rp2] << 2) |
381*4882a593Smuzhiyun          (parity[rp1] << 1) |
382*4882a593Smuzhiyun          (parity[rp0]);
383*4882a593Smuzhiyun      code[1] =
384*4882a593Smuzhiyun          (parity[rp15] << 7) |
385*4882a593Smuzhiyun          (parity[rp14] << 6) |
386*4882a593Smuzhiyun          (parity[rp13] << 5) |
387*4882a593Smuzhiyun          (parity[rp12] << 4) |
388*4882a593Smuzhiyun          (parity[rp11] << 3) |
389*4882a593Smuzhiyun          (parity[rp10] << 2) |
390*4882a593Smuzhiyun          (parity[rp9]  << 1) |
391*4882a593Smuzhiyun          (parity[rp8]);
392*4882a593Smuzhiyun      code[2] =
393*4882a593Smuzhiyun          (parity[par & 0xf0] << 7) |
394*4882a593Smuzhiyun          (parity[par & 0x0f] << 6) |
395*4882a593Smuzhiyun          (parity[par & 0xcc] << 5) |
396*4882a593Smuzhiyun          (parity[par & 0x33] << 4) |
397*4882a593Smuzhiyun          (parity[par & 0xaa] << 3) |
398*4882a593Smuzhiyun          (parity[par & 0x55] << 2);
399*4882a593Smuzhiyun      code[0] = ~code[0];
400*4882a593Smuzhiyun      code[1] = ~code[1];
401*4882a593Smuzhiyun      code[2] = ~code[2];
402*4882a593Smuzhiyun  }
403*4882a593Smuzhiyun
404*4882a593SmuzhiyunThe parity array is not shown any more. Note also that for these
405*4882a593Smuzhiyunexamples I kinda deviated from my regular programming style by allowing
406*4882a593Smuzhiyunmultiple statements on a line, not using { } in then and else blocks
407*4882a593Smuzhiyunwith only a single statement and by using operators like ^=
408*4882a593Smuzhiyun
409*4882a593Smuzhiyun
410*4882a593SmuzhiyunAnalysis 2
411*4882a593Smuzhiyun==========
412*4882a593Smuzhiyun
413*4882a593SmuzhiyunThe code (of course) works, and hurray: we are a little bit faster than
414*4882a593Smuzhiyunthe linux driver code (about 15%). But wait, don't cheer too quickly.
415*4882a593SmuzhiyunThere is more to be gained.
416*4882a593SmuzhiyunIf we look at e.g. rp14 and rp15 we see that we either xor our data with
417*4882a593Smuzhiyunrp14 or with rp15. However we also have par which goes over all data.
418*4882a593SmuzhiyunThis means there is no need to calculate rp14 as it can be calculated from
419*4882a593Smuzhiyunrp15 through rp14 = par ^ rp15, because par = rp14 ^ rp15;
420*4882a593Smuzhiyun(or if desired we can avoid calculating rp15 and calculate it from
421*4882a593Smuzhiyunrp14).  That is why some places refer to inverse parity.
422*4882a593SmuzhiyunOf course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13.
423*4882a593SmuzhiyunEffectively this means we can eliminate the else clause from the if
424*4882a593Smuzhiyunstatements. Also we can optimise the calculation in the end a little bit
425*4882a593Smuzhiyunby going from long to byte first. Actually we can even avoid the table
426*4882a593Smuzhiyunlookups
427*4882a593Smuzhiyun
428*4882a593SmuzhiyunAttempt 3
429*4882a593Smuzhiyun=========
430*4882a593Smuzhiyun
431*4882a593SmuzhiyunOdd replaced::
432*4882a593Smuzhiyun
433*4882a593Smuzhiyun          if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
434*4882a593Smuzhiyun          if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
435*4882a593Smuzhiyun          if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
436*4882a593Smuzhiyun          if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
437*4882a593Smuzhiyun          if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
438*4882a593Smuzhiyun          if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
439*4882a593Smuzhiyun
440*4882a593Smuzhiyunwith::
441*4882a593Smuzhiyun
442*4882a593Smuzhiyun          if (i & 0x01) rp5 ^= cur;
443*4882a593Smuzhiyun          if (i & 0x02) rp7 ^= cur;
444*4882a593Smuzhiyun          if (i & 0x04) rp9 ^= cur;
445*4882a593Smuzhiyun          if (i & 0x08) rp11 ^= cur;
446*4882a593Smuzhiyun          if (i & 0x10) rp13 ^= cur;
447*4882a593Smuzhiyun          if (i & 0x20) rp15 ^= cur;
448*4882a593Smuzhiyun
449*4882a593Smuzhiyunand outside the loop added::
450*4882a593Smuzhiyun
451*4882a593Smuzhiyun          rp4  = par ^ rp5;
452*4882a593Smuzhiyun          rp6  = par ^ rp7;
453*4882a593Smuzhiyun          rp8  = par ^ rp9;
454*4882a593Smuzhiyun          rp10  = par ^ rp11;
455*4882a593Smuzhiyun          rp12  = par ^ rp13;
456*4882a593Smuzhiyun          rp14  = par ^ rp15;
457*4882a593Smuzhiyun
458*4882a593SmuzhiyunAnd after that the code takes about 30% more time, although the number of
459*4882a593Smuzhiyunstatements is reduced. This is also reflected in the assembly code.
460*4882a593Smuzhiyun
461*4882a593Smuzhiyun
462*4882a593SmuzhiyunAnalysis 3
463*4882a593Smuzhiyun==========
464*4882a593Smuzhiyun
465*4882a593SmuzhiyunVery weird. Guess it has to do with caching or instruction parallellism
466*4882a593Smuzhiyunor so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting
467*4882a593Smuzhiyunobservation was that this one is only 30% slower (according to time)
468*4882a593Smuzhiyunexecuting the code as my 3Ghz D920 processor.
469*4882a593Smuzhiyun
470*4882a593SmuzhiyunWell, it was expected not to be easy so maybe instead move to a
471*4882a593Smuzhiyundifferent track: let's move back to the code from attempt2 and do some
472*4882a593Smuzhiyunloop unrolling. This will eliminate a few if statements. I'll try
473*4882a593Smuzhiyundifferent amounts of unrolling to see what works best.
474*4882a593Smuzhiyun
475*4882a593Smuzhiyun
476*4882a593SmuzhiyunAttempt 4
477*4882a593Smuzhiyun=========
478*4882a593Smuzhiyun
479*4882a593SmuzhiyunUnrolled the loop 1, 2, 3 and 4 times.
480*4882a593SmuzhiyunFor 4 the code starts with::
481*4882a593Smuzhiyun
482*4882a593Smuzhiyun    for (i = 0; i < 4; i++)
483*4882a593Smuzhiyun    {
484*4882a593Smuzhiyun        cur = *bp++;
485*4882a593Smuzhiyun        par ^= cur;
486*4882a593Smuzhiyun        rp4 ^= cur;
487*4882a593Smuzhiyun        rp6 ^= cur;
488*4882a593Smuzhiyun        rp8 ^= cur;
489*4882a593Smuzhiyun        rp10 ^= cur;
490*4882a593Smuzhiyun        if (i & 0x1) rp13 ^= cur; else rp12 ^= cur;
491*4882a593Smuzhiyun        if (i & 0x2) rp15 ^= cur; else rp14 ^= cur;
492*4882a593Smuzhiyun        cur = *bp++;
493*4882a593Smuzhiyun        par ^= cur;
494*4882a593Smuzhiyun        rp5 ^= cur;
495*4882a593Smuzhiyun        rp6 ^= cur;
496*4882a593Smuzhiyun        ...
497*4882a593Smuzhiyun
498*4882a593Smuzhiyun
499*4882a593SmuzhiyunAnalysis 4
500*4882a593Smuzhiyun==========
501*4882a593Smuzhiyun
502*4882a593SmuzhiyunUnrolling once gains about 15%
503*4882a593Smuzhiyun
504*4882a593SmuzhiyunUnrolling twice keeps the gain at about 15%
505*4882a593Smuzhiyun
506*4882a593SmuzhiyunUnrolling three times gives a gain of 30% compared to attempt 2.
507*4882a593Smuzhiyun
508*4882a593SmuzhiyunUnrolling four times gives a marginal improvement compared to unrolling
509*4882a593Smuzhiyunthree times.
510*4882a593Smuzhiyun
511*4882a593SmuzhiyunI decided to proceed with a four time unrolled loop anyway. It was my gut
512*4882a593Smuzhiyunfeeling that in the next steps I would obtain additional gain from it.
513*4882a593Smuzhiyun
514*4882a593SmuzhiyunThe next step was triggered by the fact that par contains the xor of all
515*4882a593Smuzhiyunbytes and rp4 and rp5 each contain the xor of half of the bytes.
516*4882a593SmuzhiyunSo in effect par = rp4 ^ rp5. But as xor is commutative we can also say
517*4882a593Smuzhiyunthat rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
518*4882a593Smuzhiyuneliminate rp5 (or rp4, but I already foresaw another optimisation).
519*4882a593SmuzhiyunThe same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15.
520*4882a593Smuzhiyun
521*4882a593Smuzhiyun
522*4882a593SmuzhiyunAttempt 5
523*4882a593Smuzhiyun=========
524*4882a593Smuzhiyun
525*4882a593SmuzhiyunEffectively so all odd digit rp assignments in the loop were removed.
526*4882a593SmuzhiyunThis included the else clause of the if statements.
527*4882a593SmuzhiyunOf course after the loop we need to correct things by adding code like::
528*4882a593Smuzhiyun
529*4882a593Smuzhiyun    rp5 = par ^ rp4;
530*4882a593Smuzhiyun
531*4882a593SmuzhiyunAlso the initial assignments (rp5 = 0; etc) could be removed.
532*4882a593SmuzhiyunAlong the line I also removed the initialisation of rp0/1/2/3.
533*4882a593Smuzhiyun
534*4882a593Smuzhiyun
535*4882a593SmuzhiyunAnalysis 5
536*4882a593Smuzhiyun==========
537*4882a593Smuzhiyun
538*4882a593SmuzhiyunMeasurements showed this was a good move. The run-time roughly halved
539*4882a593Smuzhiyuncompared with attempt 4 with 4 times unrolled, and we only require 1/3rd
540*4882a593Smuzhiyunof the processor time compared to the current code in the linux kernel.
541*4882a593Smuzhiyun
542*4882a593SmuzhiyunHowever, still I thought there was more. I didn't like all the if
543*4882a593Smuzhiyunstatements. Why not keep a running parity and only keep the last if
544*4882a593Smuzhiyunstatement. Time for yet another version!
545*4882a593Smuzhiyun
546*4882a593Smuzhiyun
547*4882a593SmuzhiyunAttempt 6
548*4882a593Smuzhiyun=========
549*4882a593Smuzhiyun
550*4882a593SmuzhiyunTHe code within the for loop was changed to::
551*4882a593Smuzhiyun
552*4882a593Smuzhiyun    for (i = 0; i < 4; i++)
553*4882a593Smuzhiyun    {
554*4882a593Smuzhiyun        cur = *bp++; tmppar  = cur; rp4 ^= cur;
555*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
556*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
557*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
558*4882a593Smuzhiyun
559*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
560*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
561*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
562*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
563*4882a593Smuzhiyun
564*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur;
565*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur;
566*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur;
567*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp8 ^= cur;
568*4882a593Smuzhiyun
569*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
570*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
571*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
572*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur;
573*4882a593Smuzhiyun
574*4882a593Smuzhiyun        par ^= tmppar;
575*4882a593Smuzhiyun        if ((i & 0x1) == 0) rp12 ^= tmppar;
576*4882a593Smuzhiyun        if ((i & 0x2) == 0) rp14 ^= tmppar;
577*4882a593Smuzhiyun    }
578*4882a593Smuzhiyun
579*4882a593SmuzhiyunAs you can see tmppar is used to accumulate the parity within a for
580*4882a593Smuzhiyuniteration. In the last 3 statements is added to par and, if needed,
581*4882a593Smuzhiyunto rp12 and rp14.
582*4882a593Smuzhiyun
583*4882a593SmuzhiyunWhile making the changes I also found that I could exploit that tmppar
584*4882a593Smuzhiyuncontains the running parity for this iteration. So instead of having:
585*4882a593Smuzhiyunrp4 ^= cur; rp6 ^= cur;
586*4882a593SmuzhiyunI removed the rp6 ^= cur; statement and did rp6 ^= tmppar; on next
587*4882a593Smuzhiyunstatement. A similar change was done for rp8 and rp10
588*4882a593Smuzhiyun
589*4882a593Smuzhiyun
590*4882a593SmuzhiyunAnalysis 6
591*4882a593Smuzhiyun==========
592*4882a593Smuzhiyun
593*4882a593SmuzhiyunMeasuring this code again showed big gain. When executing the original
594*4882a593Smuzhiyunlinux code 1 million times, this took about 1 second on my system.
595*4882a593Smuzhiyun(using time to measure the performance). After this iteration I was back
596*4882a593Smuzhiyunto 0.075 sec. Actually I had to decide to start measuring over 10
597*4882a593Smuzhiyunmillion iterations in order not to lose too much accuracy. This one
598*4882a593Smuzhiyundefinitely seemed to be the jackpot!
599*4882a593Smuzhiyun
600*4882a593SmuzhiyunThere is a little bit more room for improvement though. There are three
601*4882a593Smuzhiyunplaces with statements::
602*4882a593Smuzhiyun
603*4882a593Smuzhiyun	rp4 ^= cur; rp6 ^= cur;
604*4882a593Smuzhiyun
605*4882a593SmuzhiyunIt seems more efficient to also maintain a variable rp4_6 in the while
606*4882a593Smuzhiyunloop; This eliminates 3 statements per loop. Of course after the loop we
607*4882a593Smuzhiyunneed to correct by adding::
608*4882a593Smuzhiyun
609*4882a593Smuzhiyun	rp4 ^= rp4_6;
610*4882a593Smuzhiyun	rp6 ^= rp4_6
611*4882a593Smuzhiyun
612*4882a593SmuzhiyunFurthermore there are 4 sequential assignments to rp8. This can be
613*4882a593Smuzhiyunencoded slightly more efficiently by saving tmppar before those 4 lines
614*4882a593Smuzhiyunand later do rp8 = rp8 ^ tmppar ^ notrp8;
615*4882a593Smuzhiyun(where notrp8 is the value of rp8 before those 4 lines).
616*4882a593SmuzhiyunAgain a use of the commutative property of xor.
617*4882a593SmuzhiyunTime for a new test!
618*4882a593Smuzhiyun
619*4882a593Smuzhiyun
620*4882a593SmuzhiyunAttempt 7
621*4882a593Smuzhiyun=========
622*4882a593Smuzhiyun
623*4882a593SmuzhiyunThe new code now looks like::
624*4882a593Smuzhiyun
625*4882a593Smuzhiyun    for (i = 0; i < 4; i++)
626*4882a593Smuzhiyun    {
627*4882a593Smuzhiyun        cur = *bp++; tmppar  = cur; rp4 ^= cur;
628*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
629*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
630*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
631*4882a593Smuzhiyun
632*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
633*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
634*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
635*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
636*4882a593Smuzhiyun
637*4882a593Smuzhiyun        notrp8 = tmppar;
638*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
639*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
640*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
641*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur;
642*4882a593Smuzhiyun        rp8 = rp8 ^ tmppar ^ notrp8;
643*4882a593Smuzhiyun
644*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
645*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
646*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
647*4882a593Smuzhiyun        cur = *bp++; tmppar ^= cur;
648*4882a593Smuzhiyun
649*4882a593Smuzhiyun        par ^= tmppar;
650*4882a593Smuzhiyun        if ((i & 0x1) == 0) rp12 ^= tmppar;
651*4882a593Smuzhiyun        if ((i & 0x2) == 0) rp14 ^= tmppar;
652*4882a593Smuzhiyun    }
653*4882a593Smuzhiyun    rp4 ^= rp4_6;
654*4882a593Smuzhiyun    rp6 ^= rp4_6;
655*4882a593Smuzhiyun
656*4882a593Smuzhiyun
657*4882a593SmuzhiyunNot a big change, but every penny counts :-)
658*4882a593Smuzhiyun
659*4882a593Smuzhiyun
660*4882a593SmuzhiyunAnalysis 7
661*4882a593Smuzhiyun==========
662*4882a593Smuzhiyun
663*4882a593SmuzhiyunActually this made things worse. Not very much, but I don't want to move
664*4882a593Smuzhiyuninto the wrong direction. Maybe something to investigate later. Could
665*4882a593Smuzhiyunhave to do with caching again.
666*4882a593Smuzhiyun
667*4882a593SmuzhiyunGuess that is what there is to win within the loop. Maybe unrolling one
668*4882a593Smuzhiyunmore time will help. I'll keep the optimisations from 7 for now.
669*4882a593Smuzhiyun
670*4882a593Smuzhiyun
671*4882a593SmuzhiyunAttempt 8
672*4882a593Smuzhiyun=========
673*4882a593Smuzhiyun
674*4882a593SmuzhiyunUnrolled the loop one more time.
675*4882a593Smuzhiyun
676*4882a593Smuzhiyun
677*4882a593SmuzhiyunAnalysis 8
678*4882a593Smuzhiyun==========
679*4882a593Smuzhiyun
680*4882a593SmuzhiyunThis makes things worse. Let's stick with attempt 6 and continue from there.
681*4882a593SmuzhiyunAlthough it seems that the code within the loop cannot be optimised
682*4882a593Smuzhiyunfurther there is still room to optimize the generation of the ecc codes.
683*4882a593SmuzhiyunWe can simply calculate the total parity. If this is 0 then rp4 = rp5
684*4882a593Smuzhiyunetc. If the parity is 1, then rp4 = !rp5;
685*4882a593Smuzhiyun
686*4882a593SmuzhiyunBut if rp4 = rp5 we do not need rp5 etc. We can just write the even bits
687*4882a593Smuzhiyunin the result byte and then do something like::
688*4882a593Smuzhiyun
689*4882a593Smuzhiyun    code[0] |= (code[0] << 1);
690*4882a593Smuzhiyun
691*4882a593SmuzhiyunLets test this.
692*4882a593Smuzhiyun
693*4882a593Smuzhiyun
694*4882a593SmuzhiyunAttempt 9
695*4882a593Smuzhiyun=========
696*4882a593Smuzhiyun
697*4882a593SmuzhiyunChanged the code but again this slightly degrades performance. Tried all
698*4882a593Smuzhiyunkind of other things, like having dedicated parity arrays to avoid the
699*4882a593Smuzhiyunshift after parity[rp7] << 7; No gain.
700*4882a593SmuzhiyunChange the lookup using the parity array by using shift operators (e.g.
701*4882a593Smuzhiyunreplace parity[rp7] << 7 with::
702*4882a593Smuzhiyun
703*4882a593Smuzhiyun	rp7 ^= (rp7 << 4);
704*4882a593Smuzhiyun	rp7 ^= (rp7 << 2);
705*4882a593Smuzhiyun	rp7 ^= (rp7 << 1);
706*4882a593Smuzhiyun	rp7 &= 0x80;
707*4882a593Smuzhiyun
708*4882a593SmuzhiyunNo gain.
709*4882a593Smuzhiyun
710*4882a593SmuzhiyunThe only marginal change was inverting the parity bits, so we can remove
711*4882a593Smuzhiyunthe last three invert statements.
712*4882a593Smuzhiyun
713*4882a593SmuzhiyunAh well, pity this does not deliver more. Then again 10 million
714*4882a593Smuzhiyuniterations using the linux driver code takes between 13 and 13.5
715*4882a593Smuzhiyunseconds, whereas my code now takes about 0.73 seconds for those 10
716*4882a593Smuzhiyunmillion iterations. So basically I've improved the performance by a
717*4882a593Smuzhiyunfactor 18 on my system. Not that bad. Of course on different hardware
718*4882a593Smuzhiyunyou will get different results. No warranties!
719*4882a593Smuzhiyun
720*4882a593SmuzhiyunBut of course there is no such thing as a free lunch. The codesize almost
721*4882a593Smuzhiyuntripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
722*4882a593Smuzhiyun
723*4882a593Smuzhiyun
724*4882a593SmuzhiyunCorrecting errors
725*4882a593Smuzhiyun=================
726*4882a593Smuzhiyun
727*4882a593SmuzhiyunFor correcting errors I again used the ST application note as a starter,
728*4882a593Smuzhiyunbut I also peeked at the existing code.
729*4882a593Smuzhiyun
730*4882a593SmuzhiyunThe algorithm itself is pretty straightforward. Just xor the given and
731*4882a593Smuzhiyunthe calculated ecc. If all bytes are 0 there is no problem. If 11 bits
732*4882a593Smuzhiyunare 1 we have one correctable bit error. If there is 1 bit 1, we have an
733*4882a593Smuzhiyunerror in the given ecc code.
734*4882a593Smuzhiyun
735*4882a593SmuzhiyunIt proved to be fastest to do some table lookups. Performance gain
736*4882a593Smuzhiyunintroduced by this is about a factor 2 on my system when a repair had to
737*4882a593Smuzhiyunbe done, and 1% or so if no repair had to be done.
738*4882a593Smuzhiyun
739*4882a593SmuzhiyunCode size increased from 330 bytes to 686 bytes for this function.
740*4882a593Smuzhiyun(gcc 4.2, -O3)
741*4882a593Smuzhiyun
742*4882a593Smuzhiyun
743*4882a593SmuzhiyunConclusion
744*4882a593Smuzhiyun==========
745*4882a593Smuzhiyun
746*4882a593SmuzhiyunThe gain when calculating the ecc is tremendous. Om my development hardware
747*4882a593Smuzhiyuna speedup of a factor of 18 for ecc calculation was achieved. On a test on an
748*4882a593Smuzhiyunembedded system with a MIPS core a factor 7 was obtained.
749*4882a593Smuzhiyun
750*4882a593SmuzhiyunOn a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor
751*4882a593Smuzhiyun5 (big endian mode, gcc 4.1.2, -O3)
752*4882a593Smuzhiyun
753*4882a593SmuzhiyunFor correction not much gain could be obtained (as bitflips are rare). Then
754*4882a593Smuzhiyunagain there are also much less cycles spent there.
755*4882a593Smuzhiyun
756*4882a593SmuzhiyunIt seems there is not much more gain possible in this, at least when
757*4882a593Smuzhiyunprogrammed in C. Of course it might be possible to squeeze something more
758*4882a593Smuzhiyunout of it with an assembler program, but due to pipeline behaviour etc
759*4882a593Smuzhiyunthis is very tricky (at least for intel hw).
760*4882a593Smuzhiyun
761*4882a593SmuzhiyunAuthor: Frans Meulenbroeks
762*4882a593Smuzhiyun
763*4882a593SmuzhiyunCopyright (C) 2008 Koninklijke Philips Electronics NV.
764