1*4882a593Smuzhiyun========================== 2*4882a593SmuzhiyunNAND Error-correction Code 3*4882a593Smuzhiyun========================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunIntroduction 6*4882a593Smuzhiyun============ 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunHaving looked at the linux mtd/nand driver and more specific at nand_ecc.c 9*4882a593SmuzhiyunI felt there was room for optimisation. I bashed the code for a few hours 10*4882a593Smuzhiyunperforming tricks like table lookup removing superfluous code etc. 11*4882a593SmuzhiyunAfter that the speed was increased by 35-40%. 12*4882a593SmuzhiyunStill I was not too happy as I felt there was additional room for improvement. 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunBad! I was hooked. 15*4882a593SmuzhiyunI decided to annotate my steps in this file. Perhaps it is useful to someone 16*4882a593Smuzhiyunor someone learns something from it. 17*4882a593Smuzhiyun 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunThe problem 20*4882a593Smuzhiyun=========== 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunNAND flash (at least SLC one) typically has sectors of 256 bytes. 23*4882a593SmuzhiyunHowever NAND flash is not extremely reliable so some error detection 24*4882a593Smuzhiyun(and sometimes correction) is needed. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunThis is done by means of a Hamming code. I'll try to explain it in 27*4882a593Smuzhiyunlaymans terms (and apologies to all the pro's in the field in case I do 28*4882a593Smuzhiyunnot use the right terminology, my coding theory class was almost 30 29*4882a593Smuzhiyunyears ago, and I must admit it was not one of my favourites). 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunAs I said before the ecc calculation is performed on sectors of 256 32*4882a593Smuzhiyunbytes. This is done by calculating several parity bits over the rows and 33*4882a593Smuzhiyuncolumns. The parity used is even parity which means that the parity bit = 1 34*4882a593Smuzhiyunif the data over which the parity is calculated is 1 and the parity bit = 0 35*4882a593Smuzhiyunif the data over which the parity is calculated is 0. So the total 36*4882a593Smuzhiyunnumber of bits over the data over which the parity is calculated + the 37*4882a593Smuzhiyunparity bit is even. (see wikipedia if you can't follow this). 38*4882a593SmuzhiyunParity is often calculated by means of an exclusive or operation, 39*4882a593Smuzhiyunsometimes also referred to as xor. In C the operator for xor is ^ 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunBack to ecc. 42*4882a593SmuzhiyunLet's give a small figure: 43*4882a593Smuzhiyun 44*4882a593Smuzhiyun========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ==== 45*4882a593Smuzhiyunbyte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 46*4882a593Smuzhiyunbyte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 47*4882a593Smuzhiyunbyte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 48*4882a593Smuzhiyunbyte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 49*4882a593Smuzhiyunbyte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 50*4882a593Smuzhiyun... 51*4882a593Smuzhiyunbyte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 52*4882a593Smuzhiyunbyte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 53*4882a593Smuzhiyun cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 54*4882a593Smuzhiyun cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 55*4882a593Smuzhiyun cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 56*4882a593Smuzhiyun========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ==== 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunThis figure represents a sector of 256 bytes. 59*4882a593Smuzhiyuncp is my abbreviation for column parity, rp for row parity. 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunLet's start to explain column parity. 62*4882a593Smuzhiyun 63*4882a593Smuzhiyun- cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. 64*4882a593Smuzhiyun 65*4882a593Smuzhiyun so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunSimilarly cp1 is the sum of all bit1, bit3, bit5 and bit7. 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun- cp2 is the parity over bit0, bit1, bit4 and bit5 70*4882a593Smuzhiyun- cp3 is the parity over bit2, bit3, bit6 and bit7. 71*4882a593Smuzhiyun- cp4 is the parity over bit0, bit1, bit2 and bit3. 72*4882a593Smuzhiyun- cp5 is the parity over bit4, bit5, bit6 and bit7. 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunNote that each of cp0 .. cp5 is exactly one bit. 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunRow parity actually works almost the same. 77*4882a593Smuzhiyun 78*4882a593Smuzhiyun- rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) 79*4882a593Smuzhiyun- rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) 80*4882a593Smuzhiyun- rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... 81*4882a593Smuzhiyun (so handle two bytes, then skip 2 bytes). 82*4882a593Smuzhiyun- rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) 83*4882a593Smuzhiyun- for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) 86*4882a593Smuzhiyun- and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. 87*4882a593Smuzhiyun 88*4882a593SmuzhiyunThe story now becomes quite boring. I guess you get the idea. 89*4882a593Smuzhiyun 90*4882a593Smuzhiyun- rp6 covers 8 bytes then skips 8 etc 91*4882a593Smuzhiyun- rp7 skips 8 bytes then covers 8 etc 92*4882a593Smuzhiyun- rp8 covers 16 bytes then skips 16 etc 93*4882a593Smuzhiyun- rp9 skips 16 bytes then covers 16 etc 94*4882a593Smuzhiyun- rp10 covers 32 bytes then skips 32 etc 95*4882a593Smuzhiyun- rp11 skips 32 bytes then covers 32 etc 96*4882a593Smuzhiyun- rp12 covers 64 bytes then skips 64 etc 97*4882a593Smuzhiyun- rp13 skips 64 bytes then covers 64 etc 98*4882a593Smuzhiyun- rp14 covers 128 bytes then skips 128 99*4882a593Smuzhiyun- rp15 skips 128 bytes then covers 128 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunIn the end the parity bits are grouped together in three bytes as 102*4882a593Smuzhiyunfollows: 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun===== ===== ===== ===== ===== ===== ===== ===== ===== 105*4882a593SmuzhiyunECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 106*4882a593Smuzhiyun===== ===== ===== ===== ===== ===== ===== ===== ===== 107*4882a593SmuzhiyunECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 108*4882a593SmuzhiyunECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 109*4882a593SmuzhiyunECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 110*4882a593Smuzhiyun===== ===== ===== ===== ===== ===== ===== ===== ===== 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunI detected after writing this that ST application note AN1823 113*4882a593Smuzhiyun(http://www.st.com/stonline/) gives a much 114*4882a593Smuzhiyunnicer picture.(but they use line parity as term where I use row parity) 115*4882a593SmuzhiyunOh well, I'm graphically challenged, so suffer with me for a moment :-) 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunAnd I could not reuse the ST picture anyway for copyright reasons. 118*4882a593Smuzhiyun 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunAttempt 0 121*4882a593Smuzhiyun========= 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunImplementing the parity calculation is pretty simple. 124*4882a593SmuzhiyunIn C pseudocode:: 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun for (i = 0; i < 256; i++) 127*4882a593Smuzhiyun { 128*4882a593Smuzhiyun if (i & 0x01) 129*4882a593Smuzhiyun rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; 130*4882a593Smuzhiyun else 131*4882a593Smuzhiyun rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp0; 132*4882a593Smuzhiyun if (i & 0x02) 133*4882a593Smuzhiyun rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; 134*4882a593Smuzhiyun else 135*4882a593Smuzhiyun rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; 136*4882a593Smuzhiyun if (i & 0x04) 137*4882a593Smuzhiyun rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; 138*4882a593Smuzhiyun else 139*4882a593Smuzhiyun rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; 140*4882a593Smuzhiyun if (i & 0x08) 141*4882a593Smuzhiyun rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; 142*4882a593Smuzhiyun else 143*4882a593Smuzhiyun rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; 144*4882a593Smuzhiyun if (i & 0x10) 145*4882a593Smuzhiyun rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; 146*4882a593Smuzhiyun else 147*4882a593Smuzhiyun rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; 148*4882a593Smuzhiyun if (i & 0x20) 149*4882a593Smuzhiyun rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; 150*4882a593Smuzhiyun else 151*4882a593Smuzhiyun rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; 152*4882a593Smuzhiyun if (i & 0x40) 153*4882a593Smuzhiyun rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; 154*4882a593Smuzhiyun else 155*4882a593Smuzhiyun rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; 156*4882a593Smuzhiyun if (i & 0x80) 157*4882a593Smuzhiyun rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; 158*4882a593Smuzhiyun else 159*4882a593Smuzhiyun rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; 160*4882a593Smuzhiyun cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; 161*4882a593Smuzhiyun cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; 162*4882a593Smuzhiyun cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; 163*4882a593Smuzhiyun cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 164*4882a593Smuzhiyun cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 165*4882a593Smuzhiyun cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 166*4882a593Smuzhiyun } 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunAnalysis 0 170*4882a593Smuzhiyun========== 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunC does have bitwise operators but not really operators to do the above 173*4882a593Smuzhiyunefficiently (and most hardware has no such instructions either). 174*4882a593SmuzhiyunTherefore without implementing this it was clear that the code above was 175*4882a593Smuzhiyunnot going to bring me a Nobel prize :-) 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunFortunately the exclusive or operation is commutative, so we can combine 178*4882a593Smuzhiyunthe values in any order. So instead of calculating all the bits 179*4882a593Smuzhiyunindividually, let us try to rearrange things. 180*4882a593SmuzhiyunFor the column parity this is easy. We can just xor the bytes and in the 181*4882a593Smuzhiyunend filter out the relevant bits. This is pretty nice as it will bring 182*4882a593Smuzhiyunall cp calculation out of the for loop. 183*4882a593Smuzhiyun 184*4882a593SmuzhiyunSimilarly we can first xor the bytes for the various rows. 185*4882a593SmuzhiyunThis leads to: 186*4882a593Smuzhiyun 187*4882a593Smuzhiyun 188*4882a593SmuzhiyunAttempt 1 189*4882a593Smuzhiyun========= 190*4882a593Smuzhiyun 191*4882a593Smuzhiyun:: 192*4882a593Smuzhiyun 193*4882a593Smuzhiyun const char parity[256] = { 194*4882a593Smuzhiyun 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 195*4882a593Smuzhiyun 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 196*4882a593Smuzhiyun 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 197*4882a593Smuzhiyun 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 198*4882a593Smuzhiyun 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 199*4882a593Smuzhiyun 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 200*4882a593Smuzhiyun 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 201*4882a593Smuzhiyun 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 202*4882a593Smuzhiyun 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 203*4882a593Smuzhiyun 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 204*4882a593Smuzhiyun 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 205*4882a593Smuzhiyun 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 206*4882a593Smuzhiyun 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 207*4882a593Smuzhiyun 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 208*4882a593Smuzhiyun 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 209*4882a593Smuzhiyun 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 210*4882a593Smuzhiyun }; 211*4882a593Smuzhiyun 212*4882a593Smuzhiyun void ecc1(const unsigned char *buf, unsigned char *code) 213*4882a593Smuzhiyun { 214*4882a593Smuzhiyun int i; 215*4882a593Smuzhiyun const unsigned char *bp = buf; 216*4882a593Smuzhiyun unsigned char cur; 217*4882a593Smuzhiyun unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; 218*4882a593Smuzhiyun unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; 219*4882a593Smuzhiyun unsigned char par; 220*4882a593Smuzhiyun 221*4882a593Smuzhiyun par = 0; 222*4882a593Smuzhiyun rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; 223*4882a593Smuzhiyun rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; 224*4882a593Smuzhiyun rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; 225*4882a593Smuzhiyun rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun for (i = 0; i < 256; i++) 228*4882a593Smuzhiyun { 229*4882a593Smuzhiyun cur = *bp++; 230*4882a593Smuzhiyun par ^= cur; 231*4882a593Smuzhiyun if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; 232*4882a593Smuzhiyun if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; 233*4882a593Smuzhiyun if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; 234*4882a593Smuzhiyun if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; 235*4882a593Smuzhiyun if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; 236*4882a593Smuzhiyun if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; 237*4882a593Smuzhiyun if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; 238*4882a593Smuzhiyun if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; 239*4882a593Smuzhiyun } 240*4882a593Smuzhiyun code[0] = 241*4882a593Smuzhiyun (parity[rp7] << 7) | 242*4882a593Smuzhiyun (parity[rp6] << 6) | 243*4882a593Smuzhiyun (parity[rp5] << 5) | 244*4882a593Smuzhiyun (parity[rp4] << 4) | 245*4882a593Smuzhiyun (parity[rp3] << 3) | 246*4882a593Smuzhiyun (parity[rp2] << 2) | 247*4882a593Smuzhiyun (parity[rp1] << 1) | 248*4882a593Smuzhiyun (parity[rp0]); 249*4882a593Smuzhiyun code[1] = 250*4882a593Smuzhiyun (parity[rp15] << 7) | 251*4882a593Smuzhiyun (parity[rp14] << 6) | 252*4882a593Smuzhiyun (parity[rp13] << 5) | 253*4882a593Smuzhiyun (parity[rp12] << 4) | 254*4882a593Smuzhiyun (parity[rp11] << 3) | 255*4882a593Smuzhiyun (parity[rp10] << 2) | 256*4882a593Smuzhiyun (parity[rp9] << 1) | 257*4882a593Smuzhiyun (parity[rp8]); 258*4882a593Smuzhiyun code[2] = 259*4882a593Smuzhiyun (parity[par & 0xf0] << 7) | 260*4882a593Smuzhiyun (parity[par & 0x0f] << 6) | 261*4882a593Smuzhiyun (parity[par & 0xcc] << 5) | 262*4882a593Smuzhiyun (parity[par & 0x33] << 4) | 263*4882a593Smuzhiyun (parity[par & 0xaa] << 3) | 264*4882a593Smuzhiyun (parity[par & 0x55] << 2); 265*4882a593Smuzhiyun code[0] = ~code[0]; 266*4882a593Smuzhiyun code[1] = ~code[1]; 267*4882a593Smuzhiyun code[2] = ~code[2]; 268*4882a593Smuzhiyun } 269*4882a593Smuzhiyun 270*4882a593SmuzhiyunStill pretty straightforward. The last three invert statements are there to 271*4882a593Smuzhiyungive a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash 272*4882a593Smuzhiyunall data is 0xff, so the checksum then matches. 273*4882a593Smuzhiyun 274*4882a593SmuzhiyunI also introduced the parity lookup. I expected this to be the fastest 275*4882a593Smuzhiyunway to calculate the parity, but I will investigate alternatives later 276*4882a593Smuzhiyunon. 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun 279*4882a593SmuzhiyunAnalysis 1 280*4882a593Smuzhiyun========== 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunThe code works, but is not terribly efficient. On my system it took 283*4882a593Smuzhiyunalmost 4 times as much time as the linux driver code. But hey, if it was 284*4882a593Smuzhiyun*that* easy this would have been done long before. 285*4882a593SmuzhiyunNo pain. no gain. 286*4882a593Smuzhiyun 287*4882a593SmuzhiyunFortunately there is plenty of room for improvement. 288*4882a593Smuzhiyun 289*4882a593SmuzhiyunIn step 1 we moved from bit-wise calculation to byte-wise calculation. 290*4882a593SmuzhiyunHowever in C we can also use the unsigned long data type and virtually 291*4882a593Smuzhiyunevery modern microprocessor supports 32 bit operations, so why not try 292*4882a593Smuzhiyunto write our code in such a way that we process data in 32 bit chunks. 293*4882a593Smuzhiyun 294*4882a593SmuzhiyunOf course this means some modification as the row parity is byte by 295*4882a593Smuzhiyunbyte. A quick analysis: 296*4882a593Smuzhiyunfor the column parity we use the par variable. When extending to 32 bits 297*4882a593Smuzhiyunwe can in the end easily calculate rp0 and rp1 from it. 298*4882a593Smuzhiyun(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 299*4882a593Smuzhiyunrespectively, from MSB to LSB) 300*4882a593Smuzhiyunalso rp2 and rp3 can be easily retrieved from par as rp3 covers the 301*4882a593Smuzhiyunfirst two MSBs and rp2 covers the last two LSBs. 302*4882a593Smuzhiyun 303*4882a593SmuzhiyunNote that of course now the loop is executed only 64 times (256/4). 304*4882a593SmuzhiyunAnd note that care must taken wrt byte ordering. The way bytes are 305*4882a593Smuzhiyunordered in a long is machine dependent, and might affect us. 306*4882a593SmuzhiyunAnyway, if there is an issue: this code is developed on x86 (to be 307*4882a593Smuzhiyunprecise: a DELL PC with a D920 Intel CPU) 308*4882a593Smuzhiyun 309*4882a593SmuzhiyunAnd of course the performance might depend on alignment, but I expect 310*4882a593Smuzhiyunthat the I/O buffers in the nand driver are aligned properly (and 311*4882a593Smuzhiyunotherwise that should be fixed to get maximum performance). 312*4882a593Smuzhiyun 313*4882a593SmuzhiyunLet's give it a try... 314*4882a593Smuzhiyun 315*4882a593Smuzhiyun 316*4882a593SmuzhiyunAttempt 2 317*4882a593Smuzhiyun========= 318*4882a593Smuzhiyun 319*4882a593Smuzhiyun:: 320*4882a593Smuzhiyun 321*4882a593Smuzhiyun extern const char parity[256]; 322*4882a593Smuzhiyun 323*4882a593Smuzhiyun void ecc2(const unsigned char *buf, unsigned char *code) 324*4882a593Smuzhiyun { 325*4882a593Smuzhiyun int i; 326*4882a593Smuzhiyun const unsigned long *bp = (unsigned long *)buf; 327*4882a593Smuzhiyun unsigned long cur; 328*4882a593Smuzhiyun unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; 329*4882a593Smuzhiyun unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; 330*4882a593Smuzhiyun unsigned long par; 331*4882a593Smuzhiyun 332*4882a593Smuzhiyun par = 0; 333*4882a593Smuzhiyun rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; 334*4882a593Smuzhiyun rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; 335*4882a593Smuzhiyun rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; 336*4882a593Smuzhiyun rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; 337*4882a593Smuzhiyun 338*4882a593Smuzhiyun for (i = 0; i < 64; i++) 339*4882a593Smuzhiyun { 340*4882a593Smuzhiyun cur = *bp++; 341*4882a593Smuzhiyun par ^= cur; 342*4882a593Smuzhiyun if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; 343*4882a593Smuzhiyun if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; 344*4882a593Smuzhiyun if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; 345*4882a593Smuzhiyun if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; 346*4882a593Smuzhiyun if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; 347*4882a593Smuzhiyun if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; 348*4882a593Smuzhiyun } 349*4882a593Smuzhiyun /* 350*4882a593Smuzhiyun we need to adapt the code generation for the fact that rp vars are now 351*4882a593Smuzhiyun long; also the column parity calculation needs to be changed. 352*4882a593Smuzhiyun we'll bring rp4 to 15 back to single byte entities by shifting and 353*4882a593Smuzhiyun xoring 354*4882a593Smuzhiyun */ 355*4882a593Smuzhiyun rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; 356*4882a593Smuzhiyun rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; 357*4882a593Smuzhiyun rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; 358*4882a593Smuzhiyun rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; 359*4882a593Smuzhiyun rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; 360*4882a593Smuzhiyun rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; 361*4882a593Smuzhiyun rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; 362*4882a593Smuzhiyun rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; 363*4882a593Smuzhiyun rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; 364*4882a593Smuzhiyun rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; 365*4882a593Smuzhiyun rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; 366*4882a593Smuzhiyun rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; 367*4882a593Smuzhiyun rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; 368*4882a593Smuzhiyun rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; 369*4882a593Smuzhiyun par ^= (par >> 16); 370*4882a593Smuzhiyun rp1 = (par >> 8); rp1 &= 0xff; 371*4882a593Smuzhiyun rp0 = (par & 0xff); 372*4882a593Smuzhiyun par ^= (par >> 8); par &= 0xff; 373*4882a593Smuzhiyun 374*4882a593Smuzhiyun code[0] = 375*4882a593Smuzhiyun (parity[rp7] << 7) | 376*4882a593Smuzhiyun (parity[rp6] << 6) | 377*4882a593Smuzhiyun (parity[rp5] << 5) | 378*4882a593Smuzhiyun (parity[rp4] << 4) | 379*4882a593Smuzhiyun (parity[rp3] << 3) | 380*4882a593Smuzhiyun (parity[rp2] << 2) | 381*4882a593Smuzhiyun (parity[rp1] << 1) | 382*4882a593Smuzhiyun (parity[rp0]); 383*4882a593Smuzhiyun code[1] = 384*4882a593Smuzhiyun (parity[rp15] << 7) | 385*4882a593Smuzhiyun (parity[rp14] << 6) | 386*4882a593Smuzhiyun (parity[rp13] << 5) | 387*4882a593Smuzhiyun (parity[rp12] << 4) | 388*4882a593Smuzhiyun (parity[rp11] << 3) | 389*4882a593Smuzhiyun (parity[rp10] << 2) | 390*4882a593Smuzhiyun (parity[rp9] << 1) | 391*4882a593Smuzhiyun (parity[rp8]); 392*4882a593Smuzhiyun code[2] = 393*4882a593Smuzhiyun (parity[par & 0xf0] << 7) | 394*4882a593Smuzhiyun (parity[par & 0x0f] << 6) | 395*4882a593Smuzhiyun (parity[par & 0xcc] << 5) | 396*4882a593Smuzhiyun (parity[par & 0x33] << 4) | 397*4882a593Smuzhiyun (parity[par & 0xaa] << 3) | 398*4882a593Smuzhiyun (parity[par & 0x55] << 2); 399*4882a593Smuzhiyun code[0] = ~code[0]; 400*4882a593Smuzhiyun code[1] = ~code[1]; 401*4882a593Smuzhiyun code[2] = ~code[2]; 402*4882a593Smuzhiyun } 403*4882a593Smuzhiyun 404*4882a593SmuzhiyunThe parity array is not shown any more. Note also that for these 405*4882a593Smuzhiyunexamples I kinda deviated from my regular programming style by allowing 406*4882a593Smuzhiyunmultiple statements on a line, not using { } in then and else blocks 407*4882a593Smuzhiyunwith only a single statement and by using operators like ^= 408*4882a593Smuzhiyun 409*4882a593Smuzhiyun 410*4882a593SmuzhiyunAnalysis 2 411*4882a593Smuzhiyun========== 412*4882a593Smuzhiyun 413*4882a593SmuzhiyunThe code (of course) works, and hurray: we are a little bit faster than 414*4882a593Smuzhiyunthe linux driver code (about 15%). But wait, don't cheer too quickly. 415*4882a593SmuzhiyunThere is more to be gained. 416*4882a593SmuzhiyunIf we look at e.g. rp14 and rp15 we see that we either xor our data with 417*4882a593Smuzhiyunrp14 or with rp15. However we also have par which goes over all data. 418*4882a593SmuzhiyunThis means there is no need to calculate rp14 as it can be calculated from 419*4882a593Smuzhiyunrp15 through rp14 = par ^ rp15, because par = rp14 ^ rp15; 420*4882a593Smuzhiyun(or if desired we can avoid calculating rp15 and calculate it from 421*4882a593Smuzhiyunrp14). That is why some places refer to inverse parity. 422*4882a593SmuzhiyunOf course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. 423*4882a593SmuzhiyunEffectively this means we can eliminate the else clause from the if 424*4882a593Smuzhiyunstatements. Also we can optimise the calculation in the end a little bit 425*4882a593Smuzhiyunby going from long to byte first. Actually we can even avoid the table 426*4882a593Smuzhiyunlookups 427*4882a593Smuzhiyun 428*4882a593SmuzhiyunAttempt 3 429*4882a593Smuzhiyun========= 430*4882a593Smuzhiyun 431*4882a593SmuzhiyunOdd replaced:: 432*4882a593Smuzhiyun 433*4882a593Smuzhiyun if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; 434*4882a593Smuzhiyun if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; 435*4882a593Smuzhiyun if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; 436*4882a593Smuzhiyun if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; 437*4882a593Smuzhiyun if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; 438*4882a593Smuzhiyun if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; 439*4882a593Smuzhiyun 440*4882a593Smuzhiyunwith:: 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun if (i & 0x01) rp5 ^= cur; 443*4882a593Smuzhiyun if (i & 0x02) rp7 ^= cur; 444*4882a593Smuzhiyun if (i & 0x04) rp9 ^= cur; 445*4882a593Smuzhiyun if (i & 0x08) rp11 ^= cur; 446*4882a593Smuzhiyun if (i & 0x10) rp13 ^= cur; 447*4882a593Smuzhiyun if (i & 0x20) rp15 ^= cur; 448*4882a593Smuzhiyun 449*4882a593Smuzhiyunand outside the loop added:: 450*4882a593Smuzhiyun 451*4882a593Smuzhiyun rp4 = par ^ rp5; 452*4882a593Smuzhiyun rp6 = par ^ rp7; 453*4882a593Smuzhiyun rp8 = par ^ rp9; 454*4882a593Smuzhiyun rp10 = par ^ rp11; 455*4882a593Smuzhiyun rp12 = par ^ rp13; 456*4882a593Smuzhiyun rp14 = par ^ rp15; 457*4882a593Smuzhiyun 458*4882a593SmuzhiyunAnd after that the code takes about 30% more time, although the number of 459*4882a593Smuzhiyunstatements is reduced. This is also reflected in the assembly code. 460*4882a593Smuzhiyun 461*4882a593Smuzhiyun 462*4882a593SmuzhiyunAnalysis 3 463*4882a593Smuzhiyun========== 464*4882a593Smuzhiyun 465*4882a593SmuzhiyunVery weird. Guess it has to do with caching or instruction parallellism 466*4882a593Smuzhiyunor so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting 467*4882a593Smuzhiyunobservation was that this one is only 30% slower (according to time) 468*4882a593Smuzhiyunexecuting the code as my 3Ghz D920 processor. 469*4882a593Smuzhiyun 470*4882a593SmuzhiyunWell, it was expected not to be easy so maybe instead move to a 471*4882a593Smuzhiyundifferent track: let's move back to the code from attempt2 and do some 472*4882a593Smuzhiyunloop unrolling. This will eliminate a few if statements. I'll try 473*4882a593Smuzhiyundifferent amounts of unrolling to see what works best. 474*4882a593Smuzhiyun 475*4882a593Smuzhiyun 476*4882a593SmuzhiyunAttempt 4 477*4882a593Smuzhiyun========= 478*4882a593Smuzhiyun 479*4882a593SmuzhiyunUnrolled the loop 1, 2, 3 and 4 times. 480*4882a593SmuzhiyunFor 4 the code starts with:: 481*4882a593Smuzhiyun 482*4882a593Smuzhiyun for (i = 0; i < 4; i++) 483*4882a593Smuzhiyun { 484*4882a593Smuzhiyun cur = *bp++; 485*4882a593Smuzhiyun par ^= cur; 486*4882a593Smuzhiyun rp4 ^= cur; 487*4882a593Smuzhiyun rp6 ^= cur; 488*4882a593Smuzhiyun rp8 ^= cur; 489*4882a593Smuzhiyun rp10 ^= cur; 490*4882a593Smuzhiyun if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; 491*4882a593Smuzhiyun if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; 492*4882a593Smuzhiyun cur = *bp++; 493*4882a593Smuzhiyun par ^= cur; 494*4882a593Smuzhiyun rp5 ^= cur; 495*4882a593Smuzhiyun rp6 ^= cur; 496*4882a593Smuzhiyun ... 497*4882a593Smuzhiyun 498*4882a593Smuzhiyun 499*4882a593SmuzhiyunAnalysis 4 500*4882a593Smuzhiyun========== 501*4882a593Smuzhiyun 502*4882a593SmuzhiyunUnrolling once gains about 15% 503*4882a593Smuzhiyun 504*4882a593SmuzhiyunUnrolling twice keeps the gain at about 15% 505*4882a593Smuzhiyun 506*4882a593SmuzhiyunUnrolling three times gives a gain of 30% compared to attempt 2. 507*4882a593Smuzhiyun 508*4882a593SmuzhiyunUnrolling four times gives a marginal improvement compared to unrolling 509*4882a593Smuzhiyunthree times. 510*4882a593Smuzhiyun 511*4882a593SmuzhiyunI decided to proceed with a four time unrolled loop anyway. It was my gut 512*4882a593Smuzhiyunfeeling that in the next steps I would obtain additional gain from it. 513*4882a593Smuzhiyun 514*4882a593SmuzhiyunThe next step was triggered by the fact that par contains the xor of all 515*4882a593Smuzhiyunbytes and rp4 and rp5 each contain the xor of half of the bytes. 516*4882a593SmuzhiyunSo in effect par = rp4 ^ rp5. But as xor is commutative we can also say 517*4882a593Smuzhiyunthat rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can 518*4882a593Smuzhiyuneliminate rp5 (or rp4, but I already foresaw another optimisation). 519*4882a593SmuzhiyunThe same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. 520*4882a593Smuzhiyun 521*4882a593Smuzhiyun 522*4882a593SmuzhiyunAttempt 5 523*4882a593Smuzhiyun========= 524*4882a593Smuzhiyun 525*4882a593SmuzhiyunEffectively so all odd digit rp assignments in the loop were removed. 526*4882a593SmuzhiyunThis included the else clause of the if statements. 527*4882a593SmuzhiyunOf course after the loop we need to correct things by adding code like:: 528*4882a593Smuzhiyun 529*4882a593Smuzhiyun rp5 = par ^ rp4; 530*4882a593Smuzhiyun 531*4882a593SmuzhiyunAlso the initial assignments (rp5 = 0; etc) could be removed. 532*4882a593SmuzhiyunAlong the line I also removed the initialisation of rp0/1/2/3. 533*4882a593Smuzhiyun 534*4882a593Smuzhiyun 535*4882a593SmuzhiyunAnalysis 5 536*4882a593Smuzhiyun========== 537*4882a593Smuzhiyun 538*4882a593SmuzhiyunMeasurements showed this was a good move. The run-time roughly halved 539*4882a593Smuzhiyuncompared with attempt 4 with 4 times unrolled, and we only require 1/3rd 540*4882a593Smuzhiyunof the processor time compared to the current code in the linux kernel. 541*4882a593Smuzhiyun 542*4882a593SmuzhiyunHowever, still I thought there was more. I didn't like all the if 543*4882a593Smuzhiyunstatements. Why not keep a running parity and only keep the last if 544*4882a593Smuzhiyunstatement. Time for yet another version! 545*4882a593Smuzhiyun 546*4882a593Smuzhiyun 547*4882a593SmuzhiyunAttempt 6 548*4882a593Smuzhiyun========= 549*4882a593Smuzhiyun 550*4882a593SmuzhiyunTHe code within the for loop was changed to:: 551*4882a593Smuzhiyun 552*4882a593Smuzhiyun for (i = 0; i < 4; i++) 553*4882a593Smuzhiyun { 554*4882a593Smuzhiyun cur = *bp++; tmppar = cur; rp4 ^= cur; 555*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; 556*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4 ^= cur; 557*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; 558*4882a593Smuzhiyun 559*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; 560*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp6 ^= cur; 561*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4 ^= cur; 562*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; 563*4882a593Smuzhiyun 564*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; 565*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; 566*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; 567*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp8 ^= cur; 568*4882a593Smuzhiyun 569*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; 570*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp6 ^= cur; 571*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4 ^= cur; 572*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; 573*4882a593Smuzhiyun 574*4882a593Smuzhiyun par ^= tmppar; 575*4882a593Smuzhiyun if ((i & 0x1) == 0) rp12 ^= tmppar; 576*4882a593Smuzhiyun if ((i & 0x2) == 0) rp14 ^= tmppar; 577*4882a593Smuzhiyun } 578*4882a593Smuzhiyun 579*4882a593SmuzhiyunAs you can see tmppar is used to accumulate the parity within a for 580*4882a593Smuzhiyuniteration. In the last 3 statements is added to par and, if needed, 581*4882a593Smuzhiyunto rp12 and rp14. 582*4882a593Smuzhiyun 583*4882a593SmuzhiyunWhile making the changes I also found that I could exploit that tmppar 584*4882a593Smuzhiyuncontains the running parity for this iteration. So instead of having: 585*4882a593Smuzhiyunrp4 ^= cur; rp6 ^= cur; 586*4882a593SmuzhiyunI removed the rp6 ^= cur; statement and did rp6 ^= tmppar; on next 587*4882a593Smuzhiyunstatement. A similar change was done for rp8 and rp10 588*4882a593Smuzhiyun 589*4882a593Smuzhiyun 590*4882a593SmuzhiyunAnalysis 6 591*4882a593Smuzhiyun========== 592*4882a593Smuzhiyun 593*4882a593SmuzhiyunMeasuring this code again showed big gain. When executing the original 594*4882a593Smuzhiyunlinux code 1 million times, this took about 1 second on my system. 595*4882a593Smuzhiyun(using time to measure the performance). After this iteration I was back 596*4882a593Smuzhiyunto 0.075 sec. Actually I had to decide to start measuring over 10 597*4882a593Smuzhiyunmillion iterations in order not to lose too much accuracy. This one 598*4882a593Smuzhiyundefinitely seemed to be the jackpot! 599*4882a593Smuzhiyun 600*4882a593SmuzhiyunThere is a little bit more room for improvement though. There are three 601*4882a593Smuzhiyunplaces with statements:: 602*4882a593Smuzhiyun 603*4882a593Smuzhiyun rp4 ^= cur; rp6 ^= cur; 604*4882a593Smuzhiyun 605*4882a593SmuzhiyunIt seems more efficient to also maintain a variable rp4_6 in the while 606*4882a593Smuzhiyunloop; This eliminates 3 statements per loop. Of course after the loop we 607*4882a593Smuzhiyunneed to correct by adding:: 608*4882a593Smuzhiyun 609*4882a593Smuzhiyun rp4 ^= rp4_6; 610*4882a593Smuzhiyun rp6 ^= rp4_6 611*4882a593Smuzhiyun 612*4882a593SmuzhiyunFurthermore there are 4 sequential assignments to rp8. This can be 613*4882a593Smuzhiyunencoded slightly more efficiently by saving tmppar before those 4 lines 614*4882a593Smuzhiyunand later do rp8 = rp8 ^ tmppar ^ notrp8; 615*4882a593Smuzhiyun(where notrp8 is the value of rp8 before those 4 lines). 616*4882a593SmuzhiyunAgain a use of the commutative property of xor. 617*4882a593SmuzhiyunTime for a new test! 618*4882a593Smuzhiyun 619*4882a593Smuzhiyun 620*4882a593SmuzhiyunAttempt 7 621*4882a593Smuzhiyun========= 622*4882a593Smuzhiyun 623*4882a593SmuzhiyunThe new code now looks like:: 624*4882a593Smuzhiyun 625*4882a593Smuzhiyun for (i = 0; i < 4; i++) 626*4882a593Smuzhiyun { 627*4882a593Smuzhiyun cur = *bp++; tmppar = cur; rp4 ^= cur; 628*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; 629*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4 ^= cur; 630*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; 631*4882a593Smuzhiyun 632*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; 633*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp6 ^= cur; 634*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4 ^= cur; 635*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; 636*4882a593Smuzhiyun 637*4882a593Smuzhiyun notrp8 = tmppar; 638*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; 639*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp6 ^= cur; 640*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4 ^= cur; 641*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; 642*4882a593Smuzhiyun rp8 = rp8 ^ tmppar ^ notrp8; 643*4882a593Smuzhiyun 644*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; 645*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp6 ^= cur; 646*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; rp4 ^= cur; 647*4882a593Smuzhiyun cur = *bp++; tmppar ^= cur; 648*4882a593Smuzhiyun 649*4882a593Smuzhiyun par ^= tmppar; 650*4882a593Smuzhiyun if ((i & 0x1) == 0) rp12 ^= tmppar; 651*4882a593Smuzhiyun if ((i & 0x2) == 0) rp14 ^= tmppar; 652*4882a593Smuzhiyun } 653*4882a593Smuzhiyun rp4 ^= rp4_6; 654*4882a593Smuzhiyun rp6 ^= rp4_6; 655*4882a593Smuzhiyun 656*4882a593Smuzhiyun 657*4882a593SmuzhiyunNot a big change, but every penny counts :-) 658*4882a593Smuzhiyun 659*4882a593Smuzhiyun 660*4882a593SmuzhiyunAnalysis 7 661*4882a593Smuzhiyun========== 662*4882a593Smuzhiyun 663*4882a593SmuzhiyunActually this made things worse. Not very much, but I don't want to move 664*4882a593Smuzhiyuninto the wrong direction. Maybe something to investigate later. Could 665*4882a593Smuzhiyunhave to do with caching again. 666*4882a593Smuzhiyun 667*4882a593SmuzhiyunGuess that is what there is to win within the loop. Maybe unrolling one 668*4882a593Smuzhiyunmore time will help. I'll keep the optimisations from 7 for now. 669*4882a593Smuzhiyun 670*4882a593Smuzhiyun 671*4882a593SmuzhiyunAttempt 8 672*4882a593Smuzhiyun========= 673*4882a593Smuzhiyun 674*4882a593SmuzhiyunUnrolled the loop one more time. 675*4882a593Smuzhiyun 676*4882a593Smuzhiyun 677*4882a593SmuzhiyunAnalysis 8 678*4882a593Smuzhiyun========== 679*4882a593Smuzhiyun 680*4882a593SmuzhiyunThis makes things worse. Let's stick with attempt 6 and continue from there. 681*4882a593SmuzhiyunAlthough it seems that the code within the loop cannot be optimised 682*4882a593Smuzhiyunfurther there is still room to optimize the generation of the ecc codes. 683*4882a593SmuzhiyunWe can simply calculate the total parity. If this is 0 then rp4 = rp5 684*4882a593Smuzhiyunetc. If the parity is 1, then rp4 = !rp5; 685*4882a593Smuzhiyun 686*4882a593SmuzhiyunBut if rp4 = rp5 we do not need rp5 etc. We can just write the even bits 687*4882a593Smuzhiyunin the result byte and then do something like:: 688*4882a593Smuzhiyun 689*4882a593Smuzhiyun code[0] |= (code[0] << 1); 690*4882a593Smuzhiyun 691*4882a593SmuzhiyunLets test this. 692*4882a593Smuzhiyun 693*4882a593Smuzhiyun 694*4882a593SmuzhiyunAttempt 9 695*4882a593Smuzhiyun========= 696*4882a593Smuzhiyun 697*4882a593SmuzhiyunChanged the code but again this slightly degrades performance. Tried all 698*4882a593Smuzhiyunkind of other things, like having dedicated parity arrays to avoid the 699*4882a593Smuzhiyunshift after parity[rp7] << 7; No gain. 700*4882a593SmuzhiyunChange the lookup using the parity array by using shift operators (e.g. 701*4882a593Smuzhiyunreplace parity[rp7] << 7 with:: 702*4882a593Smuzhiyun 703*4882a593Smuzhiyun rp7 ^= (rp7 << 4); 704*4882a593Smuzhiyun rp7 ^= (rp7 << 2); 705*4882a593Smuzhiyun rp7 ^= (rp7 << 1); 706*4882a593Smuzhiyun rp7 &= 0x80; 707*4882a593Smuzhiyun 708*4882a593SmuzhiyunNo gain. 709*4882a593Smuzhiyun 710*4882a593SmuzhiyunThe only marginal change was inverting the parity bits, so we can remove 711*4882a593Smuzhiyunthe last three invert statements. 712*4882a593Smuzhiyun 713*4882a593SmuzhiyunAh well, pity this does not deliver more. Then again 10 million 714*4882a593Smuzhiyuniterations using the linux driver code takes between 13 and 13.5 715*4882a593Smuzhiyunseconds, whereas my code now takes about 0.73 seconds for those 10 716*4882a593Smuzhiyunmillion iterations. So basically I've improved the performance by a 717*4882a593Smuzhiyunfactor 18 on my system. Not that bad. Of course on different hardware 718*4882a593Smuzhiyunyou will get different results. No warranties! 719*4882a593Smuzhiyun 720*4882a593SmuzhiyunBut of course there is no such thing as a free lunch. The codesize almost 721*4882a593Smuzhiyuntripled (from 562 bytes to 1434 bytes). Then again, it is not that much. 722*4882a593Smuzhiyun 723*4882a593Smuzhiyun 724*4882a593SmuzhiyunCorrecting errors 725*4882a593Smuzhiyun================= 726*4882a593Smuzhiyun 727*4882a593SmuzhiyunFor correcting errors I again used the ST application note as a starter, 728*4882a593Smuzhiyunbut I also peeked at the existing code. 729*4882a593Smuzhiyun 730*4882a593SmuzhiyunThe algorithm itself is pretty straightforward. Just xor the given and 731*4882a593Smuzhiyunthe calculated ecc. If all bytes are 0 there is no problem. If 11 bits 732*4882a593Smuzhiyunare 1 we have one correctable bit error. If there is 1 bit 1, we have an 733*4882a593Smuzhiyunerror in the given ecc code. 734*4882a593Smuzhiyun 735*4882a593SmuzhiyunIt proved to be fastest to do some table lookups. Performance gain 736*4882a593Smuzhiyunintroduced by this is about a factor 2 on my system when a repair had to 737*4882a593Smuzhiyunbe done, and 1% or so if no repair had to be done. 738*4882a593Smuzhiyun 739*4882a593SmuzhiyunCode size increased from 330 bytes to 686 bytes for this function. 740*4882a593Smuzhiyun(gcc 4.2, -O3) 741*4882a593Smuzhiyun 742*4882a593Smuzhiyun 743*4882a593SmuzhiyunConclusion 744*4882a593Smuzhiyun========== 745*4882a593Smuzhiyun 746*4882a593SmuzhiyunThe gain when calculating the ecc is tremendous. Om my development hardware 747*4882a593Smuzhiyuna speedup of a factor of 18 for ecc calculation was achieved. On a test on an 748*4882a593Smuzhiyunembedded system with a MIPS core a factor 7 was obtained. 749*4882a593Smuzhiyun 750*4882a593SmuzhiyunOn a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor 751*4882a593Smuzhiyun5 (big endian mode, gcc 4.1.2, -O3) 752*4882a593Smuzhiyun 753*4882a593SmuzhiyunFor correction not much gain could be obtained (as bitflips are rare). Then 754*4882a593Smuzhiyunagain there are also much less cycles spent there. 755*4882a593Smuzhiyun 756*4882a593SmuzhiyunIt seems there is not much more gain possible in this, at least when 757*4882a593Smuzhiyunprogrammed in C. Of course it might be possible to squeeze something more 758*4882a593Smuzhiyunout of it with an assembler program, but due to pipeline behaviour etc 759*4882a593Smuzhiyunthis is very tricky (at least for intel hw). 760*4882a593Smuzhiyun 761*4882a593SmuzhiyunAuthor: Frans Meulenbroeks 762*4882a593Smuzhiyun 763*4882a593SmuzhiyunCopyright (C) 2008 Koninklijke Philips Electronics NV. 764