1*4882a593Smuzhiyun====================================== 2*4882a593Smuzhiyunvlocks for Bare-Metal Mutual Exclusion 3*4882a593Smuzhiyun====================================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunVoting Locks, or "vlocks" provide a simple low-level mutual exclusion 6*4882a593Smuzhiyunmechanism, with reasonable but minimal requirements on the memory 7*4882a593Smuzhiyunsystem. 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunThese are intended to be used to coordinate critical activity among CPUs 10*4882a593Smuzhiyunwhich are otherwise non-coherent, in situations where the hardware 11*4882a593Smuzhiyunprovides no other mechanism to support this and ordinary spinlocks 12*4882a593Smuzhiyuncannot be used. 13*4882a593Smuzhiyun 14*4882a593Smuzhiyun 15*4882a593Smuzhiyunvlocks make use of the atomicity provided by the memory system for 16*4882a593Smuzhiyunwrites to a single memory location. To arbitrate, every CPU "votes for 17*4882a593Smuzhiyunitself", by storing a unique number to a common memory location. The 18*4882a593Smuzhiyunfinal value seen in that memory location when all the votes have been 19*4882a593Smuzhiyuncast identifies the winner. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunIn order to make sure that the election produces an unambiguous result 22*4882a593Smuzhiyunin finite time, a CPU will only enter the election in the first place if 23*4882a593Smuzhiyunno winner has been chosen and the election does not appear to have 24*4882a593Smuzhiyunstarted yet. 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun 27*4882a593SmuzhiyunAlgorithm 28*4882a593Smuzhiyun--------- 29*4882a593Smuzhiyun 30*4882a593SmuzhiyunThe easiest way to explain the vlocks algorithm is with some pseudo-code:: 31*4882a593Smuzhiyun 32*4882a593Smuzhiyun 33*4882a593Smuzhiyun int currently_voting[NR_CPUS] = { 0, }; 34*4882a593Smuzhiyun int last_vote = -1; /* no votes yet */ 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun bool vlock_trylock(int this_cpu) 37*4882a593Smuzhiyun { 38*4882a593Smuzhiyun /* signal our desire to vote */ 39*4882a593Smuzhiyun currently_voting[this_cpu] = 1; 40*4882a593Smuzhiyun if (last_vote != -1) { 41*4882a593Smuzhiyun /* someone already volunteered himself */ 42*4882a593Smuzhiyun currently_voting[this_cpu] = 0; 43*4882a593Smuzhiyun return false; /* not ourself */ 44*4882a593Smuzhiyun } 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun /* let's suggest ourself */ 47*4882a593Smuzhiyun last_vote = this_cpu; 48*4882a593Smuzhiyun currently_voting[this_cpu] = 0; 49*4882a593Smuzhiyun 50*4882a593Smuzhiyun /* then wait until everyone else is done voting */ 51*4882a593Smuzhiyun for_each_cpu(i) { 52*4882a593Smuzhiyun while (currently_voting[i] != 0) 53*4882a593Smuzhiyun /* wait */; 54*4882a593Smuzhiyun } 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun /* result */ 57*4882a593Smuzhiyun if (last_vote == this_cpu) 58*4882a593Smuzhiyun return true; /* we won */ 59*4882a593Smuzhiyun return false; 60*4882a593Smuzhiyun } 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun bool vlock_unlock(void) 63*4882a593Smuzhiyun { 64*4882a593Smuzhiyun last_vote = -1; 65*4882a593Smuzhiyun } 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun 68*4882a593SmuzhiyunThe currently_voting[] array provides a way for the CPUs to determine 69*4882a593Smuzhiyunwhether an election is in progress, and plays a role analogous to the 70*4882a593Smuzhiyun"entering" array in Lamport's bakery algorithm [1]. 71*4882a593Smuzhiyun 72*4882a593SmuzhiyunHowever, once the election has started, the underlying memory system 73*4882a593Smuzhiyunatomicity is used to pick the winner. This avoids the need for a static 74*4882a593Smuzhiyunpriority rule to act as a tie-breaker, or any counters which could 75*4882a593Smuzhiyunoverflow. 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunAs long as the last_vote variable is globally visible to all CPUs, it 78*4882a593Smuzhiyunwill contain only one value that won't change once every CPU has cleared 79*4882a593Smuzhiyunits currently_voting flag. 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunFeatures and limitations 83*4882a593Smuzhiyun------------------------ 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun * vlocks are not intended to be fair. In the contended case, it is the 86*4882a593Smuzhiyun _last_ CPU which attempts to get the lock which will be most likely 87*4882a593Smuzhiyun to win. 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun vlocks are therefore best suited to situations where it is necessary 90*4882a593Smuzhiyun to pick a unique winner, but it does not matter which CPU actually 91*4882a593Smuzhiyun wins. 92*4882a593Smuzhiyun 93*4882a593Smuzhiyun * Like other similar mechanisms, vlocks will not scale well to a large 94*4882a593Smuzhiyun number of CPUs. 95*4882a593Smuzhiyun 96*4882a593Smuzhiyun vlocks can be cascaded in a voting hierarchy to permit better scaling 97*4882a593Smuzhiyun if necessary, as in the following hypothetical example for 4096 CPUs:: 98*4882a593Smuzhiyun 99*4882a593Smuzhiyun /* first level: local election */ 100*4882a593Smuzhiyun my_town = towns[(this_cpu >> 4) & 0xf]; 101*4882a593Smuzhiyun I_won = vlock_trylock(my_town, this_cpu & 0xf); 102*4882a593Smuzhiyun if (I_won) { 103*4882a593Smuzhiyun /* we won the town election, let's go for the state */ 104*4882a593Smuzhiyun my_state = states[(this_cpu >> 8) & 0xf]; 105*4882a593Smuzhiyun I_won = vlock_lock(my_state, this_cpu & 0xf)); 106*4882a593Smuzhiyun if (I_won) { 107*4882a593Smuzhiyun /* and so on */ 108*4882a593Smuzhiyun I_won = vlock_lock(the_whole_country, this_cpu & 0xf]; 109*4882a593Smuzhiyun if (I_won) { 110*4882a593Smuzhiyun /* ... */ 111*4882a593Smuzhiyun } 112*4882a593Smuzhiyun vlock_unlock(the_whole_country); 113*4882a593Smuzhiyun } 114*4882a593Smuzhiyun vlock_unlock(my_state); 115*4882a593Smuzhiyun } 116*4882a593Smuzhiyun vlock_unlock(my_town); 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunARM implementation 120*4882a593Smuzhiyun------------------ 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunThe current ARM implementation [2] contains some optimisations beyond 123*4882a593Smuzhiyunthe basic algorithm: 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun * By packing the members of the currently_voting array close together, 126*4882a593Smuzhiyun we can read the whole array in one transaction (providing the number 127*4882a593Smuzhiyun of CPUs potentially contending the lock is small enough). This 128*4882a593Smuzhiyun reduces the number of round-trips required to external memory. 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun In the ARM implementation, this means that we can use a single load 131*4882a593Smuzhiyun and comparison:: 132*4882a593Smuzhiyun 133*4882a593Smuzhiyun LDR Rt, [Rn] 134*4882a593Smuzhiyun CMP Rt, #0 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun ...in place of code equivalent to:: 137*4882a593Smuzhiyun 138*4882a593Smuzhiyun LDRB Rt, [Rn] 139*4882a593Smuzhiyun CMP Rt, #0 140*4882a593Smuzhiyun LDRBEQ Rt, [Rn, #1] 141*4882a593Smuzhiyun CMPEQ Rt, #0 142*4882a593Smuzhiyun LDRBEQ Rt, [Rn, #2] 143*4882a593Smuzhiyun CMPEQ Rt, #0 144*4882a593Smuzhiyun LDRBEQ Rt, [Rn, #3] 145*4882a593Smuzhiyun CMPEQ Rt, #0 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun This cuts down on the fast-path latency, as well as potentially 148*4882a593Smuzhiyun reducing bus contention in contended cases. 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun The optimisation relies on the fact that the ARM memory system 151*4882a593Smuzhiyun guarantees coherency between overlapping memory accesses of 152*4882a593Smuzhiyun different sizes, similarly to many other architectures. Note that 153*4882a593Smuzhiyun we do not care which element of currently_voting appears in which 154*4882a593Smuzhiyun bits of Rt, so there is no need to worry about endianness in this 155*4882a593Smuzhiyun optimisation. 156*4882a593Smuzhiyun 157*4882a593Smuzhiyun If there are too many CPUs to read the currently_voting array in 158*4882a593Smuzhiyun one transaction then multiple transations are still required. The 159*4882a593Smuzhiyun implementation uses a simple loop of word-sized loads for this 160*4882a593Smuzhiyun case. The number of transactions is still fewer than would be 161*4882a593Smuzhiyun required if bytes were loaded individually. 162*4882a593Smuzhiyun 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun In principle, we could aggregate further by using LDRD or LDM, but 165*4882a593Smuzhiyun to keep the code simple this was not attempted in the initial 166*4882a593Smuzhiyun implementation. 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun * vlocks are currently only used to coordinate between CPUs which are 170*4882a593Smuzhiyun unable to enable their caches yet. This means that the 171*4882a593Smuzhiyun implementation removes many of the barriers which would be required 172*4882a593Smuzhiyun when executing the algorithm in cached memory. 173*4882a593Smuzhiyun 174*4882a593Smuzhiyun packing of the currently_voting array does not work with cached 175*4882a593Smuzhiyun memory unless all CPUs contending the lock are cache-coherent, due 176*4882a593Smuzhiyun to cache writebacks from one CPU clobbering values written by other 177*4882a593Smuzhiyun CPUs. (Though if all the CPUs are cache-coherent, you should be 178*4882a593Smuzhiyun probably be using proper spinlocks instead anyway). 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun * The "no votes yet" value used for the last_vote variable is 0 (not 182*4882a593Smuzhiyun -1 as in the pseudocode). This allows statically-allocated vlocks 183*4882a593Smuzhiyun to be implicitly initialised to an unlocked state simply by putting 184*4882a593Smuzhiyun them in .bss. 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun An offset is added to each CPU's ID for the purpose of setting this 187*4882a593Smuzhiyun variable, so that no CPU uses the value 0 for its ID. 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun 190*4882a593SmuzhiyunColophon 191*4882a593Smuzhiyun-------- 192*4882a593Smuzhiyun 193*4882a593SmuzhiyunOriginally created and documented by Dave Martin for Linaro Limited, for 194*4882a593Smuzhiyunuse in ARM-based big.LITTLE platforms, with review and input gratefully 195*4882a593Smuzhiyunreceived from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for 196*4882a593Smuzhiyungrabbing most of this text out of the relevant mail thread and writing 197*4882a593Smuzhiyunup the pseudocode. 198*4882a593Smuzhiyun 199*4882a593SmuzhiyunCopyright (C) 2012-2013 Linaro Limited 200*4882a593SmuzhiyunDistributed under the terms of Version 2 of the GNU General Public 201*4882a593SmuzhiyunLicense, as defined in linux/COPYING. 202*4882a593Smuzhiyun 203*4882a593Smuzhiyun 204*4882a593SmuzhiyunReferences 205*4882a593Smuzhiyun---------- 206*4882a593Smuzhiyun 207*4882a593Smuzhiyun[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming 208*4882a593Smuzhiyun Problem", Communications of the ACM 17, 8 (August 1974), 453-455. 209*4882a593Smuzhiyun 210*4882a593Smuzhiyun https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm 211*4882a593Smuzhiyun 212*4882a593Smuzhiyun[2] linux/arch/arm/common/vlock.S, www.kernel.org. 213