xref: /OK3568_Linux_fs/kernel/Documentation/arm/vlocks.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun======================================
2*4882a593Smuzhiyunvlocks for Bare-Metal Mutual Exclusion
3*4882a593Smuzhiyun======================================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunVoting Locks, or "vlocks" provide a simple low-level mutual exclusion
6*4882a593Smuzhiyunmechanism, with reasonable but minimal requirements on the memory
7*4882a593Smuzhiyunsystem.
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunThese are intended to be used to coordinate critical activity among CPUs
10*4882a593Smuzhiyunwhich are otherwise non-coherent, in situations where the hardware
11*4882a593Smuzhiyunprovides no other mechanism to support this and ordinary spinlocks
12*4882a593Smuzhiyuncannot be used.
13*4882a593Smuzhiyun
14*4882a593Smuzhiyun
15*4882a593Smuzhiyunvlocks make use of the atomicity provided by the memory system for
16*4882a593Smuzhiyunwrites to a single memory location.  To arbitrate, every CPU "votes for
17*4882a593Smuzhiyunitself", by storing a unique number to a common memory location.  The
18*4882a593Smuzhiyunfinal value seen in that memory location when all the votes have been
19*4882a593Smuzhiyuncast identifies the winner.
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunIn order to make sure that the election produces an unambiguous result
22*4882a593Smuzhiyunin finite time, a CPU will only enter the election in the first place if
23*4882a593Smuzhiyunno winner has been chosen and the election does not appear to have
24*4882a593Smuzhiyunstarted yet.
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunAlgorithm
28*4882a593Smuzhiyun---------
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunThe easiest way to explain the vlocks algorithm is with some pseudo-code::
31*4882a593Smuzhiyun
32*4882a593Smuzhiyun
33*4882a593Smuzhiyun	int currently_voting[NR_CPUS] = { 0, };
34*4882a593Smuzhiyun	int last_vote = -1; /* no votes yet */
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun	bool vlock_trylock(int this_cpu)
37*4882a593Smuzhiyun	{
38*4882a593Smuzhiyun		/* signal our desire to vote */
39*4882a593Smuzhiyun		currently_voting[this_cpu] = 1;
40*4882a593Smuzhiyun		if (last_vote != -1) {
41*4882a593Smuzhiyun			/* someone already volunteered himself */
42*4882a593Smuzhiyun			currently_voting[this_cpu] = 0;
43*4882a593Smuzhiyun			return false; /* not ourself */
44*4882a593Smuzhiyun		}
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun		/* let's suggest ourself */
47*4882a593Smuzhiyun		last_vote = this_cpu;
48*4882a593Smuzhiyun		currently_voting[this_cpu] = 0;
49*4882a593Smuzhiyun
50*4882a593Smuzhiyun		/* then wait until everyone else is done voting */
51*4882a593Smuzhiyun		for_each_cpu(i) {
52*4882a593Smuzhiyun			while (currently_voting[i] != 0)
53*4882a593Smuzhiyun				/* wait */;
54*4882a593Smuzhiyun		}
55*4882a593Smuzhiyun
56*4882a593Smuzhiyun		/* result */
57*4882a593Smuzhiyun		if (last_vote == this_cpu)
58*4882a593Smuzhiyun			return true; /* we won */
59*4882a593Smuzhiyun		return false;
60*4882a593Smuzhiyun	}
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun	bool vlock_unlock(void)
63*4882a593Smuzhiyun	{
64*4882a593Smuzhiyun		last_vote = -1;
65*4882a593Smuzhiyun	}
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunThe currently_voting[] array provides a way for the CPUs to determine
69*4882a593Smuzhiyunwhether an election is in progress, and plays a role analogous to the
70*4882a593Smuzhiyun"entering" array in Lamport's bakery algorithm [1].
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunHowever, once the election has started, the underlying memory system
73*4882a593Smuzhiyunatomicity is used to pick the winner.  This avoids the need for a static
74*4882a593Smuzhiyunpriority rule to act as a tie-breaker, or any counters which could
75*4882a593Smuzhiyunoverflow.
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunAs long as the last_vote variable is globally visible to all CPUs, it
78*4882a593Smuzhiyunwill contain only one value that won't change once every CPU has cleared
79*4882a593Smuzhiyunits currently_voting flag.
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunFeatures and limitations
83*4882a593Smuzhiyun------------------------
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun * vlocks are not intended to be fair.  In the contended case, it is the
86*4882a593Smuzhiyun   _last_ CPU which attempts to get the lock which will be most likely
87*4882a593Smuzhiyun   to win.
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun   vlocks are therefore best suited to situations where it is necessary
90*4882a593Smuzhiyun   to pick a unique winner, but it does not matter which CPU actually
91*4882a593Smuzhiyun   wins.
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun * Like other similar mechanisms, vlocks will not scale well to a large
94*4882a593Smuzhiyun   number of CPUs.
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun   vlocks can be cascaded in a voting hierarchy to permit better scaling
97*4882a593Smuzhiyun   if necessary, as in the following hypothetical example for 4096 CPUs::
98*4882a593Smuzhiyun
99*4882a593Smuzhiyun	/* first level: local election */
100*4882a593Smuzhiyun	my_town = towns[(this_cpu >> 4) & 0xf];
101*4882a593Smuzhiyun	I_won = vlock_trylock(my_town, this_cpu & 0xf);
102*4882a593Smuzhiyun	if (I_won) {
103*4882a593Smuzhiyun		/* we won the town election, let's go for the state */
104*4882a593Smuzhiyun		my_state = states[(this_cpu >> 8) & 0xf];
105*4882a593Smuzhiyun		I_won = vlock_lock(my_state, this_cpu & 0xf));
106*4882a593Smuzhiyun		if (I_won) {
107*4882a593Smuzhiyun			/* and so on */
108*4882a593Smuzhiyun			I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
109*4882a593Smuzhiyun			if (I_won) {
110*4882a593Smuzhiyun				/* ... */
111*4882a593Smuzhiyun			}
112*4882a593Smuzhiyun			vlock_unlock(the_whole_country);
113*4882a593Smuzhiyun		}
114*4882a593Smuzhiyun		vlock_unlock(my_state);
115*4882a593Smuzhiyun	}
116*4882a593Smuzhiyun	vlock_unlock(my_town);
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunARM implementation
120*4882a593Smuzhiyun------------------
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunThe current ARM implementation [2] contains some optimisations beyond
123*4882a593Smuzhiyunthe basic algorithm:
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun * By packing the members of the currently_voting array close together,
126*4882a593Smuzhiyun   we can read the whole array in one transaction (providing the number
127*4882a593Smuzhiyun   of CPUs potentially contending the lock is small enough).  This
128*4882a593Smuzhiyun   reduces the number of round-trips required to external memory.
129*4882a593Smuzhiyun
130*4882a593Smuzhiyun   In the ARM implementation, this means that we can use a single load
131*4882a593Smuzhiyun   and comparison::
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun	LDR	Rt, [Rn]
134*4882a593Smuzhiyun	CMP	Rt, #0
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun   ...in place of code equivalent to::
137*4882a593Smuzhiyun
138*4882a593Smuzhiyun	LDRB	Rt, [Rn]
139*4882a593Smuzhiyun	CMP	Rt, #0
140*4882a593Smuzhiyun	LDRBEQ	Rt, [Rn, #1]
141*4882a593Smuzhiyun	CMPEQ	Rt, #0
142*4882a593Smuzhiyun	LDRBEQ	Rt, [Rn, #2]
143*4882a593Smuzhiyun	CMPEQ	Rt, #0
144*4882a593Smuzhiyun	LDRBEQ	Rt, [Rn, #3]
145*4882a593Smuzhiyun	CMPEQ	Rt, #0
146*4882a593Smuzhiyun
147*4882a593Smuzhiyun   This cuts down on the fast-path latency, as well as potentially
148*4882a593Smuzhiyun   reducing bus contention in contended cases.
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun   The optimisation relies on the fact that the ARM memory system
151*4882a593Smuzhiyun   guarantees coherency between overlapping memory accesses of
152*4882a593Smuzhiyun   different sizes, similarly to many other architectures.  Note that
153*4882a593Smuzhiyun   we do not care which element of currently_voting appears in which
154*4882a593Smuzhiyun   bits of Rt, so there is no need to worry about endianness in this
155*4882a593Smuzhiyun   optimisation.
156*4882a593Smuzhiyun
157*4882a593Smuzhiyun   If there are too many CPUs to read the currently_voting array in
158*4882a593Smuzhiyun   one transaction then multiple transations are still required.  The
159*4882a593Smuzhiyun   implementation uses a simple loop of word-sized loads for this
160*4882a593Smuzhiyun   case.  The number of transactions is still fewer than would be
161*4882a593Smuzhiyun   required if bytes were loaded individually.
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun
164*4882a593Smuzhiyun   In principle, we could aggregate further by using LDRD or LDM, but
165*4882a593Smuzhiyun   to keep the code simple this was not attempted in the initial
166*4882a593Smuzhiyun   implementation.
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun
169*4882a593Smuzhiyun * vlocks are currently only used to coordinate between CPUs which are
170*4882a593Smuzhiyun   unable to enable their caches yet.  This means that the
171*4882a593Smuzhiyun   implementation removes many of the barriers which would be required
172*4882a593Smuzhiyun   when executing the algorithm in cached memory.
173*4882a593Smuzhiyun
174*4882a593Smuzhiyun   packing of the currently_voting array does not work with cached
175*4882a593Smuzhiyun   memory unless all CPUs contending the lock are cache-coherent, due
176*4882a593Smuzhiyun   to cache writebacks from one CPU clobbering values written by other
177*4882a593Smuzhiyun   CPUs.  (Though if all the CPUs are cache-coherent, you should be
178*4882a593Smuzhiyun   probably be using proper spinlocks instead anyway).
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun * The "no votes yet" value used for the last_vote variable is 0 (not
182*4882a593Smuzhiyun   -1 as in the pseudocode).  This allows statically-allocated vlocks
183*4882a593Smuzhiyun   to be implicitly initialised to an unlocked state simply by putting
184*4882a593Smuzhiyun   them in .bss.
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun   An offset is added to each CPU's ID for the purpose of setting this
187*4882a593Smuzhiyun   variable, so that no CPU uses the value 0 for its ID.
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun
190*4882a593SmuzhiyunColophon
191*4882a593Smuzhiyun--------
192*4882a593Smuzhiyun
193*4882a593SmuzhiyunOriginally created and documented by Dave Martin for Linaro Limited, for
194*4882a593Smuzhiyunuse in ARM-based big.LITTLE platforms, with review and input gratefully
195*4882a593Smuzhiyunreceived from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
196*4882a593Smuzhiyungrabbing most of this text out of the relevant mail thread and writing
197*4882a593Smuzhiyunup the pseudocode.
198*4882a593Smuzhiyun
199*4882a593SmuzhiyunCopyright (C) 2012-2013  Linaro Limited
200*4882a593SmuzhiyunDistributed under the terms of Version 2 of the GNU General Public
201*4882a593SmuzhiyunLicense, as defined in linux/COPYING.
202*4882a593Smuzhiyun
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunReferences
205*4882a593Smuzhiyun----------
206*4882a593Smuzhiyun
207*4882a593Smuzhiyun[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
208*4882a593Smuzhiyun    Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun    https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
211*4882a593Smuzhiyun
212*4882a593Smuzhiyun[2] linux/arch/arm/common/vlock.S, www.kernel.org.
213