Documentation/arm/vlocks.rst

*4882a593Smuzhiyun======================================
*4882a593Smuzhiyunvlocks for Bare-Metal Mutual Exclusion
*4882a593Smuzhiyun======================================
*4882a593Smuzhiyun
*4882a593SmuzhiyunVoting Locks, or "vlocks" provide a simple low-level mutual exclusion
*4882a593Smuzhiyunmechanism, with reasonable but minimal requirements on the memory
*4882a593Smuzhiyunsystem.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThese are intended to be used to coordinate critical activity among CPUs
*4882a593Smuzhiyunwhich are otherwise non-coherent, in situations where the hardware
*4882a593Smuzhiyunprovides no other mechanism to support this and ordinary spinlocks
*4882a593Smuzhiyuncannot be used.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyunvlocks make use of the atomicity provided by the memory system for
*4882a593Smuzhiyunwrites to a single memory location.  To arbitrate, every CPU "votes for
*4882a593Smuzhiyunitself", by storing a unique number to a common memory location.  The
*4882a593Smuzhiyunfinal value seen in that memory location when all the votes have been
*4882a593Smuzhiyuncast identifies the winner.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn order to make sure that the election produces an unambiguous result
*4882a593Smuzhiyunin finite time, a CPU will only enter the election in the first place if
*4882a593Smuzhiyunno winner has been chosen and the election does not appear to have
*4882a593Smuzhiyunstarted yet.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunAlgorithm
*4882a593Smuzhiyun---------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe easiest way to explain the vlocks algorithm is with some pseudo-code::
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun	int currently_voting[NR_CPUS] = { 0, };
*4882a593Smuzhiyun	int last_vote = -1; /* no votes yet */
*4882a593Smuzhiyun
*4882a593Smuzhiyun	bool vlock_trylock(int this_cpu)
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		/* signal our desire to vote */
*4882a593Smuzhiyun		currently_voting[this_cpu] = 1;
*4882a593Smuzhiyun		if (last_vote != -1) {
*4882a593Smuzhiyun			/* someone already volunteered himself */
*4882a593Smuzhiyun			currently_voting[this_cpu] = 0;
*4882a593Smuzhiyun			return false; /* not ourself */
*4882a593Smuzhiyun		}
*4882a593Smuzhiyun
*4882a593Smuzhiyun		/* let's suggest ourself */
*4882a593Smuzhiyun		last_vote = this_cpu;
*4882a593Smuzhiyun		currently_voting[this_cpu] = 0;
*4882a593Smuzhiyun
*4882a593Smuzhiyun		/* then wait until everyone else is done voting */
*4882a593Smuzhiyun		for_each_cpu(i) {
*4882a593Smuzhiyun			while (currently_voting[i] != 0)
*4882a593Smuzhiyun				/* wait */;
*4882a593Smuzhiyun		}
*4882a593Smuzhiyun
*4882a593Smuzhiyun		/* result */
*4882a593Smuzhiyun		if (last_vote == this_cpu)
*4882a593Smuzhiyun			return true; /* we won */
*4882a593Smuzhiyun		return false;
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun	bool vlock_unlock(void)
*4882a593Smuzhiyun	{
*4882a593Smuzhiyun		last_vote = -1;
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe currently_voting[] array provides a way for the CPUs to determine
*4882a593Smuzhiyunwhether an election is in progress, and plays a role analogous to the
*4882a593Smuzhiyun"entering" array in Lamport's bakery algorithm [1].
*4882a593Smuzhiyun
*4882a593SmuzhiyunHowever, once the election has started, the underlying memory system
*4882a593Smuzhiyunatomicity is used to pick the winner.  This avoids the need for a static
*4882a593Smuzhiyunpriority rule to act as a tie-breaker, or any counters which could
*4882a593Smuzhiyunoverflow.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs long as the last_vote variable is globally visible to all CPUs, it
*4882a593Smuzhiyunwill contain only one value that won't change once every CPU has cleared
*4882a593Smuzhiyunits currently_voting flag.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunFeatures and limitations
*4882a593Smuzhiyun------------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyun * vlocks are not intended to be fair.  In the contended case, it is the
*4882a593Smuzhiyun   _last_ CPU which attempts to get the lock which will be most likely
*4882a593Smuzhiyun   to win.
*4882a593Smuzhiyun
*4882a593Smuzhiyun   vlocks are therefore best suited to situations where it is necessary
*4882a593Smuzhiyun   to pick a unique winner, but it does not matter which CPU actually
*4882a593Smuzhiyun   wins.
*4882a593Smuzhiyun
*4882a593Smuzhiyun * Like other similar mechanisms, vlocks will not scale well to a large
*4882a593Smuzhiyun   number of CPUs.
*4882a593Smuzhiyun
*4882a593Smuzhiyun   vlocks can be cascaded in a voting hierarchy to permit better scaling
*4882a593Smuzhiyun   if necessary, as in the following hypothetical example for 4096 CPUs::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/* first level: local election */
*4882a593Smuzhiyun	my_town = towns[(this_cpu >> 4) & 0xf];
*4882a593Smuzhiyun	I_won = vlock_trylock(my_town, this_cpu & 0xf);
*4882a593Smuzhiyun	if (I_won) {
*4882a593Smuzhiyun		/* we won the town election, let's go for the state */
*4882a593Smuzhiyun		my_state = states[(this_cpu >> 8) & 0xf];
*4882a593Smuzhiyun		I_won = vlock_lock(my_state, this_cpu & 0xf));
*4882a593Smuzhiyun		if (I_won) {
*4882a593Smuzhiyun			/* and so on */
*4882a593Smuzhiyun			I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
*4882a593Smuzhiyun			if (I_won) {
*4882a593Smuzhiyun				/* ... */
*4882a593Smuzhiyun			}
*4882a593Smuzhiyun			vlock_unlock(the_whole_country);
*4882a593Smuzhiyun		}
*4882a593Smuzhiyun		vlock_unlock(my_state);
*4882a593Smuzhiyun	}
*4882a593Smuzhiyun	vlock_unlock(my_town);
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunARM implementation
*4882a593Smuzhiyun------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe current ARM implementation [2] contains some optimisations beyond
*4882a593Smuzhiyunthe basic algorithm:
*4882a593Smuzhiyun
*4882a593Smuzhiyun * By packing the members of the currently_voting array close together,
*4882a593Smuzhiyun   we can read the whole array in one transaction (providing the number
*4882a593Smuzhiyun   of CPUs potentially contending the lock is small enough).  This
*4882a593Smuzhiyun   reduces the number of round-trips required to external memory.
*4882a593Smuzhiyun
*4882a593Smuzhiyun   In the ARM implementation, this means that we can use a single load
*4882a593Smuzhiyun   and comparison::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	LDR	Rt, [Rn]
*4882a593Smuzhiyun	CMP	Rt, #0
*4882a593Smuzhiyun
*4882a593Smuzhiyun   ...in place of code equivalent to::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	LDRB	Rt, [Rn]
*4882a593Smuzhiyun	CMP	Rt, #0
*4882a593Smuzhiyun	LDRBEQ	Rt, [Rn, #1]
*4882a593Smuzhiyun	CMPEQ	Rt, #0
*4882a593Smuzhiyun	LDRBEQ	Rt, [Rn, #2]
*4882a593Smuzhiyun	CMPEQ	Rt, #0
*4882a593Smuzhiyun	LDRBEQ	Rt, [Rn, #3]
*4882a593Smuzhiyun	CMPEQ	Rt, #0
*4882a593Smuzhiyun
*4882a593Smuzhiyun   This cuts down on the fast-path latency, as well as potentially
*4882a593Smuzhiyun   reducing bus contention in contended cases.
*4882a593Smuzhiyun
*4882a593Smuzhiyun   The optimisation relies on the fact that the ARM memory system
*4882a593Smuzhiyun   guarantees coherency between overlapping memory accesses of
*4882a593Smuzhiyun   different sizes, similarly to many other architectures.  Note that
*4882a593Smuzhiyun   we do not care which element of currently_voting appears in which
*4882a593Smuzhiyun   bits of Rt, so there is no need to worry about endianness in this
*4882a593Smuzhiyun   optimisation.
*4882a593Smuzhiyun
*4882a593Smuzhiyun   If there are too many CPUs to read the currently_voting array in
*4882a593Smuzhiyun   one transaction then multiple transations are still required.  The
*4882a593Smuzhiyun   implementation uses a simple loop of word-sized loads for this
*4882a593Smuzhiyun   case.  The number of transactions is still fewer than would be
*4882a593Smuzhiyun   required if bytes were loaded individually.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun   In principle, we could aggregate further by using LDRD or LDM, but
*4882a593Smuzhiyun   to keep the code simple this was not attempted in the initial
*4882a593Smuzhiyun   implementation.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun * vlocks are currently only used to coordinate between CPUs which are
*4882a593Smuzhiyun   unable to enable their caches yet.  This means that the
*4882a593Smuzhiyun   implementation removes many of the barriers which would be required
*4882a593Smuzhiyun   when executing the algorithm in cached memory.
*4882a593Smuzhiyun
*4882a593Smuzhiyun   packing of the currently_voting array does not work with cached
*4882a593Smuzhiyun   memory unless all CPUs contending the lock are cache-coherent, due
*4882a593Smuzhiyun   to cache writebacks from one CPU clobbering values written by other
*4882a593Smuzhiyun   CPUs.  (Though if all the CPUs are cache-coherent, you should be
*4882a593Smuzhiyun   probably be using proper spinlocks instead anyway).
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593Smuzhiyun * The "no votes yet" value used for the last_vote variable is 0 (not
*4882a593Smuzhiyun   -1 as in the pseudocode).  This allows statically-allocated vlocks
*4882a593Smuzhiyun   to be implicitly initialised to an unlocked state simply by putting
*4882a593Smuzhiyun   them in .bss.
*4882a593Smuzhiyun
*4882a593Smuzhiyun   An offset is added to each CPU's ID for the purpose of setting this
*4882a593Smuzhiyun   variable, so that no CPU uses the value 0 for its ID.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunColophon
*4882a593Smuzhiyun--------
*4882a593Smuzhiyun
*4882a593SmuzhiyunOriginally created and documented by Dave Martin for Linaro Limited, for
*4882a593Smuzhiyunuse in ARM-based big.LITTLE platforms, with review and input gratefully
*4882a593Smuzhiyunreceived from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
*4882a593Smuzhiyungrabbing most of this text out of the relevant mail thread and writing
*4882a593Smuzhiyunup the pseudocode.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCopyright (C) 2012-2013  Linaro Limited
*4882a593SmuzhiyunDistributed under the terms of Version 2 of the GNU General Public
*4882a593SmuzhiyunLicense, as defined in linux/COPYING.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunReferences
*4882a593Smuzhiyun----------
*4882a593Smuzhiyun
*4882a593Smuzhiyun[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
*4882a593Smuzhiyun    Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
*4882a593Smuzhiyun
*4882a593Smuzhiyun    https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
*4882a593Smuzhiyun
*4882a593Smuzhiyun[2] linux/arch/arm/common/vlock.S, www.kernel.org.