1*4882a593Smuzhiyun========================================================= 2*4882a593SmuzhiyunCluster-wide Power-up/power-down race avoidance algorithm 3*4882a593Smuzhiyun========================================================= 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunThis file documents the algorithm which is used to coordinate CPU and 6*4882a593Smuzhiyuncluster setup and teardown operations and to manage hardware coherency 7*4882a593Smuzhiyuncontrols safely. 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunThe section "Rationale" explains what the algorithm is for and why it is 10*4882a593Smuzhiyunneeded. "Basic model" explains general concepts using a simplified view 11*4882a593Smuzhiyunof the system. The other sections explain the actual details of the 12*4882a593Smuzhiyunalgorithm in use. 13*4882a593Smuzhiyun 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunRationale 16*4882a593Smuzhiyun--------- 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunIn a system containing multiple CPUs, it is desirable to have the 19*4882a593Smuzhiyunability to turn off individual CPUs when the system is idle, reducing 20*4882a593Smuzhiyunpower consumption and thermal dissipation. 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunIn a system containing multiple clusters of CPUs, it is also desirable 23*4882a593Smuzhiyunto have the ability to turn off entire clusters. 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunTurning entire clusters off and on is a risky business, because it 26*4882a593Smuzhiyuninvolves performing potentially destructive operations affecting a group 27*4882a593Smuzhiyunof independently running CPUs, while the OS continues to run. This 28*4882a593Smuzhiyunmeans that we need some coordination in order to ensure that critical 29*4882a593Smuzhiyuncluster-level operations are only performed when it is truly safe to do 30*4882a593Smuzhiyunso. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunSimple locking may not be sufficient to solve this problem, because 33*4882a593Smuzhiyunmechanisms like Linux spinlocks may rely on coherency mechanisms which 34*4882a593Smuzhiyunare not immediately enabled when a cluster powers up. Since enabling or 35*4882a593Smuzhiyundisabling those mechanisms may itself be a non-atomic operation (such as 36*4882a593Smuzhiyunwriting some hardware registers and invalidating large caches), other 37*4882a593Smuzhiyunmethods of coordination are required in order to guarantee safe 38*4882a593Smuzhiyunpower-down and power-up at the cluster level. 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunThe mechanism presented in this document describes a coherent memory 41*4882a593Smuzhiyunbased protocol for performing the needed coordination. It aims to be as 42*4882a593Smuzhiyunlightweight as possible, while providing the required safety properties. 43*4882a593Smuzhiyun 44*4882a593Smuzhiyun 45*4882a593SmuzhiyunBasic model 46*4882a593Smuzhiyun----------- 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunEach cluster and CPU is assigned a state, as follows: 49*4882a593Smuzhiyun 50*4882a593Smuzhiyun - DOWN 51*4882a593Smuzhiyun - COMING_UP 52*4882a593Smuzhiyun - UP 53*4882a593Smuzhiyun - GOING_DOWN 54*4882a593Smuzhiyun 55*4882a593Smuzhiyun:: 56*4882a593Smuzhiyun 57*4882a593Smuzhiyun +---------> UP ----------+ 58*4882a593Smuzhiyun | v 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun COMING_UP GOING_DOWN 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun ^ | 63*4882a593Smuzhiyun +--------- DOWN <--------+ 64*4882a593Smuzhiyun 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunDOWN: 67*4882a593Smuzhiyun The CPU or cluster is not coherent, and is either powered off or 68*4882a593Smuzhiyun suspended, or is ready to be powered off or suspended. 69*4882a593Smuzhiyun 70*4882a593SmuzhiyunCOMING_UP: 71*4882a593Smuzhiyun The CPU or cluster has committed to moving to the UP state. 72*4882a593Smuzhiyun It may be part way through the process of initialisation and 73*4882a593Smuzhiyun enabling coherency. 74*4882a593Smuzhiyun 75*4882a593SmuzhiyunUP: 76*4882a593Smuzhiyun The CPU or cluster is active and coherent at the hardware 77*4882a593Smuzhiyun level. A CPU in this state is not necessarily being used 78*4882a593Smuzhiyun actively by the kernel. 79*4882a593Smuzhiyun 80*4882a593SmuzhiyunGOING_DOWN: 81*4882a593Smuzhiyun The CPU or cluster has committed to moving to the DOWN 82*4882a593Smuzhiyun state. It may be part way through the process of teardown and 83*4882a593Smuzhiyun coherency exit. 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun 86*4882a593SmuzhiyunEach CPU has one of these states assigned to it at any point in time. 87*4882a593SmuzhiyunThe CPU states are described in the "CPU state" section, below. 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunEach cluster is also assigned a state, but it is necessary to split the 90*4882a593Smuzhiyunstate value into two parts (the "cluster" state and "inbound" state) and 91*4882a593Smuzhiyunto introduce additional states in order to avoid races between different 92*4882a593SmuzhiyunCPUs in the cluster simultaneously modifying the state. The cluster- 93*4882a593Smuzhiyunlevel states are described in the "Cluster state" section. 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunTo help distinguish the CPU states from cluster states in this 96*4882a593Smuzhiyundiscussion, the state names are given a `CPU_` prefix for the CPU states, 97*4882a593Smuzhiyunand a `CLUSTER_` or `INBOUND_` prefix for the cluster states. 98*4882a593Smuzhiyun 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunCPU state 101*4882a593Smuzhiyun--------- 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunIn this algorithm, each individual core in a multi-core processor is 104*4882a593Smuzhiyunreferred to as a "CPU". CPUs are assumed to be single-threaded: 105*4882a593Smuzhiyuntherefore, a CPU can only be doing one thing at a single point in time. 106*4882a593Smuzhiyun 107*4882a593SmuzhiyunThis means that CPUs fit the basic model closely. 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunThe algorithm defines the following states for each CPU in the system: 110*4882a593Smuzhiyun 111*4882a593Smuzhiyun - CPU_DOWN 112*4882a593Smuzhiyun - CPU_COMING_UP 113*4882a593Smuzhiyun - CPU_UP 114*4882a593Smuzhiyun - CPU_GOING_DOWN 115*4882a593Smuzhiyun 116*4882a593Smuzhiyun:: 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun cluster setup and 119*4882a593Smuzhiyun CPU setup complete policy decision 120*4882a593Smuzhiyun +-----------> CPU_UP ------------+ 121*4882a593Smuzhiyun | v 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun CPU_COMING_UP CPU_GOING_DOWN 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun ^ | 126*4882a593Smuzhiyun +----------- CPU_DOWN <----------+ 127*4882a593Smuzhiyun policy decision CPU teardown complete 128*4882a593Smuzhiyun or hardware event 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunThe definitions of the four states correspond closely to the states of 132*4882a593Smuzhiyunthe basic model. 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunTransitions between states occur as follows. 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunA trigger event (spontaneous) means that the CPU can transition to the 137*4882a593Smuzhiyunnext state as a result of making local progress only, with no 138*4882a593Smuzhiyunrequirement for any external event to happen. 139*4882a593Smuzhiyun 140*4882a593Smuzhiyun 141*4882a593SmuzhiyunCPU_DOWN: 142*4882a593Smuzhiyun A CPU reaches the CPU_DOWN state when it is ready for 143*4882a593Smuzhiyun power-down. On reaching this state, the CPU will typically 144*4882a593Smuzhiyun power itself down or suspend itself, via a WFI instruction or a 145*4882a593Smuzhiyun firmware call. 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun Next state: 148*4882a593Smuzhiyun CPU_COMING_UP 149*4882a593Smuzhiyun Conditions: 150*4882a593Smuzhiyun none 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun Trigger events: 153*4882a593Smuzhiyun a) an explicit hardware power-up operation, resulting 154*4882a593Smuzhiyun from a policy decision on another CPU; 155*4882a593Smuzhiyun 156*4882a593Smuzhiyun b) a hardware event, such as an interrupt. 157*4882a593Smuzhiyun 158*4882a593Smuzhiyun 159*4882a593SmuzhiyunCPU_COMING_UP: 160*4882a593Smuzhiyun A CPU cannot start participating in hardware coherency until the 161*4882a593Smuzhiyun cluster is set up and coherent. If the cluster is not ready, 162*4882a593Smuzhiyun then the CPU will wait in the CPU_COMING_UP state until the 163*4882a593Smuzhiyun cluster has been set up. 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun Next state: 166*4882a593Smuzhiyun CPU_UP 167*4882a593Smuzhiyun Conditions: 168*4882a593Smuzhiyun The CPU's parent cluster must be in CLUSTER_UP. 169*4882a593Smuzhiyun Trigger events: 170*4882a593Smuzhiyun Transition of the parent cluster to CLUSTER_UP. 171*4882a593Smuzhiyun 172*4882a593Smuzhiyun Refer to the "Cluster state" section for a description of the 173*4882a593Smuzhiyun CLUSTER_UP state. 174*4882a593Smuzhiyun 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunCPU_UP: 177*4882a593Smuzhiyun When a CPU reaches the CPU_UP state, it is safe for the CPU to 178*4882a593Smuzhiyun start participating in local coherency. 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun This is done by jumping to the kernel's CPU resume code. 181*4882a593Smuzhiyun 182*4882a593Smuzhiyun Note that the definition of this state is slightly different 183*4882a593Smuzhiyun from the basic model definition: CPU_UP does not mean that the 184*4882a593Smuzhiyun CPU is coherent yet, but it does mean that it is safe to resume 185*4882a593Smuzhiyun the kernel. The kernel handles the rest of the resume 186*4882a593Smuzhiyun procedure, so the remaining steps are not visible as part of the 187*4882a593Smuzhiyun race avoidance algorithm. 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun The CPU remains in this state until an explicit policy decision 190*4882a593Smuzhiyun is made to shut down or suspend the CPU. 191*4882a593Smuzhiyun 192*4882a593Smuzhiyun Next state: 193*4882a593Smuzhiyun CPU_GOING_DOWN 194*4882a593Smuzhiyun Conditions: 195*4882a593Smuzhiyun none 196*4882a593Smuzhiyun Trigger events: 197*4882a593Smuzhiyun explicit policy decision 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun 200*4882a593SmuzhiyunCPU_GOING_DOWN: 201*4882a593Smuzhiyun While in this state, the CPU exits coherency, including any 202*4882a593Smuzhiyun operations required to achieve this (such as cleaning data 203*4882a593Smuzhiyun caches). 204*4882a593Smuzhiyun 205*4882a593Smuzhiyun Next state: 206*4882a593Smuzhiyun CPU_DOWN 207*4882a593Smuzhiyun Conditions: 208*4882a593Smuzhiyun local CPU teardown complete 209*4882a593Smuzhiyun Trigger events: 210*4882a593Smuzhiyun (spontaneous) 211*4882a593Smuzhiyun 212*4882a593Smuzhiyun 213*4882a593SmuzhiyunCluster state 214*4882a593Smuzhiyun------------- 215*4882a593Smuzhiyun 216*4882a593SmuzhiyunA cluster is a group of connected CPUs with some common resources. 217*4882a593SmuzhiyunBecause a cluster contains multiple CPUs, it can be doing multiple 218*4882a593Smuzhiyunthings at the same time. This has some implications. In particular, a 219*4882a593SmuzhiyunCPU can start up while another CPU is tearing the cluster down. 220*4882a593Smuzhiyun 221*4882a593SmuzhiyunIn this discussion, the "outbound side" is the view of the cluster state 222*4882a593Smuzhiyunas seen by a CPU tearing the cluster down. The "inbound side" is the 223*4882a593Smuzhiyunview of the cluster state as seen by a CPU setting the CPU up. 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunIn order to enable safe coordination in such situations, it is important 226*4882a593Smuzhiyunthat a CPU which is setting up the cluster can advertise its state 227*4882a593Smuzhiyunindependently of the CPU which is tearing down the cluster. For this 228*4882a593Smuzhiyunreason, the cluster state is split into two parts: 229*4882a593Smuzhiyun 230*4882a593Smuzhiyun "cluster" state: The global state of the cluster; or the state 231*4882a593Smuzhiyun on the outbound side: 232*4882a593Smuzhiyun 233*4882a593Smuzhiyun - CLUSTER_DOWN 234*4882a593Smuzhiyun - CLUSTER_UP 235*4882a593Smuzhiyun - CLUSTER_GOING_DOWN 236*4882a593Smuzhiyun 237*4882a593Smuzhiyun "inbound" state: The state of the cluster on the inbound side. 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun - INBOUND_NOT_COMING_UP 240*4882a593Smuzhiyun - INBOUND_COMING_UP 241*4882a593Smuzhiyun 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun The different pairings of these states results in six possible 244*4882a593Smuzhiyun states for the cluster as a whole:: 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun CLUSTER_UP 247*4882a593Smuzhiyun +==========> INBOUND_NOT_COMING_UP -------------+ 248*4882a593Smuzhiyun # | 249*4882a593Smuzhiyun | 250*4882a593Smuzhiyun CLUSTER_UP <----+ | 251*4882a593Smuzhiyun INBOUND_COMING_UP | v 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun ^ CLUSTER_GOING_DOWN CLUSTER_GOING_DOWN 254*4882a593Smuzhiyun # INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP 255*4882a593Smuzhiyun 256*4882a593Smuzhiyun CLUSTER_DOWN | | 257*4882a593Smuzhiyun INBOUND_COMING_UP <----+ | 258*4882a593Smuzhiyun | 259*4882a593Smuzhiyun ^ | 260*4882a593Smuzhiyun +=========== CLUSTER_DOWN <------------+ 261*4882a593Smuzhiyun INBOUND_NOT_COMING_UP 262*4882a593Smuzhiyun 263*4882a593Smuzhiyun Transitions -----> can only be made by the outbound CPU, and 264*4882a593Smuzhiyun only involve changes to the "cluster" state. 265*4882a593Smuzhiyun 266*4882a593Smuzhiyun Transitions ===##> can only be made by the inbound CPU, and only 267*4882a593Smuzhiyun involve changes to the "inbound" state, except where there is no 268*4882a593Smuzhiyun further transition possible on the outbound side (i.e., the 269*4882a593Smuzhiyun outbound CPU has put the cluster into the CLUSTER_DOWN state). 270*4882a593Smuzhiyun 271*4882a593Smuzhiyun The race avoidance algorithm does not provide a way to determine 272*4882a593Smuzhiyun which exact CPUs within the cluster play these roles. This must 273*4882a593Smuzhiyun be decided in advance by some other means. Refer to the section 274*4882a593Smuzhiyun "Last man and first man selection" for more explanation. 275*4882a593Smuzhiyun 276*4882a593Smuzhiyun 277*4882a593Smuzhiyun CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the 278*4882a593Smuzhiyun cluster can actually be powered down. 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun The parallelism of the inbound and outbound CPUs is observed by 281*4882a593Smuzhiyun the existence of two different paths from CLUSTER_GOING_DOWN/ 282*4882a593Smuzhiyun INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic 283*4882a593Smuzhiyun model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to 284*4882a593Smuzhiyun COMING_UP in the basic model). The second path avoids cluster 285*4882a593Smuzhiyun teardown completely. 286*4882a593Smuzhiyun 287*4882a593Smuzhiyun CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic 288*4882a593Smuzhiyun model. The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP 289*4882a593Smuzhiyun is trivial and merely resets the state machine ready for the 290*4882a593Smuzhiyun next cycle. 291*4882a593Smuzhiyun 292*4882a593Smuzhiyun Details of the allowable transitions follow. 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun The next state in each case is notated 295*4882a593Smuzhiyun 296*4882a593Smuzhiyun <cluster state>/<inbound state> (<transitioner>) 297*4882a593Smuzhiyun 298*4882a593Smuzhiyun where the <transitioner> is the side on which the transition 299*4882a593Smuzhiyun can occur; either the inbound or the outbound side. 300*4882a593Smuzhiyun 301*4882a593Smuzhiyun 302*4882a593SmuzhiyunCLUSTER_DOWN/INBOUND_NOT_COMING_UP: 303*4882a593Smuzhiyun Next state: 304*4882a593Smuzhiyun CLUSTER_DOWN/INBOUND_COMING_UP (inbound) 305*4882a593Smuzhiyun Conditions: 306*4882a593Smuzhiyun none 307*4882a593Smuzhiyun 308*4882a593Smuzhiyun Trigger events: 309*4882a593Smuzhiyun a) an explicit hardware power-up operation, resulting 310*4882a593Smuzhiyun from a policy decision on another CPU; 311*4882a593Smuzhiyun 312*4882a593Smuzhiyun b) a hardware event, such as an interrupt. 313*4882a593Smuzhiyun 314*4882a593Smuzhiyun 315*4882a593SmuzhiyunCLUSTER_DOWN/INBOUND_COMING_UP: 316*4882a593Smuzhiyun 317*4882a593Smuzhiyun In this state, an inbound CPU sets up the cluster, including 318*4882a593Smuzhiyun enabling of hardware coherency at the cluster level and any 319*4882a593Smuzhiyun other operations (such as cache invalidation) which are required 320*4882a593Smuzhiyun in order to achieve this. 321*4882a593Smuzhiyun 322*4882a593Smuzhiyun The purpose of this state is to do sufficient cluster-level 323*4882a593Smuzhiyun setup to enable other CPUs in the cluster to enter coherency 324*4882a593Smuzhiyun safely. 325*4882a593Smuzhiyun 326*4882a593Smuzhiyun Next state: 327*4882a593Smuzhiyun CLUSTER_UP/INBOUND_COMING_UP (inbound) 328*4882a593Smuzhiyun Conditions: 329*4882a593Smuzhiyun cluster-level setup and hardware coherency complete 330*4882a593Smuzhiyun Trigger events: 331*4882a593Smuzhiyun (spontaneous) 332*4882a593Smuzhiyun 333*4882a593Smuzhiyun 334*4882a593SmuzhiyunCLUSTER_UP/INBOUND_COMING_UP: 335*4882a593Smuzhiyun 336*4882a593Smuzhiyun Cluster-level setup is complete and hardware coherency is 337*4882a593Smuzhiyun enabled for the cluster. Other CPUs in the cluster can safely 338*4882a593Smuzhiyun enter coherency. 339*4882a593Smuzhiyun 340*4882a593Smuzhiyun This is a transient state, leading immediately to 341*4882a593Smuzhiyun CLUSTER_UP/INBOUND_NOT_COMING_UP. All other CPUs on the cluster 342*4882a593Smuzhiyun should consider treat these two states as equivalent. 343*4882a593Smuzhiyun 344*4882a593Smuzhiyun Next state: 345*4882a593Smuzhiyun CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound) 346*4882a593Smuzhiyun Conditions: 347*4882a593Smuzhiyun none 348*4882a593Smuzhiyun Trigger events: 349*4882a593Smuzhiyun (spontaneous) 350*4882a593Smuzhiyun 351*4882a593Smuzhiyun 352*4882a593SmuzhiyunCLUSTER_UP/INBOUND_NOT_COMING_UP: 353*4882a593Smuzhiyun 354*4882a593Smuzhiyun Cluster-level setup is complete and hardware coherency is 355*4882a593Smuzhiyun enabled for the cluster. Other CPUs in the cluster can safely 356*4882a593Smuzhiyun enter coherency. 357*4882a593Smuzhiyun 358*4882a593Smuzhiyun The cluster will remain in this state until a policy decision is 359*4882a593Smuzhiyun made to power the cluster down. 360*4882a593Smuzhiyun 361*4882a593Smuzhiyun Next state: 362*4882a593Smuzhiyun CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound) 363*4882a593Smuzhiyun Conditions: 364*4882a593Smuzhiyun none 365*4882a593Smuzhiyun Trigger events: 366*4882a593Smuzhiyun policy decision to power down the cluster 367*4882a593Smuzhiyun 368*4882a593Smuzhiyun 369*4882a593SmuzhiyunCLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP: 370*4882a593Smuzhiyun 371*4882a593Smuzhiyun An outbound CPU is tearing the cluster down. The selected CPU 372*4882a593Smuzhiyun must wait in this state until all CPUs in the cluster are in the 373*4882a593Smuzhiyun CPU_DOWN state. 374*4882a593Smuzhiyun 375*4882a593Smuzhiyun When all CPUs are in the CPU_DOWN state, the cluster can be torn 376*4882a593Smuzhiyun down, for example by cleaning data caches and exiting 377*4882a593Smuzhiyun cluster-level coherency. 378*4882a593Smuzhiyun 379*4882a593Smuzhiyun To avoid wasteful unnecessary teardown operations, the outbound 380*4882a593Smuzhiyun should check the inbound cluster state for asynchronous 381*4882a593Smuzhiyun transitions to INBOUND_COMING_UP. Alternatively, individual 382*4882a593Smuzhiyun CPUs can be checked for entry into CPU_COMING_UP or CPU_UP. 383*4882a593Smuzhiyun 384*4882a593Smuzhiyun 385*4882a593Smuzhiyun Next states: 386*4882a593Smuzhiyun 387*4882a593Smuzhiyun CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound) 388*4882a593Smuzhiyun Conditions: 389*4882a593Smuzhiyun cluster torn down and ready to power off 390*4882a593Smuzhiyun Trigger events: 391*4882a593Smuzhiyun (spontaneous) 392*4882a593Smuzhiyun 393*4882a593Smuzhiyun CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound) 394*4882a593Smuzhiyun Conditions: 395*4882a593Smuzhiyun none 396*4882a593Smuzhiyun 397*4882a593Smuzhiyun Trigger events: 398*4882a593Smuzhiyun a) an explicit hardware power-up operation, 399*4882a593Smuzhiyun resulting from a policy decision on another 400*4882a593Smuzhiyun CPU; 401*4882a593Smuzhiyun 402*4882a593Smuzhiyun b) a hardware event, such as an interrupt. 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun 405*4882a593SmuzhiyunCLUSTER_GOING_DOWN/INBOUND_COMING_UP: 406*4882a593Smuzhiyun 407*4882a593Smuzhiyun The cluster is (or was) being torn down, but another CPU has 408*4882a593Smuzhiyun come online in the meantime and is trying to set up the cluster 409*4882a593Smuzhiyun again. 410*4882a593Smuzhiyun 411*4882a593Smuzhiyun If the outbound CPU observes this state, it has two choices: 412*4882a593Smuzhiyun 413*4882a593Smuzhiyun a) back out of teardown, restoring the cluster to the 414*4882a593Smuzhiyun CLUSTER_UP state; 415*4882a593Smuzhiyun 416*4882a593Smuzhiyun b) finish tearing the cluster down and put the cluster 417*4882a593Smuzhiyun in the CLUSTER_DOWN state; the inbound CPU will 418*4882a593Smuzhiyun set up the cluster again from there. 419*4882a593Smuzhiyun 420*4882a593Smuzhiyun Choice (a) permits the removal of some latency by avoiding 421*4882a593Smuzhiyun unnecessary teardown and setup operations in situations where 422*4882a593Smuzhiyun the cluster is not really going to be powered down. 423*4882a593Smuzhiyun 424*4882a593Smuzhiyun 425*4882a593Smuzhiyun Next states: 426*4882a593Smuzhiyun 427*4882a593Smuzhiyun CLUSTER_UP/INBOUND_COMING_UP (outbound) 428*4882a593Smuzhiyun Conditions: 429*4882a593Smuzhiyun cluster-level setup and hardware 430*4882a593Smuzhiyun coherency complete 431*4882a593Smuzhiyun 432*4882a593Smuzhiyun Trigger events: 433*4882a593Smuzhiyun (spontaneous) 434*4882a593Smuzhiyun 435*4882a593Smuzhiyun CLUSTER_DOWN/INBOUND_COMING_UP (outbound) 436*4882a593Smuzhiyun Conditions: 437*4882a593Smuzhiyun cluster torn down and ready to power off 438*4882a593Smuzhiyun 439*4882a593Smuzhiyun Trigger events: 440*4882a593Smuzhiyun (spontaneous) 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun 443*4882a593SmuzhiyunLast man and First man selection 444*4882a593Smuzhiyun-------------------------------- 445*4882a593Smuzhiyun 446*4882a593SmuzhiyunThe CPU which performs cluster tear-down operations on the outbound side 447*4882a593Smuzhiyunis commonly referred to as the "last man". 448*4882a593Smuzhiyun 449*4882a593SmuzhiyunThe CPU which performs cluster setup on the inbound side is commonly 450*4882a593Smuzhiyunreferred to as the "first man". 451*4882a593Smuzhiyun 452*4882a593SmuzhiyunThe race avoidance algorithm documented above does not provide a 453*4882a593Smuzhiyunmechanism to choose which CPUs should play these roles. 454*4882a593Smuzhiyun 455*4882a593Smuzhiyun 456*4882a593SmuzhiyunLast man: 457*4882a593Smuzhiyun 458*4882a593SmuzhiyunWhen shutting down the cluster, all the CPUs involved are initially 459*4882a593Smuzhiyunexecuting Linux and hence coherent. Therefore, ordinary spinlocks can 460*4882a593Smuzhiyunbe used to select a last man safely, before the CPUs become 461*4882a593Smuzhiyunnon-coherent. 462*4882a593Smuzhiyun 463*4882a593Smuzhiyun 464*4882a593SmuzhiyunFirst man: 465*4882a593Smuzhiyun 466*4882a593SmuzhiyunBecause CPUs may power up asynchronously in response to external wake-up 467*4882a593Smuzhiyunevents, a dynamic mechanism is needed to make sure that only one CPU 468*4882a593Smuzhiyunattempts to play the first man role and do the cluster-level 469*4882a593Smuzhiyuninitialisation: any other CPUs must wait for this to complete before 470*4882a593Smuzhiyunproceeding. 471*4882a593Smuzhiyun 472*4882a593SmuzhiyunCluster-level initialisation may involve actions such as configuring 473*4882a593Smuzhiyuncoherency controls in the bus fabric. 474*4882a593Smuzhiyun 475*4882a593SmuzhiyunThe current implementation in mcpm_head.S uses a separate mutual exclusion 476*4882a593Smuzhiyunmechanism to do this arbitration. This mechanism is documented in 477*4882a593Smuzhiyundetail in vlocks.txt. 478*4882a593Smuzhiyun 479*4882a593Smuzhiyun 480*4882a593SmuzhiyunFeatures and Limitations 481*4882a593Smuzhiyun------------------------ 482*4882a593Smuzhiyun 483*4882a593SmuzhiyunImplementation: 484*4882a593Smuzhiyun 485*4882a593Smuzhiyun The current ARM-based implementation is split between 486*4882a593Smuzhiyun arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and 487*4882a593Smuzhiyun arch/arm/common/mcpm_entry.c (everything else): 488*4882a593Smuzhiyun 489*4882a593Smuzhiyun __mcpm_cpu_going_down() signals the transition of a CPU to the 490*4882a593Smuzhiyun CPU_GOING_DOWN state. 491*4882a593Smuzhiyun 492*4882a593Smuzhiyun __mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN 493*4882a593Smuzhiyun state. 494*4882a593Smuzhiyun 495*4882a593Smuzhiyun A CPU transitions to CPU_COMING_UP and then to CPU_UP via the 496*4882a593Smuzhiyun low-level power-up code in mcpm_head.S. This could 497*4882a593Smuzhiyun involve CPU-specific setup code, but in the current 498*4882a593Smuzhiyun implementation it does not. 499*4882a593Smuzhiyun 500*4882a593Smuzhiyun __mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical() 501*4882a593Smuzhiyun handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN 502*4882a593Smuzhiyun and from there to CLUSTER_DOWN or back to CLUSTER_UP (in 503*4882a593Smuzhiyun the case of an aborted cluster power-down). 504*4882a593Smuzhiyun 505*4882a593Smuzhiyun These functions are more complex than the __mcpm_cpu_*() 506*4882a593Smuzhiyun functions due to the extra inter-CPU coordination which 507*4882a593Smuzhiyun is needed for safe transitions at the cluster level. 508*4882a593Smuzhiyun 509*4882a593Smuzhiyun A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via 510*4882a593Smuzhiyun the low-level power-up code in mcpm_head.S. This 511*4882a593Smuzhiyun typically involves platform-specific setup code, 512*4882a593Smuzhiyun provided by the platform-specific power_up_setup 513*4882a593Smuzhiyun function registered via mcpm_sync_init. 514*4882a593Smuzhiyun 515*4882a593SmuzhiyunDeep topologies: 516*4882a593Smuzhiyun 517*4882a593Smuzhiyun As currently described and implemented, the algorithm does not 518*4882a593Smuzhiyun support CPU topologies involving more than two levels (i.e., 519*4882a593Smuzhiyun clusters of clusters are not supported). The algorithm could be 520*4882a593Smuzhiyun extended by replicating the cluster-level states for the 521*4882a593Smuzhiyun additional topological levels, and modifying the transition 522*4882a593Smuzhiyun rules for the intermediate (non-outermost) cluster levels. 523*4882a593Smuzhiyun 524*4882a593Smuzhiyun 525*4882a593SmuzhiyunColophon 526*4882a593Smuzhiyun-------- 527*4882a593Smuzhiyun 528*4882a593SmuzhiyunOriginally created and documented by Dave Martin for Linaro Limited, in 529*4882a593Smuzhiyuncollaboration with Nicolas Pitre and Achin Gupta. 530*4882a593Smuzhiyun 531*4882a593SmuzhiyunCopyright (C) 2012-2013 Linaro Limited 532*4882a593SmuzhiyunDistributed under the terms of Version 2 of the GNU General Public 533*4882a593SmuzhiyunLicense, as defined in linux/COPYING. 534