1*4882a593Smuzhiyun==================================================================== 2*4882a593SmuzhiyunInteraction of Suspend code (S3) with the CPU hotplug infrastructure 3*4882a593Smuzhiyun==================================================================== 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun(C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunI. Differences between CPU hotplug and Suspend-to-RAM 9*4882a593Smuzhiyun====================================================== 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunHow does the regular CPU hotplug code differ from how the Suspend-to-RAM 12*4882a593Smuzhiyuninfrastructure uses it internally? And where do they share common code? 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunWell, a picture is worth a thousand words... So ASCII art follows :-) 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun[This depicts the current design in the kernel, and focusses only on the 17*4882a593Smuzhiyuninteractions involving the freezer and CPU hotplug and also tries to explain 18*4882a593Smuzhiyunthe locking involved. It outlines the notifications involved as well. 19*4882a593SmuzhiyunBut please note that here, only the call paths are illustrated, with the aim 20*4882a593Smuzhiyunof describing where they take different paths and where they share code. 21*4882a593SmuzhiyunWhat happens when regular CPU hotplug and Suspend-to-RAM race with each other 22*4882a593Smuzhiyunis not depicted here.] 23*4882a593Smuzhiyun 24*4882a593SmuzhiyunOn a high level, the suspend-resume cycle goes like this:: 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | 27*4882a593Smuzhiyun |tasks | | cpus | | | | cpus | |tasks| 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun 30*4882a593SmuzhiyunMore details follow:: 31*4882a593Smuzhiyun 32*4882a593Smuzhiyun Suspend call path 33*4882a593Smuzhiyun ----------------- 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun Write 'mem' to 36*4882a593Smuzhiyun /sys/power/state 37*4882a593Smuzhiyun sysfs file 38*4882a593Smuzhiyun | 39*4882a593Smuzhiyun v 40*4882a593Smuzhiyun Acquire system_transition_mutex lock 41*4882a593Smuzhiyun | 42*4882a593Smuzhiyun v 43*4882a593Smuzhiyun Send PM_SUSPEND_PREPARE 44*4882a593Smuzhiyun notifications 45*4882a593Smuzhiyun | 46*4882a593Smuzhiyun v 47*4882a593Smuzhiyun Freeze tasks 48*4882a593Smuzhiyun | 49*4882a593Smuzhiyun | 50*4882a593Smuzhiyun v 51*4882a593Smuzhiyun freeze_secondary_cpus() 52*4882a593Smuzhiyun /* start */ 53*4882a593Smuzhiyun | 54*4882a593Smuzhiyun v 55*4882a593Smuzhiyun Acquire cpu_add_remove_lock 56*4882a593Smuzhiyun | 57*4882a593Smuzhiyun v 58*4882a593Smuzhiyun Iterate over CURRENTLY 59*4882a593Smuzhiyun online CPUs 60*4882a593Smuzhiyun | 61*4882a593Smuzhiyun | 62*4882a593Smuzhiyun | ---------- 63*4882a593Smuzhiyun v | L 64*4882a593Smuzhiyun ======> _cpu_down() | 65*4882a593Smuzhiyun | [This takes cpuhotplug.lock | 66*4882a593Smuzhiyun Common | before taking down the CPU | 67*4882a593Smuzhiyun code | and releases it when done] | O 68*4882a593Smuzhiyun | While it is at it, notifications | 69*4882a593Smuzhiyun | are sent when notable events occur, | 70*4882a593Smuzhiyun ======> by running all registered callbacks. | 71*4882a593Smuzhiyun | | O 72*4882a593Smuzhiyun | | 73*4882a593Smuzhiyun | | 74*4882a593Smuzhiyun v | 75*4882a593Smuzhiyun Note down these cpus in | P 76*4882a593Smuzhiyun frozen_cpus mask ---------- 77*4882a593Smuzhiyun | 78*4882a593Smuzhiyun v 79*4882a593Smuzhiyun Disable regular cpu hotplug 80*4882a593Smuzhiyun by increasing cpu_hotplug_disabled 81*4882a593Smuzhiyun | 82*4882a593Smuzhiyun v 83*4882a593Smuzhiyun Release cpu_add_remove_lock 84*4882a593Smuzhiyun | 85*4882a593Smuzhiyun v 86*4882a593Smuzhiyun /* freeze_secondary_cpus() complete */ 87*4882a593Smuzhiyun | 88*4882a593Smuzhiyun v 89*4882a593Smuzhiyun Do suspend 90*4882a593Smuzhiyun 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunResuming back is likewise, with the counterparts being (in the order of 94*4882a593Smuzhiyunexecution during resume): 95*4882a593Smuzhiyun 96*4882a593Smuzhiyun* thaw_secondary_cpus() which involves:: 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun | Acquire cpu_add_remove_lock 99*4882a593Smuzhiyun | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug 100*4882a593Smuzhiyun | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] 101*4882a593Smuzhiyun | Release cpu_add_remove_lock 102*4882a593Smuzhiyun v 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun* thaw tasks 105*4882a593Smuzhiyun* send PM_POST_SUSPEND notifications 106*4882a593Smuzhiyun* Release system_transition_mutex lock. 107*4882a593Smuzhiyun 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunIt is to be noted here that the system_transition_mutex lock is acquired at the 110*4882a593Smuzhiyunvery beginning, when we are just starting out to suspend, and then released only 111*4882a593Smuzhiyunafter the entire cycle is complete (i.e., suspend + resume). 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun:: 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun Regular CPU hotplug call path 118*4882a593Smuzhiyun ----------------------------- 119*4882a593Smuzhiyun 120*4882a593Smuzhiyun Write 0 (or 1) to 121*4882a593Smuzhiyun /sys/devices/system/cpu/cpu*/online 122*4882a593Smuzhiyun sysfs file 123*4882a593Smuzhiyun | 124*4882a593Smuzhiyun | 125*4882a593Smuzhiyun v 126*4882a593Smuzhiyun cpu_down() 127*4882a593Smuzhiyun | 128*4882a593Smuzhiyun v 129*4882a593Smuzhiyun Acquire cpu_add_remove_lock 130*4882a593Smuzhiyun | 131*4882a593Smuzhiyun v 132*4882a593Smuzhiyun If cpu_hotplug_disabled > 0 133*4882a593Smuzhiyun return gracefully 134*4882a593Smuzhiyun | 135*4882a593Smuzhiyun | 136*4882a593Smuzhiyun v 137*4882a593Smuzhiyun ======> _cpu_down() 138*4882a593Smuzhiyun | [This takes cpuhotplug.lock 139*4882a593Smuzhiyun Common | before taking down the CPU 140*4882a593Smuzhiyun code | and releases it when done] 141*4882a593Smuzhiyun | While it is at it, notifications 142*4882a593Smuzhiyun | are sent when notable events occur, 143*4882a593Smuzhiyun ======> by running all registered callbacks. 144*4882a593Smuzhiyun | 145*4882a593Smuzhiyun | 146*4882a593Smuzhiyun v 147*4882a593Smuzhiyun Release cpu_add_remove_lock 148*4882a593Smuzhiyun [That's it!, for 149*4882a593Smuzhiyun regular CPU hotplug] 150*4882a593Smuzhiyun 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunSo, as can be seen from the two diagrams (the parts marked as "Common code"), 154*4882a593Smuzhiyunregular CPU hotplug and the suspend code path converge at the _cpu_down() and 155*4882a593Smuzhiyun_cpu_up() functions. They differ in the arguments passed to these functions, 156*4882a593Smuzhiyunin that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' 157*4882a593Smuzhiyunargument. But during suspend, since the tasks are already frozen by the time 158*4882a593Smuzhiyunthe non-boot CPUs are offlined or onlined, the _cpu_*() functions are called 159*4882a593Smuzhiyunwith the 'tasks_frozen' argument set to 1. 160*4882a593Smuzhiyun[See below for some known issues regarding this.] 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun 163*4882a593SmuzhiyunImportant files and functions/entry points: 164*4882a593Smuzhiyun------------------------------------------- 165*4882a593Smuzhiyun 166*4882a593Smuzhiyun- kernel/power/process.c : freeze_processes(), thaw_processes() 167*4882a593Smuzhiyun- kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() 168*4882a593Smuzhiyun- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), 169*4882a593Smuzhiyun [disable|enable]_nonboot_cpus() 170*4882a593Smuzhiyun 171*4882a593Smuzhiyun 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunII. What are the issues involved in CPU hotplug? 174*4882a593Smuzhiyun------------------------------------------------ 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunThere are some interesting situations involving CPU hotplug and microcode 177*4882a593Smuzhiyunupdate on the CPUs, as discussed below: 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun[Please bear in mind that the kernel requests the microcode images from 180*4882a593Smuzhiyunuserspace, using the request_firmware() function defined in 181*4882a593Smuzhiyundrivers/base/firmware_loader/main.c] 182*4882a593Smuzhiyun 183*4882a593Smuzhiyun 184*4882a593Smuzhiyuna. When all the CPUs are identical: 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun This is the most common situation and it is quite straightforward: we want 187*4882a593Smuzhiyun to apply the same microcode revision to each of the CPUs. 188*4882a593Smuzhiyun To give an example of x86, the collect_cpu_info() function defined in 189*4882a593Smuzhiyun arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU 190*4882a593Smuzhiyun and thereby in applying the correct microcode revision to it. 191*4882a593Smuzhiyun But note that the kernel does not maintain a common microcode image for the 192*4882a593Smuzhiyun all CPUs, in order to handle case 'b' described below. 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun 195*4882a593Smuzhiyunb. When some of the CPUs are different than the rest: 196*4882a593Smuzhiyun 197*4882a593Smuzhiyun In this case since we probably need to apply different microcode revisions 198*4882a593Smuzhiyun to different CPUs, the kernel maintains a copy of the correct microcode 199*4882a593Smuzhiyun image for each CPU (after appropriate CPU type/model discovery using 200*4882a593Smuzhiyun functions such as collect_cpu_info()). 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun 203*4882a593Smuzhiyunc. When a CPU is physically hot-unplugged and a new (and possibly different 204*4882a593Smuzhiyun type of) CPU is hot-plugged into the system: 205*4882a593Smuzhiyun 206*4882a593Smuzhiyun In the current design of the kernel, whenever a CPU is taken offline during 207*4882a593Smuzhiyun a regular CPU hotplug operation, upon receiving the CPU_DEAD notification 208*4882a593Smuzhiyun (which is sent by the CPU hotplug code), the microcode update driver's 209*4882a593Smuzhiyun callback for that event reacts by freeing the kernel's copy of the 210*4882a593Smuzhiyun microcode image for that CPU. 211*4882a593Smuzhiyun 212*4882a593Smuzhiyun Hence, when a new CPU is brought online, since the kernel finds that it 213*4882a593Smuzhiyun doesn't have the microcode image, it does the CPU type/model discovery 214*4882a593Smuzhiyun afresh and then requests the userspace for the appropriate microcode image 215*4882a593Smuzhiyun for that CPU, which is subsequently applied. 216*4882a593Smuzhiyun 217*4882a593Smuzhiyun For example, in x86, the mc_cpu_callback() function (which is the microcode 218*4882a593Smuzhiyun update driver's callback registered for CPU hotplug events) calls 219*4882a593Smuzhiyun microcode_update_cpu() which would call microcode_init_cpu() in this case, 220*4882a593Smuzhiyun instead of microcode_resume_cpu() when it finds that the kernel doesn't 221*4882a593Smuzhiyun have a valid microcode image. This ensures that the CPU type/model 222*4882a593Smuzhiyun discovery is performed and the right microcode is applied to the CPU after 223*4882a593Smuzhiyun getting it from userspace. 224*4882a593Smuzhiyun 225*4882a593Smuzhiyun 226*4882a593Smuzhiyund. Handling microcode update during suspend/hibernate: 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun Strictly speaking, during a CPU hotplug operation which does not involve 229*4882a593Smuzhiyun physically removing or inserting CPUs, the CPUs are not actually powered 230*4882a593Smuzhiyun off during a CPU offline. They are just put to the lowest C-states possible. 231*4882a593Smuzhiyun Hence, in such a case, it is not really necessary to re-apply microcode 232*4882a593Smuzhiyun when the CPUs are brought back online, since they wouldn't have lost the 233*4882a593Smuzhiyun image during the CPU offline operation. 234*4882a593Smuzhiyun 235*4882a593Smuzhiyun This is the usual scenario encountered during a resume after a suspend. 236*4882a593Smuzhiyun However, in the case of hibernation, since all the CPUs are completely 237*4882a593Smuzhiyun powered off, during restore it becomes necessary to apply the microcode 238*4882a593Smuzhiyun images to all the CPUs. 239*4882a593Smuzhiyun 240*4882a593Smuzhiyun [Note that we don't expect someone to physically pull out nodes and insert 241*4882a593Smuzhiyun nodes with a different type of CPUs in-between a suspend-resume or a 242*4882a593Smuzhiyun hibernate/restore cycle.] 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun In the current design of the kernel however, during a CPU offline operation 245*4882a593Smuzhiyun as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set), 246*4882a593Smuzhiyun the existing copy of microcode image in the kernel is not freed up. 247*4882a593Smuzhiyun And during the CPU online operations (during resume/restore), since the 248*4882a593Smuzhiyun kernel finds that it already has copies of the microcode images for all the 249*4882a593Smuzhiyun CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU 250*4882a593Smuzhiyun type/model and the need for validating whether the microcode revisions are 251*4882a593Smuzhiyun right for the CPUs or not (due to the above assumption that physical CPU 252*4882a593Smuzhiyun hotplug will not be done in-between suspend/resume or hibernate/restore 253*4882a593Smuzhiyun cycles). 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun 256*4882a593SmuzhiyunIII. Known problems 257*4882a593Smuzhiyun=================== 258*4882a593Smuzhiyun 259*4882a593SmuzhiyunAre there any known problems when regular CPU hotplug and suspend race 260*4882a593Smuzhiyunwith each other? 261*4882a593Smuzhiyun 262*4882a593SmuzhiyunYes, they are listed below: 263*4882a593Smuzhiyun 264*4882a593Smuzhiyun1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to 265*4882a593Smuzhiyun the _cpu_down() and _cpu_up() functions is *always* 0. 266*4882a593Smuzhiyun This might not reflect the true current state of the system, since the 267*4882a593Smuzhiyun tasks could have been frozen by an out-of-band event such as a suspend 268*4882a593Smuzhiyun operation in progress. Hence, the cpuhp_tasks_frozen variable will not 269*4882a593Smuzhiyun reflect the frozen state and the CPU hotplug callbacks which evaluate 270*4882a593Smuzhiyun that variable might execute the wrong code path. 271*4882a593Smuzhiyun 272*4882a593Smuzhiyun2. If a regular CPU hotplug stress test happens to race with the freezer due 273*4882a593Smuzhiyun to a suspend operation in progress at the same time, then we could hit the 274*4882a593Smuzhiyun situation described below: 275*4882a593Smuzhiyun 276*4882a593Smuzhiyun * A regular cpu online operation continues its journey from userspace 277*4882a593Smuzhiyun into the kernel, since the freezing has not yet begun. 278*4882a593Smuzhiyun * Then freezer gets to work and freezes userspace. 279*4882a593Smuzhiyun * If cpu online has not yet completed the microcode update stuff by now, 280*4882a593Smuzhiyun it will now start waiting on the frozen userspace in the 281*4882a593Smuzhiyun TASK_UNINTERRUPTIBLE state, in order to get the microcode image. 282*4882a593Smuzhiyun * Now the freezer continues and tries to freeze the remaining tasks. But 283*4882a593Smuzhiyun due to this wait mentioned above, the freezer won't be able to freeze 284*4882a593Smuzhiyun the cpu online hotplug task and hence freezing of tasks fails. 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun As a result of this task freezing failure, the suspend operation gets 287*4882a593Smuzhiyun aborted. 288