xref: /OK3568_Linux_fs/kernel/Documentation/power/suspend-and-cpuhotplug.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun====================================================================
2*4882a593SmuzhiyunInteraction of Suspend code (S3) with the CPU hotplug infrastructure
3*4882a593Smuzhiyun====================================================================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun(C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunI. Differences between CPU hotplug and Suspend-to-RAM
9*4882a593Smuzhiyun======================================================
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunHow does the regular CPU hotplug code differ from how the Suspend-to-RAM
12*4882a593Smuzhiyuninfrastructure uses it internally? And where do they share common code?
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunWell, a picture is worth a thousand words... So ASCII art follows :-)
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun[This depicts the current design in the kernel, and focusses only on the
17*4882a593Smuzhiyuninteractions involving the freezer and CPU hotplug and also tries to explain
18*4882a593Smuzhiyunthe locking involved. It outlines the notifications involved as well.
19*4882a593SmuzhiyunBut please note that here, only the call paths are illustrated, with the aim
20*4882a593Smuzhiyunof describing where they take different paths and where they share code.
21*4882a593SmuzhiyunWhat happens when regular CPU hotplug and Suspend-to-RAM race with each other
22*4882a593Smuzhiyunis not depicted here.]
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunOn a high level, the suspend-resume cycle goes like this::
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun  |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw |
27*4882a593Smuzhiyun  |tasks |    |     cpus      |    |          |    |     cpus     |    |tasks|
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunMore details follow::
31*4882a593Smuzhiyun
32*4882a593Smuzhiyun                                Suspend call path
33*4882a593Smuzhiyun                                -----------------
34*4882a593Smuzhiyun
35*4882a593Smuzhiyun                                  Write 'mem' to
36*4882a593Smuzhiyun                                /sys/power/state
37*4882a593Smuzhiyun                                    sysfs file
38*4882a593Smuzhiyun                                        |
39*4882a593Smuzhiyun                                        v
40*4882a593Smuzhiyun                               Acquire system_transition_mutex lock
41*4882a593Smuzhiyun                                        |
42*4882a593Smuzhiyun                                        v
43*4882a593Smuzhiyun                             Send PM_SUSPEND_PREPARE
44*4882a593Smuzhiyun                                   notifications
45*4882a593Smuzhiyun                                        |
46*4882a593Smuzhiyun                                        v
47*4882a593Smuzhiyun                                   Freeze tasks
48*4882a593Smuzhiyun                                        |
49*4882a593Smuzhiyun                                        |
50*4882a593Smuzhiyun                                        v
51*4882a593Smuzhiyun                              freeze_secondary_cpus()
52*4882a593Smuzhiyun                                   /* start */
53*4882a593Smuzhiyun                                        |
54*4882a593Smuzhiyun                                        v
55*4882a593Smuzhiyun                            Acquire cpu_add_remove_lock
56*4882a593Smuzhiyun                                        |
57*4882a593Smuzhiyun                                        v
58*4882a593Smuzhiyun                             Iterate over CURRENTLY
59*4882a593Smuzhiyun                                   online CPUs
60*4882a593Smuzhiyun                                        |
61*4882a593Smuzhiyun                                        |
62*4882a593Smuzhiyun                                        |                ----------
63*4882a593Smuzhiyun                                        v                          | L
64*4882a593Smuzhiyun             ======>               _cpu_down()                     |
65*4882a593Smuzhiyun            |              [This takes cpuhotplug.lock             |
66*4882a593Smuzhiyun  Common    |               before taking down the CPU             |
67*4882a593Smuzhiyun   code     |               and releases it when done]             | O
68*4882a593Smuzhiyun            |            While it is at it, notifications          |
69*4882a593Smuzhiyun            |            are sent when notable events occur,       |
70*4882a593Smuzhiyun             ======>     by running all registered callbacks.      |
71*4882a593Smuzhiyun                                        |                          | O
72*4882a593Smuzhiyun                                        |                          |
73*4882a593Smuzhiyun                                        |                          |
74*4882a593Smuzhiyun                                        v                          |
75*4882a593Smuzhiyun                            Note down these cpus in                | P
76*4882a593Smuzhiyun                                frozen_cpus mask         ----------
77*4882a593Smuzhiyun                                        |
78*4882a593Smuzhiyun                                        v
79*4882a593Smuzhiyun                           Disable regular cpu hotplug
80*4882a593Smuzhiyun                        by increasing cpu_hotplug_disabled
81*4882a593Smuzhiyun                                        |
82*4882a593Smuzhiyun                                        v
83*4882a593Smuzhiyun                            Release cpu_add_remove_lock
84*4882a593Smuzhiyun                                        |
85*4882a593Smuzhiyun                                        v
86*4882a593Smuzhiyun                       /* freeze_secondary_cpus() complete */
87*4882a593Smuzhiyun                                        |
88*4882a593Smuzhiyun                                        v
89*4882a593Smuzhiyun                                   Do suspend
90*4882a593Smuzhiyun
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunResuming back is likewise, with the counterparts being (in the order of
94*4882a593Smuzhiyunexecution during resume):
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun* thaw_secondary_cpus() which involves::
97*4882a593Smuzhiyun
98*4882a593Smuzhiyun   |  Acquire cpu_add_remove_lock
99*4882a593Smuzhiyun   |  Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug
100*4882a593Smuzhiyun   |  Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop]
101*4882a593Smuzhiyun   |  Release cpu_add_remove_lock
102*4882a593Smuzhiyun   v
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun* thaw tasks
105*4882a593Smuzhiyun* send PM_POST_SUSPEND notifications
106*4882a593Smuzhiyun* Release system_transition_mutex lock.
107*4882a593Smuzhiyun
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunIt is to be noted here that the system_transition_mutex lock is acquired at the
110*4882a593Smuzhiyunvery beginning, when we are just starting out to suspend, and then released only
111*4882a593Smuzhiyunafter the entire cycle is complete (i.e., suspend + resume).
112*4882a593Smuzhiyun
113*4882a593Smuzhiyun::
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun                          Regular CPU hotplug call path
118*4882a593Smuzhiyun                          -----------------------------
119*4882a593Smuzhiyun
120*4882a593Smuzhiyun                                Write 0 (or 1) to
121*4882a593Smuzhiyun                       /sys/devices/system/cpu/cpu*/online
122*4882a593Smuzhiyun                                    sysfs file
123*4882a593Smuzhiyun                                        |
124*4882a593Smuzhiyun                                        |
125*4882a593Smuzhiyun                                        v
126*4882a593Smuzhiyun                                    cpu_down()
127*4882a593Smuzhiyun                                        |
128*4882a593Smuzhiyun                                        v
129*4882a593Smuzhiyun                           Acquire cpu_add_remove_lock
130*4882a593Smuzhiyun                                        |
131*4882a593Smuzhiyun                                        v
132*4882a593Smuzhiyun                          If cpu_hotplug_disabled > 0
133*4882a593Smuzhiyun                                return gracefully
134*4882a593Smuzhiyun                                        |
135*4882a593Smuzhiyun                                        |
136*4882a593Smuzhiyun                                        v
137*4882a593Smuzhiyun             ======>                _cpu_down()
138*4882a593Smuzhiyun            |              [This takes cpuhotplug.lock
139*4882a593Smuzhiyun  Common    |               before taking down the CPU
140*4882a593Smuzhiyun   code     |               and releases it when done]
141*4882a593Smuzhiyun            |            While it is at it, notifications
142*4882a593Smuzhiyun            |           are sent when notable events occur,
143*4882a593Smuzhiyun             ======>    by running all registered callbacks.
144*4882a593Smuzhiyun                                        |
145*4882a593Smuzhiyun                                        |
146*4882a593Smuzhiyun                                        v
147*4882a593Smuzhiyun                          Release cpu_add_remove_lock
148*4882a593Smuzhiyun                               [That's it!, for
149*4882a593Smuzhiyun                              regular CPU hotplug]
150*4882a593Smuzhiyun
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunSo, as can be seen from the two diagrams (the parts marked as "Common code"),
154*4882a593Smuzhiyunregular CPU hotplug and the suspend code path converge at the _cpu_down() and
155*4882a593Smuzhiyun_cpu_up() functions. They differ in the arguments passed to these functions,
156*4882a593Smuzhiyunin that during regular CPU hotplug, 0 is passed for the 'tasks_frozen'
157*4882a593Smuzhiyunargument. But during suspend, since the tasks are already frozen by the time
158*4882a593Smuzhiyunthe non-boot CPUs are offlined or onlined, the _cpu_*() functions are called
159*4882a593Smuzhiyunwith the 'tasks_frozen' argument set to 1.
160*4882a593Smuzhiyun[See below for some known issues regarding this.]
161*4882a593Smuzhiyun
162*4882a593Smuzhiyun
163*4882a593SmuzhiyunImportant files and functions/entry points:
164*4882a593Smuzhiyun-------------------------------------------
165*4882a593Smuzhiyun
166*4882a593Smuzhiyun- kernel/power/process.c : freeze_processes(), thaw_processes()
167*4882a593Smuzhiyun- kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish()
168*4882a593Smuzhiyun- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](),
169*4882a593Smuzhiyun  [disable|enable]_nonboot_cpus()
170*4882a593Smuzhiyun
171*4882a593Smuzhiyun
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunII. What are the issues involved in CPU hotplug?
174*4882a593Smuzhiyun------------------------------------------------
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunThere are some interesting situations involving CPU hotplug and microcode
177*4882a593Smuzhiyunupdate on the CPUs, as discussed below:
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun[Please bear in mind that the kernel requests the microcode images from
180*4882a593Smuzhiyunuserspace, using the request_firmware() function defined in
181*4882a593Smuzhiyundrivers/base/firmware_loader/main.c]
182*4882a593Smuzhiyun
183*4882a593Smuzhiyun
184*4882a593Smuzhiyuna. When all the CPUs are identical:
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun   This is the most common situation and it is quite straightforward: we want
187*4882a593Smuzhiyun   to apply the same microcode revision to each of the CPUs.
188*4882a593Smuzhiyun   To give an example of x86, the collect_cpu_info() function defined in
189*4882a593Smuzhiyun   arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU
190*4882a593Smuzhiyun   and thereby in applying the correct microcode revision to it.
191*4882a593Smuzhiyun   But note that the kernel does not maintain a common microcode image for the
192*4882a593Smuzhiyun   all CPUs, in order to handle case 'b' described below.
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun
195*4882a593Smuzhiyunb. When some of the CPUs are different than the rest:
196*4882a593Smuzhiyun
197*4882a593Smuzhiyun   In this case since we probably need to apply different microcode revisions
198*4882a593Smuzhiyun   to different CPUs, the kernel maintains a copy of the correct microcode
199*4882a593Smuzhiyun   image for each CPU (after appropriate CPU type/model discovery using
200*4882a593Smuzhiyun   functions such as collect_cpu_info()).
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun
203*4882a593Smuzhiyunc. When a CPU is physically hot-unplugged and a new (and possibly different
204*4882a593Smuzhiyun   type of) CPU is hot-plugged into the system:
205*4882a593Smuzhiyun
206*4882a593Smuzhiyun   In the current design of the kernel, whenever a CPU is taken offline during
207*4882a593Smuzhiyun   a regular CPU hotplug operation, upon receiving the CPU_DEAD notification
208*4882a593Smuzhiyun   (which is sent by the CPU hotplug code), the microcode update driver's
209*4882a593Smuzhiyun   callback for that event reacts by freeing the kernel's copy of the
210*4882a593Smuzhiyun   microcode image for that CPU.
211*4882a593Smuzhiyun
212*4882a593Smuzhiyun   Hence, when a new CPU is brought online, since the kernel finds that it
213*4882a593Smuzhiyun   doesn't have the microcode image, it does the CPU type/model discovery
214*4882a593Smuzhiyun   afresh and then requests the userspace for the appropriate microcode image
215*4882a593Smuzhiyun   for that CPU, which is subsequently applied.
216*4882a593Smuzhiyun
217*4882a593Smuzhiyun   For example, in x86, the mc_cpu_callback() function (which is the microcode
218*4882a593Smuzhiyun   update driver's callback registered for CPU hotplug events) calls
219*4882a593Smuzhiyun   microcode_update_cpu() which would call microcode_init_cpu() in this case,
220*4882a593Smuzhiyun   instead of microcode_resume_cpu() when it finds that the kernel doesn't
221*4882a593Smuzhiyun   have a valid microcode image. This ensures that the CPU type/model
222*4882a593Smuzhiyun   discovery is performed and the right microcode is applied to the CPU after
223*4882a593Smuzhiyun   getting it from userspace.
224*4882a593Smuzhiyun
225*4882a593Smuzhiyun
226*4882a593Smuzhiyund. Handling microcode update during suspend/hibernate:
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun   Strictly speaking, during a CPU hotplug operation which does not involve
229*4882a593Smuzhiyun   physically removing or inserting CPUs, the CPUs are not actually powered
230*4882a593Smuzhiyun   off during a CPU offline. They are just put to the lowest C-states possible.
231*4882a593Smuzhiyun   Hence, in such a case, it is not really necessary to re-apply microcode
232*4882a593Smuzhiyun   when the CPUs are brought back online, since they wouldn't have lost the
233*4882a593Smuzhiyun   image during the CPU offline operation.
234*4882a593Smuzhiyun
235*4882a593Smuzhiyun   This is the usual scenario encountered during a resume after a suspend.
236*4882a593Smuzhiyun   However, in the case of hibernation, since all the CPUs are completely
237*4882a593Smuzhiyun   powered off, during restore it becomes necessary to apply the microcode
238*4882a593Smuzhiyun   images to all the CPUs.
239*4882a593Smuzhiyun
240*4882a593Smuzhiyun   [Note that we don't expect someone to physically pull out nodes and insert
241*4882a593Smuzhiyun   nodes with a different type of CPUs in-between a suspend-resume or a
242*4882a593Smuzhiyun   hibernate/restore cycle.]
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun   In the current design of the kernel however, during a CPU offline operation
245*4882a593Smuzhiyun   as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set),
246*4882a593Smuzhiyun   the existing copy of microcode image in the kernel is not freed up.
247*4882a593Smuzhiyun   And during the CPU online operations (during resume/restore), since the
248*4882a593Smuzhiyun   kernel finds that it already has copies of the microcode images for all the
249*4882a593Smuzhiyun   CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU
250*4882a593Smuzhiyun   type/model and the need for validating whether the microcode revisions are
251*4882a593Smuzhiyun   right for the CPUs or not (due to the above assumption that physical CPU
252*4882a593Smuzhiyun   hotplug will not be done in-between suspend/resume or hibernate/restore
253*4882a593Smuzhiyun   cycles).
254*4882a593Smuzhiyun
255*4882a593Smuzhiyun
256*4882a593SmuzhiyunIII. Known problems
257*4882a593Smuzhiyun===================
258*4882a593Smuzhiyun
259*4882a593SmuzhiyunAre there any known problems when regular CPU hotplug and suspend race
260*4882a593Smuzhiyunwith each other?
261*4882a593Smuzhiyun
262*4882a593SmuzhiyunYes, they are listed below:
263*4882a593Smuzhiyun
264*4882a593Smuzhiyun1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to
265*4882a593Smuzhiyun   the _cpu_down() and _cpu_up() functions is *always* 0.
266*4882a593Smuzhiyun   This might not reflect the true current state of the system, since the
267*4882a593Smuzhiyun   tasks could have been frozen by an out-of-band event such as a suspend
268*4882a593Smuzhiyun   operation in progress. Hence, the cpuhp_tasks_frozen variable will not
269*4882a593Smuzhiyun   reflect the frozen state and the CPU hotplug callbacks which evaluate
270*4882a593Smuzhiyun   that variable might execute the wrong code path.
271*4882a593Smuzhiyun
272*4882a593Smuzhiyun2. If a regular CPU hotplug stress test happens to race with the freezer due
273*4882a593Smuzhiyun   to a suspend operation in progress at the same time, then we could hit the
274*4882a593Smuzhiyun   situation described below:
275*4882a593Smuzhiyun
276*4882a593Smuzhiyun    * A regular cpu online operation continues its journey from userspace
277*4882a593Smuzhiyun      into the kernel, since the freezing has not yet begun.
278*4882a593Smuzhiyun    * Then freezer gets to work and freezes userspace.
279*4882a593Smuzhiyun    * If cpu online has not yet completed the microcode update stuff by now,
280*4882a593Smuzhiyun      it will now start waiting on the frozen userspace in the
281*4882a593Smuzhiyun      TASK_UNINTERRUPTIBLE state, in order to get the microcode image.
282*4882a593Smuzhiyun    * Now the freezer continues and tries to freeze the remaining tasks. But
283*4882a593Smuzhiyun      due to this wait mentioned above, the freezer won't be able to freeze
284*4882a593Smuzhiyun      the cpu online hotplug task and hence freezing of tasks fails.
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun   As a result of this task freezing failure, the suspend operation gets
287*4882a593Smuzhiyun   aborted.
288