1*4882a593Smuzhiyun================= 2*4882a593SmuzhiyunFreezing of tasks 3*4882a593Smuzhiyun================= 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun(C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunI. What is the freezing of tasks? 8*4882a593Smuzhiyun================================= 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunThe freezing of tasks is a mechanism by which user space processes and some 11*4882a593Smuzhiyunkernel threads are controlled during hibernation or system-wide suspend (on some 12*4882a593Smuzhiyunarchitectures). 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunII. How does it work? 15*4882a593Smuzhiyun===================== 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunThere are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN 18*4882a593Smuzhiyunand PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have 19*4882a593SmuzhiyunPF_NOFREEZE unset (all user space processes and some kernel threads) are 20*4882a593Smuzhiyunregarded as 'freezable' and treated in a special way before the system enters a 21*4882a593Smuzhiyunsuspend state as well as before a hibernation image is created (in what follows 22*4882a593Smuzhiyunwe only consider hibernation, but the description also applies to suspend). 23*4882a593Smuzhiyun 24*4882a593SmuzhiyunNamely, as the first step of the hibernation procedure the function 25*4882a593Smuzhiyunfreeze_processes() (defined in kernel/power/process.c) is called. A system-wide 26*4882a593Smuzhiyunvariable system_freezing_cnt (as opposed to a per-task flag) is used to indicate 27*4882a593Smuzhiyunwhether the system is to undergo a freezing operation. And freeze_processes() 28*4882a593Smuzhiyunsets this variable. After this, it executes try_to_freeze_tasks() that sends a 29*4882a593Smuzhiyunfake signal to all user space processes, and wakes up all the kernel threads. 30*4882a593SmuzhiyunAll freezable tasks must react to that by calling try_to_freeze(), which 31*4882a593Smuzhiyunresults in a call to __refrigerator() (defined in kernel/freezer.c), which sets 32*4882a593Smuzhiyunthe task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes 33*4882a593Smuzhiyunit loop until PF_FROZEN is cleared for it. Then, we say that the task is 34*4882a593Smuzhiyun'frozen' and therefore the set of functions handling this mechanism is referred 35*4882a593Smuzhiyunto as 'the freezer' (these functions are defined in kernel/power/process.c, 36*4882a593Smuzhiyunkernel/freezer.c & include/linux/freezer.h). User space processes are generally 37*4882a593Smuzhiyunfrozen before kernel threads. 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun__refrigerator() must not be called directly. Instead, use the 40*4882a593Smuzhiyuntry_to_freeze() function (defined in include/linux/freezer.h), that checks 41*4882a593Smuzhiyunif the task is to be frozen and makes the task enter __refrigerator(). 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunFor user space processes try_to_freeze() is called automatically from the 44*4882a593Smuzhiyunsignal-handling code, but the freezable kernel threads need to call it 45*4882a593Smuzhiyunexplicitly in suitable places or use the wait_event_freezable() or 46*4882a593Smuzhiyunwait_event_freezable_timeout() macros (defined in include/linux/freezer.h) 47*4882a593Smuzhiyunthat combine interruptible sleep with checking if the task is to be frozen and 48*4882a593Smuzhiyuncalling try_to_freeze(). The main loop of a freezable kernel thread may look 49*4882a593Smuzhiyunlike the following one:: 50*4882a593Smuzhiyun 51*4882a593Smuzhiyun set_freezable(); 52*4882a593Smuzhiyun do { 53*4882a593Smuzhiyun hub_events(); 54*4882a593Smuzhiyun wait_event_freezable(khubd_wait, 55*4882a593Smuzhiyun !list_empty(&hub_event_list) || 56*4882a593Smuzhiyun kthread_should_stop()); 57*4882a593Smuzhiyun } while (!kthread_should_stop() || !list_empty(&hub_event_list)); 58*4882a593Smuzhiyun 59*4882a593Smuzhiyun(from drivers/usb/core/hub.c::hub_thread()). 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunIf a freezable kernel thread fails to call try_to_freeze() after the freezer has 62*4882a593Smuzhiyuninitiated a freezing operation, the freezing of tasks will fail and the entire 63*4882a593Smuzhiyunhibernation operation will be cancelled. For this reason, freezable kernel 64*4882a593Smuzhiyunthreads must call try_to_freeze() somewhere or use one of the 65*4882a593Smuzhiyunwait_event_freezable() and wait_event_freezable_timeout() macros. 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunAfter the system memory state has been restored from a hibernation image and 68*4882a593Smuzhiyundevices have been reinitialized, the function thaw_processes() is called in 69*4882a593Smuzhiyunorder to clear the PF_FROZEN flag for each frozen task. Then, the tasks that 70*4882a593Smuzhiyunhave been frozen leave __refrigerator() and continue running. 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunRationale behind the functions dealing with freezing and thawing of tasks 74*4882a593Smuzhiyun------------------------------------------------------------------------- 75*4882a593Smuzhiyun 76*4882a593Smuzhiyunfreeze_processes(): 77*4882a593Smuzhiyun - freezes only userspace tasks 78*4882a593Smuzhiyun 79*4882a593Smuzhiyunfreeze_kernel_threads(): 80*4882a593Smuzhiyun - freezes all tasks (including kernel threads) because we can't freeze 81*4882a593Smuzhiyun kernel threads without freezing userspace tasks 82*4882a593Smuzhiyun 83*4882a593Smuzhiyunthaw_kernel_threads(): 84*4882a593Smuzhiyun - thaws only kernel threads; this is particularly useful if we need to do 85*4882a593Smuzhiyun anything special in between thawing of kernel threads and thawing of 86*4882a593Smuzhiyun userspace tasks, or if we want to postpone the thawing of userspace tasks 87*4882a593Smuzhiyun 88*4882a593Smuzhiyunthaw_processes(): 89*4882a593Smuzhiyun - thaws all tasks (including kernel threads) because we can't thaw userspace 90*4882a593Smuzhiyun tasks without thawing kernel threads 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunIII. Which kernel threads are freezable? 94*4882a593Smuzhiyun======================================== 95*4882a593Smuzhiyun 96*4882a593SmuzhiyunKernel threads are not freezable by default. However, a kernel thread may clear 97*4882a593SmuzhiyunPF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE 98*4882a593Smuzhiyundirectly is not allowed). From this point it is regarded as freezable 99*4882a593Smuzhiyunand must call try_to_freeze() in a suitable place. 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunIV. Why do we do that? 102*4882a593Smuzhiyun====================== 103*4882a593Smuzhiyun 104*4882a593SmuzhiyunGenerally speaking, there is a couple of reasons to use the freezing of tasks: 105*4882a593Smuzhiyun 106*4882a593Smuzhiyun1. The principal reason is to prevent filesystems from being damaged after 107*4882a593Smuzhiyun hibernation. At the moment we have no simple means of checkpointing 108*4882a593Smuzhiyun filesystems, so if there are any modifications made to filesystem data and/or 109*4882a593Smuzhiyun metadata on disks, we cannot bring them back to the state from before the 110*4882a593Smuzhiyun modifications. At the same time each hibernation image contains some 111*4882a593Smuzhiyun filesystem-related information that must be consistent with the state of the 112*4882a593Smuzhiyun on-disk data and metadata after the system memory state has been restored 113*4882a593Smuzhiyun from the image (otherwise the filesystems will be damaged in a nasty way, 114*4882a593Smuzhiyun usually making them almost impossible to repair). We therefore freeze 115*4882a593Smuzhiyun tasks that might cause the on-disk filesystems' data and metadata to be 116*4882a593Smuzhiyun modified after the hibernation image has been created and before the 117*4882a593Smuzhiyun system is finally powered off. The majority of these are user space 118*4882a593Smuzhiyun processes, but if any of the kernel threads may cause something like this 119*4882a593Smuzhiyun to happen, they have to be freezable. 120*4882a593Smuzhiyun 121*4882a593Smuzhiyun2. Next, to create the hibernation image we need to free a sufficient amount of 122*4882a593Smuzhiyun memory (approximately 50% of available RAM) and we need to do that before 123*4882a593Smuzhiyun devices are deactivated, because we generally need them for swapping out. 124*4882a593Smuzhiyun Then, after the memory for the image has been freed, we don't want tasks 125*4882a593Smuzhiyun to allocate additional memory and we prevent them from doing that by 126*4882a593Smuzhiyun freezing them earlier. [Of course, this also means that device drivers 127*4882a593Smuzhiyun should not allocate substantial amounts of memory from their .suspend() 128*4882a593Smuzhiyun callbacks before hibernation, but this is a separate issue.] 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun3. The third reason is to prevent user space processes and some kernel threads 131*4882a593Smuzhiyun from interfering with the suspending and resuming of devices. A user space 132*4882a593Smuzhiyun process running on a second CPU while we are suspending devices may, for 133*4882a593Smuzhiyun example, be troublesome and without the freezing of tasks we would need some 134*4882a593Smuzhiyun safeguards against race conditions that might occur in such a case. 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunAlthough Linus Torvalds doesn't like the freezing of tasks, he said this in one 137*4882a593Smuzhiyunof the discussions on LKML (http://lkml.org/lkml/2007/4/27/608): 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun"RJW:> Why we freeze tasks at all or why we freeze kernel threads? 140*4882a593Smuzhiyun 141*4882a593SmuzhiyunLinus: In many ways, 'at all'. 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunI **do** realize the IO request queue issues, and that we cannot actually do 144*4882a593Smuzhiyuns2ram with some devices in the middle of a DMA. So we want to be able to 145*4882a593Smuzhiyunavoid *that*, there's no question about that. And I suspect that stopping 146*4882a593Smuzhiyunuser threads and then waiting for a sync is practically one of the easier 147*4882a593Smuzhiyunways to do so. 148*4882a593Smuzhiyun 149*4882a593SmuzhiyunSo in practice, the 'at all' may become a 'why freeze kernel threads?' and 150*4882a593Smuzhiyunfreezing user threads I don't find really objectionable." 151*4882a593Smuzhiyun 152*4882a593SmuzhiyunStill, there are kernel threads that may want to be freezable. For example, if 153*4882a593Smuzhiyuna kernel thread that belongs to a device driver accesses the device directly, it 154*4882a593Smuzhiyunin principle needs to know when the device is suspended, so that it doesn't try 155*4882a593Smuzhiyunto access it at that time. However, if the kernel thread is freezable, it will 156*4882a593Smuzhiyunbe frozen before the driver's .suspend() callback is executed and it will be 157*4882a593Smuzhiyunthawed after the driver's .resume() callback has run, so it won't be accessing 158*4882a593Smuzhiyunthe device while it's suspended. 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun4. Another reason for freezing tasks is to prevent user space processes from 161*4882a593Smuzhiyun realizing that hibernation (or suspend) operation takes place. Ideally, user 162*4882a593Smuzhiyun space processes should not notice that such a system-wide operation has 163*4882a593Smuzhiyun occurred and should continue running without any problems after the restore 164*4882a593Smuzhiyun (or resume from suspend). Unfortunately, in the most general case this 165*4882a593Smuzhiyun is quite difficult to achieve without the freezing of tasks. Consider, 166*4882a593Smuzhiyun for example, a process that depends on all CPUs being online while it's 167*4882a593Smuzhiyun running. Since we need to disable nonboot CPUs during the hibernation, 168*4882a593Smuzhiyun if this process is not frozen, it may notice that the number of CPUs has 169*4882a593Smuzhiyun changed and may start to work incorrectly because of that. 170*4882a593Smuzhiyun 171*4882a593SmuzhiyunV. Are there any problems related to the freezing of tasks? 172*4882a593Smuzhiyun=========================================================== 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunYes, there are. 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunFirst of all, the freezing of kernel threads may be tricky if they depend one 177*4882a593Smuzhiyunon another. For example, if kernel thread A waits for a completion (in the 178*4882a593SmuzhiyunTASK_UNINTERRUPTIBLE state) that needs to be done by freezable kernel thread B 179*4882a593Smuzhiyunand B is frozen in the meantime, then A will be blocked until B is thawed, which 180*4882a593Smuzhiyunmay be undesirable. That's why kernel threads are not freezable by default. 181*4882a593Smuzhiyun 182*4882a593SmuzhiyunSecond, there are the following two problems related to the freezing of user 183*4882a593Smuzhiyunspace processes: 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun1. Putting processes into an uninterruptible sleep distorts the load average. 186*4882a593Smuzhiyun2. Now that we have FUSE, plus the framework for doing device drivers in 187*4882a593Smuzhiyun userspace, it gets even more complicated because some userspace processes are 188*4882a593Smuzhiyun now doing the sorts of things that kernel threads do 189*4882a593Smuzhiyun (https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html). 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunThe problem 1. seems to be fixable, although it hasn't been fixed so far. The 192*4882a593Smuzhiyunother one is more serious, but it seems that we can work around it by using 193*4882a593Smuzhiyunhibernation (and suspend) notifiers (in that case, though, we won't be able to 194*4882a593Smuzhiyunavoid the realization by the user space processes that the hibernation is taking 195*4882a593Smuzhiyunplace). 196*4882a593Smuzhiyun 197*4882a593SmuzhiyunThere are also problems that the freezing of tasks tends to expose, although 198*4882a593Smuzhiyunthey are not directly related to it. For example, if request_firmware() is 199*4882a593Smuzhiyuncalled from a device driver's .resume() routine, it will timeout and eventually 200*4882a593Smuzhiyunfail, because the user land process that should respond to the request is frozen 201*4882a593Smuzhiyunat this point. So, seemingly, the failure is due to the freezing of tasks. 202*4882a593SmuzhiyunSuppose, however, that the firmware file is located on a filesystem accessible 203*4882a593Smuzhiyunonly through another device that hasn't been resumed yet. In that case, 204*4882a593Smuzhiyunrequest_firmware() will fail regardless of whether or not the freezing of tasks 205*4882a593Smuzhiyunis used. Consequently, the problem is not really related to the freezing of 206*4882a593Smuzhiyuntasks, since it generally exists anyway. 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunA driver must have all firmwares it may need in RAM before suspend() is called. 209*4882a593SmuzhiyunIf keeping them is not practical, for example due to their size, they must be 210*4882a593Smuzhiyunrequested early enough using the suspend notifier API described in 211*4882a593SmuzhiyunDocumentation/driver-api/pm/notifiers.rst. 212*4882a593Smuzhiyun 213*4882a593SmuzhiyunVI. Are there any precautions to be taken to prevent freezing failures? 214*4882a593Smuzhiyun======================================================================= 215*4882a593Smuzhiyun 216*4882a593SmuzhiyunYes, there are. 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunFirst of all, grabbing the 'system_transition_mutex' lock to mutually exclude a 219*4882a593Smuzhiyunpiece of code from system-wide sleep such as suspend/hibernation is not 220*4882a593Smuzhiyunencouraged. If possible, that piece of code must instead hook onto the 221*4882a593Smuzhiyunsuspend/hibernation notifiers to achieve mutual exclusion. Look at the 222*4882a593SmuzhiyunCPU-Hotplug code (kernel/cpu.c) for an example. 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunHowever, if that is not feasible, and grabbing 'system_transition_mutex' is 225*4882a593Smuzhiyundeemed necessary, it is strongly discouraged to directly call 226*4882a593Smuzhiyunmutex_[un]lock(&system_transition_mutex) since that could lead to freezing 227*4882a593Smuzhiyunfailures, because if the suspend/hibernate code successfully acquired the 228*4882a593Smuzhiyun'system_transition_mutex' lock, and hence that other entity failed to acquire 229*4882a593Smuzhiyunthe lock, then that task would get blocked in TASK_UNINTERRUPTIBLE state. As a 230*4882a593Smuzhiyunconsequence, the freezer would not be able to freeze that task, leading to 231*4882a593Smuzhiyunfreezing failure. 232*4882a593Smuzhiyun 233*4882a593SmuzhiyunHowever, the [un]lock_system_sleep() APIs are safe to use in this scenario, 234*4882a593Smuzhiyunsince they ask the freezer to skip freezing this task, since it is anyway 235*4882a593Smuzhiyun"frozen enough" as it is blocked on 'system_transition_mutex', which will be 236*4882a593Smuzhiyunreleased only after the entire suspend/hibernation sequence is complete. So, to 237*4882a593Smuzhiyunsummarize, use [un]lock_system_sleep() instead of directly using 238*4882a593Smuzhiyunmutex_[un]lock(&system_transition_mutex). That would prevent freezing failures. 239*4882a593Smuzhiyun 240*4882a593SmuzhiyunV. Miscellaneous 241*4882a593Smuzhiyun================ 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun/sys/power/pm_freeze_timeout controls how long it will cost at most to freeze 244*4882a593Smuzhiyunall user space processes or all freezable kernel threads, in unit of 245*4882a593Smuzhiyunmillisecond. The default value is 20000, with range of unsigned integer. 246