1*4882a593Smuzhiyun============== 2*4882a593SmuzhiyunCgroup Freezer 3*4882a593Smuzhiyun============== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunThe cgroup freezer is useful to batch job management system which start 6*4882a593Smuzhiyunand stop sets of tasks in order to schedule the resources of a machine 7*4882a593Smuzhiyunaccording to the desires of a system administrator. This sort of program 8*4882a593Smuzhiyunis often used on HPC clusters to schedule access to the cluster as a 9*4882a593Smuzhiyunwhole. The cgroup freezer uses cgroups to describe the set of tasks to 10*4882a593Smuzhiyunbe started/stopped by the batch job management system. It also provides 11*4882a593Smuzhiyuna means to start and stop the tasks composing the job. 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunThe cgroup freezer will also be useful for checkpointing running groups 14*4882a593Smuzhiyunof tasks. The freezer allows the checkpoint code to obtain a consistent 15*4882a593Smuzhiyunimage of the tasks by attempting to force the tasks in a cgroup into a 16*4882a593Smuzhiyunquiescent state. Once the tasks are quiescent another task can 17*4882a593Smuzhiyunwalk /proc or invoke a kernel interface to gather information about the 18*4882a593Smuzhiyunquiesced tasks. Checkpointed tasks can be restarted later should a 19*4882a593Smuzhiyunrecoverable error occur. This also allows the checkpointed tasks to be 20*4882a593Smuzhiyunmigrated between nodes in a cluster by copying the gathered information 21*4882a593Smuzhiyunto another node and restarting the tasks there. 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunSequences of SIGSTOP and SIGCONT are not always sufficient for stopping 24*4882a593Smuzhiyunand resuming tasks in userspace. Both of these signals are observable 25*4882a593Smuzhiyunfrom within the tasks we wish to freeze. While SIGSTOP cannot be caught, 26*4882a593Smuzhiyunblocked, or ignored it can be seen by waiting or ptracing parent tasks. 27*4882a593SmuzhiyunSIGCONT is especially unsuitable since it can be caught by the task. Any 28*4882a593Smuzhiyunprograms designed to watch for SIGSTOP and SIGCONT could be broken by 29*4882a593Smuzhiyunattempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can 30*4882a593Smuzhiyundemonstrate this problem using nested bash shells:: 31*4882a593Smuzhiyun 32*4882a593Smuzhiyun $ echo $$ 33*4882a593Smuzhiyun 16644 34*4882a593Smuzhiyun $ bash 35*4882a593Smuzhiyun $ echo $$ 36*4882a593Smuzhiyun 16690 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun From a second, unrelated bash shell: 39*4882a593Smuzhiyun $ kill -SIGSTOP 16690 40*4882a593Smuzhiyun $ kill -SIGCONT 16690 41*4882a593Smuzhiyun 42*4882a593Smuzhiyun <at this point 16690 exits and causes 16644 to exit too> 43*4882a593Smuzhiyun 44*4882a593SmuzhiyunThis happens because bash can observe both signals and choose how it 45*4882a593Smuzhiyunresponds to them. 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunAnother example of a program which catches and responds to these 48*4882a593Smuzhiyunsignals is gdb. In fact any program designed to use ptrace is likely to 49*4882a593Smuzhiyunhave a problem with this method of stopping and resuming tasks. 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunIn contrast, the cgroup freezer uses the kernel freezer code to 52*4882a593Smuzhiyunprevent the freeze/unfreeze cycle from becoming visible to the tasks 53*4882a593Smuzhiyunbeing frozen. This allows the bash example above and gdb to run as 54*4882a593Smuzhiyunexpected. 55*4882a593Smuzhiyun 56*4882a593SmuzhiyunThe cgroup freezer is hierarchical. Freezing a cgroup freezes all 57*4882a593Smuzhiyuntasks belonging to the cgroup and all its descendant cgroups. Each 58*4882a593Smuzhiyuncgroup has its own state (self-state) and the state inherited from the 59*4882a593Smuzhiyunparent (parent-state). Iff both states are THAWED, the cgroup is 60*4882a593SmuzhiyunTHAWED. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunThe following cgroupfs files are created by cgroup freezer. 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun* freezer.state: Read-write. 65*4882a593Smuzhiyun 66*4882a593Smuzhiyun When read, returns the effective state of the cgroup - "THAWED", 67*4882a593Smuzhiyun "FREEZING" or "FROZEN". This is the combined self and parent-states. 68*4882a593Smuzhiyun If any is freezing, the cgroup is freezing (FREEZING or FROZEN). 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun FREEZING cgroup transitions into FROZEN state when all tasks 71*4882a593Smuzhiyun belonging to the cgroup and its descendants become frozen. Note that 72*4882a593Smuzhiyun a cgroup reverts to FREEZING from FROZEN after a new task is added 73*4882a593Smuzhiyun to the cgroup or one of its descendant cgroups until the new task is 74*4882a593Smuzhiyun frozen. 75*4882a593Smuzhiyun 76*4882a593Smuzhiyun When written, sets the self-state of the cgroup. Two values are 77*4882a593Smuzhiyun allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup, 78*4882a593Smuzhiyun if not already freezing, enters FREEZING state along with all its 79*4882a593Smuzhiyun descendant cgroups. 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun If THAWED is written, the self-state of the cgroup is changed to 82*4882a593Smuzhiyun THAWED. Note that the effective state may not change to THAWED if 83*4882a593Smuzhiyun the parent-state is still freezing. If a cgroup's effective state 84*4882a593Smuzhiyun becomes THAWED, all its descendants which are freezing because of 85*4882a593Smuzhiyun the cgroup also leave the freezing state. 86*4882a593Smuzhiyun 87*4882a593Smuzhiyun* freezer.self_freezing: Read only. 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun Shows the self-state. 0 if the self-state is THAWED; otherwise, 1. 90*4882a593Smuzhiyun This value is 1 iff the last write to freezer.state was "FROZEN". 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun* freezer.parent_freezing: Read only. 93*4882a593Smuzhiyun 94*4882a593Smuzhiyun Shows the parent-state. 0 if none of the cgroup's ancestors is 95*4882a593Smuzhiyun frozen; otherwise, 1. 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunThe root cgroup is non-freezable and the above interface files don't 98*4882a593Smuzhiyunexist. 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun* Examples of usage:: 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun # mkdir /sys/fs/cgroup/freezer 103*4882a593Smuzhiyun # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer 104*4882a593Smuzhiyun # mkdir /sys/fs/cgroup/freezer/0 105*4882a593Smuzhiyun # echo $some_pid > /sys/fs/cgroup/freezer/0/tasks 106*4882a593Smuzhiyun 107*4882a593Smuzhiyunto get status of the freezer subsystem:: 108*4882a593Smuzhiyun 109*4882a593Smuzhiyun # cat /sys/fs/cgroup/freezer/0/freezer.state 110*4882a593Smuzhiyun THAWED 111*4882a593Smuzhiyun 112*4882a593Smuzhiyunto freeze all tasks in the container:: 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state 115*4882a593Smuzhiyun # cat /sys/fs/cgroup/freezer/0/freezer.state 116*4882a593Smuzhiyun FREEZING 117*4882a593Smuzhiyun # cat /sys/fs/cgroup/freezer/0/freezer.state 118*4882a593Smuzhiyun FROZEN 119*4882a593Smuzhiyun 120*4882a593Smuzhiyunto unfreeze all tasks in the container:: 121*4882a593Smuzhiyun 122*4882a593Smuzhiyun # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state 123*4882a593Smuzhiyun # cat /sys/fs/cgroup/freezer/0/freezer.state 124*4882a593Smuzhiyun THAWED 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunThis is the basic mechanism which should do the right thing for user space task 127*4882a593Smuzhiyunin a simple scenario. 128