xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/cgroup-v1/memcg_test.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=====================================================
2*4882a593SmuzhiyunMemory Resource Controller(Memcg) Implementation Memo
3*4882a593Smuzhiyun=====================================================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunLast Updated: 2010/2
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunBase Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunBecause VM is getting complex (one of reasons is memcg...), memcg's behavior
10*4882a593Smuzhiyunis complex. This is a document for memcg's internal behavior.
11*4882a593SmuzhiyunPlease note that implementation details can be changed.
12*4882a593Smuzhiyun
13*4882a593Smuzhiyun(*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
14*4882a593Smuzhiyun
15*4882a593Smuzhiyun0. How to record usage ?
16*4882a593Smuzhiyun========================
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun   2 objects are used.
19*4882a593Smuzhiyun
20*4882a593Smuzhiyun   page_cgroup ....an object per page.
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun	Allocated at boot or memory hotplug. Freed at memory hot removal.
23*4882a593Smuzhiyun
24*4882a593Smuzhiyun   swap_cgroup ... an entry per swp_entry.
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun	Allocated at swapon(). Freed at swapoff().
27*4882a593Smuzhiyun
28*4882a593Smuzhiyun   The page_cgroup has USED bit and double count against a page_cgroup never
29*4882a593Smuzhiyun   occurs. swap_cgroup is used only when a charged page is swapped-out.
30*4882a593Smuzhiyun
31*4882a593Smuzhiyun1. Charge
32*4882a593Smuzhiyun=========
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun   a page/swp_entry may be charged (usage += PAGE_SIZE) at
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun	mem_cgroup_try_charge()
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun2. Uncharge
39*4882a593Smuzhiyun===========
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun  a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun	mem_cgroup_uncharge()
44*4882a593Smuzhiyun	  Called when a page's refcount goes down to 0.
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun	mem_cgroup_uncharge_swap()
47*4882a593Smuzhiyun	  Called when swp_entry's refcnt goes down to 0. A charge against swap
48*4882a593Smuzhiyun	  disappears.
49*4882a593Smuzhiyun
50*4882a593Smuzhiyun3. charge-commit-cancel
51*4882a593Smuzhiyun=======================
52*4882a593Smuzhiyun
53*4882a593Smuzhiyun	Memcg pages are charged in two steps:
54*4882a593Smuzhiyun
55*4882a593Smuzhiyun		- mem_cgroup_try_charge()
56*4882a593Smuzhiyun		- mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
57*4882a593Smuzhiyun
58*4882a593Smuzhiyun	At try_charge(), there are no flags to say "this page is charged".
59*4882a593Smuzhiyun	at this point, usage += PAGE_SIZE.
60*4882a593Smuzhiyun
61*4882a593Smuzhiyun	At commit(), the page is associated with the memcg.
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun	At cancel(), simply usage -= PAGE_SIZE.
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunUnder below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun4. Anonymous
68*4882a593Smuzhiyun============
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun	Anonymous page is newly allocated at
71*4882a593Smuzhiyun		  - page fault into MAP_ANONYMOUS mapping.
72*4882a593Smuzhiyun		  - Copy-On-Write.
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun	4.1 Swap-in.
75*4882a593Smuzhiyun	At swap-in, the page is taken from swap-cache. There are 2 cases.
76*4882a593Smuzhiyun
77*4882a593Smuzhiyun	(a) If the SwapCache is newly allocated and read, it has no charges.
78*4882a593Smuzhiyun	(b) If the SwapCache has been mapped by processes, it has been
79*4882a593Smuzhiyun	    charged already.
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun	4.2 Swap-out.
82*4882a593Smuzhiyun	At swap-out, typical state transition is below.
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun	(a) add to swap cache. (marked as SwapCache)
85*4882a593Smuzhiyun	    swp_entry's refcnt += 1.
86*4882a593Smuzhiyun	(b) fully unmapped.
87*4882a593Smuzhiyun	    swp_entry's refcnt += # of ptes.
88*4882a593Smuzhiyun	(c) write back to swap.
89*4882a593Smuzhiyun	(d) delete from swap cache. (remove from SwapCache)
90*4882a593Smuzhiyun	    swp_entry's refcnt -= 1.
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun	Finally, at task exit,
94*4882a593Smuzhiyun	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun5. Page Cache
97*4882a593Smuzhiyun=============
98*4882a593Smuzhiyun
99*4882a593Smuzhiyun	Page Cache is charged at
100*4882a593Smuzhiyun	- add_to_page_cache_locked().
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun	The logic is very clear. (About migration, see below)
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun	Note:
105*4882a593Smuzhiyun	  __remove_from_page_cache() is called by remove_from_page_cache()
106*4882a593Smuzhiyun	  and __remove_mapping().
107*4882a593Smuzhiyun
108*4882a593Smuzhiyun6. Shmem(tmpfs) Page Cache
109*4882a593Smuzhiyun===========================
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun	The best way to understand shmem's page state transition is to read
112*4882a593Smuzhiyun	mm/shmem.c.
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun	But brief explanation of the behavior of memcg around shmem will be
115*4882a593Smuzhiyun	helpful to understand the logic.
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun	Shmem's page (just leaf page, not direct/indirect block) can be on
118*4882a593Smuzhiyun
119*4882a593Smuzhiyun		- radix-tree of shmem's inode.
120*4882a593Smuzhiyun		- SwapCache.
121*4882a593Smuzhiyun		- Both on radix-tree and SwapCache. This happens at swap-in
122*4882a593Smuzhiyun		  and swap-out,
123*4882a593Smuzhiyun
124*4882a593Smuzhiyun	It's charged when...
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun	- A new page is added to shmem's radix-tree.
127*4882a593Smuzhiyun	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun7. Page Migration
130*4882a593Smuzhiyun=================
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun	mem_cgroup_migrate()
133*4882a593Smuzhiyun
134*4882a593Smuzhiyun8. LRU
135*4882a593Smuzhiyun======
136*4882a593Smuzhiyun        Each memcg has its own private LRU. Now, its handling is under global
137*4882a593Smuzhiyun	VM's control (means that it's handled under global pgdat->lru_lock).
138*4882a593Smuzhiyun	Almost all routines around memcg's LRU is called by global LRU's
139*4882a593Smuzhiyun	list management functions under pgdat->lru_lock.
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun	A special function is mem_cgroup_isolate_pages(). This scans
142*4882a593Smuzhiyun	memcg's private LRU and call __isolate_lru_page() to extract a page
143*4882a593Smuzhiyun	from LRU.
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun	(By __isolate_lru_page(), the page is removed from both of global and
146*4882a593Smuzhiyun	private LRU.)
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun9. Typical Tests.
150*4882a593Smuzhiyun=================
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun Tests for racy cases.
153*4882a593Smuzhiyun
154*4882a593Smuzhiyun9.1 Small limit to memcg.
155*4882a593Smuzhiyun-------------------------
156*4882a593Smuzhiyun
157*4882a593Smuzhiyun	When you do test to do racy case, it's good test to set memcg's limit
158*4882a593Smuzhiyun	to be very small rather than GB. Many races found in the test under
159*4882a593Smuzhiyun	xKB or xxMB limits.
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun	(Memory behavior under GB and Memory behavior under MB shows very
162*4882a593Smuzhiyun	different situation.)
163*4882a593Smuzhiyun
164*4882a593Smuzhiyun9.2 Shmem
165*4882a593Smuzhiyun---------
166*4882a593Smuzhiyun
167*4882a593Smuzhiyun	Historically, memcg's shmem handling was poor and we saw some amount
168*4882a593Smuzhiyun	of troubles here. This is because shmem is page-cache but can be
169*4882a593Smuzhiyun	SwapCache. Test with shmem/tmpfs is always good test.
170*4882a593Smuzhiyun
171*4882a593Smuzhiyun9.3 Migration
172*4882a593Smuzhiyun-------------
173*4882a593Smuzhiyun
174*4882a593Smuzhiyun	For NUMA, migration is an another special case. To do easy test, cpuset
175*4882a593Smuzhiyun	is useful. Following is a sample script to do migration::
176*4882a593Smuzhiyun
177*4882a593Smuzhiyun		mount -t cgroup -o cpuset none /opt/cpuset
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun		mkdir /opt/cpuset/01
180*4882a593Smuzhiyun		echo 1 > /opt/cpuset/01/cpuset.cpus
181*4882a593Smuzhiyun		echo 0 > /opt/cpuset/01/cpuset.mems
182*4882a593Smuzhiyun		echo 1 > /opt/cpuset/01/cpuset.memory_migrate
183*4882a593Smuzhiyun		mkdir /opt/cpuset/02
184*4882a593Smuzhiyun		echo 1 > /opt/cpuset/02/cpuset.cpus
185*4882a593Smuzhiyun		echo 1 > /opt/cpuset/02/cpuset.mems
186*4882a593Smuzhiyun		echo 1 > /opt/cpuset/02/cpuset.memory_migrate
187*4882a593Smuzhiyun
188*4882a593Smuzhiyun	In above set, when you moves a task from 01 to 02, page migration to
189*4882a593Smuzhiyun	node 0 to node 1 will occur. Following is a script to migrate all
190*4882a593Smuzhiyun	under cpuset.::
191*4882a593Smuzhiyun
192*4882a593Smuzhiyun		--
193*4882a593Smuzhiyun		move_task()
194*4882a593Smuzhiyun		{
195*4882a593Smuzhiyun		for pid in $1
196*4882a593Smuzhiyun		do
197*4882a593Smuzhiyun			/bin/echo $pid >$2/tasks 2>/dev/null
198*4882a593Smuzhiyun			echo -n $pid
199*4882a593Smuzhiyun			echo -n " "
200*4882a593Smuzhiyun		done
201*4882a593Smuzhiyun		echo END
202*4882a593Smuzhiyun		}
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun		G1_TASK=`cat ${G1}/tasks`
205*4882a593Smuzhiyun		G2_TASK=`cat ${G2}/tasks`
206*4882a593Smuzhiyun		move_task "${G1_TASK}" ${G2} &
207*4882a593Smuzhiyun		--
208*4882a593Smuzhiyun
209*4882a593Smuzhiyun9.4 Memory hotplug
210*4882a593Smuzhiyun------------------
211*4882a593Smuzhiyun
212*4882a593Smuzhiyun	memory hotplug test is one of good test.
213*4882a593Smuzhiyun
214*4882a593Smuzhiyun	to offline memory, do following::
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun		# echo offline > /sys/devices/system/memory/memoryXXX/state
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun	(XXX is the place of memory)
219*4882a593Smuzhiyun
220*4882a593Smuzhiyun	This is an easy way to test page migration, too.
221*4882a593Smuzhiyun
222*4882a593Smuzhiyun9.5 mkdir/rmdir
223*4882a593Smuzhiyun---------------
224*4882a593Smuzhiyun
225*4882a593Smuzhiyun	When using hierarchy, mkdir/rmdir test should be done.
226*4882a593Smuzhiyun	Use tests like the following::
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun		echo 1 >/opt/cgroup/01/memory/use_hierarchy
229*4882a593Smuzhiyun		mkdir /opt/cgroup/01/child_a
230*4882a593Smuzhiyun		mkdir /opt/cgroup/01/child_b
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun		set limit to 01.
233*4882a593Smuzhiyun		add limit to 01/child_b
234*4882a593Smuzhiyun		run jobs under child_a and child_b
235*4882a593Smuzhiyun
236*4882a593Smuzhiyun	create/delete following groups at random while jobs are running::
237*4882a593Smuzhiyun
238*4882a593Smuzhiyun		/opt/cgroup/01/child_a/child_aa
239*4882a593Smuzhiyun		/opt/cgroup/01/child_b/child_bb
240*4882a593Smuzhiyun		/opt/cgroup/01/child_c
241*4882a593Smuzhiyun
242*4882a593Smuzhiyun	running new jobs in new group is also good.
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun9.6 Mount with other subsystems
245*4882a593Smuzhiyun-------------------------------
246*4882a593Smuzhiyun
247*4882a593Smuzhiyun	Mounting with other subsystems is a good test because there is a
248*4882a593Smuzhiyun	race and lock dependency with other cgroup subsystems.
249*4882a593Smuzhiyun
250*4882a593Smuzhiyun	example::
251*4882a593Smuzhiyun
252*4882a593Smuzhiyun		# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun	and do task move, mkdir, rmdir etc...under this.
255*4882a593Smuzhiyun
256*4882a593Smuzhiyun9.7 swapoff
257*4882a593Smuzhiyun-----------
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun	Besides management of swap is one of complicated parts of memcg,
260*4882a593Smuzhiyun	call path of swap-in at swapoff is not same as usual swap-in path..
261*4882a593Smuzhiyun	It's worth to be tested explicitly.
262*4882a593Smuzhiyun
263*4882a593Smuzhiyun	For example, test like following is good:
264*4882a593Smuzhiyun
265*4882a593Smuzhiyun	(Shell-A)::
266*4882a593Smuzhiyun
267*4882a593Smuzhiyun		# mount -t cgroup none /cgroup -o memory
268*4882a593Smuzhiyun		# mkdir /cgroup/test
269*4882a593Smuzhiyun		# echo 40M > /cgroup/test/memory.limit_in_bytes
270*4882a593Smuzhiyun		# echo 0 > /cgroup/test/tasks
271*4882a593Smuzhiyun
272*4882a593Smuzhiyun	Run malloc(100M) program under this. You'll see 60M of swaps.
273*4882a593Smuzhiyun
274*4882a593Smuzhiyun	(Shell-B)::
275*4882a593Smuzhiyun
276*4882a593Smuzhiyun		# move all tasks in /cgroup/test to /cgroup
277*4882a593Smuzhiyun		# /sbin/swapoff -a
278*4882a593Smuzhiyun		# rmdir /cgroup/test
279*4882a593Smuzhiyun		# kill malloc task.
280*4882a593Smuzhiyun
281*4882a593Smuzhiyun	Of course, tmpfs v.s. swapoff test should be tested, too.
282*4882a593Smuzhiyun
283*4882a593Smuzhiyun9.8 OOM-Killer
284*4882a593Smuzhiyun--------------
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun	Out-of-memory caused by memcg's limit will kill tasks under
287*4882a593Smuzhiyun	the memcg. When hierarchy is used, a task under hierarchy
288*4882a593Smuzhiyun	will be killed by the kernel.
289*4882a593Smuzhiyun
290*4882a593Smuzhiyun	In this case, panic_on_oom shouldn't be invoked and tasks
291*4882a593Smuzhiyun	in other groups shouldn't be killed.
292*4882a593Smuzhiyun
293*4882a593Smuzhiyun	It's not difficult to cause OOM under memcg as following.
294*4882a593Smuzhiyun
295*4882a593Smuzhiyun	Case A) when you can swapoff::
296*4882a593Smuzhiyun
297*4882a593Smuzhiyun		#swapoff -a
298*4882a593Smuzhiyun		#echo 50M > /memory.limit_in_bytes
299*4882a593Smuzhiyun
300*4882a593Smuzhiyun	run 51M of malloc
301*4882a593Smuzhiyun
302*4882a593Smuzhiyun	Case B) when you use mem+swap limitation::
303*4882a593Smuzhiyun
304*4882a593Smuzhiyun		#echo 50M > memory.limit_in_bytes
305*4882a593Smuzhiyun		#echo 50M > memory.memsw.limit_in_bytes
306*4882a593Smuzhiyun
307*4882a593Smuzhiyun	run 51M of malloc
308*4882a593Smuzhiyun
309*4882a593Smuzhiyun9.9 Move charges at task migration
310*4882a593Smuzhiyun----------------------------------
311*4882a593Smuzhiyun
312*4882a593Smuzhiyun	Charges associated with a task can be moved along with task migration.
313*4882a593Smuzhiyun
314*4882a593Smuzhiyun	(Shell-A)::
315*4882a593Smuzhiyun
316*4882a593Smuzhiyun		#mkdir /cgroup/A
317*4882a593Smuzhiyun		#echo $$ >/cgroup/A/tasks
318*4882a593Smuzhiyun
319*4882a593Smuzhiyun	run some programs which uses some amount of memory in /cgroup/A.
320*4882a593Smuzhiyun
321*4882a593Smuzhiyun	(Shell-B)::
322*4882a593Smuzhiyun
323*4882a593Smuzhiyun		#mkdir /cgroup/B
324*4882a593Smuzhiyun		#echo 1 >/cgroup/B/memory.move_charge_at_immigrate
325*4882a593Smuzhiyun		#echo "pid of the program running in group A" >/cgroup/B/tasks
326*4882a593Smuzhiyun
327*4882a593Smuzhiyun	You can see charges have been moved by reading ``*.usage_in_bytes`` or
328*4882a593Smuzhiyun	memory.stat of both A and B.
329*4882a593Smuzhiyun
330*4882a593Smuzhiyun	See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
331*4882a593Smuzhiyun	be written to move_charge_at_immigrate.
332*4882a593Smuzhiyun
333*4882a593Smuzhiyun9.10 Memory thresholds
334*4882a593Smuzhiyun----------------------
335*4882a593Smuzhiyun
336*4882a593Smuzhiyun	Memory controller implements memory thresholds using cgroups notification
337*4882a593Smuzhiyun	API. You can use tools/cgroup/cgroup_event_listener.c to test it.
338*4882a593Smuzhiyun
339*4882a593Smuzhiyun	(Shell-A) Create cgroup and run event listener::
340*4882a593Smuzhiyun
341*4882a593Smuzhiyun		# mkdir /cgroup/A
342*4882a593Smuzhiyun		# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
343*4882a593Smuzhiyun
344*4882a593Smuzhiyun	(Shell-B) Add task to cgroup and try to allocate and free memory::
345*4882a593Smuzhiyun
346*4882a593Smuzhiyun		# echo $$ >/cgroup/A/tasks
347*4882a593Smuzhiyun		# a="$(dd if=/dev/zero bs=1M count=10)"
348*4882a593Smuzhiyun		# a=
349*4882a593Smuzhiyun
350*4882a593Smuzhiyun	You will see message from cgroup_event_listener every time you cross
351*4882a593Smuzhiyun	the thresholds.
352*4882a593Smuzhiyun
353*4882a593Smuzhiyun	Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
354*4882a593Smuzhiyun
355*4882a593Smuzhiyun	It's good idea to test root cgroup as well.
356