xref: /OK3568_Linux_fs/kernel/Documentation/vm/balance.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _balance:
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun================
4*4882a593SmuzhiyunMemory Balancing
5*4882a593Smuzhiyun================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunStarted Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunMemory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
10*4882a593Smuzhiyunwell as for non __GFP_IO allocations.
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunThe first reason why a caller may avoid reclaim is that the caller can not
13*4882a593Smuzhiyunsleep due to holding a spinlock or is in interrupt context. The second may
14*4882a593Smuzhiyunbe that the caller is willing to fail the allocation without incurring the
15*4882a593Smuzhiyunoverhead of page reclaim. This may happen for opportunistic high-order
16*4882a593Smuzhiyunallocation requests that have order-0 fallback options. In such cases,
17*4882a593Smuzhiyunthe caller may also wish to avoid waking kswapd.
18*4882a593Smuzhiyun
19*4882a593Smuzhiyun__GFP_IO allocation requests are made to prevent file system deadlocks.
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunIn the absence of non sleepable allocation requests, it seems detrimental
22*4882a593Smuzhiyunto be doing balancing. Page reclamation can be kicked off lazily, that
23*4882a593Smuzhiyunis, only when needed (aka zone free memory is 0), instead of making it
24*4882a593Smuzhiyuna proactive process.
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunThat being said, the kernel should try to fulfill requests for direct
27*4882a593Smuzhiyunmapped pages from the direct mapped pool, instead of falling back on
28*4882a593Smuzhiyunthe dma pool, so as to keep the dma pool filled for dma requests (atomic
29*4882a593Smuzhiyunor not). A similar argument applies to highmem and direct mapped pages.
30*4882a593SmuzhiyunOTOH, if there is a lot of free dma pages, it is preferable to satisfy
31*4882a593Smuzhiyunregular memory requests by allocating one from the dma pool, instead
32*4882a593Smuzhiyunof incurring the overhead of regular zone balancing.
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunIn 2.2, memory balancing/page reclamation would kick off only when the
35*4882a593Smuzhiyun_total_ number of free pages fell below 1/64 th of total memory. With the
36*4882a593Smuzhiyunright ratio of dma and regular memory, it is quite possible that balancing
37*4882a593Smuzhiyunwould not be done even when the dma zone was completely empty. 2.2 has
38*4882a593Smuzhiyunbeen running production machines of varying memory sizes, and seems to be
39*4882a593Smuzhiyundoing fine even with the presence of this problem. In 2.3, due to
40*4882a593SmuzhiyunHIGHMEM, this problem is aggravated.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunIn 2.3, zone balancing can be done in one of two ways: depending on the
43*4882a593Smuzhiyunzone size (and possibly of the size of lower class zones), we can decide
44*4882a593Smuzhiyunat init time how many free pages we should aim for while balancing any
45*4882a593Smuzhiyunzone. The good part is, while balancing, we do not need to look at sizes
46*4882a593Smuzhiyunof lower class zones, the bad part is, we might do too frequent balancing
47*4882a593Smuzhiyundue to ignoring possibly lower usage in the lower class zones. Also,
48*4882a593Smuzhiyunwith a slight change in the allocation routine, it is possible to reduce
49*4882a593Smuzhiyunthe memclass() macro to be a simple equality.
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunAnother possible solution is that we balance only when the free memory
52*4882a593Smuzhiyunof a zone _and_ all its lower class zones falls below 1/64th of the
53*4882a593Smuzhiyuntotal memory in the zone and its lower class zones. This fixes the 2.2
54*4882a593Smuzhiyunbalancing problem, and stays as close to 2.2 behavior as possible. Also,
55*4882a593Smuzhiyunthe balancing algorithm works the same way on the various architectures,
56*4882a593Smuzhiyunwhich have different numbers and types of zones. If we wanted to get
57*4882a593Smuzhiyunfancy, we could assign different weights to free pages in different
58*4882a593Smuzhiyunzones in the future.
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunNote that if the size of the regular zone is huge compared to dma zone,
61*4882a593Smuzhiyunit becomes less significant to consider the free dma pages while
62*4882a593Smuzhiyundeciding whether to balance the regular zone. The first solution
63*4882a593Smuzhiyunbecomes more attractive then.
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunThe appended patch implements the second solution. It also "fixes" two
66*4882a593Smuzhiyunproblems: first, kswapd is woken up as in 2.2 on low memory conditions
67*4882a593Smuzhiyunfor non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
68*4882a593Smuzhiyunso as to give a fighting chance for replace_with_highmem() to get a
69*4882a593SmuzhiyunHIGHMEM page, as well as to ensure that HIGHMEM allocations do not
70*4882a593Smuzhiyunfall back into regular zone. This also makes sure that HIGHMEM pages
71*4882a593Smuzhiyunare not leaked (for example, in situations where a HIGHMEM page is in
72*4882a593Smuzhiyunthe swapcache but is not being used by anyone)
73*4882a593Smuzhiyun
74*4882a593Smuzhiyunkswapd also needs to know about the zones it should balance. kswapd is
75*4882a593Smuzhiyunprimarily needed in a situation where balancing can not be done,
76*4882a593Smuzhiyunprobably because all allocation requests are coming from intr context
77*4882a593Smuzhiyunand all process contexts are sleeping. For 2.3, kswapd does not really
78*4882a593Smuzhiyunneed to balance the highmem zone, since intr context does not request
79*4882a593Smuzhiyunhighmem pages. kswapd looks at the zone_wake_kswapd field in the zone
80*4882a593Smuzhiyunstructure to decide whether a zone needs balancing.
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunPage stealing from process memory and shm is done if stealing the page would
83*4882a593Smuzhiyunalleviate memory pressure on any zone in the page's node that has fallen below
84*4882a593Smuzhiyunits watermark.
85*4882a593Smuzhiyun
86*4882a593Smuzhiyunwatemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
87*4882a593Smuzhiyunare per-zone fields, used to determine when a zone needs to be balanced. When
88*4882a593Smuzhiyunthe number of pages falls below watermark[WMARK_MIN], the hysteric field
89*4882a593Smuzhiyunlow_on_memory gets set. This stays set till the number of free pages becomes
90*4882a593Smuzhiyunwatermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
91*4882a593Smuzhiyuntry to free some pages in the zone (providing GFP_WAIT is set in the request).
92*4882a593SmuzhiyunOrthogonal to this, is the decision to poke kswapd to free some zone pages.
93*4882a593SmuzhiyunThat decision is not hysteresis based, and is done when the number of free
94*4882a593Smuzhiyunpages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun
97*4882a593Smuzhiyun(Good) Ideas that I have heard:
98*4882a593Smuzhiyun
99*4882a593Smuzhiyun1. Dynamic experience should influence balancing: number of failed requests
100*4882a593Smuzhiyun   for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
101*4882a593Smuzhiyun2. Implement a replace_with_highmem()-like replace_with_regular() to preserve
102*4882a593Smuzhiyun   dma pages. (lkd@tantalophile.demon.co.uk)
103