xref: /OK3568_Linux_fs/kernel/Documentation/vm/unevictable-lru.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _unevictable_lru:
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun==============================
4*4882a593SmuzhiyunUnevictable LRU Infrastructure
5*4882a593Smuzhiyun==============================
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun.. contents:: :local:
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunIntroduction
11*4882a593Smuzhiyun============
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunThis document describes the Linux memory manager's "Unevictable LRU"
14*4882a593Smuzhiyuninfrastructure and the use of this to manage several types of "unevictable"
15*4882a593Smuzhiyunpages.
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunThe document attempts to provide the overall rationale behind this mechanism
18*4882a593Smuzhiyunand the rationale for some of the design decisions that drove the
19*4882a593Smuzhiyunimplementation.  The latter design rationale is discussed in the context of an
20*4882a593Smuzhiyunimplementation description.  Admittedly, one can obtain the implementation
21*4882a593Smuzhiyundetails - the "what does it do?" - by reading the code.  One hopes that the
22*4882a593Smuzhiyundescriptions below add value by provide the answer to "why does it do that?".
23*4882a593Smuzhiyun
24*4882a593Smuzhiyun
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunThe Unevictable LRU
27*4882a593Smuzhiyun===================
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunThe Unevictable LRU facility adds an additional LRU list to track unevictable
30*4882a593Smuzhiyunpages and to hide these pages from vmscan.  This mechanism is based on a patch
31*4882a593Smuzhiyunby Larry Woodman of Red Hat to address several scalability problems with page
32*4882a593Smuzhiyunreclaim in Linux.  The problems have been observed at customer sites on large
33*4882a593Smuzhiyunmemory x86_64 systems.
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunTo illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
36*4882a593Smuzhiyunmain memory will have over 32 million 4k pages in a single zone.  When a large
37*4882a593Smuzhiyunfraction of these pages are not evictable for any reason [see below], vmscan
38*4882a593Smuzhiyunwill spend a lot of time scanning the LRU lists looking for the small fraction
39*4882a593Smuzhiyunof pages that are evictable.  This can result in a situation where all CPUs are
40*4882a593Smuzhiyunspending 100% of their time in vmscan for hours or days on end, with the system
41*4882a593Smuzhiyuncompletely unresponsive.
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunThe unevictable list addresses the following classes of unevictable pages:
44*4882a593Smuzhiyun
45*4882a593Smuzhiyun * Those owned by ramfs.
46*4882a593Smuzhiyun
47*4882a593Smuzhiyun * Those mapped into SHM_LOCK'd shared memory regions.
48*4882a593Smuzhiyun
49*4882a593Smuzhiyun * Those mapped into VM_LOCKED [mlock()ed] VMAs.
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunThe infrastructure may also be able to handle other conditions that make pages
52*4882a593Smuzhiyununevictable, either by definition or by circumstance, in the future.
53*4882a593Smuzhiyun
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunThe Unevictable Page List
56*4882a593Smuzhiyun-------------------------
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunThe Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
59*4882a593Smuzhiyuncalled the "unevictable" list and an associated page flag, PG_unevictable, to
60*4882a593Smuzhiyunindicate that the page is being managed on the unevictable list.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunThe PG_unevictable flag is analogous to, and mutually exclusive with, the
63*4882a593SmuzhiyunPG_active flag in that it indicates on which LRU list a page resides when
64*4882a593SmuzhiyunPG_lru is set.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunThe Unevictable LRU infrastructure maintains unevictable pages on an additional
67*4882a593SmuzhiyunLRU list for a few reasons:
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun (1) We get to "treat unevictable pages just like we treat other pages in the
70*4882a593Smuzhiyun     system - which means we get to use the same code to manipulate them, the
71*4882a593Smuzhiyun     same code to isolate them (for migrate, etc.), the same code to keep track
72*4882a593Smuzhiyun     of the statistics, etc..." [Rik van Riel]
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun (2) We want to be able to migrate unevictable pages between nodes for memory
75*4882a593Smuzhiyun     defragmentation, workload management and memory hotplug.  The linux kernel
76*4882a593Smuzhiyun     can only migrate pages that it can successfully isolate from the LRU
77*4882a593Smuzhiyun     lists.  If we were to maintain pages elsewhere than on an LRU-like list,
78*4882a593Smuzhiyun     where they can be found by isolate_lru_page(), we would prevent their
79*4882a593Smuzhiyun     migration, unless we reworked migration code to find the unevictable pages
80*4882a593Smuzhiyun     itself.
81*4882a593Smuzhiyun
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunThe unevictable list does not differentiate between file-backed and anonymous,
84*4882a593Smuzhiyunswap-backed pages.  This differentiation is only important while the pages are,
85*4882a593Smuzhiyunin fact, evictable.
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunThe unevictable list benefits from the "arrayification" of the per-zone LRU
88*4882a593Smuzhiyunlists and statistics originally proposed and posted by Christoph Lameter.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunThe unevictable list does not use the LRU pagevec mechanism. Rather,
91*4882a593Smuzhiyununevictable pages are placed directly on the page's zone's unevictable list
92*4882a593Smuzhiyununder the zone lru_lock.  This allows us to prevent the stranding of pages on
93*4882a593Smuzhiyunthe unevictable list when one task has the page isolated from the LRU and other
94*4882a593Smuzhiyuntasks are changing the "evictability" state of the page.
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunMemory Control Group Interaction
98*4882a593Smuzhiyun--------------------------------
99*4882a593Smuzhiyun
100*4882a593SmuzhiyunThe unevictable LRU facility interacts with the memory control group [aka
101*4882a593Smuzhiyunmemory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
102*4882a593Smuzhiyunlru_list enum.
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunThe memory controller data structure automatically gets a per-zone unevictable
105*4882a593Smuzhiyunlist as a result of the "arrayification" of the per-zone LRU lists (one per
106*4882a593Smuzhiyunlru_list enum element).  The memory controller tracks the movement of pages to
107*4882a593Smuzhiyunand from the unevictable list.
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunWhen a memory control group comes under memory pressure, the controller will
110*4882a593Smuzhiyunnot attempt to reclaim pages on the unevictable list.  This has a couple of
111*4882a593Smuzhiyuneffects:
112*4882a593Smuzhiyun
113*4882a593Smuzhiyun (1) Because the pages are "hidden" from reclaim on the unevictable list, the
114*4882a593Smuzhiyun     reclaim process can be more efficient, dealing only with pages that have a
115*4882a593Smuzhiyun     chance of being reclaimed.
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun (2) On the other hand, if too many of the pages charged to the control group
118*4882a593Smuzhiyun     are unevictable, the evictable portion of the working set of the tasks in
119*4882a593Smuzhiyun     the control group may not fit into the available memory.  This can cause
120*4882a593Smuzhiyun     the control group to thrash or to OOM-kill tasks.
121*4882a593Smuzhiyun
122*4882a593Smuzhiyun
123*4882a593Smuzhiyun.. _mark_addr_space_unevict:
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunMarking Address Spaces Unevictable
126*4882a593Smuzhiyun----------------------------------
127*4882a593Smuzhiyun
128*4882a593SmuzhiyunFor facilities such as ramfs none of the pages attached to the address space
129*4882a593Smuzhiyunmay be evicted.  To prevent eviction of any such pages, the AS_UNEVICTABLE
130*4882a593Smuzhiyunaddress space flag is provided, and this can be manipulated by a filesystem
131*4882a593Smuzhiyunusing a number of wrapper functions:
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun * ``void mapping_set_unevictable(struct address_space *mapping);``
134*4882a593Smuzhiyun
135*4882a593Smuzhiyun	Mark the address space as being completely unevictable.
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun * ``void mapping_clear_unevictable(struct address_space *mapping);``
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun	Mark the address space as being evictable.
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun * ``int mapping_unevictable(struct address_space *mapping);``
142*4882a593Smuzhiyun
143*4882a593Smuzhiyun	Query the address space, and return true if it is completely
144*4882a593Smuzhiyun	unevictable.
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunThese are currently used in three places in the kernel:
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun (1) By ramfs to mark the address spaces of its inodes when they are created,
149*4882a593Smuzhiyun     and this mark remains for the life of the inode.
150*4882a593Smuzhiyun
151*4882a593Smuzhiyun (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called.
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun     Note that SHM_LOCK is not required to page in the locked pages if they're
154*4882a593Smuzhiyun     swapped out; the application must touch the pages manually if it wants to
155*4882a593Smuzhiyun     ensure they're in memory.
156*4882a593Smuzhiyun
157*4882a593Smuzhiyun (3) By the i915 driver to mark pinned address space until it's unpinned. The
158*4882a593Smuzhiyun     amount of unevictable memory marked by i915 driver is roughly the bounded
159*4882a593Smuzhiyun     object size in debugfs/dri/0/i915_gem_objects.
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunDetecting Unevictable Pages
163*4882a593Smuzhiyun---------------------------
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunThe function page_evictable() in vmscan.c determines whether a page is
166*4882a593Smuzhiyunevictable or not using the query function outlined above [see section
167*4882a593Smuzhiyun:ref:`Marking address spaces unevictable <mark_addr_space_unevict>`]
168*4882a593Smuzhiyunto check the AS_UNEVICTABLE flag.
169*4882a593Smuzhiyun
170*4882a593SmuzhiyunFor address spaces that are so marked after being populated (as SHM regions
171*4882a593Smuzhiyunmight be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate
172*4882a593Smuzhiyunthe page tables for the region as does, for example, mlock(), nor need it make
173*4882a593Smuzhiyunany special effort to push any pages in the SHM_LOCK'd area to the unevictable
174*4882a593Smuzhiyunlist.  Instead, vmscan will do this if and when it encounters the pages during
175*4882a593Smuzhiyuna reclamation scan.
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunOn an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan
178*4882a593Smuzhiyunthe pages in the region and "rescue" them from the unevictable list if no other
179*4882a593Smuzhiyuncondition is keeping them unevictable.  If an unevictable region is destroyed,
180*4882a593Smuzhiyunthe pages are also "rescued" from the unevictable list in the process of
181*4882a593Smuzhiyunfreeing them.
182*4882a593Smuzhiyun
183*4882a593Smuzhiyunpage_evictable() also checks for mlocked pages by testing an additional page
184*4882a593Smuzhiyunflag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
185*4882a593Smuzhiyunfaulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED.
186*4882a593Smuzhiyun
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunVmscan's Handling of Unevictable Pages
189*4882a593Smuzhiyun--------------------------------------
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunIf unevictable pages are culled in the fault path, or moved to the unevictable
192*4882a593Smuzhiyunlist at mlock() or mmap() time, vmscan will not encounter the pages until they
193*4882a593Smuzhiyunhave become evictable again (via munlock() for example) and have been "rescued"
194*4882a593Smuzhiyunfrom the unevictable list.  However, there may be situations where we decide,
195*4882a593Smuzhiyunfor the sake of expediency, to leave a unevictable page on one of the regular
196*4882a593Smuzhiyunactive/inactive LRU lists for vmscan to deal with.  vmscan checks for such
197*4882a593Smuzhiyunpages in all of the shrink_{active|inactive|page}_list() functions and will
198*4882a593Smuzhiyun"cull" such pages that it encounters: that is, it diverts those pages to the
199*4882a593Smuzhiyununevictable list for the zone being scanned.
200*4882a593Smuzhiyun
201*4882a593SmuzhiyunThere may be situations where a page is mapped into a VM_LOCKED VMA, but the
202*4882a593Smuzhiyunpage is not marked as PG_mlocked.  Such pages will make it all the way to
203*4882a593Smuzhiyunshrink_page_list() where they will be detected when vmscan walks the reverse
204*4882a593Smuzhiyunmap in try_to_unmap().  If try_to_unmap() returns SWAP_MLOCK,
205*4882a593Smuzhiyunshrink_page_list() will cull the page at that point.
206*4882a593Smuzhiyun
207*4882a593SmuzhiyunTo "cull" an unevictable page, vmscan simply puts the page back on the LRU list
208*4882a593Smuzhiyunusing putback_lru_page() - the inverse operation to isolate_lru_page() - after
209*4882a593Smuzhiyundropping the page lock.  Because the condition which makes the page unevictable
210*4882a593Smuzhiyunmay change once the page is unlocked, putback_lru_page() will recheck the
211*4882a593Smuzhiyununevictable state of a page that it places on the unevictable list.  If the
212*4882a593Smuzhiyunpage has become unevictable, putback_lru_page() removes it from the list and
213*4882a593Smuzhiyunretries, including the page_unevictable() test.  Because such a race is a rare
214*4882a593Smuzhiyunevent and movement of pages onto the unevictable list should be rare, these
215*4882a593Smuzhiyunextra evictabilty checks should not occur in the majority of calls to
216*4882a593Smuzhiyunputback_lru_page().
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun
219*4882a593SmuzhiyunMLOCKED Pages
220*4882a593Smuzhiyun=============
221*4882a593Smuzhiyun
222*4882a593SmuzhiyunThe unevictable page list is also useful for mlock(), in addition to ramfs and
223*4882a593SmuzhiyunSYSV SHM.  Note that mlock() is only available in CONFIG_MMU=y situations; in
224*4882a593SmuzhiyunNOMMU situations, all mappings are effectively mlocked.
225*4882a593Smuzhiyun
226*4882a593Smuzhiyun
227*4882a593SmuzhiyunHistory
228*4882a593Smuzhiyun-------
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunThe "Unevictable mlocked Pages" infrastructure is based on work originally
231*4882a593Smuzhiyunposted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU".
232*4882a593SmuzhiyunNick posted his patch as an alternative to a patch posted by Christoph Lameter
233*4882a593Smuzhiyunto achieve the same objective: hiding mlocked pages from vmscan.
234*4882a593Smuzhiyun
235*4882a593SmuzhiyunIn Nick's patch, he used one of the struct page LRU list link fields as a count
236*4882a593Smuzhiyunof VM_LOCKED VMAs that map the page.  This use of the link field for a count
237*4882a593Smuzhiyunprevented the management of the pages on an LRU list, and thus mlocked pages
238*4882a593Smuzhiyunwere not migratable as isolate_lru_page() could not find them, and the LRU list
239*4882a593Smuzhiyunlink field was not available to the migration subsystem.
240*4882a593Smuzhiyun
241*4882a593SmuzhiyunNick resolved this by putting mlocked pages back on the lru list before
242*4882a593Smuzhiyunattempting to isolate them, thus abandoning the count of VM_LOCKED VMAs.  When
243*4882a593SmuzhiyunNick's patch was integrated with the Unevictable LRU work, the count was
244*4882a593Smuzhiyunreplaced by walking the reverse map to determine whether any VM_LOCKED VMAs
245*4882a593Smuzhiyunmapped the page.  More on this below.
246*4882a593Smuzhiyun
247*4882a593Smuzhiyun
248*4882a593SmuzhiyunBasic Management
249*4882a593Smuzhiyun----------------
250*4882a593Smuzhiyun
251*4882a593Smuzhiyunmlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable
252*4882a593Smuzhiyunpages.  When such a page has been "noticed" by the memory management subsystem,
253*4882a593Smuzhiyunthe page is marked with the PG_mlocked flag.  This can be manipulated using the
254*4882a593SmuzhiyunPageMlocked() functions.
255*4882a593Smuzhiyun
256*4882a593SmuzhiyunA PG_mlocked page will be placed on the unevictable list when it is added to
257*4882a593Smuzhiyunthe LRU.  Such pages can be "noticed" by memory management in several places:
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun (1) in the mlock()/mlockall() system call handlers;
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun (2) in the mmap() system call handler when mmapping a region with the
262*4882a593Smuzhiyun     MAP_LOCKED flag;
263*4882a593Smuzhiyun
264*4882a593Smuzhiyun (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE
265*4882a593Smuzhiyun     flag
266*4882a593Smuzhiyun
267*4882a593Smuzhiyun (4) in the fault path, if mlocked pages are "culled" in the fault path,
268*4882a593Smuzhiyun     and when a VM_LOCKED stack segment is expanded; or
269*4882a593Smuzhiyun
270*4882a593Smuzhiyun (5) as mentioned above, in vmscan:shrink_page_list() when attempting to
271*4882a593Smuzhiyun     reclaim a page in a VM_LOCKED VMA via try_to_unmap()
272*4882a593Smuzhiyun
273*4882a593Smuzhiyunall of which result in the VM_LOCKED flag being set for the VMA if it doesn't
274*4882a593Smuzhiyunalready have it set.
275*4882a593Smuzhiyun
276*4882a593Smuzhiyunmlocked pages become unlocked and rescued from the unevictable list when:
277*4882a593Smuzhiyun
278*4882a593Smuzhiyun (1) mapped in a range unlocked via the munlock()/munlockall() system calls;
279*4882a593Smuzhiyun
280*4882a593Smuzhiyun (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including
281*4882a593Smuzhiyun     unmapping at task exit;
282*4882a593Smuzhiyun
283*4882a593Smuzhiyun (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file;
284*4882a593Smuzhiyun     or
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun (4) before a page is COW'd in a VM_LOCKED VMA.
287*4882a593Smuzhiyun
288*4882a593Smuzhiyun
289*4882a593Smuzhiyunmlock()/mlockall() System Call Handling
290*4882a593Smuzhiyun---------------------------------------
291*4882a593Smuzhiyun
292*4882a593SmuzhiyunBoth [do\_]mlock() and [do\_]mlockall() system call handlers call mlock_fixup()
293*4882a593Smuzhiyunfor each VMA in the range specified by the call.  In the case of mlockall(),
294*4882a593Smuzhiyunthis is the entire active address space of the task.  Note that mlock_fixup()
295*4882a593Smuzhiyunis used for both mlocking and munlocking a range of memory.  A call to mlock()
296*4882a593Smuzhiyunan already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is
297*4882a593Smuzhiyuntreated as a no-op, and mlock_fixup() simply returns.
298*4882a593Smuzhiyun
299*4882a593SmuzhiyunIf the VMA passes some filtering as described in "Filtering Special Vmas"
300*4882a593Smuzhiyunbelow, mlock_fixup() will attempt to merge the VMA with its neighbors or split
301*4882a593Smuzhiyunoff a subset of the VMA if the range does not cover the entire VMA.  Once the
302*4882a593SmuzhiyunVMA has been merged or split or neither, mlock_fixup() will call
303*4882a593Smuzhiyunpopulate_vma_page_range() to fault in the pages via get_user_pages() and to
304*4882a593Smuzhiyunmark the pages as mlocked via mlock_vma_page().
305*4882a593Smuzhiyun
306*4882a593SmuzhiyunNote that the VMA being mlocked might be mapped with PROT_NONE.  In this case,
307*4882a593Smuzhiyunget_user_pages() will be unable to fault in the pages.  That's okay.  If pages
308*4882a593Smuzhiyundo end up getting faulted into this VM_LOCKED VMA, we'll handle them in the
309*4882a593Smuzhiyunfault path or in vmscan.
310*4882a593Smuzhiyun
311*4882a593SmuzhiyunAlso note that a page returned by get_user_pages() could be truncated or
312*4882a593Smuzhiyunmigrated out from under us, while we're trying to mlock it.  To detect this,
313*4882a593Smuzhiyunpopulate_vma_page_range() checks page_mapping() after acquiring the page lock.
314*4882a593SmuzhiyunIf the page is still associated with its mapping, we'll go ahead and call
315*4882a593Smuzhiyunmlock_vma_page().  If the mapping is gone, we just unlock the page and move on.
316*4882a593SmuzhiyunIn the worst case, this will result in a page mapped in a VM_LOCKED VMA
317*4882a593Smuzhiyunremaining on a normal LRU list without being PageMlocked().  Again, vmscan will
318*4882a593Smuzhiyundetect and cull such pages.
319*4882a593Smuzhiyun
320*4882a593Smuzhiyunmlock_vma_page() will call TestSetPageMlocked() for each page returned by
321*4882a593Smuzhiyunget_user_pages().  We use TestSetPageMlocked() because the page might already
322*4882a593Smuzhiyunbe mlocked by another task/VMA and we don't want to do extra work.  We
323*4882a593Smuzhiyunespecially do not want to count an mlocked page more than once in the
324*4882a593Smuzhiyunstatistics.  If the page was already mlocked, mlock_vma_page() need do nothing
325*4882a593Smuzhiyunmore.
326*4882a593Smuzhiyun
327*4882a593SmuzhiyunIf the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
328*4882a593Smuzhiyunpage from the LRU, as it is likely on the appropriate active or inactive list
329*4882a593Smuzhiyunat that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put
330*4882a593Smuzhiyunback the page - by calling putback_lru_page() - which will notice that the page
331*4882a593Smuzhiyunis now mlocked and divert the page to the zone's unevictable list.  If
332*4882a593Smuzhiyunmlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
333*4882a593Smuzhiyunit later if and when it attempts to reclaim the page.
334*4882a593Smuzhiyun
335*4882a593Smuzhiyun
336*4882a593SmuzhiyunFiltering Special VMAs
337*4882a593Smuzhiyun----------------------
338*4882a593Smuzhiyun
339*4882a593Smuzhiyunmlock_fixup() filters several classes of "special" VMAs:
340*4882a593Smuzhiyun
341*4882a593Smuzhiyun1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely.  The pages behind
342*4882a593Smuzhiyun   these mappings are inherently pinned, so we don't need to mark them as
343*4882a593Smuzhiyun   mlocked.  In any case, most of the pages have no struct page in which to so
344*4882a593Smuzhiyun   mark the page.  Because of this, get_user_pages() will fail for these VMAs,
345*4882a593Smuzhiyun   so there is no sense in attempting to visit them.
346*4882a593Smuzhiyun
347*4882a593Smuzhiyun2) VMAs mapping hugetlbfs page are already effectively pinned into memory.  We
348*4882a593Smuzhiyun   neither need nor want to mlock() these pages.  However, to preserve the
349*4882a593Smuzhiyun   prior behavior of mlock() - before the unevictable/mlock changes -
350*4882a593Smuzhiyun   mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to
351*4882a593Smuzhiyun   allocate the huge pages and populate the ptes.
352*4882a593Smuzhiyun
353*4882a593Smuzhiyun3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages,
354*4882a593Smuzhiyun   such as the VDSO page, relay channel pages, etc. These pages
355*4882a593Smuzhiyun   are inherently unevictable and are not managed on the LRU lists.
356*4882a593Smuzhiyun   mlock_fixup() treats these VMAs the same as hugetlbfs VMAs.  It calls
357*4882a593Smuzhiyun   make_pages_present() to populate the ptes.
358*4882a593Smuzhiyun
359*4882a593SmuzhiyunNote that for all of these special VMAs, mlock_fixup() does not set the
360*4882a593SmuzhiyunVM_LOCKED flag.  Therefore, we won't have to deal with them later during
361*4882a593Smuzhiyunmunlock(), munmap() or task exit.  Neither does mlock_fixup() account these
362*4882a593SmuzhiyunVMAs against the task's "locked_vm".
363*4882a593Smuzhiyun
364*4882a593Smuzhiyun.. _munlock_munlockall_handling:
365*4882a593Smuzhiyun
366*4882a593Smuzhiyunmunlock()/munlockall() System Call Handling
367*4882a593Smuzhiyun-------------------------------------------
368*4882a593Smuzhiyun
369*4882a593SmuzhiyunThe munlock() and munlockall() system calls are handled by the same functions -
370*4882a593Smuzhiyundo_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs
371*4882a593Smuzhiyunlock operation indicated by an argument.  So, these system calls are also
372*4882a593Smuzhiyunhandled by mlock_fixup().  Again, if called for an already munlocked VMA,
373*4882a593Smuzhiyunmlock_fixup() simply returns.  Because of the VMA filtering discussed above,
374*4882a593SmuzhiyunVM_LOCKED will not be set in any "special" VMAs.  So, these VMAs will be
375*4882a593Smuzhiyunignored for munlock.
376*4882a593Smuzhiyun
377*4882a593SmuzhiyunIf the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the
378*4882a593Smuzhiyunspecified range.  The range is then munlocked via the function
379*4882a593Smuzhiyunpopulate_vma_page_range() - the same function used to mlock a VMA range -
380*4882a593Smuzhiyunpassing a flag to indicate that munlock() is being performed.
381*4882a593Smuzhiyun
382*4882a593SmuzhiyunBecause the VMA access protections could have been changed to PROT_NONE after
383*4882a593Smuzhiyunfaulting in and mlocking pages, get_user_pages() was unreliable for visiting
384*4882a593Smuzhiyunthese pages for munlocking.  Because we don't want to leave pages mlocked,
385*4882a593Smuzhiyunget_user_pages() was enhanced to accept a flag to ignore the permissions when
386*4882a593Smuzhiyunfetching the pages - all of which should be resident as a result of previous
387*4882a593Smuzhiyunmlocking.
388*4882a593Smuzhiyun
389*4882a593SmuzhiyunFor munlock(), populate_vma_page_range() unlocks individual pages by calling
390*4882a593Smuzhiyunmunlock_vma_page().  munlock_vma_page() unconditionally clears the PG_mlocked
391*4882a593Smuzhiyunflag using TestClearPageMlocked().  As with mlock_vma_page(),
392*4882a593Smuzhiyunmunlock_vma_page() use the Test*PageMlocked() function to handle the case where
393*4882a593Smuzhiyunthe page might have already been unlocked by another task.  If the page was
394*4882a593Smuzhiyunmlocked, munlock_vma_page() updates that zone statistics for the number of
395*4882a593Smuzhiyunmlocked pages.  Note, however, that at this point we haven't checked whether
396*4882a593Smuzhiyunthe page is mapped by other VM_LOCKED VMAs.
397*4882a593Smuzhiyun
398*4882a593SmuzhiyunWe can't call try_to_munlock(), the function that walks the reverse map to
399*4882a593Smuzhiyuncheck for other VM_LOCKED VMAs, without first isolating the page from the LRU.
400*4882a593Smuzhiyuntry_to_munlock() is a variant of try_to_unmap() and thus requires that the page
401*4882a593Smuzhiyunnot be on an LRU list [more on these below].  However, the call to
402*4882a593Smuzhiyunisolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So,
403*4882a593Smuzhiyunwe go ahead and clear PG_mlocked up front, as this might be the only chance we
404*4882a593Smuzhiyunhave.  If we can successfully isolate the page, we go ahead and
405*4882a593Smuzhiyuntry_to_munlock(), which will restore the PG_mlocked flag and update the zone
406*4882a593Smuzhiyunpage statistics if it finds another VMA holding the page mlocked.  If we fail
407*4882a593Smuzhiyunto isolate the page, we'll have left a potentially mlocked page on the LRU.
408*4882a593SmuzhiyunThis is fine, because we'll catch it later if and if vmscan tries to reclaim
409*4882a593Smuzhiyunthe page.  This should be relatively rare.
410*4882a593Smuzhiyun
411*4882a593Smuzhiyun
412*4882a593SmuzhiyunMigrating MLOCKED Pages
413*4882a593Smuzhiyun-----------------------
414*4882a593Smuzhiyun
415*4882a593SmuzhiyunA page that is being migrated has been isolated from the LRU lists and is held
416*4882a593Smuzhiyunlocked across unmapping of the page, updating the page's address space entry
417*4882a593Smuzhiyunand copying the contents and state, until the page table entry has been
418*4882a593Smuzhiyunreplaced with an entry that refers to the new page.  Linux supports migration
419*4882a593Smuzhiyunof mlocked pages and other unevictable pages.  This involves simply moving the
420*4882a593SmuzhiyunPG_mlocked and PG_unevictable states from the old page to the new page.
421*4882a593Smuzhiyun
422*4882a593SmuzhiyunNote that page migration can race with mlocking or munlocking of the same page.
423*4882a593SmuzhiyunThis has been discussed from the mlock/munlock perspective in the respective
424*4882a593Smuzhiyunsections above.  Both processes (migration and m[un]locking) hold the page
425*4882a593Smuzhiyunlocked.  This provides the first level of synchronization.  Page migration
426*4882a593Smuzhiyunzeros out the page_mapping of the old page before unlocking it, so m[un]lock
427*4882a593Smuzhiyuncan skip these pages by testing the page mapping under page lock.
428*4882a593Smuzhiyun
429*4882a593SmuzhiyunTo complete page migration, we place the new and old pages back onto the LRU
430*4882a593Smuzhiyunafter dropping the page lock.  The "unneeded" page - old page on success, new
431*4882a593Smuzhiyunpage on failure - will be freed when the reference count held by the migration
432*4882a593Smuzhiyunprocess is released.  To ensure that we don't strand pages on the unevictable
433*4882a593Smuzhiyunlist because of a race between munlock and migration, page migration uses the
434*4882a593Smuzhiyunputback_lru_page() function to add migrated pages back to the LRU.
435*4882a593Smuzhiyun
436*4882a593Smuzhiyun
437*4882a593SmuzhiyunCompacting MLOCKED Pages
438*4882a593Smuzhiyun------------------------
439*4882a593Smuzhiyun
440*4882a593SmuzhiyunThe unevictable LRU can be scanned for compactable regions and the default
441*4882a593Smuzhiyunbehavior is to do so.  /proc/sys/vm/compact_unevictable_allowed controls
442*4882a593Smuzhiyunthis behavior (see Documentation/admin-guide/sysctl/vm.rst).  Once scanning of the
443*4882a593Smuzhiyununevictable LRU is enabled, the work of compaction is mostly handled by
444*4882a593Smuzhiyunthe page migration code and the same work flow as described in MIGRATING
445*4882a593SmuzhiyunMLOCKED PAGES will apply.
446*4882a593Smuzhiyun
447*4882a593SmuzhiyunMLOCKING Transparent Huge Pages
448*4882a593Smuzhiyun-------------------------------
449*4882a593Smuzhiyun
450*4882a593SmuzhiyunA transparent huge page is represented by a single entry on an LRU list.
451*4882a593SmuzhiyunTherefore, we can only make unevictable an entire compound page, not
452*4882a593Smuzhiyunindividual subpages.
453*4882a593Smuzhiyun
454*4882a593SmuzhiyunIf a user tries to mlock() part of a huge page, we want the rest of the
455*4882a593Smuzhiyunpage to be reclaimable.
456*4882a593Smuzhiyun
457*4882a593SmuzhiyunWe cannot just split the page on partial mlock() as split_huge_page() can
458*4882a593Smuzhiyunfail and new intermittent failure mode for the syscall is undesirable.
459*4882a593Smuzhiyun
460*4882a593SmuzhiyunWe handle this by keeping PTE-mapped huge pages on normal LRU lists: the
461*4882a593SmuzhiyunPMD on border of VM_LOCKED VMA will be split into PTE table.
462*4882a593Smuzhiyun
463*4882a593SmuzhiyunThis way the huge page is accessible for vmscan. Under memory pressure the
464*4882a593Smuzhiyunpage will be split, subpages which belong to VM_LOCKED VMAs will be moved
465*4882a593Smuzhiyunto unevictable LRU and the rest can be reclaimed.
466*4882a593Smuzhiyun
467*4882a593SmuzhiyunSee also comment in follow_trans_huge_pmd().
468*4882a593Smuzhiyun
469*4882a593Smuzhiyunmmap(MAP_LOCKED) System Call Handling
470*4882a593Smuzhiyun-------------------------------------
471*4882a593Smuzhiyun
472*4882a593SmuzhiyunIn addition the mlock()/mlockall() system calls, an application can request
473*4882a593Smuzhiyunthat a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
474*4882a593Smuzhiyuncall. There is one important and subtle difference here, though. mmap() + mlock()
475*4882a593Smuzhiyunwill fail if the range cannot be faulted in (e.g. because mm_populate fails)
476*4882a593Smuzhiyunand returns with ENOMEM while mmap(MAP_LOCKED) will not fail. The mmaped
477*4882a593Smuzhiyunarea will still have properties of the locked area - aka. pages will not get
478*4882a593Smuzhiyunswapped out - but major page faults to fault memory in might still happen.
479*4882a593Smuzhiyun
480*4882a593SmuzhiyunFurthermore, any mmap() call or brk() call that expands the heap by a
481*4882a593Smuzhiyuntask that has previously called mlockall() with the MCL_FUTURE flag will result
482*4882a593Smuzhiyunin the newly mapped memory being mlocked.  Before the unevictable/mlock
483*4882a593Smuzhiyunchanges, the kernel simply called make_pages_present() to allocate pages and
484*4882a593Smuzhiyunpopulate the page table.
485*4882a593Smuzhiyun
486*4882a593SmuzhiyunTo mlock a range of memory under the unevictable/mlock infrastructure, the
487*4882a593Smuzhiyunmmap() handler and task address space expansion functions call
488*4882a593Smuzhiyunpopulate_vma_page_range() specifying the vma and the address range to mlock.
489*4882a593Smuzhiyun
490*4882a593SmuzhiyunThe callers of populate_vma_page_range() will have already added the memory range
491*4882a593Smuzhiyunto be mlocked to the task's "locked_vm".  To account for filtered VMAs,
492*4882a593Smuzhiyunpopulate_vma_page_range() returns the number of pages NOT mlocked.  All of the
493*4882a593Smuzhiyuncallers then subtract a non-negative return value from the task's locked_vm.  A
494*4882a593Smuzhiyunnegative return value represent an error - for example, from get_user_pages()
495*4882a593Smuzhiyunattempting to fault in a VMA with PROT_NONE access.  In this case, we leave the
496*4882a593Smuzhiyunmemory range accounted as locked_vm, as the protections could be changed later
497*4882a593Smuzhiyunand pages allocated into that region.
498*4882a593Smuzhiyun
499*4882a593Smuzhiyun
500*4882a593Smuzhiyunmunmap()/exit()/exec() System Call Handling
501*4882a593Smuzhiyun-------------------------------------------
502*4882a593Smuzhiyun
503*4882a593SmuzhiyunWhen unmapping an mlocked region of memory, whether by an explicit call to
504*4882a593Smuzhiyunmunmap() or via an internal unmap from exit() or exec() processing, we must
505*4882a593Smuzhiyunmunlock the pages if we're removing the last VM_LOCKED VMA that maps the pages.
506*4882a593SmuzhiyunBefore the unevictable/mlock changes, mlocking did not mark the pages in any
507*4882a593Smuzhiyunway, so unmapping them required no processing.
508*4882a593Smuzhiyun
509*4882a593SmuzhiyunTo munlock a range of memory under the unevictable/mlock infrastructure, the
510*4882a593Smuzhiyunmunmap() handler and task address space call tear down function
511*4882a593Smuzhiyunmunlock_vma_pages_all().  The name reflects the observation that one always
512*4882a593Smuzhiyunspecifies the entire VMA range when munlock()ing during unmap of a region.
513*4882a593SmuzhiyunBecause of the VMA filtering when mlocking() regions, only "normal" VMAs that
514*4882a593Smuzhiyunactually contain mlocked pages will be passed to munlock_vma_pages_all().
515*4882a593Smuzhiyun
516*4882a593Smuzhiyunmunlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup()
517*4882a593Smuzhiyunfor the munlock case, calls __munlock_vma_pages_range() to walk the page table
518*4882a593Smuzhiyunfor the VMA's memory range and munlock_vma_page() each resident page mapped by
519*4882a593Smuzhiyunthe VMA.  This effectively munlocks the page, only if this is the last
520*4882a593SmuzhiyunVM_LOCKED VMA that maps the page.
521*4882a593Smuzhiyun
522*4882a593Smuzhiyun
523*4882a593Smuzhiyuntry_to_unmap()
524*4882a593Smuzhiyun--------------
525*4882a593Smuzhiyun
526*4882a593SmuzhiyunPages can, of course, be mapped into multiple VMAs.  Some of these VMAs may
527*4882a593Smuzhiyunhave VM_LOCKED flag set.  It is possible for a page mapped into one or more
528*4882a593SmuzhiyunVM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one
529*4882a593Smuzhiyunof the active or inactive LRU lists.  This could happen if, for example, a task
530*4882a593Smuzhiyunin the process of munlocking the page could not isolate the page from the LRU.
531*4882a593SmuzhiyunAs a result, vmscan/shrink_page_list() might encounter such a page as described
532*4882a593Smuzhiyunin section "vmscan's handling of unevictable pages".  To handle this situation,
533*4882a593Smuzhiyuntry_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse
534*4882a593Smuzhiyunmap.
535*4882a593Smuzhiyun
536*4882a593Smuzhiyuntry_to_unmap() is always called, by either vmscan for reclaim or for page
537*4882a593Smuzhiyunmigration, with the argument page locked and isolated from the LRU.  Separate
538*4882a593Smuzhiyunfunctions handle anonymous and mapped file and KSM pages, as these types of
539*4882a593Smuzhiyunpages have different reverse map lookup mechanisms, with different locking.
540*4882a593SmuzhiyunIn each case, whether rmap_walk_anon() or rmap_walk_file() or rmap_walk_ksm(),
541*4882a593Smuzhiyunit will call try_to_unmap_one() for every VMA which might contain the page.
542*4882a593Smuzhiyun
543*4882a593SmuzhiyunWhen trying to reclaim, if try_to_unmap_one() finds the page in a VM_LOCKED
544*4882a593SmuzhiyunVMA, it will then mlock the page via mlock_vma_page() instead of unmapping it,
545*4882a593Smuzhiyunand return SWAP_MLOCK to indicate that the page is unevictable: and the scan
546*4882a593Smuzhiyunstops there.
547*4882a593Smuzhiyun
548*4882a593Smuzhiyunmlock_vma_page() is called while holding the page table's lock (in addition
549*4882a593Smuzhiyunto the page lock, and the rmap lock): to serialize against concurrent mlock or
550*4882a593Smuzhiyunmunlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
551*4882a593Smuzhiyunholepunching, and truncation of file pages and their anonymous COWed pages.
552*4882a593Smuzhiyun
553*4882a593Smuzhiyun
554*4882a593Smuzhiyuntry_to_munlock() Reverse Map Scan
555*4882a593Smuzhiyun---------------------------------
556*4882a593Smuzhiyun
557*4882a593Smuzhiyun.. warning::
558*4882a593Smuzhiyun   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
559*4882a593Smuzhiyun   page_referenced() reverse map walker.
560*4882a593Smuzhiyun
561*4882a593SmuzhiyunWhen munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
562*4882a593SmuzhiyunHandling <munlock_munlockall_handling>` above] tries to munlock a
563*4882a593Smuzhiyunpage, it needs to determine whether or not the page is mapped by any
564*4882a593SmuzhiyunVM_LOCKED VMA without actually attempting to unmap all PTEs from the
565*4882a593Smuzhiyunpage.  For this purpose, the unevictable/mlock infrastructure
566*4882a593Smuzhiyunintroduced a variant of try_to_unmap() called try_to_munlock().
567*4882a593Smuzhiyun
568*4882a593Smuzhiyuntry_to_munlock() calls the same functions as try_to_unmap() for anonymous and
569*4882a593Smuzhiyunmapped file and KSM pages with a flag argument specifying unlock versus unmap
570*4882a593Smuzhiyunprocessing.  Again, these functions walk the respective reverse maps looking
571*4882a593Smuzhiyunfor VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
572*4882a593Smuzhiyunthe functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  This
573*4882a593Smuzhiyunundoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
574*4882a593Smuzhiyun
575*4882a593SmuzhiyunNote that try_to_munlock()'s reverse map walk must visit every VMA in a page's
576*4882a593Smuzhiyunreverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
577*4882a593SmuzhiyunHowever, the scan can terminate when it encounters a VM_LOCKED VMA.
578*4882a593SmuzhiyunAlthough try_to_munlock() might be called a great many times when munlocking a
579*4882a593Smuzhiyunlarge region or tearing down a large address space that has been mlocked via
580*4882a593Smuzhiyunmlockall(), overall this is a fairly rare event.
581*4882a593Smuzhiyun
582*4882a593Smuzhiyun
583*4882a593SmuzhiyunPage Reclaim in shrink_*_list()
584*4882a593Smuzhiyun-------------------------------
585*4882a593Smuzhiyun
586*4882a593Smuzhiyunshrink_active_list() culls any obviously unevictable pages - i.e.
587*4882a593Smuzhiyun!page_evictable(page) - diverting these to the unevictable list.
588*4882a593SmuzhiyunHowever, shrink_active_list() only sees unevictable pages that made it onto the
589*4882a593Smuzhiyunactive/inactive lru lists.  Note that these pages do not have PageUnevictable
590*4882a593Smuzhiyunset - otherwise they would be on the unevictable list and shrink_active_list
591*4882a593Smuzhiyunwould never see them.
592*4882a593Smuzhiyun
593*4882a593SmuzhiyunSome examples of these unevictable pages on the LRU lists are:
594*4882a593Smuzhiyun
595*4882a593Smuzhiyun (1) ramfs pages that have been placed on the LRU lists when first allocated.
596*4882a593Smuzhiyun
597*4882a593Smuzhiyun (2) SHM_LOCK'd shared memory pages.  shmctl(SHM_LOCK) does not attempt to
598*4882a593Smuzhiyun     allocate or fault in the pages in the shared memory region.  This happens
599*4882a593Smuzhiyun     when an application accesses the page the first time after SHM_LOCK'ing
600*4882a593Smuzhiyun     the segment.
601*4882a593Smuzhiyun
602*4882a593Smuzhiyun (3) mlocked pages that could not be isolated from the LRU and moved to the
603*4882a593Smuzhiyun     unevictable list in mlock_vma_page().
604*4882a593Smuzhiyun
605*4882a593Smuzhiyunshrink_inactive_list() also diverts any unevictable pages that it finds on the
606*4882a593Smuzhiyuninactive lists to the appropriate zone's unevictable list.
607*4882a593Smuzhiyun
608*4882a593Smuzhiyunshrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
609*4882a593Smuzhiyunafter shrink_active_list() had moved them to the inactive list, or pages mapped
610*4882a593Smuzhiyuninto VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
611*4882a593Smuzhiyunrecheck via try_to_munlock().  shrink_inactive_list() won't notice the latter,
612*4882a593Smuzhiyunbut will pass on to shrink_page_list().
613*4882a593Smuzhiyun
614*4882a593Smuzhiyunshrink_page_list() again culls obviously unevictable pages that it could
615*4882a593Smuzhiyunencounter for similar reason to shrink_inactive_list().  Pages mapped into
616*4882a593SmuzhiyunVM_LOCKED VMAs but without PG_mlocked set will make it all the way to
617*4882a593Smuzhiyuntry_to_unmap().  shrink_page_list() will divert them to the unevictable list
618*4882a593Smuzhiyunwhen try_to_unmap() returns SWAP_MLOCK, as discussed above.
619