1*4882a593Smuzhiyun.. _unevictable_lru: 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun============================== 4*4882a593SmuzhiyunUnevictable LRU Infrastructure 5*4882a593Smuzhiyun============================== 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun.. contents:: :local: 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunIntroduction 11*4882a593Smuzhiyun============ 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunThis document describes the Linux memory manager's "Unevictable LRU" 14*4882a593Smuzhiyuninfrastructure and the use of this to manage several types of "unevictable" 15*4882a593Smuzhiyunpages. 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunThe document attempts to provide the overall rationale behind this mechanism 18*4882a593Smuzhiyunand the rationale for some of the design decisions that drove the 19*4882a593Smuzhiyunimplementation. The latter design rationale is discussed in the context of an 20*4882a593Smuzhiyunimplementation description. Admittedly, one can obtain the implementation 21*4882a593Smuzhiyundetails - the "what does it do?" - by reading the code. One hopes that the 22*4882a593Smuzhiyundescriptions below add value by provide the answer to "why does it do that?". 23*4882a593Smuzhiyun 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunThe Unevictable LRU 27*4882a593Smuzhiyun=================== 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunThe Unevictable LRU facility adds an additional LRU list to track unevictable 30*4882a593Smuzhiyunpages and to hide these pages from vmscan. This mechanism is based on a patch 31*4882a593Smuzhiyunby Larry Woodman of Red Hat to address several scalability problems with page 32*4882a593Smuzhiyunreclaim in Linux. The problems have been observed at customer sites on large 33*4882a593Smuzhiyunmemory x86_64 systems. 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunTo illustrate this with an example, a non-NUMA x86_64 platform with 128GB of 36*4882a593Smuzhiyunmain memory will have over 32 million 4k pages in a single zone. When a large 37*4882a593Smuzhiyunfraction of these pages are not evictable for any reason [see below], vmscan 38*4882a593Smuzhiyunwill spend a lot of time scanning the LRU lists looking for the small fraction 39*4882a593Smuzhiyunof pages that are evictable. This can result in a situation where all CPUs are 40*4882a593Smuzhiyunspending 100% of their time in vmscan for hours or days on end, with the system 41*4882a593Smuzhiyuncompletely unresponsive. 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunThe unevictable list addresses the following classes of unevictable pages: 44*4882a593Smuzhiyun 45*4882a593Smuzhiyun * Those owned by ramfs. 46*4882a593Smuzhiyun 47*4882a593Smuzhiyun * Those mapped into SHM_LOCK'd shared memory regions. 48*4882a593Smuzhiyun 49*4882a593Smuzhiyun * Those mapped into VM_LOCKED [mlock()ed] VMAs. 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunThe infrastructure may also be able to handle other conditions that make pages 52*4882a593Smuzhiyununevictable, either by definition or by circumstance, in the future. 53*4882a593Smuzhiyun 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunThe Unevictable Page List 56*4882a593Smuzhiyun------------------------- 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunThe Unevictable LRU infrastructure consists of an additional, per-zone, LRU list 59*4882a593Smuzhiyuncalled the "unevictable" list and an associated page flag, PG_unevictable, to 60*4882a593Smuzhiyunindicate that the page is being managed on the unevictable list. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunThe PG_unevictable flag is analogous to, and mutually exclusive with, the 63*4882a593SmuzhiyunPG_active flag in that it indicates on which LRU list a page resides when 64*4882a593SmuzhiyunPG_lru is set. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunThe Unevictable LRU infrastructure maintains unevictable pages on an additional 67*4882a593SmuzhiyunLRU list for a few reasons: 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun (1) We get to "treat unevictable pages just like we treat other pages in the 70*4882a593Smuzhiyun system - which means we get to use the same code to manipulate them, the 71*4882a593Smuzhiyun same code to isolate them (for migrate, etc.), the same code to keep track 72*4882a593Smuzhiyun of the statistics, etc..." [Rik van Riel] 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun (2) We want to be able to migrate unevictable pages between nodes for memory 75*4882a593Smuzhiyun defragmentation, workload management and memory hotplug. The linux kernel 76*4882a593Smuzhiyun can only migrate pages that it can successfully isolate from the LRU 77*4882a593Smuzhiyun lists. If we were to maintain pages elsewhere than on an LRU-like list, 78*4882a593Smuzhiyun where they can be found by isolate_lru_page(), we would prevent their 79*4882a593Smuzhiyun migration, unless we reworked migration code to find the unevictable pages 80*4882a593Smuzhiyun itself. 81*4882a593Smuzhiyun 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunThe unevictable list does not differentiate between file-backed and anonymous, 84*4882a593Smuzhiyunswap-backed pages. This differentiation is only important while the pages are, 85*4882a593Smuzhiyunin fact, evictable. 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunThe unevictable list benefits from the "arrayification" of the per-zone LRU 88*4882a593Smuzhiyunlists and statistics originally proposed and posted by Christoph Lameter. 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunThe unevictable list does not use the LRU pagevec mechanism. Rather, 91*4882a593Smuzhiyununevictable pages are placed directly on the page's zone's unevictable list 92*4882a593Smuzhiyununder the zone lru_lock. This allows us to prevent the stranding of pages on 93*4882a593Smuzhiyunthe unevictable list when one task has the page isolated from the LRU and other 94*4882a593Smuzhiyuntasks are changing the "evictability" state of the page. 95*4882a593Smuzhiyun 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunMemory Control Group Interaction 98*4882a593Smuzhiyun-------------------------------- 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunThe unevictable LRU facility interacts with the memory control group [aka 101*4882a593Smuzhiyunmemory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the 102*4882a593Smuzhiyunlru_list enum. 103*4882a593Smuzhiyun 104*4882a593SmuzhiyunThe memory controller data structure automatically gets a per-zone unevictable 105*4882a593Smuzhiyunlist as a result of the "arrayification" of the per-zone LRU lists (one per 106*4882a593Smuzhiyunlru_list enum element). The memory controller tracks the movement of pages to 107*4882a593Smuzhiyunand from the unevictable list. 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunWhen a memory control group comes under memory pressure, the controller will 110*4882a593Smuzhiyunnot attempt to reclaim pages on the unevictable list. This has a couple of 111*4882a593Smuzhiyuneffects: 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun (1) Because the pages are "hidden" from reclaim on the unevictable list, the 114*4882a593Smuzhiyun reclaim process can be more efficient, dealing only with pages that have a 115*4882a593Smuzhiyun chance of being reclaimed. 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun (2) On the other hand, if too many of the pages charged to the control group 118*4882a593Smuzhiyun are unevictable, the evictable portion of the working set of the tasks in 119*4882a593Smuzhiyun the control group may not fit into the available memory. This can cause 120*4882a593Smuzhiyun the control group to thrash or to OOM-kill tasks. 121*4882a593Smuzhiyun 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun.. _mark_addr_space_unevict: 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunMarking Address Spaces Unevictable 126*4882a593Smuzhiyun---------------------------------- 127*4882a593Smuzhiyun 128*4882a593SmuzhiyunFor facilities such as ramfs none of the pages attached to the address space 129*4882a593Smuzhiyunmay be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE 130*4882a593Smuzhiyunaddress space flag is provided, and this can be manipulated by a filesystem 131*4882a593Smuzhiyunusing a number of wrapper functions: 132*4882a593Smuzhiyun 133*4882a593Smuzhiyun * ``void mapping_set_unevictable(struct address_space *mapping);`` 134*4882a593Smuzhiyun 135*4882a593Smuzhiyun Mark the address space as being completely unevictable. 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun * ``void mapping_clear_unevictable(struct address_space *mapping);`` 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun Mark the address space as being evictable. 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun * ``int mapping_unevictable(struct address_space *mapping);`` 142*4882a593Smuzhiyun 143*4882a593Smuzhiyun Query the address space, and return true if it is completely 144*4882a593Smuzhiyun unevictable. 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunThese are currently used in three places in the kernel: 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun (1) By ramfs to mark the address spaces of its inodes when they are created, 149*4882a593Smuzhiyun and this mark remains for the life of the inode. 150*4882a593Smuzhiyun 151*4882a593Smuzhiyun (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called. 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun Note that SHM_LOCK is not required to page in the locked pages if they're 154*4882a593Smuzhiyun swapped out; the application must touch the pages manually if it wants to 155*4882a593Smuzhiyun ensure they're in memory. 156*4882a593Smuzhiyun 157*4882a593Smuzhiyun (3) By the i915 driver to mark pinned address space until it's unpinned. The 158*4882a593Smuzhiyun amount of unevictable memory marked by i915 driver is roughly the bounded 159*4882a593Smuzhiyun object size in debugfs/dri/0/i915_gem_objects. 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunDetecting Unevictable Pages 163*4882a593Smuzhiyun--------------------------- 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunThe function page_evictable() in vmscan.c determines whether a page is 166*4882a593Smuzhiyunevictable or not using the query function outlined above [see section 167*4882a593Smuzhiyun:ref:`Marking address spaces unevictable <mark_addr_space_unevict>`] 168*4882a593Smuzhiyunto check the AS_UNEVICTABLE flag. 169*4882a593Smuzhiyun 170*4882a593SmuzhiyunFor address spaces that are so marked after being populated (as SHM regions 171*4882a593Smuzhiyunmight be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate 172*4882a593Smuzhiyunthe page tables for the region as does, for example, mlock(), nor need it make 173*4882a593Smuzhiyunany special effort to push any pages in the SHM_LOCK'd area to the unevictable 174*4882a593Smuzhiyunlist. Instead, vmscan will do this if and when it encounters the pages during 175*4882a593Smuzhiyuna reclamation scan. 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunOn an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan 178*4882a593Smuzhiyunthe pages in the region and "rescue" them from the unevictable list if no other 179*4882a593Smuzhiyuncondition is keeping them unevictable. If an unevictable region is destroyed, 180*4882a593Smuzhiyunthe pages are also "rescued" from the unevictable list in the process of 181*4882a593Smuzhiyunfreeing them. 182*4882a593Smuzhiyun 183*4882a593Smuzhiyunpage_evictable() also checks for mlocked pages by testing an additional page 184*4882a593Smuzhiyunflag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is 185*4882a593Smuzhiyunfaulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED. 186*4882a593Smuzhiyun 187*4882a593Smuzhiyun 188*4882a593SmuzhiyunVmscan's Handling of Unevictable Pages 189*4882a593Smuzhiyun-------------------------------------- 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunIf unevictable pages are culled in the fault path, or moved to the unevictable 192*4882a593Smuzhiyunlist at mlock() or mmap() time, vmscan will not encounter the pages until they 193*4882a593Smuzhiyunhave become evictable again (via munlock() for example) and have been "rescued" 194*4882a593Smuzhiyunfrom the unevictable list. However, there may be situations where we decide, 195*4882a593Smuzhiyunfor the sake of expediency, to leave a unevictable page on one of the regular 196*4882a593Smuzhiyunactive/inactive LRU lists for vmscan to deal with. vmscan checks for such 197*4882a593Smuzhiyunpages in all of the shrink_{active|inactive|page}_list() functions and will 198*4882a593Smuzhiyun"cull" such pages that it encounters: that is, it diverts those pages to the 199*4882a593Smuzhiyununevictable list for the zone being scanned. 200*4882a593Smuzhiyun 201*4882a593SmuzhiyunThere may be situations where a page is mapped into a VM_LOCKED VMA, but the 202*4882a593Smuzhiyunpage is not marked as PG_mlocked. Such pages will make it all the way to 203*4882a593Smuzhiyunshrink_page_list() where they will be detected when vmscan walks the reverse 204*4882a593Smuzhiyunmap in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, 205*4882a593Smuzhiyunshrink_page_list() will cull the page at that point. 206*4882a593Smuzhiyun 207*4882a593SmuzhiyunTo "cull" an unevictable page, vmscan simply puts the page back on the LRU list 208*4882a593Smuzhiyunusing putback_lru_page() - the inverse operation to isolate_lru_page() - after 209*4882a593Smuzhiyundropping the page lock. Because the condition which makes the page unevictable 210*4882a593Smuzhiyunmay change once the page is unlocked, putback_lru_page() will recheck the 211*4882a593Smuzhiyununevictable state of a page that it places on the unevictable list. If the 212*4882a593Smuzhiyunpage has become unevictable, putback_lru_page() removes it from the list and 213*4882a593Smuzhiyunretries, including the page_unevictable() test. Because such a race is a rare 214*4882a593Smuzhiyunevent and movement of pages onto the unevictable list should be rare, these 215*4882a593Smuzhiyunextra evictabilty checks should not occur in the majority of calls to 216*4882a593Smuzhiyunputback_lru_page(). 217*4882a593Smuzhiyun 218*4882a593Smuzhiyun 219*4882a593SmuzhiyunMLOCKED Pages 220*4882a593Smuzhiyun============= 221*4882a593Smuzhiyun 222*4882a593SmuzhiyunThe unevictable page list is also useful for mlock(), in addition to ramfs and 223*4882a593SmuzhiyunSYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in 224*4882a593SmuzhiyunNOMMU situations, all mappings are effectively mlocked. 225*4882a593Smuzhiyun 226*4882a593Smuzhiyun 227*4882a593SmuzhiyunHistory 228*4882a593Smuzhiyun------- 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunThe "Unevictable mlocked Pages" infrastructure is based on work originally 231*4882a593Smuzhiyunposted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". 232*4882a593SmuzhiyunNick posted his patch as an alternative to a patch posted by Christoph Lameter 233*4882a593Smuzhiyunto achieve the same objective: hiding mlocked pages from vmscan. 234*4882a593Smuzhiyun 235*4882a593SmuzhiyunIn Nick's patch, he used one of the struct page LRU list link fields as a count 236*4882a593Smuzhiyunof VM_LOCKED VMAs that map the page. This use of the link field for a count 237*4882a593Smuzhiyunprevented the management of the pages on an LRU list, and thus mlocked pages 238*4882a593Smuzhiyunwere not migratable as isolate_lru_page() could not find them, and the LRU list 239*4882a593Smuzhiyunlink field was not available to the migration subsystem. 240*4882a593Smuzhiyun 241*4882a593SmuzhiyunNick resolved this by putting mlocked pages back on the lru list before 242*4882a593Smuzhiyunattempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When 243*4882a593SmuzhiyunNick's patch was integrated with the Unevictable LRU work, the count was 244*4882a593Smuzhiyunreplaced by walking the reverse map to determine whether any VM_LOCKED VMAs 245*4882a593Smuzhiyunmapped the page. More on this below. 246*4882a593Smuzhiyun 247*4882a593Smuzhiyun 248*4882a593SmuzhiyunBasic Management 249*4882a593Smuzhiyun---------------- 250*4882a593Smuzhiyun 251*4882a593Smuzhiyunmlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable 252*4882a593Smuzhiyunpages. When such a page has been "noticed" by the memory management subsystem, 253*4882a593Smuzhiyunthe page is marked with the PG_mlocked flag. This can be manipulated using the 254*4882a593SmuzhiyunPageMlocked() functions. 255*4882a593Smuzhiyun 256*4882a593SmuzhiyunA PG_mlocked page will be placed on the unevictable list when it is added to 257*4882a593Smuzhiyunthe LRU. Such pages can be "noticed" by memory management in several places: 258*4882a593Smuzhiyun 259*4882a593Smuzhiyun (1) in the mlock()/mlockall() system call handlers; 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun (2) in the mmap() system call handler when mmapping a region with the 262*4882a593Smuzhiyun MAP_LOCKED flag; 263*4882a593Smuzhiyun 264*4882a593Smuzhiyun (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE 265*4882a593Smuzhiyun flag 266*4882a593Smuzhiyun 267*4882a593Smuzhiyun (4) in the fault path, if mlocked pages are "culled" in the fault path, 268*4882a593Smuzhiyun and when a VM_LOCKED stack segment is expanded; or 269*4882a593Smuzhiyun 270*4882a593Smuzhiyun (5) as mentioned above, in vmscan:shrink_page_list() when attempting to 271*4882a593Smuzhiyun reclaim a page in a VM_LOCKED VMA via try_to_unmap() 272*4882a593Smuzhiyun 273*4882a593Smuzhiyunall of which result in the VM_LOCKED flag being set for the VMA if it doesn't 274*4882a593Smuzhiyunalready have it set. 275*4882a593Smuzhiyun 276*4882a593Smuzhiyunmlocked pages become unlocked and rescued from the unevictable list when: 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun (1) mapped in a range unlocked via the munlock()/munlockall() system calls; 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including 281*4882a593Smuzhiyun unmapping at task exit; 282*4882a593Smuzhiyun 283*4882a593Smuzhiyun (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file; 284*4882a593Smuzhiyun or 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun (4) before a page is COW'd in a VM_LOCKED VMA. 287*4882a593Smuzhiyun 288*4882a593Smuzhiyun 289*4882a593Smuzhiyunmlock()/mlockall() System Call Handling 290*4882a593Smuzhiyun--------------------------------------- 291*4882a593Smuzhiyun 292*4882a593SmuzhiyunBoth [do\_]mlock() and [do\_]mlockall() system call handlers call mlock_fixup() 293*4882a593Smuzhiyunfor each VMA in the range specified by the call. In the case of mlockall(), 294*4882a593Smuzhiyunthis is the entire active address space of the task. Note that mlock_fixup() 295*4882a593Smuzhiyunis used for both mlocking and munlocking a range of memory. A call to mlock() 296*4882a593Smuzhiyunan already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is 297*4882a593Smuzhiyuntreated as a no-op, and mlock_fixup() simply returns. 298*4882a593Smuzhiyun 299*4882a593SmuzhiyunIf the VMA passes some filtering as described in "Filtering Special Vmas" 300*4882a593Smuzhiyunbelow, mlock_fixup() will attempt to merge the VMA with its neighbors or split 301*4882a593Smuzhiyunoff a subset of the VMA if the range does not cover the entire VMA. Once the 302*4882a593SmuzhiyunVMA has been merged or split or neither, mlock_fixup() will call 303*4882a593Smuzhiyunpopulate_vma_page_range() to fault in the pages via get_user_pages() and to 304*4882a593Smuzhiyunmark the pages as mlocked via mlock_vma_page(). 305*4882a593Smuzhiyun 306*4882a593SmuzhiyunNote that the VMA being mlocked might be mapped with PROT_NONE. In this case, 307*4882a593Smuzhiyunget_user_pages() will be unable to fault in the pages. That's okay. If pages 308*4882a593Smuzhiyundo end up getting faulted into this VM_LOCKED VMA, we'll handle them in the 309*4882a593Smuzhiyunfault path or in vmscan. 310*4882a593Smuzhiyun 311*4882a593SmuzhiyunAlso note that a page returned by get_user_pages() could be truncated or 312*4882a593Smuzhiyunmigrated out from under us, while we're trying to mlock it. To detect this, 313*4882a593Smuzhiyunpopulate_vma_page_range() checks page_mapping() after acquiring the page lock. 314*4882a593SmuzhiyunIf the page is still associated with its mapping, we'll go ahead and call 315*4882a593Smuzhiyunmlock_vma_page(). If the mapping is gone, we just unlock the page and move on. 316*4882a593SmuzhiyunIn the worst case, this will result in a page mapped in a VM_LOCKED VMA 317*4882a593Smuzhiyunremaining on a normal LRU list without being PageMlocked(). Again, vmscan will 318*4882a593Smuzhiyundetect and cull such pages. 319*4882a593Smuzhiyun 320*4882a593Smuzhiyunmlock_vma_page() will call TestSetPageMlocked() for each page returned by 321*4882a593Smuzhiyunget_user_pages(). We use TestSetPageMlocked() because the page might already 322*4882a593Smuzhiyunbe mlocked by another task/VMA and we don't want to do extra work. We 323*4882a593Smuzhiyunespecially do not want to count an mlocked page more than once in the 324*4882a593Smuzhiyunstatistics. If the page was already mlocked, mlock_vma_page() need do nothing 325*4882a593Smuzhiyunmore. 326*4882a593Smuzhiyun 327*4882a593SmuzhiyunIf the page was NOT already mlocked, mlock_vma_page() attempts to isolate the 328*4882a593Smuzhiyunpage from the LRU, as it is likely on the appropriate active or inactive list 329*4882a593Smuzhiyunat that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put 330*4882a593Smuzhiyunback the page - by calling putback_lru_page() - which will notice that the page 331*4882a593Smuzhiyunis now mlocked and divert the page to the zone's unevictable list. If 332*4882a593Smuzhiyunmlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle 333*4882a593Smuzhiyunit later if and when it attempts to reclaim the page. 334*4882a593Smuzhiyun 335*4882a593Smuzhiyun 336*4882a593SmuzhiyunFiltering Special VMAs 337*4882a593Smuzhiyun---------------------- 338*4882a593Smuzhiyun 339*4882a593Smuzhiyunmlock_fixup() filters several classes of "special" VMAs: 340*4882a593Smuzhiyun 341*4882a593Smuzhiyun1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind 342*4882a593Smuzhiyun these mappings are inherently pinned, so we don't need to mark them as 343*4882a593Smuzhiyun mlocked. In any case, most of the pages have no struct page in which to so 344*4882a593Smuzhiyun mark the page. Because of this, get_user_pages() will fail for these VMAs, 345*4882a593Smuzhiyun so there is no sense in attempting to visit them. 346*4882a593Smuzhiyun 347*4882a593Smuzhiyun2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We 348*4882a593Smuzhiyun neither need nor want to mlock() these pages. However, to preserve the 349*4882a593Smuzhiyun prior behavior of mlock() - before the unevictable/mlock changes - 350*4882a593Smuzhiyun mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to 351*4882a593Smuzhiyun allocate the huge pages and populate the ptes. 352*4882a593Smuzhiyun 353*4882a593Smuzhiyun3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages, 354*4882a593Smuzhiyun such as the VDSO page, relay channel pages, etc. These pages 355*4882a593Smuzhiyun are inherently unevictable and are not managed on the LRU lists. 356*4882a593Smuzhiyun mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls 357*4882a593Smuzhiyun make_pages_present() to populate the ptes. 358*4882a593Smuzhiyun 359*4882a593SmuzhiyunNote that for all of these special VMAs, mlock_fixup() does not set the 360*4882a593SmuzhiyunVM_LOCKED flag. Therefore, we won't have to deal with them later during 361*4882a593Smuzhiyunmunlock(), munmap() or task exit. Neither does mlock_fixup() account these 362*4882a593SmuzhiyunVMAs against the task's "locked_vm". 363*4882a593Smuzhiyun 364*4882a593Smuzhiyun.. _munlock_munlockall_handling: 365*4882a593Smuzhiyun 366*4882a593Smuzhiyunmunlock()/munlockall() System Call Handling 367*4882a593Smuzhiyun------------------------------------------- 368*4882a593Smuzhiyun 369*4882a593SmuzhiyunThe munlock() and munlockall() system calls are handled by the same functions - 370*4882a593Smuzhiyundo_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs 371*4882a593Smuzhiyunlock operation indicated by an argument. So, these system calls are also 372*4882a593Smuzhiyunhandled by mlock_fixup(). Again, if called for an already munlocked VMA, 373*4882a593Smuzhiyunmlock_fixup() simply returns. Because of the VMA filtering discussed above, 374*4882a593SmuzhiyunVM_LOCKED will not be set in any "special" VMAs. So, these VMAs will be 375*4882a593Smuzhiyunignored for munlock. 376*4882a593Smuzhiyun 377*4882a593SmuzhiyunIf the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the 378*4882a593Smuzhiyunspecified range. The range is then munlocked via the function 379*4882a593Smuzhiyunpopulate_vma_page_range() - the same function used to mlock a VMA range - 380*4882a593Smuzhiyunpassing a flag to indicate that munlock() is being performed. 381*4882a593Smuzhiyun 382*4882a593SmuzhiyunBecause the VMA access protections could have been changed to PROT_NONE after 383*4882a593Smuzhiyunfaulting in and mlocking pages, get_user_pages() was unreliable for visiting 384*4882a593Smuzhiyunthese pages for munlocking. Because we don't want to leave pages mlocked, 385*4882a593Smuzhiyunget_user_pages() was enhanced to accept a flag to ignore the permissions when 386*4882a593Smuzhiyunfetching the pages - all of which should be resident as a result of previous 387*4882a593Smuzhiyunmlocking. 388*4882a593Smuzhiyun 389*4882a593SmuzhiyunFor munlock(), populate_vma_page_range() unlocks individual pages by calling 390*4882a593Smuzhiyunmunlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked 391*4882a593Smuzhiyunflag using TestClearPageMlocked(). As with mlock_vma_page(), 392*4882a593Smuzhiyunmunlock_vma_page() use the Test*PageMlocked() function to handle the case where 393*4882a593Smuzhiyunthe page might have already been unlocked by another task. If the page was 394*4882a593Smuzhiyunmlocked, munlock_vma_page() updates that zone statistics for the number of 395*4882a593Smuzhiyunmlocked pages. Note, however, that at this point we haven't checked whether 396*4882a593Smuzhiyunthe page is mapped by other VM_LOCKED VMAs. 397*4882a593Smuzhiyun 398*4882a593SmuzhiyunWe can't call try_to_munlock(), the function that walks the reverse map to 399*4882a593Smuzhiyuncheck for other VM_LOCKED VMAs, without first isolating the page from the LRU. 400*4882a593Smuzhiyuntry_to_munlock() is a variant of try_to_unmap() and thus requires that the page 401*4882a593Smuzhiyunnot be on an LRU list [more on these below]. However, the call to 402*4882a593Smuzhiyunisolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So, 403*4882a593Smuzhiyunwe go ahead and clear PG_mlocked up front, as this might be the only chance we 404*4882a593Smuzhiyunhave. If we can successfully isolate the page, we go ahead and 405*4882a593Smuzhiyuntry_to_munlock(), which will restore the PG_mlocked flag and update the zone 406*4882a593Smuzhiyunpage statistics if it finds another VMA holding the page mlocked. If we fail 407*4882a593Smuzhiyunto isolate the page, we'll have left a potentially mlocked page on the LRU. 408*4882a593SmuzhiyunThis is fine, because we'll catch it later if and if vmscan tries to reclaim 409*4882a593Smuzhiyunthe page. This should be relatively rare. 410*4882a593Smuzhiyun 411*4882a593Smuzhiyun 412*4882a593SmuzhiyunMigrating MLOCKED Pages 413*4882a593Smuzhiyun----------------------- 414*4882a593Smuzhiyun 415*4882a593SmuzhiyunA page that is being migrated has been isolated from the LRU lists and is held 416*4882a593Smuzhiyunlocked across unmapping of the page, updating the page's address space entry 417*4882a593Smuzhiyunand copying the contents and state, until the page table entry has been 418*4882a593Smuzhiyunreplaced with an entry that refers to the new page. Linux supports migration 419*4882a593Smuzhiyunof mlocked pages and other unevictable pages. This involves simply moving the 420*4882a593SmuzhiyunPG_mlocked and PG_unevictable states from the old page to the new page. 421*4882a593Smuzhiyun 422*4882a593SmuzhiyunNote that page migration can race with mlocking or munlocking of the same page. 423*4882a593SmuzhiyunThis has been discussed from the mlock/munlock perspective in the respective 424*4882a593Smuzhiyunsections above. Both processes (migration and m[un]locking) hold the page 425*4882a593Smuzhiyunlocked. This provides the first level of synchronization. Page migration 426*4882a593Smuzhiyunzeros out the page_mapping of the old page before unlocking it, so m[un]lock 427*4882a593Smuzhiyuncan skip these pages by testing the page mapping under page lock. 428*4882a593Smuzhiyun 429*4882a593SmuzhiyunTo complete page migration, we place the new and old pages back onto the LRU 430*4882a593Smuzhiyunafter dropping the page lock. The "unneeded" page - old page on success, new 431*4882a593Smuzhiyunpage on failure - will be freed when the reference count held by the migration 432*4882a593Smuzhiyunprocess is released. To ensure that we don't strand pages on the unevictable 433*4882a593Smuzhiyunlist because of a race between munlock and migration, page migration uses the 434*4882a593Smuzhiyunputback_lru_page() function to add migrated pages back to the LRU. 435*4882a593Smuzhiyun 436*4882a593Smuzhiyun 437*4882a593SmuzhiyunCompacting MLOCKED Pages 438*4882a593Smuzhiyun------------------------ 439*4882a593Smuzhiyun 440*4882a593SmuzhiyunThe unevictable LRU can be scanned for compactable regions and the default 441*4882a593Smuzhiyunbehavior is to do so. /proc/sys/vm/compact_unevictable_allowed controls 442*4882a593Smuzhiyunthis behavior (see Documentation/admin-guide/sysctl/vm.rst). Once scanning of the 443*4882a593Smuzhiyununevictable LRU is enabled, the work of compaction is mostly handled by 444*4882a593Smuzhiyunthe page migration code and the same work flow as described in MIGRATING 445*4882a593SmuzhiyunMLOCKED PAGES will apply. 446*4882a593Smuzhiyun 447*4882a593SmuzhiyunMLOCKING Transparent Huge Pages 448*4882a593Smuzhiyun------------------------------- 449*4882a593Smuzhiyun 450*4882a593SmuzhiyunA transparent huge page is represented by a single entry on an LRU list. 451*4882a593SmuzhiyunTherefore, we can only make unevictable an entire compound page, not 452*4882a593Smuzhiyunindividual subpages. 453*4882a593Smuzhiyun 454*4882a593SmuzhiyunIf a user tries to mlock() part of a huge page, we want the rest of the 455*4882a593Smuzhiyunpage to be reclaimable. 456*4882a593Smuzhiyun 457*4882a593SmuzhiyunWe cannot just split the page on partial mlock() as split_huge_page() can 458*4882a593Smuzhiyunfail and new intermittent failure mode for the syscall is undesirable. 459*4882a593Smuzhiyun 460*4882a593SmuzhiyunWe handle this by keeping PTE-mapped huge pages on normal LRU lists: the 461*4882a593SmuzhiyunPMD on border of VM_LOCKED VMA will be split into PTE table. 462*4882a593Smuzhiyun 463*4882a593SmuzhiyunThis way the huge page is accessible for vmscan. Under memory pressure the 464*4882a593Smuzhiyunpage will be split, subpages which belong to VM_LOCKED VMAs will be moved 465*4882a593Smuzhiyunto unevictable LRU and the rest can be reclaimed. 466*4882a593Smuzhiyun 467*4882a593SmuzhiyunSee also comment in follow_trans_huge_pmd(). 468*4882a593Smuzhiyun 469*4882a593Smuzhiyunmmap(MAP_LOCKED) System Call Handling 470*4882a593Smuzhiyun------------------------------------- 471*4882a593Smuzhiyun 472*4882a593SmuzhiyunIn addition the mlock()/mlockall() system calls, an application can request 473*4882a593Smuzhiyunthat a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap() 474*4882a593Smuzhiyuncall. There is one important and subtle difference here, though. mmap() + mlock() 475*4882a593Smuzhiyunwill fail if the range cannot be faulted in (e.g. because mm_populate fails) 476*4882a593Smuzhiyunand returns with ENOMEM while mmap(MAP_LOCKED) will not fail. The mmaped 477*4882a593Smuzhiyunarea will still have properties of the locked area - aka. pages will not get 478*4882a593Smuzhiyunswapped out - but major page faults to fault memory in might still happen. 479*4882a593Smuzhiyun 480*4882a593SmuzhiyunFurthermore, any mmap() call or brk() call that expands the heap by a 481*4882a593Smuzhiyuntask that has previously called mlockall() with the MCL_FUTURE flag will result 482*4882a593Smuzhiyunin the newly mapped memory being mlocked. Before the unevictable/mlock 483*4882a593Smuzhiyunchanges, the kernel simply called make_pages_present() to allocate pages and 484*4882a593Smuzhiyunpopulate the page table. 485*4882a593Smuzhiyun 486*4882a593SmuzhiyunTo mlock a range of memory under the unevictable/mlock infrastructure, the 487*4882a593Smuzhiyunmmap() handler and task address space expansion functions call 488*4882a593Smuzhiyunpopulate_vma_page_range() specifying the vma and the address range to mlock. 489*4882a593Smuzhiyun 490*4882a593SmuzhiyunThe callers of populate_vma_page_range() will have already added the memory range 491*4882a593Smuzhiyunto be mlocked to the task's "locked_vm". To account for filtered VMAs, 492*4882a593Smuzhiyunpopulate_vma_page_range() returns the number of pages NOT mlocked. All of the 493*4882a593Smuzhiyuncallers then subtract a non-negative return value from the task's locked_vm. A 494*4882a593Smuzhiyunnegative return value represent an error - for example, from get_user_pages() 495*4882a593Smuzhiyunattempting to fault in a VMA with PROT_NONE access. In this case, we leave the 496*4882a593Smuzhiyunmemory range accounted as locked_vm, as the protections could be changed later 497*4882a593Smuzhiyunand pages allocated into that region. 498*4882a593Smuzhiyun 499*4882a593Smuzhiyun 500*4882a593Smuzhiyunmunmap()/exit()/exec() System Call Handling 501*4882a593Smuzhiyun------------------------------------------- 502*4882a593Smuzhiyun 503*4882a593SmuzhiyunWhen unmapping an mlocked region of memory, whether by an explicit call to 504*4882a593Smuzhiyunmunmap() or via an internal unmap from exit() or exec() processing, we must 505*4882a593Smuzhiyunmunlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. 506*4882a593SmuzhiyunBefore the unevictable/mlock changes, mlocking did not mark the pages in any 507*4882a593Smuzhiyunway, so unmapping them required no processing. 508*4882a593Smuzhiyun 509*4882a593SmuzhiyunTo munlock a range of memory under the unevictable/mlock infrastructure, the 510*4882a593Smuzhiyunmunmap() handler and task address space call tear down function 511*4882a593Smuzhiyunmunlock_vma_pages_all(). The name reflects the observation that one always 512*4882a593Smuzhiyunspecifies the entire VMA range when munlock()ing during unmap of a region. 513*4882a593SmuzhiyunBecause of the VMA filtering when mlocking() regions, only "normal" VMAs that 514*4882a593Smuzhiyunactually contain mlocked pages will be passed to munlock_vma_pages_all(). 515*4882a593Smuzhiyun 516*4882a593Smuzhiyunmunlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup() 517*4882a593Smuzhiyunfor the munlock case, calls __munlock_vma_pages_range() to walk the page table 518*4882a593Smuzhiyunfor the VMA's memory range and munlock_vma_page() each resident page mapped by 519*4882a593Smuzhiyunthe VMA. This effectively munlocks the page, only if this is the last 520*4882a593SmuzhiyunVM_LOCKED VMA that maps the page. 521*4882a593Smuzhiyun 522*4882a593Smuzhiyun 523*4882a593Smuzhiyuntry_to_unmap() 524*4882a593Smuzhiyun-------------- 525*4882a593Smuzhiyun 526*4882a593SmuzhiyunPages can, of course, be mapped into multiple VMAs. Some of these VMAs may 527*4882a593Smuzhiyunhave VM_LOCKED flag set. It is possible for a page mapped into one or more 528*4882a593SmuzhiyunVM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one 529*4882a593Smuzhiyunof the active or inactive LRU lists. This could happen if, for example, a task 530*4882a593Smuzhiyunin the process of munlocking the page could not isolate the page from the LRU. 531*4882a593SmuzhiyunAs a result, vmscan/shrink_page_list() might encounter such a page as described 532*4882a593Smuzhiyunin section "vmscan's handling of unevictable pages". To handle this situation, 533*4882a593Smuzhiyuntry_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse 534*4882a593Smuzhiyunmap. 535*4882a593Smuzhiyun 536*4882a593Smuzhiyuntry_to_unmap() is always called, by either vmscan for reclaim or for page 537*4882a593Smuzhiyunmigration, with the argument page locked and isolated from the LRU. Separate 538*4882a593Smuzhiyunfunctions handle anonymous and mapped file and KSM pages, as these types of 539*4882a593Smuzhiyunpages have different reverse map lookup mechanisms, with different locking. 540*4882a593SmuzhiyunIn each case, whether rmap_walk_anon() or rmap_walk_file() or rmap_walk_ksm(), 541*4882a593Smuzhiyunit will call try_to_unmap_one() for every VMA which might contain the page. 542*4882a593Smuzhiyun 543*4882a593SmuzhiyunWhen trying to reclaim, if try_to_unmap_one() finds the page in a VM_LOCKED 544*4882a593SmuzhiyunVMA, it will then mlock the page via mlock_vma_page() instead of unmapping it, 545*4882a593Smuzhiyunand return SWAP_MLOCK to indicate that the page is unevictable: and the scan 546*4882a593Smuzhiyunstops there. 547*4882a593Smuzhiyun 548*4882a593Smuzhiyunmlock_vma_page() is called while holding the page table's lock (in addition 549*4882a593Smuzhiyunto the page lock, and the rmap lock): to serialize against concurrent mlock or 550*4882a593Smuzhiyunmunlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim, 551*4882a593Smuzhiyunholepunching, and truncation of file pages and their anonymous COWed pages. 552*4882a593Smuzhiyun 553*4882a593Smuzhiyun 554*4882a593Smuzhiyuntry_to_munlock() Reverse Map Scan 555*4882a593Smuzhiyun--------------------------------- 556*4882a593Smuzhiyun 557*4882a593Smuzhiyun.. warning:: 558*4882a593Smuzhiyun [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the 559*4882a593Smuzhiyun page_referenced() reverse map walker. 560*4882a593Smuzhiyun 561*4882a593SmuzhiyunWhen munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call 562*4882a593SmuzhiyunHandling <munlock_munlockall_handling>` above] tries to munlock a 563*4882a593Smuzhiyunpage, it needs to determine whether or not the page is mapped by any 564*4882a593SmuzhiyunVM_LOCKED VMA without actually attempting to unmap all PTEs from the 565*4882a593Smuzhiyunpage. For this purpose, the unevictable/mlock infrastructure 566*4882a593Smuzhiyunintroduced a variant of try_to_unmap() called try_to_munlock(). 567*4882a593Smuzhiyun 568*4882a593Smuzhiyuntry_to_munlock() calls the same functions as try_to_unmap() for anonymous and 569*4882a593Smuzhiyunmapped file and KSM pages with a flag argument specifying unlock versus unmap 570*4882a593Smuzhiyunprocessing. Again, these functions walk the respective reverse maps looking 571*4882a593Smuzhiyunfor VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case, 572*4882a593Smuzhiyunthe functions mlock the page via mlock_vma_page() and return SWAP_MLOCK. This 573*4882a593Smuzhiyunundoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page. 574*4882a593Smuzhiyun 575*4882a593SmuzhiyunNote that try_to_munlock()'s reverse map walk must visit every VMA in a page's 576*4882a593Smuzhiyunreverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. 577*4882a593SmuzhiyunHowever, the scan can terminate when it encounters a VM_LOCKED VMA. 578*4882a593SmuzhiyunAlthough try_to_munlock() might be called a great many times when munlocking a 579*4882a593Smuzhiyunlarge region or tearing down a large address space that has been mlocked via 580*4882a593Smuzhiyunmlockall(), overall this is a fairly rare event. 581*4882a593Smuzhiyun 582*4882a593Smuzhiyun 583*4882a593SmuzhiyunPage Reclaim in shrink_*_list() 584*4882a593Smuzhiyun------------------------------- 585*4882a593Smuzhiyun 586*4882a593Smuzhiyunshrink_active_list() culls any obviously unevictable pages - i.e. 587*4882a593Smuzhiyun!page_evictable(page) - diverting these to the unevictable list. 588*4882a593SmuzhiyunHowever, shrink_active_list() only sees unevictable pages that made it onto the 589*4882a593Smuzhiyunactive/inactive lru lists. Note that these pages do not have PageUnevictable 590*4882a593Smuzhiyunset - otherwise they would be on the unevictable list and shrink_active_list 591*4882a593Smuzhiyunwould never see them. 592*4882a593Smuzhiyun 593*4882a593SmuzhiyunSome examples of these unevictable pages on the LRU lists are: 594*4882a593Smuzhiyun 595*4882a593Smuzhiyun (1) ramfs pages that have been placed on the LRU lists when first allocated. 596*4882a593Smuzhiyun 597*4882a593Smuzhiyun (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to 598*4882a593Smuzhiyun allocate or fault in the pages in the shared memory region. This happens 599*4882a593Smuzhiyun when an application accesses the page the first time after SHM_LOCK'ing 600*4882a593Smuzhiyun the segment. 601*4882a593Smuzhiyun 602*4882a593Smuzhiyun (3) mlocked pages that could not be isolated from the LRU and moved to the 603*4882a593Smuzhiyun unevictable list in mlock_vma_page(). 604*4882a593Smuzhiyun 605*4882a593Smuzhiyunshrink_inactive_list() also diverts any unevictable pages that it finds on the 606*4882a593Smuzhiyuninactive lists to the appropriate zone's unevictable list. 607*4882a593Smuzhiyun 608*4882a593Smuzhiyunshrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd 609*4882a593Smuzhiyunafter shrink_active_list() had moved them to the inactive list, or pages mapped 610*4882a593Smuzhiyuninto VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to 611*4882a593Smuzhiyunrecheck via try_to_munlock(). shrink_inactive_list() won't notice the latter, 612*4882a593Smuzhiyunbut will pass on to shrink_page_list(). 613*4882a593Smuzhiyun 614*4882a593Smuzhiyunshrink_page_list() again culls obviously unevictable pages that it could 615*4882a593Smuzhiyunencounter for similar reason to shrink_inactive_list(). Pages mapped into 616*4882a593SmuzhiyunVM_LOCKED VMAs but without PG_mlocked set will make it all the way to 617*4882a593Smuzhiyuntry_to_unmap(). shrink_page_list() will divert them to the unevictable list 618*4882a593Smuzhiyunwhen try_to_unmap() returns SWAP_MLOCK, as discussed above. 619