1*4882a593Smuzhiyun.. _hugetlbfs_reserve: 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun===================== 4*4882a593SmuzhiyunHugetlbfs Reservation 5*4882a593Smuzhiyun===================== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunOverview 8*4882a593Smuzhiyun======== 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunHuge pages as described at :ref:`hugetlbpage` are typically 11*4882a593Smuzhiyunpreallocated for application use. These huge pages are instantiated in a 12*4882a593Smuzhiyuntask's address space at page fault time if the VMA indicates huge pages are 13*4882a593Smuzhiyunto be used. If no huge page exists at page fault time, the task is sent 14*4882a593Smuzhiyuna SIGBUS and often dies an unhappy death. Shortly after huge page support 15*4882a593Smuzhiyunwas added, it was determined that it would be better to detect a shortage 16*4882a593Smuzhiyunof huge pages at mmap() time. The idea is that if there were not enough 17*4882a593Smuzhiyunhuge pages to cover the mapping, the mmap() would fail. This was first 18*4882a593Smuzhiyundone with a simple check in the code at mmap() time to determine if there 19*4882a593Smuzhiyunwere enough free huge pages to cover the mapping. Like most things in the 20*4882a593Smuzhiyunkernel, the code has evolved over time. However, the basic idea was to 21*4882a593Smuzhiyun'reserve' huge pages at mmap() time to ensure that huge pages would be 22*4882a593Smuzhiyunavailable for page faults in that mapping. The description below attempts to 23*4882a593Smuzhiyundescribe how huge page reserve processing is done in the v4.10 kernel. 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunAudience 27*4882a593Smuzhiyun======== 28*4882a593SmuzhiyunThis description is primarily targeted at kernel developers who are modifying 29*4882a593Smuzhiyunhugetlbfs code. 30*4882a593Smuzhiyun 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunThe Data Structures 33*4882a593Smuzhiyun=================== 34*4882a593Smuzhiyun 35*4882a593Smuzhiyunresv_huge_pages 36*4882a593Smuzhiyun This is a global (per-hstate) count of reserved huge pages. Reserved 37*4882a593Smuzhiyun huge pages are only available to the task which reserved them. 38*4882a593Smuzhiyun Therefore, the number of huge pages generally available is computed 39*4882a593Smuzhiyun as (``free_huge_pages - resv_huge_pages``). 40*4882a593SmuzhiyunReserve Map 41*4882a593Smuzhiyun A reserve map is described by the structure:: 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun struct resv_map { 44*4882a593Smuzhiyun struct kref refs; 45*4882a593Smuzhiyun spinlock_t lock; 46*4882a593Smuzhiyun struct list_head regions; 47*4882a593Smuzhiyun long adds_in_progress; 48*4882a593Smuzhiyun struct list_head region_cache; 49*4882a593Smuzhiyun long region_cache_count; 50*4882a593Smuzhiyun }; 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun There is one reserve map for each huge page mapping in the system. 53*4882a593Smuzhiyun The regions list within the resv_map describes the regions within 54*4882a593Smuzhiyun the mapping. A region is described as:: 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun struct file_region { 57*4882a593Smuzhiyun struct list_head link; 58*4882a593Smuzhiyun long from; 59*4882a593Smuzhiyun long to; 60*4882a593Smuzhiyun }; 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun The 'from' and 'to' fields of the file region structure are huge page 63*4882a593Smuzhiyun indices into the mapping. Depending on the type of mapping, a 64*4882a593Smuzhiyun region in the reserv_map may indicate reservations exist for the 65*4882a593Smuzhiyun range, or reservations do not exist. 66*4882a593SmuzhiyunFlags for MAP_PRIVATE Reservations 67*4882a593Smuzhiyun These are stored in the bottom bits of the reservation map pointer. 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun ``#define HPAGE_RESV_OWNER (1UL << 0)`` 70*4882a593Smuzhiyun Indicates this task is the owner of the reservations 71*4882a593Smuzhiyun associated with the mapping. 72*4882a593Smuzhiyun ``#define HPAGE_RESV_UNMAPPED (1UL << 1)`` 73*4882a593Smuzhiyun Indicates task originally mapping this range (and creating 74*4882a593Smuzhiyun reserves) has unmapped a page from this task (the child) 75*4882a593Smuzhiyun due to a failed COW. 76*4882a593SmuzhiyunPage Flags 77*4882a593Smuzhiyun The PagePrivate page flag is used to indicate that a huge page 78*4882a593Smuzhiyun reservation must be restored when the huge page is freed. More 79*4882a593Smuzhiyun details will be discussed in the "Freeing huge pages" section. 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunReservation Map Location (Private or Shared) 83*4882a593Smuzhiyun============================================ 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunA huge page mapping or segment is either private or shared. If private, 86*4882a593Smuzhiyunit is typically only available to a single address space (task). If shared, 87*4882a593Smuzhiyunit can be mapped into multiple address spaces (tasks). The location and 88*4882a593Smuzhiyunsemantics of the reservation map is significantly different for the two types 89*4882a593Smuzhiyunof mappings. Location differences are: 90*4882a593Smuzhiyun 91*4882a593Smuzhiyun- For private mappings, the reservation map hangs off the VMA structure. 92*4882a593Smuzhiyun Specifically, vma->vm_private_data. This reserve map is created at the 93*4882a593Smuzhiyun time the mapping (mmap(MAP_PRIVATE)) is created. 94*4882a593Smuzhiyun- For shared mappings, the reservation map hangs off the inode. Specifically, 95*4882a593Smuzhiyun inode->i_mapping->private_data. Since shared mappings are always backed 96*4882a593Smuzhiyun by files in the hugetlbfs filesystem, the hugetlbfs code ensures each inode 97*4882a593Smuzhiyun contains a reservation map. As a result, the reservation map is allocated 98*4882a593Smuzhiyun when the inode is created. 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunCreating Reservations 102*4882a593Smuzhiyun===================== 103*4882a593SmuzhiyunReservations are created when a huge page backed shared memory segment is 104*4882a593Smuzhiyuncreated (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB). 105*4882a593SmuzhiyunThese operations result in a call to the routine hugetlb_reserve_pages():: 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun int hugetlb_reserve_pages(struct inode *inode, 108*4882a593Smuzhiyun long from, long to, 109*4882a593Smuzhiyun struct vm_area_struct *vma, 110*4882a593Smuzhiyun vm_flags_t vm_flags) 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunThe first thing hugetlb_reserve_pages() does is check if the NORESERVE 113*4882a593Smuzhiyunflag was specified in either the shmget() or mmap() call. If NORESERVE 114*4882a593Smuzhiyunwas specified, then this routine returns immediately as no reservations 115*4882a593Smuzhiyunare desired. 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunThe arguments 'from' and 'to' are huge page indices into the mapping or 118*4882a593Smuzhiyununderlying file. For shmget(), 'from' is always 0 and 'to' corresponds to 119*4882a593Smuzhiyunthe length of the segment/mapping. For mmap(), the offset argument could 120*4882a593Smuzhiyunbe used to specify the offset into the underlying file. In such a case, 121*4882a593Smuzhiyunthe 'from' and 'to' arguments have been adjusted by this offset. 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunOne of the big differences between PRIVATE and SHARED mappings is the way 124*4882a593Smuzhiyunin which reservations are represented in the reservation map. 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun- For shared mappings, an entry in the reservation map indicates a reservation 127*4882a593Smuzhiyun exists or did exist for the corresponding page. As reservations are 128*4882a593Smuzhiyun consumed, the reservation map is not modified. 129*4882a593Smuzhiyun- For private mappings, the lack of an entry in the reservation map indicates 130*4882a593Smuzhiyun a reservation exists for the corresponding page. As reservations are 131*4882a593Smuzhiyun consumed, entries are added to the reservation map. Therefore, the 132*4882a593Smuzhiyun reservation map can also be used to determine which reservations have 133*4882a593Smuzhiyun been consumed. 134*4882a593Smuzhiyun 135*4882a593SmuzhiyunFor private mappings, hugetlb_reserve_pages() creates the reservation map and 136*4882a593Smuzhiyunhangs it off the VMA structure. In addition, the HPAGE_RESV_OWNER flag is set 137*4882a593Smuzhiyunto indicate this VMA owns the reservations. 138*4882a593Smuzhiyun 139*4882a593SmuzhiyunThe reservation map is consulted to determine how many huge page reservations 140*4882a593Smuzhiyunare needed for the current mapping/segment. For private mappings, this is 141*4882a593Smuzhiyunalways the value (to - from). However, for shared mappings it is possible that 142*4882a593Smuzhiyunsome reservations may already exist within the range (to - from). See the 143*4882a593Smuzhiyunsection :ref:`Reservation Map Modifications <resv_map_modifications>` 144*4882a593Smuzhiyunfor details on how this is accomplished. 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunThe mapping may be associated with a subpool. If so, the subpool is consulted 147*4882a593Smuzhiyunto ensure there is sufficient space for the mapping. It is possible that the 148*4882a593Smuzhiyunsubpool has set aside reservations that can be used for the mapping. See the 149*4882a593Smuzhiyunsection :ref:`Subpool Reservations <sub_pool_resv>` for more details. 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunAfter consulting the reservation map and subpool, the number of needed new 152*4882a593Smuzhiyunreservations is known. The routine hugetlb_acct_memory() is called to check 153*4882a593Smuzhiyunfor and take the requested number of reservations. hugetlb_acct_memory() 154*4882a593Smuzhiyuncalls into routines that potentially allocate and adjust surplus page counts. 155*4882a593SmuzhiyunHowever, within those routines the code is simply checking to ensure there 156*4882a593Smuzhiyunare enough free huge pages to accommodate the reservation. If there are, 157*4882a593Smuzhiyunthe global reservation count resv_huge_pages is adjusted something like the 158*4882a593Smuzhiyunfollowing:: 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun if (resv_needed <= (resv_huge_pages - free_huge_pages)) 161*4882a593Smuzhiyun resv_huge_pages += resv_needed; 162*4882a593Smuzhiyun 163*4882a593SmuzhiyunNote that the global lock hugetlb_lock is held when checking and adjusting 164*4882a593Smuzhiyunthese counters. 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunIf there were enough free huge pages and the global count resv_huge_pages 167*4882a593Smuzhiyunwas adjusted, then the reservation map associated with the mapping is 168*4882a593Smuzhiyunmodified to reflect the reservations. In the case of a shared mapping, a 169*4882a593Smuzhiyunfile_region will exist that includes the range 'from' - 'to'. For private 170*4882a593Smuzhiyunmappings, no modifications are made to the reservation map as lack of an 171*4882a593Smuzhiyunentry indicates a reservation exists. 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunIf hugetlb_reserve_pages() was successful, the global reservation count and 174*4882a593Smuzhiyunreservation map associated with the mapping will be modified as required to 175*4882a593Smuzhiyunensure reservations exist for the range 'from' - 'to'. 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun.. _consume_resv: 178*4882a593Smuzhiyun 179*4882a593SmuzhiyunConsuming Reservations/Allocating a Huge Page 180*4882a593Smuzhiyun============================================= 181*4882a593Smuzhiyun 182*4882a593SmuzhiyunReservations are consumed when huge pages associated with the reservations 183*4882a593Smuzhiyunare allocated and instantiated in the corresponding mapping. The allocation 184*4882a593Smuzhiyunis performed within the routine alloc_huge_page():: 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun struct page *alloc_huge_page(struct vm_area_struct *vma, 187*4882a593Smuzhiyun unsigned long addr, int avoid_reserve) 188*4882a593Smuzhiyun 189*4882a593Smuzhiyunalloc_huge_page is passed a VMA pointer and a virtual address, so it can 190*4882a593Smuzhiyunconsult the reservation map to determine if a reservation exists. In addition, 191*4882a593Smuzhiyunalloc_huge_page takes the argument avoid_reserve which indicates reserves 192*4882a593Smuzhiyunshould not be used even if it appears they have been set aside for the 193*4882a593Smuzhiyunspecified address. The avoid_reserve argument is most often used in the case 194*4882a593Smuzhiyunof Copy on Write and Page Migration where additional copies of an existing 195*4882a593Smuzhiyunpage are being allocated. 196*4882a593Smuzhiyun 197*4882a593SmuzhiyunThe helper routine vma_needs_reservation() is called to determine if a 198*4882a593Smuzhiyunreservation exists for the address within the mapping(vma). See the section 199*4882a593Smuzhiyun:ref:`Reservation Map Helper Routines <resv_map_helpers>` for detailed 200*4882a593Smuzhiyuninformation on what this routine does. 201*4882a593SmuzhiyunThe value returned from vma_needs_reservation() is generally 202*4882a593Smuzhiyun0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists. 203*4882a593SmuzhiyunIf a reservation does not exist, and there is a subpool associated with the 204*4882a593Smuzhiyunmapping the subpool is consulted to determine if it contains reservations. 205*4882a593SmuzhiyunIf the subpool contains reservations, one can be used for this allocation. 206*4882a593SmuzhiyunHowever, in every case the avoid_reserve argument overrides the use of 207*4882a593Smuzhiyuna reservation for the allocation. After determining whether a reservation 208*4882a593Smuzhiyunexists and can be used for the allocation, the routine dequeue_huge_page_vma() 209*4882a593Smuzhiyunis called. This routine takes two arguments related to reservations: 210*4882a593Smuzhiyun 211*4882a593Smuzhiyun- avoid_reserve, this is the same value/argument passed to alloc_huge_page() 212*4882a593Smuzhiyun- chg, even though this argument is of type long only the values 0 or 1 are 213*4882a593Smuzhiyun passed to dequeue_huge_page_vma. If the value is 0, it indicates a 214*4882a593Smuzhiyun reservation exists (see the section "Memory Policy and Reservations" for 215*4882a593Smuzhiyun possible issues). If the value is 1, it indicates a reservation does not 216*4882a593Smuzhiyun exist and the page must be taken from the global free pool if possible. 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunThe free lists associated with the memory policy of the VMA are searched for 219*4882a593Smuzhiyuna free page. If a page is found, the value free_huge_pages is decremented 220*4882a593Smuzhiyunwhen the page is removed from the free list. If there was a reservation 221*4882a593Smuzhiyunassociated with the page, the following adjustments are made:: 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun SetPagePrivate(page); /* Indicates allocating this page consumed 224*4882a593Smuzhiyun * a reservation, and if an error is 225*4882a593Smuzhiyun * encountered such that the page must be 226*4882a593Smuzhiyun * freed, the reservation will be restored. */ 227*4882a593Smuzhiyun resv_huge_pages--; /* Decrement the global reservation count */ 228*4882a593Smuzhiyun 229*4882a593SmuzhiyunNote, if no huge page can be found that satisfies the VMA's memory policy 230*4882a593Smuzhiyunan attempt will be made to allocate one using the buddy allocator. This 231*4882a593Smuzhiyunbrings up the issue of surplus huge pages and overcommit which is beyond 232*4882a593Smuzhiyunthe scope reservations. Even if a surplus page is allocated, the same 233*4882a593Smuzhiyunreservation based adjustments as above will be made: SetPagePrivate(page) and 234*4882a593Smuzhiyunresv_huge_pages--. 235*4882a593Smuzhiyun 236*4882a593SmuzhiyunAfter obtaining a new huge page, (page)->private is set to the value of 237*4882a593Smuzhiyunthe subpool associated with the page if it exists. This will be used for 238*4882a593Smuzhiyunsubpool accounting when the page is freed. 239*4882a593Smuzhiyun 240*4882a593SmuzhiyunThe routine vma_commit_reservation() is then called to adjust the reserve 241*4882a593Smuzhiyunmap based on the consumption of the reservation. In general, this involves 242*4882a593Smuzhiyunensuring the page is represented within a file_region structure of the region 243*4882a593Smuzhiyunmap. For shared mappings where the reservation was present, an entry 244*4882a593Smuzhiyunin the reserve map already existed so no change is made. However, if there 245*4882a593Smuzhiyunwas no reservation in a shared mapping or this was a private mapping a new 246*4882a593Smuzhiyunentry must be created. 247*4882a593Smuzhiyun 248*4882a593SmuzhiyunIt is possible that the reserve map could have been changed between the call 249*4882a593Smuzhiyunto vma_needs_reservation() at the beginning of alloc_huge_page() and the 250*4882a593Smuzhiyuncall to vma_commit_reservation() after the page was allocated. This would 251*4882a593Smuzhiyunbe possible if hugetlb_reserve_pages was called for the same page in a shared 252*4882a593Smuzhiyunmapping. In such cases, the reservation count and subpool free page count 253*4882a593Smuzhiyunwill be off by one. This rare condition can be identified by comparing the 254*4882a593Smuzhiyunreturn value from vma_needs_reservation and vma_commit_reservation. If such 255*4882a593Smuzhiyuna race is detected, the subpool and global reserve counts are adjusted to 256*4882a593Smuzhiyuncompensate. See the section 257*4882a593Smuzhiyun:ref:`Reservation Map Helper Routines <resv_map_helpers>` for more 258*4882a593Smuzhiyuninformation on these routines. 259*4882a593Smuzhiyun 260*4882a593Smuzhiyun 261*4882a593SmuzhiyunInstantiate Huge Pages 262*4882a593Smuzhiyun====================== 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunAfter huge page allocation, the page is typically added to the page tables 265*4882a593Smuzhiyunof the allocating task. Before this, pages in a shared mapping are added 266*4882a593Smuzhiyunto the page cache and pages in private mappings are added to an anonymous 267*4882a593Smuzhiyunreverse mapping. In both cases, the PagePrivate flag is cleared. Therefore, 268*4882a593Smuzhiyunwhen a huge page that has been instantiated is freed no adjustment is made 269*4882a593Smuzhiyunto the global reservation count (resv_huge_pages). 270*4882a593Smuzhiyun 271*4882a593Smuzhiyun 272*4882a593SmuzhiyunFreeing Huge Pages 273*4882a593Smuzhiyun================== 274*4882a593Smuzhiyun 275*4882a593SmuzhiyunHuge page freeing is performed by the routine free_huge_page(). This routine 276*4882a593Smuzhiyunis the destructor for hugetlbfs compound pages. As a result, it is only 277*4882a593Smuzhiyunpassed a pointer to the page struct. When a huge page is freed, reservation 278*4882a593Smuzhiyunaccounting may need to be performed. This would be the case if the page was 279*4882a593Smuzhiyunassociated with a subpool that contained reserves, or the page is being freed 280*4882a593Smuzhiyunon an error path where a global reserve count must be restored. 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunThe page->private field points to any subpool associated with the page. 283*4882a593SmuzhiyunIf the PagePrivate flag is set, it indicates the global reserve count should 284*4882a593Smuzhiyunbe adjusted (see the section 285*4882a593Smuzhiyun:ref:`Consuming Reservations/Allocating a Huge Page <consume_resv>` 286*4882a593Smuzhiyunfor information on how these are set). 287*4882a593Smuzhiyun 288*4882a593SmuzhiyunThe routine first calls hugepage_subpool_put_pages() for the page. If this 289*4882a593Smuzhiyunroutine returns a value of 0 (which does not equal the value passed 1) it 290*4882a593Smuzhiyunindicates reserves are associated with the subpool, and this newly free page 291*4882a593Smuzhiyunmust be used to keep the number of subpool reserves above the minimum size. 292*4882a593SmuzhiyunTherefore, the global resv_huge_pages counter is incremented in this case. 293*4882a593Smuzhiyun 294*4882a593SmuzhiyunIf the PagePrivate flag was set in the page, the global resv_huge_pages counter 295*4882a593Smuzhiyunwill always be incremented. 296*4882a593Smuzhiyun 297*4882a593Smuzhiyun.. _sub_pool_resv: 298*4882a593Smuzhiyun 299*4882a593SmuzhiyunSubpool Reservations 300*4882a593Smuzhiyun==================== 301*4882a593Smuzhiyun 302*4882a593SmuzhiyunThere is a struct hstate associated with each huge page size. The hstate 303*4882a593Smuzhiyuntracks all huge pages of the specified size. A subpool represents a subset 304*4882a593Smuzhiyunof pages within a hstate that is associated with a mounted hugetlbfs 305*4882a593Smuzhiyunfilesystem. 306*4882a593Smuzhiyun 307*4882a593SmuzhiyunWhen a hugetlbfs filesystem is mounted a min_size option can be specified 308*4882a593Smuzhiyunwhich indicates the minimum number of huge pages required by the filesystem. 309*4882a593SmuzhiyunIf this option is specified, the number of huge pages corresponding to 310*4882a593Smuzhiyunmin_size are reserved for use by the filesystem. This number is tracked in 311*4882a593Smuzhiyunthe min_hpages field of a struct hugepage_subpool. At mount time, 312*4882a593Smuzhiyunhugetlb_acct_memory(min_hpages) is called to reserve the specified number of 313*4882a593Smuzhiyunhuge pages. If they can not be reserved, the mount fails. 314*4882a593Smuzhiyun 315*4882a593SmuzhiyunThe routines hugepage_subpool_get/put_pages() are called when pages are 316*4882a593Smuzhiyunobtained from or released back to a subpool. They perform all subpool 317*4882a593Smuzhiyunaccounting, and track any reservations associated with the subpool. 318*4882a593Smuzhiyunhugepage_subpool_get/put_pages are passed the number of huge pages by which 319*4882a593Smuzhiyunto adjust the subpool 'used page' count (down for get, up for put). Normally, 320*4882a593Smuzhiyunthey return the same value that was passed or an error if not enough pages 321*4882a593Smuzhiyunexist in the subpool. 322*4882a593Smuzhiyun 323*4882a593SmuzhiyunHowever, if reserves are associated with the subpool a return value less 324*4882a593Smuzhiyunthan the passed value may be returned. This return value indicates the 325*4882a593Smuzhiyunnumber of additional global pool adjustments which must be made. For example, 326*4882a593Smuzhiyunsuppose a subpool contains 3 reserved huge pages and someone asks for 5. 327*4882a593SmuzhiyunThe 3 reserved pages associated with the subpool can be used to satisfy part 328*4882a593Smuzhiyunof the request. But, 2 pages must be obtained from the global pools. To 329*4882a593Smuzhiyunrelay this information to the caller, the value 2 is returned. The caller 330*4882a593Smuzhiyunis then responsible for attempting to obtain the additional two pages from 331*4882a593Smuzhiyunthe global pools. 332*4882a593Smuzhiyun 333*4882a593Smuzhiyun 334*4882a593SmuzhiyunCOW and Reservations 335*4882a593Smuzhiyun==================== 336*4882a593Smuzhiyun 337*4882a593SmuzhiyunSince shared mappings all point to and use the same underlying pages, the 338*4882a593Smuzhiyunbiggest reservation concern for COW is private mappings. In this case, 339*4882a593Smuzhiyuntwo tasks can be pointing at the same previously allocated page. One task 340*4882a593Smuzhiyunattempts to write to the page, so a new page must be allocated so that each 341*4882a593Smuzhiyuntask points to its own page. 342*4882a593Smuzhiyun 343*4882a593SmuzhiyunWhen the page was originally allocated, the reservation for that page was 344*4882a593Smuzhiyunconsumed. When an attempt to allocate a new page is made as a result of 345*4882a593SmuzhiyunCOW, it is possible that no free huge pages are free and the allocation 346*4882a593Smuzhiyunwill fail. 347*4882a593Smuzhiyun 348*4882a593SmuzhiyunWhen the private mapping was originally created, the owner of the mapping 349*4882a593Smuzhiyunwas noted by setting the HPAGE_RESV_OWNER bit in the pointer to the reservation 350*4882a593Smuzhiyunmap of the owner. Since the owner created the mapping, the owner owns all 351*4882a593Smuzhiyunthe reservations associated with the mapping. Therefore, when a write fault 352*4882a593Smuzhiyunoccurs and there is no page available, different action is taken for the owner 353*4882a593Smuzhiyunand non-owner of the reservation. 354*4882a593Smuzhiyun 355*4882a593SmuzhiyunIn the case where the faulting task is not the owner, the fault will fail and 356*4882a593Smuzhiyunthe task will typically receive a SIGBUS. 357*4882a593Smuzhiyun 358*4882a593SmuzhiyunIf the owner is the faulting task, we want it to succeed since it owned the 359*4882a593Smuzhiyunoriginal reservation. To accomplish this, the page is unmapped from the 360*4882a593Smuzhiyunnon-owning task. In this way, the only reference is from the owning task. 361*4882a593SmuzhiyunIn addition, the HPAGE_RESV_UNMAPPED bit is set in the reservation map pointer 362*4882a593Smuzhiyunof the non-owning task. The non-owning task may receive a SIGBUS if it later 363*4882a593Smuzhiyunfaults on a non-present page. But, the original owner of the 364*4882a593Smuzhiyunmapping/reservation will behave as expected. 365*4882a593Smuzhiyun 366*4882a593Smuzhiyun 367*4882a593Smuzhiyun.. _resv_map_modifications: 368*4882a593Smuzhiyun 369*4882a593SmuzhiyunReservation Map Modifications 370*4882a593Smuzhiyun============================= 371*4882a593Smuzhiyun 372*4882a593SmuzhiyunThe following low level routines are used to make modifications to a 373*4882a593Smuzhiyunreservation map. Typically, these routines are not called directly. Rather, 374*4882a593Smuzhiyuna reservation map helper routine is called which calls one of these low level 375*4882a593Smuzhiyunroutines. These low level routines are fairly well documented in the source 376*4882a593Smuzhiyuncode (mm/hugetlb.c). These routines are:: 377*4882a593Smuzhiyun 378*4882a593Smuzhiyun long region_chg(struct resv_map *resv, long f, long t); 379*4882a593Smuzhiyun long region_add(struct resv_map *resv, long f, long t); 380*4882a593Smuzhiyun void region_abort(struct resv_map *resv, long f, long t); 381*4882a593Smuzhiyun long region_count(struct resv_map *resv, long f, long t); 382*4882a593Smuzhiyun 383*4882a593SmuzhiyunOperations on the reservation map typically involve two operations: 384*4882a593Smuzhiyun 385*4882a593Smuzhiyun1) region_chg() is called to examine the reserve map and determine how 386*4882a593Smuzhiyun many pages in the specified range [f, t) are NOT currently represented. 387*4882a593Smuzhiyun 388*4882a593Smuzhiyun The calling code performs global checks and allocations to determine if 389*4882a593Smuzhiyun there are enough huge pages for the operation to succeed. 390*4882a593Smuzhiyun 391*4882a593Smuzhiyun2) 392*4882a593Smuzhiyun a) If the operation can succeed, region_add() is called to actually modify 393*4882a593Smuzhiyun the reservation map for the same range [f, t) previously passed to 394*4882a593Smuzhiyun region_chg(). 395*4882a593Smuzhiyun b) If the operation can not succeed, region_abort is called for the same 396*4882a593Smuzhiyun range [f, t) to abort the operation. 397*4882a593Smuzhiyun 398*4882a593SmuzhiyunNote that this is a two step process where region_add() and region_abort() 399*4882a593Smuzhiyunare guaranteed to succeed after a prior call to region_chg() for the same 400*4882a593Smuzhiyunrange. region_chg() is responsible for pre-allocating any data structures 401*4882a593Smuzhiyunnecessary to ensure the subsequent operations (specifically region_add())) 402*4882a593Smuzhiyunwill succeed. 403*4882a593Smuzhiyun 404*4882a593SmuzhiyunAs mentioned above, region_chg() determines the number of pages in the range 405*4882a593Smuzhiyunwhich are NOT currently represented in the map. This number is returned to 406*4882a593Smuzhiyunthe caller. region_add() returns the number of pages in the range added to 407*4882a593Smuzhiyunthe map. In most cases, the return value of region_add() is the same as the 408*4882a593Smuzhiyunreturn value of region_chg(). However, in the case of shared mappings it is 409*4882a593Smuzhiyunpossible for changes to the reservation map to be made between the calls to 410*4882a593Smuzhiyunregion_chg() and region_add(). In this case, the return value of region_add() 411*4882a593Smuzhiyunwill not match the return value of region_chg(). It is likely that in such 412*4882a593Smuzhiyuncases global counts and subpool accounting will be incorrect and in need of 413*4882a593Smuzhiyunadjustment. It is the responsibility of the caller to check for this condition 414*4882a593Smuzhiyunand make the appropriate adjustments. 415*4882a593Smuzhiyun 416*4882a593SmuzhiyunThe routine region_del() is called to remove regions from a reservation map. 417*4882a593SmuzhiyunIt is typically called in the following situations: 418*4882a593Smuzhiyun 419*4882a593Smuzhiyun- When a file in the hugetlbfs filesystem is being removed, the inode will 420*4882a593Smuzhiyun be released and the reservation map freed. Before freeing the reservation 421*4882a593Smuzhiyun map, all the individual file_region structures must be freed. In this case 422*4882a593Smuzhiyun region_del is passed the range [0, LONG_MAX). 423*4882a593Smuzhiyun- When a hugetlbfs file is being truncated. In this case, all allocated pages 424*4882a593Smuzhiyun after the new file size must be freed. In addition, any file_region entries 425*4882a593Smuzhiyun in the reservation map past the new end of file must be deleted. In this 426*4882a593Smuzhiyun case, region_del is passed the range [new_end_of_file, LONG_MAX). 427*4882a593Smuzhiyun- When a hole is being punched in a hugetlbfs file. In this case, huge pages 428*4882a593Smuzhiyun are removed from the middle of the file one at a time. As the pages are 429*4882a593Smuzhiyun removed, region_del() is called to remove the corresponding entry from the 430*4882a593Smuzhiyun reservation map. In this case, region_del is passed the range 431*4882a593Smuzhiyun [page_idx, page_idx + 1). 432*4882a593Smuzhiyun 433*4882a593SmuzhiyunIn every case, region_del() will return the number of pages removed from the 434*4882a593Smuzhiyunreservation map. In VERY rare cases, region_del() can fail. This can only 435*4882a593Smuzhiyunhappen in the hole punch case where it has to split an existing file_region 436*4882a593Smuzhiyunentry and can not allocate a new structure. In this error case, region_del() 437*4882a593Smuzhiyunwill return -ENOMEM. The problem here is that the reservation map will 438*4882a593Smuzhiyunindicate that there is a reservation for the page. However, the subpool and 439*4882a593Smuzhiyunglobal reservation counts will not reflect the reservation. To handle this 440*4882a593Smuzhiyunsituation, the routine hugetlb_fix_reserve_counts() is called to adjust the 441*4882a593Smuzhiyuncounters so that they correspond with the reservation map entry that could 442*4882a593Smuzhiyunnot be deleted. 443*4882a593Smuzhiyun 444*4882a593Smuzhiyunregion_count() is called when unmapping a private huge page mapping. In 445*4882a593Smuzhiyunprivate mappings, the lack of a entry in the reservation map indicates that 446*4882a593Smuzhiyuna reservation exists. Therefore, by counting the number of entries in the 447*4882a593Smuzhiyunreservation map we know how many reservations were consumed and how many are 448*4882a593Smuzhiyunoutstanding (outstanding = (end - start) - region_count(resv, start, end)). 449*4882a593SmuzhiyunSince the mapping is going away, the subpool and global reservation counts 450*4882a593Smuzhiyunare decremented by the number of outstanding reservations. 451*4882a593Smuzhiyun 452*4882a593Smuzhiyun.. _resv_map_helpers: 453*4882a593Smuzhiyun 454*4882a593SmuzhiyunReservation Map Helper Routines 455*4882a593Smuzhiyun=============================== 456*4882a593Smuzhiyun 457*4882a593SmuzhiyunSeveral helper routines exist to query and modify the reservation maps. 458*4882a593SmuzhiyunThese routines are only interested with reservations for a specific huge 459*4882a593Smuzhiyunpage, so they just pass in an address instead of a range. In addition, 460*4882a593Smuzhiyunthey pass in the associated VMA. From the VMA, the type of mapping (private 461*4882a593Smuzhiyunor shared) and the location of the reservation map (inode or VMA) can be 462*4882a593Smuzhiyundetermined. These routines simply call the underlying routines described 463*4882a593Smuzhiyunin the section "Reservation Map Modifications". However, they do take into 464*4882a593Smuzhiyunaccount the 'opposite' meaning of reservation map entries for private and 465*4882a593Smuzhiyunshared mappings and hide this detail from the caller:: 466*4882a593Smuzhiyun 467*4882a593Smuzhiyun long vma_needs_reservation(struct hstate *h, 468*4882a593Smuzhiyun struct vm_area_struct *vma, 469*4882a593Smuzhiyun unsigned long addr) 470*4882a593Smuzhiyun 471*4882a593SmuzhiyunThis routine calls region_chg() for the specified page. If no reservation 472*4882a593Smuzhiyunexists, 1 is returned. If a reservation exists, 0 is returned:: 473*4882a593Smuzhiyun 474*4882a593Smuzhiyun long vma_commit_reservation(struct hstate *h, 475*4882a593Smuzhiyun struct vm_area_struct *vma, 476*4882a593Smuzhiyun unsigned long addr) 477*4882a593Smuzhiyun 478*4882a593SmuzhiyunThis calls region_add() for the specified page. As in the case of region_chg 479*4882a593Smuzhiyunand region_add, this routine is to be called after a previous call to 480*4882a593Smuzhiyunvma_needs_reservation. It will add a reservation entry for the page. It 481*4882a593Smuzhiyunreturns 1 if the reservation was added and 0 if not. The return value should 482*4882a593Smuzhiyunbe compared with the return value of the previous call to 483*4882a593Smuzhiyunvma_needs_reservation. An unexpected difference indicates the reservation 484*4882a593Smuzhiyunmap was modified between calls:: 485*4882a593Smuzhiyun 486*4882a593Smuzhiyun void vma_end_reservation(struct hstate *h, 487*4882a593Smuzhiyun struct vm_area_struct *vma, 488*4882a593Smuzhiyun unsigned long addr) 489*4882a593Smuzhiyun 490*4882a593SmuzhiyunThis calls region_abort() for the specified page. As in the case of region_chg 491*4882a593Smuzhiyunand region_abort, this routine is to be called after a previous call to 492*4882a593Smuzhiyunvma_needs_reservation. It will abort/end the in progress reservation add 493*4882a593Smuzhiyunoperation:: 494*4882a593Smuzhiyun 495*4882a593Smuzhiyun long vma_add_reservation(struct hstate *h, 496*4882a593Smuzhiyun struct vm_area_struct *vma, 497*4882a593Smuzhiyun unsigned long addr) 498*4882a593Smuzhiyun 499*4882a593SmuzhiyunThis is a special wrapper routine to help facilitate reservation cleanup 500*4882a593Smuzhiyunon error paths. It is only called from the routine restore_reserve_on_error(). 501*4882a593SmuzhiyunThis routine is used in conjunction with vma_needs_reservation in an attempt 502*4882a593Smuzhiyunto add a reservation to the reservation map. It takes into account the 503*4882a593Smuzhiyundifferent reservation map semantics for private and shared mappings. Hence, 504*4882a593Smuzhiyunregion_add is called for shared mappings (as an entry present in the map 505*4882a593Smuzhiyunindicates a reservation), and region_del is called for private mappings (as 506*4882a593Smuzhiyunthe absence of an entry in the map indicates a reservation). See the section 507*4882a593Smuzhiyun"Reservation cleanup in error paths" for more information on what needs to 508*4882a593Smuzhiyunbe done on error paths. 509*4882a593Smuzhiyun 510*4882a593Smuzhiyun 511*4882a593SmuzhiyunReservation Cleanup in Error Paths 512*4882a593Smuzhiyun================================== 513*4882a593Smuzhiyun 514*4882a593SmuzhiyunAs mentioned in the section 515*4882a593Smuzhiyun:ref:`Reservation Map Helper Routines <resv_map_helpers>`, reservation 516*4882a593Smuzhiyunmap modifications are performed in two steps. First vma_needs_reservation 517*4882a593Smuzhiyunis called before a page is allocated. If the allocation is successful, 518*4882a593Smuzhiyunthen vma_commit_reservation is called. If not, vma_end_reservation is called. 519*4882a593SmuzhiyunGlobal and subpool reservation counts are adjusted based on success or failure 520*4882a593Smuzhiyunof the operation and all is well. 521*4882a593Smuzhiyun 522*4882a593SmuzhiyunAdditionally, after a huge page is instantiated the PagePrivate flag is 523*4882a593Smuzhiyuncleared so that accounting when the page is ultimately freed is correct. 524*4882a593Smuzhiyun 525*4882a593SmuzhiyunHowever, there are several instances where errors are encountered after a huge 526*4882a593Smuzhiyunpage is allocated but before it is instantiated. In this case, the page 527*4882a593Smuzhiyunallocation has consumed the reservation and made the appropriate subpool, 528*4882a593Smuzhiyunreservation map and global count adjustments. If the page is freed at this 529*4882a593Smuzhiyuntime (before instantiation and clearing of PagePrivate), then free_huge_page 530*4882a593Smuzhiyunwill increment the global reservation count. However, the reservation map 531*4882a593Smuzhiyunindicates the reservation was consumed. This resulting inconsistent state 532*4882a593Smuzhiyunwill cause the 'leak' of a reserved huge page. The global reserve count will 533*4882a593Smuzhiyunbe higher than it should and prevent allocation of a pre-allocated page. 534*4882a593Smuzhiyun 535*4882a593SmuzhiyunThe routine restore_reserve_on_error() attempts to handle this situation. It 536*4882a593Smuzhiyunis fairly well documented. The intention of this routine is to restore 537*4882a593Smuzhiyunthe reservation map to the way it was before the page allocation. In this 538*4882a593Smuzhiyunway, the state of the reservation map will correspond to the global reservation 539*4882a593Smuzhiyuncount after the page is freed. 540*4882a593Smuzhiyun 541*4882a593SmuzhiyunThe routine restore_reserve_on_error itself may encounter errors while 542*4882a593Smuzhiyunattempting to restore the reservation map entry. In this case, it will 543*4882a593Smuzhiyunsimply clear the PagePrivate flag of the page. In this way, the global 544*4882a593Smuzhiyunreserve count will not be incremented when the page is freed. However, the 545*4882a593Smuzhiyunreservation map will continue to look as though the reservation was consumed. 546*4882a593SmuzhiyunA page can still be allocated for the address, but it will not use a reserved 547*4882a593Smuzhiyunpage as originally intended. 548*4882a593Smuzhiyun 549*4882a593SmuzhiyunThere is some code (most notably userfaultfd) which can not call 550*4882a593Smuzhiyunrestore_reserve_on_error. In this case, it simply modifies the PagePrivate 551*4882a593Smuzhiyunso that a reservation will not be leaked when the huge page is freed. 552*4882a593Smuzhiyun 553*4882a593Smuzhiyun 554*4882a593SmuzhiyunReservations and Memory Policy 555*4882a593Smuzhiyun============================== 556*4882a593SmuzhiyunPer-node huge page lists existed in struct hstate when git was first used 557*4882a593Smuzhiyunto manage Linux code. The concept of reservations was added some time later. 558*4882a593SmuzhiyunWhen reservations were added, no attempt was made to take memory policy 559*4882a593Smuzhiyuninto account. While cpusets are not exactly the same as memory policy, this 560*4882a593Smuzhiyuncomment in hugetlb_acct_memory sums up the interaction between reservations 561*4882a593Smuzhiyunand cpusets/memory policy:: 562*4882a593Smuzhiyun 563*4882a593Smuzhiyun /* 564*4882a593Smuzhiyun * When cpuset is configured, it breaks the strict hugetlb page 565*4882a593Smuzhiyun * reservation as the accounting is done on a global variable. Such 566*4882a593Smuzhiyun * reservation is completely rubbish in the presence of cpuset because 567*4882a593Smuzhiyun * the reservation is not checked against page availability for the 568*4882a593Smuzhiyun * current cpuset. Application can still potentially OOM'ed by kernel 569*4882a593Smuzhiyun * with lack of free htlb page in cpuset that the task is in. 570*4882a593Smuzhiyun * Attempt to enforce strict accounting with cpuset is almost 571*4882a593Smuzhiyun * impossible (or too ugly) because cpuset is too fluid that 572*4882a593Smuzhiyun * task or memory node can be dynamically moved between cpusets. 573*4882a593Smuzhiyun * 574*4882a593Smuzhiyun * The change of semantics for shared hugetlb mapping with cpuset is 575*4882a593Smuzhiyun * undesirable. However, in order to preserve some of the semantics, 576*4882a593Smuzhiyun * we fall back to check against current free page availability as 577*4882a593Smuzhiyun * a best attempt and hopefully to minimize the impact of changing 578*4882a593Smuzhiyun * semantics that cpuset has. 579*4882a593Smuzhiyun */ 580*4882a593Smuzhiyun 581*4882a593SmuzhiyunHuge page reservations were added to prevent unexpected page allocation 582*4882a593Smuzhiyunfailures (OOM) at page fault time. However, if an application makes use 583*4882a593Smuzhiyunof cpusets or memory policy there is no guarantee that huge pages will be 584*4882a593Smuzhiyunavailable on the required nodes. This is true even if there are a sufficient 585*4882a593Smuzhiyunnumber of global reservations. 586*4882a593Smuzhiyun 587*4882a593SmuzhiyunHugetlbfs regression testing 588*4882a593Smuzhiyun============================ 589*4882a593Smuzhiyun 590*4882a593SmuzhiyunThe most complete set of hugetlb tests are in the libhugetlbfs repository. 591*4882a593SmuzhiyunIf you modify any hugetlb related code, use the libhugetlbfs test suite 592*4882a593Smuzhiyunto check for regressions. In addition, if you add any new hugetlb 593*4882a593Smuzhiyunfunctionality, please add appropriate tests to libhugetlbfs. 594*4882a593Smuzhiyun 595*4882a593Smuzhiyun-- 596*4882a593SmuzhiyunMike Kravetz, 7 April 2017 597