1*4882a593Smuzhiyun.. _admin_guide_transhuge: 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun============================ 4*4882a593SmuzhiyunTransparent Hugepage Support 5*4882a593Smuzhiyun============================ 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunObjective 8*4882a593Smuzhiyun========= 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunPerformance critical computing applications dealing with large memory 11*4882a593Smuzhiyunworking sets are already running on top of libhugetlbfs and in turn 12*4882a593Smuzhiyunhugetlbfs. Transparent HugePage Support (THP) is an alternative mean of 13*4882a593Smuzhiyunusing huge pages for the backing of virtual memory with huge pages 14*4882a593Smuzhiyunthat supports the automatic promotion and demotion of page sizes and 15*4882a593Smuzhiyunwithout the shortcomings of hugetlbfs. 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunCurrently THP only works for anonymous memory mappings and tmpfs/shmem. 18*4882a593SmuzhiyunBut in the future it can expand to other filesystems. 19*4882a593Smuzhiyun 20*4882a593Smuzhiyun.. note:: 21*4882a593Smuzhiyun in the examples below we presume that the basic page size is 4K and 22*4882a593Smuzhiyun the huge page size is 2M, although the actual numbers may vary 23*4882a593Smuzhiyun depending on the CPU architecture. 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunThe reason applications are running faster is because of two 26*4882a593Smuzhiyunfactors. The first factor is almost completely irrelevant and it's not 27*4882a593Smuzhiyunof significant interest because it'll also have the downside of 28*4882a593Smuzhiyunrequiring larger clear-page copy-page in page faults which is a 29*4882a593Smuzhiyunpotentially negative effect. The first factor consists in taking a 30*4882a593Smuzhiyunsingle page fault for each 2M virtual region touched by userland (so 31*4882a593Smuzhiyunreducing the enter/exit kernel frequency by a 512 times factor). This 32*4882a593Smuzhiyunonly matters the first time the memory is accessed for the lifetime of 33*4882a593Smuzhiyuna memory mapping. The second long lasting and much more important 34*4882a593Smuzhiyunfactor will affect all subsequent accesses to the memory for the whole 35*4882a593Smuzhiyunruntime of the application. The second factor consist of two 36*4882a593Smuzhiyuncomponents: 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun1) the TLB miss will run faster (especially with virtualization using 39*4882a593Smuzhiyun nested pagetables but almost always also on bare metal without 40*4882a593Smuzhiyun virtualization) 41*4882a593Smuzhiyun 42*4882a593Smuzhiyun2) a single TLB entry will be mapping a much larger amount of virtual 43*4882a593Smuzhiyun memory in turn reducing the number of TLB misses. With 44*4882a593Smuzhiyun virtualization and nested pagetables the TLB can be mapped of 45*4882a593Smuzhiyun larger size only if both KVM and the Linux guest are using 46*4882a593Smuzhiyun hugepages but a significant speedup already happens if only one of 47*4882a593Smuzhiyun the two is using hugepages just because of the fact the TLB miss is 48*4882a593Smuzhiyun going to run faster. 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunTHP can be enabled system wide or restricted to certain tasks or even 51*4882a593Smuzhiyunmemory ranges inside task's address space. Unless THP is completely 52*4882a593Smuzhiyundisabled, there is ``khugepaged`` daemon that scans memory and 53*4882a593Smuzhiyuncollapses sequences of basic pages into huge pages. 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunThe THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` 56*4882a593Smuzhiyuninterface and using madvise(2) and prctl(2) system calls. 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunTransparent Hugepage Support maximizes the usefulness of free memory 59*4882a593Smuzhiyunif compared to the reservation approach of hugetlbfs by allowing all 60*4882a593Smuzhiyununused memory to be used as cache or other movable (or even unmovable 61*4882a593Smuzhiyunentities). It doesn't require reservation to prevent hugepage 62*4882a593Smuzhiyunallocation failures to be noticeable from userland. It allows paging 63*4882a593Smuzhiyunand all other advanced VM features to be available on the 64*4882a593Smuzhiyunhugepages. It requires no modifications for applications to take 65*4882a593Smuzhiyunadvantage of it. 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunApplications however can be further optimized to take advantage of 68*4882a593Smuzhiyunthis feature, like for example they've been optimized before to avoid 69*4882a593Smuzhiyuna flood of mmap system calls for every malloc(4k). Optimizing userland 70*4882a593Smuzhiyunis by far not mandatory and khugepaged already can take care of long 71*4882a593Smuzhiyunlived page allocations even for hugepage unaware applications that 72*4882a593Smuzhiyundeals with large amounts of memory. 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunIn certain cases when hugepages are enabled system wide, application 75*4882a593Smuzhiyunmay end up allocating more memory resources. An application may mmap a 76*4882a593Smuzhiyunlarge region but only touch 1 byte of it, in that case a 2M page might 77*4882a593Smuzhiyunbe allocated instead of a 4k page for no good. This is why it's 78*4882a593Smuzhiyunpossible to disable hugepages system-wide and to only have them inside 79*4882a593SmuzhiyunMADV_HUGEPAGE madvise regions. 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunEmbedded systems should enable hugepages only inside madvise regions 82*4882a593Smuzhiyunto eliminate any risk of wasting any precious byte of memory and to 83*4882a593Smuzhiyunonly run faster. 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunApplications that gets a lot of benefit from hugepages and that don't 86*4882a593Smuzhiyunrisk to lose memory by using hugepages, should use 87*4882a593Smuzhiyunmadvise(MADV_HUGEPAGE) on their critical mmapped regions. 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun.. _thp_sysfs: 90*4882a593Smuzhiyun 91*4882a593Smuzhiyunsysfs 92*4882a593Smuzhiyun===== 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunGlobal THP controls 95*4882a593Smuzhiyun------------------- 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunTransparent Hugepage Support for anonymous memory can be entirely disabled 98*4882a593Smuzhiyun(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE 99*4882a593Smuzhiyunregions (to avoid the risk of consuming more memory resources) or enabled 100*4882a593Smuzhiyunsystem wide. This can be achieved with one of:: 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun echo always >/sys/kernel/mm/transparent_hugepage/enabled 103*4882a593Smuzhiyun echo madvise >/sys/kernel/mm/transparent_hugepage/enabled 104*4882a593Smuzhiyun echo never >/sys/kernel/mm/transparent_hugepage/enabled 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunIt's also possible to limit defrag efforts in the VM to generate 107*4882a593Smuzhiyunanonymous hugepages in case they're not immediately free to madvise 108*4882a593Smuzhiyunregions or to never try to defrag memory and simply fallback to regular 109*4882a593Smuzhiyunpages unless hugepages are immediately available. Clearly if we spend CPU 110*4882a593Smuzhiyuntime to defrag memory, we would expect to gain even more by the fact we 111*4882a593Smuzhiyunuse hugepages later instead of regular pages. This isn't always 112*4882a593Smuzhiyunguaranteed, but it may be more likely in case the allocation is for a 113*4882a593SmuzhiyunMADV_HUGEPAGE region. 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun:: 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun echo always >/sys/kernel/mm/transparent_hugepage/defrag 118*4882a593Smuzhiyun echo defer >/sys/kernel/mm/transparent_hugepage/defrag 119*4882a593Smuzhiyun echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag 120*4882a593Smuzhiyun echo madvise >/sys/kernel/mm/transparent_hugepage/defrag 121*4882a593Smuzhiyun echo never >/sys/kernel/mm/transparent_hugepage/defrag 122*4882a593Smuzhiyun 123*4882a593Smuzhiyunalways 124*4882a593Smuzhiyun means that an application requesting THP will stall on 125*4882a593Smuzhiyun allocation failure and directly reclaim pages and compact 126*4882a593Smuzhiyun memory in an effort to allocate a THP immediately. This may be 127*4882a593Smuzhiyun desirable for virtual machines that benefit heavily from THP 128*4882a593Smuzhiyun use and are willing to delay the VM start to utilise them. 129*4882a593Smuzhiyun 130*4882a593Smuzhiyundefer 131*4882a593Smuzhiyun means that an application will wake kswapd in the background 132*4882a593Smuzhiyun to reclaim pages and wake kcompactd to compact memory so that 133*4882a593Smuzhiyun THP is available in the near future. It's the responsibility 134*4882a593Smuzhiyun of khugepaged to then install the THP pages later. 135*4882a593Smuzhiyun 136*4882a593Smuzhiyundefer+madvise 137*4882a593Smuzhiyun will enter direct reclaim and compaction like ``always``, but 138*4882a593Smuzhiyun only for regions that have used madvise(MADV_HUGEPAGE); all 139*4882a593Smuzhiyun other regions will wake kswapd in the background to reclaim 140*4882a593Smuzhiyun pages and wake kcompactd to compact memory so that THP is 141*4882a593Smuzhiyun available in the near future. 142*4882a593Smuzhiyun 143*4882a593Smuzhiyunmadvise 144*4882a593Smuzhiyun will enter direct reclaim like ``always`` but only for regions 145*4882a593Smuzhiyun that are have used madvise(MADV_HUGEPAGE). This is the default 146*4882a593Smuzhiyun behaviour. 147*4882a593Smuzhiyun 148*4882a593Smuzhiyunnever 149*4882a593Smuzhiyun should be self-explanatory. 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunBy default kernel tries to use huge zero page on read page fault to 152*4882a593Smuzhiyunanonymous mapping. It's possible to disable huge zero page by writing 0 153*4882a593Smuzhiyunor enable it back by writing 1:: 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page 156*4882a593Smuzhiyun echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page 157*4882a593Smuzhiyun 158*4882a593SmuzhiyunSome userspace (such as a test program, or an optimized memory allocation 159*4882a593Smuzhiyunlibrary) may want to know the size (in bytes) of a transparent hugepage:: 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size 162*4882a593Smuzhiyun 163*4882a593Smuzhiyunkhugepaged will be automatically started when 164*4882a593Smuzhiyuntransparent_hugepage/enabled is set to "always" or "madvise, and it'll 165*4882a593Smuzhiyunbe automatically shutdown if it's set to "never". 166*4882a593Smuzhiyun 167*4882a593SmuzhiyunKhugepaged controls 168*4882a593Smuzhiyun------------------- 169*4882a593Smuzhiyun 170*4882a593Smuzhiyunkhugepaged runs usually at low frequency so while one may not want to 171*4882a593Smuzhiyuninvoke defrag algorithms synchronously during the page faults, it 172*4882a593Smuzhiyunshould be worth invoking defrag at least in khugepaged. However it's 173*4882a593Smuzhiyunalso possible to disable defrag in khugepaged by writing 0 or enable 174*4882a593Smuzhiyundefrag in khugepaged by writing 1:: 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag 177*4882a593Smuzhiyun echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag 178*4882a593Smuzhiyun 179*4882a593SmuzhiyunYou can also control how many pages khugepaged should scan at each 180*4882a593Smuzhiyunpass:: 181*4882a593Smuzhiyun 182*4882a593Smuzhiyun /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan 183*4882a593Smuzhiyun 184*4882a593Smuzhiyunand how many milliseconds to wait in khugepaged between each pass (you 185*4882a593Smuzhiyuncan set this to 0 to run khugepaged at 100% utilization of one core):: 186*4882a593Smuzhiyun 187*4882a593Smuzhiyun /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs 188*4882a593Smuzhiyun 189*4882a593Smuzhiyunand how many milliseconds to wait in khugepaged if there's an hugepage 190*4882a593Smuzhiyunallocation failure to throttle the next allocation attempt:: 191*4882a593Smuzhiyun 192*4882a593Smuzhiyun /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs 193*4882a593Smuzhiyun 194*4882a593SmuzhiyunThe khugepaged progress can be seen in the number of pages collapsed:: 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed 197*4882a593Smuzhiyun 198*4882a593Smuzhiyunfor each pass:: 199*4882a593Smuzhiyun 200*4882a593Smuzhiyun /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun``max_ptes_none`` specifies how many extra small pages (that are 203*4882a593Smuzhiyunnot already mapped) can be allocated when collapsing a group 204*4882a593Smuzhiyunof small pages into one large page:: 205*4882a593Smuzhiyun 206*4882a593Smuzhiyun /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunA higher value leads to use additional memory for programs. 209*4882a593SmuzhiyunA lower value leads to gain less thp performance. Value of 210*4882a593Smuzhiyunmax_ptes_none can waste cpu time very little, you can 211*4882a593Smuzhiyunignore it. 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun``max_ptes_swap`` specifies how many pages can be brought in from 214*4882a593Smuzhiyunswap when collapsing a group of pages into a transparent huge page:: 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunA higher value can cause excessive swap IO and waste 219*4882a593Smuzhiyunmemory. A lower value can prevent THPs from being 220*4882a593Smuzhiyuncollapsed, resulting fewer pages being collapsed into 221*4882a593SmuzhiyunTHPs, and lower memory access performance. 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun``max_ptes_shared`` specifies how many pages can be shared across multiple 224*4882a593Smuzhiyunprocesses. Exceeding the number would block the collapse:: 225*4882a593Smuzhiyun 226*4882a593Smuzhiyun /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared 227*4882a593Smuzhiyun 228*4882a593SmuzhiyunA higher value may increase memory footprint for some workloads. 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunBoot parameter 231*4882a593Smuzhiyun============== 232*4882a593Smuzhiyun 233*4882a593SmuzhiyunYou can change the sysfs boot time defaults of Transparent Hugepage 234*4882a593SmuzhiyunSupport by passing the parameter ``transparent_hugepage=always`` or 235*4882a593Smuzhiyun``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` 236*4882a593Smuzhiyunto the kernel command line. 237*4882a593Smuzhiyun 238*4882a593SmuzhiyunHugepages in tmpfs/shmem 239*4882a593Smuzhiyun======================== 240*4882a593Smuzhiyun 241*4882a593SmuzhiyunYou can control hugepage allocation policy in tmpfs with mount option 242*4882a593Smuzhiyun``huge=``. It can have following values: 243*4882a593Smuzhiyun 244*4882a593Smuzhiyunalways 245*4882a593Smuzhiyun Attempt to allocate huge pages every time we need a new page; 246*4882a593Smuzhiyun 247*4882a593Smuzhiyunnever 248*4882a593Smuzhiyun Do not allocate huge pages; 249*4882a593Smuzhiyun 250*4882a593Smuzhiyunwithin_size 251*4882a593Smuzhiyun Only allocate huge page if it will be fully within i_size. 252*4882a593Smuzhiyun Also respect fadvise()/madvise() hints; 253*4882a593Smuzhiyun 254*4882a593Smuzhiyunadvise 255*4882a593Smuzhiyun Only allocate huge pages if requested with fadvise()/madvise(); 256*4882a593Smuzhiyun 257*4882a593SmuzhiyunThe default policy is ``never``. 258*4882a593Smuzhiyun 259*4882a593Smuzhiyun``mount -o remount,huge= /mountpoint`` works fine after mount: remounting 260*4882a593Smuzhiyun``huge=never`` will not attempt to break up huge pages at all, just stop more 261*4882a593Smuzhiyunfrom being allocated. 262*4882a593Smuzhiyun 263*4882a593SmuzhiyunThere's also sysfs knob to control hugepage allocation policy for internal 264*4882a593Smuzhiyunshmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount 265*4882a593Smuzhiyunis used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or 266*4882a593SmuzhiyunMAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. 267*4882a593Smuzhiyun 268*4882a593SmuzhiyunIn addition to policies listed above, shmem_enabled allows two further 269*4882a593Smuzhiyunvalues: 270*4882a593Smuzhiyun 271*4882a593Smuzhiyundeny 272*4882a593Smuzhiyun For use in emergencies, to force the huge option off from 273*4882a593Smuzhiyun all mounts; 274*4882a593Smuzhiyunforce 275*4882a593Smuzhiyun Force the huge option on for all - very useful for testing; 276*4882a593Smuzhiyun 277*4882a593SmuzhiyunNeed of application restart 278*4882a593Smuzhiyun=========================== 279*4882a593Smuzhiyun 280*4882a593SmuzhiyunThe transparent_hugepage/enabled values and tmpfs mount option only affect 281*4882a593Smuzhiyunfuture behavior. So to make them effective you need to restart any 282*4882a593Smuzhiyunapplication that could have been using hugepages. This also applies to the 283*4882a593Smuzhiyunregions registered in khugepaged. 284*4882a593Smuzhiyun 285*4882a593SmuzhiyunMonitoring usage 286*4882a593Smuzhiyun================ 287*4882a593Smuzhiyun 288*4882a593SmuzhiyunThe number of anonymous transparent huge pages currently used by the 289*4882a593Smuzhiyunsystem is available by reading the AnonHugePages field in ``/proc/meminfo``. 290*4882a593SmuzhiyunTo identify what applications are using anonymous transparent huge pages, 291*4882a593Smuzhiyunit is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields 292*4882a593Smuzhiyunfor each mapping. 293*4882a593Smuzhiyun 294*4882a593SmuzhiyunThe number of file transparent huge pages mapped to userspace is available 295*4882a593Smuzhiyunby reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. 296*4882a593SmuzhiyunTo identify what applications are mapping file transparent huge pages, it 297*4882a593Smuzhiyunis necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields 298*4882a593Smuzhiyunfor each mapping. 299*4882a593Smuzhiyun 300*4882a593SmuzhiyunNote that reading the smaps file is expensive and reading it 301*4882a593Smuzhiyunfrequently will incur overhead. 302*4882a593Smuzhiyun 303*4882a593SmuzhiyunThere are a number of counters in ``/proc/vmstat`` that may be used to 304*4882a593Smuzhiyunmonitor how successfully the system is providing huge pages for use. 305*4882a593Smuzhiyun 306*4882a593Smuzhiyunthp_fault_alloc 307*4882a593Smuzhiyun is incremented every time a huge page is successfully 308*4882a593Smuzhiyun allocated to handle a page fault. 309*4882a593Smuzhiyun 310*4882a593Smuzhiyunthp_collapse_alloc 311*4882a593Smuzhiyun is incremented by khugepaged when it has found 312*4882a593Smuzhiyun a range of pages to collapse into one huge page and has 313*4882a593Smuzhiyun successfully allocated a new huge page to store the data. 314*4882a593Smuzhiyun 315*4882a593Smuzhiyunthp_fault_fallback 316*4882a593Smuzhiyun is incremented if a page fault fails to allocate 317*4882a593Smuzhiyun a huge page and instead falls back to using small pages. 318*4882a593Smuzhiyun 319*4882a593Smuzhiyunthp_fault_fallback_charge 320*4882a593Smuzhiyun is incremented if a page fault fails to charge a huge page and 321*4882a593Smuzhiyun instead falls back to using small pages even though the 322*4882a593Smuzhiyun allocation was successful. 323*4882a593Smuzhiyun 324*4882a593Smuzhiyunthp_collapse_alloc_failed 325*4882a593Smuzhiyun is incremented if khugepaged found a range 326*4882a593Smuzhiyun of pages that should be collapsed into one huge page but failed 327*4882a593Smuzhiyun the allocation. 328*4882a593Smuzhiyun 329*4882a593Smuzhiyunthp_file_alloc 330*4882a593Smuzhiyun is incremented every time a file huge page is successfully 331*4882a593Smuzhiyun allocated. 332*4882a593Smuzhiyun 333*4882a593Smuzhiyunthp_file_fallback 334*4882a593Smuzhiyun is incremented if a file huge page is attempted to be allocated 335*4882a593Smuzhiyun but fails and instead falls back to using small pages. 336*4882a593Smuzhiyun 337*4882a593Smuzhiyunthp_file_fallback_charge 338*4882a593Smuzhiyun is incremented if a file huge page cannot be charged and instead 339*4882a593Smuzhiyun falls back to using small pages even though the allocation was 340*4882a593Smuzhiyun successful. 341*4882a593Smuzhiyun 342*4882a593Smuzhiyunthp_file_mapped 343*4882a593Smuzhiyun is incremented every time a file huge page is mapped into 344*4882a593Smuzhiyun user address space. 345*4882a593Smuzhiyun 346*4882a593Smuzhiyunthp_split_page 347*4882a593Smuzhiyun is incremented every time a huge page is split into base 348*4882a593Smuzhiyun pages. This can happen for a variety of reasons but a common 349*4882a593Smuzhiyun reason is that a huge page is old and is being reclaimed. 350*4882a593Smuzhiyun This action implies splitting all PMD the page mapped with. 351*4882a593Smuzhiyun 352*4882a593Smuzhiyunthp_split_page_failed 353*4882a593Smuzhiyun is incremented if kernel fails to split huge 354*4882a593Smuzhiyun page. This can happen if the page was pinned by somebody. 355*4882a593Smuzhiyun 356*4882a593Smuzhiyunthp_deferred_split_page 357*4882a593Smuzhiyun is incremented when a huge page is put onto split 358*4882a593Smuzhiyun queue. This happens when a huge page is partially unmapped and 359*4882a593Smuzhiyun splitting it would free up some memory. Pages on split queue are 360*4882a593Smuzhiyun going to be split under memory pressure. 361*4882a593Smuzhiyun 362*4882a593Smuzhiyunthp_split_pmd 363*4882a593Smuzhiyun is incremented every time a PMD split into table of PTEs. 364*4882a593Smuzhiyun This can happen, for instance, when application calls mprotect() or 365*4882a593Smuzhiyun munmap() on part of huge page. It doesn't split huge page, only 366*4882a593Smuzhiyun page table entry. 367*4882a593Smuzhiyun 368*4882a593Smuzhiyunthp_zero_page_alloc 369*4882a593Smuzhiyun is incremented every time a huge zero page is 370*4882a593Smuzhiyun successfully allocated. It includes allocations which where 371*4882a593Smuzhiyun dropped due race with other allocation. Note, it doesn't count 372*4882a593Smuzhiyun every map of the huge zero page, only its allocation. 373*4882a593Smuzhiyun 374*4882a593Smuzhiyunthp_zero_page_alloc_failed 375*4882a593Smuzhiyun is incremented if kernel fails to allocate 376*4882a593Smuzhiyun huge zero page and falls back to using small pages. 377*4882a593Smuzhiyun 378*4882a593Smuzhiyunthp_swpout 379*4882a593Smuzhiyun is incremented every time a huge page is swapout in one 380*4882a593Smuzhiyun piece without splitting. 381*4882a593Smuzhiyun 382*4882a593Smuzhiyunthp_swpout_fallback 383*4882a593Smuzhiyun is incremented if a huge page has to be split before swapout. 384*4882a593Smuzhiyun Usually because failed to allocate some continuous swap space 385*4882a593Smuzhiyun for the huge page. 386*4882a593Smuzhiyun 387*4882a593SmuzhiyunAs the system ages, allocating huge pages may be expensive as the 388*4882a593Smuzhiyunsystem uses memory compaction to copy data around memory to free a 389*4882a593Smuzhiyunhuge page for use. There are some counters in ``/proc/vmstat`` to help 390*4882a593Smuzhiyunmonitor this overhead. 391*4882a593Smuzhiyun 392*4882a593Smuzhiyuncompact_stall 393*4882a593Smuzhiyun is incremented every time a process stalls to run 394*4882a593Smuzhiyun memory compaction so that a huge page is free for use. 395*4882a593Smuzhiyun 396*4882a593Smuzhiyuncompact_success 397*4882a593Smuzhiyun is incremented if the system compacted memory and 398*4882a593Smuzhiyun freed a huge page for use. 399*4882a593Smuzhiyun 400*4882a593Smuzhiyuncompact_fail 401*4882a593Smuzhiyun is incremented if the system tries to compact memory 402*4882a593Smuzhiyun but failed. 403*4882a593Smuzhiyun 404*4882a593Smuzhiyuncompact_pages_moved 405*4882a593Smuzhiyun is incremented each time a page is moved. If 406*4882a593Smuzhiyun this value is increasing rapidly, it implies that the system 407*4882a593Smuzhiyun is copying a lot of data to satisfy the huge page allocation. 408*4882a593Smuzhiyun It is possible that the cost of copying exceeds any savings 409*4882a593Smuzhiyun from reduced TLB misses. 410*4882a593Smuzhiyun 411*4882a593Smuzhiyuncompact_pagemigrate_failed 412*4882a593Smuzhiyun is incremented when the underlying mechanism 413*4882a593Smuzhiyun for moving a page failed. 414*4882a593Smuzhiyun 415*4882a593Smuzhiyuncompact_blocks_moved 416*4882a593Smuzhiyun is incremented each time memory compaction examines 417*4882a593Smuzhiyun a huge page aligned range of pages. 418*4882a593Smuzhiyun 419*4882a593SmuzhiyunIt is possible to establish how long the stalls were using the function 420*4882a593Smuzhiyuntracer to record how long was spent in __alloc_pages_nodemask and 421*4882a593Smuzhiyunusing the mm_page_alloc tracepoint to identify which allocations were 422*4882a593Smuzhiyunfor huge pages. 423*4882a593Smuzhiyun 424*4882a593SmuzhiyunOptimizing the applications 425*4882a593Smuzhiyun=========================== 426*4882a593Smuzhiyun 427*4882a593SmuzhiyunTo be guaranteed that the kernel will map a 2M page immediately in any 428*4882a593Smuzhiyunmemory region, the mmap region has to be hugepage naturally 429*4882a593Smuzhiyunaligned. posix_memalign() can provide that guarantee. 430*4882a593Smuzhiyun 431*4882a593SmuzhiyunHugetlbfs 432*4882a593Smuzhiyun========= 433*4882a593Smuzhiyun 434*4882a593SmuzhiyunYou can use hugetlbfs on a kernel that has transparent hugepage 435*4882a593Smuzhiyunsupport enabled just fine as always. No difference can be noted in 436*4882a593Smuzhiyunhugetlbfs other than there will be less overall fragmentation. All 437*4882a593Smuzhiyunusual features belonging to hugetlbfs are preserved and 438*4882a593Smuzhiyununaffected. libhugetlbfs will also work fine as usual. 439