admin-guide/mm/transhuge.rst

*4882a593Smuzhiyun.. _admin_guide_transhuge:
*4882a593Smuzhiyun
*4882a593Smuzhiyun============================
*4882a593SmuzhiyunTransparent Hugepage Support
*4882a593Smuzhiyun============================
*4882a593Smuzhiyun
*4882a593SmuzhiyunObjective
*4882a593Smuzhiyun=========
*4882a593Smuzhiyun
*4882a593SmuzhiyunPerformance critical computing applications dealing with large memory
*4882a593Smuzhiyunworking sets are already running on top of libhugetlbfs and in turn
*4882a593Smuzhiyunhugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
*4882a593Smuzhiyunusing huge pages for the backing of virtual memory with huge pages
*4882a593Smuzhiyunthat supports the automatic promotion and demotion of page sizes and
*4882a593Smuzhiyunwithout the shortcomings of hugetlbfs.
*4882a593Smuzhiyun
*4882a593SmuzhiyunCurrently THP only works for anonymous memory mappings and tmpfs/shmem.
*4882a593SmuzhiyunBut in the future it can expand to other filesystems.
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. note::
*4882a593Smuzhiyun   in the examples below we presume that the basic page size is 4K and
*4882a593Smuzhiyun   the huge page size is 2M, although the actual numbers may vary
*4882a593Smuzhiyun   depending on the CPU architecture.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe reason applications are running faster is because of two
*4882a593Smuzhiyunfactors. The first factor is almost completely irrelevant and it's not
*4882a593Smuzhiyunof significant interest because it'll also have the downside of
*4882a593Smuzhiyunrequiring larger clear-page copy-page in page faults which is a
*4882a593Smuzhiyunpotentially negative effect. The first factor consists in taking a
*4882a593Smuzhiyunsingle page fault for each 2M virtual region touched by userland (so
*4882a593Smuzhiyunreducing the enter/exit kernel frequency by a 512 times factor). This
*4882a593Smuzhiyunonly matters the first time the memory is accessed for the lifetime of
*4882a593Smuzhiyuna memory mapping. The second long lasting and much more important
*4882a593Smuzhiyunfactor will affect all subsequent accesses to the memory for the whole
*4882a593Smuzhiyunruntime of the application. The second factor consist of two
*4882a593Smuzhiyuncomponents:
*4882a593Smuzhiyun
*4882a593Smuzhiyun1) the TLB miss will run faster (especially with virtualization using
*4882a593Smuzhiyun   nested pagetables but almost always also on bare metal without
*4882a593Smuzhiyun   virtualization)
*4882a593Smuzhiyun
*4882a593Smuzhiyun2) a single TLB entry will be mapping a much larger amount of virtual
*4882a593Smuzhiyun   memory in turn reducing the number of TLB misses. With
*4882a593Smuzhiyun   virtualization and nested pagetables the TLB can be mapped of
*4882a593Smuzhiyun   larger size only if both KVM and the Linux guest are using
*4882a593Smuzhiyun   hugepages but a significant speedup already happens if only one of
*4882a593Smuzhiyun   the two is using hugepages just because of the fact the TLB miss is
*4882a593Smuzhiyun   going to run faster.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTHP can be enabled system wide or restricted to certain tasks or even
*4882a593Smuzhiyunmemory ranges inside task's address space. Unless THP is completely
*4882a593Smuzhiyundisabled, there is ``khugepaged`` daemon that scans memory and
*4882a593Smuzhiyuncollapses sequences of basic pages into huge pages.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
*4882a593Smuzhiyuninterface and using madvise(2) and prctl(2) system calls.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTransparent Hugepage Support maximizes the usefulness of free memory
*4882a593Smuzhiyunif compared to the reservation approach of hugetlbfs by allowing all
*4882a593Smuzhiyununused memory to be used as cache or other movable (or even unmovable
*4882a593Smuzhiyunentities). It doesn't require reservation to prevent hugepage
*4882a593Smuzhiyunallocation failures to be noticeable from userland. It allows paging
*4882a593Smuzhiyunand all other advanced VM features to be available on the
*4882a593Smuzhiyunhugepages. It requires no modifications for applications to take
*4882a593Smuzhiyunadvantage of it.
*4882a593Smuzhiyun
*4882a593SmuzhiyunApplications however can be further optimized to take advantage of
*4882a593Smuzhiyunthis feature, like for example they've been optimized before to avoid
*4882a593Smuzhiyuna flood of mmap system calls for every malloc(4k). Optimizing userland
*4882a593Smuzhiyunis by far not mandatory and khugepaged already can take care of long
*4882a593Smuzhiyunlived page allocations even for hugepage unaware applications that
*4882a593Smuzhiyundeals with large amounts of memory.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn certain cases when hugepages are enabled system wide, application
*4882a593Smuzhiyunmay end up allocating more memory resources. An application may mmap a
*4882a593Smuzhiyunlarge region but only touch 1 byte of it, in that case a 2M page might
*4882a593Smuzhiyunbe allocated instead of a 4k page for no good. This is why it's
*4882a593Smuzhiyunpossible to disable hugepages system-wide and to only have them inside
*4882a593SmuzhiyunMADV_HUGEPAGE madvise regions.
*4882a593Smuzhiyun
*4882a593SmuzhiyunEmbedded systems should enable hugepages only inside madvise regions
*4882a593Smuzhiyunto eliminate any risk of wasting any precious byte of memory and to
*4882a593Smuzhiyunonly run faster.
*4882a593Smuzhiyun
*4882a593SmuzhiyunApplications that gets a lot of benefit from hugepages and that don't
*4882a593Smuzhiyunrisk to lose memory by using hugepages, should use
*4882a593Smuzhiyunmadvise(MADV_HUGEPAGE) on their critical mmapped regions.
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. _thp_sysfs:
*4882a593Smuzhiyun
*4882a593Smuzhiyunsysfs
*4882a593Smuzhiyun=====
*4882a593Smuzhiyun
*4882a593SmuzhiyunGlobal THP controls
*4882a593Smuzhiyun-------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunTransparent Hugepage Support for anonymous memory can be entirely disabled
*4882a593Smuzhiyun(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
*4882a593Smuzhiyunregions (to avoid the risk of consuming more memory resources) or enabled
*4882a593Smuzhiyunsystem wide. This can be achieved with one of::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	echo always >/sys/kernel/mm/transparent_hugepage/enabled
*4882a593Smuzhiyun	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
*4882a593Smuzhiyun	echo never >/sys/kernel/mm/transparent_hugepage/enabled
*4882a593Smuzhiyun
*4882a593SmuzhiyunIt's also possible to limit defrag efforts in the VM to generate
*4882a593Smuzhiyunanonymous hugepages in case they're not immediately free to madvise
*4882a593Smuzhiyunregions or to never try to defrag memory and simply fallback to regular
*4882a593Smuzhiyunpages unless hugepages are immediately available. Clearly if we spend CPU
*4882a593Smuzhiyuntime to defrag memory, we would expect to gain even more by the fact we
*4882a593Smuzhiyunuse hugepages later instead of regular pages. This isn't always
*4882a593Smuzhiyunguaranteed, but it may be more likely in case the allocation is for a
*4882a593SmuzhiyunMADV_HUGEPAGE region.
*4882a593Smuzhiyun
*4882a593Smuzhiyun::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	echo always >/sys/kernel/mm/transparent_hugepage/defrag
*4882a593Smuzhiyun	echo defer >/sys/kernel/mm/transparent_hugepage/defrag
*4882a593Smuzhiyun	echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
*4882a593Smuzhiyun	echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
*4882a593Smuzhiyun	echo never >/sys/kernel/mm/transparent_hugepage/defrag
*4882a593Smuzhiyun
*4882a593Smuzhiyunalways
*4882a593Smuzhiyun	means that an application requesting THP will stall on
*4882a593Smuzhiyun	allocation failure and directly reclaim pages and compact
*4882a593Smuzhiyun	memory in an effort to allocate a THP immediately. This may be
*4882a593Smuzhiyun	desirable for virtual machines that benefit heavily from THP
*4882a593Smuzhiyun	use and are willing to delay the VM start to utilise them.
*4882a593Smuzhiyun
*4882a593Smuzhiyundefer
*4882a593Smuzhiyun	means that an application will wake kswapd in the background
*4882a593Smuzhiyun	to reclaim pages and wake kcompactd to compact memory so that
*4882a593Smuzhiyun	THP is available in the near future. It's the responsibility
*4882a593Smuzhiyun	of khugepaged to then install the THP pages later.
*4882a593Smuzhiyun
*4882a593Smuzhiyundefer+madvise
*4882a593Smuzhiyun	will enter direct reclaim and compaction like ``always``, but
*4882a593Smuzhiyun	only for regions that have used madvise(MADV_HUGEPAGE); all
*4882a593Smuzhiyun	other regions will wake kswapd in the background to reclaim
*4882a593Smuzhiyun	pages and wake kcompactd to compact memory so that THP is
*4882a593Smuzhiyun	available in the near future.
*4882a593Smuzhiyun
*4882a593Smuzhiyunmadvise
*4882a593Smuzhiyun	will enter direct reclaim like ``always`` but only for regions
*4882a593Smuzhiyun	that are have used madvise(MADV_HUGEPAGE). This is the default
*4882a593Smuzhiyun	behaviour.
*4882a593Smuzhiyun
*4882a593Smuzhiyunnever
*4882a593Smuzhiyun	should be self-explanatory.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBy default kernel tries to use huge zero page on read page fault to
*4882a593Smuzhiyunanonymous mapping. It's possible to disable huge zero page by writing 0
*4882a593Smuzhiyunor enable it back by writing 1::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
*4882a593Smuzhiyun	echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
*4882a593Smuzhiyun
*4882a593SmuzhiyunSome userspace (such as a test program, or an optimized memory allocation
*4882a593Smuzhiyunlibrary) may want to know the size (in bytes) of a transparent hugepage::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
*4882a593Smuzhiyun
*4882a593Smuzhiyunkhugepaged will be automatically started when
*4882a593Smuzhiyuntransparent_hugepage/enabled is set to "always" or "madvise, and it'll
*4882a593Smuzhiyunbe automatically shutdown if it's set to "never".
*4882a593Smuzhiyun
*4882a593SmuzhiyunKhugepaged controls
*4882a593Smuzhiyun-------------------
*4882a593Smuzhiyun
*4882a593Smuzhiyunkhugepaged runs usually at low frequency so while one may not want to
*4882a593Smuzhiyuninvoke defrag algorithms synchronously during the page faults, it
*4882a593Smuzhiyunshould be worth invoking defrag at least in khugepaged. However it's
*4882a593Smuzhiyunalso possible to disable defrag in khugepaged by writing 0 or enable
*4882a593Smuzhiyundefrag in khugepaged by writing 1::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
*4882a593Smuzhiyun	echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
*4882a593Smuzhiyun
*4882a593SmuzhiyunYou can also control how many pages khugepaged should scan at each
*4882a593Smuzhiyunpass::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
*4882a593Smuzhiyun
*4882a593Smuzhiyunand how many milliseconds to wait in khugepaged between each pass (you
*4882a593Smuzhiyuncan set this to 0 to run khugepaged at 100% utilization of one core)::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
*4882a593Smuzhiyun
*4882a593Smuzhiyunand how many milliseconds to wait in khugepaged if there's an hugepage
*4882a593Smuzhiyunallocation failure to throttle the next allocation attempt::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe khugepaged progress can be seen in the number of pages collapsed::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
*4882a593Smuzhiyun
*4882a593Smuzhiyunfor each pass::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
*4882a593Smuzhiyun
*4882a593Smuzhiyun``max_ptes_none`` specifies how many extra small pages (that are
*4882a593Smuzhiyunnot already mapped) can be allocated when collapsing a group
*4882a593Smuzhiyunof small pages into one large page::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
*4882a593Smuzhiyun
*4882a593SmuzhiyunA higher value leads to use additional memory for programs.
*4882a593SmuzhiyunA lower value leads to gain less thp performance. Value of
*4882a593Smuzhiyunmax_ptes_none can waste cpu time very little, you can
*4882a593Smuzhiyunignore it.
*4882a593Smuzhiyun
*4882a593Smuzhiyun``max_ptes_swap`` specifies how many pages can be brought in from
*4882a593Smuzhiyunswap when collapsing a group of pages into a transparent huge page::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
*4882a593Smuzhiyun
*4882a593SmuzhiyunA higher value can cause excessive swap IO and waste
*4882a593Smuzhiyunmemory. A lower value can prevent THPs from being
*4882a593Smuzhiyuncollapsed, resulting fewer pages being collapsed into
*4882a593SmuzhiyunTHPs, and lower memory access performance.
*4882a593Smuzhiyun
*4882a593Smuzhiyun``max_ptes_shared`` specifies how many pages can be shared across multiple
*4882a593Smuzhiyunprocesses. Exceeding the number would block the collapse::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
*4882a593Smuzhiyun
*4882a593SmuzhiyunA higher value may increase memory footprint for some workloads.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBoot parameter
*4882a593Smuzhiyun==============
*4882a593Smuzhiyun
*4882a593SmuzhiyunYou can change the sysfs boot time defaults of Transparent Hugepage
*4882a593SmuzhiyunSupport by passing the parameter ``transparent_hugepage=always`` or
*4882a593Smuzhiyun``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
*4882a593Smuzhiyunto the kernel command line.
*4882a593Smuzhiyun
*4882a593SmuzhiyunHugepages in tmpfs/shmem
*4882a593Smuzhiyun========================
*4882a593Smuzhiyun
*4882a593SmuzhiyunYou can control hugepage allocation policy in tmpfs with mount option
*4882a593Smuzhiyun``huge=``. It can have following values:
*4882a593Smuzhiyun
*4882a593Smuzhiyunalways
*4882a593Smuzhiyun    Attempt to allocate huge pages every time we need a new page;
*4882a593Smuzhiyun
*4882a593Smuzhiyunnever
*4882a593Smuzhiyun    Do not allocate huge pages;
*4882a593Smuzhiyun
*4882a593Smuzhiyunwithin_size
*4882a593Smuzhiyun    Only allocate huge page if it will be fully within i_size.
*4882a593Smuzhiyun    Also respect fadvise()/madvise() hints;
*4882a593Smuzhiyun
*4882a593Smuzhiyunadvise
*4882a593Smuzhiyun    Only allocate huge pages if requested with fadvise()/madvise();
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe default policy is ``never``.
*4882a593Smuzhiyun
*4882a593Smuzhiyun``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
*4882a593Smuzhiyun``huge=never`` will not attempt to break up huge pages at all, just stop more
*4882a593Smuzhiyunfrom being allocated.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere's also sysfs knob to control hugepage allocation policy for internal
*4882a593Smuzhiyunshmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
*4882a593Smuzhiyunis used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
*4882a593SmuzhiyunMAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn addition to policies listed above, shmem_enabled allows two further
*4882a593Smuzhiyunvalues:
*4882a593Smuzhiyun
*4882a593Smuzhiyundeny
*4882a593Smuzhiyun    For use in emergencies, to force the huge option off from
*4882a593Smuzhiyun    all mounts;
*4882a593Smuzhiyunforce
*4882a593Smuzhiyun    Force the huge option on for all - very useful for testing;
*4882a593Smuzhiyun
*4882a593SmuzhiyunNeed of application restart
*4882a593Smuzhiyun===========================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe transparent_hugepage/enabled values and tmpfs mount option only affect
*4882a593Smuzhiyunfuture behavior. So to make them effective you need to restart any
*4882a593Smuzhiyunapplication that could have been using hugepages. This also applies to the
*4882a593Smuzhiyunregions registered in khugepaged.
*4882a593Smuzhiyun
*4882a593SmuzhiyunMonitoring usage
*4882a593Smuzhiyun================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe number of anonymous transparent huge pages currently used by the
*4882a593Smuzhiyunsystem is available by reading the AnonHugePages field in ``/proc/meminfo``.
*4882a593SmuzhiyunTo identify what applications are using anonymous transparent huge pages,
*4882a593Smuzhiyunit is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
*4882a593Smuzhiyunfor each mapping.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe number of file transparent huge pages mapped to userspace is available
*4882a593Smuzhiyunby reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
*4882a593SmuzhiyunTo identify what applications are mapping file transparent huge pages, it
*4882a593Smuzhiyunis necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
*4882a593Smuzhiyunfor each mapping.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote that reading the smaps file is expensive and reading it
*4882a593Smuzhiyunfrequently will incur overhead.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere are a number of counters in ``/proc/vmstat`` that may be used to
*4882a593Smuzhiyunmonitor how successfully the system is providing huge pages for use.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_fault_alloc
*4882a593Smuzhiyun	is incremented every time a huge page is successfully
*4882a593Smuzhiyun	allocated to handle a page fault.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_collapse_alloc
*4882a593Smuzhiyun	is incremented by khugepaged when it has found
*4882a593Smuzhiyun	a range of pages to collapse into one huge page and has
*4882a593Smuzhiyun	successfully allocated a new huge page to store the data.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_fault_fallback
*4882a593Smuzhiyun	is incremented if a page fault fails to allocate
*4882a593Smuzhiyun	a huge page and instead falls back to using small pages.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_fault_fallback_charge
*4882a593Smuzhiyun	is incremented if a page fault fails to charge a huge page and
*4882a593Smuzhiyun	instead falls back to using small pages even though the
*4882a593Smuzhiyun	allocation was successful.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_collapse_alloc_failed
*4882a593Smuzhiyun	is incremented if khugepaged found a range
*4882a593Smuzhiyun	of pages that should be collapsed into one huge page but failed
*4882a593Smuzhiyun	the allocation.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_file_alloc
*4882a593Smuzhiyun	is incremented every time a file huge page is successfully
*4882a593Smuzhiyun	allocated.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_file_fallback
*4882a593Smuzhiyun	is incremented if a file huge page is attempted to be allocated
*4882a593Smuzhiyun	but fails and instead falls back to using small pages.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_file_fallback_charge
*4882a593Smuzhiyun	is incremented if a file huge page cannot be charged and instead
*4882a593Smuzhiyun	falls back to using small pages even though the allocation was
*4882a593Smuzhiyun	successful.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_file_mapped
*4882a593Smuzhiyun	is incremented every time a file huge page is mapped into
*4882a593Smuzhiyun	user address space.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_split_page
*4882a593Smuzhiyun	is incremented every time a huge page is split into base
*4882a593Smuzhiyun	pages. This can happen for a variety of reasons but a common
*4882a593Smuzhiyun	reason is that a huge page is old and is being reclaimed.
*4882a593Smuzhiyun	This action implies splitting all PMD the page mapped with.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_split_page_failed
*4882a593Smuzhiyun	is incremented if kernel fails to split huge
*4882a593Smuzhiyun	page. This can happen if the page was pinned by somebody.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_deferred_split_page
*4882a593Smuzhiyun	is incremented when a huge page is put onto split
*4882a593Smuzhiyun	queue. This happens when a huge page is partially unmapped and
*4882a593Smuzhiyun	splitting it would free up some memory. Pages on split queue are
*4882a593Smuzhiyun	going to be split under memory pressure.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_split_pmd
*4882a593Smuzhiyun	is incremented every time a PMD split into table of PTEs.
*4882a593Smuzhiyun	This can happen, for instance, when application calls mprotect() or
*4882a593Smuzhiyun	munmap() on part of huge page. It doesn't split huge page, only
*4882a593Smuzhiyun	page table entry.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_zero_page_alloc
*4882a593Smuzhiyun	is incremented every time a huge zero page is
*4882a593Smuzhiyun	successfully allocated. It includes allocations which where
*4882a593Smuzhiyun	dropped due race with other allocation. Note, it doesn't count
*4882a593Smuzhiyun	every map of the huge zero page, only its allocation.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_zero_page_alloc_failed
*4882a593Smuzhiyun	is incremented if kernel fails to allocate
*4882a593Smuzhiyun	huge zero page and falls back to using small pages.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_swpout
*4882a593Smuzhiyun	is incremented every time a huge page is swapout in one
*4882a593Smuzhiyun	piece without splitting.
*4882a593Smuzhiyun
*4882a593Smuzhiyunthp_swpout_fallback
*4882a593Smuzhiyun	is incremented if a huge page has to be split before swapout.
*4882a593Smuzhiyun	Usually because failed to allocate some continuous swap space
*4882a593Smuzhiyun	for the huge page.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAs the system ages, allocating huge pages may be expensive as the
*4882a593Smuzhiyunsystem uses memory compaction to copy data around memory to free a
*4882a593Smuzhiyunhuge page for use. There are some counters in ``/proc/vmstat`` to help
*4882a593Smuzhiyunmonitor this overhead.
*4882a593Smuzhiyun
*4882a593Smuzhiyuncompact_stall
*4882a593Smuzhiyun	is incremented every time a process stalls to run
*4882a593Smuzhiyun	memory compaction so that a huge page is free for use.
*4882a593Smuzhiyun
*4882a593Smuzhiyuncompact_success
*4882a593Smuzhiyun	is incremented if the system compacted memory and
*4882a593Smuzhiyun	freed a huge page for use.
*4882a593Smuzhiyun
*4882a593Smuzhiyuncompact_fail
*4882a593Smuzhiyun	is incremented if the system tries to compact memory
*4882a593Smuzhiyun	but failed.
*4882a593Smuzhiyun
*4882a593Smuzhiyuncompact_pages_moved
*4882a593Smuzhiyun	is incremented each time a page is moved. If
*4882a593Smuzhiyun	this value is increasing rapidly, it implies that the system
*4882a593Smuzhiyun	is copying a lot of data to satisfy the huge page allocation.
*4882a593Smuzhiyun	It is possible that the cost of copying exceeds any savings
*4882a593Smuzhiyun	from reduced TLB misses.
*4882a593Smuzhiyun
*4882a593Smuzhiyuncompact_pagemigrate_failed
*4882a593Smuzhiyun	is incremented when the underlying mechanism
*4882a593Smuzhiyun	for moving a page failed.
*4882a593Smuzhiyun
*4882a593Smuzhiyuncompact_blocks_moved
*4882a593Smuzhiyun	is incremented each time memory compaction examines
*4882a593Smuzhiyun	a huge page aligned range of pages.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIt is possible to establish how long the stalls were using the function
*4882a593Smuzhiyuntracer to record how long was spent in __alloc_pages_nodemask and
*4882a593Smuzhiyunusing the mm_page_alloc tracepoint to identify which allocations were
*4882a593Smuzhiyunfor huge pages.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOptimizing the applications
*4882a593Smuzhiyun===========================
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo be guaranteed that the kernel will map a 2M page immediately in any
*4882a593Smuzhiyunmemory region, the mmap region has to be hugepage naturally
*4882a593Smuzhiyunaligned. posix_memalign() can provide that guarantee.
*4882a593Smuzhiyun
*4882a593SmuzhiyunHugetlbfs
*4882a593Smuzhiyun=========
*4882a593Smuzhiyun
*4882a593SmuzhiyunYou can use hugetlbfs on a kernel that has transparent hugepage
*4882a593Smuzhiyunsupport enabled just fine as always. No difference can be noted in
*4882a593Smuzhiyunhugetlbfs other than there will be less overall fragmentation. All
*4882a593Smuzhiyunusual features belonging to hugetlbfs are preserved and
*4882a593Smuzhiyununaffected. libhugetlbfs will also work fine as usual.