xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/mm/transhuge.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _admin_guide_transhuge:
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun============================
4*4882a593SmuzhiyunTransparent Hugepage Support
5*4882a593Smuzhiyun============================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunObjective
8*4882a593Smuzhiyun=========
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunPerformance critical computing applications dealing with large memory
11*4882a593Smuzhiyunworking sets are already running on top of libhugetlbfs and in turn
12*4882a593Smuzhiyunhugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
13*4882a593Smuzhiyunusing huge pages for the backing of virtual memory with huge pages
14*4882a593Smuzhiyunthat supports the automatic promotion and demotion of page sizes and
15*4882a593Smuzhiyunwithout the shortcomings of hugetlbfs.
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunCurrently THP only works for anonymous memory mappings and tmpfs/shmem.
18*4882a593SmuzhiyunBut in the future it can expand to other filesystems.
19*4882a593Smuzhiyun
20*4882a593Smuzhiyun.. note::
21*4882a593Smuzhiyun   in the examples below we presume that the basic page size is 4K and
22*4882a593Smuzhiyun   the huge page size is 2M, although the actual numbers may vary
23*4882a593Smuzhiyun   depending on the CPU architecture.
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunThe reason applications are running faster is because of two
26*4882a593Smuzhiyunfactors. The first factor is almost completely irrelevant and it's not
27*4882a593Smuzhiyunof significant interest because it'll also have the downside of
28*4882a593Smuzhiyunrequiring larger clear-page copy-page in page faults which is a
29*4882a593Smuzhiyunpotentially negative effect. The first factor consists in taking a
30*4882a593Smuzhiyunsingle page fault for each 2M virtual region touched by userland (so
31*4882a593Smuzhiyunreducing the enter/exit kernel frequency by a 512 times factor). This
32*4882a593Smuzhiyunonly matters the first time the memory is accessed for the lifetime of
33*4882a593Smuzhiyuna memory mapping. The second long lasting and much more important
34*4882a593Smuzhiyunfactor will affect all subsequent accesses to the memory for the whole
35*4882a593Smuzhiyunruntime of the application. The second factor consist of two
36*4882a593Smuzhiyuncomponents:
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun1) the TLB miss will run faster (especially with virtualization using
39*4882a593Smuzhiyun   nested pagetables but almost always also on bare metal without
40*4882a593Smuzhiyun   virtualization)
41*4882a593Smuzhiyun
42*4882a593Smuzhiyun2) a single TLB entry will be mapping a much larger amount of virtual
43*4882a593Smuzhiyun   memory in turn reducing the number of TLB misses. With
44*4882a593Smuzhiyun   virtualization and nested pagetables the TLB can be mapped of
45*4882a593Smuzhiyun   larger size only if both KVM and the Linux guest are using
46*4882a593Smuzhiyun   hugepages but a significant speedup already happens if only one of
47*4882a593Smuzhiyun   the two is using hugepages just because of the fact the TLB miss is
48*4882a593Smuzhiyun   going to run faster.
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunTHP can be enabled system wide or restricted to certain tasks or even
51*4882a593Smuzhiyunmemory ranges inside task's address space. Unless THP is completely
52*4882a593Smuzhiyundisabled, there is ``khugepaged`` daemon that scans memory and
53*4882a593Smuzhiyuncollapses sequences of basic pages into huge pages.
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunThe THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
56*4882a593Smuzhiyuninterface and using madvise(2) and prctl(2) system calls.
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunTransparent Hugepage Support maximizes the usefulness of free memory
59*4882a593Smuzhiyunif compared to the reservation approach of hugetlbfs by allowing all
60*4882a593Smuzhiyununused memory to be used as cache or other movable (or even unmovable
61*4882a593Smuzhiyunentities). It doesn't require reservation to prevent hugepage
62*4882a593Smuzhiyunallocation failures to be noticeable from userland. It allows paging
63*4882a593Smuzhiyunand all other advanced VM features to be available on the
64*4882a593Smuzhiyunhugepages. It requires no modifications for applications to take
65*4882a593Smuzhiyunadvantage of it.
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunApplications however can be further optimized to take advantage of
68*4882a593Smuzhiyunthis feature, like for example they've been optimized before to avoid
69*4882a593Smuzhiyuna flood of mmap system calls for every malloc(4k). Optimizing userland
70*4882a593Smuzhiyunis by far not mandatory and khugepaged already can take care of long
71*4882a593Smuzhiyunlived page allocations even for hugepage unaware applications that
72*4882a593Smuzhiyundeals with large amounts of memory.
73*4882a593Smuzhiyun
74*4882a593SmuzhiyunIn certain cases when hugepages are enabled system wide, application
75*4882a593Smuzhiyunmay end up allocating more memory resources. An application may mmap a
76*4882a593Smuzhiyunlarge region but only touch 1 byte of it, in that case a 2M page might
77*4882a593Smuzhiyunbe allocated instead of a 4k page for no good. This is why it's
78*4882a593Smuzhiyunpossible to disable hugepages system-wide and to only have them inside
79*4882a593SmuzhiyunMADV_HUGEPAGE madvise regions.
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunEmbedded systems should enable hugepages only inside madvise regions
82*4882a593Smuzhiyunto eliminate any risk of wasting any precious byte of memory and to
83*4882a593Smuzhiyunonly run faster.
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunApplications that gets a lot of benefit from hugepages and that don't
86*4882a593Smuzhiyunrisk to lose memory by using hugepages, should use
87*4882a593Smuzhiyunmadvise(MADV_HUGEPAGE) on their critical mmapped regions.
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun.. _thp_sysfs:
90*4882a593Smuzhiyun
91*4882a593Smuzhiyunsysfs
92*4882a593Smuzhiyun=====
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunGlobal THP controls
95*4882a593Smuzhiyun-------------------
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunTransparent Hugepage Support for anonymous memory can be entirely disabled
98*4882a593Smuzhiyun(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
99*4882a593Smuzhiyunregions (to avoid the risk of consuming more memory resources) or enabled
100*4882a593Smuzhiyunsystem wide. This can be achieved with one of::
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun	echo always >/sys/kernel/mm/transparent_hugepage/enabled
103*4882a593Smuzhiyun	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
104*4882a593Smuzhiyun	echo never >/sys/kernel/mm/transparent_hugepage/enabled
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunIt's also possible to limit defrag efforts in the VM to generate
107*4882a593Smuzhiyunanonymous hugepages in case they're not immediately free to madvise
108*4882a593Smuzhiyunregions or to never try to defrag memory and simply fallback to regular
109*4882a593Smuzhiyunpages unless hugepages are immediately available. Clearly if we spend CPU
110*4882a593Smuzhiyuntime to defrag memory, we would expect to gain even more by the fact we
111*4882a593Smuzhiyunuse hugepages later instead of regular pages. This isn't always
112*4882a593Smuzhiyunguaranteed, but it may be more likely in case the allocation is for a
113*4882a593SmuzhiyunMADV_HUGEPAGE region.
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun::
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun	echo always >/sys/kernel/mm/transparent_hugepage/defrag
118*4882a593Smuzhiyun	echo defer >/sys/kernel/mm/transparent_hugepage/defrag
119*4882a593Smuzhiyun	echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
120*4882a593Smuzhiyun	echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
121*4882a593Smuzhiyun	echo never >/sys/kernel/mm/transparent_hugepage/defrag
122*4882a593Smuzhiyun
123*4882a593Smuzhiyunalways
124*4882a593Smuzhiyun	means that an application requesting THP will stall on
125*4882a593Smuzhiyun	allocation failure and directly reclaim pages and compact
126*4882a593Smuzhiyun	memory in an effort to allocate a THP immediately. This may be
127*4882a593Smuzhiyun	desirable for virtual machines that benefit heavily from THP
128*4882a593Smuzhiyun	use and are willing to delay the VM start to utilise them.
129*4882a593Smuzhiyun
130*4882a593Smuzhiyundefer
131*4882a593Smuzhiyun	means that an application will wake kswapd in the background
132*4882a593Smuzhiyun	to reclaim pages and wake kcompactd to compact memory so that
133*4882a593Smuzhiyun	THP is available in the near future. It's the responsibility
134*4882a593Smuzhiyun	of khugepaged to then install the THP pages later.
135*4882a593Smuzhiyun
136*4882a593Smuzhiyundefer+madvise
137*4882a593Smuzhiyun	will enter direct reclaim and compaction like ``always``, but
138*4882a593Smuzhiyun	only for regions that have used madvise(MADV_HUGEPAGE); all
139*4882a593Smuzhiyun	other regions will wake kswapd in the background to reclaim
140*4882a593Smuzhiyun	pages and wake kcompactd to compact memory so that THP is
141*4882a593Smuzhiyun	available in the near future.
142*4882a593Smuzhiyun
143*4882a593Smuzhiyunmadvise
144*4882a593Smuzhiyun	will enter direct reclaim like ``always`` but only for regions
145*4882a593Smuzhiyun	that are have used madvise(MADV_HUGEPAGE). This is the default
146*4882a593Smuzhiyun	behaviour.
147*4882a593Smuzhiyun
148*4882a593Smuzhiyunnever
149*4882a593Smuzhiyun	should be self-explanatory.
150*4882a593Smuzhiyun
151*4882a593SmuzhiyunBy default kernel tries to use huge zero page on read page fault to
152*4882a593Smuzhiyunanonymous mapping. It's possible to disable huge zero page by writing 0
153*4882a593Smuzhiyunor enable it back by writing 1::
154*4882a593Smuzhiyun
155*4882a593Smuzhiyun	echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
156*4882a593Smuzhiyun	echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
157*4882a593Smuzhiyun
158*4882a593SmuzhiyunSome userspace (such as a test program, or an optimized memory allocation
159*4882a593Smuzhiyunlibrary) may want to know the size (in bytes) of a transparent hugepage::
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
162*4882a593Smuzhiyun
163*4882a593Smuzhiyunkhugepaged will be automatically started when
164*4882a593Smuzhiyuntransparent_hugepage/enabled is set to "always" or "madvise, and it'll
165*4882a593Smuzhiyunbe automatically shutdown if it's set to "never".
166*4882a593Smuzhiyun
167*4882a593SmuzhiyunKhugepaged controls
168*4882a593Smuzhiyun-------------------
169*4882a593Smuzhiyun
170*4882a593Smuzhiyunkhugepaged runs usually at low frequency so while one may not want to
171*4882a593Smuzhiyuninvoke defrag algorithms synchronously during the page faults, it
172*4882a593Smuzhiyunshould be worth invoking defrag at least in khugepaged. However it's
173*4882a593Smuzhiyunalso possible to disable defrag in khugepaged by writing 0 or enable
174*4882a593Smuzhiyundefrag in khugepaged by writing 1::
175*4882a593Smuzhiyun
176*4882a593Smuzhiyun	echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
177*4882a593Smuzhiyun	echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
178*4882a593Smuzhiyun
179*4882a593SmuzhiyunYou can also control how many pages khugepaged should scan at each
180*4882a593Smuzhiyunpass::
181*4882a593Smuzhiyun
182*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
183*4882a593Smuzhiyun
184*4882a593Smuzhiyunand how many milliseconds to wait in khugepaged between each pass (you
185*4882a593Smuzhiyuncan set this to 0 to run khugepaged at 100% utilization of one core)::
186*4882a593Smuzhiyun
187*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
188*4882a593Smuzhiyun
189*4882a593Smuzhiyunand how many milliseconds to wait in khugepaged if there's an hugepage
190*4882a593Smuzhiyunallocation failure to throttle the next allocation attempt::
191*4882a593Smuzhiyun
192*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
193*4882a593Smuzhiyun
194*4882a593SmuzhiyunThe khugepaged progress can be seen in the number of pages collapsed::
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
197*4882a593Smuzhiyun
198*4882a593Smuzhiyunfor each pass::
199*4882a593Smuzhiyun
200*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun``max_ptes_none`` specifies how many extra small pages (that are
203*4882a593Smuzhiyunnot already mapped) can be allocated when collapsing a group
204*4882a593Smuzhiyunof small pages into one large page::
205*4882a593Smuzhiyun
206*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunA higher value leads to use additional memory for programs.
209*4882a593SmuzhiyunA lower value leads to gain less thp performance. Value of
210*4882a593Smuzhiyunmax_ptes_none can waste cpu time very little, you can
211*4882a593Smuzhiyunignore it.
212*4882a593Smuzhiyun
213*4882a593Smuzhiyun``max_ptes_swap`` specifies how many pages can be brought in from
214*4882a593Smuzhiyunswap when collapsing a group of pages into a transparent huge page::
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
217*4882a593Smuzhiyun
218*4882a593SmuzhiyunA higher value can cause excessive swap IO and waste
219*4882a593Smuzhiyunmemory. A lower value can prevent THPs from being
220*4882a593Smuzhiyuncollapsed, resulting fewer pages being collapsed into
221*4882a593SmuzhiyunTHPs, and lower memory access performance.
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun``max_ptes_shared`` specifies how many pages can be shared across multiple
224*4882a593Smuzhiyunprocesses. Exceeding the number would block the collapse::
225*4882a593Smuzhiyun
226*4882a593Smuzhiyun	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
227*4882a593Smuzhiyun
228*4882a593SmuzhiyunA higher value may increase memory footprint for some workloads.
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunBoot parameter
231*4882a593Smuzhiyun==============
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunYou can change the sysfs boot time defaults of Transparent Hugepage
234*4882a593SmuzhiyunSupport by passing the parameter ``transparent_hugepage=always`` or
235*4882a593Smuzhiyun``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
236*4882a593Smuzhiyunto the kernel command line.
237*4882a593Smuzhiyun
238*4882a593SmuzhiyunHugepages in tmpfs/shmem
239*4882a593Smuzhiyun========================
240*4882a593Smuzhiyun
241*4882a593SmuzhiyunYou can control hugepage allocation policy in tmpfs with mount option
242*4882a593Smuzhiyun``huge=``. It can have following values:
243*4882a593Smuzhiyun
244*4882a593Smuzhiyunalways
245*4882a593Smuzhiyun    Attempt to allocate huge pages every time we need a new page;
246*4882a593Smuzhiyun
247*4882a593Smuzhiyunnever
248*4882a593Smuzhiyun    Do not allocate huge pages;
249*4882a593Smuzhiyun
250*4882a593Smuzhiyunwithin_size
251*4882a593Smuzhiyun    Only allocate huge page if it will be fully within i_size.
252*4882a593Smuzhiyun    Also respect fadvise()/madvise() hints;
253*4882a593Smuzhiyun
254*4882a593Smuzhiyunadvise
255*4882a593Smuzhiyun    Only allocate huge pages if requested with fadvise()/madvise();
256*4882a593Smuzhiyun
257*4882a593SmuzhiyunThe default policy is ``never``.
258*4882a593Smuzhiyun
259*4882a593Smuzhiyun``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
260*4882a593Smuzhiyun``huge=never`` will not attempt to break up huge pages at all, just stop more
261*4882a593Smuzhiyunfrom being allocated.
262*4882a593Smuzhiyun
263*4882a593SmuzhiyunThere's also sysfs knob to control hugepage allocation policy for internal
264*4882a593Smuzhiyunshmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
265*4882a593Smuzhiyunis used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
266*4882a593SmuzhiyunMAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
267*4882a593Smuzhiyun
268*4882a593SmuzhiyunIn addition to policies listed above, shmem_enabled allows two further
269*4882a593Smuzhiyunvalues:
270*4882a593Smuzhiyun
271*4882a593Smuzhiyundeny
272*4882a593Smuzhiyun    For use in emergencies, to force the huge option off from
273*4882a593Smuzhiyun    all mounts;
274*4882a593Smuzhiyunforce
275*4882a593Smuzhiyun    Force the huge option on for all - very useful for testing;
276*4882a593Smuzhiyun
277*4882a593SmuzhiyunNeed of application restart
278*4882a593Smuzhiyun===========================
279*4882a593Smuzhiyun
280*4882a593SmuzhiyunThe transparent_hugepage/enabled values and tmpfs mount option only affect
281*4882a593Smuzhiyunfuture behavior. So to make them effective you need to restart any
282*4882a593Smuzhiyunapplication that could have been using hugepages. This also applies to the
283*4882a593Smuzhiyunregions registered in khugepaged.
284*4882a593Smuzhiyun
285*4882a593SmuzhiyunMonitoring usage
286*4882a593Smuzhiyun================
287*4882a593Smuzhiyun
288*4882a593SmuzhiyunThe number of anonymous transparent huge pages currently used by the
289*4882a593Smuzhiyunsystem is available by reading the AnonHugePages field in ``/proc/meminfo``.
290*4882a593SmuzhiyunTo identify what applications are using anonymous transparent huge pages,
291*4882a593Smuzhiyunit is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
292*4882a593Smuzhiyunfor each mapping.
293*4882a593Smuzhiyun
294*4882a593SmuzhiyunThe number of file transparent huge pages mapped to userspace is available
295*4882a593Smuzhiyunby reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
296*4882a593SmuzhiyunTo identify what applications are mapping file transparent huge pages, it
297*4882a593Smuzhiyunis necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
298*4882a593Smuzhiyunfor each mapping.
299*4882a593Smuzhiyun
300*4882a593SmuzhiyunNote that reading the smaps file is expensive and reading it
301*4882a593Smuzhiyunfrequently will incur overhead.
302*4882a593Smuzhiyun
303*4882a593SmuzhiyunThere are a number of counters in ``/proc/vmstat`` that may be used to
304*4882a593Smuzhiyunmonitor how successfully the system is providing huge pages for use.
305*4882a593Smuzhiyun
306*4882a593Smuzhiyunthp_fault_alloc
307*4882a593Smuzhiyun	is incremented every time a huge page is successfully
308*4882a593Smuzhiyun	allocated to handle a page fault.
309*4882a593Smuzhiyun
310*4882a593Smuzhiyunthp_collapse_alloc
311*4882a593Smuzhiyun	is incremented by khugepaged when it has found
312*4882a593Smuzhiyun	a range of pages to collapse into one huge page and has
313*4882a593Smuzhiyun	successfully allocated a new huge page to store the data.
314*4882a593Smuzhiyun
315*4882a593Smuzhiyunthp_fault_fallback
316*4882a593Smuzhiyun	is incremented if a page fault fails to allocate
317*4882a593Smuzhiyun	a huge page and instead falls back to using small pages.
318*4882a593Smuzhiyun
319*4882a593Smuzhiyunthp_fault_fallback_charge
320*4882a593Smuzhiyun	is incremented if a page fault fails to charge a huge page and
321*4882a593Smuzhiyun	instead falls back to using small pages even though the
322*4882a593Smuzhiyun	allocation was successful.
323*4882a593Smuzhiyun
324*4882a593Smuzhiyunthp_collapse_alloc_failed
325*4882a593Smuzhiyun	is incremented if khugepaged found a range
326*4882a593Smuzhiyun	of pages that should be collapsed into one huge page but failed
327*4882a593Smuzhiyun	the allocation.
328*4882a593Smuzhiyun
329*4882a593Smuzhiyunthp_file_alloc
330*4882a593Smuzhiyun	is incremented every time a file huge page is successfully
331*4882a593Smuzhiyun	allocated.
332*4882a593Smuzhiyun
333*4882a593Smuzhiyunthp_file_fallback
334*4882a593Smuzhiyun	is incremented if a file huge page is attempted to be allocated
335*4882a593Smuzhiyun	but fails and instead falls back to using small pages.
336*4882a593Smuzhiyun
337*4882a593Smuzhiyunthp_file_fallback_charge
338*4882a593Smuzhiyun	is incremented if a file huge page cannot be charged and instead
339*4882a593Smuzhiyun	falls back to using small pages even though the allocation was
340*4882a593Smuzhiyun	successful.
341*4882a593Smuzhiyun
342*4882a593Smuzhiyunthp_file_mapped
343*4882a593Smuzhiyun	is incremented every time a file huge page is mapped into
344*4882a593Smuzhiyun	user address space.
345*4882a593Smuzhiyun
346*4882a593Smuzhiyunthp_split_page
347*4882a593Smuzhiyun	is incremented every time a huge page is split into base
348*4882a593Smuzhiyun	pages. This can happen for a variety of reasons but a common
349*4882a593Smuzhiyun	reason is that a huge page is old and is being reclaimed.
350*4882a593Smuzhiyun	This action implies splitting all PMD the page mapped with.
351*4882a593Smuzhiyun
352*4882a593Smuzhiyunthp_split_page_failed
353*4882a593Smuzhiyun	is incremented if kernel fails to split huge
354*4882a593Smuzhiyun	page. This can happen if the page was pinned by somebody.
355*4882a593Smuzhiyun
356*4882a593Smuzhiyunthp_deferred_split_page
357*4882a593Smuzhiyun	is incremented when a huge page is put onto split
358*4882a593Smuzhiyun	queue. This happens when a huge page is partially unmapped and
359*4882a593Smuzhiyun	splitting it would free up some memory. Pages on split queue are
360*4882a593Smuzhiyun	going to be split under memory pressure.
361*4882a593Smuzhiyun
362*4882a593Smuzhiyunthp_split_pmd
363*4882a593Smuzhiyun	is incremented every time a PMD split into table of PTEs.
364*4882a593Smuzhiyun	This can happen, for instance, when application calls mprotect() or
365*4882a593Smuzhiyun	munmap() on part of huge page. It doesn't split huge page, only
366*4882a593Smuzhiyun	page table entry.
367*4882a593Smuzhiyun
368*4882a593Smuzhiyunthp_zero_page_alloc
369*4882a593Smuzhiyun	is incremented every time a huge zero page is
370*4882a593Smuzhiyun	successfully allocated. It includes allocations which where
371*4882a593Smuzhiyun	dropped due race with other allocation. Note, it doesn't count
372*4882a593Smuzhiyun	every map of the huge zero page, only its allocation.
373*4882a593Smuzhiyun
374*4882a593Smuzhiyunthp_zero_page_alloc_failed
375*4882a593Smuzhiyun	is incremented if kernel fails to allocate
376*4882a593Smuzhiyun	huge zero page and falls back to using small pages.
377*4882a593Smuzhiyun
378*4882a593Smuzhiyunthp_swpout
379*4882a593Smuzhiyun	is incremented every time a huge page is swapout in one
380*4882a593Smuzhiyun	piece without splitting.
381*4882a593Smuzhiyun
382*4882a593Smuzhiyunthp_swpout_fallback
383*4882a593Smuzhiyun	is incremented if a huge page has to be split before swapout.
384*4882a593Smuzhiyun	Usually because failed to allocate some continuous swap space
385*4882a593Smuzhiyun	for the huge page.
386*4882a593Smuzhiyun
387*4882a593SmuzhiyunAs the system ages, allocating huge pages may be expensive as the
388*4882a593Smuzhiyunsystem uses memory compaction to copy data around memory to free a
389*4882a593Smuzhiyunhuge page for use. There are some counters in ``/proc/vmstat`` to help
390*4882a593Smuzhiyunmonitor this overhead.
391*4882a593Smuzhiyun
392*4882a593Smuzhiyuncompact_stall
393*4882a593Smuzhiyun	is incremented every time a process stalls to run
394*4882a593Smuzhiyun	memory compaction so that a huge page is free for use.
395*4882a593Smuzhiyun
396*4882a593Smuzhiyuncompact_success
397*4882a593Smuzhiyun	is incremented if the system compacted memory and
398*4882a593Smuzhiyun	freed a huge page for use.
399*4882a593Smuzhiyun
400*4882a593Smuzhiyuncompact_fail
401*4882a593Smuzhiyun	is incremented if the system tries to compact memory
402*4882a593Smuzhiyun	but failed.
403*4882a593Smuzhiyun
404*4882a593Smuzhiyuncompact_pages_moved
405*4882a593Smuzhiyun	is incremented each time a page is moved. If
406*4882a593Smuzhiyun	this value is increasing rapidly, it implies that the system
407*4882a593Smuzhiyun	is copying a lot of data to satisfy the huge page allocation.
408*4882a593Smuzhiyun	It is possible that the cost of copying exceeds any savings
409*4882a593Smuzhiyun	from reduced TLB misses.
410*4882a593Smuzhiyun
411*4882a593Smuzhiyuncompact_pagemigrate_failed
412*4882a593Smuzhiyun	is incremented when the underlying mechanism
413*4882a593Smuzhiyun	for moving a page failed.
414*4882a593Smuzhiyun
415*4882a593Smuzhiyuncompact_blocks_moved
416*4882a593Smuzhiyun	is incremented each time memory compaction examines
417*4882a593Smuzhiyun	a huge page aligned range of pages.
418*4882a593Smuzhiyun
419*4882a593SmuzhiyunIt is possible to establish how long the stalls were using the function
420*4882a593Smuzhiyuntracer to record how long was spent in __alloc_pages_nodemask and
421*4882a593Smuzhiyunusing the mm_page_alloc tracepoint to identify which allocations were
422*4882a593Smuzhiyunfor huge pages.
423*4882a593Smuzhiyun
424*4882a593SmuzhiyunOptimizing the applications
425*4882a593Smuzhiyun===========================
426*4882a593Smuzhiyun
427*4882a593SmuzhiyunTo be guaranteed that the kernel will map a 2M page immediately in any
428*4882a593Smuzhiyunmemory region, the mmap region has to be hugepage naturally
429*4882a593Smuzhiyunaligned. posix_memalign() can provide that guarantee.
430*4882a593Smuzhiyun
431*4882a593SmuzhiyunHugetlbfs
432*4882a593Smuzhiyun=========
433*4882a593Smuzhiyun
434*4882a593SmuzhiyunYou can use hugetlbfs on a kernel that has transparent hugepage
435*4882a593Smuzhiyunsupport enabled just fine as always. No difference can be noted in
436*4882a593Smuzhiyunhugetlbfs other than there will be less overall fragmentation. All
437*4882a593Smuzhiyunusual features belonging to hugetlbfs are preserved and
438*4882a593Smuzhiyununaffected. libhugetlbfs will also work fine as usual.
439