xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/mm/idle_page_tracking.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _idle_page_tracking:
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun==================
4*4882a593SmuzhiyunIdle Page Tracking
5*4882a593Smuzhiyun==================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunMotivation
8*4882a593Smuzhiyun==========
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunThe idle page tracking feature allows to track which memory pages are being
11*4882a593Smuzhiyunaccessed by a workload and which are idle. This information can be useful for
12*4882a593Smuzhiyunestimating the workload's working set size, which, in turn, can be taken into
13*4882a593Smuzhiyunaccount when configuring the workload parameters, setting memory cgroup limits,
14*4882a593Smuzhiyunor deciding where to place the workload within a compute cluster.
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunIt is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun.. _user_api:
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunUser API
21*4882a593Smuzhiyun========
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunThe idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
24*4882a593SmuzhiyunCurrently, it consists of the only read-write file,
25*4882a593Smuzhiyun``/sys/kernel/mm/page_idle/bitmap``.
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunThe file implements a bitmap where each bit corresponds to a memory page. The
28*4882a593Smuzhiyunbitmap is represented by an array of 8-byte integers, and the page at PFN #i is
29*4882a593Smuzhiyunmapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
30*4882a593Smuzhiyunset, the corresponding page is idle.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunA page is considered idle if it has not been accessed since it was marked idle
33*4882a593Smuzhiyun(for more details on what "accessed" actually means see the :ref:`Implementation
34*4882a593SmuzhiyunDetails <impl_details>` section).
35*4882a593SmuzhiyunTo mark a page idle one has to set the bit corresponding to
36*4882a593Smuzhiyunthe page by writing to the file. A value written to the file is OR-ed with the
37*4882a593Smuzhiyuncurrent bitmap value.
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunOnly accesses to user memory pages are tracked. These are pages mapped to a
40*4882a593Smuzhiyunprocess address space, page cache and buffer pages, swap cache pages. For other
41*4882a593Smuzhiyunpage types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
42*4882a593Smuzhiyunand hence such pages are never reported idle.
43*4882a593Smuzhiyun
44*4882a593SmuzhiyunFor huge pages the idle flag is set only on the head page, so one has to read
45*4882a593Smuzhiyun``/proc/kpageflags`` in order to correctly count idle huge pages.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunReading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
48*4882a593Smuzhiyun-EINVAL if you are not starting the read/write on an 8-byte boundary, or
49*4882a593Smuzhiyunif the size of the read/write is not a multiple of 8 bytes. Writing to
50*4882a593Smuzhiyunthis file beyond max PFN will return -ENXIO.
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunThat said, in order to estimate the amount of pages that are not used by a
53*4882a593Smuzhiyunworkload one should:
54*4882a593Smuzhiyun
55*4882a593Smuzhiyun 1. Mark all the workload's pages as idle by setting corresponding bits in
56*4882a593Smuzhiyun    ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
57*4882a593Smuzhiyun    ``/proc/pid/pagemap`` if the workload is represented by a process, or by
58*4882a593Smuzhiyun    filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
59*4882a593Smuzhiyun    is placed in a memory cgroup.
60*4882a593Smuzhiyun
61*4882a593Smuzhiyun 2. Wait until the workload accesses its working set.
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
64*4882a593Smuzhiyun    If one wants to ignore certain types of pages, e.g. mlocked pages since they
65*4882a593Smuzhiyun    are not reclaimable, he or she can filter them out using
66*4882a593Smuzhiyun    ``/proc/kpageflags``.
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunThe page-types tool in the tools/vm directory can be used to assist in this.
69*4882a593SmuzhiyunIf the tool is run initially with the appropriate option, it will mark all the
70*4882a593Smuzhiyunqueried pages as idle.  Subsequent runs of the tool can then show which pages have
71*4882a593Smuzhiyuntheir idle flag cleared in the interim.
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunSee :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
74*4882a593Smuzhiyuninformation about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
75*4882a593Smuzhiyun``/proc/kpagecgroup``.
76*4882a593Smuzhiyun
77*4882a593Smuzhiyun.. _impl_details:
78*4882a593Smuzhiyun
79*4882a593SmuzhiyunImplementation Details
80*4882a593Smuzhiyun======================
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunThe kernel internally keeps track of accesses to user memory pages in order to
83*4882a593Smuzhiyunreclaim unreferenced pages first on memory shortage conditions. A page is
84*4882a593Smuzhiyunconsidered referenced if it has been recently accessed via a process address
85*4882a593Smuzhiyunspace, in which case one or more PTEs it is mapped to will have the Accessed bit
86*4882a593Smuzhiyunset, or marked accessed explicitly by the kernel (see mark_page_accessed()). The
87*4882a593Smuzhiyunlatter happens when:
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun - a userspace process reads or writes a page using a system call (e.g. read(2)
90*4882a593Smuzhiyun   or write(2))
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun - a page that is used for storing filesystem buffers is read or written,
93*4882a593Smuzhiyun   because a process needs filesystem metadata stored in it (e.g. lists a
94*4882a593Smuzhiyun   directory tree)
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun - a page is accessed by a device driver using get_user_pages()
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunWhen a dirty page is written to swap or disk as a result of memory reclaim or
99*4882a593Smuzhiyunexceeding the dirty memory limit, it is not marked referenced.
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunThe idle memory tracking feature adds a new page flag, the Idle flag. This flag
102*4882a593Smuzhiyunis set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
103*4882a593Smuzhiyun:ref:`User API <user_api>`
104*4882a593Smuzhiyunsection), and cleared automatically whenever a page is referenced as defined
105*4882a593Smuzhiyunabove.
106*4882a593Smuzhiyun
107*4882a593SmuzhiyunWhen a page is marked idle, the Accessed bit must be cleared in all PTEs it is
108*4882a593Smuzhiyunmapped to, otherwise we will not be able to detect accesses to the page coming
109*4882a593Smuzhiyunfrom a process address space. To avoid interference with the reclaimer, which,
110*4882a593Smuzhiyunas noted above, uses the Accessed bit to promote actively referenced pages, one
111*4882a593Smuzhiyunmore page flag is introduced, the Young flag. When the PTE Accessed bit is
112*4882a593Smuzhiyuncleared as a result of setting or updating a page's Idle flag, the Young flag
113*4882a593Smuzhiyunis set on the page. The reclaimer treats the Young flag as an extra PTE
114*4882a593SmuzhiyunAccessed bit and therefore will consider such a page as referenced.
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunSince the idle memory tracking feature is based on the memory reclaimer logic,
117*4882a593Smuzhiyunit only works with pages that are on an LRU list, other pages are silently
118*4882a593Smuzhiyunignored. That means it will ignore a user memory page if it is isolated, but
119*4882a593Smuzhiyunsince there are usually not many of them, it should not affect the overall
120*4882a593Smuzhiyunresult noticeably. In order not to stall scanning of the idle page bitmap,
121*4882a593Smuzhiyunlocked pages may be skipped too.
122