xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/mm/ksm.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _admin_guide_ksm:
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=======================
4*4882a593SmuzhiyunKernel Samepage Merging
5*4882a593Smuzhiyun=======================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunOverview
8*4882a593Smuzhiyun========
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunKSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
11*4882a593Smuzhiyunadded to the Linux kernel in 2.6.32.  See ``mm/ksm.c`` for its implementation,
12*4882a593Smuzhiyunand http://lwn.net/Articles/306704/ and https://lwn.net/Articles/330589/
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunKSM was originally developed for use with KVM (where it was known as
15*4882a593SmuzhiyunKernel Shared Memory), to fit more virtual machines into physical memory,
16*4882a593Smuzhiyunby sharing the data common between them.  But it can be useful to any
17*4882a593Smuzhiyunapplication which generates many instances of the same data.
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunThe KSM daemon ksmd periodically scans those areas of user memory
20*4882a593Smuzhiyunwhich have been registered with it, looking for pages of identical
21*4882a593Smuzhiyuncontent which can be replaced by a single write-protected page (which
22*4882a593Smuzhiyunis automatically copied if a process later wants to update its
23*4882a593Smuzhiyuncontent). The amount of pages that KSM daemon scans in a single pass
24*4882a593Smuzhiyunand the time between the passes are configured using :ref:`sysfs
25*4882a593Smuzhiyunintraface <ksm_sysfs>`
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunKSM only merges anonymous (private) pages, never pagecache (file) pages.
28*4882a593SmuzhiyunKSM's merged pages were originally locked into kernel memory, but can now
29*4882a593Smuzhiyunbe swapped out just like other user pages (but sharing is broken when they
30*4882a593Smuzhiyunare swapped back in: ksmd must rediscover their identity and merge again).
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunControlling KSM with madvise
33*4882a593Smuzhiyun============================
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunKSM only operates on those areas of address space which an application
36*4882a593Smuzhiyunhas advised to be likely candidates for merging, by using the madvise(2)
37*4882a593Smuzhiyunsystem call::
38*4882a593Smuzhiyun
39*4882a593Smuzhiyun	int madvise(addr, length, MADV_MERGEABLE)
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunThe app may call
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun::
44*4882a593Smuzhiyun
45*4882a593Smuzhiyun	int madvise(addr, length, MADV_UNMERGEABLE)
46*4882a593Smuzhiyun
47*4882a593Smuzhiyunto cancel that advice and restore unshared pages: whereupon KSM
48*4882a593Smuzhiyununmerges whatever it merged in that range.  Note: this unmerging call
49*4882a593Smuzhiyunmay suddenly require more memory than is available - possibly failing
50*4882a593Smuzhiyunwith EAGAIN, but more probably arousing the Out-Of-Memory killer.
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunIf KSM is not configured into the running kernel, madvise MADV_MERGEABLE
53*4882a593Smuzhiyunand MADV_UNMERGEABLE simply fail with EINVAL.  If the running kernel was
54*4882a593Smuzhiyunbuilt with CONFIG_KSM=y, those calls will normally succeed: even if the
55*4882a593SmuzhiyunKSM daemon is not currently running, MADV_MERGEABLE still registers
56*4882a593Smuzhiyunthe range for whenever the KSM daemon is started; even if the range
57*4882a593Smuzhiyuncannot contain any pages which KSM could actually merge; even if
58*4882a593SmuzhiyunMADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunIf a region of memory must be split into at least one new MADV_MERGEABLE
61*4882a593Smuzhiyunor MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
62*4882a593Smuzhiyunwill exceed ``vm.max_map_count`` (see Documentation/admin-guide/sysctl/vm.rst).
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunLike other madvise calls, they are intended for use on mapped areas of
65*4882a593Smuzhiyunthe user address space: they will report ENOMEM if the specified range
66*4882a593Smuzhiyunincludes unmapped gaps (though working on the intervening mapped areas),
67*4882a593Smuzhiyunand might fail with EAGAIN if not enough memory for internal structures.
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunApplications should be considerate in their use of MADV_MERGEABLE,
70*4882a593Smuzhiyunrestricting its use to areas likely to benefit.  KSM's scans may use a lot
71*4882a593Smuzhiyunof processing power: some installations will disable KSM for that reason.
72*4882a593Smuzhiyun
73*4882a593Smuzhiyun.. _ksm_sysfs:
74*4882a593Smuzhiyun
75*4882a593SmuzhiyunKSM daemon sysfs interface
76*4882a593Smuzhiyun==========================
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunThe KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``,
79*4882a593Smuzhiyunreadable by all but writable only by root:
80*4882a593Smuzhiyun
81*4882a593Smuzhiyunpages_to_scan
82*4882a593Smuzhiyun        how many pages to scan before ksmd goes to sleep
83*4882a593Smuzhiyun        e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun        Default: 100 (chosen for demonstration purposes)
86*4882a593Smuzhiyun
87*4882a593Smuzhiyunsleep_millisecs
88*4882a593Smuzhiyun        how many milliseconds ksmd should sleep before next scan
89*4882a593Smuzhiyun        e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs``
90*4882a593Smuzhiyun
91*4882a593Smuzhiyun        Default: 20 (chosen for demonstration purposes)
92*4882a593Smuzhiyun
93*4882a593Smuzhiyunmerge_across_nodes
94*4882a593Smuzhiyun        specifies if pages from different NUMA nodes can be merged.
95*4882a593Smuzhiyun        When set to 0, ksm merges only pages which physically reside
96*4882a593Smuzhiyun        in the memory area of same NUMA node. That brings lower
97*4882a593Smuzhiyun        latency to access of shared pages. Systems with more nodes, at
98*4882a593Smuzhiyun        significant NUMA distances, are likely to benefit from the
99*4882a593Smuzhiyun        lower latency of setting 0. Smaller systems, which need to
100*4882a593Smuzhiyun        minimize memory usage, are likely to benefit from the greater
101*4882a593Smuzhiyun        sharing of setting 1 (default). You may wish to compare how
102*4882a593Smuzhiyun        your system performs under each setting, before deciding on
103*4882a593Smuzhiyun        which to use. ``merge_across_nodes`` setting can be changed only
104*4882a593Smuzhiyun        when there are no ksm shared pages in the system: set run 2 to
105*4882a593Smuzhiyun        unmerge pages first, then to 1 after changing
106*4882a593Smuzhiyun        ``merge_across_nodes``, to remerge according to the new setting.
107*4882a593Smuzhiyun
108*4882a593Smuzhiyun        Default: 1 (merging across nodes as in earlier releases)
109*4882a593Smuzhiyun
110*4882a593Smuzhiyunrun
111*4882a593Smuzhiyun        * set to 0 to stop ksmd from running but keep merged pages,
112*4882a593Smuzhiyun        * set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``,
113*4882a593Smuzhiyun        * set to 2 to stop ksmd and unmerge all pages currently merged, but
114*4882a593Smuzhiyun	  leave mergeable areas registered for next run.
115*4882a593Smuzhiyun
116*4882a593Smuzhiyun        Default: 0 (must be changed to 1 to activate KSM, except if
117*4882a593Smuzhiyun        CONFIG_SYSFS is disabled)
118*4882a593Smuzhiyun
119*4882a593Smuzhiyunuse_zero_pages
120*4882a593Smuzhiyun        specifies whether empty pages (i.e. allocated pages that only
121*4882a593Smuzhiyun        contain zeroes) should be treated specially.  When set to 1,
122*4882a593Smuzhiyun        empty pages are merged with the kernel zero page(s) instead of
123*4882a593Smuzhiyun        with each other as it would happen normally. This can improve
124*4882a593Smuzhiyun        the performance on architectures with coloured zero pages,
125*4882a593Smuzhiyun        depending on the workload. Care should be taken when enabling
126*4882a593Smuzhiyun        this setting, as it can potentially degrade the performance of
127*4882a593Smuzhiyun        KSM for some workloads, for example if the checksums of pages
128*4882a593Smuzhiyun        candidate for merging match the checksum of an empty
129*4882a593Smuzhiyun        page. This setting can be changed at any time, it is only
130*4882a593Smuzhiyun        effective for pages merged after the change.
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun        Default: 0 (normal KSM behaviour as in earlier releases)
133*4882a593Smuzhiyun
134*4882a593Smuzhiyunmax_page_sharing
135*4882a593Smuzhiyun        Maximum sharing allowed for each KSM page. This enforces a
136*4882a593Smuzhiyun        deduplication limit to avoid high latency for virtual memory
137*4882a593Smuzhiyun        operations that involve traversal of the virtual mappings that
138*4882a593Smuzhiyun        share the KSM page. The minimum value is 2 as a newly created
139*4882a593Smuzhiyun        KSM page will have at least two sharers. The higher this value
140*4882a593Smuzhiyun        the faster KSM will merge the memory and the higher the
141*4882a593Smuzhiyun        deduplication factor will be, but the slower the worst case
142*4882a593Smuzhiyun        virtual mappings traversal could be for any given KSM
143*4882a593Smuzhiyun        page. Slowing down this traversal means there will be higher
144*4882a593Smuzhiyun        latency for certain virtual memory operations happening during
145*4882a593Smuzhiyun        swapping, compaction, NUMA balancing and page migration, in
146*4882a593Smuzhiyun        turn decreasing responsiveness for the caller of those virtual
147*4882a593Smuzhiyun        memory operations. The scheduler latency of other tasks not
148*4882a593Smuzhiyun        involved with the VM operations doing the virtual mappings
149*4882a593Smuzhiyun        traversal is not affected by this parameter as these
150*4882a593Smuzhiyun        traversals are always schedule friendly themselves.
151*4882a593Smuzhiyun
152*4882a593Smuzhiyunstable_node_chains_prune_millisecs
153*4882a593Smuzhiyun        specifies how frequently KSM checks the metadata of the pages
154*4882a593Smuzhiyun        that hit the deduplication limit for stale information.
155*4882a593Smuzhiyun        Smaller milllisecs values will free up the KSM metadata with
156*4882a593Smuzhiyun        lower latency, but they will make ksmd use more CPU during the
157*4882a593Smuzhiyun        scan. It's a noop if not a single KSM page hit the
158*4882a593Smuzhiyun        ``max_page_sharing`` yet.
159*4882a593Smuzhiyun
160*4882a593SmuzhiyunThe effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
161*4882a593Smuzhiyun
162*4882a593Smuzhiyunpages_shared
163*4882a593Smuzhiyun        how many shared pages are being used
164*4882a593Smuzhiyunpages_sharing
165*4882a593Smuzhiyun        how many more sites are sharing them i.e. how much saved
166*4882a593Smuzhiyunpages_unshared
167*4882a593Smuzhiyun        how many pages unique but repeatedly checked for merging
168*4882a593Smuzhiyunpages_volatile
169*4882a593Smuzhiyun        how many pages changing too fast to be placed in a tree
170*4882a593Smuzhiyunfull_scans
171*4882a593Smuzhiyun        how many times all mergeable areas have been scanned
172*4882a593Smuzhiyunstable_node_chains
173*4882a593Smuzhiyun        the number of KSM pages that hit the ``max_page_sharing`` limit
174*4882a593Smuzhiyunstable_node_dups
175*4882a593Smuzhiyun        number of duplicated KSM pages
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunA high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
178*4882a593Smuzhiyunsharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
179*4882a593Smuzhiyunindicates wasted effort.  ``pages_volatile`` embraces several
180*4882a593Smuzhiyundifferent kinds of activity, but a high proportion there would also
181*4882a593Smuzhiyunindicate poor use of madvise MADV_MERGEABLE.
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunThe maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
184*4882a593Smuzhiyun``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
185*4882a593Smuzhiyunbe increased accordingly.
186*4882a593Smuzhiyun
187*4882a593Smuzhiyun--
188*4882a593SmuzhiyunIzik Eidus,
189*4882a593SmuzhiyunHugh Dickins, 17 Nov 2009
190