xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/mm/concepts.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _mm_concepts:
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=================
4*4882a593SmuzhiyunConcepts overview
5*4882a593Smuzhiyun=================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunThe memory management in Linux is a complex system that evolved over the
8*4882a593Smuzhiyunyears and included more and more functionality to support a variety of
9*4882a593Smuzhiyunsystems from MMU-less microcontrollers to supercomputers. The memory
10*4882a593Smuzhiyunmanagement for systems without an MMU is called ``nommu`` and it
11*4882a593Smuzhiyundefinitely deserves a dedicated document, which hopefully will be
12*4882a593Smuzhiyuneventually written. Yet, although some of the concepts are the same,
13*4882a593Smuzhiyunhere we assume that an MMU is available and a CPU can translate a virtual
14*4882a593Smuzhiyunaddress to a physical address.
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun.. contents:: :local:
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunVirtual Memory Primer
19*4882a593Smuzhiyun=====================
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunThe physical memory in a computer system is a limited resource and
22*4882a593Smuzhiyuneven for systems that support memory hotplug there is a hard limit on
23*4882a593Smuzhiyunthe amount of memory that can be installed. The physical memory is not
24*4882a593Smuzhiyunnecessarily contiguous; it might be accessible as a set of distinct
25*4882a593Smuzhiyunaddress ranges. Besides, different CPU architectures, and even
26*4882a593Smuzhiyundifferent implementations of the same architecture have different views
27*4882a593Smuzhiyunof how these address ranges are defined.
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunAll this makes dealing directly with physical memory quite complex and
30*4882a593Smuzhiyunto avoid this complexity a concept of virtual memory was developed.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunThe virtual memory abstracts the details of physical memory from the
33*4882a593Smuzhiyunapplication software, allows to keep only needed information in the
34*4882a593Smuzhiyunphysical memory (demand paging) and provides a mechanism for the
35*4882a593Smuzhiyunprotection and controlled sharing of data between processes.
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunWith virtual memory, each and every memory access uses a virtual
38*4882a593Smuzhiyunaddress. When the CPU decodes an instruction that reads (or
39*4882a593Smuzhiyunwrites) from (or to) the system memory, it translates the `virtual`
40*4882a593Smuzhiyunaddress encoded in that instruction to a `physical` address that the
41*4882a593Smuzhiyunmemory controller can understand.
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunThe physical system memory is divided into page frames, or pages. The
44*4882a593Smuzhiyunsize of each page is architecture specific. Some architectures allow
45*4882a593Smuzhiyunselection of the page size from several supported values; this
46*4882a593Smuzhiyunselection is performed at the kernel build time by setting an
47*4882a593Smuzhiyunappropriate kernel configuration option.
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunEach physical memory page can be mapped as one or more virtual
50*4882a593Smuzhiyunpages. These mappings are described by page tables that allow
51*4882a593Smuzhiyuntranslation from a virtual address used by programs to the physical
52*4882a593Smuzhiyunmemory address. The page tables are organized hierarchically.
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunThe tables at the lowest level of the hierarchy contain physical
55*4882a593Smuzhiyunaddresses of actual pages used by the software. The tables at higher
56*4882a593Smuzhiyunlevels contain physical addresses of the pages belonging to the lower
57*4882a593Smuzhiyunlevels. The pointer to the top level page table resides in a
58*4882a593Smuzhiyunregister. When the CPU performs the address translation, it uses this
59*4882a593Smuzhiyunregister to access the top level page table. The high bits of the
60*4882a593Smuzhiyunvirtual address are used to index an entry in the top level page
61*4882a593Smuzhiyuntable. That entry is then used to access the next level in the
62*4882a593Smuzhiyunhierarchy with the next bits of the virtual address as the index to
63*4882a593Smuzhiyunthat level page table. The lowest bits in the virtual address define
64*4882a593Smuzhiyunthe offset inside the actual page.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunHuge Pages
67*4882a593Smuzhiyun==========
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunThe address translation requires several memory accesses and memory
70*4882a593Smuzhiyunaccesses are slow relatively to CPU speed. To avoid spending precious
71*4882a593Smuzhiyunprocessor cycles on the address translation, CPUs maintain a cache of
72*4882a593Smuzhiyunsuch translations called Translation Lookaside Buffer (or
73*4882a593SmuzhiyunTLB). Usually TLB is pretty scarce resource and applications with
74*4882a593Smuzhiyunlarge memory working set will experience performance hit because of
75*4882a593SmuzhiyunTLB misses.
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunMany modern CPU architectures allow mapping of the memory pages
78*4882a593Smuzhiyundirectly by the higher levels in the page table. For instance, on x86,
79*4882a593Smuzhiyunit is possible to map 2M and even 1G pages using entries in the second
80*4882a593Smuzhiyunand the third level page tables. In Linux such pages are called
81*4882a593Smuzhiyun`huge`. Usage of huge pages significantly reduces pressure on TLB,
82*4882a593Smuzhiyunimproves TLB hit-rate and thus improves overall system performance.
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunThere are two mechanisms in Linux that enable mapping of the physical
85*4882a593Smuzhiyunmemory with the huge pages. The first one is `HugeTLB filesystem`, or
86*4882a593Smuzhiyunhugetlbfs. It is a pseudo filesystem that uses RAM as its backing
87*4882a593Smuzhiyunstore. For the files created in this filesystem the data resides in
88*4882a593Smuzhiyunthe memory and mapped using huge pages. The hugetlbfs is described at
89*4882a593Smuzhiyun:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
90*4882a593Smuzhiyun
91*4882a593SmuzhiyunAnother, more recent, mechanism that enables use of the huge pages is
92*4882a593Smuzhiyuncalled `Transparent HugePages`, or THP. Unlike the hugetlbfs that
93*4882a593Smuzhiyunrequires users and/or system administrators to configure what parts of
94*4882a593Smuzhiyunthe system memory should and can be mapped by the huge pages, THP
95*4882a593Smuzhiyunmanages such mappings transparently to the user and hence the
96*4882a593Smuzhiyunname. See
97*4882a593Smuzhiyun:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
98*4882a593Smuzhiyunfor more details about THP.
99*4882a593Smuzhiyun
100*4882a593SmuzhiyunZones
101*4882a593Smuzhiyun=====
102*4882a593Smuzhiyun
103*4882a593SmuzhiyunOften hardware poses restrictions on how different physical memory
104*4882a593Smuzhiyunranges can be accessed. In some cases, devices cannot perform DMA to
105*4882a593Smuzhiyunall the addressable memory. In other cases, the size of the physical
106*4882a593Smuzhiyunmemory exceeds the maximal addressable size of virtual memory and
107*4882a593Smuzhiyunspecial actions are required to access portions of the memory. Linux
108*4882a593Smuzhiyungroups memory pages into `zones` according to their possible
109*4882a593Smuzhiyunusage. For example, ZONE_DMA will contain memory that can be used by
110*4882a593Smuzhiyundevices for DMA, ZONE_HIGHMEM will contain memory that is not
111*4882a593Smuzhiyunpermanently mapped into kernel's address space and ZONE_NORMAL will
112*4882a593Smuzhiyuncontain normally addressed pages.
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunThe actual layout of the memory zones is hardware dependent as not all
115*4882a593Smuzhiyunarchitectures define all zones, and requirements for DMA are different
116*4882a593Smuzhiyunfor different platforms.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunNodes
119*4882a593Smuzhiyun=====
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunMany multi-processor machines are NUMA - Non-Uniform Memory Access -
122*4882a593Smuzhiyunsystems. In such systems the memory is arranged into banks that have
123*4882a593Smuzhiyundifferent access latency depending on the "distance" from the
124*4882a593Smuzhiyunprocessor. Each bank is referred to as a `node` and for each node Linux
125*4882a593Smuzhiyunconstructs an independent memory management subsystem. A node has its
126*4882a593Smuzhiyunown set of zones, lists of free and used pages and various statistics
127*4882a593Smuzhiyuncounters. You can find more details about NUMA in
128*4882a593Smuzhiyun:ref:`Documentation/vm/numa.rst <numa>` and in
129*4882a593Smuzhiyun:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunPage cache
132*4882a593Smuzhiyun==========
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunThe physical memory is volatile and the common case for getting data
135*4882a593Smuzhiyuninto the memory is to read it from files. Whenever a file is read, the
136*4882a593Smuzhiyundata is put into the `page cache` to avoid expensive disk access on
137*4882a593Smuzhiyunthe subsequent reads. Similarly, when one writes to a file, the data
138*4882a593Smuzhiyunis placed in the page cache and eventually gets into the backing
139*4882a593Smuzhiyunstorage device. The written pages are marked as `dirty` and when Linux
140*4882a593Smuzhiyundecides to reuse them for other purposes, it makes sure to synchronize
141*4882a593Smuzhiyunthe file contents on the device with the updated data.
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunAnonymous Memory
144*4882a593Smuzhiyun================
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunThe `anonymous memory` or `anonymous mappings` represent memory that
147*4882a593Smuzhiyunis not backed by a filesystem. Such mappings are implicitly created
148*4882a593Smuzhiyunfor program's stack and heap or by explicit calls to mmap(2) system
149*4882a593Smuzhiyuncall. Usually, the anonymous mappings only define virtual memory areas
150*4882a593Smuzhiyunthat the program is allowed to access. The read accesses will result
151*4882a593Smuzhiyunin creation of a page table entry that references a special physical
152*4882a593Smuzhiyunpage filled with zeroes. When the program performs a write, a regular
153*4882a593Smuzhiyunphysical page will be allocated to hold the written data. The page
154*4882a593Smuzhiyunwill be marked dirty and if the kernel decides to repurpose it,
155*4882a593Smuzhiyunthe dirty page will be swapped out.
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunReclaim
158*4882a593Smuzhiyun=======
159*4882a593Smuzhiyun
160*4882a593SmuzhiyunThroughout the system lifetime, a physical page can be used for storing
161*4882a593Smuzhiyundifferent types of data. It can be kernel internal data structures,
162*4882a593SmuzhiyunDMA'able buffers for device drivers use, data read from a filesystem,
163*4882a593Smuzhiyunmemory allocated by user space processes etc.
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunDepending on the page usage it is treated differently by the Linux
166*4882a593Smuzhiyunmemory management. The pages that can be freed at any time, either
167*4882a593Smuzhiyunbecause they cache the data available elsewhere, for instance, on a
168*4882a593Smuzhiyunhard disk, or because they can be swapped out, again, to the hard
169*4882a593Smuzhiyundisk, are called `reclaimable`. The most notable categories of the
170*4882a593Smuzhiyunreclaimable pages are page cache and anonymous memory.
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunIn most cases, the pages holding internal kernel data and used as DMA
173*4882a593Smuzhiyunbuffers cannot be repurposed, and they remain pinned until freed by
174*4882a593Smuzhiyuntheir user. Such pages are called `unreclaimable`. However, in certain
175*4882a593Smuzhiyuncircumstances, even pages occupied with kernel data structures can be
176*4882a593Smuzhiyunreclaimed. For instance, in-memory caches of filesystem metadata can
177*4882a593Smuzhiyunbe re-read from the storage device and therefore it is possible to
178*4882a593Smuzhiyundiscard them from the main memory when system is under memory
179*4882a593Smuzhiyunpressure.
180*4882a593Smuzhiyun
181*4882a593SmuzhiyunThe process of freeing the reclaimable physical memory pages and
182*4882a593Smuzhiyunrepurposing them is called (surprise!) `reclaim`. Linux can reclaim
183*4882a593Smuzhiyunpages either asynchronously or synchronously, depending on the state
184*4882a593Smuzhiyunof the system. When the system is not loaded, most of the memory is free
185*4882a593Smuzhiyunand allocation requests will be satisfied immediately from the free
186*4882a593Smuzhiyunpages supply. As the load increases, the amount of the free pages goes
187*4882a593Smuzhiyundown and when it reaches a certain threshold (high watermark), an
188*4882a593Smuzhiyunallocation request will awaken the ``kswapd`` daemon. It will
189*4882a593Smuzhiyunasynchronously scan memory pages and either just free them if the data
190*4882a593Smuzhiyunthey contain is available elsewhere, or evict to the backing storage
191*4882a593Smuzhiyundevice (remember those dirty pages?). As memory usage increases even
192*4882a593Smuzhiyunmore and reaches another threshold - min watermark - an allocation
193*4882a593Smuzhiyunwill trigger `direct reclaim`. In this case allocation is stalled
194*4882a593Smuzhiyununtil enough memory pages are reclaimed to satisfy the request.
195*4882a593Smuzhiyun
196*4882a593SmuzhiyunCompaction
197*4882a593Smuzhiyun==========
198*4882a593Smuzhiyun
199*4882a593SmuzhiyunAs the system runs, tasks allocate and free the memory and it becomes
200*4882a593Smuzhiyunfragmented. Although with virtual memory it is possible to present
201*4882a593Smuzhiyunscattered physical pages as virtually contiguous range, sometimes it is
202*4882a593Smuzhiyunnecessary to allocate large physically contiguous memory areas. Such
203*4882a593Smuzhiyunneed may arise, for instance, when a device driver requires a large
204*4882a593Smuzhiyunbuffer for DMA, or when THP allocates a huge page. Memory `compaction`
205*4882a593Smuzhiyunaddresses the fragmentation issue. This mechanism moves occupied pages
206*4882a593Smuzhiyunfrom the lower part of a memory zone to free pages in the upper part
207*4882a593Smuzhiyunof the zone. When a compaction scan is finished free pages are grouped
208*4882a593Smuzhiyuntogether at the beginning of the zone and allocations of large
209*4882a593Smuzhiyunphysically contiguous areas become possible.
210*4882a593Smuzhiyun
211*4882a593SmuzhiyunLike reclaim, the compaction may happen asynchronously in the ``kcompactd``
212*4882a593Smuzhiyundaemon or synchronously as a result of a memory allocation request.
213*4882a593Smuzhiyun
214*4882a593SmuzhiyunOOM killer
215*4882a593Smuzhiyun==========
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunIt is possible that on a loaded machine memory will be exhausted and the
218*4882a593Smuzhiyunkernel will be unable to reclaim enough memory to continue to operate. In
219*4882a593Smuzhiyunorder to save the rest of the system, it invokes the `OOM killer`.
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunThe `OOM killer` selects a task to sacrifice for the sake of the overall
222*4882a593Smuzhiyunsystem health. The selected task is killed in a hope that after it exits
223*4882a593Smuzhiyunenough memory will be freed to continue normal operation.
224