xref: /OK3568_Linux_fs/kernel/Documentation/vm/memory-model.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun.. _physical_memory_model:
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun=====================
6*4882a593SmuzhiyunPhysical Memory Model
7*4882a593Smuzhiyun=====================
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunPhysical memory in a system may be addressed in different ways. The
10*4882a593Smuzhiyunsimplest case is when the physical memory starts at address 0 and
11*4882a593Smuzhiyunspans a contiguous range up to the maximal address. It could be,
12*4882a593Smuzhiyunhowever, that this range contains small holes that are not accessible
13*4882a593Smuzhiyunfor the CPU. Then there could be several contiguous ranges at
14*4882a593Smuzhiyuncompletely distinct addresses. And, don't forget about NUMA, where
15*4882a593Smuzhiyundifferent memory banks are attached to different CPUs.
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunLinux abstracts this diversity using one of the three memory models:
18*4882a593SmuzhiyunFLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what
19*4882a593Smuzhiyunmemory models it supports, what the default memory model is and
20*4882a593Smuzhiyunwhether it is possible to manually override that default.
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun.. note::
23*4882a593Smuzhiyun   At time of this writing, DISCONTIGMEM is considered deprecated,
24*4882a593Smuzhiyun   although it is still in use by several architectures.
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunAll the memory models track the status of physical page frames using
27*4882a593Smuzhiyunstruct page arranged in one or more arrays.
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunRegardless of the selected memory model, there exists one-to-one
30*4882a593Smuzhiyunmapping between the physical page frame number (PFN) and the
31*4882a593Smuzhiyuncorresponding `struct page`.
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunEach memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
34*4882a593Smuzhiyunhelpers that allow the conversion from PFN to `struct page` and vice
35*4882a593Smuzhiyunversa.
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunFLATMEM
38*4882a593Smuzhiyun=======
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunThe simplest memory model is FLATMEM. This model is suitable for
41*4882a593Smuzhiyunnon-NUMA systems with contiguous, or mostly contiguous, physical
42*4882a593Smuzhiyunmemory.
43*4882a593Smuzhiyun
44*4882a593SmuzhiyunIn the FLATMEM memory model, there is a global `mem_map` array that
45*4882a593Smuzhiyunmaps the entire physical memory. For most architectures, the holes
46*4882a593Smuzhiyunhave entries in the `mem_map` array. The `struct page` objects
47*4882a593Smuzhiyuncorresponding to the holes are never fully initialized.
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunTo allocate the `mem_map` array, architecture specific setup code should
50*4882a593Smuzhiyuncall :c:func:`free_area_init` function. Yet, the mappings array is not
51*4882a593Smuzhiyunusable until the call to :c:func:`memblock_free_all` that hands all the
52*4882a593Smuzhiyunmemory to the page allocator.
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunAn architecture may free parts of the `mem_map` array that do not cover the
55*4882a593Smuzhiyunactual physical pages. In such case, the architecture specific
56*4882a593Smuzhiyun:c:func:`pfn_valid` implementation should take the holes in the
57*4882a593Smuzhiyun`mem_map` into account.
58*4882a593Smuzhiyun
59*4882a593SmuzhiyunWith FLATMEM, the conversion between a PFN and the `struct page` is
60*4882a593Smuzhiyunstraightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
61*4882a593Smuzhiyun`mem_map` array.
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunThe `ARCH_PFN_OFFSET` defines the first page frame number for
64*4882a593Smuzhiyunsystems with physical memory starting at address different from 0.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunDISCONTIGMEM
67*4882a593Smuzhiyun============
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunThe DISCONTIGMEM model treats the physical memory as a collection of
70*4882a593Smuzhiyun`nodes` similarly to how Linux NUMA support does. For each node Linux
71*4882a593Smuzhiyunconstructs an independent memory management subsystem represented by
72*4882a593Smuzhiyun`struct pglist_data` (or `pg_data_t` for short). Among other
73*4882a593Smuzhiyunthings, `pg_data_t` holds the `node_mem_map` array that maps
74*4882a593Smuzhiyunphysical pages belonging to that node. The `node_start_pfn` field of
75*4882a593Smuzhiyun`pg_data_t` is the number of the first page frame belonging to that
76*4882a593Smuzhiyunnode.
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunThe architecture setup code should call :c:func:`free_area_init_node` for
79*4882a593Smuzhiyuneach node in the system to initialize the `pg_data_t` object and its
80*4882a593Smuzhiyun`node_mem_map`.
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunEvery `node_mem_map` behaves exactly as FLATMEM's `mem_map` -
83*4882a593Smuzhiyunevery physical page frame in a node has a `struct page` entry in the
84*4882a593Smuzhiyun`node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the
85*4882a593Smuzhiyun`flags` field of the `struct page` encodes the node number of the
86*4882a593Smuzhiyunnode hosting that page.
87*4882a593Smuzhiyun
88*4882a593SmuzhiyunThe conversion between a PFN and the `struct page` in the
89*4882a593SmuzhiyunDISCONTIGMEM model became slightly more complex as it has to determine
90*4882a593Smuzhiyunwhich node hosts the physical page and which `pg_data_t` object
91*4882a593Smuzhiyunholds the `struct page`.
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunArchitectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid`
94*4882a593Smuzhiyunto convert PFN to the node number. The opposite conversion helper
95*4882a593Smuzhiyun:c:func:`page_to_nid` is generic as it uses the node number encoded in
96*4882a593Smuzhiyunpage->flags.
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunOnce the node number is known, the PFN can be used to index
99*4882a593Smuzhiyunappropriate `node_mem_map` array to access the `struct page` and
100*4882a593Smuzhiyunthe offset of the `struct page` from the `node_mem_map` plus
101*4882a593Smuzhiyun`node_start_pfn` is the PFN of that page.
102*4882a593Smuzhiyun
103*4882a593SmuzhiyunSPARSEMEM
104*4882a593Smuzhiyun=========
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunSPARSEMEM is the most versatile memory model available in Linux and it
107*4882a593Smuzhiyunis the only memory model that supports several advanced features such
108*4882a593Smuzhiyunas hot-plug and hot-remove of the physical memory, alternative memory
109*4882a593Smuzhiyunmaps for non-volatile memory devices and deferred initialization of
110*4882a593Smuzhiyunthe memory map for larger systems.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunThe SPARSEMEM model presents the physical memory as a collection of
113*4882a593Smuzhiyunsections. A section is represented with struct mem_section
114*4882a593Smuzhiyunthat contains `section_mem_map` that is, logically, a pointer to an
115*4882a593Smuzhiyunarray of struct pages. However, it is stored with some other magic
116*4882a593Smuzhiyunthat aids the sections management. The section size and maximal number
117*4882a593Smuzhiyunof section is specified using `SECTION_SIZE_BITS` and
118*4882a593Smuzhiyun`MAX_PHYSMEM_BITS` constants defined by each architecture that
119*4882a593Smuzhiyunsupports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a
120*4882a593Smuzhiyunphysical address that an architecture supports, the
121*4882a593Smuzhiyun`SECTION_SIZE_BITS` is an arbitrary value.
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunThe maximal number of sections is denoted `NR_MEM_SECTIONS` and
124*4882a593Smuzhiyundefined as
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun.. math::
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun   NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
129*4882a593Smuzhiyun
130*4882a593SmuzhiyunThe `mem_section` objects are arranged in a two-dimensional array
131*4882a593Smuzhiyuncalled `mem_sections`. The size and placement of this array depend
132*4882a593Smuzhiyunon `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of
133*4882a593Smuzhiyunsections:
134*4882a593Smuzhiyun
135*4882a593Smuzhiyun* When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections`
136*4882a593Smuzhiyun  array is static and has `NR_MEM_SECTIONS` rows. Each row holds a
137*4882a593Smuzhiyun  single `mem_section` object.
138*4882a593Smuzhiyun* When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections`
139*4882a593Smuzhiyun  array is dynamically allocated. Each row contains PAGE_SIZE worth of
140*4882a593Smuzhiyun  `mem_section` objects and the number of rows is calculated to fit
141*4882a593Smuzhiyun  all the memory sections.
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunThe architecture setup code should call sparse_init() to
144*4882a593Smuzhiyuninitialize the memory sections and the memory maps.
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunWith SPARSEMEM there are two possible ways to convert a PFN to the
147*4882a593Smuzhiyuncorresponding `struct page` - a "classic sparse" and "sparse
148*4882a593Smuzhiyunvmemmap". The selection is made at build time and it is determined by
149*4882a593Smuzhiyunthe value of `CONFIG_SPARSEMEM_VMEMMAP`.
150*4882a593Smuzhiyun
151*4882a593SmuzhiyunThe classic sparse encodes the section number of a page in page->flags
152*4882a593Smuzhiyunand uses high bits of a PFN to access the section that maps that page
153*4882a593Smuzhiyunframe. Inside a section, the PFN is the index to the array of pages.
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunThe sparse vmemmap uses a virtually mapped memory map to optimize
156*4882a593Smuzhiyunpfn_to_page and page_to_pfn operations. There is a global `struct
157*4882a593Smuzhiyunpage *vmemmap` pointer that points to a virtually contiguous array of
158*4882a593Smuzhiyun`struct page` objects. A PFN is an index to that array and the
159*4882a593Smuzhiyunoffset of the `struct page` from `vmemmap` is the PFN of that
160*4882a593Smuzhiyunpage.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunTo use vmemmap, an architecture has to reserve a range of virtual
163*4882a593Smuzhiyunaddresses that will map the physical pages containing the memory
164*4882a593Smuzhiyunmap and make sure that `vmemmap` points to that range. In addition,
165*4882a593Smuzhiyunthe architecture should implement :c:func:`vmemmap_populate` method
166*4882a593Smuzhiyunthat will allocate the physical memory and create page tables for the
167*4882a593Smuzhiyunvirtual memory map. If an architecture does not have any special
168*4882a593Smuzhiyunrequirements for the vmemmap mappings, it can use default
169*4882a593Smuzhiyun:c:func:`vmemmap_populate_basepages` provided by the generic memory
170*4882a593Smuzhiyunmanagement.
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunThe virtually mapped memory map allows storing `struct page` objects
173*4882a593Smuzhiyunfor persistent memory devices in pre-allocated storage on those
174*4882a593Smuzhiyundevices. This storage is represented with struct vmem_altmap
175*4882a593Smuzhiyunthat is eventually passed to vmemmap_populate() through a long chain
176*4882a593Smuzhiyunof function calls. The vmemmap_populate() implementation may use the
177*4882a593Smuzhiyun`vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to
178*4882a593Smuzhiyunallocate memory map on the persistent memory device.
179*4882a593Smuzhiyun
180*4882a593SmuzhiyunZONE_DEVICE
181*4882a593Smuzhiyun===========
182*4882a593SmuzhiyunThe `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
183*4882a593Smuzhiyun`struct page` `mem_map` services for device driver identified physical
184*4882a593Smuzhiyunaddress ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
185*4882a593Smuzhiyunthat the page objects for these address ranges are never marked online,
186*4882a593Smuzhiyunand that a reference must be taken against the device, not just the page
187*4882a593Smuzhiyunto keep the memory pinned for active use. `ZONE_DEVICE`, via
188*4882a593Smuzhiyun:c:func:`devm_memremap_pages`, performs just enough memory hotplug to
189*4882a593Smuzhiyunturn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
190*4882a593Smuzhiyun:c:func:`get_user_pages` service for the given range of pfns. Since the
191*4882a593Smuzhiyunpage reference count never drops below 1 the page is never tracked as
192*4882a593Smuzhiyunfree memory and the page's `struct list_head lru` space is repurposed
193*4882a593Smuzhiyunfor back referencing to the host device / driver that mapped the memory.
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunWhile `SPARSEMEM` presents memory as a collection of sections,
196*4882a593Smuzhiyunoptionally collected into memory blocks, `ZONE_DEVICE` users have a need
197*4882a593Smuzhiyunfor smaller granularity of populating the `mem_map`. Given that
198*4882a593Smuzhiyun`ZONE_DEVICE` memory is never marked online it is subsequently never
199*4882a593Smuzhiyunsubject to its memory ranges being exposed through the sysfs memory
200*4882a593Smuzhiyunhotplug api on memory block boundaries. The implementation relies on
201*4882a593Smuzhiyunthis lack of user-api constraint to allow sub-section sized memory
202*4882a593Smuzhiyunranges to be specified to :c:func:`arch_add_memory`, the top-half of
203*4882a593Smuzhiyunmemory hotplug. Sub-section support allows for 2MB as the cross-arch
204*4882a593Smuzhiyuncommon alignment granularity for :c:func:`devm_memremap_pages`.
205*4882a593Smuzhiyun
206*4882a593SmuzhiyunThe users of `ZONE_DEVICE` are:
207*4882a593Smuzhiyun
208*4882a593Smuzhiyun* pmem: Map platform persistent memory to be used as a direct-I/O target
209*4882a593Smuzhiyun  via DAX mappings.
210*4882a593Smuzhiyun
211*4882a593Smuzhiyun* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
212*4882a593Smuzhiyun  event callbacks to allow a device-driver to coordinate memory management
213*4882a593Smuzhiyun  events related to device-memory, typically GPU memory. See
214*4882a593Smuzhiyun  Documentation/vm/hmm.rst.
215*4882a593Smuzhiyun
216*4882a593Smuzhiyun* p2pdma: Create `struct page` objects to allow peer devices in a
217*4882a593Smuzhiyun  PCI/-E topology to coordinate direct-DMA operations between themselves,
218*4882a593Smuzhiyun  i.e. bypass host memory.
219