xref: /OK3568_Linux_fs/kernel/Documentation/vm/hmm.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _hmm:
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=====================================
4*4882a593SmuzhiyunHeterogeneous Memory Management (HMM)
5*4882a593Smuzhiyun=====================================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunProvide infrastructure and helpers to integrate non-conventional memory (device
8*4882a593Smuzhiyunmemory like GPU on board memory) into regular kernel path, with the cornerstone
9*4882a593Smuzhiyunof this being specialized struct page for such memory (see sections 5 to 7 of
10*4882a593Smuzhiyunthis document).
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunHMM also provides optional helpers for SVM (Share Virtual Memory), i.e.,
13*4882a593Smuzhiyunallowing a device to transparently access program addresses coherently with
14*4882a593Smuzhiyunthe CPU meaning that any valid pointer on the CPU is also a valid pointer
15*4882a593Smuzhiyunfor the device. This is becoming mandatory to simplify the use of advanced
16*4882a593Smuzhiyunheterogeneous computing where GPU, DSP, or FPGA are used to perform various
17*4882a593Smuzhiyuncomputations on behalf of a process.
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunThis document is divided as follows: in the first section I expose the problems
20*4882a593Smuzhiyunrelated to using device specific memory allocators. In the second section, I
21*4882a593Smuzhiyunexpose the hardware limitations that are inherent to many platforms. The third
22*4882a593Smuzhiyunsection gives an overview of the HMM design. The fourth section explains how
23*4882a593SmuzhiyunCPU page-table mirroring works and the purpose of HMM in this context. The
24*4882a593Smuzhiyunfifth section deals with how device memory is represented inside the kernel.
25*4882a593SmuzhiyunFinally, the last section presents a new migration helper that allows
26*4882a593Smuzhiyunleveraging the device DMA engine.
27*4882a593Smuzhiyun
28*4882a593Smuzhiyun.. contents:: :local:
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunProblems of using a device specific memory allocator
31*4882a593Smuzhiyun====================================================
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunDevices with a large amount of on board memory (several gigabytes) like GPUs
34*4882a593Smuzhiyunhave historically managed their memory through dedicated driver specific APIs.
35*4882a593SmuzhiyunThis creates a disconnect between memory allocated and managed by a device
36*4882a593Smuzhiyundriver and regular application memory (private anonymous, shared memory, or
37*4882a593Smuzhiyunregular file backed memory). From here on I will refer to this aspect as split
38*4882a593Smuzhiyunaddress space. I use shared address space to refer to the opposite situation:
39*4882a593Smuzhiyuni.e., one in which any application memory region can be used by a device
40*4882a593Smuzhiyuntransparently.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunSplit address space happens because devices can only access memory allocated
43*4882a593Smuzhiyunthrough a device specific API. This implies that all memory objects in a program
44*4882a593Smuzhiyunare not equal from the device point of view which complicates large programs
45*4882a593Smuzhiyunthat rely on a wide set of libraries.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunConcretely, this means that code that wants to leverage devices like GPUs needs
48*4882a593Smuzhiyunto copy objects between generically allocated memory (malloc, mmap private, mmap
49*4882a593Smuzhiyunshare) and memory allocated through the device driver API (this still ends up
50*4882a593Smuzhiyunwith an mmap but of the device file).
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunFor flat data sets (array, grid, image, ...) this isn't too hard to achieve but
53*4882a593Smuzhiyunfor complex data sets (list, tree, ...) it's hard to get right. Duplicating a
54*4882a593Smuzhiyuncomplex data set needs to re-map all the pointer relations between each of its
55*4882a593Smuzhiyunelements. This is error prone and programs get harder to debug because of the
56*4882a593Smuzhiyunduplicate data set and addresses.
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunSplit address space also means that libraries cannot transparently use data
59*4882a593Smuzhiyunthey are getting from the core program or another library and thus each library
60*4882a593Smuzhiyunmight have to duplicate its input data set using the device specific memory
61*4882a593Smuzhiyunallocator. Large projects suffer from this and waste resources because of the
62*4882a593Smuzhiyunvarious memory copies.
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunDuplicating each library API to accept as input or output memory allocated by
65*4882a593Smuzhiyuneach device specific allocator is not a viable option. It would lead to a
66*4882a593Smuzhiyuncombinatorial explosion in the library entry points.
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunFinally, with the advance of high level language constructs (in C++ but in
69*4882a593Smuzhiyunother languages too) it is now possible for the compiler to leverage GPUs and
70*4882a593Smuzhiyunother devices without programmer knowledge. Some compiler identified patterns
71*4882a593Smuzhiyunare only do-able with a shared address space. It is also more reasonable to use
72*4882a593Smuzhiyuna shared address space for all other patterns.
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun
75*4882a593SmuzhiyunI/O bus, device memory characteristics
76*4882a593Smuzhiyun======================================
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunI/O buses cripple shared address spaces due to a few limitations. Most I/O
79*4882a593Smuzhiyunbuses only allow basic memory access from device to main memory; even cache
80*4882a593Smuzhiyuncoherency is often optional. Access to device memory from a CPU is even more
81*4882a593Smuzhiyunlimited. More often than not, it is not cache coherent.
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunIf we only consider the PCIE bus, then a device can access main memory (often
84*4882a593Smuzhiyunthrough an IOMMU) and be cache coherent with the CPUs. However, it only allows
85*4882a593Smuzhiyuna limited set of atomic operations from the device on main memory. This is worse
86*4882a593Smuzhiyunin the other direction: the CPU can only access a limited range of the device
87*4882a593Smuzhiyunmemory and cannot perform atomic operations on it. Thus device memory cannot
88*4882a593Smuzhiyunbe considered the same as regular memory from the kernel point of view.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunAnother crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
91*4882a593Smuzhiyunand 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
92*4882a593SmuzhiyunThe final limitation is latency. Access to main memory from the device has an
93*4882a593Smuzhiyunorder of magnitude higher latency than when the device accesses its own memory.
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunSome platforms are developing new I/O buses or additions/modifications to PCIE
96*4882a593Smuzhiyunto address some of these limitations (OpenCAPI, CCIX). They mainly allow
97*4882a593Smuzhiyuntwo-way cache coherency between CPU and device and allow all atomic operations the
98*4882a593Smuzhiyunarchitecture supports. Sadly, not all platforms are following this trend and
99*4882a593Smuzhiyunsome major architectures are left without hardware solutions to these problems.
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunSo for shared address space to make sense, not only must we allow devices to
102*4882a593Smuzhiyunaccess any memory but we must also permit any memory to be migrated to device
103*4882a593Smuzhiyunmemory while the device is using it (blocking CPU access while it happens).
104*4882a593Smuzhiyun
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunShared address space and migration
107*4882a593Smuzhiyun==================================
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunHMM intends to provide two main features. The first one is to share the address
110*4882a593Smuzhiyunspace by duplicating the CPU page table in the device page table so the same
111*4882a593Smuzhiyunaddress points to the same physical memory for any valid main memory address in
112*4882a593Smuzhiyunthe process address space.
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunTo achieve this, HMM offers a set of helpers to populate the device page table
115*4882a593Smuzhiyunwhile keeping track of CPU page table updates. Device page table updates are
116*4882a593Smuzhiyunnot as easy as CPU page table updates. To update the device page table, you must
117*4882a593Smuzhiyunallocate a buffer (or use a pool of pre-allocated buffers) and write GPU
118*4882a593Smuzhiyunspecific commands in it to perform the update (unmap, cache invalidations, and
119*4882a593Smuzhiyunflush, ...). This cannot be done through common code for all devices. Hence
120*4882a593Smuzhiyunwhy HMM provides helpers to factor out everything that can be while leaving the
121*4882a593Smuzhiyunhardware specific details to the device driver.
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunThe second mechanism HMM provides is a new kind of ZONE_DEVICE memory that
124*4882a593Smuzhiyunallows allocating a struct page for each page of device memory. Those pages
125*4882a593Smuzhiyunare special because the CPU cannot map them. However, they allow migrating
126*4882a593Smuzhiyunmain memory to device memory using existing migration mechanisms and everything
127*4882a593Smuzhiyunlooks like a page that is swapped out to disk from the CPU point of view. Using a
128*4882a593Smuzhiyunstruct page gives the easiest and cleanest integration with existing mm
129*4882a593Smuzhiyunmechanisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
130*4882a593Smuzhiyunmemory for the device memory and second to perform migration. Policy decisions
131*4882a593Smuzhiyunof what and when to migrate is left to the device driver.
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunNote that any CPU access to a device page triggers a page fault and a migration
134*4882a593Smuzhiyunback to main memory. For example, when a page backing a given CPU address A is
135*4882a593Smuzhiyunmigrated from a main memory page to a device page, then any CPU access to
136*4882a593Smuzhiyunaddress A triggers a page fault and initiates a migration back to main memory.
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunWith these two features, HMM not only allows a device to mirror process address
139*4882a593Smuzhiyunspace and keeps both CPU and device page tables synchronized, but also
140*4882a593Smuzhiyunleverages device memory by migrating the part of the data set that is actively being
141*4882a593Smuzhiyunused by the device.
142*4882a593Smuzhiyun
143*4882a593Smuzhiyun
144*4882a593SmuzhiyunAddress space mirroring implementation and API
145*4882a593Smuzhiyun==============================================
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunAddress space mirroring's main objective is to allow duplication of a range of
148*4882a593SmuzhiyunCPU page table into a device page table; HMM helps keep both synchronized. A
149*4882a593Smuzhiyundevice driver that wants to mirror a process address space must start with the
150*4882a593Smuzhiyunregistration of a mmu_interval_notifier::
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub,
153*4882a593Smuzhiyun				  struct mm_struct *mm, unsigned long start,
154*4882a593Smuzhiyun				  unsigned long length,
155*4882a593Smuzhiyun				  const struct mmu_interval_notifier_ops *ops);
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunDuring the ops->invalidate() callback the device driver must perform the
158*4882a593Smuzhiyunupdate action to the range (mark range read only, or fully unmap, etc.). The
159*4882a593Smuzhiyundevice must complete the update before the driver callback returns.
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunWhen the device driver wants to populate a range of virtual addresses, it can
162*4882a593Smuzhiyunuse::
163*4882a593Smuzhiyun
164*4882a593Smuzhiyun  int hmm_range_fault(struct hmm_range *range);
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunIt will trigger a page fault on missing or read-only entries if write access is
167*4882a593Smuzhiyunrequested (see below). Page faults use the generic mm page fault code path just
168*4882a593Smuzhiyunlike a CPU page fault.
169*4882a593Smuzhiyun
170*4882a593SmuzhiyunBoth functions copy CPU page table entries into their pfns array argument. Each
171*4882a593Smuzhiyunentry in that array corresponds to an address in the virtual range. HMM
172*4882a593Smuzhiyunprovides a set of flags to help the driver identify special CPU page table
173*4882a593Smuzhiyunentries.
174*4882a593Smuzhiyun
175*4882a593SmuzhiyunLocking within the sync_cpu_device_pagetables() callback is the most important
176*4882a593Smuzhiyunaspect the driver must respect in order to keep things properly synchronized.
177*4882a593SmuzhiyunThe usage pattern is::
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun int driver_populate_range(...)
180*4882a593Smuzhiyun {
181*4882a593Smuzhiyun      struct hmm_range range;
182*4882a593Smuzhiyun      ...
183*4882a593Smuzhiyun
184*4882a593Smuzhiyun      range.notifier = &interval_sub;
185*4882a593Smuzhiyun      range.start = ...;
186*4882a593Smuzhiyun      range.end = ...;
187*4882a593Smuzhiyun      range.hmm_pfns = ...;
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun      if (!mmget_not_zero(interval_sub->notifier.mm))
190*4882a593Smuzhiyun          return -EFAULT;
191*4882a593Smuzhiyun
192*4882a593Smuzhiyun again:
193*4882a593Smuzhiyun      range.notifier_seq = mmu_interval_read_begin(&interval_sub);
194*4882a593Smuzhiyun      mmap_read_lock(mm);
195*4882a593Smuzhiyun      ret = hmm_range_fault(&range);
196*4882a593Smuzhiyun      if (ret) {
197*4882a593Smuzhiyun          mmap_read_unlock(mm);
198*4882a593Smuzhiyun          if (ret == -EBUSY)
199*4882a593Smuzhiyun                 goto again;
200*4882a593Smuzhiyun          return ret;
201*4882a593Smuzhiyun      }
202*4882a593Smuzhiyun      mmap_read_unlock(mm);
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun      take_lock(driver->update);
205*4882a593Smuzhiyun      if (mmu_interval_read_retry(&ni, range.notifier_seq) {
206*4882a593Smuzhiyun          release_lock(driver->update);
207*4882a593Smuzhiyun          goto again;
208*4882a593Smuzhiyun      }
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun      /* Use pfns array content to update device page table,
211*4882a593Smuzhiyun       * under the update lock */
212*4882a593Smuzhiyun
213*4882a593Smuzhiyun      release_lock(driver->update);
214*4882a593Smuzhiyun      return 0;
215*4882a593Smuzhiyun }
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunThe driver->update lock is the same lock that the driver takes inside its
218*4882a593Smuzhiyuninvalidate() callback. That lock must be held before calling
219*4882a593Smuzhiyunmmu_interval_read_retry() to avoid any race with a concurrent CPU page table
220*4882a593Smuzhiyunupdate.
221*4882a593Smuzhiyun
222*4882a593SmuzhiyunLeverage default_flags and pfn_flags_mask
223*4882a593Smuzhiyun=========================================
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunThe hmm_range struct has 2 fields, default_flags and pfn_flags_mask, that specify
226*4882a593Smuzhiyunfault or snapshot policy for the whole range instead of having to set them
227*4882a593Smuzhiyunfor each entry in the pfns array.
228*4882a593Smuzhiyun
229*4882a593SmuzhiyunFor instance if the device driver wants pages for a range with at least read
230*4882a593Smuzhiyunpermission, it sets::
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun    range->default_flags = HMM_PFN_REQ_FAULT;
233*4882a593Smuzhiyun    range->pfn_flags_mask = 0;
234*4882a593Smuzhiyun
235*4882a593Smuzhiyunand calls hmm_range_fault() as described above. This will fill fault all pages
236*4882a593Smuzhiyunin the range with at least read permission.
237*4882a593Smuzhiyun
238*4882a593SmuzhiyunNow let's say the driver wants to do the same except for one page in the range for
239*4882a593Smuzhiyunwhich it wants to have write permission. Now driver set::
240*4882a593Smuzhiyun
241*4882a593Smuzhiyun    range->default_flags = HMM_PFN_REQ_FAULT;
242*4882a593Smuzhiyun    range->pfn_flags_mask = HMM_PFN_REQ_WRITE;
243*4882a593Smuzhiyun    range->pfns[index_of_write] = HMM_PFN_REQ_WRITE;
244*4882a593Smuzhiyun
245*4882a593SmuzhiyunWith this, HMM will fault in all pages with at least read (i.e., valid) and for the
246*4882a593Smuzhiyunaddress == range->start + (index_of_write << PAGE_SHIFT) it will fault with
247*4882a593Smuzhiyunwrite permission i.e., if the CPU pte does not have write permission set then HMM
248*4882a593Smuzhiyunwill call handle_mm_fault().
249*4882a593Smuzhiyun
250*4882a593SmuzhiyunAfter hmm_range_fault completes the flag bits are set to the current state of
251*4882a593Smuzhiyunthe page tables, ie HMM_PFN_VALID | HMM_PFN_WRITE will be set if the page is
252*4882a593Smuzhiyunwritable.
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun
255*4882a593SmuzhiyunRepresent and manage device memory from core kernel point of view
256*4882a593Smuzhiyun=================================================================
257*4882a593Smuzhiyun
258*4882a593SmuzhiyunSeveral different designs were tried to support device memory. The first one
259*4882a593Smuzhiyunused a device specific data structure to keep information about migrated memory
260*4882a593Smuzhiyunand HMM hooked itself in various places of mm code to handle any access to
261*4882a593Smuzhiyunaddresses that were backed by device memory. It turns out that this ended up
262*4882a593Smuzhiyunreplicating most of the fields of struct page and also needed many kernel code
263*4882a593Smuzhiyunpaths to be updated to understand this new kind of memory.
264*4882a593Smuzhiyun
265*4882a593SmuzhiyunMost kernel code paths never try to access the memory behind a page
266*4882a593Smuzhiyunbut only care about struct page contents. Because of this, HMM switched to
267*4882a593Smuzhiyundirectly using struct page for device memory which left most kernel code paths
268*4882a593Smuzhiyununaware of the difference. We only need to make sure that no one ever tries to
269*4882a593Smuzhiyunmap those pages from the CPU side.
270*4882a593Smuzhiyun
271*4882a593SmuzhiyunMigration to and from device memory
272*4882a593Smuzhiyun===================================
273*4882a593Smuzhiyun
274*4882a593SmuzhiyunBecause the CPU cannot access device memory directly, the device driver must
275*4882a593Smuzhiyunuse hardware DMA or device specific load/store instructions to migrate data.
276*4882a593SmuzhiyunThe migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize()
277*4882a593Smuzhiyunfunctions are designed to make drivers easier to write and to centralize common
278*4882a593Smuzhiyuncode across drivers.
279*4882a593Smuzhiyun
280*4882a593SmuzhiyunBefore migrating pages to device private memory, special device private
281*4882a593Smuzhiyun``struct page`` need to be created. These will be used as special "swap"
282*4882a593Smuzhiyunpage table entries so that a CPU process will fault if it tries to access
283*4882a593Smuzhiyuna page that has been migrated to device private memory.
284*4882a593Smuzhiyun
285*4882a593SmuzhiyunThese can be allocated and freed with::
286*4882a593Smuzhiyun
287*4882a593Smuzhiyun    struct resource *res;
288*4882a593Smuzhiyun    struct dev_pagemap pagemap;
289*4882a593Smuzhiyun
290*4882a593Smuzhiyun    res = request_free_mem_region(&iomem_resource, /* number of bytes */,
291*4882a593Smuzhiyun                                  "name of driver resource");
292*4882a593Smuzhiyun    pagemap.type = MEMORY_DEVICE_PRIVATE;
293*4882a593Smuzhiyun    pagemap.range.start = res->start;
294*4882a593Smuzhiyun    pagemap.range.end = res->end;
295*4882a593Smuzhiyun    pagemap.nr_range = 1;
296*4882a593Smuzhiyun    pagemap.ops = &device_devmem_ops;
297*4882a593Smuzhiyun    memremap_pages(&pagemap, numa_node_id());
298*4882a593Smuzhiyun
299*4882a593Smuzhiyun    memunmap_pages(&pagemap);
300*4882a593Smuzhiyun    release_mem_region(pagemap.range.start, range_len(&pagemap.range));
301*4882a593Smuzhiyun
302*4882a593SmuzhiyunThere are also devm_request_free_mem_region(), devm_memremap_pages(),
303*4882a593Smuzhiyundevm_memunmap_pages(), and devm_release_mem_region() when the resources can
304*4882a593Smuzhiyunbe tied to a ``struct device``.
305*4882a593Smuzhiyun
306*4882a593SmuzhiyunThe overall migration steps are similar to migrating NUMA pages within system
307*4882a593Smuzhiyunmemory (see :ref:`Page migration <page_migration>`) but the steps are split
308*4882a593Smuzhiyunbetween device driver specific code and shared common code:
309*4882a593Smuzhiyun
310*4882a593Smuzhiyun1. ``mmap_read_lock()``
311*4882a593Smuzhiyun
312*4882a593Smuzhiyun   The device driver has to pass a ``struct vm_area_struct`` to
313*4882a593Smuzhiyun   migrate_vma_setup() so the mmap_read_lock() or mmap_write_lock() needs to
314*4882a593Smuzhiyun   be held for the duration of the migration.
315*4882a593Smuzhiyun
316*4882a593Smuzhiyun2. ``migrate_vma_setup(struct migrate_vma *args)``
317*4882a593Smuzhiyun
318*4882a593Smuzhiyun   The device driver initializes the ``struct migrate_vma`` fields and passes
319*4882a593Smuzhiyun   the pointer to migrate_vma_setup(). The ``args->flags`` field is used to
320*4882a593Smuzhiyun   filter which source pages should be migrated. For example, setting
321*4882a593Smuzhiyun   ``MIGRATE_VMA_SELECT_SYSTEM`` will only migrate system memory and
322*4882a593Smuzhiyun   ``MIGRATE_VMA_SELECT_DEVICE_PRIVATE`` will only migrate pages residing in
323*4882a593Smuzhiyun   device private memory. If the latter flag is set, the ``args->pgmap_owner``
324*4882a593Smuzhiyun   field is used to identify device private pages owned by the driver. This
325*4882a593Smuzhiyun   avoids trying to migrate device private pages residing in other devices.
326*4882a593Smuzhiyun   Currently only anonymous private VMA ranges can be migrated to or from
327*4882a593Smuzhiyun   system memory and device private memory.
328*4882a593Smuzhiyun
329*4882a593Smuzhiyun   One of the first steps migrate_vma_setup() does is to invalidate other
330*4882a593Smuzhiyun   device's MMUs with the ``mmu_notifier_invalidate_range_start(()`` and
331*4882a593Smuzhiyun   ``mmu_notifier_invalidate_range_end()`` calls around the page table
332*4882a593Smuzhiyun   walks to fill in the ``args->src`` array with PFNs to be migrated.
333*4882a593Smuzhiyun   The ``invalidate_range_start()`` callback is passed a
334*4882a593Smuzhiyun   ``struct mmu_notifier_range`` with the ``event`` field set to
335*4882a593Smuzhiyun   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
336*4882a593Smuzhiyun   the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
337*4882a593Smuzhiyun   allows the device driver to skip the invalidation callback and only
338*4882a593Smuzhiyun   invalidate device private MMU mappings that are actually migrating.
339*4882a593Smuzhiyun   This is explained more in the next section.
340*4882a593Smuzhiyun
341*4882a593Smuzhiyun   While walking the page tables, a ``pte_none()`` or ``is_zero_pfn()``
342*4882a593Smuzhiyun   entry results in a valid "zero" PFN stored in the ``args->src`` array.
343*4882a593Smuzhiyun   This lets the driver allocate device private memory and clear it instead
344*4882a593Smuzhiyun   of copying a page of zeros. Valid PTE entries to system memory or
345*4882a593Smuzhiyun   device private struct pages will be locked with ``lock_page()``, isolated
346*4882a593Smuzhiyun   from the LRU (if system memory since device private pages are not on
347*4882a593Smuzhiyun   the LRU), unmapped from the process, and a special migration PTE is
348*4882a593Smuzhiyun   inserted in place of the original PTE.
349*4882a593Smuzhiyun   migrate_vma_setup() also clears the ``args->dst`` array.
350*4882a593Smuzhiyun
351*4882a593Smuzhiyun3. The device driver allocates destination pages and copies source pages to
352*4882a593Smuzhiyun   destination pages.
353*4882a593Smuzhiyun
354*4882a593Smuzhiyun   The driver checks each ``src`` entry to see if the ``MIGRATE_PFN_MIGRATE``
355*4882a593Smuzhiyun   bit is set and skips entries that are not migrating. The device driver
356*4882a593Smuzhiyun   can also choose to skip migrating a page by not filling in the ``dst``
357*4882a593Smuzhiyun   array for that page.
358*4882a593Smuzhiyun
359*4882a593Smuzhiyun   The driver then allocates either a device private struct page or a
360*4882a593Smuzhiyun   system memory page, locks the page with ``lock_page()``, and fills in the
361*4882a593Smuzhiyun   ``dst`` array entry with::
362*4882a593Smuzhiyun
363*4882a593Smuzhiyun     dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
364*4882a593Smuzhiyun
365*4882a593Smuzhiyun   Now that the driver knows that this page is being migrated, it can
366*4882a593Smuzhiyun   invalidate device private MMU mappings and copy device private memory
367*4882a593Smuzhiyun   to system memory or another device private page. The core Linux kernel
368*4882a593Smuzhiyun   handles CPU page table invalidations so the device driver only has to
369*4882a593Smuzhiyun   invalidate its own MMU mappings.
370*4882a593Smuzhiyun
371*4882a593Smuzhiyun   The driver can use ``migrate_pfn_to_page(src[i])`` to get the
372*4882a593Smuzhiyun   ``struct page`` of the source and either copy the source page to the
373*4882a593Smuzhiyun   destination or clear the destination device private memory if the pointer
374*4882a593Smuzhiyun   is ``NULL`` meaning the source page was not populated in system memory.
375*4882a593Smuzhiyun
376*4882a593Smuzhiyun4. ``migrate_vma_pages()``
377*4882a593Smuzhiyun
378*4882a593Smuzhiyun   This step is where the migration is actually "committed".
379*4882a593Smuzhiyun
380*4882a593Smuzhiyun   If the source page was a ``pte_none()`` or ``is_zero_pfn()`` page, this
381*4882a593Smuzhiyun   is where the newly allocated page is inserted into the CPU's page table.
382*4882a593Smuzhiyun   This can fail if a CPU thread faults on the same page. However, the page
383*4882a593Smuzhiyun   table is locked and only one of the new pages will be inserted.
384*4882a593Smuzhiyun   The device driver will see that the ``MIGRATE_PFN_MIGRATE`` bit is cleared
385*4882a593Smuzhiyun   if it loses the race.
386*4882a593Smuzhiyun
387*4882a593Smuzhiyun   If the source page was locked, isolated, etc. the source ``struct page``
388*4882a593Smuzhiyun   information is now copied to destination ``struct page`` finalizing the
389*4882a593Smuzhiyun   migration on the CPU side.
390*4882a593Smuzhiyun
391*4882a593Smuzhiyun5. Device driver updates device MMU page tables for pages still migrating,
392*4882a593Smuzhiyun   rolling back pages not migrating.
393*4882a593Smuzhiyun
394*4882a593Smuzhiyun   If the ``src`` entry still has ``MIGRATE_PFN_MIGRATE`` bit set, the device
395*4882a593Smuzhiyun   driver can update the device MMU and set the write enable bit if the
396*4882a593Smuzhiyun   ``MIGRATE_PFN_WRITE`` bit is set.
397*4882a593Smuzhiyun
398*4882a593Smuzhiyun6. ``migrate_vma_finalize()``
399*4882a593Smuzhiyun
400*4882a593Smuzhiyun   This step replaces the special migration page table entry with the new
401*4882a593Smuzhiyun   page's page table entry and releases the reference to the source and
402*4882a593Smuzhiyun   destination ``struct page``.
403*4882a593Smuzhiyun
404*4882a593Smuzhiyun7. ``mmap_read_unlock()``
405*4882a593Smuzhiyun
406*4882a593Smuzhiyun   The lock can now be released.
407*4882a593Smuzhiyun
408*4882a593SmuzhiyunMemory cgroup (memcg) and rss accounting
409*4882a593Smuzhiyun========================================
410*4882a593Smuzhiyun
411*4882a593SmuzhiyunFor now, device memory is accounted as any regular page in rss counters (either
412*4882a593Smuzhiyunanonymous if device page is used for anonymous, file if device page is used for
413*4882a593Smuzhiyunfile backed page, or shmem if device page is used for shared memory). This is a
414*4882a593Smuzhiyundeliberate choice to keep existing applications, that might start using device
415*4882a593Smuzhiyunmemory without knowing about it, running unimpacted.
416*4882a593Smuzhiyun
417*4882a593SmuzhiyunA drawback is that the OOM killer might kill an application using a lot of
418*4882a593Smuzhiyundevice memory and not a lot of regular system memory and thus not freeing much
419*4882a593Smuzhiyunsystem memory. We want to gather more real world experience on how applications
420*4882a593Smuzhiyunand system react under memory pressure in the presence of device memory before
421*4882a593Smuzhiyundeciding to account device memory differently.
422*4882a593Smuzhiyun
423*4882a593Smuzhiyun
424*4882a593SmuzhiyunSame decision was made for memory cgroup. Device memory pages are accounted
425*4882a593Smuzhiyunagainst same memory cgroup a regular page would be accounted to. This does
426*4882a593Smuzhiyunsimplify migration to and from device memory. This also means that migration
427*4882a593Smuzhiyunback from device memory to regular memory cannot fail because it would
428*4882a593Smuzhiyungo above memory cgroup limit. We might revisit this choice latter on once we
429*4882a593Smuzhiyunget more experience in how device memory is used and its impact on memory
430*4882a593Smuzhiyunresource control.
431*4882a593Smuzhiyun
432*4882a593Smuzhiyun
433*4882a593SmuzhiyunNote that device memory can never be pinned by a device driver nor through GUP
434*4882a593Smuzhiyunand thus such memory is always free upon process exit. Or when last reference
435*4882a593Smuzhiyunis dropped in case of shared memory or file backed memory.
436