xref: /OK3568_Linux_fs/kernel/Documentation/vm/mmu_notifier.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. _mmu_notifier:
2*4882a593Smuzhiyun
3*4882a593SmuzhiyunWhen do you need to notify inside page table lock ?
4*4882a593Smuzhiyun===================================================
5*4882a593Smuzhiyun
6*4882a593SmuzhiyunWhen clearing a pte/pmd we are given a choice to notify the event through
7*4882a593Smuzhiyun(notify version of \*_clear_flush call mmu_notifier_invalidate_range) under
8*4882a593Smuzhiyunthe page table lock. But that notification is not necessary in all cases.
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunFor secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
11*4882a593Smuzhiyunthing like ATS/PASID to get the IOMMU to walk the CPU page table to access a
12*4882a593Smuzhiyunprocess virtual address space). There is only 2 cases when you need to notify
13*4882a593Smuzhiyunthose secondary TLB while holding page table lock when clearing a pte/pmd:
14*4882a593Smuzhiyun
15*4882a593Smuzhiyun  A) page backing address is free before mmu_notifier_invalidate_range_end()
16*4882a593Smuzhiyun  B) a page table entry is updated to point to a new page (COW, write fault
17*4882a593Smuzhiyun     on zero page, __replace_page(), ...)
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunCase A is obvious you do not want to take the risk for the device to write to
20*4882a593Smuzhiyuna page that might now be used by some completely different task.
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunCase B is more subtle. For correctness it requires the following sequence to
23*4882a593Smuzhiyunhappen:
24*4882a593Smuzhiyun
25*4882a593Smuzhiyun  - take page table lock
26*4882a593Smuzhiyun  - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
27*4882a593Smuzhiyun  - set page table entry to point to new page
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunIf clearing the page table entry is not followed by a notify before setting
30*4882a593Smuzhiyunthe new pte/pmd value then you can break memory model like C11 or C++11 for
31*4882a593Smuzhiyunthe device.
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunConsider the following scenario (device use a feature similar to ATS/PASID):
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunTwo address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume
36*4882a593Smuzhiyunthey are write protected for COW (other case of B apply too).
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun::
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun [Time N] --------------------------------------------------------------------
41*4882a593Smuzhiyun CPU-thread-0  {try to write to addrA}
42*4882a593Smuzhiyun CPU-thread-1  {try to write to addrB}
43*4882a593Smuzhiyun CPU-thread-2  {}
44*4882a593Smuzhiyun CPU-thread-3  {}
45*4882a593Smuzhiyun DEV-thread-0  {read addrA and populate device TLB}
46*4882a593Smuzhiyun DEV-thread-2  {read addrB and populate device TLB}
47*4882a593Smuzhiyun [Time N+1] ------------------------------------------------------------------
48*4882a593Smuzhiyun CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
49*4882a593Smuzhiyun CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
50*4882a593Smuzhiyun CPU-thread-2  {}
51*4882a593Smuzhiyun CPU-thread-3  {}
52*4882a593Smuzhiyun DEV-thread-0  {}
53*4882a593Smuzhiyun DEV-thread-2  {}
54*4882a593Smuzhiyun [Time N+2] ------------------------------------------------------------------
55*4882a593Smuzhiyun CPU-thread-0  {COW_step1: {update page table to point to new page for addrA}}
56*4882a593Smuzhiyun CPU-thread-1  {COW_step1: {update page table to point to new page for addrB}}
57*4882a593Smuzhiyun CPU-thread-2  {}
58*4882a593Smuzhiyun CPU-thread-3  {}
59*4882a593Smuzhiyun DEV-thread-0  {}
60*4882a593Smuzhiyun DEV-thread-2  {}
61*4882a593Smuzhiyun [Time N+3] ------------------------------------------------------------------
62*4882a593Smuzhiyun CPU-thread-0  {preempted}
63*4882a593Smuzhiyun CPU-thread-1  {preempted}
64*4882a593Smuzhiyun CPU-thread-2  {write to addrA which is a write to new page}
65*4882a593Smuzhiyun CPU-thread-3  {}
66*4882a593Smuzhiyun DEV-thread-0  {}
67*4882a593Smuzhiyun DEV-thread-2  {}
68*4882a593Smuzhiyun [Time N+3] ------------------------------------------------------------------
69*4882a593Smuzhiyun CPU-thread-0  {preempted}
70*4882a593Smuzhiyun CPU-thread-1  {preempted}
71*4882a593Smuzhiyun CPU-thread-2  {}
72*4882a593Smuzhiyun CPU-thread-3  {write to addrB which is a write to new page}
73*4882a593Smuzhiyun DEV-thread-0  {}
74*4882a593Smuzhiyun DEV-thread-2  {}
75*4882a593Smuzhiyun [Time N+4] ------------------------------------------------------------------
76*4882a593Smuzhiyun CPU-thread-0  {preempted}
77*4882a593Smuzhiyun CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
78*4882a593Smuzhiyun CPU-thread-2  {}
79*4882a593Smuzhiyun CPU-thread-3  {}
80*4882a593Smuzhiyun DEV-thread-0  {}
81*4882a593Smuzhiyun DEV-thread-2  {}
82*4882a593Smuzhiyun [Time N+5] ------------------------------------------------------------------
83*4882a593Smuzhiyun CPU-thread-0  {preempted}
84*4882a593Smuzhiyun CPU-thread-1  {}
85*4882a593Smuzhiyun CPU-thread-2  {}
86*4882a593Smuzhiyun CPU-thread-3  {}
87*4882a593Smuzhiyun DEV-thread-0  {read addrA from old page}
88*4882a593Smuzhiyun DEV-thread-2  {read addrB from new page}
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunSo here because at time N+2 the clear page table entry was not pair with a
91*4882a593Smuzhiyunnotification to invalidate the secondary TLB, the device see the new value for
92*4882a593SmuzhiyunaddrB before seeing the new value for addrA. This break total memory ordering
93*4882a593Smuzhiyunfor the device.
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunWhen changing a pte to write protect or to point to a new write protected page
96*4882a593Smuzhiyunwith same content (KSM) it is fine to delay the mmu_notifier_invalidate_range
97*4882a593Smuzhiyuncall to mmu_notifier_invalidate_range_end() outside the page table lock. This
98*4882a593Smuzhiyunis true even if the thread doing the page table update is preempted right after
99*4882a593Smuzhiyunreleasing page table lock but before call mmu_notifier_invalidate_range_end().
100