1*4882a593Smuzhiyun.. _mmu_notifier: 2*4882a593Smuzhiyun 3*4882a593SmuzhiyunWhen do you need to notify inside page table lock ? 4*4882a593Smuzhiyun=================================================== 5*4882a593Smuzhiyun 6*4882a593SmuzhiyunWhen clearing a pte/pmd we are given a choice to notify the event through 7*4882a593Smuzhiyun(notify version of \*_clear_flush call mmu_notifier_invalidate_range) under 8*4882a593Smuzhiyunthe page table lock. But that notification is not necessary in all cases. 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunFor secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use 11*4882a593Smuzhiyunthing like ATS/PASID to get the IOMMU to walk the CPU page table to access a 12*4882a593Smuzhiyunprocess virtual address space). There is only 2 cases when you need to notify 13*4882a593Smuzhiyunthose secondary TLB while holding page table lock when clearing a pte/pmd: 14*4882a593Smuzhiyun 15*4882a593Smuzhiyun A) page backing address is free before mmu_notifier_invalidate_range_end() 16*4882a593Smuzhiyun B) a page table entry is updated to point to a new page (COW, write fault 17*4882a593Smuzhiyun on zero page, __replace_page(), ...) 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunCase A is obvious you do not want to take the risk for the device to write to 20*4882a593Smuzhiyuna page that might now be used by some completely different task. 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunCase B is more subtle. For correctness it requires the following sequence to 23*4882a593Smuzhiyunhappen: 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun - take page table lock 26*4882a593Smuzhiyun - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify()) 27*4882a593Smuzhiyun - set page table entry to point to new page 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunIf clearing the page table entry is not followed by a notify before setting 30*4882a593Smuzhiyunthe new pte/pmd value then you can break memory model like C11 or C++11 for 31*4882a593Smuzhiyunthe device. 32*4882a593Smuzhiyun 33*4882a593SmuzhiyunConsider the following scenario (device use a feature similar to ATS/PASID): 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunTwo address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume 36*4882a593Smuzhiyunthey are write protected for COW (other case of B apply too). 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun:: 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun [Time N] -------------------------------------------------------------------- 41*4882a593Smuzhiyun CPU-thread-0 {try to write to addrA} 42*4882a593Smuzhiyun CPU-thread-1 {try to write to addrB} 43*4882a593Smuzhiyun CPU-thread-2 {} 44*4882a593Smuzhiyun CPU-thread-3 {} 45*4882a593Smuzhiyun DEV-thread-0 {read addrA and populate device TLB} 46*4882a593Smuzhiyun DEV-thread-2 {read addrB and populate device TLB} 47*4882a593Smuzhiyun [Time N+1] ------------------------------------------------------------------ 48*4882a593Smuzhiyun CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} 49*4882a593Smuzhiyun CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} 50*4882a593Smuzhiyun CPU-thread-2 {} 51*4882a593Smuzhiyun CPU-thread-3 {} 52*4882a593Smuzhiyun DEV-thread-0 {} 53*4882a593Smuzhiyun DEV-thread-2 {} 54*4882a593Smuzhiyun [Time N+2] ------------------------------------------------------------------ 55*4882a593Smuzhiyun CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} 56*4882a593Smuzhiyun CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} 57*4882a593Smuzhiyun CPU-thread-2 {} 58*4882a593Smuzhiyun CPU-thread-3 {} 59*4882a593Smuzhiyun DEV-thread-0 {} 60*4882a593Smuzhiyun DEV-thread-2 {} 61*4882a593Smuzhiyun [Time N+3] ------------------------------------------------------------------ 62*4882a593Smuzhiyun CPU-thread-0 {preempted} 63*4882a593Smuzhiyun CPU-thread-1 {preempted} 64*4882a593Smuzhiyun CPU-thread-2 {write to addrA which is a write to new page} 65*4882a593Smuzhiyun CPU-thread-3 {} 66*4882a593Smuzhiyun DEV-thread-0 {} 67*4882a593Smuzhiyun DEV-thread-2 {} 68*4882a593Smuzhiyun [Time N+3] ------------------------------------------------------------------ 69*4882a593Smuzhiyun CPU-thread-0 {preempted} 70*4882a593Smuzhiyun CPU-thread-1 {preempted} 71*4882a593Smuzhiyun CPU-thread-2 {} 72*4882a593Smuzhiyun CPU-thread-3 {write to addrB which is a write to new page} 73*4882a593Smuzhiyun DEV-thread-0 {} 74*4882a593Smuzhiyun DEV-thread-2 {} 75*4882a593Smuzhiyun [Time N+4] ------------------------------------------------------------------ 76*4882a593Smuzhiyun CPU-thread-0 {preempted} 77*4882a593Smuzhiyun CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} 78*4882a593Smuzhiyun CPU-thread-2 {} 79*4882a593Smuzhiyun CPU-thread-3 {} 80*4882a593Smuzhiyun DEV-thread-0 {} 81*4882a593Smuzhiyun DEV-thread-2 {} 82*4882a593Smuzhiyun [Time N+5] ------------------------------------------------------------------ 83*4882a593Smuzhiyun CPU-thread-0 {preempted} 84*4882a593Smuzhiyun CPU-thread-1 {} 85*4882a593Smuzhiyun CPU-thread-2 {} 86*4882a593Smuzhiyun CPU-thread-3 {} 87*4882a593Smuzhiyun DEV-thread-0 {read addrA from old page} 88*4882a593Smuzhiyun DEV-thread-2 {read addrB from new page} 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunSo here because at time N+2 the clear page table entry was not pair with a 91*4882a593Smuzhiyunnotification to invalidate the secondary TLB, the device see the new value for 92*4882a593SmuzhiyunaddrB before seeing the new value for addrA. This break total memory ordering 93*4882a593Smuzhiyunfor the device. 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunWhen changing a pte to write protect or to point to a new write protected page 96*4882a593Smuzhiyunwith same content (KSM) it is fine to delay the mmu_notifier_invalidate_range 97*4882a593Smuzhiyuncall to mmu_notifier_invalidate_range_end() outside the page table lock. This 98*4882a593Smuzhiyunis true even if the thread doing the page table update is preempted right after 99*4882a593Smuzhiyunreleasing page table lock but before call mmu_notifier_invalidate_range_end(). 100