1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun================= 4*4882a593SmuzhiyunKVM Lock Overview 5*4882a593Smuzhiyun================= 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun1. Acquisition Orders 8*4882a593Smuzhiyun--------------------- 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunThe acquisition orders for mutexes are as follows: 11*4882a593Smuzhiyun 12*4882a593Smuzhiyun- kvm->lock is taken outside vcpu->mutex 13*4882a593Smuzhiyun 14*4882a593Smuzhiyun- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring 17*4882a593Smuzhiyun them together is quite rare. 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunOn x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunEverything else is a leaf: no other lock is taken inside the critical 22*4882a593Smuzhiyunsections. 23*4882a593Smuzhiyun 24*4882a593Smuzhiyun2. Exception 25*4882a593Smuzhiyun------------ 26*4882a593Smuzhiyun 27*4882a593SmuzhiyunFast page fault: 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunFast page fault is the fast path which fixes the guest page fault out of 30*4882a593Smuzhiyunthe mmu-lock on x86. Currently, the page fault can be fast in one of the 31*4882a593Smuzhiyunfollowing two cases: 32*4882a593Smuzhiyun 33*4882a593Smuzhiyun1. Access Tracking: The SPTE is not present, but it is marked for access 34*4882a593Smuzhiyun tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to 35*4882a593Smuzhiyun restore the saved R/X bits. This is described in more detail later below. 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun2. Write-Protection: The SPTE is present and the fault is 38*4882a593Smuzhiyun caused by write-protect. That means we just need to change the W bit of 39*4882a593Smuzhiyun the spte. 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunWhat we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and 42*4882a593SmuzhiyunSPTE_MMU_WRITEABLE bit on the spte: 43*4882a593Smuzhiyun 44*4882a593Smuzhiyun- SPTE_HOST_WRITEABLE means the gfn is writable on host. 45*4882a593Smuzhiyun- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when 46*4882a593Smuzhiyun the gfn is writable on guest mmu and it is not write-protected by shadow 47*4882a593Smuzhiyun page write-protection. 48*4882a593Smuzhiyun 49*4882a593SmuzhiyunOn fast page fault path, we will use cmpxchg to atomically set the spte W 50*4882a593Smuzhiyunbit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or 51*4882a593Smuzhiyunrestore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This 52*4882a593Smuzhiyunis safe because whenever changing these bits can be detected by cmpxchg. 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunBut we need carefully check these cases: 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun1) The mapping from gfn to pfn 57*4882a593Smuzhiyun 58*4882a593SmuzhiyunThe mapping from gfn to pfn may be changed since we can only ensure the pfn 59*4882a593Smuzhiyunis not changed during cmpxchg. This is a ABA problem, for example, below case 60*4882a593Smuzhiyunwill happen: 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun+------------------------------------------------------------------------+ 63*4882a593Smuzhiyun| At the beginning:: | 64*4882a593Smuzhiyun| | 65*4882a593Smuzhiyun| gpte = gfn1 | 66*4882a593Smuzhiyun| gfn1 is mapped to pfn1 on host | 67*4882a593Smuzhiyun| spte is the shadow page table entry corresponding with gpte and | 68*4882a593Smuzhiyun| spte = pfn1 | 69*4882a593Smuzhiyun+------------------------------------------------------------------------+ 70*4882a593Smuzhiyun| On fast page fault path: | 71*4882a593Smuzhiyun+------------------------------------+-----------------------------------+ 72*4882a593Smuzhiyun| CPU 0: | CPU 1: | 73*4882a593Smuzhiyun+------------------------------------+-----------------------------------+ 74*4882a593Smuzhiyun| :: | | 75*4882a593Smuzhiyun| | | 76*4882a593Smuzhiyun| old_spte = *spte; | | 77*4882a593Smuzhiyun+------------------------------------+-----------------------------------+ 78*4882a593Smuzhiyun| | pfn1 is swapped out:: | 79*4882a593Smuzhiyun| | | 80*4882a593Smuzhiyun| | spte = 0; | 81*4882a593Smuzhiyun| | | 82*4882a593Smuzhiyun| | pfn1 is re-alloced for gfn2. | 83*4882a593Smuzhiyun| | | 84*4882a593Smuzhiyun| | gpte is changed to point to | 85*4882a593Smuzhiyun| | gfn2 by the guest:: | 86*4882a593Smuzhiyun| | | 87*4882a593Smuzhiyun| | spte = pfn1; | 88*4882a593Smuzhiyun+------------------------------------+-----------------------------------+ 89*4882a593Smuzhiyun| :: | 90*4882a593Smuzhiyun| | 91*4882a593Smuzhiyun| if (cmpxchg(spte, old_spte, old_spte+W) | 92*4882a593Smuzhiyun| mark_page_dirty(vcpu->kvm, gfn1) | 93*4882a593Smuzhiyun| OOPS!!! | 94*4882a593Smuzhiyun+------------------------------------------------------------------------+ 95*4882a593Smuzhiyun 96*4882a593SmuzhiyunWe dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunFor direct sp, we can easily avoid it since the spte of direct sp is fixed 99*4882a593Smuzhiyunto gfn. For indirect sp, we disabled fast page fault for simplicity. 100*4882a593Smuzhiyun 101*4882a593SmuzhiyunA solution for indirect sp could be to pin the gfn, for example via 102*4882a593Smuzhiyunkvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning: 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun- We have held the refcount of pfn that means the pfn can not be freed and 105*4882a593Smuzhiyun be reused for another gfn. 106*4882a593Smuzhiyun- The pfn is writable and therefore it cannot be shared between different gfns 107*4882a593Smuzhiyun by KSM. 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunThen, we can ensure the dirty bitmaps is correctly set for a gfn. 110*4882a593Smuzhiyun 111*4882a593Smuzhiyun2) Dirty bit tracking 112*4882a593Smuzhiyun 113*4882a593SmuzhiyunIn the origin code, the spte can be fast updated (non-atomically) if the 114*4882a593Smuzhiyunspte is read-only and the Accessed bit has already been set since the 115*4882a593SmuzhiyunAccessed bit and Dirty bit can not be lost. 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunBut it is not true after fast page fault since the spte can be marked 118*4882a593Smuzhiyunwritable between reading spte and updating spte. Like below case: 119*4882a593Smuzhiyun 120*4882a593Smuzhiyun+------------------------------------------------------------------------+ 121*4882a593Smuzhiyun| At the beginning:: | 122*4882a593Smuzhiyun| | 123*4882a593Smuzhiyun| spte.W = 0 | 124*4882a593Smuzhiyun| spte.Accessed = 1 | 125*4882a593Smuzhiyun+------------------------------------+-----------------------------------+ 126*4882a593Smuzhiyun| CPU 0: | CPU 1: | 127*4882a593Smuzhiyun+------------------------------------+-----------------------------------+ 128*4882a593Smuzhiyun| In mmu_spte_clear_track_bits():: | | 129*4882a593Smuzhiyun| | | 130*4882a593Smuzhiyun| old_spte = *spte; | | 131*4882a593Smuzhiyun| | | 132*4882a593Smuzhiyun| | | 133*4882a593Smuzhiyun| /* 'if' condition is satisfied. */| | 134*4882a593Smuzhiyun| if (old_spte.Accessed == 1 && | | 135*4882a593Smuzhiyun| old_spte.W == 0) | | 136*4882a593Smuzhiyun| spte = 0ull; | | 137*4882a593Smuzhiyun+------------------------------------+-----------------------------------+ 138*4882a593Smuzhiyun| | on fast page fault path:: | 139*4882a593Smuzhiyun| | | 140*4882a593Smuzhiyun| | spte.W = 1 | 141*4882a593Smuzhiyun| | | 142*4882a593Smuzhiyun| | memory write on the spte:: | 143*4882a593Smuzhiyun| | | 144*4882a593Smuzhiyun| | spte.Dirty = 1 | 145*4882a593Smuzhiyun+------------------------------------+-----------------------------------+ 146*4882a593Smuzhiyun| :: | | 147*4882a593Smuzhiyun| | | 148*4882a593Smuzhiyun| else | | 149*4882a593Smuzhiyun| old_spte = xchg(spte, 0ull) | | 150*4882a593Smuzhiyun| if (old_spte.Accessed == 1) | | 151*4882a593Smuzhiyun| kvm_set_pfn_accessed(spte.pfn);| | 152*4882a593Smuzhiyun| if (old_spte.Dirty == 1) | | 153*4882a593Smuzhiyun| kvm_set_pfn_dirty(spte.pfn); | | 154*4882a593Smuzhiyun| OOPS!!! | | 155*4882a593Smuzhiyun+------------------------------------+-----------------------------------+ 156*4882a593Smuzhiyun 157*4882a593SmuzhiyunThe Dirty bit is lost in this case. 158*4882a593Smuzhiyun 159*4882a593SmuzhiyunIn order to avoid this kind of issue, we always treat the spte as "volatile" 160*4882a593Smuzhiyunif it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, 161*4882a593Smuzhiyunthe spte is always atomically updated in this case. 162*4882a593Smuzhiyun 163*4882a593Smuzhiyun3) flush tlbs due to spte updated 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunIf the spte is updated from writable to readonly, we should flush all TLBs, 166*4882a593Smuzhiyunotherwise rmap_write_protect will find a read-only spte, even though the 167*4882a593Smuzhiyunwritable spte might be cached on a CPU's TLB. 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunAs mentioned before, the spte can be updated to writable out of mmu-lock on 170*4882a593Smuzhiyunfast page fault path, in order to easily audit the path, we see if TLBs need 171*4882a593Smuzhiyunbe flushed caused by this reason in mmu_spte_update() since this is a common 172*4882a593Smuzhiyunfunction to update spte (present -> present). 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunSince the spte is "volatile" if it can be updated out of mmu-lock, we always 175*4882a593Smuzhiyunatomically update the spte, the race caused by fast page fault can be avoided, 176*4882a593SmuzhiyunSee the comments in spte_has_volatile_bits() and mmu_spte_update(). 177*4882a593Smuzhiyun 178*4882a593SmuzhiyunLockless Access Tracking: 179*4882a593Smuzhiyun 180*4882a593SmuzhiyunThis is used for Intel CPUs that are using EPT but do not support the EPT A/D 181*4882a593Smuzhiyunbits. In this case, when the KVM MMU notifier is called to track accesses to a 182*4882a593Smuzhiyunpage (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present 183*4882a593Smuzhiyunby clearing the RWX bits in the PTE and storing the original R & X bits in 184*4882a593Smuzhiyunsome unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the 185*4882a593SmuzhiyunPTE (using the ignored bit 62). When the VM tries to access the page later on, 186*4882a593Smuzhiyuna fault is generated and the fast page fault mechanism described above is used 187*4882a593Smuzhiyunto atomically restore the PTE to a Present state. The W bit is not saved when 188*4882a593Smuzhiyunthe PTE is marked for access tracking and during restoration to the Present 189*4882a593Smuzhiyunstate, the W bit is set depending on whether or not it was a write access. If 190*4882a593Smuzhiyunit wasn't, then the W bit will remain clear until a write access happens, at 191*4882a593Smuzhiyunwhich time it will be set using the Dirty tracking mechanism described above. 192*4882a593Smuzhiyun 193*4882a593Smuzhiyun3. Reference 194*4882a593Smuzhiyun------------ 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun:Name: kvm_lock 197*4882a593Smuzhiyun:Type: mutex 198*4882a593Smuzhiyun:Arch: any 199*4882a593Smuzhiyun:Protects: - vm_list 200*4882a593Smuzhiyun 201*4882a593Smuzhiyun:Name: kvm_count_lock 202*4882a593Smuzhiyun:Type: raw_spinlock_t 203*4882a593Smuzhiyun:Arch: any 204*4882a593Smuzhiyun:Protects: - hardware virtualization enable/disable 205*4882a593Smuzhiyun:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt 206*4882a593Smuzhiyun migration. 207*4882a593Smuzhiyun 208*4882a593Smuzhiyun:Name: kvm_arch::tsc_write_lock 209*4882a593Smuzhiyun:Type: raw_spinlock 210*4882a593Smuzhiyun:Arch: x86 211*4882a593Smuzhiyun:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} 212*4882a593Smuzhiyun - tsc offset in vmcb 213*4882a593Smuzhiyun:Comment: 'raw' because updating the tsc offsets must not be preempted. 214*4882a593Smuzhiyun 215*4882a593Smuzhiyun:Name: kvm->mmu_lock 216*4882a593Smuzhiyun:Type: spinlock_t 217*4882a593Smuzhiyun:Arch: any 218*4882a593Smuzhiyun:Protects: -shadow page/shadow tlb entry 219*4882a593Smuzhiyun:Comment: it is a spinlock since it is used in mmu notifier. 220*4882a593Smuzhiyun 221*4882a593Smuzhiyun:Name: kvm->srcu 222*4882a593Smuzhiyun:Type: srcu lock 223*4882a593Smuzhiyun:Arch: any 224*4882a593Smuzhiyun:Protects: - kvm->memslots 225*4882a593Smuzhiyun - kvm->buses 226*4882a593Smuzhiyun:Comment: The srcu read lock must be held while accessing memslots (e.g. 227*4882a593Smuzhiyun when using gfn_to_* functions) and while accessing in-kernel 228*4882a593Smuzhiyun MMIO/PIO address->device structure mapping (kvm->buses). 229*4882a593Smuzhiyun The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu 230*4882a593Smuzhiyun if it is needed by multiple functions. 231*4882a593Smuzhiyun 232*4882a593Smuzhiyun:Name: blocked_vcpu_on_cpu_lock 233*4882a593Smuzhiyun:Type: spinlock_t 234*4882a593Smuzhiyun:Arch: x86 235*4882a593Smuzhiyun:Protects: blocked_vcpu_on_cpu 236*4882a593Smuzhiyun:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. 237*4882a593Smuzhiyun When VT-d posted-interrupts is supported and the VM has assigned 238*4882a593Smuzhiyun devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu 239*4882a593Smuzhiyun protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues 240*4882a593Smuzhiyun wakeup notification event since external interrupts from the 241*4882a593Smuzhiyun assigned devices happens, we will find the vCPU on the list to 242*4882a593Smuzhiyun wakeup. 243