xref: /OK3568_Linux_fs/kernel/Documentation/virt/kvm/locking.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=================
4*4882a593SmuzhiyunKVM Lock Overview
5*4882a593Smuzhiyun=================
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun1. Acquisition Orders
8*4882a593Smuzhiyun---------------------
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunThe acquisition orders for mutexes are as follows:
11*4882a593Smuzhiyun
12*4882a593Smuzhiyun- kvm->lock is taken outside vcpu->mutex
13*4882a593Smuzhiyun
14*4882a593Smuzhiyun- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
17*4882a593Smuzhiyun  them together is quite rare.
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunOn x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunEverything else is a leaf: no other lock is taken inside the critical
22*4882a593Smuzhiyunsections.
23*4882a593Smuzhiyun
24*4882a593Smuzhiyun2. Exception
25*4882a593Smuzhiyun------------
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunFast page fault:
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunFast page fault is the fast path which fixes the guest page fault out of
30*4882a593Smuzhiyunthe mmu-lock on x86. Currently, the page fault can be fast in one of the
31*4882a593Smuzhiyunfollowing two cases:
32*4882a593Smuzhiyun
33*4882a593Smuzhiyun1. Access Tracking: The SPTE is not present, but it is marked for access
34*4882a593Smuzhiyun   tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
35*4882a593Smuzhiyun   restore the saved R/X bits. This is described in more detail later below.
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun2. Write-Protection: The SPTE is present and the fault is
38*4882a593Smuzhiyun   caused by write-protect. That means we just need to change the W bit of
39*4882a593Smuzhiyun   the spte.
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunWhat we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
42*4882a593SmuzhiyunSPTE_MMU_WRITEABLE bit on the spte:
43*4882a593Smuzhiyun
44*4882a593Smuzhiyun- SPTE_HOST_WRITEABLE means the gfn is writable on host.
45*4882a593Smuzhiyun- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
46*4882a593Smuzhiyun  the gfn is writable on guest mmu and it is not write-protected by shadow
47*4882a593Smuzhiyun  page write-protection.
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunOn fast page fault path, we will use cmpxchg to atomically set the spte W
50*4882a593Smuzhiyunbit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
51*4882a593Smuzhiyunrestore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
52*4882a593Smuzhiyunis safe because whenever changing these bits can be detected by cmpxchg.
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunBut we need carefully check these cases:
55*4882a593Smuzhiyun
56*4882a593Smuzhiyun1) The mapping from gfn to pfn
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunThe mapping from gfn to pfn may be changed since we can only ensure the pfn
59*4882a593Smuzhiyunis not changed during cmpxchg. This is a ABA problem, for example, below case
60*4882a593Smuzhiyunwill happen:
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun+------------------------------------------------------------------------+
63*4882a593Smuzhiyun| At the beginning::                                                     |
64*4882a593Smuzhiyun|                                                                        |
65*4882a593Smuzhiyun|	gpte = gfn1                                                      |
66*4882a593Smuzhiyun|	gfn1 is mapped to pfn1 on host                                   |
67*4882a593Smuzhiyun|	spte is the shadow page table entry corresponding with gpte and  |
68*4882a593Smuzhiyun|	spte = pfn1                                                      |
69*4882a593Smuzhiyun+------------------------------------------------------------------------+
70*4882a593Smuzhiyun| On fast page fault path:                                               |
71*4882a593Smuzhiyun+------------------------------------+-----------------------------------+
72*4882a593Smuzhiyun| CPU 0:                             | CPU 1:                            |
73*4882a593Smuzhiyun+------------------------------------+-----------------------------------+
74*4882a593Smuzhiyun| ::                                 |                                   |
75*4882a593Smuzhiyun|                                    |                                   |
76*4882a593Smuzhiyun|   old_spte = *spte;                |                                   |
77*4882a593Smuzhiyun+------------------------------------+-----------------------------------+
78*4882a593Smuzhiyun|                                    | pfn1 is swapped out::             |
79*4882a593Smuzhiyun|                                    |                                   |
80*4882a593Smuzhiyun|                                    |    spte = 0;                      |
81*4882a593Smuzhiyun|                                    |                                   |
82*4882a593Smuzhiyun|                                    | pfn1 is re-alloced for gfn2.      |
83*4882a593Smuzhiyun|                                    |                                   |
84*4882a593Smuzhiyun|                                    | gpte is changed to point to       |
85*4882a593Smuzhiyun|                                    | gfn2 by the guest::               |
86*4882a593Smuzhiyun|                                    |                                   |
87*4882a593Smuzhiyun|                                    |    spte = pfn1;                   |
88*4882a593Smuzhiyun+------------------------------------+-----------------------------------+
89*4882a593Smuzhiyun| ::                                                                     |
90*4882a593Smuzhiyun|                                                                        |
91*4882a593Smuzhiyun|   if (cmpxchg(spte, old_spte, old_spte+W)                              |
92*4882a593Smuzhiyun|	mark_page_dirty(vcpu->kvm, gfn1)                                 |
93*4882a593Smuzhiyun|            OOPS!!!                                                     |
94*4882a593Smuzhiyun+------------------------------------------------------------------------+
95*4882a593Smuzhiyun
96*4882a593SmuzhiyunWe dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunFor direct sp, we can easily avoid it since the spte of direct sp is fixed
99*4882a593Smuzhiyunto gfn.  For indirect sp, we disabled fast page fault for simplicity.
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunA solution for indirect sp could be to pin the gfn, for example via
102*4882a593Smuzhiyunkvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg.  After the pinning:
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun- We have held the refcount of pfn that means the pfn can not be freed and
105*4882a593Smuzhiyun  be reused for another gfn.
106*4882a593Smuzhiyun- The pfn is writable and therefore it cannot be shared between different gfns
107*4882a593Smuzhiyun  by KSM.
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunThen, we can ensure the dirty bitmaps is correctly set for a gfn.
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun2) Dirty bit tracking
112*4882a593Smuzhiyun
113*4882a593SmuzhiyunIn the origin code, the spte can be fast updated (non-atomically) if the
114*4882a593Smuzhiyunspte is read-only and the Accessed bit has already been set since the
115*4882a593SmuzhiyunAccessed bit and Dirty bit can not be lost.
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunBut it is not true after fast page fault since the spte can be marked
118*4882a593Smuzhiyunwritable between reading spte and updating spte. Like below case:
119*4882a593Smuzhiyun
120*4882a593Smuzhiyun+------------------------------------------------------------------------+
121*4882a593Smuzhiyun| At the beginning::                                                     |
122*4882a593Smuzhiyun|                                                                        |
123*4882a593Smuzhiyun|	spte.W = 0                                                       |
124*4882a593Smuzhiyun|	spte.Accessed = 1                                                |
125*4882a593Smuzhiyun+------------------------------------+-----------------------------------+
126*4882a593Smuzhiyun| CPU 0:                             | CPU 1:                            |
127*4882a593Smuzhiyun+------------------------------------+-----------------------------------+
128*4882a593Smuzhiyun| In mmu_spte_clear_track_bits()::   |                                   |
129*4882a593Smuzhiyun|                                    |                                   |
130*4882a593Smuzhiyun|  old_spte = *spte;                 |                                   |
131*4882a593Smuzhiyun|                                    |                                   |
132*4882a593Smuzhiyun|                                    |                                   |
133*4882a593Smuzhiyun|  /* 'if' condition is satisfied. */|                                   |
134*4882a593Smuzhiyun|  if (old_spte.Accessed == 1 &&     |                                   |
135*4882a593Smuzhiyun|       old_spte.W == 0)             |                                   |
136*4882a593Smuzhiyun|     spte = 0ull;                   |                                   |
137*4882a593Smuzhiyun+------------------------------------+-----------------------------------+
138*4882a593Smuzhiyun|                                    | on fast page fault path::         |
139*4882a593Smuzhiyun|                                    |                                   |
140*4882a593Smuzhiyun|                                    |    spte.W = 1                     |
141*4882a593Smuzhiyun|                                    |                                   |
142*4882a593Smuzhiyun|                                    | memory write on the spte::        |
143*4882a593Smuzhiyun|                                    |                                   |
144*4882a593Smuzhiyun|                                    |    spte.Dirty = 1                 |
145*4882a593Smuzhiyun+------------------------------------+-----------------------------------+
146*4882a593Smuzhiyun|  ::                                |                                   |
147*4882a593Smuzhiyun|                                    |                                   |
148*4882a593Smuzhiyun|   else                             |                                   |
149*4882a593Smuzhiyun|     old_spte = xchg(spte, 0ull)    |                                   |
150*4882a593Smuzhiyun|   if (old_spte.Accessed == 1)      |                                   |
151*4882a593Smuzhiyun|     kvm_set_pfn_accessed(spte.pfn);|                                   |
152*4882a593Smuzhiyun|   if (old_spte.Dirty == 1)         |                                   |
153*4882a593Smuzhiyun|     kvm_set_pfn_dirty(spte.pfn);   |                                   |
154*4882a593Smuzhiyun|     OOPS!!!                        |                                   |
155*4882a593Smuzhiyun+------------------------------------+-----------------------------------+
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunThe Dirty bit is lost in this case.
158*4882a593Smuzhiyun
159*4882a593SmuzhiyunIn order to avoid this kind of issue, we always treat the spte as "volatile"
160*4882a593Smuzhiyunif it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
161*4882a593Smuzhiyunthe spte is always atomically updated in this case.
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun3) flush tlbs due to spte updated
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunIf the spte is updated from writable to readonly, we should flush all TLBs,
166*4882a593Smuzhiyunotherwise rmap_write_protect will find a read-only spte, even though the
167*4882a593Smuzhiyunwritable spte might be cached on a CPU's TLB.
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunAs mentioned before, the spte can be updated to writable out of mmu-lock on
170*4882a593Smuzhiyunfast page fault path, in order to easily audit the path, we see if TLBs need
171*4882a593Smuzhiyunbe flushed caused by this reason in mmu_spte_update() since this is a common
172*4882a593Smuzhiyunfunction to update spte (present -> present).
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunSince the spte is "volatile" if it can be updated out of mmu-lock, we always
175*4882a593Smuzhiyunatomically update the spte, the race caused by fast page fault can be avoided,
176*4882a593SmuzhiyunSee the comments in spte_has_volatile_bits() and mmu_spte_update().
177*4882a593Smuzhiyun
178*4882a593SmuzhiyunLockless Access Tracking:
179*4882a593Smuzhiyun
180*4882a593SmuzhiyunThis is used for Intel CPUs that are using EPT but do not support the EPT A/D
181*4882a593Smuzhiyunbits. In this case, when the KVM MMU notifier is called to track accesses to a
182*4882a593Smuzhiyunpage (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
183*4882a593Smuzhiyunby clearing the RWX bits in the PTE and storing the original R & X bits in
184*4882a593Smuzhiyunsome unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
185*4882a593SmuzhiyunPTE (using the ignored bit 62). When the VM tries to access the page later on,
186*4882a593Smuzhiyuna fault is generated and the fast page fault mechanism described above is used
187*4882a593Smuzhiyunto atomically restore the PTE to a Present state. The W bit is not saved when
188*4882a593Smuzhiyunthe PTE is marked for access tracking and during restoration to the Present
189*4882a593Smuzhiyunstate, the W bit is set depending on whether or not it was a write access. If
190*4882a593Smuzhiyunit wasn't, then the W bit will remain clear until a write access happens, at
191*4882a593Smuzhiyunwhich time it will be set using the Dirty tracking mechanism described above.
192*4882a593Smuzhiyun
193*4882a593Smuzhiyun3. Reference
194*4882a593Smuzhiyun------------
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun:Name:		kvm_lock
197*4882a593Smuzhiyun:Type:		mutex
198*4882a593Smuzhiyun:Arch:		any
199*4882a593Smuzhiyun:Protects:	- vm_list
200*4882a593Smuzhiyun
201*4882a593Smuzhiyun:Name:		kvm_count_lock
202*4882a593Smuzhiyun:Type:		raw_spinlock_t
203*4882a593Smuzhiyun:Arch:		any
204*4882a593Smuzhiyun:Protects:	- hardware virtualization enable/disable
205*4882a593Smuzhiyun:Comment:	'raw' because hardware enabling/disabling must be atomic /wrt
206*4882a593Smuzhiyun		migration.
207*4882a593Smuzhiyun
208*4882a593Smuzhiyun:Name:		kvm_arch::tsc_write_lock
209*4882a593Smuzhiyun:Type:		raw_spinlock
210*4882a593Smuzhiyun:Arch:		x86
211*4882a593Smuzhiyun:Protects:	- kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
212*4882a593Smuzhiyun		- tsc offset in vmcb
213*4882a593Smuzhiyun:Comment:	'raw' because updating the tsc offsets must not be preempted.
214*4882a593Smuzhiyun
215*4882a593Smuzhiyun:Name:		kvm->mmu_lock
216*4882a593Smuzhiyun:Type:		spinlock_t
217*4882a593Smuzhiyun:Arch:		any
218*4882a593Smuzhiyun:Protects:	-shadow page/shadow tlb entry
219*4882a593Smuzhiyun:Comment:	it is a spinlock since it is used in mmu notifier.
220*4882a593Smuzhiyun
221*4882a593Smuzhiyun:Name:		kvm->srcu
222*4882a593Smuzhiyun:Type:		srcu lock
223*4882a593Smuzhiyun:Arch:		any
224*4882a593Smuzhiyun:Protects:	- kvm->memslots
225*4882a593Smuzhiyun		- kvm->buses
226*4882a593Smuzhiyun:Comment:	The srcu read lock must be held while accessing memslots (e.g.
227*4882a593Smuzhiyun		when using gfn_to_* functions) and while accessing in-kernel
228*4882a593Smuzhiyun		MMIO/PIO address->device structure mapping (kvm->buses).
229*4882a593Smuzhiyun		The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
230*4882a593Smuzhiyun		if it is needed by multiple functions.
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun:Name:		blocked_vcpu_on_cpu_lock
233*4882a593Smuzhiyun:Type:		spinlock_t
234*4882a593Smuzhiyun:Arch:		x86
235*4882a593Smuzhiyun:Protects:	blocked_vcpu_on_cpu
236*4882a593Smuzhiyun:Comment:	This is a per-CPU lock and it is used for VT-d posted-interrupts.
237*4882a593Smuzhiyun		When VT-d posted-interrupts is supported and the VM has assigned
238*4882a593Smuzhiyun		devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
239*4882a593Smuzhiyun		protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
240*4882a593Smuzhiyun		wakeup notification event since external interrupts from the
241*4882a593Smuzhiyun		assigned devices happens, we will find the vCPU on the list to
242*4882a593Smuzhiyun		wakeup.
243