xref: /OK3568_Linux_fs/kernel/Documentation/virt/kvm/vcpu-requests.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=================
2*4882a593SmuzhiyunKVM VCPU Requests
3*4882a593Smuzhiyun=================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunOverview
6*4882a593Smuzhiyun========
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunKVM supports an internal API enabling threads to request a VCPU thread to
9*4882a593Smuzhiyunperform some activity.  For example, a thread may request a VCPU to flush
10*4882a593Smuzhiyunits TLB with a VCPU request.  The API consists of the following functions::
11*4882a593Smuzhiyun
12*4882a593Smuzhiyun  /* Check if any requests are pending for VCPU @vcpu. */
13*4882a593Smuzhiyun  bool kvm_request_pending(struct kvm_vcpu *vcpu);
14*4882a593Smuzhiyun
15*4882a593Smuzhiyun  /* Check if VCPU @vcpu has request @req pending. */
16*4882a593Smuzhiyun  bool kvm_test_request(int req, struct kvm_vcpu *vcpu);
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun  /* Clear request @req for VCPU @vcpu. */
19*4882a593Smuzhiyun  void kvm_clear_request(int req, struct kvm_vcpu *vcpu);
20*4882a593Smuzhiyun
21*4882a593Smuzhiyun  /*
22*4882a593Smuzhiyun   * Check if VCPU @vcpu has request @req pending. When the request is
23*4882a593Smuzhiyun   * pending it will be cleared and a memory barrier, which pairs with
24*4882a593Smuzhiyun   * another in kvm_make_request(), will be issued.
25*4882a593Smuzhiyun   */
26*4882a593Smuzhiyun  bool kvm_check_request(int req, struct kvm_vcpu *vcpu);
27*4882a593Smuzhiyun
28*4882a593Smuzhiyun  /*
29*4882a593Smuzhiyun   * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs
30*4882a593Smuzhiyun   * with another in kvm_check_request(), prior to setting the request.
31*4882a593Smuzhiyun   */
32*4882a593Smuzhiyun  void kvm_make_request(int req, struct kvm_vcpu *vcpu);
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun  /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
35*4882a593Smuzhiyun  bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunTypically a requester wants the VCPU to perform the activity as soon
38*4882a593Smuzhiyunas possible after making the request.  This means most requests
39*4882a593Smuzhiyun(kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
40*4882a593Smuzhiyunand kvm_make_all_cpus_request() has the kicking of all VCPUs built
41*4882a593Smuzhiyuninto it.
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunVCPU Kicks
44*4882a593Smuzhiyun----------
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunThe goal of a VCPU kick is to bring a VCPU thread out of guest mode in
47*4882a593Smuzhiyunorder to perform some KVM maintenance.  To do so, an IPI is sent, forcing
48*4882a593Smuzhiyuna guest mode exit.  However, a VCPU thread may not be in guest mode at the
49*4882a593Smuzhiyuntime of the kick.  Therefore, depending on the mode and state of the VCPU
50*4882a593Smuzhiyunthread, there are two other actions a kick may take.  All three actions
51*4882a593Smuzhiyunare listed below:
52*4882a593Smuzhiyun
53*4882a593Smuzhiyun1) Send an IPI.  This forces a guest mode exit.
54*4882a593Smuzhiyun2) Waking a sleeping VCPU.  Sleeping VCPUs are VCPU threads outside guest
55*4882a593Smuzhiyun   mode that wait on waitqueues.  Waking them removes the threads from
56*4882a593Smuzhiyun   the waitqueues, allowing the threads to run again.  This behavior
57*4882a593Smuzhiyun   may be suppressed, see KVM_REQUEST_NO_WAKEUP below.
58*4882a593Smuzhiyun3) Nothing.  When the VCPU is not in guest mode and the VCPU thread is not
59*4882a593Smuzhiyun   sleeping, then there is nothing to do.
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunVCPU Mode
62*4882a593Smuzhiyun---------
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunVCPUs have a mode state, ``vcpu->mode``, that is used to track whether the
65*4882a593Smuzhiyunguest is running in guest mode or not, as well as some specific
66*4882a593Smuzhiyunoutside guest mode states.  The architecture may use ``vcpu->mode`` to
67*4882a593Smuzhiyunensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"),
68*4882a593Smuzhiyunas well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and
69*4882a593Smuzhiyuneven to ensure IPI acknowledgements are waited upon (see "Waiting for
70*4882a593SmuzhiyunAcknowledgements").  The following modes are defined:
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunOUTSIDE_GUEST_MODE
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun  The VCPU thread is outside guest mode.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunIN_GUEST_MODE
77*4882a593Smuzhiyun
78*4882a593Smuzhiyun  The VCPU thread is in guest mode.
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunEXITING_GUEST_MODE
81*4882a593Smuzhiyun
82*4882a593Smuzhiyun  The VCPU thread is transitioning from IN_GUEST_MODE to
83*4882a593Smuzhiyun  OUTSIDE_GUEST_MODE.
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunREADING_SHADOW_PAGE_TABLES
86*4882a593Smuzhiyun
87*4882a593Smuzhiyun  The VCPU thread is outside guest mode, but it wants the sender of
88*4882a593Smuzhiyun  certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU
89*4882a593Smuzhiyun  thread is done reading the page tables.
90*4882a593Smuzhiyun
91*4882a593SmuzhiyunVCPU Request Internals
92*4882a593Smuzhiyun======================
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunVCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
95*4882a593SmuzhiyunThis means general bitops, like those documented in [atomic-ops]_ could
96*4882a593Smuzhiyunalso be used, e.g. ::
97*4882a593Smuzhiyun
98*4882a593Smuzhiyun  clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests);
99*4882a593Smuzhiyun
100*4882a593SmuzhiyunHowever, VCPU request users should refrain from doing so, as it would
101*4882a593Smuzhiyunbreak the abstraction.  The first 8 bits are reserved for architecture
102*4882a593Smuzhiyunindependent requests, all additional bits are available for architecture
103*4882a593Smuzhiyundependent requests.
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunArchitecture Independent Requests
106*4882a593Smuzhiyun---------------------------------
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunKVM_REQ_TLB_FLUSH
109*4882a593Smuzhiyun
110*4882a593Smuzhiyun  KVM's common MMU notifier may need to flush all of a guest's TLB
111*4882a593Smuzhiyun  entries, calling kvm_flush_remote_tlbs() to do so.  Architectures that
112*4882a593Smuzhiyun  choose to use the common kvm_flush_remote_tlbs() implementation will
113*4882a593Smuzhiyun  need to handle this VCPU request.
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunKVM_REQ_MMU_RELOAD
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun  When shadow page tables are used and memory slots are removed it's
118*4882a593Smuzhiyun  necessary to inform each VCPU to completely refresh the tables.  This
119*4882a593Smuzhiyun  request is used for that.
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunKVM_REQ_PENDING_TIMER
122*4882a593Smuzhiyun
123*4882a593Smuzhiyun  This request may be made from a timer handler run on the host on behalf
124*4882a593Smuzhiyun  of a VCPU.  It informs the VCPU thread to inject a timer interrupt.
125*4882a593Smuzhiyun
126*4882a593SmuzhiyunKVM_REQ_UNHALT
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun  This request may be made from the KVM common function kvm_vcpu_block(),
129*4882a593Smuzhiyun  which is used to emulate an instruction that causes a CPU to halt until
130*4882a593Smuzhiyun  one of an architectural specific set of events and/or interrupts is
131*4882a593Smuzhiyun  received (determined by checking kvm_arch_vcpu_runnable()).  When that
132*4882a593Smuzhiyun  event or interrupt arrives kvm_vcpu_block() makes the request.  This is
133*4882a593Smuzhiyun  in contrast to when kvm_vcpu_block() returns due to any other reason,
134*4882a593Smuzhiyun  such as a pending signal, which does not indicate the VCPU's halt
135*4882a593Smuzhiyun  emulation should stop, and therefore does not make the request.
136*4882a593Smuzhiyun
137*4882a593SmuzhiyunKVM_REQUEST_MASK
138*4882a593Smuzhiyun----------------
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunVCPU requests should be masked by KVM_REQUEST_MASK before using them with
141*4882a593Smuzhiyunbitops.  This is because only the lower 8 bits are used to represent the
142*4882a593Smuzhiyunrequest's number.  The upper bits are used as flags.  Currently only two
143*4882a593Smuzhiyunflags are defined.
144*4882a593Smuzhiyun
145*4882a593SmuzhiyunVCPU Request Flags
146*4882a593Smuzhiyun------------------
147*4882a593Smuzhiyun
148*4882a593SmuzhiyunKVM_REQUEST_NO_WAKEUP
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun  This flag is applied to requests that only need immediate attention
151*4882a593Smuzhiyun  from VCPUs running in guest mode.  That is, sleeping VCPUs do not need
152*4882a593Smuzhiyun  to be awaken for these requests.  Sleeping VCPUs will handle the
153*4882a593Smuzhiyun  requests when they are awaken later for some other reason.
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunKVM_REQUEST_WAIT
156*4882a593Smuzhiyun
157*4882a593Smuzhiyun  When requests with this flag are made with kvm_make_all_cpus_request(),
158*4882a593Smuzhiyun  then the caller will wait for each VCPU to acknowledge its IPI before
159*4882a593Smuzhiyun  proceeding.  This flag only applies to VCPUs that would receive IPIs.
160*4882a593Smuzhiyun  If, for example, the VCPU is sleeping, so no IPI is necessary, then
161*4882a593Smuzhiyun  the requesting thread does not wait.  This means that this flag may be
162*4882a593Smuzhiyun  safely combined with KVM_REQUEST_NO_WAKEUP.  See "Waiting for
163*4882a593Smuzhiyun  Acknowledgements" for more information about requests with
164*4882a593Smuzhiyun  KVM_REQUEST_WAIT.
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunVCPU Requests with Associated State
167*4882a593Smuzhiyun===================================
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunRequesters that want the receiving VCPU to handle new state need to ensure
170*4882a593Smuzhiyunthe newly written state is observable to the receiving VCPU thread's CPU
171*4882a593Smuzhiyunby the time it observes the request.  This means a write memory barrier
172*4882a593Smuzhiyunmust be inserted after writing the new state and before setting the VCPU
173*4882a593Smuzhiyunrequest bit.  Additionally, on the receiving VCPU thread's side, a
174*4882a593Smuzhiyuncorresponding read barrier must be inserted after reading the request bit
175*4882a593Smuzhiyunand before proceeding to read the new state associated with it.  See
176*4882a593Smuzhiyunscenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation
177*4882a593Smuzhiyun[memory-barriers]_.
178*4882a593Smuzhiyun
179*4882a593SmuzhiyunThe pair of functions, kvm_check_request() and kvm_make_request(), provide
180*4882a593Smuzhiyunthe memory barriers, allowing this requirement to be handled internally by
181*4882a593Smuzhiyunthe API.
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunEnsuring Requests Are Seen
184*4882a593Smuzhiyun==========================
185*4882a593Smuzhiyun
186*4882a593SmuzhiyunWhen making requests to VCPUs, we want to avoid the receiving VCPU
187*4882a593Smuzhiyunexecuting in guest mode for an arbitrary long time without handling the
188*4882a593Smuzhiyunrequest.  We can be sure this won't happen as long as we ensure the VCPU
189*4882a593Smuzhiyunthread checks kvm_request_pending() before entering guest mode and that a
190*4882a593Smuzhiyunkick will send an IPI to force an exit from guest mode when necessary.
191*4882a593SmuzhiyunExtra care must be taken to cover the period after the VCPU thread's last
192*4882a593Smuzhiyunkvm_request_pending() check and before it has entered guest mode, as kick
193*4882a593SmuzhiyunIPIs will only trigger guest mode exits for VCPU threads that are in guest
194*4882a593Smuzhiyunmode or at least have already disabled interrupts in order to prepare to
195*4882a593Smuzhiyunenter guest mode.  This means that an optimized implementation (see "IPI
196*4882a593SmuzhiyunReduction") must be certain when it's safe to not send the IPI.  One
197*4882a593Smuzhiyunsolution, which all architectures except s390 apply, is to:
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun- set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and
200*4882a593Smuzhiyun  the last kvm_request_pending() check;
201*4882a593Smuzhiyun- enable interrupts atomically when entering the guest.
202*4882a593Smuzhiyun
203*4882a593SmuzhiyunThis solution also requires memory barriers to be placed carefully in both
204*4882a593Smuzhiyunthe requesting thread and the receiving VCPU.  With the memory barriers we
205*4882a593Smuzhiyuncan exclude the possibility of a VCPU thread observing
206*4882a593Smuzhiyun!kvm_request_pending() on its last check and then not receiving an IPI for
207*4882a593Smuzhiyunthe next request made of it, even if the request is made immediately after
208*4882a593Smuzhiyunthe check.  This is done by way of the Dekker memory barrier pattern
209*4882a593Smuzhiyun(scenario 10 of [lwn-mb]_).  As the Dekker pattern requires two variables,
210*4882a593Smuzhiyunthis solution pairs ``vcpu->mode`` with ``vcpu->requests``.  Substituting
211*4882a593Smuzhiyunthem into the pattern gives::
212*4882a593Smuzhiyun
213*4882a593Smuzhiyun  CPU1                                    CPU2
214*4882a593Smuzhiyun  =================                       =================
215*4882a593Smuzhiyun  local_irq_disable();
216*4882a593Smuzhiyun  WRITE_ONCE(vcpu->mode, IN_GUEST_MODE);  kvm_make_request(REQ, vcpu);
217*4882a593Smuzhiyun  smp_mb();                               smp_mb();
218*4882a593Smuzhiyun  if (kvm_request_pending(vcpu)) {        if (READ_ONCE(vcpu->mode) ==
219*4882a593Smuzhiyun                                              IN_GUEST_MODE) {
220*4882a593Smuzhiyun      ...abort guest entry...                 ...send IPI...
221*4882a593Smuzhiyun  }                                       }
222*4882a593Smuzhiyun
223*4882a593SmuzhiyunAs stated above, the IPI is only useful for VCPU threads in guest mode or
224*4882a593Smuzhiyunthat have already disabled interrupts.  This is why this specific case of
225*4882a593Smuzhiyunthe Dekker pattern has been extended to disable interrupts before setting
226*4882a593Smuzhiyun``vcpu->mode`` to IN_GUEST_MODE.  WRITE_ONCE() and READ_ONCE() are used to
227*4882a593Smuzhiyunpedantically implement the memory barrier pattern, guaranteeing the
228*4882a593Smuzhiyuncompiler doesn't interfere with ``vcpu->mode``'s carefully planned
229*4882a593Smuzhiyunaccesses.
230*4882a593Smuzhiyun
231*4882a593SmuzhiyunIPI Reduction
232*4882a593Smuzhiyun-------------
233*4882a593Smuzhiyun
234*4882a593SmuzhiyunAs only one IPI is needed to get a VCPU to check for any/all requests,
235*4882a593Smuzhiyunthen they may be coalesced.  This is easily done by having the first IPI
236*4882a593Smuzhiyunsending kick also change the VCPU mode to something !IN_GUEST_MODE.  The
237*4882a593Smuzhiyuntransitional state, EXITING_GUEST_MODE, is used for this purpose.
238*4882a593Smuzhiyun
239*4882a593SmuzhiyunWaiting for Acknowledgements
240*4882a593Smuzhiyun----------------------------
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunSome requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to
243*4882a593Smuzhiyunbe sent, and the acknowledgements to be waited upon, even when the target
244*4882a593SmuzhiyunVCPU threads are in modes other than IN_GUEST_MODE.  For example, one case
245*4882a593Smuzhiyunis when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which
246*4882a593Smuzhiyunis set after disabling interrupts.  To support these cases, the
247*4882a593SmuzhiyunKVM_REQUEST_WAIT flag changes the condition for sending an IPI from
248*4882a593Smuzhiyunchecking that the VCPU is IN_GUEST_MODE to checking that it is not
249*4882a593SmuzhiyunOUTSIDE_GUEST_MODE.
250*4882a593Smuzhiyun
251*4882a593SmuzhiyunRequest-less VCPU Kicks
252*4882a593Smuzhiyun-----------------------
253*4882a593Smuzhiyun
254*4882a593SmuzhiyunAs the determination of whether or not to send an IPI depends on the
255*4882a593Smuzhiyuntwo-variable Dekker memory barrier pattern, then it's clear that
256*4882a593Smuzhiyunrequest-less VCPU kicks are almost never correct.  Without the assurance
257*4882a593Smuzhiyunthat a non-IPI generating kick will still result in an action by the
258*4882a593Smuzhiyunreceiving VCPU, as the final kvm_request_pending() check does for
259*4882a593Smuzhiyunrequest-accompanying kicks, then the kick may not do anything useful at
260*4882a593Smuzhiyunall.  If, for instance, a request-less kick was made to a VCPU that was
261*4882a593Smuzhiyunjust about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then
262*4882a593Smuzhiyunthe VCPU thread may continue its entry without actually having done
263*4882a593Smuzhiyunwhatever it was the kick was meant to initiate.
264*4882a593Smuzhiyun
265*4882a593SmuzhiyunOne exception is x86's posted interrupt mechanism.  In this case, however,
266*4882a593Smuzhiyuneven the request-less VCPU kick is coupled with the same
267*4882a593Smuzhiyunlocal_irq_disable() + smp_mb() pattern described above; the ON bit
268*4882a593Smuzhiyun(Outstanding Notification) in the posted interrupt descriptor takes the
269*4882a593Smuzhiyunrole of ``vcpu->requests``.  When sending a posted interrupt, PIR.ON is
270*4882a593Smuzhiyunset before reading ``vcpu->mode``; dually, in the VCPU thread,
271*4882a593Smuzhiyunvmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to
272*4882a593SmuzhiyunIN_GUEST_MODE.
273*4882a593Smuzhiyun
274*4882a593SmuzhiyunAdditional Considerations
275*4882a593Smuzhiyun=========================
276*4882a593Smuzhiyun
277*4882a593SmuzhiyunSleeping VCPUs
278*4882a593Smuzhiyun--------------
279*4882a593Smuzhiyun
280*4882a593SmuzhiyunVCPU threads may need to consider requests before and/or after calling
281*4882a593Smuzhiyunfunctions that may put them to sleep, e.g. kvm_vcpu_block().  Whether they
282*4882a593Smuzhiyundo or not, and, if they do, which requests need consideration, is
283*4882a593Smuzhiyunarchitecture dependent.  kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
284*4882a593Smuzhiyunto check if it should awaken.  One reason to do so is to provide
285*4882a593Smuzhiyunarchitectures a function where requests may be checked if necessary.
286*4882a593Smuzhiyun
287*4882a593SmuzhiyunClearing Requests
288*4882a593Smuzhiyun-----------------
289*4882a593Smuzhiyun
290*4882a593SmuzhiyunGenerally it only makes sense for the receiving VCPU thread to clear a
291*4882a593Smuzhiyunrequest.  However, in some circumstances, such as when the requesting
292*4882a593Smuzhiyunthread and the receiving VCPU thread are executed serially, such as when
293*4882a593Smuzhiyunthey are the same thread, or when they are using some form of concurrency
294*4882a593Smuzhiyuncontrol to temporarily execute synchronously, then it's possible to know
295*4882a593Smuzhiyunthat the request may be cleared immediately, rather than waiting for the
296*4882a593Smuzhiyunreceiving VCPU thread to handle the request in VCPU RUN.  The only current
297*4882a593Smuzhiyunexamples of this are kvm_vcpu_block() calls made by VCPUs to block
298*4882a593Smuzhiyunthemselves.  A possible side-effect of that call is to make the
299*4882a593SmuzhiyunKVM_REQ_UNHALT request, which may then be cleared immediately when the
300*4882a593SmuzhiyunVCPU returns from the call.
301*4882a593Smuzhiyun
302*4882a593SmuzhiyunReferences
303*4882a593Smuzhiyun==========
304*4882a593Smuzhiyun
305*4882a593Smuzhiyun.. [atomic-ops] Documentation/core-api/atomic_ops.rst
306*4882a593Smuzhiyun.. [memory-barriers] Documentation/memory-barriers.txt
307*4882a593Smuzhiyun.. [lwn-mb] https://lwn.net/Articles/573436/
308