1*4882a593Smuzhiyun================= 2*4882a593SmuzhiyunKVM VCPU Requests 3*4882a593Smuzhiyun================= 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunOverview 6*4882a593Smuzhiyun======== 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunKVM supports an internal API enabling threads to request a VCPU thread to 9*4882a593Smuzhiyunperform some activity. For example, a thread may request a VCPU to flush 10*4882a593Smuzhiyunits TLB with a VCPU request. The API consists of the following functions:: 11*4882a593Smuzhiyun 12*4882a593Smuzhiyun /* Check if any requests are pending for VCPU @vcpu. */ 13*4882a593Smuzhiyun bool kvm_request_pending(struct kvm_vcpu *vcpu); 14*4882a593Smuzhiyun 15*4882a593Smuzhiyun /* Check if VCPU @vcpu has request @req pending. */ 16*4882a593Smuzhiyun bool kvm_test_request(int req, struct kvm_vcpu *vcpu); 17*4882a593Smuzhiyun 18*4882a593Smuzhiyun /* Clear request @req for VCPU @vcpu. */ 19*4882a593Smuzhiyun void kvm_clear_request(int req, struct kvm_vcpu *vcpu); 20*4882a593Smuzhiyun 21*4882a593Smuzhiyun /* 22*4882a593Smuzhiyun * Check if VCPU @vcpu has request @req pending. When the request is 23*4882a593Smuzhiyun * pending it will be cleared and a memory barrier, which pairs with 24*4882a593Smuzhiyun * another in kvm_make_request(), will be issued. 25*4882a593Smuzhiyun */ 26*4882a593Smuzhiyun bool kvm_check_request(int req, struct kvm_vcpu *vcpu); 27*4882a593Smuzhiyun 28*4882a593Smuzhiyun /* 29*4882a593Smuzhiyun * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs 30*4882a593Smuzhiyun * with another in kvm_check_request(), prior to setting the request. 31*4882a593Smuzhiyun */ 32*4882a593Smuzhiyun void kvm_make_request(int req, struct kvm_vcpu *vcpu); 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */ 35*4882a593Smuzhiyun bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req); 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunTypically a requester wants the VCPU to perform the activity as soon 38*4882a593Smuzhiyunas possible after making the request. This means most requests 39*4882a593Smuzhiyun(kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(), 40*4882a593Smuzhiyunand kvm_make_all_cpus_request() has the kicking of all VCPUs built 41*4882a593Smuzhiyuninto it. 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunVCPU Kicks 44*4882a593Smuzhiyun---------- 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunThe goal of a VCPU kick is to bring a VCPU thread out of guest mode in 47*4882a593Smuzhiyunorder to perform some KVM maintenance. To do so, an IPI is sent, forcing 48*4882a593Smuzhiyuna guest mode exit. However, a VCPU thread may not be in guest mode at the 49*4882a593Smuzhiyuntime of the kick. Therefore, depending on the mode and state of the VCPU 50*4882a593Smuzhiyunthread, there are two other actions a kick may take. All three actions 51*4882a593Smuzhiyunare listed below: 52*4882a593Smuzhiyun 53*4882a593Smuzhiyun1) Send an IPI. This forces a guest mode exit. 54*4882a593Smuzhiyun2) Waking a sleeping VCPU. Sleeping VCPUs are VCPU threads outside guest 55*4882a593Smuzhiyun mode that wait on waitqueues. Waking them removes the threads from 56*4882a593Smuzhiyun the waitqueues, allowing the threads to run again. This behavior 57*4882a593Smuzhiyun may be suppressed, see KVM_REQUEST_NO_WAKEUP below. 58*4882a593Smuzhiyun3) Nothing. When the VCPU is not in guest mode and the VCPU thread is not 59*4882a593Smuzhiyun sleeping, then there is nothing to do. 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunVCPU Mode 62*4882a593Smuzhiyun--------- 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunVCPUs have a mode state, ``vcpu->mode``, that is used to track whether the 65*4882a593Smuzhiyunguest is running in guest mode or not, as well as some specific 66*4882a593Smuzhiyunoutside guest mode states. The architecture may use ``vcpu->mode`` to 67*4882a593Smuzhiyunensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"), 68*4882a593Smuzhiyunas well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and 69*4882a593Smuzhiyuneven to ensure IPI acknowledgements are waited upon (see "Waiting for 70*4882a593SmuzhiyunAcknowledgements"). The following modes are defined: 71*4882a593Smuzhiyun 72*4882a593SmuzhiyunOUTSIDE_GUEST_MODE 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun The VCPU thread is outside guest mode. 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunIN_GUEST_MODE 77*4882a593Smuzhiyun 78*4882a593Smuzhiyun The VCPU thread is in guest mode. 79*4882a593Smuzhiyun 80*4882a593SmuzhiyunEXITING_GUEST_MODE 81*4882a593Smuzhiyun 82*4882a593Smuzhiyun The VCPU thread is transitioning from IN_GUEST_MODE to 83*4882a593Smuzhiyun OUTSIDE_GUEST_MODE. 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunREADING_SHADOW_PAGE_TABLES 86*4882a593Smuzhiyun 87*4882a593Smuzhiyun The VCPU thread is outside guest mode, but it wants the sender of 88*4882a593Smuzhiyun certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU 89*4882a593Smuzhiyun thread is done reading the page tables. 90*4882a593Smuzhiyun 91*4882a593SmuzhiyunVCPU Request Internals 92*4882a593Smuzhiyun====================== 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunVCPU requests are simply bit indices of the ``vcpu->requests`` bitmap. 95*4882a593SmuzhiyunThis means general bitops, like those documented in [atomic-ops]_ could 96*4882a593Smuzhiyunalso be used, e.g. :: 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests); 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunHowever, VCPU request users should refrain from doing so, as it would 101*4882a593Smuzhiyunbreak the abstraction. The first 8 bits are reserved for architecture 102*4882a593Smuzhiyunindependent requests, all additional bits are available for architecture 103*4882a593Smuzhiyundependent requests. 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunArchitecture Independent Requests 106*4882a593Smuzhiyun--------------------------------- 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunKVM_REQ_TLB_FLUSH 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun KVM's common MMU notifier may need to flush all of a guest's TLB 111*4882a593Smuzhiyun entries, calling kvm_flush_remote_tlbs() to do so. Architectures that 112*4882a593Smuzhiyun choose to use the common kvm_flush_remote_tlbs() implementation will 113*4882a593Smuzhiyun need to handle this VCPU request. 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunKVM_REQ_MMU_RELOAD 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun When shadow page tables are used and memory slots are removed it's 118*4882a593Smuzhiyun necessary to inform each VCPU to completely refresh the tables. This 119*4882a593Smuzhiyun request is used for that. 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunKVM_REQ_PENDING_TIMER 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun This request may be made from a timer handler run on the host on behalf 124*4882a593Smuzhiyun of a VCPU. It informs the VCPU thread to inject a timer interrupt. 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunKVM_REQ_UNHALT 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun This request may be made from the KVM common function kvm_vcpu_block(), 129*4882a593Smuzhiyun which is used to emulate an instruction that causes a CPU to halt until 130*4882a593Smuzhiyun one of an architectural specific set of events and/or interrupts is 131*4882a593Smuzhiyun received (determined by checking kvm_arch_vcpu_runnable()). When that 132*4882a593Smuzhiyun event or interrupt arrives kvm_vcpu_block() makes the request. This is 133*4882a593Smuzhiyun in contrast to when kvm_vcpu_block() returns due to any other reason, 134*4882a593Smuzhiyun such as a pending signal, which does not indicate the VCPU's halt 135*4882a593Smuzhiyun emulation should stop, and therefore does not make the request. 136*4882a593Smuzhiyun 137*4882a593SmuzhiyunKVM_REQUEST_MASK 138*4882a593Smuzhiyun---------------- 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunVCPU requests should be masked by KVM_REQUEST_MASK before using them with 141*4882a593Smuzhiyunbitops. This is because only the lower 8 bits are used to represent the 142*4882a593Smuzhiyunrequest's number. The upper bits are used as flags. Currently only two 143*4882a593Smuzhiyunflags are defined. 144*4882a593Smuzhiyun 145*4882a593SmuzhiyunVCPU Request Flags 146*4882a593Smuzhiyun------------------ 147*4882a593Smuzhiyun 148*4882a593SmuzhiyunKVM_REQUEST_NO_WAKEUP 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun This flag is applied to requests that only need immediate attention 151*4882a593Smuzhiyun from VCPUs running in guest mode. That is, sleeping VCPUs do not need 152*4882a593Smuzhiyun to be awaken for these requests. Sleeping VCPUs will handle the 153*4882a593Smuzhiyun requests when they are awaken later for some other reason. 154*4882a593Smuzhiyun 155*4882a593SmuzhiyunKVM_REQUEST_WAIT 156*4882a593Smuzhiyun 157*4882a593Smuzhiyun When requests with this flag are made with kvm_make_all_cpus_request(), 158*4882a593Smuzhiyun then the caller will wait for each VCPU to acknowledge its IPI before 159*4882a593Smuzhiyun proceeding. This flag only applies to VCPUs that would receive IPIs. 160*4882a593Smuzhiyun If, for example, the VCPU is sleeping, so no IPI is necessary, then 161*4882a593Smuzhiyun the requesting thread does not wait. This means that this flag may be 162*4882a593Smuzhiyun safely combined with KVM_REQUEST_NO_WAKEUP. See "Waiting for 163*4882a593Smuzhiyun Acknowledgements" for more information about requests with 164*4882a593Smuzhiyun KVM_REQUEST_WAIT. 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunVCPU Requests with Associated State 167*4882a593Smuzhiyun=================================== 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunRequesters that want the receiving VCPU to handle new state need to ensure 170*4882a593Smuzhiyunthe newly written state is observable to the receiving VCPU thread's CPU 171*4882a593Smuzhiyunby the time it observes the request. This means a write memory barrier 172*4882a593Smuzhiyunmust be inserted after writing the new state and before setting the VCPU 173*4882a593Smuzhiyunrequest bit. Additionally, on the receiving VCPU thread's side, a 174*4882a593Smuzhiyuncorresponding read barrier must be inserted after reading the request bit 175*4882a593Smuzhiyunand before proceeding to read the new state associated with it. See 176*4882a593Smuzhiyunscenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation 177*4882a593Smuzhiyun[memory-barriers]_. 178*4882a593Smuzhiyun 179*4882a593SmuzhiyunThe pair of functions, kvm_check_request() and kvm_make_request(), provide 180*4882a593Smuzhiyunthe memory barriers, allowing this requirement to be handled internally by 181*4882a593Smuzhiyunthe API. 182*4882a593Smuzhiyun 183*4882a593SmuzhiyunEnsuring Requests Are Seen 184*4882a593Smuzhiyun========================== 185*4882a593Smuzhiyun 186*4882a593SmuzhiyunWhen making requests to VCPUs, we want to avoid the receiving VCPU 187*4882a593Smuzhiyunexecuting in guest mode for an arbitrary long time without handling the 188*4882a593Smuzhiyunrequest. We can be sure this won't happen as long as we ensure the VCPU 189*4882a593Smuzhiyunthread checks kvm_request_pending() before entering guest mode and that a 190*4882a593Smuzhiyunkick will send an IPI to force an exit from guest mode when necessary. 191*4882a593SmuzhiyunExtra care must be taken to cover the period after the VCPU thread's last 192*4882a593Smuzhiyunkvm_request_pending() check and before it has entered guest mode, as kick 193*4882a593SmuzhiyunIPIs will only trigger guest mode exits for VCPU threads that are in guest 194*4882a593Smuzhiyunmode or at least have already disabled interrupts in order to prepare to 195*4882a593Smuzhiyunenter guest mode. This means that an optimized implementation (see "IPI 196*4882a593SmuzhiyunReduction") must be certain when it's safe to not send the IPI. One 197*4882a593Smuzhiyunsolution, which all architectures except s390 apply, is to: 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun- set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and 200*4882a593Smuzhiyun the last kvm_request_pending() check; 201*4882a593Smuzhiyun- enable interrupts atomically when entering the guest. 202*4882a593Smuzhiyun 203*4882a593SmuzhiyunThis solution also requires memory barriers to be placed carefully in both 204*4882a593Smuzhiyunthe requesting thread and the receiving VCPU. With the memory barriers we 205*4882a593Smuzhiyuncan exclude the possibility of a VCPU thread observing 206*4882a593Smuzhiyun!kvm_request_pending() on its last check and then not receiving an IPI for 207*4882a593Smuzhiyunthe next request made of it, even if the request is made immediately after 208*4882a593Smuzhiyunthe check. This is done by way of the Dekker memory barrier pattern 209*4882a593Smuzhiyun(scenario 10 of [lwn-mb]_). As the Dekker pattern requires two variables, 210*4882a593Smuzhiyunthis solution pairs ``vcpu->mode`` with ``vcpu->requests``. Substituting 211*4882a593Smuzhiyunthem into the pattern gives:: 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun CPU1 CPU2 214*4882a593Smuzhiyun ================= ================= 215*4882a593Smuzhiyun local_irq_disable(); 216*4882a593Smuzhiyun WRITE_ONCE(vcpu->mode, IN_GUEST_MODE); kvm_make_request(REQ, vcpu); 217*4882a593Smuzhiyun smp_mb(); smp_mb(); 218*4882a593Smuzhiyun if (kvm_request_pending(vcpu)) { if (READ_ONCE(vcpu->mode) == 219*4882a593Smuzhiyun IN_GUEST_MODE) { 220*4882a593Smuzhiyun ...abort guest entry... ...send IPI... 221*4882a593Smuzhiyun } } 222*4882a593Smuzhiyun 223*4882a593SmuzhiyunAs stated above, the IPI is only useful for VCPU threads in guest mode or 224*4882a593Smuzhiyunthat have already disabled interrupts. This is why this specific case of 225*4882a593Smuzhiyunthe Dekker pattern has been extended to disable interrupts before setting 226*4882a593Smuzhiyun``vcpu->mode`` to IN_GUEST_MODE. WRITE_ONCE() and READ_ONCE() are used to 227*4882a593Smuzhiyunpedantically implement the memory barrier pattern, guaranteeing the 228*4882a593Smuzhiyuncompiler doesn't interfere with ``vcpu->mode``'s carefully planned 229*4882a593Smuzhiyunaccesses. 230*4882a593Smuzhiyun 231*4882a593SmuzhiyunIPI Reduction 232*4882a593Smuzhiyun------------- 233*4882a593Smuzhiyun 234*4882a593SmuzhiyunAs only one IPI is needed to get a VCPU to check for any/all requests, 235*4882a593Smuzhiyunthen they may be coalesced. This is easily done by having the first IPI 236*4882a593Smuzhiyunsending kick also change the VCPU mode to something !IN_GUEST_MODE. The 237*4882a593Smuzhiyuntransitional state, EXITING_GUEST_MODE, is used for this purpose. 238*4882a593Smuzhiyun 239*4882a593SmuzhiyunWaiting for Acknowledgements 240*4882a593Smuzhiyun---------------------------- 241*4882a593Smuzhiyun 242*4882a593SmuzhiyunSome requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to 243*4882a593Smuzhiyunbe sent, and the acknowledgements to be waited upon, even when the target 244*4882a593SmuzhiyunVCPU threads are in modes other than IN_GUEST_MODE. For example, one case 245*4882a593Smuzhiyunis when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which 246*4882a593Smuzhiyunis set after disabling interrupts. To support these cases, the 247*4882a593SmuzhiyunKVM_REQUEST_WAIT flag changes the condition for sending an IPI from 248*4882a593Smuzhiyunchecking that the VCPU is IN_GUEST_MODE to checking that it is not 249*4882a593SmuzhiyunOUTSIDE_GUEST_MODE. 250*4882a593Smuzhiyun 251*4882a593SmuzhiyunRequest-less VCPU Kicks 252*4882a593Smuzhiyun----------------------- 253*4882a593Smuzhiyun 254*4882a593SmuzhiyunAs the determination of whether or not to send an IPI depends on the 255*4882a593Smuzhiyuntwo-variable Dekker memory barrier pattern, then it's clear that 256*4882a593Smuzhiyunrequest-less VCPU kicks are almost never correct. Without the assurance 257*4882a593Smuzhiyunthat a non-IPI generating kick will still result in an action by the 258*4882a593Smuzhiyunreceiving VCPU, as the final kvm_request_pending() check does for 259*4882a593Smuzhiyunrequest-accompanying kicks, then the kick may not do anything useful at 260*4882a593Smuzhiyunall. If, for instance, a request-less kick was made to a VCPU that was 261*4882a593Smuzhiyunjust about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then 262*4882a593Smuzhiyunthe VCPU thread may continue its entry without actually having done 263*4882a593Smuzhiyunwhatever it was the kick was meant to initiate. 264*4882a593Smuzhiyun 265*4882a593SmuzhiyunOne exception is x86's posted interrupt mechanism. In this case, however, 266*4882a593Smuzhiyuneven the request-less VCPU kick is coupled with the same 267*4882a593Smuzhiyunlocal_irq_disable() + smp_mb() pattern described above; the ON bit 268*4882a593Smuzhiyun(Outstanding Notification) in the posted interrupt descriptor takes the 269*4882a593Smuzhiyunrole of ``vcpu->requests``. When sending a posted interrupt, PIR.ON is 270*4882a593Smuzhiyunset before reading ``vcpu->mode``; dually, in the VCPU thread, 271*4882a593Smuzhiyunvmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to 272*4882a593SmuzhiyunIN_GUEST_MODE. 273*4882a593Smuzhiyun 274*4882a593SmuzhiyunAdditional Considerations 275*4882a593Smuzhiyun========================= 276*4882a593Smuzhiyun 277*4882a593SmuzhiyunSleeping VCPUs 278*4882a593Smuzhiyun-------------- 279*4882a593Smuzhiyun 280*4882a593SmuzhiyunVCPU threads may need to consider requests before and/or after calling 281*4882a593Smuzhiyunfunctions that may put them to sleep, e.g. kvm_vcpu_block(). Whether they 282*4882a593Smuzhiyundo or not, and, if they do, which requests need consideration, is 283*4882a593Smuzhiyunarchitecture dependent. kvm_vcpu_block() calls kvm_arch_vcpu_runnable() 284*4882a593Smuzhiyunto check if it should awaken. One reason to do so is to provide 285*4882a593Smuzhiyunarchitectures a function where requests may be checked if necessary. 286*4882a593Smuzhiyun 287*4882a593SmuzhiyunClearing Requests 288*4882a593Smuzhiyun----------------- 289*4882a593Smuzhiyun 290*4882a593SmuzhiyunGenerally it only makes sense for the receiving VCPU thread to clear a 291*4882a593Smuzhiyunrequest. However, in some circumstances, such as when the requesting 292*4882a593Smuzhiyunthread and the receiving VCPU thread are executed serially, such as when 293*4882a593Smuzhiyunthey are the same thread, or when they are using some form of concurrency 294*4882a593Smuzhiyuncontrol to temporarily execute synchronously, then it's possible to know 295*4882a593Smuzhiyunthat the request may be cleared immediately, rather than waiting for the 296*4882a593Smuzhiyunreceiving VCPU thread to handle the request in VCPU RUN. The only current 297*4882a593Smuzhiyunexamples of this are kvm_vcpu_block() calls made by VCPUs to block 298*4882a593Smuzhiyunthemselves. A possible side-effect of that call is to make the 299*4882a593SmuzhiyunKVM_REQ_UNHALT request, which may then be cleared immediately when the 300*4882a593SmuzhiyunVCPU returns from the call. 301*4882a593Smuzhiyun 302*4882a593SmuzhiyunReferences 303*4882a593Smuzhiyun========== 304*4882a593Smuzhiyun 305*4882a593Smuzhiyun.. [atomic-ops] Documentation/core-api/atomic_ops.rst 306*4882a593Smuzhiyun.. [memory-barriers] Documentation/memory-barriers.txt 307*4882a593Smuzhiyun.. [lwn-mb] https://lwn.net/Articles/573436/ 308