1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun=========================== 4*4882a593SmuzhiyunThe KVM halt polling system 5*4882a593Smuzhiyun=========================== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunThe KVM halt polling system provides a feature within KVM whereby the latency 8*4882a593Smuzhiyunof a guest can, under some circumstances, be reduced by polling in the host 9*4882a593Smuzhiyunfor some time period after the guest has elected to no longer run by cedeing. 10*4882a593SmuzhiyunThat is, when a guest vcpu has ceded, or in the case of powerpc when all of the 11*4882a593Smuzhiyunvcpus of a single vcore have ceded, the host kernel polls for wakeup conditions 12*4882a593Smuzhiyunbefore giving up the cpu to the scheduler in order to let something else run. 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunPolling provides a latency advantage in cases where the guest can be run again 15*4882a593Smuzhiyunvery quickly by at least saving us a trip through the scheduler, normally on 16*4882a593Smuzhiyunthe order of a few micro-seconds, although performance benefits are workload 17*4882a593Smuzhiyundependant. In the event that no wakeup source arrives during the polling 18*4882a593Smuzhiyuninterval or some other task on the runqueue is runnable the scheduler is 19*4882a593Smuzhiyuninvoked. Thus halt polling is especially useful on workloads with very short 20*4882a593Smuzhiyunwakeup periods where the time spent halt polling is minimised and the time 21*4882a593Smuzhiyunsavings of not invoking the scheduler are distinguishable. 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunThe generic halt polling code is implemented in: 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun virt/kvm/kvm_main.c: kvm_vcpu_block() 26*4882a593Smuzhiyun 27*4882a593SmuzhiyunThe powerpc kvm-hv specific case is implemented in: 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun arch/powerpc/kvm/book3s_hv.c: kvmppc_vcore_blocked() 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunHalt Polling Interval 32*4882a593Smuzhiyun===================== 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunThe maximum time for which to poll before invoking the scheduler, referred to 35*4882a593Smuzhiyunas the halt polling interval, is increased and decreased based on the perceived 36*4882a593Smuzhiyuneffectiveness of the polling in an attempt to limit pointless polling. 37*4882a593SmuzhiyunThis value is stored in either the vcpu struct: 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun kvm_vcpu->halt_poll_ns 40*4882a593Smuzhiyun 41*4882a593Smuzhiyunor in the case of powerpc kvm-hv, in the vcore struct: 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun kvmppc_vcore->halt_poll_ns 44*4882a593Smuzhiyun 45*4882a593SmuzhiyunThus this is a per vcpu (or vcore) value. 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunDuring polling if a wakeup source is received within the halt polling interval, 48*4882a593Smuzhiyunthe interval is left unchanged. In the event that a wakeup source isn't 49*4882a593Smuzhiyunreceived during the polling interval (and thus schedule is invoked) there are 50*4882a593Smuzhiyuntwo options, either the polling interval and total block time[0] were less than 51*4882a593Smuzhiyunthe global max polling interval (see module params below), or the total block 52*4882a593Smuzhiyuntime was greater than the global max polling interval. 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunIn the event that both the polling interval and total block time were less than 55*4882a593Smuzhiyunthe global max polling interval then the polling interval can be increased in 56*4882a593Smuzhiyunthe hope that next time during the longer polling interval the wake up source 57*4882a593Smuzhiyunwill be received while the host is polling and the latency benefits will be 58*4882a593Smuzhiyunreceived. The polling interval is grown in the function grow_halt_poll_ns() and 59*4882a593Smuzhiyunis multiplied by the module parameters halt_poll_ns_grow and 60*4882a593Smuzhiyunhalt_poll_ns_grow_start. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunIn the event that the total block time was greater than the global max polling 63*4882a593Smuzhiyuninterval then the host will never poll for long enough (limited by the global 64*4882a593Smuzhiyunmax) to wakeup during the polling interval so it may as well be shrunk in order 65*4882a593Smuzhiyunto avoid pointless polling. The polling interval is shrunk in the function 66*4882a593Smuzhiyunshrink_halt_poll_ns() and is divided by the module parameter 67*4882a593Smuzhiyunhalt_poll_ns_shrink, or set to 0 iff halt_poll_ns_shrink == 0. 68*4882a593Smuzhiyun 69*4882a593SmuzhiyunIt is worth noting that this adjustment process attempts to hone in on some 70*4882a593Smuzhiyunsteady state polling interval but will only really do a good job for wakeups 71*4882a593Smuzhiyunwhich come at an approximately constant rate, otherwise there will be constant 72*4882a593Smuzhiyunadjustment of the polling interval. 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun[0] total block time: 75*4882a593Smuzhiyun the time between when the halt polling function is 76*4882a593Smuzhiyun invoked and a wakeup source received (irrespective of 77*4882a593Smuzhiyun whether the scheduler is invoked within that function). 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunModule Parameters 80*4882a593Smuzhiyun================= 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunThe kvm module has 3 tuneable module parameters to adjust the global max 83*4882a593Smuzhiyunpolling interval as well as the rate at which the polling interval is grown and 84*4882a593Smuzhiyunshrunk. These variables are defined in include/linux/kvm_host.h and as module 85*4882a593Smuzhiyunparameters in virt/kvm/kvm_main.c, or arch/powerpc/kvm/book3s_hv.c in the 86*4882a593Smuzhiyunpowerpc kvm-hv case. 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+ 89*4882a593Smuzhiyun|Module Parameter | Description | Default Value | 90*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+ 91*4882a593Smuzhiyun|halt_poll_ns | The global max polling | KVM_HALT_POLL_NS_DEFAULT| 92*4882a593Smuzhiyun| | interval which defines | | 93*4882a593Smuzhiyun| | the ceiling value of the | | 94*4882a593Smuzhiyun| | polling interval for | (per arch value) | 95*4882a593Smuzhiyun| | each vcpu. | | 96*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+ 97*4882a593Smuzhiyun|halt_poll_ns_grow | The value by which the | 2 | 98*4882a593Smuzhiyun| | halt polling interval is | | 99*4882a593Smuzhiyun| | multiplied in the | | 100*4882a593Smuzhiyun| | grow_halt_poll_ns() | | 101*4882a593Smuzhiyun| | function. | | 102*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+ 103*4882a593Smuzhiyun|halt_poll_ns_grow_start| The initial value to grow | 10000 | 104*4882a593Smuzhiyun| | to from zero in the | | 105*4882a593Smuzhiyun| | grow_halt_poll_ns() | | 106*4882a593Smuzhiyun| | function. | | 107*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+ 108*4882a593Smuzhiyun|halt_poll_ns_shrink | The value by which the | 0 | 109*4882a593Smuzhiyun| | halt polling interval is | | 110*4882a593Smuzhiyun| | divided in the | | 111*4882a593Smuzhiyun| | shrink_halt_poll_ns() | | 112*4882a593Smuzhiyun| | function. | | 113*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+ 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunThese module parameters can be set from the debugfs files in: 116*4882a593Smuzhiyun 117*4882a593Smuzhiyun /sys/module/kvm/parameters/ 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunNote: that these module parameters are system wide values and are not able to 120*4882a593Smuzhiyun be tuned on a per vm basis. 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunFurther Notes 123*4882a593Smuzhiyun============= 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun- Care should be taken when setting the halt_poll_ns module parameter as a large value 126*4882a593Smuzhiyun has the potential to drive the cpu usage to 100% on a machine which would be almost 127*4882a593Smuzhiyun entirely idle otherwise. This is because even if a guest has wakeups during which very 128*4882a593Smuzhiyun little work is done and which are quite far apart, if the period is shorter than the 129*4882a593Smuzhiyun global max polling interval (halt_poll_ns) then the host will always poll for the 130*4882a593Smuzhiyun entire block time and thus cpu utilisation will go to 100%. 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun- Halt polling essentially presents a trade off between power usage and latency and 133*4882a593Smuzhiyun the module parameters should be used to tune the affinity for this. Idle cpu time is 134*4882a593Smuzhiyun essentially converted to host kernel time with the aim of decreasing latency when 135*4882a593Smuzhiyun entering the guest. 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun- Halt polling will only be conducted by the host when no other tasks are runnable on 138*4882a593Smuzhiyun that cpu, otherwise the polling will cease immediately and schedule will be invoked to 139*4882a593Smuzhiyun allow that other task to run. Thus this doesn't allow a guest to denial of service the 140*4882a593Smuzhiyun cpu. 141