virt/kvm/halt-polling.rst

*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
*4882a593Smuzhiyun
*4882a593Smuzhiyun===========================
*4882a593SmuzhiyunThe KVM halt polling system
*4882a593Smuzhiyun===========================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe KVM halt polling system provides a feature within KVM whereby the latency
*4882a593Smuzhiyunof a guest can, under some circumstances, be reduced by polling in the host
*4882a593Smuzhiyunfor some time period after the guest has elected to no longer run by cedeing.
*4882a593SmuzhiyunThat is, when a guest vcpu has ceded, or in the case of powerpc when all of the
*4882a593Smuzhiyunvcpus of a single vcore have ceded, the host kernel polls for wakeup conditions
*4882a593Smuzhiyunbefore giving up the cpu to the scheduler in order to let something else run.
*4882a593Smuzhiyun
*4882a593SmuzhiyunPolling provides a latency advantage in cases where the guest can be run again
*4882a593Smuzhiyunvery quickly by at least saving us a trip through the scheduler, normally on
*4882a593Smuzhiyunthe order of a few micro-seconds, although performance benefits are workload
*4882a593Smuzhiyundependant. In the event that no wakeup source arrives during the polling
*4882a593Smuzhiyuninterval or some other task on the runqueue is runnable the scheduler is
*4882a593Smuzhiyuninvoked. Thus halt polling is especially useful on workloads with very short
*4882a593Smuzhiyunwakeup periods where the time spent halt polling is minimised and the time
*4882a593Smuzhiyunsavings of not invoking the scheduler are distinguishable.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe generic halt polling code is implemented in:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	virt/kvm/kvm_main.c: kvm_vcpu_block()
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe powerpc kvm-hv specific case is implemented in:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	arch/powerpc/kvm/book3s_hv.c: kvmppc_vcore_blocked()
*4882a593Smuzhiyun
*4882a593SmuzhiyunHalt Polling Interval
*4882a593Smuzhiyun=====================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe maximum time for which to poll before invoking the scheduler, referred to
*4882a593Smuzhiyunas the halt polling interval, is increased and decreased based on the perceived
*4882a593Smuzhiyuneffectiveness of the polling in an attempt to limit pointless polling.
*4882a593SmuzhiyunThis value is stored in either the vcpu struct:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	kvm_vcpu->halt_poll_ns
*4882a593Smuzhiyun
*4882a593Smuzhiyunor in the case of powerpc kvm-hv, in the vcore struct:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	kvmppc_vcore->halt_poll_ns
*4882a593Smuzhiyun
*4882a593SmuzhiyunThus this is a per vcpu (or vcore) value.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDuring polling if a wakeup source is received within the halt polling interval,
*4882a593Smuzhiyunthe interval is left unchanged. In the event that a wakeup source isn't
*4882a593Smuzhiyunreceived during the polling interval (and thus schedule is invoked) there are
*4882a593Smuzhiyuntwo options, either the polling interval and total block time[0] were less than
*4882a593Smuzhiyunthe global max polling interval (see module params below), or the total block
*4882a593Smuzhiyuntime was greater than the global max polling interval.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn the event that both the polling interval and total block time were less than
*4882a593Smuzhiyunthe global max polling interval then the polling interval can be increased in
*4882a593Smuzhiyunthe hope that next time during the longer polling interval the wake up source
*4882a593Smuzhiyunwill be received while the host is polling and the latency benefits will be
*4882a593Smuzhiyunreceived. The polling interval is grown in the function grow_halt_poll_ns() and
*4882a593Smuzhiyunis multiplied by the module parameters halt_poll_ns_grow and
*4882a593Smuzhiyunhalt_poll_ns_grow_start.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn the event that the total block time was greater than the global max polling
*4882a593Smuzhiyuninterval then the host will never poll for long enough (limited by the global
*4882a593Smuzhiyunmax) to wakeup during the polling interval so it may as well be shrunk in order
*4882a593Smuzhiyunto avoid pointless polling. The polling interval is shrunk in the function
*4882a593Smuzhiyunshrink_halt_poll_ns() and is divided by the module parameter
*4882a593Smuzhiyunhalt_poll_ns_shrink, or set to 0 iff halt_poll_ns_shrink == 0.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIt is worth noting that this adjustment process attempts to hone in on some
*4882a593Smuzhiyunsteady state polling interval but will only really do a good job for wakeups
*4882a593Smuzhiyunwhich come at an approximately constant rate, otherwise there will be constant
*4882a593Smuzhiyunadjustment of the polling interval.
*4882a593Smuzhiyun
*4882a593Smuzhiyun[0] total block time:
*4882a593Smuzhiyun		      the time between when the halt polling function is
*4882a593Smuzhiyun		      invoked and a wakeup source received (irrespective of
*4882a593Smuzhiyun		      whether the scheduler is invoked within that function).
*4882a593Smuzhiyun
*4882a593SmuzhiyunModule Parameters
*4882a593Smuzhiyun=================
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe kvm module has 3 tuneable module parameters to adjust the global max
*4882a593Smuzhiyunpolling interval as well as the rate at which the polling interval is grown and
*4882a593Smuzhiyunshrunk. These variables are defined in include/linux/kvm_host.h and as module
*4882a593Smuzhiyunparameters in virt/kvm/kvm_main.c, or arch/powerpc/kvm/book3s_hv.c in the
*4882a593Smuzhiyunpowerpc kvm-hv case.
*4882a593Smuzhiyun
*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+
*4882a593Smuzhiyun|Module Parameter	|   Description		    |	     Default Value    |
*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+
*4882a593Smuzhiyun|halt_poll_ns		| The global max polling    | KVM_HALT_POLL_NS_DEFAULT|
*4882a593Smuzhiyun|			| interval which defines    |			      |
*4882a593Smuzhiyun|			| the ceiling value of the  |			      |
*4882a593Smuzhiyun|			| polling interval for      | (per arch value)	      |
*4882a593Smuzhiyun|			| each vcpu.		    |			      |
*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+
*4882a593Smuzhiyun|halt_poll_ns_grow	| The value by which the    | 2			      |
*4882a593Smuzhiyun|			| halt polling interval is  |			      |
*4882a593Smuzhiyun|			| multiplied in the	    |			      |
*4882a593Smuzhiyun|			| grow_halt_poll_ns()	    |			      |
*4882a593Smuzhiyun|			| function.		    |			      |
*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+
*4882a593Smuzhiyun|halt_poll_ns_grow_start| The initial value to grow | 10000		      |
*4882a593Smuzhiyun|			| to from zero in the	    |			      |
*4882a593Smuzhiyun|			| grow_halt_poll_ns()	    |			      |
*4882a593Smuzhiyun|			| function.		    |			      |
*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+
*4882a593Smuzhiyun|halt_poll_ns_shrink	| The value by which the    | 0			      |
*4882a593Smuzhiyun|			| halt polling interval is  |			      |
*4882a593Smuzhiyun|			| divided in the	    |			      |
*4882a593Smuzhiyun|			| shrink_halt_poll_ns()	    |			      |
*4882a593Smuzhiyun|			| function.		    |			      |
*4882a593Smuzhiyun+-----------------------+---------------------------+-------------------------+
*4882a593Smuzhiyun
*4882a593SmuzhiyunThese module parameters can be set from the debugfs files in:
*4882a593Smuzhiyun
*4882a593Smuzhiyun	/sys/module/kvm/parameters/
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote: that these module parameters are system wide values and are not able to
*4882a593Smuzhiyun      be tuned on a per vm basis.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFurther Notes
*4882a593Smuzhiyun=============
*4882a593Smuzhiyun
*4882a593Smuzhiyun- Care should be taken when setting the halt_poll_ns module parameter as a large value
*4882a593Smuzhiyun  has the potential to drive the cpu usage to 100% on a machine which would be almost
*4882a593Smuzhiyun  entirely idle otherwise. This is because even if a guest has wakeups during which very
*4882a593Smuzhiyun  little work is done and which are quite far apart, if the period is shorter than the
*4882a593Smuzhiyun  global max polling interval (halt_poll_ns) then the host will always poll for the
*4882a593Smuzhiyun  entire block time and thus cpu utilisation will go to 100%.
*4882a593Smuzhiyun
*4882a593Smuzhiyun- Halt polling essentially presents a trade off between power usage and latency and
*4882a593Smuzhiyun  the module parameters should be used to tune the affinity for this. Idle cpu time is
*4882a593Smuzhiyun  essentially converted to host kernel time with the aim of decreasing latency when
*4882a593Smuzhiyun  entering the guest.
*4882a593Smuzhiyun
*4882a593Smuzhiyun- Halt polling will only be conducted by the host when no other tasks are runnable on
*4882a593Smuzhiyun  that cpu, otherwise the polling will cease immediately and schedule will be invoked to
*4882a593Smuzhiyun  allow that other task to run. Thus this doesn't allow a guest to denial of service the
*4882a593Smuzhiyun  cpu.