1*4882a593Smuzhiyun============================== 2*4882a593SmuzhiyunRunning nested guests with KVM 3*4882a593Smuzhiyun============================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunA nested guest is the ability to run a guest inside another guest (it 6*4882a593Smuzhiyuncan be KVM-based or a different hypervisor). The straightforward 7*4882a593Smuzhiyunexample is a KVM guest that in turn runs on a KVM guest (the rest of 8*4882a593Smuzhiyunthis document is built on this example):: 9*4882a593Smuzhiyun 10*4882a593Smuzhiyun .----------------. .----------------. 11*4882a593Smuzhiyun | | | | 12*4882a593Smuzhiyun | L2 | | L2 | 13*4882a593Smuzhiyun | (Nested Guest) | | (Nested Guest) | 14*4882a593Smuzhiyun | | | | 15*4882a593Smuzhiyun |----------------'--'----------------| 16*4882a593Smuzhiyun | | 17*4882a593Smuzhiyun | L1 (Guest Hypervisor) | 18*4882a593Smuzhiyun | KVM (/dev/kvm) | 19*4882a593Smuzhiyun | | 20*4882a593Smuzhiyun .------------------------------------------------------. 21*4882a593Smuzhiyun | L0 (Host Hypervisor) | 22*4882a593Smuzhiyun | KVM (/dev/kvm) | 23*4882a593Smuzhiyun |------------------------------------------------------| 24*4882a593Smuzhiyun | Hardware (with virtualization extensions) | 25*4882a593Smuzhiyun '------------------------------------------------------' 26*4882a593Smuzhiyun 27*4882a593SmuzhiyunTerminology: 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun- L0 – level-0; the bare metal host, running KVM 30*4882a593Smuzhiyun 31*4882a593Smuzhiyun- L1 – level-1 guest; a VM running on L0; also called the "guest 32*4882a593Smuzhiyun hypervisor", as it itself is capable of running KVM. 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun- L2 – level-2 guest; a VM running on L1, this is the "nested guest" 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun.. note:: The above diagram is modelled after the x86 architecture; 37*4882a593Smuzhiyun s390x, ppc64 and other architectures are likely to have 38*4882a593Smuzhiyun a different design for nesting. 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun For example, s390x always has an LPAR (LogicalPARtition) 41*4882a593Smuzhiyun hypervisor running on bare metal, adding another layer and 42*4882a593Smuzhiyun resulting in at least four levels in a nested setup — L0 (bare 43*4882a593Smuzhiyun metal, running the LPAR hypervisor), L1 (host hypervisor), L2 44*4882a593Smuzhiyun (guest hypervisor), L3 (nested guest). 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun This document will stick with the three-level terminology (L0, 47*4882a593Smuzhiyun L1, and L2) for all architectures; and will largely focus on 48*4882a593Smuzhiyun x86. 49*4882a593Smuzhiyun 50*4882a593Smuzhiyun 51*4882a593SmuzhiyunUse Cases 52*4882a593Smuzhiyun--------- 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunThere are several scenarios where nested KVM can be useful, to name a 55*4882a593Smuzhiyunfew: 56*4882a593Smuzhiyun 57*4882a593Smuzhiyun- As a developer, you want to test your software on different operating 58*4882a593Smuzhiyun systems (OSes). Instead of renting multiple VMs from a Cloud 59*4882a593Smuzhiyun Provider, using nested KVM lets you rent a large enough "guest 60*4882a593Smuzhiyun hypervisor" (level-1 guest). This in turn allows you to create 61*4882a593Smuzhiyun multiple nested guests (level-2 guests), running different OSes, on 62*4882a593Smuzhiyun which you can develop and test your software. 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun- Live migration of "guest hypervisors" and their nested guests, for 65*4882a593Smuzhiyun load balancing, disaster recovery, etc. 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun- VM image creation tools (e.g. ``virt-install``, etc) often run 68*4882a593Smuzhiyun their own VM, and users expect these to work inside a VM. 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun- Some OSes use virtualization internally for security (e.g. to let 71*4882a593Smuzhiyun applications run safely in isolation). 72*4882a593Smuzhiyun 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunEnabling "nested" (x86) 75*4882a593Smuzhiyun----------------------- 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunFrom Linux kernel v4.19 onwards, the ``nested`` KVM parameter is enabled 78*4882a593Smuzhiyunby default for Intel and AMD. (Though your Linux distribution might 79*4882a593Smuzhiyunoverride this default.) 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunIn case you are running a Linux kernel older than v4.19, to enable 82*4882a593Smuzhiyunnesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To 83*4882a593Smuzhiyunpersist this setting across reboots, you can add it in a config file, as 84*4882a593Smuzhiyunshown below: 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun1. On the bare metal host (L0), list the kernel modules and ensure that 87*4882a593Smuzhiyun the KVM modules:: 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun $ lsmod | grep -i kvm 90*4882a593Smuzhiyun kvm_intel 133627 0 91*4882a593Smuzhiyun kvm 435079 1 kvm_intel 92*4882a593Smuzhiyun 93*4882a593Smuzhiyun2. Show information for ``kvm_intel`` module:: 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun $ modinfo kvm_intel | grep -i nested 96*4882a593Smuzhiyun parm: nested:bool 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun3. For the nested KVM configuration to persist across reboots, place the 99*4882a593Smuzhiyun below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it 100*4882a593Smuzhiyun doesn't exist):: 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun $ cat /etc/modprobe.d/kvm_intel.conf 103*4882a593Smuzhiyun options kvm-intel nested=y 104*4882a593Smuzhiyun 105*4882a593Smuzhiyun4. Unload and re-load the KVM Intel module:: 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun $ sudo rmmod kvm-intel 108*4882a593Smuzhiyun $ sudo modprobe kvm-intel 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun5. Verify if the ``nested`` parameter for KVM is enabled:: 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun $ cat /sys/module/kvm_intel/parameters/nested 113*4882a593Smuzhiyun Y 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunFor AMD hosts, the process is the same as above, except that the module 116*4882a593Smuzhiyunname is ``kvm-amd``. 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunAdditional nested-related kernel parameters (x86) 120*4882a593Smuzhiyun------------------------------------------------- 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunIf your hardware is sufficiently advanced (Intel Haswell processor or 123*4882a593Smuzhiyunhigher, which has newer hardware virt extensions), the following 124*4882a593Smuzhiyunadditional features will also be enabled by default: "Shadow VMCS 125*4882a593Smuzhiyun(Virtual Machine Control Structure)", APIC Virtualization on your bare 126*4882a593Smuzhiyunmetal host (L0). Parameters for Intel hosts:: 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs 129*4882a593Smuzhiyun Y 130*4882a593Smuzhiyun 131*4882a593Smuzhiyun $ cat /sys/module/kvm_intel/parameters/enable_apicv 132*4882a593Smuzhiyun Y 133*4882a593Smuzhiyun 134*4882a593Smuzhiyun $ cat /sys/module/kvm_intel/parameters/ept 135*4882a593Smuzhiyun Y 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun.. note:: If you suspect your L2 (i.e. nested guest) is running slower, 138*4882a593Smuzhiyun ensure the above are enabled (particularly 139*4882a593Smuzhiyun ``enable_shadow_vmcs`` and ``ept``). 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun 142*4882a593SmuzhiyunStarting a nested guest (x86) 143*4882a593Smuzhiyun----------------------------- 144*4882a593Smuzhiyun 145*4882a593SmuzhiyunOnce your bare metal host (L0) is configured for nesting, you should be 146*4882a593Smuzhiyunable to start an L1 guest with:: 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun $ qemu-kvm -cpu host [...] 149*4882a593Smuzhiyun 150*4882a593SmuzhiyunThe above will pass through the host CPU's capabilities as-is to the 151*4882a593Smuzhiyungues); or for better live migration compatibility, use a named CPU 152*4882a593Smuzhiyunmodel supported by QEMU. e.g.:: 153*4882a593Smuzhiyun 154*4882a593Smuzhiyun $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on 155*4882a593Smuzhiyun 156*4882a593Smuzhiyunthen the guest hypervisor will subsequently be capable of running a 157*4882a593Smuzhiyunnested guest with accelerated KVM. 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun 160*4882a593SmuzhiyunEnabling "nested" (s390x) 161*4882a593Smuzhiyun------------------------- 162*4882a593Smuzhiyun 163*4882a593Smuzhiyun1. On the host hypervisor (L0), enable the ``nested`` parameter on 164*4882a593Smuzhiyun s390x:: 165*4882a593Smuzhiyun 166*4882a593Smuzhiyun $ rmmod kvm 167*4882a593Smuzhiyun $ modprobe kvm nested=1 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive 170*4882a593Smuzhiyun with the ``nested`` paramter — i.e. to be able to enable 171*4882a593Smuzhiyun ``nested``, the ``hpage`` parameter *must* be disabled. 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun2. The guest hypervisor (L1) must be provided with the ``sie`` CPU 174*4882a593Smuzhiyun feature — with QEMU, this can be done by using "host passthrough" 175*4882a593Smuzhiyun (via the command-line ``-cpu host``). 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun3. Now the KVM module can be loaded in the L1 (guest hypervisor):: 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun $ modprobe kvm 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun 182*4882a593SmuzhiyunLive migration with nested KVM 183*4882a593Smuzhiyun------------------------------ 184*4882a593Smuzhiyun 185*4882a593SmuzhiyunMigrating an L1 guest, with a *live* nested guest in it, to another 186*4882a593Smuzhiyunbare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for 187*4882a593SmuzhiyunIntel x86 systems, and even on older versions for s390x. 188*4882a593Smuzhiyun 189*4882a593SmuzhiyunOn AMD systems, once an L1 guest has started an L2 guest, the L1 guest 190*4882a593Smuzhiyunshould no longer be migrated or saved (refer to QEMU documentation on 191*4882a593Smuzhiyun"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate 192*4882a593Smuzhiyunor save-and-load an L1 guest while an L2 guest is running will result in 193*4882a593Smuzhiyunundefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a 194*4882a593Smuzhiyunkernel 'oops', or an outright kernel panic. Such a migrated or loaded L1 195*4882a593Smuzhiyunguest can no longer be considered stable or secure, and must be restarted. 196*4882a593SmuzhiyunMigrating an L1 guest merely configured to support nesting, while not 197*4882a593Smuzhiyunactually running L2 guests, is expected to function normally even on AMD 198*4882a593Smuzhiyunsystems but may fail once guests are started. 199*4882a593Smuzhiyun 200*4882a593SmuzhiyunMigrating an L2 guest is always expected to succeed, so all the following 201*4882a593Smuzhiyunscenarios should work even on AMD systems: 202*4882a593Smuzhiyun 203*4882a593Smuzhiyun- Migrating a nested guest (L2) to another L1 guest on the *same* bare 204*4882a593Smuzhiyun metal host. 205*4882a593Smuzhiyun 206*4882a593Smuzhiyun- Migrating a nested guest (L2) to another L1 guest on a *different* 207*4882a593Smuzhiyun bare metal host. 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun- Migrating a nested guest (L2) to a bare metal host. 210*4882a593Smuzhiyun 211*4882a593SmuzhiyunReporting bugs from nested setups 212*4882a593Smuzhiyun----------------------------------- 213*4882a593Smuzhiyun 214*4882a593SmuzhiyunDebugging "nested" problems can involve sifting through log files across 215*4882a593SmuzhiyunL0, L1 and L2; this can result in tedious back-n-forth between the bug 216*4882a593Smuzhiyunreporter and the bug fixer. 217*4882a593Smuzhiyun 218*4882a593Smuzhiyun- Mention that you are in a "nested" setup. If you are running any kind 219*4882a593Smuzhiyun of "nesting" at all, say so. Unfortunately, this needs to be called 220*4882a593Smuzhiyun out because when reporting bugs, people tend to forget to even 221*4882a593Smuzhiyun *mention* that they're using nested virtualization. 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun- Ensure you are actually running KVM on KVM. Sometimes people do not 224*4882a593Smuzhiyun have KVM enabled for their guest hypervisor (L1), which results in 225*4882a593Smuzhiyun them running with pure emulation or what QEMU calls it as "TCG", but 226*4882a593Smuzhiyun they think they're running nested KVM. Thus confusing "nested Virt" 227*4882a593Smuzhiyun (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM). 228*4882a593Smuzhiyun 229*4882a593SmuzhiyunInformation to collect (generic) 230*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 231*4882a593Smuzhiyun 232*4882a593SmuzhiyunThe following is not an exhaustive list, but a very good starting point: 233*4882a593Smuzhiyun 234*4882a593Smuzhiyun - Kernel, libvirt, and QEMU version from L0 235*4882a593Smuzhiyun 236*4882a593Smuzhiyun - Kernel, libvirt and QEMU version from L1 237*4882a593Smuzhiyun 238*4882a593Smuzhiyun - QEMU command-line of L1 -- when using libvirt, you'll find it here: 239*4882a593Smuzhiyun ``/var/log/libvirt/qemu/instance.log`` 240*4882a593Smuzhiyun 241*4882a593Smuzhiyun - QEMU command-line of L2 -- as above, when using libvirt, get the 242*4882a593Smuzhiyun complete libvirt-generated QEMU command-line 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun - ``cat /sys/cpuinfo`` from L0 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun - ``cat /sys/cpuinfo`` from L1 247*4882a593Smuzhiyun 248*4882a593Smuzhiyun - ``lscpu`` from L0 249*4882a593Smuzhiyun 250*4882a593Smuzhiyun - ``lscpu`` from L1 251*4882a593Smuzhiyun 252*4882a593Smuzhiyun - Full ``dmesg`` output from L0 253*4882a593Smuzhiyun 254*4882a593Smuzhiyun - Full ``dmesg`` output from L1 255*4882a593Smuzhiyun 256*4882a593Smuzhiyunx86-specific info to collect 257*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 258*4882a593Smuzhiyun 259*4882a593SmuzhiyunBoth the below commands, ``x86info`` and ``dmidecode``, should be 260*4882a593Smuzhiyunavailable on most Linux distributions with the same name: 261*4882a593Smuzhiyun 262*4882a593Smuzhiyun - Output of: ``x86info -a`` from L0 263*4882a593Smuzhiyun 264*4882a593Smuzhiyun - Output of: ``x86info -a`` from L1 265*4882a593Smuzhiyun 266*4882a593Smuzhiyun - Output of: ``dmidecode`` from L0 267*4882a593Smuzhiyun 268*4882a593Smuzhiyun - Output of: ``dmidecode`` from L1 269*4882a593Smuzhiyun 270*4882a593Smuzhiyuns390x-specific info to collect 271*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 272*4882a593Smuzhiyun 273*4882a593SmuzhiyunAlong with the earlier mentioned generic details, the below is 274*4882a593Smuzhiyunalso recommended: 275*4882a593Smuzhiyun 276*4882a593Smuzhiyun - ``/proc/sysinfo`` from L1; this will also include the info from L0 277