xref: /OK3568_Linux_fs/kernel/Documentation/virt/kvm/running-nested-guests.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==============================
2*4882a593SmuzhiyunRunning nested guests with KVM
3*4882a593Smuzhiyun==============================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunA nested guest is the ability to run a guest inside another guest (it
6*4882a593Smuzhiyuncan be KVM-based or a different hypervisor).  The straightforward
7*4882a593Smuzhiyunexample is a KVM guest that in turn runs on a KVM guest (the rest of
8*4882a593Smuzhiyunthis document is built on this example)::
9*4882a593Smuzhiyun
10*4882a593Smuzhiyun              .----------------.  .----------------.
11*4882a593Smuzhiyun              |                |  |                |
12*4882a593Smuzhiyun              |      L2        |  |      L2        |
13*4882a593Smuzhiyun              | (Nested Guest) |  | (Nested Guest) |
14*4882a593Smuzhiyun              |                |  |                |
15*4882a593Smuzhiyun              |----------------'--'----------------|
16*4882a593Smuzhiyun              |                                    |
17*4882a593Smuzhiyun              |       L1 (Guest Hypervisor)        |
18*4882a593Smuzhiyun              |          KVM (/dev/kvm)            |
19*4882a593Smuzhiyun              |                                    |
20*4882a593Smuzhiyun      .------------------------------------------------------.
21*4882a593Smuzhiyun      |                 L0 (Host Hypervisor)                 |
22*4882a593Smuzhiyun      |                    KVM (/dev/kvm)                    |
23*4882a593Smuzhiyun      |------------------------------------------------------|
24*4882a593Smuzhiyun      |        Hardware (with virtualization extensions)     |
25*4882a593Smuzhiyun      '------------------------------------------------------'
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunTerminology:
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun- L0 – level-0; the bare metal host, running KVM
30*4882a593Smuzhiyun
31*4882a593Smuzhiyun- L1 – level-1 guest; a VM running on L0; also called the "guest
32*4882a593Smuzhiyun  hypervisor", as it itself is capable of running KVM.
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun- L2 – level-2 guest; a VM running on L1, this is the "nested guest"
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun.. note:: The above diagram is modelled after the x86 architecture;
37*4882a593Smuzhiyun          s390x, ppc64 and other architectures are likely to have
38*4882a593Smuzhiyun          a different design for nesting.
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun          For example, s390x always has an LPAR (LogicalPARtition)
41*4882a593Smuzhiyun          hypervisor running on bare metal, adding another layer and
42*4882a593Smuzhiyun          resulting in at least four levels in a nested setup — L0 (bare
43*4882a593Smuzhiyun          metal, running the LPAR hypervisor), L1 (host hypervisor), L2
44*4882a593Smuzhiyun          (guest hypervisor), L3 (nested guest).
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun          This document will stick with the three-level terminology (L0,
47*4882a593Smuzhiyun          L1, and L2) for all architectures; and will largely focus on
48*4882a593Smuzhiyun          x86.
49*4882a593Smuzhiyun
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunUse Cases
52*4882a593Smuzhiyun---------
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunThere are several scenarios where nested KVM can be useful, to name a
55*4882a593Smuzhiyunfew:
56*4882a593Smuzhiyun
57*4882a593Smuzhiyun- As a developer, you want to test your software on different operating
58*4882a593Smuzhiyun  systems (OSes).  Instead of renting multiple VMs from a Cloud
59*4882a593Smuzhiyun  Provider, using nested KVM lets you rent a large enough "guest
60*4882a593Smuzhiyun  hypervisor" (level-1 guest).  This in turn allows you to create
61*4882a593Smuzhiyun  multiple nested guests (level-2 guests), running different OSes, on
62*4882a593Smuzhiyun  which you can develop and test your software.
63*4882a593Smuzhiyun
64*4882a593Smuzhiyun- Live migration of "guest hypervisors" and their nested guests, for
65*4882a593Smuzhiyun  load balancing, disaster recovery, etc.
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun- VM image creation tools (e.g. ``virt-install``,  etc) often run
68*4882a593Smuzhiyun  their own VM, and users expect these to work inside a VM.
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun- Some OSes use virtualization internally for security (e.g. to let
71*4882a593Smuzhiyun  applications run safely in isolation).
72*4882a593Smuzhiyun
73*4882a593Smuzhiyun
74*4882a593SmuzhiyunEnabling "nested" (x86)
75*4882a593Smuzhiyun-----------------------
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunFrom Linux kernel v4.19 onwards, the ``nested`` KVM parameter is enabled
78*4882a593Smuzhiyunby default for Intel and AMD.  (Though your Linux distribution might
79*4882a593Smuzhiyunoverride this default.)
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunIn case you are running a Linux kernel older than v4.19, to enable
82*4882a593Smuzhiyunnesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``.  To
83*4882a593Smuzhiyunpersist this setting across reboots, you can add it in a config file, as
84*4882a593Smuzhiyunshown below:
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun1. On the bare metal host (L0), list the kernel modules and ensure that
87*4882a593Smuzhiyun   the KVM modules::
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun    $ lsmod | grep -i kvm
90*4882a593Smuzhiyun    kvm_intel             133627  0
91*4882a593Smuzhiyun    kvm                   435079  1 kvm_intel
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun2. Show information for ``kvm_intel`` module::
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun    $ modinfo kvm_intel | grep -i nested
96*4882a593Smuzhiyun    parm:           nested:bool
97*4882a593Smuzhiyun
98*4882a593Smuzhiyun3. For the nested KVM configuration to persist across reboots, place the
99*4882a593Smuzhiyun   below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
100*4882a593Smuzhiyun   doesn't exist)::
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun    $ cat /etc/modprobe.d/kvm_intel.conf
103*4882a593Smuzhiyun    options kvm-intel nested=y
104*4882a593Smuzhiyun
105*4882a593Smuzhiyun4. Unload and re-load the KVM Intel module::
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun    $ sudo rmmod kvm-intel
108*4882a593Smuzhiyun    $ sudo modprobe kvm-intel
109*4882a593Smuzhiyun
110*4882a593Smuzhiyun5. Verify if the ``nested`` parameter for KVM is enabled::
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun    $ cat /sys/module/kvm_intel/parameters/nested
113*4882a593Smuzhiyun    Y
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunFor AMD hosts, the process is the same as above, except that the module
116*4882a593Smuzhiyunname is ``kvm-amd``.
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunAdditional nested-related kernel parameters (x86)
120*4882a593Smuzhiyun-------------------------------------------------
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunIf your hardware is sufficiently advanced (Intel Haswell processor or
123*4882a593Smuzhiyunhigher, which has newer hardware virt extensions), the following
124*4882a593Smuzhiyunadditional features will also be enabled by default: "Shadow VMCS
125*4882a593Smuzhiyun(Virtual Machine Control Structure)", APIC Virtualization on your bare
126*4882a593Smuzhiyunmetal host (L0).  Parameters for Intel hosts::
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun    $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
129*4882a593Smuzhiyun    Y
130*4882a593Smuzhiyun
131*4882a593Smuzhiyun    $ cat /sys/module/kvm_intel/parameters/enable_apicv
132*4882a593Smuzhiyun    Y
133*4882a593Smuzhiyun
134*4882a593Smuzhiyun    $ cat /sys/module/kvm_intel/parameters/ept
135*4882a593Smuzhiyun    Y
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun.. note:: If you suspect your L2 (i.e. nested guest) is running slower,
138*4882a593Smuzhiyun          ensure the above are enabled (particularly
139*4882a593Smuzhiyun          ``enable_shadow_vmcs`` and ``ept``).
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun
142*4882a593SmuzhiyunStarting a nested guest (x86)
143*4882a593Smuzhiyun-----------------------------
144*4882a593Smuzhiyun
145*4882a593SmuzhiyunOnce your bare metal host (L0) is configured for nesting, you should be
146*4882a593Smuzhiyunable to start an L1 guest with::
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun    $ qemu-kvm -cpu host [...]
149*4882a593Smuzhiyun
150*4882a593SmuzhiyunThe above will pass through the host CPU's capabilities as-is to the
151*4882a593Smuzhiyungues); or for better live migration compatibility, use a named CPU
152*4882a593Smuzhiyunmodel supported by QEMU. e.g.::
153*4882a593Smuzhiyun
154*4882a593Smuzhiyun    $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on
155*4882a593Smuzhiyun
156*4882a593Smuzhiyunthen the guest hypervisor will subsequently be capable of running a
157*4882a593Smuzhiyunnested guest with accelerated KVM.
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun
160*4882a593SmuzhiyunEnabling "nested" (s390x)
161*4882a593Smuzhiyun-------------------------
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun1. On the host hypervisor (L0), enable the ``nested`` parameter on
164*4882a593Smuzhiyun   s390x::
165*4882a593Smuzhiyun
166*4882a593Smuzhiyun    $ rmmod kvm
167*4882a593Smuzhiyun    $ modprobe kvm nested=1
168*4882a593Smuzhiyun
169*4882a593Smuzhiyun.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
170*4882a593Smuzhiyun          with the ``nested`` paramter — i.e. to be able to enable
171*4882a593Smuzhiyun          ``nested``, the ``hpage`` parameter *must* be disabled.
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
174*4882a593Smuzhiyun   feature — with QEMU, this can be done by using "host passthrough"
175*4882a593Smuzhiyun   (via the command-line ``-cpu host``).
176*4882a593Smuzhiyun
177*4882a593Smuzhiyun3. Now the KVM module can be loaded in the L1 (guest hypervisor)::
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun    $ modprobe kvm
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun
182*4882a593SmuzhiyunLive migration with nested KVM
183*4882a593Smuzhiyun------------------------------
184*4882a593Smuzhiyun
185*4882a593SmuzhiyunMigrating an L1 guest, with a  *live* nested guest in it, to another
186*4882a593Smuzhiyunbare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
187*4882a593SmuzhiyunIntel x86 systems, and even on older versions for s390x.
188*4882a593Smuzhiyun
189*4882a593SmuzhiyunOn AMD systems, once an L1 guest has started an L2 guest, the L1 guest
190*4882a593Smuzhiyunshould no longer be migrated or saved (refer to QEMU documentation on
191*4882a593Smuzhiyun"savevm"/"loadvm") until the L2 guest shuts down.  Attempting to migrate
192*4882a593Smuzhiyunor save-and-load an L1 guest while an L2 guest is running will result in
193*4882a593Smuzhiyunundefined behavior.  You might see a ``kernel BUG!`` entry in ``dmesg``, a
194*4882a593Smuzhiyunkernel 'oops', or an outright kernel panic.  Such a migrated or loaded L1
195*4882a593Smuzhiyunguest can no longer be considered stable or secure, and must be restarted.
196*4882a593SmuzhiyunMigrating an L1 guest merely configured to support nesting, while not
197*4882a593Smuzhiyunactually running L2 guests, is expected to function normally even on AMD
198*4882a593Smuzhiyunsystems but may fail once guests are started.
199*4882a593Smuzhiyun
200*4882a593SmuzhiyunMigrating an L2 guest is always expected to succeed, so all the following
201*4882a593Smuzhiyunscenarios should work even on AMD systems:
202*4882a593Smuzhiyun
203*4882a593Smuzhiyun- Migrating a nested guest (L2) to another L1 guest on the *same* bare
204*4882a593Smuzhiyun  metal host.
205*4882a593Smuzhiyun
206*4882a593Smuzhiyun- Migrating a nested guest (L2) to another L1 guest on a *different*
207*4882a593Smuzhiyun  bare metal host.
208*4882a593Smuzhiyun
209*4882a593Smuzhiyun- Migrating a nested guest (L2) to a bare metal host.
210*4882a593Smuzhiyun
211*4882a593SmuzhiyunReporting bugs from nested setups
212*4882a593Smuzhiyun-----------------------------------
213*4882a593Smuzhiyun
214*4882a593SmuzhiyunDebugging "nested" problems can involve sifting through log files across
215*4882a593SmuzhiyunL0, L1 and L2; this can result in tedious back-n-forth between the bug
216*4882a593Smuzhiyunreporter and the bug fixer.
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun- Mention that you are in a "nested" setup.  If you are running any kind
219*4882a593Smuzhiyun  of "nesting" at all, say so.  Unfortunately, this needs to be called
220*4882a593Smuzhiyun  out because when reporting bugs, people tend to forget to even
221*4882a593Smuzhiyun  *mention* that they're using nested virtualization.
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun- Ensure you are actually running KVM on KVM.  Sometimes people do not
224*4882a593Smuzhiyun  have KVM enabled for their guest hypervisor (L1), which results in
225*4882a593Smuzhiyun  them running with pure emulation or what QEMU calls it as "TCG", but
226*4882a593Smuzhiyun  they think they're running nested KVM.  Thus confusing "nested Virt"
227*4882a593Smuzhiyun  (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).
228*4882a593Smuzhiyun
229*4882a593SmuzhiyunInformation to collect (generic)
230*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
231*4882a593Smuzhiyun
232*4882a593SmuzhiyunThe following is not an exhaustive list, but a very good starting point:
233*4882a593Smuzhiyun
234*4882a593Smuzhiyun  - Kernel, libvirt, and QEMU version from L0
235*4882a593Smuzhiyun
236*4882a593Smuzhiyun  - Kernel, libvirt and QEMU version from L1
237*4882a593Smuzhiyun
238*4882a593Smuzhiyun  - QEMU command-line of L1 -- when using libvirt, you'll find it here:
239*4882a593Smuzhiyun    ``/var/log/libvirt/qemu/instance.log``
240*4882a593Smuzhiyun
241*4882a593Smuzhiyun  - QEMU command-line of L2 -- as above, when using libvirt, get the
242*4882a593Smuzhiyun    complete libvirt-generated QEMU command-line
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun  - ``cat /sys/cpuinfo`` from L0
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun  - ``cat /sys/cpuinfo`` from L1
247*4882a593Smuzhiyun
248*4882a593Smuzhiyun  - ``lscpu`` from L0
249*4882a593Smuzhiyun
250*4882a593Smuzhiyun  - ``lscpu`` from L1
251*4882a593Smuzhiyun
252*4882a593Smuzhiyun  - Full ``dmesg`` output from L0
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun  - Full ``dmesg`` output from L1
255*4882a593Smuzhiyun
256*4882a593Smuzhiyunx86-specific info to collect
257*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~
258*4882a593Smuzhiyun
259*4882a593SmuzhiyunBoth the below commands, ``x86info`` and ``dmidecode``, should be
260*4882a593Smuzhiyunavailable on most Linux distributions with the same name:
261*4882a593Smuzhiyun
262*4882a593Smuzhiyun  - Output of: ``x86info -a`` from L0
263*4882a593Smuzhiyun
264*4882a593Smuzhiyun  - Output of: ``x86info -a`` from L1
265*4882a593Smuzhiyun
266*4882a593Smuzhiyun  - Output of: ``dmidecode`` from L0
267*4882a593Smuzhiyun
268*4882a593Smuzhiyun  - Output of: ``dmidecode`` from L1
269*4882a593Smuzhiyun
270*4882a593Smuzhiyuns390x-specific info to collect
271*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
272*4882a593Smuzhiyun
273*4882a593SmuzhiyunAlong with the earlier mentioned generic details, the below is
274*4882a593Smuzhiyunalso recommended:
275*4882a593Smuzhiyun
276*4882a593Smuzhiyun  - ``/proc/sysinfo`` from L1; this will also include the info from L0
277