xref: /OK3568_Linux_fs/kernel/Documentation/virt/kvm/nested-vmx.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun==========
4*4882a593SmuzhiyunNested VMX
5*4882a593Smuzhiyun==========
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunOverview
8*4882a593Smuzhiyun---------
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunOn Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
11*4882a593Smuzhiyunto easily and efficiently run guest operating systems. Normally, these guests
12*4882a593Smuzhiyun*cannot* themselves be hypervisors running their own guests, because in VMX,
13*4882a593Smuzhiyunguests cannot use VMX instructions.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunThe "Nested VMX" feature adds this missing capability - of running guest
16*4882a593Smuzhiyunhypervisors (which use VMX) with their own nested guests. It does so by
17*4882a593Smuzhiyunallowing a guest to use VMX instructions, and correctly and efficiently
18*4882a593Smuzhiyunemulating them using the single level of VMX available in the hardware.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunWe describe in much greater detail the theory behind the nested VMX feature,
21*4882a593Smuzhiyunits implementation and its performance characteristics, in the OSDI 2010 paper
22*4882a593Smuzhiyun"The Turtles Project: Design and Implementation of Nested Virtualization",
23*4882a593Smuzhiyunavailable at:
24*4882a593Smuzhiyun
25*4882a593Smuzhiyun	https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunTerminology
29*4882a593Smuzhiyun-----------
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunSingle-level virtualization has two levels - the host (KVM) and the guests.
32*4882a593SmuzhiyunIn nested virtualization, we have three levels: The host (KVM), which we call
33*4882a593SmuzhiyunL0, the guest hypervisor, which we call L1, and its nested guest, which we
34*4882a593Smuzhiyuncall L2.
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunRunning nested VMX
38*4882a593Smuzhiyun------------------
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunThe nested VMX feature is disabled by default. It can be enabled by giving
41*4882a593Smuzhiyunthe "nested=1" option to the kvm-intel module.
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunNo modifications are required to user space (qemu). However, qemu's default
44*4882a593Smuzhiyunemulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
45*4882a593Smuzhiyunexplicitly enabled, by giving qemu one of the following options:
46*4882a593Smuzhiyun
47*4882a593Smuzhiyun     - cpu host              (emulated CPU has all features of the real CPU)
48*4882a593Smuzhiyun
49*4882a593Smuzhiyun     - cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
50*4882a593Smuzhiyun
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunABIs
53*4882a593Smuzhiyun----
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunNested VMX aims to present a standard and (eventually) fully-functional VMX
56*4882a593Smuzhiyunimplementation for the a guest hypervisor to use. As such, the official
57*4882a593Smuzhiyunspecification of the ABI that it provides is Intel's VMX specification,
58*4882a593Smuzhiyunnamely volume 3B of their "Intel 64 and IA-32 Architectures Software
59*4882a593SmuzhiyunDeveloper's Manual". Not all of VMX's features are currently fully supported,
60*4882a593Smuzhiyunbut the goal is to eventually support them all, starting with the VMX features
61*4882a593Smuzhiyunwhich are used in practice by popular hypervisors (KVM and others).
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunAs a VMX implementation, nested VMX presents a VMCS structure to L1.
64*4882a593SmuzhiyunAs mandated by the spec, other than the two fields revision_id and abort,
65*4882a593Smuzhiyunthis structure is *opaque* to its user, who is not supposed to know or care
66*4882a593Smuzhiyunabout its internal structure. Rather, the structure is accessed through the
67*4882a593SmuzhiyunVMREAD and VMWRITE instructions.
68*4882a593SmuzhiyunStill, for debugging purposes, KVM developers might be interested to know the
69*4882a593Smuzhiyuninternals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
70*4882a593Smuzhiyun
71*4882a593SmuzhiyunThe name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
72*4882a593Smuzhiyunalso have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
73*4882a593Smuzhiyunwhich L0 builds to actually run L2 - how this is done is explained in the
74*4882a593Smuzhiyunaforementioned paper.
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunFor convenience, we repeat the content of struct vmcs12 here. If the internals
77*4882a593Smuzhiyunof this structure changes, this can break live migration across KVM versions.
78*4882a593SmuzhiyunVMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
79*4882a593Smuzhiyunstruct shadow_vmcs is ever changed.
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun::
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun	typedef u64 natural_width;
84*4882a593Smuzhiyun	struct __packed vmcs12 {
85*4882a593Smuzhiyun		/* According to the Intel spec, a VMCS region must start with
86*4882a593Smuzhiyun		 * these two user-visible fields */
87*4882a593Smuzhiyun		u32 revision_id;
88*4882a593Smuzhiyun		u32 abort;
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun		u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
91*4882a593Smuzhiyun		u32 padding[7]; /* room for future expansion */
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun		u64 io_bitmap_a;
94*4882a593Smuzhiyun		u64 io_bitmap_b;
95*4882a593Smuzhiyun		u64 msr_bitmap;
96*4882a593Smuzhiyun		u64 vm_exit_msr_store_addr;
97*4882a593Smuzhiyun		u64 vm_exit_msr_load_addr;
98*4882a593Smuzhiyun		u64 vm_entry_msr_load_addr;
99*4882a593Smuzhiyun		u64 tsc_offset;
100*4882a593Smuzhiyun		u64 virtual_apic_page_addr;
101*4882a593Smuzhiyun		u64 apic_access_addr;
102*4882a593Smuzhiyun		u64 ept_pointer;
103*4882a593Smuzhiyun		u64 guest_physical_address;
104*4882a593Smuzhiyun		u64 vmcs_link_pointer;
105*4882a593Smuzhiyun		u64 guest_ia32_debugctl;
106*4882a593Smuzhiyun		u64 guest_ia32_pat;
107*4882a593Smuzhiyun		u64 guest_ia32_efer;
108*4882a593Smuzhiyun		u64 guest_pdptr0;
109*4882a593Smuzhiyun		u64 guest_pdptr1;
110*4882a593Smuzhiyun		u64 guest_pdptr2;
111*4882a593Smuzhiyun		u64 guest_pdptr3;
112*4882a593Smuzhiyun		u64 host_ia32_pat;
113*4882a593Smuzhiyun		u64 host_ia32_efer;
114*4882a593Smuzhiyun		u64 padding64[8]; /* room for future expansion */
115*4882a593Smuzhiyun		natural_width cr0_guest_host_mask;
116*4882a593Smuzhiyun		natural_width cr4_guest_host_mask;
117*4882a593Smuzhiyun		natural_width cr0_read_shadow;
118*4882a593Smuzhiyun		natural_width cr4_read_shadow;
119*4882a593Smuzhiyun		natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */
120*4882a593Smuzhiyun		natural_width exit_qualification;
121*4882a593Smuzhiyun		natural_width guest_linear_address;
122*4882a593Smuzhiyun		natural_width guest_cr0;
123*4882a593Smuzhiyun		natural_width guest_cr3;
124*4882a593Smuzhiyun		natural_width guest_cr4;
125*4882a593Smuzhiyun		natural_width guest_es_base;
126*4882a593Smuzhiyun		natural_width guest_cs_base;
127*4882a593Smuzhiyun		natural_width guest_ss_base;
128*4882a593Smuzhiyun		natural_width guest_ds_base;
129*4882a593Smuzhiyun		natural_width guest_fs_base;
130*4882a593Smuzhiyun		natural_width guest_gs_base;
131*4882a593Smuzhiyun		natural_width guest_ldtr_base;
132*4882a593Smuzhiyun		natural_width guest_tr_base;
133*4882a593Smuzhiyun		natural_width guest_gdtr_base;
134*4882a593Smuzhiyun		natural_width guest_idtr_base;
135*4882a593Smuzhiyun		natural_width guest_dr7;
136*4882a593Smuzhiyun		natural_width guest_rsp;
137*4882a593Smuzhiyun		natural_width guest_rip;
138*4882a593Smuzhiyun		natural_width guest_rflags;
139*4882a593Smuzhiyun		natural_width guest_pending_dbg_exceptions;
140*4882a593Smuzhiyun		natural_width guest_sysenter_esp;
141*4882a593Smuzhiyun		natural_width guest_sysenter_eip;
142*4882a593Smuzhiyun		natural_width host_cr0;
143*4882a593Smuzhiyun		natural_width host_cr3;
144*4882a593Smuzhiyun		natural_width host_cr4;
145*4882a593Smuzhiyun		natural_width host_fs_base;
146*4882a593Smuzhiyun		natural_width host_gs_base;
147*4882a593Smuzhiyun		natural_width host_tr_base;
148*4882a593Smuzhiyun		natural_width host_gdtr_base;
149*4882a593Smuzhiyun		natural_width host_idtr_base;
150*4882a593Smuzhiyun		natural_width host_ia32_sysenter_esp;
151*4882a593Smuzhiyun		natural_width host_ia32_sysenter_eip;
152*4882a593Smuzhiyun		natural_width host_rsp;
153*4882a593Smuzhiyun		natural_width host_rip;
154*4882a593Smuzhiyun		natural_width paddingl[8]; /* room for future expansion */
155*4882a593Smuzhiyun		u32 pin_based_vm_exec_control;
156*4882a593Smuzhiyun		u32 cpu_based_vm_exec_control;
157*4882a593Smuzhiyun		u32 exception_bitmap;
158*4882a593Smuzhiyun		u32 page_fault_error_code_mask;
159*4882a593Smuzhiyun		u32 page_fault_error_code_match;
160*4882a593Smuzhiyun		u32 cr3_target_count;
161*4882a593Smuzhiyun		u32 vm_exit_controls;
162*4882a593Smuzhiyun		u32 vm_exit_msr_store_count;
163*4882a593Smuzhiyun		u32 vm_exit_msr_load_count;
164*4882a593Smuzhiyun		u32 vm_entry_controls;
165*4882a593Smuzhiyun		u32 vm_entry_msr_load_count;
166*4882a593Smuzhiyun		u32 vm_entry_intr_info_field;
167*4882a593Smuzhiyun		u32 vm_entry_exception_error_code;
168*4882a593Smuzhiyun		u32 vm_entry_instruction_len;
169*4882a593Smuzhiyun		u32 tpr_threshold;
170*4882a593Smuzhiyun		u32 secondary_vm_exec_control;
171*4882a593Smuzhiyun		u32 vm_instruction_error;
172*4882a593Smuzhiyun		u32 vm_exit_reason;
173*4882a593Smuzhiyun		u32 vm_exit_intr_info;
174*4882a593Smuzhiyun		u32 vm_exit_intr_error_code;
175*4882a593Smuzhiyun		u32 idt_vectoring_info_field;
176*4882a593Smuzhiyun		u32 idt_vectoring_error_code;
177*4882a593Smuzhiyun		u32 vm_exit_instruction_len;
178*4882a593Smuzhiyun		u32 vmx_instruction_info;
179*4882a593Smuzhiyun		u32 guest_es_limit;
180*4882a593Smuzhiyun		u32 guest_cs_limit;
181*4882a593Smuzhiyun		u32 guest_ss_limit;
182*4882a593Smuzhiyun		u32 guest_ds_limit;
183*4882a593Smuzhiyun		u32 guest_fs_limit;
184*4882a593Smuzhiyun		u32 guest_gs_limit;
185*4882a593Smuzhiyun		u32 guest_ldtr_limit;
186*4882a593Smuzhiyun		u32 guest_tr_limit;
187*4882a593Smuzhiyun		u32 guest_gdtr_limit;
188*4882a593Smuzhiyun		u32 guest_idtr_limit;
189*4882a593Smuzhiyun		u32 guest_es_ar_bytes;
190*4882a593Smuzhiyun		u32 guest_cs_ar_bytes;
191*4882a593Smuzhiyun		u32 guest_ss_ar_bytes;
192*4882a593Smuzhiyun		u32 guest_ds_ar_bytes;
193*4882a593Smuzhiyun		u32 guest_fs_ar_bytes;
194*4882a593Smuzhiyun		u32 guest_gs_ar_bytes;
195*4882a593Smuzhiyun		u32 guest_ldtr_ar_bytes;
196*4882a593Smuzhiyun		u32 guest_tr_ar_bytes;
197*4882a593Smuzhiyun		u32 guest_interruptibility_info;
198*4882a593Smuzhiyun		u32 guest_activity_state;
199*4882a593Smuzhiyun		u32 guest_sysenter_cs;
200*4882a593Smuzhiyun		u32 host_ia32_sysenter_cs;
201*4882a593Smuzhiyun		u32 padding32[8]; /* room for future expansion */
202*4882a593Smuzhiyun		u16 virtual_processor_id;
203*4882a593Smuzhiyun		u16 guest_es_selector;
204*4882a593Smuzhiyun		u16 guest_cs_selector;
205*4882a593Smuzhiyun		u16 guest_ss_selector;
206*4882a593Smuzhiyun		u16 guest_ds_selector;
207*4882a593Smuzhiyun		u16 guest_fs_selector;
208*4882a593Smuzhiyun		u16 guest_gs_selector;
209*4882a593Smuzhiyun		u16 guest_ldtr_selector;
210*4882a593Smuzhiyun		u16 guest_tr_selector;
211*4882a593Smuzhiyun		u16 host_es_selector;
212*4882a593Smuzhiyun		u16 host_cs_selector;
213*4882a593Smuzhiyun		u16 host_ss_selector;
214*4882a593Smuzhiyun		u16 host_ds_selector;
215*4882a593Smuzhiyun		u16 host_fs_selector;
216*4882a593Smuzhiyun		u16 host_gs_selector;
217*4882a593Smuzhiyun		u16 host_tr_selector;
218*4882a593Smuzhiyun	};
219*4882a593Smuzhiyun
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunAuthors
222*4882a593Smuzhiyun-------
223*4882a593Smuzhiyun
224*4882a593SmuzhiyunThese patches were written by:
225*4882a593Smuzhiyun    - Abel Gordon, abelg <at> il.ibm.com
226*4882a593Smuzhiyun    - Nadav Har'El, nyh <at> il.ibm.com
227*4882a593Smuzhiyun    - Orit Wasserman, oritw <at> il.ibm.com
228*4882a593Smuzhiyun    - Ben-Ami Yassor, benami <at> il.ibm.com
229*4882a593Smuzhiyun    - Muli Ben-Yehuda, muli <at> il.ibm.com
230*4882a593Smuzhiyun
231*4882a593SmuzhiyunWith contributions by:
232*4882a593Smuzhiyun    - Anthony Liguori, aliguori <at> us.ibm.com
233*4882a593Smuzhiyun    - Mike Day, mdday <at> us.ibm.com
234*4882a593Smuzhiyun    - Michael Factor, factor <at> il.ibm.com
235*4882a593Smuzhiyun    - Zvi Dubitzky, dubi <at> il.ibm.com
236*4882a593Smuzhiyun
237*4882a593SmuzhiyunAnd valuable reviews by:
238*4882a593Smuzhiyun    - Avi Kivity, avi <at> redhat.com
239*4882a593Smuzhiyun    - Gleb Natapov, gleb <at> redhat.com
240*4882a593Smuzhiyun    - Marcelo Tosatti, mtosatti <at> redhat.com
241*4882a593Smuzhiyun    - Kevin Tian, kevin.tian <at> intel.com
242*4882a593Smuzhiyun    - and others.
243