1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun========== 4*4882a593SmuzhiyunNested VMX 5*4882a593Smuzhiyun========== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunOverview 8*4882a593Smuzhiyun--------- 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunOn Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) 11*4882a593Smuzhiyunto easily and efficiently run guest operating systems. Normally, these guests 12*4882a593Smuzhiyun*cannot* themselves be hypervisors running their own guests, because in VMX, 13*4882a593Smuzhiyunguests cannot use VMX instructions. 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunThe "Nested VMX" feature adds this missing capability - of running guest 16*4882a593Smuzhiyunhypervisors (which use VMX) with their own nested guests. It does so by 17*4882a593Smuzhiyunallowing a guest to use VMX instructions, and correctly and efficiently 18*4882a593Smuzhiyunemulating them using the single level of VMX available in the hardware. 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunWe describe in much greater detail the theory behind the nested VMX feature, 21*4882a593Smuzhiyunits implementation and its performance characteristics, in the OSDI 2010 paper 22*4882a593Smuzhiyun"The Turtles Project: Design and Implementation of Nested Virtualization", 23*4882a593Smuzhiyunavailable at: 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun 28*4882a593SmuzhiyunTerminology 29*4882a593Smuzhiyun----------- 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunSingle-level virtualization has two levels - the host (KVM) and the guests. 32*4882a593SmuzhiyunIn nested virtualization, we have three levels: The host (KVM), which we call 33*4882a593SmuzhiyunL0, the guest hypervisor, which we call L1, and its nested guest, which we 34*4882a593Smuzhiyuncall L2. 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunRunning nested VMX 38*4882a593Smuzhiyun------------------ 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunThe nested VMX feature is disabled by default. It can be enabled by giving 41*4882a593Smuzhiyunthe "nested=1" option to the kvm-intel module. 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunNo modifications are required to user space (qemu). However, qemu's default 44*4882a593Smuzhiyunemulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be 45*4882a593Smuzhiyunexplicitly enabled, by giving qemu one of the following options: 46*4882a593Smuzhiyun 47*4882a593Smuzhiyun - cpu host (emulated CPU has all features of the real CPU) 48*4882a593Smuzhiyun 49*4882a593Smuzhiyun - cpu qemu64,+vmx (add just the vmx feature to a named CPU type) 50*4882a593Smuzhiyun 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunABIs 53*4882a593Smuzhiyun---- 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunNested VMX aims to present a standard and (eventually) fully-functional VMX 56*4882a593Smuzhiyunimplementation for the a guest hypervisor to use. As such, the official 57*4882a593Smuzhiyunspecification of the ABI that it provides is Intel's VMX specification, 58*4882a593Smuzhiyunnamely volume 3B of their "Intel 64 and IA-32 Architectures Software 59*4882a593SmuzhiyunDeveloper's Manual". Not all of VMX's features are currently fully supported, 60*4882a593Smuzhiyunbut the goal is to eventually support them all, starting with the VMX features 61*4882a593Smuzhiyunwhich are used in practice by popular hypervisors (KVM and others). 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunAs a VMX implementation, nested VMX presents a VMCS structure to L1. 64*4882a593SmuzhiyunAs mandated by the spec, other than the two fields revision_id and abort, 65*4882a593Smuzhiyunthis structure is *opaque* to its user, who is not supposed to know or care 66*4882a593Smuzhiyunabout its internal structure. Rather, the structure is accessed through the 67*4882a593SmuzhiyunVMREAD and VMWRITE instructions. 68*4882a593SmuzhiyunStill, for debugging purposes, KVM developers might be interested to know the 69*4882a593Smuzhiyuninternals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. 70*4882a593Smuzhiyun 71*4882a593SmuzhiyunThe name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we 72*4882a593Smuzhiyunalso have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS 73*4882a593Smuzhiyunwhich L0 builds to actually run L2 - how this is done is explained in the 74*4882a593Smuzhiyunaforementioned paper. 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunFor convenience, we repeat the content of struct vmcs12 here. If the internals 77*4882a593Smuzhiyunof this structure changes, this can break live migration across KVM versions. 78*4882a593SmuzhiyunVMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner 79*4882a593Smuzhiyunstruct shadow_vmcs is ever changed. 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun:: 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun typedef u64 natural_width; 84*4882a593Smuzhiyun struct __packed vmcs12 { 85*4882a593Smuzhiyun /* According to the Intel spec, a VMCS region must start with 86*4882a593Smuzhiyun * these two user-visible fields */ 87*4882a593Smuzhiyun u32 revision_id; 88*4882a593Smuzhiyun u32 abort; 89*4882a593Smuzhiyun 90*4882a593Smuzhiyun u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ 91*4882a593Smuzhiyun u32 padding[7]; /* room for future expansion */ 92*4882a593Smuzhiyun 93*4882a593Smuzhiyun u64 io_bitmap_a; 94*4882a593Smuzhiyun u64 io_bitmap_b; 95*4882a593Smuzhiyun u64 msr_bitmap; 96*4882a593Smuzhiyun u64 vm_exit_msr_store_addr; 97*4882a593Smuzhiyun u64 vm_exit_msr_load_addr; 98*4882a593Smuzhiyun u64 vm_entry_msr_load_addr; 99*4882a593Smuzhiyun u64 tsc_offset; 100*4882a593Smuzhiyun u64 virtual_apic_page_addr; 101*4882a593Smuzhiyun u64 apic_access_addr; 102*4882a593Smuzhiyun u64 ept_pointer; 103*4882a593Smuzhiyun u64 guest_physical_address; 104*4882a593Smuzhiyun u64 vmcs_link_pointer; 105*4882a593Smuzhiyun u64 guest_ia32_debugctl; 106*4882a593Smuzhiyun u64 guest_ia32_pat; 107*4882a593Smuzhiyun u64 guest_ia32_efer; 108*4882a593Smuzhiyun u64 guest_pdptr0; 109*4882a593Smuzhiyun u64 guest_pdptr1; 110*4882a593Smuzhiyun u64 guest_pdptr2; 111*4882a593Smuzhiyun u64 guest_pdptr3; 112*4882a593Smuzhiyun u64 host_ia32_pat; 113*4882a593Smuzhiyun u64 host_ia32_efer; 114*4882a593Smuzhiyun u64 padding64[8]; /* room for future expansion */ 115*4882a593Smuzhiyun natural_width cr0_guest_host_mask; 116*4882a593Smuzhiyun natural_width cr4_guest_host_mask; 117*4882a593Smuzhiyun natural_width cr0_read_shadow; 118*4882a593Smuzhiyun natural_width cr4_read_shadow; 119*4882a593Smuzhiyun natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */ 120*4882a593Smuzhiyun natural_width exit_qualification; 121*4882a593Smuzhiyun natural_width guest_linear_address; 122*4882a593Smuzhiyun natural_width guest_cr0; 123*4882a593Smuzhiyun natural_width guest_cr3; 124*4882a593Smuzhiyun natural_width guest_cr4; 125*4882a593Smuzhiyun natural_width guest_es_base; 126*4882a593Smuzhiyun natural_width guest_cs_base; 127*4882a593Smuzhiyun natural_width guest_ss_base; 128*4882a593Smuzhiyun natural_width guest_ds_base; 129*4882a593Smuzhiyun natural_width guest_fs_base; 130*4882a593Smuzhiyun natural_width guest_gs_base; 131*4882a593Smuzhiyun natural_width guest_ldtr_base; 132*4882a593Smuzhiyun natural_width guest_tr_base; 133*4882a593Smuzhiyun natural_width guest_gdtr_base; 134*4882a593Smuzhiyun natural_width guest_idtr_base; 135*4882a593Smuzhiyun natural_width guest_dr7; 136*4882a593Smuzhiyun natural_width guest_rsp; 137*4882a593Smuzhiyun natural_width guest_rip; 138*4882a593Smuzhiyun natural_width guest_rflags; 139*4882a593Smuzhiyun natural_width guest_pending_dbg_exceptions; 140*4882a593Smuzhiyun natural_width guest_sysenter_esp; 141*4882a593Smuzhiyun natural_width guest_sysenter_eip; 142*4882a593Smuzhiyun natural_width host_cr0; 143*4882a593Smuzhiyun natural_width host_cr3; 144*4882a593Smuzhiyun natural_width host_cr4; 145*4882a593Smuzhiyun natural_width host_fs_base; 146*4882a593Smuzhiyun natural_width host_gs_base; 147*4882a593Smuzhiyun natural_width host_tr_base; 148*4882a593Smuzhiyun natural_width host_gdtr_base; 149*4882a593Smuzhiyun natural_width host_idtr_base; 150*4882a593Smuzhiyun natural_width host_ia32_sysenter_esp; 151*4882a593Smuzhiyun natural_width host_ia32_sysenter_eip; 152*4882a593Smuzhiyun natural_width host_rsp; 153*4882a593Smuzhiyun natural_width host_rip; 154*4882a593Smuzhiyun natural_width paddingl[8]; /* room for future expansion */ 155*4882a593Smuzhiyun u32 pin_based_vm_exec_control; 156*4882a593Smuzhiyun u32 cpu_based_vm_exec_control; 157*4882a593Smuzhiyun u32 exception_bitmap; 158*4882a593Smuzhiyun u32 page_fault_error_code_mask; 159*4882a593Smuzhiyun u32 page_fault_error_code_match; 160*4882a593Smuzhiyun u32 cr3_target_count; 161*4882a593Smuzhiyun u32 vm_exit_controls; 162*4882a593Smuzhiyun u32 vm_exit_msr_store_count; 163*4882a593Smuzhiyun u32 vm_exit_msr_load_count; 164*4882a593Smuzhiyun u32 vm_entry_controls; 165*4882a593Smuzhiyun u32 vm_entry_msr_load_count; 166*4882a593Smuzhiyun u32 vm_entry_intr_info_field; 167*4882a593Smuzhiyun u32 vm_entry_exception_error_code; 168*4882a593Smuzhiyun u32 vm_entry_instruction_len; 169*4882a593Smuzhiyun u32 tpr_threshold; 170*4882a593Smuzhiyun u32 secondary_vm_exec_control; 171*4882a593Smuzhiyun u32 vm_instruction_error; 172*4882a593Smuzhiyun u32 vm_exit_reason; 173*4882a593Smuzhiyun u32 vm_exit_intr_info; 174*4882a593Smuzhiyun u32 vm_exit_intr_error_code; 175*4882a593Smuzhiyun u32 idt_vectoring_info_field; 176*4882a593Smuzhiyun u32 idt_vectoring_error_code; 177*4882a593Smuzhiyun u32 vm_exit_instruction_len; 178*4882a593Smuzhiyun u32 vmx_instruction_info; 179*4882a593Smuzhiyun u32 guest_es_limit; 180*4882a593Smuzhiyun u32 guest_cs_limit; 181*4882a593Smuzhiyun u32 guest_ss_limit; 182*4882a593Smuzhiyun u32 guest_ds_limit; 183*4882a593Smuzhiyun u32 guest_fs_limit; 184*4882a593Smuzhiyun u32 guest_gs_limit; 185*4882a593Smuzhiyun u32 guest_ldtr_limit; 186*4882a593Smuzhiyun u32 guest_tr_limit; 187*4882a593Smuzhiyun u32 guest_gdtr_limit; 188*4882a593Smuzhiyun u32 guest_idtr_limit; 189*4882a593Smuzhiyun u32 guest_es_ar_bytes; 190*4882a593Smuzhiyun u32 guest_cs_ar_bytes; 191*4882a593Smuzhiyun u32 guest_ss_ar_bytes; 192*4882a593Smuzhiyun u32 guest_ds_ar_bytes; 193*4882a593Smuzhiyun u32 guest_fs_ar_bytes; 194*4882a593Smuzhiyun u32 guest_gs_ar_bytes; 195*4882a593Smuzhiyun u32 guest_ldtr_ar_bytes; 196*4882a593Smuzhiyun u32 guest_tr_ar_bytes; 197*4882a593Smuzhiyun u32 guest_interruptibility_info; 198*4882a593Smuzhiyun u32 guest_activity_state; 199*4882a593Smuzhiyun u32 guest_sysenter_cs; 200*4882a593Smuzhiyun u32 host_ia32_sysenter_cs; 201*4882a593Smuzhiyun u32 padding32[8]; /* room for future expansion */ 202*4882a593Smuzhiyun u16 virtual_processor_id; 203*4882a593Smuzhiyun u16 guest_es_selector; 204*4882a593Smuzhiyun u16 guest_cs_selector; 205*4882a593Smuzhiyun u16 guest_ss_selector; 206*4882a593Smuzhiyun u16 guest_ds_selector; 207*4882a593Smuzhiyun u16 guest_fs_selector; 208*4882a593Smuzhiyun u16 guest_gs_selector; 209*4882a593Smuzhiyun u16 guest_ldtr_selector; 210*4882a593Smuzhiyun u16 guest_tr_selector; 211*4882a593Smuzhiyun u16 host_es_selector; 212*4882a593Smuzhiyun u16 host_cs_selector; 213*4882a593Smuzhiyun u16 host_ss_selector; 214*4882a593Smuzhiyun u16 host_ds_selector; 215*4882a593Smuzhiyun u16 host_fs_selector; 216*4882a593Smuzhiyun u16 host_gs_selector; 217*4882a593Smuzhiyun u16 host_tr_selector; 218*4882a593Smuzhiyun }; 219*4882a593Smuzhiyun 220*4882a593Smuzhiyun 221*4882a593SmuzhiyunAuthors 222*4882a593Smuzhiyun------- 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunThese patches were written by: 225*4882a593Smuzhiyun - Abel Gordon, abelg <at> il.ibm.com 226*4882a593Smuzhiyun - Nadav Har'El, nyh <at> il.ibm.com 227*4882a593Smuzhiyun - Orit Wasserman, oritw <at> il.ibm.com 228*4882a593Smuzhiyun - Ben-Ami Yassor, benami <at> il.ibm.com 229*4882a593Smuzhiyun - Muli Ben-Yehuda, muli <at> il.ibm.com 230*4882a593Smuzhiyun 231*4882a593SmuzhiyunWith contributions by: 232*4882a593Smuzhiyun - Anthony Liguori, aliguori <at> us.ibm.com 233*4882a593Smuzhiyun - Mike Day, mdday <at> us.ibm.com 234*4882a593Smuzhiyun - Michael Factor, factor <at> il.ibm.com 235*4882a593Smuzhiyun - Zvi Dubitzky, dubi <at> il.ibm.com 236*4882a593Smuzhiyun 237*4882a593SmuzhiyunAnd valuable reviews by: 238*4882a593Smuzhiyun - Avi Kivity, avi <at> redhat.com 239*4882a593Smuzhiyun - Gleb Natapov, gleb <at> redhat.com 240*4882a593Smuzhiyun - Marcelo Tosatti, mtosatti <at> redhat.com 241*4882a593Smuzhiyun - Kevin Tian, kevin.tian <at> intel.com 242*4882a593Smuzhiyun - and others. 243