xref: /OK3568_Linux_fs/kernel/Documentation/virt/kvm/ppc-pv.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=================================
4*4882a593SmuzhiyunThe PPC KVM paravirtual interface
5*4882a593Smuzhiyun=================================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunThe basic execution principle by which KVM on PowerPC works is to run all kernel
8*4882a593Smuzhiyunspace code in PR=1 which is user space. This way we trap all privileged
9*4882a593Smuzhiyuninstructions and can emulate them accordingly.
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunUnfortunately that is also the downfall. There are quite some privileged
12*4882a593Smuzhiyuninstructions that needlessly return us to the hypervisor even though they
13*4882a593Smuzhiyuncould be handled differently.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunThis is what the PPC PV interface helps with. It takes privileged instructions
16*4882a593Smuzhiyunand transforms them into unprivileged ones with some help from the hypervisor.
17*4882a593SmuzhiyunThis cuts down virtualization costs by about 50% on some of my benchmarks.
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunThe code for that interface can be found in arch/powerpc/kernel/kvm*
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunQuerying for existence
22*4882a593Smuzhiyun======================
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunTo find out if we're running on KVM or not, we leverage the device tree. When
25*4882a593SmuzhiyunLinux is running on KVM, a node /hypervisor exists. That node contains a
26*4882a593Smuzhiyuncompatible property with the value "linux,kvm".
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunOnce you determined you're running under a PV capable KVM, you can now use
29*4882a593Smuzhiyunhypercalls as described below.
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunKVM hypercalls
32*4882a593Smuzhiyun==============
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunInside the device tree's /hypervisor node there's a property called
35*4882a593Smuzhiyun'hypercall-instructions'. This property contains at most 4 opcodes that make
36*4882a593Smuzhiyunup the hypercall. To call a hypercall, just call these instructions.
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunThe parameters are as follows:
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun        ========	================	================
41*4882a593Smuzhiyun	Register	IN			OUT
42*4882a593Smuzhiyun        ========	================	================
43*4882a593Smuzhiyun	r0		-			volatile
44*4882a593Smuzhiyun	r3		1st parameter		Return code
45*4882a593Smuzhiyun	r4		2nd parameter		1st output value
46*4882a593Smuzhiyun	r5		3rd parameter		2nd output value
47*4882a593Smuzhiyun	r6		4th parameter		3rd output value
48*4882a593Smuzhiyun	r7		5th parameter		4th output value
49*4882a593Smuzhiyun	r8		6th parameter		5th output value
50*4882a593Smuzhiyun	r9		7th parameter		6th output value
51*4882a593Smuzhiyun	r10		8th parameter		7th output value
52*4882a593Smuzhiyun	r11		hypercall number	8th output value
53*4882a593Smuzhiyun	r12		-			volatile
54*4882a593Smuzhiyun        ========	================	================
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunHypercall definitions are shared in generic code, so the same hypercall numbers
57*4882a593Smuzhiyunapply for x86 and powerpc alike with the exception that each KVM hypercall
58*4882a593Smuzhiyunalso needs to be ORed with the KVM vendor code which is (42 << 16).
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunReturn codes can be as follows:
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun	====		=========================
63*4882a593Smuzhiyun	Code		Meaning
64*4882a593Smuzhiyun	====		=========================
65*4882a593Smuzhiyun	0		Success
66*4882a593Smuzhiyun	12		Hypercall not implemented
67*4882a593Smuzhiyun	<0		Error
68*4882a593Smuzhiyun	====		=========================
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunThe magic page
71*4882a593Smuzhiyun==============
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunTo enable communication between the hypervisor and guest there is a new shared
74*4882a593Smuzhiyunpage that contains parts of supervisor visible register state. The guest can
75*4882a593Smuzhiyunmap this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunWith this hypercall issued the guest always gets the magic page mapped at the
78*4882a593Smuzhiyundesired location. The first parameter indicates the effective address when the
79*4882a593SmuzhiyunMMU is enabled. The second parameter indicates the address in real mode, if
80*4882a593Smuzhiyunapplicable to the target. For now, we always map the page to -4096. This way we
81*4882a593Smuzhiyuncan access it using absolute load and store functions. The following
82*4882a593Smuzhiyuninstruction reads the first field of the magic page::
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun	ld	rX, -4096(0)
85*4882a593Smuzhiyun
86*4882a593SmuzhiyunThe interface is designed to be extensible should there be need later to add
87*4882a593Smuzhiyunadditional registers to the magic page. If you add fields to the magic page,
88*4882a593Smuzhiyunalso define a new hypercall feature to indicate that the host can give you more
89*4882a593Smuzhiyunregisters. Only if the host supports the additional features, make use of them.
90*4882a593Smuzhiyun
91*4882a593SmuzhiyunThe magic page layout is described by struct kvm_vcpu_arch_shared
92*4882a593Smuzhiyunin arch/powerpc/include/asm/kvm_para.h.
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunMagic page features
95*4882a593Smuzhiyun===================
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunWhen mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE,
98*4882a593Smuzhiyuna second return value is passed to the guest. This second return value contains
99*4882a593Smuzhiyuna bitmap of available features inside the magic page.
100*4882a593Smuzhiyun
101*4882a593SmuzhiyunThe following enhancements to the magic page are currently available:
102*4882a593Smuzhiyun
103*4882a593Smuzhiyun  ============================  =======================================
104*4882a593Smuzhiyun  KVM_MAGIC_FEAT_SR		Maps SR registers r/w in the magic page
105*4882a593Smuzhiyun  KVM_MAGIC_FEAT_MAS0_TO_SPRG7	Maps MASn, ESR, PIR and high SPRGs
106*4882a593Smuzhiyun  ============================  =======================================
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunFor enhanced features in the magic page, please check for the existence of the
109*4882a593Smuzhiyunfeature before using them!
110*4882a593Smuzhiyun
111*4882a593SmuzhiyunMagic page flags
112*4882a593Smuzhiyun================
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunIn addition to features that indicate whether a host is capable of a particular
115*4882a593Smuzhiyunfeature we also have a channel for a guest to tell the guest whether it's capable
116*4882a593Smuzhiyunof something. This is what we call "flags".
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunFlags are passed to the host in the low 12 bits of the Effective Address.
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunThe following flags are currently available for a guest to expose:
121*4882a593Smuzhiyun
122*4882a593Smuzhiyun  MAGIC_PAGE_FLAG_NOT_MAPPED_NX Guest handles NX bits correctly wrt magic page
123*4882a593Smuzhiyun
124*4882a593SmuzhiyunMSR bits
125*4882a593Smuzhiyun========
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunThe MSR contains bits that require hypervisor intervention and bits that do
128*4882a593Smuzhiyunnot require direct hypervisor intervention because they only get interpreted
129*4882a593Smuzhiyunwhen entering the guest or don't have any impact on the hypervisor's behavior.
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunThe following bits are safe to be set inside the guest:
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun  - MSR_EE
134*4882a593Smuzhiyun  - MSR_RI
135*4882a593Smuzhiyun
136*4882a593SmuzhiyunIf any other bit changes in the MSR, please still use mtmsr(d).
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunPatched instructions
139*4882a593Smuzhiyun====================
140*4882a593Smuzhiyun
141*4882a593SmuzhiyunThe "ld" and "std" instructions are transformed to "lwz" and "stw" instructions
142*4882a593Smuzhiyunrespectively on 32 bit systems with an added offset of 4 to accommodate for big
143*4882a593Smuzhiyunendianness.
144*4882a593Smuzhiyun
145*4882a593SmuzhiyunThe following is a list of mapping the Linux kernel performs when running as
146*4882a593Smuzhiyunguest. Implementing any of those mappings is optional, as the instruction traps
147*4882a593Smuzhiyunalso act on the shared page. So calling privileged instructions still works as
148*4882a593Smuzhiyunbefore.
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun======================= ================================
151*4882a593SmuzhiyunFrom			To
152*4882a593Smuzhiyun======================= ================================
153*4882a593Smuzhiyunmfmsr	rX		ld	rX, magic_page->msr
154*4882a593Smuzhiyunmfsprg	rX, 0		ld	rX, magic_page->sprg0
155*4882a593Smuzhiyunmfsprg	rX, 1		ld	rX, magic_page->sprg1
156*4882a593Smuzhiyunmfsprg	rX, 2		ld	rX, magic_page->sprg2
157*4882a593Smuzhiyunmfsprg	rX, 3		ld	rX, magic_page->sprg3
158*4882a593Smuzhiyunmfsrr0	rX		ld	rX, magic_page->srr0
159*4882a593Smuzhiyunmfsrr1	rX		ld	rX, magic_page->srr1
160*4882a593Smuzhiyunmfdar	rX		ld	rX, magic_page->dar
161*4882a593Smuzhiyunmfdsisr	rX		lwz	rX, magic_page->dsisr
162*4882a593Smuzhiyun
163*4882a593Smuzhiyunmtmsr	rX		std	rX, magic_page->msr
164*4882a593Smuzhiyunmtsprg	0, rX		std	rX, magic_page->sprg0
165*4882a593Smuzhiyunmtsprg	1, rX		std	rX, magic_page->sprg1
166*4882a593Smuzhiyunmtsprg	2, rX		std	rX, magic_page->sprg2
167*4882a593Smuzhiyunmtsprg	3, rX		std	rX, magic_page->sprg3
168*4882a593Smuzhiyunmtsrr0	rX		std	rX, magic_page->srr0
169*4882a593Smuzhiyunmtsrr1	rX		std	rX, magic_page->srr1
170*4882a593Smuzhiyunmtdar	rX		std	rX, magic_page->dar
171*4882a593Smuzhiyunmtdsisr	rX		stw	rX, magic_page->dsisr
172*4882a593Smuzhiyun
173*4882a593Smuzhiyuntlbsync			nop
174*4882a593Smuzhiyun
175*4882a593Smuzhiyunmtmsrd	rX, 0		b	<special mtmsr section>
176*4882a593Smuzhiyunmtmsr	rX		b	<special mtmsr section>
177*4882a593Smuzhiyun
178*4882a593Smuzhiyunmtmsrd	rX, 1		b	<special mtmsrd section>
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun[Book3S only]
181*4882a593Smuzhiyunmtsrin	rX, rY		b	<special mtsrin section>
182*4882a593Smuzhiyun
183*4882a593Smuzhiyun[BookE only]
184*4882a593Smuzhiyunwrteei	[0|1]		b	<special wrteei section>
185*4882a593Smuzhiyun======================= ================================
186*4882a593Smuzhiyun
187*4882a593SmuzhiyunSome instructions require more logic to determine what's going on than a load
188*4882a593Smuzhiyunor store instruction can deliver. To enable patching of those, we keep some
189*4882a593SmuzhiyunRAM around where we can live translate instructions to. What happens is the
190*4882a593Smuzhiyunfollowing:
191*4882a593Smuzhiyun
192*4882a593Smuzhiyun	1) copy emulation code to memory
193*4882a593Smuzhiyun	2) patch that code to fit the emulated instruction
194*4882a593Smuzhiyun	3) patch that code to return to the original pc + 4
195*4882a593Smuzhiyun	4) patch the original instruction to branch to the new code
196*4882a593Smuzhiyun
197*4882a593SmuzhiyunThat way we can inject an arbitrary amount of code as replacement for a single
198*4882a593Smuzhiyuninstruction. This allows us to check for pending interrupts when setting EE=1
199*4882a593Smuzhiyunfor example.
200*4882a593Smuzhiyun
201*4882a593SmuzhiyunHypercall ABIs in KVM on PowerPC
202*4882a593Smuzhiyun=================================
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun1) KVM hypercalls (ePAPR)
205*4882a593Smuzhiyun
206*4882a593SmuzhiyunThese are ePAPR compliant hypercall implementation (mentioned above). Even
207*4882a593Smuzhiyungeneric hypercalls are implemented here, like the ePAPR idle hcall. These are
208*4882a593Smuzhiyunavailable on all targets.
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun2) PAPR hypercalls
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunPAPR hypercalls are needed to run server PowerPC PAPR guests (-M pseries in QEMU).
213*4882a593SmuzhiyunThese are the same hypercalls that pHyp, the POWER hypervisor implements. Some of
214*4882a593Smuzhiyunthem are handled in the kernel, some are handled in user space. This is only
215*4882a593Smuzhiyunavailable on book3s_64.
216*4882a593Smuzhiyun
217*4882a593Smuzhiyun3) OSI hypercalls
218*4882a593Smuzhiyun
219*4882a593SmuzhiyunMac-on-Linux is another user of KVM on PowerPC, which has its own hypercall (long
220*4882a593Smuzhiyunbefore KVM). This is supported to maintain compatibility. All these hypercalls get
221*4882a593Smuzhiyunforwarded to user space. This is only useful on book3s_32, but can be used with
222*4882a593Smuzhiyunbook3s_64 as well.
223