xref: /OK3568_Linux_fs/kernel/Documentation/x86/sva.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun===========================================
4*4882a593SmuzhiyunShared Virtual Addressing (SVA) with ENQCMD
5*4882a593Smuzhiyun===========================================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunBackground
8*4882a593Smuzhiyun==========
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunShared Virtual Addressing (SVA) allows the processor and device to use the
11*4882a593Smuzhiyunsame virtual addresses avoiding the need for software to translate virtual
12*4882a593Smuzhiyunaddresses to physical addresses. SVA is what PCIe calls Shared Virtual
13*4882a593SmuzhiyunMemory (SVM).
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunIn addition to the convenience of using application virtual addresses
16*4882a593Smuzhiyunby the device, it also doesn't require pinning pages for DMA.
17*4882a593SmuzhiyunPCIe Address Translation Services (ATS) along with Page Request Interface
18*4882a593Smuzhiyun(PRI) allow devices to function much the same way as the CPU handling
19*4882a593Smuzhiyunapplication page-faults. For more information please refer to the PCIe
20*4882a593Smuzhiyunspecification Chapter 10: ATS Specification.
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunUse of SVA requires IOMMU support in the platform. IOMMU is also
23*4882a593Smuzhiyunrequired to support the PCIe features ATS and PRI. ATS allows devices
24*4882a593Smuzhiyunto cache translations for virtual addresses. The IOMMU driver uses the
25*4882a593Smuzhiyunmmu_notifier() support to keep the device TLB cache and the CPU cache in
26*4882a593Smuzhiyunsync. When an ATS lookup fails for a virtual address, the device should
27*4882a593Smuzhiyunuse the PRI in order to request the virtual address to be paged into the
28*4882a593SmuzhiyunCPU page tables. The device must use ATS again in order the fetch the
29*4882a593Smuzhiyuntranslation before use.
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunShared Hardware Workqueues
32*4882a593Smuzhiyun==========================
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunUnlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
35*4882a593Smuzhiyunthe use of Shared Work Queues (SWQ) by both applications and Virtual
36*4882a593SmuzhiyunMachines (VM's). This allows better hardware utilization vs. hard
37*4882a593Smuzhiyunpartitioning resources that could result in under utilization. In order to
38*4882a593Smuzhiyunallow the hardware to distinguish the context for which work is being
39*4882a593Smuzhiyunexecuted in the hardware by SWQ interface, SIOV uses Process Address Space
40*4882a593SmuzhiyunID (PASID), which is a 20-bit number defined by the PCIe SIG.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunPASID value is encoded in all transactions from the device. This allows the
43*4882a593SmuzhiyunIOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
44*4882a593SmuzhiyunResource Identifier (RID) which is the Bus/Device/Function.
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunENQCMD
48*4882a593Smuzhiyun======
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunENQCMD is a new instruction on Intel platforms that atomically submits a
51*4882a593Smuzhiyunwork descriptor to a device. The descriptor includes the operation to be
52*4882a593Smuzhiyunperformed, virtual addresses of all parameters, virtual address of a completion
53*4882a593Smuzhiyunrecord, and the PASID (process address space ID) of the current process.
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunENQCMD works with non-posted semantics and carries a status back if the
56*4882a593Smuzhiyuncommand was accepted by hardware. This allows the submitter to know if the
57*4882a593Smuzhiyunsubmission needs to be retried or other device specific mechanisms to
58*4882a593Smuzhiyunimplement fairness or ensure forward progress should be provided.
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunENQCMD is the glue that ensures applications can directly submit commands
61*4882a593Smuzhiyunto the hardware and also permits hardware to be aware of application context
62*4882a593Smuzhiyunto perform I/O operations via use of PASID.
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunProcess Address Space Tagging
65*4882a593Smuzhiyun=============================
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunA new thread-scoped MSR (IA32_PASID) provides the connection between
68*4882a593Smuzhiyunuser processes and the rest of the hardware. When an application first
69*4882a593Smuzhiyunaccesses an SVA-capable device, this MSR is initialized with a newly
70*4882a593Smuzhiyunallocated PASID. The driver for the device calls an IOMMU-specific API
71*4882a593Smuzhiyunthat sets up the routing for DMA and page-requests.
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunFor example, the Intel Data Streaming Accelerator (DSA) uses
74*4882a593Smuzhiyuniommu_sva_bind_device(), which will do the following:
75*4882a593Smuzhiyun
76*4882a593Smuzhiyun- Allocate the PASID, and program the process page-table (%cr3 register) in the
77*4882a593Smuzhiyun  PASID context entries.
78*4882a593Smuzhiyun- Register for mmu_notifier() to track any page-table invalidations to keep
79*4882a593Smuzhiyun  the device TLB in sync. For example, when a page-table entry is invalidated,
80*4882a593Smuzhiyun  the IOMMU propagates the invalidation to the device TLB. This will force any
81*4882a593Smuzhiyun  future access by the device to this virtual address to participate in
82*4882a593Smuzhiyun  ATS. If the IOMMU responds with proper response that a page is not
83*4882a593Smuzhiyun  present, the device would request the page to be paged in via the PCIe PRI
84*4882a593Smuzhiyun  protocol before performing I/O.
85*4882a593Smuzhiyun
86*4882a593SmuzhiyunThis MSR is managed with the XSAVE feature set as "supervisor state" to
87*4882a593Smuzhiyunensure the MSR is updated during context switch.
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunPASID Management
90*4882a593Smuzhiyun================
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunThe kernel must allocate a PASID on behalf of each process which will use
93*4882a593SmuzhiyunENQCMD and program it into the new MSR to communicate the process identity to
94*4882a593Smuzhiyunplatform hardware.  ENQCMD uses the PASID stored in this MSR to tag requests
95*4882a593Smuzhiyunfrom this process.  When a user submits a work descriptor to a device using the
96*4882a593SmuzhiyunENQCMD instruction, the PASID field in the descriptor is auto-filled with the
97*4882a593Smuzhiyunvalue from MSR_IA32_PASID. Requests for DMA from the device are also tagged
98*4882a593Smuzhiyunwith the same PASID. The platform IOMMU uses the PASID in the transaction to
99*4882a593Smuzhiyunperform address translation. The IOMMU APIs setup the corresponding PASID
100*4882a593Smuzhiyunentry in IOMMU with the process address used by the CPU (e.g. %cr3 register in
101*4882a593Smuzhiyunx86).
102*4882a593Smuzhiyun
103*4882a593SmuzhiyunThe MSR must be configured on each logical CPU before any application
104*4882a593Smuzhiyunthread can interact with a device. Threads that belong to the same
105*4882a593Smuzhiyunprocess share the same page tables, thus the same MSR value.
106*4882a593Smuzhiyun
107*4882a593SmuzhiyunPASID is cleared when a process is created. The PASID allocation and MSR
108*4882a593Smuzhiyunprogramming may occur long after a process and its threads have been created.
109*4882a593SmuzhiyunOne thread must call iommu_sva_bind_device() to allocate the PASID for the
110*4882a593Smuzhiyunprocess. If a thread uses ENQCMD without the MSR first being populated, a #GP
111*4882a593Smuzhiyunwill be raised. The kernel will update the PASID MSR with the PASID for all
112*4882a593Smuzhiyunthreads in the process. A single process PASID can be used simultaneously
113*4882a593Smuzhiyunwith multiple devices since they all share the same address space.
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunOne thread can call iommu_sva_unbind_device() to free the allocated PASID.
116*4882a593SmuzhiyunThe kernel will clear the PASID MSR for all threads belonging to the process.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunNew threads inherit the MSR value from the parent.
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunRelationships
121*4882a593Smuzhiyun=============
122*4882a593Smuzhiyun
123*4882a593Smuzhiyun * Each process has many threads, but only one PASID.
124*4882a593Smuzhiyun * Devices have a limited number (~10's to 1000's) of hardware workqueues.
125*4882a593Smuzhiyun   The device driver manages allocating hardware workqueues.
126*4882a593Smuzhiyun * A single mmap() maps a single hardware workqueue as a "portal" and
127*4882a593Smuzhiyun   each portal maps down to a single workqueue.
128*4882a593Smuzhiyun * For each device with which a process interacts, there must be
129*4882a593Smuzhiyun   one or more mmap()'d portals.
130*4882a593Smuzhiyun * Many threads within a process can share a single portal to access
131*4882a593Smuzhiyun   a single device.
132*4882a593Smuzhiyun * Multiple processes can separately mmap() the same portal, in
133*4882a593Smuzhiyun   which case they still share one device hardware workqueue.
134*4882a593Smuzhiyun * The single process-wide PASID is used by all threads to interact
135*4882a593Smuzhiyun   with all devices.  There is not, for instance, a PASID for each
136*4882a593Smuzhiyun   thread or each thread<->device pair.
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunFAQ
139*4882a593Smuzhiyun===
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun* What is SVA/SVM?
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunShared Virtual Addressing (SVA) permits I/O hardware and the processor to
144*4882a593Smuzhiyunwork in the same address space, i.e., to share it. Some call it Shared
145*4882a593SmuzhiyunVirtual Memory (SVM), but Linux community wanted to avoid confusing it with
146*4882a593SmuzhiyunPOSIX Shared Memory and Secure Virtual Machines which were terms already in
147*4882a593Smuzhiyuncirculation.
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun* What is a PASID?
150*4882a593Smuzhiyun
151*4882a593SmuzhiyunA Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet
152*4882a593Smuzhiyun(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.
153*4882a593SmuzhiyunPASID is included in all transactions between the platform and the device.
154*4882a593Smuzhiyun
155*4882a593Smuzhiyun* How are shared workqueues different?
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunTraditionally, in order for userspace applications to interact with hardware,
158*4882a593Smuzhiyunthere is a separate hardware instance required per process. For example,
159*4882a593Smuzhiyunconsider doorbells as a mechanism of informing hardware about work to process.
160*4882a593SmuzhiyunEach doorbell is required to be spaced 4k (or page-size) apart for process
161*4882a593Smuzhiyunisolation. This requires hardware to provision that space and reserve it in
162*4882a593SmuzhiyunMMIO. This doesn't scale as the number of threads becomes quite large. The
163*4882a593Smuzhiyunhardware also manages the queue depth for Shared Work Queues (SWQ), and
164*4882a593Smuzhiyunconsumers don't need to track queue depth. If there is no space to accept
165*4882a593Smuzhiyuna command, the device will return an error indicating retry.
166*4882a593Smuzhiyun
167*4882a593SmuzhiyunA user should check Deferrable Memory Write (DMWr) capability on the device
168*4882a593Smuzhiyunand only submits ENQCMD when the device supports it. In the new DMWr PCIe
169*4882a593Smuzhiyunterminology, devices need to support DMWr completer capability. In addition,
170*4882a593Smuzhiyunit requires all switch ports to support DMWr routing and must be enabled by
171*4882a593Smuzhiyunthe PCIe subsystem, much like how PCIe atomic operations are managed for
172*4882a593Smuzhiyuninstance.
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunSWQ allows hardware to provision just a single address in the device. When
175*4882a593Smuzhiyunused with ENQCMD to submit work, the device can distinguish the process
176*4882a593Smuzhiyunsubmitting the work since it will include the PASID assigned to that
177*4882a593Smuzhiyunprocess. This helps the device scale to a large number of processes.
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun* Is this the same as a user space device driver?
180*4882a593Smuzhiyun
181*4882a593SmuzhiyunCommunicating with the device via the shared workqueue is much simpler
182*4882a593Smuzhiyunthan a full blown user space driver. The kernel driver does all the
183*4882a593Smuzhiyuninitialization of the hardware. User space only needs to worry about
184*4882a593Smuzhiyunsubmitting work and processing completions.
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun* Is this the same as SR-IOV?
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunSingle Root I/O Virtualization (SR-IOV) focuses on providing independent
189*4882a593Smuzhiyunhardware interfaces for virtualizing hardware. Hence, it's required to be
190*4882a593Smuzhiyunalmost fully functional interface to software supporting the traditional
191*4882a593SmuzhiyunBARs, space for interrupts via MSI-X, its own register layout.
192*4882a593SmuzhiyunVirtual Functions (VFs) are assisted by the Physical Function (PF)
193*4882a593Smuzhiyundriver.
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunScalable I/O Virtualization builds on the PASID concept to create device
196*4882a593Smuzhiyuninstances for virtualization. SIOV requires host software to assist in
197*4882a593Smuzhiyuncreating virtual devices; each virtual device is represented by a PASID
198*4882a593Smuzhiyunalong with the bus/device/function of the device.  This allows device
199*4882a593Smuzhiyunhardware to optimize device resource creation and can grow dynamically on
200*4882a593Smuzhiyundemand. SR-IOV creation and management is very static in nature. Consult
201*4882a593Smuzhiyunreferences below for more details.
202*4882a593Smuzhiyun
203*4882a593Smuzhiyun* Why not just create a virtual function for each app?
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunCreating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require
206*4882a593Smuzhiyunduplicated hardware for PCI config space and interrupts such as MSI-X.
207*4882a593SmuzhiyunResources such as interrupts have to be hard partitioned between VFs at
208*4882a593Smuzhiyuncreation time, and cannot scale dynamically on demand. The VFs are not
209*4882a593Smuzhiyuncompletely independent from the Physical Function (PF). Most VFs require
210*4882a593Smuzhiyunsome communication and assistance from the PF driver. SIOV, in contrast,
211*4882a593Smuzhiyuncreates a software-defined device where all the configuration and control
212*4882a593Smuzhiyunaspects are mediated via the slow path. The work submission and completion
213*4882a593Smuzhiyunhappen without any mediation.
214*4882a593Smuzhiyun
215*4882a593Smuzhiyun* Does this support virtualization?
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunENQCMD can be used from within a guest VM. In these cases, the VMM helps
218*4882a593Smuzhiyunwith setting up a translation table to translate from Guest PASID to Host
219*4882a593SmuzhiyunPASID. Please consult the ENQCMD instruction set reference for more
220*4882a593Smuzhiyundetails.
221*4882a593Smuzhiyun
222*4882a593Smuzhiyun* Does memory need to be pinned?
223*4882a593Smuzhiyun
224*4882a593SmuzhiyunWhen devices support SVA along with platform hardware such as IOMMU
225*4882a593Smuzhiyunsupporting such devices, there is no need to pin memory for DMA purposes.
226*4882a593SmuzhiyunDevices that support SVA also support other PCIe features that remove the
227*4882a593Smuzhiyunpinning requirement for memory.
228*4882a593Smuzhiyun
229*4882a593SmuzhiyunDevice TLB support - Device requests the IOMMU to lookup an address before
230*4882a593Smuzhiyunuse via Address Translation Service (ATS) requests.  If the mapping exists
231*4882a593Smuzhiyunbut there is no page allocated by the OS, IOMMU hardware returns that no
232*4882a593Smuzhiyunmapping exists.
233*4882a593Smuzhiyun
234*4882a593SmuzhiyunDevice requests the virtual address to be mapped via Page Request
235*4882a593SmuzhiyunInterface (PRI). Once the OS has successfully completed the mapping, it
236*4882a593Smuzhiyunreturns the response back to the device. The device requests again for
237*4882a593Smuzhiyuna translation and continues.
238*4882a593Smuzhiyun
239*4882a593SmuzhiyunIOMMU works with the OS in managing consistency of page-tables with the
240*4882a593Smuzhiyundevice. When removing pages, it interacts with the device to remove any
241*4882a593Smuzhiyundevice TLB entry that might have been cached before removing the mappings from
242*4882a593Smuzhiyunthe OS.
243*4882a593Smuzhiyun
244*4882a593SmuzhiyunReferences
245*4882a593Smuzhiyun==========
246*4882a593Smuzhiyun
247*4882a593SmuzhiyunVT-D:
248*4882a593Smuzhiyunhttps://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d
249*4882a593Smuzhiyun
250*4882a593SmuzhiyunSIOV:
251*4882a593Smuzhiyunhttps://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux
252*4882a593Smuzhiyun
253*4882a593SmuzhiyunENQCMD in ISE:
254*4882a593Smuzhiyunhttps://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
255*4882a593Smuzhiyun
256*4882a593SmuzhiyunDSA spec:
257*4882a593Smuzhiyunhttps://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
258