1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun=========================================== 4*4882a593SmuzhiyunShared Virtual Addressing (SVA) with ENQCMD 5*4882a593Smuzhiyun=========================================== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunBackground 8*4882a593Smuzhiyun========== 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunShared Virtual Addressing (SVA) allows the processor and device to use the 11*4882a593Smuzhiyunsame virtual addresses avoiding the need for software to translate virtual 12*4882a593Smuzhiyunaddresses to physical addresses. SVA is what PCIe calls Shared Virtual 13*4882a593SmuzhiyunMemory (SVM). 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunIn addition to the convenience of using application virtual addresses 16*4882a593Smuzhiyunby the device, it also doesn't require pinning pages for DMA. 17*4882a593SmuzhiyunPCIe Address Translation Services (ATS) along with Page Request Interface 18*4882a593Smuzhiyun(PRI) allow devices to function much the same way as the CPU handling 19*4882a593Smuzhiyunapplication page-faults. For more information please refer to the PCIe 20*4882a593Smuzhiyunspecification Chapter 10: ATS Specification. 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunUse of SVA requires IOMMU support in the platform. IOMMU is also 23*4882a593Smuzhiyunrequired to support the PCIe features ATS and PRI. ATS allows devices 24*4882a593Smuzhiyunto cache translations for virtual addresses. The IOMMU driver uses the 25*4882a593Smuzhiyunmmu_notifier() support to keep the device TLB cache and the CPU cache in 26*4882a593Smuzhiyunsync. When an ATS lookup fails for a virtual address, the device should 27*4882a593Smuzhiyunuse the PRI in order to request the virtual address to be paged into the 28*4882a593SmuzhiyunCPU page tables. The device must use ATS again in order the fetch the 29*4882a593Smuzhiyuntranslation before use. 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunShared Hardware Workqueues 32*4882a593Smuzhiyun========================== 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunUnlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits 35*4882a593Smuzhiyunthe use of Shared Work Queues (SWQ) by both applications and Virtual 36*4882a593SmuzhiyunMachines (VM's). This allows better hardware utilization vs. hard 37*4882a593Smuzhiyunpartitioning resources that could result in under utilization. In order to 38*4882a593Smuzhiyunallow the hardware to distinguish the context for which work is being 39*4882a593Smuzhiyunexecuted in the hardware by SWQ interface, SIOV uses Process Address Space 40*4882a593SmuzhiyunID (PASID), which is a 20-bit number defined by the PCIe SIG. 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunPASID value is encoded in all transactions from the device. This allows the 43*4882a593SmuzhiyunIOMMU to track I/O on a per-PASID granularity in addition to using the PCIe 44*4882a593SmuzhiyunResource Identifier (RID) which is the Bus/Device/Function. 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunENQCMD 48*4882a593Smuzhiyun====== 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunENQCMD is a new instruction on Intel platforms that atomically submits a 51*4882a593Smuzhiyunwork descriptor to a device. The descriptor includes the operation to be 52*4882a593Smuzhiyunperformed, virtual addresses of all parameters, virtual address of a completion 53*4882a593Smuzhiyunrecord, and the PASID (process address space ID) of the current process. 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunENQCMD works with non-posted semantics and carries a status back if the 56*4882a593Smuzhiyuncommand was accepted by hardware. This allows the submitter to know if the 57*4882a593Smuzhiyunsubmission needs to be retried or other device specific mechanisms to 58*4882a593Smuzhiyunimplement fairness or ensure forward progress should be provided. 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunENQCMD is the glue that ensures applications can directly submit commands 61*4882a593Smuzhiyunto the hardware and also permits hardware to be aware of application context 62*4882a593Smuzhiyunto perform I/O operations via use of PASID. 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunProcess Address Space Tagging 65*4882a593Smuzhiyun============================= 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunA new thread-scoped MSR (IA32_PASID) provides the connection between 68*4882a593Smuzhiyunuser processes and the rest of the hardware. When an application first 69*4882a593Smuzhiyunaccesses an SVA-capable device, this MSR is initialized with a newly 70*4882a593Smuzhiyunallocated PASID. The driver for the device calls an IOMMU-specific API 71*4882a593Smuzhiyunthat sets up the routing for DMA and page-requests. 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunFor example, the Intel Data Streaming Accelerator (DSA) uses 74*4882a593Smuzhiyuniommu_sva_bind_device(), which will do the following: 75*4882a593Smuzhiyun 76*4882a593Smuzhiyun- Allocate the PASID, and program the process page-table (%cr3 register) in the 77*4882a593Smuzhiyun PASID context entries. 78*4882a593Smuzhiyun- Register for mmu_notifier() to track any page-table invalidations to keep 79*4882a593Smuzhiyun the device TLB in sync. For example, when a page-table entry is invalidated, 80*4882a593Smuzhiyun the IOMMU propagates the invalidation to the device TLB. This will force any 81*4882a593Smuzhiyun future access by the device to this virtual address to participate in 82*4882a593Smuzhiyun ATS. If the IOMMU responds with proper response that a page is not 83*4882a593Smuzhiyun present, the device would request the page to be paged in via the PCIe PRI 84*4882a593Smuzhiyun protocol before performing I/O. 85*4882a593Smuzhiyun 86*4882a593SmuzhiyunThis MSR is managed with the XSAVE feature set as "supervisor state" to 87*4882a593Smuzhiyunensure the MSR is updated during context switch. 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunPASID Management 90*4882a593Smuzhiyun================ 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunThe kernel must allocate a PASID on behalf of each process which will use 93*4882a593SmuzhiyunENQCMD and program it into the new MSR to communicate the process identity to 94*4882a593Smuzhiyunplatform hardware. ENQCMD uses the PASID stored in this MSR to tag requests 95*4882a593Smuzhiyunfrom this process. When a user submits a work descriptor to a device using the 96*4882a593SmuzhiyunENQCMD instruction, the PASID field in the descriptor is auto-filled with the 97*4882a593Smuzhiyunvalue from MSR_IA32_PASID. Requests for DMA from the device are also tagged 98*4882a593Smuzhiyunwith the same PASID. The platform IOMMU uses the PASID in the transaction to 99*4882a593Smuzhiyunperform address translation. The IOMMU APIs setup the corresponding PASID 100*4882a593Smuzhiyunentry in IOMMU with the process address used by the CPU (e.g. %cr3 register in 101*4882a593Smuzhiyunx86). 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunThe MSR must be configured on each logical CPU before any application 104*4882a593Smuzhiyunthread can interact with a device. Threads that belong to the same 105*4882a593Smuzhiyunprocess share the same page tables, thus the same MSR value. 106*4882a593Smuzhiyun 107*4882a593SmuzhiyunPASID is cleared when a process is created. The PASID allocation and MSR 108*4882a593Smuzhiyunprogramming may occur long after a process and its threads have been created. 109*4882a593SmuzhiyunOne thread must call iommu_sva_bind_device() to allocate the PASID for the 110*4882a593Smuzhiyunprocess. If a thread uses ENQCMD without the MSR first being populated, a #GP 111*4882a593Smuzhiyunwill be raised. The kernel will update the PASID MSR with the PASID for all 112*4882a593Smuzhiyunthreads in the process. A single process PASID can be used simultaneously 113*4882a593Smuzhiyunwith multiple devices since they all share the same address space. 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunOne thread can call iommu_sva_unbind_device() to free the allocated PASID. 116*4882a593SmuzhiyunThe kernel will clear the PASID MSR for all threads belonging to the process. 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunNew threads inherit the MSR value from the parent. 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunRelationships 121*4882a593Smuzhiyun============= 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun * Each process has many threads, but only one PASID. 124*4882a593Smuzhiyun * Devices have a limited number (~10's to 1000's) of hardware workqueues. 125*4882a593Smuzhiyun The device driver manages allocating hardware workqueues. 126*4882a593Smuzhiyun * A single mmap() maps a single hardware workqueue as a "portal" and 127*4882a593Smuzhiyun each portal maps down to a single workqueue. 128*4882a593Smuzhiyun * For each device with which a process interacts, there must be 129*4882a593Smuzhiyun one or more mmap()'d portals. 130*4882a593Smuzhiyun * Many threads within a process can share a single portal to access 131*4882a593Smuzhiyun a single device. 132*4882a593Smuzhiyun * Multiple processes can separately mmap() the same portal, in 133*4882a593Smuzhiyun which case they still share one device hardware workqueue. 134*4882a593Smuzhiyun * The single process-wide PASID is used by all threads to interact 135*4882a593Smuzhiyun with all devices. There is not, for instance, a PASID for each 136*4882a593Smuzhiyun thread or each thread<->device pair. 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunFAQ 139*4882a593Smuzhiyun=== 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun* What is SVA/SVM? 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunShared Virtual Addressing (SVA) permits I/O hardware and the processor to 144*4882a593Smuzhiyunwork in the same address space, i.e., to share it. Some call it Shared 145*4882a593SmuzhiyunVirtual Memory (SVM), but Linux community wanted to avoid confusing it with 146*4882a593SmuzhiyunPOSIX Shared Memory and Secure Virtual Machines which were terms already in 147*4882a593Smuzhiyuncirculation. 148*4882a593Smuzhiyun 149*4882a593Smuzhiyun* What is a PASID? 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunA Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet 152*4882a593Smuzhiyun(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS. 153*4882a593SmuzhiyunPASID is included in all transactions between the platform and the device. 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun* How are shared workqueues different? 156*4882a593Smuzhiyun 157*4882a593SmuzhiyunTraditionally, in order for userspace applications to interact with hardware, 158*4882a593Smuzhiyunthere is a separate hardware instance required per process. For example, 159*4882a593Smuzhiyunconsider doorbells as a mechanism of informing hardware about work to process. 160*4882a593SmuzhiyunEach doorbell is required to be spaced 4k (or page-size) apart for process 161*4882a593Smuzhiyunisolation. This requires hardware to provision that space and reserve it in 162*4882a593SmuzhiyunMMIO. This doesn't scale as the number of threads becomes quite large. The 163*4882a593Smuzhiyunhardware also manages the queue depth for Shared Work Queues (SWQ), and 164*4882a593Smuzhiyunconsumers don't need to track queue depth. If there is no space to accept 165*4882a593Smuzhiyuna command, the device will return an error indicating retry. 166*4882a593Smuzhiyun 167*4882a593SmuzhiyunA user should check Deferrable Memory Write (DMWr) capability on the device 168*4882a593Smuzhiyunand only submits ENQCMD when the device supports it. In the new DMWr PCIe 169*4882a593Smuzhiyunterminology, devices need to support DMWr completer capability. In addition, 170*4882a593Smuzhiyunit requires all switch ports to support DMWr routing and must be enabled by 171*4882a593Smuzhiyunthe PCIe subsystem, much like how PCIe atomic operations are managed for 172*4882a593Smuzhiyuninstance. 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunSWQ allows hardware to provision just a single address in the device. When 175*4882a593Smuzhiyunused with ENQCMD to submit work, the device can distinguish the process 176*4882a593Smuzhiyunsubmitting the work since it will include the PASID assigned to that 177*4882a593Smuzhiyunprocess. This helps the device scale to a large number of processes. 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun* Is this the same as a user space device driver? 180*4882a593Smuzhiyun 181*4882a593SmuzhiyunCommunicating with the device via the shared workqueue is much simpler 182*4882a593Smuzhiyunthan a full blown user space driver. The kernel driver does all the 183*4882a593Smuzhiyuninitialization of the hardware. User space only needs to worry about 184*4882a593Smuzhiyunsubmitting work and processing completions. 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun* Is this the same as SR-IOV? 187*4882a593Smuzhiyun 188*4882a593SmuzhiyunSingle Root I/O Virtualization (SR-IOV) focuses on providing independent 189*4882a593Smuzhiyunhardware interfaces for virtualizing hardware. Hence, it's required to be 190*4882a593Smuzhiyunalmost fully functional interface to software supporting the traditional 191*4882a593SmuzhiyunBARs, space for interrupts via MSI-X, its own register layout. 192*4882a593SmuzhiyunVirtual Functions (VFs) are assisted by the Physical Function (PF) 193*4882a593Smuzhiyundriver. 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunScalable I/O Virtualization builds on the PASID concept to create device 196*4882a593Smuzhiyuninstances for virtualization. SIOV requires host software to assist in 197*4882a593Smuzhiyuncreating virtual devices; each virtual device is represented by a PASID 198*4882a593Smuzhiyunalong with the bus/device/function of the device. This allows device 199*4882a593Smuzhiyunhardware to optimize device resource creation and can grow dynamically on 200*4882a593Smuzhiyundemand. SR-IOV creation and management is very static in nature. Consult 201*4882a593Smuzhiyunreferences below for more details. 202*4882a593Smuzhiyun 203*4882a593Smuzhiyun* Why not just create a virtual function for each app? 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunCreating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require 206*4882a593Smuzhiyunduplicated hardware for PCI config space and interrupts such as MSI-X. 207*4882a593SmuzhiyunResources such as interrupts have to be hard partitioned between VFs at 208*4882a593Smuzhiyuncreation time, and cannot scale dynamically on demand. The VFs are not 209*4882a593Smuzhiyuncompletely independent from the Physical Function (PF). Most VFs require 210*4882a593Smuzhiyunsome communication and assistance from the PF driver. SIOV, in contrast, 211*4882a593Smuzhiyuncreates a software-defined device where all the configuration and control 212*4882a593Smuzhiyunaspects are mediated via the slow path. The work submission and completion 213*4882a593Smuzhiyunhappen without any mediation. 214*4882a593Smuzhiyun 215*4882a593Smuzhiyun* Does this support virtualization? 216*4882a593Smuzhiyun 217*4882a593SmuzhiyunENQCMD can be used from within a guest VM. In these cases, the VMM helps 218*4882a593Smuzhiyunwith setting up a translation table to translate from Guest PASID to Host 219*4882a593SmuzhiyunPASID. Please consult the ENQCMD instruction set reference for more 220*4882a593Smuzhiyundetails. 221*4882a593Smuzhiyun 222*4882a593Smuzhiyun* Does memory need to be pinned? 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunWhen devices support SVA along with platform hardware such as IOMMU 225*4882a593Smuzhiyunsupporting such devices, there is no need to pin memory for DMA purposes. 226*4882a593SmuzhiyunDevices that support SVA also support other PCIe features that remove the 227*4882a593Smuzhiyunpinning requirement for memory. 228*4882a593Smuzhiyun 229*4882a593SmuzhiyunDevice TLB support - Device requests the IOMMU to lookup an address before 230*4882a593Smuzhiyunuse via Address Translation Service (ATS) requests. If the mapping exists 231*4882a593Smuzhiyunbut there is no page allocated by the OS, IOMMU hardware returns that no 232*4882a593Smuzhiyunmapping exists. 233*4882a593Smuzhiyun 234*4882a593SmuzhiyunDevice requests the virtual address to be mapped via Page Request 235*4882a593SmuzhiyunInterface (PRI). Once the OS has successfully completed the mapping, it 236*4882a593Smuzhiyunreturns the response back to the device. The device requests again for 237*4882a593Smuzhiyuna translation and continues. 238*4882a593Smuzhiyun 239*4882a593SmuzhiyunIOMMU works with the OS in managing consistency of page-tables with the 240*4882a593Smuzhiyundevice. When removing pages, it interacts with the device to remove any 241*4882a593Smuzhiyundevice TLB entry that might have been cached before removing the mappings from 242*4882a593Smuzhiyunthe OS. 243*4882a593Smuzhiyun 244*4882a593SmuzhiyunReferences 245*4882a593Smuzhiyun========== 246*4882a593Smuzhiyun 247*4882a593SmuzhiyunVT-D: 248*4882a593Smuzhiyunhttps://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d 249*4882a593Smuzhiyun 250*4882a593SmuzhiyunSIOV: 251*4882a593Smuzhiyunhttps://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux 252*4882a593Smuzhiyun 253*4882a593SmuzhiyunENQCMD in ISE: 254*4882a593Smuzhiyunhttps://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf 255*4882a593Smuzhiyun 256*4882a593SmuzhiyunDSA spec: 257*4882a593Smuzhiyunhttps://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf 258