1*4882a593Smuzhiyun================================== 2*4882a593SmuzhiyunVFIO - "Virtual Function I/O" [1]_ 3*4882a593Smuzhiyun================================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunMany modern system now provide DMA and interrupt remapping facilities 6*4882a593Smuzhiyunto help ensure I/O devices behave within the boundaries they've been 7*4882a593Smuzhiyunallotted. This includes x86 hardware with AMD-Vi and Intel VT-d, 8*4882a593SmuzhiyunPOWER systems with Partitionable Endpoints (PEs) and embedded PowerPC 9*4882a593Smuzhiyunsystems such as Freescale PAMU. The VFIO driver is an IOMMU/device 10*4882a593Smuzhiyunagnostic framework for exposing direct device access to userspace, in 11*4882a593Smuzhiyuna secure, IOMMU protected environment. In other words, this allows 12*4882a593Smuzhiyunsafe [2]_, non-privileged, userspace drivers. 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunWhy do we want that? Virtual machines often make use of direct device 15*4882a593Smuzhiyunaccess ("device assignment") when configured for the highest possible 16*4882a593SmuzhiyunI/O performance. From a device and host perspective, this simply 17*4882a593Smuzhiyunturns the VM into a userspace driver, with the benefits of 18*4882a593Smuzhiyunsignificantly reduced latency, higher bandwidth, and direct use of 19*4882a593Smuzhiyunbare-metal device drivers [3]_. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunSome applications, particularly in the high performance computing 22*4882a593Smuzhiyunfield, also benefit from low-overhead, direct device access from 23*4882a593Smuzhiyunuserspace. Examples include network adapters (often non-TCP/IP based) 24*4882a593Smuzhiyunand compute accelerators. Prior to VFIO, these drivers had to either 25*4882a593Smuzhiyungo through the full development cycle to become proper upstream 26*4882a593Smuzhiyundriver, be maintained out of tree, or make use of the UIO framework, 27*4882a593Smuzhiyunwhich has no notion of IOMMU protection, limited interrupt support, 28*4882a593Smuzhiyunand requires root privileges to access things like PCI configuration 29*4882a593Smuzhiyunspace. 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunThe VFIO driver framework intends to unify these, replacing both the 32*4882a593SmuzhiyunKVM PCI specific device assignment code as well as provide a more 33*4882a593Smuzhiyunsecure, more featureful userspace driver environment than UIO. 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunGroups, Devices, and IOMMUs 36*4882a593Smuzhiyun--------------------------- 37*4882a593Smuzhiyun 38*4882a593SmuzhiyunDevices are the main target of any I/O driver. Devices typically 39*4882a593Smuzhiyuncreate a programming interface made up of I/O access, interrupts, 40*4882a593Smuzhiyunand DMA. Without going into the details of each of these, DMA is 41*4882a593Smuzhiyunby far the most critical aspect for maintaining a secure environment 42*4882a593Smuzhiyunas allowing a device read-write access to system memory imposes the 43*4882a593Smuzhiyungreatest risk to the overall system integrity. 44*4882a593Smuzhiyun 45*4882a593SmuzhiyunTo help mitigate this risk, many modern IOMMUs now incorporate 46*4882a593Smuzhiyunisolation properties into what was, in many cases, an interface only 47*4882a593Smuzhiyunmeant for translation (ie. solving the addressing problems of devices 48*4882a593Smuzhiyunwith limited address spaces). With this, devices can now be isolated 49*4882a593Smuzhiyunfrom each other and from arbitrary memory access, thus allowing 50*4882a593Smuzhiyunthings like secure direct assignment of devices into virtual machines. 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunThis isolation is not always at the granularity of a single device 53*4882a593Smuzhiyunthough. Even when an IOMMU is capable of this, properties of devices, 54*4882a593Smuzhiyuninterconnects, and IOMMU topologies can each reduce this isolation. 55*4882a593SmuzhiyunFor instance, an individual device may be part of a larger multi- 56*4882a593Smuzhiyunfunction enclosure. While the IOMMU may be able to distinguish 57*4882a593Smuzhiyunbetween devices within the enclosure, the enclosure may not require 58*4882a593Smuzhiyuntransactions between devices to reach the IOMMU. Examples of this 59*4882a593Smuzhiyuncould be anything from a multi-function PCI device with backdoors 60*4882a593Smuzhiyunbetween functions to a non-PCI-ACS (Access Control Services) capable 61*4882a593Smuzhiyunbridge allowing redirection without reaching the IOMMU. Topology 62*4882a593Smuzhiyuncan also play a factor in terms of hiding devices. A PCIe-to-PCI 63*4882a593Smuzhiyunbridge masks the devices behind it, making transaction appear as if 64*4882a593Smuzhiyunfrom the bridge itself. Obviously IOMMU design plays a major factor 65*4882a593Smuzhiyunas well. 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunTherefore, while for the most part an IOMMU may have device level 68*4882a593Smuzhiyungranularity, any system is susceptible to reduced granularity. The 69*4882a593SmuzhiyunIOMMU API therefore supports a notion of IOMMU groups. A group is 70*4882a593Smuzhiyuna set of devices which is isolatable from all other devices in the 71*4882a593Smuzhiyunsystem. Groups are therefore the unit of ownership used by VFIO. 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunWhile the group is the minimum granularity that must be used to 74*4882a593Smuzhiyunensure secure user access, it's not necessarily the preferred 75*4882a593Smuzhiyungranularity. In IOMMUs which make use of page tables, it may be 76*4882a593Smuzhiyunpossible to share a set of page tables between different groups, 77*4882a593Smuzhiyunreducing the overhead both to the platform (reduced TLB thrashing, 78*4882a593Smuzhiyunreduced duplicate page tables), and to the user (programming only 79*4882a593Smuzhiyuna single set of translations). For this reason, VFIO makes use of 80*4882a593Smuzhiyuna container class, which may hold one or more groups. A container 81*4882a593Smuzhiyunis created by simply opening the /dev/vfio/vfio character device. 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunOn its own, the container provides little functionality, with all 84*4882a593Smuzhiyunbut a couple version and extension query interfaces locked away. 85*4882a593SmuzhiyunThe user needs to add a group into the container for the next level 86*4882a593Smuzhiyunof functionality. To do this, the user first needs to identify the 87*4882a593Smuzhiyungroup associated with the desired device. This can be done using 88*4882a593Smuzhiyunthe sysfs links described in the example below. By unbinding the 89*4882a593Smuzhiyundevice from the host driver and binding it to a VFIO driver, a new 90*4882a593SmuzhiyunVFIO group will appear for the group as /dev/vfio/$GROUP, where 91*4882a593Smuzhiyun$GROUP is the IOMMU group number of which the device is a member. 92*4882a593SmuzhiyunIf the IOMMU group contains multiple devices, each will need to 93*4882a593Smuzhiyunbe bound to a VFIO driver before operations on the VFIO group 94*4882a593Smuzhiyunare allowed (it's also sufficient to only unbind the device from 95*4882a593Smuzhiyunhost drivers if a VFIO driver is unavailable; this will make the 96*4882a593Smuzhiyungroup available, but not that particular device). TBD - interface 97*4882a593Smuzhiyunfor disabling driver probing/locking a device. 98*4882a593Smuzhiyun 99*4882a593SmuzhiyunOnce the group is ready, it may be added to the container by opening 100*4882a593Smuzhiyunthe VFIO group character device (/dev/vfio/$GROUP) and using the 101*4882a593SmuzhiyunVFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the 102*4882a593Smuzhiyunpreviously opened container file. If desired and if the IOMMU driver 103*4882a593Smuzhiyunsupports sharing the IOMMU context between groups, multiple groups may 104*4882a593Smuzhiyunbe set to the same container. If a group fails to set to a container 105*4882a593Smuzhiyunwith existing groups, a new empty container will need to be used 106*4882a593Smuzhiyuninstead. 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunWith a group (or groups) attached to a container, the remaining 109*4882a593Smuzhiyunioctls become available, enabling access to the VFIO IOMMU interfaces. 110*4882a593SmuzhiyunAdditionally, it now becomes possible to get file descriptors for each 111*4882a593Smuzhiyundevice within a group using an ioctl on the VFIO group file descriptor. 112*4882a593Smuzhiyun 113*4882a593SmuzhiyunThe VFIO device API includes ioctls for describing the device, the I/O 114*4882a593Smuzhiyunregions and their read/write/mmap offsets on the device descriptor, as 115*4882a593Smuzhiyunwell as mechanisms for describing and registering interrupt 116*4882a593Smuzhiyunnotifications. 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunVFIO Usage Example 119*4882a593Smuzhiyun------------------ 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunAssume user wants to access PCI device 0000:06:0d.0:: 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group 124*4882a593Smuzhiyun ../../../../kernel/iommu_groups/26 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunThis device is therefore in IOMMU group 26. This device is on the 127*4882a593Smuzhiyunpci bus, therefore the user will make use of vfio-pci to manage the 128*4882a593Smuzhiyungroup:: 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun # modprobe vfio-pci 131*4882a593Smuzhiyun 132*4882a593SmuzhiyunBinding this device to the vfio-pci driver creates the VFIO group 133*4882a593Smuzhiyuncharacter devices for this group:: 134*4882a593Smuzhiyun 135*4882a593Smuzhiyun $ lspci -n -s 0000:06:0d.0 136*4882a593Smuzhiyun 06:0d.0 0401: 1102:0002 (rev 08) 137*4882a593Smuzhiyun # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind 138*4882a593Smuzhiyun # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunNow we need to look at what other devices are in the group to free 141*4882a593Smuzhiyunit for use by VFIO:: 142*4882a593Smuzhiyun 143*4882a593Smuzhiyun $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices 144*4882a593Smuzhiyun total 0 145*4882a593Smuzhiyun lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> 146*4882a593Smuzhiyun ../../../../devices/pci0000:00/0000:00:1e.0 147*4882a593Smuzhiyun lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> 148*4882a593Smuzhiyun ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 149*4882a593Smuzhiyun lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> 150*4882a593Smuzhiyun ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 151*4882a593Smuzhiyun 152*4882a593SmuzhiyunThis device is behind a PCIe-to-PCI bridge [4]_, therefore we also 153*4882a593Smuzhiyunneed to add device 0000:06:0d.1 to the group following the same 154*4882a593Smuzhiyunprocedure as above. Device 0000:00:1e.0 is a bridge that does 155*4882a593Smuzhiyunnot currently have a host driver, therefore it's not required to 156*4882a593Smuzhiyunbind this device to the vfio-pci driver (vfio-pci does not currently 157*4882a593Smuzhiyunsupport PCI bridges). 158*4882a593Smuzhiyun 159*4882a593SmuzhiyunThe final step is to provide the user with access to the group if 160*4882a593Smuzhiyununprivileged operation is desired (note that /dev/vfio/vfio provides 161*4882a593Smuzhiyunno capabilities on its own and is therefore expected to be set to 162*4882a593Smuzhiyunmode 0666 by the system):: 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun # chown user:user /dev/vfio/26 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunThe user now has full access to all the devices and the iommu for this 167*4882a593Smuzhiyungroup and can access them as follows:: 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun int container, group, device, i; 170*4882a593Smuzhiyun struct vfio_group_status group_status = 171*4882a593Smuzhiyun { .argsz = sizeof(group_status) }; 172*4882a593Smuzhiyun struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) }; 173*4882a593Smuzhiyun struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) }; 174*4882a593Smuzhiyun struct vfio_device_info device_info = { .argsz = sizeof(device_info) }; 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun /* Create a new container */ 177*4882a593Smuzhiyun container = open("/dev/vfio/vfio", O_RDWR); 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) 180*4882a593Smuzhiyun /* Unknown API version */ 181*4882a593Smuzhiyun 182*4882a593Smuzhiyun if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) 183*4882a593Smuzhiyun /* Doesn't support the IOMMU driver we want. */ 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun /* Open the group */ 186*4882a593Smuzhiyun group = open("/dev/vfio/26", O_RDWR); 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun /* Test the group is viable and available */ 189*4882a593Smuzhiyun ioctl(group, VFIO_GROUP_GET_STATUS, &group_status); 190*4882a593Smuzhiyun 191*4882a593Smuzhiyun if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) 192*4882a593Smuzhiyun /* Group is not viable (ie, not all devices bound for vfio) */ 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun /* Add the group to the container */ 195*4882a593Smuzhiyun ioctl(group, VFIO_GROUP_SET_CONTAINER, &container); 196*4882a593Smuzhiyun 197*4882a593Smuzhiyun /* Enable the IOMMU model we want */ 198*4882a593Smuzhiyun ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); 199*4882a593Smuzhiyun 200*4882a593Smuzhiyun /* Get addition IOMMU info */ 201*4882a593Smuzhiyun ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info); 202*4882a593Smuzhiyun 203*4882a593Smuzhiyun /* Allocate some space and setup a DMA mapping */ 204*4882a593Smuzhiyun dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE, 205*4882a593Smuzhiyun MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); 206*4882a593Smuzhiyun dma_map.size = 1024 * 1024; 207*4882a593Smuzhiyun dma_map.iova = 0; /* 1MB starting at 0x0 from device view */ 208*4882a593Smuzhiyun dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; 209*4882a593Smuzhiyun 210*4882a593Smuzhiyun ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map); 211*4882a593Smuzhiyun 212*4882a593Smuzhiyun /* Get a file descriptor for the device */ 213*4882a593Smuzhiyun device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0"); 214*4882a593Smuzhiyun 215*4882a593Smuzhiyun /* Test and setup the device */ 216*4882a593Smuzhiyun ioctl(device, VFIO_DEVICE_GET_INFO, &device_info); 217*4882a593Smuzhiyun 218*4882a593Smuzhiyun for (i = 0; i < device_info.num_regions; i++) { 219*4882a593Smuzhiyun struct vfio_region_info reg = { .argsz = sizeof(reg) }; 220*4882a593Smuzhiyun 221*4882a593Smuzhiyun reg.index = i; 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun ioctl(device, VFIO_DEVICE_GET_REGION_INFO, ®); 224*4882a593Smuzhiyun 225*4882a593Smuzhiyun /* Setup mappings... read/write offsets, mmaps 226*4882a593Smuzhiyun * For PCI devices, config space is a region */ 227*4882a593Smuzhiyun } 228*4882a593Smuzhiyun 229*4882a593Smuzhiyun for (i = 0; i < device_info.num_irqs; i++) { 230*4882a593Smuzhiyun struct vfio_irq_info irq = { .argsz = sizeof(irq) }; 231*4882a593Smuzhiyun 232*4882a593Smuzhiyun irq.index = i; 233*4882a593Smuzhiyun 234*4882a593Smuzhiyun ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq); 235*4882a593Smuzhiyun 236*4882a593Smuzhiyun /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */ 237*4882a593Smuzhiyun } 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun /* Gratuitous device reset and go... */ 240*4882a593Smuzhiyun ioctl(device, VFIO_DEVICE_RESET); 241*4882a593Smuzhiyun 242*4882a593SmuzhiyunVFIO User API 243*4882a593Smuzhiyun------------------------------------------------------------------------------- 244*4882a593Smuzhiyun 245*4882a593SmuzhiyunPlease see include/linux/vfio.h for complete API documentation. 246*4882a593Smuzhiyun 247*4882a593SmuzhiyunVFIO bus driver API 248*4882a593Smuzhiyun------------------------------------------------------------------------------- 249*4882a593Smuzhiyun 250*4882a593SmuzhiyunVFIO bus drivers, such as vfio-pci make use of only a few interfaces 251*4882a593Smuzhiyuninto VFIO core. When devices are bound and unbound to the driver, 252*4882a593Smuzhiyunthe driver should call vfio_register_group_dev() and 253*4882a593Smuzhiyunvfio_unregister_group_dev() respectively:: 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun void vfio_init_group_dev(struct vfio_device *device, 256*4882a593Smuzhiyun struct device *dev, 257*4882a593Smuzhiyun const struct vfio_device_ops *ops, 258*4882a593Smuzhiyun void *device_data); 259*4882a593Smuzhiyun int vfio_register_group_dev(struct vfio_device *device); 260*4882a593Smuzhiyun void vfio_unregister_group_dev(struct vfio_device *device); 261*4882a593Smuzhiyun 262*4882a593SmuzhiyunThe driver should embed the vfio_device in its own structure and call 263*4882a593Smuzhiyunvfio_init_group_dev() to pre-configure it before going to registration. 264*4882a593Smuzhiyunvfio_register_group_dev() indicates to the core to begin tracking the 265*4882a593Smuzhiyuniommu_group of the specified dev and register the dev as owned by a VFIO bus 266*4882a593Smuzhiyundriver. Once vfio_register_group_dev() returns it is possible for userspace to 267*4882a593Smuzhiyunstart accessing the driver, thus the driver should ensure it is completely 268*4882a593Smuzhiyunready before calling it. The driver provides an ops structure for callbacks 269*4882a593Smuzhiyunsimilar to a file operations structure:: 270*4882a593Smuzhiyun 271*4882a593Smuzhiyun struct vfio_device_ops { 272*4882a593Smuzhiyun int (*open)(void *device_data); 273*4882a593Smuzhiyun void (*release)(void *device_data); 274*4882a593Smuzhiyun ssize_t (*read)(void *device_data, char __user *buf, 275*4882a593Smuzhiyun size_t count, loff_t *ppos); 276*4882a593Smuzhiyun ssize_t (*write)(void *device_data, const char __user *buf, 277*4882a593Smuzhiyun size_t size, loff_t *ppos); 278*4882a593Smuzhiyun long (*ioctl)(void *device_data, unsigned int cmd, 279*4882a593Smuzhiyun unsigned long arg); 280*4882a593Smuzhiyun int (*mmap)(void *device_data, struct vm_area_struct *vma); 281*4882a593Smuzhiyun }; 282*4882a593Smuzhiyun 283*4882a593SmuzhiyunEach function is passed the device_data that was originally registered 284*4882a593Smuzhiyunin the vfio_register_group_dev() call above. This allows the bus driver 285*4882a593Smuzhiyunan easy place to store its opaque, private data. The open/release 286*4882a593Smuzhiyuncallbacks are issued when a new file descriptor is created for a 287*4882a593Smuzhiyundevice (via VFIO_GROUP_GET_DEVICE_FD). The ioctl interface provides 288*4882a593Smuzhiyuna direct pass through for VFIO_DEVICE_* ioctls. The read/write/mmap 289*4882a593Smuzhiyuninterfaces implement the device region access defined by the device's 290*4882a593Smuzhiyunown VFIO_DEVICE_GET_REGION_INFO ioctl. 291*4882a593Smuzhiyun 292*4882a593Smuzhiyun 293*4882a593SmuzhiyunPPC64 sPAPR implementation note 294*4882a593Smuzhiyun------------------------------- 295*4882a593Smuzhiyun 296*4882a593SmuzhiyunThis implementation has some specifics: 297*4882a593Smuzhiyun 298*4882a593Smuzhiyun1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per 299*4882a593Smuzhiyun container is supported as an IOMMU table is allocated at the boot time, 300*4882a593Smuzhiyun one table per a IOMMU group which is a Partitionable Endpoint (PE) 301*4882a593Smuzhiyun (PE is often a PCI domain but not always). 302*4882a593Smuzhiyun 303*4882a593Smuzhiyun Newer systems (POWER8 with IODA2) have improved hardware design which allows 304*4882a593Smuzhiyun to remove this limitation and have multiple IOMMU groups per a VFIO 305*4882a593Smuzhiyun container. 306*4882a593Smuzhiyun 307*4882a593Smuzhiyun2) The hardware supports so called DMA windows - the PCI address range 308*4882a593Smuzhiyun within which DMA transfer is allowed, any attempt to access address space 309*4882a593Smuzhiyun out of the window leads to the whole PE isolation. 310*4882a593Smuzhiyun 311*4882a593Smuzhiyun3) PPC64 guests are paravirtualized but not fully emulated. There is an API 312*4882a593Smuzhiyun to map/unmap pages for DMA, and it normally maps 1..32 pages per call and 313*4882a593Smuzhiyun currently there is no way to reduce the number of calls. In order to make 314*4882a593Smuzhiyun things faster, the map/unmap handling has been implemented in real mode 315*4882a593Smuzhiyun which provides an excellent performance which has limitations such as 316*4882a593Smuzhiyun inability to do locked pages accounting in real time. 317*4882a593Smuzhiyun 318*4882a593Smuzhiyun4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O 319*4882a593Smuzhiyun subtree that can be treated as a unit for the purposes of partitioning and 320*4882a593Smuzhiyun error recovery. A PE may be a single or multi-function IOA (IO Adapter), a 321*4882a593Smuzhiyun function of a multi-function IOA, or multiple IOAs (possibly including 322*4882a593Smuzhiyun switch and bridge structures above the multiple IOAs). PPC64 guests detect 323*4882a593Smuzhiyun PCI errors and recover from them via EEH RTAS services, which works on the 324*4882a593Smuzhiyun basis of additional ioctl commands. 325*4882a593Smuzhiyun 326*4882a593Smuzhiyun So 4 additional ioctls have been added: 327*4882a593Smuzhiyun 328*4882a593Smuzhiyun VFIO_IOMMU_SPAPR_TCE_GET_INFO 329*4882a593Smuzhiyun returns the size and the start of the DMA window on the PCI bus. 330*4882a593Smuzhiyun 331*4882a593Smuzhiyun VFIO_IOMMU_ENABLE 332*4882a593Smuzhiyun enables the container. The locked pages accounting 333*4882a593Smuzhiyun is done at this point. This lets user first to know what 334*4882a593Smuzhiyun the DMA window is and adjust rlimit before doing any real job. 335*4882a593Smuzhiyun 336*4882a593Smuzhiyun VFIO_IOMMU_DISABLE 337*4882a593Smuzhiyun disables the container. 338*4882a593Smuzhiyun 339*4882a593Smuzhiyun VFIO_EEH_PE_OP 340*4882a593Smuzhiyun provides an API for EEH setup, error detection and recovery. 341*4882a593Smuzhiyun 342*4882a593Smuzhiyun The code flow from the example above should be slightly changed:: 343*4882a593Smuzhiyun 344*4882a593Smuzhiyun struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 }; 345*4882a593Smuzhiyun 346*4882a593Smuzhiyun ..... 347*4882a593Smuzhiyun /* Add the group to the container */ 348*4882a593Smuzhiyun ioctl(group, VFIO_GROUP_SET_CONTAINER, &container); 349*4882a593Smuzhiyun 350*4882a593Smuzhiyun /* Enable the IOMMU model we want */ 351*4882a593Smuzhiyun ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU) 352*4882a593Smuzhiyun 353*4882a593Smuzhiyun /* Get addition sPAPR IOMMU info */ 354*4882a593Smuzhiyun vfio_iommu_spapr_tce_info spapr_iommu_info; 355*4882a593Smuzhiyun ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info); 356*4882a593Smuzhiyun 357*4882a593Smuzhiyun if (ioctl(container, VFIO_IOMMU_ENABLE)) 358*4882a593Smuzhiyun /* Cannot enable container, may be low rlimit */ 359*4882a593Smuzhiyun 360*4882a593Smuzhiyun /* Allocate some space and setup a DMA mapping */ 361*4882a593Smuzhiyun dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE, 362*4882a593Smuzhiyun MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); 363*4882a593Smuzhiyun 364*4882a593Smuzhiyun dma_map.size = 1024 * 1024; 365*4882a593Smuzhiyun dma_map.iova = 0; /* 1MB starting at 0x0 from device view */ 366*4882a593Smuzhiyun dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; 367*4882a593Smuzhiyun 368*4882a593Smuzhiyun /* Check here is .iova/.size are within DMA window from spapr_iommu_info */ 369*4882a593Smuzhiyun ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map); 370*4882a593Smuzhiyun 371*4882a593Smuzhiyun /* Get a file descriptor for the device */ 372*4882a593Smuzhiyun device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0"); 373*4882a593Smuzhiyun 374*4882a593Smuzhiyun .... 375*4882a593Smuzhiyun 376*4882a593Smuzhiyun /* Gratuitous device reset and go... */ 377*4882a593Smuzhiyun ioctl(device, VFIO_DEVICE_RESET); 378*4882a593Smuzhiyun 379*4882a593Smuzhiyun /* Make sure EEH is supported */ 380*4882a593Smuzhiyun ioctl(container, VFIO_CHECK_EXTENSION, VFIO_EEH); 381*4882a593Smuzhiyun 382*4882a593Smuzhiyun /* Enable the EEH functionality on the device */ 383*4882a593Smuzhiyun pe_op.op = VFIO_EEH_PE_ENABLE; 384*4882a593Smuzhiyun ioctl(container, VFIO_EEH_PE_OP, &pe_op); 385*4882a593Smuzhiyun 386*4882a593Smuzhiyun /* You're suggested to create additional data struct to represent 387*4882a593Smuzhiyun * PE, and put child devices belonging to same IOMMU group to the 388*4882a593Smuzhiyun * PE instance for later reference. 389*4882a593Smuzhiyun */ 390*4882a593Smuzhiyun 391*4882a593Smuzhiyun /* Check the PE's state and make sure it's in functional state */ 392*4882a593Smuzhiyun pe_op.op = VFIO_EEH_PE_GET_STATE; 393*4882a593Smuzhiyun ioctl(container, VFIO_EEH_PE_OP, &pe_op); 394*4882a593Smuzhiyun 395*4882a593Smuzhiyun /* Save device state using pci_save_state(). 396*4882a593Smuzhiyun * EEH should be enabled on the specified device. 397*4882a593Smuzhiyun */ 398*4882a593Smuzhiyun 399*4882a593Smuzhiyun .... 400*4882a593Smuzhiyun 401*4882a593Smuzhiyun /* Inject EEH error, which is expected to be caused by 32-bits 402*4882a593Smuzhiyun * config load. 403*4882a593Smuzhiyun */ 404*4882a593Smuzhiyun pe_op.op = VFIO_EEH_PE_INJECT_ERR; 405*4882a593Smuzhiyun pe_op.err.type = EEH_ERR_TYPE_32; 406*4882a593Smuzhiyun pe_op.err.func = EEH_ERR_FUNC_LD_CFG_ADDR; 407*4882a593Smuzhiyun pe_op.err.addr = 0ul; 408*4882a593Smuzhiyun pe_op.err.mask = 0ul; 409*4882a593Smuzhiyun ioctl(container, VFIO_EEH_PE_OP, &pe_op); 410*4882a593Smuzhiyun 411*4882a593Smuzhiyun .... 412*4882a593Smuzhiyun 413*4882a593Smuzhiyun /* When 0xFF's returned from reading PCI config space or IO BARs 414*4882a593Smuzhiyun * of the PCI device. Check the PE's state to see if that has been 415*4882a593Smuzhiyun * frozen. 416*4882a593Smuzhiyun */ 417*4882a593Smuzhiyun ioctl(container, VFIO_EEH_PE_OP, &pe_op); 418*4882a593Smuzhiyun 419*4882a593Smuzhiyun /* Waiting for pending PCI transactions to be completed and don't 420*4882a593Smuzhiyun * produce any more PCI traffic from/to the affected PE until 421*4882a593Smuzhiyun * recovery is finished. 422*4882a593Smuzhiyun */ 423*4882a593Smuzhiyun 424*4882a593Smuzhiyun /* Enable IO for the affected PE and collect logs. Usually, the 425*4882a593Smuzhiyun * standard part of PCI config space, AER registers are dumped 426*4882a593Smuzhiyun * as logs for further analysis. 427*4882a593Smuzhiyun */ 428*4882a593Smuzhiyun pe_op.op = VFIO_EEH_PE_UNFREEZE_IO; 429*4882a593Smuzhiyun ioctl(container, VFIO_EEH_PE_OP, &pe_op); 430*4882a593Smuzhiyun 431*4882a593Smuzhiyun /* 432*4882a593Smuzhiyun * Issue PE reset: hot or fundamental reset. Usually, hot reset 433*4882a593Smuzhiyun * is enough. However, the firmware of some PCI adapters would 434*4882a593Smuzhiyun * require fundamental reset. 435*4882a593Smuzhiyun */ 436*4882a593Smuzhiyun pe_op.op = VFIO_EEH_PE_RESET_HOT; 437*4882a593Smuzhiyun ioctl(container, VFIO_EEH_PE_OP, &pe_op); 438*4882a593Smuzhiyun pe_op.op = VFIO_EEH_PE_RESET_DEACTIVATE; 439*4882a593Smuzhiyun ioctl(container, VFIO_EEH_PE_OP, &pe_op); 440*4882a593Smuzhiyun 441*4882a593Smuzhiyun /* Configure the PCI bridges for the affected PE */ 442*4882a593Smuzhiyun pe_op.op = VFIO_EEH_PE_CONFIGURE; 443*4882a593Smuzhiyun ioctl(container, VFIO_EEH_PE_OP, &pe_op); 444*4882a593Smuzhiyun 445*4882a593Smuzhiyun /* Restored state we saved at initialization time. pci_restore_state() 446*4882a593Smuzhiyun * is good enough as an example. 447*4882a593Smuzhiyun */ 448*4882a593Smuzhiyun 449*4882a593Smuzhiyun /* Hopefully, error is recovered successfully. Now, you can resume to 450*4882a593Smuzhiyun * start PCI traffic to/from the affected PE. 451*4882a593Smuzhiyun */ 452*4882a593Smuzhiyun 453*4882a593Smuzhiyun .... 454*4882a593Smuzhiyun 455*4882a593Smuzhiyun5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ 456*4882a593Smuzhiyun VFIO_IOMMU_DISABLE and implements 2 new ioctls: 457*4882a593Smuzhiyun VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 458*4882a593Smuzhiyun (which are unsupported in v1 IOMMU). 459*4882a593Smuzhiyun 460*4882a593Smuzhiyun PPC64 paravirtualized guests generate a lot of map/unmap requests, 461*4882a593Smuzhiyun and the handling of those includes pinning/unpinning pages and updating 462*4882a593Smuzhiyun mm::locked_vm counter to make sure we do not exceed the rlimit. 463*4882a593Smuzhiyun The v2 IOMMU splits accounting and pinning into separate operations: 464*4882a593Smuzhiyun 465*4882a593Smuzhiyun - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls 466*4882a593Smuzhiyun receive a user space address and size of the block to be pinned. 467*4882a593Smuzhiyun Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to 468*4882a593Smuzhiyun be called with the exact address and size used for registering 469*4882a593Smuzhiyun the memory block. The userspace is not expected to call these often. 470*4882a593Smuzhiyun The ranges are stored in a linked list in a VFIO container. 471*4882a593Smuzhiyun 472*4882a593Smuzhiyun - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual 473*4882a593Smuzhiyun IOMMU table and do not do pinning; instead these check that the userspace 474*4882a593Smuzhiyun address is from pre-registered range. 475*4882a593Smuzhiyun 476*4882a593Smuzhiyun This separation helps in optimizing DMA for guests. 477*4882a593Smuzhiyun 478*4882a593Smuzhiyun6) sPAPR specification allows guests to have an additional DMA window(s) on 479*4882a593Smuzhiyun a PCI bus with a variable page size. Two ioctls have been added to support 480*4882a593Smuzhiyun this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. 481*4882a593Smuzhiyun The platform has to support the functionality or error will be returned to 482*4882a593Smuzhiyun the userspace. The existing hardware supports up to 2 DMA windows, one is 483*4882a593Smuzhiyun 2GB long, uses 4K pages and called "default 32bit window"; the other can 484*4882a593Smuzhiyun be as big as entire RAM, use different page size, it is optional - guests 485*4882a593Smuzhiyun create those in run-time if the guest driver supports 64bit DMA. 486*4882a593Smuzhiyun 487*4882a593Smuzhiyun VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and 488*4882a593Smuzhiyun a number of TCE table levels (if a TCE table is going to be big enough and 489*4882a593Smuzhiyun the kernel may not be able to allocate enough of physically contiguous 490*4882a593Smuzhiyun memory). It creates a new window in the available slot and returns the bus 491*4882a593Smuzhiyun address where the new window starts. Due to hardware limitation, the user 492*4882a593Smuzhiyun space cannot choose the location of DMA windows. 493*4882a593Smuzhiyun 494*4882a593Smuzhiyun VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window 495*4882a593Smuzhiyun and removes it. 496*4882a593Smuzhiyun 497*4882a593Smuzhiyun------------------------------------------------------------------------------- 498*4882a593Smuzhiyun 499*4882a593Smuzhiyun.. [1] VFIO was originally an acronym for "Virtual Function I/O" in its 500*4882a593Smuzhiyun initial implementation by Tom Lyon while as Cisco. We've since 501*4882a593Smuzhiyun outgrown the acronym, but it's catchy. 502*4882a593Smuzhiyun 503*4882a593Smuzhiyun.. [2] "safe" also depends upon a device being "well behaved". It's 504*4882a593Smuzhiyun possible for multi-function devices to have backdoors between 505*4882a593Smuzhiyun functions and even for single function devices to have alternative 506*4882a593Smuzhiyun access to things like PCI config space through MMIO registers. To 507*4882a593Smuzhiyun guard against the former we can include additional precautions in the 508*4882a593Smuzhiyun IOMMU driver to group multi-function PCI devices together 509*4882a593Smuzhiyun (iommu=group_mf). The latter we can't prevent, but the IOMMU should 510*4882a593Smuzhiyun still provide isolation. For PCI, SR-IOV Virtual Functions are the 511*4882a593Smuzhiyun best indicator of "well behaved", as these are designed for 512*4882a593Smuzhiyun virtualization usage models. 513*4882a593Smuzhiyun 514*4882a593Smuzhiyun.. [3] As always there are trade-offs to virtual machine device 515*4882a593Smuzhiyun assignment that are beyond the scope of VFIO. It's expected that 516*4882a593Smuzhiyun future IOMMU technologies will reduce some, but maybe not all, of 517*4882a593Smuzhiyun these trade-offs. 518*4882a593Smuzhiyun 519*4882a593Smuzhiyun.. [4] In this case the device is below a PCI bridge, so transactions 520*4882a593Smuzhiyun from either function of the device are indistinguishable to the iommu:: 521*4882a593Smuzhiyun 522*4882a593Smuzhiyun -[0000:00]-+-1e.0-[06]--+-0d.0 523*4882a593Smuzhiyun \-0d.1 524*4882a593Smuzhiyun 525*4882a593Smuzhiyun 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) 526