1*4882a593Smuzhiyun=================================================== 2*4882a593SmuzhiyunPCI Express I/O Virtualization Resource on Powerenv 3*4882a593Smuzhiyun=================================================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunWei Yang <weiyang@linux.vnet.ibm.com> 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunBenjamin Herrenschmidt <benh@au1.ibm.com> 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunBjorn Helgaas <bhelgaas@google.com> 10*4882a593Smuzhiyun 11*4882a593Smuzhiyun26 Aug 2014 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunThis document describes the requirement from hardware for PCI MMIO resource 14*4882a593Smuzhiyunsizing and assignment on PowerKVM and how generic PCI code handles this 15*4882a593Smuzhiyunrequirement. The first two sections describe the concepts of Partitionable 16*4882a593SmuzhiyunEndpoints and the implementation on P8 (IODA2). The next two sections talks 17*4882a593Smuzhiyunabout considerations on enabling SRIOV on IODA2. 18*4882a593Smuzhiyun 19*4882a593Smuzhiyun1. Introduction to Partitionable Endpoints 20*4882a593Smuzhiyun========================================== 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunA Partitionable Endpoint (PE) is a way to group the various resources 23*4882a593Smuzhiyunassociated with a device or a set of devices to provide isolation between 24*4882a593Smuzhiyunpartitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism 25*4882a593Smuzhiyunto freeze a device that is causing errors in order to limit the possibility 26*4882a593Smuzhiyunof propagation of bad data. 27*4882a593Smuzhiyun 28*4882a593SmuzhiyunThere is thus, in HW, a table of PE states that contains a pair of "frozen" 29*4882a593Smuzhiyunstate bits (one for MMIO and one for DMA, they get set together but can be 30*4882a593Smuzhiyuncleared independently) for each PE. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunWhen a PE is frozen, all stores in any direction are dropped and all loads 33*4882a593Smuzhiyunreturn all 1's value. MSIs are also blocked. There's a bit more state that 34*4882a593Smuzhiyuncaptures things like the details of the error that caused the freeze etc., but 35*4882a593Smuzhiyunthat's not critical. 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunThe interesting part is how the various PCIe transactions (MMIO, DMA, ...) 38*4882a593Smuzhiyunare matched to their corresponding PEs. 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunThe following section provides a rough description of what we have on P8 41*4882a593Smuzhiyun(IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB 42*4882a593Smuzhiyunis a completely separate HW entity that replicates the entire logic, so has 43*4882a593Smuzhiyunits own set of PEs, etc. 44*4882a593Smuzhiyun 45*4882a593Smuzhiyun2. Implementation of Partitionable Endpoints on P8 (IODA2) 46*4882a593Smuzhiyun========================================================== 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunP8 supports up to 256 Partitionable Endpoints per PHB. 49*4882a593Smuzhiyun 50*4882a593Smuzhiyun * Inbound 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun For DMA, MSIs and inbound PCIe error messages, we have a table (in 53*4882a593Smuzhiyun memory but accessed in HW by the chip) that provides a direct 54*4882a593Smuzhiyun correspondence between a PCIe RID (bus/dev/fn) with a PE number. 55*4882a593Smuzhiyun We call this the RTT. 56*4882a593Smuzhiyun 57*4882a593Smuzhiyun - For DMA we then provide an entire address space for each PE that can 58*4882a593Smuzhiyun contain two "windows", depending on the value of PCI address bit 59. 59*4882a593Smuzhiyun Each window can be configured to be remapped via a "TCE table" (IOMMU 60*4882a593Smuzhiyun translation table), which has various configurable characteristics 61*4882a593Smuzhiyun not described here. 62*4882a593Smuzhiyun 63*4882a593Smuzhiyun - For MSIs, we have two windows in the address space (one at the top of 64*4882a593Smuzhiyun the 32-bit space and one much higher) which, via a combination of the 65*4882a593Smuzhiyun address and MSI value, will result in one of the 2048 interrupts per 66*4882a593Smuzhiyun bridge being triggered. There's a PE# in the interrupt controller 67*4882a593Smuzhiyun descriptor table as well which is compared with the PE# obtained from 68*4882a593Smuzhiyun the RTT to "authorize" the device to emit that specific interrupt. 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun - Error messages just use the RTT. 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun * Outbound. That's where the tricky part is. 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" 75*4882a593Smuzhiyun from the CPU address space to the PCI address space. There is one M32 76*4882a593Smuzhiyun window and sixteen M64 windows. They have different characteristics. 77*4882a593Smuzhiyun First what they have in common: they forward a configurable portion of 78*4882a593Smuzhiyun the CPU address space to the PCIe bus and must be naturally aligned 79*4882a593Smuzhiyun power of two in size. The rest is different: 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun - The M32 window: 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun * Is limited to 4GB in size. 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun * Drops the top bits of the address (above the size) and replaces 86*4882a593Smuzhiyun them with a configurable value. This is typically used to generate 87*4882a593Smuzhiyun 32-bit PCIe accesses. We configure that window at boot from FW and 88*4882a593Smuzhiyun don't touch it from Linux; it's usually set to forward a 2GB 89*4882a593Smuzhiyun portion of address space from the CPU to PCIe 90*4882a593Smuzhiyun 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually 91*4882a593Smuzhiyun reserved for MSIs but this is not a problem at this point; we just 92*4882a593Smuzhiyun need to ensure Linux doesn't assign anything there, the M32 logic 93*4882a593Smuzhiyun ignores that however and will forward in that space if we try). 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun * It is divided into 256 segments of equal size. A table in the chip 96*4882a593Smuzhiyun maps each segment to a PE#. That allows portions of the MMIO space 97*4882a593Smuzhiyun to be assigned to PEs on a segment granularity. For a 2GB window, 98*4882a593Smuzhiyun the segment granularity is 2GB/256 = 8MB. 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun Now, this is the "main" window we use in Linux today (excluding 101*4882a593Smuzhiyun SR-IOV). We basically use the trick of forcing the bridge MMIO windows 102*4882a593Smuzhiyun onto a segment alignment/granularity so that the space behind a bridge 103*4882a593Smuzhiyun can be assigned to a PE. 104*4882a593Smuzhiyun 105*4882a593Smuzhiyun Ideally we would like to be able to have individual functions in PEs 106*4882a593Smuzhiyun but that would mean using a completely different address allocation 107*4882a593Smuzhiyun scheme where individual function BARs can be "grouped" to fit in one or 108*4882a593Smuzhiyun more segments. 109*4882a593Smuzhiyun 110*4882a593Smuzhiyun - The M64 windows: 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun * Must be at least 256MB in size. 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun * Do not translate addresses (the address on PCIe is the same as the 115*4882a593Smuzhiyun address on the PowerBus). There is a way to also set the top 14 116*4882a593Smuzhiyun bits which are not conveyed by PowerBus but we don't use this. 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun * Can be configured to be segmented. When not segmented, we can 119*4882a593Smuzhiyun specify the PE# for the entire window. When segmented, a window 120*4882a593Smuzhiyun has 256 segments; however, there is no table for mapping a segment 121*4882a593Smuzhiyun to a PE#. The segment number *is* the PE#. 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun * Support overlaps. If an address is covered by multiple windows, 124*4882a593Smuzhiyun there's a defined ordering for which window applies. 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun We have code (fairly new compared to the M32 stuff) that exploits that 127*4882a593Smuzhiyun for large BARs in 64-bit space: 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun We configure an M64 window to cover the entire region of address space 130*4882a593Smuzhiyun that has been assigned by FW for the PHB (about 64GB, ignore the space 131*4882a593Smuzhiyun for the M32, it comes out of a different "reserve"). We configure it 132*4882a593Smuzhiyun as segmented. 133*4882a593Smuzhiyun 134*4882a593Smuzhiyun Then we do the same thing as with M32, using the bridge alignment 135*4882a593Smuzhiyun trick, to match to those giant segments. 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun Since we cannot remap, we have two additional constraints: 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun - We do the PE# allocation *after* the 64-bit space has been assigned 140*4882a593Smuzhiyun because the addresses we use directly determine the PE#. We then 141*4882a593Smuzhiyun update the M32 PE# for the devices that use both 32-bit and 64-bit 142*4882a593Smuzhiyun spaces or assign the remaining PE# to 32-bit only devices. 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun - We cannot "group" segments in HW, so if a device ends up using more 145*4882a593Smuzhiyun than one segment, we end up with more than one PE#. There is a HW 146*4882a593Smuzhiyun mechanism to make the freeze state cascade to "companion" PEs but 147*4882a593Smuzhiyun that only works for PCIe error messages (typically used so that if 148*4882a593Smuzhiyun you freeze a switch, it freezes all its children). So we do it in 149*4882a593Smuzhiyun SW. We lose a bit of effectiveness of EEH in that case, but that's 150*4882a593Smuzhiyun the best we found. So when any of the PEs freezes, we freeze the 151*4882a593Smuzhiyun other ones for that "domain". We thus introduce the concept of 152*4882a593Smuzhiyun "master PE" which is the one used for DMA, MSIs, etc., and "secondary 153*4882a593Smuzhiyun PEs" that are used for the remaining M64 segments. 154*4882a593Smuzhiyun 155*4882a593Smuzhiyun We would like to investigate using additional M64 windows in "single 156*4882a593Smuzhiyun PE" mode to overlay over specific BARs to work around some of that, for 157*4882a593Smuzhiyun example for devices with very large BARs, e.g., GPUs. It would make 158*4882a593Smuzhiyun sense, but we haven't done it yet. 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun3. Considerations for SR-IOV on PowerKVM 161*4882a593Smuzhiyun======================================== 162*4882a593Smuzhiyun 163*4882a593Smuzhiyun * SR-IOV Background 164*4882a593Smuzhiyun 165*4882a593Smuzhiyun The PCIe SR-IOV feature allows a single Physical Function (PF) to 166*4882a593Smuzhiyun support several Virtual Functions (VFs). Registers in the PF's SR-IOV 167*4882a593Smuzhiyun Capability control the number of VFs and whether they are enabled. 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun When VFs are enabled, they appear in Configuration Space like normal 170*4882a593Smuzhiyun PCI devices, but the BARs in VF config space headers are unusual. For 171*4882a593Smuzhiyun a non-VF device, software uses BARs in the config space header to 172*4882a593Smuzhiyun discover the BAR sizes and assign addresses for them. For VF devices, 173*4882a593Smuzhiyun software uses VF BAR registers in the *PF* SR-IOV Capability to 174*4882a593Smuzhiyun discover sizes and assign addresses. The BARs in the VF's config space 175*4882a593Smuzhiyun header are read-only zeros. 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun When a VF BAR in the PF SR-IOV Capability is programmed, it sets the 178*4882a593Smuzhiyun base address for all the corresponding VF(n) BARs. For example, if the 179*4882a593Smuzhiyun PF SR-IOV Capability is programmed to enable eight VFs, and it has a 180*4882a593Smuzhiyun 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. 181*4882a593Smuzhiyun This region is divided into eight contiguous 1MB regions, each of which 182*4882a593Smuzhiyun is a BAR0 for one of the VFs. Note that even though the VF BAR 183*4882a593Smuzhiyun describes an 8MB region, the alignment requirement is for a single VF, 184*4882a593Smuzhiyun i.e., 1MB in this example. 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun There are several strategies for isolating VFs in PEs: 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun - M32 window: There's one M32 window, and it is split into 256 189*4882a593Smuzhiyun equally-sized segments. The finest granularity possible is a 256MB 190*4882a593Smuzhiyun window with 1MB segments. VF BARs that are 1MB or larger could be 191*4882a593Smuzhiyun mapped to separate PEs in this window. Each segment can be 192*4882a593Smuzhiyun individually mapped to a PE via the lookup table, so this is quite 193*4882a593Smuzhiyun flexible, but it works best when all the VF BARs are the same size. If 194*4882a593Smuzhiyun they are different sizes, the entire window has to be small enough that 195*4882a593Smuzhiyun the segment size matches the smallest VF BAR, which means larger VF 196*4882a593Smuzhiyun BARs span several segments. 197*4882a593Smuzhiyun 198*4882a593Smuzhiyun - Non-segmented M64 window: A non-segmented M64 window is mapped entirely 199*4882a593Smuzhiyun to a single PE, so it could only isolate one VF. 200*4882a593Smuzhiyun 201*4882a593Smuzhiyun - Single segmented M64 windows: A segmented M64 window could be used just 202*4882a593Smuzhiyun like the M32 window, but the segments can't be individually mapped to 203*4882a593Smuzhiyun PEs (the segment number is the PE#), so there isn't as much 204*4882a593Smuzhiyun flexibility. A VF with multiple BARs would have to be in a "domain" of 205*4882a593Smuzhiyun multiple PEs, which is not as well isolated as a single PE. 206*4882a593Smuzhiyun 207*4882a593Smuzhiyun - Multiple segmented M64 windows: As usual, each window is split into 256 208*4882a593Smuzhiyun equally-sized segments, and the segment number is the PE#. But if we 209*4882a593Smuzhiyun use several M64 windows, they can be set to different base addresses 210*4882a593Smuzhiyun and different segment sizes. If we have VFs that each have a 1MB BAR 211*4882a593Smuzhiyun and a 32MB BAR, we could use one M64 window to assign 1MB segments and 212*4882a593Smuzhiyun another M64 window to assign 32MB segments. 213*4882a593Smuzhiyun 214*4882a593Smuzhiyun Finally, the plan to use M64 windows for SR-IOV, which will be described 215*4882a593Smuzhiyun more in the next two sections. For a given VF BAR, we need to 216*4882a593Smuzhiyun effectively reserve the entire 256 segments (256 * VF BAR size) and 217*4882a593Smuzhiyun position the VF BAR to start at the beginning of a free range of 218*4882a593Smuzhiyun segments/PEs inside that M64 window. 219*4882a593Smuzhiyun 220*4882a593Smuzhiyun The goal is of course to be able to give a separate PE for each VF. 221*4882a593Smuzhiyun 222*4882a593Smuzhiyun The IODA2 platform has 16 M64 windows, which are used to map MMIO 223*4882a593Smuzhiyun range to PE#. Each M64 window defines one MMIO range and this range is 224*4882a593Smuzhiyun divided into 256 segments, with each segment corresponding to one PE. 225*4882a593Smuzhiyun 226*4882a593Smuzhiyun We decide to leverage this M64 window to map VFs to individual PEs, since 227*4882a593Smuzhiyun SR-IOV VF BARs are all the same size. 228*4882a593Smuzhiyun 229*4882a593Smuzhiyun But doing so introduces another problem: total_VFs is usually smaller 230*4882a593Smuzhiyun than the number of M64 window segments, so if we map one VF BAR directly 231*4882a593Smuzhiyun to one M64 window, some part of the M64 window will map to another 232*4882a593Smuzhiyun device's MMIO range. 233*4882a593Smuzhiyun 234*4882a593Smuzhiyun IODA supports 256 PEs, so segmented windows contain 256 segments, so if 235*4882a593Smuzhiyun total_VFs is less than 256, we have the situation in Figure 1.0, where 236*4882a593Smuzhiyun segments [total_VFs, 255] of the M64 window may map to some MMIO range on 237*4882a593Smuzhiyun other devices:: 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun 0 1 total_VFs - 1 240*4882a593Smuzhiyun +------+------+- -+------+------+ 241*4882a593Smuzhiyun | | | ... | | | 242*4882a593Smuzhiyun +------+------+- -+------+------+ 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun VF(n) BAR space 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun 0 1 total_VFs - 1 255 247*4882a593Smuzhiyun +------+------+- -+------+------+- -+------+------+ 248*4882a593Smuzhiyun | | | ... | | | ... | | | 249*4882a593Smuzhiyun +------+------+- -+------+------+- -+------+------+ 250*4882a593Smuzhiyun 251*4882a593Smuzhiyun M64 window 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun Figure 1.0 Direct map VF(n) BAR space 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun Our current solution is to allocate 256 segments even if the VF(n) BAR 256*4882a593Smuzhiyun space doesn't need that much, as shown in Figure 1.1:: 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun 0 1 total_VFs - 1 255 259*4882a593Smuzhiyun +------+------+- -+------+------+- -+------+------+ 260*4882a593Smuzhiyun | | | ... | | | ... | | | 261*4882a593Smuzhiyun +------+------+- -+------+------+- -+------+------+ 262*4882a593Smuzhiyun 263*4882a593Smuzhiyun VF(n) BAR space + extra 264*4882a593Smuzhiyun 265*4882a593Smuzhiyun 0 1 total_VFs - 1 255 266*4882a593Smuzhiyun +------+------+- -+------+------+- -+------+------+ 267*4882a593Smuzhiyun | | | ... | | | ... | | | 268*4882a593Smuzhiyun +------+------+- -+------+------+- -+------+------+ 269*4882a593Smuzhiyun 270*4882a593Smuzhiyun M64 window 271*4882a593Smuzhiyun 272*4882a593Smuzhiyun Figure 1.1 Map VF(n) BAR space + extra 273*4882a593Smuzhiyun 274*4882a593Smuzhiyun Allocating the extra space ensures that the entire M64 window will be 275*4882a593Smuzhiyun assigned to this one SR-IOV device and none of the space will be 276*4882a593Smuzhiyun available for other devices. Note that this only expands the space 277*4882a593Smuzhiyun reserved in software; there are still only total_VFs VFs, and they only 278*4882a593Smuzhiyun respond to segments [0, total_VFs - 1]. There's nothing in hardware that 279*4882a593Smuzhiyun responds to segments [total_VFs, 255]. 280*4882a593Smuzhiyun 281*4882a593Smuzhiyun4. Implications for the Generic PCI Code 282*4882a593Smuzhiyun======================================== 283*4882a593Smuzhiyun 284*4882a593SmuzhiyunThe PCIe SR-IOV spec requires that the base of the VF(n) BAR space be 285*4882a593Smuzhiyunaligned to the size of an individual VF BAR. 286*4882a593Smuzhiyun 287*4882a593SmuzhiyunIn IODA2, the MMIO address determines the PE#. If the address is in an M32 288*4882a593Smuzhiyunwindow, we can set the PE# by updating the table that translates segments 289*4882a593Smuzhiyunto PE#s. Similarly, if the address is in an unsegmented M64 window, we can 290*4882a593Smuzhiyunset the PE# for the window. But if it's in a segmented M64 window, the 291*4882a593Smuzhiyunsegment number is the PE#. 292*4882a593Smuzhiyun 293*4882a593SmuzhiyunTherefore, the only way to control the PE# for a VF is to change the base 294*4882a593Smuzhiyunof the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact 295*4882a593Smuzhiyunamount of space required for the VF(n) BAR space, the VF BAR value is fixed 296*4882a593Smuzhiyunand cannot be changed. 297*4882a593Smuzhiyun 298*4882a593SmuzhiyunOn the other hand, if the PCI core allocates additional space, the VF BAR 299*4882a593Smuzhiyunvalue can be changed as long as the entire VF(n) BAR space remains inside 300*4882a593Smuzhiyunthe space allocated by the core. 301*4882a593Smuzhiyun 302*4882a593SmuzhiyunIdeally the segment size will be the same as an individual VF BAR size. 303*4882a593SmuzhiyunThen each VF will be in its own PE. The VF BARs (and therefore the PE#s) 304*4882a593Smuzhiyunare contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we 305*4882a593Smuzhiyunallocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0. 306*4882a593Smuzhiyun 307*4882a593SmuzhiyunIf the segment size is smaller than the VF BAR size, it will take several 308*4882a593Smuzhiyunsegments to cover a VF BAR, and a VF will be in several PEs. This is 309*4882a593Smuzhiyunpossible, but the isolation isn't as good, and it reduces the number of PE# 310*4882a593Smuzhiyunchoices because instead of consuming only numVFs segments, the VF(n) BAR 311*4882a593Smuzhiyunspace will consume (numVFs * n) segments. That means there aren't as many 312*4882a593Smuzhiyunavailable segments for adjusting base of the VF(n) BAR space. 313