xref: /OK3568_Linux_fs/kernel/Documentation/powerpc/pci_iov_resource_on_powernv.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun===================================================
2*4882a593SmuzhiyunPCI Express I/O Virtualization Resource on Powerenv
3*4882a593Smuzhiyun===================================================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunWei Yang <weiyang@linux.vnet.ibm.com>
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunBenjamin Herrenschmidt <benh@au1.ibm.com>
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunBjorn Helgaas <bhelgaas@google.com>
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun26 Aug 2014
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunThis document describes the requirement from hardware for PCI MMIO resource
14*4882a593Smuzhiyunsizing and assignment on PowerKVM and how generic PCI code handles this
15*4882a593Smuzhiyunrequirement. The first two sections describe the concepts of Partitionable
16*4882a593SmuzhiyunEndpoints and the implementation on P8 (IODA2). The next two sections talks
17*4882a593Smuzhiyunabout considerations on enabling SRIOV on IODA2.
18*4882a593Smuzhiyun
19*4882a593Smuzhiyun1. Introduction to Partitionable Endpoints
20*4882a593Smuzhiyun==========================================
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunA Partitionable Endpoint (PE) is a way to group the various resources
23*4882a593Smuzhiyunassociated with a device or a set of devices to provide isolation between
24*4882a593Smuzhiyunpartitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
25*4882a593Smuzhiyunto freeze a device that is causing errors in order to limit the possibility
26*4882a593Smuzhiyunof propagation of bad data.
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunThere is thus, in HW, a table of PE states that contains a pair of "frozen"
29*4882a593Smuzhiyunstate bits (one for MMIO and one for DMA, they get set together but can be
30*4882a593Smuzhiyuncleared independently) for each PE.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunWhen a PE is frozen, all stores in any direction are dropped and all loads
33*4882a593Smuzhiyunreturn all 1's value. MSIs are also blocked. There's a bit more state that
34*4882a593Smuzhiyuncaptures things like the details of the error that caused the freeze etc., but
35*4882a593Smuzhiyunthat's not critical.
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunThe interesting part is how the various PCIe transactions (MMIO, DMA, ...)
38*4882a593Smuzhiyunare matched to their corresponding PEs.
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunThe following section provides a rough description of what we have on P8
41*4882a593Smuzhiyun(IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB
42*4882a593Smuzhiyunis a completely separate HW entity that replicates the entire logic, so has
43*4882a593Smuzhiyunits own set of PEs, etc.
44*4882a593Smuzhiyun
45*4882a593Smuzhiyun2. Implementation of Partitionable Endpoints on P8 (IODA2)
46*4882a593Smuzhiyun==========================================================
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunP8 supports up to 256 Partitionable Endpoints per PHB.
49*4882a593Smuzhiyun
50*4882a593Smuzhiyun  * Inbound
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun    For DMA, MSIs and inbound PCIe error messages, we have a table (in
53*4882a593Smuzhiyun    memory but accessed in HW by the chip) that provides a direct
54*4882a593Smuzhiyun    correspondence between a PCIe RID (bus/dev/fn) with a PE number.
55*4882a593Smuzhiyun    We call this the RTT.
56*4882a593Smuzhiyun
57*4882a593Smuzhiyun    - For DMA we then provide an entire address space for each PE that can
58*4882a593Smuzhiyun      contain two "windows", depending on the value of PCI address bit 59.
59*4882a593Smuzhiyun      Each window can be configured to be remapped via a "TCE table" (IOMMU
60*4882a593Smuzhiyun      translation table), which has various configurable characteristics
61*4882a593Smuzhiyun      not described here.
62*4882a593Smuzhiyun
63*4882a593Smuzhiyun    - For MSIs, we have two windows in the address space (one at the top of
64*4882a593Smuzhiyun      the 32-bit space and one much higher) which, via a combination of the
65*4882a593Smuzhiyun      address and MSI value, will result in one of the 2048 interrupts per
66*4882a593Smuzhiyun      bridge being triggered.  There's a PE# in the interrupt controller
67*4882a593Smuzhiyun      descriptor table as well which is compared with the PE# obtained from
68*4882a593Smuzhiyun      the RTT to "authorize" the device to emit that specific interrupt.
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun    - Error messages just use the RTT.
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun  * Outbound.  That's where the tricky part is.
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun    Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
75*4882a593Smuzhiyun    from the CPU address space to the PCI address space.  There is one M32
76*4882a593Smuzhiyun    window and sixteen M64 windows.  They have different characteristics.
77*4882a593Smuzhiyun    First what they have in common: they forward a configurable portion of
78*4882a593Smuzhiyun    the CPU address space to the PCIe bus and must be naturally aligned
79*4882a593Smuzhiyun    power of two in size.  The rest is different:
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun    - The M32 window:
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun      * Is limited to 4GB in size.
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun      * Drops the top bits of the address (above the size) and replaces
86*4882a593Smuzhiyun	them with a configurable value.  This is typically used to generate
87*4882a593Smuzhiyun	32-bit PCIe accesses.  We configure that window at boot from FW and
88*4882a593Smuzhiyun	don't touch it from Linux; it's usually set to forward a 2GB
89*4882a593Smuzhiyun	portion of address space from the CPU to PCIe
90*4882a593Smuzhiyun	0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually
91*4882a593Smuzhiyun	reserved for MSIs but this is not a problem at this point; we just
92*4882a593Smuzhiyun	need to ensure Linux doesn't assign anything there, the M32 logic
93*4882a593Smuzhiyun	ignores that however and will forward in that space if we try).
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun      * It is divided into 256 segments of equal size.  A table in the chip
96*4882a593Smuzhiyun	maps each segment to a PE#.  That allows portions of the MMIO space
97*4882a593Smuzhiyun	to be assigned to PEs on a segment granularity.  For a 2GB window,
98*4882a593Smuzhiyun	the segment granularity is 2GB/256 = 8MB.
99*4882a593Smuzhiyun
100*4882a593Smuzhiyun    Now, this is the "main" window we use in Linux today (excluding
101*4882a593Smuzhiyun    SR-IOV).  We basically use the trick of forcing the bridge MMIO windows
102*4882a593Smuzhiyun    onto a segment alignment/granularity so that the space behind a bridge
103*4882a593Smuzhiyun    can be assigned to a PE.
104*4882a593Smuzhiyun
105*4882a593Smuzhiyun    Ideally we would like to be able to have individual functions in PEs
106*4882a593Smuzhiyun    but that would mean using a completely different address allocation
107*4882a593Smuzhiyun    scheme where individual function BARs can be "grouped" to fit in one or
108*4882a593Smuzhiyun    more segments.
109*4882a593Smuzhiyun
110*4882a593Smuzhiyun    - The M64 windows:
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun      * Must be at least 256MB in size.
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun      * Do not translate addresses (the address on PCIe is the same as the
115*4882a593Smuzhiyun	address on the PowerBus).  There is a way to also set the top 14
116*4882a593Smuzhiyun	bits which are not conveyed by PowerBus but we don't use this.
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun      * Can be configured to be segmented.  When not segmented, we can
119*4882a593Smuzhiyun	specify the PE# for the entire window.  When segmented, a window
120*4882a593Smuzhiyun	has 256 segments; however, there is no table for mapping a segment
121*4882a593Smuzhiyun	to a PE#.  The segment number *is* the PE#.
122*4882a593Smuzhiyun
123*4882a593Smuzhiyun      * Support overlaps.  If an address is covered by multiple windows,
124*4882a593Smuzhiyun	there's a defined ordering for which window applies.
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun    We have code (fairly new compared to the M32 stuff) that exploits that
127*4882a593Smuzhiyun    for large BARs in 64-bit space:
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun    We configure an M64 window to cover the entire region of address space
130*4882a593Smuzhiyun    that has been assigned by FW for the PHB (about 64GB, ignore the space
131*4882a593Smuzhiyun    for the M32, it comes out of a different "reserve").  We configure it
132*4882a593Smuzhiyun    as segmented.
133*4882a593Smuzhiyun
134*4882a593Smuzhiyun    Then we do the same thing as with M32, using the bridge alignment
135*4882a593Smuzhiyun    trick, to match to those giant segments.
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun    Since we cannot remap, we have two additional constraints:
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun    - We do the PE# allocation *after* the 64-bit space has been assigned
140*4882a593Smuzhiyun      because the addresses we use directly determine the PE#.  We then
141*4882a593Smuzhiyun      update the M32 PE# for the devices that use both 32-bit and 64-bit
142*4882a593Smuzhiyun      spaces or assign the remaining PE# to 32-bit only devices.
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun    - We cannot "group" segments in HW, so if a device ends up using more
145*4882a593Smuzhiyun      than one segment, we end up with more than one PE#.  There is a HW
146*4882a593Smuzhiyun      mechanism to make the freeze state cascade to "companion" PEs but
147*4882a593Smuzhiyun      that only works for PCIe error messages (typically used so that if
148*4882a593Smuzhiyun      you freeze a switch, it freezes all its children).  So we do it in
149*4882a593Smuzhiyun      SW.  We lose a bit of effectiveness of EEH in that case, but that's
150*4882a593Smuzhiyun      the best we found.  So when any of the PEs freezes, we freeze the
151*4882a593Smuzhiyun      other ones for that "domain".  We thus introduce the concept of
152*4882a593Smuzhiyun      "master PE" which is the one used for DMA, MSIs, etc., and "secondary
153*4882a593Smuzhiyun      PEs" that are used for the remaining M64 segments.
154*4882a593Smuzhiyun
155*4882a593Smuzhiyun    We would like to investigate using additional M64 windows in "single
156*4882a593Smuzhiyun    PE" mode to overlay over specific BARs to work around some of that, for
157*4882a593Smuzhiyun    example for devices with very large BARs, e.g., GPUs.  It would make
158*4882a593Smuzhiyun    sense, but we haven't done it yet.
159*4882a593Smuzhiyun
160*4882a593Smuzhiyun3. Considerations for SR-IOV on PowerKVM
161*4882a593Smuzhiyun========================================
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun  * SR-IOV Background
164*4882a593Smuzhiyun
165*4882a593Smuzhiyun    The PCIe SR-IOV feature allows a single Physical Function (PF) to
166*4882a593Smuzhiyun    support several Virtual Functions (VFs).  Registers in the PF's SR-IOV
167*4882a593Smuzhiyun    Capability control the number of VFs and whether they are enabled.
168*4882a593Smuzhiyun
169*4882a593Smuzhiyun    When VFs are enabled, they appear in Configuration Space like normal
170*4882a593Smuzhiyun    PCI devices, but the BARs in VF config space headers are unusual.  For
171*4882a593Smuzhiyun    a non-VF device, software uses BARs in the config space header to
172*4882a593Smuzhiyun    discover the BAR sizes and assign addresses for them.  For VF devices,
173*4882a593Smuzhiyun    software uses VF BAR registers in the *PF* SR-IOV Capability to
174*4882a593Smuzhiyun    discover sizes and assign addresses.  The BARs in the VF's config space
175*4882a593Smuzhiyun    header are read-only zeros.
176*4882a593Smuzhiyun
177*4882a593Smuzhiyun    When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
178*4882a593Smuzhiyun    base address for all the corresponding VF(n) BARs.  For example, if the
179*4882a593Smuzhiyun    PF SR-IOV Capability is programmed to enable eight VFs, and it has a
180*4882a593Smuzhiyun    1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
181*4882a593Smuzhiyun    This region is divided into eight contiguous 1MB regions, each of which
182*4882a593Smuzhiyun    is a BAR0 for one of the VFs.  Note that even though the VF BAR
183*4882a593Smuzhiyun    describes an 8MB region, the alignment requirement is for a single VF,
184*4882a593Smuzhiyun    i.e., 1MB in this example.
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun  There are several strategies for isolating VFs in PEs:
187*4882a593Smuzhiyun
188*4882a593Smuzhiyun  - M32 window: There's one M32 window, and it is split into 256
189*4882a593Smuzhiyun    equally-sized segments.  The finest granularity possible is a 256MB
190*4882a593Smuzhiyun    window with 1MB segments.  VF BARs that are 1MB or larger could be
191*4882a593Smuzhiyun    mapped to separate PEs in this window.  Each segment can be
192*4882a593Smuzhiyun    individually mapped to a PE via the lookup table, so this is quite
193*4882a593Smuzhiyun    flexible, but it works best when all the VF BARs are the same size.  If
194*4882a593Smuzhiyun    they are different sizes, the entire window has to be small enough that
195*4882a593Smuzhiyun    the segment size matches the smallest VF BAR, which means larger VF
196*4882a593Smuzhiyun    BARs span several segments.
197*4882a593Smuzhiyun
198*4882a593Smuzhiyun  - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
199*4882a593Smuzhiyun    to a single PE, so it could only isolate one VF.
200*4882a593Smuzhiyun
201*4882a593Smuzhiyun  - Single segmented M64 windows: A segmented M64 window could be used just
202*4882a593Smuzhiyun    like the M32 window, but the segments can't be individually mapped to
203*4882a593Smuzhiyun    PEs (the segment number is the PE#), so there isn't as much
204*4882a593Smuzhiyun    flexibility.  A VF with multiple BARs would have to be in a "domain" of
205*4882a593Smuzhiyun    multiple PEs, which is not as well isolated as a single PE.
206*4882a593Smuzhiyun
207*4882a593Smuzhiyun  - Multiple segmented M64 windows: As usual, each window is split into 256
208*4882a593Smuzhiyun    equally-sized segments, and the segment number is the PE#.  But if we
209*4882a593Smuzhiyun    use several M64 windows, they can be set to different base addresses
210*4882a593Smuzhiyun    and different segment sizes.  If we have VFs that each have a 1MB BAR
211*4882a593Smuzhiyun    and a 32MB BAR, we could use one M64 window to assign 1MB segments and
212*4882a593Smuzhiyun    another M64 window to assign 32MB segments.
213*4882a593Smuzhiyun
214*4882a593Smuzhiyun  Finally, the plan to use M64 windows for SR-IOV, which will be described
215*4882a593Smuzhiyun  more in the next two sections.  For a given VF BAR, we need to
216*4882a593Smuzhiyun  effectively reserve the entire 256 segments (256 * VF BAR size) and
217*4882a593Smuzhiyun  position the VF BAR to start at the beginning of a free range of
218*4882a593Smuzhiyun  segments/PEs inside that M64 window.
219*4882a593Smuzhiyun
220*4882a593Smuzhiyun  The goal is of course to be able to give a separate PE for each VF.
221*4882a593Smuzhiyun
222*4882a593Smuzhiyun  The IODA2 platform has 16 M64 windows, which are used to map MMIO
223*4882a593Smuzhiyun  range to PE#.  Each M64 window defines one MMIO range and this range is
224*4882a593Smuzhiyun  divided into 256 segments, with each segment corresponding to one PE.
225*4882a593Smuzhiyun
226*4882a593Smuzhiyun  We decide to leverage this M64 window to map VFs to individual PEs, since
227*4882a593Smuzhiyun  SR-IOV VF BARs are all the same size.
228*4882a593Smuzhiyun
229*4882a593Smuzhiyun  But doing so introduces another problem: total_VFs is usually smaller
230*4882a593Smuzhiyun  than the number of M64 window segments, so if we map one VF BAR directly
231*4882a593Smuzhiyun  to one M64 window, some part of the M64 window will map to another
232*4882a593Smuzhiyun  device's MMIO range.
233*4882a593Smuzhiyun
234*4882a593Smuzhiyun  IODA supports 256 PEs, so segmented windows contain 256 segments, so if
235*4882a593Smuzhiyun  total_VFs is less than 256, we have the situation in Figure 1.0, where
236*4882a593Smuzhiyun  segments [total_VFs, 255] of the M64 window may map to some MMIO range on
237*4882a593Smuzhiyun  other devices::
238*4882a593Smuzhiyun
239*4882a593Smuzhiyun     0      1                     total_VFs - 1
240*4882a593Smuzhiyun     +------+------+-     -+------+------+
241*4882a593Smuzhiyun     |      |      |  ...  |      |      |
242*4882a593Smuzhiyun     +------+------+-     -+------+------+
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun                           VF(n) BAR space
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun     0      1                     total_VFs - 1                255
247*4882a593Smuzhiyun     +------+------+-     -+------+------+-      -+------+------+
248*4882a593Smuzhiyun     |      |      |  ...  |      |      |   ...  |      |      |
249*4882a593Smuzhiyun     +------+------+-     -+------+------+-      -+------+------+
250*4882a593Smuzhiyun
251*4882a593Smuzhiyun                           M64 window
252*4882a593Smuzhiyun
253*4882a593Smuzhiyun		Figure 1.0 Direct map VF(n) BAR space
254*4882a593Smuzhiyun
255*4882a593Smuzhiyun  Our current solution is to allocate 256 segments even if the VF(n) BAR
256*4882a593Smuzhiyun  space doesn't need that much, as shown in Figure 1.1::
257*4882a593Smuzhiyun
258*4882a593Smuzhiyun     0      1                     total_VFs - 1                255
259*4882a593Smuzhiyun     +------+------+-     -+------+------+-      -+------+------+
260*4882a593Smuzhiyun     |      |      |  ...  |      |      |   ...  |      |      |
261*4882a593Smuzhiyun     +------+------+-     -+------+------+-      -+------+------+
262*4882a593Smuzhiyun
263*4882a593Smuzhiyun                           VF(n) BAR space + extra
264*4882a593Smuzhiyun
265*4882a593Smuzhiyun     0      1                     total_VFs - 1                255
266*4882a593Smuzhiyun     +------+------+-     -+------+------+-      -+------+------+
267*4882a593Smuzhiyun     |      |      |  ...  |      |      |   ...  |      |      |
268*4882a593Smuzhiyun     +------+------+-     -+------+------+-      -+------+------+
269*4882a593Smuzhiyun
270*4882a593Smuzhiyun			   M64 window
271*4882a593Smuzhiyun
272*4882a593Smuzhiyun		Figure 1.1 Map VF(n) BAR space + extra
273*4882a593Smuzhiyun
274*4882a593Smuzhiyun  Allocating the extra space ensures that the entire M64 window will be
275*4882a593Smuzhiyun  assigned to this one SR-IOV device and none of the space will be
276*4882a593Smuzhiyun  available for other devices.  Note that this only expands the space
277*4882a593Smuzhiyun  reserved in software; there are still only total_VFs VFs, and they only
278*4882a593Smuzhiyun  respond to segments [0, total_VFs - 1].  There's nothing in hardware that
279*4882a593Smuzhiyun  responds to segments [total_VFs, 255].
280*4882a593Smuzhiyun
281*4882a593Smuzhiyun4. Implications for the Generic PCI Code
282*4882a593Smuzhiyun========================================
283*4882a593Smuzhiyun
284*4882a593SmuzhiyunThe PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
285*4882a593Smuzhiyunaligned to the size of an individual VF BAR.
286*4882a593Smuzhiyun
287*4882a593SmuzhiyunIn IODA2, the MMIO address determines the PE#.  If the address is in an M32
288*4882a593Smuzhiyunwindow, we can set the PE# by updating the table that translates segments
289*4882a593Smuzhiyunto PE#s.  Similarly, if the address is in an unsegmented M64 window, we can
290*4882a593Smuzhiyunset the PE# for the window.  But if it's in a segmented M64 window, the
291*4882a593Smuzhiyunsegment number is the PE#.
292*4882a593Smuzhiyun
293*4882a593SmuzhiyunTherefore, the only way to control the PE# for a VF is to change the base
294*4882a593Smuzhiyunof the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact
295*4882a593Smuzhiyunamount of space required for the VF(n) BAR space, the VF BAR value is fixed
296*4882a593Smuzhiyunand cannot be changed.
297*4882a593Smuzhiyun
298*4882a593SmuzhiyunOn the other hand, if the PCI core allocates additional space, the VF BAR
299*4882a593Smuzhiyunvalue can be changed as long as the entire VF(n) BAR space remains inside
300*4882a593Smuzhiyunthe space allocated by the core.
301*4882a593Smuzhiyun
302*4882a593SmuzhiyunIdeally the segment size will be the same as an individual VF BAR size.
303*4882a593SmuzhiyunThen each VF will be in its own PE.  The VF BARs (and therefore the PE#s)
304*4882a593Smuzhiyunare contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we
305*4882a593Smuzhiyunallocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
306*4882a593Smuzhiyun
307*4882a593SmuzhiyunIf the segment size is smaller than the VF BAR size, it will take several
308*4882a593Smuzhiyunsegments to cover a VF BAR, and a VF will be in several PEs.  This is
309*4882a593Smuzhiyunpossible, but the isolation isn't as good, and it reduces the number of PE#
310*4882a593Smuzhiyunchoices because instead of consuming only numVFs segments, the VF(n) BAR
311*4882a593Smuzhiyunspace will consume (numVFs * n) segments.  That means there aren't as many
312*4882a593Smuzhiyunavailable segments for adjusting base of the VF(n) BAR space.
313