xref: /OK3568_Linux_fs/kernel/Documentation/networking/openvswitch.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=============================================
4*4882a593SmuzhiyunOpen vSwitch datapath developer documentation
5*4882a593Smuzhiyun=============================================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunThe Open vSwitch kernel module allows flexible userspace control over
8*4882a593Smuzhiyunflow-level packet processing on selected network devices.  It can be
9*4882a593Smuzhiyunused to implement a plain Ethernet switch, network device bonding,
10*4882a593SmuzhiyunVLAN processing, network access control, flow-based network control,
11*4882a593Smuzhiyunand so on.
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunThe kernel module implements multiple "datapaths" (analogous to
14*4882a593Smuzhiyunbridges), each of which can have multiple "vports" (analogous to ports
15*4882a593Smuzhiyunwithin a bridge).  Each datapath also has associated with it a "flow
16*4882a593Smuzhiyuntable" that userspace populates with "flows" that map from keys based
17*4882a593Smuzhiyunon packet headers and metadata to sets of actions.  The most common
18*4882a593Smuzhiyunaction forwards the packet to another vport; other actions are also
19*4882a593Smuzhiyunimplemented.
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunWhen a packet arrives on a vport, the kernel module processes it by
22*4882a593Smuzhiyunextracting its flow key and looking it up in the flow table.  If there
23*4882a593Smuzhiyunis a matching flow, it executes the associated actions.  If there is
24*4882a593Smuzhiyunno match, it queues the packet to userspace for processing (as part of
25*4882a593Smuzhiyunits processing, userspace will likely set up a flow to handle further
26*4882a593Smuzhiyunpackets of the same type entirely in-kernel).
27*4882a593Smuzhiyun
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunFlow key compatibility
30*4882a593Smuzhiyun----------------------
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunNetwork protocols evolve over time.  New protocols become important
33*4882a593Smuzhiyunand existing protocols lose their prominence.  For the Open vSwitch
34*4882a593Smuzhiyunkernel module to remain relevant, it must be possible for newer
35*4882a593Smuzhiyunversions to parse additional protocols as part of the flow key.  It
36*4882a593Smuzhiyunmight even be desirable, someday, to drop support for parsing
37*4882a593Smuzhiyunprotocols that have become obsolete.  Therefore, the Netlink interface
38*4882a593Smuzhiyunto Open vSwitch is designed to allow carefully written userspace
39*4882a593Smuzhiyunapplications to work with any version of the flow key, past or future.
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunTo support this forward and backward compatibility, whenever the
42*4882a593Smuzhiyunkernel module passes a packet to userspace, it also passes along the
43*4882a593Smuzhiyunflow key that it parsed from the packet.  Userspace then extracts its
44*4882a593Smuzhiyunown notion of a flow key from the packet and compares it against the
45*4882a593Smuzhiyunkernel-provided version:
46*4882a593Smuzhiyun
47*4882a593Smuzhiyun    - If userspace's notion of the flow key for the packet matches the
48*4882a593Smuzhiyun      kernel's, then nothing special is necessary.
49*4882a593Smuzhiyun
50*4882a593Smuzhiyun    - If the kernel's flow key includes more fields than the userspace
51*4882a593Smuzhiyun      version of the flow key, for example if the kernel decoded IPv6
52*4882a593Smuzhiyun      headers but userspace stopped at the Ethernet type (because it
53*4882a593Smuzhiyun      does not understand IPv6), then again nothing special is
54*4882a593Smuzhiyun      necessary.  Userspace can still set up a flow in the usual way,
55*4882a593Smuzhiyun      as long as it uses the kernel-provided flow key to do it.
56*4882a593Smuzhiyun
57*4882a593Smuzhiyun    - If the userspace flow key includes more fields than the
58*4882a593Smuzhiyun      kernel's, for example if userspace decoded an IPv6 header but
59*4882a593Smuzhiyun      the kernel stopped at the Ethernet type, then userspace can
60*4882a593Smuzhiyun      forward the packet manually, without setting up a flow in the
61*4882a593Smuzhiyun      kernel.  This case is bad for performance because every packet
62*4882a593Smuzhiyun      that the kernel considers part of the flow must go to userspace,
63*4882a593Smuzhiyun      but the forwarding behavior is correct.  (If userspace can
64*4882a593Smuzhiyun      determine that the values of the extra fields would not affect
65*4882a593Smuzhiyun      forwarding behavior, then it could set up a flow anyway.)
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunHow flow keys evolve over time is important to making this work, so
68*4882a593Smuzhiyunthe following sections go into detail.
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun
71*4882a593SmuzhiyunFlow key format
72*4882a593Smuzhiyun---------------
73*4882a593Smuzhiyun
74*4882a593SmuzhiyunA flow key is passed over a Netlink socket as a sequence of Netlink
75*4882a593Smuzhiyunattributes.  Some attributes represent packet metadata, defined as any
76*4882a593Smuzhiyuninformation about a packet that cannot be extracted from the packet
77*4882a593Smuzhiyunitself, e.g. the vport on which the packet was received.  Most
78*4882a593Smuzhiyunattributes, however, are extracted from headers within the packet,
79*4882a593Smuzhiyune.g. source and destination addresses from Ethernet, IP, or TCP
80*4882a593Smuzhiyunheaders.
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunThe <linux/openvswitch.h> header file defines the exact format of the
83*4882a593Smuzhiyunflow key attributes.  For informal explanatory purposes here, we write
84*4882a593Smuzhiyunthem as comma-separated strings, with parentheses indicating arguments
85*4882a593Smuzhiyunand nesting.  For example, the following could represent a flow key
86*4882a593Smuzhiyuncorresponding to a TCP packet that arrived on vport 1::
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun    in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
89*4882a593Smuzhiyun    eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
90*4882a593Smuzhiyun    frag=no), tcp(src=49163, dst=80)
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunOften we ellipsize arguments not important to the discussion, e.g.::
93*4882a593Smuzhiyun
94*4882a593Smuzhiyun    in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunWildcarded flow key format
98*4882a593Smuzhiyun--------------------------
99*4882a593Smuzhiyun
100*4882a593SmuzhiyunA wildcarded flow is described with two sequences of Netlink attributes
101*4882a593Smuzhiyunpassed over the Netlink socket. A flow key, exactly as described above, and an
102*4882a593Smuzhiyunoptional corresponding flow mask.
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunA wildcarded flow can represent a group of exact match flows. Each '1' bit
105*4882a593Smuzhiyunin the mask specifies a exact match with the corresponding bit in the flow key.
106*4882a593SmuzhiyunA '0' bit specifies a don't care bit, which will match either a '1' or '0' bit
107*4882a593Smuzhiyunof a incoming packet. Using wildcarded flow can improve the flow set up rate
108*4882a593Smuzhiyunby reduce the number of new flows need to be processed by the user space program.
109*4882a593Smuzhiyun
110*4882a593SmuzhiyunSupport for the mask Netlink attribute is optional for both the kernel and user
111*4882a593Smuzhiyunspace program. The kernel can ignore the mask attribute, installing an exact
112*4882a593Smuzhiyunmatch flow, or reduce the number of don't care bits in the kernel to less than
113*4882a593Smuzhiyunwhat was specified by the user space program. In this case, variations in bits
114*4882a593Smuzhiyunthat the kernel does not implement will simply result in additional flow setups.
115*4882a593SmuzhiyunThe kernel module will also work with user space programs that neither support
116*4882a593Smuzhiyunnor supply flow mask attributes.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunSince the kernel may ignore or modify wildcard bits, it can be difficult for
119*4882a593Smuzhiyunthe userspace program to know exactly what matches are installed. There are
120*4882a593Smuzhiyuntwo possible approaches: reactively install flows as they miss the kernel
121*4882a593Smuzhiyunflow table (and therefore not attempt to determine wildcard changes at all)
122*4882a593Smuzhiyunor use the kernel's response messages to determine the installed wildcards.
123*4882a593Smuzhiyun
124*4882a593SmuzhiyunWhen interacting with userspace, the kernel should maintain the match portion
125*4882a593Smuzhiyunof the key exactly as originally installed. This will provides a handle to
126*4882a593Smuzhiyunidentify the flow for all future operations. However, when reporting the
127*4882a593Smuzhiyunmask of an installed flow, the mask should include any restrictions imposed
128*4882a593Smuzhiyunby the kernel.
129*4882a593Smuzhiyun
130*4882a593SmuzhiyunThe behavior when using overlapping wildcarded flows is undefined. It is the
131*4882a593Smuzhiyunresponsibility of the user space program to ensure that any incoming packet
132*4882a593Smuzhiyuncan match at most one flow, wildcarded or not. The current implementation
133*4882a593Smuzhiyunperforms best-effort detection of overlapping wildcarded flows and may reject
134*4882a593Smuzhiyunsome but not all of them. However, this behavior may change in future versions.
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun
137*4882a593SmuzhiyunUnique flow identifiers
138*4882a593Smuzhiyun-----------------------
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunAn alternative to using the original match portion of a key as the handle for
141*4882a593Smuzhiyunflow identification is a unique flow identifier, or "UFID". UFIDs are optional
142*4882a593Smuzhiyunfor both the kernel and user space program.
143*4882a593Smuzhiyun
144*4882a593SmuzhiyunUser space programs that support UFID are expected to provide it during flow
145*4882a593Smuzhiyunsetup in addition to the flow, then refer to the flow using the UFID for all
146*4882a593Smuzhiyunfuture operations. The kernel is not required to index flows by the original
147*4882a593Smuzhiyunflow key if a UFID is specified.
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun
150*4882a593SmuzhiyunBasic rule for evolving flow keys
151*4882a593Smuzhiyun---------------------------------
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunSome care is needed to really maintain forward and backward
154*4882a593Smuzhiyuncompatibility for applications that follow the rules listed under
155*4882a593Smuzhiyun"Flow key compatibility" above.
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunThe basic rule is obvious::
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun    ==================================================================
160*4882a593Smuzhiyun    New network protocol support must only supplement existing flow
161*4882a593Smuzhiyun    key attributes.  It must not change the meaning of already defined
162*4882a593Smuzhiyun    flow key attributes.
163*4882a593Smuzhiyun    ==================================================================
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunThis rule does have less-obvious consequences so it is worth working
166*4882a593Smuzhiyunthrough a few examples.  Suppose, for example, that the kernel module
167*4882a593Smuzhiyundid not already implement VLAN parsing.  Instead, it just interpreted
168*4882a593Smuzhiyunthe 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
169*4882a593Smuzhiyunpacket.  The flow key for any packet with an 802.1Q header would look
170*4882a593Smuzhiyunessentially like this, ignoring metadata::
171*4882a593Smuzhiyun
172*4882a593Smuzhiyun    eth(...), eth_type(0x8100)
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunNaively, to add VLAN support, it makes sense to add a new "vlan" flow
175*4882a593Smuzhiyunkey attribute to contain the VLAN tag, then continue to decode the
176*4882a593Smuzhiyunencapsulated headers beyond the VLAN tag using the existing field
177*4882a593Smuzhiyundefinitions.  With this change, a TCP packet in VLAN 10 would have a
178*4882a593Smuzhiyunflow key much like this::
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun    eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
181*4882a593Smuzhiyun
182*4882a593SmuzhiyunBut this change would negatively affect a userspace application that
183*4882a593Smuzhiyunhas not been updated to understand the new "vlan" flow key attribute.
184*4882a593SmuzhiyunThe application could, following the flow compatibility rules above,
185*4882a593Smuzhiyunignore the "vlan" attribute that it does not understand and therefore
186*4882a593Smuzhiyunassume that the flow contained IP packets.  This is a bad assumption
187*4882a593Smuzhiyun(the flow only contains IP packets if one parses and skips over the
188*4882a593Smuzhiyun802.1Q header) and it could cause the application's behavior to change
189*4882a593Smuzhiyunacross kernel versions even though it follows the compatibility rules.
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunThe solution is to use a set of nested attributes.  This is, for
192*4882a593Smuzhiyunexample, why 802.1Q support uses nested attributes.  A TCP packet in
193*4882a593SmuzhiyunVLAN 10 is actually expressed as::
194*4882a593Smuzhiyun
195*4882a593Smuzhiyun    eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
196*4882a593Smuzhiyun    ip(proto=6, ...), tcp(...)))
197*4882a593Smuzhiyun
198*4882a593SmuzhiyunNotice how the "eth_type", "ip", and "tcp" flow key attributes are
199*4882a593Smuzhiyunnested inside the "encap" attribute.  Thus, an application that does
200*4882a593Smuzhiyunnot understand the "vlan" key will not see either of those attributes
201*4882a593Smuzhiyunand therefore will not misinterpret them.  (Also, the outer eth_type
202*4882a593Smuzhiyunis still 0x8100, not changed to 0x0800.)
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunHandling malformed packets
205*4882a593Smuzhiyun--------------------------
206*4882a593Smuzhiyun
207*4882a593SmuzhiyunDon't drop packets in the kernel for malformed protocol headers, bad
208*4882a593Smuzhiyunchecksums, etc.  This would prevent userspace from implementing a
209*4882a593Smuzhiyunsimple Ethernet switch that forwards every packet.
210*4882a593Smuzhiyun
211*4882a593SmuzhiyunInstead, in such a case, include an attribute with "empty" content.
212*4882a593SmuzhiyunIt doesn't matter if the empty content could be valid protocol values,
213*4882a593Smuzhiyunas long as those values are rarely seen in practice, because userspace
214*4882a593Smuzhiyuncan always forward all packets with those values to userspace and
215*4882a593Smuzhiyunhandle them individually.
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunFor example, consider a packet that contains an IP header that
218*4882a593Smuzhiyunindicates protocol 6 for TCP, but which is truncated just after the IP
219*4882a593Smuzhiyunheader, so that the TCP header is missing.  The flow key for this
220*4882a593Smuzhiyunpacket would include a tcp attribute with all-zero src and dst, like
221*4882a593Smuzhiyunthis::
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun    eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunAs another example, consider a packet with an Ethernet type of 0x8100,
226*4882a593Smuzhiyunindicating that a VLAN TCI should follow, but which is truncated just
227*4882a593Smuzhiyunafter the Ethernet type.  The flow key for this packet would include
228*4882a593Smuzhiyunan all-zero-bits vlan and an empty encap attribute, like this::
229*4882a593Smuzhiyun
230*4882a593Smuzhiyun    eth(...), eth_type(0x8100), vlan(0), encap()
231*4882a593Smuzhiyun
232*4882a593SmuzhiyunUnlike a TCP packet with source and destination ports 0, an
233*4882a593Smuzhiyunall-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
234*4882a593SmuzhiyunVLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
235*4882a593Smuzhiyunattribute expressly to allow this situation to be distinguished.
236*4882a593SmuzhiyunThus, the flow key in this second example unambiguously indicates a
237*4882a593Smuzhiyunmissing or malformed VLAN TCI.
238*4882a593Smuzhiyun
239*4882a593SmuzhiyunOther rules
240*4882a593Smuzhiyun-----------
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunThe other rules for flow keys are much less subtle:
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun    - Duplicate attributes are not allowed at a given nesting level.
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun    - Ordering of attributes is not significant.
247*4882a593Smuzhiyun
248*4882a593Smuzhiyun    - When the kernel sends a given flow key to userspace, it always
249*4882a593Smuzhiyun      composes it the same way.  This allows userspace to hash and
250*4882a593Smuzhiyun      compare entire flow keys that it may not be able to fully
251*4882a593Smuzhiyun      interpret.
252