xref: /OK3568_Linux_fs/kernel/Documentation/networking/devlink/devlink-dpipe.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=============
4*4882a593SmuzhiyunDevlink DPIPE
5*4882a593Smuzhiyun=============
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunBackground
8*4882a593Smuzhiyun==========
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunWhile performing the hardware offloading process, much of the hardware
11*4882a593Smuzhiyunspecifics cannot be presented. These details are useful for debugging, and
12*4882a593Smuzhiyun``devlink-dpipe`` provides a standardized way to provide visibility into the
13*4882a593Smuzhiyunoffloading process.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunFor example, the routing longest prefix match (LPM) algorithm used by the
16*4882a593SmuzhiyunLinux kernel may differ from the hardware implementation. The pipeline debug
17*4882a593SmuzhiyunAPI (DPIPE) is aimed at providing the user visibility into the ASIC's
18*4882a593Smuzhiyunpipeline in a generic way.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunThe hardware offload process is expected to be done in a way that the user
21*4882a593Smuzhiyunshould not be able to distinguish between the hardware vs. software
22*4882a593Smuzhiyunimplementation. In this process, hardware specifics are neglected. In
23*4882a593Smuzhiyunreality those details can have lots of meaning and should be exposed in some
24*4882a593Smuzhiyunstandard way.
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunThis problem is made even more complex when one wishes to offload the
27*4882a593Smuzhiyuncontrol path of the whole networking stack to a switch ASIC. Due to
28*4882a593Smuzhiyundifferences in the hardware and software models some processes cannot be
29*4882a593Smuzhiyunrepresented correctly.
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunOne example is the kernel's LPM algorithm which in many cases differs
32*4882a593Smuzhiyungreatly to the hardware implementation. The configuration API is the same,
33*4882a593Smuzhiyunbut one cannot rely on the Forward Information Base (FIB) to look like the
34*4882a593SmuzhiyunLevel Path Compression trie (LPC-trie) in hardware.
35*4882a593Smuzhiyun
36*4882a593SmuzhiyunIn many situations trying to analyze systems failure solely based on the
37*4882a593Smuzhiyunkernel's dump may not be enough. By combining this data with complementary
38*4882a593Smuzhiyuninformation about the underlying hardware, this debugging can be made
39*4882a593Smuzhiyuneasier; additionally, the information can be useful when debugging
40*4882a593Smuzhiyunperformance issues.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunOverview
43*4882a593Smuzhiyun========
44*4882a593Smuzhiyun
45*4882a593SmuzhiyunThe ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
46*4882a593Smuzhiyunmodeled as a graph of match/action tables. Each table represents a specific
47*4882a593Smuzhiyunhardware block. This model is not new, first being used by the P4 language.
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunTraditionally it has been used as an alternative model for hardware
50*4882a593Smuzhiyunconfiguration, but the ``devlink-dpipe`` interface uses it for visibility
51*4882a593Smuzhiyunpurposes as a standard complementary tool. The system's view from
52*4882a593Smuzhiyun``devlink-dpipe`` should change according to the changes done by the
53*4882a593Smuzhiyunstandard configuration tools.
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunFor example, it’s quiet common to  implement Access Control Lists (ACL)
56*4882a593Smuzhiyunusing Ternary Content Addressable Memory (TCAM). The TCAM memory can be
57*4882a593Smuzhiyundivided into TCAM regions. Complex TC filters can have multiple rules with
58*4882a593Smuzhiyundifferent priorities and different lookup keys. On the other hand hardware
59*4882a593SmuzhiyunTCAM regions have a predefined lookup key. Offloading the TC filter rules
60*4882a593Smuzhiyunusing TCAM engine can result in multiple TCAM regions being interconnected
61*4882a593Smuzhiyunin a chain (which may affect the data path latency). In response to a new TC
62*4882a593Smuzhiyunfilter new tables should be created describing those regions.
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunModel
65*4882a593Smuzhiyun=====
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunThe ``DPIPE`` model introduces several objects:
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun  * headers
70*4882a593Smuzhiyun  * tables
71*4882a593Smuzhiyun  * entries
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunA ``header`` describes packet formats and provides names for fields within
74*4882a593Smuzhiyunthe packet. A ``table`` describes hardware blocks. An ``entry`` describes
75*4882a593Smuzhiyunthe actual content of a specific table.
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunThe hardware pipeline is not port specific, but rather describes the whole
78*4882a593SmuzhiyunASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunDrivers can register and unregister tables at run time, in order to support
81*4882a593Smuzhiyundynamic behavior. This dynamic behavior is mandatory for describing hardware
82*4882a593Smuzhiyunblocks like TCAM regions which can be allocated and freed dynamically.
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun``devlink-dpipe`` generally is not intended for configuration. The exception
85*4882a593Smuzhiyunis hardware counting for a specific table.
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunThe following commands are used to obtain the ``dpipe`` objects from
88*4882a593Smuzhiyunuserspace:
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun  * ``table_get``: Receive a table's description.
91*4882a593Smuzhiyun  * ``headers_get``: Receive a device's supported headers.
92*4882a593Smuzhiyun  * ``entries_get``: Receive a table's current entries.
93*4882a593Smuzhiyun  * ``counters_set``: Enable or disable counters on a table.
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunTable
96*4882a593Smuzhiyun-----
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunThe driver should implement the following operations for each table:
99*4882a593Smuzhiyun
100*4882a593Smuzhiyun  * ``matches_dump``: Dump the supported matches.
101*4882a593Smuzhiyun  * ``actions_dump``: Dump the supported actions.
102*4882a593Smuzhiyun  * ``entries_dump``: Dump the actual content of the table.
103*4882a593Smuzhiyun  * ``counters_set_update``: Synchronize hardware with counters enabled or
104*4882a593Smuzhiyun    disabled.
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunHeader/Field
107*4882a593Smuzhiyun------------
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunIn a similar way to P4 headers and fields are used to describe a table's
110*4882a593Smuzhiyunbehavior. There is a slight difference between the standard protocol headers
111*4882a593Smuzhiyunand specific ASIC metadata. The protocol headers should be declared in the
112*4882a593Smuzhiyun``devlink`` core API. On the other hand ASIC meta data is driver specific
113*4882a593Smuzhiyunand should be defined in the driver. Additionally, each driver-specific
114*4882a593Smuzhiyundevlink documentation file should document the driver-specific ``dpipe``
115*4882a593Smuzhiyunheaders it implements. The headers and fields are identified by enumeration.
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunIn order to provide further visibility some ASIC metadata fields could be
118*4882a593Smuzhiyunmapped to kernel objects. For example, internal router interface indexes can
119*4882a593Smuzhiyunbe directly mapped to the net device ifindex. FIB table indexes used by
120*4882a593Smuzhiyundifferent Virtual Routing and Forwarding (VRF) tables can be mapped to
121*4882a593Smuzhiyuninternal routing table indexes.
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunMatch
124*4882a593Smuzhiyun-----
125*4882a593Smuzhiyun
126*4882a593SmuzhiyunMatches are kept primitive and close to hardware operation. Match types like
127*4882a593SmuzhiyunLPM are not supported due to the fact that this is exactly a process we wish
128*4882a593Smuzhiyunto describe in full detail. Example of matches:
129*4882a593Smuzhiyun
130*4882a593Smuzhiyun  * ``field_exact``: Exact match on a specific field.
131*4882a593Smuzhiyun  * ``field_exact_mask``: Exact match on a specific field after masking.
132*4882a593Smuzhiyun  * ``field_range``: Match on a specific range.
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunThe id's of the header and the field should be specified in order to
135*4882a593Smuzhiyunidentify the specific field. Furthermore, the header index should be
136*4882a593Smuzhiyunspecified in order to distinguish multiple headers of the same type in a
137*4882a593Smuzhiyunpacket (tunneling).
138*4882a593Smuzhiyun
139*4882a593SmuzhiyunAction
140*4882a593Smuzhiyun------
141*4882a593Smuzhiyun
142*4882a593SmuzhiyunSimilar to match, the actions are kept primitive and close to hardware
143*4882a593Smuzhiyunoperation. For example:
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun  * ``field_modify``: Modify the field value.
146*4882a593Smuzhiyun  * ``field_inc``: Increment the field value.
147*4882a593Smuzhiyun  * ``push_header``: Add a header.
148*4882a593Smuzhiyun  * ``pop_header``: Remove a header.
149*4882a593Smuzhiyun
150*4882a593SmuzhiyunEntry
151*4882a593Smuzhiyun-----
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunEntries of a specific table can be dumped on demand. Each eentry is
154*4882a593Smuzhiyunidentified with an index and its properties are described by a list of
155*4882a593Smuzhiyunmatch/action values and specific counter. By dumping the tables content the
156*4882a593Smuzhiyuninteractions between tables can be resolved.
157*4882a593Smuzhiyun
158*4882a593SmuzhiyunAbstraction Example
159*4882a593Smuzhiyun===================
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunThe following is an example of the abstraction model of the L3 part of
162*4882a593SmuzhiyunMellanox Spectrum ASIC. The blocks are described in the order they appear in
163*4882a593Smuzhiyunthe pipeline. The table sizes in the following examples are not real
164*4882a593Smuzhiyunhardware sizes and are provided for demonstration purposes.
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunLPM
167*4882a593Smuzhiyun---
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunThe LPM algorithm can be implemented as a list of hash tables. Each hash
170*4882a593Smuzhiyuntable contains routes with the same prefix length. The root of the list is
171*4882a593Smuzhiyun/32, and in case of a miss the hardware will continue to the next hash
172*4882a593Smuzhiyuntable. The depth of the search will affect the data path latency.
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunIn case of a hit the entry contains information about the next stage of the
175*4882a593Smuzhiyunpipeline which resolves the MAC address. The next stage can be either local
176*4882a593Smuzhiyunhost table for directly connected routes, or adjacency table for next-hops.
177*4882a593SmuzhiyunThe ``meta.lpm_prefix`` field is used to connect two LPM tables.
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun.. code::
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun    table lpm_prefix_16 {
182*4882a593Smuzhiyun      size: 4096,
183*4882a593Smuzhiyun      counters_enabled: true,
184*4882a593Smuzhiyun      match: { meta.vr_id: exact,
185*4882a593Smuzhiyun               ipv4.dst_addr: exact_mask,
186*4882a593Smuzhiyun               ipv6.dst_addr: exact_mask,
187*4882a593Smuzhiyun               meta.lpm_prefix: exact },
188*4882a593Smuzhiyun      action: { meta.adj_index: set,
189*4882a593Smuzhiyun                meta.adj_group_size: set,
190*4882a593Smuzhiyun                meta.rif_port: set,
191*4882a593Smuzhiyun                meta.lpm_prefix: set },
192*4882a593Smuzhiyun    }
193*4882a593Smuzhiyun
194*4882a593SmuzhiyunLocal Host
195*4882a593Smuzhiyun----------
196*4882a593Smuzhiyun
197*4882a593SmuzhiyunIn the case of local routes the LPM lookup already resolves the egress
198*4882a593Smuzhiyunrouter interface (RIF), yet the exact MAC address is not known. The local
199*4882a593Smuzhiyunhost table is a hash table combining the output interface id with
200*4882a593Smuzhiyundestination IP address as a key. The result is the MAC address.
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun.. code::
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun    table local_host {
205*4882a593Smuzhiyun      size: 4096,
206*4882a593Smuzhiyun      counters_enabled: true,
207*4882a593Smuzhiyun      match: { meta.rif_port: exact,
208*4882a593Smuzhiyun               ipv4.dst_addr: exact},
209*4882a593Smuzhiyun      action: { ethernet.daddr: set }
210*4882a593Smuzhiyun    }
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunAdjacency
213*4882a593Smuzhiyun---------
214*4882a593Smuzhiyun
215*4882a593SmuzhiyunIn case of remote routes this table does the ECMP. The LPM lookup results in
216*4882a593SmuzhiyunECMP group size and index that serves as a global offset into this table.
217*4882a593SmuzhiyunConcurrently a hash of the packet is generated. Based on the ECMP group size
218*4882a593Smuzhiyunand the packet's hash a local offset is generated. Multiple LPM entries can
219*4882a593Smuzhiyunpoint to the same adjacency group.
220*4882a593Smuzhiyun
221*4882a593Smuzhiyun.. code::
222*4882a593Smuzhiyun
223*4882a593Smuzhiyun    table adjacency {
224*4882a593Smuzhiyun      size: 4096,
225*4882a593Smuzhiyun      counters_enabled: true,
226*4882a593Smuzhiyun      match: { meta.adj_index: exact,
227*4882a593Smuzhiyun               meta.adj_group_size: exact,
228*4882a593Smuzhiyun               meta.packet_hash_index: exact },
229*4882a593Smuzhiyun      action: { ethernet.daddr: set,
230*4882a593Smuzhiyun                meta.erif: set }
231*4882a593Smuzhiyun    }
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunERIF
234*4882a593Smuzhiyun----
235*4882a593Smuzhiyun
236*4882a593SmuzhiyunIn case the egress RIF and destination MAC have been resolved by previous
237*4882a593Smuzhiyuntables this table does multiple operations like TTL decrease and MTU check.
238*4882a593SmuzhiyunThen the decision of forward/drop is taken and the port L3 statistics are
239*4882a593Smuzhiyunupdated based on the packet's type (broadcast, unicast, multicast).
240*4882a593Smuzhiyun
241*4882a593Smuzhiyun.. code::
242*4882a593Smuzhiyun
243*4882a593Smuzhiyun    table erif {
244*4882a593Smuzhiyun      size: 800,
245*4882a593Smuzhiyun      counters_enabled: true,
246*4882a593Smuzhiyun      match: { meta.rif_port: exact,
247*4882a593Smuzhiyun               meta.is_l3_unicast: exact,
248*4882a593Smuzhiyun               meta.is_l3_broadcast: exact,
249*4882a593Smuzhiyun               meta.is_l3_multicast, exact },
250*4882a593Smuzhiyun      action: { meta.l3_drop: set,
251*4882a593Smuzhiyun                meta.l3_forward: set }
252*4882a593Smuzhiyun    }
253