1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun============= 4*4882a593SmuzhiyunDevlink DPIPE 5*4882a593Smuzhiyun============= 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunBackground 8*4882a593Smuzhiyun========== 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunWhile performing the hardware offloading process, much of the hardware 11*4882a593Smuzhiyunspecifics cannot be presented. These details are useful for debugging, and 12*4882a593Smuzhiyun``devlink-dpipe`` provides a standardized way to provide visibility into the 13*4882a593Smuzhiyunoffloading process. 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunFor example, the routing longest prefix match (LPM) algorithm used by the 16*4882a593SmuzhiyunLinux kernel may differ from the hardware implementation. The pipeline debug 17*4882a593SmuzhiyunAPI (DPIPE) is aimed at providing the user visibility into the ASIC's 18*4882a593Smuzhiyunpipeline in a generic way. 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunThe hardware offload process is expected to be done in a way that the user 21*4882a593Smuzhiyunshould not be able to distinguish between the hardware vs. software 22*4882a593Smuzhiyunimplementation. In this process, hardware specifics are neglected. In 23*4882a593Smuzhiyunreality those details can have lots of meaning and should be exposed in some 24*4882a593Smuzhiyunstandard way. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunThis problem is made even more complex when one wishes to offload the 27*4882a593Smuzhiyuncontrol path of the whole networking stack to a switch ASIC. Due to 28*4882a593Smuzhiyundifferences in the hardware and software models some processes cannot be 29*4882a593Smuzhiyunrepresented correctly. 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunOne example is the kernel's LPM algorithm which in many cases differs 32*4882a593Smuzhiyungreatly to the hardware implementation. The configuration API is the same, 33*4882a593Smuzhiyunbut one cannot rely on the Forward Information Base (FIB) to look like the 34*4882a593SmuzhiyunLevel Path Compression trie (LPC-trie) in hardware. 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunIn many situations trying to analyze systems failure solely based on the 37*4882a593Smuzhiyunkernel's dump may not be enough. By combining this data with complementary 38*4882a593Smuzhiyuninformation about the underlying hardware, this debugging can be made 39*4882a593Smuzhiyuneasier; additionally, the information can be useful when debugging 40*4882a593Smuzhiyunperformance issues. 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunOverview 43*4882a593Smuzhiyun======== 44*4882a593Smuzhiyun 45*4882a593SmuzhiyunThe ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is 46*4882a593Smuzhiyunmodeled as a graph of match/action tables. Each table represents a specific 47*4882a593Smuzhiyunhardware block. This model is not new, first being used by the P4 language. 48*4882a593Smuzhiyun 49*4882a593SmuzhiyunTraditionally it has been used as an alternative model for hardware 50*4882a593Smuzhiyunconfiguration, but the ``devlink-dpipe`` interface uses it for visibility 51*4882a593Smuzhiyunpurposes as a standard complementary tool. The system's view from 52*4882a593Smuzhiyun``devlink-dpipe`` should change according to the changes done by the 53*4882a593Smuzhiyunstandard configuration tools. 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunFor example, it’s quiet common to implement Access Control Lists (ACL) 56*4882a593Smuzhiyunusing Ternary Content Addressable Memory (TCAM). The TCAM memory can be 57*4882a593Smuzhiyundivided into TCAM regions. Complex TC filters can have multiple rules with 58*4882a593Smuzhiyundifferent priorities and different lookup keys. On the other hand hardware 59*4882a593SmuzhiyunTCAM regions have a predefined lookup key. Offloading the TC filter rules 60*4882a593Smuzhiyunusing TCAM engine can result in multiple TCAM regions being interconnected 61*4882a593Smuzhiyunin a chain (which may affect the data path latency). In response to a new TC 62*4882a593Smuzhiyunfilter new tables should be created describing those regions. 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunModel 65*4882a593Smuzhiyun===== 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunThe ``DPIPE`` model introduces several objects: 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun * headers 70*4882a593Smuzhiyun * tables 71*4882a593Smuzhiyun * entries 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunA ``header`` describes packet formats and provides names for fields within 74*4882a593Smuzhiyunthe packet. A ``table`` describes hardware blocks. An ``entry`` describes 75*4882a593Smuzhiyunthe actual content of a specific table. 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunThe hardware pipeline is not port specific, but rather describes the whole 78*4882a593SmuzhiyunASIC. Thus it is tied to the top of the ``devlink`` infrastructure. 79*4882a593Smuzhiyun 80*4882a593SmuzhiyunDrivers can register and unregister tables at run time, in order to support 81*4882a593Smuzhiyundynamic behavior. This dynamic behavior is mandatory for describing hardware 82*4882a593Smuzhiyunblocks like TCAM regions which can be allocated and freed dynamically. 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun``devlink-dpipe`` generally is not intended for configuration. The exception 85*4882a593Smuzhiyunis hardware counting for a specific table. 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunThe following commands are used to obtain the ``dpipe`` objects from 88*4882a593Smuzhiyunuserspace: 89*4882a593Smuzhiyun 90*4882a593Smuzhiyun * ``table_get``: Receive a table's description. 91*4882a593Smuzhiyun * ``headers_get``: Receive a device's supported headers. 92*4882a593Smuzhiyun * ``entries_get``: Receive a table's current entries. 93*4882a593Smuzhiyun * ``counters_set``: Enable or disable counters on a table. 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunTable 96*4882a593Smuzhiyun----- 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunThe driver should implement the following operations for each table: 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun * ``matches_dump``: Dump the supported matches. 101*4882a593Smuzhiyun * ``actions_dump``: Dump the supported actions. 102*4882a593Smuzhiyun * ``entries_dump``: Dump the actual content of the table. 103*4882a593Smuzhiyun * ``counters_set_update``: Synchronize hardware with counters enabled or 104*4882a593Smuzhiyun disabled. 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunHeader/Field 107*4882a593Smuzhiyun------------ 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunIn a similar way to P4 headers and fields are used to describe a table's 110*4882a593Smuzhiyunbehavior. There is a slight difference between the standard protocol headers 111*4882a593Smuzhiyunand specific ASIC metadata. The protocol headers should be declared in the 112*4882a593Smuzhiyun``devlink`` core API. On the other hand ASIC meta data is driver specific 113*4882a593Smuzhiyunand should be defined in the driver. Additionally, each driver-specific 114*4882a593Smuzhiyundevlink documentation file should document the driver-specific ``dpipe`` 115*4882a593Smuzhiyunheaders it implements. The headers and fields are identified by enumeration. 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunIn order to provide further visibility some ASIC metadata fields could be 118*4882a593Smuzhiyunmapped to kernel objects. For example, internal router interface indexes can 119*4882a593Smuzhiyunbe directly mapped to the net device ifindex. FIB table indexes used by 120*4882a593Smuzhiyundifferent Virtual Routing and Forwarding (VRF) tables can be mapped to 121*4882a593Smuzhiyuninternal routing table indexes. 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunMatch 124*4882a593Smuzhiyun----- 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunMatches are kept primitive and close to hardware operation. Match types like 127*4882a593SmuzhiyunLPM are not supported due to the fact that this is exactly a process we wish 128*4882a593Smuzhiyunto describe in full detail. Example of matches: 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun * ``field_exact``: Exact match on a specific field. 131*4882a593Smuzhiyun * ``field_exact_mask``: Exact match on a specific field after masking. 132*4882a593Smuzhiyun * ``field_range``: Match on a specific range. 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunThe id's of the header and the field should be specified in order to 135*4882a593Smuzhiyunidentify the specific field. Furthermore, the header index should be 136*4882a593Smuzhiyunspecified in order to distinguish multiple headers of the same type in a 137*4882a593Smuzhiyunpacket (tunneling). 138*4882a593Smuzhiyun 139*4882a593SmuzhiyunAction 140*4882a593Smuzhiyun------ 141*4882a593Smuzhiyun 142*4882a593SmuzhiyunSimilar to match, the actions are kept primitive and close to hardware 143*4882a593Smuzhiyunoperation. For example: 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun * ``field_modify``: Modify the field value. 146*4882a593Smuzhiyun * ``field_inc``: Increment the field value. 147*4882a593Smuzhiyun * ``push_header``: Add a header. 148*4882a593Smuzhiyun * ``pop_header``: Remove a header. 149*4882a593Smuzhiyun 150*4882a593SmuzhiyunEntry 151*4882a593Smuzhiyun----- 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunEntries of a specific table can be dumped on demand. Each eentry is 154*4882a593Smuzhiyunidentified with an index and its properties are described by a list of 155*4882a593Smuzhiyunmatch/action values and specific counter. By dumping the tables content the 156*4882a593Smuzhiyuninteractions between tables can be resolved. 157*4882a593Smuzhiyun 158*4882a593SmuzhiyunAbstraction Example 159*4882a593Smuzhiyun=================== 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunThe following is an example of the abstraction model of the L3 part of 162*4882a593SmuzhiyunMellanox Spectrum ASIC. The blocks are described in the order they appear in 163*4882a593Smuzhiyunthe pipeline. The table sizes in the following examples are not real 164*4882a593Smuzhiyunhardware sizes and are provided for demonstration purposes. 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunLPM 167*4882a593Smuzhiyun--- 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunThe LPM algorithm can be implemented as a list of hash tables. Each hash 170*4882a593Smuzhiyuntable contains routes with the same prefix length. The root of the list is 171*4882a593Smuzhiyun/32, and in case of a miss the hardware will continue to the next hash 172*4882a593Smuzhiyuntable. The depth of the search will affect the data path latency. 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunIn case of a hit the entry contains information about the next stage of the 175*4882a593Smuzhiyunpipeline which resolves the MAC address. The next stage can be either local 176*4882a593Smuzhiyunhost table for directly connected routes, or adjacency table for next-hops. 177*4882a593SmuzhiyunThe ``meta.lpm_prefix`` field is used to connect two LPM tables. 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun.. code:: 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun table lpm_prefix_16 { 182*4882a593Smuzhiyun size: 4096, 183*4882a593Smuzhiyun counters_enabled: true, 184*4882a593Smuzhiyun match: { meta.vr_id: exact, 185*4882a593Smuzhiyun ipv4.dst_addr: exact_mask, 186*4882a593Smuzhiyun ipv6.dst_addr: exact_mask, 187*4882a593Smuzhiyun meta.lpm_prefix: exact }, 188*4882a593Smuzhiyun action: { meta.adj_index: set, 189*4882a593Smuzhiyun meta.adj_group_size: set, 190*4882a593Smuzhiyun meta.rif_port: set, 191*4882a593Smuzhiyun meta.lpm_prefix: set }, 192*4882a593Smuzhiyun } 193*4882a593Smuzhiyun 194*4882a593SmuzhiyunLocal Host 195*4882a593Smuzhiyun---------- 196*4882a593Smuzhiyun 197*4882a593SmuzhiyunIn the case of local routes the LPM lookup already resolves the egress 198*4882a593Smuzhiyunrouter interface (RIF), yet the exact MAC address is not known. The local 199*4882a593Smuzhiyunhost table is a hash table combining the output interface id with 200*4882a593Smuzhiyundestination IP address as a key. The result is the MAC address. 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun.. code:: 203*4882a593Smuzhiyun 204*4882a593Smuzhiyun table local_host { 205*4882a593Smuzhiyun size: 4096, 206*4882a593Smuzhiyun counters_enabled: true, 207*4882a593Smuzhiyun match: { meta.rif_port: exact, 208*4882a593Smuzhiyun ipv4.dst_addr: exact}, 209*4882a593Smuzhiyun action: { ethernet.daddr: set } 210*4882a593Smuzhiyun } 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunAdjacency 213*4882a593Smuzhiyun--------- 214*4882a593Smuzhiyun 215*4882a593SmuzhiyunIn case of remote routes this table does the ECMP. The LPM lookup results in 216*4882a593SmuzhiyunECMP group size and index that serves as a global offset into this table. 217*4882a593SmuzhiyunConcurrently a hash of the packet is generated. Based on the ECMP group size 218*4882a593Smuzhiyunand the packet's hash a local offset is generated. Multiple LPM entries can 219*4882a593Smuzhiyunpoint to the same adjacency group. 220*4882a593Smuzhiyun 221*4882a593Smuzhiyun.. code:: 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun table adjacency { 224*4882a593Smuzhiyun size: 4096, 225*4882a593Smuzhiyun counters_enabled: true, 226*4882a593Smuzhiyun match: { meta.adj_index: exact, 227*4882a593Smuzhiyun meta.adj_group_size: exact, 228*4882a593Smuzhiyun meta.packet_hash_index: exact }, 229*4882a593Smuzhiyun action: { ethernet.daddr: set, 230*4882a593Smuzhiyun meta.erif: set } 231*4882a593Smuzhiyun } 232*4882a593Smuzhiyun 233*4882a593SmuzhiyunERIF 234*4882a593Smuzhiyun---- 235*4882a593Smuzhiyun 236*4882a593SmuzhiyunIn case the egress RIF and destination MAC have been resolved by previous 237*4882a593Smuzhiyuntables this table does multiple operations like TTL decrease and MTU check. 238*4882a593SmuzhiyunThen the decision of forward/drop is taken and the port L3 statistics are 239*4882a593Smuzhiyunupdated based on the packet's type (broadcast, unicast, multicast). 240*4882a593Smuzhiyun 241*4882a593Smuzhiyun.. code:: 242*4882a593Smuzhiyun 243*4882a593Smuzhiyun table erif { 244*4882a593Smuzhiyun size: 800, 245*4882a593Smuzhiyun counters_enabled: true, 246*4882a593Smuzhiyun match: { meta.rif_port: exact, 247*4882a593Smuzhiyun meta.is_l3_unicast: exact, 248*4882a593Smuzhiyun meta.is_l3_broadcast: exact, 249*4882a593Smuzhiyun meta.is_l3_multicast, exact }, 250*4882a593Smuzhiyun action: { meta.l3_drop: set, 251*4882a593Smuzhiyun meta.l3_forward: set } 252*4882a593Smuzhiyun } 253