xref: /OK3568_Linux_fs/kernel/Documentation/infiniband/ipoib.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==================
2*4882a593SmuzhiyunIP over InfiniBand
3*4882a593Smuzhiyun==================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun  The ib_ipoib driver is an implementation of the IP over InfiniBand
6*4882a593Smuzhiyun  protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib
7*4882a593Smuzhiyun  working group.  It is a "native" implementation in the sense of
8*4882a593Smuzhiyun  setting the interface type to ARPHRD_INFINIBAND and the hardware
9*4882a593Smuzhiyun  address length to 20 (earlier proprietary implementations
10*4882a593Smuzhiyun  masqueraded to the kernel as ethernet interfaces).
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunPartitions and P_Keys
13*4882a593Smuzhiyun=====================
14*4882a593Smuzhiyun
15*4882a593Smuzhiyun  When the IPoIB driver is loaded, it creates one interface for each
16*4882a593Smuzhiyun  port using the P_Key at index 0.  To create an interface with a
17*4882a593Smuzhiyun  different P_Key, write the desired P_Key into the main interface's
18*4882a593Smuzhiyun  /sys/class/net/<intf name>/create_child file.  For example::
19*4882a593Smuzhiyun
20*4882a593Smuzhiyun    echo 0x8001 > /sys/class/net/ib0/create_child
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun  This will create an interface named ib0.8001 with P_Key 0x8001.  To
23*4882a593Smuzhiyun  remove a subinterface, use the "delete_child" file::
24*4882a593Smuzhiyun
25*4882a593Smuzhiyun    echo 0x8001 > /sys/class/net/ib0/delete_child
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun  The P_Key for any interface is given by the "pkey" file, and the
28*4882a593Smuzhiyun  main interface for a subinterface is in "parent."
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun  Child interface create/delete can also be done using IPoIB's
31*4882a593Smuzhiyun  rtnl_link_ops, where children created using either way behave the same.
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunDatagram vs Connected modes
34*4882a593Smuzhiyun===========================
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun  The IPoIB driver supports two modes of operation: datagram and
37*4882a593Smuzhiyun  connected.  The mode is set and read through an interface's
38*4882a593Smuzhiyun  /sys/class/net/<intf name>/mode file.
39*4882a593Smuzhiyun
40*4882a593Smuzhiyun  In datagram mode, the IB UD (Unreliable Datagram) transport is used
41*4882a593Smuzhiyun  and so the interface MTU has is equal to the IB L2 MTU minus the
42*4882a593Smuzhiyun  IPoIB encapsulation header (4 bytes).  For example, in a typical IB
43*4882a593Smuzhiyun  fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes.
44*4882a593Smuzhiyun
45*4882a593Smuzhiyun  In connected mode, the IB RC (Reliable Connected) transport is used.
46*4882a593Smuzhiyun  Connected mode takes advantage of the connected nature of the IB
47*4882a593Smuzhiyun  transport and allows an MTU up to the maximal IP packet size of 64K,
48*4882a593Smuzhiyun  which reduces the number of IP packets needed for handling large UDP
49*4882a593Smuzhiyun  datagrams, TCP segments, etc and increases the performance for large
50*4882a593Smuzhiyun  messages.
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun  In connected mode, the interface's UD QP is still used for multicast
53*4882a593Smuzhiyun  and communication with peers that don't support connected mode. In
54*4882a593Smuzhiyun  this case, RX emulation of ICMP PMTU packets is used to cause the
55*4882a593Smuzhiyun  networking stack to use the smaller UD MTU for these neighbours.
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunStateless offloads
58*4882a593Smuzhiyun==================
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun  If the IB HW supports IPoIB stateless offloads, IPoIB advertises
61*4882a593Smuzhiyun  TCP/IP checksum and/or Large Send (LSO) offloading capability to the
62*4882a593Smuzhiyun  network stack.
63*4882a593Smuzhiyun
64*4882a593Smuzhiyun  Large Receive (LRO) offloading is also implemented and may be turned
65*4882a593Smuzhiyun  on/off using ethtool calls.  Currently LRO is supported only for
66*4882a593Smuzhiyun  checksum offload capable devices.
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun  Stateless offloads are supported only in datagram mode.
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunInterrupt moderation
71*4882a593Smuzhiyun====================
72*4882a593Smuzhiyun
73*4882a593Smuzhiyun  If the underlying IB device supports CQ event moderation, one can
74*4882a593Smuzhiyun  use ethtool to set interrupt mitigation parameters and thus reduce
75*4882a593Smuzhiyun  the overhead incurred by handling interrupts.  The main code path of
76*4882a593Smuzhiyun  IPoIB doesn't use events for TX completion signaling so only RX
77*4882a593Smuzhiyun  moderation is supported.
78*4882a593Smuzhiyun
79*4882a593SmuzhiyunDebugging Information
80*4882a593Smuzhiyun=====================
81*4882a593Smuzhiyun
82*4882a593Smuzhiyun  By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
83*4882a593Smuzhiyun  to 'y', tracing messages are compiled into the driver.  They are
84*4882a593Smuzhiyun  turned on by setting the module parameters debug_level and
85*4882a593Smuzhiyun  mcast_debug_level to 1.  These parameters can be controlled at
86*4882a593Smuzhiyun  runtime through files in /sys/module/ib_ipoib/.
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun  CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs
89*4882a593Smuzhiyun  virtual filesystem.  By mounting this filesystem, for example with::
90*4882a593Smuzhiyun
91*4882a593Smuzhiyun    mount -t debugfs none /sys/kernel/debug
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun  it is possible to get statistics about multicast groups from the
94*4882a593Smuzhiyun  files /sys/kernel/debug/ipoib/ib0_mcg and so on.
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun  The performance impact of this option is negligible, so it
97*4882a593Smuzhiyun  is safe to enable this option with debug_level set to 0 for normal
98*4882a593Smuzhiyun  operation.
99*4882a593Smuzhiyun
100*4882a593Smuzhiyun  CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in
101*4882a593Smuzhiyun  the data path when data_debug_level is set to 1.  However, even with
102*4882a593Smuzhiyun  the output disabled, enabling this configuration option will affect
103*4882a593Smuzhiyun  performance, because it adds tests to the fast path.
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunReferences
106*4882a593Smuzhiyun==========
107*4882a593Smuzhiyun
108*4882a593Smuzhiyun  Transmission of IP over InfiniBand (IPoIB) (RFC 4391)
109*4882a593Smuzhiyun    http://ietf.org/rfc/rfc4391.txt
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun  IP over InfiniBand (IPoIB) Architecture (RFC 4392)
112*4882a593Smuzhiyun    http://ietf.org/rfc/rfc4392.txt
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun  IP over InfiniBand: Connected Mode (RFC 4755)
115*4882a593Smuzhiyun    http://ietf.org/rfc/rfc4755.txt
116