1*4882a593Smuzhiyun================== 2*4882a593SmuzhiyunIP over InfiniBand 3*4882a593Smuzhiyun================== 4*4882a593Smuzhiyun 5*4882a593Smuzhiyun The ib_ipoib driver is an implementation of the IP over InfiniBand 6*4882a593Smuzhiyun protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib 7*4882a593Smuzhiyun working group. It is a "native" implementation in the sense of 8*4882a593Smuzhiyun setting the interface type to ARPHRD_INFINIBAND and the hardware 9*4882a593Smuzhiyun address length to 20 (earlier proprietary implementations 10*4882a593Smuzhiyun masqueraded to the kernel as ethernet interfaces). 11*4882a593Smuzhiyun 12*4882a593SmuzhiyunPartitions and P_Keys 13*4882a593Smuzhiyun===================== 14*4882a593Smuzhiyun 15*4882a593Smuzhiyun When the IPoIB driver is loaded, it creates one interface for each 16*4882a593Smuzhiyun port using the P_Key at index 0. To create an interface with a 17*4882a593Smuzhiyun different P_Key, write the desired P_Key into the main interface's 18*4882a593Smuzhiyun /sys/class/net/<intf name>/create_child file. For example:: 19*4882a593Smuzhiyun 20*4882a593Smuzhiyun echo 0x8001 > /sys/class/net/ib0/create_child 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun This will create an interface named ib0.8001 with P_Key 0x8001. To 23*4882a593Smuzhiyun remove a subinterface, use the "delete_child" file:: 24*4882a593Smuzhiyun 25*4882a593Smuzhiyun echo 0x8001 > /sys/class/net/ib0/delete_child 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun The P_Key for any interface is given by the "pkey" file, and the 28*4882a593Smuzhiyun main interface for a subinterface is in "parent." 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun Child interface create/delete can also be done using IPoIB's 31*4882a593Smuzhiyun rtnl_link_ops, where children created using either way behave the same. 32*4882a593Smuzhiyun 33*4882a593SmuzhiyunDatagram vs Connected modes 34*4882a593Smuzhiyun=========================== 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun The IPoIB driver supports two modes of operation: datagram and 37*4882a593Smuzhiyun connected. The mode is set and read through an interface's 38*4882a593Smuzhiyun /sys/class/net/<intf name>/mode file. 39*4882a593Smuzhiyun 40*4882a593Smuzhiyun In datagram mode, the IB UD (Unreliable Datagram) transport is used 41*4882a593Smuzhiyun and so the interface MTU has is equal to the IB L2 MTU minus the 42*4882a593Smuzhiyun IPoIB encapsulation header (4 bytes). For example, in a typical IB 43*4882a593Smuzhiyun fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes. 44*4882a593Smuzhiyun 45*4882a593Smuzhiyun In connected mode, the IB RC (Reliable Connected) transport is used. 46*4882a593Smuzhiyun Connected mode takes advantage of the connected nature of the IB 47*4882a593Smuzhiyun transport and allows an MTU up to the maximal IP packet size of 64K, 48*4882a593Smuzhiyun which reduces the number of IP packets needed for handling large UDP 49*4882a593Smuzhiyun datagrams, TCP segments, etc and increases the performance for large 50*4882a593Smuzhiyun messages. 51*4882a593Smuzhiyun 52*4882a593Smuzhiyun In connected mode, the interface's UD QP is still used for multicast 53*4882a593Smuzhiyun and communication with peers that don't support connected mode. In 54*4882a593Smuzhiyun this case, RX emulation of ICMP PMTU packets is used to cause the 55*4882a593Smuzhiyun networking stack to use the smaller UD MTU for these neighbours. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunStateless offloads 58*4882a593Smuzhiyun================== 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun If the IB HW supports IPoIB stateless offloads, IPoIB advertises 61*4882a593Smuzhiyun TCP/IP checksum and/or Large Send (LSO) offloading capability to the 62*4882a593Smuzhiyun network stack. 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun Large Receive (LRO) offloading is also implemented and may be turned 65*4882a593Smuzhiyun on/off using ethtool calls. Currently LRO is supported only for 66*4882a593Smuzhiyun checksum offload capable devices. 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun Stateless offloads are supported only in datagram mode. 69*4882a593Smuzhiyun 70*4882a593SmuzhiyunInterrupt moderation 71*4882a593Smuzhiyun==================== 72*4882a593Smuzhiyun 73*4882a593Smuzhiyun If the underlying IB device supports CQ event moderation, one can 74*4882a593Smuzhiyun use ethtool to set interrupt mitigation parameters and thus reduce 75*4882a593Smuzhiyun the overhead incurred by handling interrupts. The main code path of 76*4882a593Smuzhiyun IPoIB doesn't use events for TX completion signaling so only RX 77*4882a593Smuzhiyun moderation is supported. 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunDebugging Information 80*4882a593Smuzhiyun===================== 81*4882a593Smuzhiyun 82*4882a593Smuzhiyun By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set 83*4882a593Smuzhiyun to 'y', tracing messages are compiled into the driver. They are 84*4882a593Smuzhiyun turned on by setting the module parameters debug_level and 85*4882a593Smuzhiyun mcast_debug_level to 1. These parameters can be controlled at 86*4882a593Smuzhiyun runtime through files in /sys/module/ib_ipoib/. 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs 89*4882a593Smuzhiyun virtual filesystem. By mounting this filesystem, for example with:: 90*4882a593Smuzhiyun 91*4882a593Smuzhiyun mount -t debugfs none /sys/kernel/debug 92*4882a593Smuzhiyun 93*4882a593Smuzhiyun it is possible to get statistics about multicast groups from the 94*4882a593Smuzhiyun files /sys/kernel/debug/ipoib/ib0_mcg and so on. 95*4882a593Smuzhiyun 96*4882a593Smuzhiyun The performance impact of this option is negligible, so it 97*4882a593Smuzhiyun is safe to enable this option with debug_level set to 0 for normal 98*4882a593Smuzhiyun operation. 99*4882a593Smuzhiyun 100*4882a593Smuzhiyun CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in 101*4882a593Smuzhiyun the data path when data_debug_level is set to 1. However, even with 102*4882a593Smuzhiyun the output disabled, enabling this configuration option will affect 103*4882a593Smuzhiyun performance, because it adds tests to the fast path. 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunReferences 106*4882a593Smuzhiyun========== 107*4882a593Smuzhiyun 108*4882a593Smuzhiyun Transmission of IP over InfiniBand (IPoIB) (RFC 4391) 109*4882a593Smuzhiyun http://ietf.org/rfc/rfc4391.txt 110*4882a593Smuzhiyun 111*4882a593Smuzhiyun IP over InfiniBand (IPoIB) Architecture (RFC 4392) 112*4882a593Smuzhiyun http://ietf.org/rfc/rfc4392.txt 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun IP over InfiniBand: Connected Mode (RFC 4755) 115*4882a593Smuzhiyun http://ietf.org/rfc/rfc4755.txt 116