1*4882a593Smuzhiyun**************************** 2*4882a593SmuzhiyunRDMA Transport (RTRS) 3*4882a593Smuzhiyun**************************** 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunRTRS (RDMA Transport) is a reliable high speed transport library 6*4882a593Smuzhiyunwhich provides support to establish optimal number of connections 7*4882a593Smuzhiyunbetween client and server machines using RDMA (InfiniBand, RoCE, iWarp) 8*4882a593Smuzhiyuntransport. It is optimized to transfer (read/write) IO blocks. 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunIn its core interface it follows the BIO semantics of providing the 11*4882a593Smuzhiyunpossibility to either write data from an sg list to the remote side 12*4882a593Smuzhiyunor to request ("read") data transfer from the remote side into a given 13*4882a593Smuzhiyunsg list. 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunRTRS provides I/O fail-over and load-balancing capabilities by using 16*4882a593Smuzhiyunmultipath I/O (see "add_path" and "mp_policy" configuration entries in 17*4882a593SmuzhiyunDocumentation/ABI/testing/sysfs-class-rtrs-client). 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunRTRS is used by the RNBD (RDMA Network Block Device) modules. 20*4882a593Smuzhiyun 21*4882a593Smuzhiyun================== 22*4882a593SmuzhiyunTransport protocol 23*4882a593Smuzhiyun================== 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunOverview 26*4882a593Smuzhiyun-------- 27*4882a593SmuzhiyunAn established connection between a client and a server is called rtrs 28*4882a593Smuzhiyunsession. A session is associated with a set of memory chunks reserved on the 29*4882a593Smuzhiyunserver side for a given client for rdma transfer. A session 30*4882a593Smuzhiyunconsists of multiple paths, each representing a separate physical link 31*4882a593Smuzhiyunbetween client and server. Those are used for load balancing and failover. 32*4882a593SmuzhiyunEach path consists of as many connections (QPs) as there are cpus on 33*4882a593Smuzhiyunthe client. 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunWhen processing an incoming write or read request, rtrs client uses memory 36*4882a593Smuzhiyunchunks reserved for him on the server side. Their number, size and addresses 37*4882a593Smuzhiyunneed to be exchanged between client and server during the connection 38*4882a593Smuzhiyunestablishment phase. Apart from the memory related information client needs to 39*4882a593Smuzhiyuninform the server about the session name and identify each path and connection 40*4882a593Smuzhiyunindividually. 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunOn an established session client sends to server write or read messages. 43*4882a593SmuzhiyunServer uses immediate field to tell the client which request is being 44*4882a593Smuzhiyunacknowledged and for errno. Client uses immediate field to tell the server 45*4882a593Smuzhiyunwhich of the memory chunks has been accessed and at which offset the message 46*4882a593Smuzhiyuncan be found. 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunModule parameter always_invalidate is introduced for the security problem 49*4882a593Smuzhiyundiscussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we 50*4882a593Smuzhiyuninvalidate each rdma buffer before we hand it over to RNBD server and 51*4882a593Smuzhiyunthen pass it to the block layer. A new rkey is generated and registered for the 52*4882a593Smuzhiyunbuffer after it returns back from the block layer and RNBD server. 53*4882a593SmuzhiyunThe new rkey is sent back to the client along with the IO result. 54*4882a593SmuzhiyunThe procedure is the default behaviour of the driver. This invalidation and 55*4882a593Smuzhiyunregistration on each IO causes performance drop of up to 20%. A user of the 56*4882a593Smuzhiyundriver may choose to load the modules with this mechanism switched off 57*4882a593Smuzhiyun(always_invalidate=N), if he understands and can take the risk of a malicious 58*4882a593Smuzhiyunclient being able to corrupt memory of a server it is connected to. This might 59*4882a593Smuzhiyunbe a reasonable option in a scenario where all the clients and all the servers 60*4882a593Smuzhiyunare located within a secure datacenter. 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunConnection establishment 64*4882a593Smuzhiyun------------------------ 65*4882a593Smuzhiyun 66*4882a593Smuzhiyun1. Client starts establishing connections belonging to a path of a session one 67*4882a593Smuzhiyunby one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests. 68*4882a593SmuzhiyunThose include uuid of the session and uuid of the path to be 69*4882a593Smuzhiyunestablished. They are used by the server to find a persisting session/path or 70*4882a593Smuzhiyunto create a new one when necessary. The message also contains the protocol 71*4882a593Smuzhiyunversion and magic for compatibility, total number of connections per session 72*4882a593Smuzhiyun(as many as cpus on the client), the id of the current connection and 73*4882a593Smuzhiyunthe reconnect counter, which is used to resolve the situations where 74*4882a593Smuzhiyunclient is trying to reconnect a path, while server is still destroying the old 75*4882a593Smuzhiyunone. 76*4882a593Smuzhiyun 77*4882a593Smuzhiyun2. Server accepts the connection requests one by one and attaches 78*4882a593SmuzhiyunRTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and 79*4882a593Smuzhiyunprotocol version, the messages include error code, queue depth supported by 80*4882a593Smuzhiyunthe server (number of memory chunks which are going to be allocated for that 81*4882a593Smuzhiyunsession) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set 82*4882a593Smuzhiyunwhen always_invalidate=Y. 83*4882a593Smuzhiyun 84*4882a593Smuzhiyun3. After all connections of a path are established client sends to server the 85*4882a593SmuzhiyunRTRS_MSG_INFO_REQ message, containing the name of the session. This message 86*4882a593Smuzhiyunrequests the address information from the server. 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun4. Server replies to the session info request message with RTRS_MSG_INFO_RSP, 89*4882a593Smuzhiyunwhich contains the addresses and keys of the RDMA buffers allocated for that 90*4882a593Smuzhiyunsession. 91*4882a593Smuzhiyun 92*4882a593Smuzhiyun5. Session becomes connected after all paths to be established are connected 93*4882a593Smuzhiyun(i.e. steps 1-4 finished for all paths requested for a session) 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun6. Server and client exchange periodically heartbeat messages (empty rdma 96*4882a593Smuzhiyunmessages with an immediate field) which are used to detect a crash on remote 97*4882a593Smuzhiyunside or network outage in an absence of IO. 98*4882a593Smuzhiyun 99*4882a593Smuzhiyun7. On any RDMA related error or in the case of a heartbeat timeout, the 100*4882a593Smuzhiyuncorresponding path is disconnected, all the inflight IO are failed over to a 101*4882a593Smuzhiyunhealthy path, if any, and the reconnect mechanism is triggered. 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunCLT SRV 104*4882a593Smuzhiyun*for each connection belonging to a path and for each path: 105*4882a593SmuzhiyunRTRS_MSG_CON_REQ -------------------> 106*4882a593Smuzhiyun <------------------- RTRS_MSG_CON_RSP 107*4882a593Smuzhiyun... 108*4882a593Smuzhiyun*after all connections are established: 109*4882a593SmuzhiyunRTRS_MSG_INFO_REQ -------------------> 110*4882a593Smuzhiyun <------------------- RTRS_MSG_INFO_RSP 111*4882a593Smuzhiyun*heartbeat is started from both sides: 112*4882a593Smuzhiyun -------------------> [RTRS_HB_MSG_IMM] 113*4882a593Smuzhiyun[RTRS_HB_MSG_ACK] <------------------- 114*4882a593Smuzhiyun[RTRS_HB_MSG_IMM] <------------------- 115*4882a593Smuzhiyun -------------------> [RTRS_HB_MSG_ACK] 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunIO path 118*4882a593Smuzhiyun------- 119*4882a593Smuzhiyun 120*4882a593Smuzhiyun* Write (always_invalidate=N) * 121*4882a593Smuzhiyun 122*4882a593Smuzhiyun1. When processing a write request client selects one of the memory chunks 123*4882a593Smuzhiyunon the server side and rdma writes there the user data, user header and the 124*4882a593SmuzhiyunRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 125*4882a593Smuzhiyuncontains size of the user header. The client tells the server which chunk has 126*4882a593Smuzhiyunbeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by 127*4882a593Smuzhiyunusing the IMM field. 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun2. When confirming a write request server sends an "empty" rdma message with 130*4882a593Smuzhiyunan immediate field. The 32 bit field is used to specify the outstanding 131*4882a593Smuzhiyuninflight IO and for the error code. 132*4882a593Smuzhiyun 133*4882a593SmuzhiyunCLT SRV 134*4882a593Smuzhiyunusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] 135*4882a593Smuzhiyun[RTRS_IO_RSP_IMM] <----------------- (id + errno) 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun* Write (always_invalidate=Y) * 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun1. When processing a write request client selects one of the memory chunks 140*4882a593Smuzhiyunon the server side and rdma writes there the user data, user header and the 141*4882a593SmuzhiyunRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 142*4882a593Smuzhiyuncontains size of the user header. The client tells the server which chunk has 143*4882a593Smuzhiyunbeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by 144*4882a593Smuzhiyunusing the IMM field, Server invalidate rkey associated to the memory chunks 145*4882a593Smuzhiyunfirst, when it finishes, pass the IO to RNBD server module. 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun2. When confirming a write request server sends an "empty" rdma message with 148*4882a593Smuzhiyunan immediate field. The 32 bit field is used to specify the outstanding 149*4882a593Smuzhiyuninflight IO and for the error code. The new rkey is sent back using 150*4882a593SmuzhiyunSEND_WITH_IMM WR, client When it recived new rkey message, it validates 151*4882a593Smuzhiyunthe message and finished IO after update rkey for the rbuffer, then post 152*4882a593Smuzhiyunback the recv buffer for later use. 153*4882a593Smuzhiyun 154*4882a593SmuzhiyunCLT SRV 155*4882a593Smuzhiyunusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] 156*4882a593Smuzhiyun[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) 157*4882a593Smuzhiyun[RTRS_IO_RSP_IMM] <----------------- (id + errno) 158*4882a593Smuzhiyun 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun* Read (always_invalidate=N)* 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun1. When processing a read request client selects one of the memory chunks 163*4882a593Smuzhiyunon the server side and rdma writes there the user header and the 164*4882a593SmuzhiyunRTRS_MSG_RDMA_READ message. This message contains the type (read), size of 165*4882a593Smuzhiyunthe user header, flags (specifying if memory invalidation is necessary) and the 166*4882a593Smuzhiyunlist of addresses along with keys for the data to be read into. 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun2. When confirming a read request server transfers the requested data first, 169*4882a593Smuzhiyunattaches an invalidation message if requested and finally an "empty" rdma 170*4882a593Smuzhiyunmessage with an immediate field. The 32 bit field is used to specify the 171*4882a593Smuzhiyunoutstanding inflight IO and the error code. 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunCLT SRV 174*4882a593Smuzhiyunusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] 175*4882a593Smuzhiyun[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 176*4882a593Smuzhiyunor in case client requested invalidation: 177*4882a593Smuzhiyun[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun* Read (always_invalidate=Y)* 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun1. When processing a read request client selects one of the memory chunks 182*4882a593Smuzhiyunon the server side and rdma writes there the user header and the 183*4882a593SmuzhiyunRTRS_MSG_RDMA_READ message. This message contains the type (read), size of 184*4882a593Smuzhiyunthe user header, flags (specifying if memory invalidation is necessary) and the 185*4882a593Smuzhiyunlist of addresses along with keys for the data to be read into. 186*4882a593SmuzhiyunServer invalidate rkey associated to the memory chunks first, when it finishes, 187*4882a593Smuzhiyunpasses the IO to RNBD server module. 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun2. When confirming a read request server transfers the requested data first, 190*4882a593Smuzhiyunattaches an invalidation message if requested and finally an "empty" rdma 191*4882a593Smuzhiyunmessage with an immediate field. The 32 bit field is used to specify the 192*4882a593Smuzhiyunoutstanding inflight IO and the error code. The new rkey is sent back using 193*4882a593SmuzhiyunSEND_WITH_IMM WR, client When it recived new rkey message, it validates 194*4882a593Smuzhiyunthe message and finished IO after update rkey for the rbuffer, then post 195*4882a593Smuzhiyunback the recv buffer for later use. 196*4882a593Smuzhiyun 197*4882a593SmuzhiyunCLT SRV 198*4882a593Smuzhiyunusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] 199*4882a593Smuzhiyun[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 200*4882a593Smuzhiyun[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) 201*4882a593Smuzhiyunor in case client requested invalidation: 202*4882a593Smuzhiyun[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) 203*4882a593Smuzhiyun========================================= 204*4882a593SmuzhiyunContributors List(in alphabetical order) 205*4882a593Smuzhiyun========================================= 206*4882a593SmuzhiyunDanil Kipnis <danil.kipnis@profitbricks.com> 207*4882a593SmuzhiyunFabian Holler <mail@fholler.de> 208*4882a593SmuzhiyunGuoqing Jiang <guoqing.jiang@cloud.ionos.com> 209*4882a593SmuzhiyunJack Wang <jinpu.wang@profitbricks.com> 210*4882a593SmuzhiyunKleber Souza <kleber.souza@profitbricks.com> 211*4882a593SmuzhiyunLutz Pogrell <lutz.pogrell@cloud.ionos.com> 212*4882a593SmuzhiyunMilind Dumbare <Milind.dumbare@gmail.com> 213*4882a593SmuzhiyunRoman Penyaev <roman.penyaev@profitbricks.com> 214