xref: /OK3568_Linux_fs/kernel/drivers/infiniband/ulp/rtrs/README (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun****************************
2*4882a593SmuzhiyunRDMA Transport (RTRS)
3*4882a593Smuzhiyun****************************
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunRTRS (RDMA Transport) is a reliable high speed transport library
6*4882a593Smuzhiyunwhich provides support to establish optimal number of connections
7*4882a593Smuzhiyunbetween client and server machines using RDMA (InfiniBand, RoCE, iWarp)
8*4882a593Smuzhiyuntransport. It is optimized to transfer (read/write) IO blocks.
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunIn its core interface it follows the BIO semantics of providing the
11*4882a593Smuzhiyunpossibility to either write data from an sg list to the remote side
12*4882a593Smuzhiyunor to request ("read") data transfer from the remote side into a given
13*4882a593Smuzhiyunsg list.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunRTRS provides I/O fail-over and load-balancing capabilities by using
16*4882a593Smuzhiyunmultipath I/O (see "add_path" and "mp_policy" configuration entries in
17*4882a593SmuzhiyunDocumentation/ABI/testing/sysfs-class-rtrs-client).
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunRTRS is used by the RNBD (RDMA Network Block Device) modules.
20*4882a593Smuzhiyun
21*4882a593Smuzhiyun==================
22*4882a593SmuzhiyunTransport protocol
23*4882a593Smuzhiyun==================
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunOverview
26*4882a593Smuzhiyun--------
27*4882a593SmuzhiyunAn established connection between a client and a server is called rtrs
28*4882a593Smuzhiyunsession. A session is associated with a set of memory chunks reserved on the
29*4882a593Smuzhiyunserver side for a given client for rdma transfer. A session
30*4882a593Smuzhiyunconsists of multiple paths, each representing a separate physical link
31*4882a593Smuzhiyunbetween client and server. Those are used for load balancing and failover.
32*4882a593SmuzhiyunEach path consists of as many connections (QPs) as there are cpus on
33*4882a593Smuzhiyunthe client.
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunWhen processing an incoming write or read request, rtrs client uses memory
36*4882a593Smuzhiyunchunks reserved for him on the server side. Their number, size and addresses
37*4882a593Smuzhiyunneed to be exchanged between client and server during the connection
38*4882a593Smuzhiyunestablishment phase. Apart from the memory related information client needs to
39*4882a593Smuzhiyuninform the server about the session name and identify each path and connection
40*4882a593Smuzhiyunindividually.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunOn an established session client sends to server write or read messages.
43*4882a593SmuzhiyunServer uses immediate field to tell the client which request is being
44*4882a593Smuzhiyunacknowledged and for errno. Client uses immediate field to tell the server
45*4882a593Smuzhiyunwhich of the memory chunks has been accessed and at which offset the message
46*4882a593Smuzhiyuncan be found.
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunModule parameter always_invalidate is introduced for the security problem
49*4882a593Smuzhiyundiscussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we
50*4882a593Smuzhiyuninvalidate each rdma buffer before we hand it over to RNBD server and
51*4882a593Smuzhiyunthen pass it to the block layer. A new rkey is generated and registered for the
52*4882a593Smuzhiyunbuffer after it returns back from the block layer and RNBD server.
53*4882a593SmuzhiyunThe new rkey is sent back to the client along with the IO result.
54*4882a593SmuzhiyunThe procedure is the default behaviour of the driver. This invalidation and
55*4882a593Smuzhiyunregistration on each IO causes performance drop of up to 20%. A user of the
56*4882a593Smuzhiyundriver may choose to load the modules with this mechanism switched off
57*4882a593Smuzhiyun(always_invalidate=N), if he understands and can take the risk of a malicious
58*4882a593Smuzhiyunclient being able to corrupt memory of a server it is connected to. This might
59*4882a593Smuzhiyunbe a reasonable option in a scenario where all the clients and all the servers
60*4882a593Smuzhiyunare located within a secure datacenter.
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunConnection establishment
64*4882a593Smuzhiyun------------------------
65*4882a593Smuzhiyun
66*4882a593Smuzhiyun1. Client starts establishing connections belonging to a path of a session one
67*4882a593Smuzhiyunby one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
68*4882a593SmuzhiyunThose include uuid of the session and uuid of the path to be
69*4882a593Smuzhiyunestablished. They are used by the server to find a persisting session/path or
70*4882a593Smuzhiyunto create a new one when necessary. The message also contains the protocol
71*4882a593Smuzhiyunversion and magic for compatibility, total number of connections per session
72*4882a593Smuzhiyun(as many as cpus on the client), the id of the current connection and
73*4882a593Smuzhiyunthe reconnect counter, which is used to resolve the situations where
74*4882a593Smuzhiyunclient is trying to reconnect a path, while server is still destroying the old
75*4882a593Smuzhiyunone.
76*4882a593Smuzhiyun
77*4882a593Smuzhiyun2. Server accepts the connection requests one by one and attaches
78*4882a593SmuzhiyunRTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
79*4882a593Smuzhiyunprotocol version, the messages include error code, queue depth supported by
80*4882a593Smuzhiyunthe server (number of memory chunks which are going to be allocated for that
81*4882a593Smuzhiyunsession) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set
82*4882a593Smuzhiyunwhen always_invalidate=Y.
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun3. After all connections of a path are established client sends to server the
85*4882a593SmuzhiyunRTRS_MSG_INFO_REQ message, containing the name of the session. This message
86*4882a593Smuzhiyunrequests the address information from the server.
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
89*4882a593Smuzhiyunwhich contains the addresses and keys of the RDMA buffers allocated for that
90*4882a593Smuzhiyunsession.
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun5. Session becomes connected after all paths to be established are connected
93*4882a593Smuzhiyun(i.e. steps 1-4 finished for all paths requested for a session)
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun6. Server and client exchange periodically heartbeat messages (empty rdma
96*4882a593Smuzhiyunmessages with an immediate field) which are used to detect a crash on remote
97*4882a593Smuzhiyunside or network outage in an absence of IO.
98*4882a593Smuzhiyun
99*4882a593Smuzhiyun7. On any RDMA related error or in the case of a heartbeat timeout, the
100*4882a593Smuzhiyuncorresponding path is disconnected, all the inflight IO are failed over to a
101*4882a593Smuzhiyunhealthy path, if any, and the reconnect mechanism is triggered.
102*4882a593Smuzhiyun
103*4882a593SmuzhiyunCLT                                     SRV
104*4882a593Smuzhiyun*for each connection belonging to a path and for each path:
105*4882a593SmuzhiyunRTRS_MSG_CON_REQ  ------------------->
106*4882a593Smuzhiyun                   <------------------- RTRS_MSG_CON_RSP
107*4882a593Smuzhiyun...
108*4882a593Smuzhiyun*after all connections are established:
109*4882a593SmuzhiyunRTRS_MSG_INFO_REQ ------------------->
110*4882a593Smuzhiyun                   <------------------- RTRS_MSG_INFO_RSP
111*4882a593Smuzhiyun*heartbeat is started from both sides:
112*4882a593Smuzhiyun                   -------------------> [RTRS_HB_MSG_IMM]
113*4882a593Smuzhiyun[RTRS_HB_MSG_ACK] <-------------------
114*4882a593Smuzhiyun[RTRS_HB_MSG_IMM] <-------------------
115*4882a593Smuzhiyun                   -------------------> [RTRS_HB_MSG_ACK]
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunIO path
118*4882a593Smuzhiyun-------
119*4882a593Smuzhiyun
120*4882a593Smuzhiyun* Write (always_invalidate=N) *
121*4882a593Smuzhiyun
122*4882a593Smuzhiyun1. When processing a write request client selects one of the memory chunks
123*4882a593Smuzhiyunon the server side and rdma writes there the user data, user header and the
124*4882a593SmuzhiyunRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
125*4882a593Smuzhiyuncontains size of the user header. The client tells the server which chunk has
126*4882a593Smuzhiyunbeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
127*4882a593Smuzhiyunusing the IMM field.
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun2. When confirming a write request server sends an "empty" rdma message with
130*4882a593Smuzhiyunan immediate field. The 32 bit field is used to specify the outstanding
131*4882a593Smuzhiyuninflight IO and for the error code.
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunCLT                                                          SRV
134*4882a593Smuzhiyunusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
135*4882a593Smuzhiyun[RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun* Write (always_invalidate=Y) *
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun1. When processing a write request client selects one of the memory chunks
140*4882a593Smuzhiyunon the server side and rdma writes there the user data, user header and the
141*4882a593SmuzhiyunRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
142*4882a593Smuzhiyuncontains size of the user header. The client tells the server which chunk has
143*4882a593Smuzhiyunbeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
144*4882a593Smuzhiyunusing the IMM field, Server invalidate rkey associated to the memory chunks
145*4882a593Smuzhiyunfirst, when it finishes, pass the IO to RNBD server module.
146*4882a593Smuzhiyun
147*4882a593Smuzhiyun2. When confirming a write request server sends an "empty" rdma message with
148*4882a593Smuzhiyunan immediate field. The 32 bit field is used to specify the outstanding
149*4882a593Smuzhiyuninflight IO and for the error code. The new rkey is sent back using
150*4882a593SmuzhiyunSEND_WITH_IMM WR, client When it recived new rkey message, it validates
151*4882a593Smuzhiyunthe message and finished IO after update rkey for the rbuffer, then post
152*4882a593Smuzhiyunback the recv buffer for later use.
153*4882a593Smuzhiyun
154*4882a593SmuzhiyunCLT                                                          SRV
155*4882a593Smuzhiyunusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
156*4882a593Smuzhiyun[RTRS_MSG_RKEY_RSP]                     <----------------- (RTRS_MSG_RKEY_RSP)
157*4882a593Smuzhiyun[RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun
160*4882a593Smuzhiyun* Read (always_invalidate=N)*
161*4882a593Smuzhiyun
162*4882a593Smuzhiyun1. When processing a read request client selects one of the memory chunks
163*4882a593Smuzhiyunon the server side and rdma writes there the user header and the
164*4882a593SmuzhiyunRTRS_MSG_RDMA_READ message. This message contains the type (read), size of
165*4882a593Smuzhiyunthe user header, flags (specifying if memory invalidation is necessary) and the
166*4882a593Smuzhiyunlist of addresses along with keys for the data to be read into.
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun2. When confirming a read request server transfers the requested data first,
169*4882a593Smuzhiyunattaches an invalidation message if requested and finally an "empty" rdma
170*4882a593Smuzhiyunmessage with an immediate field. The 32 bit field is used to specify the
171*4882a593Smuzhiyunoutstanding inflight IO and the error code.
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunCLT                                           SRV
174*4882a593Smuzhiyunusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
175*4882a593Smuzhiyun[RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
176*4882a593Smuzhiyunor in case client requested invalidation:
177*4882a593Smuzhiyun[RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun* Read (always_invalidate=Y)*
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun1. When processing a read request client selects one of the memory chunks
182*4882a593Smuzhiyunon the server side and rdma writes there the user header and the
183*4882a593SmuzhiyunRTRS_MSG_RDMA_READ message. This message contains the type (read), size of
184*4882a593Smuzhiyunthe user header, flags (specifying if memory invalidation is necessary) and the
185*4882a593Smuzhiyunlist of addresses along with keys for the data to be read into.
186*4882a593SmuzhiyunServer invalidate rkey associated to the memory chunks first, when it finishes,
187*4882a593Smuzhiyunpasses the IO to RNBD server module.
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun2. When confirming a read request server transfers the requested data first,
190*4882a593Smuzhiyunattaches an invalidation message if requested and finally an "empty" rdma
191*4882a593Smuzhiyunmessage with an immediate field. The 32 bit field is used to specify the
192*4882a593Smuzhiyunoutstanding inflight IO and the error code. The new rkey is sent back using
193*4882a593SmuzhiyunSEND_WITH_IMM WR, client When it recived new rkey message, it validates
194*4882a593Smuzhiyunthe message and finished IO after update rkey for the rbuffer, then post
195*4882a593Smuzhiyunback the recv buffer for later use.
196*4882a593Smuzhiyun
197*4882a593SmuzhiyunCLT                                           SRV
198*4882a593Smuzhiyunusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
199*4882a593Smuzhiyun[RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
200*4882a593Smuzhiyun[RTRS_MSG_RKEY_RSP]	     <----------------- (RTRS_MSG_RKEY_RSP)
201*4882a593Smuzhiyunor in case client requested invalidation:
202*4882a593Smuzhiyun[RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
203*4882a593Smuzhiyun=========================================
204*4882a593SmuzhiyunContributors List(in alphabetical order)
205*4882a593Smuzhiyun=========================================
206*4882a593SmuzhiyunDanil Kipnis <danil.kipnis@profitbricks.com>
207*4882a593SmuzhiyunFabian Holler <mail@fholler.de>
208*4882a593SmuzhiyunGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
209*4882a593SmuzhiyunJack Wang <jinpu.wang@profitbricks.com>
210*4882a593SmuzhiyunKleber Souza <kleber.souza@profitbricks.com>
211*4882a593SmuzhiyunLutz Pogrell <lutz.pogrell@cloud.ionos.com>
212*4882a593SmuzhiyunMilind Dumbare <Milind.dumbare@gmail.com>
213*4882a593SmuzhiyunRoman Penyaev <roman.penyaev@profitbricks.com>
214