xref: /OK3568_Linux_fs/kernel/Documentation/networking/kcm.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=============================
4*4882a593SmuzhiyunKernel Connection Multiplexor
5*4882a593Smuzhiyun=============================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunKernel Connection Multiplexor (KCM) is a mechanism that provides a message based
8*4882a593Smuzhiyuninterface over TCP for generic application protocols. With KCM an application
9*4882a593Smuzhiyuncan efficiently send and receive application protocol messages over TCP using
10*4882a593Smuzhiyundatagram sockets.
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunKCM implements an NxM multiplexor in the kernel as diagrammed below::
13*4882a593Smuzhiyun
14*4882a593Smuzhiyun    +------------+   +------------+   +------------+   +------------+
15*4882a593Smuzhiyun    | KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
16*4882a593Smuzhiyun    +------------+   +------------+   +------------+   +------------+
17*4882a593Smuzhiyun	|                 |               |                |
18*4882a593Smuzhiyun	+-----------+     |               |     +----------+
19*4882a593Smuzhiyun		    |     |               |     |
20*4882a593Smuzhiyun		+----------------------------------+
21*4882a593Smuzhiyun		|           Multiplexor            |
22*4882a593Smuzhiyun		+----------------------------------+
23*4882a593Smuzhiyun		    |   |           |           |  |
24*4882a593Smuzhiyun	+---------+   |           |           |  ------------+
25*4882a593Smuzhiyun	|             |           |           |              |
26*4882a593Smuzhiyun    +----------+  +----------+  +----------+  +----------+ +----------+
27*4882a593Smuzhiyun    |  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
28*4882a593Smuzhiyun    +----------+  +----------+  +----------+  +----------+ +----------+
29*4882a593Smuzhiyun	|              |           |            |             |
30*4882a593Smuzhiyun    +----------+  +----------+  +----------+  +----------+ +----------+
31*4882a593Smuzhiyun    | TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
32*4882a593Smuzhiyun    +----------+  +----------+  +----------+  +----------+ +----------+
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunKCM sockets
35*4882a593Smuzhiyun===========
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunThe KCM sockets provide the user interface to the multiplexor. All the KCM sockets
38*4882a593Smuzhiyunbound to a multiplexor are considered to have equivalent function, and I/O
39*4882a593Smuzhiyunoperations in different sockets may be done in parallel without the need for
40*4882a593Smuzhiyunsynchronization between threads in userspace.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunMultiplexor
43*4882a593Smuzhiyun===========
44*4882a593Smuzhiyun
45*4882a593SmuzhiyunThe multiplexor provides the message steering. In the transmit path, messages
46*4882a593Smuzhiyunwritten on a KCM socket are sent atomically on an appropriate TCP socket.
47*4882a593SmuzhiyunSimilarly, in the receive path, messages are constructed on each TCP socket
48*4882a593Smuzhiyun(Psock) and complete messages are steered to a KCM socket.
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunTCP sockets & Psocks
51*4882a593Smuzhiyun====================
52*4882a593Smuzhiyun
53*4882a593SmuzhiyunTCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
54*4882a593Smuzhiyunfor each bound TCP socket, this structure holds the state for constructing
55*4882a593Smuzhiyunmessages on receive as well as other connection specific information for KCM.
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunConnected mode semantics
58*4882a593Smuzhiyun========================
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunEach multiplexor assumes that all attached TCP connections are to the same
61*4882a593Smuzhiyundestination and can use the different connections for load balancing when
62*4882a593Smuzhiyuntransmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
63*4882a593Smuzhiyuncan be used to send and receive messages from the KCM socket.
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunSocket types
66*4882a593Smuzhiyun============
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunKCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunMessage delineation
71*4882a593Smuzhiyun-------------------
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunMessages are sent over a TCP stream with some application protocol message
74*4882a593Smuzhiyunformat that typically includes a header which frames the messages. The length
75*4882a593Smuzhiyunof a received message can be deduced from the application protocol header
76*4882a593Smuzhiyun(often just a simple length field).
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunA TCP stream must be parsed to determine message boundaries. Berkeley Packet
79*4882a593SmuzhiyunFilter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
80*4882a593SmuzhiyunBPF program must be specified. The program is called at the start of receiving
81*4882a593Smuzhiyuna new message and is given an skbuff that contains the bytes received so far.
82*4882a593SmuzhiyunIt parses the message header and returns the length of the message. Given this
83*4882a593Smuzhiyuninformation, KCM will construct the message of the stated length and deliver it
84*4882a593Smuzhiyunto a KCM socket.
85*4882a593Smuzhiyun
86*4882a593SmuzhiyunTCP socket management
87*4882a593Smuzhiyun---------------------
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunWhen a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
90*4882a593Smuzhiyunwrite space available (POLLOUT) events are handled by the multiplexor. If there
91*4882a593Smuzhiyunis a state change (disconnection) or other error on a TCP socket, an error is
92*4882a593Smuzhiyunposted on the TCP socket so that a POLLERR event happens and KCM discontinues
93*4882a593Smuzhiyunusing the socket. When the application gets the error notification for a
94*4882a593SmuzhiyunTCP socket, it should unattach the socket from KCM and then handle the error
95*4882a593Smuzhiyuncondition (the typical response is to close the socket and create a new
96*4882a593Smuzhiyunconnection if necessary).
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunKCM limits the maximum receive message size to be the size of the receive
99*4882a593Smuzhiyunsocket buffer on the attached TCP socket (the socket buffer size can be set by
100*4882a593SmuzhiyunSO_RCVBUF). If the length of a new message reported by the BPF program is
101*4882a593Smuzhiyungreater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
102*4882a593Smuzhiyunsocket. The BPF program may also enforce a maximum messages size and report an
103*4882a593Smuzhiyunerror when it is exceeded.
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunA timeout may be set for assembling messages on a receive socket. The timeout
106*4882a593Smuzhiyunvalue is taken from the receive timeout of the attached TCP socket (this is set
107*4882a593Smuzhiyunby SO_RCVTIMEO). If the timer expires before assembly is complete an error
108*4882a593Smuzhiyun(ETIMEDOUT) is posted on the socket.
109*4882a593Smuzhiyun
110*4882a593SmuzhiyunUser interface
111*4882a593Smuzhiyun==============
112*4882a593Smuzhiyun
113*4882a593SmuzhiyunCreating a multiplexor
114*4882a593Smuzhiyun----------------------
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunA new multiplexor and initial KCM socket is created by a socket call::
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun  socket(AF_KCM, type, protocol)
119*4882a593Smuzhiyun
120*4882a593Smuzhiyun- type is either SOCK_DGRAM or SOCK_SEQPACKET
121*4882a593Smuzhiyun- protocol is KCMPROTO_CONNECTED
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunCloning KCM sockets
124*4882a593Smuzhiyun-------------------
125*4882a593Smuzhiyun
126*4882a593SmuzhiyunAfter the first KCM socket is created using the socket call as described
127*4882a593Smuzhiyunabove, additional sockets for the multiplexor can be created by cloning
128*4882a593Smuzhiyuna KCM socket. This is accomplished by an ioctl on a KCM socket::
129*4882a593Smuzhiyun
130*4882a593Smuzhiyun  /* From linux/kcm.h */
131*4882a593Smuzhiyun  struct kcm_clone {
132*4882a593Smuzhiyun	int fd;
133*4882a593Smuzhiyun  };
134*4882a593Smuzhiyun
135*4882a593Smuzhiyun  struct kcm_clone info;
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun  memset(&info, 0, sizeof(info));
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun  err = ioctl(kcmfd, SIOCKCMCLONE, &info);
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun  if (!err)
142*4882a593Smuzhiyun    newkcmfd = info.fd;
143*4882a593Smuzhiyun
144*4882a593SmuzhiyunAttach transport sockets
145*4882a593Smuzhiyun------------------------
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunAttaching of transport sockets to a multiplexor is performed by calling an
148*4882a593Smuzhiyunioctl on a KCM socket for the multiplexor. e.g.::
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun  /* From linux/kcm.h */
151*4882a593Smuzhiyun  struct kcm_attach {
152*4882a593Smuzhiyun	int fd;
153*4882a593Smuzhiyun	int bpf_fd;
154*4882a593Smuzhiyun  };
155*4882a593Smuzhiyun
156*4882a593Smuzhiyun  struct kcm_attach info;
157*4882a593Smuzhiyun
158*4882a593Smuzhiyun  memset(&info, 0, sizeof(info));
159*4882a593Smuzhiyun
160*4882a593Smuzhiyun  info.fd = tcpfd;
161*4882a593Smuzhiyun  info.bpf_fd = bpf_prog_fd;
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun  ioctl(kcmfd, SIOCKCMATTACH, &info);
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunThe kcm_attach structure contains:
166*4882a593Smuzhiyun
167*4882a593Smuzhiyun  - fd: file descriptor for TCP socket being attached
168*4882a593Smuzhiyun  - bpf_prog_fd: file descriptor for compiled BPF program downloaded
169*4882a593Smuzhiyun
170*4882a593SmuzhiyunUnattach transport sockets
171*4882a593Smuzhiyun--------------------------
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunUnattaching a transport socket from a multiplexor is straightforward. An
174*4882a593Smuzhiyun"unattach" ioctl is done with the kcm_unattach structure as the argument::
175*4882a593Smuzhiyun
176*4882a593Smuzhiyun  /* From linux/kcm.h */
177*4882a593Smuzhiyun  struct kcm_unattach {
178*4882a593Smuzhiyun	int fd;
179*4882a593Smuzhiyun  };
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun  struct kcm_unattach info;
182*4882a593Smuzhiyun
183*4882a593Smuzhiyun  memset(&info, 0, sizeof(info));
184*4882a593Smuzhiyun
185*4882a593Smuzhiyun  info.fd = cfd;
186*4882a593Smuzhiyun
187*4882a593Smuzhiyun  ioctl(fd, SIOCKCMUNATTACH, &info);
188*4882a593Smuzhiyun
189*4882a593SmuzhiyunDisabling receive on KCM socket
190*4882a593Smuzhiyun-------------------------------
191*4882a593Smuzhiyun
192*4882a593SmuzhiyunA setsockopt is used to disable or enable receiving on a KCM socket.
193*4882a593SmuzhiyunWhen receive is disabled, any pending messages in the socket's
194*4882a593Smuzhiyunreceive buffer are moved to other sockets. This feature is useful
195*4882a593Smuzhiyunif an application thread knows that it will be doing a lot of
196*4882a593Smuzhiyunwork on a request and won't be able to service new messages for a
197*4882a593Smuzhiyunwhile. Example use::
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun  int val = 1;
200*4882a593Smuzhiyun
201*4882a593Smuzhiyun  setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
202*4882a593Smuzhiyun
203*4882a593SmuzhiyunBFP programs for message delineation
204*4882a593Smuzhiyun------------------------------------
205*4882a593Smuzhiyun
206*4882a593SmuzhiyunBPF programs can be compiled using the BPF LLVM backend. For example,
207*4882a593Smuzhiyunthe BPF program for parsing Thrift is::
208*4882a593Smuzhiyun
209*4882a593Smuzhiyun  #include "bpf.h" /* for __sk_buff */
210*4882a593Smuzhiyun  #include "bpf_helpers.h" /* for load_word intrinsic */
211*4882a593Smuzhiyun
212*4882a593Smuzhiyun  SEC("socket_kcm")
213*4882a593Smuzhiyun  int bpf_prog1(struct __sk_buff *skb)
214*4882a593Smuzhiyun  {
215*4882a593Smuzhiyun       return load_word(skb, 0) + 4;
216*4882a593Smuzhiyun  }
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun  char _license[] SEC("license") = "GPL";
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunUse in applications
221*4882a593Smuzhiyun===================
222*4882a593Smuzhiyun
223*4882a593SmuzhiyunKCM accelerates application layer protocols. Specifically, it allows
224*4882a593Smuzhiyunapplications to use a message based interface for sending and receiving
225*4882a593Smuzhiyunmessages. The kernel provides necessary assurances that messages are sent
226*4882a593Smuzhiyunand received atomically. This relieves much of the burden applications have
227*4882a593Smuzhiyunin mapping a message based protocol onto the TCP stream. KCM also make
228*4882a593Smuzhiyunapplication layer messages a unit of work in the kernel for the purposes of
229*4882a593Smuzhiyunsteering and scheduling, which in turn allows a simpler networking model in
230*4882a593Smuzhiyunmultithreaded applications.
231*4882a593Smuzhiyun
232*4882a593SmuzhiyunConfigurations
233*4882a593Smuzhiyun--------------
234*4882a593Smuzhiyun
235*4882a593SmuzhiyunIn an Nx1 configuration, KCM logically provides multiple socket handles
236*4882a593Smuzhiyunto the same TCP connection. This allows parallelism between in I/O
237*4882a593Smuzhiyunoperations on the TCP socket (for instance copyin and copyout of data is
238*4882a593Smuzhiyunparallelized). In an application, a KCM socket can be opened for each
239*4882a593Smuzhiyunprocessing thread and inserted into the epoll (similar to how SO_REUSEPORT
240*4882a593Smuzhiyunis used to allow multiple listener sockets on the same port).
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunIn a MxN configuration, multiple connections are established to the
243*4882a593Smuzhiyunsame destination. These are used for simple load balancing.
244*4882a593Smuzhiyun
245*4882a593SmuzhiyunMessage batching
246*4882a593Smuzhiyun----------------
247*4882a593Smuzhiyun
248*4882a593SmuzhiyunThe primary purpose of KCM is load balancing between KCM sockets and hence
249*4882a593Smuzhiyunthreads in a nominal use case. Perfect load balancing, that is steering
250*4882a593Smuzhiyuneach received message to a different KCM socket or steering each sent
251*4882a593Smuzhiyunmessage to a different TCP socket, can negatively impact performance
252*4882a593Smuzhiyunsince this doesn't allow for affinities to be established. Balancing
253*4882a593Smuzhiyunbased on groups, or batches of messages, can be beneficial for performance.
254*4882a593Smuzhiyun
255*4882a593SmuzhiyunOn transmit, there are three ways an application can batch (pipeline)
256*4882a593Smuzhiyunmessages on a KCM socket.
257*4882a593Smuzhiyun
258*4882a593Smuzhiyun  1) Send multiple messages in a single sendmmsg.
259*4882a593Smuzhiyun  2) Send a group of messages each with a sendmsg call, where all messages
260*4882a593Smuzhiyun     except the last have MSG_BATCH in the flags of sendmsg call.
261*4882a593Smuzhiyun  3) Create "super message" composed of multiple messages and send this
262*4882a593Smuzhiyun     with a single sendmsg.
263*4882a593Smuzhiyun
264*4882a593SmuzhiyunOn receive, the KCM module attempts to queue messages received on the
265*4882a593Smuzhiyunsame KCM socket during each TCP ready callback. The targeted KCM socket
266*4882a593Smuzhiyunchanges at each receive ready callback on the KCM socket. The application
267*4882a593Smuzhiyundoes not need to configure this.
268*4882a593Smuzhiyun
269*4882a593SmuzhiyunError handling
270*4882a593Smuzhiyun--------------
271*4882a593Smuzhiyun
272*4882a593SmuzhiyunAn application should include a thread to monitor errors raised on
273*4882a593Smuzhiyunthe TCP connection. Normally, this will be done by placing each
274*4882a593SmuzhiyunTCP socket attached to a KCM multiplexor in epoll set for POLLERR
275*4882a593Smuzhiyunevent. If an error occurs on an attached TCP socket, KCM sets an EPIPE
276*4882a593Smuzhiyunon the socket thus waking up the application thread. When the application
277*4882a593Smuzhiyunsees the error (which may just be a disconnect) it should unattach the
278*4882a593Smuzhiyunsocket from KCM and then close it. It is assumed that once an error is
279*4882a593Smuzhiyunposted on the TCP socket the data stream is unrecoverable (i.e. an error
280*4882a593Smuzhiyunmay have occurred in the middle of receiving a message).
281*4882a593Smuzhiyun
282*4882a593SmuzhiyunTCP connection monitoring
283*4882a593Smuzhiyun-------------------------
284*4882a593Smuzhiyun
285*4882a593SmuzhiyunIn KCM there is no means to correlate a message to the TCP socket that
286*4882a593Smuzhiyunwas used to send or receive the message (except in the case there is
287*4882a593Smuzhiyunonly one attached TCP socket). However, the application does retain
288*4882a593Smuzhiyunan open file descriptor to the socket so it will be able to get statistics
289*4882a593Smuzhiyunfrom the socket which can be used in detecting issues (such as high
290*4882a593Smuzhiyunretransmissions on the socket).
291