1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun============================= 4*4882a593SmuzhiyunKernel Connection Multiplexor 5*4882a593Smuzhiyun============================= 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunKernel Connection Multiplexor (KCM) is a mechanism that provides a message based 8*4882a593Smuzhiyuninterface over TCP for generic application protocols. With KCM an application 9*4882a593Smuzhiyuncan efficiently send and receive application protocol messages over TCP using 10*4882a593Smuzhiyundatagram sockets. 11*4882a593Smuzhiyun 12*4882a593SmuzhiyunKCM implements an NxM multiplexor in the kernel as diagrammed below:: 13*4882a593Smuzhiyun 14*4882a593Smuzhiyun +------------+ +------------+ +------------+ +------------+ 15*4882a593Smuzhiyun | KCM socket | | KCM socket | | KCM socket | | KCM socket | 16*4882a593Smuzhiyun +------------+ +------------+ +------------+ +------------+ 17*4882a593Smuzhiyun | | | | 18*4882a593Smuzhiyun +-----------+ | | +----------+ 19*4882a593Smuzhiyun | | | | 20*4882a593Smuzhiyun +----------------------------------+ 21*4882a593Smuzhiyun | Multiplexor | 22*4882a593Smuzhiyun +----------------------------------+ 23*4882a593Smuzhiyun | | | | | 24*4882a593Smuzhiyun +---------+ | | | ------------+ 25*4882a593Smuzhiyun | | | | | 26*4882a593Smuzhiyun +----------+ +----------+ +----------+ +----------+ +----------+ 27*4882a593Smuzhiyun | Psock | | Psock | | Psock | | Psock | | Psock | 28*4882a593Smuzhiyun +----------+ +----------+ +----------+ +----------+ +----------+ 29*4882a593Smuzhiyun | | | | | 30*4882a593Smuzhiyun +----------+ +----------+ +----------+ +----------+ +----------+ 31*4882a593Smuzhiyun | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | 32*4882a593Smuzhiyun +----------+ +----------+ +----------+ +----------+ +----------+ 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunKCM sockets 35*4882a593Smuzhiyun=========== 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunThe KCM sockets provide the user interface to the multiplexor. All the KCM sockets 38*4882a593Smuzhiyunbound to a multiplexor are considered to have equivalent function, and I/O 39*4882a593Smuzhiyunoperations in different sockets may be done in parallel without the need for 40*4882a593Smuzhiyunsynchronization between threads in userspace. 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunMultiplexor 43*4882a593Smuzhiyun=========== 44*4882a593Smuzhiyun 45*4882a593SmuzhiyunThe multiplexor provides the message steering. In the transmit path, messages 46*4882a593Smuzhiyunwritten on a KCM socket are sent atomically on an appropriate TCP socket. 47*4882a593SmuzhiyunSimilarly, in the receive path, messages are constructed on each TCP socket 48*4882a593Smuzhiyun(Psock) and complete messages are steered to a KCM socket. 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunTCP sockets & Psocks 51*4882a593Smuzhiyun==================== 52*4882a593Smuzhiyun 53*4882a593SmuzhiyunTCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated 54*4882a593Smuzhiyunfor each bound TCP socket, this structure holds the state for constructing 55*4882a593Smuzhiyunmessages on receive as well as other connection specific information for KCM. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunConnected mode semantics 58*4882a593Smuzhiyun======================== 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunEach multiplexor assumes that all attached TCP connections are to the same 61*4882a593Smuzhiyundestination and can use the different connections for load balancing when 62*4882a593Smuzhiyuntransmitting. The normal send and recv calls (include sendmmsg and recvmmsg) 63*4882a593Smuzhiyuncan be used to send and receive messages from the KCM socket. 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunSocket types 66*4882a593Smuzhiyun============ 67*4882a593Smuzhiyun 68*4882a593SmuzhiyunKCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. 69*4882a593Smuzhiyun 70*4882a593SmuzhiyunMessage delineation 71*4882a593Smuzhiyun------------------- 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunMessages are sent over a TCP stream with some application protocol message 74*4882a593Smuzhiyunformat that typically includes a header which frames the messages. The length 75*4882a593Smuzhiyunof a received message can be deduced from the application protocol header 76*4882a593Smuzhiyun(often just a simple length field). 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunA TCP stream must be parsed to determine message boundaries. Berkeley Packet 79*4882a593SmuzhiyunFilter (BPF) is used for this. When attaching a TCP socket to a multiplexor a 80*4882a593SmuzhiyunBPF program must be specified. The program is called at the start of receiving 81*4882a593Smuzhiyuna new message and is given an skbuff that contains the bytes received so far. 82*4882a593SmuzhiyunIt parses the message header and returns the length of the message. Given this 83*4882a593Smuzhiyuninformation, KCM will construct the message of the stated length and deliver it 84*4882a593Smuzhiyunto a KCM socket. 85*4882a593Smuzhiyun 86*4882a593SmuzhiyunTCP socket management 87*4882a593Smuzhiyun--------------------- 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunWhen a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and 90*4882a593Smuzhiyunwrite space available (POLLOUT) events are handled by the multiplexor. If there 91*4882a593Smuzhiyunis a state change (disconnection) or other error on a TCP socket, an error is 92*4882a593Smuzhiyunposted on the TCP socket so that a POLLERR event happens and KCM discontinues 93*4882a593Smuzhiyunusing the socket. When the application gets the error notification for a 94*4882a593SmuzhiyunTCP socket, it should unattach the socket from KCM and then handle the error 95*4882a593Smuzhiyuncondition (the typical response is to close the socket and create a new 96*4882a593Smuzhiyunconnection if necessary). 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunKCM limits the maximum receive message size to be the size of the receive 99*4882a593Smuzhiyunsocket buffer on the attached TCP socket (the socket buffer size can be set by 100*4882a593SmuzhiyunSO_RCVBUF). If the length of a new message reported by the BPF program is 101*4882a593Smuzhiyungreater than this limit a corresponding error (EMSGSIZE) is posted on the TCP 102*4882a593Smuzhiyunsocket. The BPF program may also enforce a maximum messages size and report an 103*4882a593Smuzhiyunerror when it is exceeded. 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunA timeout may be set for assembling messages on a receive socket. The timeout 106*4882a593Smuzhiyunvalue is taken from the receive timeout of the attached TCP socket (this is set 107*4882a593Smuzhiyunby SO_RCVTIMEO). If the timer expires before assembly is complete an error 108*4882a593Smuzhiyun(ETIMEDOUT) is posted on the socket. 109*4882a593Smuzhiyun 110*4882a593SmuzhiyunUser interface 111*4882a593Smuzhiyun============== 112*4882a593Smuzhiyun 113*4882a593SmuzhiyunCreating a multiplexor 114*4882a593Smuzhiyun---------------------- 115*4882a593Smuzhiyun 116*4882a593SmuzhiyunA new multiplexor and initial KCM socket is created by a socket call:: 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun socket(AF_KCM, type, protocol) 119*4882a593Smuzhiyun 120*4882a593Smuzhiyun- type is either SOCK_DGRAM or SOCK_SEQPACKET 121*4882a593Smuzhiyun- protocol is KCMPROTO_CONNECTED 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunCloning KCM sockets 124*4882a593Smuzhiyun------------------- 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunAfter the first KCM socket is created using the socket call as described 127*4882a593Smuzhiyunabove, additional sockets for the multiplexor can be created by cloning 128*4882a593Smuzhiyuna KCM socket. This is accomplished by an ioctl on a KCM socket:: 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun /* From linux/kcm.h */ 131*4882a593Smuzhiyun struct kcm_clone { 132*4882a593Smuzhiyun int fd; 133*4882a593Smuzhiyun }; 134*4882a593Smuzhiyun 135*4882a593Smuzhiyun struct kcm_clone info; 136*4882a593Smuzhiyun 137*4882a593Smuzhiyun memset(&info, 0, sizeof(info)); 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun err = ioctl(kcmfd, SIOCKCMCLONE, &info); 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun if (!err) 142*4882a593Smuzhiyun newkcmfd = info.fd; 143*4882a593Smuzhiyun 144*4882a593SmuzhiyunAttach transport sockets 145*4882a593Smuzhiyun------------------------ 146*4882a593Smuzhiyun 147*4882a593SmuzhiyunAttaching of transport sockets to a multiplexor is performed by calling an 148*4882a593Smuzhiyunioctl on a KCM socket for the multiplexor. e.g.:: 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun /* From linux/kcm.h */ 151*4882a593Smuzhiyun struct kcm_attach { 152*4882a593Smuzhiyun int fd; 153*4882a593Smuzhiyun int bpf_fd; 154*4882a593Smuzhiyun }; 155*4882a593Smuzhiyun 156*4882a593Smuzhiyun struct kcm_attach info; 157*4882a593Smuzhiyun 158*4882a593Smuzhiyun memset(&info, 0, sizeof(info)); 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun info.fd = tcpfd; 161*4882a593Smuzhiyun info.bpf_fd = bpf_prog_fd; 162*4882a593Smuzhiyun 163*4882a593Smuzhiyun ioctl(kcmfd, SIOCKCMATTACH, &info); 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunThe kcm_attach structure contains: 166*4882a593Smuzhiyun 167*4882a593Smuzhiyun - fd: file descriptor for TCP socket being attached 168*4882a593Smuzhiyun - bpf_prog_fd: file descriptor for compiled BPF program downloaded 169*4882a593Smuzhiyun 170*4882a593SmuzhiyunUnattach transport sockets 171*4882a593Smuzhiyun-------------------------- 172*4882a593Smuzhiyun 173*4882a593SmuzhiyunUnattaching a transport socket from a multiplexor is straightforward. An 174*4882a593Smuzhiyun"unattach" ioctl is done with the kcm_unattach structure as the argument:: 175*4882a593Smuzhiyun 176*4882a593Smuzhiyun /* From linux/kcm.h */ 177*4882a593Smuzhiyun struct kcm_unattach { 178*4882a593Smuzhiyun int fd; 179*4882a593Smuzhiyun }; 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun struct kcm_unattach info; 182*4882a593Smuzhiyun 183*4882a593Smuzhiyun memset(&info, 0, sizeof(info)); 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun info.fd = cfd; 186*4882a593Smuzhiyun 187*4882a593Smuzhiyun ioctl(fd, SIOCKCMUNATTACH, &info); 188*4882a593Smuzhiyun 189*4882a593SmuzhiyunDisabling receive on KCM socket 190*4882a593Smuzhiyun------------------------------- 191*4882a593Smuzhiyun 192*4882a593SmuzhiyunA setsockopt is used to disable or enable receiving on a KCM socket. 193*4882a593SmuzhiyunWhen receive is disabled, any pending messages in the socket's 194*4882a593Smuzhiyunreceive buffer are moved to other sockets. This feature is useful 195*4882a593Smuzhiyunif an application thread knows that it will be doing a lot of 196*4882a593Smuzhiyunwork on a request and won't be able to service new messages for a 197*4882a593Smuzhiyunwhile. Example use:: 198*4882a593Smuzhiyun 199*4882a593Smuzhiyun int val = 1; 200*4882a593Smuzhiyun 201*4882a593Smuzhiyun setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) 202*4882a593Smuzhiyun 203*4882a593SmuzhiyunBFP programs for message delineation 204*4882a593Smuzhiyun------------------------------------ 205*4882a593Smuzhiyun 206*4882a593SmuzhiyunBPF programs can be compiled using the BPF LLVM backend. For example, 207*4882a593Smuzhiyunthe BPF program for parsing Thrift is:: 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun #include "bpf.h" /* for __sk_buff */ 210*4882a593Smuzhiyun #include "bpf_helpers.h" /* for load_word intrinsic */ 211*4882a593Smuzhiyun 212*4882a593Smuzhiyun SEC("socket_kcm") 213*4882a593Smuzhiyun int bpf_prog1(struct __sk_buff *skb) 214*4882a593Smuzhiyun { 215*4882a593Smuzhiyun return load_word(skb, 0) + 4; 216*4882a593Smuzhiyun } 217*4882a593Smuzhiyun 218*4882a593Smuzhiyun char _license[] SEC("license") = "GPL"; 219*4882a593Smuzhiyun 220*4882a593SmuzhiyunUse in applications 221*4882a593Smuzhiyun=================== 222*4882a593Smuzhiyun 223*4882a593SmuzhiyunKCM accelerates application layer protocols. Specifically, it allows 224*4882a593Smuzhiyunapplications to use a message based interface for sending and receiving 225*4882a593Smuzhiyunmessages. The kernel provides necessary assurances that messages are sent 226*4882a593Smuzhiyunand received atomically. This relieves much of the burden applications have 227*4882a593Smuzhiyunin mapping a message based protocol onto the TCP stream. KCM also make 228*4882a593Smuzhiyunapplication layer messages a unit of work in the kernel for the purposes of 229*4882a593Smuzhiyunsteering and scheduling, which in turn allows a simpler networking model in 230*4882a593Smuzhiyunmultithreaded applications. 231*4882a593Smuzhiyun 232*4882a593SmuzhiyunConfigurations 233*4882a593Smuzhiyun-------------- 234*4882a593Smuzhiyun 235*4882a593SmuzhiyunIn an Nx1 configuration, KCM logically provides multiple socket handles 236*4882a593Smuzhiyunto the same TCP connection. This allows parallelism between in I/O 237*4882a593Smuzhiyunoperations on the TCP socket (for instance copyin and copyout of data is 238*4882a593Smuzhiyunparallelized). In an application, a KCM socket can be opened for each 239*4882a593Smuzhiyunprocessing thread and inserted into the epoll (similar to how SO_REUSEPORT 240*4882a593Smuzhiyunis used to allow multiple listener sockets on the same port). 241*4882a593Smuzhiyun 242*4882a593SmuzhiyunIn a MxN configuration, multiple connections are established to the 243*4882a593Smuzhiyunsame destination. These are used for simple load balancing. 244*4882a593Smuzhiyun 245*4882a593SmuzhiyunMessage batching 246*4882a593Smuzhiyun---------------- 247*4882a593Smuzhiyun 248*4882a593SmuzhiyunThe primary purpose of KCM is load balancing between KCM sockets and hence 249*4882a593Smuzhiyunthreads in a nominal use case. Perfect load balancing, that is steering 250*4882a593Smuzhiyuneach received message to a different KCM socket or steering each sent 251*4882a593Smuzhiyunmessage to a different TCP socket, can negatively impact performance 252*4882a593Smuzhiyunsince this doesn't allow for affinities to be established. Balancing 253*4882a593Smuzhiyunbased on groups, or batches of messages, can be beneficial for performance. 254*4882a593Smuzhiyun 255*4882a593SmuzhiyunOn transmit, there are three ways an application can batch (pipeline) 256*4882a593Smuzhiyunmessages on a KCM socket. 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun 1) Send multiple messages in a single sendmmsg. 259*4882a593Smuzhiyun 2) Send a group of messages each with a sendmsg call, where all messages 260*4882a593Smuzhiyun except the last have MSG_BATCH in the flags of sendmsg call. 261*4882a593Smuzhiyun 3) Create "super message" composed of multiple messages and send this 262*4882a593Smuzhiyun with a single sendmsg. 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunOn receive, the KCM module attempts to queue messages received on the 265*4882a593Smuzhiyunsame KCM socket during each TCP ready callback. The targeted KCM socket 266*4882a593Smuzhiyunchanges at each receive ready callback on the KCM socket. The application 267*4882a593Smuzhiyundoes not need to configure this. 268*4882a593Smuzhiyun 269*4882a593SmuzhiyunError handling 270*4882a593Smuzhiyun-------------- 271*4882a593Smuzhiyun 272*4882a593SmuzhiyunAn application should include a thread to monitor errors raised on 273*4882a593Smuzhiyunthe TCP connection. Normally, this will be done by placing each 274*4882a593SmuzhiyunTCP socket attached to a KCM multiplexor in epoll set for POLLERR 275*4882a593Smuzhiyunevent. If an error occurs on an attached TCP socket, KCM sets an EPIPE 276*4882a593Smuzhiyunon the socket thus waking up the application thread. When the application 277*4882a593Smuzhiyunsees the error (which may just be a disconnect) it should unattach the 278*4882a593Smuzhiyunsocket from KCM and then close it. It is assumed that once an error is 279*4882a593Smuzhiyunposted on the TCP socket the data stream is unrecoverable (i.e. an error 280*4882a593Smuzhiyunmay have occurred in the middle of receiving a message). 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunTCP connection monitoring 283*4882a593Smuzhiyun------------------------- 284*4882a593Smuzhiyun 285*4882a593SmuzhiyunIn KCM there is no means to correlate a message to the TCP socket that 286*4882a593Smuzhiyunwas used to send or receive the message (except in the case there is 287*4882a593Smuzhiyunonly one attached TCP socket). However, the application does retain 288*4882a593Smuzhiyunan open file descriptor to the socket so it will be able to get statistics 289*4882a593Smuzhiyunfrom the socket which can be used in detecting issues (such as high 290*4882a593Smuzhiyunretransmissions on the socket). 291