1*4882a593Smuzhiyun 2*4882a593Smuzhiyun============ 3*4882a593SmuzhiyunMSG_ZEROCOPY 4*4882a593Smuzhiyun============ 5*4882a593Smuzhiyun 6*4882a593SmuzhiyunIntro 7*4882a593Smuzhiyun===== 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunThe MSG_ZEROCOPY flag enables copy avoidance for socket send calls. 10*4882a593SmuzhiyunThe feature is currently implemented for TCP and UDP sockets. 11*4882a593Smuzhiyun 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunOpportunity and Caveats 14*4882a593Smuzhiyun----------------------- 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunCopying large buffers between user process and kernel can be 17*4882a593Smuzhiyunexpensive. Linux supports various interfaces that eschew copying, 18*4882a593Smuzhiyunsuch as sendpage and splice. The MSG_ZEROCOPY flag extends the 19*4882a593Smuzhiyununderlying copy avoidance mechanism to common socket send calls. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunCopy avoidance is not a free lunch. As implemented, with page pinning, 22*4882a593Smuzhiyunit replaces per byte copy cost with page accounting and completion 23*4882a593Smuzhiyunnotification overhead. As a result, MSG_ZEROCOPY is generally only 24*4882a593Smuzhiyuneffective at writes over around 10 KB. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunPage pinning also changes system call semantics. It temporarily shares 27*4882a593Smuzhiyunthe buffer between process and network stack. Unlike with copying, the 28*4882a593Smuzhiyunprocess cannot immediately overwrite the buffer after system call 29*4882a593Smuzhiyunreturn without possibly modifying the data in flight. Kernel integrity 30*4882a593Smuzhiyunis not affected, but a buggy program can possibly corrupt its own data 31*4882a593Smuzhiyunstream. 32*4882a593Smuzhiyun 33*4882a593SmuzhiyunThe kernel returns a notification when it is safe to modify data. 34*4882a593SmuzhiyunConverting an existing application to MSG_ZEROCOPY is not always as 35*4882a593Smuzhiyuntrivial as just passing the flag, then. 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun 38*4882a593SmuzhiyunMore Info 39*4882a593Smuzhiyun--------- 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunMuch of this document was derived from a longer paper presented at 42*4882a593Smuzhiyunnetdev 2.1. For more in-depth information see that paper and talk, 43*4882a593Smuzhiyunthe excellent reporting over at LWN.net or read the original code. 44*4882a593Smuzhiyun 45*4882a593Smuzhiyun paper, slides, video 46*4882a593Smuzhiyun https://netdevconf.org/2.1/session.html?debruijn 47*4882a593Smuzhiyun 48*4882a593Smuzhiyun LWN article 49*4882a593Smuzhiyun https://lwn.net/Articles/726917/ 50*4882a593Smuzhiyun 51*4882a593Smuzhiyun patchset 52*4882a593Smuzhiyun [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY 53*4882a593Smuzhiyun https://lkml.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com 54*4882a593Smuzhiyun 55*4882a593Smuzhiyun 56*4882a593SmuzhiyunInterface 57*4882a593Smuzhiyun========= 58*4882a593Smuzhiyun 59*4882a593SmuzhiyunPassing the MSG_ZEROCOPY flag is the most obvious step to enable copy 60*4882a593Smuzhiyunavoidance, but not the only one. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunSocket Setup 63*4882a593Smuzhiyun------------ 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunThe kernel is permissive when applications pass undefined flags to the 66*4882a593Smuzhiyunsend system call. By default it simply ignores these. To avoid enabling 67*4882a593Smuzhiyuncopy avoidance mode for legacy processes that accidentally already pass 68*4882a593Smuzhiyunthis flag, a process must first signal intent by setting a socket option: 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun:: 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) 73*4882a593Smuzhiyun error(1, errno, "setsockopt zerocopy"); 74*4882a593Smuzhiyun 75*4882a593SmuzhiyunTransmission 76*4882a593Smuzhiyun------------ 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunThe change to send (or sendto, sendmsg, sendmmsg) itself is trivial. 79*4882a593SmuzhiyunPass the new flag. 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun:: 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); 84*4882a593Smuzhiyun 85*4882a593SmuzhiyunA zerocopy failure will return -1 with errno ENOBUFS. This happens if 86*4882a593Smuzhiyunthe socket option was not set, the socket exceeds its optmem limit or 87*4882a593Smuzhiyunthe user exceeds its ulimit on locked pages. 88*4882a593Smuzhiyun 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunMixing copy avoidance and copying 91*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunMany workloads have a mixture of large and small buffers. Because copy 94*4882a593Smuzhiyunavoidance is more expensive than copying for small packets, the 95*4882a593Smuzhiyunfeature is implemented as a flag. It is safe to mix calls with the flag 96*4882a593Smuzhiyunwith those without. 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun 99*4882a593SmuzhiyunNotifications 100*4882a593Smuzhiyun------------- 101*4882a593Smuzhiyun 102*4882a593SmuzhiyunThe kernel has to notify the process when it is safe to reuse a 103*4882a593Smuzhiyunpreviously passed buffer. It queues completion notifications on the 104*4882a593Smuzhiyunsocket error queue, akin to the transmit timestamping interface. 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunThe notification itself is a simple scalar value. Each socket 107*4882a593Smuzhiyunmaintains an internal unsigned 32-bit counter. Each send call with 108*4882a593SmuzhiyunMSG_ZEROCOPY that successfully sends data increments the counter. The 109*4882a593Smuzhiyuncounter is not incremented on failure or if called with length zero. 110*4882a593SmuzhiyunThe counter counts system call invocations, not bytes. It wraps after 111*4882a593SmuzhiyunUINT_MAX calls. 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun 114*4882a593SmuzhiyunNotification Reception 115*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~ 116*4882a593Smuzhiyun 117*4882a593SmuzhiyunThe below snippet demonstrates the API. In the simplest case, each 118*4882a593Smuzhiyunsend syscall is followed by a poll and recvmsg on the error queue. 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunReading from the error queue is always a non-blocking operation. The 121*4882a593Smuzhiyunpoll call is there to block until an error is outstanding. It will set 122*4882a593SmuzhiyunPOLLERR in its output flags. That flag does not have to be set in the 123*4882a593Smuzhiyunevents field. Errors are signaled unconditionally. 124*4882a593Smuzhiyun 125*4882a593Smuzhiyun:: 126*4882a593Smuzhiyun 127*4882a593Smuzhiyun pfd.fd = fd; 128*4882a593Smuzhiyun pfd.events = 0; 129*4882a593Smuzhiyun if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) 130*4882a593Smuzhiyun error(1, errno, "poll"); 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun ret = recvmsg(fd, &msg, MSG_ERRQUEUE); 133*4882a593Smuzhiyun if (ret == -1) 134*4882a593Smuzhiyun error(1, errno, "recvmsg"); 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun read_notification(msg); 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunThe example is for demonstration purpose only. In practice, it is more 139*4882a593Smuzhiyunefficient to not wait for notifications, but read without blocking 140*4882a593Smuzhiyunevery couple of send calls. 141*4882a593Smuzhiyun 142*4882a593SmuzhiyunNotifications can be processed out of order with other operations on 143*4882a593Smuzhiyunthe socket. A socket that has an error queued would normally block 144*4882a593Smuzhiyunother operations until the error is read. Zerocopy notifications have 145*4882a593Smuzhiyuna zero error code, however, to not block send and recv calls. 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun 148*4882a593SmuzhiyunNotification Batching 149*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~ 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunMultiple outstanding packets can be read at once using the recvmmsg 152*4882a593Smuzhiyuncall. This is often not needed. In each message the kernel returns not 153*4882a593Smuzhiyuna single value, but a range. It coalesces consecutive notifications 154*4882a593Smuzhiyunwhile one is outstanding for reception on the error queue. 155*4882a593Smuzhiyun 156*4882a593SmuzhiyunWhen a new notification is about to be queued, it checks whether the 157*4882a593Smuzhiyunnew value extends the range of the notification at the tail of the 158*4882a593Smuzhiyunqueue. If so, it drops the new notification packet and instead increases 159*4882a593Smuzhiyunthe range upper value of the outstanding notification. 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunFor protocols that acknowledge data in-order, like TCP, each 162*4882a593Smuzhiyunnotification can be squashed into the previous one, so that no more 163*4882a593Smuzhiyunthan one notification is outstanding at any one point. 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunOrdered delivery is the common case, but not guaranteed. Notifications 166*4882a593Smuzhiyunmay arrive out of order on retransmission and socket teardown. 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunNotification Parsing 170*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~ 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunThe below snippet demonstrates how to parse the control message: the 173*4882a593Smuzhiyunread_notification() call in the previous snippet. A notification 174*4882a593Smuzhiyunis encoded in the standard error format, sock_extended_err. 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunThe level and type fields in the control data are protocol family 177*4882a593Smuzhiyunspecific, IP_RECVERR or IPV6_RECVERR. 178*4882a593Smuzhiyun 179*4882a593SmuzhiyunError origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, 180*4882a593Smuzhiyunas explained before, to avoid blocking read and write system calls on 181*4882a593Smuzhiyunthe socket. 182*4882a593Smuzhiyun 183*4882a593SmuzhiyunThe 32-bit notification range is encoded as [ee_info, ee_data]. This 184*4882a593Smuzhiyunrange is inclusive. Other fields in the struct must be treated as 185*4882a593Smuzhiyunundefined, bar for ee_code, as discussed below. 186*4882a593Smuzhiyun 187*4882a593Smuzhiyun:: 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun struct sock_extended_err *serr; 190*4882a593Smuzhiyun struct cmsghdr *cm; 191*4882a593Smuzhiyun 192*4882a593Smuzhiyun cm = CMSG_FIRSTHDR(msg); 193*4882a593Smuzhiyun if (cm->cmsg_level != SOL_IP && 194*4882a593Smuzhiyun cm->cmsg_type != IP_RECVERR) 195*4882a593Smuzhiyun error(1, 0, "cmsg"); 196*4882a593Smuzhiyun 197*4882a593Smuzhiyun serr = (void *) CMSG_DATA(cm); 198*4882a593Smuzhiyun if (serr->ee_errno != 0 || 199*4882a593Smuzhiyun serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) 200*4882a593Smuzhiyun error(1, 0, "serr"); 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); 203*4882a593Smuzhiyun 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunDeferred copies 206*4882a593Smuzhiyun~~~~~~~~~~~~~~~ 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunPassing flag MSG_ZEROCOPY is a hint to the kernel to apply copy 209*4882a593Smuzhiyunavoidance, and a contract that the kernel will queue a completion 210*4882a593Smuzhiyunnotification. It is not a guarantee that the copy is elided. 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunCopy avoidance is not always feasible. Devices that do not support 213*4882a593Smuzhiyunscatter-gather I/O cannot send packets made up of kernel generated 214*4882a593Smuzhiyunprotocol headers plus zerocopy user data. A packet may need to be 215*4882a593Smuzhiyunconverted to a private copy of data deep in the stack, say to compute 216*4882a593Smuzhiyuna checksum. 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunIn all these cases, the kernel returns a completion notification when 219*4882a593Smuzhiyunit releases its hold on the shared pages. That notification may arrive 220*4882a593Smuzhiyunbefore the (copied) data is fully transmitted. A zerocopy completion 221*4882a593Smuzhiyunnotification is not a transmit completion notification, therefore. 222*4882a593Smuzhiyun 223*4882a593SmuzhiyunDeferred copies can be more expensive than a copy immediately in the 224*4882a593Smuzhiyunsystem call, if the data is no longer warm in the cache. The process 225*4882a593Smuzhiyunalso incurs notification processing cost for no benefit. For this 226*4882a593Smuzhiyunreason, the kernel signals if data was completed with a copy, by 227*4882a593Smuzhiyunsetting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. 228*4882a593SmuzhiyunA process may use this signal to stop passing flag MSG_ZEROCOPY on 229*4882a593Smuzhiyunsubsequent requests on the same socket. 230*4882a593Smuzhiyun 231*4882a593Smuzhiyun 232*4882a593SmuzhiyunImplementation 233*4882a593Smuzhiyun============== 234*4882a593Smuzhiyun 235*4882a593SmuzhiyunLoopback 236*4882a593Smuzhiyun-------- 237*4882a593Smuzhiyun 238*4882a593SmuzhiyunData sent to local sockets can be queued indefinitely if the receive 239*4882a593Smuzhiyunprocess does not read its socket. Unbound notification latency is not 240*4882a593Smuzhiyunacceptable. For this reason all packets generated with MSG_ZEROCOPY 241*4882a593Smuzhiyunthat are looped to a local socket will incur a deferred copy. This 242*4882a593Smuzhiyunincludes looping onto packet sockets (e.g., tcpdump) and tun devices. 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun 245*4882a593SmuzhiyunTesting 246*4882a593Smuzhiyun======= 247*4882a593Smuzhiyun 248*4882a593SmuzhiyunMore realistic example code can be found in the kernel source under 249*4882a593Smuzhiyuntools/testing/selftests/net/msg_zerocopy.c. 250*4882a593Smuzhiyun 251*4882a593SmuzhiyunBe cognizant of the loopback constraint. The test can be run between 252*4882a593Smuzhiyuna pair of hosts. But if run between a local pair of processes, for 253*4882a593Smuzhiyuninstance when run with msg_zerocopy.sh between a veth pair across 254*4882a593Smuzhiyunnamespaces, the test will not show any improvement. For testing, the 255*4882a593Smuzhiyunloopback restriction can be temporarily relaxed by making 256*4882a593Smuzhiyunskb_orphan_frags_rx identical to skb_orphan_frags. 257