xref: /OK3568_Linux_fs/kernel/Documentation/networking/msg_zerocopy.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun
2*4882a593Smuzhiyun============
3*4882a593SmuzhiyunMSG_ZEROCOPY
4*4882a593Smuzhiyun============
5*4882a593Smuzhiyun
6*4882a593SmuzhiyunIntro
7*4882a593Smuzhiyun=====
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunThe MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
10*4882a593SmuzhiyunThe feature is currently implemented for TCP and UDP sockets.
11*4882a593Smuzhiyun
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunOpportunity and Caveats
14*4882a593Smuzhiyun-----------------------
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunCopying large buffers between user process and kernel can be
17*4882a593Smuzhiyunexpensive. Linux supports various interfaces that eschew copying,
18*4882a593Smuzhiyunsuch as sendpage and splice. The MSG_ZEROCOPY flag extends the
19*4882a593Smuzhiyununderlying copy avoidance mechanism to common socket send calls.
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunCopy avoidance is not a free lunch. As implemented, with page pinning,
22*4882a593Smuzhiyunit replaces per byte copy cost with page accounting and completion
23*4882a593Smuzhiyunnotification overhead. As a result, MSG_ZEROCOPY is generally only
24*4882a593Smuzhiyuneffective at writes over around 10 KB.
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunPage pinning also changes system call semantics. It temporarily shares
27*4882a593Smuzhiyunthe buffer between process and network stack. Unlike with copying, the
28*4882a593Smuzhiyunprocess cannot immediately overwrite the buffer after system call
29*4882a593Smuzhiyunreturn without possibly modifying the data in flight. Kernel integrity
30*4882a593Smuzhiyunis not affected, but a buggy program can possibly corrupt its own data
31*4882a593Smuzhiyunstream.
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunThe kernel returns a notification when it is safe to modify data.
34*4882a593SmuzhiyunConverting an existing application to MSG_ZEROCOPY is not always as
35*4882a593Smuzhiyuntrivial as just passing the flag, then.
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunMore Info
39*4882a593Smuzhiyun---------
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunMuch of this document was derived from a longer paper presented at
42*4882a593Smuzhiyunnetdev 2.1. For more in-depth information see that paper and talk,
43*4882a593Smuzhiyunthe excellent reporting over at LWN.net or read the original code.
44*4882a593Smuzhiyun
45*4882a593Smuzhiyun  paper, slides, video
46*4882a593Smuzhiyun    https://netdevconf.org/2.1/session.html?debruijn
47*4882a593Smuzhiyun
48*4882a593Smuzhiyun  LWN article
49*4882a593Smuzhiyun    https://lwn.net/Articles/726917/
50*4882a593Smuzhiyun
51*4882a593Smuzhiyun  patchset
52*4882a593Smuzhiyun    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
53*4882a593Smuzhiyun    https://lkml.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
54*4882a593Smuzhiyun
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunInterface
57*4882a593Smuzhiyun=========
58*4882a593Smuzhiyun
59*4882a593SmuzhiyunPassing the MSG_ZEROCOPY flag is the most obvious step to enable copy
60*4882a593Smuzhiyunavoidance, but not the only one.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunSocket Setup
63*4882a593Smuzhiyun------------
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunThe kernel is permissive when applications pass undefined flags to the
66*4882a593Smuzhiyunsend system call. By default it simply ignores these. To avoid enabling
67*4882a593Smuzhiyuncopy avoidance mode for legacy processes that accidentally already pass
68*4882a593Smuzhiyunthis flag, a process must first signal intent by setting a socket option:
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun::
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun	if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
73*4882a593Smuzhiyun		error(1, errno, "setsockopt zerocopy");
74*4882a593Smuzhiyun
75*4882a593SmuzhiyunTransmission
76*4882a593Smuzhiyun------------
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunThe change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
79*4882a593SmuzhiyunPass the new flag.
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun::
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun	ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunA zerocopy failure will return -1 with errno ENOBUFS. This happens if
86*4882a593Smuzhiyunthe socket option was not set, the socket exceeds its optmem limit or
87*4882a593Smuzhiyunthe user exceeds its ulimit on locked pages.
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunMixing copy avoidance and copying
91*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunMany workloads have a mixture of large and small buffers. Because copy
94*4882a593Smuzhiyunavoidance is more expensive than copying for small packets, the
95*4882a593Smuzhiyunfeature is implemented as a flag. It is safe to mix calls with the flag
96*4882a593Smuzhiyunwith those without.
97*4882a593Smuzhiyun
98*4882a593Smuzhiyun
99*4882a593SmuzhiyunNotifications
100*4882a593Smuzhiyun-------------
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunThe kernel has to notify the process when it is safe to reuse a
103*4882a593Smuzhiyunpreviously passed buffer. It queues completion notifications on the
104*4882a593Smuzhiyunsocket error queue, akin to the transmit timestamping interface.
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunThe notification itself is a simple scalar value. Each socket
107*4882a593Smuzhiyunmaintains an internal unsigned 32-bit counter. Each send call with
108*4882a593SmuzhiyunMSG_ZEROCOPY that successfully sends data increments the counter. The
109*4882a593Smuzhiyuncounter is not incremented on failure or if called with length zero.
110*4882a593SmuzhiyunThe counter counts system call invocations, not bytes. It wraps after
111*4882a593SmuzhiyunUINT_MAX calls.
112*4882a593Smuzhiyun
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunNotification Reception
115*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunThe below snippet demonstrates the API. In the simplest case, each
118*4882a593Smuzhiyunsend syscall is followed by a poll and recvmsg on the error queue.
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunReading from the error queue is always a non-blocking operation. The
121*4882a593Smuzhiyunpoll call is there to block until an error is outstanding. It will set
122*4882a593SmuzhiyunPOLLERR in its output flags. That flag does not have to be set in the
123*4882a593Smuzhiyunevents field. Errors are signaled unconditionally.
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun::
126*4882a593Smuzhiyun
127*4882a593Smuzhiyun	pfd.fd = fd;
128*4882a593Smuzhiyun	pfd.events = 0;
129*4882a593Smuzhiyun	if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
130*4882a593Smuzhiyun		error(1, errno, "poll");
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
133*4882a593Smuzhiyun	if (ret == -1)
134*4882a593Smuzhiyun		error(1, errno, "recvmsg");
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun	read_notification(msg);
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunThe example is for demonstration purpose only. In practice, it is more
139*4882a593Smuzhiyunefficient to not wait for notifications, but read without blocking
140*4882a593Smuzhiyunevery couple of send calls.
141*4882a593Smuzhiyun
142*4882a593SmuzhiyunNotifications can be processed out of order with other operations on
143*4882a593Smuzhiyunthe socket. A socket that has an error queued would normally block
144*4882a593Smuzhiyunother operations until the error is read. Zerocopy notifications have
145*4882a593Smuzhiyuna zero error code, however, to not block send and recv calls.
146*4882a593Smuzhiyun
147*4882a593Smuzhiyun
148*4882a593SmuzhiyunNotification Batching
149*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~
150*4882a593Smuzhiyun
151*4882a593SmuzhiyunMultiple outstanding packets can be read at once using the recvmmsg
152*4882a593Smuzhiyuncall. This is often not needed. In each message the kernel returns not
153*4882a593Smuzhiyuna single value, but a range. It coalesces consecutive notifications
154*4882a593Smuzhiyunwhile one is outstanding for reception on the error queue.
155*4882a593Smuzhiyun
156*4882a593SmuzhiyunWhen a new notification is about to be queued, it checks whether the
157*4882a593Smuzhiyunnew value extends the range of the notification at the tail of the
158*4882a593Smuzhiyunqueue. If so, it drops the new notification packet and instead increases
159*4882a593Smuzhiyunthe range upper value of the outstanding notification.
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunFor protocols that acknowledge data in-order, like TCP, each
162*4882a593Smuzhiyunnotification can be squashed into the previous one, so that no more
163*4882a593Smuzhiyunthan one notification is outstanding at any one point.
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunOrdered delivery is the common case, but not guaranteed. Notifications
166*4882a593Smuzhiyunmay arrive out of order on retransmission and socket teardown.
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunNotification Parsing
170*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunThe below snippet demonstrates how to parse the control message: the
173*4882a593Smuzhiyunread_notification() call in the previous snippet. A notification
174*4882a593Smuzhiyunis encoded in the standard error format, sock_extended_err.
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunThe level and type fields in the control data are protocol family
177*4882a593Smuzhiyunspecific, IP_RECVERR or IPV6_RECVERR.
178*4882a593Smuzhiyun
179*4882a593SmuzhiyunError origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
180*4882a593Smuzhiyunas explained before, to avoid blocking read and write system calls on
181*4882a593Smuzhiyunthe socket.
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunThe 32-bit notification range is encoded as [ee_info, ee_data]. This
184*4882a593Smuzhiyunrange is inclusive. Other fields in the struct must be treated as
185*4882a593Smuzhiyunundefined, bar for ee_code, as discussed below.
186*4882a593Smuzhiyun
187*4882a593Smuzhiyun::
188*4882a593Smuzhiyun
189*4882a593Smuzhiyun	struct sock_extended_err *serr;
190*4882a593Smuzhiyun	struct cmsghdr *cm;
191*4882a593Smuzhiyun
192*4882a593Smuzhiyun	cm = CMSG_FIRSTHDR(msg);
193*4882a593Smuzhiyun	if (cm->cmsg_level != SOL_IP &&
194*4882a593Smuzhiyun	    cm->cmsg_type != IP_RECVERR)
195*4882a593Smuzhiyun		error(1, 0, "cmsg");
196*4882a593Smuzhiyun
197*4882a593Smuzhiyun	serr = (void *) CMSG_DATA(cm);
198*4882a593Smuzhiyun	if (serr->ee_errno != 0 ||
199*4882a593Smuzhiyun	    serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
200*4882a593Smuzhiyun		error(1, 0, "serr");
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun	printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunDeferred copies
206*4882a593Smuzhiyun~~~~~~~~~~~~~~~
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunPassing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
209*4882a593Smuzhiyunavoidance, and a contract that the kernel will queue a completion
210*4882a593Smuzhiyunnotification. It is not a guarantee that the copy is elided.
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunCopy avoidance is not always feasible. Devices that do not support
213*4882a593Smuzhiyunscatter-gather I/O cannot send packets made up of kernel generated
214*4882a593Smuzhiyunprotocol headers plus zerocopy user data. A packet may need to be
215*4882a593Smuzhiyunconverted to a private copy of data deep in the stack, say to compute
216*4882a593Smuzhiyuna checksum.
217*4882a593Smuzhiyun
218*4882a593SmuzhiyunIn all these cases, the kernel returns a completion notification when
219*4882a593Smuzhiyunit releases its hold on the shared pages. That notification may arrive
220*4882a593Smuzhiyunbefore the (copied) data is fully transmitted. A zerocopy completion
221*4882a593Smuzhiyunnotification is not a transmit completion notification, therefore.
222*4882a593Smuzhiyun
223*4882a593SmuzhiyunDeferred copies can be more expensive than a copy immediately in the
224*4882a593Smuzhiyunsystem call, if the data is no longer warm in the cache. The process
225*4882a593Smuzhiyunalso incurs notification processing cost for no benefit. For this
226*4882a593Smuzhiyunreason, the kernel signals if data was completed with a copy, by
227*4882a593Smuzhiyunsetting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
228*4882a593SmuzhiyunA process may use this signal to stop passing flag MSG_ZEROCOPY on
229*4882a593Smuzhiyunsubsequent requests on the same socket.
230*4882a593Smuzhiyun
231*4882a593Smuzhiyun
232*4882a593SmuzhiyunImplementation
233*4882a593Smuzhiyun==============
234*4882a593Smuzhiyun
235*4882a593SmuzhiyunLoopback
236*4882a593Smuzhiyun--------
237*4882a593Smuzhiyun
238*4882a593SmuzhiyunData sent to local sockets can be queued indefinitely if the receive
239*4882a593Smuzhiyunprocess does not read its socket. Unbound notification latency is not
240*4882a593Smuzhiyunacceptable. For this reason all packets generated with MSG_ZEROCOPY
241*4882a593Smuzhiyunthat are looped to a local socket will incur a deferred copy. This
242*4882a593Smuzhiyunincludes looping onto packet sockets (e.g., tcpdump) and tun devices.
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun
245*4882a593SmuzhiyunTesting
246*4882a593Smuzhiyun=======
247*4882a593Smuzhiyun
248*4882a593SmuzhiyunMore realistic example code can be found in the kernel source under
249*4882a593Smuzhiyuntools/testing/selftests/net/msg_zerocopy.c.
250*4882a593Smuzhiyun
251*4882a593SmuzhiyunBe cognizant of the loopback constraint. The test can be run between
252*4882a593Smuzhiyuna pair of hosts. But if run between a local pair of processes, for
253*4882a593Smuzhiyuninstance when run with msg_zerocopy.sh between a veth pair across
254*4882a593Smuzhiyunnamespaces, the test will not show any improvement. For testing, the
255*4882a593Smuzhiyunloopback restriction can be temporarily relaxed by making
256*4882a593Smuzhiyunskb_orphan_frags_rx identical to skb_orphan_frags.
257