xref: /OK3568_Linux_fs/kernel/Documentation/networking/af_xdp.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun======
4*4882a593SmuzhiyunAF_XDP
5*4882a593Smuzhiyun======
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunOverview
8*4882a593Smuzhiyun========
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunAF_XDP is an address family that is optimized for high performance
11*4882a593Smuzhiyunpacket processing.
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunThis document assumes that the reader is familiar with BPF and XDP. If
14*4882a593Smuzhiyunnot, the Cilium project has an excellent reference guide at
15*4882a593Smuzhiyunhttp://cilium.readthedocs.io/en/latest/bpf/.
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunUsing the XDP_REDIRECT action from an XDP program, the program can
18*4882a593Smuzhiyunredirect ingress frames to other XDP enabled netdevs, using the
19*4882a593Smuzhiyunbpf_redirect_map() function. AF_XDP sockets enable the possibility for
20*4882a593SmuzhiyunXDP programs to redirect frames to a memory buffer in a user-space
21*4882a593Smuzhiyunapplication.
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunAn AF_XDP socket (XSK) is created with the normal socket()
24*4882a593Smuzhiyunsyscall. Associated with each XSK are two rings: the RX ring and the
25*4882a593SmuzhiyunTX ring. A socket can receive packets on the RX ring and it can send
26*4882a593Smuzhiyunpackets on the TX ring. These rings are registered and sized with the
27*4882a593Smuzhiyunsetsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory
28*4882a593Smuzhiyunto have at least one of these rings for each socket. An RX or TX
29*4882a593Smuzhiyundescriptor ring points to a data buffer in a memory area called a
30*4882a593SmuzhiyunUMEM. RX and TX can share the same UMEM so that a packet does not have
31*4882a593Smuzhiyunto be copied between RX and TX. Moreover, if a packet needs to be kept
32*4882a593Smuzhiyunfor a while due to a possible retransmit, the descriptor that points
33*4882a593Smuzhiyunto that packet can be changed to point to another and reused right
34*4882a593Smuzhiyunaway. This again avoids copying data.
35*4882a593Smuzhiyun
36*4882a593SmuzhiyunThe UMEM consists of a number of equally sized chunks. A descriptor in
37*4882a593Smuzhiyunone of the rings references a frame by referencing its addr. The addr
38*4882a593Smuzhiyunis simply an offset within the entire UMEM region. The user space
39*4882a593Smuzhiyunallocates memory for this UMEM using whatever means it feels is most
40*4882a593Smuzhiyunappropriate (malloc, mmap, huge pages, etc). This memory area is then
41*4882a593Smuzhiyunregistered with the kernel using the new setsockopt XDP_UMEM_REG. The
42*4882a593SmuzhiyunUMEM also has two rings: the FILL ring and the COMPLETION ring. The
43*4882a593SmuzhiyunFILL ring is used by the application to send down addr for the kernel
44*4882a593Smuzhiyunto fill in with RX packet data. References to these frames will then
45*4882a593Smuzhiyunappear in the RX ring once each packet has been received. The
46*4882a593SmuzhiyunCOMPLETION ring, on the other hand, contains frame addr that the
47*4882a593Smuzhiyunkernel has transmitted completely and can now be used again by user
48*4882a593Smuzhiyunspace, for either TX or RX. Thus, the frame addrs appearing in the
49*4882a593SmuzhiyunCOMPLETION ring are addrs that were previously transmitted using the
50*4882a593SmuzhiyunTX ring. In summary, the RX and FILL rings are used for the RX path
51*4882a593Smuzhiyunand the TX and COMPLETION rings are used for the TX path.
52*4882a593Smuzhiyun
53*4882a593SmuzhiyunThe socket is then finally bound with a bind() call to a device and a
54*4882a593Smuzhiyunspecific queue id on that device, and it is not until bind is
55*4882a593Smuzhiyuncompleted that traffic starts to flow.
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunThe UMEM can be shared between processes, if desired. If a process
58*4882a593Smuzhiyunwants to do this, it simply skips the registration of the UMEM and its
59*4882a593Smuzhiyuncorresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
60*4882a593Smuzhiyuncall and submits the XSK of the process it would like to share UMEM
61*4882a593Smuzhiyunwith as well as its own newly created XSK socket. The new process will
62*4882a593Smuzhiyunthen receive frame addr references in its own RX ring that point to
63*4882a593Smuzhiyunthis shared UMEM. Note that since the ring structures are
64*4882a593Smuzhiyunsingle-consumer / single-producer (for performance reasons), the new
65*4882a593Smuzhiyunprocess has to create its own socket with associated RX and TX rings,
66*4882a593Smuzhiyunsince it cannot share this with the other process. This is also the
67*4882a593Smuzhiyunreason that there is only one set of FILL and COMPLETION rings per
68*4882a593SmuzhiyunUMEM. It is the responsibility of a single process to handle the UMEM.
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunHow is then packets distributed from an XDP program to the XSKs? There
71*4882a593Smuzhiyunis a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
72*4882a593Smuzhiyunuser-space application can place an XSK at an arbitrary place in this
73*4882a593Smuzhiyunmap. The XDP program can then redirect a packet to a specific index in
74*4882a593Smuzhiyunthis map and at this point XDP validates that the XSK in that map was
75*4882a593Smuzhiyunindeed bound to that device and ring number. If not, the packet is
76*4882a593Smuzhiyundropped. If the map is empty at that index, the packet is also
77*4882a593Smuzhiyundropped. This also means that it is currently mandatory to have an XDP
78*4882a593Smuzhiyunprogram loaded (and one XSK in the XSKMAP) to be able to get any
79*4882a593Smuzhiyuntraffic to user space through the XSK.
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunAF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
82*4882a593Smuzhiyundriver does not have support for XDP, or XDP_SKB is explicitly chosen
83*4882a593Smuzhiyunwhen loading the XDP program, XDP_SKB mode is employed that uses SKBs
84*4882a593Smuzhiyuntogether with the generic XDP support and copies out the data to user
85*4882a593Smuzhiyunspace. A fallback mode that works for any network device. On the other
86*4882a593Smuzhiyunhand, if the driver has support for XDP, it will be used by the AF_XDP
87*4882a593Smuzhiyuncode to provide better performance, but there is still a copy of the
88*4882a593Smuzhiyundata into user space.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunConcepts
91*4882a593Smuzhiyun========
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunIn order to use an AF_XDP socket, a number of associated objects need
94*4882a593Smuzhiyunto be setup. These objects and their options are explained in the
95*4882a593Smuzhiyunfollowing sections.
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunFor an overview on how AF_XDP works, you can also take a look at the
98*4882a593SmuzhiyunLinux Plumbers paper from 2018 on the subject:
99*4882a593Smuzhiyunhttp://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf. Do
100*4882a593SmuzhiyunNOT consult the paper from 2017 on "AF_PACKET v4", the first attempt
101*4882a593Smuzhiyunat AF_XDP. Nearly everything changed since then. Jonathan Corbet has
102*4882a593Smuzhiyunalso written an excellent article on LWN, "Accelerating networking
103*4882a593Smuzhiyunwith AF_XDP". It can be found at https://lwn.net/Articles/750845/.
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunUMEM
106*4882a593Smuzhiyun----
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunUMEM is a region of virtual contiguous memory, divided into
109*4882a593Smuzhiyunequal-sized frames. An UMEM is associated to a netdev and a specific
110*4882a593Smuzhiyunqueue id of that netdev. It is created and configured (chunk size,
111*4882a593Smuzhiyunheadroom, start address and size) by using the XDP_UMEM_REG setsockopt
112*4882a593Smuzhiyunsystem call. A UMEM is bound to a netdev and queue id, via the bind()
113*4882a593Smuzhiyunsystem call.
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunAn AF_XDP is socket linked to a single UMEM, but one UMEM can have
116*4882a593Smuzhiyunmultiple AF_XDP sockets. To share an UMEM created via one socket A,
117*4882a593Smuzhiyunthe next socket B can do this by setting the XDP_SHARED_UMEM flag in
118*4882a593Smuzhiyunstruct sockaddr_xdp member sxdp_flags, and passing the file descriptor
119*4882a593Smuzhiyunof A to struct sockaddr_xdp member sxdp_shared_umem_fd.
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunThe UMEM has two single-producer/single-consumer rings that are used
122*4882a593Smuzhiyunto transfer ownership of UMEM frames between the kernel and the
123*4882a593Smuzhiyunuser-space application.
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunRings
126*4882a593Smuzhiyun-----
127*4882a593Smuzhiyun
128*4882a593SmuzhiyunThere are a four different kind of rings: FILL, COMPLETION, RX and
129*4882a593SmuzhiyunTX. All rings are single-producer/single-consumer, so the user-space
130*4882a593Smuzhiyunapplication need explicit synchronization of multiple
131*4882a593Smuzhiyunprocesses/threads are reading/writing to them.
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunThe UMEM uses two rings: FILL and COMPLETION. Each socket associated
134*4882a593Smuzhiyunwith the UMEM must have an RX queue, TX queue or both. Say, that there
135*4882a593Smuzhiyunis a setup with four sockets (all doing TX and RX). Then there will be
136*4882a593Smuzhiyunone FILL ring, one COMPLETION ring, four TX rings and four RX rings.
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunThe rings are head(producer)/tail(consumer) based rings. A producer
139*4882a593Smuzhiyunwrites the data ring at the index pointed out by struct xdp_ring
140*4882a593Smuzhiyunproducer member, and increasing the producer index. A consumer reads
141*4882a593Smuzhiyunthe data ring at the index pointed out by struct xdp_ring consumer
142*4882a593Smuzhiyunmember, and increasing the consumer index.
143*4882a593Smuzhiyun
144*4882a593SmuzhiyunThe rings are configured and created via the _RING setsockopt system
145*4882a593Smuzhiyuncalls and mmapped to user-space using the appropriate offset to mmap()
146*4882a593Smuzhiyun(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and
147*4882a593SmuzhiyunXDP_UMEM_PGOFF_COMPLETION_RING).
148*4882a593Smuzhiyun
149*4882a593SmuzhiyunThe size of the rings need to be of size power of two.
150*4882a593Smuzhiyun
151*4882a593SmuzhiyunUMEM Fill Ring
152*4882a593Smuzhiyun~~~~~~~~~~~~~~
153*4882a593Smuzhiyun
154*4882a593SmuzhiyunThe FILL ring is used to transfer ownership of UMEM frames from
155*4882a593Smuzhiyunuser-space to kernel-space. The UMEM addrs are passed in the ring. As
156*4882a593Smuzhiyunan example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
157*4882a593Smuzhiyun16 chunks and can pass addrs between 0 and 64k.
158*4882a593Smuzhiyun
159*4882a593SmuzhiyunFrames passed to the kernel are used for the ingress path (RX rings).
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunThe user application produces UMEM addrs to this ring. Note that, if
162*4882a593Smuzhiyunrunning the application with aligned chunk mode, the kernel will mask
163*4882a593Smuzhiyunthe incoming addr.  E.g. for a chunk size of 2k, the log2(2048) LSB of
164*4882a593Smuzhiyunthe addr will be masked off, meaning that 2048, 2050 and 3000 refers
165*4882a593Smuzhiyunto the same chunk. If the user application is run in the unaligned
166*4882a593Smuzhiyunchunks mode, then the incoming addr will be left untouched.
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun
169*4882a593SmuzhiyunUMEM Completion Ring
170*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~
171*4882a593Smuzhiyun
172*4882a593SmuzhiyunThe COMPLETION Ring is used transfer ownership of UMEM frames from
173*4882a593Smuzhiyunkernel-space to user-space. Just like the FILL ring, UMEM indices are
174*4882a593Smuzhiyunused.
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunFrames passed from the kernel to user-space are frames that has been
177*4882a593Smuzhiyunsent (TX ring) and can be used by user-space again.
178*4882a593Smuzhiyun
179*4882a593SmuzhiyunThe user application consumes UMEM addrs from this ring.
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun
182*4882a593SmuzhiyunRX Ring
183*4882a593Smuzhiyun~~~~~~~
184*4882a593Smuzhiyun
185*4882a593SmuzhiyunThe RX ring is the receiving side of a socket. Each entry in the ring
186*4882a593Smuzhiyunis a struct xdp_desc descriptor. The descriptor contains UMEM offset
187*4882a593Smuzhiyun(addr) and the length of the data (len).
188*4882a593Smuzhiyun
189*4882a593SmuzhiyunIf no frames have been passed to kernel via the FILL ring, no
190*4882a593Smuzhiyundescriptors will (or can) appear on the RX ring.
191*4882a593Smuzhiyun
192*4882a593SmuzhiyunThe user application consumes struct xdp_desc descriptors from this
193*4882a593Smuzhiyunring.
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunTX Ring
196*4882a593Smuzhiyun~~~~~~~
197*4882a593Smuzhiyun
198*4882a593SmuzhiyunThe TX ring is used to send frames. The struct xdp_desc descriptor is
199*4882a593Smuzhiyunfilled (index, length and offset) and passed into the ring.
200*4882a593Smuzhiyun
201*4882a593SmuzhiyunTo start the transfer a sendmsg() system call is required. This might
202*4882a593Smuzhiyunbe relaxed in the future.
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunThe user application produces struct xdp_desc descriptors to this
205*4882a593Smuzhiyunring.
206*4882a593Smuzhiyun
207*4882a593SmuzhiyunLibbpf
208*4882a593Smuzhiyun======
209*4882a593Smuzhiyun
210*4882a593SmuzhiyunLibbpf is a helper library for eBPF and XDP that makes using these
211*4882a593Smuzhiyuntechnologies a lot simpler. It also contains specific helper functions
212*4882a593Smuzhiyunin tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It
213*4882a593Smuzhiyuncontains two types of functions: those that can be used to make the
214*4882a593Smuzhiyunsetup of AF_XDP socket easier and ones that can be used in the data
215*4882a593Smuzhiyunplane to access the rings safely and quickly. To see an example on how
216*4882a593Smuzhiyunto use this API, please take a look at the sample application in
217*4882a593Smuzhiyunsamples/bpf/xdpsock_usr.c which uses libbpf for both setup and data
218*4882a593Smuzhiyunplane operations.
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunWe recommend that you use this library unless you have become a power
221*4882a593Smuzhiyunuser. It will make your program a lot simpler.
222*4882a593Smuzhiyun
223*4882a593SmuzhiyunXSKMAP / BPF_MAP_TYPE_XSKMAP
224*4882a593Smuzhiyun============================
225*4882a593Smuzhiyun
226*4882a593SmuzhiyunOn XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that
227*4882a593Smuzhiyunis used in conjunction with bpf_redirect_map() to pass the ingress
228*4882a593Smuzhiyunframe to a socket.
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunThe user application inserts the socket into the map, via the bpf()
231*4882a593Smuzhiyunsystem call.
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunNote that if an XDP program tries to redirect to a socket that does
234*4882a593Smuzhiyunnot match the queue configuration and netdev, the frame will be
235*4882a593Smuzhiyundropped. E.g. an AF_XDP socket is bound to netdev eth0 and
236*4882a593Smuzhiyunqueue 17. Only the XDP program executing for eth0 and queue 17 will
237*4882a593Smuzhiyunsuccessfully pass data to the socket. Please refer to the sample
238*4882a593Smuzhiyunapplication (samples/bpf/) in for an example.
239*4882a593Smuzhiyun
240*4882a593SmuzhiyunConfiguration Flags and Socket Options
241*4882a593Smuzhiyun======================================
242*4882a593Smuzhiyun
243*4882a593SmuzhiyunThese are the various configuration flags that can be used to control
244*4882a593Smuzhiyunand monitor the behavior of AF_XDP sockets.
245*4882a593Smuzhiyun
246*4882a593SmuzhiyunXDP_COPY and XDP_ZERO_COPY bind flags
247*4882a593Smuzhiyun-------------------------------------
248*4882a593Smuzhiyun
249*4882a593SmuzhiyunWhen you bind to a socket, the kernel will first try to use zero-copy
250*4882a593Smuzhiyuncopy. If zero-copy is not supported, it will fall back on using copy
251*4882a593Smuzhiyunmode, i.e. copying all packets out to user space. But if you would
252*4882a593Smuzhiyunlike to force a certain mode, you can use the following flags. If you
253*4882a593Smuzhiyunpass the XDP_COPY flag to the bind call, the kernel will force the
254*4882a593Smuzhiyunsocket into copy mode. If it cannot use copy mode, the bind call will
255*4882a593Smuzhiyunfail with an error. Conversely, the XDP_ZERO_COPY flag will force the
256*4882a593Smuzhiyunsocket into zero-copy mode or fail.
257*4882a593Smuzhiyun
258*4882a593SmuzhiyunXDP_SHARED_UMEM bind flag
259*4882a593Smuzhiyun-------------------------
260*4882a593Smuzhiyun
261*4882a593SmuzhiyunThis flag enables you to bind multiple sockets to the same UMEM. It
262*4882a593Smuzhiyunworks on the same queue id, between queue ids and between
263*4882a593Smuzhiyunnetdevs/devices. In this mode, each socket has their own RX and TX
264*4882a593Smuzhiyunrings as usual, but you are going to have one or more FILL and
265*4882a593SmuzhiyunCOMPLETION ring pairs. You have to create one of these pairs per
266*4882a593Smuzhiyununique netdev and queue id tuple that you bind to.
267*4882a593Smuzhiyun
268*4882a593SmuzhiyunStarting with the case were we would like to share a UMEM between
269*4882a593Smuzhiyunsockets bound to the same netdev and queue id. The UMEM (tied to the
270*4882a593Smuzhiyunfist socket created) will only have a single FILL ring and a single
271*4882a593SmuzhiyunCOMPLETION ring as there is only on unique netdev,queue_id tuple that
272*4882a593Smuzhiyunwe have bound to. To use this mode, create the first socket and bind
273*4882a593Smuzhiyunit in the normal way. Create a second socket and create an RX and a TX
274*4882a593Smuzhiyunring, or at least one of them, but no FILL or COMPLETION rings as the
275*4882a593Smuzhiyunones from the first socket will be used. In the bind call, set he
276*4882a593SmuzhiyunXDP_SHARED_UMEM option and provide the initial socket's fd in the
277*4882a593Smuzhiyunsxdp_shared_umem_fd field. You can attach an arbitrary number of extra
278*4882a593Smuzhiyunsockets this way.
279*4882a593Smuzhiyun
280*4882a593SmuzhiyunWhat socket will then a packet arrive on? This is decided by the XDP
281*4882a593Smuzhiyunprogram. Put all the sockets in the XSK_MAP and just indicate which
282*4882a593Smuzhiyunindex in the array you would like to send each packet to. A simple
283*4882a593Smuzhiyunround-robin example of distributing packets is shown below:
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun.. code-block:: c
286*4882a593Smuzhiyun
287*4882a593Smuzhiyun   #include <linux/bpf.h>
288*4882a593Smuzhiyun   #include "bpf_helpers.h"
289*4882a593Smuzhiyun
290*4882a593Smuzhiyun   #define MAX_SOCKS 16
291*4882a593Smuzhiyun
292*4882a593Smuzhiyun   struct {
293*4882a593Smuzhiyun        __uint(type, BPF_MAP_TYPE_XSKMAP);
294*4882a593Smuzhiyun        __uint(max_entries, MAX_SOCKS);
295*4882a593Smuzhiyun        __uint(key_size, sizeof(int));
296*4882a593Smuzhiyun        __uint(value_size, sizeof(int));
297*4882a593Smuzhiyun   } xsks_map SEC(".maps");
298*4882a593Smuzhiyun
299*4882a593Smuzhiyun   static unsigned int rr;
300*4882a593Smuzhiyun
301*4882a593Smuzhiyun   SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
302*4882a593Smuzhiyun   {
303*4882a593Smuzhiyun	rr = (rr + 1) & (MAX_SOCKS - 1);
304*4882a593Smuzhiyun
305*4882a593Smuzhiyun	return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
306*4882a593Smuzhiyun   }
307*4882a593Smuzhiyun
308*4882a593SmuzhiyunNote, that since there is only a single set of FILL and COMPLETION
309*4882a593Smuzhiyunrings, and they are single producer, single consumer rings, you need
310*4882a593Smuzhiyunto make sure that multiple processes or threads do not use these rings
311*4882a593Smuzhiyunconcurrently. There are no synchronization primitives in the
312*4882a593Smuzhiyunlibbpf code that protects multiple users at this point in time.
313*4882a593Smuzhiyun
314*4882a593SmuzhiyunLibbpf uses this mode if you create more than one socket tied to the
315*4882a593Smuzhiyunsame UMEM. However, note that you need to supply the
316*4882a593SmuzhiyunXSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the
317*4882a593Smuzhiyunxsk_socket__create calls and load your own XDP program as there is no
318*4882a593Smuzhiyunbuilt in one in libbpf that will route the traffic for you.
319*4882a593Smuzhiyun
320*4882a593SmuzhiyunThe second case is when you share a UMEM between sockets that are
321*4882a593Smuzhiyunbound to different queue ids and/or netdevs. In this case you have to
322*4882a593Smuzhiyuncreate one FILL ring and one COMPLETION ring for each unique
323*4882a593Smuzhiyunnetdev,queue_id pair. Let us say you want to create two sockets bound
324*4882a593Smuzhiyunto two different queue ids on the same netdev. Create the first socket
325*4882a593Smuzhiyunand bind it in the normal way. Create a second socket and create an RX
326*4882a593Smuzhiyunand a TX ring, or at least one of them, and then one FILL and
327*4882a593SmuzhiyunCOMPLETION ring for this socket. Then in the bind call, set he
328*4882a593SmuzhiyunXDP_SHARED_UMEM option and provide the initial socket's fd in the
329*4882a593Smuzhiyunsxdp_shared_umem_fd field as you registered the UMEM on that
330*4882a593Smuzhiyunsocket. These two sockets will now share one and the same UMEM.
331*4882a593Smuzhiyun
332*4882a593SmuzhiyunThere is no need to supply an XDP program like the one in the previous
333*4882a593Smuzhiyuncase where sockets were bound to the same queue id and
334*4882a593Smuzhiyundevice. Instead, use the NIC's packet steering capabilities to steer
335*4882a593Smuzhiyunthe packets to the right queue. In the previous example, there is only
336*4882a593Smuzhiyunone queue shared among sockets, so the NIC cannot do this steering. It
337*4882a593Smuzhiyuncan only steer between queues.
338*4882a593Smuzhiyun
339*4882a593SmuzhiyunIn libbpf, you need to use the xsk_socket__create_shared() API as it
340*4882a593Smuzhiyuntakes a reference to a FILL ring and a COMPLETION ring that will be
341*4882a593Smuzhiyuncreated for you and bound to the shared UMEM. You can use this
342*4882a593Smuzhiyunfunction for all the sockets you create, or you can use it for the
343*4882a593Smuzhiyunsecond and following ones and use xsk_socket__create() for the first
344*4882a593Smuzhiyunone. Both methods yield the same result.
345*4882a593Smuzhiyun
346*4882a593SmuzhiyunNote that a UMEM can be shared between sockets on the same queue id
347*4882a593Smuzhiyunand device, as well as between queues on the same device and between
348*4882a593Smuzhiyundevices at the same time.
349*4882a593Smuzhiyun
350*4882a593SmuzhiyunXDP_USE_NEED_WAKEUP bind flag
351*4882a593Smuzhiyun-----------------------------
352*4882a593Smuzhiyun
353*4882a593SmuzhiyunThis option adds support for a new flag called need_wakeup that is
354*4882a593Smuzhiyunpresent in the FILL ring and the TX ring, the rings for which user
355*4882a593Smuzhiyunspace is a producer. When this option is set in the bind call, the
356*4882a593Smuzhiyunneed_wakeup flag will be set if the kernel needs to be explicitly
357*4882a593Smuzhiyunwoken up by a syscall to continue processing packets. If the flag is
358*4882a593Smuzhiyunzero, no syscall is needed.
359*4882a593Smuzhiyun
360*4882a593SmuzhiyunIf the flag is set on the FILL ring, the application needs to call
361*4882a593Smuzhiyunpoll() to be able to continue to receive packets on the RX ring. This
362*4882a593Smuzhiyuncan happen, for example, when the kernel has detected that there are no
363*4882a593Smuzhiyunmore buffers on the FILL ring and no buffers left on the RX HW ring of
364*4882a593Smuzhiyunthe NIC. In this case, interrupts are turned off as the NIC cannot
365*4882a593Smuzhiyunreceive any packets (as there are no buffers to put them in), and the
366*4882a593Smuzhiyunneed_wakeup flag is set so that user space can put buffers on the
367*4882a593SmuzhiyunFILL ring and then call poll() so that the kernel driver can put these
368*4882a593Smuzhiyunbuffers on the HW ring and start to receive packets.
369*4882a593Smuzhiyun
370*4882a593SmuzhiyunIf the flag is set for the TX ring, it means that the application
371*4882a593Smuzhiyunneeds to explicitly notify the kernel to send any packets put on the
372*4882a593SmuzhiyunTX ring. This can be accomplished either by a poll() call, as in the
373*4882a593SmuzhiyunRX path, or by calling sendto().
374*4882a593Smuzhiyun
375*4882a593SmuzhiyunAn example of how to use this flag can be found in
376*4882a593Smuzhiyunsamples/bpf/xdpsock_user.c. An example with the use of libbpf helpers
377*4882a593Smuzhiyunwould look like this for the TX path:
378*4882a593Smuzhiyun
379*4882a593Smuzhiyun.. code-block:: c
380*4882a593Smuzhiyun
381*4882a593Smuzhiyun   if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
382*4882a593Smuzhiyun      sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
383*4882a593Smuzhiyun
384*4882a593SmuzhiyunI.e., only use the syscall if the flag is set.
385*4882a593Smuzhiyun
386*4882a593SmuzhiyunWe recommend that you always enable this mode as it usually leads to
387*4882a593Smuzhiyunbetter performance especially if you run the application and the
388*4882a593Smuzhiyundriver on the same core, but also if you use different cores for the
389*4882a593Smuzhiyunapplication and the kernel driver, as it reduces the number of
390*4882a593Smuzhiyunsyscalls needed for the TX path.
391*4882a593Smuzhiyun
392*4882a593SmuzhiyunXDP_{RX|TX|UMEM_FILL|UMEM_COMPLETION}_RING setsockopts
393*4882a593Smuzhiyun------------------------------------------------------
394*4882a593Smuzhiyun
395*4882a593SmuzhiyunThese setsockopts sets the number of descriptors that the RX, TX,
396*4882a593SmuzhiyunFILL, and COMPLETION rings respectively should have. It is mandatory
397*4882a593Smuzhiyunto set the size of at least one of the RX and TX rings. If you set
398*4882a593Smuzhiyunboth, you will be able to both receive and send traffic from your
399*4882a593Smuzhiyunapplication, but if you only want to do one of them, you can save
400*4882a593Smuzhiyunresources by only setting up one of them. Both the FILL ring and the
401*4882a593SmuzhiyunCOMPLETION ring are mandatory as you need to have a UMEM tied to your
402*4882a593Smuzhiyunsocket. But if the XDP_SHARED_UMEM flag is used, any socket after the
403*4882a593Smuzhiyunfirst one does not have a UMEM and should in that case not have any
404*4882a593SmuzhiyunFILL or COMPLETION rings created as the ones from the shared UMEM will
405*4882a593Smuzhiyunbe used. Note, that the rings are single-producer single-consumer, so
406*4882a593Smuzhiyundo not try to access them from multiple processes at the same
407*4882a593Smuzhiyuntime. See the XDP_SHARED_UMEM section.
408*4882a593Smuzhiyun
409*4882a593SmuzhiyunIn libbpf, you can create Rx-only and Tx-only sockets by supplying
410*4882a593SmuzhiyunNULL to the rx and tx arguments, respectively, to the
411*4882a593Smuzhiyunxsk_socket__create function.
412*4882a593Smuzhiyun
413*4882a593SmuzhiyunIf you create a Tx-only socket, we recommend that you do not put any
414*4882a593Smuzhiyunpackets on the fill ring. If you do this, drivers might think you are
415*4882a593Smuzhiyungoing to receive something when you in fact will not, and this can
416*4882a593Smuzhiyunnegatively impact performance.
417*4882a593Smuzhiyun
418*4882a593SmuzhiyunXDP_UMEM_REG setsockopt
419*4882a593Smuzhiyun-----------------------
420*4882a593Smuzhiyun
421*4882a593SmuzhiyunThis setsockopt registers a UMEM to a socket. This is the area that
422*4882a593Smuzhiyuncontain all the buffers that packet can recide in. The call takes a
423*4882a593Smuzhiyunpointer to the beginning of this area and the size of it. Moreover, it
424*4882a593Smuzhiyunalso has parameter called chunk_size that is the size that the UMEM is
425*4882a593Smuzhiyundivided into. It can only be 2K or 4K at the moment. If you have an
426*4882a593SmuzhiyunUMEM area that is 128K and a chunk size of 2K, this means that you
427*4882a593Smuzhiyunwill be able to hold a maximum of 128K / 2K = 64 packets in your UMEM
428*4882a593Smuzhiyunarea and that your largest packet size can be 2K.
429*4882a593Smuzhiyun
430*4882a593SmuzhiyunThere is also an option to set the headroom of each single buffer in
431*4882a593Smuzhiyunthe UMEM. If you set this to N bytes, it means that the packet will
432*4882a593Smuzhiyunstart N bytes into the buffer leaving the first N bytes for the
433*4882a593Smuzhiyunapplication to use. The final option is the flags field, but it will
434*4882a593Smuzhiyunbe dealt with in separate sections for each UMEM flag.
435*4882a593Smuzhiyun
436*4882a593SmuzhiyunXDP_STATISTICS getsockopt
437*4882a593Smuzhiyun-------------------------
438*4882a593Smuzhiyun
439*4882a593SmuzhiyunGets drop statistics of a socket that can be useful for debug
440*4882a593Smuzhiyunpurposes. The supported statistics are shown below:
441*4882a593Smuzhiyun
442*4882a593Smuzhiyun.. code-block:: c
443*4882a593Smuzhiyun
444*4882a593Smuzhiyun   struct xdp_statistics {
445*4882a593Smuzhiyun	  __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
446*4882a593Smuzhiyun	  __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
447*4882a593Smuzhiyun	  __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
448*4882a593Smuzhiyun   };
449*4882a593Smuzhiyun
450*4882a593SmuzhiyunXDP_OPTIONS getsockopt
451*4882a593Smuzhiyun----------------------
452*4882a593Smuzhiyun
453*4882a593SmuzhiyunGets options from an XDP socket. The only one supported so far is
454*4882a593SmuzhiyunXDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
455*4882a593Smuzhiyun
456*4882a593SmuzhiyunUsage
457*4882a593Smuzhiyun=====
458*4882a593Smuzhiyun
459*4882a593SmuzhiyunIn order to use AF_XDP sockets two parts are needed. The
460*4882a593Smuzhiyunuser-space application and the XDP program. For a complete setup and
461*4882a593Smuzhiyunusage example, please refer to the sample application. The user-space
462*4882a593Smuzhiyunside is xdpsock_user.c and the XDP side is part of libbpf.
463*4882a593Smuzhiyun
464*4882a593SmuzhiyunThe XDP code sample included in tools/lib/bpf/xsk.c is the following:
465*4882a593Smuzhiyun
466*4882a593Smuzhiyun.. code-block:: c
467*4882a593Smuzhiyun
468*4882a593Smuzhiyun   SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
469*4882a593Smuzhiyun   {
470*4882a593Smuzhiyun       int index = ctx->rx_queue_index;
471*4882a593Smuzhiyun
472*4882a593Smuzhiyun       // A set entry here means that the corresponding queue_id
473*4882a593Smuzhiyun       // has an active AF_XDP socket bound to it.
474*4882a593Smuzhiyun       if (bpf_map_lookup_elem(&xsks_map, &index))
475*4882a593Smuzhiyun           return bpf_redirect_map(&xsks_map, index, 0);
476*4882a593Smuzhiyun
477*4882a593Smuzhiyun       return XDP_PASS;
478*4882a593Smuzhiyun   }
479*4882a593Smuzhiyun
480*4882a593SmuzhiyunA simple but not so performance ring dequeue and enqueue could look
481*4882a593Smuzhiyunlike this:
482*4882a593Smuzhiyun
483*4882a593Smuzhiyun.. code-block:: c
484*4882a593Smuzhiyun
485*4882a593Smuzhiyun    // struct xdp_rxtx_ring {
486*4882a593Smuzhiyun    // 	__u32 *producer;
487*4882a593Smuzhiyun    // 	__u32 *consumer;
488*4882a593Smuzhiyun    // 	struct xdp_desc *desc;
489*4882a593Smuzhiyun    // };
490*4882a593Smuzhiyun
491*4882a593Smuzhiyun    // struct xdp_umem_ring {
492*4882a593Smuzhiyun    // 	__u32 *producer;
493*4882a593Smuzhiyun    // 	__u32 *consumer;
494*4882a593Smuzhiyun    // 	__u64 *desc;
495*4882a593Smuzhiyun    // };
496*4882a593Smuzhiyun
497*4882a593Smuzhiyun    // typedef struct xdp_rxtx_ring RING;
498*4882a593Smuzhiyun    // typedef struct xdp_umem_ring RING;
499*4882a593Smuzhiyun
500*4882a593Smuzhiyun    // typedef struct xdp_desc RING_TYPE;
501*4882a593Smuzhiyun    // typedef __u64 RING_TYPE;
502*4882a593Smuzhiyun
503*4882a593Smuzhiyun    int dequeue_one(RING *ring, RING_TYPE *item)
504*4882a593Smuzhiyun    {
505*4882a593Smuzhiyun        __u32 entries = *ring->producer - *ring->consumer;
506*4882a593Smuzhiyun
507*4882a593Smuzhiyun        if (entries == 0)
508*4882a593Smuzhiyun            return -1;
509*4882a593Smuzhiyun
510*4882a593Smuzhiyun        // read-barrier!
511*4882a593Smuzhiyun
512*4882a593Smuzhiyun        *item = ring->desc[*ring->consumer & (RING_SIZE - 1)];
513*4882a593Smuzhiyun        (*ring->consumer)++;
514*4882a593Smuzhiyun        return 0;
515*4882a593Smuzhiyun    }
516*4882a593Smuzhiyun
517*4882a593Smuzhiyun    int enqueue_one(RING *ring, const RING_TYPE *item)
518*4882a593Smuzhiyun    {
519*4882a593Smuzhiyun        u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer);
520*4882a593Smuzhiyun
521*4882a593Smuzhiyun        if (free_entries == 0)
522*4882a593Smuzhiyun            return -1;
523*4882a593Smuzhiyun
524*4882a593Smuzhiyun        ring->desc[*ring->producer & (RING_SIZE - 1)] = *item;
525*4882a593Smuzhiyun
526*4882a593Smuzhiyun        // write-barrier!
527*4882a593Smuzhiyun
528*4882a593Smuzhiyun        (*ring->producer)++;
529*4882a593Smuzhiyun        return 0;
530*4882a593Smuzhiyun    }
531*4882a593Smuzhiyun
532*4882a593SmuzhiyunBut please use the libbpf functions as they are optimized and ready to
533*4882a593Smuzhiyunuse. Will make your life easier.
534*4882a593Smuzhiyun
535*4882a593SmuzhiyunSample application
536*4882a593Smuzhiyun==================
537*4882a593Smuzhiyun
538*4882a593SmuzhiyunThere is a xdpsock benchmarking/test application included that
539*4882a593Smuzhiyundemonstrates how to use AF_XDP sockets with private UMEMs. Say that
540*4882a593Smuzhiyunyou would like your UDP traffic from port 4242 to end up in queue 16,
541*4882a593Smuzhiyunthat we will enable AF_XDP on. Here, we use ethtool for this::
542*4882a593Smuzhiyun
543*4882a593Smuzhiyun      ethtool -N p3p2 rx-flow-hash udp4 fn
544*4882a593Smuzhiyun      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
545*4882a593Smuzhiyun          action 16
546*4882a593Smuzhiyun
547*4882a593SmuzhiyunRunning the rxdrop benchmark in XDP_DRV mode can then be done
548*4882a593Smuzhiyunusing::
549*4882a593Smuzhiyun
550*4882a593Smuzhiyun      samples/bpf/xdpsock -i p3p2 -q 16 -r -N
551*4882a593Smuzhiyun
552*4882a593SmuzhiyunFor XDP_SKB mode, use the switch "-S" instead of "-N" and all options
553*4882a593Smuzhiyuncan be displayed with "-h", as usual.
554*4882a593Smuzhiyun
555*4882a593SmuzhiyunThis sample application uses libbpf to make the setup and usage of
556*4882a593SmuzhiyunAF_XDP simpler. If you want to know how the raw uapi of AF_XDP is
557*4882a593Smuzhiyunreally used to make something more advanced, take a look at the libbpf
558*4882a593Smuzhiyuncode in tools/lib/bpf/xsk.[ch].
559*4882a593Smuzhiyun
560*4882a593SmuzhiyunFAQ
561*4882a593Smuzhiyun=======
562*4882a593Smuzhiyun
563*4882a593SmuzhiyunQ: I am not seeing any traffic on the socket. What am I doing wrong?
564*4882a593Smuzhiyun
565*4882a593SmuzhiyunA: When a netdev of a physical NIC is initialized, Linux usually
566*4882a593Smuzhiyun   allocates one RX and TX queue pair per core. So on a 8 core system,
567*4882a593Smuzhiyun   queue ids 0 to 7 will be allocated, one per core. In the AF_XDP
568*4882a593Smuzhiyun   bind call or the xsk_socket__create libbpf function call, you
569*4882a593Smuzhiyun   specify a specific queue id to bind to and it is only the traffic
570*4882a593Smuzhiyun   towards that queue you are going to get on you socket. So in the
571*4882a593Smuzhiyun   example above, if you bind to queue 0, you are NOT going to get any
572*4882a593Smuzhiyun   traffic that is distributed to queues 1 through 7. If you are
573*4882a593Smuzhiyun   lucky, you will see the traffic, but usually it will end up on one
574*4882a593Smuzhiyun   of the queues you have not bound to.
575*4882a593Smuzhiyun
576*4882a593Smuzhiyun   There are a number of ways to solve the problem of getting the
577*4882a593Smuzhiyun   traffic you want to the queue id you bound to. If you want to see
578*4882a593Smuzhiyun   all the traffic, you can force the netdev to only have 1 queue, queue
579*4882a593Smuzhiyun   id 0, and then bind to queue 0. You can use ethtool to do this::
580*4882a593Smuzhiyun
581*4882a593Smuzhiyun     sudo ethtool -L <interface> combined 1
582*4882a593Smuzhiyun
583*4882a593Smuzhiyun   If you want to only see part of the traffic, you can program the
584*4882a593Smuzhiyun   NIC through ethtool to filter out your traffic to a single queue id
585*4882a593Smuzhiyun   that you can bind your XDP socket to. Here is one example in which
586*4882a593Smuzhiyun   UDP traffic to and from port 4242 are sent to queue 2::
587*4882a593Smuzhiyun
588*4882a593Smuzhiyun     sudo ethtool -N <interface> rx-flow-hash udp4 fn
589*4882a593Smuzhiyun     sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \
590*4882a593Smuzhiyun     4242 action 2
591*4882a593Smuzhiyun
592*4882a593Smuzhiyun   A number of other ways are possible all up to the capabilities of
593*4882a593Smuzhiyun   the NIC you have.
594*4882a593Smuzhiyun
595*4882a593SmuzhiyunQ: Can I use the XSKMAP to implement a switch betwen different umems
596*4882a593Smuzhiyun   in copy mode?
597*4882a593Smuzhiyun
598*4882a593SmuzhiyunA: The short answer is no, that is not supported at the moment. The
599*4882a593Smuzhiyun   XSKMAP can only be used to switch traffic coming in on queue id X
600*4882a593Smuzhiyun   to sockets bound to the same queue id X. The XSKMAP can contain
601*4882a593Smuzhiyun   sockets bound to different queue ids, for example X and Y, but only
602*4882a593Smuzhiyun   traffic goming in from queue id Y can be directed to sockets bound
603*4882a593Smuzhiyun   to the same queue id Y. In zero-copy mode, you should use the
604*4882a593Smuzhiyun   switch, or other distribution mechanism, in your NIC to direct
605*4882a593Smuzhiyun   traffic to the correct queue id and socket.
606*4882a593Smuzhiyun
607*4882a593SmuzhiyunQ: My packets are sometimes corrupted. What is wrong?
608*4882a593Smuzhiyun
609*4882a593SmuzhiyunA: Care has to be taken not to feed the same buffer in the UMEM into
610*4882a593Smuzhiyun   more than one ring at the same time. If you for example feed the
611*4882a593Smuzhiyun   same buffer into the FILL ring and the TX ring at the same time, the
612*4882a593Smuzhiyun   NIC might receive data into the buffer at the same time it is
613*4882a593Smuzhiyun   sending it. This will cause some packets to become corrupted. Same
614*4882a593Smuzhiyun   thing goes for feeding the same buffer into the FILL rings
615*4882a593Smuzhiyun   belonging to different queue ids or netdevs bound with the
616*4882a593Smuzhiyun   XDP_SHARED_UMEM flag.
617*4882a593Smuzhiyun
618*4882a593SmuzhiyunCredits
619*4882a593Smuzhiyun=======
620*4882a593Smuzhiyun
621*4882a593Smuzhiyun- Björn Töpel (AF_XDP core)
622*4882a593Smuzhiyun- Magnus Karlsson (AF_XDP core)
623*4882a593Smuzhiyun- Alexander Duyck
624*4882a593Smuzhiyun- Alexei Starovoitov
625*4882a593Smuzhiyun- Daniel Borkmann
626*4882a593Smuzhiyun- Jesper Dangaard Brouer
627*4882a593Smuzhiyun- John Fastabend
628*4882a593Smuzhiyun- Jonathan Corbet (LWN coverage)
629*4882a593Smuzhiyun- Michael S. Tsirkin
630*4882a593Smuzhiyun- Qi Z Zhang
631*4882a593Smuzhiyun- Willem de Bruijn
632