1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun====== 4*4882a593SmuzhiyunAF_XDP 5*4882a593Smuzhiyun====== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunOverview 8*4882a593Smuzhiyun======== 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunAF_XDP is an address family that is optimized for high performance 11*4882a593Smuzhiyunpacket processing. 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunThis document assumes that the reader is familiar with BPF and XDP. If 14*4882a593Smuzhiyunnot, the Cilium project has an excellent reference guide at 15*4882a593Smuzhiyunhttp://cilium.readthedocs.io/en/latest/bpf/. 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunUsing the XDP_REDIRECT action from an XDP program, the program can 18*4882a593Smuzhiyunredirect ingress frames to other XDP enabled netdevs, using the 19*4882a593Smuzhiyunbpf_redirect_map() function. AF_XDP sockets enable the possibility for 20*4882a593SmuzhiyunXDP programs to redirect frames to a memory buffer in a user-space 21*4882a593Smuzhiyunapplication. 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunAn AF_XDP socket (XSK) is created with the normal socket() 24*4882a593Smuzhiyunsyscall. Associated with each XSK are two rings: the RX ring and the 25*4882a593SmuzhiyunTX ring. A socket can receive packets on the RX ring and it can send 26*4882a593Smuzhiyunpackets on the TX ring. These rings are registered and sized with the 27*4882a593Smuzhiyunsetsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory 28*4882a593Smuzhiyunto have at least one of these rings for each socket. An RX or TX 29*4882a593Smuzhiyundescriptor ring points to a data buffer in a memory area called a 30*4882a593SmuzhiyunUMEM. RX and TX can share the same UMEM so that a packet does not have 31*4882a593Smuzhiyunto be copied between RX and TX. Moreover, if a packet needs to be kept 32*4882a593Smuzhiyunfor a while due to a possible retransmit, the descriptor that points 33*4882a593Smuzhiyunto that packet can be changed to point to another and reused right 34*4882a593Smuzhiyunaway. This again avoids copying data. 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunThe UMEM consists of a number of equally sized chunks. A descriptor in 37*4882a593Smuzhiyunone of the rings references a frame by referencing its addr. The addr 38*4882a593Smuzhiyunis simply an offset within the entire UMEM region. The user space 39*4882a593Smuzhiyunallocates memory for this UMEM using whatever means it feels is most 40*4882a593Smuzhiyunappropriate (malloc, mmap, huge pages, etc). This memory area is then 41*4882a593Smuzhiyunregistered with the kernel using the new setsockopt XDP_UMEM_REG. The 42*4882a593SmuzhiyunUMEM also has two rings: the FILL ring and the COMPLETION ring. The 43*4882a593SmuzhiyunFILL ring is used by the application to send down addr for the kernel 44*4882a593Smuzhiyunto fill in with RX packet data. References to these frames will then 45*4882a593Smuzhiyunappear in the RX ring once each packet has been received. The 46*4882a593SmuzhiyunCOMPLETION ring, on the other hand, contains frame addr that the 47*4882a593Smuzhiyunkernel has transmitted completely and can now be used again by user 48*4882a593Smuzhiyunspace, for either TX or RX. Thus, the frame addrs appearing in the 49*4882a593SmuzhiyunCOMPLETION ring are addrs that were previously transmitted using the 50*4882a593SmuzhiyunTX ring. In summary, the RX and FILL rings are used for the RX path 51*4882a593Smuzhiyunand the TX and COMPLETION rings are used for the TX path. 52*4882a593Smuzhiyun 53*4882a593SmuzhiyunThe socket is then finally bound with a bind() call to a device and a 54*4882a593Smuzhiyunspecific queue id on that device, and it is not until bind is 55*4882a593Smuzhiyuncompleted that traffic starts to flow. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunThe UMEM can be shared between processes, if desired. If a process 58*4882a593Smuzhiyunwants to do this, it simply skips the registration of the UMEM and its 59*4882a593Smuzhiyuncorresponding two rings, sets the XDP_SHARED_UMEM flag in the bind 60*4882a593Smuzhiyuncall and submits the XSK of the process it would like to share UMEM 61*4882a593Smuzhiyunwith as well as its own newly created XSK socket. The new process will 62*4882a593Smuzhiyunthen receive frame addr references in its own RX ring that point to 63*4882a593Smuzhiyunthis shared UMEM. Note that since the ring structures are 64*4882a593Smuzhiyunsingle-consumer / single-producer (for performance reasons), the new 65*4882a593Smuzhiyunprocess has to create its own socket with associated RX and TX rings, 66*4882a593Smuzhiyunsince it cannot share this with the other process. This is also the 67*4882a593Smuzhiyunreason that there is only one set of FILL and COMPLETION rings per 68*4882a593SmuzhiyunUMEM. It is the responsibility of a single process to handle the UMEM. 69*4882a593Smuzhiyun 70*4882a593SmuzhiyunHow is then packets distributed from an XDP program to the XSKs? There 71*4882a593Smuzhiyunis a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The 72*4882a593Smuzhiyunuser-space application can place an XSK at an arbitrary place in this 73*4882a593Smuzhiyunmap. The XDP program can then redirect a packet to a specific index in 74*4882a593Smuzhiyunthis map and at this point XDP validates that the XSK in that map was 75*4882a593Smuzhiyunindeed bound to that device and ring number. If not, the packet is 76*4882a593Smuzhiyundropped. If the map is empty at that index, the packet is also 77*4882a593Smuzhiyundropped. This also means that it is currently mandatory to have an XDP 78*4882a593Smuzhiyunprogram loaded (and one XSK in the XSKMAP) to be able to get any 79*4882a593Smuzhiyuntraffic to user space through the XSK. 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunAF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the 82*4882a593Smuzhiyundriver does not have support for XDP, or XDP_SKB is explicitly chosen 83*4882a593Smuzhiyunwhen loading the XDP program, XDP_SKB mode is employed that uses SKBs 84*4882a593Smuzhiyuntogether with the generic XDP support and copies out the data to user 85*4882a593Smuzhiyunspace. A fallback mode that works for any network device. On the other 86*4882a593Smuzhiyunhand, if the driver has support for XDP, it will be used by the AF_XDP 87*4882a593Smuzhiyuncode to provide better performance, but there is still a copy of the 88*4882a593Smuzhiyundata into user space. 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunConcepts 91*4882a593Smuzhiyun======== 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunIn order to use an AF_XDP socket, a number of associated objects need 94*4882a593Smuzhiyunto be setup. These objects and their options are explained in the 95*4882a593Smuzhiyunfollowing sections. 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunFor an overview on how AF_XDP works, you can also take a look at the 98*4882a593SmuzhiyunLinux Plumbers paper from 2018 on the subject: 99*4882a593Smuzhiyunhttp://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf. Do 100*4882a593SmuzhiyunNOT consult the paper from 2017 on "AF_PACKET v4", the first attempt 101*4882a593Smuzhiyunat AF_XDP. Nearly everything changed since then. Jonathan Corbet has 102*4882a593Smuzhiyunalso written an excellent article on LWN, "Accelerating networking 103*4882a593Smuzhiyunwith AF_XDP". It can be found at https://lwn.net/Articles/750845/. 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunUMEM 106*4882a593Smuzhiyun---- 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunUMEM is a region of virtual contiguous memory, divided into 109*4882a593Smuzhiyunequal-sized frames. An UMEM is associated to a netdev and a specific 110*4882a593Smuzhiyunqueue id of that netdev. It is created and configured (chunk size, 111*4882a593Smuzhiyunheadroom, start address and size) by using the XDP_UMEM_REG setsockopt 112*4882a593Smuzhiyunsystem call. A UMEM is bound to a netdev and queue id, via the bind() 113*4882a593Smuzhiyunsystem call. 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunAn AF_XDP is socket linked to a single UMEM, but one UMEM can have 116*4882a593Smuzhiyunmultiple AF_XDP sockets. To share an UMEM created via one socket A, 117*4882a593Smuzhiyunthe next socket B can do this by setting the XDP_SHARED_UMEM flag in 118*4882a593Smuzhiyunstruct sockaddr_xdp member sxdp_flags, and passing the file descriptor 119*4882a593Smuzhiyunof A to struct sockaddr_xdp member sxdp_shared_umem_fd. 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunThe UMEM has two single-producer/single-consumer rings that are used 122*4882a593Smuzhiyunto transfer ownership of UMEM frames between the kernel and the 123*4882a593Smuzhiyunuser-space application. 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunRings 126*4882a593Smuzhiyun----- 127*4882a593Smuzhiyun 128*4882a593SmuzhiyunThere are a four different kind of rings: FILL, COMPLETION, RX and 129*4882a593SmuzhiyunTX. All rings are single-producer/single-consumer, so the user-space 130*4882a593Smuzhiyunapplication need explicit synchronization of multiple 131*4882a593Smuzhiyunprocesses/threads are reading/writing to them. 132*4882a593Smuzhiyun 133*4882a593SmuzhiyunThe UMEM uses two rings: FILL and COMPLETION. Each socket associated 134*4882a593Smuzhiyunwith the UMEM must have an RX queue, TX queue or both. Say, that there 135*4882a593Smuzhiyunis a setup with four sockets (all doing TX and RX). Then there will be 136*4882a593Smuzhiyunone FILL ring, one COMPLETION ring, four TX rings and four RX rings. 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunThe rings are head(producer)/tail(consumer) based rings. A producer 139*4882a593Smuzhiyunwrites the data ring at the index pointed out by struct xdp_ring 140*4882a593Smuzhiyunproducer member, and increasing the producer index. A consumer reads 141*4882a593Smuzhiyunthe data ring at the index pointed out by struct xdp_ring consumer 142*4882a593Smuzhiyunmember, and increasing the consumer index. 143*4882a593Smuzhiyun 144*4882a593SmuzhiyunThe rings are configured and created via the _RING setsockopt system 145*4882a593Smuzhiyuncalls and mmapped to user-space using the appropriate offset to mmap() 146*4882a593Smuzhiyun(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and 147*4882a593SmuzhiyunXDP_UMEM_PGOFF_COMPLETION_RING). 148*4882a593Smuzhiyun 149*4882a593SmuzhiyunThe size of the rings need to be of size power of two. 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunUMEM Fill Ring 152*4882a593Smuzhiyun~~~~~~~~~~~~~~ 153*4882a593Smuzhiyun 154*4882a593SmuzhiyunThe FILL ring is used to transfer ownership of UMEM frames from 155*4882a593Smuzhiyunuser-space to kernel-space. The UMEM addrs are passed in the ring. As 156*4882a593Smuzhiyunan example, if the UMEM is 64k and each chunk is 4k, then the UMEM has 157*4882a593Smuzhiyun16 chunks and can pass addrs between 0 and 64k. 158*4882a593Smuzhiyun 159*4882a593SmuzhiyunFrames passed to the kernel are used for the ingress path (RX rings). 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunThe user application produces UMEM addrs to this ring. Note that, if 162*4882a593Smuzhiyunrunning the application with aligned chunk mode, the kernel will mask 163*4882a593Smuzhiyunthe incoming addr. E.g. for a chunk size of 2k, the log2(2048) LSB of 164*4882a593Smuzhiyunthe addr will be masked off, meaning that 2048, 2050 and 3000 refers 165*4882a593Smuzhiyunto the same chunk. If the user application is run in the unaligned 166*4882a593Smuzhiyunchunks mode, then the incoming addr will be left untouched. 167*4882a593Smuzhiyun 168*4882a593Smuzhiyun 169*4882a593SmuzhiyunUMEM Completion Ring 170*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~ 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunThe COMPLETION Ring is used transfer ownership of UMEM frames from 173*4882a593Smuzhiyunkernel-space to user-space. Just like the FILL ring, UMEM indices are 174*4882a593Smuzhiyunused. 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunFrames passed from the kernel to user-space are frames that has been 177*4882a593Smuzhiyunsent (TX ring) and can be used by user-space again. 178*4882a593Smuzhiyun 179*4882a593SmuzhiyunThe user application consumes UMEM addrs from this ring. 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun 182*4882a593SmuzhiyunRX Ring 183*4882a593Smuzhiyun~~~~~~~ 184*4882a593Smuzhiyun 185*4882a593SmuzhiyunThe RX ring is the receiving side of a socket. Each entry in the ring 186*4882a593Smuzhiyunis a struct xdp_desc descriptor. The descriptor contains UMEM offset 187*4882a593Smuzhiyun(addr) and the length of the data (len). 188*4882a593Smuzhiyun 189*4882a593SmuzhiyunIf no frames have been passed to kernel via the FILL ring, no 190*4882a593Smuzhiyundescriptors will (or can) appear on the RX ring. 191*4882a593Smuzhiyun 192*4882a593SmuzhiyunThe user application consumes struct xdp_desc descriptors from this 193*4882a593Smuzhiyunring. 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunTX Ring 196*4882a593Smuzhiyun~~~~~~~ 197*4882a593Smuzhiyun 198*4882a593SmuzhiyunThe TX ring is used to send frames. The struct xdp_desc descriptor is 199*4882a593Smuzhiyunfilled (index, length and offset) and passed into the ring. 200*4882a593Smuzhiyun 201*4882a593SmuzhiyunTo start the transfer a sendmsg() system call is required. This might 202*4882a593Smuzhiyunbe relaxed in the future. 203*4882a593Smuzhiyun 204*4882a593SmuzhiyunThe user application produces struct xdp_desc descriptors to this 205*4882a593Smuzhiyunring. 206*4882a593Smuzhiyun 207*4882a593SmuzhiyunLibbpf 208*4882a593Smuzhiyun====== 209*4882a593Smuzhiyun 210*4882a593SmuzhiyunLibbpf is a helper library for eBPF and XDP that makes using these 211*4882a593Smuzhiyuntechnologies a lot simpler. It also contains specific helper functions 212*4882a593Smuzhiyunin tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It 213*4882a593Smuzhiyuncontains two types of functions: those that can be used to make the 214*4882a593Smuzhiyunsetup of AF_XDP socket easier and ones that can be used in the data 215*4882a593Smuzhiyunplane to access the rings safely and quickly. To see an example on how 216*4882a593Smuzhiyunto use this API, please take a look at the sample application in 217*4882a593Smuzhiyunsamples/bpf/xdpsock_usr.c which uses libbpf for both setup and data 218*4882a593Smuzhiyunplane operations. 219*4882a593Smuzhiyun 220*4882a593SmuzhiyunWe recommend that you use this library unless you have become a power 221*4882a593Smuzhiyunuser. It will make your program a lot simpler. 222*4882a593Smuzhiyun 223*4882a593SmuzhiyunXSKMAP / BPF_MAP_TYPE_XSKMAP 224*4882a593Smuzhiyun============================ 225*4882a593Smuzhiyun 226*4882a593SmuzhiyunOn XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that 227*4882a593Smuzhiyunis used in conjunction with bpf_redirect_map() to pass the ingress 228*4882a593Smuzhiyunframe to a socket. 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunThe user application inserts the socket into the map, via the bpf() 231*4882a593Smuzhiyunsystem call. 232*4882a593Smuzhiyun 233*4882a593SmuzhiyunNote that if an XDP program tries to redirect to a socket that does 234*4882a593Smuzhiyunnot match the queue configuration and netdev, the frame will be 235*4882a593Smuzhiyundropped. E.g. an AF_XDP socket is bound to netdev eth0 and 236*4882a593Smuzhiyunqueue 17. Only the XDP program executing for eth0 and queue 17 will 237*4882a593Smuzhiyunsuccessfully pass data to the socket. Please refer to the sample 238*4882a593Smuzhiyunapplication (samples/bpf/) in for an example. 239*4882a593Smuzhiyun 240*4882a593SmuzhiyunConfiguration Flags and Socket Options 241*4882a593Smuzhiyun====================================== 242*4882a593Smuzhiyun 243*4882a593SmuzhiyunThese are the various configuration flags that can be used to control 244*4882a593Smuzhiyunand monitor the behavior of AF_XDP sockets. 245*4882a593Smuzhiyun 246*4882a593SmuzhiyunXDP_COPY and XDP_ZERO_COPY bind flags 247*4882a593Smuzhiyun------------------------------------- 248*4882a593Smuzhiyun 249*4882a593SmuzhiyunWhen you bind to a socket, the kernel will first try to use zero-copy 250*4882a593Smuzhiyuncopy. If zero-copy is not supported, it will fall back on using copy 251*4882a593Smuzhiyunmode, i.e. copying all packets out to user space. But if you would 252*4882a593Smuzhiyunlike to force a certain mode, you can use the following flags. If you 253*4882a593Smuzhiyunpass the XDP_COPY flag to the bind call, the kernel will force the 254*4882a593Smuzhiyunsocket into copy mode. If it cannot use copy mode, the bind call will 255*4882a593Smuzhiyunfail with an error. Conversely, the XDP_ZERO_COPY flag will force the 256*4882a593Smuzhiyunsocket into zero-copy mode or fail. 257*4882a593Smuzhiyun 258*4882a593SmuzhiyunXDP_SHARED_UMEM bind flag 259*4882a593Smuzhiyun------------------------- 260*4882a593Smuzhiyun 261*4882a593SmuzhiyunThis flag enables you to bind multiple sockets to the same UMEM. It 262*4882a593Smuzhiyunworks on the same queue id, between queue ids and between 263*4882a593Smuzhiyunnetdevs/devices. In this mode, each socket has their own RX and TX 264*4882a593Smuzhiyunrings as usual, but you are going to have one or more FILL and 265*4882a593SmuzhiyunCOMPLETION ring pairs. You have to create one of these pairs per 266*4882a593Smuzhiyununique netdev and queue id tuple that you bind to. 267*4882a593Smuzhiyun 268*4882a593SmuzhiyunStarting with the case were we would like to share a UMEM between 269*4882a593Smuzhiyunsockets bound to the same netdev and queue id. The UMEM (tied to the 270*4882a593Smuzhiyunfist socket created) will only have a single FILL ring and a single 271*4882a593SmuzhiyunCOMPLETION ring as there is only on unique netdev,queue_id tuple that 272*4882a593Smuzhiyunwe have bound to. To use this mode, create the first socket and bind 273*4882a593Smuzhiyunit in the normal way. Create a second socket and create an RX and a TX 274*4882a593Smuzhiyunring, or at least one of them, but no FILL or COMPLETION rings as the 275*4882a593Smuzhiyunones from the first socket will be used. In the bind call, set he 276*4882a593SmuzhiyunXDP_SHARED_UMEM option and provide the initial socket's fd in the 277*4882a593Smuzhiyunsxdp_shared_umem_fd field. You can attach an arbitrary number of extra 278*4882a593Smuzhiyunsockets this way. 279*4882a593Smuzhiyun 280*4882a593SmuzhiyunWhat socket will then a packet arrive on? This is decided by the XDP 281*4882a593Smuzhiyunprogram. Put all the sockets in the XSK_MAP and just indicate which 282*4882a593Smuzhiyunindex in the array you would like to send each packet to. A simple 283*4882a593Smuzhiyunround-robin example of distributing packets is shown below: 284*4882a593Smuzhiyun 285*4882a593Smuzhiyun.. code-block:: c 286*4882a593Smuzhiyun 287*4882a593Smuzhiyun #include <linux/bpf.h> 288*4882a593Smuzhiyun #include "bpf_helpers.h" 289*4882a593Smuzhiyun 290*4882a593Smuzhiyun #define MAX_SOCKS 16 291*4882a593Smuzhiyun 292*4882a593Smuzhiyun struct { 293*4882a593Smuzhiyun __uint(type, BPF_MAP_TYPE_XSKMAP); 294*4882a593Smuzhiyun __uint(max_entries, MAX_SOCKS); 295*4882a593Smuzhiyun __uint(key_size, sizeof(int)); 296*4882a593Smuzhiyun __uint(value_size, sizeof(int)); 297*4882a593Smuzhiyun } xsks_map SEC(".maps"); 298*4882a593Smuzhiyun 299*4882a593Smuzhiyun static unsigned int rr; 300*4882a593Smuzhiyun 301*4882a593Smuzhiyun SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) 302*4882a593Smuzhiyun { 303*4882a593Smuzhiyun rr = (rr + 1) & (MAX_SOCKS - 1); 304*4882a593Smuzhiyun 305*4882a593Smuzhiyun return bpf_redirect_map(&xsks_map, rr, XDP_DROP); 306*4882a593Smuzhiyun } 307*4882a593Smuzhiyun 308*4882a593SmuzhiyunNote, that since there is only a single set of FILL and COMPLETION 309*4882a593Smuzhiyunrings, and they are single producer, single consumer rings, you need 310*4882a593Smuzhiyunto make sure that multiple processes or threads do not use these rings 311*4882a593Smuzhiyunconcurrently. There are no synchronization primitives in the 312*4882a593Smuzhiyunlibbpf code that protects multiple users at this point in time. 313*4882a593Smuzhiyun 314*4882a593SmuzhiyunLibbpf uses this mode if you create more than one socket tied to the 315*4882a593Smuzhiyunsame UMEM. However, note that you need to supply the 316*4882a593SmuzhiyunXSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the 317*4882a593Smuzhiyunxsk_socket__create calls and load your own XDP program as there is no 318*4882a593Smuzhiyunbuilt in one in libbpf that will route the traffic for you. 319*4882a593Smuzhiyun 320*4882a593SmuzhiyunThe second case is when you share a UMEM between sockets that are 321*4882a593Smuzhiyunbound to different queue ids and/or netdevs. In this case you have to 322*4882a593Smuzhiyuncreate one FILL ring and one COMPLETION ring for each unique 323*4882a593Smuzhiyunnetdev,queue_id pair. Let us say you want to create two sockets bound 324*4882a593Smuzhiyunto two different queue ids on the same netdev. Create the first socket 325*4882a593Smuzhiyunand bind it in the normal way. Create a second socket and create an RX 326*4882a593Smuzhiyunand a TX ring, or at least one of them, and then one FILL and 327*4882a593SmuzhiyunCOMPLETION ring for this socket. Then in the bind call, set he 328*4882a593SmuzhiyunXDP_SHARED_UMEM option and provide the initial socket's fd in the 329*4882a593Smuzhiyunsxdp_shared_umem_fd field as you registered the UMEM on that 330*4882a593Smuzhiyunsocket. These two sockets will now share one and the same UMEM. 331*4882a593Smuzhiyun 332*4882a593SmuzhiyunThere is no need to supply an XDP program like the one in the previous 333*4882a593Smuzhiyuncase where sockets were bound to the same queue id and 334*4882a593Smuzhiyundevice. Instead, use the NIC's packet steering capabilities to steer 335*4882a593Smuzhiyunthe packets to the right queue. In the previous example, there is only 336*4882a593Smuzhiyunone queue shared among sockets, so the NIC cannot do this steering. It 337*4882a593Smuzhiyuncan only steer between queues. 338*4882a593Smuzhiyun 339*4882a593SmuzhiyunIn libbpf, you need to use the xsk_socket__create_shared() API as it 340*4882a593Smuzhiyuntakes a reference to a FILL ring and a COMPLETION ring that will be 341*4882a593Smuzhiyuncreated for you and bound to the shared UMEM. You can use this 342*4882a593Smuzhiyunfunction for all the sockets you create, or you can use it for the 343*4882a593Smuzhiyunsecond and following ones and use xsk_socket__create() for the first 344*4882a593Smuzhiyunone. Both methods yield the same result. 345*4882a593Smuzhiyun 346*4882a593SmuzhiyunNote that a UMEM can be shared between sockets on the same queue id 347*4882a593Smuzhiyunand device, as well as between queues on the same device and between 348*4882a593Smuzhiyundevices at the same time. 349*4882a593Smuzhiyun 350*4882a593SmuzhiyunXDP_USE_NEED_WAKEUP bind flag 351*4882a593Smuzhiyun----------------------------- 352*4882a593Smuzhiyun 353*4882a593SmuzhiyunThis option adds support for a new flag called need_wakeup that is 354*4882a593Smuzhiyunpresent in the FILL ring and the TX ring, the rings for which user 355*4882a593Smuzhiyunspace is a producer. When this option is set in the bind call, the 356*4882a593Smuzhiyunneed_wakeup flag will be set if the kernel needs to be explicitly 357*4882a593Smuzhiyunwoken up by a syscall to continue processing packets. If the flag is 358*4882a593Smuzhiyunzero, no syscall is needed. 359*4882a593Smuzhiyun 360*4882a593SmuzhiyunIf the flag is set on the FILL ring, the application needs to call 361*4882a593Smuzhiyunpoll() to be able to continue to receive packets on the RX ring. This 362*4882a593Smuzhiyuncan happen, for example, when the kernel has detected that there are no 363*4882a593Smuzhiyunmore buffers on the FILL ring and no buffers left on the RX HW ring of 364*4882a593Smuzhiyunthe NIC. In this case, interrupts are turned off as the NIC cannot 365*4882a593Smuzhiyunreceive any packets (as there are no buffers to put them in), and the 366*4882a593Smuzhiyunneed_wakeup flag is set so that user space can put buffers on the 367*4882a593SmuzhiyunFILL ring and then call poll() so that the kernel driver can put these 368*4882a593Smuzhiyunbuffers on the HW ring and start to receive packets. 369*4882a593Smuzhiyun 370*4882a593SmuzhiyunIf the flag is set for the TX ring, it means that the application 371*4882a593Smuzhiyunneeds to explicitly notify the kernel to send any packets put on the 372*4882a593SmuzhiyunTX ring. This can be accomplished either by a poll() call, as in the 373*4882a593SmuzhiyunRX path, or by calling sendto(). 374*4882a593Smuzhiyun 375*4882a593SmuzhiyunAn example of how to use this flag can be found in 376*4882a593Smuzhiyunsamples/bpf/xdpsock_user.c. An example with the use of libbpf helpers 377*4882a593Smuzhiyunwould look like this for the TX path: 378*4882a593Smuzhiyun 379*4882a593Smuzhiyun.. code-block:: c 380*4882a593Smuzhiyun 381*4882a593Smuzhiyun if (xsk_ring_prod__needs_wakeup(&my_tx_ring)) 382*4882a593Smuzhiyun sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0); 383*4882a593Smuzhiyun 384*4882a593SmuzhiyunI.e., only use the syscall if the flag is set. 385*4882a593Smuzhiyun 386*4882a593SmuzhiyunWe recommend that you always enable this mode as it usually leads to 387*4882a593Smuzhiyunbetter performance especially if you run the application and the 388*4882a593Smuzhiyundriver on the same core, but also if you use different cores for the 389*4882a593Smuzhiyunapplication and the kernel driver, as it reduces the number of 390*4882a593Smuzhiyunsyscalls needed for the TX path. 391*4882a593Smuzhiyun 392*4882a593SmuzhiyunXDP_{RX|TX|UMEM_FILL|UMEM_COMPLETION}_RING setsockopts 393*4882a593Smuzhiyun------------------------------------------------------ 394*4882a593Smuzhiyun 395*4882a593SmuzhiyunThese setsockopts sets the number of descriptors that the RX, TX, 396*4882a593SmuzhiyunFILL, and COMPLETION rings respectively should have. It is mandatory 397*4882a593Smuzhiyunto set the size of at least one of the RX and TX rings. If you set 398*4882a593Smuzhiyunboth, you will be able to both receive and send traffic from your 399*4882a593Smuzhiyunapplication, but if you only want to do one of them, you can save 400*4882a593Smuzhiyunresources by only setting up one of them. Both the FILL ring and the 401*4882a593SmuzhiyunCOMPLETION ring are mandatory as you need to have a UMEM tied to your 402*4882a593Smuzhiyunsocket. But if the XDP_SHARED_UMEM flag is used, any socket after the 403*4882a593Smuzhiyunfirst one does not have a UMEM and should in that case not have any 404*4882a593SmuzhiyunFILL or COMPLETION rings created as the ones from the shared UMEM will 405*4882a593Smuzhiyunbe used. Note, that the rings are single-producer single-consumer, so 406*4882a593Smuzhiyundo not try to access them from multiple processes at the same 407*4882a593Smuzhiyuntime. See the XDP_SHARED_UMEM section. 408*4882a593Smuzhiyun 409*4882a593SmuzhiyunIn libbpf, you can create Rx-only and Tx-only sockets by supplying 410*4882a593SmuzhiyunNULL to the rx and tx arguments, respectively, to the 411*4882a593Smuzhiyunxsk_socket__create function. 412*4882a593Smuzhiyun 413*4882a593SmuzhiyunIf you create a Tx-only socket, we recommend that you do not put any 414*4882a593Smuzhiyunpackets on the fill ring. If you do this, drivers might think you are 415*4882a593Smuzhiyungoing to receive something when you in fact will not, and this can 416*4882a593Smuzhiyunnegatively impact performance. 417*4882a593Smuzhiyun 418*4882a593SmuzhiyunXDP_UMEM_REG setsockopt 419*4882a593Smuzhiyun----------------------- 420*4882a593Smuzhiyun 421*4882a593SmuzhiyunThis setsockopt registers a UMEM to a socket. This is the area that 422*4882a593Smuzhiyuncontain all the buffers that packet can recide in. The call takes a 423*4882a593Smuzhiyunpointer to the beginning of this area and the size of it. Moreover, it 424*4882a593Smuzhiyunalso has parameter called chunk_size that is the size that the UMEM is 425*4882a593Smuzhiyundivided into. It can only be 2K or 4K at the moment. If you have an 426*4882a593SmuzhiyunUMEM area that is 128K and a chunk size of 2K, this means that you 427*4882a593Smuzhiyunwill be able to hold a maximum of 128K / 2K = 64 packets in your UMEM 428*4882a593Smuzhiyunarea and that your largest packet size can be 2K. 429*4882a593Smuzhiyun 430*4882a593SmuzhiyunThere is also an option to set the headroom of each single buffer in 431*4882a593Smuzhiyunthe UMEM. If you set this to N bytes, it means that the packet will 432*4882a593Smuzhiyunstart N bytes into the buffer leaving the first N bytes for the 433*4882a593Smuzhiyunapplication to use. The final option is the flags field, but it will 434*4882a593Smuzhiyunbe dealt with in separate sections for each UMEM flag. 435*4882a593Smuzhiyun 436*4882a593SmuzhiyunXDP_STATISTICS getsockopt 437*4882a593Smuzhiyun------------------------- 438*4882a593Smuzhiyun 439*4882a593SmuzhiyunGets drop statistics of a socket that can be useful for debug 440*4882a593Smuzhiyunpurposes. The supported statistics are shown below: 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun.. code-block:: c 443*4882a593Smuzhiyun 444*4882a593Smuzhiyun struct xdp_statistics { 445*4882a593Smuzhiyun __u64 rx_dropped; /* Dropped for reasons other than invalid desc */ 446*4882a593Smuzhiyun __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ 447*4882a593Smuzhiyun __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ 448*4882a593Smuzhiyun }; 449*4882a593Smuzhiyun 450*4882a593SmuzhiyunXDP_OPTIONS getsockopt 451*4882a593Smuzhiyun---------------------- 452*4882a593Smuzhiyun 453*4882a593SmuzhiyunGets options from an XDP socket. The only one supported so far is 454*4882a593SmuzhiyunXDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not. 455*4882a593Smuzhiyun 456*4882a593SmuzhiyunUsage 457*4882a593Smuzhiyun===== 458*4882a593Smuzhiyun 459*4882a593SmuzhiyunIn order to use AF_XDP sockets two parts are needed. The 460*4882a593Smuzhiyunuser-space application and the XDP program. For a complete setup and 461*4882a593Smuzhiyunusage example, please refer to the sample application. The user-space 462*4882a593Smuzhiyunside is xdpsock_user.c and the XDP side is part of libbpf. 463*4882a593Smuzhiyun 464*4882a593SmuzhiyunThe XDP code sample included in tools/lib/bpf/xsk.c is the following: 465*4882a593Smuzhiyun 466*4882a593Smuzhiyun.. code-block:: c 467*4882a593Smuzhiyun 468*4882a593Smuzhiyun SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) 469*4882a593Smuzhiyun { 470*4882a593Smuzhiyun int index = ctx->rx_queue_index; 471*4882a593Smuzhiyun 472*4882a593Smuzhiyun // A set entry here means that the corresponding queue_id 473*4882a593Smuzhiyun // has an active AF_XDP socket bound to it. 474*4882a593Smuzhiyun if (bpf_map_lookup_elem(&xsks_map, &index)) 475*4882a593Smuzhiyun return bpf_redirect_map(&xsks_map, index, 0); 476*4882a593Smuzhiyun 477*4882a593Smuzhiyun return XDP_PASS; 478*4882a593Smuzhiyun } 479*4882a593Smuzhiyun 480*4882a593SmuzhiyunA simple but not so performance ring dequeue and enqueue could look 481*4882a593Smuzhiyunlike this: 482*4882a593Smuzhiyun 483*4882a593Smuzhiyun.. code-block:: c 484*4882a593Smuzhiyun 485*4882a593Smuzhiyun // struct xdp_rxtx_ring { 486*4882a593Smuzhiyun // __u32 *producer; 487*4882a593Smuzhiyun // __u32 *consumer; 488*4882a593Smuzhiyun // struct xdp_desc *desc; 489*4882a593Smuzhiyun // }; 490*4882a593Smuzhiyun 491*4882a593Smuzhiyun // struct xdp_umem_ring { 492*4882a593Smuzhiyun // __u32 *producer; 493*4882a593Smuzhiyun // __u32 *consumer; 494*4882a593Smuzhiyun // __u64 *desc; 495*4882a593Smuzhiyun // }; 496*4882a593Smuzhiyun 497*4882a593Smuzhiyun // typedef struct xdp_rxtx_ring RING; 498*4882a593Smuzhiyun // typedef struct xdp_umem_ring RING; 499*4882a593Smuzhiyun 500*4882a593Smuzhiyun // typedef struct xdp_desc RING_TYPE; 501*4882a593Smuzhiyun // typedef __u64 RING_TYPE; 502*4882a593Smuzhiyun 503*4882a593Smuzhiyun int dequeue_one(RING *ring, RING_TYPE *item) 504*4882a593Smuzhiyun { 505*4882a593Smuzhiyun __u32 entries = *ring->producer - *ring->consumer; 506*4882a593Smuzhiyun 507*4882a593Smuzhiyun if (entries == 0) 508*4882a593Smuzhiyun return -1; 509*4882a593Smuzhiyun 510*4882a593Smuzhiyun // read-barrier! 511*4882a593Smuzhiyun 512*4882a593Smuzhiyun *item = ring->desc[*ring->consumer & (RING_SIZE - 1)]; 513*4882a593Smuzhiyun (*ring->consumer)++; 514*4882a593Smuzhiyun return 0; 515*4882a593Smuzhiyun } 516*4882a593Smuzhiyun 517*4882a593Smuzhiyun int enqueue_one(RING *ring, const RING_TYPE *item) 518*4882a593Smuzhiyun { 519*4882a593Smuzhiyun u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer); 520*4882a593Smuzhiyun 521*4882a593Smuzhiyun if (free_entries == 0) 522*4882a593Smuzhiyun return -1; 523*4882a593Smuzhiyun 524*4882a593Smuzhiyun ring->desc[*ring->producer & (RING_SIZE - 1)] = *item; 525*4882a593Smuzhiyun 526*4882a593Smuzhiyun // write-barrier! 527*4882a593Smuzhiyun 528*4882a593Smuzhiyun (*ring->producer)++; 529*4882a593Smuzhiyun return 0; 530*4882a593Smuzhiyun } 531*4882a593Smuzhiyun 532*4882a593SmuzhiyunBut please use the libbpf functions as they are optimized and ready to 533*4882a593Smuzhiyunuse. Will make your life easier. 534*4882a593Smuzhiyun 535*4882a593SmuzhiyunSample application 536*4882a593Smuzhiyun================== 537*4882a593Smuzhiyun 538*4882a593SmuzhiyunThere is a xdpsock benchmarking/test application included that 539*4882a593Smuzhiyundemonstrates how to use AF_XDP sockets with private UMEMs. Say that 540*4882a593Smuzhiyunyou would like your UDP traffic from port 4242 to end up in queue 16, 541*4882a593Smuzhiyunthat we will enable AF_XDP on. Here, we use ethtool for this:: 542*4882a593Smuzhiyun 543*4882a593Smuzhiyun ethtool -N p3p2 rx-flow-hash udp4 fn 544*4882a593Smuzhiyun ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ 545*4882a593Smuzhiyun action 16 546*4882a593Smuzhiyun 547*4882a593SmuzhiyunRunning the rxdrop benchmark in XDP_DRV mode can then be done 548*4882a593Smuzhiyunusing:: 549*4882a593Smuzhiyun 550*4882a593Smuzhiyun samples/bpf/xdpsock -i p3p2 -q 16 -r -N 551*4882a593Smuzhiyun 552*4882a593SmuzhiyunFor XDP_SKB mode, use the switch "-S" instead of "-N" and all options 553*4882a593Smuzhiyuncan be displayed with "-h", as usual. 554*4882a593Smuzhiyun 555*4882a593SmuzhiyunThis sample application uses libbpf to make the setup and usage of 556*4882a593SmuzhiyunAF_XDP simpler. If you want to know how the raw uapi of AF_XDP is 557*4882a593Smuzhiyunreally used to make something more advanced, take a look at the libbpf 558*4882a593Smuzhiyuncode in tools/lib/bpf/xsk.[ch]. 559*4882a593Smuzhiyun 560*4882a593SmuzhiyunFAQ 561*4882a593Smuzhiyun======= 562*4882a593Smuzhiyun 563*4882a593SmuzhiyunQ: I am not seeing any traffic on the socket. What am I doing wrong? 564*4882a593Smuzhiyun 565*4882a593SmuzhiyunA: When a netdev of a physical NIC is initialized, Linux usually 566*4882a593Smuzhiyun allocates one RX and TX queue pair per core. So on a 8 core system, 567*4882a593Smuzhiyun queue ids 0 to 7 will be allocated, one per core. In the AF_XDP 568*4882a593Smuzhiyun bind call or the xsk_socket__create libbpf function call, you 569*4882a593Smuzhiyun specify a specific queue id to bind to and it is only the traffic 570*4882a593Smuzhiyun towards that queue you are going to get on you socket. So in the 571*4882a593Smuzhiyun example above, if you bind to queue 0, you are NOT going to get any 572*4882a593Smuzhiyun traffic that is distributed to queues 1 through 7. If you are 573*4882a593Smuzhiyun lucky, you will see the traffic, but usually it will end up on one 574*4882a593Smuzhiyun of the queues you have not bound to. 575*4882a593Smuzhiyun 576*4882a593Smuzhiyun There are a number of ways to solve the problem of getting the 577*4882a593Smuzhiyun traffic you want to the queue id you bound to. If you want to see 578*4882a593Smuzhiyun all the traffic, you can force the netdev to only have 1 queue, queue 579*4882a593Smuzhiyun id 0, and then bind to queue 0. You can use ethtool to do this:: 580*4882a593Smuzhiyun 581*4882a593Smuzhiyun sudo ethtool -L <interface> combined 1 582*4882a593Smuzhiyun 583*4882a593Smuzhiyun If you want to only see part of the traffic, you can program the 584*4882a593Smuzhiyun NIC through ethtool to filter out your traffic to a single queue id 585*4882a593Smuzhiyun that you can bind your XDP socket to. Here is one example in which 586*4882a593Smuzhiyun UDP traffic to and from port 4242 are sent to queue 2:: 587*4882a593Smuzhiyun 588*4882a593Smuzhiyun sudo ethtool -N <interface> rx-flow-hash udp4 fn 589*4882a593Smuzhiyun sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \ 590*4882a593Smuzhiyun 4242 action 2 591*4882a593Smuzhiyun 592*4882a593Smuzhiyun A number of other ways are possible all up to the capabilities of 593*4882a593Smuzhiyun the NIC you have. 594*4882a593Smuzhiyun 595*4882a593SmuzhiyunQ: Can I use the XSKMAP to implement a switch betwen different umems 596*4882a593Smuzhiyun in copy mode? 597*4882a593Smuzhiyun 598*4882a593SmuzhiyunA: The short answer is no, that is not supported at the moment. The 599*4882a593Smuzhiyun XSKMAP can only be used to switch traffic coming in on queue id X 600*4882a593Smuzhiyun to sockets bound to the same queue id X. The XSKMAP can contain 601*4882a593Smuzhiyun sockets bound to different queue ids, for example X and Y, but only 602*4882a593Smuzhiyun traffic goming in from queue id Y can be directed to sockets bound 603*4882a593Smuzhiyun to the same queue id Y. In zero-copy mode, you should use the 604*4882a593Smuzhiyun switch, or other distribution mechanism, in your NIC to direct 605*4882a593Smuzhiyun traffic to the correct queue id and socket. 606*4882a593Smuzhiyun 607*4882a593SmuzhiyunQ: My packets are sometimes corrupted. What is wrong? 608*4882a593Smuzhiyun 609*4882a593SmuzhiyunA: Care has to be taken not to feed the same buffer in the UMEM into 610*4882a593Smuzhiyun more than one ring at the same time. If you for example feed the 611*4882a593Smuzhiyun same buffer into the FILL ring and the TX ring at the same time, the 612*4882a593Smuzhiyun NIC might receive data into the buffer at the same time it is 613*4882a593Smuzhiyun sending it. This will cause some packets to become corrupted. Same 614*4882a593Smuzhiyun thing goes for feeding the same buffer into the FILL rings 615*4882a593Smuzhiyun belonging to different queue ids or netdevs bound with the 616*4882a593Smuzhiyun XDP_SHARED_UMEM flag. 617*4882a593Smuzhiyun 618*4882a593SmuzhiyunCredits 619*4882a593Smuzhiyun======= 620*4882a593Smuzhiyun 621*4882a593Smuzhiyun- Björn Töpel (AF_XDP core) 622*4882a593Smuzhiyun- Magnus Karlsson (AF_XDP core) 623*4882a593Smuzhiyun- Alexander Duyck 624*4882a593Smuzhiyun- Alexei Starovoitov 625*4882a593Smuzhiyun- Daniel Borkmann 626*4882a593Smuzhiyun- Jesper Dangaard Brouer 627*4882a593Smuzhiyun- John Fastabend 628*4882a593Smuzhiyun- Jonathan Corbet (LWN coverage) 629*4882a593Smuzhiyun- Michael S. Tsirkin 630*4882a593Smuzhiyun- Qi Z Zhang 631*4882a593Smuzhiyun- Willem de Bruijn 632