1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun=========== 4*4882a593SmuzhiyunPacket MMAP 5*4882a593Smuzhiyun=========== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunAbstract 8*4882a593Smuzhiyun======== 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunThis file documents the mmap() facility available with the PACKET 11*4882a593Smuzhiyunsocket interface on 2.4/2.6/3.x kernels. This type of sockets is used for 12*4882a593Smuzhiyun 13*4882a593Smuzhiyuni) capture network traffic with utilities like tcpdump, 14*4882a593Smuzhiyunii) transmit network traffic, or any other that needs raw 15*4882a593Smuzhiyun access to network interface. 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunHowto can be found at: 18*4882a593Smuzhiyun 19*4882a593Smuzhiyun https://sites.google.com/site/packetmmap/ 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunPlease send your comments to 22*4882a593Smuzhiyun - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es> 23*4882a593Smuzhiyun - Johann Baudy 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunWhy use PACKET_MMAP 26*4882a593Smuzhiyun=================== 27*4882a593Smuzhiyun 28*4882a593SmuzhiyunIn Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very 29*4882a593Smuzhiyuninefficient. It uses very limited buffers and requires one system call to 30*4882a593Smuzhiyuncapture each packet, it requires two if you want to get packet's timestamp 31*4882a593Smuzhiyun(like libpcap always does). 32*4882a593Smuzhiyun 33*4882a593SmuzhiyunIn the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size 34*4882a593Smuzhiyunconfigurable circular buffer mapped in user space that can be used to either 35*4882a593Smuzhiyunsend or receive packets. This way reading packets just needs to wait for them, 36*4882a593Smuzhiyunmost of the time there is no need to issue a single system call. Concerning 37*4882a593Smuzhiyuntransmission, multiple packets can be sent through one system call to get the 38*4882a593Smuzhiyunhighest bandwidth. By using a shared buffer between the kernel and the user 39*4882a593Smuzhiyunalso has the benefit of minimizing packet copies. 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunIt's fine to use PACKET_MMAP to improve the performance of the capture and 42*4882a593Smuzhiyuntransmission process, but it isn't everything. At least, if you are capturing 43*4882a593Smuzhiyunat high speeds (this is relative to the cpu speed), you should check if the 44*4882a593Smuzhiyundevice driver of your network interface card supports some sort of interrupt 45*4882a593Smuzhiyunload mitigation or (even better) if it supports NAPI, also make sure it is 46*4882a593Smuzhiyunenabled. For transmission, check the MTU (Maximum Transmission Unit) used and 47*4882a593Smuzhiyunsupported by devices of your network. CPU IRQ pinning of your network interface 48*4882a593Smuzhiyuncard can also be an advantage. 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunHow to use mmap() to improve capture process 51*4882a593Smuzhiyun============================================ 52*4882a593Smuzhiyun 53*4882a593SmuzhiyunFrom the user standpoint, you should use the higher level libpcap library, which 54*4882a593Smuzhiyunis a de facto standard, portable across nearly all operating systems 55*4882a593Smuzhiyunincluding Win32. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunPacket MMAP support was integrated into libpcap around the time of version 1.3.0; 58*4882a593SmuzhiyunTPACKET_V3 support was added in version 1.5.0 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunHow to use mmap() directly to improve capture process 61*4882a593Smuzhiyun===================================================== 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunFrom the system calls stand point, the use of PACKET_MMAP involves 64*4882a593Smuzhiyunthe following process:: 65*4882a593Smuzhiyun 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun [setup] socket() -------> creation of the capture socket 68*4882a593Smuzhiyun setsockopt() ---> allocation of the circular buffer (ring) 69*4882a593Smuzhiyun option: PACKET_RX_RING 70*4882a593Smuzhiyun mmap() ---------> mapping of the allocated buffer to the 71*4882a593Smuzhiyun user process 72*4882a593Smuzhiyun 73*4882a593Smuzhiyun [capture] poll() ---------> to wait for incoming packets 74*4882a593Smuzhiyun 75*4882a593Smuzhiyun [shutdown] close() --------> destruction of the capture socket and 76*4882a593Smuzhiyun deallocation of all associated 77*4882a593Smuzhiyun resources. 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun 80*4882a593Smuzhiyunsocket creation and destruction is straight forward, and is done 81*4882a593Smuzhiyunthe same way with or without PACKET_MMAP:: 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); 84*4882a593Smuzhiyun 85*4882a593Smuzhiyunwhere mode is SOCK_RAW for the raw interface were link level 86*4882a593Smuzhiyuninformation can be captured or SOCK_DGRAM for the cooked 87*4882a593Smuzhiyuninterface where link level information capture is not 88*4882a593Smuzhiyunsupported and a link level pseudo-header is provided 89*4882a593Smuzhiyunby the kernel. 90*4882a593Smuzhiyun 91*4882a593SmuzhiyunThe destruction of the socket and all associated resources 92*4882a593Smuzhiyunis done by a simple call to close(fd). 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunSimilarly as without PACKET_MMAP, it is possible to use one socket 95*4882a593Smuzhiyunfor capture and transmission. This can be done by mapping the 96*4882a593Smuzhiyunallocated RX and TX buffer ring with a single mmap() call. 97*4882a593SmuzhiyunSee "Mapping and use of the circular buffer (ring)". 98*4882a593Smuzhiyun 99*4882a593SmuzhiyunNext I will describe PACKET_MMAP settings and its constraints, 100*4882a593Smuzhiyunalso the mapping of the circular buffer in the user process and 101*4882a593Smuzhiyunthe use of this buffer. 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunHow to use mmap() directly to improve transmission process 104*4882a593Smuzhiyun========================================================== 105*4882a593SmuzhiyunTransmission process is similar to capture as shown below:: 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun [setup] socket() -------> creation of the transmission socket 108*4882a593Smuzhiyun setsockopt() ---> allocation of the circular buffer (ring) 109*4882a593Smuzhiyun option: PACKET_TX_RING 110*4882a593Smuzhiyun bind() ---------> bind transmission socket with a network interface 111*4882a593Smuzhiyun mmap() ---------> mapping of the allocated buffer to the 112*4882a593Smuzhiyun user process 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun [transmission] poll() ---------> wait for free packets (optional) 115*4882a593Smuzhiyun send() ---------> send all packets that are set as ready in 116*4882a593Smuzhiyun the ring 117*4882a593Smuzhiyun The flag MSG_DONTWAIT can be used to return 118*4882a593Smuzhiyun before end of transfer. 119*4882a593Smuzhiyun 120*4882a593Smuzhiyun [shutdown] close() --------> destruction of the transmission socket and 121*4882a593Smuzhiyun deallocation of all associated resources. 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunSocket creation and destruction is also straight forward, and is done 124*4882a593Smuzhiyunthe same way as in capturing described in the previous paragraph:: 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun int fd = socket(PF_PACKET, mode, 0); 127*4882a593Smuzhiyun 128*4882a593SmuzhiyunThe protocol can optionally be 0 in case we only want to transmit 129*4882a593Smuzhiyunvia this socket, which avoids an expensive call to packet_rcv(). 130*4882a593SmuzhiyunIn this case, you also need to bind(2) the TX_RING with sll_protocol = 0 131*4882a593Smuzhiyunset. Otherwise, htons(ETH_P_ALL) or any other protocol, for example. 132*4882a593Smuzhiyun 133*4882a593SmuzhiyunBinding the socket to your network interface is mandatory (with zero copy) to 134*4882a593Smuzhiyunknow the header size of frames used in the circular buffer. 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunAs capture, each frame contains two parts:: 137*4882a593Smuzhiyun 138*4882a593Smuzhiyun -------------------- 139*4882a593Smuzhiyun | struct tpacket_hdr | Header. It contains the status of 140*4882a593Smuzhiyun | | of this frame 141*4882a593Smuzhiyun |--------------------| 142*4882a593Smuzhiyun | data buffer | 143*4882a593Smuzhiyun . . Data that will be sent over the network interface. 144*4882a593Smuzhiyun . . 145*4882a593Smuzhiyun -------------------- 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun bind() associates the socket to your network interface thanks to 148*4882a593Smuzhiyun sll_ifindex parameter of struct sockaddr_ll. 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun Initialization example:: 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun struct sockaddr_ll my_addr; 153*4882a593Smuzhiyun struct ifreq s_ifr; 154*4882a593Smuzhiyun ... 155*4882a593Smuzhiyun 156*4882a593Smuzhiyun strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); 157*4882a593Smuzhiyun 158*4882a593Smuzhiyun /* get interface index of eth0 */ 159*4882a593Smuzhiyun ioctl(this->socket, SIOCGIFINDEX, &s_ifr); 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun /* fill sockaddr_ll struct to prepare binding */ 162*4882a593Smuzhiyun my_addr.sll_family = AF_PACKET; 163*4882a593Smuzhiyun my_addr.sll_protocol = htons(ETH_P_ALL); 164*4882a593Smuzhiyun my_addr.sll_ifindex = s_ifr.ifr_ifindex; 165*4882a593Smuzhiyun 166*4882a593Smuzhiyun /* bind socket to eth0 */ 167*4882a593Smuzhiyun bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun A complete tutorial is available at: https://sites.google.com/site/packetmmap/ 170*4882a593Smuzhiyun 171*4882a593SmuzhiyunBy default, the user should put data at:: 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) 174*4882a593Smuzhiyun 175*4882a593SmuzhiyunSo, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), 176*4882a593Smuzhiyunthe beginning of the user data will be at:: 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) 179*4882a593Smuzhiyun 180*4882a593SmuzhiyunIf you wish to put user data at a custom offset from the beginning of 181*4882a593Smuzhiyunthe frame (for payload alignment with SOCK_RAW mode for instance) you 182*4882a593Smuzhiyuncan set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order 183*4882a593Smuzhiyunto make this work it must be enabled previously with setsockopt() 184*4882a593Smuzhiyunand the PACKET_TX_HAS_OFF option. 185*4882a593Smuzhiyun 186*4882a593SmuzhiyunPACKET_MMAP settings 187*4882a593Smuzhiyun==================== 188*4882a593Smuzhiyun 189*4882a593SmuzhiyunTo setup PACKET_MMAP from user level code is done with a call like 190*4882a593Smuzhiyun 191*4882a593Smuzhiyun - Capture process:: 192*4882a593Smuzhiyun 193*4882a593Smuzhiyun setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) 194*4882a593Smuzhiyun 195*4882a593Smuzhiyun - Transmission process:: 196*4882a593Smuzhiyun 197*4882a593Smuzhiyun setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) 198*4882a593Smuzhiyun 199*4882a593SmuzhiyunThe most significant argument in the previous call is the req parameter, 200*4882a593Smuzhiyunthis parameter must to have the following structure:: 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun struct tpacket_req 203*4882a593Smuzhiyun { 204*4882a593Smuzhiyun unsigned int tp_block_size; /* Minimal size of contiguous block */ 205*4882a593Smuzhiyun unsigned int tp_block_nr; /* Number of blocks */ 206*4882a593Smuzhiyun unsigned int tp_frame_size; /* Size of frame */ 207*4882a593Smuzhiyun unsigned int tp_frame_nr; /* Total number of frames */ 208*4882a593Smuzhiyun }; 209*4882a593Smuzhiyun 210*4882a593SmuzhiyunThis structure is defined in /usr/include/linux/if_packet.h and establishes a 211*4882a593Smuzhiyuncircular buffer (ring) of unswappable memory. 212*4882a593SmuzhiyunBeing mapped in the capture process allows reading the captured frames and 213*4882a593Smuzhiyunrelated meta-information like timestamps without requiring a system call. 214*4882a593Smuzhiyun 215*4882a593SmuzhiyunFrames are grouped in blocks. Each block is a physically contiguous 216*4882a593Smuzhiyunregion of memory and holds tp_block_size/tp_frame_size frames. The total number 217*4882a593Smuzhiyunof blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because:: 218*4882a593Smuzhiyun 219*4882a593Smuzhiyun frames_per_block = tp_block_size/tp_frame_size 220*4882a593Smuzhiyun 221*4882a593Smuzhiyunindeed, packet_set_ring checks that the following condition is true:: 222*4882a593Smuzhiyun 223*4882a593Smuzhiyun frames_per_block * tp_block_nr == tp_frame_nr 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunLets see an example, with the following values:: 226*4882a593Smuzhiyun 227*4882a593Smuzhiyun tp_block_size= 4096 228*4882a593Smuzhiyun tp_frame_size= 2048 229*4882a593Smuzhiyun tp_block_nr = 4 230*4882a593Smuzhiyun tp_frame_nr = 8 231*4882a593Smuzhiyun 232*4882a593Smuzhiyunwe will get the following buffer structure:: 233*4882a593Smuzhiyun 234*4882a593Smuzhiyun block #1 block #2 235*4882a593Smuzhiyun +---------+---------+ +---------+---------+ 236*4882a593Smuzhiyun | frame 1 | frame 2 | | frame 3 | frame 4 | 237*4882a593Smuzhiyun +---------+---------+ +---------+---------+ 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun block #3 block #4 240*4882a593Smuzhiyun +---------+---------+ +---------+---------+ 241*4882a593Smuzhiyun | frame 5 | frame 6 | | frame 7 | frame 8 | 242*4882a593Smuzhiyun +---------+---------+ +---------+---------+ 243*4882a593Smuzhiyun 244*4882a593SmuzhiyunA frame can be of any size with the only condition it can fit in a block. A block 245*4882a593Smuzhiyuncan only hold an integer number of frames, or in other words, a frame cannot 246*4882a593Smuzhiyunbe spawned across two blocks, so there are some details you have to take into 247*4882a593Smuzhiyunaccount when choosing the frame_size. See "Mapping and use of the circular 248*4882a593Smuzhiyunbuffer (ring)". 249*4882a593Smuzhiyun 250*4882a593SmuzhiyunPACKET_MMAP setting constraints 251*4882a593Smuzhiyun=============================== 252*4882a593Smuzhiyun 253*4882a593SmuzhiyunIn kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), 254*4882a593Smuzhiyunthe PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or 255*4882a593Smuzhiyun16384 in a 64 bit architecture. For information on these kernel versions 256*4882a593Smuzhiyunsee http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt 257*4882a593Smuzhiyun 258*4882a593SmuzhiyunBlock size limit 259*4882a593Smuzhiyun---------------- 260*4882a593Smuzhiyun 261*4882a593SmuzhiyunAs stated earlier, each block is a contiguous physical region of memory. These 262*4882a593Smuzhiyunmemory regions are allocated with calls to the __get_free_pages() function. As 263*4882a593Smuzhiyunthe name indicates, this function allocates pages of memory, and the second 264*4882a593Smuzhiyunargument is "order" or a power of two number of pages, that is 265*4882a593Smuzhiyun(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, 266*4882a593Smuzhiyunorder=2 ==> 16384 bytes, etc. The maximum size of a 267*4882a593Smuzhiyunregion allocated by __get_free_pages is determined by the MAX_ORDER macro. More 268*4882a593Smuzhiyunprecisely the limit can be calculated as:: 269*4882a593Smuzhiyun 270*4882a593Smuzhiyun PAGE_SIZE << MAX_ORDER 271*4882a593Smuzhiyun 272*4882a593Smuzhiyun In a i386 architecture PAGE_SIZE is 4096 bytes 273*4882a593Smuzhiyun In a 2.4/i386 kernel MAX_ORDER is 10 274*4882a593Smuzhiyun In a 2.6/i386 kernel MAX_ORDER is 11 275*4882a593Smuzhiyun 276*4882a593SmuzhiyunSo get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel 277*4882a593Smuzhiyunrespectively, with an i386 architecture. 278*4882a593Smuzhiyun 279*4882a593SmuzhiyunUser space programs can include /usr/include/sys/user.h and 280*4882a593Smuzhiyun/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. 281*4882a593Smuzhiyun 282*4882a593SmuzhiyunThe pagesize can also be determined dynamically with the getpagesize (2) 283*4882a593Smuzhiyunsystem call. 284*4882a593Smuzhiyun 285*4882a593SmuzhiyunBlock number limit 286*4882a593Smuzhiyun------------------ 287*4882a593Smuzhiyun 288*4882a593SmuzhiyunTo understand the constraints of PACKET_MMAP, we have to see the structure 289*4882a593Smuzhiyunused to hold the pointers to each block. 290*4882a593Smuzhiyun 291*4882a593SmuzhiyunCurrently, this structure is a dynamically allocated vector with kmalloc 292*4882a593Smuzhiyuncalled pg_vec, its size limits the number of blocks that can be allocated:: 293*4882a593Smuzhiyun 294*4882a593Smuzhiyun +---+---+---+---+ 295*4882a593Smuzhiyun | x | x | x | x | 296*4882a593Smuzhiyun +---+---+---+---+ 297*4882a593Smuzhiyun | | | | 298*4882a593Smuzhiyun | | | v 299*4882a593Smuzhiyun | | v block #4 300*4882a593Smuzhiyun | v block #3 301*4882a593Smuzhiyun v block #2 302*4882a593Smuzhiyun block #1 303*4882a593Smuzhiyun 304*4882a593Smuzhiyunkmalloc allocates any number of bytes of physically contiguous memory from 305*4882a593Smuzhiyuna pool of pre-determined sizes. This pool of memory is maintained by the slab 306*4882a593Smuzhiyunallocator which is at the end the responsible for doing the allocation and 307*4882a593Smuzhiyunhence which imposes the maximum memory that kmalloc can allocate. 308*4882a593Smuzhiyun 309*4882a593SmuzhiyunIn a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The 310*4882a593Smuzhiyunpredetermined sizes that kmalloc uses can be checked in the "size-<bytes>" 311*4882a593Smuzhiyunentries of /proc/slabinfo 312*4882a593Smuzhiyun 313*4882a593SmuzhiyunIn a 32 bit architecture, pointers are 4 bytes long, so the total number of 314*4882a593Smuzhiyunpointers to blocks is:: 315*4882a593Smuzhiyun 316*4882a593Smuzhiyun 131072/4 = 32768 blocks 317*4882a593Smuzhiyun 318*4882a593SmuzhiyunPACKET_MMAP buffer size calculator 319*4882a593Smuzhiyun================================== 320*4882a593Smuzhiyun 321*4882a593SmuzhiyunDefinitions: 322*4882a593Smuzhiyun 323*4882a593Smuzhiyun============== ================================================================ 324*4882a593Smuzhiyun<size-max> is the maximum size of allocable with kmalloc 325*4882a593Smuzhiyun (see /proc/slabinfo) 326*4882a593Smuzhiyun<pointer size> depends on the architecture -- ``sizeof(void *)`` 327*4882a593Smuzhiyun<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2) 328*4882a593Smuzhiyun<max-order> is the value defined with MAX_ORDER 329*4882a593Smuzhiyun<frame size> it's an upper bound of frame's capture size (more on this later) 330*4882a593Smuzhiyun============== ================================================================ 331*4882a593Smuzhiyun 332*4882a593Smuzhiyunfrom these definitions we will derive:: 333*4882a593Smuzhiyun 334*4882a593Smuzhiyun <block number> = <size-max>/<pointer size> 335*4882a593Smuzhiyun <block size> = <pagesize> << <max-order> 336*4882a593Smuzhiyun 337*4882a593Smuzhiyunso, the max buffer size is:: 338*4882a593Smuzhiyun 339*4882a593Smuzhiyun <block number> * <block size> 340*4882a593Smuzhiyun 341*4882a593Smuzhiyunand, the number of frames be:: 342*4882a593Smuzhiyun 343*4882a593Smuzhiyun <block number> * <block size> / <frame size> 344*4882a593Smuzhiyun 345*4882a593SmuzhiyunSuppose the following parameters, which apply for 2.6 kernel and an 346*4882a593Smuzhiyuni386 architecture:: 347*4882a593Smuzhiyun 348*4882a593Smuzhiyun <size-max> = 131072 bytes 349*4882a593Smuzhiyun <pointer size> = 4 bytes 350*4882a593Smuzhiyun <pagesize> = 4096 bytes 351*4882a593Smuzhiyun <max-order> = 11 352*4882a593Smuzhiyun 353*4882a593Smuzhiyunand a value for <frame size> of 2048 bytes. These parameters will yield:: 354*4882a593Smuzhiyun 355*4882a593Smuzhiyun <block number> = 131072/4 = 32768 blocks 356*4882a593Smuzhiyun <block size> = 4096 << 11 = 8 MiB. 357*4882a593Smuzhiyun 358*4882a593Smuzhiyunand hence the buffer will have a 262144 MiB size. So it can hold 359*4882a593Smuzhiyun262144 MiB / 2048 bytes = 134217728 frames 360*4882a593Smuzhiyun 361*4882a593SmuzhiyunActually, this buffer size is not possible with an i386 architecture. 362*4882a593SmuzhiyunRemember that the memory is allocated in kernel space, in the case of 363*4882a593Smuzhiyunan i386 kernel's memory size is limited to 1GiB. 364*4882a593Smuzhiyun 365*4882a593SmuzhiyunAll memory allocations are not freed until the socket is closed. The memory 366*4882a593Smuzhiyunallocations are done with GFP_KERNEL priority, this basically means that 367*4882a593Smuzhiyunthe allocation can wait and swap other process' memory in order to allocate 368*4882a593Smuzhiyunthe necessary memory, so normally limits can be reached. 369*4882a593Smuzhiyun 370*4882a593SmuzhiyunOther constraints 371*4882a593Smuzhiyun----------------- 372*4882a593Smuzhiyun 373*4882a593SmuzhiyunIf you check the source code you will see that what I draw here as a frame 374*4882a593Smuzhiyunis not only the link level frame. At the beginning of each frame there is a 375*4882a593Smuzhiyunheader called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame 376*4882a593Smuzhiyunmeta information like timestamp. So what we draw here a frame it's really 377*4882a593Smuzhiyunthe following (from include/linux/if_packet.h):: 378*4882a593Smuzhiyun 379*4882a593Smuzhiyun /* 380*4882a593Smuzhiyun Frame structure: 381*4882a593Smuzhiyun 382*4882a593Smuzhiyun - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 383*4882a593Smuzhiyun - struct tpacket_hdr 384*4882a593Smuzhiyun - pad to TPACKET_ALIGNMENT=16 385*4882a593Smuzhiyun - struct sockaddr_ll 386*4882a593Smuzhiyun - Gap, chosen so that packet data (Start+tp_net) aligns to 387*4882a593Smuzhiyun TPACKET_ALIGNMENT=16 388*4882a593Smuzhiyun - Start+tp_mac: [ Optional MAC header ] 389*4882a593Smuzhiyun - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. 390*4882a593Smuzhiyun - Pad to align to TPACKET_ALIGNMENT=16 391*4882a593Smuzhiyun */ 392*4882a593Smuzhiyun 393*4882a593SmuzhiyunThe following are conditions that are checked in packet_set_ring 394*4882a593Smuzhiyun 395*4882a593Smuzhiyun - tp_block_size must be a multiple of PAGE_SIZE (1) 396*4882a593Smuzhiyun - tp_frame_size must be greater than TPACKET_HDRLEN (obvious) 397*4882a593Smuzhiyun - tp_frame_size must be a multiple of TPACKET_ALIGNMENT 398*4882a593Smuzhiyun - tp_frame_nr must be exactly frames_per_block*tp_block_nr 399*4882a593Smuzhiyun 400*4882a593SmuzhiyunNote that tp_block_size should be chosen to be a power of two or there will 401*4882a593Smuzhiyunbe a waste of memory. 402*4882a593Smuzhiyun 403*4882a593SmuzhiyunMapping and use of the circular buffer (ring) 404*4882a593Smuzhiyun--------------------------------------------- 405*4882a593Smuzhiyun 406*4882a593SmuzhiyunThe mapping of the buffer in the user process is done with the conventional 407*4882a593Smuzhiyunmmap function. Even the circular buffer is compound of several physically 408*4882a593Smuzhiyundiscontiguous blocks of memory, they are contiguous to the user space, hence 409*4882a593Smuzhiyunjust one call to mmap is needed:: 410*4882a593Smuzhiyun 411*4882a593Smuzhiyun mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); 412*4882a593Smuzhiyun 413*4882a593SmuzhiyunIf tp_frame_size is a divisor of tp_block_size frames will be 414*4882a593Smuzhiyuncontiguously spaced by tp_frame_size bytes. If not, each 415*4882a593Smuzhiyuntp_block_size/tp_frame_size frames there will be a gap between 416*4882a593Smuzhiyunthe frames. This is because a frame cannot be spawn across two 417*4882a593Smuzhiyunblocks. 418*4882a593Smuzhiyun 419*4882a593SmuzhiyunTo use one socket for capture and transmission, the mapping of both the 420*4882a593SmuzhiyunRX and TX buffer ring has to be done with one call to mmap:: 421*4882a593Smuzhiyun 422*4882a593Smuzhiyun ... 423*4882a593Smuzhiyun setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo)); 424*4882a593Smuzhiyun setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar)); 425*4882a593Smuzhiyun ... 426*4882a593Smuzhiyun rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); 427*4882a593Smuzhiyun tx_ring = rx_ring + size; 428*4882a593Smuzhiyun 429*4882a593SmuzhiyunRX must be the first as the kernel maps the TX ring memory right 430*4882a593Smuzhiyunafter the RX one. 431*4882a593Smuzhiyun 432*4882a593SmuzhiyunAt the beginning of each frame there is an status field (see 433*4882a593Smuzhiyunstruct tpacket_hdr). If this field is 0 means that the frame is ready 434*4882a593Smuzhiyunto be used for the kernel, If not, there is a frame the user can read 435*4882a593Smuzhiyunand the following flags apply: 436*4882a593Smuzhiyun 437*4882a593SmuzhiyunCapture process 438*4882a593Smuzhiyun^^^^^^^^^^^^^^^ 439*4882a593Smuzhiyun 440*4882a593Smuzhiyun from include/linux/if_packet.h 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun #define TP_STATUS_COPY (1 << 1) 443*4882a593Smuzhiyun #define TP_STATUS_LOSING (1 << 2) 444*4882a593Smuzhiyun #define TP_STATUS_CSUMNOTREADY (1 << 3) 445*4882a593Smuzhiyun #define TP_STATUS_CSUM_VALID (1 << 7) 446*4882a593Smuzhiyun 447*4882a593Smuzhiyun====================== ======================================================= 448*4882a593SmuzhiyunTP_STATUS_COPY This flag indicates that the frame (and associated 449*4882a593Smuzhiyun meta information) has been truncated because it's 450*4882a593Smuzhiyun larger than tp_frame_size. This packet can be 451*4882a593Smuzhiyun read entirely with recvfrom(). 452*4882a593Smuzhiyun 453*4882a593Smuzhiyun In order to make this work it must to be 454*4882a593Smuzhiyun enabled previously with setsockopt() and 455*4882a593Smuzhiyun the PACKET_COPY_THRESH option. 456*4882a593Smuzhiyun 457*4882a593Smuzhiyun The number of frames that can be buffered to 458*4882a593Smuzhiyun be read with recvfrom is limited like a normal socket. 459*4882a593Smuzhiyun See the SO_RCVBUF option in the socket (7) man page. 460*4882a593Smuzhiyun 461*4882a593SmuzhiyunTP_STATUS_LOSING indicates there were packet drops from last time 462*4882a593Smuzhiyun statistics where checked with getsockopt() and 463*4882a593Smuzhiyun the PACKET_STATISTICS option. 464*4882a593Smuzhiyun 465*4882a593SmuzhiyunTP_STATUS_CSUMNOTREADY currently it's used for outgoing IP packets which 466*4882a593Smuzhiyun its checksum will be done in hardware. So while 467*4882a593Smuzhiyun reading the packet we should not try to check the 468*4882a593Smuzhiyun checksum. 469*4882a593Smuzhiyun 470*4882a593SmuzhiyunTP_STATUS_CSUM_VALID This flag indicates that at least the transport 471*4882a593Smuzhiyun header checksum of the packet has been already 472*4882a593Smuzhiyun validated on the kernel side. If the flag is not set 473*4882a593Smuzhiyun then we are free to check the checksum by ourselves 474*4882a593Smuzhiyun provided that TP_STATUS_CSUMNOTREADY is also not set. 475*4882a593Smuzhiyun====================== ======================================================= 476*4882a593Smuzhiyun 477*4882a593Smuzhiyunfor convenience there are also the following defines:: 478*4882a593Smuzhiyun 479*4882a593Smuzhiyun #define TP_STATUS_KERNEL 0 480*4882a593Smuzhiyun #define TP_STATUS_USER 1 481*4882a593Smuzhiyun 482*4882a593SmuzhiyunThe kernel initializes all frames to TP_STATUS_KERNEL, when the kernel 483*4882a593Smuzhiyunreceives a packet it puts in the buffer and updates the status with 484*4882a593Smuzhiyunat least the TP_STATUS_USER flag. Then the user can read the packet, 485*4882a593Smuzhiyunonce the packet is read the user must zero the status field, so the kernel 486*4882a593Smuzhiyuncan use again that frame buffer. 487*4882a593Smuzhiyun 488*4882a593SmuzhiyunThe user can use poll (any other variant should apply too) to check if new 489*4882a593Smuzhiyunpackets are in the ring:: 490*4882a593Smuzhiyun 491*4882a593Smuzhiyun struct pollfd pfd; 492*4882a593Smuzhiyun 493*4882a593Smuzhiyun pfd.fd = fd; 494*4882a593Smuzhiyun pfd.revents = 0; 495*4882a593Smuzhiyun pfd.events = POLLIN|POLLRDNORM|POLLERR; 496*4882a593Smuzhiyun 497*4882a593Smuzhiyun if (status == TP_STATUS_KERNEL) 498*4882a593Smuzhiyun retval = poll(&pfd, 1, timeout); 499*4882a593Smuzhiyun 500*4882a593SmuzhiyunIt doesn't incur in a race condition to first check the status value and 501*4882a593Smuzhiyunthen poll for frames. 502*4882a593Smuzhiyun 503*4882a593SmuzhiyunTransmission process 504*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^ 505*4882a593Smuzhiyun 506*4882a593SmuzhiyunThose defines are also used for transmission:: 507*4882a593Smuzhiyun 508*4882a593Smuzhiyun #define TP_STATUS_AVAILABLE 0 // Frame is available 509*4882a593Smuzhiyun #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() 510*4882a593Smuzhiyun #define TP_STATUS_SENDING 2 // Frame is currently in transmission 511*4882a593Smuzhiyun #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct 512*4882a593Smuzhiyun 513*4882a593SmuzhiyunFirst, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a 514*4882a593Smuzhiyunpacket, the user fills a data buffer of an available frame, sets tp_len to 515*4882a593Smuzhiyuncurrent data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. 516*4882a593SmuzhiyunThis can be done on multiple frames. Once the user is ready to transmit, it 517*4882a593Smuzhiyuncalls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are 518*4882a593Smuzhiyunforwarded to the network device. The kernel updates each status of sent 519*4882a593Smuzhiyunframes with TP_STATUS_SENDING until the end of transfer. 520*4882a593Smuzhiyun 521*4882a593SmuzhiyunAt the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. 522*4882a593Smuzhiyun 523*4882a593Smuzhiyun:: 524*4882a593Smuzhiyun 525*4882a593Smuzhiyun header->tp_len = in_i_size; 526*4882a593Smuzhiyun header->tp_status = TP_STATUS_SEND_REQUEST; 527*4882a593Smuzhiyun retval = send(this->socket, NULL, 0, 0); 528*4882a593Smuzhiyun 529*4882a593SmuzhiyunThe user can also use poll() to check if a buffer is available: 530*4882a593Smuzhiyun 531*4882a593Smuzhiyun(status == TP_STATUS_SENDING) 532*4882a593Smuzhiyun 533*4882a593Smuzhiyun:: 534*4882a593Smuzhiyun 535*4882a593Smuzhiyun struct pollfd pfd; 536*4882a593Smuzhiyun pfd.fd = fd; 537*4882a593Smuzhiyun pfd.revents = 0; 538*4882a593Smuzhiyun pfd.events = POLLOUT; 539*4882a593Smuzhiyun retval = poll(&pfd, 1, timeout); 540*4882a593Smuzhiyun 541*4882a593SmuzhiyunWhat TPACKET versions are available and when to use them? 542*4882a593Smuzhiyun========================================================= 543*4882a593Smuzhiyun 544*4882a593Smuzhiyun:: 545*4882a593Smuzhiyun 546*4882a593Smuzhiyun int val = tpacket_version; 547*4882a593Smuzhiyun setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); 548*4882a593Smuzhiyun getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); 549*4882a593Smuzhiyun 550*4882a593Smuzhiyunwhere 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. 551*4882a593Smuzhiyun 552*4882a593SmuzhiyunTPACKET_V1: 553*4882a593Smuzhiyun - Default if not otherwise specified by setsockopt(2) 554*4882a593Smuzhiyun - RX_RING, TX_RING available 555*4882a593Smuzhiyun 556*4882a593SmuzhiyunTPACKET_V1 --> TPACKET_V2: 557*4882a593Smuzhiyun - Made 64 bit clean due to unsigned long usage in TPACKET_V1 558*4882a593Smuzhiyun structures, thus this also works on 64 bit kernel with 32 bit 559*4882a593Smuzhiyun userspace and the like 560*4882a593Smuzhiyun - Timestamp resolution in nanoseconds instead of microseconds 561*4882a593Smuzhiyun - RX_RING, TX_RING available 562*4882a593Smuzhiyun - VLAN metadata information available for packets 563*4882a593Smuzhiyun (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID), 564*4882a593Smuzhiyun in the tpacket2_hdr structure: 565*4882a593Smuzhiyun 566*4882a593Smuzhiyun - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates 567*4882a593Smuzhiyun that the tp_vlan_tci field has valid VLAN TCI value 568*4882a593Smuzhiyun - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field 569*4882a593Smuzhiyun indicates that the tp_vlan_tpid field has valid VLAN TPID value 570*4882a593Smuzhiyun 571*4882a593Smuzhiyun - How to switch to TPACKET_V2: 572*4882a593Smuzhiyun 573*4882a593Smuzhiyun 1. Replace struct tpacket_hdr by struct tpacket2_hdr 574*4882a593Smuzhiyun 2. Query header len and save 575*4882a593Smuzhiyun 3. Set protocol version to 2, set up ring as usual 576*4882a593Smuzhiyun 4. For getting the sockaddr_ll, 577*4882a593Smuzhiyun use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of 578*4882a593Smuzhiyun ``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))`` 579*4882a593Smuzhiyun 580*4882a593SmuzhiyunTPACKET_V2 --> TPACKET_V3: 581*4882a593Smuzhiyun - Flexible buffer implementation for RX_RING: 582*4882a593Smuzhiyun 1. Blocks can be configured with non-static frame-size 583*4882a593Smuzhiyun 2. Read/poll is at a block-level (as opposed to packet-level) 584*4882a593Smuzhiyun 3. Added poll timeout to avoid indefinite user-space wait 585*4882a593Smuzhiyun on idle links 586*4882a593Smuzhiyun 4. Added user-configurable knobs: 587*4882a593Smuzhiyun 588*4882a593Smuzhiyun 4.1 block::timeout 589*4882a593Smuzhiyun 4.2 tpkt_hdr::sk_rxhash 590*4882a593Smuzhiyun 591*4882a593Smuzhiyun - RX Hash data available in user space 592*4882a593Smuzhiyun - TX_RING semantics are conceptually similar to TPACKET_V2; 593*4882a593Smuzhiyun use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN 594*4882a593Smuzhiyun instead of TPACKET2_HDRLEN. In the current implementation, 595*4882a593Smuzhiyun the tp_next_offset field in the tpacket3_hdr MUST be set to 596*4882a593Smuzhiyun zero, indicating that the ring does not hold variable sized frames. 597*4882a593Smuzhiyun Packets with non-zero values of tp_next_offset will be dropped. 598*4882a593Smuzhiyun 599*4882a593SmuzhiyunAF_PACKET fanout mode 600*4882a593Smuzhiyun===================== 601*4882a593Smuzhiyun 602*4882a593SmuzhiyunIn the AF_PACKET fanout mode, packet reception can be load balanced among 603*4882a593Smuzhiyunprocesses. This also works in combination with mmap(2) on packet sockets. 604*4882a593Smuzhiyun 605*4882a593SmuzhiyunCurrently implemented fanout policies are: 606*4882a593Smuzhiyun 607*4882a593Smuzhiyun - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash 608*4882a593Smuzhiyun - PACKET_FANOUT_LB: schedule to socket by round-robin 609*4882a593Smuzhiyun - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on 610*4882a593Smuzhiyun - PACKET_FANOUT_RND: schedule to socket by random selection 611*4882a593Smuzhiyun - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another 612*4882a593Smuzhiyun - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping 613*4882a593Smuzhiyun 614*4882a593SmuzhiyunMinimal example code by David S. Miller (try things like "./test eth0 hash", 615*4882a593Smuzhiyun"./test eth0 lb", etc.):: 616*4882a593Smuzhiyun 617*4882a593Smuzhiyun #include <stddef.h> 618*4882a593Smuzhiyun #include <stdlib.h> 619*4882a593Smuzhiyun #include <stdio.h> 620*4882a593Smuzhiyun #include <string.h> 621*4882a593Smuzhiyun 622*4882a593Smuzhiyun #include <sys/types.h> 623*4882a593Smuzhiyun #include <sys/wait.h> 624*4882a593Smuzhiyun #include <sys/socket.h> 625*4882a593Smuzhiyun #include <sys/ioctl.h> 626*4882a593Smuzhiyun 627*4882a593Smuzhiyun #include <unistd.h> 628*4882a593Smuzhiyun 629*4882a593Smuzhiyun #include <linux/if_ether.h> 630*4882a593Smuzhiyun #include <linux/if_packet.h> 631*4882a593Smuzhiyun 632*4882a593Smuzhiyun #include <net/if.h> 633*4882a593Smuzhiyun 634*4882a593Smuzhiyun static const char *device_name; 635*4882a593Smuzhiyun static int fanout_type; 636*4882a593Smuzhiyun static int fanout_id; 637*4882a593Smuzhiyun 638*4882a593Smuzhiyun #ifndef PACKET_FANOUT 639*4882a593Smuzhiyun # define PACKET_FANOUT 18 640*4882a593Smuzhiyun # define PACKET_FANOUT_HASH 0 641*4882a593Smuzhiyun # define PACKET_FANOUT_LB 1 642*4882a593Smuzhiyun #endif 643*4882a593Smuzhiyun 644*4882a593Smuzhiyun static int setup_socket(void) 645*4882a593Smuzhiyun { 646*4882a593Smuzhiyun int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); 647*4882a593Smuzhiyun struct sockaddr_ll ll; 648*4882a593Smuzhiyun struct ifreq ifr; 649*4882a593Smuzhiyun int fanout_arg; 650*4882a593Smuzhiyun 651*4882a593Smuzhiyun if (fd < 0) { 652*4882a593Smuzhiyun perror("socket"); 653*4882a593Smuzhiyun return EXIT_FAILURE; 654*4882a593Smuzhiyun } 655*4882a593Smuzhiyun 656*4882a593Smuzhiyun memset(&ifr, 0, sizeof(ifr)); 657*4882a593Smuzhiyun strcpy(ifr.ifr_name, device_name); 658*4882a593Smuzhiyun err = ioctl(fd, SIOCGIFINDEX, &ifr); 659*4882a593Smuzhiyun if (err < 0) { 660*4882a593Smuzhiyun perror("SIOCGIFINDEX"); 661*4882a593Smuzhiyun return EXIT_FAILURE; 662*4882a593Smuzhiyun } 663*4882a593Smuzhiyun 664*4882a593Smuzhiyun memset(&ll, 0, sizeof(ll)); 665*4882a593Smuzhiyun ll.sll_family = AF_PACKET; 666*4882a593Smuzhiyun ll.sll_ifindex = ifr.ifr_ifindex; 667*4882a593Smuzhiyun err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); 668*4882a593Smuzhiyun if (err < 0) { 669*4882a593Smuzhiyun perror("bind"); 670*4882a593Smuzhiyun return EXIT_FAILURE; 671*4882a593Smuzhiyun } 672*4882a593Smuzhiyun 673*4882a593Smuzhiyun fanout_arg = (fanout_id | (fanout_type << 16)); 674*4882a593Smuzhiyun err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, 675*4882a593Smuzhiyun &fanout_arg, sizeof(fanout_arg)); 676*4882a593Smuzhiyun if (err) { 677*4882a593Smuzhiyun perror("setsockopt"); 678*4882a593Smuzhiyun return EXIT_FAILURE; 679*4882a593Smuzhiyun } 680*4882a593Smuzhiyun 681*4882a593Smuzhiyun return fd; 682*4882a593Smuzhiyun } 683*4882a593Smuzhiyun 684*4882a593Smuzhiyun static void fanout_thread(void) 685*4882a593Smuzhiyun { 686*4882a593Smuzhiyun int fd = setup_socket(); 687*4882a593Smuzhiyun int limit = 10000; 688*4882a593Smuzhiyun 689*4882a593Smuzhiyun if (fd < 0) 690*4882a593Smuzhiyun exit(fd); 691*4882a593Smuzhiyun 692*4882a593Smuzhiyun while (limit-- > 0) { 693*4882a593Smuzhiyun char buf[1600]; 694*4882a593Smuzhiyun int err; 695*4882a593Smuzhiyun 696*4882a593Smuzhiyun err = read(fd, buf, sizeof(buf)); 697*4882a593Smuzhiyun if (err < 0) { 698*4882a593Smuzhiyun perror("read"); 699*4882a593Smuzhiyun exit(EXIT_FAILURE); 700*4882a593Smuzhiyun } 701*4882a593Smuzhiyun if ((limit % 10) == 0) 702*4882a593Smuzhiyun fprintf(stdout, "(%d) \n", getpid()); 703*4882a593Smuzhiyun } 704*4882a593Smuzhiyun 705*4882a593Smuzhiyun fprintf(stdout, "%d: Received 10000 packets\n", getpid()); 706*4882a593Smuzhiyun 707*4882a593Smuzhiyun close(fd); 708*4882a593Smuzhiyun exit(0); 709*4882a593Smuzhiyun } 710*4882a593Smuzhiyun 711*4882a593Smuzhiyun int main(int argc, char **argp) 712*4882a593Smuzhiyun { 713*4882a593Smuzhiyun int fd, err; 714*4882a593Smuzhiyun int i; 715*4882a593Smuzhiyun 716*4882a593Smuzhiyun if (argc != 3) { 717*4882a593Smuzhiyun fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); 718*4882a593Smuzhiyun return EXIT_FAILURE; 719*4882a593Smuzhiyun } 720*4882a593Smuzhiyun 721*4882a593Smuzhiyun if (!strcmp(argp[2], "hash")) 722*4882a593Smuzhiyun fanout_type = PACKET_FANOUT_HASH; 723*4882a593Smuzhiyun else if (!strcmp(argp[2], "lb")) 724*4882a593Smuzhiyun fanout_type = PACKET_FANOUT_LB; 725*4882a593Smuzhiyun else { 726*4882a593Smuzhiyun fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); 727*4882a593Smuzhiyun exit(EXIT_FAILURE); 728*4882a593Smuzhiyun } 729*4882a593Smuzhiyun 730*4882a593Smuzhiyun device_name = argp[1]; 731*4882a593Smuzhiyun fanout_id = getpid() & 0xffff; 732*4882a593Smuzhiyun 733*4882a593Smuzhiyun for (i = 0; i < 4; i++) { 734*4882a593Smuzhiyun pid_t pid = fork(); 735*4882a593Smuzhiyun 736*4882a593Smuzhiyun switch (pid) { 737*4882a593Smuzhiyun case 0: 738*4882a593Smuzhiyun fanout_thread(); 739*4882a593Smuzhiyun 740*4882a593Smuzhiyun case -1: 741*4882a593Smuzhiyun perror("fork"); 742*4882a593Smuzhiyun exit(EXIT_FAILURE); 743*4882a593Smuzhiyun } 744*4882a593Smuzhiyun } 745*4882a593Smuzhiyun 746*4882a593Smuzhiyun for (i = 0; i < 4; i++) { 747*4882a593Smuzhiyun int status; 748*4882a593Smuzhiyun 749*4882a593Smuzhiyun wait(&status); 750*4882a593Smuzhiyun } 751*4882a593Smuzhiyun 752*4882a593Smuzhiyun return 0; 753*4882a593Smuzhiyun } 754*4882a593Smuzhiyun 755*4882a593SmuzhiyunAF_PACKET TPACKET_V3 example 756*4882a593Smuzhiyun============================ 757*4882a593Smuzhiyun 758*4882a593SmuzhiyunAF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame 759*4882a593Smuzhiyunsizes by doing it's own memory management. It is based on blocks where polling 760*4882a593Smuzhiyunworks on a per block basis instead of per ring as in TPACKET_V2 and predecessor. 761*4882a593Smuzhiyun 762*4882a593SmuzhiyunIt is said that TPACKET_V3 brings the following benefits: 763*4882a593Smuzhiyun 764*4882a593Smuzhiyun * ~15% - 20% reduction in CPU-usage 765*4882a593Smuzhiyun * ~20% increase in packet capture rate 766*4882a593Smuzhiyun * ~2x increase in packet density 767*4882a593Smuzhiyun * Port aggregation analysis 768*4882a593Smuzhiyun * Non static frame size to capture entire packet payload 769*4882a593Smuzhiyun 770*4882a593SmuzhiyunSo it seems to be a good candidate to be used with packet fanout. 771*4882a593Smuzhiyun 772*4882a593SmuzhiyunMinimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile 773*4882a593Smuzhiyunit with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):: 774*4882a593Smuzhiyun 775*4882a593Smuzhiyun /* Written from scratch, but kernel-to-user space API usage 776*4882a593Smuzhiyun * dissected from lolpcap: 777*4882a593Smuzhiyun * Copyright 2011, Chetan Loke <loke.chetan@gmail.com> 778*4882a593Smuzhiyun * License: GPL, version 2.0 779*4882a593Smuzhiyun */ 780*4882a593Smuzhiyun 781*4882a593Smuzhiyun #include <stdio.h> 782*4882a593Smuzhiyun #include <stdlib.h> 783*4882a593Smuzhiyun #include <stdint.h> 784*4882a593Smuzhiyun #include <string.h> 785*4882a593Smuzhiyun #include <assert.h> 786*4882a593Smuzhiyun #include <net/if.h> 787*4882a593Smuzhiyun #include <arpa/inet.h> 788*4882a593Smuzhiyun #include <netdb.h> 789*4882a593Smuzhiyun #include <poll.h> 790*4882a593Smuzhiyun #include <unistd.h> 791*4882a593Smuzhiyun #include <signal.h> 792*4882a593Smuzhiyun #include <inttypes.h> 793*4882a593Smuzhiyun #include <sys/socket.h> 794*4882a593Smuzhiyun #include <sys/mman.h> 795*4882a593Smuzhiyun #include <linux/if_packet.h> 796*4882a593Smuzhiyun #include <linux/if_ether.h> 797*4882a593Smuzhiyun #include <linux/ip.h> 798*4882a593Smuzhiyun 799*4882a593Smuzhiyun #ifndef likely 800*4882a593Smuzhiyun # define likely(x) __builtin_expect(!!(x), 1) 801*4882a593Smuzhiyun #endif 802*4882a593Smuzhiyun #ifndef unlikely 803*4882a593Smuzhiyun # define unlikely(x) __builtin_expect(!!(x), 0) 804*4882a593Smuzhiyun #endif 805*4882a593Smuzhiyun 806*4882a593Smuzhiyun struct block_desc { 807*4882a593Smuzhiyun uint32_t version; 808*4882a593Smuzhiyun uint32_t offset_to_priv; 809*4882a593Smuzhiyun struct tpacket_hdr_v1 h1; 810*4882a593Smuzhiyun }; 811*4882a593Smuzhiyun 812*4882a593Smuzhiyun struct ring { 813*4882a593Smuzhiyun struct iovec *rd; 814*4882a593Smuzhiyun uint8_t *map; 815*4882a593Smuzhiyun struct tpacket_req3 req; 816*4882a593Smuzhiyun }; 817*4882a593Smuzhiyun 818*4882a593Smuzhiyun static unsigned long packets_total = 0, bytes_total = 0; 819*4882a593Smuzhiyun static sig_atomic_t sigint = 0; 820*4882a593Smuzhiyun 821*4882a593Smuzhiyun static void sighandler(int num) 822*4882a593Smuzhiyun { 823*4882a593Smuzhiyun sigint = 1; 824*4882a593Smuzhiyun } 825*4882a593Smuzhiyun 826*4882a593Smuzhiyun static int setup_socket(struct ring *ring, char *netdev) 827*4882a593Smuzhiyun { 828*4882a593Smuzhiyun int err, i, fd, v = TPACKET_V3; 829*4882a593Smuzhiyun struct sockaddr_ll ll; 830*4882a593Smuzhiyun unsigned int blocksiz = 1 << 22, framesiz = 1 << 11; 831*4882a593Smuzhiyun unsigned int blocknum = 64; 832*4882a593Smuzhiyun 833*4882a593Smuzhiyun fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); 834*4882a593Smuzhiyun if (fd < 0) { 835*4882a593Smuzhiyun perror("socket"); 836*4882a593Smuzhiyun exit(1); 837*4882a593Smuzhiyun } 838*4882a593Smuzhiyun 839*4882a593Smuzhiyun err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); 840*4882a593Smuzhiyun if (err < 0) { 841*4882a593Smuzhiyun perror("setsockopt"); 842*4882a593Smuzhiyun exit(1); 843*4882a593Smuzhiyun } 844*4882a593Smuzhiyun 845*4882a593Smuzhiyun memset(&ring->req, 0, sizeof(ring->req)); 846*4882a593Smuzhiyun ring->req.tp_block_size = blocksiz; 847*4882a593Smuzhiyun ring->req.tp_frame_size = framesiz; 848*4882a593Smuzhiyun ring->req.tp_block_nr = blocknum; 849*4882a593Smuzhiyun ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz; 850*4882a593Smuzhiyun ring->req.tp_retire_blk_tov = 60; 851*4882a593Smuzhiyun ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH; 852*4882a593Smuzhiyun 853*4882a593Smuzhiyun err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, 854*4882a593Smuzhiyun sizeof(ring->req)); 855*4882a593Smuzhiyun if (err < 0) { 856*4882a593Smuzhiyun perror("setsockopt"); 857*4882a593Smuzhiyun exit(1); 858*4882a593Smuzhiyun } 859*4882a593Smuzhiyun 860*4882a593Smuzhiyun ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, 861*4882a593Smuzhiyun PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0); 862*4882a593Smuzhiyun if (ring->map == MAP_FAILED) { 863*4882a593Smuzhiyun perror("mmap"); 864*4882a593Smuzhiyun exit(1); 865*4882a593Smuzhiyun } 866*4882a593Smuzhiyun 867*4882a593Smuzhiyun ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); 868*4882a593Smuzhiyun assert(ring->rd); 869*4882a593Smuzhiyun for (i = 0; i < ring->req.tp_block_nr; ++i) { 870*4882a593Smuzhiyun ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); 871*4882a593Smuzhiyun ring->rd[i].iov_len = ring->req.tp_block_size; 872*4882a593Smuzhiyun } 873*4882a593Smuzhiyun 874*4882a593Smuzhiyun memset(&ll, 0, sizeof(ll)); 875*4882a593Smuzhiyun ll.sll_family = PF_PACKET; 876*4882a593Smuzhiyun ll.sll_protocol = htons(ETH_P_ALL); 877*4882a593Smuzhiyun ll.sll_ifindex = if_nametoindex(netdev); 878*4882a593Smuzhiyun ll.sll_hatype = 0; 879*4882a593Smuzhiyun ll.sll_pkttype = 0; 880*4882a593Smuzhiyun ll.sll_halen = 0; 881*4882a593Smuzhiyun 882*4882a593Smuzhiyun err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); 883*4882a593Smuzhiyun if (err < 0) { 884*4882a593Smuzhiyun perror("bind"); 885*4882a593Smuzhiyun exit(1); 886*4882a593Smuzhiyun } 887*4882a593Smuzhiyun 888*4882a593Smuzhiyun return fd; 889*4882a593Smuzhiyun } 890*4882a593Smuzhiyun 891*4882a593Smuzhiyun static void display(struct tpacket3_hdr *ppd) 892*4882a593Smuzhiyun { 893*4882a593Smuzhiyun struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); 894*4882a593Smuzhiyun struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); 895*4882a593Smuzhiyun 896*4882a593Smuzhiyun if (eth->h_proto == htons(ETH_P_IP)) { 897*4882a593Smuzhiyun struct sockaddr_in ss, sd; 898*4882a593Smuzhiyun char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; 899*4882a593Smuzhiyun 900*4882a593Smuzhiyun memset(&ss, 0, sizeof(ss)); 901*4882a593Smuzhiyun ss.sin_family = PF_INET; 902*4882a593Smuzhiyun ss.sin_addr.s_addr = ip->saddr; 903*4882a593Smuzhiyun getnameinfo((struct sockaddr *) &ss, sizeof(ss), 904*4882a593Smuzhiyun sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); 905*4882a593Smuzhiyun 906*4882a593Smuzhiyun memset(&sd, 0, sizeof(sd)); 907*4882a593Smuzhiyun sd.sin_family = PF_INET; 908*4882a593Smuzhiyun sd.sin_addr.s_addr = ip->daddr; 909*4882a593Smuzhiyun getnameinfo((struct sockaddr *) &sd, sizeof(sd), 910*4882a593Smuzhiyun dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); 911*4882a593Smuzhiyun 912*4882a593Smuzhiyun printf("%s -> %s, ", sbuff, dbuff); 913*4882a593Smuzhiyun } 914*4882a593Smuzhiyun 915*4882a593Smuzhiyun printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); 916*4882a593Smuzhiyun } 917*4882a593Smuzhiyun 918*4882a593Smuzhiyun static void walk_block(struct block_desc *pbd, const int block_num) 919*4882a593Smuzhiyun { 920*4882a593Smuzhiyun int num_pkts = pbd->h1.num_pkts, i; 921*4882a593Smuzhiyun unsigned long bytes = 0; 922*4882a593Smuzhiyun struct tpacket3_hdr *ppd; 923*4882a593Smuzhiyun 924*4882a593Smuzhiyun ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + 925*4882a593Smuzhiyun pbd->h1.offset_to_first_pkt); 926*4882a593Smuzhiyun for (i = 0; i < num_pkts; ++i) { 927*4882a593Smuzhiyun bytes += ppd->tp_snaplen; 928*4882a593Smuzhiyun display(ppd); 929*4882a593Smuzhiyun 930*4882a593Smuzhiyun ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + 931*4882a593Smuzhiyun ppd->tp_next_offset); 932*4882a593Smuzhiyun } 933*4882a593Smuzhiyun 934*4882a593Smuzhiyun packets_total += num_pkts; 935*4882a593Smuzhiyun bytes_total += bytes; 936*4882a593Smuzhiyun } 937*4882a593Smuzhiyun 938*4882a593Smuzhiyun static void flush_block(struct block_desc *pbd) 939*4882a593Smuzhiyun { 940*4882a593Smuzhiyun pbd->h1.block_status = TP_STATUS_KERNEL; 941*4882a593Smuzhiyun } 942*4882a593Smuzhiyun 943*4882a593Smuzhiyun static void teardown_socket(struct ring *ring, int fd) 944*4882a593Smuzhiyun { 945*4882a593Smuzhiyun munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); 946*4882a593Smuzhiyun free(ring->rd); 947*4882a593Smuzhiyun close(fd); 948*4882a593Smuzhiyun } 949*4882a593Smuzhiyun 950*4882a593Smuzhiyun int main(int argc, char **argp) 951*4882a593Smuzhiyun { 952*4882a593Smuzhiyun int fd, err; 953*4882a593Smuzhiyun socklen_t len; 954*4882a593Smuzhiyun struct ring ring; 955*4882a593Smuzhiyun struct pollfd pfd; 956*4882a593Smuzhiyun unsigned int block_num = 0, blocks = 64; 957*4882a593Smuzhiyun struct block_desc *pbd; 958*4882a593Smuzhiyun struct tpacket_stats_v3 stats; 959*4882a593Smuzhiyun 960*4882a593Smuzhiyun if (argc != 2) { 961*4882a593Smuzhiyun fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); 962*4882a593Smuzhiyun return EXIT_FAILURE; 963*4882a593Smuzhiyun } 964*4882a593Smuzhiyun 965*4882a593Smuzhiyun signal(SIGINT, sighandler); 966*4882a593Smuzhiyun 967*4882a593Smuzhiyun memset(&ring, 0, sizeof(ring)); 968*4882a593Smuzhiyun fd = setup_socket(&ring, argp[argc - 1]); 969*4882a593Smuzhiyun assert(fd > 0); 970*4882a593Smuzhiyun 971*4882a593Smuzhiyun memset(&pfd, 0, sizeof(pfd)); 972*4882a593Smuzhiyun pfd.fd = fd; 973*4882a593Smuzhiyun pfd.events = POLLIN | POLLERR; 974*4882a593Smuzhiyun pfd.revents = 0; 975*4882a593Smuzhiyun 976*4882a593Smuzhiyun while (likely(!sigint)) { 977*4882a593Smuzhiyun pbd = (struct block_desc *) ring.rd[block_num].iov_base; 978*4882a593Smuzhiyun 979*4882a593Smuzhiyun if ((pbd->h1.block_status & TP_STATUS_USER) == 0) { 980*4882a593Smuzhiyun poll(&pfd, 1, -1); 981*4882a593Smuzhiyun continue; 982*4882a593Smuzhiyun } 983*4882a593Smuzhiyun 984*4882a593Smuzhiyun walk_block(pbd, block_num); 985*4882a593Smuzhiyun flush_block(pbd); 986*4882a593Smuzhiyun block_num = (block_num + 1) % blocks; 987*4882a593Smuzhiyun } 988*4882a593Smuzhiyun 989*4882a593Smuzhiyun len = sizeof(stats); 990*4882a593Smuzhiyun err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); 991*4882a593Smuzhiyun if (err < 0) { 992*4882a593Smuzhiyun perror("getsockopt"); 993*4882a593Smuzhiyun exit(1); 994*4882a593Smuzhiyun } 995*4882a593Smuzhiyun 996*4882a593Smuzhiyun fflush(stdout); 997*4882a593Smuzhiyun printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", 998*4882a593Smuzhiyun stats.tp_packets, bytes_total, stats.tp_drops, 999*4882a593Smuzhiyun stats.tp_freeze_q_cnt); 1000*4882a593Smuzhiyun 1001*4882a593Smuzhiyun teardown_socket(&ring, fd); 1002*4882a593Smuzhiyun return 0; 1003*4882a593Smuzhiyun } 1004*4882a593Smuzhiyun 1005*4882a593SmuzhiyunPACKET_QDISC_BYPASS 1006*4882a593Smuzhiyun=================== 1007*4882a593Smuzhiyun 1008*4882a593SmuzhiyunIf there is a requirement to load the network with many packets in a similar 1009*4882a593Smuzhiyunfashion as pktgen does, you might set the following option after socket 1010*4882a593Smuzhiyuncreation:: 1011*4882a593Smuzhiyun 1012*4882a593Smuzhiyun int one = 1; 1013*4882a593Smuzhiyun setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one)); 1014*4882a593Smuzhiyun 1015*4882a593SmuzhiyunThis has the side-effect, that packets sent through PF_PACKET will bypass the 1016*4882a593Smuzhiyunkernel's qdisc layer and are forcedly pushed to the driver directly. Meaning, 1017*4882a593Smuzhiyunpacket are not buffered, tc disciplines are ignored, increased loss can occur 1018*4882a593Smuzhiyunand such packets are also not visible to other PF_PACKET sockets anymore. So, 1019*4882a593Smuzhiyunyou have been warned; generally, this can be useful for stress testing various 1020*4882a593Smuzhiyuncomponents of a system. 1021*4882a593Smuzhiyun 1022*4882a593SmuzhiyunOn default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled 1023*4882a593Smuzhiyunon PF_PACKET sockets. 1024*4882a593Smuzhiyun 1025*4882a593SmuzhiyunPACKET_TIMESTAMP 1026*4882a593Smuzhiyun================ 1027*4882a593Smuzhiyun 1028*4882a593SmuzhiyunThe PACKET_TIMESTAMP setting determines the source of the timestamp in 1029*4882a593Smuzhiyunthe packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your 1030*4882a593SmuzhiyunNIC is capable of timestamping packets in hardware, you can request those 1031*4882a593Smuzhiyunhardware timestamps to be used. Note: you may need to enable the generation 1032*4882a593Smuzhiyunof hardware timestamps with SIOCSHWTSTAMP (see related information from 1033*4882a593SmuzhiyunDocumentation/networking/timestamping.rst). 1034*4882a593Smuzhiyun 1035*4882a593SmuzhiyunPACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:: 1036*4882a593Smuzhiyun 1037*4882a593Smuzhiyun int req = SOF_TIMESTAMPING_RAW_HARDWARE; 1038*4882a593Smuzhiyun setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) 1039*4882a593Smuzhiyun 1040*4882a593SmuzhiyunFor the mmap(2)ed ring buffers, such timestamps are stored in the 1041*4882a593Smuzhiyun``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members. 1042*4882a593SmuzhiyunTo determine what kind of timestamp has been reported, the tp_status field 1043*4882a593Smuzhiyunis binary or'ed with the following possible bits ... 1044*4882a593Smuzhiyun 1045*4882a593Smuzhiyun:: 1046*4882a593Smuzhiyun 1047*4882a593Smuzhiyun TP_STATUS_TS_RAW_HARDWARE 1048*4882a593Smuzhiyun TP_STATUS_TS_SOFTWARE 1049*4882a593Smuzhiyun 1050*4882a593Smuzhiyun... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the 1051*4882a593SmuzhiyunRX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a 1052*4882a593Smuzhiyunsoftware fallback was invoked *within* PF_PACKET's processing code (less 1053*4882a593Smuzhiyunprecise). 1054*4882a593Smuzhiyun 1055*4882a593SmuzhiyunGetting timestamps for the TX_RING works as follows: i) fill the ring frames, 1056*4882a593Smuzhiyunii) call sendto() e.g. in blocking mode, iii) wait for status of relevant 1057*4882a593Smuzhiyunframes to be updated resp. the frame handed over to the application, iv) walk 1058*4882a593Smuzhiyunthrough the frames to pick up the individual hw/sw timestamps. 1059*4882a593Smuzhiyun 1060*4882a593SmuzhiyunOnly (!) if transmit timestamping is enabled, then these bits are combined 1061*4882a593Smuzhiyunwith binary | with TP_STATUS_AVAILABLE, so you must check for that in your 1062*4882a593Smuzhiyunapplication (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) 1063*4882a593Smuzhiyunin a first step to see if the frame belongs to the application, and then 1064*4882a593Smuzhiyunone can extract the type of timestamp in a second step from tp_status)! 1065*4882a593Smuzhiyun 1066*4882a593SmuzhiyunIf you don't care about them, thus having it disabled, checking for 1067*4882a593SmuzhiyunTP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the 1068*4882a593SmuzhiyunTX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec 1069*4882a593Smuzhiyunmembers do not contain a valid value. For TX_RINGs, by default no timestamp 1070*4882a593Smuzhiyunis generated! 1071*4882a593Smuzhiyun 1072*4882a593SmuzhiyunSee include/linux/net_tstamp.h and Documentation/networking/timestamping.rst 1073*4882a593Smuzhiyunfor more information on hardware timestamps. 1074*4882a593Smuzhiyun 1075*4882a593SmuzhiyunMiscellaneous bits 1076*4882a593Smuzhiyun================== 1077*4882a593Smuzhiyun 1078*4882a593Smuzhiyun- Packet sockets work well together with Linux socket filters, thus you also 1079*4882a593Smuzhiyun might want to have a look at Documentation/networking/filter.rst 1080*4882a593Smuzhiyun 1081*4882a593SmuzhiyunTHANKS 1082*4882a593Smuzhiyun====== 1083*4882a593Smuzhiyun 1084*4882a593Smuzhiyun Jesse Brandeburg, for fixing my grammathical/spelling errors 1085