1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun==================================== 4*4882a593SmuzhiyunNetfilter's flowtable infrastructure 5*4882a593Smuzhiyun==================================== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunThis documentation describes the software flowtable infrastructure available in 8*4882a593SmuzhiyunNetfilter since Linux kernel 4.16. 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunOverview 11*4882a593Smuzhiyun-------- 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunInitial packets follow the classic forwarding path, once the flow enters the 14*4882a593Smuzhiyunestablished state according to the conntrack semantics (ie. we have seen traffic 15*4882a593Smuzhiyunin both directions), then you can decide to offload the flow to the flowtable 16*4882a593Smuzhiyunfrom the forward chain via the 'flow offload' action available in nftables. 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunPackets that find an entry in the flowtable (ie. flowtable hit) are sent to the 19*4882a593Smuzhiyunoutput netdevice via neigh_xmit(), hence, they bypass the classic forwarding 20*4882a593Smuzhiyunpath (the visible effect is that you do not see these packets from any of the 21*4882a593Smuzhiyunnetfilter hooks coming after the ingress). In case of flowtable miss, the packet 22*4882a593Smuzhiyunfollows the classic forward path. 23*4882a593Smuzhiyun 24*4882a593SmuzhiyunThe flowtable uses a resizable hashtable, lookups are based on the following 25*4882a593Smuzhiyun7-tuple selectors: source, destination, layer 3 and layer 4 protocols, source 26*4882a593Smuzhiyunand destination ports and the input interface (useful in case there are several 27*4882a593Smuzhiyunconntrack zones in place). 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunFlowtables are populated via the 'flow offload' nftables action, so the user can 30*4882a593Smuzhiyunselectively specify what flows are placed into the flow table. Hence, packets 31*4882a593Smuzhiyunfollow the classic forwarding path unless the user explicitly instruct packets 32*4882a593Smuzhiyunto use this new alternative forwarding path via nftables policy. 33*4882a593Smuzhiyun 34*4882a593SmuzhiyunThis is represented in Fig.1, which describes the classic forwarding path 35*4882a593Smuzhiyunincluding the Netfilter hooks and the flowtable fastpath bypass. 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun:: 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun userspace process 40*4882a593Smuzhiyun ^ | 41*4882a593Smuzhiyun | | 42*4882a593Smuzhiyun _____|____ ____\/___ 43*4882a593Smuzhiyun / \ / \ 44*4882a593Smuzhiyun | input | | output | 45*4882a593Smuzhiyun \__________/ \_________/ 46*4882a593Smuzhiyun ^ | 47*4882a593Smuzhiyun | | 48*4882a593Smuzhiyun _________ __________ --------- _____\/_____ 49*4882a593Smuzhiyun / \ / \ |Routing | / \ 50*4882a593Smuzhiyun --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit 51*4882a593Smuzhiyun \_________/ \__________/ ---------- \____________/ ^ 52*4882a593Smuzhiyun | ^ | ^ | 53*4882a593Smuzhiyun flowtable | ____\/___ | | 54*4882a593Smuzhiyun | | / \ | | 55*4882a593Smuzhiyun __\/___ | | forward |------------ | 56*4882a593Smuzhiyun |-----| | \_________/ | 57*4882a593Smuzhiyun |-----| | 'flow offload' rule | 58*4882a593Smuzhiyun |-----| | adds entry to | 59*4882a593Smuzhiyun |_____| | flowtable | 60*4882a593Smuzhiyun | | | 61*4882a593Smuzhiyun / \ | | 62*4882a593Smuzhiyun /hit\_no_| | 63*4882a593Smuzhiyun \ ? / | 64*4882a593Smuzhiyun \ / | 65*4882a593Smuzhiyun |__yes_________________fastpath bypass ____________________________| 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun Fig.1 Netfilter hooks and flowtable interactions 68*4882a593Smuzhiyun 69*4882a593SmuzhiyunThe flowtable entry also stores the NAT configuration, so all packets are 70*4882a593Smuzhiyunmangled according to the NAT policy that matches the initial packets that went 71*4882a593Smuzhiyunthrough the classic forwarding path. The TTL is decremented before calling 72*4882a593Smuzhiyunneigh_xmit(). Fragmented traffic is passed up to follow the classic forwarding 73*4882a593Smuzhiyunpath given that the transport selectors are missing, therefore flowtable lookup 74*4882a593Smuzhiyunis not possible. 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunExample configuration 77*4882a593Smuzhiyun--------------------- 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunEnabling the flowtable bypass is relatively easy, you only need to create a 80*4882a593Smuzhiyunflowtable and add one rule to your forward chain:: 81*4882a593Smuzhiyun 82*4882a593Smuzhiyun table inet x { 83*4882a593Smuzhiyun flowtable f { 84*4882a593Smuzhiyun hook ingress priority 0; devices = { eth0, eth1 }; 85*4882a593Smuzhiyun } 86*4882a593Smuzhiyun chain y { 87*4882a593Smuzhiyun type filter hook forward priority 0; policy accept; 88*4882a593Smuzhiyun ip protocol tcp flow offload @f 89*4882a593Smuzhiyun counter packets 0 bytes 0 90*4882a593Smuzhiyun } 91*4882a593Smuzhiyun } 92*4882a593Smuzhiyun 93*4882a593SmuzhiyunThis example adds the flowtable 'f' to the ingress hook of the eth0 and eth1 94*4882a593Smuzhiyunnetdevices. You can create as many flowtables as you want in case you need to 95*4882a593Smuzhiyunperform resource partitioning. The flowtable priority defines the order in which 96*4882a593Smuzhiyunhooks are run in the pipeline, this is convenient in case you already have a 97*4882a593Smuzhiyunnftables ingress chain (make sure the flowtable priority is smaller than the 98*4882a593Smuzhiyunnftables ingress chain hence the flowtable runs before in the pipeline). 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunThe 'flow offload' action from the forward chain 'y' adds an entry to the 101*4882a593Smuzhiyunflowtable for the TCP syn-ack packet coming in the reply direction. Once the 102*4882a593Smuzhiyunflow is offloaded, you will observe that the counter rule in the example above 103*4882a593Smuzhiyundoes not get updated for the packets that are being forwarded through the 104*4882a593Smuzhiyunforwarding bypass. 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunMore reading 107*4882a593Smuzhiyun------------ 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunThis documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki 110*4882a593Smuzhiyunalso made a very complete and comprehensive summary called "A state of network 111*4882a593Smuzhiyunacceleration" that describes how things were before this infrastructure was 112*4882a593Smuzhiyunmainlined [3]_ and it also makes a rough summary of this work [4]_. 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun.. [1] https://lwn.net/Articles/738214/ 115*4882a593Smuzhiyun.. [2] https://lwn.net/Articles/742164/ 116*4882a593Smuzhiyun.. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html 117*4882a593Smuzhiyun.. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html 118